[PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
@ 2023-01-20 13:47 Daniil Tatianin
  2023-01-20 13:47 ` [PATCH 1/4] oslib: introduce new qemu_prealloc_mem_with_timeout() api Daniil Tatianin
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Daniil Tatianin @ 2023-01-20 13:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Daniil Tatianin, qemu-devel, Stefan Weil, David Hildenbrand,
	Igor Mammedov, yc-core

This series introduces new qemu_prealloc_mem_with_timeout() api,
which allows limiting the maximum amount of time to be spent on memory
preallocation. It also adds prealloc statistics collection that is
exposed via an optional timeout handler.

This new api is then utilized by hostmem for guest RAM preallocation
controlled via new object properties called 'prealloc-timeout' and
'prealloc-timeout-fatal'.

This is useful for limiting VM startup time on systems with
unpredictable page allocation delays due to memory fragmentation or the
backing storage. The timeout can be configured to either simply emit a
warning and continue VM startup without having preallocated the entire
guest RAM or just abort startup entirely if that is not acceptable for
a specific use case.

Daniil Tatianin (4):
  oslib: introduce new qemu_prealloc_mem_with_timeout() api
  backends/hostmem: move memory region preallocation logic into a helper
  backends/hostmem: add an ability to specify prealloc timeout
  backends/hostmem: add an ability to make prealloc timeout fatal

 backends/hostmem.c       | 112 +++++++++++++++++++++++++++++++-------
 include/qemu/osdep.h     |  19 +++++++
 include/sysemu/hostmem.h |   3 ++
 qapi/qom.json            |   8 +++
 util/oslib-posix.c       | 114 +++++++++++++++++++++++++++++++++++----
 util/oslib-win32.c       |   9 ++++
 6 files changed, 238 insertions(+), 27 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] oslib: introduce new qemu_prealloc_mem_with_timeout() api
  2023-01-20 13:47 [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout Daniil Tatianin
@ 2023-01-20 13:47 ` Daniil Tatianin
  2023-01-20 13:47 ` [PATCH 2/4] backends/hostmem: move memory region preallocation logic into a helper Daniil Tatianin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Daniil Tatianin @ 2023-01-20 13:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Daniil Tatianin, qemu-devel, Stefan Weil, David Hildenbrand,
	Igor Mammedov, yc-core

This helper allows limiting the maximum amount time to be spent
preallocating a block of memory, which is important on systems that
might have unpredictable page allocation delays because of possible
fragmentation or other reasons specific to the backend.

It also exposes a way to register a callback that is invoked in case the
specified timeout is exceeded. The callback is provided with a
PreallocStats structure that includes a bunch of statistics about the
progress including total & allocated number of pages, as well as page
size and number of allocation threads.

The win32 implementation is currently a stub that just calls into the
old qemu_prealloc_mem api.

Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
 include/qemu/osdep.h |  19 ++++++++
 util/oslib-posix.c   | 114 +++++++++++++++++++++++++++++++++++++++----
 util/oslib-win32.c   |   9 ++++
 3 files changed, 133 insertions(+), 9 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index bd23a08595..21757e5144 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -595,6 +595,25 @@ typedef struct ThreadContext ThreadContext;
 void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
                        ThreadContext *tc, Error **errp);
 
+typedef struct PreallocStats {
+    size_t page_size;
+    size_t total_pages;
+    size_t allocated_pages;
+    int threads;
+    time_t seconds_elapsed;
+} PreallocStats;
+
+typedef struct PreallocTimeout {
+    time_t seconds;
+    void *user;
+    void (*on_timeout)(void *user, const PreallocStats *stats);
+} PreallocTimeout;
+
+void qemu_prealloc_mem_with_timeout(int fd, char *area, size_t sz,
+                                    int max_threads, ThreadContext *tc,
+                                    const PreallocTimeout *timeout,
+                                    Error **errp);
+
 /**
  * qemu_get_pid_name:
  * @pid: pid of a process
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 59a891b6a8..570fca601f 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -74,6 +74,7 @@ typedef struct MemsetContext {
     bool any_thread_failed;
     struct MemsetThread *threads;
     int num_threads;
+    PreallocStats stats;
 } MemsetContext;
 
 struct MemsetThread {
@@ -83,6 +84,7 @@ struct MemsetThread {
     QemuThread pgthread;
     sigjmp_buf env;
     MemsetContext *context;
+    size_t touched_pages;
 };
 typedef struct MemsetThread MemsetThread;
 
@@ -373,6 +375,7 @@ static void *do_touch_pages(void *arg)
              */
             *(volatile char *)addr = *addr;
             addr += hpagesize;
+            qatomic_inc(&memset_args->touched_pages);
         }
     }
     pthread_sigmask(SIG_SETMASK, &oldset, NULL);
@@ -396,6 +399,11 @@ static void *do_madv_populate_write_pages(void *arg)
     if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) {
         ret = -errno;
     }
+
+    if (!ret) {
+        qatomic_set(&memset_args->touched_pages, memset_args->numpages);
+    }
+
     return (void *)(uintptr_t)ret;
 }
 
@@ -418,8 +426,68 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
     return ret;
 }
 
+static int do_join_memset_threads_with_timeout(MemsetContext *context,
+                                               time_t seconds)
+{
+    struct timespec ts;
+    int i = 0;
+
+    if (clock_gettime(CLOCK_REALTIME, &ts) < 0) {
+        return i;
+    }
+    ts.tv_sec += seconds;
+
+    for (; i < context->num_threads; ++i) {
+        if (pthread_timedjoin_np(context->threads[i].pgthread.thread,
+                                 NULL, &ts)) {
+            break;
+        }
+    }
+
+    return i;
+}
+
+static void memset_stats_count_pages(MemsetContext *context)
+{
+    int i;
+
+    for (i = 0; i < context->num_threads; ++i) {
+        size_t pages = qatomic_load_acquire(
+                            &context->threads[i].touched_pages);
+        context->stats.allocated_pages += pages;
+    }
+}
+
+static int timed_join_memset_threads(MemsetContext *context,
+                                     const PreallocTimeout *timeout)
+{
+    int i, off;
+    PreallocStats *stats = &context->stats;
+    off = do_join_memset_threads_with_timeout(context, timeout->seconds);
+
+    if (off != context->num_threads && timeout->on_timeout) {
+        memset_stats_count_pages(context);
+
+        /*
+         * Guard against possible races if preallocation finishes right
+         * after the timeout is exceeded.
+         */
+        if (stats->allocated_pages < stats->total_pages) {
+            stats->seconds_elapsed = timeout->seconds;
+            timeout->on_timeout(timeout->user, stats);
+        }
+    }
+
+    for (i = off; i < context->num_threads; ++i) {
+        pthread_cancel(context->threads[i].pgthread.thread);
+    }
+
+    return off;
+}
+
 static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
                            int max_threads, ThreadContext *tc,
+                           const PreallocTimeout *timeout,
                            bool use_madv_populate_write)
 {
     static gsize initialized = 0;
@@ -452,6 +520,9 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
     }
 
     context.threads = g_new0(MemsetThread, context.num_threads);
+    context.stats.page_size = hpagesize;
+    context.stats.total_pages = numpages;
+    context.stats.threads = context.num_threads;
     numpages_per_thread = numpages / context.num_threads;
     leftover = numpages % context.num_threads;
     for (i = 0; i < context.num_threads; i++) {
@@ -481,11 +552,20 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
     qemu_cond_broadcast(&page_cond);
     qemu_mutex_unlock(&page_mutex);
 
-    for (i = 0; i < context.num_threads; i++) {
-        int tmp = (uintptr_t)qemu_thread_join(&context.threads[i].pgthread);
+    if (timeout) {
+        i = timed_join_memset_threads(&context, timeout);
+
+        if (i != context.num_threads &&
+            context.stats.allocated_pages != context.stats.total_pages) {
+            ret = -ETIMEDOUT;
+        }
+    }
+
+    for (; i < context.num_threads; i++) {
+        void *thread_ret = qemu_thread_join(&context.threads[i].pgthread);
 
-        if (tmp) {
-            ret = tmp;
+        if (thread_ret && thread_ret != PTHREAD_CANCELED) {
+            ret = (uintptr_t)thread_ret;
         }
     }
 
@@ -503,8 +583,10 @@ static bool madv_populate_write_possible(char *area, size_t pagesize)
            errno != EINVAL;
 }
 
-void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
-                       ThreadContext *tc, Error **errp)
+void qemu_prealloc_mem_with_timeout(int fd, char *area, size_t sz,
+                                    int max_threads, ThreadContext *tc,
+                                    const PreallocTimeout *timeout,
+                                    Error **errp)
 {
     static gsize initialized;
     int ret;
@@ -546,10 +628,18 @@ void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
 
     /* touch pages simultaneously */
     ret = touch_all_pages(area, hpagesize, numpages, max_threads, tc,
-                          use_madv_populate_write);
+                          timeout, use_madv_populate_write);
+
     if (ret) {
-        error_setg_errno(errp, -ret,
-                         "qemu_prealloc_mem: preallocating memory failed");
+        const char *msg;
+
+        if (timeout && ret == -ETIMEDOUT) {
+            msg = "preallocation timed out";
+        } else {
+            msg = "preallocating memory failed";
+        }
+
+        error_setg_errno(errp, -ret, "qemu_prealloc_mem: %s", msg);
     }
 
     if (!use_madv_populate_write) {
@@ -563,6 +653,12 @@ void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
     }
 }
 
+void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
+                       ThreadContext *tc, Error **errp)
+{
+    qemu_prealloc_mem_with_timeout(fd, area, sz, max_threads, tc, NULL, errp);
+}
+
 char *qemu_get_pid_name(pid_t pid)
 {
     char *name = NULL;
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 07ade41800..27f39ef66a 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -276,6 +276,15 @@ void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
     }
 }
 
+void qemu_prealloc_mem_with_timeout(int fd, char *area, size_t sz,
+                                    int max_threads, ThreadContext *tc,
+                                    const PreallocTimeout *timeout,
+                                    Error **errp)
+{
+    /* FIXME: actually implement timing out here */
+    qemu_prealloc_mem(fd, area, sz, max_threads, tc, errp);
+}
+
 char *qemu_get_pid_name(pid_t pid)
 {
     /* XXX Implement me */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] backends/hostmem: move memory region preallocation logic into a helper
  2023-01-20 13:47 [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout Daniil Tatianin
  2023-01-20 13:47 ` [PATCH 1/4] oslib: introduce new qemu_prealloc_mem_with_timeout() api Daniil Tatianin
@ 2023-01-20 13:47 ` Daniil Tatianin
  2023-01-20 13:47 ` [PATCH 3/4] backends/hostmem: add an ability to specify prealloc timeout Daniil Tatianin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Daniil Tatianin @ 2023-01-20 13:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Daniil Tatianin, qemu-devel, Stefan Weil, David Hildenbrand,
	Igor Mammedov, yc-core

...so that we don't have to duplicate it in multiple places throughout
the file.

Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
 backends/hostmem.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 747e7838c0..842bfa9eb7 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -216,10 +216,26 @@ static bool host_memory_backend_get_prealloc(Object *obj, Error **errp)
     return backend->prealloc;
 }
 
+static bool do_prealloc_mr(HostMemoryBackend *backend, Error **errp)
+{
+    Error *local_err = NULL;
+    int fd = memory_region_get_fd(&backend->mr);
+    void *ptr = memory_region_get_ram_ptr(&backend->mr);
+    uint64_t sz = memory_region_size(&backend->mr);
+
+    qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads,
+                      backend->prealloc_context, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return false;
+    }
+
+    return true;
+}
+
 static void host_memory_backend_set_prealloc(Object *obj, bool value,
                                              Error **errp)
 {
-    Error *local_err = NULL;
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
 
     if (!backend->reserve && value) {
@@ -233,17 +249,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
     }
 
     if (value && !backend->prealloc) {
-        int fd = memory_region_get_fd(&backend->mr);
-        void *ptr = memory_region_get_ram_ptr(&backend->mr);
-        uint64_t sz = memory_region_size(&backend->mr);
-
-        qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads,
-                          backend->prealloc_context, &local_err);
-        if (local_err) {
-            error_propagate(errp, local_err);
-            return;
-        }
-        backend->prealloc = true;
+        backend->prealloc = do_prealloc_mr(backend, errp);
     }
 }
 
@@ -399,12 +405,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
          * specified NUMA policy in place.
          */
         if (backend->prealloc) {
-            qemu_prealloc_mem(memory_region_get_fd(&backend->mr), ptr, sz,
-                              backend->prealloc_threads,
-                              backend->prealloc_context, &local_err);
-            if (local_err) {
-                goto out;
-            }
+            do_prealloc_mr(backend, errp);
+            return;
         }
     }
 out:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-20 13:47 [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout Daniil Tatianin
  2023-01-20 13:47 ` [PATCH 1/4] oslib: introduce new qemu_prealloc_mem_with_timeout() api Daniil Tatianin
  2023-01-20 13:47 ` [PATCH 2/4] backends/hostmem: move memory region preallocation logic into a helper Daniil Tatianin
@ 2023-01-20 13:47 ` Daniil Tatianin
  2023-01-20 13:47 ` [PATCH 4/4] backends/hostmem: add an ability to make prealloc timeout fatal Daniil Tatianin
  2023-01-23  8:57 ` [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout David Hildenbrand
  4 siblings, 0 replies; 14+ messages in thread
From: Daniil Tatianin @ 2023-01-20 13:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Daniil Tatianin, qemu-devel, Stefan Weil, David Hildenbrand,
	Igor Mammedov, yc-core

Use the new qemu_prealloc_mem_with_timeout api so that we can limit the
maximum amount of time to be spent preallocating guest RAM. We also emit
a warning from the timeout handler detailing the current prealloc
progress and letting the user know that it was exceeded.

The timeout is set to zero (no timeout) by default, and can be
configured via the new 'prealloc-timeout' property.

Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
 backends/hostmem.c       | 48 ++++++++++++++++++++++++++++++++++++++--
 include/sysemu/hostmem.h |  2 ++
 qapi/qom.json            |  4 ++++
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 842bfa9eb7..be9af7515e 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -34,6 +34,19 @@ QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
 #endif
 
+static void
+host_memory_on_prealloc_timeout(void *opaque,
+                                const PreallocStats *stats)
+{
+    HostMemoryBackend *backend = opaque;
+
+    backend->prealloc_did_timeout = true;
+    warn_report("HostMemory preallocation timeout %"PRIu64"s exceeded, "
+                "allocated %zu/%zu (%zu byte) pages (%d threads)",
+                (uint64_t)stats->seconds_elapsed, stats->allocated_pages,
+                stats->total_pages, stats->page_size, stats->threads);
+}
+
 char *
 host_memory_backend_get_name(HostMemoryBackend *backend)
 {
@@ -223,8 +236,26 @@ static bool do_prealloc_mr(HostMemoryBackend *backend, Error **errp)
     void *ptr = memory_region_get_ram_ptr(&backend->mr);
     uint64_t sz = memory_region_size(&backend->mr);
 
-    qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads,
-                      backend->prealloc_context, &local_err);
+    if (backend->prealloc_timeout) {
+        PreallocTimeout timeout = {
+            .seconds = (time_t)backend->prealloc_timeout,
+            .user = backend,
+            .on_timeout = host_memory_on_prealloc_timeout,
+        };
+
+        qemu_prealloc_mem_with_timeout(fd, ptr, sz, backend->prealloc_threads,
+                                       backend->prealloc_context, &timeout,
+                                       &local_err);
+        if (local_err && backend->prealloc_did_timeout) {
+            error_free(local_err);
+            local_err = NULL;
+        }
+    } else {
+        qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads,
+                          backend->prealloc_context, &local_err);
+    }
+
+
     if (local_err) {
         error_propagate(errp, local_err);
         return false;
@@ -277,6 +308,13 @@ static void host_memory_backend_set_prealloc_threads(Object *obj, Visitor *v,
     backend->prealloc_threads = value;
 }
 
+static void host_memory_backend_get_set_prealloc_timeout(Object *obj,
+    Visitor *v, const char *name, void *opaque, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+    visit_type_uint32(v, name, &backend->prealloc_timeout, errp);
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -516,6 +554,12 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
         object_property_allow_set_link, OBJ_PROP_LINK_STRONG);
     object_class_property_set_description(oc, "prealloc-context",
         "Context to use for creating CPU threads for preallocation");
+    object_class_property_add(oc, "prealloc-timeout", "int",
+        host_memory_backend_get_set_prealloc_timeout,
+        host_memory_backend_get_set_prealloc_timeout,
+        NULL, NULL);
+    object_class_property_set_description(oc, "prealloc-timeout",
+        "Maximum memory preallocation timeout in seconds");
     object_class_property_add(oc, "size", "int",
         host_memory_backend_get_size,
         host_memory_backend_set_size,
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 39326f1d4f..21910f3b45 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -66,7 +66,9 @@ struct HostMemoryBackend {
     uint64_t size;
     bool merge, dump, use_canonical_path;
     bool prealloc, is_mapped, share, reserve;
+    bool prealloc_did_timeout;
     uint32_t prealloc_threads;
+    uint32_t prealloc_timeout;
     ThreadContext *prealloc_context;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
     HostMemPolicy policy;
diff --git a/qapi/qom.json b/qapi/qom.json
index 30e76653ad..9149c064b8 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -581,6 +581,9 @@
 # @prealloc-context: thread context to use for creation of preallocation threads
 #                    (default: none) (since 7.2)
 #
+# @prealloc-timeout: Maximum memory preallocation timeout in seconds
+#                    (default: 0) (since 7.3)
+#
 # @share: if false, the memory is private to QEMU; if true, it is shared
 #         (default: false)
 #
@@ -612,6 +615,7 @@
             '*prealloc': 'bool',
             '*prealloc-threads': 'uint32',
             '*prealloc-context': 'str',
+            '*prealloc-timeout': 'uint32',
             '*share': 'bool',
             '*reserve': 'bool',
             'size': 'size',
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] backends/hostmem: add an ability to make prealloc timeout fatal
  2023-01-20 13:47 [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout Daniil Tatianin
                   ` (2 preceding siblings ...)
  2023-01-20 13:47 ` [PATCH 3/4] backends/hostmem: add an ability to specify prealloc timeout Daniil Tatianin
@ 2023-01-20 13:47 ` Daniil Tatianin
  2023-01-23  8:57 ` [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout David Hildenbrand
  4 siblings, 0 replies; 14+ messages in thread
From: Daniil Tatianin @ 2023-01-20 13:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Daniil Tatianin, qemu-devel, Stefan Weil, David Hildenbrand,
	Igor Mammedov, yc-core

This is controlled via the new 'prealloc-timeout-fatal' property and can
be useful for cases when we cannot afford to not preallocate all guest
pages while being time constrained.

Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
 backends/hostmem.c       | 38 ++++++++++++++++++++++++++++++++++----
 include/sysemu/hostmem.h |  1 +
 qapi/qom.json            |  4 ++++
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index be9af7515e..0808dc6951 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -39,12 +39,21 @@ host_memory_on_prealloc_timeout(void *opaque,
                                 const PreallocStats *stats)
 {
     HostMemoryBackend *backend = opaque;
+    const char *msg = "HostMemory preallocation timeout %"PRIu64"s exceeded, "
+                      "allocated %zu/%zu (%zu byte) pages (%d threads)";
+
+    if (backend->prealloc_timeout_fatal) {
+        error_report(msg, (uint64_t)stats->seconds_elapsed,
+                     stats->allocated_pages, stats->total_pages,
+                     stats->page_size, stats->threads);
+        exit(1);
+
+    }
 
     backend->prealloc_did_timeout = true;
-    warn_report("HostMemory preallocation timeout %"PRIu64"s exceeded, "
-                "allocated %zu/%zu (%zu byte) pages (%d threads)",
-                (uint64_t)stats->seconds_elapsed, stats->allocated_pages,
-                stats->total_pages, stats->page_size, stats->threads);
+    warn_report(msg, (uint64_t)stats->seconds_elapsed,
+                stats->allocated_pages, stats->total_pages,
+                stats->page_size, stats->threads);
 }
 
 char *
@@ -315,6 +324,22 @@ static void host_memory_backend_get_set_prealloc_timeout(Object *obj,
     visit_type_uint32(v, name, &backend->prealloc_timeout, errp);
 }
 
+static bool host_memory_backend_get_prealloc_timeout_fatal(
+        Object *obj, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    return backend->prealloc_timeout_fatal;
+}
+
+static void host_memory_backend_set_prealloc_timeout_fatal(
+        Object *obj, bool value, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+
+    backend->prealloc_timeout_fatal = value;
+}
+
 static void host_memory_backend_init(Object *obj)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -560,6 +585,11 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
         NULL, NULL);
     object_class_property_set_description(oc, "prealloc-timeout",
         "Maximum memory preallocation timeout in seconds");
+    object_class_property_add_bool(oc, "prealloc-timeout-fatal",
+        host_memory_backend_get_prealloc_timeout_fatal,
+        host_memory_backend_set_prealloc_timeout_fatal);
+    object_class_property_set_description(oc, "prealloc-timeout-fatal",
+        "Consider preallocation timeout a fatal error");
     object_class_property_add(oc, "size", "int",
         host_memory_backend_get_size,
         host_memory_backend_set_size,
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 21910f3b45..b501b5eff2 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -67,6 +67,7 @@ struct HostMemoryBackend {
     bool merge, dump, use_canonical_path;
     bool prealloc, is_mapped, share, reserve;
     bool prealloc_did_timeout;
+    bool prealloc_timeout_fatal;
     uint32_t prealloc_threads;
     uint32_t prealloc_timeout;
     ThreadContext *prealloc_context;
diff --git a/qapi/qom.json b/qapi/qom.json
index 9149c064b8..70644d714b 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -584,6 +584,9 @@
 # @prealloc-timeout: Maximum memory preallocation timeout in seconds
 #                    (default: 0) (since 7.3)
 #
+# @prealloc-timeout-fatal: Consider preallocation timeout a fatal error
+#                          (default: false) (since 7.3)
+#
 # @share: if false, the memory is private to QEMU; if true, it is shared
 #         (default: false)
 #
@@ -616,6 +619,7 @@
             '*prealloc-threads': 'uint32',
             '*prealloc-context': 'str',
             '*prealloc-timeout': 'uint32',
+            '*prealloc-timeout-fatal': 'bool',
             '*share': 'bool',
             '*reserve': 'bool',
             'size': 'size',
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-20 13:47 [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout Daniil Tatianin
                   ` (3 preceding siblings ...)
  2023-01-20 13:47 ` [PATCH 4/4] backends/hostmem: add an ability to make prealloc timeout fatal Daniil Tatianin
@ 2023-01-23  8:57 ` David Hildenbrand
  2023-01-23 13:30   ` Daniil Tatianin
  4 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2023-01-23  8:57 UTC (permalink / raw)
  To: Daniil Tatianin, Paolo Bonzini
  Cc: qemu-devel, Stefan Weil, Igor Mammedov, yc-core

On 20.01.23 14:47, Daniil Tatianin wrote:
> This series introduces new qemu_prealloc_mem_with_timeout() api,
> which allows limiting the maximum amount of time to be spent on memory
> preallocation. It also adds prealloc statistics collection that is
> exposed via an optional timeout handler.
> 
> This new api is then utilized by hostmem for guest RAM preallocation
> controlled via new object properties called 'prealloc-timeout' and
> 'prealloc-timeout-fatal'.
> 
> This is useful for limiting VM startup time on systems with
> unpredictable page allocation delays due to memory fragmentation or the
> backing storage. The timeout can be configured to either simply emit a
> warning and continue VM startup without having preallocated the entire
> guest RAM or just abort startup entirely if that is not acceptable for
> a specific use case.

The major use case for preallocation is memory resources that cannot be 
overcommitted (hugetlb, file blocks, ...), to avoid running out of such 
resources later, while the guest is already running, and crashing it.

Allocating only a fraction "because it takes too long" looks quite 
useless in that (main use-case) context. We shouldn't encourage QEMU 
users to play with fire in such a way. IOW, there should be no way 
around "prealloc-timeout-fatal". Either preallocation succeeded and the 
guest can run, or it failed, and the guest can't run.

... but then, management tools can simply start QEMU with "-S", start an 
own timer, and zap QEMU if it didn't manage to come up in time, and 
simply start a new QEMU instance without preallocation enabled.

The "good" thing about that approach is that it will also cover any 
implicit memory preallocation, like using mlock() or VFIO, that don't 
run in ordinary per-hostmem preallocation context. If setting QEMU up 
takes to long, you might want to try on a different hypervisor in your 
cluster instead.

I don't immediately see why we want to make our preallcoation+hostmem 
implementation in QEMU more complicated for such a use case.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-23  8:57 ` [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout David Hildenbrand
@ 2023-01-23 13:30   ` Daniil Tatianin
  2023-01-23 13:47     ` Daniel P. Berrangé
  2023-01-23 13:56     ` David Hildenbrand
  0 siblings, 2 replies; 14+ messages in thread
From: Daniil Tatianin @ 2023-01-23 13:30 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini
  Cc: qemu-devel, Stefan Weil, Igor Mammedov, yc-core

On 1/23/23 11:57 AM, David Hildenbrand wrote:
> On 20.01.23 14:47, Daniil Tatianin wrote:
>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>> which allows limiting the maximum amount of time to be spent on memory
>> preallocation. It also adds prealloc statistics collection that is
>> exposed via an optional timeout handler.
>>
>> This new api is then utilized by hostmem for guest RAM preallocation
>> controlled via new object properties called 'prealloc-timeout' and
>> 'prealloc-timeout-fatal'.
>>
>> This is useful for limiting VM startup time on systems with
>> unpredictable page allocation delays due to memory fragmentation or the
>> backing storage. The timeout can be configured to either simply emit a
>> warning and continue VM startup without having preallocated the entire
>> guest RAM or just abort startup entirely if that is not acceptable for
>> a specific use case.
> 
> The major use case for preallocation is memory resources that cannot be 
> overcommitted (hugetlb, file blocks, ...), to avoid running out of such 
> resources later, while the guest is already running, and crashing it.

Wouldn't you say that preallocating memory for the sake of speeding up 
guest kernel startup & runtime is a valid use case of prealloc? This way 
we can avoid expensive (for a multitude of reasons) page faults that 
will otherwise slow down the guest significantly at runtime and affect 
the user experience.

> Allocating only a fraction "because it takes too long" looks quite 
> useless in that (main use-case) context. We shouldn't encourage QEMU 
> users to play with fire in such a way. IOW, there should be no way 
> around "prealloc-timeout-fatal". Either preallocation succeeded and the 
> guest can run, or it failed, and the guest can't run.

Here we basically accept the fact that e.g with fragmented memory the 
kernel might take a while in a page fault handler especially for hugetlb 
because of page compaction that has to run for every fault.

This way we can prefault at least some number of pages and let the guest 
fault the rest on demand later on during runtime even if it's slow and 
would cause a noticeable lag.

> ... but then, management tools can simply start QEMU with "-S", start an 
> own timer, and zap QEMU if it didn't manage to come up in time, and 
> simply start a new QEMU instance without preallocation enabled.
> 
> The "good" thing about that approach is that it will also cover any 
> implicit memory preallocation, like using mlock() or VFIO, that don't 
> run in ordinary per-hostmem preallocation context. If setting QEMU up 
> takes to long, you might want to try on a different hypervisor in your 
> cluster instead.

This approach definitely works too but again it assumes that we always 
want 'prealloc-timeout-fatal' to be on, which is, for the most part only 
the case for working around issues that might be caused by overcommit.

> 
> I don't immediately see why we want to make our preallcoation+hostmem 
> implementation in QEMU more complicated for such a use case.
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-23 13:30   ` Daniil Tatianin
@ 2023-01-23 13:47     ` Daniel P. Berrangé
  2023-01-23 14:10       ` David Hildenbrand
  2023-01-23 14:14       ` Daniil Tatianin
  2023-01-23 13:56     ` David Hildenbrand
  1 sibling, 2 replies; 14+ messages in thread
From: Daniel P. Berrangé @ 2023-01-23 13:47 UTC (permalink / raw)
  To: Daniil Tatianin
  Cc: David Hildenbrand, Paolo Bonzini, qemu-devel, Stefan Weil,
	Igor Mammedov, yc-core

On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
> On 1/23/23 11:57 AM, David Hildenbrand wrote:
> > On 20.01.23 14:47, Daniil Tatianin wrote:
> > > This series introduces new qemu_prealloc_mem_with_timeout() api,
> > > which allows limiting the maximum amount of time to be spent on memory
> > > preallocation. It also adds prealloc statistics collection that is
> > > exposed via an optional timeout handler.
> > > 
> > > This new api is then utilized by hostmem for guest RAM preallocation
> > > controlled via new object properties called 'prealloc-timeout' and
> > > 'prealloc-timeout-fatal'.
> > > 
> > > This is useful for limiting VM startup time on systems with
> > > unpredictable page allocation delays due to memory fragmentation or the
> > > backing storage. The timeout can be configured to either simply emit a
> > > warning and continue VM startup without having preallocated the entire
> > > guest RAM or just abort startup entirely if that is not acceptable for
> > > a specific use case.
> > 
> > The major use case for preallocation is memory resources that cannot be
> > overcommitted (hugetlb, file blocks, ...), to avoid running out of such
> > resources later, while the guest is already running, and crashing it.
> 
> Wouldn't you say that preallocating memory for the sake of speeding up guest
> kernel startup & runtime is a valid use case of prealloc? This way we can
> avoid expensive (for a multitude of reasons) page faults that will otherwise
> slow down the guest significantly at runtime and affect the user experience.
> 
> > Allocating only a fraction "because it takes too long" looks quite
> > useless in that (main use-case) context. We shouldn't encourage QEMU
> > users to play with fire in such a way. IOW, there should be no way
> > around "prealloc-timeout-fatal". Either preallocation succeeded and the
> > guest can run, or it failed, and the guest can't run.
> 
> Here we basically accept the fact that e.g with fragmented memory the kernel
> might take a while in a page fault handler especially for hugetlb because of
> page compaction that has to run for every fault.
> 
> This way we can prefault at least some number of pages and let the guest
> fault the rest on demand later on during runtime even if it's slow and would
> cause a noticeable lag.

Rather than treat this as a problem that needs a timeout, can we
restate it as situations need synchronous vs asynchronous
preallocation ?

For the case where we need synchronous prealloc, current QEMU deals
with that. If it doesn't work quickly enough, mgmt can just kill
QEMU already today.

For the case where you would like some prealloc, but don't mind
if it runs without full prealloc, then why not just treat it as an
entirely asynchronous task ? Instead of calling qemu_prealloc_mem
and waiting for it to complete, just spawn a thread to run
qemu_prealloc_mem, so it doesn't block QEMU startup. This will
have minimal maint burden on the existing code, and will avoid
need for mgmt apps to think about what timeout value to give,
which is good because timeouts are hard to get right.

Most of the time that async background prealloc will still finish
before the guest even gets out of the firmware phase, but if it
takes longer it is no big deal. You don't need to quit the prealloc
job early, you just need it to not delay the guest OS boot IIUC.

This impl could be done with the 'prealloc' property turning from
a boolean on/off, to a enum  on/async/off, where 'on' == sync
prealloc. Or add a separate 'prealloc-async' bool property

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-23 13:30   ` Daniil Tatianin
  2023-01-23 13:47     ` Daniel P. Berrangé
@ 2023-01-23 13:56     ` David Hildenbrand
  1 sibling, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2023-01-23 13:56 UTC (permalink / raw)
  To: Daniil Tatianin, Paolo Bonzini
  Cc: qemu-devel, Stefan Weil, Igor Mammedov, yc-core

On 23.01.23 14:30, Daniil Tatianin wrote:
> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>> which allows limiting the maximum amount of time to be spent on memory
>>> preallocation. It also adds prealloc statistics collection that is
>>> exposed via an optional timeout handler.
>>>
>>> This new api is then utilized by hostmem for guest RAM preallocation
>>> controlled via new object properties called 'prealloc-timeout' and
>>> 'prealloc-timeout-fatal'.
>>>
>>> This is useful for limiting VM startup time on systems with
>>> unpredictable page allocation delays due to memory fragmentation or the
>>> backing storage. The timeout can be configured to either simply emit a
>>> warning and continue VM startup without having preallocated the entire
>>> guest RAM or just abort startup entirely if that is not acceptable for
>>> a specific use case.
>>
>> The major use case for preallocation is memory resources that cannot be
>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>> resources later, while the guest is already running, and crashing it.
> 
> Wouldn't you say that preallocating memory for the sake of speeding up
> guest kernel startup & runtime is a valid use case of prealloc? This way
> we can avoid expensive (for a multitude of reasons) page faults that
> will otherwise slow down the guest significantly at runtime and affect
> the user experience.

With "ordinary" memory (anon/shmem/file), there is no such guarantee 
unless you effectively prevent swapping/writeback or run in an extremely 
controlled environment. With anon memory, you further have to disable 
KSM, because that could immediately de-duplicate the zeroed pages again.

For this reason, I am not aware of preallocation getting used for the 
use case you mentioned. Performance-sensitive workloads want 
determinism, and consequently usually use hugetlb + preallocation. Or 
mlockall() to effectively allocate all memory and lock it before 
starting the VM.

Regarding page faults: with THP, the guest will touch a 2 MiB range 
once, and you'll get a 2 MiB page populated, requiring no further write 
faults, which should already heavily reduce page faults when booting a 
guest.

Preallocating all guest memory to make a guest kernel boot up faster 
sound a bit weird to me. Preallocating "some random part of guest 
memory" also sounds weird, too: what if the guest uses exactly the 
memory locations you didn't preallocate?

I'd suggest doing some measurements if there are actually cases where 
"randomly preallocating some memory pages" are actually beneficial when 
considering the overall startup time (setting up VM + starting the OS).

> 
>> Allocating only a fraction "because it takes too long" looks quite
>> useless in that (main use-case) context. We shouldn't encourage QEMU
>> users to play with fire in such a way. IOW, there should be no way
>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>> guest can run, or it failed, and the guest can't run.
> 
> Here we basically accept the fact that e.g with fragmented memory the
> kernel might take a while in a page fault handler especially for hugetlb
> because of page compaction that has to run for every fault.
> 
> This way we can prefault at least some number of pages and let the guest
> fault the rest on demand later on during runtime even if it's slow and
> would cause a noticeable lag.

Sorry, I don't really see the value of this "preallcoating an random 
portion of guest memory".

In practice, Linux guests will only touch all memory once that memory is 
required (e.g., allocated), not as default during bootup".

What you could do, is start the VM from a shmem/hugetlb/... file, and 
concurrently start preallocating all memory from a second process. The 
guest can boot up immediately and eventually you'll have all guest 
memory allocated. It won't work with anon memory (memory-backend-ram) 
and private mappings (shared=false), of course.

> 
>> ... but then, management tools can simply start QEMU with "-S", start an
>> own timer, and zap QEMU if it didn't manage to come up in time, and
>> simply start a new QEMU instance without preallocation enabled.
>>
>> The "good" thing about that approach is that it will also cover any
>> implicit memory preallocation, like using mlock() or VFIO, that don't
>> run in ordinary per-hostmem preallocation context. If setting QEMU up
>> takes to long, you might want to try on a different hypervisor in your
>> cluster instead.
> 
> This approach definitely works too but again it assumes that we always
> want 'prealloc-timeout-fatal' to be on, which is, for the most part only
> the case for working around issues that might be caused by overcommit.

Can you elaborate? Thanks.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-23 13:47     ` Daniel P. Berrangé
@ 2023-01-23 14:10       ` David Hildenbrand
  2023-01-23 14:14       ` Daniil Tatianin
  1 sibling, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2023-01-23 14:10 UTC (permalink / raw)
  To: Daniel P. Berrangé, Daniil Tatianin
  Cc: Paolo Bonzini, qemu-devel, Stefan Weil, Igor Mammedov, yc-core

On 23.01.23 14:47, Daniel P. Berrangé wrote:
> On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
>> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>>> which allows limiting the maximum amount of time to be spent on memory
>>>> preallocation. It also adds prealloc statistics collection that is
>>>> exposed via an optional timeout handler.
>>>>
>>>> This new api is then utilized by hostmem for guest RAM preallocation
>>>> controlled via new object properties called 'prealloc-timeout' and
>>>> 'prealloc-timeout-fatal'.
>>>>
>>>> This is useful for limiting VM startup time on systems with
>>>> unpredictable page allocation delays due to memory fragmentation or the
>>>> backing storage. The timeout can be configured to either simply emit a
>>>> warning and continue VM startup without having preallocated the entire
>>>> guest RAM or just abort startup entirely if that is not acceptable for
>>>> a specific use case.
>>>
>>> The major use case for preallocation is memory resources that cannot be
>>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>>> resources later, while the guest is already running, and crashing it.
>>
>> Wouldn't you say that preallocating memory for the sake of speeding up guest
>> kernel startup & runtime is a valid use case of prealloc? This way we can
>> avoid expensive (for a multitude of reasons) page faults that will otherwise
>> slow down the guest significantly at runtime and affect the user experience.
>>
>>> Allocating only a fraction "because it takes too long" looks quite
>>> useless in that (main use-case) context. We shouldn't encourage QEMU
>>> users to play with fire in such a way. IOW, there should be no way
>>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>>> guest can run, or it failed, and the guest can't run.
>>
>> Here we basically accept the fact that e.g with fragmented memory the kernel
>> might take a while in a page fault handler especially for hugetlb because of
>> page compaction that has to run for every fault.
>>
>> This way we can prefault at least some number of pages and let the guest
>> fault the rest on demand later on during runtime even if it's slow and would
>> cause a noticeable lag.
> 
> Rather than treat this as a problem that needs a timeout, can we
> restate it as situations need synchronous vs asynchronous
> preallocation ?
> 
> For the case where we need synchronous prealloc, current QEMU deals
> with that. If it doesn't work quickly enough, mgmt can just kill
> QEMU already today.
> 
> For the case where you would like some prealloc, but don't mind
> if it runs without full prealloc, then why not just treat it as an
> entirely asynchronous task ? Instead of calling qemu_prealloc_mem
> and waiting for it to complete, just spawn a thread to run
> qemu_prealloc_mem, so it doesn't block QEMU startup. This will
> have minimal maint burden on the existing code, and will avoid
> need for mgmt apps to think about what timeout value to give,
> which is good because timeouts are hard to get right.
> 
> Most of the time that async background prealloc will still finish
> before the guest even gets out of the firmware phase, but if it
> takes longer it is no big deal. You don't need to quit the prealloc
> job early, you just need it to not delay the guest OS boot IIUC.
> 
> This impl could be done with the 'prealloc' property turning from
> a boolean on/off, to a enum  on/async/off, where 'on' == sync
> prealloc. Or add a separate 'prealloc-async' bool property

That sounds better to me.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-23 13:47     ` Daniel P. Berrangé
  2023-01-23 14:10       ` David Hildenbrand
@ 2023-01-23 14:14       ` Daniil Tatianin
  2023-01-23 14:16         ` David Hildenbrand
  2023-01-24  6:57         ` Valentin Sinitsyn
  1 sibling, 2 replies; 14+ messages in thread
From: Daniil Tatianin @ 2023-01-23 14:14 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: David Hildenbrand, Paolo Bonzini, qemu-devel, Stefan Weil,
	Igor Mammedov, yc-core

On 1/23/23 4:47 PM, Daniel P. Berrangé wrote:
> On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
>> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>>> which allows limiting the maximum amount of time to be spent on memory
>>>> preallocation. It also adds prealloc statistics collection that is
>>>> exposed via an optional timeout handler.
>>>>
>>>> This new api is then utilized by hostmem for guest RAM preallocation
>>>> controlled via new object properties called 'prealloc-timeout' and
>>>> 'prealloc-timeout-fatal'.
>>>>
>>>> This is useful for limiting VM startup time on systems with
>>>> unpredictable page allocation delays due to memory fragmentation or the
>>>> backing storage. The timeout can be configured to either simply emit a
>>>> warning and continue VM startup without having preallocated the entire
>>>> guest RAM or just abort startup entirely if that is not acceptable for
>>>> a specific use case.
>>>
>>> The major use case for preallocation is memory resources that cannot be
>>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>>> resources later, while the guest is already running, and crashing it.
>>
>> Wouldn't you say that preallocating memory for the sake of speeding up guest
>> kernel startup & runtime is a valid use case of prealloc? This way we can
>> avoid expensive (for a multitude of reasons) page faults that will otherwise
>> slow down the guest significantly at runtime and affect the user experience.
>>
>>> Allocating only a fraction "because it takes too long" looks quite
>>> useless in that (main use-case) context. We shouldn't encourage QEMU
>>> users to play with fire in such a way. IOW, there should be no way
>>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>>> guest can run, or it failed, and the guest can't run.
>>
>> Here we basically accept the fact that e.g with fragmented memory the kernel
>> might take a while in a page fault handler especially for hugetlb because of
>> page compaction that has to run for every fault.
>>
>> This way we can prefault at least some number of pages and let the guest
>> fault the rest on demand later on during runtime even if it's slow and would
>> cause a noticeable lag.
> 
> Rather than treat this as a problem that needs a timeout, can we
> restate it as situations need synchronous vs asynchronous
> preallocation ?
> 
> For the case where we need synchronous prealloc, current QEMU deals
> with that. If it doesn't work quickly enough, mgmt can just kill
> QEMU already today.
> 
> For the case where you would like some prealloc, but don't mind
> if it runs without full prealloc, then why not just treat it as an
> entirely asynchronous task ? Instead of calling qemu_prealloc_mem
> and waiting for it to complete, just spawn a thread to run
> qemu_prealloc_mem, so it doesn't block QEMU startup. This will
> have minimal maint burden on the existing code, and will avoid
> need for mgmt apps to think about what timeout value to give,
> which is good because timeouts are hard to get right.
> 
> Most of the time that async background prealloc will still finish
> before the guest even gets out of the firmware phase, but if it
> takes longer it is no big deal. You don't need to quit the prealloc
> job early, you just need it to not delay the guest OS boot IIUC.
> 
> This impl could be done with the 'prealloc' property turning from
> a boolean on/off, to a enum  on/async/off, where 'on' == sync
> prealloc. Or add a separate 'prealloc-async' bool property

I like this idea, but I'm not sure how we would go about writing to live 
guest memory. Is that something that can be done safely without racing 
with the guest?

> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-23 14:14       ` Daniil Tatianin
@ 2023-01-23 14:16         ` David Hildenbrand
  2023-01-23 16:01           ` Daniel P. Berrangé
  2023-01-24  6:57         ` Valentin Sinitsyn
  1 sibling, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2023-01-23 14:16 UTC (permalink / raw)
  To: Daniil Tatianin, Daniel P. Berrangé
  Cc: Paolo Bonzini, qemu-devel, Stefan Weil, Igor Mammedov, yc-core

On 23.01.23 15:14, Daniil Tatianin wrote:
> On 1/23/23 4:47 PM, Daniel P. Berrangé wrote:
>> On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
>>> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>>>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>>>> which allows limiting the maximum amount of time to be spent on memory
>>>>> preallocation. It also adds prealloc statistics collection that is
>>>>> exposed via an optional timeout handler.
>>>>>
>>>>> This new api is then utilized by hostmem for guest RAM preallocation
>>>>> controlled via new object properties called 'prealloc-timeout' and
>>>>> 'prealloc-timeout-fatal'.
>>>>>
>>>>> This is useful for limiting VM startup time on systems with
>>>>> unpredictable page allocation delays due to memory fragmentation or the
>>>>> backing storage. The timeout can be configured to either simply emit a
>>>>> warning and continue VM startup without having preallocated the entire
>>>>> guest RAM or just abort startup entirely if that is not acceptable for
>>>>> a specific use case.
>>>>
>>>> The major use case for preallocation is memory resources that cannot be
>>>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>>>> resources later, while the guest is already running, and crashing it.
>>>
>>> Wouldn't you say that preallocating memory for the sake of speeding up guest
>>> kernel startup & runtime is a valid use case of prealloc? This way we can
>>> avoid expensive (for a multitude of reasons) page faults that will otherwise
>>> slow down the guest significantly at runtime and affect the user experience.
>>>
>>>> Allocating only a fraction "because it takes too long" looks quite
>>>> useless in that (main use-case) context. We shouldn't encourage QEMU
>>>> users to play with fire in such a way. IOW, there should be no way
>>>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>>>> guest can run, or it failed, and the guest can't run.
>>>
>>> Here we basically accept the fact that e.g with fragmented memory the kernel
>>> might take a while in a page fault handler especially for hugetlb because of
>>> page compaction that has to run for every fault.
>>>
>>> This way we can prefault at least some number of pages and let the guest
>>> fault the rest on demand later on during runtime even if it's slow and would
>>> cause a noticeable lag.
>>
>> Rather than treat this as a problem that needs a timeout, can we
>> restate it as situations need synchronous vs asynchronous
>> preallocation ?
>>
>> For the case where we need synchronous prealloc, current QEMU deals
>> with that. If it doesn't work quickly enough, mgmt can just kill
>> QEMU already today.
>>
>> For the case where you would like some prealloc, but don't mind
>> if it runs without full prealloc, then why not just treat it as an
>> entirely asynchronous task ? Instead of calling qemu_prealloc_mem
>> and waiting for it to complete, just spawn a thread to run
>> qemu_prealloc_mem, so it doesn't block QEMU startup. This will
>> have minimal maint burden on the existing code, and will avoid
>> need for mgmt apps to think about what timeout value to give,
>> which is good because timeouts are hard to get right.
>>
>> Most of the time that async background prealloc will still finish
>> before the guest even gets out of the firmware phase, but if it
>> takes longer it is no big deal. You don't need to quit the prealloc
>> job early, you just need it to not delay the guest OS boot IIUC.
>>
>> This impl could be done with the 'prealloc' property turning from
>> a boolean on/off, to a enum  on/async/off, where 'on' == sync
>> prealloc. Or add a separate 'prealloc-async' bool property
> 
> I like this idea, but I'm not sure how we would go about writing to live
> guest memory. Is that something that can be done safely without racing
> with the guest?

You can use MADV_POPULATE_WRITE safely, as it doesn't actually perform a 
write. We'd have to fail async=true if MADV_POPULATE_WRITE cannot be used.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-23 14:16         ` David Hildenbrand
@ 2023-01-23 16:01           ` Daniel P. Berrangé
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel P. Berrangé @ 2023-01-23 16:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Daniil Tatianin, Paolo Bonzini, qemu-devel, Stefan Weil,
	Igor Mammedov, yc-core

On Mon, Jan 23, 2023 at 03:16:03PM +0100, David Hildenbrand wrote:
> On 23.01.23 15:14, Daniil Tatianin wrote:
> > On 1/23/23 4:47 PM, Daniel P. Berrangé wrote:
> > > On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
> > > > On 1/23/23 11:57 AM, David Hildenbrand wrote:
> > > > > On 20.01.23 14:47, Daniil Tatianin wrote:
> > > > > > This series introduces new qemu_prealloc_mem_with_timeout() api,
> > > > > > which allows limiting the maximum amount of time to be spent on memory
> > > > > > preallocation. It also adds prealloc statistics collection that is
> > > > > > exposed via an optional timeout handler.
> > > > > > 
> > > > > > This new api is then utilized by hostmem for guest RAM preallocation
> > > > > > controlled via new object properties called 'prealloc-timeout' and
> > > > > > 'prealloc-timeout-fatal'.
> > > > > > 
> > > > > > This is useful for limiting VM startup time on systems with
> > > > > > unpredictable page allocation delays due to memory fragmentation or the
> > > > > > backing storage. The timeout can be configured to either simply emit a
> > > > > > warning and continue VM startup without having preallocated the entire
> > > > > > guest RAM or just abort startup entirely if that is not acceptable for
> > > > > > a specific use case.
> > > > > 
> > > > > The major use case for preallocation is memory resources that cannot be
> > > > > overcommitted (hugetlb, file blocks, ...), to avoid running out of such
> > > > > resources later, while the guest is already running, and crashing it.
> > > > 
> > > > Wouldn't you say that preallocating memory for the sake of speeding up guest
> > > > kernel startup & runtime is a valid use case of prealloc? This way we can
> > > > avoid expensive (for a multitude of reasons) page faults that will otherwise
> > > > slow down the guest significantly at runtime and affect the user experience.
> > > > 
> > > > > Allocating only a fraction "because it takes too long" looks quite
> > > > > useless in that (main use-case) context. We shouldn't encourage QEMU
> > > > > users to play with fire in such a way. IOW, there should be no way
> > > > > around "prealloc-timeout-fatal". Either preallocation succeeded and the
> > > > > guest can run, or it failed, and the guest can't run.
> > > > 
> > > > Here we basically accept the fact that e.g with fragmented memory the kernel
> > > > might take a while in a page fault handler especially for hugetlb because of
> > > > page compaction that has to run for every fault.
> > > > 
> > > > This way we can prefault at least some number of pages and let the guest
> > > > fault the rest on demand later on during runtime even if it's slow and would
> > > > cause a noticeable lag.
> > > 
> > > Rather than treat this as a problem that needs a timeout, can we
> > > restate it as situations need synchronous vs asynchronous
> > > preallocation ?
> > > 
> > > For the case where we need synchronous prealloc, current QEMU deals
> > > with that. If it doesn't work quickly enough, mgmt can just kill
> > > QEMU already today.
> > > 
> > > For the case where you would like some prealloc, but don't mind
> > > if it runs without full prealloc, then why not just treat it as an
> > > entirely asynchronous task ? Instead of calling qemu_prealloc_mem
> > > and waiting for it to complete, just spawn a thread to run
> > > qemu_prealloc_mem, so it doesn't block QEMU startup. This will
> > > have minimal maint burden on the existing code, and will avoid
> > > need for mgmt apps to think about what timeout value to give,
> > > which is good because timeouts are hard to get right.
> > > 
> > > Most of the time that async background prealloc will still finish
> > > before the guest even gets out of the firmware phase, but if it
> > > takes longer it is no big deal. You don't need to quit the prealloc
> > > job early, you just need it to not delay the guest OS boot IIUC.
> > > 
> > > This impl could be done with the 'prealloc' property turning from
> > > a boolean on/off, to a enum  on/async/off, where 'on' == sync
> > > prealloc. Or add a separate 'prealloc-async' bool property
> > 
> > I like this idea, but I'm not sure how we would go about writing to live
> > guest memory. Is that something that can be done safely without racing
> > with the guest?
> 
> You can use MADV_POPULATE_WRITE safely, as it doesn't actually perform a
> write. We'd have to fail async=true if MADV_POPULATE_WRITE cannot be used.

Right, in the short term that means this feature would have limited
availability on our targetted OS platforms, but such issues tend to
fade into irrelevance quicker than we anticipate, as platforms move
forwards at such a fast pace.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout
  2023-01-23 14:14       ` Daniil Tatianin
  2023-01-23 14:16         ` David Hildenbrand
@ 2023-01-24  6:57         ` Valentin Sinitsyn
  1 sibling, 0 replies; 14+ messages in thread
From: Valentin Sinitsyn @ 2023-01-24  6:57 UTC (permalink / raw)
  To: Daniil Tatianin, Daniel P. Berrangé
  Cc: David Hildenbrand, Paolo Bonzini, qemu-devel, Stefan Weil,
	Igor Mammedov, yc-core

Hello,

On 23.01.2023 19:14, Daniil Tatianin wrote:
> On 1/23/23 4:47 PM, Daniel P. Berrangé wrote:
>> On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
>>> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>>>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>>>> which allows limiting the maximum amount of time to be spent on memory
>>>>> preallocation. It also adds prealloc statistics collection that is
>>>>> exposed via an optional timeout handler.
>>>>>
>>>>> This new api is then utilized by hostmem for guest RAM preallocation
>>>>> controlled via new object properties called 'prealloc-timeout' and
>>>>> 'prealloc-timeout-fatal'.
>>>>>
>>>>> This is useful for limiting VM startup time on systems with
>>>>> unpredictable page allocation delays due to memory fragmentation or 
>>>>> the
>>>>> backing storage. The timeout can be configured to either simply emit a
>>>>> warning and continue VM startup without having preallocated the entire
>>>>> guest RAM or just abort startup entirely if that is not acceptable for
>>>>> a specific use case.
>>>>
>>>> The major use case for preallocation is memory resources that cannot be
>>>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>>>> resources later, while the guest is already running, and crashing it.
>>>
>>> Wouldn't you say that preallocating memory for the sake of speeding 
>>> up guest
>>> kernel startup & runtime is a valid use case of prealloc? This way we 
>>> can
>>> avoid expensive (for a multitude of reasons) page faults that will 
>>> otherwise
>>> slow down the guest significantly at runtime and affect the user 
>>> experience.
>>>
>>>> Allocating only a fraction "because it takes too long" looks quite
>>>> useless in that (main use-case) context. We shouldn't encourage QEMU
>>>> users to play with fire in such a way. IOW, there should be no way
>>>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>>>> guest can run, or it failed, and the guest can't run.
>>>
>>> Here we basically accept the fact that e.g with fragmented memory the 
>>> kernel
>>> might take a while in a page fault handler especially for hugetlb 
>>> because of
>>> page compaction that has to run for every fault.
>>>
>>> This way we can prefault at least some number of pages and let the guest
>>> fault the rest on demand later on during runtime even if it's slow 
>>> and would
>>> cause a noticeable lag.
>>
>> Rather than treat this as a problem that needs a timeout, can we
>> restate it as situations need synchronous vs asynchronous
>> preallocation ?
>>
>> For the case where we need synchronous prealloc, current QEMU deals
>> with that. If it doesn't work quickly enough, mgmt can just kill
>> QEMU already today.
>>
>> For the case where you would like some prealloc, but don't mind
>> if it runs without full prealloc, then why not just treat it as an
>> entirely asynchronous task ? Instead of calling qemu_prealloc_mem
>> and waiting for it to complete, just spawn a thread to run
>> qemu_prealloc_mem, so it doesn't block QEMU startup. This will
>> have minimal maint burden on the existing code, and will avoid
>> need for mgmt apps to think about what timeout value to give,
>> which is good because timeouts are hard to get right.
>>
>> Most of the time that async background prealloc will still finish
>> before the guest even gets out of the firmware phase, but if it
>> takes longer it is no big deal. You don't need to quit the prealloc
>> job early, you just need it to not delay the guest OS boot IIUC.
>>
>> This impl could be done with the 'prealloc' property turning from
>> a boolean on/off, to a enum  on/async/off, where 'on' == sync
>> prealloc. Or add a separate 'prealloc-async' bool property
> 
> I like this idea, but I'm not sure how we would go about writing to live 
> guest memory. Is that something that can be done safely without racing 
> with the guest?

Don't forget that prealloc threads will need some CPUs to run, which 
would likely result in increased steal time during preallocation for the 
guest.

Likely not a big deal, but something to keep in mind.

Best,
Valentine



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-01-24  7:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-20 13:47 [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout Daniil Tatianin
2023-01-20 13:47 ` [PATCH 1/4] oslib: introduce new qemu_prealloc_mem_with_timeout() api Daniil Tatianin
2023-01-20 13:47 ` [PATCH 2/4] backends/hostmem: move memory region preallocation logic into a helper Daniil Tatianin
2023-01-20 13:47 ` [PATCH 3/4] backends/hostmem: add an ability to specify prealloc timeout Daniil Tatianin
2023-01-20 13:47 ` [PATCH 4/4] backends/hostmem: add an ability to make prealloc timeout fatal Daniil Tatianin
2023-01-23  8:57 ` [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout David Hildenbrand
2023-01-23 13:30   ` Daniil Tatianin
2023-01-23 13:47     ` Daniel P. Berrangé
2023-01-23 14:10       ` David Hildenbrand
2023-01-23 14:14       ` Daniil Tatianin
2023-01-23 14:16         ` David Hildenbrand
2023-01-23 16:01           ` Daniel P. Berrangé
2023-01-24  6:57         ` Valentin Sinitsyn
2023-01-23 13:56     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).