[PATCH 0/2] overcommit: introduce mem-lock-onfault

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] overcommit: introduce mem-lock-onfault
@ 2024-12-05 23:19 Daniil Tatianin
  2024-12-05 23:19 ` [PATCH 1/2] os: add an ability to lock memory on_fault Daniil Tatianin
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Daniil Tatianin @ 2024-12-05 23:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Daniil Tatianin, Stefan Weil, Peter Xu, Fabiano Rosas, qemu-devel

Currently, passing mem-lock=on to QEMU causes memory usage to grow by
huge amounts:

no memlock:
    $ qemu-system-x86_64 -overcommit mem-lock=off
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    45652

    $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    39756

memlock:
    $ qemu-system-x86_64 -overcommit mem-lock=on
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    1309876

    $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    259956

This is caused by the fact that mlockall(2) automatically
write-faults every existing and future anonymous mappings in the
process right away.

One of the reasons to enable mem-lock is to protect a QEMU process'
pages from being compacted and migrated by kcompactd (which does so
by messing with a live process page tables causing thousands of TLB
flush IPIs per second) basically stealing all guest time while it's
active.

mem-lock=on helps against this (given compact_unevictable_allowed is 0),
but the memory overhead it introduces is an undesirable side effect,
which we can completely avoid by passing MCL_ONFAULT to mlockall, which
is what this series allows to do with a new command line option called
mem-lock-onfault.

memlock-onfault:
    $ qemu-system-x86_64 -overcommit mem-lock-onfault=on
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    54004

    $ ./qemu-system-x86_64 -overcommit mem-lock-onfault=on -enable-kvm
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    47772

You may notice the memory usage is still slightly higher, in this case
by a few megabytes over the mem-lock=off case. I was able to trace this
down to a bug in the linux kernel with MCL_ONFAULT not being honored for
the early process heap (with brk(2) etc.) so it is still write-faulted in
this case, but it's still way less than it was with just the mem-lock=on.

Daniil Tatianin (2):
  os: add an ability to lock memory on_fault
  overcommit: introduce mem-lock-onfault

 include/sysemu/os-posix.h |  2 +-
 include/sysemu/os-win32.h |  3 ++-
 include/sysemu/sysemu.h   |  1 +
 migration/postcopy-ram.c  |  4 ++--
 os-posix.c                | 10 ++++++++--
 qemu-options.hx           | 13 ++++++++++---
 system/globals.c          |  1 +
 system/vl.c               | 18 ++++++++++++++++--
 8 files changed, 41 insertions(+), 11 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] os: add an ability to lock memory on_fault
  2024-12-05 23:19 [PATCH 0/2] overcommit: introduce mem-lock-onfault Daniil Tatianin
@ 2024-12-05 23:19 ` Daniil Tatianin
  2024-12-05 23:19 ` [PATCH 2/2] overcommit: introduce mem-lock-onfault Daniil Tatianin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Daniil Tatianin @ 2024-12-05 23:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Daniil Tatianin, Stefan Weil, Peter Xu, Fabiano Rosas, qemu-devel

This will be used in the following commits to make it possible to only
lock memory on fault instead of right away.

Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
 include/sysemu/os-posix.h |  2 +-
 include/sysemu/os-win32.h |  3 ++-
 migration/postcopy-ram.c  |  2 +-
 os-posix.c                | 10 ++++++++--
 system/vl.c               |  2 +-
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/sysemu/os-posix.h b/include/sysemu/os-posix.h
index b881ac6c6f..ce5b3bccf8 100644
--- a/include/sysemu/os-posix.h
+++ b/include/sysemu/os-posix.h
@@ -53,7 +53,7 @@ bool os_set_runas(const char *user_id);
 void os_set_chroot(const char *path);
 void os_setup_limits(void);
 void os_setup_post(void);
-int os_mlock(void);
+int os_mlock(bool on_fault);
 
 /**
  * qemu_alloc_stack:
diff --git a/include/sysemu/os-win32.h b/include/sysemu/os-win32.h
index b82a5d3ad9..cd61d69e10 100644
--- a/include/sysemu/os-win32.h
+++ b/include/sysemu/os-win32.h
@@ -123,8 +123,9 @@ static inline bool is_daemonized(void)
     return false;
 }
 
-static inline int os_mlock(void)
+static inline int os_mlock(bool on_fault)
 {
+    (void)on_fault;
     return -ENOSYS;
 }
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a535fd2e30..36ec6a3d75 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -652,7 +652,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
     }
 
     if (enable_mlock) {
-        if (os_mlock() < 0) {
+        if (os_mlock(false) < 0) {
             error_report("mlock: %s", strerror(errno));
             /*
              * It doesn't feel right to fail at this point, we have a valid
diff --git a/os-posix.c b/os-posix.c
index 43f9a43f3f..0948128134 100644
--- a/os-posix.c
+++ b/os-posix.c
@@ -327,18 +327,24 @@ void os_set_line_buffering(void)
     setvbuf(stdout, NULL, _IOLBF, 0);
 }
 
-int os_mlock(void)
+int os_mlock(bool on_fault)
 {
 #ifdef HAVE_MLOCKALL
     int ret = 0;
+    int flags = MCL_CURRENT | MCL_FUTURE;
 
-    ret = mlockall(MCL_CURRENT | MCL_FUTURE);
+    if (on_fault) {
+        flags |= MCL_ONFAULT;
+    }
+
+    ret = mlockall(flags);
     if (ret < 0) {
         error_report("mlockall: %s", strerror(errno));
     }
 
     return ret;
 #else
+    (void)on_fault;
     return -ENOSYS;
 #endif
 }
diff --git a/system/vl.c b/system/vl.c
index 54998fdbc7..03819a80ef 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -793,7 +793,7 @@ static QemuOptsList qemu_run_with_opts = {
 static void realtime_init(void)
 {
     if (enable_mlock) {
-        if (os_mlock() < 0) {
+        if (os_mlock(false) < 0) {
             error_report("locking memory failed");
             exit(1);
         }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] overcommit: introduce mem-lock-onfault
  2024-12-05 23:19 [PATCH 0/2] overcommit: introduce mem-lock-onfault Daniil Tatianin
  2024-12-05 23:19 ` [PATCH 1/2] os: add an ability to lock memory on_fault Daniil Tatianin
@ 2024-12-05 23:19 ` Daniil Tatianin
  2024-12-06  1:08 ` [PATCH 0/2] " Peter Xu
  2024-12-10 14:48 ` Vladimir Sementsov-Ogievskiy
  3 siblings, 0 replies; 10+ messages in thread
From: Daniil Tatianin @ 2024-12-05 23:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Daniil Tatianin, Stefan Weil, Peter Xu, Fabiano Rosas, qemu-devel

Locking the memory without MCL_ONFAULT instantly prefaults any mmaped
anonymous memory with a write-fault, which introduces a lot of extra
overhead in terms of memory usage when all you want to do is to prevent
kcompactd from migrating and compacting QEMU pages. Add an option to
only lock pages lazily as they're faulted by the process by using
MCL_ONFAULT if asked.

Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
 include/sysemu/sysemu.h  |  1 +
 migration/postcopy-ram.c |  4 ++--
 qemu-options.hx          | 13 ++++++++++---
 system/globals.c         |  1 +
 system/vl.c              | 18 ++++++++++++++++--
 5 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 7ec419ce13..b6519c3c1e 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -44,6 +44,7 @@ extern const char *keyboard_layout;
 extern int old_param;
 extern uint8_t *boot_splash_filedata;
 extern bool enable_mlock;
+extern bool enable_mlock_onfault;
 extern bool enable_cpu_pm;
 extern QEMUClockType rtc_clock;
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 36ec6a3d75..8ff8c73a27 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -651,8 +651,8 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         mis->have_fault_thread = false;
     }
 
-    if (enable_mlock) {
-        if (os_mlock(false) < 0) {
+    if (enable_mlock || enable_mlock_onfault) {
+        if (os_mlock(enable_mlock_onfault) < 0) {
             error_report("mlock: %s", strerror(errno));
             /*
              * It doesn't feel right to fail at this point, we have a valid
diff --git a/qemu-options.hx b/qemu-options.hx
index dacc9790a4..477e0e439a 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4566,21 +4566,28 @@ SRST
 ERST
 
 DEF("overcommit", HAS_ARG, QEMU_OPTION_overcommit,
-    "-overcommit [mem-lock=on|off][cpu-pm=on|off]\n"
+    "-overcommit [mem-lock=on|off][mem-lock-onfault=on|off][cpu-pm=on|off]\n"
     "                run qemu with overcommit hints\n"
     "                mem-lock=on|off controls memory lock support (default: off)\n"
+    "                mem-lock-onfault=on|off controls memory lock on fault support (default: off)\n"
     "                cpu-pm=on|off controls cpu power management (default: off)\n",
     QEMU_ARCH_ALL)
 SRST
 ``-overcommit mem-lock=on|off``
   \ 
+``-overcommit mem-lock-onfault=on|off``
+  \
 ``-overcommit cpu-pm=on|off``
     Run qemu with hints about host resource overcommit. The default is
     to assume that host overcommits all resources.
 
     Locking qemu and guest memory can be enabled via ``mem-lock=on``
-    (disabled by default). This works when host memory is not
-    overcommitted and reduces the worst-case latency for guest.
+    or ``mem-lock-onfault=on`` (disabled by default). This works when
+    host memory is not overcommitted and reduces the worst-case latency for
+    guest. The on-fault option is better for reducing the memory footprint
+    since it makes allocations lazy, but the pages still get locked in place
+    once faulted by the guest or QEMU. Note that the two options are mutually
+    exclusive.
 
     Guest ability to manage power state of host cpus (increasing latency
     for other processes on the same host cpu, but decreasing latency for
diff --git a/system/globals.c b/system/globals.c
index 84ce943ac9..43501fe690 100644
--- a/system/globals.c
+++ b/system/globals.c
@@ -35,6 +35,7 @@ enum vga_retrace_method vga_retrace_method = VGA_RETRACE_DUMB;
 int display_opengl;
 const char* keyboard_layout;
 bool enable_mlock;
+bool enable_mlock_onfault;
 bool enable_cpu_pm;
 int autostart = 1;
 int vga_interface_type = VGA_NONE;
diff --git a/system/vl.c b/system/vl.c
index 03819a80ef..89477f38bc 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -349,6 +349,10 @@ static QemuOptsList qemu_overcommit_opts = {
             .name = "mem-lock",
             .type = QEMU_OPT_BOOL,
         },
+        {
+            .name = "mem-lock-onfault",
+            .type = QEMU_OPT_BOOL,
+        },
         {
             .name = "cpu-pm",
             .type = QEMU_OPT_BOOL,
@@ -792,8 +796,8 @@ static QemuOptsList qemu_run_with_opts = {
 
 static void realtime_init(void)
 {
-    if (enable_mlock) {
-        if (os_mlock(false) < 0) {
+    if (enable_mlock || enable_mlock_onfault) {
+        if (os_mlock(enable_mlock_onfault) < 0) {
             error_report("locking memory failed");
             exit(1);
         }
@@ -3537,7 +3541,17 @@ void qemu_init(int argc, char **argv)
                 if (!opts) {
                     exit(1);
                 }
+
                 enable_mlock = qemu_opt_get_bool(opts, "mem-lock", enable_mlock);
+                enable_mlock_onfault = qemu_opt_get_bool(opts,
+                                                         "mem-lock-onfault",
+                                                         enable_mlock_onfault);
+                if (enable_mlock && enable_mlock_onfault) {
+                    error_report("mem-lock and mem-lock-onfault are mutually"
+                                 "exclusive");
+                    exit(1);
+                }
+
                 enable_cpu_pm = qemu_opt_get_bool(opts, "cpu-pm", enable_cpu_pm);
                 break;
             case QEMU_OPTION_compat:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] overcommit: introduce mem-lock-onfault
  2024-12-05 23:19 [PATCH 0/2] overcommit: introduce mem-lock-onfault Daniil Tatianin
  2024-12-05 23:19 ` [PATCH 1/2] os: add an ability to lock memory on_fault Daniil Tatianin
  2024-12-05 23:19 ` [PATCH 2/2] overcommit: introduce mem-lock-onfault Daniil Tatianin
@ 2024-12-06  1:08 ` Peter Xu
  2024-12-09  7:40   ` Daniil Tatianin
  2024-12-10 14:48 ` Vladimir Sementsov-Ogievskiy
  3 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2024-12-06  1:08 UTC (permalink / raw)
  To: Daniil Tatianin; +Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, qemu-devel

On Fri, Dec 06, 2024 at 02:19:06AM +0300, Daniil Tatianin wrote:
> Currently, passing mem-lock=on to QEMU causes memory usage to grow by
> huge amounts:
> 
> no memlock:
>     $ qemu-system-x86_64 -overcommit mem-lock=off
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     45652
> 
>     $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     39756
> 
> memlock:
>     $ qemu-system-x86_64 -overcommit mem-lock=on
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     1309876
> 
>     $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     259956
> 
> This is caused by the fact that mlockall(2) automatically
> write-faults every existing and future anonymous mappings in the
> process right away.
> 
> One of the reasons to enable mem-lock is to protect a QEMU process'
> pages from being compacted and migrated by kcompactd (which does so
> by messing with a live process page tables causing thousands of TLB
> flush IPIs per second) basically stealing all guest time while it's
> active.
> 
> mem-lock=on helps against this (given compact_unevictable_allowed is 0),
> but the memory overhead it introduces is an undesirable side effect,
> which we can completely avoid by passing MCL_ONFAULT to mlockall, which
> is what this series allows to do with a new command line option called
> mem-lock-onfault.

IMHO it'll be always helpful to dig and provide information on why such
difference existed.  E.g. guest mem should normally be the major mem sink
and that definitely won't be affected by either ON_FAULT or not.

I had a quick look explicitly on tcg (as that really surprised me a bit..).
When you look at the mappings there's 1G constant shmem map that always got
locked and populated.

It turns out to be tcg's jit buffer, alloc_code_gen_buffer_splitwx_memfd:

    buf_rw = qemu_memfd_alloc("tcg-jit", size, 0, &fd, errp);
    if (buf_rw == NULL) {
        goto fail;
    }

    buf_rx = mmap(NULL, size, host_prot_read_exec(), MAP_SHARED, fd, 0);
    if (buf_rx == MAP_FAILED) {
        error_setg_errno(errp, errno,
                         "failed to map shared memory for execute");
        goto fail;
    }

Looks like that's the major reason why tcg has mlockall bloated constantly
with roughly 1G size - that seems to be from tcg_init_machine().  I didn't
check kvm.

Logically having a on-fault option won't ever hurt, so probably not an
issue to have it anyway.  Still, share my finding above, as IIUC that's
mostly why it was bloated for tcg, so maybe there're other options too.

> 
> memlock-onfault:
>     $ qemu-system-x86_64 -overcommit mem-lock-onfault=on
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     54004
> 
>     $ ./qemu-system-x86_64 -overcommit mem-lock-onfault=on -enable-kvm
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     47772
> 
> You may notice the memory usage is still slightly higher, in this case
> by a few megabytes over the mem-lock=off case. I was able to trace this
> down to a bug in the linux kernel with MCL_ONFAULT not being honored for
> the early process heap (with brk(2) etc.) so it is still write-faulted in
> this case, but it's still way less than it was with just the mem-lock=on.
> 
> Daniil Tatianin (2):
>   os: add an ability to lock memory on_fault
>   overcommit: introduce mem-lock-onfault
> 
>  include/sysemu/os-posix.h |  2 +-
>  include/sysemu/os-win32.h |  3 ++-
>  include/sysemu/sysemu.h   |  1 +
>  migration/postcopy-ram.c  |  4 ++--
>  os-posix.c                | 10 ++++++++--
>  qemu-options.hx           | 13 ++++++++++---
>  system/globals.c          |  1 +
>  system/vl.c               | 18 ++++++++++++++++--
>  8 files changed, 41 insertions(+), 11 deletions(-)
> 
> -- 
> 2.34.1
> 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] overcommit: introduce mem-lock-onfault
  2024-12-06  1:08 ` [PATCH 0/2] " Peter Xu
@ 2024-12-09  7:40   ` Daniil Tatianin
  2024-12-10 16:48     ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Daniil Tatianin @ 2024-12-09  7:40 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, qemu-devel

On 12/6/24 4:08 AM, Peter Xu wrote:

> On Fri, Dec 06, 2024 at 02:19:06AM +0300, Daniil Tatianin wrote:
>> Currently, passing mem-lock=on to QEMU causes memory usage to grow by
>> huge amounts:
>>
>> no memlock:
>>      $ qemu-system-x86_64 -overcommit mem-lock=off
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      45652
>>
>>      $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      39756
>>
>> memlock:
>>      $ qemu-system-x86_64 -overcommit mem-lock=on
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      1309876
>>
>>      $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      259956
>>
>> This is caused by the fact that mlockall(2) automatically
>> write-faults every existing and future anonymous mappings in the
>> process right away.
>>
>> One of the reasons to enable mem-lock is to protect a QEMU process'
>> pages from being compacted and migrated by kcompactd (which does so
>> by messing with a live process page tables causing thousands of TLB
>> flush IPIs per second) basically stealing all guest time while it's
>> active.
>>
>> mem-lock=on helps against this (given compact_unevictable_allowed is 0),
>> but the memory overhead it introduces is an undesirable side effect,
>> which we can completely avoid by passing MCL_ONFAULT to mlockall, which
>> is what this series allows to do with a new command line option called
>> mem-lock-onfault.
> IMHO it'll be always helpful to dig and provide information on why such
> difference existed.  E.g. guest mem should normally be the major mem sink
> and that definitely won't be affected by either ON_FAULT or not.
>
> I had a quick look explicitly on tcg (as that really surprised me a bit..).
> When you look at the mappings there's 1G constant shmem map that always got
> locked and populated.
>
> It turns out to be tcg's jit buffer, alloc_code_gen_buffer_splitwx_memfd:

Thanks for looking into this! I'd guessed it was something to do with 
JIT, makes sense.

>      buf_rw = qemu_memfd_alloc("tcg-jit", size, 0, &fd, errp);
>      if (buf_rw == NULL) {
>          goto fail;
>      }
>
>      buf_rx = mmap(NULL, size, host_prot_read_exec(), MAP_SHARED, fd, 0);
>      if (buf_rx == MAP_FAILED) {
>          error_setg_errno(errp, errno,
>                           "failed to map shared memory for execute");
>          goto fail;
>      }
>
> Looks like that's the major reason why tcg has mlockall bloated constantly
> with roughly 1G size - that seems to be from tcg_init_machine().  I didn't
> check kvm.
>
> Logically having a on-fault option won't ever hurt, so probably not an
> issue to have it anyway.  Still, share my finding above, as IIUC that's
> mostly why it was bloated for tcg, so maybe there're other options too.

Yeah, the situation with KVM is slightly better, although it's still a 
~200MiB overhead with default Q35 and no extra devices (I haven't 
measured the difference with various devices).

I think it's definitely nice to have an on-fault option for this, as 
optimizing every possible mmap caller for the rare mem-lock=on case 
might be too ambitious.

Thanks!

>
>> memlock-onfault:
>>      $ qemu-system-x86_64 -overcommit mem-lock-onfault=on
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      54004
>>
>>      $ ./qemu-system-x86_64 -overcommit mem-lock-onfault=on -enable-kvm
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      47772
>>
>> You may notice the memory usage is still slightly higher, in this case
>> by a few megabytes over the mem-lock=off case. I was able to trace this
>> down to a bug in the linux kernel with MCL_ONFAULT not being honored for
>> the early process heap (with brk(2) etc.) so it is still write-faulted in
>> this case, but it's still way less than it was with just the mem-lock=on.
>>
>> Daniil Tatianin (2):
>>    os: add an ability to lock memory on_fault
>>    overcommit: introduce mem-lock-onfault
>>
>>   include/sysemu/os-posix.h |  2 +-
>>   include/sysemu/os-win32.h |  3 ++-
>>   include/sysemu/sysemu.h   |  1 +
>>   migration/postcopy-ram.c  |  4 ++--
>>   os-posix.c                | 10 ++++++++--
>>   qemu-options.hx           | 13 ++++++++++---
>>   system/globals.c          |  1 +
>>   system/vl.c               | 18 ++++++++++++++++--
>>   8 files changed, 41 insertions(+), 11 deletions(-)
>>
>> -- 
>> 2.34.1
>>
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] overcommit: introduce mem-lock-onfault
  2024-12-05 23:19 [PATCH 0/2] overcommit: introduce mem-lock-onfault Daniil Tatianin
                   ` (2 preceding siblings ...)
  2024-12-06  1:08 ` [PATCH 0/2] " Peter Xu
@ 2024-12-10 14:48 ` Vladimir Sementsov-Ogievskiy
  3 siblings, 0 replies; 10+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2024-12-10 14:48 UTC (permalink / raw)
  To: Daniil Tatianin, Paolo Bonzini
  Cc: Stefan Weil, Peter Xu, Fabiano Rosas, qemu-devel

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] overcommit: introduce mem-lock-onfault
  2024-12-09  7:40   ` Daniil Tatianin
@ 2024-12-10 16:48     ` Peter Xu
  2024-12-10 17:01       ` Daniil Tatianin
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2024-12-10 16:48 UTC (permalink / raw)
  To: Daniil Tatianin; +Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, qemu-devel

On Mon, Dec 09, 2024 at 10:40:51AM +0300, Daniil Tatianin wrote:
> On 12/6/24 4:08 AM, Peter Xu wrote:
> 
> > On Fri, Dec 06, 2024 at 02:19:06AM +0300, Daniil Tatianin wrote:
> > > Currently, passing mem-lock=on to QEMU causes memory usage to grow by
> > > huge amounts:
> > > 
> > > no memlock:
> > >      $ qemu-system-x86_64 -overcommit mem-lock=off
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      45652
> > > 
> > >      $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      39756
> > > 
> > > memlock:
> > >      $ qemu-system-x86_64 -overcommit mem-lock=on
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      1309876
> > > 
> > >      $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      259956
> > > 
> > > This is caused by the fact that mlockall(2) automatically
> > > write-faults every existing and future anonymous mappings in the
> > > process right away.
> > > 
> > > One of the reasons to enable mem-lock is to protect a QEMU process'
> > > pages from being compacted and migrated by kcompactd (which does so
> > > by messing with a live process page tables causing thousands of TLB
> > > flush IPIs per second) basically stealing all guest time while it's
> > > active.
> > > 
> > > mem-lock=on helps against this (given compact_unevictable_allowed is 0),
> > > but the memory overhead it introduces is an undesirable side effect,
> > > which we can completely avoid by passing MCL_ONFAULT to mlockall, which
> > > is what this series allows to do with a new command line option called
> > > mem-lock-onfault.
> > IMHO it'll be always helpful to dig and provide information on why such
> > difference existed.  E.g. guest mem should normally be the major mem sink
> > and that definitely won't be affected by either ON_FAULT or not.
> > 
> > I had a quick look explicitly on tcg (as that really surprised me a bit..).
> > When you look at the mappings there's 1G constant shmem map that always got
> > locked and populated.
> > 
> > It turns out to be tcg's jit buffer, alloc_code_gen_buffer_splitwx_memfd:
> 
> Thanks for looking into this! I'd guessed it was something to do with JIT,
> makes sense.
> 
> >      buf_rw = qemu_memfd_alloc("tcg-jit", size, 0, &fd, errp);
> >      if (buf_rw == NULL) {
> >          goto fail;
> >      }
> > 
> >      buf_rx = mmap(NULL, size, host_prot_read_exec(), MAP_SHARED, fd, 0);
> >      if (buf_rx == MAP_FAILED) {
> >          error_setg_errno(errp, errno,
> >                           "failed to map shared memory for execute");
> >          goto fail;
> >      }
> > 
> > Looks like that's the major reason why tcg has mlockall bloated constantly
> > with roughly 1G size - that seems to be from tcg_init_machine().  I didn't
> > check kvm.
> > 
> > Logically having a on-fault option won't ever hurt, so probably not an
> > issue to have it anyway.  Still, share my finding above, as IIUC that's
> > mostly why it was bloated for tcg, so maybe there're other options too.
> 
> Yeah, the situation with KVM is slightly better, although it's still a
> ~200MiB overhead with default Q35 and no extra devices (I haven't measured
> the difference with various devices).
> 
> I think it's definitely nice to have an on-fault option for this, as
> optimizing every possible mmap caller for the rare mem-lock=on case might be
> too ambitious.

It really depends, IMHO, and that's why I didn't already ack the series.

It may be relevant to the trade-off here on allowing faults to happen later
even if mem-lock=on.  The question is why, for example in your use case,
would like to lock the memory.

Take kvm-rt as example, I believe that's needed because RT apps (running in
the guest) would like to avoid page faults throughout the stack, so that
guest workload, especially on the latency part of things, is predictable.

Here if on-fault is enabled it could beat that purpose already.

Or if the current use case is making sure after QEMU boots the memory will
always present so that even if later the host faces memory stress it won't
affect anything running the VM as it pre-allocated everything (so that's
beyond memory-backend-*,prealloc=on, because it covers QEMU/KVM memory
too).  Meanwhile locked pages won't swap out, so it's always there.

But then with on-fault, it means the pages will only be locked upon access.
Then it means the guarantee on "QEMU secures the memory on boot" is gone
too.

That's why I was thinking whether your specific use case really wants
on-fault, or you do want e.g. to have a limit on the tcg-jit buffer instead
(or same to whatever kvm was consuming), so you don't want that large a
buffer, however you still want to have all things locked up upfront.  It
can be relevant to why your use case started to use mem-lock=on before this
on-fault flag.

OTOH, I believe on-fault cannot work with kvm-rt at all already, because of
its possible faults happening later on - even if the fault can happen in
KVM and even if it's not about accessing guest mem, it can still be part of
overhead later when running the rt application in the guest, hence it can
start to break RT deterministics.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] overcommit: introduce mem-lock-onfault
  2024-12-10 16:48     ` Peter Xu
@ 2024-12-10 17:01       ` Daniil Tatianin
  2024-12-10 17:20         ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Daniil Tatianin @ 2024-12-10 17:01 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, qemu-devel

On 12/10/24 7:48 PM, Peter Xu wrote:

> On Mon, Dec 09, 2024 at 10:40:51AM +0300, Daniil Tatianin wrote:
>> On 12/6/24 4:08 AM, Peter Xu wrote:
>>
>>> On Fri, Dec 06, 2024 at 02:19:06AM +0300, Daniil Tatianin wrote:
>>>> Currently, passing mem-lock=on to QEMU causes memory usage to grow by
>>>> huge amounts:
>>>>
>>>> no memlock:
>>>>       $ qemu-system-x86_64 -overcommit mem-lock=off
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       45652
>>>>
>>>>       $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       39756
>>>>
>>>> memlock:
>>>>       $ qemu-system-x86_64 -overcommit mem-lock=on
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       1309876
>>>>
>>>>       $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       259956
>>>>
>>>> This is caused by the fact that mlockall(2) automatically
>>>> write-faults every existing and future anonymous mappings in the
>>>> process right away.
>>>>
>>>> One of the reasons to enable mem-lock is to protect a QEMU process'
>>>> pages from being compacted and migrated by kcompactd (which does so
>>>> by messing with a live process page tables causing thousands of TLB
>>>> flush IPIs per second) basically stealing all guest time while it's
>>>> active.
>>>>
>>>> mem-lock=on helps against this (given compact_unevictable_allowed is 0),
>>>> but the memory overhead it introduces is an undesirable side effect,
>>>> which we can completely avoid by passing MCL_ONFAULT to mlockall, which
>>>> is what this series allows to do with a new command line option called
>>>> mem-lock-onfault.
>>> IMHO it'll be always helpful to dig and provide information on why such
>>> difference existed.  E.g. guest mem should normally be the major mem sink
>>> and that definitely won't be affected by either ON_FAULT or not.
>>>
>>> I had a quick look explicitly on tcg (as that really surprised me a bit..).
>>> When you look at the mappings there's 1G constant shmem map that always got
>>> locked and populated.
>>>
>>> It turns out to be tcg's jit buffer, alloc_code_gen_buffer_splitwx_memfd:
>> Thanks for looking into this! I'd guessed it was something to do with JIT,
>> makes sense.
>>
>>>       buf_rw = qemu_memfd_alloc("tcg-jit", size, 0, &fd, errp);
>>>       if (buf_rw == NULL) {
>>>           goto fail;
>>>       }
>>>
>>>       buf_rx = mmap(NULL, size, host_prot_read_exec(), MAP_SHARED, fd, 0);
>>>       if (buf_rx == MAP_FAILED) {
>>>           error_setg_errno(errp, errno,
>>>                            "failed to map shared memory for execute");
>>>           goto fail;
>>>       }
>>>
>>> Looks like that's the major reason why tcg has mlockall bloated constantly
>>> with roughly 1G size - that seems to be from tcg_init_machine().  I didn't
>>> check kvm.
>>>
>>> Logically having a on-fault option won't ever hurt, so probably not an
>>> issue to have it anyway.  Still, share my finding above, as IIUC that's
>>> mostly why it was bloated for tcg, so maybe there're other options too.
>> Yeah, the situation with KVM is slightly better, although it's still a
>> ~200MiB overhead with default Q35 and no extra devices (I haven't measured
>> the difference with various devices).
>>
>> I think it's definitely nice to have an on-fault option for this, as
>> optimizing every possible mmap caller for the rare mem-lock=on case might be
>> too ambitious.
> It really depends, IMHO, and that's why I didn't already ack the series.
>
> It may be relevant to the trade-off here on allowing faults to happen later
> even if mem-lock=on.  The question is why, for example in your use case,
> would like to lock the memory.
>
> Take kvm-rt as example, I believe that's needed because RT apps (running in
> the guest) would like to avoid page faults throughout the stack, so that
> guest workload, especially on the latency part of things, is predictable.
>
> Here if on-fault is enabled it could beat that purpose already.
>
> Or if the current use case is making sure after QEMU boots the memory will
> always present so that even if later the host faces memory stress it won't
> affect anything running the VM as it pre-allocated everything (so that's
> beyond memory-backend-*,prealloc=on, because it covers QEMU/KVM memory
> too).  Meanwhile locked pages won't swap out, so it's always there.
>
> But then with on-fault, it means the pages will only be locked upon access.
> Then it means the guarantee on "QEMU secures the memory on boot" is gone
> too.
>
> That's why I was thinking whether your specific use case really wants
> on-fault, or you do want e.g. to have a limit on the tcg-jit buffer instead
> (or same to whatever kvm was consuming), so you don't want that large a
> buffer, however you still want to have all things locked up upfront.  It
> can be relevant to why your use case started to use mem-lock=on before this
> on-fault flag.

I mentioned my use case in the cover letter. Basically we want to 
protect QEMU's pages from being migrated and compacted by kcompactd, 
which it accomplishes by modifying live page tables and spamming the 
process with TLB invalidate IPIs while it does that, which kills guest 
performance for the duration of the compaction operation.

Memory locking allows to protect a process from kcompactd page 
compaction and more importantly, migration (that is taking a PTE and 
replacing it with one, which is closer in memory to reduce 
fragmentation). (As long as /proc/sys/vm/compact_unevictable_allowed is 0)

For this use case we don't mind page faults as they take more or less 
constant time, which we can also avoid if we wanted by preallocating 
guest memory. We do, however, want PTEs to be untouched by kcompactd, 
which MCL_ONFAULT accomplishes just fine without the extra memory 
overhead that comes from various anonymous mappings getting 
write-faulted with the currently available mem-lock=on option.

In our case we use KVM of course, TCG was just an experiment where I 
noticed anonymous memory
jump way too much.

I don't think it's feasible in our case to look for the origin of every 
anonymous mapping that grew compared to the no mem-lock case (which 
there's about ~30 with default Q35 + KVM, without any extra devices), 
and try to optimize it to map anonymous memory less eagerly.

Thanks!

>
> OTOH, I believe on-fault cannot work with kvm-rt at all already, because of
> its possible faults happening later on - even if the fault can happen in
> KVM and even if it's not about accessing guest mem, it can still be part of
> overhead later when running the rt application in the guest, hence it can
> start to break RT deterministics.
>
> Thanks,
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] overcommit: introduce mem-lock-onfault
  2024-12-10 17:01       ` Daniil Tatianin
@ 2024-12-10 17:20         ` Peter Xu
  2024-12-10 17:23           ` Daniil Tatianin
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2024-12-10 17:20 UTC (permalink / raw)
  To: Daniil Tatianin; +Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, qemu-devel

On Tue, Dec 10, 2024 at 08:01:08PM +0300, Daniil Tatianin wrote:
> I mentioned my use case in the cover letter. Basically we want to protect
> QEMU's pages from being migrated and compacted by kcompactd, which it
> accomplishes by modifying live page tables and spamming the process with TLB
> invalidate IPIs while it does that, which kills guest performance for the
> duration of the compaction operation.

Ah right, I read it initially but just now when I scanned the cover letter
I missed that.  My fault.

> 
> Memory locking allows to protect a process from kcompactd page compaction
> and more importantly, migration (that is taking a PTE and replacing it with
> one, which is closer in memory to reduce fragmentation). (As long as
> /proc/sys/vm/compact_unevictable_allowed is 0)
> 
> For this use case we don't mind page faults as they take more or less
> constant time, which we can also avoid if we wanted by preallocating guest
> memory. We do, however, want PTEs to be untouched by kcompactd, which
> MCL_ONFAULT accomplishes just fine without the extra memory overhead that
> comes from various anonymous mappings getting write-faulted with the
> currently available mem-lock=on option.
> 
> In our case we use KVM of course, TCG was just an experiment where I noticed
> anonymous memory
> jump way too much.
> 
> I don't think it's feasible in our case to look for the origin of every
> anonymous mapping that grew compared to the no mem-lock case (which there's
> about ~30 with default Q35 + KVM, without any extra devices), and try to
> optimize it to map anonymous memory less eagerly.

Would it be better then to use mem-lock=on|off|onfault?  So turns it into a
string to avoid the "exclusiveness" needed (meanwhile having two separate
knobs for relevant things looks odd too).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] overcommit: introduce mem-lock-onfault
  2024-12-10 17:20         ` Peter Xu
@ 2024-12-10 17:23           ` Daniil Tatianin
  0 siblings, 0 replies; 10+ messages in thread
From: Daniil Tatianin @ 2024-12-10 17:23 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, qemu-devel


On 12/10/24 8:20 PM, Peter Xu wrote:
> On Tue, Dec 10, 2024 at 08:01:08PM +0300, Daniil Tatianin wrote:
>> I mentioned my use case in the cover letter. Basically we want to protect
>> QEMU's pages from being migrated and compacted by kcompactd, which it
>> accomplishes by modifying live page tables and spamming the process with TLB
>> invalidate IPIs while it does that, which kills guest performance for the
>> duration of the compaction operation.
> Ah right, I read it initially but just now when I scanned the cover letter
> I missed that.  My fault.

No worries!

>> Memory locking allows to protect a process from kcompactd page compaction
>> and more importantly, migration (that is taking a PTE and replacing it with
>> one, which is closer in memory to reduce fragmentation). (As long as
>> /proc/sys/vm/compact_unevictable_allowed is 0)
>>
>> For this use case we don't mind page faults as they take more or less
>> constant time, which we can also avoid if we wanted by preallocating guest
>> memory. We do, however, want PTEs to be untouched by kcompactd, which
>> MCL_ONFAULT accomplishes just fine without the extra memory overhead that
>> comes from various anonymous mappings getting write-faulted with the
>> currently available mem-lock=on option.
>>
>> In our case we use KVM of course, TCG was just an experiment where I noticed
>> anonymous memory
>> jump way too much.
>>
>> I don't think it's feasible in our case to look for the origin of every
>> anonymous mapping that grew compared to the no mem-lock case (which there's
>> about ~30 with default Q35 + KVM, without any extra devices), and try to
>> optimize it to map anonymous memory less eagerly.
> Would it be better then to use mem-lock=on|off|onfault?  So turns it into a
> string to avoid the "exclusiveness" needed (meanwhile having two separate
> knobs for relevant things looks odd too).

How did I not think of that.. Sounds much better IMO.

Thank you!

> Thanks,
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-12-10 17:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-05 23:19 [PATCH 0/2] overcommit: introduce mem-lock-onfault Daniil Tatianin
2024-12-05 23:19 ` [PATCH 1/2] os: add an ability to lock memory on_fault Daniil Tatianin
2024-12-05 23:19 ` [PATCH 2/2] overcommit: introduce mem-lock-onfault Daniil Tatianin
2024-12-06  1:08 ` [PATCH 0/2] " Peter Xu
2024-12-09  7:40   ` Daniil Tatianin
2024-12-10 16:48     ` Peter Xu
2024-12-10 17:01       ` Daniil Tatianin
2024-12-10 17:20         ` Peter Xu
2024-12-10 17:23           ` Daniil Tatianin
2024-12-10 14:48 ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.