[PULL 00/14] Mem next patches

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PULL 00/14] Mem next patches
@ 2025-02-11 22:50 Peter Xu
  2025-02-11 22:50 ` [PULL 01/14] system/physmem: take into account fd_offset for file fallocate Peter Xu
                   ` (13 more replies)
  0 siblings, 14 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi; +Cc: peterx, Paolo Bonzini, David Hildenbrand

The following changes since commit ffaf7f0376f8040ce9068d71ae9ae8722505c42e:

  Merge tag 'pull-10.0-testing-and-gdstub-updates-100225-1' of https://gitlab.com/stsquad/qemu into staging (2025-02-10 13:26:17 -0500)

are available in the Git repository at:

  https://gitlab.com/peterx/qemu.git tags/mem-next-pull-request

for you to fetch changes up to 3e05dedb9f3a4687bffbbc91e89e5c27887c5dcd:

  system/physmem: poisoned memory discard on reboot (2025-02-11 16:37:06 -0500)

----------------------------------------------------------------
Memory pull for 10.0

- William's fix on ram hole punching when with file offset
- Daniil's patchset to introduce mem-lock=on-fault
- William's hugetlb hwpoison fix for size report & remap
- David's series to allow qemu debug writes to MMIOs

----------------------------------------------------------------

Daniil Tatianin (4):
  os: add an ability to lock memory on_fault
  system/vl: extract overcommit option parsing into a helper
  system: introduce a new MlockState enum
  overcommit: introduce mem-lock=on-fault

David Hildenbrand (7):
  physmem: factor out memory_region_is_ram_device() check in
    memory_access_is_direct()
  physmem: factor out RAM/ROMD check in memory_access_is_direct()
  physmem: factor out direct access check into
    memory_region_supports_direct_access()
  physmem: disallow direct access to RAM DEVICE in
    address_space_write_rom()
  memory: pass MemTxAttrs to memory_access_is_direct()
  hmp: use cpu_get_phys_page_debug() in hmp_gva2gpa()
  physmem: teach cpu_memory_rw_debug() to write to more memory regions

William Roche (3):
  system/physmem: take into account fd_offset for file fallocate
  system/physmem: handle hugetlb correctly in qemu_ram_remap()
  system/physmem: poisoned memory discard on reboot

 include/exec/cpu-common.h |   2 +-
 include/exec/memattrs.h   |   5 +-
 include/exec/memory.h     |  35 ++++++++---
 include/system/os-posix.h |   2 +-
 include/system/os-win32.h |   3 +-
 include/system/system.h   |  12 +++-
 accel/kvm/kvm-all.c       |   2 +-
 hw/core/cpu-system.c      |  13 ++--
 hw/core/loader.c          |   2 +-
 hw/remote/vfio-user-obj.c |   2 +-
 hw/virtio/virtio-mem.c    |   2 +-
 migration/postcopy-ram.c  |   4 +-
 monitor/hmp-cmds-target.c |   3 +-
 os-posix.c                |  10 +++-
 system/globals.c          |  12 +++-
 system/physmem.c          | 121 ++++++++++++++++++++++++--------------
 system/vl.c               |  52 ++++++++++++----
 system/memory_ldst.c.inc  |  18 +++---
 qemu-options.hx           |  14 +++--
 19 files changed, 217 insertions(+), 97 deletions(-)

-- 
2.47.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PULL 01/14] system/physmem: take into account fd_offset for file fallocate
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 02/14] os: add an ability to lock memory on_fault Peter Xu
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand, William Roche

From: William Roche <william.roche@oracle.com>

Punching a hole in a file with fallocate needs to take into account the
fd_offset value for a correct file location.
But guest_memfd internal use doesn't currently consider fd_offset.

Fixes: 4b870dc4d0c0 ("hostmem-file: add offset option")

Signed-off-by: William Roche <william.roche@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250122194053.3103617-2-william.roche@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 system/physmem.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 67c9db9daa..235015f3ea 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3797,18 +3797,19 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
             }
 
             ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-                            start, length);
+                            start + rb->fd_offset, length);
             if (ret) {
                 ret = -errno;
-                error_report("%s: Failed to fallocate %s:%" PRIx64 " +%zx (%d)",
-                             __func__, rb->idstr, start, length, ret);
+                error_report("%s: Failed to fallocate %s:%" PRIx64 "+%" PRIx64
+                             " +%zx (%d)", __func__, rb->idstr, start,
+                             rb->fd_offset, length, ret);
                 goto err;
             }
 #else
             ret = -ENOSYS;
             error_report("%s: fallocate not available/file"
-                         "%s:%" PRIx64 " +%zx (%d)",
-                         __func__, rb->idstr, start, length, ret);
+                         "%s:%" PRIx64 "+%" PRIx64 " +%zx (%d)", __func__,
+                         rb->idstr, start, rb->fd_offset, length, ret);
             goto err;
 #endif
         }
@@ -3855,6 +3856,7 @@ int ram_block_discard_guest_memfd_range(RAMBlock *rb, uint64_t start,
     int ret = -1;
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+    /* ignore fd_offset with guest_memfd */
     ret = fallocate(rb->guest_memfd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                     start, length);
 
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 02/14] os: add an ability to lock memory on_fault
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
  2025-02-11 22:50 ` [PULL 01/14] system/physmem: take into account fd_offset for file fallocate Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-12 14:13   ` Stefan Hajnoczi
  2025-02-11 22:50 ` [PULL 03/14] system/vl: extract overcommit option parsing into a helper Peter Xu
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand, Daniil Tatianin,
	Vladimir Sementsov-Ogievskiy

From: Daniil Tatianin <d-tatianin@yandex-team.ru>

This will be used in the following commits to make it possible to only
lock memory on fault instead of right away.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Link: https://lore.kernel.org/r/20250123131944.391886-2-d-tatianin@yandex-team.ru
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/system/os-posix.h |  2 +-
 include/system/os-win32.h |  3 ++-
 migration/postcopy-ram.c  |  2 +-
 os-posix.c                | 10 ++++++++--
 system/vl.c               |  2 +-
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/system/os-posix.h b/include/system/os-posix.h
index b881ac6c6f..ce5b3bccf8 100644
--- a/include/system/os-posix.h
+++ b/include/system/os-posix.h
@@ -53,7 +53,7 @@ bool os_set_runas(const char *user_id);
 void os_set_chroot(const char *path);
 void os_setup_limits(void);
 void os_setup_post(void);
-int os_mlock(void);
+int os_mlock(bool on_fault);
 
 /**
  * qemu_alloc_stack:
diff --git a/include/system/os-win32.h b/include/system/os-win32.h
index b82a5d3ad9..cd61d69e10 100644
--- a/include/system/os-win32.h
+++ b/include/system/os-win32.h
@@ -123,8 +123,9 @@ static inline bool is_daemonized(void)
     return false;
 }
 
-static inline int os_mlock(void)
+static inline int os_mlock(bool on_fault)
 {
+    (void)on_fault;
     return -ENOSYS;
 }
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 6a6da6ba7f..fc4d8a10df 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -652,7 +652,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
     }
 
     if (enable_mlock) {
-        if (os_mlock() < 0) {
+        if (os_mlock(false) < 0) {
             error_report("mlock: %s", strerror(errno));
             /*
              * It doesn't feel right to fail at this point, we have a valid
diff --git a/os-posix.c b/os-posix.c
index 9cce55ff2f..48afb2990d 100644
--- a/os-posix.c
+++ b/os-posix.c
@@ -327,18 +327,24 @@ void os_set_line_buffering(void)
     setvbuf(stdout, NULL, _IOLBF, 0);
 }
 
-int os_mlock(void)
+int os_mlock(bool on_fault)
 {
 #ifdef HAVE_MLOCKALL
     int ret = 0;
+    int flags = MCL_CURRENT | MCL_FUTURE;
 
-    ret = mlockall(MCL_CURRENT | MCL_FUTURE);
+    if (on_fault) {
+        flags |= MCL_ONFAULT;
+    }
+
+    ret = mlockall(flags);
     if (ret < 0) {
         error_report("mlockall: %s", strerror(errno));
     }
 
     return ret;
 #else
+    (void)on_fault;
     return -ENOSYS;
 #endif
 }
diff --git a/system/vl.c b/system/vl.c
index 9c6942c6cf..e94fc7ea35 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -797,7 +797,7 @@ static QemuOptsList qemu_run_with_opts = {
 static void realtime_init(void)
 {
     if (enable_mlock) {
-        if (os_mlock() < 0) {
+        if (os_mlock(false) < 0) {
             error_report("locking memory failed");
             exit(1);
         }
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PULL 02/14] os: add an ability to lock memory on_fault
  2025-02-11 22:50 ` [PULL 02/14] os: add an ability to lock memory on_fault Peter Xu
@ 2025-02-12 14:13   ` Stefan Hajnoczi
  2025-02-12 14:17     ` Daniil Tatianin
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Hajnoczi @ 2025-02-12 14:13 UTC (permalink / raw)
  To: Peter Xu, Daniil Tatianin
  Cc: qemu-devel, Stefan Hajnoczi, Paolo Bonzini, David Hildenbrand,
	Vladimir Sementsov-Ogievskiy

On Tue, Feb 11, 2025 at 5:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> From: Daniil Tatianin <d-tatianin@yandex-team.ru>
>
> This will be used in the following commits to make it possible to only
> lock memory on fault instead of right away.

Hi Peter and Daniil,
Please take a look at this CI failure:
https://gitlab.com/qemu-project/qemu/-/jobs/9117106042#L3603

Thanks,
Stefan

>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
> Link: https://lore.kernel.org/r/20250123131944.391886-2-d-tatianin@yandex-team.ru
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/system/os-posix.h |  2 +-
>  include/system/os-win32.h |  3 ++-
>  migration/postcopy-ram.c  |  2 +-
>  os-posix.c                | 10 ++++++++--
>  system/vl.c               |  2 +-
>  5 files changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/include/system/os-posix.h b/include/system/os-posix.h
> index b881ac6c6f..ce5b3bccf8 100644
> --- a/include/system/os-posix.h
> +++ b/include/system/os-posix.h
> @@ -53,7 +53,7 @@ bool os_set_runas(const char *user_id);
>  void os_set_chroot(const char *path);
>  void os_setup_limits(void);
>  void os_setup_post(void);
> -int os_mlock(void);
> +int os_mlock(bool on_fault);
>
>  /**
>   * qemu_alloc_stack:
> diff --git a/include/system/os-win32.h b/include/system/os-win32.h
> index b82a5d3ad9..cd61d69e10 100644
> --- a/include/system/os-win32.h
> +++ b/include/system/os-win32.h
> @@ -123,8 +123,9 @@ static inline bool is_daemonized(void)
>      return false;
>  }
>
> -static inline int os_mlock(void)
> +static inline int os_mlock(bool on_fault)
>  {
> +    (void)on_fault;
>      return -ENOSYS;
>  }
>
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 6a6da6ba7f..fc4d8a10df 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -652,7 +652,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>      }
>
>      if (enable_mlock) {
> -        if (os_mlock() < 0) {
> +        if (os_mlock(false) < 0) {
>              error_report("mlock: %s", strerror(errno));
>              /*
>               * It doesn't feel right to fail at this point, we have a valid
> diff --git a/os-posix.c b/os-posix.c
> index 9cce55ff2f..48afb2990d 100644
> --- a/os-posix.c
> +++ b/os-posix.c
> @@ -327,18 +327,24 @@ void os_set_line_buffering(void)
>      setvbuf(stdout, NULL, _IOLBF, 0);
>  }
>
> -int os_mlock(void)
> +int os_mlock(bool on_fault)
>  {
>  #ifdef HAVE_MLOCKALL
>      int ret = 0;
> +    int flags = MCL_CURRENT | MCL_FUTURE;
>
> -    ret = mlockall(MCL_CURRENT | MCL_FUTURE);
> +    if (on_fault) {
> +        flags |= MCL_ONFAULT;
> +    }
> +
> +    ret = mlockall(flags);
>      if (ret < 0) {
>          error_report("mlockall: %s", strerror(errno));
>      }
>
>      return ret;
>  #else
> +    (void)on_fault;
>      return -ENOSYS;
>  #endif
>  }
> diff --git a/system/vl.c b/system/vl.c
> index 9c6942c6cf..e94fc7ea35 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -797,7 +797,7 @@ static QemuOptsList qemu_run_with_opts = {
>  static void realtime_init(void)
>  {
>      if (enable_mlock) {
> -        if (os_mlock() < 0) {
> +        if (os_mlock(false) < 0) {
>              error_report("locking memory failed");
>              exit(1);
>          }
> --
> 2.47.0
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PULL 02/14] os: add an ability to lock memory on_fault
  2025-02-12 14:13   ` Stefan Hajnoczi
@ 2025-02-12 14:17     ` Daniil Tatianin
  0 siblings, 0 replies; 17+ messages in thread
From: Daniil Tatianin @ 2025-02-12 14:17 UTC (permalink / raw)
  To: Stefan Hajnoczi, Peter Xu
  Cc: qemu-devel, Stefan Hajnoczi, Paolo Bonzini, David Hildenbrand,
	Vladimir Sementsov-Ogievskiy


On 2/12/25 5:13 PM, Stefan Hajnoczi wrote:
> On Tue, Feb 11, 2025 at 5:52 PM Peter Xu <peterx@redhat.com> wrote:
>> From: Daniil Tatianin <d-tatianin@yandex-team.ru>
>>
>> This will be used in the following commits to make it possible to only
>> lock memory on fault instead of right away.
> Hi Peter and Daniil,
> Please take a look at this CI failure:
> https://gitlab.com/qemu-project/qemu/-/jobs/9117106042#L3603
>
> Thanks,
> Stefan

Looks like MCL_ONFAULT is linux-only, I guess I'll have to have add 
something similar to HAVE_MLOCKALL and put this under an ifdef.

Thank you

>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
>> Link: https://lore.kernel.org/r/20250123131944.391886-2-d-tatianin@yandex-team.ru
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>>   include/system/os-posix.h |  2 +-
>>   include/system/os-win32.h |  3 ++-
>>   migration/postcopy-ram.c  |  2 +-
>>   os-posix.c                | 10 ++++++++--
>>   system/vl.c               |  2 +-
>>   5 files changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/system/os-posix.h b/include/system/os-posix.h
>> index b881ac6c6f..ce5b3bccf8 100644
>> --- a/include/system/os-posix.h
>> +++ b/include/system/os-posix.h
>> @@ -53,7 +53,7 @@ bool os_set_runas(const char *user_id);
>>   void os_set_chroot(const char *path);
>>   void os_setup_limits(void);
>>   void os_setup_post(void);
>> -int os_mlock(void);
>> +int os_mlock(bool on_fault);
>>
>>   /**
>>    * qemu_alloc_stack:
>> diff --git a/include/system/os-win32.h b/include/system/os-win32.h
>> index b82a5d3ad9..cd61d69e10 100644
>> --- a/include/system/os-win32.h
>> +++ b/include/system/os-win32.h
>> @@ -123,8 +123,9 @@ static inline bool is_daemonized(void)
>>       return false;
>>   }
>>
>> -static inline int os_mlock(void)
>> +static inline int os_mlock(bool on_fault)
>>   {
>> +    (void)on_fault;
>>       return -ENOSYS;
>>   }
>>
>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> index 6a6da6ba7f..fc4d8a10df 100644
>> --- a/migration/postcopy-ram.c
>> +++ b/migration/postcopy-ram.c
>> @@ -652,7 +652,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>>       }
>>
>>       if (enable_mlock) {
>> -        if (os_mlock() < 0) {
>> +        if (os_mlock(false) < 0) {
>>               error_report("mlock: %s", strerror(errno));
>>               /*
>>                * It doesn't feel right to fail at this point, we have a valid
>> diff --git a/os-posix.c b/os-posix.c
>> index 9cce55ff2f..48afb2990d 100644
>> --- a/os-posix.c
>> +++ b/os-posix.c
>> @@ -327,18 +327,24 @@ void os_set_line_buffering(void)
>>       setvbuf(stdout, NULL, _IOLBF, 0);
>>   }
>>
>> -int os_mlock(void)
>> +int os_mlock(bool on_fault)
>>   {
>>   #ifdef HAVE_MLOCKALL
>>       int ret = 0;
>> +    int flags = MCL_CURRENT | MCL_FUTURE;
>>
>> -    ret = mlockall(MCL_CURRENT | MCL_FUTURE);
>> +    if (on_fault) {
>> +        flags |= MCL_ONFAULT;
>> +    }
>> +
>> +    ret = mlockall(flags);
>>       if (ret < 0) {
>>           error_report("mlockall: %s", strerror(errno));
>>       }
>>
>>       return ret;
>>   #else
>> +    (void)on_fault;
>>       return -ENOSYS;
>>   #endif
>>   }
>> diff --git a/system/vl.c b/system/vl.c
>> index 9c6942c6cf..e94fc7ea35 100644
>> --- a/system/vl.c
>> +++ b/system/vl.c
>> @@ -797,7 +797,7 @@ static QemuOptsList qemu_run_with_opts = {
>>   static void realtime_init(void)
>>   {
>>       if (enable_mlock) {
>> -        if (os_mlock() < 0) {
>> +        if (os_mlock(false) < 0) {
>>               error_report("locking memory failed");
>>               exit(1);
>>           }
>> --
>> 2.47.0
>>
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PULL 03/14] system/vl: extract overcommit option parsing into a helper
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
  2025-02-11 22:50 ` [PULL 01/14] system/physmem: take into account fd_offset for file fallocate Peter Xu
  2025-02-11 22:50 ` [PULL 02/14] os: add an ability to lock memory on_fault Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 04/14] system: introduce a new MlockState enum Peter Xu
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand, Daniil Tatianin,
	Vladimir Sementsov-Ogievskiy

From: Daniil Tatianin <d-tatianin@yandex-team.ru>

This will be extended in the future commits, let's move it out of line
right away so that it's easier to read.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Link: https://lore.kernel.org/r/20250123131944.391886-3-d-tatianin@yandex-team.ru
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 system/vl.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/system/vl.c b/system/vl.c
index e94fc7ea35..72a40985f5 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -1875,6 +1875,19 @@ static void object_option_parse(const char *str)
     visit_free(v);
 }
 
+static void overcommit_parse(const char *str)
+{
+    QemuOpts *opts;
+
+    opts = qemu_opts_parse_noisily(qemu_find_opts("overcommit"),
+                                   str, false);
+    if (!opts) {
+        exit(1);
+    }
+    enable_mlock = qemu_opt_get_bool(opts, "mem-lock", enable_mlock);
+    enable_cpu_pm = qemu_opt_get_bool(opts, "cpu-pm", enable_cpu_pm);
+}
+
 /*
  * Very early object creation, before the sandbox options have been activated.
  */
@@ -3575,13 +3588,7 @@ void qemu_init(int argc, char **argv)
                 object_option_parse(optarg);
                 break;
             case QEMU_OPTION_overcommit:
-                opts = qemu_opts_parse_noisily(qemu_find_opts("overcommit"),
-                                               optarg, false);
-                if (!opts) {
-                    exit(1);
-                }
-                enable_mlock = qemu_opt_get_bool(opts, "mem-lock", enable_mlock);
-                enable_cpu_pm = qemu_opt_get_bool(opts, "cpu-pm", enable_cpu_pm);
+                overcommit_parse(optarg);
                 break;
             case QEMU_OPTION_compat:
                 {
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 04/14] system: introduce a new MlockState enum
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (2 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 03/14] system/vl: extract overcommit option parsing into a helper Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 05/14] overcommit: introduce mem-lock=on-fault Peter Xu
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand, Daniil Tatianin,
	Vladimir Sementsov-Ogievskiy

From: Daniil Tatianin <d-tatianin@yandex-team.ru>

Replace the boolean value enable_mlock with an enum and add a helper to
decide whether we should be calling os_mlock.

This is a stepping stone towards introducing a new mlock mode, which
will be the third possible state of this enum.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Link: https://lore.kernel.org/r/20250123131944.391886-4-d-tatianin@yandex-team.ru
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/system/system.h  | 10 +++++++++-
 hw/virtio/virtio-mem.c   |  2 +-
 migration/postcopy-ram.c |  2 +-
 system/globals.c         |  7 ++++++-
 system/vl.c              |  9 +++++++--
 5 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/include/system/system.h b/include/system/system.h
index 0cbb43ec30..dc7628357a 100644
--- a/include/system/system.h
+++ b/include/system/system.h
@@ -44,10 +44,18 @@ extern int display_opengl;
 extern const char *keyboard_layout;
 extern int old_param;
 extern uint8_t *boot_splash_filedata;
-extern bool enable_mlock;
 extern bool enable_cpu_pm;
 extern QEMUClockType rtc_clock;
 
+typedef enum {
+    MLOCK_OFF = 0,
+    MLOCK_ON,
+} MlockState;
+
+bool should_mlock(MlockState);
+
+extern MlockState mlock_state;
+
 #define MAX_OPTION_ROMS 16
 typedef struct QEMUOptionRom {
     const char *name;
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index b1a003736b..7b140add76 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -991,7 +991,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (enable_mlock) {
+    if (should_mlock(mlock_state)) {
         error_setg(errp, "Incompatible with mlock");
         return;
     }
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index fc4d8a10df..04068ee039 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -651,7 +651,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         mis->have_fault_thread = false;
     }
 
-    if (enable_mlock) {
+    if (should_mlock(mlock_state)) {
         if (os_mlock(false) < 0) {
             error_report("mlock: %s", strerror(errno));
             /*
diff --git a/system/globals.c b/system/globals.c
index 4867c93ca6..adeff38348 100644
--- a/system/globals.c
+++ b/system/globals.c
@@ -31,10 +31,15 @@
 #include "system/cpus.h"
 #include "system/system.h"
 
+bool should_mlock(MlockState state)
+{
+    return state == MLOCK_ON;
+}
+
 enum vga_retrace_method vga_retrace_method = VGA_RETRACE_DUMB;
 int display_opengl;
 const char* keyboard_layout;
-bool enable_mlock;
+MlockState mlock_state;
 bool enable_cpu_pm;
 int autostart = 1;
 int vga_interface_type = VGA_NONE;
diff --git a/system/vl.c b/system/vl.c
index 72a40985f5..2895824c1a 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -796,7 +796,7 @@ static QemuOptsList qemu_run_with_opts = {
 
 static void realtime_init(void)
 {
-    if (enable_mlock) {
+    if (should_mlock(mlock_state)) {
         if (os_mlock(false) < 0) {
             error_report("locking memory failed");
             exit(1);
@@ -1878,13 +1878,18 @@ static void object_option_parse(const char *str)
 static void overcommit_parse(const char *str)
 {
     QemuOpts *opts;
+    bool enable_mlock;
 
     opts = qemu_opts_parse_noisily(qemu_find_opts("overcommit"),
                                    str, false);
     if (!opts) {
         exit(1);
     }
-    enable_mlock = qemu_opt_get_bool(opts, "mem-lock", enable_mlock);
+
+    enable_mlock = qemu_opt_get_bool(opts, "mem-lock",
+                                     should_mlock(mlock_state));
+    mlock_state = enable_mlock ? MLOCK_ON : MLOCK_OFF;
+
     enable_cpu_pm = qemu_opt_get_bool(opts, "cpu-pm", enable_cpu_pm);
 }
 
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 05/14] overcommit: introduce mem-lock=on-fault
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (3 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 04/14] system: introduce a new MlockState enum Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 06/14] physmem: factor out memory_region_is_ram_device() check in memory_access_is_direct() Peter Xu
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand, Daniil Tatianin,
	Vladimir Sementsov-Ogievskiy

From: Daniil Tatianin <d-tatianin@yandex-team.ru>

Locking the memory without MCL_ONFAULT instantly prefaults any mmaped
anonymous memory with a write-fault, which introduces a lot of extra
overhead in terms of memory usage when all you want to do is to prevent
kcompactd from migrating and compacting QEMU pages. Add an option to
only lock pages lazily as they're faulted by the process by using
MCL_ONFAULT if asked.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Link: https://lore.kernel.org/r/20250123131944.391886-5-d-tatianin@yandex-team.ru
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/system/system.h  |  2 ++
 migration/postcopy-ram.c |  2 +-
 system/globals.c         |  7 ++++++-
 system/vl.c              | 34 +++++++++++++++++++++++++++-------
 qemu-options.hx          | 14 +++++++++-----
 5 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/include/system/system.h b/include/system/system.h
index dc7628357a..a7effe7dfd 100644
--- a/include/system/system.h
+++ b/include/system/system.h
@@ -50,9 +50,11 @@ extern QEMUClockType rtc_clock;
 typedef enum {
     MLOCK_OFF = 0,
     MLOCK_ON,
+    MLOCK_ON_FAULT,
 } MlockState;
 
 bool should_mlock(MlockState);
+bool is_mlock_on_fault(MlockState);
 
 extern MlockState mlock_state;
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 04068ee039..5d3edfcfec 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -652,7 +652,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
     }
 
     if (should_mlock(mlock_state)) {
-        if (os_mlock(false) < 0) {
+        if (os_mlock(is_mlock_on_fault(mlock_state)) < 0) {
             error_report("mlock: %s", strerror(errno));
             /*
              * It doesn't feel right to fail at this point, we have a valid
diff --git a/system/globals.c b/system/globals.c
index adeff38348..316623bd20 100644
--- a/system/globals.c
+++ b/system/globals.c
@@ -33,7 +33,12 @@
 
 bool should_mlock(MlockState state)
 {
-    return state == MLOCK_ON;
+    return state == MLOCK_ON || state == MLOCK_ON_FAULT;
+}
+
+bool is_mlock_on_fault(MlockState state)
+{
+    return state == MLOCK_ON_FAULT;
 }
 
 enum vga_retrace_method vga_retrace_method = VGA_RETRACE_DUMB;
diff --git a/system/vl.c b/system/vl.c
index 2895824c1a..3c0fa2ff64 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -351,7 +351,7 @@ static QemuOptsList qemu_overcommit_opts = {
     .desc = {
         {
             .name = "mem-lock",
-            .type = QEMU_OPT_BOOL,
+            .type = QEMU_OPT_STRING,
         },
         {
             .name = "cpu-pm",
@@ -797,7 +797,7 @@ static QemuOptsList qemu_run_with_opts = {
 static void realtime_init(void)
 {
     if (should_mlock(mlock_state)) {
-        if (os_mlock(false) < 0) {
+        if (os_mlock(is_mlock_on_fault(mlock_state)) < 0) {
             error_report("locking memory failed");
             exit(1);
         }
@@ -1878,7 +1878,7 @@ static void object_option_parse(const char *str)
 static void overcommit_parse(const char *str)
 {
     QemuOpts *opts;
-    bool enable_mlock;
+    const char *mem_lock_opt;
 
     opts = qemu_opts_parse_noisily(qemu_find_opts("overcommit"),
                                    str, false);
@@ -1886,11 +1886,31 @@ static void overcommit_parse(const char *str)
         exit(1);
     }
 
-    enable_mlock = qemu_opt_get_bool(opts, "mem-lock",
-                                     should_mlock(mlock_state));
-    mlock_state = enable_mlock ? MLOCK_ON : MLOCK_OFF;
-
     enable_cpu_pm = qemu_opt_get_bool(opts, "cpu-pm", enable_cpu_pm);
+
+    mem_lock_opt = qemu_opt_get(opts, "mem-lock");
+    if (!mem_lock_opt) {
+        return;
+    }
+
+    if (strcmp(mem_lock_opt, "on") == 0) {
+        mlock_state = MLOCK_ON;
+        return;
+    }
+
+    if (strcmp(mem_lock_opt, "off") == 0) {
+        mlock_state = MLOCK_OFF;
+        return;
+    }
+
+    if (strcmp(mem_lock_opt, "on-fault") == 0) {
+        mlock_state = MLOCK_ON_FAULT;
+        return;
+    }
+
+    error_report("parameter 'mem-lock' expects one of "
+                 "'on', 'off', 'on-fault'");
+    exit(1);
 }
 
 /*
diff --git a/qemu-options.hx b/qemu-options.hx
index 1b26ad53bd..61270e3206 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4632,21 +4632,25 @@ SRST
 ERST
 
 DEF("overcommit", HAS_ARG, QEMU_OPTION_overcommit,
-    "-overcommit [mem-lock=on|off][cpu-pm=on|off]\n"
+    "-overcommit [mem-lock=on|off|on-fault][cpu-pm=on|off]\n"
     "                run qemu with overcommit hints\n"
-    "                mem-lock=on|off controls memory lock support (default: off)\n"
+    "                mem-lock=on|off|on-fault controls memory lock support (default: off)\n"
     "                cpu-pm=on|off controls cpu power management (default: off)\n",
     QEMU_ARCH_ALL)
 SRST
-``-overcommit mem-lock=on|off``
+``-overcommit mem-lock=on|off|on-fault``
   \ 
 ``-overcommit cpu-pm=on|off``
     Run qemu with hints about host resource overcommit. The default is
     to assume that host overcommits all resources.
 
     Locking qemu and guest memory can be enabled via ``mem-lock=on``
-    (disabled by default). This works when host memory is not
-    overcommitted and reduces the worst-case latency for guest.
+    or ``mem-lock=on-fault`` (disabled by default). This works when
+    host memory is not overcommitted and reduces the worst-case latency for
+    guest. The on-fault option is better for reducing the memory footprint
+    since it makes allocations lazy, but the pages still get locked in place
+    once faulted by the guest or QEMU. Note that the two options are mutually
+    exclusive.
 
     Guest ability to manage power state of host cpus (increasing latency
     for other processes on the same host cpu, but decreasing latency for
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 06/14] physmem: factor out memory_region_is_ram_device() check in memory_access_is_direct()
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (4 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 05/14] overcommit: introduce mem-lock=on-fault Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 07/14] physmem: factor out RAM/ROMD " Peter Xu
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi; +Cc: peterx, Paolo Bonzini, David Hildenbrand

From: David Hildenbrand <david@redhat.com>

As documented in commit 4a2e242bbb306 ("memory: Don't use memcpy for
ram_device regions"), we disallow direct access to RAM DEVICE regions.

Let's make this clearer to prepare for further changes. Note that romd
regions will never be RAM DEVICE at the same time.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250210084648.33798-2-david@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9f73b59867..5cd7574c60 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2997,12 +2997,19 @@ bool prepare_mmio_access(MemoryRegion *mr);
 
 static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
 {
+    /*
+     * RAM DEVICE regions can be accessed directly using memcpy, but it might
+     * be MMIO and access using mempy can be wrong (e.g., using instructions not
+     * intended for MMIO access). So we treat this as IO.
+     */
+    if (memory_region_is_ram_device(mr)) {
+        return false;
+    }
     if (is_write) {
         return memory_region_is_ram(mr) && !mr->readonly &&
-               !mr->rom_device && !memory_region_is_ram_device(mr);
+               !mr->rom_device;
     } else {
-        return (memory_region_is_ram(mr) && !memory_region_is_ram_device(mr)) ||
-               memory_region_is_romd(mr);
+        return memory_region_is_ram(mr) || memory_region_is_romd(mr);
     }
 }
 
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 07/14] physmem: factor out RAM/ROMD check in memory_access_is_direct()
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (5 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 06/14] physmem: factor out memory_region_is_ram_device() check in memory_access_is_direct() Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 08/14] physmem: factor out direct access check into memory_region_supports_direct_access() Peter Xu
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi; +Cc: peterx, Paolo Bonzini, David Hildenbrand

From: David Hildenbrand <david@redhat.com>

Let's factor more of the generic "is this directly accessible" check,
independent of the "write" condition out.

Note that the "!mr->rom_device" check in the write case essentially
disallows the memory_region_is_romd() condition again. Further note that
RAM DEVICE regions are also RAM regions, so we can check for RAM+ROMD
first.

This is a preparation for further changes.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250210084648.33798-3-david@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5cd7574c60..cb35c38402 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2997,6 +2997,10 @@ bool prepare_mmio_access(MemoryRegion *mr);
 
 static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
 {
+    /* ROM DEVICE regions only allow direct access if in ROMD mode. */
+    if (!memory_region_is_ram(mr) && !memory_region_is_romd(mr)) {
+        return false;
+    }
     /*
      * RAM DEVICE regions can be accessed directly using memcpy, but it might
      * be MMIO and access using mempy can be wrong (e.g., using instructions not
@@ -3006,11 +3010,9 @@ static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
         return false;
     }
     if (is_write) {
-        return memory_region_is_ram(mr) && !mr->readonly &&
-               !mr->rom_device;
-    } else {
-        return memory_region_is_ram(mr) || memory_region_is_romd(mr);
+        return !mr->readonly && !mr->rom_device;
     }
+    return true;
 }
 
 /**
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 08/14] physmem: factor out direct access check into memory_region_supports_direct_access()
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (6 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 07/14] physmem: factor out RAM/ROMD " Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 09/14] physmem: disallow direct access to RAM DEVICE in address_space_write_rom() Peter Xu
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi; +Cc: peterx, Paolo Bonzini, David Hildenbrand

From: David Hildenbrand <david@redhat.com>

Let's factor the complete "directly accessible" check independent of
the "write" condition out so we can reuse it next.

We can now split up the checks RAM and ROMD check, so we really only check
for RAM DEVICE in case of RAM -- ROM DEVICE is neither RAM not RAM DEVICE.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250210084648.33798-4-david@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index cb35c38402..4e2cf95ab6 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2995,10 +2995,13 @@ MemTxResult address_space_write_cached_slow(MemoryRegionCache *cache,
 int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
 bool prepare_mmio_access(MemoryRegion *mr);
 
-static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
+static inline bool memory_region_supports_direct_access(MemoryRegion *mr)
 {
     /* ROM DEVICE regions only allow direct access if in ROMD mode. */
-    if (!memory_region_is_ram(mr) && !memory_region_is_romd(mr)) {
+    if (memory_region_is_romd(mr)) {
+        return true;
+    }
+    if (!memory_region_is_ram(mr)) {
         return false;
     }
     /*
@@ -3006,7 +3009,12 @@ static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
      * be MMIO and access using mempy can be wrong (e.g., using instructions not
      * intended for MMIO access). So we treat this as IO.
      */
-    if (memory_region_is_ram_device(mr)) {
+    return !memory_region_is_ram_device(mr);
+}
+
+static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
+{
+    if (!memory_region_supports_direct_access(mr)) {
         return false;
     }
     if (is_write) {
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 09/14] physmem: disallow direct access to RAM DEVICE in address_space_write_rom()
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (7 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 08/14] physmem: factor out direct access check into memory_region_supports_direct_access() Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 10/14] memory: pass MemTxAttrs to memory_access_is_direct() Peter Xu
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand, Alex Williamson

From: David Hildenbrand <david@redhat.com>

As documented in commit 4a2e242bbb306 ("memory: Don't use memcpy for
ram_device regions"), we disallow direct access to RAM DEVICE regions.

This change implies that address_space_write_rom() and
cpu_memory_rw_debug() won't be able to write to RAM DEVICE regions. It
will also affect cpu_flush_icache_range(), but it's only used by
hw/core/loader.c after writing to ROM, so it is expected to not apply
here with RAM DEVICE.

This fixes direct access to these regions where we don't want direct
access. We'll extend cpu_memory_rw_debug() next to also be able to write to
these (and IO) regions.

This is a preparation for further changes.

Cc: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250210084648.33798-5-david@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 system/physmem.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 235015f3ea..cff15ca1df 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3137,8 +3137,7 @@ static inline MemTxResult address_space_write_rom_internal(AddressSpace *as,
         l = len;
         mr = address_space_translate(as, addr, &addr1, &l, true, attrs);
 
-        if (!(memory_region_is_ram(mr) ||
-              memory_region_is_romd(mr))) {
+        if (!memory_region_supports_direct_access(mr)) {
             l = memory_access_size(mr, l, addr1);
         } else {
             /* ROM/RAM case */
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 10/14] memory: pass MemTxAttrs to memory_access_is_direct()
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (8 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 09/14] physmem: disallow direct access to RAM DEVICE in address_space_write_rom() Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 11/14] hmp: use cpu_get_phys_page_debug() in hmp_gva2gpa() Peter Xu
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand,
	Philippe Mathieu-Daudé

From: David Hildenbrand <david@redhat.com>

We want to pass another flag that will be stored in MemTxAttrs. So pass
MemTxAttrs directly.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250210084648.33798-6-david@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h     |  5 +++--
 hw/core/loader.c          |  2 +-
 hw/remote/vfio-user-obj.c |  2 +-
 system/physmem.c          | 12 ++++++------
 system/memory_ldst.c.inc  | 18 +++++++++---------
 5 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 4e2cf95ab6..b18ecf933e 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -3012,7 +3012,8 @@ static inline bool memory_region_supports_direct_access(MemoryRegion *mr)
     return !memory_region_is_ram_device(mr);
 }
 
-static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
+static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write,
+                                           MemTxAttrs attrs)
 {
     if (!memory_region_supports_direct_access(mr)) {
         return false;
@@ -3053,7 +3054,7 @@ MemTxResult address_space_read(AddressSpace *as, hwaddr addr,
             fv = address_space_to_flatview(as);
             l = len;
             mr = flatview_translate(fv, addr, &addr1, &l, false, attrs);
-            if (len == l && memory_access_is_direct(mr, false)) {
+            if (len == l && memory_access_is_direct(mr, false, attrs)) {
                 ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
                 memcpy(buf, ptr, len);
             } else {
diff --git a/hw/core/loader.c b/hw/core/loader.c
index fd25c5e01b..332b879a0b 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -144,7 +144,7 @@ ssize_t load_image_mr(const char *filename, MemoryRegion *mr)
 {
     ssize_t size;
 
-    if (!memory_access_is_direct(mr, false)) {
+    if (!memory_access_is_direct(mr, false, MEMTXATTRS_UNSPECIFIED)) {
         /* Can only load an image into RAM or ROM */
         return -1;
     }
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 9e5ff6d87a..6e51a92856 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -358,7 +358,7 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
     int access_size;
     uint64_t val;
 
-    if (memory_access_is_direct(mr, is_write)) {
+    if (memory_access_is_direct(mr, is_write, MEMTXATTRS_UNSPECIFIED)) {
         /**
          * Some devices expose a PCI expansion ROM, which could be buffer
          * based as compared to other regions which are primarily based on
diff --git a/system/physmem.c b/system/physmem.c
index cff15ca1df..8745c10c9d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -573,7 +573,7 @@ MemoryRegion *flatview_translate(FlatView *fv, hwaddr addr, hwaddr *xlat,
                                     is_write, true, &as, attrs);
     mr = section.mr;
 
-    if (xen_enabled() && memory_access_is_direct(mr, is_write)) {
+    if (xen_enabled() && memory_access_is_direct(mr, is_write, attrs)) {
         hwaddr page = ((addr & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE) - addr;
         *plen = MIN(page, *plen);
     }
@@ -2869,7 +2869,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
         return MEMTX_ACCESS_ERROR;
     }
 
-    if (!memory_access_is_direct(mr, true)) {
+    if (!memory_access_is_direct(mr, true, attrs)) {
         uint64_t val;
         MemTxResult result;
         bool release_lock = prepare_mmio_access(mr);
@@ -2965,7 +2965,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
         return MEMTX_ACCESS_ERROR;
     }
 
-    if (!memory_access_is_direct(mr, false)) {
+    if (!memory_access_is_direct(mr, false, attrs)) {
         /* I/O case */
         uint64_t val;
         MemTxResult result;
@@ -3274,7 +3274,7 @@ static bool flatview_access_valid(FlatView *fv, hwaddr addr, hwaddr len,
     while (len > 0) {
         l = len;
         mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
-        if (!memory_access_is_direct(mr, is_write)) {
+        if (!memory_access_is_direct(mr, is_write, attrs)) {
             l = memory_access_size(mr, l, addr);
             if (!memory_region_access_valid(mr, xlat, l, is_write, attrs)) {
                 return false;
@@ -3354,7 +3354,7 @@ void *address_space_map(AddressSpace *as,
     fv = address_space_to_flatview(as);
     mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
 
-    if (!memory_access_is_direct(mr, is_write)) {
+    if (!memory_access_is_direct(mr, is_write, attrs)) {
         size_t used = qatomic_read(&as->bounce_buffer_size);
         for (;;) {
             hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
@@ -3487,7 +3487,7 @@ int64_t address_space_cache_init(MemoryRegionCache *cache,
 
     mr = cache->mrs.mr;
     memory_region_ref(mr);
-    if (memory_access_is_direct(mr, is_write)) {
+    if (memory_access_is_direct(mr, is_write, MEMTXATTRS_UNSPECIFIED)) {
         /* We don't care about the memory attributes here as we're only
          * doing this if we found actual RAM, which behaves the same
          * regardless of attributes; so UNSPECIFIED is fine.
diff --git a/system/memory_ldst.c.inc b/system/memory_ldst.c.inc
index 0e6f3940a9..7f32d3d9ff 100644
--- a/system/memory_ldst.c.inc
+++ b/system/memory_ldst.c.inc
@@ -34,7 +34,7 @@ static inline uint32_t glue(address_space_ldl_internal, SUFFIX)(ARG1_DECL,
 
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, false, attrs);
-    if (l < 4 || !memory_access_is_direct(mr, false)) {
+    if (l < 4 || !memory_access_is_direct(mr, false, attrs)) {
         release_lock |= prepare_mmio_access(mr);
 
         /* I/O case */
@@ -103,7 +103,7 @@ static inline uint64_t glue(address_space_ldq_internal, SUFFIX)(ARG1_DECL,
 
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, false, attrs);
-    if (l < 8 || !memory_access_is_direct(mr, false)) {
+    if (l < 8 || !memory_access_is_direct(mr, false, attrs)) {
         release_lock |= prepare_mmio_access(mr);
 
         /* I/O case */
@@ -170,7 +170,7 @@ uint8_t glue(address_space_ldub, SUFFIX)(ARG1_DECL,
 
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, false, attrs);
-    if (!memory_access_is_direct(mr, false)) {
+    if (!memory_access_is_direct(mr, false, attrs)) {
         release_lock |= prepare_mmio_access(mr);
 
         /* I/O case */
@@ -207,7 +207,7 @@ static inline uint16_t glue(address_space_lduw_internal, SUFFIX)(ARG1_DECL,
 
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, false, attrs);
-    if (l < 2 || !memory_access_is_direct(mr, false)) {
+    if (l < 2 || !memory_access_is_direct(mr, false, attrs)) {
         release_lock |= prepare_mmio_access(mr);
 
         /* I/O case */
@@ -277,7 +277,7 @@ void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
 
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
-    if (l < 4 || !memory_access_is_direct(mr, true)) {
+    if (l < 4 || !memory_access_is_direct(mr, true, attrs)) {
         release_lock |= prepare_mmio_access(mr);
 
         r = memory_region_dispatch_write(mr, addr1, val, MO_32, attrs);
@@ -314,7 +314,7 @@ static inline void glue(address_space_stl_internal, SUFFIX)(ARG1_DECL,
 
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
-    if (l < 4 || !memory_access_is_direct(mr, true)) {
+    if (l < 4 || !memory_access_is_direct(mr, true, attrs)) {
         release_lock |= prepare_mmio_access(mr);
         r = memory_region_dispatch_write(mr, addr1, val,
                                          MO_32 | devend_memop(endian), attrs);
@@ -377,7 +377,7 @@ void glue(address_space_stb, SUFFIX)(ARG1_DECL,
 
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
-    if (!memory_access_is_direct(mr, true)) {
+    if (!memory_access_is_direct(mr, true, attrs)) {
         release_lock |= prepare_mmio_access(mr);
         r = memory_region_dispatch_write(mr, addr1, val, MO_8, attrs);
     } else {
@@ -410,7 +410,7 @@ static inline void glue(address_space_stw_internal, SUFFIX)(ARG1_DECL,
 
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
-    if (l < 2 || !memory_access_is_direct(mr, true)) {
+    if (l < 2 || !memory_access_is_direct(mr, true, attrs)) {
         release_lock |= prepare_mmio_access(mr);
         r = memory_region_dispatch_write(mr, addr1, val,
                                          MO_16 | devend_memop(endian), attrs);
@@ -474,7 +474,7 @@ static void glue(address_space_stq_internal, SUFFIX)(ARG1_DECL,
 
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
-    if (l < 8 || !memory_access_is_direct(mr, true)) {
+    if (l < 8 || !memory_access_is_direct(mr, true, attrs)) {
         release_lock |= prepare_mmio_access(mr);
         r = memory_region_dispatch_write(mr, addr1, val,
                                          MO_64 | devend_memop(endian), attrs);
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 11/14] hmp: use cpu_get_phys_page_debug() in hmp_gva2gpa()
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (9 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 10/14] memory: pass MemTxAttrs to memory_access_is_direct() Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 12/14] physmem: teach cpu_memory_rw_debug() to write to more memory regions Peter Xu
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand,
	Philippe Mathieu-Daudé

From: David Hildenbrand <david@redhat.com>

We don't need the MemTxAttrs, so let's simply use the simpler function
variant.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250210084648.33798-7-david@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 monitor/hmp-cmds-target.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
index 27ffe61818..239c2a61a4 100644
--- a/monitor/hmp-cmds-target.c
+++ b/monitor/hmp-cmds-target.c
@@ -301,7 +301,6 @@ void hmp_gpa2hva(Monitor *mon, const QDict *qdict)
 void hmp_gva2gpa(Monitor *mon, const QDict *qdict)
 {
     target_ulong addr = qdict_get_int(qdict, "addr");
-    MemTxAttrs attrs;
     CPUState *cs = mon_get_cpu(mon);
     hwaddr gpa;
 
@@ -310,7 +309,7 @@ void hmp_gva2gpa(Monitor *mon, const QDict *qdict)
         return;
     }
 
-    gpa  = cpu_get_phys_page_attrs_debug(cs, addr & TARGET_PAGE_MASK, &attrs);
+    gpa  = cpu_get_phys_page_debug(cs, addr & TARGET_PAGE_MASK);
     if (gpa == -1) {
         monitor_printf(mon, "Unmapped\n");
     } else {
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 12/14] physmem: teach cpu_memory_rw_debug() to write to more memory regions
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (10 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 11/14] hmp: use cpu_get_phys_page_debug() in hmp_gva2gpa() Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 13/14] system/physmem: handle hugetlb correctly in qemu_ram_remap() Peter Xu
  2025-02-11 22:50 ` [PULL 14/14] system/physmem: poisoned memory discard on reboot Peter Xu
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi; +Cc: peterx, Paolo Bonzini, David Hildenbrand

From: David Hildenbrand <david@redhat.com>

Right now, we only allow for writing to memory regions that allow direct
access using memcpy etc; all other writes are simply ignored. This
implies that debugging guests will not work as expected when writing
to MMIO device regions.

Let's extend cpu_memory_rw_debug() to write to more memory regions,
including MMIO device regions. Reshuffle the condition in
memory_access_is_direct() to make it easier to read and add a comment.

While this change implies that debug access can now also write to MMIO
devices, we now are also permit ELF image loads and similar users of
cpu_memory_rw_debug() to write to MMIO devices; currently we ignore
these writes.

Peter assumes [1] that there's probably a class of guest images, which
will start writing junk (likely zeroes) into device model registers; we
previously would silently ignore any such bogus ELF sections. Likely
these images are of questionable correctness and this can be ignored. If
ever a problem, we could make these cases use address_space_write_rom()
instead, which is left unchanged for now.

This patch is based on previous work by Stefan Zabka.

[1] https://lore.kernel.org/all/CAFEAcA_2CEJKFyjvbwmpt=on=GgMVamQ5hiiVt+zUr6AY3X=Xg@mail.gmail.com/

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/213
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250210084648.33798-8-david@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memattrs.h |  5 ++++-
 include/exec/memory.h   |  3 ++-
 hw/core/cpu-system.c    | 13 +++++++++----
 system/physmem.c        |  9 ++-------
 4 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/exec/memattrs.h b/include/exec/memattrs.h
index 060b7e7131..8db1d30464 100644
--- a/include/exec/memattrs.h
+++ b/include/exec/memattrs.h
@@ -44,6 +44,8 @@ typedef struct MemTxAttrs {
      * (see MEMTX_ACCESS_ERROR).
      */
     unsigned int memory:1;
+    /* Debug access that can even write to ROM. */
+    unsigned int debug:1;
     /* Requester ID (for MSI for example) */
     unsigned int requester_id:16;
 
@@ -56,7 +58,8 @@ typedef struct MemTxAttrs {
      * Bus masters which don't specify any attributes will get this
      * (via the MEMTXATTRS_UNSPECIFIED constant), so that we can
      * distinguish "all attributes deliberately clear" from
-     * "didn't specify" if necessary.
+     * "didn't specify" if necessary. "debug" can be set alongside
+     * "unspecified".
      */
     bool unspecified;
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index b18ecf933e..78c4e0aec8 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -3018,7 +3018,8 @@ static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write,
     if (!memory_region_supports_direct_access(mr)) {
         return false;
     }
-    if (is_write) {
+    /* Debug access can write to ROM. */
+    if (is_write && !attrs.debug) {
         return !mr->readonly && !mr->rom_device;
     }
     return true;
diff --git a/hw/core/cpu-system.c b/hw/core/cpu-system.c
index 6aae28a349..6e307c8959 100644
--- a/hw/core/cpu-system.c
+++ b/hw/core/cpu-system.c
@@ -51,13 +51,18 @@ hwaddr cpu_get_phys_page_attrs_debug(CPUState *cpu, vaddr addr,
                                      MemTxAttrs *attrs)
 {
     CPUClass *cc = CPU_GET_CLASS(cpu);
+    hwaddr paddr;
 
     if (cc->sysemu_ops->get_phys_page_attrs_debug) {
-        return cc->sysemu_ops->get_phys_page_attrs_debug(cpu, addr, attrs);
+        paddr = cc->sysemu_ops->get_phys_page_attrs_debug(cpu, addr, attrs);
+    } else {
+        /* Fallback for CPUs which don't implement the _attrs_ hook */
+        *attrs = MEMTXATTRS_UNSPECIFIED;
+        paddr = cc->sysemu_ops->get_phys_page_debug(cpu, addr);
     }
-    /* Fallback for CPUs which don't implement the _attrs_ hook */
-    *attrs = MEMTXATTRS_UNSPECIFIED;
-    return cc->sysemu_ops->get_phys_page_debug(cpu, addr);
+    /* Indicate that this is a debug access. */
+    attrs->debug = 1;
+    return paddr;
 }
 
 hwaddr cpu_get_phys_page_debug(CPUState *cpu, vaddr addr)
diff --git a/system/physmem.c b/system/physmem.c
index 8745c10c9d..d3efdf13d3 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3680,13 +3680,8 @@ int cpu_memory_rw_debug(CPUState *cpu, vaddr addr,
         if (l > len)
             l = len;
         phys_addr += (addr & ~TARGET_PAGE_MASK);
-        if (is_write) {
-            res = address_space_write_rom(cpu->cpu_ases[asidx].as, phys_addr,
-                                          attrs, buf, l);
-        } else {
-            res = address_space_read(cpu->cpu_ases[asidx].as, phys_addr,
-                                     attrs, buf, l);
-        }
+        res = address_space_rw(cpu->cpu_ases[asidx].as, phys_addr, attrs, buf,
+                               l, is_write);
         if (res != MEMTX_OK) {
             return -1;
         }
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 13/14] system/physmem: handle hugetlb correctly in qemu_ram_remap()
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (11 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 12/14] physmem: teach cpu_memory_rw_debug() to write to more memory regions Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  2025-02-11 22:50 ` [PULL 14/14] system/physmem: poisoned memory discard on reboot Peter Xu
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand, William Roche

From: William Roche <william.roche@oracle.com>

The list of hwpoison pages used to remap the memory on reset
is based on the backend real page size.
To correctly handle hugetlb, we must mmap(MAP_FIXED) a complete
hugetlb page; hugetlb pages cannot be partially mapped.

Signed-off-by: William Roche <william.roche@oracle.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20250211212707.302391-2-william.roche@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/cpu-common.h |  2 +-
 accel/kvm/kvm-all.c       |  2 +-
 system/physmem.c          | 38 +++++++++++++++++++++++++++++---------
 3 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index b1d76d6985..3771b2130c 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -67,7 +67,7 @@ typedef uintptr_t ram_addr_t;
 
 /* memory API */
 
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
+void qemu_ram_remap(ram_addr_t addr);
 /* This should not be used by devices.  */
 ram_addr_t qemu_ram_addr_from_host(void *ptr);
 ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c65b790433..f89568bfa3 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1288,7 +1288,7 @@ static void kvm_unpoison_all(void *param)
 
     QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {
         QLIST_REMOVE(page, list);
-        qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+        qemu_ram_remap(page->ram_addr);
         g_free(page);
     }
 }
diff --git a/system/physmem.c b/system/physmem.c
index d3efdf13d3..af1175a57c 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2275,17 +2275,35 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
-void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+/*
+ * qemu_ram_remap - remap a single RAM page
+ *
+ * @addr: address in ram_addr_t address space.
+ *
+ * This function will try remapping a single page of guest RAM identified by
+ * @addr, essentially discarding memory to recover from previously poisoned
+ * memory (MCE). The page size depends on the RAMBlock (i.e., hugetlb). @addr
+ * does not have to point at the start of the page.
+ *
+ * This function is only to be used during system resets; it will kill the
+ * VM if remapping failed.
+ */
+void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
-    ram_addr_t offset;
+    uint64_t offset;
     int flags;
     void *area, *vaddr;
     int prot;
+    size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
         offset = addr - block->offset;
         if (offset < block->max_length) {
+            /* Respect the pagesize of our RAMBlock */
+            page_size = qemu_ram_pagesize(block);
+            offset = QEMU_ALIGN_DOWN(offset, page_size);
+
             vaddr = ramblock_ptr(block, offset);
             if (block->flags & RAM_PREALLOC) {
                 ;
@@ -2299,21 +2317,23 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
+                    area = mmap(vaddr, page_size, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {
                     flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
+                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
                 }
                 if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
+                    error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
+                                 " +%zx", block->idstr, offset,
+                                 block->fd_offset, page_size);
                     exit(1);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
+                memory_try_enable_merging(vaddr, page_size);
+                qemu_ram_setup_dump(vaddr, page_size);
             }
+
+            break;
         }
     }
 }
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PULL 14/14] system/physmem: poisoned memory discard on reboot
  2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
                   ` (12 preceding siblings ...)
  2025-02-11 22:50 ` [PULL 13/14] system/physmem: handle hugetlb correctly in qemu_ram_remap() Peter Xu
@ 2025-02-11 22:50 ` Peter Xu
  13 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2025-02-11 22:50 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi
  Cc: peterx, Paolo Bonzini, David Hildenbrand, William Roche

From: William Roche <william.roche@oracle.com>

Repair poisoned memory location(s), calling ram_block_discard_range():
punching a hole in the backend file when necessary and regenerating
a usable memory.
If the kernel doesn't support the madvise calls used by this function
and we are dealing with anonymous memory, fall back to remapping the
location(s).

Signed-off-by: William Roche <william.roche@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250211212707.302391-3-william.roche@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 system/physmem.c | 57 ++++++++++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index af1175a57c..67bdf631e6 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2275,6 +2275,23 @@ void qemu_ram_free(RAMBlock *block)
 }
 
 #ifndef _WIN32
+/* Simply remap the given VM memory location from start to start+length */
+static int qemu_ram_remap_mmap(RAMBlock *block, uint64_t start, size_t length)
+{
+    int flags, prot;
+    void *area;
+    void *host_startaddr = block->host + start;
+
+    assert(block->fd < 0);
+    flags = MAP_FIXED | MAP_ANONYMOUS;
+    flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE;
+    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+    prot = PROT_READ;
+    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+    area = mmap(host_startaddr, length, prot, flags, -1, 0);
+    return area != host_startaddr ? -errno : 0;
+}
+
 /*
  * qemu_ram_remap - remap a single RAM page
  *
@@ -2292,9 +2309,7 @@ void qemu_ram_remap(ram_addr_t addr)
 {
     RAMBlock *block;
     uint64_t offset;
-    int flags;
-    void *area, *vaddr;
-    int prot;
+    void *vaddr;
     size_t page_size;
 
     RAMBLOCK_FOREACH(block) {
@@ -2310,24 +2325,24 @@ void qemu_ram_remap(ram_addr_t addr)
             } else if (xen_enabled()) {
                 abort();
             } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, page_size, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, page_size, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap RAM %s:%" PRIx64 "+%" PRIx64
-                                 " +%zx", block->idstr, offset,
-                                 block->fd_offset, page_size);
-                    exit(1);
+                if (ram_block_discard_range(block, offset, page_size) != 0) {
+                    /*
+                     * Fall back to using mmap() only for anonymous mapping,
+                     * as if a backing file is associated we may not be able
+                     * to recover the memory in all cases.
+                     * So don't take the risk of using only mmap and fail now.
+                     */
+                    if (block->fd >= 0) {
+                        error_report("Could not remap RAM %s:%" PRIx64 "+%"
+                                     PRIx64 " +%zx", block->idstr, offset,
+                                     block->fd_offset, page_size);
+                        exit(1);
+                    }
+                    if (qemu_ram_remap_mmap(block, offset, page_size) != 0) {
+                        error_report("Could not remap RAM %s:%" PRIx64 " +%zx",
+                                     block->idstr, offset, page_size);
+                        exit(1);
+                    }
                 }
                 memory_try_enable_merging(vaddr, page_size);
                 qemu_ram_setup_dump(vaddr, page_size);
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-02-12 14:18 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-11 22:50 [PULL 00/14] Mem next patches Peter Xu
2025-02-11 22:50 ` [PULL 01/14] system/physmem: take into account fd_offset for file fallocate Peter Xu
2025-02-11 22:50 ` [PULL 02/14] os: add an ability to lock memory on_fault Peter Xu
2025-02-12 14:13   ` Stefan Hajnoczi
2025-02-12 14:17     ` Daniil Tatianin
2025-02-11 22:50 ` [PULL 03/14] system/vl: extract overcommit option parsing into a helper Peter Xu
2025-02-11 22:50 ` [PULL 04/14] system: introduce a new MlockState enum Peter Xu
2025-02-11 22:50 ` [PULL 05/14] overcommit: introduce mem-lock=on-fault Peter Xu
2025-02-11 22:50 ` [PULL 06/14] physmem: factor out memory_region_is_ram_device() check in memory_access_is_direct() Peter Xu
2025-02-11 22:50 ` [PULL 07/14] physmem: factor out RAM/ROMD " Peter Xu
2025-02-11 22:50 ` [PULL 08/14] physmem: factor out direct access check into memory_region_supports_direct_access() Peter Xu
2025-02-11 22:50 ` [PULL 09/14] physmem: disallow direct access to RAM DEVICE in address_space_write_rom() Peter Xu
2025-02-11 22:50 ` [PULL 10/14] memory: pass MemTxAttrs to memory_access_is_direct() Peter Xu
2025-02-11 22:50 ` [PULL 11/14] hmp: use cpu_get_phys_page_debug() in hmp_gva2gpa() Peter Xu
2025-02-11 22:50 ` [PULL 12/14] physmem: teach cpu_memory_rw_debug() to write to more memory regions Peter Xu
2025-02-11 22:50 ` [PULL 13/14] system/physmem: handle hugetlb correctly in qemu_ram_remap() Peter Xu
2025-02-11 22:50 ` [PULL 14/14] system/physmem: poisoned memory discard on reboot Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).