live-migration performance regression when using pmem

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* live-migration performance regression when using pmem
@ 2025-05-12 15:16 Chaney, Ben
  2025-05-12 18:50 ` Peter Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Chaney, Ben @ 2025-05-12 15:16 UTC (permalink / raw)
  To: yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com,
	peterx@redhat.com, david@redhat.com, philmd@linaro.org,
	xiaoguangrong.eric@gmail.com
  Cc: Tottenham, Max, Hunt, Joshua, Glasgall, Anna

Hello,

        When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case.


        Profiling suggests very high activity in this code path:


ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
6873c7 xbzrle_load_cleanup+0x37 (inlined)
6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
ffffffffffffffff [unknown] ([unknown])


        I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM?


        I'm also curious about the intended use of this code path in NVDIMM case. It seems like it would run into a few issues. This on its own seems insufficient to restore the VM state if the host crashes after a live migration. The memory region being synced is only the guest memory. It doesn't save the driver state on the host side. Also, once the migration completes, the guest can redirty the pages. If the host crashes after that point, the guest memory will still be in an inconsistent state unless the crash is exceptionally well timed. Does anyone have any insight into why this sync operation was introduced?


Thank you,
        Ben Chaney









^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-05-12 15:16 live-migration performance regression when using pmem Chaney, Ben
@ 2025-05-12 18:50 ` Peter Xu
  2025-05-13 15:48   ` Chaney, Ben
  2025-05-12 19:52 ` Michael S. Tsirkin
  2025-05-13 17:21 ` David Hildenbrand
  2 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2025-05-12 18:50 UTC (permalink / raw)
  To: Chaney, Ben
  Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com,
	david@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com,
	Tottenham, Max, Hunt, Joshua, Glasgall, Anna

On Mon, May 12, 2025 at 03:16:34PM +0000, Chaney, Ben wrote:
> Hello,
> 
>         When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case.
> 
> 
>         Profiling suggests very high activity in this code path:
> 
> 
> ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
> ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
> ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
> ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
> ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
> ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
> ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
> ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
> ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
> ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
> ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
> 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
> 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 6873c7 xbzrle_load_cleanup+0x37 (inlined)
> 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> ffffffffffffffff [unknown] ([unknown])
> 
> 
>         I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM?
> 
> 
>         I'm also curious about the intended use of this code path in NVDIMM case. It seems like it would run into a few issues. This on its own seems insufficient to restore the VM state if the host crashes after a live migration. The memory region being synced is only the guest memory. It doesn't save the driver state on the host side. Also, once the migration completes, the guest can redirty the pages. If the host crashes after that point, the guest memory will still be in an inconsistent state unless the crash is exceptionally well timed. Does anyone have any insight into why this sync operation was introduced?

What you said makes sense to me, but I'm neither pmem user nor
expert. Let's wait to see whether others would like to chime in.

What's the first bad commit of the regression?  Is it since v10.0 release?

So I remember there's something changed in some relevant path last release,
which happened in the VFIO work:

https://lore.kernel.org/qemu-devel/21bb5ca337b1d5a802e697f553f37faf296b5ff4.1741193259.git.maciej.szmigiero@oracle.com/

But that doesn't look like to matter for precopy, and since you mentioned
nothing I would expect you're using precopy not postcopy.

In general, if it's a regression, having the 1st bad commit would always be
helpful.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-05-12 18:50 ` Peter Xu
@ 2025-05-13 15:48   ` Chaney, Ben
  2025-05-14 16:41     ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Chaney, Ben @ 2025-05-13 15:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com,
	david@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com,
	Tottenham, Max, Hunt, Joshua, Glasgall, Anna

On 5/12/25, 2:50 PM, "Peter Xu" <peterx@redhat.com <mailto:peterx@redhat.com>> wrote:


> What you said makes sense to me, but I'm neither pmem user nor
> expert. Let's wait to see whether others would like to chime in.


> What's the first bad commit of the regression? Is it since v10.0 release?

Hi Peter,
        We are still on an old branch (7.2). The issue began when we enabled pmem, not as the result of a code change.

Thanks,
        Ben






^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-05-13 15:48   ` Chaney, Ben
@ 2025-05-14 16:41     ` Peter Xu
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2025-05-14 16:41 UTC (permalink / raw)
  To: Chaney, Ben
  Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com,
	david@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com,
	Tottenham, Max, Hunt, Joshua, Glasgall, Anna

On Tue, May 13, 2025 at 03:48:06PM +0000, Chaney, Ben wrote:
> On 5/12/25, 2:50 PM, "Peter Xu" <peterx@redhat.com <mailto:peterx@redhat.com>> wrote:
> 
> 
> > What you said makes sense to me, but I'm neither pmem user nor
> > expert. Let's wait to see whether others would like to chime in.
> 
> 
> > What's the first bad commit of the regression? Is it since v10.0 release?
> 
> Hi Peter,
>         We are still on an old branch (7.2). The issue began when we enabled pmem, not as the result of a code change.

OK.  Then I think it's not strictly a regression, as it may have been like
that forever.

I do see that qemu_ram_msync() has this anyway:

#ifdef CONFIG_LIBPMEM
    /* The lack of support for pmem should not block the sync */
    if (ramblock_is_pmem(block)) {
        void *addr = ramblock_ptr(block, start);
        pmem_persist(addr, length);
        return;
    }
#endif

Does it mean that you're using pmem but without libpmem compiled?  From
your stack dump, it looks like msync() is triggered and I would expect that
won't happen if the ramblock in question is pmem.

Is your case using DRAM as the backing storage (in form of DAX) for the
ext4 file, while exposed as a pmem to the guest?  I'd expect if at least
with above check pass then pmem_persist() would be faster, though I don't
know how much.

It looks still reasonable for QEMU to always sync here if it's pmem then,
because qemu still sees this ramblock a persist storage, and after
migration qemu wants to make sure all things are persisted.  Said that, I
wonder if David was right in the other email that we still have some
regression and at least migration should skip the sync for !pmem, that is:

diff --git a/migration/ram.c b/migration/ram.c
index d26dbd37c4..a93da18842 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3672,7 +3672,9 @@ static int ram_load_cleanup(void *opaque)
     RAMBlock *rb;

     RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
-        qemu_ram_block_writeback(rb);
+        if (ramblock_is_pmem(block)) {
+            qemu_ram_block_writeback(rb);
+        }
     }

     xbzrle_load_cleanup();

But if you're using a real pmem ramblock, it shouldn't affect your use
case.

Thanks,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-05-12 15:16 live-migration performance regression when using pmem Chaney, Ben
  2025-05-12 18:50 ` Peter Xu
@ 2025-05-12 19:52 ` Michael S. Tsirkin
  2025-05-13 17:21 ` David Hildenbrand
  2 siblings, 0 replies; 11+ messages in thread
From: Michael S. Tsirkin @ 2025-05-12 19:52 UTC (permalink / raw)
  To: Chaney, Ben
  Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com,
	david@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com,
	Tottenham, Max, Hunt, Joshua, Glasgall, Anna, Haozhong Zhang

On Mon, May 12, 2025 at 03:16:34PM +0000, Chaney, Ben wrote:
> Hello,
> 
>         When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case.
> 
> 
>         Profiling suggests very high activity in this code path:
> 
> 
> ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
> ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
> ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
> ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
> ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
> ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
> ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
> ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
> ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
> ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
> ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
> 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
> 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 6873c7 xbzrle_load_cleanup+0x37 (inlined)
> 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> ffffffffffffffff [unknown] ([unknown])
> 
> 
>         I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM?
> 
> 
>         I'm also curious about the intended use of this code path in NVDIMM case. It seems like it would run into a few issues. This on its own seems insufficient to restore the VM state if the host crashes after a live migration. The memory region being synced is only the guest memory. It doesn't save the driver state on the host side. Also, once the migration completes, the guest can redirty the pages. If the host crashes after that point, the guest memory will still be in an inconsistent state unless the crash is exceptionally well timed. Does anyone have any insight into why this sync operation was introduced?
> 
> 
> Thank you,
>         Ben Chaney
> 
> 
> 
> 
> 

Was added here:

commit 56eb90af39abf66c0e80588a9f50c31e7df7320b
Author: Junyan He <junyan.he@intel.com>
Date:   Wed Jul 18 15:48:03 2018 +0800

    migration/ram: ensure write persistence on loading all data to PMEM.
    
    Because we need to make sure the pmem kind memory data is synced
    after migration, we choose to call pmem_persist() when the migration
    finish. This will make sure the data of pmem is safe and will not
    lose if power is off.
    
    Signed-off-by: Junyan He <junyan.he@intel.com>
    Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
    Reviewed-by: Igor Mammedov <imammedo@redhat.com>
    Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


it kind of sounded reasonable ... but I don't remember.

Also CC Haozhong Zhang who worked in this area.

> 
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-05-12 15:16 live-migration performance regression when using pmem Chaney, Ben
  2025-05-12 18:50 ` Peter Xu
  2025-05-12 19:52 ` Michael S. Tsirkin
@ 2025-05-13 17:21 ` David Hildenbrand
  2025-05-13 18:40   ` Chaney, Ben
  2025-05-13 20:11   ` Michael S. Tsirkin
  2 siblings, 2 replies; 11+ messages in thread
From: David Hildenbrand @ 2025-05-13 17:21 UTC (permalink / raw)
  To: Chaney, Ben, yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com,
	peterx@redhat.com, philmd@linaro.org,
	xiaoguangrong.eric@gmail.com
  Cc: Tottenham, Max, Hunt, Joshua, Glasgall, Anna

On 12.05.25 17:16, Chaney, Ben wrote:
> Hello,
> 
>          When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case.
> 
> 
>          Profiling suggests very high activity in this code path:
> 
> 
> ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
> ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
> ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
> ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
> ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
> ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
> ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
> ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
> ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
> ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
> ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
> 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
> 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 6873c7 xbzrle_load_cleanup+0x37 (inlined)
> 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> ffffffffffffffff [unknown] ([unknown])
> 
> 
>          I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM?

Are you using a read-only NVDIMM?

In that case, I assume we would never need msync.


diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 94bb3ccbe4..819b8ef829 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -153,7 +153,8 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length);
  /* Clear whole block of mem */
  static inline void qemu_ram_block_writeback(RAMBlock *block)
  {
-    qemu_ram_msync(block, 0, block->used_length);
+    if (!(block->flags & RAM_READONLY))
+        qemu_ram_msync(block, 0, block->used_length);
  }


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-05-13 17:21 ` David Hildenbrand
@ 2025-05-13 18:40   ` Chaney, Ben
  2025-05-13 20:11   ` Michael S. Tsirkin
  1 sibling, 0 replies; 11+ messages in thread
From: Chaney, Ben @ 2025-05-13 18:40 UTC (permalink / raw)
  To: David Hildenbrand, yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com,
	peterx@redhat.com, philmd@linaro.org,
	xiaoguangrong.eric@gmail.com
  Cc: Tottenham, Max, Hunt, Joshua, Glasgall, Anna



On 5/13/25, 1:21 PM, "David Hildenbrand" <david@redhat.com <mailto:david@redhat.com>> wrote:



> Are you using a read-only NVDIMM?


> In that case, I assume we would never need msync.

We aren't using an NVDIMM at all, we are using pmem that is backed by DRAM.

Ben


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-05-13 17:21 ` David Hildenbrand
  2025-05-13 18:40   ` Chaney, Ben
@ 2025-05-13 20:11   ` Michael S. Tsirkin
  2025-05-14 13:57     ` David Hildenbrand
  1 sibling, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2025-05-13 20:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chaney, Ben, yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com,
	philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max,
	Hunt, Joshua, Glasgall, Anna

On Tue, May 13, 2025 at 07:21:36PM +0200, David Hildenbrand wrote:
> On 12.05.25 17:16, Chaney, Ben wrote:
> > Hello,
> > 
> >          When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case.
> > 
> > 
> >          Profiling suggests very high activity in this code path:
> > 
> > 
> > ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
> > ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
> > ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
> > ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
> > ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
> > ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
> > ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
> > ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
> > ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
> > ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
> > ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
> > 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
> > 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> > 6873c7 xbzrle_load_cleanup+0x37 (inlined)
> > 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> > 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> > 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> > 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> > 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
> > ffffffffffffffff [unknown] ([unknown])
> > 
> > 
> >          I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM?
> 
> Are you using a read-only NVDIMM?
> 
> In that case, I assume we would never need msync.
> 
> 
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index 94bb3ccbe4..819b8ef829 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -153,7 +153,8 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length);
>  /* Clear whole block of mem */
>  static inline void qemu_ram_block_writeback(RAMBlock *block)
>  {
> -    qemu_ram_msync(block, 0, block->used_length);
> +    if (!(block->flags & RAM_READONLY))
> +        qemu_ram_msync(block, 0, block->used_length);
>  }
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb

I acked the original change but now I don't understand why is it
critical to preserve memory at a random time that has nothing
to do with guest state.
David, maybe you understand?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-05-13 20:11   ` Michael S. Tsirkin
@ 2025-05-14 13:57     ` David Hildenbrand
  2025-06-12 15:34       ` Chaney, Ben
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-05-14 13:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Chaney, Ben, yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com,
	philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max,
	Hunt, Joshua, Glasgall, Anna, Junyan He

On 13.05.25 22:11, Michael S. Tsirkin wrote:
> On Tue, May 13, 2025 at 07:21:36PM +0200, David Hildenbrand wrote:
>> On 12.05.25 17:16, Chaney, Ben wrote:
>>> Hello,
>>>
>>>           When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case.
>>>
>>>
>>>           Profiling suggests very high activity in this code path:
>>>
>>>
>>> ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
>>> ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
>>> ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
>>> ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
>>> ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
>>> ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
>>> ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
>>> ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
>>> ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
>>> ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
>>> ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
>>> 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
>>> 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 6873c7 xbzrle_load_cleanup+0x37 (inlined)
>>> 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> ffffffffffffffff [unknown] ([unknown])
>>>
>>>
>>>           I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM?
>>
>> Are you using a read-only NVDIMM?
>>
>> In that case, I assume we would never need msync.
>>
>>
>> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
>> index 94bb3ccbe4..819b8ef829 100644
>> --- a/include/exec/ram_addr.h
>> +++ b/include/exec/ram_addr.h
>> @@ -153,7 +153,8 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length);
>>   /* Clear whole block of mem */
>>   static inline void qemu_ram_block_writeback(RAMBlock *block)
>>   {
>> -    qemu_ram_msync(block, 0, block->used_length);
>> +    if (!(block->flags & RAM_READONLY))
>> +        qemu_ram_msync(block, 0, block->used_length);
>>   }
>>
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
> 
> I acked the original change but now I don't understand why is it
> critical to preserve memory at a random time that has nothing
> to do with guest state.
> David, maybe you understand?

Let me dig ...

As you said, we originally added pmem_persist() in:

commit 56eb90af39abf66c0e80588a9f50c31e7df7320b (mst/mst-next)
Author: Junyan He <junyan.he@intel.com>
Date:   Wed Jul 18 15:48:03 2018 +0800

     migration/ram: ensure write persistence on loading all data to PMEM.

     Because we need to make sure the pmem kind memory data is synced
     after migration, we choose to call pmem_persist() when the migration
     finish. This will make sure the data of pmem is safe and will not
     lose if power is off.

     Signed-off-by: Junyan He <junyan.he@intel.com>
     Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
     Reviewed-by: Igor Mammedov <imammedo@redhat.com>
     Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Then, we generalized to not using pmem_persist() but doing a 
qemu_ram_block_writeback() -- that includes a conditional pmem_persist() in:

commit bd108a44bc29cb648dd930564996b0128e66ac01
Author: Beata Michalska <beata.michalska@linaro.org>
Date:   Thu Nov 21 00:08:42 2019 +0000

     migration: ram: Switch to ram block writeback

     Switch to ram block writeback for pmem migration.

     Signed-off-by: Beata Michalska <beata.michalska@linaro.org>
     Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
     Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
     Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
     Message-id: 20191121000843.24844-4-beata.michalska@linaro.org
     Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

That was part of a patch series "[PATCH 0/4] target/arm: Support for 
Data Cache Clean up to PoP" [1].

At first, it looks like a cleanup, but has the effect of also affecting 
non-pmem memory backends.

A discussion [2] includes some reasoning around libpmem not being 
around, and msync being a suitable replacement in that case [3]: "
According to the PMDG man page, pmem_persist is supposed to be
equivalent for the msync. It's just more performant. So in case of real 
pmem hardware it should be all good."

So, the real question is: why do have to sync *after* migration on the 
migration *destination*?

I think the reason is simple if you assume that the pmem device will 
differ between source and destination, and that we actually migrated 
that data in the migration stream.

On the migration destination, we will fill pmem with data we obtained 
from the src via the migration stream: writing the data to pmem using 
ordinary memory writes.

pmem requires a sync to make sure that the data is *actually* persisted. 
The VM will certainly not issue a sync, because it didn't modify any 
pages. So we have to issue a sync such that pmem is guaranteed to be 
persisted.

In case of ordinary files, this means writing data back to disk 
("persist on disk"). I'll note that NVDIMMs are not suitable for 
ordinary files in general, because we cannot easily implement 
guest-triggered pmem syncs using basic instruction set. For R/O NVDIMMs 
it's fine.

For the R/W use case, virtio-pmem was invented, whereby the VM will do 
the sync -> msync using an explicit guest->host call. So once the guest 
sync'ed, it's actually persisted.

Now, NVDIMMs could be safely used in R/O mode backed by ordinary files. 
Here, we would *still* want to do this msync.

So, we can really only safely ignore the msync if we know that the 
mmap() is R/O (in which case, migration probably would fail either way? 
unless the RAMBlock is ignored).

While we could not perform the msync if we detect that we have an 
ordinary file, there might still be the case where we have a R/W NVDIMM, 
just nobody actually ever writes to it ... so it's tricky. Certainly 
worth exploring. But there would be the chance of data loss for R/O 
NVDIMMs after migration on hypervisor crash ...

[1] 
https://patchew.org/QEMU/20191121000843.24844-1-beata.michalska@linaro.org/

[2] 
https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01750.html

[3] 
https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01772.html

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-05-14 13:57     ` David Hildenbrand
@ 2025-06-12 15:34       ` Chaney, Ben
  2025-06-12 16:06         ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Chaney, Ben @ 2025-06-12 15:34 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com,
	beata.michalska@linaro.org, richard.henderson@linaro.org,
	alex.bennee@linaro.org, peter.maydell@linaro.org,
	junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
	qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com,
	philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max,
	Hunt, Joshua, Glasgall, Anna, Junyan He

On 5/14/25, 9:59 AM, "David Hildenbrand" <david@redhat.com <mailto:david@redhat.com>> wrote:



>Because we need to make sure the pmem kind memory data is synced
>after migration, we choose to call pmem_persist() when the migration
>finish. This will make sure the data of pmem is safe and will not
>lose if power is off.

Thank you for clarifying. I think I initially misunderstood the purpose of
this code path.

In that case, how about something like this to restrict the sync to only
run when it would be effective? If the memory region is volatile then
there is no benefit to syncing.

Thanks,
        Ben

---
migration/ram.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index f25ebd9620..24fb29f0a5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3930,7 +3930,9 @@ static int ram_load_cleanup(void *opaque)
     RAMBlock *rb;

     RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
-        qemu_ram_block_writeback(rb);
+        if (rb->mr->nonvolatile) {
+            qemu_ram_block_writeback(rb);
+        }
     }

     xbzrle_load_cleanup();
--
2.40.1




^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: live-migration performance regression when using pmem
  2025-06-12 15:34       ` Chaney, Ben
@ 2025-06-12 16:06         ` Peter Xu
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2025-06-12 16:06 UTC (permalink / raw)
  To: Chaney, Ben
  Cc: David Hildenbrand, Michael S. Tsirkin, yury-kotov@yandex-team.ru,
	dgilbert@redhat.com, beata.michalska@linaro.org,
	richard.henderson@linaro.org, alex.bennee@linaro.org,
	peter.maydell@linaro.org, junyan.he@intel.com,
	stefanha@redhat.com, imammedo@redhat.com, qemu-devel@nongnu.org,
	pbonzini@redhat.com, philmd@linaro.org,
	xiaoguangrong.eric@gmail.com, Tottenham, Max, Hunt, Joshua,
	Glasgall, Anna

On Thu, Jun 12, 2025 at 03:34:35PM +0000, Chaney, Ben wrote:
> On 5/14/25, 9:59 AM, "David Hildenbrand" <david@redhat.com <mailto:david@redhat.com>> wrote:
> 
> 
> 
> >Because we need to make sure the pmem kind memory data is synced
> >after migration, we choose to call pmem_persist() when the migration
> >finish. This will make sure the data of pmem is safe and will not
> >lose if power is off.
> 
> Thank you for clarifying. I think I initially misunderstood the purpose of
> this code path.
> 
> In that case, how about something like this to restrict the sync to only
> run when it would be effective? If the memory region is volatile then
> there is no benefit to syncing.
> 
> Thanks,
>         Ben
> 
> ---
> migration/ram.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index f25ebd9620..24fb29f0a5 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3930,7 +3930,9 @@ static int ram_load_cleanup(void *opaque)
>      RAMBlock *rb;
> 
>      RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
> -        qemu_ram_block_writeback(rb);
> +        if (rb->mr->nonvolatile) {
> +            qemu_ram_block_writeback(rb);
> +        }
>      }
> 
>      xbzrle_load_cleanup();
> --
> 2.40.1

Looks good here, I think that's what I mentioned:

https://lore.kernel.org/all/aCTHwhrXROReEPEh@x1.local/

But I guess I got the use case wrong; looks like it worked for you.

In that case please switch to memory_region_is_nonvolatile(), and add
proper Fixes, and copy stable.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-06-12 16:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-12 15:16 live-migration performance regression when using pmem Chaney, Ben
2025-05-12 18:50 ` Peter Xu
2025-05-13 15:48   ` Chaney, Ben
2025-05-14 16:41     ` Peter Xu
2025-05-12 19:52 ` Michael S. Tsirkin
2025-05-13 17:21 ` David Hildenbrand
2025-05-13 18:40   ` Chaney, Ben
2025-05-13 20:11   ` Michael S. Tsirkin
2025-05-14 13:57     ` David Hildenbrand
2025-06-12 15:34       ` Chaney, Ben
2025-06-12 16:06         ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).