* live-migration performance regression when using pmem
@ 2025-05-12 15:16 Chaney, Ben
2025-05-12 18:50 ` Peter Xu
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Chaney, Ben @ 2025-05-12 15:16 UTC (permalink / raw)
To: yury-kotov@yandex-team.ru, dgilbert@redhat.com,
beata.michalska@linaro.org, richard.henderson@linaro.org,
alex.bennee@linaro.org, peter.maydell@linaro.org,
junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com,
mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com,
peterx@redhat.com, david@redhat.com, philmd@linaro.org,
xiaoguangrong.eric@gmail.com
Cc: Tottenham, Max, Hunt, Joshua, Glasgall, Anna
Hello,
When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case.
Profiling suggests very high activity in this code path:
ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
6873c7 xbzrle_load_cleanup+0x37 (inlined)
6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
ffffffffffffffff [unknown] ([unknown])
I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM?
I'm also curious about the intended use of this code path in NVDIMM case. It seems like it would run into a few issues. This on its own seems insufficient to restore the VM state if the host crashes after a live migration. The memory region being synced is only the guest memory. It doesn't save the driver state on the host side. Also, once the migration completes, the guest can redirty the pages. If the host crashes after that point, the guest memory will still be in an inconsistent state unless the crash is exceptionally well timed. Does anyone have any insight into why this sync operation was introduced?
Thank you,
Ben Chaney
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: live-migration performance regression when using pmem 2025-05-12 15:16 live-migration performance regression when using pmem Chaney, Ben @ 2025-05-12 18:50 ` Peter Xu 2025-05-13 15:48 ` Chaney, Ben 2025-05-12 19:52 ` Michael S. Tsirkin 2025-05-13 17:21 ` David Hildenbrand 2 siblings, 1 reply; 11+ messages in thread From: Peter Xu @ 2025-05-12 18:50 UTC (permalink / raw) To: Chaney, Ben Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, david@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max, Hunt, Joshua, Glasgall, Anna On Mon, May 12, 2025 at 03:16:34PM +0000, Chaney, Ben wrote: > Hello, > > When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case. > > > Profiling suggests very high activity in this code path: > > > ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms]) > ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms]) > ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms]) > ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms]) > ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms]) > ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms]) > ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms]) > ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms]) > ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms]) > ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms]) > ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms]) > 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6) > 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 6873c7 xbzrle_load_cleanup+0x37 (inlined) > 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > ffffffffffffffff [unknown] ([unknown]) > > > I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM? > > > I'm also curious about the intended use of this code path in NVDIMM case. It seems like it would run into a few issues. This on its own seems insufficient to restore the VM state if the host crashes after a live migration. The memory region being synced is only the guest memory. It doesn't save the driver state on the host side. Also, once the migration completes, the guest can redirty the pages. If the host crashes after that point, the guest memory will still be in an inconsistent state unless the crash is exceptionally well timed. Does anyone have any insight into why this sync operation was introduced? What you said makes sense to me, but I'm neither pmem user nor expert. Let's wait to see whether others would like to chime in. What's the first bad commit of the regression? Is it since v10.0 release? So I remember there's something changed in some relevant path last release, which happened in the VFIO work: https://lore.kernel.org/qemu-devel/21bb5ca337b1d5a802e697f553f37faf296b5ff4.1741193259.git.maciej.szmigiero@oracle.com/ But that doesn't look like to matter for precopy, and since you mentioned nothing I would expect you're using precopy not postcopy. In general, if it's a regression, having the 1st bad commit would always be helpful. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: live-migration performance regression when using pmem 2025-05-12 18:50 ` Peter Xu @ 2025-05-13 15:48 ` Chaney, Ben 2025-05-14 16:41 ` Peter Xu 0 siblings, 1 reply; 11+ messages in thread From: Chaney, Ben @ 2025-05-13 15:48 UTC (permalink / raw) To: Peter Xu Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, david@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max, Hunt, Joshua, Glasgall, Anna On 5/12/25, 2:50 PM, "Peter Xu" <peterx@redhat.com <mailto:peterx@redhat.com>> wrote: > What you said makes sense to me, but I'm neither pmem user nor > expert. Let's wait to see whether others would like to chime in. > What's the first bad commit of the regression? Is it since v10.0 release? Hi Peter, We are still on an old branch (7.2). The issue began when we enabled pmem, not as the result of a code change. Thanks, Ben ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: live-migration performance regression when using pmem 2025-05-13 15:48 ` Chaney, Ben @ 2025-05-14 16:41 ` Peter Xu 0 siblings, 0 replies; 11+ messages in thread From: Peter Xu @ 2025-05-14 16:41 UTC (permalink / raw) To: Chaney, Ben Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, david@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max, Hunt, Joshua, Glasgall, Anna On Tue, May 13, 2025 at 03:48:06PM +0000, Chaney, Ben wrote: > On 5/12/25, 2:50 PM, "Peter Xu" <peterx@redhat.com <mailto:peterx@redhat.com>> wrote: > > > > What you said makes sense to me, but I'm neither pmem user nor > > expert. Let's wait to see whether others would like to chime in. > > > > What's the first bad commit of the regression? Is it since v10.0 release? > > Hi Peter, > We are still on an old branch (7.2). The issue began when we enabled pmem, not as the result of a code change. OK. Then I think it's not strictly a regression, as it may have been like that forever. I do see that qemu_ram_msync() has this anyway: #ifdef CONFIG_LIBPMEM /* The lack of support for pmem should not block the sync */ if (ramblock_is_pmem(block)) { void *addr = ramblock_ptr(block, start); pmem_persist(addr, length); return; } #endif Does it mean that you're using pmem but without libpmem compiled? From your stack dump, it looks like msync() is triggered and I would expect that won't happen if the ramblock in question is pmem. Is your case using DRAM as the backing storage (in form of DAX) for the ext4 file, while exposed as a pmem to the guest? I'd expect if at least with above check pass then pmem_persist() would be faster, though I don't know how much. It looks still reasonable for QEMU to always sync here if it's pmem then, because qemu still sees this ramblock a persist storage, and after migration qemu wants to make sure all things are persisted. Said that, I wonder if David was right in the other email that we still have some regression and at least migration should skip the sync for !pmem, that is: diff --git a/migration/ram.c b/migration/ram.c index d26dbd37c4..a93da18842 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -3672,7 +3672,9 @@ static int ram_load_cleanup(void *opaque) RAMBlock *rb; RAMBLOCK_FOREACH_NOT_IGNORED(rb) { - qemu_ram_block_writeback(rb); + if (ramblock_is_pmem(block)) { + qemu_ram_block_writeback(rb); + } } xbzrle_load_cleanup(); But if you're using a real pmem ramblock, it shouldn't affect your use case. Thanks, -- Peter Xu ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: live-migration performance regression when using pmem 2025-05-12 15:16 live-migration performance regression when using pmem Chaney, Ben 2025-05-12 18:50 ` Peter Xu @ 2025-05-12 19:52 ` Michael S. Tsirkin 2025-05-13 17:21 ` David Hildenbrand 2 siblings, 0 replies; 11+ messages in thread From: Michael S. Tsirkin @ 2025-05-12 19:52 UTC (permalink / raw) To: Chaney, Ben Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com, david@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max, Hunt, Joshua, Glasgall, Anna, Haozhong Zhang On Mon, May 12, 2025 at 03:16:34PM +0000, Chaney, Ben wrote: > Hello, > > When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case. > > > Profiling suggests very high activity in this code path: > > > ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms]) > ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms]) > ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms]) > ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms]) > ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms]) > ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms]) > ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms]) > ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms]) > ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms]) > ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms]) > ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms]) > 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6) > 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 6873c7 xbzrle_load_cleanup+0x37 (inlined) > 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > ffffffffffffffff [unknown] ([unknown]) > > > I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM? > > > I'm also curious about the intended use of this code path in NVDIMM case. It seems like it would run into a few issues. This on its own seems insufficient to restore the VM state if the host crashes after a live migration. The memory region being synced is only the guest memory. It doesn't save the driver state on the host side. Also, once the migration completes, the guest can redirty the pages. If the host crashes after that point, the guest memory will still be in an inconsistent state unless the crash is exceptionally well timed. Does anyone have any insight into why this sync operation was introduced? > > > Thank you, > Ben Chaney > > > > > Was added here: commit 56eb90af39abf66c0e80588a9f50c31e7df7320b Author: Junyan He <junyan.he@intel.com> Date: Wed Jul 18 15:48:03 2018 +0800 migration/ram: ensure write persistence on loading all data to PMEM. Because we need to make sure the pmem kind memory data is synced after migration, we choose to call pmem_persist() when the migration finish. This will make sure the data of pmem is safe and will not lose if power is off. Signed-off-by: Junyan He <junyan.he@intel.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> it kind of sounded reasonable ... but I don't remember. Also CC Haozhong Zhang who worked in this area. > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: live-migration performance regression when using pmem 2025-05-12 15:16 live-migration performance regression when using pmem Chaney, Ben 2025-05-12 18:50 ` Peter Xu 2025-05-12 19:52 ` Michael S. Tsirkin @ 2025-05-13 17:21 ` David Hildenbrand 2025-05-13 18:40 ` Chaney, Ben 2025-05-13 20:11 ` Michael S. Tsirkin 2 siblings, 2 replies; 11+ messages in thread From: David Hildenbrand @ 2025-05-13 17:21 UTC (permalink / raw) To: Chaney, Ben, yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com Cc: Tottenham, Max, Hunt, Joshua, Glasgall, Anna On 12.05.25 17:16, Chaney, Ben wrote: > Hello, > > When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case. > > > Profiling suggests very high activity in this code path: > > > ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms]) > ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms]) > ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms]) > ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms]) > ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms]) > ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms]) > ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms]) > ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms]) > ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms]) > ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms]) > ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms]) > 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6) > 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 6873c7 xbzrle_load_cleanup+0x37 (inlined) > 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > ffffffffffffffff [unknown] ([unknown]) > > > I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM? Are you using a read-only NVDIMM? In that case, I assume we would never need msync. diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h index 94bb3ccbe4..819b8ef829 100644 --- a/include/exec/ram_addr.h +++ b/include/exec/ram_addr.h @@ -153,7 +153,8 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length); /* Clear whole block of mem */ static inline void qemu_ram_block_writeback(RAMBlock *block) { - qemu_ram_msync(block, 0, block->used_length); + if (!(block->flags & RAM_READONLY)) + qemu_ram_msync(block, 0, block->used_length); } -- Cheers, David / dhildenb ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: live-migration performance regression when using pmem 2025-05-13 17:21 ` David Hildenbrand @ 2025-05-13 18:40 ` Chaney, Ben 2025-05-13 20:11 ` Michael S. Tsirkin 1 sibling, 0 replies; 11+ messages in thread From: Chaney, Ben @ 2025-05-13 18:40 UTC (permalink / raw) To: David Hildenbrand, yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, mst@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com Cc: Tottenham, Max, Hunt, Joshua, Glasgall, Anna On 5/13/25, 1:21 PM, "David Hildenbrand" <david@redhat.com <mailto:david@redhat.com>> wrote: > Are you using a read-only NVDIMM? > In that case, I assume we would never need msync. We aren't using an NVDIMM at all, we are using pmem that is backed by DRAM. Ben ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: live-migration performance regression when using pmem 2025-05-13 17:21 ` David Hildenbrand 2025-05-13 18:40 ` Chaney, Ben @ 2025-05-13 20:11 ` Michael S. Tsirkin 2025-05-14 13:57 ` David Hildenbrand 1 sibling, 1 reply; 11+ messages in thread From: Michael S. Tsirkin @ 2025-05-13 20:11 UTC (permalink / raw) To: David Hildenbrand Cc: Chaney, Ben, yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max, Hunt, Joshua, Glasgall, Anna On Tue, May 13, 2025 at 07:21:36PM +0200, David Hildenbrand wrote: > On 12.05.25 17:16, Chaney, Ben wrote: > > Hello, > > > > When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case. > > > > > > Profiling suggests very high activity in this code path: > > > > > > ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms]) > > ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms]) > > ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms]) > > ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms]) > > ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms]) > > ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms]) > > ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms]) > > ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms]) > > ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms]) > > ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms]) > > ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms]) > > 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6) > > 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > > 6873c7 xbzrle_load_cleanup+0x37 (inlined) > > 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > > 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > > 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > > 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > > 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) > > ffffffffffffffff [unknown] ([unknown]) > > > > > > I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM? > > Are you using a read-only NVDIMM? > > In that case, I assume we would never need msync. > > > diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h > index 94bb3ccbe4..819b8ef829 100644 > --- a/include/exec/ram_addr.h > +++ b/include/exec/ram_addr.h > @@ -153,7 +153,8 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length); > /* Clear whole block of mem */ > static inline void qemu_ram_block_writeback(RAMBlock *block) > { > - qemu_ram_msync(block, 0, block->used_length); > + if (!(block->flags & RAM_READONLY)) > + qemu_ram_msync(block, 0, block->used_length); > } > > > -- > Cheers, > > David / dhildenb I acked the original change but now I don't understand why is it critical to preserve memory at a random time that has nothing to do with guest state. David, maybe you understand? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: live-migration performance regression when using pmem 2025-05-13 20:11 ` Michael S. Tsirkin @ 2025-05-14 13:57 ` David Hildenbrand 2025-06-12 15:34 ` Chaney, Ben 0 siblings, 1 reply; 11+ messages in thread From: David Hildenbrand @ 2025-05-14 13:57 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Chaney, Ben, yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max, Hunt, Joshua, Glasgall, Anna, Junyan He On 13.05.25 22:11, Michael S. Tsirkin wrote: > On Tue, May 13, 2025 at 07:21:36PM +0200, David Hildenbrand wrote: >> On 12.05.25 17:16, Chaney, Ben wrote: >>> Hello, >>> >>> When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case. >>> >>> >>> Profiling suggests very high activity in this code path: >>> >>> >>> ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms]) >>> ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms]) >>> ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms]) >>> ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms]) >>> ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms]) >>> ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms]) >>> ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms]) >>> ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms]) >>> ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms]) >>> ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms]) >>> ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms]) >>> 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6) >>> 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64) >>> 6873c7 xbzrle_load_cleanup+0x37 (inlined) >>> 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) >>> 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) >>> 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64) >>> 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) >>> 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64) >>> ffffffffffffffff [unknown] ([unknown]) >>> >>> >>> I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM? >> >> Are you using a read-only NVDIMM? >> >> In that case, I assume we would never need msync. >> >> >> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h >> index 94bb3ccbe4..819b8ef829 100644 >> --- a/include/exec/ram_addr.h >> +++ b/include/exec/ram_addr.h >> @@ -153,7 +153,8 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length); >> /* Clear whole block of mem */ >> static inline void qemu_ram_block_writeback(RAMBlock *block) >> { >> - qemu_ram_msync(block, 0, block->used_length); >> + if (!(block->flags & RAM_READONLY)) >> + qemu_ram_msync(block, 0, block->used_length); >> } >> >> >> -- >> Cheers, >> >> David / dhildenb > > I acked the original change but now I don't understand why is it > critical to preserve memory at a random time that has nothing > to do with guest state. > David, maybe you understand? Let me dig ... As you said, we originally added pmem_persist() in: commit 56eb90af39abf66c0e80588a9f50c31e7df7320b (mst/mst-next) Author: Junyan He <junyan.he@intel.com> Date: Wed Jul 18 15:48:03 2018 +0800 migration/ram: ensure write persistence on loading all data to PMEM. Because we need to make sure the pmem kind memory data is synced after migration, we choose to call pmem_persist() when the migration finish. This will make sure the data of pmem is safe and will not lose if power is off. Signed-off-by: Junyan He <junyan.he@intel.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Then, we generalized to not using pmem_persist() but doing a qemu_ram_block_writeback() -- that includes a conditional pmem_persist() in: commit bd108a44bc29cb648dd930564996b0128e66ac01 Author: Beata Michalska <beata.michalska@linaro.org> Date: Thu Nov 21 00:08:42 2019 +0000 migration: ram: Switch to ram block writeback Switch to ram block writeback for pmem migration. Signed-off-by: Beata Michalska <beata.michalska@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-id: 20191121000843.24844-4-beata.michalska@linaro.org Signed-off-by: Peter Maydell <peter.maydell@linaro.org> That was part of a patch series "[PATCH 0/4] target/arm: Support for Data Cache Clean up to PoP" [1]. At first, it looks like a cleanup, but has the effect of also affecting non-pmem memory backends. A discussion [2] includes some reasoning around libpmem not being around, and msync being a suitable replacement in that case [3]: " According to the PMDG man page, pmem_persist is supposed to be equivalent for the msync. It's just more performant. So in case of real pmem hardware it should be all good." So, the real question is: why do have to sync *after* migration on the migration *destination*? I think the reason is simple if you assume that the pmem device will differ between source and destination, and that we actually migrated that data in the migration stream. On the migration destination, we will fill pmem with data we obtained from the src via the migration stream: writing the data to pmem using ordinary memory writes. pmem requires a sync to make sure that the data is *actually* persisted. The VM will certainly not issue a sync, because it didn't modify any pages. So we have to issue a sync such that pmem is guaranteed to be persisted. In case of ordinary files, this means writing data back to disk ("persist on disk"). I'll note that NVDIMMs are not suitable for ordinary files in general, because we cannot easily implement guest-triggered pmem syncs using basic instruction set. For R/O NVDIMMs it's fine. For the R/W use case, virtio-pmem was invented, whereby the VM will do the sync -> msync using an explicit guest->host call. So once the guest sync'ed, it's actually persisted. Now, NVDIMMs could be safely used in R/O mode backed by ordinary files. Here, we would *still* want to do this msync. So, we can really only safely ignore the msync if we know that the mmap() is R/O (in which case, migration probably would fail either way? unless the RAMBlock is ignored). While we could not perform the msync if we detect that we have an ordinary file, there might still be the case where we have a R/W NVDIMM, just nobody actually ever writes to it ... so it's tricky. Certainly worth exploring. But there would be the chance of data loss for R/O NVDIMMs after migration on hypervisor crash ... [1] https://patchew.org/QEMU/20191121000843.24844-1-beata.michalska@linaro.org/ [2] https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01750.html [3] https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01772.html -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: live-migration performance regression when using pmem 2025-05-14 13:57 ` David Hildenbrand @ 2025-06-12 15:34 ` Chaney, Ben 2025-06-12 16:06 ` Peter Xu 0 siblings, 1 reply; 11+ messages in thread From: Chaney, Ben @ 2025-06-12 15:34 UTC (permalink / raw) To: David Hildenbrand, Michael S. Tsirkin Cc: yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max, Hunt, Joshua, Glasgall, Anna, Junyan He On 5/14/25, 9:59 AM, "David Hildenbrand" <david@redhat.com <mailto:david@redhat.com>> wrote: >Because we need to make sure the pmem kind memory data is synced >after migration, we choose to call pmem_persist() when the migration >finish. This will make sure the data of pmem is safe and will not >lose if power is off. Thank you for clarifying. I think I initially misunderstood the purpose of this code path. In that case, how about something like this to restrict the sync to only run when it would be effective? If the memory region is volatile then there is no benefit to syncing. Thanks, Ben --- migration/ram.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/migration/ram.c b/migration/ram.c index f25ebd9620..24fb29f0a5 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -3930,7 +3930,9 @@ static int ram_load_cleanup(void *opaque) RAMBlock *rb; RAMBLOCK_FOREACH_NOT_IGNORED(rb) { - qemu_ram_block_writeback(rb); + if (rb->mr->nonvolatile) { + qemu_ram_block_writeback(rb); + } } xbzrle_load_cleanup(); -- 2.40.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: live-migration performance regression when using pmem 2025-06-12 15:34 ` Chaney, Ben @ 2025-06-12 16:06 ` Peter Xu 0 siblings, 0 replies; 11+ messages in thread From: Peter Xu @ 2025-06-12 16:06 UTC (permalink / raw) To: Chaney, Ben Cc: David Hildenbrand, Michael S. Tsirkin, yury-kotov@yandex-team.ru, dgilbert@redhat.com, beata.michalska@linaro.org, richard.henderson@linaro.org, alex.bennee@linaro.org, peter.maydell@linaro.org, junyan.he@intel.com, stefanha@redhat.com, imammedo@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com, philmd@linaro.org, xiaoguangrong.eric@gmail.com, Tottenham, Max, Hunt, Joshua, Glasgall, Anna On Thu, Jun 12, 2025 at 03:34:35PM +0000, Chaney, Ben wrote: > On 5/14/25, 9:59 AM, "David Hildenbrand" <david@redhat.com <mailto:david@redhat.com>> wrote: > > > > >Because we need to make sure the pmem kind memory data is synced > >after migration, we choose to call pmem_persist() when the migration > >finish. This will make sure the data of pmem is safe and will not > >lose if power is off. > > Thank you for clarifying. I think I initially misunderstood the purpose of > this code path. > > In that case, how about something like this to restrict the sync to only > run when it would be effective? If the memory region is volatile then > there is no benefit to syncing. > > Thanks, > Ben > > --- > migration/ram.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/migration/ram.c b/migration/ram.c > index f25ebd9620..24fb29f0a5 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -3930,7 +3930,9 @@ static int ram_load_cleanup(void *opaque) > RAMBlock *rb; > > RAMBLOCK_FOREACH_NOT_IGNORED(rb) { > - qemu_ram_block_writeback(rb); > + if (rb->mr->nonvolatile) { > + qemu_ram_block_writeback(rb); > + } > } > > xbzrle_load_cleanup(); > -- > 2.40.1 Looks good here, I think that's what I mentioned: https://lore.kernel.org/all/aCTHwhrXROReEPEh@x1.local/ But I guess I got the use case wrong; looks like it worked for you. In that case please switch to memory_region_is_nonvolatile(), and add proper Fixes, and copy stable. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-06-12 16:07 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-12 15:16 live-migration performance regression when using pmem Chaney, Ben 2025-05-12 18:50 ` Peter Xu 2025-05-13 15:48 ` Chaney, Ben 2025-05-14 16:41 ` Peter Xu 2025-05-12 19:52 ` Michael S. Tsirkin 2025-05-13 17:21 ` David Hildenbrand 2025-05-13 18:40 ` Chaney, Ben 2025-05-13 20:11 ` Michael S. Tsirkin 2025-05-14 13:57 ` David Hildenbrand 2025-06-12 15:34 ` Chaney, Ben 2025-06-12 16:06 ` Peter Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).