Re: live-migration performance regression when using pmem

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Chaney, Ben" <bchaney@akamai.com>,
	"yury-kotov@yandex-team.ru" <yury-kotov@yandex-team.ru>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"beata.michalska@linaro.org" <beata.michalska@linaro.org>,
	"richard.henderson@linaro.org" <richard.henderson@linaro.org>,
	"alex.bennee@linaro.org" <alex.bennee@linaro.org>,
	"peter.maydell@linaro.org" <peter.maydell@linaro.org>,
	"junyan.he@intel.com" <junyan.he@intel.com>,
	"stefanha@redhat.com" <stefanha@redhat.com>,
	"imammedo@redhat.com" <imammedo@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"peterx@redhat.com" <peterx@redhat.com>,
	"philmd@linaro.org" <philmd@linaro.org>,
	"xiaoguangrong.eric@gmail.com" <xiaoguangrong.eric@gmail.com>,
	"Tottenham, Max" <mtottenh@akamai.com>,
	"Hunt, Joshua" <johunt@akamai.com>,
	"Glasgall, Anna" <aglasgal@akamai.com>,
	Junyan He <junyan.he@intel.com>
Subject: Re: live-migration performance regression when using pmem
Date: Wed, 14 May 2025 15:57:27 +0200	[thread overview]
Message-ID: <cac9c790-c195-4d06-b3ac-894320ccbb97@redhat.com> (raw)
In-Reply-To: <20250513161036-mutt-send-email-mst@kernel.org>

On 13.05.25 22:11, Michael S. Tsirkin wrote:
> On Tue, May 13, 2025 at 07:21:36PM +0200, David Hildenbrand wrote:
>> On 12.05.25 17:16, Chaney, Ben wrote:
>>> Hello,
>>>
>>>           When live migrating to a destination host with pmem there is a very long downtime where the guest is paused. In some cases, this can be as high as 5 minutes, compared to less than one second in the good case.
>>>
>>>
>>>           Profiling suggests very high activity in this code path:
>>>
>>>
>>> ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
>>> ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
>>> ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
>>> ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
>>> ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
>>> ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
>>> ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
>>> ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
>>> ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
>>> ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
>>> ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
>>> 11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
>>> 675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 6873c7 xbzrle_load_cleanup+0x37 (inlined)
>>> 6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 4ff375 qemu_loadvm_state_cleanup+0x55 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 4ecf85 process_incoming_migration_co+0x95 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> 8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
>>> ffffffffffffffff [unknown] ([unknown])
>>>
>>>
>>>           I was able to resolve the performance issue by removing the call to qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to return to normal. It looks like this code path was initially added to ensure the memory was synchronized if the persistent memory region is backed by an NVDIMM device. Does it serve any purpose if pmem is instead backed by standard DRAM?
>>
>> Are you using a read-only NVDIMM?
>>
>> In that case, I assume we would never need msync.
>>
>>
>> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
>> index 94bb3ccbe4..819b8ef829 100644
>> --- a/include/exec/ram_addr.h
>> +++ b/include/exec/ram_addr.h
>> @@ -153,7 +153,8 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length);
>>   /* Clear whole block of mem */
>>   static inline void qemu_ram_block_writeback(RAMBlock *block)
>>   {
>> -    qemu_ram_msync(block, 0, block->used_length);
>> +    if (!(block->flags & RAM_READONLY))
>> +        qemu_ram_msync(block, 0, block->used_length);
>>   }
>>
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
> 
> I acked the original change but now I don't understand why is it
> critical to preserve memory at a random time that has nothing
> to do with guest state.
> David, maybe you understand?

Let me dig ...

As you said, we originally added pmem_persist() in:


commit 56eb90af39abf66c0e80588a9f50c31e7df7320b (mst/mst-next)
Author: Junyan He <junyan.he@intel.com>
Date:   Wed Jul 18 15:48:03 2018 +0800

     migration/ram: ensure write persistence on loading all data to PMEM.

     Because we need to make sure the pmem kind memory data is synced
     after migration, we choose to call pmem_persist() when the migration
     finish. This will make sure the data of pmem is safe and will not
     lose if power is off.

     Signed-off-by: Junyan He <junyan.he@intel.com>
     Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
     Reviewed-by: Igor Mammedov <imammedo@redhat.com>
     Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


Then, we generalized to not using pmem_persist() but doing a 
qemu_ram_block_writeback() -- that includes a conditional pmem_persist() in:

commit bd108a44bc29cb648dd930564996b0128e66ac01
Author: Beata Michalska <beata.michalska@linaro.org>
Date:   Thu Nov 21 00:08:42 2019 +0000

     migration: ram: Switch to ram block writeback

     Switch to ram block writeback for pmem migration.

     Signed-off-by: Beata Michalska <beata.michalska@linaro.org>
     Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
     Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
     Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
     Message-id: 20191121000843.24844-4-beata.michalska@linaro.org
     Signed-off-by: Peter Maydell <peter.maydell@linaro.org>


That was part of a patch series "[PATCH 0/4] target/arm: Support for 
Data Cache Clean up to PoP" [1].

At first, it looks like a cleanup, but has the effect of also affecting 
non-pmem memory backends.

A discussion [2] includes some reasoning around libpmem not being 
around, and msync being a suitable replacement in that case [3]: "
According to the PMDG man page, pmem_persist is supposed to be
equivalent for the msync. It's just more performant. So in case of real 
pmem hardware it should be all good."


So, the real question is: why do have to sync *after* migration on the 
migration *destination*?

I think the reason is simple if you assume that the pmem device will 
differ between source and destination, and that we actually migrated 
that data in the migration stream.

On the migration destination, we will fill pmem with data we obtained 
from the src via the migration stream: writing the data to pmem using 
ordinary memory writes.

pmem requires a sync to make sure that the data is *actually* persisted. 
The VM will certainly not issue a sync, because it didn't modify any 
pages. So we have to issue a sync such that pmem is guaranteed to be 
persisted.


In case of ordinary files, this means writing data back to disk 
("persist on disk"). I'll note that NVDIMMs are not suitable for 
ordinary files in general, because we cannot easily implement 
guest-triggered pmem syncs using basic instruction set. For R/O NVDIMMs 
it's fine.

For the R/W use case, virtio-pmem was invented, whereby the VM will do 
the sync -> msync using an explicit guest->host call. So once the guest 
sync'ed, it's actually persisted.


Now, NVDIMMs could be safely used in R/O mode backed by ordinary files. 
Here, we would *still* want to do this msync.

So, we can really only safely ignore the msync if we know that the 
mmap() is R/O (in which case, migration probably would fail either way? 
unless the RAMBlock is ignored).


While we could not perform the msync if we detect that we have an 
ordinary file, there might still be the case where we have a R/W NVDIMM, 
just nobody actually ever writes to it ... so it's tricky. Certainly 
worth exploring. But there would be the chance of data loss for R/O 
NVDIMMs after migration on hypervisor crash ...



[1] 
https://patchew.org/QEMU/20191121000843.24844-1-beata.michalska@linaro.org/

[2] 
https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01750.html

[3] 
https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01772.html

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2025-05-14 13:59 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-12 15:16 live-migration performance regression when using pmem Chaney, Ben
2025-05-12 18:50 ` Peter Xu
2025-05-13 15:48   ` Chaney, Ben
2025-05-14 16:41     ` Peter Xu
2025-05-12 19:52 ` Michael S. Tsirkin
2025-05-13 17:21 ` David Hildenbrand
2025-05-13 18:40   ` Chaney, Ben
2025-05-13 20:11   ` Michael S. Tsirkin
2025-05-14 13:57     ` David Hildenbrand [this message]
2025-06-12 15:34       ` Chaney, Ben
2025-06-12 16:06         ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cac9c790-c195-4d06-b3ac-894320ccbb97@redhat.com \
    --to=david@redhat.com \
    --cc=aglasgal@akamai.com \
    --cc=alex.bennee@linaro.org \
    --cc=bchaney@akamai.com \
    --cc=beata.michalska@linaro.org \
    --cc=dgilbert@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=johunt@akamai.com \
    --cc=junyan.he@intel.com \
    --cc=mst@redhat.com \
    --cc=mtottenh@akamai.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=stefanha@redhat.com \
    --cc=xiaoguangrong.eric@gmail.com \
    --cc=yury-kotov@yandex-team.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).