qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Steven Sistare <steven.sistare@oracle.com>
To: Peter Xu <peterx@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
	David Hildenbrand <david@redhat.com>,
	Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,
	Eduardo Habkost <eduardo@habkost.net>,
	Philippe Mathieu-Daude <philmd@linaro.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Daniel P. Berrange" <berrange@redhat.com>,
	Markus Armbruster <armbru@redhat.com>
Subject: Re: [PATCH V2 05/13] physmem: preserve ram blocks for cpr
Date: Thu, 31 Oct 2024 16:32:39 -0400	[thread overview]
Message-ID: <47d6d984-7002-4086-bb10-b191168f141f@oracle.com> (raw)
In-Reply-To: <ZwWk66RpczCpejhx@x1n>

On 10/8/2024 5:32 PM, Peter Xu wrote:
> On Tue, Oct 08, 2024 at 05:05:01PM -0400, Steven Sistare wrote:
>> On 10/8/2024 12:26 PM, Peter Xu wrote:
>>> On Tue, Oct 08, 2024 at 11:17:46AM -0400, Steven Sistare wrote:
>>>> On 10/7/2024 12:28 PM, Peter Xu wrote:
>>>>> On Mon, Oct 07, 2024 at 11:49:25AM -0400, Peter Xu wrote:
>>>>>> On Mon, Sep 30, 2024 at 12:40:36PM -0700, Steve Sistare wrote:
>>>>>>> Save the memfd for anonymous ramblocks in CPR state, along with a name
>>>>>>> that uniquely identifies it.  The block's idstr is not yet set, so it
>>>>>>> cannot be used for this purpose.  Find the saved memfd in new QEMU when
>>>>>>> creating a block.  QEMU hard-codes the length of some internally-created
>>>>>>> blocks, so to guard against that length changing, use lseek to get the
>>>>>>> actual length of an incoming memfd.
>>>>>>>
>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>> ---
>>>>>>>     system/physmem.c | 25 ++++++++++++++++++++++++-
>>>>>>>     1 file changed, 24 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>>>> index 174f7e0..ddbeec9 100644
>>>>>>> --- a/system/physmem.c
>>>>>>> +++ b/system/physmem.c
>>>>>>> @@ -72,6 +72,7 @@
>>>>>>>     #include "qapi/qapi-types-migration.h"
>>>>>>>     #include "migration/options.h"
>>>>>>> +#include "migration/cpr.h"
>>>>>>>     #include "migration/vmstate.h"
>>>>>>>     #include "qemu/range.h"
>>>>>>> @@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
>>>>>>>         }
>>>>>>>     }
>>>>>>> +static char *cpr_name(RAMBlock *block)
>>>>>>> +{
>>>>>>> +    MemoryRegion *mr = block->mr;
>>>>>>> +    const char *mr_name = memory_region_name(mr);
>>>>>>> +    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
>>>>>>> +
>>>>>>> +    if (id) {
>>>>>>> +        return g_strdup_printf("%s/%s", id, mr_name);
>>>>>>> +    } else {
>>>>>>> +        return g_strdup(mr_name);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>>     size_t qemu_ram_pagesize(RAMBlock *rb)
>>>>>>>     {
>>>>>>>         return rb->page_size;
>>>>>>> @@ -1858,14 +1872,18 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>>>>                                             TYPE_MEMORY_BACKEND)) {
>>>>>>>                 size_t max_length = new_block->max_length;
>>>>>>>                 MemoryRegion *mr = new_block->mr;
>>>>>>> -            const char *name = memory_region_name(mr);
>>>>>>> +            g_autofree char *name = cpr_name(new_block);
>>>>>>>                 new_block->mr->align = QEMU_VMALLOC_ALIGN;
>>>>>>>                 new_block->flags |= RAM_SHARED;
>>>>>>> +            new_block->fd = cpr_find_fd(name, 0);
>>>>>>>                 if (new_block->fd == -1) {
>>>>>>>                     new_block->fd = qemu_memfd_create(name, max_length + mr->align,
>>>>>>>                                                       0, 0, 0, errp);
>>>>>>> +                cpr_save_fd(name, 0, new_block->fd);
>>>>>>> +            } else {
>>>>>>> +                new_block->max_length = lseek(new_block->fd, 0, SEEK_END);
>>>>>>
>>>>>> So this can overwrite the max_length that the caller specified..
>>>>>>
>>>>>> I remember we used to have some tricks on specifying different max_length
>>>>>> for ROMs on dest QEMU (on which, qemu firmwares also upgraded on the dest
>>>>>> host so the size can be bigger than src qemu's old ramblocks), so that the
>>>>>> MR is always large enough to reload even the new firmwares, while migration
>>>>>> only migrates the smaller size (used_length) so it's fine as we keep the
>>>>>> extra sizes empty. I think that can relevant to the qemu_ram_resize() call
>>>>>> of parse_ramblock().
>>>>
>>>> Yes, resizable ram block for firmware blob is the only case I know of where
>>>> the length changed in the past.  If a length changes in the future, we will
>>>> need to detect and accommodate that change here, and I believe the fix will
>>>> be to simply use the actual length, as per the code above.  But if you prefer,
>>>> for now I can check for length change and return an error. New qemu will fail
>>>> to start, and old qemu will recover.
>>>>
>>>>>> The reload will not happen until some point, perhaps system resets.  I
>>>>>> wonder whether that is an issue in this case.
>>>>
>>>> Firmware is only generated once, via this path on x86:
>>>>     qmp_x_exit_preconfig
>>>>       qemu_machine_creation_done
>>>>         qdev_machine_creation_done
>>>>           pc_machine_done
>>>>             acpi_setup
>>>>               acpi_add_rom_blob
>>>>                 rom_add_blob
>>>>                   rom_set_mr
>>>>
>>>> After a system reset, the ramblock contents from memory are used as-is.
>>>>
>>>>> PS: If this is needed by CPR-transfer only because mmap() later can fail
>>>>> due to a bigger max_length,
>>>>
>>>> That is the reason.  IMO adjusting max_length is more robust than fiddling
>>>> with truncate and pretending that max_length is larger, when qemu will never
>>>> be able to use the phantom space up to max_length.
>>>
>>> I thought it was not pretending, but the ROM region might be resized after
>>> a system reset?  I worry that your change here can violate with such
>>> resizing later, so that qemu_ram_resize() can potentially fail after (1)
>>> CPR-transfer upgrades completes, then follow with (2) a system reset.
>>>
>>> We can observe such resizing kick off in every reboot, like:
>>>
>>> (gdb) bt
>>> #0  qemu_ram_resize
>>> #1  0x00005602b623b740 in memory_region_ram_resize
>>> #2  0x00005602b60f5580 in acpi_ram_update
>>> #3  0x00005602b60f5667 in acpi_build_update
>>> #4  0x00005602b5e1028b in fw_cfg_select
>>> #5  0x00005602b5e105af in fw_cfg_dma_transfer
>>> #6  0x00005602b5e109a8 in fw_cfg_dma_mem_write
>>> #7  0x00005602b62352ec in memory_region_write_accessor
>>> #8  0x00005602b62355e6 in access_with_adjusted_size
>>> #9  0x00005602b6238de8 in memory_region_dispatch_write
>>> #10 0x00005602b62488c5 in flatview_write_continue_step
>>> #11 0x00005602b6248997 in flatview_write_continue
>>> #12 0x00005602b6248abf in flatview_write
>>> #13 0x00005602b6248f39 in address_space_write
>>> #14 0x00005602b6248fb1 in address_space_rw
>>> #15 0x00005602b62a5d86 in kvm_handle_io
>>> #16 0x00005602b62a6cb2 in kvm_cpu_exec
>>> #17 0x00005602b62aa37a in kvm_vcpu_thread_fn
>>> #18 0x00005602b655da57 in qemu_thread_start
>>> #19 0x00007f120224a1b7 in start_thread
>>> #20 0x00007f12022cc39c in clone3
>>>
>>> Specifically, see this code clip:
>>>
>>> acpi_ram_update():
>>>       memory_region_ram_resize(mr, size, &error_abort);
>>>       memcpy(memory_region_get_ram_ptr(mr), data->data, size);
>>>
>>> Per my understanding, what it does is during the reset the ROM ramblock
>>> will resize to the new size (normally, only larger, in my memory there used
>>> to have a ROM grew from 256K->512K, or something like that), then the
>>> memcpy() injects the latest firmware that it pre-loaded into mem.
>>>
>>> So after such system reset, QEMU might start to see new ROM code loaded
>>> here (not the one that got migrated anymore, which will only match the
>>> version installed on src QEMU).  Here the problem is the new firmware can
>>> be larger, so I _think_ we need to make sure max_length is not modified by
>>> CPR to allow resizing happen here, while if we use truncate=true here it
>>> should just work in all cases.
>>>
>>> I think it could be verified with an old QEMU running with old ROM files
>>> (which is smaller), then CPR migrate to a new QEMU running new ROM files
>>> (which is larger), then reboot to see whether that new QEMU crash.  Maybe
>>> we can emulate that with "romfile=XXX" parameter.
>>>
>>> I am not fluent with ROM/firmware code, but please double check..
>>
>> Thank you for the detailed analysis, I was completely wrong on this one :(
>>
>> I also keep forgetting that ftruncate can grow as well as shrink a file.
>> I agree that preserving the dest qemu max_length, and using ftruncate, is the
>> correct solution, as long as dest max_length >= source max_length.
>>
>> However, IMO the extra memory created by ftruncate also needs to be pinned for DMA.
>> We disagreed on exactly what blocks needs to be pinned in previous discussions,
>> and to save time I would rather not re-open that debate right now.  Instead, I propose
>> to simply require that max_length does not change, and return an error if it does.
>> If it changes in some future qemu, we can reopen the discussion.
> 
> Hmm.. why the extra memory needs to be pinned?
> 
>  From QEMU memory topology POV, anything more than used_length is not
> visible to the guest, afaict.
> 
> In this specific ROM example, qemu_ram_resize() on src QEMU will first
> resize the ramblock (updating used_length), then set that exact same size
> with memory_region_set_size() to the MR with the size of the smaller
> firmware size when src QEMU boots:
> 
> qemu_ram_resize():
>      unaligned_size = newsize;
>      ...
>      newsize = TARGET_PAGE_ALIGN(newsize);
>      newsize = REAL_HOST_PAGE_ALIGN(newsize);
>      ...
>      block->used_length = newsize;
>      ...
>      memory_region_set_size(block->mr, unaligned_size);
> 
> Here a tiny detail is the two sizes are slightly different, but the MR size
> is even smaller than used_length.  The MR size decides what can be visible
> to the guest, when the MR that owns the ROM file is mapped into GPA range.
> That's true on the src, while after CPR migrates to dest that should still
> hold true, afaict, as all the rest memory (used->max) is not yet used
> before a system reset.
> 
> The extra memory (used->max) can be relevant only after a system reset,
> when the new firmware will be loaded, and qemu_ram_resize() can indeed
> extend that MR to cover more than before.  However that should be fine too
> because that means guest memory is being rebuilt, so VFIO memory listeners
> should do the right things (unpin old, repin the new ROM that is larger
> this time), iiuc.

So, we actually agree!  I said the extra memory needs to be pinned.  You said
the memory listeners will pin the extra memory.  Cool.

I will add ftruncate to this patch to grow the mr.

- Steve



  reply	other threads:[~2024-10-31 20:34 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-30 19:40 [PATCH V2 00/13] Live update: cpr-transfer Steve Sistare
2024-09-30 19:40 ` [PATCH V2 01/13] machine: alloc-anon option Steve Sistare
2024-10-03 16:14   ` Peter Xu
2024-10-04 10:14     ` David Hildenbrand
2024-10-04 12:33       ` Peter Xu
2024-10-04 12:54         ` David Hildenbrand
2024-10-04 13:24           ` Peter Xu
2024-10-07 16:23             ` David Hildenbrand
2024-10-07 19:05               ` Peter Xu
2024-10-07 15:36   ` Peter Xu
2024-10-07 19:30     ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 02/13] migration: cpr-state Steve Sistare
2024-10-07 14:14   ` Peter Xu
2024-10-07 19:30     ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 03/13] migration: save cpr mode Steve Sistare
2024-10-07 15:18   ` Peter Xu
2024-10-07 19:31     ` Steven Sistare
2024-10-07 20:10       ` Peter Xu
2024-10-08 15:57         ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 04/13] migration: stop vm earlier for cpr Steve Sistare
2024-10-07 15:27   ` Peter Xu
2024-10-07 20:52     ` Steven Sistare
2024-10-08 15:35       ` Peter Xu
2024-10-08 19:13         ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 05/13] physmem: preserve ram blocks " Steve Sistare
2024-10-07 15:49   ` Peter Xu
2024-10-07 16:28     ` Peter Xu
2024-10-08 15:17       ` Steven Sistare
2024-10-08 16:26         ` Peter Xu
2024-10-08 21:05           ` Steven Sistare
2024-10-08 21:32             ` Peter Xu
2024-10-31 20:32               ` Steven Sistare [this message]
2024-09-30 19:40 ` [PATCH V2 06/13] hostmem-memfd: preserve " Steve Sistare
2024-10-07 15:52   ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 07/13] migration: SCM_RIGHTS for QEMUFile Steve Sistare
2024-10-07 16:06   ` Peter Xu
2024-10-07 16:35     ` Daniel P. Berrangé
2024-10-07 18:12       ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 08/13] migration: VMSTATE_FD Steve Sistare
2024-10-07 16:36   ` Peter Xu
2024-10-07 19:31     ` Steven Sistare
2024-09-30 19:40 ` [PATCH V2 09/13] migration: cpr-transfer save and load Steve Sistare
2024-10-07 16:47   ` Peter Xu
2024-10-07 19:31     ` Steven Sistare
2024-10-08 15:36       ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 10/13] migration: cpr-uri parameter Steve Sistare
2024-10-07 16:49   ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 11/13] migration: cpr-uri option Steve Sistare
2024-10-07 16:50   ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 12/13] migration: split qmp_migrate Steve Sistare
2024-10-07 19:18   ` Peter Xu
2024-09-30 19:40 ` [PATCH V2 13/13] migration: cpr-transfer mode Steve Sistare
2024-10-07 19:44   ` Peter Xu
2024-10-07 20:39     ` Steven Sistare
2024-10-08 15:45       ` Peter Xu
2024-10-08 19:12         ` Steven Sistare
2024-10-08 19:38           ` Peter Xu
2024-10-08 18:28       ` Fabiano Rosas
2024-10-08 18:47         ` Peter Xu
2024-10-08 19:11           ` Fabiano Rosas
2024-10-08 19:33             ` Steven Sistare
2024-10-08 19:48             ` Peter Xu
2024-10-09 18:43               ` Steven Sistare
2024-10-09 19:06                 ` Peter Xu
2024-10-09 19:59                   ` Peter Xu
2024-10-09 20:18                     ` Steven Sistare
2024-10-09 20:57                       ` Peter Xu
2024-10-09 22:08                         ` Fabiano Rosas
2024-10-10 20:05                           ` Steven Sistare
2024-10-09 20:09                   ` Steven Sistare
2024-10-09 20:36                     ` Peter Xu
2024-10-10 20:06                       ` Steven Sistare
2024-10-10 21:23                         ` Peter Xu
2024-10-24 21:12                           ` Steven Sistare
2024-10-25 13:55                             ` Peter Xu
2024-10-25 15:04                               ` Steven Sistare
2024-10-08 19:29           ` Steven Sistare
2024-10-08 14:33 ` [PATCH V2 00/13] Live update: cpr-transfer Vladimir Sementsov-Ogievskiy
2024-10-08 21:13   ` Steven Sistare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47d6d984-7002-4086-bb10-b191168f141f@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=david@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=farosas@suse.de \
    --cc=imammedo@redhat.com \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).