From: Alex Williamson <alex.williamson@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: Steven Sistare <steven.sistare@oracle.com>,
Igor Mammedov <imammedo@redhat.com>,
"Daniel P. Berrange" <berrange@redhat.com>,
qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
David Hildenbrand <david@redhat.com>,
Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,
Eduardo Habkost <eduardo@habkost.net>,
Philippe Mathieu-Daude <philmd@linaro.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Markus Armbruster <armbru@redhat.com>
Subject: Re: [PATCH V2 01/11] machine: alloc-anon option
Date: Tue, 13 Aug 2024 11:00:37 -0600 [thread overview]
Message-ID: <20240813110037.6f04ffe9.alex.williamson@redhat.com> (raw)
In-Reply-To: <Zrt9M00rDk3EUdNM@x1n>
On Tue, 13 Aug 2024 11:35:15 -0400
Peter Xu <peterx@redhat.com> wrote:
> On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
> > On 8/8/2024 2:32 PM, Steven Sistare wrote:
> > > On 7/29/2024 8:29 AM, Igor Mammedov wrote:
> > > > On Sat, 20 Jul 2024 16:28:25 -0400
> > > > Steven Sistare <steven.sistare@oracle.com> wrote:
> > > >
> > > > > On 7/16/2024 5:19 AM, Igor Mammedov wrote:
> > > > > > On Sun, 30 Jun 2024 12:40:24 -0700
> > > > > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > on the value of the anon-alloc machine property. This affects
> > > > > > > memory-backend-ram objects, guest RAM created with the global -m option
> > > > > > > but without an associated memory-backend object and without the -mem-path
> > > > > > > option
> > > > > > nowadays, all machines were converted to use memory backend for VM RAM.
> > > > > > so -m option implicitly creates memory-backend object,
> > > > > > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > > > > > or MEMORY_BACKEND_RAM otherwise.
> > > > >
> > > > > Yes. I dropped an an important adjective, "implicit".
> > > > >
> > > > > "guest RAM created with the global -m option but without an explicit associated
> > > > > memory-backend object and without the -mem-path option"
> > > > >
> > > > > > > To access the same memory in the old and new QEMU processes, the memory
> > > > > > > must be mapped shared. Therefore, the implementation always sets
> > > > > > > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > > > > > > user must explicitly specify the share option. In lieu of defining a new
> > > > > > so statement at the top that memory-backend-ram is affected is not
> > > > > > really valid?
> > > > >
> > > > > memory-backend-ram is affected by alloc-anon. But in addition, the user must
> > > > > explicitly add the "share" option. I don't implicitly set share in this case,
> > > > > because I would be overriding the user's specification of the memory object's property,
> > > > > which would be private if omitted.
> > > >
> > > > instead of touching implicit RAM (-m), it would be better to error out
> > > > and ask user to provide properly configured memory-backend explicitly.
> > > >
> > > > >
> > > > > > > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > > > > > > as the condition for calling memfd_create.
> > > > > >
> > > > > > In general I do dislike adding yet another option that will affect
> > > > > > guest RAM allocation (memory-backends should be sufficient).
> > > > > >
> > > > > > However I do see that you need memfd for device memory (vram, roms, ...).
> > > > > > Can we just use memfd/shared unconditionally for those and
> > > > > > avoid introducing a new confusing option?
> > > > >
> > > > > The Linux kernel has different tunables for backing memfd's with huge pages, so we
> > > > > could hurt performance if we unconditionally change to memfd. The user should have
> > > > > a choice for any segment that is large enough for huge pages to improve performance,
> > > > > which potentially is any memory-backend-object. The non memory-backend objects are
> > > > > small, and it would be OK to use memfd unconditionally for them.
> > >
> > > Thanks everyone for your feedback. The common theme is that you dislike that the
> > > new option modifies the allocation of memory-backend-objects. OK, accepted. I propose
> > > to remove that interaction, and document in the QAPI which backends work for CPR.
> > > Specifically, memory-backend-memfd or memory-backend-file object is required,
> > > with share=on (which is the default for memory-backend-memfd). CPR will be blocked
> > > otherwise. The legacy -m option without an explicit memory-backend-object will not
> > > support CPR.
> > >
> > > Non memory-backend-objects (ramblocks not described on the qemu command line) will always
> > > be allocated using memfd_create (on Linux only). The alloc-anon option is deleted.
> > > The logic in ram_block_add becomes:
> > >
> > > if (!new_block->host) {
> > > if (xen_enabled()) {
> > > ...
> > > } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
> > > TYPE_MEMORY_BACKEND)) {
> > > qemu_memfd_create()
> > > } else {
> > > qemu_anon_ram_alloc()
> > > }
> > >
> > > Is that acceptable to everyone? Igor, Peter, Daniel?
>
> Sorry for a late reply.
>
> I think this may not work as David pointed out? Where AFAIU it will switch
> many old anon use cases to use memfd, aka, shmem, and it might be
> problematic when share=off: we have double memory consumption issue with
> shmem with private mapping.
>
> I assume that includes things like "-m", "memory-backend-ram", and maybe
> more. IIUC memory consumption of the VM will double with them.
>
> >
> > In a simple test here are the NON-memory-backend-object ramblocks which
> > are allocated with memfd_create in my new proposal:
> >
> > memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
> > memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
> > memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
> > memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
> > memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
> > memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
> > memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
> > memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
> >
> > Of those, only a subset are mapped for DMA, per the existing QEMU logic,
> > no changes from me:
> >
> > dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
> > dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
> > dma_map: vga.rom 65536 @ 0x7fffe0400000 ro
>
> I wonder whether there's any case that the "rom"s can be DMA target at
> all.. I understand it's logically possible to be READ from as ROMs, but I
> am curious what happens if we don't map them at all when they're ROMs, or
> whether there's any device that can (in real life) DMA from device ROMs,
> and for what use.
>
> > dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
> > dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
> >
> > system.flash0 is excluded by the vfio listener because it is a rom_device.
> > The rom@etc blocks are excluded because their MemoryRegions are not added to
> > any container region, so the flatmem traversal of the AS used by the listener
> > does not see them.
> >
> > The BARs should not be mapped IMO, and I propose excluding them in the
> > iommufd series:
> > https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/
>
> Looks like this is clear now that they should be there.
>
> >
> > Note that the old-QEMU contents of all ramblocks must be preserved, just like
> > in live migration. Live migration copies the contents in the stream. Live update
> > preserves the contents in place by preserving the memfd. Thus memfd serves
> > two purposes: preserving old contents, and preserving DMA mapped pinned pages.
>
> IMHO the 1st purpose is a fake one. IOW:
>
> - Preserving content will be important on large RAM/ROM regions. When
> it's small, it shouldn't matter a huge deal, IMHO, because this is
> about "how fast we can migrate / live upgrade'. IOW, this is not a
> functional requirement.
Regardless of the size of a ROM region, how would it ever be faster to
migrate ROMs rather that reload them from stable media on the target?
Furthermore, what mechanism other than migrating the ROM do we have to
guarantee the contents of the ROM are identical?
I have a hard time accepting that ROMs are only migrated for
performance and there isn't some aspect of migrating them to ensure the
contents remain identical, and by that token CPR would also need to
preserve the contents to provide the same guarantee. Thanks,
Alex
> - DMA mapped pinned pages: instead this is a hard requirement that we
> must make sure these pages are fd-based, because only a fd-based
> mapping can persist the pages (via page cache).
>
> IMHO we shouldn't mangle them, and we should start with sticking with the
> 2nd goal here. To be explicit, if we can find a good replacement for
> -alloc-anon, IMHO we could still migrate the ramblocks only fall into the
> 1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
> they should never be DMAed to/from.
>
> Thanks,
>
next prev parent reply other threads:[~2024-08-13 17:01 UTC|newest]
Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
2024-06-30 19:40 ` [PATCH V2 01/11] machine: alloc-anon option Steve Sistare
2024-07-15 17:52 ` Fabiano Rosas
2024-07-16 9:19 ` Igor Mammedov
2024-07-17 19:24 ` Peter Xu
2024-07-18 15:43 ` Steven Sistare
2024-07-18 16:22 ` Peter Xu
2024-07-20 20:35 ` Steven Sistare
2024-08-04 16:20 ` Peter Xu
2024-07-20 20:28 ` Steven Sistare
2024-07-22 9:10 ` David Hildenbrand
2024-07-29 12:29 ` Igor Mammedov
2024-08-08 18:32 ` Steven Sistare
2024-08-12 18:37 ` Steven Sistare
2024-08-13 15:35 ` Peter Xu
2024-08-13 17:00 ` Alex Williamson [this message]
2024-08-13 18:45 ` Peter Xu
2024-08-13 18:56 ` Steven Sistare
2024-08-13 18:46 ` Steven Sistare
2024-08-13 18:49 ` Steven Sistare
2024-08-13 17:34 ` Steven Sistare
2024-08-13 19:02 ` Peter Xu
2024-06-30 19:40 ` [PATCH V2 02/11] migration: cpr-state Steve Sistare
2024-07-17 18:39 ` Fabiano Rosas
2024-07-19 15:03 ` Peter Xu
2024-07-20 19:53 ` Steven Sistare
2024-06-30 19:40 ` [PATCH V2 03/11] migration: save cpr mode Steve Sistare
2024-07-17 18:39 ` Fabiano Rosas
2024-07-18 15:47 ` Steven Sistare
2024-06-30 19:40 ` [PATCH V2 04/11] migration: stop vm earlier for cpr Steve Sistare
2024-07-17 18:59 ` Fabiano Rosas
2024-07-20 20:00 ` Steven Sistare
2024-07-22 13:42 ` Fabiano Rosas
2024-08-06 20:52 ` Steven Sistare
2024-06-30 19:40 ` [PATCH V2 05/11] physmem: preserve ram blocks " Steve Sistare
2024-06-30 19:40 ` [PATCH V2 06/11] migration: fix mismatched GPAs during cpr Steve Sistare
2024-07-19 16:28 ` Peter Xu
2024-07-20 21:28 ` Steven Sistare
2024-08-07 21:04 ` Steven Sistare
2024-08-13 20:43 ` Peter Xu
2024-08-15 20:54 ` Steven Sistare
2024-08-16 14:43 ` Peter Xu
2024-08-16 17:10 ` Steven Sistare
2024-08-21 16:57 ` Peter Xu
2024-06-30 19:40 ` [PATCH V2 07/11] oslib: qemu_clear_cloexec Steve Sistare
2024-06-30 19:40 ` [PATCH V2 08/11] vl: helper to request exec Steve Sistare
2024-06-30 19:40 ` [PATCH V2 09/11] migration: cpr-exec-command parameter Steve Sistare
2024-06-30 19:40 ` [PATCH V2 10/11] migration: cpr-exec save and load Steve Sistare
2024-06-30 19:40 ` [PATCH V2 11/11] migration: cpr-exec mode Steve Sistare
2024-07-18 15:56 ` [PATCH V2 00/11] Live update: cpr-exec Peter Xu
2024-07-20 21:26 ` Steven Sistare
2024-08-04 16:10 ` Peter Xu
2024-08-07 19:47 ` Steven Sistare
2024-08-13 20:12 ` Peter Xu
2024-08-20 16:28 ` [PATCH V2 00/11] Live update: cpr-exec (reconnections) Steven Sistare
2024-07-22 8:59 ` [PATCH V2 00/11] Live update: cpr-exec David Hildenbrand
2024-08-04 15:43 ` Peter Xu
2024-08-05 9:52 ` David Hildenbrand
2024-08-05 10:06 ` David Hildenbrand
2024-08-05 10:01 ` Daniel P. Berrangé
2024-08-06 20:56 ` Steven Sistare
2024-08-13 19:46 ` Peter Xu
2024-08-15 20:55 ` Steven Sistare
2024-08-16 15:06 ` Peter Xu
2024-08-16 15:16 ` Daniel P. Berrangé
2024-08-16 15:19 ` Steven Sistare
2024-08-16 15:34 ` Peter Xu
2024-08-16 16:00 ` Daniel P. Berrangé
2024-08-16 16:17 ` Peter Xu
2024-08-16 16:28 ` Daniel P. Berrangé
2024-08-16 17:09 ` Steven Sistare
2024-08-21 18:34 ` Peter Xu
2024-09-04 20:58 ` Steven Sistare
2024-09-04 22:23 ` Peter Xu
2024-09-05 9:49 ` Daniel P. Berrangé
2024-09-05 9:43 ` Daniel P. Berrangé
2024-09-05 9:30 ` Daniel P. Berrangé
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240813110037.6f04ffe9.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=armbru@redhat.com \
--cc=berrange@redhat.com \
--cc=david@redhat.com \
--cc=eduardo@habkost.net \
--cc=farosas@suse.de \
--cc=imammedo@redhat.com \
--cc=marcel.apfelbaum@gmail.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=steven.sistare@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).