All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Steven Sistare <steven.sistare@oracle.com>
Cc: David Hildenbrand <david@redhat.com>,
	Fabiano Rosas <farosas@suse.de>,
	Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,
	Eduardo Habkost <eduardo@habkost.net>,
	Philippe Mathieu-Daude <philmd@linaro.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Daniel P. Berrange" <berrange@redhat.com>,
	Markus Armbruster <armbru@redhat.com>,
	qemu-devel@nongnu.org
Subject: Re: [PATCH V3 01/16] machine: anon-alloc option
Date: Wed, 6 Nov 2024 15:41:39 -0500	[thread overview]
Message-ID: <ZyvUg3CP30f3DZYY@x1n> (raw)
In-Reply-To: <66c05a06-dbb7-49ec-b58e-ccd917d098ea@oracle.com>

On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
> 
> 
> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > On 04.11.24 21:56, Steven Sistare wrote:
> > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > > > > memory allocated as a side effect of creating various devices. It does
> > > > > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > > > > the command line, or implicitly created by the -m command line option.
> > > > > > > > 
> > > > > > > > The memfd option is intended to support new migration modes, in which the
> > > > > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > > > 
> > > > > > > A more portable, non-Linux specific variant of this will be using shm,
> > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > 
> > > > > > > Likely we should be using that instead of memfd, or try hiding the
> > > > > > > details. See below.
> > > > > > 
> > > > > > For this series I would prefer to use memfd and hide the details.  It's a
> > > > > > concise (and well tested) solution albeit linux only.  The code you supply
> > > > > > for posix shm would be a good follow on patch to support other unices.
> > > > > 
> > > > > Unless there is reason to use memfd we should start with the more
> > > > > generic POSIX variant that is available even on systems without memfd.
> > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > 
> > > > > I can help with the rework, and send it out separately, so you can focus
> > > > > on the "machine toggle" as part of this series.
> > > > > 
> > > > > Of course, if we find out we need the memfd internally instead under
> > > > > Linux for whatever reason later, we can use that instead.
> > > > > 
> > > > > But IIUC, the main selling point for memfd are additional features
> > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > 
> > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> > > 
> > > Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> > > To do so using shm_open requires configuration on the mount.  One step harder to use.
> > 
> > Yes.
> > 
> > > 
> > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> > > if memory-backend-ram has hogged all the memory.
> > > 
> > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> > > 
> > > Yes, and if that is a good idea, then the same should be done for internal RAM
> > > -- memfd if available and fallback to shm_open.
> > 
> > Yes.
> > 
> > > 
> > > > I'm hoping we can find a way where it just all is rather intuitive, like
> > > > 
> > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> > > > 
> > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> > > > 
> > > > Thoughts?
> > > 
> > > Agreed, though I thought I had already landed at the intuitive specification in my patch.
> > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> > > controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> > > of options and words to describe them.
> > 
> > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> 
> Hi David and Peter,
> 
> I have implemented and tested the following, for both qemu_memfd_create
> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> for simplicity.

I'm ok with either shm or memfd, as this feature only applies to Linux
anyway.  I'll leave that part to you and David to decide.

> 
> Any comments before I submit a complete patch?
> 
> ----
> qemu-options.hx:
>     ``aux-ram-share=on|off``
>         Allocate auxiliary guest RAM as an anonymous file that is
>         shareable with an external process.  This option applies to
>         memory allocated as a side effect of creating various devices.
>         It does not apply to memory-backend-objects, whether explicitly
>         specified on the command line, or implicitly created by the -m
>         command line option.
> 
>         Some migration modes require aux-ram-share=on.
> 
> qapi/migration.json:
>     @cpr-transfer:
>          ...
>          Memory-backend objects must have the share=on attribute, but
>          memory-backend-epc is not supported.  The VM must be started
>          with the '-machine aux-ram-share=on' option.
> 
> Define RAM_PRIVATE
> 
> Define qemu_shm_alloc(), from David's tmp patch
> 
> ram_backend_memory_alloc()
>     ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>     memory_region_init_ram_flags_nomigrate(ram_flags)

Looks all good until here.

> 
> qemu_ram_alloc_internal()
>     ...
>     if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)

Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
that's equal to RAM_PREALLOC.  Meanwhile I slightly prefer we don't touch
anything if SHARED|PRIVATE is set.  All combined, it could be:

    if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) {
        // ramblock to be allocated, with no share/private request, aka,
        // aux memory chunk...
    }

>         new_block->flags |= RAM_SHARED;
> 
>     if (!host && (new_block->flags & RAM_SHARED)) {
>         qemu_ram_alloc_shared(new_block);

I'm not sure whether this needs its own helper.  Should we fallback to
ram_block_add() below, just like a RAM_SHARED?

IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and
always cache the fd (even if we don't do that before)?

>     } else
>         new_block->fd = -1;
>         new_block->host = host;
>     }
>     ram_block_add(new_block);
> 
> qemu_ram_alloc_shared()
>     if qemu_memfd_check()
>         new_block->fd = qemu_memfd_create()
>     else
>         new_block->fd = qemu_shm_alloc()
>     new_block->host = file_ram_alloc(new_block->fd)
> ----
> 
> - Steve
> 

-- 
Peter Xu



  reply	other threads:[~2024-11-06 20:42 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
2024-11-01 13:47 ` [PATCH V3 01/16] machine: anon-alloc option Steve Sistare
2024-11-01 14:06   ` Peter Xu
2024-11-04 10:39   ` David Hildenbrand
2024-11-04 10:45     ` David Hildenbrand
2024-11-04 17:38     ` Steven Sistare
2024-11-04 19:51       ` David Hildenbrand
2024-11-04 20:14         ` Peter Xu
2024-11-04 20:17           ` David Hildenbrand
2024-11-04 20:41             ` Peter Xu
2024-11-04 20:15         ` David Hildenbrand
2024-11-04 20:56           ` Steven Sistare
2024-11-04 21:36             ` David Hildenbrand
2024-11-06 20:12               ` Steven Sistare
2024-11-06 20:41                 ` Peter Xu [this message]
2024-11-06 20:59                   ` Steven Sistare
2024-11-06 21:21                     ` Peter Xu
2024-11-07 14:03                       ` Steven Sistare
2024-11-07 13:05                     ` David Hildenbrand
2024-11-07 14:04                       ` Steven Sistare
2024-11-07 16:19                         ` David Hildenbrand
2024-11-07 18:13                           ` Steven Sistare
2024-11-07 16:32                         ` Peter Xu
2024-11-07 16:38                           ` David Hildenbrand
2024-11-07 17:48                             ` Peter Xu
2024-11-07 13:23                 ` David Hildenbrand
2024-11-07 16:02                   ` Steven Sistare
2024-11-07 16:26                     ` David Hildenbrand
2024-11-07 16:40                       ` Steven Sistare
2024-11-08 11:31                         ` David Hildenbrand
2024-11-08 13:43                           ` Peter Xu
2024-11-08 14:14                             ` Steven Sistare
2024-11-08 14:32                               ` Peter Xu
2024-11-08 14:18                             ` David Hildenbrand
2024-11-08 15:01                               ` Peter Xu
2024-11-08 13:56                           ` Steven Sistare
2024-11-08 14:20                             ` David Hildenbrand
2024-11-08 14:37                               ` Steven Sistare
2024-11-08 14:54                                 ` David Hildenbrand
2024-11-08 15:07                                   ` Peter Xu
2024-11-08 15:09                                     ` David Hildenbrand
2024-11-08 15:15                                   ` David Hildenbrand
2024-11-01 13:47 ` [PATCH V3 02/16] migration: cpr-state Steve Sistare
2024-11-13 20:36   ` Peter Xu
2024-11-01 13:47 ` [PATCH V3 03/16] physmem: preserve ram blocks for cpr Steve Sistare
2024-11-01 13:47 ` [PATCH V3 04/16] hostmem-memfd: preserve " Steve Sistare
2024-11-01 13:47 ` [PATCH V3 05/16] migration: SCM_RIGHTS for QEMUFile Steve Sistare
2024-11-13 20:54   ` Peter Xu
2024-11-14 18:34     ` Steven Sistare
2024-11-01 13:47 ` [PATCH V3 06/16] migration: VMSTATE_FD Steve Sistare
2024-11-13 20:55   ` Peter Xu
2024-11-01 13:47 ` [PATCH V3 07/16] migration: cpr-transfer save and load Steve Sistare
2024-11-01 13:47 ` [PATCH V3 08/16] migration: cpr-uri parameter Steve Sistare
2024-11-01 13:47 ` [PATCH V3 09/16] migration: cpr-uri option Steve Sistare
2024-11-01 13:47 ` [PATCH V3 10/16] migration: split qmp_migrate Steve Sistare
2024-11-13 21:11   ` Peter Xu
2024-11-14 18:33     ` Steven Sistare
2024-11-01 13:47 ` [PATCH V3 11/16] migration: cpr-transfer mode Steve Sistare
2024-11-13 21:58   ` Peter Xu
2024-11-14 18:36     ` Steven Sistare
2024-11-14 19:04       ` Peter Xu
2024-11-19 19:50         ` Steven Sistare
2024-11-19 20:16           ` Peter Xu
2024-11-19 20:32             ` Steven Sistare
2024-11-19 20:51               ` Peter Xu
2024-11-19 21:03                 ` Steven Sistare
2024-11-19 21:29                   ` Peter Xu
2024-11-19 21:41                     ` Steven Sistare
2024-11-19 21:48                       ` Peter Xu
2024-11-19 21:51                         ` Steven Sistare
2024-11-20  9:38               ` Daniel P. Berrangé
2024-11-20 16:12                 ` Steven Sistare
2024-11-20 16:26                   ` Daniel P. Berrangé
2024-11-01 13:47 ` [PATCH V3 12/16] tests/migration-test: memory_backend Steve Sistare
2024-11-13 22:19   ` Fabiano Rosas
2024-11-01 13:47 ` [PATCH V3 13/16] tests/qtest: defer connection Steve Sistare
2024-11-13 22:36   ` Fabiano Rosas
2024-11-14 18:45     ` Steven Sistare
2024-11-13 22:53   ` Peter Xu
2024-11-14 18:31     ` Steven Sistare
2024-11-01 13:47 ` [PATCH V3 14/16] tests/migration-test: " Steve Sistare
2024-11-14 12:46   ` Fabiano Rosas
2024-11-01 13:47 ` [PATCH V3 15/16] migration-test: cpr-transfer Steve Sistare
2024-11-01 13:47 ` [PATCH V3 16/16] migration: cpr-transfer documentation Steve Sistare
2024-11-13 22:02   ` Peter Xu
2024-11-14 18:31     ` Steven Sistare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZyvUg3CP30f3DZYY@x1n \
    --to=peterx@redhat.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=david@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=farosas@suse.de \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=steven.sistare@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.