qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Jason Zeng" <jason.zeng@linux.intel.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Juan Quintela" <quintela@redhat.com>,
	qemu-devel@nongnu.org, "Markus Armbruster" <armbru@redhat.com>,
	"Steven Sistare" <steven.sistare@oracle.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Alex Bennée" <alex.bennee@linaro.org>
Subject: Re: [PATCH V3 00/22] Live Update
Date: Tue, 18 May 2021 14:01:55 -0600	[thread overview]
Message-ID: <20210518140155.0d302324.alex.williamson@redhat.com> (raw)
In-Reply-To: <YKQULUn5F+x1wrYI@work-vm>

On Tue, 18 May 2021 20:23:25 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Steven Sistare (steven.sistare@oracle.com) wrote:
> > On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:  
> > > * Steven Sistare (steven.sistare@oracle.com) wrote:  
> > >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
> > >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
> > >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
> > >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
> > >>>>>> Provide the cprsave and cprload commands for live update.  These save and
> > >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> > >>>>>> to a new version in between.
> > >>>>>>
> > >>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> > >>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> > >>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> > >>>>>> paused state and waits for the cprload command.  
> > >>>>>
> > >>>>> I think cprsave/cprload could be generalized by using QMP to stash the
> > >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> > >>>>> already opens fds passed using this mechanism.
> > >>>>>
> > >>>>> I haven't checked but it may be possible to drop some patches by reusing
> > >>>>> QEMU's monitor file descriptor passing since the code already knows how
> > >>>>> to open from 'getfd' fds.
> > >>>>>
> > >>>>> The reason why using QMP is interesting is because it eliminates the
> > >>>>> need for execve(2). QEMU may be unable to execute a program due to
> > >>>>> chroot, seccomp, etc.
> > >>>>>
> > >>>>> QMP would enable cprsave/cprload to work both with and without
> > >>>>> execve(2).
> > >>>>>
> > >>>>> One tricky thing with this approach might be startup ordering: how to
> > >>>>> get fds via the QMP monitor in the new process before processing the
> > >>>>> entire command-line.  
> > >>>>
> > >>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> > >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> > >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> > >>>> I suspect my recent vfio extensions would smooth the rough edges.  
> > >>>
> > >>> I wonder about the reason for VFIO's pid limitation, maybe because it
> > >>> pins pages from the original process?  
> > >>
> > >> The dma unmap code verifies that the requesting task is the same as the task that mapped
> > >> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> > >> to fix locked memory accounting, which is associated with the mm of the original task.  
> > >   
> > >>> Is this VFIO pid limitation the main reason why you chose to make QEMU
> > >>> execve(2) the new binary?  
> > >>
> > >> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> > >> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> > >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> > >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> > >> I was working against vfio rather than with it.  
> > > 
> > > OK the weirdness of vfio helps explain a bit about why you're doing it
> > > this way; can you help separate some difference between restart and
> > > reboot for me though:
> > > 
> > > In 'reboot' mode; where the guest must do suspend in it's drivers, how
> > > much of these vfio requirements are needed?  I guess the memfd use
> > > for the anonymous areas isn't any use for reboot mode.  
> > 
> > Correct.  For reboot no special vfio support or fiddling is needed.
> >   
> > > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> > > vfio still care about the currently-anonymous areas?  
> > 
> > Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
> > are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.  
> 
> Is it possible with vfio to map it into multiple processes
> simultaneously or does it have to only be one at a time?

The IOMMU maps an IOVA to a physical address, what Steve is saying is
that mapping persists across the restart.  A given IOVA can only map to
a specific physical address, so mapping into multiple processes doesn't
make any sense.  The two processes need to map the same IOVA to the
same HPA, only the HVA is allowed to change.

> Are you saying that you have no way to shut off DMA, and thus you can
> never know it's safe to terminate the source process?

Stopping DMA, ex. disabling PCI bus master, would be not only visible
to the behavior of the device, but likely detrimental.  You'd need
driver or device participation to some extent to make this seamless.

> > >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> > >> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> > >> without exec, I still need the exec option.  
> > > 
> > > Can you explain what that code injection mechanism is for those of us
> > > who didn't see that?  
> > 
> > Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
> > accepted upstream.  
> 
> In this series, without MADV_DOEXEC, how do you guarantee the same HVA
> in source and destination - or is that not necessary?

It's not necessary, the HVA is used to establish the IOVA to HPA
mapping for the IOMMU.  We have patches upstream that suspend (block)
that translation for the window when the HVA is invalid and resume when
it becomes valid.  It's expected that the new HVA is equivalent to the
old HVA and that the user can only hurt themselves should they violate
this, ie. they can still only map+pin memory they own, so at worst they
create a bad translation for their own device.  Thanks,

Alex



  reply	other threads:[~2021-05-18 20:03 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
2021-05-07 12:24 ` [PATCH V3 01/22] as_flat_walk Steve Sistare
2021-05-07 12:25 ` [PATCH V3 02/22] qemu_ram_volatile Steve Sistare
2021-05-07 12:25 ` [PATCH V3 03/22] oslib: qemu_clr_cloexec Steve Sistare
2021-05-07 12:25 ` [PATCH V3 04/22] util: env var helpers Steve Sistare
2021-05-07 12:25 ` [PATCH V3 05/22] machine: memfd-alloc option Steve Sistare
2021-05-07 12:25 ` [PATCH V3 06/22] vl: add helper to request re-exec Steve Sistare
2021-05-07 14:31   ` Eric Blake
2021-05-13 20:19     ` Steven Sistare
2021-05-14  8:18       ` Daniel P. Berrangé
2021-05-12 16:27   ` Stefan Hajnoczi
2021-05-13 20:20     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 07/22] cpr Steve Sistare
2021-05-12 16:19   ` Stefan Hajnoczi
2021-05-13 20:21     ` Steven Sistare
2021-05-14 11:28       ` Stefan Hajnoczi
2021-05-14 15:14         ` Steven Sistare
2021-05-18 13:42           ` Stefan Hajnoczi
2021-05-07 12:25 ` [PATCH V3 08/22] cpr: QMP interfaces Steve Sistare
2021-06-04 13:59   ` Eric Blake
2021-06-07 17:19     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 09/22] cpr: HMP interfaces Steve Sistare
2021-05-07 12:25 ` [PATCH V3 10/22] pci: export functions for cpr Steve Sistare
2021-05-07 12:25 ` [PATCH V3 11/22] vfio-pci: refactor " Steve Sistare
2021-05-19 22:38   ` Alex Williamson
2021-05-21 13:33     ` Steven Sistare
2021-05-21 21:07       ` Alex Williamson
2021-05-21 21:18         ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 12/22] vfio-pci: cpr part 1 Steve Sistare
2021-05-21 22:24   ` Alex Williamson
2021-05-24 18:29     ` Steven Sistare
2021-06-11 18:15       ` Steven Sistare
2021-06-11 19:43         ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 13/22] vfio-pci: cpr part 2 Steve Sistare
2021-05-21 22:24   ` Alex Williamson
2021-05-24 18:31     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 14/22] vhost: reset vhost devices upon cprsave Steve Sistare
2021-05-07 12:25 ` [PATCH V3 15/22] hostmem-memfd: cpr support Steve Sistare
2021-05-07 12:25 ` [PATCH V3 16/22] chardev: cpr framework Steve Sistare
2021-05-07 14:33   ` Eric Blake
2021-05-13 20:19     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 17/22] chardev: cpr for simple devices Steve Sistare
2021-05-07 12:25 ` [PATCH V3 18/22] chardev: cpr for pty Steve Sistare
2021-05-07 12:25 ` [PATCH V3 19/22] chardev: cpr for sockets Steve Sistare
2021-05-07 12:25 ` [PATCH V3 20/22] cpr: only-cpr-capable option Steve Sistare
2021-05-07 12:25 ` [PATCH V3 21/22] cpr: maintainers Steve Sistare
2021-05-07 12:25 ` [PATCH V3 22/22] simplify savevm Steve Sistare
2021-05-07 13:00 ` [PATCH V3 00/22] Live Update no-reply
2021-05-13 20:42   ` Steven Sistare
2021-05-12 16:42 ` Stefan Hajnoczi
2021-05-13 20:21   ` Steven Sistare
2021-05-14 11:53     ` Stefan Hajnoczi
2021-05-14 15:15       ` Steven Sistare
2021-05-17 11:40         ` Stefan Hajnoczi
2021-05-17 19:10           ` Alex Williamson
2021-05-18 13:39             ` Stefan Hajnoczi
2021-05-18 15:48               ` Steven Sistare
2021-05-18  9:57         ` Dr. David Alan Gilbert
2021-05-18 16:00           ` Steven Sistare
2021-05-18 19:23             ` Dr. David Alan Gilbert
2021-05-18 20:01               ` Alex Williamson [this message]
2021-05-18 20:14               ` Steven Sistare
2021-05-20 13:00                 ` [PATCH V3 00/22] Live Update [reboot] Dr. David Alan Gilbert
2021-05-21 14:55                   ` Steven Sistare
2021-06-15 19:14                     ` Dr. David Alan Gilbert
2021-06-24 15:05                       ` Steven Sistare
2021-07-06 17:31                         ` Steven Sistare
2021-05-20 13:13                 ` [PATCH V3 00/22] Live Update [restart] Dr. David Alan Gilbert
2021-05-21 14:56                   ` Steven Sistare
2021-05-24 10:39                     ` Dr. David Alan Gilbert
2021-06-02 13:51                       ` Steven Sistare
2021-06-03 19:36                         ` Dr. David Alan Gilbert
2021-06-03 20:44                           ` Daniel P. Berrangé
2021-06-07 16:40                             ` [PATCH V3 00/22] Live Update [restart] : exec Steven Sistare
2021-06-14 14:31                               ` Steven Sistare
2021-06-14 14:36                                 ` Daniel P. Berrangé
2021-06-15 19:05                               ` Dr. David Alan Gilbert
2021-06-07 18:08                           ` [PATCH V3 00/22] Live Update [restart] : code replication Steven Sistare
2021-06-14 14:33                             ` Steven Sistare
2021-05-19 16:43 ` [PATCH V3 00/22] Live Update Steven Sistare
2021-06-02 15:19   ` Steven Sistare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210518140155.0d302324.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=alex.bennee@linaro.org \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=jason.zeng@linux.intel.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=steven.sistare@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).