Re: [PATCH v2 1/1] osdep: asynchronous teardown for shutdown on Linux

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Claudio Imbrenda <imbrenda@linux.ibm.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, david@redhat.com,
	cohuck@redhat.com, thuth@redhat.com, borntraeger@de.ibm.com,
	frankja@linux.ibm.com, alex.bennee@linaro.org
Subject: Re: [PATCH v2 1/1] osdep: asynchronous teardown for shutdown on Linux
Date: Fri, 5 Aug 2022 09:32:50 +0200	[thread overview]
Message-ID: <20220805093250.1d3bcda2@p-imbrenda> (raw)
In-Reply-To: <Yuv6QtPCtABMc7J4@redhat.com>

On Thu, 4 Aug 2022 17:56:34 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Wed, Aug 03, 2022 at 07:31:41PM +0200, Claudio Imbrenda wrote:
> > This patch adds support for asynchronously tearing down a VM on Linux.
> > 
> > When qemu terminates, either naturally or because of a fatal signal,
> > the VM is torn down. If the VM is huge, it can take a considerable
> > amount of time for it to be cleaned up. In case of a protected VM, it
> > might take even longer than a non-protected VM (this is the case on
> > s390x, for example).
> > 
> > Some users might want to shut down a VM and restart it immediately,
> > without having to wait. This is especially true if management
> > infrastructure like libvirt is used.
> > 
> > This patch implements a simple trick on Linux to allow qemu to return
> > immediately, with the teardown of the VM being performed
> > asynchronously.
> > 
> > If the new commandline option -async-teardown is used, a new process is
> > spawned from qemu at startup, using the clone syscall, in such way that
> > it will share its address space with qemu.
> > 
> > The new process will then simpy wait until qemu terminates, and then it
> > will exit itself.
> > 
> > This allows qemu to terminate quickly, without having to wait for the
> > whole address space to be torn down. The teardown process will exit
> > after qemu, so it will be the last user of the address space, and
> > therefore it will take care of the actual teardown.
> > 
> > The teardown process will share the same cgroups as qemu, so both
> > memory usage and cpu time will be accounted properly.
> > 
> > This feature can already be used with libvirt by adding the following
> > to the XML domain definition:
> > 
> >   <commandline xmlns="http://libvirt.org/schemas/domain/qemu/1.0">
> >   <arg value='-async-teardown'/>
> >   </commandline>
> > 
> > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> > ---
> >  include/qemu/osdep.h |  2 ++
> >  os-posix.c           |  5 ++++
> >  qemu-options.hx      | 17 ++++++++++++++
> >  util/osdep.c         | 55 ++++++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 79 insertions(+)  
> 
> 
> > diff --git a/util/osdep.c b/util/osdep.c
> > index 60fcbbaebe..bb0baf97a0 100644
> > --- a/util/osdep.c
> > +++ b/util/osdep.c
> > @@ -23,6 +23,15 @@
> >   */
> >  #include "qemu/osdep.h"
> >  #include "qapi/error.h"
> > +
> > +#ifdef CONFIG_LINUX
> > +#include <sys/types.h>
> > +#include <sys/select.h>
> > +#include <sys/unistd.h>
> > +#include <sys/syscall.h>
> > +#include <signal.h>
> > +#endif
> > +
> >  #include "qemu/cutils.h"
> >  #include "qemu/sockets.h"
> >  #include "qemu/error-report.h"
> > @@ -512,6 +521,52 @@ const char *qemu_hw_version(void)
> >      return hw_version;
> >  }
> >  
> > +#ifdef __linux__
> > +static int async_teardown_fn(void *arg)
> > +{
> > +    sigset_t all_signals;
> > +    fd_set r, w, e;
> > +    int fd;
> > +
> > +    /* open a pidfd descriptor for the parent qemu process */
> > +    fd = syscall(__NR_pidfd_open, getppid(), 0);  
> 
> We ought to open this FD in the parent process to avoid a race
> where the parent crashes immediately after clone() and gets
> reparented to 'init' before this child process calls pidfd_open,
> otherwise it'll sit around waiting for init to exit.

sounds good

> 
> > +    /* if something went wrong, or if the file descriptor is too big */
> > +    if ((fd < 0) || (fd >= FD_SETSIZE)) {
> > +        _exit(1);
> > +    }
> > +    /* zero all fd sets */
> > +    FD_ZERO(&r);
> > +    FD_ZERO(&w);
> > +    FD_ZERO(&e);
> > +    /* set the fd for the pidfd in the "read" set */
> > +    FD_SET(fd, &r);
> > +    /* block all signals */
> > +    sigfillset(&all_signals);
> > +    sigprocmask(SIG_BLOCK, &all_signals, NULL);  
> 
> Technnically this is racy as there's still a window in which the
> child is running when signals are not blocked.

true, I will apply the workaround you suggest below

> 
> > +    /* wait for the pid to disappear -> fd will appear as ready for read */
> > +    (void) select(fd + 1, &r, &w, &e, NULL);  
> 
> While using pidfd can work, a stronger protection would be to use
> 
>    prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
> 
> this guarantees that the kernel will deliver SIGKILL to this
> process immediately when the parent QEMU exits.
> 
> We should probably do both in fact.

makes sense

> 
> > +
> > +    /*
> > +     * Close all file descriptors that might have been inherited from the
> > +     * main qemu process when doing clone. This is needed to make libvirt
> > +     * happy.
> > +     */
> > +    close_range(0, ~0U, 0);  
> 
> Shouldn't we close all the FDs immediately when this process is

that's exactly what I did in the beginning, but some things broke...
but I just tried again now and it seems to work, so probably the
breakage was due to something else

> created, rather than only at the end when we're exiting. I don't
> see there's any need to keep them open. Doing it immediately
> would be better when using prctl(PR_SET_PDEATHSIG)

yes

> 
> > +    _exit(0);
> > +}
> > +
> > +void init_async_teardown(void)
> > +{
> > +    const int size = 8192; /* should be more than enough */
> > +    char *stack = malloc(size);
> > +  
> 
> You need to block all signals here.
> 
> > +    /* start a new process sharing the address space with qemu */
> > +    clone(async_teardown_fn, stack + size, CLONE_VM, NULL, NULL, NULL, NULL);  
> 
> And unblock signals again here.
> 
> That way the "everything blocked"  mask is inherited by the child
> from the very first moment of its existance.

sounds like a very good idea

> 
> > +}
> > +#else /* __linux__ */
> > +void init_async_teardown(void) {}
> > +#endif
> > +
> >  #ifdef _WIN32
> >  static void socket_cleanup(void)
> >  {
> > -- 
> > 2.37.1
> >   
> 
> With regards,
> Daniel

     prev parent reply	other threads:[~2022-08-05  7:42 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-03 17:31 [PATCH v2 1/1] osdep: asynchronous teardown for shutdown on Linux Claudio Imbrenda
2022-08-03 17:34 ` Daniel P. Berrangé
2022-08-04  5:56   ` Claudio Imbrenda
2022-08-04  8:20     ` Daniel P. Berrangé
2022-08-04 16:58       ` Daniel P. Berrangé
2022-08-05  7:02         ` Claudio Imbrenda
2022-08-04  8:29   ` Daniel P. Berrangé
2022-08-04 14:49     ` Claudio Imbrenda
2022-08-04 16:41       ` Daniel P. Berrangé
2022-08-05  6:59         ` Claudio Imbrenda
2022-08-04 16:56 ` Daniel P. Berrangé
2022-08-05  7:32   ` Claudio Imbrenda [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220805093250.1d3bcda2@p-imbrenda \
    --to=imbrenda@linux.ibm.com \
    --cc=alex.bennee@linaro.org \
    --cc=berrange@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=cohuck@redhat.com \
    --cc=david@redhat.com \
    --cc=frankja@linux.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).