From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [Qemu-devel] live migration vs device assignment (motivation)
Date: Fri, 25 Dec 2015 14:11:57 +0200
Message-ID: <20151225140336-mutt-send-email-mst@redhat.com>
References: <20151207165039.GA20210@redhat.com>
 <56685631.50700@intel.com>
 <20151210101840.GA2570@work-vm>
 <566961C1.6030000@gmail.com>
 <20151210114114.GE2570@work-vm>
 <56698E68.5040207@intel.com>
 <CAKgT0UduOMvnVAUvRgnXkMPDwvOBh_5RimCgnb0zRr7aOyza4A@mail.gmail.com>
 <566D9320.8000209@intel.com>
 <CAKgT0Uc9g5aqKUKudD4Rj+1KfbGZn6VLzZxGv7UrRK+dy3wEVA@mail.gmail.com>
 <567CEA53.5030601@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Alexander Duyck <alexander.duyck@gmail.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Yang Zhang <yang.zhang.wz@gmail.com>, qemu-devel@nongnu.org,
	"Tantilov, Emil S" <emil.s.tantilov@intel.com>,
	kvm@vger.kernel.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	aik@ozlabs.ru, "Skidmore, Donald C" <donald.c.skidmore@intel.com>,
	quintela@redhat.com, "Dong, Eddie" <eddie.dong@intel.com>,
	"Jani, Nrupal" <nrupal.jani@intel.com>,
	Alexander Graf <agraf@suse.de>,
	Blue Swirl <blauwirbel@gmail.com>, cornelia.huck@de.ibm.com,
	Alex Williamson <alex.williamson@redhat.com>,
	kraxel@redhat.com, Anthony Liguori <anthony@codemonkey.ws>,
	amit.shah@redhat.com, Paolo Bonzini <pbonzini@redhat.com>,
	"Rustad, Mark D" <mark.d.rustad@intel.com>, lcapitulino@redhat.com,
	Or Gerlitz <gerlitz.or@gmail.com>
To: Lan Tianyu <tianyu.lan@intel.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:56902 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752204AbbLYMMH (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 25 Dec 2015 07:12:07 -0500
Content-Disposition: inline
In-Reply-To: <567CEA53.5030601@intel.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Fri, Dec 25, 2015 at 03:03:47PM +0800, Lan Tianyu wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
>=20
> On 2015=E5=B9=B412=E6=9C=8814=E6=97=A5 03:30, Alexander Duyck wrote:
> >> > These sounds we need to add a faked bridge for migration and add=
ing a
> >> > driver in the guest for it. It also needs to extend PCI bus/hotp=
lug
> >> > driver to do pause/resume other devices, right?
> >> >
> >> > My concern is still that whether we can change PCI bus/hotplug l=
ike that
> >> > without spec change.
> >> >
> >> > IRQ should be general for any devices and we may extend it for
> >> > migration. Device driver also can make decision to support migra=
tion
> >> > or not.
> > The device should have no say in the matter.  Either we are going t=
o
> > migrate or we will not.  This is why I have suggested my approach a=
s
> > it allows for the least amount of driver intrusion while providing =
the
> > maximum number of ways to still perform migration even if the devic=
e
> > doesn't support it.
>=20
> Even if the device driver doesn't support migration, you still want t=
o
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.
>=20
> >=20
> > The solution I have proposed is simple:
> >=20
> > 1.  Extend swiotlb to allow for a page dirtying functionality.
> >=20
> >      This part is pretty straight forward.  I'll submit a few patch=
es
> > later today as RFC that can provided the minimal functionality need=
ed
> > for this.
>=20
> Very appreciate to do that.
>=20
> >=20
> > 2.  Provide a vendor specific configuration space option on the QEM=
U
> > implementation of a PCI bridge to act as a bridge between direct
> > assigned devices and the host bridge.
> >=20
> >      My thought was to add some vendor specific block that includes=
 a
> > capabilities, status, and control register so you could go through =
and
> > synchronize things like the DMA page dirtying feature.  The bridge
> > itself could manage the migration capable bit inside QEMU for all
> > devices assigned to it.  So if you added a VF to the bridge it woul=
d
> > flag that you can support migration in QEMU, while the bridge would
> > indicate you cannot until the DMA page dirtying control bit is set =
by
> > the guest.
> >=20
> >      We could also go through and optimize the DMA page dirtying af=
ter
> > this is added so that we can narrow down the scope of use, and as a
> > result improve the performance for other devices that don't need to
> > support migration.  It would then be a matter of adding an interrup=
t
> > in the device to handle an event such as the DMA page dirtying stat=
us
> > bit being set in the config space status register, while the bit is
> > not set in the control register.  If it doesn't get set then we wou=
ld
> > have to evict the devices before the warm-up phase of the migration=
,
> > otherwise we can defer it until the end of the warm-up phase.
> >=20
> > 3.  Extend existing shpc driver to support the optional "pause"
> > functionality as called out in section 4.1.2 of the Revision 1.1 PC=
I
> > hot-plug specification.
>=20
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
>=20
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve th=
e
> passthough device's performance during migration.

As long as you keep up this vague talk about performance during
migration, without even bothering with any measurements, this patchset
will keep going nowhere.


There's Alex's patch that tracks memory changes during migration.  It
needs some simple enhancements to be useful in production (e.g. add a
host/guest handshake to both enable tracking in guest and to detect the
support in host), then it can allow starting migration with an assigned
device, by invoking hot-unplug after most of memory have been migrated.

Please implement this in qemu and measure the speed.
I will not be surprised if destroying/creating netdev in linux
turns out to take too long, but before anyone bothered
checking, it does not make sense to discuss further enhancements.


> >=20
> >      Note I call out "extend" here instead of saying to add this.
> > Basically what we should do is provide a means of quiescing the dev=
ice
> > without unloading the driver.  This is called out as something the =
OS
> > vendor can optionally implement in the PCI hot-plug specification. =
 On
> > OSes that wouldn't support this it would just be treated as a stand=
ard
> > hot-plug event.   We could add a capability, status, and control bi=
t
> > in the vendor specific configuration block for this as well and if =
we
> > set the status bit would indicate the host wants to pause instead o=
f
> > remove and the control bit would indicate the guest supports "pause=
"
> > in the OS.  We then could optionally disable guest migration while =
the
> > VF is present and pause is not supported.
> >=20
> >      To support this we would need to add a timer and if a new devi=
ce
> > is not inserted in some period of time (60 seconds for example), or=
 if
> > a different device is inserted,
> > we need to unload the original driver
> > from the device.  In addition we would need to verify if drivers ca=
n
> > call the remove function after having called suspend without resume=
=2E
> > If not, we could look at adding a recovery function to remove the
> > driver from the device in the case of a suspend with either a faile=
d
> > resume or no resume call.  Once again it would probably be useful t=
o
> > have for those cases where power management suspend/resume runs int=
o
> > an issue like somebody causing a surprise removal while a device wa=
s
> > suspended.
>=20
>=20
> --=20
> Best regards
> Tianyu Lan