From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44655) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gUaw9-0002PL-NA for qemu-devel@nongnu.org; Wed, 05 Dec 2018 12:26:15 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gUaw3-0004O6-Qe for qemu-devel@nongnu.org; Wed, 05 Dec 2018 12:26:12 -0500 Received: from mx1.redhat.com ([209.132.183.28]:27440) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gUaw1-0004Md-Vm for qemu-devel@nongnu.org; Wed, 05 Dec 2018 12:26:06 -0500 Date: Wed, 5 Dec 2018 12:26:02 -0500 From: "Michael S. Tsirkin" Message-ID: <20181205122332-mutt-send-email-mst@kernel.org> References: <20181025140631.634922-1-sameeh@daynix.com> <20181205171818.GA1136@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20181205171818.GA1136@redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC 0/2] Attempt to implement the standby feature for assigned network devices List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Daniel =?iso-8859-1?Q?P=2E_Berrang=E9?= Cc: Sameeh Jubran , qemu-devel@nongnu.org, Jason Wang , Yan Vugenfirer , Eduardo Habkost On Wed, Dec 05, 2018 at 05:18:18PM +0000, Daniel P. Berrang=E9 wrote: > On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote: > > From: Sameeh Jubran > >=20 > > Hi all, > >=20 > > Background: > >=20 > > There has been a few attempts to implement the standby feature for vf= io > > assigned devices which aims to enable the migration of such devices. = This > > is another attempt. > >=20 > > The series implements an infrastructure for hiding devices from the b= us > > upon boot. What it does is the following: > >=20 > > * In the first patch the infrastructure for hiding the device is adde= d > > for the qbus and qdev APIs. A "hidden" boolean is added to the devi= ce > > state and it is set based on a callback to the standby device which > > registers itself for handling the assessment: "should the primary d= evice > > be hidden?" by cross validating the ids of the devices. > >=20 > > * In the second patch the virtio-net uses the API to hide the vfio > > device and unhides it when the feature is acked. >=20 > IIUC, the general idea is that we want to provide a pair of associated = NIC > devices to the guest, one emulated, one physical PCI device. The guest = would > put them in a bonded pair. Before migration the PCI device is unplugged= & a > new PCI device plugged on target after migration. The guest traffic con= tinues > without interuption due to the emulate device. >=20 > This kind of conceptual approach can already be implemented today by ma= nagement > apps. The only hard problem that exists today is how the guest OS can f= igure > out that a particular pair of devices it has are intended to be used to= gether.=20 >=20 > With this series, IIUC, the virtio-net device is getting a given proper= ty which > defines the qdev ID of the associated VFIO device. When the guest OS ac= tivates > the virtio-net device and acknowledges the STANDBY feature bit, qdev th= en > unhides the associated VFIO device. >=20 > AFAICT the guest has to infer that the device which suddenly appears is= the one > associated with the virtio-net device it just initialized, for purposes= of > setting up the NIC bonding. There doesn't appear to be any explicit ass= ocation > between the devices exposed to the guest. >=20 > This feels pretty fragile for a guest needing to match up devices when = there > are many pairs of devices exposed to a single guest. >=20 > Unless I'm mis-reading the patches, it looks like the VFIO device alway= s has > to be available at the time QEMU is started. There's no way to boot a g= uest > and then later hotplug a VFIO device to accelerate the existing virtio-= net NIC. That should be supported. > Or similarly after migration there might not be any VFIO device availab= le > initially when QEMU is started to accept the incoming migration. So it = might > need to run in degraded mode for an extended period of time until one b= ecomes > available for hotplugging. That should work too. > The use of qdev IDs makes this troublesome, as the > qdev ID of the future VFIO device would need to be decided upfront befo= re it > even exists. I agree this sounds problematic. >=20 > So overall I'm not really a fan of the dynamic hiding/unhiding of devic= es. Dynamic hiding is an orthogonal issue though. It's needed for error handling in case of migration failure: we do not want to close the VFIO device but we do need to hide it from guest. libvirt should not be involved in this aspect though. > I > would much prefer to see some way to expose an explicit relationship be= tween > the devices to the guest. >=20 > > Disclaimers: > >=20 > > * I have only scratch tested this and from qemu side, it seems to be > > working. > > * This is an RFC so it lacks some proper error handling in few cases > > and proper resource freeing. I wanted to get some feedback first > > before it is finalized. > >=20 > > Command line example: > >=20 > > /home/sameeh/Builds/failover/qemu/x86_64-softmmu/qemu-system-x86_64 \ > > -netdev tap,id=3Dhostnet0,script=3Dworld_bridge_standalone.sh,downscr= ipt=3Dno,ifname=3Dcc1_71 \ > > -netdev tap,vhost=3Don,id=3Dhostnet1,script=3Dworld_bridge_standalone= .sh,downscript=3Dno,ifname=3Dcc1_72,queues=3D4 \ > > -device virtio-net,host_mtu=3D1500,netdev=3Dhostnet1,id=3Dcc1_72,vect= ors=3D10,mq=3Don,primary=3Dcc1_71 \ > > -device e1000,netdev=3Dhostnet0,id=3Dcc1_71,standby=3Dcc1_72 \ > >=20 > > Migration support: > >=20 > > Pre migration or during setup phase of the migration we should send a= n > > unplug request to the guest to unplug the primary device. I haven't h= ad > > the chance to implement that part yet but should do soon. Do you know > > what's the best approach to do so? I wanted to have a callback to the > > virtio-net device which tries to send an unplug request to the guest = and > > if succeeds then the migration continues. It needs to handle the case= where > > the migration fails and then it has to replug the primary device back= . >=20 > Having QEMU do this internally gets into a world of pain when you have > multiple devices in the guest. >=20 > Consider if we have 2 pairs of devices. We unplug one VFIO device, but > unplugging the second VFIO device fails, thus we try to replug the firs= t > VFIO device but this now fails too. We don't even get as far as startin= g > the migration before we have to return an error. >=20 > The mgmt app will just see that the migration failed, but it will not > be sure which devices are now actually exposed to the guest OS correctl= y. >=20 > The similar problem hits if we started the migration data stream, but > then had to abort and so need to tear try to replug in the source but > failed for some reasons. >=20 > Doing the VFIO device plugging/unplugging explicitly from the mgmt app > gives that mgmt app direct information about which devices have been > successfully made available to the guest at all time, becuase the mgmt > app can see the errors from each step of the process. Trying to do > this inside QEMU doesn't achieve anything the mgmt app can't already > do, but it obscures what happens during failures. The same applies at > the libvirt level too, which is why mgmt apps today will do the VFIO > unplug/replug either side of migration themselves. >=20 >=20 > Regards, > Daniel > --=20 > |: https://berrange.com -o- https://www.flickr.com/photos/dberr= ange :| > |: https://libvirt.org -o- https://fstop138.berrange= .com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberr= ange :|