From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Kicinski Subject: Re: [RFC] virtio-net: help live migrate SR-IOV devices Date: Thu, 30 Nov 2017 12:48:22 -0800 Message-ID: <20171130124816.7d534cf3@cakuba.netronome.com> References: <20171128112722.00003716@intel.com> <0fb552d4-1bfc-e130-4fc1-87b83873916d@redhat.com> <20171129195138.63512ead@cakuba.netronome.com> <20171130153522-mutt-send-email-mst@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Jason Wang , Jesse Brandeburg , virtualization@lists.linux-foundation.org, Sridhar Samudrala , Achiad , Peter Waskiewicz Jr , "Singhai, Anjali" , Andy Gospodarek , Or Gerlitz , netdev , Hannes Frederic Sowa To: "Michael S. Tsirkin" Return-path: Received: from mail-pl0-f41.google.com ([209.85.160.41]:46310 "EHLO mail-pl0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751217AbdK3Us0 (ORCPT ); Thu, 30 Nov 2017 15:48:26 -0500 Received: by mail-pl0-f41.google.com with SMTP id i6so4946228plt.13 for ; Thu, 30 Nov 2017 12:48:25 -0800 (PST) In-Reply-To: <20171130153522-mutt-send-email-mst@kernel.org> Sender: netdev-owner@vger.kernel.org List-ID: On Thu, 30 Nov 2017 15:54:40 +0200, Michael S. Tsirkin wrote: > On Wed, Nov 29, 2017 at 07:51:38PM -0800, Jakub Kicinski wrote: > > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote: =20 > > > On 2017=E5=B9=B411=E6=9C=8829=E6=97=A5 03:27, Jesse Brandeburg wrote:= =20 > > > > Hi, I'd like to get some feedback on a proposal to enhance virtio-n= et > > > > to ease configuration of a VM and that would enable live migration = of > > > > passthrough network SR-IOV devices. > > > > > > > > Today we have SR-IOV network devices (VFs) that can be passed into = a VM > > > > in order to enable high performance networking direct within the VM. > > > > The problem I am trying to address is that this configuration is > > > > generally difficult to live-migrate. There is documentation [1] > > > > indicating that some OS/Hypervisor vendors will support live migrat= ion > > > > of a system with a direct assigned networking device. The problem I > > > > see with these implementations is that the network configuration > > > > requirements that are passed on to the owner of the VM are quite > > > > complicated. You have to set up bonding, you have to configure it = to > > > > enslave two interfaces, those interfaces (one is virtio-net, the ot= her > > > > is SR-IOV device/driver like ixgbevf) must support MAC address chan= ges > > > > requested in the VM, and on and on... > > > > > > > > So, on to the proposal: > > > > Modify virtio-net driver to be a single VM network device that > > > > enslaves an SR-IOV network device (inside the VM) with the same MAC > > > > address. This would cause the virtio-net driver to appear and work = like > > > > a simplified bonding/team driver. The live migration problem would= be > > > > solved just like today's bonding solution, but the VM user's networ= king > > > > config would be greatly simplified. > > > > > > > > At it's simplest, it would appear something like this in the VM. > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > =3D vnet0 =3D > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > (virtio- =3D | > > > > net) =3D | > > > > =3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > =3D =3D ixgbef =3D > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > (forgive the ASCII art) > > > > > > > > The fast path traffic would prefer the ixgbevf or other SR-IOV devi= ce > > > > path, and fall back to virtio's transmit/receive when migrating. > > > > > > > > Compared to today's options this proposal would > > > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV > > > > speeds > > > > 2) simplify end user configuration in the VM (most if not all of the > > > > set up to enable migration would be done in the hypervisor) > > > > 3) allow live migration via a simple link down and maybe a PCI > > > > hot-unplug of the SR-IOV device, with failover to the virtio-net > > > > driver core > > > > 4) allow vendor agnostic hardware acceleration, and live migration > > > > between vendors if the VM os has driver support for all the req= uired > > > > SR-IOV devices. > > > > > > > > Runtime operation proposed: > > > > - virtio-net driver loads, SR-IOV driver loads > > > > - virtio-net finds other NICs that match it's MAC address by > > > > both examining existing interfaces, and sets up a new device not= ifier > > > > - virtio-net enslaves the first NIC with the same MAC address > > > > - virtio-net brings up the slave, and makes it the "preferred" path > > > > - virtio-net follows the behavior of an active backup bond/team > > > > - virtio-net acts as the interface to the VM > > > > - live migration initiates > > > > - link goes down on SR-IOV, or SR-IOV device is removed > > > > - failover to virtio-net as primary path > > > > - migration continues to new host > > > > - new host is started with virio-net as primary > > > > - if no SR-IOV, virtio-net stays primary > > > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio > > > > - virtio-net notices new NIC and starts over at enslave step above > > > > > > > > Future ideas (brainstorming): > > > > - Optimize Fast east-west by having special rules to direct east-we= st > > > > traffic through virtio-net traffic path > > > > > > > > Thanks for reading! > > > > Jesse =20 > > >=20 > > > Cc netdev. > > >=20 > > > Interesting, and this method is actually used by netvsc now: > > >=20 > > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e > > > Author: stephen hemminger > > > Date:=C2=A0=C2=A0 Tue Aug 1 19:58:53 2017 -0700 > > >=20 > > > =C2=A0=C2=A0=C2=A0 netvsc: transparent VF management > > >=20 > > > =C2=A0=C2=A0=C2=A0 This patch implements transparent fail over from = synthetic NIC to > > > =C2=A0=C2=A0=C2=A0 SR-IOV virtual function NIC in Hyper-V environmen= t. It is a better > > > =C2=A0=C2=A0=C2=A0 alternative to using bonding as is done now. Inst= ead, the receive and > > > =C2=A0=C2=A0=C2=A0 transmit fail over is done internally inside the = driver. > > >=20 > > > =C2=A0=C2=A0=C2=A0 Using bonding driver has lots of issues because i= t depends on the > > > =C2=A0=C2=A0=C2=A0 script being run early enough in the boot process= and with sufficient > > > =C2=A0=C2=A0=C2=A0 information to make the association. This patch m= oves all that > > > =C2=A0=C2=A0=C2=A0 functionality into the kernel. > > >=20 > > > =C2=A0=C2=A0=C2=A0 Signed-off-by: Stephen Hemminger > > > =C2=A0=C2=A0=C2=A0 Signed-off-by: David S. Miller > > >=20 > > > If my understanding is correct there's no need to for any extension o= f=20 > > > virtio spec. If this is true, maybe you can start to prepare the patc= h? =20 > > > > IMHO this is as close to policy in the kernel as one can get. User > > land has all the information it needs to instantiate that bond/team > > automatically. =20 >=20 > It does have this info (MAC addresses match) but where's the policy > here? IMHO the policy has been set by the hypervisor already. > From hypervisor POV adding passthrough is a commitment not to migrate > until guest stops using the passthrough device. >=20 > Within the guest, the bond is required for purely functional reasons - ju= st to > maintain a link up since we know SRIOV will will go away. Maintaining an > uninterrupted connection is not a policy - it's what networking is > about. >=20 > > In fact I'm trying to discuss this with NetworkManager > > folks and Red Hat right now: > >=20 > > https://mail.gnome.org/archives/networkmanager-list/2017-November/msg00= 038.html =20 >=20 > I thought we should do it too, for a while. >=20 > But now, I think that the real issue is this: kernel exposes what looks > like two network devices to userspace, but in fact it is just one > backend device, just exposed by hypervisor in a weird way for > compatibility reasons. >=20 > For example you will not get a better reliability or throughput by using > both of them - the only bonding mode that makes sense is fail over. Yes, I'm talking about fail over. > As another example, if the underlying physical device lost its link, tryi= ng > to use virtio won't help - it's only useful when the passthrough device > is gone for good. As another example, there is no point in not > configuring a bond. As a last example, depending on how the backend is > configured, virtio might not even work when the pass-through device is > active. >=20 > So from that point of view, showing two network devices to userspace is > a bug that we are asking userspace to work around. I'm confused by what you're saying here. IIRC the question is whether we expose 2 netdevs or 3. There will always be a virtio netdev and a VF netdev. I assume you're not suggesting hiding the VF netdev. So the question is do we expose a VF netdev and a combo virtio netdev which is also a bond or do we expose a VF netdev a virtio netdev, and a active/passive bond/team which is a well understood and architecturally correct construct. > > Can we flip the argument and ask why is the kernel supposed to be > > responsible for this? =20 >=20 > Because if we show a single device to userspace the number of > misconfigured guests will go down, and we won't lose any useful > flexibility. Again, single device? > > It's not like we run DHCP out of the kernel > > on new interfaces... =20 >=20 > Because one can set up a static IP, IPv6 doesn't always need DHCP, etc. But we don't handle LACP, etc. Look, as much as I don't like this, I'm not going to argue about this to death. I just find it very dishonest to claim kernel *has to* do it, when no one seem to have made any honest attempts to solve this in user space for the last 10 years :/