From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jakub Kicinski <jakub.kicinski@netronome.com>
Subject: Re: [RFC] virtio-net: help live migrate SR-IOV devices
Date: Thu, 30 Nov 2017 12:48:22 -0800
Message-ID: <20171130124816.7d534cf3@cakuba.netronome.com>
References: <20171128112722.00003716@intel.com>
        <0fb552d4-1bfc-e130-4fc1-87b83873916d@redhat.com>
        <20171129195138.63512ead@cakuba.netronome.com>
        <20171130153522-mutt-send-email-mst@kernel.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: Jason Wang <jasowang@redhat.com>,
        Jesse Brandeburg <jesse.brandeburg@intel.com>,
        virtualization@lists.linux-foundation.org,
        Sridhar Samudrala <sridhar.samudrala@intel.com>,
        Achiad <achiad@mellanox.com>,
        Peter Waskiewicz Jr <peter.waskiewicz.jr@intel.com>,
        "Singhai, Anjali" <anjali.singhai@intel.com>,
        Andy Gospodarek <gospo@broadcom.com>,
        Or Gerlitz <gerlitz.or@gmail.com>,
        netdev <netdev@vger.kernel.org>,
        Hannes Frederic Sowa <hannes@stressinduktion.org>
To: "Michael S. Tsirkin" <mst@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-pl0-f41.google.com ([209.85.160.41]:46310 "EHLO
        mail-pl0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751217AbdK3Us0 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Thu, 30 Nov 2017 15:48:26 -0500
Received: by mail-pl0-f41.google.com with SMTP id i6so4946228plt.13
        for <netdev@vger.kernel.org>; Thu, 30 Nov 2017 12:48:25 -0800 (PST)
In-Reply-To: <20171130153522-mutt-send-email-mst@kernel.org>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Thu, 30 Nov 2017 15:54:40 +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 29, 2017 at 07:51:38PM -0800, Jakub Kicinski wrote:
> > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote: =20
> > > On 2017=E5=B9=B411=E6=9C=8829=E6=97=A5 03:27, Jesse Brandeburg wrote:=
 =20
> > > > Hi, I'd like to get some feedback on a proposal to enhance virtio-n=
et
> > > > to ease configuration of a VM and that would enable live migration =
of
> > > > passthrough network SR-IOV devices.
> > > >
> > > > Today we have SR-IOV network devices (VFs) that can be passed into =
a VM
> > > > in order to enable high performance networking direct within the VM.
> > > > The problem I am trying to address is that this configuration is
> > > > generally difficult to live-migrate.  There is documentation [1]
> > > > indicating that some OS/Hypervisor vendors will support live migrat=
ion
> > > > of a system with a direct assigned networking device.  The problem I
> > > > see with these implementations is that the network configuration
> > > > requirements that are passed on to the owner of the VM are quite
> > > > complicated.  You have to set up bonding, you have to configure it =
to
> > > > enslave two interfaces, those interfaces (one is virtio-net, the ot=
her
> > > > is SR-IOV device/driver like ixgbevf) must support MAC address chan=
ges
> > > > requested in the VM, and on and on...
> > > >
> > > > So, on to the proposal:
> > > > Modify virtio-net driver to be a single VM network device that
> > > > enslaves an SR-IOV network device (inside the VM) with the same MAC
> > > > address. This would cause the virtio-net driver to appear and work =
like
> > > > a simplified bonding/team driver.  The live migration problem would=
 be
> > > > solved just like today's bonding solution, but the VM user's networ=
king
> > > > config would be greatly simplified.
> > > >
> > > > At it's simplest, it would appear something like this in the VM.
> > > >
> > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > =3D vnet0  =3D
> > > >           =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > (virtio- =3D       |
> > > >   net)    =3D       |
> > > >           =3D  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > >           =3D  =3D ixgbef =3D
> > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > >
> > > > (forgive the ASCII art)
> > > >
> > > > The fast path traffic would prefer the ixgbevf or other SR-IOV devi=
ce
> > > > path, and fall back to virtio's transmit/receive when migrating.
> > > >
> > > > Compared to today's options this proposal would
> > > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
> > > >     speeds
> > > > 2) simplify end user configuration in the VM (most if not all of the
> > > >     set up to enable migration would be done in the hypervisor)
> > > > 3) allow live migration via a simple link down and maybe a PCI
> > > >     hot-unplug of the SR-IOV device, with failover to the virtio-net
> > > >     driver core
> > > > 4) allow vendor agnostic hardware acceleration, and live migration
> > > >     between vendors if the VM os has driver support for all the req=
uired
> > > >     SR-IOV devices.
> > > >
> > > > Runtime operation proposed:
> > > > - <in either order> virtio-net driver loads, SR-IOV driver loads
> > > > - virtio-net finds other NICs that match it's MAC address by
> > > >    both examining existing interfaces, and sets up a new device not=
ifier
> > > > - virtio-net enslaves the first NIC with the same MAC address
> > > > - virtio-net brings up the slave, and makes it the "preferred" path
> > > > - virtio-net follows the behavior of an active backup bond/team
> > > > - virtio-net acts as the interface to the VM
> > > > - live migration initiates
> > > > - link goes down on SR-IOV, or SR-IOV device is removed
> > > > - failover to virtio-net as primary path
> > > > - migration continues to new host
> > > > - new host is started with virio-net as primary
> > > > - if no SR-IOV, virtio-net stays primary
> > > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
> > > > - virtio-net notices new NIC and starts over at enslave step above
> > > >
> > > > Future ideas (brainstorming):
> > > > - Optimize Fast east-west by having special rules to direct east-we=
st
> > > >    traffic through virtio-net traffic path
> > > >
> > > > Thanks for reading!
> > > > Jesse   =20
> > >=20
> > > Cc netdev.
> > >=20
> > > Interesting, and this method is actually used by netvsc now:
> > >=20
> > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> > > Author: stephen hemminger <stephen@networkplumber.org>
> > > Date:=C2=A0=C2=A0 Tue Aug 1 19:58:53 2017 -0700
> > >=20
> > >  =C2=A0=C2=A0=C2=A0 netvsc: transparent VF management
> > >=20
> > >  =C2=A0=C2=A0=C2=A0 This patch implements transparent fail over from =
synthetic NIC to
> > >  =C2=A0=C2=A0=C2=A0 SR-IOV virtual function NIC in Hyper-V environmen=
t. It is a better
> > >  =C2=A0=C2=A0=C2=A0 alternative to using bonding as is done now. Inst=
ead, the receive and
> > >  =C2=A0=C2=A0=C2=A0 transmit fail over is done internally inside the =
driver.
> > >=20
> > >  =C2=A0=C2=A0=C2=A0 Using bonding driver has lots of issues because i=
t depends on the
> > >  =C2=A0=C2=A0=C2=A0 script being run early enough in the boot process=
 and with sufficient
> > >  =C2=A0=C2=A0=C2=A0 information to make the association. This patch m=
oves all that
> > >  =C2=A0=C2=A0=C2=A0 functionality into the kernel.
> > >=20
> > >  =C2=A0=C2=A0=C2=A0 Signed-off-by: Stephen Hemminger <sthemmin@micros=
oft.com>
> > >  =C2=A0=C2=A0=C2=A0 Signed-off-by: David S. Miller <davem@davemloft.n=
et>
> > >=20
> > > If my understanding is correct there's no need to for any extension o=
f=20
> > > virtio spec. If this is true, maybe you can start to prepare the patc=
h? =20
> >
> > IMHO this is as close to policy in the kernel as one can get.  User
> > land has all the information it needs to instantiate that bond/team
> > automatically. =20
>=20
> It does have this info (MAC addresses match) but where's the policy
> here? IMHO the policy has been set by the hypervisor already.
> From hypervisor POV adding passthrough is a commitment not to migrate
> until guest stops using the passthrough device.
>=20
> Within the guest, the bond is required for purely functional reasons - ju=
st to
> maintain a link up since we know SRIOV will will go away. Maintaining an
> uninterrupted connection is not a policy - it's what networking is
> about.
>=20
> >  In fact I'm trying to discuss this with NetworkManager
> > folks and Red Hat right now:
> >=20
> > https://mail.gnome.org/archives/networkmanager-list/2017-November/msg00=
038.html =20
>=20
> I thought we should do it too, for a while.
>=20
> But now, I think that the real issue is this: kernel exposes what looks
> like two network devices to userspace, but in fact it is just one
> backend device, just exposed by hypervisor in a weird way for
> compatibility reasons.
>=20
> For example you will not get a better reliability or throughput by using
> both of them - the only bonding mode that makes sense is fail over.

Yes, I'm talking about fail over.

> As another example, if the underlying physical device lost its link, tryi=
ng
> to use virtio won't help - it's only useful when the passthrough device
> is gone for good.  As another example, there is no point in not
> configuring a bond. As a last example, depending on how the backend is
> configured, virtio might not even work when the pass-through device is
> active.
>=20
> So from that point of view, showing two network devices to userspace is
> a bug that we are asking userspace to work around.

I'm confused by what you're saying here.  IIRC the question is whether
we expose 2 netdevs or 3.  There will always be a virtio netdev and a
VF netdev.  I assume you're not suggesting hiding the VF netdev.  So
the question is do we expose a VF netdev and a combo virtio netdev
which is also a bond or do we expose a VF netdev a virtio netdev, and a
active/passive bond/team which is a well understood and architecturally
correct construct.

> > Can we flip the argument and ask why is the kernel supposed to be
> > responsible for this? =20
>=20
> Because if we show a single device to userspace the number of
> misconfigured guests will go down, and we won't lose any useful
> flexibility.

Again, single device?

> >  It's not like we run DHCP out of the kernel
> > on new interfaces...  =20
>=20
> Because one can set up a static IP, IPv6 doesn't always need DHCP, etc.

But we don't handle LACP, etc.

Look, as much as I don't like this, I'm not going to argue about this to
death.  I just find it very dishonest to claim kernel *has to* do it,
when no one seem to have made any honest attempts to solve this in user
space for the last 10 years :/