From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Samudrala, Sridhar" Subject: Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device Date: Wed, 21 Feb 2018 18:35:08 -0800 Message-ID: <0d158bf6-79b3-442b-2c61-3e900ff40922@intel.com> References: <1518804682-16881-1-git-send-email-sridhar.samudrala@intel.com> <20180216183817.42b07af6@cakuba.netronome.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: Jakub Kicinski , "Michael S. Tsirkin" , Stephen Hemminger , David Miller , Netdev , virtualization@lists.linux-foundation.org, virtio-dev@lists.oasis-open.org, "Brandeburg, Jesse" , "Duyck, Alexander H" , Jason Wang To: Siwei Liu , Alexander Duyck Return-path: Sender: List-Post: List-Help: List-Unsubscribe: List-Subscribe: In-Reply-To: Content-Language: en-US List-Id: netdev.vger.kernel.org On 2/21/2018 5:59 PM, Siwei Liu wrote: > On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck > wrote: >> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu wrote: >>> I haven't checked emails for days and did not realize the new revision >>> had already came out. And thank you for the effort, this revision >>> really looks to be a step forward towards our use case and is close to >>> what we wanted to do. A few questions in line. >>> >>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck >>> wrote: >>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: >>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: >>>>>> Ppatch 2 is in response to the community request for a 3 netdev >>>>>> solution. However, it creates some issues we'll get into in a moment. >>>>>> It extends virtio_net to use alternate datapath when available and >>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates >>>>>> an additional 'bypass' netdev that acts as a master device and controls >>>>>> 2 slave devices. The original virtio_net netdev is registered as >>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets >>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>>>>> associated with the same 'pci' device. The user accesses the network >>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >>>>>> as default for transmits when it is available with link up and running. >>>>> Thank you do doing this. >>>>> >>>>>> We noticed a couple of issues with this approach during testing. >>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same >>>>>> virtio pci device, udev tries to rename both of them with the same name >>>>>> and the 2nd rename will fail. This would be OK as long as the first netdev >>>>>> to be renamed is the 'bypass' netdev, but the order in which udev gets >>>>>> to rename the 2 netdevs is not reliable. >>>>> Out of curiosity - why do you link the master netdev to the virtio >>>>> struct device? >>>> The basic idea of all this is that we wanted this to work with an >>>> existing VM image that was using virtio. As such we were trying to >>>> make it so that the bypass interface takes the place of the original >>>> virtio and get udev to rename the bypass to what the original >>>> virtio_net was. >>> Could it made it also possible to take over the config from VF instead >>> of virtio on an existing VM image? And get udev rename the bypass >>> netdev to what the original VF was. I don't say tightly binding the >>> bypass master to only virtio or VF, but I think we should provide both >>> options to support different upgrade paths. Possibly we could tweak >>> the device tree layout to reuse the same PCI slot for the master >>> bypass netdev, such that udev would not get confused when renaming the >>> device. The VF needs to use a different function slot afterwards. >>> Perhaps we might need to a special multiseat like QEMU device for that >>> purpose? >>> >>> Our case we'll upgrade the config from VF to virtio-bypass directly. >> So if I am understanding what you are saying you are wanting to flip >> the backup interface from the virtio to a VF. The problem is that >> becomes a bit of a vendor lock-in solution since it would rely on a >> specific VF driver. I would agree with Jiri that we don't want to go >> down that path. We don't want every VF out there firing up its own >> separate bond. Ideally you want the hypervisor to be able to manage >> all of this which is why it makes sense to have virtio manage this and >> why this is associated with the virtio_net interface. > No, that's not what I was talking about of course. I thought you > mentioned the upgrade scenario this patch would like to address is to > use the bypass interface "to take the place of the original virtio, > and get udev to rename the bypass to what the original virtio_net > was". That is one of the possible upgrade paths for sure. However the > upgrade path I was seeking is to use the bypass interface to take the > place of original VF interface while retaining the name and network > configs, which generally can be done simply with kernel upgrade. It > would become limiting as this patch makes the bypass interface share > the same virtio pci device with virito backup. Can this bypass > interface be made general to take place of any pci device other than > virtio-net? This will be more helpful as the cloud users who has > existing setup on VF interface don't have to recreate it on virtio-net > and VF separately again. Yes. This sounds interesting. Looks like you want an existing VM image with VF only configuration to get transparent live migration support by adding virtio_net with BACKUP feature.  We may need another feature bit to switch between these 2 options. > >> The other bits get into more complexity then we are ready to handle >> for now. I think I might have talked about something similar that I >> was referring to as a "virtio-bond" where you would have a PCI/PCIe >> tree topology that makes this easier to sort out, and the "virtio-bond >> would be used to handle coordination/configuration of a much more >> complex interface. > That was one way to solve this problem but I'd like to see simple ways > to sort it out. > >>>>> FWIW two solutions that immediately come to mind is to export "backup" >>>>> as phys_port_name of the backup virtio link and/or assign a name to the >>>>> master like you are doing already. I think team uses team%d and bond >>>>> uses bond%d, soft naming of master devices seems quite natural in this >>>>> case. >>>> I figured I had overlooked something like that.. Thanks for pointing >>>> this out. Okay so I think the phys_port_name approach might resolve >>>> the original issue. If I am reading things correctly what we end up >>>> with is the master showing up as "ens1" for example and the backup >>>> showing up as "ens1nbackup". Am I understanding that right? >>>> >>>> The problem with the team/bond%d approach is that it creates a new >>>> netdevice and so it would require guest configuration changes. >>>> >>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio >>>>> link is quite neat. >>>> I agree. For non-"backup" virio_net devices would it be okay for us to >>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy >>>> behavior could be maintained although the function still exists. >>>> >>>>>> - When the 'active' netdev is unplugged OR not present on a destination >>>>>> system after live migration, the user will see 2 virtio_net netdevs. >>>>> That's necessary and expected, all configuration applies to the master >>>>> so master must exist. >>>> With the naming issue resolved this is the only item left outstanding. >>>> This becomes a matter of form vs function. >>>> >>>> The main complaint about the "3 netdev" solution is a bit confusing to >>>> have the 2 netdevs present if the VF isn't there. The idea is that >>>> having the extra "master" netdev there if there isn't really a bond is >>>> a bit ugly. >>> Is it this uglier in terms of user experience rather than >>> functionality? I don't want it dynamically changed between 2-netdev >>> and 3-netdev depending on the presence of VF. That gets back to my >>> original question and suggestion earlier: why not just hide the lower >>> netdevs from udev renaming and such? Which important observability >>> benefits users may get if exposing the lower netdevs? >>> >>> Thanks, >>> -Siwei >> The only real advantage to a 2 netdev solution is that it looks like >> the netvsc solution, however it doesn't behave like it since there are >> some features like XDP that may not function correctly if they are >> left enabled in the virtio_net interface. >> >> As far as functionality the advantage of not hiding the lower devices >> is that they are free to be managed. The problem with pushing all of >> the configuration into the upper device is that you are limited to the >> intersection of the features of the lower devices. This can be >> limiting for some setups as some VFs support things like more queues, >> or better interrupt moderation options than others so trying to make >> everything work with one config would be ugly. > It depends on how you build it and the way you expect it to work. IMHO > the lower devices don't need to be directly managed at all, otherwise > it ends up with loss of configuration across migration, and it really > does not bring much value than having a general team or bond device. > Users still have to reconfigure those queue settings and interrupt > moderation options after all. The new upper device could take the > assumption that the VF/PT lower device always has superior feature set > than virtio-net in order to apply advanced configuration. The upper > device should remember all configurations previously done and apply > supporting ones to active device automatically when switching the > datapath. > It should be possible to extend this patchset to support migration of additional settings  by enabling additional ndo_ops and ethtool_ops on the upper dev and propagating them down the lower devices and replaying the settings after the VF is replugged after migration. Thanks Sridhar