From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2822EC43381 for ; Wed, 20 Mar 2019 22:10:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E5068218AE for ; Wed, 20 Mar 2019 22:10:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727479AbfCTWKv (ORCPT ); Wed, 20 Mar 2019 18:10:51 -0400 Received: from mail-qt1-f193.google.com ([209.85.160.193]:45336 "EHLO mail-qt1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727381AbfCTWKv (ORCPT ); Wed, 20 Mar 2019 18:10:51 -0400 Received: by mail-qt1-f193.google.com with SMTP id v20so4466664qtv.12 for ; Wed, 20 Mar 2019 15:10:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=qIV2lgfnPG9VA4SlW9GpJK4Tt8RVGWNO53IuBVsNqVY=; b=OfkhnP1HXUyraqKuoAdXQrse6VNcdU2TMEbOEbp+ic6buaBwf/JxqsKLO3GNq2hMDe m5suKqHbx4LzDOC/ID6QlDgF0jTBi625DBrNFHk/m8fHGpOYYp69mDx1UTwSoKQv39gQ dgS6DO/cJ83Zxakp00Tu/sGbq/Z/qYIjTlZWbh8gwukRxclaAInavdgK8vwd4Edcz9yc iv9lqQBbzYlfJN3kNBc56qPct24NioQiVRJhdL9kHhlVIIGZ6rwBu48nWi1cveq/gJUz zBe+8RFKuJlXphT0Ez3bqHzbiNNGwMrv86YKmKc0U4mQmur2TEvtJS9zO/YyhpotUf7G H2jg== X-Gm-Message-State: APjAAAUxkMkFxv3obaiaSEcTe93QzC+Df7tp8zPgQQ9w5ZQdyYDwFopC 4rNrO55FD66HdQg3RNdvH+rITg== X-Google-Smtp-Source: APXvYqyWAZgCPgiwJW5BwSW63/bImhHgl7ql/wipWwkIa2wGT31a2qOz40IlcyGM5tpw2kjFgFRRew== X-Received: by 2002:a0c:80a8:: with SMTP id 37mr348789qvb.138.1553119850378; Wed, 20 Mar 2019 15:10:50 -0700 (PDT) Received: from redhat.com ([195.39.71.253]) by smtp.gmail.com with ESMTPSA id c18sm2012827qta.2.2019.03.20.15.10.45 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 20 Mar 2019 15:10:49 -0700 (PDT) Date: Wed, 20 Mar 2019 18:10:37 -0400 From: "Michael S. Tsirkin" To: Liran Alon Cc: Stephen Hemminger , Si-Wei Liu , Sridhar Samudrala , Alexander Duyck , Jakub Kicinski , Jiri Pirko , David Miller , Netdev , virtualization@lists.linux-foundation.org, boris.ostrovsky@oracle.com, vijay.balakrishna@oracle.com, jfreimann@redhat.com, ogerlitz@mellanox.com, vuhuong@mellanox.com Subject: Re: [summary] virtio network device failover writeup Message-ID: <20190320180641-mutt-send-email-mst@kernel.org> References: <20190317095052-mutt-send-email-mst@kernel.org> <54E7C3AF-C3C5-4AF2-86C9-AA50389F855F@oracle.com> <20190319084647.727f8dcf@shemminger-XPS-13-9360> <20190319171638-mutt-send-email-mst@kernel.org> <79F5D7C0-BBAA-4F78-9039-27A444970002@oracle.com> <20190320061632-mutt-send-email-mst@kernel.org> <20190320100747-mutt-send-email-mst@kernel.org> <36772E22-7A8F-4C42-A731-398E3204B418@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <36772E22-7A8F-4C42-A731-398E3204B418@oracle.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote: > > > > On 20 Mar 2019, at 16:09, Michael S. Tsirkin wrote: > > > > On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote: > >> > >> > >>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin wrote: > >>> > >>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote: > >>>> > >>>> > >>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin wrote: > >>>>> > >>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote: > >>>>>> On Tue, 19 Mar 2019 14:38:06 +0200 > >>>>>> Liran Alon wrote: > >>>>>> > >>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves. > >>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver). > >>>>>> > >>>>>> Cloud-init should really just ignore all devices that have a master device. > >>>>>> That would have been more general, and safer for other use cases. > >>>>> > >>>>> Given lots of userspace doesn't do this, I wonder whether it would be > >>>>> safer to just somehow pretend to userspace that the slave links are > >>>>> down? And add a special attribute for the actual link state. > >>>> > >>>> I think this may be problematic as it would also break legit use case > >>>> of userspace attempt to set various config on VF slave. > >>>> In general, lying to userspace usually leads to problems. > >>> > >>> I hear you on this. So how about instead of lying, > >>> we basically just fail some accesses to slaves > >>> unless a flag is set e.g. in ethtool. > >>> > >>> Some userspace will need to change to set it but in a minor way. > >>> Arguably/hopefully failure to set config would generally be a safer > >>> failure. > >> > >> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work. > > > > Sorry about being unclear, the idea would be to require the flag on each ethtool operation. > > Oh. I have indeed misunderstood your previous email then. :) > Thanks for clarifying. > > > > >> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed. > > > > I think sending/receiving should probably just fail unconditionally. > > You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev > unless skb is marked with some flag to indicate it has been sent via the net-failover master? We can maybe avoid binding a protocol socket to the device? > This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.). > > However, I see a couple of down-sides to it: > 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves. > It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC. How about we fail to retrieve mac from the slave? > 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity > on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity. > > The set of changes I vision to fix our issues are: > 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly. > (E.g. Configure the net-failover VF slave in some special way). > 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor. > 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit > as an indicator on when VF is about to be set up. (Similar to as done in NetVSC). > > Is there any clear issue we see regarding the above suggestion? > > -Liran The issue would be this: how do we avoid conflicting with namespaces created by users? > > > >> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace. > >> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated > >> by userspace that it wishes to perform a set of actions on the net-failover slave. > >> > >> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns. > >> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev. > >> But of course maybe there are other ideas that can achieve similar behaviour. > >> > >> -Liran > >> > >>> > >>> Which things to fail? Probably sending/receiving packets? Getting MAC? > >>> More? > >>> > >>>> If we reach > >>>> to a scenario where we try to avoid userspace issues generically and > >>>> not on a userspace component basis, I believe the right path should be > >>>> to hide the net-failover slaves such that explicit action is required > >>>> to actually manipulate them (As described in blog-post). E.g. > >>>> Automatically move net-failover slaves by kernel to a different netns. > >>>> > >>>> -Liran > >>>> > >>>>> > >>>>> -- > >>>>> MST