From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FF23C433F5 for ; Thu, 31 Mar 2022 12:07:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233051AbiCaMJY (ORCPT ); Thu, 31 Mar 2022 08:09:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32820 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233258AbiCaMJX (ORCPT ); Thu, 31 Mar 2022 08:09:23 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DCED01FE567; Thu, 31 Mar 2022 05:07:35 -0700 (PDT) Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 22V9d183016846; Thu, 31 Mar 2022 12:07:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=pp1; bh=gRPtGdpAZ/q0V+YfyGFvf0Q6ubH5G8jmX+S9U6IIRIg=; b=QIi1x0fPC7PoYfUGPu4udKItoBiAgAYM/PDRb6o0UCZUKEgcr1wUkXl/ugIFTIBjxpSb m0sq8fl2JUSNthlLAqb2upCrujVX/Okm94Febz5wBwp82Cs7Mr1vT7rV6yADWHscKnNt 9MRkGIWX0x+/RK8QVXympUtUQME78S6QOcpSSAD56oM3rAEfCrkIfkPNcA4finnri/Yw jUU2QTWQ2IIMiI52x5BZLZVGFCRY6cu7AdXYOzgruUb3jkl4yrI+owceN3CrKT+GZ0Zm Lv6w4ULQTLquS1bfXTkVn1NIyhy07XbLXAJ8xvcIV6DnYLz5gKbJvCqJBBHGblJaNGRg Pw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 3f57tnwq3b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 31 Mar 2022 12:07:26 +0000 Received: from m0098416.ppops.net (m0098416.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 22VC2KU0014441; Thu, 31 Mar 2022 12:07:26 GMT Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com [169.51.49.98]) by mx0b-001b2d01.pphosted.com with ESMTP id 3f57tnwq2j-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 31 Mar 2022 12:07:26 +0000 Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1]) by ppma03ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 22VBvhDk026585; Thu, 31 Mar 2022 12:07:24 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma03ams.nl.ibm.com with ESMTP id 3f1tf9jqe0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 31 Mar 2022 12:07:23 +0000 Received: from d06av24.portsmouth.uk.ibm.com (mk.ibm.com [9.149.105.60]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 22VC7L1T19595674 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 31 Mar 2022 12:07:21 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 828BB42045; Thu, 31 Mar 2022 12:07:21 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DDA2942042; Thu, 31 Mar 2022 12:07:20 +0000 (GMT) Received: from [9.145.190.237] (unknown [9.145.190.237]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Thu, 31 Mar 2022 12:07:20 +0000 (GMT) Message-ID: <1845d832-e36c-fff1-0ea6-1d7aa290919f@linux.ibm.com> Date: Thu, 31 Mar 2022 14:07:20 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: [PATCH net-next v2] veth: Support bonding events Content-Language: en-US To: Nikolay Aleksandrov , Jay Vosburgh , Jakub Kicinski Cc: "David S. Miller" , Paolo Abeni , Hangbin Liu , netdev@vger.kernel.org, linux-s390@vger.kernel.org, Heiko Carstens , Roopa Prabhu , bridge@lists.linux-foundation.org, Ido Schimmel , Jiri Pirko References: <20220329114052.237572-1-wintera@linux.ibm.com> <20220329175421.4a6325d9@kernel.org> <20220330085154.34440715@kernel.org> <20220330101256.53f6ef48@kernel.org> <2600.1648667758@famine> From: Alexandra Winter In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-GUID: Jk8GBhxOGd4DCMK45juIeQbB2sj2SGdv X-Proofpoint-ORIG-GUID: doJLC2TJjhO-l6jWj2x5CiOTsf8qoJoU X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.850,Hydra:6.0.425,FMLib:17.11.64.514 definitions=2022-03-31_04,2022-03-31_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 spamscore=0 priorityscore=1501 adultscore=0 phishscore=0 lowpriorityscore=0 mlxlogscore=999 bulkscore=0 suspectscore=0 malwarescore=0 mlxscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2202240000 definitions=main-2203310068 Precedence: bulk List-ID: X-Mailing-List: linux-s390@vger.kernel.org On 31.03.22 12:33, Nikolay Aleksandrov wrote: > On 31/03/2022 12:59, Alexandra Winter wrote: >> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote: >>> Bonding drivers generate specific events during failover that trigger >>> switch updates. When a veth device is attached to a bridge with a >>> bond interface, we want external switches to learn about the veth >>> devices as well. >>> >>> Example: >>> >>> | veth_a2 | veth_b2 | veth_c2 | >>> ------o-----------o----------o------ >>> \ | / >>> o o o >>> veth_a1 veth_b1 veth_c1 >>> ------------------------- >>> | bridge | >>> ------------------------- >>> bond0 >>> / \ >>> eth0 eth1 >>> >>> In case of failover from eth0 to eth1, the netdev_notifier needs to be >>> propagated, so e.g. veth_a2 can re-announce its MAC address to the >>> external hardware attached to eth1. >> >> On 30.03.22 21:15, Jay Vosburgh wrote: >>> Jakub Kicinski wrote: >>> >>>> On Wed, 30 Mar 2022 19:16:42 +0300 Nikolay Aleksandrov wrote: >>>>>> Maybe opt-out? But assuming the event is only generated on >>>>>> active/backup switch over - when would it be okay to ignore >>>>>> the notification? >>>>> >>>>> Let me just clarify, so I'm sure I've not misunderstood you. Do you mean opt-out as in >>>>> make it default on? IMO that would be a problem, large scale setups would suddenly >>>>> start propagating it to upper devices which would cause a lot of unnecessary bcast. >>>>> I meant enable it only if needed, and only on specific ports (second part is not >>>>> necessary, could be global, I think it's ok either way). I don't think any setup >>>>> which has many upper vlans/macvlans would ever enable this. >>>> >>>> That may be. I don't have a good understanding of scenarios in which >>>> GARP is required and where it's not :) Goes without saying but the >>>> default should follow the more common scenario. >>> >>> At least from the bonding failover persective, the GARP is >>> needed when there's a visible topology change (so peers learn the new >>> path), a change in MAC address, or both. I don't think it's possible to >>> determine from bonding which topology changes are visible, so any >>> failover gets a GARP. The original intent as best I recall was to cover >>> IP addresses configured on the bond itself or on VLANs above the bond. >>> >>> If I understand the original problem description correctly, the >>> bonding failover causes the connectivity issue because the network >>> segments beyond the bond interfaces don't share forwarding information >>> (i.e., they are completely independent). The peer (end station or >>> switch) at the far end of those network segments (where they converge) >>> is unable to directly see that the "to bond eth0" port went down, and >>> has no way to know that anything is awry, and thus won't find the new >>> path until an ARP or forwarding entry for "veth_a2" (from the original >>> diagram) times out at the peer out in the network. >>> >>>>>>> My concern was about the Hangbin's alternative proposal to notify all >>>>>>> bridge ports. I hope in my porposal I was able to avoid infinite loops. >>>>>> >>>>>> Possibly I'm confused as to where the notification for bridge master >>>>>> gets sent.. >>>>> >>>>> IIUC it bypasses the bridge and sends a notify peers for the veth peer so it would >>>>> generate a grat arp (inetdev_event -> NETDEV_NOTIFY_PEERS). >>>> >>>> Ack, I was basically repeating the question of where does >>>> the notification with dev == br get generated. >>>> >>>> There is a protection in this patch to make sure the other >>>> end of the veth is not plugged into a bridge (i.e. is not >>>> a bridge port) but there can be a macvlan on top of that >>>> veth that is part of a bridge, so IIUC that check is either >>>> insufficient or unnecessary. >>> >>> I'm a bit concerned this is becoming a interface plumbing >>> topology change whack-a-mole. >>> >>> In the above, what if the veth is plugged into a bridge, and >>> there's a end station on that bridge? If it's bridges all the way down, >>> where does the need for some kind of TCN mechanism stop? >>> >>> Or instead of a veth it's an physical network hop (perhaps a >>> tunnel; something through which notifiers do not propagate) to another >>> host with another bridge, then what? >>> >>> -J >>> >>> --- >>> -Jay Vosburgh, jay.vosburgh@canonical.com >> >> I see 3 technologies that are used for network virtualization in combination with bond for redundancy >> (and I may miss some): >> (1) MACVTAP/MACVLAN over bond: >> MACVLAN propagates notifiers from bond to endpoints (same as VLAN) >> (drivers/net/macvlan.c: >> case NETDEV_NOTIFY_PEERS: >> case NETDEV_BONDING_FAILOVER: >> case NETDEV_RESEND_IGMP: >> /* Propagate to all vlans */ >> list_for_each_entry(vlan, &port->vlans, list) >> call_netdevice_notifiers(event, vlan->dev); >> }) >> (2) OpenVSwitch: >> OVS seems to have its own bond implementation, but sends out reverse Arp on active-backup failover >> (3) User defined bridge over bond: >> propagates notifiers to the bridge device itself, but not to the devices attached to bridge ports. >> (net/bridge/br.c: >> case NETDEV_RESEND_IGMP: >> /* Propagate to master device */ >> call_netdevice_notifiers(event, br->dev);) >> >> Active-backup may not be the best bonding mode, but it is a simple way to achieve redundancy and I've seen it being used. >> I don't see a usecase for MACVLAN over bridge over bond (?) > > If you're talking about this particular case (network virtualization) - sure. But macvlans over bridges > are heavily used in Cumulus Linux and large scale setups. For example VRRP is implemented using macvlan > devices. Any notification that propagates to the bridge and reaches these would cause a storm of broadcasts > being sent down which would not scale and is extremely undesirable in general. > >> The external HW network does not need to be updated about the instances that are conencted via tunnel, >> so I don't see an issue there. >> >> I had this idea how to solve the failover issue it for veth pairs attached to the user defined bridge. >> Does this need to be configurable? How? Per veth pair? > > That is not what I meant (if you were referring to my comment), I meant if it gets implemented in the > bridge and it starts propagating the notify peers notifier - that _must_ be configurable. > >> >> Of course a more general solution how bridge over bond could handle notifications, would be great, >> but I'm running out of ideas. So I thought I'd address veth first. >> Your help and ideas are highly appreciated, thank you. > > I'm curious why it must be done in the kernel altogether? This can obviously be solved in user-space > by sending grat arps towards flapped por for fdbs on other ports (e.g. veths) based on a netlink notification. > In fact based on your description propagating NETDEV_NOTIFY_PEERS to bridge ports wouldn't help > because in that case the remote peer veth will not generate a grat arp. The notification will > get propagated only to local veth (bridge port), or the bridge itself depending on implementation. > > So from bridge perspective, if you decide to pursue a kernel solution, I think you'll need > a new bridge port option which acts on NOTIFY_PEERS and generates a grat arp for all fdbs > on the port where it is enabled to the port which generated the NOTIFY_PEERS. Note that is > also fragile as I'm sure some stacked device config would not work, so I want to re-iterate > how much easier it is to solve it in user-space which has better visibility and you can > change much faster to accommodate new use cases. > > To illustrate: bond > \ > bridge > / > veth0 > | > veth1 > > When bond generates NOTIFY_PEERS, and you have this new option enabled on veth0 then > the bridge should generate grat arps for all fdbs on veth0 towards bond so the new > path would learn them. Note that is very dangerous as veth1 can generate thousands > of fdbs and you can potentially DDoS the whole network, so again I'd advise to do > this in user-space where you can better control it. > > W.r.t to this patch, I think it will also work and will cause a single grat arp which > is ok. Just need to make sure loops are not possible, for example I think you can loop > your implementation by the following config (untested theory): > bond > \ > bridge > \ \ > veth2.10 veth0 - veth1 > \ \ > \ veth1.10 (vlan) > \ \ > \ bridge2 > \ / > veth2 - veth3 > > > 1. bond generates NOTIFY_PEERS > 2. bridge propagates to veth1 (through veth0 port) > 3. veth1 propagates to its vlan (veth1.10) > 4. bridge2 sees veth1.10 NOTIFY_PEERS and propagates to veth2 (through veth3 port) > 5. veth2 propagates to its vlan (veth2.10) > 6. veth2.10 propagates it back to bridge > > > I'm sure similar setup, and maybe even simpler, can be constructed with other devices > which can propagate or generate NOTIFY_PEERS. > > Cheers, > Nik > Thank you very much Nik for your advice and your thourough explanations. I think I could prevent the loop you describe above. But I agree that propagating notifications through veth is more risky, because there is no upper-lower relationship. Is there interest in a v3 of my patch, where I would also incorporate Jakub's comments? Otherwise I would next explore a user-space solution like Nik proposed. Thank you all very much Alexandra