From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Fastabend Subject: Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined. Date: Tue, 16 Dec 2014 08:35:51 -0800 Message-ID: <54905F67.2090509@gmail.com> References: <54894850.5000603@cumulusnetworks.com> <7968540cd0768a770b0c8b29ce41a162.squirrel@poczta.wsisiz.edu.pl> <5489D53E.5010603@cumulusnetworks.com> <8d4ec5c1ae73c77866a0a154fb528f23.squirrel@poczta.wsisiz.edu.pl> <548AD781.5020004@mojatatu.com> <4c22a6c452a73b3b77a9a9c8e7f76bcc.squirrel@poczta.wsisiz.edu.pl> <548AFD41.3010801@mojatatu.com> <548B4AA4.1020804@gmail.com> <548EF05E.6050401@mojatatu.com> <548F80B2.80408@gmail.com> <54902E5E.2070405@mojatatu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Hubert Sokolowski , Roopa Prabhu , "netdev@vger.kernel.org" , Vlad Yasevich To: Jamal Hadi Salim Return-path: Received: from mail-oi0-f53.google.com ([209.85.218.53]:57234 "EHLO mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750873AbaLPQgE (ORCPT ); Tue, 16 Dec 2014 11:36:04 -0500 Received: by mail-oi0-f53.google.com with SMTP id g201so989668oib.12 for ; Tue, 16 Dec 2014 08:36:03 -0800 (PST) In-Reply-To: <54902E5E.2070405@mojatatu.com> Sender: netdev-owner@vger.kernel.org List-ID: On 12/16/2014 05:06 AM, Jamal Hadi Salim wrote: > On 12/15/14 19:45, John Fastabend wrote: >> On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote: > >> >> hmm good question. When I implemented this on the host nics with SR-IOV, >> VMDQ, etc. The multi/unicast addresses were propagated into the FDB by >> the driver. > > So if i understand correctly, this is a NIC with an FDB. And there is no > concept of a bridge to which it is attached. To the point of > classical uni/multicast addresses on a netdev abstraction; these > are typically stored in *much simpler tables* (used to be IO > registers back in the day) From a model perspective it looks like a edge relay. Only a single downlink with multiple uplinks. No learning, no loops and so no STP, et. al. required. It may or may not support MAC+VLAN forwarding or just MAC forwarding. It may be configured via register writes or more complicated firmware requests or some other mechanism. This is device dependent even across devices by the same vendor the mechanisms change. But the driver abstracts this. > Do these NICs not have such a concept? > An fdb entry has an egress port column; I have seen cases where the > port is labeled as "Cpu port" which would mean it belongs to the host; But in the SR-IOV case you have multiple "Cpu ports" and you want to send packets to each of them depending on the configuration. port0 port1 port2 port3 | | | | uplinks +------------------------------+ | | | SRIOV edge relay | | | +------------------------------+ | downlink In a host nic with SRIOV each port will be a PCIE function. So really they are all CPU ports. For multi-function devices they might all be physical functions. In the hardware there needs to be a table to forward incoming traffic to the correct port#. For L2 we use MAC+VLAN and an egress port column to select the port. The model shouldn't care if the port is backed by a VF or PF or set of queues. It just needs to forward packets to the correct uplink. One issue we have today when writing software for these edge relays is we don't have a netdev representing the downlink. Or a netdev representing management functions of the device. So if I want to say change the mode of the edge relay from VEB to VEPA I usually just send the message to the PF. Or if I want to send packets out on the wire but not through the edge relay usually we do this by sending control packets over an elected PF and it will attach a tag or something so the edge relay doesn't forward or flood them to other uplinks. Adding a netdev for the downlink would probably clean some of this up. Now we rely on some behaviour that is not well-defined. > but in this case it just seems there is no such concept and as Or > brought up in another email - what does "VLANid" mean in such a case? I think most host nics with SR-IOV can forward using VLAN + MAC and do filtering on VLANid. Many can also put a default VLAN on the packet. > If we go with a CPU port concept, > We could then use the concept of a vlan filter on a port basis > but then what happens when you dont have an fdb (majority of cases)? Not sure what the question is here.. I'm hoping the above helped explain my thinking on this. Don't have an FDB? This means you don't have any way to forward between ports so you must have a 1:1 mapping between the physical port and the netdev. I think its fair to think of this as a TPMR (two port mac relay) although not a very useful abstraction. > >> My logic was if some netdev ethx has a set of MAC addresses >> above it well then any virtual function or virtual device also behind >> the hardware shouldn't be sending those addresses out the egress switch >> facing port. Otherwise the switch will see packets it knows are behind >> that port and drop them. Or flood them if it hasn't learned the address >> yet. Either way they will never get to the right netdev. >> >> Admittedly I wasn't thinking about switches with many ports at the time. >> > > I often struggle with trying to "box" SRIOV into some concept of a > switch abstraction and sometimes i am puzzled. > Would exposing the SRIOV underlay as a switch not have solved this > problem? Then the virtual ports essentially are bridge ports. Yes this would help and this is how I view it. Although the edge relay vs "real standards based" bridge distinction is important because we don't do learning, only have a single uplink, don't run loop detecting protocols, etc. All that stuff is not needed on a host where you "know" your MAC addresses (at least for many use cases) and can not build loops. > Maybe what we need is a concept of a "edge relay" extended netdev? This is effectively what the fdb table does right? Sure its not as explicit as it could be but this is how I treat the NIC when I learn it has multiple downlinks and a single uplink. At the moment we use a trick similar to Jiri's on rocker, when we get a switch op like getlink, setlink we "know" what switch object it refers to because the netdev maps to a single switch always. > These things would have an fdb as well down and uplink relay ports that > can be attached to them. > Right in the current code paths there is no "attach" operation we assume the edge relay and ports are attached when the ports are created via SR-IOV or hw-offload or whatever. What are we missing? We have the FDB and a unique id to show ports on the same edge relay. User space can build this abstraction from those two things. A downlink netdev port would probably clean up the abstraction a bit especially for sending control frames. > >>> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what >>> the other driver did). >> >> My original thinking here was... if it didn't implement fdb_add, fdb_del >> and fdb_dump then if you wanted to think of it as having forwarding >> database that was fine but it was really just a two port mac relay. In >> which case just dump all the mac addresses it knows about. In this case >> if it was something more fancy it could do its own dump like vxlan or >> macvlan. >> > > The challenge here is lack of separation between a NICs uni/multicast > ports which it owns - which is a traditional operation regardless of > what capabilities the NIC has; vs an fdb which has may have many > other capabilities. Probably all NICs capable of many MACs implement > fdbs? Yes they must to support forwarding. Agreed its a bit clunky they way we overload uni/multicast address lists. But what does it mean to add a unicast address to a port and not have it in the FDB? If the port wants to receive traffic on a MAC because its added to the unicast list doesn't it mean insert it into the FDB so the packets actually get sent to the netdev? Otherwise its a two step process one add it to the multicast list and then add it to the FDB. I'm not sure why this is valuable. > >> For a host nic ucast/multicast and fdb are the same, I think? The >> code we had was just short-hand to allow the common case a host nic >> to work. Notice vxlan and bridge drivers didn't dump there addr lists >> from fdb_dump until your patch. >> >> Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And >> I shouldn't overload the addr lists. >> > > Not just those - I am wondering about the general utility of what > Hubert was trying to do if all the driver does is call the default > dumper based on some flags presence and the default dumper > does a dump of uni/multicast host entries. Those are not really fdb > entries in the traditional sense. But as a practical matter any uni/multicast entry is in the FDB so when the host nic has multiple ports we receive those mac addresses on the port. The drivers do this today and it seems reasonable to me. > Is there no way to get the unicast/multicast mac addresses for such > a driver? You can almost infer it from ip link by looking at all the stacked drivers and figuring out how the address are propagated down. Then look at the routes and figure out multicast address. But other than the fdb dump mechanism I don't think there is anything. > I think that would help bring clarity to my confusion. > clear as mud now? > >> >> I'm interested to see what Vlad says as well. But the current situation >> is previously some drivers dumped their addr lists others didn't. >> Specifically, the more switch like devices (bridge, vxlan) didn't. Now >> every device will dump the addr lists. I'm not entirely convinced that >> is correct. >> > > I am glad this happened ;-> Otherwise we wouldnt be having this > discussion. When Vlad was asking me I was in a rush to get the patch > out and didnt question because i thought this was something some crazy > virtualization people needed. > If Vlad's use case goes away, then Hubert's little restoration is fine. Yep. maybe we can talk about it at the netdev users conference > > >> It works OK for host nics (NICS that can't forward between ports) and >> seems at best confusing for real switch asics. > > So if these NICs have fdb entries and i programmed it (meaning setting > which port a given MAC should be sent to), would it not work? You mean via 'bridge fdb add' yes this will work. But then as a short hand we also program the ucast/multicast addresses. (have I beaten this to death yet?) > >> On a related question do >> you expect the switch asic to trap any packets with MAC addresses in >> the multi/unicast address lists and send them to the correct netdev? Or >> will the switch forward them using normal FDB tables? >> > > I think there would be a separate table for that. Roopa, can you check > with the ASICs you guys work on? The point i was trying to make above > is today there is a uni/multicast list or table of sorts that all NICs > expose. > There's always the hack of a "cpu port". I have also seen the "cpu port" > being conceptualized in L3 tables to imply "next hop is cpu" where you > have an IP address owned by the host; so maybe we need a concept of a > cpu port or again the revival of TheThing class device. OK the confusing part of "cpu port" to me is in a host nic trying to map this abstraction onto it implies a host nic may have many "cpu ports". Thanks, .John > > cheers, > jamal > -- John Fastabend Intel Corporation