* SRIOV switchdev mode BoF minutes @ 2017-11-12 19:49 Or Gerlitz 2017-11-12 20:38 ` Alexander Duyck 2018-04-12 17:05 ` Samudrala, Sridhar 0 siblings, 2 replies; 36+ messages in thread From: Or Gerlitz @ 2017-11-12 19:49 UTC (permalink / raw) To: David Miller Cc: Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List Hi Dave and all, During and after the BoF on SRIOV switchdev mode, we came into a consensus among the developers from four different HW vendors (CC audience) that a correct thing to do would be to disallow any new extensions to the legacy mode. The idea is to put focus on the new mode and not add new UAPIs and kernel code which was turned to be a wrong design which does not allow for properly offloading a kernel switching SW model to e-switch HW. We also had a good session the day after regarding alignment for the representation model of the uplink (physical port) and PF/s. The VF representor netdevs exist for all drivers that support the new mode but the representation for the uplink and PF wasn't the same for all. The decision was to represent the uplink and PFs vports in the same manner done for VFs, using rep netdevs. This alignment would provide a more strict and clear view of the kernel model for e-switch to users and upper layer control plane SW. Or. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-12 19:49 SRIOV switchdev mode BoF minutes Or Gerlitz @ 2017-11-12 20:38 ` Alexander Duyck 2017-11-13 6:16 ` Or Gerlitz 2018-04-12 17:05 ` Samudrala, Sridhar 1 sibling, 1 reply; 36+ messages in thread From: Alexander Duyck @ 2017-11-12 20:38 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Sun, Nov 12, 2017 at 11:49 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > Hi Dave and all, > > During and after the BoF on SRIOV switchdev mode, we came into a > consensus among the developers from four different HW vendors (CC > audience) that a correct thing to do would be to disallow any new > extensions to the legacy mode. > > The idea is to put focus on the new mode and not add new UAPIs and > kernel code which was turned to be a wrong design which does not allow > for properly offloading a kernel switching SW model to e-switch HW. I would have to disagree with this. For devices such as 82599 that doesn't have a true switch this may limit future functionality since we can't move it over to switchdev mode. For example one thing I may need to add is the ability to disable multicast and broadcast receive on a per-VF basis at some point in the future. You may not recall but we tried to transition the i40e driver over to SwitchDev, the parts supported by i40e have a much more robust l2 forwarding framework than the 82599, and the result was we were told that while we might look at doing port representors some other way, there was no way we could use switchdev since the hardware couldn't support the requirements of switchdev in terms of default routes and forwarding behavior. I am planning to resolve the port representor issue by looking at coming up with something like a "source mode" macvlan based port representor. I figure that is probably the closest match for what the Intel hardware does since really the VFs are nothing more than a physical macvlan interface in and of themselves as the hardware doesn't have a full switch. > We also had a good session the day after regarding alignment for the > representation model of the uplink (physical port) and PF/s. > > The VF representor netdevs exist for all drivers that support the new > mode but the representation for the uplink and PF wasn't the same for > all. The decision was to represent the uplink and PFs vports in the > same manner done for VFs, using rep netdevs. This alignment would > provide a more strict and clear view of the kernel model for e-switch > to users and upper layer control plane SW. > > Or. This part sounds fine. - Alex ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-12 20:38 ` Alexander Duyck @ 2017-11-13 6:16 ` Or Gerlitz 2017-11-13 17:10 ` Alexander Duyck 0 siblings, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2017-11-13 6:16 UTC (permalink / raw) To: Alexander Duyck Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Sun, Nov 12, 2017 at 11:49 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >> Hi Dave and all, >> >> During and after the BoF on SRIOV switchdev mode, we came into a >> consensus among the developers from four different HW vendors (CC >> audience) that a correct thing to do would be to disallow any new >> extensions to the legacy mode. >> >> The idea is to put focus on the new mode and not add new UAPIs and >> kernel code which was turned to be a wrong design which does not allow >> for properly offloading a kernel switching SW model to e-switch HW. > You may not recall but we tried to transition the i40e driver over to > SwitchDev, the parts supported by i40e have a much more robust l2 > forwarding framework than the 82599, and the result was we were told > that while we might look at doing port representors some other way, > there was no way we could use switchdev since the hardware couldn't > support the requirements of switchdev in terms of default routes and > forwarding behavior. I am planning to resolve the port representor > issue by looking at coming up with something like a "source mode" > macvlan based port representor. I figure that is probably the closest > match for what the Intel hardware does since really the VFs are > nothing more than a physical macvlan interface in and of themselves as > the hardware doesn't have a full switch. Hi Alex, The what we call slow path requirements are the following: 1. xmit on VF rep always turns to a receive on the VF, regardless of the offloaded SW steering rules ("send-to-vport") 2. xmit on VF which doesn't meet any offloaded SW steering rules must be recieved into the host OS from the VF rep 1,2 above must hold also for the uplink and the PF reps When the i40e limitation was described to @ netdev, it seems you have a problem with VF xmit that should be turned to be a recv on the VF rep but also goes to the wire. It smells as if a FW patch can solve that, isn't that? > I would have to disagree with this. For devices such as 82599 that > doesn't have a true switch this may limit future functionality since > we can't move it over to switchdev mode. For example one thing I may > need to add is the ability to disable multicast and broadcast receive > on a per-VF basis at some point in the future. We are on the same boat with ConnectX3/mlx4, so us lucky that misery loves company (my google search also yielded "many narrow-half consolation" is that completely unrelated?) - the legacy mode for ixgbe/mlx4 is there for ~8-10 years - and since then both companies had 2-3 newer HW generations. I don't see why you can't come to your customers and tell that newish functionality needs newer HW - it will also help sell more from the new stuff.. If you keep extending the legacy mode, more ppl/drivers will do that as well and it will not let us go in the right direction. Or. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-13 6:16 ` Or Gerlitz @ 2017-11-13 17:10 ` Alexander Duyck 2017-11-14 16:44 ` Or Gerlitz 0 siblings, 1 reply; 36+ messages in thread From: Alexander Duyck @ 2017-11-13 17:10 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Sun, Nov 12, 2017 at 11:49 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >>> Hi Dave and all, >>> >>> During and after the BoF on SRIOV switchdev mode, we came into a >>> consensus among the developers from four different HW vendors (CC >>> audience) that a correct thing to do would be to disallow any new >>> extensions to the legacy mode. >>> >>> The idea is to put focus on the new mode and not add new UAPIs and >>> kernel code which was turned to be a wrong design which does not allow >>> for properly offloading a kernel switching SW model to e-switch HW. > >> You may not recall but we tried to transition the i40e driver over to >> SwitchDev, the parts supported by i40e have a much more robust l2 >> forwarding framework than the 82599, and the result was we were told >> that while we might look at doing port representors some other way, >> there was no way we could use switchdev since the hardware couldn't >> support the requirements of switchdev in terms of default routes and >> forwarding behavior. I am planning to resolve the port representor >> issue by looking at coming up with something like a "source mode" >> macvlan based port representor. I figure that is probably the closest >> match for what the Intel hardware does since really the VFs are >> nothing more than a physical macvlan interface in and of themselves as >> the hardware doesn't have a full switch. > > Hi Alex, > > The what we call slow path requirements are the following: > > 1. xmit on VF rep always turns to a receive on the VF, regardless of > the offloaded > SW steering rules ("send-to-vport") > > 2. xmit on VF which doesn't meet any offloaded SW steering rules must > be recieved > into the host OS from the VF rep > > 1,2 above must hold also for the uplink and the PF reps I am well aware of the requirements. We discussed these with Jiri at the previous netdev. > When the i40e limitation was described to @ netdev, it seems you have a problem > with VF xmit that should be turned to be a recv on the VF rep but also > goes to the wire. > > It smells as if a FW patch can solve that, isn't that? That is a huge maybe. We looked into it last time and while we can meet requirements 1 and 2 we do so with a heavy performance penalty due to the fact that we don't support anywhere near the same number of flows as a true switch. Also while that might work for i40e we still have a much larger install base of ixgbe ports that we have to support. >> I would have to disagree with this. For devices such as 82599 that >> doesn't have a true switch this may limit future functionality since >> we can't move it over to switchdev mode. For example one thing I may >> need to add is the ability to disable multicast and broadcast receive >> on a per-VF basis at some point in the future. > > We are on the same boat with ConnectX3/mlx4, so us lucky that misery loves > company (my google search also yielded "many narrow-half consolation" is that > completely unrelated?) - the legacy mode for ixgbe/mlx4 is there for ~8-10 years > - and since then both companies had 2-3 newer HW generations. I don't see why > you can't come to your customers and tell that newish functionality needs newer > HW - it will also help sell more from the new stuff.. If you keep > extending the legacy > mode, more ppl/drivers will do that as well and it will not let us go > in the right direction. > > Or. Well I don't know about you guys, but we still are selling parts supported by ixgbe and have still been adding new hardware as recently as just a couple years ago. I'm not saying SwitchDev doesn't need to be supported, if anything I am saying we need to leave the legacy support extendable so that we can setup a glide path between the two. If I can get the souce mode macvlan port representor working the way I hope we can start looking at getting our customers used to a SwitchDev type environment without having to use full SwitchDev. That would help to make them more amenable to moving over to devices that support that in the future. In addition this all works on the basis of all future SR-IOV devices being based on a VEB. Do we know if there are any existing or future devices that work in a VEPA type mode? The issue with ixgbe and i40e is that they were designed to be a hybrid between the two but in my opinion they lean much more toward the VEPA configuration with just a little bit of loopback support to make the VEB setup work. As such we end up with issues such as all broadcasts/multicasts always being transmitted out the uplink port. If anything I think what we should define as a requirement would be that we cannot add any future legacy items without adding support for the same via the SwitchDev port representor. - Alex ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-13 17:10 ` Alexander Duyck @ 2017-11-14 16:44 ` Or Gerlitz 2017-11-14 20:00 ` Alexander Duyck 0 siblings, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2017-11-14 16:44 UTC (permalink / raw) To: Alexander Duyck Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Mon, Nov 13, 2017 at 7:10 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >> On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck >> The what we call slow path requirements are the following: >> >> 1. xmit on VF rep always turns to a receive on the VF, regardless of >> the offloaded SW steering rules ("send-to-vport") >> >> 2. xmit on VF which doesn't meet any offloaded SW steering rules must >> be received into the host OS from the VF rep >> 1,2 above must hold also for the uplink and the PF reps > I am well aware of the requirements. We discussed these with Jiri at > the previous netdev. >> When the i40e limitation was described to @ netdev, it seems you have a problem >> with VF xmit that should be turned to be a recv on the VF rep but also >> goes to the wire. >> It smells as if a FW patch can solve that, isn't that? > That is a huge maybe. We looked into it last time and while we can > meet requirements 1 and 2 we do so with a heavy performance penalty > due to the fact that we don't support anywhere near the same number of > flows as a true switch. Also while that might work for i40e to recap on i40e, you can support the slow path requirements, but you have an issue with the fast path (== offloaded flows)? what is the issue there? > we still have a much larger install base of ixgbe ports that we have to support. ok, but support is one thing and keep enhancing a ten years old wrong SW model is 2nd thing >>>> I would have to disagree with this. For devices such as 82599 that >>> doesn't have a true switch this may limit future functionality since >>> we can't move it over to switchdev mode. For example one thing I may >>> need to add is the ability to disable multicast and broadcast receive >>> on a per-VF basis at some point in the future. >> We are on the same boat with ConnectX3/mlx4, so us lucky that misery loves >> company (my google search also yielded "many narrow-half consolation" is that >> completely unrelated?) - the legacy mode for ixgbe/mlx4 is there for ~8-10 years >> - and since then both companies had 2-3 newer HW generations. I don't see why >> you can't come to your customers and tell that newish functionality needs newer >> HW - it will also help sell more from the new stuff.. If you keep >> extending the legacy mode, more ppl/drivers will do that as well and it will not let us go >> in the right direction. > Well I don't know about you guys, but we still are selling parts > supported by ixgbe Same here, we are selling lots of CX3 and have to support that, but I didn't see why someone will want new features there. > still been adding new hardware as recently as just a couple years ago. wait, that's different story. You are saying that your older HW doesn't support e-switch and you want to keep doing new parts of that older HW and you want the kernel to keep enhance a wrong SW model b/c you are doing new parts from old HW, I don't see why we as a community need to go there. Lets focus on this point for a moment before discussing the other points you raised. Or. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-14 16:44 ` Or Gerlitz @ 2017-11-14 20:00 ` Alexander Duyck 2017-11-14 21:50 ` Or Gerlitz 2017-11-14 23:32 ` Jakub Kicinski 0 siblings, 2 replies; 36+ messages in thread From: Alexander Duyck @ 2017-11-14 20:00 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, Nov 14, 2017 at 8:44 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Mon, Nov 13, 2017 at 7:10 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >>> On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck > >>> The what we call slow path requirements are the following: >>> >>> 1. xmit on VF rep always turns to a receive on the VF, regardless of >>> the offloaded SW steering rules ("send-to-vport") >>> >>> 2. xmit on VF which doesn't meet any offloaded SW steering rules must >>> be received into the host OS from the VF rep > >>> 1,2 above must hold also for the uplink and the PF reps > >> I am well aware of the requirements. We discussed these with Jiri at >> the previous netdev. > >>> When the i40e limitation was described to @ netdev, it seems you have a problem >>> with VF xmit that should be turned to be a recv on the VF rep but also >>> goes to the wire. > >>> It smells as if a FW patch can solve that, isn't that? > >> That is a huge maybe. We looked into it last time and while we can >> meet requirements 1 and 2 we do so with a heavy performance penalty >> due to the fact that we don't support anywhere near the same number of >> flows as a true switch. Also while that might work for i40e > > to recap on i40e, you can support the slow path requirements, but you have an > issue with the fast path (== offloaded flows)? what is the issue there? We basically need to do some feasability research to see if we can actually meet all the requirements for switchdev on i40e. We have been getting mixed messages where we are given a great many "yes, but" type answers. For i40e we are looking into it but I don't have high confidence in our ability to actually support it in hardare/firmware. If it were as easy as you have been led to believe, we would have done it months ago when we were researching the requirements to support switchdev. In addition i40e isn't really my concern. I am much more concerned about ixgbe as it has a much larger install base and many more customers that are still buying it today. >> we still have a much larger install base of ixgbe ports that we have to support. > > ok, but support is one thing and keep enhancing a ten years old wrong > SW model is 2nd thing The model might be 10 years old, but as I said we are still shipping new silicon that was released just over a year ago that is supported by the ixgbe driver. Also I don't know if the term "enhancing" is the right word for what I am thinking. I'm not talking about adding new drivers that only support legacy mode. We are looking at probably having to refactor the whole concept of "trusted" VF in order to break it out into smaller buckets. In addition I plan to come up with a source mode macvlan based "port representor" for legacy SR-IOV and hope to be able to use that to start working on a better path for SR-IOV live migration. Fundamentally the problem I have with us saying we cannot extend legacy mode SR-IOV is that 82599 is a very large piece of the existing install base for 10Gbit in general. We have it shipping on brand new platforms as the silicon that is installed on the motherboard. With that being the case people are going to want to get the most value they can out of the silicon that they purchased since in many cases it is just a standard part of the platform. >>>>> I would have to disagree with this. For devices such as 82599 that >>>> doesn't have a true switch this may limit future functionality since >>>> we can't move it over to switchdev mode. For example one thing I may >>>> need to add is the ability to disable multicast and broadcast receive >>>> on a per-VF basis at some point in the future. > >>> We are on the same boat with ConnectX3/mlx4, so us lucky that misery loves >>> company (my google search also yielded "many narrow-half consolation" is that >>> completely unrelated?) - the legacy mode for ixgbe/mlx4 is there for ~8-10 years >>> - and since then both companies had 2-3 newer HW generations. I don't see why >>> you can't come to your customers and tell that newish functionality needs newer >>> HW - it will also help sell more from the new stuff.. If you keep >>> extending the legacy mode, more ppl/drivers will do that as well and it will not let us go >>> in the right direction. > >> Well I don't know about you guys, but we still are selling parts >> supported by ixgbe > > Same here, we are selling lots of CX3 and have to support that, but I didn't > see why someone will want new features there. I think the difference is that we get pressed on as part of the platform instead of being a single component. If a customer wants some specific feature enabled on 82599 as a part of the platform we tend to need to go along with it in order to avoid being a roadblock in a sale of other components. >> still been adding new hardware as recently as just a couple years ago. > > wait, that's different story. > > You are saying that your older HW doesn't support e-switch > and you want to keep doing new parts of that older HW and you want the > kernel to keep enhance a wrong SW model b/c you are doing new parts > from old HW, I don't see why we as a community need to go there. I'm not saying we have new parts. I'm saying we have existing parts that will likely need some work done. SwitchDev was only introduced about 2 years ago. We have parts that were released around or before then with functionality that didn't anticipate this. We still haven't finished fully implementing all the features that were available on the parts, that is what I am arguing. Usually new features go in for several years after a part is released, usually something on the 3 to 5 year range. > Lets focus on this point for a moment before discussing the other points > you raised. > > Or. When SR-IOV was introduced there were two available modes, Virtual Ethernet Port Aggregation, aka VEPA, and Virtual Ethernet Bridging, aka VEB. The fact is SwitchDev is designed specifically for networking SR-IOV with Virtual Ethernet Bridging, aka VEB. You argue that the legacy model is bad, but I would argue that is because the legacy model was really designed to work more for both VEPA than with VEB, whereas SwitchDev only focuses on VEB. If you take a look in the ixgbe or i40e drivers you will see that we support configuring both of those modes via ndo_bridge_setlink since we have customer install bases that actually prefer VEPA over VEB as they prefer to have their traffic centrally managed instead of having the local host managing the traffic. We cannot just arbitrarily tell our customers they are doing SR-IOV using the "wrong model". I would rather not have SwitchDev become the next SystemD. The type argument you are making is basically dictating to us and our customers how things are supposed to work based on your view things. We have different hardware, different customers, and all of our needs aren't necessarily met by SwitchDev. I would agree that SwitchDev is the go-to solution for VEB configuration, and we do plan to have future hardware support it. In addition I would argue that for the sake of consistency we should make sure that any feature that gets added to the legacy has to be supported by the SwitchDev model as well before it could be supported. If anything my hope is to evolve the legacy model to have much of the same look and feel as SwitchDev, but that will take time and require changes to the legacy model. I don't plan to have a ton of new features added to legacy SR-IOV, as I stated earlier my main concern is the "trusted" VF mode as that has become a security issue as everything is getting dumped into that so we need to break it up to get finer granularity. For example I am looking at adding a promisc/allmulti/multicast/broadcast control per VF to set the upper limit of what a VF can request to receive instead of just turning on "trusted" to allow a VF to turn on promiscuous. My only other concern is live migration. I don't know if that will require changes to the legacy SR-IOV mode or not, but it would be better to not have that door closed as an option than to have to work around it entirely. So, to summarize: 1. VEPA is still a thing, that implies no e-switch. Switchdev does not address that model. 2. I agree that SwitchDev is the way forward for VEB. 3. I agree we should focus on interface consistency so any new feature added to legacy mode has to also be enabled in SwitchDev. I hope this makes my point a bit clearer. I don't fundamentally disagree with the need to focus on having a consistent UAPI going forward. The only spot where we have issues is that I don't see SwitchDev as the only solution as we still have customers that aren't necessarily making use of an eswitch and telling them they are "doing it wrong" isn't really a viable solution. If nothing else I think we can look at re-evaluating this at the next netdev/netconf, and for now I would agree legacy SR-IOV changes should be under greater scrutiny. - Alex ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-14 20:00 ` Alexander Duyck @ 2017-11-14 21:50 ` Or Gerlitz 2017-11-14 23:05 ` Alexander Duyck 2017-11-14 23:32 ` Jakub Kicinski 1 sibling, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2017-11-14 21:50 UTC (permalink / raw) To: Alexander Duyck Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, Nov 14, 2017 at 10:00 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Tue, Nov 14, 2017 at 8:44 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >> On Mon, Nov 13, 2017 at 7:10 PM, Alexander Duyck >> <alexander.duyck@gmail.com> wrote: >>> On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >>>> On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck >> >>>> The what we call slow path requirements are the following: >>>> >>>> 1. xmit on VF rep always turns to a receive on the VF, regardless of >>>> the offloaded SW steering rules ("send-to-vport") >>>> >>>> 2. xmit on VF which doesn't meet any offloaded SW steering rules must >>>> be received into the host OS from the VF rep >> >>>> 1,2 above must hold also for the uplink and the PF reps >> >>> I am well aware of the requirements. We discussed these with Jiri at >>> the previous netdev. >> >>>> When the i40e limitation was described to @ netdev, it seems you have a problem >>>> with VF xmit that should be turned to be a recv on the VF rep but also >>>> goes to the wire. >> >>>> It smells as if a FW patch can solve that, isn't that? >> >>> That is a huge maybe. We looked into it last time and while we can >>> meet requirements 1 and 2 we do so with a heavy performance penalty >>> due to the fact that we don't support anywhere near the same number of >>> flows as a true switch. Also while that might work for i40e >> >> to recap on i40e, you can support the slow path requirements, but you have an >> issue with the fast path (== offloaded flows)? what is the issue there? > > We basically need to do some feasability research to see if we can > actually meet all the requirements for switchdev on i40e. We have been > getting mixed messages where we are given a great many "yes, but" type > answers. For i40e we are looking into it but I don't have high > confidence in our ability to actually support it in hardare/firmware. > If it were as easy as you have been led to believe, we would have done > it months ago when we were researching the requirements to support switchdev wait, Sridhar made seven rounds of his submission (this is the v7 pointer [1]) and you still don't know if what you were attempting to push upstream can work, something is weird here, can you clarify? Jeff? Sridhar, maybe you can explain if/what wrong assumptions you had in your code and what you think is the gap to address them and come up with proper impl for i40e? [1] https://marc.info/?l=linux-netdev&m=149083338400922&w=2 > In addition i40e isn't really my concern. I am much more > concerned about ixgbe as it has a much larger install base and many > more customers that are still buying it today. > >>> we still have a much larger install base of ixgbe ports that we have to support. >> >> ok, but support is one thing and keep enhancing a ten years old wrong >> SW model is 2nd thing > > The model might be 10 years old, but as I said we are still shipping > new silicon that was released just over a year ago that is supported > by the ixgbe driver. > > Also I don't know if the term "enhancing" is the right word for what I > am thinking. I'm not talking about adding new drivers that only > support legacy mode. We are looking at probably having to refactor > the whole concept of "trusted" VF in order to break it out into > smaller buckets. In addition I plan to come up with a source mode > macvlan based "port representor" for legacy SR-IOV and hope to be able > to use that to start working on a better path for SR-IOV live > migration. > > Fundamentally the problem I have with us saying we cannot extend > legacy mode SR-IOV is that 82599 is a very large piece of the existing > install base for 10Gbit in general. We have it shipping on brand new > platforms as the silicon that is installed on the motherboard. With > that being the case people are going to want to get the most value > they can out of the silicon that they purchased since in many cases it > is just a standard part of the platform. Getting the most value still doesn't mean you should approach the community and ask to keep enhancing a wrong SW model for a switch. For example, suppose a single new bit module param to IXGBE will get you to sell another 100K or 1M or 10M pieces per year but we as community decided that module params are not the way to go - will you come and ask to add the module param for you to get more biz? > I'm not saying we have new parts. I'm saying we have existing parts > that will likely need some work done. SwitchDev was only introduced > about 2 years ago. We have parts that were released around or before > then with functionality that didn't anticipate this. We still haven't > finished fully implementing all the features that were available on > the parts, that is what I am arguing. Usually new features go in for > several years after a part is released, usually something on the 3 to > 5 year range. > When SR-IOV was introduced there were two available modes, Virtual > Ethernet Port Aggregation, aka VEPA, and Virtual Ethernet Bridging, > aka VEB. The fact is SwitchDev is designed specifically for networking > SR-IOV with Virtual Ethernet Bridging, aka VEB. You argue that the > legacy model is bad, but I would argue that is because the legacy > model was really designed to work more for both VEPA than with VEB, > whereas SwitchDev only focuses on VEB. If you take a look in the ixgbe > or i40e drivers you will see that we support configuring both of those > modes via ndo_bridge_setlink since we have customer install bases that > actually prefer VEPA over VEB as they prefer to have their traffic > centrally managed instead of having the local host managing the > traffic. We cannot just arbitrarily tell our customers they are doing > SR-IOV using the "wrong model". > > I would rather not have SwitchDev become the next SystemD. The type > argument you are making is basically dictating to us and our customers > how things are supposed to work based on your view things. We have > different hardware, different customers, and all of our needs aren't > necessarily met by SwitchDev. I would agree that SwitchDev is the > go-to solution for VEB configuration, and we do plan to have future > hardware support it. In addition I would argue that for the sake of > consistency we should make sure that any feature that gets added to > the legacy has to be supported by the SwitchDev model as well before > it could be supported. If anything my hope is to evolve the legacy > model to have much of the same look and feel as SwitchDev, but that > will take time and require changes to the legacy model. > > I don't plan to have a ton of new features added to legacy SR-IOV, as > I stated earlier my main concern is the "trusted" VF mode as that has > become a security issue as everything is getting dumped into that so > we need to break it up to get finer granularity. For example I am > looking at adding a promisc/allmulti/multicast/broadcast control per > VF to set the upper limit of what a VF can request to receive instead > of just turning on "trusted" to allow a VF to turn on promiscuous. My > only other concern is live migration. I don't know if that will > require changes to the legacy SR-IOV mode or not, but it would be > better to not have that door closed as an option than to have to work > around it entirely. > > So, to summarize: > 1. VEPA is still a thing, that implies no e-switch. Switchdev does not > address that model. > 2. I agree that SwitchDev is the way forward for VEB. > 3. I agree we should focus on interface consistency so any new feature > added to legacy mode has to also be enabled in SwitchDev. > > I hope this makes my point a bit clearer. I don't fundamentally > disagree with the need to focus on having a consistent UAPI going > forward. The only spot where we have issues is that I don't see > SwitchDev as the only solution as we still have customers that aren't > necessarily making use of an eswitch and telling them they are "doing > it wrong" isn't really a viable solution. If nothing else I think we > can look at re-evaluating this at the next netdev/netconf, and for now > I would agree legacy SR-IOV changes should be under greater scrutiny. Alex, Lots of data and argumentation, it's too bad that none of it was said/presented @ the last netdev/netconf nor in the previous conferences (Feb 2016 / Oct 2016) when SRIOV switchdev was on the stage nor in the submissions that followed, doesn't seem as new data points, at least to you. As you said, the switchdev mode for SRIOV is around for two years (merged in 4.8 but was presented way back). You waited two years to provide this input and we will have to wait another 6 months for you to conduct a session on that. Can you point out public use-cases / white-papers / design documents / blue prints / etc that employ the VEPA approach? b/c really no other person/vendor brought it up... we were all dealing with the sriov e-switch as a HW switch which should be programmed by the host stack according to well known industry models that apply on physical switches, e.g 1. L2 FDB (Linux Bridge) 2. L3 FIB (Linux Routers) 3. ACLS (Linux TC) [3] is what implemented by the upstream sriov switchdev drivers, [1] and [2] we discussed on netdev, maybe you want to play with [1] for i40e? I had a slide on that in the BoF Or. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-14 21:50 ` Or Gerlitz @ 2017-11-14 23:05 ` Alexander Duyck 2017-11-14 23:36 ` Jakub Kicinski 2017-11-16 17:41 ` Or Gerlitz 0 siblings, 2 replies; 36+ messages in thread From: Alexander Duyck @ 2017-11-14 23:05 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, Nov 14, 2017 at 1:50 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Tue, Nov 14, 2017 at 10:00 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Tue, Nov 14, 2017 at 8:44 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >>> On Mon, Nov 13, 2017 at 7:10 PM, Alexander Duyck >>> <alexander.duyck@gmail.com> wrote: >>>> On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >>>>> On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck >>> >>>>> The what we call slow path requirements are the following: >>>>> >>>>> 1. xmit on VF rep always turns to a receive on the VF, regardless of >>>>> the offloaded SW steering rules ("send-to-vport") >>>>> >>>>> 2. xmit on VF which doesn't meet any offloaded SW steering rules must >>>>> be received into the host OS from the VF rep >>> >>>>> 1,2 above must hold also for the uplink and the PF reps >>> >>>> I am well aware of the requirements. We discussed these with Jiri at >>>> the previous netdev. >>> >>>>> When the i40e limitation was described to @ netdev, it seems you have a problem >>>>> with VF xmit that should be turned to be a recv on the VF rep but also >>>>> goes to the wire. >>> >>>>> It smells as if a FW patch can solve that, isn't that? >>> >>>> That is a huge maybe. We looked into it last time and while we can >>>> meet requirements 1 and 2 we do so with a heavy performance penalty >>>> due to the fact that we don't support anywhere near the same number of >>>> flows as a true switch. Also while that might work for i40e >>> >>> to recap on i40e, you can support the slow path requirements, but you have an >>> issue with the fast path (== offloaded flows)? what is the issue there? >> >> We basically need to do some feasability research to see if we can >> actually meet all the requirements for switchdev on i40e. We have been >> getting mixed messages where we are given a great many "yes, but" type >> answers. For i40e we are looking into it but I don't have high >> confidence in our ability to actually support it in hardare/firmware. >> If it were as easy as you have been led to believe, we would have done >> it months ago when we were researching the requirements to support switchdev > > wait, Sridhar made seven rounds of his submission (this is the v7 > pointer [1]) and you > still don't know if what you were attempting to push upstream can > work, something is > weird here, can you clarify? Jeff? Not weird so much as stubborn. The patches were being pushed based on the assumption that the community would accept a NIC generating port representors that didn't necessarily pass traffic, and then even when we had them passing traffic the PF still wasn't configured to handle being the default destination for traffic without any rules associated, instead VFs would directly send to the outside world. > Sridhar, maybe you can explain if/what wrong assumptions you had in your code > and what you think is the gap to address them and come up with proper > impl for i40e? > > [1] https://marc.info/?l=linux-netdev&m=149083338400922&w=2 For starters the firmware change you are talking about didn't exist during this time frame. We can ignore those patches as they assumed that port representors didn't necessarily have to pass traffic. >> In addition i40e isn't really my concern. I am much more >> concerned about ixgbe as it has a much larger install base and many >> more customers that are still buying it today. >> >>>> we still have a much larger install base of ixgbe ports that we have to support. >>> >>> ok, but support is one thing and keep enhancing a ten years old wrong >>> SW model is 2nd thing >> >> The model might be 10 years old, but as I said we are still shipping >> new silicon that was released just over a year ago that is supported >> by the ixgbe driver. >> >> Also I don't know if the term "enhancing" is the right word for what I >> am thinking. I'm not talking about adding new drivers that only >> support legacy mode. We are looking at probably having to refactor >> the whole concept of "trusted" VF in order to break it out into >> smaller buckets. In addition I plan to come up with a source mode >> macvlan based "port representor" for legacy SR-IOV and hope to be able >> to use that to start working on a better path for SR-IOV live >> migration. >> >> Fundamentally the problem I have with us saying we cannot extend >> legacy mode SR-IOV is that 82599 is a very large piece of the existing >> install base for 10Gbit in general. We have it shipping on brand new >> platforms as the silicon that is installed on the motherboard. With >> that being the case people are going to want to get the most value >> they can out of the silicon that they purchased since in many cases it >> is just a standard part of the platform. > > Getting the most value still doesn't mean you should approach the community > and ask to keep enhancing a wrong SW model for a switch. > > For example, suppose a single new bit module param to IXGBE will get > you to sell another > 100K or 1M or 10M pieces per year but we as community decided that > module params are > not the way to go - will you come and ask to add the module param for > you to get more biz? The problem is that is how things have been done in the past. I don't want us going down that road. That is half of my frustration with how things have been done. Even worse is how debugfs has been mis-used. I'm trying to keep us from committing to an agreement that we won't abide by. >> I'm not saying we have new parts. I'm saying we have existing parts >> that will likely need some work done. SwitchDev was only introduced >> about 2 years ago. We have parts that were released around or before >> then with functionality that didn't anticipate this. We still haven't >> finished fully implementing all the features that were available on >> the parts, that is what I am arguing. Usually new features go in for >> several years after a part is released, usually something on the 3 to >> 5 year range. > >> When SR-IOV was introduced there were two available modes, Virtual >> Ethernet Port Aggregation, aka VEPA, and Virtual Ethernet Bridging, >> aka VEB. The fact is SwitchDev is designed specifically for networking >> SR-IOV with Virtual Ethernet Bridging, aka VEB. You argue that the >> legacy model is bad, but I would argue that is because the legacy >> model was really designed to work more for both VEPA than with VEB, >> whereas SwitchDev only focuses on VEB. If you take a look in the ixgbe >> or i40e drivers you will see that we support configuring both of those >> modes via ndo_bridge_setlink since we have customer install bases that >> actually prefer VEPA over VEB as they prefer to have their traffic >> centrally managed instead of having the local host managing the >> traffic. We cannot just arbitrarily tell our customers they are doing >> SR-IOV using the "wrong model". >> >> I would rather not have SwitchDev become the next SystemD. The type >> argument you are making is basically dictating to us and our customers >> how things are supposed to work based on your view things. We have >> different hardware, different customers, and all of our needs aren't >> necessarily met by SwitchDev. I would agree that SwitchDev is the >> go-to solution for VEB configuration, and we do plan to have future >> hardware support it. In addition I would argue that for the sake of >> consistency we should make sure that any feature that gets added to >> the legacy has to be supported by the SwitchDev model as well before >> it could be supported. If anything my hope is to evolve the legacy >> model to have much of the same look and feel as SwitchDev, but that >> will take time and require changes to the legacy model. >> >> I don't plan to have a ton of new features added to legacy SR-IOV, as >> I stated earlier my main concern is the "trusted" VF mode as that has >> become a security issue as everything is getting dumped into that so >> we need to break it up to get finer granularity. For example I am >> looking at adding a promisc/allmulti/multicast/broadcast control per >> VF to set the upper limit of what a VF can request to receive instead >> of just turning on "trusted" to allow a VF to turn on promiscuous. My >> only other concern is live migration. I don't know if that will >> require changes to the legacy SR-IOV mode or not, but it would be >> better to not have that door closed as an option than to have to work >> around it entirely. >> >> So, to summarize: >> 1. VEPA is still a thing, that implies no e-switch. Switchdev does not >> address that model. >> 2. I agree that SwitchDev is the way forward for VEB. >> 3. I agree we should focus on interface consistency so any new feature >> added to legacy mode has to also be enabled in SwitchDev. >> >> I hope this makes my point a bit clearer. I don't fundamentally >> disagree with the need to focus on having a consistent UAPI going >> forward. The only spot where we have issues is that I don't see >> SwitchDev as the only solution as we still have customers that aren't >> necessarily making use of an eswitch and telling them they are "doing >> it wrong" isn't really a viable solution. If nothing else I think we >> can look at re-evaluating this at the next netdev/netconf, and for now >> I would agree legacy SR-IOV changes should be under greater scrutiny. > > Alex, > > Lots of data and argumentation, it's too bad that none of it was > said/presented @ the last > netdev/netconf nor in the previous conferences (Feb 2016 / Oct 2016) > when SRIOV switchdev > was on the stage nor in the submissions that followed, doesn't seem as > new data points, at > least to you. As you said, the switchdev mode for SRIOV is around for > two years (merged in 4.8 > but was presented way back). You waited two years to provide this > input and we will have to wait > another 6 months for you to conduct a session on that. This is the first time where you have essentially said SwitchDev is the only way things are going to be done going forward. In addition I don't recall you ever using all the wording basically calling the legacy model bad for SR-IOV. That is why I have been okay with it up until now. > Can you point out public use-cases / white-papers / design documents / > blue prints / etc > that employ the VEPA approach? b/c really no other person/vendor > brought it up... we Cisco and HP were the two vendors that were pushing it hard for a while there. It isn't anywhere near as popular as VEB is, but from the looks of it Cisco is still pushing a variant on it in the form of vntag. If nothing else you can go look at the 802.1Qbg IEEE spec as it is called out there as well. > were all dealing with the sriov e-switch as a HW switch which should > be programmed > by the host stack according to well known industry models that apply > on physical switches, e.g > > 1. L2 FDB (Linux Bridge) > 2. L3 FIB (Linux Routers) > 3. ACLS (Linux TC) > > [3] is what implemented by the upstream sriov switchdev drivers, [1] and [2] we > discussed on netdev, maybe you want to play with [1] for i40e? I had a slide on > that in the BoF > > Or. So for i40e we will probably explore option 1, and possibly option 3 though as I said we still have to figure out what we can get the firmware to actually do for us. That ends up being the ultimate limitation. - Alex ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-14 23:05 ` Alexander Duyck @ 2017-11-14 23:36 ` Jakub Kicinski 2017-11-15 3:04 ` Alexander Duyck 2017-11-16 17:41 ` Or Gerlitz 1 sibling, 1 reply; 36+ messages in thread From: Jakub Kicinski @ 2017-11-14 23:36 UTC (permalink / raw) To: Alexander Duyck Cc: Or Gerlitz, David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, 14 Nov 2017 15:05:08 -0800, Alexander Duyck wrote: > >> We basically need to do some feasability research to see if we can > >> actually meet all the requirements for switchdev on i40e. We have been > >> getting mixed messages where we are given a great many "yes, but" type > >> answers. For i40e we are looking into it but I don't have high > >> confidence in our ability to actually support it in hardare/firmware. > >> If it were as easy as you have been led to believe, we would have done > >> it months ago when we were researching the requirements to support switchdev > > > > wait, Sridhar made seven rounds of his submission (this is the v7 > > pointer [1]) and you > > still don't know if what you were attempting to push upstream can > > work, something is > > weird here, can you clarify? Jeff? > > Not weird so much as stubborn. The patches were being pushed based on > the assumption that the community would accept a NIC generating port > representors that didn't necessarily pass traffic, and then even when > we had them passing traffic the PF still wasn't configured to handle > being the default destination for traffic without any rules > associated, instead VFs would directly send to the outside world. Perhaps the way forward is to lift the requirement on passing traffic, as long as the limitation is clearly expressed to the users. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-14 23:36 ` Jakub Kicinski @ 2017-11-15 3:04 ` Alexander Duyck 2017-11-15 4:02 ` Jakub Kicinski 0 siblings, 1 reply; 36+ messages in thread From: Alexander Duyck @ 2017-11-15 3:04 UTC (permalink / raw) To: Jakub Kicinski Cc: Or Gerlitz, David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, Nov 14, 2017 at 3:36 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: > On Tue, 14 Nov 2017 15:05:08 -0800, Alexander Duyck wrote: >> >> We basically need to do some feasability research to see if we can >> >> actually meet all the requirements for switchdev on i40e. We have been >> >> getting mixed messages where we are given a great many "yes, but" type >> >> answers. For i40e we are looking into it but I don't have high >> >> confidence in our ability to actually support it in hardare/firmware. >> >> If it were as easy as you have been led to believe, we would have done >> >> it months ago when we were researching the requirements to support switchdev >> > >> > wait, Sridhar made seven rounds of his submission (this is the v7 >> > pointer [1]) and you >> > still don't know if what you were attempting to push upstream can >> > work, something is >> > weird here, can you clarify? Jeff? >> >> Not weird so much as stubborn. The patches were being pushed based on >> the assumption that the community would accept a NIC generating port >> representors that didn't necessarily pass traffic, and then even when >> we had them passing traffic the PF still wasn't configured to handle >> being the default destination for traffic without any rules >> associated, instead VFs would directly send to the outside world. > > Perhaps the way forward is to lift the requirement on passing traffic, > as long as the limitation is clearly expressed to the users. No, I am not arguing for that because then SwitchDev will fall into disarray. If we want to have a strict definition for what is SwitchDev and what isn't I am okay with that. It gives us a definition of what our hardware needs to do in order to support it and without that we are going to get hardware that just bends the rules to claim support for it. All I am asking for is for us to not close the door to the possibility of adding features to legacy SR-IOV. I am hoping to use a source macvlan based approach to make it so that we can support "port representors" for devices that can't support full SwitchDev. The idea would be to use them to get as close to SwitchDev level support on legacy devices as possible without using full SwitchDev. That should solve a good part of the issue, but I am pretty certain I need to be able to extend legacy SR-IOV in order to support it. I had talked with Jiri at netdev 2.1 about it back when we had submitted the v7 patches, and the decision was to look at doing "port representors" but don't associate them with SwitchDev. I was out on Sabbatical for most of the summer and I am just now starting on the macvlan work I had planned. I hope to have it done before the next netdev and then we can discuss it there if it needs more discussion than what we can have on the mailing list. I'm fine with us placing any legacy SR-IOV changes under more scrutiny. I am just not open to saying we will not extend or update any features for legacy SR-IOV. The fact is we are still selling a ton of ixgbe based parts, so I can't say with any certainty that there won't be a request for some new SR-IOV feature in the future. - Alex ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-15 3:04 ` Alexander Duyck @ 2017-11-15 4:02 ` Jakub Kicinski 2017-11-15 18:25 ` Alexander Duyck 0 siblings, 1 reply; 36+ messages in thread From: Jakub Kicinski @ 2017-11-15 4:02 UTC (permalink / raw) To: Alexander Duyck Cc: Or Gerlitz, David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, 14 Nov 2017 19:04:36 -0800, Alexander Duyck wrote: > On Tue, Nov 14, 2017 at 3:36 PM, Jakub Kicinski > <jakub.kicinski@netronome.com> wrote: > > On Tue, 14 Nov 2017 15:05:08 -0800, Alexander Duyck wrote: > >> >> We basically need to do some feasability research to see if we can > >> >> actually meet all the requirements for switchdev on i40e. We have been > >> >> getting mixed messages where we are given a great many "yes, but" type > >> >> answers. For i40e we are looking into it but I don't have high > >> >> confidence in our ability to actually support it in hardare/firmware. > >> >> If it were as easy as you have been led to believe, we would have done > >> >> it months ago when we were researching the requirements to support switchdev > >> > > >> > wait, Sridhar made seven rounds of his submission (this is the v7 > >> > pointer [1]) and you > >> > still don't know if what you were attempting to push upstream can > >> > work, something is > >> > weird here, can you clarify? Jeff? > >> > >> Not weird so much as stubborn. The patches were being pushed based on > >> the assumption that the community would accept a NIC generating port > >> representors that didn't necessarily pass traffic, and then even when > >> we had them passing traffic the PF still wasn't configured to handle > >> being the default destination for traffic without any rules > >> associated, instead VFs would directly send to the outside world. > > > > Perhaps the way forward is to lift the requirement on passing traffic, > > as long as the limitation is clearly expressed to the users. > > No, I am not arguing for that because then SwitchDev will fall into > disarray. If we want to have a strict definition for what is SwitchDev > and what isn't I am okay with that. It gives us a definition of what > our hardware needs to do in order to support it and without that we > are going to get hardware that just bends the rules to claim support > for it. Let me make sure we understand each other. The switchdev SR-IOV mode is what happens when user requests DEVLINK_ESWITCH_MODE_SWITCHDEV. Are you saying you are opposed to adding DEVLINK_ESWITCH_MODE_VEPA? > All I am asking for is for us to not close the door to the possibility > of adding features to legacy SR-IOV. I am hoping to use a source > macvlan based approach to make it so that we can support "port > representors" for devices that can't support full SwitchDev. The idea > would be to use them to get as close to SwitchDev level support on > legacy devices as possible without using full SwitchDev. That should > solve a good part of the issue, but I am pretty certain I need to be > able to extend legacy SR-IOV in order to support it. I had talked with > Jiri at netdev 2.1 about it back when we had submitted the v7 patches, > and the decision was to look at doing "port representors" but don't > associate them with SwitchDev. I was out on Sabbatical for most of the > summer and I am just now starting on the macvlan work I had planned. I > hope to have it done before the next netdev and then we can discuss it > there if it needs more discussion than what we can have on the mailing > list. I don't know what you mean with the macvlan based approach. Could you perhaps describe it in more detail? Will it allow users to configure forwarding and queueing with existing, standard tools and APIs? > I'm fine with us placing any legacy SR-IOV changes under more > scrutiny. I am just not open to saying we will not extend or update > any features for legacy SR-IOV. The fact is we are still selling a ton > of ixgbe based parts, so I can't say with any certainty that there > won't be a request for some new SR-IOV feature in the future. > > - Alex ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-15 4:02 ` Jakub Kicinski @ 2017-11-15 18:25 ` Alexander Duyck 0 siblings, 0 replies; 36+ messages in thread From: Alexander Duyck @ 2017-11-15 18:25 UTC (permalink / raw) To: Jakub Kicinski Cc: Or Gerlitz, David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, Nov 14, 2017 at 8:02 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: > On Tue, 14 Nov 2017 19:04:36 -0800, Alexander Duyck wrote: >> On Tue, Nov 14, 2017 at 3:36 PM, Jakub Kicinski >> <jakub.kicinski@netronome.com> wrote: >> > On Tue, 14 Nov 2017 15:05:08 -0800, Alexander Duyck wrote: >> >> >> We basically need to do some feasability research to see if we can >> >> >> actually meet all the requirements for switchdev on i40e. We have been >> >> >> getting mixed messages where we are given a great many "yes, but" type >> >> >> answers. For i40e we are looking into it but I don't have high >> >> >> confidence in our ability to actually support it in hardare/firmware. >> >> >> If it were as easy as you have been led to believe, we would have done >> >> >> it months ago when we were researching the requirements to support switchdev >> >> > >> >> > wait, Sridhar made seven rounds of his submission (this is the v7 >> >> > pointer [1]) and you >> >> > still don't know if what you were attempting to push upstream can >> >> > work, something is >> >> > weird here, can you clarify? Jeff? >> >> >> >> Not weird so much as stubborn. The patches were being pushed based on >> >> the assumption that the community would accept a NIC generating port >> >> representors that didn't necessarily pass traffic, and then even when >> >> we had them passing traffic the PF still wasn't configured to handle >> >> being the default destination for traffic without any rules >> >> associated, instead VFs would directly send to the outside world. >> > >> > Perhaps the way forward is to lift the requirement on passing traffic, >> > as long as the limitation is clearly expressed to the users. >> >> No, I am not arguing for that because then SwitchDev will fall into >> disarray. If we want to have a strict definition for what is SwitchDev >> and what isn't I am okay with that. It gives us a definition of what >> our hardware needs to do in order to support it and without that we >> are going to get hardware that just bends the rules to claim support >> for it. > > Let me make sure we understand each other. The switchdev SR-IOV mode is > what happens when user requests DEVLINK_ESWITCH_MODE_SWITCHDEV. Are you > saying you are opposed to adding DEVLINK_ESWITCH_MODE_VEPA? I wouldn't say I am opposed to that idea. We just need to clearly define what MODE_VEPA is. I would say that even in MODE_VEPA we would be passing traffic. The limitation though is that we wouldn't have the same mechanisms in place to route the traffic. The big issue with VEPA is that the traffic is routed to an external entity before it makes a hairpin turn and comes back. As such we don't have the actual origin of the packet to work with other than MAC and VLAN. As far as directing a packet to a specific port the only way we really have of doing that is to direct it to the MAC/VLAN pair for the VF. This is one of the reasons why I am thinking source mode macvlan is the solution to go with for something like this. Basically the source mode macvlan can get pretty close to identifying the origin of any packet that came from the VF assuming it is programmed with all the MAC entries belonging to the VF. The only case where this doesn't work is the "trusted" legacy mode VF that is running in promiscuous with anti-spoof disabled. >> All I am asking for is for us to not close the door to the possibility >> of adding features to legacy SR-IOV. I am hoping to use a source >> macvlan based approach to make it so that we can support "port >> representors" for devices that can't support full SwitchDev. The idea >> would be to use them to get as close to SwitchDev level support on >> legacy devices as possible without using full SwitchDev. That should >> solve a good part of the issue, but I am pretty certain I need to be >> able to extend legacy SR-IOV in order to support it. I had talked with >> Jiri at netdev 2.1 about it back when we had submitted the v7 patches, >> and the decision was to look at doing "port representors" but don't >> associate them with SwitchDev. I was out on Sabbatical for most of the >> summer and I am just now starting on the macvlan work I had planned. I >> hope to have it done before the next netdev and then we can discuss it >> there if it needs more discussion than what we can have on the mailing >> list. > > I don't know what you mean with the macvlan based approach. Could you > perhaps describe it in more detail? Will it allow users to configure > forwarding and queueing with existing, standard tools and APIs? So there are a few issues with our devices doing SwitchDev mode that I am trying to address. One of the issues is that we have no direct way to figure out where the packets are coming from as I described above. So instead of us implementing multiple approaches for the same thing my thought was to look at using source mode macvlan which does filtering on the source MAC address instead of the destination. It shouldn't take much to extend it so that a PF could notify a source mode macvlan interface of all the unicast addresses a VF can use as a source address for transmitting. With that we would at least be able to tell where the traffic came from. Another issue is directing transmit packets to the VF for any specific interface. My thought is for our source mode based "port representor" macvlan would be to limit the transmits so that we can only transmit unicast packets that are guaranteed to be delivered to the proper destination. Basically we would have to tag all broadcast and multicast packets as being already forwarded and they would have to be dropped on the "port representor" interfaces. Ideally there would be some sort of uplink representor that would then be able to handle the broadcast/multicast packets for the device since we end up replicating the packets across all ports on the same VLAN currently. The last issue is that by default all transmits that don't have a matching filter in hardware are transmitted out the uplink port. That was part of the issue that we don't think can be solved for ixgbe, and even with a firmware change I am not certain how will i40e will work for this. With macvlan being used as the model we basically skirt the whole issue since that is kind of the standard behavior for macvlan anyway. In theory this all should work together to allow forwarding with the existing tools. It would basically just mean we need to use FDB programming on the port representor to control what MAC addresses are handled for each interface. In addition we could probably handle the ndo_setup_tc call in the port representors with some limited subset of fields supported by flower to use that to route traffic. It will be much easier to show all this once I have have code. It will probably take me a month or so to dig out the technical debt that is currently present for macvlan offload, and the fact that i40e currently doesn't support it. Once I get those two items addressed my plan is to then start tackling the source mode macvlan based port representors. I hope to have an RFC ready early next year. Thanks. - Alex ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-14 23:05 ` Alexander Duyck 2017-11-14 23:36 ` Jakub Kicinski @ 2017-11-16 17:41 ` Or Gerlitz 2017-11-16 18:20 ` Alexander Duyck 1 sibling, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2017-11-16 17:41 UTC (permalink / raw) To: Alexander Duyck Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Wed, Nov 15, 2017 at 1:05 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Tue, Nov 14, 2017 at 1:50 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >> all dealing with the sriov e-switch as a HW switch which should >> be programmed >> by the host stack according to well known industry models that apply >> on physical switches, e.g >> >> 1. L2 FDB (Linux Bridge) >> 2. L3 FIB (Linux Routers) >> 3. ACLS (Linux TC) >> >> [3] is what implemented by the upstream sriov switchdev drivers, [1] and [2] we >> discussed on netdev, maybe you want to play with [1] for i40e? I had a slide on >> that in the BoF > So for i40e we will probably explore option 1, and possibly option 3 > though as I said we still have to figure out what we can get the > firmware to actually do for us. That ends up being the ultimate > limitation. I think Intel/Linux/sriov wise, it would be good if you put now the focus on that small corner of the universe and show support for the new community lead mode by having one of your current drivers support that. FDB support would be great and it will help transition existing legacy mode users to the switchdev mode, b/c essentially FDBs is what each driver now configures their HW from within, where's if we manage to get a bridge to be offloaded, all what left is systemd script that creates the VF, puts the driver into switchdev mode, creates a bridge with the reps, and that is it!! I have presented a slide in our BoF re what does it take to support FDB, here it is: 1. create linux bridge (e.g.1q), assign VF and uplink rep netdevices to the bridge 2. support the switchdev FDB notifications in the HW driver learning: respond to SWITCHDEV_FDB_ADD_TO_DEVICE events aging: respond to SWITCHDEV_FDB_DEL_TO_DEVICE events (del FDB from HW) enhance the driver/bridge API to allows drivers provide last-use indications on FDB entries STP: fwd - offload FDBs as explained above learning - make sure HW flow miss (slow path) goes to CPU discard - add drop HW rule flooding: use SW based flooding ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-16 17:41 ` Or Gerlitz @ 2017-11-16 18:20 ` Alexander Duyck 0 siblings, 0 replies; 36+ messages in thread From: Alexander Duyck @ 2017-11-16 18:20 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Thu, Nov 16, 2017 at 9:41 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Wed, Nov 15, 2017 at 1:05 AM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Tue, Nov 14, 2017 at 1:50 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > >>> all dealing with the sriov e-switch as a HW switch which should >>> be programmed >>> by the host stack according to well known industry models that apply >>> on physical switches, e.g >>> >>> 1. L2 FDB (Linux Bridge) >>> 2. L3 FIB (Linux Routers) >>> 3. ACLS (Linux TC) >>> >>> [3] is what implemented by the upstream sriov switchdev drivers, [1] and [2] we >>> discussed on netdev, maybe you want to play with [1] for i40e? I had a slide on >>> that in the BoF > >> So for i40e we will probably explore option 1, and possibly option 3 >> though as I said we still have to figure out what we can get the >> firmware to actually do for us. That ends up being the ultimate >> limitation. > > I think Intel/Linux/sriov wise, it would be good if you put now the > focus on that small > corner of the universe and show support for the new community lead > mode by having > one of your current drivers support that. I am trying to focus on this area. The problem is you keep assuming what we can and can't do in our hardware. I am not certain we can handle the "learning" aspect of things. The biggest issue is that our hardware was designed to be a VEPA with a filter based hairpin. It really wasn't designed to be a switch. My concern is you may have been misinformed about what our hardware can and cannot do. In addition changing our firmware for the parts supported by i40e isn't that easy. In addition there is no guarantee that we can do what is being asked per PCIe function, it might be a global impact on the entire device. If that were the case then it isn't an option since we can't have one function breaking another. There are a lot of what-if scenarios that we have to sort out, if we can even get the firmware update for this since it was mostly locked down and in maintenance mode. > FDB support would be great and it will help transition existing legacy > mode users to the switchdev > mode, b/c essentially FDBs is what each driver now configures their HW > from within, where's if > we manage to get a bridge to be offloaded, all what left is systemd > script that creates the VF, > puts the driver into switchdev mode, creates a bridge with the reps, > and that is it!! > > I have presented a slide in our BoF re what does it take to support > FDB, here it is: > > 1. create linux bridge (e.g.1q), assign VF and uplink rep netdevices > to the bridge > 2. support the switchdev FDB notifications in the HW driver This is essentially what I hope to support with source macvlan based port representors. > learning: respond to SWITCHDEV_FDB_ADD_TO_DEVICE events This requires that we see the traffic. We have to figure out if we can actually make the CPU the default target and can then get the traffic out of the uplink interface without horribly breaking things. It will take time to see if we can even do it. The problem is the CPU/PF is only the default target for traffic coming from the uplink on our devices. Anything the VF sends will default to the uplink unless there is a filter for it to route it otherwise. > aging: respond to SWITCHDEV_FDB_DEL_TO_DEVICE events (del FDB from HW) > enhance the driver/bridge API to allows drivers provide last-use > indications on FDB entries > > STP: > > fwd - offload FDBs as explained above > learning - make sure HW flow miss (slow path) goes to CPU > discard - add drop HW rule > > flooding: > > use SW based flooding This is much easier said than done when you are working with a device that was architected years before switchdev was a thing. I'll see what we can do, but I cannot make any promises. - Alex ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-14 20:00 ` Alexander Duyck 2017-11-14 21:50 ` Or Gerlitz @ 2017-11-14 23:32 ` Jakub Kicinski 1 sibling, 0 replies; 36+ messages in thread From: Jakub Kicinski @ 2017-11-14 23:32 UTC (permalink / raw) To: Alexander Duyck Cc: Or Gerlitz, David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, 14 Nov 2017 12:00:32 -0800, Alexander Duyck wrote: > On Tue, Nov 14, 2017 at 8:44 AM, Or Gerlitz wrote: > > On Mon, Nov 13, 2017 at 7:10 PM, Alexander Duyck wrote: > >> On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz wrote: > > Lets focus on this point for a moment before discussing the other points > > you raised. > > When SR-IOV was introduced there were two available modes, Virtual > Ethernet Port Aggregation, aka VEPA, and Virtual Ethernet Bridging, > aka VEB. The fact is SwitchDev is designed specifically for networking > SR-IOV with Virtual Ethernet Bridging, aka VEB. You argue that the > legacy model is bad, but I would argue that is because the legacy > model was really designed to work more for both VEPA than with VEB, > whereas SwitchDev only focuses on VEB. If you take a look in the ixgbe > or i40e drivers you will see that we support configuring both of those > modes via ndo_bridge_setlink since we have customer install bases that > actually prefer VEPA over VEB as they prefer to have their traffic > centrally managed instead of having the local host managing the > traffic. We cannot just arbitrarily tell our customers they are doing > SR-IOV using the "wrong model". Maybe that's an obvious statement, but the perhaps real problem we are grappling with here is that VEPA doesn't really exist as forwarding model outside of SR-IOV NICs. So we have no software construct that cleanly maps onto it for offload. > I would rather not have SwitchDev become the next SystemD. The type > argument you are making is basically dictating to us and our customers > how things are supposed to work based on your view things. We have > different hardware, different customers, and all of our needs aren't > necessarily met by SwitchDev. I would agree that SwitchDev is the > go-to solution for VEB configuration, and we do plan to have future > hardware support it. In addition I would argue that for the sake of > consistency we should make sure that any feature that gets added to > the legacy has to be supported by the SwitchDev model as well before > it could be supported. If anything my hope is to evolve the legacy > model to have much of the same look and feel as SwitchDev, but that > will take time and require changes to the legacy model. To me the whole point of switchdev is to reuse existing ABIs, TC, FDB, bridging etc. We are arguing to stop adding special SR-IOV features, if the general direction of things is to just reflect configuration done with SW ABIs to the hardware. I think saying we need feature parity between the models is missing this crucial point. Also more ways there are to configure a single thing, the more confusing it will be to the users. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2017-11-12 19:49 SRIOV switchdev mode BoF minutes Or Gerlitz 2017-11-12 20:38 ` Alexander Duyck @ 2018-04-12 17:05 ` Samudrala, Sridhar 2018-04-12 20:20 ` Or Gerlitz 1 sibling, 1 reply; 36+ messages in thread From: Samudrala, Sridhar @ 2018-04-12 17:05 UTC (permalink / raw) To: Or Gerlitz, David Miller Cc: Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On 11/12/2017 11:49 AM, Or Gerlitz wrote: > Hi Dave and all, > > During and after the BoF on SRIOV switchdev mode, we came into a > consensus among the developers from four different HW vendors (CC > audience) that a correct thing to do would be to disallow any new > extensions to the legacy mode. > > The idea is to put focus on the new mode and not add new UAPIs and > kernel code which was turned to be a wrong design which does not allow > for properly offloading a kernel switching SW model to e-switch HW. > > We also had a good session the day after regarding alignment for the > representation model of the uplink (physical port) and PF/s. > > The VF representor netdevs exist for all drivers that support the new > mode but the representation for the uplink and PF wasn't the same for > all. The decision was to represent the uplink and PFs vports in the > same manner done for VFs, using rep netdevs. This alignment would > provide a more strict and clear view of the kernel model for e-switch > to users and upper layer control plane SW. > I don't see any changes in the Mellanox/other drivers to move to this new model to enable the uplink and PF port representors, any updates? It would be really nice to highlight the pros and cons of the old versus the new model. We are looking into adding switchdev support for our new 100Gb ice driver and could use some feedback on the direction we should be taking. Thanks Sridhar ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-12 17:05 ` Samudrala, Sridhar @ 2018-04-12 20:20 ` Or Gerlitz 2018-04-12 20:33 ` Samudrala, Sridhar 0 siblings, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2018-04-12 20:20 UTC (permalink / raw) To: Samudrala, Sridhar Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: > On 11/12/2017 11:49 AM, Or Gerlitz wrote: >> >> Hi Dave and all, >> >> During and after the BoF on SRIOV switchdev mode, we came into a >> consensus among the developers from four different HW vendors (CC >> audience) that a correct thing to do would be to disallow any new >> extensions to the legacy mode. >> >> The idea is to put focus on the new mode and not add new UAPIs and >> kernel code which was turned to be a wrong design which does not allow >> for properly offloading a kernel switching SW model to e-switch HW. >> >> We also had a good session the day after regarding alignment for the >> representation model of the uplink (physical port) and PF/s. >> >> The VF representor netdevs exist for all drivers that support the new >> mode but the representation for the uplink and PF wasn't the same for >> all. The decision was to represent the uplink and PFs vports in the >> same manner done for VFs, using rep netdevs. This alignment would >> provide a more strict and clear view of the kernel model for e-switch >> to users and upper layer control plane SW. >> > I don't see any changes in the Mellanox/other drivers to move to this new > model to enable the uplink and PF port representors, any updates? Yeah, I am worked on that but didn't get to finalize the upstreaming so far. I have resumed the work and plan uplink rep in mlx5 to replace the PF being uplink rep for 4.18 > It would be really nice to highlight the pros and cons of the old versus the > new model. > > We are looking into adding switchdev support for our new 100Gb ice driver > and could use some feedback on the direction we should be taking. good news. The uplink rep is clear cut that needs to be a rep device representing the uplink just like vf rep represents the vport toward the vf - please just do it correct from the begining I can spare ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-12 20:20 ` Or Gerlitz @ 2018-04-12 20:33 ` Samudrala, Sridhar 2018-04-13 8:56 ` Or Gerlitz 0 siblings, 1 reply; 36+ messages in thread From: Samudrala, Sridhar @ 2018-04-12 20:33 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On 4/12/2018 1:20 PM, Or Gerlitz wrote: > On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar > <sridhar.samudrala@intel.com> wrote: >> On 11/12/2017 11:49 AM, Or Gerlitz wrote: >>> Hi Dave and all, >>> >>> During and after the BoF on SRIOV switchdev mode, we came into a >>> consensus among the developers from four different HW vendors (CC >>> audience) that a correct thing to do would be to disallow any new >>> extensions to the legacy mode. >>> >>> The idea is to put focus on the new mode and not add new UAPIs and >>> kernel code which was turned to be a wrong design which does not allow >>> for properly offloading a kernel switching SW model to e-switch HW. >>> >>> We also had a good session the day after regarding alignment for the >>> representation model of the uplink (physical port) and PF/s. >>> >>> The VF representor netdevs exist for all drivers that support the new >>> mode but the representation for the uplink and PF wasn't the same for >>> all. The decision was to represent the uplink and PFs vports in the >>> same manner done for VFs, using rep netdevs. This alignment would >>> provide a more strict and clear view of the kernel model for e-switch >>> to users and upper layer control plane SW. >>> >> I don't see any changes in the Mellanox/other drivers to move to this new >> model to enable the uplink and PF port representors, any updates? > Yeah, I am worked on that but didn't get to finalize the upstreaming > so far. I have resumed > the work and plan uplink rep in mlx5 to replace the PF being uplink rep for 4.18 > >> It would be really nice to highlight the pros and cons of the old versus the >> new model. >> >> We are looking into adding switchdev support for our new 100Gb ice driver >> and could use some feedback on the direction we should be taking. > good news. > > The uplink rep is clear cut that needs to be a rep device representing > the uplink just like vf > rep represents the vport toward the vf - please just do it correct > from the begining > Having an uplink rep will definitely help implement the slow path with flat/vlan network scenarios by not having to add PF to the bridge. But how do they help with a vxlan overlay scenario? In case of overlays, the slow path has to go via vxlan -> ip stack -> pf? What about pf-rep? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-12 20:33 ` Samudrala, Sridhar @ 2018-04-13 8:56 ` Or Gerlitz 2018-04-13 8:57 ` Or Gerlitz 0 siblings, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2018-04-13 8:56 UTC (permalink / raw) To: Samudrala, Sridhar Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: > On 4/12/2018 1:20 PM, Or Gerlitz wrote: >> >> On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar >> <sridhar.samudrala@intel.com> wrote: >>> >>> On 11/12/2017 11:49 AM, Or Gerlitz wrote: >>>> >>>> Hi Dave and all, >>>> >>>> During and after the BoF on SRIOV switchdev mode, we came into a >>>> consensus among the developers from four different HW vendors (CC >>>> audience) that a correct thing to do would be to disallow any new >>>> extensions to the legacy mode. >>>> >>>> The idea is to put focus on the new mode and not add new UAPIs and >>>> kernel code which was turned to be a wrong design which does not allow >>>> for properly offloading a kernel switching SW model to e-switch HW. >>>> >>>> We also had a good session the day after regarding alignment for the >>>> representation model of the uplink (physical port) and PF/s. >>>> >>>> The VF representor netdevs exist for all drivers that support the new >>>> mode but the representation for the uplink and PF wasn't the same for >>>> all. The decision was to represent the uplink and PFs vports in the >>>> same manner done for VFs, using rep netdevs. This alignment would >>>> provide a more strict and clear view of the kernel model for e-switch >>>> to users and upper layer control plane SW. >>>> >>> I don't see any changes in the Mellanox/other drivers to move to this new >>> model to enable the uplink and PF port representors, any updates? >> >> Yeah, I am worked on that but didn't get to finalize the upstreaming >> so far. I have resumed >> the work and plan uplink rep in mlx5 to replace the PF being uplink rep >> for 4.18 >> >>> It would be really nice to highlight the pros and cons of the old versus >>> the >>> new model. >>> >>> We are looking into adding switchdev support for our new 100Gb ice driver >>> and could use some feedback on the direction we should be taking. >> >> good news. >> >> The uplink rep is clear cut that needs to be a rep device representing >> the uplink just like vf >> rep represents the vport toward the vf - please just do it correct >> from the begining >> > Having an uplink rep will definitely help implement the slow path with > flat/vlan network > scenarios by not having to add PF to the bridge. > > But how do they help with a vxlan overlay scenario? In case of overlays, the > slow path has to go via vxlan -> ip stack -> pf? in overlay networks scheme, the uplink has the VTEP ip and is not connected to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to ovs and the ip stack routes through the uplink rep > > What about pf-rep? > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-13 8:56 ` Or Gerlitz @ 2018-04-13 8:57 ` Or Gerlitz 2018-04-13 16:49 ` Samudrala, Sridhar 0 siblings, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2018-04-13 8:57 UTC (permalink / raw) To: Samudrala, Sridhar Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Fri, Apr 13, 2018 at 11:56 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar > <sridhar.samudrala@intel.com> wrote: >> On 4/12/2018 1:20 PM, Or Gerlitz wrote: >>> >>> On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar >>> <sridhar.samudrala@intel.com> wrote: >>>> >>>> On 11/12/2017 11:49 AM, Or Gerlitz wrote: >>>>> >>>>> Hi Dave and all, >>>>> >>>>> During and after the BoF on SRIOV switchdev mode, we came into a >>>>> consensus among the developers from four different HW vendors (CC >>>>> audience) that a correct thing to do would be to disallow any new >>>>> extensions to the legacy mode. >>>>> >>>>> The idea is to put focus on the new mode and not add new UAPIs and >>>>> kernel code which was turned to be a wrong design which does not allow >>>>> for properly offloading a kernel switching SW model to e-switch HW. >>>>> >>>>> We also had a good session the day after regarding alignment for the >>>>> representation model of the uplink (physical port) and PF/s. >>>>> >>>>> The VF representor netdevs exist for all drivers that support the new >>>>> mode but the representation for the uplink and PF wasn't the same for >>>>> all. The decision was to represent the uplink and PFs vports in the >>>>> same manner done for VFs, using rep netdevs. This alignment would >>>>> provide a more strict and clear view of the kernel model for e-switch >>>>> to users and upper layer control plane SW. >>>>> >>>> I don't see any changes in the Mellanox/other drivers to move to this new >>>> model to enable the uplink and PF port representors, any updates? >>> >>> Yeah, I am worked on that but didn't get to finalize the upstreaming >>> so far. I have resumed >>> the work and plan uplink rep in mlx5 to replace the PF being uplink rep >>> for 4.18 >>> >>>> It would be really nice to highlight the pros and cons of the old versus >>>> the >>>> new model. >>>> >>>> We are looking into adding switchdev support for our new 100Gb ice driver >>>> and could use some feedback on the direction we should be taking. >>> >>> good news. >>> >>> The uplink rep is clear cut that needs to be a rep device representing >>> the uplink just like vf >>> rep represents the vport toward the vf - please just do it correct >>> from the begining >>> >> Having an uplink rep will definitely help implement the slow path with >> flat/vlan network >> scenarios by not having to add PF to the bridge. >> >> But how do they help with a vxlan overlay scenario? In case of overlays, the >> slow path has to go via vxlan -> ip stack -> pf? > > in overlay networks scheme, the uplink has the VTEP ip and is not connected the uplink rep has the vtep ip > to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to ovs > and the ip stack routes through the uplink rep > >> >> What about pf-rep? >> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-13 8:57 ` Or Gerlitz @ 2018-04-13 16:49 ` Samudrala, Sridhar 2018-04-13 20:16 ` Or Gerlitz 0 siblings, 1 reply; 36+ messages in thread From: Samudrala, Sridhar @ 2018-04-13 16:49 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On 4/13/2018 1:57 AM, Or Gerlitz wrote: > On Fri, Apr 13, 2018 at 11:56 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >> On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar >> <sridhar.samudrala@intel.com> wrote: >>> On 4/12/2018 1:20 PM, Or Gerlitz wrote: >>>> On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar >>>> <sridhar.samudrala@intel.com> wrote: >>>>> On 11/12/2017 11:49 AM, Or Gerlitz wrote: >>>>>> Hi Dave and all, >>>>>> >>>>>> During and after the BoF on SRIOV switchdev mode, we came into a >>>>>> consensus among the developers from four different HW vendors (CC >>>>>> audience) that a correct thing to do would be to disallow any new >>>>>> extensions to the legacy mode. >>>>>> >>>>>> The idea is to put focus on the new mode and not add new UAPIs and >>>>>> kernel code which was turned to be a wrong design which does not allow >>>>>> for properly offloading a kernel switching SW model to e-switch HW. >>>>>> >>>>>> We also had a good session the day after regarding alignment for the >>>>>> representation model of the uplink (physical port) and PF/s. >>>>>> >>>>>> The VF representor netdevs exist for all drivers that support the new >>>>>> mode but the representation for the uplink and PF wasn't the same for >>>>>> all. The decision was to represent the uplink and PFs vports in the >>>>>> same manner done for VFs, using rep netdevs. This alignment would >>>>>> provide a more strict and clear view of the kernel model for e-switch >>>>>> to users and upper layer control plane SW. >>>>>> >>>>> I don't see any changes in the Mellanox/other drivers to move to this new >>>>> model to enable the uplink and PF port representors, any updates? >>>> Yeah, I am worked on that but didn't get to finalize the upstreaming >>>> so far. I have resumed >>>> the work and plan uplink rep in mlx5 to replace the PF being uplink rep >>>> for 4.18 >>>> >>>>> It would be really nice to highlight the pros and cons of the old versus >>>>> the >>>>> new model. >>>>> >>>>> We are looking into adding switchdev support for our new 100Gb ice driver >>>>> and could use some feedback on the direction we should be taking. >>>> good news. >>>> >>>> The uplink rep is clear cut that needs to be a rep device representing >>>> the uplink just like vf >>>> rep represents the vport toward the vf - please just do it correct >>>> from the begining >>>> >>> Having an uplink rep will definitely help implement the slow path with >>> flat/vlan network >>> scenarios by not having to add PF to the bridge. >>> >>> But how do they help with a vxlan overlay scenario? In case of overlays, the >>> slow path has to go via vxlan -> ip stack -> pf? >> in overlay networks scheme, the uplink has the VTEP ip and is not connected > the uplink rep has the vtep ip > >> to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to ovs >> and the ip stack routes through the uplink rep This changes the legacy mode behavior of configuring vtep ip on the pf netdev. How does host to host traffic expected to work when vtep ip is moved to uplink rep? >> >>> What about pf-rep? Are you planning to create a pf-rep too? Is pf also treated similar to vf in switchdev mode? All pf traffic goes to pf-rep and pf-rep traffic goes to pf by default without any rules programmed? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-13 16:49 ` Samudrala, Sridhar @ 2018-04-13 20:16 ` Or Gerlitz 2018-04-13 23:03 ` Samudrala, Sridhar 0 siblings, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2018-04-13 20:16 UTC (permalink / raw) To: Samudrala, Sridhar Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Fri, Apr 13, 2018 at 7:49 PM, Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: > On 4/13/2018 1:57 AM, Or Gerlitz wrote: >>> in overlay networks scheme, the uplink rep has the VTEP ip and is not connected >>> to the bridge, e.g you use ovs you have vf reps and vxlan ports connected >>> to ovs and the ip stack routes through the uplink rep > This changes the legacy mode behavior of configuring vtep ip on the pf > netdev. How does host to host traffic expected to work when vtep ip is moved to uplink rep? What do you mean host to host traffic, is that two VFs on the same host? control plane SWs (such as OVS) don't apply encapsulation within the same host >>>> What about pf-rep? > Are you planning to create a pf-rep too? Is pf also treated similar to vf in > switchdev mode? > All pf traffic goes to pf-rep and pf-rep traffic goes to pf by default > without any rules programmed? @ the sriov switchdev ARCH level, pf/pf-rep would work indeed as you described. We will have pf rep for smartnic schemes where the the pf on the host is not the manager of the eswitch but rather the smartnic driver instance. on non smart env, there are some challenges to address for the pf nic to be fully functional for the slow path (what you described), we will get there down the road if there is a real need. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-13 20:16 ` Or Gerlitz @ 2018-04-13 23:03 ` Samudrala, Sridhar 2018-04-15 6:01 ` Or Gerlitz 0 siblings, 1 reply; 36+ messages in thread From: Samudrala, Sridhar @ 2018-04-13 23:03 UTC (permalink / raw) To: Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On 4/13/2018 1:16 PM, Or Gerlitz wrote: > On Fri, Apr 13, 2018 at 7:49 PM, Samudrala, Sridhar > <sridhar.samudrala@intel.com> wrote: >> On 4/13/2018 1:57 AM, Or Gerlitz wrote: > >>>> in overlay networks scheme, the uplink rep has the VTEP ip and is not connected >>>> to the bridge, e.g you use ovs you have vf reps and vxlan ports connected >>>> to ovs and the ip stack routes through the uplink rep >> This changes the legacy mode behavior of configuring vtep ip on the pf >> netdev. How does host to host traffic expected to work when vtep ip is moved to uplink rep? > What do you mean host to host traffic, is that two VFs on the same host? > control plane SWs (such as OVS) don't apply encapsulation within the same host I meant between PFs on 2 compute nodes. > >>>>> What about pf-rep? >> Are you planning to create a pf-rep too? Is pf also treated similar to vf in >> switchdev mode? >> All pf traffic goes to pf-rep and pf-rep traffic goes to pf by default >> without any rules programmed? > @ the sriov switchdev ARCH level, pf/pf-rep would work indeed as you described. > > We will have pf rep for smartnic schemes where the the pf on the host > is not the manager of the eswitch but rather the smartnic driver instance. > > on non smart env, there are some challenges to address for the pf > nic to be fully functional for the slow path (what you described), we > will get there down the road if there is a real need. So on non-smart env, are you planning to only expose uplink rep and vf reps as netdevs. By smartnic env, i guess you are referring to OVS control plane also running on the NIC. I will look forward to your patches. Thanks Sridhar ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-13 23:03 ` Samudrala, Sridhar @ 2018-04-15 6:01 ` Or Gerlitz 2018-04-16 12:39 ` Andy Gospodarek 0 siblings, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2018-04-15 6:01 UTC (permalink / raw) To: Samudrala, Sridhar Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: > I meant between PFs on 2 compute nodes. If the PF serves as uplink rep, it functions as a switch port -- applications don't run on switch ports. One way to get apps to run on the host in switchdev mode is probe one of the VFs there. [...] > By smartnic env, i guess you are referring to OVS control plane also running > on the NIC. correct > I will look forward to your patches. FWIW, note that my patches don't bring any newz for you.. I am aligning mlx5 with what was agreed on netdev, e.g nfp does it (uplink rep and such) already. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-15 6:01 ` Or Gerlitz @ 2018-04-16 12:39 ` Andy Gospodarek 2018-04-17 2:08 ` Samudrala, Sridhar 0 siblings, 1 reply; 36+ messages in thread From: Andy Gospodarek @ 2018-04-16 12:39 UTC (permalink / raw) To: Or Gerlitz Cc: Samudrala, Sridhar, David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote: > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar > <sridhar.samudrala@intel.com> wrote: > > > I meant between PFs on 2 compute nodes. > > If the PF serves as uplink rep, it functions as a switch port -- applications > don't run on switch ports. One way to get apps to run on the host in switchdev > mode is probe one of the VFs there. > > > [...] > > > By smartnic env, i guess you are referring to OVS control plane also running > > on the NIC. > > correct > Not just OvS, but other applications running on the SmartNIC could use tc for programming hardware can benefit from a design like this. > > I will look forward to your patches. > > FWIW, note that my patches don't bring any newz for you.. I am aligning > mlx5 with what was agreed on netdev, e.g nfp does it (uplink rep and > such) already. Probably not major news from us either since this was discussed at the last NetConf, but we are planning to have this option for SmartNICs or PCI-multihost NICs, too. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-16 12:39 ` Andy Gospodarek @ 2018-04-17 2:08 ` Samudrala, Sridhar 2018-04-17 13:30 ` Andy Gospodarek 0 siblings, 1 reply; 36+ messages in thread From: Samudrala, Sridhar @ 2018-04-17 2:08 UTC (permalink / raw) To: Andy Gospodarek, Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On 4/16/2018 5:39 AM, Andy Gospodarek wrote: > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote: >> On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar >> <sridhar.samudrala@intel.com> wrote: >> >>> I meant between PFs on 2 compute nodes. >> If the PF serves as uplink rep, it functions as a switch port -- applications >> don't run on switch ports. One way to get apps to run on the host in switchdev >> mode is probe one of the VFs there. >> >> >> So once a pci device is configured in 'switchdev' mode, only port representor netdevs are seen on the host, no more PF netdev. Are you going to expose another way to change sriov_num_vfs when the device is in 'switchdev' mode OR do we need to switch to 'legacy' mode to increase/decrease the number of VFs? Even in switchdev mode, i guess it will be possible for host apps to use the IP configured on the uplink rep to talk externally. In case of multiple uplinks, are you exposing one uplink-rep netdev per uplink? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-17 2:08 ` Samudrala, Sridhar @ 2018-04-17 13:30 ` Andy Gospodarek 2018-04-17 13:58 ` Or Gerlitz 0 siblings, 1 reply; 36+ messages in thread From: Andy Gospodarek @ 2018-04-17 13:30 UTC (permalink / raw) To: Samudrala, Sridhar Cc: Andy Gospodarek, Or Gerlitz, David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote: > > On 4/16/2018 5:39 AM, Andy Gospodarek wrote: > > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote: > > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar > > > <sridhar.samudrala@intel.com> wrote: > > > > > > > I meant between PFs on 2 compute nodes. > > > If the PF serves as uplink rep, it functions as a switch port -- applications > > > don't run on switch ports. One way to get apps to run on the host in switchdev > > > mode is probe one of the VFs there. > > > > > > > > > > So once a pci device is configured in 'switchdev' mode, only port representor netdevs are > seen on the host, no more PF netdev. That is not the functionality I would propose. The PF netdev will still be there. > Are you going to expose another way to change sriov_num_vfs when the device is in > 'switchdev' mode OR do we need to switch to 'legacy' mode to increase/decrease the number of > VFs? Since the PF netdev will not disappear, the standard ways to configure number of VF, etc is still available. > Even in switchdev mode, i guess it will be possible for host apps to use the IP configured > on the uplink rep to talk externally. > > In case of multiple uplinks, are you exposing one uplink-rep netdev per uplink? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-17 13:30 ` Andy Gospodarek @ 2018-04-17 13:58 ` Or Gerlitz 2018-04-17 14:47 ` Andy Gospodarek 0 siblings, 1 reply; 36+ messages in thread From: Or Gerlitz @ 2018-04-17 13:58 UTC (permalink / raw) To: Andy Gospodarek Cc: Samudrala, Sridhar, David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, Apr 17, 2018 at 4:30 PM, Andy Gospodarek <andrew.gospodarek@broadcom.com> wrote: > On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote: >> >> On 4/16/2018 5:39 AM, Andy Gospodarek wrote: >> > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote: >> > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar >> > > <sridhar.samudrala@intel.com> wrote: >> > > >> > > > I meant between PFs on 2 compute nodes. >> > > If the PF serves as uplink rep, it functions as a switch port -- applications >> > > don't run on switch ports. One way to get apps to run on the host in switchdev >> > > mode is probe one of the VFs there. >> > > >> > > >> > > >> So once a pci device is configured in 'switchdev' mode, only port representor netdevs are >> seen on the host, no more PF netdev. > > That is not the functionality I would propose. The PF netdev will still be there. Andy, Basically LGTM, so even in smartnic configs, the PF @ the host is still privileged to create/destroy VFs or provision MACs for them even if it is not the e-switch manager anymore? Actually AFAIK this can also work somehow otherwise, e.g a smartnic FW "pushes" the VFs into the host w.o them being under a host admin directive. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-17 13:58 ` Or Gerlitz @ 2018-04-17 14:47 ` Andy Gospodarek 2018-04-17 16:46 ` Samudrala, Sridhar 2018-04-17 23:19 ` Jakub Kicinski 0 siblings, 2 replies; 36+ messages in thread From: Andy Gospodarek @ 2018-04-17 14:47 UTC (permalink / raw) To: Or Gerlitz Cc: Andy Gospodarek, Samudrala, Sridhar, David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, Apr 17, 2018 at 04:58:05PM +0300, Or Gerlitz wrote: > On Tue, Apr 17, 2018 at 4:30 PM, Andy Gospodarek > <andrew.gospodarek@broadcom.com> wrote: > > On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote: > >> > >> On 4/16/2018 5:39 AM, Andy Gospodarek wrote: > >> > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote: > >> > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar > >> > > <sridhar.samudrala@intel.com> wrote: > >> > > > >> > > > I meant between PFs on 2 compute nodes. > >> > > If the PF serves as uplink rep, it functions as a switch port -- applications > >> > > don't run on switch ports. One way to get apps to run on the host in switchdev > >> > > mode is probe one of the VFs there. > >> > > > >> > > > >> > > > >> So once a pci device is configured in 'switchdev' mode, only port representor netdevs are > >> seen on the host, no more PF netdev. > > > > That is not the functionality I would propose. The PF netdev will still be there. > > Andy, > > Basically LGTM, so even in smartnic configs, the PF @ the host is > still privileged to > create/destroy VFs or provision MACs for them even if it is not the > e-switch manager > anymore? Yes, in a SmartNIC world one config we aim to have is that a host can create and destroy VFs as needed. One of the challenges is how the VF reps are managed by applications in the SmartNIC when the host could make them disappear. > Actually AFAIK this can also work somehow otherwise, e.g a smartnic FW > "pushes" the VFs into the host w.o them being under a host admin directive. The model to 'push' VFs to a host is also another option, but I do not like it as much. My general preference is to allow the host to use a SmartNIC as if it was any other standard NIC (we have been using the word 'Performance NIC' to desribe what we would call a standard NIC, but the name is not terribly important). There is also a school of thought that the VF reps could be pre-allocated on the SmartNIC so that any application processing that traffic would sit idle when no traffic arrives on the rep, but could process frames that do arrive when the VFs were created on the host. This implementation will depend on how resources are allocated on a given bit of hardware, but can really work well. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-17 14:47 ` Andy Gospodarek @ 2018-04-17 16:46 ` Samudrala, Sridhar 2018-04-17 16:53 ` Andy Gospodarek 2018-04-17 23:19 ` Jakub Kicinski 1 sibling, 1 reply; 36+ messages in thread From: Samudrala, Sridhar @ 2018-04-17 16:46 UTC (permalink / raw) To: Andy Gospodarek, Or Gerlitz Cc: David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On 4/17/2018 7:47 AM, Andy Gospodarek wrote: > On Tue, Apr 17, 2018 at 04:58:05PM +0300, Or Gerlitz wrote: >> On Tue, Apr 17, 2018 at 4:30 PM, Andy Gospodarek >> <andrew.gospodarek@broadcom.com> wrote: >>> On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote: >>>> On 4/16/2018 5:39 AM, Andy Gospodarek wrote: >>>>> On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote: >>>>>> On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar >>>>>> <sridhar.samudrala@intel.com> wrote: >>>>>> >>>>>>> I meant between PFs on 2 compute nodes. >>>>>> If the PF serves as uplink rep, it functions as a switch port -- applications >>>>>> don't run on switch ports. One way to get apps to run on the host in switchdev >>>>>> mode is probe one of the VFs there. >>>>>> >>>>>> >>>>>> >>>> So once a pci device is configured in 'switchdev' mode, only port representor netdevs are >>>> seen on the host, no more PF netdev. >>> That is not the functionality I would propose. The PF netdev will still be there. >> Andy, >> >> Basically LGTM, so even in smartnic configs, the PF @ the host is >> still privileged to >> create/destroy VFs or provision MACs for them even if it is not the >> e-switch manager >> anymore? > Yes, in a SmartNIC world one config we aim to have is that a host can create > and destroy VFs as needed. One of the challenges is how the VF reps are > managed by applications in the SmartNIC when the host could make them > disappear. OK. So are we saying that in 'switchdev' mode with 2 VFs and 1 uplink, the host will see PF netdev, 2 vf-rep netdev's corresponding to 2 VFs and 1 uplink-rep netdev. Is PF netdev used only for the control/configure of the VFs? If it used as a datapath, i think we need a pf-rep netdev too. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-17 16:46 ` Samudrala, Sridhar @ 2018-04-17 16:53 ` Andy Gospodarek 0 siblings, 0 replies; 36+ messages in thread From: Andy Gospodarek @ 2018-04-17 16:53 UTC (permalink / raw) To: Samudrala, Sridhar Cc: Andy Gospodarek, Or Gerlitz, David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, Apr 17, 2018 at 09:46:38AM -0700, Samudrala, Sridhar wrote: > On 4/17/2018 7:47 AM, Andy Gospodarek wrote: > > On Tue, Apr 17, 2018 at 04:58:05PM +0300, Or Gerlitz wrote: > > > On Tue, Apr 17, 2018 at 4:30 PM, Andy Gospodarek > > > <andrew.gospodarek@broadcom.com> wrote: > > > > On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote: > > > > > On 4/16/2018 5:39 AM, Andy Gospodarek wrote: > > > > > > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote: > > > > > > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar > > > > > > > <sridhar.samudrala@intel.com> wrote: > > > > > > > > > > > > > > > I meant between PFs on 2 compute nodes. > > > > > > > If the PF serves as uplink rep, it functions as a switch port -- applications > > > > > > > don't run on switch ports. One way to get apps to run on the host in switchdev > > > > > > > mode is probe one of the VFs there. > > > > > > > > > > > > > > > > > > > > > > > > > > So once a pci device is configured in 'switchdev' mode, only port representor netdevs are > > > > > seen on the host, no more PF netdev. > > > > That is not the functionality I would propose. The PF netdev will still be there. > > > Andy, > > > > > > Basically LGTM, so even in smartnic configs, the PF @ the host is > > > still privileged to > > > create/destroy VFs or provision MACs for them even if it is not the > > > e-switch manager > > > anymore? > > Yes, in a SmartNIC world one config we aim to have is that a host can create > > and destroy VFs as needed. One of the challenges is how the VF reps are > > managed by applications in the SmartNIC when the host could make them > > disappear. > > OK. So are we saying that in 'switchdev' mode with 2 VFs and 1 uplink, the host will > see PF netdev, 2 vf-rep netdev's corresponding to 2 VFs and 1 uplink-rep netdev. > > Is PF netdev used only for the control/configure of the VFs? If it used as a datapath, > i think we need a pf-rep netdev too. > Yes, that is correct. PF reps could be used for datapath configuration to redirect traffic to a PF. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-17 14:47 ` Andy Gospodarek 2018-04-17 16:46 ` Samudrala, Sridhar @ 2018-04-17 23:19 ` Jakub Kicinski 2018-04-18 15:15 ` Andy Gospodarek 1 sibling, 1 reply; 36+ messages in thread From: Jakub Kicinski @ 2018-04-17 23:19 UTC (permalink / raw) To: Andy Gospodarek Cc: Or Gerlitz, Samudrala, Sridhar, David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, 17 Apr 2018 10:47:00 -0400, Andy Gospodarek wrote: > There is also a school of thought that the VF reps could be > pre-allocated on the SmartNIC so that any application processing that > traffic would sit idle when no traffic arrives on the rep, but could > process frames that do arrive when the VFs were created on the host. > This implementation will depend on how resources are allocated on a > given bit of hardware, but can really work well. +1 if there is no FW resource allocation issues IMHO it's okay to just show all reprs for "remote PCIes (PFs and VFs)" on the SmartNIC/ controller. The reprs should just show link down as if PCIe cable was unpluged until host actually enables them. A similar issue exists on multi-host for PFs, right? If one of the hosts is down do we still show their PF repr? IMHO yes. That makes the thing looks more like a switch with cables being plugged in and out. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-17 23:19 ` Jakub Kicinski @ 2018-04-18 15:15 ` Andy Gospodarek 2018-04-18 16:26 ` Jakub Kicinski 2018-04-18 17:07 ` Parikh, Neerav 0 siblings, 2 replies; 36+ messages in thread From: Andy Gospodarek @ 2018-04-18 15:15 UTC (permalink / raw) To: Jakub Kicinski Cc: Andy Gospodarek, Or Gerlitz, Samudrala, Sridhar, David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Tue, Apr 17, 2018 at 04:19:15PM -0700, Jakub Kicinski wrote: > On Tue, 17 Apr 2018 10:47:00 -0400, Andy Gospodarek wrote: > > There is also a school of thought that the VF reps could be > > pre-allocated on the SmartNIC so that any application processing that > > traffic would sit idle when no traffic arrives on the rep, but could > > process frames that do arrive when the VFs were created on the host. > > This implementation will depend on how resources are allocated on a > > given bit of hardware, but can really work well. > > +1 if there is no FW resource allocation issues IMHO it's okay to > just show all reprs for "remote PCIes (PFs and VFs)" on the SmartNIC/ > controller. The reprs should just show link down as if PCIe cable > was unpluged until host actually enables them. Yes we are on the same page on this. > A similar issue exists on multi-host for PFs, right? If one of the > hosts is down do we still show their PF repr? IMHO yes. I would agree with that as well. With today's model the VF reps are created once a PF is put into switchdev mode, but I'm still working out how we want to consider whether or not a PF rep for the other domains is created locally or not and also how one can determine which domain is in control. Permanent config options (like NVRAM settings) could easily handle which domain is in control, but that still does not mean that PF reps must be created automatically, does it? > That makes the thing looks more like a switch with cables being plugged > in and out. Yes, that's exactly how I view it as well. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-18 15:15 ` Andy Gospodarek @ 2018-04-18 16:26 ` Jakub Kicinski 2018-04-18 17:25 ` Andy Gospodarek 2018-04-18 17:07 ` Parikh, Neerav 1 sibling, 1 reply; 36+ messages in thread From: Jakub Kicinski @ 2018-04-18 16:26 UTC (permalink / raw) To: Andy Gospodarek Cc: Or Gerlitz, Samudrala, Sridhar, David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Wed, 18 Apr 2018 11:15:29 -0400, Andy Gospodarek wrote: > > A similar issue exists on multi-host for PFs, right? If one of the > > hosts is down do we still show their PF repr? IMHO yes. > > I would agree with that as well. With today's model the VF reps are > created once a PF is put into switchdev mode, but I'm still working out > how we want to consider whether or not a PF rep for the other domains is > created locally or not and also how one can determine which domain is in > control. > > Permanent config options (like NVRAM settings) could easily handle which > domain is in control, but that still does not mean that PF reps must be > created automatically, does it? The control domain is tricky. I'm not sure I understand how you could not have a PF rep for remote domains, though. How do you configure switching to the PF netdev if there is no rep? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: SRIOV switchdev mode BoF minutes 2018-04-18 16:26 ` Jakub Kicinski @ 2018-04-18 17:25 ` Andy Gospodarek 0 siblings, 0 replies; 36+ messages in thread From: Andy Gospodarek @ 2018-04-18 17:25 UTC (permalink / raw) To: Jakub Kicinski Cc: Andy Gospodarek, Or Gerlitz, Samudrala, Sridhar, David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List On Wed, Apr 18, 2018 at 09:26:34AM -0700, Jakub Kicinski wrote: > On Wed, 18 Apr 2018 11:15:29 -0400, Andy Gospodarek wrote: > > > A similar issue exists on multi-host for PFs, right? If one of the > > > hosts is down do we still show their PF repr? IMHO yes. > > > > I would agree with that as well. With today's model the VF reps are > > created once a PF is put into switchdev mode, but I'm still working out > > how we want to consider whether or not a PF rep for the other domains is > > created locally or not and also how one can determine which domain is in > > control. > > > > Permanent config options (like NVRAM settings) could easily handle which > > domain is in control, but that still does not mean that PF reps must be > > created automatically, does it? > > The control domain is tricky. I'm not sure I understand how you could > not have a PF rep for remote domains, though. How do you configure > switching to the PF netdev if there is no rep? Yes, for complete control of all traffic using standard Linux APIs a PF rep is a requirement. ^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: SRIOV switchdev mode BoF minutes 2018-04-18 15:15 ` Andy Gospodarek 2018-04-18 16:26 ` Jakub Kicinski @ 2018-04-18 17:07 ` Parikh, Neerav 1 sibling, 0 replies; 36+ messages in thread From: Parikh, Neerav @ 2018-04-18 17:07 UTC (permalink / raw) To: Andy Gospodarek, Jakub Kicinski Cc: Or Gerlitz, Samudrala, Sridhar, David Miller, Singhai, Anjali, Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List > -----Original Message----- > From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] > On Behalf Of Andy Gospodarek > Sent: Wednesday, April 18, 2018 8:15 AM > To: Jakub Kicinski <jakub.kicinski@netronome.com> > Cc: Andy Gospodarek <andrew.gospodarek@broadcom.com>; Or Gerlitz > <gerlitz.or@gmail.com>; Samudrala, Sridhar <sridhar.samudrala@intel.com>; > David Miller <davem@davemloft.net>; Singhai, Anjali > <anjali.singhai@intel.com>; Michael Chan <michael.chan@broadcom.com>; > Simon Horman <simon.horman@netronome.com>; John Fastabend > <john.fastabend@gmail.com>; Saeed Mahameed <saeedm@mellanox.com>; > Jiri Pirko <jiri@mellanox.com>; Rony Efraim <ronye@mellanox.com>; Linux > Netdev List <netdev@vger.kernel.org> > Subject: Re: SRIOV switchdev mode BoF minutes > > On Tue, Apr 17, 2018 at 04:19:15PM -0700, Jakub Kicinski wrote: > > On Tue, 17 Apr 2018 10:47:00 -0400, Andy Gospodarek wrote: > > > There is also a school of thought that the VF reps could be > > > pre-allocated on the SmartNIC so that any application processing that > > > traffic would sit idle when no traffic arrives on the rep, but could > > > process frames that do arrive when the VFs were created on the host. > > > This implementation will depend on how resources are allocated on a > > > given bit of hardware, but can really work well. > > > > +1 if there is no FW resource allocation issues IMHO it's okay to > > just show all reprs for "remote PCIes (PFs and VFs)" on the SmartNIC/ > > controller. The reprs should just show link down as if PCIe cable > > was unpluged until host actually enables them. > > Yes we are on the same page on this. > > > A similar issue exists on multi-host for PFs, right? If one of the > > hosts is down do we still show their PF repr? IMHO yes. > > I would agree with that as well. With today's model the VF reps are > created once a PF is put into switchdev mode, but I'm still working out > how we want to consider whether or not a PF rep for the other domains is > created locally or not and also how one can determine which domain is in > control. > > Permanent config options (like NVRAM settings) could easily handle which > domain is in control, but that still does not mean that PF reps must be > created automatically, does it? > > > That makes the thing looks more like a switch with cables being plugged > > in and out. > > Yes, that's exactly how I view it as well. If we need to behave like a switch or emulate that mode then is there a thought around the usability model? So, while whichever domain is in control the implication above is that the max number of vports supported (VFs and PFs) will need to be represented regardless of whether they're "Enabled" or not in a given "Switch". By "Enabled" I mean SRIOV VFs may not have been enabled but still the representor exists. For example if there are 256 VFs supported on a given PF when someone switches into the switchdev mode there will be ~257 representor netdevs added into the system. And if you've multi-port, multi-function devices the number of representor netdevs will increase accordingly. While representor netdevs' naming may help a bit here but a user will need to determine and differentiate between the sprawl of representors netdevs and data netdevs to identify which all can be added into an OVS bridge (or vSwitch). And switching to the "switchdev mode" becomes a pre-requisite before any of the vSwitch bridges that uses these representor netdevs. While, in a SmartNIC where users are not managing the devices this may be deployed based on the NIC FW/SW capabilities. But, I'm not sure how the same model applied on a standard host running Linux will work across devices. ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2018-04-18 17:24 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-11-12 19:49 SRIOV switchdev mode BoF minutes Or Gerlitz 2017-11-12 20:38 ` Alexander Duyck 2017-11-13 6:16 ` Or Gerlitz 2017-11-13 17:10 ` Alexander Duyck 2017-11-14 16:44 ` Or Gerlitz 2017-11-14 20:00 ` Alexander Duyck 2017-11-14 21:50 ` Or Gerlitz 2017-11-14 23:05 ` Alexander Duyck 2017-11-14 23:36 ` Jakub Kicinski 2017-11-15 3:04 ` Alexander Duyck 2017-11-15 4:02 ` Jakub Kicinski 2017-11-15 18:25 ` Alexander Duyck 2017-11-16 17:41 ` Or Gerlitz 2017-11-16 18:20 ` Alexander Duyck 2017-11-14 23:32 ` Jakub Kicinski 2018-04-12 17:05 ` Samudrala, Sridhar 2018-04-12 20:20 ` Or Gerlitz 2018-04-12 20:33 ` Samudrala, Sridhar 2018-04-13 8:56 ` Or Gerlitz 2018-04-13 8:57 ` Or Gerlitz 2018-04-13 16:49 ` Samudrala, Sridhar 2018-04-13 20:16 ` Or Gerlitz 2018-04-13 23:03 ` Samudrala, Sridhar 2018-04-15 6:01 ` Or Gerlitz 2018-04-16 12:39 ` Andy Gospodarek 2018-04-17 2:08 ` Samudrala, Sridhar 2018-04-17 13:30 ` Andy Gospodarek 2018-04-17 13:58 ` Or Gerlitz 2018-04-17 14:47 ` Andy Gospodarek 2018-04-17 16:46 ` Samudrala, Sridhar 2018-04-17 16:53 ` Andy Gospodarek 2018-04-17 23:19 ` Jakub Kicinski 2018-04-18 15:15 ` Andy Gospodarek 2018-04-18 16:26 ` Jakub Kicinski 2018-04-18 17:25 ` Andy Gospodarek 2018-04-18 17:07 ` Parikh, Neerav
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).