* Offloading DSA taggers to hardware
@ 2019-11-13 12:40 Vladimir Oltean
2019-11-13 16:53 ` Andrew Lunn
2019-11-13 19:40 ` Florian Fainelli
0 siblings, 2 replies; 7+ messages in thread
From: Vladimir Oltean @ 2019-11-13 12:40 UTC (permalink / raw)
To: netdev
DSA is all about pairing any tagging-capable (or at least VLAN-capable) switch
to any NIC, and the software stack creates N "virtual" net devices, each
representing a switch port, with I/O capabilities based on the metadata present
in the frame. It all looks like an hourglass:
switch switch switch switch switch
net_device net_device net_device net_device net_device
| | | | |
| | | | |
| | | | |
+----------------+----------------+----------------+----------------+
|
|
DSA master
net_device
|
|
DSA master
NIC
|
switch
CPU port
|
|
+----------------+----------------+----------------+----------------+
| | | | |
| | | | |
| | | | |
switch switch switch switch switch
port port port port port
But the process by which the stack:
- Parses the frame on receive, decodes the DSA tag and redirects the frame from
the DSA master net_device to a switch net_device based on the source port,
then removes the DSA tag from the frame and recalculates checksums as
appropriate
- Adds the DSA tag on xmit, then redirects the frame from the "virtual" switch
net_device to the real DSA master net_device
can be optimized, if the DSA master NIC supports this. Let's say there is a
fictional NIC that has a programmable hardware parser and the ability to
perform frame manipulation (insert, extract a tag). Such a NIC could be
programmed to do a better job adding/removing the DSA tag, as well as
masquerading skb->dev based on the parser meta-data. In addition, there would
be a net benefit for QoS, which as a consequence of the DSA model, cannot be
really end-to-end: a frame classified to a high-priority traffic class by the
switch may be treated as best-effort by the DSA master, due to the fact that it
doesn't really parse the DSA tag (the traffic class, in this case).
I think the DSA hotpath would still need to be involved, but instead of calling
the tagger's xmit/rcv it would need to call a newly introduced ndo that
offloads this operation.
Is there any hardware out there that can do this? Is it desirable to see
something like this in DSA?
Regards,
-Vladimir
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Offloading DSA taggers to hardware 2019-11-13 12:40 Offloading DSA taggers to hardware Vladimir Oltean @ 2019-11-13 16:53 ` Andrew Lunn 2019-11-13 17:47 ` Vladimir Oltean 2019-11-13 19:29 ` Florian Fainelli 2019-11-13 19:40 ` Florian Fainelli 1 sibling, 2 replies; 7+ messages in thread From: Andrew Lunn @ 2019-11-13 16:53 UTC (permalink / raw) To: Vladimir Oltean; +Cc: netdev Hi Vladimir I've not seen any hardware that can do this. There is an Atheros/Qualcom integrated SoC/Switch where the 'header' is actually just a field in the transmit/receive descriptor. There is an out of tree driver for it, and the tag driver is very minimal. But clearly this only works for integrated systems. The other 'smart' features i've seen in NICs with respect to DSA is being able to do hardware checksums. Freescale FEC for example cannot figure out where the IP header is, because of the DSA header, and so cannot calculate IP/TCP/UDP checksums. Marvell, and i expect some other vendors of both MAC and switch devices, know about these headers, and can do checksumming. I'm not even sure there are any NICs which can do GSO or LRO when there is a DSA header involved. In the direction CPU to switch, i think many of the QoS issues are higher up the stack. By the time the tagger is involved, all the queue discipline stuff has been done, and it really is time to send the frame. In the 'post buffer bloat world', the NICs hardware queue should be small, so QoS is not so relevant once you reach the TX queue. The real QoS issue i guess is that the slave interfaces have no idea they are sharing resources at the lowest level. So a high priority frames from slave 1 are not differentiated from best effort frames from slave 2. If we were serious about improving QoS, we need a meta scheduler across the slaves feeding the master interface in a QoS aware way. In the other direction, how much is the NIC really looking at QoS information on the receive path? Are you thinking RPS? I'm not sure any of the NICs commonly used today with DSA are actually multi-queue and do RPS. Another aspect here might be, what Linux is doing with DSA is probably well past the silicon vendors expected use cases. None of the 'vendor crap' drivers i've seen for these SOHO class switches have the level of integration we have in Linux. We are pushing the limits of the host/switch interfaces much more then vendors do, and so silicon vendors are not so aware of the limits in these areas? But DSA is being successful, vendors are taking more notice of it, and maybe with time, the host/switch interface will improve. NICs might start supporting GSO/LRO when there is a DSA header involved? Multi-queue NICs become more popular in this class of hardware and RPS knows how to handle DSA headers. But my guess would be, it will be for a Marvell NIC paired with a Marvell Switch, Broadcom NIC paired with a Broadcom switch, etc. I doubt there will be cross vendor support. Andrew ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Offloading DSA taggers to hardware 2019-11-13 16:53 ` Andrew Lunn @ 2019-11-13 17:47 ` Vladimir Oltean 2019-11-13 19:29 ` Florian Fainelli 1 sibling, 0 replies; 7+ messages in thread From: Vladimir Oltean @ 2019-11-13 17:47 UTC (permalink / raw) To: Andrew Lunn; +Cc: netdev Hi Andrew, On Wed, 13 Nov 2019 at 18:53, Andrew Lunn <andrew@lunn.ch> wrote: > > Hi Vladimir > > I've not seen any hardware that can do this. There is an > Atheros/Qualcom integrated SoC/Switch where the 'header' is actually > just a field in the transmit/receive descriptor. There is an out of > tree driver for it, and the tag driver is very minimal. But clearly > this only works for integrated systems. > What is this Atheros SoC? It is funny that the topic reminded you of it. Your line of reasoning probably was: "Atheros pushed this idea so far that they omitted the DSA frame tag altogether for their own CPU port/DSA master". Which means that even if they try to use this "offloaded DSA tagger" abstraction, it would slightly violate the main idea of an offload, which is the fact that it's optional. What do you think? > The other 'smart' features i've seen in NICs with respect to DSA is > being able to do hardware checksums. Freescale FEC for example cannot > figure out where the IP header is, because of the DSA header, and so > cannot calculate IP/TCP/UDP checksums. Marvell, and i expect some > other vendors of both MAC and switch devices, know about these > headers, and can do checksumming. > Of course there are many more benefits that derive from more complete frame parsing as well, for some reason my mind just stopped at QoS when I wrote this email. > I'm not even sure there are any NICs which can do GSO or LRO when > there is a DSA header involved. > > In the direction CPU to switch, i think many of the QoS issues are > higher up the stack. By the time the tagger is involved, all the queue > discipline stuff has been done, and it really is time to send the > frame. In the 'post buffer bloat world', the NICs hardware queue > should be small, so QoS is not so relevant once you reach the TX > queue. The real QoS issue i guess is that the slave interfaces have no > idea they are sharing resources at the lowest level. So a high > priority frames from slave 1 are not differentiated from best effort > frames from slave 2. If we were serious about improving QoS, we need a > meta scheduler across the slaves feeding the master interface in a QoS > aware way. > Qdiscs on the DSA master are a good discussion to be had, but this wasn't the main thing I wanted to bring up here. > In the other direction, how much is the NIC really looking at QoS > information on the receive path? Are you thinking RPS? I'm not sure > any of the NICs commonly used today with DSA are actually multi-queue > and do RPS. > Actually both DSA master drivers I've been using so far (gianfar, enetc) register a number of RX queues equal to the number of cores. It is possible to add ethtool --config-nfc rules to steer certain priority traffic to its own CPU, but the keys need to be masked according to where the QoS field in the DSA frame tag overlaps with what the DSA master thinks it's looking at, aka DMAC, SMAC, EtherType, etc. It's not pretty. > Another aspect here might be, what Linux is doing with DSA is probably > well past the silicon vendors expected use cases. None of the 'vendor > crap' drivers i've seen for these SOHO class switches have the level > of integration we have in Linux. We are pushing the limits of the > host/switch interfaces much more then vendors do, and so silicon > vendors are not so aware of the limits in these areas? But DSA is > being successful, vendors are taking more notice of it, and maybe with > time, the host/switch interface will improve. NICs might start > supporting GSO/LRO when there is a DSA header involved? Multi-queue > NICs become more popular in this class of hardware and RPS knows how > to handle DSA headers. But my guess would be, it will be for a Marvell > NIC paired with a Marvell Switch, Broadcom NIC paired with a Broadcom > switch, etc. I doubt there will be cross vendor support. ...Atheros with Atheros... :) Yes, that's kind of the angle I'm coming from, basically trying to understand what a correct abstraction from Linux's perspective would look like, and what is considered too much "tribalism". The DSA model is attractive even for an integrated system because there is more modularity in the design, but there are some clear optimizations that can be made when the master+switch recipe is tightly controlled. > > Andrew Thanks, -Vladimir ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Offloading DSA taggers to hardware 2019-11-13 16:53 ` Andrew Lunn 2019-11-13 17:47 ` Vladimir Oltean @ 2019-11-13 19:29 ` Florian Fainelli 1 sibling, 0 replies; 7+ messages in thread From: Florian Fainelli @ 2019-11-13 19:29 UTC (permalink / raw) To: Andrew Lunn, Vladimir Oltean; +Cc: netdev On 11/13/19 8:53 AM, Andrew Lunn wrote: > Hi Vladimir > > I've not seen any hardware that can do this. Such hardware exists ant there was a prior attempt at supporting that: http://linux-kernel.2935.n7.nabble.com/PATCH-net-next-0-3-net-Switch-tag-HW-extraction-insertion-td1162606.html > There is an Atheros/Qualcom integrated SoC/Switch where the 'header' is actually > just a field in the transmit/receive descriptor. There is an out of > tree driver for it, and the tag driver is very minimal. But clearly > this only works for integrated systems. It can work between discrete components in premise, it just is unlikely because of the flexibility of DSA to mix and match MAC and switches and having different vendors on either end. Of course, even between the same vendor, the right hand rarely talks to the left hand, so it only has to be the work of someone who knows both ends. > > The other 'smart' features i've seen in NICs with respect to DSA is > being able to do hardware checksums. Freescale FEC for example cannot > figure out where the IP header is, because of the DSA header, and so > cannot calculate IP/TCP/UDP checksums. Marvell, and i expect some > other vendors of both MAC and switch devices, know about these > headers, and can do checksumming. This is probably to be blamed to the fact that most Ethernet switch tagging protocols did not assign themselves an EtherType, otherwise they just could do that checksumming. In fact, this even trip up controllers that are from the same vendor: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a40061ea2e39494104602b3048751341bda374a1 > > I'm not even sure there are any NICs which can do GSO or LRO when > there is a DSA header involved. Similarly to VLAN devices, would not this be done at the DSA virtual device level instead? > > In the direction CPU to switch, i think many of the QoS issues are > higher up the stack. By the time the tagger is involved, all the queue > discipline stuff has been done, and it really is time to send the > frame. In the 'post buffer bloat world', the NICs hardware queue > should be small, so QoS is not so relevant once you reach the TX > queue. The real QoS issue i guess is that the slave interfaces have no > idea they are sharing resources at the lowest level. So a high > priority frames from slave 1 are not differentiated from best effort > frames from slave 2. If we were serious about improving QoS, we need a > meta scheduler across the slaves feeding the master interface in a QoS > aware way. > > In the other direction, how much is the NIC really looking at QoS > information on the receive path? Are you thinking RPS? I'm not sure > any of the NICs commonly used today with DSA are actually multi-queue > and do RPS. Same hardware as presented above can deliver frames to the desired switch output queue: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d156576362c07e954dc36e07b0d7b0733a010f7d > > Another aspect here might be, what Linux is doing with DSA is probably > well past the silicon vendors expected use cases. None of the 'vendor > crap' drivers i've seen for these SOHO class switches have the level > of integration we have in Linux. We are pushing the limits of the > host/switch interfaces much more then vendors do, and so silicon > vendors are not so aware of the limits in these areas? Maybe, but vendors support many basic things we still don't like controlling broadcast storm suppression (commonly requested feature), offloading QoS properly etc. etc. What we have achieved so far is IMHO a solid framework, but there are still many, many features unsupported. > But DSA is being successful, vendors are taking more notice of it, and maybe with > time, the host/switch interface will improve. NICs might start > supporting GSO/LRO when there is a DSA header involved? Multi-queue > NICs become more popular in this class of hardware and RPS knows how > to handle DSA headers. But my guess would be, it will be for a Marvell > NIC paired with a Marvell Switch, Broadcom NIC paired with a Broadcom > switch, etc. I doubt there will be cross vendor support. > > Andrew > -- Florian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Offloading DSA taggers to hardware 2019-11-13 12:40 Offloading DSA taggers to hardware Vladimir Oltean 2019-11-13 16:53 ` Andrew Lunn @ 2019-11-13 19:40 ` Florian Fainelli 2019-11-14 16:40 ` Vladimir Oltean 1 sibling, 1 reply; 7+ messages in thread From: Florian Fainelli @ 2019-11-13 19:40 UTC (permalink / raw) To: Vladimir Oltean, netdev, Andrew Lunn, Vivien Didelot On 11/13/19 4:40 AM, Vladimir Oltean wrote: > DSA is all about pairing any tagging-capable (or at least VLAN-capable) switch > to any NIC, and the software stack creates N "virtual" net devices, each > representing a switch port, with I/O capabilities based on the metadata present > in the frame. It all looks like an hourglass: > > switch switch switch switch switch > net_device net_device net_device net_device net_device > | | | | | > | | | | | > | | | | | > +----------------+----------------+----------------+----------------+ > | > | > DSA master > net_device > | > | > DSA master > NIC > | > switch > CPU port > | > | > +----------------+----------------+----------------+----------------+ > | | | | | > | | | | | > | | | | | > switch switch switch switch switch > port port port port port > > > But the process by which the stack: > - Parses the frame on receive, decodes the DSA tag and redirects the frame from > the DSA master net_device to a switch net_device based on the source port, > then removes the DSA tag from the frame and recalculates checksums as > appropriate > - Adds the DSA tag on xmit, then redirects the frame from the "virtual" switch > net_device to the real DSA master net_device > > can be optimized, if the DSA master NIC supports this. Let's say there is a > fictional NIC that has a programmable hardware parser and the ability to > perform frame manipulation (insert, extract a tag). Such a NIC could be > programmed to do a better job adding/removing the DSA tag, as well as > masquerading skb->dev based on the parser meta-data. In addition, there would > be a net benefit for QoS, which as a consequence of the DSA model, cannot be > really end-to-end: a frame classified to a high-priority traffic class by the > switch may be treated as best-effort by the DSA master, due to the fact that it > doesn't really parse the DSA tag (the traffic class, in this case). The QoS part can be guaranteed for an integrated design, not so much if you have discrete/separate NIC and switch vendors and there is no agreed upon mechanism to "not lose information" between the two. > > I think the DSA hotpath would still need to be involved, but instead of calling > the tagger's xmit/rcv it would need to call a newly introduced ndo that > offloads this operation. > > Is there any hardware out there that can do this? Is it desirable to see > something like this in DSA? BCM7445 and BCM7278 (and other DSL and Cable Modem chips, just not supported upstream) use drivers/net/dsa/bcm_sf2.c along with drivers/net/ethernet/broadcom/bcmsysport.c. It is possible to offload the creation and extraction of the Broadcom tag: http://linux-kernel.2935.n7.nabble.com/PATCH-net-next-0-3-net-Switch-tag-HW-extraction-insertion-td1162606.html (this was reverted shortly after because napi_gro_receive() occupies the full 48 bytes skb->cb[] space on 64-bit hosts, I have now a better view of solving this though, see below). In my experience though, since the data is already hot in the cache in either direction, so a memmove() is not that costly, it was not possible to see sizable throughput improvements at 1Gbps or 2Gbps speeds because the CPU is more than capable of managing the tag extraction in software, and that is the most compatible way of doing it. To give you some more details, the SYSTEMPORT MAC will pre-pend an 8 byte Receive Status Block, word 0 contains status/length/error and word 1 can contain the full 4byte Broadcom tag as extracted. Then there is a (configurable) 2byte gap to align the IP header and then the Ethernet header can be found. This is quite similar to the NET_DSA_TAG_BRCM_PREPEND case, except for this 2b gap, which is why I am wondering if I am not going to introduce an additional tagging protocol NET_DSA_TAG_BRCM_PREPEND_WITH_2B or whatever side band information I can provide in the skb to permit the removal of these extraneous 2bytes. On transmit, we also have an 8byte transmit status block which can be constructed to contain information for the HW to insert a 4byte Broadcom tag, along with a VLAN tag, and with the same length/checksum insertion information. TX path would be equivalent to not doing any tagging, so similarly, it may be desirable to have a separate NET_DSA_TAG_BRCM_PREPEN value that indicates that nothing needs to be done except queue the frame for transmission on the master netdev. Now from a practical angle, offloading DSA tagging only makes sense if you happen to have a lot of host initiated/received traffic, which would be the case for either a streaming device (BCM7445/BCM7278) with their ports either completely separate (DSA default), or bridged. Does that apply in your case? -- Florian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Offloading DSA taggers to hardware 2019-11-13 19:40 ` Florian Fainelli @ 2019-11-14 16:40 ` Vladimir Oltean 2019-11-22 17:47 ` Florian Fainelli 0 siblings, 1 reply; 7+ messages in thread From: Vladimir Oltean @ 2019-11-14 16:40 UTC (permalink / raw) To: Florian Fainelli; +Cc: netdev, Andrew Lunn, Vivien Didelot Hi Florian, On Wed, 13 Nov 2019 at 21:40, Florian Fainelli <f.fainelli@gmail.com> wrote: > > On 11/13/19 4:40 AM, Vladimir Oltean wrote: > > DSA is all about pairing any tagging-capable (or at least VLAN-capable) switch > > to any NIC, and the software stack creates N "virtual" net devices, each > > representing a switch port, with I/O capabilities based on the metadata present > > in the frame. It all looks like an hourglass: > > > > switch switch switch switch switch > > net_device net_device net_device net_device net_device > > | | | | | > > | | | | | > > | | | | | > > +----------------+----------------+----------------+----------------+ > > | > > | > > DSA master > > net_device > > | > > | > > DSA master > > NIC > > | > > switch > > CPU port > > | > > | > > +----------------+----------------+----------------+----------------+ > > | | | | | > > | | | | | > > | | | | | > > switch switch switch switch switch > > port port port port port > > > > > > But the process by which the stack: > > - Parses the frame on receive, decodes the DSA tag and redirects the frame from > > the DSA master net_device to a switch net_device based on the source port, > > then removes the DSA tag from the frame and recalculates checksums as > > appropriate > > - Adds the DSA tag on xmit, then redirects the frame from the "virtual" switch > > net_device to the real DSA master net_device > > > > can be optimized, if the DSA master NIC supports this. Let's say there is a > > fictional NIC that has a programmable hardware parser and the ability to > > perform frame manipulation (insert, extract a tag). Such a NIC could be > > programmed to do a better job adding/removing the DSA tag, as well as > > masquerading skb->dev based on the parser meta-data. In addition, there would > > be a net benefit for QoS, which as a consequence of the DSA model, cannot be > > really end-to-end: a frame classified to a high-priority traffic class by the > > switch may be treated as best-effort by the DSA master, due to the fact that it > > doesn't really parse the DSA tag (the traffic class, in this case). > > The QoS part can be guaranteed for an integrated design, not so much if > you have discrete/separate NIC and switch vendors and there is no agreed > upon mechanism to "not lose information" between the two. > > > > > I think the DSA hotpath would still need to be involved, but instead of calling > > the tagger's xmit/rcv it would need to call a newly introduced ndo that > > offloads this operation. > > > > Is there any hardware out there that can do this? Is it desirable to see > > something like this in DSA? > > BCM7445 and BCM7278 (and other DSL and Cable Modem chips, just not > supported upstream) use drivers/net/dsa/bcm_sf2.c along with > drivers/net/ethernet/broadcom/bcmsysport.c. It is possible to offload > the creation and extraction of the Broadcom tag: > > http://linux-kernel.2935.n7.nabble.com/PATCH-net-next-0-3-net-Switch-tag-HW-extraction-insertion-td1162606.html > > (this was reverted shortly after because napi_gro_receive() occupies the > full 48 bytes skb->cb[] space on 64-bit hosts, I have now a better view > of solving this though, see below). > > In my experience though, since the data is already hot in the cache in > either direction, so a memmove() is not that costly, it was not possible > to see sizable throughput improvements at 1Gbps or 2Gbps speeds because > the CPU is more than capable of managing the tag extraction in software, > and that is the most compatible way of doing it. > > To give you some more details, the SYSTEMPORT MAC will pre-pend an 8 > byte Receive Status Block, word 0 contains status/length/error and word > 1 can contain the full 4byte Broadcom tag as extracted. Then there is a > (configurable) 2byte gap to align the IP header and then the Ethernet > header can be found. This is quite similar to the > NET_DSA_TAG_BRCM_PREPEND case, except for this 2b gap, which is why I am > wondering if I am not going to introduce an additional tagging protocol > NET_DSA_TAG_BRCM_PREPEND_WITH_2B or whatever side band information I can > provide in the skb to permit the removal of these extraneous 2bytes. > > On transmit, we also have an 8byte transmit status block which can be > constructed to contain information for the HW to insert a 4byte Broadcom > tag, along with a VLAN tag, and with the same length/checksum insertion > information. TX path would be equivalent to not doing any tagging, so > similarly, it may be desirable to have a separate > NET_DSA_TAG_BRCM_PREPEN value that indicates that nothing needs to be > done except queue the frame for transmission on the master netdev. > > Now from a practical angle, offloading DSA tagging only makes sense if > you happen to have a lot of host initiated/received traffic, which would > be the case for either a streaming device (BCM7445/BCM7278) with their > ports either completely separate (DSA default), or bridged. Does that > apply in your case? Not at all, I would say. In fact, I was trying to understand what are the chances of interpreting information from the master's frame descriptor as the de-facto DSA tag in mainline Linux. Your story with Starfighter 2 chips seems to indicate that it isn't such a good idea. > -- > Florian Thanks, -Vladimir ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Offloading DSA taggers to hardware 2019-11-14 16:40 ` Vladimir Oltean @ 2019-11-22 17:47 ` Florian Fainelli 0 siblings, 0 replies; 7+ messages in thread From: Florian Fainelli @ 2019-11-22 17:47 UTC (permalink / raw) To: Vladimir Oltean; +Cc: netdev, Andrew Lunn, Vivien Didelot On 11/14/19 8:40 AM, Vladimir Oltean wrote: > Hi Florian, > > On Wed, 13 Nov 2019 at 21:40, Florian Fainelli <f.fainelli@gmail.com> wrote: >> >> On 11/13/19 4:40 AM, Vladimir Oltean wrote: >>> DSA is all about pairing any tagging-capable (or at least VLAN-capable) switch >>> to any NIC, and the software stack creates N "virtual" net devices, each >>> representing a switch port, with I/O capabilities based on the metadata present >>> in the frame. It all looks like an hourglass: >>> >>> switch switch switch switch switch >>> net_device net_device net_device net_device net_device >>> | | | | | >>> | | | | | >>> | | | | | >>> +----------------+----------------+----------------+----------------+ >>> | >>> | >>> DSA master >>> net_device >>> | >>> | >>> DSA master >>> NIC >>> | >>> switch >>> CPU port >>> | >>> | >>> +----------------+----------------+----------------+----------------+ >>> | | | | | >>> | | | | | >>> | | | | | >>> switch switch switch switch switch >>> port port port port port >>> >>> >>> But the process by which the stack: >>> - Parses the frame on receive, decodes the DSA tag and redirects the frame from >>> the DSA master net_device to a switch net_device based on the source port, >>> then removes the DSA tag from the frame and recalculates checksums as >>> appropriate >>> - Adds the DSA tag on xmit, then redirects the frame from the "virtual" switch >>> net_device to the real DSA master net_device >>> >>> can be optimized, if the DSA master NIC supports this. Let's say there is a >>> fictional NIC that has a programmable hardware parser and the ability to >>> perform frame manipulation (insert, extract a tag). Such a NIC could be >>> programmed to do a better job adding/removing the DSA tag, as well as >>> masquerading skb->dev based on the parser meta-data. In addition, there would >>> be a net benefit for QoS, which as a consequence of the DSA model, cannot be >>> really end-to-end: a frame classified to a high-priority traffic class by the >>> switch may be treated as best-effort by the DSA master, due to the fact that it >>> doesn't really parse the DSA tag (the traffic class, in this case). >> >> The QoS part can be guaranteed for an integrated design, not so much if >> you have discrete/separate NIC and switch vendors and there is no agreed >> upon mechanism to "not lose information" between the two. >> >>> >>> I think the DSA hotpath would still need to be involved, but instead of calling >>> the tagger's xmit/rcv it would need to call a newly introduced ndo that >>> offloads this operation. >>> >>> Is there any hardware out there that can do this? Is it desirable to see >>> something like this in DSA? >> >> BCM7445 and BCM7278 (and other DSL and Cable Modem chips, just not >> supported upstream) use drivers/net/dsa/bcm_sf2.c along with >> drivers/net/ethernet/broadcom/bcmsysport.c. It is possible to offload >> the creation and extraction of the Broadcom tag: >> >> http://linux-kernel.2935.n7.nabble.com/PATCH-net-next-0-3-net-Switch-tag-HW-extraction-insertion-td1162606.html >> >> (this was reverted shortly after because napi_gro_receive() occupies the >> full 48 bytes skb->cb[] space on 64-bit hosts, I have now a better view >> of solving this though, see below). >> >> In my experience though, since the data is already hot in the cache in >> either direction, so a memmove() is not that costly, it was not possible >> to see sizable throughput improvements at 1Gbps or 2Gbps speeds because >> the CPU is more than capable of managing the tag extraction in software, >> and that is the most compatible way of doing it. >> >> To give you some more details, the SYSTEMPORT MAC will pre-pend an 8 >> byte Receive Status Block, word 0 contains status/length/error and word >> 1 can contain the full 4byte Broadcom tag as extracted. Then there is a >> (configurable) 2byte gap to align the IP header and then the Ethernet >> header can be found. This is quite similar to the >> NET_DSA_TAG_BRCM_PREPEND case, except for this 2b gap, which is why I am >> wondering if I am not going to introduce an additional tagging protocol >> NET_DSA_TAG_BRCM_PREPEND_WITH_2B or whatever side band information I can >> provide in the skb to permit the removal of these extraneous 2bytes. >> >> On transmit, we also have an 8byte transmit status block which can be >> constructed to contain information for the HW to insert a 4byte Broadcom >> tag, along with a VLAN tag, and with the same length/checksum insertion >> information. TX path would be equivalent to not doing any tagging, so >> similarly, it may be desirable to have a separate >> NET_DSA_TAG_BRCM_PREPEN value that indicates that nothing needs to be >> done except queue the frame for transmission on the master netdev. >> >> Now from a practical angle, offloading DSA tagging only makes sense if >> you happen to have a lot of host initiated/received traffic, which would >> be the case for either a streaming device (BCM7445/BCM7278) with their >> ports either completely separate (DSA default), or bridged. Does that >> apply in your case? > > Not at all, I would say. In fact, I was trying to understand what are > the chances of interpreting information from the master's frame > descriptor as the de-facto DSA tag in mainline Linux. Your story with > Starfighter 2 chips seems to indicate that it isn't such a good idea. I would not say that this is a bad idea, but that it may be challenging to find a driver agnostic way, on both the DSA master and tagger side to provide the switch tag in a way that minimizes the amount of data manipulation within the packet, while preserving possible stack optimizations such as GRO. Technically, we should probably be doing the GRO at the DSA slave layer though, I am fuzzy on the details here TBH. AFAIR, there may have been some efforts to allow nesting of skb->cb[] work by Florian Westphal, maybe we could use that. -- Florian ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2019-11-22 17:47 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-11-13 12:40 Offloading DSA taggers to hardware Vladimir Oltean 2019-11-13 16:53 ` Andrew Lunn 2019-11-13 17:47 ` Vladimir Oltean 2019-11-13 19:29 ` Florian Fainelli 2019-11-13 19:40 ` Florian Fainelli 2019-11-14 16:40 ` Vladimir Oltean 2019-11-22 17:47 ` Florian Fainelli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).