From: DENG Qingfang <dqfext@gmail.com>
To: Vladimir Oltean <vladimir.oltean@nxp.com>
Cc: netdev@vger.kernel.org, Jakub Kicinski <kuba@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Andrew Lunn <andrew@lunn.ch>,
Florian Fainelli <f.fainelli@gmail.com>,
Vivien Didelot <vivien.didelot@gmail.com>,
Jiri Pirko <jiri@resnulli.us>, Ido Schimmel <idosch@idosch.org>,
Tobias Waldekranz <tobias@waldekranz.com>,
Roopa Prabhu <roopa@nvidia.com>,
Nikolay Aleksandrov <nikolay@nvidia.com>,
Stephen Hemminger <stephen@networkplumber.org>,
bridge@lists.linux-foundation.org,
Alexander Duyck <alexander.duyck@gmail.com>
Subject: Re: [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
Date: Mon, 5 Jul 2021 12:20:18 +0800 [thread overview]
Message-ID: <20210705042018.137390-1-dqfext@gmail.com> (raw)
In-Reply-To: <20210703115705.1034112-1-vladimir.oltean@nxp.com>
Hi Vladimir,
On Sat, Jul 03, 2021 at 02:56:55PM +0300, Vladimir Oltean wrote:
> For this series I have taken Tobias' work from here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> and made the following changes:
> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
> feedback on the bridge driver changes. Otherwise, the structure of the
> bridge changes is pretty much the same as Tobias left it.
> - I basically rewrote the DSA infrastructure for the data plane
> forwarding offload, based on the commonalities with another switch
> driver for which I implemented this feature (not submitted here)
> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
> works but I didn't test that
>
> The data plane of the software bridge can be partially offloaded to
> switchdev, in the sense that we can trust the accelerator to:
> (a) look up its FDB (which is more or less in sync with the software
> bridge FDB) for selecting the destination ports for a packet
> (b) replicate the frame in hardware in case it's a multicast/broadcast,
> instead of the software bridge having to clone it and send the
> clones to each net device one at a time. This reduces the bandwidth
> needed between the CPU and the accelerator, as well as the CPU time
> spent.
Many DSA taggers use port bit field in their TX tags, which allows
replication in hardware. (multiple bits set = send to multiple ports)
I wonder if the tagger API can be updated to support this.
>
> The data path forwarding offload is managed per "hardware domain" - a
> generalization of the "offload_fwd_mark" concept which is being
> introduced in this series. Every packet is delivered only once to each
> hardware domain.
>
> In addition, Tobias said in the original cover letter:
>
> ====================
> ## Overview
>
> vlan1 vlan2
> \ /
> .-----------.
> | br0 |
> '-----------'
> / / \ \
> swp0 swp1 swp2 eth0
> : : :
> (hwdom 1)
>
> Up to this point, switchdevs have been trusted with offloading
> forwarding between bridge ports, e.g. forwarding a unicast from swp0
> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> series extends forward offloading to include some new classes of
> traffic:
>
> - Locally originating flows, i.e. packets that ingress on br0 that are
> to be forwarded to one or several of the ports swp{0,1,2}. Notably
> this also includes routed flows, e.g. a packet ingressing swp0 on
> VLAN 1 which is then routed over to VLAN 2 by the CPU and then
> forwarded to swp1 is "locally originating" from br0's point of view.
>
> - Flows originating from "foreign" interfaces, i.e. an interface that
> is not offloaded by a particular switchdev instance. This includes
> ports belonging to other switchdev instances. A typical example
> would be flows from eth0 towards swp{0,1,2}.
>
> The bridge still looks up its FDB/MDB as usual and then notifies the
> switchdev driver that a particular skb should be offloaded if it
> matches one of the classes above. It does so by using the _accel
> version of dev_queue_xmit, supplying its own netdev as the
> "subordinate" device. The driver can react to the presence of the
> subordinate in its .ndo_select_queue in what ever way it needs to make
> sure to forward the skb in much the same way that it would for packets
> ingressing on regular ports.
>
> Hardware domains to which a particular skb has been forwarded are
> recorded so that duplicates are avoided.
>
> The main performance benefit is thus seen on multicast flows. Imagine
> for example that:
>
> - An IP camera is connected to swp0 (VLAN 1)
>
> - The CPU is acting as a multicast router, routing the group from VLAN
> 1 to VLAN 2.
>
> - There are subscribers for the group in question behind both swp1 and
> swp2 (VLAN 2).
>
> With this offloading in place, the bridge need only send a single skb
> to the driver, which will send it to the hardware marked in such a way
> that the switch will perform the multicast replication according to
> the MDB configuration. Naturally, the number of saved skb_clones
> increase linearly with the number of subscribed ports.
>
> As an extra benefit, on mv88e6xxx, this also allows the switch to
> perform source address learning on these flows, which avoids having to
> sync dynamic FDB entries over slow configuration interfaces like MDIO
> to avoid flows directed towards the CPU being flooded as unknown
> unicast by the switch.
>
>
> ## RFC
>
> - In general, what do you think about this idea?
>
> - hwdom. What do you think about this terminology? Personally I feel
> that we had too many things called offload_fwd_mark, and that as the
> use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
> might be useful to have a separate term for it.
>
> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
> and if so do you have any suggestion/preference on how to signal the
> offloading from the bridge down to the switchdev driver?
>
> - The way that flooding is implemented in br_forward.c (lazily cloning
> skbs) means that you have to mark the forwarding as completed very
> early (right after should_deliver in maybe_deliver) in order to
> avoid duplicates. Is there some way to move this decision point to a
> later stage that I am missing?
>
> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
> compatible with unicast-to-multicast being used on a port. Then
> again, I think that this would also be broken for regular switchdev
> bridge offloading as this flag is not offloaded to the switchdev
> port, so there is no way for the driver to refuse it. Any ideas on
> how to handle this?
>
>
> ## mv88e6xxx Specifics
>
> Since we are now only receiving a single skb for both unicast and
> multicast flows, we can tag the packets with the FORWARD command
> instead of FROM_CPU. The swich(es) will then forward the packet in
> accordance with its ATU, VTU, STU, and PVT configuration - just like
> for packets ingressing on user ports.
>
> Crucially, FROM_CPU is still used for:
>
> - Ports in standalone mode.
>
> - Flows that are trapped to the CPU and software-forwarded by a
> bridge. Note that these flows match neither of the classes discussed
> in the overview.
>
> - Packets that are sent directly to a port netdev without going
> through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
> socket.
>
> We thus have a pretty clean separation where the data plane uses
> FORWARDs and the control plane uses TO_/FROM_CPU.
>
> The barrier between different bridges is enforced by port based VLANs
> on mv88e6xxx, which in essence is a mapping from a source device/port
> pair to an allowed set of egress ports. In order to have a FORWARD
> frame (which carries a _source_ device/port) correctly mapped by the
> PVT, we must use a unique pair for each bridge.
>
> Fortunately, there is typically lots of unused address space in most
> switch trees. When was the last time you saw an mv88e6xxx product
> using more than 4 chips? Even if you found one with 16 (!) devices,
> you would still have room to allocate 16*16 virtual ports to software
> bridges.
>
> Therefore, the mv88e6xxx driver will allocate a virtual device/port
> pair to each bridge that it offloads. All members of the same bridge
> are then configured to allow packets from this virtual port in their
> PVTs.
> ====================
>
> Tobias Waldekranz (5):
> net: dfwd: constrain existing users to macvlan subordinates
> net: bridge: disambiguate offload_fwd_mark
> net: bridge: switchdev: recycle unused hwdoms
> net: bridge: switchdev: allow the data plane forwarding to be
> offloaded
> net: dsa: tag_dsa: offload the bridge forwarding process
>
> Vladimir Oltean (5):
> net: extract helpers for binding a subordinate device to TX queues
> net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
> net: dsa: track the number of switches in a tree
> net: dsa: add support for bridge forwarding offload
> net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
> the PVT
>
> drivers/net/dsa/mv88e6xxx/chip.c | 106 +++++++++++-
> .../net/ethernet/intel/fm10k/fm10k_netdev.c | 3 +
> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +
> include/linux/if_bridge.h | 1 +
> include/linux/netdevice.h | 13 +-
> include/net/dsa.h | 37 ++++
> net/bridge/br_forward.c | 18 +-
> net/bridge/br_if.c | 4 +-
> net/bridge/br_private.h | 49 +++++-
> net/bridge/br_switchdev.c | 163 +++++++++++++++---
> net/bridge/br_vlan.c | 10 +-
> net/core/dev.c | 31 +++-
> net/dsa/dsa2.c | 3 +
> net/dsa/dsa_priv.h | 28 +++
> net/dsa/port.c | 35 ++++
> net/dsa/slave.c | 134 +++++++++++++-
> net/dsa/switch.c | 58 +++++++
> net/dsa/tag_dsa.c | 60 ++++++-
> 19 files changed, 700 insertions(+), 59 deletions(-)
>
> --
> 2.25.1
>
next prev parent reply other threads:[~2021-07-05 4:20 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-03 11:56 [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices Vladimir Oltean
2021-07-03 11:56 ` [RFC PATCH v2 net-next 01/10] net: dfwd: constrain existing users to macvlan subordinates Vladimir Oltean
2021-07-03 11:56 ` [RFC PATCH v2 net-next 02/10] net: bridge: disambiguate offload_fwd_mark Vladimir Oltean
2021-07-09 13:23 ` Grygorii Strashko
2021-07-03 11:56 ` [RFC PATCH v2 net-next 03/10] net: bridge: switchdev: recycle unused hwdoms Vladimir Oltean
2021-07-03 11:56 ` [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded Vladimir Oltean
2021-07-09 13:16 ` Grygorii Strashko
2021-07-09 14:09 ` Vladimir Oltean
2021-07-12 12:28 ` Tobias Waldekranz
2021-07-12 13:03 ` Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 05/10] net: extract helpers for binding a subordinate device to TX queues Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 06/10] net: allow ndo_select_queue to go beyond dev->num_real_tx_queues Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 07/10] net: dsa: track the number of switches in a tree Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 08/10] net: dsa: add support for bridge forwarding offload Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 09/10] net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in the PVT Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 10/10] net: dsa: tag_dsa: offload the bridge forwarding process Vladimir Oltean
2021-07-03 22:04 ` [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices Tobias Waldekranz
2021-07-04 8:11 ` Vladimir Oltean
2021-07-05 8:09 ` Tobias Waldekranz
2021-07-05 8:54 ` Vladimir Oltean
2021-07-05 4:20 ` DENG Qingfang [this message]
2021-07-05 8:32 ` Tobias Waldekranz
2021-07-05 9:57 ` Vladimir Oltean
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210705042018.137390-1-dqfext@gmail.com \
--to=dqfext@gmail.com \
--cc=alexander.duyck@gmail.com \
--cc=andrew@lunn.ch \
--cc=bridge@lists.linux-foundation.org \
--cc=davem@davemloft.net \
--cc=f.fainelli@gmail.com \
--cc=idosch@idosch.org \
--cc=jiri@resnulli.us \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=nikolay@nvidia.com \
--cc=roopa@nvidia.com \
--cc=stephen@networkplumber.org \
--cc=tobias@waldekranz.com \
--cc=vivien.didelot@gmail.com \
--cc=vladimir.oltean@nxp.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).