understanding switchdev notifications

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* understanding switchdev notifications
@ 2024-08-08  0:48 Chris Packham
  2024-08-14  6:54 ` Tobias Waldekranz
  0 siblings, 1 reply; 4+ messages in thread
From: Chris Packham @ 2024-08-08  0:48 UTC (permalink / raw)
  To: netdev, linux-kernel@vger.kernel.org

Hi,

I'm trying to get to grips with how the switchdev notifications are 
supposed to be used when developing a switchdev driver.

I have been reading through 
https://www.kernel.org/doc/html/latest/networking/switchdev.html which 
covers a few things but doesn't go into detail around the notifiers that 
one needs to implement for a new switchdev driver (which is probably 
very dependent on what the hardware is capable of).

Specifically right now I'm looking at having a switch port join a vlan 
aware bridge. I have a configuration something like this

     ip link add br0 type bridge vlan_filtering 1
     ip link set sw1p5 master br0
     ip link set sw1p1 master br0
     bridge vlan add vid 2 dev br0 self
     ip link add link br0 br0.2 type vlan id 2
     ip addr add dev br0.2 192.168.2.1/24
     bridge vlan add vid 2 dev lan5 pvid untagged
     bridge vlan add vid 2 dev lan1
     ip link set sw1p5 up
     ip link set sw1p1 up
     ip link set br0 up
     ip link set br0.2 up

Then I'm testing by sending a ping to a nonexistent host on the 
192.168.2.0/24 subnet and looking at the traffic with tcpdump on another 
device connected to sw1p5.

I'm a bit confused about how I should be calling 
switchdev_bridge_port_offload(). It takes two netdevs (brport_dev and 
dev) but as far as I've been able to see all the callers end up passing 
the same netdev for both of these (some create a driver specific brport 
but this still ends up with brport->dev and dev being the same object).

I've figured out that I need to set tx_fwd_offload=true so that the 
bridge software only sends one packet to the hardware. That makes sense 
as a way of saying the my hardware can take care of sending the packet 
out the right ports.

I do have a problem that what I get from the bridge has a vlan tag 
inserted (which makes sense in sw when the packet goes from br0.2 to 
br0). But I don't actually need it as the hardware will insert a tag for 
me if the port is setup for egress tagging. I can shuffle the Ethernet 
header up but I was wondering if there was a way of telling the bridge 
not to insert the tag?

Finally I'm confused about the atomic_nb/atomic_nb parameters. Some 
drivers just pass NULL and others pass the same notifier blocks that 
they've already registered with 
register_switchdev_notifier()/register_switchdev_notifier(). If 
notifiers are registered why does switchdev_bridge_port_offload() take 
them as parameters?

Thanks,
Chris

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: understanding switchdev notifications
  2024-08-08  0:48 understanding switchdev notifications Chris Packham
@ 2024-08-14  6:54 ` Tobias Waldekranz
  2024-08-14 22:18   ` Chris Packham
  0 siblings, 1 reply; 4+ messages in thread
From: Tobias Waldekranz @ 2024-08-14  6:54 UTC (permalink / raw)
  To: Chris Packham, netdev, linux-kernel@vger.kernel.org

On tor, aug 08, 2024 at 12:48, Chris Packham <chris.packham@alliedtelesis.co.nz> wrote:
> Hi,
>
> I'm trying to get to grips with how the switchdev notifications are 
> supposed to be used when developing a switchdev driver.
>
> I have been reading through 
> https://www.kernel.org/doc/html/latest/networking/switchdev.html which 
> covers a few things but doesn't go into detail around the notifiers that 
> one needs to implement for a new switchdev driver (which is probably 
> very dependent on what the hardware is capable of).
>
> Specifically right now I'm looking at having a switch port join a vlan 
> aware bridge. I have a configuration something like this
>
>      ip link add br0 type bridge vlan_filtering 1
>      ip link set sw1p5 master br0
>      ip link set sw1p1 master br0
>      bridge vlan add vid 2 dev br0 self
>      ip link add link br0 br0.2 type vlan id 2
>      ip addr add dev br0.2 192.168.2.1/24
>      bridge vlan add vid 2 dev lan5 pvid untagged
>      bridge vlan add vid 2 dev lan1
>      ip link set sw1p5 up
>      ip link set sw1p1 up
>      ip link set br0 up
>      ip link set br0.2 up
>
> Then I'm testing by sending a ping to a nonexistent host on the 
> 192.168.2.0/24 subnet and looking at the traffic with tcpdump on another 
> device connected to sw1p5.
>
> I'm a bit confused about how I should be calling 
> switchdev_bridge_port_offload(). It takes two netdevs (brport_dev and 
> dev) but as far as I've been able to see all the callers end up passing 
> the same netdev for both of these (some create a driver specific brport 
> but this still ends up with brport->dev and dev being the same object).

In the simple case when a switchport is directly attached to a bridge,
brport_dev and dev will be the same. If the attachment is indirect, via
a bond for example, they will differ:

       br0
       /
    bond0
   /    \
sw1p1  sw1p5

In the setup above, the bridge has no reference to any sw*p* interfaces,
all generated notifications will reference "bond0". By including the
switchdev port in the message back to the bridge, it can perform
validation on the setup; e.g. that bond0 is not made up of interfaces
from different hardware domains.

> I've figured out that I need to set tx_fwd_offload=true so that the 
> bridge software only sends one packet to the hardware. That makes sense 
> as a way of saying the my hardware can take care of sending the packet 
> out the right ports.
>
> I do have a problem that what I get from the bridge has a vlan tag 
> inserted (which makes sense in sw when the packet goes from br0.2 to 
> br0). But I don't actually need it as the hardware will insert a tag for 
> me if the port is setup for egress tagging. I can shuffle the Ethernet 
> header up but I was wondering if there was a way of telling the bridge 
> not to insert the tag?

Signaling tx_fwd_offload=true means assuming responsibility for
delivering each packet to all ports that the bridge would otherwise have
sent individual skbs for.

Let's expand your setup slightly, and see why you need the tag:

   br0.2 br0.3
       \ /
       br0
      / |  \
     /  |   \
sw1p1 sw1p3  sw1p5
(2U)  (3U)  (2T,3T)

sw1p5 is now a trunk. We can trigger an ARP broadcast to be sent out
either via br0.2 or br0.3, depending on the subnet we choose to target.

Your driver will receive a single skb to transmit, and skb->dev can be
set to any of sw1p{1,3,5} depending on config order, FDB entries
(i.e. the order of previously received packets) etc., and is thus
nondeterministic.

So presumably, even though you might need to remove the 802.1Q tag from
the frame, you need some way of tagging the packet with the correct VID
in order for the hardware to do the right thing; possibly via a field in
the vendor's hardware specific tag.

> Finally I'm confused about the atomic_nb/atomic_nb parameters. Some 
> drivers just pass NULL and others pass the same notifier blocks that 
> they've already registered with 
> register_switchdev_notifier()/register_switchdev_notifier(). If 
> notifiers are registered why does switchdev_bridge_port_offload() take 
> them as parameters?

Because when you add a port to the bridge, lots of stuff that you want
to offload might already have been configured. E.g., imagine that you
were to add vlan 2 to br0 before adding the switchports; then you
probably need those events to be replayed to the new ports in order to
add your CPU-facing switchport to vlan 2. However, we do not want to
bother existing bridge members with duplicated events (and risk messing
up any reference counters they might maintain for these
objects). Therefore we bypass the standard notifier calls and "unicast"
the replay events only to the driver for the port being added.

> Thanks,
> Chris

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: understanding switchdev notifications
  2024-08-14  6:54 ` Tobias Waldekranz
@ 2024-08-14 22:18   ` Chris Packham
  2024-08-15  7:02     ` Tobias Waldekranz
  0 siblings, 1 reply; 4+ messages in thread
From: Chris Packham @ 2024-08-14 22:18 UTC (permalink / raw)
  To: Tobias Waldekranz, netdev, linux-kernel@vger.kernel.org

Hi Tobias,

On 14/08/24 18:54, Tobias Waldekranz wrote:
> On tor, aug 08, 2024 at 12:48, Chris Packham <chris.packham@alliedtelesis.co.nz> wrote:
>> Hi,
>>
>> I'm trying to get to grips with how the switchdev notifications are
>> supposed to be used when developing a switchdev driver.
>>
>> I have been reading through
>> https://www.kernel.org/doc/html/latest/networking/switchdev.html which
>> covers a few things but doesn't go into detail around the notifiers that
>> one needs to implement for a new switchdev driver (which is probably
>> very dependent on what the hardware is capable of).
>>
>> Specifically right now I'm looking at having a switch port join a vlan
>> aware bridge. I have a configuration something like this
>>
>>       ip link add br0 type bridge vlan_filtering 1
>>       ip link set sw1p5 master br0
>>       ip link set sw1p1 master br0
>>       bridge vlan add vid 2 dev br0 self
>>       ip link add link br0 br0.2 type vlan id 2
>>       ip addr add dev br0.2 192.168.2.1/24
>>       bridge vlan add vid 2 dev lan5 pvid untagged
>>       bridge vlan add vid 2 dev lan1
>>       ip link set sw1p5 up
>>       ip link set sw1p1 up
>>       ip link set br0 up
>>       ip link set br0.2 up
>>
>> Then I'm testing by sending a ping to a nonexistent host on the
>> 192.168.2.0/24 subnet and looking at the traffic with tcpdump on another
>> device connected to sw1p5.
>>
>> I'm a bit confused about how I should be calling
>> switchdev_bridge_port_offload(). It takes two netdevs (brport_dev and
>> dev) but as far as I've been able to see all the callers end up passing
>> the same netdev for both of these (some create a driver specific brport
>> but this still ends up with brport->dev and dev being the same object).
> In the simple case when a switchport is directly attached to a bridge,
> brport_dev and dev will be the same. If the attachment is indirect, via
> a bond for example, they will differ:
>
>         br0
>         /
>      bond0
>     /    \
> sw1p1  sw1p5
>
> In the setup above, the bridge has no reference to any sw*p* interfaces,
> all generated notifications will reference "bond0". By including the
> switchdev port in the message back to the bridge, it can perform
> validation on the setup; e.g. that bond0 is not made up of interfaces
> from different hardware domains.

Ah that makes sense. I haven't got to bonds yet so I hadn't hit that case.

>> I've figured out that I need to set tx_fwd_offload=true so that the
>> bridge software only sends one packet to the hardware. That makes sense
>> as a way of saying the my hardware can take care of sending the packet
>> out the right ports.
>>
>> I do have a problem that what I get from the bridge has a vlan tag
>> inserted (which makes sense in sw when the packet goes from br0.2 to
>> br0). But I don't actually need it as the hardware will insert a tag for
>> me if the port is setup for egress tagging. I can shuffle the Ethernet
>> header up but I was wondering if there was a way of telling the bridge
>> not to insert the tag?
> Signaling tx_fwd_offload=true means assuming responsibility for
> delivering each packet to all ports that the bridge would otherwise have
> sent individual skbs for.
>
> Let's expand your setup slightly, and see why you need the tag:
>
>     br0.2 br0.3
>         \ /
>         br0
>        / |  \
>       /  |   \
> sw1p1 sw1p3  sw1p5
> (2U)  (3U)  (2T,3T)
>
> sw1p5 is now a trunk. We can trigger an ARP broadcast to be sent out
> either via br0.2 or br0.3, depending on the subnet we choose to target.
>
> Your driver will receive a single skb to transmit, and skb->dev can be
> set to any of sw1p{1,3,5} depending on config order, FDB entries
> (i.e. the order of previously received packets) etc., and is thus
> nondeterministic.
>
> So presumably, even though you might need to remove the 802.1Q tag from
> the frame, you need some way of tagging the packet with the correct VID
> in order for the hardware to do the right thing; possibly via a field in
> the vendor's hardware specific tag.

I did eventually find NETIF_F_HW_VLAN_CTAG_TX which stops the packet 
data coming down to the switch driver with a vlan tag inserted. The 
intended egress vlan is still available via skb_vlan_tag_get_id() so I 
can add it to hardware specific tag (which for me is part of the TX DMA 
descriptor) and I don't need to shuffle any bytes around which is great.

>> Finally I'm confused about the atomic_nb/atomic_nb parameters. Some
>> drivers just pass NULL and others pass the same notifier blocks that
>> they've already registered with
>> register_switchdev_notifier()/register_switchdev_notifier(). If
>> notifiers are registered why does switchdev_bridge_port_offload() take
>> them as parameters?
> Because when you add a port to the bridge, lots of stuff that you want
> to offload might already have been configured. E.g., imagine that you
> were to add vlan 2 to br0 before adding the switchports; then you
> probably need those events to be replayed to the new ports in order to
> add your CPU-facing switchport to vlan 2. However, we do not want to
> bother existing bridge members with duplicated events (and risk messing
> up any reference counters they might maintain for these
> objects). Therefore we bypass the standard notifier calls and "unicast"
> the replay events only to the driver for the port being added.

This part I still don't get. I understand that there may be scenarios 
where switchdev decides it needs to unicast events to a specific device. 
But why does the caller of switchdev_bridge_port_offload() need to make 
that distinction?

>> Thanks,
>> Chris

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: understanding switchdev notifications
  2024-08-14 22:18   ` Chris Packham
@ 2024-08-15  7:02     ` Tobias Waldekranz
  0 siblings, 0 replies; 4+ messages in thread
From: Tobias Waldekranz @ 2024-08-15  7:02 UTC (permalink / raw)
  To: Chris Packham, netdev, linux-kernel@vger.kernel.org

On tor, aug 15, 2024 at 10:18, Chris Packham <chris.packham@alliedtelesis.co.nz> wrote:
> Hi Tobias,

Hi Chris!

> On 14/08/24 18:54, Tobias Waldekranz wrote:
>> On tor, aug 08, 2024 at 12:48, Chris Packham <chris.packham@alliedtelesis.co.nz> wrote:
>>> Hi,
>>>
>>> I'm trying to get to grips with how the switchdev notifications are
>>> supposed to be used when developing a switchdev driver.
>>>
>>> I have been reading through
>>> https://www.kernel.org/doc/html/latest/networking/switchdev.html which
>>> covers a few things but doesn't go into detail around the notifiers that
>>> one needs to implement for a new switchdev driver (which is probably
>>> very dependent on what the hardware is capable of).
>>>
>>> Specifically right now I'm looking at having a switch port join a vlan
>>> aware bridge. I have a configuration something like this
>>>
>>>       ip link add br0 type bridge vlan_filtering 1
>>>       ip link set sw1p5 master br0
>>>       ip link set sw1p1 master br0
>>>       bridge vlan add vid 2 dev br0 self
>>>       ip link add link br0 br0.2 type vlan id 2
>>>       ip addr add dev br0.2 192.168.2.1/24
>>>       bridge vlan add vid 2 dev lan5 pvid untagged
>>>       bridge vlan add vid 2 dev lan1
>>>       ip link set sw1p5 up
>>>       ip link set sw1p1 up
>>>       ip link set br0 up
>>>       ip link set br0.2 up
>>>
>>> Then I'm testing by sending a ping to a nonexistent host on the
>>> 192.168.2.0/24 subnet and looking at the traffic with tcpdump on another
>>> device connected to sw1p5.
>>>
>>> I'm a bit confused about how I should be calling
>>> switchdev_bridge_port_offload(). It takes two netdevs (brport_dev and
>>> dev) but as far as I've been able to see all the callers end up passing
>>> the same netdev for both of these (some create a driver specific brport
>>> but this still ends up with brport->dev and dev being the same object).
>> In the simple case when a switchport is directly attached to a bridge,
>> brport_dev and dev will be the same. If the attachment is indirect, via
>> a bond for example, they will differ:
>>
>>         br0
>>         /
>>      bond0
>>     /    \
>> sw1p1  sw1p5
>>
>> In the setup above, the bridge has no reference to any sw*p* interfaces,
>> all generated notifications will reference "bond0". By including the
>> switchdev port in the message back to the bridge, it can perform
>> validation on the setup; e.g. that bond0 is not made up of interfaces
>> from different hardware domains.
>
> Ah that makes sense. I haven't got to bonds yet so I hadn't hit that case.
>
>>> I've figured out that I need to set tx_fwd_offload=true so that the
>>> bridge software only sends one packet to the hardware. That makes sense
>>> as a way of saying the my hardware can take care of sending the packet
>>> out the right ports.
>>>
>>> I do have a problem that what I get from the bridge has a vlan tag
>>> inserted (which makes sense in sw when the packet goes from br0.2 to
>>> br0). But I don't actually need it as the hardware will insert a tag for
>>> me if the port is setup for egress tagging. I can shuffle the Ethernet
>>> header up but I was wondering if there was a way of telling the bridge
>>> not to insert the tag?
>> Signaling tx_fwd_offload=true means assuming responsibility for
>> delivering each packet to all ports that the bridge would otherwise have
>> sent individual skbs for.
>>
>> Let's expand your setup slightly, and see why you need the tag:
>>
>>     br0.2 br0.3
>>         \ /
>>         br0
>>        / |  \
>>       /  |   \
>> sw1p1 sw1p3  sw1p5
>> (2U)  (3U)  (2T,3T)
>>
>> sw1p5 is now a trunk. We can trigger an ARP broadcast to be sent out
>> either via br0.2 or br0.3, depending on the subnet we choose to target.
>>
>> Your driver will receive a single skb to transmit, and skb->dev can be
>> set to any of sw1p{1,3,5} depending on config order, FDB entries
>> (i.e. the order of previously received packets) etc., and is thus
>> nondeterministic.
>>
>> So presumably, even though you might need to remove the 802.1Q tag from
>> the frame, you need some way of tagging the packet with the correct VID
>> in order for the hardware to do the right thing; possibly via a field in
>> the vendor's hardware specific tag.
>
> I did eventually find NETIF_F_HW_VLAN_CTAG_TX which stops the packet 
> data coming down to the switch driver with a vlan tag inserted. The 
> intended egress vlan is still available via skb_vlan_tag_get_id() so I 
> can add it to hardware specific tag (which for me is part of the TX DMA 
> descriptor) and I don't need to shuffle any bytes around which is great.

I see now that I might have misunderstood your original question. You
just wanted to avoid the VLAN info being moved from skb->vlan_tci into
skb->data - good that you found the right netif flag!

>>> Finally I'm confused about the atomic_nb/atomic_nb parameters. Some
>>> drivers just pass NULL and others pass the same notifier blocks that
>>> they've already registered with
>>> register_switchdev_notifier()/register_switchdev_notifier(). If
>>> notifiers are registered why does switchdev_bridge_port_offload() take
>>> them as parameters?
>> Because when you add a port to the bridge, lots of stuff that you want
>> to offload might already have been configured. E.g., imagine that you
>> were to add vlan 2 to br0 before adding the switchports; then you
>> probably need those events to be replayed to the new ports in order to
>> add your CPU-facing switchport to vlan 2. However, we do not want to
>> bother existing bridge members with duplicated events (and risk messing
>> up any reference counters they might maintain for these
>> objects). Therefore we bypass the standard notifier calls and "unicast"
>> the replay events only to the driver for the port being added.
>
> This part I still don't get. I understand that there may be scenarios 
> where switchdev decides it needs to unicast events to a specific device. 

That's the thing, switchdev (really: the bridge) can't always decide
when this need arises. Imagine this transition:

      br0                  br0
      /                    /
   bond0      =>        bond0
    / \                /  |  \
swp1  swp2        swp1  swp2  swp3

From the bridge's perspective, the two configurations are the same:
There's one attached port, which is still a member of the same VLANs,
multicast groups etc. that it was before the new port was added to the
bond.

In order to get to the right hardware configuration though, we typically
need to add swp3 to all of those objects that bond0 is a member of. If
it would be the bridge's responsibility to handle that, then it would
have to consern itself with the inner workings of all kinds of
interfaces that can be attached to it.

The current "replay request" model instead places that responsibility on
each individual driver, which is the entity that knows about the
particular offloads that it supports.

> But why does the caller of switchdev_bridge_port_offload() need to make 
> that distinction?

AFAIK, the core idea with notifier blocks is to decouple the publisher
from its subscribers. I.e. switchdev does not know anything about the
subscribers listening to the notifications that it generates. It just
publishes information to the chain and checks for any errors
reported. In order to address an individual subscriber, that subscriber
must instead be the initiator and pass along the callbacks.

>>> Thanks,
>>> Chris

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-08-15  7:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-08  0:48 understanding switchdev notifications Chris Packham
2024-08-14  6:54 ` Tobias Waldekranz
2024-08-14 22:18   ` Chris Packham
2024-08-15  7:02     ` Tobias Waldekranz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).