From: Tobias Waldekranz <tobias@waldekranz.com>
To: Vladimir Oltean <olteanv@gmail.com>
Cc: davem@davemloft.net, kuba@kernel.org, atenart@kernel.org,
roopa@nvidia.com, razor@blackwall.org, bridge@lists.linux.dev,
netdev@vger.kernel.org, jiri@resnulli.us, ivecera@redhat.com
Subject: Re: [PATCH v3 net] net: bridge: switchdev: Skip MDB replays of pending events
Date: Tue, 06 Feb 2024 15:54:25 +0100 [thread overview]
Message-ID: <87v871skwu.fsf@waldekranz.com> (raw)
In-Reply-To: <20240205114138.uiwioqstybmzq77b@skbuf>
On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
>> Before this change, generation of the list of events MDB to replay
>
> s/events MDB/MDB events/
>
>> would race against the IGMP/MLD snooping logic, which could concurrently
>
> logic. This could (...)
>
>> enqueue events to the switchdev deferred queue, leading to duplicate
>> events being sent to drivers. As a consequence of this, drivers which
>> reference count memberships (at least DSA), would be left with orphan
>> groups in their hardware database when the bridge was destroyed.
>
> Still missing the user impact description, aka "when would this be
> noticed by, and actively bother an user?". Something that would justify
> handling this via net.git rather than net-next.git.
Other than moving up the example to this point, what do you feel is
missing? Or are you saying that you don't think this belongs on net.git?
>>
>> Avoid this by grabbing the write-side lock of the MDB while generating
>> the replay list, making sure that no deferred version of a replay
>> event is already enqueued to the switchdev deferred queue, before
>> adding it to the replay list.
>
> The description of the solution is actually not very satisfactory to me.
> I would have liked to see a more thorough analysis.
Sorry to disappoint.
> The race has 2 components, one comes from the fact that during replay,
> we iterate using RCU, which does not halt concurrent updates, and the
> other comes from the fact that the MDB addition procedure is non-atomic.
> Elements are first added to the br->mdb_list, but are notified to
> switchdev in a deferred context.
>
> Grabbing the bridge multicast spinlock only solves the first problem: it
> stops new enqueues of deferred events. We also need special handling of
> the pending deferred events. The idea there is that we cannot cancel
> them, since that would lead to complications for other potential
> listeners on the switchdev chain. And we cannot flush them either, since
> that wouldn't actually help: flushing needs sleepable context, which is
> incompatible with holding br->multicast_lock, and without
> br->multicast_lock held, we haven't actually solved anything, since new
> deferred events can still be enqueued at any time.
>
> So the only simple solution is to let the pending deferred events
> execute (eventually), but during event replay on joining port, exclude
> replaying those multicast elements which are in the bridge's multicast
> list, but were not yet added through switchdev. Eventually they will be.
In my opinion, your three paragraphs above say pretty much the same
thing as my single paragraph. But sure, I will try to provide more
details in v4.
> (side note: the handling code for excluding replays on pending event
> deletion seems to not actually help, because)
>
> Event replays on a switchdev port leaving the bridge are also
> problematic, but in a different way. The deletion procedure is also
> non-atomic, they are first removed from br->mdb_list then the switchdev
> notification is deferred. So, the replay procedure cannot enter a
> condition where it replays the deletion twice. But, there is no
> switchdev_deferred_process() call when the switchdev port unoffloads an
> intermediary LAG bridge port, and this can lead to the situation where
> neither the replay, nor the deferred switchdev object, trickle down to a
> call in the switchdev driver. So for event deletion, we need to force a
> synchronous call to switchdev_deferred_process().
Good catch!
What does this mean for v4? Is this still one logical change that
ensures that MDB events are always delivered exactly once, or a series
where one patch fixes the duplicates and one ensures that no events are
missed?
> See how the analysis in the commit message changes the patch?
Suddenly the novice was enlightened.
>>
>> An easy way to reproduce this issue, on an mv88e6xxx system, was to
>
> s/was/is/
>
>> create a snooping bridge, and immediately add a port to it:
>>
>> root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
>> > ip link set dev x3 up master br0
>> root@infix-06-0b-00:~$ ip link del dev br0
>> root@infix-06-0b-00:~$ mvls atu
>> ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a
>> DEV:0 Marvell 88E6393X
>> 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . .
>> 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . .
>> ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a
>> root@infix-06-0b-00:~$
>>
>> The two IPv6 groups remain in the hardware database because the
>> port (x3) is notified of the host's membership twice: once via the
>> original event and once via a replay. Since only a single delete
>> notification is sent, the count remains at 1 when the bridge is
>> destroyed.
>
> These 2 paragraphs, plus the mvls output, should be placed before the
> "Avoid this" paragraph.
>
>>
>> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> ---
next prev parent reply other threads:[~2024-02-06 14:54 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-01 16:10 [PATCH v3 net] net: bridge: switchdev: Skip MDB replays of pending events Tobias Waldekranz
2024-02-01 16:24 ` Jiri Pirko
2024-02-02 7:09 ` Tobias Waldekranz
2024-02-05 11:41 ` Vladimir Oltean
2024-02-06 14:54 ` Tobias Waldekranz [this message]
2024-02-06 19:31 ` Vladimir Oltean
2024-02-06 21:58 ` Tobias Waldekranz
2024-02-07 16:36 ` Vladimir Oltean
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87v871skwu.fsf@waldekranz.com \
--to=tobias@waldekranz.com \
--cc=atenart@kernel.org \
--cc=bridge@lists.linux.dev \
--cc=davem@davemloft.net \
--cc=ivecera@redhat.com \
--cc=jiri@resnulli.us \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=olteanv@gmail.com \
--cc=razor@blackwall.org \
--cc=roopa@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).