From: Vladimir Oltean <olteanv@gmail.com>
To: Tobias Waldekranz <tobias@waldekranz.com>
Cc: davem@davemloft.net, kuba@kernel.org, atenart@kernel.org,
roopa@nvidia.com, razor@blackwall.org, bridge@lists.linux.dev,
netdev@vger.kernel.org, jiri@resnulli.us, ivecera@redhat.com
Subject: Re: [PATCH v3 net] net: bridge: switchdev: Skip MDB replays of pending events
Date: Wed, 7 Feb 2024 18:36:14 +0200 [thread overview]
Message-ID: <20240207163614.irxeoowoapglbnxj@skbuf> (raw)
In-Reply-To: <87sf25s1ad.fsf@waldekranz.com>
On Tue, Feb 06, 2024 at 10:58:18PM +0100, Tobias Waldekranz wrote:
> On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote:
> >> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote:
> >> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
> >> >> Before this change, generation of the list of events MDB to replay
> >> >
> >> > s/events MDB/MDB events/
> >> >
> >> >> would race against the IGMP/MLD snooping logic, which could concurrently
> >> >
> >> > logic. This could (...)
> >> >
> >> >> enqueue events to the switchdev deferred queue, leading to duplicate
> >> >> events being sent to drivers. As a consequence of this, drivers which
> >> >> reference count memberships (at least DSA), would be left with orphan
> >> >> groups in their hardware database when the bridge was destroyed.
> >> >
> >> > Still missing the user impact description, aka "when would this be
> >> > noticed by, and actively bother an user?". Something that would justify
> >> > handling this via net.git rather than net-next.git.
> >>
> >> Other than moving up the example to this point, what do you feel is
> >> missing? Or are you saying that you don't think this belongs on net.git?
> >
> > I just don't have enough data to be able to judge (it's missing from the
> > commit message). I'm looking at Documentation/process/stable-kernel-rules.rst
> > as a reference. It lists a series of guidelines from which I gather,
> > generally speaking, that there should exist a user impact on
> > functionality. The "mvls atu" command is a debug program, it just dumps
> > hardware tables. It is enough to point out that multicast entries leak,
> > but not what will be broken as a result of that.
>
> Fair enough.
>
> I originally discovered the issue while developing a kselftest for
> another improvement I want to make to switchdev multicast offloading
> (related to router ports). I thought my new code was causing these
> orphan entries, but then realized that the issue existed on a plain -net
> and -net-next kernel.
>
> I can imagine scenarios in which a user _could_ be impacted by this, but
> they are all quite esoteric. Maybe that is an indication that I should
> just target net-next instead.
Again, I'm not discouraging you to send this patch to net.git. I've been
saying (since v1, apparently: https://lore.kernel.org/netdev/20240131135157.ddrtt4swvz5y3nbz@skbuf/)
that if there is a reason to do it, just say it in the commit message,
so that we're all on the same page.
Be verbose when calculating the cost/benefit ratio calculation
(preferably also in the commit message). I would analyze the complexity
of the change as "medium", since it does change the locking scheme to
fix an underdesigned mechanism. And the fact that mlxsw also uses
replays through a slightly different code path means something you
(probably) can't test, and potentially risky.
> >> > The race has 2 components, one comes from the fact that during replay,
> >> > we iterate using RCU, which does not halt concurrent updates, and the
> >> > other comes from the fact that the MDB addition procedure is non-atomic.
> >> > Elements are first added to the br->mdb_list, but are notified to
> >> > switchdev in a deferred context.
> >> >
> >> > Grabbing the bridge multicast spinlock only solves the first problem: it
> >> > stops new enqueues of deferred events. We also need special handling of
> >> > the pending deferred events. The idea there is that we cannot cancel
> >> > them, since that would lead to complications for other potential
> >> > listeners on the switchdev chain. And we cannot flush them either, since
> >> > that wouldn't actually help: flushing needs sleepable context, which is
> >> > incompatible with holding br->multicast_lock, and without
> >> > br->multicast_lock held, we haven't actually solved anything, since new
> >> > deferred events can still be enqueued at any time.
> >> >
> >> > So the only simple solution is to let the pending deferred events
> >> > execute (eventually), but during event replay on joining port, exclude
> >> > replaying those multicast elements which are in the bridge's multicast
> >> > list, but were not yet added through switchdev. Eventually they will be.
> >>
> >> In my opinion, your three paragraphs above say pretty much the same
> >> thing as my single paragraph. But sure, I will try to provide more
> >> details in v4.
> >
> > It's not about word count, I'm notoriously bad at summarizing. If you
> > could still say in a single paragraph what I did in 3, it would be great.
> >
> > The difference is that the above 3 paragraphs only explore the race on
> > MDB addition events. It is of a different nature that the one on deletion.
> > Your analysis clumps them together, and the code follows suit. There is
> > code to supposedly handle the race between deferred deletion events and
> > replay on port removal, but I don't think it does.
>
> That's the thing though: I did not lump them together, I simply did not
> think about the deletion issue at all. An issue which you yourself state
> should probably be fixed in a separate patch, presumably becuase you
> think of them as two very closely related, but ultimately separate,
> issues. Which is why I think it was needlessly harsh to say that my
> analysis of the duplicate-events-issue was too simplistic, because it
> did not address a related issue that I had not considered.
So in the comments for your v1, one of the things I told you was that
"there's one more race to deal with" on removal.
https://lore.kernel.org/netdev/20240131150642.mxcssv7l5qfiejkl@skbuf/
At the time, the idea had not crystalized in my mind either that it's
not "one more" race on removal, but that on removal it's _the only_
race.
You then submitted v2 and v3, not taking this comment into consideration,
not even to explore it. Exploring it thoroughly would have revealed that
your approach - to apply the algorithm of excluding objects from replay
if events on them are pending in switchdev - is completely unnecessary
when "adding=false".
In light of my comment and the lack of a detailed analysis on your side
on removal, it _appears_, by looking at this change, that the patch
handles a race on removal, but the code actually runs through the
algorithm that handles a race that doesn't exist, and doesn't handle the
race that _does_ exist.
My part of the fault is doing spotty review, giving ideas piecemeal, and
getting frustrated you aren't picking all of them up. It's still going
to be a really busy few weeks/months for me ahead, and there's nothing I
can do about that. I'm faced with the alternatives of either doing a
shit job reviewing and leaving comments/ideas in the little time I have,
or not reviewing at all. I'll think about my options some more.
prev parent reply other threads:[~2024-02-07 16:36 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-01 16:10 [PATCH v3 net] net: bridge: switchdev: Skip MDB replays of pending events Tobias Waldekranz
2024-02-01 16:24 ` Jiri Pirko
2024-02-02 7:09 ` Tobias Waldekranz
2024-02-05 11:41 ` Vladimir Oltean
2024-02-06 14:54 ` Tobias Waldekranz
2024-02-06 19:31 ` Vladimir Oltean
2024-02-06 21:58 ` Tobias Waldekranz
2024-02-07 16:36 ` Vladimir Oltean [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240207163614.irxeoowoapglbnxj@skbuf \
--to=olteanv@gmail.com \
--cc=atenart@kernel.org \
--cc=bridge@lists.linux.dev \
--cc=davem@davemloft.net \
--cc=ivecera@redhat.com \
--cc=jiri@resnulli.us \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=razor@blackwall.org \
--cc=roopa@nvidia.com \
--cc=tobias@waldekranz.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox