From: Tobias Waldekranz <tobias@waldekranz.com>
To: Vladimir Oltean <olteanv@gmail.com>
Cc: davem@davemloft.net, kuba@kernel.org, atenart@kernel.org,
roopa@nvidia.com, razor@blackwall.org, bridge@lists.linux.dev,
netdev@vger.kernel.org, jiri@resnulli.us, ivecera@redhat.com
Subject: Re: [PATCH v3 net] net: bridge: switchdev: Skip MDB replays of pending events
Date: Tue, 06 Feb 2024 22:58:18 +0100 [thread overview]
Message-ID: <87sf25s1ad.fsf@waldekranz.com> (raw)
In-Reply-To: <20240206193111.pm265ffzkmd7jbyo@skbuf>
On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote:
>> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@gmail.com> wrote:
>> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
>> >> Before this change, generation of the list of events MDB to replay
>> >
>> > s/events MDB/MDB events/
>> >
>> >> would race against the IGMP/MLD snooping logic, which could concurrently
>> >
>> > logic. This could (...)
>> >
>> >> enqueue events to the switchdev deferred queue, leading to duplicate
>> >> events being sent to drivers. As a consequence of this, drivers which
>> >> reference count memberships (at least DSA), would be left with orphan
>> >> groups in their hardware database when the bridge was destroyed.
>> >
>> > Still missing the user impact description, aka "when would this be
>> > noticed by, and actively bother an user?". Something that would justify
>> > handling this via net.git rather than net-next.git.
>>
>> Other than moving up the example to this point, what do you feel is
>> missing? Or are you saying that you don't think this belongs on net.git?
>
> I just don't have enough data to be able to judge (it's missing from the
> commit message). I'm looking at Documentation/process/stable-kernel-rules.rst
> as a reference. It lists a series of guidelines from which I gather,
> generally speaking, that there should exist a user impact on
> functionality. The "mvls atu" command is a debug program, it just dumps
> hardware tables. It is enough to point out that multicast entries leak,
> but not what will be broken as a result of that.
Fair enough.
I originally discovered the issue while developing a kselftest for
another improvement I want to make to switchdev multicast offloading
(related to router ports). I thought my new code was causing these
orphan entries, but then realized that the issue existed on a plain -net
and -net-next kernel.
I can imagine scenarios in which a user _could_ be impacted by this, but
they are all quite esoteric. Maybe that is an indication that I should
just target net-next instead.
>> >> Avoid this by grabbing the write-side lock of the MDB while generating
>> >> the replay list, making sure that no deferred version of a replay
>> >> event is already enqueued to the switchdev deferred queue, before
>> >> adding it to the replay list.
>> >
>> > The description of the solution is actually not very satisfactory to me.
>> > I would have liked to see a more thorough analysis.
>>
>> Sorry to disappoint.
>>
>> > The race has 2 components, one comes from the fact that during replay,
>> > we iterate using RCU, which does not halt concurrent updates, and the
>> > other comes from the fact that the MDB addition procedure is non-atomic.
>> > Elements are first added to the br->mdb_list, but are notified to
>> > switchdev in a deferred context.
>> >
>> > Grabbing the bridge multicast spinlock only solves the first problem: it
>> > stops new enqueues of deferred events. We also need special handling of
>> > the pending deferred events. The idea there is that we cannot cancel
>> > them, since that would lead to complications for other potential
>> > listeners on the switchdev chain. And we cannot flush them either, since
>> > that wouldn't actually help: flushing needs sleepable context, which is
>> > incompatible with holding br->multicast_lock, and without
>> > br->multicast_lock held, we haven't actually solved anything, since new
>> > deferred events can still be enqueued at any time.
>> >
>> > So the only simple solution is to let the pending deferred events
>> > execute (eventually), but during event replay on joining port, exclude
>> > replaying those multicast elements which are in the bridge's multicast
>> > list, but were not yet added through switchdev. Eventually they will be.
>>
>> In my opinion, your three paragraphs above say pretty much the same
>> thing as my single paragraph. But sure, I will try to provide more
>> details in v4.
>
> It's not about word count, I'm notoriously bad at summarizing. If you
> could still say in a single paragraph what I did in 3, it would be great.
>
> The difference is that the above 3 paragraphs only explore the race on
> MDB addition events. It is of a different nature that the one on deletion.
> Your analysis clumps them together, and the code follows suit. There is
> code to supposedly handle the race between deferred deletion events and
> replay on port removal, but I don't think it does.
That's the thing though: I did not lump them together, I simply did not
think about the deletion issue at all. An issue which you yourself state
should probably be fixed in a separate patch, presumably becuase you
think of them as two very closely related, but ultimately separate,
issues. Which is why I think it was needlessly harsh to say that my
analysis of the duplicate-events-issue was too simplistic, because it
did not address a related issue that I had not considered.
>> > (side note: the handling code for excluding replays on pending event
>> > deletion seems to not actually help, because)
>> >
>> > Event replays on a switchdev port leaving the bridge are also
>> > problematic, but in a different way. The deletion procedure is also
>> > non-atomic, they are first removed from br->mdb_list then the switchdev
>> > notification is deferred. So, the replay procedure cannot enter a
>> > condition where it replays the deletion twice. But, there is no
>> > switchdev_deferred_process() call when the switchdev port unoffloads an
>> > intermediary LAG bridge port, and this can lead to the situation where
>> > neither the replay, nor the deferred switchdev object, trickle down to a
>> > call in the switchdev driver. So for event deletion, we need to force a
>> > synchronous call to switchdev_deferred_process().
>>
>> Good catch!
>>
>> What does this mean for v4? Is this still one logical change that
>> ensures that MDB events are always delivered exactly once, or a series
>> where one patch fixes the duplicates and one ensures that no events
>> are missed?
>
> I'm not saying I know how you can satisfy everyone's review requests.
Of course, I was merely trying to avoid an extra round of patches.
> Once it is clear that deferred additions and deferred removals need
> separate treatment, it would probably be preferable to treat them in
> separate patches.
I should have been more clear in my response. When I said "Good catch!",
that was because I had first reproduced the issue, and then verified
that adding a call to switchdev_deferred_process() at the end of
nbp_switchdev_unsync_objs() solves the issue.
>> > See how the analysis in the commit message changes the patch?
>>
>> Suddenly the novice was enlightened.
>
> This, to me, reads like unnecessary sarcasm.
Your question, to me, read like unnecessary condescension.
> I just mean that I don't think you spent enough time to explore the
> problem space in the commit message. It is a very important element of
> a patch, because it is very rare for someone to solve a problem which
> is not even formulated. I actually tried to be helpful by pointing out
> where things might not work out. I'm sorry if it did not appear that way.
I am extremely greatful for your help, truly! Which is why I
complemented you for pointing out the second issue to me. Something
about the phrasing of that question really rubbed me the wrong way
though. I should have just let it go, I'm sorry.
next prev parent reply other threads:[~2024-02-06 21:58 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-01 16:10 [PATCH v3 net] net: bridge: switchdev: Skip MDB replays of pending events Tobias Waldekranz
2024-02-01 16:24 ` Jiri Pirko
2024-02-02 7:09 ` Tobias Waldekranz
2024-02-05 11:41 ` Vladimir Oltean
2024-02-06 14:54 ` Tobias Waldekranz
2024-02-06 19:31 ` Vladimir Oltean
2024-02-06 21:58 ` Tobias Waldekranz [this message]
2024-02-07 16:36 ` Vladimir Oltean
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87sf25s1ad.fsf@waldekranz.com \
--to=tobias@waldekranz.com \
--cc=atenart@kernel.org \
--cc=bridge@lists.linux.dev \
--cc=davem@davemloft.net \
--cc=ivecera@redhat.com \
--cc=jiri@resnulli.us \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=olteanv@gmail.com \
--cc=razor@blackwall.org \
--cc=roopa@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).