All of lore.kernel.org
 help / color / mirror / Atom feed
From: William McVicker <willmcvicker@google.com>
To: Johannes Berg <johannes@sipsolutions.net>
Cc: Jakub Kicinski <kuba@kernel.org>,
	linux-wireless@vger.kernel.org,
	Marek Szyprowski <m.szyprowski@samsung.com>,
	Kalle Valo <kvalo@codeaurora.org>,
	"David S. Miller" <davem@davemloft.net>,
	netdev@vger.kernel.org, Amitkumar Karwar <amitkarwar@gmail.com>,
	Ganapathi Bhat <ganapathi.bhat@nxp.com>,
	Xinming Hu <huxinming820@gmail.com>,
	kernel-team@android.com, Paolo Abeni <pabeni@redhat.com>
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd
Date: Fri, 25 Mar 2022 18:08:20 +0000	[thread overview]
Message-ID: <Yj4FFIXi//ivQC3X@google.com> (raw)
In-Reply-To: <f4f8a27dc07c1adaab470fde302ed841113e6b7f.camel@sipsolutions.net>

On 03/25/2022, Johannes Berg wrote:
> On Fri, 2022-03-25 at 09:49 -0700, Jakub Kicinski wrote:
> > On Fri, 25 Mar 2022 13:04:23 +0100 Johannes Berg wrote:
> > > So we can avoid the potential deadlock in cfg80211 in a few ways:
> > > 
> > >  1) export rtnl_lock_unregistering_all() or maybe a variant after
> > >     refactoring the two versions, to allow cfg80211 to use it, that way
> > >     netdev_run_todo() can never have a non-empty todo list
> > > 
> > >  2) export __rtnl_unlock() so cfg80211 can avoid running
> > >     netdev_run_todo() in the unlock, personally I like this less because
> > >     it might encourage random drivers to use it
> > > 
> > >  3) completely rework cfg80211's locking, adding a separate mutex for
> > >     the wiphy list so we don't need to acquire the RTNL at all here
> > >     (unless the ops need it, but there's no issue if we don't drop it),
> > >     something like https://p.sipsolutions.net/27d08e1f5881a793.txt
> > > 
> > > 
> > > I think I'm happy with 3) now (even if it took a couple of hours), so I
> > > think we can go with it, just need to go through all the possibilities.
> > 
> > I like 3) as well. FWIW a few places (e.g. mlx5, devlink, I think I've
> > seen more) had been converting to xarray for managing the "registered"
> > objects. It may be worth looking into if you're re-doing things, anyway.
> > 
> 
> That's not a bad idea, but I think I wouldn't want to backport that, so
> separately :) I don't think that fundamentally changes the locking
> properties though.
> 
> 
> Couple of more questions I guess: First, are we assuming that the
> cfg80211 code *is* actually broken, even if it looks like nothing can
> cause the situation, due to the empty todo list?

I'm able to reproduce this issue pretty easily with a Pixel 6 when I add
support to allow vendor commands to request for the RTNL. For this case, I just
delay unlocking the RTNL until nl80211_vendor_cmds() at which point I check the
flags to see if I should unlock before calling doit(). That allows me to run my
tests again and hit this issue. I imagine that I could hit this issue without
any changes if I re-work my vendor ops to not need the RTNL.

> 
> Given that we have rtnl_lock_unregistering() (and also
> rtnl_lock_unregistering_all()), it looks like we *do* in fact at least
> not want to make an assumption that no user of __rtnl_unlock() can have
> added a todo item.
> 
> I mean, there's technically yet *another* thing we could do - something
> like this:
> 
> [this doesn't compile, need to suitably make net_todo_list non-static]
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -95,6 +95,7 @@ void __rtnl_unlock(void)
>  
>         defer_kfree_skb_list = NULL;
>  
> +       WARN_ON(!list_empty(&net_todo_list));
>         mutex_unlock(&rtnl_mutex);
>  
>         while (head) {
> 
> and actually that would allow us to get rid of rtnl_lock_unregistering()
> and rtnl_lock_unregistering_all() simply because we'd actually guarantee
> the invariant that when the RTNL is freshly locked, the list is empty
> (by guaranteeing that it's always empty when it's unlocked, since it can
> only be added to under RTNL).
> 
> With some suitable commentary, that might also be a reasonable thing?
> __rtnl_unlock() is actually rather pretty rare, and not exported.
> 
> 
> However, if you don't like that ...
> 
> I've been testing with this patch, to make lockdep complain:
> 
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -9933,6 +9933,11 @@ void netdev_run_todo(void)
>         if (!list_empty(&list))
>                 rcu_barrier();
>  
> +#ifdef CONFIG_LOCKDEP
> +       rtnl_lock();
> +       __rtnl_unlock();
> +#endif
> +
>         while (!list_empty(&list)) {
>                 struct net_device *dev
>                         = list_first_entry(&list, struct net_device, todo_list);
> 
> 
> That causes lockdep to complain for cfg80211 even if the list *is* in
> fact empty.
> 
> Would you be open to adding something like that? Perhaps if I don't just
> do the easy rtnl_lock/unlock, but try to find the corresponding lockdep-
> only things to do there, to cause lockdep to do things without really
> locking? OTOH, the locking overhead of the RTNL we just unlocked is
> probably minimal, vs. the actual work *lockdep* is doing to track all
> this ...
> 
> Thanks,
> johannes

Let me know if you'd like me to test any patches out.

Thanks,
Will

  reply	other threads:[~2022-03-25 19:38 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-17 17:09 [BUG] deadlock in nl80211_vendor_cmd willmcvicker
     [not found] ` <CABYd82Z=YXmZPTQhf0K1M4nS2wk3dPBSqx91D8SoUd59AUzpHg@mail.gmail.com>
2022-03-21 17:00   ` William McVicker
2022-03-21 20:07 ` Johannes Berg
2022-03-22  0:15   ` William McVicker
2022-03-22 21:31   ` Jeff Johnson
2022-03-22 21:58     ` William McVicker
2022-03-23 16:06       ` Jeff Johnson
2022-03-24 21:58         ` William McVicker
2022-03-25 12:04           ` Johannes Berg
2022-03-25 12:06             ` Johannes Berg
2022-03-25 16:49             ` Jakub Kicinski
2022-03-25 17:01               ` Johannes Berg
2022-03-25 18:08                 ` William McVicker [this message]
2022-03-25 20:21                   ` Johannes Berg
2022-03-25 20:36                   ` William McVicker
2022-03-25 21:16                     ` Johannes Berg
2022-03-25 21:54                       ` Johannes Berg
2022-03-25 22:18                       ` Jeff Johnson
2022-03-25 23:57                       ` William McVicker
2022-03-26  0:07                         ` Jakub Kicinski
2022-03-26  0:12                           ` William McVicker
2022-03-25 20:40                 ` Jakub Kicinski
2022-03-25 21:25                   ` Johannes Berg
2022-03-25 21:48                     ` Jakub Kicinski
2022-03-25 21:50                       ` Johannes Berg
2022-03-25 12:09       ` Johannes Berg
2022-03-25 15:59         ` Jeff Johnson
2022-03-25 16:04           ` Johannes Berg
2022-03-25 17:14             ` Jeff Johnson
2022-03-25 12:08     ` Johannes Berg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yj4FFIXi//ivQC3X@google.com \
    --to=willmcvicker@google.com \
    --cc=amitkarwar@gmail.com \
    --cc=davem@davemloft.net \
    --cc=ganapathi.bhat@nxp.com \
    --cc=huxinming820@gmail.com \
    --cc=johannes@sipsolutions.net \
    --cc=kernel-team@android.com \
    --cc=kuba@kernel.org \
    --cc=kvalo@codeaurora.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=m.szyprowski@samsung.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.