netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: William McVicker <willmcvicker@google.com>
To: Johannes Berg <johannes@sipsolutions.net>
Cc: Jakub Kicinski <kuba@kernel.org>,
	linux-wireless@vger.kernel.org,
	Marek Szyprowski <m.szyprowski@samsung.com>,
	Kalle Valo <kvalo@codeaurora.org>,
	"David S. Miller" <davem@davemloft.net>,
	netdev@vger.kernel.org, Amitkumar Karwar <amitkarwar@gmail.com>,
	Ganapathi Bhat <ganapathi.bhat@nxp.com>,
	Xinming Hu <huxinming820@gmail.com>,
	kernel-team@android.com, Paolo Abeni <pabeni@redhat.com>
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd
Date: Fri, 25 Mar 2022 20:36:05 +0000	[thread overview]
Message-ID: <Yj4ntUejxaPhrM5b@google.com> (raw)
In-Reply-To: <Yj4FFIXi//ivQC3X@google.com>

On 03/25/2022, William McVicker wrote:
> On 03/25/2022, Johannes Berg wrote:
> > On Fri, 2022-03-25 at 09:49 -0700, Jakub Kicinski wrote:
> > > On Fri, 25 Mar 2022 13:04:23 +0100 Johannes Berg wrote:
> > > > So we can avoid the potential deadlock in cfg80211 in a few ways:
> > > > 
> > > >  1) export rtnl_lock_unregistering_all() or maybe a variant after
> > > >     refactoring the two versions, to allow cfg80211 to use it, that way
> > > >     netdev_run_todo() can never have a non-empty todo list
> > > > 
> > > >  2) export __rtnl_unlock() so cfg80211 can avoid running
> > > >     netdev_run_todo() in the unlock, personally I like this less because
> > > >     it might encourage random drivers to use it
> > > > 
> > > >  3) completely rework cfg80211's locking, adding a separate mutex for
> > > >     the wiphy list so we don't need to acquire the RTNL at all here
> > > >     (unless the ops need it, but there's no issue if we don't drop it),
> > > >     something like https://p.sipsolutions.net/27d08e1f5881a793.txt
> > > > 
> > > > 
> > > > I think I'm happy with 3) now (even if it took a couple of hours), so I
> > > > think we can go with it, just need to go through all the possibilities.
> > > 
> > > I like 3) as well. FWIW a few places (e.g. mlx5, devlink, I think I've
> > > seen more) had been converting to xarray for managing the "registered"
> > > objects. It may be worth looking into if you're re-doing things, anyway.
> > > 
> > 
> > That's not a bad idea, but I think I wouldn't want to backport that, so
> > separately :) I don't think that fundamentally changes the locking
> > properties though.
> > 
> > 
> > Couple of more questions I guess: First, are we assuming that the
> > cfg80211 code *is* actually broken, even if it looks like nothing can
> > cause the situation, due to the empty todo list?
> 
> I'm able to reproduce this issue pretty easily with a Pixel 6 when I add
> support to allow vendor commands to request for the RTNL. For this case, I just
> delay unlocking the RTNL until nl80211_vendor_cmds() at which point I check the
> flags to see if I should unlock before calling doit(). That allows me to run my
> tests again and hit this issue. I imagine that I could hit this issue without
> any changes if I re-work my vendor ops to not need the RTNL.
> 
> > 
> > Given that we have rtnl_lock_unregistering() (and also
> > rtnl_lock_unregistering_all()), it looks like we *do* in fact at least
> > not want to make an assumption that no user of __rtnl_unlock() can have
> > added a todo item.
> > 
> > I mean, there's technically yet *another* thing we could do - something
> > like this:
> > 
> > [this doesn't compile, need to suitably make net_todo_list non-static]
> > --- a/net/core/rtnetlink.c
> > +++ b/net/core/rtnetlink.c
> > @@ -95,6 +95,7 @@ void __rtnl_unlock(void)
> >  
> >         defer_kfree_skb_list = NULL;
> >  
> > +       WARN_ON(!list_empty(&net_todo_list));
> >         mutex_unlock(&rtnl_mutex);
> >  
> >         while (head) {
> > 
> > and actually that would allow us to get rid of rtnl_lock_unregistering()
> > and rtnl_lock_unregistering_all() simply because we'd actually guarantee
> > the invariant that when the RTNL is freshly locked, the list is empty
> > (by guaranteeing that it's always empty when it's unlocked, since it can
> > only be added to under RTNL).
> > 
> > With some suitable commentary, that might also be a reasonable thing?
> > __rtnl_unlock() is actually rather pretty rare, and not exported.
> > 
> > 
> > However, if you don't like that ...
> > 
> > I've been testing with this patch, to make lockdep complain:
> > 
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -9933,6 +9933,11 @@ void netdev_run_todo(void)
> >         if (!list_empty(&list))
> >                 rcu_barrier();
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +       rtnl_lock();
> > +       __rtnl_unlock();
> > +#endif
> > +
> >         while (!list_empty(&list)) {
> >                 struct net_device *dev
> >                         = list_first_entry(&list, struct net_device, todo_list);
> > 
> > 
> > That causes lockdep to complain for cfg80211 even if the list *is* in
> > fact empty.
> > 
> > Would you be open to adding something like that? Perhaps if I don't just
> > do the easy rtnl_lock/unlock, but try to find the corresponding lockdep-
> > only things to do there, to cause lockdep to do things without really
> > locking? OTOH, the locking overhead of the RTNL we just unlocked is
> > probably minimal, vs. the actual work *lockdep* is doing to track all
> > this ...
> > 
> > Thanks,
> > johannes
> 
> Let me know if you'd like me to test any patches out.
> 
> Thanks,
> Will

Hi Johannes,

I found that my wlan driver is using the vendor commands to create/delete NAN
interfaces for this Android feature called Wi-Fi aware [1]. Basically, this
features allows users to discover other nearby devices and allows them to
connect directly with one another over a local network. To get my driver
working again, I first had to allow the kernel to let my driver request for the
RTNL lock for these NAN create/delete interface vendor commands. With that
I got the following code path:


Thread 1                         Thread 2
 nl80211_pre_doit():
   rtnl_lock()
   wiphy_lock()                   nl80211_pre_doit():
                                    rtnl_lock() // blocked by Thread 1
 nl80211_vendor_cmd():
   doit()
     cfg80211_unregister_netdevice()
   rtnl_unlock():
     netdev_run_todo():
       __rtnl_unlock()
                                    <got RTNL lock>
                                    wiphy_lock() // blocked by Thread 1
       rtnl_lock(); // DEADLOCK
 nl80211_post_doit():
   wiphy_unlock();


Since I'm unlocking the RTNL inside nl80211_vendor_cmd() after calling doit()
instead of waiting till post_doit(), I get into the situation you mentioned
where the net_todo_list is not empty when calling rtnl_unlock. So I decided to
drop the rtnl_unlock() in nl80211_vendor_cmd() and defer that until
nl80211_post_doit() after calling wiphy_unlock(). With this change, I haven't
been able to reproduce the deadlock. So it's possible that we aren't actually
able to hit this deadlock in nl80211_pre_doit() with the existing code since,
as you mentioned, one wouldn't be able to call unregister_netdevice() without
having the RTNL lock.

Sorry if I sent you down a rabbit hole with the first code path scenario.

Thanks,
Will

[1] https://developer.android.com/guide/topics/connectivity/wifi-aware

  parent reply	other threads:[~2022-03-25 20:36 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <0000000000009e9b7105da6d1779@google.com>
     [not found] ` <CABYd82Z=YXmZPTQhf0K1M4nS2wk3dPBSqx91D8SoUd59AUzpHg@mail.gmail.com>
2022-03-21 17:00   ` [BUG] deadlock in nl80211_vendor_cmd William McVicker
     [not found] ` <99eda6d1dad3ff49435b74e539488091642b10a8.camel@sipsolutions.net>
     [not found]   ` <5d5cf050-7de0-7bad-2407-276970222635@quicinc.com>
     [not found]     ` <YjpGlRvcg72zNo8s@google.com>
     [not found]       ` <dc556455-51a2-06e8-8ec5-b807c2901b7e@quicinc.com>
2022-03-24 21:58         ` William McVicker
2022-03-25 12:04           ` Johannes Berg
2022-03-25 12:06             ` Johannes Berg
2022-03-25 16:49             ` Jakub Kicinski
2022-03-25 17:01               ` Johannes Berg
2022-03-25 18:08                 ` William McVicker
2022-03-25 20:21                   ` Johannes Berg
2022-03-25 20:36                   ` William McVicker [this message]
2022-03-25 21:16                     ` Johannes Berg
2022-03-25 21:54                       ` Johannes Berg
2022-03-25 22:18                       ` Jeff Johnson
2022-03-25 23:57                       ` William McVicker
2022-03-26  0:07                         ` Jakub Kicinski
2022-03-26  0:12                           ` William McVicker
2022-03-25 20:40                 ` Jakub Kicinski
2022-03-25 21:25                   ` Johannes Berg
2022-03-25 21:48                     ` Jakub Kicinski
2022-03-25 21:50                       ` Johannes Berg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yj4ntUejxaPhrM5b@google.com \
    --to=willmcvicker@google.com \
    --cc=amitkarwar@gmail.com \
    --cc=davem@davemloft.net \
    --cc=ganapathi.bhat@nxp.com \
    --cc=huxinming820@gmail.com \
    --cc=johannes@sipsolutions.net \
    --cc=kernel-team@android.com \
    --cc=kuba@kernel.org \
    --cc=kvalo@codeaurora.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=m.szyprowski@samsung.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).