linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler
@ 2018-01-10 12:30 Xinming Hu
  2018-01-12  2:25 ` Brian Norris
  0 siblings, 1 reply; 4+ messages in thread
From: Xinming Hu @ 2018-01-10 12:30 UTC (permalink / raw)
  To: Brian Norris, Kalle Valo
  Cc: Kalle Valo, Linux Wireless, Dmitry Torokhov, rajatja@google.com,
	Zhiyuan Yang, Tim Song, Cathy Luo, James Cao, Ganapathi Bhat,
	Ellie Reeves, Christoph Hellwig

Hi Brian,

> -----Original Message-----
> From: Brian Norris [mailto:briannorris@chromium.org]
> Sent: 2018年1月10日 6:01
> To: Xinming Hu <huxm@marvell.com>; Kalle Valo <kvalo@codeaurora.org>
> Cc: Kalle Valo <kvalo@codeaurora.org>; Linux Wireless
> <linux-wireless@vger.kernel.org>; Dmitry Torokhov <dtor@google.com>;
> rajatja@google.com; Zhiyuan Yang <yangzy@marvell.com>; Tim Song
> <songtao@marvell.com>; Cathy Luo <cluo@marvell.com>; James Cao
> <jcao@marvell.com>; Ganapathi Bhat <gbhat@marvell.com>; Ellie Reeves
> <ellierevves@gmail.com>; Christoph Hellwig <hch@lst.de>
> Subject: [EXT] Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in
> remove/shutdown handler
> 
> External Email
> 
> ----------------------------------------------------------------------
> + Christopher
> 
> Hi Simon and Kalle,
> 
> On Tue, Jan 09, 2018 at 11:42:21AM +0000, Xinming Hu wrote:
> > Hi,
> >
> > > -----Original Message-----
> > > From: Kalle Valo [mailto:kvalo@codeaurora.org]
> > > Sent: 2018年1月9日 15:40
> > > To: Brian Norris <briannorris@chromium.org>
> > > Cc: Xinming Hu <huxm@marvell.com>; Linux Wireless
> > > <linux-wireless@vger.kernel.org>; Dmitry Torokhov <dtor@google.com>;
> > > rajatja@google.com; Zhiyuan Yang <yangzy@marvell.com>; Tim Song
> > > <songtao@marvell.com>; Cathy Luo <cluo@marvell.com>; James Cao
> > > <jcao@marvell.com>; Ganapathi Bhat <gbhat@marvell.com>; Ellie Reeves
> > > <ellierevves@gmail.com>
> > > Subject: [EXT] Re: [PATCH] mwifiex: cancel pcie/sdio work in
> > > remove/shutdown handler
> > >
> > > External Email
> > >
> > > --------------------------------------------------------------------
> > > -- Brian Norris <briannorris@chromium.org> writes:
> > >
> > > >> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> > > >> @@ -310,6 +310,8 @@ static void mwifiex_pcie_remove(struct
> > > >> pci_dev
> > > *pdev)
> > > >>  		mwifiex_init_shutdown_fw(priv, MWIFIEX_FUNC_SHUTDOWN);
> > > >>  	}
> > > >>
> > > >> +	cancel_work_sync(&card->work);
> > > >> +
> > > >
> > > > Just FYI, this "fix" is not a real fix. It will likely paper over
> > > > some of your bugs (where, e.g., the FW shutdown command times out
> > > > in the previous couple of lines), but this highlights the fact
> > > > that there are other races that could trigger the same behavior.
> > > > You're not fixing those.
> > > >
> > > > For example, what if somebody initiates a scan or other nl80211
> > > > command between the above line and mwifiex_remove_card()? That
> > > command
> > > > could potentially time out too.
> > > >
> >
> > The hardware status have been reset before downloading the last
> > command(FUNC SHUTDOWN), in this way, follow commands  download will
> be
> > ignored and warned.
> 
> Hmm, I suppose that's true. So the race I'm talking about probably can't
> happen usually. What about in manufacturing mode
> or !FIRMWARE_READY_PCIE though? Those cases don't shut down the
> firmware. Can we still have outstanding timeouts in those cases?
> 
> Anyway, I still think there's a problem though, and this patch is just going to
> make things worse. See below.
> 
> > > > The proper fix would be to institute some kind of mutual exclusion
> > > > (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(),
> > > > so that they can't occur at the same time.
> > > >
> >
> > I am not sure whether there is any mutual exclusion protect between
> pcie_reset and pcie_remove in pcie core.
> > But it looks a different race.
> > We still need this fix, right?
> 
> Good point. Previously, there wasn't any such exclusion, and that's why races
> like the above were even more likely. But as of 4.13, now there
> *is* exclusion. See commit b014e96d1abb ("PCI: Protect
> pci_error_handlers->reset_notify() usage with device_lock()"). That incidentally
> means that you're creating a deadlock with this patch! [1]
> 
> If we start a timeout/reset sequence in mwifiex_init_shutdown_fw() (called
> from remove()), then you'll eventually have pci_reset_function() waiting on the
> device lock, but mwifiex_pcie_remove() will be holding the device lock already,
> and now (with your patch), remove() will be waiting on the worker thread to
> finish pci_reset_function()...deadlock!
> 
> I actually think that the above patch (adding device_lock()) resolves most of the
> race but introduces a possible deadlock. I believe an easy solution is just to
> switch to pci_try_reset_function() instead? That will just abort a reset attempt
> if we're in the middle of removing the device. Problem solved? Diff appended,
> but I'll send out a real version if that looks right. Can you test your original
> problem with the above commit from Christopher, as well as the appended
> diff?
> 

Since I don't have the original customer platform, which is 4.1 based.
Just run the stress test on my 4.14, with commands:
while true; do rmmod mwifiex;insmod $mwd/mwifiex.ko; sleep 1; insmod $mwd/mwifiex_pcie.ko; sleep 1; rmmod mwifiex_pcie & echo "1" > /sys/kernel/debug/mwifiex/mlan0/reset;done &

Till now, I didn't hit deadlock with/without below diff, though it looks finally will be reached, according to the above explanation.  
I guess, maybe rmmod operation is slowly then sysfs operation.

> > Regards,
> > Simon
> > > > Unfortunately, I only paid attention to this after Kalle already
> > > > applied this patch. Personally, I'd prefer this patch not get
> > > > applied, since it's a bad solution to an obvious problem, which
> > > > instead leaves a subtle problem that perhaps no one will bother fixing.
> > >
> > > I can revert it, that's not a problem. Can I use the text below as
> > > explanation for the revert?
> > >
> > > --------------------------------------------------------------------
> > > -- Brian Norris <briannorris@chromium.org> says:
> > >
> > > Just FYI, this "fix" is not a real fix. It will likely paper over
> > > some of your bugs (where, e.g., the FW shutdown command times out in
> > > the previous couple of lines), but this highlights the fact that
> > > there are other races that could trigger the same behavior. You're not fixing
> those.
> > >
> > > For example, what if somebody initiates a scan or other nl80211
> > > command between the above line and mwifiex_remove_card()? That
> > > command could potentially time out too.
> > >
> > > The proper fix would be to institute some kind of mutual exclusion
> > > (locking) between mwifiex_shutdown_sw() and mwifiex_remove_card(),
> > > so that they can't occur at the same time.
> > >
> > > --------------------------------------------------------------------
> > > --
> 
> Hi Kalle, thanks for the above. With the new information from above, I think
> that there's a more accurate description, like the following:
> 
> ---
> Brian Norris <briannorris@chromium.org> says:
> 
> The "fix" in question might not actually fix all related problems, and it also looks
> like it can cause a deadlock. Since commit b014e96d1abb ("PCI: Protect
> pci_error_handlers->reset_notify() usage with device_lock()"), the race in
> question is actually resolved (PCIe reset cannot happen at the same time as
> remove()). Instead, this "fix"
> just introduces a deadlock where mwifiex_pcie_card_reset_work() is waiting on
> device_lock, which is held by PCIe device remove(), which is waiting
> on...mwifiex_pcie_card_reset_work().
> 
> The proper thing to do is just to fix the deadlock. Patch for this will come
> separately.
> ---
> 
> Brian
> 
> [1] Technically, the deadlock is already there, since
> mwifiex_remove_card() eventually calls pcie.c's ->cleanup_if(), which also calls
> cancel_work_sync(). But your patch doesn't help...
> 
> Untested diff:
> 
> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c
> b/drivers/net/wireless/marvell/mwifiex/pcie.c
> index cd314946452c..5da3d6ccf5f2 100644
> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> @@ -2781,7 +2781,7 @@ static void mwifiex_pcie_card_reset_work(struct
> mwifiex_adapter *adapter)  {
>  	struct pcie_service_card *card = adapter->card;
> 
> -	pci_reset_function(card->dev);
> +	pci_try_reset_function(card->dev);
>  }
> 
>  static void mwifiex_pcie_work(struct work_struct *work)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler
  2018-01-10 12:30 Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler Xinming Hu
@ 2018-01-12  2:25 ` Brian Norris
  2018-01-12 19:06   ` Brian Norris
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Norris @ 2018-01-12  2:25 UTC (permalink / raw)
  To: Xinming Hu
  Cc: Kalle Valo, Linux Wireless, Dmitry Torokhov, rajatja@google.com,
	Zhiyuan Yang, Tim Song, Cathy Luo, James Cao, Ganapathi Bhat,
	Ellie Reeves, Christoph Hellwig

Hi Simon,

On Wed, Jan 10, 2018 at 12:30:35PM +0000, Xinming Hu wrote:
> > -----Original Message-----
> > From: Brian Norris [mailto:briannorris@chromium.org]
> > On Tue, Jan 09, 2018 at 11:42:21AM +0000, Xinming Hu wrote:
> > > I am not sure whether there is any mutual exclusion protect between
> > pcie_reset and pcie_remove in pcie core.
> > > But it looks a different race.
> > > We still need this fix, right?
> > 
> > Good point. Previously, there wasn't any such exclusion, and that's why races
> > like the above were even more likely. But as of 4.13, now there
> > *is* exclusion. See commit b014e96d1abb ("PCI: Protect
> > pci_error_handlers->reset_notify() usage with device_lock()"). That incidentally
> > means that you're creating a deadlock with this patch! [1]
> > 
> > If we start a timeout/reset sequence in mwifiex_init_shutdown_fw() (called
> > from remove()), then you'll eventually have pci_reset_function() waiting on the
> > device lock, but mwifiex_pcie_remove() will be holding the device lock already,
> > and now (with your patch), remove() will be waiting on the worker thread to
> > finish pci_reset_function()...deadlock!
> > 
> > I actually think that the above patch (adding device_lock()) resolves most of the
> > race but introduces a possible deadlock. I believe an easy solution is just to
> > switch to pci_try_reset_function() instead? That will just abort a reset attempt
> > if we're in the middle of removing the device. Problem solved? Diff appended,
> > but I'll send out a real version if that looks right. Can you test your original
> > problem with the above commit from Christopher, as well as the appended
> > diff?
> > 
> 
> Since I don't have the original customer platform, which is 4.1 based.

You could suggest your customer try the aforementioned commit + my
patch. It will likely get them a more reliable result in the end.

> Just run the stress test on my 4.14, with commands:
> while true; do rmmod mwifiex;insmod $mwd/mwifiex.ko; sleep 1; insmod $mwd/mwifiex_pcie.ko; sleep 1; rmmod mwifiex_pcie & echo "1" > /sys/kernel/debug/mwifiex/mlan0/reset;done &

That's actually not much of a good test, since the mutual exclusion of
reset and remove will mean that 'rmmod' and 'echo 1 > ... /reset' won't
really be doing anything significant in parallel.

What you'd really need to do is fake a timeout from within remove().
e.g., you could do

	adapter->if_ops.card_reset(adapter);

plus some sleep (to let the work event get started) followed by the
cancel_work_sync() of your patch.

> 
> Till now, I didn't hit deadlock with/without below diff, though it
> looks finally will be reached, according to the above explanation.  I
> guess, maybe rmmod operation is slowly then sysfs operation.

Per my above explanation, you won't hit the deadlock like that.

Anyway, I'll do my own testing and then submit my patch properly.

Brian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler
  2018-01-12  2:25 ` Brian Norris
@ 2018-01-12 19:06   ` Brian Norris
  2018-01-13  9:54     ` Kalle Valo
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Norris @ 2018-01-12 19:06 UTC (permalink / raw)
  To: Xinming Hu
  Cc: Kalle Valo, Linux Wireless, Dmitry Torokhov, rajatja@google.com,
	Zhiyuan Yang, Tim Song, Cathy Luo, James Cao, Ganapathi Bhat,
	Ellie Reeves, Christoph Hellwig

On Thu, Jan 11, 2018 at 06:25:09PM -0800, Brian Norris wrote:
> Anyway, I'll do my own testing and then submit my patch properly.

OK, so I definitely confirmed: if your patch does anything, it
introduces a new deadlock possibility. Just trigger a Wifi timeout or
reset from within remove(), and you'll see the work event get stuck in
pci_reset_function(), while remove() gets stuck at cancel_work_sync().

I also confirmed that my patch resolves this problem.

I'll send the revert + my patch now.

Brian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler
  2018-01-12 19:06   ` Brian Norris
@ 2018-01-13  9:54     ` Kalle Valo
  0 siblings, 0 replies; 4+ messages in thread
From: Kalle Valo @ 2018-01-13  9:54 UTC (permalink / raw)
  To: Brian Norris
  Cc: Xinming Hu, Linux Wireless, Dmitry Torokhov, rajatja@google.com,
	Zhiyuan Yang, Tim Song, Cathy Luo, James Cao, Ganapathi Bhat,
	Ellie Reeves, Christoph Hellwig

Brian Norris <briannorris@chromium.org> writes:

> On Thu, Jan 11, 2018 at 06:25:09PM -0800, Brian Norris wrote:
>> Anyway, I'll do my own testing and then submit my patch properly.
>
> OK, so I definitely confirmed: if your patch does anything, it
> introduces a new deadlock possibility. Just trigger a Wifi timeout or
> reset from within remove(), and you'll see the work event get stuck in
> pci_reset_function(), while remove() gets stuck at cancel_work_sync().
>
> I also confirmed that my patch resolves this problem.
>
> I'll send the revert + my patch now.

Great, thanks. I didn't had a chance to do the revert yet but I'll now
apply your revert instead.

-- 
Kalle Valo

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-01-13  9:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-10 12:30 Re: Re: [PATCH] mwifiex: cancel pcie/sdio work in remove/shutdown handler Xinming Hu
2018-01-12  2:25 ` Brian Norris
2018-01-12 19:06   ` Brian Norris
2018-01-13  9:54     ` Kalle Valo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).