On Sun, May 26, 2019 at 08:42:21PM +0200, Arend Van Spriel wrote: > On 5/26/2019 2:21 PM, Brian Masney wrote: > > + Broadcom wireless maintainers > > > > On Fri, May 24, 2019 at 11:49:58AM -0400, Brian Masney wrote: > > > On Fri, May 24, 2019 at 03:17:13PM +0300, Adrian Hunter wrote: > > > > On 24/05/19 2:10 PM, Brian Masney wrote: > > > > > WiFi stopped working on the LG Nexus 5 phone and the issue was bisected > > > > > to the commit c07a48c26519 ("mmc: sdhci: Remove finish_tasklet") that > > > > > moved from using a tasklet to a work queue. That patch also changed > > > > > sdhci_irq() to return IRQ_WAKE_THREAD instead of finishing the work when > > > > > sdhci_defer_done() is true. Change it to queue work to the complete work > > > > > queue if sdhci_defer_done() is true so that the functionality is > > > > > equilivent to what was there when the finish_tasklet was present. This > > > > > corrects the WiFi breakage on the Nexus 5 phone. > > > > > > > > > > Signed-off-by: Brian Masney > > > > > Fixes: c07a48c26519 ("mmc: sdhci: Remove finish_tasklet") > > > > > --- > > > > > [ ... ] > > > > > > > > > > drivers/mmc/host/sdhci.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c > > > > > index 97158344b862..3563c3bc57c9 100644 > > > > > --- a/drivers/mmc/host/sdhci.c > > > > > +++ b/drivers/mmc/host/sdhci.c > > > > > @@ -3115,7 +3115,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id) > > > > > continue; > > > > > if (sdhci_defer_done(host, mrq)) { > > > > > - result = IRQ_WAKE_THREAD; > > > > > + queue_work(host->complete_wq, &host->complete_work); > > > > > > > > The IRQ thread has a lot less latency than the work queue, which is why it > > > > is done that way. > > > > > > > > I am not sure why you say this change is equivalent to what was there > > > > before, nor why it fixes your problem. > > > > > > > > Can you explain some more? > > > > > > [ ... ] > > > > > > drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c calls > > > sdio_claim_host() and it appears to never return. > > > > When the brcmfmac driver is loaded, the firmware is requested from disk, > > and that's when the deadlock occurs in 5.2rc1. Specifically: > > > > 1) brcmf_sdio_download_firmware() in > > drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c calls > > sdio_claim_host() > > > > 2) brcmf_sdio_firmware_callback() is called and brcmf_sdiod_ramrw() > > tries to claim the host, but has to wait since its already claimed > > in #1 and the deadlock occurs. > > This does not make any sense to me. brcmf_sdio_download_firmware() is called > from brcmf_sdio_firmware_callback() so they are in the same context. So #2 > is not waiting for #1, but something else I would say. Also #2 calls > sdio_claim_host() after brcmf_sdio_download_firmware has completed so > definitely not waiting for #1. I attached a patch that shows how I was able to determine what had already claimed the host. It's messy; please don't judge me negatively for this. :) Anyways, sdio_claim_host() is mostly a wrapper for __mmc_claim_host() and there is a mmc_ctx structure that contains a task struct. This context can be NULL. I added a description field to the context structure and put the function name that claimed the host in there. The mmc_host structure already contained a 'claimer' member, so that made it easy. I see the following messages in dmesg that shows what has already claimed the host when loading the brcmfmac module in 5.2rc1: cfg80211: Loading compiled-in X.509 certificates for regulatory database cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7' platform regulatory.0: Direct firmware load for regulatory.db failed with error -2 cfg80211: failed to load regulatory.db brcmfmac: brcmf_fw_alloc_request: using brcm/brcmfmac4339-sdio for chip BCM4339/2 brcmfmac mmc1:0001:1: Direct firmware load for brcm/brcmfmac4339-sdio.lge,hammerhead.txt failed with error -2 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 5 at drivers/mmc/core/core.c:819 __mmc_claim_host+0x28c/0x2c0 Modules linked in: brcmfmac brcmutil cfg80211 dm_mod CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.2.0-rc1-00175-g9899510d2cd1-dirty #420 Hardware name: Generic DT based system Workqueue: events request_firmware_work_func [] (unwind_backtrace) from [] (show_stack+0x10/0x14) [] (show_stack) from [] (dump_stack+0x78/0x8c) [] (dump_stack) from [] (__warn.part.3+0xb8/0xd4) [] (__warn.part.3) from [] (warn_slowpath_null+0x44/0x4c) [] (warn_slowpath_null) from [] (__mmc_claim_host+0x28c/0x2c0) [] (__mmc_claim_host) from [] (brcmf_sdiod_ramrw+0x9c/0x200 [brcmfmac]) [] (brcmf_sdiod_ramrw [brcmfmac]) from [] (brcmf_sdio_firmware_callback+0xe8/0x7b4 [brcmfmac]) [] (brcmf_sdio_firmware_callback [brcmfmac]) from [] (brcmf_fw_request_done+0xf0/0x110 [brcmfmac]) [] (brcmf_fw_request_done [brcmfmac]) from [] (request_firmware_work_func+0x4c/0x88) [] (request_firmware_work_func) from [] (process_one_work+0x1fc/0x564) [] (process_one_work) from [] (worker_thread+0x44/0x584) [] (worker_thread) from [] (kthread+0x148/0x150) [] (kthread) from [] (ret_from_fork+0x14/0x2c) Exception stack(0xee8bdfb0 to 0xee8bdff8) dfa0: 00000000 00000000 00000000 00000000 dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 ---[ end trace 4ab1b01efc876120 ]--- mmc_host mmc1: __mmc_claim_host: FIXME - before schedule() - descr=brcmf_sdiod_ramrw, claimer=brcmf_sdio_download_firmware The 'after schedule()' line is not shown and WiFi doesn't work. > > I tried to release the host before the firmware is requested, however > > parts of brcmf_chip_set_active() needs the host to be claimed, and a > > similar deadlock occurs in brcmf_sdiod_ramrw() if I claim the host > > before calling brcmf_chip_set_active(). > > > > I started to look at moving the sdio_{claim,release}_host() calls out of > > brcmf_sdiod_ramrw() but there's a fair number of callers, so I'd like to > > get feedback about the best course of action here. > > Long ago Franky reworked the sdio critical sections requiring sdio > claim/release and I am pretty sure they are correct. > > Could you try with lockdep kernel and see if that brings any more > information. In the mean time I will update my dev branch to 5.2-rc1 and see > if I can find any clues. My .config has CONFIG_LOCKDEP_SUPPORT enabled. I haven't used lockdep but my understanding is that it should print something in dmesg if a deadlock occurs. I assume it won't pick up cases like this where schedule() is called. Brian