* MMC locking: mmc_request accesses from dw_mmc driver ok? @ 2013-08-12 23:45 Grant Grundler 2013-08-21 20:18 ` Grant Grundler 0 siblings, 1 reply; 4+ messages in thread From: Grant Grundler @ 2013-08-12 23:45 UTC (permalink / raw) To: linux-mmc@vger.kernel.org I've been working on an "task mmcqd/0:84 blocked for more than 120 seconds" panic for the past month or so in the chromeos-3.4 kernel tree. Stack trace below. Feel free to tell me "fixed in v3.x". :) After staring at the 14 MMC and DW driver data structures, I now think dw_mmc driver is accessing MMC generic data structures (mmc_request and mmc_queue_req) without grabbing either mmc_blk_data->lock or mmc_queue->thread_sem and it needs to. I don't have a specific stack trace yet where dw_mmc driver is accessing MMC generic data without protection. This is where I need some guidance. I am confident dw_mmc driver is always acquiring dw_mci->lock when accessing data in dw_mci structure(s). I don't see any locking around access into the struct mmc_request by dw_mci_slot[]->mrq though - not sure where that belongs. Two questions: 1) is there interest in adding "assert_spin_locked()" calls to document locking dependencies? 2) Does anyone understand this code well enough to confirm I'm on the right track and which code path I should be looking at? Back to the bug: mmc_start_req() is sleeping, waiting for the "previous" (in flight) "async" IO to complete. (1) This IO never completes (unlikely) OR (2) already did (e.g. mmc_host->areq is stale) OR (3) mmc_host->areq is non-zero garbage. I'll add some code to confirm (3) not the last case. I have confirmed with the stress test I'm running (many async IO in flight with two antagonist processes that burns CPU cycles) gets about 4 completions per second that are "done" before we call mmc_start_req(). So I know the race in (2) could happen. thanks, grant INFO: task mmcqd/0:84 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mmcqd/0 D 804f4890 0 84 2 0x00000000 [<804f4890>] (__schedule+0x614/0x780) from [<804f4dc0>] (schedule+0x94/0x98) [<804f4dc0>] (schedule+0x94/0x98) from [<804f2ae0>] (schedule_timeout+0x38/0x2d0) [<804f2ae0>] (schedule_timeout+0x38/0x2d0) from [<804f4c14>] (wait_for_common+0x164/0x1a0) [<804f4c14>] (wait_for_common+0x164/0x1a0) from [<804f4d28>] (wait_for_completion+0x20/0x24) [<804f4d28>] (wait_for_completion+0x20/0x24) from [<803a39f8>] (mmc_wait_for_req_done+0x2c/0x84) [<803a39f8>] (mmc_wait_for_req_done+0x2c/0x84) from [<803a4b50>] (mmc_start_req+0x60/0x120) [<803a4b50>] (mmc_start_req+0x60/0x120) from [<803b09bc>] (mmc_blk_issue_rw_rq+0xa0/0x3a8) [<803b09bc>] (mmc_blk_issue_rw_rq+0xa0/0x3a8) from [<803b10e8>] (mmc_blk_issue_rq+0x424/0x478) [<803b10e8>] (mmc_blk_issue_rq+0x424/0x478) from [<803b220c>] (mmc_queue_thread+0xb0/0x118) [<803b220c>] (mmc_queue_thread+0xb0/0x118) from [<8004d620>] (kthread+0xa8/0xbc) [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) Kernel panic - not syncing: hung_task: blocked tasks [<800150a4>] (unwind_backtrace+0x0/0x114) from [<804ee160>] (dump_stack+0x20/0x24) [<804ee160>] (dump_stack+0x20/0x24) from [<804ee2d0>] (panic+0xa8/0x1f4) [<804ee2d0>] (panic+0xa8/0x1f4) from [<80086d44>] (watchdog+0x1f4/0x25c) [<80086d44>] (watchdog+0x1f4/0x25c) from [<8004d620>] (kthread+0xa8/0xbc) [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: MMC locking: mmc_request accesses from dw_mmc driver ok? 2013-08-12 23:45 MMC locking: mmc_request accesses from dw_mmc driver ok? Grant Grundler @ 2013-08-21 20:18 ` Grant Grundler 2013-08-23 14:06 ` Jaehoon Chung 0 siblings, 1 reply; 4+ messages in thread From: Grant Grundler @ 2013-08-21 20:18 UTC (permalink / raw) To: Chris Ball, Jaehoon Chung, Doug Anderson; +Cc: linux-mmc@vger.kernel.org On Mon, Aug 12, 2013 at 4:45 PM, Grant Grundler <grundler@chromium.org> wrote: > I've been working on an "task mmcqd/0:84 blocked for more than 120 > seconds" panic for the past month or so in the chromeos-3.4 kernel > tree. Stack trace below. Feel free to tell me "fixed in v3.x". :) I've added this change: diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 1cf4900..a127ce1 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -135,6 +135,9 @@ void mmc_request_done(struct mmc_host *host, struct mmc_request *mrq) struct mmc_command *cmd = mrq->cmd; int err = cmd->error; +WARN_ON(!host->claimed); +WARN_ON(host->claimer != current); + if (err && cmd->retries && mmc_host_is_spi(host)) { if (cmd->resp[0] & R1_SPI_ILLEGAL_COMMAND) cmd->retries = 0; and the "WARN_ON(host->claimer != current)" is triggering with this stack trace (as I suspected): WARNING: at /mnt/host/source/src/third_party/kernel/files/drivers/mmc/core/core.c:139 mmc_request_done+0x6c/0xf0() Modules linked in: i2c_dev uinput asix usbnet nf_conntrack_ipv6 nf_defrag_ipv6 uvcvideo videobuf2_vmalloc sbs_battery mwifiex_sdio mwifiex cfg80211 btmrvl_sdio btmrvl rtc_s3c bluetooth zram(C) zsmalloc(C) fuse ip6table_filter xt_mark ip6_tables joydev [<800150a4>] (unwind_backtrace+0x0/0x114) from [<804ee9a0>] (dump_stack+0x20/0x24) [<804ee9a0>] (dump_stack+0x20/0x24) from [<8002bff8>] (warn_slowpath_null+0x44/0x5c) [<8002bff8>] (warn_slowpath_null+0x44/0x5c) from [<803a3e74>] (mmc_request_done+0x6c/0xf0) [<803a3e74>] (mmc_request_done+0x6c/0xf0) from [<803b9c4c>] (dw_mci_request_end+0xc4/0xec) [<803b9c4c>] (dw_mci_request_end+0xc4/0xec) from [<803ba274>] (dw_mci_tasklet_func+0x354/0x3a8) [<803ba274>] (dw_mci_tasklet_func+0x354/0x3a8) from [<80034044>] (tasklet_action+0xac/0x12c) [<80034044>] (tasklet_action+0xac/0x12c) from [<800334f8>] (__do_softirq+0xc4/0x208) [<800334f8>] (__do_softirq+0xc4/0x208) from [<80033a54>] (irq_exit+0x54/0x94) [<80033a54>] (irq_exit+0x54/0x94) from [<8000ef68>] (handle_IRQ+0x8c/0xc8) [<8000ef68>] (handle_IRQ+0x8c/0xc8) from [<800085ec>] (gic_handle_irq+0x4c/0x70) [<800085ec>] (gic_handle_irq+0x4c/0x70) from [<8000e200>] (__irq_svc+0x40/0x60) Is this expected behavior? It feels wrong to me since it means the dw_mmc tasklet and whatever thread starts the IO can access the mmc_request data structure. Do I have this right? If I have this right, any "obvious" fixes? e.g. add locking to mmc_start_request() and mmc_request_done()? thanks, grant > After staring at the 14 MMC and DW driver data structures, I now > think dw_mmc driver is accessing MMC generic data structures > (mmc_request and mmc_queue_req) without grabbing either > mmc_blk_data->lock or mmc_queue->thread_sem and it needs to. I don't > have a specific stack trace yet where dw_mmc driver is accessing MMC > generic data without protection. This is where I need some guidance. > > I am confident dw_mmc driver is always acquiring dw_mci->lock when > accessing data in dw_mci structure(s). I don't see any locking around > access into the struct mmc_request by dw_mci_slot[]->mrq though - not > sure where that belongs. > > Two questions: > 1) is there interest in adding "assert_spin_locked()" calls to > document locking dependencies? > 2) Does anyone understand this code well enough to confirm I'm on the > right track and which code path I should be looking at? > > > Back to the bug: mmc_start_req() is sleeping, waiting for the > "previous" (in flight) "async" IO to complete. (1) This IO never > completes (unlikely) OR (2) already did (e.g. mmc_host->areq is stale) > OR (3) mmc_host->areq is non-zero garbage. I'll add some code to > confirm (3) not the last case. > > I have confirmed with the stress test I'm running (many async IO in > flight with two antagonist processes that burns CPU cycles) gets about > 4 completions per second that are "done" before we call > mmc_start_req(). So I know the race in (2) could happen. > > > thanks, > grant > > > INFO: task mmcqd/0:84 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > mmcqd/0 D 804f4890 0 84 2 0x00000000 > [<804f4890>] (__schedule+0x614/0x780) from [<804f4dc0>] (schedule+0x94/0x98) > [<804f4dc0>] (schedule+0x94/0x98) from [<804f2ae0>] > (schedule_timeout+0x38/0x2d0) > [<804f2ae0>] (schedule_timeout+0x38/0x2d0) from [<804f4c14>] > (wait_for_common+0x164/0x1a0) > [<804f4c14>] (wait_for_common+0x164/0x1a0) from [<804f4d28>] > (wait_for_completion+0x20/0x24) > [<804f4d28>] (wait_for_completion+0x20/0x24) from [<803a39f8>] > (mmc_wait_for_req_done+0x2c/0x84) > [<803a39f8>] (mmc_wait_for_req_done+0x2c/0x84) from [<803a4b50>] > (mmc_start_req+0x60/0x120) > [<803a4b50>] (mmc_start_req+0x60/0x120) from [<803b09bc>] > (mmc_blk_issue_rw_rq+0xa0/0x3a8) > [<803b09bc>] (mmc_blk_issue_rw_rq+0xa0/0x3a8) from [<803b10e8>] > (mmc_blk_issue_rq+0x424/0x478) > [<803b10e8>] (mmc_blk_issue_rq+0x424/0x478) from [<803b220c>] > (mmc_queue_thread+0xb0/0x118) > [<803b220c>] (mmc_queue_thread+0xb0/0x118) from [<8004d620>] (kthread+0xa8/0xbc) > [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) > Kernel panic - not syncing: hung_task: blocked tasks > [<800150a4>] (unwind_backtrace+0x0/0x114) from [<804ee160>] > (dump_stack+0x20/0x24) > [<804ee160>] (dump_stack+0x20/0x24) from [<804ee2d0>] (panic+0xa8/0x1f4) > [<804ee2d0>] (panic+0xa8/0x1f4) from [<80086d44>] (watchdog+0x1f4/0x25c) > [<80086d44>] (watchdog+0x1f4/0x25c) from [<8004d620>] (kthread+0xa8/0xbc) > [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: MMC locking: mmc_request accesses from dw_mmc driver ok? 2013-08-21 20:18 ` Grant Grundler @ 2013-08-23 14:06 ` Jaehoon Chung 2013-08-28 23:20 ` Grant Grundler 0 siblings, 1 reply; 4+ messages in thread From: Jaehoon Chung @ 2013-08-23 14:06 UTC (permalink / raw) To: Grant Grundler Cc: Chris Ball, Jaehoon Chung, Doug Anderson, linux-mmc@vger.kernel.org Hi Grant, I will check this. Best Regards, Jaehoon Chung On 08/22/2013 05:18 AM, Grant Grundler wrote: > On Mon, Aug 12, 2013 at 4:45 PM, Grant Grundler <grundler@chromium.org> wrote: >> I've been working on an "task mmcqd/0:84 blocked for more than 120 >> seconds" panic for the past month or so in the chromeos-3.4 kernel >> tree. Stack trace below. Feel free to tell me "fixed in v3.x". :) > > I've added this change: > diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c > index 1cf4900..a127ce1 100644 > --- a/drivers/mmc/core/core.c > +++ b/drivers/mmc/core/core.c > @@ -135,6 +135,9 @@ void mmc_request_done(struct mmc_host *host, > struct mmc_request *mrq) > struct mmc_command *cmd = mrq->cmd; > int err = cmd->error; > > +WARN_ON(!host->claimed); > +WARN_ON(host->claimer != current); > + > if (err && cmd->retries && mmc_host_is_spi(host)) { > if (cmd->resp[0] & R1_SPI_ILLEGAL_COMMAND) > cmd->retries = 0; > > and the "WARN_ON(host->claimer != current)" is triggering with this > stack trace (as I suspected): > > WARNING: at /mnt/host/source/src/third_party/kernel/files/drivers/mmc/core/core.c:139 > mmc_request_done+0x6c/0xf0() > Modules linked in: i2c_dev uinput asix usbnet nf_conntrack_ipv6 > nf_defrag_ipv6 uvcvideo videobuf2_vmalloc sbs_battery mwifiex_sdio > mwifiex cfg80211 btmrvl_sdio btmrvl rtc_s3c bluetooth zram(C) > zsmalloc(C) fuse ip6table_filter xt_mark ip6_tables joydev > [<800150a4>] (unwind_backtrace+0x0/0x114) from [<804ee9a0>] > (dump_stack+0x20/0x24) > [<804ee9a0>] (dump_stack+0x20/0x24) from [<8002bff8>] > (warn_slowpath_null+0x44/0x5c) > [<8002bff8>] (warn_slowpath_null+0x44/0x5c) from [<803a3e74>] > (mmc_request_done+0x6c/0xf0) > [<803a3e74>] (mmc_request_done+0x6c/0xf0) from [<803b9c4c>] > (dw_mci_request_end+0xc4/0xec) > [<803b9c4c>] (dw_mci_request_end+0xc4/0xec) from [<803ba274>] > (dw_mci_tasklet_func+0x354/0x3a8) > [<803ba274>] (dw_mci_tasklet_func+0x354/0x3a8) from [<80034044>] > (tasklet_action+0xac/0x12c) > [<80034044>] (tasklet_action+0xac/0x12c) from [<800334f8>] > (__do_softirq+0xc4/0x208) > [<800334f8>] (__do_softirq+0xc4/0x208) from [<80033a54>] (irq_exit+0x54/0x94) > [<80033a54>] (irq_exit+0x54/0x94) from [<8000ef68>] (handle_IRQ+0x8c/0xc8) > [<8000ef68>] (handle_IRQ+0x8c/0xc8) from [<800085ec>] (gic_handle_irq+0x4c/0x70) > [<800085ec>] (gic_handle_irq+0x4c/0x70) from [<8000e200>] (__irq_svc+0x40/0x60) > > Is this expected behavior? > > It feels wrong to me since it means the dw_mmc tasklet and whatever > thread starts the IO can access the mmc_request data structure. Do I > have this right? > > If I have this right, any "obvious" fixes? e.g. add locking to > mmc_start_request() and mmc_request_done()? > > thanks, > grant > >> After staring at the 14 MMC and DW driver data structures, I now >> think dw_mmc driver is accessing MMC generic data structures >> (mmc_request and mmc_queue_req) without grabbing either >> mmc_blk_data->lock or mmc_queue->thread_sem and it needs to. I don't >> have a specific stack trace yet where dw_mmc driver is accessing MMC >> generic data without protection. This is where I need some guidance. >> >> I am confident dw_mmc driver is always acquiring dw_mci->lock when >> accessing data in dw_mci structure(s). I don't see any locking around >> access into the struct mmc_request by dw_mci_slot[]->mrq though - not >> sure where that belongs. >> >> Two questions: >> 1) is there interest in adding "assert_spin_locked()" calls to >> document locking dependencies? >> 2) Does anyone understand this code well enough to confirm I'm on the >> right track and which code path I should be looking at? >> >> >> Back to the bug: mmc_start_req() is sleeping, waiting for the >> "previous" (in flight) "async" IO to complete. (1) This IO never >> completes (unlikely) OR (2) already did (e.g. mmc_host->areq is stale) >> OR (3) mmc_host->areq is non-zero garbage. I'll add some code to >> confirm (3) not the last case. >> >> I have confirmed with the stress test I'm running (many async IO in >> flight with two antagonist processes that burns CPU cycles) gets about >> 4 completions per second that are "done" before we call >> mmc_start_req(). So I know the race in (2) could happen. >> >> >> thanks, >> grant >> >> >> INFO: task mmcqd/0:84 blocked for more than 120 seconds. >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> mmcqd/0 D 804f4890 0 84 2 0x00000000 >> [<804f4890>] (__schedule+0x614/0x780) from [<804f4dc0>] (schedule+0x94/0x98) >> [<804f4dc0>] (schedule+0x94/0x98) from [<804f2ae0>] >> (schedule_timeout+0x38/0x2d0) >> [<804f2ae0>] (schedule_timeout+0x38/0x2d0) from [<804f4c14>] >> (wait_for_common+0x164/0x1a0) >> [<804f4c14>] (wait_for_common+0x164/0x1a0) from [<804f4d28>] >> (wait_for_completion+0x20/0x24) >> [<804f4d28>] (wait_for_completion+0x20/0x24) from [<803a39f8>] >> (mmc_wait_for_req_done+0x2c/0x84) >> [<803a39f8>] (mmc_wait_for_req_done+0x2c/0x84) from [<803a4b50>] >> (mmc_start_req+0x60/0x120) >> [<803a4b50>] (mmc_start_req+0x60/0x120) from [<803b09bc>] >> (mmc_blk_issue_rw_rq+0xa0/0x3a8) >> [<803b09bc>] (mmc_blk_issue_rw_rq+0xa0/0x3a8) from [<803b10e8>] >> (mmc_blk_issue_rq+0x424/0x478) >> [<803b10e8>] (mmc_blk_issue_rq+0x424/0x478) from [<803b220c>] >> (mmc_queue_thread+0xb0/0x118) >> [<803b220c>] (mmc_queue_thread+0xb0/0x118) from [<8004d620>] (kthread+0xa8/0xbc) >> [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) >> Kernel panic - not syncing: hung_task: blocked tasks >> [<800150a4>] (unwind_backtrace+0x0/0x114) from [<804ee160>] >> (dump_stack+0x20/0x24) >> [<804ee160>] (dump_stack+0x20/0x24) from [<804ee2d0>] (panic+0xa8/0x1f4) >> [<804ee2d0>] (panic+0xa8/0x1f4) from [<80086d44>] (watchdog+0x1f4/0x25c) >> [<80086d44>] (watchdog+0x1f4/0x25c) from [<8004d620>] (kthread+0xa8/0xbc) >> [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) > -- > To unsubscribe from this list: send the line "unsubscribe linux-mmc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: MMC locking: mmc_request accesses from dw_mmc driver ok? 2013-08-23 14:06 ` Jaehoon Chung @ 2013-08-28 23:20 ` Grant Grundler 0 siblings, 0 replies; 4+ messages in thread From: Grant Grundler @ 2013-08-28 23:20 UTC (permalink / raw) To: Jaehoon Chung Cc: Grant Grundler, Chris Ball, Doug Anderson, linux-mmc@vger.kernel.org On Fri, Aug 23, 2013 at 7:06 AM, Jaehoon Chung <jh80.chung@samsung.com> wrote: > Hi Grant, > > I will check this. Hi Jaehoon! Thanks for looking into this! Do you have any comments on the problem? My impression is the mmcqd thread and DW tasklet are racing but I'm not able to pin down exactly how we end up waiting for a "areq" that might already be done. thanks, grant > > Best Regards, > Jaehoon Chung > > On 08/22/2013 05:18 AM, Grant Grundler wrote: >> On Mon, Aug 12, 2013 at 4:45 PM, Grant Grundler <grundler@chromium.org> wrote: >>> I've been working on an "task mmcqd/0:84 blocked for more than 120 >>> seconds" panic for the past month or so in the chromeos-3.4 kernel >>> tree. Stack trace below. Feel free to tell me "fixed in v3.x". :) >> >> I've added this change: >> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c >> index 1cf4900..a127ce1 100644 >> --- a/drivers/mmc/core/core.c >> +++ b/drivers/mmc/core/core.c >> @@ -135,6 +135,9 @@ void mmc_request_done(struct mmc_host *host, >> struct mmc_request *mrq) >> struct mmc_command *cmd = mrq->cmd; >> int err = cmd->error; >> >> +WARN_ON(!host->claimed); >> +WARN_ON(host->claimer != current); >> + >> if (err && cmd->retries && mmc_host_is_spi(host)) { >> if (cmd->resp[0] & R1_SPI_ILLEGAL_COMMAND) >> cmd->retries = 0; >> >> and the "WARN_ON(host->claimer != current)" is triggering with this >> stack trace (as I suspected): >> >> WARNING: at /mnt/host/source/src/third_party/kernel/files/drivers/mmc/core/core.c:139 >> mmc_request_done+0x6c/0xf0() >> Modules linked in: i2c_dev uinput asix usbnet nf_conntrack_ipv6 >> nf_defrag_ipv6 uvcvideo videobuf2_vmalloc sbs_battery mwifiex_sdio >> mwifiex cfg80211 btmrvl_sdio btmrvl rtc_s3c bluetooth zram(C) >> zsmalloc(C) fuse ip6table_filter xt_mark ip6_tables joydev >> [<800150a4>] (unwind_backtrace+0x0/0x114) from [<804ee9a0>] >> (dump_stack+0x20/0x24) >> [<804ee9a0>] (dump_stack+0x20/0x24) from [<8002bff8>] >> (warn_slowpath_null+0x44/0x5c) >> [<8002bff8>] (warn_slowpath_null+0x44/0x5c) from [<803a3e74>] >> (mmc_request_done+0x6c/0xf0) >> [<803a3e74>] (mmc_request_done+0x6c/0xf0) from [<803b9c4c>] >> (dw_mci_request_end+0xc4/0xec) >> [<803b9c4c>] (dw_mci_request_end+0xc4/0xec) from [<803ba274>] >> (dw_mci_tasklet_func+0x354/0x3a8) >> [<803ba274>] (dw_mci_tasklet_func+0x354/0x3a8) from [<80034044>] >> (tasklet_action+0xac/0x12c) >> [<80034044>] (tasklet_action+0xac/0x12c) from [<800334f8>] >> (__do_softirq+0xc4/0x208) >> [<800334f8>] (__do_softirq+0xc4/0x208) from [<80033a54>] (irq_exit+0x54/0x94) >> [<80033a54>] (irq_exit+0x54/0x94) from [<8000ef68>] (handle_IRQ+0x8c/0xc8) >> [<8000ef68>] (handle_IRQ+0x8c/0xc8) from [<800085ec>] (gic_handle_irq+0x4c/0x70) >> [<800085ec>] (gic_handle_irq+0x4c/0x70) from [<8000e200>] (__irq_svc+0x40/0x60) >> >> Is this expected behavior? >> >> It feels wrong to me since it means the dw_mmc tasklet and whatever >> thread starts the IO can access the mmc_request data structure. Do I >> have this right? >> >> If I have this right, any "obvious" fixes? e.g. add locking to >> mmc_start_request() and mmc_request_done()? >> >> thanks, >> grant >> >>> After staring at the 14 MMC and DW driver data structures, I now >>> think dw_mmc driver is accessing MMC generic data structures >>> (mmc_request and mmc_queue_req) without grabbing either >>> mmc_blk_data->lock or mmc_queue->thread_sem and it needs to. I don't >>> have a specific stack trace yet where dw_mmc driver is accessing MMC >>> generic data without protection. This is where I need some guidance. >>> >>> I am confident dw_mmc driver is always acquiring dw_mci->lock when >>> accessing data in dw_mci structure(s). I don't see any locking around >>> access into the struct mmc_request by dw_mci_slot[]->mrq though - not >>> sure where that belongs. >>> >>> Two questions: >>> 1) is there interest in adding "assert_spin_locked()" calls to >>> document locking dependencies? >>> 2) Does anyone understand this code well enough to confirm I'm on the >>> right track and which code path I should be looking at? >>> >>> >>> Back to the bug: mmc_start_req() is sleeping, waiting for the >>> "previous" (in flight) "async" IO to complete. (1) This IO never >>> completes (unlikely) OR (2) already did (e.g. mmc_host->areq is stale) >>> OR (3) mmc_host->areq is non-zero garbage. I'll add some code to >>> confirm (3) not the last case. >>> >>> I have confirmed with the stress test I'm running (many async IO in >>> flight with two antagonist processes that burns CPU cycles) gets about >>> 4 completions per second that are "done" before we call >>> mmc_start_req(). So I know the race in (2) could happen. >>> >>> >>> thanks, >>> grant >>> >>> >>> INFO: task mmcqd/0:84 blocked for more than 120 seconds. >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> mmcqd/0 D 804f4890 0 84 2 0x00000000 >>> [<804f4890>] (__schedule+0x614/0x780) from [<804f4dc0>] (schedule+0x94/0x98) >>> [<804f4dc0>] (schedule+0x94/0x98) from [<804f2ae0>] >>> (schedule_timeout+0x38/0x2d0) >>> [<804f2ae0>] (schedule_timeout+0x38/0x2d0) from [<804f4c14>] >>> (wait_for_common+0x164/0x1a0) >>> [<804f4c14>] (wait_for_common+0x164/0x1a0) from [<804f4d28>] >>> (wait_for_completion+0x20/0x24) >>> [<804f4d28>] (wait_for_completion+0x20/0x24) from [<803a39f8>] >>> (mmc_wait_for_req_done+0x2c/0x84) >>> [<803a39f8>] (mmc_wait_for_req_done+0x2c/0x84) from [<803a4b50>] >>> (mmc_start_req+0x60/0x120) >>> [<803a4b50>] (mmc_start_req+0x60/0x120) from [<803b09bc>] >>> (mmc_blk_issue_rw_rq+0xa0/0x3a8) >>> [<803b09bc>] (mmc_blk_issue_rw_rq+0xa0/0x3a8) from [<803b10e8>] >>> (mmc_blk_issue_rq+0x424/0x478) >>> [<803b10e8>] (mmc_blk_issue_rq+0x424/0x478) from [<803b220c>] >>> (mmc_queue_thread+0xb0/0x118) >>> [<803b220c>] (mmc_queue_thread+0xb0/0x118) from [<8004d620>] (kthread+0xa8/0xbc) >>> [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) >>> Kernel panic - not syncing: hung_task: blocked tasks >>> [<800150a4>] (unwind_backtrace+0x0/0x114) from [<804ee160>] >>> (dump_stack+0x20/0x24) >>> [<804ee160>] (dump_stack+0x20/0x24) from [<804ee2d0>] (panic+0xa8/0x1f4) >>> [<804ee2d0>] (panic+0xa8/0x1f4) from [<80086d44>] (watchdog+0x1f4/0x25c) >>> [<80086d44>] (watchdog+0x1f4/0x25c) from [<8004d620>] (kthread+0xa8/0xbc) >>> [<8004d620>] (kthread+0xa8/0xbc) from [<8000f1c8>] (kernel_thread_exit+0x0/0x8) >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-08-28 23:20 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-08-12 23:45 MMC locking: mmc_request accesses from dw_mmc driver ok? Grant Grundler 2013-08-21 20:18 ` Grant Grundler 2013-08-23 14:06 ` Jaehoon Chung 2013-08-28 23:20 ` Grant Grundler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox