From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
To: Ulf Hansson <ulf.hansson@linaro.org>
Cc: "linux-mmc@vger.kernel.org" <linux-mmc@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: Continuous SD IO causes hung task messages
Date: Tue, 27 Aug 2019 16:06:14 +0100 [thread overview]
Message-ID: <20190827150614.GN13294@shell.armlinux.org.uk> (raw)
In-Reply-To: <20190827145216.GM13294@shell.armlinux.org.uk>
On Tue, Aug 27, 2019 at 03:52:17PM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Aug 27, 2019 at 03:36:34PM +0100, Russell King - ARM Linux admin wrote:
> > On Tue, Aug 27, 2019 at 03:55:23PM +0200, Ulf Hansson wrote:
> > > On Tue, 27 Aug 2019 at 15:43, Russell King - ARM Linux admin
> > > <linux@armlinux.org.uk> wrote:
> > > >
> > > > Hi,
> > > >
> > > > While dd'ing the contents of a SD card, I get hung task timeout
> > > > messages as per below. However, the dd is making progress. Any
> > > > ideas?
> > > >
> > > > Presumably, mmc_rescan doesn't get a look-in while IO is progressing
> > > > for the card?
> > >
> > > Is it a regression?
> > >
> > > There not much of recent mmc core and mmc block changes, that I can
> > > think of at this point.
> >
> > No idea - I just repaired the SD socket after the D2 line became
> > disconnected, and decided to run that command as a test.
> >
> > > > ARM64 host, Macchiatobin, uSD card.
> > >
> > > What mmc host driver is it? mmci?
> >
> > sdhci-xenon.
> >
> > I'm just trying with one CPU online, then I'll try with two. My
> > suspicion is that there's a problem in the ARM64 arch code where
> > unlocking a mutex doesn't get noticed on other CPUs.
> >
> > Hmm, I thought I'd try bringing another CPU online, but it seems
> > like the ARM64 CPU hotplug code is broken:
> >
> > [ 3552.029689] CPU1: shutdown
> > [ 3552.031099] psci: CPU1 killed.
> > [ 3949.835212] CPU1: failed to come online
> > [ 3949.837753] CPU1: failed in unknown state : 0x0
> >
> > which means I can only take CPUs down, I can't bring them back
> > online without rebooting.
>
> Okay, running on a single CPU shows no problems.
>
> Running on four CPUs (as originally) shows that the kworker thread
> _never_ gets scheduled, so the warning is not false.
>
> With three CPUs, same problem.
>
> root@arm-d06300000000:~# ps aux | grep ' D '
> root 34 0.0 0.0 0 0 ? D 15:38 0:00 [kworker/1:1+events_freezable]
> root@arm-d06300000000:~# cat /proc/34/sched
> kworker/1:1 (34, #threads: 1)
> -------------------------------------------------------------------
> se.exec_start : 318689.992440
> se.vruntime : 37750.882357
> se.sum_exec_runtime : 9.421240
> se.nr_migrations : 0
> nr_switches : 1174
> nr_voluntary_switches : 1171
> nr_involuntary_switches : 3
> se.load.weight : 1048576
> se.runnable_weight : 1048576
> se.avg.load_sum : 6
> se.avg.runnable_load_sum : 6
> se.avg.util_sum : 5170
> se.avg.load_avg : 0
> se.avg.runnable_load_avg : 0
> se.avg.util_avg : 0
> se.avg.last_update_time : 318689991680
> se.avg.util_est.ewma : 10
> se.avg.util_est.enqueued : 0
> policy : 0
> prio : 120
> clock-delta : 0
>
> The only thing that changes there is "clock-delta". When I kill the
> dd, I get:
>
> root@arm-d06300000000:~# cat /proc/34/sched
> kworker/1:1 (34, #threads: 1)
> -------------------------------------------------------------------
> se.exec_start : 574025.791680
> se.vruntime : 79996.657300
> se.sum_exec_runtime : 10.916400
> se.nr_migrations : 0
> nr_switches : 1403
> nr_voluntary_switches : 1400
> nr_involuntary_switches : 3
> se.load.weight : 1048576
> se.runnable_weight : 1048576
> se.avg.load_sum : 15
> se.avg.runnable_load_sum : 15
> se.avg.util_sum : 15007
> se.avg.load_avg : 0
> se.avg.runnable_load_avg : 0
> se.avg.util_avg : 0
> se.avg.last_update_time : 574025791488
> se.avg.util_est.ewma : 10
> se.avg.util_est.enqueued : 0
> policy : 0
> prio : 120
> clock-delta : 40
>
> so the thread makes forward progress.
>
> Down to two CPUs:
>
> root@arm-d06300000000:~# ps aux | grep ' D '
> root 34 0.0 0.0 0 0 ? D 15:38 0:00 [kworker/1:1+events_freezable]
>
> Same symptoms. dd and md5sum switch between CPU 0 and CPU1.
Hmm.
static blk_status_t mmc_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
mq->in_flight[issue_type] += 1;
get_card = (mmc_tot_in_flight(mq) == 1);
if (get_card)
mmc_get_card(card, &mq->ctx);
mmc_get_card() gets the host lock according to the card.
So, if we always have requests in flight (which is probably the case
here) we never drop the host lock, and mmc_rescan() never gets a look
in - hence blocking the kworker.
So this is a real issue with MMC, and not down to something in the
arch.
I suspect the reason that single-CPU doesn't show it is because it is
unable to keep multiple requests in flight.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
WARNING: multiple messages have this Message-ID (diff)
From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
To: Ulf Hansson <ulf.hansson@linaro.org>
Cc: "linux-mmc@vger.kernel.org" <linux-mmc@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: Continuous SD IO causes hung task messages
Date: Tue, 27 Aug 2019 16:06:14 +0100 [thread overview]
Message-ID: <20190827150614.GN13294@shell.armlinux.org.uk> (raw)
In-Reply-To: <20190827145216.GM13294@shell.armlinux.org.uk>
On Tue, Aug 27, 2019 at 03:52:17PM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Aug 27, 2019 at 03:36:34PM +0100, Russell King - ARM Linux admin wrote:
> > On Tue, Aug 27, 2019 at 03:55:23PM +0200, Ulf Hansson wrote:
> > > On Tue, 27 Aug 2019 at 15:43, Russell King - ARM Linux admin
> > > <linux@armlinux.org.uk> wrote:
> > > >
> > > > Hi,
> > > >
> > > > While dd'ing the contents of a SD card, I get hung task timeout
> > > > messages as per below. However, the dd is making progress. Any
> > > > ideas?
> > > >
> > > > Presumably, mmc_rescan doesn't get a look-in while IO is progressing
> > > > for the card?
> > >
> > > Is it a regression?
> > >
> > > There not much of recent mmc core and mmc block changes, that I can
> > > think of at this point.
> >
> > No idea - I just repaired the SD socket after the D2 line became
> > disconnected, and decided to run that command as a test.
> >
> > > > ARM64 host, Macchiatobin, uSD card.
> > >
> > > What mmc host driver is it? mmci?
> >
> > sdhci-xenon.
> >
> > I'm just trying with one CPU online, then I'll try with two. My
> > suspicion is that there's a problem in the ARM64 arch code where
> > unlocking a mutex doesn't get noticed on other CPUs.
> >
> > Hmm, I thought I'd try bringing another CPU online, but it seems
> > like the ARM64 CPU hotplug code is broken:
> >
> > [ 3552.029689] CPU1: shutdown
> > [ 3552.031099] psci: CPU1 killed.
> > [ 3949.835212] CPU1: failed to come online
> > [ 3949.837753] CPU1: failed in unknown state : 0x0
> >
> > which means I can only take CPUs down, I can't bring them back
> > online without rebooting.
>
> Okay, running on a single CPU shows no problems.
>
> Running on four CPUs (as originally) shows that the kworker thread
> _never_ gets scheduled, so the warning is not false.
>
> With three CPUs, same problem.
>
> root@arm-d06300000000:~# ps aux | grep ' D '
> root 34 0.0 0.0 0 0 ? D 15:38 0:00 [kworker/1:1+events_freezable]
> root@arm-d06300000000:~# cat /proc/34/sched
> kworker/1:1 (34, #threads: 1)
> -------------------------------------------------------------------
> se.exec_start : 318689.992440
> se.vruntime : 37750.882357
> se.sum_exec_runtime : 9.421240
> se.nr_migrations : 0
> nr_switches : 1174
> nr_voluntary_switches : 1171
> nr_involuntary_switches : 3
> se.load.weight : 1048576
> se.runnable_weight : 1048576
> se.avg.load_sum : 6
> se.avg.runnable_load_sum : 6
> se.avg.util_sum : 5170
> se.avg.load_avg : 0
> se.avg.runnable_load_avg : 0
> se.avg.util_avg : 0
> se.avg.last_update_time : 318689991680
> se.avg.util_est.ewma : 10
> se.avg.util_est.enqueued : 0
> policy : 0
> prio : 120
> clock-delta : 0
>
> The only thing that changes there is "clock-delta". When I kill the
> dd, I get:
>
> root@arm-d06300000000:~# cat /proc/34/sched
> kworker/1:1 (34, #threads: 1)
> -------------------------------------------------------------------
> se.exec_start : 574025.791680
> se.vruntime : 79996.657300
> se.sum_exec_runtime : 10.916400
> se.nr_migrations : 0
> nr_switches : 1403
> nr_voluntary_switches : 1400
> nr_involuntary_switches : 3
> se.load.weight : 1048576
> se.runnable_weight : 1048576
> se.avg.load_sum : 15
> se.avg.runnable_load_sum : 15
> se.avg.util_sum : 15007
> se.avg.load_avg : 0
> se.avg.runnable_load_avg : 0
> se.avg.util_avg : 0
> se.avg.last_update_time : 574025791488
> se.avg.util_est.ewma : 10
> se.avg.util_est.enqueued : 0
> policy : 0
> prio : 120
> clock-delta : 40
>
> so the thread makes forward progress.
>
> Down to two CPUs:
>
> root@arm-d06300000000:~# ps aux | grep ' D '
> root 34 0.0 0.0 0 0 ? D 15:38 0:00 [kworker/1:1+events_freezable]
>
> Same symptoms. dd and md5sum switch between CPU 0 and CPU1.
Hmm.
static blk_status_t mmc_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
mq->in_flight[issue_type] += 1;
get_card = (mmc_tot_in_flight(mq) == 1);
if (get_card)
mmc_get_card(card, &mq->ctx);
mmc_get_card() gets the host lock according to the card.
So, if we always have requests in flight (which is probably the case
here) we never drop the host lock, and mmc_rescan() never gets a look
in - hence blocking the kworker.
So this is a real issue with MMC, and not down to something in the
arch.
I suspect the reason that single-CPU doesn't show it is because it is
unable to keep multiple requests in flight.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2019-08-27 15:06 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-27 13:43 Continuous SD IO causes hung task messages Russell King - ARM Linux admin
2019-08-27 13:43 ` Russell King - ARM Linux admin
2019-08-27 13:55 ` Ulf Hansson
2019-08-27 13:55 ` Ulf Hansson
2019-08-27 13:55 ` Ulf Hansson
2019-08-27 14:36 ` Russell King - ARM Linux admin
2019-08-27 14:36 ` Russell King - ARM Linux admin
2019-08-27 14:52 ` Russell King - ARM Linux admin
2019-08-27 14:52 ` Russell King - ARM Linux admin
2019-08-27 15:06 ` Russell King - ARM Linux admin [this message]
2019-08-27 15:06 ` Russell King - ARM Linux admin
2019-08-28 10:31 ` Ulf Hansson
2019-08-28 10:31 ` Ulf Hansson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190827150614.GN13294@shell.armlinux.org.uk \
--to=linux@armlinux.org.uk \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mmc@vger.kernel.org \
--cc=ulf.hansson@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.