From: Hannes Reinecke <hare@suse.de>
To: Kevin Groeneveld <kgroeneveld@lenbrook.com>, JBottomley@odin.com
Cc: linux-scsi@vger.kernel.org, festevam@gmail.com,
richard.zhu@freescale.com, arnd@arndb.de, linux@arm.linux.org.uk
Subject: Re: [PATCH] scsi: fix hang in scsi error handling
Date: Thu, 16 Jul 2015 13:11:18 +0200 [thread overview]
Message-ID: <55A79156.3060307@suse.de> (raw)
In-Reply-To: <1436964449-31447-1-git-send-email-kgroeneveld@lenbrook.com>
On 07/15/2015 02:47 PM, Kevin Groeneveld wrote:
> With the following setup/steps I can consistently trigger the scsi host to
> hang requiring a reboot:
> 1. iMX6Q processor with built in AHCI compatible SATA host
> 2. SATA port multiplier in CBS mode connected to iMX6Q
> 3. HDD connected to port multiplier
> 4. CDROM connected to port multiplier
> 5. trigger continuous I/O to HDD
> 6. repeatedly execute CDROM_DRIVE_STATUS ioctl on CDROM with no disc in
> drive
>
> I don't think this issue is iMX6 specific but that is the only platform
> I have duplicated the hang on.
>
> To trigger the issue at least two CPU cores must be enabled and the HDD
> access and CDROM ioctls must be happening concurrently. If I only enable
> one CPU core the hang does not occur.
>
> The following C program can be used to trigger the CDROM ioctl:
>
> #include <stdio.h>
> #include <fcntl.h>
> #include <linux/cdrom.h>
>
> int main(int argc, char* argv[])
> {
> int fd;
>
> fd = open("/dev/cdrom", O_RDONLY | O_NONBLOCK);
> if(fd < 0)
> {
> perror("cannot open /dev/cdrom");
> return fd;
> }
>
> for(;;)
> {
> ioctl(fd, CDROM_DRIVE_STATUS, 0);
> usleep(100 * 1000);
> }
> }
>
> When the hang occurs shost->host_busy == 2 and shost->host_failed == 1 in
> the scsi_eh_wakeup function. However this function only wakes the error
> handler if host_busy == host_failed.
>
Which just means that one command is still outstanding, and we need
to wait for it to complete.
But see below...
> The patch changes the condition to test if host_busy >= host_failed and
> updates the corresponding condition in scsi_error_handler. Without the
> patch I can trigger the hang within seconds. With the patch I have not
> duplicated the hang after hours of testing.
>
> Signed-off-by: Kevin Groeneveld <kgroeneveld@lenbrook.com>
> ---
> drivers/scsi/scsi_error.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 106884a..853964b 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -61,7 +61,7 @@ static int scsi_try_to_abort_cmd(struct scsi_host_template *,
> /* called with shost->host_lock held */
> void scsi_eh_wakeup(struct Scsi_Host *shost)
> {
> - if (atomic_read(&shost->host_busy) == shost->host_failed) {
> + if (atomic_read(&shost->host_busy) >= shost->host_failed) {
> trace_scsi_eh_wakeup(shost);
> wake_up_process(shost->ehandler);
> SCSI_LOG_ERROR_RECOVERY(5, shost_printk(KERN_INFO, shost,
> @@ -2173,7 +2173,7 @@ int scsi_error_handler(void *data)
> while (!kthread_should_stop()) {
> set_current_state(TASK_INTERRUPTIBLE);
> if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) ||
> - shost->host_failed != atomic_read(&shost->host_busy)) {
> + shost->host_failed > atomic_read(&shost->host_busy)) {
> SCSI_LOG_ERROR_RECOVERY(1,
> shost_printk(KERN_INFO, shost,
> "scsi_eh_%d: sleeping\n",
>
Hmm.
I am really not sure about this.
'host_busy' indicates the number of outstanding commands, and
'host_failed' is the number of commands which have failed (on the
ground that failed commands are considered outstanding, too).
So the first hunk would change the behaviour from
'start SCSI EH once all commands are completed or failed' to
'start SCSI EH for _any_ command if scsi_eh_wakeup is called'
(note that shost_failed might be '0'...).
Which doesn't sound right.
The second hunk seems to be okay, as in principle 'host_busy' could
have been decreased before the check is done (ie someone could have
called ->done on a failed command).
But even so this would point to an invalid command completion; as
soon as a command is marked as 'failed' control is back in the SCSI
midlayer, and no-one else should be tampering with it.
I guess this needs further debugging to get to the bottom of it.
Sorry, but:
NACK.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2015-07-16 11:11 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-15 12:47 [PATCH] scsi: fix hang in scsi error handling Kevin Groeneveld
2015-07-16 11:11 ` Hannes Reinecke [this message]
2015-07-16 18:55 ` Kevin Groeneveld
2015-07-17 6:02 ` Hannes Reinecke
2015-07-27 10:38 ` Hannes Reinecke
2015-07-27 15:31 ` Kevin Groeneveld
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55A79156.3060307@suse.de \
--to=hare@suse.de \
--cc=JBottomley@odin.com \
--cc=arnd@arndb.de \
--cc=festevam@gmail.com \
--cc=kgroeneveld@lenbrook.com \
--cc=linux-scsi@vger.kernel.org \
--cc=linux@arm.linux.org.uk \
--cc=richard.zhu@freescale.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.