From: Bernd Schubert <bs@q-leap.de>
To: linux-scsi@vger.kernel.org
Subject: [PATCH] scsi device recovery
Date: Wed, 12 Dec 2007 13:54:14 +0100 [thread overview]
Message-ID: <200712121354.14474.bs@q-leap.de> (raw)
Hi,
below is a patch introducing device recovery, trying to prevent i/o errors
when a DID_NO_CONNECT or SOFT_ERROR does happen.
The patch still needs quite some work:
1.) I still didn't figure out what is the best place to run
sdev->deh.ehandler = kthread_run(scsi_device_error_handler, ...)
2.) As I see it, its not a good idea to run spi_schedule_dv_device() in
scsi_error.c, since spi_schedule_dv_device() is in scsi_transport_spi.c,
which seems to be separated from the core scsi-layer.
So what is another way to initiate a DV in scsi_error.c?
3.) Maybe related to 2), for now I'm calling spi_schedule_dv_device(), but
this is not always doing what I want.
[ 406.785104] sd 5:0:2:0: deh: scheduling domain validation
[ 408.422530] target5:0:2: Beginning Domain Validation
[ 408.466620] target5:0:2: Domain Validation skipping write tests
[ 408.472771] target5:0:2: Ending Domain Validation
Hmm, somehow related to sdev->inquiry_len, but isn't it the task of
spi_schedule_dv_device() and subfunctions to do that properly?
Any comments, hints and help is appreciated.
Signed-of-by: Bernd Schubert <bs@q-leap.de>
Index: linux-2.6.22/drivers/scsi/scsi_error.c
===================================================================
--- linux-2.6.22.orig/drivers/scsi/scsi_error.c 2007-12-12 12:26:20.000000000
+0100
+++ linux-2.6.22/drivers/scsi/scsi_error.c 2007-12-12 13:08:40.000000000 +0100
@@ -33,6 +33,7 @@
#include <scsi/scsi_transport.h>
#include <scsi/scsi_host.h>
#include <scsi/scsi_ioctl.h>
+#include <scsi/scsi_transport_spi.h>
#include "scsi_priv.h"
#include "scsi_logging.h"
@@ -1589,6 +1590,153 @@ int scsi_error_handler(void *data)
return 0;
}
+/**
+ * scsi_unjam_sdev - try to revover a failed scsi-device
+ * @sdev: scsi device we are recovering
+ */
+static int scsi_unjam_sdev(struct scsi_device *sdev)
+{
+ int rtn;
+
+ sdev_printk(KERN_CRIT, sdev, "resetting device\n");
+ rtn = scsi_reset_provider(sdev, SCSI_TRY_RESET_DEVICE);
+ scsi_report_device_reset(sdev->host, sdev->channel, sdev->id);
+ if (rtn == SUCCESS)
+ sdev_printk(KERN_INFO, sdev, "device reset succeeded, "
+ "set device to running state\n");
+ return SUCCESS;
+}
+
+/**
+ * scsi_schedule_deh - schedule EH for SCSI device
+ * @sdev: SCSI device to invoke error handling on.
+ *
+ **/
+void scsi_schedule_deh(struct scsi_device *sdev)
+{
+#if 0
+ if (sdev->deh.error) {
+ /* blocking the device does not work! another recovery was
+ * scheduled, though no i/o should go to the device now! */
+ sdev_printk(KERN_CRIT, sdev,
+ "device already in recovery, but another recovery "
+ "was scheduled\n");
+ dump_stack();
+ }
+#endif
+ if (sdev->deh.error)
+ return; /* recovery already running */
+
+ if (sdev->deh.last_recovery
+ && jiffies < sdev->deh.last_recovery + 300 * HZ)
+ sdev->deh.count++;
+ else
+ sdev->deh.count = 0;
+
+ if (sdev->deh.count >= 10) {
+ sdev_printk(KERN_WARNING, sdev,
+ "too many errors within time limit, setting "
+ "device offline\n");
+ scsi_device_set_state(sdev, SDEV_OFFLINE);
+ return;
+ } else if (sdev->deh.count >= 5) {
+ sdev_printk(KERN_INFO, sdev, "Initiating host recovery\n");
+ scsi_schedule_eh(sdev->host); /* host recovery */
+ return;
+ } else
+ sdev->deh.count++;
+
+ sdev_printk(KERN_INFO, sdev, "n-error: %d\n", sdev->deh.count);
+
+ if (!scsi_internal_device_block(sdev)) {
+ sdev->deh.error = 1;
+ if (sdev->deh.ehandler)
+ wake_up_process(sdev->deh.ehandler);
+ else
+ sdev_printk(KERN_WARNING, sdev,
+ "deh handler missing\n");
+ } else {
+ sdev_printk(KERN_WARNING, sdev,
+ "Couldn't block device, calling host recovery\n");
+ scsi_schedule_eh(sdev->host);
+ }
+}
+EXPORT_SYMBOL_GPL(scsi_schedule_deh);
+
+/**
+ * scsi_device_error_handler - SCSI error handler thread
+ * @data: Device for which we are running.
+ *
+ * Notes:
+ * This is the main device error handling loop. This is run as a kernel
thread
+ * for every SCSI device and handles all device error handling activity.
+ **/
+int scsi_device_error_handler(void *data)
+{
+ struct scsi_device *sdev = data;
+ int sleeptime = 30;
+
+ current->flags |= PF_NOFREEZE;
+
+ /*
+ * We use TASK_INTERRUPTIBLE so that the thread is not
+ * counted against the load average as a running process.
+ * We never actually get interrupted because kthread_run
+ * disables singal delivery for the created thread.
+ */
+ set_current_state(TASK_INTERRUPTIBLE);
+ while (!kthread_should_stop()) {
+ if (sdev->deh.error == 0) {
+ SCSI_LOG_ERROR_RECOVERY(1,
+ printk("Error handler scsi_deh sleeping\n"));
+ schedule();
+ set_current_state(TASK_INTERRUPTIBLE);
+ continue;
+ }
+
+ __set_current_state(TASK_RUNNING);
+ SCSI_LOG_ERROR_RECOVERY(1,
+ printk("Error handler scsi_deh waking up\n"));
+
+ sdev_printk(KERN_CRIT, sdev, "waiting %ds to settle device\n",
+ sleeptime);
+ msleep (sleeptime * 1000);
+
+ if (sdev->deh.count < 2) {
+ sdev_printk(KERN_WARNING, sdev,
+ "First device error, simply recovery\n");
+ goto cont;
+ }
+
+ /*
+ * We have a device that is failing for some reason. Figure out
+ * what we need to do to get it up and online again (if we can).
+ * If we fail, we call host recovery
+ */
+ if (scsi_unjam_sdev(sdev) != SUCCESS) {
+ sdev_printk(KERN_CRIT, sdev, "device recovery failed,"
+ " initiating host recovery\n");
+ scsi_schedule_eh(sdev->host);
+ /* scsi_schedule_eh() doesn't know about deh.error */
+ goto error_cont;
+ }
+cont:
+ if (scsi_internal_device_unblock(sdev))
+ sdev_printk(KERN_WARNING, sdev,
+ "deh: device unblocking failed!\n");
+ spi_schedule_dv_device(sdev);
+error_cont:
+ sdev->deh.error = 0;
+ sdev->deh.last_recovery = jiffies;
+ set_current_state(TASK_INTERRUPTIBLE);
+ }
+ __set_current_state(TASK_RUNNING);
+
+ sdev_printk(KERN_CRIT, sdev, "Error handler scsi_deh exiting\n");
+ sdev->deh.ehandler = NULL;
+ return 0;
+}
+
/*
* Function: scsi_report_bus_reset()
*
Index: linux-2.6.22/include/scsi/scsi_device.h
===================================================================
--- linux-2.6.22.orig/include/scsi/scsi_device.h 2007-12-12 12:26:20.000000000
+0100
+++ linux-2.6.22/include/scsi/scsi_device.h 2007-12-12 12:26:23.000000000
+0100
@@ -145,6 +145,13 @@ struct scsi_device {
enum scsi_device_state sdev_state;
unsigned long sdev_data[0];
+
+ struct device_error_handler {
+ unsigned error;
+ struct task_struct * ehandler; /* Error recovery thread. */
+ time_t last_recovery; /* time on last error recovery */
+ unsigned count; /* error count */
+ } deh;
} __attribute__((aligned(sizeof(unsigned long))));
#define to_scsi_device(d) \
container_of(d, struct scsi_device, sdev_gendev)
Index: linux-2.6.22/drivers/scsi/scsi_scan.c
===================================================================
--- linux-2.6.22.orig/drivers/scsi/scsi_scan.c 2007-12-12 12:26:20.000000000
+0100
+++ linux-2.6.22/drivers/scsi/scsi_scan.c 2007-12-12 12:26:23.000000000 +0100
@@ -1313,6 +1313,12 @@ static int scsi_report_lun_scan(struct s
return 0;
}
+ if (!sdev->deh.ehandler)
+ sdev->deh.ehandler = kthread_run(scsi_device_error_handler,
+ sdev, "sdeh_%d_%d_%d_%d",
+ shost->host_no, sdev->channel,
+ sdev->id, sdev->lun);
+
sprintf(devname, "host %d channel %d id %d",
shost->host_no, sdev->channel, sdev->id);
@@ -1489,8 +1495,13 @@ struct scsi_device *__scsi_add_device(st
scsi_probe_and_add_lun(starget, lun, NULL, &sdev, 1, hostdata);
mutex_unlock(&shost->scan_mutex);
scsi_target_reap(starget);
- put_device(&starget->dev);
+ if (!sdev->deh.ehandler)
+ sdev->deh.ehandler = kthread_run(scsi_device_error_handler,
+ sdev, "sdeh_%d_%d_%d_%d",
+ shost->host_no, sdev->channel,
+ sdev->id, sdev->lun);
+ put_device(&starget->dev);
return sdev;
}
EXPORT_SYMBOL(__scsi_add_device);
Index: linux-2.6.22/drivers/scsi/scsi_priv.h
===================================================================
--- linux-2.6.22.orig/drivers/scsi/scsi_priv.h 2007-12-12 12:26:20.000000000
+0100
+++ linux-2.6.22/drivers/scsi/scsi_priv.h 2007-12-12 12:26:23.000000000 +0100
@@ -54,6 +54,7 @@ extern void scsi_add_timer(struct scsi_c
extern int scsi_delete_timer(struct scsi_cmnd *);
extern void scsi_times_out(struct scsi_cmnd *cmd);
extern int scsi_error_handler(void *host);
+extern int scsi_device_error_handler(void *sdev);
extern int scsi_decide_disposition(struct scsi_cmnd *cmd);
extern void scsi_eh_wakeup(struct Scsi_Host *shost);
extern int scsi_eh_scmd_add(struct scsi_cmnd *, int);
Index: linux-2.6.22/drivers/scsi/scsi_sysfs.c
===================================================================
--- linux-2.6.22.orig/drivers/scsi/scsi_sysfs.c 2007-12-12 12:26:20.000000000
+0100
+++ linux-2.6.22/drivers/scsi/scsi_sysfs.c 2007-12-12 12:26:23.000000000 +0100
@@ -10,6 +10,7 @@
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/device.h>
+#include <linux/kthread.h>
#include <scsi/scsi.h>
#include <scsi/scsi_device.h>
@@ -798,6 +799,9 @@ void __scsi_remove_device(struct scsi_de
if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
return;
+ if (sdev->deh.ehandler)
+ kthread_stop(sdev->deh.ehandler);
+
class_device_unregister(&sdev->sdev_classdev);
transport_remove_device(dev);
device_del(dev);
Index: linux-2.6.22/drivers/scsi/scsi_lib.c
===================================================================
--- linux-2.6.22.orig/drivers/scsi/scsi_lib.c 2007-12-12 12:26:20.000000000
+0100
+++ linux-2.6.22/drivers/scsi/scsi_lib.c 2007-12-12 12:52:31.000000000 +0100
@@ -28,6 +28,7 @@
#include "scsi_priv.h"
#include "scsi_logging.h"
+#include "scsi_transport_api.h"
#define SG_MEMPOOL_NR ARRAY_SIZE(scsi_sg_pools)
@@ -820,6 +821,7 @@ void scsi_io_completion(struct scsi_cmnd
int this_count = cmd->request_bufflen;
request_queue_t *q = cmd->device->request_queue;
struct request *req = cmd->request;
+ struct scsi_device *sdev = cmd->device;
int clear_errors = 1;
struct scsi_sense_hdr sshdr;
int sense_valid = 0;
@@ -958,13 +960,26 @@ void scsi_io_completion(struct scsi_cmnd
break;
}
}
- if (host_byte(result) == DID_RESET) {
+ switch (host_byte(result)) {
+ case DID_OK:
+ break;
+ case DID_RESET:
/* Third party bus reset or reset for error recovery
* reasons. Just retry the request and see what
* happens.
*/
scsi_requeue_command(q, cmd);
return;
+ case DID_NO_CONNECT:
+ sdev_printk(KERN_CRIT, sdev, "DID_NO_CONNECT\n");
+ scsi_schedule_deh(sdev);
+ scsi_requeue_command(q, cmd);
+ return;
+ case DID_SOFT_ERROR:
+ sdev_printk(KERN_CRIT, sdev, "DID_SOFT_ERROR\n");
+ scsi_schedule_deh(sdev);
+ scsi_requeue_command(q, cmd);
+ return;
}
if (result) {
if (!(req->cmd_flags & REQ_QUIET)) {
@@ -2007,18 +2022,18 @@ scsi_device_set_state(struct scsi_device
goto illegal;
}
break;
-
}
sdev->sdev_state = state;
return 0;
illegal:
- SCSI_LOG_ERROR_RECOVERY(1,
+ SCSI_LOG_ERROR_RECOVERY(1,
sdev_printk(KERN_ERR, sdev,
"Illegal state transition %s->%s\n",
scsi_device_state_name(oldstate),
scsi_device_state_name(state))
);
+ dump_stack();
return -EINVAL;
}
EXPORT_SYMBOL(scsi_device_set_state);
Index: linux-2.6.22/drivers/scsi/scsi_transport_api.h
===================================================================
--- linux-2.6.22.orig/drivers/scsi/scsi_transport_api.h 2007-12-12
12:26:20.000000000 +0100
+++ linux-2.6.22/drivers/scsi/scsi_transport_api.h 2007-12-12
12:26:23.000000000 +0100
@@ -2,5 +2,6 @@
#define _SCSI_TRANSPORT_API_H
void scsi_schedule_eh(struct Scsi_Host *shost);
+void scsi_schedule_deh(struct scsi_device *sdev);
#endif /* _SCSI_TRANSPORT_API_H */
Index: linux-2.6.22/drivers/scsi/scsi.c
===================================================================
--- linux-2.6.22.orig/drivers/scsi/scsi.c 2007-12-12 12:26:20.000000000 +0100
+++ linux-2.6.22/drivers/scsi/scsi.c 2007-12-12 12:26:23.000000000 +0100
@@ -494,7 +494,8 @@ int scsi_dispatch_cmd(struct scsi_cmnd *
*/
scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
- SCSI_LOG_MLQUEUE(3, printk("queuecommand : device blocked \n"));
+ SCSI_LOG_MLQUEUE(3, printk("queuecommand : device blocked or "
+ "in recovery\n"));
/*
* NOTE: rtn is still zero here because we don't need the
--
Bernd Schubert
Q-Leap Networks GmbH
next reply other threads:[~2007-12-12 12:54 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-12-12 12:54 Bernd Schubert [this message]
2007-12-12 13:39 ` [PATCH] scsi device recovery Matthew Wilcox
2007-12-12 14:36 ` Bernd Schubert
2007-12-12 15:59 ` James Bottomley
2007-12-12 17:54 ` Bernd Schubert
2007-12-13 14:18 ` James Bottomley
2007-12-14 11:26 ` fusion problem (was Re: [PATCH] scsi device recovery) Bernd Schubert
2007-12-14 12:04 ` [PATCH] scsi device recovery Bernd Schubert
2007-12-14 12:22 ` Matthew Wilcox
2007-12-14 12:28 ` Bernd Schubert
2007-12-14 14:35 ` James Bottomley
2007-12-14 15:26 ` Bernd Schubert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200712121354.14474.bs@q-leap.de \
--to=bs@q-leap.de \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.