From: Bernd Schubert <bs@q-leap.de>
To: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Matthew Wilcox <matthew@wil.cx>, linux-scsi@vger.kernel.org
Subject: Re: [PATCH] scsi device recovery
Date: Wed, 12 Dec 2007 18:54:42 +0100 [thread overview]
Message-ID: <200712121854.42669.bs@q-leap.de> (raw)
In-Reply-To: <1197475177.4203.29.camel@localhost.localdomain>
[Hmm, resending since mail after more than 30min still not on the ML, maybe
the attachment was too large? I have uploaded the log to
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/scsi/kern.log.1]
On Wednesday 12 December 2007 16:59:36 James Bottomley wrote:
> On Wed, 2007-12-12 at 15:36 +0100, Bernd Schubert wrote:
> > On Wednesday 12 December 2007 14:39:27 Matthew Wilcox wrote:
> > > On Wed, Dec 12, 2007 at 01:54:14PM +0100, Bernd Schubert wrote:
> > > > below is a patch introducing device recovery, trying to prevent i/o
> > > > errors when a DID_NO_CONNECT or SOFT_ERROR does happen.
> > >
> > > Why doesn't the regular scsi_eh do what you need?
> >
> > First of all, it is presently simply not called when the two errors above
> > do happen. This could be changed, of course.
>
> Erm, I think you'll find the error handler does activate on
> DID_SOFT_ERROR. It causes a retry via the eh. DID_NO_CONNECT is an
Dec 7 23:48:45 beo-96 kernel: [94605.297924] sd 2:0:5:0: [sdd] Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
Dec 7 23:48:45 beo-96 kernel: [94605.297932] end_request: I/O error, dev sdd,
sector 7706802052
Dec 7 23:48:45 beo-96 kernel: [94605.297937] raid5:md5: read error not
correctable (sector 871932472 on sdd3).
Full log attached.
> immediate error with no eh intervention because it means that the target
> went away. Handling this as a retryable error isn't an option because
> it will interfere with hotplug.
Then we need a sysfs flag one can set to manually enable eh for these devices
on DID_NO_CONNECT.
>
> > Secondly, I think scsi_eh is in most cases doing too much. We are
> > fighting with flaky Infortrend boxes here, and scsi_eh sometimes manages
> > to crash their scsi channels. In most cases it is sufficient to stall any
> > io to the device and then to resume.
>
> But that's basically the default behaviour of the error handler (stall
> then resume).
>
> > For most scsi devices one probably doesn't need a suspend time or it can
> > be very small, this still needs to become configurable via sysfs.
>
> You mean a wait time beyond what the error handler currently does
> (basically it waits for the quiesce, begins error handling and then
> sends a test unit ready when it finishes before restarting).
In deh just waits on the first error and then only does a DV. For
these infortrend devices, thats mostly sufficient.
>
> > Thirdly, scsi_eh doesn't give up, in most cases, when the scsi channel of
> > a Infortrend box crashed, it tried forever to recover.
> > To improve this is still on my todo list.
>
> Could you send traces for this. I thought the error handler had been
> fixed over the last few years always to terminate. If there's a case
> where it doesn't, this needs fixing.
I'm attaching the syslog, this is 2.6.22 + additional printks, dump_stack()'s
and msleep()'s.
At 03:59:36 the system finally went into wait_for_completion(), similar
to the "everything in wait_for_completion, what is my system doing?" thread.
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
next prev parent reply other threads:[~2007-12-12 17:54 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-12-12 12:54 [PATCH] scsi device recovery Bernd Schubert
2007-12-12 13:39 ` Matthew Wilcox
2007-12-12 14:36 ` Bernd Schubert
2007-12-12 15:59 ` James Bottomley
2007-12-12 17:54 ` Bernd Schubert [this message]
2007-12-13 14:18 ` James Bottomley
2007-12-14 11:26 ` fusion problem (was Re: [PATCH] scsi device recovery) Bernd Schubert
2007-12-14 12:04 ` [PATCH] scsi device recovery Bernd Schubert
2007-12-14 12:22 ` Matthew Wilcox
2007-12-14 12:28 ` Bernd Schubert
2007-12-14 14:35 ` James Bottomley
2007-12-14 15:26 ` Bernd Schubert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200712121854.42669.bs@q-leap.de \
--to=bs@q-leap.de \
--cc=James.Bottomley@hansenpartnership.com \
--cc=linux-scsi@vger.kernel.org \
--cc=matthew@wil.cx \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox