public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* SCSI error handling -- one error blocks the whole SCSI host
@ 2013-05-23 18:14 Roland Dreier
  2013-05-25 18:07 ` James Smart
  2013-05-26 22:44 ` James Bottomley
  0 siblings, 2 replies; 8+ messages in thread
From: Roland Dreier @ 2013-05-23 18:14 UTC (permalink / raw)
  To: linux-scsi, Hannes Reinecke, Jej B

At LSF this year, we had a discussion about error handling and in
particular the problem that SCSI midlayer error handling waits for the
entire SCSI host (HBA) to quiesce before it starts to abort commands
etc.

James made the suggestion that FC should handle things the way SAS
does, because SAS has a strategy handler that does things the right
way.  However, now that I finally sit down and look at the code, I
don't see how this is the case.  It seems inherent in the way that
scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
particular the strategy handler can't even be called until host_failed
== host_busy; we don't bump host_failed without SHOST_RECOVERY set,
which stops queueing commands to any devices attached to the whole
HBA).

James, am I understanding your suggestion properly?  If so can you
explain what you meant about the libsas code -- I see that it has its
own strategy handler but as I said before we've already stopped every
device attached to the HBA before we ever get there.

To recapitulate the problem here, we might have a whole fabric
attached to an HBA via SAS or FC, and be doing 500K IOPS happily to 50
devices.  Then a single LUN goes wonky and all the IO stops while we
try to recover that single device, which might take minutes.

I know this has been discussed before, but can we find a way forward
here?  Is there some way we can start with per-device error recovery
and avoid disrupting IO that we can see is working fine?

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-05-28 16:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-23 18:14 SCSI error handling -- one error blocks the whole SCSI host Roland Dreier
2013-05-25 18:07 ` James Smart
2013-05-26 22:44 ` James Bottomley
2013-05-27 14:39   ` Hannes Reinecke
2013-05-27 20:41     ` James Bottomley
2013-05-28  1:32       ` Baruch Even
2013-05-28 14:38         ` Jeremy Linton
2013-05-28 16:22           ` Baruch Even

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox