Re: disk restart failure after suspend

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

From: Stefan Richter <stefanr@s5r6.in-berlin.de>
To: Alan Stern <stern@rowland.harvard.edu>
Cc: Tino Keitel <tino.keitel@tikei.de>,
	linux1394-user@lists.sourceforge.net, linux-scsi@vger.kernel.org,
	Tejun Heo <tj@kernel.org>
Subject: Re: disk restart failure after suspend
Date: Mon, 19 Oct 2009 22:24:43 +0200	[thread overview]
Message-ID: <4ADCCB0B.6060005@s5r6.in-berlin.de> (raw)
In-Reply-To: <Pine.LNX.4.44L0.0910181642390.18952-100000@netrider.rowland.org>

Alan Stern wrote:
> On Sun, 18 Oct 2009, Stefan Richter wrote:
>> IEEE 1394 rediscovery and SBP-2 reconnect can become
>> necessary anytime (and they do become necessary at /least/ once during
>> PM resume), in no particular order with respect to SCSI request
>> submission.  Our drivers (firewire-sbp2 mainly) need to be able to
>> handle any order of such events.
> 
> Is it possible to delay returning from the device resume routine until
> the rediscovery/reconnect has completed?  This is more or less how the
> USB stack works.

Hmm.  FireWire isn't deterministic in this regard; it's partly bus,
partly network.  The transport protocol SBP-2 is kind of a network
protocol with remote DMA.  Rediscovery and reconnect at PM resume are
rather stochastic processes
  - if the target went through a low power state too,
  - if other nodes besides the Linux SBP-2 initiator and the SBP-2
    target are on the bus,
  - not to mention if those other nodes went through a low power
    cycle as well.

I could add .suspend and .resume methods to firewire-sbp2's struct
device_driver (or just .resume if the PM core accepts that... I have to
check the API), and the .resume method could contain a
wait_for_completion_timeout which is unblocked when a reconnect
happened.  However, this could still go wrong if for some reason (e.g.
see above) multiple reconnects to the target happen in a row.

So I tend to think firewire-sbp2 should learn to resubmit requests that
were queued by SCSI midlayer after the SBP-2 connection broke & before
reconnect happened, i.e. hide all this from SCSI midlayer rather than
quitting this request with DID_BUS_BUSY.

>> There are two independent places of the code that could possibly be
>> improved to fix this issue:
>>
>> a.)  sd's PM resume method:
>>
>> 1.a)  sd_resume could gain this retry loop which you implemented.
> 
> This wouldn't be necessary if the transport was working before 
> sd_resume got called.

Technically the transport does "work" at this time:  It might have
blocked the Scsi_Host though, or it might return "bus busy" status for
one request and then block the host.  But apparently that's not liked by
upper layers during resume.

Anyway, I'd say it this way:

This wouldn't be necessary if the transport just hid this reconnection
phase from SCSI core and everything above it.

Then we only need to rely on the reconnect (or possibly series of
reconnects, see above) to finish before timeout, minus time for the
actual execution of the request.  That should fit comfortably into the
30 seconds SD_TIMEOUT.

>> 1.b)  sd_resume (but probably not sd_suspend) could optimistically
>> ignore any error return from sd_start_stop_device.  If the motor cannot
>> be started immediately at resume, the SCSI core would try to start it
>> later on when the disk is normally accessed.
> 
> This is probably a worthwhile idea in any case.
> 
>> My assumption here is that an error return from sd_resume causes the
>> disk to become inaccessible (taken offline?).
> 
> No.  All it does is cause an error message to be printed in the system 
> log.  But it's possible that a failure lower down in the SCSI stack has 
> this effect.

I wonder what this might be.
-- 
Stefan Richter
-=====-==--= =-=- =--==
http://arcgraph.de/sr/

next prev parent reply	other threads:[~2009-10-19 20:25 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20091006073926.GA5636@mac.home>
     [not found] ` <4ACB2B69.20008@s5r6.in-berlin.de>
     [not found]   ` <20091006115121.GA15517@mac.home>
     [not found]     ` <4ACB6940.5010905@s5r6.in-berlin.de>
     [not found]       ` <4ACB6BF2.5090409@s5r6.in-berlin.de>
     [not found]         ` <20091007051614.GA7527@mac.home>
     [not found]           ` <4ACC2ADC.4050307@s5r6.in-berlin.de>
     [not found]             ` <20091011202902.GA29604@x61.home>
     [not found]               ` <4AD25437.90609@s5r6.in-berlin.de>
     [not found]                 ` <20091016060320.GB30389@mac.home>
2009-10-18 14:42                   ` disk restart failure after suspend Stefan Richter
2009-10-19 13:42                     ` Alan Stern
2009-10-19 18:05                       ` Tino Keitel
2009-10-19 20:24                       ` Stefan Richter [this message]
2009-10-19 20:28                         ` Tino Keitel
2009-10-19 21:50                           ` Stefan Richter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ADCCB0B.6060005@s5r6.in-berlin.de \
    --to=stefanr@s5r6.in-berlin.de \
    --cc=linux-scsi@vger.kernel.org \
    --cc=linux1394-user@lists.sourceforge.net \
    --cc=stern@rowland.harvard.edu \
    --cc=tino.keitel@tikei.de \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox