From mboxrd@z Thu Jan 1 00:00:00 1970 From: Douglas Gilbert Subject: Re: [PATCH as468] Retry supposedly "unrecoverable" hardware errors Date: Fri, 18 Feb 2005 10:49:53 +1000 Message-ID: <42153BB1.4050303@torque.net> References: <42141D3D.9080800@torque.net> <1108653107.5507.3.camel@mulgrave> Reply-To: dougg@torque.net Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Received: from borg.st.net.au ([65.23.158.22]:32976 "EHLO borg.st.net.au") by vger.kernel.org with ESMTP id S261277AbVBRAtS (ORCPT ); Thu, 17 Feb 2005 19:49:18 -0500 In-Reply-To: <1108653107.5507.3.camel@mulgrave> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: James Bottomley Cc: Alan Stern , Martin Peschke , Radovan Garabik , SCSI Mailing List James Bottomley wrote: > On Thu, 2005-02-17 at 14:27 +1000, Douglas Gilbert wrote: > >>Recent SPC-3 and SBC-2 drafts treat the sense keys of >>MEDIUM ERROR and HARDWARE ERROR in a similar way. >>Both can return an "info" field which has the same >>meaning (lba of first failure). The distinction is that >>MEDIUM ERROR is a little more precise (at least for >>magnetic rotating media) **. For flash ram the distinction >>is moot. > > > My copy of SPC-3 (r21d) still defined HARDWARE ERROR in Table 27 as > > HARDWARE ERROR: Indicates that the device server detected a non- > recoverable hardware failure > (e.g., controller failure, device failure, or parity error) while > performing the command or during a self > test. > > which looks pretty non-retryable to me ... where does it say that the > error might be retryable? James, The definition of MEDIUM ERROR from the same table: "Indicates that the command terminated with a non-recoverable error condition that may have been caused by a flaw in the medium or an error in the recorded data. This sense key may also be returned if the device server is unable to distinguish between a flaw in the medium and a specific hardware failure (i.e. sense key 4h)". Sense key "4h" is HARDWARE ERROR. I interpret that as SPC-3 saying MEDIUM ERROR and HARDWARE ERROR may both report non-recoverable errors. Also note that MEDIUM ERROR, HARDWARE ERROR and RECOVERED ERROR can return an "actual retry count" in their additional sense data. SBC-2 (rev 16) makes little distinction between the two sense keys for "unrecovered read errors": table 4 shows either can be used. It also says on page 19: "When an unrecovered read error is reported the information field of the sense data shall contain the LBA of the unrecovered logical block." Nothing that I can see links an "unrecovered (read) error" with the application client retrying the same command in either draft. If "actual retry count" is > 1 in the sense key specific field then that implies the device has already tried several times. SSC-3 (for tape drives) also allows MEDIUM ERROR or HARDWARE ERROR to indicate an unrecovered read error (rev 1c, table 2). For tape drives, retrying the same command is probably not appropriate. [I note that st and sg set their 'max_retries' to 0 to inhibit this.] MMC-5 only mentions the HARDWARE ERROR sense key for a self diagnostic failure. This analysis leads me to question why retries are instigated from the mid level and not the sd driver (and perhaps sr driver as well). If so, sd should not instigate retries if the device indicates a reasonable number of retries have already taken place, unless it can change some other factor or is instructed by some parameter to sd. As Alan Stern points out, my patch fails the reality test. The device in question obviously required a retry when it returned a HARDWARE ERROR sense key (but perhaps the reason was not an unrecovered error or it was not reported properly). Doug Gilbert