* [PATCH] sd: retry read_capacity on UNIT_ATTENTION @ 2010-04-01 13:44 Hannes Reinecke 2010-04-01 14:30 ` James Bottomley 0 siblings, 1 reply; 4+ messages in thread From: Hannes Reinecke @ 2010-04-01 13:44 UTC (permalink / raw) To: James Bottomley; +Cc: linux-scsi Hazard testing uncovered yet another bug in sd. Under heavy reset activity the retry counter might be exhausted and the command will be returned with sense UNIT_ATTENTION/0x29/00 (POWER ON, RESET, OR BUS DEVICE RESET OCCURRED). In those cases we should just increase the retry counter again, retrying one more to clear up this Unit Attention state. Signed-off-by: Hannes Reinecke <hare@suse.de> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 1962bea..7d75a21 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1454,8 +1454,15 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, if (media_not_present(sdkp, &sshdr)) return -ENODEV; - if (the_result) + if (the_result) { sense_valid = scsi_sense_valid(&sshdr); + if (sense_valid && + sshdr.sense_key == UNIT_ATTENTION && + sshdr.asc = 0x29 && sshdr.asq == 0x00) + /* Device reset might occur several times, + * give it one more chance */ + retries++; + } retries--; } while (the_result && retries); ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] sd: retry read_capacity on UNIT_ATTENTION 2010-04-01 13:44 [PATCH] sd: retry read_capacity on UNIT_ATTENTION Hannes Reinecke @ 2010-04-01 14:30 ` James Bottomley 2010-04-08 7:36 ` Hannes Reinecke 0 siblings, 1 reply; 4+ messages in thread From: James Bottomley @ 2010-04-01 14:30 UTC (permalink / raw) To: Hannes Reinecke; +Cc: linux-scsi On Thu, 2010-04-01 at 15:44 +0200, Hannes Reinecke wrote: > Hazard testing uncovered yet another bug in sd. Under heavy > reset activity the retry counter might be exhausted and > the command will be returned with sense UNIT_ATTENTION/0x29/00 > (POWER ON, RESET, OR BUS DEVICE RESET OCCURRED). In those > cases we should just increase the retry counter again, > retrying one more to clear up this Unit Attention state. > > Signed-off-by: Hannes Reinecke <hare@suse.de> > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 1962bea..7d75a21 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -1454,8 +1454,15 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, > if (media_not_present(sdkp, &sshdr)) > return -ENODEV; > > - if (the_result) > + if (the_result) { > sense_valid = scsi_sense_valid(&sshdr); > + if (sense_valid && > + sshdr.sense_key == UNIT_ATTENTION && > + sshdr.asc = 0x29 && sshdr.asq == 0x00) ^^^^ should be == > + /* Device reset might occur several times, > + * give it one more chance */ > + retries++; > + } Firstly, not even compile checked: drivers/scsi/sd.c: In function ‘read_capacity_10’: drivers/scsi/sd.c:1558: error: ‘struct scsi_sense_hdr’ has no member named ‘asq’ Secondly, we can't quite do this. Some devices (only broken ones in my experience) will reply UNIT_ATTENTION I was RESET forever, leading to a loop here. Additionally, a massive reset storm on a shared bus would DoS the code here, so there must be a give up point after a reasonable number of retries. The third problem is that if this is happening to a large device, we only catch it in RC10 ... so we'll report undersize if the device is > SPC2 How about this instead? James --- diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 7b75c8a..cdb8ed6 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1432,6 +1432,8 @@ static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device *sdp, #error RC16_LEN must not be more than SD_BUF_SIZE #endif +#define READ_CAPACITY_RETRIES_ON_RESET 10 + static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, unsigned char *buffer) { @@ -1439,7 +1441,7 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, struct scsi_sense_hdr sshdr; int sense_valid = 0; int the_result; - int retries = 3; + int retries = 3, reset_retries = READ_CAPACITY_RETRIES_ON_RESET; unsigned int alignment; unsigned long long lba; unsigned sector_size; @@ -1468,6 +1470,13 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, * Invalid Field in CDB, just retry * silently with RC10 */ return -EINVAL; + if (sense_valid && + sshdr.sense_key == UNIT_ATTENTION && + sshdr.asc == 0x29 && sshdr.ascq == 0x00) + /* Device reset might occur several times, + * give it one more chance */ + if (--reset_retries > 0) + continue; } retries--; @@ -1526,7 +1535,7 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, struct scsi_sense_hdr sshdr; int sense_valid = 0; int the_result; - int retries = 3; + int retries = 3, reset_retries = READ_CAPACITY_RETRIES_ON_RESET; sector_t lba; unsigned sector_size; @@ -1542,8 +1551,16 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, if (media_not_present(sdkp, &sshdr)) return -ENODEV; - if (the_result) + if (the_result) { sense_valid = scsi_sense_valid(&sshdr); + if (sense_valid && + sshdr.sense_key == UNIT_ATTENTION && + sshdr.asc == 0x29 && sshdr.ascq == 0x00) + /* Device reset might occur several times, + * give it one more chance */ + if (--reset_retries > 0) + continue; + } retries--; } while (the_result && retries); -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] sd: retry read_capacity on UNIT_ATTENTION 2010-04-01 14:30 ` James Bottomley @ 2010-04-08 7:36 ` Hannes Reinecke 2010-04-08 13:48 ` James Bottomley 0 siblings, 1 reply; 4+ messages in thread From: Hannes Reinecke @ 2010-04-08 7:36 UTC (permalink / raw) To: James Bottomley; +Cc: linux-scsi James Bottomley wrote: > On Thu, 2010-04-01 at 15:44 +0200, Hannes Reinecke wrote: >> Hazard testing uncovered yet another bug in sd. Under heavy >> reset activity the retry counter might be exhausted and >> the command will be returned with sense UNIT_ATTENTION/0x29/00 >> (POWER ON, RESET, OR BUS DEVICE RESET OCCURRED). In those >> cases we should just increase the retry counter again, >> retrying one more to clear up this Unit Attention state. >> >> Signed-off-by: Hannes Reinecke <hare@suse.de> >> >> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c >> index 1962bea..7d75a21 100644 >> --- a/drivers/scsi/sd.c >> +++ b/drivers/scsi/sd.c >> @@ -1454,8 +1454,15 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, >> if (media_not_present(sdkp, &sshdr)) >> return -ENODEV; >> >> - if (the_result) >> + if (the_result) { >> sense_valid = scsi_sense_valid(&sshdr); >> + if (sense_valid && >> + sshdr.sense_key == UNIT_ATTENTION && >> + sshdr.asc = 0x29 && sshdr.asq == 0x00) > ^^^^ > should be == > >> + /* Device reset might occur several times, >> + * give it one more chance */ >> + retries++; >> + } > > Firstly, not even compile checked: > > drivers/scsi/sd.c: In function ‘read_capacity_10’: > drivers/scsi/sd.c:1558: error: ‘struct scsi_sense_hdr’ has no member named ‘asq’ > D'oh. > Secondly, we can't quite do this. Some devices (only broken ones in my > experience) will reply UNIT_ATTENTION I was RESET forever, leading to a > loop here. Additionally, a massive reset storm on a shared bus would > DoS the code here, so there must be a give up point after a reasonable > number of retries. > Hmm. yes. > The third problem is that if this is happening to a large device, we > only catch it in RC10 ... so we'll report undersize if the device is > > SPC2 > Okay. In the best of all worlds we would have a module parameter which would us to adjust this parameter, as I fear the actual number of retries will depend on the number of devices connected. But if you fell that's overkill it's fine by me, too. > How about this instead? > Yes, that's better. Thanks. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] sd: retry read_capacity on UNIT_ATTENTION 2010-04-08 7:36 ` Hannes Reinecke @ 2010-04-08 13:48 ` James Bottomley 0 siblings, 0 replies; 4+ messages in thread From: James Bottomley @ 2010-04-08 13:48 UTC (permalink / raw) To: Hannes Reinecke; +Cc: linux-scsi On Thu, 2010-04-08 at 09:36 +0200, Hannes Reinecke wrote: > James Bottomley wrote: > > On Thu, 2010-04-01 at 15:44 +0200, Hannes Reinecke wrote: > >> Hazard testing uncovered yet another bug in sd. Under heavy > >> reset activity the retry counter might be exhausted and > >> the command will be returned with sense UNIT_ATTENTION/0x29/00 > >> (POWER ON, RESET, OR BUS DEVICE RESET OCCURRED). In those > >> cases we should just increase the retry counter again, > >> retrying one more to clear up this Unit Attention state. > >> > >> Signed-off-by: Hannes Reinecke <hare@suse.de> > >> > >> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > >> index 1962bea..7d75a21 100644 > >> --- a/drivers/scsi/sd.c > >> +++ b/drivers/scsi/sd.c > >> @@ -1454,8 +1454,15 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, > >> if (media_not_present(sdkp, &sshdr)) > >> return -ENODEV; > >> > >> - if (the_result) > >> + if (the_result) { > >> sense_valid = scsi_sense_valid(&sshdr); > >> + if (sense_valid && > >> + sshdr.sense_key == UNIT_ATTENTION && > >> + sshdr.asc = 0x29 && sshdr.asq == 0x00) > > ^^^^ > > should be == > > > >> + /* Device reset might occur several times, > >> + * give it one more chance */ > >> + retries++; > >> + } > > > > Firstly, not even compile checked: > > > > drivers/scsi/sd.c: In function ‘read_capacity_10’: > > drivers/scsi/sd.c:1558: error: ‘struct scsi_sense_hdr’ has no member named ‘asq’ > > > D'oh. > > > Secondly, we can't quite do this. Some devices (only broken ones in my > > experience) will reply UNIT_ATTENTION I was RESET forever, leading to a > > loop here. Additionally, a massive reset storm on a shared bus would > > DoS the code here, so there must be a give up point after a reasonable > > number of retries. > > > Hmm. yes. > > > The third problem is that if this is happening to a large device, we > > only catch it in RC10 ... so we'll report undersize if the device is > > > SPC2 > > > Okay. In the best of all worlds we would have a module parameter which > would us to adjust this parameter, as I fear the actual number of retries > will depend on the number of devices connected. > > But if you fell that's overkill it's fine by me, too. A module parameter probably is overkill. Once we get beyond 5, that's the usual retry limit for ordinary commands, so even if we survive beyond READ_CAPACITY, we'll begin failing in the actual operational commands like reads and writes (beginning with partition size). > > How about this instead? > > > Yes, that's better. Thanks. OK, will put it in with your ack then. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2010-04-08 13:48 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-01 13:44 [PATCH] sd: retry read_capacity on UNIT_ATTENTION Hannes Reinecke 2010-04-01 14:30 ` James Bottomley 2010-04-08 7:36 ` Hannes Reinecke 2010-04-08 13:48 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox