* do Symmetrix multipath-tools defaults need update ? or scsi-to-blk errors management ? [not found] <2113818871.1724831244670374246.JavaMail.root@zimbra16-e3.priv.proxad.net> @ 2009-06-10 21:49 ` christophe.varoqui 2009-06-10 23:34 ` Mike Christie 0 siblings, 1 reply; 4+ messages in thread From: christophe.varoqui @ 2009-06-10 21:49 UTC (permalink / raw) To: Levy_Jerome; +Cc: device-mapper development, linux-scsi Hi Jerome, EMC recently asked my/one-of-your client to active "queue_if_no_path" on Symmetrix logical units, which is not the current default setting in the upstream multipath-tools package. I'd like to know if you intent on submitting a patch to change the default setting accordingly, or if you'd rather let the no-queueing default unchanged and work on fixing the root cause of this issue. ::: Background information, root cause ::: The Symmetrix array proved to return scsi errors io to submitters in certains circumstances (I was told of errors on R1+R2 network link). The linux kernel lacking finesse in the SCSI->DM error reporting ends-up invalidating in turn each path of the multipath before the multipathd daemon gets a chance to revalidate. "queue_if_no_path" being disabled, the io errors ends up in the FS layer and in the userspace submitter. ::: error log on a 2.6.9 (rhel 4.7) kernel ::: SCSI error : <h b t l> return code 0x8000002 current sday: sense key Aborted Command Additional sense: Internal target failure end_request: I/O error, dev sday, sector XXXXX device-mapper: dm-multipath: Failing path 67:32. ::: unfortunate side effect of queue_if_no_path ::: Activating "queue_if_no_path" is certainly an effecient work-around for this kind of short-lived retriable errors, but this feature compromises data-protection on clusters relying on persistent reservation to fence ios from passive nodes. Ironically, the reason is quite similar : SCSI return codes for reservation conflicts also end up invalidating each path of a multipath, and worse, the io causing the conflict gets queued ! and retried ! until the poor active drops its reservation, unleashing data-corrupting ios from passive node queues on the logical unit. ::: error log on a 2.6.29.x kernel for a reservation conflict ::: sd h:b:t:l: reservation conflict sd h:b:t:l: [sdu] Unhandled error code sd h:b:t:l: [sdu] Result: hostbyte=DID_OK driver_byte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdu, sector XXXXX device-mapper: dm-multipath: Failing path 65:64. ::: persistent reservation + queue_if_no_path, possible solution ? ::: Seems to me scsi_lib.c::scsi_io_completion() should be able to cancel a reservation conflicting io and signal blk_end_request() with no error reported. Please comment. Best regards, cvaroqui ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: do Symmetrix multipath-tools defaults need update ? or scsi-to-blk errors management ? 2009-06-10 21:49 ` do Symmetrix multipath-tools defaults need update ? or scsi-to-blk errors management ? christophe.varoqui @ 2009-06-10 23:34 ` Mike Christie 2009-06-11 4:58 ` [dm-devel] " christophe.varoqui 0 siblings, 1 reply; 4+ messages in thread From: Mike Christie @ 2009-06-10 23:34 UTC (permalink / raw) To: device-mapper development; +Cc: Levy_Jerome, linux-scsi [-- Attachment #1: Type: text/plain, Size: 2820 bytes --] On 06/10/2009 04:49 PM, christophe.varoqui@free.fr wrote: > Hi Jerome, > > EMC recently asked my/one-of-your client to active "queue_if_no_path" on Symmetrix logical units, which is not the current default setting in the upstream multipath-tools package. > > I'd like to know if you intent on submitting a patch to change the default setting accordingly, or if you'd rather let the no-queueing default unchanged and work on fixing the root cause of this issue. > > ::: Background information, root cause ::: > > The Symmetrix array proved to return scsi errors io to submitters in certains circumstances (I was told of errors on R1+R2 network link). The linux kernel lacking finesse in the SCSI->DM error reporting ends-up invalidating in turn each path of the multipath before the multipathd daemon gets a chance to revalidate. "queue_if_no_path" being disabled, the io errors ends up in the FS layer and in the userspace submitter. > > ::: error log on a 2.6.9 (rhel 4.7) kernel ::: > For RH 4.9 I did the attached patch. So this error is not fastfailed (upstream does not fastfail this type of error when using dm-multipath now). So now the scsi layer will retry its normal 5 times, then fail. > SCSI error :<h b t l> return code 0x8000002 > current sday: sense key Aborted Command > Additional sense: Internal target failure > end_request: I/O error, dev sday, sector XXXXX > device-mapper: dm-multipath: Failing path 67:32. > > ::: unfortunate side effect of queue_if_no_path ::: > > Activating "queue_if_no_path" is certainly an effecient work-around for this kind of short-lived retriable errors, but this feature compromises data-protection on clusters relying on persistent reservation to fence ios from passive nodes. Ironically, the reason is quite similar : SCSI return codes for reservation conflicts also end up invalidating each path of a multipath, and worse, the io causing the conflict gets queued ! and retried ! until the poor active drops its reservation, unleashing data-corrupting ios from passive node queues on the logical unit. > > ::: error log on a 2.6.29.x kernel for a reservation conflict ::: > > sd h:b:t:l: reservation conflict > sd h:b:t:l: [sdu] Unhandled error code > sd h:b:t:l: [sdu] Result: hostbyte=DID_OK driver_byte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sdu, sector XXXXX > device-mapper: dm-multipath: Failing path 65:64. > > ::: persistent reservation + queue_if_no_path, possible solution ? ::: > > Seems to me scsi_lib.c::scsi_io_completion() should be able to cancel a reservation conflicting io and signal blk_end_request() with no error reported. > I was just about to post new blkerr patches. For this we just wan multipath to fail this IO right away right? So have scsi return some fatal error then dm-multipath will see it and not retry that IO? [-- Attachment #2: dont-failfast-dev-errs.patch --] [-- Type: text/plain, Size: 528 bytes --] diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 7309f12..d5a3390 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -1390,7 +1390,7 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd) case CHECK_CONDITION: rtn = scsi_check_sense(scmd); if (rtn == NEEDS_RETRY) - goto maybe_retry; + goto check_retry_count; /* if rtn == FAILED, we have no sense information; * returning FAILED will wake the error handler thread * to collect the sense and redo the decide [-- Attachment #3: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [dm-devel] do Symmetrix multipath-tools defaults need update ? or scsi-to-blk errors management ? 2009-06-10 23:34 ` Mike Christie @ 2009-06-11 4:58 ` christophe.varoqui 2009-06-11 5:11 ` Mike Christie 0 siblings, 1 reply; 4+ messages in thread From: christophe.varoqui @ 2009-06-11 4:58 UTC (permalink / raw) To: device-mapper development; +Cc: Levy Jerome, linux-scsi >> >> ::: error log on a 2.6.9 (rhel 4.7) kernel ::: >> > > For RH 4.9 I did the attached patch. So this error is not fastfailed > (upstream does not fastfail this type of error when using dm-multipath > now). So now the scsi layer will retry its normal 5 times, then fail. > Thank you for the information. This is very good news. Can you also advise about the rhel 5 minimum kernel version ? >> >> ::: error log on a 2.6.29.x kernel for a reservation conflict ::: >> > I was just about to post new blkerr patches. For this we just wan > multipath to fail this IO right away right? So have scsi return some > fatal error then dm-multipath will see it and not retry that IO? > Yes, that would do. Thanks you very much for your prompt response and your work on this. Regards, cvaroqui ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: do Symmetrix multipath-tools defaults need update ? or scsi-to-blk errors management ? 2009-06-11 4:58 ` [dm-devel] " christophe.varoqui @ 2009-06-11 5:11 ` Mike Christie 0 siblings, 0 replies; 4+ messages in thread From: Mike Christie @ 2009-06-11 5:11 UTC (permalink / raw) To: christophe.varoqui; +Cc: device-mapper development, Levy Jerome, linux-scsi On 06/10/2009 11:58 PM, christophe.varoqui@free.fr wrote: >>> ::: error log on a 2.6.9 (rhel 4.7) kernel ::: >>> >> For RH 4.9 I did the attached patch. So this error is not fastfailed >> (upstream does not fastfail this type of error when using dm-multipath >> now). So now the scsi layer will retry its normal 5 times, then fail. >> > Thank you for the information. This is very good news. > Can you also advise about the rhel 5 minimum kernel version ? It should be fixed in rhel 5.3 and the change was merged in upstream 2.6.28 kernel. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2009-06-11 5:11 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <2113818871.1724831244670374246.JavaMail.root@zimbra16-e3.priv.proxad.net>
2009-06-10 21:49 ` do Symmetrix multipath-tools defaults need update ? or scsi-to-blk errors management ? christophe.varoqui
2009-06-10 23:34 ` Mike Christie
2009-06-11 4:58 ` [dm-devel] " christophe.varoqui
2009-06-11 5:11 ` Mike Christie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox