From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lars Marowsky-Bree Subject: Re: [dm-devel] Re: fastfail operation and retries Date: Thu, 21 Apr 2005 23:18:51 +0200 Message-ID: <20050421211851.GS17315@marowsky-bree.de> References: Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from gate.in-addr.de ([212.8.193.158]:14520 "EHLO mx.in-addr.de") by vger.kernel.org with ESMTP id S261884AbVDUVS6 (ORCPT ); Thu, 21 Apr 2005 17:18:58 -0400 Content-Disposition: inline In-Reply-To: Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: device-mapper development , Andreas Herrmann Cc: Linux SCSI On 2005-04-21T17:02:44, "goggin, edward" wrote: > Depending on the "queue_if_no_path" feature has the current undesirab= le > side-effect of requiring intervention of the user space multipath com= ponents > to reinstate at least one of the paths to a useable state in the mult= ipath > target driver. This dependency currently creates the potential for d= eadlock > scenarios since the user space multipath components (nor the kernel f= or that > matter) are currently architected to avoid them. multipath-tools is, to a certain degree, architected to avoid them. And the kernel is meant to be, too - there's bugs and known FIXME's, but those are just bugs and we're taking patches gladly ;-) > I think for now it may be better to try to avoid having to fail a pat= h if it > is possible that an io error is not path related. No. Basically every time out error creates a "dunno why" error right no= w - could be the storage system itself, could be the network in between. A failover to another path is the obvious remedy; take for example the CX series where even if it's not the path, it's the SP, and failing ove= r to the other SP will cure the problem. If the storage at least rejects the IO with a specific error code, it can be worked around by a specific hw handler which doesn't fail the path but just causes the IO to be queued and retried; that's a pretty simple hardware handler to write. But quite frankly, storage subsystems which _reject_ all IO for a given time are just broken for reliable configurations. What good are they in multipath configurations if they fail _all_ paths at the same time? How can they even dare claim redundancy? We can build more or less smelly kludges around them, but it remains a problem to be fixed at the storag= e subsystem level IMNSHO. Sincerely, Lars Marowsky-Br=E9e --=20 High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business - To unsubscribe from this list: send the line "unsubscribe linux-scsi" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html