From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Bottomley Subject: Re: [PATCH] sd: sd should not modify read capacity, cache type or write protect flag on rescan when there is a transport error Date: Mon, 28 Feb 2011 09:34:50 -0600 Message-ID: <1298907291.2487.18.camel@mulgrave.site> References: Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Return-path: Received: from cantor2.suse.de ([195.135.220.15]:59311 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753899Ab1B1Pe4 (ORCPT ); Mon, 28 Feb 2011 10:34:56 -0500 In-Reply-To: Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Menny_Hamburger@Dell.com Cc: linux-scsi@vger.kernel.org On Sun, 2011-02-27 at 14:21 +0000, Menny_Hamburger@Dell.com wrote: > From: Menny Hamburger > > When sd scan fails in apprehending capacity, cache_type or write protect flag > property from the device, it automatically assigns a default value to the > failed property. When rescanning, in case of transport/host error, this default > value is invalid since the problem is with the connection to the device and not in > the device itself that may (in most cases) still be intact. Applying a default value > when failing may lead to problems when connection is re-established since the default > value persists unless an additional rescan is performed. That's correct. Zero means we know there's something there but we couldn't get the necessary information. A zero size device can't be read from or written to. > This problem was witnessed when running in a iSCSI environment under multipath > (with I/O on the active path). In this case we get a ping-ping effect where > multipathd switches between alternate paths forever (until rescan) because the > path checker states that the device is OK, and I/O fails immediately because of > the 0 capacity (assigned to the device when rescanning while the device was > disconnected from the storage). > > Reproduction over ISCSI environment: > 1) dd if=/dev/dm-0 of=/dev/zero bs=64 count=10000 > 2) ifdown ethN, ethM, ethK, ... (where ethX is an interface from which the > machine establishes connection to the storage array). > 3) iscsiadm -m session -R > 4) ifup ethN, ethM, ethK, ... This really doesn't look like a good idea. It's a layering violation in that the SCSI mid layer now has to try to determine if certain command failures are the result of host disruption. The idea of believing a prior value if a READ_CAPACITY fails also doesn't look to be such a good one. This could lead to volume corruption if the disruption is part of an array configuration. The correct fix looks to be to initiate a rescan when the host is active via hotplug, and just teach the path checker about zero size devices? James