From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christophe Varoqui Subject: Re: Problems with multipathing Date: Tue, 18 Apr 2006 06:47:58 +0200 Message-ID: <44446F7E.2060502@free.fr> References: <443BF12E.50900@ludd.luth.se> <443BF87A.10804@free.fr> <443C19E6.8000607@ludd.luth.se> <443E870E.1070902@ludd.luth.se> <443EB907.1070600@free.fr> <4442E2ED.7020405@ludd.luth.se> <4443635F.5000401@free.fr> <4443A390.5010005@ludd.luth.se> Reply-To: device-mapper development Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <4443A390.5010005@ludd.luth.se> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: device-mapper development List-Id: dm-devel.ids Roger H=E5kansson a =E9crit : > Christophe Varoqui wrote: > =20 >> Do failover device nodes get reassigned during the rescan ? >> Like, for example, a configured path sda gets removed and a new path s= db >> appears ? >> =20 > > No, since I don't do a rescan on the bus but just on the target itself. > When I had the controller in non-hubbed mode and did (when a controller > has failed) "echo 1> /sys/class/scsi_host/host[1-2]/scan" I got two new > devices, sde and sdf (I normally havesda,sdb,sdc and sdd) > But if I instead did "echo 1 > > /sys/class/scsi_device/[1-2]:0:0:0/device/rescan", I didn't get any new > devices but the old ones start working again. > Now, when I have the box in hubbed-mode, I can't seem to get new device= s > even when I do a scsi-host-scan, but just as before, a > scsi-target-rescan will get my devices back to order again. > > =20 ok. have you tried sending a START_STOP scsi command (wit sg_start from=20 sg3_utils) to the affect'ed LUN instead of target-rescaning ? > Also, I've noticed that it's not only when a controller fails that this > happens, when a failed controller is "revived" the same thing might hap= pen. > > As far as I've been able to tell, the more I/O-transactions at the time > of the failure, the more likely that the (SCSI) device will be marked a= s > "dead". > If I do "while /bin/true ;do dd if=3D/dev/zero of=3D/mnt/test count=3D2= 0000; > sleep 1" and fail (or revive) a controller it seems to work in 50% of > the cases, with 2 sec sleep there is rarely any problem but with no > sleep at all it fails nearly 100% of the times > And in all types of tests, if I do a SCSI-target(path) rescan before > multipath decides both paths are dead, both paths will work again and > the multipath-device will never fail. > > =20 I see your features still don't include "queue_if_no_path". You seem to=20 really need it. >> If so, the FC transport class is in charge of the timeout triggering t= he >> dead devices removal. >> A hardware handler wouldn't help here. >> >> Can you paste a before/after scsi rescan "multipath -l" output ? >> >> =20 > > They are identical > > [root@asl005 ~]# multipath -l > mpath1 (3600d0230000000000b01910b4d313400) > [size=3D97 GB][features=3D0][hwhandler=3D0] > \_ round-robin 0 [prio=3D0][active] > \_ 1:0:0:0 sdb 8:16 [active][undef] > \_ 2:0:0:0 sdc 8:32 [active][undef] > [root@asl005 ~]# dmesg |tail -20 > SCSI error : <1 0 0 0> return code =3D 0x20008 > end_request: I/O error, dev sdb, sector 21247352 > end_request: I/O error, dev sdb, sector 21247360 > SCSI error : <1 0 0 0> return code =3D 0x20008 > end_request: I/O error, dev sdb, sector 21036576 > end_request: I/O error, dev sdb, sector 21036584 > Aborting journal on device dm-5. > ext3_abort called. > EXT3-fs error (device dm-5): ext3_journal_start_sb: Detected aborted jo= urnal > Remounting filesystem read-only > EXT3-fs error (device dm-5) in start_transaction: Journal has aborted > __journal_remove_journal_head: freeing b_committed_data > printk: 254766 messages suppressed. > Buffer I/O error on device dm-5, logical block 2092209 > lost page write due to I/O error on dm-5 > Buffer I/O error on device dm-5, logical block 2093234 > lost page write due to I/O error on dm-5 > printk: 485 messages suppressed. > Buffer I/O error on device dm-5, logical block 1 > lost page write due to I/O error on dm-5 > [root@asl005 ~]# multipath -l > mpath1 (3600d0230000000000b01910b4d313400) > [size=3D97 GB][features=3D0][hwhandler=3D0] > \_ round-robin 0 [prio=3D0][enabled] > \_ 1:0:0:0 sdb 8:16 [failed][undef] > \_ 2:0:0:0 sdc 8:32 [failed][undef] > [root@asl005 ~]# echo 1 > > /sys/class/fc_transport/target1\:0\:0/device/1\:0\:0\:0/rescan > [root@asl005 ~]# multipath -l > mpath1 (3600d0230000000000b01910b4d313400) > [size=3D97 GB][features=3D0][hwhandler=3D0] > \_ round-robin 0 [prio=3D0][enabled] > \_ 1:0:0:0 sdb 8:16 [failed][undef] > \_ 2:0:0:0 sdc 8:32 [failed][undef] > [root@asl005 ~]# multipath > [root@asl005 ~]# multipath -ll > mpath1 (3600d0230000000000b01910b4d313400) > [size=3D97 GB][features=3D0][hwhandler=3D0] > \_ round-robin 0 [prio=3D0][active] > \_ 1:0:0:0 sdb 8:16 [active][undef] > \_ 2:0:0:0 sdc 8:32 [active][undef] > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > > =20