From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vu Pham Subject: Re: [PATCH 00/11] First pass at merging Bart's HA work Date: Fri, 7 Dec 2012 13:47:14 -0800 Message-ID: <50C263E2.1070805@mellanox.com> References: <1353957308.2681.5.camel@dabdike> <1353989041.28917.24.camel@obelisk.thedillows.org> <1354242098.3670.3.camel@obelisk.thedillows.org> <50BF9760.2080801@acm.org> <50C0A76C.20500@acm.org> <50C0AB42.8040402@mellanox.com> <50C0B407.4010706@acm.org> <50C0BFE0.909@mellanox.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <50C0BFE0.909-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Alex Turin Cc: Bart Van Assche , Or Gerlitz , David Dillow , Roland Dreier , James Bottomley , "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , linux-scsi , "fujita.tomonori-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org" , "rcj-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org" List-Id: linux-rdma@vger.kernel.org Alex Turin wrote: > On 12/6/2012 5:04 PM, Bart Van Assche wrote: > >> On 12/06/12 15:27, Or Gerlitz wrote: >> >>> The core problem here seems to be that scsi_remove_host simply never >>> ends. >>> >> Hello Or, >> >> The later patches in the srp-ha patch series avoided such behavior by >> checking whether the connection between SRP initiator and target is >> unique, and by removing duplicate SCSI hosts for which the transport >> layer failed. Unfortunately these patches are still under review. >> Unless someone can come up with a better solution I will post a patch >> one of the next days that makes ib_srp again fail all commands after >> host removal started. That will avoid spending a long time doing error >> recovery. >> >> Also, you might have noticed that Hannes Reinecke reported a few days >> ago that the SCSI error handler may need a lot of time for other >> transport types - this behavior is not SRP specific. >> >> Bart. >> >> > Hello Bart, > > In our case we don't have duplicate hosts or targets. We are working > with a single SCSI disk. > To make scsi_remove_host hang we simply disabling a IB port and run "dd > if=/dev/sdb of=/dev/null count=1". > > Hello Bart, I applied your latest patch [PATCH for-next] IB/srp: Make SCSI error handling finish and test Let me capture what I'm seeing: Host has two paths (scsi_host 7 & 8) to target thru two physical ports 1 & 2 [root@rsws42 ~]# multipath -l size=50G features='0' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 7:0:0:11 sdb 8:16 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 8:0:0:11 sdc 8:32 active undef running Cable pull by disable port 1, I/Os fail-over fine, the problem is the cleaning of scsi_host 7 of fail path. IB RC failure, scsi error recovery kick in. srp _reconnect_target() failed, srp_remove_target() run to remove scsi_host 7; however, I think it get stuck at device_del(dev) inside __scsi_remove_device(dev) Error recovery continuously happen again and again on scsi host 7 for 9-10 minutes. scsi_host 7 cannot be cleaned up, its sysfs entry is still there (/sys/class/scsi_host/host7), its state is SHOST_CANCEL. I brought port 1 back online, scsi_host 7 cannot reconnect to target because its state in SRP_TARGET_REMOVED. scci_host 7 sysfs entry does not contain target login info (ioc_guid, id_ext, dgid...). I think srp_daemon can reconnect to target by creating new path with new scsi hosst; however, I cannot check because I currently don't have a working srp_daemon. I need to manually reconnect to target with echo command Bottom line, I/Os can fail-over/failback; however, old scsi hosts cannot be removed (sysfs entry is still there) with state SHOST_CANCEL, error recovery keep happening on old scsi hosts for 10-20 minutes. thanks, -vu -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html