* Re: [PATCH] multipathd: check and cleanup zombie paths [not found] ` <CEB9978CF3252343BE3C67AC9F0086A34295CF62@H3CMLB14-EX.srv.huawei-3com.com> @ 2018-03-08 15:54 ` Xose Vazquez Perez 2018-03-09 6:11 ` Chongyun Wu [not found] ` <20180308154435.GB14513@octiron.msp.redhat.com> 1 sibling, 1 reply; 18+ messages in thread From: Xose Vazquez Perez @ 2018-03-08 15:54 UTC (permalink / raw) To: Chongyun Wu, Martin Wilck, Benjamin Marzinski, 'Christophe Varoqui', 'Hannes Reinecke' Cc: Guozhonghua, Changwei Ge, Changlimin, device-mapper development On 03/08/2018 09:03 AM, Chongyun Wu wrote: [add dm-devel@redhat.com] > 360002ac000000000000004f40001e2d7 dm-5 3PARdata,VV > size=13G features='1 queue_if_no_path' hwhandler='0' wp=rw > `-+- policy='round-robin 0' prio=1 status=active > |- 3:0:0:3 sdk 8:160 active ready running > |- 4:0:0:3 sdn 8:208 active ready running > |- 3:0:0:6 sdo 8:224 failed faulty running > `- 4:0:0:6 sdp 8:240 failed faulty running 3PAR arrays are able to use ALUA, but with *all ports* *across all controllers* in *single Target Port Group* : 31000000000000a000000000000001000 dm-0 3PARdata,VV size=50G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 0:0:0:0 sda 8:0 active ready running |- 0:0:1:0 sdf 8:80 active ready running |- 1:0:0:0 sdk 8:160 active ready running `- 1:0:1:0 sdp 8:240 active ready running And it's recommended by the manufacturer. From the StoreServ Management Console "Host:" should be changed to "Generic-ALUA" "Persona 2" "(UARepLun, SESLun, ALUA)". multipath-tools( *upstream* ) is already configured by default to use ALUA with 3PARdata, since c1b7f7f7: https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=c1b7f7f7 Run "multipath -d -v3" to see the config by lun. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-08 15:54 ` [PATCH] multipathd: check and cleanup zombie paths Xose Vazquez Perez @ 2018-03-09 6:11 ` Chongyun Wu 0 siblings, 0 replies; 18+ messages in thread From: Chongyun Wu @ 2018-03-09 6:11 UTC (permalink / raw) To: Xose Vazquez Perez, Martin Wilck, Benjamin Marzinski, 'Christophe Varoqui', 'Hannes Reinecke' Cc: Guozhonghua, Changwei Ge, Changlimin, device-mapper development On 2018/3/8 23:54, Xose Vazquez Perez wrote: > On 03/08/2018 09:03 AM, Chongyun Wu wrote: > > [add dm-devel@redhat.com] > >> 360002ac000000000000004f40001e2d7 dm-5 3PARdata,VV >> size=13G features='1 queue_if_no_path' hwhandler='0' wp=rw >> `-+- policy='round-robin 0' prio=1 status=active >> |- 3:0:0:3 sdk 8:160 active ready running >> |- 4:0:0:3 sdn 8:208 active ready running >> |- 3:0:0:6 sdo 8:224 failed faulty running >> `- 4:0:0:6 sdp 8:240 failed faulty running > 3PAR arrays are able to use ALUA, but with *all ports* > *across all controllers* in *single Target Port Group* : > > 31000000000000a000000000000001000 dm-0 3PARdata,VV > size=50G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw > `-+- policy='service-time 0' prio=50 status=active > |- 0:0:0:0 sda 8:0 active ready running > |- 0:0:1:0 sdf 8:80 active ready running > |- 1:0:0:0 sdk 8:160 active ready running > `- 1:0:1:0 sdp 8:240 active ready running > > And it's recommended by the manufacturer. > > > From the StoreServ Management Console "Host:" should be changed to > "Generic-ALUA" "Persona 2" "(UARepLun, SESLun, ALUA)". > > multipath-tools( *upstream* ) is already configured by default to use ALUA > with 3PARdata, since c1b7f7f7: https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=c1b7f7f7 > > Run "multipath -d -v3" to see the config by lun. > Hi Xose Vazquez Perez, Thanks for your reminding, I will check it, thanks again~ Regards, Chongyun ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <20180308154435.GB14513@octiron.msp.redhat.com>]
* Re: [PATCH] multipathd: check and cleanup zombie paths [not found] ` <20180308154435.GB14513@octiron.msp.redhat.com> @ 2018-03-09 6:47 ` Chongyun Wu 2018-03-09 10:47 ` Xose Vazquez Perez 2018-03-09 16:22 ` Benjamin Marzinski 0 siblings, 2 replies; 18+ messages in thread From: Chongyun Wu @ 2018-03-09 6:47 UTC (permalink / raw) To: Benjamin Marzinski Cc: 'Xose Vazquez Perez', Guozhonghua, dm-devel@redhat.com, Changwei Ge, Changlimin, Martin Wilck On 2018/3/8 23:45, Benjamin Marzinski wrote: > On Thu, Mar 08, 2018 at 08:03:50AM +0000, Chongyun Wu wrote: >> On 2018/3/7 20:45, Martin Wilck wrote: >>> On Wed, 2018-03-07 at 01:45 +0000, Chongyun Wu wrote: >>>> >>>> Hi Martin, >>>> Your analysis is correct. Did you have any good idea to deal with >>>> this >>>> issue? >>> >>> Could you maybe explain what was causing the issue in the first place? >>> Did you reconfigure the storage in any particular way? >>> >>> If yes, I think "multipathd reconfigure" would be the correct way to >>> deal with the problem. It re-reads everything, so it should get rid of >>> the stale paths. >>> >>> Regards >>> Martin >>> >> >> I have used "multipathd reconfigure", but the zombie(or stale) still >> here, even restart multipath-tools also can't clean those zombie paths. >> >> issue reproduce steps: >> (1)export the LUN(LUN1) to the server(host1) form LUN value *6* in the >> storage array; >> (2)scan out LUN1 in host1 and create multipath; >> (3)delete multipath in host1; >> (4)unexport LUN1 to host1 in the storage array; >> (5)export the LUN(LUN1) to the server(host1) form LUN value *3* in the >> storage array; >> (6)scan out LUN1 in host1 and create multipath, will see the zombie path >> like below: >> 360002ac000000000000004f40001e2d7 dm-5 3PARdata,VV >> size=13G features='1 queue_if_no_path' hwhandler='0' wp=rw >> `-+- policy='round-robin 0' prio=1 status=active >> |- 3:0:0:3 sdk 8:160 active ready running >> |- 4:0:0:3 sdn 8:208 active ready running >> |- 3:0:0:6 sdo 8:224 failed faulty running >> `- 4:0:0:6 sdp 8:240 failed faulty running >> those zombie paths actually case by cancel the old export relation in >> the storage array and change to a new export relation(given a different >> LUN value, kernel will create a new device for it), the old device stay >> in the system which I called zombie path or stable paths. >> >> I'm sorry that my first description isn't so clear and can be >> misleading. The description *a lun can't be exported from a different >> lun number to a host at the same time* actually not the reference to >> found zombie paths. I have tested the storage haven't such restrict we >> can export one LUN to server from different LUN number at the same time. >> But my patch not care about this scenario, because the path which export >> many times from different LUN number in the storage array at the same >> time will have the same path status(either faild or active). > > If there are multiple routes to the storage, Some of them can be down, > even if everything is fine on the storage. This will cause some paths > to be up and some to be down, regardless of the state of the LUN. In > every other multipath case but this one, there is just one LUN, and not > all the paths have the same state. > > Ideally, there would be a way to determine if a path is a zombie, simply > by looking at it alone. The additional sense code "LOGICAL UNIT NOT > SUPPORTED" that you posted earlier isn't one that I recall seeing for > failed multipathd paths. I'll check around more, but a quick look makes > it appear that this code is only used when you are accessing a LUN that > really isn't there. It's possible that the TUR checker could return a > special path state for this, that would cause multipathd to remove the > device. Also, even if that additional sense code is only supposed to be > used for this condition, we should still removing a device that returns > it configurable, because I can almost guarantee that there will be a > scsi device that does follow the standard for this. > Hi Ben, You just mentioned *the TUR checker could return a special path state for this*, what is the special path state? Thanks~ > -Ben > >> My previous patch use three conditions to found those paths: >> (1)path status is faild; >> (2)can found path which have the same wwid and different lun >> number(pp->sg_id.lun) with the failed path ; >> (3)the founded path's status is active. >> >> Based on your analysis of support for all devices, I want to restrict >> the clean up just for scsi device. >> >> Above is my test result and reconsideration after your reply. Thanks a lot~ >> >> Regards, >> Chongyun > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-09 6:47 ` Chongyun Wu @ 2018-03-09 10:47 ` Xose Vazquez Perez 2018-03-09 16:22 ` Benjamin Marzinski 1 sibling, 0 replies; 18+ messages in thread From: Xose Vazquez Perez @ 2018-03-09 10:47 UTC (permalink / raw) To: Chongyun Wu, Benjamin Marzinski Cc: Guozhonghua, dm-devel@redhat.com, Changwei Ge, Changlimin, Martin Wilck On 03/09/2018 07:47 AM, Chongyun Wu wrote: > You just mentioned *the TUR checker could return a special path state > for this*, what is the special path state? Thanks~ to follow with this bug, you should post: - distribution - kernel release - multipath-tools release - /etc/multipath.conf - and relevant system logs (multipath -v3 -d, journalctl, dmesg, messages, ...) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-09 6:47 ` Chongyun Wu 2018-03-09 10:47 ` Xose Vazquez Perez @ 2018-03-09 16:22 ` Benjamin Marzinski 2018-03-19 21:42 ` Martin Wilck 1 sibling, 1 reply; 18+ messages in thread From: Benjamin Marzinski @ 2018-03-09 16:22 UTC (permalink / raw) To: Chongyun Wu Cc: 'Xose Vazquez Perez', Guozhonghua, dm-devel@redhat.com, Changwei Ge, Changlimin, Martin Wilck On Fri, Mar 09, 2018 at 06:47:30AM +0000, Chongyun Wu wrote: > On 2018/3/8 23:45, Benjamin Marzinski wrote: > > On Thu, Mar 08, 2018 at 08:03:50AM +0000, Chongyun Wu wrote: > >> On 2018/3/7 20:45, Martin Wilck wrote: > >>> On Wed, 2018-03-07 at 01:45 +0000, Chongyun Wu wrote: > >>>> > >>>> Hi Martin, > >>>> Your analysis is correct. Did you have any good idea to deal with > >>>> this > >>>> issue? > >>> > >>> Could you maybe explain what was causing the issue in the first place? > >>> Did you reconfigure the storage in any particular way? > >>> > >>> If yes, I think "multipathd reconfigure" would be the correct way to > >>> deal with the problem. It re-reads everything, so it should get rid of > >>> the stale paths. > >>> > >>> Regards > >>> Martin > >>> > >> > >> I have used "multipathd reconfigure", but the zombie(or stale) still > >> here, even restart multipath-tools also can't clean those zombie paths. > >> > >> issue reproduce steps: > >> (1)export the LUN(LUN1) to the server(host1) form LUN value *6* in the > >> storage array; > >> (2)scan out LUN1 in host1 and create multipath; > >> (3)delete multipath in host1; > >> (4)unexport LUN1 to host1 in the storage array; > >> (5)export the LUN(LUN1) to the server(host1) form LUN value *3* in the > >> storage array; > >> (6)scan out LUN1 in host1 and create multipath, will see the zombie path > >> like below: > >> 360002ac000000000000004f40001e2d7 dm-5 3PARdata,VV > >> size=13G features='1 queue_if_no_path' hwhandler='0' wp=rw > >> `-+- policy='round-robin 0' prio=1 status=active > >> |- 3:0:0:3 sdk 8:160 active ready running > >> |- 4:0:0:3 sdn 8:208 active ready running > >> |- 3:0:0:6 sdo 8:224 failed faulty running > >> `- 4:0:0:6 sdp 8:240 failed faulty running > >> those zombie paths actually case by cancel the old export relation in > >> the storage array and change to a new export relation(given a different > >> LUN value, kernel will create a new device for it), the old device stay > >> in the system which I called zombie path or stable paths. > >> > >> I'm sorry that my first description isn't so clear and can be > >> misleading. The description *a lun can't be exported from a different > >> lun number to a host at the same time* actually not the reference to > >> found zombie paths. I have tested the storage haven't such restrict we > >> can export one LUN to server from different LUN number at the same time. > >> But my patch not care about this scenario, because the path which export > >> many times from different LUN number in the storage array at the same > >> time will have the same path status(either faild or active). > > > > If there are multiple routes to the storage, Some of them can be down, > > even if everything is fine on the storage. This will cause some paths > > to be up and some to be down, regardless of the state of the LUN. In > > every other multipath case but this one, there is just one LUN, and not > > all the paths have the same state. > > > > Ideally, there would be a way to determine if a path is a zombie, simply > > by looking at it alone. The additional sense code "LOGICAL UNIT NOT > > SUPPORTED" that you posted earlier isn't one that I recall seeing for > > failed multipathd paths. I'll check around more, but a quick look makes > > it appear that this code is only used when you are accessing a LUN that > > really isn't there. It's possible that the TUR checker could return a > > special path state for this, that would cause multipathd to remove the > > device. Also, even if that additional sense code is only supposed to be > > used for this condition, we should still removing a device that returns > > it configurable, because I can almost guarantee that there will be a > > scsi device that does follow the standard for this. > > > Hi Ben, > You just mentioned *the TUR checker could return a special path state > for this*, what is the special path state? Thanks~ > We would have to add a new state, like PATH_NOT_SUPPORTED, that the TUR checker could return in this case. multipathd could be configured to remove the path if it returned this state. If it wasn't configured to do so, multipathd would just change the state to PATH_DOWN. > > -Ben > > > >> My previous patch use three conditions to found those paths: > >> (1)path status is faild; > >> (2)can found path which have the same wwid and different lun > >> number(pp->sg_id.lun) with the failed path ; > >> (3)the founded path's status is active. > >> > >> Based on your analysis of support for all devices, I want to restrict > >> the clean up just for scsi device. > >> > >> Above is my test result and reconsideration after your reply. Thanks a lot~ > >> > >> Regards, > >> Chongyun > > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-09 16:22 ` Benjamin Marzinski @ 2018-03-19 21:42 ` Martin Wilck 2018-03-20 3:19 ` Chongyun Wu 0 siblings, 1 reply; 18+ messages in thread From: Martin Wilck @ 2018-03-19 21:42 UTC (permalink / raw) To: Benjamin Marzinski, Chongyun Wu Cc: 'Xose Vazquez Perez', Guozhonghua, dm-devel@redhat.com, Changwei Ge, Changlimin On Fri, 2018-03-09 at 10:22 -0600, Benjamin Marzinski wrote: > On Fri, Mar 09, 2018 at 06:47:30AM +0000, Chongyun Wu wrote: > > On 2018/3/8 23:45, Benjamin Marzinski wrote: > > > > > > If there are multiple routes to the storage, Some of them can be > > > down, > > > even if everything is fine on the storage. This will cause some > > > paths > > > to be up and some to be down, regardless of the state of the LUN. > > > In > > > every other multipath case but this one, there is just one LUN, > > > and not > > > all the paths have the same state. > > > > > > Ideally, there would be a way to determine if a path is a zombie, > > > simply > > > by looking at it alone. The additional sense code "LOGICAL UNIT > > > NOT > > > SUPPORTED" that you posted earlier isn't one that I recall seeing > > > for > > > failed multipathd paths. I'll check around more, but a quick > > > look makes > > > it appear that this code is only used when you are accessing a > > > LUN that > > > really isn't there. It's possible that the TUR checker could > > > return a > > > special path state for this, that would cause multipathd to > > > remove the > > > device. Also, even if that additional sense code is only > > > supposed to be > > > used for this condition, we should still removing a device that > > > returns > > > it configurable, because I can almost guarantee that there will > > > be a > > > scsi device that does follow the standard for this. > > > > > > > Hi Ben, > > You just mentioned *the TUR checker could return a special path > > state > > for this*, what is the special path state? Thanks~ > > > > We would have to add a new state, like PATH_NOT_SUPPORTED, that the > TUR > checker could return in this case. multipathd could be configured to > remove the path if it returned this state. If it wasn't configured to > do > so, multipathd would just change the state to PATH_DOWN. Is it really multipathd's job to do remove devices that return "LOGICAL UNIT NOT SUPPORTED"? To me it sounds like a misconfiguration on the SCSI/storage level, and I'm unsure if that's a thing multipathd should mess with. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-19 21:42 ` Martin Wilck @ 2018-03-20 3:19 ` Chongyun Wu 2018-03-20 7:36 ` Martin Wilck 2018-03-20 14:58 ` Bart Van Assche 0 siblings, 2 replies; 18+ messages in thread From: Chongyun Wu @ 2018-03-20 3:19 UTC (permalink / raw) To: Martin Wilck, Benjamin Marzinski Cc: 'Xose Vazquez Perez', Guozhonghua, dm-devel@redhat.com, Changwei Ge, Changlimin On 2018/3/20 5:42, Martin Wilck wrote: > On Fri, 2018-03-09 at 10:22 -0600, Benjamin Marzinski wrote: >> On Fri, Mar 09, 2018 at 06:47:30AM +0000, Chongyun Wu wrote: >>> On 2018/3/8 23:45, Benjamin Marzinski wrote: >>>> >>>> If there are multiple routes to the storage, Some of them can be >>>> down, >>>> even if everything is fine on the storage. This will cause some >>>> paths >>>> to be up and some to be down, regardless of the state of the LUN. >>>> In >>>> every other multipath case but this one, there is just one LUN, >>>> and not >>>> all the paths have the same state. >>>> >>>> Ideally, there would be a way to determine if a path is a zombie, >>>> simply >>>> by looking at it alone. The additional sense code "LOGICAL UNIT >>>> NOT >>>> SUPPORTED" that you posted earlier isn't one that I recall seeing >>>> for >>>> failed multipathd paths. I'll check around more, but a quick >>>> look makes >>>> it appear that this code is only used when you are accessing a >>>> LUN that >>>> really isn't there. It's possible that the TUR checker could >>>> return a >>>> special path state for this, that would cause multipathd to >>>> remove the >>>> device. Also, even if that additional sense code is only >>>> supposed to be >>>> used for this condition, we should still removing a device that >>>> returns >>>> it configurable, because I can almost guarantee that there will >>>> be a >>>> scsi device that does follow the standard for this. >>>> >>> >>> Hi Ben, >>> You just mentioned *the TUR checker could return a special path >>> state >>> for this*, what is the special path state? Thanks~ >>> >> >> We would have to add a new state, like PATH_NOT_SUPPORTED, that the >> TUR >> checker could return in this case. multipathd could be configured to >> remove the path if it returned this state. If it wasn't configured to >> do >> so, multipathd would just change the state to PATH_DOWN. > > Is it really multipathd's job to do remove devices that return "LOGICAL > UNIT NOT SUPPORTED"? To me it sounds like a misconfiguration on the > SCSI/storage level, and I'm unsure if that's a thing multipathd should > mess with. > > Martin > Actually there are two scenario: (1)Export the LUN to a server at the same time using different LUN nubmer. As you mentioned this scenario can be considered a misconfiguration which we might not care about it. (2)Export the LUN to a server not at the same time using different LUN number. This scenario's operation may be right, the customer just want to reassignment the export relations in the storage. But the former export operation leave a residual device in the system which will been adopted by the latter exported device's multipath. Also there are lots of syslog for the former device which actually not exist(at lest customer don't think it exists, the customer want only the new exported device exist) Regards, Chongyun ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-20 3:19 ` Chongyun Wu @ 2018-03-20 7:36 ` Martin Wilck 2018-03-20 14:58 ` Bart Van Assche 1 sibling, 0 replies; 18+ messages in thread From: Martin Wilck @ 2018-03-20 7:36 UTC (permalink / raw) To: Chongyun Wu, Benjamin Marzinski Cc: 'Xose Vazquez Perez', Guozhonghua, dm-devel@redhat.com, Changwei Ge, Changlimin On Tue, 2018-03-20 at 03:19 +0000, Chongyun Wu wrote: > On 2018/3/20 5:42, Martin Wilck wrote: > > On Fri, 2018-03-09 at 10:22 -0600, Benjamin Marzinski wrote: > > > On Fri, Mar 09, 2018 at 06:47:30AM +0000, Chongyun Wu wrote: > > > > On 2018/3/8 23:45, Benjamin Marzinski wrote: > > > > > > > > > > If there are multiple routes to the storage, Some of them can > > > > > be > > > > > down, > > > > > even if everything is fine on the storage. This will cause > > > > > some > > > > > paths > > > > > to be up and some to be down, regardless of the state of the > > > > > LUN. > > > > > In > > > > > every other multipath case but this one, there is just one > > > > > LUN, > > > > > and not > > > > > all the paths have the same state. > > > > > > > > > > Ideally, there would be a way to determine if a path is a > > > > > zombie, > > > > > simply > > > > > by looking at it alone. The additional sense code "LOGICAL > > > > > UNIT > > > > > NOT > > > > > SUPPORTED" that you posted earlier isn't one that I recall > > > > > seeing > > > > > for > > > > > failed multipathd paths. I'll check around more, but a quick > > > > > look makes > > > > > it appear that this code is only used when you are accessing > > > > > a > > > > > LUN that > > > > > really isn't there. It's possible that the TUR checker could > > > > > return a > > > > > special path state for this, that would cause multipathd to > > > > > remove the > > > > > device. Also, even if that additional sense code is only > > > > > supposed to be > > > > > used for this condition, we should still removing a device > > > > > that > > > > > returns > > > > > it configurable, because I can almost guarantee that there > > > > > will > > > > > be a > > > > > scsi device that does follow the standard for this. > > > > > > > > > > > > > Hi Ben, > > > > You just mentioned *the TUR checker could return a special path > > > > state > > > > for this*, what is the special path state? Thanks~ > > > > > > > > > > We would have to add a new state, like PATH_NOT_SUPPORTED, that > > > the > > > TUR > > > checker could return in this case. multipathd could be > > > configured to > > > remove the path if it returned this state. If it wasn't > > > configured to > > > do > > > so, multipathd would just change the state to PATH_DOWN. > > > > Is it really multipathd's job to do remove devices that return > > "LOGICAL > > UNIT NOT SUPPORTED"? To me it sounds like a misconfiguration on the > > SCSI/storage level, and I'm unsure if that's a thing multipathd > > should > > mess with. > > > > Martin > > > > Actually there are two scenario: > (1)Export the LUN to a server at the same time using different LUN > nubmer. > As you mentioned this scenario can be considered a misconfiguration > which we might not care about it. > (2)Export the LUN to a server not at the same time using different > LUN > number. > This scenario's operation may be right, the customer just want to > reassignment the export relations in the storage. > But the former export operation leave a residual device in the > system > which will been adopted by the latter exported device's multipath. > Also > there are lots of syslog for the former device which actually not > exist(at lest customer don't think it exists, the customer want only > the > new exported device exist) I agree that the "residual device" should be removed from the system. But I don't think that it's multipathd's assignment to detect and remove such devices. Well, detect and spit out a message - maybe, but remove - rather not. multipathd is for managing (dm-)multipath devices, not for taking care of arbitrary problems on the storage layer. That said, I'd be OK with a PATH_NOT_SUPPORTED state that would result in the paths being treated like orphans or blacklisted devices. Regards, Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-20 3:19 ` Chongyun Wu 2018-03-20 7:36 ` Martin Wilck @ 2018-03-20 14:58 ` Bart Van Assche 2018-03-20 15:12 ` Xose Vazquez Perez 2018-03-21 1:17 ` Chongyun Wu 1 sibling, 2 replies; 18+ messages in thread From: Bart Van Assche @ 2018-03-20 14:58 UTC (permalink / raw) To: bmarzins@redhat.com, wu.chongyun@h3c.com, mwilck@suse.com Cc: guozhonghua@h3c.com, dm-devel@redhat.com, changlimin@h3c.com, xose.vazquez@gmail.com, ge.changwei@h3c.com On Tue, 2018-03-20 at 03:19 +0000, Chongyun Wu wrote: > Actually there are two scenario: > (1)Export the LUN to a server at the same time using different LUN nubmer. > As you mentioned this scenario can be considered a misconfiguration > which we might not care about it. > (2)Export the LUN to a server not at the same time using different LUN > number. > This scenario's operation may be right, the customer just want to > reassignment the export relations in the storage. > But the former export operation leave a residual device in the system > which will been adopted by the latter exported device's multipath. Also > there are lots of syslog for the former device which actually not > exist(at lest customer don't think it exists, the customer want only the > new exported device exist) Hello Chongyun, It is on purpose that the SCSI core does not remove stale SCSI device nodes. If you want that these stale SCSI device nodes get removed automatically, two possible approaches are (there might be other approaches): * Write a new user space daemon that periodically checks for stale devices (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state | grep -v running) and that triggers a SCSI rescan if any stale devices are found. * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED and that triggers a SCSI rescan if this event is triggered by the kernel. Bart. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-20 14:58 ` Bart Van Assche @ 2018-03-20 15:12 ` Xose Vazquez Perez 2018-03-20 15:14 ` Bart Van Assche 2018-03-21 1:17 ` Chongyun Wu 1 sibling, 1 reply; 18+ messages in thread From: Xose Vazquez Perez @ 2018-03-20 15:12 UTC (permalink / raw) To: Bart Van Assche, bmarzins@redhat.com, wu.chongyun@h3c.com, mwilck@suse.com Cc: guozhonghua@h3c.com, dm-devel@redhat.com, changlimin@h3c.com, ge.changwei@h3c.com On 03/20/2018 03:58 PM, Bart Van Assche wrote: > It is on purpose that the SCSI core does not remove stale SCSI device nodes. > If you want that these stale SCSI device nodes get removed automatically, > two possible approaches are (there might be other approaches): > * Write a new user space daemon that periodically checks for stale devices > (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state | > grep -v running) and that triggers a SCSI rescan if any stale devices are > found. > * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED > and that triggers a SCSI rescan if this event is triggered by the kernel. There are some "remove" flags in rescan-scsi-bus.sh: https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2acc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080 -r enables removing of devices [default: disabled] --forceremove: Remove stale devices (DANGEROUS) --forcerescan: Remove and readd existing devices (DANGEROUS) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-20 15:12 ` Xose Vazquez Perez @ 2018-03-20 15:14 ` Bart Van Assche 2018-03-20 15:19 ` Martin Wilck ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Bart Van Assche @ 2018-03-20 15:14 UTC (permalink / raw) To: xose.vazquez@gmail.com, bmarzins@redhat.com, wu.chongyun@h3c.com, mwilck@suse.com Cc: guozhonghua@h3c.com, dm-devel@redhat.com, changlimin@h3c.com, ge.changwei@h3c.com On Tue, 2018-03-20 at 16:12 +0100, Xose Vazquez Perez wrote: > On 03/20/2018 03:58 PM, Bart Van Assche wrote: > > > It is on purpose that the SCSI core does not remove stale SCSI device nodes. > > If you want that these stale SCSI device nodes get removed automatically, > > two possible approaches are (there might be other approaches): > > * Write a new user space daemon that periodically checks for stale devices > > (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state | > > grep -v running) and that triggers a SCSI rescan if any stale devices are > > found. > > * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED > > and that triggers a SCSI rescan if this event is triggered by the kernel. > > There are some "remove" flags in rescan-scsi-bus.sh: > https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2acc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080 > > -r enables removing of devices [default: disabled] > --forceremove: Remove stale devices (DANGEROUS) > --forcerescan: Remove and readd existing devices (DANGEROUS) Last time I checked the rescan-scsi-bus.sh script relied on the SCSI sysfs delete attribute to remove stale devices. That is the mechanism that can trigger a deadlock in the kernel. Bart. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-20 15:14 ` Bart Van Assche @ 2018-03-20 15:19 ` Martin Wilck 2018-03-21 1:54 ` Chongyun Wu 2018-03-22 3:40 ` Chongyun Wu 2 siblings, 0 replies; 18+ messages in thread From: Martin Wilck @ 2018-03-20 15:19 UTC (permalink / raw) To: Bart Van Assche, xose.vazquez@gmail.com, bmarzins@redhat.com, wu.chongyun@h3c.com Cc: guozhonghua@h3c.com, dm-devel@redhat.com, changlimin@h3c.com, ge.changwei@h3c.com On Tue, 2018-03-20 at 15:14 +0000, Bart Van Assche wrote: > On Tue, 2018-03-20 at 16:12 +0100, Xose Vazquez Perez wrote: > > > > There are some "remove" flags in rescan-scsi-bus.sh: > > https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2a > > cc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080 > > > > -r enables removing of devices [default: disabled] > > --forceremove: Remove stale devices (DANGEROUS) > > --forcerescan: Remove and readd existing devices (DANGEROUS) > > Last time I checked the rescan-scsi-bus.sh script relied on the SCSI > sysfs > delete attribute to remove stale devices. That is the mechanism that > can > trigger a deadlock in the kernel. Is there an alternative the script could use? I'm not aware of any. Martin -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-20 15:14 ` Bart Van Assche 2018-03-20 15:19 ` Martin Wilck @ 2018-03-21 1:54 ` Chongyun Wu 2018-03-21 19:56 ` Bart Van Assche 2018-03-22 3:40 ` Chongyun Wu 2 siblings, 1 reply; 18+ messages in thread From: Chongyun Wu @ 2018-03-21 1:54 UTC (permalink / raw) To: Bart Van Assche, xose.vazquez@gmail.com, bmarzins@redhat.com, mwilck@suse.com Cc: Guozhonghua, dm-devel@redhat.com, Changlimin, Changwei Ge On 2018/3/20 23:14, Bart Van Assche wrote: > On Tue, 2018-03-20 at 16:12 +0100, Xose Vazquez Perez wrote: >> On 03/20/2018 03:58 PM, Bart Van Assche wrote: >> >>> It is on purpose that the SCSI core does not remove stale SCSI device nodes. >>> If you want that these stale SCSI device nodes get removed automatically, >>> two possible approaches are (there might be other approaches): >>> * Write a new user space daemon that periodically checks for stale devices >>> (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state | >>> grep -v running) and that triggers a SCSI rescan if any stale devices are >>> found. >>> * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED >>> and that triggers a SCSI rescan if this event is triggered by the kernel. >> >> There are some "remove" flags in rescan-scsi-bus.sh: >> https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2acc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080 >> >> -r enables removing of devices [default: disabled] >> --forceremove: Remove stale devices (DANGEROUS) >> --forcerescan: Remove and readd existing devices (DANGEROUS) > > Last time I checked the rescan-scsi-bus.sh script relied on the SCSI sysfs > delete attribute to remove stale devices. That is the mechanism that can > trigger a deadlock in the kernel. > > Bart. > Hi Bart, Is there any special operation or conditions to reproduce the dead lock? I have use SCSI sysfs delete atrribute to remove stale devices in my previous patch and test many times, but I haven't encountered any deadlock problems. Regards, Chongyun ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-21 1:54 ` Chongyun Wu @ 2018-03-21 19:56 ` Bart Van Assche 2018-03-22 1:58 ` Chongyun Wu 0 siblings, 1 reply; 18+ messages in thread From: Bart Van Assche @ 2018-03-21 19:56 UTC (permalink / raw) To: xose.vazquez@gmail.com, wu.chongyun@h3c.com, bmarzins@redhat.com, mwilck@suse.com Cc: guozhonghua@h3c.com, dm-devel@redhat.com, changlimin@h3c.com, ge.changwei@h3c.com On Wed, 2018-03-21 at 01:54 +0000, Chongyun Wu wrote: > Is there any special operation or conditions to reproduce the dead lock? > I have use SCSI sysfs delete atrribute to remove stale devices in my > previous patch and test many times, but I haven't encountered any > deadlock problems. Hello Chongyun, In my tests I enabled the lock validator (CONFIG_LOCKDEP) in order not only to obtain detailed information about the cause of actual deadlocks but also to obtain information about potential deadlocks. I'm not sure how to make it more likely to trigger this deadlock. But I think that you should be aware that some SCSI transports remove the SCSI host if a transport failure is detected and other SCSI transports don't remove the SCSI host upon a transport failure. I ran my tests with the SRP protocol and the SRP transport protocol removes the SCSI host if a transport failure persists long enough. Maybe triggering SCSI host removal often helps to trigger that deadlock. Bart. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-21 19:56 ` Bart Van Assche @ 2018-03-22 1:58 ` Chongyun Wu 0 siblings, 0 replies; 18+ messages in thread From: Chongyun Wu @ 2018-03-22 1:58 UTC (permalink / raw) To: Bart Van Assche, xose.vazquez@gmail.com, bmarzins@redhat.com, mwilck@suse.com Cc: Guozhonghua, dm-devel@redhat.com, Changlimin, Changwei Ge On 2018/3/22 3:56, Bart Van Assche wrote: > On Wed, 2018-03-21 at 01:54 +0000, Chongyun Wu wrote: >> Is there any special operation or conditions to reproduce the dead lock? >> I have use SCSI sysfs delete atrribute to remove stale devices in my >> previous patch and test many times, but I haven't encountered any >> deadlock problems. > > Hello Chongyun, > > In my tests I enabled the lock validator (CONFIG_LOCKDEP) in order not only > to obtain detailed information about the cause of actual deadlocks but also > to obtain information about potential deadlocks. > > I'm not sure how to make it more likely to trigger this deadlock. But I think > that you should be aware that some SCSI transports remove the SCSI host if a > transport failure is detected and other SCSI transports don't remove the SCSI > host upon a transport failure. I ran my tests with the SRP protocol and the > SRP transport protocol removes the SCSI host if a transport failure persists > long enough. Maybe triggering SCSI host removal often helps to trigger that > deadlock. > > Bart. > > > Hi Bart, Thanks, as you mentioned maybe we have different scsi transport so we haven't found deadlock so far. Hope your mentioned issue *Avoid that SCSI device removal through sysfs triggers a deadlock* can be resolved by root. We often encounter problems with device residual which cause some issues and want to find a safe and effective method to clean up it. Regards, Chongyun ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-20 15:14 ` Bart Van Assche 2018-03-20 15:19 ` Martin Wilck 2018-03-21 1:54 ` Chongyun Wu @ 2018-03-22 3:40 ` Chongyun Wu 2018-03-22 15:18 ` Bart Van Assche 2 siblings, 1 reply; 18+ messages in thread From: Chongyun Wu @ 2018-03-22 3:40 UTC (permalink / raw) To: Bart Van Assche, xose.vazquez@gmail.com, bmarzins@redhat.com, mwilck@suse.com Cc: Guozhonghua, dm-devel@redhat.com, Changlimin, Changwei Ge On 2018/3/20 23:14, Bart Van Assche wrote: > On Tue, 2018-03-20 at 16:12 +0100, Xose Vazquez Perez wrote: >> On 03/20/2018 03:58 PM, Bart Van Assche wrote: >> >>> It is on purpose that the SCSI core does not remove stale SCSI device nodes. >>> If you want that these stale SCSI device nodes get removed automatically, >>> two possible approaches are (there might be other approaches): >>> * Write a new user space daemon that periodically checks for stale devices >>> (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state | >>> grep -v running) and that triggers a SCSI rescan if any stale devices are >>> found. >>> * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED >>> and that triggers a SCSI rescan if this event is triggered by the kernel. >> >> There are some "remove" flags in rescan-scsi-bus.sh: >> https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2acc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080 >> >> -r enables removing of devices [default: disabled] >> --forceremove: Remove stale devices (DANGEROUS) >> --forcerescan: Remove and readd existing devices (DANGEROUS) > > Last time I checked the rescan-scsi-bus.sh script relied on the SCSI sysfs > delete attribute to remove stale devices. That is the mechanism that can > trigger a deadlock in the kernel. > > Bart. > > > Hi Bart, I did a test. Below command can remove the residual device: *echo "scsi remove-single-device 3 0 0 3" > /proc/scsi/scsi* Is it safe? Regards, Chongyun ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-22 3:40 ` Chongyun Wu @ 2018-03-22 15:18 ` Bart Van Assche 0 siblings, 0 replies; 18+ messages in thread From: Bart Van Assche @ 2018-03-22 15:18 UTC (permalink / raw) To: xose.vazquez@gmail.com, wu.chongyun@h3c.com, bmarzins@redhat.com, mwilck@suse.com Cc: guozhonghua@h3c.com, dm-devel@redhat.com, changlimin@h3c.com, ge.changwei@h3c.com On Thu, 2018-03-22 at 03:40 +0000, Chongyun Wu wrote: > I did a test. Below command can remove the residual device: > *echo "scsi remove-single-device 3 0 0 3" > /proc/scsi/scsi* > Is it safe? Hello Chongyun, Are you aware of the linux-scsi mailing list? I think this question would be more appropriate for that mailing list. Regarding your question, I think that you should be aware of the following comment above the proc write method that implements that functionality: "this provides a legacy mechanism to add or remove devices". Bart. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] multipathd: check and cleanup zombie paths 2018-03-20 14:58 ` Bart Van Assche 2018-03-20 15:12 ` Xose Vazquez Perez @ 2018-03-21 1:17 ` Chongyun Wu 1 sibling, 0 replies; 18+ messages in thread From: Chongyun Wu @ 2018-03-21 1:17 UTC (permalink / raw) To: Bart Van Assche, bmarzins@redhat.com, mwilck@suse.com Cc: Guozhonghua, dm-devel@redhat.com, Changlimin, xose.vazquez@gmail.com, Changwei Ge On 2018/3/20 22:58, Bart Van Assche wrote: > On Tue, 2018-03-20 at 03:19 +0000, Chongyun Wu wrote: >> Actually there are two scenario: >> (1)Export the LUN to a server at the same time using different LUN nubmer. >> As you mentioned this scenario can be considered a misconfiguration >> which we might not care about it. >> (2)Export the LUN to a server not at the same time using different LUN >> number. >> This scenario's operation may be right, the customer just want to >> reassignment the export relations in the storage. >> But the former export operation leave a residual device in the system >> which will been adopted by the latter exported device's multipath. Also >> there are lots of syslog for the former device which actually not >> exist(at lest customer don't think it exists, the customer want only the >> new exported device exist) > > Hello Chongyun, > > It is on purpose that the SCSI core does not remove stale SCSI device nodes. > If you want that these stale SCSI device nodes get removed automatically, > two possible approaches are (there might be other approaches): > * Write a new user space daemon that periodically checks for stale devices > (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state | > grep -v running) and that triggers a SCSI rescan if any stale devices are > found. > * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED > and that triggers a SCSI rescan if this event is triggered by the kernel. > > Bart. Hi Bart, Thank you very much for your advice, I think the two approaches are new way for me to clean up stale devices, I will have a try. Regards, Chongyun ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2018-03-22 15:18 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CEB9978CF3252343BE3C67AC9F0086A34295C462@H3CMLB14-EX.srv.huawei-3com.com>
[not found] ` <1520325779.4131.4.camel@suse.com>
[not found] ` <CEB9978CF3252343BE3C67AC9F0086A34295C9D0@H3CMLB14-EX.srv.huawei-3com.com>
[not found] ` <1520349519.4131.20.camel@suse.com>
[not found] ` <CEB9978CF3252343BE3C67AC9F0086A34295CA7D@H3CMLB14-EX.srv.huawei-3com.com>
[not found] ` <1520426679.11340.5.camel@suse.com>
[not found] ` <CEB9978CF3252343BE3C67AC9F0086A34295CF62@H3CMLB14-EX.srv.huawei-3com.com>
2018-03-08 15:54 ` [PATCH] multipathd: check and cleanup zombie paths Xose Vazquez Perez
2018-03-09 6:11 ` Chongyun Wu
[not found] ` <20180308154435.GB14513@octiron.msp.redhat.com>
2018-03-09 6:47 ` Chongyun Wu
2018-03-09 10:47 ` Xose Vazquez Perez
2018-03-09 16:22 ` Benjamin Marzinski
2018-03-19 21:42 ` Martin Wilck
2018-03-20 3:19 ` Chongyun Wu
2018-03-20 7:36 ` Martin Wilck
2018-03-20 14:58 ` Bart Van Assche
2018-03-20 15:12 ` Xose Vazquez Perez
2018-03-20 15:14 ` Bart Van Assche
2018-03-20 15:19 ` Martin Wilck
2018-03-21 1:54 ` Chongyun Wu
2018-03-21 19:56 ` Bart Van Assche
2018-03-22 1:58 ` Chongyun Wu
2018-03-22 3:40 ` Chongyun Wu
2018-03-22 15:18 ` Bart Van Assche
2018-03-21 1:17 ` Chongyun Wu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.