[PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers
@ 2010-07-28 22:58 Moger, Babu
  2010-07-29 21:54 ` Chandra Seetharaman
  0 siblings, 1 reply; 8+ messages in thread
From: Moger, Babu @ 2010-07-28 22:58 UTC (permalink / raw)
  To: device-mapper development, linux-scsi@vger.kernel.org
  Cc: Qi, Yanling, Chauhan, Vijay, Dachepalli, Sudhir, Stankey, Robert

These patches fix the following two cases. 
1. Devices going away while scsi device hander's activate is still in progress.

2. Removal of scsi_dh_data(calling detach handler) when scsi device hander's activate is still in progress.

We have been seeing these problems while running multipath failover tests on LSI storage. These patches fix the problem. We have verified it.

Here is the panic we have been seeing while running failover failback tests.

> 00:40:42:869  COM1 >------------[ cut here ]------------
> 00:40:42:869  COM1 >kernel BUG at /usr/src/packages/BUILD/lsi-
> scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
> 00:40:42:869  COM1 >invalid opcode: 0000 [1] SMP 
> 00:40:42:885  COM1 >last sysfs file: /sys/kernel/uevent_seqnum
> 00:40:42:885  COM1 >CPU 3 
> 00:40:42:885  COM1 >Modules linked in: dm_round_robin dm_multipath 
> nls_utf8 cifs(X) microcode af_packet ipv6 fuse loop dm_mod iTCO_wdt 
> iTCO_vendor_support dcdbas(X) pcspkr rtc_cmos rtc_core serio_raw 
> rtc_lib i5000_edac edac_core bnx2 shpchp sg pci_hotplug button 
> mptctl usbhid hid ff_memless uhci_hcd ehci_hcd usbcore sd_mod 
> crc_t10dif mpt2sas(N) raid_class edd ext3 mbcache jbd fan 
> ide_pci_generic piix ide_core ata_generic ata_piix libata dock 
> mptsas mptscsih mptbase scsi_transport_sas thermal processor 
> thermal_sys hwmon scsi_dh_rdac(X) scsi_dh scsi_mod
> 00:40:42:932  COM1 >Supported: No
> 00:40:42:932  COM1 >Pid: 14044, comm: kmpath_handlerd Tainted: G    
> 2.6.27.39-0.3-default #1
> 00:40:42:932  COM1 >RIP: 0010:  
> rdac_activate+0x257/0x387 [scsi_dh_rdac]
> 00:40:42:947  COM1 >RSP: 0018:ffff880127109dc0  EFLAGS: 00010246
> 00:40:42:947  COM1 >RAX: ffff8800ae02f000 RBX: 0000000000000001 RCX:
> 0000000000000018
> 00:40:42:963  COM1 >RDX: 0000000000001bbc RSI: 0000000000000282 RDI:
> ffff8800c2ccd918
> 00:40:42:963  COM1 >RBP: 00000000fffffffb R08: ffffffff806eaf78 R09:
> ffff880028087720
> 00:40:42:963  COM1 >R10: 0000000000000000 R11: ffffffff80284ebe R12:
> ffffffffa0030fbe
> 00:40:42:978  COM1 >R13: 0000000000000000 R14: 0000000000000282 R15:
> 0000000000000000
> 00:40:42:978  COM1 >FS:  0000000000000000(0000) GS:ffff88012fb81ec0
> (0000) knlGS:0000000000000000
> 00:40:42:978  COM1 >CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> 00:40:42:994  COM1 >CR2: 00000000f7701630 CR3: 0000000101d5e000 CR4:
> 00000000000006e0
> 00:40:42:994  COM1 >DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> 00:40:43:010  COM1 >DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> 00:40:43:010  COM1 >Process kmpath_handlerd (pid: 14044, threadinfo 
> ffff880127108000, task ffff88012710e680)
> 00:40:43:010  COM1 >Stack:  ffff880127109ec0 ffffffff8049c431 
> 0000000000000000 ffff880127109e50
> 00:40:43:025  COM1 > ffff8800ae02f000 ffff8800b5032208 
> ffff8800b5032200 ffff8800ae02f250
> 00:40:43:025  COM1 > ffff8800b5032216 0000000580a33680 
> ffff8800c2ccd6b0 ffff8800ae02f120
> 00:40:43:041  COM1 >Call Trace:
> 00:40:43:041  COM1 > scsi_dh_activate+0x81/0x9b[scsi_dh]
> 00:40:43:041  COM1 > activate_path+0x22/0x46 
> [dm_multipath]
> 00:40:43:041  COM1 > run_workqueue+0x7a/0x100
> 00:40:43:057  COM1 > worker_thread+0xd8/0xe7
> 00:40:43:057  COM1 > kthread+0x47/0x73
> 00:40:43:057  COM1 > child_rip+0xa/0x11
> 00:40:43:057  COM1 >
> 00:40:43:057  COM1 >
> 00:40:43:057  COM1 >Code: 4c 89 ea e8 78 dd 30 e0 4c 89 ef 89 c5 e8 
> db a8 30 e0 85 ed 0f 84 da 00 00 00 48 8b 44 24 20 4c 8b a8 d0 05 00
> 00 4d 85 ed 75 04 <0f> 0b eb fe 48 8b 7c 24 40 48 8d 54 24 60 be 60 
> 00 00 00 e8 ae 
> 00:40:43:072  COM1 >RIP   rdac_activate+0x257/
> 0x387 [scsi_dh_rdac]
> 00:40:43:088  COM1 > RSP <ffff880127109dc0>
> 00:40:43:088  COM1 >---[ end trace 00e89c598c82483b ]---

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers
  2010-07-28 22:58 [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers Moger, Babu
@ 2010-07-29 21:54 ` Chandra Seetharaman
  2010-07-29 22:35   ` Moger, Babu
  0 siblings, 1 reply; 8+ messages in thread
From: Chandra Seetharaman @ 2010-07-29 21:54 UTC (permalink / raw)
  To: Moger, Babu
  Cc: device-mapper development, linux-scsi@vger.kernel.org,
	Qi, Yanling, Chauhan, Vijay, Dachepalli, Sudhir, Stankey, Robert

Babu,

Your main object is to protect scsi_dh_data across scsi_dh_activate() by
way of getting kref around scsi_dh_activate(), right ?

Wouldn't doing what Shyam suggested (doing kref_put() and put_device())
in scsi_activate() make it simpler and code still be readable ? (it
would make all the patches except 2/6 not needed).

Did you hit with any problems doing it that way ?

Also the snippet (in 2/6)
---------------
@@ -228,7 +228,8 @@ store_dh_state(struct device *dev, struc
                         * Activate a device handler
                         */
                        if (scsi_dh->activate)
-                               err = scsi_dh->activate(sdev, NULL, NULL);
+                               err = scsi_dh_activate(sdev->request_queue,
+                                                       NULL, NULL);
                        else
                                err = 0;
                }
------------------
can be made as a patch in itself.

Thanks for fixing the problem.

regards,

chandra
On Wed, 2010-07-28 at 16:58 -0600, Moger, Babu wrote:
> These patches fix the following two cases. 
> 1. Devices going away while scsi device hander's activate is still in progress.
> 
> 2. Removal of scsi_dh_data(calling detach handler) when scsi device hander's activate is still in progress.
> 
> We have been seeing these problems while running multipath failover tests on LSI storage. These patches fix the problem. We have verified it.
> 
> Here is the panic we have been seeing while running failover failback tests.
> 
> > 00:40:42:869  COM1 >------------[ cut here ]------------
> > 00:40:42:869  COM1 >kernel BUG at /usr/src/packages/BUILD/lsi-
> > scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
> > 00:40:42:869  COM1 >invalid opcode: 0000 [1] SMP 
> > 00:40:42:885  COM1 >last sysfs file: /sys/kernel/uevent_seqnum
> > 00:40:42:885  COM1 >CPU 3 
> > 00:40:42:885  COM1 >Modules linked in: dm_round_robin dm_multipath 
> > nls_utf8 cifs(X) microcode af_packet ipv6 fuse loop dm_mod iTCO_wdt 
> > iTCO_vendor_support dcdbas(X) pcspkr rtc_cmos rtc_core serio_raw 
> > rtc_lib i5000_edac edac_core bnx2 shpchp sg pci_hotplug button 
> > mptctl usbhid hid ff_memless uhci_hcd ehci_hcd usbcore sd_mod 
> > crc_t10dif mpt2sas(N) raid_class edd ext3 mbcache jbd fan 
> > ide_pci_generic piix ide_core ata_generic ata_piix libata dock 
> > mptsas mptscsih mptbase scsi_transport_sas thermal processor 
> > thermal_sys hwmon scsi_dh_rdac(X) scsi_dh scsi_mod
> > 00:40:42:932  COM1 >Supported: No
> > 00:40:42:932  COM1 >Pid: 14044, comm: kmpath_handlerd Tainted: G    
> > 2.6.27.39-0.3-default #1
> > 00:40:42:932  COM1 >RIP: 0010:  
> > rdac_activate+0x257/0x387 [scsi_dh_rdac]
> > 00:40:42:947  COM1 >RSP: 0018:ffff880127109dc0  EFLAGS: 00010246
> > 00:40:42:947  COM1 >RAX: ffff8800ae02f000 RBX: 0000000000000001 RCX:
> > 0000000000000018
> > 00:40:42:963  COM1 >RDX: 0000000000001bbc RSI: 0000000000000282 RDI:
> > ffff8800c2ccd918
> > 00:40:42:963  COM1 >RBP: 00000000fffffffb R08: ffffffff806eaf78 R09:
> > ffff880028087720
> > 00:40:42:963  COM1 >R10: 0000000000000000 R11: ffffffff80284ebe R12:
> > ffffffffa0030fbe
> > 00:40:42:978  COM1 >R13: 0000000000000000 R14: 0000000000000282 R15:
> > 0000000000000000
> > 00:40:42:978  COM1 >FS:  0000000000000000(0000) GS:ffff88012fb81ec0
> > (0000) knlGS:0000000000000000
> > 00:40:42:978  COM1 >CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > 00:40:42:994  COM1 >CR2: 00000000f7701630 CR3: 0000000101d5e000 CR4:
> > 00000000000006e0
> > 00:40:42:994  COM1 >DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > 00:40:43:010  COM1 >DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> > 0000000000000400
> > 00:40:43:010  COM1 >Process kmpath_handlerd (pid: 14044, threadinfo 
> > ffff880127108000, task ffff88012710e680)
> > 00:40:43:010  COM1 >Stack:  ffff880127109ec0 ffffffff8049c431 
> > 0000000000000000 ffff880127109e50
> > 00:40:43:025  COM1 > ffff8800ae02f000 ffff8800b5032208 
> > ffff8800b5032200 ffff8800ae02f250
> > 00:40:43:025  COM1 > ffff8800b5032216 0000000580a33680 
> > ffff8800c2ccd6b0 ffff8800ae02f120
> > 00:40:43:041  COM1 >Call Trace:
> > 00:40:43:041  COM1 > scsi_dh_activate+0x81/0x9b[scsi_dh]
> > 00:40:43:041  COM1 > activate_path+0x22/0x46 
> > [dm_multipath]
> > 00:40:43:041  COM1 > run_workqueue+0x7a/0x100
> > 00:40:43:057  COM1 > worker_thread+0xd8/0xe7
> > 00:40:43:057  COM1 > kthread+0x47/0x73
> > 00:40:43:057  COM1 > child_rip+0xa/0x11
> > 00:40:43:057  COM1 >
> > 00:40:43:057  COM1 >
> > 00:40:43:057  COM1 >Code: 4c 89 ea e8 78 dd 30 e0 4c 89 ef 89 c5 e8 
> > db a8 30 e0 85 ed 0f 84 da 00 00 00 48 8b 44 24 20 4c 8b a8 d0 05 00
> > 00 4d 85 ed 75 04 <0f> 0b eb fe 48 8b 7c 24 40 48 8d 54 24 60 be 60 
> > 00 00 00 e8 ae 
> > 00:40:43:072  COM1 >RIP   rdac_activate+0x257/
> > 0x387 [scsi_dh_rdac]
> > 00:40:43:088  COM1 > RSP <ffff880127109dc0>
> > 00:40:43:088  COM1 >---[ end trace 00e89c598c82483b ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers
  2010-07-29 21:54 ` Chandra Seetharaman
@ 2010-07-29 22:35   ` Moger, Babu
  2010-07-30  0:04     ` Chandra Seetharaman
  0 siblings, 1 reply; 8+ messages in thread
From: Moger, Babu @ 2010-07-29 22:35 UTC (permalink / raw)
  To: sekharan@us.ibm.com, device-mapper development,
	linux-scsi@vger.kernel.org,
	"Shyam_Iyer@Dell.com" <Shyam_I>
  Cc: Qi, Yanling, Chauhan, Vijay, Dachepalli, Sudhir, Stankey, Robert

Chandra, Shyam,  
 Thanks for your comments.. Please see my response.

> -----Original Message-----
> From: Chandra Seetharaman [mailto:sekharan@us.ibm.com]
> Sent: Thursday, July 29, 2010 4:54 PM
> To: Moger, Babu
> Cc: device-mapper development; linux-scsi@vger.kernel.org; Qi, Yanling;
> Chauhan, Vijay; Dachepalli, Sudhir; Stankey, Robert
> Subject: Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> handlers
> 
> Babu,
> 
> Your main object is to protect scsi_dh_data across scsi_dh_activate()
> by
> way of getting kref around scsi_dh_activate(), right ?
> 

  Yes, That is correct. 

> Wouldn't doing what Shyam suggested (doing kref_put() and put_device())
> in scsi_activate() make it simpler and code still be readable ? (it
> would make all the patches except 2/6 not needed).
> 
> Did you hit with any problems doing it that way ?
>

  Yes, We can do that.  Problem is I am hitting the issue with BUG_ON in get_rdac_data which is there in the beginning of rdac_activate.
  If I have to go this way, then I need to remove a call get_rdac_data and just validate pointers. Report error(SCSI_DH_IO) if pointer is not valid.
  Then hold the reference counts and continue if everything is alright.  I will send the new patches as soon as I can.  

> Also the snippet (in 2/6)
> ---------------
> @@ -228,7 +228,8 @@ store_dh_state(struct device *dev, struc
>                          * Activate a device handler
>                          */
>                         if (scsi_dh->activate)
> -                               err = scsi_dh->activate(sdev, NULL,
> NULL);
> +                               err = scsi_dh_activate(sdev-
> >request_queue,
> +                                                       NULL, NULL);
>                         else
>                                 err = 0;
>                 }
> ------------------
> can be made as a patch in itself.
> 
> Thanks for fixing the problem.

 Ok.. I will make this as separate patch.. 

>
> regards,
> 
> chandra
> On Wed, 2010-07-28 at 16:58 -0600, Moger, Babu wrote:
> > These patches fix the following two cases.
> > 1. Devices going away while scsi device hander's activate is still in
> progress.
> >
> > 2. Removal of scsi_dh_data(calling detach handler) when scsi device
> hander's activate is still in progress.
> >
> > We have been seeing these problems while running multipath failover
> tests on LSI storage. These patches fix the problem. We have verified
> it.
> >
> > Here is the panic we have been seeing while running failover failback
> tests.
> >
> > > 00:40:42:869  COM1 >------------[ cut here ]------------
> > > 00:40:42:869  COM1 >kernel BUG at /usr/src/packages/BUILD/lsi-
> > > scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
> > > 00:40:42:869  COM1 >invalid opcode: 0000 [1] SMP
> > > 00:40:42:885  COM1 >last sysfs file: /sys/kernel/uevent_seqnum
> > > 00:40:42:885  COM1 >CPU 3
> > > 00:40:42:885  COM1 >Modules linked in: dm_round_robin dm_multipath
> > > nls_utf8 cifs(X) microcode af_packet ipv6 fuse loop dm_mod iTCO_wdt
> > > iTCO_vendor_support dcdbas(X) pcspkr rtc_cmos rtc_core serio_raw
> > > rtc_lib i5000_edac edac_core bnx2 shpchp sg pci_hotplug button
> > > mptctl usbhid hid ff_memless uhci_hcd ehci_hcd usbcore sd_mod
> > > crc_t10dif mpt2sas(N) raid_class edd ext3 mbcache jbd fan
> > > ide_pci_generic piix ide_core ata_generic ata_piix libata dock
> > > mptsas mptscsih mptbase scsi_transport_sas thermal processor
> > > thermal_sys hwmon scsi_dh_rdac(X) scsi_dh scsi_mod
> > > 00:40:42:932  COM1 >Supported: No
> > > 00:40:42:932  COM1 >Pid: 14044, comm: kmpath_handlerd Tainted: G
> > > 2.6.27.39-0.3-default #1
> > > 00:40:42:932  COM1 >RIP: 0010:
> > > rdac_activate+0x257/0x387 [scsi_dh_rdac]
> > > 00:40:42:947  COM1 >RSP: 0018:ffff880127109dc0  EFLAGS: 00010246
> > > 00:40:42:947  COM1 >RAX: ffff8800ae02f000 RBX: 0000000000000001
> RCX:
> > > 0000000000000018
> > > 00:40:42:963  COM1 >RDX: 0000000000001bbc RSI: 0000000000000282
> RDI:
> > > ffff8800c2ccd918
> > > 00:40:42:963  COM1 >RBP: 00000000fffffffb R08: ffffffff806eaf78
> R09:
> > > ffff880028087720
> > > 00:40:42:963  COM1 >R10: 0000000000000000 R11: ffffffff80284ebe
> R12:
> > > ffffffffa0030fbe
> > > 00:40:42:978  COM1 >R13: 0000000000000000 R14: 0000000000000282
> R15:
> > > 0000000000000000
> > > 00:40:42:978  COM1 >FS:  0000000000000000(0000) GS:ffff88012fb81ec0
> > > (0000) knlGS:0000000000000000
> > > 00:40:42:978  COM1 >CS:  0010 DS: 0018 ES: 0018 CR0:
> 000000008005003b
> > > 00:40:42:994  COM1 >CR2: 00000000f7701630 CR3: 0000000101d5e000
> CR4:
> > > 00000000000006e0
> > > 00:40:42:994  COM1 >DR0: 0000000000000000 DR1: 0000000000000000
> DR2:
> > > 0000000000000000
> > > 00:40:43:010  COM1 >DR3: 0000000000000000 DR6: 00000000ffff0ff0
> DR7:
> > > 0000000000000400
> > > 00:40:43:010  COM1 >Process kmpath_handlerd (pid: 14044, threadinfo
> > > ffff880127108000, task ffff88012710e680)
> > > 00:40:43:010  COM1 >Stack:  ffff880127109ec0 ffffffff8049c431
> > > 0000000000000000 ffff880127109e50
> > > 00:40:43:025  COM1 > ffff8800ae02f000 ffff8800b5032208
> > > ffff8800b5032200 ffff8800ae02f250
> > > 00:40:43:025  COM1 > ffff8800b5032216 0000000580a33680
> > > ffff8800c2ccd6b0 ffff8800ae02f120
> > > 00:40:43:041  COM1 >Call Trace:
> > > 00:40:43:041  COM1 > scsi_dh_activate+0x81/0x9b[scsi_dh]
> > > 00:40:43:041  COM1 > activate_path+0x22/0x46
> > > [dm_multipath]
> > > 00:40:43:041  COM1 > run_workqueue+0x7a/0x100
> > > 00:40:43:057  COM1 > worker_thread+0xd8/0xe7
> > > 00:40:43:057  COM1 > kthread+0x47/0x73
> > > 00:40:43:057  COM1 > child_rip+0xa/0x11
> > > 00:40:43:057  COM1 >
> > > 00:40:43:057  COM1 >
> > > 00:40:43:057  COM1 >Code: 4c 89 ea e8 78 dd 30 e0 4c 89 ef 89 c5 e8
> > > db a8 30 e0 85 ed 0f 84 da 00 00 00 48 8b 44 24 20 4c 8b a8 d0 05
> 00
> > > 00 4d 85 ed 75 04 <0f> 0b eb fe 48 8b 7c 24 40 48 8d 54 24 60 be 60
> > > 00 00 00 e8 ae
> > > 00:40:43:072  COM1 >RIP   rdac_activate+0x257/
> > > 0x387 [scsi_dh_rdac]
> > > 00:40:43:088  COM1 > RSP <ffff880127109dc0>
> > > 00:40:43:088  COM1 >---[ end trace 00e89c598c82483b ]---
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi"
> in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers
  2010-07-29 22:35   ` Moger, Babu
@ 2010-07-30  0:04     ` Chandra Seetharaman
  2010-07-30 18:12       ` Moger, Babu
  0 siblings, 1 reply; 8+ messages in thread
From: Chandra Seetharaman @ 2010-07-30  0:04 UTC (permalink / raw)
  To: Moger, Babu
  Cc: device-mapper development, linux-scsi@vger.kernel.org,
	Shyam_Iyer@Dell.com, Qi, Yanling, Chauhan, Vijay,
	Dachepalli, Sudhir, Stankey, Robert

See comment below

On Thu, 2010-07-29 at 16:35 -0600, Moger, Babu wrote:
> Chandra, Shyam,  
>  Thanks for your comments.. Please see my response.
> 
> > -----Original Message-----
> > From: Chandra Seetharaman [mailto:sekharan@us.ibm.com]
> > Sent: Thursday, July 29, 2010 4:54 PM
> > To: Moger, Babu
> > Cc: device-mapper development; linux-scsi@vger.kernel.org; Qi, Yanling;
> > Chauhan, Vijay; Dachepalli, Sudhir; Stankey, Robert
> > Subject: Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> > handlers
> > 
> > Babu,
> > 
> > Your main object is to protect scsi_dh_data across scsi_dh_activate()
> > by
> > way of getting kref around scsi_dh_activate(), right ?
> > 
> 
>   Yes, That is correct. 
> 
> > Wouldn't doing what Shyam suggested (doing kref_put() and put_device())
> > in scsi_activate() make it simpler and code still be readable ? (it
> > would make all the patches except 2/6 not needed).
> > 
> > Did you hit with any problems doing it that way ?
> >
> 
>   Yes, We can do that.  Problem is I am hitting the issue with BUG_ON
>  in get_rdac_data which is there in the beginning of rdac_activate.

I do not understand the problem.

If the BUG_ON on get_rdac_data() is being triggered, that means
sdev->scsi_dh_data is NULL, if that is the case, how can you access
sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
it trip a oops ?


>   If I have to go this way, then I need to remove a call get_rdac_data
>  and just validate pointers. Report error(SCSI_DH_IO) if pointer is not valid.
>   Then hold the reference counts and continue if everything is alright.
>   I will send the new patches as soon as I can.  



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers
  2010-07-30  0:04     ` Chandra Seetharaman
@ 2010-07-30 18:12       ` Moger, Babu
  2010-07-30 23:56         ` Chandra Seetharaman
  0 siblings, 1 reply; 8+ messages in thread
From: Moger, Babu @ 2010-07-30 18:12 UTC (permalink / raw)
  To: sekharan@us.ibm.com, device-mapper development,
	linux-scsi@vger.kernel.org
  Cc: Shyam_Iyer@Dell.com, Qi, Yanling, Chauhan, Vijay,
	Dachepalli, Sudhir, Stankey, Robert

See my comments below..

> -----Original Message-----
> From: Chandra Seetharaman [mailto:sekharan@us.ibm.com]
> Sent: Thursday, July 29, 2010 7:05 PM
> To: Moger, Babu
> Cc: device-mapper development; linux-scsi@vger.kernel.org;
> Shyam_Iyer@Dell.com; Qi, Yanling; Chauhan, Vijay; Dachepalli, Sudhir;
> Stankey, Robert
> Subject: RE: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> handlers
> 
> See comment below
> 
> On Thu, 2010-07-29 at 16:35 -0600, Moger, Babu wrote:
> > Chandra, Shyam,
> >  Thanks for your comments.. Please see my response.
> >
> > > -----Original Message-----
> > > From: Chandra Seetharaman [mailto:sekharan@us.ibm.com]
> > > Sent: Thursday, July 29, 2010 4:54 PM
> > > To: Moger, Babu
> > > Cc: device-mapper development; linux-scsi@vger.kernel.org; Qi,
> Yanling;
> > > Chauhan, Vijay; Dachepalli, Sudhir; Stankey, Robert
> > > Subject: Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> > > handlers
> > >
> > > Babu,
> > >
> > > Your main object is to protect scsi_dh_data across
> scsi_dh_activate()
> > > by
> > > way of getting kref around scsi_dh_activate(), right ?
> > >
> >
> >   Yes, That is correct.
> >
> > > Wouldn't doing what Shyam suggested (doing kref_put() and
> put_device())
> > > in scsi_activate() make it simpler and code still be readable ? (it
> > > would make all the patches except 2/6 not needed).
> > >
> > > Did you hit with any problems doing it that way ?
> > >
> >
> >   Yes, We can do that.  Problem is I am hitting the issue with BUG_ON
> >  in get_rdac_data which is there in the beginning of rdac_activate.
> 
> I do not understand the problem.
> 
> If the BUG_ON on get_rdac_data() is being triggered, that means
> sdev->scsi_dh_data is NULL, if that is the case, how can you access
> sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
> it trip a oops ?

Test case is deleting both active and passive paths almost together during the multipath testing.  Looks like DM picked up the active path failure first. Then failed the active path and started scheduling activate_path to failover to passive path. Passive path is also about to go down pretty soon. When the control was in scsi_dh_activate, Looks like scsi_dh_data was still valid because I did not see panic here.  But scsi_dh_data became NULL when control went to rdac_activate. That is when I hit the bug on. 

kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
RIP: 0010:  rdac_activate+0x257/0x387 [scsi_dh_rdac]

My understanding is someone triggered scsi_dh->detach for passive path during this small window.  Only way I could see problem go away is holding reference counts between these calls.  Did i miss anything here? See the code snippet below.. 

int scsi_dh_activate(struct request_queue *q, activate_complete fn, void *data)
 {
         int err = 0;
         unsigned long flags;
         struct scsi_device *sdev;
         struct scsi_device_handler *scsi_dh = NULL;
 
         spin_lock_irqsave(q->queue_lock, flags);
         sdev = q->queuedata;
         if (sdev && sdev->scsi_dh_data)
                 scsi_dh = sdev->scsi_dh_data->scsi_dh;
         if (!scsi_dh || !get_device(&sdev->sdev_gendev))
                 err = SCSI_DH_NOSYS;
         spin_unlock_irqrestore(q->queue_lock, flags);
 
         if (err)
                 return err;
 
         if (scsi_dh->activate)
                 err = scsi_dh->activate(sdev, fn, data);
         put_device(&sdev->sdev_gendev);
         return err;
 }
 EXPORT_SYMBOL_GPL(scsi_dh_activate);
 

static int rdac_activate(struct scsi_device *sdev,
                         activate_complete fn, void *data)
 {
         struct rdac_dh_data *h = get_rdac_data(sdev); *******  I hit BUG_ON here *********
         int err = SCSI_DH_OK;
 
         err = check_ownership(sdev, h);
         if (err != SCSI_DH_OK)
                 goto done;
 
         if (h->lun_state == RDAC_LUN_UNOWNED) {
                 err = queue_mode_select(sdev, fn, data);
                 if (err == SCSI_DH_OK)
                         return 0;
         }
 done:
         if (fn)
                 fn(data, err);
         return 0;
 }


> 
> 
> >   If I have to go this way, then I need to remove a call
> get_rdac_data
> >  and just validate pointers. Report error(SCSI_DH_IO) if pointer is
> not valid.
> >   Then hold the reference counts and continue if everything is
> alright.
> >   I will send the new patches as soon as I can.
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers
  2010-07-30 18:12       ` Moger, Babu
@ 2010-07-30 23:56         ` Chandra Seetharaman
  2010-07-31  0:22           ` Shyam Iyer
  0 siblings, 1 reply; 8+ messages in thread
From: Chandra Seetharaman @ 2010-07-30 23:56 UTC (permalink / raw)
  To: Moger, Babu
  Cc: device-mapper development, linux-scsi@vger.kernel.org,
	Shyam_Iyer@Dell.com, Qi, Yanling, Chauhan, Vijay,
	Dachepalli, Sudhir, Stankey, Robert


On Fri, 2010-07-30 at 12:12 -0600, Moger, Babu wrote:

> > >
> > >   Yes, We can do that.  Problem is I am hitting the issue with BUG_ON
> > >  in get_rdac_data which is there in the beginning of rdac_activate.
> > 
> > I do not understand the problem.
> > 
> > If the BUG_ON on get_rdac_data() is being triggered, that means
> > sdev->scsi_dh_data is NULL, if that is the case, how can you access
> > sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
> > it trip a oops ?
> 
> Test case is deleting both active and passive paths almost together during
>  the multipath testing.  Looks like DM picked up the active path failure 
> first. Then failed the active path and started scheduling activate_path to 
> failover to passive path. Passive path is also about to go down pretty soon. 
> When the control was in scsi_dh_activate, Looks like scsi_dh_data was still 
> valid because I did not see panic here.  But scsi_dh_data became NULL when 
> control went to rdac_activate. That is when I hit the bug on. 
> 
> kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
> RIP: 0010:  rdac_activate+0x257/0x387 [scsi_dh_rdac]
> 
> My understanding is someone triggered scsi_dh->detach for passive path during
>  this small window.  Only way I could see problem go away is holding reference 
> counts between these calls.  Did i miss anything here? See the code snippet below.. 

So, basically, the new patch just reduces the window of race. IOW, if it
just spins for few seconds (or preempted) just before calling the
kref_get() in the new patch, it will generate an oops as scsi_dh_data
would be NULL.

Looks like you have to create a lock to protect scsi_dh_data and then
call kref_get() with the protection of lock.

> 
> int scsi_dh_activate(struct request_queue *q, activate_complete fn, void *data)
>  {
>          int err = 0;
>          unsigned long flags;
>          struct scsi_device *sdev;
>          struct scsi_device_handler *scsi_dh = NULL;
>  
>          spin_lock_irqsave(q->queue_lock, flags);
>          sdev = q->queuedata;
>          if (sdev && sdev->scsi_dh_data)
>                  scsi_dh = sdev->scsi_dh_data->scsi_dh;
>          if (!scsi_dh || !get_device(&sdev->sdev_gendev))
>                  err = SCSI_DH_NOSYS;
>          spin_unlock_irqrestore(q->queue_lock, flags);
>  
>          if (err)
>                  return err;
>  
>          if (scsi_dh->activate)
>                  err = scsi_dh->activate(sdev, fn, data);
>          put_device(&sdev->sdev_gendev);
>          return err;
>  }
>  EXPORT_SYMBOL_GPL(scsi_dh_activate);
>  
> 
> static int rdac_activate(struct scsi_device *sdev,
>                          activate_complete fn, void *data)
>  {
>          struct rdac_dh_data *h = get_rdac_data(sdev); *******  I hit BUG_ON here *********
>          int err = SCSI_DH_OK;
>  
>          err = check_ownership(sdev, h);
>          if (err != SCSI_DH_OK)
>                  goto done;
>  
>          if (h->lun_state == RDAC_LUN_UNOWNED) {
>                  err = queue_mode_select(sdev, fn, data);
>                  if (err == SCSI_DH_OK)
>                          return 0;
>          }
>  done:
>          if (fn)
>                  fn(data, err);
>          return 0;
>  }
> 
> 
> > 
> > 
> > >   If I have to go this way, then I need to remove a call
> > get_rdac_data
> > >  and just validate pointers. Report error(SCSI_DH_IO) if pointer is
> > not valid.
> > >   Then hold the reference counts and continue if everything is
> > alright.
> > >   I will send the new patches as soon as I can.
> > 
> 
> NrybXǧv^)޺{.n+{"{ay\x1dʇڙ,j\afhz\x1ew\fj:+vwjm\azZ+ݢj"!


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers
  2010-07-30 23:56         ` Chandra Seetharaman
@ 2010-07-31  0:22           ` Shyam Iyer
  2010-07-31  5:02             ` Moger, Babu
  0 siblings, 1 reply; 8+ messages in thread
From: Shyam Iyer @ 2010-07-31  0:22 UTC (permalink / raw)
  To: sekharan
  Cc: Moger, Babu, device-mapper development,
	linux-scsi@vger.kernel.org, Qi, Yanling, Chauhan, Vijay,
	Dachepalli, Sudhir, Stankey, Robert

On 07/30/2010 07:56 PM, Chandra Seetharaman wrote:
> On Fri, 2010-07-30 at 12:12 -0600, Moger, Babu wrote:
>
>    
>>>>    Yes, We can do that.  Problem is I am hitting the issue with BUG_ON
>>>>   in get_rdac_data which is there in the beginning of rdac_activate.
>>>>          
>>> I do not understand the problem.
>>>
>>> If the BUG_ON on get_rdac_data() is being triggered, that means
>>> sdev->scsi_dh_data is NULL, if that is the case, how can you access
>>> sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
>>> it trip a oops ?
>>>        
>> Test case is deleting both active and passive paths almost together during
>>   the multipath testing.  Looks like DM picked up the active path failure
>> first. Then failed the active path and started scheduling activate_path to
>> failover to passive path. Passive path is also about to go down pretty soon.
>> When the control was in scsi_dh_activate, Looks like scsi_dh_data was still
>> valid because I did not see panic here.  But scsi_dh_data became NULL when
>> control went to rdac_activate. That is when I hit the bug on.
>>
>> kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
>> RIP: 0010:  rdac_activate+0x257/0x387 [scsi_dh_rdac]
>>
>> My understanding is someone triggered scsi_dh->detach for passive path during
>>   this small window.  Only way I could see problem go away is holding reference
>> counts between these calls.  Did i miss anything here? See the code snippet below..
>>      
> So, basically, the new patch just reduces the window of race. IOW, if it
> just spins for few seconds (or preempted) just before calling the
> kref_get() in the new patch, it will generate an oops as scsi_dh_data
> would be NULL.
>
> Looks like you have to create a lock to protect scsi_dh_data and then
> call kref_get() with the protection of lock.
>
>    
Second that.. Sounds like each kref_get/kref_put .. needs a lock protection not just this scenario.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers
  2010-07-31  0:22           ` Shyam Iyer
@ 2010-07-31  5:02             ` Moger, Babu
  0 siblings, 0 replies; 8+ messages in thread
From: Moger, Babu @ 2010-07-31  5:02 UTC (permalink / raw)
  To: Shyam Iyer, sekharan@us.ibm.com
  Cc: device-mapper development, linux-scsi@vger.kernel.org,
	Qi, Yanling, Chauhan, Vijay, Dachepalli, Sudhir, Stankey, Robert



> -----Original Message-----
> From: Shyam Iyer [mailto:shyam_iyer@dell.com]
> Sent: Friday, July 30, 2010 7:22 PM
> To: sekharan@us.ibm.com
> Cc: Moger, Babu; device-mapper development; linux-scsi@vger.kernel.org;
> Qi, Yanling; Chauhan, Vijay; Dachepalli, Sudhir; Stankey, Robert
> Subject: Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> handlers
> 
> On 07/30/2010 07:56 PM, Chandra Seetharaman wrote:
> > On Fri, 2010-07-30 at 12:12 -0600, Moger, Babu wrote:
> >
> >
> >>>>    Yes, We can do that.  Problem is I am hitting the issue with
> BUG_ON
> >>>>   in get_rdac_data which is there in the beginning of rdac_activate.
> >>>>
> >>> I do not understand the problem.
> >>>
> >>> If the BUG_ON on get_rdac_data() is being triggered, that means
> >>> sdev->scsi_dh_data is NULL, if that is the case, how can you access
> >>> sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
> >>> it trip a oops ?
> >>>
> >> Test case is deleting both active and passive paths almost together
> during
> >>   the multipath testing.  Looks like DM picked up the active path
> failure
> >> first. Then failed the active path and started scheduling activate_path
> to
> >> failover to passive path. Passive path is also about to go down pretty
> soon.
> >> When the control was in scsi_dh_activate, Looks like scsi_dh_data was
> still
> >> valid because I did not see panic here.  But scsi_dh_data became NULL
> when
> >> control went to rdac_activate. That is when I hit the bug on.
> >>
> >> kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-
> 01.00/obj/default/scsi_dh_rdac.c:232!
> >> RIP: 0010:  rdac_activate+0x257/0x387 [scsi_dh_rdac]
> >>
> >> My understanding is someone triggered scsi_dh->detach for passive path
> during
> >>   this small window.  Only way I could see problem go away is holding
> reference
> >> counts between these calls.  Did i miss anything here? See the code
> snippet below..
> >>
> > So, basically, the new patch just reduces the window of race. IOW, if it
> > just spins for few seconds (or preempted) just before calling the
> > kref_get() in the new patch, it will generate an oops as scsi_dh_data
> > would be NULL.

  That is correct. 

> >
> > Looks like you have to create a lock to protect scsi_dh_data and then
> > call kref_get() with the protection of lock.
> >
> >
> Second that.. Sounds like each kref_get/kref_put .. needs a lock
> protection not just this scenario.

  I agree. I will attempt one more set of patches. This time I will try to match get and put sequence in a one file. Please feel free to comment.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-07-31  5:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-28 22:58 [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers Moger, Babu
2010-07-29 21:54 ` Chandra Seetharaman
2010-07-29 22:35   ` Moger, Babu
2010-07-30  0:04     ` Chandra Seetharaman
2010-07-30 18:12       ` Moger, Babu
2010-07-30 23:56         ` Chandra Seetharaman
2010-07-31  0:22           ` Shyam Iyer
2010-07-31  5:02             ` Moger, Babu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).