* further testing w/ multipath ... and bugs
@ 2005-06-13 8:11 Christophe Varoqui
2005-06-13 10:06 ` Christophe Varoqui
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Christophe Varoqui @ 2005-06-13 8:11 UTC (permalink / raw)
To: dm-devel
Hello,
I'm testing Mike Christie's START_STOP hwhandler and discovered a bunch of new, interesting, phenomenons :
A little context first :
o kernel 2.6.12-rc6 + qlogic discovery patch
o qla2342 (dual 2GB)
o EVA5000, Solaris-tagged connections
Here is a map create by multipath, fresh from boot :
eva1_lun2 (3600508b400014ba7000120000cf00000)
[size=50 GB][features="1 queue_if_no_path"][hwhandler="1 hp_sw"]
\_ round-robin 0 [active][best]
\_ 0:0:0:2 sdb 8:16 [ready ][active]
\_ 1:0:0:2 sdf 8:80 [ready ][active]
\_ round-robin 0 [enabled]
\_ 0:0:1:2 sdd 8:48 [faulty][active]
\_ 1:0:1:2 sdh 8:112 [faulty][active]
Start a background stream read with dd on that map.
Do a port disable on the FC switch port connected to HBA 0
Consistently at this moment I get the following in the logs :
qla2300 0000:05:0d.0: LOOP DOWN detected.
Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():1, irqs_disabled():1
[<c0120a74>] __might_sleep+0xa4/0xc0
[<c026a466>] device_for_each_child+0x26/0x80
[<c02b3180>] target_block+0x0/0x30
[<c02bbdae>] fc_remote_port_block+0x2e/0x60
[<c02bdbf5>] qla2x00_mark_all_devices_lost+0x55/0x60
[<c02c597e>] qla2x00_async_event+0x83e/0xd60
[<c011dd2b>] find_busiest_group+0xbb/0x310
[<c02cdce4>] sd_rw_intr+0x164/0x320
[<c02c4e37>] qla2300_intr_handler+0x77/0x240
[<c0144882>] handle_IRQ_event+0x32/0x70
[<c0144997>] __do_IRQ+0xd7/0x140
[<c0106756>] do_IRQ+0x36/0x70
[<c0104c1e>] common_interrupt+0x1a/0x20
[<c0102030>] default_idle+0x0/0x30
[<c0102053>] default_idle+0x23/0x30
[<c0102104>] cpu_idle+0x64/0x80
If I wait long enough, I then get the following :
rport-0:0-0: blocked FC remote port time out: removing target
rport-0:0-1: blocked FC remote port time out: removing target
... which is rather new to me.
As a side effect, all sd associated are removed, uevents are sent signaling the disks have gone. This triggers checker removal on multipathd side in the current implementation.
Then, upon port reenable, sd are registred again with different minor than before. uevent adds get sent, multipath reconfigures the maps and ...
Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
f8b0d29f
*pde = 08e4d001
Oops: 0000 [#1]
SMP
Modules linked in: dm_round_robin dm_hp_sw dm_multipath md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc video button battery ac ohci_hcd tg3 floppy dm_mod qla6312
CPU: 2
EIP: 0060:[<f8b0d29f>] Not tainted VLI
EFLAGS: 00010086 (2.6.12-rc6)
EIP is at rr_select_path+0xf/0x60 [dm_round_robin]
eax: f6a989cc ebx: 00000000 ecx: f6a978c0 edx: f7f1e77c
esi: f7f1e77c edi: 00000000 ebp: 00000001 esp: f65d1f00
ds: 007b es: 007b ss: 0068
Process kmpathd/2 (pid: 4564, threadinfo=f65d0000 task=f6708aa0)
Stack: f6a989c0 f7f1e740 f8ae3bc2 f7f1e740 f7f1e740 f8ae3c90 f7f1e740 f7f1e740
00000000 f7f1e74c f8ae3f9c 00000286 00000000 f7f1e754 f7f1e740 f7f34100
f7f1e790 00000282 c01339a2 00000000 000f42b4 f6cdfe5c f7f34128 f7f34110
Call Trace:
[<f8ae3bc2>] __choose_path_in_pg+0x12/0x40 [dm_multipath]
[<f8ae3c90>] __choose_pgpath+0xa0/0xb0 [dm_multipath]
[<f8ae3f9c>] process_queued_ios+0x7c/0xf0 [dm_multipath]
[<c01339a2>] worker_thread+0x1c2/0x250
[<f8ae3f20>] process_queued_ios+0x0/0xf0 [dm_multipath]
[<c011eaa0>] default_wake_function+0x0/0x10
[<c011eaa0>] default_wake_function+0x0/0x10
[<c01337e0>] worker_thread+0x0/0x250
[<c0137c65>] kthread+0xa5/0xf0
[<c0137bc0>] kthread+0x0/0xf0
[<c0102445>] kernel_thread_helper+0x5/0x10
Code: 42 04 89 10 89 58 04 89 03 31 c0 5b c3 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 90 83 ec 08 89 74 24 04 89 d6 89 1c 24 8b 58 04 <8b> 03 39 d8 74 30 89 c1 8b 50 04 8b 00 85 c9 89 50 04 89 02 8b
Here dd is now stuck in D-state.
I Will post more as I continue my hammering.
Regards,
cvaroqui
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: further testing w/ multipath ... and bugs
2005-06-13 8:11 further testing w/ multipath ... and bugs Christophe Varoqui
@ 2005-06-13 10:06 ` Christophe Varoqui
2005-06-13 14:37 ` Christophe Varoqui
2005-06-13 18:20 ` Andrew Vasquez
2005-06-13 19:36 ` Mike Christie
2 siblings, 1 reply; 7+ messages in thread
From: Christophe Varoqui @ 2005-06-13 10:06 UTC (permalink / raw)
To: device-mapper development
Here is an additional one :
When at the end of the previous scenario, with a dd in D-state, I "dmsetup remove_all" ... it effectively accept to remove the maps. Exec'ing multipath again gives :
[<c027506c>] end_that_request_last+0xcc/0x100
[<c02b19ed>] scsi_end_request+0x9d/0xe0
[<c02b1d45>] scsi_io_completion+0x155/0x500
[<c0327643>] ip_rcv+0x3a3/0x560
[<c012c1de>] del_timer+0x5e/0x70
[<c02cdce4>] sd_rw_intr+0x164/0x320
[<c0149531>] mempool_free+0x81/0xa0
[<c02c60cd>] qla2x00_process_response_queue+0x14d/0x1d0
[<c02ac946>] scsi_finish_command+0x96/0xe0
[<c033f2e3>] tcp_write_timer+0x73/0xe0
[<c02ac836>] scsi_softirq+0xa6/0xe0
[<c01285b2>] __do_softirq+0x82/0x100
[<c0128665>] do_softirq+0x35/0x40
[<c010675b>] do_IRQ+0x3b/0x70
[<c0104c1e>] common_interrupt+0x1a/0x20
[<c0102030>] default_idle+0x0/0x30
[<c0102053>] default_idle+0x23/0x30
[<c0102104>] cpu_idle+0x64/0x80
[<c0462965>] start_kernel+0x185/0x1d0
[<c0462370>] unknown_bootoption+0x0/0x1e0
Code: 90 80 3e 00 7e f9 fa eb e8 89 d8 8b 74 24 0c 8b 5c 24 08 83 c4 10 c3 c7 04
24 c4 6a 38 c0 8b 44 24 10 89 44 24 04 e8 6d 30 db ff <0f> 0b 95 00 c2 62 38 c0
eb bc 8d 76 00 53 83 ec 08 89 c3 fa 81
<0>Kernel panic - not syncing: Fatal exception in interrupt
Regards,
cvaroqui
On Mon, Jun 13, 2005 at 10:11:54AM +0200, Christophe Varoqui wrote:
> Hello,
>
> I'm testing Mike Christie's START_STOP hwhandler and discovered a bunch of new, interesting, phenomenons :
>
> A little context first :
> o kernel 2.6.12-rc6 + qlogic discovery patch
> o qla2342 (dual 2GB)
> o EVA5000, Solaris-tagged connections
>
> Here is a map create by multipath, fresh from boot :
>
> eva1_lun2 (3600508b400014ba7000120000cf00000)
> [size=50 GB][features="1 queue_if_no_path"][hwhandler="1 hp_sw"]
> \_ round-robin 0 [active][best]
> \_ 0:0:0:2 sdb 8:16 [ready ][active]
> \_ 1:0:0:2 sdf 8:80 [ready ][active]
> \_ round-robin 0 [enabled]
> \_ 0:0:1:2 sdd 8:48 [faulty][active]
> \_ 1:0:1:2 sdh 8:112 [faulty][active]
>
> Start a background stream read with dd on that map.
>
> Do a port disable on the FC switch port connected to HBA 0
> Consistently at this moment I get the following in the logs :
>
> qla2300 0000:05:0d.0: LOOP DOWN detected.
> Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
> in_atomic():1, irqs_disabled():1
> [<c0120a74>] __might_sleep+0xa4/0xc0
> [<c026a466>] device_for_each_child+0x26/0x80
> [<c02b3180>] target_block+0x0/0x30
> [<c02bbdae>] fc_remote_port_block+0x2e/0x60
> [<c02bdbf5>] qla2x00_mark_all_devices_lost+0x55/0x60
> [<c02c597e>] qla2x00_async_event+0x83e/0xd60
> [<c011dd2b>] find_busiest_group+0xbb/0x310
> [<c02cdce4>] sd_rw_intr+0x164/0x320
> [<c02c4e37>] qla2300_intr_handler+0x77/0x240
> [<c0144882>] handle_IRQ_event+0x32/0x70
> [<c0144997>] __do_IRQ+0xd7/0x140
> [<c0106756>] do_IRQ+0x36/0x70
> [<c0104c1e>] common_interrupt+0x1a/0x20
> [<c0102030>] default_idle+0x0/0x30
> [<c0102053>] default_idle+0x23/0x30
> [<c0102104>] cpu_idle+0x64/0x80
>
> If I wait long enough, I then get the following :
>
> rport-0:0-0: blocked FC remote port time out: removing target
> rport-0:0-1: blocked FC remote port time out: removing target
>
> ... which is rather new to me.
>
> As a side effect, all sd associated are removed, uevents are sent signaling the disks have gone. This triggers checker removal on multipathd side in the current implementation.
>
> Then, upon port reenable, sd are registred again with different minor than before. uevent adds get sent, multipath reconfigures the maps and ...
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> f8b0d29f
> *pde = 08e4d001
> Oops: 0000 [#1]
> SMP
> Modules linked in: dm_round_robin dm_hp_sw dm_multipath md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc video button battery ac ohci_hcd tg3 floppy dm_mod qla6312
> CPU: 2
> EIP: 0060:[<f8b0d29f>] Not tainted VLI
> EFLAGS: 00010086 (2.6.12-rc6)
> EIP is at rr_select_path+0xf/0x60 [dm_round_robin]
> eax: f6a989cc ebx: 00000000 ecx: f6a978c0 edx: f7f1e77c
> esi: f7f1e77c edi: 00000000 ebp: 00000001 esp: f65d1f00
> ds: 007b es: 007b ss: 0068
> Process kmpathd/2 (pid: 4564, threadinfo=f65d0000 task=f6708aa0)
> Stack: f6a989c0 f7f1e740 f8ae3bc2 f7f1e740 f7f1e740 f8ae3c90 f7f1e740 f7f1e740
> 00000000 f7f1e74c f8ae3f9c 00000286 00000000 f7f1e754 f7f1e740 f7f34100
> f7f1e790 00000282 c01339a2 00000000 000f42b4 f6cdfe5c f7f34128 f7f34110
> Call Trace:
> [<f8ae3bc2>] __choose_path_in_pg+0x12/0x40 [dm_multipath]
> [<f8ae3c90>] __choose_pgpath+0xa0/0xb0 [dm_multipath]
> [<f8ae3f9c>] process_queued_ios+0x7c/0xf0 [dm_multipath]
> [<c01339a2>] worker_thread+0x1c2/0x250
> [<f8ae3f20>] process_queued_ios+0x0/0xf0 [dm_multipath]
> [<c011eaa0>] default_wake_function+0x0/0x10
> [<c011eaa0>] default_wake_function+0x0/0x10
> [<c01337e0>] worker_thread+0x0/0x250
> [<c0137c65>] kthread+0xa5/0xf0
> [<c0137bc0>] kthread+0x0/0xf0
> [<c0102445>] kernel_thread_helper+0x5/0x10
> Code: 42 04 89 10 89 58 04 89 03 31 c0 5b c3 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 90 83 ec 08 89 74 24 04 89 d6 89 1c 24 8b 58 04 <8b> 03 39 d8 74 30 89 c1 8b 50 04 8b 00 85 c9 89 50 04 89 02 8b
>
> Here dd is now stuck in D-state.
>
> I Will post more as I continue my hammering.
>
> Regards,
> cvaroqui
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: further testing w/ multipath ... and bugs
2005-06-13 10:06 ` Christophe Varoqui
@ 2005-06-13 14:37 ` Christophe Varoqui
0 siblings, 0 replies; 7+ messages in thread
From: Christophe Varoqui @ 2005-06-13 14:37 UTC (permalink / raw)
To: device-mapper development
I also hit a DM path group corruption in this scenario :
loaded map :
[root@s64p17bibrn cvaroqui]# multipath -ll
eva1_lun2 (3600508b400014ba7000120000cf00000)
[size=50 GB][features="1 queue_if_no_path"][hwhandler="1 hp_sw"]
\_ round-robin 0 [enabled]
\_ 1:0:0:2 sdf 8:80 [faulty][failed]
\_ round-robin 0 [active][best]
\_ 1:0:1:2 sdh 8:112 [ready ][active]
\_ round-robin 0 [enabled]
\_ 0:0:0:2 sdj 8:144 [faulty][failed]
\_ round-robin 0 [enabled]
\_ 0:0:1:2 sdl 8:176 [ready ][active]
Switch group :
[root@s64p17bibrn cvaroqui]# dmsetup message eva1_lun2 0 switch_group 3
Kernel message :
Jun 13 16:28:15 s64p17bibrn kernel: device-mapper: hp_sw: queueing START_STOP command on 8:112
8:112 being path->dev->name as seen by the hwhandler, which should have been 8:144 (sdj).
And in fact, no group switching occured.
Regards,
cvaroqui
On Mon, Jun 13, 2005 at 12:06:27PM +0200, Christophe Varoqui wrote:
> Here is an additional one :
>
> When at the end of the previous scenario, with a dd in D-state, I "dmsetup remove_all" ... it effectively accept to remove the maps. Exec'ing multipath again gives :
>
> [<c027506c>] end_that_request_last+0xcc/0x100
> [<c02b19ed>] scsi_end_request+0x9d/0xe0
> [<c02b1d45>] scsi_io_completion+0x155/0x500
> [<c0327643>] ip_rcv+0x3a3/0x560
> [<c012c1de>] del_timer+0x5e/0x70
> [<c02cdce4>] sd_rw_intr+0x164/0x320
> [<c0149531>] mempool_free+0x81/0xa0
> [<c02c60cd>] qla2x00_process_response_queue+0x14d/0x1d0
> [<c02ac946>] scsi_finish_command+0x96/0xe0
> [<c033f2e3>] tcp_write_timer+0x73/0xe0
> [<c02ac836>] scsi_softirq+0xa6/0xe0
> [<c01285b2>] __do_softirq+0x82/0x100
> [<c0128665>] do_softirq+0x35/0x40
> [<c010675b>] do_IRQ+0x3b/0x70
> [<c0104c1e>] common_interrupt+0x1a/0x20
> [<c0102030>] default_idle+0x0/0x30
> [<c0102053>] default_idle+0x23/0x30
> [<c0102104>] cpu_idle+0x64/0x80
> [<c0462965>] start_kernel+0x185/0x1d0
> [<c0462370>] unknown_bootoption+0x0/0x1e0
> Code: 90 80 3e 00 7e f9 fa eb e8 89 d8 8b 74 24 0c 8b 5c 24 08 83 c4 10 c3 c7 04
> 24 c4 6a 38 c0 8b 44 24 10 89 44 24 04 e8 6d 30 db ff <0f> 0b 95 00 c2 62 38 c0
> eb bc 8d 76 00 53 83 ec 08 89 c3 fa 81
> <0>Kernel panic - not syncing: Fatal exception in interrupt
>
> Regards,
> cvaroqui
>
> On Mon, Jun 13, 2005 at 10:11:54AM +0200, Christophe Varoqui wrote:
> > Hello,
> >
> > I'm testing Mike Christie's START_STOP hwhandler and discovered a bunch of new, interesting, phenomenons :
> >
> > A little context first :
> > o kernel 2.6.12-rc6 + qlogic discovery patch
> > o qla2342 (dual 2GB)
> > o EVA5000, Solaris-tagged connections
> >
> > Here is a map create by multipath, fresh from boot :
> >
> > eva1_lun2 (3600508b400014ba7000120000cf00000)
> > [size=50 GB][features="1 queue_if_no_path"][hwhandler="1 hp_sw"]
> > \_ round-robin 0 [active][best]
> > \_ 0:0:0:2 sdb 8:16 [ready ][active]
> > \_ 1:0:0:2 sdf 8:80 [ready ][active]
> > \_ round-robin 0 [enabled]
> > \_ 0:0:1:2 sdd 8:48 [faulty][active]
> > \_ 1:0:1:2 sdh 8:112 [faulty][active]
> >
> > Start a background stream read with dd on that map.
> >
> > Do a port disable on the FC switch port connected to HBA 0
> > Consistently at this moment I get the following in the logs :
> >
> > qla2300 0000:05:0d.0: LOOP DOWN detected.
> > Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
> > in_atomic():1, irqs_disabled():1
> > [<c0120a74>] __might_sleep+0xa4/0xc0
> > [<c026a466>] device_for_each_child+0x26/0x80
> > [<c02b3180>] target_block+0x0/0x30
> > [<c02bbdae>] fc_remote_port_block+0x2e/0x60
> > [<c02bdbf5>] qla2x00_mark_all_devices_lost+0x55/0x60
> > [<c02c597e>] qla2x00_async_event+0x83e/0xd60
> > [<c011dd2b>] find_busiest_group+0xbb/0x310
> > [<c02cdce4>] sd_rw_intr+0x164/0x320
> > [<c02c4e37>] qla2300_intr_handler+0x77/0x240
> > [<c0144882>] handle_IRQ_event+0x32/0x70
> > [<c0144997>] __do_IRQ+0xd7/0x140
> > [<c0106756>] do_IRQ+0x36/0x70
> > [<c0104c1e>] common_interrupt+0x1a/0x20
> > [<c0102030>] default_idle+0x0/0x30
> > [<c0102053>] default_idle+0x23/0x30
> > [<c0102104>] cpu_idle+0x64/0x80
> >
> > If I wait long enough, I then get the following :
> >
> > rport-0:0-0: blocked FC remote port time out: removing target
> > rport-0:0-1: blocked FC remote port time out: removing target
> >
> > ... which is rather new to me.
> >
> > As a side effect, all sd associated are removed, uevents are sent signaling the disks have gone. This triggers checker removal on multipathd side in the current implementation.
> >
> > Then, upon port reenable, sd are registred again with different minor than before. uevent adds get sent, multipath reconfigures the maps and ...
> >
> > Unable to handle kernel NULL pointer dereference at virtual address 00000000
> > printing eip:
> > f8b0d29f
> > *pde = 08e4d001
> > Oops: 0000 [#1]
> > SMP
> > Modules linked in: dm_round_robin dm_hp_sw dm_multipath md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc video button battery ac ohci_hcd tg3 floppy dm_mod qla6312
> > CPU: 2
> > EIP: 0060:[<f8b0d29f>] Not tainted VLI
> > EFLAGS: 00010086 (2.6.12-rc6)
> > EIP is at rr_select_path+0xf/0x60 [dm_round_robin]
> > eax: f6a989cc ebx: 00000000 ecx: f6a978c0 edx: f7f1e77c
> > esi: f7f1e77c edi: 00000000 ebp: 00000001 esp: f65d1f00
> > ds: 007b es: 007b ss: 0068
> > Process kmpathd/2 (pid: 4564, threadinfo=f65d0000 task=f6708aa0)
> > Stack: f6a989c0 f7f1e740 f8ae3bc2 f7f1e740 f7f1e740 f8ae3c90 f7f1e740 f7f1e740
> > 00000000 f7f1e74c f8ae3f9c 00000286 00000000 f7f1e754 f7f1e740 f7f34100
> > f7f1e790 00000282 c01339a2 00000000 000f42b4 f6cdfe5c f7f34128 f7f34110
> > Call Trace:
> > [<f8ae3bc2>] __choose_path_in_pg+0x12/0x40 [dm_multipath]
> > [<f8ae3c90>] __choose_pgpath+0xa0/0xb0 [dm_multipath]
> > [<f8ae3f9c>] process_queued_ios+0x7c/0xf0 [dm_multipath]
> > [<c01339a2>] worker_thread+0x1c2/0x250
> > [<f8ae3f20>] process_queued_ios+0x0/0xf0 [dm_multipath]
> > [<c011eaa0>] default_wake_function+0x0/0x10
> > [<c011eaa0>] default_wake_function+0x0/0x10
> > [<c01337e0>] worker_thread+0x0/0x250
> > [<c0137c65>] kthread+0xa5/0xf0
> > [<c0137bc0>] kthread+0x0/0xf0
> > [<c0102445>] kernel_thread_helper+0x5/0x10
> > Code: 42 04 89 10 89 58 04 89 03 31 c0 5b c3 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 90 83 ec 08 89 74 24 04 89 d6 89 1c 24 8b 58 04 <8b> 03 39 d8 74 30 89 c1 8b 50 04 8b 00 85 c9 89 50 04 89 02 8b
> >
> > Here dd is now stuck in D-state.
> >
> > I Will post more as I continue my hammering.
> >
> > Regards,
> > cvaroqui
> >
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: further testing w/ multipath ... and bugs
2005-06-13 8:11 further testing w/ multipath ... and bugs Christophe Varoqui
2005-06-13 10:06 ` Christophe Varoqui
@ 2005-06-13 18:20 ` Andrew Vasquez
2005-06-13 18:56 ` christophe varoqui
2005-06-13 19:36 ` Mike Christie
2 siblings, 1 reply; 7+ messages in thread
From: Andrew Vasquez @ 2005-06-13 18:20 UTC (permalink / raw)
To: device-mapper development, Christophe Varoqui; +Cc: Linux-SCSI Mailing List
On Mon, 13 Jun 2005, Christophe Varoqui wrote:
>
> I'm testing Mike Christie's START_STOP hwhandler and discovered a bunch of new, interesting, phenomenons :
>
> A little context first :
> o kernel 2.6.12-rc6 + qlogic discovery patch
> o qla2342 (dual 2GB)
> o EVA5000, Solaris-tagged connections
>
> Here is a map create by multipath, fresh from boot :
>
> eva1_lun2 (3600508b400014ba7000120000cf00000)
> [size=50 GB][features="1 queue_if_no_path"][hwhandler="1 hp_sw"]
> \_ round-robin 0 [active][best]
> \_ 0:0:0:2 sdb 8:16 [ready ][active]
> \_ 1:0:0:2 sdf 8:80 [ready ][active]
> \_ round-robin 0 [enabled]
> \_ 0:0:1:2 sdd 8:48 [faulty][active]
> \_ 1:0:1:2 sdh 8:112 [faulty][active]
>
> Start a background stream read with dd on that map.
>
> Do a port disable on the FC switch port connected to HBA 0
> Consistently at this moment I get the following in the logs :
>
> qla2300 0000:05:0d.0: LOOP DOWN detected.
> Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
> in_atomic():1, irqs_disabled():1
> [<c0120a74>] __might_sleep+0xa4/0xc0
> [<c026a466>] device_for_each_child+0x26/0x80
> [<c02b3180>] target_block+0x0/0x30
> [<c02bbdae>] fc_remote_port_block+0x2e/0x60
> [<c02bdbf5>] qla2x00_mark_all_devices_lost+0x55/0x60
> [<c02c597e>] qla2x00_async_event+0x83e/0xd60
> [<c011dd2b>] find_busiest_group+0xbb/0x310
> [<c02cdce4>] sd_rw_intr+0x164/0x320
> [<c02c4e37>] qla2300_intr_handler+0x77/0x240
> [<c0144882>] handle_IRQ_event+0x32/0x70
Without wanting to making a number of large changes to the qla2xxx
internals to deal with these pre-qualifications, could you try the
following patch (lightly tested with latest linus git tree).
We'll need to update the fc_remote_port docs in order to account for
this semantic change in device_for_each_child().
--
av
Postpone fc_rport block/unblock to scheduled work.
diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
--- a/drivers/scsi/qla2xxx/qla_def.h
+++ b/drivers/scsi/qla2xxx/qla_def.h
@@ -33,6 +33,7 @@
#include <linux/mempool.h>
#include <linux/spinlock.h>
#include <linux/completion.h>
+#include <linux/workqueue.h>
#include <asm/semaphore.h>
#include <scsi/scsi.h>
@@ -1644,6 +1645,8 @@ typedef struct fc_port {
uint8_t cur_path; /* current path id */
struct fc_rport *rport;
+ struct work_struct block_work;
+ struct work_struct unblock_work;
} fc_port_t;
/*
diff --git a/drivers/scsi/qla2xxx/qla_gbl.h b/drivers/scsi/qla2xxx/qla_gbl.h
--- a/drivers/scsi/qla2xxx/qla_gbl.h
+++ b/drivers/scsi/qla2xxx/qla_gbl.h
@@ -82,6 +82,8 @@ extern void qla2x00_cmd_timeout(srb_t *)
extern void qla2x00_mark_device_lost(scsi_qla_host_t *, fc_port_t *, int);
extern void qla2x00_mark_all_devices_lost(scsi_qla_host_t *);
+extern void qla2x00_block_fcport(void *);
+extern void qla2x00_unblock_fcport(void *);
extern void qla2x00_blink_led(scsi_qla_host_t *);
diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
--- a/drivers/scsi/qla2xxx/qla_init.c
+++ b/drivers/scsi/qla2xxx/qla_init.c
@@ -1534,6 +1534,8 @@ qla2x00_alloc_fcport(scsi_qla_host_t *ha
fcport->iodesc_idx_sent = IODESC_INVALID_INDEX;
atomic_set(&fcport->state, FCS_UNCONFIGURED);
fcport->flags = FCF_RLC_SUPPORT;
+ INIT_WORK(&fcport->block_work, qla2x00_block_fcport, fcport);
+ INIT_WORK(&fcport->unblock_work, qla2x00_unblock_fcport, fcport);
return (fcport);
}
@@ -1899,7 +1901,7 @@ qla2x00_reg_remote_port(scsi_qla_host_t
struct fc_rport *rport;
if (fcport->rport) {
- fc_remote_port_unblock(fcport->rport);
+ schedule_work(&fcport->unblock_work);
return;
}
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1407,6 +1407,8 @@ void qla2x00_remove_one(struct pci_dev *
qla2x00_free_sysfs_attr(ha);
+ flush_scheduled_work();
+
fc_remove_host(ha->host);
scsi_remove_host(ha->host);
@@ -1481,7 +1483,7 @@ void qla2x00_mark_device_lost(scsi_qla_h
int do_login)
{
if (atomic_read(&fcport->state) == FCS_ONLINE && fcport->rport)
- fc_remote_port_block(fcport->rport);
+ schedule_work(&fcport->block_work);
/*
* We may need to retry the login, so don't change the state of the
* port but do the retries.
@@ -1542,11 +1544,25 @@ qla2x00_mark_all_devices_lost(scsi_qla_h
if (atomic_read(&fcport->state) == FCS_DEVICE_DEAD)
continue;
if (atomic_read(&fcport->state) == FCS_ONLINE && fcport->rport)
- fc_remote_port_block(fcport->rport);
+ schedule_work(&fcport->block_work);
atomic_set(&fcport->state, FCS_DEVICE_LOST);
}
}
+void
+qla2x00_block_fcport(void *data)
+{
+ fc_port_t *fcport = (fc_port_t *)data;
+ fc_remote_port_block(fcport->rport);
+}
+
+void
+qla2x00_unblock_fcport(void *data)
+{
+ fc_port_t *fcport = (fc_port_t *)data;
+ fc_remote_port_unblock(fcport->rport);
+}
+
/*
* qla2x00_mem_alloc
* Allocates adapter memory.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: further testing w/ multipath ... and bugs
2005-06-13 18:20 ` Andrew Vasquez
@ 2005-06-13 18:56 ` christophe varoqui
0 siblings, 0 replies; 7+ messages in thread
From: christophe varoqui @ 2005-06-13 18:56 UTC (permalink / raw)
To: device-mapper development; +Cc: Linux-SCSI Mailing List
> > qla2300 0000:05:0d.0: LOOP DOWN detected.
> > Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
> > in_atomic():1, irqs_disabled():1
> > [<c0120a74>] __might_sleep+0xa4/0xc0
> > [<c026a466>] device_for_each_child+0x26/0x80
> > [<c02b3180>] target_block+0x0/0x30
> > [<c02bbdae>] fc_remote_port_block+0x2e/0x60
> > [<c02bdbf5>] qla2x00_mark_all_devices_lost+0x55/0x60
> > [<c02c597e>] qla2x00_async_event+0x83e/0xd60
> > [<c011dd2b>] find_busiest_group+0xbb/0x310
> > [<c02cdce4>] sd_rw_intr+0x164/0x320
> > [<c02c4e37>] qla2300_intr_handler+0x77/0x240
> > [<c0144882>] handle_IRQ_event+0x32/0x70
>
> Without wanting to making a number of large changes to the qla2xxx
> internals to deal with these pre-qualifications, could you try the
> following patch (lightly tested with latest linus git tree).
>
> We'll need to update the fc_remote_port docs in order to account for
> this semantic change in device_for_each_child().
>
Indeed, it fixed this bug. Thanks for the prompt fix.
This leaves us with DM related bugs only :/
Regards,
--
christophe varoqui <christophe.varoqui@free.fr>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: further testing w/ multipath ... and bugs
2005-06-13 8:11 further testing w/ multipath ... and bugs Christophe Varoqui
2005-06-13 10:06 ` Christophe Varoqui
2005-06-13 18:20 ` Andrew Vasquez
@ 2005-06-13 19:36 ` Mike Christie
2005-06-13 21:37 ` christophe varoqui
2 siblings, 1 reply; 7+ messages in thread
From: Mike Christie @ 2005-06-13 19:36 UTC (permalink / raw)
To: device-mapper development
Christophe Varoqui wrote:
>
> rport-0:0-0: blocked FC remote port time out: removing target
> rport-0:0-1: blocked FC remote port time out: removing target
I think if we end up doing this to all the devices paths we are going to
be screwed. If all the paths suddenly fail and then those devices are
freed when trying to readd/create devices when the ports come back is
going to take a lot of GFP_KERNELs.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: further testing w/ multipath ... and bugs
2005-06-13 19:36 ` Mike Christie
@ 2005-06-13 21:37 ` christophe varoqui
0 siblings, 0 replies; 7+ messages in thread
From: christophe varoqui @ 2005-06-13 21:37 UTC (permalink / raw)
To: device-mapper development
On lun, 2005-06-13 at 14:36 -0500, Mike Christie wrote:
> Christophe Varoqui wrote:
> >
> > rport-0:0-0: blocked FC remote port time out: removing target
> > rport-0:0-1: blocked FC remote port time out: removing target
>
> I think if we end up doing this to all the devices paths we are going to
> be screwed. If all the paths suddenly fail and then those devices are
> freed when trying to readd/create devices when the ports come back is
> going to take a lot of GFP_KERNELs.
Right.
One thing I don't understand in the current scheme is why io are not
rerouted by the DM multipath target between the LOOP_DOWN and the rport
target removal ...
If io could get rerouted within a few seconds, we could set a target
removal multi-minutes timer, which should cover most reasonable SAN
failure scenarii.
Regards,
--
christophe varoqui <christophe.varoqui@free.fr>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2005-06-13 21:37 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-13 8:11 further testing w/ multipath ... and bugs Christophe Varoqui
2005-06-13 10:06 ` Christophe Varoqui
2005-06-13 14:37 ` Christophe Varoqui
2005-06-13 18:20 ` Andrew Vasquez
2005-06-13 18:56 ` christophe varoqui
2005-06-13 19:36 ` Mike Christie
2005-06-13 21:37 ` christophe varoqui
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.