Fibre-Channel Access : interuptive access

All of lore.kernel.org
 help / color / mirror / Atom feed

* Fibre-Channel Access : interuptive access
@ 2002-11-25 17:46 Fabien Salvi
  2002-11-25 19:37 ` Steven Dake
  0 siblings, 1 reply; 3+ messages in thread
From: Fabien Salvi @ 2002-11-25 17:46 UTC (permalink / raw)
  To: Linux SCSI list

Hello,

We have Fibre-Channel HBA (Qlogic 2200F) with a Sanbox2 switch connected
to a storage enclosure with a CMD 7240 Raid FC - SCSI controller.

We use qla2x00 (v6.01) driver.

When I reboot the FC switch, access is interrupted for 1 minute.
If I still have partition mounted on external enclosure while rebooting,
it brings failure on the server with a "semi-crash" of linux :

I can still access on it, but fsync is impossible, access to the data
after the reboot is not possible and reboot is blocked...
So, I must do a hard reset.

Well, this is something not really anormal you will say me, but what can
I do to reduce damages ?
Is there a way to prevent access to the partition while rebooting ?
When there is a timeout in NFS mounts, it is still possible to reboot
normally and to get back data when NFS is ok. Is there a solution like
this with FibreChannel SCSI ?

Here are the logs (I use Reiserfs filesystem) :

Nov 25 16:26:27 d4 kernel: scsi(0): LOOP DOWN detected
Nov 25 16:27:07 d4 kernel: SCSI disk error : host 0 channel 0 id 0 lun 7
return code = 10000
Nov 25 16:27:07 d4 kernel:  I/O error: dev 08:01, sector 50056
Nov 25 16:27:07 d4 kernel: journal-601, buffer write failed
Nov 25 16:27:07 d4 kernel: kernel BUG at prints.c:334!
Nov 25 16:27:07 d4 kernel: invalid operand: 0000
Nov 25 16:27:07 d4 kernel: CPU:    0
Nov 25 16:27:07 d4 kernel: EIP:    0010:[reiserfs_panic+41/96]    Not
tainted
Nov 25 16:27:07 d4 kernel: EFLAGS: 00010286
Nov 25 16:27:07 d4 kernel: eax: 00000024   ebx: c02764c0   ecx:
c7fb0000   edx: 00000000
Nov 25 16:27:07 d4 kernel: esi: c3470400   edi: 00000000   ebp:
c3470400   esp: c7fb1ee4
Nov 25 16:27:07 d4 kernel: ds: 0018   es: 0018   ss: 0018
Nov 25 16:27:07 d4 kernel: Process kupdated (pid: 7, stackpage=c7fb1000)
Nov 25 16:27:07 d4 kernel: Stack: c027495a c031f0c0 c02764c0 c7fb1f08
c888c798 00000003 c01a83cf
 c3470400 
Nov 25 16:27:07 d4 kernel:        c02764c0 00000011 00000012 00000010
00000000 c888c7cc c888c7c0
 00000004 
Nov 25 16:27:07 d4 kernel:        00000000 00000012 c7a032c0 c01abcfe
c3470400 c888c798 00000001
 c7fb1fa4 
Nov 25 16:27:07 d4 kernel: Call Trace:    [flush_commit_list+687/928]
[do_journal_end+1982/2704]
 [flush_old_commits+287/320] [reiserfs_write_super+21/32]
[sync_supers+191/240]
Nov 25 16:27:07 d4 kernel:   [sync_old_buffers+12/64] [kupdate+213/256]
[kernel_thread+40/64]
Nov 25 16:27:07 d4 kernel: 
Nov 25 16:27:07 d4 kernel: Code: 0f 0b 4e 01 60 49 27 c0 68 c0 f0 31 c0
85 f6 74 16 0f b7 46 
Nov 25 16:27:08 d4 kernel:  SCSI disk error : host 0 channel 0 id 0 lun
7 return code = 10000
Nov 25 16:27:08 d4 kernel:  I/O error: dev 08:01, sector 50064
Nov 25 16:27:09 d4 kernel: SCSI disk error : host 0 channel 0 id 0 lun 7
return code = 10000
Nov 25 16:27:09 d4 kernel:  I/O error: dev 08:01, sector 50072
Nov 25 16:27:30 d4 kernel: scsi(0): LOOP UP detected


Thanks a lot for your help !

--
Fabien

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Fibre-Channel Access : interuptive access
  2002-11-25 17:46 Fibre-Channel Access : interuptive access Fabien Salvi
@ 2002-11-25 19:37 ` Steven Dake
  2002-11-26 10:04   ` Fabien Salvi
  0 siblings, 1 reply; 3+ messages in thread
From: Steven Dake @ 2002-11-25 19:37 UTC (permalink / raw)
  To: Fabien Salvi; +Cc: Linux SCSI list

Fabien,

What you want is hotswap support.  The kernel has basic support for 
hotswap but only if a device is not in use.  Search the archives for 
hotswap.

I'm currently working on forced block device removal, even if the device 
is in use, properly shutting down files in VFS, RAID, and filesystem 
mount layers.  This is what you really need when hotswap happens, but it 
just isn't ready yet.

The correct way to configure your system so it will be alive during this 
type of failure is to have two HBAs, two switches, and have each hba go 
through a seperate switch.  This way, if your link/HBA/switch fails, 
there is automatic failover.

Then create a RAID 1 array across both HBAs.  In the case of a switch 
failure, the RAID subsystem will automatically correct any errors and 
rebuild arrays on disk reinsertions.  Or you could use the RAID 
multipathing personality to create a multipath across two hbas to the 
same device.

Hope this helps.
-steve

Fabien Salvi wrote:

>Hello,
>
>We have Fibre-Channel HBA (Qlogic 2200F) with a Sanbox2 switch connected
>to a storage enclosure with a CMD 7240 Raid FC - SCSI controller.
>
>We use qla2x00 (v6.01) driver.
>
>When I reboot the FC switch, access is interrupted for 1 minute.
>If I still have partition mounted on external enclosure while rebooting,
>it brings failure on the server with a "semi-crash" of linux :
>
>I can still access on it, but fsync is impossible, access to the data
>after the reboot is not possible and reboot is blocked...
>So, I must do a hard reset.
>
>Well, this is something not really anormal you will say me, but what can
>I do to reduce damages ?
>Is there a way to prevent access to the partition while rebooting ?
>When there is a timeout in NFS mounts, it is still possible to reboot
>normally and to get back data when NFS is ok. Is there a solution like
>this with FibreChannel SCSI ?
>
>Here are the logs (I use Reiserfs filesystem) :
>
>Nov 25 16:26:27 d4 kernel: scsi(0): LOOP DOWN detected
>Nov 25 16:27:07 d4 kernel: SCSI disk error : host 0 channel 0 id 0 lun 7
>return code = 10000
>Nov 25 16:27:07 d4 kernel:  I/O error: dev 08:01, sector 50056
>Nov 25 16:27:07 d4 kernel: journal-601, buffer write failed
>Nov 25 16:27:07 d4 kernel: kernel BUG at prints.c:334!
>Nov 25 16:27:07 d4 kernel: invalid operand: 0000
>Nov 25 16:27:07 d4 kernel: CPU:    0
>Nov 25 16:27:07 d4 kernel: EIP:    0010:[reiserfs_panic+41/96]    Not
>tainted
>Nov 25 16:27:07 d4 kernel: EFLAGS: 00010286
>Nov 25 16:27:07 d4 kernel: eax: 00000024   ebx: c02764c0   ecx:
>c7fb0000   edx: 00000000
>Nov 25 16:27:07 d4 kernel: esi: c3470400   edi: 00000000   ebp:
>c3470400   esp: c7fb1ee4
>Nov 25 16:27:07 d4 kernel: ds: 0018   es: 0018   ss: 0018
>Nov 25 16:27:07 d4 kernel: Process kupdated (pid: 7, stackpage=c7fb1000)
>Nov 25 16:27:07 d4 kernel: Stack: c027495a c031f0c0 c02764c0 c7fb1f08
>c888c798 00000003 c01a83cf
> c3470400 
>Nov 25 16:27:07 d4 kernel:        c02764c0 00000011 00000012 00000010
>00000000 c888c7cc c888c7c0
> 00000004 
>Nov 25 16:27:07 d4 kernel:        00000000 00000012 c7a032c0 c01abcfe
>c3470400 c888c798 00000001
> c7fb1fa4 
>Nov 25 16:27:07 d4 kernel: Call Trace:    [flush_commit_list+687/928]
>[do_journal_end+1982/2704]
> [flush_old_commits+287/320] [reiserfs_write_super+21/32]
>[sync_supers+191/240]
>Nov 25 16:27:07 d4 kernel:   [sync_old_buffers+12/64] [kupdate+213/256]
>[kernel_thread+40/64]
>Nov 25 16:27:07 d4 kernel: 
>Nov 25 16:27:07 d4 kernel: Code: 0f 0b 4e 01 60 49 27 c0 68 c0 f0 31 c0
>85 f6 74 16 0f b7 46 
>Nov 25 16:27:08 d4 kernel:  SCSI disk error : host 0 channel 0 id 0 lun
>7 return code = 10000
>Nov 25 16:27:08 d4 kernel:  I/O error: dev 08:01, sector 50064
>Nov 25 16:27:09 d4 kernel: SCSI disk error : host 0 channel 0 id 0 lun 7
>return code = 10000
>Nov 25 16:27:09 d4 kernel:  I/O error: dev 08:01, sector 50072
>Nov 25 16:27:30 d4 kernel: scsi(0): LOOP UP detected
>
>
>Thanks a lot for your help !
>
>--
>Fabien
>-
>To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>  
>


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Fibre-Channel Access : interuptive access
  2002-11-25 19:37 ` Steven Dake
@ 2002-11-26 10:04   ` Fabien Salvi
  0 siblings, 0 replies; 3+ messages in thread
From: Fabien Salvi @ 2002-11-26 10:04 UTC (permalink / raw)
  To: Linux SCSI list

Steven Dake wrote:
> 
> Fabien,
> 
> What you want is hotswap support.  The kernel has basic support for
> hotswap but only if a device is not in use.  Search the archives for
> hotswap.
> 
> I'm currently working on forced block device removal, even if the device
> is in use, properly shutting down files in VFS, RAID, and filesystem
> mount layers.  This is what you really need when hotswap happens, but it
> just isn't ready yet.
> 
> The correct way to configure your system so it will be alive during this
> type of failure is to have two HBAs, two switches, and have each hba go
> through a seperate switch.  This way, if your link/HBA/switch fails,
> there is automatic failover.
> 
> Then create a RAID 1 array across both HBAs.  In the case of a switch
> failure, the RAID subsystem will automatically correct any errors and
> rebuild arrays on disk reinsertions.  Or you could use the RAID
> multipathing personality to create a multipath across two hbas to the
> same device.
> 
> Hope this helps.

Yes, thanks !

Our CMD controllers normally support multi-path access but we haven't
tested it a lot and in good conditions for the moment...
But, I plan to test HA using a switch for active access and a hub for
passive access (used only in case of failure or maintenance of the
switch).

The CMD controllers normally support non-disruptive firmware upgrade, so
it shouldn't be a problem.

But, this is theory, if something hang, the only way is to reset it,
this is when the problem occur...
(yes, it should not :) )

What I thought about is a system to prevent use of a device during a few
seconds.
For example, I write 0 or 1 in the properties of a device (in
/proc/....) to prevent access on it and so prevent it from crashing.

Another thing, is there a way to do a "sync" only on a particular device
?
For example, I could flush filesystem buffers only in internal device
before hard reset so I only lost data on FC Raid but not on local SCSI
disks...

Thanks a lot for your help !

-- 
Fabien

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2002-11-26 10:04 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-25 17:46 Fibre-Channel Access : interuptive access Fabien Salvi
2002-11-25 19:37 ` Steven Dake
2002-11-26 10:04   ` Fabien Salvi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.