Who do we point to?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Who do we point to?
@ 2008-08-20 19:11 greg
  2008-08-21  1:06 ` [Scst-devel] " Stanislaw Gruszka
  2008-08-21 12:14 ` Vladislav Bolkhovitin
  0 siblings, 2 replies; 6+ messages in thread
From: greg @ 2008-08-20 19:11 UTC (permalink / raw)
  To: scst-devel, vst; +Cc: linux-driver, linux-scsi, linuxraid, neilb, linux-raid

Good morning hope the day is going well for everyone.

Apologies for the large broadcast domain on this.  I wanted to make
sure everyone who may have an interest in this is involved.

Some feedback on another issue we encountered with Linux in a
production initiator/target environment with SCST.  I'm including logs
below from three separate systems involved in the incident.  I've gone
through them with my team and we are currently unsure on what
triggered all this, hence mail to everyone who may be involved.

The system involved is SCST 1.0.0.0 running on a Linux 2.6.24.7 target
platform using the qla_isp driver module.  The target machine has two
9650 eight port 3Ware controller cards driving a total of 16 750
gigabyte Seagate NearLine drives.  Firmware on the 3ware and Qlogic
cards should all be current.  There are two identical servers in two
geographically separated data-centers.

The drives on each platform are broken into four 3+1 RAID5 devices
with software RAID.  Each RAID5 volume is a physical volume for an LVM
volume group. There is currently one logical volume exported from each
of four RAID5 volumes as a target device.  A total of four initiators
are thus accessing the target server, each accessing different RAID5
volumes.

The initiators are running a stock 2.6.26.2 kernel with a RHEL5
userspace.  Access to the SAN is via a 2462 dual-port Qlogic card.
The initiators see a block device from each of the two target servers
through separate ports/paths.  The block devices form a software RAID1
device (with bitmaps) which is the physical volume for an LVM volume
group.  The production filesystem is supported by a single logical
volume allocated from that volume group.

A drive failure occured last Sunday afternoon on one of the RAID5
volumes.  The target kernel recognized the failure, failed the device
and kept going.

Unfortunately three of the four initiators picked up a device failure
which caused the SCST exported volume to be faulted out of the RAID1
device.  One of the initiators noted an incident was occurring, issued
a target reset and continued forward with no issues.

The initiator which got things 'right' was not accessing the RAID5
volume on the target which experienced the error.  Two of the three
initiators which faulted out their volumes were not accessing the
compromised RAID5 volume.  The initiator accessing the volume faulted
out its device.

In the logs below the 'init1' initiator was the one which did not fail
its device.  The init2 log is an example log from the initiators which
failed out their devices, behavior seemed to be identical on all the
initiators which faulted their block devices.  The log labelled target
are the log entries from the event on the SCST server.  All three
servers from which logs were abstracted were NTP time synchronized so
log timings are directly correlatable.

Some items to note:

---
The following log message from the 3Ware driver seems bogus with
respect to the port number.  Doubtfull this has anything to do with
the incident but may be of interest to the 3Ware people copied on this
note:

Aug 17 17:55:16 scst-target kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x000A): Drive error detected:unit=2, port=-2147483646.

---
The initiators which received I/O errors had the Qlogic driver attempt
a 'DEVICE RESET' which failed and was then retried.  The second reset
attempt succeeded.

The 3Ware driver elected to reset the card at 17:55:32.  A period of
44 seconds elapses from that message until end_request picks up on the
I/O error which causes the RAID5 driver to fault the affected drive.
The initiators which failed their 'DEVICE RESET' issued their failed
requests during this time window.

Of interest to Vlad may be the following log entry(s):

Aug 17 17:56:07 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET FAILED: Task management failed.

The initiator which had its 'DEVICE RESET' succeed issued the reset
after the above window with a timestamp identical to that of the
end_request I/O error message on the target.

---
Of interest to NeilB and why I copied him as well is the following.

Precisely one minute after the second attempt to reset the target
succeeeds the kernel indicates the involved RAID1 kthread has blocked
for more than 120 seconds.  The call trace indicates the thread was
waiting on a RAID superblock update.

Immediately after the kernel finishes issueing the message and stack
trace the Qlogic driver attempts to abort a SCSI command which results
in end_request getting an I/O error which causes the device to be
faulted from the RAID1 device.

This occurs one full minute AFTER the target RAID5 device has had its
device evicted and is continuing in normal but degraded operation.
---

Empirically it would seem the initiators which were 'unlucky' happened
to issue their 'DEVICE RESET' requests while the SCST service thread
they were assigned to was blocked waiting for the 3Ware card to reset.
What is unclear is why the initiator I/O error was generated after the
reset succeeded the second time, a minute after the incident was
completely over as far as the SCST target server was concerned.

A question for Vlad.  The SCST target server is a dual-processor SMP
box with the default value of two kernel threads active.  Would it be
advantageous to increase this value to avoid situations like this?
Would an appropriate metric be to have the number of active threads
equal to the number of exported volumes or initiators?

I would be interested in any ideas the group may have.  Let me know if
I can provide additional information or documentation on any of this.

Best wishes for a productive remainder of the week.

GW

Logs follow.

init1 (good) --------------------------------------------------------------
Aug 17 17:56:02 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e334 2002.
Aug 17 17:56:08 init1 sshd[14485]: Did not receive identification string from 134.129.111.71
Aug 17 17:56:13 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e334 2002.
Aug 17 17:56:13 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e335 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e336 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e337 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e338 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e339 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33a 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33b 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33c 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33d 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33e 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33f 2002.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET ISSUED.
Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET SUCCEEDED.
---------------------------------------------------------------------------

init2 (bad) ---------------------------------------------------------------
Aug 17 17:55:37 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 3b848 2002.
Aug 17 17:55:47 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 3b848 2002.
Aug 17 17:55:47 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET ISSUED.
Aug 17 17:56:07 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET FAILED: Task management failed.
Aug 17 17:56:07 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): TARGET RESET ISSUED.
Aug 17 17:56:08 init2 sshd[15769]: Did not receive identification string from 134.129.111.71
Aug 17 17:56:16 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): TARGET RESET SUCCEEDED.
Aug 17 17:57:08 init2 sshd[15815]: Did not receive identification string from 134.129.111.71
Aug 17 17:57:16 init2 kernel: INFO: task md7_raid1:739 blocked for more than 120 seconds.
Aug 17 17:57:16 init2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 17 17:57:16 init2 kernel: md7_raid1     D ffff81025e405e00  5792   739      2
Aug 17 17:57:16 init2 kernel:  ffff81025d523c70 0000000000000046 000000000000ffff 000000003e7fff88
Aug 17 17:57:16 init2 kernel:  ffff81025d4cd140 ffff81025fe73000 ffff81025d4cd488 0000000400000000
Aug 17 17:57:16 init2 kernel:  ffff8101f81018c0 0000000000000000 ffff81025fdfaa80 ffffffff802a382c
Aug 17 17:57:16 init2 kernel: Call Trace:
Aug 17 17:57:16 init2 kernel:  [<ffffffff802a382c>] bio_alloc_bioset+0x89/0xd9
Aug 17 17:57:16 init2 kernel:  [<ffffffff803c4d64>] md_super_wait+0xa3/0xb7
Aug 17 17:57:16 init2 kernel:  [<ffffffff8023fbb4>] autoremove_wake_function+0x0/0x2e
Aug 17 17:57:16 init2 kernel:  [<ffffffff803cc017>] write_page+0x1bb/0x2a3
Aug 17 17:57:16 init2 kernel:  [<ffffffff80224c00>] find_busiest_group+0x25e/0x6c6
Aug 17 17:57:16 init2 kernel:  [<ffffffff803c7e9a>] super_90_sync+0x359/0x367
Aug 17 17:57:16 init2 kernel:  [<ffffffff803c5461>] md_update_sb+0x1be/0x291
Aug 17 17:57:16 init2 kernel:  [<ffffffff803c9841>] md_check_recovery+0x185/0x4c0
Aug 17 17:57:16 init2 kernel:  [<ffffffff80242434>] hrtimer_start+0xfc/0x11e
Aug 17 17:57:16 init2 kernel:  [<ffffffffa0044091>] :raid1:raid1d+0x25/0xc68
Aug 17 17:57:16 init2 kernel:  [<ffffffff80459e32>] thread_return+0x63/0xa4
Aug 17 17:57:16 init2 kernel:  [<ffffffff80236d9f>] lock_timer_base+0x26/0x4b
Aug 17 17:57:16 init2 kernel:  [<ffffffff80236e15>] try_to_del_timer_sync+0x51/0x5a
Aug 17 17:57:16 init2 kernel:  [<ffffffff80236e2a>] del_timer_sync+0xc/0x16
Aug 17 17:57:16 init2 kernel:  [<ffffffff8045a0ff>] schedule_timeout+0x92/0xad
Aug 17 17:57:16 init2 kernel:  [<ffffffff803cb2e9>] md_thread+0xd7/0xed
Aug 17 17:57:16 init2 kernel:  [<ffffffff8023fbb4>] autoremove_wake_function+0x0/0x2e
Aug 17 17:57:16 init2 kernel:  [<ffffffff803cb212>] md_thread+0x0/0xed
Aug 17 17:57:16 init2 kernel:  [<ffffffff8023fa95>] kthread+0x47/0x76
Aug 17 17:57:16 init2 kernel:  [<ffffffff8022aa03>] schedule_tail+0x28/0x5d
Aug 17 17:57:16 init2 kernel:  [<ffffffff8020cb78>] child_rip+0xa/0x12
Aug 17 17:57:16 init2 kernel:  [<ffffffff8023fa4e>] kthread+0x0/0x76
Aug 17 17:57:16 init2 kernel:  [<ffffffff8020cb6e>] child_rip+0x0/0x12
Aug 17 17:57:16 init2 kernel: 
Aug 17 17:57:17 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 3b849 2002.
Aug 17 17:57:17 init2 kernel: end_request: I/O error, dev sdc, sector 1048575880
Aug 17 17:57:17 init2 kernel: md: super_written gets error=-5, uptodate=0
Aug 17 17:57:17 init2 kernel: raid1: Disk failure on sdc1, disabling device.
Aug 17 17:57:17 init2 kernel: raid1: Operation continuing on 1 devices.
Aug 17 17:57:17 init2 kernel: RAID1 conf printout:
Aug 17 17:57:17 init2 kernel:  --- wd:1 rd:2
Aug 17 17:57:17 init2 kernel:  disk 0, wo:1, o:0, dev:sdc1
Aug 17 17:57:17 init2 kernel:  disk 1, wo:0, o:1, dev:sdd1
Aug 17 17:57:17 init2 kernel: RAID1 conf printout:
Aug 17 17:57:17 init2 kernel:  --- wd:1 rd:2
Aug 17 17:57:17 init2 kernel:  disk 1, wo:0, o:1, dev:sdd1
---------------------------------------------------------------------------

target --------------------------------------------------------------------
Aug 17 17:55:13 scst-target kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=2.
Aug 17 17:55:16 scst-target kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x000A): Drive error detected:unit=2, port=-2147483646.
Aug 17 17:55:18 scst-target kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=2.
Aug 17 17:55:29 scst-target last message repeated 2 times
Aug 17 17:55:32 scst-target kernel: sd 1:0:2:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.
Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 66331127
Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37108159
Aug 17 17:56:16 scst-target kernel: raid5: Disk failure on sdk1, disabling device. Operation continuing on 3 devices
Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37109439
Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37108799
Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37108847
Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37109191
Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37109183
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37108687
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37108431
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37108423
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37108207
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37106855
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107135
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107391
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107519
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107775
Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107823
Aug 17 17:56:18 scst-target kernel: end_request: I/O error, dev sdk, sector 151465351
Aug 17 17:56:18 scst-target kernel: RAID5 conf printout:
Aug 17 17:56:18 scst-target kernel:  --- rd:4 wd:3
Aug 17 17:56:18 scst-target kernel:  disk 0, o:1, dev:sdi1
Aug 17 17:56:18 scst-target kernel:  disk 1, o:1, dev:sdj1
Aug 17 17:56:18 scst-target kernel:  disk 2, o:0, dev:sdk1
Aug 17 17:56:18 scst-target kernel:  disk 3, o:1, dev:sdl1
Aug 17 17:56:18 scst-target kernel: RAID5 conf printout:
Aug 17 17:56:18 scst-target kernel:  --- rd:4 wd:3
Aug 17 17:56:18 scst-target kernel:  disk 0, o:1, dev:sdi1
Aug 17 17:56:18 scst-target kernel:  disk 1, o:1, dev:sdj1
Aug 17 17:56:18 scst-target kernel:  disk 3, o:1, dev:sdl1
Aug 17 19:37:16 scst-target sshd[4515]: refused connect from mon1 (::ffff:xxx.xx.xxx.xxx)
---------------------------------------------------------------------------

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"This patch causes a CONFIG_PREEMPT=y, CONFIG_PREEMPT_BKL=y,
 CONFIG_DEBUG_PREEMPT=y kernel on a ppc64 G5 to hang immediately after
 displaying the penguins, but apparently not before having set the
 hardware clock backwards 101 years."

"After having carefully reviewed the above description and having
 decided that these effects were not a part of the patch's design
 intent I have temporarily set it aside, thanks."
                                -- Andrew Morton
                                   linux-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Scst-devel] Who do we point to?
  2008-08-20 19:11 Who do we point to? greg
@ 2008-08-21  1:06 ` Stanislaw Gruszka
  2008-08-21 12:17   ` Vladislav Bolkhovitin
  2008-08-21 12:14 ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 6+ messages in thread
From: Stanislaw Gruszka @ 2008-08-21  1:06 UTC (permalink / raw)
  To: scst-devel, greg
  Cc: vst, linux-driver, neilb, linux-raid, linuxraid, linux-scsi

On Wednesday 20 August 2008, greg@enjellic.com wrote:
> Good morning hope the day is going well for everyone.
Hi Greg!

> Apologies for the large broadcast domain on this.  I wanted to make
> sure everyone who may have an interest in this is involved.
>
> Some feedback on another issue we encountered with Linux in a
> production initiator/target environment with SCST.  I'm including logs
> below from three separate systems involved in the incident.  I've gone
> through them with my team and we are currently unsure on what
> triggered all this, hence mail to everyone who may be involved.
>
> The system involved is SCST 1.0.0.0 running on a Linux 2.6.24.7 target
> platform using the qla_isp driver module.  The target machine has two
> 9650 eight port 3Ware controller cards driving a total of 16 750
> gigabyte Seagate NearLine drives.  Firmware on the 3ware and Qlogic
> cards should all be current.  There are two identical servers in two
> geographically separated data-centers.
>
> The drives on each platform are broken into four 3+1 RAID5 devices
> with software RAID.  Each RAID5 volume is a physical volume for an LVM
> volume group. There is currently one logical volume exported from each
> of four RAID5 volumes as a target device.  A total of four initiators
> are thus accessing the target server, each accessing different RAID5
> volumes.
>
> The initiators are running a stock 2.6.26.2 kernel with a RHEL5
> userspace.  Access to the SAN is via a 2462 dual-port Qlogic card.
> The initiators see a block device from each of the two target servers
> through separate ports/paths.  The block devices form a software RAID1
> device (with bitmaps) which is the physical volume for an LVM volume
> group.  The production filesystem is supported by a single logical
> volume allocated from that volume group.
>
> A drive failure occured last Sunday afternoon on one of the RAID5
> volumes.  The target kernel recognized the failure, failed the device
> and kept going.
>
> Unfortunately three of the four initiators picked up a device failure
> which caused the SCST exported volume to be faulted out of the RAID1
> device.  One of the initiators noted an incident was occurring, issued
> a target reset and continued forward with no issues.
>
> The initiator which got things 'right' was not accessing the RAID5
> volume on the target which experienced the error.  Two of the three
> initiators which faulted out their volumes were not accessing the
> compromised RAID5 volume.  The initiator accessing the volume faulted
> out its device.
For some reason SCST core need to wait for logical unit driver (aka dev 
handler) for abort comand. It is not possible to abort command instantly i.e.  
mark command as aborted, return task management success to initiator and 
after logical unit driver finish, just free resources for aborted command (I 
don't know way, maybe Vlad could tell more about this). Qlogic initiator 
device just waits for 3ware card to abort commands. As both systems have the 
same SCSI stack, such same commands timeouts. 3ware driver will return error 
to RAID5 roughly at the same time when Qlogic initiator timeouts. So 
sometimes Qlogic send only device reset and sometimes target reset too.

I believe increasing timeouts in sd driver on initiator site (and maybe 
decreasing in on target system) will help. This things are not run time 
configurable, only compile time. On initiator systems I suggest to increase 
SD_TIMEOUT and maybe on target site decrease SD_MAX_RETRIES, both values are 
in drivers/scsi/sd.h. In such configuration, when physical disk fail, 3ware 
will return error during initiator waiting for command complete, RAID5 on 
target will do the right job and from initiator point of view command will 
finish successfully.

Cheers
Stanislaw Gruszka

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Who do we point to?
  2008-08-20 19:11 Who do we point to? greg
  2008-08-21  1:06 ` [Scst-devel] " Stanislaw Gruszka
@ 2008-08-21 12:14 ` Vladislav Bolkhovitin
  2008-08-21 14:32   ` James Bottomley
  1 sibling, 1 reply; 6+ messages in thread
From: Vladislav Bolkhovitin @ 2008-08-21 12:14 UTC (permalink / raw)
  To: greg
  Cc: scst-devel, linux-driver, linux-scsi, linuxraid, neilb,
	linux-raid, linux-fsdevel

greg@enjellic.com wrote:
> Good morning hope the day is going well for everyone.
> 
> Apologies for the large broadcast domain on this.  I wanted to make
> sure everyone who may have an interest in this is involved.
> 
> Some feedback on another issue we encountered with Linux in a
> production initiator/target environment with SCST.  I'm including logs
> below from three separate systems involved in the incident.  I've gone
> through them with my team and we are currently unsure on what
> triggered all this, hence mail to everyone who may be involved.
> 
> The system involved is SCST 1.0.0.0 running on a Linux 2.6.24.7 target
> platform using the qla_isp driver module.  The target machine has two
> 9650 eight port 3Ware controller cards driving a total of 16 750
> gigabyte Seagate NearLine drives.  Firmware on the 3ware and Qlogic
> cards should all be current.  There are two identical servers in two
> geographically separated data-centers.
> 
> The drives on each platform are broken into four 3+1 RAID5 devices
> with software RAID.  Each RAID5 volume is a physical volume for an LVM
> volume group. There is currently one logical volume exported from each
> of four RAID5 volumes as a target device.  A total of four initiators
> are thus accessing the target server, each accessing different RAID5
> volumes.
> 
> The initiators are running a stock 2.6.26.2 kernel with a RHEL5
> userspace.  Access to the SAN is via a 2462 dual-port Qlogic card.
> The initiators see a block device from each of the two target servers
> through separate ports/paths.  The block devices form a software RAID1
> device (with bitmaps) which is the physical volume for an LVM volume
> group.  The production filesystem is supported by a single logical
> volume allocated from that volume group.
> 
> A drive failure occured last Sunday afternoon on one of the RAID5
> volumes.  The target kernel recognized the failure, failed the device
> and kept going.
> 
> Unfortunately three of the four initiators picked up a device failure
> which caused the SCST exported volume to be faulted out of the RAID1
> device.  One of the initiators noted an incident was occurring, issued
> a target reset and continued forward with no issues.
> 
> The initiator which got things 'right' was not accessing the RAID5
> volume on the target which experienced the error.  Two of the three
> initiators which faulted out their volumes were not accessing the
> compromised RAID5 volume.  The initiator accessing the volume faulted
> out its device.
> 
> In the logs below the 'init1' initiator was the one which did not fail
> its device.  The init2 log is an example log from the initiators which
> failed out their devices, behavior seemed to be identical on all the
> initiators which faulted their block devices.  The log labelled target
> are the log entries from the event on the SCST server.  All three
> servers from which logs were abstracted were NTP time synchronized so
> log timings are directly correlatable.
> 
> Some items to note:
> 
> ---
> The following log message from the 3Ware driver seems bogus with
> respect to the port number.  Doubtfull this has anything to do with
> the incident but may be of interest to the 3Ware people copied on this
> note:
> 
> Aug 17 17:55:16 scst-target kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x000A): Drive error detected:unit=2, port=-2147483646.
> 
> ---
> The initiators which received I/O errors had the Qlogic driver attempt
> a 'DEVICE RESET' which failed and was then retried.  The second reset
> attempt succeeded.
> 
> The 3Ware driver elected to reset the card at 17:55:32.  A period of
> 44 seconds elapses from that message until end_request picks up on the
> I/O error which causes the RAID5 driver to fault the affected drive.
> The initiators which failed their 'DEVICE RESET' issued their failed
> requests during this time window.
> 
> Of interest to Vlad may be the following log entry(s):
> 
> Aug 17 17:56:07 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET FAILED: Task management failed.
> 
> The initiator which had its 'DEVICE RESET' succeed issued the reset
> after the above window with a timestamp identical to that of the
> end_request I/O error message on the target.

It would be good to know the reason for that reset failure. If you had 
SCST on the target built in the debug mode, we would also have an 
interesting info to think over (in this mode all the TM processing by 
SCST core is logged).

But I bet, the reason was a timeout, see below.

> ---
> Of interest to NeilB and why I copied him as well is the following.
> 
> Precisely one minute after the second attempt to reset the target
> succeeeds the kernel indicates the involved RAID1 kthread has blocked
> for more than 120 seconds.  The call trace indicates the thread was
> waiting on a RAID superblock update.
> 
> Immediately after the kernel finishes issueing the message and stack
> trace the Qlogic driver attempts to abort a SCSI command which results
> in end_request getting an I/O error which causes the device to be
> faulted from the RAID1 device.
> 
> This occurs one full minute AFTER the target RAID5 device has had its
> device evicted and is continuing in normal but degraded operation.
> ---
> 
> 
> Empirically it would seem the initiators which were 'unlucky' happened
> to issue their 'DEVICE RESET' requests while the SCST service thread
> they were assigned to was blocked waiting for the 3Ware card to reset.
> What is unclear is why the initiator I/O error was generated after the
> reset succeeded the second time, a minute after the incident was
> completely over as far as the SCST target server was concerned.
> 
> A question for Vlad.  The SCST target server is a dual-processor SMP
> box with the default value of two kernel threads active.  Would it be
> advantageous to increase this value to avoid situations like this?
> Would an appropriate metric be to have the number of active threads
> equal to the number of exported volumes or initiators?

For BLOCKIO or pass-through modes increase of the threads count beyond 
the default CPU count won't affect anything, because all the processing 
is fully asynchronous. For FILEIO you already have a bunch of dedicated 
threads per device. All the TM processing is done in the dedicated 
thread as well.

> I would be interested in any ideas the group may have.  Let me know if
> I can provide additional information or documentation on any of this.

I agree with Stanislaw Gruszka that it was purely a timeout issue. 
Qlogic driver on the initiator was more impatient than the storage stack 
on the target. The failed request before became failed was many times 
retried each time with some timeout. The sum of those timeouts was 
bigger than the corresponding command's timeout on the target + timeout 
for the reset TM command.

As the solution I can suggest to decrease retries count and commands 
failure timeout on the target. I'm recalling something like that already 
was once discussed in linux-scsi, I think it would worth for you to 
search for that thread.

MOANING MODE ON

Testing SCST and target drivers I often have to deal with various 
failures and with how initiators recover from them. And, unfortunately, 
my observations on Linux aren't very encouraging. See, for instance, 
http://marc.info/?l=linux-scsi&m=119557128825721&w=2 thread. Receiving 
from the target TASK ABORTED status isn't really a failure, it's rather 
a corner case behavior, but it leads to immediate file system errors on 
initiator and then after remount ext3 journal replay doesn't completely 
repair it, only manual e2fsck helps. Even mounting with barrier=1 
doesn't improve anything. Target can't be blamed for the failure, 
because it stayed online, all its cache fully healthy and no commands 
were lost. Hence, apparently, the journaling code in ext3 isn't as 
reliable in face of storage corner cases as it's thought. I haven't 
tried that test since I reported it, but recently I've seen the similar 
ext3 failures on 2.6.26 in other tests, so I guess the problem(s) still 
there.

A software SCSI target, like SCST, is beautiful to test things like 
that, because it allows easily simulate any possible corner case and 
storage failure. Unfortunately, I don't work on file systems level and 
can't participate in all that great testing and fixing effort. I can 
only help with setup and assistance in failures simulations.

MOANING MODE OFF

Vlad


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Who do we point to?
  2008-08-21  1:06 ` [Scst-devel] " Stanislaw Gruszka
@ 2008-08-21 12:17   ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 6+ messages in thread
From: Vladislav Bolkhovitin @ 2008-08-21 12:17 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: scst-devel, greg, linux-driver, neilb, linux-raid, linuxraid,
	linux-scsi

Stanislaw Gruszka wrote:
>> Apologies for the large broadcast domain on this.  I wanted to make
>> sure everyone who may have an interest in this is involved.
>>
>> Some feedback on another issue we encountered with Linux in a
>> production initiator/target environment with SCST.  I'm including logs
>> below from three separate systems involved in the incident.  I've gone
>> through them with my team and we are currently unsure on what
>> triggered all this, hence mail to everyone who may be involved.
>>
>> The system involved is SCST 1.0.0.0 running on a Linux 2.6.24.7 target
>> platform using the qla_isp driver module.  The target machine has two
>> 9650 eight port 3Ware controller cards driving a total of 16 750
>> gigabyte Seagate NearLine drives.  Firmware on the 3ware and Qlogic
>> cards should all be current.  There are two identical servers in two
>> geographically separated data-centers.
>>
>> The drives on each platform are broken into four 3+1 RAID5 devices
>> with software RAID.  Each RAID5 volume is a physical volume for an LVM
>> volume group. There is currently one logical volume exported from each
>> of four RAID5 volumes as a target device.  A total of four initiators
>> are thus accessing the target server, each accessing different RAID5
>> volumes.
>>
>> The initiators are running a stock 2.6.26.2 kernel with a RHEL5
>> userspace.  Access to the SAN is via a 2462 dual-port Qlogic card.
>> The initiators see a block device from each of the two target servers
>> through separate ports/paths.  The block devices form a software RAID1
>> device (with bitmaps) which is the physical volume for an LVM volume
>> group.  The production filesystem is supported by a single logical
>> volume allocated from that volume group.
>>
>> A drive failure occured last Sunday afternoon on one of the RAID5
>> volumes.  The target kernel recognized the failure, failed the device
>> and kept going.
>>
>> Unfortunately three of the four initiators picked up a device failure
>> which caused the SCST exported volume to be faulted out of the RAID1
>> device.  One of the initiators noted an incident was occurring, issued
>> a target reset and continued forward with no issues.
>>
>> The initiator which got things 'right' was not accessing the RAID5
>> volume on the target which experienced the error.  Two of the three
>> initiators which faulted out their volumes were not accessing the
>> compromised RAID5 volume.  The initiator accessing the volume faulted
>> out its device.
> For some reason SCST core need to wait for logical unit driver (aka dev 
> handler) for abort comand. It is not possible to abort command instantly i.e.  
> mark command as aborted, return task management success to initiator and 
> after logical unit driver finish, just free resources for aborted command (I 
> don't know way, maybe Vlad could tell more about this).

That's a SAM requirement. Otherwise, if complete TM commands 
"instantly", without waiting for all affected commands to complete, it 
is possible that the aborted command would be executed in one more retry 
*after* the next command that initiator issued after the reset was 
completed. Initiator would think that the aborted commands are already 
dead and such behavior could kill journaled filesystems.

> Qlogic initiator 
> device just waits for 3ware card to abort commands. As both systems have the 
> same SCSI stack, such same commands timeouts. 3ware driver will return error 
> to RAID5 roughly at the same time when Qlogic initiator timeouts. So 
> sometimes Qlogic send only device reset and sometimes target reset too.
> 
> I believe increasing timeouts in sd driver on initiator site (and maybe 
> decreasing in on target system) will help. This things are not run time 
> configurable, only compile time. On initiator systems I suggest to increase 
> SD_TIMEOUT and maybe on target site decrease SD_MAX_RETRIES, both values are 
> in drivers/scsi/sd.h. In such configuration, when physical disk fail, 3ware 
> will return error during initiator waiting for command complete, RAID5 on 
> target will do the right job and from initiator point of view command will 
> finish successfully.
> 
> Cheers
> Stanislaw Gruszka
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Who do we point to?
  2008-08-21 12:14 ` Vladislav Bolkhovitin
@ 2008-08-21 14:32   ` James Bottomley
  2008-08-27 18:17     ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 6+ messages in thread
From: James Bottomley @ 2008-08-21 14:32 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: greg, scst-devel, linux-driver, linux-scsi, linuxraid, neilb,
	linux-raid, linux-fsdevel

On Thu, 2008-08-21 at 16:14 +0400, Vladislav Bolkhovitin wrote:
> MOANING MODE ON
> 
> Testing SCST and target drivers I often have to deal with various 
> failures and with how initiators recover from them. And,
> unfortunately, 
> my observations on Linux aren't very encouraging. See, for instance, 
> http://marc.info/?l=linux-scsi&m=119557128825721&w=2 thread.
> Receiving 
> from the target TASK ABORTED status isn't really a failure, it's
> rather 
> a corner case behavior, but it leads to immediate file system errors
> on 
> initiator and then after remount ext3 journal replay doesn't
> completely 
> repair it, only manual e2fsck helps. Even mounting with barrier=1 
> doesn't improve anything. Target can't be blamed for the failure, 
> because it stayed online, all its cache fully healthy and no commands 
> were lost. Hence, apparently, the journaling code in ext3 isn't as 
> reliable in face of storage corner cases as it's thought. I haven't 
> tried that test since I reported it, but recently I've seen the
> similar 
> ext3 failures on 2.6.26 in other tests, so I guess the problem(s)
> still 
> there.
> 
> A software SCSI target, like SCST, is beautiful to test things like 
> that, because it allows easily simulate any possible corner case and 
> storage failure. Unfortunately, I don't work on file systems level
> and 
> can't participate in all that great testing and fixing effort. I can 
> only help with setup and assistance in failures simulations.
> 
> MOANING MODE OFF

Well, since I can see your just so anxious to stop moaning and get
coding, let me help you.

Firstly, from a standards point of view, TASK_ABORTED means that the
target is telling us this particular command was killed by another
initiator (seeing this also requires the TAS bit to be set in the
control mode page, so you can easily fix your current problem by
unsetting it).  This makes TASK_ABORTED an incredibly rare status
condition (hence the problems below).

The way the kernel currently handles it is to return SUCCESS (around
line 1411 in scsi_error.c).  This return actually propagates an I/O
error all the way up the stack.  If the filesystem is the consumer, then
how it handles the error depends on what you have the errors= switch set
to.  If you've got it set to a safety condition like remount-ro or
panic, then the fs should be recoverable on reboot (or unmount recheck).
If you have it set to something unsafe like continue, then yes, you're
asking for trouble and fs corruption ... but it's hardly the OSs fault,
you told it you didn't want to operate safely.

So, given what TASK_ABORT means, it looks to me like the handling should
go through the maybe_retry path.  I'd say that's about a three line
patch ... and since you have the test bed, you can even try it out.

James

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Who do we point to?
  2008-08-21 14:32   ` James Bottomley
@ 2008-08-27 18:17     ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 6+ messages in thread
From: Vladislav Bolkhovitin @ 2008-08-27 18:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: greg, scst-devel, linux-driver, linux-scsi, linuxraid, neilb,
	linux-raid, linux-fsdevel

James Bottomley wrote:
> On Thu, 2008-08-21 at 16:14 +0400, Vladislav Bolkhovitin wrote:
>> MOANING MODE ON
>>
>> Testing SCST and target drivers I often have to deal with various 
>> failures and with how initiators recover from them. And,
>> unfortunately, 
>> my observations on Linux aren't very encouraging. See, for instance, 
>> http://marc.info/?l=linux-scsi&m=119557128825721&w=2 thread.
>> Receiving 
>> from the target TASK ABORTED status isn't really a failure, it's
>> rather 
>> a corner case behavior, but it leads to immediate file system errors
>> on 
>> initiator and then after remount ext3 journal replay doesn't
>> completely 
>> repair it, only manual e2fsck helps. Even mounting with barrier=1 
>> doesn't improve anything. Target can't be blamed for the failure, 
>> because it stayed online, all its cache fully healthy and no commands 
>> were lost. Hence, apparently, the journaling code in ext3 isn't as 
>> reliable in face of storage corner cases as it's thought. I haven't 
>> tried that test since I reported it, but recently I've seen the
>> similar 
>> ext3 failures on 2.6.26 in other tests, so I guess the problem(s)
>> still 
>> there.
>>
>> A software SCSI target, like SCST, is beautiful to test things like 
>> that, because it allows easily simulate any possible corner case and 
>> storage failure. Unfortunately, I don't work on file systems level
>> and 
>> can't participate in all that great testing and fixing effort. I can 
>> only help with setup and assistance in failures simulations.
>>
>> MOANING MODE OFF
> 
> Well, since I can see your just so anxious to stop moaning and get
> coding, let me help you.
> 
> Firstly, from a standards point of view, TASK_ABORTED means that the
> target is telling us this particular command was killed by another
> initiator (seeing this also requires the TAS bit to be set in the
> control mode page, so you can easily fix your current problem by
> unsetting it).  This makes TASK_ABORTED an incredibly rare status
> condition (hence the problems below).
> 
> The way the kernel currently handles it is to return SUCCESS (around
> line 1411 in scsi_error.c).  This return actually propagates an I/O
> error all the way up the stack.  If the filesystem is the consumer, then
> how it handles the error depends on what you have the errors= switch set
> to.  If you've got it set to a safety condition like remount-ro or
> panic, then the fs should be recoverable on reboot (or unmount recheck).
> If you have it set to something unsafe like continue, then yes, you're
> asking for trouble and fs corruption ... but it's hardly the OSs fault,
> you told it you didn't want to operate safely.

Yes, we already agreed in the referenced thread that there are 2 
separate and completely unrelated problems were discovered here:

1. Handling of TASK_ABORTED status is different from handling "Commands
cleared by another initiator" Unit Attention.

2. The file system layer after receiving an I/O error handles something 
not too well. I use default mount and format options, so "errors" was 
"remount-ro", but recovery on reboot wasn't sufficient.

We in the SCSI layer can fix (1), but only FS people can fix (2).

> So, given what TASK_ABORT means, it looks to me like the handling should
> go through the maybe_retry path.  I'd say that's about a three line
> patch ... and since you have the test bed, you can even try it out.

OK, I'll prepare it.

> James
> 
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-08-27 18:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-20 19:11 Who do we point to? greg
2008-08-21  1:06 ` [Scst-devel] " Stanislaw Gruszka
2008-08-21 12:17   ` Vladislav Bolkhovitin
2008-08-21 12:14 ` Vladislav Bolkhovitin
2008-08-21 14:32   ` James Bottomley
2008-08-27 18:17     ` Vladislav Bolkhovitin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).