All of lore.kernel.org
 help / color / mirror / Atom feed
* megaraid_mbox: garbage in file
@ 2006-05-04 18:48 Vasily Averin
  2006-05-04 22:59 ` James Bottomley
  0 siblings, 1 reply; 17+ messages in thread
From: Vasily Averin @ 2006-05-04 18:48 UTC (permalink / raw)
  To: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib,
	James.Bottomley, devel

Hello all,

I've investigated customers claim on the unstable work of their node and found a
strange effect: reading from some files leads to the
 "attempt to access beyond end of device" messages.

I've checked filesystem, memory on the node, motherboard BIOS version, but it
does not help and issue still has been reproduced by simple file reading.

Reproducer is simple:

echo 0xffffffff >/proc/sys/dev/scsi/logging_level ;
cat /vz/private/101/root/etc/ld.so.cache >/tmp/ttt  ;
echo 0 >/proc/sys/dev/scsi/logging

It leads to the following messages in dmesg

sd_init_command: disk=sda, block=871769260, count=26
sda : block=871769260
sda : reading 26/26 512 byte blocks.
scsi_add_timer: scmd: f79ed980, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf79ed980                  sd 0:1:0:0:
        command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00
buffer = 0xf7cfb540, bufflen = 13312, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f79ed980, rtn: 1
sd 0:1:0:0: done 0xf79ed980 SUCCESS        0 sd 0:1:0:0:
        command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
26 sectors total, 13312 bytes done.
use_sg is 4
attempt to access beyond end of device
sda6: rw=0, want=1044134458, limit=951401367
Buffer I/O error on device sda6, logical block 522067228
attempt to access beyond end of device
sda6: rw=0, want=1178878530, limit=951401367
Buffer I/O error on device sda6, logical block 589439264
...

As far as I see first read operation has finished without errors, but when we
read the rest of file we get an access to beyond end of device.

Originally it was found on Virtuozzo kernels (2.6.8.1-based x86 32-bit),
reproduced on RHEL4 kernels 2.6.9-22.EL and 2.6.9-34.EL,
on FC5 (2.6.16-1.2096_FC5) and on vanilla 2.6.16 kernels.

However, when I first read these blocks by using dd with bs=512 or 1024 it works
without any troubles. Then I can cat this file, copy it, map it and so on -- and
get correct content without any errors. Moreover, this issue may be workarounded
by memory limitation: it helps to use mem=4G in kernel commandline or kernels
without PAE support.

I've noticed that we attempt to access to the blocks with a strange numbers:

522067228 = 0x1f1e1d1c
589439264 = 0x23222120 and so on.

Then I've found that I've read strange garbage from file:

# hexdump /tmp/ttt
0000000 0100 0302 0504 0706 0908 0b0a 0d0c 0f0e
0000010 1110 1312 1514 1716 1918 1b1a 1d1c 1f1e
0000020 2120 2322 2524 2726 2928 2b2a 2d2c 2f2e
0000030 3130 3332 3534 3736 3938 3b3a 3d3c 3f3e
0000040 4140 4342 4544 4746 4948 4b4a 4d4c 4f4e
0000050 5150 5352 5554 5756 5958 5b5a 5d5c 5f5e
0000060 6160 6362 6564 6766 6968 6b6a 6d6c 6f6e
0000070 7170 7372 7574 7776 7978 7b7a 7d7c 7f7e
0000080 0100 0302 0504 0706 0908 0b0a 0d0c 0f0e
0000090 1110 1312 1514 1716 1918 1b1a 1d1c 1f1e
00000a0 2120 2322 2524 2726 2928 2b2a 2d2c 2f2e
...
00000f0 7170 7372 7574 7776 7978 7b7a 7d7c 7f7e
0000100 0100 0302 0504 0706 0908 0b0a 0d0c 0f0e
...

Then I've discovered that "access beyond end of device" occurs due reading of
the same garbage from the 13-th (Indirect) block of the file.

I've tried to understand where we got this garbage and found that it is present
in the data buffers beginning at megaraid_mbox driver functions.

Could somebody explain me what is the strange garbage: repeated 0...127?
Seokmann, Atul, could you please tell me if it is a known issue?
James, from my point of view it is not looks like a driver bug, but probably I'm
wrong?

I suppose it is MegaRAID SATA 150-4 firmware issue. I've seen similar firmware
fixes for MegaRAID SATA 300 controllers ("Support PAE mode fixed" and "Fixed the
operating systems using more than 4 gig of memory"). Is it probably the same
issues are present in SATA 150-4 firmware? Or may be I use broken controller?

Hardware Environment:
Tyan B2881
2 x Opteron 246
8G RAM
LSI MegaRAID SATA 150-4
/vz partition formatted as ext3 with 1Kb blocksize

megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005)
megaraid: 2.20.4.7 (Release Date: Mon Nov 14 12:27:22 EST 2005)
megaraid: probe new device 0x1000:0x1960:0x1000:0x4523: bus 1:slot 4:func 0
ACPI: PCI Interrupt 0000:01:04.0[A] -> GSI 29 (level, low) -> IRQ 16
megaraid: fw version:[713N] bios version:[G119]
scsi0 : LSI Logic MegaRAID driver
scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices
scsi[0]: scanning scsi channel 1 [virtual] for logical drives
  Vendor: MegaRAID  Model: LD 0 RAID1  476G  Rev: 713N
  Type:   Direct-Access                      ANSI SCSI revision: 02


Also I would note that from my point of view this issue looks similar to
http://bugzilla.kernel.org/show_bug.cgi?id=6052

It seems for me both of our cases may have the same cause.

Thank you,
	Vasily Averin

SWsoft Virtuozzo/OpenVZ Linux kernel team

^ permalink raw reply	[flat|nested] 17+ messages in thread
* RE: [RFC] Megaraid update, submission
@ 2006-05-16 19:03 Ju, Seokmann
  2006-05-16 20:47 ` Andre Hedrick
  0 siblings, 1 reply; 17+ messages in thread
From: Ju, Seokmann @ 2006-05-16 19:03 UTC (permalink / raw)
  To: Andre Hedrick, linux-scsi, Andrew Morton
  Cc: James Bottomley, Christoph Hellwig, Mukker, Atul

Hi,

I cannot agree on the changes in the patch for following reasons.

On Tuesday, May 16, 2006 1:44 PM, Andre Hedrick wrote:
> Random (hard to reproduce, without a noise injection into the SATA
> connector or cable) hardware error states which locks the 
> card and in the
> majority of the cases caused the array to be lost.  If the 
> array was not
> lost then a drive was failed but one could not remove/replace w/ a new
> drive.  Thus adding in a pci_master_abort test and clear 
> function proved
> to allow recovery in all cases where the card shutdown 
> communication to
> the host.  This may not address all cases; however, clearly this is a
> missing part of the driver base when entry to eh_scsi_* begins.
If 'raid_dev->hw_error' is non-zero, this means that the controller has gone bad and will (and should not to avoid further memory corruption) not be able to recoverd unless reboot.
The overall issue described here already taken care by the patch that I've submitted.
The patch has been accepted and should be available on 2.6.17-rc1-mm3 as specified in Andrew Morton's email.
> The compond issue in the failed recovery resulted in a deref 
> NULL pointer
> in the various list_head calls.  After change the individual 
> list_add to
> list_move and such, the NULL point issue has never shown up 
> in the past 6
> weeks of heavy testing.
I'm not sure how this changes help for the issue. Furthermore, I'm not sure what is _the NULL point issue_ refering to. If you see the issue with driver available on 2.6.17-rc1-mm3, please let me know.
Following link will leads you to further details of the patch.
http://www.kernel.org/git/?p=linux/kernel/git/jejb/scsi-rc-fixes-2.6.git;a=commit;h=c005fb4fb2d23ba29ad21dee5042b2f8451ca8ba

Thank you,

Seokmann

> -----Original Message-----
> From: Andre Hedrick [mailto:andre@linux-ide.org] 
> Sent: Tuesday, May 16, 2006 1:44 PM
> To: linux-scsi@vger.kernel.org; Ju, Seokmann; Andrew Morton
> Cc: James Bottomley; Christoph Hellwig; Mukker, Atul
> Subject: [RFC] Megaraid update, submission
> 
> 
> Linux-scsi, et al.
> 
> The follow patch address two major issues found under 
> extensive testing.
> 
> While pounding data io down the card and performing large 
> scale queries to
> the controller about device state and function parameters, 
> the following
> were discovered.
> 
> Random (hard to reproduce, without a noise injection into the SATA
> connector or cable) hardware error states which locks the 
> card and in the
> majority of the cases caused the array to be lost.  If the 
> array was not
> lost then a drive was failed but one could not remove/replace w/ a new
> drive.  Thus adding in a pci_master_abort test and clear 
> function proved
> to allow recovery in all cases where the card shutdown 
> communication to
> the host.  This may not address all cases; however, clearly this is a
> missing part of the driver base when entry to eh_scsi_* begins.
> 
> The compond issue in the failed recovery resulted in a deref 
> NULL pointer
> in the various list_head calls.  After change the individual 
> list_add to
> list_move and such, the NULL point issue has never shown up 
> in the past 6
> weeks of heavy testing.
> 
> In all cases in the past, the baseline for error was 6:1.  
> Meaning either
> one system in six failed and/or one in six test/stress runs 
> failed.  With
> the attached changes, there have been zero failures in the past three
> weeks.  This sound great, but I wish it would fail to allow some
> statistics of improved error handling.
> 
> Please note the changes to SAS are minor and not tested, but 
> seem correct
> for the entire directory code base.  SAS shares the CMM core 
> with MBOX,
> thus the rational for changes to SAS.
> 
> Please comment and provide suggestions.
> 
> Cheers,
> 
> Andre Hedrick
> LAD Storage Consulting Group
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread
* RE: [RFC] Megaraid update, submission
@ 2006-05-16 21:08 Ju, Seokmann
  0 siblings, 0 replies; 17+ messages in thread
From: Ju, Seokmann @ 2006-05-16 21:08 UTC (permalink / raw)
  To: Andre Hedrick
  Cc: linux-scsi, Andrew Morton, James Bottomley, Christoph Hellwig,
	Mukker, Atul

Hi Andre,
Tuesday, May 16, 2006 4:47 PM, Andre Hedrick wrote:
> Lets move on to the list management issues where timeouts on 
> ioctl calls
> have produced NULL pointers when one performs an add v/s move 
> to transfer
> ownership of a given scb between pools.
> 
> Fixing the list management may mean the pci_master_abort is 
> not needed.
If this issue still exist on 2.6.17-rcl kernel, I would definitely work on it.
>From my best estimate, the _NULL pointer_ issue should not be there with the patch.
Please let me know if you still see the issue.

I thank you very much your contribution on the driver stability.

Regards, 

> -----Original Message-----
> From: Andre Hedrick [mailto:andre@linux-ide.org] 
> Sent: Tuesday, May 16, 2006 4:47 PM
> To: Ju, Seokmann
> Cc: linux-scsi@vger.kernel.org; Andrew Morton; James 
> Bottomley; Christoph Hellwig; Mukker, Atul
> Subject: RE: [RFC] Megaraid update, submission
> 
> 
> Warning OOPS in message, ignore if you hate reading pasted OOPS's
> 
> Seokmann,
> 
> So there should be no (sane) heroic attempts to recover the 
> card state?
> Please look and see the path is only retried and follows the original
> operational path which resulted in setting the 
> 'raid_dev->hw_error' flag.
> If I am reading the code correctly, the *->quiescent flag 
> controls command
> submission to the card.  Thus all commands submitted to the 
> firmware are
> owned by the card, and should be allowed to complete the IO's 
> regardless?
> With as many as 20 requests outstanding (max I have seen to date) and
> termiation of the transactions surely blows apart any filesystem, as I
> have had filesystems and in several cases attached arrays 
> just vaporize if
> forced to reboot when 'hw_error' is set.
> 
> So since the pci_master_abort for the card is being rejected ...
> 
> Lets move on to the list management issues where timeouts on 
> ioctl calls
> have produced NULL pointers when one performs an add v/s move 
> to transfer
> ownership of a given scb between pools.
> 
> Fixing the list management may mean the pci_master_abort is 
> not needed.
> 
> The NULL pointer:
> 
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464723 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 
> 464723:40[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464744 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 
> 464744:12[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464745 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 
> 464745:23[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464746 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 464746:0[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464747 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 464747[255:0], 
> driver owner 
> Mar 29 00:09:53 5000 kernel: megaraid: reseting the host...
> Mar 29 00:09:53 5000 kernel: megaraid: 
> 464723:128[65535:65535], reset from pending list
> Mar 29 00:09:53 5000 kernel: megaraid: 4 outstanding 
> commands. Max wait 180 sec
> Mar 29 00:09:53 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:180
> ...
> Mar 29 00:11:54 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:60
> Mar 29 00:11:59 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:55
> Mar 29 00:12:04 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:50
> Mar 29 00:12:08 5000 kernel: megaraid mbox: reset sequence 
> completed sucessfully
> Mar 29 00:12:08 5000 kernel: Unable to handle kernel NULL 
> pointer dereference at virtual address 00000000
> Mar 29 00:12:08 5000 kernel:  printing eip:
> Mar 29 00:12:08 5000 kernel: f881f739
> Mar 29 00:12:08 5000 kernel: *pde = 00000000
> Mar 29 00:12:08 5000 kernel: Oops: 0002 [#1]
> Mar 29 00:12:08 5000 kernel: SMP
> Mar 29 00:12:08 5000 kernel: Modules linked in: xfs md5 ipv6 
> af_packet button thermal processor fan ac battery tsdev joydev
> evdev usbkbd usbhid e1000 intel_agp agpgart ehci_hcd uhci_hcd 
> usbcore rtc ext3 jbd sd_mod megaraid_mbox megaraid_mm 
> ata_piix  libata scsi_mod
> Mar 29 00:12:08 5000 kernel: CPU:    0
> Mar 29 00:12:08 5000 kernel: EIP:    
> 0060:[pg0+943802169/1069495296] Tainted: P      VLI
> Mar 29 00:12:08 5000 kernel: EIP:    0060:[<f881f739>]    
> Tainted: P VLI
> Mar 29 00:12:08 5000 kernel: EFLAGS: 00010046   (2.6.10)
> Mar 29 00:12:08 5000 kernel: EIP is at 
> megaraid_mbox_build_cmd+0x979/0xce0 [megaraid_mbox]
> Mar 29 00:12:08 5000 kernel: eax: 00000000   ebx: 00000000   
> ecx: 0000000d edx: 79473000
> Mar 29 00:12:08 5000 kernel: esi: c238f780   edi: c23af800   
> ebp: f7491f10 esp: f7491e98
> Mar 29 00:12:09 5000 kernel: ds: 007b   es: 007b   ss: 0068
> Mar 29 00:12:09 5000 kernel: Process scsi_eh_1 (pid: 885, 
> threadinfo=f7490000 task=f7dde020)
> Mar 29 00:12:09 5000 kernel: Stack: c23e3c00 f7de3000 
> f7491ebc f66fc2a0 c23e3c00 0000000d c226a42c f7436038
> Mar 29 00:12:09 5000 kernel:        f7436030 f7491ee8 
> c23b1010 f7491ed0 011d2df4 c226aa34 c226aa2c c226a42c
> Mar 29 00:12:09 5000 kernel:        00000000 000000ff 
> c2268000 6e616373 676e696e 00000000 00000086 70696b73
> Mar 29 00:12:09 5000 kernel: Call Trace:
> Mar 29 00:12:09 5000 kernel:  [show_stack+171/192] 
> show_stack+0xab/0xc0
> Mar 29 00:12:09 5000 kernel:  [<c0103e9b>] show_stack+0xab/0xc0
> Mar 29 00:12:09 5000 kernel:  [show_registers+351/464] 
> show_registers+0x15f/0x1d0
> Mar 29 00:12:09 5000 kernel:  [<c010402f>] show_registers+0x15f/0x1d0
> Mar 29 00:12:09 5000 kernel:  [die+244/400] die+0xf4/0x190
> Mar 29 00:12:09 5000 kernel:  [<c0104244>] die+0xf4/0x190
> Mar 29 00:12:09 5000 kernel:  [do_page_fault+1172/1715] 
> do_page_fault+0x494/0x6b3
> Mar 29 00:12:09 5000 kernel:  [<c0117394>] do_page_fault+0x494/0x6b3
> Mar 29 00:12:09 5000 kernel:  [error_code+43/48] error_code+0x2b/0x30
> Mar 29 00:12:09 5000 kernel:  [<c0103aeb>] error_code+0x2b/0x30
> Mar 29 00:12:09 5000 kernel:  [pg0+943799680/1069495296] 
> megaraid_queue_command+0x50/0x90 [megaraid_mbox]
> Mar 29 00:12:09 5000 kernel:  [<f881ed80>] 
> megaraid_queue_command+0x50/0x90 [megaraid_mbox]
> Mar 29 00:12:09 5000 kernel:  [pg0+943941731/1069495296] 
> scsi_dispatch_cmd+0x173/0x290 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [<f8841863>] 
> scsi_dispatch_cmd+0x173/0x290 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [pg0+943966809/1069495296] 
> scsi_request_fn+0x1e9/0x430 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [blk_run_queue+42/64] 
> blk_run_queue+0x2a/0x40
> Mar 29 00:12:09 5000 kernel:  [<c023aeaa>] blk_run_queue+0x2a/0x40
> Mar 29 00:12:09 5000 kernel:  [pg0+943963243/1069495296] 
> scsi_run_host_queues+0x2b/0x50 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [<f8846c6b>] 
> scsi_run_host_queues+0x2b/0x50 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [pg0+943960213/1069495296] 
> scsi_error_handler+0x85/0x170 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [<f8846095>] 
> scsi_error_handler+0x85/0x170 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [kernel_thread_helper+5/16] 
> kernel_thread_helper+0x5/0x10
> Mar 29 00:12:09 5000 kernel:  [<c01012d5>] 
> kernel_thread_helper+0x5/0x10
> Mar 29 00:12:09 5000 kernel: Code: 2c 82 f8 c7 47 20 01 00 00 
> 00 8b 4d 9c 85 c9 74 39 8b 4d 9c 31 db 8d b6 00 00 00 00 8d 
> bf 00 00 00 00 8b 55 a0 8b 42 10 8b 56 08 <89> 14 18 31 d2 89 
> 54 18 04 8b 45 a0 8b 50 10 8b 46 0c 83 c6 10
> Mar 29 00:14:23 5000 kernel:  <4>megaraid cmm: ioctl timed out
> Mar 29 00:14:23 5000 kernel: megaraid cmm: controller cannot 
> accept cmds
> due to earlier errors
> Mar 29 00:14:24 5000 last message repeated 3 times
> ...
> until reboot
> 
> I know everyone will rant about ... there is a taint, I just do not
> have immediate access to the logs (which) do exist without the taint
> marker set.
> 
> I will post the patch on kernel.org and can be adopted or dumped.
> The posting to the list was to follow the patch submission rules.
> 
> Cheers,
> 
> Andre Hedrick
> LAD Storage Consulting Group
> 
> On Tue, 16 May 2006, Ju, Seokmann wrote:
> 
> > Hi,
> > 
> > I cannot agree on the changes in the patch for following reasons.
> > 
> > On Tuesday, May 16, 2006 1:44 PM, Andre Hedrick wrote:
> > > Random (hard to reproduce, without a noise injection into the SATA
> > > connector or cable) hardware error states which locks the 
> > > card and in the
> > > majority of the cases caused the array to be lost.  If the 
> > > array was not
> > > lost then a drive was failed but one could not 
> remove/replace w/ a new
> > > drive.  Thus adding in a pci_master_abort test and clear 
> > > function proved
> > > to allow recovery in all cases where the card shutdown 
> > > communication to
> > > the host.  This may not address all cases; however, 
> clearly this is a
> > > missing part of the driver base when entry to eh_scsi_* begins.
> > If 'raid_dev->hw_error' is non-zero, this means that the 
> controller has gone bad and will (and should not to avoid 
> further memory corruption) not be able to recoverd unless reboot.
> > The overall issue described here already taken care by the 
> patch that I've submitted.
> > The patch has been accepted and should be available on 
> 2.6.17-rc1-mm3 as specified in Andrew Morton's email.
> > > The compond issue in the failed recovery resulted in a deref 
> > > NULL pointer
> > > in the various list_head calls.  After change the individual 
> > > list_add to
> > > list_move and such, the NULL point issue has never shown up 
> > > in the past 6
> > > weeks of heavy testing.
> > I'm not sure how this changes help for the issue. 
> Furthermore, I'm not sure what is _the NULL point issue_ 
> refering to. If you see the issue with driver available on 
> 2.6.17-rc1-mm3, please let me know.
> > Following link will leads you to further details of the patch.
> > 
> http://www.kernel.org/git/?p=linux/kernel/git/jejb/scsi-rc-fix
es-2.6.git;a=commit;h=c005fb4fb2d23ba29ad21dee5042b2f8451ca8ba
> > 
> > Thank you,
> > 
> > Seokmann
> > 
> > > -----Original Message-----
> > > From: Andre Hedrick [mailto:andre@linux-ide.org] 
> > > Sent: Tuesday, May 16, 2006 1:44 PM
> > > To: linux-scsi@vger.kernel.org; Ju, Seokmann; Andrew Morton
> > > Cc: James Bottomley; Christoph Hellwig; Mukker, Atul
> > > Subject: [RFC] Megaraid update, submission
> > > 
> > > 
> > > Linux-scsi, et al.
> > > 
> > > The follow patch address two major issues found under 
> > > extensive testing.
> > > 
> > > While pounding data io down the card and performing large 
> > > scale queries to
> > > the controller about device state and function parameters, 
> > > the following
> > > were discovered.
> > > 
> > > Random (hard to reproduce, without a noise injection into the SATA
> > > connector or cable) hardware error states which locks the 
> > > card and in the
> > > majority of the cases caused the array to be lost.  If the 
> > > array was not
> > > lost then a drive was failed but one could not 
> remove/replace w/ a new
> > > drive.  Thus adding in a pci_master_abort test and clear 
> > > function proved
> > > to allow recovery in all cases where the card shutdown 
> > > communication to
> > > the host.  This may not address all cases; however, 
> clearly this is a
> > > missing part of the driver base when entry to eh_scsi_* begins.
> > > 
> > > The compond issue in the failed recovery resulted in a deref 
> > > NULL pointer
> > > in the various list_head calls.  After change the individual 
> > > list_add to
> > > list_move and such, the NULL point issue has never shown up 
> > > in the past 6
> > > weeks of heavy testing.
> > > 
> > > In all cases in the past, the baseline for error was 6:1.  
> > > Meaning either
> > > one system in six failed and/or one in six test/stress runs 
> > > failed.  With
> > > the attached changes, there have been zero failures in 
> the past three
> > > weeks.  This sound great, but I wish it would fail to allow some
> > > statistics of improved error handling.
> > > 
> > > Please note the changes to SAS are minor and not tested, but 
> > > seem correct
> > > for the entire directory code base.  SAS shares the CMM core 
> > > with MBOX,
> > > thus the rational for changes to SAS.
> > > 
> > > Please comment and provide suggestions.
> > > 
> > > Cheers,
> > > 
> > > Andre Hedrick
> > > LAD Storage Consulting Group
> > > 
> > > 
> > > 
> > > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-05-16 21:09 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-04 18:48 megaraid_mbox: garbage in file Vasily Averin
2006-05-04 22:59 ` James Bottomley
2006-05-05  5:37   ` Vasily Averin
2006-05-05  9:21     ` Vasily Averin
2006-05-16 17:44       ` [RFC] Megaraid update, submission Andre Hedrick
2006-05-16 18:07         ` Jeff Garzik
2006-05-16 18:13           ` Andre Hedrick
2006-05-16 19:44             ` Matthew Wilcox
2006-05-16 20:24               ` Andre Hedrick
2006-05-05 15:59     ` megaraid_mbox: garbage in file James Bottomley
2006-05-05 18:17       ` Vasily Averin
2006-05-05 20:05         ` James Bottomley
2006-05-05 23:43           ` Vasily Averin
2006-05-05 23:43             ` Vasily Averin
  -- strict thread matches above, loose matches on Subject: below --
2006-05-16 19:03 [RFC] Megaraid update, submission Ju, Seokmann
2006-05-16 20:47 ` Andre Hedrick
2006-05-16 21:08 Ju, Seokmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.