megaraid_mbox: garbage in file

All of lore.kernel.org
 help / color / mirror / Atom feed

* megaraid_mbox: garbage in file
@ 2006-05-04 18:48 Vasily Averin
  2006-05-04 22:59 ` James Bottomley
  0 siblings, 1 reply; 17+ messages in thread
From: Vasily Averin @ 2006-05-04 18:48 UTC (permalink / raw)
  To: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib,
	James.Bottomley, devel

Hello all,

I've investigated customers claim on the unstable work of their node and found a
strange effect: reading from some files leads to the
 "attempt to access beyond end of device" messages.

I've checked filesystem, memory on the node, motherboard BIOS version, but it
does not help and issue still has been reproduced by simple file reading.

Reproducer is simple:

echo 0xffffffff >/proc/sys/dev/scsi/logging_level ;
cat /vz/private/101/root/etc/ld.so.cache >/tmp/ttt  ;
echo 0 >/proc/sys/dev/scsi/logging

It leads to the following messages in dmesg

sd_init_command: disk=sda, block=871769260, count=26
sda : block=871769260
sda : reading 26/26 512 byte blocks.
scsi_add_timer: scmd: f79ed980, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf79ed980                  sd 0:1:0:0:
        command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00
buffer = 0xf7cfb540, bufflen = 13312, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f79ed980, rtn: 1
sd 0:1:0:0: done 0xf79ed980 SUCCESS        0 sd 0:1:0:0:
        command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
26 sectors total, 13312 bytes done.
use_sg is 4
attempt to access beyond end of device
sda6: rw=0, want=1044134458, limit=951401367
Buffer I/O error on device sda6, logical block 522067228
attempt to access beyond end of device
sda6: rw=0, want=1178878530, limit=951401367
Buffer I/O error on device sda6, logical block 589439264
...

As far as I see first read operation has finished without errors, but when we
read the rest of file we get an access to beyond end of device.

Originally it was found on Virtuozzo kernels (2.6.8.1-based x86 32-bit),
reproduced on RHEL4 kernels 2.6.9-22.EL and 2.6.9-34.EL,
on FC5 (2.6.16-1.2096_FC5) and on vanilla 2.6.16 kernels.

However, when I first read these blocks by using dd with bs=512 or 1024 it works
without any troubles. Then I can cat this file, copy it, map it and so on -- and
get correct content without any errors. Moreover, this issue may be workarounded
by memory limitation: it helps to use mem=4G in kernel commandline or kernels
without PAE support.

I've noticed that we attempt to access to the blocks with a strange numbers:

522067228 = 0x1f1e1d1c
589439264 = 0x23222120 and so on.

Then I've found that I've read strange garbage from file:

# hexdump /tmp/ttt
0000000 0100 0302 0504 0706 0908 0b0a 0d0c 0f0e
0000010 1110 1312 1514 1716 1918 1b1a 1d1c 1f1e
0000020 2120 2322 2524 2726 2928 2b2a 2d2c 2f2e
0000030 3130 3332 3534 3736 3938 3b3a 3d3c 3f3e
0000040 4140 4342 4544 4746 4948 4b4a 4d4c 4f4e
0000050 5150 5352 5554 5756 5958 5b5a 5d5c 5f5e
0000060 6160 6362 6564 6766 6968 6b6a 6d6c 6f6e
0000070 7170 7372 7574 7776 7978 7b7a 7d7c 7f7e
0000080 0100 0302 0504 0706 0908 0b0a 0d0c 0f0e
0000090 1110 1312 1514 1716 1918 1b1a 1d1c 1f1e
00000a0 2120 2322 2524 2726 2928 2b2a 2d2c 2f2e
...
00000f0 7170 7372 7574 7776 7978 7b7a 7d7c 7f7e
0000100 0100 0302 0504 0706 0908 0b0a 0d0c 0f0e
...

Then I've discovered that "access beyond end of device" occurs due reading of
the same garbage from the 13-th (Indirect) block of the file.

I've tried to understand where we got this garbage and found that it is present
in the data buffers beginning at megaraid_mbox driver functions.

Could somebody explain me what is the strange garbage: repeated 0...127?
Seokmann, Atul, could you please tell me if it is a known issue?
James, from my point of view it is not looks like a driver bug, but probably I'm
wrong?

I suppose it is MegaRAID SATA 150-4 firmware issue. I've seen similar firmware
fixes for MegaRAID SATA 300 controllers ("Support PAE mode fixed" and "Fixed the
operating systems using more than 4 gig of memory"). Is it probably the same
issues are present in SATA 150-4 firmware? Or may be I use broken controller?

Hardware Environment:
Tyan B2881
2 x Opteron 246
8G RAM
LSI MegaRAID SATA 150-4
/vz partition formatted as ext3 with 1Kb blocksize

megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005)
megaraid: 2.20.4.7 (Release Date: Mon Nov 14 12:27:22 EST 2005)
megaraid: probe new device 0x1000:0x1960:0x1000:0x4523: bus 1:slot 4:func 0
ACPI: PCI Interrupt 0000:01:04.0[A] -> GSI 29 (level, low) -> IRQ 16
megaraid: fw version:[713N] bios version:[G119]
scsi0 : LSI Logic MegaRAID driver
scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices
scsi[0]: scanning scsi channel 1 [virtual] for logical drives
  Vendor: MegaRAID  Model: LD 0 RAID1  476G  Rev: 713N
  Type:   Direct-Access                      ANSI SCSI revision: 02

Also I would note that from my point of view this issue looks similar to
http://bugzilla.kernel.org/show_bug.cgi?id=6052

It seems for me both of our cases may have the same cause.

Thank you,
	Vasily Averin

SWsoft Virtuozzo/OpenVZ Linux kernel team

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: megaraid_mbox: garbage in file
  2006-05-04 18:48 megaraid_mbox: garbage in file Vasily Averin
@ 2006-05-04 22:59 ` James Bottomley
  2006-05-05  5:37   ` Vasily Averin
  0 siblings, 1 reply; 17+ messages in thread
From: James Bottomley @ 2006-05-04 22:59 UTC (permalink / raw)
  To: Vasily Averin
  Cc: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib, devel

On Thu, 2006-05-04 at 22:48 +0400, Vasily Averin wrote:
> attempt to access beyond end of device
> sda6: rw=0, want=1044134458, limit=951401367
> Buffer I/O error on device sda6, logical block 522067228

That's not a SCSI error.  It's coming from the block layer and it means
that the filesystem tried to access beyond the end of the listed
partition.   Why that happened is anyone's guess.  I suspect the actual
filesystem is corrupt somehow, but how it came to be, I don't know.

James



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: megaraid_mbox: garbage in file
  2006-05-04 22:59 ` James Bottomley
@ 2006-05-05  5:37   ` Vasily Averin
  2006-05-05  9:21     ` Vasily Averin
  2006-05-05 15:59     ` megaraid_mbox: garbage in file James Bottomley
  0 siblings, 2 replies; 17+ messages in thread
From: Vasily Averin @ 2006-05-05  5:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib, devel

James Bottomley wrote:
> On Thu, 2006-05-04 at 22:48 +0400, Vasily Averin wrote:
>>attempt to access beyond end of device
>>sda6: rw=0, want=1044134458, limit=951401367
>>Buffer I/O error on device sda6, logical block 522067228
> 
> That's not a SCSI error.  It's coming from the block layer and it means
> that the filesystem tried to access beyond the end of the listed
> partition.   Why that happened is anyone's guess.  I suspect the actual
> filesystem is corrupt somehow, but how it came to be, I don't know.

James,

The issue is that the correctly finished scsi read command return me garbage
(repeated 0 ...127 -- see hexdump in my first letter) instead correct file content.
"attempt to access beyond end of device" messages occurs due the same garbage
readed from the Indirect block. I found this garbage present in data buffers
beginning at megaraid driver functions.

I would note that if I read the same file by using dd with bs=1024 or bs=512 --
I get correct file content.

When I use kernel with 4Gb memory limit -- the same cat command return me
correct file content too, without any garbage.

Question is what it is the strange garbage? Have you seen it earlier?
Is it possible that it is some driver-related issue or it is broken hardware?
And why I can workaround this issue by using only 4Gb memory?

Thank you,
	Vasily Averin

SWsoft Virtuozzo/OpenVZ Linux kernel team

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: megaraid_mbox: garbage in file
  2006-05-05  5:37   ` Vasily Averin
@ 2006-05-05  9:21     ` Vasily Averin
  2006-05-16 17:44       ` [RFC] Megaraid update, submission Andre Hedrick
  2006-05-05 15:59     ` megaraid_mbox: garbage in file James Bottomley
  1 sibling, 1 reply; 17+ messages in thread
From: Vasily Averin @ 2006-05-05  9:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib, devel

[-- Attachment #1: Type: text/plain, Size: 2123 bytes --]

Small update:

When I use
cat /vz/private/101/root/etc/ld.so.cache >/tmp/ttt
I've get "access beyond end of device" and garbage in buffers

Then I create the same scsi read command by using sgp_dd utils:
sgp_dd count=26 if=/dev/sg0 skip=871769260 of=/tmp/ttt.sgp
and get correct file content without any errors.

The only difference that I see is use_sg=3 for cat and use_sg=1 for dd.

dmesg with scsi debugs and output files are attached.

Node will be accessible for some time and I can perform some experiments. If
somebody wants I can request the customer about access on the node.

Thank you,
	Vasily Averin

SWsoft Virtuozzo/OpenVZ Linux kernel team

Vasily Averin wrote:
> James Bottomley wrote:
>>On Thu, 2006-05-04 at 22:48 +0400, Vasily Averin wrote:
>>>attempt to access beyond end of device
>>>sda6: rw=0, want=1044134458, limit=951401367
>>>Buffer I/O error on device sda6, logical block 522067228
>>That's not a SCSI error.  It's coming from the block layer and it means
>>that the filesystem tried to access beyond the end of the listed
>>partition.   Why that happened is anyone's guess.  I suspect the actual
>>filesystem is corrupt somehow, but how it came to be, I don't know.
> 
> James,
> 
> The issue is that the correctly finished scsi read command return me garbage
> (repeated 0 ...127 -- see hexdump in my first letter) instead correct file content.
> "attempt to access beyond end of device" messages occurs due the same garbage
> readed from the Indirect block. I found this garbage present in data buffers
> beginning at megaraid driver functions.
> 
> I would note that if I read the same file by using dd with bs=1024 or bs=512 --
> I get correct file content.
> 
> When I use kernel with 4Gb memory limit -- the same cat command return me
> correct file content too, without any garbage.
> 
> Question is what it is the strange garbage? Have you seen it earlier?
> Is it possible that it is some driver-related issue or it is broken hardware?
> And why I can workaround this issue by using only 4Gb memory?
> 
> Thank you,
> 	Vasily Averin
> 
> SWsoft Virtuozzo/OpenVZ Linux kernel team
> 


[-- Attachment #2: dmesg.out --]
[-- Type: text/plain, Size: 41658 bytes --]

Linux version 2.6.16 (vvs@dhcp0-157) (gcc version 3.3.5 20050117 (prerelease) (SUSE Linux)) #1 SMP Thu May 4 17:49:16 MSD 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000fbff0000 (usable)
 BIOS-e820: 00000000fbff0000 - 00000000fbfff000 (ACPI data)
 BIOS-e820: 00000000fbfff000 - 00000000fc000000 (ACPI NVS)
 BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000200000000 (usable)
7296MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000ff780
NX (Execute Disable) protection: active
On node 0 totalpages: 2097152
  DMA zone: 4096 pages, LIFO batch:0
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 225280 pages, LIFO batch:31
  HighMem zone: 1867776 pages, LIFO batch:31
DMI 2.3 present.
ACPI: RSDP (v002 ACPIAM                                ) @ 0x000f6dd0
ACPI: XSDT (v001 A M I  OEMXSDT  0x12000527 MSFT 0x00000097) @ 0xfbff0100
ACPI: FADT (v001 A M I  OEMFACP  0x12000527 MSFT 0x00000097) @ 0xfbff0281
ACPI: MADT (v001 A M I  OEMAPIC  0x12000527 MSFT 0x00000097) @ 0xfbff0380
ACPI: OEMB (v001 A M I  OEMBIOS  0x12000527 MSFT 0x00000097) @ 0xfbfff040
ACPI: SRAT (v001 A M I  OEMSRAT  0x12000527 MSFT 0x00000097) @ 0xfbff39b0
ACPI: HPET (v001 A M I  OEMHPET  0x12000527 MSFT 0x00000097) @ 0xfbff3ac0
ACPI: ASF! (v001 AMIASF AMDSTRET 0x00000001 INTL 0x02002026) @ 0xfbff3b00
ACPI: DSDT (v001  0AAAA 0AAAA001 0x00000001 INTL 0x02002026) @ 0x00000000
ACPI: PM-Timer IO Port: 0x5008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x03] address[0xfebff000] gsi_base[24])
IOAPIC[1]: apic_id 3, version 17, address 0xfebff000, GSI 24-27
ACPI: IOAPIC (id[0x04] address[0xfebfe000] gsi_base[28])
IOAPIC[2]: apic_id 4, version 17, address 0xfebfe000, GSI 28-31
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 3 I/O APICs
ACPI: HPET id: 0x102282a0 base: 0xfec01000
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at fc400000 (gap: fc000000:03780000)
Built 1 zonelists
Kernel command line: ro root=LABEL=/1 debug panic=5
mapped APIC to ffffd000 (fee00000)
mapped IOAPIC to ffffc000 (fec00000)
mapped IOAPIC to ffffb000 (febff000)
mapped IOAPIC to ffffa000 (febfe000)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c0565000 soft=c0545000
PID hash table entries: 4096 (order: 12, 65536 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 8248540k/8388608k available (3118k kernel code, 73068k reserved, 940k data, 288k init, 7405504k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Using HPET for base-timer
Using HPET for gettimeofday
Detected 1990.876 MHz processor.
Using hpet for high-res timesource
Calibrating delay using timer specific routine.. 3987.38 BogoMIPS (lpj=7974771)
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 078bfbff e1d3fbff 00000000 00000000 00000000 00000000 00000000
CPU: After vendor identify, caps: 078bfbff e1d3fbff 00000000 00000000 00000000 00000000 00000000
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU: After all inits, caps: 078bfbff e1d3fbff 00000000 00000010 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Checking 'hlt' instruction... OK.
CPU0: AMD Opteron(tm) Processor 246 stepping 0a
Booting processor 1/1 eip 2000
CPU 1 irqstacks, hard=c0566000 soft=c0546000
Initializing CPU#1
Calibrating delay using timer specific routine.. 3981.36 BogoMIPS (lpj=7962728)
CPU: After generic identify, caps: 078bfbff e1d3fbff 00000000 00000000 00000000 00000000 00000000
CPU: After vendor identify, caps: 078bfbff e1d3fbff 00000000 00000000 00000000 00000000 00000000
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU: After all inits, caps: 078bfbff e1d3fbff 00000000 00000010 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: AMD Opteron(tm) Processor 246 stepping 0a
Total of 2 processors activated (7968.74 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=0 pin2=0
checking TSC synchronization across 2 CPUs: passed.
Brought up 2 CPUs
migration_cost=4000
checking if image is initramfs...it isn't (no cpio magic); looks like an initrd
Freeing initrd memory: 589k freed
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xf0031, last bus=3
PCI: Using configuration type 1
ACPI: Subsystem revision 20060127
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
Boot video device is 0000:03:06.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.GOLA._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.GOLB._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 *5 6 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 *9 10 11 12 14 15)
SCSI subsystem initialized
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
PCI: Bridge: 0000:00:06.0
  IO window: b000-bfff
  MEM window: fca00000-feafffff
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:0a.0
  IO window: disabled.
  MEM window: fc900000-fc9fffff
  PREFETCH window: ff500000-ff5fffff
PCI: Bridge: 0000:00:0b.0
  IO window: disabled.
  MEM window: fc800000-fc8fffff
  PREFETCH window: ff400000-ff4fffff
highmem bounce pool size: 64 pages
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
Initializing Cryptographic API
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
PCI: MSI quirk detected. pci_msi_quirk set.
PCI: MSI quirk detected. pci_msi_quirk set.
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
Real Time Clock Driver v1.12ac
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Compaq SMART2 Driver (v 2.6.0)
HP CISS Driver (v 2.6.10)
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
AMD8111: IDE controller at PCI slot 0000:00:07.1
AMD8111: chipset revision 3
AMD8111: not 100% native mode: will probe irqs later
AMD8111: 0000:00:07.1 (rev 03) UDMA133 controller
    ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:pio, hdb:pio
    ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:pio, hdd:pio
Probing IDE interface ide0...
Probing IDE interface ide1...
Probing IDE interface ide0...
Probing IDE interface ide1...
Adaptec aacraid driver (1.1-4 May  4 2006 17:43:05)
Emulex LightPulse Fibre Channel SCSI driver 8.1.1
Copyright(c) 2004-2005 Emulex.  All rights reserved.
megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005)
megaraid: 2.20.4.7 (Release Date: Mon Nov 14 12:27:22 EST 2005)
megaraid: probe new device 0x1000:0x1960:0x1000:0x4523: bus 1:slot 4:func 0
ACPI: PCI Interrupt 0000:01:04.0[A] -> GSI 29 (level, low) -> IRQ 16
megaraid: fw version:[713N] bios version:[G119]
scsi0 : LSI Logic MegaRAID driver
scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices
scsi[0]: scanning scsi channel 1 [virtual] for logical drives
  Vendor: MegaRAID  Model: LD 0 RAID1  476G  Rev: 713N
  Type:   Direct-Access                      ANSI SCSI revision: 02
GDT-HA: Storage RAID Controller Driver. Version: 3.04 
GDT-HA: Found 0 PCI Storage RAID Controllers
3ware Storage Controller device driver for Linux v1.26.02.001.
3ware 9000 Storage Controller device driver for Linux v2.26.02.005.
libata version 1.20 loaded.
SCSI device sda: 976773120 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
sda: Mode Sense: 00 00 00 00
sda: asking for cache data failed
sda: assuming drive cache: write through
SCSI device sda: 976773120 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
sda: Mode Sense: 00 00 00 00
sda: asking for cache data failed
sda: assuming drive cache: write through
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
sd 0:1:0:0: Attached scsi disk sda
mice: PS/2 mouse device common for all mice
md: linear personality registered for level -1
md: raid0 personality registered for level 0
md: raid1 personality registered for level 1
md: raid10 personality registered for level 10
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
raid5: automatically using best checksumming function: pIII_sse
   pIII_sse  :  6405.000 MB/sec
raid5: using function: pIII_sse (6405.000 MB/sec)
md: multipath personality registered for level -4
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
device-mapper: 4.5.0-ioctl (2005-10-04) initialised: dm-devel@redhat.com
device-mapper: dm-multipath version 1.0.4 loaded
device-mapper: dm-round-robin version 1.0.0 loaded
device-mapper: dm-emc version 0.0.3 loaded
NET: Registered protocol family 2
IP route cache hash table entries: 524288 (order: 9, 2097152 bytes)
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
TCP: Hash tables configured (established 524288 bind 65536)
TCP reno registered
TCP bic registered
NET: Registered protocol family 1
Starting balanced_irq
Using IPI Shortcut mode
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
RAMDISK: Compressed image found at block 0
VFS: Mounted root (ext2 filesystem).
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
Freeing unused kernel memory: 288k freed
floppy0: no floppy controllers found
tg3.c:v3.49 (Feb 2, 2006)
ACPI: PCI Interrupt 0000:02:09.0[A] -> GSI 24 (level, low) -> IRQ 17
eth0: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:2f:90:96
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] 
eth0: dma_rwctrl[769f4000] dma_mask[64-bit]
ACPI: PCI Interrupt 0000:02:09.1[B] -> GSI 25 (level, low) -> IRQ 18
eth1: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:2f:90:97
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] 
eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
shpchp: HPC vendor_id 1022 device_id 7460 ss_vid 0 ss_did 0
shpchp: shpc_init: cannot reserve MMIO region
shpchp: HPC vendor_id 1022 device_id 7450 ss_vid 0 ss_did 0
shpchp: shpc_init: cannot reserve MMIO region
shpchp: HPC vendor_id 1022 device_id 7450 ss_vid 0 ss_did 0
shpchp: shpc_init: cannot reserve MMIO region
shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
usbcore: registered new driver usbfs
usbcore: registered new driver hub
ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
ACPI: PCI Interrupt 0000:03:00.0[D] -> GSI 19 (level, low) -> IRQ 19
ohci_hcd 0000:03:00.0: OHCI Host Controller
ohci_hcd 0000:03:00.0: new USB bus registered, assigned bus number 1
ohci_hcd 0000:03:00.0: irq 19, io mem 0xfeafc000
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 3 ports detected
ACPI: PCI Interrupt 0000:03:00.1[D] -> GSI 19 (level, low) -> IRQ 19
ohci_hcd 0000:03:00.1: OHCI Host Controller
ohci_hcd 0000:03:00.1: new USB bus registered, assigned bus number 2
ohci_hcd 0000:03:00.1: irq 19, io mem 0xfeafd000
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
ACPI: Power Button (FF) [PWRF]
ACPI: Power Button (CM) [PWRB]
ACPI: Processor [CPU1] (supports 8 throttling states)
EXT3 FS on sda2, internal journal
program dmraid is using a deprecated SCSI ioctl, please convert it to SG_IO
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda6, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 4192924k swap on /dev/sda5.  Priority:-1 extents:1 across:4192924k
NET: Registered protocol family 17
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
lp: driver loaded but no devices found
sd_init_command: disk=sda, block=8596831, count=8
sda : block=8596831
sda : reading 8/8 512 byte blocks.
scsi_add_timer: scmd: f7f30980, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30980                  sd 0:1:0:0: 
        command: Read (10): 28 00 00 83 2d 5f 00 00 08 00
buffer = 0xf7cfb240, bufflen = 4096, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30980, rtn: 1
sd 0:1:0:0: done 0xf7f30980 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 00 83 2d 5f 00 00 08 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
8 sectors total, 4096 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=25372896, count=2
sda : block=25372896
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 01 83 28 e0 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sd_init_command: disk=sda, block=2992845, count=8
sda : block=2992845
sda : writing 8/8 512 byte blocks.
scsi_add_timer: scmd: f7f30500, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30500                  sd 0:1:0:0: 
        command: Write (10): 2a 00 00 2d aa cd 00 00 08 00
buffer = 0xf6bfeb00, bufflen = 4096, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 01 83 28 e0 00 00 02 00
scsi host busy 2 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=446976176, count=2
sda : block=446976176
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 1a a4 50 b0 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30500, rtn: 1
sd 0:1:0:0: done 0xf7f30500 SUCCESS        0 sd 0:1:0:0: 
        command: Write (10): 2a 00 00 2d aa cd 00 00 08 00
scsi host busy 2 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
8 sectors total, 4096 bytes done.
use_sg is 1
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 1a a4 50 b0 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=237405, count=16
sda : block=237405
sda : writing 16/16 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 9f 5d 00 00 10 00
buffer = 0xf6bfebc0, bufflen = 8192, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 9f 5d 00 00 10 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
16 sectors total, 8192 bytes done.
use_sg is 2
sd_init_command: disk=sda, block=446984364, count=2
sda : block=446984364
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 1a a4 70 ac 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 1a a4 70 ac 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=447205552, count=2
sda : block=447205552
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 1a a7 d0 b0 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 1a a7 d0 b0 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=447217836, count=2
sda : block=447217836
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 1a a8 00 ac 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 1a a8 00 ac 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=447205554, count=2
sda : block=447205554
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 1a a7 d0 b2 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 1a a7 d0 b2 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=447218934, count=2
sda : block=447218934
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 1a a8 04 f6 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 1a a8 04 f6 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=447811834, count=2
sda : block=447811834
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 1a b1 10 fa 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 1a b1 10 fa 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=447825518, count=2
sda : block=447825518
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 1a b1 46 6e 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 1a b1 46 6e 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=871764144, count=2
sda : block=871764144
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 33 f6 10 b0 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 33 f6 10 b0 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=871769260, count=26
sda : block=871769260
sda : reading 26/26 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00
buffer = 0xf6bfebc0, bufflen = 13312, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
26 sectors total, 13312 bytes done.
use_sg is 3
attempt to access beyond end of device
sda6: rw=0, want=1044134458, limit=951401367
Buffer I/O error on device sda6, logical block 522067228
attempt to access beyond end of device
sda6: rw=0, want=1178878530, limit=951401367
Buffer I/O error on device sda6, logical block 589439264
attempt to access beyond end of device
sda6: rw=0, want=1313622602, limit=951401367
Buffer I/O error on device sda6, logical block 656811300
attempt to access beyond end of device
sda6: rw=0, want=1448366674, limit=951401367
Buffer I/O error on device sda6, logical block 724183336
attempt to access beyond end of device
sda6: rw=0, want=1583110746, limit=951401367
Buffer I/O error on device sda6, logical block 791555372
attempt to access beyond end of device
sda6: rw=0, want=1717854818, limit=951401367
Buffer I/O error on device sda6, logical block 858927408
attempt to access beyond end of device
sda6: rw=0, want=1852598890, limit=951401367
Buffer I/O error on device sda6, logical block 926299444
attempt to access beyond end of device
sda6: rw=0, want=1987342962, limit=951401367
Buffer I/O error on device sda6, logical block 993671480
attempt to access beyond end of device
sda6: rw=0, want=2122087034, limit=951401367
Buffer I/O error on device sda6, logical block 1061043516
attempt to access beyond end of device
sda6: rw=0, want=2256831106, limit=951401367
Buffer I/O error on device sda6, logical block 1128415552
attempt to access beyond end of device
sda6: rw=0, want=2391575178, limit=951401367
attempt to access beyond end of device
sda6: rw=0, want=2526319250, limit=951401367
attempt to access beyond end of device
sda6: rw=0, want=2661063322, limit=951401367
sd_init_command: disk=sda, block=934757082, count=2
sda : block=934757082
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 37 b7 42 da 00 00 02 00
buffer = 0xf6bfebc0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sd_init_command: disk=sda, block=126292650, count=2
sda : block=126292650
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30500, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30500                  sd 0:1:0:0: 
        command: Read (10): 28 00 07 87 12 aa 00 00 02 00
buffer = 0xf6bfeb00, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sd_init_command: disk=sda, block=261036722, count=2
sda : block=261036722
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30680, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30680                  sd 0:1:0:0: 
        command: Read (10): 28 00 0f 8f 1a b2 00 00 02 00
buffer = 0xf6bfea40, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sd_init_command: disk=sda, block=395780794, count=2
sda : block=395780794
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30800, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30800                  sd 0:1:0:0: 
        command: Read (10): 28 00 17 97 22 ba 00 00 02 00
buffer = 0xf6bfe980, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sd_init_command: disk=sda, block=530524866, count=2
sda : block=530524866
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30e00, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30e00                  sd 0:1:0:0: 
        command: Read (10): 28 00 1f 9f 2a c2 00 00 02 00
buffer = 0xf6bfe8c0, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sd_init_command: disk=sda, block=665268938, count=2
sda : block=665268938
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30c80, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30c80                  sd 0:1:0:0: 
        command: Read (10): 28 00 27 a7 32 ca 00 00 02 00
buffer = 0xf6bfe800, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sd_init_command: disk=sda, block=800013010, count=2
sda : block=800013010
sda : reading 2/2 512 byte blocks.
scsi_add_timer: scmd: f7f30b00, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30b00                  sd 0:1:0:0: 
        command: Read (10): 28 00 2f af 3a d2 00 00 02 00
buffer = 0xf6bfe740, bufflen = 1024, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
attempt to access beyond end of device
sda6: rw=0, want=2795807394, limit=951401367
attempt to access beyond end of device
sda6: rw=0, want=2930551466, limit=951401367
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 37 b7 42 da 00 00 02 00
scsi host busy 7 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
scsi_delete_timer: scmd: f7f30500, rtn: 1
sd 0:1:0:0: done 0xf7f30500 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 07 87 12 aa 00 00 02 00
scsi host busy 6 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
scsi_delete_timer: scmd: f7f30680, rtn: 1
sd 0:1:0:0: done 0xf7f30680 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 0f 8f 1a b2 00 00 02 00
scsi host busy 5 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
scsi_delete_timer: scmd: f7f30800, rtn: 1
sd 0:1:0:0: done 0xf7f30800 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 17 97 22 ba 00 00 02 00
scsi host busy 4 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
scsi_delete_timer: scmd: f7f30e00, rtn: 1
sd 0:1:0:0: done 0xf7f30e00 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 1f 9f 2a c2 00 00 02 00
scsi host busy 3 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
scsi_delete_timer: scmd: f7f30c80, rtn: 1
sd 0:1:0:0: done 0xf7f30c80 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 27 a7 32 ca 00 00 02 00
scsi host busy 2 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
scsi_delete_timer: scmd: f7f30b00, rtn: 1
sd 0:1:0:0: done 0xf7f30b00 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 2f af 3a d2 00 00 02 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
2 sectors total, 1024 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=237421, count=8
sda : block=237421
sda : writing 8/8 512 byte blocks.
scsi_add_timer: scmd: f7f30b00, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30b00                  sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 9f 6d 00 00 08 00
buffer = 0xf6bfe740, bufflen = 4096, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
attempt to access beyond end of device
sda6: rw=0, want=1044134458, limit=951401367
scsi_delete_timer: scmd: f7f30b00, rtn: 1
sd 0:1:0:0: done 0xf7f30b00 SUCCESS        0 sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 9f 6d 00 00 08 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
8 sectors total, 4096 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=8338293, count=8
sda : block=8338293
sda : reading 8/8 512 byte blocks.
scsi_add_timer: scmd: f7f30c80, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30c80                  sd 0:1:0:0: 
        command: Read (10): 28 00 00 7f 3b 75 00 00 08 00
buffer = 0xf6bfebc0, bufflen = 4096, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30c80, rtn: 1
sd 0:1:0:0: done 0xf7f30c80 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 00 7f 3b 75 00 00 08 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
8 sectors total, 4096 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=8382885, count=64
sda : block=8382885
sda : reading 64/64 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 00 7f e9 a5 00 00 40 00
buffer = 0xf6bfeb00, bufflen = 32768, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 00 7f e9 a5 00 00 40 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
64 sectors total, 32768 bytes done.
use_sg is 8
sd_init_command: disk=sda, block=8382949, count=40
sda : block=8382949
sda : reading 40/40 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 00 7f e9 e5 00 00 28 00
buffer = 0xf6bfeb00, bufflen = 20480, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sd_init_command: disk=sda, block=2992901, count=8
sda : block=2992901
sda : writing 8/8 512 byte blocks.
scsi_add_timer: scmd: f7f30c80, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30c80                  sd 0:1:0:0: 
        command: Write (10): 2a 00 00 2d ab 05 00 00 08 00
buffer = 0xf6bfebc0, bufflen = 4096, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 00 7f e9 e5 00 00 28 00
scsi host busy 2 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
40 sectors total, 20480 bytes done.
use_sg is 4
sd_init_command: disk=sda, block=8382989, count=16
sda : block=8382989
sda : reading 16/16 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 00 7f ea 0d 00 00 10 00
buffer = 0xf6bfeb00, bufflen = 8192, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 00 7f ea 0d 00 00 10 00
scsi host busy 2 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
16 sectors total, 8192 bytes done.
use_sg is 2
scsi_delete_timer: scmd: f7f30c80, rtn: 1
sd 0:1:0:0: done 0xf7f30c80 SUCCESS        0 sd 0:1:0:0: 
        command: Write (10): 2a 00 00 2d ab 05 00 00 08 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
8 sectors total, 4096 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=247085, count=16
sda : block=247085
sda : writing 16/16 512 byte blocks.
scsi_add_timer: scmd: f7f30c80, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30c80                  sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 c5 2d 00 00 10 00
buffer = 0xf6bfebc0, bufflen = 8192, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30c80, rtn: 1
sd 0:1:0:0: done 0xf7f30c80 SUCCESS        0 sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 c5 2d 00 00 10 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
16 sectors total, 8192 bytes done.
use_sg is 2
sd 0:1:0:0: Attached scsi generic sg0 type 0
sg_open: dev=0, flags=0x8002
scsi_block_when_processing_errors: rtn: 1
sg_add_sfp: sfp=0xf680e000
sg_build_reserve: req_size=32768
sg_build_indirect: buff_size=32768, blk_size=32768
sg_build_build: k=0, a=0xc16d0900, len=32768
sg_build_indirect: k_use_sg=1, rem_sz=0
sg_add_sfp:   bufflen=32768, k_use_sg=1
sg_ioctl: sg0, cmd=0x2282
sd_init_command: disk=sda, block=2992917, count=8
sda : block=2992917
sda : writing 8/8 512 byte blocks.
scsi_add_timer: scmd: f7f30800, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30800                  sd 0:1:0:0: 
        command: Write (10): 2a 00 00 2d ab 15 00 00 08 00
buffer = 0xf6bfe080, bufflen = 4096, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sg_ioctl: sg0, cmd=0x2275
sg_remove_scat: k_use_sg=1
sg_remove_scat: k=0, a=0xc16d0900, len=32768
sg_build_reserve: req_size=65536
sg_build_indirect: buff_size=65536, blk_size=65536
sg_build_build: k=0, a=0xc16d0900, len=32768
sg_build_build: k=1, a=0xc16d0a00, len=32768
sg_build_indirect: k_use_sg=2, rem_sz=0
sg_ioctl: sg0, cmd=0x227b
sg_ioctl: sg0, cmd=0x2276
sg_write: sg0, count=64
scsi_block_when_processing_errors: rtn: 1
sg_common_write:  scsi opcode=0x28, cmd_size=10
sg_start_req: dxfer_len=13312
sg_link_reserve: size=13312
scsi_add_timer: scmd: f7f30b00, time: 15000, (c02b1420)
sd 0:1:0:0: send 0xf7f30b00                  sd 0:1:0:0: 
        command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00
buffer = 0xf6bfe680, bufflen = 13312, done = 0xc02b4b90, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sg_read: sg0, count=64
scsi_delete_timer: scmd: f7f30800, rtn: 1
sd 0:1:0:0: done 0xf7f30800 SUCCESS        0 sd 0:1:0:0: 
        command: Write (10): 2a 00 00 2d ab 15 00 00 08 00
scsi host busy 2 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
8 sectors total, 4096 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=249717, count=56
sda : block=249717
sda : writing 56/56 512 byte blocks.
scsi_add_timer: scmd: f7f30800, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30800                  sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 cf 75 00 00 38 00
buffer = 0xf6bfe080, bufflen = 28672, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30b00, rtn: 1
sd 0:1:0:0: done 0xf7f30b00 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 33 f6 24 ac 00 00 1a 00
scsi host busy 2 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
26 sectors total, 13312 bytes done.
use_sg is 1
sg_cmd_done: sg0, pack_id=871769260, res=0x0
scsi_delete_timer: scmd: f7f30800, rtn: 1
sd 0:1:0:0: done 0xf7f30800 SUCCESS        0 sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 cf 75 00 00 38 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
56 sectors total, 28672 bytes done.
use_sg is 7
sd_init_command: disk=sda, block=249773, count=8
sda : block=249773
sda : writing 8/8 512 byte blocks.
scsi_add_timer: scmd: f7f30800, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30800                  sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 cf ad 00 00 08 00
buffer = 0xf6bfe080, bufflen = 4096, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sg_read_xfer: num_xfer=13312, iovec_count=0, k_use_sg=1
scsi_delete_timer: scmd: f7f30800, rtn: 1
sd 0:1:0:0: done 0xf7f30800 SUCCESS        0 sd 0:1:0:0: 
        command: Write (10): 2a 00 00 03 cf ad 00 00 08 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
8 sectors total, 4096 bytes done.
use_sg is 1
sg_finish_rem_req: res_used=1
sg_unlink_reserve: req->k_use_sg=1
sd_init_command: disk=sda, block=2992917, count=8
sda : block=2992917
sda : writing 8/8 512 byte blocks.
scsi_add_timer: scmd: f7f30800, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30800                  sd 0:1:0:0: 
        command: Write (10): 2a 00 00 2d ab 15 00 00 08 00
buffer = 0xf6bfe080, bufflen = 4096, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
sd_init_command: disk=sda, block=8807823, count=8
sda : block=8807823
sda : reading 8/8 512 byte blocks.
scsi_add_timer: scmd: f7f30b00, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30b00                  sd 0:1:0:0: 
        command: Read (10): 28 00 00 86 65 8f 00 00 08 00
buffer = 0xf6bfe680, bufflen = 4096, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30800, rtn: 1
sd 0:1:0:0: done 0xf7f30800 SUCCESS        0 sd 0:1:0:0: 
        command: Write (10): 2a 00 00 2d ab 15 00 00 08 00
scsi host busy 2 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
8 sectors total, 4096 bytes done.
use_sg is 1
scsi_delete_timer: scmd: f7f30b00, rtn: 1
sd 0:1:0:0: done 0xf7f30b00 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 00 86 65 8f 00 00 08 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
8 sectors total, 4096 bytes done.
use_sg is 1
sd_init_command: disk=sda, block=4232917, count=64
sda : block=4232917
sda : reading 64/64 512 byte blocks.
scsi_add_timer: scmd: f7f30380, time: 7500, (c02b1420)
sd 0:1:0:0: send 0xf7f30380                  sd 0:1:0:0: 
        command: Read (10): 28 00 00 40 96 d5 00 00 40 00
buffer = 0xf6bfea40, bufflen = 32768, done = 0xc0366b40, queuecommand 0xc0344010
leaving scsi_dispatch_cmnd()
scsi_delete_timer: scmd: f7f30380, rtn: 1
sd 0:1:0:0: done 0xf7f30380 SUCCESS        0 sd 0:1:0:0: 
        command: Read (10): 28 00 00 40 96 d5 00 00 40 00
scsi host busy 1 failed 0
sd 0:1:0:0: Notifying upper driver of completion (result 0)
sd_rw_intr: sda: res=0x0
64 sectors total, 32768 bytes done.
use_sg is 5
sg_release: sg0
sg_fasync: sg0, mode=0
__sg_remove_sfp:    bufflen=65536, k_use_sg=2
sg_remove_scat: k_use_sg=2
sg_remove_scat: k=0, a=0xc16d0900, len=32768
sg_remove_scat: k=1, a=0xc16d0a00, len=32768
__sg_remove_sfp:    sfp=0xf680e000

[-- Attachment #3: ttt --]
[-- Type: application/octet-stream, Size: 16384 bytes --]

[-- Attachment #4: ttt.sgp --]
[-- Type: application/octet-stream, Size: 13312 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC] Megaraid update, submission
  2006-05-05  9:21     ` Vasily Averin
@ 2006-05-16 17:44       ` Andre Hedrick
  2006-05-16 18:07         ` Jeff Garzik
  0 siblings, 1 reply; 17+ messages in thread
From: Andre Hedrick @ 2006-05-16 17:44 UTC (permalink / raw)
  To: linux-scsi, Seokmann.Ju, Andrew Morton
  Cc: James Bottomley, Christoph Hellwig, Atul Mukker

[-- Attachment #1: Type: text/plain, Size: 1676 bytes --]


Linux-scsi, et al.

The follow patch address two major issues found under extensive testing.

While pounding data io down the card and performing large scale queries to
the controller about device state and function parameters, the following
were discovered.

Random (hard to reproduce, without a noise injection into the SATA
connector or cable) hardware error states which locks the card and in the
majority of the cases caused the array to be lost.  If the array was not
lost then a drive was failed but one could not remove/replace w/ a new
drive.  Thus adding in a pci_master_abort test and clear function proved
to allow recovery in all cases where the card shutdown communication to
the host.  This may not address all cases; however, clearly this is a
missing part of the driver base when entry to eh_scsi_* begins.

The compond issue in the failed recovery resulted in a deref NULL pointer
in the various list_head calls.  After change the individual list_add to
list_move and such, the NULL point issue has never shown up in the past 6
weeks of heavy testing.

In all cases in the past, the baseline for error was 6:1.  Meaning either
one system in six failed and/or one in six test/stress runs failed.  With
the attached changes, there have been zero failures in the past three
weeks.  This sound great, but I wish it would fail to allow some
statistics of improved error handling.

Please note the changes to SAS are minor and not tested, but seem correct
for the entire directory code base.  SAS shares the CMM core with MBOX,
thus the rational for changes to SAS.

Please comment and provide suggestions.

Cheers,

Andre Hedrick
LAD Storage Consulting Group




[-- Attachment #2: Type: text/plain, Size: 6612 bytes --]

diff -ur linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_mbox.c linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_mbox.c
--- linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_mbox.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_mbox.c	2006-05-16 10:15:11.282454536 -0700
@@ -94,6 +94,8 @@
 static int megaraid_sysfs_alloc_resources(adapter_t *);
 static void megaraid_sysfs_free_resources(adapter_t *);
 
+static int megaraid_pci_master_abort(struct pci_dev *);
+
 static int megaraid_abort_handler(struct scsi_cmnd *);
 static int megaraid_reset_handler(struct scsi_cmnd *);
 
@@ -1276,7 +1278,7 @@
 	// detach scb from free pool
 	spin_lock_irqsave(SCSI_FREE_LIST_LOCK(adapter), flags);
 
-	if (list_empty(head)) {
+	if (list_empty_careful(head)) {
 		spin_unlock_irqrestore(SCSI_FREE_LIST_LOCK(adapter), flags);
 		return NULL;
 	}
@@ -1314,7 +1316,7 @@
 	scb->scp	= NULL;
 	spin_lock_irqsave(SCSI_FREE_LIST_LOCK(adapter), flags);
 
-	list_add(&scb->list, &adapter->kscb_pool);
+	list_move(&scb->list, &adapter->kscb_pool);
 
 	spin_unlock_irqrestore(SCSI_FREE_LIST_LOCK(adapter), flags);
 
@@ -1911,7 +1913,7 @@
 
 	if (scb_q) {
 		scb_q->state = SCB_PENDQ;
-		list_add_tail(&scb_q->list, &adapter->pend_list);
+		list_move_tail(&scb_q->list, &adapter->pend_list);
 	}
 
 	// if the adapter in not in quiescent mode, post the commands to FW
@@ -1920,7 +1922,7 @@
 		return;
 	}
 
-	while (!list_empty(&adapter->pend_list)) {
+	while (!list_empty_careful(&adapter->pend_list)) {
 
 		assert_spin_locked(PENDING_LIST_LOCK(adapter));
 
@@ -1946,7 +1948,7 @@
 
 			scb->state = SCB_PENDQ;
 
-			list_add(&scb->list, &adapter->pend_list);
+			list_move(&scb->list, &adapter->pend_list);
 
 			spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter),
 				flags);
@@ -2148,7 +2150,7 @@
 			}
 
 			scb->status = mbox->status;
-			list_add_tail(&scb->list, &clist);
+			list_move_tail(&scb->list, &clist);
 		}
 
 		// Acknowledge interrupt
@@ -2477,6 +2479,27 @@
 
 
 /**
+ * megaraid_pci_master_abort
+ * @dev                : pci device structure
+ *
+ * Tests for PCI Master Abort on the host adapter and clears state
+ * Returns state of error with inverted logic test to give proper
+ * state of the pci statuts bit describing master_abort.
+ */
+static int megaraid_pci_master_abort(struct pci_dev* dev)
+{
+	u16 status, error_bits;
+
+	pci_read_config_word(dev, PCI_STATUS, &status);
+	if (error_bits)
+		pci_write_config_word(dev, PCI_STATUS, error_bits);
+	pci_read_config_word(dev, PCI_STATUS, &status);
+	error_bits = (status & PCI_STATUS_REC_MASTER_ABORT);
+	return (!error_bits) ? 0 : 1;
+}
+
+
+/**
  * megaraid_abort_handler - abort the scsi command
  * @scp		: command to be aborted
  *
@@ -2505,9 +2528,13 @@
 
 	// If FW has stopped responding, simply return failure
 	if (raid_dev->hw_error) {
-		con_log(CL_ANN, (KERN_NOTICE
-			"megaraid: hw error, not aborting\n"));
-		return FAILED;
+		// test if adapter is locked up because of pci master abort
+		if (megaraid_pci_master_abort(adapter->pdev)) {
+			con_log(CL_ANN, (KERN_NOTICE
+				"megaraid: hw error, cannot reset\n"));
+			return FAILED;
+		}
+		raid_dev->hw_error = 0;
 	}
 
 	// There might a race here, where the command was completed by the
@@ -2519,14 +2546,13 @@
 	list_for_each_entry_safe(scb, tmp, &adapter->completed_list, list) {
 
 		if (scb->scp == scp) {	// Found command
-
-			list_del_init(&scb->list);	// from completed list
-
 			con_log(CL_ANN, (KERN_WARNING
 			"megaraid: %ld:%d[%d:%d], abort from completed list\n",
 				scp->serial_number, scb->sno,
 				scb->dev_channel, scb->dev_target));
 
+			list_del_init(&scb->list);	// from completed list
+
 			scp->result = (DID_ABORT << 16);
 			scp->scsi_done(scp);
 
@@ -2549,8 +2575,6 @@
 
 		if (scb->scp == scp) {	// Found command
 
-			list_del_init(&scb->list);	// from pending list
-
 			ASSERT(!(scb->state & SCB_ISSUED));
 
 			con_log(CL_ANN, (KERN_WARNING
@@ -2558,6 +2582,8 @@
 				scp->serial_number, scb->dev_channel,
 				scb->dev_target));
 
+			list_del_init(&scb->list);	// from pending list
+
 			scp->result = (DID_ABORT << 16);
 			scp->scsi_done(scp);
 
@@ -3606,7 +3632,7 @@
 	// detach one scb from free pool
 	spin_lock_irqsave(USER_FREE_LIST_LOCK(adapter), flags);
 
-	if (list_empty(head)) {	// should never happen because of CMM
+	if (list_empty_careful(head)) {	// should never happen because of CMM
 
 		con_log(CL_ANN, (KERN_WARNING
 			"megaraid mbox: bug in cmm handler, lost resources\n"));
@@ -3737,7 +3763,7 @@
 
 	spin_lock_irqsave(USER_FREE_LIST_LOCK(adapter), flags);
 
-	list_add(&scb->list, &adapter->uscb_pool);
+	list_move(&scb->list, &adapter->uscb_pool);
 
 	spin_unlock_irqrestore(USER_FREE_LIST_LOCK(adapter), flags);
 
diff -ur linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_mm.c linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_mm.c
--- linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_mm.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_mm.c	2006-05-16 10:16:19.461954269 -0700
@@ -577,7 +577,7 @@
 
 	head = &adp->kioc_pool;
 
-	if (list_empty(head)) {
+	if (list_empty_careful(head)) {
 		up(&adp->kioc_semaphore);
 		spin_unlock_irqrestore(&adp->kioc_pool_lock, flags);
 
@@ -641,7 +641,7 @@
 
 	/* Return the kioc to the free pool */
 	spin_lock_irqsave(&adp->kioc_pool_lock, flags);
-	list_add(&kioc->list, &adp->kioc_pool);
+	list_move(&kioc->list, &adp->kioc_pool);
 	spin_unlock_irqrestore(&adp->kioc_pool_lock, flags);
 
 	/* increment the free kioc count */
diff -ur linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_sas.c linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_sas.c
--- linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_sas.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_sas.c	2006-05-16 10:25:54.562822089 -0700
@@ -95,7 +95,7 @@
 
 	spin_lock_irqsave(&instance->cmd_pool_lock, flags);
 
-	if (!list_empty(&instance->cmd_pool)) {
+	if (!list_empty_careful(&instance->cmd_pool)) {
 		cmd = list_entry((&instance->cmd_pool)->next,
 				 struct megasas_cmd, list);
 		list_del_init(&cmd->list);
@@ -120,7 +120,7 @@
 	spin_lock_irqsave(&instance->cmd_pool_lock, flags);
 
 	cmd->scmd = NULL;
-	list_add_tail(&cmd->list, &instance->cmd_pool);
+	list_move_tail(&cmd->list, &instance->cmd_pool);
 
 	spin_unlock_irqrestore(&instance->cmd_pool_lock, flags);
 }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Megaraid update, submission
  2006-05-16 17:44       ` [RFC] Megaraid update, submission Andre Hedrick
@ 2006-05-16 18:07         ` Jeff Garzik
  2006-05-16 18:13           ` Andre Hedrick
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Garzik @ 2006-05-16 18:07 UTC (permalink / raw)
  To: Andre Hedrick
  Cc: linux-scsi, Seokmann.Ju, Andrew Morton, James Bottomley,
	Christoph Hellwig, Atul Mukker

Andre Hedrick wrote:
> +static int megaraid_pci_master_abort(struct pci_dev* dev)
> +{
> +	u16 status, error_bits;
> +
> +	pci_read_config_word(dev, PCI_STATUS, &status);
> +	if (error_bits)
> +		pci_write_config_word(dev, PCI_STATUS, error_bits);
> +	pci_read_config_word(dev, PCI_STATUS, &status);

error_bits is used before a value is assigned to it.  Presumably you are 
missing a duplicate of a line further down in the function,

	error_bits = (status & PCI_STATUS_REC_MASTER_ABORT);

The list stuff looks OK, to my quick glance.

	Jeff



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Megaraid update, submission
  2006-05-16 18:07         ` Jeff Garzik
@ 2006-05-16 18:13           ` Andre Hedrick
  2006-05-16 19:44             ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Andre Hedrick @ 2006-05-16 18:13 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: linux-scsi, Seokmann.Ju, Andrew Morton, James Bottomley,
	Christoph Hellwig, Atul Mukker

[-- Attachment #1: Type: text/plain, Size: 997 bytes --]


Jeff,

Corrected patch, thanks for seeing I missed a paste when removing all the
ifdefs for testing variations.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Tue, 16 May 2006, Jeff Garzik wrote:

> Andre Hedrick wrote:
> > +static int megaraid_pci_master_abort(struct pci_dev* dev)
> > +{
> > +	u16 status, error_bits;
> > +
> > +	pci_read_config_word(dev, PCI_STATUS, &status);
> > +	if (error_bits)
> > +		pci_write_config_word(dev, PCI_STATUS, error_bits);
> > +	pci_read_config_word(dev, PCI_STATUS, &status);
> 
> error_bits is used before a value is assigned to it.  Presumably you are 
> missing a duplicate of a line further down in the function,
> 
> 	error_bits = (status & PCI_STATUS_REC_MASTER_ABORT);
> 
> The list stuff looks OK, to my quick glance.
> 
> 	Jeff
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: Type: text/plain, Size: 6668 bytes --]

diff -ur linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_mbox.c linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_mbox.c
--- linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_mbox.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_mbox.c	2006-05-16 11:18:56.263708705 -0700
@@ -94,6 +94,8 @@
 static int megaraid_sysfs_alloc_resources(adapter_t *);
 static void megaraid_sysfs_free_resources(adapter_t *);
 
+static int megaraid_pci_master_abort(struct pci_dev *);
+
 static int megaraid_abort_handler(struct scsi_cmnd *);
 static int megaraid_reset_handler(struct scsi_cmnd *);
 
@@ -1276,7 +1278,7 @@
 	// detach scb from free pool
 	spin_lock_irqsave(SCSI_FREE_LIST_LOCK(adapter), flags);
 
-	if (list_empty(head)) {
+	if (list_empty_careful(head)) {
 		spin_unlock_irqrestore(SCSI_FREE_LIST_LOCK(adapter), flags);
 		return NULL;
 	}
@@ -1314,7 +1316,7 @@
 	scb->scp	= NULL;
 	spin_lock_irqsave(SCSI_FREE_LIST_LOCK(adapter), flags);
 
-	list_add(&scb->list, &adapter->kscb_pool);
+	list_move(&scb->list, &adapter->kscb_pool);
 
 	spin_unlock_irqrestore(SCSI_FREE_LIST_LOCK(adapter), flags);
 
@@ -1911,7 +1913,7 @@
 
 	if (scb_q) {
 		scb_q->state = SCB_PENDQ;
-		list_add_tail(&scb_q->list, &adapter->pend_list);
+		list_move_tail(&scb_q->list, &adapter->pend_list);
 	}
 
 	// if the adapter in not in quiescent mode, post the commands to FW
@@ -1920,7 +1922,7 @@
 		return;
 	}
 
-	while (!list_empty(&adapter->pend_list)) {
+	while (!list_empty_careful(&adapter->pend_list)) {
 
 		assert_spin_locked(PENDING_LIST_LOCK(adapter));
 
@@ -1946,7 +1948,7 @@
 
 			scb->state = SCB_PENDQ;
 
-			list_add(&scb->list, &adapter->pend_list);
+			list_move(&scb->list, &adapter->pend_list);
 
 			spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter),
 				flags);
@@ -2148,7 +2150,7 @@
 			}
 
 			scb->status = mbox->status;
-			list_add_tail(&scb->list, &clist);
+			list_move_tail(&scb->list, &clist);
 		}
 
 		// Acknowledge interrupt
@@ -2477,6 +2479,28 @@
 
 
 /**
+ * megaraid_pci_master_abort
+ * @dev                : pci device structure
+ *
+ * Tests for PCI Master Abort on the host adapter and clears state
+ * Returns state of error with inverted logic test to give proper
+ * state of the pci statuts bit describing master_abort.
+ */
+static int megaraid_pci_master_abort(struct pci_dev* dev)
+{
+	u16 status, error_bits;
+
+	pci_read_config_word(dev, PCI_STATUS, &status);
+	error_bits = (status & PCI_STATUS_REC_MASTER_ABORT);
+	if (error_bits)
+		pci_write_config_word(dev, PCI_STATUS, error_bits);
+	pci_read_config_word(dev, PCI_STATUS, &status);
+	error_bits = (status & PCI_STATUS_REC_MASTER_ABORT);
+	return (!error_bits) ? 0 : 1;
+}
+
+
+/**
  * megaraid_abort_handler - abort the scsi command
  * @scp		: command to be aborted
  *
@@ -2505,9 +2529,13 @@
 
 	// If FW has stopped responding, simply return failure
 	if (raid_dev->hw_error) {
-		con_log(CL_ANN, (KERN_NOTICE
-			"megaraid: hw error, not aborting\n"));
-		return FAILED;
+		// test if adapter is locked up because of pci master abort
+		if (megaraid_pci_master_abort(adapter->pdev)) {
+			con_log(CL_ANN, (KERN_NOTICE
+				"megaraid: hw error, cannot reset\n"));
+			return FAILED;
+		}
+		raid_dev->hw_error = 0;
 	}
 
 	// There might a race here, where the command was completed by the
@@ -2519,14 +2547,13 @@
 	list_for_each_entry_safe(scb, tmp, &adapter->completed_list, list) {
 
 		if (scb->scp == scp) {	// Found command
-
-			list_del_init(&scb->list);	// from completed list
-
 			con_log(CL_ANN, (KERN_WARNING
 			"megaraid: %ld:%d[%d:%d], abort from completed list\n",
 				scp->serial_number, scb->sno,
 				scb->dev_channel, scb->dev_target));
 
+			list_del_init(&scb->list);	// from completed list
+
 			scp->result = (DID_ABORT << 16);
 			scp->scsi_done(scp);
 
@@ -2549,8 +2576,6 @@
 
 		if (scb->scp == scp) {	// Found command
 
-			list_del_init(&scb->list);	// from pending list
-
 			ASSERT(!(scb->state & SCB_ISSUED));
 
 			con_log(CL_ANN, (KERN_WARNING
@@ -2558,6 +2583,8 @@
 				scp->serial_number, scb->dev_channel,
 				scb->dev_target));
 
+			list_del_init(&scb->list);	// from pending list
+
 			scp->result = (DID_ABORT << 16);
 			scp->scsi_done(scp);
 
@@ -3606,7 +3633,7 @@
 	// detach one scb from free pool
 	spin_lock_irqsave(USER_FREE_LIST_LOCK(adapter), flags);
 
-	if (list_empty(head)) {	// should never happen because of CMM
+	if (list_empty_careful(head)) {	// should never happen because of CMM
 
 		con_log(CL_ANN, (KERN_WARNING
 			"megaraid mbox: bug in cmm handler, lost resources\n"));
@@ -3737,7 +3764,7 @@
 
 	spin_lock_irqsave(USER_FREE_LIST_LOCK(adapter), flags);
 
-	list_add(&scb->list, &adapter->uscb_pool);
+	list_move(&scb->list, &adapter->uscb_pool);
 
 	spin_unlock_irqrestore(USER_FREE_LIST_LOCK(adapter), flags);
 
diff -ur linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_mm.c linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_mm.c
--- linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_mm.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_mm.c	2006-05-16 10:16:19.461954269 -0700
@@ -577,7 +577,7 @@
 
 	head = &adp->kioc_pool;
 
-	if (list_empty(head)) {
+	if (list_empty_careful(head)) {
 		up(&adp->kioc_semaphore);
 		spin_unlock_irqrestore(&adp->kioc_pool_lock, flags);
 
@@ -641,7 +641,7 @@
 
 	/* Return the kioc to the free pool */
 	spin_lock_irqsave(&adp->kioc_pool_lock, flags);
-	list_add(&kioc->list, &adp->kioc_pool);
+	list_move(&kioc->list, &adp->kioc_pool);
 	spin_unlock_irqrestore(&adp->kioc_pool_lock, flags);
 
 	/* increment the free kioc count */
diff -ur linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_sas.c linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_sas.c
--- linux-2.6.17-rc4/drivers/scsi/megaraid/megaraid_sas.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4.new/drivers/scsi/megaraid/megaraid_sas.c	2006-05-16 10:25:54.562822089 -0700
@@ -95,7 +95,7 @@
 
 	spin_lock_irqsave(&instance->cmd_pool_lock, flags);
 
-	if (!list_empty(&instance->cmd_pool)) {
+	if (!list_empty_careful(&instance->cmd_pool)) {
 		cmd = list_entry((&instance->cmd_pool)->next,
 				 struct megasas_cmd, list);
 		list_del_init(&cmd->list);
@@ -120,7 +120,7 @@
 	spin_lock_irqsave(&instance->cmd_pool_lock, flags);
 
 	cmd->scmd = NULL;
-	list_add_tail(&cmd->list, &instance->cmd_pool);
+	list_move_tail(&cmd->list, &instance->cmd_pool);
 
 	spin_unlock_irqrestore(&instance->cmd_pool_lock, flags);
 }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Megaraid update, submission
  2006-05-16 18:13           ` Andre Hedrick
@ 2006-05-16 19:44             ` Matthew Wilcox
  2006-05-16 20:24               ` Andre Hedrick
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2006-05-16 19:44 UTC (permalink / raw)
  To: Andre Hedrick
  Cc: Jeff Garzik, linux-scsi, Seokmann.Ju, Andrew Morton,
	James Bottomley, Christoph Hellwig, Atul Mukker

On Tue, May 16, 2006 at 11:13:14AM -0700, Andre Hedrick wrote:
>  /**
> + * megaraid_pci_master_abort
> + * @dev                : pci device structure
> + *
> + * Tests for PCI Master Abort on the host adapter and clears state
> + * Returns state of error with inverted logic test to give proper
> + * state of the pci statuts bit describing master_abort.
> + */
> +static int megaraid_pci_master_abort(struct pci_dev* dev)
> +{
> +	u16 status, error_bits;
> +
> +	pci_read_config_word(dev, PCI_STATUS, &status);
> +	error_bits = (status & PCI_STATUS_REC_MASTER_ABORT);
> +	if (error_bits)
> +		pci_write_config_word(dev, PCI_STATUS, error_bits);
> +	pci_read_config_word(dev, PCI_STATUS, &status);
> +	error_bits = (status & PCI_STATUS_REC_MASTER_ABORT);
> +	return (!error_bits) ? 0 : 1;

A little clunky still.  How about:

	u16 status;
	pci_read_config_word(dev, PCI_STATUS, &status);
	if (!(status & PCI_STATUS_REC_MASTER_ABORT))
		return 0;
	pci_write_config_word(dev, PCI_STATUS, PCI_STATUS_REC_MASTER_ABORT);
	pci_read_config_word(dev, PCI_STATUS, &status);
	return (status & PCI_STATUS_REC_MASTER_ABORT) ? 1 : 0;

> @@ -2519,14 +2547,13 @@
>  	list_for_each_entry_safe(scb, tmp, &adapter->completed_list, list) {
>  
>  		if (scb->scp == scp) {	// Found command
> -
> -			list_del_init(&scb->list);	// from completed list
> -
>  			con_log(CL_ANN, (KERN_WARNING
>  			"megaraid: %ld:%d[%d:%d], abort from completed list\n",
>  				scp->serial_number, scb->sno,
>  				scb->dev_channel, scb->dev_target));
>  
> +			list_del_init(&scb->list);	// from completed list
> +
>  			scp->result = (DID_ABORT << 16);
>  			scp->scsi_done(scp);
>  

Not quite sure why this change makes any difference

> @@ -2549,8 +2576,6 @@
>  
>  		if (scb->scp == scp) {	// Found command
>  
> -			list_del_init(&scb->list);	// from pending list
> -
>  			ASSERT(!(scb->state & SCB_ISSUED));
>  
>  			con_log(CL_ANN, (KERN_WARNING
> @@ -2558,6 +2583,8 @@
>  				scp->serial_number, scb->dev_channel,
>  				scb->dev_target));
>  
> +			list_del_init(&scb->list);	// from pending list
> +
>  			scp->result = (DID_ABORT << 16);
>  			scp->scsi_done(scp);

ditto


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Megaraid update, submission
  2006-05-16 19:44             ` Matthew Wilcox
@ 2006-05-16 20:24               ` Andre Hedrick
  0 siblings, 0 replies; 17+ messages in thread
From: Andre Hedrick @ 2006-05-16 20:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Garzik, linux-scsi, Seokmann.Ju, Andrew Morton,
	James Bottomley, Christoph Hellwig, Atul Mukker


Matthew,

Thanks for the feedback ... the movement of "list_del_init" is so you can
read what the scb was before it gets deleted.  This is a house keeping
issue to see what failed before it is deleted.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Tue, 16 May 2006, Matthew Wilcox wrote:

> On Tue, May 16, 2006 at 11:13:14AM -0700, Andre Hedrick wrote:
> >  /**
> > + * megaraid_pci_master_abort
> > + * @dev                : pci device structure
> > + *
> > + * Tests for PCI Master Abort on the host adapter and clears state
> > + * Returns state of error with inverted logic test to give proper
> > + * state of the pci statuts bit describing master_abort.
> > + */
> > +static int megaraid_pci_master_abort(struct pci_dev* dev)
> > +{
> > +	u16 status, error_bits;
> > +
> > +	pci_read_config_word(dev, PCI_STATUS, &status);
> > +	error_bits = (status & PCI_STATUS_REC_MASTER_ABORT);
> > +	if (error_bits)
> > +		pci_write_config_word(dev, PCI_STATUS, error_bits);
> > +	pci_read_config_word(dev, PCI_STATUS, &status);
> > +	error_bits = (status & PCI_STATUS_REC_MASTER_ABORT);
> > +	return (!error_bits) ? 0 : 1;
> 
> A little clunky still.  How about:
> 
> 	u16 status;
> 	pci_read_config_word(dev, PCI_STATUS, &status);
> 	if (!(status & PCI_STATUS_REC_MASTER_ABORT))
> 		return 0;
> 	pci_write_config_word(dev, PCI_STATUS, PCI_STATUS_REC_MASTER_ABORT);
> 	pci_read_config_word(dev, PCI_STATUS, &status);
> 	return (status & PCI_STATUS_REC_MASTER_ABORT) ? 1 : 0;
> 
> > @@ -2519,14 +2547,13 @@
> >  	list_for_each_entry_safe(scb, tmp, &adapter->completed_list, list) {
> >  
> >  		if (scb->scp == scp) {	// Found command
> > -
> > -			list_del_init(&scb->list);	// from completed list
> > -
> >  			con_log(CL_ANN, (KERN_WARNING
> >  			"megaraid: %ld:%d[%d:%d], abort from completed list\n",
> >  				scp->serial_number, scb->sno,
> >  				scb->dev_channel, scb->dev_target));
> >  
> > +			list_del_init(&scb->list);	// from completed list
> > +
> >  			scp->result = (DID_ABORT << 16);
> >  			scp->scsi_done(scp);
> >  
> 
> Not quite sure why this change makes any difference
> 
> > @@ -2549,8 +2576,6 @@
> >  
> >  		if (scb->scp == scp) {	// Found command
> >  
> > -			list_del_init(&scb->list);	// from pending list
> > -
> >  			ASSERT(!(scb->state & SCB_ISSUED));
> >  
> >  			con_log(CL_ANN, (KERN_WARNING
> > @@ -2558,6 +2583,8 @@
> >  				scp->serial_number, scb->dev_channel,
> >  				scb->dev_target));
> >  
> > +			list_del_init(&scb->list);	// from pending list
> > +
> >  			scp->result = (DID_ABORT << 16);
> >  			scp->scsi_done(scp);
> 
> ditto
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: megaraid_mbox: garbage in file
  2006-05-05  5:37   ` Vasily Averin
  2006-05-05  9:21     ` Vasily Averin
@ 2006-05-05 15:59     ` James Bottomley
  2006-05-05 18:17       ` Vasily Averin
  1 sibling, 1 reply; 17+ messages in thread
From: James Bottomley @ 2006-05-05 15:59 UTC (permalink / raw)
  To: Vasily Averin
  Cc: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib, devel

On Fri, 2006-05-05 at 09:37 +0400, Vasily Averin wrote:
> The issue is that the correctly finished scsi read command return me garbage
> (repeated 0 ...127 -- see hexdump in my first letter) instead correct file content.
> "attempt to access beyond end of device" messages occurs due the same garbage
> readed from the Indirect block. I found this garbage present in data buffers
> beginning at megaraid driver functions.
> 
> I would note that if I read the same file by using dd with bs=1024 or bs=512 --
> I get correct file content.
> 
> When I use kernel with 4Gb memory limit -- the same cat command return me
> correct file content too, without any garbage.
> 
> Question is what it is the strange garbage? Have you seen it earlier?
> Is it possible that it is some driver-related issue or it is broken hardware?
> And why I can workaround this issue by using only 4Gb memory?

This is really odd ... if the controller can't reach *any* memory above
32 bits, then, on an 8GB machine you'd expect corruption all over the
place since most user pages come from the top of highmem.

The first thing to try, since you have an opteron system, is to get rid
of highmem entirely and use a 64 bit kernel (just to make sure we're not
running into some annoying dma_addr_t conversion problem).  Then, I
suppose if that doesn't work, try printing out the actual contents of
the sg list to see what the physical memory location of the page
containing the corrupt block is.

This could also be a firmware problem, I suppose, but I haven't seen any
similar reports.

James

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: megaraid_mbox: garbage in file
  2006-05-05 15:59     ` megaraid_mbox: garbage in file James Bottomley
@ 2006-05-05 18:17       ` Vasily Averin
  2006-05-05 20:05         ` James Bottomley
  0 siblings, 1 reply; 17+ messages in thread
From: Vasily Averin @ 2006-05-05 18:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib, devel,
	Linux Kernel Mailing List

James Bottomley wrote:
> On Fri, 2006-05-05 at 09:37 +0400, Vasily Averin wrote:
>>The issue is that the correctly finished scsi read command return me garbage
>>(repeated 0 ...127 -- see hexdump in my first letter) instead correct file content.
>>"attempt to access beyond end of device" messages occurs due the same garbage
>>readed from the Indirect block. I found this garbage present in data buffers
>>beginning at megaraid driver functions.
>>
>>I would note that if I read the same file by using dd with bs=1024 or bs=512 --
>>I get correct file content.
>>
>>When I use kernel with 4Gb memory limit -- the same cat command return me
>>correct file content too, without any garbage.
>>
>>Question is what it is the strange garbage? Have you seen it earlier?
>>Is it possible that it is some driver-related issue or it is broken hardware?
>>And why I can workaround this issue by using only 4Gb memory?
> 
> This is really odd ... if the controller can't reach *any* memory above
> 32 bits, then, on an 8GB machine you'd expect corruption all over the
> place since most user pages come from the top of highmem.
> 
> The first thing to try, since you have an opteron system, is to get rid
> of highmem entirely and use a 64 bit kernel (just to make sure we're not
> running into some annoying dma_addr_t conversion problem).

Unfortunately it is customers node, and I'm not able to re-install 64-bit
distribution to load 64-bit kernel. Of course I'll ask customer about this, but
it will be done later.

> Then, I
> suppose if that doesn't work, try printing out the actual contents of
> the sg list to see what the physical memory location of the page
> containing the corrupt block is.

I've already done such experiment:
On 2.6.8-based virtuozzo kernel I've added following code to
megaraid_mbox_display_scb function:
  virt = page_address(sg[i].page) + sg[i].offset;
  printk("mbox sg%d: page %p off %d addr %llx len %d "
         "virt %p first %08x page->flags %08x\n",
   i, sg[i].page, sg[i].offset, sg[i].dma_address, sg[i].length,
   virt, virt == NULL ? 0: *(int *)virt, sg[i].page->flags);

and get the following results
May  4 02:51:38 vpsn002 kernel:
 megaraid mailbox: status:0x0 cmd:0xa7 id:0x25 sec:0x1a
		 lba:0x33f624ac addr:0xffffffff ld:128 sg:4
 scsi cmnd: 0x28 0x00 0x33 0xf6 0x24 0xac 0x00 0x00 0x1a 0x00
 mbox request_buffer eafde340 use_sg 4
 mbox sg0: page 077a0474 off 0 addr 1fd575000 len 4096 virt ff15a000
		 first 03020100 page->flags 40020101
 mbox sg1: page 077b5738 off 0 addr 1fdede000 len 4096 virt ff141000
		 first 03020100 page->flags 40020101
 mbox sg2: page 077ad500 off 0 addr 1fdb40000 len 4096 virt ff056000
		 first 03020100 page->flags 40020101
 mbox sg3: page 030d46e8 off 1024 addr 5e6a400 len 1024 virt 07e6a400
		 first 03020100 page->flags 20001004

"first 03020100" shows that data in the all sg buffers is already corrupted.
Also I would note that page for last 1Kb buffer is not Highmem.

If you want I can reproduce this experiment on 2.6.16 kernel too.

> This could also be a firmware problem, I suppose, but I haven't seen any
> similar reports.

Thank you,
	Vasily Averin

SWsoft Virtuozzo/OpenVZ Linux kernel team

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: megaraid_mbox: garbage in file
  2006-05-05 18:17       ` Vasily Averin
@ 2006-05-05 20:05         ` James Bottomley
  2006-05-05 23:43             ` Vasily Averin
  0 siblings, 1 reply; 17+ messages in thread
From: James Bottomley @ 2006-05-05 20:05 UTC (permalink / raw)
  To: Vasily Averin
  Cc: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib, devel,
	Linux Kernel Mailing List

On Fri, 2006-05-05 at 22:17 +0400, Vasily Averin wrote:
>  megaraid mailbox: status:0x0 cmd:0xa7 id:0x25 sec:0x1a
>                  lba:0x33f624ac addr:0xffffffff ld:128 sg:4
>  scsi cmnd: 0x28 0x00 0x33 0xf6 0x24 0xac 0x00 0x00 0x1a 0x00
>  mbox request_buffer eafde340 use_sg 4
>  mbox sg0: page 077a0474 off 0 addr 1fd575000 len 4096 virt ff15a000
>                  first 03020100 page->flags 40020101
>  mbox sg1: page 077b5738 off 0 addr 1fdede000 len 4096 virt ff141000
>                  first 03020100 page->flags 40020101
>  mbox sg2: page 077ad500 off 0 addr 1fdb40000 len 4096 virt ff056000
>                  first 03020100 page->flags 40020101
>  mbox sg3: page 030d46e8 off 1024 addr 5e6a400 len 1024 virt 07e6a400
>                  first 03020100 page->flags 20001004

The odd thing about this is that the highmem addresses shouldn't have a
virtual mapping (since nothing should have called kmap on them).

But the other tickles a suspicion about the card.  I know various LSI
chips that don't have a true 64 bit mode, but instead have programmable
windowed mappings in their descriptors  (i.e. all SG list elements have
to be in the same xGB region of physical memory), and since the last
descriptor is more than 4GB away from the other three, whether this
might be the problem here.  Unfortunately, only LSI can tell us this ...

James

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: megaraid_mbox: garbage in file
  2006-05-05 20:05         ` James Bottomley
@ 2006-05-05 23:43             ` Vasily Averin
  0 siblings, 0 replies; 17+ messages in thread
From: Vasily Averin @ 2006-05-05 23:43 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib, devel,
	Linux Kernel Mailing List

James Bottomley wrote:
> On Fri, 2006-05-05 at 22:17 +0400, Vasily Averin wrote:
>> megaraid mailbox: status:0x0 cmd:0xa7 id:0x25 sec:0x1a
>>                 lba:0x33f624ac addr:0xffffffff ld:128 sg:4
>> scsi cmnd: 0x28 0x00 0x33 0xf6 0x24 0xac 0x00 0x00 0x1a 0x00
>> mbox request_buffer eafde340 use_sg 4
>> mbox sg0: page 077a0474 off 0 addr 1fd575000 len 4096 virt ff15a000
>>                 first 03020100 page->flags 40020101
>> mbox sg1: page 077b5738 off 0 addr 1fdede000 len 4096 virt ff141000
>>                 first 03020100 page->flags 40020101
>> mbox sg2: page 077ad500 off 0 addr 1fdb40000 len 4096 virt ff056000
>>                 first 03020100 page->flags 40020101
>> mbox sg3: page 030d46e8 off 1024 addr 5e6a400 len 1024 virt 07e6a400
>>                 first 03020100 page->flags 20001004
> 
> The odd thing about this is that the highmem addresses shouldn't have a
> virtual mapping (since nothing should have called kmap on them).

You are right, in the other my experiments highmem pages usually have virt=0 and
I cannot find who is kmapped these pages.

I'll investigate this issue later: first ща all I'll try to reproduce this issue
on 2.6.16 kernel.

Thank you,
	Vasily Averin

SWsoft Virtuozzo/OpenVZ Linux kernel team

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: megaraid_mbox: garbage in file
@ 2006-05-05 23:43             ` Vasily Averin
  0 siblings, 0 replies; 17+ messages in thread
From: Vasily Averin @ 2006-05-05 23:43 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-scsi, Neela.Kolli, Atul Mukker, Seokmann.Ju, sreenib, devel,
	Linux Kernel Mailing List

James Bottomley wrote:
> On Fri, 2006-05-05 at 22:17 +0400, Vasily Averin wrote:
>> megaraid mailbox: status:0x0 cmd:0xa7 id:0x25 sec:0x1a
>>                 lba:0x33f624ac addr:0xffffffff ld:128 sg:4
>> scsi cmnd: 0x28 0x00 0x33 0xf6 0x24 0xac 0x00 0x00 0x1a 0x00
>> mbox request_buffer eafde340 use_sg 4
>> mbox sg0: page 077a0474 off 0 addr 1fd575000 len 4096 virt ff15a000
>>                 first 03020100 page->flags 40020101
>> mbox sg1: page 077b5738 off 0 addr 1fdede000 len 4096 virt ff141000
>>                 first 03020100 page->flags 40020101
>> mbox sg2: page 077ad500 off 0 addr 1fdb40000 len 4096 virt ff056000
>>                 first 03020100 page->flags 40020101
>> mbox sg3: page 030d46e8 off 1024 addr 5e6a400 len 1024 virt 07e6a400
>>                 first 03020100 page->flags 20001004
> 
> The odd thing about this is that the highmem addresses shouldn't have a
> virtual mapping (since nothing should have called kmap on them).

You are right, in the other my experiments highmem pages usually have virt=0 and
I cannot find who is kmapped these pages.

I'll investigate this issue later: first ща all I'll try to reproduce this issue
on 2.6.16 kernel.

Thank you,
	Vasily Averin

SWsoft Virtuozzo/OpenVZ Linux kernel team


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Megaraid update, submission
@ 2006-05-16 19:03 Ju, Seokmann
  2006-05-16 20:47 ` Andre Hedrick
  0 siblings, 1 reply; 17+ messages in thread
From: Ju, Seokmann @ 2006-05-16 19:03 UTC (permalink / raw)
  To: Andre Hedrick, linux-scsi, Andrew Morton
  Cc: James Bottomley, Christoph Hellwig, Mukker, Atul

Hi,

I cannot agree on the changes in the patch for following reasons.

On Tuesday, May 16, 2006 1:44 PM, Andre Hedrick wrote:
> Random (hard to reproduce, without a noise injection into the SATA
> connector or cable) hardware error states which locks the 
> card and in the
> majority of the cases caused the array to be lost.  If the 
> array was not
> lost then a drive was failed but one could not remove/replace w/ a new
> drive.  Thus adding in a pci_master_abort test and clear 
> function proved
> to allow recovery in all cases where the card shutdown 
> communication to
> the host.  This may not address all cases; however, clearly this is a
> missing part of the driver base when entry to eh_scsi_* begins.
If 'raid_dev->hw_error' is non-zero, this means that the controller has gone bad and will (and should not to avoid further memory corruption) not be able to recoverd unless reboot.
The overall issue described here already taken care by the patch that I've submitted.
The patch has been accepted and should be available on 2.6.17-rc1-mm3 as specified in Andrew Morton's email.
> The compond issue in the failed recovery resulted in a deref 
> NULL pointer
> in the various list_head calls.  After change the individual 
> list_add to
> list_move and such, the NULL point issue has never shown up 
> in the past 6
> weeks of heavy testing.
I'm not sure how this changes help for the issue. Furthermore, I'm not sure what is _the NULL point issue_ refering to. If you see the issue with driver available on 2.6.17-rc1-mm3, please let me know.
Following link will leads you to further details of the patch.
http://www.kernel.org/git/?p=linux/kernel/git/jejb/scsi-rc-fixes-2.6.git;a=commit;h=c005fb4fb2d23ba29ad21dee5042b2f8451ca8ba

Thank you,

Seokmann

> -----Original Message-----
> From: Andre Hedrick [mailto:andre@linux-ide.org] 
> Sent: Tuesday, May 16, 2006 1:44 PM
> To: linux-scsi@vger.kernel.org; Ju, Seokmann; Andrew Morton
> Cc: James Bottomley; Christoph Hellwig; Mukker, Atul
> Subject: [RFC] Megaraid update, submission
> 
> 
> Linux-scsi, et al.
> 
> The follow patch address two major issues found under 
> extensive testing.
> 
> While pounding data io down the card and performing large 
> scale queries to
> the controller about device state and function parameters, 
> the following
> were discovered.
> 
> Random (hard to reproduce, without a noise injection into the SATA
> connector or cable) hardware error states which locks the 
> card and in the
> majority of the cases caused the array to be lost.  If the 
> array was not
> lost then a drive was failed but one could not remove/replace w/ a new
> drive.  Thus adding in a pci_master_abort test and clear 
> function proved
> to allow recovery in all cases where the card shutdown 
> communication to
> the host.  This may not address all cases; however, clearly this is a
> missing part of the driver base when entry to eh_scsi_* begins.
> 
> The compond issue in the failed recovery resulted in a deref 
> NULL pointer
> in the various list_head calls.  After change the individual 
> list_add to
> list_move and such, the NULL point issue has never shown up 
> in the past 6
> weeks of heavy testing.
> 
> In all cases in the past, the baseline for error was 6:1.  
> Meaning either
> one system in six failed and/or one in six test/stress runs 
> failed.  With
> the attached changes, there have been zero failures in the past three
> weeks.  This sound great, but I wish it would fail to allow some
> statistics of improved error handling.
> 
> Please note the changes to SAS are minor and not tested, but 
> seem correct
> for the entire directory code base.  SAS shares the CMM core 
> with MBOX,
> thus the rational for changes to SAS.
> 
> Please comment and provide suggestions.
> 
> Cheers,
> 
> Andre Hedrick
> LAD Storage Consulting Group
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Megaraid update, submission
  2006-05-16 19:03 [RFC] Megaraid update, submission Ju, Seokmann
@ 2006-05-16 20:47 ` Andre Hedrick
  0 siblings, 0 replies; 17+ messages in thread
From: Andre Hedrick @ 2006-05-16 20:47 UTC (permalink / raw)
  To: Ju, Seokmann
  Cc: linux-scsi, Andrew Morton, James Bottomley, Christoph Hellwig,
	Mukker, Atul


Warning OOPS in message, ignore if you hate reading pasted OOPS's

Seokmann,

So there should be no (sane) heroic attempts to recover the card state?
Please look and see the path is only retried and follows the original
operational path which resulted in setting the 'raid_dev->hw_error' flag.
If I am reading the code correctly, the *->quiescent flag controls command
submission to the card.  Thus all commands submitted to the firmware are
owned by the card, and should be allowed to complete the IO's regardless?
With as many as 20 requests outstanding (max I have seen to date) and
termiation of the transactions surely blows apart any filesystem, as I
have had filesystems and in several cases attached arrays just vaporize if
forced to reboot when 'hw_error' is set.

So since the pci_master_abort for the card is being rejected ...

Lets move on to the list management issues where timeouts on ioctl calls
have produced NULL pointers when one performs an add v/s move to transfer
ownership of a given scb between pools.

Fixing the list management may mean the pci_master_abort is not needed.

The NULL pointer:

Mar 29 00:09:53 5000 kernel: megaraid: aborting-464723 cmd=2a <c=1 t=0 l=0>
Mar 29 00:09:53 5000 kernel: megaraid abort: 464723:40[255:0], fw owner
Mar 29 00:09:53 5000 kernel: megaraid: aborting-464744 cmd=2a <c=1 t=0 l=0>
Mar 29 00:09:53 5000 kernel: megaraid abort: 464744:12[255:0], fw owner
Mar 29 00:09:53 5000 kernel: megaraid: aborting-464745 cmd=2a <c=1 t=0 l=0>
Mar 29 00:09:53 5000 kernel: megaraid abort: 464745:23[255:0], fw owner
Mar 29 00:09:53 5000 kernel: megaraid: aborting-464746 cmd=2a <c=1 t=0 l=0>
Mar 29 00:09:53 5000 kernel: megaraid abort: 464746:0[255:0], fw owner
Mar 29 00:09:53 5000 kernel: megaraid: aborting-464747 cmd=2a <c=1 t=0 l=0>
Mar 29 00:09:53 5000 kernel: megaraid abort: 464747[255:0], driver owner 
Mar 29 00:09:53 5000 kernel: megaraid: reseting the host...
Mar 29 00:09:53 5000 kernel: megaraid: 464723:128[65535:65535], reset from pending list
Mar 29 00:09:53 5000 kernel: megaraid: 4 outstanding commands. Max wait 180 sec
Mar 29 00:09:53 5000 kernel: megaraid mbox: Wait for 4 commands to complete:180
...
Mar 29 00:11:54 5000 kernel: megaraid mbox: Wait for 4 commands to complete:60
Mar 29 00:11:59 5000 kernel: megaraid mbox: Wait for 4 commands to complete:55
Mar 29 00:12:04 5000 kernel: megaraid mbox: Wait for 4 commands to complete:50
Mar 29 00:12:08 5000 kernel: megaraid mbox: reset sequence completed sucessfully
Mar 29 00:12:08 5000 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000
Mar 29 00:12:08 5000 kernel:  printing eip:
Mar 29 00:12:08 5000 kernel: f881f739
Mar 29 00:12:08 5000 kernel: *pde = 00000000
Mar 29 00:12:08 5000 kernel: Oops: 0002 [#1]
Mar 29 00:12:08 5000 kernel: SMP
Mar 29 00:12:08 5000 kernel: Modules linked in: xfs md5 ipv6 af_packet button thermal processor fan ac battery tsdev joydev
evdev usbkbd usbhid e1000 intel_agp agpgart ehci_hcd uhci_hcd usbcore rtc ext3 jbd sd_mod megaraid_mbox megaraid_mm ata_piix  libata scsi_mod
Mar 29 00:12:08 5000 kernel: CPU:    0
Mar 29 00:12:08 5000 kernel: EIP:    0060:[pg0+943802169/1069495296] Tainted: P      VLI
Mar 29 00:12:08 5000 kernel: EIP:    0060:[<f881f739>]    Tainted: P VLI
Mar 29 00:12:08 5000 kernel: EFLAGS: 00010046   (2.6.10)
Mar 29 00:12:08 5000 kernel: EIP is at megaraid_mbox_build_cmd+0x979/0xce0 [megaraid_mbox]
Mar 29 00:12:08 5000 kernel: eax: 00000000   ebx: 00000000   ecx: 0000000d edx: 79473000
Mar 29 00:12:08 5000 kernel: esi: c238f780   edi: c23af800   ebp: f7491f10 esp: f7491e98
Mar 29 00:12:09 5000 kernel: ds: 007b   es: 007b   ss: 0068
Mar 29 00:12:09 5000 kernel: Process scsi_eh_1 (pid: 885, threadinfo=f7490000 task=f7dde020)
Mar 29 00:12:09 5000 kernel: Stack: c23e3c00 f7de3000 f7491ebc f66fc2a0 c23e3c00 0000000d c226a42c f7436038
Mar 29 00:12:09 5000 kernel:        f7436030 f7491ee8 c23b1010 f7491ed0 011d2df4 c226aa34 c226aa2c c226a42c
Mar 29 00:12:09 5000 kernel:        00000000 000000ff c2268000 6e616373 676e696e 00000000 00000086 70696b73
Mar 29 00:12:09 5000 kernel: Call Trace:
Mar 29 00:12:09 5000 kernel:  [show_stack+171/192] show_stack+0xab/0xc0
Mar 29 00:12:09 5000 kernel:  [<c0103e9b>] show_stack+0xab/0xc0
Mar 29 00:12:09 5000 kernel:  [show_registers+351/464] show_registers+0x15f/0x1d0
Mar 29 00:12:09 5000 kernel:  [<c010402f>] show_registers+0x15f/0x1d0
Mar 29 00:12:09 5000 kernel:  [die+244/400] die+0xf4/0x190
Mar 29 00:12:09 5000 kernel:  [<c0104244>] die+0xf4/0x190
Mar 29 00:12:09 5000 kernel:  [do_page_fault+1172/1715] do_page_fault+0x494/0x6b3
Mar 29 00:12:09 5000 kernel:  [<c0117394>] do_page_fault+0x494/0x6b3
Mar 29 00:12:09 5000 kernel:  [error_code+43/48] error_code+0x2b/0x30
Mar 29 00:12:09 5000 kernel:  [<c0103aeb>] error_code+0x2b/0x30
Mar 29 00:12:09 5000 kernel:  [pg0+943799680/1069495296] megaraid_queue_command+0x50/0x90 [megaraid_mbox]
Mar 29 00:12:09 5000 kernel:  [<f881ed80>] megaraid_queue_command+0x50/0x90 [megaraid_mbox]
Mar 29 00:12:09 5000 kernel:  [pg0+943941731/1069495296] scsi_dispatch_cmd+0x173/0x290 [scsi_mod]
Mar 29 00:12:09 5000 kernel:  [<f8841863>] scsi_dispatch_cmd+0x173/0x290 [scsi_mod]
Mar 29 00:12:09 5000 kernel:  [pg0+943966809/1069495296] scsi_request_fn+0x1e9/0x430 [scsi_mod]
Mar 29 00:12:09 5000 kernel:  [blk_run_queue+42/64] blk_run_queue+0x2a/0x40
Mar 29 00:12:09 5000 kernel:  [<c023aeaa>] blk_run_queue+0x2a/0x40
Mar 29 00:12:09 5000 kernel:  [pg0+943963243/1069495296] scsi_run_host_queues+0x2b/0x50 [scsi_mod]
Mar 29 00:12:09 5000 kernel:  [<f8846c6b>] scsi_run_host_queues+0x2b/0x50 [scsi_mod]
Mar 29 00:12:09 5000 kernel:  [pg0+943960213/1069495296] scsi_error_handler+0x85/0x170 [scsi_mod]
Mar 29 00:12:09 5000 kernel:  [<f8846095>] scsi_error_handler+0x85/0x170 [scsi_mod]
Mar 29 00:12:09 5000 kernel:  [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10
Mar 29 00:12:09 5000 kernel:  [<c01012d5>] kernel_thread_helper+0x5/0x10
Mar 29 00:12:09 5000 kernel: Code: 2c 82 f8 c7 47 20 01 00 00 00 8b 4d 9c 85 c9 74 39 8b 4d 9c 31 db 8d b6 00 00 00 00 8d bf 00 00 00 00 8b 55 a0 8b 42 10 8b 56 08 <89> 14 18 31 d2 89 54 18 04 8b 45 a0 8b 50 10 8b 46 0c 83 c6 10
Mar 29 00:14:23 5000 kernel:  <4>megaraid cmm: ioctl timed out
Mar 29 00:14:23 5000 kernel: megaraid cmm: controller cannot accept cmds
due to earlier errors
Mar 29 00:14:24 5000 last message repeated 3 times
...
until reboot

I know everyone will rant about ... there is a taint, I just do not
have immediate access to the logs (which) do exist without the taint
marker set.

I will post the patch on kernel.org and can be adopted or dumped.
The posting to the list was to follow the patch submission rules.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Tue, 16 May 2006, Ju, Seokmann wrote:

> Hi,
> 
> I cannot agree on the changes in the patch for following reasons.
> 
> On Tuesday, May 16, 2006 1:44 PM, Andre Hedrick wrote:
> > Random (hard to reproduce, without a noise injection into the SATA
> > connector or cable) hardware error states which locks the 
> > card and in the
> > majority of the cases caused the array to be lost.  If the 
> > array was not
> > lost then a drive was failed but one could not remove/replace w/ a new
> > drive.  Thus adding in a pci_master_abort test and clear 
> > function proved
> > to allow recovery in all cases where the card shutdown 
> > communication to
> > the host.  This may not address all cases; however, clearly this is a
> > missing part of the driver base when entry to eh_scsi_* begins.
> If 'raid_dev->hw_error' is non-zero, this means that the controller has gone bad and will (and should not to avoid further memory corruption) not be able to recoverd unless reboot.
> The overall issue described here already taken care by the patch that I've submitted.
> The patch has been accepted and should be available on 2.6.17-rc1-mm3 as specified in Andrew Morton's email.
> > The compond issue in the failed recovery resulted in a deref 
> > NULL pointer
> > in the various list_head calls.  After change the individual 
> > list_add to
> > list_move and such, the NULL point issue has never shown up 
> > in the past 6
> > weeks of heavy testing.
> I'm not sure how this changes help for the issue. Furthermore, I'm not sure what is _the NULL point issue_ refering to. If you see the issue with driver available on 2.6.17-rc1-mm3, please let me know.
> Following link will leads you to further details of the patch.
> http://www.kernel.org/git/?p=linux/kernel/git/jejb/scsi-rc-fixes-2.6.git;a=commit;h=c005fb4fb2d23ba29ad21dee5042b2f8451ca8ba
> 
> Thank you,
> 
> Seokmann
> 
> > -----Original Message-----
> > From: Andre Hedrick [mailto:andre@linux-ide.org] 
> > Sent: Tuesday, May 16, 2006 1:44 PM
> > To: linux-scsi@vger.kernel.org; Ju, Seokmann; Andrew Morton
> > Cc: James Bottomley; Christoph Hellwig; Mukker, Atul
> > Subject: [RFC] Megaraid update, submission
> > 
> > 
> > Linux-scsi, et al.
> > 
> > The follow patch address two major issues found under 
> > extensive testing.
> > 
> > While pounding data io down the card and performing large 
> > scale queries to
> > the controller about device state and function parameters, 
> > the following
> > were discovered.
> > 
> > Random (hard to reproduce, without a noise injection into the SATA
> > connector or cable) hardware error states which locks the 
> > card and in the
> > majority of the cases caused the array to be lost.  If the 
> > array was not
> > lost then a drive was failed but one could not remove/replace w/ a new
> > drive.  Thus adding in a pci_master_abort test and clear 
> > function proved
> > to allow recovery in all cases where the card shutdown 
> > communication to
> > the host.  This may not address all cases; however, clearly this is a
> > missing part of the driver base when entry to eh_scsi_* begins.
> > 
> > The compond issue in the failed recovery resulted in a deref 
> > NULL pointer
> > in the various list_head calls.  After change the individual 
> > list_add to
> > list_move and such, the NULL point issue has never shown up 
> > in the past 6
> > weeks of heavy testing.
> > 
> > In all cases in the past, the baseline for error was 6:1.  
> > Meaning either
> > one system in six failed and/or one in six test/stress runs 
> > failed.  With
> > the attached changes, there have been zero failures in the past three
> > weeks.  This sound great, but I wish it would fail to allow some
> > statistics of improved error handling.
> > 
> > Please note the changes to SAS are minor and not tested, but 
> > seem correct
> > for the entire directory code base.  SAS shares the CMM core 
> > with MBOX,
> > thus the rational for changes to SAS.
> > 
> > Please comment and provide suggestions.
> > 
> > Cheers,
> > 
> > Andre Hedrick
> > LAD Storage Consulting Group
> > 
> > 
> > 
> > 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Megaraid update, submission
@ 2006-05-16 21:08 Ju, Seokmann
  0 siblings, 0 replies; 17+ messages in thread
From: Ju, Seokmann @ 2006-05-16 21:08 UTC (permalink / raw)
  To: Andre Hedrick
  Cc: linux-scsi, Andrew Morton, James Bottomley, Christoph Hellwig,
	Mukker, Atul

Hi Andre,
Tuesday, May 16, 2006 4:47 PM, Andre Hedrick wrote:
> Lets move on to the list management issues where timeouts on 
> ioctl calls
> have produced NULL pointers when one performs an add v/s move 
> to transfer
> ownership of a given scb between pools.
> 
> Fixing the list management may mean the pci_master_abort is 
> not needed.
If this issue still exist on 2.6.17-rcl kernel, I would definitely work on it.
>From my best estimate, the _NULL pointer_ issue should not be there with the patch.
Please let me know if you still see the issue.

I thank you very much your contribution on the driver stability.

Regards, 

> -----Original Message-----
> From: Andre Hedrick [mailto:andre@linux-ide.org] 
> Sent: Tuesday, May 16, 2006 4:47 PM
> To: Ju, Seokmann
> Cc: linux-scsi@vger.kernel.org; Andrew Morton; James 
> Bottomley; Christoph Hellwig; Mukker, Atul
> Subject: RE: [RFC] Megaraid update, submission
> 
> 
> Warning OOPS in message, ignore if you hate reading pasted OOPS's
> 
> Seokmann,
> 
> So there should be no (sane) heroic attempts to recover the 
> card state?
> Please look and see the path is only retried and follows the original
> operational path which resulted in setting the 
> 'raid_dev->hw_error' flag.
> If I am reading the code correctly, the *->quiescent flag 
> controls command
> submission to the card.  Thus all commands submitted to the 
> firmware are
> owned by the card, and should be allowed to complete the IO's 
> regardless?
> With as many as 20 requests outstanding (max I have seen to date) and
> termiation of the transactions surely blows apart any filesystem, as I
> have had filesystems and in several cases attached arrays 
> just vaporize if
> forced to reboot when 'hw_error' is set.
> 
> So since the pci_master_abort for the card is being rejected ...
> 
> Lets move on to the list management issues where timeouts on 
> ioctl calls
> have produced NULL pointers when one performs an add v/s move 
> to transfer
> ownership of a given scb between pools.
> 
> Fixing the list management may mean the pci_master_abort is 
> not needed.
> 
> The NULL pointer:
> 
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464723 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 
> 464723:40[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464744 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 
> 464744:12[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464745 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 
> 464745:23[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464746 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 464746:0[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464747 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 464747[255:0], 
> driver owner 
> Mar 29 00:09:53 5000 kernel: megaraid: reseting the host...
> Mar 29 00:09:53 5000 kernel: megaraid: 
> 464723:128[65535:65535], reset from pending list
> Mar 29 00:09:53 5000 kernel: megaraid: 4 outstanding 
> commands. Max wait 180 sec
> Mar 29 00:09:53 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:180
> ...
> Mar 29 00:11:54 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:60
> Mar 29 00:11:59 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:55
> Mar 29 00:12:04 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:50
> Mar 29 00:12:08 5000 kernel: megaraid mbox: reset sequence 
> completed sucessfully
> Mar 29 00:12:08 5000 kernel: Unable to handle kernel NULL 
> pointer dereference at virtual address 00000000
> Mar 29 00:12:08 5000 kernel:  printing eip:
> Mar 29 00:12:08 5000 kernel: f881f739
> Mar 29 00:12:08 5000 kernel: *pde = 00000000
> Mar 29 00:12:08 5000 kernel: Oops: 0002 [#1]
> Mar 29 00:12:08 5000 kernel: SMP
> Mar 29 00:12:08 5000 kernel: Modules linked in: xfs md5 ipv6 
> af_packet button thermal processor fan ac battery tsdev joydev
> evdev usbkbd usbhid e1000 intel_agp agpgart ehci_hcd uhci_hcd 
> usbcore rtc ext3 jbd sd_mod megaraid_mbox megaraid_mm 
> ata_piix  libata scsi_mod
> Mar 29 00:12:08 5000 kernel: CPU:    0
> Mar 29 00:12:08 5000 kernel: EIP:    
> 0060:[pg0+943802169/1069495296] Tainted: P      VLI
> Mar 29 00:12:08 5000 kernel: EIP:    0060:[<f881f739>]    
> Tainted: P VLI
> Mar 29 00:12:08 5000 kernel: EFLAGS: 00010046   (2.6.10)
> Mar 29 00:12:08 5000 kernel: EIP is at 
> megaraid_mbox_build_cmd+0x979/0xce0 [megaraid_mbox]
> Mar 29 00:12:08 5000 kernel: eax: 00000000   ebx: 00000000   
> ecx: 0000000d edx: 79473000
> Mar 29 00:12:08 5000 kernel: esi: c238f780   edi: c23af800   
> ebp: f7491f10 esp: f7491e98
> Mar 29 00:12:09 5000 kernel: ds: 007b   es: 007b   ss: 0068
> Mar 29 00:12:09 5000 kernel: Process scsi_eh_1 (pid: 885, 
> threadinfo=f7490000 task=f7dde020)
> Mar 29 00:12:09 5000 kernel: Stack: c23e3c00 f7de3000 
> f7491ebc f66fc2a0 c23e3c00 0000000d c226a42c f7436038
> Mar 29 00:12:09 5000 kernel:        f7436030 f7491ee8 
> c23b1010 f7491ed0 011d2df4 c226aa34 c226aa2c c226a42c
> Mar 29 00:12:09 5000 kernel:        00000000 000000ff 
> c2268000 6e616373 676e696e 00000000 00000086 70696b73
> Mar 29 00:12:09 5000 kernel: Call Trace:
> Mar 29 00:12:09 5000 kernel:  [show_stack+171/192] 
> show_stack+0xab/0xc0
> Mar 29 00:12:09 5000 kernel:  [<c0103e9b>] show_stack+0xab/0xc0
> Mar 29 00:12:09 5000 kernel:  [show_registers+351/464] 
> show_registers+0x15f/0x1d0
> Mar 29 00:12:09 5000 kernel:  [<c010402f>] show_registers+0x15f/0x1d0
> Mar 29 00:12:09 5000 kernel:  [die+244/400] die+0xf4/0x190
> Mar 29 00:12:09 5000 kernel:  [<c0104244>] die+0xf4/0x190
> Mar 29 00:12:09 5000 kernel:  [do_page_fault+1172/1715] 
> do_page_fault+0x494/0x6b3
> Mar 29 00:12:09 5000 kernel:  [<c0117394>] do_page_fault+0x494/0x6b3
> Mar 29 00:12:09 5000 kernel:  [error_code+43/48] error_code+0x2b/0x30
> Mar 29 00:12:09 5000 kernel:  [<c0103aeb>] error_code+0x2b/0x30
> Mar 29 00:12:09 5000 kernel:  [pg0+943799680/1069495296] 
> megaraid_queue_command+0x50/0x90 [megaraid_mbox]
> Mar 29 00:12:09 5000 kernel:  [<f881ed80>] 
> megaraid_queue_command+0x50/0x90 [megaraid_mbox]
> Mar 29 00:12:09 5000 kernel:  [pg0+943941731/1069495296] 
> scsi_dispatch_cmd+0x173/0x290 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [<f8841863>] 
> scsi_dispatch_cmd+0x173/0x290 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [pg0+943966809/1069495296] 
> scsi_request_fn+0x1e9/0x430 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [blk_run_queue+42/64] 
> blk_run_queue+0x2a/0x40
> Mar 29 00:12:09 5000 kernel:  [<c023aeaa>] blk_run_queue+0x2a/0x40
> Mar 29 00:12:09 5000 kernel:  [pg0+943963243/1069495296] 
> scsi_run_host_queues+0x2b/0x50 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [<f8846c6b>] 
> scsi_run_host_queues+0x2b/0x50 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [pg0+943960213/1069495296] 
> scsi_error_handler+0x85/0x170 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [<f8846095>] 
> scsi_error_handler+0x85/0x170 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [kernel_thread_helper+5/16] 
> kernel_thread_helper+0x5/0x10
> Mar 29 00:12:09 5000 kernel:  [<c01012d5>] 
> kernel_thread_helper+0x5/0x10
> Mar 29 00:12:09 5000 kernel: Code: 2c 82 f8 c7 47 20 01 00 00 
> 00 8b 4d 9c 85 c9 74 39 8b 4d 9c 31 db 8d b6 00 00 00 00 8d 
> bf 00 00 00 00 8b 55 a0 8b 42 10 8b 56 08 <89> 14 18 31 d2 89 
> 54 18 04 8b 45 a0 8b 50 10 8b 46 0c 83 c6 10
> Mar 29 00:14:23 5000 kernel:  <4>megaraid cmm: ioctl timed out
> Mar 29 00:14:23 5000 kernel: megaraid cmm: controller cannot 
> accept cmds
> due to earlier errors
> Mar 29 00:14:24 5000 last message repeated 3 times
> ...
> until reboot
> 
> I know everyone will rant about ... there is a taint, I just do not
> have immediate access to the logs (which) do exist without the taint
> marker set.
> 
> I will post the patch on kernel.org and can be adopted or dumped.
> The posting to the list was to follow the patch submission rules.
> 
> Cheers,
> 
> Andre Hedrick
> LAD Storage Consulting Group
> 
> On Tue, 16 May 2006, Ju, Seokmann wrote:
> 
> > Hi,
> > 
> > I cannot agree on the changes in the patch for following reasons.
> > 
> > On Tuesday, May 16, 2006 1:44 PM, Andre Hedrick wrote:
> > > Random (hard to reproduce, without a noise injection into the SATA
> > > connector or cable) hardware error states which locks the 
> > > card and in the
> > > majority of the cases caused the array to be lost.  If the 
> > > array was not
> > > lost then a drive was failed but one could not 
> remove/replace w/ a new
> > > drive.  Thus adding in a pci_master_abort test and clear 
> > > function proved
> > > to allow recovery in all cases where the card shutdown 
> > > communication to
> > > the host.  This may not address all cases; however, 
> clearly this is a
> > > missing part of the driver base when entry to eh_scsi_* begins.
> > If 'raid_dev->hw_error' is non-zero, this means that the 
> controller has gone bad and will (and should not to avoid 
> further memory corruption) not be able to recoverd unless reboot.
> > The overall issue described here already taken care by the 
> patch that I've submitted.
> > The patch has been accepted and should be available on 
> 2.6.17-rc1-mm3 as specified in Andrew Morton's email.
> > > The compond issue in the failed recovery resulted in a deref 
> > > NULL pointer
> > > in the various list_head calls.  After change the individual 
> > > list_add to
> > > list_move and such, the NULL point issue has never shown up 
> > > in the past 6
> > > weeks of heavy testing.
> > I'm not sure how this changes help for the issue. 
> Furthermore, I'm not sure what is _the NULL point issue_ 
> refering to. If you see the issue with driver available on 
> 2.6.17-rc1-mm3, please let me know.
> > Following link will leads you to further details of the patch.
> > 
> http://www.kernel.org/git/?p=linux/kernel/git/jejb/scsi-rc-fix
es-2.6.git;a=commit;h=c005fb4fb2d23ba29ad21dee5042b2f8451ca8ba
> > 
> > Thank you,
> > 
> > Seokmann
> > 
> > > -----Original Message-----
> > > From: Andre Hedrick [mailto:andre@linux-ide.org] 
> > > Sent: Tuesday, May 16, 2006 1:44 PM
> > > To: linux-scsi@vger.kernel.org; Ju, Seokmann; Andrew Morton
> > > Cc: James Bottomley; Christoph Hellwig; Mukker, Atul
> > > Subject: [RFC] Megaraid update, submission
> > > 
> > > 
> > > Linux-scsi, et al.
> > > 
> > > The follow patch address two major issues found under 
> > > extensive testing.
> > > 
> > > While pounding data io down the card and performing large 
> > > scale queries to
> > > the controller about device state and function parameters, 
> > > the following
> > > were discovered.
> > > 
> > > Random (hard to reproduce, without a noise injection into the SATA
> > > connector or cable) hardware error states which locks the 
> > > card and in the
> > > majority of the cases caused the array to be lost.  If the 
> > > array was not
> > > lost then a drive was failed but one could not 
> remove/replace w/ a new
> > > drive.  Thus adding in a pci_master_abort test and clear 
> > > function proved
> > > to allow recovery in all cases where the card shutdown 
> > > communication to
> > > the host.  This may not address all cases; however, 
> clearly this is a
> > > missing part of the driver base when entry to eh_scsi_* begins.
> > > 
> > > The compond issue in the failed recovery resulted in a deref 
> > > NULL pointer
> > > in the various list_head calls.  After change the individual 
> > > list_add to
> > > list_move and such, the NULL point issue has never shown up 
> > > in the past 6
> > > weeks of heavy testing.
> > > 
> > > In all cases in the past, the baseline for error was 6:1.  
> > > Meaning either
> > > one system in six failed and/or one in six test/stress runs 
> > > failed.  With
> > > the attached changes, there have been zero failures in 
> the past three
> > > weeks.  This sound great, but I wish it would fail to allow some
> > > statistics of improved error handling.
> > > 
> > > Please note the changes to SAS are minor and not tested, but 
> > > seem correct
> > > for the entire directory code base.  SAS shares the CMM core 
> > > with MBOX,
> > > thus the rational for changes to SAS.
> > > 
> > > Please comment and provide suggestions.
> > > 
> > > Cheers,
> > > 
> > > Andre Hedrick
> > > LAD Storage Consulting Group
> > > 
> > > 
> > > 
> > > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-05-16 21:09 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-04 18:48 megaraid_mbox: garbage in file Vasily Averin
2006-05-04 22:59 ` James Bottomley
2006-05-05  5:37   ` Vasily Averin
2006-05-05  9:21     ` Vasily Averin
2006-05-16 17:44       ` [RFC] Megaraid update, submission Andre Hedrick
2006-05-16 18:07         ` Jeff Garzik
2006-05-16 18:13           ` Andre Hedrick
2006-05-16 19:44             ` Matthew Wilcox
2006-05-16 20:24               ` Andre Hedrick
2006-05-05 15:59     ` megaraid_mbox: garbage in file James Bottomley
2006-05-05 18:17       ` Vasily Averin
2006-05-05 20:05         ` James Bottomley
2006-05-05 23:43           ` Vasily Averin
2006-05-05 23:43             ` Vasily Averin
  -- strict thread matches above, loose matches on Subject: below --
2006-05-16 19:03 [RFC] Megaraid update, submission Ju, Seokmann
2006-05-16 20:47 ` Andre Hedrick
2006-05-16 21:08 Ju, Seokmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.