raid5 - which disk failed ?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid5 - which disk failed ?
@ 2007-09-23 23:17 Rainer Fuegenstein
  2007-09-24  0:11 ` Richard Scobie
  2007-09-24  2:44 ` Neil Brown
  0 siblings, 2 replies; 4+ messages in thread
From: Rainer Fuegenstein @ 2007-09-23 23:17 UTC (permalink / raw)
  To: linux-raid maillist

[-- Attachment #1: Type: text/plain, Size: 2818 bytes --]


Hi,

I'm using a raid 5 with 4*400 GB PATA disks on a rather old VIA
mainboard, running centos 5.0. a few days ago the server started to
reboot or freeze occasionally, after reboot md always starts a resync
of the raid:
$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 hdh1[3] hdg1[2] hdf1[1] hde1[0]
      1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      [>....................]  resync =  0.9% (3819132/390708736) finish=366.2min speed=17603K/sec

unused devices: <none>

after about an hour, the server freezes again. I figured out that
about this time the following errors are reported in the messages log:

Sep 23 22:23:05 alfred kernel: end_request: I/O error, dev hde, sector 254106007
Sep 23 22:23:09 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 23 22:23:09 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106015, high=15, low=2447775, sector=254106015
Sep 23 22:23:09 alfred kernel: end_request: I/O error, dev hde, sector 254106015
Sep 23 22:23:14 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 23 22:23:14 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106023, high=15, low=2447783, sector=254106023
Sep 23 22:23:14 alfred kernel: end_request: I/O error, dev hde, sector 254106023
Sep 23 22:23:18 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 23 22:23:18 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106031, high=15, low=2447791, sector=254106031
Sep 23 22:23:18 alfred kernel: end_request: I/O error, dev hde, sector 254106031
Sep 23 22:23:23 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 23 22:23:23 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106039, high=15, low=2447799, sector=254106039
Sep 23 22:23:23 alfred kernel: end_request: I/O error, dev hde, sector 254106039
Sep 23 22:23:43 alfred kernel: hde: dma_timer_expiry: dma status == 0x21
Sep 23 22:23:53 alfred kernel: hde: DMA timeout error
Sep 23 22:23:53 alfred kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
Sep 23 22:28:40 alfred kernel:     ide2: BM-DMA at 0x7800-0x7807, BIOS settings: hde:DMA, hdf:pio

now there are two things that puzzle me:

1) when md starts a resync of the array, shouldn't one drive be marked
as down [_UUU] in mdstat instead of reporting it as [UUUU] ? or, the
other way round: is hde really the faulty drive ? how can I make sure
I'm removing and replacing the proper drive ?

2) can a faulty drive in a raid5 really crash the whole server ? maybe
it's because of the bug in the onboard promise controller that adds to
this problem (see attachment for dmesg output).

tia.

[-- Attachment #2: dmesg --]
[-- Type: application/octet-stream, Size: 14474 bytes --]

Linux version 2.6.18-8.1.8.el5xen (mockbuild@builder4.centos.org) (gcc version 4.1.1 20070105 (Red Hat 4.1.1-52)) #1 SMP Tue Jul 10 08:51:27 EDT 2007
BIOS-provided physical RAM map:
 Xen: 0000000000000000 - 000000001d7fd000 (usable)
0MB HIGHMEM available.
471MB LOWMEM available.
Using x86 segment limits to approximate NX protection
On node 0 totalpages: 120829
  DMA zone: 120829 pages, LIFO batch:31
DMI 2.3 present.
ACPI: RSDP (v000 ASUS                                  ) @ 0x000f6a90
ACPI: RSDT (v001 ASUS   A7V      0x30303031 MSFT 0x31313031) @ 0x1ffec000
ACPI: FADT (v001 ASUS   A7V      0x30303031 MSFT 0x31313031) @ 0x1ffec080
ACPI: BOOT (v001 ASUS   A7V      0x30303031 MSFT 0x31313031) @ 0x1ffec040
ACPI: DSDT (v001   ASUS A7V      0x00001000 MSFT 0x0100000b) @ 0x00000000
Built 1 zonelists.  Total pages: 120829
Kernel command line: ro root=LABEL=/
Enabling fast FPU save and restore... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c071b000 soft=c06fb000
PID hash table entries: 2048 (order: 11, 8192 bytes)
Xen reported: 807.212 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Software IO TLB enabled: 
 Aperture:     2 megabytes
 Kernel range: 0x00000000c0166000 - 0x00000000c0366000
vmalloc area: de000000-f4ffe000, maxmem 2d7fe000
Memory: 459904k/483316k available (2017k kernel code, 14656k reserved, 824k data, 172k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 2021.38 BogoMIPS (lpj=4042770)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0183d1f1 c1c7f9ff 00000000 00000000 00000000 00000000 00000000
CPU: After vendor identify, caps: 0183d1f1 c1c7f9ff 00000000 00000000 00000000 00000000 00000000
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 64K (64 bytes/line)
CPU: After all inits, caps: 0183d1f1 c1c7f9ff 00000000 00000420 00000000 00000000 00000000
Checking 'hlt' instruction... OK.
SMP alternatives: switching to UP code
Freeing SMP alternatives: 16k freed
ACPI: Core revision 20060707
ACPI: setting ELCR to 0200 (from 1c00)
Brought up 1 CPUs
sizeof(vma)=88 bytes
sizeof(page)=32 bytes
sizeof(inode)=340 bytes
sizeof(dentry)=136 bytes
sizeof(ext3inode)=492 bytes
sizeof(buffer_head)=52 bytes
sizeof(skbuff)=172 bytes
checking if image is initramfs... it is
Freeing initrd memory: 3049k freed
Grant table initialized
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using configuration type 1
Setting up standard PCI resources
Allocating PCI resources starting at 30000000 (gap: 20000000:dfff0000)
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 11 *12 14 15)
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
ACPI: Assume root bridge [\_SB_.PCI0] bus is 0
PCI quirk: region e400-e4ff claimed by vt82c586 ACPI
PCI quirk: region e200-e27f claimed by vt82c686 HW-mon
PCI quirk: region e800-e80f claimed by vt82c686 SMB
Boot video device is 0000:01:00.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 13 devices
xen_mem: Initialising balloon driver.
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
pnp: 00:03: ioport range 0xe400-0xe47f could not be reserved
pnp: 00:03: ioport range 0xe800-0xe80f has been reserved
PCI: Bridge: 0000:00:01.0
  IO window: d000-dfff
  MEM window: df000000-dfdfffff
  PREFETCH window: dff00000-e5ffffff
PCI: Setting latency timer of device 0000:00:01.0 to 64
NET: Registered protocol family 2
IP route cache hash table entries: 4096 (order: 2, 16384 bytes)
TCP established hash table entries: 16384 (order: 5, 131072 bytes)
TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
TCP: Hash tables configured (established 16384 bind 8192)
TCP reno registered
Simple Boot Flag at 0x3a set to 0x1
IA-32 Microcode Update Driver: v1.14-xen <tigran@veritas.com>
audit: initializing netlink socket (disabled)
audit(1190581571.184:1): initialized
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
Initializing Cryptographic API
ksign: Installing public key data
Loading keyring
- Added public key 3EFBCAAC52BC4CBF
- User ID: CentOS (Kernel Module GPG key)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
PCI: Disabling Via external APIC routing
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI: CPU0 (power states: C1[C1] C2[C2])
ACPI: Processor [CPU0] (supports 16 throttling states)
Real Time Clock Driver v1.12ac
Non-volatile memory driver v1.2
Linux agpgart interface v0.101 (c) Dave Jones
agpgart: Detected VIA Twister-K/KT133x/KM133 chipset
agpgart: AGP aperture is 32M @ 0xe6000000
RAMDISK driver initialized: 16 RAM disks of 16384K size 4096 blocksize
Xen virtual console successfully installed as ttyS0
Event-channel device installed.
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller at PCI slot 0000:00:04.1
VP_IDE: chipset revision 16
VP_IDE: not 100% native mode: will probe irqs later
VP_IDE: VIA vt82c686a (rev 22) IDE UDMA66 controller on pci0000:00:04.1
    ide0: BM-DMA at 0xb800-0xb807, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0xb808-0xb80f, BIOS settings: hdc:pio, hdd:pio
Probing IDE interface ide0...
hda: ST380011A, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
PDC20265: IDE controller at PCI slot 0000:00:11.0
ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
PCI: setting IRQ 10 as level-triggered
ACPI: PCI Interrupt 0000:00:11.0[A] -> Link [LNKB] -> GSI 10 (level, low) -> IRQ 10
PDC20265: chipset revision 2
PDC20265: ROM enabled at 0x30020000
PDC20265: 100% native mode on irq 10
PDC20265: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
    ide2: BM-DMA at 0x7800-0x7807, BIOS settings: hde:pio, hdf:pio
    ide3: BM-DMA at 0x7808-0x780f, BIOS settings: hdg:pio, hdh:DMA
Probing IDE interface ide2...
hde: ST3400832A, ATA DISK drive
hdf: ST3400620A, ATA DISK drive
ide2 at 0x9000-0x9007,0x8802 on irq 10
Probing IDE interface ide3...
hdg: ST3400620A, ATA DISK drive
hdh: ST3400620A, ATA DISK drive
ide3 at 0x8400-0x8407,0x8002 on irq 10
Probing IDE interface ide1...
hda: max request size: 512KiB
hda: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=16383/255/63, UDMA(33)
hda: cache flushes supported
 hda: hda1 hda2 hda3
hde: max request size: 128KiB
hde: 781422768 sectors (400088 MB) w/8192KiB Cache, CHS=48641/255/63, UDMA(100)
hde: cache flushes supported
 hde: hde1
hdf: max request size: 128KiB
hdf: 781422768 sectors (400088 MB) w/16384KiB Cache, CHS=48641/255/63, UDMA(100)
hdf: cache flushes supported
 hdf: hdf1
hdg: max request size: 128KiB
hdg: 781422768 sectors (400088 MB) w/16384KiB Cache, CHS=48641/255/63, UDMA(100)
hdg: cache flushes supported
 hdg: hdg1
hdh: max request size: 128KiB
hdh: 781422768 sectors (400088 MB) w/16384KiB Cache, CHS=48641/255/63, UDMA(100)
hdh: cache flushes supported
 hdh: hdh1
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: PS/2 Controller [PNP0303:PS2K] at 0x60,0x64 irq 1
PNP: PS/2 controller doesn't have AUX irq; using default 12
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
TCP bic registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
Using IPI No-Shortcut mode
Freeing unused kernel memory: 172k freed
Write protecting the kernel read-only data: 355k
input: AT Translated Set 2 keyboard as /class/input/input0
USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 12
PCI: setting IRQ 12 as level-triggered
ACPI: PCI Interrupt 0000:00:04.2[D] -> Link [LNKD] -> GSI 12 (level, low) -> IRQ 12
uhci_hcd 0000:00:04.2: UHCI Host Controller
uhci_hcd 0000:00:04.2: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:04.2: irq 12, io base 0x0000b400
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:04.3[D] -> Link [LNKD] -> GSI 12 (level, low) -> IRQ 12
uhci_hcd 0000:00:04.3: UHCI Host Controller
uhci_hcd 0000:00:04.3: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:04.3: irq 12, io base 0x0000b000
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
usb 2-2: new full speed USB device using uhci_hcd and address 2
usb 2-2: configuration #1 chosen from 1 choice
hub 2-2:1.0: USB hub found
hub 2-2:1.0: 4 ports detected
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux:  Disabled at runtime.
SELinux:  Unregistering netfilter hooks
audit(1190581579.620:2): selinux=0 auid=4294967295
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
8250_pnp: Unknown symbol serial8250_unregister_port
8250_pnp: Unknown symbol serial8250_register_port
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
8250_pnp: Unknown symbol serial8250_unregister_port
8250_pnp: Unknown symbol serial8250_register_port
parport_pc: VIA 686A/8231 detected
parport_pc: probing current configuration
parport_pc: Current parallel port base: 0x378
parport0: PC-style at 0x378 (0x778), irq 7 [PCSPP,TRISTATE]
parport_pc: VIA parallel port: io=0x378, irq=7
input: PC Speaker as /class/input/input1
ACPI: PCI Interrupt 0000:00:0b.0[A] -> Link [LNKB] -> GSI 10 (level, low) -> IRQ 10
3c59x: Donald Becker and others. www.scyld.com/network/vortex.html
0000:00:0b.0: 3Com PCI 3c905C Tornado at de002000.
i2c_adapter i2c-9191: sensors disabled - enable with force_addr=0xe200
floppy0: no floppy controllers found
lp0: using parport0 (interrupt-driven).
lp0: console ready
ACPI: Power Button (FF) [PWRF]
ACPI: Power Button (CM) [PWRB]
ibm_acpi: ec object not found
md: Autodetecting RAID arrays.
md: autorun ...
md: considering hdh1 ...
md:  adding hdh1 ...
md:  adding hdg1 ...
md:  adding hdf1 ...
md:  adding hde1 ...
md: created md0
md: bind<hde1>
md: bind<hdf1>
md: bind<hdg1>
md: bind<hdh1>
md: running: <hdh1><hdg1><hdf1><hde1>
md: md0: raid array is not clean -- starting background reconstruction
raid5: measuring checksumming speed
   8regs     :  2713.000 MB/sec
   8regs_prefetch:  2584.000 MB/sec
   32regs    :  1972.000 MB/sec
   32regs_prefetch:  2195.000 MB/sec
   pII_mmx   :  3821.000 MB/sec
   p5_mmx    :  4586.000 MB/sec
raid5: using function: p5_mmx (4586.000 MB/sec)
raid6: int32x1    312 MB/s
raid6: int32x2    351 MB/s
raid6: int32x4    305 MB/s
raid6: int32x8    298 MB/s
raid6: mmxx1      708 MB/s
raid6: mmxx2     1077 MB/s
raid6: sse1x1     647 MB/s
raid6: sse1x2     960 MB/s
raid6: using algorithm sse1x2 (960 MB/s)
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
raid5: device hdh1 operational as raid disk 3
raid5: device hdg1 operational as raid disk 2
raid5: device hdf1 operational as raid disk 1
raid5: device hde1 operational as raid disk 0
raid5: allocated 4204kB for md0
raid5: raid level 5 set md0 active with 4 out of 4 devices, algorithm 2
RAID5 conf printout:
 --- rd:4 wd:4 fd:0
 disk 0, o:1, dev:hde1
 disk 1, o:1, dev:hdf1
 disk 2, o:1, dev:hdg1
 disk 3, o:1, dev:hdh1
md: ... autorun DONE.
md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 390708736 blocks.
md: resuming recovery of md0 from checkpoint.
device-mapper: ioctl: 4.11.0-ioctl (2006-09-14) initialised: dm-devel@redhat.com
EXT3 FS on hda1, internal journal
SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug enabled
SGI XFS Quota Management subsystem
Filesystem "md0": Disabling barriers, not supported by the underlying device
XFS mounting filesystem md0
Ending clean XFS mount for filesystem: md0
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 1052248k swap on /dev/hda2.  Priority:-1 extents:1 across:1052248k
eth0:  setting full-duplex.
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
eth0: no IPv6 routers present
Bridge firewalling registered
device vif0.0 entered promiscuous mode
audit(1190574456.580:3): dev=vif0.0 prom=256 old_prom=0 auid=4294967295
xenbr0: port 1(vif0.0) entering learning state
xenbr0: topology change detected, propagating
xenbr0: port 1(vif0.0) entering forwarding state
peth0:  setting full-duplex.
peth0: Setting promiscuous mode.
device peth0 entered promiscuous mode
audit(1190574456.732:4): dev=peth0 prom=256 old_prom=0 auid=4294967295
xenbr0: port 2(peth0) entering learning state
xenbr0: topology change detected, propagating
xenbr0: port 2(peth0) entering forwarding state
hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
hda: drive_cmd: error=0x04 { DriveStatusError }
ide: failed opcode was: 0xb0
eth0: no IPv6 routers present

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: raid5 - which disk failed ?
  2007-09-23 23:17 raid5 - which disk failed ? Rainer Fuegenstein
@ 2007-09-24  0:11 ` Richard Scobie
  2007-09-24  2:44 ` Neil Brown
  1 sibling, 0 replies; 4+ messages in thread
From: Richard Scobie @ 2007-09-24  0:11 UTC (permalink / raw)
  To: Linux RAID Mailing List

Rainer Fuegenstein wrote:

> 
> 1) when md starts a resync of the array, shouldn't one drive be marked
> as down [_UUU] in mdstat instead of reporting it as [UUUU] ? or, the
> other way round: is hde really the faulty drive ? how can I make sure
> I'm removing and replacing the proper drive ?

If it is not already, install smartmontools.

It certainly looks like hde is failing, so a smartctl -a /dev/hde should 
give you some idea. You will find it also gives you the serial number of 
the drive, which will be attached to a label on the drive, allowing you 
to locate it.

Regards,

Richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: raid5 - which disk failed ?
  2007-09-23 23:17 raid5 - which disk failed ? Rainer Fuegenstein
  2007-09-24  0:11 ` Richard Scobie
@ 2007-09-24  2:44 ` Neil Brown
  2007-09-24 23:05   ` Re[2]: " Rainer Fuegenstein
  1 sibling, 1 reply; 4+ messages in thread
From: Neil Brown @ 2007-09-24  2:44 UTC (permalink / raw)
  To: Rainer Fuegenstein; +Cc: linux-raid maillist

On Monday September 24, rfu@kaneda.iguw.tuwien.ac.at wrote:
> 
> Hi,
> 
> I'm using a raid 5 with 4*400 GB PATA disks on a rather old VIA
> mainboard, running centos 5.0. a few days ago the server started to
> reboot or freeze occasionally, after reboot md always starts a resync
> of the raid:
> $ cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 hdh1[3] hdg1[2] hdf1[1] hde1[0]
>       1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>       [>....................]  resync =  0.9% (3819132/390708736) finish=366.2min speed=17603K/sec

This is normal.  If there was any write activity in the few hundred
milliseconds before a crash, you need to resync because the parity of
the stripe being written could not incorrect.

> 
> after about an hour, the server freezes again. I figured out that
> about this time the following errors are reported in the messages log:
> 
> Sep 23 22:23:05 alfred kernel: end_request: I/O error, dev hde, sector 254106007
> Sep 23 22:23:09 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> Sep 23 22:23:09 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106015, high=15, low=2447775, sector=254106015
> Sep 23 22:23:09 alfred kernel: end_request: I/O error, dev hde, sector 254106015
> Sep 23 22:23:14 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> Sep 23 22:23:14 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106023, high=15, low=2447783, sector=254106023
> Sep 23 22:23:14 alfred kernel: end_request: I/O error, dev hde, sector 254106023
> Sep 23 22:23:18 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> Sep 23 22:23:18 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106031, high=15, low=2447791, sector=254106031
> Sep 23 22:23:18 alfred kernel: end_request: I/O error, dev hde, sector 254106031
> Sep 23 22:23:23 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> Sep 23 22:23:23 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106039, high=15, low=2447799, sector=254106039
> Sep 23 22:23:23 alfred kernel: end_request: I/O error, dev hde, sector 254106039
> Sep 23 22:23:43 alfred kernel: hde: dma_timer_expiry: dma status == 0x21
> Sep 23 22:23:53 alfred kernel: hde: DMA timeout error
> Sep 23 22:23:53 alfred kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
> Sep 23 22:28:40 alfred kernel:     ide2: BM-DMA at 0x7800-0x7807, BIOS settings: hde:DMA, hdf:pio

Something definitely sick there.

> 
> now there are two things that puzzle me:
> 
> 1) when md starts a resync of the array, shouldn't one drive be marked
> as down [_UUU] in mdstat instead of reporting it as [UUUU] ? or, the
> other way round: is hde really the faulty drive ? how can I make sure
> I'm removing and replacing the proper drive ?

When a drive fail, md records that failure in the metadata on the
other devices in the array.
The fact that the drive is not marked as failed after the reboot
suggests that md failed to update the metadata of the good drives.
Maybe it is the controller that is failing rather than a drive, and it
cannot write to anything at this point.
Or maybe the drive is failing, but that is badly confusing the
controller, with the same result.
Is it always hde that is reporting errors?

With PATA, it is fairly easy to make sure you have removed the correct
drive, and names don't change.  hde is the 'master' on the 3rd
channel.  Presumably the first channel of your controller card.

Just disconnect the drive you think it is, reboot, and see if hde is
still there.

> 
> 2) can a faulty drive in a raid5 really crash the whole server ? maybe
> it's because of the bug in the onboard promise controller that adds to
> this problem (see attachment for dmesg output).

No, a faulty drive in a raid5 should not crash the whole server.  But
a bad controller card or buggy driver for the controller could.

NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re[2]: raid5 - which disk failed ?
  2007-09-24  2:44 ` Neil Brown
@ 2007-09-24 23:05   ` Rainer Fuegenstein
  0 siblings, 0 replies; 4+ messages in thread
From: Rainer Fuegenstein @ 2007-09-24 23:05 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid maillist

NB> Or maybe the drive is failing, but that is badly confusing the
NB> controller, with the same result.
NB> Is it always hde that is reporting errors?

for now - yes; but a few months ago for a short period of time hdg and
hdh also have been reported with errors, but this went away quickly
and never occured again.

NB> With PATA, it is fairly easy to make sure you have removed the correct
NB> drive, and names don't change.  hde is the 'master' on the 3rd
NB> channel.  Presumably the first channel of your controller card.

I know; what I meant was: I'd like to make sure that I remove the one
drive that md thinks is faulty - I want to avoid removing a healthy
drive, leaving md with one broken drive and two healthy ones which
isn't good for a raid5. but in this case, hde rather certainly is the
troublemaker.

NB> No, a faulty drive in a raid5 should not crash the whole server.  But
NB> a bad controller card or buggy driver for the controller could.

this seems to be the case here. guess its time to shop for a new
server.

tnx.

------------------------------------------------------------------------------
 Rainer Fuegenstein                              rfu@kaneda.iguw.tuwien.ac.at
------------------------------------------------------------------------------
"Why are you looking into the darkness and not into the fire as we do ?", Nell
asked. "Because the darkness is where danger comes from," Peter said, "and
from the fire comes only illusion".                   (from "The Diamond Age")
------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-09-24 23:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-23 23:17 raid5 - which disk failed ? Rainer Fuegenstein
2007-09-24  0:11 ` Richard Scobie
2007-09-24  2:44 ` Neil Brown
2007-09-24 23:05   ` Re[2]: " Rainer Fuegenstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).