Linux ATA/IDE development
 help / color / mirror / Atom feed
From: Mark Cooke <mpc@jts.homeip.net>
To: linux-ide@vger.kernel.org
Subject: 2.4.26rc1 / HPT 374 / RAID = data read corruption with disks on primary channels.
Date: Tue, 30 Mar 2004 21:04:18 +0100	[thread overview]
Message-ID: <1080677057.11947.83.camel@sage.kitchen> (raw)

[-- Attachment #1: Type: text/plain, Size: 10242 bytes --]

Hi all,

I've been having some trouble with an Abit IT7 machine.  It is a
pentium-4 machine, 1GB ram (passes days of all-test memtest86), 80GB
seagate on the ICH4 as the system disk.  4 x 160GB seagates, one on each
channel of the HPT374.

hdc: ICH4            80GB ST380021A  FwRev=3.10
hde: HPT Disk 0     160GB ST3160023A FwRev=3.04
hdg: HPT Disk 1     160GB ST3160023A FwRev=3.06
hdi: HPT Disk 2     160GB ST3160023A FwRev=3.06
hdk: HPT Disk 3     160GB ST3160023A FwRev=3.06

The 160G disks are all split into 4 partitions, and a set of 4-disk
RAID-5 partitions created using one partition from each disk.  All are
running ext3.

Checksumming a large (>RAM sized) file on any of the raid-5 devices
gives a different checksum every time.  Ie 'while true; do md5sum
big_file ; done' produces a list of different checksums. The same file
on the ICH4 works as expected.

Characterising the errors shows random blocks of 4-byte corruption.  I
ran a second test copying file 1 to file 2, and doing a 'cmp -l 1 2',
and on a 1GB file it gave 20 bytes of errors, in 5 groups of 4
contiguous bytes.  The errors do not have an obvious pattern of single
bit errors, nor am I seeing any messages in the system logs relating to
the file copying.  The number of location of the errors varies without
any obvious pattern.

All the disks passed their individual (long offline) SMART self-test's,
a dd if=/dev/hdX of=/dev/null, and are all connected as udma5, with
80wire cables.  Drive temperatures are all showing under 40C after an
extended period of intensive file checksumming / copying.

See below for details of the further work I did to track this, but at
this point, I believe there is some strange issue with the primary
channel on the two highpoint controllers that does not exist for the two
drives on the secondary channels, as arrays built from disks 1+3 work
without errors, whereas any use of disks 0+2 produces random read
errors.

Questions:

Any known issues with what I'm trying to do ?

Any workarounds / suggestions for isolating the problem ?

Any recommendations for PCI ide cards that work right ?


Thanks for any input!

Mark


Futher investigation summary:

After finding the above I moved the data off one of the raid-5
partitions and did some experiments with different disks/raid levels:

1. 4 disk raid-0 stripe.  This gave the 4-byte corruption errors.

2. 2 disk raid-0 stripe, using disk0+2.  This gave the same 4-byte
corruption errors.

3. 2 disk raid-1 mirror, using disk1+3, but with disk-3 failed out of
the array to reduce i/o to a minimal level.  This works without errors.

4. 2 disk raid-1 mirror. using disk1+3.  This works without errors.

5. 2 disk raid-1 mirror, using disk0+2.  This produces errors again.

6. 2 disk raid-1 mirror, using disk0+2, with disk-2 failed out of the
array.  More errors.

7. 2 disk raid-1 mirror, using disk0+2, with disk-0 failed out of the
array.  More errors.


Example cmp -l output from a 1GB file:

294961149   6 204
294961150   0 317
294961151 223 311
294961152 377  40
434229245 173   3
434229246  16   4
434229247 342  23
434229248 141 210
497602557  65 377
497602558  71 377
497602559 220 377
497602560  42 377
625459197  35  36
625459198 263 151
625459199 252 244
625459200 322 102
634101757 377  17
634101758 377 232
634101759 377 302
634101760 377 234

lspci:

00:00.0 Host bridge: Intel Corp. 82845 845 (Brookdale) Chipset Host Bridge (rev 11)
00:01.0 PCI bridge: Intel Corp. 82845 845 (Brookdale) Chipset AGP Bridge (rev 11)
00:1d.0 USB Controller: Intel Corp. 82801DB USB (Hub #1) (rev 01)
00:1d.1 USB Controller: Intel Corp. 82801DB USB (Hub #2) (rev 01)
00:1d.2 USB Controller: Intel Corp. 82801DB USB (Hub #3) (rev 01)
00:1d.7 USB Controller: Intel Corp. 82801DB USB EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corp. 82801BA/CA/DB PCI Bridge (rev 81)
00:1f.0 ISA bridge: Intel Corp. 82801DB ISA Bridge (LPC) (rev 01)
00:1f.1 IDE interface: Intel Corp. 82801DB ICH4 IDE (rev 01)
00:1f.3 SMBus: Intel Corp. 82801DB SMBus (rev 01)
00:1f.5 Multimedia audio controller: Intel Corp. 82801DB AC'97 Audio (rev 01)
01:00.0 VGA compatible controller: nVidia Corporation NV11DDR [GeForce2 MX 100 DDR/200 DDR] (rev b2)
02:00.0 Multimedia video controller: Brooktree Corporation Bt848 Video Capture (rev 12)
02:01.0 SCSI storage controller: Tekram Technology Co.,Ltd. TRM-S1040 (rev 01)
02:02.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30)
02:03.0 Network controller: Harris Semiconductor Prism 2.5 Wavelan chipset (rev 01)
02:04.0 RAID bus controller: Triones Technologies, Inc. HPT374 (rev 07)
02:04.1 RAID bus controller: Triones Technologies, Inc. HPT374 (rev 07)
02:05.0 USB Controller: VIA Technologies, Inc. USB (rev 50)
02:05.1 USB Controller: VIA Technologies, Inc. USB (rev 50)
02:05.2 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 51)
02:06.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
02:07.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link)

proc/interrupts:

           CPU0
  0:     299882    IO-APIC-edge  timer
  1:          5    IO-APIC-edge  keyboard
  2:          0          XT-PIC  cascade
  7:          0          XT-PIC  parport0
  8:          1    IO-APIC-edge  rtc
14:      71251    IO-APIC-edge  ide0
15:     505029    IO-APIC-edge  ide1
16:     222120   IO-APIC-level  usb-uhci, bttv0, nvidia
17:         21   IO-APIC-level  DC395x_TRM, Intel 82801DB-ICH4, ohci1394
18:      76625   IO-APIC-level  usb-uhci, usb-uhci, eth0
19:      24273   IO-APIC-level  usb-uhci, usb-uhci, wifi0
20:    2913200   IO-APIC-level  ide2, ide3, ide4, ide5
21:          0   IO-APIC-level  ehci_hcd
22:       4338   IO-APIC-level  eth1
23:          0   IO-APIC-level  ehci_hcd
NMI:          0
LOC:     299838
ERR:          0
MIS:          0
raidtab:

raiddev /dev/md1
        raid-level              5
        nr-raid-disks           4
        nr-spare-disks          0
        persistent-superblock   1
        chunk-size              32

        device          /dev/hde1
        raid-disk       0

        device          /dev/hdg1
        raid-disk       1

        device          /dev/hdi1
        raid-disk       2

        device          /dev/hdk1
        raid-disk       3

raiddev /dev/md2
        raid-level              5
        nr-raid-disks           4
        nr-spare-disks          0
        persistent-superblock   1
        chunk-size              32

        device          /dev/hde2
        raid-disk       0

        device          /dev/hdg2
        raid-disk       1

        device          /dev/hdi2
        raid-disk       2

        device          /dev/hdk2
        raid-disk       3

raiddev /dev/md3
        raid-level              5
        nr-raid-disks           4
        nr-spare-disks          0
        persistent-superblock   1
        chunk-size              32

        device          /dev/hde3
        raid-disk       0

        device          /dev/hdg3
        raid-disk       1

        device          /dev/hdi3
        raid-disk       2

        device          /dev/hdk3
        raid-disk       3

raiddev /dev/md4
        raid-level              0
        nr-raid-disks           4
        nr-spare-disks          0
        persistent-superblock   1
        chunk-size              64

        device          /dev/hde4
        raid-disk       0

        device          /dev/hdg4
        raid-disk       1

        device          /dev/hdi4
        raid-disk       2

        device          /dev/hdk4
        raid-disk       3

dmesg extract:

hda: JLMS XJ-HD165H, ATAPI CD/DVD-ROM drive
hdb: CD-RW CDR-6S52, ATAPI CD/DVD-ROM drive
hdc: ST380021A, ATA DISK drive
hde: ST3160023A, ATA DISK drive
hdg: ST3160023A, ATA DISK drive
hdi: ST3160023A, ATA DISK drive
hdk: ST3160023A, ATA DISK drive
hdc: attached ide-disk driver.
hdc: host protected area => 1
hdc: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=155061/16/63, UDMA(100)
hde: attached ide-disk driver.
hde: host protected area => 1
hde: 312581808 sectors (160042 MB) w/8192KiB Cache, CHS=19457/255/63, UDMA(100)
hdg: attached ide-disk driver.
hdg: host protected area => 1
hdg: 312581808 sectors (160042 MB) w/8192KiB Cache, CHS=19457/255/63, UDMA(100)
hdi: attached ide-disk driver.
hdi: host protected area => 1
hdi: 312581808 sectors (160042 MB) w/8192KiB Cache, CHS=19457/255/63, UDMA(100)
hdk: attached ide-disk driver.
hdk: host protected area => 1
hdk: 312581808 sectors (160042 MB) w/8192KiB Cache, CHS=19457/255/63, UDMA(100)
hdc: hdc1 hdc2 hdc3 hdc4
hde: hde1 hde2 hde3 hde4
hdg: hdg1 hdg2 hdg3 hdg4
hdi: hdi1 hdi2 hdi3 hdi4
hdk: hdk1 hdk2 hdk3 hdk4

/proc/ide/hpt366
 
                             HighPoint HPT366/368/370/372/374
 
Controller: 0
Chipset: HPT374
--------------- Primary Channel --------------- Secondary Channel --------------Enabled:        yes                             yes
Cable:          ATA-66                          ATA-66
 
--------------- drive0 --------- drive1 ------- drive0 ---------- drive1 -------DMA capable:    yes              no             yes               no
Mode:           UDMA             off            UDMA              off
 
Controller: 1
Chipset: HPT374
--------------- Primary Channel --------------- Secondary Channel --------------Enabled:        yes                             yes
Cable:          ATA-66                          ATA-66
 
--------------- drive0 --------- drive1 ------- drive0 ---------- drive1 -------DMA capable:    yes              no             yes               no
Mode:           UDMA             off            UDMA              off
 

/proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.00GHz
stepping        : 4
cpu MHz         : 2009.991
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 4010.80

kernel config and drive smart logs attached.

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 11127 bytes --]

[-- Attachment #3: smart.gz --]
[-- Type: application/x-gzip, Size: 2014 bytes --]

             reply	other threads:[~2004-03-30 20:04 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-03-30 20:04 Mark Cooke [this message]
2004-03-31  7:08 ` 2.4.26rc1 / HPT 374 / RAID = data read corruption with disks onprimary channels Tomi Orava
2004-03-31 15:01   ` Mark Cooke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1080677057.11947.83.camel@sage.kitchen \
    --to=mpc@jts.homeip.net \
    --cc=linux-ide@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox