Promise SATAII150 TX4: strange disk ordering

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Promise SATAII150 TX4: strange disk ordering
@ 2005-08-25 22:48 Eyal Lebedinsky
  2005-08-30  4:44 ` Jeff Garzik
  0 siblings, 1 reply; 10+ messages in thread
From: Eyal Lebedinsky @ 2005-08-25 22:48 UTC (permalink / raw)
  To: jgarzik; +Cc: linux-ide

I needed a 4-port SATA controller and this was was picked. It seems
to work OK, however I find that Linux (2.6.12.5 and .13-rc7) see
the disks in a different order than the labelled sockets (which do
match what the BIOS detection lists at bootup).

It is not even the reverse order:
	TX4 socket	sata_promise ata*
	1		4
	2		2
	3		1
	4		3
This order looks stable - I connected a different number of disks
on some ports and this ordering was maintained.

This is the 0x3d18 card.

I saw some mention on the list. Was this resolved as "cannot fix
driver" and introducing driver options to manually order the ports?

How can I ensure stable device names (/dev/sd*)?

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
	attach .zip as .dat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATAII150 TX4: strange disk ordering
  2005-08-25 22:48 Promise SATAII150 TX4: strange disk ordering Eyal Lebedinsky
@ 2005-08-30  4:44 ` Jeff Garzik
  2005-08-30 10:31   ` Eyal Lebedinsky
  2005-09-10  1:29   ` Promise SATAII150 TX4 or raidreconf broken Eyal Lebedinsky
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff Garzik @ 2005-08-30  4:44 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: linux-ide

Eyal Lebedinsky wrote:
> I needed a 4-port SATA controller and this was was picked. It seems
> to work OK, however I find that Linux (2.6.12.5 and .13-rc7) see
> the disks in a different order than the labelled sockets (which do
> match what the BIOS detection lists at bootup).
> 
> It is not even the reverse order:
> 	TX4 socket	sata_promise ata*
> 	1		4
> 	2		2
> 	3		1
> 	4		3
> This order looks stable - I connected a different number of disks
> on some ports and this ordering was maintained.

sata_promise driver just presents the devices in the order that the 
board maker has wired each port to the chip.  What may be labelled "port 
3" on the board might be wired to the chip's port-0.  sata_promise just 
presents what it is given.

	Jeff




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATAII150 TX4: strange disk ordering
  2005-08-30  4:44 ` Jeff Garzik
@ 2005-08-30 10:31   ` Eyal Lebedinsky
  2005-09-10  1:29   ` Promise SATAII150 TX4 or raidreconf broken Eyal Lebedinsky
  1 sibling, 0 replies; 10+ messages in thread
From: Eyal Lebedinsky @ 2005-08-30 10:31 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-ide

Jeff Garzik wrote:
> Eyal Lebedinsky wrote:
> 
>> I needed a 4-port SATA controller and this was was picked. It seems
>> to work OK, however I find that Linux (2.6.12.5 and .13-rc7) see
>> the disks in a different order than the labelled sockets (which do
>> match what the BIOS detection lists at bootup).
>>
>> It is not even the reverse order:
>>     TX4 socket    sata_promise ata*
>>     1        4
>>     2        2
>>     3        1
>>     4        3
>> This order looks stable - I connected a different number of disks
>> on some ports and this ordering was maintained.
> 
> 
> sata_promise driver just presents the devices in the order that the
> board maker has wired each port to the chip.  What may be labelled "port
> 3" on the board might be wired to the chip's port-0.  sata_promise just
> presents what it is given.
> 
>     Jeff

Seeing how people trust these number, the confusion is risky. I may
remove the wrong raid disk when it is reported offline and lose the
lot.

If we know the wiring (I assume this is stable for each board) why
not arrange the logical ports accordingly? Much more user friendly.

Thanks

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
	attach .zip as .dat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Promise SATAII150 TX4 or raidreconf broken
  2005-08-30  4:44 ` Jeff Garzik
  2005-08-30 10:31   ` Eyal Lebedinsky
@ 2005-09-10  1:29   ` Eyal Lebedinsky
  2005-09-10 15:02     ` Thorild Selen
  1 sibling, 1 reply; 10+ messages in thread
From: Eyal Lebedinsky @ 2005-09-10  1:29 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-ide

Jeff Garzik wrote:
> Eyal Lebedinsky wrote:
> 
>> I needed a 4-port SATA controller and this was was picked. It seems
>> to work OK, however I find that Linux (2.6.12.5 and .13-rc7) see
>> the disks in a different order than the labelled sockets (which do
>> match what the BIOS detection lists at bootup).
>>
>> It is not even the reverse order:
>>     TX4 socket    sata_promise ata*
>>     1        4
>>     2        2
>>     3        1
>>     4        3
>> This order looks stable - I connected a different number of disks
>> on some ports and this ordering was maintained.
> 
> sata_promise driver just presents the devices in the order that the
> board maker has wired each port to the chip.  What may be labelled "port
> 3" on the board might be wired to the chip's port-0.  sata_promise just
> presents what it is given.
> 
>     Jeff

I am, for now, ignoring the ordering problem and moving on to using the
array.

I spent the last week attempting to build and test the array and I have
a problem: the array is thrashed by raidreconf. I need to know if this
is a hardware problem (TX4?), a raidreconf problem or a kernel issue.

It is now becoming urgent for me to sort this out, any hints will be
appreciated.

If this is a TX4 issue, which SATA controllers (4-way) are known to be
supported and good on Linux?

I have a test script that does this:
	build a 3-disk raid-5
	mkfs.ext3
	copy data in
		200+GB
	fsck
		OK
	raidreconf
		3->4 disks
	fsck
		failed

The disks are 320GB SATA "WDC WD3200JD-00K  Rev: 08.0". Kernel 2.6.13
vanilla.

The test takes about 16h to complete.

The rebuild messages:
====================
Sat Sep 10 01:19:07 EST 2005 mdbuild: checking the file system
====================================
/dev/md0: 4136/610560 files (2.7% non-contiguous), 55952642/156285568 blocks
Sat Sep 10 01:25:51 EST 2005 mdbuild: reconfiguring RAID
====================================
Parsing /etc/raidtab.old
Parsing /etc/raidtab.new
Old raid-disk 0 has 1220981 chunks, 312571136 blocks
Old raid-disk 1 has 1220981 chunks, 312571136 blocks
Old raid-disk 2 has 1220981 chunks, 312571136 blocks
New raid-disk 0 has 1220981 chunks, 312571136 blocks
New raid-disk 1 has 1220981 chunks, 312571136 blocks
New raid-disk 2 has 1220981 chunks, 312571136 blocks
New raid-disk 3 has 1220981 chunks, 312571136 blocks
Using 256 Kbyte blocks to move from 256 Kbyte chunks to 256 Kbyte chunks.
Detected 1035336 KB of physical memory in system
A maximum of 1181 outstanding requests is allowed
Working with device /dev/md0
Size of old array: 1875427344 blocks,  Size of new array: 2500569792 blocks
---------------------------------------------------
I will grow your old device /dev/md0 of 2441962 blocks
to a new device /dev/md0 of 3662943 blocks
using a block-size of 256 KB
Is this what you want? (yes/no): yes
Converting 2441962 block device to 3662943 block device
Allocated free block map for 3 disks
4 unique disks detected.
Working (/) [02441962/02441962] [############################################]
Source drained, flushing sink.
Reconfiguration succeeded, will update superblocks...
Maximum friend-freeing depth:         8
Total wishes hooked:            2441962
Maximum wishes hooked:             1181
Total gifts hooked:             2441962
Maximum gifts hooked:               991
Congratulations, your array has been reconfigured,
and no errors seem to have occured.
Updating superblocks...
handling MD device /dev/md0
analyzing super-block
disk 0: /dev/sda, 312571224kB, raid superblock at 312571136kB
disk 1: /dev/sdb, 312571224kB, raid superblock at 312571136kB
disk 2: /dev/sdc, 312571224kB, raid superblock at 312571136kB
disk 3: /dev/sdd, 312571224kB, raid superblock at 312571136kB
Array is updated with kernel.
Disks re-inserted in array... Hold on while starting the array...
Sat Sep 10 10:30:19 EST 2005 mdbuild: checking the file system
====================================
/dev/md0: Inode 129 is in use, but has dtime set.  FIXED.
/dev/md0: Inode 129 has imagic flag set.

/dev/md0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
        (i.e., without -a or -p options)


Inspecting the fs shows real corruption. It does not even look like full
bad blocks but specific entries are bad. The some directories are completely
missing and I (naturally) get errors reading the fs (mounted with errors).

/data3/mythtv/tv_grab_au:
========================
total 2968465682
drwxr-sr-x      2 mythtv     mythtv          8192 Sep  8 04:56 08092005
drwxr-sr-x      2 mythtv     mythtv          8192 Sep  8 04:56 09092005
drwxr-sr-x      2 mythtv     mythtv          8192 Sep  8 04:59 10092005
drwxr-sr-x      2 mythtv     mythtv          8192 Sep  8 04:57 11092005
drwxr-sr-x      2 mythtv     mythtv          8192 Sep  8 04:58 12092005
?--xrws--T  31794 3359396242 982138100 1048034695 Oct  9  1972 13092005
drwxr-sr-x      2 mythtv     mythtv          8192 Sep  8 04:59 14092005
drwxr-sr-x      2 mythtv     mythtv          8192 Sep  9 04:52 15092005
srw-----w-  26765  936675348 473714967 2355711621 Oct  5  1976 16092005
drwxr-sr-x      2 mythtv     mythtv          8192 Aug 26 04:54 26082005
drwxr-sr-x      2 mythtv     mythtv          8192 Aug 26 04:54 27082005
drwxr-sr-x      2 mythtv     mythtv          8192 Aug 26 04:55 28082005
drwxr-sr-x      2 mythtv     mythtv          8192 Aug 26 04:55 29082005
drwxr-sr-x      2 mythtv     mythtv          8192 Aug 24 04:57 30082005
-rw-r--r--      1 mythtv     mythtv        362153 Nov  5  2004 guide.xml

The original has:
================
drwxr-sr-x  2 mythtv mythtv   8192 Sep  8 04:56 09092005
drwxr-sr-x  2 mythtv mythtv   8192 Sep  8 04:57 10092005
drwxr-sr-x  2 mythtv mythtv   8192 Sep  8 04:57 11092005
drwxr-sr-x  2 mythtv mythtv   8192 Sep  8 04:58 12092005
drwxr-sr-x  2 mythtv mythtv   8192 Sep 10 04:58 13092005 <<<<<
drwxr-sr-x  2 mythtv mythtv   8192 Sep  8 04:59 14092005
drwxr-sr-x  2 mythtv mythtv   8192 Sep  9 04:52 15092005
drwxr-sr-x  2 mythtv mythtv   8192 Sep 10 05:00 16092005 <<<<<
drwxr-sr-x  2 mythtv mythtv   8192 Sep 10 05:00 17092005
drwxr-sr-x  2 mythtv mythtv   8192 Aug 26 04:54 26082005
drwxr-sr-x  2 mythtv mythtv   8192 Aug 26 04:54 27082005
drwxr-sr-x  2 mythtv mythtv   8192 Aug 26 04:55 28082005
drwxr-sr-x  2 mythtv mythtv   8192 Aug 26 04:55 29082005
drwxr-sr-x  2 mythtv mythtv   8192 Aug 24 04:57 30082005
-rw-r--r--  1 mythtv mythtv 362153 Nov  5  2004 guide.xml

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
	attach .zip as .dat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATAII150 TX4 or raidreconf broken
  2005-09-10  1:29   ` Promise SATAII150 TX4 or raidreconf broken Eyal Lebedinsky
@ 2005-09-10 15:02     ` Thorild Selen
  2005-09-11  2:38       ` Eyal Lebedinsky
                         ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Thorild Selen @ 2005-09-10 15:02 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: linux-ide

If you search a bit in the linux-ide and linux-kernel mailing list
archives, you will find that several people before have had problems
with SATA150-TX4 and SATAII150-TX4 (see for example posts by Jim
Ramsay, Joerg Sommrey and me).

You haven't reported of any error messages on the console or reported
by dmesg -- can you check dmesg output?

The bug is reported to have been introduced between 2.6.10-1c8 and
2.6.10-ac11. I have seen one report of not easily being able to
reproduce the problem on 2.6.13 though -- if your problem is similar,
that would suggest that 2.6.13 does not fix the bug completely though.

It appears to primarily affect SMP/SMT(HT) systems (I've asked around
a bit among people reporting similar problems), and is typically
trigged by the intensive simultaneous use of several disks on the
controller. I believe that raidreconf would easily trig it. See my
earlier posts for more info.

If this is the bug affecting you, you should get error messages in
your log similar to these:

       ata4: status=0x51 { DriveReady SeekComplete Error }
       ata4: error=0x40 { UncorrectableError }
       scsi5: ERROR on channel 0, id 0, lun 0,
               CDB: Read (10) 00 08 38 46 2d 00 00 c8 00
       Current sd08:57: sense key Medium Error
       Additional sense indicates Unrecovered read error -
               auto reallocate failed
        I/O error: dev 08:57, sector 49773056

These errors appear apparently randomly when the disks are
stressed. Further symptoms are RAID failure (at least with raid5) and
file system corruption.

The only fix I know of is grabbing the SCSI bits from 2.6.10, which
can be compiled into a later kernel.  2.6.10 doesn't contain the bug,
but (the kernel.org) 2.6.10 should be avoided due to security
issues. There are important fixes in later SCSI code which should
probably be applied if you attempt this.

If you have these problems and have time and opportunity to do some
more testing (such as bit-by-bit applying changes between a late
non-affected kernel version and an early affected kernel version to
see what change introduces the bug), I and others would be very
thankful.  As far as we know, the bug is somewhere in the SCSI bits of
the kernel. For all I know, it might be outside libata though, and
trigged by some peculiar property of the interface or the sata_promise
driver.

In any case, you should probably first check for errors in the dmesg
output. Your problem might be the same, or something entirely
different.

Thorild Selén
Datorföreningen Update / Update Computer Club, Uppsala, SE

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATAII150 TX4 or raidreconf broken
  2005-09-10 15:02     ` Thorild Selen
@ 2005-09-11  2:38       ` Eyal Lebedinsky
  2005-09-11 15:50       ` Eyal Lebedinsky
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Eyal Lebedinsky @ 2005-09-11  2:38 UTC (permalink / raw)
  To: Thorild Selen; +Cc: linux-ide

Thorild Selen wrote:
> If you search a bit in the linux-ide and linux-kernel mailing list
> archives, you will find that several people before have had problems
> with SATA150-TX4 and SATAII150-TX4 (see for example posts by Jim
> Ramsay, Joerg Sommrey and me).
> 
> You haven't reported of any error messages on the console or reported
> by dmesg -- can you check dmesg output?

No, no errors reported. This is why I wondered (as the subject says)
if the reaidreconf itself can be the problem.

I now have a short (26m) test that reproduces the problem. The
corruption is different for each run, so I would not think it is
raidreconf.

Maybe the TX4 has silent corruption? Or maybe raidreconf really has
a non-deterministic bug.

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
	attach .zip as .dat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATAII150 TX4 or raidreconf broken
  2005-09-10 15:02     ` Thorild Selen
  2005-09-11  2:38       ` Eyal Lebedinsky
@ 2005-09-11 15:50       ` Eyal Lebedinsky
  2005-09-12 22:50       ` Promise SATAII150 TX4 or raidreconf broken - answer Eyal Lebedinsky
  2005-09-18 12:03       ` Promise SATAII150 TX4 ide errors Eyal Lebedinsky
  3 siblings, 0 replies; 10+ messages in thread
From: Eyal Lebedinsky @ 2005-09-11 15:50 UTC (permalink / raw)
  To: Thorild Selen; +Cc: linux-ide, linux-raid list

Thorild Selen wrote:
> If you search a bit in the linux-ide and linux-kernel mailing list
> archives, you will find that several people before have had problems
> with SATA150-TX4 and SATAII150-TX4 (see for example posts by Jim
> Ramsay, Joerg Sommrey and me).

Following up on this information I did more testing.

I verified that creating a 2-disk raid-5 and extending it to 3 disks
always works. 3-disk to 4-disk end up corrupted.

BTW I had to change the check in raidreconf for a minimum of raid5
disks from 3 to 2. It worked just fine.

I then moved the first two disks to the motherboard (sd[cd] left on
the TX4). The situation remained the same (but I did get better
performance).

I am now less inclined to blame the TX4 and lean more towards
raidreconf.

I need to create a final test where I hit the disks concurrently
without raidreconf to see how they fair...

I did some tests and so far failed to provoke any i/o error.

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
	attach .zip as .dat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATAII150 TX4 or raidreconf broken - answer
  2005-09-10 15:02     ` Thorild Selen
  2005-09-11  2:38       ` Eyal Lebedinsky
  2005-09-11 15:50       ` Eyal Lebedinsky
@ 2005-09-12 22:50       ` Eyal Lebedinsky
  2005-09-18  8:56         ` Tyler
  2005-09-18 12:03       ` Promise SATAII150 TX4 ide errors Eyal Lebedinsky
  3 siblings, 1 reply; 10+ messages in thread
From: Eyal Lebedinsky @ 2005-09-12 22:50 UTC (permalink / raw)
  To: linux-raid list; +Cc: linux-ide

Executive summary: it is not the TX4. It is not really raidreconf.
You must specify the parity-algorithm in raidtab because the
raidreconf default is not what one expects.

I have now investigated the corrupted 3->4 disk raidreconf and I
can see that there is a pattern to the problem. A similar pattern
is seen with a 4->5 run.

I wrote known values to the raid before the reconf and checked
after. The process is
	create /dev/md0
	write to it
	raidreconf it
	read it and see which blocks show up where

What I see is that the 2nd pair of each 6 blocks is swapped. Here
is the error list for a test with 1 cyl (31 blocks) per disk:

bad block 2 says it is 3
bad block 3 says it is 2
bad block 8 says it is 9
bad block 9 says it is 8
bad block 14 says it is 15
bad block 15 says it is 14
bad block 20 says it is 21
bad block 21 says it is 20
bad block 26 says it is 27
bad block 27 says it is 26
bad block 32 says it is 33
bad block 33 says it is 32
bad block 38 says it is 39
bad block 39 says it is 38
bad block 44 says it is 45
bad block 45 says it is 44
bad block 50 says it is 51
bad block 51 says it is 50
bad block 56 says it is 57
bad block 57 says it is 56
20 errors in 62 blocks

At this point I decided that I must take the TX4 out of the equation.
This is just too regular for a hardware problem. I created four
partitions on one disk and repeated the test. It failed just the same.

I was now reasonably convinced that it is raidreconf that gives me
grief. Nevertheless, the pattern is just too regular. Maybe the program
does not agree with md on the parity algorithm? The default is said
to be left-symmetric (see man mdadm; man raidtab does not say), so I
specified this explicitly in the raidtab and it started working.

Good, but I needed to understand this.

Looking at the raidtools code (where raidreconf is built), I think
that it does not default to left-symmetric. It looks to me like the
config struct is malloced and zeroed (with memset) meaning the .layout
member is set to left-asymmetric (see top of parser.c) and I do not
see that it is ever set to any other default (left-symmetric would
be numeric 2).

--
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
	attach .zip as .dat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATAII150 TX4 or raidreconf broken - answer
  2005-09-12 22:50       ` Promise SATAII150 TX4 or raidreconf broken - answer Eyal Lebedinsky
@ 2005-09-18  8:56         ` Tyler
  0 siblings, 0 replies; 10+ messages in thread
From: Tyler @ 2005-09-18  8:56 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: linux-raid list, linux-ide

Eyal Lebedinsky wrote:

>Executive summary: it is not the TX4. It is not really raidreconf.
>You must specify the parity-algorithm in raidtab because the
>raidreconf default is not what one expects.
>
>I have now investigated the corrupted 3->4 disk raidreconf and I
>can see that there is a pattern to the problem. A similar pattern
>is seen with a 4->5 run.
>
>I wrote known values to the raid before the reconf and checked
>after. The process is
>	create /dev/md0
>	write to it
>	raidreconf it
>	read it and see which blocks show up where
>
>What I see is that the 2nd pair of each 6 blocks is swapped. Here
>is the error list for a test with 1 cyl (31 blocks) per disk:
>
>bad block 2 says it is 3
>bad block 3 says it is 2
>bad block 8 says it is 9
>bad block 9 says it is 8
>bad block 14 says it is 15
>bad block 15 says it is 14
>bad block 20 says it is 21
>bad block 21 says it is 20
>bad block 26 says it is 27
>bad block 27 says it is 26
>bad block 32 says it is 33
>bad block 33 says it is 32
>bad block 38 says it is 39
>bad block 39 says it is 38
>bad block 44 says it is 45
>bad block 45 says it is 44
>bad block 50 says it is 51
>bad block 51 says it is 50
>bad block 56 says it is 57
>bad block 57 says it is 56
>20 errors in 62 blocks
>
>At this point I decided that I must take the TX4 out of the equation.
>This is just too regular for a hardware problem. I created four
>partitions on one disk and repeated the test. It failed just the same.
>
>I was now reasonably convinced that it is raidreconf that gives me
>grief. Nevertheless, the pattern is just too regular. Maybe the program
>does not agree with md on the parity algorithm? The default is said
>to be left-symmetric (see man mdadm; man raidtab does not say), so I
>specified this explicitly in the raidtab and it started working.
>
>Good, but I needed to understand this.
>
>Looking at the raidtools code (where raidreconf is built), I think
>that it does not default to left-symmetric. It looks to me like the
>config struct is malloced and zeroed (with memset) meaning the .layout
>member is set to left-asymmetric (see top of parser.c) and I do not
>see that it is ever set to any other default (left-symmetric would
>be numeric 2).
>
>--
>Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
>	attach .zip as .dat
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>  
>
Nice work Eyal :)

Now all we need is a patch for raid-reconf to fix default behaviour? :D

Regards,
Tyler.


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.1/104 - Release Date: 9/16/2005


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATAII150 TX4 ide errors
  2005-09-10 15:02     ` Thorild Selen
                         ` (2 preceding siblings ...)
  2005-09-12 22:50       ` Promise SATAII150 TX4 or raidreconf broken - answer Eyal Lebedinsky
@ 2005-09-18 12:03       ` Eyal Lebedinsky
  3 siblings, 0 replies; 10+ messages in thread
From: Eyal Lebedinsky @ 2005-09-18 12:03 UTC (permalink / raw)
  To: Thorild Selen; +Cc: linux-ide, Jeff Garzik

Thorild Selen wrote:
> If you search a bit in the linux-ide and linux-kernel mailing list
> archives, you will find that several people before have had problems
> with SATA150-TX4 and SATAII150-TX4 (see for example posts by Jim
> Ramsay, Joerg Sommrey and me).

Having sorted out the raidreconf I now have a working raid.

Well, sort of working, I now started seeing i/o errors that cause
the raid to stop and required a re-assemble. It looks very much
like the other reports, except that my details vary a bit.
Here is what I get:

Sep 18 10:11:10 eyal kernel: [4294756.244000] irq 16: nobody cared (try booting with the "irqpoll" option)
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [__report_bad_irq+42/160] __report_bad_irq+0x2a/0xa0
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [handle_IRQ_event+48/112] handle_IRQ_event+0x30/0x70
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [note_interrupt+128/240] note_interrupt+0x80/0xf0
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [__do_IRQ+283/288] __do_IRQ+0x11b/0x120
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [do_IRQ+70/112] do_IRQ+0x46/0x70
Sep 18 10:11:10 eyal kernel: [4294756.244000]  =======================
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [common_interrupt+26/32] common_interrupt+0x1a/0x20
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [default_idle+0/48] default_idle+0x0/0x30
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [default_idle+35/48] default_idle+0x23/0x30
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [cpu_idle+112/128] cpu_idle+0x70/0x80
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [start_kernel+366/400] start_kernel+0x16e/0x190
Sep 18 10:11:10 eyal kernel: [4294756.244000]  [unknown_bootoption+0/480] unknown_bootoption+0x0/0x1e0
Sep 18 10:11:10 eyal kernel: [4294756.244000] handlers:
Sep 18 10:11:10 eyal kernel: [4294756.244000] [pg0+945037568/1068864512] (usb_hcd_irq+0x0/0x70 [usbcore])
Sep 18 10:11:10 eyal kernel: [4294756.244000] [pg0+945383264/1068864512] (ata_interrupt+0x0/0x120 [libata])
Sep 18 10:11:10 eyal kernel: [4294756.244000] [pg0+945587408/1068864512] (pdc_interrupt+0x0/0x1c0 [sata_promise])
Sep 18 10:11:10 eyal kernel: [4294756.244000] [pg0+946374608/1068864512] (dc395x_interrupt+0x0/0x90 [dc395x])
Sep 18 10:11:10 eyal kernel: [4294756.244000] [pg0+946191904/1068864512] (e1000_intr+0x0/0x100 [e1000])
Sep 18 10:11:10 eyal kernel: [4294756.244000] Disabling IRQ #16

Sep 18 10:11:39 eyal kernel: [4294785.495000] ata4: command timeout
Sep 18 10:11:39 eyal kernel: [4294785.520000] ATA: abnormal status 0xFF on port 0xF8A4829C
Sep 18 10:11:39 eyal kernel: [4294785.538000] ata4: status=0xff { Busy }
Sep 18 10:11:39 eyal kernel: [4294785.551000] SCSI error : <3 0 0 0> return code = 0x8000002
Sep 18 10:11:39 eyal kernel: [4294785.569000] sdd: Current: sense key: Aborted Command
Sep 18 10:11:39 eyal kernel: [4294785.586000]     Additional sense: Scsi parity error
Sep 18 10:11:39 eyal kernel: [4294785.603000] end_request: I/O error, dev sdd, sector 625137215
Sep 18 10:11:39 eyal kernel: [4294785.622000] raid5: Disk failure on sdd1, disabling device. Operation continuing on 3 devices

Then /dev/sdc goes down the same way (it is on the TX4 too) and the raid5
goes offline.

[trim]

> The only fix I know of is grabbing the SCSI bits from 2.6.10, which
> can be compiled into a later kernel.  2.6.10 doesn't contain the bug,
> but (the kernel.org) 2.6.10 should be avoided due to security
> issues. There are important fixes in later SCSI code which should
> probably be applied if you attempt this.

Looking at sata_promise.c, comparing 2.6.10 to 2.6.13 shows very
little change:
	- a few new devices added to the table
	- a new device type 'board_20619'
	- new code for the new type in pdc_ata_init_one() which is in
	fact identical to the older 'board_20319
	- handling of pci_dev_busy condition
	- one line added in the interrupt handler:
		writel(mask, mmio_base + PDC_INT_SEQMASK);

There was no more changes in the -ac series.

The maintainer may be able to judge if any of these could lead to
these errors.

So this may be due to changes outside the device driver, e.g. the
ide/scsi layers.

I cannot reliably reproduce the failure, but at least once it happened
very soon after a boot when there was not much activity on the raid.

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
	attach .zip as .dat

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-09-18 12:03 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-25 22:48 Promise SATAII150 TX4: strange disk ordering Eyal Lebedinsky
2005-08-30  4:44 ` Jeff Garzik
2005-08-30 10:31   ` Eyal Lebedinsky
2005-09-10  1:29   ` Promise SATAII150 TX4 or raidreconf broken Eyal Lebedinsky
2005-09-10 15:02     ` Thorild Selen
2005-09-11  2:38       ` Eyal Lebedinsky
2005-09-11 15:50       ` Eyal Lebedinsky
2005-09-12 22:50       ` Promise SATAII150 TX4 or raidreconf broken - answer Eyal Lebedinsky
2005-09-18  8:56         ` Tyler
2005-09-18 12:03       ` Promise SATAII150 TX4 ide errors Eyal Lebedinsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).