Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
@ 2007-11-10 12:43 I Stratford
  2007-11-12  4:12 ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: I Stratford @ 2007-11-10 12:43 UTC (permalink / raw)
  To: linux-ide

[-- Attachment #1: Type: text/plain, Size: 5289 bytes --]

Hi there. I have read the archives quite a bit, but this is my first
post to linux-ide or any kernel-related mailing list. I therefore
apologize in advance if any aspect of the form or content of my mail
is in any way inappropriate for the list! :)

I have a desktop with the following (admittedly somewhat insane) configuration :

- ASUS K8NE-Deluxe Motherboard (2x vanilla SATA and 4x sil3112
onboard, NForce3 PCI)
- 2  Promise TX4 150 pci cards, connected to 8 250gb SATA seagate
drives (ST3250823AS)
- 2  Promise TX4 300 pci cards, connected to 8 500gb SATA hitachi
drives (HDT725050VLA360)

This provides me with a total of 22 SATA ports, 16 of which use the
sata_promise driver. Prior to the most recent upgrade which added the
2 TX4 300s, I have been using the 2 promise TX4 150s with the on-board
SATA ports as the underlying drives for linux software raid 5 as linux
mds. This configuration has been running fine in linux kernels
throughout the 2.6 series, and works in all cases if they are the only
powered drives in the machine.

I began to experience trouble when I added the 2 Promise TX4 300 cards
with the hitachi drives attached. I experienced "port slow to respond"
resets on the TX4 300 connected drives, especially of a specific port
(sdu), while using a (fedora-patched) 2.6.22 kernel. I debugged all of
the possible hardware related causes, specifically ruling out cabling,
SATA port and hard drive failure by swapping each component and
experiencing the same timeout behavior with each.

If I had all 8 drives assembled into a RAID, the timeouts would occur
before or shortly after the re-sync completed. Strangely, the timeouts
did not seem to happen when I had only 7 of the 8 drives assembled
into the RAID5, even if all 8 were physically connected. The timeouts
appeared to be unrelated to reads or writes, because they would happen
even when the RAID5 was synced and not experiencing any usage.

In most cases, it was "sdu", aka "ata22" which "failed" :
"
Oct 29 00:08:01 rice kernel: ata22.00: exception Emask 0x0 SAct 0x0
SErr 0x1380000 action 0x2 frozen
Oct 29 00:08:01 rice kernel: ata22.00: cmd
25/00:78:bf:c5:b7/00:00:32:00:00/e0 tag 0 cdb 0x0 data 61440 in
Oct 29 00:09:12 rice kernel: res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Oct 29 00:09:12 rice kernel: ata22: port is slow to respond, please be
patient (Status 0xff)
Oct 29 00:09:12 rice kernel: ata22: device not ready (errno=-16),
forcing hardreset
Oct 29 00:09:13 rice kernel: ata22: hard resetting port
Oct 29 00:09:13 rice kernel: ata22: port is slow to respond, please be
patient (Status 0xff)
Oct 29 00:09:13 rice kernel: ata22: COMRESET failed (errno=-16)
"

If the system remained running in this state, other ports would
sometimes time out, even including ports on the original TX150s. This
was the only case in which I saw ports on the TX150s appear to time
out. I chalk this up to linux not liking it when any drive is hung for
an extended period of time, and only mention it in case the
information might be in some way useful.

After experiencing these problems for a while and running out of
obvious hardware-based explanations, I started searching the linux-ide
archives and found this post :

http://www.spinics.net/lists/linux-ide/msg14089.html

In which Mikael Pettersson suggests using one of his sata_promise
hacks to step the TX4 300 to 1.5Gbps mode by forcing SControl and
Peter Favrholdt suggests that this worked for him in 2.6.21.
Follow-ups to the thread indicated that this problem was largely fixed
in 2.6.23 and, as 2.6.23 had been packaged for FC7 in the interim, I
installed it with high hopes.

Unfortunately, 2.6.23 (2.6.23.1-21.fc7) kernel panics when interacting
with the 8 drive array on the new TX4 300s. Strangely, the panic
seemed to be in md2_raid5 and associated raid system calls ("release
stripe" iirc). With reiserfs on the raid, it would happen at mount
time. It also kernel panicked when I tried to mkfs.xfs. I also tried
Mikael Pettersson's "force to 1.5Gbps" patches for 2.6.23 patched into
2.6.23.1-21.fc7, with the same kernel-panic result.

I then decided to take a shot at 2.6.22 (2.6.22.9-91.fc7) with the
1.5GBps patch, and have been running with no problems on the RAID ever
since. This is too sweet, so I have been keeping my fingers crossed.
Without the patch, problems appear within minutes or hours. With the
patch, it has been running for three days with no problems whatsoever,
even with heavy usage.

The purpose of the mail is to document and share my experience in the
hope that someone might find it useful, either for debugging their own
TX4 300-centric system issues or figuring out what is up with
sata_promise and the TX4 300 in 3Gbps mode. I also wish to offer my
somewhat unique promise-based system as a test environment for either
the timeout or kernel panic issues. I obviously have some basic need
for data integrity of the RAID5, but this system is not in production
and is therefore more available for testing purposes than the average
machine with 22 Promise SATA ports..  :)

Thank you all very much for your dedicated work on controller support
in linux, and please let me know if you need any further information
or if I can help in any way.

___ids
PS - attached is a copy of /proc/interrupts, fyi.

[-- Attachment #2: interrupts.txt --]
[-- Type: text/plain, Size: 875 bytes --]

           CPU0
  0:        180   IO-APIC-edge      timer
  1:          2   IO-APIC-edge      i8042
  5:          0   IO-APIC-edge      MPU401 UART
  6:          6   IO-APIC-edge      floppy
  8:          1   IO-APIC-edge      rtc
  9:          0   IO-APIC-fasteoi   acpi
 12:        104   IO-APIC-edge      i8042
 14:     587763   IO-APIC-edge      libata
 15:          0   IO-APIC-edge      libata
 16:      42164   IO-APIC-fasteoi   ohci_hcd:usb1, sata_nv
 17:   41558670   IO-APIC-fasteoi   ohci_hcd:usb2, eth0
 18:    7132775   IO-APIC-fasteoi   ehci_hcd:usb3, NVidia CK8S
 19:   18977630   IO-APIC-fasteoi   sata_sil, sata_promise
 20:  187959252   IO-APIC-fasteoi   sata_promise, eth1
 21:     250150   IO-APIC-fasteoi   sata_promise
 22:   14762663   IO-APIC-fasteoi   sata_promise
NMI:          0
LOC:   18662678
ERR:          1
MIS:          0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
  2007-11-10 12:43 Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23 I Stratford
@ 2007-11-12  4:12 ` Tejun Heo
  2007-11-12  8:45   ` Patric Karlsson
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2007-11-12  4:12 UTC (permalink / raw)
  To: I Stratford; +Cc: linux-ide, Mikael Pettersson

Hello,

I Stratford wrote:
> The purpose of the mail is to document and share my experience in the
> hope that someone might find it useful, either for debugging their own
> TX4 300-centric system issues or figuring out what is up with
> sata_promise and the TX4 300 in 3Gbps mode. I also wish to offer my
> somewhat unique promise-based system as a test environment for either
> the timeout or kernel panic issues. I obviously have some basic need
> for data integrity of the RAID5, but this system is not in production
> and is therefore more available for testing purposes than the average
> machine with 22 Promise SATA ports..  :)

[cc'ing Mikael Pettersson]

It seems those 3Gbps promise controllers have hard time getting out of
transmission errors.  Is it because hardreset doesn't work?  Can we fix it?

Also, if 3Gbps can't be made reliable on those controllers, how about
limiting it to 1.5Gbps by default with appropriate warning messages?
Without PMP, it's not like we're gonna earn anything by driving the
thing at 3Gbps.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
  2007-11-12  4:12 ` Tejun Heo
@ 2007-11-12  8:45   ` Patric Karlsson
  2007-11-12  8:58     ` Tejun Heo
  2007-11-12 19:59     ` Peter Favrholdt
  0 siblings, 2 replies; 10+ messages in thread
From: Patric Karlsson @ 2007-11-12  8:45 UTC (permalink / raw)
  To: linux-ide

Tejun Heo wrote:
> Hello,
> 
> I Stratford wrote:
>> The purpose of the mail is to document and share my experience in the
>> hope that someone might find it useful, either for debugging their own
>> TX4 300-centric system issues or figuring out what is up with
>> sata_promise and the TX4 300 in 3Gbps mode. I also wish to offer my
>> somewhat unique promise-based system as a test environment for either
>> the timeout or kernel panic issues. I obviously have some basic need
>> for data integrity of the RAID5, but this system is not in production
>> and is therefore more available for testing purposes than the average
>> machine with 22 Promise SATA ports..  :)
> 
> [cc'ing Mikael Pettersson]
> 
> It seems those 3Gbps promise controllers have hard time getting out of
> transmission errors.  Is it because hardreset doesn't work?  Can we fix it?
> 
> Also, if 3Gbps can't be made reliable on those controllers, how about
> limiting it to 1.5Gbps by default with appropriate warning messages?
> Without PMP, it's not like we're gonna earn anything by driving the
> thing at 3Gbps.
> 
> Thanks.
> 

I thought this was supposed to be fixed in 2.6.24-RC2 ?

Although I'm currently running a TX4 myself in 3Gbit mode with 2.6.23.1, 
  I'm waiting for 2.6.24 to reach stable until I try it out myself.
2.6.23.1 seemed to fix the hard-lock when it tried to reset the card 
tho, so now I just get a few errors about "soft resetting" in the logs. 
No data loss.

/Patric


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
  2007-11-12  8:45   ` Patric Karlsson
@ 2007-11-12  8:58     ` Tejun Heo
  2007-11-12 19:59     ` Peter Favrholdt
  1 sibling, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2007-11-12  8:58 UTC (permalink / raw)
  To: Patric Karlsson; +Cc: linux-ide

Patric Karlsson wrote:
> Tejun Heo wrote:
>> It seems those 3Gbps promise controllers have hard time getting out of
>> transmission errors.  Is it because hardreset doesn't work?  Can we
>> fix it?
>>
>> Also, if 3Gbps can't be made reliable on those controllers, how about
>> limiting it to 1.5Gbps by default with appropriate warning messages?
>> Without PMP, it's not like we're gonna earn anything by driving the
>> thing at 3Gbps.
> 
> I thought this was supposed to be fixed in 2.6.24-RC2 ?

Ah.... good news.

> Although I'm currently running a TX4 myself in 3Gbit mode with 2.6.23.1,
>  I'm waiting for 2.6.24 to reach stable until I try it out myself.
> 2.6.23.1 seemed to fix the hard-lock when it tried to reset the card
> tho, so now I just get a few errors about "soft resetting" in the logs.
> No data loss.

That's great too.

I Stratford, can you please give 2.6.24-rc2 a shot?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
  2007-11-12  8:45   ` Patric Karlsson
  2007-11-12  8:58     ` Tejun Heo
@ 2007-11-12 19:59     ` Peter Favrholdt
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Favrholdt @ 2007-11-12 19:59 UTC (permalink / raw)
  To: Patric Karlsson; +Cc: linux-ide

Hi Patric & list,

Just for the record:

A while ago I wrote SATA TX4 300 with 2.6.23.1 was stable but recently I 
had to correct it to "a lot more stable but not perfect", see 
http://marc.info/?l=linux-ide&m=119465285225569&w=2

(yes it still locks up hard but not very often).

I look forward to testing 2.6.24 with the workaround for the Asic 
hardware issue :-)


Best regards,

Peter

Patric Karlsson wrote:
> Although I'm currently running a TX4 myself in 3Gbit mode with 2.6.23.1, 
>  I'm waiting for 2.6.24 to reach stable until I try it out myself.
> 2.6.23.1 seemed to fix the hard-lock when it tried to reset the card 
> tho, so now I just get a few errors about "soft resetting" in the logs. 
> No data loss.
> 
> /Patric


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
@ 2007-11-12 10:25 Mikael Pettersson
  2007-11-12 12:01 ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: Mikael Pettersson @ 2007-11-12 10:25 UTC (permalink / raw)
  To: htejun, i.d.stratford; +Cc: linux-ide, mikpe

On Mon, 12 Nov 2007 13:12:19 +0900, Tejun Heo wrote:
>I Stratford wrote:
>> The purpose of the mail is to document and share my experience in the
>> hope that someone might find it useful, either for debugging their own
>> TX4 300-centric system issues or figuring out what is up with
>> sata_promise and the TX4 300 in 3Gbps mode. I also wish to offer my
>> somewhat unique promise-based system as a test environment for either
>> the timeout or kernel panic issues. I obviously have some basic need
>> for data integrity of the RAID5, but this system is not in production
>> and is therefore more available for testing purposes than the average
>> machine with 22 Promise SATA ports..  :)
>
>[cc'ing Mikael Pettersson]
>
>It seems those 3Gbps promise controllers have hard time getting out of
>transmission errors.  Is it because hardreset doesn't work?  Can we fix it?
>
>Also, if 3Gbps can't be made reliable on those controllers, how about
>limiting it to 1.5Gbps by default with appropriate warning messages?
>Without PMP, it's not like we're gonna earn anything by driving the
>thing at 3Gbps.

There are two things going on here:

First, a workaround for a HW erratum affecting 2nd-generation
chips like the SATA300 TX4 was included in kernel 2.6.24-rc2.
Outstanding bug reports for 2nd-generation chips in older kernels
are not unlikely to be related to this erratum, so we should not
butcher the driver because of issues reported against old kernels.

Secondly, Stratford's system is seriously overloaded:
- a desktop mainboard
- worked with 6 mainboard and 8 Promise 150 TX4 ports
- problems began when two Promise 300 TX4 cards and
  more disks were added
On several occasions we've traced people's problems to
overtaxed system components (cooling, PSU, PCI busses).
OTOH, it may be that Stratford's problem is directly related
to the HW erratum, in which case 2.6.24-rc2 should solve it.

/Mikael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
  2007-11-12 10:25 Mikael Pettersson
@ 2007-11-12 12:01 ` Tejun Heo
  2007-11-14  8:33   ` I Stratford
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2007-11-12 12:01 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: i.d.stratford, linux-ide

Hello,

Mikael Pettersson wrote:
>> Also, if 3Gbps can't be made reliable on those controllers, how about
>> limiting it to 1.5Gbps by default with appropriate warning messages?
>> Without PMP, it's not like we're gonna earn anything by driving the
>> thing at 3Gbps.
> 
> There are two things going on here:
> 
> First, a workaround for a HW erratum affecting 2nd-generation
> chips like the SATA300 TX4 was included in kernel 2.6.24-rc2.
> Outstanding bug reports for 2nd-generation chips in older kernels
> are not unlikely to be related to this erratum, so we should not
> butcher the driver because of issues reported against old kernels.

Alright, if it's fixable, no problem.  I just wanted to remind that
running the link at 3Gbps isn't worth if it continues to cause problems.

> Secondly, Stratford's system is seriously overloaded:
> - a desktop mainboard
> - worked with 6 mainboard and 8 Promise 150 TX4 ports
> - problems began when two Promise 300 TX4 cards and
>   more disks were added
> On several occasions we've traced people's problems to
> overtaxed system components (cooling, PSU, PCI busses).

Agreed, I've seen my share of those issues.  Especially, SATA links seem
very dependent on power quality and very weird things happen when the
power isn't good enough.  Easy way to debug this is connect half of the
drives to a separate PSU and see what happens.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
  2007-11-12 12:01 ` Tejun Heo
@ 2007-11-14  8:33   ` I Stratford
  2007-11-14  9:38     ` Patric Karlsson
  2007-11-15  1:06     ` Tejun Heo
  0 siblings, 2 replies; 10+ messages in thread
From: I Stratford @ 2007-11-14  8:33 UTC (permalink / raw)
  To: Tejun Heo, linux-ide, Mikael Pettersson

On Nov 12, 2007 4:01 AM, Tejun Heo <htejun@gmail.com> wrote:

> Mikael Pettersson wrote:
> > First, a workaround for a HW erratum affecting 2nd-generation
> > chips like the SATA300 TX4 was included in kernel 2.6.24-rc2.
> ...
> Alright, if it's fixable, no problem.  I just wanted to remind that
> running the link at 3Gbps isn't worth if it continues to cause problems.

I appreciate the replies and ensuing discussion. I will test
2.6.24-rc2 as soon as possible and let you know the results. At that
time I'll also have more runtime on the 1.5Gbps forced 2.622 and will
be able to follow-up. Would you (Tejun, Mikael) prefer that I mail
linux-ide or you directly? I checked for a linux-ide FAQ and didn't
find one.. :)

Mikael :
> > Secondly, Stratford's system is seriously overloaded:
> > ...
> > - problems began when two Promise 300 TX4 cards and
> >   more disks were added
> > On several occasions we've traced people's problems to
> > overtaxed system components (cooling, PSU, PCI busses).

Tejun:
> Agreed, I've seen my share of those issues.  Especially, SATA links seem
> very dependent on power quality and very weird things happen when the
> power isn't good enough.  Easy way to debug this is connect half of the
> drives to a separate PSU and see what happens.

While I agree that the configuration is "seriously overloaded" (I
believe I described it as "admittedly somewhat insane" ;D) I haven't
experienced any port-resets or timeouts on my new TX4 300s, coming up
on a week of runtime with the 1.5Gbps-only 2.6.22 patched kernel.
Also, the problems did not generally extend to the two pre-existing
TX4 150s on the same PCI bus, even when the TX4 300s were having
problems. If hardware overheating/PCI overload/PSU problems were the
cause, it seems like a very lucky coincidence that stepping the TX4
300s to 1.5Gbps mode also resolves it.  :D

The system's 23 drives are spread across 3 good quality power
supplies. As indicated in my initial mail, I have swapped the PSU on
the new drive with a new one, specifically a 430 watt cooler master
PSU which by my kill-a-watt gives me ~250 watts of headroom even
during spin-up. While my building power is notoriously lousy, I find a
building-power or PSU-power-quality explanation somewhat unlikely,
especially in light of the consistent performance of the two TX4 150s
and the night-and-day performance of 1.5Gbps patched 2.6.22 vice
unpatched 2.6.22 on the two TX4 300s.

Of course, when you're dealing with 23 hard drives in a desktop.. who
knows! Thanks for the replies! :D

___ids

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
  2007-11-14  8:33   ` I Stratford
@ 2007-11-14  9:38     ` Patric Karlsson
  2007-11-15  1:06     ` Tejun Heo
  1 sibling, 0 replies; 10+ messages in thread
From: Patric Karlsson @ 2007-11-14  9:38 UTC (permalink / raw)
  To: linux-ide

I Stratford wrote:
> On Nov 12, 2007 4:01 AM, Tejun Heo <htejun@gmail.com> wrote:
> 
>> Mikael Pettersson wrote:
>>> First, a workaround for a HW erratum affecting 2nd-generation
>>> chips like the SATA300 TX4 was included in kernel 2.6.24-rc2.
>> ...
>> Alright, if it's fixable, no problem.  I just wanted to remind that
>> running the link at 3Gbps isn't worth if it continues to cause problems.
> 
> I appreciate the replies and ensuing discussion. I will test
> 2.6.24-rc2 as soon as possible and let you know the results. At that
> time I'll also have more runtime on the 1.5Gbps forced 2.622 and will
> be able to follow-up. Would you (Tejun, Mikael) prefer that I mail
> linux-ide or you directly? I checked for a linux-ide FAQ and didn't
> find one.. :)
>

You should CC linux-ide, so that the rest of us can monitor what's going on.

> Mikael :
>>> Secondly, Stratford's system is seriously overloaded:
>>> ...
>>> - problems began when two Promise 300 TX4 cards and
>>>   more disks were added
>>> On several occasions we've traced people's problems to
>>> overtaxed system components (cooling, PSU, PCI busses).
> 
> Tejun:
>> Agreed, I've seen my share of those issues.  Especially, SATA links seem
>> very dependent on power quality and very weird things happen when the
>> power isn't good enough.  Easy way to debug this is connect half of the
>> drives to a separate PSU and see what happens.
> 
> While I agree that the configuration is "seriously overloaded" (I
> believe I described it as "admittedly somewhat insane" ;D) I haven't
> experienced any port-resets or timeouts on my new TX4 300s, coming up
> on a week of runtime with the 1.5Gbps-only 2.6.22 patched kernel.
> Also, the problems did not generally extend to the two pre-existing
> TX4 150s on the same PCI bus, even when the TX4 300s were having
> problems. If hardware overheating/PCI overload/PSU problems were the
> cause, it seems like a very lucky coincidence that stepping the TX4
> 300s to 1.5Gbps mode also resolves it.  :D
> 
> The system's 23 drives are spread across 3 good quality power
> supplies. As indicated in my initial mail, I have swapped the PSU on
> the new drive with a new one, specifically a 430 watt cooler master
> PSU which by my kill-a-watt gives me ~250 watts of headroom even
> during spin-up. While my building power is notoriously lousy, I find a
> building-power or PSU-power-quality explanation somewhat unlikely,
> especially in light of the consistent performance of the two TX4 150s
> and the night-and-day performance of 1.5Gbps patched 2.6.22 vice
> unpatched 2.6.22 on the two TX4 300s.
> 
> Of course, when you're dealing with 23 hard drives in a desktop.. who
> knows! Thanks for the replies! :D
> 
> ___ids
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23
  2007-11-14  8:33   ` I Stratford
  2007-11-14  9:38     ` Patric Karlsson
@ 2007-11-15  1:06     ` Tejun Heo
  1 sibling, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2007-11-15  1:06 UTC (permalink / raw)
  To: I Stratford; +Cc: linux-ide, Mikael Pettersson

I Stratford wrote:
> On Nov 12, 2007 4:01 AM, Tejun Heo <htejun@gmail.com> wrote:
> 
>> Mikael Pettersson wrote:
>>> First, a workaround for a HW erratum affecting 2nd-generation
>>> chips like the SATA300 TX4 was included in kernel 2.6.24-rc2.
>> ...
>> Alright, if it's fixable, no problem.  I just wanted to remind that
>> running the link at 3Gbps isn't worth if it continues to cause problems.
> 
> I appreciate the replies and ensuing discussion. I will test
> 2.6.24-rc2 as soon as possible and let you know the results. At that
> time I'll also have more runtime on the 1.5Gbps forced 2.622 and will
> be able to follow-up. Would you (Tejun, Mikael) prefer that I mail
> linux-ide or you directly? I checked for a linux-ide FAQ and didn't
> find one.. :)

Please cc all involved including linux-ide.

> Mikael :
>>> Secondly, Stratford's system is seriously overloaded:
>>> ...
>>> - problems began when two Promise 300 TX4 cards and
>>>   more disks were added
>>> On several occasions we've traced people's problems to
>>> overtaxed system components (cooling, PSU, PCI busses).
> 
> Tejun:
>> Agreed, I've seen my share of those issues.  Especially, SATA links seem
>> very dependent on power quality and very weird things happen when the
>> power isn't good enough.  Easy way to debug this is connect half of the
>> drives to a separate PSU and see what happens.
> 
> While I agree that the configuration is "seriously overloaded" (I
> believe I described it as "admittedly somewhat insane" ;D) I haven't
> experienced any port-resets or timeouts on my new TX4 300s, coming up
> on a week of runtime with the 1.5Gbps-only 2.6.22 patched kernel.
> Also, the problems did not generally extend to the two pre-existing
> TX4 150s on the same PCI bus, even when the TX4 300s were having
> problems. If hardware overheating/PCI overload/PSU problems were the
> cause, it seems like a very lucky coincidence that stepping the TX4
> 300s to 1.5Gbps mode also resolves it.  :D

One thing I can tell you is power problem shows itself in highly diverse
ways.  Failing 3Gbps while 1.5 works fine, some subset of disks /
controllers work fine while others don't.  You name it.

> The system's 23 drives are spread across 3 good quality power
> supplies. As indicated in my initial mail, I have swapped the PSU on
> the new drive with a new one, specifically a 430 watt cooler master
> PSU which by my kill-a-watt gives me ~250 watts of headroom even
> during spin-up. While my building power is notoriously lousy, I find a
> building-power or PSU-power-quality explanation somewhat unlikely,
> especially in light of the consistent performance of the two TX4 150s
> and the night-and-day performance of 1.5Gbps patched 2.6.22 vice
> unpatched 2.6.22 on the two TX4 300s.

That said, using one or more PSUs and swapping them is the best way to
rule those problems out.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-11-15  1:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-10 12:43 Promise SATA TX4 300 port timeout with sata_promise in 2.6.22, kernel panic in 2.6.23 I Stratford
2007-11-12  4:12 ` Tejun Heo
2007-11-12  8:45   ` Patric Karlsson
2007-11-12  8:58     ` Tejun Heo
2007-11-12 19:59     ` Peter Favrholdt
  -- strict thread matches above, loose matches on Subject: below --
2007-11-12 10:25 Mikael Pettersson
2007-11-12 12:01 ` Tejun Heo
2007-11-14  8:33   ` I Stratford
2007-11-14  9:38     ` Patric Karlsson
2007-11-15  1:06     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).