SATA150TX4 atat1:command timeout

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SATA150TX4 atat1:command timeout
@ 2005-02-14 21:41 Francois Payette
  2005-02-14 22:35 ` Jeff Garzik
  0 siblings, 1 reply; 15+ messages in thread
From: Francois Payette @ 2005-02-14 21:41 UTC (permalink / raw)
  To: linux-ide

Hi,

We have reported earlier a strange bug at bugzilla.kernel.org (#4106 
<http://bugzilla.kernel.org/show_bug.cgi?id=4106>): in our setup of a 
20318 (the SATA150 TX4, not the fastrack one) we are systematically 
getting ata1: command timeout after copying between 200 and 600GB of 
data through the controller. Our setup is with 4 maxtor 6Y200M0, 2 of 
them in raid 0, and the other 2 in a LV group over a raid 0 md array. 
When copying from one array to the other one repeatedly,  the machines 
freezes once out out of every 2 copy. We changed the drive order, but we 
still got the msg ata1 command timeout. We swapped the order of the 
cables, and still got ata1 command timeout. We got a few kernel panics 
with spin locks, but since finding this forum we added the line

writel(mask, mmio_base + PDC_INT_SEQMASK);

to pdc_interrupt, and that one was gone.

We have kernel 2.6.10-753 (fc3) with all relevant patches to the sata 
stuff, the last of which is the one Bartlomiej Zolnierkiewicz posted on 
06/02/2005. 
http://marc.theaimsgroup.com/?l=linux-ide&m=110769875419863&w=2 
<http://marc.theaimsgroup.com/?l=linux-ide&m=110769875419863&w=2>

After commenting out the line
    /* reduce TBG clock to 133 Mhz. */
    /*tmp = readl(mmio + PDC_TBG_MODE); */
    tmp &= ~0x30000; /* clear bit 17, 16*/
    tmp |= 0x10000;  /* set bit 17:16 = 0:1 */
    /*writel(tmp, mmio + PDC_TBG_MODE); */

in pdc_host_init (total shot in the dark) the setup seems more stable, 
we have now gone through 3 cycles of stress test (600GB of copying) and 
have not seen the crash.

Earlier we tried the same stress test with ATA_DEBUG and 
ATA_VERBOSE_DEBUG defined, the error did not occur  maybe because of it 
was slowed down with all the output)?

Later we tried  commenting out the line that sets bmr burst 
(PDC_FLASH_CTL) and slew rate (PDC_SLEW_CTL) in pdc_host_init, and that 
slowed the setup to half it's orignal speed, but in that case the 
problem did not show up.

We outputted the command timeout and it's  0x35 (ATA_CMD_WRITE_EXT) 
protocol is 4 (ATA_PROT_DMA).

Any ideas?
TIA,
Francois Payette

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-02-14 21:41 SATA150TX4 atat1:command timeout Francois Payette
@ 2005-02-14 22:35 ` Jeff Garzik
  2005-02-16 15:04   ` Francois Payette
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Garzik @ 2005-02-14 22:35 UTC (permalink / raw)
  To: francoisp; +Cc: linux-ide

Francois Payette wrote:
> Hi,
> 
> We have reported earlier a strange bug at bugzilla.kernel.org (#4106 
> <http://bugzilla.kernel.org/show_bug.cgi?id=4106>): in our setup of a 
> 20318 (the SATA150 TX4, not the fastrack one) we are systematically 
> getting ata1: command timeout after copying between 200 and 600GB of 
> data through the controller. Our setup is with 4 maxtor 6Y200M0, 2 of 
> them in raid 0, and the other 2 in a LV group over a raid 0 md array. 
> When copying from one array to the other one repeatedly,  the machines 
> freezes once out out of every 2 copy. We changed the drive order, but we 
> still got the msg ata1 command timeout. We swapped the order of the 
> cables, and still got ata1 command timeout. We got a few kernel panics 
> with spin locks, but since finding this forum we added the line
> 
> writel(mask, mmio_base + PDC_INT_SEQMASK);
> 
> to pdc_interrupt, and that one was gone.

The latest kernel (2.6.11-rc4) includes this code change.


> We have kernel 2.6.10-753 (fc3) with all relevant patches to the sata 
> stuff, the last of which is the one Bartlomiej Zolnierkiewicz posted on 
> 06/02/2005. 
> http://marc.theaimsgroup.com/?l=linux-ide&m=110769875419863&w=2 
> <http://marc.theaimsgroup.com/?l=linux-ide&m=110769875419863&w=2>
> 
> After commenting out the line
>    /* reduce TBG clock to 133 Mhz. */
>    /*tmp = readl(mmio + PDC_TBG_MODE); */
>    tmp &= ~0x30000; /* clear bit 17, 16*/
>    tmp |= 0x10000;  /* set bit 17:16 = 0:1 */
>    /*writel(tmp, mmio + PDC_TBG_MODE); */
> 
> in pdc_host_init (total shot in the dark) the setup seems more stable, 
> we have now gone through 3 cycles of stress test (600GB of copying) and 
> have not seen the crash.
> 
> Earlier we tried the same stress test with ATA_DEBUG and 
> ATA_VERBOSE_DEBUG defined, the error did not occur  maybe because of it 
> was slowed down with all the output)?

Correct, all that debug output introduces delays.  Introducing delays 
often "band-aids" a problem enough that it appears to work.

IOW, you can decrease performance to the point where bugs stop 
appearing, even though they still exist.


> Later we tried  commenting out the line that sets bmr burst 
> (PDC_FLASH_CTL) and slew rate (PDC_SLEW_CTL) in pdc_host_init, and that 
> slowed the setup to half it's orignal speed, but in that case the 
> problem did not show up.

Any chance you can test 2.6.11-rc4, either vanilla or only with your 
changes to sata_promise.c, and report the results?

	Jeff



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-02-14 22:35 ` Jeff Garzik
@ 2005-02-16 15:04   ` Francois Payette
  2005-02-18 16:40     ` Francois Payette
  0 siblings, 1 reply; 15+ messages in thread
From: Francois Payette @ 2005-02-16 15:04 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-ide

With plain vanilla 2.6.11-rc4 the same bug appears after about 250GB 
(avg of 2 trials). With the TBG clock setting line omitted it still 
happens, but after about 1 1 TB (avg of 2 trials, takes about 6hrs per 
trial). Interestingly enough, this change does not slow down the setup, 
it even seems a little faster.

I was mistaken earlier: the 4 drives are not exactly the same, there is 
2 6B200M0 one 6B200S0 and one 6Y200M0. This should be irrelevant as I 
have swapped disks and wires and the problem happens anyway. One 
interesting thing: in init 1 the timeout seems to appear faster, after 
about 200GB in the case with the omission. I would be inclined to think 
this is some sort of a deadlock or race condition: the kernel does not 
dump or panic, it just hangs on pdc_eng_timeout. When we dumped the 
stack  in that function, all we had was pdc_eng_timeout, as there seems 
to a be a separate thread per disk that gets waken up for error handling.

Any ideas on how we can catch this one?
TIA,
Francois

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-02-16 15:04   ` Francois Payette
@ 2005-02-18 16:40     ` Francois Payette
  2005-09-30 10:40       ` Robin Bowes
  0 siblings, 1 reply; 15+ messages in thread
From: Francois Payette @ 2005-02-18 16:40 UTC (permalink / raw)
  Cc: Jeff Garzik, linux-ide, Eric Mudama

It seems that the locking problem we were experiencing was related to 
the disparities of drives in a raid 0 array; when we replaced the 
6Y200M0 with another 6B200S0 the problem never reappeared even after 1.8 
TB of io. Thanks a buch to Eric for pointing out the differences; the 
promise card and/or driver must have a problem with the bridge chip on 
that drive when interfacing with another drive that does not have that 
chip. We also tested for performance improvements with

writel(tmp, mmio + PDC_TBG_MODE);

commented out from pdc_host_init, but it does not cause a significant 
difference when benchmarked with bonnie++.

Thanks for your help,
Francois

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-02-18 16:40     ` Francois Payette
@ 2005-09-30 10:40       ` Robin Bowes
  2005-10-06  9:55         ` Ian Oliver
  0 siblings, 1 reply; 15+ messages in thread
From: Robin Bowes @ 2005-09-30 10:40 UTC (permalink / raw)
  To: Francois Payette; +Cc: Jeff Garzik, linux-ide, Eric Mudama

Hi,

I hope it's OK resurrecting an old thread, but I'm seeing similar problems.

My setup is as follows:

Epox EP-D3VA
Dual PIII 1GHZ processors
1.5GB RAM
Two of Promise SATA150 TX4 controllers
Six of Maxtor 250GB SATA drives (7Y250M0) - three per controller

Running Fedora Core 4 with "stock" FC4 kernel (kernel-smp-2.6.12-1.1456_FC4)

I have four md arrays as follows:

/dev/md0 RAID1 /dev/sd[ad]
/dev/md1 RAID1 /dev/sd[be]1
/dev/md2 RAID1 /dev/sd[cf]1
/dev/md5 RAID5 /dev/sd[abcef]2 (/dev/sdd2 is a hot spare)

md[0-2] are 1.5 MB areas
/dev/md0 is /
/dev/md1 is swap
/dev/md2 is currently not used

md5 is 929GB and I have used lvm to create:

/home home_lv audio_vg -wi-ao 914.38G
/usr  usr_lv  audio_vg -wi-ao  10.00G
/var  var_lv  audio_vg -wi-ao   5.00G

Ok, onto the problem...

After a couple of power outages I recently got myself a UPS but 
(typically) didn't get round to installing it before another outage (doh!).

The server came back up OK with /dev/md5 dirty and needing to resync.

However, during the re-sync, one or more of the disks clunked and I saw 
an "ATAn Timeout" message on the console and the system froze. (n 
varied, e.g. ATA2, ATA1, ATA4, etc.)  This seemed to be triggered by 
doing something that caused disk activity during the resync.

I've seen this before and done a hard-reset to start again - eventually 
the resync has completed and everything's back to normal.

However, this time, I had to drop to single-user mode and reduce the 
RAID sync speed (echo 5000 > /proc/sys/dev/raid/speed_limit_max) to get 
the resync to complete.

Can anyone tell me if this is a bug somewhere or might it be a hardware 
limitation, i.e. saturating the PCI bus when resyncing? Is there 
anything I can do to prevent it from happening?

I'm not too bothered about RAID performance - I mainly use it to store 
.flac audio files which don't need great throuhgput to stream off the disk.

Any suggestions (or fixes!) appreciated.

Thanks,

R.
-- 
http://robinbowes.com

If a man speaks in a forest,
and his wife's not there,
is he still wrong?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-09-30 10:40       ` Robin Bowes
@ 2005-10-06  9:55         ` Ian Oliver
  2005-10-06 10:44           ` Erik Slagter
  0 siblings, 1 reply; 15+ messages in thread
From: Ian Oliver @ 2005-10-06  9:55 UTC (permalink / raw)
  To: linux-ide

In article  <433D1626.2000909@robinbowes.com>, Robin Bowes wrote:
> However, during the re-sync, one or more of the disks clunked and I saw 
> an "ATAn Timeout" message on the console and the system froze. (n 
> varied, e.g. ATA2, ATA1, ATA4, etc.)

I had this when hammering a software raid that was on four Promise TX2 
cards. I pulled two of the cards and instead used a four-port SiL3114 and 
the errors went away.

Other have reported similar.  I haven't seen anything about a fix.

Ian




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-10-06  9:55         ` Ian Oliver
@ 2005-10-06 10:44           ` Erik Slagter
  2005-10-13 20:37             ` Ian Oliver
  0 siblings, 1 reply; 15+ messages in thread
From: Erik Slagter @ 2005-10-06 10:44 UTC (permalink / raw)
  To: linux-ide

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

On Thu, 2005-10-06 at 10:55 +0100, Ian Oliver wrote:
> In article  <433D1626.2000909@robinbowes.com>, Robin Bowes wrote:
> > However, during the re-sync, one or more of the disks clunked and I saw 
> > an "ATAn Timeout" message on the console and the system froze. (n 
> > varied, e.g. ATA2, ATA1, ATA4, etc.)
> 
> I had this when hammering a software raid that was on four Promise TX2 
> cards. I pulled two of the cards and instead used a four-port SiL3114 and 
> the errors went away.
> 
> Other have reported similar.  I haven't seen anything about a fix.

That's not really a solution ;-) It looks you're also a victim of the
promise problem.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 2115 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-10-06 10:44           ` Erik Slagter
@ 2005-10-13 20:37             ` Ian Oliver
  2005-10-13 21:04               ` Mark Lord
  2005-10-14 10:00               ` Erik Slagter
  0 siblings, 2 replies; 15+ messages in thread
From: Ian Oliver @ 2005-10-13 20:37 UTC (permalink / raw)
  To: linux-ide

In article <1128595485.5964.41.camel@localhost.localdomain>, Erik 
Slagter wrote:
> That's not really a solution ;-) It looks you're also a victim of the
> promise problem.

For sale, 4x Promise TX2 (20375 based) cards, some light bullet damage 
that should polish out. :-)

To elaborate, my fix of pulling two of the cards just made the problem 
occur less often. But when the machine did eventually lock solid again, 
it then needed to resync the raid 5 on reboot. As soon as the smart 
demon kicked in with this level of activity, the machine locked solid. 
I rebooted with the raid arrays pulled, disabled all things "smart", 
and then rebooted.

I could then boot, enjoy a working server, and open some wine. Quite 
close to the bottom of the bottle, the sodding thing locked again. It's 
now running with only one raid array, which is now re-cabled to the SiL 
3114 and is rebuilding rather nicely (and far faster than on the 
Promise cards)

Assuming that all is sweetness and light, I will, 1) Order another SiL 
card, 2) Subject the Promise cards to physical abuse, 3) Move on to 
whisky, in no particular order.

For the record, this is with Ubuntu 5.04 (Hoary) with all repo updates 
as of last weekend.

Regards

Ian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-10-13 20:37             ` Ian Oliver
@ 2005-10-13 21:04               ` Mark Lord
  2005-10-14 10:17                 ` Ian Oliver
  2005-10-14 10:00               ` Erik Slagter
  1 sibling, 1 reply; 15+ messages in thread
From: Mark Lord @ 2005-10-13 21:04 UTC (permalink / raw)
  To: linux-ide

Ian Oliver wrote:
>
> For the record, this is with Ubuntu 5.04 (Hoary) with all repo updates 
> as of last weekend.

The Ubuntu-5.04 kernels have the libata EH bug, whereby the entire
machine locks up at random once in a blue-moon when EH is running.

On the machines here, this bug is most often triggered by simply
having an empty CD/DVD reader that is managed by libata -- polling for
disc insertion triggers the EH code every couple of seconds, leading
to random lockups.

Bug no longer present in newer kernels, but the Ubuntu-5.04 stock
kernels still have it.

Cheers

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-10-13 20:37             ` Ian Oliver
  2005-10-13 21:04               ` Mark Lord
@ 2005-10-14 10:00               ` Erik Slagter
  2005-10-14 16:00                 ` Ian Oliver
  1 sibling, 1 reply; 15+ messages in thread
From: Erik Slagter @ 2005-10-14 10:00 UTC (permalink / raw)
  To: linux-ide

[-- Attachment #1: Type: text/plain, Size: 787 bytes --]

On Thu, 2005-10-13 at 21:37 +0100, Ian Oliver wrote:

> > That's not really a solution ;-) It looks you're also a victim of the
> > promise problem.
> 
> For sale, 4x Promise TX2 (20375 based) cards, some light bullet damage 
> that should polish out. :-)

Don't give up yet. I received my new PSU two days ago. The new PSU has
150% more headroom and is said to be of some quality, contrary of the
former one which is brandless.

Also I put one of the two harddisks (the one that was running warmest)
in a harddisk cooler. 

Here is the interesting part: although the hard disk still reports a
high temperature (44C), I haven't had any problem since then.

So either the PSU has solved my problem, or the disk cooler cools some
critical component on the disk better.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 2115 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-10-13 21:04               ` Mark Lord
@ 2005-10-14 10:17                 ` Ian Oliver
  2005-10-14 13:07                   ` Mark Lord
  0 siblings, 1 reply; 15+ messages in thread
From: Ian Oliver @ 2005-10-14 10:17 UTC (permalink / raw)
  To: linux-ide

In article <434ECBC3.3020802@rtr.ca>, Mark Lord wrote:
> The Ubuntu-5.04 kernels have the libata EH bug, whereby the entire
> machine locks up at random once in a blue-moon when EH is running.

What about 5.10? I've just upgraded a couple of boxes, and will tackle 
this problem server if necessary. Though switching away from Promise is 
still very much the plan.

I'm more than happy to donate the 4x Promise cards to any 
developer/tester that wants them.

Regards

Ian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-10-14 10:17                 ` Ian Oliver
@ 2005-10-14 13:07                   ` Mark Lord
  0 siblings, 0 replies; 15+ messages in thread
From: Mark Lord @ 2005-10-14 13:07 UTC (permalink / raw)
  To: linux-ide

Ian Oliver wrote:
> In article <434ECBC3.3020802@rtr.ca>, Mark Lord wrote:
> 
>>The Ubuntu-5.04 kernels have the libata EH bug, whereby the entire
>>machine locks up at random once in a blue-moon when EH is running.
>
> What about 5.10?

Breezy (5.10) should be fine.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-10-14 10:00               ` Erik Slagter
@ 2005-10-14 16:00                 ` Ian Oliver
  2005-10-17 10:08                   ` Erik Slagter
  0 siblings, 1 reply; 15+ messages in thread
From: Ian Oliver @ 2005-10-14 16:00 UTC (permalink / raw)
  To: linux-ide

In article <1129284030.30961.4.camel@localhost.localdomain>, Erik 
Slagter wrote:
> So either the PSU has solved my problem, or the disk cooler cools some
> critical component on the disk better.

My machine has two PSUs and the disks are shared between them. I doubt 
that power is an issue.

Will try an upgrade to Ubuntu Breezy (5.10) when I'm feeling brave!

Regards

Ian




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-10-14 16:00                 ` Ian Oliver
@ 2005-10-17 10:08                   ` Erik Slagter
  2005-10-17 13:01                     ` Ian Oliver
  0 siblings, 1 reply; 15+ messages in thread
From: Erik Slagter @ 2005-10-17 10:08 UTC (permalink / raw)
  To: linux-ide

[-- Attachment #1: Type: text/plain, Size: 575 bytes --]

On Fri, 2005-10-14 at 17:00 +0100, Ian Oliver wrote:
> In article <1129284030.30961.4.camel@localhost.localdomain>, Erik 
> Slagter wrote:
> > So either the PSU has solved my problem, or the disk cooler cools some
> > critical component on the disk better.
> 
> My machine has two PSUs and the disks are shared between them. I doubt 
> that power is an issue.
> 
> Will try an upgrade to Ubuntu Breezy (5.10) when I'm feeling brave!

Of course you could also try to compile your kernel yourself (like I do
and therefore I am not affected by bugs already solved)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 2115 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SATA150TX4 atat1:command timeout
  2005-10-17 10:08                   ` Erik Slagter
@ 2005-10-17 13:01                     ` Ian Oliver
  0 siblings, 0 replies; 15+ messages in thread
From: Ian Oliver @ 2005-10-17 13:01 UTC (permalink / raw)
  To: linux-ide

In article <1129543730.5945.17.camel@localhost.localdomain>, Erik Slagter 
wrote:
> > Will try an upgrade to Ubuntu Breezy (5.10) when I'm feeling brave!
> 
> Of course you could also try to compile your kernel yourself (like I do
> and therefore I am not affected by bugs already solved)

I've done this a few times, but I wasn't quite sure which of the bugs was 
biting me.  Anyway, Breezy if downloaded and ready to go.

Regards

Ian




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2005-10-17 13:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-14 21:41 SATA150TX4 atat1:command timeout Francois Payette
2005-02-14 22:35 ` Jeff Garzik
2005-02-16 15:04   ` Francois Payette
2005-02-18 16:40     ` Francois Payette
2005-09-30 10:40       ` Robin Bowes
2005-10-06  9:55         ` Ian Oliver
2005-10-06 10:44           ` Erik Slagter
2005-10-13 20:37             ` Ian Oliver
2005-10-13 21:04               ` Mark Lord
2005-10-14 10:17                 ` Ian Oliver
2005-10-14 13:07                   ` Mark Lord
2005-10-14 10:00               ` Erik Slagter
2005-10-14 16:00                 ` Ian Oliver
2005-10-17 10:08                   ` Erik Slagter
2005-10-17 13:01                     ` Ian Oliver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).