* SATA150TX4 atat1:command timeout
@ 2005-02-14 21:41 Francois Payette
2005-02-14 22:35 ` Jeff Garzik
0 siblings, 1 reply; 15+ messages in thread
From: Francois Payette @ 2005-02-14 21:41 UTC (permalink / raw)
To: linux-ide
Hi,
We have reported earlier a strange bug at bugzilla.kernel.org (#4106
<http://bugzilla.kernel.org/show_bug.cgi?id=4106>): in our setup of a
20318 (the SATA150 TX4, not the fastrack one) we are systematically
getting ata1: command timeout after copying between 200 and 600GB of
data through the controller. Our setup is with 4 maxtor 6Y200M0, 2 of
them in raid 0, and the other 2 in a LV group over a raid 0 md array.
When copying from one array to the other one repeatedly, the machines
freezes once out out of every 2 copy. We changed the drive order, but we
still got the msg ata1 command timeout. We swapped the order of the
cables, and still got ata1 command timeout. We got a few kernel panics
with spin locks, but since finding this forum we added the line
writel(mask, mmio_base + PDC_INT_SEQMASK);
to pdc_interrupt, and that one was gone.
We have kernel 2.6.10-753 (fc3) with all relevant patches to the sata
stuff, the last of which is the one Bartlomiej Zolnierkiewicz posted on
06/02/2005.
http://marc.theaimsgroup.com/?l=linux-ide&m=110769875419863&w=2
<http://marc.theaimsgroup.com/?l=linux-ide&m=110769875419863&w=2>
After commenting out the line
/* reduce TBG clock to 133 Mhz. */
/*tmp = readl(mmio + PDC_TBG_MODE); */
tmp &= ~0x30000; /* clear bit 17, 16*/
tmp |= 0x10000; /* set bit 17:16 = 0:1 */
/*writel(tmp, mmio + PDC_TBG_MODE); */
in pdc_host_init (total shot in the dark) the setup seems more stable,
we have now gone through 3 cycles of stress test (600GB of copying) and
have not seen the crash.
Earlier we tried the same stress test with ATA_DEBUG and
ATA_VERBOSE_DEBUG defined, the error did not occur maybe because of it
was slowed down with all the output)?
Later we tried commenting out the line that sets bmr burst
(PDC_FLASH_CTL) and slew rate (PDC_SLEW_CTL) in pdc_host_init, and that
slowed the setup to half it's orignal speed, but in that case the
problem did not show up.
We outputted the command timeout and it's 0x35 (ATA_CMD_WRITE_EXT)
protocol is 4 (ATA_PROT_DMA).
Any ideas?
TIA,
Francois Payette
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: SATA150TX4 atat1:command timeout
2005-02-14 21:41 SATA150TX4 atat1:command timeout Francois Payette
@ 2005-02-14 22:35 ` Jeff Garzik
2005-02-16 15:04 ` Francois Payette
0 siblings, 1 reply; 15+ messages in thread
From: Jeff Garzik @ 2005-02-14 22:35 UTC (permalink / raw)
To: francoisp; +Cc: linux-ide
Francois Payette wrote:
> Hi,
>
> We have reported earlier a strange bug at bugzilla.kernel.org (#4106
> <http://bugzilla.kernel.org/show_bug.cgi?id=4106>): in our setup of a
> 20318 (the SATA150 TX4, not the fastrack one) we are systematically
> getting ata1: command timeout after copying between 200 and 600GB of
> data through the controller. Our setup is with 4 maxtor 6Y200M0, 2 of
> them in raid 0, and the other 2 in a LV group over a raid 0 md array.
> When copying from one array to the other one repeatedly, the machines
> freezes once out out of every 2 copy. We changed the drive order, but we
> still got the msg ata1 command timeout. We swapped the order of the
> cables, and still got ata1 command timeout. We got a few kernel panics
> with spin locks, but since finding this forum we added the line
>
> writel(mask, mmio_base + PDC_INT_SEQMASK);
>
> to pdc_interrupt, and that one was gone.
The latest kernel (2.6.11-rc4) includes this code change.
> We have kernel 2.6.10-753 (fc3) with all relevant patches to the sata
> stuff, the last of which is the one Bartlomiej Zolnierkiewicz posted on
> 06/02/2005.
> http://marc.theaimsgroup.com/?l=linux-ide&m=110769875419863&w=2
> <http://marc.theaimsgroup.com/?l=linux-ide&m=110769875419863&w=2>
>
> After commenting out the line
> /* reduce TBG clock to 133 Mhz. */
> /*tmp = readl(mmio + PDC_TBG_MODE); */
> tmp &= ~0x30000; /* clear bit 17, 16*/
> tmp |= 0x10000; /* set bit 17:16 = 0:1 */
> /*writel(tmp, mmio + PDC_TBG_MODE); */
>
> in pdc_host_init (total shot in the dark) the setup seems more stable,
> we have now gone through 3 cycles of stress test (600GB of copying) and
> have not seen the crash.
>
> Earlier we tried the same stress test with ATA_DEBUG and
> ATA_VERBOSE_DEBUG defined, the error did not occur maybe because of it
> was slowed down with all the output)?
Correct, all that debug output introduces delays. Introducing delays
often "band-aids" a problem enough that it appears to work.
IOW, you can decrease performance to the point where bugs stop
appearing, even though they still exist.
> Later we tried commenting out the line that sets bmr burst
> (PDC_FLASH_CTL) and slew rate (PDC_SLEW_CTL) in pdc_host_init, and that
> slowed the setup to half it's orignal speed, but in that case the
> problem did not show up.
Any chance you can test 2.6.11-rc4, either vanilla or only with your
changes to sata_promise.c, and report the results?
Jeff
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-02-14 22:35 ` Jeff Garzik
@ 2005-02-16 15:04 ` Francois Payette
2005-02-18 16:40 ` Francois Payette
0 siblings, 1 reply; 15+ messages in thread
From: Francois Payette @ 2005-02-16 15:04 UTC (permalink / raw)
To: Jeff Garzik; +Cc: linux-ide
With plain vanilla 2.6.11-rc4 the same bug appears after about 250GB
(avg of 2 trials). With the TBG clock setting line omitted it still
happens, but after about 1 1 TB (avg of 2 trials, takes about 6hrs per
trial). Interestingly enough, this change does not slow down the setup,
it even seems a little faster.
I was mistaken earlier: the 4 drives are not exactly the same, there is
2 6B200M0 one 6B200S0 and one 6Y200M0. This should be irrelevant as I
have swapped disks and wires and the problem happens anyway. One
interesting thing: in init 1 the timeout seems to appear faster, after
about 200GB in the case with the omission. I would be inclined to think
this is some sort of a deadlock or race condition: the kernel does not
dump or panic, it just hangs on pdc_eng_timeout. When we dumped the
stack in that function, all we had was pdc_eng_timeout, as there seems
to a be a separate thread per disk that gets waken up for error handling.
Any ideas on how we can catch this one?
TIA,
Francois
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-02-16 15:04 ` Francois Payette
@ 2005-02-18 16:40 ` Francois Payette
2005-09-30 10:40 ` Robin Bowes
0 siblings, 1 reply; 15+ messages in thread
From: Francois Payette @ 2005-02-18 16:40 UTC (permalink / raw)
Cc: Jeff Garzik, linux-ide, Eric Mudama
It seems that the locking problem we were experiencing was related to
the disparities of drives in a raid 0 array; when we replaced the
6Y200M0 with another 6B200S0 the problem never reappeared even after 1.8
TB of io. Thanks a buch to Eric for pointing out the differences; the
promise card and/or driver must have a problem with the bridge chip on
that drive when interfacing with another drive that does not have that
chip. We also tested for performance improvements with
writel(tmp, mmio + PDC_TBG_MODE);
commented out from pdc_host_init, but it does not cause a significant
difference when benchmarked with bonnie++.
Thanks for your help,
Francois
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-02-18 16:40 ` Francois Payette
@ 2005-09-30 10:40 ` Robin Bowes
2005-10-06 9:55 ` Ian Oliver
0 siblings, 1 reply; 15+ messages in thread
From: Robin Bowes @ 2005-09-30 10:40 UTC (permalink / raw)
To: Francois Payette; +Cc: Jeff Garzik, linux-ide, Eric Mudama
Hi,
I hope it's OK resurrecting an old thread, but I'm seeing similar problems.
My setup is as follows:
Epox EP-D3VA
Dual PIII 1GHZ processors
1.5GB RAM
Two of Promise SATA150 TX4 controllers
Six of Maxtor 250GB SATA drives (7Y250M0) - three per controller
Running Fedora Core 4 with "stock" FC4 kernel (kernel-smp-2.6.12-1.1456_FC4)
I have four md arrays as follows:
/dev/md0 RAID1 /dev/sd[ad]
/dev/md1 RAID1 /dev/sd[be]1
/dev/md2 RAID1 /dev/sd[cf]1
/dev/md5 RAID5 /dev/sd[abcef]2 (/dev/sdd2 is a hot spare)
md[0-2] are 1.5 MB areas
/dev/md0 is /
/dev/md1 is swap
/dev/md2 is currently not used
md5 is 929GB and I have used lvm to create:
/home home_lv audio_vg -wi-ao 914.38G
/usr usr_lv audio_vg -wi-ao 10.00G
/var var_lv audio_vg -wi-ao 5.00G
Ok, onto the problem...
After a couple of power outages I recently got myself a UPS but
(typically) didn't get round to installing it before another outage (doh!).
The server came back up OK with /dev/md5 dirty and needing to resync.
However, during the re-sync, one or more of the disks clunked and I saw
an "ATAn Timeout" message on the console and the system froze. (n
varied, e.g. ATA2, ATA1, ATA4, etc.) This seemed to be triggered by
doing something that caused disk activity during the resync.
I've seen this before and done a hard-reset to start again - eventually
the resync has completed and everything's back to normal.
However, this time, I had to drop to single-user mode and reduce the
RAID sync speed (echo 5000 > /proc/sys/dev/raid/speed_limit_max) to get
the resync to complete.
Can anyone tell me if this is a bug somewhere or might it be a hardware
limitation, i.e. saturating the PCI bus when resyncing? Is there
anything I can do to prevent it from happening?
I'm not too bothered about RAID performance - I mainly use it to store
.flac audio files which don't need great throuhgput to stream off the disk.
Any suggestions (or fixes!) appreciated.
Thanks,
R.
--
http://robinbowes.com
If a man speaks in a forest,
and his wife's not there,
is he still wrong?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-09-30 10:40 ` Robin Bowes
@ 2005-10-06 9:55 ` Ian Oliver
2005-10-06 10:44 ` Erik Slagter
0 siblings, 1 reply; 15+ messages in thread
From: Ian Oliver @ 2005-10-06 9:55 UTC (permalink / raw)
To: linux-ide
In article <433D1626.2000909@robinbowes.com>, Robin Bowes wrote:
> However, during the re-sync, one or more of the disks clunked and I saw
> an "ATAn Timeout" message on the console and the system froze. (n
> varied, e.g. ATA2, ATA1, ATA4, etc.)
I had this when hammering a software raid that was on four Promise TX2
cards. I pulled two of the cards and instead used a four-port SiL3114 and
the errors went away.
Other have reported similar. I haven't seen anything about a fix.
Ian
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-10-06 9:55 ` Ian Oliver
@ 2005-10-06 10:44 ` Erik Slagter
2005-10-13 20:37 ` Ian Oliver
0 siblings, 1 reply; 15+ messages in thread
From: Erik Slagter @ 2005-10-06 10:44 UTC (permalink / raw)
To: linux-ide
[-- Attachment #1: Type: text/plain, Size: 662 bytes --]
On Thu, 2005-10-06 at 10:55 +0100, Ian Oliver wrote:
> In article <433D1626.2000909@robinbowes.com>, Robin Bowes wrote:
> > However, during the re-sync, one or more of the disks clunked and I saw
> > an "ATAn Timeout" message on the console and the system froze. (n
> > varied, e.g. ATA2, ATA1, ATA4, etc.)
>
> I had this when hammering a software raid that was on four Promise TX2
> cards. I pulled two of the cards and instead used a four-port SiL3114 and
> the errors went away.
>
> Other have reported similar. I haven't seen anything about a fix.
That's not really a solution ;-) It looks you're also a victim of the
promise problem.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 2115 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-10-06 10:44 ` Erik Slagter
@ 2005-10-13 20:37 ` Ian Oliver
2005-10-13 21:04 ` Mark Lord
2005-10-14 10:00 ` Erik Slagter
0 siblings, 2 replies; 15+ messages in thread
From: Ian Oliver @ 2005-10-13 20:37 UTC (permalink / raw)
To: linux-ide
In article <1128595485.5964.41.camel@localhost.localdomain>, Erik
Slagter wrote:
> That's not really a solution ;-) It looks you're also a victim of the
> promise problem.
For sale, 4x Promise TX2 (20375 based) cards, some light bullet damage
that should polish out. :-)
To elaborate, my fix of pulling two of the cards just made the problem
occur less often. But when the machine did eventually lock solid again,
it then needed to resync the raid 5 on reboot. As soon as the smart
demon kicked in with this level of activity, the machine locked solid.
I rebooted with the raid arrays pulled, disabled all things "smart",
and then rebooted.
I could then boot, enjoy a working server, and open some wine. Quite
close to the bottom of the bottle, the sodding thing locked again. It's
now running with only one raid array, which is now re-cabled to the SiL
3114 and is rebuilding rather nicely (and far faster than on the
Promise cards)
Assuming that all is sweetness and light, I will, 1) Order another SiL
card, 2) Subject the Promise cards to physical abuse, 3) Move on to
whisky, in no particular order.
For the record, this is with Ubuntu 5.04 (Hoary) with all repo updates
as of last weekend.
Regards
Ian
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-10-13 20:37 ` Ian Oliver
@ 2005-10-13 21:04 ` Mark Lord
2005-10-14 10:17 ` Ian Oliver
2005-10-14 10:00 ` Erik Slagter
1 sibling, 1 reply; 15+ messages in thread
From: Mark Lord @ 2005-10-13 21:04 UTC (permalink / raw)
To: linux-ide
Ian Oliver wrote:
>
> For the record, this is with Ubuntu 5.04 (Hoary) with all repo updates
> as of last weekend.
The Ubuntu-5.04 kernels have the libata EH bug, whereby the entire
machine locks up at random once in a blue-moon when EH is running.
On the machines here, this bug is most often triggered by simply
having an empty CD/DVD reader that is managed by libata -- polling for
disc insertion triggers the EH code every couple of seconds, leading
to random lockups.
Bug no longer present in newer kernels, but the Ubuntu-5.04 stock
kernels still have it.
Cheers
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-10-13 21:04 ` Mark Lord
@ 2005-10-14 10:17 ` Ian Oliver
2005-10-14 13:07 ` Mark Lord
0 siblings, 1 reply; 15+ messages in thread
From: Ian Oliver @ 2005-10-14 10:17 UTC (permalink / raw)
To: linux-ide
In article <434ECBC3.3020802@rtr.ca>, Mark Lord wrote:
> The Ubuntu-5.04 kernels have the libata EH bug, whereby the entire
> machine locks up at random once in a blue-moon when EH is running.
What about 5.10? I've just upgraded a couple of boxes, and will tackle
this problem server if necessary. Though switching away from Promise is
still very much the plan.
I'm more than happy to donate the 4x Promise cards to any
developer/tester that wants them.
Regards
Ian
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-10-13 20:37 ` Ian Oliver
2005-10-13 21:04 ` Mark Lord
@ 2005-10-14 10:00 ` Erik Slagter
2005-10-14 16:00 ` Ian Oliver
1 sibling, 1 reply; 15+ messages in thread
From: Erik Slagter @ 2005-10-14 10:00 UTC (permalink / raw)
To: linux-ide
[-- Attachment #1: Type: text/plain, Size: 787 bytes --]
On Thu, 2005-10-13 at 21:37 +0100, Ian Oliver wrote:
> > That's not really a solution ;-) It looks you're also a victim of the
> > promise problem.
>
> For sale, 4x Promise TX2 (20375 based) cards, some light bullet damage
> that should polish out. :-)
Don't give up yet. I received my new PSU two days ago. The new PSU has
150% more headroom and is said to be of some quality, contrary of the
former one which is brandless.
Also I put one of the two harddisks (the one that was running warmest)
in a harddisk cooler.
Here is the interesting part: although the hard disk still reports a
high temperature (44C), I haven't had any problem since then.
So either the PSU has solved my problem, or the disk cooler cools some
critical component on the disk better.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 2115 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-10-14 10:00 ` Erik Slagter
@ 2005-10-14 16:00 ` Ian Oliver
2005-10-17 10:08 ` Erik Slagter
0 siblings, 1 reply; 15+ messages in thread
From: Ian Oliver @ 2005-10-14 16:00 UTC (permalink / raw)
To: linux-ide
In article <1129284030.30961.4.camel@localhost.localdomain>, Erik
Slagter wrote:
> So either the PSU has solved my problem, or the disk cooler cools some
> critical component on the disk better.
My machine has two PSUs and the disks are shared between them. I doubt
that power is an issue.
Will try an upgrade to Ubuntu Breezy (5.10) when I'm feeling brave!
Regards
Ian
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-10-14 16:00 ` Ian Oliver
@ 2005-10-17 10:08 ` Erik Slagter
2005-10-17 13:01 ` Ian Oliver
0 siblings, 1 reply; 15+ messages in thread
From: Erik Slagter @ 2005-10-17 10:08 UTC (permalink / raw)
To: linux-ide
[-- Attachment #1: Type: text/plain, Size: 575 bytes --]
On Fri, 2005-10-14 at 17:00 +0100, Ian Oliver wrote:
> In article <1129284030.30961.4.camel@localhost.localdomain>, Erik
> Slagter wrote:
> > So either the PSU has solved my problem, or the disk cooler cools some
> > critical component on the disk better.
>
> My machine has two PSUs and the disks are shared between them. I doubt
> that power is an issue.
>
> Will try an upgrade to Ubuntu Breezy (5.10) when I'm feeling brave!
Of course you could also try to compile your kernel yourself (like I do
and therefore I am not affected by bugs already solved)
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 2115 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: SATA150TX4 atat1:command timeout
2005-10-17 10:08 ` Erik Slagter
@ 2005-10-17 13:01 ` Ian Oliver
0 siblings, 0 replies; 15+ messages in thread
From: Ian Oliver @ 2005-10-17 13:01 UTC (permalink / raw)
To: linux-ide
In article <1129543730.5945.17.camel@localhost.localdomain>, Erik Slagter
wrote:
> > Will try an upgrade to Ubuntu Breezy (5.10) when I'm feeling brave!
>
> Of course you could also try to compile your kernel yourself (like I do
> and therefore I am not affected by bugs already solved)
I've done this a few times, but I wasn't quite sure which of the bugs was
biting me. Anyway, Breezy if downloaded and ready to go.
Regards
Ian
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2005-10-17 13:01 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-14 21:41 SATA150TX4 atat1:command timeout Francois Payette
2005-02-14 22:35 ` Jeff Garzik
2005-02-16 15:04 ` Francois Payette
2005-02-18 16:40 ` Francois Payette
2005-09-30 10:40 ` Robin Bowes
2005-10-06 9:55 ` Ian Oliver
2005-10-06 10:44 ` Erik Slagter
2005-10-13 20:37 ` Ian Oliver
2005-10-13 21:04 ` Mark Lord
2005-10-14 10:17 ` Ian Oliver
2005-10-14 13:07 ` Mark Lord
2005-10-14 10:00 ` Erik Slagter
2005-10-14 16:00 ` Ian Oliver
2005-10-17 10:08 ` Erik Slagter
2005-10-17 13:01 ` Ian Oliver
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).