linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tejun Heo <htejun@gmail.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mark Hatle <fray@gate.crashing.org>, linux-ide@vger.kernel.org
Subject: Re: sata_nv failure on 2.4.18 (Fedora Core 6)
Date: Wed, 25 Oct 2006 14:46:08 +0900	[thread overview]
Message-ID: <453EFA20.9050208@gmail.com> (raw)
In-Reply-To: <1161754028.22582.47.camel@localhost.localdomain>

Benjamin Herrenschmidt wrote:
> On Wed, 2006-10-25 at 00:03 -0500, Mark Hatle wrote:
>> Just an FYI.  I rebooted the machine w/ "noapic" (same kernel).  During 
>> the RAID rebuild I got the following.. (but it did not stop processing)
>>
>> ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x1950000 action 0x2 frozen
>> ata1.00: tag 0 cmd 0xea Emask 0x14 stat 0x40 err 0x0 (ATA bus error)
>> ata1: soft resetting port
>> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ata1.00: configured for UDMA/133
>> ata1: EH complete
>> SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
>> sda: Write Protect is off
>> sda: Mode Sense: 00 3a 00 00
>> SCSI device sda: drive cache: write back
>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> ata1.00: (BMDMA stat 0x21)
>> ata1.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout)
>> ata4.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x2 frozen
>> ata4.00: tag 0 cmd 0x60 Emask 0x4 stat 0x40 err 0x0 (timeout)
>> ata4.00: tag 1 cmd 0x60 Emask 0x4 stat 0x40 err 0x0 (timeout)
>> ata1: soft resetting port
>> ata4: soft resetting port
>> ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>> ata4.00: configured for UDMA/100
>> ata4: EH complete
>> SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
>> sdb: Write Protect is off
>> sdb: Mode Sense: 00 3a 00 00
>> SCSI device sdb: drive cache: write back
>> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ata1.00: configured for UDMA/133
>> ata1: EH complete
>> SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
>> sda: Write Protect is off
>> sda: Mode Sense: 00 3a 00 00
>> SCSI device sda: drive cache: write back
>>
>> Since both ata1 (sata_nv) and ata4 (sata_sil24) got this, could it be an 
>> interrupt problem?  Yuck

I dunno.  The weird IRQ problem theory was primarily based on the 
assumption the problem occurs only on 2.6.18, so it's a bit flimsy at 
this point.  Can you verify that the system behaves differently when 
apci is turned on and off?  You probably need to test several times to 
find out the pattern.

It's very weird that both controllers time out at the same moment.  Also 
in the above log, sata_nv is reporting DMA engine is still running when 
it timed out suggesting transmission error.  So, it might be that you 
have some problem in your system which causes transient transmission 
errors and sata_nv chokes up on such events.  sata_sil24 should be able 
to handle all those pretty well tho.

In the past, we had several similar cases and some of them turned out to 
be power problem - insufficient and/or faulty PSU, and some others 
faulty drive.  I think performing regular hw debugging stuff would be 
useful.

* Swap drives / cables / connected ports.  See if error follows 
controllers or drives or cables.

* If available, use separate PSU to power harddrives.  Just put another 
machine close, take SATA power cables and connect them to drives.  SATA 
signals don't need common ground, so there's nothing to worry about.

> If that is the case, then we should probably move the discussion to
> lkml...

I think diagnosing a bit more here can be helpful.

Thanks.

-- 
tejun

  reply	other threads:[~2006-10-25  5:46 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-25  4:01 sata_nv failure on 2.4.18 (Fedora Core 6) Mark Hatle
2006-10-25  4:28 ` Tejun Heo
2006-10-25  4:35   ` Mark Hatle
2006-10-25  5:03     ` Mark Hatle
2006-10-25  5:27       ` Benjamin Herrenschmidt
2006-10-25  5:46         ` Tejun Heo [this message]
2006-10-25  5:26   ` Benjamin Herrenschmidt
2006-10-25  5:51     ` Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2006-11-09 22:29 Jonathan Cohen
2006-11-21  6:58 ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=453EFA20.9050208@gmail.com \
    --to=htejun@gmail.com \
    --cc=benh@kernel.crashing.org \
    --cc=fray@gate.crashing.org \
    --cc=linux-ide@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).