Re: help with PMP failures

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Marc MERLIN <marc@merlins.org>
To: Tejun Heo <tj@kernel.org>
Cc: Tejun Heo <htejun@gmail.com>, linux-ide@vger.kernel.org
Subject: Re: help with PMP failures
Date: Tue, 17 Nov 2009 09:39:55 -0800	[thread overview]
Message-ID: <20091117173955.GA19029@merlins.org> (raw)
In-Reply-To: <4B0238EC.6060803@kernel.org>

On Tue, Nov 17, 2009 at 02:47:24PM +0900, Tejun Heo wrote:
> Hello,
> 
> Can you please cc linux-ide@vger.kernel.org?

Absolutely, didn't know it was good for PMP too. Done.

> > Nov  2 17:03:17 gargamel kernel: ata6.15: exception Emask 0x100 SAct 0x0 SErr 0x200000 action 0x6 frozen
> > Nov  2 17:03:17 gargamel kernel: ata6.15: irq_stat 0x02060002, PMP DMA CS errata
> 
> Command execution error reported.
> 
> Sil3124/32 has an errata which worsens PMP error handling quite a bit.
> It's DMA context gets corrupt if a failure occurs when commands are in
> flight to 3 or more commands, so the driver has to abort all commands
> immediately.

gotcha

> This is the actual failure.  Your 6.02 drive reported media error
> which combined with the controller errata caused port wide failure.
 
Ah, I see, so it should be the one for me to focus on.
If it hadn't had an error, everything wouldn't have gone down the toilet,
next, right?

scsi 6:2:0:0: Direct-Access     ATA      Hitachi HDS72101 GKAO PQ: 0 ANSI: 5
sd 6:2:0:0: [sdj] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB)

If it's a media error, shouldn't it show up in the smart counters?
=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K1000
Device Model:     Hitachi HDS721010KLA330
Serial Number:    GTJ000PAG2JLKC
Firmware Version: GKAOA70F
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
Local Time is:    Tue Nov 17 09:32:47 2009 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       150
  3 Spin_Up_Time            0x0007   105   105   024    Pre-fail  Always       -       662 (Average 662)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       179
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   132   132   020    Pre-fail  Offline      -       33
  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       18566
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       92
192 Power-Off_Retract_Count 0x0032   061   061   000    Old_age   Always       -       47436
193 Load_Cycle_Count        0x0012   061   061   000    Old_age   Always       -       47436
194 Temperature_Celsius     0x0002   125   125   000    Old_age   Always       -       48 (Lifetime Min/Max 20/63)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       359

> The device gets kicked out of the system so the errors follow.  I have
> no idea why ata6.00 decided to stop responding.  It might be a
> firmware bug or the PMP is malfunctioning.  If this happens again, you
> can verify that by detaching the offending drive from the PMP without
> disconnecting power (the drive stays powered up) and then connect it
> in a different port and see whether it works.  If it doesn't, it means
> the firmware on the drive is firmly hung and will require power cycle
> to get working again.  Earlier SATA drives and few of recent ones
> sometimes do this after certain failures.
 
I can't really move it to another PMP port but I have indeed had failures
that required not just a reboot of my server but an actual power cycle
of the drive.

> Anyways, if my guess is right, the sequence of the event is first the
> drive with bad sector led to EH kicking in abruptly due to controller
> errata, which in turn caused another drive to lock up due to its
> firmware problem.

Ok, so this all sounds like it's a bit fragile due to hardware issues :)

I now have to figure out if /dev/sdj has a bad sector or not.

Last time I had this happen, though I did run 
dd if=/dev/drive of=/dev/null bs=1M
for my 5 drives, and it ran clean.

If I had a bad sector, shouldn't it show up in Current_Pending_Sector
and shouldn't reading the entire drive with dd fail?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems & security ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

next      parent reply	other threads:[~2009-11-17 18:01 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20091116184242.GA22250@merlins.org>
     [not found] ` <20091116184853.GA23126@merlins.org>
     [not found]   ` <4B0238EC.6060803@kernel.org>
2009-11-17 17:39     ` Marc MERLIN [this message]
2009-11-18  4:03       ` help with PMP failures Tejun Heo
2009-11-18  7:41         ` Marc MERLIN
2009-11-18  8:33           ` Tejun Heo
2009-11-18 18:29             ` Marc MERLIN

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091117173955.GA19029@merlins.org \
    --to=marc@merlins.org \
    --cc=htejun@gmail.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).