Re: help with PMP failures

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: help with PMP failures
       [not found]   ` <4B0238EC.6060803@kernel.org>
@ 2009-11-17 17:39     ` Marc MERLIN
  2009-11-18  4:03       ` Tejun Heo
  0 siblings, 1 reply; 5+ messages in thread
From: Marc MERLIN @ 2009-11-17 17:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Tejun Heo, linux-ide

On Tue, Nov 17, 2009 at 02:47:24PM +0900, Tejun Heo wrote:
> Hello,
> 
> Can you please cc linux-ide@vger.kernel.org?

Absolutely, didn't know it was good for PMP too. Done.

> > Nov  2 17:03:17 gargamel kernel: ata6.15: exception Emask 0x100 SAct 0x0 SErr 0x200000 action 0x6 frozen
> > Nov  2 17:03:17 gargamel kernel: ata6.15: irq_stat 0x02060002, PMP DMA CS errata
> 
> Command execution error reported.
> 
> Sil3124/32 has an errata which worsens PMP error handling quite a bit.
> It's DMA context gets corrupt if a failure occurs when commands are in
> flight to 3 or more commands, so the driver has to abort all commands
> immediately.

gotcha

> This is the actual failure.  Your 6.02 drive reported media error
> which combined with the controller errata caused port wide failure.
 
Ah, I see, so it should be the one for me to focus on.
If it hadn't had an error, everything wouldn't have gone down the toilet,
next, right?

scsi 6:2:0:0: Direct-Access     ATA      Hitachi HDS72101 GKAO PQ: 0 ANSI: 5
sd 6:2:0:0: [sdj] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB)

If it's a media error, shouldn't it show up in the smart counters?
=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K1000
Device Model:     Hitachi HDS721010KLA330
Serial Number:    GTJ000PAG2JLKC
Firmware Version: GKAOA70F
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
Local Time is:    Tue Nov 17 09:32:47 2009 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       150
  3 Spin_Up_Time            0x0007   105   105   024    Pre-fail  Always       -       662 (Average 662)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       179
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   132   132   020    Pre-fail  Offline      -       33
  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       18566
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       92
192 Power-Off_Retract_Count 0x0032   061   061   000    Old_age   Always       -       47436
193 Load_Cycle_Count        0x0012   061   061   000    Old_age   Always       -       47436
194 Temperature_Celsius     0x0002   125   125   000    Old_age   Always       -       48 (Lifetime Min/Max 20/63)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       359

> The device gets kicked out of the system so the errors follow.  I have
> no idea why ata6.00 decided to stop responding.  It might be a
> firmware bug or the PMP is malfunctioning.  If this happens again, you
> can verify that by detaching the offending drive from the PMP without
> disconnecting power (the drive stays powered up) and then connect it
> in a different port and see whether it works.  If it doesn't, it means
> the firmware on the drive is firmly hung and will require power cycle
> to get working again.  Earlier SATA drives and few of recent ones
> sometimes do this after certain failures.
 
I can't really move it to another PMP port but I have indeed had failures
that required not just a reboot of my server but an actual power cycle
of the drive.

> Anyways, if my guess is right, the sequence of the event is first the
> drive with bad sector led to EH kicking in abruptly due to controller
> errata, which in turn caused another drive to lock up due to its
> firmware problem.

Ok, so this all sounds like it's a bit fragile due to hardware issues :)

I now have to figure out if /dev/sdj has a bad sector or not.

Last time I had this happen, though I did run 
dd if=/dev/drive of=/dev/null bs=1M
for my 5 drives, and it ran clean.

If I had a bad sector, shouldn't it show up in Current_Pending_Sector
and shouldn't reading the entire drive with dd fail?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems & security ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help with PMP failures
  2009-11-17 17:39     ` help with PMP failures Marc MERLIN
@ 2009-11-18  4:03       ` Tejun Heo
  2009-11-18  7:41         ` Marc MERLIN
  0 siblings, 1 reply; 5+ messages in thread
From: Tejun Heo @ 2009-11-18  4:03 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Tejun Heo, linux-ide

Hello,

11/18/2009 02:39 AM, Marc MERLIN wrote:
>> This is the actual failure.  Your 6.02 drive reported media error
>> which combined with the controller errata caused port wide failure.
>  
> Ah, I see, so it should be the one for me to focus on.
> If it hadn't had an error, everything wouldn't have gone down the toilet,
> next, right?

Yes, that's my guess.

> scsi 6:2:0:0: Direct-Access     ATA      Hitachi HDS72101 GKAO PQ: 0 ANSI: 5
> sd 6:2:0:0: [sdj] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB)
>
> If it's a media error, shouldn't it show up in the smart counters?

Does smartctl -a output shows any logged errors?

> I can't really move it to another PMP port but I have indeed had failures
> that required not just a reboot of my server but an actual power cycle
> of the drive.

Yeah, some old drives do that after abruptly aborted while executing
commands.  :-(

> Ok, so this all sounds like it's a bit fragile due to hardware issues :)
> 
> I now have to figure out if /dev/sdj has a bad sector or not.
> 
> Last time I had this happen, though I did run 
> dd if=/dev/drive of=/dev/null bs=1M
> for my 5 drives, and it ran clean.
> 
> If I had a bad sector, shouldn't it show up in Current_Pending_Sector
> and shouldn't reading the entire drive with dd fail?

I'm not sure which smart counter would be affected.  It also depends
on the firmware implementation and read errors might happen one time
but not on the next trial (if the drive for some reason didn't move
the failed sector elsewhere) or maybe the drive is continuously
developing bad sectors.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help with PMP failures
  2009-11-18  4:03       ` Tejun Heo
@ 2009-11-18  7:41         ` Marc MERLIN
  2009-11-18  8:33           ` Tejun Heo
  0 siblings, 1 reply; 5+ messages in thread
From: Marc MERLIN @ 2009-11-18  7:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Tejun Heo, linux-ide

On Wed, Nov 18, 2009 at 01:03:00PM +0900, Tejun Heo wrote:
> Hello,
> 
> 11/18/2009 02:39 AM, Marc MERLIN wrote:
> >> This is the actual failure.  Your 6.02 drive reported media error
> >> which combined with the controller errata caused port wide failure.
> >  
> > Ah, I see, so it should be the one for me to focus on.
> > If it hadn't had an error, everything wouldn't have gone down the toilet,
> > next, right?
> 
> Yes, that's my guess.
> 
> > scsi 6:2:0:0: Direct-Access     ATA      Hitachi HDS72101 GKAO PQ: 0 ANSI: 5
> > sd 6:2:0:0: [sdj] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB)
> >
> > If it's a media error, shouldn't it show up in the smart counters?
> 
> Does smartctl -a output shows any logged errors?

Didn't know about -a, good call.
Yep, they do and the times are about consistent with the raid going down

Not sure if the last 5 errors are enough to give a good clue, or not.

If I can't quite figure it out, I'll just pop in a 16 port adaptec sata 
board I recently picked up. It'll take PMP out of the equation if I truly
have drives returning read / write errors that don't quite seem to show up
in smart yet.

  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       18580

 
SMART Error Log Version: 1
ATA Error Count: 47 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 47 occurred at disk power-on lifetime: 18547 hours (772 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 46 59 70 44

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 00 3f 59 70 40 08  13d+19:41:56.500  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 a0 08  13d+19:41:56.500  FLUSH CACHE EXIT
  ea 00 00 00 00 00 a0 08  13d+19:41:56.200  FLUSH CACHE EXIT
  61 08 00 3f 59 70 40 08  13d+19:41:56.200  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 a0 08  13d+19:41:56.200  FLUSH CACHE EXIT

Error 46 occurred at disk power-on lifetime: 18546 hours (772 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.
  
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 46 59 70 44
  
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 00 3f 59 70 40 08  13d+19:27:26.400  WRITE FPDMA QUEUED
  27 00 00 00 00 00 e0 08  13d+19:27:26.400  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08  13d+19:27:26.400  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 08  13d+19:27:26.400  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 08  13d+19:27:26.400  READ NATIVE MAX ADDRESS EXT

Error 45 occurred at disk power-on lifetime: 18546 hours (772 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.
  
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 46 59 70 44
  
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 00 3f 59 70 40 08  13d+19:27:26.100  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 a0 08  13d+19:27:26.100  FLUSH CACHE EXIT
  ea 00 00 00 00 00 a0 08  13d+19:27:25.800  FLUSH CACHE EXIT
  61 08 00 3f 59 70 40 08  13d+19:27:25.800  WRITE FPDMA QUEUED
  27 00 00 00 00 00 e0 08  13d+19:27:25.800  READ NATIVE MAX ADDRESS EXT

Error 44 occurred at disk power-on lifetime: 18546 hours (772 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.
  
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 46 59 70 44
  
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 00 3f 59 70 40 08  13d+19:27:24.900  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 a0 08  13d+19:27:24.900  FLUSH CACHE EXIT
  b0 d0 01 00 4f c2 00 08  13d+19:20:22.300  SMART READ DATA
  b0 d8 00 00 4f c2 00 08  13d+19:20:22.100  SMART ENABLE OPERATIONS
  e5 00 00 00 00 00 00 08  13d+19:20:22.100  CHECK POWER MODE

Error 43 occurred at disk power-on lifetime: 18546 hours (772 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.
  
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 46 59 70 44
  
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 00 3f 59 70 40 08  13d+19:12:24.500  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 a0 08  13d+19:12:24.500  FLUSH CACHE EXIT
  ea 00 00 00 00 00 a0 08  13d+19:12:23.600  FLUSH CACHE EXIT
  61 08 00 3f 59 70 40 08  13d+19:12:23.600  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 a0 08  13d+19:12:23.600  FLUSH CACHE EXIT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     18565         -
# 2  Short offline       Completed without error       00%     18535         -
# 3  Short offline       Completed without error       00%     18511         -
# 4  Extended offline    Completed without error       00%     18492         -
# 5  Short offline       Completed without error       00%     18487         -
# 6  Short offline       Completed without error       00%     18463         -
# 7  Short offline       Completed without error       00%     18439         -
# 8  Short offline       Completed without error       00%     18415         -
# 9  Short offline       Completed without error       00%     18391         -
#10  Short offline       Completed without error       00%     18366         -
#11  Short offline       Completed without error       00%     18343         -
#12  Extended offline    Completed without error       00%     18324         -
#13  Short offline       Completed without error       00%     18319         -
#14  Short offline       Completed without error       00%     18295         -
#15  Short offline       Completed without error       00%     18271         -
#16  Short offline       Completed without error       00%     18247         -
#17  Short offline       Completed without error       00%     18223         -
#18  Short offline       Completed without error       00%     18199         -
#19  Short offline       Completed without error       00%     18178         -
#20  Short offline       Completed without error       00%     18178         -
#21  Short offline       Completed without error       00%     18178         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Thanks for looking,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems & security ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help with PMP failures
  2009-11-18  7:41         ` Marc MERLIN
@ 2009-11-18  8:33           ` Tejun Heo
  2009-11-18 18:29             ` Marc MERLIN
  0 siblings, 1 reply; 5+ messages in thread
From: Tejun Heo @ 2009-11-18  8:33 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Tejun Heo, linux-ide

Hello,

11/18/2009 04:41 PM, Marc MERLIN wrote:
> Didn't know about -a, good call.
> Yep, they do and the times are about consistent with the raid going down
> 
> Not sure if the last 5 errors are enough to give a good clue, or not.
> 
> If I can't quite figure it out, I'll just pop in a 16 port adaptec sata 
> board I recently picked up. It'll take PMP out of the equation if I truly
> have drives returning read / write errors that don't quite seem to show up
> in smart yet.
> 
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   84 51 00 46 59 70 44
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   61 08 00 3f 59 70 40 08  13d+19:41:56.500  WRITE FPDMA QUEUED

All the loggedcommands are write but the one which triggered the
failure was read.  Error value of 0x84 indicates ICRC and ABORT, so
all the logged commands were failed due to transmission failure from
the host.  Hmmm....

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help with PMP failures
  2009-11-18  8:33           ` Tejun Heo
@ 2009-11-18 18:29             ` Marc MERLIN
  0 siblings, 0 replies; 5+ messages in thread
From: Marc MERLIN @ 2009-11-18 18:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Tejun Heo, linux-ide

On Wed, Nov 18, 2009 at 05:33:23PM +0900, Tejun Heo wrote:
> All the loggedcommands are write but the one which triggered the
> failure was read.  Error value of 0x84 indicates ICRC and ABORT, so
> all the logged commands were failed due to transmission failure from
> the host.  Hmmm....

If that helps, I dug up my 3rd such failure in my logs (back a month ago
now).

This should let you confirm or contradict your earlier suspicions

Funny how it also starts with a CRC error with the PMP too.
For that matter, it's a similar failure string then the previous
one I posted.
However this trace doesn't show any media error.

Comparing this with the last one you looked at, does it help pinpointing where
the fault might be?

Oct 17 21:18:25 gargamel kernel: ata6.00: failed to read SCR 1 (Emask=0x40)
Oct 17 21:18:25 gargamel kernel: ata6.01: failed to read SCR 1 (Emask=0x40)
Oct 17 21:18:25 gargamel kernel: ata6.02: failed to read SCR 1 (Emask=0x40)
Oct 17 21:18:25 gargamel kernel: ata6.03: failed to read SCR 1 (Emask=0x40)
Oct 17 21:18:25 gargamel kernel: ata6.04: failed to read SCR 1 (Emask=0x40)
Oct 17 21:18:25 gargamel kernel: ata6.05: failed to read SCR 1 (Emask=0x40)
Oct 17 21:18:25 gargamel kernel: ata6.15: exception Emask 0x100 SAct 0x0 SErr 0x200000 action 0x6 frozen
Oct 17 21:18:25 gargamel kernel: ata6.15: irq_stat 0x02060002, PMP DMA CS errata
Oct 17 21:18:25 gargamel kernel: ata6.15: SError: { BadCRC }
Oct 17 21:18:25 gargamel kernel: ata6.00: exception Emask 0x100 SAct 0xa SErr 0x0 action 0x6 frozen
Oct 17 21:18:25 gargamel kernel: ata6.00: cmd 60/80:08:bf:67:cc/00:00:2b:00:00/40 tag 1 ncq 65536 in
Oct 17 21:18:25 gargamel kernel:          res 3c/36:00:00:00:00/cd:00:40:10:3c/00 Emask 0x2 (HSM violation)
Oct 17 21:18:25 gargamel kernel: ata6.00: status: { DF DRQ }
Oct 17 21:18:25 gargamel kernel: ata6.00: error: { IDNF ABRT }
Oct 17 21:18:25 gargamel kernel: ata6.00: cmd 60/10:18:3f:68:cc/00:00:2b:00:00/40 tag 3 ncq 8192 in
Oct 17 21:18:25 gargamel kernel:          res 60/10:18:3f:68:cc/00:00:2b:00:00/40 Emask 0x81 (invalid argument)
Oct 17 21:18:25 gargamel kernel: ata6.00: status: { DRDY DF }
Oct 17 21:18:25 gargamel kernel: ata6.00: error: { IDNF }
Oct 17 21:18:25 gargamel kernel: ata6.01: exception Emask 0x100 SAct 0x885 SErr 0x0 action 0x6 frozen
Oct 17 21:18:25 gargamel kernel: ata6.01: cmd 60/70:00:cf:66:cc/00:00:2b:00:00/40 tag 0 ncq 57344 in
Oct 17 21:18:25 gargamel kernel:          res 3c/36:00:00:00:00/cd:00:00:00:3c/00 Emask 0x2 (HSM violation)
Oct 17 21:18:25 gargamel kernel: ata6.01: status: { DF DRQ }
Oct 17 21:18:25 gargamel kernel: ata6.01: error: { IDNF ABRT }
Oct 17 21:18:25 gargamel kernel: ata6.01: cmd 60/10:10:bf:66:cc/00:00:2b:00:00/40 tag 2 ncq 8192 in
Oct 17 21:18:25 gargamel kernel:          res 3c/36:00:00:00:00/00:00:00:20:3c/00 Emask 0x2 (HSM violation)
Oct 17 21:18:25 gargamel kernel: ata6.01: status: { DF DRQ }
Oct 17 21:18:25 gargamel kernel: ata6.01: error: { IDNF ABRT }
Oct 17 21:18:25 gargamel kernel: ata6.01: cmd 60/80:38:3f:67:cc/00:00:2b:00:00/40 tag 7 ncq 65536 in
Oct 17 21:18:25 gargamel kernel:          res 3c/36:00:00:00:00/00:00:00:70:3c/00 Emask 0x2 (HSM violation)
Oct 17 21:18:25 gargamel kernel: ata6.01: status: { DF DRQ }
Oct 17 21:18:25 gargamel kernel: ata6.01: error: { IDNF ABRT }
Oct 17 21:18:25 gargamel kernel: ata6.01: cmd 60/10:58:bf:67:cc/00:00:2b:00:00/40 tag 11 ncq 8192 in
Oct 17 21:18:25 gargamel kernel:          res 3c/36:00:00:00:00/00:00:00:b0:3c/00 Emask 0x2 (HSM violation)
Oct 17 21:18:25 gargamel kernel: ata6.01: status: { DF DRQ }
Oct 17 21:18:25 gargamel kernel: ata6.01: error: { IDNF ABRT }
Oct 17 21:18:26 gargamel kernel: ata6.02: exception Emask 0x1 SAct 0x1100 SErr 0x0 action 0x6 frozen
Oct 17 21:18:26 gargamel kernel: ata6.02: irq_stat 0x02060002, device error via SDB FIS
Oct 17 21:18:26 gargamel kernel: ata6.02: cmd 60/70:40:cf:66:cc/00:00:2b:00:00/40 tag 8 ncq 57344 in
Oct 17 21:18:26 gargamel kernel:          res 3c/36:00:00:00:00/00:00:80:80:3c/00 Emask 0x3 (HSM violation)
Oct 17 21:18:26 gargamel kernel: ata6.02: status: { DF DRQ }
Oct 17 21:18:26 gargamel kernel: ata6.02: error: { IDNF ABRT }
Oct 17 21:18:26 gargamel kernel: ata6.02: cmd 60/80:60:3f:67:cc/00:00:2b:00:00/40 tag 12 ncq 65536 in
Oct 17 21:18:26 gargamel kernel:          res 60/80:60:3f:67:cc/00:00:2b:00:00/40 Emask 0x10 (ATA bus error)
Oct 17 21:18:26 gargamel kernel: ata6.02: status: { DRDY DF }
Oct 17 21:18:26 gargamel kernel: ata6.02: error: { ICRC }
Oct 17 21:18:26 gargamel kernel: ata6.03: exception Emask 0x100 SAct 0x2000 SErr 0x0 action 0x6 frozen
Oct 17 21:18:26 gargamel kernel: ata6.03: cmd 60/80:68:bf:67:cc/00:00:2b:00:00/40 tag 13 ncq 65536 in
Oct 17 21:18:26 gargamel kernel:          res 3c/36:00:00:00:00/00:00:c0:d0:3c/00 Emask 0x2 (HSM violation)
Oct 17 21:18:26 gargamel kernel: ata6.03: status: { DF DRQ }
Oct 17 21:18:26 gargamel kernel: ata6.03: error: { IDNF ABRT }
Oct 17 21:18:26 gargamel kernel: ata6.04: exception Emask 0x100 SAct 0x40 SErr 0x0 action 0x6 frozen
Oct 17 21:18:26 gargamel kernel: ata6.04: cmd 60/70:30:4f:67:cc/00:00:2b:00:00/40 tag 6 ncq 57344 in
Oct 17 21:18:26 gargamel kernel:          res 3c/36:00:00:00:00/00:00:40:60:3c/00 Emask 0x2 (HSM violation)
Oct 17 21:18:26 gargamel kernel: ata6.04: status: { DF DRQ }
Oct 17 21:18:26 gargamel kernel: ata6.04: error: { IDNF ABRT }
Oct 17 21:18:26 gargamel kernel: ata6.05: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 17 21:18:26 gargamel kernel: ata6.15: hard resetting link
Oct 17 21:18:26 gargamel kernel: ata6: controller in dubious state, performing PORT_RST
Oct 17 21:18:28 gargamel kernel: ata6.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
Oct 17 21:18:28 gargamel kernel: ata6.00: hard resetting link
Oct 17 21:18:28 gargamel kernel: ata6.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Oct 17 21:18:28 gargamel kernel: ata6.01: hard resetting link
Oct 17 21:18:28 gargamel kernel: ata6.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:28 gargamel kernel: ata6.02: hard resetting link
Oct 17 21:18:29 gargamel kernel: ata6.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:29 gargamel kernel: ata6.03: hard resetting link
Oct 17 21:18:29 gargamel kernel: ata6.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:29 gargamel kernel: ata6.04: hard resetting link
Oct 17 21:18:29 gargamel kernel: ata6.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:29 gargamel kernel: ata6.05: hard resetting link
Oct 17 21:18:30 gargamel kernel: ata6.05: SATA link up 1.5 Gbps (SStatus 113 SControl 320)
Oct 17 21:18:30 gargamel kernel: ata6.00: configured for UDMA/100
Oct 17 21:18:35 gargamel kernel: ata6.01: qc timeout (cmd 0xec)
Oct 17 21:18:35 gargamel kernel: ata6.01: failed to IDENTIFY (I/O error, err_mask=0x5)
Oct 17 21:18:35 gargamel kernel: ata6.01: revalidation failed (errno=-5)
Oct 17 21:18:35 gargamel kernel: ata6.15: hard resetting link
Oct 17 21:18:37 gargamel kernel: ata6.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
Oct 17 21:18:37 gargamel kernel: ata6.00: hard resetting link
Oct 17 21:18:37 gargamel kernel: ata6.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Oct 17 21:18:37 gargamel kernel: ata6.01: hard resetting link
Oct 17 21:18:37 gargamel kernel: ata6.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:37 gargamel kernel: ata6.02: hard resetting link
Oct 17 21:18:38 gargamel kernel: ata6.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:38 gargamel kernel: ata6.03: hard resetting link
Oct 17 21:18:38 gargamel kernel: ata6.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:38 gargamel kernel: ata6.04: hard resetting link
Oct 17 21:18:38 gargamel kernel: ata6.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:38 gargamel kernel: ata6.05: hard resetting link
Oct 17 21:18:39 gargamel kernel: ata6.05: SATA link up 1.5 Gbps (SStatus 113 SControl 320)
Oct 17 21:18:39 gargamel kernel: ata6.00: configured for UDMA/100
Oct 17 21:18:49 gargamel kernel: ata6.01: qc timeout (cmd 0xec)
Oct 17 21:18:49 gargamel kernel: ata6.01: failed to IDENTIFY (I/O error, err_mask=0x5)
Oct 17 21:18:49 gargamel kernel: ata6.01: revalidation failed (errno=-5)
Oct 17 21:18:49 gargamel kernel: ata6.15: hard resetting link
Oct 17 21:18:51 gargamel kernel: ata6.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
Oct 17 21:18:51 gargamel kernel: ata6.00: hard resetting link
Oct 17 21:18:51 gargamel kernel: ata6.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Oct 17 21:18:51 gargamel kernel: ata6.01: hard resetting link
Oct 17 21:18:51 gargamel kernel: ata6.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:51 gargamel kernel: ata6.02: hard resetting link
Oct 17 21:18:52 gargamel kernel: ata6.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:52 gargamel kernel: ata6.03: hard resetting link
Oct 17 21:18:52 gargamel kernel: ata6.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:52 gargamel kernel: ata6.04: hard resetting link
Oct 17 21:18:52 gargamel kernel: ata6.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:18:52 gargamel kernel: ata6.05: hard resetting link
Oct 17 21:18:53 gargamel kernel: ata6.05: SATA link up 1.5 Gbps (SStatus 113 SControl 320)
Oct 17 21:18:53 gargamel kernel: ata6.00: configured for UDMA/100
Oct 17 21:19:23 gargamel kernel: ata6.01: qc timeout (cmd 0xec)
Oct 17 21:19:23 gargamel kernel: ata6.01: failed to IDENTIFY (I/O error, err_mask=0x5)
Oct 17 21:19:23 gargamel kernel: ata6.01: revalidation failed (errno=-5)
Oct 17 21:19:23 gargamel kernel: ata6.01: failed to recover link after 3 tries, disabling
Oct 17 21:19:23 gargamel kernel: ata6.01: disabled
Oct 17 21:19:23 gargamel kernel: ata6.15: hard resetting link
Oct 17 21:19:25 gargamel kernel: ata6.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
Oct 17 21:19:25 gargamel kernel: ata6.00: hard resetting link
Oct 17 21:19:25 gargamel kernel: ata6.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Oct 17 21:19:25 gargamel kernel: ata6.02: hard resetting link
Oct 17 21:19:26 gargamel kernel: ata6.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:19:26 gargamel kernel: ata6.03: hard resetting link
Oct 17 21:19:26 gargamel kernel: ata6.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:19:26 gargamel kernel: ata6.04: hard resetting link
Oct 17 21:19:26 gargamel kernel: ata6.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 17 21:19:26 gargamel kernel: ata6.05: hard resetting link
Oct 17 21:19:27 gargamel kernel: ata6.05: SATA link up 1.5 Gbps (SStatus 113 SControl 320)
Oct 17 21:19:27 gargamel kernel: ata6.00: configured for UDMA/100
Oct 17 21:19:27 gargamel kernel: ata6.02: configured for UDMA/100
Oct 17 21:19:27 gargamel kernel: ata6.03: configured for UDMA/100
Oct 17 21:19:27 gargamel kernel: ata6.04: configured for UDMA/100
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Unhandled sense code
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Sense Key : Hardware Error [current] [descriptor]
Oct 17 21:19:27 gargamel kernel: Descriptor sense data with sense descriptors (in hex):
Oct 17 21:19:27 gargamel kernel:         72 04 00 00 00 00 00 0c 00 0a 80 00 00 00 3c 00 
Oct 17 21:19:27 gargamel kernel:         00 00 00 00 
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Add. Sense: No additional sense information
Oct 17 21:19:27 gargamel kernel: end_request: I/O error, dev sdi, sector 734815951
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: rejecting I/O to offline device
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Unhandled error code
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Oct 17 21:19:27 gargamel kernel: end_request: I/O error, dev sdi, sector 734816207
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Unhandled sense code
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Sense Key : Hardware Error [current] [descriptor]
Oct 17 21:19:27 gargamel kernel: Descriptor sense data with sense descriptors (in hex):
Oct 17 21:19:27 gargamel kernel:         72 04 00 00 00 00 00 0c 00 0a 80 00 00 00 3c 20 
Oct 17 21:19:27 gargamel kernel:         00 00 00 00 
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Add. Sense: No additional sense information
Oct 17 21:19:27 gargamel kernel: end_request: I/O error, dev sdi, sector 734815935
Oct 17 21:19:27 gargamel kernel: sd 6:0:0:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 17 21:19:27 gargamel kernel: sd 6:0:0:0: [sdh] Sense Key : Aborted Command [current] [descriptor]
Oct 17 21:19:27 gargamel kernel: Descriptor sense data with sense descriptors (in hex):
Oct 17 21:19:27 gargamel kernel:         72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Oct 17 21:19:27 gargamel kernel:         2b cc 68 3f 
Oct 17 21:19:27 gargamel kernel: sd 6:0:0:0: [sdh] Add. Sense: Recorded entity not found
Oct 17 21:19:27 gargamel kernel: end_request: I/O error, dev sdh, sector 734816319
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Unhandled sense code
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: rejecting I/O to offline device
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Sense Key : Hardware Error [current] [descriptor]
Oct 17 21:19:27 gargamel kernel: Descriptor sense data with sense descriptors (in hex):
Oct 17 21:19:27 gargamel kernel:         72 04 00 00 00 00 00 0c 00 0a 80 00 00 00 3c 70 
Oct 17 21:19:27 gargamel kernel:         00 00 00 00 
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Add. Sense: No additional sense information
Oct 17 21:19:27 gargamel kernel: end_request: I/O error, dev sdi, sector 734816063
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Unhandled sense code
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Sense Key : Hardware Error [current] [descriptor]
Oct 17 21:19:27 gargamel kernel: Descriptor sense data with sense descriptors (in hex):
Oct 17 21:19:27 gargamel kernel:         72 04 00 00 00 00 00 0c 00 0a 80 00 00 00 3c b0 
Oct 17 21:19:27 gargamel kernel:         00 00 00 00 
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Add. Sense: No additional sense information
Oct 17 21:19:27 gargamel kernel: end_request: I/O error, dev sdi, sector 734816191
Oct 17 21:19:27 gargamel kernel: ata6: EH complete
Oct 17 21:19:27 gargamel kernel: ata6.01: detaching (SCSI 6:1:0:0)
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Synchronizing SCSI cache
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Stopping disk
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] START_STOP FAILED
Oct 17 21:19:27 gargamel kernel: sd 6:1:0:0: [sdi] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct 17 21:19:27 gargamel kernel: raid5: Disk failure on sdi1, disabling device.
Oct 17 21:19:27 gargamel kernel: raid5: Operation continuing on 4 devices.
Oct 17 21:19:27 gargamel kernel: raid5: Disk failure on sdh1, disabling device.
Oct 17 21:19:27 gargamel kernel: raid5: Operation continuing on 3 devices.

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems & security ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-11-18 18:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20091116184242.GA22250@merlins.org>
     [not found] ` <20091116184853.GA23126@merlins.org>
     [not found]   ` <4B0238EC.6060803@kernel.org>
2009-11-17 17:39     ` help with PMP failures Marc MERLIN
2009-11-18  4:03       ` Tejun Heo
2009-11-18  7:41         ` Marc MERLIN
2009-11-18  8:33           ` Tejun Heo
2009-11-18 18:29             ` Marc MERLIN

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).