HDD Unrecovered readerror issue

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* HDD Unrecovered readerror issue
@ 2016-07-20  9:01 Dmitry Monakhov
  2016-07-22 13:51 ` Jeff Moyer
  0 siblings, 1 reply; 3+ messages in thread
From: Dmitry Monakhov @ 2016-07-20  9:01 UTC (permalink / raw)
  To: linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1683 bytes --]

Drive:WDC WD1003FZEX-00MK2A0
I have got this in logs:

ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/a0:a0:f0:c0:c5/00:00:04:00:00/40 tag 20 ncq 81920 in res 41/40:00:88:c1:c5/00:00:04:00:00/00 Emask 0x409 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/133
sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda] Sense Key : Medium Error[current] [descriptor]
sd 0:0:0:0: [sda] Add. Sense: Unrecovered readerror - auto reallocate failed
sd 0:0:0:0: [sda] CDB: Read(10) 28 00 04 c5 c0 f0 00 00 a0 00
blk_update_request: I/O error, dev sda, sector 80069000
ata1: EH complete

I can reproduce this easily
#xfs_io -c "pread $((80069000/2))k 4k" -d  /dev/sda
pread64: Input/output error
##Got EIO
##Smartctl also detect this
#smartctl -t short /dev/sda
#smartctl -l selftest /dev/sda
....
Short offline       Completed: read failure       90%      4682 80069000

But once I rewrite this block, problem goes away.
#xfs_io -c "pwrite -S 0x0 $((80069000/2))k 4k" -d  /dev/sda

Now I can read it w/o any errors and smartctl is happy
#smartctl -t short /dev/sda
#smartctl -l selftest /dev/sda
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4683 -

So my disk is not dead right? Why the hell HDD fail read from very beginning
Is this because HDD firmware detect internal crcXX sum corruption?
How this can happen? Is this because of power failure?
AFAIK standard guarantees that sector will be updated atomically.
But it happens! Please guide me how to fix such problems in general.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: HDD Unrecovered readerror issue
  2016-07-20  9:01 HDD Unrecovered readerror issue Dmitry Monakhov
@ 2016-07-22 13:51 ` Jeff Moyer
  2016-07-22 14:33   ` James Bottomley
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff Moyer @ 2016-07-22 13:51 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-scsi

Dmitry Monakhov <dmonakhov@openvz.org> writes:

> But once I rewrite this block, problem goes away.
> #xfs_io -c "pwrite -S 0x0 $((80069000/2))k 4k" -d  /dev/sda
>
> Now I can read it w/o any errors and smartctl is happy
> #smartctl -t short /dev/sda
> #smartctl -l selftest /dev/sda
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed without error       00%      4683 -
>
> So my disk is not dead right?

Correct.

> Why the hell HDD fail read from very beginning
> Is this because HDD firmware detect internal crcXX sum corruption?

Yes.

> How this can happen? Is this because of power failure?

Could be.  If power was cut in the middle of a write, this can happen.
There are other causes, though (bit rot, for example).

> AFAIK standard guarantees that sector will be updated atomically.

No, the SCSI and ATA standards most certainly do not guarantee that!
NVMe is the only standard I know of that requires Atomic Write Unit
Power Fail to be at lest one sector.

> But it happens! Please guide me how to fix such problems in general.

You fixed it.  Overwriting the sector will clear the error.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: HDD Unrecovered readerror issue
  2016-07-22 13:51 ` Jeff Moyer
@ 2016-07-22 14:33   ` James Bottomley
  0 siblings, 0 replies; 3+ messages in thread
From: James Bottomley @ 2016-07-22 14:33 UTC (permalink / raw)
  To: Jeff Moyer, Dmitry Monakhov; +Cc: linux-scsi

On Fri, 2016-07-22 at 09:51 -0400, Jeff Moyer wrote:
> Dmitry Monakhov <dmonakhov@openvz.org> writes:
> 
> > But once I rewrite this block, problem goes away.
> > #xfs_io -c "pwrite -S 0x0 $((80069000/2))k 4k" -d  /dev/sda
> > 
> > Now I can read it w/o any errors and smartctl is happy
> > #smartctl -t short /dev/sda
> > #smartctl -l selftest /dev/sda
> > Num  Test_Description    Status                  Remaining
> > LifeTime(hours)  LBA_of_first_error
> > # 1  Short offline       Completed without error       00%     
> >  4683 -
> > 
> > So my disk is not dead right?
> 
> Correct.
> 
> > Why the hell HDD fail read from very beginning
> > Is this because HDD firmware detect internal crcXX sum corruption?
> 
> Yes.
> 
> > How this can happen? Is this because of power failure?
> 
> Could be.  If power was cut in the middle of a write, this can 
> happen. There are other causes, though (bit rot, for example).
> 
> > AFAIK standard guarantees that sector will be updated atomically.
> 
> No, the SCSI and ATA standards most certainly do not guarantee that!
> NVMe is the only standard I know of that requires Atomic Write Unit
> Power Fail to be at lest one sector.

The mechanics of the drive mostly ensure atomic updates on the physical
block level.  You definitely get either the old data, the new data or
an unreadable sector.  The latter is a pretty rare event because
surviving power usually ensures the writes complete, but it's not
guaranteed. 

> > But it happens! Please guide me how to fix such problems in
> > general.
> 
> You fixed it.  Overwriting the sector will clear the error.

Actually only "may clear the error" depending on what happened.  If the
hamming codes on the sector itself just failed (because of a torn write
due to power fail) then a rewrite simply re-fixes the sector in situ. 
 Sometimes the magnetic substrate of the track is worn (so the sector
is permanently damaged) and the re-write forces a reallocation.  If
that's happening to your disk then eventually it will fail
irrecoverably when the reallocation table is full.

You can monitor this with the smart Reallocated_Event_Count.

James

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-07-22 14:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-20  9:01 HDD Unrecovered readerror issue Dmitry Monakhov
2016-07-22 13:51 ` Jeff Moyer
2016-07-22 14:33   ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).