Re: HDD Unrecovered readerror issue

All of lore.kernel.org
 help / color / mirror / Atom feed

From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Jeff Moyer <jmoyer@redhat.com>, Dmitry Monakhov <dmonakhov@openvz.org>
Cc: linux-scsi@vger.kernel.org
Subject: Re: HDD Unrecovered readerror issue
Date: Fri, 22 Jul 2016 07:33:59 -0700	[thread overview]
Message-ID: <1469198039.2382.11.camel@HansenPartnership.com> (raw)
In-Reply-To: <x494m7hhjbu.fsf@segfault.boston.devel.redhat.com>

On Fri, 2016-07-22 at 09:51 -0400, Jeff Moyer wrote:
> Dmitry Monakhov <dmonakhov@openvz.org> writes:
> 
> > But once I rewrite this block, problem goes away.
> > #xfs_io -c "pwrite -S 0x0 $((80069000/2))k 4k" -d  /dev/sda
> > 
> > Now I can read it w/o any errors and smartctl is happy
> > #smartctl -t short /dev/sda
> > #smartctl -l selftest /dev/sda
> > Num  Test_Description    Status                  Remaining
> > LifeTime(hours)  LBA_of_first_error
> > # 1  Short offline       Completed without error       00%     
> >  4683 -
> > 
> > So my disk is not dead right?
> 
> Correct.
> 
> > Why the hell HDD fail read from very beginning
> > Is this because HDD firmware detect internal crcXX sum corruption?
> 
> Yes.
> 
> > How this can happen? Is this because of power failure?
> 
> Could be.  If power was cut in the middle of a write, this can 
> happen. There are other causes, though (bit rot, for example).
> 
> > AFAIK standard guarantees that sector will be updated atomically.
> 
> No, the SCSI and ATA standards most certainly do not guarantee that!
> NVMe is the only standard I know of that requires Atomic Write Unit
> Power Fail to be at lest one sector.

The mechanics of the drive mostly ensure atomic updates on the physical
block level.  You definitely get either the old data, the new data or
an unreadable sector.  The latter is a pretty rare event because
surviving power usually ensures the writes complete, but it's not
guaranteed. 

> > But it happens! Please guide me how to fix such problems in
> > general.
> 
> You fixed it.  Overwriting the sector will clear the error.

Actually only "may clear the error" depending on what happened.  If the
hamming codes on the sector itself just failed (because of a torn write
due to power fail) then a rewrite simply re-fixes the sector in situ. 
 Sometimes the magnetic substrate of the track is worn (so the sector
is permanently damaged) and the re-write forces a reallocation.  If
that's happening to your disk then eventually it will fail
irrecoverably when the reallocation table is full.

You can monitor this with the smart Reallocated_Event_Count.

James

     prev parent reply	other threads:[~2016-07-22 14:34 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-20  9:01 HDD Unrecovered readerror issue Dmitry Monakhov
2016-07-22 13:51 ` Jeff Moyer
2016-07-22 14:33   ` James Bottomley [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1469198039.2382.11.camel@HansenPartnership.com \
    --to=james.bottomley@hansenpartnership.com \
    --cc=dmonakhov@openvz.org \
    --cc=jmoyer@redhat.com \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.