maintain badblocks list on the fly

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* maintain badblocks list on the fly
@ 2014-01-05 10:02 Oleksij Rempel
  2014-01-06  1:27 ` Theodore Ts'o
  0 siblings, 1 reply; 3+ messages in thread
From: Oleksij Rempel @ 2014-01-05 10:02 UTC (permalink / raw)
  To: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1010 bytes --]

Hello all,

after some googling i didn't found answer to my question, so i set it
directly here: do it makes sense and is it possible to maintain bad
block list of ext4 on fly? I mean, if ext4 get error from, for example
from ata subsystem, and it will mark block as bad or may be better as
"probably bad"?

Since ext4 do anyway journal recovery, it can do some sort of
auto-repair too.

The reason why i ask is a story of laptop which i got for repair. After
update, system failed to boot. One of system relevant files was placed
on badblock which was already detected by kernel one month(!) before (it
can be found in syslog). After reboot, system was unusable for average
user. It is single bad block for 500GB, so it is not the case for
replacement.

In case if ext4 would maintain bad blocks on the fly, it would keep the
system working. Hi-level tools should be responsible to notify user
about FS and device degradation.

PS: thanks for metadata_csum :)
-- 
Regards,
Oleksij

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 295 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: maintain badblocks list on the fly
  2014-01-05 10:02 maintain badblocks list on the fly Oleksij Rempel
@ 2014-01-06  1:27 ` Theodore Ts'o
  2014-01-16  9:50   ` Oleksij Rempel
  0 siblings, 1 reply; 3+ messages in thread
From: Theodore Ts'o @ 2014-01-06  1:27 UTC (permalink / raw)
  To: Oleksij Rempel; +Cc: linux-ext4

On Sun, Jan 05, 2014 at 11:02:49AM +0100, Oleksij Rempel wrote:
> 
> after some googling i didn't found answer to my question, so i set it
> directly here: do it makes sense and is it possible to maintain bad
> block list of ext4 on fly? I mean, if ext4 get error from, for example
> from ata subsystem, and it will mark block as bad or may be better as
> "probably bad"?

Figuring out what to do in case of an error is tricky.  Sometimes
errors are transient.  For example, losing a connection (perhaps
briefly) to a disk connected via fiber channel).

Also, with most hard drives, if you rewrite a block which has reported
a read error, the hard drive will usually remap the block to one of
the blocks in the spare pool.  So one strategy is when you get a read
error is not to avoid using the block forever, but to simply write all
zero's to the block, and then see if the block is now valid.  But now
combine this with the "some errors are transient" problem --- if you
do a forced rewrite, you might lose data that you could get back i you
try rereading the block later.  So it's rare file system author that
is willing to do an automated forced rewrite when getting a read
error.

For a write error, it's safer to try rewriting the block, but most of
the time the hard drive will have tried rewriting the block already,
unless it's due to a connection problem between the file system and
the storage device.  For example, suppose the file system is accessing
an iSCSI block device which where the transport layer between computer
and the storage device is a TCP connection...

So the problem with automated error recovery is that it's highly
dependent on the storage device (is it a RAID; a hard drive; an iSCSI
device, etc.) and the application / what are you storing.

For example, if the file system is on a direct connected HDD as the
back end for a cluster file system such as hadoopfs or the Google File
System, where the cluster file system is storing every chunk of its
file replicated on multiple file servers, and/or using some kind of
Reed Solomon encoding, when you detect a read error on data block, the
best thing to do might be to delete file (relying on the fact that the
next time you write to the bad block, the HDD will remap the block to
one of the blocks in the spare pool), and then informing the cluster
file system that it should do a Reed Solomon reconstruction or to
otherwise reshard that portion of the file.

At one point I toyed with trying to get something upstream where the
bad block notification would get sent via a netlink channel.  That way
userspace can do something appropriate, instead of trying to encode
what can potentially extremely complicated policy decisions into the
kernel.  I never had the time to get the design and interface clean
enough for upstream, though.

						- Ted

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: maintain badblocks list on the fly
  2014-01-06  1:27 ` Theodore Ts'o
@ 2014-01-16  9:50   ` Oleksij Rempel
  0 siblings, 0 replies; 3+ messages in thread
From: Oleksij Rempel @ 2014-01-16  9:50 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 3692 bytes --]

Am 06.01.2014 02:27, schrieb Theodore Ts'o:
> On Sun, Jan 05, 2014 at 11:02:49AM +0100, Oleksij Rempel wrote:
>>
>> after some googling i didn't found answer to my question, so i set it
>> directly here: do it makes sense and is it possible to maintain bad
>> block list of ext4 on fly? I mean, if ext4 get error from, for example
>> from ata subsystem, and it will mark block as bad or may be better as
>> "probably bad"?
> 
> Figuring out what to do in case of an error is tricky.  Sometimes
> errors are transient.  For example, losing a connection (perhaps
> briefly) to a disk connected via fiber channel).
> 
> Also, with most hard drives, if you rewrite a block which has reported
> a read error, the hard drive will usually remap the block to one of
> the blocks in the spare pool.  So one strategy is when you get a read
> error is not to avoid using the block forever, but to simply write all
> zero's to the block, and then see if the block is now valid.  But now
> combine this with the "some errors are transient" problem --- if you
> do a forced rewrite, you might lose data that you could get back i you
> try rereading the block later.  So it's rare file system author that
> is willing to do an automated forced rewrite when getting a read
> error.
> 
> For a write error, it's safer to try rewriting the block, but most of
> the time the hard drive will have tried rewriting the block already,
> unless it's due to a connection problem between the file system and
> the storage device.  For example, suppose the file system is accessing
> an iSCSI block device which where the transport layer between computer
> and the storage device is a TCP connection...
> 
> So the problem with automated error recovery is that it's highly
> dependent on the storage device (is it a RAID; a hard drive; an iSCSI
> device, etc.) and the application / what are you storing.
> 
> For example, if the file system is on a direct connected HDD as the
> back end for a cluster file system such as hadoopfs or the Google File
> System, where the cluster file system is storing every chunk of its
> file replicated on multiple file servers, and/or using some kind of
> Reed Solomon encoding, when you detect a read error on data block, the
> best thing to do might be to delete file (relying on the fact that the
> next time you write to the bad block, the HDD will remap the block to
> one of the blocks in the spare pool), and then informing the cluster
> file system that it should do a Reed Solomon reconstruction or to
> otherwise reshard that portion of the file.
> 
> At one point I toyed with trying to get something upstream where the
> bad block notification would get sent via a netlink channel.  That way
> userspace can do something appropriate, instead of trying to encode
> what can potentially extremely complicated policy decisions into the
> kernel.  I never had the time to get the design and interface clean
> enough for upstream, though.

Good point,
back to this drive. It appears to be one of drive which report error but
do not remember it in SMART. One day of testing caused SMART to report 2
pending corrupt blocks. After reboot, there was 0 pending and relocated
blocks. It means there is no way to detect hardware degradation with SMART.
It reports no error on write, only some times on read and there is no
guaranty that readed data is not corrupt.
If i see correctly, mostly affected are consumer devices like laptops
and PCs. They do not have RAID, mostly no backups and if they would have
some primitive backup option, it wont really help find and restore
corrupt files.

-- 
Regards,
Oleksij


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 295 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-01-16  9:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-05 10:02 maintain badblocks list on the fly Oleksij Rempel
2014-01-06  1:27 ` Theodore Ts'o
2014-01-16  9:50   ` Oleksij Rempel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox