single disk reed solomon codes

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* single disk reed solomon codes
@ 2008-07-19 12:21 Ahmed Kamal
  2008-07-19 15:18 ` Gerald Nowitzky
  2008-07-19 16:50 ` David Woodhouse
  0 siblings, 2 replies; 13+ messages in thread
From: Ahmed Kamal @ 2008-07-19 12:21 UTC (permalink / raw)
  To: linux-btrfs

Hi,
Since btrfs is someday going to be the default FS for Linux, and will
be on so many single disk PCs and laptops, I was thinking it should be
a good idea to insert some redundancy in single disk deployments. Of
course it can help with disk failures, since it's obviously a "single"
disk, but it can help with bit-rot, and with hardware sector read
errors. To get that we'd need to implement some kind of forward error
correction, possibly reed solomon code. I am not sure why no
filesystem seems to implement such scheme, although I believe at the
hardware level, such schemes are being used (so the idea is
applicable) ?
Not that I am an expert on such matters, but I thought I'd drop that
suggestion here, maybe at least I'll know why no one else seems to do
that
Regards

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-19 12:21 single disk reed solomon codes Ahmed Kamal
@ 2008-07-19 15:18 ` Gerald Nowitzky
  2008-07-19 22:15   ` Joe Peterson
  2008-07-21  6:48   ` Tomasz Torcz
  2008-07-19 16:50 ` David Woodhouse
  1 sibling, 2 replies; 13+ messages in thread
From: Gerald Nowitzky @ 2008-07-19 15:18 UTC (permalink / raw)
  To: linux-btrfs

ECC codes like Reed-Solomon are very useful to recognize and locate random 
bit-errors. On a HDD as a unit, as it is seen from OS level, I don't think 
it will be of any help: When a HDD drive reads a sector from disk, it does a 
whole bunch of error recognition and correction measures. Usually there are, 
at least, two layers of error correction with different bit spreads on it. 
*If* this still isn't enough, it is very likely that the whole sector will 
come back completely spoiled, or, much more likely, won't come back at all 
and the drive will report a read error.

So, the only thing that could be done is to add some redundancy on 
sector-level with something like a "Intra-Disk-RAID5" by adding a number of 
parity sectors. A simple parity will be enough, as error recognition and 
location can be done by sector checksums. However, there will be a *huge* 
performance penalty, as every sector write will cause an additional seek and 
write for the parity sector.

In the end, you would add very little security by the price of -at least- 
cutting half your write performance. Thus, I don't think there is any point 
in adding redundancy to single disk systems.

Just my 2 cents...
(Gerald)

>Hi,
>Since btrfs is someday going to be the default FS for Linux, and will
>be on so many single disk PCs and laptops, I was thinking it should be
>a good idea to insert some redundancy in single disk deployments. Of
>course it can help with disk failures, since it's obviously a "single"
>disk, but it can help with bit-rot, and with hardware sector read
>errors. To get that we'd need to implement some kind of forward error
>correction, possibly reed solomon code. I am not sure why no
>filesystem seems to implement such scheme, although I believe at the
>hardware level, such schemes are being used (so the idea is
>applicable) ?
>Not that I am an expert on such matters, but I thought I'd drop that
>suggestion here, maybe at least I'll know why no one else seems to do
>that

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-19 15:18 ` Gerald Nowitzky
@ 2008-07-19 22:15   ` Joe Peterson
  2008-07-20  1:21     ` Bron Gondwana
  2008-07-21  6:48   ` Tomasz Torcz
  1 sibling, 1 reply; 13+ messages in thread
From: Joe Peterson @ 2008-07-19 22:15 UTC (permalink / raw)
  To: Gerald Nowitzky; +Cc: linux-btrfs

Gerald Nowitzky wrote:
> When a HDD drive reads a sector from disk, it does a
> whole bunch of error recognition and correction measures. Usually there are, 
> at least, two layers of error correction with different bit spreads on it. 
> *If* this still isn't enough, it is very likely that the whole sector will 
> come back completely spoiled, or, much more likely, won't come back at all 
> and the drive will report a read error.

With larger and larger disks, it is increasingly likely we will see
undetected/uncorrected errors (the drive bit error rates are not
improving - 1 in 10^17 is typical).  It is clear we cannot rely
completely on the hardware to catch everything.  Also, errors that
happen in the hardware between the drive and the CPU can be caused by
bad cables, interfaces, etc.

For even single disk systems (even without mirroring), it is still valid
to have some means of verifying integrity.  It is far better to know an
error occurred and which files are affected than to have it happen
silently.  If caught, undetected errors will be less likely to migrate
onto backups over time and slowly corrupt data there too, making
eventual recovery impossible.  That's why btrfs's checksums are so cool!

See my blog for my personal experiences with silent hard disk errors:

	http://planet.gentoo.org/developers/lavajoe/

					-Joe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-19 22:15   ` Joe Peterson
@ 2008-07-20  1:21     ` Bron Gondwana
  0 siblings, 0 replies; 13+ messages in thread
From: Bron Gondwana @ 2008-07-20  1:21 UTC (permalink / raw)
  To: Joe Peterson; +Cc: Gerald Nowitzky, linux-btrfs

On Sat, Jul 19, 2008 at 04:15:43PM -0600, Joe Peterson wrote:
> Gerald Nowitzky wrote:
> > When a HDD drive reads a sector from disk, it does a
> > whole bunch of error recognition and correction measures. Usually there are, 
> > at least, two layers of error correction with different bit spreads on it. 
> > *If* this still isn't enough, it is very likely that the whole sector will 
> > come back completely spoiled, or, much more likely, won't come back at all 
> > and the drive will report a read error.
> 
> With larger and larger disks, it is increasingly likely we will see
> undetected/uncorrected errors (the drive bit error rates are not
> improving - 1 in 10^17 is typical).  It is clear we cannot rely
> completely on the hardware to catch everything.  Also, errors that
> happen in the hardware between the drive and the CPU can be caused by
> bad cables, interfaces, etc.
> 
> For even single disk systems (even without mirroring), it is still valid
> to have some means of verifying integrity.  It is far better to know an
> error occurred and which files are affected than to have it happen
> silently.  If caught, undetected errors will be less likely to migrate
> onto backups over time and slowly corrupt data there too, making
> eventual recovery impossible.  That's why btrfs's checksums are so cool!
> 
> See my blog for my personal experiences with silent hard disk errors:
> 
> 	http://planet.gentoo.org/developers/lavajoe/

I've seen an interesting discussion elsewhere about this very issue, in
the context of retrofitting some sort of checksumming support to FFS.

The suggestion was to make 128th block a checksum block for the previous
127 blocks (scale to your liking).  Without changing the filesystem
format _at_all_ you could still checksum so long as you read in 128
blocks at a time.  This isn't a major problem, since you'll probably
want that sort of readahead anyway.

Of course - that's just error detection, not error correction.

Bron.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-19 15:18 ` Gerald Nowitzky
  2008-07-19 22:15   ` Joe Peterson
@ 2008-07-21  6:48   ` Tomasz Torcz
  2008-07-21  7:40     ` Ahmed Kamal
  1 sibling, 1 reply; 13+ messages in thread
From: Tomasz Torcz @ 2008-07-21  6:48 UTC (permalink / raw)
  To: linux-btrfs

Dnia 2008-07-19, sob o godzinie 17:18 +0200, Gerald Nowitzky pisze:

> In the end, you would add very little security by the price of -at least- 
> cutting half your write performance. Thus, I don't think there is any point 
> in adding redundancy to single disk systems.

  ZFS can store multiple copies of data block within one disk. Using
your words, it's like "Intra-Disk-RAID1". After reading data, when
checksum shows it's corrupted, another copy (hopefully correct) is read
from other disk location.
  This is adding security by the price of half storage capacity. Which
seems like a fair game, given todays 1,5TB HDDs.

-- 
Tomasz Torcz

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-21  6:48   ` Tomasz Torcz
@ 2008-07-21  7:40     ` Ahmed Kamal
  2008-07-21 13:03       ` Chris Mason
  2008-07-21 15:03       ` Dongjun Shin
  0 siblings, 2 replies; 13+ messages in thread
From: Ahmed Kamal @ 2008-07-21  7:40 UTC (permalink / raw)
  To: Tomasz Torcz; +Cc: linux-btrfs

I definitely hope btrfs has this per-object "copies" property too.
However, simply replicating the whole contents of a directory, wastes
too much disk space, as opposed to RS codes

On Mon, Jul 21, 2008 at 9:48 AM, Tomasz Torcz <tomek@crocom.com.pl> wrote:
> Dnia 2008-07-19, sob o godzinie 17:18 +0200, Gerald Nowitzky pisze:
>
>> In the end, you would add very little security by the price of -at least-
>> cutting half your write performance. Thus, I don't think there is any point
>> in adding redundancy to single disk systems.
>
>  ZFS can store multiple copies of data block within one disk. Using
> your words, it's like "Intra-Disk-RAID1". After reading data, when
> checksum shows it's corrupted, another copy (hopefully correct) is read
> from other disk location.
>  This is adding security by the price of half storage capacity. Which
> seems like a fair game, given todays 1,5TB HDDs.
>
> --
> Tomasz Torcz
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-21  7:40     ` Ahmed Kamal
@ 2008-07-21 13:03       ` Chris Mason
  2008-07-21 15:03       ` Dongjun Shin
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Mason @ 2008-07-21 13:03 UTC (permalink / raw)
  To: Ahmed Kamal; +Cc: Tomasz Torcz, linux-btrfs

On Mon, 2008-07-21 at 10:40 +0300, Ahmed Kamal wrote:
> I definitely hope btrfs has this per-object "copies" property too.
> However, simply replicating the whole contents of a directory, wastes
> too much disk space, as opposed to RS codes
> 

Btrfs already has a raid level where things are duplicated on the single
spindles and it is on by default for metadata.  mkfs isn't currently
setup to use this on data blocks, but it is certainly possible (look for
BTRFS_BLOCK_GROUP_DUP).  This is definitely less reliable than two
physical devices, and I worry that such a feature would give people the
impression that single drive raid is a good idea.

As others have already said, the drives to have considerable error
detection and correction already.  One of the main benefits of the
checksums is being able to tell which copy of the data from a group of
drives is correct.

In terms of detecting errors, the data checksums will do that.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-21  7:40     ` Ahmed Kamal
  2008-07-21 13:03       ` Chris Mason
@ 2008-07-21 15:03       ` Dongjun Shin
  2008-08-04  6:52         ` Ahmed Kamal
  1 sibling, 1 reply; 13+ messages in thread
From: Dongjun Shin @ 2008-07-21 15:03 UTC (permalink / raw)
  To: Ahmed Kamal; +Cc: Tomasz Torcz, linux-btrfs

On Mon, Jul 21, 2008 at 4:40 PM, Ahmed Kamal
<email.ahmedkamal@googlemail.com> wrote:
> I definitely hope btrfs has this per-object "copies" property too.
> However, simply replicating the whole contents of a directory, wastes
> too much disk space, as opposed to RS codes
>

Although adding redundancy mechanism will help increasing the integrity of data,
I'm not sure whether repeating the same kind of mechanism twice will help.
(AFAIK, RS is common in HDD and BCH is common in flash due to their own
physical characteristics)

I think it is better to have another redundancy mechanism (like RAID1)
which is independent of the algorithm used by the underlying storage.

-- 
Dongjun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-21 15:03       ` Dongjun Shin
@ 2008-08-04  6:52         ` Ahmed Kamal
  2008-08-04 11:31           ` Ric Wheeler
  0 siblings, 1 reply; 13+ messages in thread
From: Ahmed Kamal @ 2008-08-04  6:52 UTC (permalink / raw)
  To: linux-btrfs

An experiment of applying RS codes for protecting data, worth a look
http://ttsiodras.googlepages.com/rsbep.html

He overwrites a series of 127 sectors and still manages to correctly
recover his data. We all know disks give us unreadable sectors every
now and then, so at least on workstations/laptops this could really be
useful ?

Advantage over single-disk-raid1 is storage efficiency (4.2MB becomes
5.2MB), that means we get 80% of useable disk space, instead of 50% if
I decide to raid1 everything ?

On Mon, Jul 21, 2008 at 6:03 PM, Dongjun Shin <djshin90@gmail.com> wrote:
> On Mon, Jul 21, 2008 at 4:40 PM, Ahmed Kamal
> <email.ahmedkamal@googlemail.com> wrote:
>> I definitely hope btrfs has this per-object "copies" property too.
>> However, simply replicating the whole contents of a directory, wastes
>> too much disk space, as opposed to RS codes
>>
>
> Although adding redundancy mechanism will help increasing the integrity of data,
> I'm not sure whether repeating the same kind of mechanism twice will help.
> (AFAIK, RS is common in HDD and BCH is common in flash due to their own
> physical characteristics)
>
> I think it is better to have another redundancy mechanism (like RAID1)
> which is independent of the algorithm used by the underlying storage.
>
> --
> Dongjun
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-08-04  6:52         ` Ahmed Kamal
@ 2008-08-04 11:31           ` Ric Wheeler
  0 siblings, 0 replies; 13+ messages in thread
From: Ric Wheeler @ 2008-08-04 11:31 UTC (permalink / raw)
  To: Ahmed Kamal; +Cc: linux-btrfs

Ahmed Kamal wrote:
> An experiment of applying RS codes for protecting data, worth a look
> http://ttsiodras.googlepages.com/rsbep.html
>
> He overwrites a series of 127 sectors and still manages to correctly
> recover his data. We all know disks give us unreadable sectors every
> now and then, so at least on workstations/laptops this could really be
> useful ?
>
> Advantage over single-disk-raid1 is storage efficiency (4.2MB becomes
> 5.2MB), that means we get 80% of useable disk space, instead of 50% if
> I decide to raid1 everything ?
>
>   

This is an interesting idea and could help recover from some types of 
failures (for example, single head failures) or localized bad sectors 
(think of dust or junk on the platter).  This is almost certainly a big 
win for single disk systems. 

You would probably still need to RAID (or do other protection schemes) 
to get enterprise class data availability since you clearly cannot 
handle a full drive failure whenever you have multiple drives in a system.

Thanks!

ric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-19 12:21 single disk reed solomon codes Ahmed Kamal
  2008-07-19 15:18 ` Gerald Nowitzky
@ 2008-07-19 16:50 ` David Woodhouse
  2008-07-19 16:53   ` Ahmed Kamal
  2008-07-21 13:05   ` Chris Mason
  1 sibling, 2 replies; 13+ messages in thread
From: David Woodhouse @ 2008-07-19 16:50 UTC (permalink / raw)
  To: Ahmed Kamal; +Cc: linux-btrfs

On Sat, 2008-07-19 at 15:21 +0300, Ahmed Kamal wrote:
> Hi,
> Since btrfs is someday going to be the default FS for Linux, and will
> be on so many single disk PCs and laptops, I was thinking it should be
> a good idea to insert some redundancy in single disk deployments. Of
> course it can help with disk failures, since it's obviously a "single"
> disk, but it can help with bit-rot, and with hardware sector read
> errors. To get that we'd need to implement some kind of forward error
> correction, possibly reed solomon code. I am not sure why no
> filesystem seems to implement such scheme, although I believe at the
> hardware level, such schemes are being used (so the idea is
> applicable) ?

We have implementations of such schemes in lib/reed_solomon/ in the
kernel already.

I'm quite interested in using btrfs on flash (I mean  _real_ flash not
SSDs where they have their own internal pseudo-fs pretending to be a
disk). For that, we'd probably want to use precisely this kind of error
correction. Although it's normal to do it at the block level rather than
the filesystem object level;

I don't know if the failure modes on real disks are likely to be helped
by this kind of scheme or not. After all, the disks already do a similar
RS-based error correction for themselves. If we're unlucky in our choice
of error correction, it might even be possible to end up in a situation
where the only errors we'd _see_ are the ones which were uncorrectable.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-19 16:50 ` David Woodhouse
@ 2008-07-19 16:53   ` Ahmed Kamal
  2008-07-21 13:05   ` Chris Mason
  1 sibling, 0 replies; 13+ messages in thread
From: Ahmed Kamal @ 2008-07-19 16:53 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-btrfs

> RS-based error correction for themselves. If we're unlucky in our choice
> of error correction, it might even be possible to end up in a situation
> where the only errors we'd _see_ are the ones which were uncorrectable.
>

but since at the FS level, the redundancy would be at a different
place, than the hardware level redundancy, it might be correctable to
you

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: single disk reed solomon codes
  2008-07-19 16:50 ` David Woodhouse
  2008-07-19 16:53   ` Ahmed Kamal
@ 2008-07-21 13:05   ` Chris Mason
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Mason @ 2008-07-21 13:05 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Ahmed Kamal, linux-btrfs

On Sat, 2008-07-19 at 09:50 -0700, David Woodhouse wrote:
> On Sat, 2008-07-19 at 15:21 +0300, Ahmed Kamal wrote:
> > Hi,
> > Since btrfs is someday going to be the default FS for Linux, and will
> > be on so many single disk PCs and laptops, I was thinking it should be
> > a good idea to insert some redundancy in single disk deployments. Of
> > course it can help with disk failures, since it's obviously a "single"
> > disk, but it can help with bit-rot, and with hardware sector read
> > errors. To get that we'd need to implement some kind of forward error
> > correction, possibly reed solomon code. I am not sure why no
> > filesystem seems to implement such scheme, although I believe at the
> > hardware level, such schemes are being used (so the idea is
> > applicable) ?
> 
> We have implementations of such schemes in lib/reed_solomon/ in the
> kernel already.
> 
> I'm quite interested in using btrfs on flash (I mean  _real_ flash not
> SSDs where they have their own internal pseudo-fs pretending to be a
> disk). For that, we'd probably want to use precisely this kind of error
> correction. Although it's normal to do it at the block level rather than
> the filesystem object level;
> 

The long term goal is to have the checksum algorithm selectable between
a number of choices.  For metadata, you have 256 bits to use and for
data you can use anything that will fit in a btree block.

So, the way to do this for real flash would be to implement the
selectable checksum, and then store the sum + whatever error recovery
code you want in the checksum item.

-chris



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-08-04 11:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-19 12:21 single disk reed solomon codes Ahmed Kamal
2008-07-19 15:18 ` Gerald Nowitzky
2008-07-19 22:15   ` Joe Peterson
2008-07-20  1:21     ` Bron Gondwana
2008-07-21  6:48   ` Tomasz Torcz
2008-07-21  7:40     ` Ahmed Kamal
2008-07-21 13:03       ` Chris Mason
2008-07-21 15:03       ` Dongjun Shin
2008-08-04  6:52         ` Ahmed Kamal
2008-08-04 11:31           ` Ric Wheeler
2008-07-19 16:50 ` David Woodhouse
2008-07-19 16:53   ` Ahmed Kamal
2008-07-21 13:05   ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox