avoiding the initial resync on --create

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* avoiding the initial resync on --create
@ 2006-10-09 12:57 martin f krafft
  2006-10-09 13:49 ` Erik Mouw
  0 siblings, 1 reply; 16+ messages in thread
From: martin f krafft @ 2006-10-09 12:57 UTC (permalink / raw)
  To: linux-raid mailing list

[-- Attachment #1: Type: text/plain, Size: 1643 bytes --]

Hi all,

I am looking at http://bugs.debian.org/251898 and wondering whether
it is save to use --assume-clean (which prevents the initial resync)
when creating RAID arrays from the Debian installer.

Please also see the following discussion on IRC:

< madduck> yeah, i am not sure --assume-clean is a good idea.
< peterS> madduck: why not?  I've tried to think of a reason it
  would fail for months, and so far I'm too stupid to think of one
< madduck> even then
< madduck> peterS: because it then assumes that it
< madduck> it's clean, period.
< peterS> yeah, so?
< peterS> the blocks you have not written will have unreliable
  contents
< madduck> in reality, the three components are not properly XORed
< peterS> but why would you care about that?
< madduck> hm. kinda true.
< peterS> the blocks you _do_ write will be correct
< peterS> even an uninitialised raid5 or raid6 seems like it would
  work perfectly well with --assume-clean

Do you have any thoughts on the issue? If Debian were to --create
its arrays with --assume-clean just before slapping a filesystem on
them and installing the system, do you see any potential problems?

-- 
martin;              (greetings from the heart of the sun.)
  \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck

spamtraps: madduck.bogus@madduck.net

"sometimes we sit and read other people's interpretations of our
 lyrics and think, 'hey, that's pretty good.' if we liked it, we would
 keep our mouths shut and just accept the credit as if it was what we
 meant all along."
                                                        -- john lennon

[-- Attachment #2: Digital signature (GPG/PGP) --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-09 12:57 avoiding the initial resync on --create martin f krafft
@ 2006-10-09 13:49 ` Erik Mouw
  2006-10-09 16:32   ` Doug Ledford
  0 siblings, 1 reply; 16+ messages in thread
From: Erik Mouw @ 2006-10-09 13:49 UTC (permalink / raw)
  To: linux-raid mailing list

On Mon, Oct 09, 2006 at 02:57:00PM +0200, martin f krafft wrote:
> I am looking at http://bugs.debian.org/251898 and wondering whether
> it is save to use --assume-clean (which prevents the initial resync)
> when creating RAID arrays from the Debian installer.
> 
> Please also see the following discussion on IRC:
> 
> < madduck> yeah, i am not sure --assume-clean is a good idea.
> < peterS> madduck: why not?  I've tried to think of a reason it
>   would fail for months, and so far I'm too stupid to think of one
> < madduck> even then
> < madduck> peterS: because it then assumes that it
> < madduck> it's clean, period.
> < peterS> yeah, so?
> < peterS> the blocks you have not written will have unreliable
>   contents
> < madduck> in reality, the three components are not properly XORed
> < peterS> but why would you care about that?
> < madduck> hm. kinda true.
> < peterS> the blocks you _do_ write will be correct
> < peterS> even an uninitialised raid5 or raid6 seems like it would
>   work perfectly well with --assume-clean

There is no way to figure out what exactly is correct data and what is
not. It might work right after creation and during the initial install,
but after the next reboot there is no way to figure out what blocks to
believe.

> Do you have any thoughts on the issue? If Debian were to --create
> its arrays with --assume-clean just before slapping a filesystem on
> them and installing the system, do you see any potential problems?

If you want to speed up the initial install, I'd say it's better to
create the array with one missing drive, install the system and let it
resync upon the next boot. Be sure to tell the user about that, though.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-09 13:49 ` Erik Mouw
@ 2006-10-09 16:32   ` Doug Ledford
  2006-10-09 19:10     ` Rob Bray
  2006-10-10  9:55     ` Gabor Gombas
  0 siblings, 2 replies; 16+ messages in thread
From: Doug Ledford @ 2006-10-09 16:32 UTC (permalink / raw)
  To: Erik Mouw; +Cc: linux-raid mailing list

[-- Attachment #1: Type: text/plain, Size: 1297 bytes --]

On Mon, 2006-10-09 at 15:49 +0200, Erik Mouw wrote:

> There is no way to figure out what exactly is correct data and what is
> not. It might work right after creation and during the initial install,
> but after the next reboot there is no way to figure out what blocks to
> believe.

You don't really need to.  After a clean install, the operating system
has no business reading any block it didn't write to during the install
unless you are just reading disk blocks for the fun of it.  And any
program that depends on data that hasn't first been written to disk is
just wrong and stupid anyway.

> > Do you have any thoughts on the issue? If Debian were to --create
> > its arrays with --assume-clean just before slapping a filesystem on
> > them and installing the system, do you see any potential problems?
> 
> If you want to speed up the initial install, I'd say it's better to
> create the array with one missing drive, install the system and let it
> resync upon the next boot. Be sure to tell the user about that, though.
> 
> 
> Erik
> 
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-09 16:32   ` Doug Ledford
@ 2006-10-09 19:10     ` Rob Bray
  2006-10-09 19:45       ` Doug Ledford
  2006-10-10  9:55     ` Gabor Gombas
  1 sibling, 1 reply; 16+ messages in thread
From: Rob Bray @ 2006-10-09 19:10 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid mailing list

> On Mon, 2006-10-09 at 15:49 +0200, Erik Mouw wrote:
>
>> There is no way to figure out what exactly is correct data and what is
>> not. It might work right after creation and during the initial install,
>> but after the next reboot there is no way to figure out what blocks to
>> believe.
>
> You don't really need to.  After a clean install, the operating system
> has no business reading any block it didn't write to during the install
> unless you are just reading disk blocks for the fun of it.  And any
> program that depends on data that hasn't first been written to disk is
> just wrong and stupid anyway.

I suppose a partial-stripe write would read back junk data on the other
disks, xor with your write, and update the parity block.

If you benchmark the disk, you're going to be reading blocks you didn't
necessarily write, which could kick out consistency errors.

A whole-array consistency check would puke on the out-of-whack parity data.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-09 19:10     ` Rob Bray
@ 2006-10-09 19:45       ` Doug Ledford
  2006-10-09 21:33         ` Neil Brown
  2006-10-11 21:24         ` Michael Tokarev
  0 siblings, 2 replies; 16+ messages in thread
From: Doug Ledford @ 2006-10-09 19:45 UTC (permalink / raw)
  To: Rob Bray; +Cc: linux-raid mailing list

[-- Attachment #1: Type: text/plain, Size: 3782 bytes --]

On Mon, 2006-10-09 at 15:10 -0400, Rob Bray wrote:
> > On Mon, 2006-10-09 at 15:49 +0200, Erik Mouw wrote:
> >
> >> There is no way to figure out what exactly is correct data and what is
> >> not. It might work right after creation and during the initial install,
> >> but after the next reboot there is no way to figure out what blocks to
> >> believe.
> >
> > You don't really need to.  After a clean install, the operating system
> > has no business reading any block it didn't write to during the install
> > unless you are just reading disk blocks for the fun of it.  And any
> > program that depends on data that hasn't first been written to disk is
> > just wrong and stupid anyway.
> 
> I suppose a partial-stripe write would read back junk data on the other
> disks, xor with your write, and update the parity block.

The original email was about raid1 and the fact that reads from
different disks could return different data.  For that scenario, my
comments are accurate.  For the parity based raids, you never have two
disks with the same block, so you would only ever get different results
if you had a disk fail and the parity was never initialized.  For that
situation, you would need to init the parity on any stripe that has been
even partially written to.  Totally unwritten stripes could have any
parity you want since the data is undefined anyway, so who cares if it
changes when a disk fails and you are reconstructing from parity.

> If you benchmark the disk, you're going to be reading blocks you didn't
> necessarily write, which could kick out consistency errors.

The only benchmarks I know of that give a rats ass about the data
integrity are ones that write a pattern first and then read it back.  In
that case, parity would have been init'ed during the write.

> A whole-array consistency check would puke on the out-of-whack parity data.

Or a whole array consistency check on an array that hasn't had a whole
array parity init makes no sense.  You could create the array without
touching the parity, update parity on all stripes that are written,
leave a flag in the superblock indicating the array has never been
init'ed, and in the event of failure you can use the parity safe in the
knowledge that all stripes that have been written to have valid parity
and all other stripes we don't care about.  The main problem here is
that if we *did* need a consistency check, we couldn't tell errors from
uninit'ed stripes.  You could also make it so that the first time you
run a consistency check with the uninit'ed flag in the superblock set,
you calculate all parity and then clear the flag in the superblock and
on all subsequent runs you would then know when you have an error as
opposed to an uninit'ed block.

Probably the best thing to do would be on create of the array, setup a
large all 0 block of mem and repeatedly write that to all blocks in the
array devices except parity blocks and use a large all 1 block for that.
Then you could just write the entire array at blinding speed.  You could
call that the "quick-init" option or something.  You wouldn't be able to
use the array until it was done, but it would be quick.  If you wanted
to be *really* fast, at least for SCSI drives you could write one large
chunk of 0's and one large chunk of 1's at the first parity block, then
use the SCSI COPY command to copy the 0 chunk everywhere it needs to go,
and likewise for the parity chunk, and avoid transferring the data over
the SCSI bus more than once.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-09 19:45       ` Doug Ledford
@ 2006-10-09 21:33         ` Neil Brown
  2006-10-09 21:45           ` Doug Ledford
  2006-10-11 21:24         ` Michael Tokarev
  1 sibling, 1 reply; 16+ messages in thread
From: Neil Brown @ 2006-10-09 21:33 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Rob Bray, martin f krafft, linux-raid mailing list

On Monday October 9, dledford@redhat.com wrote:
> 
> The original email was about raid1 and the fact that reads from
> different disks could return different data.

To be fair, the original mail didn't mention "raid1" at all.  It did
mention raid5 and raid6 as a possible contrast so you could reasonably
get the impression that it was talking about raid1.  But that wasn't
stated.

Otherwise I agree.  There is no real need to perform the sync of a
raid1 at creation.
However it seems to be a good idea to regularly 'check' an array to
make sure that all blocks on all disks get read to find sleeping bad
blocks early.  If you didn't sync first, then every check will find
lots of errors.  Ofcourse you could 'repair' instead of 'check'.  Or
do that once.  Or something.

For raid6 it is also safe to not sync first, though with the same
caveat as raid1.  Raid6 always updates parity by reading all blocks in
the stripe that aren't known and calculating P and Q.  So the first
write to a stripe will make P and Q correct for that stripe.
This is current behaviour.  I don't think I can guarantee it will
never changed.

For raid5 it is NOT safe to skip the initial sync.  It is possible for
all updates to be "read-modify-write" updates which assume the parity
is correct.  If it is wrong, it stays wrong.  Then when you lose a
drive, the parity blocks are wrong so the data you recover using them
is wrong.

In summary, it is safe to use --assume-clean on a raid1 or raid1o,
though I would recommend a "repair" before too long.  For other raid
levels it is best avoided.

> 
> Probably the best thing to do would be on create of the array, setup a
> large all 0 block of mem and repeatedly write that to all blocks in the
> array devices except parity blocks and use a large all 1 block for that.

No, you would want 0 for the parity block too.  0 + 0 = 0.

> Then you could just write the entire array at blinding speed.  You could
> call that the "quick-init" option or something.  You wouldn't be able to
> use the array until it was done, but it would be quick. 

I doubt you would notice it being faster than the current
resync/recovery that happens on creation.  We go at device-speed -
either the buss device or the storage device depending on which is
slower.

>                                                          If you wanted
> to be *really* fast, at least for SCSI drives you could write one large
> chunk of 0's and one large chunk of 1's at the first parity block, then
> use the SCSI COPY command to copy the 0 chunk everywhere it needs to go,
> and likewise for the parity chunk, and avoid transferring the data over
> the SCSI bus more than once.

Yes, that might be measurably faster.  It is the sort of thing you might
do in a "hardware" RAID controller but I doubt it would ever get done
in md (there is a price for being very general).

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-09 21:33         ` Neil Brown
@ 2006-10-09 21:45           ` Doug Ledford
  2006-10-09 23:14             ` Neil Brown
  0 siblings, 1 reply; 16+ messages in thread
From: Doug Ledford @ 2006-10-09 21:45 UTC (permalink / raw)
  To: Neil Brown; +Cc: Rob Bray, martin f krafft, linux-raid mailing list

[-- Attachment #1: Type: text/plain, Size: 3917 bytes --]

On Tue, 2006-10-10 at 07:33 +1000, Neil Brown wrote:
> On Monday October 9, dledford@redhat.com wrote:
> > 
> > The original email was about raid1 and the fact that reads from
> > different disks could return different data.
> 
> To be fair, the original mail didn't mention "raid1" at all.  It did
> mention raid5 and raid6 as a possible contrast so you could reasonably
> get the impression that it was talking about raid1.  But that wasn't
> stated.

OK, well I got that impression from the contrast ;-)

> Otherwise I agree.  There is no real need to perform the sync of a
> raid1 at creation.
> However it seems to be a good idea to regularly 'check' an array to
> make sure that all blocks on all disks get read to find sleeping bad
> blocks early.  If you didn't sync first, then every check will find
> lots of errors.  Ofcourse you could 'repair' instead of 'check'.  Or
> do that once.  Or something.
> 
> For raid6 it is also safe to not sync first, though with the same
> caveat as raid1.  Raid6 always updates parity by reading all blocks in
> the stripe that aren't known and calculating P and Q.  So the first
> write to a stripe will make P and Q correct for that stripe.
> This is current behaviour.  I don't think I can guarantee it will
> never changed.
> 
> For raid5 it is NOT safe to skip the initial sync.  It is possible for
> all updates to be "read-modify-write" updates which assume the parity
> is correct.  If it is wrong, it stays wrong.  Then when you lose a
> drive, the parity blocks are wrong so the data you recover using them
> is wrong.

superblock->init_flag == FALSE then make all writes a parity generating
not updating write (less efficient, so you would want to resync the
array and clear this up soon, but possible).

> In summary, it is safe to use --assume-clean on a raid1 or raid1o,
> though I would recommend a "repair" before too long.  For other raid
> levels it is best avoided.
> 
> > 
> > Probably the best thing to do would be on create of the array, setup a
> > large all 0 block of mem and repeatedly write that to all blocks in the
> > array devices except parity blocks and use a large all 1 block for that.
> 
> No, you would want 0 for the parity block too.  0 + 0 = 0.

Sorry, I was thinking odd parity.

> > Then you could just write the entire array at blinding speed.  You could
> > call that the "quick-init" option or something.  You wouldn't be able to
> > use the array until it was done, but it would be quick. 
> 
> I doubt you would notice it being faster than the current
> resync/recovery that happens on creation.  We go at device-speed -
> either the buss device or the storage device depending on which is
> slower.

There's memory overhead though.  That can impact other operations the
cpu might do while in the process of recovering.

> 
> >                                                          If you wanted
> > to be *really* fast, at least for SCSI drives you could write one large
> > chunk of 0's and one large chunk of 1's at the first parity block, then
> > use the SCSI COPY command to copy the 0 chunk everywhere it needs to go,
> > and likewise for the parity chunk, and avoid transferring the data over
> > the SCSI bus more than once.
> 
> Yes, that might be measurably faster.  It is the sort of thing you might
> do in a "hardware" RAID controller but I doubt it would ever get done
> in md (there is a price for being very general).

Bleh...sometimes I really dislike always making things cater to the
lowest common denominator...you're never as good as you could be and you
are always as bad as the worst case...

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-09 21:45           ` Doug Ledford
@ 2006-10-09 23:14             ` Neil Brown
  0 siblings, 0 replies; 16+ messages in thread
From: Neil Brown @ 2006-10-09 23:14 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Rob Bray, martin f krafft, linux-raid mailing list

On Monday October 9, dledford@redhat.com wrote:
> 
> superblock->init_flag == FALSE then make all writes a parity generating
> not updating write (less efficient, so you would want to resync the
> array and clear this up soon, but possible).

Yeh, that would work.  I wonder if it is worth the effort though.

> 
> Bleh...sometimes I really dislike always making things cater to the
> lowest common denominator...you're never as good as you could be and you
> are always as bad as the worst case...

damn those two-edged swords!

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-09 16:32   ` Doug Ledford
  2006-10-09 19:10     ` Rob Bray
@ 2006-10-10  9:55     ` Gabor Gombas
  2006-10-10 17:47       ` Doug Ledford
  1 sibling, 1 reply; 16+ messages in thread
From: Gabor Gombas @ 2006-10-10  9:55 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Erik Mouw, linux-raid mailing list

On Mon, Oct 09, 2006 at 12:32:00PM -0400, Doug Ledford wrote:

> You don't really need to.  After a clean install, the operating system
> has no business reading any block it didn't write to during the install
> unless you are just reading disk blocks for the fun of it.

What happens if you have a crash, and fsck for some reason tries to read
into that uninitialized area? This may happen even years after the
install if the array was never resynced and the filesystem was never
100% full... What happens, if fsck tries to read the same area twice but
gets different data, because the second time the read went to a
different disk?

And yes, fsck is exactly an application that reads blocks just "for the
fun of it" when it tries to find all the pieces of the filesystem, esp.
for filesystems that (unlike e.g. ext3) do not keep metadata at fixed
locations.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-10  9:55     ` Gabor Gombas
@ 2006-10-10 17:47       ` Doug Ledford
  2006-10-10 19:18         ` Sergey Vlasov
  2006-10-10 20:37         ` Gabor Gombas
  0 siblings, 2 replies; 16+ messages in thread
From: Doug Ledford @ 2006-10-10 17:47 UTC (permalink / raw)
  To: Gabor Gombas; +Cc: Erik Mouw, linux-raid mailing list

[-- Attachment #1: Type: text/plain, Size: 2026 bytes --]

On Tue, 2006-10-10 at 11:55 +0200, Gabor Gombas wrote:
> On Mon, Oct 09, 2006 at 12:32:00PM -0400, Doug Ledford wrote:
> 
> > You don't really need to.  After a clean install, the operating system
> > has no business reading any block it didn't write to during the install
> > unless you are just reading disk blocks for the fun of it.
> 
> What happens if you have a crash, and fsck for some reason tries to read
> into that uninitialized area? This may happen even years after the
> install if the array was never resynced and the filesystem was never
> 100% full... What happens, if fsck tries to read the same area twice but
> gets different data, because the second time the read went to a
> different disk?
> 
> And yes, fsck is exactly an application that reads blocks just "for the
> fun of it" when it tries to find all the pieces of the filesystem, esp.
> for filesystems that (unlike e.g. ext3) do not keep metadata at fixed
> locations.

Not at all true.  Every filesystem, no matter where it stores its
metadata blocks, still writes to every single metadata block it
allocates to initialize that metadata block.  The same is true for
directory blocks...they are created with a . and .. entry and nothing
else.  What exactly do you think mke2fs is doing when it's writing out
the inode groups, block groups, bitmaps, etc.?  Every metadata block
needed by fsck is written either during mkfs or during use as the
filesystem data is grown.

So, like my original email said, fsck has no business reading any block
that hasn't been written to either by the install or since the install
when the filesystem was filled up more.  It certainly does *not* read
blocks just for the fun of it, nor does it rely on anything the
filesystem didn't specifically write.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-10 17:47       ` Doug Ledford
@ 2006-10-10 19:18         ` Sergey Vlasov
  2006-10-10 20:38           ` Doug Ledford
  2006-10-10 20:37         ` Gabor Gombas
  1 sibling, 1 reply; 16+ messages in thread
From: Sergey Vlasov @ 2006-10-10 19:18 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Gabor Gombas, Erik Mouw, linux-raid mailing list

[-- Attachment #1: Type: text/plain, Size: 793 bytes --]

On Tue, 10 Oct 2006 13:47:56 -0400 Doug Ledford wrote:

[...]
> So, like my original email said, fsck has no business reading any block
> that hasn't been written to either by the install or since the install
> when the filesystem was filled up more.  It certainly does *not* read
> blocks just for the fun of it, nor does it rely on anything the
> filesystem didn't specifically write.

There are fsck implementations which read potentially unwritten
blocks.  E.g., reiserfsck --rebuild-tree reads every block on the
device, finds anything which looks like a tree block and tries to
do something with it.  This procedure sometimes recovers files
which were deleted, and if an uncompressed image of a reiserfs v3
filesystem was stored in a file on reiserfs, it can confuse
reiserfsck badly...

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-10 17:47       ` Doug Ledford
  2006-10-10 19:18         ` Sergey Vlasov
@ 2006-10-10 20:37         ` Gabor Gombas
  2006-10-10 21:26           ` Doug Ledford
  1 sibling, 1 reply; 16+ messages in thread
From: Gabor Gombas @ 2006-10-10 20:37 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Erik Mouw, linux-raid mailing list

On Tue, Oct 10, 2006 at 01:47:56PM -0400, Doug Ledford wrote:

> Not at all true.  Every filesystem, no matter where it stores its
> metadata blocks, still writes to every single metadata block it
> allocates to initialize that metadata block.  The same is true for
> directory blocks...they are created with a . and .. entry and nothing
> else.  What exactly do you think mke2fs is doing when it's writing out
> the inode groups, block groups, bitmaps, etc.?  Every metadata block
> needed by fsck is written either during mkfs or during use as the
> filesystem data is grown.

You don't get my point. I'm not talking about normal operation, but
about the case when the filesystem becomes corrupt, and fsck has to glue
together the pieces. Consider reiserfs: it stores metadata in a single
tree. If an internal node of the tree gets corrupted, reiserfsck has
absolutely no information where the child nodes are. So it must scan the
whole device, and perform a "does this block look like reiserfs
metadata?" test for every single block.

Btw. that's the reason why you can't store reiserfs3 file system images
on a reiserfs3 file system - reiserfsck simply can't tell if a block
that looks like metadata is really part of the filesystem or is it just
part of a regular file. AFAIK this design flaw is only fixed in reiser4.

> So, like my original email said, fsck has no business reading any block
> that hasn't been written to either by the install or since the install
> when the filesystem was filled up more.

But fsck has _ZERO_ information about what blocks were written since the
filesystem was created, because that information is part of the metadata
that got corrupted. If you could trust the metadata, you'd not need
fsck.

> It certainly does *not* read
> blocks just for the fun of it, nor does it rely on anything the
> filesystem didn't specifically write.

That's only true for "traditional" UNIX file systems like ext2/3. But
there are many other filesystems out there...

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-10 19:18         ` Sergey Vlasov
@ 2006-10-10 20:38           ` Doug Ledford
  0 siblings, 0 replies; 16+ messages in thread
From: Doug Ledford @ 2006-10-10 20:38 UTC (permalink / raw)
  To: Sergey Vlasov; +Cc: Gabor Gombas, Erik Mouw, linux-raid mailing list

[-- Attachment #1: Type: text/plain, Size: 2839 bytes --]

On Tue, 2006-10-10 at 23:18 +0400, Sergey Vlasov wrote:
> On Tue, 10 Oct 2006 13:47:56 -0400 Doug Ledford wrote:
> 
> [...]
> > So, like my original email said, fsck has no business reading any block
> > that hasn't been written to either by the install or since the install
> > when the filesystem was filled up more.  It certainly does *not* read
> > blocks just for the fun of it, nor does it rely on anything the
> > filesystem didn't specifically write.
> 
> There are fsck implementations which read potentially unwritten
> blocks.  E.g., reiserfsck --rebuild-tree reads every block on the
> device, finds anything which looks like a tree block and tries to
> do something with it.  This procedure sometimes recovers files
> which were deleted, and if an uncompressed image of a reiserfs v3
> filesystem was stored in a file on reiserfs, it can confuse
> reiserfsck badly...

Or if the disk was pulled from another machine that had a reiserfs on it
and then reformatted in the new machine with reiserfs, it could find old
blocks from the previous filesystem that point to possibly overwritten
data and give corrupted files.  Like I said, no program has any business
reading from unwritten blocks.  The "scan the whole disk looking for
something that might be something" heuristic is an easily breakable one
that I don't really give a rats ass about.  Besides, even in this
scenario, if it were *truly* a deleted file, then it *would* be in sync
between the disks and the point is moot.  It's only if it's random
garbage that might get interpreted as a deleted tree block that we have
the issue of whether it reads the data from one raid1 disk or another
and gets different results, and in that case we don't care, it's
garbage.  In fact, reiserfsck *could* read the block from multiple
constituent devices of a raid1 to check for any inconsistency, and
should it be spotted, use that knowledge to clue us in to the fact that
it's not a valid block for recovery.  And if we *did* do an initial sync
of the raid1 devices and the device we are syncing from is the one with
the old reiserfs on it, then we have now copied that bogus garbage to
all the disks and eliminated this possible clue as to the voracity of
the blocks we find.  I'd rather 0 the blocks out than copy them across
for this reason personally, but I think the option to zero the device
should be just that, an option and not the default, to avoid accidental
loss of data in cases where someone is converting a normal disk to a
raid1 disk and wants to sync the correct source to the destination(s).

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-10 20:37         ` Gabor Gombas
@ 2006-10-10 21:26           ` Doug Ledford
  2006-10-10 22:14             ` Rev. Jeffrey Paul
  0 siblings, 1 reply; 16+ messages in thread
From: Doug Ledford @ 2006-10-10 21:26 UTC (permalink / raw)
  To: Gabor Gombas; +Cc: Erik Mouw, linux-raid mailing list

[-- Attachment #1: Type: text/plain, Size: 1184 bytes --]

On Tue, 2006-10-10 at 22:37 +0200, Gabor Gombas wrote:
> You don't get my point. I'm not talking about normal operation, but
> about the case when the filesystem becomes corrupt, and fsck has to glue
> together the pieces. Consider reiserfs:

See my other on list mail about the fallacy of the idea that consistency
of garbage data blocks is any better than inconsistency.  As I mentioned
in it, even if it's a deleted file, a lost metadata block, etc., it will
always be consistent if it's a valid block to consider during rebuild
because *at some point in time* since the filesystem was created, it
will have been written.  Reiserfsck is just as susceptible to random
garbage on a single disk not part of any raid array as it is to
inconsistent blocks in a raid1 as it is to a fully synced raid1 array
with garbage that looks like a reiserfs.  That's a shortcoming of that
filesystem and there is no one to blame but Hans Reiser for that.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-10 21:26           ` Doug Ledford
@ 2006-10-10 22:14             ` Rev. Jeffrey Paul
  0 siblings, 0 replies; 16+ messages in thread
From: Rev. Jeffrey Paul @ 2006-10-10 22:14 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Gabor Gombas, Erik Mouw, linux-raid mailing list

On Tue, Oct 10, 2006 at 05:26:25PM -0400, Doug Ledford wrote:
> with garbage that looks like a reiserfs.  That's a shortcoming of that
> filesystem and there is no one to blame but Hans Reiser for that.

What do filesystem shortcomings and dead wives have in common?

http://abclocal.go.com/kgo/story?section=local&id=4646839

-j

-- 
--------------------------------------------------------
 Rev. Jeffrey Paul    -datavibe-     sneak@datavibe.net
  aim:x736e65616b   pgp:0xD9B3C17D   phone:877-748-3467
   9440 0C7F C598 01CA 2F17  D098 0A3A 4B8F D9B3 C17D
--------------------------------------------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: avoiding the initial resync on --create
  2006-10-09 19:45       ` Doug Ledford
  2006-10-09 21:33         ` Neil Brown
@ 2006-10-11 21:24         ` Michael Tokarev
  1 sibling, 0 replies; 16+ messages in thread
From: Michael Tokarev @ 2006-10-11 21:24 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Rob Bray, linux-raid mailing list

Doug Ledford wrote:
> On Mon, 2006-10-09 at 15:10 -0400, Rob Bray wrote:
[]
> Probably the best thing to do would be on create of the array, setup a
> large all 0 block of mem and repeatedly write that to all blocks in the
> array devices except parity blocks and use a large all 1 block for that.
> Then you could just write the entire array at blinding speed.  You could
> call that the "quick-init" option or something.  You wouldn't be able to
> use the array until it was done, but it would be quick.  If you wanted
> to be *really* fast, at least for SCSI drives you could write one large
> chunk of 0's and one large chunk of 1's at the first parity block, then
> use the SCSI COPY command to copy the 0 chunk everywhere it needs to go,
> and likewise for the parity chunk, and avoid transferring the data over
> the SCSI bus more than once.

Some notes.

First, raid array gets created sometimes in order to repair a broken array.
Ie, you had an array, you lose it for whatever reason, and re-create it,
avoiding initial resync (--assume-clean option), in a hope your data is
still here.  For that, you don't want to zero-fill your drives, for sure! :)

And second, at least SCSI drives have FORMAT UNIT command, which has a
range argument (from-sector and to-sector), and, if memory serves me
right, also "filler" argument as well (the data, 512-byte block, to
write to all the sectors in the range).  (Well, it was long ago when
I looked at that stuff, so it might be some other command, but it's
here anyway).  I'm not sure it's used/available in block device layer
(most probably it isn't).  But this is the fastest way to fill (parts
of) your drives with whatever repeated pattern of bytes you want.
Including this initial zero-filling.

But either way, you don't really need to do that in kernel space --
Userspace solution will work too.  Ok ok, if kernel is doing it after
array creation, the array is available immediately for other use,
which is a plus.

And yes, I'm not sure implementing it is worth the effort.  Unless
you're re-creating your multi-terabyte array several times a day ;)

/mjt

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-10-11 21:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-09 12:57 avoiding the initial resync on --create martin f krafft
2006-10-09 13:49 ` Erik Mouw
2006-10-09 16:32   ` Doug Ledford
2006-10-09 19:10     ` Rob Bray
2006-10-09 19:45       ` Doug Ledford
2006-10-09 21:33         ` Neil Brown
2006-10-09 21:45           ` Doug Ledford
2006-10-09 23:14             ` Neil Brown
2006-10-11 21:24         ` Michael Tokarev
2006-10-10  9:55     ` Gabor Gombas
2006-10-10 17:47       ` Doug Ledford
2006-10-10 19:18         ` Sergey Vlasov
2006-10-10 20:38           ` Doug Ledford
2006-10-10 20:37         ` Gabor Gombas
2006-10-10 21:26           ` Doug Ledford
2006-10-10 22:14             ` Rev. Jeffrey Paul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).