--assume-clean on raid5/6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* --assume-clean on raid5/6
@ 2010-08-06  1:19 brian.foster
  2010-08-07 12:28 ` Stefan /*St0fF*/ Hübner
  0 siblings, 1 reply; 4+ messages in thread
From: brian.foster @ 2010-08-06  1:19 UTC (permalink / raw)
  To: linux-raid

Hi all,

I've read in the list archives that use of --assume-clean on raid5
(raid6?) is not safe assuming the member drives are not sync, but it's
not clear to me as to why. I can see the content of an written raid5
array change if I fail a drive out of the array (created w/
--assume-clean), but data that I write prior to failing a drive remains
intact. Perhaps I'm missing something. Could somebody elaborate on the
danger/risk of using --assume-clean? Thanks in advance.

Brian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: --assume-clean on raid5/6
  2010-08-06  1:19 --assume-clean on raid5/6 brian.foster
@ 2010-08-07 12:28 ` Stefan /*St0fF*/ Hübner
  2010-08-08  8:56   ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Stefan /*St0fF*/ Hübner @ 2010-08-07 12:28 UTC (permalink / raw)
  To: brian.foster; +Cc: linux-raid

Hi Brian,

--assume-clean skips over the initial resync.  Which - if you will
create a filesystem after creating the array - is a time-saving idea.
But keep in mind: even if the disks are brand new and contain only
zeros, the parity would probably look not all zeros.  So reading from
such an array would be a bad idea.
But if the next thing you do is create LVM/filesystem etc., then all bit
read from the array will have been written to before (and by that are in
sync).

Stefan

Am 06.08.2010 03:19, schrieb brian.foster@emc.com:
> Hi all,
> 
> I've read in the list archives that use of --assume-clean on raid5
> (raid6?) is not safe assuming the member drives are not sync, but it's
> not clear to me as to why. I can see the content of an written raid5
> array change if I fail a drive out of the array (created w/
> --assume-clean), but data that I write prior to failing a drive remains
> intact. Perhaps I'm missing something. Could somebody elaborate on the
> danger/risk of using --assume-clean? Thanks in advance.
> 
> Brian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: --assume-clean on raid5/6
  2010-08-07 12:28 ` Stefan /*St0fF*/ Hübner
@ 2010-08-08  8:56   ` Neil Brown
  2010-08-08 14:17     ` brian.foster
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Brown @ 2010-08-08  8:56 UTC (permalink / raw)
  To: st0ff; +Cc: stefan.huebner, brian.foster, linux-raid

On Sat, 07 Aug 2010 14:28:55 +0200
Stefan /*St0fF*/ Hübner <stefan.huebner@stud.tu-ilmenau.de> wrote:

> Hi Brian,
> 
> --assume-clean skips over the initial resync.  Which - if you will
> create a filesystem after creating the array - is a time-saving idea.
> But keep in mind: even if the disks are brand new and contain only
> zeros, the parity would probably look not all zeros.  So reading from
> such an array would be a bad idea.
> But if the next thing you do is create LVM/filesystem etc., then all bit
> read from the array will have been written to before (and by that are in
> sync).

There is an important point that this misses.

When md updates a block on a RAID5 it will sometimes use a read-modify-write
cycle which reads the old block and old parity, subtracts the old block from
the parity block and then added the new block to the parity block.  Then it
writes the new data block and the new parity block.

If the old parity was correct for the old stripe, then the new parity will be
correct for the new stripe.  But if the old was wrong then the new will be
wrong.

So if you use assume-clean then the parity may well be wrong and could remain
wrong even when you write new data.  If you then lose a device, the data for
that device will be computed using wrong parity and you will get wrong data -
hence data corruption.

So you should only use --assume-clean if you know the array really is
'clean'.

RAID1/RAID10 cannot suffer from this so --assume-clean is quite safe with
those array types.
The current implementation of RAID6 never does read-modify-write so
--assume-clean is currently safe with RAID6 too.  However I do not promise
that RAID6 might not change to use read-modify-write cycles in some future
implementation.  So I would not recommend using --assume-clean on RAID6 just
to avoid the resync cost.

NeilBrown

> 
> Stefan
> 
> Am 06.08.2010 03:19, schrieb brian.foster@emc.com:
> > Hi all,
> > 
> > I've read in the list archives that use of --assume-clean on raid5
> > (raid6?) is not safe assuming the member drives are not sync, but it's
> > not clear to me as to why. I can see the content of an written raid5
> > array change if I fail a drive out of the array (created w/
> > --assume-clean), but data that I write prior to failing a drive remains
> > intact. Perhaps I'm missing something. Could somebody elaborate on the
> > danger/risk of using --assume-clean? Thanks in advance.
> > 
> > Brian
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: --assume-clean on raid5/6
  2010-08-08  8:56   ` Neil Brown
@ 2010-08-08 14:17     ` brian.foster
  0 siblings, 0 replies; 4+ messages in thread
From: brian.foster @ 2010-08-08 14:17 UTC (permalink / raw)
  To: neilb, st0ff; +Cc: stefan.huebner, linux-raid

> -----Original Message-----
> From: Neil Brown [mailto:neilb@suse.de]
> Sent: Sunday, August 08, 2010 4:56 AM
> To: st0ff@npl.de
> Cc: stefan.huebner@stud.tu-ilmenau.de; Foster, Brian; linux-
> raid@vger.kernel.org
> Subject: Re: --assume-clean on raid5/6
> 
> On Sat, 07 Aug 2010 14:28:55 +0200
> Stefan /*St0fF*/ Hübner <stefan.huebner@stud.tu-ilmenau.de> wrote:
> 
> > Hi Brian,
> >
> > --assume-clean skips over the initial resync.  Which - if you will
> > create a filesystem after creating the array - is a time-saving idea.
> > But keep in mind: even if the disks are brand new and contain only
> > zeros, the parity would probably look not all zeros.  So reading from
> > such an array would be a bad idea.
> > But if the next thing you do is create LVM/filesystem etc., then all
> bit
> > read from the array will have been written to before (and by that are
> in
> > sync).
> 
> There is an important point that this misses.
> 
> When md updates a block on a RAID5 it will sometimes use a read-modify-
> write
> cycle which reads the old block and old parity, subtracts the old block
> from
> the parity block and then added the new block to the parity block.
> Then it
> writes the new data block and the new parity block.
> 
> If the old parity was correct for the old stripe, then the new parity
> will be
> correct for the new stripe.  But if the old was wrong then the new will
> be
> wrong.
> 
> So if you use assume-clean then the parity may well be wrong and could
> remain
> wrong even when you write new data.  If you then lose a device, the
> data for
> that device will be computed using wrong parity and you will get wrong
> data -
> hence data corruption.
> 
> So you should only use --assume-clean if you know the array really is
> 'clean'.
> 

Thanks for the information guys. I was actually attempting to test whether this could occur with a high-level sequence similar to the following:

- dd /dev/urandom data to 4 small partitions (~10MB each).
- Create a raid5 with --assume-clean on said partitions.
- Write a small bit of data (32 bytes) to the beginning of the md, capture an image of the md to a file.
- Fail/remove a drive from the md, capture a second md file image.
- cmp the file images to see what changed, and read back the first 32 bytes of data.

In this scenario I do observe differences in the file image, but my data remains intact. I ran this sequence multiple times, each time failing a different drive in the array and also tried to stop/restart the array (with a drop_caches in between) before the drive failure step. This leads to my question: is there a write test that can reproduce data corruption under this scenario, or is the rmw cycle some kind of optimization that is not so deterministic?

Also out of curiousity, would --assume-clean be safe on a raid5 if the drives were explicitly zeroed beforehand? Thanks again.

Brian

> RAID1/RAID10 cannot suffer from this so --assume-clean is quite safe
> with
> those array types.
> The current implementation of RAID6 never does read-modify-write so
> --assume-clean is currently safe with RAID6 too.  However I do not
> promise
> that RAID6 might not change to use read-modify-write cycles in some
> future
> implementation.  So I would not recommend using --assume-clean on RAID6
> just
> to avoid the resync cost.
> 
> NeilBrown
> 
> >
> > Stefan
> >
> > Am 06.08.2010 03:19, schrieb brian.foster@emc.com:
> > > Hi all,
> > >
> > > I've read in the list archives that use of --assume-clean on raid5
> > > (raid6?) is not safe assuming the member drives are not sync, but
> it's
> > > not clear to me as to why. I can see the content of an written
> raid5
> > > array change if I fail a drive out of the array (created w/
> > > --assume-clean), but data that I write prior to failing a drive
> remains
> > > intact. Perhaps I'm missing something. Could somebody elaborate on
> the
> > > danger/risk of using --assume-clean? Thanks in advance.
> > >
> > > Brian
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-
> raid" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-08-08 14:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-06  1:19 --assume-clean on raid5/6 brian.foster
2010-08-07 12:28 ` Stefan /*St0fF*/ Hübner
2010-08-08  8:56   ` Neil Brown
2010-08-08 14:17     ` brian.foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).