modifying degraded raid 1 then re-adding other members is bad

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* modifying degraded raid 1 then re-adding other members is bad
@ 2006-08-08  9:38 Alexandre Oliva
  2006-08-08 11:12 ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Alexandre Oliva @ 2006-08-08  9:38 UTC (permalink / raw)
  To: linux-raid, linux-kernel

Assume I have a fully-functional raid 1 between two disks, one
hot-pluggable and the other fixed.

If I unplug the hot-pluggable disk and reboot, the array will come up
degraded, as intended.

If I then modify a lot of the data in the raid device (say it's my
root fs and I'm running daily Fedora development updates :-), which
modifies only the fixed disk, and then plug the hot-pluggable disk in
and re-add its members, it appears that it comes up without resyncing
and, well, major filesystem corruption ensues.

Is this a known issue, or should I try to gather more info about it?

This happened with 2.6.18rc3-git[367] (not sure which), plus Fedora
development patches.

-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Secretary for FSF Latin America        http://www.fsfla.org/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: modifying degraded raid 1 then re-adding other members is bad
  2006-08-08  9:38 modifying degraded raid 1 then re-adding other members is bad Alexandre Oliva
@ 2006-08-08 11:12 ` Neil Brown
  2006-08-08 11:19   ` Michael Tokarev
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2006-08-08 11:12 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-raid, linux-kernel

On Tuesday August 8, aoliva@redhat.com wrote:
> Assume I have a fully-functional raid 1 between two disks, one
> hot-pluggable and the other fixed.
> 
> If I unplug the hot-pluggable disk and reboot, the array will come up
> degraded, as intended.
> 
> If I then modify a lot of the data in the raid device (say it's my
> root fs and I'm running daily Fedora development updates :-), which
> modifies only the fixed disk, and then plug the hot-pluggable disk in
> and re-add its members, it appears that it comes up without resyncing
> and, well, major filesystem corruption ensues.
> 
> Is this a known issue, or should I try to gather more info about it?

Looks a lot like
   http://bugzilla.kernel.org/show_bug.cgi?id=6965

Attached are two patches.  One against -mm and one against -linus.

They are below.

Please confirm if the appropriate one help.

NeilBrown

(-mm)

Avoid backward event updates in md superblock when degraded.

If we
  - shut down a clean array,
  - restart with one (or more) drive(s) missing
  - make some changes
  - pause, so that they array gets marked 'clean',
the event count on the superblock of included drives
will be the same as that of the removed drives.
So adding the removed drive back in will cause it
to be included with no resync.

To avoid this, we only update the eventcount backwards when the array
is not degraded.  In this case there can (should) be no non-connected
drives that we can get confused with, and this is the particular case
where updating-backwards is valuable.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2006-08-03 11:42:48.000000000 +1000
+++ ./drivers/md/md.c	2006-08-07 08:57:10.000000000 +1000
@@ -1609,6 +1609,17 @@ repeat:
 		nospares = 1;
 	if (force_change)
 		nospares = 0;
+	if (mddev->degraded)
+		/* If the array is degraded, then skipping spares is both
+		 * dangerous and fairly pointless.
+		 * Dangerous because a device that was removed from the array
+		 * might have a event_count that still looks up-to-date,
+		 * so it can be re-added without a resync.
+		 * Pointless because if there are any spares to skip,
+		 * then a recovery will happen and soon that array won't
+		 * be degraded any more and the spare can go back to sleep then.
+		 */
+		nospares = 0;
 
 	sync_req = mddev->in_sync;
 	mddev->utime = get_seconds();

---------------------------------------

(-linus)

Avoid backward event updates in md superblock when degraded.

If we
  - shut down a clean array,
  - restart with one (or more) drive(s) missing
  - make some changes
  - pause, so that they array gets marked 'clean',
the event count on the superblock of included drives
will be the same as that of the removed drives.
So adding the removed drive back in will cause it
to be included with no resync.

To avoid this, we only update the eventcount backwards when the array
is not degraded.  In this case there can (should) be no non-connected
drives that we can get confused with, and this is the particular case
where updating-backwards is valuable.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2006-08-08 09:00:44.000000000 +1000
+++ ./drivers/md/md.c	2006-08-08 09:04:04.000000000 +1000
@@ -1597,6 +1597,19 @@ void md_update_sb(mddev_t * mddev)
 
 repeat:
 	spin_lock_irq(&mddev->write_lock);
+
+	if (mddev->degraded && mddev->sb_dirty == 3)
+		/* If the array is degraded, then skipping spares is both
+		 * dangerous and fairly pointless.
+		 * Dangerous because a device that was removed from the array
+		 * might have a event_count that still looks up-to-date,
+		 * so it can be re-added without a resync.
+		 * Pointless because if there are any spares to skip,
+		 * then a recovery will happen and soon that array won't
+		 * be degraded any more and the spare can go back to sleep then.
+		 */
+		mddev->sb_dirty = 1;
+
 	sync_req = mddev->in_sync;
 	mddev->utime = get_seconds();
 	if (mddev->sb_dirty == 3)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: modifying degraded raid 1 then re-adding other members is bad
  2006-08-08 11:12 ` Neil Brown
@ 2006-08-08 11:19   ` Michael Tokarev
  2006-08-08 19:17     ` Krzysztof Halasa
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Michael Tokarev @ 2006-08-08 11:19 UTC (permalink / raw)
  To: Neil Brown; +Cc: Alexandre Oliva, linux-raid, linux-kernel

Neil Brown wrote:
> On Tuesday August 8, aoliva@redhat.com wrote:
>> Assume I have a fully-functional raid 1 between two disks, one
>> hot-pluggable and the other fixed.
>>
>> If I unplug the hot-pluggable disk and reboot, the array will come up
>> degraded, as intended.
>>
>> If I then modify a lot of the data in the raid device (say it's my
>> root fs and I'm running daily Fedora development updates :-), which
>> modifies only the fixed disk, and then plug the hot-pluggable disk in
>> and re-add its members, it appears that it comes up without resyncing
>> and, well, major filesystem corruption ensues.
>>
>> Is this a known issue, or should I try to gather more info about it?
> 
> Looks a lot like
>    http://bugzilla.kernel.org/show_bug.cgi?id=6965
> 
> Attached are two patches.  One against -mm and one against -linus.
> 
> They are below.
> 
> Please confirm if the appropriate one help.
> 
> NeilBrown
> 
> (-mm)
> 
> Avoid backward event updates in md superblock when degraded.
> 
> If we
>   - shut down a clean array,
>   - restart with one (or more) drive(s) missing
>   - make some changes
>   - pause, so that they array gets marked 'clean',
> the event count on the superblock of included drives
> will be the same as that of the removed drives.
> So adding the removed drive back in will cause it
> to be included with no resync.
> 
> To avoid this, we only update the eventcount backwards when the array
> is not degraded.  In this case there can (should) be no non-connected
> drives that we can get confused with, and this is the particular case
> where updating-backwards is valuable.

Why we're updating it BACKWARD in the first place?

Also, why, when we adding something to the array, the event counter is
checked -- should it resync regardless?

Thanks.

/mjt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: modifying degraded raid 1 then re-adding other members is bad
  2006-08-08 11:19   ` Michael Tokarev
@ 2006-08-08 19:17     ` Krzysztof Halasa
  2006-08-08 22:33       ` Neil Brown
  2006-08-08 22:30     ` Neil Brown
  2006-08-09  9:01     ` Helge Hafting
  2 siblings, 1 reply; 9+ messages in thread
From: Krzysztof Halasa @ 2006-08-08 19:17 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Neil Brown, Alexandre Oliva, linux-raid, linux-kernel

Michael Tokarev <mjt@tls.msk.ru> writes:

> Why we're updating it BACKWARD in the first place?

Another scenario: 1 disk (of 2) is removed, another is added, RAID-1
is rebuilt, then the disk added last is removed and replaced by
the disk which was removed first. Would it trigger this problem?

> Also, why, when we adding something to the array, the event counter is
> checked -- should it resync regardless?

I think it's a full start, not a hot add. For hot add contents of
the new disk should be ignored.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: modifying degraded raid 1 then re-adding other members is bad
  2006-08-08 19:17     ` Krzysztof Halasa
@ 2006-08-08 22:33       ` Neil Brown
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Brown @ 2006-08-08 22:33 UTC (permalink / raw)
  To: Krzysztof Halasa
  Cc: Michael Tokarev, Alexandre Oliva, linux-raid, linux-kernel

On Tuesday August 8, khc@pm.waw.pl wrote:
> Michael Tokarev <mjt@tls.msk.ru> writes:
> 
> > Why we're updating it BACKWARD in the first place?
> 
> Another scenario: 1 disk (of 2) is removed, another is added, RAID-1
> is rebuilt, then the disk added last is removed and replaced by
> the disk which was removed first. Would it trigger this problem?
> 

No.  The removing and the adding will all move the event count clearly
forward and the removed drive will have an old event count and so will
not be considered for easy inclusion.


> > Also, why, when we adding something to the array, the event counter is
> > checked -- should it resync regardless?
> 
> I think it's a full start, not a hot add. For hot add contents of
> the new disk should be ignored.

See my other post for why I want to sometimes not do a recovery on a
hot-add.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: modifying degraded raid 1 then re-adding other members is bad
  2006-08-08 11:19   ` Michael Tokarev
  2006-08-08 19:17     ` Krzysztof Halasa
@ 2006-08-08 22:30     ` Neil Brown
  2006-08-09  6:35       ` Jan Engelhardt
  2006-08-09  9:01     ` Helge Hafting
  2 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2006-08-08 22:30 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Alexandre Oliva, linux-raid, linux-kernel

On Tuesday August 8, mjt@tls.msk.ru wrote:
> 
> Why we're updating it BACKWARD in the first place?
> 

To avoid writing to spares when it isn't needed - some people want
their spare drives to go to sleep.

If we increment the event count without writing to the spares, the
spares quickly get left behind and won't be included next time the
array is assembled.
So on superblock updates that are purely for setting/clearing the
'dirty' bit, we rock back and forward between X and X+1, while leaving
the spares with 'X'.  A difference of 1 isn't enough to leave a drive
out of an array, so the spares stay part of the array.
The 'X is clean, the X+1 is dirty, so if there is any inconsistency
at startup, the 'dirty' will win, which is proper.

Any other superblock change like drives failing or being added cause a
normal forward change of 'events' and spares get written to as well.

> Also, why, when we adding something to the array, the event counter is
> checked -- should it resync regardless?

If we know it to be in sync, why should we resync it?

This is part of a longer term strategy to plan nicely with hotplug.

What I would like is that whenever hotplug finds a device, the hotplug
system can call
   mdadm --hot-plug-this-new-drive-somewhere-useful /dev/newdisk

(or something like that) and the drive will be added to an appropriate
array (if there is one).

So now you have the question: when do you actually activate an array?
Do I wait until there are just enough drives to start it degraded or
do I wait until all drives are present?
The later might never happen.  The former might cause lots of
unnecessary resync.

With the above feature (hot add of a current drive doesn't cause a
resync) then I can activate the array as soon as there are enough
drive for it to work at all.  It can then be read from even though it
isn't complete.
Once the first write happens we commit to the current layout and a new
drive will have to be resynced.  but if the array becomes complete
before the first write, no resync will be needed.

Hope that makes it a bit clearer.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: modifying degraded raid 1 then re-adding other members is bad
  2006-08-08 22:30     ` Neil Brown
@ 2006-08-09  6:35       ` Jan Engelhardt
  2006-08-09 23:18         ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Engelhardt @ 2006-08-09  6:35 UTC (permalink / raw)
  To: Neil Brown; +Cc: Michael Tokarev, Alexandre Oliva, linux-raid, linux-kernel

>> Why we're updating it BACKWARD in the first place?
>
>To avoid writing to spares when it isn't needed - some people want
>their spare drives to go to sleep.

That sounds a little dangerous. What if it decrements below 0?


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: modifying degraded raid 1 then re-adding other members is bad
  2006-08-09  6:35       ` Jan Engelhardt
@ 2006-08-09 23:18         ` Neil Brown
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Brown @ 2006-08-09 23:18 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Michael Tokarev, Alexandre Oliva, linux-raid, linux-kernel

On Wednesday August 9, jengelh@linux01.gwdg.de wrote:
> >> Why we're updating it BACKWARD in the first place?
> >
> >To avoid writing to spares when it isn't needed - some people want
> >their spare drives to go to sleep.
> 
> That sounds a little dangerous. What if it decrements below 0?

It cannot.
md  decrements the event count only on a dirty->clean transition, and
only if it had previously incremented the count on a clean->dirty
transition.  So it can never go below what it was when the array was
assembled.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: modifying degraded raid 1 then re-adding other members is bad
  2006-08-08 11:19   ` Michael Tokarev
  2006-08-08 19:17     ` Krzysztof Halasa
  2006-08-08 22:30     ` Neil Brown
@ 2006-08-09  9:01     ` Helge Hafting
  2 siblings, 0 replies; 9+ messages in thread
From: Helge Hafting @ 2006-08-09  9:01 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Neil Brown, Alexandre Oliva, linux-raid, linux-kernel

Michael Tokarev wrote:
> Why we're updating it BACKWARD in the first place?
>   
Don't know this one...
> Also, why, when we adding something to the array, the event counter is
> checked -- should it resync regardless?
If you remove a drive and then add it back with
no changes in the meantime, then you don't want
a resync to happen.  Some people reboot their machine
every day (too much noise, heat or electricity at night),
a daily resync is excessive.

An which drive would you consider
the "master copy" anyway, if the event counts match?

Helge Hafting


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-08-09 23:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-08  9:38 modifying degraded raid 1 then re-adding other members is bad Alexandre Oliva
2006-08-08 11:12 ` Neil Brown
2006-08-08 11:19   ` Michael Tokarev
2006-08-08 19:17     ` Krzysztof Halasa
2006-08-08 22:33       ` Neil Brown
2006-08-08 22:30     ` Neil Brown
2006-08-09  6:35       ` Jan Engelhardt
2006-08-09 23:18         ` Neil Brown
2006-08-09  9:01     ` Helge Hafting

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).