RAID6 Reshape Gone Awry

All of lore.kernel.org
 help / color / mirror / Atom feed

* RAID6 Reshape Gone Awry
@ 2012-08-03  5:27 Flynn
  2012-08-03 11:18 ` Stan Hoeppner
  0 siblings, 1 reply; 6+ messages in thread
From: Flynn @ 2012-08-03  5:27 UTC (permalink / raw)
  To: linux-raid

Apologies in advance if this is the wrong place for this...

I'd been running a RAID6 with 5 1.5TB drives on CentOS 5.ancient for quite 
awhile.  Last week, I wanted to add a drive, and promptly ran into issues 
with my CentOS mdadm being unable to do the obvious thing with mdadm 
--grow, so I upgraded to Ubuntu 12.04 LTS.

All was well, briefly.

My RAID6 is actually a little bit odd in that the drives are split into 10 
partitions.  All the partition 5's are a RAID6; all the partition 6's are a 
RAID6; etc.  There's an LVM layer that sits on top.  This turned out to be 
handy when I changed the size of the drives in the RAID, so I stuck with it.

This means I have to actually do 10 mdadm --grow commands.  My original 
cunning plan was to issue one, wait for that partition to reshape, issue 
another, etc.  I scripted this -- and made a mistake, so the 'wait' step 
didn't happen.  I ended up with all ten partitions grown to 6 drives, and 
most of them marked pending reshape.

Again, all was well.

But you can guess what happened next, can't you?  That's right, the machine 
crashed.  On reboot, the reshape that had been underway at the time 
(partition 7) picked up and carried on just fine.  But partition 8 didn't. 
Nor anything after.

So at this point I have partitions 5, 6, and 7 happy; 8 - 14 are marked 
inactive.  The initial mdadm --grow reported that it passed the critical 
section long before the machine crashed, for all partitions.  mdadm 
--examine on the individual drives shows that each of these partitions 
believes that they are part of a RAID6 with 6 drives, correct checksums 
everywhere, event counters the same, but:

1)  Trying e.g.

    sudo mdadm --assemble --force /dev/md8 /dev/sd[bdefgh]8

says

mdadm: Failed to restore critical section for reshape, sorry.
      Possibly you needed to specify the --backup-file

Given that I didn't specify --backup-file to the initial mdadm --grow, this 
seems... perhaps not entirely helpful.

2)  In a working partition, I always see the 'this' entry in mdadm 
--examine's output matching up with the drive being read (e.g. /dev/sde5 
will say 'this' is /dev/sde5).  In a _non_-working partition, that's not 
the case (e.g. /dev/sdb7 says 'this' is /dev/sdg7).

3)  Finally, all the working partitions show that their superblocks are 
version 0.90.00, but all the non-working partitions show 0.91.00.

I've been beating my head on this for awhile, Googling around, learning a 
fair amount but not getting very far.  In theory there's nothing on this 
array that's irreplaceable (it's meant as a backup, not a primary store) 
but, well, it'd be nice to repair it rather than blowing it away.

This is mdadm 3.2.3.  Suggestions very welcome.  I can provide output to 
whatever people'd like to see, of course, but figured I'd wait for 
requests...

Thanks!

 -- Flynn

--
Never let your sense of morals get in the way of doing what's right.
                                                            (Isaac Asimov)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 Reshape Gone Awry
  2012-08-03  5:27 RAID6 Reshape Gone Awry Flynn
@ 2012-08-03 11:18 ` Stan Hoeppner
  2012-08-03 13:37   ` David Brown
  0 siblings, 1 reply; 6+ messages in thread
From: Stan Hoeppner @ 2012-08-03 11:18 UTC (permalink / raw)
  To: Flynn; +Cc: Linux RAID

On 8/3/2012 12:27 AM, Flynn wrote:

> My RAID6 is actually a little bit odd in that the drives are split into
> 10 partitions.  All the partition 5's are a RAID6; all the partition 6's
> are a RAID6; etc.

md offered the ability, so you _could_ create such a monstrosity.  But
you never bothered to consider if you _should_

The primary function of RAID is to protect your data in the event of a
_disk_ failure.  Creating multiple arrays from _partitions_ on the same
set of physical disks does nothing to protect one from disk failure.

What it can do is cause massive problems for the elevator when you try
to reshape 10 arrays simultaneously, which just happen to reside on the
same set of disks.  By doing this you force the heads on the drives into
a massive random seek pattern, bumping all over the platters, top to
bottom.  This is likely what caused, or is directly related to, your crash.

> Suggestions very welcome.

Backup what you need to external storage.  Blow the entire mess away.
Start over from scratch, and build a single RAID6 array, as you should
have in the first place.

md allows the use of partitions, but not so you can create 50 arrays on
the same set of disks, shooting yourself in the foot.  Similarly, most
cars can travel at velocities over 120 mph, but most people have enough
sense not to attempt driving that fast.

Learn the difference between "Can I?" and "Should I?".  You never
bothered to consider the latter when you built this.  Please consider it
now, for your sake.

-- 
Stan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 Reshape Gone Awry
  2012-08-03 11:18 ` Stan Hoeppner
@ 2012-08-03 13:37   ` David Brown
  2012-08-03 14:53     ` Roman Mamedov
                       ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: David Brown @ 2012-08-03 13:37 UTC (permalink / raw)
  To: stan; +Cc: Flynn, Linux RAID

On 03/08/2012 13:18, Stan Hoeppner wrote:
> On 8/3/2012 12:27 AM, Flynn wrote:
>
>> My RAID6 is actually a little bit odd in that the drives are split into
>> 10 partitions.  All the partition 5's are a RAID6; all the partition 6's
>> are a RAID6; etc.
>
> md offered the ability, so you _could_ create such a monstrosity.  But
> you never bothered to consider if you _should_
>
> The primary function of RAID is to protect your data in the event of a
> _disk_ failure.  Creating multiple arrays from _partitions_ on the same
> set of physical disks does nothing to protect one from disk failure.
>

That's not how I understand the disk layout - if I'm right, it is still 
a monstrosity, but one that will offer protection on disk failure.

As I read it, he has this (prior to adding the new disk):

md0 = raid6(sda5, sdb5, sdc5, sdd5, sde5)
md1 = raid6(sda6, sdb6, sdc6, sdd6, sde6)
...
md9 = raid6(sda14, sdb14, sdc14, sdd14, sde14)

If that's the case, then it will be an administrative mess (as the OP is 
now experiencing), but it will protect the data, and if the LVM is a 
linear concatenation of these then performance normally will be okay. 
Of course, if the LVM tries to use a stripe of these arrays, it will be 
terrible - and rebuild/reshape will involve massively inefficient head 
movement (as you noted).

> What it can do is cause massive problems for the elevator when you try
> to reshape 10 arrays simultaneously, which just happen to reside on the
> same set of disks.  By doing this you force the heads on the drives into
> a massive random seek pattern, bumping all over the platters, top to
> bottom.  This is likely what caused, or is directly related to, your crash.
>
>> Suggestions very welcome.
>
> Backup what you need to external storage.  Blow the entire mess away.
> Start over from scratch, and build a single RAID6 array, as you should
> have in the first place.

If the OP can manage it, then I agree.

>
> md allows the use of partitions, but not so you can create 50 arrays on
> the same set of disks, shooting yourself in the foot.  Similarly, most
> cars can travel at velocities over 120 mph, but most people have enough
> sense not to attempt driving that fast.

I have sometimes used multiple arrays like this:

md0 = raid1,n4(sda1, sdb1, sdc1, sdd1) for /boot (makes grub happy)
md1 = raid5(sda2, sdb2, sdc2, sdd2) for everything else

But this particular setup seems very odd to me - I would love to know 
the reasoning behind it.

>
> Learn the difference between "Can I?" and "Should I?".  You never
> bothered to consider the latter when you built this.  Please consider it
> now, for your sake.
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 Reshape Gone Awry
  2012-08-03 13:37   ` David Brown
@ 2012-08-03 14:53     ` Roman Mamedov
  2012-08-03 15:44     ` Flynn
  2012-08-03 18:25     ` Stan Hoeppner
  2 siblings, 0 replies; 6+ messages in thread
From: Roman Mamedov @ 2012-08-03 14:53 UTC (permalink / raw)
  To: David Brown; +Cc: stan, Flynn, Linux RAID

[-- Attachment #1: Type: text/plain, Size: 1364 bytes --]

On Fri, 03 Aug 2012 15:37:36 +0200
David Brown <david.brown@hesbynett.no> wrote:

> That's not how I understand the disk layout - if I'm right, it is still 
> a monstrosity, but one that will offer protection on disk failure.
> 
> As I read it, he has this (prior to adding the new disk):
> 
> md0 = raid6(sda5, sdb5, sdc5, sdd5, sde5)
> md1 = raid6(sda6, sdb6, sdc6, sdd6, sde6)
> ...
> md9 = raid6(sda14, sdb14, sdc14, sdd14, sde14)
> 
> If that's the case, then it will be an administrative mess (as the OP is 
> now experiencing), but it will protect the data, and if the LVM is a 
> linear concatenation of these then performance normally will be okay. 

If you want the RAID5/6 write performance to be okay, you will want to
increase stripe_cache_size to a good value [1] -- and that's per array, and the
RAM consumption increases linearly with disk count - so on 10 five-member
arrays you won't have anywhere near enough RAM to have a sufficient
stripe_cache on all of them.

In other words, one more aspect in which this multi-array configuration is
highly suboptimal. :)

[1]
http://peterkieser.com/2009/11/29/raid-mdraid-stripe_cache_size-vs-write-transfer/

-- 
With respect,
Roman

~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 Reshape Gone Awry
  2012-08-03 13:37   ` David Brown
  2012-08-03 14:53     ` Roman Mamedov
@ 2012-08-03 15:44     ` Flynn
  2012-08-03 18:25     ` Stan Hoeppner
  2 siblings, 0 replies; 6+ messages in thread
From: Flynn @ 2012-08-03 15:44 UTC (permalink / raw)
  To: David Brown; +Cc: stan, Linux RAID

David writes:
> As I read it, he has this (prior to adding the new disk):
>
> md0 = raid6(sda5, sdb5, sdc5, sdd5, sde5)
> md1 = raid6(sda6, sdb6, sdc6, sdd6, sde6)
> ...
> md9 = raid6(sda14, sdb14, sdc14, sdd14, sde14)

That's correct (although it's md5 - md14, to match the partition numbers). 
You're also correct that the LVM is a concatenation rather than striped. 
It performs just fine for its use case: occasional large writes (mostly 
with scp), lots of reading.

David continues:
> I have sometimes used multiple arrays like this:
>
> md0 = raid1,n4(sda1, sdb1, sdc1, sdd1) for /boot (makes grub happy)
> md1 = raid5(sda2, sdb2, sdc2, sdd2) for everything else
>
> But this particular setup seems very odd to me - I would love to know the
> reasoning behind it.

In fact, there is also a RAID1 md0 for grub's sake as well, but it's not 
relevant to the problem.

I first built this array about four years ago, when CentOS 5.2 was current. 
It started life as a RAID5 (not 6) of 4 500GB drives, and I knew when I 
created it that I'd need to grow it over time by adding drives.

At that time, though, mdadm as shipped with CentOS 5.2 couldn't reshape a 
RAID5 -- IIRC, the most recent version of mdadm at the time listed it as an 
experimental feature that would eat your data and give you bad breath.  But 
LVM + md + multiple partitions makes it possible, as long as you hold some 
space in reserve (a good idea for snapshot support anyway).  Use pvmove to 
clear a given md device, pull the md out of the LVM, dissassemble it, 
reassemble it in whatever new configuration you need, and then put it back 
into the LVM.

Yes, it is an administrative mess.  But it was a powerful administrative 
mess.  [ :) ]  This array has gone from a 4x500GB RAID5 to a 4x1500GB RAID5 
to a 5x1500GB RAID6, without ever running anything in degraded mode, or 
taking the array as whole offline for any significant time.

Of course, the downside is that pvmove + recreating the array spends a lot 
of time hammering the drives: for 5x1500 RAID6 to 6x1500 RAID6, it was 
looking like a few weeks.  Since mdadm _can_ reshape RAID6 now, and it was 
past time to get off CentOS anyway, spending a few weeks beating on the 
disk drives didn't much appeal to me.

To preempt a few other obvious questions: CentOS was a plus because I 
worked at a shop that made heavy use of RHEL at the time.  Getting CentOS 
to boot off RAID sucked, though; that plus my tendency towards sysadmin by 
not screwing with a working system made me disinclined, for a long time, to 
go to a newer OS or mdadm.  And it's a rather stripped-down system, to make 
security simpler to manage.

At this point, the system boots Ubuntu off CF, sidestepping the whole 
booting-off-RAID issue completely.

Stan notes:
> What it can do is cause massive problems for the elevator when you try
> to reshape 10 arrays simultaneously...

Note, though, that mdadm _did not_ try to reshape ten arrays 
simultaneously.  It marked all but one as "pending" and then started into 
reshaping the one, which isn't any more abuse of the elevator algorithm 
than it normally gets...

Stan also suggests:
> Backup what you need to external storage [and] [s]tart over from 
scratch...

to which David concurs:
> If the OP can manage it, then I agree.

Nope, the OP cannot, especially not with arrays that can't be started.  [ 
:) ]  As noted, in theory it's all replaceable data anyway, but it would be 
much more pleasant to not have to make the experiment.

<deep breath>  OK.  All that being said, can we perhaps take the honor of 
the list as upheld, and return to the question of recovery?  Is there a way 
to recover a RAID6 where the event counters and checksums and all that are 
consistent, but where the superblock is marked as version 0.91.00, and 
where it complains about failing to restore the critical section, even 
though it said it got past the critical section before?

Thanks much!

 -- Flynn

--
Never let your sense of morals get in the way of doing what's right.
                                                            (Isaac Asimov)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID6 Reshape Gone Awry
  2012-08-03 13:37   ` David Brown
  2012-08-03 14:53     ` Roman Mamedov
  2012-08-03 15:44     ` Flynn
@ 2012-08-03 18:25     ` Stan Hoeppner
  2 siblings, 0 replies; 6+ messages in thread
From: Stan Hoeppner @ 2012-08-03 18:25 UTC (permalink / raw)
  To: David Brown; +Cc: Flynn, Linux RAID

On 8/3/2012 8:37 AM, David Brown wrote:
> On 03/08/2012 13:18, Stan Hoeppner wrote:

>> The primary function of RAID is to protect your data in the event of a
>> _disk_ failure.  Creating multiple arrays from _partitions_ on the same
>> set of physical disks does nothing to protect one from disk failure.
>>
> 
> That's not how I understand the disk layout - if I'm right, it is still
> a monstrosity, but one that will offer protection on disk failure.

I didn't state this setup would not protect data from disk failure.  I
stated that a single array would have done that.  Making 9 more simply
causes all kinds of problems without any additional benefits.

-- 
Stan


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-08-03 18:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-03  5:27 RAID6 Reshape Gone Awry Flynn
2012-08-03 11:18 ` Stan Hoeppner
2012-08-03 13:37   ` David Brown
2012-08-03 14:53     ` Roman Mamedov
2012-08-03 15:44     ` Flynn
2012-08-03 18:25     ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.