Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: SRaid with 13 Disks crashed
From: Phil Turmel @ 2011-06-08 14:39 UTC (permalink / raw)
  To: Dragon; +Cc: linux-raid
In-Reply-To: <20110608142440.139240@gmx.net>

Hi Dragon,

On 06/08/2011 10:24 AM, Dragon wrote:
> SRaid with 13 Disks crashed
> Hello,
> 
> 
> this seems to be my last chance to get back all of my data from a sw-raid5 with 12-13 disks.
> i use debian ( 2.6.32-bpo.5-amd64) and last i wanted to grow the raid from 12 to 13 disk with a size at all of 18tb. after run mke2fs i must see that the tool on ext4 allow a maximum size of 16tb. after that i wanted to shrink the size back to 12 disk and now the raid is gone.

Did you actually mean "mke2fs" ?  It destroys existing data.  I hope you meant "resize2fs".

> i tried some assemble and examine things but without success.
> 
> here some information:
>  cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : inactive sdh[0](S) sda[13](S) sdg[12](S) sdf[11](S) sde[10](S) sdd[9](S) sdc[8](S) sdb[6](S) sdm[5](S) sdl[4](S) sdj[3](S) sdi[2](S)
>       17581661952 blocks
> 
> unused devices: <none>
> 
> mdadm --detail /dev/md0
> mdadm: md device /dev/md0 does not appear to be active.
> 
>  mdadm --assemble --force -v /dev/md0 /dev/sdh /dev/sda /dev/sdg /dev/sdf /dev/sde /dev/sdd /dev/sdc /dev/sdb /dev/sdm /dev/sdl /dev/sdj /dev/sdi --update=super-minor /dev/sdh

Was /dev/sdk supposed to be in this list?

> mdadm: looking for devices for /dev/md0
> mdadm: updating superblock of /dev/sdh with minor number 0
> mdadm: /dev/sdh is identified as a member of /dev/md0, slot 0.
> mdadm: updating superblock of /dev/sda with minor number 0
> mdadm: /dev/sda is identified as a member of /dev/md0, slot 13.

This is suspicious.  Looks like sda was added as a spare?

> mdadm: updating superblock of /dev/sdg with minor number 0
> mdadm: /dev/sdg is identified as a member of /dev/md0, slot 12.
> mdadm: updating superblock of /dev/sdf with minor number 0
> mdadm: /dev/sdf is identified as a member of /dev/md0, slot 11.
> mdadm: updating superblock of /dev/sde with minor number 0
> mdadm: /dev/sde is identified as a member of /dev/md0, slot 10.
> mdadm: updating superblock of /dev/sdd with minor number 0
> mdadm: /dev/sdd is identified as a member of /dev/md0, slot 9.
> mdadm: updating superblock of /dev/sdc with minor number 0
> mdadm: /dev/sdc is identified as a member of /dev/md0, slot 8.
> mdadm: updating superblock of /dev/sdb with minor number 0
> mdadm: /dev/sdb is identified as a member of /dev/md0, slot 6.
> mdadm: updating superblock of /dev/sdm with minor number 0
> mdadm: /dev/sdm is identified as a member of /dev/md0, slot 5.
> mdadm: updating superblock of /dev/sdl with minor number 0
> mdadm: /dev/sdl is identified as a member of /dev/md0, slot 4.
> mdadm: updating superblock of /dev/sdj with minor number 0
> mdadm: /dev/sdj is identified as a member of /dev/md0, slot 3.
> mdadm: updating superblock of /dev/sdi with minor number 0
> mdadm: /dev/sdi is identified as a member of /dev/md0, slot 2.
> mdadm: updating superblock of /dev/sdh with minor number 0
> mdadm: /dev/sdh is identified as a member of /dev/md0, slot 0.
> mdadm: no uptodate device for slot 1 of /dev/md0
> mdadm: added /dev/sdi to /dev/md0 as 2
> mdadm: added /dev/sdj to /dev/md0 as 3
> mdadm: added /dev/sdl to /dev/md0 as 4
> mdadm: added /dev/sdm to /dev/md0 as 5
> mdadm: added /dev/sdb to /dev/md0 as 6
> mdadm: no uptodate device for slot 7 of /dev/md0
> mdadm: added /dev/sdc to /dev/md0 as 8
> mdadm: added /dev/sdd to /dev/md0 as 9
> mdadm: added /dev/sde to /dev/md0 as 10
> mdadm: added /dev/sdf to /dev/md0 as 11
> mdadm: added /dev/sdg to /dev/md0 as 12
> mdadm: added /dev/sda to /dev/md0 as 13
> mdadm: added /dev/sdh to /dev/md0 as 0
> mdadm: /dev/md0 assembled from 11 drives and 1 spare - not enough to start the array.

Indeed.  Your problem is likely to be /dev/sda.

> mdadm.conf
> #old=ARRAY /dev/md0 level=raid5 num-devices=13 metadata=0.90 UUID=975d6eb2:285eed11:021df236:c2d05073
> ARRAY /dev/md0 UUID=975d6eb2:285eed11:021df236:c2d05073
> 
> Hope some can help. Thx

Please share the output of "mdadm -E /dev/sd[abcdefghijklm]"

Phil

^ permalink raw reply

* Re: [PATCH/RFC] md/raid10: optimize read_balance() for 'far copies' arrays
From: Namhyung Kim @ 2011-06-08 14:39 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: NeilBrown, linux-raid
In-Reply-To: <20110608114924.GA10134@www2.open-std.org>

Keld Jørn Simonsen <keld@keldix.com> writes:
> On Wed, Jun 08, 2011 at 04:42:27PM +0900, Namhyung Kim wrote:
>> Still can't understand why we choose the closest-to-the-start disk in
>> case we could have possible sequencial access on other disk. Probably
>> because of the lack of my understanding how md/disk works :(
>
> the nearest position was the case for the initial implementation of
> raid10-far.  But this had bad performance for an array with disks of
> varying specifications. And also it led to not using the faster
> outer sectors. Using the closest-to-beginning gave a spped-up of about
> 50 % in some cases.
>

Hi Keld,

Thanks for the explanation. That means lower sectors reside on the outer
tracks/cylinders in the disk, right? The 50% seems a huge improvement I
couldn't stand against. Although my patch tried to choose
closest-to-current-head disk if the disk head is in the lowest stripe -
in the (similar) hope that it'd be on the outer tracks - I don't have
the numbers, so I'll just give up on it.

Besides, I just noticed that the rationale behind read_balance()
pressumed that all components of the array are traditional disks. If we
could detect all/some of them are not (i.e. SSD, etc.), it would be
better off using some other criteria for the read balancing IMHO,
something like nr_pending?

-- 
Regards,
Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* (unknown)
From: Dragon @ 2011-06-08 14:24 UTC (permalink / raw)
  To: linux-raid

SRaid with 13 Disks crashed
Hello,

this seems to be my last chance to get back all of my data from a sw-raid5 with 12-13 disks.
i use debian ( 2.6.32-bpo.5-amd64) and last i wanted to grow the raid from 12 to 13 disk with a size at all of 18tb. after run mke2fs i must see that the tool on ext4 allow a maximum size of 16tb. after that i wanted to shrink the size back to 12 disk and now the raid is gone.

i tried some assemble and examine things but without success.

here some information:
 cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : inactive sdh[0](S) sda[13](S) sdg[12](S) sdf[11](S) sde[10](S) sdd[9](S) sdc[8](S) sdb[6](S) sdm[5](S) sdl[4](S) sdj[3](S) sdi[2](S)
      17581661952 blocks

unused devices: <none>

mdadm --detail /dev/md0
mdadm: md device /dev/md0 does not appear to be active.

 mdadm --assemble --force -v /dev/md0 /dev/sdh /dev/sda /dev/sdg /dev/sdf /dev/sde /dev/sdd /dev/sdc /dev/sdb /dev/sdm /dev/sdl /dev/sdj /dev/sdi --update=super-minor /dev/sdh
mdadm: looking for devices for /dev/md0
mdadm: updating superblock of /dev/sdh with minor number 0
mdadm: /dev/sdh is identified as a member of /dev/md0, slot 0.
mdadm: updating superblock of /dev/sda with minor number 0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 13.
mdadm: updating superblock of /dev/sdg with minor number 0
mdadm: /dev/sdg is identified as a member of /dev/md0, slot 12.
mdadm: updating superblock of /dev/sdf with minor number 0
mdadm: /dev/sdf is identified as a member of /dev/md0, slot 11.
mdadm: updating superblock of /dev/sde with minor number 0
mdadm: /dev/sde is identified as a member of /dev/md0, slot 10.
mdadm: updating superblock of /dev/sdd with minor number 0
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 9.
mdadm: updating superblock of /dev/sdc with minor number 0
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 8.
mdadm: updating superblock of /dev/sdb with minor number 0
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 6.
mdadm: updating superblock of /dev/sdm with minor number 0
mdadm: /dev/sdm is identified as a member of /dev/md0, slot 5.
mdadm: updating superblock of /dev/sdl with minor number 0
mdadm: /dev/sdl is identified as a member of /dev/md0, slot 4.
mdadm: updating superblock of /dev/sdj with minor number 0
mdadm: /dev/sdj is identified as a member of /dev/md0, slot 3.
mdadm: updating superblock of /dev/sdi with minor number 0
mdadm: /dev/sdi is identified as a member of /dev/md0, slot 2.
mdadm: updating superblock of /dev/sdh with minor number 0
mdadm: /dev/sdh is identified as a member of /dev/md0, slot 0.
mdadm: no uptodate device for slot 1 of /dev/md0
mdadm: added /dev/sdi to /dev/md0 as 2
mdadm: added /dev/sdj to /dev/md0 as 3
mdadm: added /dev/sdl to /dev/md0 as 4
mdadm: added /dev/sdm to /dev/md0 as 5
mdadm: added /dev/sdb to /dev/md0 as 6
mdadm: no uptodate device for slot 7 of /dev/md0
mdadm: added /dev/sdc to /dev/md0 as 8
mdadm: added /dev/sdd to /dev/md0 as 9
mdadm: added /dev/sde to /dev/md0 as 10
mdadm: added /dev/sdf to /dev/md0 as 11
mdadm: added /dev/sdg to /dev/md0 as 12
mdadm: added /dev/sda to /dev/md0 as 13
mdadm: added /dev/sdh to /dev/md0 as 0
mdadm: /dev/md0 assembled from 11 drives and 1 spare - not enough to start the array.

mdadm.conf
#old=ARRAY /dev/md0 level=raid5 num-devices=13 metadata=0.90 UUID=975d6eb2:285eed11:021df236:c2d05073
ARRAY /dev/md0 UUID=975d6eb2:285eed11:021df236:c2d05073

Hope some can help. Thx
-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: Phil Turmel @ 2011-06-08 14:20 UTC (permalink / raw)
  To: David Brown
  Cc: John Robinson, linux-raid@vger.kernel.org, Stefan G. Weichinger,
	Maurice Hilarius, Thomas Harold
In-Reply-To: <isnj71$rap$1@dough.gmane.org>

Hi All,

On 06/08/2011 06:33 AM, David Brown wrote:
> On 08/06/2011 12:11, John Robinson wrote:
>> On 08/06/2011 10:38, David Brown wrote:
>>> On 08/06/2011 01:59, Thomas Harold wrote:
>>>> On 6/7/2011 4:07 PM, Maurice Hilarius wrote:
>>>>> On 6/7/2011 12:12 PM, Stefan G. Weichinger wrote:
>>>>>> Greetings, could you please advise me how to proceed?
>>>>>>
>>>>>> On a server I have 2 RAID1-arrays, each consisting of 2 TB-drives:
>>>>>>
>>>>>> ..
>>>>>>
>>>>>> Now I would like to move things to a more reliable RAID6 consisting of
>>>>>> all the four TB-drives ...
>>>>>>
>>>>>> How to do that with minimum risk?
>>>>>>
>>>>>> ..
>>>>>> Maybe I overlook a clever alternative?
>>>>>
>>>>> RAID 10 is as secure, and risk free, and much faster.
>>>>> And will cause much less CPU load.
>>>>>
>>>>
>>>> Well, with both a pair of RAID1 arrays and a pair of RAID-10 arrays, you
>>>> can lose 2 disks without losing data, but only if the right 2 disks
>>>> fail.
>>>>
>>>> With RAID6, any two of the four can fail without data loss.
>>>>
>>>
>>> It /sounds/ like RAID6 is more reliable here because it can always
>>> survive a second disk failure, while with RAID10 you have only a 66%
>>> chance of surviving a second disk failure.
>>>
>>> However, how often does a disk fail? What is the chance of a random disk
>>> failure in a given space of time? And how long will it go between one
>>> disk failing, and it being replaced and the array rebuilt? If you figure
>>> out these numbers, you'll have the probability of losing your RAID10
>>> array due to the second critical disk failing.
>>>
>>> To pick some rough numbers - say you've got low reliability, cheap disks
>>> with a 500,000 hour MTBF. If it takes you 3 days to replace a disk (over
>>> the weekend), and 8 hours to rebuild, you have a risk period of 80
>>> hours. That gives you a 0.016% chance of having the second disk failing.
>>> Even if you consider that a rebuild is quite stressful on the critical
>>> disk, it's not a big risk.
>>
>> It's not so much that the mirror disc might fail that I'd be worried
>> about, it's that you might find the odd sector failure during the
>> rebuild - this is the reason why RAID5 is now so disliked, and the
>> reasons apply similarly to RAID1 and RAID10 too, even if you're only
>> relying on one disc ('s worth of data) being perfect rather than two or
>> more.
> 
> I can see that problem, but it again boils down to probabilities.  The chances of seeing an unrecoverable read error are very low, just as with other disk errors.

The chances of any given unrecoverable read error are low, but during the rebuild, you are going to read every sector of the remaining drive in a mirror pair, or every sector of every remaining drive in a degraded raid5.  On large drives, you suddenly have a probability of uncorrectable error during rebuild that is orders of magnitude larger than the risk of a generic drive failure (in the rebuild window).

Since Stefan reported that he does backups to this array, I suspect the performance is less important than the redundancy.  The difference in redundancy is *very* significant.

Here's some stats on disk failures themselves:
http://www.storagemojo.com/2007/02/19/googles-disk-failure-experience/

Here's some stats on read errors during rebuild:
http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019/

If I recall correctly, Google switched to exclusive use of triple-disk mirrors on its production servers for this very reason.  (I can't find a link at the moment....)

> The issue with RAID5 is that people often had large arrays with multiple disks, and on a rebuild /every/ sector had to be read.  So if you have a ten disk RAID5 and are rebuilding, you are reading from all other 9 disks - you have 9 times as high a chance of having an unrecoverable read error ruin your day.
> 
> I look forward to the day bad block lists and hot replace are ready in mdraid - it will give us close to another disk's worth of redundancy without the cost.  For example, if one half of your raid1 mirror fails but is not totally dead (such as by having too many bad blocks), during rebuild you can keep both the good and bad halves in place.  Then if there is a read failure on the "good" half, you can probably still get the data from the "bad" half.

I don't see where either of these actually help the "rebuild after disk failure" situation?

Phil

^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: Stefan G. Weichinger @ 2011-06-08 12:31 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <isnga2$a7g$1@dough.gmane.org>

Am 08.06.2011 11:43, schrieb David Brown:

> This may be stating the obvious, but you do realise that converting to a
> four-disk RAID6 will not give you any more space?

Yes, I know. It is the improved redundancy I aim for.

S


^ permalink raw reply

* Re: [PATCH/RFC] md/raid10: optimize read_balance() for 'far copies' arrays
From: Keld Jørn Simonsen @ 2011-06-08 11:49 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: NeilBrown, linux-raid
In-Reply-To: <877h8w93bw.fsf@gmail.com>

On Wed, Jun 08, 2011 at 04:42:27PM +0900, Namhyung Kim wrote:
> NeilBrown <neilb@suse.de> writes:
> 
> > On Wed,  8 Jun 2011 16:00:45 +0900 Namhyung Kim <namhyung@gmail.com> wrote:
> >
> >> If @conf->far_offset > 0, there is only 1 stripe so that we can treat
> >> the array same as 'near' arrays. Furthermore we could calculate new
> >> distance from the previous position even for the real 'far' array
> >> cases if the position of given disk is already in the lowest stripe.
> >> 
> > I agree that it still make sense to to balancing if far_offset != 0.
> > However  there is absolutely no point in your change to the calculation of
> > new_distance.
> > You only wont new_distance to contain a distance from head position if we
> > want to choose the device with the 'closest' head.  But we don't.  We want to
> > choose the device were the data is closest to the start of the device.  So
> > the current value for new_distance is correct.
> >
> 
> Still can't understand why we choose the closest-to-the-start disk in
> case we could have possible sequencial access on other disk. Probably
> because of the lack of my understanding how md/disk works :(

the nearest position was the case for the initial implementation of
raid10-far.  But this had bad performance for an array with disks of
varying specifications. And also it led to not using the faster
outer sectors. Using the closest-to-beginning gave a spped-up of about
50 % in some cases.

best regards
keld

^ permalink raw reply

* [PATCH] md/raid10: get rid of duplicated conditional expression
From: Namhyung Kim @ 2011-06-08 11:35 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Variable 'first' is initialized to zero and updated to @rdev->raid_disk
only if it is greater than 0. Thus condition '>= first' always implies
'>= 0' so the latter is not needed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/raid10.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index fc56bdd8c3fb..fcb86e86bc31 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1093,8 +1093,7 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
 	if (rdev->raid_disk >= 0)
 		first = last = rdev->raid_disk;
 
-	if (rdev->saved_raid_disk >= 0 &&
-	    rdev->saved_raid_disk >= first &&
+	if (rdev->saved_raid_disk >= first &&
 	    conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
 		mirror = rdev->saved_raid_disk;
 	else
-- 
1.7.5.2


^ permalink raw reply related

* Re: from 2x RAID1 to 1x RAID6 ?
From: David Brown @ 2011-06-08 10:33 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <4DEF4AC5.1090003@anonymous.org.uk>

On 08/06/2011 12:11, John Robinson wrote:
> On 08/06/2011 10:38, David Brown wrote:
>> On 08/06/2011 01:59, Thomas Harold wrote:
>>> On 6/7/2011 4:07 PM, Maurice Hilarius wrote:
>>>> On 6/7/2011 12:12 PM, Stefan G. Weichinger wrote:
>>>>> Greetings, could you please advise me how to proceed?
>>>>>
>>>>> On a server I have 2 RAID1-arrays, each consisting of 2 TB-drives:
>>>>>
>>>>> ..
>>>>>
>>>>> Now I would like to move things to a more reliable RAID6 consisting of
>>>>> all the four TB-drives ...
>>>>>
>>>>> How to do that with minimum risk?
>>>>>
>>>>> ..
>>>>> Maybe I overlook a clever alternative?
>>>>
>>>> RAID 10 is as secure, and risk free, and much faster.
>>>> And will cause much less CPU load.
>>>>
>>>
>>> Well, with both a pair of RAID1 arrays and a pair of RAID-10 arrays, you
>>> can lose 2 disks without losing data, but only if the right 2 disks
>>> fail.
>>>
>>> With RAID6, any two of the four can fail without data loss.
>>>
>>
>> It /sounds/ like RAID6 is more reliable here because it can always
>> survive a second disk failure, while with RAID10 you have only a 66%
>> chance of surviving a second disk failure.
>>
>> However, how often does a disk fail? What is the chance of a random disk
>> failure in a given space of time? And how long will it go between one
>> disk failing, and it being replaced and the array rebuilt? If you figure
>> out these numbers, you'll have the probability of losing your RAID10
>> array due to the second critical disk failing.
>>
>> To pick some rough numbers - say you've got low reliability, cheap disks
>> with a 500,000 hour MTBF. If it takes you 3 days to replace a disk (over
>> the weekend), and 8 hours to rebuild, you have a risk period of 80
>> hours. That gives you a 0.016% chance of having the second disk failing.
>> Even if you consider that a rebuild is quite stressful on the critical
>> disk, it's not a big risk.
>
> It's not so much that the mirror disc might fail that I'd be worried
> about, it's that you might find the odd sector failure during the
> rebuild - this is the reason why RAID5 is now so disliked, and the
> reasons apply similarly to RAID1 and RAID10 too, even if you're only
> relying on one disc ('s worth of data) being perfect rather than two or
> more.

I can see that problem, but it again boils down to probabilities.  The 
chances of seeing an unrecoverable read error are very low, just as with 
other disk errors.

The issue with RAID5 is that people often had large arrays with multiple 
disks, and on a rebuild /every/ sector had to be read.  So if you have a 
ten disk RAID5 and are rebuilding, you are reading from all other 9 
disks - you have 9 times as high a chance of having an unrecoverable 
read error ruin your day.

I look forward to the day bad block lists and hot replace are ready in 
mdraid - it will give us close to another disk's worth of redundancy 
without the cost.  For example, if one half of your raid1 mirror fails 
but is not totally dead (such as by having too many bad blocks), during 
rebuild you can keep both the good and bad halves in place.  Then if 
there is a read failure on the "good" half, you can probably still get 
the data from the "bad" half.

>
> Still, I don't have any stats to back this up...
>

Statistics on these things are pretty much worthless unless you have 
hundreds of systems deployed - either your array dies, or it does not. 
It's like lottery tickets, but in reverse - no matter how many tickets 
you buy, you can be confident that you won't win, despite statistics 
that prove that /somebody/ wins each draw.

So you install your RAID10 (or RAID6, if you prefer) system, and make 
sure you keep backups.  And if you /do/ get hit by a double disk failure 
in the wrong place, you spend the day restoring everything from the 
backups.  When management complain that a 24 hour downtime doesn't fit 
with their 99.99% uptime expectations, you remind them that this is 
amortized over the next 27 years...

^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: John Robinson @ 2011-06-08 10:11 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid
In-Reply-To: <isng0m$8c9$1@dough.gmane.org>

On 08/06/2011 10:38, David Brown wrote:
> On 08/06/2011 01:59, Thomas Harold wrote:
>> On 6/7/2011 4:07 PM, Maurice Hilarius wrote:
>>> On 6/7/2011 12:12 PM, Stefan G. Weichinger wrote:
>>>> Greetings, could you please advise me how to proceed?
>>>>
>>>> On a server I have 2 RAID1-arrays, each consisting of 2 TB-drives:
>>>>
>>>> ..
>>>>
>>>> Now I would like to move things to a more reliable RAID6 consisting of
>>>> all the four TB-drives ...
>>>>
>>>> How to do that with minimum risk?
>>>>
>>>> ..
>>>> Maybe I overlook a clever alternative?
>>>
>>> RAID 10 is as secure, and risk free, and much faster.
>>> And will cause much less CPU load.
>>>
>>
>> Well, with both a pair of RAID1 arrays and a pair of RAID-10 arrays, you
>> can lose 2 disks without losing data, but only if the right 2 disks fail.
>>
>> With RAID6, any two of the four can fail without data loss.
>>
>
> It /sounds/ like RAID6 is more reliable here because it can always
> survive a second disk failure, while with RAID10 you have only a 66%
> chance of surviving a second disk failure.
>
> However, how often does a disk fail? What is the chance of a random disk
> failure in a given space of time? And how long will it go between one
> disk failing, and it being replaced and the array rebuilt? If you figure
> out these numbers, you'll have the probability of losing your RAID10
> array due to the second critical disk failing.
>
> To pick some rough numbers - say you've got low reliability, cheap disks
> with a 500,000 hour MTBF. If it takes you 3 days to replace a disk (over
> the weekend), and 8 hours to rebuild, you have a risk period of 80
> hours. That gives you a 0.016% chance of having the second disk failing.
> Even if you consider that a rebuild is quite stressful on the critical
> disk, it's not a big risk.

It's not so much that the mirror disc might fail that I'd be worried 
about, it's that you might find the odd sector failure during the 
rebuild - this is the reason why RAID5 is now so disliked, and the 
reasons apply similarly to RAID1 and RAID10 too, even if you're only 
relying on one disc ('s worth of data) being perfect rather than two or 
more.

Still, I don't have any stats to back this up...

Cheers,

John.


^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: David Brown @ 2011-06-08  9:43 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <4DEE6A11.1030205@xunil.at>

On 07/06/2011 20:12, Stefan G. Weichinger wrote:
>
> Greetings, could you please advise me how to proceed?
>
> On a server I have 2 RAID1-arrays, each consisting of 2 TB-drives:
>
> md5 : active raid1 sde1[0] sdf1[1]
>        976759936 blocks [2/2] [UU]
>
> md6 : active raid1 sdh1[1] sdg1[0]
>        976759936 blocks [2/2] [UU]
>
>
> md5 and md6 are right now physical volumes (PVs) in an LVM-volume-group.
> Nearly all the space is used right now (1.7 TB out of the ~2 TB).
>
> Now I would like to move things to a more reliable RAID6 consisting of
> all the four TB-drives ...
>
> How to do that with minimum risk?
>
> For sure it would be best to move all data aside, stop the arrays and
> build a new one ... etc
>
> Failing two drives and remove them from the RAID1s to build a new
> degraded RAID6 seems dangerous to me?
>
> Maybe I overlook a clever alternative?
>
> Suggestions welcome, thanks in advance.
>

This may be stating the obvious, but you do realise that converting to a 
four-disk RAID6 will not give you any more space?

You might want to consider replacing the drives when you do your 
re-shaping or rebuilding (whether you go for RAID6 or RAID10,far).




^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: David Brown @ 2011-06-08  9:38 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <4DEEBB66.7080802@nybeta.com>

On 08/06/2011 01:59, Thomas Harold wrote:
> On 6/7/2011 4:07 PM, Maurice Hilarius wrote:
>> On 6/7/2011 12:12 PM, Stefan G. Weichinger wrote:
>>> Greetings, could you please advise me how to proceed?
>>>
>>> On a server I have 2 RAID1-arrays, each consisting of 2 TB-drives:
>>>
>>> ..
>>>
>>> Now I would like to move things to a more reliable RAID6 consisting of
>>> all the four TB-drives ...
>>>
>>> How to do that with minimum risk?
>>>
>>> ..
>>> Maybe I overlook a clever alternative?
>>
>> RAID 10 is as secure, and risk free, and much faster.
>> And will cause much less CPU load.
>>
>
> Well, with both a pair of RAID1 arrays and a pair of RAID-10 arrays, you
> can lose 2 disks without losing data, but only if the right 2 disks fail.
>
> With RAID6, any two of the four can fail without data loss.
>

It /sounds/ like RAID6 is more reliable here because it can always 
survive a second disk failure, while with RAID10 you have only a 66% 
chance of surviving a second disk failure.

However, how often does a disk fail?  What is the chance of a random 
disk failure in a given space of time?  And how long will it go between 
one disk failing, and it being replaced and the array rebuilt?  If you 
figure out these numbers, you'll have the probability of losing your 
RAID10 array due to the second critical disk failing.

To pick some rough numbers - say you've got low reliability, cheap disks 
with a 500,000 hour MTBF.  If it takes you 3 days to replace a disk 
(over the weekend), and 8 hours to rebuild, you have a risk period of 80 
hours.  That gives you a 0.016% chance of having the second disk 
failing.  Even if you consider that a rebuild is quite stressful on the 
critical disk, it's not a big risk.

Compare that to the chance of losing data through other causes (fire, 
theft, user-error, motherboard failure, power supply problems, etc., 
etc.) and in reality the "higher risk" of RAID10 compared to RAID6 is a 
drop in the ocean.  RAID10 is /far/ from being the weak point in a 
typical server.

And you can also take into account that the disk usage patterns on RAID6 
are a lot more intensive and stressful on the disk than RAID10 - I would 
expect the lifetime of a RAID10 member disk to be much higher than that 
of a RAID6 member disk.

I don't have the statistics to prove it, but I am certainly happy to use 
RAID10 rather than RAID6 for our company servers.

Of course, I also have two backup servers on two different sites...

> (I still prefer RAID-10 over RAID-6 unless space is at an absolute
> premium. But for a four-disk setup, net disk space is the same and it's
> just a question of whether you want the speed of RAID-10 or the
> reliability of RAID-6.)

^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: Stefan G. Weichinger @ 2011-06-08  8:16 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org
In-Reply-To: <4DEECD61.4030909@anonymous.org.uk>

Am 08.06.2011 03:16, schrieb John Robinson:

> There may be a clever alternative, retaining single redundancy, if
> you don't mind buying one more disc, which I'm guessing you might do
> soon anyway as you're already 85% full. Or if not, it won't do too
> much harm to have a spare drive sitting on a shelf.

In fact I already have one ... but I can't use it as the 8 bays of that
server are already fully used. I could only attach that drive
temporarily with USB or so ...

> You can convert a 2-drive RAID1 to a 2-drive RAID5, then add the new 
> drive to double the size of the array, resize the PV, then move the
> PEs over from the other RAID1, then tear down that PV and RAID1, add
> one or both of those drives into the RAID5 and grow it to a RAID6.
> The only step at which you have a little less redundancy is while
> you're running the 3-drive RAID5 (well, it's still 1 drive but
> against 2 drives, instead of 1:1).

Clever idea, yes ... but a rather long way somehow ...

> On the other hand it might be easier to take a backup, which you 
> probably ought to do anyway!

Yep, I assume it will be that way: mv data aside, new array, data back
in ...

Thanks anyway, Stefan

^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: Stefan G. Weichinger @ 2011-06-08  8:06 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <4DEEBB66.7080802@nybeta.com>

Am 08.06.2011 01:59, schrieb Thomas Harold:

> Well, with both a pair of RAID1 arrays and a pair of RAID-10 arrays, you
> can lose 2 disks without losing data, but only if the right 2 disks fail.
> 
> With RAID6, any two of the four can fail without data loss.

Yes, that was my initial reason to try that.

> (I still prefer RAID-10 over RAID-6 unless space is at an absolute
> premium.  But for a four-disk setup, net disk space is the same and it's
> just a question of whether you want the speed of RAID-10 or the
> reliability of RAID-6.)

Reliability. There are backups done to that array.

^ permalink raw reply

* [PATCH] md/raid10: optimize read_balance() for 'far offset' arrays
From: Namhyung Kim @ 2011-06-08  7:57 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

If @conf->far_offset > 0, there is only 1 stripe so that we can treat
the array same as 'near' arrays.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/raid10.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6e846688962f..fc56bdd8c3fb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -531,7 +531,7 @@ retry:
 			break;
 
 		/* for far > 1 always use the lowest address */
-		if (conf->far_copies > 1)
+		if (conf->far_copies > 1 && conf->far_offset == 0)
 			new_distance = r10_bio->devs[slot].addr;
 		else
 			new_distance = abs(r10_bio->devs[slot].addr -
-- 
1.7.5.2


^ permalink raw reply related

* Re: Maximizing failed disk replacement on a RAID5 array
From: Brad Campbell @ 2011-06-08  7:57 UTC (permalink / raw)
  To: Durval Menezes; +Cc: Brad Campbell, linux-raid, Drew
In-Reply-To: <BANLkTi=4uHhNS7+WfcmvVSY11kDWe-e7ZQ@mail.gmail.com>

On 08/06/11 15:47, Durval Menezes wrote:

> I'm sorry if I did not make myself clear; I've already run both a
> "repair" on the RAID  (see above) and a "smart -t long" on the
> particular disk... I had about 40 bad sectors before, and now have
> just 4, but these 4 sectors persist as being marked in error... I
> think the "RAID repair" didn't touch them.

Apologies, I obviously missed that fact.

I think your best course of action in this case is to test both the 
other drives with SMART long checks and fail/replace the faulty one.

I've never had md not report a repaired sector when performing a repair 
operation.

I'll just re-iterate, if you take the bad sectors away without a good 
copy of the data on them, md won't know it is supposed to reconstruct 
those missing sectors.

Hrm.. *or*, and this is a big *or* you could use hdparm to create 
correctable bad sectors on the copy at the appropriate LBA's, and md 
should do the right thing as it will get read errors from those, which 
will go away when they are re-written.

I'd not thought of that before, but it should do the trick.

^ permalink raw reply

* Re: Maximizing failed disk replacement on a RAID5 array
From: Durval Menezes @ 2011-06-08  7:47 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid, Drew
In-Reply-To: <4DEF258A.8090600@fnarfbargle.com>

Hello Brad,

On Wed, Jun 8, 2011 at 4:32 AM, Brad Campbell <brad@fnarfbargle.com> wrote:
> On 08/06/11 14:58, Durval Menezes wrote:
>
>> 1) can I simply skip over these sectors (using dd_rescue or multiple
>> dd invocations) when off-line copying the old disk to the new one,
>> trusting the RAID5 to reconstruct the data correctly from the other 2
>
> Noooooooooooo. As we stated early on, it you do that md will have no idea
> that the data missing is actually missing as the drive won't return a read
> error.

Even if a "repair" (echo "repair" >/sys/block/md1/md/sync_status,
checking progress with "cat /proc/mdstat" and completion with "tail -f
/var/log/messages | grep md" ) finishes with no errors?

> does a repair take long on your machine? I find that a few repair runs
> generally gets me enough re-writes to clear the dud sectors and allow an
> offline clone.

I'm sorry if I did not make myself clear; I've already run both a
"repair" on the RAID  (see above) and a "smart -t long" on the
particular disk... I had about 40 bad sectors before, and now have
just 4, but these 4 sectors persist as being marked in error... I
think the "RAID repair" didn't touch them.

Cheers,
-- 
  Durval.

^ permalink raw reply

* Re: [PATCH/RFC] md/raid10: optimize read_balance() for 'far copies' arrays
From: Namhyung Kim @ 2011-06-08  7:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <20110608172157.4d6ac2a8@notabene.brown>

NeilBrown <neilb@suse.de> writes:

> On Wed,  8 Jun 2011 16:00:45 +0900 Namhyung Kim <namhyung@gmail.com> wrote:
>
>> If @conf->far_offset > 0, there is only 1 stripe so that we can treat
>> the array same as 'near' arrays. Furthermore we could calculate new
>> distance from the previous position even for the real 'far' array
>> cases if the position of given disk is already in the lowest stripe.
>> 
> I agree that it still make sense to to balancing if far_offset != 0.
> However  there is absolutely no point in your change to the calculation of
> new_distance.
> You only wont new_distance to contain a distance from head position if we
> want to choose the device with the 'closest' head.  But we don't.  We want to
> choose the device were the data is closest to the start of the device.  So
> the current value for new_distance is correct.
>

Still can't understand why we choose the closest-to-the-start disk in
case we could have possible sequencial access on other disk. Probably
because of the lack of my understanding how md/disk works :(


> If you would like to resubmit with just the first change I'll happily apply
> the patch.
>

OK. Will do that right soon.


> If you have performed some tests and can demonstrate some cases where this
> makes something faster, and can show us the results of those tests, I would
> be even more happy!!!
>

I wish I could. :) However, unfortunately, I don't have such a real system
to test on.

Thanks.


^ permalink raw reply

* RE: [PATCH 00/22] IMSM checkpointing implementation
From: Wojcik, Krzysztof @ 2011-06-08  7:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org, Neubauer, Wojciech, Kwolek, Adam,
	Williams, Dan J, Ciechanowski, Ed
In-Reply-To: <20110608172349.5ee242c9@notabene.brown>

Thanks Neil!

We will send set of bug fixes today.

Regards
Krzysztof

> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.de]
> Sent: Wednesday, June 08, 2011 9:24 AM
> To: Wojcik, Krzysztof
> Cc: linux-raid@vger.kernel.org; Neubauer, Wojciech; Kwolek, Adam;
> Williams, Dan J; Ciechanowski, Ed
> Subject: Re: [PATCH 00/22] IMSM checkpointing implementation
> 
> On Thu, 02 Jun 2011 16:48:08 +0200 Krzysztof Wojcik
> <krzysztof.wojcik@intel.com> wrote:
> 
> > IMSM for securing reshape process uses special disk area outside
> metadata
> > for reshaped area backup purposes. If just reshaped array area
> requires
> > backup, bunch of array stripes prepared for reshape is stored in to
> > Migration Copy Area. In case of reshape interruption, Option ROM
> during
> > restart or mdadm during reshape restart (when no reboot occurs) will
> > restore Migration Copy Area to designation array.
> > Reshape can be continued form stable array stable state.
> >
> > The following series implements IMSM checkpointing procedure.
> 
> I have applied most of these patches - some with some minor fixes.  The
> major
> changes I have already mentioned.
> 
> I will wait for you responses to those changes, and the other bug fixes
> you
> mentioned before considering a release of 3.2.2, but I would like to
> make
> that release in the next week or two.
> 
> Thanks,
> NeilBrown
> 
> >
> > ---
> >
> > Adam Kwolek, Krzysztof Wojcik (22):
> >       imsm: Add migration record to intel_super
> >       Support restore_stripes() from the given buffer
> >       Define dummy functions to mdmon.c
> >       imsm: Add support for copy area and backup operations
> >       imsm: check migration compatibility
> >       FIX: Initialize reshape structure
> >       imsm: Add wait_for_reshape_imsm() implementation
> >       imsm: Implement imsm_manage_reshape(), reshape workhorse
> >       imsm: Check if array degradation has been changed
> >       imsm: Clear migration record when no migration in progress
> >       imsm: Add information about migration record to mdadm '-E'
> option
> >       imsm: update blocks_per_migr_unit() to support migration record
> >       Add reshape restart support for external metadata
> >       imsm: Implement recover_backup_imsm() for imsm metadata
> >       imsm: Disable checkpoint updating by mdmon for general
> migration
> >       imsm: Add metadata update type for general migration check-
> pointing
> >       imsm: Prepare checkpoint update for general migration
> >       imsm: Apply checkpoint metadata update for general migration
> >       FIX: Enable metadata updates for raid0
> >       Do not use backup file for external metadata
> >       imsm: Remove user warning before reshape start
> >       imsm: Unit Tests - remove backup-file during grow command
> >
> >
> >  Assemble.c               |   10
> >  Grow.c                   |   50 +-
> >  mdadm.h                  |    7
> >  mdmon.c                  |   23 +
> >  restripe.c               |  101 +++-
> >  super-intel.c            | 1241
> ++++++++++++++++++++++++++++++++++++++++++++--
> >  tests/imsm-grow-template |    5
> >  7 files changed, 1322 insertions(+), 115 deletions(-)
> >


^ permalink raw reply

* Re: Maximizing failed disk replacement on a RAID5 array
From: Brad Campbell @ 2011-06-08  7:32 UTC (permalink / raw)
  To: Durval Menezes; +Cc: linux-raid, Drew
In-Reply-To: <BANLkTimxsT+htp82Us9uVgSdFNgb0m4vkQ@mail.gmail.com>

On 08/06/11 14:58, Durval Menezes wrote:

> 1) can I simply skip over these sectors (using dd_rescue or multiple
> dd invocations) when off-line copying the old disk to the new one,
> trusting the RAID5 to reconstruct the data correctly from the other 2

Noooooooooooo. As we stated early on, it you do that md will have no 
idea that the data missing is actually missing as the drive won't return 
a read error.

does a repair take long on your machine? I find that a few repair runs 
generally gets me enough re-writes to clear the dud sectors and allow an 
offline clone.

If your dd of the old disk to the new disk aborts with an error, do 
_not_ under any circumstances (well, unless you have really good 
backups) do a dd_rescue and just swap the disks.

^ permalink raw reply

* Re: Nested RAID and booting
From: Roman Mamedov @ 2011-06-08  7:27 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid
In-Reply-To: <D1.7A.00666.3AE0FED4@cdptpa-omtalb.mail.rr.com>

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On Wed, 8 Jun 2011 00:54:44 -0500
"Leslie Rhorer" <lrhorer@satx.rr.com> wrote:

> Will simply putting them in the mdadm.conf file prior to the RAID6 array do
> the trick?

Seems to work just fine for me.

Well, I also have the member array explicitly listed in a DEVICE clause,
together with physical drives. So I am not really sure if the default
"DEVICE partitions" would also include MD devices.

So in my mdadm.conf I have, roughly: 

  DEVICE /dev/disk/by-id/ata-WDC_WD20EADS-*-part*
  DEVICE /dev/disk/by-id/ata-WDC_WD15EADS-*-part*
  DEVICE /dev/disk/by-id/ata-Hitachi_HDS*-part*
  DEVICE /dev/md1
  ARRAY /dev/md1 UUID=...................................
  ARRAY /dev/md0 UUID=...................................

...where md1 is a member of md0.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH 00/22] IMSM checkpointing implementation
From: NeilBrown @ 2011-06-08  7:23 UTC (permalink / raw)
  To: Krzysztof Wojcik
  Cc: linux-raid, wojciech.neubauer, adam.kwolek, dan.j.williams,
	ed.ciechanowski
In-Reply-To: <20110602144212.27355.3706.stgit@gklab-128-111.igk.intel.com>

On Thu, 02 Jun 2011 16:48:08 +0200 Krzysztof Wojcik
<krzysztof.wojcik@intel.com> wrote:

> IMSM for securing reshape process uses special disk area outside metadata
> for reshaped area backup purposes. If just reshaped array area requires
> backup, bunch of array stripes prepared for reshape is stored in to
> Migration Copy Area. In case of reshape interruption, Option ROM during
> restart or mdadm during reshape restart (when no reboot occurs) will
> restore Migration Copy Area to designation array.
> Reshape can be continued form stable array stable state.
> 
> The following series implements IMSM checkpointing procedure.

I have applied most of these patches - some with some minor fixes.  The major
changes I have already mentioned.

I will wait for you responses to those changes, and the other bug fixes you
mentioned before considering a release of 3.2.2, but I would like to make
that release in the next week or two.

Thanks,
NeilBrown

> 
> ---
> 
> Adam Kwolek, Krzysztof Wojcik (22):
>       imsm: Add migration record to intel_super
>       Support restore_stripes() from the given buffer
>       Define dummy functions to mdmon.c
>       imsm: Add support for copy area and backup operations
>       imsm: check migration compatibility
>       FIX: Initialize reshape structure
>       imsm: Add wait_for_reshape_imsm() implementation
>       imsm: Implement imsm_manage_reshape(), reshape workhorse
>       imsm: Check if array degradation has been changed
>       imsm: Clear migration record when no migration in progress
>       imsm: Add information about migration record to mdadm '-E' option
>       imsm: update blocks_per_migr_unit() to support migration record
>       Add reshape restart support for external metadata
>       imsm: Implement recover_backup_imsm() for imsm metadata
>       imsm: Disable checkpoint updating by mdmon for general migration
>       imsm: Add metadata update type for general migration check-pointing
>       imsm: Prepare checkpoint update for general migration
>       imsm: Apply checkpoint metadata update for general migration
>       FIX: Enable metadata updates for raid0
>       Do not use backup file for external metadata
>       imsm: Remove user warning before reshape start
>       imsm: Unit Tests - remove backup-file during grow command
> 
> 
>  Assemble.c               |   10 
>  Grow.c                   |   50 +-
>  mdadm.h                  |    7 
>  mdmon.c                  |   23 +
>  restripe.c               |  101 +++-
>  super-intel.c            | 1241 ++++++++++++++++++++++++++++++++++++++++++++--
>  tests/imsm-grow-template |    5 
>  7 files changed, 1322 insertions(+), 115 deletions(-)
> 


^ permalink raw reply

* Re: [PATCH/RFC] md/raid10: optimize read_balance() for 'far copies' arrays
From: NeilBrown @ 2011-06-08  7:21 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid
In-Reply-To: <1307516445-3208-1-git-send-email-namhyung@gmail.com>

On Wed,  8 Jun 2011 16:00:45 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> If @conf->far_offset > 0, there is only 1 stripe so that we can treat
> the array same as 'near' arrays. Furthermore we could calculate new
> distance from the previous position even for the real 'far' array
> cases if the position of given disk is already in the lowest stripe.
> 
> Signed-off-by: Namhyung Kim <namhyung@gmail.com>
> ---
>  drivers/md/raid10.c |   14 +++++++++++---
>  1 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 6e846688962f..9ec4c5f8cd48 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -531,11 +531,19 @@ retry:
>  			break;
>  
>  		/* for far > 1 always use the lowest address */
> -		if (conf->far_copies > 1)
> -			new_distance = r10_bio->devs[slot].addr;
> -		else
> +		if (conf->far_copies > 1 && conf->far_offset == 0) {
> +			if (conf->mirrors[disk].head_position < conf->stride &&
> +			    r10_bio->devs[slot].addr < conf->stride)
> +				/* already in the lowest stripe */
> +				new_distance = abs(r10_bio->devs[slot].addr -
> +						   conf->mirrors[disk].head_position);
> +			else
> +				new_distance = r10_bio->devs[slot].addr;
> +		} else {
>  			new_distance = abs(r10_bio->devs[slot].addr -
>  					   conf->mirrors[disk].head_position);
> +		}
> +
>  		if (new_distance < best_dist) {
>  			best_dist = new_distance;
>  			best_slot = slot;


I agree that it still make sense to to balancing if far_offset != 0.
However  there is absolutely no point in your change to the calculation of
new_distance.
You only wont new_distance to contain a distance from head position if we
want to choose the device with the 'closest' head.  But we don't.  We want to
choose the device were the data is closest to the start of the device.  So
the current value for new_distance is correct.

If you would like to resubmit with just the first change I'll happily apply
the patch.

If you have performed some tests and can demonstrate some cases where this
makes something faster, and can show us the results of those tests, I would
be even more happy!!!

Thanks,
NeilBrown

^ permalink raw reply

* Re: [PATCH 07/22] imsm: Add wait_for_reshape_imsm() implementation
From: NeilBrown @ 2011-06-08  7:07 UTC (permalink / raw)
  To: Krzysztof Wojcik
  Cc: linux-raid, wojciech.neubauer, adam.kwolek, dan.j.williams,
	ed.ciechanowski
In-Reply-To: <20110602144908.27355.89349.stgit@gklab-128-111.igk.intel.com>

On Thu, 02 Jun 2011 16:49:08 +0200 Krzysztof Wojcik
<krzysztof.wojcik@intel.com> wrote:

> From: Adam Kwolek <adam.kwolek@intel.com>
> 
> After each checkpoint mdadm should set new reshaped area and wait
> until md finishes reshape. Function wait_for_reshape_imsm() sets
> new reshape range and waits for job completion.
> 
> Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
> Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
> ---
>  super-intel.c |   61 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 61 insertions(+), 0 deletions(-)
> 
> diff --git a/super-intel.c b/super-intel.c
> index 31fae1e..c395a48 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -8248,6 +8248,67 @@ exit_imsm_reshape_super:
>  	return ret_val;
>  }
>  
> +/*******************************************************************************
> + * Function:	wait_for_reshape_imsm
> + * Description:	Function writes new sync_max value and waits until
> + *		reshape process reach new position
> + * Parameters:
> + *	sra		: general array info
> + *	to_complete	: new sync_max position
> + *	ndata		: number of disks in new array's layout
> + * Returns:
> + *	 0 : success,
> + *	 1 : there is no reshape in progress,
> + *	-1 : fail
> + ******************************************************************************/
> +int wait_for_reshape_imsm(struct mdinfo *sra, unsigned long long to_complete,
> +			  int ndata)
> +{
> +	int fd = sysfs_get_fd(sra, NULL, "reshape_position");
> +	unsigned long long completed;
> +
> +	struct timeval timeout;
> +
> +	if (fd < 0)
> +		return 1;
> +
> +	sysfs_fd_get_ll(fd, &completed);
> +
> +	if (to_complete == 0) {/* reshape till the end of array */
> +		sysfs_set_str(sra, NULL, "sync_max", "max");
> +		to_complete = MaxSector;
> +	} else {
> +		if (completed > to_complete)
> +			return -1;
> +		if (sysfs_set_num(sra, NULL, "sync_max",
> +				  to_complete / ndata) != 0) {
> +			close(fd);
> +			return -1;
> +		}
> +	}
> +
> +	timeout.tv_sec = 0;
> +	timeout.tv_usec = 500000;

Having a 1/2 second timeout is wrong.  You shouldn't need a timeout at all.
If you do, there is a bug somewhere.

I changed this to 30 seconds.
> +	do {
> +		char action[20];
> +		fd_set rfds;
> +		FD_ZERO(&rfds);
> +		FD_SET(fd, &rfds);
> +		select(fd+1, NULL, NULL, &rfds, &timeout);
> +		if (sysfs_fd_get_ll(fd, &completed) < 0) {
> +			close(fd);
> +			return 1;
> +		}
> +		if (sysfs_get_str(sra, NULL, "sync_action",
> +			    action, 20) > 0 &&
> +			    strncmp(action, "reshape", 7) != 0)
> +			continue;

And if 'sync_action' is not 'reshape' any more then soemthing must have
aborted and just 'continue'ing is wrong.  I have changed this to 'break', but
maybe you want to return an error.

NeilBrown


> +	} while (completed < to_complete);
> +	close(fd);
> +	return 0;
> +
> +}
> +
>  static int imsm_manage_reshape(
>  	int afd, struct mdinfo *sra, struct reshape *reshape,
>  	struct supertype *st, unsigned long stripes,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* [PATCH/RFC] md/raid10: optimize read_balance() for 'far copies' arrays
From: Namhyung Kim @ 2011-06-08  7:00 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

If @conf->far_offset > 0, there is only 1 stripe so that we can treat
the array same as 'near' arrays. Furthermore we could calculate new
distance from the previous position even for the real 'far' array
cases if the position of given disk is already in the lowest stripe.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/raid10.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6e846688962f..9ec4c5f8cd48 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -531,11 +531,19 @@ retry:
 			break;
 
 		/* for far > 1 always use the lowest address */
-		if (conf->far_copies > 1)
-			new_distance = r10_bio->devs[slot].addr;
-		else
+		if (conf->far_copies > 1 && conf->far_offset == 0) {
+			if (conf->mirrors[disk].head_position < conf->stride &&
+			    r10_bio->devs[slot].addr < conf->stride)
+				/* already in the lowest stripe */
+				new_distance = abs(r10_bio->devs[slot].addr -
+						   conf->mirrors[disk].head_position);
+			else
+				new_distance = r10_bio->devs[slot].addr;
+		} else {
 			new_distance = abs(r10_bio->devs[slot].addr -
 					   conf->mirrors[disk].head_position);
+		}
+
 		if (new_distance < best_dist) {
 			best_dist = new_distance;
 			best_slot = slot;
-- 
1.7.5.2


^ permalink raw reply related

* Re: Maximizing failed disk replacement on a RAID5 array
From: Durval Menezes @ 2011-06-08  6:58 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid, Drew
In-Reply-To: <4DEDB8B7.2070708@fnarfbargle.com>

Hello,

On Tue, Jun 7, 2011 at 2:35 AM, Brad Campbell <brad@fnarfbargle.com> wrote:
> On 07/06/11 13:03, Durval Menezes wrote:
>>
>> Hello Folks,
>>
>> Just finished the "repair". It completed OK, and over SMART the HD now
>> shows a "Reallocated_Sector_Ct" of 291 (which shows that many bad
>> sectors have been remapped), but it's also still reporting 4
>> "Current_Pending_Sector" and 4 "Offline_Uncorrectable"... which I
>> think means exactly the same thing, ie, that there are 4 "active"
>> (from the HD perspective) sectors on the drive still detected as bad
>> and not remapped.
>>
>> I've been thinking about exactly what that means, and I think that
>> these 4 sectors are either A) outside the RAID partition (not very
>> probable as this partition occupies more than 99.99% of the disk,
>> leaving just a small, less than 105MB area at the beginning), or B)
>> some kind of metadata or unused space that hasn't been read and
>> rewritten by the "repair" I've just completed. I've just done a "dd
>> bs=1024k count=105</dev/DISK>/dev/null" to account for the
>> hyphotesys A), and come out empty: no errors, and the drive still
>> shows 4 bad, unmapped sectors on SMART.
>>
>> So, by elimination, it must be either case B) above, or a bug in the
>> linux md code (which prevents it from hitting every needed block on
>> the disk), or a bug in SMART (which makes it report inexistent bad
>>
> Try running a SMART long test smartctl -t long and it will tell you whether
> the sectors are really bad or not.
> I've had instances where the firmware still thought that some previously
> pending sectors were still pending until I forced a test, at which time the
> drive came to its senses and they went away.
>
> I believe if you wait until the drive gets around to doing its periodic
> offline data collection you'll see the same thing, but a long test is nice
> as it will give you an actual block number for the first failure (if you
> have one)

I did it (smartctl -a long) and it completed (registering an error at
the very end of the disk):

     SMART Self-test log structure revision number 1
     Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
     # 1  Extended offline    Completed: read failure       10%
9942           2930273794

The SMART Attributes table still shows 4 pending/uncorrectable sectors:

    197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -
           4
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -
           4

Converting the above LBA to a block number, I find 2930273794/2=
1465136897; as this is a 1.5TB HD,
this first error (there are possibly 3 more) is right at the final
35GB of the media, so it's inside (near the
end) of the RAID partition:

     fdisk -l /dev/sdc
         Disk /dev/sdc: 1500.3 GB, 1500301910016 bytes
          255 heads, 63 sectors/track, 182401 cylinders
          Units = cylinders of 16065 * 512 = 8225280 bytes
          Sector size (logical/physical): 512 bytes / 512 bytes
          I/O size (minimum/optimal): 512 bytes / 512 bytes
          Disk identifier: 0x6be6057c
             Device Boot      Start         End      Blocks   Id  System
          /dev/sdc1               1           1        8001    4  FAT16 <32M
          /dev/sdc2   *           2          14      104422+  83  Linux
          /dev/sdc3              15      182401  1465023577+  fd
Linux raid autodetect

Confirming that this block is indeed returning read errors:

    dd count=1 bs=1024 skip=1465136897 if=/dev/sdc of=/dev/null
        [long delay]
        dd: reading `/dev/sdc': Input/output error
        0+0 records in
        0+0 records out
        0 bytes (0 B) copied, 45.1076 s, 0.0 kB/s

Examining one sector before:

    dd count=1 bs=1024 skip=146513686 if=/dev/sdc | hexdump -C
        00000000  92 e1 b4 d4 c6 cd 0f 33  db 7c ff a9 be c1 c1 8e
|.......3.|......|
        00000010  71 35 fc 55 16 c4 36 ef  59 10 db 20 22 f4 57 99
|q5.U..6.Y.. ".W.|
        00000020  31 61 2b 24 e0 98 3c 94  4b 8a 17 93 23 aa e9 96
|1a+$..<.K...#...|
        00000030  b0 47 7b 8f 12 c6 52 42  99 0d 72 b4 51 02 5a 8e
|.G{...RB..r.Q.Z.|
        00000040  c6 5a ac 86 0b a5 74 9b  13 e7 87 7a db 94 e2 7f
|.Z....t....z....|
        00000050  c6 42 75 ba 53 bf 7f 20  fc 9c ad 4b 8f 3c 85 64
|.Bu.S.. ...K.<.d|
        00000060  3a b0 ac 41 6e 41 fb 95  03 70 24 7e 2e d5 df 8a
|:..AnA...p$~....|
        00000070  f9 dc d1 7d 4a 1e e1 93  9d 39 18 83 6c 9f 9f 79
|...}J....9..l..y|
        00000080  53 a3 d1 fb 7f c6 bd 44  8d 0c 40 06 0a 92 f9 7e
|S......D..@....~|
        00000090  0c 0e 87 43 66 9d fc 12  2b 0d 7a 34 ba 84 cb 73
|...Cf...+.z4...s|
        000000a0  47 3b a4 fa c9 50 d9 96  f9 50 a2 60 17 eb 7c c8
|G;...P...P.`..|.|
        000000b0  42 76 59 d0 1e 06 10 a8  3b 89 74 8d b4 04 83 88
|BvY.....;.t.....|
        000000c0  d7 9d 3c 82 cf 8f 7d 6e  a2 b6 bf 56 06 c0 aa 7c
|..<...}n...V...||
        000000d0  7d 39 ae 0a 67 48 28 b5  07 fd fc ae 49 e4 7a 08
|}9..gH(.....I.z.|
        000000e0  8a 37 94 e0 d3 d7 f0 f4  4c 49 3a ed b7 f4 84 95
|.7......LI:.....|
        000000f0  3f 0a 4f 6c 47 62 1a f4  70 ca 14 8a 52 6d 4c 1e
|?.OlGb..p...RmL.|
        00000100  da 0c 29 17 c1 a4 e1 5c  cb 43 e0 01 45 9c 72 7f
|..)....\.C..E.r.|
        00000110  78 b8 19 3f dd 35 c5 50  ff 9b 42 fb 0b d8 61 5a
|x..?.5.P..B...aZ|
        00000120  24 2b ae c9 45 e6 e5 e9  04 00 93 bb 53 c0 fd d6
|$+..E.......S...|
        00000130  9c ab 69 98 50 f0 5e 98  0d 0b b3 dc cb cb d0 7d
|..i.P.^........}|
        00000140  21 70 68 e8 fb 3c 55 fd  2d c6 6c 25 86 dd 9a 4a
|!ph..<U.-.l%...J|
        00000150  fc e2 24 a9 fb 9a 6b be  d5 e2 3b e9 a0 b1 61 ad
|..$...k...;...a.|
        00000160  1f 9a c8 31 86 91 c6 1f  86 9e 17 35 25 7e 77 42
|...1.......5%~wB|
        00000170  37 86 b2 17 08 8e c4 cf  4e e2 64 7d 83 11 05 1e
|7.......N.d}....|
        00000180  6b c1 e7 5d 0f e2 c9 f9  0a 0a b1 2b 83 a1 2a a4
|k..].......+..*.|
        00000190  1d f8 a6 13 2f e9 45 bb  b7 e2 71 e9 69 ad 3c 47
|..../.E...q.i.<G|
        000001a0  3f fa 39 7f 1e 93 0e d2  89 09 dc d2 b3 3b f8 6f
|?.9..........;.o|
        000001b0  21 21 72 b6 9e 9d 42 79  fb 78 3c 02 85 7b 1f 4f
|!!r...By.x<..{.O|
        000001c0  8b 3c 26 62 8a 58 38 a7  48 31 b9 e2 0c 0d 41 d6
|.<&b.X8.H1....A.|
        000001d0  8f 43 95 f0 1f 52 3e 0e  55 8d c0 93 f7 e3 c8 79
|.C...R>.U......y|
        000001e0  a2 bc 51 72 87 3c 16 c3  d0 f3 57 a8 e4 48 51 32
|..Qr.<....W..HQ2|
        000001f0  00 99 3e 0e 88 a3 fa e3  00 a4 c2 cb 28 7a a1 00
|..>.........(z..|
        00000200  a0 b4 1b 6d c4 2a 15 75  a3 f0 24 47 5a d6 54 74
|...m.*.u..$GZ.Tt|
        00000210  d0 ad e4 92 b1 99 5d 7a  62 47 b9 54 8f 9e 15 ca
|......]zbG.T....|
        00000220  65 09 9e d0 d3 61 51 93  88 4a 46 1e 5c 15 07 ef
|e....aQ..JF.\...|
        00000230  b0 92 fa a7 e7 3d e5 36  20 67 d2 24 b7 59 ae f4
|.....=.6 g.$.Y..|
        00000240  7c 26 57 90 e1 69 b5 f3  b4 1b 8e e6 07 2e 46 84
||&W..i........F.|
        1+0 records in
        1+0 records out
        1024 bytes (1.0 kB) copied, 5.0224e-05 s, 20.4 MB/s

Looking at one sector after the error returns similar results.

So, I don't know about you, but the above seems pretty much like data
to me (although it could also be parity).

So I have two questions:

1) can I simply skip over these sectors (using dd_rescue or multiple
dd invocations) when off-line copying the old disk to the new one,
trusting the RAID5 to reconstruct the data correctly from the other 2
disks? Or is it better to simply do the recover the "traditional" way
(ie, "fail" the old disk, "add" the new one, and run the risk of a
possible bad sector on one of the two remaining old disks ruining the
show completely and forcing me to recover from backups [I *do* have
up-to-date backups on this array])?

2) Is there a formula, a program or anything that can tell me exactly
what is located at the above sector (ie, whether it's RAID parity or a
data sector)?

Thanks,
-- 
   Durval Menezes.




Ditto, one sector after:



So, when I "dd" this partition to a new one, I think



>
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox