Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Triple-parity raid6
From: David Brown @ 2011-06-09 22:42 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <isp2g2$rf$1@dough.gmane.org>

On 09/06/11 02:01, David Brown wrote:
> Has anyone considered triple-parity raid6 ? As far as I can see, it
> should not be significantly harder than normal raid6 - either to
> implement, or for the processor at run-time. Once you have the GF(2⁸)
> field arithmetic in place for raid6, it's just a matter of making
> another parity block in the same way but using a different generator:
>
> P = D_0 + D_1 + D_2 + .. + D_(n.1)
> Q = D_0 + g.D_1 + g².D_2 + .. + g^(n-1).D_(n.1)
> R = D_0 + h.D_1 + h².D_2 + .. + h^(n-1).D_(n.1)
>
> The raid6 implementation in mdraid uses g = 0x02 to generate the second
> parity (based on "The mathematics of RAID-6" - I haven't checked the
> source code). You can make a third parity using h = 0x04 and then get a
> redundancy of 3 disks. (Note - I haven't yet confirmed that this is
> valid for more than 100 data disks - I need to make my checker program
> more efficient first.)
>
> Rebuilding a disk, or running in degraded mode, is just an obvious
> extension to the current raid6 algorithms. If you are missing three data
> blocks, the maths looks hard to start with - but if you express the
> equations as a set of linear equations and use standard matrix inversion
> techniques, it should not be hard to implement. You only need to do this
> inversion once when you find that one or more disks have failed - then
> you pre-compute the multiplication tables in the same way as is done for
> raid6 today.
>
> In normal use, calculating the R parity is no more demanding than
> calculating the Q parity. And most rebuilds or degraded situations will
> only involve a single disk, and the data can thus be re-constructed
> using the P parity just like raid5 or two-parity raid6.
>
>
> I'm sure there are situations where triple-parity raid6 would be
> appealing - it has already been implemented in ZFS, and it is only a
> matter of time before two-parity raid6 has a real probability of hitting
> an unrecoverable read error during a rebuild.
>
>
> And of course, there is no particular reason to stop at three parity
> blocks - the maths can easily be generalised. 1, 2, 4 and 8 can be used
> as generators for quad-parity (checked up to 60 disks), and adding 16
> gives you quintuple parity (checked up to 30 disks) - but that's maybe
> getting a bit paranoid.
>
>
> ref.:
>
> <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>
> <http://blogs.oracle.com/ahl/entry/acm_triple_parity_raid>
> <http://queue.acm.org/detail.cfm?id=1670144>
> <http://blogs.oracle.com/ahl/entry/triple_parity_raid_z>
>
>
> mvh.,
>
> David
>

Just to follow up on my numbers here - I've now checked the validity of 
triple-parity using generators 1, 2 and 4 for up to 254 data disks 
(i.e., 257 disks altogether).  I've checked the validity of quad-parity 
up to 120 disks - checking the full 253 disks will probably take the 
machine most of the night.  I'm sure there is some mathematical way to 
prove this, and it could certainly be checked more efficiently than with 
a Python program - but my computer has more spare time than me!


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: Backup Server RAID Array Event Notification
From: Leslie Rhorer @ 2011-06-10  1:10 UTC (permalink / raw)
  To: 'NeilBrown'; +Cc: 'Roman Mamedov', linux-raid
In-Reply-To: <20110610071441.796e5f78@notabene.brown>

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of NeilBrown
> Sent: Thursday, June 09, 2011 4:15 PM
> To: lrhorer@satx.rr.com
> Cc: 'Roman Mamedov'; linux-raid@vger.kernel.org
> Subject: Re: Backup Server RAID Array Event Notification
> 
> On Thu, 9 Jun 2011 14:01:34 -0500 "Leslie Rhorer" <lrhorer@satx.rr.com>
> wrote:
> 
> > > -----Original Message-----
> > > From: Roman Mamedov [mailto:rm@romanrm.ru]
> > > Sent: Thursday, June 09, 2011 1:47 PM
> > > To: lrhorer@satx.rr.com
> > > Cc: linux-raid@vger.kernel.org
> > > Subject: Re: Backup Server RAID Array Event Notification
> > >
> > > On Thu, 9 Jun 2011 13:35:59 -0500
> > > "Leslie Rhorer" <lrhorer@satx.rr.com> wrote:
> > >
> > > >
> > > > 	After I created a pair of two member RAID1 arrays and then
> added
> > > > them as members to a RAID6 array, I am now getting messages similar
> to
> > > the
> > > > following, complaining of "Wrong-level" issues.  When I check the
> RAID6
> > > > array, however, it is clean and both RAID1 members are still there.
> > > When I
> > > > check both RAID1 arrays, they show clean with no events.  I am
> running a
> > > > compare between all the data on this machine and its mirror (this is
> a
> > > > backup machine).  So far everything looks good.  What does this
> imply?
> > > Is
> > > > there something about which I should be worried?
> > >
> > > You said RAID1 twice, and your mdadm --detail doesn't agree with you
> and
> > > says
> > > "raid0" twice. Maybe you mistakenly used RAID1 instead of RAID0
> somewhere
> > > else as well, and the WrongLevel message is trying to tell you that?
> >
> > 	No, that was just a typo.  (OK, three typos) I meant "RAID0".  The
> > RAID0 members are all 1T drives.  The RAID6 array is made of 1.5T
> members.
> > In order to use the 1T drives on the RAID6 array, I have to combine them
> > into 2T arrays, which then can be used as members of the RAID6 array.
> If
> > md10 and md11 were RAID1 arrays, they would only be 1T in extent, and
> could
> > not be members of md0.
> >
> 
> "mdadm --monitor" does not monitor RAID0 or Linear arrays.  There is
> nothing
> to see.  Nothing can fail, they don't rebuilt, they are really just AID,
> not
> RAID.

	Well, OK.  So why does it report anything at all?
> 
> So if it thinks that it was asked to monitor a RAID0 it pretends that it
> has
> disappeared with reason "Wrong Level".
> So if you explicitly ask it to monitor a RAID0, it won't and it will tell
> you
> why.

	That would make sense if I had started the monitor deamon and it had
sent the e-mail, but the monitor has been running for nearly two days, since
the system was rebooted.  Why send the message nearly a day after the deamon
is started, and why send it more than once (for each array)?

	By the same token, why did it wait nearly 8 hours and then again
more than a day and a half after the array was created to send the messages,
instead of immediately after it was created?

	This suggests I am going to be treated to a pair of spurious e-mails
every day or so telling me the device has disappeared, when it is perfectly
good.  After a few months of that, what happens when one of the devices
really does disappear?  We all know what happens to the system that cries,
"Wolf!" all the time.

> If you only implicitly ask with e.g. "mdadm --monitor --scan" with a RAID0
> listing in mdadm.conf it probably shouldn't give the message as it might
> be confusing... but it does.

	I'm not sure I follow.

> Or maybe the message is just confusing and I should change it.
> 
> Or something.

	Well that's definite.  :-)


^ permalink raw reply

* Re: Triple-parity raid6
From: Namhyung Kim @ 2011-06-10  3:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: David Brown, linux-raid
In-Reply-To: <20110609220438.26336b27@notabene.brown>

NeilBrown <neilb@suse.de> writes:
> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown <david@westcontrol.com> wrote:
>
> You can see the current kernel code at:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f1a47c7;hb=HEAD
>
>
> int.uc is the generic C code which 'unroll.awk' processes to make various
> versions that unroll the loops different amounts to work with CPUs with
> different numbers of registers.
> Then there is sse1, sse2, altivec which provide the same functionality in
> assembler which is optimised for various processors.
>
> And 'recov' has the smarts for doing the reverse calculation when 2 data
> blocks, or 1 data and P are missing.
>
> Even if you don't feel up to implementing everything, a start might be
> useful.  You never know when someone might jump up and offer to help.
>

Maybe I could help David to some extent. :)

I'm gonna read the raid6 code next week and hope that there is a room I
can help with, FWIW.

Thanks.


-- 
Regards,
Namhyung Kim

^ permalink raw reply

* [PATCH/RFC] md/raid10: spread read for subordinate r10bios during recovery
From: Namhyung Kim @ 2011-06-10  3:31 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

In the current scheme, multiple read request could be directed to
the first active disk during recovery if there are several disk
failure at the same time. Spreading those requests on other in-sync
disks might be helpful.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/raid10.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index dea73bdb99b8..d0188e49f881 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1832,6 +1832,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
 	if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
 		/* recovery... the complicated one */
 		int j, k;
+		int last_read = -1;
 		r10_bio = NULL;
 
 		for (i=0 ; i<conf->raid_disks; i++) {
@@ -1891,7 +1892,9 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
 						      &sync_blocks, still_degraded);
 
 			for (j=0; j<conf->copies;j++) {
-				int d = r10_bio->devs[j].devnum;
+				int c = (last_read + j + 1) % conf->copies;
+				int d = r10_bio->devs[c].devnum;
+
 				if (!conf->mirrors[d].rdev ||
 				    !test_bit(In_sync, &conf->mirrors[d].rdev->flags))
 					continue;
@@ -1902,13 +1905,14 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr,
 				bio->bi_private = r10_bio;
 				bio->bi_end_io = end_sync_read;
 				bio->bi_rw = READ;
-				bio->bi_sector = r10_bio->devs[j].addr +
+				bio->bi_sector = r10_bio->devs[c].addr +
 					conf->mirrors[d].rdev->data_offset;
 				bio->bi_bdev = conf->mirrors[d].rdev->bdev;
 				atomic_inc(&conf->mirrors[d].rdev->nr_pending);
 				atomic_inc(&r10_bio->remaining);
-				/* and we write to 'i' */
+				last_read = c;
 
+				/* and we write to 'i' */
 				for (k=0; k<conf->copies; k++)
 					if (r10_bio->devs[k].devnum == i)
 						break;
-- 
1.7.5.2


^ permalink raw reply related

* Re: Why move all map_sg/unmap_sg for slave channel to its client?
From: viresh kumar @ 2011-06-10  3:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linus Walleij, Koul, Vinod, linux-kernel@vger.kernel.org,
	anemo@mba.ocn.ne.jp, Shiraz HASHIM, Armando VISCONTI,
	Bhupesh SHARMA, linux-raid
In-Reply-To: <BANLkTikzDEx2netonGyhskYFtiAEAxq5XQ@mail.gmail.com>

On 06/09/2011 11:58 PM, Dan Williams wrote:
> On Thu, Jun 9, 2011 at 2:38 AM, Linus Walleij <linus.walleij@linaro.org> wrote:
>> On Thu, Jun 9, 2011 at 8:54 AM, viresh kumar <viresh.kumar@st.com> wrote:
>>
>>> I thought map_sg/unmap_sg for slave channels will be handled according
>>> to the flags passed in prep_slave_sg(). But then i found following patch:
>>> (...)
>>> I don't have much knowledge about that discussion, but i think this should be left
>>> configurable.
>>> If the client wants to control map/unmap then it can simply pass
>>> DMA_COMPL_SKIP_DEST_UNMAP | DMA_COMPL_SKIP_SRC_UNMAP in flags. I didn't wanted to
>>> skip this in my driver and so i don't pass them.
>>
>> What if the same driver is used on many different platforms like say
>> drivers/tty/serial/amba-pl011.c, and some of the platforms using it
>> has DMA engines that does not implement mapping/unmapping of
>> the passed sglist?
>>
>> In that case I think you have to modify all drivers in drivers/dma/*
>> to do this mapping, and then you could just make it a required behaviour
>> and skip the flags altogether.
>>
>> But apparently that approach was blocked at one point so let's see
>> what the others say.
> 
> My problem with automatic unmapping support is that the dma-driver
> really does not have a chance to get it right except for the trivially
> straightforward cases.  One need only look at the current bustage of
> raid5 acceleration with respect to overlapping mappings and arm v6.
> The dma-driver just knows how to perform "this" operation on "this"
> dma address.  It does not know the lifetime of the mapping, or even if
> it has the actual dma handle for unmapping versus an offset
> 
> For the raid case I've currently convinced myself that the raid client
> needs to get directly involved in dma mapping management, rather than
> teach all dma drivers a language of how to unmap and when.  Not only
> will this fix the overlapping, but it also eliminates the need to map
> and remap because the raid client knows the lifetime of  a stripe_head
> while the driver only knows the lifetime of a given stripe operation.
> 
> For slave-dma maybe there is a lot of common un-mapping logic that can
> be reused, but I think that comes from a separate smart library that
> understands the dma mapping lifetimes of a given class of clients.
> Leave the dma-drivers to just be dumb operators on anonymous dma
> addresses.
> 

Linus, Dan,

Got it. Thanks for your replies.

-- 
viresh

^ permalink raw reply

* Re: SRaid with 13 Disks crashed
From: Dragon @ 2011-06-10  7:52 UTC (permalink / raw)
  To: philip; +Cc: linux-raid

I have nothing limited. It's all i get from the bash after excecute the script. i am not aware of using a fast -boot kernel, but i saw that it comes whit kernel version 2.6.28. i think after the raid crashes i upgrade to the last kernel version, because of having the newer mdamd, ext and what ever tools to recover the raid.
Am i right to use the --backup-file option, i must have a backup file ;)? i didnt have a file.

as far as i understand your advise, to recreate the raid and the order of the disks. you choose them by the output off mdadm -E wich gave the actuall number of the disk and the variation at disk sdd and sdn is because of  both shows the same number, right? i think i understand that. but why is disk "j" the last in the order. my output say that disk sdd is number 13, but as spare for sda....?
here the stand after rebooting at the morning:
fdisk  -l|grep sd
Disk /dev/sdb: 1500.3 GB, 1500301910016 bytes
Disk /dev/sda: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdc: 20.4 GB, 20409532416 bytes
/dev/sdc1   *           1        2372    19053058+  83  Linux
/dev/sdc2            2373        2481      875542+   5  Extended
/dev/sdc5            2373        2481      875511   82  Linux swap / Solaris
Disk /dev/sdd: 1500.3 GB, 1500301910016 bytes
Disk /dev/sde: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdf: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdg: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdh: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdi: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdj: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdk: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdl: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdm: 1500.3 GB, 1500301910016 bytes
Disk /dev/sdn: 1500.3 GB, 1500301910016 bytes

mdadm -E /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee4232 - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     4       8      176        4      active sync   /dev/sdl

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdb
/dev/sdb:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee4244 - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     5       8      192        5      active sync   /dev/sdm

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee418e - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this    13       8        0       13      spare   /dev/sda

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

 mdadm -E /dev/sde
/dev/sde:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee4196 - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     6       8       16        6      active sync   /dev/sdb

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

 mdadm -E /dev/sdf
/dev/sdf:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee41aa - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     8       8       32        8      active sync   /dev/sdc

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdg
/dev/sdg:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee41bc - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     9       8       48        9      active sync   /dev/sdd

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdh
/dev/sdh:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee41ce - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this    10       8       64       10      active sync   /dev/sde

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdi
/dev/sdi:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee41e0 - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this    11       8       80       11      active sync   /dev/sdf

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdj
/dev/sdj:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee41f2 - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this    12       8       96       12      active sync   /dev/sdg

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdk
/dev/sdk:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee41ea - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8      112        0      active sync   /dev/sdh

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdl
/dev/sdl:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee41fe - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2       8      128        2      active sync   /dev/sdi

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdm
/dev/sdm:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 23:47:53 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee4210 - correct
         Events : 156864

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8      144        3      active sync   /dev/sdj

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8        0       13      spare   /dev/sda

mdadm -E /dev/sdn
/dev/sdn:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 975d6eb2:285eed11:021df236:c2d05073
  Creation Time : Tue Oct 13 23:26:17 2009
     Raid Level : raid5
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
   Raid Devices : 13
  Total Devices : 12
Preferred Minor : 0

    Update Time : Fri Jun  3 22:49:22 2011
          State : clean
 Active Devices : 11
Working Devices : 12
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 1dee3313 - correct
         Events : 156606

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this    13       8      160       13      spare   /dev/sdk

   0     0       8      112        0      active sync   /dev/sdh
   1     1       0        0        1      faulty removed
   2     2       8      128        2      active sync   /dev/sdi
   3     3       8      144        3      active sync   /dev/sdj
   4     4       8      176        4      active sync   /dev/sdl
   5     5       8      192        5      active sync   /dev/sdm
   6     6       8       16        6      active sync   /dev/sdb
   7     7       0        0        7      faulty removed
   8     8       8       32        8      active sync   /dev/sdc
   9     9       8       48        9      active sync   /dev/sdd
  10    10       8       64       10      active sync   /dev/sde
  11    11       8       80       11      active sync   /dev/sdf
  12    12       8       96       12      active sync   /dev/sdg
  13    13       8      160       13      spare   /dev/sdk
------
in short:
/dev/sda:
this     4       8      176        4      active sync   /dev/sdl

/dev/sdb:
this     5       8      192        5      active sync   /dev/sdm

/dev/sdd:
this    13       8        0       13      spare   /dev/sda

/dev/sde:
this     6       8       16        6      active sync   /dev/sdb

/dev/sdf:
this     8       8       32        8      active sync   /dev/sdc

/dev/sdg:
this     9       8       48        9      active sync   /dev/sdd

/dev/sdh:
this    10       8       64       10      active sync   /dev/sde

/dev/sdi:
this    11       8       80       11      active sync   /dev/sdf

/dev/sdj:
this    12       8       96       12      active sync   /dev/sdg

/dev/sdk:
this     0       8      112        0      active sync   /dev/sdh

/dev/sdl:     
this     2       8      128        2      active sync   /dev/sdi

/dev/sdm:         
this     3       8      144        3      active sync   /dev/sdj

/dev/sdn:      
this    13       8      160       13      spare   /dev/sdk
----
after that i would think the order confused because no disk is the right in the first line auf the position. but i would do this:

mdadm -C /dev/md0 -l 5 -n 13 -e 0.90 -c 64 --assume-clean /dev/sd{k,l,m,a,b,e,?d?,f,g,h,i,j,n} 
or
mdadm -C /dev/md0 -l 5 -n 13 -e 0.90 -c 64 --assume-clean /dev/sd{k,l,m,a,b,e,?n?,f,g,h,i,j,d} 

or what you think? how must i handle both spare disk sda and sdk?
-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

^ permalink raw reply

* Re: Triple-parity raid6
From: David Brown @ 2011-06-10  8:45 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <87aadq5q1l.fsf@gmail.com>

On 10/06/2011 05:22, Namhyung Kim wrote:
> NeilBrown<neilb@suse.de>  writes:
>> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown<david@westcontrol.com>  wrote:
>>
>> You can see the current kernel code at:
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f1a47c7;hb=HEAD
>>
>>
>> int.uc is the generic C code which 'unroll.awk' processes to make various
>> versions that unroll the loops different amounts to work with CPUs with
>> different numbers of registers.
>> Then there is sse1, sse2, altivec which provide the same functionality in
>> assembler which is optimised for various processors.
>>
>> And 'recov' has the smarts for doing the reverse calculation when 2 data
>> blocks, or 1 data and P are missing.
>>
>> Even if you don't feel up to implementing everything, a start might be
>> useful.  You never know when someone might jump up and offer to help.
>>
>
> Maybe I could help David to some extent. :)
>
> I'm gonna read the raid6 code next week and hope that there is a room I
> can help with, FWIW.
>

No matter how far I manage to get, I'm going to need help from someone 
who knows the system and the development process, how the new functions 
would fit with the rest of the system, and not least people who can 
check and test the code.

Making multiple parity syndromes is easy enough mathematically:

For each parity bit P_j, you have a generator g_j and calculate for d_i 
running over all data disks:

	P_j = sum((g_j ^ i) . d_i)

Raid5 parity uses g_0 = 1, so it is just the xor.
Raid6 uses g_0 = 1, g_1 = 2.

Any independent generators could be used, of course.  I am not sure how 
to prove that a set of generators is independent (except in the easy 
case of 2 generators, as shown in the raid6.pdf paper) - but brute force 
testing over all choices of dead disks is easy enough.

For Raid7, the obvious choice is g_2 = 4 - then we can re-use existing 
macros and optimisations.

I've had a brief look at int.uc - the gen_syndrome function is nice and 
small, but that's before awk gets at it.  I haven't yet tried building 
any of this, but extending raid6_int to a third syndrome is, I hope, is 
straightforward:

static void raid7_int$#_gen_syndrome(int disks, size_t bytes, void **ptrs)
{
         u8 **dptr = (u8 **)ptrs;
         u8 *p, *q, *r;
         int d, z, z0;

         unative_t wd$$, wq$$, wp$$, w1$$, w2$$, wr$$, w3$$, w4$$;

         z0 = disks - 4;         /* Highest data disk */
         p = dptr[z0+1];         /* XOR parity */
         q = dptr[z0+2];         /* RS syndrome */
         r = dptr[z0+3];         /* RS syndrome 2 */

         for ( d = 0 ; d < bytes ; d += NSIZE*$# ) {
                 wr$$ = wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE];
                 for ( z = z0-1 ; z >= 0 ; z-- ) {
                         wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE];
                         wp$$ ^= wd$$;
                         w2$$ = MASK(wq$$);
                         w1$$ = SHLBYTE(wq$$);
                         w2$$ &= NBYTES(0x1d);
                         w1$$ ^= w2$$;
                         wq$$ = w1$$ ^ wd$$;

                         w4$$ = MASK(wr$$);
                         w3$$ = SHLBYTE(wr$$);
                         w4$$ &= NBYTES(0x1d);
                         w3$$ ^= w4$$;
                         w4$$ = MASK(w3$$);
                         w3$$ = SHLBYTE(w3$$);
                         w4$$ &= NBYTES(0x1d);
                         w3$$ ^= w4$$;
                         wr$$ = w3$$ ^ wd$$;
                 }
                 *(unative_t *)&p[d+NSIZE*$$] = wp$$;
                 *(unative_t *)&q[d+NSIZE*$$] = wq$$;
                 *(unative_t *)&r[d+NSIZE*$$] = wr$$;
         }
}

I wrote the wr$$ calculations using a second set of working variables. 
I don't know (yet) what compiler options are used to generate the target 
code, nor what the awk'ed code looks like.  If the compiler can handle 
the scheduling to interlace the Q and R calculations and reduce pipeline 
delays, then that's great.  If not, then they can be manually interlaced 
if it helps the code.  But as I'm writing this blind (on a windows 
machine, no less), I don't know if it makes a difference.

As a general point regarding optimisations, is there a minimum level of 
gcc we can expect to have here?  And are there rules about compiler 
flags?  Later versions of gcc are getting very smart about 
vectorisation, and generating multiple versions of code that are 
selected automatically at run-time depending on the capabilities of the 
processor.  My hope is that the code can be greatly simplified by 
writing a single clear version in C that runs fast and takes advantage 
of the cpu, without needing to make explicit SSE, AtliVec, etc., 
versions.  Are there rules about which gcc attributes or extensions we 
can use?

Another question - what should we call this?  Raid 7?  Raid 6.3?  Raid 
7.3?  While we are in the mood for triple-parity Raid, we can easily 
extend it to quad-parity or more - the name should probably be flexible 
enough to allow that.

^ permalink raw reply

* Re: Triple-parity raid6
From: David Brown @ 2011-06-10  9:03 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20110609220438.26336b27@notabene.brown>

On 09/06/2011 14:04, NeilBrown wrote:
> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown<david@westcontrol.com>  wrote:
>
>> On 09/06/2011 03:49, NeilBrown wrote:
>>> On Thu, 09 Jun 2011 02:01:06 +0200 David Brown<david.brown@hesbynett.no>
>>> wrote:
>>>
>>>> Has anyone considered triple-parity raid6 ?  As far as I can see, it
>>>> should not be significantly harder than normal raid6 - either  to
>>>> implement, or for the processor at run-time.  Once you have the GF(2⁸)
>>>> field arithmetic in place for raid6, it's just a matter of making
>>>> another parity block in the same way but using a different generator:
>>>>
>>>> P = D_0 + D_1 + D_2 + .. + D_(n.1)
>>>> Q = D_0 + g.D_1 + g².D_2 + .. + g^(n-1).D_(n.1)
>>>> R = D_0 + h.D_1 + h².D_2 + .. + h^(n-1).D_(n.1)
>>>>
>>>> The raid6 implementation in mdraid uses g = 0x02 to generate the second
>>>> parity (based on "The mathematics of RAID-6" - I haven't checked the
>>>> source code).  You can make a third parity using h = 0x04 and then get a
>>>> redundancy of 3 disks.  (Note - I haven't yet confirmed that this is
>>>> valid for more than 100 data disks - I need to make my checker program
>>>> more efficient first.)
>>>>
>>>> Rebuilding a disk, or running in degraded mode, is just an obvious
>>>> extension to the current raid6 algorithms.  If you are missing three
>>>> data blocks, the maths looks hard to start with - but if you express the
>>>> equations as a set of linear equations and use standard matrix inversion
>>>> techniques, it should not be hard to implement.  You only need to do
>>>> this inversion once when you find that one or more disks have failed -
>>>> then you pre-compute the multiplication tables in the same way as is
>>>> done for raid6 today.
>>>>
>>>> In normal use, calculating the R parity is no more demanding than
>>>> calculating the Q parity.  And most rebuilds or degraded situations will
>>>> only involve a single disk, and the data can thus be re-constructed
>>>> using the P parity just like raid5 or two-parity raid6.
>>>>
>>>>
>>>> I'm sure there are situations where triple-parity raid6 would be
>>>> appealing - it has already been implemented in ZFS, and it is only a
>>>> matter of time before two-parity raid6 has a real probability of hitting
>>>> an unrecoverable read error during a rebuild.
>>>>
>>>>
>>>> And of course, there is no particular reason to stop at three parity
>>>> blocks - the maths can easily be generalised.  1, 2, 4 and 8 can be used
>>>> as generators for quad-parity (checked up to 60 disks), and adding 16
>>>> gives you quintuple parity (checked up to 30 disks) - but that's maybe
>>>> getting a bit paranoid.
>>>>
>>>>
>>>> ref.:
>>>>
>>>> <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>
>>>> <http://blogs.oracle.com/ahl/entry/acm_triple_parity_raid>
>>>> <http://queue.acm.org/detail.cfm?id=1670144>
>>>> <http://blogs.oracle.com/ahl/entry/triple_parity_raid_z>
>>>>
>>>
>>>    -ENOPATCH  :-)
>>>
>>> I have a series of patches nearly ready which removes a lot of the remaining
>>> duplication in raid5.c between raid5 and raid6 paths.  So there will be
>>> relative few places where RAID5 and RAID6 do different things - only the
>>> places where they *must* do different things.
>>> After that, adding a new level or layout which has 'max_degraded == 3' would
>>> be quite easy.
>>> The most difficult part would be the enhancements to libraid6 to generate the
>>> new 'syndrome', and to handle the different recovery possibilities.
>>>
>>> So if you're not otherwise busy this weekend, a patch would be nice :-)
>>>
>>
>> I'm not going to promise any patches, but maybe I can help with the
>> maths.  You say the difficult part is the syndrome calculations and
>> recovery - I've got these bits figured out on paper and some
>> quick-and-dirty python test code.  On the other hand, I don't really
>> want to get into the md kernel code, or the mdadm code - I haven't done
>> Linux kernel development before (I mostly program 8-bit microcontrollers
>> - when I code on Linux, I use Python), and I fear it would take me a
>> long time to get up to speed.
>>
>> However, if the parity generation and recovery is neatly separated into
>> a libraid6 library, the whole thing becomes much more tractable from my
>> viewpoint.  Since I am new to this, can you tell me where I should get
>> the current libraid6 code?  I'm sure google will find some sources for
>> me, but I'd like to make sure I start with whatever version /you/ have.
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> You can see the current kernel code at:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f1a47c7;hb=HEAD
>
>
> int.uc is the generic C code which 'unroll.awk' processes to make various
> versions that unroll the loops different amounts to work with CPUs with
> different numbers of registers.
> Then there is sse1, sse2, altivec which provide the same functionality in
> assembler which is optimised for various processors.
>
> And 'recov' has the smarts for doing the reverse calculation when 2 data
> blocks, or 1 data and P are missing.
>
> Even if you don't feel up to implementing everything, a start might be
> useful.  You never know when someone might jump up and offer to help.
>
> NeilBrown


When looking at recov.c, I see in the "raid6_dual_recov" function there 
is no code for testing data+Q failure as it is equivalent to raid5 
recovery.  Should this not still be implemented here so that testing can 
be more complete?

Is there a general entry point for the recovery routines, which then 
decides which of raid6_2data_recov, raid6_datap_recov, or 
raid6_dual_recov is called?  With triple-parity raid, there are many 
more combinations - it would make sense for the library to have a single 
function like :

void raid7_3_recov(int disks, size_t bytes, int noOfFails,
		int *pFails, void **ptrs);

or even (to cover quad parity and more) :

void raid7_n_recov(int disks, int noOfParities, size_t bytes,
	int noOfFails, int *pFails, void **ptrs);






--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Maximizing failed disk replacement on a RAID5 array
From: John Robinson @ 2011-06-10 10:25 UTC (permalink / raw)
  To: Linux RAID
In-Reply-To: <4DEDE6E7.40301@anonymous.org.uk>

On 07/06/2011 09:52, John Robinson wrote:
> On 06/06/2011 19:06, Durval Menezes wrote:
> [...]
>> It would be great to have a
>> "duplicate-this-bad-old-disk-into-this-shiny-new-disk" functionality,
>> as it would enable an almost-no-downtime disk replacement with
>> minimum risk, but it seems we can't have everything... :-0 Maybe it's
>> something for the wishlist?
>
> It's already on the wishlist, described as a hot replace.

Actually I've been thinking about this. I think I'd rather the hot 
replace functionality did a normal rebuild from the still-good drives, 
and only if it came across a read error from those would it attempt to 
refer to the contents of the known-to-be-failing drive (and then also 
attempt to repair the read error on the supposedly-still-good drive that 
gave a read error, as already happens).

My rationale for this is as follows: if we want to hot-replace a drive 
that's known to be failing, we should trust it less than the remaining 
still-good drives, and treat it with kid gloves. It may be suffering 
from bit-rot. We'd rather not hit all the bad sectors on the failing 
drive, because each time we do that we send the drive into 7 seconds (or 
more, for cheap drives without TLER) of re-reading, plus any Linux-level 
re-reading there might be. Further, making the known-to-be-failing drive 
work extra hard (doing the equivalent of dd'ing from it while also still 
using it to serve its contents as an array member) might make it die 
completely before we've finished.

What will this do for rebuild time? Well, I don't think it'll be any 
slower. On the one hand, you'd think that copying from one drive to 
another would be faster than a rebuild, because you're only reading 1 
drive instead of N-1, but on the other, your array is going to run 
slowly (pretty much degraded speed) anyway because you're keeping one 
drive in constant use reading from it, and you risk it becoming much, 
much slower if you do run in to hundreds or thousands of read errors on 
the failing drive.

So overall I think hot-replace should be a normal replace with a 
possible second source of data/parity.

Thoughts?

Yes, I know, -ENOPATCH

Cheers,

John.

^ permalink raw reply

* [PATCH/RFC] md/multipath: implement I/O balancing
From: Namhyung Kim @ 2011-06-10 11:32 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Implement basic I/O balancing code (for read/write) for multipath
personality. The code is based on RAID1 implementation.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/multipath.c |   70 ++++++++++++++++++++++++++++++++++++++---------
 drivers/md/multipath.h |    1 +
 2 files changed, 57 insertions(+), 14 deletions(-)

diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 3535c23af288..83c4f5105705 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -30,29 +30,58 @@
 
 #define	NR_RESERVED_BUFS	32
 
-
-static int multipath_map (multipath_conf_t *conf)
+/*
+ * This routine returns the disk from which the requested read should
+ * be done. There is a per-array 'next expected sequential IO' sector
+ * number - if this matches on the next IO then we use the last disk.
+ * There is also a per-disk 'last know head position' sector that is
+ * maintained from IRQ contexts, IO completion handlers update this
+ * position correctly. We pick the disk whose head is closest.
+ *
+ * Note that 'sector' argument is for original bio whereas 'head_position'
+ * is maintained for each rdev so we should take it into account when
+ * calculating the distance.
+ */
+static int multipath_map(multipath_conf_t *conf, sector_t sector)
 {
 	int i, disks = conf->raid_disks;
-
-	/*
-	 * Later we do read balancing on the read side 
-	 * now we use the first available disk.
-	 */
+	int best_disk;
+	sector_t best_dist;
 
 	rcu_read_lock();
+retry:
+	best_disk = -1;
+	best_dist = MaxSector;
+
 	for (i = 0; i < disks; i++) {
+		int dist;
 		mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev);
+		sector_t this_sector = sector;
+
 		if (rdev && test_bit(In_sync, &rdev->flags)) {
-			atomic_inc(&rdev->nr_pending);
-			rcu_read_unlock();
-			return i;
+			this_sector += rdev->data_offset;
+			dist = abs(this_sector - conf->multipaths[i].head_position);
+			if (dist < best_dist) {
+				best_dist = dist;
+				best_disk = i;
+			}
 		}
 	}
+
+	if (best_disk == -1) {
+		printk(KERN_ERR "multipath_map(): no more operational IO paths?\n");
+	} else {
+		mdk_rdev_t *rdev;
+
+		rdev = rcu_dereference(conf->multipaths[best_disk].rdev);
+		if (!rdev || !test_bit(In_sync, &rdev->flags))
+			goto retry;
+
+		atomic_inc(&rdev->nr_pending);
+	}
 	rcu_read_unlock();
 
-	printk(KERN_ERR "multipath_map(): no more operational IO paths?\n");
-	return (-1);
+	return best_disk;
 }
 
 static void multipath_reschedule_retry (struct multipath_bh *mp_bh)
@@ -82,6 +111,17 @@ static void multipath_end_bh_io (struct multipath_bh *mp_bh, int err)
 	mempool_free(mp_bh, conf->pool);
 }
 
+/*
+ * Update disk head position estimator based on IRQ completion info.
+ */
+static inline void update_head_pos(int disk, struct multipath_bh *mp_bh)
+{
+	multipath_conf_t *conf = mp_bh->mddev->private;
+
+	conf->multipaths[disk].head_position =
+		mp_bh->bio.bi_sector + (mp_bh->bio.bi_size >> 9);
+}
+
 static void multipath_end_request(struct bio *bio, int error)
 {
 	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -89,6 +129,8 @@ static void multipath_end_request(struct bio *bio, int error)
 	multipath_conf_t *conf = mp_bh->mddev->private;
 	mdk_rdev_t *rdev = conf->multipaths[mp_bh->path].rdev;
 
+	update_head_pos(mp_bh->path, mp_bh);
+
 	if (uptodate)
 		multipath_end_bh_io(mp_bh, 0);
 	else if (!(bio->bi_rw & REQ_RAHEAD)) {
@@ -122,7 +164,7 @@ static int multipath_make_request(mddev_t *mddev, struct bio * bio)
 	mp_bh->master_bio = bio;
 	mp_bh->mddev = mddev;
 
-	mp_bh->path = multipath_map(conf);
+	mp_bh->path = multipath_map(conf, bio->bi_sector);
 	if (mp_bh->path < 0) {
 		bio_endio(bio, -EIO);
 		mempool_free(mp_bh, conf->pool);
@@ -356,7 +398,7 @@ static void multipathd (mddev_t *mddev)
 		bio = &mp_bh->bio;
 		bio->bi_sector = mp_bh->master_bio->bi_sector;
 		
-		if ((mp_bh->path = multipath_map (conf))<0) {
+		if ((mp_bh->path = multipath_map(conf, bio->bi_sector)) < 0) {
 			printk(KERN_ALERT "multipath: %s: unrecoverable IO read"
 				" error for block %llu\n",
 				bdevname(bio->bi_bdev,b),
diff --git a/drivers/md/multipath.h b/drivers/md/multipath.h
index 3c5a45eb5f8a..060fe2aabd97 100644
--- a/drivers/md/multipath.h
+++ b/drivers/md/multipath.h
@@ -3,6 +3,7 @@
 
 struct multipath_info {
 	mdk_rdev_t	*rdev;
+	sector_t	head_position;
 };
 
 struct multipath_private_data {
-- 
1.7.5.2


^ permalink raw reply related

* Re: SRaid with 13 Disks crashed
From: Phil Turmel @ 2011-06-10 11:48 UTC (permalink / raw)
  To: Dragon; +Cc: linux-raid
In-Reply-To: <20110610075257.298520@gmx.net>

Hi Dragon,

On 06/10/2011 03:52 AM, Dragon wrote:
> I have nothing limited. It's all i get from the bash after excecute the script. i am not aware of using a fast -boot kernel, but i saw that it comes whit kernel version 2.6.28. i think after the raid crashes i upgrade to the last kernel version, because of having the newer mdamd, ext and what ever tools to recover the raid.

Newer kernels have the option to parallel probe.  You got a newer kernel.

> Am i right to use the --backup-file option, i must have a backup file ;)? i didnt have a file.

Your attempt to shrink the array from 13 back to 12 would have needed the file, and you didn't supply one.  mdadm almost certainly told you it wouldn't shrink the array.

> as far as i understand your advise, to recreate the raid and the order of the disks. you choose them by the output off mdadm -E wich gave the actuall number of the disk and the variation at disk sdd and sdn is because of  both shows the same number, right? i think i understand that. but why is disk "j" the last in the order. my output say that disk sdd is number 13, but as spare for sda....?
> here the stand after rebooting at the morning:

Yes, I'm using the "RaidDevice" column to determine which drive is which.  The drive names in the far right column are the original names, before your kernel changed.  They aren't important.  To reconstruct your 13-disk array, the devices from 0 to 12 must be listed in the create command in the correct order.  Two of your devices think they are spares, reporting "RaidDevice" > 12.  They must fit into the two slots that are reported as "faulty removed".

> fdisk  -l|grep sd
> Disk /dev/sdb: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sda: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdc: 20.4 GB, 20409532416 bytes
> /dev/sdc1   *           1        2372    19053058+  83  Linux
> /dev/sdc2            2373        2481      875542+   5  Extended
> /dev/sdc5            2373        2481      875511   82  Linux swap / Solaris
> Disk /dev/sdd: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sde: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdf: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdg: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdh: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdi: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdj: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdk: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdl: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdm: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdn: 1500.3 GB, 1500301910016 bytes

[...]

> ------
> in short:
> /dev/sda:
> this     4       8      176        4      active sync   /dev/sdl
> 
> /dev/sdb:
> this     5       8      192        5      active sync   /dev/sdm
> 
> /dev/sdd:
> this    13       8        0       13      spare   /dev/sda
> 
> /dev/sde:
> this     6       8       16        6      active sync   /dev/sdb
> 
> /dev/sdf:
> this     8       8       32        8      active sync   /dev/sdc
> 
> /dev/sdg:
> this     9       8       48        9      active sync   /dev/sdd
> 
> /dev/sdh:
> this    10       8       64       10      active sync   /dev/sde
> 
> /dev/sdi:
> this    11       8       80       11      active sync   /dev/sdf
> 
> /dev/sdj:
> this    12       8       96       12      active sync   /dev/sdg
> 
> /dev/sdk:
> this     0       8      112        0      active sync   /dev/sdh
> 
> /dev/sdl:     
> this     2       8      128        2      active sync   /dev/sdi
> 
> /dev/sdm:         
> this     3       8      144        3      active sync   /dev/sdj
> 
> /dev/sdn:      
> this    13       8      160       13      spare   /dev/sdk

Good to summarize, but ignore the names on the right.

> after that i would think the order confused because no disk is the right in the first line auf the position. but i would do this:
> 
> mdadm -C /dev/md0 -l 5 -n 13 -e 0.90 -c 64 --assume-clean /dev/sd{k,l,m,a,b,e,?d?,f,g,h,i,j,n} 
> or
> mdadm -C /dev/md0 -l 5 -n 13 -e 0.90 -c 64 --assume-clean /dev/sd{k,l,m,a,b,e,?n?,f,g,h,i,j,d} 

No.  The devices that know where they belong are 0,2,3,4,5,6,8,9,10,11,12.  That would be /dev/sd{k,*,l,m,a,b,e,*,f,g,h,i,j}.  The devices that report 13, /dev/sdd & /dev/sdn, must fit in to positions 1 & 7.  Two possibilities:

/dev/sd{k,d,l,m,a,b,e,n,f,g,h,i,j} or /dev/sd{k,n,l,m,a,b,e,d,f,g,h,i,j}

HTH,

Phil


^ permalink raw reply

* Re: Triple-parity raid6
From: Christoph Dittmann @ 2011-06-10 12:20 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <isslla$o2i$1@dough.gmane.org>

On 06/10/2011 10:45 AM, David Brown wrote:
> Making multiple parity syndromes is easy enough mathematically:

Adam Leventhal, who wrote the double parity and triple parity code for 
ZFS, mentioned on his blog [1] that going beyond triple parity poses 
significant challenges if write performance should not suffer.

In particular, it looks like a relevant math paper (originally) had a 
flaw in the claim that quad parity and above can be implemented just 
like triple parity.

I don't know the implementation you were going to use, neither am I 
knowledgeable about multi-parity in general. I only thought it might be 
relevant to add to the current discussion that other people had issues 
with implementing N-parity for N > 3.

Christoph

[1] http://blogs.oracle.com/ahl/entry/triple_parity_raid_z

^ permalink raw reply

* [PATCH] imsm: FIX: Raid5 data corruption data recovering from backup
From: Adam Kwolek @ 2011-06-10 13:00 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer

Sporadicaly when Raid5's data are restored from backup area,
corruption occurs.
It doesn't happen if reshape process is beyond critical section.

Root cause of the problem is passing wrong starting point in
restore_stripes(). It was hard coded to 0 so far.
This causes that parity disks position in first stripe was always set
to the last raid disk. This position should depend on data position in array.

Proper start position was set and pointer for restoring data
(copy area address) is adjusted to passed start parameter.

Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
---

 super-intel.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 5e8b834..075385a 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -7718,6 +7718,8 @@ int save_backup_imsm(struct supertype *st,
 	int new_disks = map_dest->num_members;
 	int dest_layout = 0;
 	int dest_chunk;
+	unsigned long long start;
+	int data_disks = imsm_num_data_members(dev, 0);
 
 	targets = malloc(new_disks * sizeof(int));
 	if (!targets)
@@ -7727,10 +7729,15 @@ int save_backup_imsm(struct supertype *st,
 	if (!target_offsets)
 		goto abort;
 
+	start = info->reshape_progress * 512;
 	for (i = 0; i < new_disks; i++) {
 		targets[i] = -1;
 		target_offsets[i] = (unsigned long long)
 		  __le32_to_cpu(super->migr_rec->ckpt_area_pba) * 512;
+		/* move back copy area adderss, it will be moved forward
+		 * in restore_stripes() using start input variable
+		 */
+		target_offsets[i] -= start/data_disks;
 	}
 
 	if (open_backup_targets(info, new_disks, targets))
@@ -7748,7 +7755,7 @@ int save_backup_imsm(struct supertype *st,
 			    -1,    /* source backup file descriptor */
 			    0,     /* input buf offset
 				    * always 0 buf is already offseted */
-			    0,
+			    start,
 			    length,
 			    buf) != 0) {
 		fprintf(stderr, Name ": Error restoring stripes\n");


^ permalink raw reply related

* (unknown)
From: Dragon @ 2011-06-10 13:06 UTC (permalink / raw)
  To: philip; +Cc: linux-raid

You are right, the array starts at pos 0 and so pos 1 and 7 are the right pos. the 2. try was perfect. fsck shows this:

fsck -n /dev/md0
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
/dev/md0 wurde nicht ordnungsgemäß ausgehängt, Prüfung erzwungen.
Durchgang 1: Prüfe Inodes, Blocks, und Größen
Durchgang 2: Prüfe Verzeichnis Struktur
Durchgang 3: Prüfe Verzeichnis Verknüpfungen
Durchgang 4: Überprüfe die Referenzzähler
Durchgang 5: Überprüfe Gruppe Zusammenfassung
dd/dev/md0: 266872/1007288320 Dateien (15.4% nicht zusammenhängend), 3769576927/4029130864 Blöcke

and:
mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Fri Jun 10 14:19:24 2011
     Raid Level : raid5
     Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
  Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
   Raid Devices : 13
  Total Devices : 13
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri Jun 10 14:19:24 2011
          State : clean
 Active Devices : 13
Working Devices : 13
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 8c4d8438:42aa49f9:a6d866f6:b6ea6b93 (local to host nassrv01)
         Events : 0.1

    Number   Major   Minor   RaidDevice State
       0       8      160        0      active sync   /dev/sdk
       1       8      208        1      active sync   /dev/sdn
       2       8      176        2      active sync   /dev/sdl
       3       8      192        3      active sync   /dev/sdm
       4       8        0        4      active sync   /dev/sda
       5       8       16        5      active sync   /dev/sdb
       6       8       64        6      active sync   /dev/sde
       7       8       48        7      active sync   /dev/sdd
       8       8       80        8      active sync   /dev/sdf
       9       8       96        9      active sync   /dev/sdg
      10       8      112       10      active sync   /dev/sdh
      11       8      128       11      active sync   /dev/sdi
      12       8      144       12      active sync   /dev/sdj

normaly i use fsck.ext4 e.a. fsck.ext4dev. problem? what means 15,4% not related? the quote of lost data? after that i shrink like this:?

mdadm  /dev/md0 --fail /dev/sdj
mdadm /dev/md0 --remove /dev/sdj
mdadm --detail --scan >> /etc/mdadm/mdadm.conf

right way? i assume that the disk that i take off the raid is not the same like i added at last? so i have to read out the serial to find it under the harddrives?
many thx so far
-- 
NEU: FreePhone - kostenlos mobil telefonieren!			
Jetzt informieren: http://www.gmx.net/de/go/freephone
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Why move all map_sg/unmap_sg for slave channel to its client?
From: Atsushi Nemoto @ 2011-06-10 13:19 UTC (permalink / raw)
  To: viresh.kumar
  Cc: dan.j.williams, linus.walleij, vinod.koul, linux-kernel,
	shiraz.hashim, armando.visconti, bhupesh.sharma, linux-raid
In-Reply-To: <4DF19251.1090100@st.com>

On Fri, 10 Jun 2011 09:11:05 +0530, viresh kumar <viresh.kumar@st.com> wrote:
> >>> I thought map_sg/unmap_sg for slave channels will be handled according
> >>> to the flags passed in prep_slave_sg(). But then i found following patch:
...
> Linus, Dan,
> 
> Got it. Thanks for your replies.

JFYI, the old discussion was here:

https://lkml.org/lkml/2009/7/24/114

---
Atsushi Nemoto

^ permalink raw reply

* Re: Triple-parity raid6
From: Bill Davidsen @ 2011-06-10 13:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: David Brown, linux-raid
In-Reply-To: <20110609220438.26336b27@notabene.brown>

NeilBrown wrote:
> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown<david@westcontrol.com>  wrote:
>
>    
>> On 09/06/2011 03:49, NeilBrown wrote:
>>      
>>>
>>>    -ENOPATCH  :-)
>>>
>>> I have a series of patches nearly ready which removes a lot of the remaining
>>> duplication in raid5.c between raid5 and raid6 paths.  So there will be
>>> relative few places where RAID5 and RAID6 do different things - only the
>>> places where they *must* do different things.
>>>        
>>> After that, adding a new level or layout which has 'max_degraded == 3' would
>>> be quite easy.
>>> The most difficult part would be the enhancements to libraid6 to generate the
>>> new 'syndrome', and to handle the different recovery possibilities.
>>>
>>> So if you're not otherwise busy this weekend, a patch would be nice :-)
>>>
>>>        
>> I'm not going to promise any patches, but maybe I can help with the
>> maths.  You say the difficult part is the syndrome calculations and
>> recovery - I've got these bits figured out on paper and some
>> quick-and-dirty python test code.  On the other hand, I don't really
>> want to get into the md kernel code, or the mdadm code - I haven't done
>> Linux kernel development before (I mostly program 8-bit microcontrollers
>> - when I code on Linux, I use Python), and I fear it would take me a
>> long time to get up to speed.
>>
>> However, if the parity generation and recovery is neatly separated into
>> a libraid6 library, the whole thing becomes much more tractable from my
>> viewpoint.  Since I am new to this, can you tell me where I should get
>> the current libraid6 code?  I'm sure google will find some sources for
>> me, but I'd like to make sure I start with whatever version /you/ have.
>>      
> You can see the current kernel code at:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f1a47c7;hb=HEAD
>
>
> int.uc is the generic C code which 'unroll.awk' processes to make various
> versions that unroll the loops different amounts to work with CPUs with
> different numbers of registers.
> Then there is sse1, sse2, altivec which provide the same functionality in
> assembler which is optimised for various processors.
>
>    
And at some point I'm sure one of the video card vendors will provide a 
hack to do it in
the GPU in massively parallel fashion.

> And 'recov' has the smarts for doing the reverse calculation when 2 data
> blocks, or 1 data and P are missing.
>
> Even if you don't feel up to implementing everything, a start might be
> useful.  You never know when someone might jump up and offer to help.
>
>
>    


-- 
Bill Davidsen<davidsen@tmr.com>
   We are not out of the woods yet, but we know the direction and have
taken the first step. The steps are many, but finite in number, and if
we persevere we will reach our destination.  -me, 2010




^ permalink raw reply

* Re: SRaid with 13 Disks crashed
From: Phil Turmel @ 2011-06-10 14:01 UTC (permalink / raw)
  To: Dragon; +Cc: linux-raid
In-Reply-To: <20110610130652.298530@gmx.net>

On 06/10/2011 09:06 AM, Dragon wrote:
> You are right, the array starts at pos 0 and so pos 1 and 7 are the right pos. the 2. try was perfect. fsck shows this:

Yay!

> fsck -n /dev/md0
> fsck from util-linux-ng 2.17.2
> e2fsck 1.41.12 (17-May-2010)
> /dev/md0 wurde nicht ordnungsgemäß ausgehängt, Prüfung erzwungen.
> Durchgang 1: Prüfe Inodes, Blocks, und Größen
> Durchgang 2: Prüfe Verzeichnis Struktur
> Durchgang 3: Prüfe Verzeichnis Verknüpfungen
> Durchgang 4: Überprüfe die Referenzzähler
> Durchgang 5: Überprüfe Gruppe Zusammenfassung
> dd/dev/md0: 266872/1007288320 Dateien (15.4% nicht zusammenhängend), 3769576927/4029130864 Blöcke
> 
> and:
> mdadm --detail /dev/md0
> /dev/md0:
>         Version : 0.90
>   Creation Time : Fri Jun 10 14:19:24 2011
>      Raid Level : raid5
>      Array Size : 17581661952 (16767.18 GiB 18003.62 GB)
>   Used Dev Size : 1465138496 (1397.26 GiB 1500.30 GB)
>    Raid Devices : 13
>   Total Devices : 13
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Fri Jun 10 14:19:24 2011
>           State : clean
>  Active Devices : 13
> Working Devices : 13
>  Failed Devices : 0
>   Spare Devices : 0
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>            UUID : 8c4d8438:42aa49f9:a6d866f6:b6ea6b93 (local to host nassrv01)
>          Events : 0.1
> 
>     Number   Major   Minor   RaidDevice State
>        0       8      160        0      active sync   /dev/sdk
>        1       8      208        1      active sync   /dev/sdn
>        2       8      176        2      active sync   /dev/sdl
>        3       8      192        3      active sync   /dev/sdm
>        4       8        0        4      active sync   /dev/sda
>        5       8       16        5      active sync   /dev/sdb
>        6       8       64        6      active sync   /dev/sde
>        7       8       48        7      active sync   /dev/sdd
>        8       8       80        8      active sync   /dev/sdf
>        9       8       96        9      active sync   /dev/sdg
>       10       8      112       10      active sync   /dev/sdh
>       11       8      128       11      active sync   /dev/sdi
>       12       8      144       12      active sync   /dev/sdj
> 
> normaly i use fsck.ext4 e.a. fsck.ext4dev. problem? what means 15,4% not related? the quote of lost data? after that i shrink like this:?

fsck automatically calls fsck.ext4 when it sees an ext4 filesystem.  15.4% Not contiguous == 15.4 fragmented.  No lost data.

Now that you have a good filesystem, mounting it and taking a backup would be a good idea.  Or at least retrieve any files that are very important to you.

> mdadm  /dev/md0 --fail /dev/sdj
> mdadm /dev/md0 --remove /dev/sdj

NO! You must use "mdadm --grow".  Yes, "--grow" also does "shrink".  Your fsck shows that the ext4 filesystem is still sized for the original 12-disk setup, so you don't have to shrink the filesystem.  You do have to shrink the raid:

Step 1a: Tell mdadm the final size you are aiming for.  MD will emulate this while you test that the new size works:
mdadm /dev/md0 --grow --array-size=16116523456k

(Please show "mdadm -D /dev/md0" at this point.)

Step 1b: Verify data integrity with another fsck -n

Step 2:  Tell mdadm to really reshape to the 12-disk raid5
mdadm /dev/md0 --grow -n 12 --backup-file=/reshape.bak

When the reshape/shrink is done, "mdadm -D /dev/md0" will report "Raid Devices : 12" and "Spare Devices : 1", and one of them, almost certainly /dev/sdj, will be marked "spare".

At this point, I recommend converting to raid6, consuming the spare.

mdadm /dev/md0 --grow -n 13 -l 6 --backup-file=/reshape.bak

It might be possible to go directly to this layout (in place of step 2 above).  It would save a lot of time.  Maybe someone else on the list can answer that.  Or you can just try it.  I'm sure mdadm will complain if it's not possible ;).

> mdadm --detail --scan >> /etc/mdadm/mdadm.conf

Yes.  Make sure you edit it afterwards to remove the old array's information.

> right way? i assume that the disk that i take off the raid is not the same like i added at last? so i have to read out the serial to find it under the harddrives?

Yes, use lsdrv or "/s -l /dev/disk/by-id/" to make sure you remove the spare.  Of course, if you convert to raid6, it won't be a spare :).

> many thx so far

You are welcome.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Disk upgrade
From: Bill Davidsen @ 2011-06-10 14:09 UTC (permalink / raw)
  To: Linux RAID

I'm running out of room in a box to add drives, so I want to go to 
larger drives. Unfortunately I have but one bay left. What I would like 
to do is put in a single drive, create a raid-10f2 array with a missing 
device, and copying all of the data off the existing arrays onto the 
raid-10 array, then diddling the boot so I can get up off the new drive, 
removing the old drives, adding another new drive and adding that to the 
array, along with a spare, perhaps.

Any particular problems with that plan? It leaves the existing drives 
intact, and critical data is backup via rsync on both a removable device 
and to network storage.

-- 
Bill Davidsen<davidsen@tmr.com>
   We are not out of the woods yet, but we know the direction and have
taken the first step. The steps are many, but finite in number, and if
we persevere we will reach our destination.  -me, 2010

^ permalink raw reply

* Re: Triple-parity raid6
From: David Brown @ 2011-06-10 14:28 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <4DF20C18.3030604@christoph-d.de>

On 10/06/2011 14:20, Christoph Dittmann wrote:
> On 06/10/2011 10:45 AM, David Brown wrote:
>> Making multiple parity syndromes is easy enough mathematically:
>
> Adam Leventhal, who wrote the double parity and triple parity code for
> ZFS, mentioned on his blog [1] that going beyond triple parity poses
> significant challenges if write performance should not suffer.
>
> In particular, it looks like a relevant math paper (originally) had a
> flaw in the claim that quad parity and above can be implemented just
> like triple parity.
>
> I don't know the implementation you were going to use, neither am I
> knowledgeable about multi-parity in general. I only thought it might be
> relevant to add to the current discussion that other people had issues
> with implementing N-parity for N > 3.
>
>

I've looked at Adam's blog article - in fact, stumbling over it when 
wandering around the 'net was one of my inspirations to think about 
multi-parity raid again.

But there are a few key differences between the description in the James 
Plank papers referenced, and the implementation I've looked at.

One point is that he is looking at GF(2^4), which is quickly a limiting 
factor.  It's hard to get enough independent parity syndromes, and you 
can get easily start to think that things won't work for larger parity.

The other is the choice of factors for the syndromes.  Assuming I 
understand the papers correctly, the first paper is suggesting this 
arrangement (all arithmetic done in GF()):

P_0 = 1^0 . d_0  +  2^0 . d_1  +  3^0 . d_2  +  4^0 . d_3
P_1 = 1^1 . d_0  +  2^1 . d_1  +  3^1 . d_2  +  4^1 . d_3
P_2 = 1^2 . d_0  +  2^2 . d_1  +  3^2 . d_2  +  4^2 . d_3
P_3 = 1^3 . d_0  +  2^3 . d_1  +  3^3 . d_2  +  4^3 . d_3

The second paper changes it to:

P_0 = 1^0 . d_0  +  1^1 . d_1  +  1^2 . d_2  +  1^3 . d_3
P_1 = 2^0 . d_0  +  2^1 . d_1  +  2^2 . d_2  +  2^3 . d_3
P_2 = 3^0 . d_0  +  3^1 . d_1  +  3^2 . d_2  +  3^3 . d_3
P_3 = 4^0 . d_0  +  4^1 . d_1  +  4^2 . d_2  +  4^3 . d_3

For the first two parity blocks, this is the same as for Linux raid6:

P = d_0 + d_1 + d_2 + d_3
Q = g^0 . d_0  +  g^1 . d_1  +  g^2 . d_2  +  g^3 . d_3
(where g = 2)

But the third (and later) lines are not good - the restoration equations 
on multiple disk failure are not independent for all combinations of 
failed disks.  For example, if you have 20 data disks and 3 parity 
disks, there are 1140 different combinations of three data-disk 
failures.  Of these, 92 combinations lead to non-independent equations 
when you try to restore the data.

The equations I am using are:

P_0 = 1^0 . d_0  +  1^1 . d_1  +  1^2 . d_2  +  1^3 . d_3
P_1 = 2^0 . d_0  +  2^1 . d_1  +  2^2 . d_2  +  2^3 . d_3
P_2 = 4^0 . d_0  +  4^1 . d_1  +  4^2 . d_2  +  4^3 . d_3
P_3 = 8^0 . d_0  +  8^1 . d_1  +  8^2 . d_2  +  8^3 . d_3

P_4 would use powers of 16, etc.

I have checked that the restoration equations are solvable for all 
combinations of 3 disk failures for triple parity, and all combinations 
of 4 disk failures for quad parity.  The restoration equations can be 
written in matrix form and are dependent on the indexes of the failed 
disks.  To solve them, you simply invert the matrix and this gives you a 
linear formula for re-calculating each missing data block.  So to check 
that the data can be restored, you need to check that the determinant of 
this matrix is non-zero for all combinations of n-disk failures.

I /believe/ these should work find for at least up to P_7 (i.e., 8 
parity disks), but I haven't yet checked more than a few larger parity 
modes.  Checking all 4 disk failures from 253 disks took my inefficient 
python program about 45 minutes - to check for 8-disk failures would 
mean finding all combinations of 8 choices for up to 249 disks.  If my 
maths serves me right, that's 316,528,258,780,856 combinations - and for 
each one, I'd have to find the determinant of an 8x8 matrix over 
GF(2^8).  Fortunately, I don't think they'll be much call for 
octal-parity RAID in the near future :-)

Of course, all this assume that my maths is correct !

^ permalink raw reply

* Re: [PATCH/RFC] md/raid10: optimize read_balance() for 'far copies' arrays
From: Bill Davidsen @ 2011-06-10 14:29 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: NeilBrown, linux-raid
In-Reply-To: <877h8w93bw.fsf@gmail.com>

Namhyung Kim wrote:
> NeilBrown<neilb@suse.de>  writes:
>
>    
>> On Wed,  8 Jun 2011 16:00:45 +0900 Namhyung Kim<namhyung@gmail.com>  wrote:
>>
>>      
>>> If @conf->far_offset>  0, there is only 1 stripe so that we can treat
>>> the array same as 'near' arrays. Furthermore we could calculate new
>>> distance from the previous position even for the real 'far' array
>>> cases if the position of given disk is already in the lowest stripe.
>>>
>>>        
>> I agree that it still make sense to to balancing if far_offset != 0.
>> However  there is absolutely no point in your change to the calculation of
>> new_distance.
>> You only wont new_distance to contain a distance from head position if we
>> want to choose the device with the 'closest' head.  But we don't.  We want to
>> choose the device were the data is closest to the start of the device.  So
>> the current value for new_distance is correct.
>>
>>      
> Still can't understand why we choose the closest-to-the-start disk in
> case we could have possible sequencial access on other disk. Probably
> because of the lack of my understanding how md/disk works :(
>    

This code is all based on traditional drives, where the seek time, 
rotational latency, and position on the platter are all factors which 
effect performance in some way. Devices like SSD don't have these 
factors (ie. they are constants) and someday it may make sense to 
rethink this code again.

Also note that "close to current" optimizes seek time, while "close to 
beginning" optimizes transfer rate. Note the total lack of parameters to 
tune "what you want" for a given device.

-- 
Bill Davidsen<davidsen@tmr.com>
   We are not out of the woods yet, but we know the direction and have
taken the first step. The steps are many, but finite in number, and if
we persevere we will reach our destination.  -me, 2010




^ permalink raw reply

* [PATCH] mdadm: Linux 3.x version change
From: Namhyung Kim @ 2011-06-10 14:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

As Linux 3.x changes its versioning scheme, we have to deal with
the 2-digit version number also.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 util.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/util.c b/util.c
index e92be4f..9db9ee6 100644
--- a/util.c
+++ b/util.c
@@ -154,8 +154,15 @@ int get_linux_version()
 	a = strtoul(cp, &cp, 10);
 	if (*cp != '.') return -1;
 	b = strtoul(cp+1, &cp, 10);
-	if (*cp != '.') return -1;
-	c = strtoul(cp+1, NULL, 10);
+	/* deal with 3.x version change */
+	if (*cp != '.') {
+		if (a >= 3)
+			c = 0;
+		else
+			return -1;
+	} else {
+		c = strtoul(cp+1, NULL, 10);
+	}
 
 	return (a*1000000)+(b*1000)+c;
 }
-- 
1.7.5.2


^ permalink raw reply related

* [PATCH] imsm: FIX: Migration Raid0->Raid5 cannot be restarted correctly
From: Adam Kwolek @ 2011-06-10 14:44 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer

When array raid0 is migrated to raid5, reshape cannot be continued
correctly due to wrong array parameters settings.
Raid disks number is set too big.

There is no need, during raid0->raid5 migration to increase
info->array.raid_disks, it is already set to final value using
designation map information.

Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
---

 super-intel.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 075385a..40fd940 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -2133,7 +2133,6 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 				/* conversion is happening as RAID5 */
 				info->array.level = 5;
 				info->array.layout = ALGORITHM_PARITY_N;
-				info->array.raid_disks += 1;
 				info->delta_disks -= 1;
 				break;
 			default:


^ permalink raw reply related

* Re: SRaid with 13 Disks crashed
From: Phil Turmel @ 2011-06-10 14:49 UTC (permalink / raw)
  To: Dragon; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <20110610144429.298520@gmx.net>

On 06/10/2011 10:44 AM, Dragon wrote:
> "fsck automatically calls fsck.ext4 when it sees an ext4 filesystem.  15.4% Notcontiguous == 15.4 fragmented.  No lost data."
> -> ok, puh
> 
> Now that you have a good filesystem, mounting it and taking a backup would be a good idea.  Or at least retrieve any files that are very important to you.
> -> ups
> 
> NO! You must use "mdadm --grow".  Yes, "--grow" also does "shrink".  Your fsck shows that the ext4 filesystem is still sized for the original 12-disk setup, so you don't have to shrink the filesystem.  You do have to shrink the raid:
> 
> ->mount was successfull and i can see all of my data ;) im so happy, but have still no possibility to make a backup

Yay! :) and Oooo! :(

> Step 1a: Tell mdadm the final size you are aiming for.  MD will emulate this while you test that the new size works:
> mdadm /dev/md0 --grow --array-size=16116523456k
> -> shows
> mdadm /dev/md0 --grow --array-size=16116523456k
> mdadm: invalid array size: 16116523456k
> 
> might be: 16896000k better? 1500*1024*11

No, it must be "Used Device Size" * 11 = 16116523456.  Try it without the 'k'.

Phil

^ permalink raw reply

* raid5 wont start
From: Liam Kurmos @ 2011-06-10 15:24 UTC (permalink / raw)
  To: linux-raid

Hi All,

I'm currently experiencing a nightmare scenario, and hoping someone
here might be able to help me.

I have a deadline on my phd at the end of the weekend. I was just
running some calculations when an unexpected power failure took my
system down.
Restarting and fsck tried to run but failed. rebooting again the
system drive starts but is unable to mount my /home which on a
separate /md1 raid5.

I am able to get to a console where i'm trying to fix the problem.

my devices are
md0 level=10 devices /dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1
md1 level=5 devices /dev/sda5,/dev/sdb3, /dev/sdc3,/dev/sdd3

doing cat /proc/mdstat
shows md1: inactive sdc3[2] sdb3[1] sdd3[3]

and md0: active raid10 sdd1[3] sdc1[2] sdb3[1]`

this makes me wonder if there was a temporary problem with sda after
the powercut (it appears to be fine now)

examining sda1 it appears to be fine (similar to sdb1 etc)
Array state is AAAA
yet md0 appear to be active without sda1 according to /proc/mdstat.

examining sda5 there where it lists the other drives in the array
there is an anomaly

the right hand to columns read:

active sync  /dev/sda5
active sync  /dev/sdc3
active sync  /dev/sdd3
active sync

so there is no mention of sdb3 and the bottom device has no name...

examining the other partitions b3,c3,d3 the RHS of the bottom 4 lines
are identically

removed
active sync  /dev/sdb3
active sync  /dev/sdc3
active sync  /dev/sdd3

so those partitions suggest it is a which is removed.

I would be tempted to try re-adding sda5 to md1, and sda1 to md0.
but the anomaly above makes me concerned this could make things worse
as I really cant afford to loose data on md1 now. I would have thought
the raid5 would still be able to run with 1 drive removed, as the
raid10 appears to be...

finally i should mention attempting to run /dev/md1 i  get errors

cannot start dirty degraded array
failed to run raid set
md: pers->run() failed...

can anyone suggest how i should proceed?

Liam

^ permalink raw reply

* [PATCH 0/2] IMSM Checkpointing Bug Fix Series (3)
From: Adam Kwolek @ 2011-06-10 15:56 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer

The following series implements fixes for potential problems found
using Klocwork.

For complete solution all sent patches (7) needs to be applied:
  1. imsm: FIX: Cannot create volume
  2. FIX: Cannot create volume
  3. imsm: FIX: Use function to obtain array layout
  4. imsm: FIX: Disable automatic metadata rollback for broken reshape
  5. imsm: FIX: Raid5 data corruption data recovering from backup
  6. imsm: Fix: klocwork: targets variable can be used uninitialized
  7. imsm: FIX: klocwork: passed dev pointer to is_gen_migration() can be NULL

IMSM Checkpointing Status: All unit tests passed

BR
Adam

---

Adam Kwolek (2):
      imsm: FIX: klocwork: passed dev pointer to is_gen_migration() can be NULL
      imsm: Fix: klocwork: targets variable can be used uninitialized

 super-intel.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

-- 
Signature

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox