Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Triple-parity raid6
From: David Brown @ 2011-06-09 22:42 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <isp2g2$rf$1@dough.gmane.org>

On 09/06/11 02:01, David Brown wrote:
> Has anyone considered triple-parity raid6 ? As far as I can see, it
> should not be significantly harder than normal raid6 - either to
> implement, or for the processor at run-time. Once you have the GF(2⁸)
> field arithmetic in place for raid6, it's just a matter of making
> another parity block in the same way but using a different generator:
>
> P = D_0 + D_1 + D_2 + .. + D_(n.1)
> Q = D_0 + g.D_1 + g².D_2 + .. + g^(n-1).D_(n.1)
> R = D_0 + h.D_1 + h².D_2 + .. + h^(n-1).D_(n.1)
>
> The raid6 implementation in mdraid uses g = 0x02 to generate the second
> parity (based on "The mathematics of RAID-6" - I haven't checked the
> source code). You can make a third parity using h = 0x04 and then get a
> redundancy of 3 disks. (Note - I haven't yet confirmed that this is
> valid for more than 100 data disks - I need to make my checker program
> more efficient first.)
>
> Rebuilding a disk, or running in degraded mode, is just an obvious
> extension to the current raid6 algorithms. If you are missing three data
> blocks, the maths looks hard to start with - but if you express the
> equations as a set of linear equations and use standard matrix inversion
> techniques, it should not be hard to implement. You only need to do this
> inversion once when you find that one or more disks have failed - then
> you pre-compute the multiplication tables in the same way as is done for
> raid6 today.
>
> In normal use, calculating the R parity is no more demanding than
> calculating the Q parity. And most rebuilds or degraded situations will
> only involve a single disk, and the data can thus be re-constructed
> using the P parity just like raid5 or two-parity raid6.
>
>
> I'm sure there are situations where triple-parity raid6 would be
> appealing - it has already been implemented in ZFS, and it is only a
> matter of time before two-parity raid6 has a real probability of hitting
> an unrecoverable read error during a rebuild.
>
>
> And of course, there is no particular reason to stop at three parity
> blocks - the maths can easily be generalised. 1, 2, 4 and 8 can be used
> as generators for quad-parity (checked up to 60 disks), and adding 16
> gives you quintuple parity (checked up to 30 disks) - but that's maybe
> getting a bit paranoid.
>
>
> ref.:
>
> <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>
> <http://blogs.oracle.com/ahl/entry/acm_triple_parity_raid>
> <http://queue.acm.org/detail.cfm?id=1670144>
> <http://blogs.oracle.com/ahl/entry/triple_parity_raid_z>
>
>
> mvh.,
>
> David
>

Just to follow up on my numbers here - I've now checked the validity of 
triple-parity using generators 1, 2 and 4 for up to 254 data disks 
(i.e., 257 disks altogether).  I've checked the validity of quad-parity 
up to 120 disks - checking the full 253 disks will probably take the 
machine most of the night.  I'm sure there is some mathematical way to 
prove this, and it could certainly be checked more efficiently than with 
a Python program - but my computer has more spare time than me!


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Backup Server RAID Array Event Notification
From: NeilBrown @ 2011-06-09 21:14 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Roman Mamedov', linux-raid
In-Reply-To: <12.BD.20202.F8811FD4@cdptpa-omtalb.mail.rr.com>

On Thu, 9 Jun 2011 14:01:34 -0500 "Leslie Rhorer" <lrhorer@satx.rr.com> wrote:

> > -----Original Message-----
> > From: Roman Mamedov [mailto:rm@romanrm.ru]
> > Sent: Thursday, June 09, 2011 1:47 PM
> > To: lrhorer@satx.rr.com
> > Cc: linux-raid@vger.kernel.org
> > Subject: Re: Backup Server RAID Array Event Notification
> > 
> > On Thu, 9 Jun 2011 13:35:59 -0500
> > "Leslie Rhorer" <lrhorer@satx.rr.com> wrote:
> > 
> > >
> > > 	After I created a pair of two member RAID1 arrays and then added
> > > them as members to a RAID6 array, I am now getting messages similar to
> > the
> > > following, complaining of "Wrong-level" issues.  When I check the RAID6
> > > array, however, it is clean and both RAID1 members are still there.
> > When I
> > > check both RAID1 arrays, they show clean with no events.  I am running a
> > > compare between all the data on this machine and its mirror (this is a
> > > backup machine).  So far everything looks good.  What does this imply?
> > Is
> > > there something about which I should be worried?
> > 
> > You said RAID1 twice, and your mdadm --detail doesn't agree with you and
> > says
> > "raid0" twice. Maybe you mistakenly used RAID1 instead of RAID0 somewhere
> > else as well, and the WrongLevel message is trying to tell you that?
> 
> 	No, that was just a typo.  (OK, three typos) I meant "RAID0".  The
> RAID0 members are all 1T drives.  The RAID6 array is made of 1.5T members.
> In order to use the 1T drives on the RAID6 array, I have to combine them
> into 2T arrays, which then can be used as members of the RAID6 array.  If
> md10 and md11 were RAID1 arrays, they would only be 1T in extent, and could
> not be members of md0.
> 

"mdadm --monitor" does not monitor RAID0 or Linear arrays.  There is nothing
to see.  Nothing can fail, they don't rebuilt, they are really just AID, not
RAID.

So if it thinks that it was asked to monitor a RAID0 it pretends that it has
disappeared with reason "Wrong Level".
So if you explicitly ask it to monitor a RAID0, it won't and it will tell you
why.

If you only implicitly ask with e.g. "mdadm --monitor --scan" with a RAID0
listing in mdadm.conf it probably shouldn't give the message as it might be
confusing... but it does.
Or maybe the message is just confusing and I should change it.

Or something.

NeilBrown

^ permalink raw reply

* Re: Triple-parity raid6
From: David Brown @ 2011-06-09 19:19 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20110609220438.26336b27@notabene.brown>

On 09/06/11 14:04, NeilBrown wrote:
> On Thu, 09 Jun 2011 13:32:59 +0200 David Brown<david@westcontrol.com>  wrote:
>
>> On 09/06/2011 03:49, NeilBrown wrote:
>>> On Thu, 09 Jun 2011 02:01:06 +0200 David Brown<david.brown@hesbynett.no>
>>> wrote:
>>>
>>>> Has anyone considered triple-parity raid6 ?  As far as I can see, it
>>>> should not be significantly harder than normal raid6 - either  to
>>>> implement, or for the processor at run-time.  Once you have the GF(2⁸)
>>>> field arithmetic in place for raid6, it's just a matter of making
>>>> another parity block in the same way but using a different generator:
>>>>
>>>> P = D_0 + D_1 + D_2 + .. + D_(n.1)
>>>> Q = D_0 + g.D_1 + g².D_2 + .. + g^(n-1).D_(n.1)
>>>> R = D_0 + h.D_1 + h².D_2 + .. + h^(n-1).D_(n.1)
>>>>
>>>> The raid6 implementation in mdraid uses g = 0x02 to generate the second
>>>> parity (based on "The mathematics of RAID-6" - I haven't checked the
>>>> source code).  You can make a third parity using h = 0x04 and then get a
>>>> redundancy of 3 disks.  (Note - I haven't yet confirmed that this is
>>>> valid for more than 100 data disks - I need to make my checker program
>>>> more efficient first.)
>>>>
>>>> Rebuilding a disk, or running in degraded mode, is just an obvious
>>>> extension to the current raid6 algorithms.  If you are missing three
>>>> data blocks, the maths looks hard to start with - but if you express the
>>>> equations as a set of linear equations and use standard matrix inversion
>>>> techniques, it should not be hard to implement.  You only need to do
>>>> this inversion once when you find that one or more disks have failed -
>>>> then you pre-compute the multiplication tables in the same way as is
>>>> done for raid6 today.
>>>>
>>>> In normal use, calculating the R parity is no more demanding than
>>>> calculating the Q parity.  And most rebuilds or degraded situations will
>>>> only involve a single disk, and the data can thus be re-constructed
>>>> using the P parity just like raid5 or two-parity raid6.
>>>>
>>>>
>>>> I'm sure there are situations where triple-parity raid6 would be
>>>> appealing - it has already been implemented in ZFS, and it is only a
>>>> matter of time before two-parity raid6 has a real probability of hitting
>>>> an unrecoverable read error during a rebuild.
>>>>
>>>>
>>>> And of course, there is no particular reason to stop at three parity
>>>> blocks - the maths can easily be generalised.  1, 2, 4 and 8 can be used
>>>> as generators for quad-parity (checked up to 60 disks), and adding 16
>>>> gives you quintuple parity (checked up to 30 disks) - but that's maybe
>>>> getting a bit paranoid.
>>>>
>>>>
>>>> ref.:
>>>>
>>>> <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>
>>>> <http://blogs.oracle.com/ahl/entry/acm_triple_parity_raid>
>>>> <http://queue.acm.org/detail.cfm?id=1670144>
>>>> <http://blogs.oracle.com/ahl/entry/triple_parity_raid_z>
>>>>
>>>
>>>    -ENOPATCH  :-)
>>>
>>> I have a series of patches nearly ready which removes a lot of the remaining
>>> duplication in raid5.c between raid5 and raid6 paths.  So there will be
>>> relative few places where RAID5 and RAID6 do different things - only the
>>> places where they *must* do different things.
>>> After that, adding a new level or layout which has 'max_degraded == 3' would
>>> be quite easy.
>>> The most difficult part would be the enhancements to libraid6 to generate the
>>> new 'syndrome', and to handle the different recovery possibilities.
>>>
>>> So if you're not otherwise busy this weekend, a patch would be nice :-)
>>>
>>
>> I'm not going to promise any patches, but maybe I can help with the
>> maths.  You say the difficult part is the syndrome calculations and
>> recovery - I've got these bits figured out on paper and some
>> quick-and-dirty python test code.  On the other hand, I don't really
>> want to get into the md kernel code, or the mdadm code - I haven't done
>> Linux kernel development before (I mostly program 8-bit microcontrollers
>> - when I code on Linux, I use Python), and I fear it would take me a
>> long time to get up to speed.
>>
>> However, if the parity generation and recovery is neatly separated into
>> a libraid6 library, the whole thing becomes much more tractable from my
>> viewpoint.  Since I am new to this, can you tell me where I should get
>> the current libraid6 code?  I'm sure google will find some sources for
>> me, but I'd like to make sure I start with whatever version /you/ have.
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> You can see the current kernel code at:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f1a47c7;hb=HEAD
>
>
> int.uc is the generic C code which 'unroll.awk' processes to make various
> versions that unroll the loops different amounts to work with CPUs with
> different numbers of registers.
> Then there is sse1, sse2, altivec which provide the same functionality in
> assembler which is optimised for various processors.
>
> And 'recov' has the smarts for doing the reverse calculation when 2 data
> blocks, or 1 data and P are missing.
>
> Even if you don't feel up to implementing everything, a start might be
> useful.  You never know when someone might jump up and offer to help.
>
> NeilBrown

Monday is a holiday here in Norway, so I've got a long weekend.  I 
should get at least /some/ time to have a look at libraid6!

mvh.,

David


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Possible to use multiple disk to bypass I/O wait?
From: Steve Thompson @ 2011-06-09 19:06 UTC (permalink / raw)
  To: CentOS mailing list; +Cc: linux-raid
In-Reply-To: <BANLkTimFOaJoMnwid1F+ghVwkBgJi2FymQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, 9 Jun 2011, Emmanuel Noobadmin wrote:

> I'm trying to resolve an I/O problem on a CentOS 5.6 server. The
> process basically scans through Maildirs, checking for space usage and
> quota. Because there are hundred odd user folders and several 10s of
> thousands of small files, this sends the I/O wait % way high. The
> server hits a very high load level and stops responding to other
> requests until the crawl is done.

If the server is reduced to a crawl, it's possible that you are hitting 
the dirty_ratio limit due to writes and the server has entered synchronous 
I/O mode. As others have mentioned, setting noatime could have a 
significant effect, especially if there are many files and the server 
doesn't have much memory. You can try increasing dirty_ratio to see if it 
has an effect, eg:

 	# sysctl vm.dirty_ratio
 	# sysctl -w vm.dirty_ratio=50

Steve

^ permalink raw reply

* RE: Backup Server RAID Array Event Notification
From: Leslie Rhorer @ 2011-06-09 19:01 UTC (permalink / raw)
  To: 'Roman Mamedov'; +Cc: linux-raid
In-Reply-To: <20110610004655.66e259ad@natsu>

> -----Original Message-----
> From: Roman Mamedov [mailto:rm@romanrm.ru]
> Sent: Thursday, June 09, 2011 1:47 PM
> To: lrhorer@satx.rr.com
> Cc: linux-raid@vger.kernel.org
> Subject: Re: Backup Server RAID Array Event Notification
> 
> On Thu, 9 Jun 2011 13:35:59 -0500
> "Leslie Rhorer" <lrhorer@satx.rr.com> wrote:
> 
> >
> > 	After I created a pair of two member RAID1 arrays and then added
> > them as members to a RAID6 array, I am now getting messages similar to
> the
> > following, complaining of "Wrong-level" issues.  When I check the RAID6
> > array, however, it is clean and both RAID1 members are still there.
> When I
> > check both RAID1 arrays, they show clean with no events.  I am running a
> > compare between all the data on this machine and its mirror (this is a
> > backup machine).  So far everything looks good.  What does this imply?
> Is
> > there something about which I should be worried?
> 
> You said RAID1 twice, and your mdadm --detail doesn't agree with you and
> says
> "raid0" twice. Maybe you mistakenly used RAID1 instead of RAID0 somewhere
> else as well, and the WrongLevel message is trying to tell you that?

	No, that was just a typo.  (OK, three typos) I meant "RAID0".  The
RAID0 members are all 1T drives.  The RAID6 array is made of 1.5T members.
In order to use the 1T drives on the RAID6 array, I have to combine them
into 2T arrays, which then can be used as members of the RAID6 array.  If
md10 and md11 were RAID1 arrays, they would only be 1T in extent, and could
not be members of md0.


^ permalink raw reply

* Re: Backup Server RAID Array Event Notification
From: Roman Mamedov @ 2011-06-09 18:46 UTC (permalink / raw)
  To: lrhorer; +Cc: linux-raid
In-Reply-To: <9C.33.00666.09211FD4@cdptpa-omtalb.mail.rr.com>

[-- Attachment #1: Type: text/plain, Size: 923 bytes --]

On Thu, 9 Jun 2011 13:35:59 -0500
"Leslie Rhorer" <lrhorer@satx.rr.com> wrote:

> 
> 	After I created a pair of two member RAID1 arrays and then added
> them as members to a RAID6 array, I am now getting messages similar to the
> following, complaining of "Wrong-level" issues.  When I check the RAID6
> array, however, it is clean and both RAID1 members are still there.  When I
> check both RAID1 arrays, they show clean with no events.  I am running a
> compare between all the data on this machine and its mirror (this is a
> backup machine).  So far everything looks good.  What does this imply?  Is
> there something about which I should be worried?

You said RAID1 twice, and your mdadm --detail doesn't agree with you and says
"raid0" twice. Maybe you mistakenly used RAID1 instead of RAID0 somewhere
else as well, and the WrongLevel message is trying to tell you that?

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* FW: Backup Server RAID Array Event Notification
From: Leslie Rhorer @ 2011-06-09 18:35 UTC (permalink / raw)
  To: linux-raid


	After I created a pair of two member RAID1 arrays and then added
them as members to a RAID6 array, I am now getting messages similar to the
following, complaining of "Wrong-level" issues.  When I check the RAID6
array, however, it is clean and both RAID1 members are still there.  When I
check both RAID1 arrays, they show clean with no events.  I am running a
compare between all the data on this machine and its mirror (this is a
backup machine).  So far everything looks good.  What does this imply?  Is
there something about which I should be worried?

-----Original Message-----
From: mdadm_monitor@satx.rr.com [mailto:mdadm_monitor@satx.rr.com] 
Sent: Thursday, June 09, 2011 8:04 AM
To: leslie.rhorer@twtelecom.com; lrhorer@satx.rr.com
Subject: Backup Server RAID Array Event Notification

DeviceDisappeared /dev/md10 Wrong-Level
<message ends>

From mdadm:
Backup:~# mdadm -D /dev/md0

/dev/md0:

        Version : 1.2

  Creation Time : Mon May 31 16:23:10 2010

     Raid Level : raid6

     Array Size : 14651371520 (13972.64 GiB 15003.00 GB)

  Used Dev Size : 1465137152 (1397.26 GiB 1500.30 GB)

   Raid Devices : 12
  Total Devices : 12
    Persistence : Superblock is persistent

    Update Time : Thu Jun  9 13:30:54 2011
          State : active
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 1024K

           Name : Backup:0  (local to host Backup)
           UUID : 431244d6:45d9635a:e88b3de5:92f30255
         Events : 436289

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       16        1      active sync   /dev/sdb
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       80        5      active sync   /dev/sdf
       6       8       96        6      active sync   /dev/sdg
       7       8      112        7      active sync   /dev/sdh
       8       8      128        8      active sync   /dev/sdi
      10       8      144        9      active sync   /dev/sdj
      12       9       11       10      active sync   /dev/md11
      11       9       10       11      active sync   /dev/md10
Backup:~# mdadm -D /dev/md10
/dev/md10:
        Version : 1.2
  Creation Time : Wed Jun  8 00:08:16 2011

     Raid Level : raid0

     Array Size : 1953521664 (1863.02 GiB 2000.41 GB)
   Raid Devices : 2

  Total Devices : 2
    Persistence : Superblock is persistent


    Update Time : Wed Jun  8 00:08:16 2011
          State : clean

 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 1024K

           Name : Backup:10  (local to host Backup)
           UUID : fa1ed617:d80525c4:1df692e8:0116406d
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8      192        0      active sync   /dev/sdm
       1       8      208        1      active sync   /dev/sdn
Backup:~# mdadm -D /dev/md11
/dev/md11:
        Version : 1.2
  Creation Time : Wed Jun  8 00:08:38 2011
     Raid Level : raid0
     Array Size : 1953521664 (1863.02 GiB 2000.41 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Wed Jun  8 00:08:38 2011
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 1024K

           Name : Backup:11  (local to host Backup)
           UUID : 1ac704ee:8f501b33:4caee409:e384eeec
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8      224        0      active sync   /dev/sdo
       1       8      240        1      active sync   /dev/sdp



^ permalink raw reply

* Re: Why move all map_sg/unmap_sg for slave channel to its client?
From: Dan Williams @ 2011-06-09 18:28 UTC (permalink / raw)
  To: Linus Walleij
  Cc: viresh kumar, Koul, Vinod, linux-kernel@vger.kernel.org, anemo,
	Shiraz HASHIM, Armando VISCONTI, Bhupesh SHARMA, linux-raid
In-Reply-To: <BANLkTik9kZjJs5H2kvM+xAtd_CR5NA0dDw@mail.gmail.com>

On Thu, Jun 9, 2011 at 2:38 AM, Linus Walleij <linus.walleij@linaro.org> wrote:
> On Thu, Jun 9, 2011 at 8:54 AM, viresh kumar <viresh.kumar@st.com> wrote:
>
>> I thought map_sg/unmap_sg for slave channels will be handled according
>> to the flags passed in prep_slave_sg(). But then i found following patch:
>> (...)
>> I don't have much knowledge about that discussion, but i think this should be left
>> configurable.
>> If the client wants to control map/unmap then it can simply pass
>> DMA_COMPL_SKIP_DEST_UNMAP | DMA_COMPL_SKIP_SRC_UNMAP in flags. I didn't wanted to
>> skip this in my driver and so i don't pass them.
>
> What if the same driver is used on many different platforms like say
> drivers/tty/serial/amba-pl011.c, and some of the platforms using it
> has DMA engines that does not implement mapping/unmapping of
> the passed sglist?
>
> In that case I think you have to modify all drivers in drivers/dma/*
> to do this mapping, and then you could just make it a required behaviour
> and skip the flags altogether.
>
> But apparently that approach was blocked at one point so let's see
> what the others say.

My problem with automatic unmapping support is that the dma-driver
really does not have a chance to get it right except for the trivially
straightforward cases.  One need only look at the current bustage of
raid5 acceleration with respect to overlapping mappings and arm v6.
The dma-driver just knows how to perform "this" operation on "this"
dma address.  It does not know the lifetime of the mapping, or even if
it has the actual dma handle for unmapping versus an offset

For the raid case I've currently convinced myself that the raid client
needs to get directly involved in dma mapping management, rather than
teach all dma drivers a language of how to unmap and when.  Not only
will this fix the overlapping, but it also eliminates the need to map
and remap because the raid client knows the lifetime of  a stripe_head
while the driver only knows the lifetime of a given stripe operation.

For slave-dma maybe there is a lot of common un-mapping logic that can
be reused, but I think that comes from a separate smart library that
understands the dma mapping lifetimes of a given class of clients.
Leave the dma-drivers to just be dumb operators on anonymous dma
addresses.

--
Dan

^ permalink raw reply

* [PATCH v2] md/raid10: share pages between read and write bio's during recovery
From: Namhyung Kim @ 2011-06-09 17:31 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

When performing a recovery, only first 2 slots in r10_bio are in use,
for read and write respectively. However all of pages in the write bio
are never used and just replaced to read bio's when the read completes.

Get rid of those unused pages and share read pages properly.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/raid10.c |   24 +++++++++++++-----------
 1 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a53779ffdf89..dea73bdb99b8 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -123,7 +123,15 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 	for (j = 0 ; j < nalloc; j++) {
 		bio = r10_bio->devs[j].bio;
 		for (i = 0; i < RESYNC_PAGES; i++) {
-			page = alloc_page(gfp_flags);
+			if (j == 1 && !test_bit(MD_RECOVERY_SYNC,
+						&conf->mddev->recovery)) {
+				/* we can share bv_page's during recovery */
+				struct bio *rbio = r10_bio->devs[0].bio;
+				page = rbio->bi_io_vec[i].bv_page;
+				get_page(page);
+			} else {
+				page = alloc_page(gfp_flags);
+			}
 			if (unlikely(!page))
 				goto out_free_pages;
 
@@ -1360,20 +1368,14 @@ done:
 static void recovery_request_write(mddev_t *mddev, r10bio_t *r10_bio)
 {
 	conf_t *conf = mddev->private;
-	int i, d;
-	struct bio *bio, *wbio;
-
+	int d;
+	struct bio *wbio;
 
-	/* move the pages across to the second bio
+	/*
+	 * share the pages with the first bio
 	 * and submit the write request
 	 */
-	bio = r10_bio->devs[0].bio;
 	wbio = r10_bio->devs[1].bio;
-	for (i=0; i < wbio->bi_vcnt; i++) {
-		struct page *p = bio->bi_io_vec[i].bv_page;
-		bio->bi_io_vec[i].bv_page = wbio->bi_io_vec[i].bv_page;
-		wbio->bi_io_vec[i].bv_page = p;
-	}
 	d = r10_bio->devs[1].devnum;
 
 	atomic_inc(&conf->mirrors[d].rdev->nr_pending);
-- 
1.7.5.2


^ permalink raw reply related

* [PATCH 4/4] imsm: FIX: Disable automatic metadata rollback for broken reshape
From: Adam Kwolek @ 2011-06-09 16:29 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer
In-Reply-To: <20110609162050.690.87261.stgit@gklab-128-013.igk.intel.com>

mdmon cannot rollback metadata changes automatically.
It can break reshape process in the way that in case of reshape break
user will not be able to deal with broken reshape due to lack of information
about reshape geometry.

mdadm (process that invokes reshape) doesn't make any rollback to allow
for user action. mdmon should not do this either unless it knows for sure
it is save. such knowledge is not available for automatic rollback.

Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
---

 super-intel.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 8bfe40a..5e8b834 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -5865,14 +5865,17 @@ static int imsm_set_array_state(struct active_array *a, int consistent)
 		} else {
 			if (a->last_checkpoint == 0 && a->prev_action == reshape) {
 				/* for some reason we aborted the reshape.
-				 * Better clean up
-				 */
+				 *
+				 * disable automatic metadata rollback
+				 * user action is required to recover process
+				 *
 				struct imsm_map *map2 = get_imsm_map(dev, 1);
 				dev->vol.migr_state = 0;
 				dev->vol.migr_type = 0;
 				dev->vol.curr_migr_unit = 0;
 				memcpy(map, map2, sizeof_imsm_map(map2));
 				super->updates_pending++;
+				*/
 			}
 			if (a->last_checkpoint >= a->info.component_size) {
 				unsigned long long array_blocks;


^ permalink raw reply related

* [PATCH 3/4] imsm: FIX: Use function to obtain array layout
From: Adam Kwolek @ 2011-06-09 16:29 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer
In-Reply-To: <20110609162050.690.87261.stgit@gklab-128-013.igk.intel.com>

Function imsm_level_to_layout() should be use to get array layout.

Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
---

 super-intel.c |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index e094b85..8bfe40a 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -7733,8 +7733,7 @@ int save_backup_imsm(struct supertype *st,
 	if (open_backup_targets(info, new_disks, targets))
 		goto abort;
 
-	if (map_dest->raid_level != 0)
-		dest_layout = ALGORITHM_LEFT_ASYMMETRIC;
+	dest_layout = imsm_level_to_layout(map_dest->raid_level);
 	dest_chunk = __le16_to_cpu(map_dest->blocks_per_strip) * 512;
 
 	if (restore_stripes(targets, /* list of dest devices */
@@ -8781,8 +8780,7 @@ static int imsm_manage_reshape(
 	}
 
 	max_position = sra->component_size * ndata;
-	if (map_src->raid_level != 0)
-		source_layout = ALGORITHM_LEFT_ASYMMETRIC;
+	source_layout = imsm_level_to_layout(map_src->raid_level);
 
 	while (__le32_to_cpu(migr_rec->curr_migr_unit) <
 	       __le32_to_cpu(migr_rec->num_migr_units)) {


^ permalink raw reply related

* [PATCH 2/4] FIX: Cannot create volume
From: Adam Kwolek @ 2011-06-09 16:29 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer
In-Reply-To: <20110609162050.690.87261.stgit@gklab-128-013.igk.intel.com>

getinfo_super() can clear entire 'inf' structure before filling with new
information. Disk number required later is lost.

Restore disk number information after getinfo_super() call.

Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
---

 Create.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Create.c b/Create.c
index 7b4d0fe..d01dea7 100644
--- a/Create.c
+++ b/Create.c
@@ -805,7 +805,6 @@ int Create(struct supertype *st, char *mddev,
 			switch(pass) {
 			case 1:
 				*inf = info;
-
 				inf->disk.number = dnum;
 				inf->disk.raid_disk = dnum;
 				if (inf->disk.raid_disk < raiddisks)
@@ -856,12 +855,13 @@ int Create(struct supertype *st, char *mddev,
 					/* getinfo_super might have lost these ... */
 					inf->disk.major = major(stb.st_rdev);
 					inf->disk.minor = minor(stb.st_rdev);
+					inf->disk.number = dnum;
+					inf->disk.raid_disk = dnum;
 				}
 				break;
 			case 2:
 				inf->errors = 0;
 				rv = 0;
-
 				rv = add_disk(mdfd, st, &info, inf);
 
 				if (rv) {


^ permalink raw reply related

* [PATCH 1/4] imsm: FIX: Cannot create volume
From: Adam Kwolek @ 2011-06-09 16:29 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer
In-Reply-To: <20110609162050.690.87261.stgit@gklab-128-013.igk.intel.com>

getinfo_super_imsm_volume() clears entire 'info' structure before filling with new
information. Disk number and raid_disk can be required later by caller
but it is lost.

Restore disk number information in getinfo_super_imsm_volume() call.

Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
---

 super-intel.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 5c840ec..e094b85 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -2075,11 +2075,18 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	unsigned int component_size_alligment;
 	int map_disks = info->array.raid_disks;
 
-	memset(info, 0, sizeof(*info));
 	if (prev_map)
 		map_to_analyse = prev_map;
 
 	dl = super->disks;
+	while (dl) {
+		if (dl->index == info->disk.number)
+			break;
+		dl = dl->next;
+	}
+	if (!dl)
+		dl = super->disks;
+	memset(info, 0, sizeof(*info));
 
 	info->container_member	  = super->current_vol;
 	info->array.raid_disks    = map->num_members;
@@ -2147,6 +2154,8 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	if (dl) {
 		info->disk.major = dl->major;
 		info->disk.minor = dl->minor;
+		info->disk.number = dl->index;
+		info->disk.raid_disk = dl->index;
 	}
 
 	info->data_offset	  = __le32_to_cpu(map_to_analyse->pba_of_lba0);


^ permalink raw reply related

* [PATCH 0/4] IMSM Checkpointing Bug Fix Series (2)
From: Adam Kwolek @ 2011-06-09 16:29 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer

The following series contains 2 fixes for IMSM Checkpointing:
        1. imsm: FIX: Disable automatic metadata rollback for broken reshape
        2. imsm: FIX: Use function to obtain array layout

It enables IMSM array creation:
        1. FIX: Cannot create volume
        2. imsm: FIX: Cannot create volume
This fix contains 2 patches but any single patch of those 2 help.
I think both should be placed in mdadm.

Checkpointing status:
	- Raid0: OK
	- Raid5: Data corruption problem when stripes are restored from backup (work in progress)


BR
Adam

---

Adam Kwolek (4):
      imsm: FIX: Disable automatic metadata rollback for broken reshape
      imsm: FIX: Use function to obtain array layout
      FIX: Cannot create volume
      imsm: FIX: Cannot create volume


 Create.c      |    4 ++--
 super-intel.c |   24 +++++++++++++++++-------
 2 files changed, 19 insertions(+), 9 deletions(-)

-- 
Signature

^ permalink raw reply

* Re: Possible to use multiple disk to bypass I/O wait?
From: Emmanuel Noobadmin @ 2011-06-09 16:15 UTC (permalink / raw)
  To: Mathias Burén; +Cc: CentOS mailing list, linux-raid
In-Reply-To: <BANLkTikd-LEtAt_3n6bC7nTd2BZqZfNQLA@mail.gmail.com>

On 6/9/11, Mathias Burén <mathias.buren@gmail.com> wrote:
> The first thing that comes to my mind: Have you tried another IO scheduler?

and the first thing that came to this noob's mind was: Wait, you mean
there's actually more than one? AND I get to choose?

I'll probably be experimenting with deadline and anticipatory since
the i/o wait seems to be due to the disk running back and fro trying
to serve the file scan as well as legit read request so having that
small wait for reads in the same area sounds like it would help.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [BUG?] RAID not properly assembled w/ kernel 2.6.39.x
From: Phil Turmel @ 2011-06-09 14:25 UTC (permalink / raw)
  To: Matthias Dahl; +Cc: linux-raid
In-Reply-To: <201106091521.54769.ml_linux_raid@mortal-soul.de>

Hi Matthias,

On 06/09/2011 09:21 AM, Matthias Dahl wrote:
> Hi all.
> 
> I tried updating to 2.6.39.1 from 2.6.38.2 and failed due to some md raid
> issues I wasn't able to solve even after hours.   I hope someone can help
> me out- I'd really appreciate it.

[...]

> I'm sorry for this chaotic explanation but as you can see it's quite hard
> to explain. :-(

Well, you are swimming against the tide:  the kernel has been steadily losing consistent device names, for what the kernel devs feel are good and proper reasons.  The recommended alternative is to use filesystem labels, or even better, UUIDs in your fstab.  Then you can go back to no mdadm.conf file in your initramfs, and carry on.

(*I'm* not going to command the tide to stop.)

> Again, this works fine with any kernel prior to 2.6.39 and adding a  conf
> to the initramfs fixes  mostly anything but the additional partition prob
> which has me worrying that something else might be wrong.
> 
> Like said earlier, I'd really appreciate anyone shedding  some  light  on
> this. Thanks a lot in advance...

HTH,

Phil

^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: David Brown @ 2011-06-09 13:42 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <4DF0C81E.4020903@oldum.net>

On 09/06/2011 15:18, Nikolay Kichukov wrote:
> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>
>
>
> On 06/08/2011 01:33 PM, David Brown wrote:
>
>> So you install your RAID10 (or RAID6, if you prefer) system, and
>> make sure you keep backups.  And if you /do/ get hit by a double
>> disk failure in the wrong place, you spend the day restoring
>> everything from the backups.  When management complain that a 24
>> hour downtime doesn't fit with their 99.99% uptime expectations,
>> you remind them that this is amortized over the next 27 years...
>
> Hi David,
>
> nice one ;-) Did you actually calculate 24 hours for those 99.99%
> within 27 years? ;-)
>

27.4 years is 10,000 days - so you can have 99.99% uptime with a 24-hour 
failure if you run for the rest of the 27.4 years without a hitch.  Of 
course, by the same logic you can claim 6 nine's uptime with a week's 
failure - as long as there are no more problems for the next 20,000 years...

:-)


> Cheers, - -Nik


^ permalink raw reply

* Re:
From: Phil Turmel @ 2011-06-09 13:39 UTC (permalink / raw)
  To: Dragon; +Cc: linux-raid
In-Reply-To: <20110609121641.298530@gmx.net>

On 06/09/2011 08:16 AM, Dragon wrote:
> Yes if all things get back to normal i will change to raid6. that was my idea for the future too.
> here the result of the script:
> 
> ./lsdrv
> **Warning** The following utility(ies) failed to execute:
>   pvs
>   lvs
> Some information may be missing.
> 
> PCI [pata_atiixp] 00:14.1 IDE interface: ATI Technologies Inc SB700/SB800 IDE Controller
>  ââscsi 0:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1WZ401747}
>  â  ââsda: [8:0] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  â     ââmd0: [9:0] Empty/Unknown 0.00k
>  ââscsi 0:0:1:0 ATA SAMSUNG HD154UI {S1XWJ1WZ405098}
>  â  ââsdb: [8:16] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  ââscsi 1:0:0:0 ATA SAMSUNG SV2044D {0244J1BN626842}
>     ââsdc: [8:32] Partitioned (dos) 19.01g
>        ââsdc1: [8:33] (ext3) 18.17g {6858fc38-9fee-4ab5-8135-029f305b9198}
>        â  ââMounted as /dev/disk/by-uuid/6858fc38-9fee-4ab5-8135-029f305b9198 @ /
>        ââsdc2: [8:34] Partitioned (dos) 1.00k
>        ââsdc5: [8:37] (swap) 854.99m {f67c7f23-e5ac-4c05-992c-a9a494687026}
> PCI [sata_mv] 02:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)
>  ââscsi 2:0:0:0 ATA SAMSUNG HD154UI {S1XWJD2Z907626}
>  â  ââsdd: [8:48] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  ââscsi 4:0:0:0 ATA SAMSUNG HD154UI {S1XWJ90ZA03442}
>  â  ââsde: [8:64] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  ââscsi 6:0:0:0 ATA SAMSUNG HD154UI {S1XWJ9AB200390}
>  â  ââsdf: [8:80] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  ââscsi 8:0:0:0 ATA SAMSUNG HD154UI {61833B761A63RP}
>     ââsdg: [8:96] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
> PCI [sata_promise] 04:02.0 Mass storage controller: Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02)
>  ââscsi 3:0:0:0 ATA SAMSUNG HD154UI {S1XWJD5B201174}
>  â  ââsdh: [8:112] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  ââscsi 5:0:0:0 ATA SAMSUNG HD154UI {S1XWJ9CB201815}
>  â  ââsdi: [8:128] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  ââscsi 7:x:x:x [Empty]
>  ââscsi 9:0:0:0 ATA SAMSUNG HD154UI {A6311B761A3XPB}
>     ââsdj: [8:144] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
> PCI [ahci] 00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [IDE mode]
>  ââscsi 10:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1KS915803}
>  â  ââsdk: [8:160] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  ââscsi 11:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1KS915802}
>  â  ââsdl: [8:176] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  ââscsi 12:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1KSC08024}
>  â  ââsdm: [8:192] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
>  ââscsi 13:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1KS915804}
>     ââsdn: [8:208] MD raid5 (13) 1.36t inactive {975d6eb2-285e-ed11-021d-f236c2d05073}
> 

Very interesting.  You've exposed a limitation of my script.  I'll have to reconsider how I extract information from members of a partially started array.

Its also clear that you are using a fast-boot kernel with parallel probing of your scsi hosts.  That's why your device names sometimes change.

/dev/sdn is definitely the holdout, though.  Notice the "(13)" where the others are "(none/13)".

Before continuing, I've made the assumption that "mdadm --grow -n 12" was the last major operation attempted, and this is was put you in your current predicament?  If so, and you interrupted it, did you try to assemble the array with the --backup-file option from the shrink operation?  If you didn't, please stop the array, and retry the assemble (with all 13 devices) and the --backup-file option.  Try twice, if needed, adding "--force" the second time.

If that works, sit tight until the reshape is complete.

If that was already tried, or doesn't change the situation, here's what I recommend:

Stop the array: "mdadm -S /dev/md0"

Recreate the array "mdadm -C /dev/md0 -l 5 -n 13 -e 0.90 -c 64 --assume-clean /dev/sd{k,d,l,m,a,b,e,n,f,g,h,i,j}"

The order in {} matters! The option "--assume-clean" is vital!

You will be warned that the members appear to be part of another array.  Continue.

Do *NOT* mount the array!

Try a non-destructive fsck: "fsck -n /dev/md0"

If that has a huge number of errors, stop the array, and recreate again, swapping /dev/sdd and /dev/sdn, then repeat the fsck:

"mdadm -C /dev/md0 -l 5 -n 13 -e 0.90 -c 64 --assume-clean /dev/sd{k,n,l,m,a,b,e,d,f,g,h,i,j}"

If you get a good, or mostly good fsck, you've found the right combination, and you can try the shrink operations again.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [BUG?] RAID not properly assembled w/ kernel 2.6.39.x
From: Matthias Dahl @ 2011-06-09 13:21 UTC (permalink / raw)
  To: linux-raid

Hi all.

I tried updating to 2.6.39.1 from 2.6.38.2 and failed due to some md raid
issues I wasn't able to solve even after hours.   I hope someone can help
me out- I'd really appreciate it.

I've running two RAID1s (boot, swap) and one partitioned RAID5 (3 partitions):

dreamgate ~ # mdadm --misc -D /dev/md{0,1,_d0}
/dev/md0:
        Version : 0.90
  Creation Time : Sat Aug  8 03:09:54 2009
     Raid Level : raid1
     Array Size : 779008 (760.88 MiB 797.70 MB)
  Used Dev Size : 779008 (760.88 MiB 797.70 MB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Jun  9 14:49:31 2011
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

           UUID : c6c29c17:d2088f05:bfe78010:bc810f04
         Events : 0.74

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8        1        1      active sync   /dev/sda1
       2       8       49        2      active sync   /dev/sdd1
/dev/md1:
        Version : 0.90
  Creation Time : Sat Aug  8 03:10:33 2009
     Raid Level : raid1
     Array Size : 4000064 (3.81 GiB 4.10 GB)
  Used Dev Size : 4000064 (3.81 GiB 4.10 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Thu Jun  9 14:49:31 2011
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

           UUID : f4f3dfc8:812e35f6:bfe78010:bc810f04
         Events : 0.30

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdc2
       1       8        2        1      active sync   /dev/sda2
       2       8       50        2      active sync   /dev/sdd2
/dev/md_d0:
        Version : 1.0
  Creation Time : Sat Aug  8 03:46:31 2009
     Raid Level : raid5
     Array Size : 1943961088 (1853.91 GiB 1990.62 GB)
  Used Dev Size : 971980544 (926.95 GiB 995.31 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Thu Jun  9 14:49:32 2011
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : localhost.localdomain:d0
           UUID : fbbdaaca:b89e291b:784b7787:a4b65ebd
         Events : 5080856

    Number   Major   Minor   RaidDevice State
       0       8       35        0      active sync   /dev/sdc3
       1       8        3        1      active sync   /dev/sda3
       3       8       51        2      active sync   /dev/sdd3

This worked fine since day one actually. I'm using Gentoo x86_64 w/ mdadm
3.1.5 and an initramfs created by genkernel to start the raids and  mount
root which is on one of those RAID5 partitions.

The  initramfs  simply does a "mdadm --assemble --scan" w/o a mdadm.conf.
This worked fine with any kernel prior to 2.6.39.x. Now whatever I do, it
does not create the devices as usual like: /dev/md0 /dev/md1 /dev/md_d0pX
where X is {1..3}. It suffixes the unpartitioned devices w/ "_0" which is
due to the homehost not matching (which was never a problem) and has even
more problems with the partitioned RAID5. Those end up as md127pX in /dev
and localhost.localdomain:d0 in /dev/md.    And depending on what you do,
you even get an additional partition  p4  like  when  simply  adding  the
following mdadm.conf to the initramfs  so  there  is  no   auto-detection
necessary:

ARRAY /dev/md0 level=raid1 num-devices=3 UUID=c6c29c17:d2088f05:bfe78010:bc810f04
ARRAY /dev/md1 level=raid1 num-devices=3 UUID=f4f3dfc8:812e35f6:bfe78010:bc810f04
ARRAY /dev/md_d0 level=raid5 metadata=1.0 auto=mdp num-devices=3 UUID=fbbdaaca:b89e291b:784b7787:a4b65ebd name=localhost.localdomain:d0

I've experimented with homehost but never got to a point where everything
was simply right again.   Setting homehost to "<ignore>" changed nothing.
Changing it to "localhost.localdomain" fixed the problem with the  prefix
and the suffixes but the numbering (md127 instead of md_d0) wasn't  still
right, no partitions were created in /dev/md/ (but in /dev) and  I  ended
up with one additional partition (4).

I'm sorry for this chaotic explanation but as you can see it's quite hard
to explain. :-(

Again, this works fine with any kernel prior to 2.6.39 and adding a  conf
to the initramfs fixes  mostly anything but the additional partition prob
which has me worrying that something else might be wrong.

Like said earlier, I'd really appreciate anyone shedding  some  light  on
this. Thanks a lot in advance...

So long,
matthias.

^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: Nikolay Kichukov @ 2011-06-09 13:18 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid
In-Reply-To: <isnj71$rap$1@dough.gmane.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



On 06/08/2011 01:33 PM, David Brown wrote:

> So you install your RAID10 (or RAID6, if you prefer) system, and make sure you keep backups.  And if you /do/ get hit by
> a double disk failure in the wrong place, you spend the day restoring everything from the backups.  When management
> complain that a 24 hour downtime doesn't fit with their 99.99% uptime expectations, you remind them that this is
> amortized over the next 27 years...

Hi David,

nice one ;-) Did you actually calculate 24 hours for those 99.99% within 27 years? ;-)

Cheers,
- -Nik
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJN8MgdAAoJEDFLYVOGGjgXIiYH/0qMOkHCKTV5WeBhlPdGOpjr
RzniFUYxpVYLvHAna7DWmrUaYqGMgZWadljt2GZB90NLqhDQX0OgIKm5thGRwaLD
09x2h2zpT4XV8a78VRU63blS2jHBygCxqVkUnagCHlYVZ63Jm4qZZH0jeHJkWzPV
YjQXhGILzx8H02P1G8WDCnzg32+k8XNleatV2+441OUidnYV1019SyYDX6/5/UDh
88VMIiWOMA0RvJP4b9yGw9vV/pEx2LReAahfhRAZ3iu9sOc5kUtCjiHzghE8n2nW
oF9t4i5raS4q54tz2WGs/iDiV20gO8lsNtIjReIAAEnMFlpIZapelVyxj9HjNe0=
=Y5nu
-----END PGP SIGNATURE-----

^ permalink raw reply

* (unknown)
From: Dragon @ 2011-06-09 12:16 UTC (permalink / raw)
  To: philip; +Cc: linux-raid

Yes if all things get back to normal i will change to raid6. that was my idea for the future too.
here the result of the script:

./lsdrv
**Warning** The following utility(ies) failed to execute:
  pvs
  lvs
Some information may be missing.

PCI [pata_atiixp] 00:14.1 IDE interface: ATI Technologies Inc SB700/SB800 IDE Controller
 ââscsi 0:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1WZ401747}
 â  ââsda: [8:0] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 â     ââmd0: [9:0] Empty/Unknown 0.00k
 ââscsi 0:0:1:0 ATA SAMSUNG HD154UI {S1XWJ1WZ405098}
 â  ââsdb: [8:16] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 ââscsi 1:0:0:0 ATA SAMSUNG SV2044D {0244J1BN626842}
    ââsdc: [8:32] Partitioned (dos) 19.01g
       ââsdc1: [8:33] (ext3) 18.17g {6858fc38-9fee-4ab5-8135-029f305b9198}
       â  ââMounted as /dev/disk/by-uuid/6858fc38-9fee-4ab5-8135-029f305b9198 @ /
       ââsdc2: [8:34] Partitioned (dos) 1.00k
       ââsdc5: [8:37] (swap) 854.99m {f67c7f23-e5ac-4c05-992c-a9a494687026}
PCI [sata_mv] 02:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)
 ââscsi 2:0:0:0 ATA SAMSUNG HD154UI {S1XWJD2Z907626}
 â  ââsdd: [8:48] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 ââscsi 4:0:0:0 ATA SAMSUNG HD154UI {S1XWJ90ZA03442}
 â  ââsde: [8:64] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 ââscsi 6:0:0:0 ATA SAMSUNG HD154UI {S1XWJ9AB200390}
 â  ââsdf: [8:80] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 ââscsi 8:0:0:0 ATA SAMSUNG HD154UI {61833B761A63RP}
    ââsdg: [8:96] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
PCI [sata_promise] 04:02.0 Mass storage controller: Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02)
 ââscsi 3:0:0:0 ATA SAMSUNG HD154UI {S1XWJD5B201174}
 â  ââsdh: [8:112] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 ââscsi 5:0:0:0 ATA SAMSUNG HD154UI {S1XWJ9CB201815}
 â  ââsdi: [8:128] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 ââscsi 7:x:x:x [Empty]
 ââscsi 9:0:0:0 ATA SAMSUNG HD154UI {A6311B761A3XPB}
    ââsdj: [8:144] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
PCI [ahci] 00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [IDE mode]
 ââscsi 10:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1KS915803}
 â  ââsdk: [8:160] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 ââscsi 11:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1KS915802}
 â  ââsdl: [8:176] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 ââscsi 12:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1KSC08024}
 â  ââsdm: [8:192] MD raid5 (none/13) 1.36t md0 inactive spare {975d6eb2-285e-ed11-021d-f236c2d05073}
 ââscsi 13:0:0:0 ATA SAMSUNG HD154UI {S1XWJ1KS915804}
    ââsdn: [8:208] MD raid5 (13) 1.36t inactive {975d6eb2-285e-ed11-021d-f236c2d05073}

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Possible to use multiple disk to bypass I/O wait?
From: Nagilum @ 2011-06-09 12:06 UTC (permalink / raw)
  To: Emmanuel Noobadmin; +Cc: CentOS mailing list, linux-raid
In-Reply-To: <BANLkTimFOaJoMnwid1F+ghVwkBgJi2FymQ@mail.gmail.com>

----- Message from centos.admin@gmail.com ---------
     Date: Thu, 9 Jun 2011 17:24:23 +0800
     From: Emmanuel Noobadmin <centos.admin@gmail.com>
  Subject: Possible to use multiple disk to bypass I/O wait?
       To: CentOS mailing list <centos@centos.org>, linux-raid  
<linux-raid@vger.kernel.org>


> I'm trying to resolve an I/O problem on a CentOS 5.6 server. The
> process basically scans through Maildirs, checking for space usage and
> quota. Because there are hundred odd user folders and several 10s of
> thousands of small files, this sends the I/O wait % way high. The
> server hits a very high load level and stops responding to other
> requests until the crawl is done.
>
> I am wondering if I add another disk and symlink the sub-directories
> to that, would that free up the server to respond to other requests
> despite the wait on that disk?

Have you tried using ionice -c 3  on the process?


----- End message from centos.admin@gmail.com -----



========================================================================
#    _  __          _ __     http://www.nagilum.org/ \n icq://69646724 #
#   / |/ /__ ____ _(_) /_ ____ _  nagilum@nagilum.org \n +491776461165 #
#  /    / _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
# /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
#           /___/     x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #
========================================================================


----------------------------------------------------------------
cakebox.homeunix.net - all the machine one needs..

^ permalink raw reply

* Re: Triple-parity raid6
From: NeilBrown @ 2011-06-09 12:04 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid
In-Reply-To: <isqb2o$g0s$1@dough.gmane.org>

On Thu, 09 Jun 2011 13:32:59 +0200 David Brown <david@westcontrol.com> wrote:

> On 09/06/2011 03:49, NeilBrown wrote:
> > On Thu, 09 Jun 2011 02:01:06 +0200 David Brown<david.brown@hesbynett.no>
> > wrote:
> >
> >> Has anyone considered triple-parity raid6 ?  As far as I can see, it
> >> should not be significantly harder than normal raid6 - either  to
> >> implement, or for the processor at run-time.  Once you have the GF(2⁸)
> >> field arithmetic in place for raid6, it's just a matter of making
> >> another parity block in the same way but using a different generator:
> >>
> >> P = D_0 + D_1 + D_2 + .. + D_(n.1)
> >> Q = D_0 + g.D_1 + g².D_2 + .. + g^(n-1).D_(n.1)
> >> R = D_0 + h.D_1 + h².D_2 + .. + h^(n-1).D_(n.1)
> >>
> >> The raid6 implementation in mdraid uses g = 0x02 to generate the second
> >> parity (based on "The mathematics of RAID-6" - I haven't checked the
> >> source code).  You can make a third parity using h = 0x04 and then get a
> >> redundancy of 3 disks.  (Note - I haven't yet confirmed that this is
> >> valid for more than 100 data disks - I need to make my checker program
> >> more efficient first.)
> >>
> >> Rebuilding a disk, or running in degraded mode, is just an obvious
> >> extension to the current raid6 algorithms.  If you are missing three
> >> data blocks, the maths looks hard to start with - but if you express the
> >> equations as a set of linear equations and use standard matrix inversion
> >> techniques, it should not be hard to implement.  You only need to do
> >> this inversion once when you find that one or more disks have failed -
> >> then you pre-compute the multiplication tables in the same way as is
> >> done for raid6 today.
> >>
> >> In normal use, calculating the R parity is no more demanding than
> >> calculating the Q parity.  And most rebuilds or degraded situations will
> >> only involve a single disk, and the data can thus be re-constructed
> >> using the P parity just like raid5 or two-parity raid6.
> >>
> >>
> >> I'm sure there are situations where triple-parity raid6 would be
> >> appealing - it has already been implemented in ZFS, and it is only a
> >> matter of time before two-parity raid6 has a real probability of hitting
> >> an unrecoverable read error during a rebuild.
> >>
> >>
> >> And of course, there is no particular reason to stop at three parity
> >> blocks - the maths can easily be generalised.  1, 2, 4 and 8 can be used
> >> as generators for quad-parity (checked up to 60 disks), and adding 16
> >> gives you quintuple parity (checked up to 30 disks) - but that's maybe
> >> getting a bit paranoid.
> >>
> >>
> >> ref.:
> >>
> >> <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>
> >> <http://blogs.oracle.com/ahl/entry/acm_triple_parity_raid>
> >> <http://queue.acm.org/detail.cfm?id=1670144>
> >> <http://blogs.oracle.com/ahl/entry/triple_parity_raid_z>
> >>
> >
> >   -ENOPATCH  :-)
> >
> > I have a series of patches nearly ready which removes a lot of the remaining
> > duplication in raid5.c between raid5 and raid6 paths.  So there will be
> > relative few places where RAID5 and RAID6 do different things - only the
> > places where they *must* do different things.
> > After that, adding a new level or layout which has 'max_degraded == 3' would
> > be quite easy.
> > The most difficult part would be the enhancements to libraid6 to generate the
> > new 'syndrome', and to handle the different recovery possibilities.
> >
> > So if you're not otherwise busy this weekend, a patch would be nice :-)
> >
> 
> I'm not going to promise any patches, but maybe I can help with the 
> maths.  You say the difficult part is the syndrome calculations and 
> recovery - I've got these bits figured out on paper and some 
> quick-and-dirty python test code.  On the other hand, I don't really 
> want to get into the md kernel code, or the mdadm code - I haven't done 
> Linux kernel development before (I mostly program 8-bit microcontrollers 
> - when I code on Linux, I use Python), and I fear it would take me a 
> long time to get up to speed.
> 
> However, if the parity generation and recovery is neatly separated into 
> a libraid6 library, the whole thing becomes much more tractable from my 
> viewpoint.  Since I am new to this, can you tell me where I should get 
> the current libraid6 code?  I'm sure google will find some sources for 
> me, but I'd like to make sure I start with whatever version /you/ have.
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

You can see the current kernel code at:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f1a47c7;hb=HEAD


int.uc is the generic C code which 'unroll.awk' processes to make various
versions that unroll the loops different amounts to work with CPUs with
different numbers of registers.
Then there is sse1, sse2, altivec which provide the same functionality in
assembler which is optimised for various processors.

And 'recov' has the smarts for doing the reverse calculation when 2 data
blocks, or 1 data and P are missing.

Even if you don't feel up to implementing everything, a start might be
useful.  You never know when someone might jump up and offer to help.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re:
From: Phil Turmel @ 2011-06-09 12:01 UTC (permalink / raw)
  To: Dragon; +Cc: linux-raid
In-Reply-To: <20110609065059.298530@gmx.net>

Hi Dragon,

[Fixed subject line]

On 06/09/2011 02:50 AM, Dragon wrote:
> Hi Phil,
> i know that there is something odd with the raid, thats why i need help.
> No i didnt scamble the report. thats what the system output. Sorry for confusing with sdo, this is my usb disk and doesnt belong to the raid. because of the size i didnt have any backup ;(

Well, we don't know yet if your data is intact.  You might get lucky.  For what its worth now, you should know that raid5 isn't considered safe for arrays this size.  When the array is running 12 disks again, you might want to consider using the 13th to change your array to raid6.

> I do not let the system run 24/7 and as i started at in the morning the sequence has changed.

The SCSI driver stack in linux doesn't guarantee the order the drives get named.  And custom udev scripts could massage the names further.

>  fdisk -l |grep sd
> Disk /dev/sda: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdc: 20.4 GB, 20409532416 bytes
> /dev/sdc1   *           1        2372    19053058+  83  Linux
> /dev/sdc2            2373        2481      875542+   5  Extended
> /dev/sdc5            2373        2481      875511   82  Linux swap / Solaris
> Disk /dev/sdd: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sde: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdg: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdf: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdh: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdi: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdj: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdk: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdl: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdm: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdn: 1500.3 GB, 1500301910016 bytes
> Disk /dev/sdb: 1500.3 GB, 1500301910016 bytes
> Yesterday was the system on disk sdk. now its on sdc?! the system is now and up to the evening online.
> here the actual data of the drives again:
[...]
> 
> as far as i can see, now there is no error with a missing superblock of one disk.

Well, the superblocks indicate that the array is still configured for 13 drives, two of which are missing.  One of the missing drives has been misidentified as a spare, and the other missing drive ialso thinks it is a spare, but has not been attached.  With your most recent listing, they are /dev/sdd and /dev/sdn.

> 
> how can i download lsdrv with "wget"? Yes the way backwards by shrinking lead to the actual problem.

wget https://github.com/pturmel/lsdrv/raw/master/lsdrv
chmod +x lsdrv
./lsdrv

Phil

^ permalink raw reply

* Re: Unstable speed or correct?
From: Pol Hallen @ 2011-06-09 11:52 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <201106091325.54782.raid1@fuckaround.org>

init 1 raid 6 performance:

http://fuckaround.org/nuvola/?p=17

iostat md0 -x 1:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
0.00    0.00   0.00   0.00

:-(((

Pol

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox