RAID5 to RAID6 reshape?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID5 to RAID6 reshape?
@ 2008-02-17  3:58 Beolach
  2008-02-17 11:50 ` Peter Grandi
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Beolach @ 2008-02-17  3:58 UTC (permalink / raw)
  To: linux-raid

Hi list,

I'm a newbie to RAID, planning a home fileserver that will be pretty
much my first real time using RAID.  What I think I'd like to do is
start w/ 3 drives in RAID5, and add drives as I run low on free space,
eventually to a total of 14 drives (the max the case can fit).  But
when I add the 5th or 6th drive, I'd like to switch from RAID5 to
RAID6 for the extra redundancy.  As I've been researching RAID
options, I've seen that RAID5 to RAID6 migration is a planned feature,
but AFAIK it isn't implemented yet, and the most recent mention I
found was a few months old.  Is it likely that RAID5 to RAID6
reshaping will be implemented in the next 12 to 18 months (my rough
guesstimate as to when I might want to migrate from RAID5 to RAID6)?
Or would I be better off starting w/ 4 drives in RAID6?

I'm also interested in hearing people's opinions about LVM / EVMS.
I'm currently planning on just using RAID w/out the higher level
volume management, as from my reading I don't think they're worth the
performance penalty, but if anyone thinks that's a horrible mistake
I'd like to know sooner rather than later.

And if anyone has comments on good hardware to consider or bad
hardware to avoid, here's what I'm currently planning on getting:
<http://secure.newegg.com/NewVersion/wishlist/PublicWishDetail.asp?WishListNumber=6134331>

TIA,
Conway S. Smith

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17  3:58 RAID5 to RAID6 reshape? Beolach
@ 2008-02-17 11:50 ` Peter Grandi
  2008-02-17 14:45   ` Conway S. Smith
  2008-02-17 13:31 ` Janek Kozicki
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2008-02-17 11:50 UTC (permalink / raw)
  To: Linux RAID

>>> On Sat, 16 Feb 2008 20:58:07 -0700, Beolach
>>> <beolach@gmail.com> said:

beolach> [ ... ] start w/ 3 drives in RAID5, and add drives as I
beolach> run low on free space, eventually to a total of 14
beolach> drives (the max the case can fit).

Like for for so many other posts to this list, all that is
"syntactically" valid is not necessarily the same thing as that
which is wise. 

beolach> But when I add the 5th or 6th drive, I'd like to switch
beolach> from RAID5 to RAID6 for the extra redundancy.

Again, what may be possible is not necessarily what may be wise.

In particular it seems difficult to discern which usage such
arrays would be put to. There might be a bit of difference
between a giant FAT32 volume containing song lyrics files or an
XFS filesystem with a collection of 500GB tomography scans in
them cached from a large tape backup system.

beolach> I'm also interested in hearing people's opinions about
beolach> LVM / EVMS.

They are yellow, and taste of vanilla :-). To say something more
specific is difficult without knowing what kind of requirement
they may be expected to satisfy.

beolach> I'm currently planning on just using RAID w/out the
beolach> higher level volume management, as from my reading I
beolach> don't think they're worth the performance penalty, [
beolach> ... ]

Very amusing that someone who is planning to grow a 3 drive
RAID5 into a 14 drive RAID6 worries about the DM "performance
penalty".

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17  3:58 RAID5 to RAID6 reshape? Beolach
  2008-02-17 11:50 ` Peter Grandi
@ 2008-02-17 13:31 ` Janek Kozicki
  2008-02-17 16:18   ` Conway S. Smith
  2008-02-17 22:40   ` Mark Hahn
  2008-02-17 14:06 ` Janek Kozicki
  2008-02-18  3:43 ` Neil Brown
  3 siblings, 2 replies; 42+ messages in thread
From: Janek Kozicki @ 2008-02-17 13:31 UTC (permalink / raw)
  To: linux-raid

Beolach said:     (by the date of Sat, 16 Feb 2008 20:58:07 -0700)

> I'm also interested in hearing people's opinions about LVM / EVMS.

With LVM it will be possible for you to have several raid5 and raid6:
eg: 5 HHDs (raid6), 5HDDs (raid6) and 4 HDDs (raid5). Here you would
have 14 HDDs and five of them being extra - for safety/redundancy
purposes.

LVM allows you to "join" several blockdevices and create one huge
partition on top of them. Without LVM you will end up with raid6 on
14 HDDs thus having only 2 drives used for redundancy. Quite risky
IMHO.

It is quite often that a *whole* IO controller dies and takes all 4
drives with it. So when you connect your drives, always make sure
that you are totally safe if any of your IO conrollers dies (taking
down 4 HDDs with it). With 5 redundant discs this may be possible to
solve. Of course when you replace the controller the discs are up
again, and only need to resync (which is done automatically).

LVM can be grown on-line (without rebooting the computer) to "join"
new block devices. And after that you only `resize2fs /dev/...` and
your partition is bigger. Also in such configuration I suggest you to
use ext3 fs, because no other fs (XFS, JFS, whatever) had that much
testing than ext* filesystems had.


Question to other people here - what is the maximum partition size
that ext3 can handle, am I correct it 4 TB ?

And to go above 4 TB we need to use ext4dev, right?

best regards
-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17  3:58 RAID5 to RAID6 reshape? Beolach
  2008-02-17 11:50 ` Peter Grandi
  2008-02-17 13:31 ` Janek Kozicki
@ 2008-02-17 14:06 ` Janek Kozicki
  2008-02-17 23:54   ` cat
  2008-02-18  3:43 ` Neil Brown
  3 siblings, 1 reply; 42+ messages in thread
From: Janek Kozicki @ 2008-02-17 14:06 UTC (permalink / raw)
  Cc: linux-raid

Beolach said:     (by the date of Sat, 16 Feb 2008 20:58:07 -0700)


> Or would I be better off starting w/ 4 drives in RAID6?

oh, right - Sevrin Robstad has a good idea to solve your problem -
create raid6 with one missing member. And add this member, when you
have it, next year or such.

-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17 11:50 ` Peter Grandi
@ 2008-02-17 14:45   ` Conway S. Smith
  2008-02-18  5:26     ` Janek Kozicki
  2008-02-18 19:05     ` RAID5 to RAID6 reshape? Peter Grandi
  0 siblings, 2 replies; 42+ messages in thread
From: Conway S. Smith @ 2008-02-17 14:45 UTC (permalink / raw)
  To: Linux RAID

On Sun, 17 Feb 2008 11:50:25 +0000
pg_lxra@lxra.to.sabi.co.UK (Peter Grandi) wrote:
> >>> On Sat, 16 Feb 2008 20:58:07 -0700, Beolach
> >>> <beolach@gmail.com> said:
> 
> beolach> [ ... ] start w/ 3 drives in RAID5, and add drives as I
> beolach> run low on free space, eventually to a total of 14
> beolach> drives (the max the case can fit).
> 
> Like for for so many other posts to this list, all that is
> "syntactically" valid is not necessarily the same thing as that
> which is wise. 
> 

Which part isn't wise?  Starting w/ a few drives w/ the intention of
growing; or ending w/ a large array (IOW, are 14 drives more than I
should put in 1 array & expect to be "safe" from data loss)?

> beolach> But when I add the 5th or 6th drive, I'd like to switch
> beolach> from RAID5 to RAID6 for the extra redundancy.
> 
> Again, what may be possible is not necessarily what may be wise.
> 
> In particular it seems difficult to discern which usage such
> arrays would be put to. There might be a bit of difference
> between a giant FAT32 volume containing song lyrics files or an
> XFS filesystem with a collection of 500GB tomography scans in
> them cached from a large tape backup system.
> 

Sorry for not mentioning, I am planning on using XFS.  Its intended
usage is general home use; probably most of the space will end up
being used by media files that would typically be accessed over the
network by MythTV boxes.  I'll also be using it as a sandbox
database/web/mail server.  Everything will just be personal stuff, so
if the I did lose it all I would be very depressed, but I hopefully
will have all the most important stuff backed up, and I won't lose my
job or anything too horrible.  The main reason I'm concerned about
performance is that for some time after I buy it, it will be the
highest speced of my boxes, and so I will also be using it for some
gaming, which is where I expect performance to be most noticeable.

> beolach> I'm also interested in hearing people's opinions about
> beolach> LVM / EVMS.
> 
> They are yellow, and taste of vanilla :-). To say something more
> specific is difficult without knowing what kind of requirement
> they may be expected to satisfy.
> 
> beolach> I'm currently planning on just using RAID w/out the
> beolach> higher level volume management, as from my reading I
> beolach> don't think they're worth the performance penalty, [
> beolach> ... ]
> 
> Very amusing that someone who is planning to grow a 3 drive
> RAID5 into a 14 drive RAID6 worries about the DM "performance
> penalty".
> 

Well, I was reading that LVM2 had a 20%-50% performance penalty,
which in my mind is a really big penalty.  But I think those numbers
where from some time ago, has the situation improved?  And is a 14
drive RAID6 going to already have enough overhead that the additional
overhead isn't very significant?  I'm not sure why you say it's
amusing.

The other reason I wasn't planning on using LVM was because I was
planning on keeping all the drives in the one RAID.  If I decide a 14
drive array is too risky, and I go w/ 2 or 3 arrays then LVM would
appear much more useful to me.

Thanks for the response,
Conway S. Smith

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17 13:31 ` Janek Kozicki
@ 2008-02-17 16:18   ` Conway S. Smith
  2008-02-18  3:48     ` Neil Brown
  2008-02-17 22:40   ` Mark Hahn
  1 sibling, 1 reply; 42+ messages in thread
From: Conway S. Smith @ 2008-02-17 16:18 UTC (permalink / raw)
  To: linux-raid

On Sun, 17 Feb 2008 14:31:22 +0100
Janek Kozicki <janek_listy@wp.pl> wrote:
> Beolach said:     (by the date of Sat, 16 Feb 2008 20:58:07 -0700)
> 
> > I'm also interested in hearing people's opinions about LVM / EVMS.
> 
> With LVM it will be possible for you to have several raid5 and
> raid6: eg: 5 HHDs (raid6), 5HDDs (raid6) and 4 HDDs (raid5). Here
> you would have 14 HDDs and five of them being extra - for
> safety/redundancy purposes.
> 
> LVM allows you to "join" several blockdevices and create one huge
> partition on top of them. Without LVM you will end up with raid6 on
> 14 HDDs thus having only 2 drives used for redundancy. Quite risky
> IMHO.
> 

I guess I'm just too reckless a guy.  I don't like having "wasted"
space, even though I know redundancy is by no means a waste.  And
part of me keeps thinking that the vast majority of my drives have
never failed (although a few have, including one just recently, which
is a large part of my motivation for this fileserver).  So I was
thinking RAID6, possibly w/ a hot spare or 2, would be safe enough.

Speaking of hot spares, how well would cheap external USB drives work
as hot spares?  Is that a pretty silly idea?

> It is quite often that a *whole* IO controller dies and takes all 4
> drives with it. So when you connect your drives, always make sure
> that you are totally safe if any of your IO conrollers dies (taking
> down 4 HDDs with it). With 5 redundant discs this may be possible to
> solve. Of course when you replace the controller the discs are up
> again, and only need to resync (which is done automatically).
> 

That sounds scary.  Does a controller failure often cause data loss
on the disks?  My understanding was that one of the advantages of
Linux's SW RAID was that if a controller failed you could swap in
another controller, not even the same model or brand, and Linux would
reassemble the RAID.  But if a controller failure typically takes all
the data w/ it, then the portability isn't as awesome an advantage.
Is your last sentence about replacing the controller applicable to
most controller failures, or just w/ more redundant discs?  In my
situation downtime is only mildly annoying, data loss would be much
worse.

> LVM can be grown on-line (without rebooting the computer) to "join"
> new block devices. And after that you only `resize2fs /dev/...` and
> your partition is bigger. Also in such configuration I suggest you
> to use ext3 fs, because no other fs (XFS, JFS, whatever) had that
> much testing than ext* filesystems had.
> 
> 

Plain RAID5 & RAID6 are also capable of growing on-line, although I
expect it's a much more complex & time-consuming process than LVM.  I
had been planning on using XFS, but I could rethink that.  Have there
been many horror stories about XFS?

> Question to other people here - what is the maximum partition size
> that ext3 can handle, am I correct it 4 TB ?
> 
> And to go above 4 TB we need to use ext4dev, right?
> 

I thought it depended on CPU architecture & kernel version, w/ recent
kernels on 64-bit archs being capable of 32 TiB.  If it is only 4
TiB, I would go w/ XFS.

> oh, right - Sevrin Robstad has a good idea to solve your problem -
> create raid6 with one missing member. And add this member, when you
> have it, next year or such.
> 

I thought I read that would involve a huge performance hit, since
then everything would require parity calculations.  Or would that
just be w/ 2 missing drives?

Thanks,
Conway S. Smith

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17 13:31 ` Janek Kozicki
  2008-02-17 16:18   ` Conway S. Smith
@ 2008-02-17 22:40   ` Mark Hahn
  2008-02-17 23:54     ` Janek Kozicki
  2008-02-18 12:46     ` Andre Noll
  1 sibling, 2 replies; 42+ messages in thread
From: Mark Hahn @ 2008-02-17 22:40 UTC (permalink / raw)
  To: linux-raid

>> I'm also interested in hearing people's opinions about LVM / EVMS.
>
> With LVM it will be possible for you to have several raid5 and raid6:
> eg: 5 HHDs (raid6), 5HDDs (raid6) and 4 HDDs (raid5). Here you would
> have 14 HDDs and five of them being extra - for safety/redundancy
> purposes.

that's a very high price to pay.

> partition on top of them. Without LVM you will end up with raid6 on
> 14 HDDs thus having only 2 drives used for redundancy. Quite risky
> IMHO.

your risk model is quite strange - 5/14 redundancy means that either 
you expect a LOT of failures, or you put a huge premium on availability.
the latter is odd because normally, HA people go for replication of 
more components, not just controllers (ie, whole servers).

> It is quite often that a *whole* IO controller dies and takes all 4

you appear to be using very flakey IO controllers.  are you specifically
talking about very cheap ones, or in hostile environments?

> drives with it. So when you connect your drives, always make sure
> that you are totally safe if any of your IO conrollers dies (taking

IO controllers are not a common failure mode, in my experience.
when it happens, it usually indicates an environmental problem
(heat, bad power, bad hotplug, etc).

> Question to other people here - what is the maximum partition size
> that ext3 can handle, am I correct it 4 TB ?

8 TB.  people who want to push this are probably using ext4 already.

> And to go above 4 TB we need to use ext4dev, right?

or patches (which have been around and even in some production use 
for a long while.)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17 14:06 ` Janek Kozicki
@ 2008-02-17 23:54   ` cat
  0 siblings, 0 replies; 42+ messages in thread
From: cat @ 2008-02-17 23:54 UTC (permalink / raw)
  To: Janek Kozicki; +Cc: linux-raid

On Sun, Feb 17, 2008 at 03:06:53PM +0100, Janek Kozicki wrote:
> > Or would I be better off starting w/ 4 drives in RAID6?
> 
> oh, right - Sevrin Robstad has a good idea to solve your problem -
> create raid6 with one missing member. And add this member, when you
> have it, next year or such.

That's a most-cunning of plans. Would there be any downsides to this?



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17 22:40   ` Mark Hahn
@ 2008-02-17 23:54     ` Janek Kozicki
  2008-02-18 12:46     ` Andre Noll
  1 sibling, 0 replies; 42+ messages in thread
From: Janek Kozicki @ 2008-02-17 23:54 UTC (permalink / raw)
  Cc: linux-raid

Mark Hahn said:     (by the date of Sun, 17 Feb 2008 17:40:12 -0500 (EST))

> >> I'm also interested in hearing people's opinions about LVM / EVMS.
> >
> > With LVM it will be possible for you to have several raid5 and raid6:
> > eg: 5 HHDs (raid6), 5HDDs (raid6) and 4 HDDs (raid5). Here you would
> > have 14 HDDs and five of them being extra - for safety/redundancy
> > purposes.
> 
> that's a very high price to pay.
> 
> > partition on top of them. Without LVM you will end up with raid6 on
> > 14 HDDs thus having only 2 drives used for redundancy. Quite risky
> > IMHO.
> 
> your risk model is quite strange - 5/14 redundancy means that either 

yeah, sorry. I went too far.

I didn't have IO controller failure so far. But I've read about one
on this list, and that all data was lost.

You're right, better to duplicate a server with backup copy, so it is
independent of the original one.

-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17  3:58 RAID5 to RAID6 reshape? Beolach
                   ` (2 preceding siblings ...)
  2008-02-17 14:06 ` Janek Kozicki
@ 2008-02-18  3:43 ` Neil Brown
  3 siblings, 0 replies; 42+ messages in thread
From: Neil Brown @ 2008-02-18  3:43 UTC (permalink / raw)
  To: Beolach; +Cc: linux-raid

On Saturday February 16, beolach@gmail.com wrote:
> found was a few months old.  Is it likely that RAID5 to RAID6
> reshaping will be implemented in the next 12 to 18 months (my rough

Certainly possible.

I won't say it is "likely" until it is actually done.  And by then it
will be definite :-)

i.e. no concrete plans.
It is always best to base your decisions on what is available today.


NeilBrown

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17 16:18   ` Conway S. Smith
@ 2008-02-18  3:48     ` Neil Brown
  0 siblings, 0 replies; 42+ messages in thread
From: Neil Brown @ 2008-02-18  3:48 UTC (permalink / raw)
  To: Conway S. Smith; +Cc: linux-raid

On Sunday February 17, beolach@gmail.com wrote:
> On Sun, 17 Feb 2008 14:31:22 +0100
> Janek Kozicki <janek_listy@wp.pl> wrote:
> 
> > oh, right - Sevrin Robstad has a good idea to solve your problem -
> > create raid6 with one missing member. And add this member, when you
> > have it, next year or such.
> > 
> 
> I thought I read that would involve a huge performance hit, since
> then everything would require parity calculations.  Or would that
> just be w/ 2 missing drives?

A raid6 with one missing drive would have a little bit of a
performance hit over raid5.

Partly there is a CPU hit to calculate the Q block which is slower
than calculating normal parity.

Partly there is the fact that raid6 never does "read-modify-write"
cycles, so to update one block in a stripe, it has to read all the
other data blocks.

But the worst aspect of doing this that if you have a system crash,
you could get hidden data corruption.
After a system crash you cannot trust parity data (as it may have been
in the process of being updated) so you have to regenerate it from
known good data.  But if your array is degraded, you don't have all
the known good data, so you loose.

It is really best to avoid degraded raid4/5/6 arrays when at all
possible.

NeilBrown

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17 14:45   ` Conway S. Smith
@ 2008-02-18  5:26     ` Janek Kozicki
  2008-02-18 12:38       ` Beolach
  2008-02-18 19:05     ` RAID5 to RAID6 reshape? Peter Grandi
  1 sibling, 1 reply; 42+ messages in thread
From: Janek Kozicki @ 2008-02-18  5:26 UTC (permalink / raw)
  Cc: Linux RAID

Conway S. Smith said:     (by the date of Sun, 17 Feb 2008 07:45:26 -0700)

> Well, I was reading that LVM2 had a 20%-50% performance penalty,

huh? Make a benchmark. Do you really think that anyone would be using
it if there was any penalty bigger than 1-2% ? (random access, r/w).

I have no idea what is the penalty, but I'm totally sure I didn't
notice it.

-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-18  5:26     ` Janek Kozicki
@ 2008-02-18 12:38       ` Beolach
  2008-02-18 14:42         ` Janek Kozicki
  0 siblings, 1 reply; 42+ messages in thread
From: Beolach @ 2008-02-18 12:38 UTC (permalink / raw)
  To: linux-raid

On Feb 17, 2008 10:26 PM, Janek Kozicki <janek_listy@wp.pl> wrote:
> Conway S. Smith said:     (by the date of Sun, 17 Feb 2008 07:45:26 -0700)
>
> > Well, I was reading that LVM2 had a 20%-50% performance penalty,
>
> huh? Make a benchmark. Do you really think that anyone would be using
> it if there was any penalty bigger than 1-2% ? (random access, r/w).
>
> I have no idea what is the penalty, but I'm totally sure I didn't
> notice it.
>

(Oops, replied straight to Janek, rather than the list.  Sorry.)

I saw those numbers in a few places, the only one I can remember off
the top of my head was the Gentoo-Wiki:
<http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID_mirror_and_LVM2_on_top_of_RAID>.
Looking at its history, that warning was added back on 23 Dec. 2006,
so it could very well be out-of-date.  Good to hear you don't notice
any performance drop.  I think I will try to run some benchmarks.
What do you guys recommend using for benchmarking?  Plain dd,
bonnie++?

Conway S. Smith

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17 22:40   ` Mark Hahn
  2008-02-17 23:54     ` Janek Kozicki
@ 2008-02-18 12:46     ` Andre Noll
  2008-02-18 18:23       ` Mark Hahn
  1 sibling, 1 reply; 42+ messages in thread
From: Andre Noll @ 2008-02-18 12:46 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 748 bytes --]

On 17:40, Mark Hahn wrote:
> >Question to other people here - what is the maximum partition size
> >that ext3 can handle, am I correct it 4 TB ?
> 
> 8 TB.  people who want to push this are probably using ext4 already.

ext3 supports up to 16T for quite some time. It works fine for me:

root@ume:~ # mount |grep sda; df /dev/sda; uname -a; uptime
/dev/sda on /media/bia type ext3 (rw)
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda               15T  7.8T  7.0T  53% /media/bia
Linux ume 2.6.20.12 #3 SMP Tue Jun 5 14:33:44 CEST 2007 x86_64 GNU/Linux
 13:44:29 up 236 days, 15:12,  9 users,  load average: 10.47, 10.28, 10.17

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-18 12:38       ` Beolach
@ 2008-02-18 14:42         ` Janek Kozicki
  2008-02-19 19:41           ` LVM performance (was: Re: RAID5 to RAID6 reshape?) Oliver Martin
  0 siblings, 1 reply; 42+ messages in thread
From: Janek Kozicki @ 2008-02-18 14:42 UTC (permalink / raw)
  Cc: linux-raid

Beolach said:     (by the date of Mon, 18 Feb 2008 05:38:15 -0700)

> On Feb 17, 2008 10:26 PM, Janek Kozicki <janek_listy@wp.pl> wrote:
> > Conway S. Smith said:     (by the date of Sun, 17 Feb 2008 07:45:26 -0700)
> >
> > > Well, I was reading that LVM2 had a 20%-50% performance penalty,
> <http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID_mirror_and_LVM2_on_top_of_RAID>.

hold on. This might be related to raid chunk positioning with respect
to LVM chunk positioning. If they interfere there indeed may be some
performance drop. Best to make sure that those chunks are aligned together.

-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-18 12:46     ` Andre Noll
@ 2008-02-18 18:23       ` Mark Hahn
  0 siblings, 0 replies; 42+ messages in thread
From: Mark Hahn @ 2008-02-18 18:23 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-raid

>> 8 TB.  people who want to push this are probably using ext4 already.
>
> ext3 supports up to 16T for quite some time. It works fine for me:

thanks.  16 makes sense (2^32 * 4k blocks).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-17 14:45   ` Conway S. Smith
  2008-02-18  5:26     ` Janek Kozicki
@ 2008-02-18 19:05     ` Peter Grandi
  2008-02-20  6:39       ` Alexander Kühn
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2008-02-18 19:05 UTC (permalink / raw)
  To: Linux RAID

>>> On Sun, 17 Feb 2008 07:45:26 -0700, "Conway S. Smith"
>>> <beolach@gmail.com> said:

[ ... ]

beolach> Which part isn't wise? Starting w/ a few drives w/ the
beolach> intention of growing; or ending w/ a large array (IOW,
beolach> are 14 drives more than I should put in 1 array & expect
beolach> to be "safe" from data loss)?

Well, that rather depends on what is your intended data setup and
access patterns, but the above are all things that may be unwise
in many cases. The intended use mentioned below does not require
a single array for example.

However while doing the above may make sense in *some* situation,
I reckon that the number of those situations is rather small.

Consider for example the answers to these questions:

* Suppose you have a 2+1 array which is full. Now you add a disk
  and that means that almost all free space is on a single disk.
  The MD subsystem has two options as to where to add that lump
  of space, consider why neither is very pleasant.

* How fast is doing unaligned writes with a 13+1 or a 12+2
  stripe? How often is that going to happen, especially on an
  array that started as a 2+1?

* How long does it take to rebuild parity with a 13+1 array or a
  12+2 array in case of s single disk failure? What happens if a
  disk fails during rebuild?

* When you have 13 drives and you add the 14th, how long does
  that take? What happens if a disk fails during rebuild??

The points made by http://WWW.BAARF.com/ apply too.

beolach> [ ... ] media files that would typically be accessed
beolach> over the network by MythTV boxes.  I'll also be using
beolach> it as a sandbox database/web/mail server. [ ... ] most
beolach> important stuff backed up, [ ... ] some gaming, which
beolach> is where I expect performance to be most noticeable.

To me that sounds like something that could well be split across
multiple arrays, rather than risking repeatedly extending a
single array, and then risking a single large array.

beolach> Well, I was reading that LVM2 had a 20%-50% performance
beolach> penalty, which in my mind is a really big penalty. But I
beolach> think those numbers where from some time ago, has the
beolach> situation improved?

LVM2 relies on DM, which is not much slower than say 'loop', so
it is almost insignificant for most people.

But even if the overhead may be very very low, DM/LVM2/EVMS seem
to me to have very limited usefulness (e.g. Oracle tablespaces,
and there are contrary opinions as to that too). In your stated
applications it is hard to see why you'd want to split your
arrays into very many block devices or why you'd want to resize
them.

beolach> And is a 14 drive RAID6 going to already have enough
beolach> overhead that the additional overhead isn't very
beolach> significant? I'm not sure why you say it's amusing.

Consider the questions above. Parity RAID has issues, extending
an array has issues, the idea of extending both massively and
in several steps a parity RAID looks very amusing to me.

beolach> [ ... ] The other reason I wasn't planning on using LVM
beolach> was because I was planning on keeping all the drives in
beolach> the one RAID. [... ]

Good luck :-).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* LVM performance (was: Re: RAID5 to RAID6 reshape?)
  2008-02-18 14:42         ` Janek Kozicki
@ 2008-02-19 19:41           ` Oliver Martin
  2008-02-19 19:52             ` Jon Nelson
                               ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Oliver Martin @ 2008-02-19 19:41 UTC (permalink / raw)
  To: Janek Kozicki; +Cc: linux-raid

Janek Kozicki schrieb:
> hold on. This might be related to raid chunk positioning with respect
> to LVM chunk positioning. If they interfere there indeed may be some
> performance drop. Best to make sure that those chunks are aligned together.

Interesting. I'm seeing a 20% performance drop too, with default RAID 
and LVM chunk sizes of 64K and 4M, respectively. Since 64K divides 4M 
evenly, I'd think there shouldn't be such a big performance penalty.
It's not like I care that much, I only have 100 Mbps ethernet anyway. 
I'm just wondering...

$ hdparm -t /dev/md0

/dev/md0:
  Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec

$ hdparm -t /dev/dm-0

/dev/dm-0:
  Timing buffered disk reads:  116 MB in  3.04 seconds =  38.20 MB/sec

dm doesn't do anything fancy to justify the drop (encryption etc). In 
fact, it doesn't do much at all yet - I intend to use it to join 
multiple arrays in the future when I have drives of different sizes, but 
right now, I only have 500GB drives. So it's just one PV in one VG in 
one LV.

Here's some more info:

$ mdadm -D /dev/md0
/dev/md0:
         Version : 00.90.03
   Creation Time : Sat Nov 24 12:15:48 2007
      Raid Level : raid5
      Array Size : 976767872 (931.52 GiB 1000.21 GB)
   Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
    Raid Devices : 3
   Total Devices : 3
Preferred Minor : 0
     Persistence : Superblock is persistent

     Update Time : Tue Feb 19 01:18:26 2008
           State : clean
  Active Devices : 3
Working Devices : 3
  Failed Devices : 0
   Spare Devices : 0

          Layout : left-symmetric
      Chunk Size : 64K

            UUID : d41fe8a6:84b0f97a:8ac8b21a:819833c6 (local to host 
quassel)
          Events : 0.330016

     Number   Major   Minor   RaidDevice State
        0       8       17        0      active sync   /dev/sdb1
        1       8       33        1      active sync   /dev/sdc1
        2       8       81        2      active sync   /dev/sdf1

$ pvdisplay
   --- Physical volume ---
   PV Name               /dev/md0
   VG Name               raid
   PV Size               931,52 GB / not usable 2,69 MB
   Allocatable           yes (but full)
   PE Size (KByte)       4096
   Total PE              238468
   Free PE               0
   Allocated PE          238468
   PV UUID               KadH5k-9Cie-dn5Y-eNow-g4It-lfuI-XqNIet

$ vgdisplay
   --- Volume group ---
   VG Name               raid
   System ID
   Format                lvm2
   Metadata Areas        1
   Metadata Sequence No  4
   VG Access             read/write
   VG Status             resizable
   MAX LV                0
   Cur LV                1
   Open LV               1
   Max PV                0
   Cur PV                1
   Act PV                1
   VG Size               931,52 GB
   PE Size               4,00 MB
   Total PE              238468
   Alloc PE / Size       238468 / 931,52 GB
   Free  PE / Size       0 / 0
   VG UUID               AW9yaV-B3EM-pRLN-RTIK-LEOd-bfae-3Vx3BC

$ lvdisplay
   --- Logical volume ---
   LV Name                /dev/raid/raid
   VG Name                raid
   LV UUID                eWIRs8-SFyv-lnix-Gk72-Lu9E-Ku7j-iMIv92
   LV Write Access        read/write
   LV Status              available
   # open                 1
   LV Size                931,52 GB
   Current LE             238468
   Segments               1
   Allocation             inherit
   Read ahead sectors     auto
   - currently set to     256
   Block device           253:0

-- 
Oliver

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)
  2008-02-19 19:41           ` LVM performance (was: Re: RAID5 to RAID6 reshape?) Oliver Martin
@ 2008-02-19 19:52             ` Jon Nelson
  2008-02-19 20:00               ` Iustin Pop
  2008-02-19 23:19             ` LVM performance Peter Rabbitson
  2008-02-20 12:19             ` LVM performance (was: Re: RAID5 to RAID6 reshape?) Peter Grandi
  2 siblings, 1 reply; 42+ messages in thread
From: Jon Nelson @ 2008-02-19 19:52 UTC (permalink / raw)
  To: Oliver Martin; +Cc: Janek Kozicki, linux-raid

On Feb 19, 2008 1:41 PM, Oliver Martin
<oliver.martin@student.tuwien.ac.at> wrote:
> Janek Kozicki schrieb:
> > hold on. This might be related to raid chunk positioning with respect
> > to LVM chunk positioning. If they interfere there indeed may be some
> > performance drop. Best to make sure that those chunks are aligned together.
>
> Interesting. I'm seeing a 20% performance drop too, with default RAID
> and LVM chunk sizes of 64K and 4M, respectively. Since 64K divides 4M
> evenly, I'd think there shouldn't be such a big performance penalty.
> It's not like I care that much, I only have 100 Mbps ethernet anyway.
> I'm just wondering...
>
> $ hdparm -t /dev/md0
>
> /dev/md0:
>   Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec
>
> $ hdparm -t /dev/dm-0
>
> /dev/dm-0:
>   Timing buffered disk reads:  116 MB in  3.04 seconds =  38.20 MB/sec

I'm getting better performance on a LV than on the underlying MD:

# hdparm -t /dev/md0

/dev/md0:
 Timing buffered disk reads:  408 MB in  3.01 seconds = 135.63 MB/sec
# hdparm -t /dev/raid/multimedia

/dev/raid/multimedia:
 Timing buffered disk reads:  434 MB in  3.01 seconds = 144.04 MB/sec
#

md0 is a 3-disk raid5, 64k chunk, alg. 2, using a bitmap comprised of
7200rpm sata drives from several manufacturers.



-- 
Jon

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)
  2008-02-19 19:52             ` Jon Nelson
@ 2008-02-19 20:00               ` Iustin Pop
  0 siblings, 0 replies; 42+ messages in thread
From: Iustin Pop @ 2008-02-19 20:00 UTC (permalink / raw)
  To: Jon Nelson; +Cc: Oliver Martin, Janek Kozicki, linux-raid

On Tue, Feb 19, 2008 at 01:52:21PM -0600, Jon Nelson wrote:
> On Feb 19, 2008 1:41 PM, Oliver Martin
> <oliver.martin@student.tuwien.ac.at> wrote:
> > Janek Kozicki schrieb:
> >
> > $ hdparm -t /dev/md0
> >
> > /dev/md0:
> >   Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec
> >
> > $ hdparm -t /dev/dm-0
> >
> > /dev/dm-0:
> >   Timing buffered disk reads:  116 MB in  3.04 seconds =  38.20 MB/sec
> 
> I'm getting better performance on a LV than on the underlying MD:
> 
> # hdparm -t /dev/md0
> 
> /dev/md0:
>  Timing buffered disk reads:  408 MB in  3.01 seconds = 135.63 MB/sec
> # hdparm -t /dev/raid/multimedia
> 
> /dev/raid/multimedia:
>  Timing buffered disk reads:  434 MB in  3.01 seconds = 144.04 MB/sec
> #

As people are trying to point out in many lists and docs: hdparm is
*not* a benchmark tool. So its numbers, while interesting, should not be
regarded as a valid comparison.

Just my oppinion.

regards,
iustin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-02-19 19:41           ` LVM performance (was: Re: RAID5 to RAID6 reshape?) Oliver Martin
  2008-02-19 19:52             ` Jon Nelson
@ 2008-02-19 23:19             ` Peter Rabbitson
  2008-02-20 12:19             ` LVM performance (was: Re: RAID5 to RAID6 reshape?) Peter Grandi
  2 siblings, 0 replies; 42+ messages in thread
From: Peter Rabbitson @ 2008-02-19 23:19 UTC (permalink / raw)
  To: Oliver Martin; +Cc: Janek Kozicki, linux-raid

Oliver Martin wrote:
> Interesting. I'm seeing a 20% performance drop too, with default RAID 
> and LVM chunk sizes of 64K and 4M, respectively. Since 64K divides 4M 
> evenly, I'd think there shouldn't be such a big performance penalty.

I am no expert, but as far as I have read you must not only have compatible 
chunk sizes (which is easy and most often the case). You also must stripe 
align the LVM chunks, so every chunk spans an even number of raid stripes (not 
raid chunks). Check the output of `dmsetup table`. The last number is the 
offset of the underlying block device at which the LVM data portion starts. It 
must be divisible by the raid stripe length (the length varies for different 
raid types).

Currently LVM does not offer an easy way to do such alignment, you have to do 
it manually upon executing pvcreate. By using the option --metadatasize one 
can specify the size of the area between the LVM header (64KiB) and the start 
of the data area. So one would supply STRIPE_SIZE - 64 for metadatasize[*], 
and the result will be a stripe aligned LVM.

This information is unverified, I just compiled it from different list threads 
and whatnot. I did this to my own arrays/volumes and I get near 100% raw 
speed. If someone else can confirm the validity of this - it would be great.

Peter

* The supplied number is always rounded up to be divisible by 64KiB, so the 
smallest total LVM header is at least 128KiB

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-18 19:05     ` RAID5 to RAID6 reshape? Peter Grandi
@ 2008-02-20  6:39       ` Alexander Kühn
  2008-02-22  8:13         ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: Alexander Kühn @ 2008-02-20  6:39 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 3228 bytes --]

----- Message from pg_lxra@lxra.to.sabi.co.UK ---------
     Date: Mon, 18 Feb 2008 19:05:02 +0000
     From: Peter Grandi <pg_lxra@lxra.to.sabi.co.UK>
Reply-To: Peter Grandi <pg_lxra@lxra.to.sabi.co.UK>
  Subject: Re: RAID5 to RAID6 reshape?
       To: Linux RAID <linux-raid@vger.kernel.org>


>>>> On Sun, 17 Feb 2008 07:45:26 -0700, "Conway S. Smith"
>>>> <beolach@gmail.com> said:

> Consider for example the answers to these questions:
>
> * Suppose you have a 2+1 array which is full. Now you add a disk
>   and that means that almost all free space is on a single disk.
>   The MD subsystem has two options as to where to add that lump
>   of space, consider why neither is very pleasant.

No, only one, at the end of the md device and the "free space" will be  
evenly distributed among the drives.

> * How fast is doing unaligned writes with a 13+1 or a 12+2
>   stripe? How often is that going to happen, especially on an
>   array that started as a 2+1?

They are all the same speed with raid5 no matter what you started  
with. You read two blocks and you write two blocks. (not even chunks  
mind you)

> * How long does it take to rebuild parity with a 13+1 array or a
>   12+2 array in case of s single disk failure? What happens if a
>   disk fails during rebuild?

Depends on how much data the controllers can push. But at least with  
my hpt2320 the limiting factor is the disk speed and that doesn't  
change whether I have 2 disks or 12.

> * When you have 13 drives and you add the 14th, how long does
>   that take? What happens if a disk fails during rebuild??

..again pretty much the same as adding a fourth drive to a three-drives raid5.
It will continue to be degraded..nothing special.

> beolach> Well, I was reading that LVM2 had a 20%-50% performance
> beolach> penalty, which in my mind is a really big penalty. But I
> beolach> think those numbers where from some time ago, has the
> beolach> situation improved?
>
> LVM2 relies on DM, which is not much slower than say 'loop', so
> it is almost insignificant for most people.

I agree.

> But even if the overhead may be very very low, DM/LVM2/EVMS seem
> to me to have very limited usefulness (e.g. Oracle tablespaces,
> and there are contrary opinions as to that too). In your stated
> applications it is hard to see why you'd want to split your
> arrays into very many block devices or why you'd want to resize
> them.

I think the idea is to be able to have more than just one device to  
put a filesystem on. For example a / filesystem, swap and maybe  
something like /storage comes to mind. Yes, one could to that with  
partitioning but lvm was made for this so why not use it.

The situation looks different with Raid6, there the write penalty  
becomes higher with more disks but not with raid5.
Regards,
Alex.

----- End message from pg_lxra@lxra.to.sabi.co.UK -----




- --
Alexander Kuehn

Cell phone: +49 (0)177 6461165
Cell fax:   +49 (0)177 6468001
Tel @Home:  +49 (0)711 6336140
Mail mailto:Alexander.Kuehn@nagilum.de


----------------------------------------------------------------
cakebox.homeunix.net - all the machine one needs..


[-- Attachment #2: PGP Digital Signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)
  2008-02-19 19:41           ` LVM performance (was: Re: RAID5 to RAID6 reshape?) Oliver Martin
  2008-02-19 19:52             ` Jon Nelson
  2008-02-19 23:19             ` LVM performance Peter Rabbitson
@ 2008-02-20 12:19             ` Peter Grandi
  2008-02-22 13:41               ` LVM performance Oliver Martin
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2008-02-20 12:19 UTC (permalink / raw)
  To: Linux RAID

>> This might be related to raid chunk positioning with respect
>> to LVM chunk positioning. If they interfere there indeed may
>> be some performance drop. Best to make sure that those chunks
>> are aligned together.

> Interesting. I'm seeing a 20% performance drop too, with default
> RAID and LVM chunk sizes of 64K and 4M, respectively. Since 64K
> divides 4M evenly, I'd think there shouldn't be such a big
> performance penalty. [ ... ]

Those are as such not very meaningful. What matters most is
whether the starting physical address of each logical volume
extent is stripe aligned (and whether the filesystem makes use
of that) and then the stripe size of the parity RAID set, not
the chunk sizes in themselves.

I am often surprised by how many people who use parity RAID
don't seem to realize the crucial importance of physical stripe
alignment, but I am getting used to it.

Because of stripe alignment it is usually better to build parity
arrays on top of partitions or volumes than viceversa, as it is
often more difficult to align the start of a partition or volume
to the underlying stripes than the reverse.

But then those who understand the vital importance of stripe
aligned writes for parity RAID often avoid using parity RAID
:-).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-20  6:39       ` Alexander Kühn
@ 2008-02-22  8:13         ` Peter Grandi
  2008-02-23 20:40           ` Nagilum
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2008-02-22  8:13 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> * Suppose you have a 2+1 array which is full. Now you add a
>> disk and that means that almost all free space is on a single
>> disk. The MD subsystem has two options as to where to add
>> that lump of space, consider why neither is very pleasant.

> No, only one, at the end of the md device and the "free space"
> will be evenly distributed among the drives.

Not necessarily, however let's assume that happens.

Since the the free space will have a different distribution
then the used space will also, so that the physical layout will
evolve like this when creating up a 3+1 from a 2+1+1:

   2+1+1       3+1
  a b c d    a b c d
  -------    -------   
  0 1 P	F    0 1 2 Q	P: old parity
  P 2 3 F    Q 3 4 5    F: free block
  4 P 5 F    6 Q 7 8    Q: new parity
  .......    .......
	     F F F F

How will the free space become evenly distributed among the
drives? Well, sounds like 3 drives will be read (2 if not
checking parity) and 4 drives written; while on a 3+1 a mere
parity rebuild only writes to 1 at a time, even if reads from
3, and a recovery reads from 3 and writes to 2 drives.

Is that a pleasant option? To me it looks like begging for
trouble. For one thing the highest likelyhood of failure is
when a lot of disk start running together doing much the same
things. RAID is based on the idea of uncorrelated failures...

  An aside: in my innocence I realized only recently that online
  redundancy and uncorrelated failures are somewhat contradictory.

Never mind that since one is changing the layout an interruption
in the process may leave the array unusable, even if with no
loss of data, evne if recent MD versions mostly cope; from a
recent 'man' page for 'mdadm':

 «Increasing the number of active devices in a RAID5 is much
  more effort.  Every block in the array will need to be read
  and written back to a new location.»

  From 2.6.17, the Linux Kernel is able to do this safely,
  including restart and interrupted "reshape".

  When relocating the first few stripes on a raid5, it is not
  possible to keep the data on disk completely consistent and
  crash-proof. To provide the required safety, mdadm disables
  writes to the array while this "critical section" is reshaped,
  and takes a backup of the data that is in that section.

  This backup is normally stored in any spare devices that the
  array has, however it can also be stored in a separate file
  specified with the --backup-file option.»

Since the reshape reads N *and then writes* to N+1 the drives at
almost the same time things are going to be a bit slower than a
mere rebuild or recover: each stripe will be read from the N
existing drives and then written back to N+1 *while the next
stripe is being read from N* (or not...).

>> * How fast is doing unaligned writes with a 13+1 or a 12+2
>> stripe? How often is that going to happen, especially on an
>> array that started as a 2+1?

> They are all the same speed with raid5 no matter what you
> started with.

But I asked two questions questions that are not "how does the
speed differ". The two answers to the questions I aked are very
different from "the same speed" (they are "very slow" and
"rather often"):

* Doing unaligned writes on a 13+1 or 12+2 is catastrophically
  slow because of the RMW cycle. This is of course independent
  of how one got to the something like 13+1 or a 12+2.

* Unfortunately the frequency of unaligned writes *does* usually
  depend on how dementedly one got to the 13+1 or 12+2 case:
  because a filesystem that lays out files so that misalignment
  is minimised with a 2+1 stripe just about guarantees that when
  one switches to a 3+1 stripe all previously written data is
  misaligned, and so on -- and never mind that every time one
  adds a disk a reshape is done that shuffles stuff around.

There is a saving grace as to the latter point: many programs
don't overwrite files in place but truncate and recreate them
(which is not so good but for this case).

> You read two blocks and you write two blocks. (not even chunks
> mind you)

But we are talking about a *reshape* here and to a RAID5. If you
add a drive to a RAID5 and redistribute in the obvious way then
existing stripes have to be rewritten as the periodicity of the
parity changes from every N to every N+1.

>> * How long does it take to rebuild parity with a 13+1 array
>> or a 12+2 array in case of single disk failure? What happens
>> if a disk fails during rebuild?

> Depends on how much data the controllers can push. But at
> least with my hpt2320 the limiting factor is the disk speed

But here we are on the Linux RAID mailing list and we are
talking about software RAID. With software RAID a reshape with
14 disks needs to shuffle around the *host bus* (not merely the
host adapter as with hw RAID) almost 5 times as much data as
with 3 (say 14x80MB/s ~= 1GB/s sustained in both directions at
the outer tracks). The host adapter also has to be able to run
14 operations in parallel.

It can be done -- it is just somewhat expensive, but then what's
the point of a 14 wide RAID if the host bus and host adapter
cannot handle the full parallel bandwidth of 14 drives?

Yet in some cases RAID sets are built for capacity more than
speed, and with cheap hw it may not be possible to read or write
in parallel 14 drives, but something like 3-4. Then look at the
alternatives:

* Grow from a 2+1 to a 13+1 a drive at a time: every time the
  whole array is both read and written, and if the host cannot
  handle more than say 4 drives at once, the array will be
  reshaping for 3-4 times longer towards the end than at the
  beginning (something like 8 hours instead of 2).

* Grow from 2+1 by adding say another 2+1 and two 3+1s: every
  time that involves just a few drives, existing drives are not
  touched, and a drive failure during building a new array is
  not an issue because if the build fails there is no data on
  the failed array, indeed the previously built arraya just
  continue to work.

At this some very clever readers will shake their head, count
the 1 drive wasted for resiliency in one case, and 4 in the
other and realize smugly how much more cost effective their
scheme is. Good luck! :-)

> and that doesn't change whether I have 2 disks or 12.

Not quite, but another thing that changes is the probability of
a disk failure during a reshape.

Neil Brown wrote recently in this list (Feb 17th) this very wise
bit of advice:

 «It is really best to avoid degraded raid4/5/6 arrays when at all
  possible. NeilBrown»

Repeatedly expanding an array means deliberately doing something
similar...

One amusing detail is the number of companies advertising disk
recovery services for RAID sets. They have RAID5 to thank for a
lot of their business, but array reshapes may well help too :-).

[ ... ]

>> [ ... ] In your stated applications it is hard to see why
>> you'd want to split your arrays into very many block devices
>> or why you'd want to resize them.

> I think the idea is to be able to have more than just one
> device to put a filesystem on. For example a / filesystem,
> swap and maybe something like /storage comes to mind.

Well, for a small number of volumes like that a reasonable
strategy is to partition the disks and then RAID those
partitions. This can be done on a few disks at a time.

For archiving stuff as it accumulates (''digital attic'') just
adding disks and creating a large single partition on each disk
seems simplest and easiest.

Even RAID is not that useful there (because RAID, especially
parity RAID, is not a substitute for backups). But a few small
(2+1, 3+1, in s desperate case even 4+1) mostly read-only RAID5
may be reasonable for that (as long as there are backups
anyhow).

> Yes, one could to that with partitioning but lvm was made for
> this so why not use it.

The problem with LVM is that it adds an extra layer of
complications and dependencies to things like booting and system
management. Can be fully automated, but then the list of things
that go wrong increases.

BTW, good news: DM/LVM2 are largely no longer necessary: one can
achieve the same effect, including much the same performance, by
using the loop device on large files on a good filesystem that
supports extents, like JFS or XFS.

To the point that in a (slightly dubious) test some guy got
better performance out of Oracle tablespaces as large files
than with the usually recommended raw volumes/partitions...
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-02-20 12:19             ` LVM performance (was: Re: RAID5 to RAID6 reshape?) Peter Grandi
@ 2008-02-22 13:41               ` Oliver Martin
  2008-03-07  8:14                 ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: Oliver Martin @ 2008-02-22 13:41 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

Peter Grandi schrieb:
> Those are as such not very meaningful. What matters most is
> whether the starting physical address of each logical volume
> extent is stripe aligned (and whether the filesystem makes use
> of that) and then the stripe size of the parity RAID set, not
> the chunk sizes in themselves.
> 
> I am often surprised by how many people who use parity RAID
> don't seem to realize the crucial importance of physical stripe
> alignment, but I am getting used to it.

Am I right to assume that stripe alignment matters because of the 
read-modify-write cycle needed for unaligned writes? If so, how come a 
pure read benchmark (hdparm -t or plain dd) is slower on the LVM device 
than on the md device?


Oliver

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-22  8:13         ` Peter Grandi
@ 2008-02-23 20:40           ` Nagilum
  2008-02-25  0:10             ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: Nagilum @ 2008-02-23 20:40 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 10441 bytes --]

----- Message from pg_lxra@lxfra.for.sabi.co.UK ---------
     Date: Fri, 22 Feb 2008 08:13:05 +0000
     From: Peter Grandi <pg_lxra@lxfra.for.sabi.co.UK>
Reply-To: Peter Grandi <pg_lxra@lxfra.for.sabi.co.UK>
  Subject: Re: RAID5 to RAID6 reshape?
       To: Linux RAID <linux-raid@vger.kernel.org>


> [ ... ]
>
>>> * Suppose you have a 2+1 array which is full. Now you add a
>>> disk and that means that almost all free space is on a single
>>> disk. The MD subsystem has two options as to where to add
>>> that lump of space, consider why neither is very pleasant.
>
>> No, only one, at the end of the md device and the "free space"
>> will be evenly distributed among the drives.
>
> Not necessarily, however let's assume that happens.
>
> Since the the free space will have a different distribution
> then the used space will also, so that the physical layout will
> evolve like this when creating up a 3+1 from a 2+1+1:
>
>    2+1+1       3+1
>   a b c d    a b c d
>   -------    -------
>   0 1 P F    0 1 2 Q	P: old parity
>   P 2 3 F    Q 3 4 5    F: free block
>   4 P 5 F    6 Q 7 8    Q: new parity
>   .......    .......
>              F F F F
                ^^^^^^^
...evenly distributed. Thanks for the picture. I don't know why you  
are still asking after that?

> How will the free space become evenly distributed among the
> drives? Well, sounds like 3 drives will be read (2 if not
> checking parity) and 4 drives written; while on a 3+1 a mere
> parity rebuild only writes to 1 at a time, even if reads from
> 3, and a recovery reads from 3 and writes to 2 drives.
>
> Is that a pleasant option? To me it looks like begging for
> trouble. For one thing the highest likelyhood of failure is
> when a lot of disk start running together doing much the same
> things. RAID is based on the idea of uncorrelated failures...

A forced sync before a reshape is advised.
As usual a single disk failure during reshape is not a bigger problem  
than when it happens at another time.

>   An aside: in my innocence I realized only recently that online
>   redundancy and uncorrelated failures are somewhat contradictory.
>
> Never mind that since one is changing the layout an interruption
> in the process may leave the array unusable, even if with no
> loss of data, evne if recent MD versions mostly cope; from a
> recent 'man' page for 'mdadm':
>
>  «Increasing the number of active devices in a RAID5 is much
>   more effort.  Every block in the array will need to be read
>   and written back to a new location.»
>
>   From 2.6.17, the Linux Kernel is able to do this safely,
>   including restart and interrupted "reshape".
>
>   When relocating the first few stripes on a raid5, it is not
>   possible to keep the data on disk completely consistent and
>   crash-proof. To provide the required safety, mdadm disables
>   writes to the array while this "critical section" is reshaped,
>   and takes a backup of the data that is in that section.
>
>   This backup is normally stored in any spare devices that the
>   array has, however it can also be stored in a separate file
>   specified with the --backup-file option.»
>
> Since the reshape reads N *and then writes* to N+1 the drives at
> almost the same time things are going to be a bit slower than a
> mere rebuild or recover: each stripe will be read from the N
> existing drives and then written back to N+1 *while the next
> stripe is being read from N* (or not...).

Yes, it will be slower but probably still faster than getting the data  
off and back on again. And of course you don't need the storage for  
the backup..

>>> * How fast is doing unaligned writes with a 13+1 or a 12+2
>>> stripe? How often is that going to happen, especially on an
>>> array that started as a 2+1?
>
>> They are all the same speed with raid5 no matter what you
>> started with.
>
> But I asked two questions questions that are not "how does the
> speed differ". The two answers to the questions I aked are very
> different from "the same speed" (they are "very slow" and
> "rather often"):

And this is where you're wrong.

> * Doing unaligned writes on a 13+1 or 12+2 is catastrophically
>   slow because of the RMW cycle. This is of course independent
>   of how one got to the something like 13+1 or a 12+2.

Changing a single byte in a 2+1 raid5 or a 13+1 raid5 requires exactly  
two 512byte blocks to be read and written from two different disks.
Changing two bytes which are unaligned (the last and first byte of two  
consecutive stripes) doubles those figures, but more disks are involved.

> * Unfortunately the frequency of unaligned writes *does* usually
>   depend on how dementedly one got to the 13+1 or 12+2 case:
>   because a filesystem that lays out files so that misalignment
>   is minimised with a 2+1 stripe just about guarantees that when
>   one switches to a 3+1 stripe all previously written data is
>   misaligned, and so on -- and never mind that every time one
>   adds a disk a reshape is done that shuffles stuff around.

One can usually do away with specifying 2*Chunksize.

>> You read two blocks and you write two blocks. (not even chunks
>> mind you)
>
> But we are talking about a *reshape* here and to a RAID5. If you
> add a drive to a RAID5 and redistribute in the obvious way then
> existing stripes have to be rewritten as the periodicity of the
> parity changes from every N to every N+1.

Yes, once, during the reshape.

>>> * How long does it take to rebuild parity with a 13+1 array
>>> or a 12+2 array in case of single disk failure? What happens
>>> if a disk fails during rebuild?
>
>> Depends on how much data the controllers can push. But at
>> least with my hpt2320 the limiting factor is the disk speed
>
> But here we are on the Linux RAID mailing list and we are
> talking about software RAID. With software RAID a reshape with
> 14 disks needs to shuffle around the *host bus* (not merely the
> host adapter as with hw RAID) almost 5 times as much data as
> with 3 (say 14x80MB/s ~= 1GB/s sustained in both directions at
> the outer tracks). The host adapter also has to be able to run
> 14 operations in parallel.

I'm also talking about software raid. I'm not claiming that my hpt232x  
can push that much but then again it handles only 8 drives anyway.

> It can be done -- it is just somewhat expensive, but then what's
> the point of a 14 wide RAID if the host bus and host adapter
> cannot handle the full parallel bandwidth of 14 drives?

In most uses your are not going to exhaust the maximum transfer rate  
of the disks. So I guess one would do it for the (cheap) space?

>> and that doesn't change whether I have 2 disks or 12.
>
> Not quite

See above.

> , but another thing that changes is the probability of
> a disk failure during a reshape.
>
> Neil Brown wrote recently in this list (Feb 17th) this very wise
> bit of advice:
>
>  «It is really best to avoid degraded raid4/5/6 arrays when at all
>   possible. NeilBrown»
>
> Repeatedly expanding an array means deliberately doing something
> similar...

It's not quite that bad. You still have redundancy when doing reshape.

> One amusing detail is the number of companies advertising disk
> recovery services for RAID sets. They have RAID5 to thank for a
> lot of their business, but array reshapes may well help too :-).

Yeah, reshaping is putting a strain on the array and one should take  
some precautions.

>>> [ ... ] In your stated applications it is hard to see why
>>> you'd want to split your arrays into very many block devices
>>> or why you'd want to resize them.
>
>> I think the idea is to be able to have more than just one
>> device to put a filesystem on. For example a / filesystem,
>> swap and maybe something like /storage comes to mind.
>
> Well, for a small number of volumes like that a reasonable
> strategy is to partition the disks and then RAID those
> partitions. This can be done on a few disks at a time.

True, but you loose flexibility. And how do you plan on increasing the  
size of any of those volumes if you only want to add one disk and keep  
the redundancy?
Ok, you could by a disk which is only as large as the raid-devs that  
make up the colume in question, but I find it a much cleaner setup to  
have a bunch of identically sized disks in one big array.

> For archiving stuff as it accumulates (''digital attic'') just
> adding disks and creating a large single partition on each disk
> seems simplest and easiest.

I thinks this what we're talking about here. But with your proposal  
you have no redundancy.

>> Yes, one could to that with partitioning but lvm was made for
>> this so why not use it.
>
> The problem with LVM is that it adds an extra layer of
> complications and dependencies to things like booting and system
> management. Can be fully automated, but then the list of things
> that go wrong increases.

Never had any problems with it.

> BTW, good news: DM/LVM2 are largely no longer necessary: one can
> achieve the same effect, including much the same performance, by
> using the loop device on large files on a good filesystem that
> supports extents, like JFS or XFS.

*yeeks* no thanks, I rather use what has been made for it.
No need for another bikeshed.

> To the point that in a (slightly dubious) test some guy got
> better performance out of Oracle tablespaces as large files
> than with the usually recommended raw volumes/partitions...

Should not happen but who knows what Oracle does when it accesses  
block devices...

----- End message from pg_lxra@lxfra.for.sabi.co.UK -----



========================================================================
#    _  __          _ __     http://www.nagilum.org/ \n icq://69646724 #
#   / |/ /__ ____ _(_) /_ ____ _  nagilum@nagilum.org \n +491776461165 #
#  /    / _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
# /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
#           /___/     x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #
========================================================================


----------------------------------------------------------------
cakebox.homeunix.net - all the machine one needs..


[-- Attachment #2: PGP Digital Signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-23 20:40           ` Nagilum
@ 2008-02-25  0:10             ` Peter Grandi
  2008-02-25 16:31               ` Nagilum
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2008-02-25  0:10 UTC (permalink / raw)
  To: Linux RAID

>>> On Sat, 23 Feb 2008 21:40:08 +0100, Nagilum
>>> <nagilum@nagilum.org> said:

[ ... ]

>> * Doing unaligned writes on a 13+1 or 12+2 is catastrophically
>> slow because of the RMW cycle. This is of course independent
>> of how one got to the something like 13+1 or a 12+2.

nagilum> Changing a single byte in a 2+1 raid5 or a 13+1 raid5
nagilum> requires exactly two 512byte blocks to be read and
nagilum> written from two different disks. Changing two bytes
nagilum> which are unaligned (the last and first byte of two
nagilum> consecutive stripes) doubles those figures, but more
nagilum> disks are involved.

Here you are using the astute misdirection of talking about
unaunaligned *byte* *updates* when the issue is unaligned
*stripe* *writes*.

If one used your scheme to write a 13+1 stripe one block at a
time would take 26R+26W operations (about half of which could be
cached) instead of 14W which are what is required when doing
aligned stripe writes, which is what good file systems try to
achieve.

Well, 26R+26W may be a caricature, but the problem is that even
if one bunches updates of N blocks into a read N blocks+parity,
write N blocks+parity operation is still RMW, just a smaller RMW
than a full stripe RMW.

And reading before writing can kill write performance, because
it is a two-pass algorithm and a two-pass algorithm is pretty
bad news for disk work, and even more so, given most OS and disk
elevator algorithms, for one pass of reads and one of writes
dependent on the reads.

But enough of talking about absurd cases, let's do a good clear
example of why a 13+1 is bad bad bad when doing unaligned writes.

Consider writing to a 2+1 and an 13+1 just 15 blocks in 4+4+4+3
bunches, starting with block 0 (so aligned start, unaligned
bunch length, unaligned total length), a random case but quite
illustrative:

  2+1:
	00 01 P1 03 04 P2 06 07 P3 09 10 P4
        00 01    02 03    04 05    06 07   
        ------**-------** ------**-------**
        12 13 P5 15 16 P6 18 19 P7 21 22 P8
        08 09    10 11    12 13    14
        ------**-------** ------**---    **

	write D00 D01 DP1
	write D03 D04 DP2

	write D06 D07 DP3
	write D09 D10 DP4

	write D12 D13 DP5
	write D15 D16 DP6

	write D18 D19 DP7
	read  D21 DP8
	write D21 DP8

        Total:
	  IOP: 01 reads, 08 writes
	  NLK: 02 reads, 23 writes
	  XOR: 28 reads, 15 writes

 13+1:
	00 01 02 03 04 05 06 07 08 09 10 11 12 P1
        00 01 02 03 04 05 06 07 08 09 10 11 12
        ----------- ----------- ----------- -- **

        14 15 16 17 18 19 20 21 22 23 24 25 26 P2
	13 14
	-----                                  **

	read  D00 D01 D02 D03 DP1
	write D00 D01 D02 D03 DP1

	read  D04 D05 D06 D07 DP1
	write D04 D05 D06 D07 DP1

	read  D08 D09 D10 D11 DP1
	write D08 D09 D10 D11 DP1

	read  D12 DP1 D14 D15 DP2
	write D12 DP1 D14 D15 DP2

        Total:
	  IOP: 04 reads, 04 writes
	  BLK: 20 reads, 20 writes
	  XOR: 34 reads, 10 writes

The short stripe size means that one does not need to RMW in
many cases, just W; and this despite that much higher redundancy
of 2+1. it also means that there are lots of parity blocks to
compute and write. With a 4 block operation length a 3+1 or even
more a 4+1 would be flattered here, but I wanted to exemplify
two extremes.

The narrow parallelism thus short stripe length of 2+1 means
that a lot less blocks get transferred because of almost no RM,
but it does 9 IOPs and 13+1 does one less at 8 (wider
parallelism); but then the 2+1 IOPs are mostly in back-to-back
write pairs, while the 13+1 are in read-rewrite pairs, which is
a significant disadvantage (often greatly underestimated).

Never mind that the number of IOPs is almost the same despite
the large difference in width, and that can do with the same
disks as a 13+1 something like 4 2+1/3+1 arrays, thus gaining a
lot of parallelism across threads, if there is such to be
obtained. And if one really wants to write long stripes, one
should use RAID10 of course, not long stripes with a single (or
two) parity blocks.

In the above example the length of the transfer is not aligned
with either the 2+1 or 13+1 stripe length; if the starting block
is unaligned too, then things look worse for 2+1, but that is a
pathologically bad case (and at the same time a pathologically
good case for 13+1):

  2+1:
	00 01 P1|03 04 P2|06 07 P3|09 10 P4|12
           00   |01 02   |03 04   |05 06   |07
           ---**|------**|-- ---**|------**|--
        13 P5|15 16 P6|18 19 P7|21 22 P8
        08   |09 10   |11 12   |13 14
        ---**|------**|-- ---**|------**

	read  D01 DP1
	read  D06 DP3
	write D01 DP1
	write D03 D04 DP2
        write D06 DP3

	read  D07 DP3
	read  D12 DP5
	write D07 DP3
	write D09 D10 DP4
	write D12 DP5

	read  D13 DP5
	read  D18 DP7
	write D13 DP5
	write D15 D16 DP6
	write D18 DP7

	read  D19 DP7
	write D19 DP7
	write D15 D16 DP6

        Total:
	  IOP: 07 reads, 11 writes
	  BLK: 14 reads, 26 writes
	  XOR: 36 reads, 18 writes

 13+1:
	00 01 02 03 04 05 06 07 08 09 10 11 12 P1|
           00 01 02 03 04 05 06 07 08 09 10 11   |
           ----------- ----------- ----------- **|

        14 15 16 17 18 19 20 21 22 23 24 25 26 P2
	12 13 14
	--------                               **

	read  D01 D02 D03 D04 DP1
	write D01 D02 D03 D04 DP1

	read  D05 D06 D07 D08 DP1
	write D05 D06 D07 D08 DP1

	read  D09 D10 D11 D12 DP1
	write D09 D10 D11 D12 DP1

	read  D14 D15 D16 DP2
	write D14 D15 D16 DP2

	Total:
	  IOP: 04 reads, 04 writes
	  BLK: 18 reads, 18 writes
	  XOR: 38 reads, 08 writes

Here 2+1 does only a bit over twice as many IOPs as 13+1, even
if the latter has much wider potential parallelism, because the
latter cannot take advantage of that. However in both cases the
cost of RMW is large.

Never mind that finding the chances of putting in the IO request
stream a set of back-to-back logical writes to 13 contiguous
blocks aligned starting on a 13 block multiple are bound to be
lower than those of get a set of of 2 or 3 blocks, and even
worse with a filesystem mostly built for the wrong stripe
alignment.

>> * Unfortunately the frequency of unaligned writes *does*
>>   usually depend on how dementedly one got to the 13+1 or
>>   12+2 case: because a filesystem that lays out files so that
>>   misalignment is minimised with a 2+1 stripe just about
>>   guarantees that when one switches to a 3+1 stripe all
>>   previously written data is misaligned, and so on -- and
>>   never mind that every time one adds a disk a reshape is
>>   done that shuffles stuff around.

nagilum> One can usually do away with specifying 2*Chunksize.

Following the same logic to the extreme one can use a linear
concatenation to avoid the problem, where stripes are written
consecutively on each disk and then the following disk. This
avoids any problems with unaligned stripe writes :-).

In general large chunksizes are not such a brilliant idea, even
if ill-considered benchmarks may show some small advantage with
somewhat larger chunksizes.

My general conclusion is that reshapes are a risky, bad for
performance, expensive operation that is available, like RAID5
in general (and especially RAID5 above 2+1 or in a pinch 3+1)
only for special cases when one cannot do otherwise and knows
exactly what the downside is (which seems somewhat rare).

I think that defending the concept of growing a 2+1 into a 13+1
via as many as 11 successive reshapes is quite ridiculous, even
more so when using fatuous arguments about 1 or 2 byte updates.

It is even worse than coming up with that idea itself, which is
itself worse than that of building a 13+1 to start with.

But hey, lots of people know better -- do you feel lucky? :-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: RAID5 to RAID6 reshape?
  2008-02-25  0:10             ` Peter Grandi
@ 2008-02-25 16:31               ` Nagilum
  0 siblings, 0 replies; 42+ messages in thread
From: Nagilum @ 2008-02-25 16:31 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 8276 bytes --]

----- Message from pg_lxra@lxra.for.sabi.co.UK ---------
     Date: Mon, 25 Feb 2008 00:10:07 +0000
     From: Peter Grandi <pg_lxra@lxra.for.sabi.co.UK>
Reply-To: Peter Grandi <pg_lxra@lxra.for.sabi.co.UK>
  Subject: Re: RAID5 to RAID6 reshape?
       To: Linux RAID <linux-raid@vger.kernel.org>


>>>> On Sat, 23 Feb 2008 21:40:08 +0100, Nagilum
>>>> <nagilum@nagilum.org> said:
>
> [ ... ]
>
>>> * Doing unaligned writes on a 13+1 or 12+2 is catastrophically
>>> slow because of the RMW cycle. This is of course independent
>>> of how one got to the something like 13+1 or a 12+2.
>
> nagilum> Changing a single byte in a 2+1 raid5 or a 13+1 raid5
> nagilum> requires exactly two 512byte blocks to be read and
> nagilum> written from two different disks. Changing two bytes
> nagilum> which are unaligned (the last and first byte of two
> nagilum> consecutive stripes) doubles those figures, but more
> nagilum> disks are involved.
>
> Here you are using the astute misdirection of talking about
> unaunaligned *byte* *updates* when the issue is unaligned
> *stripe* *writes*.

Which are (imho) much less likely to occur than minor changes in a  
block. (think touch, mv, chown, chmod, etc.)

> If one used your scheme to write a 13+1 stripe one block at a
> time would take 26R+26W operations (about half of which could be
> cached) instead of 14W which are what is required when doing
> aligned stripe writes, which is what good file systems try to
> achieve.
> ....
> But enough of talking about absurd cases, let's do a good clear
> example of why a 13+1 is bad bad bad when doing unaligned writes.
>
> Consider writing to a 2+1 and an 13+1 just 15 blocks in 4+4+4+3
> bunches, starting with block 0 (so aligned start, unaligned
> bunch length, unaligned total length), a random case but quite
> illustrative:
>
>   2+1:
> 	00 01 P1 03 04 P2 06 07 P3 09 10 P4
>         00 01    02 03    04 05    06 07
>         ------**-------** ------**-------**
>         12 13 P5 15 16 P6 18 19 P7 21 22 P8
>         08 09    10 11    12 13    14
>         ------**-------** ------**---    **
>
> 	write D00 D01 DP1
> 	write D03 D04 DP2
>
> 	write D06 D07 DP3
> 	write D09 D10 DP4
>
> 	write D12 D13 DP5
> 	write D15 D16 DP6
>
> 	write D18 D19 DP7
> 	read  D21 DP8
> 	write D21 DP8
>
>         Total:
> 	  IOP: 01 reads, 08 writes
> 	  NLK: 02 reads, 23 writes
> 	  XOR: 28 reads, 15 writes
>
>  13+1:
> 	00 01 02 03 04 05 06 07 08 09 10 11 12 P1
>         00 01 02 03 04 05 06 07 08 09 10 11 12
>         ----------- ----------- ----------- -- **
>
>         14 15 16 17 18 19 20 21 22 23 24 25 26 P2
> 	13 14
> 	-----                                  **
>
> 	read  D00 D01 D02 D03 DP1
> 	write D00 D01 D02 D03 DP1
>
> 	read  D04 D05 D06 D07 DP1
> 	write D04 D05 D06 D07 DP1
>
> 	read  D08 D09 D10 D11 DP1
> 	write D08 D09 D10 D11 DP1
>
> 	read  D12 DP1 D14 D15 DP2
> 	write D12 DP1 D14 D15 DP2
>
>         Total:
> 	  IOP: 04 reads, 04 writes
> 	  BLK: 20 reads, 20 writes
> 	  XOR: 34 reads, 10 writes

and now the same with cache:

	write D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 DP1
  	read  D14 D15 DP2
  	write D14 D15 DP2
         Total:
	  IOP: 01 reads, 02 writes
	  BLK: 03 reads, 18 writes
	  XOR: not sure what you're calculating here, but it's mostly  
irrelevant anyway, even my old Athlon500MHz can XOR >2.6GB/s iirc.

> The short stripe size means that one does not need to RMW in
> many cases, just W; and this despite that much higher redundancy
> of 2+1. it also means that there are lots of parity blocks to
> compute and write. With a 4 block operation length a 3+1 or even
> more a 4+1 would be flattered here, but I wanted to exemplify
> two extremes.

With a write cache the picture looks a bit better. If the writes  
happen close enough together (temporal) they will be joined. If they  
are further apart chances are the write speed is not that critical  
anyway.

> The narrow parallelism thus short stripe length of 2+1 means
> that a lot less blocks get transferred because of almost no RM,
> but it does 9 IOPs and 13+1 does one less at 8 (wider
> parallelism); but then the 2+1 IOPs are mostly in back-to-back
> write pairs, while the 13+1 are in read-rewrite pairs, which is
> a significant disadvantage (often greatly underestimated).
>
> Never mind that the number of IOPs is almost the same despite
> the large difference in width, and that can do with the same
> disks as a 13+1 something like 4 2+1/3+1 arrays, thus gaining a
> lot of parallelism across threads, if there is such to be
> obtained. And if one really wants to write long stripes, one
> should use RAID10 of course, not long stripes with a single (or
> two) parity blocks.
>

> Never mind that finding the chances of putting in the IO request
> stream a set of back-to-back logical writes to 13 contiguous
> blocks aligned starting on a 13 block multiple are bound to be
> lower than those of get a set of of 2 or 3 blocks, and even
> worse with a filesystem mostly built for the wrong stripe
> alignment.

I have yet to be convinced this difference is that significant.
I think most changes are updates of file attributes (e.g. atime).
File reads will perform better when spread over more disks.
File writes usually write the whole file so it directly depends on  
your filesizes most of which are usually <1k. If this is for a digital  
attic the media files will be in the many MB range. Both are equally  
good or bad for the described scenarios.
The advantage is limited to a certain window of file writes.
The size of that window depends on the number of disks just as much as  
it depends on the chunk size.
Depending on the individual usage scenario one or the other window is  
better suited.

>>> * Unfortunately the frequency of unaligned writes *does*
>>>   usually depend on how dementedly one got to the 13+1 or
>>>   12+2 case: because a filesystem that lays out files so that
>>>   misalignment is minimised with a 2+1 stripe just about
>>>   guarantees that when one switches to a 3+1 stripe all
>>>   previously written data is misaligned, and so on -- and
>>>   never mind that every time one adds a disk a reshape is
>>>   done that shuffles stuff around.
>
> In general large chunksizes are not such a brilliant idea, even
> if ill-considered benchmarks may show some small advantage with
> somewhat larger chunksizes.

Yeah.

> My general conclusion is that reshapes are a risky, bad for
> performance, expensive operation that is available, like RAID5
> in general (and especially RAID5 above 2+1 or in a pinch 3+1)
> only for special cases when one cannot do otherwise and knows
> exactly what the downside is (which seems somewhat rare).

Agreed, but performance is still acceptable albeit not optimal.

> I think that defending the concept of growing a 2+1 into a 13+1
> via as many as 11 successive reshapes is quite ridiculous, even
> more so when using fatuous arguments about 1 or 2 byte updates.

I don't know why you don't like the example. How many bytes change for  
an atime update?

> It is even worse than coming up with that idea itself, which is
> itself worse than that of building a 13+1 to start with.

The advantage is economically. One buys a few disks now and continuous  
to stack up over the course of the years as storage need increases.
But I wouldn't voluntarily do a raid5 with more than 8 disks too.
Kind regards,

----- End message from pg_lxra@lxra.for.sabi.co.UK -----



========================================================================
#    _  __          _ __     http://www.nagilum.org/ \n icq://69646724 #
#   / |/ /__ ____ _(_) /_ ____ _  nagilum@nagilum.org \n +491776461165 #
#  /    / _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
# /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
#           /___/     x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #
========================================================================


----------------------------------------------------------------
cakebox.homeunix.net - all the machine one needs..


[-- Attachment #2: PGP Digital Signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-02-22 13:41               ` LVM performance Oliver Martin
@ 2008-03-07  8:14                 ` Peter Grandi
  2008-03-09 19:56                   ` Oliver Martin
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2008-03-07  8:14 UTC (permalink / raw)
  To: Linux RAID

[ .... ]

Sorry forn the long delay in replying...

om> $ hdparm -t /dev/md0

om> /dev/md0:
om>   Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec

om> $ hdparm -t /dev/dm-0

om> /dev/dm-0:
om>   Timing buffered disk reads:  116 MB in  3.04 seconds = 38.20 MB/sec

om> [ ... ] but right now, I only have 500GB drives. [ ... ]

pg> Those are as such not very meaningful. What matters most is
pg> whether the starting physical address of each logical volume
pg> extent is stripe aligned (and whether the filesystem makes use
pg> of that) and then the stripe size of the parity RAID set, not
pg> the chunk sizes in themselves. [ ... ]

om> Am I right to assume that stripe alignment matters because
om> of the read-modify-write cycle needed for unaligned writes?

Sure, if you are writing as you say later. Note also that I was
commenting on the points made about chunk size and alignment:

  jk> [ ... ] This might be related to raid chunk positioning with
  jk> respect to LVM chunk positioning. If they interfere there
  jk> indeed may be some performance drop. Best to make sure that
  jk> those chunks are aligned together. [ ... ]

  om> I'm seeing a 20% performance drop too, with default RAID
  om> and LVM chunk sizes of 64K and 4M, respectively. Since 64K
  om> divides 4M evenly, I'd think there shouldn't be such a big
  om> performance penalty.

As I said, if there is an issue with "interference", it is about
stripes, not chunks, and both alignment and size, not just size.

But in your case as you point out the issue is not with that,
because when reading a RAID5 behaves like a slightly smaller
RAID0, as you point out, so the cause is different:

om> If so, how come a pure read benchmark (hdparm -t or plain
om> dd) is slower on the LVM device than on the md device?

Ahhh because the benchmark you are doing is not very meaningful
either, not just the speculation about chunk sizes.

Reading from the outer tracks of a RAID5 2+1 on contemporary
500GB drives should give you at least 100-120MB/s (as if it were
a 2x RAID0), and the numbers that you are reporting above seem
meaningless for a comparison between MD and DM, because there
must be something else that makes them both perform very badly.

Odds are that your test was afflicted by the page cache
read-ahead horror that several people have reported, and that I
have investigated in detail in a recent posting to this list,
with the conclusion that it is a particularly grave flaw in the
design and implementation of Linux IO.

Since the horror comes from poor scheduling of streaming read
sequences, there is wide variability among tests using the same
setup, and most likely DM and MD have a slightly different
interaction with the page cache.

PS: maybe you are getting 40-50MB/s only because of some other
    reason, e.g. a slow host adapter or host bus, but whatever
    it is, it results in an improper comparison between DM and
    MD.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-07  8:14                 ` Peter Grandi
@ 2008-03-09 19:56                   ` Oliver Martin
  2008-03-09 21:13                     ` Michael Guntsche
  0 siblings, 1 reply; 42+ messages in thread
From: Oliver Martin @ 2008-03-09 19:56 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

Peter Grandi schrieb:
> pg> Those are as such not very meaningful. What matters most is
> pg> whether the starting physical address of each logical volume
> pg> extent is stripe aligned (and whether the filesystem makes use
> pg> of that) and then the stripe size of the parity RAID set, not
> pg> the chunk sizes in themselves. [ ... ]
> 
> om> Am I right to assume that stripe alignment matters because
> om> of the read-modify-write cycle needed for unaligned writes?
> 
> Sure, if you are writing as you say later. Note also that I was
> commenting on the points made about chunk size and alignment:
> 
>   jk> [ ... ] This might be related to raid chunk positioning with
>   jk> respect to LVM chunk positioning. If they interfere there
>   jk> indeed may be some performance drop. Best to make sure that
>   jk> those chunks are aligned together. [ ... ]
> 
>   om> I'm seeing a 20% performance drop too, with default RAID
>   om> and LVM chunk sizes of 64K and 4M, respectively. Since 64K
>   om> divides 4M evenly, I'd think there shouldn't be such a big
>   om> performance penalty.
> 
> As I said, if there is an issue with "interference", it is about
> stripes, not chunks, and both alignment and size, not just size.
> 

Thanks for explaining this. I think I finally got it ;-).
I will probably recreate the array anyway, so I might as well do it 
right this time. I currently have three drives, but when I run out of 
space, I will add a fourth. So the setup should be prepared for a reshape.

Based on what I understand, the things to look out for are:

  * LVM/md first extent stripe alignment: when creating the PV, specify 
a --metadatasize that is divisible by all anticipated stripe sizes, 
i.e., the least common multiple. For example, to accommodate for 3, 4 or 
5 drive configurations with 64KB chunk size, that would be 768KB.

  * Alignment of other extents: for the initial array creation with 3 
drives the default 4MB extent size is fine. When I add a fourth drive, I 
can resize the extents with vgchange - though I'm a bit hesitant as the 
manpage doesn't explicitly say that this doesn't destroy any data. The 
bigger problem is that the extent size must be a power of two, so the 
maximum I can use with 192KB stripe size is 64KB. I'll see if that hurts 
performance. The vgchange manpage says it doesn't...

  * Telling the file system that the underlying device is striped. ext3 
has the stride parameter, and changing it doesn't seem to be possible. 
XFS might be better, as the swidth/sunit options can be set at 
mount-time. This would speed up writes, while reads of existing data 
wouldn't be affected too much by the misalignment anyway. Right?

> Reading from the outer tracks of a RAID5 2+1 on contemporary
> 500GB drives should give you at least 100-120MB/s (as if it were
> a 2x RAID0), and the numbers that you are reporting above seem
> meaningless for a comparison between MD and DM, because there
> must be something else that makes them both perform very badly.

The general slowness is due to the fact that I'm using external drives, 
two USB ones and one Firewire. To add insult to injury, the two USB 
drives share one port with a hub, so it's obviously not going to be very 
fast. Also, it's probably not the best idea since it adds another 
potential single point of failure...
That said, the machine is currently down for hardware troubleshooting 
(see my other thread "RAID-5 data corruption" for that) and if it turns 
out the USB controller is indeed flaky, i might end up replacing the 
whole thing with a more sensible configuration.

> 
> Odds are that your test was afflicted by the page cache
> read-ahead horror that several people have reported, and that I
> have investigated in detail in a recent posting to this list,
> with the conclusion that it is a particularly grave flaw in the
> design and implementation of Linux IO.

Do you mean the "slow raid5 performance" thread from October where you 
pointed out that the page cache is rather CPU-intensive?
Also, it might be related to read-ahead: It was 128 for md0 and 256 for 
dm-0, and after I set it to 3072 for both, I got about the same 
sequential read performance (~50MB/s) for both.
> 
> Since the horror comes from poor scheduling of streaming read
> sequences, there is wide variability among tests using the same
> setup, and most likely DM and MD have a slightly different
> interaction with the page cache.
> 
> PS: maybe you are getting 40-50MB/s only because of some other
>     reason, e.g. a slow host adapter or host bus, but whatever
>     it is, it results in an improper comparison between DM and
>     MD.

Okay, I'll shut up. :-)

-- 
Oliver

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-09 19:56                   ` Oliver Martin
@ 2008-03-09 21:13                     ` Michael Guntsche
  2008-03-09 23:27                       ` Oliver Martin
  0 siblings, 1 reply; 42+ messages in thread
From: Michael Guntsche @ 2008-03-09 21:13 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1879 bytes --]


On Mar 9, 2008, at 20:56, Oliver Martin wrote:

> * LVM/md first extent stripe alignment: when creating the PV,  
> specify a --metadatasize that is divisible by all anticipated stripe  
> sizes, i.e., the least common multiple. For example, to accommodate  
> for 3, 4 or 5 drive configurations with 64KB chunk size, that would  
> be 768KB.
>

Aligning it on chunk-size should be enough so in your case 64KB.  
Personally I did a lot of tests during  the last few weeks and this  
seemed to make not that big of a difference.


> * Alignment of other extents: for the initial array creation with 3  
> drives the default 4MB extent size is fine. When I add a fourth  
> drive, I can resize the extents with vgchange - though I'm a bit  
> hesitant as the manpage doesn't explicitly say that this doesn't  
> destroy any data. The bigger problem is that the extent size must be  
> a power of two, so the maximum I can use with 192KB stripe size is  
> 64KB. I'll see if that hurts performance. The vgchange manpage says  
> it doesn't...
>

Why make the extents so small? You do not normally increase your LVs  
by 4MB. I use 256MB or 512MB extends.

> * Telling the file system that the underlying device is striped.  
> ext3 has the stride parameter, and changing it doesn't seem to be  
> possible. XFS might be better, as the swidth/sunit options can be  
> set at mount-time. This would speed up writes, while reads of  
> existing data wouldn't be affected too much by the misalignment  
> anyway. Right?
>

You can change the stride parameter of ext3 with tune2fs take a look  
at the -E switch, even after you created the filesystem.
That said bonnie++ results showed that while setting a correct stride  
for EXT3 increased the creation and deletion of files, the big  
sequential read and write tests suffered. But this is bonnie++......

Hope that helps,
Michael



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2417 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-09 21:13                     ` Michael Guntsche
@ 2008-03-09 23:27                       ` Oliver Martin
  2008-03-09 23:53                         ` Michael Guntsche
  2008-03-10  0:32                         ` Richard Scobie
  0 siblings, 2 replies; 42+ messages in thread
From: Oliver Martin @ 2008-03-09 23:27 UTC (permalink / raw)
  To: Michael Guntsche; +Cc: linux-raid, Oliver Martin

Michael Guntsche schrieb:
> 
> On Mar 9, 2008, at 20:56, Oliver Martin wrote:
> 
>> * LVM/md first extent stripe alignment: when creating the PV, specify 
>> a --metadatasize that is divisible by all anticipated stripe sizes, 
>> i.e., the least common multiple. For example, to accommodate for 3, 4 
>> or 5 drive configurations with 64KB chunk size, that would be 768KB.
>>
> 
> Aligning it on chunk-size should be enough so in your case 64KB. 
> Personally I did a lot of tests during  the last few weeks and this 
> seemed to make not that big of a difference.

Hmm. Stripe alignment of the beginning of a file system would seem to 
make sense. Otherwise, even if I tell the file system the stripe size, 
how should it know where it's best to start writes? If my stripes are 
128KB, and I tell the fs, it can make an effort to write 128KB blocks 
whenever possible. But if the fs starts 64KB into a 128KB stripe, every 
128KB write will cause two RMW cycles.
At least, that's how I understand it. Maybe there's something else 
involved and it really doesn't make a difference?
> 
> 
>> * Alignment of other extents: for the initial array creation with 3 
>> drives the default 4MB extent size is fine. When I add a fourth drive, 
>> I can resize the extents with vgchange - though I'm a bit hesitant as 
>> the manpage doesn't explicitly say that this doesn't destroy any data. 
>> The bigger problem is that the extent size must be a power of two, so 
>> the maximum I can use with 192KB stripe size is 64KB. I'll see if that 
>> hurts performance. The vgchange manpage says it doesn't...
>>
> 
> Why make the extents so small? You do not normally increase your LVs by 
> 4MB. I use 256MB or 512MB extends.

I was under the impression that aligning the LVM extents to RAID stripes 
was crucial ("What matters most is whether the starting physical address 
of each logical volume extent is stripe aligned"). If the LVM extents 
have nothing to do with how much is read/written at once, but rather 
only define the granularity with which LVs can be created, aligning the 
first extent could be enough.
Of course, I don't extend my LVs by 4MB, much less 64KB. The only reason 
I use LVM at all is because I might one day add larger drives to the 
array. Suppose I have 3 500GB drives and 2 750GB ones. In this 
configuration I would use a 5-drive array with 500GB from each, and a 
2-drive array with the rest on the larger ones. These two arrays would 
then be joined by LVM to form one file system.
Thinking it all through again, I see that trying to align things to 
stripes is utterly pointless as soon as I join arrays with different 
stripe sizes together. Maybe I should revise my plan, or just accept 
that I won't be getting optimal performance.

> 
>> * Telling the file system that the underlying device is striped. ext3 
>> has the stride parameter, and changing it doesn't seem to be possible. 
>> XFS might be better, as the swidth/sunit options can be set at 
>> mount-time. This would speed up writes, while reads of existing data 
>> wouldn't be affected too much by the misalignment anyway. Right?
>>
> 
> You can change the stride parameter of ext3 with tune2fs take a look at 
> the -E switch, even after you created the filesystem.

Ah, I see that's a very recent addition to 1.40.7. Thanks for pointing 
that out!

> That said bonnie++ results showed that while setting a correct stride 
> for EXT3 increased the creation and deletion of files, the big 
> sequential read and write tests suffered. But this is bonnie++......

Sorry, no idea why.

-- 
Oliver

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-09 23:27                       ` Oliver Martin
@ 2008-03-09 23:53                         ` Michael Guntsche
  2008-03-10  8:54                           ` Oliver Martin
  2008-03-10  0:32                         ` Richard Scobie
  1 sibling, 1 reply; 42+ messages in thread
From: Michael Guntsche @ 2008-03-09 23:53 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1103 bytes --]


On Mar 10, 2008, at 0:27, Oliver Martin wrote:
>
> Hmm. Stripe alignment of the beginning of a file system would seem  
> to make sense. Otherwise, even if I tell the file system the stripe  
> size, how should it know where it's best to start writes? If my  
> stripes are 128KB, and I tell the fs, it can make an effort to write  
> 128KB blocks whenever possible. But if the fs starts 64KB into a  
> 128KB stripe, every 128KB write will cause two RMW cycles.
> At least, that's how I understand it. Maybe there's something else  
> involved and it really doesn't make a difference?

That's exactly what I ment sorry for not being clear enough.
If you have a chunk size of 128KB it makes sense to align the  
beginning of the PV to this as well. Since the header is 64KB itself  
you should use a --metadatasize of 64 so it becomes 128K.
I used a chunk size of 256K so I set the metadatasize to 192 (256-64).  
You can also use that if you want.
That said, I did not see a big difference when running the benchmarks.
A good way to see where the PE starts is

	pvs -o+pe_start

Kind regards,
Michael 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2417 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-09 23:27                       ` Oliver Martin
  2008-03-09 23:53                         ` Michael Guntsche
@ 2008-03-10  0:32                         ` Richard Scobie
  2008-03-10  0:53                           ` Michael Guntsche
  1 sibling, 1 reply; 42+ messages in thread
From: Richard Scobie @ 2008-03-10  0:32 UTC (permalink / raw)
  To: Linux RAID Mailing List

Oliver Martin wrote:
> Michael Guntsche schrieb:

> whenever possible. But if the fs starts 64KB into a 128KB stripe, every 
> 128KB write will cause two RMW cycles.
> At least, that's how I understand it. Maybe there's something else 
> involved and it really doesn't make a difference?

As I understand it, XFS is smart enough to work with md RAID and 
automagically set the correct swidth and sunit sizes to suit the array.

Regards,

Richard

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-10  0:32                         ` Richard Scobie
@ 2008-03-10  0:53                           ` Michael Guntsche
  2008-03-10  0:59                             ` Richard Scobie
  0 siblings, 1 reply; 42+ messages in thread
From: Michael Guntsche @ 2008-03-10  0:53 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 701 bytes --]

	
On Mar 10, 2008, at 1:32, Richard Scobie wrote:

> Oliver Martin wrote:
>> Michael Guntsche schrieb:
>
>> whenever possible. But if the fs starts 64KB into a 128KB stripe,  
>> every 128KB write will cause two RMW cycles.
>> At least, that's how I understand it. Maybe there's something else  
>> involved and it really doesn't make a difference?
>
> As I understand it, XFS is smart enough to work with md RAID and  
> automagically set the correct swidth and sunit sizes to suit the  
> array.

That's true but only if you create a XFS filesystem on the md device  
directly. If LVM is in between mkfs.xfs cannot figure it out and you  
have to specify the values yourself.


Kind regards,
Michael

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2417 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-10  0:53                           ` Michael Guntsche
@ 2008-03-10  0:59                             ` Richard Scobie
  2008-03-10  1:21                               ` Michael Guntsche
  0 siblings, 1 reply; 42+ messages in thread
From: Richard Scobie @ 2008-03-10  0:59 UTC (permalink / raw)
  To: Linux RAID Mailing List

Michael Guntsche wrote:

> That's true but only if you create a XFS filesystem on the md device  
> directly. If LVM is in between mkfs.xfs cannot figure it out and you  
> have to specify the values yourself.

I wondered about that after I wrote and found this in the mkfs.xfs man page:

"sw=value

suboption is an alternative to using swidth.  The sw suboption is used 
to specify the  stripe  width for  a  RAID  device or striped logical 
volume. The value is expressed as a multiplier of the stripe unit, 
usually the same as the number of stripe members in the logical volume 
configuration, or  data disks in a RAID device.

When a filesystem is created on a logical volume device, mkfs.xfs will 
automatically query the logical volume for appropriate sunit and swidth 
values."

So perhaps it'll do the same on LVM?

Regards,

Richard

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-10  0:59                             ` Richard Scobie
@ 2008-03-10  1:21                               ` Michael Guntsche
  0 siblings, 0 replies; 42+ messages in thread
From: Michael Guntsche @ 2008-03-10  1:21 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 544 bytes --]


On Mar 10, 2008, at 1:59, Richard Scobie wrote:

>
> When a filesystem is created on a logical volume device, mkfs.xfs  
> will automatically query the logical volume for appropriate sunit  
> and swidth values."


If it is a "striped" logical volume it will do that, but since you are  
creating the LVM on ONE PV namely the md device itself  there will be  
no way for the LVM to stripe it.
Of course you could create two MDs and create a LVM stripe acroos  
those too, but I do not think there is any benefit in that.

Kind regards,
Michael

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2417 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-09 23:53                         ` Michael Guntsche
@ 2008-03-10  8:54                           ` Oliver Martin
  2008-03-10 21:04                             ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: Oliver Martin @ 2008-03-10  8:54 UTC (permalink / raw)
  To: Michael Guntsche; +Cc: linux-raid

Michael Guntsche schrieb:
> That's exactly what I ment sorry for not being clear enough.
> If you have a chunk size of 128KB it makes sense to align the beginning 
> of the PV to this as well.

I was talking about stripe size, not chunk size. That 128KB stripe size 
is made up of n-1 chunks of an n-disk raid-5. In this case, 3 disks and 
64KB chunk size result in 128KB stripe size.

I assume if you tell the file system about this stripe size (or it 
figures it out itself, as xfs does), it tries to align its structures 
such that whole-stripe writes are more likely than partial writes. This 
means that md only has to write 3*64KB (2x data + parity).

If a write touches both (data_chunk_1 + offset) and (data_chunk_2 + 
offset), you can calculate (parity_chunk + offset) without reading 
anything. If it doesn't change all data chunks, you have to read either
* the current parity
* the data chunk(s) to be changed
* all other data chunks
to calculate parity.

So, if this 128KB write is offset by half a stripe, md has to read one 
of the chunks from each stripe prior to writing so it can update parity. 
Also, there are two parity chunks to write. So that's 2*64KB read + 
4*64KB write.

That's what I meant with stripe-aligning the PV (and thus the LV and 
thus the file system).

-- 
Oliver

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-10  8:54                           ` Oliver Martin
@ 2008-03-10 21:04                             ` Peter Grandi
  2008-03-12 14:03                               ` Michael Guntsche
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2008-03-10 21:04 UTC (permalink / raw)
  To: Linux RAID

>>> On Mon, 10 Mar 2008 09:54:07 +0100, Oliver Martin
>>> <oliver.martin@student.tuwien.ac.at> said:

[ ... ]

> I was talking about stripe size, not chunk size. That 128KB
> stripe size is made up of n-1 chunks of an n-disk raid-5. In
> this case, 3 disks and 64KB chunk size result in 128KB stripe
> size.

Uhm, usually I would say that in such a case the stripe size is
192KiB, of which 128KiB are the data capacity/payload.

I usually think of the stripe as it is recorded on the array,
from the point of view of the RAID software. As you say here:

> I assume if you tell the file system about this stripe size
> (or it figures it out itself, as xfs does), it tries to align
> its structures such that whole-stripe writes are more likely
> than partial writes. This means that md only has to write
> 3*64KB (2x data + parity).

Indeed, indeed the application above the filesystem has to write
carefully in 128KiB long, 128KiB aligned (to the start of the
array, not the start of the overlaying volume, as you point out)
transactions to avoid the high costs you describe here and
elsewhere.

As I was arguing in a recent post (with very explicit examples),
the wider the array (and the larger the chunks size) the worse
the cost and chances that the application (usually the file
system) manage to put together properly sized and aligned
transactions. XFS delayed writes are useful as to that.

But then one of the many advantages of RAID10 is that all these
complications are largely irrelevant with it...

[ ... ]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-10 21:04                             ` Peter Grandi
@ 2008-03-12 14:03                               ` Michael Guntsche
  2008-03-12 19:54                                 ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: Michael Guntsche @ 2008-03-12 14:03 UTC (permalink / raw)
  To: Linux RAID

On Mon, 10 Mar 2008 21:04:56 +0000, pg_lxra@lxra.for.sabi.co.UK (Peter
Grandi) wrote:
<snip>
> Uhm, usually I would say that in such a case the stripe size is
> 192KiB, of which 128KiB are the data capacity/payload.
> 
> I usually think of the stripe as it is recorded on the array,
> from the point of view of the RAID software. As you say here:
> 
>> I assume if you tell the file system about this stripe size
>> (or it figures it out itself, as xfs does), it tries to align
>> its structures such that whole-stripe writes are more likely
>> than partial writes. This means that md only has to write
>> 3*64KB (2x data + parity).
> 
> Indeed, indeed the application above the filesystem has to write
> carefully in 128KiB long, 128KiB aligned (to the start of the
> array, not the start of the overlaying volume, as you point out)
> transactions to avoid the high costs you describe here and
> elsewhere.

Ok by now the the horse of this thread has been beaten to death several
times.
But there has to be a logical reason why my RAID-5 which has been running
in test mode for the last weeks has not been put into production yet.
Thinking about all the alignment talks I read about on this list I did one
final test.

Right now I have my trusty 4 DISK RAID-5 with a CHUNKSIZE of 256KB, thus
having a stripe-size of 1MB.

Below you will find the bonnie results. 
The first one is plain XFS on the MD device, do not ask me why the numbers
are so low for the file tests but right now I do not care. :)
The next entry is with a CHUNKSIZE aligned LVM volume.

I did:

   pvcreate --metadatasize 192k /dev/md1 <--- 192=256-64 where 256 is the
chunksize and 64 is the PV headear

The final entry is a STRIPESIZE aligned LVM volume

  pvcreate --metadatasize 960K /dev/md1 <-- 960=1024-64 where 1024 is the
stripesize and ......

Version  1.03c      ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
%CP
xfs              8G           49398  43 26252  21           116564  45
177.8   2
lvm-chunkaligned 8G           45937  42 23711  24           102184  50
154.3   2
lvm-stripealigne 8G           49271  43 24401  25           116136  50
167.9   2
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP
xfs 16:100000:16/64   196  13   211  10  1531  49   205  13    45   2  1532
 46
lvm 16:100000:16/64   634  25   389  25  2307  56   695  26    74   4   514
 34
lvm 16:100000:16/64   712  27   383  25  2346  52   769  27    59   3  1303
 46

As you can see it apparently does make a difference if you stripe align or
not, like everyone else said.
My main mistake was that I always confused CHUNK and STRIPE size when
talking and testing.

Hopefully this will help someone, who is searching the archives for some
answers.

Kind regards,
Michael

PS: This list has given me a lot of valuable information and I want to
thank everyone for their support, especially the guys who never got tired
answering my sometimes stupid questions during the last weeks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-12 14:03                               ` Michael Guntsche
@ 2008-03-12 19:54                                 ` Peter Grandi
  2008-03-12 20:11                                   ` Guntsche Michael
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2008-03-12 19:54 UTC (permalink / raw)
  To: Linux RAID

[ ... poor dead horse of alignment  :-) ... ]

mike> Right now I have my trusty 4 DISK RAID-5 with a CHUNKSIZE
mike> of 256KB, thus having a stripe-size of 1MB.

A largish chunk size with a largish stripe size may not be a
particularly good idea for sequential IO, more for multithreaded
access or for random access perhaps.

mike> Version  1.03c      ------Sequential Output------ [ ... ]
mike>                     -Per Chr- --Block-- -Rewrite- [ ... ]
mike> Machine        Size K/sec %CP K/sec %CP K/sec %CP [ ... ]
mike> xfs              8G           49398  43 26252  21 [ ... ]
mike> lvm-chunkaligned 8G           45937  42 23711  24 [ ... ]
mike> lvm-stripealigne 8G           49271  43 24401  25 [ ... ]

mike> As you can see it apparently does make a difference if you
mike> stripe align or not, [ ... ]

But it should make a *much* bigger difference, and a 3+1 RAID5
should perform *a lot* better. As in 100-150MB/s (factor of 2-3
over a single disk) reading and (if aligned) writing.

Perhaps there is some problem with your IO subsystem (USB drives?
4 ATA drives attached to only 2 channels?  5-10 year old disks?).

BTW, Bonnie 1.03 or Bonnie++ or Iozone are not necessarily the
best way to check things, I recommend Bonnie 1.14 with the
options '-u -y -o_direct'.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: LVM performance
  2008-03-12 19:54                                 ` Peter Grandi
@ 2008-03-12 20:11                                   ` Guntsche Michael
  0 siblings, 0 replies; 42+ messages in thread
From: Guntsche Michael @ 2008-03-12 20:11 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]


On Mar 12, 2008, at 20:54, Peter Grandi wrote:

> A largish chunk size with a largish stripe size may not be a
> particularly good idea for sequential IO, more for multithreaded
> access or for random access perhaps.
>
> mike> Version  1.03c      ------Sequential Output------ [ ... ]
> mike>                     -Per Chr- --Block-- -Rewrite- [ ... ]
> mike> Machine        Size K/sec %CP K/sec %CP K/sec %CP [ ... ]
> mike> xfs              8G           49398  43 26252  21 [ ... ]
> mike> lvm-chunkaligned 8G           45937  42 23711  24 [ ... ]
> mike> lvm-stripealigne 8G           49271  43 24401  25 [ ... ]
>
>
> But it should make a *much* bigger difference, and a 3+1 RAID5
> should perform *a lot* better. As in 100-150MB/s (factor of 2-3
> over a single disk) reading and (if aligned) writing.

I think for my setup here the numbers are okay.
All four disk are a attached to a 4port SATA PCI card. Yes no typo,  
there is no "e" at the end. :)

Thus ~100MB looks like the optimum I will get out of this setup for  
reading, regardless of the number of disks.
If I had the money, I would definitly go with something faster,  
attached to PCIe and a RAID-10.
But for now this rig has to do the job.

All I am trying to do right now is keeping the difference between MD 
+XFS and MD+LVM+XFS as small as possible.

Kind regards,
Michael 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2417 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2008-03-12 20:11 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-17  3:58 RAID5 to RAID6 reshape? Beolach
2008-02-17 11:50 ` Peter Grandi
2008-02-17 14:45   ` Conway S. Smith
2008-02-18  5:26     ` Janek Kozicki
2008-02-18 12:38       ` Beolach
2008-02-18 14:42         ` Janek Kozicki
2008-02-19 19:41           ` LVM performance (was: Re: RAID5 to RAID6 reshape?) Oliver Martin
2008-02-19 19:52             ` Jon Nelson
2008-02-19 20:00               ` Iustin Pop
2008-02-19 23:19             ` LVM performance Peter Rabbitson
2008-02-20 12:19             ` LVM performance (was: Re: RAID5 to RAID6 reshape?) Peter Grandi
2008-02-22 13:41               ` LVM performance Oliver Martin
2008-03-07  8:14                 ` Peter Grandi
2008-03-09 19:56                   ` Oliver Martin
2008-03-09 21:13                     ` Michael Guntsche
2008-03-09 23:27                       ` Oliver Martin
2008-03-09 23:53                         ` Michael Guntsche
2008-03-10  8:54                           ` Oliver Martin
2008-03-10 21:04                             ` Peter Grandi
2008-03-12 14:03                               ` Michael Guntsche
2008-03-12 19:54                                 ` Peter Grandi
2008-03-12 20:11                                   ` Guntsche Michael
2008-03-10  0:32                         ` Richard Scobie
2008-03-10  0:53                           ` Michael Guntsche
2008-03-10  0:59                             ` Richard Scobie
2008-03-10  1:21                               ` Michael Guntsche
2008-02-18 19:05     ` RAID5 to RAID6 reshape? Peter Grandi
2008-02-20  6:39       ` Alexander Kühn
2008-02-22  8:13         ` Peter Grandi
2008-02-23 20:40           ` Nagilum
2008-02-25  0:10             ` Peter Grandi
2008-02-25 16:31               ` Nagilum
2008-02-17 13:31 ` Janek Kozicki
2008-02-17 16:18   ` Conway S. Smith
2008-02-18  3:48     ` Neil Brown
2008-02-17 22:40   ` Mark Hahn
2008-02-17 23:54     ` Janek Kozicki
2008-02-18 12:46     ` Andre Noll
2008-02-18 18:23       ` Mark Hahn
2008-02-17 14:06 ` Janek Kozicki
2008-02-17 23:54   ` cat
2008-02-18  3:43 ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).