RAID-10 explicitly defined drive pairs?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID-10 explicitly defined drive pairs?
@ 2011-12-12 11:54 Jan Kasprzak
  2011-12-12 15:33 ` John Robinson
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Kasprzak @ 2011-12-12 11:54 UTC (permalink / raw)
  To: linux-raid

	Hello, Linux RAID gurus,

I have a new server with two identical external disk shelves (22 drives each),
which will be connected to the server with a pair of SAS cables.
I want to use RAID-10 on these disks, but I want it to be configured
so that the data will always be mirrored between the shelves.
I.e. I want to be able to overcome complete failure of a single shelf.

	Is there any way how to tell mdadm explicitly how to set up
the pairs of mirrored drives inside a RAID-10 volume?

	Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839      Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/    Journal: http://www.fi.muni.cz/~kas/blog/ |
Please don't top post and in particular don't attach entire digests to your
mail or we'll all soon be using bittorrent to read the list.     --Alan Cox

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2011-12-12 11:54 RAID-10 explicitly defined drive pairs? Jan Kasprzak
@ 2011-12-12 15:33 ` John Robinson
  2012-01-06 15:08   ` Jan Kasprzak
  0 siblings, 1 reply; 25+ messages in thread
From: John Robinson @ 2011-12-12 15:33 UTC (permalink / raw)
  To: Jan Kasprzak; +Cc: linux-raid

On 12/12/2011 11:54, Jan Kasprzak wrote:
> 	Hello, Linux RAID gurus,
>
> I have a new server with two identical external disk shelves (22 drives each),
> which will be connected to the server with a pair of SAS cables.
> I want to use RAID-10 on these disks, but I want it to be configured
> so that the data will always be mirrored between the shelves.
> I.e. I want to be able to overcome complete failure of a single shelf.
>
> 	Is there any way how to tell mdadm explicitly how to set up
> the pairs of mirrored drives inside a RAID-10 volume?

If you're using RAID10,n2 (the default layout) then adjacent pairs of 
drives in the create command will be mirrors, so your command line 
should be something like:

# mdadm --create /dev/mdX -l10 -pn2 -n44 /dev/shelf1drive1 
/dev/shelf2drive1 /dev/shelf1drive2 ...

Having said that, if you think there's a real chance of a shelf failing, 
you probably ought to think about adding more redundancy within the 
shelves so that you can survive another drive failure or two while 
you're running on just one shelf.

If you are sticking with RAID10, you can potentially get double the read 
performance using the far layout - -pf2 - and with the same order of 
drives you can still survive a shelf failure, though your use of port 
multipliers may well limit your performance anyway.

Hope this helps!

Cheers,

John.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2011-12-12 15:33 ` John Robinson
@ 2012-01-06 15:08   ` Jan Kasprzak
  2012-01-06 16:39     ` Peter Grandi
  2012-01-06 20:55     ` NeilBrown
  0 siblings, 2 replies; 25+ messages in thread
From: Jan Kasprzak @ 2012-01-06 15:08 UTC (permalink / raw)
  To: linux-raid; +Cc: John Robinson

John Robinson wrote:
: On 12/12/2011 11:54, Jan Kasprzak wrote:
: >	Is there any way how to tell mdadm explicitly how to set up
: >the pairs of mirrored drives inside a RAID-10 volume?
: 
: If you're using RAID10,n2 (the default layout) then adjacent pairs
: of drives in the create command will be mirrors, so your command
: line should be something like:
: 
: # mdadm --create /dev/mdX -l10 -pn2 -n44 /dev/shelf1drive1
: /dev/shelf2drive1 /dev/shelf1drive2 ...

	OK, this works, thanks!

: Having said that, if you think there's a real chance of a shelf
: failing, you probably ought to think about adding more redundancy
: within the shelves so that you can survive another drive failure or
: two while you're running on just one shelf.

	I am aware of that. I don't think the whole shelf will fail,
but who knows :-)

: If you are sticking with RAID10, you can potentially get double the
: read performance using the far layout - -pf2 - and with the same
: order of drives you can still survive a shelf failure, though your
: use of port multipliers may well limit your performance anyway.

	On the older hardware I have a majority of writes, so the far
layout is probably not good for me (reads can be cached pretty well
at the OS level).

	After some experiments with my new hardware, I have discovered
one more serious problem: I have simulated an enclosure failure,
so half of the disks forming the RAID-10 volume disappeared.
After removing them using mdadm --remove, and adding them back,
iostat reports that they are resynced one disk a time, not all
just-added disks in parallel.

	Is there any way of adding more than one disk to the degraded
RAID-10 volume, and get the volume restored as fast as the hardware permits?
Otherwise, it would be better for us to discard RAID-10 altogether,
and use several independent RAID-1 volumes joined together using LVM
(which we will probably use on top of the RAID-10 volume anyway).

	I have tried mdadm --add /dev/mdN /dev/sd.. /dev/sd.. /dev/sd..,
but it behaves the same way as issuing mdadm --add one drive at a time.

	Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839      Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/    Journal: http://www.fi.muni.cz/~kas/blog/ |
Please don't top post and in particular don't attach entire digests to your
mail or we'll all soon be using bittorrent to read the list.     --Alan Cox

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-06 15:08   ` Jan Kasprzak
@ 2012-01-06 16:39     ` Peter Grandi
  2012-01-06 19:16       ` Stan Hoeppner
  2012-01-06 20:11       ` Jan Kasprzak
  2012-01-06 20:55     ` NeilBrown
  1 sibling, 2 replies; 25+ messages in thread
From: Peter Grandi @ 2012-01-06 16:39 UTC (permalink / raw)
  To: Linux RAID

>>> Is there any way how to tell mdadm explicitly how to set up
>>> the pairs of mirrored drives inside a RAID-10 volume?

>> If you're using RAID10,n2 (the default layout) then adjacent
>> pairs of drives in the create command will be mirrors, [ ... ]

I did that once with a pair of MD1000 shelves from Dell and that
worked pretty well (except it was very painful to configure the
shelves with each disk as separate volume).

> half of the disks forming the RAID-10 volume disappeared.
> After removing them using mdadm --remove, and adding them
> back, iostat reports that they are resynced one disk a time,
> not all just-added disks in parallel.

That's very interesting news. Thanks for reporting this though,
it is something to keep in mind.

> [ ... ] Otherwise it would be better for us to discard RAID-10
> altogether, and use several independent RAID-1 volumes joined
> together

I suspect that that MD runs one recovery per array at a time,
and 'raid10' arrays are a single array.

Which would be interesting to know in general, for example how
many drives would be rebuilt at the same time in a 2-drive
failure on a RAID6.

You might try a two layer arrangements, as a 'raid0' of 'raid1'
pairs, instead of a 'raid10'. The two things with MD are not the
same, for example you can do layouts like a 3-drive 'raid10'.

> using LVM (which we will probably use on top of the RAID-10
> volume anyway).

Oh no! LVM is nowhere as nice as MD for RAIDing and is otherwise
largely useless (except regrettably for snapshots) and has some
annoying limitations.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-06 16:39     ` Peter Grandi
@ 2012-01-06 19:16       ` Stan Hoeppner
  2012-01-06 20:11       ` Jan Kasprzak
  1 sibling, 0 replies; 25+ messages in thread
From: Stan Hoeppner @ 2012-01-06 19:16 UTC (permalink / raw)
  To: Linux RAID

I often forget the vger lists don't provide a List-Post: header (while
'everyone' else does).  Thus my apologies for this reply going initially
only to Peter and not the list.

On 1/6/2012 10:39 AM, Peter Grandi wrote:

> You might try a two layer arrangements, as a 'raid0' of 'raid1'
> pairs, instead of a 'raid10'. The two things with MD are not the
> same, for example you can do layouts like a 3-drive 'raid10'.

The MD 3-drive 'RAID 10' layout being similar or equivalent to SNIA RAID
1E, IIRC.

-- 
Stan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-06 16:39     ` Peter Grandi
  2012-01-06 19:16       ` Stan Hoeppner
@ 2012-01-06 20:11       ` Jan Kasprzak
  2012-01-06 22:55         ` Stan Hoeppner
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Kasprzak @ 2012-01-06 20:11 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

Peter Grandi wrote:
: > half of the disks forming the RAID-10 volume disappeared.
: > After removing them using mdadm --remove, and adding them
: > back, iostat reports that they are resynced one disk a time,
: > not all just-added disks in parallel.
: 
: That's very interesting news. Thanks for reporting this though,
: it is something to keep in mind.

	Yes. My HBA is able to do 4 GByte/s bursts according to the
documentation, and I am able to get 2.4 GByte/s sustained. So getting
only about 120-150 MByte/s for RAID-10 resync is really disappointing.

: > [ ... ] Otherwise it would be better for us to discard RAID-10
: > altogether, and use several independent RAID-1 volumes joined
: > together
: 
: I suspect that that MD runs one recovery per array at a time,
: and 'raid10' arrays are a single array.

	Yes, but when the array is being assembled initially
(without --assume-clean), the MD RAID-10 can resync all pairs
of disks at once. It is still limited to two threads (mdX_resync
and mdX_raid10), so for widely-interleaved RAID-10 CPUs can still
be a bottleneck (see my post in this thread from last May or April).
But it is still much better than 120-150 MByte/s.

: You might try a two layer arrangements, as a 'raid0' of 'raid1'
: pairs, instead of a 'raid10'. The two things with MD are not the
: same, for example you can do layouts like a 3-drive 'raid10'.
: 
: > using LVM (which we will probably use on top of the RAID-10
: > volume anyway).
: 
: Oh no! LVM is nowhere as nice as MD for RAIDing and is otherwise
: largely useless (except regrettably for snapshots) and has some
: annoying limitations.

        I think LVM on top of RAID-10 (or more RAID-1 volumes)
is actually pretty nice. With RAID-10 it is a bit easier to handle,
because the upper layer (LVM) does not need to know about proper
interleaving of lower layers. And I suspect that XFS swidth/sunit
settings will still work with RAID-10 parameters even over plain
LVM logical volume on top of that RAID 10, while the settings would
be more tricky when used with interleaved LVM logical volume on top
of several RAID-1 pairs (LVM interleaving uses LE/PE-sized stripes, IIRC).

-Yenya

-- 
| Jan "Yenya" Kasprzak  <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839      Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/    Journal: http://www.fi.muni.cz/~kas/blog/ |
Please don't top post and in particular don't attach entire digests to your
mail or we'll all soon be using bittorrent to read the list.     --Alan Cox

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-06 15:08   ` Jan Kasprzak
  2012-01-06 16:39     ` Peter Grandi
@ 2012-01-06 20:55     ` NeilBrown
  2012-01-06 21:02       ` Jan Kasprzak
  2012-03-22 10:01       ` Alexander Lyakas
  1 sibling, 2 replies; 25+ messages in thread
From: NeilBrown @ 2012-01-06 20:55 UTC (permalink / raw)
  To: Jan Kasprzak; +Cc: linux-raid, John Robinson

[-- Attachment #1: Type: text/plain, Size: 2619 bytes --]

On Fri, 6 Jan 2012 16:08:23 +0100 Jan Kasprzak <kas@fi.muni.cz> wrote:

> John Robinson wrote:
> : On 12/12/2011 11:54, Jan Kasprzak wrote:
> : >	Is there any way how to tell mdadm explicitly how to set up
> : >the pairs of mirrored drives inside a RAID-10 volume?
> : 
> : If you're using RAID10,n2 (the default layout) then adjacent pairs
> : of drives in the create command will be mirrors, so your command
> : line should be something like:
> : 
> : # mdadm --create /dev/mdX -l10 -pn2 -n44 /dev/shelf1drive1
> : /dev/shelf2drive1 /dev/shelf1drive2 ...
> 
> 	OK, this works, thanks!
> 
> : Having said that, if you think there's a real chance of a shelf
> : failing, you probably ought to think about adding more redundancy
> : within the shelves so that you can survive another drive failure or
> : two while you're running on just one shelf.
> 
> 	I am aware of that. I don't think the whole shelf will fail,
> but who knows :-)
> 
> : If you are sticking with RAID10, you can potentially get double the
> : read performance using the far layout - -pf2 - and with the same
> : order of drives you can still survive a shelf failure, though your
> : use of port multipliers may well limit your performance anyway.
> 
> 	On the older hardware I have a majority of writes, so the far
> layout is probably not good for me (reads can be cached pretty well
> at the OS level).
> 
> 	After some experiments with my new hardware, I have discovered
> one more serious problem: I have simulated an enclosure failure,
> so half of the disks forming the RAID-10 volume disappeared.
> After removing them using mdadm --remove, and adding them back,
> iostat reports that they are resynced one disk a time, not all
> just-added disks in parallel.
> 
> 	Is there any way of adding more than one disk to the degraded
> RAID-10 volume, and get the volume restored as fast as the hardware permits?
> Otherwise, it would be better for us to discard RAID-10 altogether,
> and use several independent RAID-1 volumes joined together using LVM
> (which we will probably use on top of the RAID-10 volume anyway).
> 
> 	I have tried mdadm --add /dev/mdN /dev/sd.. /dev/sd.. /dev/sd..,
> but it behaves the same way as issuing mdadm --add one drive at a time.

I would expect that to first recover just the first device added, then
recover all the rest at once.

If you:
  echo frozen > /sys/block/mdN/md/sync_action
  mdadm --add /dev/mdN /dev......
  echo recover > /sys/block/mdN/md/sync_action

it should do them all at once.

I should teach mdadm about this..

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-06 20:55     ` NeilBrown
@ 2012-01-06 21:02       ` Jan Kasprzak
  2012-03-22 10:01       ` Alexander Lyakas
  1 sibling, 0 replies; 25+ messages in thread
From: Jan Kasprzak @ 2012-01-06 21:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, John Robinson

NeilBrown wrote:
: > 	I have tried mdadm --add /dev/mdN /dev/sd.. /dev/sd.. /dev/sd..,
: > but it behaves the same way as issuing mdadm --add one drive at a time.
: 
: I would expect that to first recover just the first device added, then
: recover all the rest at once.
: 
: If you:
:   echo frozen > /sys/block/mdN/md/sync_action
:   mdadm --add /dev/mdN /dev......
:   echo recover > /sys/block/mdN/md/sync_action
: 
: it should do them all at once.

	Wow, it works! Thanks!

: I should teach mdadm about this..

	It would be nice if mdadm --add /dev/mdN <multiple devices>
did this.

-Yenya

-- 
| Jan "Yenya" Kasprzak  <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839      Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/    Journal: http://www.fi.muni.cz/~kas/blog/ |
Please don't top post and in particular don't attach entire digests to your
mail or we'll all soon be using bittorrent to read the list.     --Alan Cox

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-06 20:11       ` Jan Kasprzak
@ 2012-01-06 22:55         ` Stan Hoeppner
  2012-01-07 14:25           ` Peter Grandi
  0 siblings, 1 reply; 25+ messages in thread
From: Stan Hoeppner @ 2012-01-06 22:55 UTC (permalink / raw)
  To: Jan Kasprzak; +Cc: Peter Grandi, Linux RAID

On 1/6/2012 2:11 PM, Jan Kasprzak wrote:
> And I suspect that XFS swidth/sunit
> settings will still work with RAID-10 parameters even over plain
> LVM logical volume on top of that RAID 10, while the settings would
> be more tricky when used with interleaved LVM logical volume on top
> of several RAID-1 pairs (LVM interleaving uses LE/PE-sized stripes, IIRC).

If one is using many RAID1 pair s/he probably isn't after single large
file performance anyway, or s/he would just use RAID10.  Thus
sunit/swidth settings aren't tricky in this case.  One would use a
linear concatenation and drive parallelism with XFS allocation groups,
i.e. for a 24 drive chassis you'd setup an mdraid or lvm linear array of
12 RAID1 pairs and format with something like:

$ mkfs.xfs -d agcount=24 [device]

As long as one's workload writes files relatively evenly across 24 or
more directories, one receives fantastic concurrency/parallelism, in
this case 24 concurrent transactions, 2 to each mirror pair.  In the
case of 15K SAS drives this is far more than sufficient to saturate the
seek bandwidth of the drives.  One may need more AGs to achieve the
concurrency necessary to saturate good SSDs.

-- 
Stan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-06 22:55         ` Stan Hoeppner
@ 2012-01-07 14:25           ` Peter Grandi
  2012-01-07 16:25             ` Stan Hoeppner
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Grandi @ 2012-01-07 14:25 UTC (permalink / raw)
  To: Linux RAID

>> And I suspect that XFS swidth/sunit settings will still work
>> with RAID-10 parameters even over plain LVM logical volume on
>> top of that RAID 10, while the settings would be more tricky
>> when used with interleaved LVM logical volume on top of
>> several RAID-1 pairs (LVM interleaving uses LE/PE-sized
>> stripes, IIRC).

Stripe alignment is only relevant for parity RAID types, as it
is meant to minimize read-modify-write. There is no RMW problem
with RAID0, RAID1 or combinations. But there is a case for
'sunit'/'swidth' with single flash based SSDs as they do have a
RMW-like issue with erase blocks. In other cases whether they
are of benefit is rather questionable.

> One would use a linear concatenation and drive parallelism
> with XFS allocation groups, i.e. for a 24 drive chassis you'd
> setup an mdraid or lvm linear array of 12 RAID1 pairs and
> format with something like: $ mkfs.xfs -d agcount=24 [device]
> As long as one's workload writes files relatively evenly
> across 24 or more directories, one receives fantastic
> concurrency/parallelism, in this case 24 concurrent
> transactions, 2 to each mirror pair.

That to me sounds a bit too fragile ; RAID0 is almost always
preferable to "concat", even with AG multiplication, and I would
be avoiding LVM more than avoiding MD.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-07 14:25           ` Peter Grandi
@ 2012-01-07 16:25             ` Stan Hoeppner
  2012-01-09 13:46               ` Peter Grandi
  2012-01-12 12:47               ` Peter Grandi
  0 siblings, 2 replies; 25+ messages in thread
From: Stan Hoeppner @ 2012-01-07 16:25 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On 1/7/2012 8:25 AM, Peter Grandi wrote:

I wrote:
>>> And I suspect that XFS swidth/sunit settings will still work
>>> with RAID-10 parameters even over plain LVM logical volume on
>>> top of that RAID 10, while the settings would be more tricky
>>> when used with interleaved LVM logical volume on top of
>>> several RAID-1 pairs (LVM interleaving uses LE/PE-sized
>>> stripes, IIRC).

> Stripe alignment is only relevant for parity RAID types, as it
> is meant to minimize read-modify-write. 

The benefits aren't limited to parity arrays.  Tuning the stripe
parameters yields benefits on RAID0/10 arrays as well, mainly by packing
a full stripe of data when possible, avoiding many partial stripe width
writes in the non aligned case.  Granted the gains are workload
dependent, but overall you get a bump from aligned writes.

> There is no RMW problem
> with RAID0, RAID1 or combinations. 

Which is one of the reasons the linear concat over RAID1 pairs works
very well for some workloads.

> But there is a case for
> 'sunit'/'swidth' with single flash based SSDs as they do have a
> RMW-like issue with erase blocks. In other cases whether they
> are of benefit is rather questionable.

I'd love to see some documentation supporting this sunit/swidth with a
single SSD device theory.

I wrote:
>> One would use a linear concatenation and drive parallelism
>> with XFS allocation groups, i.e. for a 24 drive chassis you'd
>> setup an mdraid or lvm linear array of 12 RAID1 pairs and
>> format with something like: $ mkfs.xfs -d agcount=24 [device]
>> As long as one's workload writes files relatively evenly
>> across 24 or more directories, one receives fantastic
>> concurrency/parallelism, in this case 24 concurrent
>> transactions, 2 to each mirror pair.

> That to me sounds a bit too fragile ; RAID0 is almost always
> preferable to "concat", even with AG multiplication, and I would
> be avoiding LVM more than avoiding MD.

This wholly depends on the workload.  For something like maildir RAID0
would give you no benefit as the mail files are going to be smaller than
a sane MDRAID chunk size for such an array, so you get no striping
performance benefit.

And RAID0 is far more fragile here than a concat.  If you lose both
drives in a mirror pair, say to controller, backplane, cable, etc
failure, you've lost your entire array, and your XFS filesystem.  With a
cocncat you can lose a mirror pair, run an xfs_repair and very likely
end up with a functioning filesystem, sans the directories and files
that resided on that pair.  With RAID0 you're totally hosed.  With a
concat you're probably mostly still in business.

-- 
Stan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-07 16:25             ` Stan Hoeppner
@ 2012-01-09 13:46               ` Peter Grandi
  2012-01-10  3:54                 ` Stan Hoeppner
  2012-01-12 12:47               ` Peter Grandi
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Grandi @ 2012-01-09 13:46 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> Stripe alignment is only relevant for parity RAID types, as it
>> is meant to minimize read-modify-write.

> The benefits aren't limited to parity arrays. Tuning the
> stripe parameters yields benefits on RAID0/10 arrays as well,
> mainly by packing a full stripe of data when possible,
> avoiding many partial stripe width writes in the non aligned
> case.

This seems like handwaving gibberish to me, or (being very
generous) a misunderestimating of the general notion that
larger (as opposed to *aligned*) transactions are (sometimes)
of greater benefit than smaller ones.

Note: There is with 'ext' style filesystems the 'stride' which
  is designed to interleave data and metadata so they are likely
  to be on different disks, but that is in some ways the opposite
  to 'sunit'/'swidth' style address/length alignment, and is
  rather more similar to multiple AGs rather than aligning IO on
  RMW-free boundaries.

How can «packing a full stripe of data» by itself be of benefit
on RAID0/RAID1/RAID10, if that is in any way different from just
doing larger larger transactions, or if it is different from an
argument about chunk size vs. transaction size?

An single N-wide write (or even a close sequence of N 1-wide
writes) on a RAID0/1/10 will result in optimal N concurrent
writes if that is possible, whether it is address/length aligned
or not. Why would «avoiding many partial stripe width writes«
have a significant effect in the RAID0 or RAID1 case, given that
there is no RMW problem?

> Granted the gains are workload dependent, but overall you get
> a bump from aligned writes.

Perhaps in a small way because of buffering effects or RAM or
cache alignment effects, but that would be unrelated to the
storage geometry.

>> There is no RMW problem with RAID0, RAID1 or combinations.

> Which is one of the reasons the linear concat over RAID1 pairs
> works very well for some workloads.

But the two are completely unrelated. Your argument was that
'concat' plus AGs works well if the workload is distributed over
different directories in a number similar to the drivers. Concat
plus AGs may work well for special workloads, but RAID0 plus AGs
might work better.

To me 'concat' is just like RAID0 but sillier, regardless of
special cases. It is largely pointless. Please show how 'concat'
is indeed preferable to RAID0 in the general case or any
significant special case.

>> But there is a case for 'sunit'/'swidth' with single flash
>> based SSDs as they do have a RMW-like issue with erase
>> blocks. In other cases whether they are of benefit is rather
>> questionable.

> I'd love to see some documentation supporting this sunit/swidth
> with a single SSD device theory.

You have already read it above: internally SSDs have a big RMW
problem because of (erase) ''flash blocks'' being much larger
(around 512KiB/1MiB) than (''write''/read) ''flash pages'' which
are anyhow rather larger (usually 4KiB/8KiB) than logical 512B
sectors.

RMW avoidance is all that there is to address/length alignment.
It has nothing to do with RAIDness per se and indeed in a
different domain address/length aligned writes work very well
with RAM because it too has a big RMW problem.

Note: the case for RMW address/length aligned writes on single
  SSDs is not clear only because FTL firmware simulates a
  non-RMW device by using something (quite) similar to a
  small-granule log-structured filesystem on top of the flash
  storage and this might "waste" the extra alignment by the
  filesystem.

Same for example as partition alignment: you can easily find on
the web documentation that explain in accessible terms that
having ''parity block'' aligned partitions is good for parity
RAID, and other documentation that explains that ''erase block''
aligned partitions are good for SSDs too, and in both case the
reason is RMW, whether the reason for RMW is parity or erasing.

Those able to do a web search with the relevant keywords and
read documentation can find some mentions of single SSD RMW and
address/length alignment, for example here:

  http://research.cs.wisc.edu/adsl/Publications/ssd-usenix08.pdf
  http://research.microsoft.com/en-us/projects/flashlight/winhec08-ssd.pptx
  http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-2.pdf

Mentioned in passing as something pretty obvious, and there are
other similar mentions that come up in web searches because it
is a pretty natural application of thinking about RMW issues.

Now I eagerly await your explanation of the amazing "Hoeppner
effect" by which address/length aligned writes on RAID0/1/10
have significant benefits and of the audacious "Hoeppner
principle" by which 'concat' is as good as RAID0 over the same
disks.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-09 13:46               ` Peter Grandi
@ 2012-01-10  3:54                 ` Stan Hoeppner
  2012-01-10  4:13                   ` NeilBrown
  2012-01-12 11:58                   ` Peter Grandi
  0 siblings, 2 replies; 25+ messages in thread
From: Stan Hoeppner @ 2012-01-10  3:54 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On 1/9/2012 7:46 AM, Peter Grandi wrote:

> Those able to do a web search with the relevant keywords and
> read documentation can find some mentions of single SSD RMW and
> address/length alignment, for example here:
> 
>   http://research.cs.wisc.edu/adsl/Publications/ssd-usenix08.pdf
>   http://research.microsoft.com/en-us/projects/flashlight/winhec08-ssd.pptx
>   http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-2.pdf
> 
> Mentioned in passing as something pretty obvious, and there are
> other similar mentions that come up in web searches because it
> is a pretty natural application of thinking about RMW issues.

Yes, I've read such things.  I was eluding to the fact that there are at
least a half dozen different erase block sizes and algorithms in use by
different SSD manufacturers.  There is no standard.  And not all of them
are published.  There is no reliable way to do such optimization
generically.

> Now I eagerly await your explanation of the amazing "Hoeppner
> effect" by which address/length aligned writes on RAID0/1/10
> have significant benefits and of the audacious "Hoeppner
> principle" by which 'concat' is as good as RAID0 over the same
> disks.

IIRC from a previous discussion I had with Neil Brown on this list,
mdraid0, as with all the striped array code, runs as a single kernel
thread, limiting its performance to that of a single CPU.  A linear
concatenation does not run as a single kernel thread, but is simply an
offset calculation routine that, IIRC, executes on the same CPU as the
caller.  Thus one can theoretically achieve near 100% CPU scalability
when using concat instead of mdraid0.  So the issue isn't partial stripe
writes at the media level, but the CPU overhead caused by millions of
the little bastards with heavy random IOPS workloads, along with
increased numbers of smaller IOs through the SCSI/SATA interface,
causing more interrupts thus more CPU time, etc.

I've not run into this single stripe thread limitation myself, but have
read multiple cases where OPs can't get maximum performance from their
storage hardware because their top level mdraid stripe thread is peaking
a single CPU in their X-way system.  Moving from RAID10 to a linear
concat gets around this limitation for small file random IOPS workloads.
 Only when using XFS and a proper AG configuration, obviously.  This is
my recollection of Neil's description of the code behavior.  I could
very well have misunderstood, and I'm sure he'll correct me if that's
the case, or you, or both. ;)

Dave Chinner had some input WRT XFS on concat for this type of workload,
stating it's a little better than RAID10 (ambiguous as to hard/soft).
Did you read that thread Peter?  I know you're on the XFS list as well.
 I can't exactly recall at this time Dave's specific reasoning, I'll try
to dig it up.  I'm thinking it had to do with the different distribution
of metadata IOs between the two AG layouts, and the amount of total head
seeking required for the workload being somewhat higher for RAID10 than
for the concat of RAID1 pairs.  Again, I could be wrong on that, but it
seems familiar.  That discussion was many months ago.

-- 
Stan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-10  3:54                 ` Stan Hoeppner
@ 2012-01-10  4:13                   ` NeilBrown
  2012-01-10 16:25                     ` Stan Hoeppner
  2012-01-12 11:58                   ` Peter Grandi
  1 sibling, 1 reply; 25+ messages in thread
From: NeilBrown @ 2012-01-10  4:13 UTC (permalink / raw)
  To: stan; +Cc: Peter Grandi, Linux RAID

[-- Attachment #1: Type: text/plain, Size: 4171 bytes --]

On Mon, 09 Jan 2012 21:54:56 -0600 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 1/9/2012 7:46 AM, Peter Grandi wrote:
> 
> > Those able to do a web search with the relevant keywords and
> > read documentation can find some mentions of single SSD RMW and
> > address/length alignment, for example here:
> > 
> >   http://research.cs.wisc.edu/adsl/Publications/ssd-usenix08.pdf
> >   http://research.microsoft.com/en-us/projects/flashlight/winhec08-ssd.pptx
> >   http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-2.pdf
> > 
> > Mentioned in passing as something pretty obvious, and there are
> > other similar mentions that come up in web searches because it
> > is a pretty natural application of thinking about RMW issues.
> 
> Yes, I've read such things.  I was eluding to the fact that there are at
> least a half dozen different erase block sizes and algorithms in use by
> different SSD manufacturers.  There is no standard.  And not all of them
> are published.  There is no reliable way to do such optimization
> generically.
> 
> > Now I eagerly await your explanation of the amazing "Hoeppner
> > effect" by which address/length aligned writes on RAID0/1/10
> > have significant benefits and of the audacious "Hoeppner
> > principle" by which 'concat' is as good as RAID0 over the same
> > disks.
> 
> IIRC from a previous discussion I had with Neil Brown on this list,
> mdraid0, as with all the striped array code, runs as a single kernel
> thread, limiting its performance to that of a single CPU.  A linear
> concatenation does not run as a single kernel thread, but is simply an
> offset calculation routine that, IIRC, executes on the same CPU as the
> caller.  Thus one can theoretically achieve near 100% CPU scalability
> when using concat instead of mdraid0.  So the issue isn't partial stripe
> writes at the media level, but the CPU overhead caused by millions of
> the little bastards with heavy random IOPS workloads, along with
> increased numbers of smaller IOs through the SCSI/SATA interface,
> causing more interrupts thus more CPU time, etc.
> 
> I've not run into this single stripe thread limitation myself, but have
> read multiple cases where OPs can't get maximum performance from their
> storage hardware because their top level mdraid stripe thread is peaking
> a single CPU in their X-way system.  Moving from RAID10 to a linear
> concat gets around this limitation for small file random IOPS workloads.
>  Only when using XFS and a proper AG configuration, obviously.  This is
> my recollection of Neil's description of the code behavior.  I could
> very well have misunderstood, and I'm sure he'll correct me if that's
> the case, or you, or both. ;)

(oh dear, someone is Wrong on the Internet! Quick, duck into the telephone
booth and pop out as ....)

Hi Stan,
 I think you must be misremembering.
Neither RAID0 or Linear have any threads involved.  They just redirect the
request to the appropriate devices.  Multiple threads can submit multiple
requests down through RAID0 and Linear concurrently.

RAID1, RAID10, and RAID5/6 are different.  For reads they normally are have
no contention with other requests, but for writes things to get
single-threaded at some point.

Hm... you text above sometime talks about RAID0 vs Linear, and sometimes
about RAID10 vs Linear.  So maybe you are remembering correctly, but
presenting incorrectly in part ....

NeilBrown

> 
> Dave Chinner had some input WRT XFS on concat for this type of workload,
> stating it's a little better than RAID10 (ambiguous as to hard/soft).
> Did you read that thread Peter?  I know you're on the XFS list as well.
>  I can't exactly recall at this time Dave's specific reasoning, I'll try
> to dig it up.  I'm thinking it had to do with the different distribution
> of metadata IOs between the two AG layouts, and the amount of total head
> seeking required for the workload being somewhat higher for RAID10 than
> for the concat of RAID1 pairs.  Again, I could be wrong on that, but it
> seems familiar.  That discussion was many months ago.
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-10  4:13                   ` NeilBrown
@ 2012-01-10 16:25                     ` Stan Hoeppner
  0 siblings, 0 replies; 25+ messages in thread
From: Stan Hoeppner @ 2012-01-10 16:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: Peter Grandi, Linux RAID

On 1/9/2012 10:13 PM, NeilBrown wrote:

>> IIRC from a previous discussion I had with Neil Brown on this list,
>> mdraid0, as with all the striped array code, runs as a single kernel
>> thread, limiting its performance to that of a single CPU.  A linear
>> concatenation does not run as a single kernel thread, but is simply an
>> offset calculation routine that, IIRC, executes on the same CPU as the
>> caller.  Thus one can theoretically achieve near 100% CPU scalability
>> when using concat instead of mdraid0.  So the issue isn't partial stripe
>> writes at the media level, but the CPU overhead caused by millions of
>> the little bastards with heavy random IOPS workloads, along with
>> increased numbers of smaller IOs through the SCSI/SATA interface,
>> causing more interrupts thus more CPU time, etc.
>>
>> I've not run into this single stripe thread limitation myself, but have
>> read multiple cases where OPs can't get maximum performance from their
>> storage hardware because their top level mdraid stripe thread is peaking
>> a single CPU in their X-way system.  Moving from RAID10 to a linear
>> concat gets around this limitation for small file random IOPS workloads.
>>  Only when using XFS and a proper AG configuration, obviously.  This is
>> my recollection of Neil's description of the code behavior.  I could
>> very well have misunderstood, and I'm sure he'll correct me if that's
>> the case, or you, or both. ;)
> 
> (oh dear, someone is Wrong on the Internet! Quick, duck into the telephone
> booth and pop out as ....)
> 
> Hi Stan,
>  I think you must be misremembering.
> Neither RAID0 or Linear have any threads involved.  They just redirect the
> request to the appropriate devices.  Multiple threads can submit multiple
> requests down through RAID0 and Linear concurrently.
> 
> RAID1, RAID10, and RAID5/6 are different.  For reads they normally are have
> no contention with other requests, but for writes things to get
> single-threaded at some point.
> 
> Hm... you text above sometime talks about RAID0 vs Linear, and sometimes
> about RAID10 vs Linear.  So maybe you are remembering correctly, but
> presenting incorrectly in part ....

Yes, I believe that's where we are.  My apologies for allowing myself to
become slightly confused.  I'm sure I'm the only human being working
with Linux to ever become so. ;)

Peter kept referencing RAID0 after I'd explicitly referenced RAID10 in
my statement.  I guess I assumed he was simply referring to the striped
component of RAID10, which apparently wasn't the case.

So I did recall correctly that mdraid10 does have some threading
limitations.  So what needs clarification at this point is whether those
limitations are greater than any such limitations with the concatenated
RAID1 pair case using XFS AGs to drive the parallelism.

Thanks for your input Neil, and for your clarifications thus far.

-- 
Stan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-10  3:54                 ` Stan Hoeppner
  2012-01-10  4:13                   ` NeilBrown
@ 2012-01-12 11:58                   ` Peter Grandi
  1 sibling, 0 replies; 25+ messages in thread
From: Peter Grandi @ 2012-01-12 11:58 UTC (permalink / raw)
  To: Linux RAID

I have pulled bits of the original posts to give some context.

[ .... ]

>>>> Stripe alignment is only relevant for parity RAID types, as
>>>> it is meant to minimize read-modify-write. There is no RMW
>>>> problem with RAID0, RAID1 or combinations.

[ ... ]

>>> The benefits aren't limited to parity arrays.  Tuning the
>>> stripe parameters yields benefits on RAID0/10 arrays as
>>> well, mainly by packing a full stripe of data when possible,
>>> avoiding many partial stripe width writes in the non aligned
>>> case. Granted the gains are workload dependent, but overall
>>> you get a bump from aligned writes.

[ ... ]

>>>> But there is a case for 'sunit'/'swidth' with single flash
>>>> based SSDs as they do have a RMW-like issue with erase
>>>> blocks. In other cases whether they are of benefit is
>>>> rather questionable.

>>> I'd love to see some documentation supporting this
>>> sunit/swidth with a single SSD device theory.

[ ... ]

> Yes, I've read such things. I was eluding to the fact that
> there are at least a half dozen different erase block sizes
> and algorithms in use by different SSD manufacturers. There is
> no standard. And not all of them are published. There is no
> reliable way to do such optimization generically.

Well, at least in some cases there are some details on erase
block sizes for some devices, and most contemporary devices seem
to max at 8KiB "flash pages" and 1MiB "flash blocks" (most
contemporary low cost flash SSDs are RAID0-like interleavings of
chips with those parameters).

There is (hopefully) little cost in further alignment so I think
that 16KiB as the 'sunit' and 2MiB as the 'swidth' on single
SSD should cover some further tightening of the requirements.

But as I wrote previously the biggest issue with the expectation
that address/length alignment matters with flash SSDs that do is
the Flash Translation Layer firmware they use that may make
attempts to perform higher level geometry adaptation not so
relevant.

While there isn't any good argument that address/length
alignment matters other than for RMW storage devices, I must say
that because of my estimate that address/length alignment is not
costly, and my intuition that it might help, I specify
address/length alignments on *everything* (even non parity RAID
on non-RMW storage, even single disks on non-RMW storage).

One of the guesses that I have as to that is that it might help
keep free space more contiguous, and thus in general may lead to
lower fragmentation of allocated files (which does not matter a
lot for flash SSDs, but then perhaps the RMW issue matters).
Because probably it leads to allocations being done in bigger
and more aligned chunks that otherwise.

That is a *static* effect at the file system level rather than
the dynamic effects at the array level mentioned in your
(euphemism alert) rather weak arguments as to multithreading or
better scheduling of IO operations at the array level.

Where the cost is mostly that when there is little free space
probably the remaining free space is more fragmented that it
would otherwise have been. But I try to keep at least 15-20%
free space available regardless.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-07 16:25             ` Stan Hoeppner
  2012-01-09 13:46               ` Peter Grandi
@ 2012-01-12 12:47               ` Peter Grandi
  2012-01-12 21:24                 ` Stan Hoeppner
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Grandi @ 2012-01-12 12:47 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> That to me sounds a bit too fragile ; RAID0 is almost always
>> preferable to "concat", even with AG multiplication, and I
>> would be avoiding LVM more than avoiding MD.

> This wholly depends on the workload.  For something like
> maildir RAID0 would give you no benefit as the mail files are
> going to be smaller than a sane MDRAID chunk size for such an
> array, so you get no striping performance benefit.

That seems to me unfortunate argument and example:

* As an example, putting a mail archive on a RAID0 or 'concat'
  seems a bit at odd with the usual expectations of availability
  for them. Unless a RAID0 or 'concat' over RAID1. Because anyhow
  'maildir' mail archive is a horribly bad idea regardless
  because it maps very badly on current storage technology.

* The issue if chunk size is one of my pet peeves, as there is
  very little case for it being larger than file system block
  size. Sure there are many "benchmarks" that show that larger
  chunk sizes correspond to higher transfer rates, but that is
  because of unrealistic transaction size effects. Which don't
  matter for a mostly random-access share mail archive, never
  mind a maildir one.

* Regardless, an argument that there is no striping benefit in
  that case is not an argument that 'concat' is better. I'd still
  default to RAID0.

* Consider the dubious joys of an 'fsck' or 'rsync' (and other
  bulk maintenance operations, like indexing the archive), and
  how RAID0 may help (even if not a lot) the scanning of metadata
  with respect to 'concat' (unless one relies totally on
  parallelism across multiple AGs).

Perhaps one could make a case that 'concat' is no worse than
'RAID0' if one has a very special case that is equivalent to
painting oneself in a corner, but it is not a very interesting
case.

> And RAID0 is far more fragile here than a concat. If you lose
> both drives in a mirror pair, say to controller, backplane,
> cable, etc failure, you've lost your entire array, and your
> XFS filesystem.

Uhm, sometimes it is not a good idea to structure mirror pairs so
that they have blatant common modes of failure. But then most
arrays I have seen were built out of drives of the same make and
model and taken out of the same carton....

> With a concat you can lose a mirror pair, run an xfs_repair and
> very likely end up with a functioning filesystem, sans the
> directories and files that resided on that pair. With RAID0
> you're totally hosed. With a concat you're probably mostly
> still in business.

That sounds (euphemism alert) rather optimistic to me, because it
is based on the expectation that files, and files within the same
directory, tend to be allocated entirely within a single segment
of a 'concat'. Even with distributing AGs around for file system
types that support that, that's a bit wistful (as is the
expectation that AGs are indeed wholly contained in specific
segments of a 'concat').

Usually if there is a case for a 'concat' there is a rather
better case for separate, smaller filesystems mounted under a
common location, as an alternative to RAID0.

It is often a better case because data is often partitionable,
there is no large advantage to a single free space pool as most
files are not that large, and one can do fully independent and
parallel 'fsck', 'rsync' and other bulk maintenance operations
(including restores).

Then we might as well get into distributed partitioned file
systems with a single namespace like Lustre or DPM.

But your (euphemism alert) edgy recovery example above triggers a
couple of my long standing pet peeves:

* The correct response to a damaged (in the sense of data loss)
  storage system is not to ignore the hole, patch up the filetree
  in it, and restart it, but to restore the filetree from backups.
  Because in any case one would have to run a verification pass
  aganst backups to see what has been lost and whether any
  partial file losses have happened.

* If availability requirement are so exigent that a restore from
  backup is not acceptable to the customer, and random data loss
  is better accepted, we have a strange situation. Which is that
  the customer really wants a Very Large DataBase (a database so
  large that it cannot be taken offline for maintenance, such as
  backups or recovery) style storage system, but they don't want
  to pay for it. A sysadm may then look good by playing to these
  politics by pretending they have done one on the cheap, by
  tacitly dropping data integrity, but these are scary politics.

[ ... ]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-12 12:47               ` Peter Grandi
@ 2012-01-12 21:24                 ` Stan Hoeppner
  0 siblings, 0 replies; 25+ messages in thread
From: Stan Hoeppner @ 2012-01-12 21:24 UTC (permalink / raw)
  To: Linux RAID

On 1/12/2012 6:47 AM, Peter Grandi wrote:
> [ ... ]
> 
>>> That to me sounds a bit too fragile ; RAID0 is almost always
>>> preferable to "concat", even with AG multiplication, and I
>>> would be avoiding LVM more than avoiding MD.
> 
>> This wholly depends on the workload.  For something like
>> maildir RAID0 would give you no benefit as the mail files are
>> going to be smaller than a sane MDRAID chunk size for such an
>> array, so you get no striping performance benefit.
> 
> That seems to me unfortunate argument and example:
> 
> * As an example, putting a mail archive on a RAID0 or 'concat'
>   seems a bit at odd with the usual expectations of availability
>   for them. Unless a RAID0 or 'concat' over RAID1. Because anyhow
>   'maildir' mail archive is a horribly bad idea regardless
>   because it maps very badly on current storage technology.

WRT availability both are identical to textbook RAID10--half the drives
can fail as long as no two are in the same mirror pair.  In essence
RAID0 over mirrors _is_ RAID10.  I totally agree with your maildir
sentiments--far too much physical IO (metadata) needed for the same job
as mbox, Dovecot's mdbox, etc.  But, maildir is still extremely popular
and in wide use, and will be for quite some time.

> * The issue if chunk size is one of my pet peeves, as there is
>   very little case for it being larger than file system block
>   size. Sure there are many "benchmarks" that show that larger
>   chunk sizes correspond to higher transfer rates, but that is
>   because of unrealistic transaction size effects. Which don't
>   matter for a mostly random-access share mail archive, never
>   mind a maildir one.

I absolutely agree.  Which is why the concat makes sense for such
workloads.  From a 10,000 ft view, it is little different than having a
group of mirror pairs, putting a filesystem on each, and manually
spreading one's user mailboxen over those filesystems.  The XFS over
concat simply takes the manual spreading aspect out of this, and yields
pretty good transaction load distribution.

> * Regardless, an argument that there is no striping benefit in
>   that case is not an argument that 'concat' is better. I'd still
>   default to RAID0.

The issue with RAID0 or RAID10 here is tuning XFS.  With a striped array
XFS works best if sunit/swidth match the stripe block characteristics of
the array, as it attempts to pack a full stripe worth of writes before
pushing down the stack.  This works well with large files, but can be an
impediment to performance with lots of small writes.  Free space
fragmentation becomes a problem as XFS attempts to stripe align all
writes.  So with maildir you often end up with lots of partial stripe
writes, each being a different stripe.  Once an XFS filesystem ages
sufficiently (i.e. fills up) more head seeking is required to write
files into the fragmented free space.  At least, this is my
understanding from Dave's previous explanation.

Additionally, when using a striped array, all XFS AGs are striped down
the virtual cylinder that is the array.  So when searching a large
directory btree, you may generate seeks across all drives in the array
to find a single entry.  With a properly created XFS on concat, AGs are
aligned and wholly contained within a mirror pair.  And since all files
created within an AG have their metadata also within that AG, any btree
walking is done only on an AG within that mirror pair, reducing head
seeking to one drive vs all drives in a striped array.

The concat setup also has the advantage that per drive read-ahaead is
more likely to cache blocks that actually will be needed shortly, i.e.
the next file in an inbox, whereas with a striped array it's very likely
that the next few blocks contain a different user's email, a user who
may not even be logged in.

> * Consider the dubious joys of an 'fsck' or 'rsync' (and other
>   bulk maintenance operations, like indexing the archive), and
>   how RAID0 may help (even if not a lot) the scanning of metadata
>   with respect to 'concat' (unless one relies totally on
>   parallelism across multiple AGs).

This concat setup is specific to XFS and only XFS.  It is useless with
any other (Linux anyway) filesystem because no others use an allocation
group design nor can derive meaningful parallelism in the absence of
striping.

> Perhaps one could make a case that 'concat' is no worse than
> 'RAID0' if one has a very special case that is equivalent to
> painting oneself in a corner, but it is not a very interesting
> case.

It better than a RAID0/10 stripe for small file random IO workloads.
See reasons above.

>> And RAID0 is far more fragile here than a concat. If you lose
>> both drives in a mirror pair, say to controller, backplane,
>> cable, etc failure, you've lost your entire array, and your
>> XFS filesystem.
> 
> Uhm, sometimes it is not a good idea to structure mirror pairs so
> that they have blatant common modes of failure. But then most
> arrays I have seen were built out of drives of the same make and
> model and taken out of the same carton....

I was demonstrating the worst case scenario that could take down both
array types, and the fact that when using XFS on both, you lose
everything with RAID0, but can likely recover to a large degree with the
concat specifically because of the allocation group design and how the
AGs are physically laid down on the concat disks.

>> With a concat you can lose a mirror pair, run an xfs_repair and
>> very likely end up with a functioning filesystem, sans the
>> directories and files that resided on that pair. With RAID0
>> you're totally hosed. With a concat you're probably mostly
>> still in business.
> 
> That sounds (euphemism alert) rather optimistic to me, because it
> is based on the expectation that files, and files within the same
> directory, tend to be allocated entirely within a single segment
> of a 'concat'. 

This is exactly the case.  With 16x1TB drives in an mdraid linear concat
with XFS and 16 AGs, you get exactly 1 AG on each drive.  In practice in
this case one would probably want 2 AGs per drive, as files are
clustered around the directories.  With the small file random IO
workload this decreases head seeking between the directory write op and
the file write up, which typically occur in rapid succession.

> Even with distributing AGs around for file system
> types that support that, that's a bit wistful (as is the
> expectation that AGs are indeed wholly contained in specific
> segments of a 'concat').

No, it is not, and yes, they are.

> Usually if there is a case for a 'concat' there is a rather
> better case for separate, smaller filesystems mounted under a
> common location, as an alternative to RAID0.

Absolutely agreed, for the most part.  If the application itself has the
ability to spread the file transaction load across multiple directories
this is often better than relying on the filesystem to do it
automagically.  And if you lose one filesystem for any reason you've
only lost access to a portion of data, not all of it.  The minor
downside is managing multiple filesystems instead of one, but not a big
deal really, given the extra safety margin.

In the case of the maildir workload, Dovecot, for instance, allows
specifying a mailbox location on a per user basis.  I recall one Dovecot
OP who is doing this with 16 mirror pairs with 16 EXTx filesystems atop.
 IIRC he was bitten more than once by single large hardware RAID setups
going down--I don't recall the specifics.  Check the Dovecot list archives.

> It is often a better case because data is often partitionable,
> there is no large advantage to a single free space pool as most
> files are not that large, and one can do fully independent and
> parallel 'fsck', 'rsync' and other bulk maintenance operations
> (including restores).

Agreed.  If the data set can be partitioned, and if your application
permits doing so.  Some do not.

> Then we might as well get into distributed partitioned file
> systems with a single namespace like Lustre or DPM.

Lustre wasn't designed for, nor is suitable for, high IOPS low latency,
small file workloads, which is, or at least was, the topic we are
discussing.  I'm not familiar with DPM.  Most distributed filesystems
aren't suitable for this type of workload due to multiple types of latency.

> But your (euphemism alert) edgy recovery example above triggers a
> couple of my long standing pet peeves:
> 
> * The correct response to a damaged (in the sense of data loss)
>   storage system is not to ignore the hole, patch up the filetree
>   in it, and restart it, but to restore the filetree from backups.
>   Because in any case one would have to run a verification pass
>   aganst backups to see what has been lost and whether any
>   partial file losses have happened.

I believe you missed the point, and are making some incorrect
assumptions WRT SOP in this field, and the where-with-all of your
colleagues.  In my concat example you can likely be back up and running
"right now" with some loss _while_ you troubleshoot/fix/restore.  In the
RAID0 scenario, you're completely down _until_ you
troubleshoot/fix/restore.  Nobody is going to slap a bandaid on and
"ignore the hole".  I never stated nor implied that.  I operate on the
assumption my colleagues here know what they're doing for the most part,
so I don't expend extra unnecessary paragraphs on SOP minutia.

[snipped]

-- 
Stan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-01-06 20:55     ` NeilBrown
  2012-01-06 21:02       ` Jan Kasprzak
@ 2012-03-22 10:01       ` Alexander Lyakas
  2012-03-22 10:31         ` NeilBrown
  1 sibling, 1 reply; 25+ messages in thread
From: Alexander Lyakas @ 2012-03-22 10:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kasprzak, linux-raid, John Robinson

Neil,

>  echo frozen > /sys/block/mdN/md/sync_action
>  mdadm --add /dev/mdN /dev......
>  echo recover > /sys/block/mdN/md/sync_action
>
> it should do them all at once.
>
> I should teach mdadm about this..

What is required to do that from mdadm? I don't see any other place
where MD_RECOVERY_FROZEN is set, except via sysfs. So do you suggest
that mdadm use sysfs for that?
Also, what should be done if mdadm succeeds to "freeze" the array, but
fails to "unfreeze" it for some reason?

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-03-22 10:01       ` Alexander Lyakas
@ 2012-03-22 10:31         ` NeilBrown
  2012-03-25  9:30           ` Alexander Lyakas
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2012-03-22 10:31 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: Jan Kasprzak, linux-raid, John Robinson

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

On Thu, 22 Mar 2012 12:01:48 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Neil,
> 
> >  echo frozen > /sys/block/mdN/md/sync_action
> >  mdadm --add /dev/mdN /dev......
> >  echo recover > /sys/block/mdN/md/sync_action
> >
> > it should do them all at once.
> >
> > I should teach mdadm about this..
> 
> What is required to do that from mdadm? I don't see any other place
> where MD_RECOVERY_FROZEN is set, except via sysfs. So do you suggest
> that mdadm use sysfs for that?

Yes.

http://neil.brown.name/git?p=mdadm;a=commitdiff;h=9f58469128c99c0d7f434d28657f86789334f253

> Also, what should be done if mdadm succeeds to "freeze" the array, but
> fails to "unfreeze" it for some reason?

What could possibly cause that?
I guess if someone kills mdadm while it was running..


NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-03-22 10:31         ` NeilBrown
@ 2012-03-25  9:30           ` Alexander Lyakas
  2012-04-04 16:56             ` Alexander Lyakas
  0 siblings, 1 reply; 25+ messages in thread
From: Alexander Lyakas @ 2012-03-25  9:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kasprzak, linux-raid, John Robinson

Thanks, Neil!
I will merge this in and test.


On Thu, Mar 22, 2012 at 12:31 PM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 22 Mar 2012 12:01:48 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Neil,
>>
>> >  echo frozen > /sys/block/mdN/md/sync_action
>> >  mdadm --add /dev/mdN /dev......
>> >  echo recover > /sys/block/mdN/md/sync_action
>> >
>> > it should do them all at once.
>> >
>> > I should teach mdadm about this..
>>
>> What is required to do that from mdadm? I don't see any other place
>> where MD_RECOVERY_FROZEN is set, except via sysfs. So do you suggest
>> that mdadm use sysfs for that?
>
> Yes.
>
> http://neil.brown.name/git?p=mdadm;a=commitdiff;h=9f58469128c99c0d7f434d28657f86789334f253
>
>> Also, what should be done if mdadm succeeds to "freeze" the array, but
>> fails to "unfreeze" it for some reason?
>
> What could possibly cause that?
> I guess if someone kills mdadm while it was running..
>
>
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-03-25  9:30           ` Alexander Lyakas
@ 2012-04-04 16:56             ` Alexander Lyakas
  2014-06-09 14:26               ` Alexander Lyakas
  0 siblings, 1 reply; 25+ messages in thread
From: Alexander Lyakas @ 2012-04-04 16:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

I fetched this commit, and also the "fixed sysfs_freeze_array array to
work properly with Manage_subdevs" commit
fd324b08dbfa8404558534dd0a2321213ffb7257, which looks relevant.
I experimented with this, and I have some questions:

Is there any special purpose for calling
sysfs_attribute_available("sync_action")? If this call fails, the code
returns 1, which will make the caller to attempt to un-freeze. Without
this call, however, sysfs_get_str("sync_action") will fail and return
0, which will prevent the caller from unfreezing, which looks better.

Why do you refuse to freeze the array if it's not "idle"? What will
happen is that current recover/resync will abort, drives will be
added, and on unfreezing, array will resume (restart?) recovery with
all drives. If array was resyncing, however, it will start recovering
the newly added drives, because kernel prefers recovery over resync
(as we discussed earlier).

Any other caveats of this freezing/unfreezing you can think of?

Thanks,
  Alex.

On Sun, Mar 25, 2012 at 12:30 PM, Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:
> Thanks, Neil!
> I will merge this in and test.
>
>
> On Thu, Mar 22, 2012 at 12:31 PM, NeilBrown <neilb@suse.de> wrote:
>> On Thu, 22 Mar 2012 12:01:48 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
>> wrote:
>>
>>> Neil,
>>>
>>> >  echo frozen > /sys/block/mdN/md/sync_action
>>> >  mdadm --add /dev/mdN /dev......
>>> >  echo recover > /sys/block/mdN/md/sync_action
>>> >
>>> > it should do them all at once.
>>> >
>>> > I should teach mdadm about this..
>>>
>>> What is required to do that from mdadm? I don't see any other place
>>> where MD_RECOVERY_FROZEN is set, except via sysfs. So do you suggest
>>> that mdadm use sysfs for that?
>>
>> Yes.
>>
>> http://neil.brown.name/git?p=mdadm;a=commitdiff;h=9f58469128c99c0d7f434d28657f86789334f253
>>
>>> Also, what should be done if mdadm succeeds to "freeze" the array, but
>>> fails to "unfreeze" it for some reason?
>>
>> What could possibly cause that?
>> I guess if someone kills mdadm while it was running..
>>
>>
>> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2012-04-04 16:56             ` Alexander Lyakas
@ 2014-06-09 14:26               ` Alexander Lyakas
  2014-06-10  0:11                 ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: Alexander Lyakas @ 2014-06-09 14:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

> Why do you refuse to freeze the array if it's not "idle"? What will
> happen is that current recover/resync will abort, drives will be
> added, and on unfreezing, array will resume (restart?) recovery with
> all drives. If array was resyncing, however, it will start recovering
> the newly added drives, because kernel prefers recovery over resync
> (as we discussed earlier).
Indeed, since dea3786ae2cf74ecb0087d1bea1aa04e9091ad5c, I see that you
agree to freeze the array also in case it is recovering.

Alex.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2014-06-09 14:26               ` Alexander Lyakas
@ 2014-06-10  0:11                 ` NeilBrown
  2014-06-11 16:05                   ` Alexander Lyakas
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2014-06-10  0:11 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1026 bytes --]

On Mon, 9 Jun 2014 17:26:38 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> > Why do you refuse to freeze the array if it's not "idle"? What will
> > happen is that current recover/resync will abort, drives will be
> > added, and on unfreezing, array will resume (restart?) recovery with
> > all drives. If array was resyncing, however, it will start recovering
> > the newly added drives, because kernel prefers recovery over resync
> > (as we discussed earlier).
> Indeed, since dea3786ae2cf74ecb0087d1bea1aa04e9091ad5c, I see that you
> agree to freeze the array also in case it is recovering.

I guess I did..... though I don't remember seeing the email that you have
quoted.  I can see it in my inbox, but it seems that I never replied.  Maybe
I was too busy that day :-(

If there other outstanding issues, feel free to resend.
(If I don't reply it is more likely to be careless than deliberate, so in
general you should feel free to resend if I don't respond in a week or so).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RAID-10 explicitly defined drive pairs?
  2014-06-10  0:11                 ` NeilBrown
@ 2014-06-11 16:05                   ` Alexander Lyakas
  0 siblings, 0 replies; 25+ messages in thread
From: Alexander Lyakas @ 2014-06-11 16:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Thanks, Neil.
What we do now:
- we refuse freezing/re-adding drives if array is not idle or
recovering (e.g., resyncing)
- otherwise, we freeze, add/re-add drives, unfreeze

Thanks!
Alex.



On Tue, Jun 10, 2014 at 3:11 AM, NeilBrown <neilb@suse.de> wrote:
> On Mon, 9 Jun 2014 17:26:38 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> > Why do you refuse to freeze the array if it's not "idle"? What will
>> > happen is that current recover/resync will abort, drives will be
>> > added, and on unfreezing, array will resume (restart?) recovery with
>> > all drives. If array was resyncing, however, it will start recovering
>> > the newly added drives, because kernel prefers recovery over resync
>> > (as we discussed earlier).
>> Indeed, since dea3786ae2cf74ecb0087d1bea1aa04e9091ad5c, I see that you
>> agree to freeze the array also in case it is recovering.
>
> I guess I did..... though I don't remember seeing the email that you have
> quoted.  I can see it in my inbox, but it seems that I never replied.  Maybe
> I was too busy that day :-(
>
> If there other outstanding issues, feel free to resend.
> (If I don't reply it is more likely to be careless than deliberate, so in
> general you should feel free to resend if I don't respond in a week or so).
>
> NeilBrown

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-06-11 16:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-12 11:54 RAID-10 explicitly defined drive pairs? Jan Kasprzak
2011-12-12 15:33 ` John Robinson
2012-01-06 15:08   ` Jan Kasprzak
2012-01-06 16:39     ` Peter Grandi
2012-01-06 19:16       ` Stan Hoeppner
2012-01-06 20:11       ` Jan Kasprzak
2012-01-06 22:55         ` Stan Hoeppner
2012-01-07 14:25           ` Peter Grandi
2012-01-07 16:25             ` Stan Hoeppner
2012-01-09 13:46               ` Peter Grandi
2012-01-10  3:54                 ` Stan Hoeppner
2012-01-10  4:13                   ` NeilBrown
2012-01-10 16:25                     ` Stan Hoeppner
2012-01-12 11:58                   ` Peter Grandi
2012-01-12 12:47               ` Peter Grandi
2012-01-12 21:24                 ` Stan Hoeppner
2012-01-06 20:55     ` NeilBrown
2012-01-06 21:02       ` Jan Kasprzak
2012-03-22 10:01       ` Alexander Lyakas
2012-03-22 10:31         ` NeilBrown
2012-03-25  9:30           ` Alexander Lyakas
2012-04-04 16:56             ` Alexander Lyakas
2014-06-09 14:26               ` Alexander Lyakas
2014-06-10  0:11                 ` NeilBrown
2014-06-11 16:05                   ` Alexander Lyakas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).