RAID10: how much does chunk size matter? Can partial chunks be written?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID10: how much does chunk size matter? Can partial chunks be written?
@ 2013-01-04 17:54 Andras Korn
  2013-01-04 22:51 ` Stan Hoeppner
  2013-01-04 23:43 ` Peter Grandi
  0 siblings, 2 replies; 11+ messages in thread
From: Andras Korn @ 2013-01-04 17:54 UTC (permalink / raw)
  To: linux-raid

Hi,

I have a RAID10 array with the default chunksize of 512k:

md124 : active raid10 sdg3[6] sdd3[0] sda3[7] sdh3[3] sdf3[2] sde3[1]
      1914200064 blocks super 1.2 512K chunks 3 far-copies [6/6] [UUUUUU]
      bitmap: 4/15 pages [16KB], 65536KB chunk

I have an application on top of it that writes blocks of 128k or less, using
multiple threads, pretty randomly (but reads dominate, hence far-copies;
large sequential reads are relatively frequent).

I wonder whether re-creating the array with a chunksize of 128k (or maybe
even just 64k) could be expected to improve write performance. I assume the
RAID10 implementation doesn't read-modify-write if writes are not aligned to
chunk boundaries, does it? In that case, reducing the chunk size would just
increase the likelihood of more than one disk (per copy) being necessary to
service each request, and thus decrease performance, right?

I understand that small chunksizes favour single-threaded sequential
workloads (because all disks can read/write simultaneously, thus adding
their bandwidth together), whereas large(r) chunksizes favour multi-threaded
random access (because a single disk may be enough to serve each request,
while the other disks serve other requests).

So: can RAID10 issue writes that start at some offset from a chunk boundary?

Thanks.

-- 
                     Andras Korn <korn at elan.rulez.org>
                 Visit the Soviet Union before it visits you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-04 17:54 RAID10: how much does chunk size matter? Can partial chunks be written? Andras Korn
@ 2013-01-04 22:51 ` Stan Hoeppner
  2013-01-04 23:41   ` Andras Korn
  2013-01-04 23:43 ` Peter Grandi
  1 sibling, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2013-01-04 22:51 UTC (permalink / raw)
  To: Andras Korn; +Cc: linux-raid

On 1/4/2013 11:54 AM, Andras Korn wrote:

> I have a RAID10 array with the default chunksize of 512k:
> 
> md124 : active raid10 sdg3[6] sdd3[0] sda3[7] sdh3[3] sdf3[2] sde3[1]
>       1914200064 blocks super 1.2 512K chunks 3 far-copies [6/6] [UUUUUU]
>       bitmap: 4/15 pages [16KB], 65536KB chunk
> 
> I have an application on top of it that writes blocks of 128k or less, using
> multiple threads, pretty randomly (but reads dominate, hence far-copies;
> large sequential reads are relatively frequent).

You've left out critical details:

1.  What filesystem are you using?

2.  Does that app write to multiple files with these multiple threads,
or the same file?

3.  Is it creating new files or modifying/appending existing files?
I.e. is it metadata intensive?

4.  Why are you doing large sequential reads of files that are written
128KB at a time?

We need much more detail, because those details dictate how you need to
configure your 6 disk RAID10 for optimal performance.

> I wonder whether re-creating the array with a chunksize of 128k (or maybe
> even just 64k) could be expected to improve write performance. I assume the
> RAID10 implementation doesn't read-modify-write if writes are not aligned to
> chunk boundaries, does it? In that case, reducing the chunk size would just
> increase the likelihood of more than one disk (per copy) being necessary to
> service each request, and thus decrease performance, right?

RAID10 doesn't RMW because there is no parity, making chunk size less
critical than with RAID5/6 arrays.  To optimize this array you really
need to capture the IO traffic pattern of the application.  If your
chunk size is too large you may be creating IO hotspots on individual
disks, with the others idling to a degree.  It's actually really
difficult to use a chunk size that is "too small" that will decrease
performance.

> I understand that small chunksizes favour single-threaded sequential
> workloads (because all disks can read/write simultaneously, thus adding
> their bandwidth together), 

This simply isn't true.  Small chunk sizes are preferable for almost all
workloads.  Large chunks are only optimal for single thread long
duration streaming writes/reads.

> whereas large(r) chunksizes favour multi-threaded
> random access (because a single disk may be enough to serve each request,
> while the other disks serve other requests).

Again, not true.  Everything depends on your workload:  the file sizes
involved, and the read/write patterns from/to those files.  For
instance, with your current 512KB chunk, if you're writing/reading 128KB
files, you're putting 4 files on disk1 then 4 files on disk2, then 4
files on disk3.  If you read them back in write order, even with 4
threads, you'll read the first 4 files from disk1 while disks 2/3 sit
idle.  If you use a 128KB chunk, each file gets written to a different
disk.  So when your 4 threads read them back each thread is accessing a
different, all 3 disks in parallel.

Now, this ignores metadata write/reads to the filesystem journal.  With
a large chunk of 512KB, it's likely that most/all of your journal writes
will go to the first disk in the array.  If this is the case you've
doubled (or more) the IO load on disk1, such that file IO performance
will be half that of each of the other drives.

And, this is exactly why we recommend nothing larger than a 32KB chunk
size for XFS, and this is why the md metadata 1.2 default chunk of 512KB
is insane.  Using a "small" chunk size spreads both metadata and file IO
more evenly across the spindles and yields more predictable performance.

> So: can RAID10 issue writes that start at some offset from a chunk boundary?

The filesystem dictates where files are written, not md/RAID.  If you
have a 512KB chunk and you're writing 128KB or smaller files, 3 of your
4 file writes will not start on a chunk boundary.  If you use a 128KB
chunk and all your files are exactly 128KB then each will start on a
chunk boundary.

That said, I wouldn't use a 128KB chunk.  I'd use a 32KB chunk.  And
unless your application is doing some really funky stuff, going above
that mostly likely isn't going to give you any benefit, especially if
each of these 128KB writes is an individual file.  In that case you
definitely want a small chunk due to the metadata write load.

-- 
Stan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-04 22:51 ` Stan Hoeppner
@ 2013-01-04 23:41   ` Andras Korn
  2013-01-05  0:27     ` Chris Murphy
                       ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Andras Korn @ 2013-01-04 23:41 UTC (permalink / raw)
  To: linux-raid

On Fri, Jan 04, 2013 at 04:51:14PM -0600, Stan Hoeppner wrote:

Hi,

> > I have a RAID10 array with the default chunksize of 512k:
> > 
> > md124 : active raid10 sdg3[6] sdd3[0] sda3[7] sdh3[3] sdf3[2] sde3[1]
> >       1914200064 blocks super 1.2 512K chunks 3 far-copies [6/6] [UUUUUU]
> >       bitmap: 4/15 pages [16KB], 65536KB chunk
> > 
> > I have an application on top of it that writes blocks of 128k or less, using
> > multiple threads, pretty randomly (but reads dominate, hence far-copies;
> > large sequential reads are relatively frequent).
> 
> You've left out critical details:
> 
> 1.  What filesystem are you using?

The filesystem is the "application": it's zfsonlinux. I'm putting it on
RAID10 instead of using the disks natively because I want to encrypt it
using LUKS, and encrypting each disk separately seemed wasteful of CPU (I
only have 3 cores).

I realize that I forsake some of the advantages of zfs by putting it on an
mdraid array.

I think this answers your other questions.

> > I wonder whether re-creating the array with a chunksize of 128k (or
> > maybe even just 64k) could be expected to improve write performance. I
> > assume the RAID10 implementation doesn't read-modify-write if writes are
> > not aligned to chunk boundaries, does it? In that case, reducing the
> > chunk size would just increase the likelihood of more than one disk (per
> > copy) being necessary to service each request, and thus decrease
> > performance, right?
> 
> RAID10 doesn't RMW because there is no parity, making chunk size less
> critical than with RAID5/6 arrays.  To optimize this array you really
> need to capture the IO traffic pattern of the application.  If your
> chunk size is too large you may be creating IO hotspots on individual
> disks, with the others idling to a degree.  It's actually really
> difficult to use a chunk size that is "too small" that will decrease
> performance.

I'm not sure I follow, as far as the last sentence is concerned. If, ad
absurdum, you use a chunksize of 512 bytes, then several disks will need to
operate in lock-step to service any reasonably sized read request. If, on
the other hand, the chunk size is large, there is a good chance that all of
the data you're trying to read is in a single chunk, and therefore on a
single disk. This leaves the other spindles free to seek elsewhere
(servicing other requests).

The drawback of the large chunksize is that, as one thread only reads from
one disk at a time, the read bandwidth of any one thread is limited to the
throughput of a single disk.

So, for reads, the large chunk sizes favour multi-threaded random access
time, whereas small chunk sizes favour single-threaded sequential
throughput. If there's a flaw in this chain of thought, plese point it out.
:)

Writes are not much different: all copies of a chunk must be updated when it
is written. If chunks are small, they are spread across more disks; thus a
single write causes more disks to seek.

> > I understand that small chunksizes favour single-threaded sequential
> > workloads (because all disks can read/write simultaneously, thus adding
> > their bandwidth together), 
> 
> This simply isn't true.  Small chunk sizes are preferable for almost all
> workloads.  Large chunks are only optimal for single thread long
> duration streaming writes/reads.

I think you have this backwards; see above. Imagine, for the sake of
simplicity, a RAID0 array with a chunksize of 1 bit. For single thread
sequential reads, the bandwidth of all disks is added together because they
can all read at the same time. With a large chunksize, you only get this if
you also read-ahead agressively.

> > whereas large(r) chunksizes favour multi-threaded
> > random access (because a single disk may be enough to serve each request,
> > while the other disks serve other requests).
> 
> Again, not true.  Everything depends on your workload:  the file sizes
> involved, and the read/write patterns from/to those files.  For
> instance, with your current 512KB chunk, if you're writing/reading 128KB
> files, you're putting 4 files on disk1 then 4 files on disk2, then 4
> files on disk3.  If you read them back in write order, even with 4
> threads, you'll read the first 4 files from disk1 while disks 2/3 sit
> idle.  If you use a 128KB chunk, each file gets written to a different
> disk.  So when your 4 threads read them back each thread is accessing a
> different, all 3 disks in parallel.

This is a specific bad case but not the average. Of course, if you know
the access pattern with sufficient specificity, then you can optimise for
it, but I don't. Many different applications will run on top of ZFS, with
occasional peaks in utilisation. This includes a mailserver that normally
has very little traffic but there are some mailing lists with low mail rates
but many subscribers; there is a mediawiki instance with mysql that has
seasonal as well as trending changes in its traffic etc. etc.

I can't anticipate the exact access pattern; I'm looking for something
that'll work well in some abstract average sense.

> And, this is exactly why we recommend nothing larger than a 32KB chunk
> size for XFS, and this is why the md metadata 1.2 default chunk of 512KB
> is insane.

I think the problem of the xfs journal is much smaller since delaylog became
default; for highly metadata intensive workloads I'd recommend an external
journal (it doesn't even need to be on a SSD because it's written
sequentially).

> Using a "small" chunk size spreads both metadata and file IO more evenly
> across the spindles and yields more predictable performance.

... but sucks for multithreaded random reads.

> > So: can RAID10 issue writes that start at some offset from a chunk boundary?
> 
> The filesystem dictates where files are written, not md/RAID.

Of course. What I meant was: "If the filesystem issues a write request that
is not chunk-aligned, will RAID10 resort to read-modify-write, or just
perform the write at the requested offset within the chunk?"

> That said, I wouldn't use a 128KB chunk.  I'd use a 32KB chunk.  And
> unless your application is doing some really funky stuff, going above
> that mostly likely isn't going to give you any benefit, especially if
> each of these 128KB writes is an individual file.  In that case you
> definitely want a small chunk due to the metadata write load.

I still believe going much below 128k would require more seeking and thus
hurt multithreaded random access performance.

If I use 32k chunks, every 128k block zfs writes will be spread across 4
disks (and that's assuming the 128k write was 32k-aligned); as I'm using 6
disks with 3 copies, evey disk will end up holding 2 chunks of a 128k write,
a bit like this:

disk1 | disk2 | disk3 | disk4 | disk5 | disk6
  1       2       3       4       1       2
  3       4       1       2       3       4

Thus all disks will have to seek twice to serve this write, and four disks
will have to seek to read this 128k block.

With 128k chunks, it'd look like this:

disk1 | disk2 | disk3 | disk4 | disk5 | disk6
  1       1       1

Three disks would have to seek to serve the write (meanwhile, the other
three can serve other writes), and any one of three can serve a read,
leaving the others to seek in order to serve other requests.

How is this reasoning flawed?

Andras

-- 
                     Andras Korn <korn at elan.rulez.org>
 Getting information from the Internet is like taking a drink from a hydrant.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-04 23:41   ` Andras Korn
@ 2013-01-05  0:27     ` Chris Murphy
  2013-01-05  0:46     ` Stan Hoeppner
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2013-01-05  0:27 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org Raid

On Jan 4, 2013, at 4:41 PM, Andras Korn <korn@raidlist.elan.rulez.org> wrote:
> 
> The filesystem is the "application": it's zfsonlinux. I'm putting it on
> RAID10 instead of using the disks natively because I want to encrypt it
> using LUKS, and encrypting each disk separately seemed wasteful of CPU (I
> only have 3 cores).
> 
> I realize that I forsake some of the advantages of zfs by putting it on an
> mdraid array.

I would not do this, you eliminate not just some of the advantages, but all of the major ones including self-healing. Your choices are:

dmcrypt/LUKS (ZFS on encrypted logical device)
ecryptfs (encrypted fs on top of ZFS)
Nearline (or enterprise) drives that have self-encryption

The only way ZFS can self-heal is if it directly manages its own mirrored copies or its own parity. To use ZFS in the fashion you're suggesting I think is pointless, so skip using md or LVM. And consider the list in reverse order as best performing, with your idea off the list entirely.

Three cores? Does it have AES-NI? If it does, it adds maybe 2% overhead for encryption, although I can't tell if you off hand if that's per disk.

Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-04 23:41   ` Andras Korn
  2013-01-05  0:27     ` Chris Murphy
@ 2013-01-05  0:46     ` Stan Hoeppner
  2013-01-05  7:44       ` Andras Korn
  2013-01-05  1:30     ` Chris Murphy
  2013-01-05  8:15     ` Andras Korn
  3 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2013-01-05  0:46 UTC (permalink / raw)
  To: Andras Korn; +Cc: linux-raid

On 1/4/2013 5:41 PM, Andras Korn wrote:
> On Fri, Jan 04, 2013 at 04:51:14PM -0600, Stan Hoeppner wrote:

>> You've left out critical details:
>>
>> 1.  What filesystem are you using?
> 
> The filesystem is the "application": it's zfsonlinux. I'm putting it on
> RAID10 instead of using the disks natively because I want to encrypt it
> using LUKS, and encrypting each disk separately seemed wasteful of CPU (I
> only have 3 cores).

I just love when the other shoe drops and it turns out to be a size 13
boot... filled with lead.

Any reason why you intentionally omitted these critical details from
your initial post?

-- 
Stan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-05  0:46     ` Stan Hoeppner
@ 2013-01-05  7:44       ` Andras Korn
  0 siblings, 0 replies; 11+ messages in thread
From: Andras Korn @ 2013-01-05  7:44 UTC (permalink / raw)
  To: linux-raid

On Fri, Jan 04, 2013 at 06:46:20PM -0600, Stan Hoeppner wrote:

> >> 1.  What filesystem are you using?
> > 
> > The filesystem is the "application": it's zfsonlinux. I'm putting it on
> > RAID10 instead of using the disks natively because I want to encrypt it
> > using LUKS, and encrypting each disk separately seemed wasteful of CPU (I
> > only have 3 cores).
> 
> I just love when the other shoe drops and it turns out to be a size 13
> boot... filled with lead.
> 
> Any reason why you intentionally omitted these critical details from
> your initial post?

Yes: I thought I was asking a theoretical question, not for advice on tuning
my specific setup, and thought - apparently correctly, I might add :) - that
including these details would only get everyone sidetracked into trying to
optimise for a specific application. I don't have a specific application
with a specific access pattern because I have to run all sorts of
applications on this box, on top of this array, simultaneously.

But meanwhile I think I have received an answer: the fact that zfs uses
blocks of 128k or less does not automatically mean that write performance
would suffer on a 512k-chunk RAID10 array because the Linux RAID10
implementation, very sensibly, doesn't insist on writes being aligned to
chunk boundaries.

So thanks.

-- 
                     Andras Korn <korn at elan.rulez.org>
         Reality is merely an illusion, albeit a very persistent one.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-04 23:41   ` Andras Korn
  2013-01-05  0:27     ` Chris Murphy
  2013-01-05  0:46     ` Stan Hoeppner
@ 2013-01-05  1:30     ` Chris Murphy
  2013-01-05  8:15     ` Andras Korn
  3 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2013-01-05  1:30 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org Raid

On Jan 4, 2013, at 4:41 PM, Andras Korn <korn@raidlist.elan.rulez.org> wrote:

> The filesystem is the "application": it's zfsonlinux. I'm putting it on
> RAID10 instead of using the disks natively because I want to encrypt it
> using LUKS, and encrypting each disk separately seemed wasteful of CPU (I
> only have 3 cores).

Also, it's worth reading this, to hopefully ensure the backup system isn't also experimental.

http://confessionsofalinuxpenguin.blogspot.com/2012/09/btrfs-vs-zfsonlinux-how-do-they-compare.html

I mean, think about it another way. You value the data, apparently, enough to encrypt it. But then you're willing to basically f around with the data by using a "nailing jello to a tree" approach for a file system. Quite honestly you should consider doing this on FreeBSD or OpenIndiana where there's native support for encryption, and for ZFS, no nail and jello required. People who care about their data, and need/want a resilient file system, do it on one of those two OSs.

If FreeBSD/OpenIndiana are no ops, the way to do it on Linux is, XFS on nearline SATA or SAS SEDs, which have an order magnitude (at least) lower UER than consumer crap, and hence less of a reason why you need to 2nd guess the disks with a resilient file system. But even though also experimental, I'd still use Btrfs before I'd use ZFS on LUKS on Linux, just saying.

Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-04 23:41   ` Andras Korn
                       ` (2 preceding siblings ...)
  2013-01-05  1:30     ` Chris Murphy
@ 2013-01-05  8:15     ` Andras Korn
  2013-01-05 18:57       ` Chris Murphy
  3 siblings, 1 reply; 11+ messages in thread
From: Andras Korn @ 2013-01-05  8:15 UTC (permalink / raw)
  To: linux-raid

Replying to Chris Murphy:

> > The filesystem is the "application": it's zfsonlinux. I'm putting it on
> > RAID10 instead of using the disks natively because I want to encrypt it
> > using LUKS, and encrypting each disk separately seemed wasteful of CPU (I
> > only have 3 cores).
> > 
> > I realize that I forsake some of the advantages of zfs by putting it on an
> > mdraid array.
> 
> I would not do this, you eliminate not just some of the advantages, but
> all of the major ones including self-healing.

I know; however, I get to use compression, convenient management, fast
snapshots etc. If I later add an SSD I can use it as L2ARC.

I have considered the benefits and disadvantages and I think my choice was
the right one.

> dmcrypt/LUKS (ZFS on encrypted logical device)
> ecryptfs (encrypted fs on top of ZFS)

I know of ecryptfs but I don't know how mature it is or how well it would
work on top of zfs (which it certainly hasn't been tested with). I have a
fair amount of experience with LUKS though.

I considered and rejected it due to my lack of experience with it. Perhaps
I'll get the chance to play with it sometime so I can deploy it with
confidence later.

> Nearline (or enterprise) drives that have self-encryption

Alas, too expensive. I built this server for a hobby/charity project, from
disks I had lying around; buying enterprise grade hardware is out of the
question.

> The only way ZFS can self-heal is if it directly manages its own mirrored
> copies or its own parity. To use ZFS in the fashion you're suggesting I
> think is pointless, so skip using md or LVM. And consider the list in
> reverse order as best performing, with your idea off the list entirely.

It's not pointless (see above), just sub-optimal.

> Three cores? Does it have AES-NI?

No. It's a Phenom II X3 705e.

> If it does, it adds maybe 2% overhead for encryption, although I can't
> tell if you off hand if that's per disk.

Per encrypted device. If I had encrypted the six disks separately, I'd be
running six encryption threads, and encrypting each piece of data in
triplicate. And without AES-NI, it's more like 10% when there are many
writes.

> Also, it's worth reading this, to hopefully ensure the backup system isn't
> also experimental.
> 
> http://confessionsofalinuxpenguin.blogspot.com/2012/09/btrfs-vs-zfsonlinux-how-do-they-compare.html

Thanks, I've read it. I actually did try FreeBSD first, but it kept locking
up if it had more than one CPU AND there was a lot of I/O going on. My idea
was to build a FreeBSD based storage appliance in a VM (because I can't run
all my stuff on FreeBSD directly), and export the VM's zfs to Linux (maybe
in another VM, maybe the host), but it just didn't work. OpenIndiana failed
in a similar but not identical way. I don't know enough about either system
to be able to troubleshoot them effectively and no time, right now, to
learn.

The article, btw, doesn't mention some of the other differences between
btrfs and zfs: for example, afaik, with btrfs the mount hierarchy has to
mirror the pool hierarchy, whereas with zfs you can mount every fs anywhere.
On the whole, btrfs "feels" a lot more experimental to me than zfsonlinux,
which is actually pretty stable (I've been using it for more than a year
now). There are occasional problems, to be sure, but it's getting better at
a steady pace. I guess I like to live on the edge.

> I mean, think about it another way. You value the data, apparently, enough
> to encrypt it. But then you're willing to basically f around with the data
> by using a "nailing jello to a tree" approach for a file system. Quite
> honestly you should consider doing this on FreeBSD or OpenIndiana where
> there's native support for encryption, and for ZFS, no nail and jello
> required. People who care about their data, and need/want a resilient file
> system, do it on one of those two OSs.

You're conflating two distinct meanings of "value". I encrypt my data for
reasons of privacy, not confidentiality: I don't want other people to
automatically have access to it if they have my disks - for example, because
the server is stolen, or because I've disposed of a defective disk without
securely erasing it first.

OTOH, the data does not have particularly high business value. Losing it
would be inconvenient, but not a big deal.

> If FreeBSD/OpenIndiana are no ops, the way to do it on Linux is, XFS on
> nearline SATA or SAS SEDs, which have an order magnitude (at least) lower
> UER than consumer crap, and hence less of a reason why you need to 2nd guess
> the disks with a resilient file system.

Zfs doesn't appeal to me (only) because of its resilience. I benefit a lot
from compression and snapshotting, somewhat from deduplication, somewhat
from zfs send/receive, a lot from the flexible "volume management" etc. I
will also later benefit from the ability to use an SSD as cache.

> But even though also experimental, I'd still use Btrfs before I'd use ZFS
> on LUKS on Linux, just saying.

Perhaps you'd like to read https://lwn.net/Articles/506487/ and the
admittedly somewhat outdated
http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-is-better-than-btrfs
for some reasons people might want to prefer zfs to btrfs.

-- 
                     Andras Korn <korn at elan.rulez.org>
                  Remember: Rape and pillage, and THEN burn!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-05  8:15     ` Andras Korn
@ 2013-01-05 18:57       ` Chris Murphy
  2013-01-05 21:01         ` Andras Korn
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Murphy @ 2013-01-05 18:57 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org Raid

On Jan 5, 2013, at 1:15 AM, Andras Korn <korn@raidlist.elan.rulez.org> wrote:

> Replying to Chris Murphy:
> 
>> I would not do this, you eliminate not just some of the advantages, but
>> all of the major ones including self-healing.
> 
> I know; however, I get to use compression, convenient management, fast
> snapshots etc. If I later add an SSD I can use it as L2ARC.

You've definitely exchanged performance and resilience, for maybe possibly not sure about adding an SSD. An SSD that you're more likely to need because of the extra layers you're forcing this setup to use. Btrfs offers all of the features you list, except the SSD that you haven't even committed to, and you'd regain resilience and drop at least two layers that will negatively impact performance.

> Alas, too expensive. I built this server for a hobby/charity project, from
> disks I had lying around; buying enterprise grade hardware is out of the
> question.

All the more reason why simpler is better, and this is distinctly not simple. It's a FrankenNAS. You might consider arbitrarily yanking one of the disks, and seeing how the restore process works out for you.

> 
>> The only way ZFS can self-heal is if it directly manages its own mirrored
>> copies or its own parity. To use ZFS in the fashion you're suggesting I
>> think is pointless, so skip using md or LVM. And consider the list in
>> reverse order as best performing, with your idea off the list entirely.
> 
> It's not pointless (see above), just sub-optimal.

Pointless. You're going to take the COW and data checksumming performance hit for no reason. If you care so little about that, at least with Btrfs you can turn both of those off.

>> If it does, it adds maybe 2% overhead for encryption, although I can't

>> tell if you off hand if that's per disk.
> 
> Per encrypted device.

Really? You're sure? Explain to me the difference between six kworker threads each encrypting 100-150MB/s, and funneling 600MB/s - 1GB/s through one kworker thread. It seems you have a fixed amount of data per unit time that must be encrypted. 

> 
> The article, btw, doesn't mention some of the other differences between
> btrfs and zfs: for example, afaik, with btrfs the mount hierarchy has to
> mirror the pool hierarchy, whereas with zfs you can mount every fs anywhere.

And the use case for this is? You might consider esoteric and minor differences like this to be a good exchange f

> On the whole, btrfs "feels" a lot more experimental to me than zfsonlinux,
> which is actually pretty stable (I've been using it for more than a year
> now). There are occasional problems, to be sure, but it's getting better at
> a steady pace. I guess I like to live on the edge.

I have heard of exactly no one doing what you're doing, and I'd say that makes it far more experimental than Btrfs.

If by "feels" experimental, you mean many commits to new kernels and few backports, OK. I suggest you run on a UPS in either case, especially if you don't have the time to test your rebuild process.

>> If FreeBSD/OpenIndiana are no ops, the way to do it on Linux is, XFS on
>> nearline SATA or SAS SEDs, which have an order magnitude (at least) lower
>> UER than consumer crap, and hence less of a reason why you need to 2nd guess
>> the disks with a resilient file system.
> 
> Zfs doesn't appeal to me (only) because of its resilience. I benefit a lot
> from compression and snapshotting, somewhat from deduplication, somewhat
> from zfs send/receive, a lot from the flexible "volume management" etc. I
> will also later benefit from the ability to use an SSD as cache.

I'm glad you're demoting the importance of resilience since the way you're going to use it totally obviates its resilience to that of any other fs. You don't get dedup without an SSD, it's way too slow to be useable at all, and you need a large SSD to do a meaningful amount of dedup with ZFS and also have enough for caching. Discount send/receive because Btrfs has that, and I don't know what you mean by flexible volume management.

> 
>> But even though also experimental, I'd still use Btrfs before I'd use ZFS
>> on LUKS on Linux, just saying.
> 
> Perhaps you'd like to read https://lwn.net/Articles/506487/ and the
> admittedly somewhat outdated

The singular thing here is the SSD as ZIL or L2ARC, and that's something being worked on in the Linux VFS rather than make it a file system specific feature. If you look at all the zfsonlinux benchmarks, even SSD isn't enough to help ZFS depending on the task. So long as you've done your homework on the read/write patterns and made sure it's compatible with the capabilities of what you're designing, great. Otherwise it's pure speculation what on paper features (which you're not using anyway) even matter.

> http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-is-better-than-btrfs
> for some reasons people might want to prefer zfs to btrfs.

Except that article is ideological b.s. that gets rather fundamental facts wrong. You can organize subvolumes however you want. You can rename them. You can move them. You can boot from them. That has always been the case, so the age of the article doesn't even matter.

It mostly sounds you like features that you're not even going to use from the outset, and won't use, but you want them anyway. Which is not the way to design storage. You design it for a task. If you really need the features you're talking about, you'd actually spend the time to sort out your FreeBSD/OpenIndiana problems.

Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-05 18:57       ` Chris Murphy
@ 2013-01-05 21:01         ` Andras Korn
  0 siblings, 0 replies; 11+ messages in thread
From: Andras Korn @ 2013-01-05 21:01 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org Raid

On Sat, Jan 05, 2013 at 11:57:53AM -0700, Chris Murphy wrote:

> >> I would not do this, you eliminate not just some of the advantages, but
> >> all of the major ones including self-healing.
> > 
> > I know; however, I get to use compression, convenient management, fast
> > snapshots etc. If I later add an SSD I can use it as L2ARC.
> 
> You've definitely exchanged performance and resilience, for maybe possibly
> not sure about adding an SSD.

Oh, it's quite sure. I've ordered it already but it hasn't arrived yet. The
only question is when I'll have time to install it in the server.

> An SSD that you're more likely to need because of the extra layers you're
> forcing this setup to use.

Come on, that's FUD. I need the SSD because I have an I/O workload with many
random reads, too slow disks, too few spindles and too little RAM. It has
little to do with the number of layers. (Also, I arguably don't "need" the
SSD, but it will do a lot to improve interactive performance.)

> > Alas, too expensive. I built this server for a hobby/charity project, from
> > disks I had lying around; buying enterprise grade hardware is out of the
> > question.
> 
> All the more reason why simpler is better, and this is distinctly not
> simple. It's a FrankenNAS.

It may not be simple but it's something I have experience with, so it's
relatively simple for me to maintain and set up. With btrfs there'd been a
learning curve in addition to less flexibility in choosing mountpoints, for
example. Also, btrfs is completely new, whereas zfs is only new on Linux. I
don't trust btrfs yet (and reading kernel changelogs keeps me cautious).

The setup where a virtualised FreeBSD or OpenIndiana would've been a NAS for
the host, OTOH... that would have been positively Frankensteinian.

BTW, just to make you wince: two of the first "production" boxes I deployed
zfsonlinux on actually have mdraid-LUKS-LVM-zfs (and it works and is even
fast enough).

> You might consider arbitrarily yanking one of the disks, and seeing how
> the restore process works out for you.

I did that several times, worked fine (but took a while, of course).

> >> The only way ZFS can self-heal is if it directly manages its own mirrored
> >> copies or its own parity. To use ZFS in the fashion you're suggesting I
> >> think is pointless, so skip using md or LVM. And consider the list in
> >> reverse order as best performing, with your idea off the list entirely.
> > 
> > It's not pointless (see above), just sub-optimal.
> 
> Pointless. You're going to take the COW and data checksumming performance
> hit for no reason.

Not for no reason: I get cheap snapshots out of the COW thing, and dedup out
of checksumming (for a select few filesystems).

> If you care so little about that, at least with Btrfs
> you can turn both of those off.

For the record, I could turn off checksumming on zfs too. That's actually
not a bad idea, come to think of it, because most of my datasets really
don't benefit from it. Thanks.

> > Per encrypted device.
> 
> Really? You're sure? Explain to me the difference between six kworker
> threads each encrypting 100-150MB/s, and funneling 600MB/s - 1GB/s through
> one kworker thread. It seems you have a fixed amount of data per unit time
> that must be encrypted.

No, because if I encrypt each disk separately, I need to encrypt the same
piece of data 3 times (because I store 3 copies of everything). In the
current setup, replication occurs below the encryption layer.

> > The article, btw, doesn't mention some of the other differences between
> > btrfs and zfs: for example, afaik, with btrfs the mount hierarchy has to
> > mirror the pool hierarchy, whereas with zfs you can mount every fs anywhere.
> 
> And the use case for this is?

For example: I have two zpools, one encrypted and one not. Both contain
filesystems that get mounted under /srv. Of course this would be possible
with btrfs using workarounds like bind mounts and symlinks, but why should I
need a workaround?

Or how about this? I want some subdirectories of /usr and /var to be on zfs,
in the same pool, with the rest being on xfs. (This might be possible with
btrfs; I don't know.)

In another case, I had a backup of a vserver (as in linux-vserver.org); the
host with the live instance failed and I had to create clones of some of the
backup snapshots, then mount them in various locations to be able to start
the vserver on the backup host. This was possible even though all were part
of the 'backup' pool. Flexibility is almost always a good thing.

> > On the whole, btrfs "feels" a lot more experimental to me than zfsonlinux,
> > which is actually pretty stable (I've been using it for more than a year
> > now). There are occasional problems, to be sure, but it's getting better at
> > a steady pace. I guess I like to live on the edge.
> 
> I have heard of exactly no one doing what you're doing, and I'd say that
> makes it far more experimental than Btrfs.

That's only if you work from the premise that there are magical interactions
between the various layers, which, while conceivable, my experience so far
doesn't confirm. (If you consider enough specifics, _every_ setup is
"experimental": at least the serial number of the hardware components likely
differs from the previous similar setup.)

> If by "feels" experimental, you mean many commits to new kernels and few
> backports, OK. I suggest you run on a UPS in either case, especially if
> you don't have the time to test your rebuild process.

Alas, no UPS. I make do with turning the write cache of my drives off. (But
the box survived numerous crashes caused by the first power supply being
just short of sufficient, which makes me relatively confident of the
resilience of the storage subsystem.)

> >> If FreeBSD/OpenIndiana are no ops, the way to do it on Linux is, XFS on
> >> nearline SATA or SAS SEDs, which have an order magnitude (at least) lower
> >> UER than consumer crap, and hence less of a reason why you need to 2nd guess
> >> the disks with a resilient file system.
> > 
> > Zfs doesn't appeal to me (only) because of its resilience. I benefit a lot
> > from compression and snapshotting, somewhat from deduplication, somewhat
> > from zfs send/receive, a lot from the flexible "volume management" etc. I
> > will also later benefit from the ability to use an SSD as cache.
> 
> I'm glad you're demoting the importance of resilience since the way you're
> going to use it totally obviates its resilience to that of any other fs.

I know.

> You don't get dedup without an SSD, it's way too slow to be useable at
> all,

That entirely depends. I have a few hundred MB of storage that I can dedup
very efficiently (ratio of maybe 5:1). While the space savings are
insignificant, the dedup table is also small and it fits into ARC easily. I
don't dedup it for the space savings on disk, but for the space savings in
cache.

> and you need a large SSD to do a meaningful amount of dedup with ZFS
> and also have enough for caching. Discount send/receive because Btrfs has
> that, and I don't know what you mean by flexible volume management.

The above wasn't about zfs vs. btrfs, it was about zfs vs. xfs on allegedly
better quality drives.

> >> But even though also experimental, I'd still use Btrfs before I'd use ZFS
> >> on LUKS on Linux, just saying.
> > 
> > Perhaps you'd like to read https://lwn.net/Articles/506487/ and the
> > admittedly somewhat outdated
> 
> The singular thing here is the SSD as ZIL or L2ARC, and that's something
> being worked on in the Linux VFS rather than make it a file system
> specific feature. If you look at all the zfsonlinux benchmarks, even SSD
> isn't enough to help ZFS depending on the task. So long as you've done
> your homework on the read/write patterns and made sure it's compatible
> with the capabilities of what you're designing, great. Otherwise it's pure
> speculation what on paper features (which you're not using anyway) even
> matter.

I have a fair amount of experience with zfsonlinux, both with and without
SSDs, and similar (but not identical) workloads. I have reason to believe
the SSD will help. (FWIW, it will also allow me to use an external bitmap
for my mdraid, which will also help.)

> > http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-is-better-than-btrfs
> > for some reasons people might want to prefer zfs to btrfs.
> 
> It mostly sounds you like features that you're not even going to use from
> the outset, and won't use, but you want them anyway.

I have no idea what you mean. Maybe you're conflating our theoretical
argument over whether zfsonlinux might be preferable to btrfs at all, and
the one about whether I made the right choice in this specific instance?

Hmmm... hasn't this gotten somewhat off-topic? (And see how right I was in
not mentioning zfsonlinux when all I wanted to know was whether Linux RAID10
insisted on chunk-aligned writes? :)

-- 
                     Andras Korn <korn at elan.rulez.org>
      I tried sniffing Coke once, but the ice cubes got stuck in my nose.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID10: how much does chunk size matter? Can partial chunks be written?
  2013-01-04 17:54 RAID10: how much does chunk size matter? Can partial chunks be written? Andras Korn
  2013-01-04 22:51 ` Stan Hoeppner
@ 2013-01-04 23:43 ` Peter Grandi
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Grandi @ 2013-01-04 23:43 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> md124 : active raid10 sdg3[6] sdd3[0] sda3[7] sdh3[3] sdf3[2] sde3[1]
>       1914200064 blocks super 1.2 512K chunks 3 far-copies [6/6] [UUUUUU]
>       bitmap: 4/15 pages [16KB], 65536KB chunk

> [ ... ] writes blocks of 128k or less, using multiple threads,
> pretty randomly (but reads dominate, hence far-copies; large
> sequential reads are relatively frequent). I wonder whether
> re-creating the array with a chunksize of 128k (or maybe even
> just 64k) could be expected to improve write performance.

This is an amazingly underreported context in which to ask such
a loaded yet vague question.

It is one of the usual absurdist "contributions" to this mailing
list, even if not as comically misconceived as many others.

You have a complicated workload, with entirely different access
profiles at different times, on who-knows-what disks, host
adapter, memory, CPU, and you are not even reporting what is the
current "performance" that should be improved.

Nor even saying what "performance" matters (bandwidth or latency
or transactions/s? average or maximum? caches enabled? ...).

Never mind that rotating disks have rather different transfer
rates outer/inner tracks, something that can mean very different
tradeoffs depending on where the data lands. What about the
filesystem and the chances that large transactions happen on
logically contiguous block device sectors?

Just asking such a question is ridiculous.

The only way you are going to get something vaguely sensible is
to run the same load on two identical configurations with
different chunk sizes, and even that won't help you a lot if
your workload is rather variable, as it is very likely to be.

> I assume the RAID10 implementation doesn't read-modify-write
> if writes are not aligned to chunk boundaries, does it?

If you have to assume this, instead of knowing it, you shouldn't
ask absurd questions about complicated workloads and the fine
points of chunk sizes.

> I understand that small chunksizes favour single-threaded
> sequential workloads (because all disks can read/write
> simultaneously, thus adding their bandwidth together), whereas
> large(r) chunksizes favour multi-threaded random access
> (because a single disk may be enough to serve each request,
> while the other disks serve other requests).

That's a very peculiar way of looking at it.

Assuming that we are not talking about hw synchronized drives,
as in RAID3 variants, the main issue with chunk size is that
every IO large enough to involve more than one disk can incur
massive latency from the disks not being at the same rotational
position at the same time, and the worst case is that the IO
takes a full disks rotation to complete, if completion matters.

For example on a RAID0 of 2 disks capable of 7200RPM or 120RPS,
a transfer of 2x512B sectors can take 8 milliseconds, delivering
around 120KiB/s of throughput, thanks to an 8ms latency every
1KiB.

In case of _streaming_, with read ahead or write behind, with
very good elevator algorithms, the latency due to the disks
rotating independently can be somewhat hidden, and large chunk
size obviously helps.

Conversely with a mostly random workload larger chunk size can
restrict the maximum amount of parallelism, as the reduced
interleaving *may* result in a higher chance that accesses from
two threads will hit the same disk and thus the same disk arm.

Also note that chunk size does not matter for each RAID1 "pair"
in the RAID10, as there are no chunks, and for _reading_ the two
disks are fully uncoupled, while for writing they are fully
coupled. The chunk size that matters is solely that of the RAID0
"on top" of the RAID1 pairs, and that perhaps in many situations
does not matter a lot.

It is really complicated...

My usual refrain is to tell people who don't get most of the
subtle details of storage to just use RAID10 and not to worry
too much, because RAID0 works well in almost every case, and if
they have "performance" problems add more disks; usually as more
pairs (rather than turning 2-disk RAID1s into 3-disk RAID1s,
which can also be a good idea in several cases), as the number
of arms and thus IOPS per TB of *used* capacity often is a big
issue, as most storage "experts" eventually should figure out.

In this case the only vague details as to "performance" goals
are for writes of largish-block random multithreaded access, and
more pairs seems the first thing to consider, as the setup has
only got 3 pairs totaling something around 300-400 random IOPS,
or with 128kiB writes probably overall write rates of 30-40MB/s
even not considering the sequential read interference. Doubling
the number of pairs is probably something to start with.

So choosing RAID10 was a good idea, asking yourself whether
vaguely undefined "performance" can be positively affected by
some subtle detail of tuning is not.

Note:

  Long ago I used to prefer small chunk sizes (like 16KiB), but
  the sequential speed of disks has improved a lot in the
  meantime, while the rotational one(s) have been pretty much
  constant for decades once 3600RPM disks stopped being
  designed.

  To make a very crude argument, assuming a common 7200RPM disk,
  with a full rotation every 8ms, and presumably an average
  offset among the disks of half a rotation or 4ms, ideally the
  amount read or written in one go should be at least as such as
  can be read or written in 4ms across all disks.

  On a 10MB/s disk of several years ago one half rotation time
  means 40KB, on a 100MB/s disks that becomes 400KB, and the
  goal is ideally to minimize the time spent waiting for all the
  disks to complete their work which begins the same 4ms apart,
  so larger chunk sizes probably help, even if they may reduce
  the degree of multihreading available, and the contemporary
  disks that can do 100MB tend to be much bigger than the old
  disks that could do 10MB/s but they have got only one arm.

  But with a small chunk size largish IO transactions will involve
  as a rule contiguous chunks (if the filesystem is suitable), and
  suitable elevators can still turn a 1MiB IO to 4 disks in much
  the same pattern of transfers whether the chunk size if 16KiB or
  128KiB, as in the end on each disk one has to transfer 256KiB of
  physically contiguous sectors.

  Yes, it is complicated.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-01-05 21:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-04 17:54 RAID10: how much does chunk size matter? Can partial chunks be written? Andras Korn
2013-01-04 22:51 ` Stan Hoeppner
2013-01-04 23:41   ` Andras Korn
2013-01-05  0:27     ` Chris Murphy
2013-01-05  0:46     ` Stan Hoeppner
2013-01-05  7:44       ` Andras Korn
2013-01-05  1:30     ` Chris Murphy
2013-01-05  8:15     ` Andras Korn
2013-01-05 18:57       ` Chris Murphy
2013-01-05 21:01         ` Andras Korn
2013-01-04 23:43 ` Peter Grandi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).