* RAID10: how much does chunk size matter? Can partial chunks be written?
@ 2013-01-04 17:54 Andras Korn
2013-01-04 22:51 ` Stan Hoeppner
2013-01-04 23:43 ` Peter Grandi
0 siblings, 2 replies; 11+ messages in thread
From: Andras Korn @ 2013-01-04 17:54 UTC (permalink / raw)
To: linux-raid
Hi,
I have a RAID10 array with the default chunksize of 512k:
md124 : active raid10 sdg3[6] sdd3[0] sda3[7] sdh3[3] sdf3[2] sde3[1]
1914200064 blocks super 1.2 512K chunks 3 far-copies [6/6] [UUUUUU]
bitmap: 4/15 pages [16KB], 65536KB chunk
I have an application on top of it that writes blocks of 128k or less, using
multiple threads, pretty randomly (but reads dominate, hence far-copies;
large sequential reads are relatively frequent).
I wonder whether re-creating the array with a chunksize of 128k (or maybe
even just 64k) could be expected to improve write performance. I assume the
RAID10 implementation doesn't read-modify-write if writes are not aligned to
chunk boundaries, does it? In that case, reducing the chunk size would just
increase the likelihood of more than one disk (per copy) being necessary to
service each request, and thus decrease performance, right?
I understand that small chunksizes favour single-threaded sequential
workloads (because all disks can read/write simultaneously, thus adding
their bandwidth together), whereas large(r) chunksizes favour multi-threaded
random access (because a single disk may be enough to serve each request,
while the other disks serve other requests).
So: can RAID10 issue writes that start at some offset from a chunk boundary?
Thanks.
--
Andras Korn <korn at elan.rulez.org>
Visit the Soviet Union before it visits you.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-04 17:54 RAID10: how much does chunk size matter? Can partial chunks be written? Andras Korn @ 2013-01-04 22:51 ` Stan Hoeppner 2013-01-04 23:41 ` Andras Korn 2013-01-04 23:43 ` Peter Grandi 1 sibling, 1 reply; 11+ messages in thread From: Stan Hoeppner @ 2013-01-04 22:51 UTC (permalink / raw) To: Andras Korn; +Cc: linux-raid On 1/4/2013 11:54 AM, Andras Korn wrote: > I have a RAID10 array with the default chunksize of 512k: > > md124 : active raid10 sdg3[6] sdd3[0] sda3[7] sdh3[3] sdf3[2] sde3[1] > 1914200064 blocks super 1.2 512K chunks 3 far-copies [6/6] [UUUUUU] > bitmap: 4/15 pages [16KB], 65536KB chunk > > I have an application on top of it that writes blocks of 128k or less, using > multiple threads, pretty randomly (but reads dominate, hence far-copies; > large sequential reads are relatively frequent). You've left out critical details: 1. What filesystem are you using? 2. Does that app write to multiple files with these multiple threads, or the same file? 3. Is it creating new files or modifying/appending existing files? I.e. is it metadata intensive? 4. Why are you doing large sequential reads of files that are written 128KB at a time? We need much more detail, because those details dictate how you need to configure your 6 disk RAID10 for optimal performance. > I wonder whether re-creating the array with a chunksize of 128k (or maybe > even just 64k) could be expected to improve write performance. I assume the > RAID10 implementation doesn't read-modify-write if writes are not aligned to > chunk boundaries, does it? In that case, reducing the chunk size would just > increase the likelihood of more than one disk (per copy) being necessary to > service each request, and thus decrease performance, right? RAID10 doesn't RMW because there is no parity, making chunk size less critical than with RAID5/6 arrays. To optimize this array you really need to capture the IO traffic pattern of the application. If your chunk size is too large you may be creating IO hotspots on individual disks, with the others idling to a degree. It's actually really difficult to use a chunk size that is "too small" that will decrease performance. > I understand that small chunksizes favour single-threaded sequential > workloads (because all disks can read/write simultaneously, thus adding > their bandwidth together), This simply isn't true. Small chunk sizes are preferable for almost all workloads. Large chunks are only optimal for single thread long duration streaming writes/reads. > whereas large(r) chunksizes favour multi-threaded > random access (because a single disk may be enough to serve each request, > while the other disks serve other requests). Again, not true. Everything depends on your workload: the file sizes involved, and the read/write patterns from/to those files. For instance, with your current 512KB chunk, if you're writing/reading 128KB files, you're putting 4 files on disk1 then 4 files on disk2, then 4 files on disk3. If you read them back in write order, even with 4 threads, you'll read the first 4 files from disk1 while disks 2/3 sit idle. If you use a 128KB chunk, each file gets written to a different disk. So when your 4 threads read them back each thread is accessing a different, all 3 disks in parallel. Now, this ignores metadata write/reads to the filesystem journal. With a large chunk of 512KB, it's likely that most/all of your journal writes will go to the first disk in the array. If this is the case you've doubled (or more) the IO load on disk1, such that file IO performance will be half that of each of the other drives. And, this is exactly why we recommend nothing larger than a 32KB chunk size for XFS, and this is why the md metadata 1.2 default chunk of 512KB is insane. Using a "small" chunk size spreads both metadata and file IO more evenly across the spindles and yields more predictable performance. > So: can RAID10 issue writes that start at some offset from a chunk boundary? The filesystem dictates where files are written, not md/RAID. If you have a 512KB chunk and you're writing 128KB or smaller files, 3 of your 4 file writes will not start on a chunk boundary. If you use a 128KB chunk and all your files are exactly 128KB then each will start on a chunk boundary. That said, I wouldn't use a 128KB chunk. I'd use a 32KB chunk. And unless your application is doing some really funky stuff, going above that mostly likely isn't going to give you any benefit, especially if each of these 128KB writes is an individual file. In that case you definitely want a small chunk due to the metadata write load. -- Stan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-04 22:51 ` Stan Hoeppner @ 2013-01-04 23:41 ` Andras Korn 2013-01-05 0:27 ` Chris Murphy ` (3 more replies) 0 siblings, 4 replies; 11+ messages in thread From: Andras Korn @ 2013-01-04 23:41 UTC (permalink / raw) To: linux-raid On Fri, Jan 04, 2013 at 04:51:14PM -0600, Stan Hoeppner wrote: Hi, > > I have a RAID10 array with the default chunksize of 512k: > > > > md124 : active raid10 sdg3[6] sdd3[0] sda3[7] sdh3[3] sdf3[2] sde3[1] > > 1914200064 blocks super 1.2 512K chunks 3 far-copies [6/6] [UUUUUU] > > bitmap: 4/15 pages [16KB], 65536KB chunk > > > > I have an application on top of it that writes blocks of 128k or less, using > > multiple threads, pretty randomly (but reads dominate, hence far-copies; > > large sequential reads are relatively frequent). > > You've left out critical details: > > 1. What filesystem are you using? The filesystem is the "application": it's zfsonlinux. I'm putting it on RAID10 instead of using the disks natively because I want to encrypt it using LUKS, and encrypting each disk separately seemed wasteful of CPU (I only have 3 cores). I realize that I forsake some of the advantages of zfs by putting it on an mdraid array. I think this answers your other questions. > > I wonder whether re-creating the array with a chunksize of 128k (or > > maybe even just 64k) could be expected to improve write performance. I > > assume the RAID10 implementation doesn't read-modify-write if writes are > > not aligned to chunk boundaries, does it? In that case, reducing the > > chunk size would just increase the likelihood of more than one disk (per > > copy) being necessary to service each request, and thus decrease > > performance, right? > > RAID10 doesn't RMW because there is no parity, making chunk size less > critical than with RAID5/6 arrays. To optimize this array you really > need to capture the IO traffic pattern of the application. If your > chunk size is too large you may be creating IO hotspots on individual > disks, with the others idling to a degree. It's actually really > difficult to use a chunk size that is "too small" that will decrease > performance. I'm not sure I follow, as far as the last sentence is concerned. If, ad absurdum, you use a chunksize of 512 bytes, then several disks will need to operate in lock-step to service any reasonably sized read request. If, on the other hand, the chunk size is large, there is a good chance that all of the data you're trying to read is in a single chunk, and therefore on a single disk. This leaves the other spindles free to seek elsewhere (servicing other requests). The drawback of the large chunksize is that, as one thread only reads from one disk at a time, the read bandwidth of any one thread is limited to the throughput of a single disk. So, for reads, the large chunk sizes favour multi-threaded random access time, whereas small chunk sizes favour single-threaded sequential throughput. If there's a flaw in this chain of thought, plese point it out. :) Writes are not much different: all copies of a chunk must be updated when it is written. If chunks are small, they are spread across more disks; thus a single write causes more disks to seek. > > I understand that small chunksizes favour single-threaded sequential > > workloads (because all disks can read/write simultaneously, thus adding > > their bandwidth together), > > This simply isn't true. Small chunk sizes are preferable for almost all > workloads. Large chunks are only optimal for single thread long > duration streaming writes/reads. I think you have this backwards; see above. Imagine, for the sake of simplicity, a RAID0 array with a chunksize of 1 bit. For single thread sequential reads, the bandwidth of all disks is added together because they can all read at the same time. With a large chunksize, you only get this if you also read-ahead agressively. > > whereas large(r) chunksizes favour multi-threaded > > random access (because a single disk may be enough to serve each request, > > while the other disks serve other requests). > > Again, not true. Everything depends on your workload: the file sizes > involved, and the read/write patterns from/to those files. For > instance, with your current 512KB chunk, if you're writing/reading 128KB > files, you're putting 4 files on disk1 then 4 files on disk2, then 4 > files on disk3. If you read them back in write order, even with 4 > threads, you'll read the first 4 files from disk1 while disks 2/3 sit > idle. If you use a 128KB chunk, each file gets written to a different > disk. So when your 4 threads read them back each thread is accessing a > different, all 3 disks in parallel. This is a specific bad case but not the average. Of course, if you know the access pattern with sufficient specificity, then you can optimise for it, but I don't. Many different applications will run on top of ZFS, with occasional peaks in utilisation. This includes a mailserver that normally has very little traffic but there are some mailing lists with low mail rates but many subscribers; there is a mediawiki instance with mysql that has seasonal as well as trending changes in its traffic etc. etc. I can't anticipate the exact access pattern; I'm looking for something that'll work well in some abstract average sense. > And, this is exactly why we recommend nothing larger than a 32KB chunk > size for XFS, and this is why the md metadata 1.2 default chunk of 512KB > is insane. I think the problem of the xfs journal is much smaller since delaylog became default; for highly metadata intensive workloads I'd recommend an external journal (it doesn't even need to be on a SSD because it's written sequentially). > Using a "small" chunk size spreads both metadata and file IO more evenly > across the spindles and yields more predictable performance. ... but sucks for multithreaded random reads. > > So: can RAID10 issue writes that start at some offset from a chunk boundary? > > The filesystem dictates where files are written, not md/RAID. Of course. What I meant was: "If the filesystem issues a write request that is not chunk-aligned, will RAID10 resort to read-modify-write, or just perform the write at the requested offset within the chunk?" > That said, I wouldn't use a 128KB chunk. I'd use a 32KB chunk. And > unless your application is doing some really funky stuff, going above > that mostly likely isn't going to give you any benefit, especially if > each of these 128KB writes is an individual file. In that case you > definitely want a small chunk due to the metadata write load. I still believe going much below 128k would require more seeking and thus hurt multithreaded random access performance. If I use 32k chunks, every 128k block zfs writes will be spread across 4 disks (and that's assuming the 128k write was 32k-aligned); as I'm using 6 disks with 3 copies, evey disk will end up holding 2 chunks of a 128k write, a bit like this: disk1 | disk2 | disk3 | disk4 | disk5 | disk6 1 2 3 4 1 2 3 4 1 2 3 4 Thus all disks will have to seek twice to serve this write, and four disks will have to seek to read this 128k block. With 128k chunks, it'd look like this: disk1 | disk2 | disk3 | disk4 | disk5 | disk6 1 1 1 Three disks would have to seek to serve the write (meanwhile, the other three can serve other writes), and any one of three can serve a read, leaving the others to seek in order to serve other requests. How is this reasoning flawed? Andras -- Andras Korn <korn at elan.rulez.org> Getting information from the Internet is like taking a drink from a hydrant. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-04 23:41 ` Andras Korn @ 2013-01-05 0:27 ` Chris Murphy 2013-01-05 0:46 ` Stan Hoeppner ` (2 subsequent siblings) 3 siblings, 0 replies; 11+ messages in thread From: Chris Murphy @ 2013-01-05 0:27 UTC (permalink / raw) To: linux-raid@vger.kernel.org Raid On Jan 4, 2013, at 4:41 PM, Andras Korn <korn@raidlist.elan.rulez.org> wrote: > > The filesystem is the "application": it's zfsonlinux. I'm putting it on > RAID10 instead of using the disks natively because I want to encrypt it > using LUKS, and encrypting each disk separately seemed wasteful of CPU (I > only have 3 cores). > > I realize that I forsake some of the advantages of zfs by putting it on an > mdraid array. I would not do this, you eliminate not just some of the advantages, but all of the major ones including self-healing. Your choices are: dmcrypt/LUKS (ZFS on encrypted logical device) ecryptfs (encrypted fs on top of ZFS) Nearline (or enterprise) drives that have self-encryption The only way ZFS can self-heal is if it directly manages its own mirrored copies or its own parity. To use ZFS in the fashion you're suggesting I think is pointless, so skip using md or LVM. And consider the list in reverse order as best performing, with your idea off the list entirely. Three cores? Does it have AES-NI? If it does, it adds maybe 2% overhead for encryption, although I can't tell if you off hand if that's per disk. Chris Murphy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-04 23:41 ` Andras Korn 2013-01-05 0:27 ` Chris Murphy @ 2013-01-05 0:46 ` Stan Hoeppner 2013-01-05 7:44 ` Andras Korn 2013-01-05 1:30 ` Chris Murphy 2013-01-05 8:15 ` Andras Korn 3 siblings, 1 reply; 11+ messages in thread From: Stan Hoeppner @ 2013-01-05 0:46 UTC (permalink / raw) To: Andras Korn; +Cc: linux-raid On 1/4/2013 5:41 PM, Andras Korn wrote: > On Fri, Jan 04, 2013 at 04:51:14PM -0600, Stan Hoeppner wrote: >> You've left out critical details: >> >> 1. What filesystem are you using? > > The filesystem is the "application": it's zfsonlinux. I'm putting it on > RAID10 instead of using the disks natively because I want to encrypt it > using LUKS, and encrypting each disk separately seemed wasteful of CPU (I > only have 3 cores). I just love when the other shoe drops and it turns out to be a size 13 boot... filled with lead. Any reason why you intentionally omitted these critical details from your initial post? -- Stan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-05 0:46 ` Stan Hoeppner @ 2013-01-05 7:44 ` Andras Korn 0 siblings, 0 replies; 11+ messages in thread From: Andras Korn @ 2013-01-05 7:44 UTC (permalink / raw) To: linux-raid On Fri, Jan 04, 2013 at 06:46:20PM -0600, Stan Hoeppner wrote: > >> 1. What filesystem are you using? > > > > The filesystem is the "application": it's zfsonlinux. I'm putting it on > > RAID10 instead of using the disks natively because I want to encrypt it > > using LUKS, and encrypting each disk separately seemed wasteful of CPU (I > > only have 3 cores). > > I just love when the other shoe drops and it turns out to be a size 13 > boot... filled with lead. > > Any reason why you intentionally omitted these critical details from > your initial post? Yes: I thought I was asking a theoretical question, not for advice on tuning my specific setup, and thought - apparently correctly, I might add :) - that including these details would only get everyone sidetracked into trying to optimise for a specific application. I don't have a specific application with a specific access pattern because I have to run all sorts of applications on this box, on top of this array, simultaneously. But meanwhile I think I have received an answer: the fact that zfs uses blocks of 128k or less does not automatically mean that write performance would suffer on a 512k-chunk RAID10 array because the Linux RAID10 implementation, very sensibly, doesn't insist on writes being aligned to chunk boundaries. So thanks. -- Andras Korn <korn at elan.rulez.org> Reality is merely an illusion, albeit a very persistent one. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-04 23:41 ` Andras Korn 2013-01-05 0:27 ` Chris Murphy 2013-01-05 0:46 ` Stan Hoeppner @ 2013-01-05 1:30 ` Chris Murphy 2013-01-05 8:15 ` Andras Korn 3 siblings, 0 replies; 11+ messages in thread From: Chris Murphy @ 2013-01-05 1:30 UTC (permalink / raw) To: linux-raid@vger.kernel.org Raid On Jan 4, 2013, at 4:41 PM, Andras Korn <korn@raidlist.elan.rulez.org> wrote: > The filesystem is the "application": it's zfsonlinux. I'm putting it on > RAID10 instead of using the disks natively because I want to encrypt it > using LUKS, and encrypting each disk separately seemed wasteful of CPU (I > only have 3 cores). Also, it's worth reading this, to hopefully ensure the backup system isn't also experimental. http://confessionsofalinuxpenguin.blogspot.com/2012/09/btrfs-vs-zfsonlinux-how-do-they-compare.html I mean, think about it another way. You value the data, apparently, enough to encrypt it. But then you're willing to basically f around with the data by using a "nailing jello to a tree" approach for a file system. Quite honestly you should consider doing this on FreeBSD or OpenIndiana where there's native support for encryption, and for ZFS, no nail and jello required. People who care about their data, and need/want a resilient file system, do it on one of those two OSs. If FreeBSD/OpenIndiana are no ops, the way to do it on Linux is, XFS on nearline SATA or SAS SEDs, which have an order magnitude (at least) lower UER than consumer crap, and hence less of a reason why you need to 2nd guess the disks with a resilient file system. But even though also experimental, I'd still use Btrfs before I'd use ZFS on LUKS on Linux, just saying. Chris Murphy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-04 23:41 ` Andras Korn ` (2 preceding siblings ...) 2013-01-05 1:30 ` Chris Murphy @ 2013-01-05 8:15 ` Andras Korn 2013-01-05 18:57 ` Chris Murphy 3 siblings, 1 reply; 11+ messages in thread From: Andras Korn @ 2013-01-05 8:15 UTC (permalink / raw) To: linux-raid Replying to Chris Murphy: > > The filesystem is the "application": it's zfsonlinux. I'm putting it on > > RAID10 instead of using the disks natively because I want to encrypt it > > using LUKS, and encrypting each disk separately seemed wasteful of CPU (I > > only have 3 cores). > > > > I realize that I forsake some of the advantages of zfs by putting it on an > > mdraid array. > > I would not do this, you eliminate not just some of the advantages, but > all of the major ones including self-healing. I know; however, I get to use compression, convenient management, fast snapshots etc. If I later add an SSD I can use it as L2ARC. I have considered the benefits and disadvantages and I think my choice was the right one. > dmcrypt/LUKS (ZFS on encrypted logical device) > ecryptfs (encrypted fs on top of ZFS) I know of ecryptfs but I don't know how mature it is or how well it would work on top of zfs (which it certainly hasn't been tested with). I have a fair amount of experience with LUKS though. I considered and rejected it due to my lack of experience with it. Perhaps I'll get the chance to play with it sometime so I can deploy it with confidence later. > Nearline (or enterprise) drives that have self-encryption Alas, too expensive. I built this server for a hobby/charity project, from disks I had lying around; buying enterprise grade hardware is out of the question. > The only way ZFS can self-heal is if it directly manages its own mirrored > copies or its own parity. To use ZFS in the fashion you're suggesting I > think is pointless, so skip using md or LVM. And consider the list in > reverse order as best performing, with your idea off the list entirely. It's not pointless (see above), just sub-optimal. > Three cores? Does it have AES-NI? No. It's a Phenom II X3 705e. > If it does, it adds maybe 2% overhead for encryption, although I can't > tell if you off hand if that's per disk. Per encrypted device. If I had encrypted the six disks separately, I'd be running six encryption threads, and encrypting each piece of data in triplicate. And without AES-NI, it's more like 10% when there are many writes. > Also, it's worth reading this, to hopefully ensure the backup system isn't > also experimental. > > http://confessionsofalinuxpenguin.blogspot.com/2012/09/btrfs-vs-zfsonlinux-how-do-they-compare.html Thanks, I've read it. I actually did try FreeBSD first, but it kept locking up if it had more than one CPU AND there was a lot of I/O going on. My idea was to build a FreeBSD based storage appliance in a VM (because I can't run all my stuff on FreeBSD directly), and export the VM's zfs to Linux (maybe in another VM, maybe the host), but it just didn't work. OpenIndiana failed in a similar but not identical way. I don't know enough about either system to be able to troubleshoot them effectively and no time, right now, to learn. The article, btw, doesn't mention some of the other differences between btrfs and zfs: for example, afaik, with btrfs the mount hierarchy has to mirror the pool hierarchy, whereas with zfs you can mount every fs anywhere. On the whole, btrfs "feels" a lot more experimental to me than zfsonlinux, which is actually pretty stable (I've been using it for more than a year now). There are occasional problems, to be sure, but it's getting better at a steady pace. I guess I like to live on the edge. > I mean, think about it another way. You value the data, apparently, enough > to encrypt it. But then you're willing to basically f around with the data > by using a "nailing jello to a tree" approach for a file system. Quite > honestly you should consider doing this on FreeBSD or OpenIndiana where > there's native support for encryption, and for ZFS, no nail and jello > required. People who care about their data, and need/want a resilient file > system, do it on one of those two OSs. You're conflating two distinct meanings of "value". I encrypt my data for reasons of privacy, not confidentiality: I don't want other people to automatically have access to it if they have my disks - for example, because the server is stolen, or because I've disposed of a defective disk without securely erasing it first. OTOH, the data does not have particularly high business value. Losing it would be inconvenient, but not a big deal. > If FreeBSD/OpenIndiana are no ops, the way to do it on Linux is, XFS on > nearline SATA or SAS SEDs, which have an order magnitude (at least) lower > UER than consumer crap, and hence less of a reason why you need to 2nd guess > the disks with a resilient file system. Zfs doesn't appeal to me (only) because of its resilience. I benefit a lot from compression and snapshotting, somewhat from deduplication, somewhat from zfs send/receive, a lot from the flexible "volume management" etc. I will also later benefit from the ability to use an SSD as cache. > But even though also experimental, I'd still use Btrfs before I'd use ZFS > on LUKS on Linux, just saying. Perhaps you'd like to read https://lwn.net/Articles/506487/ and the admittedly somewhat outdated http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-is-better-than-btrfs for some reasons people might want to prefer zfs to btrfs. -- Andras Korn <korn at elan.rulez.org> Remember: Rape and pillage, and THEN burn! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-05 8:15 ` Andras Korn @ 2013-01-05 18:57 ` Chris Murphy 2013-01-05 21:01 ` Andras Korn 0 siblings, 1 reply; 11+ messages in thread From: Chris Murphy @ 2013-01-05 18:57 UTC (permalink / raw) To: linux-raid@vger.kernel.org Raid On Jan 5, 2013, at 1:15 AM, Andras Korn <korn@raidlist.elan.rulez.org> wrote: > Replying to Chris Murphy: > >> I would not do this, you eliminate not just some of the advantages, but >> all of the major ones including self-healing. > > I know; however, I get to use compression, convenient management, fast > snapshots etc. If I later add an SSD I can use it as L2ARC. You've definitely exchanged performance and resilience, for maybe possibly not sure about adding an SSD. An SSD that you're more likely to need because of the extra layers you're forcing this setup to use. Btrfs offers all of the features you list, except the SSD that you haven't even committed to, and you'd regain resilience and drop at least two layers that will negatively impact performance. > Alas, too expensive. I built this server for a hobby/charity project, from > disks I had lying around; buying enterprise grade hardware is out of the > question. All the more reason why simpler is better, and this is distinctly not simple. It's a FrankenNAS. You might consider arbitrarily yanking one of the disks, and seeing how the restore process works out for you. > >> The only way ZFS can self-heal is if it directly manages its own mirrored >> copies or its own parity. To use ZFS in the fashion you're suggesting I >> think is pointless, so skip using md or LVM. And consider the list in >> reverse order as best performing, with your idea off the list entirely. > > It's not pointless (see above), just sub-optimal. Pointless. You're going to take the COW and data checksumming performance hit for no reason. If you care so little about that, at least with Btrfs you can turn both of those off. >> If it does, it adds maybe 2% overhead for encryption, although I can't >> tell if you off hand if that's per disk. > > Per encrypted device. Really? You're sure? Explain to me the difference between six kworker threads each encrypting 100-150MB/s, and funneling 600MB/s - 1GB/s through one kworker thread. It seems you have a fixed amount of data per unit time that must be encrypted. > > The article, btw, doesn't mention some of the other differences between > btrfs and zfs: for example, afaik, with btrfs the mount hierarchy has to > mirror the pool hierarchy, whereas with zfs you can mount every fs anywhere. And the use case for this is? You might consider esoteric and minor differences like this to be a good exchange f > On the whole, btrfs "feels" a lot more experimental to me than zfsonlinux, > which is actually pretty stable (I've been using it for more than a year > now). There are occasional problems, to be sure, but it's getting better at > a steady pace. I guess I like to live on the edge. I have heard of exactly no one doing what you're doing, and I'd say that makes it far more experimental than Btrfs. If by "feels" experimental, you mean many commits to new kernels and few backports, OK. I suggest you run on a UPS in either case, especially if you don't have the time to test your rebuild process. >> If FreeBSD/OpenIndiana are no ops, the way to do it on Linux is, XFS on >> nearline SATA or SAS SEDs, which have an order magnitude (at least) lower >> UER than consumer crap, and hence less of a reason why you need to 2nd guess >> the disks with a resilient file system. > > Zfs doesn't appeal to me (only) because of its resilience. I benefit a lot > from compression and snapshotting, somewhat from deduplication, somewhat > from zfs send/receive, a lot from the flexible "volume management" etc. I > will also later benefit from the ability to use an SSD as cache. I'm glad you're demoting the importance of resilience since the way you're going to use it totally obviates its resilience to that of any other fs. You don't get dedup without an SSD, it's way too slow to be useable at all, and you need a large SSD to do a meaningful amount of dedup with ZFS and also have enough for caching. Discount send/receive because Btrfs has that, and I don't know what you mean by flexible volume management. > >> But even though also experimental, I'd still use Btrfs before I'd use ZFS >> on LUKS on Linux, just saying. > > Perhaps you'd like to read https://lwn.net/Articles/506487/ and the > admittedly somewhat outdated The singular thing here is the SSD as ZIL or L2ARC, and that's something being worked on in the Linux VFS rather than make it a file system specific feature. If you look at all the zfsonlinux benchmarks, even SSD isn't enough to help ZFS depending on the task. So long as you've done your homework on the read/write patterns and made sure it's compatible with the capabilities of what you're designing, great. Otherwise it's pure speculation what on paper features (which you're not using anyway) even matter. > http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-is-better-than-btrfs > for some reasons people might want to prefer zfs to btrfs. Except that article is ideological b.s. that gets rather fundamental facts wrong. You can organize subvolumes however you want. You can rename them. You can move them. You can boot from them. That has always been the case, so the age of the article doesn't even matter. It mostly sounds you like features that you're not even going to use from the outset, and won't use, but you want them anyway. Which is not the way to design storage. You design it for a task. If you really need the features you're talking about, you'd actually spend the time to sort out your FreeBSD/OpenIndiana problems. Chris Murphy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-05 18:57 ` Chris Murphy @ 2013-01-05 21:01 ` Andras Korn 0 siblings, 0 replies; 11+ messages in thread From: Andras Korn @ 2013-01-05 21:01 UTC (permalink / raw) To: linux-raid@vger.kernel.org Raid On Sat, Jan 05, 2013 at 11:57:53AM -0700, Chris Murphy wrote: > >> I would not do this, you eliminate not just some of the advantages, but > >> all of the major ones including self-healing. > > > > I know; however, I get to use compression, convenient management, fast > > snapshots etc. If I later add an SSD I can use it as L2ARC. > > You've definitely exchanged performance and resilience, for maybe possibly > not sure about adding an SSD. Oh, it's quite sure. I've ordered it already but it hasn't arrived yet. The only question is when I'll have time to install it in the server. > An SSD that you're more likely to need because of the extra layers you're > forcing this setup to use. Come on, that's FUD. I need the SSD because I have an I/O workload with many random reads, too slow disks, too few spindles and too little RAM. It has little to do with the number of layers. (Also, I arguably don't "need" the SSD, but it will do a lot to improve interactive performance.) > > Alas, too expensive. I built this server for a hobby/charity project, from > > disks I had lying around; buying enterprise grade hardware is out of the > > question. > > All the more reason why simpler is better, and this is distinctly not > simple. It's a FrankenNAS. It may not be simple but it's something I have experience with, so it's relatively simple for me to maintain and set up. With btrfs there'd been a learning curve in addition to less flexibility in choosing mountpoints, for example. Also, btrfs is completely new, whereas zfs is only new on Linux. I don't trust btrfs yet (and reading kernel changelogs keeps me cautious). The setup where a virtualised FreeBSD or OpenIndiana would've been a NAS for the host, OTOH... that would have been positively Frankensteinian. BTW, just to make you wince: two of the first "production" boxes I deployed zfsonlinux on actually have mdraid-LUKS-LVM-zfs (and it works and is even fast enough). > You might consider arbitrarily yanking one of the disks, and seeing how > the restore process works out for you. I did that several times, worked fine (but took a while, of course). > >> The only way ZFS can self-heal is if it directly manages its own mirrored > >> copies or its own parity. To use ZFS in the fashion you're suggesting I > >> think is pointless, so skip using md or LVM. And consider the list in > >> reverse order as best performing, with your idea off the list entirely. > > > > It's not pointless (see above), just sub-optimal. > > Pointless. You're going to take the COW and data checksumming performance > hit for no reason. Not for no reason: I get cheap snapshots out of the COW thing, and dedup out of checksumming (for a select few filesystems). > If you care so little about that, at least with Btrfs > you can turn both of those off. For the record, I could turn off checksumming on zfs too. That's actually not a bad idea, come to think of it, because most of my datasets really don't benefit from it. Thanks. > > Per encrypted device. > > Really? You're sure? Explain to me the difference between six kworker > threads each encrypting 100-150MB/s, and funneling 600MB/s - 1GB/s through > one kworker thread. It seems you have a fixed amount of data per unit time > that must be encrypted. No, because if I encrypt each disk separately, I need to encrypt the same piece of data 3 times (because I store 3 copies of everything). In the current setup, replication occurs below the encryption layer. > > The article, btw, doesn't mention some of the other differences between > > btrfs and zfs: for example, afaik, with btrfs the mount hierarchy has to > > mirror the pool hierarchy, whereas with zfs you can mount every fs anywhere. > > And the use case for this is? For example: I have two zpools, one encrypted and one not. Both contain filesystems that get mounted under /srv. Of course this would be possible with btrfs using workarounds like bind mounts and symlinks, but why should I need a workaround? Or how about this? I want some subdirectories of /usr and /var to be on zfs, in the same pool, with the rest being on xfs. (This might be possible with btrfs; I don't know.) In another case, I had a backup of a vserver (as in linux-vserver.org); the host with the live instance failed and I had to create clones of some of the backup snapshots, then mount them in various locations to be able to start the vserver on the backup host. This was possible even though all were part of the 'backup' pool. Flexibility is almost always a good thing. > > On the whole, btrfs "feels" a lot more experimental to me than zfsonlinux, > > which is actually pretty stable (I've been using it for more than a year > > now). There are occasional problems, to be sure, but it's getting better at > > a steady pace. I guess I like to live on the edge. > > I have heard of exactly no one doing what you're doing, and I'd say that > makes it far more experimental than Btrfs. That's only if you work from the premise that there are magical interactions between the various layers, which, while conceivable, my experience so far doesn't confirm. (If you consider enough specifics, _every_ setup is "experimental": at least the serial number of the hardware components likely differs from the previous similar setup.) > If by "feels" experimental, you mean many commits to new kernels and few > backports, OK. I suggest you run on a UPS in either case, especially if > you don't have the time to test your rebuild process. Alas, no UPS. I make do with turning the write cache of my drives off. (But the box survived numerous crashes caused by the first power supply being just short of sufficient, which makes me relatively confident of the resilience of the storage subsystem.) > >> If FreeBSD/OpenIndiana are no ops, the way to do it on Linux is, XFS on > >> nearline SATA or SAS SEDs, which have an order magnitude (at least) lower > >> UER than consumer crap, and hence less of a reason why you need to 2nd guess > >> the disks with a resilient file system. > > > > Zfs doesn't appeal to me (only) because of its resilience. I benefit a lot > > from compression and snapshotting, somewhat from deduplication, somewhat > > from zfs send/receive, a lot from the flexible "volume management" etc. I > > will also later benefit from the ability to use an SSD as cache. > > I'm glad you're demoting the importance of resilience since the way you're > going to use it totally obviates its resilience to that of any other fs. I know. > You don't get dedup without an SSD, it's way too slow to be useable at > all, That entirely depends. I have a few hundred MB of storage that I can dedup very efficiently (ratio of maybe 5:1). While the space savings are insignificant, the dedup table is also small and it fits into ARC easily. I don't dedup it for the space savings on disk, but for the space savings in cache. > and you need a large SSD to do a meaningful amount of dedup with ZFS > and also have enough for caching. Discount send/receive because Btrfs has > that, and I don't know what you mean by flexible volume management. The above wasn't about zfs vs. btrfs, it was about zfs vs. xfs on allegedly better quality drives. > >> But even though also experimental, I'd still use Btrfs before I'd use ZFS > >> on LUKS on Linux, just saying. > > > > Perhaps you'd like to read https://lwn.net/Articles/506487/ and the > > admittedly somewhat outdated > > The singular thing here is the SSD as ZIL or L2ARC, and that's something > being worked on in the Linux VFS rather than make it a file system > specific feature. If you look at all the zfsonlinux benchmarks, even SSD > isn't enough to help ZFS depending on the task. So long as you've done > your homework on the read/write patterns and made sure it's compatible > with the capabilities of what you're designing, great. Otherwise it's pure > speculation what on paper features (which you're not using anyway) even > matter. I have a fair amount of experience with zfsonlinux, both with and without SSDs, and similar (but not identical) workloads. I have reason to believe the SSD will help. (FWIW, it will also allow me to use an external bitmap for my mdraid, which will also help.) > > http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-is-better-than-btrfs > > for some reasons people might want to prefer zfs to btrfs. > > It mostly sounds you like features that you're not even going to use from > the outset, and won't use, but you want them anyway. I have no idea what you mean. Maybe you're conflating our theoretical argument over whether zfsonlinux might be preferable to btrfs at all, and the one about whether I made the right choice in this specific instance? Hmmm... hasn't this gotten somewhat off-topic? (And see how right I was in not mentioning zfsonlinux when all I wanted to know was whether Linux RAID10 insisted on chunk-aligned writes? :) -- Andras Korn <korn at elan.rulez.org> I tried sniffing Coke once, but the ice cubes got stuck in my nose. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID10: how much does chunk size matter? Can partial chunks be written? 2013-01-04 17:54 RAID10: how much does chunk size matter? Can partial chunks be written? Andras Korn 2013-01-04 22:51 ` Stan Hoeppner @ 2013-01-04 23:43 ` Peter Grandi 1 sibling, 0 replies; 11+ messages in thread From: Peter Grandi @ 2013-01-04 23:43 UTC (permalink / raw) To: Linux RAID [ ... ] > md124 : active raid10 sdg3[6] sdd3[0] sda3[7] sdh3[3] sdf3[2] sde3[1] > 1914200064 blocks super 1.2 512K chunks 3 far-copies [6/6] [UUUUUU] > bitmap: 4/15 pages [16KB], 65536KB chunk > [ ... ] writes blocks of 128k or less, using multiple threads, > pretty randomly (but reads dominate, hence far-copies; large > sequential reads are relatively frequent). I wonder whether > re-creating the array with a chunksize of 128k (or maybe even > just 64k) could be expected to improve write performance. This is an amazingly underreported context in which to ask such a loaded yet vague question. It is one of the usual absurdist "contributions" to this mailing list, even if not as comically misconceived as many others. You have a complicated workload, with entirely different access profiles at different times, on who-knows-what disks, host adapter, memory, CPU, and you are not even reporting what is the current "performance" that should be improved. Nor even saying what "performance" matters (bandwidth or latency or transactions/s? average or maximum? caches enabled? ...). Never mind that rotating disks have rather different transfer rates outer/inner tracks, something that can mean very different tradeoffs depending on where the data lands. What about the filesystem and the chances that large transactions happen on logically contiguous block device sectors? Just asking such a question is ridiculous. The only way you are going to get something vaguely sensible is to run the same load on two identical configurations with different chunk sizes, and even that won't help you a lot if your workload is rather variable, as it is very likely to be. > I assume the RAID10 implementation doesn't read-modify-write > if writes are not aligned to chunk boundaries, does it? If you have to assume this, instead of knowing it, you shouldn't ask absurd questions about complicated workloads and the fine points of chunk sizes. > I understand that small chunksizes favour single-threaded > sequential workloads (because all disks can read/write > simultaneously, thus adding their bandwidth together), whereas > large(r) chunksizes favour multi-threaded random access > (because a single disk may be enough to serve each request, > while the other disks serve other requests). That's a very peculiar way of looking at it. Assuming that we are not talking about hw synchronized drives, as in RAID3 variants, the main issue with chunk size is that every IO large enough to involve more than one disk can incur massive latency from the disks not being at the same rotational position at the same time, and the worst case is that the IO takes a full disks rotation to complete, if completion matters. For example on a RAID0 of 2 disks capable of 7200RPM or 120RPS, a transfer of 2x512B sectors can take 8 milliseconds, delivering around 120KiB/s of throughput, thanks to an 8ms latency every 1KiB. In case of _streaming_, with read ahead or write behind, with very good elevator algorithms, the latency due to the disks rotating independently can be somewhat hidden, and large chunk size obviously helps. Conversely with a mostly random workload larger chunk size can restrict the maximum amount of parallelism, as the reduced interleaving *may* result in a higher chance that accesses from two threads will hit the same disk and thus the same disk arm. Also note that chunk size does not matter for each RAID1 "pair" in the RAID10, as there are no chunks, and for _reading_ the two disks are fully uncoupled, while for writing they are fully coupled. The chunk size that matters is solely that of the RAID0 "on top" of the RAID1 pairs, and that perhaps in many situations does not matter a lot. It is really complicated... My usual refrain is to tell people who don't get most of the subtle details of storage to just use RAID10 and not to worry too much, because RAID0 works well in almost every case, and if they have "performance" problems add more disks; usually as more pairs (rather than turning 2-disk RAID1s into 3-disk RAID1s, which can also be a good idea in several cases), as the number of arms and thus IOPS per TB of *used* capacity often is a big issue, as most storage "experts" eventually should figure out. In this case the only vague details as to "performance" goals are for writes of largish-block random multithreaded access, and more pairs seems the first thing to consider, as the setup has only got 3 pairs totaling something around 300-400 random IOPS, or with 128kiB writes probably overall write rates of 30-40MB/s even not considering the sequential read interference. Doubling the number of pairs is probably something to start with. So choosing RAID10 was a good idea, asking yourself whether vaguely undefined "performance" can be positively affected by some subtle detail of tuning is not. Note: Long ago I used to prefer small chunk sizes (like 16KiB), but the sequential speed of disks has improved a lot in the meantime, while the rotational one(s) have been pretty much constant for decades once 3600RPM disks stopped being designed. To make a very crude argument, assuming a common 7200RPM disk, with a full rotation every 8ms, and presumably an average offset among the disks of half a rotation or 4ms, ideally the amount read or written in one go should be at least as such as can be read or written in 4ms across all disks. On a 10MB/s disk of several years ago one half rotation time means 40KB, on a 100MB/s disks that becomes 400KB, and the goal is ideally to minimize the time spent waiting for all the disks to complete their work which begins the same 4ms apart, so larger chunk sizes probably help, even if they may reduce the degree of multihreading available, and the contemporary disks that can do 100MB tend to be much bigger than the old disks that could do 10MB/s but they have got only one arm. But with a small chunk size largish IO transactions will involve as a rule contiguous chunks (if the filesystem is suitable), and suitable elevators can still turn a 1MiB IO to 4 disks in much the same pattern of transfers whether the chunk size if 16KiB or 128KiB, as in the end on each disk one has to transfer 256KiB of physically contiguous sectors. Yes, it is complicated. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-01-05 21:01 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-01-04 17:54 RAID10: how much does chunk size matter? Can partial chunks be written? Andras Korn 2013-01-04 22:51 ` Stan Hoeppner 2013-01-04 23:41 ` Andras Korn 2013-01-05 0:27 ` Chris Murphy 2013-01-05 0:46 ` Stan Hoeppner 2013-01-05 7:44 ` Andras Korn 2013-01-05 1:30 ` Chris Murphy 2013-01-05 8:15 ` Andras Korn 2013-01-05 18:57 ` Chris Murphy 2013-01-05 21:01 ` Andras Korn 2013-01-04 23:43 ` Peter Grandi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).