* [linux-lvm] Volume alignment over RAID @ 2010-05-20 21:24 Linda A. Walsh 2010-05-21 5:10 ` Luca Berra 0 siblings, 1 reply; 14+ messages in thread From: Linda A. Walsh @ 2010-05-20 21:24 UTC (permalink / raw) To: LVM general discussion and development I'm a bit unclear as to where some units are applied in my RAID setup, but was wondering how LVM interacted, could be, or should be setup so that created volumes would be aligned properly on top of a RAID disk. I'm using a RAID 'chunk' size of 64k as suggested by the RAID documentation and am using 6 disks to create a RAID6, giving 4 units of data/stripe. Does this mean my logical volume needs to be aligned on a 64K boundary, or a 256k boundary? I.e. does 64k usually specify chunk/unit, or chunk/stripe? What do I need to do to make sure my logical volumes always line up on RAID stripe boundaries? I've been using default logical volume parameters, which I think use an allocation size measured in Megabytes, so does that imply I'm automatically aligned (as 64k and 256k both divide into 1 Meg)? Or is some offset involved? Thanks! Linda Walsh ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Volume alignment over RAID 2010-05-20 21:24 [linux-lvm] Volume alignment over RAID Linda A. Walsh @ 2010-05-21 5:10 ` Luca Berra 2010-05-21 6:48 ` Linda A. Walsh 0 siblings, 1 reply; 14+ messages in thread From: Luca Berra @ 2010-05-21 5:10 UTC (permalink / raw) To: linux-lvm On Thu, May 20, 2010 at 02:24:19PM -0700, Linda A. Walsh wrote: > I'm a bit unclear as to where some units are applied in my RAID setup, but > was wondering how LVM interacted, could be, or should be setup so that > created volumes would be aligned properly on top of a RAID disk. note that if your rig uses fairly recent software data alignment should happen automagically. > I'm using a RAID 'chunk' size of 64k as suggested by the RAID documentation > and am using 6 disks to create a RAID6, giving 4 units of data/stripe. Does I suppose by raid you mean md, so i wonder what documentation you were looking at? I think 64k might be small as a chunk size, depending on your array size you probably want a bigger size. Then, since with a six drive raid 6 stripe size is always a power of 2, answers are easy :) > this mean my logical volume needs to be aligned on a 64K boundary, or a 256k > boundary? I.e. does 64k usually specify chunk/unit, or chunk/stripe? align to stripe size > What do I need to do to make sure my logical volumes always line up on RAID > stripe boundaries? make the volume group with pe size multiple of stripe size > I've been using default logical volume parameters, which I think use an > allocation size measured in Megabytes, so does that imply I'm automatically > aligned (as 64k and 256k both divide into 1 Meg)? Or is some offset > involved? run: pvs -o pv_name,pe_start -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Volume alignment over RAID 2010-05-21 5:10 ` Luca Berra @ 2010-05-21 6:48 ` Linda A. Walsh 2010-05-21 7:19 ` Lyn Rees 2010-05-22 7:23 ` Luca Berra 0 siblings, 2 replies; 14+ messages in thread From: Linda A. Walsh @ 2010-05-21 6:48 UTC (permalink / raw) To: LVM general discussion and development Luca Berra wrote: > On Thu, May 20, 2010 at 02:24:19PM -0700, Linda A. Walsh wrote: >> I'm a bit unclear as to where some units are applied in my RAID setup, but >> was wondering how LVM interacted, could be, or should be setup so that >> created volumes would be aligned properly on top of a RAID disk. > note that if your rig uses fairly recent software data alignment should > happen automagically. --- So I'm told, but I like to verify, I'm paranoid :-) >> I'm using a RAID 'chunk' size of 64k as suggested by the RAID documentation >> and am using 6 disks to create a RAID6, giving 4 units of data/stripe. Does > I suppose by raid you mean md, so i wonder what documentation you were > looking at? --- Well, doc in 2 different raid controllers LSI and rocket raid both suggest 64K as a unit size (forget, their exact term). > I think 64k might be small as a chunk size, depending on your array size > you probably want a bigger size. --- Really? What are the trade offs? Array size well 6 disks and 4 of data. > > Then, since with a six drive raid 6 stripe size is always a power of 2, > answers are easy :) --- I like easy...figure 4 should divide into most things. >> this mean my logical volume needs to be aligned on a 64K boundary, or a 256k >> boundary? I.e. does 64k usually specify chunk/unit, or chunk/stripe? > align to stripe size > >> What do I need to do to make sure my logical volumes always line up on RAID >> stripe boundaries? > make the volume group with pe size multiple of stripe size >> I've been using default logical volume parameters, which I think use an >> allocation size measured in Megabytes, so does that imply I'm automatically >> aligned (as 64k and 256k both divide into 1 Meg) ---- It's 4.0M, so not a problem...but this next one is: >> Or is some offset involved? > run: > pvs -o pv_name,pe_start 192.00K is listed as the start of each! GRR...why would that be a default...I suppose it works for someone, but it's NOT a power of 2! Hmph! So each start is messed up. Is there a way I can change the default on a per volume basis? ...(yes: --datalignment; if I'd just read manpage before shooting!) (off to read manpage...thanks for the help....this is most unpleasant, given I' just copied 1.3T (6 hours worth) of data on to this thing already... looks like I was a bit too eager, but was out of disk space on old partition. Oh well..). *sigh* ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Volume alignment over RAID 2010-05-21 6:48 ` Linda A. Walsh @ 2010-05-21 7:19 ` Lyn Rees 2010-05-21 18:50 ` Linda A. Walsh 2010-05-22 7:23 ` Luca Berra 1 sibling, 1 reply; 14+ messages in thread From: Lyn Rees @ 2010-05-21 7:19 UTC (permalink / raw) To: LVM general discussion and development [-- Attachment #1: Type: text/plain, Size: 3671 bytes --] > 192.00K is listed as the start of each! GRR...why would that > be a default...I suppose it works for someone, but it's NOT a power of 2! > Hmph! 192 is a multiplier of 64... so it's aligned - assuming you used the whole disk as a PV (you didn't partition the thing first). -------------------------------------------------- Mr Lyn Rees Senior Engineer, UIG Information Services Computing Centre Cardiff University, 40-41 Park Place, Cardiff. CF10 3BB. -------------------------------------------------- Contact numbers: (029) 2087 9188 (direct) (029) 2087 4875 (reception) (029) 2087 4285 (fax) -------------------------------------------------- Email: rees@cardiff.ac.uk Web: www.cardiff.ac.uk From: "Linda A. Walsh" <lvm@tlinx.org> To: LVM general discussion and development <linux-lvm@redhat.com> Date: 21/05/2010 07:56 Subject: Re: [linux-lvm] Volume alignment over RAID Sent by: linux-lvm-bounces@redhat.com Luca Berra wrote: > On Thu, May 20, 2010 at 02:24:19PM -0700, Linda A. Walsh wrote: >> I'm a bit unclear as to where some units are applied in my RAID setup, but >> was wondering how LVM interacted, could be, or should be setup so that >> created volumes would be aligned properly on top of a RAID disk. > note that if your rig uses fairly recent software data alignment should > happen automagically. --- So I'm told, but I like to verify, I'm paranoid :-) >> I'm using a RAID 'chunk' size of 64k as suggested by the RAID documentation >> and am using 6 disks to create a RAID6, giving 4 units of data/stripe. Does > I suppose by raid you mean md, so i wonder what documentation you were > looking at? --- Well, doc in 2 different raid controllers LSI and rocket raid both suggest 64K as a unit size (forget, their exact term). > I think 64k might be small as a chunk size, depending on your array size > you probably want a bigger size. --- Really? What are the trade offs? Array size well 6 disks and 4 of data. > > Then, since with a six drive raid 6 stripe size is always a power of 2, > answers are easy :) --- I like easy...figure 4 should divide into most things. >> this mean my logical volume needs to be aligned on a 64K boundary, or a 256k >> boundary? I.e. does 64k usually specify chunk/unit, or chunk/stripe? > align to stripe size > >> What do I need to do to make sure my logical volumes always line up on RAID >> stripe boundaries? > make the volume group with pe size multiple of stripe size >> I've been using default logical volume parameters, which I think use an >> allocation size measured in Megabytes, so does that imply I'm automatically >> aligned (as 64k and 256k both divide into 1 Meg) ---- It's 4.0M, so not a problem...but this next one is: >> Or is some offset involved? > run: > pvs -o pv_name,pe_start 192.00K is listed as the start of each! GRR...why would that be a default...I suppose it works for someone, but it's NOT a power of 2! Hmph! So each start is messed up. Is there a way I can change the default on a per volume basis? ...(yes: --datalignment; if I'd just read manpage before shooting!) (off to read manpage...thanks for the help....this is most unpleasant, given I' just copied 1.3T (6 hours worth) of data on to this thing already... looks like I was a bit too eager, but was out of disk space on old partition. Oh well..). *sigh* _______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/ [-- Attachment #2: Type: text/html, Size: 5491 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Volume alignment over RAID 2010-05-21 7:19 ` Lyn Rees @ 2010-05-21 18:50 ` Linda A. Walsh 2010-05-22 7:36 ` Luca Berra 0 siblings, 1 reply; 14+ messages in thread From: Linda A. Walsh @ 2010-05-21 18:50 UTC (permalink / raw) To: LVM general discussion and development Lyn Rees wrote: > > 192.00K is listed as the start of each! GRR...why would that >> be a default...I suppose it works for someone, but it's NOT a power of 2! >> Hmph! > > 192 is a multiplier of 64... so it's aligned - assuming you used the > whole disk as a PV (you didn't partition the thing first). --- Isn't 64 the amount written / disk, so the strip size is 256K? Wouldn't that make each strip have 1 64K chunk written odd, and the next 3 written in the next 'row'.... I suppose maybe it doesn't matter...but when you break the pv up into vg's and lvs, somehow it seems odd to have them all skewed by 64K... But I haven't worked with RAIDS that much, so it's probably just a conceptual thing in my head. Anyway...I wanted to redo the array anyway. I didn't like the performance I was getting, so thought I'd try RAID 50. I was only getting 150-300 on writes/reads on the RAID60 which seemed a bit low. I get more than that on a a 4-data-disk RAID5 (200/400). It's a bit of pain to do all this reconfiguring now, but better now than when they are all full! It was a mistake to do RAID60, though I don't know if the performance on a 10data-disk RAID6 would be any better for writes...still has to do alot of XORing even with a hardware card. I had 2x6 and am going to try 4x3disks, so my hmmm....I guess now that I think about it my strip size was really 8, not 4, since I had 2 of them. But I'll still have a strip width of 8 with 4x3 RAID5's. I don't know if it will be much faster or not...but guess I'll see. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Volume alignment over RAID 2010-05-21 18:50 ` Linda A. Walsh @ 2010-05-22 7:36 ` Luca Berra 0 siblings, 0 replies; 14+ messages in thread From: Luca Berra @ 2010-05-22 7:36 UTC (permalink / raw) To: linux-lvm On Fri, May 21, 2010 at 11:50:54AM -0700, Linda A. Walsh wrote: > Lyn Rees wrote: >> > 192.00K is listed as the start of each! GRR...why would that >>> be a default...I suppose it works for someone, but it's NOT a power of 2! >>> Hmph! >> >> 192 is a multiplier of 64... so it's aligned - assuming you used the whole >> disk as a PV (you didn't partition the thing first). it is chunk aligned, not stripe aligned, reads would be ok, but writes... > Isn't 64 the amount written / disk, so the strip size is 256K? > Wouldn't that make each strip have 1 64K chunk written odd, > and the next 3 written in the next 'row'.... > I suppose maybe it doesn't matter...but when you break the pv up into > vg's and lvs, somehow it seems odd to have them all skewed by 64K... it will cause multiple R-M-W cycles fro writes that cross stripe boundary, not good. > Anyway...I wanted to redo the array anyway. I didn't like the performance > I was getting, so thought I'd try RAID 50. I was only getting 150-300 on > writes/reads on the RAID60 which seemed a bit low. I get more than that > on a a 4-data-disk RAID5 (200/400). It's a bit of pain to do all this > reconfiguring now, but better now than when they are all full! It was > a mistake to do RAID60, though I don't know if the performance on a > 10data-disk RAID6 would be any better for writes...still has to do > alot of XORing even with a hardware card. the choice between raid5 and raid6 has a lot to do with data safety. also other constraints would mandate the use of spare drives in the raid5 case. personally i prefer striping smaller redundant sets for critical data. not to say that 10 is not a power of 2 and aligning lvm becomes interesting. > I had 2x6 and am going to try 4x3disks, so my hmmm....I guess now that I > think about it my strip size was really 8, not 4, since I had 2 of them. yes it was 8 > But I'll still have a strip width of 8 with 4x3 RAID5's. I don't know if it > will be much faster or not...but guess I'll see. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Volume alignment over RAID 2010-05-21 6:48 ` Linda A. Walsh 2010-05-21 7:19 ` Lyn Rees @ 2010-05-22 7:23 ` Luca Berra 2010-05-27 16:40 ` Doug Ledford 1 sibling, 1 reply; 14+ messages in thread From: Luca Berra @ 2010-05-22 7:23 UTC (permalink / raw) To: linux-lvm On Thu, May 20, 2010 at 11:48:31PM -0700, Linda A. Walsh wrote: > Luca Berra wrote: >>> I'm using a RAID 'chunk' size of 64k as suggested by the RAID documentation >>> and am using 6 disks to create a RAID6, giving 4 units of data/stripe. Does >> I suppose by raid you mean md, so i wonder what documentation you were >> looking at? > --- > Well, doc in 2 different raid controllers LSI and rocket raid both > suggest 64K as a unit size (forget, their exact term). > >> I think 64k might be small as a chunk size, depending on your array size >> you probably want a bigger size. > --- > Really? What are the trade offs? Array size well 6 disks and 4 of data. ok, i trew the stone .. First we have to consider usage scenarios, i.e. average read and average write size, large reads benefit from larger chunks, small writes with too large chunks would still result on whole stripe Read-Modify-Write. there were people on linux-raid ml doing benchmarks, and iirc using chunks between 256k and 1m gave better average results -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] Volume alignment over RAID 2010-05-22 7:23 ` Luca Berra @ 2010-05-27 16:40 ` Doug Ledford 2010-06-21 4:26 ` [linux-lvm] RAID chunk size & LVM 'offset' affecting RAID stripe alignment Linda A. Walsh 0 siblings, 1 reply; 14+ messages in thread From: Doug Ledford @ 2010-05-27 16:40 UTC (permalink / raw) To: linux-lvm [-- Attachment #1: Type: text/plain, Size: 1569 bytes --] On 05/22/2010 03:23 AM, Luca Berra wrote: > On Thu, May 20, 2010 at 11:48:31PM -0700, Linda A. Walsh wrote: >> Luca Berra wrote: >>>> I'm using a RAID 'chunk' size of 64k as suggested by the RAID >>>> documentation >>>> and am using 6 disks to create a RAID6, giving 4 units of >>>> data/stripe. Does >>> I suppose by raid you mean md, so i wonder what documentation you were >>> looking at? >> --- >> Well, doc in 2 different raid controllers LSI and rocket raid both >> suggest 64K as a unit size (forget, their exact term). Hardware raid and software raid are two entirely different things when it comes to optimization. >>> I think 64k might be small as a chunk size, depending on your array size >>> you probably want a bigger size. >> --- >> Really? What are the trade offs? Array size well 6 disks and 4 >> of data. > ok, i trew the stone .. > First we have to consider usage scenarios, i.e. average read and average > write size, large reads benefit from larger chunks, small writes with > too large chunks would still result on whole stripe Read-Modify-Write. > > there were people on linux-raid ml doing benchmarks, and iirc using > chunks between 256k and 1m gave better average results That was me. The best results are with 256 or 512k chunk sizes. Above 512k you don't get any more benefit. -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] RAID chunk size & LVM 'offset' affecting RAID stripe alignment 2010-05-27 16:40 ` Doug Ledford @ 2010-06-21 4:26 ` Linda A. Walsh 2010-06-23 18:59 ` Doug Ledford 0 siblings, 1 reply; 14+ messages in thread From: Linda A. Walsh @ 2010-06-21 4:26 UTC (permalink / raw) To: LVM general discussion and development Revisiting an older topic (I got sidetracked w/other issues, as usual, fortunately email usually waits...). About a month ago, I'd mentioned docs for 2 HW raid cards (LSI & Rocket Raid) both suggested 64K as a RAID chunk size. Two responses came up, Doug Ledford said: Hardware raid and software raid are two entirely different things when it comes to optimization. And Luca Berra said: I think 64k might be small as a chunk size, depending on your array size you probably want a bigger size. (I asked why and Luca contiued..) First we have to consider usage scenarios, i.e. average read and average write size, large reads benefit from larger chunks, small writes with too large chunks would still result on whole stripe Read-Modify-Write. there were people on linux-raid ml doing benchmarks, and iirc using chunks between 256k and 1m gave better average results... (Doug seconded this, as he was the benchmarker..) That was me. The best results are with 256 or 512k chunk sizes. Above 512k you don't get any more benefit. ------ My questions at this point -- why are SW and HW raid so different? Aren't they doing the same algorithms on the same media? SW might be a bit slower at some things (or it might be faster if it's good SW and the HW doesn't clearly make it faster). Secondly, how would array size affect the choice for chunk size? Wouldn't chunk size be based on your average update size, trading off against the increased benefit of a larger chunk size benefitting reads more than writes. I.e. if you read 10 times as much as write, then maybe faster reads provide a clear win, but if you update nearly as much as read, then a stripe size closer to your average update size would be preferable. Concerning the benefit of a larger chunk size benefitting reads -- would that benefit be less if one also was using read-ahead on the array? >-----------------------< In another note, Luca Berra commented, in response to my observation that my 256K-data wide stripes (4x64K chunks) would be skewed by a chunk size on my PV's that defaulted to starting data@offset 192K: LB> it will cause multiple R-M-W cycles fro writes that cross stripe LB> boundary, not good. I don't see how it would make a measurable difference. If it did, wouldn't we also have to account for the parity disks so that they are aligned as well -- as they also have to be written during a stripe-write? I.e. -- if it is a requirement that they be aligned, it seems that the LVM alignment has to be: (total disks)x(chunk-size) not (data-disks)x(chunk-size) as I *think* we were both thinking when we earlier discussed this. Either way, I don't know how much of an effect there would be if, when updating a stripe, some of the disks read/write chunk "N", while the other disks use chunk "N-1"... They would all be writing 1 chunk/stripe update, no? The only conceivable impact on performance would be at some 'boundary' point -- if your volume contained multiple physical partitions -- but those would be far and few between large areas where it should (?) make no difference. Eh? Linda ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] RAID chunk size & LVM 'offset' affecting RAID stripe alignment 2010-06-21 4:26 ` [linux-lvm] RAID chunk size & LVM 'offset' affecting RAID stripe alignment Linda A. Walsh @ 2010-06-23 18:59 ` Doug Ledford 2010-06-25 8:36 ` Linda A. Walsh 0 siblings, 1 reply; 14+ messages in thread From: Doug Ledford @ 2010-06-23 18:59 UTC (permalink / raw) To: LVM general discussion and development On Jun 21, 2010, at 12:26 AM, Linda A. Walsh wrote: > Revisiting an older topic (I got sidetracked w/other issues, > as usual, fortunately email usually waits...). > > About a month ago, I'd mentioned docs for 2 HW raid cards > (LSI & Rocket Raid) both suggested 64K as a RAID chunk size. > > Two responses came up, Doug Ledford said: > Hardware raid and software raid are two entirely different things > when it comes to optimization. > > And Luca Berra said: > I think 64k might be small as a chunk size, depending on your > array size you probably want a bigger size. > (I asked why and Luca contiued..) > > First we have to consider usage scenarios, i.e. average read and > average write size, large reads benefit from larger chunks, Correction: all reads benefit from larger chunks now a days. The only reason to use smaller chunks in the past was to try and get all of your drives streaming data to you simultaneously, which effectively made the total aggregate throughput of those reads equal to the throughput of one data disk times the number of data disks in the array. With modern drives able to put out 100MB/s sustained by themselves, we don't really need to do this any more, and if we aren't attempting to get this particular optimization (which really only existed when you were doing single threaded sequential I/O anyway, which happens to be rare on real servers), then larger chunk sizes benefit reads because they help to ensure that reads will, as much as possible, only hit one disk. If you can manage to make every read you service hit one disk only, you maximize the random I/O ops per second that your array can handle. > small > writes with too large chunks would still result on whole stripe > Read-Modify-Write. There is a very limited set of applications where the benefit of streaming writes versus a read-modify-write cycle are worth the trade off that they require. Specifically, only if you are going to be doing more writing to your array than reading, or maybe if you are doing at least 33% of all commands as writes, then you should worry about this. By far and away the vast majority of usage scenarios involve far more reads than writes, and in those cases you always optimize for reads. However, even if you are optimizing for writes, then what I wrote above about trying to make it so that your writes always only fall on one disk (excepting the fact that parity also needs updated) still holds true unless you can make your writes *reliably* take up the entire stripe. The absolute worst thing you could do is use a small chunk size thinking that it will cause your writes to skip the read-modify-write cycle and instead do a complete stripe write, then have your writes reliably only do half stripe writes instead of full stripe writes. A half stripe write is worse than a full stripe write, and is worse than a single chunk write. It is the worst case scenario. > there were people on linux-raid ml doing benchmarks, and iirc > using chunks between 256k and 1m gave better average results... > > (Doug seconded this, as he was the benchmarker..) > That was me. The best results are with 256 or 512k chunk sizes. > Above 512k you don't get any more benefit. > > ------ > > My questions at this point -- why are SW and HW raid so different? > Aren't they doing the same algorithms on the same media? Yes and no. Hardware RAID implementations provide a pseudo device to the operating system, and implement their own caching subsystem and command elevator algorithm on the card for both the pseudo device and the underlying physical drives. Linux likewise has its own elevator and caching subsystems that work on the logical drive. So, in the case of software raid you are usually talking the stack looks something like this: filesystem -> cacheing layer -> block device layer with elevator for logical device -> raid layer -> block device layer with noop elevator for physical device -> scsi device layer -> physical drive In the case of hardware raid controller, it's like this: filesystem -> cacheing layer -> block device layer with elevator for logical device -> scsi layer -> raid controller driver -> hardware raid controller cacheing layer and elevator -> hardware raid controller raid stack -> hardware raid controller physical drive driver layer -> physical drive So, while at a glance it might seem that they are implementing the same algorithms on the same devices, the details of how they do so are drastically different and hence the differences in optimal numbers. FWIW, we don't generally have access to the raid stack on those hardware raid controllers to answer the question of why they perform best with certain block sizes, but my guess is that they have built in assumptions in the cacheing layer related to those block sizes, that result in them being hamstrung at other block sizes. > SW might > be a bit slower at some things (or it might be faster if it's good > SW and the HW doesn't clearly make it faster). > > Secondly, how would array size affect the choice for chunk size? Array size doesn't affect optimal chunk size. > Wouldn't chunk size be based on your average update size, trading > off against the increased benefit of a larger chunk size benefitting > reads more than writes. I.e. if you read 10 times as much as write, > then maybe faster reads provide a clear win, but if you update > nearly as much as read, then a stripe size closer to your average > update size would be preferable. See my comments above, but in general, you can always play it safe with writes and use a large chunk size so that writes generally are single chunk writes. If you do that, you get reasonably good writes, and optimal reads. Unless you have very strict control of the writes on your device, it's almost impossible to have optimal full-stripe writes, and if you try to aim for that, you have a large chance of failure. So, my advice is to not even try to go down that path. > Concerning the benefit of a larger chunk size benefitting reads -- > would that benefit be less if one also was using read-ahead on the > array? The benefit of large chunk size for reads is that it keeps the read on a single device as frequently as possible. Because readahead doesn't kick in immediately, it doesn't negate that benefit on random I/O, and on truly sequential I/O it turns out to still help things as it will start the process of reading from the next disk ahead of time but usually only after we've determined we truly are going to need to do exactly that. > >-----------------------< > In another note, Luca Berra commented, in response to my > observation that my 256K-data wide stripes (4x64K chunks) would be > skewed by a > chunk size on my PV's that defaulted to starting data at offset 192K: > > LB> it will cause multiple R-M-W cycles fro writes that cross stripe > LB> boundary, not good. > > I don't see how it would make a measurable difference. Alignment of the lvm device on top of the raid device most certainly will make a measurable difference. > If it did, wouldn't we also have to account for the parity disks so > that they > are aligned as well -- as they also have to be written during a > stripe-write? I.e. -- if it is a requirement that they be aligned, > it seems that the LVM alignment has to be: > > (total disks)x(chunk-size) > > not > (data-disks)x(chunk-size) No. If you're putting lvm on top of a raid array, and the raid array is a pv to the lvm device, then the lvm device will only see (data- disks)x(chunk-size) of space in each stripe. The parity block is internal to the raid and never exposed to the lvm layer. > as I *think* we were both thinking when we earlier discussed this. > > Either way, I don't know how much of an effect there would be if, > when updating a stripe, some of the disks read/write chunk "N", while > the other disks use chunk "N-1"... They would all be writing 1 > chunk/stripe update, no? No. This goes to the heart of a full stripe write versus partial stripe write. If your pv is properly aligned on the raid array, then a single stripe write of the lvm subsystem will be exactly and optimally aligned to write to a single stripe of the raid array. So, let's say you have a 5 disk raid5 array, so 4 data disks and 1 parity disk. And let's assume a chunk size of 256K. That gives a total stripe width of 1024K. So, telling the lvm subsystem to align the start of the data on a 1024K offset will optimally align the lv on the pv. If you then create an ext4 filesystem on the lv, and tell the ext4 filesystem that you have a chunk size of 256k and a stripe width of 1024k, the ext4 filesystem will be properly aligned on the underlying raid device. And because you've told the ext4 filesystem about the raid device layouts, it will attempt to optimize access patterns for the raid device. That all being said, here's an example of a non-optimal access pattern. Let's assume you have a 1024k write, and the ext4 filesystem knows you have a 1024k stripe width. The filesystem will attempt to align that write on a 1024k stripe boundary so that you get a full stripe write. That means that the raid layer will ignore the parity already on disk, will simply calculate new parity by doing an xor on the 1024k of data, and will then simply write all 4 256k chunks and the 256k parity block out to disk. That's optimal. If the alignment is skewed by the lvm layer though, what happens is that the ext4 filesystems tries to lay out the write on the start of a stripe but fails, and instead of the write causing a very fast parity generation and write to a single stripe, the write gets split between two different stripes and since neither stripe is a full stripe write, we do one of two things: a read-modify-write cycle or a read-calculate- write cycle. In either of those cases, it is a requirement that we read something off of disk and use it in the calculation of what needs to be written out to disk. So, we end up touching two stripes instead of one and we have to read stuff in, introducing a latency delay, before we can write our data out. So, it's highly important that, in so far as some layers are aware of raid device layouts, that those layers be *properly* aligned on our raid device or the result is not only suboptimal, but likely pathological. > The only conceivable impact on performance > would be at some 'boundary' point -- if your volume contained > multiple physical partitions -- but those would be far and few > between large areas where it should (?) make no difference. Eh? > > Linda > > > > > > > > _______________________________________________ > linux-lvm mailing list > linux-lvm@redhat.com > https://www.redhat.com/mailman/listinfo/linux-lvm > read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] RAID chunk size & LVM 'offset' affecting RAID stripe alignment 2010-06-23 18:59 ` Doug Ledford @ 2010-06-25 8:36 ` Linda A. Walsh 2010-06-26 1:50 ` Doug Ledford 2010-06-28 18:56 ` Charles Marcus 0 siblings, 2 replies; 14+ messages in thread From: Linda A. Walsh @ 2010-06-25 8:36 UTC (permalink / raw) To: LVM general discussion and development Doug Ledford wrote: > Correction: all reads benefit from larger chunks now a days. The only > reason to use smaller chunks in the past was to try and get all of > your drives streaming data to you simultaneously, which effectively > made the total aggregate throughput of those reads equal to the > throughput of one data disk times the number of data disks in the > array. With modern drives able to put out 100MB/s sustained by > themselves, we don't really need to do this any more, .... --- I would regard 100MB/s as moderately slow. For files in my server cache, my Win7 machine reads @ 110MB/s over the network, so as much as file-io slows down network response, 100MB would be on the slow side. I hope for at least 2-3 times that with software RAID, but with hardware raid 5-6X that is common. Write speeds run maybe 50-100MB/s slower? > and if we aren't > attempting to get this particular optimization (which really only > existed when you were doing single threaded sequential I/O anyway, > which happens to be rare on real servers), then larger chunk sizes > benefit reads because they help to ensure that reads will, as much as > possible, only hit one disk. If you can manage to make every read you > service hit one disk only, you maximize the random I/O ops per second > that your array can handle. --- I was under the impression that rule of thumb was that IOPs of a RAID array were generally equal to that of 1 member disk, because normally they operate as 1 spindle. It seems like in your case, you are only using the RAID component for the redundancy rather than the speedup. If you want to increase IOPs, above the single spindle rate, then I had the impression that using a multi-level RAID would accomplish that -- like RAID 50 or 60? I.e. a RAID0 of 3 RAID5's would give you 3X the IOP's (because, like in your example, any read would likely only use a fraction of a stripe), but you would still benefit from using multiple devices for a read/write to get speed. I seem to remember something about multiprocessor checksumming going into some recent kernels that could allow practical multi-level RAID in software. >> in response to my >> observation that my 256K-data wide stripes (4x64K chunks) would be >> skewed by a >> chunk size on my PV's that defaulted to starting data at offset 192K .... > So, we end up touching two stripes instead > of one and we have to read stuff in, introducing a latency delay, > before we can write our data out. ---- Duh...missing the obvious, I am! Sigh. I think I got it write...oi veh! If not, well... dumping and restoring that much data just takes WAY too long. (beginning to think 500-600MB read/writes are too slow... actually for dump/restore -- I'm lucky when I get an 8th of that). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] RAID chunk size & LVM 'offset' affecting RAID stripe alignment 2010-06-25 8:36 ` Linda A. Walsh @ 2010-06-26 1:50 ` Doug Ledford 2010-06-28 18:56 ` Charles Marcus 1 sibling, 0 replies; 14+ messages in thread From: Doug Ledford @ 2010-06-26 1:50 UTC (permalink / raw) To: LVM general discussion and development On Jun 25, 2010, at 4:36 AM, Linda A. Walsh wrote: > Doug Ledford wrote: >> Correction: all reads benefit from larger chunks now a days. The >> only >> reason to use smaller chunks in the past was to try and get all of >> your drives streaming data to you simultaneously, which effectively >> made the total aggregate throughput of those reads equal to the >> throughput of one data disk times the number of data disks in the >> array. With modern drives able to put out 100MB/s sustained by >> themselves, we don't really need to do this any more, .... > --- > I would regard 100MB/s as moderately slow. For files in my > server cache, my Win7 machine reads @ 110MB/s over the network, so as > much as file-io slows down network response, 100MB would be on > the slow side. I hope for at least 2-3 times that with software > RAID, but with hardware raid 5-6X that is common. Write speeds run > maybe 50-100MB/s slower? In practice you get better results than that. Maybe not a fully linear scale up, but it goes way up. My test system anyway was getting 400-500MB/s under the right conditions. >> and if we aren't >> attempting to get this particular optimization (which really only >> existed when you were doing single threaded sequential I/O anyway, >> which happens to be rare on real servers), then larger chunk sizes >> benefit reads because they help to ensure that reads will, as much as >> possible, only hit one disk. If you can manage to make every read >> you >> service hit one disk only, you maximize the random I/O ops per second >> that your array can handle. > --- > I was under the impression that rule of thumb was that IOPs of > a RAID array were generally equal to that of 1 member disk, because > normally they operate as 1 spindle. With a small chunk size, this is the case, yes. > It seems like in your case, you > are only using the RAID component for the redundancy rather than the > speedup. No, I'm trading off some speed up in sequential throughput for a speed up in IOPs. > If you want to increase IOPs, above the single spindle > rate, then I had the impression that using a multi-level RAID would > accomplish that -- like RAID 50 or 60? I.e. a RAID0 of 3 RAID5's > would give you 3X the IOP's (because, like in your example, any > read would likely only use a fraction of a stripe), but you would > still benefit from using multiple devices for a read/write to get > speed. In truth, whether you use a large chunk size, or smaller chunk sizes and stacked arrays, the net result is the same: you make the average request involve fewer disks, trading off maximum single stream throughput for IOPs. My argument in all of this is that single threaded streaming performance is such a total "who cares" number, that you are silly to ever chase that particular beast. Almost nothing in the real world that is doing I/O at speeds that we even remotely care about, is doing that I/O in a single stream. Instead, it's various different threads of I/O to different places in the array and what we care about is that the array be able to handle enough IOPs that the array stays ahead of the load. An exception to this rule might be something like the data acquisition equipment at CERN's HADRON collider. That stuff dumps data in a continuous stream so fast that it makes my mind hurt. > I seem to remember something about multiprocessor checksumming > going into some recent kernels that could allow practical multi-level > RAID in software. Red herring. You can do multilevel raid without this feature, and this feature is currently broken so I wouldn't recommend using it. > >>> in response to my >>> observation that my 256K-data wide stripes (4x64K chunks) would be >>> skewed by a >>> chunk size on my PV's that defaulted to starting data at offset 192K > .... >> So, we end up touching two stripes instead >> of one and we have to read stuff in, introducing a latency delay, >> before we can write our data out. > ---- > Duh...missing the obvious, I am! Sigh. I think I got it > write...oi veh! If not, well... > dumping and restoring that much data just takes WAY too long. > (beginning to think 500-600MB read/writes are too slow... > actually for dump/restore -- I'm lucky when I get an 8th of that). > > > _______________________________________________ > linux-lvm mailing list > linux-lvm@redhat.com > https://www.redhat.com/mailman/listinfo/linux-lvm > read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] RAID chunk size & LVM 'offset' affecting RAID stripe alignment 2010-06-25 8:36 ` Linda A. Walsh 2010-06-26 1:50 ` Doug Ledford @ 2010-06-28 18:56 ` Charles Marcus 2010-06-29 21:33 ` Linda A. Walsh 1 sibling, 1 reply; 14+ messages in thread From: Charles Marcus @ 2010-06-28 18:56 UTC (permalink / raw) To: linux-lvm On 2010-06-25 4:36 AM, Linda A. Walsh wrote: > Doug Ledford wrote: >> Correction: all reads benefit from larger chunks now a days. The only >> reason to use smaller chunks in the past was to try and get all of >> your drives streaming data to you simultaneously, which effectively >> made the total aggregate throughput of those reads equal to the >> throughput of one data disk times the number of data disks in the >> array. With modern drives able to put out 100MB/s sustained by >> themselves, we don't really need to do this any more, .... > I would regard 100MB/s as moderately slow. For files in my > server cache, my Win7 machine reads @ 110MB/s over the network, My understanding is Gigabit ethernet is only capable of topping out at about 30MB/s, so, I'm curious what kind of network you have? 10GBe? Fiber? -- Best regards, Charles ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-lvm] RAID chunk size & LVM 'offset' affecting RAID stripe alignment 2010-06-28 18:56 ` Charles Marcus @ 2010-06-29 21:33 ` Linda A. Walsh 0 siblings, 0 replies; 14+ messages in thread From: Linda A. Walsh @ 2010-06-29 21:33 UTC (permalink / raw) To: LVM general discussion and development Charles Marcus wrote: > On 2010-06-25 4:36 AM, Linda A. Walsh wrote: >> Doug Ledford wrote: >>> Correction: all reads benefit from larger chunks now a days. The only >>> reason to use smaller chunks in the past was to try and get all of >>> your drives streaming data to you simultaneously, which effectively >>> made the total aggregate throughput of those reads equal to the >>> throughput of one data disk times the number of data disks in the >>> array. With modern drives able to put out 100MB/s sustained by >>> themselves, we don't really need to do this any more, .... > >> I would regard 100MB/s as moderately slow. For files in my >> server cache, my Win7 machine reads @ 110MB/s over the network, > > My understanding is Gigabit ethernet is only capable of topping out at > about 30MB/s, so, I'm curious what kind of network you have? 10GBe? Fiber? ---- Why would gigabit ethernet top out at less than 1/4th it's theoretical speed? What would possibly cause such poor performance? Are you using xfs as a file system? It's the optimal file system for high performance with large files. Gigabit ethernet should have a max theoretical somewhere around 120MB/s. If there was no overhead, it would be 125MB/s, so 120MB allows for 4% overhead. My tests used 'samba3' to transfer files. Both the server and the win7 box use Intel Gigabit PCIe cards bought off Amazon. My local net uses a 9000 byte MTU (9014 frame size). Tests had a win7-64 client talking to a SuSE 11.2(x86-64) w/2.6.34 vanilla kernel. File system is xfs over LVM2. Linear writes are measurable at 115MB/s. Writes to disk are the same since my local disk does ~670MB/s writes which can easily handle network bandwidth (670MB/s is direct, through the buffer cache, I get about 2/3rd's that: 448MB/s). Win7 reading 4GB file from the server's Cache gets 110MB/s. From disk it's about 13-14% slower, even though the disk's read speed (for a 48G file) is 826MB/s. The disk used for the testing is a RAID50 based on 7200RPM SATA disks. 1. Read (file in memory on server): /l> dd if=test1 of=/dev/null bs=256M count=16 16+0 records in 16+0 records out 4294967296 bytes (4.3 GB) copied, 39.024 s, 110 MB/s 2. Read (file NOT in memory on server): /t/test> dd if=file2 of=/dev/null bs=1G count=4 oflag=direct 4+0 records in 4+0 records out 4294967296 bytes (4.3 GB) copied, 44.955 s, 95.5 MB/s 3. Write (file written to server memory buffs): /l> dd of=test1 if=/dev/zero bs=256M count=16 conv=notrunc oflag=direct 16+0 records in 16+0 records out 4294967296 bytes (4.3 GB) copied, 37.37 s, 115 MB/s 4. Write (write with 'file+metadata sync'): /t/test> dd of=file2 if=/dev/zero bs=1G count=2 oflag=direct conv=nocreat,fsync 2+0 records in 2+0 records out 2147483648 bytes (2.1 GB) copied, 18.765 s, 114 MB/s 5. Write (to verify write speed, including write to disk, this next test write out twice the amount of memory the server has): /t/test> dd of=file2 if=/dev/zero bs=1G count=48 oflag=direct conv=nocreat,fsync 48+0 records in 48+0 records out 51539607552 bytes (52 GB) copied, 449.427 s, 115 MB/s Writing to disk has no effect on network write speed -- as expected. Reads have some effect, causing about 13-14% slowdown to 95.5MB.s In both cases, running 'xosview' showed the expected network bandwidth being used. Also, FWIW -- my music only hiccuped occasionally during the write activity. Oddly enough, it didn't hiccup at all during the read test (I was listening to flacs from the server, while doing the I/O tests). xosview was also displaying from the server over the net -- so there was entirely 'zero' background network traffic. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-06-29 21:34 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-05-20 21:24 [linux-lvm] Volume alignment over RAID Linda A. Walsh 2010-05-21 5:10 ` Luca Berra 2010-05-21 6:48 ` Linda A. Walsh 2010-05-21 7:19 ` Lyn Rees 2010-05-21 18:50 ` Linda A. Walsh 2010-05-22 7:36 ` Luca Berra 2010-05-22 7:23 ` Luca Berra 2010-05-27 16:40 ` Doug Ledford 2010-06-21 4:26 ` [linux-lvm] RAID chunk size & LVM 'offset' affecting RAID stripe alignment Linda A. Walsh 2010-06-23 18:59 ` Doug Ledford 2010-06-25 8:36 ` Linda A. Walsh 2010-06-26 1:50 ` Doug Ledford 2010-06-28 18:56 ` Charles Marcus 2010-06-29 21:33 ` Linda A. Walsh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).