From: Zdenek Kabelac <zkabelac@redhat.com>
To: LVM general discussion and development <linux-lvm@redhat.com>,
Dale Stephenson <dalestephenson@mac.com>
Subject: Re: [linux-lvm] Performance penalty for 4k requests on thin provisioned volume
Date: Thu, 14 Sep 2017 11:00:46 +0200 [thread overview]
Message-ID: <bb529703-849c-93ff-f40c-1de8a6f49ce6@redhat.com> (raw)
In-Reply-To: <42E7ED35-B32E-4C02-976A-7A9E5380EEA8@mac.com>
Dne 14.9.2017 v 00:39 Dale Stephenson napsal(a):
>
>> On Sep 13, 2017, at 4:19 PM, Zdenek Kabelac <zkabelac@redhat.com> wrote:
>>
>> Dne 13.9.2017 v 17:33 Dale Stephenson napsal(a):
>>> Distribution: centos-release-7-3.1611.el7.centos.x86_64
>>> Kernel: Linux 3.10.0-514.26.2.el7.x86_64
>>> LVM: 2.02.166(2)-RHEL7 (2016-11-16)
>>> Volume group consisted of an 8-drive SSD (500G drives) array, plus an additional SSD of the same size. The array had 64 k stripes.
>>> Thin pool had -Zn option and 512k chunksize (full stripe), size 3T with metadata volume 16G. data was entirely on the 8-drive raid, metadata was entirely on the 9th drive.
>>> Virtual volume “thin” was 300 GB. I also filled it with dd so that it would be fully provisioned before the test.
>>> Volume “thick” was also 300GB, just an ordinary volume also entirely on the 8-drive array.
>>> Four tests were run directlyagainst each volume using fio-2.2.8, random read, random write, sequential read, sequential write. Single thread, 4k blocksize, 90s run time.
>>
>> Hi
>>
>> Can you please provide output of:
>>
>> lvs -a -o+stripes,stripesize,seg_pe_ranges
>>
>> so we can see how is your stripe placed on devices ?
>
> Sure, thank you for your help:
> # lvs -a -o+stripes,stripesize,seg_pe_ranges
> LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert #Str Stripe PE Ranges
> [lvol0_pmspare] volgr0 ewi------- 16.00g 1 0 /dev/md127:867328-871423
> thick volgr0 -wi-a----- 300.00g 1 0 /dev/md127:790528-867327
> thin volgr0 Vwi-a-t--- 300.00g thinpool 100.00 0 0
> thinpool volgr0 twi-aot--- 3.00t 9.77 0.13 1 0 thinpool_tdata:0-786431
> [thinpool_tdata] volgr0 Twi-ao---- 3.00t 1 0 /dev/md127:0-786431
> [thinpool_tmeta] volgr0 ewi-ao---- 16.00g 1 0 /dev/sdb4:0-4095
>
> md127 is an 8-drive RAID 0
>
> As you can see, there’s no lvm striping; I rely on the software RAID underneath for that. Both thick and thin lvols are on the same PV.
>>
>> SSD typically do needs ideally write 512K chunks.
>
> I could create the md to use 512k chunks for RAID 0, but I wouldn’t expect that to have any impact on a single threaded test using 4k request size. Is there a hidden relationship that I’m unaware of?
Yep - it seems the setup in this case is the best fit.
If you can reevaluate different setups you may possibly get much higher
throughput.
My guess would be - the best targeting layout should be probably striping no
more then 2-3 disks and use bigger striping block.
And then just 'join' 'smaller' arrays together in lvm2 in 1 big LV.
>
>> (something like 'lvcreate -LXXX -i8 -I512k vgname’)
>>
> Would making lvm stripe on top of an md that already stripes confer any performance benefit in general, or for small (4k) requests in particular?
Rule #1 - try to avoid 'over-combining' things together.
- measure performance from 'bottom' upward in your device stack.
If the underlying devices gives poor speed - you can't make it better by any
super0smart disk-layout on top of it.
>
>> Wouldn't be 'faster' to just concatenate 8 disks together instead of striping - or stripe only across 2 disk - and then you concatenate 4 such striped areas…
>>
> For sustained throughput I would expect striping of 8 disks to blow away concatenation — however, for small requests I wouldn’t expect any advantage. On a non-redundant array, I would expect a single threaded test using 4k requests is going to end up reading/writing data from exactly one disk regardless of whether the underlying drives are concatenated or stripes.
It always depends which kind of load you expect the most.
I suspect spreading 4K blocks across 8 SSD is likely very far away from ideal
layout.
Any SSD is typically very bad with 4K blocks - it you want to 'spread' the
load on mores SSDs do not use less the 64K stripe chunks per SSD - this gives
you (8*64) 512K stripe size.
As for thin-pool chunksize - if you plan to use lots of snapshots - keep the
value lowest possible - 64K or 128K thin-pool chunksize.
But I'd still suggest to reevaluate/benchmark setup where you will use much
lower number of SSD for load spreading - and use bigger strip chunks per each
device. This should nicely improve performance in case of 'bigger' writes
and not that much slow things down with 4K loads....
> What is the best choice for handling 4k request sizes?
Possibly NVMe can do a better job here.
Regards
Zdenek
next prev parent reply other threads:[~2017-09-14 9:00 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-13 15:33 [linux-lvm] Performance penalty for 4k requests on thin provisioned volume Dale Stephenson
2017-09-13 20:19 ` Zdenek Kabelac
2017-09-13 22:39 ` Dale Stephenson
2017-09-14 9:00 ` Zdenek Kabelac [this message]
2017-09-14 9:37 ` Zdenek Kabelac
2017-09-14 10:52 ` Gionatan Danti
2017-09-14 10:57 ` Gionatan Danti
2017-09-14 11:13 ` Zdenek Kabelac
2017-09-14 14:32 ` Dale Stephenson
2017-09-14 15:25 ` Dale Stephenson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bb529703-849c-93ff-f40c-1de8a6f49ce6@redhat.com \
--to=zkabelac@redhat.com \
--cc=dalestephenson@mac.com \
--cc=linux-lvm@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).