* [linux-lvm] poor read performance on rbd+LVM, LVM overload @ 2013-10-16 14:46 Ugis 2013-10-16 16:16 ` Sage Weil 0 siblings, 1 reply; 18+ messages in thread From: Ugis @ 2013-10-16 14:46 UTC (permalink / raw) To: linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com Hello ceph&LVM communities! I noticed very slow reads from xfs mount that is on ceph client(rbd+gpt partition+LVM PV + xfs on LE) To find a cause I created another rbd in the same pool, formatted it straight away with xfs, mounted. Write performance for both xfs mounts is similar ~12MB/s reads with "dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null" as follows: with LVM ~4MB/s pure xfs ~30MB/s Watched performance while doing reads with atop. In LVM case atop shows LVM overloaded: LVM | s-LV_backups | busy 95% | read 21515 | write 0 | KiB/r 4 | | KiB/w 0 | MBr/s 4.20 | MBw/s 0.00 | avq 1.00 | avio 0.85 ms | client kernel 3.10.10 ceph version 0.67.4 My considerations: I have expanded rbd under LVM couple of times(accordingly expanding gpt partition, PV,VG,LV, xfs afterwards), but that should have no impact on performance(tested clean rbd+LVM, same read performance as for expanded one). As with device-mapper, after LVM is initialized it is just a small table with LE->PE mapping that should reside in close CPU cache. I am guessing this could be related to old CPU used, probably caching near CPU does not work well(I tested also local HDDs with/without LVM and got read speed ~13MB/s vs 46MB/s with atop showing same overload in LVM case). What could make so great difference when LVM is used and what/how to tune? As write performance does not differ, DM extent lookup should not be lagging, where is the trick? CPU used: # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 10 microcode : 0x2 cpu MHz : 3200.077 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr ss e sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl cid cx16 xtpr lahf_lm bogomips : 6400.15 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: Br, Ugis ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-16 14:46 [linux-lvm] poor read performance on rbd+LVM, LVM overload Ugis @ 2013-10-16 16:16 ` Sage Weil 2013-10-17 9:06 ` David McBride 2013-10-17 15:18 ` Mike Snitzer 0 siblings, 2 replies; 18+ messages in thread From: Sage Weil @ 2013-10-16 16:16 UTC (permalink / raw) To: Ugis; +Cc: ceph-devel@vger.kernel.org, ceph-users@ceph.com, linux-lvm Hi, On Wed, 16 Oct 2013, Ugis wrote: > Hello ceph&LVM communities! > > I noticed very slow reads from xfs mount that is on ceph > client(rbd+gpt partition+LVM PV + xfs on LE) > To find a cause I created another rbd in the same pool, formatted it > straight away with xfs, mounted. > > Write performance for both xfs mounts is similar ~12MB/s > > reads with "dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null" as follows: > with LVM ~4MB/s > pure xfs ~30MB/s > > Watched performance while doing reads with atop. In LVM case atop > shows LVM overloaded: > LVM | s-LV_backups | busy 95% | read 21515 | write 0 | > KiB/r 4 | | KiB/w 0 | MBr/s 4.20 | MBw/s > 0.00 | avq 1.00 | avio 0.85 ms | > > client kernel 3.10.10 > ceph version 0.67.4 > > My considerations: > I have expanded rbd under LVM couple of times(accordingly expanding > gpt partition, PV,VG,LV, xfs afterwards), but that should have no > impact on performance(tested clean rbd+LVM, same read performance as > for expanded one). > > As with device-mapper, after LVM is initialized it is just a small > table with LE->PE mapping that should reside in close CPU cache. > I am guessing this could be related to old CPU used, probably caching > near CPU does not work well(I tested also local HDDs with/without LVM > and got read speed ~13MB/s vs 46MB/s with atop showing same overload > in LVM case). > > What could make so great difference when LVM is used and what/how to > tune? As write performance does not differ, DM extent lookup should > not be lagging, where is the trick? My first guess is that LVM is shifting the content of hte device such that it no longer aligns well with the RBD striping (by default, 4MB). The non-aligned reads/writes would need to touch two objects instead of one, and dd is generally doing these synchronously (i.e., lots of waiting). I'm not sure what options LVM provides for aligning things to the underlying storage... sage > > CPU used: > # cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 15 > model : 4 > model name : Intel(R) Xeon(TM) CPU 3.20GHz > stepping : 10 > microcode : 0x2 > cpu MHz : 3200.077 > cache size : 2048 KB > physical id : 0 > siblings : 2 > core id : 0 > cpu cores : 1 > apicid : 0 > initial apicid : 0 > fpu : yes > fpu_exception : yes > cpuid level : 5 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush dts acpi mmx fxsr ss > e sse2 ss ht tm > pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl > cid cx16 xtpr lahf_lm > bogomips : 6400.15 > clflush size : 64 > cache_alignment : 128 > address sizes : 36 bits physical, 48 bits virtual > power management: > > Br, > Ugis > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-16 16:16 ` Sage Weil @ 2013-10-17 9:06 ` David McBride 2013-10-17 15:18 ` Mike Snitzer 1 sibling, 0 replies; 18+ messages in thread From: David McBride @ 2013-10-17 9:06 UTC (permalink / raw) To: Sage Weil Cc: ugis22, ceph-devel@vger.kernel.org, ceph-users@ceph.com, linux-lvm On 16/10/2013 17:16, Sage Weil wrote: > I'm not sure what options LVM provides for aligning things to the > underlying storage... There is a generic kernel ABI for exposing performance properties of block devices to higher layers, so that they can automatically tune themselves according to those performance properties, and report their performance properties to users higher up the stack. LVM supports both reading this data from underlying physical devices, configuring itself as appropriate --- as well as reporting this data to users of LVs, so that they can, too. (For example, mkfs.xfs uses libblkid to automatically select the optimal stripe-size, stride width, etc. of an LVM volume sitting on top of an MD disk array.) A good starting point appears to be: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c72758f33784e5e2a1a4bb9421ef3e6de8f9fcf3 If Ceph RBD block devices don't currently expose this information, that should be a relatively simple addition that will result in all higher layers, whether LVM or a native filesystem, automatically tuning themselves at creation-time for the RBD's performance characteristics. (As an aside, it's possible that OSD journalling performance could also be improved by teaching it to heed this topology information. I can imagine that when writing directly to block devices it may be possible to improve performance, such as when using LVM-on-an-SSD, or a DOS partition on a 4k-sector SATA disk.) ~ ~ ~ In the mean time, the documentation I found for LVM2 suggests that the `pvcreate` command supports the "--dataalignment" and "--dataalignmentoffset" flags. The former should be the RBD object size, e.g. 4MB by default. In this case, you'll also need to set the latter compensate for the offset introduced by the GPT place-holder partition table at the start of the device so that LVM data extents begin on an object boundry. Cheers, David -- David McBride <dwm37@cam.ac.uk> Unix Specialist, University Computing Service ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-16 16:16 ` Sage Weil 2013-10-17 9:06 ` David McBride @ 2013-10-17 15:18 ` Mike Snitzer 2013-10-18 7:56 ` Ugis 1 sibling, 1 reply; 18+ messages in thread From: Mike Snitzer @ 2013-10-17 15:18 UTC (permalink / raw) To: Sage Weil Cc: ceph-devel@vger.kernel.org, Ugis, linux-lvm, ceph-users@ceph.com On Wed, Oct 16 2013 at 12:16pm -0400, Sage Weil <sage@inktank.com> wrote: > Hi, > > On Wed, 16 Oct 2013, Ugis wrote: > > > > What could make so great difference when LVM is used and what/how to > > tune? As write performance does not differ, DM extent lookup should > > not be lagging, where is the trick? > > My first guess is that LVM is shifting the content of hte device such that > it no longer aligns well with the RBD striping (by default, 4MB). The > non-aligned reads/writes would need to touch two objects instead of > one, and dd is generally doing these synchronously (i.e., lots of > waiting). > > I'm not sure what options LVM provides for aligning things to the > underlying storage... LVM will consume the underlying storage's device limits. So if rbd establishes appropriate minimum_io_size and optimal_io_size that reflect the striping config LVM will pick it up -- provided 'data_alignment_detection' is enabled in lvm.conf (which it is by default). Ugis, please provide the output of: RBD_DEVICE=<rbd device name> pvs -o pe_start $RBD_DEVICE cat /sys/block/$RBD_DEVICE/queue/minimum_io_size cat /sys/block/$RBD_DEVICE/queue/optimal_io_size The 'pvs' command will tell you where LVM aligned the start of the data area (which follows the LVM metadata area). Hopefully it reflects what was published in sysfs for rbd's striping. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-17 15:18 ` Mike Snitzer @ 2013-10-18 7:56 ` Ugis 2013-10-19 0:01 ` Sage Weil 0 siblings, 1 reply; 18+ messages in thread From: Ugis @ 2013-10-18 7:56 UTC (permalink / raw) To: Mike Snitzer Cc: ceph-devel@vger.kernel.org, ceph-users@ceph.com, Sage Weil, linux-lvm > Ugis, please provide the output of: > > RBD_DEVICE=<rbd device name> > pvs -o pe_start $RBD_DEVICE > cat /sys/block/$RBD_DEVICE/queue/minimum_io_size > cat /sys/block/$RBD_DEVICE/queue/optimal_io_size > > The 'pvs' command will tell you where LVM aligned the start of the data > area (which follows the LVM metadata area). Hopefully it reflects what > was published in sysfs for rbd's striping. output follows: #pvs -o pe_start /dev/rbd1p1 1st PE 4.00m # cat /sys/block/rbd1/queue/minimum_io_size 4194304 # cat /sys/block/rbd1/queue/optimal_io_size 4194304 Seems correct in terms of ceph-LVM io parameter negotiation? I wonded about gpt header+PV metadata - it makes some shift starting from ceph 1st block beginning. Does this mean that all following LVM 4m data blocks are shifted by this part and span 2 ceph objects? If so, performance will be affected. Ugis ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-18 7:56 ` Ugis @ 2013-10-19 0:01 ` Sage Weil 2013-10-20 15:18 ` Ugis 0 siblings, 1 reply; 18+ messages in thread From: Sage Weil @ 2013-10-19 0:01 UTC (permalink / raw) To: Ugis Cc: ceph-devel@vger.kernel.org, ceph-users@ceph.com, Mike Snitzer, linux-lvm On Fri, 18 Oct 2013, Ugis wrote: > > Ugis, please provide the output of: > > > > RBD_DEVICE=<rbd device name> > > pvs -o pe_start $RBD_DEVICE > > cat /sys/block/$RBD_DEVICE/queue/minimum_io_size > > cat /sys/block/$RBD_DEVICE/queue/optimal_io_size > > > > The 'pvs' command will tell you where LVM aligned the start of the data > > area (which follows the LVM metadata area). Hopefully it reflects what > > was published in sysfs for rbd's striping. > > output follows: > #pvs -o pe_start /dev/rbd1p1 > 1st PE > 4.00m > # cat /sys/block/rbd1/queue/minimum_io_size > 4194304 > # cat /sys/block/rbd1/queue/optimal_io_size > 4194304 Well, the parameters are being set at least. Mike, is it possible that having minimum_io_size set to 4m is causing some read amplification in LVM, translating a small read into a complete fetch of the PE (or somethinga long those lines)? Ugis, if your cluster is on the small side, it might be interesting to see what requests the client is generated in the LVM and non-LVM case by setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs '--debug-ms 1') and then looking at the osd_op messages that appear in /var/log/ceph/ceph-osd*.log. It may be obvious that the IO pattern is different. > Seems correct in terms of ceph-LVM io parameter negotiation? I wonded > about gpt header+PV metadata - it makes some shift starting from ceph > 1st block beginning. Does this mean that all following LVM 4m data > blocks are shifted by this part and span 2 ceph objects? > If so, performance will be affected. I'm no LVM expert, but I would guess that LVM is aligning things properly based on the above device properties... sage ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-19 0:01 ` Sage Weil @ 2013-10-20 15:18 ` Ugis 2013-10-20 18:21 ` [linux-lvm] [ceph-users] " Josh Durgin 2013-10-21 3:58 ` [linux-lvm] " Sage Weil 0 siblings, 2 replies; 18+ messages in thread From: Ugis @ 2013-10-20 15:18 UTC (permalink / raw) To: Sage Weil Cc: ceph-devel@vger.kernel.org, ceph-users@ceph.com, Mike Snitzer, linux-lvm >> output follows: >> #pvs -o pe_start /dev/rbd1p1 >> 1st PE >> 4.00m >> # cat /sys/block/rbd1/queue/minimum_io_size >> 4194304 >> # cat /sys/block/rbd1/queue/optimal_io_size >> 4194304 > > Well, the parameters are being set at least. Mike, is it possible that > having minimum_io_size set to 4m is causing some read amplification > in LVM, translating a small read into a complete fetch of the PE (or > somethinga long those lines)? > > Ugis, if your cluster is on the small side, it might be interesting to see > what requests the client is generated in the LVM and non-LVM case by > setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs > '--debug-ms 1') and then looking at the osd_op messages that appear in > /var/log/ceph/ceph-osd*.log. It may be obvious that the IO pattern is > different. > Sage, here follows debug output. I am no pro in reading this, but seems read block size differ(or what is that number following ~ sign)? OSD.2 read with LVM: 2013-10-20 16:59:05.307159 7f95acfa5700 1 -- x.x.x.x:6804/1944 --> x.x.x.y:0/269199468 -- osd_op_reply(176566434 rbd_data.3ad974b0dc51.0000000000007cef [read 4083712~4096] ondisk = 0) v4 -- ?+0 0xdc35c00 con 0xd9e4840 2013-10-20 16:59:05.307655 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== client.38069 x.x.x.y:0/269199468 5548 ==== osd_op(client.38069.1:176566435 rbd_data.3ad974b0dc51.0000000000007cef [read 4087808~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (1554835253 0 0) 0x12593d80 con 0xd9e4840 2013-10-20 16:59:05.307824 7f95ac7a4700 1 -- x.x.x.x:6804/1944 --> x.x.x.y:0/269199468 -- osd_op_reply(176566435 rbd_data.3ad974b0dc51.0000000000007cef [read 4087808~4096] ondisk = 0) v4 -- ?+0 0xe24fc00 con 0xd9e4840 2013-10-20 16:59:05.308316 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== client.38069 x.x.x.y:0/269199468 5549 ==== osd_op(client.38069.1:176566436 rbd_data.3ad974b0dc51.0000000000007cef [read 4091904~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3467296840 0 0) 0xe28f6c0 con 0xd9e4840 2013-10-20 16:59:05.308499 7f95acfa5700 1 -- x.x.x.x:6804/1944 --> x.x.x.y:0/269199468 -- osd_op_reply(176566436 rbd_data.3ad974b0dc51.0000000000007cef [read 4091904~4096] ondisk = 0) v4 -- ?+0 0xdc35a00 con 0xd9e4840 2013-10-20 16:59:05.308985 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== client.38069 x.x.x.y:0/269199468 5550 ==== osd_op(client.38069.1:176566437 rbd_data.3ad974b0dc51.0000000000007cef [read 4096000~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3104591620 0 0) 0xe0b46c0 con 0xd9e4840 OSD.2 read without LVM 2013-10-20 17:03:13.730881 7f95ac7a4700 1 -- x.x.x.x:6804/1944 --> x.x.x.y:0/269199468 -- osd_op_reply(176708854 rb.0.967b.238e1f29.000000000071 [read 2359296~131072] ondisk = 0) v4 -- ?+0 0x1019d200 con 0xd9e4840 2013-10-20 17:03:13.731318 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== client.38069 x.x.x.y:0/269199468 18232 ==== osd_op(client.38069.1:176708855 rb.0.967b.238e1f29.000000000071 [read 2490368~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (1987168552 0 0) 0x171a7480 con 0xd9e4840 2013-10-20 17:03:13.731664 7f95acfa5700 1 -- x.x.x.x:6804/1944 --> x.x.x.y:0/269199468 -- osd_op_reply(176708855 rb.0.967b.238e1f29.000000000071 [read 2490368~131072] ondisk = 0) v4 -- ?+0 0x12b81200 con 0xd9e4840 2013-10-20 17:03:13.733112 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== client.38069 x.x.x.y:0/269199468 18233 ==== osd_op(client.38069.1:176708856 rb.0.967b.238e1f29.000000000071 [read 2621440~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (527551382 0 0) 0x12593d80 con 0xd9e4840 2013-10-20 17:03:13.733393 7f95ac7a4700 1 -- x.x.x.x:6804/1944 --> x.x.x.y:0/269199468 -- osd_op_reply(176708856 rb.0.967b.238e1f29.000000000071 [read 2621440~131072] ondisk = 0) v4 -- ?+0 0xeba9000 con 0xd9e4840 2013-10-20 17:03:13.733741 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== client.38069 x.x.x.y:0/269199468 18234 ==== osd_op(client.38069.1:176708857 rb.0.967b.238e1f29.000000000071 [read 2752512~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (178955972 0 0) 0xe0b4d80 con 0xd9e4840 How to proceed with tuning read performance on LVM? Is there some chanage needed in code of ceph/LVM or my config needs to be tuned? If what is shown in logs means 4k read block in LVM case - then it seems I need to tell LVM(or xfs on top of LVM dictates read block side?) that io block should be rather 4m? Ugis ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] [ceph-users] poor read performance on rbd+LVM, LVM overload 2013-10-20 15:18 ` Ugis @ 2013-10-20 18:21 ` Josh Durgin 2013-10-21 3:58 ` [linux-lvm] " Sage Weil 1 sibling, 0 replies; 18+ messages in thread From: Josh Durgin @ 2013-10-20 18:21 UTC (permalink / raw) To: Ugis, Sage Weil Cc: ceph-devel@vger.kernel.org, ceph-users@ceph.com, Mike Snitzer, linux-lvm On 10/20/2013 08:18 AM, Ugis wrote: >>> output follows: >>> #pvs -o pe_start /dev/rbd1p1 >>> 1st PE >>> 4.00m >>> # cat /sys/block/rbd1/queue/minimum_io_size >>> 4194304 >>> # cat /sys/block/rbd1/queue/optimal_io_size >>> 4194304 >> >> Well, the parameters are being set at least. Mike, is it possible that >> having minimum_io_size set to 4m is causing some read amplification >> in LVM, translating a small read into a complete fetch of the PE (or >> somethinga long those lines)? >> >> Ugis, if your cluster is on the small side, it might be interesting to see >> what requests the client is generated in the LVM and non-LVM case by >> setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs >> '--debug-ms 1') and then looking at the osd_op messages that appear in >> /var/log/ceph/ceph-osd*.log. It may be obvious that the IO pattern is >> different. >> > Sage, here follows debug output. I am no pro in reading this, but > seems read block size differ(or what is that number following ~ sign)? Yes, that's the I/O length. LVM is sending requests for 4k at a time, while plain kernel rbd is sending 128k. <request logs showing this> > How to proceed with tuning read performance on LVM? Is there some > chanage needed in code of ceph/LVM or my config needs to be tuned? > If what is shown in logs means 4k read block in LVM case - then it > seems I need to tell LVM(or xfs on top of LVM dictates read block > side?) that io block should be rather 4m? It's a client side issue of sending much smaller requests than it needs to. Check the queue minimum and optimal sizes for the lvm device - it sounds like they might be getting set to 4k for some reason. Josh ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-20 15:18 ` Ugis 2013-10-20 18:21 ` [linux-lvm] [ceph-users] " Josh Durgin @ 2013-10-21 3:58 ` Sage Weil 2013-10-21 14:11 ` Christoph Hellwig 1 sibling, 1 reply; 18+ messages in thread From: Sage Weil @ 2013-10-21 3:58 UTC (permalink / raw) To: Ugis Cc: elder, ceph-devel@vger.kernel.org, ceph-users@ceph.com, Mike Snitzer, linux-lvm On Sun, 20 Oct 2013, Ugis wrote: > >> output follows: > >> #pvs -o pe_start /dev/rbd1p1 > >> 1st PE > >> 4.00m > >> # cat /sys/block/rbd1/queue/minimum_io_size > >> 4194304 > >> # cat /sys/block/rbd1/queue/optimal_io_size > >> 4194304 > > > > Well, the parameters are being set at least. Mike, is it possible that > > having minimum_io_size set to 4m is causing some read amplification > > in LVM, translating a small read into a complete fetch of the PE (or > > somethinga long those lines)? > > > > Ugis, if your cluster is on the small side, it might be interesting to see > > what requests the client is generated in the LVM and non-LVM case by > > setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs > > '--debug-ms 1') and then looking at the osd_op messages that appear in > > /var/log/ceph/ceph-osd*.log. It may be obvious that the IO pattern is > > different. > > > Sage, here follows debug output. I am no pro in reading this, but > seems read block size differ(or what is that number following ~ sign)? Yep, it's offset~length. It looks like without LVM we're getting 128KB requests (which IIRC is typical), but with LVM it's only 4KB. Unfortunately my memory is a bit fuzzy here, but I seem to recall a property on the request_queue or device that affected this. RBD is currently doing segment_size = rbd_obj_bytes(&rbd_dev->header); blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE); blk_queue_max_segment_size(q, segment_size); blk_queue_io_min(q, segment_size); blk_queue_io_opt(q, segment_size); where segment_size is 4MB (so, much more than 128KB); maybe it has something to do with how many smaller ios get coalesced a larger requests? In any case, something appears to be lost due to the pass through LVM, but I'm not very familiar with the block layer code at all... :/ sage > > OSD.2 read with LVM: > 2013-10-20 16:59:05.307159 7f95acfa5700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176566434 > rbd_data.3ad974b0dc51.0000000000007cef [read 4083712~4096] ondisk = 0) > v4 -- ?+0 0xdc35c00 con 0xd9e4840 > 2013-10-20 16:59:05.307655 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 5548 ==== > osd_op(client.38069.1:176566435 rbd_data.3ad974b0dc51.0000000000007cef > [read 4087808~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (1554835253 0 0) > 0x12593d80 con 0xd9e4840 > 2013-10-20 16:59:05.307824 7f95ac7a4700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176566435 > rbd_data.3ad974b0dc51.0000000000007cef [read 4087808~4096] ondisk = 0) > v4 -- ?+0 0xe24fc00 con 0xd9e4840 > 2013-10-20 16:59:05.308316 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 5549 ==== > osd_op(client.38069.1:176566436 rbd_data.3ad974b0dc51.0000000000007cef > [read 4091904~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3467296840 0 0) > 0xe28f6c0 con 0xd9e4840 > 2013-10-20 16:59:05.308499 7f95acfa5700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176566436 > rbd_data.3ad974b0dc51.0000000000007cef [read 4091904~4096] ondisk = 0) > v4 -- ?+0 0xdc35a00 con 0xd9e4840 > 2013-10-20 16:59:05.308985 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 5550 ==== > osd_op(client.38069.1:176566437 rbd_data.3ad974b0dc51.0000000000007cef > [read 4096000~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3104591620 0 0) > 0xe0b46c0 con 0xd9e4840 > > OSD.2 read without LVM > 2013-10-20 17:03:13.730881 7f95ac7a4700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176708854 > rb.0.967b.238e1f29.000000000071 [read 2359296~131072] ondisk = 0) v4 > -- ?+0 0x1019d200 con 0xd9e4840 > 2013-10-20 17:03:13.731318 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 18232 ==== > osd_op(client.38069.1:176708855 rb.0.967b.238e1f29.000000000071 [read > 2490368~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (1987168552 0 0) > 0x171a7480 con 0xd9e4840 > 2013-10-20 17:03:13.731664 7f95acfa5700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176708855 > rb.0.967b.238e1f29.000000000071 [read 2490368~131072] ondisk = 0) v4 > -- ?+0 0x12b81200 con 0xd9e4840 > 2013-10-20 17:03:13.733112 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 18233 ==== > osd_op(client.38069.1:176708856 rb.0.967b.238e1f29.000000000071 [read > 2621440~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (527551382 0 0) > 0x12593d80 con 0xd9e4840 > 2013-10-20 17:03:13.733393 7f95ac7a4700 1 -- x.x.x.x:6804/1944 --> > x.x.x.y:0/269199468 -- osd_op_reply(176708856 > rb.0.967b.238e1f29.000000000071 [read 2621440~131072] ondisk = 0) v4 > -- ?+0 0xeba9000 con 0xd9e4840 > 2013-10-20 17:03:13.733741 7f95b27b0700 1 -- x.x.x.x:6804/1944 <== > client.38069 x.x.x.y:0/269199468 18234 ==== > osd_op(client.38069.1:176708857 rb.0.967b.238e1f29.000000000071 [read > 2752512~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (178955972 0 0) > 0xe0b4d80 con 0xd9e4840 > > How to proceed with tuning read performance on LVM? Is there some > chanage needed in code of ceph/LVM or my config needs to be tuned? > If what is shown in logs means 4k read block in LVM case - then it > seems I need to tell LVM(or xfs on top of LVM dictates read block > side?) that io block should be rather 4m? > > Ugis > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-21 3:58 ` [linux-lvm] " Sage Weil @ 2013-10-21 14:11 ` Christoph Hellwig 2013-10-21 15:01 ` Mike Snitzer 0 siblings, 1 reply; 18+ messages in thread From: Christoph Hellwig @ 2013-10-21 14:11 UTC (permalink / raw) To: Sage Weil Cc: elder, Mike Snitzer, Ugis, linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote: > It looks like without LVM we're getting 128KB requests (which IIRC is > typical), but with LVM it's only 4KB. Unfortunately my memory is a bit > fuzzy here, but I seem to recall a property on the request_queue or device > that affected this. RBD is currently doing Unfortunately most device mapper modules still split all I/O into 4k chunks before handling them. They rely on the elevator to merge them back together down the line, which isn't overly efficient but should at least provide larger segments for the common cases. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-21 14:11 ` Christoph Hellwig @ 2013-10-21 15:01 ` Mike Snitzer 2013-10-21 15:06 ` Mike Snitzer ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Mike Snitzer @ 2013-10-21 15:01 UTC (permalink / raw) To: Christoph Hellwig Cc: elder, Sage Weil, Ugis, linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com On Mon, Oct 21 2013 at 10:11am -0400, Christoph Hellwig <hch@infradead.org> wrote: > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote: > > It looks like without LVM we're getting 128KB requests (which IIRC is > > typical), but with LVM it's only 4KB. Unfortunately my memory is a bit > > fuzzy here, but I seem to recall a property on the request_queue or device > > that affected this. RBD is currently doing > > Unfortunately most device mapper modules still split all I/O into 4k > chunks before handling them. They rely on the elevator to merge them > back together down the line, which isn't overly efficient but should at > least provide larger segments for the common cases. It isn't DM that splits the IO into 4K chunks; it is the VM subsystem no? Unless care is taken to assemble larger bios (higher up the IO stack, e.g. in XFS), all buffered IO will come to bio-based DM targets in $PAGE_SIZE granularity. I would expect direct IO to before better here because it will make use of bio_add_page to build up larger IOs. Taking a step back, the rbd driver is exposing both the minimum_io_size and optimal_io_size as 4M. This symmetry will cause XFS to _not_ detect the exposed limits as striping. Therefore, AFAIK, XFS won't take steps to respect the limits when it assembles its bios (via bio_add_page). Sage, any reason why you don't use traditional raid geomtry based IO limits?, e.g.: minimum_io_size = raid chunk size optimal_io_size = raid chunk size * N stripes (aka full stripe) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-21 15:01 ` Mike Snitzer @ 2013-10-21 15:06 ` Mike Snitzer 2013-10-21 16:02 ` Sage Weil 2013-10-21 18:06 ` Christoph Hellwig 2 siblings, 0 replies; 18+ messages in thread From: Mike Snitzer @ 2013-10-21 15:06 UTC (permalink / raw) To: Christoph Hellwig Cc: elder, Sage Weil, Ugis, linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com On Mon, Oct 21 2013 at 11:01am -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Mon, Oct 21 2013 at 10:11am -0400, > Christoph Hellwig <hch@infradead.org> wrote: > > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote: > > > It looks like without LVM we're getting 128KB requests (which IIRC is > > > typical), but with LVM it's only 4KB. Unfortunately my memory is a bit > > > fuzzy here, but I seem to recall a property on the request_queue or device > > > that affected this. RBD is currently doing > > > > Unfortunately most device mapper modules still split all I/O into 4k > > chunks before handling them. They rely on the elevator to merge them > > back together down the line, which isn't overly efficient but should at > > least provide larger segments for the common cases. > > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem > no? Unless care is taken to assemble larger bios (higher up the IO > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets > in $PAGE_SIZE granularity. > > I would expect direct IO to before better here because it will make use > of bio_add_page to build up larger IOs. s/before/perform/ ;) > Taking a step back, the rbd driver is exposing both the minimum_io_size > and optimal_io_size as 4M. This symmetry will cause XFS to _not_ detect > the exposed limits as striping. Therefore, AFAIK, XFS won't take steps > to respect the limits when it assembles its bios (via bio_add_page). > > Sage, any reason why you don't use traditional raid geomtry based IO > limits?, e.g.: > > minimum_io_size = raid chunk size > optimal_io_size = raid chunk size * N stripes (aka full stripe) > > _______________________________________________ > linux-lvm mailing list > linux-lvm@redhat.com > https://www.redhat.com/mailman/listinfo/linux-lvm > read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-21 15:01 ` Mike Snitzer 2013-10-21 15:06 ` Mike Snitzer @ 2013-10-21 16:02 ` Sage Weil 2013-10-21 17:48 ` Mike Snitzer 2013-10-21 18:06 ` Christoph Hellwig 2 siblings, 1 reply; 18+ messages in thread From: Sage Weil @ 2013-10-21 16:02 UTC (permalink / raw) To: Mike Snitzer Cc: elder, Christoph Hellwig, Ugis, linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com On Mon, 21 Oct 2013, Mike Snitzer wrote: > On Mon, Oct 21 2013 at 10:11am -0400, > Christoph Hellwig <hch@infradead.org> wrote: > > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote: > > > It looks like without LVM we're getting 128KB requests (which IIRC is > > > typical), but with LVM it's only 4KB. Unfortunately my memory is a bit > > > fuzzy here, but I seem to recall a property on the request_queue or device > > > that affected this. RBD is currently doing > > > > Unfortunately most device mapper modules still split all I/O into 4k > > chunks before handling them. They rely on the elevator to merge them > > back together down the line, which isn't overly efficient but should at > > least provide larger segments for the common cases. > > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem > no? Unless care is taken to assemble larger bios (higher up the IO > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets > in $PAGE_SIZE granularity. > > I would expect direct IO to before better here because it will make use > of bio_add_page to build up larger IOs. I do know that we regularly see 128 KB requests when we put XFS (or whatever else) directly on top of /dev/rbd*. > Taking a step back, the rbd driver is exposing both the minimum_io_size > and optimal_io_size as 4M. This symmetry will cause XFS to _not_ detect > the exposed limits as striping. Therefore, AFAIK, XFS won't take steps > to respect the limits when it assembles its bios (via bio_add_page). > > Sage, any reason why you don't use traditional raid geomtry based IO > limits?, e.g.: > > minimum_io_size = raid chunk size > optimal_io_size = raid chunk size * N stripes (aka full stripe) We are... by default we stripe 4M chunks across 4M objects. You're suggesting it would actually help to advertise a smaller minimim_io_size (say, 1MB)? This could easily be made tunable. sage ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-21 16:02 ` Sage Weil @ 2013-10-21 17:48 ` Mike Snitzer 2013-10-21 18:05 ` Sage Weil 0 siblings, 1 reply; 18+ messages in thread From: Mike Snitzer @ 2013-10-21 17:48 UTC (permalink / raw) To: Sage Weil Cc: elder, Christoph Hellwig, Ugis, linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com On Mon, Oct 21 2013 at 12:02pm -0400, Sage Weil <sage@inktank.com> wrote: > On Mon, 21 Oct 2013, Mike Snitzer wrote: > > On Mon, Oct 21 2013 at 10:11am -0400, > > Christoph Hellwig <hch@infradead.org> wrote: > > > > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote: > > > > It looks like without LVM we're getting 128KB requests (which IIRC is > > > > typical), but with LVM it's only 4KB. Unfortunately my memory is a bit > > > > fuzzy here, but I seem to recall a property on the request_queue or device > > > > that affected this. RBD is currently doing > > > > > > Unfortunately most device mapper modules still split all I/O into 4k > > > chunks before handling them. They rely on the elevator to merge them > > > back together down the line, which isn't overly efficient but should at > > > least provide larger segments for the common cases. > > > > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem > > no? Unless care is taken to assemble larger bios (higher up the IO > > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets > > in $PAGE_SIZE granularity. > > > > I would expect direct IO to before better here because it will make use > > of bio_add_page to build up larger IOs. > > I do know that we regularly see 128 KB requests when we put XFS (or > whatever else) directly on top of /dev/rbd*. Should be pretty straight-forward to identify any limits that are different by walking sysfs/queue, e.g.: grep -r . /sys/block/rdbXXX/queue vs grep -r . /sys/block/dm-X/queue Could be there is an unexpected difference. For instance, there was this fix recently: http://patchwork.usersys.redhat.com/patch/69661/ > > Taking a step back, the rbd driver is exposing both the minimum_io_size > > and optimal_io_size as 4M. This symmetry will cause XFS to _not_ detect > > the exposed limits as striping. Therefore, AFAIK, XFS won't take steps > > to respect the limits when it assembles its bios (via bio_add_page). > > > > Sage, any reason why you don't use traditional raid geomtry based IO > > limits?, e.g.: > > > > minimum_io_size = raid chunk size > > optimal_io_size = raid chunk size * N stripes (aka full stripe) > > We are... by default we stripe 4M chunks across 4M objects. You're > suggesting it would actually help to advertise a smaller minimim_io_size > (say, 1MB)? This could easily be made tunable. You're striping 4MB chunks across 4 million stripes? So the full stripe size in bytes is 17592186044416 (or 16TB)? Yeah cannot see how XFS could make use of that ;) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-21 17:48 ` Mike Snitzer @ 2013-10-21 18:05 ` Sage Weil 0 siblings, 0 replies; 18+ messages in thread From: Sage Weil @ 2013-10-21 18:05 UTC (permalink / raw) To: Mike Snitzer Cc: elder, Christoph Hellwig, Ugis, linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com On Mon, 21 Oct 2013, Mike Snitzer wrote: > On Mon, Oct 21 2013 at 12:02pm -0400, > Sage Weil <sage@inktank.com> wrote: > > > On Mon, 21 Oct 2013, Mike Snitzer wrote: > > > On Mon, Oct 21 2013 at 10:11am -0400, > > > Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote: > > > > > It looks like without LVM we're getting 128KB requests (which IIRC is > > > > > typical), but with LVM it's only 4KB. Unfortunately my memory is a bit > > > > > fuzzy here, but I seem to recall a property on the request_queue or device > > > > > that affected this. RBD is currently doing > > > > > > > > Unfortunately most device mapper modules still split all I/O into 4k > > > > chunks before handling them. They rely on the elevator to merge them > > > > back together down the line, which isn't overly efficient but should at > > > > least provide larger segments for the common cases. > > > > > > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem > > > no? Unless care is taken to assemble larger bios (higher up the IO > > > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets > > > in $PAGE_SIZE granularity. > > > > > > I would expect direct IO to before better here because it will make use > > > of bio_add_page to build up larger IOs. > > > > I do know that we regularly see 128 KB requests when we put XFS (or > > whatever else) directly on top of /dev/rbd*. > > Should be pretty straight-forward to identify any limits that are > different by walking sysfs/queue, e.g.: > > grep -r . /sys/block/rdbXXX/queue > vs > grep -r . /sys/block/dm-X/queue > > Could be there is an unexpected difference. For instance, there was > this fix recently: http://patchwork.usersys.redhat.com/patch/69661/ > > > > Taking a step back, the rbd driver is exposing both the minimum_io_size > > > and optimal_io_size as 4M. This symmetry will cause XFS to _not_ detect > > > the exposed limits as striping. Therefore, AFAIK, XFS won't take steps > > > to respect the limits when it assembles its bios (via bio_add_page). > > > > > > Sage, any reason why you don't use traditional raid geomtry based IO > > > limits?, e.g.: > > > > > > minimum_io_size = raid chunk size > > > optimal_io_size = raid chunk size * N stripes (aka full stripe) > > > > We are... by default we stripe 4M chunks across 4M objects. You're > > suggesting it would actually help to advertise a smaller minimim_io_size > > (say, 1MB)? This could easily be made tunable. > > You're striping 4MB chunks across 4 million stripes? > > So the full stripe size in bytes is 17592186044416 (or 16TB)? Yeah > cannot see how XFS could make use of that ;) Sorry, I mean the stripe count is effectively 1. Each 4MB gets mapped to a new 4MB object (for a total of image_size / 4MB objects). So I think minimum_io_size and optimal_io_size are technically correct in this case. sage ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-21 15:01 ` Mike Snitzer 2013-10-21 15:06 ` Mike Snitzer 2013-10-21 16:02 ` Sage Weil @ 2013-10-21 18:06 ` Christoph Hellwig 2013-10-21 18:27 ` Mike Snitzer 2 siblings, 1 reply; 18+ messages in thread From: Christoph Hellwig @ 2013-10-21 18:06 UTC (permalink / raw) To: Mike Snitzer Cc: elder, Sage Weil, Christoph Hellwig, Ugis, linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote: > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem > no? Well, it's the block layer based on what DM tells it. Take a look at dm_merge_bvec From dm_merge_bvec: /* * If the target doesn't support merge method and some of the devices * provided their merge_bvec method (we know this by looking at * queue_max_hw_sectors), then we can't allow bios with multiple vector * entries. So always set max_size to 0, and the code below allows * just one page. */ Although it's not the general case, just if the driver has a merge_bvec method. But this happens if you using DM ontop of MD where I saw it aswell as on rbd, which is why it's correct in this context, too. Sorry for over generalizing a bit. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-21 18:06 ` Christoph Hellwig @ 2013-10-21 18:27 ` Mike Snitzer 2013-10-30 14:53 ` Ugis 0 siblings, 1 reply; 18+ messages in thread From: Mike Snitzer @ 2013-10-21 18:27 UTC (permalink / raw) To: Christoph Hellwig Cc: elder, Sage Weil, Ugis, linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com On Mon, Oct 21 2013 at 2:06pm -0400, Christoph Hellwig <hch@infradead.org> wrote: > On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote: > > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem > > no? > > Well, it's the block layer based on what DM tells it. Take a look at > dm_merge_bvec > > >From dm_merge_bvec: > > /* > * If the target doesn't support merge method and some of the devices > * provided their merge_bvec method (we know this by looking at > * queue_max_hw_sectors), then we can't allow bios with multiple vector > * entries. So always set max_size to 0, and the code below allows > * just one page. > */ > > Although it's not the general case, just if the driver has a > merge_bvec method. But this happens if you using DM ontop of MD where I > saw it aswell as on rbd, which is why it's correct in this context, too. Right, but only if the DM target that is being used doesn't have a .merge method. I don't think it was ever shared which DM target is in use here.. but both the linear and stripe DM targets provide a .merge method. > Sorry for over generalizing a bit. No problem. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload 2013-10-21 18:27 ` Mike Snitzer @ 2013-10-30 14:53 ` Ugis 0 siblings, 0 replies; 18+ messages in thread From: Ugis @ 2013-10-30 14:53 UTC (permalink / raw) To: Mike Snitzer Cc: Alex Elder, Sage Weil, Christoph Hellwig, linux-lvm, ceph-devel@vger.kernel.org, ceph-users@ceph.com Hi, I'm back from trip, sorry for thread pause, wanted to wrap this up. I reread thead, but actually do not see what could be done from admin side to tune LVM for better read performance on ceph(parts of my LVM config included below). At least for already deployed LVM. It seems there is no clear agreement why io is lost, so, it seems that LVM is not recommended on ceph rbd currently. In case there is still hope for tuning here follows info. Mike wrote: "Should be pretty straight-forward to identify any limits that are different by walking sysfs/queue, e.g.: grep -r . /sys/block/rdbXXX/queue vs grep -r . /sys/block/dm-X/queue " Here it is # grep -r . /sys/block/rbd2/queue/ /sys/block/rbd2/queue/nomerges:0 /sys/block/rbd2/queue/logical_block_size:512 /sys/block/rbd2/queue/rq_affinity:1 /sys/block/rbd2/queue/discard_zeroes_data:0 /sys/block/rbd2/queue/max_segments:128 /sys/block/rbd2/queue/max_segment_size:4194304 /sys/block/rbd2/queue/rotational:1 /sys/block/rbd2/queue/scheduler:noop [deadline] cfq /sys/block/rbd2/queue/read_ahead_kb:128 /sys/block/rbd2/queue/max_hw_sectors_kb:4096 /sys/block/rbd2/queue/discard_granularity:0 /sys/block/rbd2/queue/discard_max_bytes:0 /sys/block/rbd2/queue/write_same_max_bytes:0 /sys/block/rbd2/queue/max_integrity_segments:0 /sys/block/rbd2/queue/max_sectors_kb:512 /sys/block/rbd2/queue/physical_block_size:512 /sys/block/rbd2/queue/add_random:1 /sys/block/rbd2/queue/nr_requests:128 /sys/block/rbd2/queue/minimum_io_size:4194304 /sys/block/rbd2/queue/hw_sector_size:512 /sys/block/rbd2/queue/optimal_io_size:4194304 /sys/block/rbd2/queue/iosched/read_expire:500 /sys/block/rbd2/queue/iosched/write_expire:5000 /sys/block/rbd2/queue/iosched/fifo_batch:16 /sys/block/rbd2/queue/iosched/front_merges:1 /sys/block/rbd2/queue/iosched/writes_starved:2 /sys/block/rbd2/queue/iostats:1 # grep -r . /sys/block/dm-2/queue/ /sys/block/dm-2/queue/nomerges:0 /sys/block/dm-2/queue/logical_block_size:512 /sys/block/dm-2/queue/rq_affinity:0 /sys/block/dm-2/queue/discard_zeroes_data:0 /sys/block/dm-2/queue/max_segments:128 /sys/block/dm-2/queue/max_segment_size:65536 /sys/block/dm-2/queue/rotational:1 /sys/block/dm-2/queue/scheduler:none /sys/block/dm-2/queue/read_ahead_kb:0 /sys/block/dm-2/queue/max_hw_sectors_kb:4096 /sys/block/dm-2/queue/discard_granularity:0 /sys/block/dm-2/queue/discard_max_bytes:0 /sys/block/dm-2/queue/write_same_max_bytes:0 /sys/block/dm-2/queue/max_integrity_segments:0 /sys/block/dm-2/queue/max_sectors_kb:512 /sys/block/dm-2/queue/physical_block_size:512 /sys/block/dm-2/queue/add_random:0 /sys/block/dm-2/queue/nr_requests:128 /sys/block/dm-2/queue/minimum_io_size:4194304 /sys/block/dm-2/queue/hw_sector_size:512 /sys/block/dm-2/queue/optimal_io_size:4194304 /sys/block/dm-2/queue/iostats:0 Chunks of /etc/lvm/lvm.conf if this helps devices { dir = "/dev" scan = [ "/dev/rbd" ,"/dev" ] preferred_names = [ ] filter = [ "a/.*/" ] cache_dir = "/etc/lvm/cache" cache_file_prefix = "" write_cache_state = 0 types = [ "rbd", 250 ] sysfs_scan = 1 md_component_detection = 1 md_chunk_alignment = 1 data_alignment_detection = 1 data_alignment = 0 data_alignment_offset_detection = 1 ignore_suspended_devices = 0 } ... activation { udev_sync = 1 udev_rules = 1 missing_stripe_filler = "error" reserved_stack = 256 reserved_memory = 8192 process_priority = -18 mirror_region_size = 512 readahead = "none" mirror_log_fault_policy = "allocate" mirror_image_fault_policy = "remove" use_mlockall = 0 monitoring = 1 polling_interval = 15 } Hope something can be done still, or I will have to move several TB off the LVM :) Anyway, it does not feel like the problem cause is clear. May be I need to file a bug if that is relevant, but where to? Ugis 2013/10/21 Mike Snitzer <snitzer@redhat.com>: > On Mon, Oct 21 2013 at 2:06pm -0400, > Christoph Hellwig <hch@infradead.org> wrote: > >> On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote: >> > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem >> > no? >> >> Well, it's the block layer based on what DM tells it. Take a look at >> dm_merge_bvec >> >> >From dm_merge_bvec: >> >> /* >> * If the target doesn't support merge method and some of the devices >> * provided their merge_bvec method (we know this by looking at >> * queue_max_hw_sectors), then we can't allow bios with multiple vector >> * entries. So always set max_size to 0, and the code below allows >> * just one page. >> */ >> >> Although it's not the general case, just if the driver has a >> merge_bvec method. But this happens if you using DM ontop of MD where I >> saw it aswell as on rbd, which is why it's correct in this context, too. > > Right, but only if the DM target that is being used doesn't have a > .merge method. I don't think it was ever shared which DM target is in > use here.. but both the linear and stripe DM targets provide a .merge > method. > >> Sorry for over generalizing a bit. > > No problem. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2013-10-30 14:53 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-10-16 14:46 [linux-lvm] poor read performance on rbd+LVM, LVM overload Ugis 2013-10-16 16:16 ` Sage Weil 2013-10-17 9:06 ` David McBride 2013-10-17 15:18 ` Mike Snitzer 2013-10-18 7:56 ` Ugis 2013-10-19 0:01 ` Sage Weil 2013-10-20 15:18 ` Ugis 2013-10-20 18:21 ` [linux-lvm] [ceph-users] " Josh Durgin 2013-10-21 3:58 ` [linux-lvm] " Sage Weil 2013-10-21 14:11 ` Christoph Hellwig 2013-10-21 15:01 ` Mike Snitzer 2013-10-21 15:06 ` Mike Snitzer 2013-10-21 16:02 ` Sage Weil 2013-10-21 17:48 ` Mike Snitzer 2013-10-21 18:05 ` Sage Weil 2013-10-21 18:06 ` Christoph Hellwig 2013-10-21 18:27 ` Mike Snitzer 2013-10-30 14:53 ` Ugis
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).