From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= Subject: Re: IO speed limited by size of IO request (for RBD driver) Date: Wed, 8 May 2013 12:45:07 +0200 Message-ID: <518A2CB3.7090106@citrix.com> References: <51768FA5.6090609@crc.id.au> <5176957E.6010306@citrix.com> <51769B9D.4000708@crc.id.au> <51769CFD.7020907@citrix.com> <51769E1E.6040902@crc.id.au> <5176A19A.2010802@citrix.com> <5176A440.8040303@crc.id.au> <5176A520.5030503@citrix.com> <5176A61F.6050607@crc.id.au> <5176A6DD.5000404@citrix.com> <5176AFF9.4020003@crc.id.au> <5176B237.8020803@citrix.com> <5176C073.3050409@crc.id.au> <5176CF56.8000505@citrix.com> <5176DB88.1070200@crc.id.au> <517A89DA.3030804@citrix.com> <517A8C44.5020103@crc.id.au> <517B3088.7070809@crc.id.au> <517B790A.3020009@citrix.com> <517B838C.9040607@crc.id.au> <517B8DE3.90306@crc.id.au> <517E3195.8090204@citrix.com> <517EC975.7030807@crc.id.au> <517ECE64.6000503@crc.id.au> <9F2C4E7DFB7839489C89757A66C5AD620E57EA@LONPEX01CL03.citrite.net> <518A0AB8.90506@crc.id.au> <518A0DC8.4080501@citrix.com> <518A29DA.3080501@crc.id.au> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <518A29DA.3080501@crc.id.au> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Steven Haigh Cc: Felipe Franciosi , "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org On 08/05/13 12:32, Steven Haigh wrote: > On 8/05/2013 6:33 PM, Roger Pau Monn=E9 wrote: >> On 08/05/13 10:20, Steven Haigh wrote: >>> On 30/04/2013 8:07 PM, Felipe Franciosi wrote: >>>> I noticed you copied your results from "dd", but I didn't see any conc= lusions drawn from experiment. >>>> >>>> Did I understand it wrong or now you have comparable performance on do= m0 and domU when using DIRECT? >>>> >>>> domU: >>>> # dd if=3D/dev/zero of=3Doutput.zero bs=3D1M count=3D2048 oflag=3Ddire= ct >>>> 2048+0 records in >>>> 2048+0 records out >>>> 2147483648 bytes (2.1 GB) copied, 25.4705 s, 84.3 MB/s >>>> >>>> dom0: >>>> # dd if=3D/dev/zero of=3Doutput.zero bs=3D1M count=3D2048 oflag=3Ddire= ct >>>> 2048+0 records in >>>> 2048+0 records out >>>> 2147483648 bytes (2.1 GB) copied, 24.8914 s, 86.3 MB/s >>>> >>>> >>>> I think that if the performance differs when NOT using DIRECT, the iss= ue must be related to the way your guest is flushing the cache. This must b= e generating a workload that doesn't perform well on Xen's PV protocol. >>> >>> Just wondering if there is any further input on this... While DIRECT >>> writes are as good as can be expected, NON-DIRECT writes in certain >>> cases (specifically with a mdadm raid in the Dom0) are affected by about >>> a 50% loss in throughput... >>> >>> The hard part is that this is the default mode of writing! >> >> As another test with indirect descriptors, could you change >> xen_blkif_max_segments in xen-blkfront.c to 128 (it is 32 by default), >> recompile the DomU kernel and see if that helps? > = > Ok, here we go.... compiled as 3.8.0-2 with the above change. 3.8.0-2 is = > running on both the Dom0 and DomU. > = > # dd if=3D/dev/zero of=3Doutput.zero bs=3D1M count=3D2048 > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 22.1703 s, 96.9 MB/s > = > avg-cpu: %user %nice %system %iowait %steal %idle > 0.34 0.00 17.10 0.00 0.23 82.33 > = > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s = > avgrq-sz avgqu-sz await svctm %util > sdd 980.97 11936.47 53.11 429.78 4.00 48.77 = > 223.81 12.75 26.10 2.11 101.79 > sdc 872.71 11957.87 45.98 435.67 3.55 49.30 = > 224.71 13.77 28.43 2.11 101.49 > sde 949.26 11981.88 51.30 429.33 3.91 48.90 = > 225.03 21.29 43.91 2.27 109.08 > sdf 915.52 11968.52 48.58 428.88 3.73 48.92 = > 225.84 21.44 44.68 2.27 108.56 > md2 0.00 0.00 0.00 1155.61 0.00 97.51 = > 172.80 0.00 0.00 0.00 0.00 > = > # dd if=3D/dev/zero of=3Doutput.zero bs=3D1M count=3D2048 oflag=3Ddirect > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 25.3708 s, 84.6 MB/s > = > avg-cpu: %user %nice %system %iowait %steal %idle > 0.11 0.00 13.92 0.00 0.22 85.75 > = > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s = > avgrq-sz avgqu-sz await svctm %util > sdd 0.00 13986.08 0.00 263.20 0.00 55.76 = > 433.87 0.43 1.63 1.07 28.27 > sdc 202.10 13741.55 6.52 256.57 0.81 54.77 = > 432.65 0.50 1.88 1.25 32.78 > sde 47.96 11437.57 1.55 261.77 0.19 45.79 = > 357.63 0.80 3.02 1.85 48.60 > sdf 2233.37 11756.13 71.93 191.38 8.99 46.80 = > 433.90 1.49 5.66 3.27 86.15 > md2 0.00 0.00 0.00 731.93 0.00 91.49 = > 256.00 0.00 0.00 0.00 0.00 > = > Now this is pretty much exactly what I would expect the system to do.... = > ~96MB/sec buffered, and 85MB/sec direct. I'm sorry to be such a PITA, but could you also try with 64? If we have to increase the maximum number of indirect descriptors I would like to set it to the lowest value that provides good performance to prevent using too much memory. > So - it turns out that xen_blkif_max_segments at 32 is a killer in the = > DomU. Now it makes me wonder what we can do about this in kernels that = > don't have your series of patches against it? And also about the backend = > stuff in 3.8.x etc? There isn't much we can do regarding kernels without indirect descriptors, there's no easy way to increase the number of segments in a request.