From mboxrd@z Thu Jan  1 00:00:00 1970
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: Re: IO speed limited by size of IO request (for RBD
 driver)
Date: Fri, 24 May 2013 10:29:38 -0400
Message-ID: <20130524142938.GI3900@phenom.dumpdata.com>
References: <517EC975.7030807@crc.id.au> <517ECE64.6000503@crc.id.au>
	<9F2C4E7DFB7839489C89757A66C5AD620E57EA@LONPEX01CL03.citrite.net>
	<518A0AB8.90506@crc.id.au> <518A0DC8.4080501@citrix.com>
	<518A29DA.3080501@crc.id.au> <518A2CB3.7090106@citrix.com>
	<9F2C4E7DFB7839489C89757A66C5AD620F1713@LONPEX01CL03.citrite.net>
	<20130522201308.GB12372@phenom.dumpdata.com>
	<44C85321-F0BF-45DF-AD44-F02EE9A2391B@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
Content-Disposition: inline
In-Reply-To: <44C85321-F0BF-45DF-AD44-F02EE9A2391B@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Felipe Franciosi <felipe.franciosi@citrix.com>
Cc: Roger Pau Monne <roger.pau@citrix.com>, Steven Haigh <netwiz@crc.id.au>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

On Thu, May 23, 2013 at 07:22:27AM +0000, Felipe Franciosi wrote:
> =

> =

> On 22 May 2013, at 21:13, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com=
> wrote:
> =

> > On Wed, May 08, 2013 at 11:14:26AM +0000, Felipe Franciosi wrote:
> >> However we didn't "prove" it properly, I think it is worth mentioning =
that this boils down to what we originally thought it was:
> >> Steven's environment is writing to a filesystem in the guest. On top o=
f that, it's using the guest's buffer cache to do the writes.
> > =

> > If he is using O_DIRECT it bypasses the cache in the guest.
> =

> Certainly, but the issues were when _not_ using O_DIRECT.


I am confused. Are the feature-indirect-descriptor making it worst or bette=
r when
!O_DIRECT?

Or are there no difference when using !O_DIRECT with the feature-indirect-d=
escriptor?

> =

> F
> =

> =

> > =

> >> This means that we cannot (easily?) control how the cache and the fs a=
re flushing these writes through blkfront/blkback.

echo 3 > /proc/..something/drop_cache

does it?
> >> =

> >> In other words, it's very likely that it generates a workload that sim=
ply doesn't perform well on the "stock" PV protocol.

'fio' is an excellent tool to run the tests without using the cache.

> >> This is a good example of how indirect descriptors help (remembering R=
oger and I were struggling to find use cases where indirect descriptors sho=
wed a substantial gain).


You mean using the O_DIRECT? Yes, all tests that involve any I/O should use=
 O_DIRECT.
Otherwise they are misleading. And my understanding from this thread that S=
teven did that
and found that:
 a) without the feature-indirect-descriptor - the I/O was sucky
 b) with the initial feature-indirect-descriptior - the I/O was less sucky
 c) with the feature-indirect-descriptor and a tweak to the frontend of how=
 mant
    segments to use - the I/O was the same as on baremetal.

Sorry about being soo verbose here - I feel that I am missing something her=
e and
I am not exactly sure what this is. Could you please enlighten me?

> >> =

> >> Cheers,
> >> Felipe
> >> =

> >> -----Original Message-----
> >> From: Roger Pau Monne =

> >> Sent: 08 May 2013 11:45
> >> To: Steven Haigh
> >> Cc: Felipe Franciosi; xen-devel@lists.xen.org
> >> Subject: Re: IO speed limited by size of IO request (for RBD driver)
> >> =

> >> On 08/05/13 12:32, Steven Haigh wrote:
> >>> On 8/05/2013 6:33 PM, Roger Pau Monn=E9 wrote:
> >>>> On 08/05/13 10:20, Steven Haigh wrote:
> >>>>> On 30/04/2013 8:07 PM, Felipe Franciosi wrote:
> >>>>>> I noticed you copied your results from "dd", but I didn't see any =
conclusions drawn from experiment.
> >>>>>> =

> >>>>>> Did I understand it wrong or now you have comparable performance o=
n dom0 and domU when using DIRECT?
> >>>>>> =

> >>>>>> domU:
> >>>>>> # dd if=3D/dev/zero of=3Doutput.zero bs=3D1M count=3D2048 oflag=3D=
direct
> >>>>>> 2048+0 records in
> >>>>>> 2048+0 records out
> >>>>>> 2147483648 bytes (2.1 GB) copied, 25.4705 s, 84.3 MB/s
> >>>>>> =

> >>>>>> dom0:
> >>>>>> # dd if=3D/dev/zero of=3Doutput.zero bs=3D1M count=3D2048 oflag=3D=
direct
> >>>>>> 2048+0 records in
> >>>>>> 2048+0 records out
> >>>>>> 2147483648 bytes (2.1 GB) copied, 24.8914 s, 86.3 MB/s
> >>>>>> =

> >>>>>> =

> >>>>>> I think that if the performance differs when NOT using DIRECT, the=
 issue must be related to the way your guest is flushing the cache. This mu=
st be generating a workload that doesn't perform well on Xen's PV protocol.
> >>>>> =

> >>>>> Just wondering if there is any further input on this... While DIREC=
T =

> >>>>> writes are as good as can be expected, NON-DIRECT writes in certain =

> >>>>> cases (specifically with a mdadm raid in the Dom0) are affected by =

> >>>>> about a 50% loss in throughput...
> >>>>> =

> >>>>> The hard part is that this is the default mode of writing!
> >>>> =

> >>>> As another test with indirect descriptors, could you change =

> >>>> xen_blkif_max_segments in xen-blkfront.c to 128 (it is 32 by =

> >>>> default), recompile the DomU kernel and see if that helps?
> >>> =

> >>> Ok, here we go.... compiled as 3.8.0-2 with the above change. 3.8.0-2 =

> >>> is running on both the Dom0 and DomU.
> >>> =

> >>> # dd if=3D/dev/zero of=3Doutput.zero bs=3D1M count=3D2048
> >>> 2048+0 records in
> >>> 2048+0 records out
> >>> 2147483648 bytes (2.1 GB) copied, 22.1703 s, 96.9 MB/s
> >>> =

> >>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >>>            0.34    0.00   17.10    0.00    0.23   82.33
> >>> =

> >>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s =

> >>> avgrq-sz avgqu-sz   await  svctm  %util
> >>> sdd             980.97 11936.47   53.11  429.78     4.00    48.77 =

> >>> 223.81    12.75   26.10   2.11 101.79
> >>> sdc             872.71 11957.87   45.98  435.67     3.55    49.30 =

> >>> 224.71    13.77   28.43   2.11 101.49
> >>> sde             949.26 11981.88   51.30  429.33     3.91    48.90 =

> >>> 225.03    21.29   43.91   2.27 109.08
> >>> sdf             915.52 11968.52   48.58  428.88     3.73    48.92 =

> >>> 225.84    21.44   44.68   2.27 108.56
> >>> md2               0.00     0.00    0.00 1155.61     0.00    97.51 =

> >>> 172.80     0.00    0.00   0.00   0.00
> >>> =

> >>> # dd if=3D/dev/zero of=3Doutput.zero bs=3D1M count=3D2048 oflag=3Ddir=
ect
> >>> 2048+0 records in
> >>> 2048+0 records out
> >>> 2147483648 bytes (2.1 GB) copied, 25.3708 s, 84.6 MB/s
> >>> =

> >>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >>>            0.11    0.00   13.92    0.00    0.22   85.75
> >>> =

> >>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s =

> >>> avgrq-sz avgqu-sz   await  svctm  %util
> >>> sdd               0.00 13986.08    0.00  263.20     0.00    55.76 =

> >>> 433.87     0.43    1.63   1.07  28.27
> >>> sdc             202.10 13741.55    6.52  256.57     0.81    54.77 =

> >>> 432.65     0.50    1.88   1.25  32.78
> >>> sde              47.96 11437.57    1.55  261.77     0.19    45.79 =

> >>> 357.63     0.80    3.02   1.85  48.60
> >>> sdf            2233.37 11756.13   71.93  191.38     8.99    46.80 =

> >>> 433.90     1.49    5.66   3.27  86.15
> >>> md2               0.00     0.00    0.00  731.93     0.00    91.49 =

> >>> 256.00     0.00    0.00   0.00   0.00
> >>> =

> >>> Now this is pretty much exactly what I would expect the system to do.=
... =

> >>> ~96MB/sec buffered, and 85MB/sec direct.
> >> =

> >> I'm sorry to be such a PITA, but could you also try with 64? If we hav=
e to increase the maximum number of indirect descriptors I would like to se=
t it to the lowest value that provides good performance to prevent using to=
o much memory.
> >> =

> >>> So - it turns out that xen_blkif_max_segments at 32 is a killer in th=
e =

> >>> DomU. Now it makes me wonder what we can do about this in kernels tha=
t =

> >>> don't have your series of patches against it? And also about the =

> >>> backend stuff in 3.8.x etc?
> >> =

> >> There isn't much we can do regarding kernels without indirect descript=
ors, there's no easy way to increase the number of segments in a request.
> >> =

> >> =

> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xen.org
> >> http://lists.xen.org/xen-devel
> >> =

> =

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> =