All of lore.kernel.org
 help / color / mirror / Atom feed
From: Steven Haigh <netwiz@crc.id.au>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: roger.pau@citrix.com, xen-devel@lists.xen.org
Subject: Re: 4.2.1: Poor write performance for DomU.
Date: Sat, 07 Sep 2013 09:06:41 +1000	[thread overview]
Message-ID: <522A6001.4070207@crc.id.au> (raw)
In-Reply-To: <20130906133325.GJ2590@phenom.dumpdata.com>


[-- Attachment #1.1: Type: text/plain, Size: 11785 bytes --]

On 06/09/13 23:33, Konrad Rzeszutek Wilk wrote:
> On Thu, Sep 05, 2013 at 06:28:25PM +1000, Steven Haigh wrote:
>> On 21/08/13 02:48, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Mar 25, 2013 at 01:21:09PM +1100, Steven Haigh wrote:
>>>> So, based on my tests yesterday, I decided to break the RAID6 and
>>>> pull a drive out of it to test directly on the 2Tb drives in
>>>> question.
>>>>
>>>> The array in question:
>>>> # cat /proc/mdstat
>>>> Personalities : [raid1] [raid6] [raid5] [raid4]
>>>> md2 : active raid6 sdd[4] sdc[0] sde[1] sdf[5]
>>>>       3907026688 blocks super 1.2 level 6, 128k chunk, algorithm 2
>>>> [4/4] [UUUU]
>>>>
>>>> # mdadm /dev/md2 --fail /dev/sdf
>>>> mdadm: set /dev/sdf faulty in /dev/md2
>>>> # mdadm /dev/md2 --remove /dev/sdf
>>>> mdadm: hot removed /dev/sdf from /dev/md2
>>>>
>>>> So, all tests are to be done on /dev/sdf.
>>>> Model Family:     Seagate SV35
>>>> Device Model:     ST2000VX000-9YW164
>>>> Serial Number:    Z1E17C3X
>>>> LU WWN Device Id: 5 000c50 04e1bc6f0
>>>> Firmware Version: CV13
>>>> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
>>>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>>>
>>>> From the Dom0:
>>>> # dd if=/dev/zero of=/dev/sdf bs=1M count=4096 oflag=direct
>>>> 4096+0 records in
>>>> 4096+0 records out
>>>> 4294967296 bytes (4.3 GB) copied, 30.7691 s, 140 MB/s
>>>>
>>>> Create a single partition on the drive, and format it with ext4:
>>>> Disk /dev/sdf: 2000.4 GB, 2000398934016 bytes
>>>> 255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
>>>> Units = sectors of 1 * 512 = 512 bytes
>>>> Sector size (logical/physical): 512 bytes / 4096 bytes
>>>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>>>> Disk identifier: 0x98d8baaf
>>>>
>>>>    Device Boot      Start         End      Blocks   Id  System
>>>> /dev/sdf1            2048  3907029167  1953513560   83  Linux
>>>>
>>>> Command (m for help): w
>>>>
>>>> # mkfs.ext4 -j /dev/sdf1
>>>> ......
>>>> Writing inode tables: done
>>>> Creating journal (32768 blocks): done
>>>> Writing superblocks and filesystem accounting information: done
>>>>
>>>> Mount it on the Dom0:
>>>> # mount /dev/sdf1 /mnt/esata/
>>>> # cd /mnt/esata/
>>>> # bonnie++ -d . -u 0:0
>>>> ....
>>>> Version  1.96       ------Sequential Output------ --Sequential
>>>> Input- --Random-
>>>> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
>>>> --Block-- --Seeks--
>>>> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
>>>> %CP /sec %CP
>>>> xenhost.lan.crc. 2G   425  94 133607  24 60544  12   973  95 209114
>>>> 17 296.4   6
>>>> Latency             70971us     190ms     221ms   40369us   17657us
>>>> 164ms
>>>>
>>>> So from the Dom0: 133Mb/sec write, 209Mb/sec read.
>>>>
>>>> Now, I'll attach the full disk to a DomU:
>>>> # xm block-attach zeus.vm phy:/dev/sdf xvdc w
>>>>
>>>> And we'll test from the DomU.
>>>>
>>>> # dd if=/dev/zero of=/dev/xvdc bs=1M count=4096 oflag=direct
>>>> 4096+0 records in
>>>> 4096+0 records out
>>>> 4294967296 bytes (4.3 GB) copied, 32.318 s, 133 MB/s
>>>>
>>>> Partition the same as in the Dom0 and create an ext4 filesystem on it:
>>>>
>>>> I notice something interesting here. In the Dom0, the device is seen as:
>>>> Units = sectors of 1 * 512 = 512 bytes
>>>> Sector size (logical/physical): 512 bytes / 4096 bytes
>>>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>>>>
>>>> In the DomU, it is seen as:
>>>> Units = sectors of 1 * 512 = 512 bytes
>>>> Sector size (logical/physical): 512 bytes / 512 bytes
>>>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>>>>
>>>> Not sure if this could be related - but continuing testing:
>>>>     Device Boot      Start         End      Blocks   Id  System
>>>> /dev/xvdc1            2048  3907029167  1953513560   83  Linux
>>>>
>>>> # mkfs.ext4 -j /dev/xvdc1
>>>> ....
>>>> Allocating group tables: done
>>>> Writing inode tables: done
>>>> Creating journal (32768 blocks): done
>>>> Writing superblocks and filesystem accounting information: done
>>>>
>>>> # mount /dev/xvdc1 /mnt/esata/
>>>> # cd /mnt/esata/
>>>> # bonnie++ -d . -u 0:0
>>>> ....
>>>> Version  1.96       ------Sequential Output------ --Sequential
>>>> Input- --Random-
>>>> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
>>>> --Block-- --Seeks--
>>>> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
>>>> %CP /sec %CP
>>>> zeus.crc.id.au   2G   396  99 116530  23 50451  15  1035  99 176407
>>>> 23 313.4   9
>>>> Latency             34615us     130ms     128ms   33316us   74401us
>>>> 130ms
>>>>
>>>> So still... 116Mb/sec write, 176Mb/sec read to the physical device
>>>> from the DomU. More than acceptable.
>>>>
>>>> It leaves me to wonder.... Could there be something in the Dom0
>>>> seeing the drives as 4096 byte sectors, but the DomU seeing it as
>>>> 512 byte sectors cause an issue?
>>>
>>> There is certain overhead in it. I still have this in my mailbox
>>> so I am not sure whether this issue got ever resolved? I know that the 
>>> indirect patches in Xen blkback and xen blkfront are meant to resolve
>>> some of these issues - by being able to carry a bigger payload.
>>>
>>> Did you ever try v3.11 kernel in both dom0 and domU? Thanks.
>>
>> Ok, so I finally got around to building kernel 3.11 RPMs today for
>> testing. I upgraded both the Dom0 and DomU to the same kernel:
> 
> Woohoo!
>>
>> DomU:
>> # dmesg | grep blkfront
>> blkfront: xvda: flush diskcache: enabled; persistent grants: enabled;
>> indirect descriptors: enabled;
>> blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled;
>> indirect descriptors: enabled;
>>
>> Looks good.
>>
>> Transfer tests using bonnie++ as per before:
>> # bonnie -d . -u 0:0
>> Version  1.96       ------Sequential Output------ --Sequential Input-
>> --Random-
>> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
>> --Seeks--
>> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
>> /sec %CP
>> zeus.crc.id.au   2G   603  92 58250   9 62248  14   886  99 295757  30
>> 492.3  13
>> Latency             27305us     124ms     158ms   34222us   16865us
>> 374ms
>> Version  1.96       ------Sequential Create------ --------Random
>> Create--------
>> zeus.crc.id.au      -Create-- --Read--- -Delete-- -Create-- --Read---
>> -Delete--
>>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>> /sec %CP
>>                  16 10048  22 +++++ +++ 17849  29 11109  25 +++++ +++
>> 18389  31
>> Latency             17775us     154us     180us   16008us      38us
>>  58us
>>
>> Still seems to be a massive discrepancy between Dom0 and DomU write
>> speeds. Interesting is that sequential block reads are nearly 300MB/sec,
>> yet sequential writes were only ~58MB/sec.
> 
> OK, so the other thing that people were pointing out that is you
> can use xen-blkfront.max parameter. By default it is 32, but try 8.
> Or 64. Or 256.

Ahh - interesting.

I used the following:
Kernel command line: ro root=/dev/xvda rd_NO_LUKS rd_NO_DM
LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us
crashkernel=auto console=hvc0 xen-blkfront.max=X

8:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   696  92 50906   7 46102  11  1013  97 256784  27
496.5  10
Latency             24374us     199ms     117ms   30855us   38008us
85175us

16:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   675  92 58078   8 57585  13  1005  97 262735  25
505.6  10
Latency             24412us     187ms     183ms   23661us   53850us
232ms

32:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   698  92 57416   8 63328  13  1063  97 267154  24
498.2  12
Latency             24264us     199ms   81362us   33144us   22526us
237ms

64:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   574  86 88447  13 68988  17   897  97 265128  27
493.7  13

128:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   702  97 107638  14 70158  15  1045  97 255596  24
491.0  12
Latency             27279us   17553us     134ms   29771us   38392us
65761us

256:
Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
zeus.crc.id.au   2G   689  91 102554  14 67337  15  1012  97 262475  24
484.4  12
Latency             20642us     104ms     189ms   36624us   45286us
80023us

So, as a nice summary:
8: 50Mb/sec
16: 58Mb/sec
32: 57Mb/sec
64: 88Mb/sec
128: 107Mb/sec
256: 102Mb/sec

So, maybe it's coincidence, maybe it isn't - but the best (factoring
margin of error) seems to be 128 - which happens to be the block size of
the underlying RAID6 array on the Dom0.

# cat /proc/mdstat
md2 : active raid6 sdd[5] sdc[4] sdf[1] sde[0]
      3906766592 blocks super 1.2 level 6, 128k chunk, algorithm 2 [4/4]
[UUUU]

> The indirect descriptor allows us to put more I/Os on the ring - and
> I am hoping that will:
>  a) solve your problem

Well, it looks like this solves the issue - at least increasing the max
causes almost double the write speed - and no change to read speeds
(within margin of error).

>  b) not solve your problem, but demonstrate that the issue is not with
>     the ring, but with something else making your writes slower.
> 
> Hmm, are you by any chance using O_DIRECT when running bonnie++ in
> dom0? The xen-blkback tacks on O_DIRECT to all write requests. This is
> done to not use the dom0 page cache - otherwise you end up with
> a double buffer where the writes are insane speed - but with absolutly
> no safety.
> 
> If you want to try disabling that (so no O_DIRECT), I would do this
> little change:
> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index bf4b9d2..823b629 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1139,7 +1139,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>                 break;
>         case BLKIF_OP_WRITE:
>                 blkif->st_wr_req++;
> -               operation = WRITE_ODIRECT;
> +               operation = WRITE;
>                 break;
>         case BLKIF_OP_WRITE_BARRIER:
>                 drain = true;

With the above results, is this still useful?

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  reply	other threads:[~2013-09-06 23:06 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-20  2:10 4.2.1: Poor write performance for DomU Steven Haigh
2013-02-20  8:26 ` Roger Pau Monné
2013-02-20  8:49   ` Steven Haigh
2013-02-20  9:49     ` Steven Haigh
2013-02-20 10:12       ` Jan Beulich
2013-02-20 11:06         ` Andrew Cooper
2013-02-20 11:08           ` Steven Haigh
2013-02-20 12:48             ` Andrew Cooper
2013-02-20 13:18             ` Pasi Kärkkäinen
2013-03-08 20:42               ` Konrad Rzeszutek Wilk
2013-03-08  8:54       ` Steven Haigh
2013-03-08  9:43         ` Roger Pau Monné
2013-03-08  9:46           ` Steven Haigh
2013-03-08  9:54             ` Roger Pau Monné
2013-03-08 20:49         ` Konrad Rzeszutek Wilk
2013-03-08 22:30           ` Steven Haigh
2013-03-11 13:30             ` Konrad Rzeszutek Wilk
2013-03-11 13:37               ` Steven Haigh
2013-03-12 13:04                 ` Konrad Rzeszutek Wilk
2013-03-12 14:08                   ` Steven Haigh
     [not found]                   ` <514EA337.7030303@crc.id.au>
     [not found]                     ` <514EA6B0.8010504@crc.id.au>
     [not found]                       ` <514EA741.7050403@crc.id.au>
2013-03-24  9:10                         ` Steven Haigh
2013-03-24  9:54                           ` Steven Haigh
2013-03-25  2:21                           ` Steven Haigh
2013-08-20 16:48                             ` Konrad Rzeszutek Wilk
2013-08-20 18:25                               ` Steven Haigh
2013-09-05  8:28                               ` Steven Haigh
2013-09-06 13:33                                 ` Konrad Rzeszutek Wilk
2013-09-06 23:06                                   ` Steven Haigh [this message]
2013-09-06 23:37                                     ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=522A6001.4070207@crc.id.au \
    --to=netwiz@crc.id.au \
    --cc=konrad.wilk@oracle.com \
    --cc=roger.pau@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.