Re: [RFC] Block IO Controller V2 - some results

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Alan D. Brunelle" <Alan.Brunelle@hp.com>
To: linux-kernel@vger.kernel.org
Cc: vgoyal@redhat.com, jens.axboe@oracle.com
Subject: Re: [RFC] Block IO Controller V2 - some results
Date: Mon, 16 Nov 2009 15:51:00 -0500	[thread overview]
Message-ID: <1258404660.3533.150.camel@cail> (raw)

Hi Vivek: 

I'm finding some things that don't quite seem right - executive
summary: 

o  I think the apportionment algorithm doesn't work consistently well
for writes.

o  I think there are problems with significant performance loss when
doing random I/Os.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Test configuration: HP dl585 (32-way quad-core AMD Opteron processors +
128GB RAM + 4 FC HBAs + 4 MSA 1000's (each exporting 3 multi-disk
striped LUNs)). Running with Jens Axboe's remotes/origin/for-2.6.33
branch (at commit 8721c81f6480e2c9acbf92078383953f825d1057)) w/out and
w/ your V2 patch.

The test: 12 Ext3 file systems (1 per disk), each file system has eight
8GB files on it. Doing simple fio runs in various modes and I/O
directions: random or sequential, read or write or read/write (80%/20%).
Using 2, 4 or 8 processes per file system (each process working on a
different file). Here is a sample fio command file:

[global]
ioengine=sync
size=8g
overwrite=0
runtime=120
bs=256k
readwrite=write
[/mnt/sdl/data.7]
filename=/mnt/sdl/data.7

I'm then using cgroups that have IO weights as follows:

/cgroup/test0/blkio.weight 100
/cgroup/test1/blkio.weight 200
/cgroup/test2/blkio.weight 300
/cgroup/test3/blkio.weight 400
/cgroup/test4/blkio.weight 500
/cgroup/test5/blkio.weight 600
/cgroup/test6/blkio.weight 700
/cgroup/test7/blkio.weight 800

There were 12 X N total processes running in the system for each test,
and each file system would have N process working on a different file in
that file system. The N processes would be assigned to increasing test
groups: process 0 will be in test0's group and working on file 0 in a
file system; process 1 will be in test1's group and working on file 1 in
a file system; and so on.

Before each test I drop caches & umount/mount the filesystem anew.

In the following tables:

'base' - means a kernel generated from Jens' branch (-no- patching)

'ioc off' - means a kernel generated w/ your patches added but -no-
other settings (no CGROUP stuff mounted or enabled)

'ioc no idle' - means the ioc kernel w/ CGROUP stuff enabled
-but- /sys/block/sd*/queue/iosched/cgroup_idle = 0

'ioc idle' - means the ioc kernel w/ CGROUP stuff enabled
-and- /sys/block/sd*/queue/iosched/cgroup_idle = 1

Modes: random or sequential

RdWr: rd==read, wr==write, rdwr==80%read & 20%write

N: Number of processes per disk

testX: Processes sharing a task group (when enabled)

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The first thing to do is to check for correctness: when the I/O
controller is enabled do we see correctly apportioned I/O?

At the tail end of the e-mail I've placed three (3) tables showing the
state where -no- differences should be seen between the various "task"
groups in terms of performance ("level playing field"), and sure enough
no differences were seen. These were done basically as a "control" set
of tests - the script being used didn't have any inherent biases in
it.[1]

This table shows the cases where we should see a difference based upon
weights:

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
       Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
   ioc idle  rnd   rd 2   2.8   6.3 
   ioc idle  rnd   rd 4   0.7   1.5   2.5   3.5 
   ioc idle  rnd   rd 8   0.2   0.4   0.5   0.8   0.9   1.2   1.4   1.7 

   ioc idle  rnd   wr 2  38.2 192.7 
   ioc idle  rnd   wr 4   1.0  17.7  38.1 204.5 
   ioc idle  rnd   wr 8   0.3   0.6   0.9   1.5   2.2  16.3  16.6 208.3 

   ioc idle  rnd rdwr 2   4.9  11.3 
   ioc idle  rnd rdwr 4   0.9   2.4   4.3   6.2 
   ioc idle  rnd rdwr 8   0.2   0.5   0.8   1.1   1.4   1.8   2.2   2.7 

   ioc idle  seq   rd 2 221.0 386.4 
   ioc idle  seq   rd 4  69.8 128.1 183.2 226.8 
   ioc idle  seq   rd 8  21.4  40.0  55.6  70.8  85.2  98.3 111.6 121.9 

   ioc idle  seq   wr 2 398.6 391.6 
   ioc idle  seq   wr 4 219.0 214.5 214.1 214.5 
   ioc idle  seq   wr 8 107.6 106.8 104.7 102.5  99.5  99.5 100.5 100.8 

   ioc idle  seq rdwr 2 196.8 340.9 
   ioc idle  seq rdwr 4  64.0 109.6 148.7 183.5 
   ioc idle  seq rdwr 8  22.6  36.6  48.8  61.1  70.3  78.5  84.9  94.3 

In general, we do see weights associated in correctly increasing order,
but I don't think the proportions are done correctly in all cases.

In the random tests for example, the read distribution looks pretty
decent, but random writes are all off - for some reason the highest
priority (most heavily weighted) is getting a disproportionately large
percentage of the I/O bandwidth.

For the sequential loads, the reads look "OK" - not quite correctly fair
when we have 8 processes running against the devices, but on the whole
things look ok. Sequential writes are not working well at all:
relatively flat distribution. 

I _think_ this is pointing to some real problems in both the write cases
for both random & sequential I/Os.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The next thing to look at is to see what the "penalty" is for the
additional code: see how much bandwidth we lose for the capability
added. Here we see the sum of the system's throughput for the various
tests:

---- ---- - ----------- ----------- ----------- ----------- 
Mode RdWr N    base       ioc off   ioc no idle  ioc idle   
---- ---- - ----------- ----------- ----------- ----------- 
 rnd   rd 2        17.3        17.1         9.4         9.1 
 rnd   rd 4        27.1        27.1         8.1         8.2 
 rnd   rd 8        37.1        37.1         6.8         7.1 

 rnd   wr 2       296.5       243.7       290.2       230.9 
 rnd   wr 4       287.3       280.7       270.4       261.3 
 rnd   wr 8       272.5       273.1       237.7       246.5 

 rnd rdwr 2        27.4        27.7        16.1        16.2 
 rnd rdwr 4        38.3        39.3        13.5        13.9 
 rnd rdwr 8        62.0        61.5        10.0        10.7 

 seq   rd 2       610.2       608.1       610.7       607.4 
 seq   rd 4       608.4       601.5       609.3       608.0 
 seq   rd 8       605.7       603.7       605.0       604.8 

 seq   wr 2       840.3       850.2       836.8       790.2 
 seq   wr 4       886.8       891.6       868.2       862.2 
 seq   wr 8       865.1       887.1       832.1       822.0 

 seq rdwr 2       536.2       550.0       538.1       537.7 
 seq rdwr 4       595.3       605.7       512.9       505.8 
 seq rdwr 8       617.3       628.5       526.6       497.1

The sequential runs look very good - not much variance across the board.

The random results look horrible, especially when reads are involved:
The first two columns (base & ioc off) are very similar, however note
the significant drop in overall system performance once the
io-controller CGROUP stuff gets involved - the more processes involved
the more performance is lost. 

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

I'm going to spend some time drilling down into three specific tests:

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
       Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
   ioc idle  rnd   wr 2  38.2 192.7 
   ioc idle  seq   wr 2 398.6 391.6 

This test I can use to see why random writes are so disproportionately
apportioned - it should be 2-to-1 but we are seeing something like
6-to-1. And then I can look at why sequential writes are flat.

and:

---- ---- - ----------- ----------- ----------- ----------- 
Mode RdWr N    base       ioc off   ioc no idle  ioc idle   
---- ---- - ----------- ----------- ----------- ----------- 
 rnd   rd 2        17.3        17.1         9.4         9.1 

I will try to find out why we are seeing such a loss in system
performance...

Regards,
Alan D. Brunelle
Hewlett-Packard / Linux Kernel Technology Team

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
[1] Three tables showing the I/O load distributed when either there was
no I/O controller code or when it was turned off or when cgroup_idle was
turned off. All looks sane - with the exception of the ioc-enabled
kernel with no-idle set - for random writes it appears like there is
some differences, but not an appreciable amount?

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
       Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
       base  rnd   rd 2   8.6   8.6 
       base  rnd   rd 4   6.8   6.8   6.8   6.7 
       base  rnd   rd 8   4.7   4.6   4.6   4.6   4.6   4.6   4.6   4.6 

       base  rnd   wr 2 150.4 146.1 
       base  rnd   wr 4  75.2  74.8  68.1  69.2 
       base  rnd   wr 8  36.2  39.3  29.6  35.9  32.9  37.0  29.6  32.2 

       base  rnd rdwr 2  13.7  13.7 
       base  rnd rdwr 4   9.6   9.6   9.6   9.6 
       base  rnd rdwr 8   7.8   7.8   7.7   7.8   7.8   7.7   7.7   7.8 

       base  seq   rd 2 306.2 304.0 
       base  seq   rd 4 150.1 152.4 151.9 154.0 
       base  seq   rd 8  77.2  75.9  75.9  73.9  77.0  75.7  75.0  74.9 

       base  seq   wr 2 420.2 420.1 
       base  seq   wr 4 220.5 222.5 221.9 221.9 
       base  seq   wr 8 108.2 108.8 107.8 107.7 108.7 108.5 108.1 107.2 

       base  seq rdwr 2 268.4 267.8 
       base  seq rdwr 4 148.9 150.6 147.8 148.0 
       base  seq rdwr 8  78.0  77.7  76.3  76.0  79.1  77.9  74.3  77.9 

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
       Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
    ioc off  rnd   rd 2   8.6   8.6 
    ioc off  rnd   rd 4   6.8   6.8   6.7   6.7 
    ioc off  rnd   rd 8   4.7   4.6   4.6   4.7   4.6   4.6   4.6   4.6 

    ioc off  rnd   wr 2 112.6 131.1 
    ioc off  rnd   wr 4  64.9  67.8  79.9  68.1 
    ioc off  rnd   wr 8  35.1  39.5  31.5  32.0  36.1  34.5  30.8  33.5 

    ioc off  rnd rdwr 2  13.8  13.8 
    ioc off  rnd rdwr 4   9.8   9.8   9.9   9.8 
    ioc off  rnd rdwr 8   7.7   7.7   7.7   7.7   7.7   7.7   7.7   7.7 

    ioc off  seq   rd 2 303.1 305.0 
    ioc off  seq   rd 4 150.8 151.6 149.0 150.2 
    ioc off  seq   rd 8  77.0  76.3  74.5  74.0  77.9  75.5  74.0  74.6 

    ioc off  seq   wr 2 424.6 425.5 
    ioc off  seq   wr 4 223.0 222.4 223.9 222.3 
    ioc off  seq   wr 8 110.8 112.0 111.3 109.6 111.7 111.3 110.8 109.7 

    ioc off  seq rdwr 2 274.3 275.8 
    ioc off  seq rdwr 4 151.3 154.8 149.0 150.6 
    ioc off  seq rdwr 8  81.1  80.6  77.8  74.8  81.0  78.5  77.0  77.7

----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
       Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- 
ioc no idle  rnd   rd 2   4.7   4.7 
ioc no idle  rnd   rd 4   2.0   2.0   2.0   2.0 
ioc no idle  rnd   rd 8   0.9   0.9   0.8   0.8   0.8   0.8   0.9   0.9 

ioc no idle  rnd   wr 2 144.8 145.4 
ioc no idle  rnd   wr 4  73.2  65.9  65.5  65.8 
ioc no idle  rnd   wr 8  35.5  52.5  26.2  31.0  25.5  19.3  25.1  22.6 

ioc no idle  rnd rdwr 2   8.1   8.1 
ioc no idle  rnd rdwr 4   3.4   3.4   3.4   3.4 
ioc no idle  rnd rdwr 8   1.3   1.3   1.3   1.2   1.2   1.3   1.2   1.3 

ioc no idle  seq   rd 2 304.1 306.6 
ioc no idle  seq   rd 4 152.1 154.5 149.8 153.0 
ioc no idle  seq   rd 8  75.8  75.8  75.2  75.1  75.5  75.3  75.7  76.5 

ioc no idle  seq   wr 2 418.6 418.2 
ioc no idle  seq   wr 4 217.7 217.7 215.4 217.4 
ioc no idle  seq   wr 8 105.5 105.8 105.8 103.4 102.9 103.1 102.7 102.8 

ioc no idle  seq rdwr 2 269.2 269.0 
ioc no idle  seq rdwr 4 130.0 126.4 127.8 128.6 
ioc no idle  seq rdwr 8  67.2  66.6  65.4  65.0  65.3  64.8  65.7  66.5

next             reply	other threads:[~2009-11-16 20:50 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-16 20:51 Alan D. Brunelle [this message]
2009-11-16 21:14 ` [RFC] Block IO Controller V2 - some results Vivek Goyal
2009-11-16 21:32   ` Alan D. Brunelle
2009-11-16 21:37     ` Vivek Goyal
2009-11-16 22:18 ` Vivek Goyal
2009-11-17 12:38   ` Alan D. Brunelle
2009-11-17 14:14     ` Vivek Goyal
2009-11-17 16:17       ` Corrado Zoccolo
2009-11-17 16:40         ` Vivek Goyal
2009-11-17 17:30           ` Alan D. Brunelle
2009-11-17 17:44             ` Vivek Goyal
2009-11-17 20:59           ` Corrado Zoccolo
2009-11-17 22:38             ` Vivek Goyal
2009-11-17 23:11               ` Corrado Zoccolo
2009-11-19  0:04                 ` Vivek Goyal
2009-11-19 20:12                   ` Corrado Zoccolo
2009-11-17 16:45         ` Alan D. Brunelle
2009-11-18 15:32     ` Vivek Goyal
2009-11-18 16:20       ` Corrado Zoccolo
2009-11-18 22:56         ` Vivek Goyal
2009-11-18 23:35           ` Corrado Zoccolo
2009-11-20 14:18             ` Vivek Goyal
2009-11-20 14:28               ` Corrado Zoccolo
2009-11-20 15:04                 ` Vivek Goyal
2009-11-20 18:32                   ` Corrado Zoccolo
2009-11-20 18:42                     ` Vivek Goyal
2009-11-20 19:50                       ` Corrado Zoccolo
2009-11-21 17:57                         ` Corrado Zoccolo
2009-11-23 15:19                           ` Vivek Goyal
2009-11-23 16:22                             ` Corrado Zoccolo
2009-11-17 20:38 ` Alan D. Brunelle
2009-11-19 16:57   ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1258404660.3533.150.camel@cail \
    --to=alan.brunelle@hp.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.