From: "Alan D. Brunelle" <Alan.Brunelle@hp.com>
To: linux-kernel@vger.kernel.org
Cc: vgoyal@redhat.com, jens.axboe@oracle.com
Subject: Re: [RFC] Block IO Controller V2 - some results
Date: Mon, 16 Nov 2009 15:51:00 -0500 [thread overview]
Message-ID: <1258404660.3533.150.camel@cail> (raw)
Hi Vivek:
I'm finding some things that don't quite seem right - executive
summary:
o I think the apportionment algorithm doesn't work consistently well
for writes.
o I think there are problems with significant performance loss when
doing random I/Os.
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Test configuration: HP dl585 (32-way quad-core AMD Opteron processors +
128GB RAM + 4 FC HBAs + 4 MSA 1000's (each exporting 3 multi-disk
striped LUNs)). Running with Jens Axboe's remotes/origin/for-2.6.33
branch (at commit 8721c81f6480e2c9acbf92078383953f825d1057)) w/out and
w/ your V2 patch.
The test: 12 Ext3 file systems (1 per disk), each file system has eight
8GB files on it. Doing simple fio runs in various modes and I/O
directions: random or sequential, read or write or read/write (80%/20%).
Using 2, 4 or 8 processes per file system (each process working on a
different file). Here is a sample fio command file:
[global]
ioengine=sync
size=8g
overwrite=0
runtime=120
bs=256k
readwrite=write
[/mnt/sdl/data.7]
filename=/mnt/sdl/data.7
I'm then using cgroups that have IO weights as follows:
/cgroup/test0/blkio.weight 100
/cgroup/test1/blkio.weight 200
/cgroup/test2/blkio.weight 300
/cgroup/test3/blkio.weight 400
/cgroup/test4/blkio.weight 500
/cgroup/test5/blkio.weight 600
/cgroup/test6/blkio.weight 700
/cgroup/test7/blkio.weight 800
There were 12 X N total processes running in the system for each test,
and each file system would have N process working on a different file in
that file system. The N processes would be assigned to increasing test
groups: process 0 will be in test0's group and working on file 0 in a
file system; process 1 will be in test1's group and working on file 1 in
a file system; and so on.
Before each test I drop caches & umount/mount the filesystem anew.
In the following tables:
'base' - means a kernel generated from Jens' branch (-no- patching)
'ioc off' - means a kernel generated w/ your patches added but -no-
other settings (no CGROUP stuff mounted or enabled)
'ioc no idle' - means the ioc kernel w/ CGROUP stuff enabled
-but- /sys/block/sd*/queue/iosched/cgroup_idle = 0
'ioc idle' - means the ioc kernel w/ CGROUP stuff enabled
-and- /sys/block/sd*/queue/iosched/cgroup_idle = 1
Modes: random or sequential
RdWr: rd==read, wr==write, rdwr==80%read & 20%write
N: Number of processes per disk
testX: Processes sharing a task group (when enabled)
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The first thing to do is to check for correctness: when the I/O
controller is enabled do we see correctly apportioned I/O?
At the tail end of the e-mail I've placed three (3) tables showing the
state where -no- differences should be seen between the various "task"
groups in terms of performance ("level playing field"), and sure enough
no differences were seen. These were done basically as a "control" set
of tests - the script being used didn't have any inherent biases in
it.[1]
This table shows the cases where we should see a difference based upon
weights:
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
ioc idle rnd rd 2 2.8 6.3
ioc idle rnd rd 4 0.7 1.5 2.5 3.5
ioc idle rnd rd 8 0.2 0.4 0.5 0.8 0.9 1.2 1.4 1.7
ioc idle rnd wr 2 38.2 192.7
ioc idle rnd wr 4 1.0 17.7 38.1 204.5
ioc idle rnd wr 8 0.3 0.6 0.9 1.5 2.2 16.3 16.6 208.3
ioc idle rnd rdwr 2 4.9 11.3
ioc idle rnd rdwr 4 0.9 2.4 4.3 6.2
ioc idle rnd rdwr 8 0.2 0.5 0.8 1.1 1.4 1.8 2.2 2.7
ioc idle seq rd 2 221.0 386.4
ioc idle seq rd 4 69.8 128.1 183.2 226.8
ioc idle seq rd 8 21.4 40.0 55.6 70.8 85.2 98.3 111.6 121.9
ioc idle seq wr 2 398.6 391.6
ioc idle seq wr 4 219.0 214.5 214.1 214.5
ioc idle seq wr 8 107.6 106.8 104.7 102.5 99.5 99.5 100.5 100.8
ioc idle seq rdwr 2 196.8 340.9
ioc idle seq rdwr 4 64.0 109.6 148.7 183.5
ioc idle seq rdwr 8 22.6 36.6 48.8 61.1 70.3 78.5 84.9 94.3
In general, we do see weights associated in correctly increasing order,
but I don't think the proportions are done correctly in all cases.
In the random tests for example, the read distribution looks pretty
decent, but random writes are all off - for some reason the highest
priority (most heavily weighted) is getting a disproportionately large
percentage of the I/O bandwidth.
For the sequential loads, the reads look "OK" - not quite correctly fair
when we have 8 processes running against the devices, but on the whole
things look ok. Sequential writes are not working well at all:
relatively flat distribution.
I _think_ this is pointing to some real problems in both the write cases
for both random & sequential I/Os.
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The next thing to look at is to see what the "penalty" is for the
additional code: see how much bandwidth we lose for the capability
added. Here we see the sum of the system's throughput for the various
tests:
---- ---- - ----------- ----------- ----------- -----------
Mode RdWr N base ioc off ioc no idle ioc idle
---- ---- - ----------- ----------- ----------- -----------
rnd rd 2 17.3 17.1 9.4 9.1
rnd rd 4 27.1 27.1 8.1 8.2
rnd rd 8 37.1 37.1 6.8 7.1
rnd wr 2 296.5 243.7 290.2 230.9
rnd wr 4 287.3 280.7 270.4 261.3
rnd wr 8 272.5 273.1 237.7 246.5
rnd rdwr 2 27.4 27.7 16.1 16.2
rnd rdwr 4 38.3 39.3 13.5 13.9
rnd rdwr 8 62.0 61.5 10.0 10.7
seq rd 2 610.2 608.1 610.7 607.4
seq rd 4 608.4 601.5 609.3 608.0
seq rd 8 605.7 603.7 605.0 604.8
seq wr 2 840.3 850.2 836.8 790.2
seq wr 4 886.8 891.6 868.2 862.2
seq wr 8 865.1 887.1 832.1 822.0
seq rdwr 2 536.2 550.0 538.1 537.7
seq rdwr 4 595.3 605.7 512.9 505.8
seq rdwr 8 617.3 628.5 526.6 497.1
The sequential runs look very good - not much variance across the board.
The random results look horrible, especially when reads are involved:
The first two columns (base & ioc off) are very similar, however note
the significant drop in overall system performance once the
io-controller CGROUP stuff gets involved - the more processes involved
the more performance is lost.
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
I'm going to spend some time drilling down into three specific tests:
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
ioc idle rnd wr 2 38.2 192.7
ioc idle seq wr 2 398.6 391.6
This test I can use to see why random writes are so disproportionately
apportioned - it should be 2-to-1 but we are seeing something like
6-to-1. And then I can look at why sequential writes are flat.
and:
---- ---- - ----------- ----------- ----------- -----------
Mode RdWr N base ioc off ioc no idle ioc idle
---- ---- - ----------- ----------- ----------- -----------
rnd rd 2 17.3 17.1 9.4 9.1
I will try to find out why we are seeing such a loss in system
performance...
Regards,
Alan D. Brunelle
Hewlett-Packard / Linux Kernel Technology Team
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
[1] Three tables showing the I/O load distributed when either there was
no I/O controller code or when it was turned off or when cgroup_idle was
turned off. All looks sane - with the exception of the ioc-enabled
kernel with no-idle set - for random writes it appears like there is
some differences, but not an appreciable amount?
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
base rnd rd 2 8.6 8.6
base rnd rd 4 6.8 6.8 6.8 6.7
base rnd rd 8 4.7 4.6 4.6 4.6 4.6 4.6 4.6 4.6
base rnd wr 2 150.4 146.1
base rnd wr 4 75.2 74.8 68.1 69.2
base rnd wr 8 36.2 39.3 29.6 35.9 32.9 37.0 29.6 32.2
base rnd rdwr 2 13.7 13.7
base rnd rdwr 4 9.6 9.6 9.6 9.6
base rnd rdwr 8 7.8 7.8 7.7 7.8 7.8 7.7 7.7 7.8
base seq rd 2 306.2 304.0
base seq rd 4 150.1 152.4 151.9 154.0
base seq rd 8 77.2 75.9 75.9 73.9 77.0 75.7 75.0 74.9
base seq wr 2 420.2 420.1
base seq wr 4 220.5 222.5 221.9 221.9
base seq wr 8 108.2 108.8 107.8 107.7 108.7 108.5 108.1 107.2
base seq rdwr 2 268.4 267.8
base seq rdwr 4 148.9 150.6 147.8 148.0
base seq rdwr 8 78.0 77.7 76.3 76.0 79.1 77.9 74.3 77.9
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
ioc off rnd rd 2 8.6 8.6
ioc off rnd rd 4 6.8 6.8 6.7 6.7
ioc off rnd rd 8 4.7 4.6 4.6 4.7 4.6 4.6 4.6 4.6
ioc off rnd wr 2 112.6 131.1
ioc off rnd wr 4 64.9 67.8 79.9 68.1
ioc off rnd wr 8 35.1 39.5 31.5 32.0 36.1 34.5 30.8 33.5
ioc off rnd rdwr 2 13.8 13.8
ioc off rnd rdwr 4 9.8 9.8 9.9 9.8
ioc off rnd rdwr 8 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7
ioc off seq rd 2 303.1 305.0
ioc off seq rd 4 150.8 151.6 149.0 150.2
ioc off seq rd 8 77.0 76.3 74.5 74.0 77.9 75.5 74.0 74.6
ioc off seq wr 2 424.6 425.5
ioc off seq wr 4 223.0 222.4 223.9 222.3
ioc off seq wr 8 110.8 112.0 111.3 109.6 111.7 111.3 110.8 109.7
ioc off seq rdwr 2 274.3 275.8
ioc off seq rdwr 4 151.3 154.8 149.0 150.6
ioc off seq rdwr 8 81.1 80.6 77.8 74.8 81.0 78.5 77.0 77.7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7
----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- -----
ioc no idle rnd rd 2 4.7 4.7
ioc no idle rnd rd 4 2.0 2.0 2.0 2.0
ioc no idle rnd rd 8 0.9 0.9 0.8 0.8 0.8 0.8 0.9 0.9
ioc no idle rnd wr 2 144.8 145.4
ioc no idle rnd wr 4 73.2 65.9 65.5 65.8
ioc no idle rnd wr 8 35.5 52.5 26.2 31.0 25.5 19.3 25.1 22.6
ioc no idle rnd rdwr 2 8.1 8.1
ioc no idle rnd rdwr 4 3.4 3.4 3.4 3.4
ioc no idle rnd rdwr 8 1.3 1.3 1.3 1.2 1.2 1.3 1.2 1.3
ioc no idle seq rd 2 304.1 306.6
ioc no idle seq rd 4 152.1 154.5 149.8 153.0
ioc no idle seq rd 8 75.8 75.8 75.2 75.1 75.5 75.3 75.7 76.5
ioc no idle seq wr 2 418.6 418.2
ioc no idle seq wr 4 217.7 217.7 215.4 217.4
ioc no idle seq wr 8 105.5 105.8 105.8 103.4 102.9 103.1 102.7 102.8
ioc no idle seq rdwr 2 269.2 269.0
ioc no idle seq rdwr 4 130.0 126.4 127.8 128.6
ioc no idle seq rdwr 8 67.2 66.6 65.4 65.0 65.3 64.8 65.7 66.5
next reply other threads:[~2009-11-16 20:50 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-11-16 20:51 Alan D. Brunelle [this message]
2009-11-16 21:14 ` [RFC] Block IO Controller V2 - some results Vivek Goyal
2009-11-16 21:32 ` Alan D. Brunelle
2009-11-16 21:37 ` Vivek Goyal
2009-11-16 22:18 ` Vivek Goyal
2009-11-17 12:38 ` Alan D. Brunelle
2009-11-17 14:14 ` Vivek Goyal
2009-11-17 16:17 ` Corrado Zoccolo
2009-11-17 16:40 ` Vivek Goyal
2009-11-17 17:30 ` Alan D. Brunelle
2009-11-17 17:44 ` Vivek Goyal
2009-11-17 20:59 ` Corrado Zoccolo
2009-11-17 22:38 ` Vivek Goyal
2009-11-17 23:11 ` Corrado Zoccolo
2009-11-19 0:04 ` Vivek Goyal
2009-11-19 20:12 ` Corrado Zoccolo
2009-11-17 16:45 ` Alan D. Brunelle
2009-11-18 15:32 ` Vivek Goyal
2009-11-18 16:20 ` Corrado Zoccolo
2009-11-18 22:56 ` Vivek Goyal
2009-11-18 23:35 ` Corrado Zoccolo
2009-11-20 14:18 ` Vivek Goyal
2009-11-20 14:28 ` Corrado Zoccolo
2009-11-20 15:04 ` Vivek Goyal
2009-11-20 18:32 ` Corrado Zoccolo
2009-11-20 18:42 ` Vivek Goyal
2009-11-20 19:50 ` Corrado Zoccolo
2009-11-21 17:57 ` Corrado Zoccolo
2009-11-23 15:19 ` Vivek Goyal
2009-11-23 16:22 ` Corrado Zoccolo
2009-11-17 20:38 ` Alan D. Brunelle
2009-11-19 16:57 ` Vivek Goyal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1258404660.3533.150.camel@cail \
--to=alan.brunelle@hp.com \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.