From: "Satoshi UCHIDA" <s-uchida@ap.jp.nec.com>
To: linux-kernel@vger.kernel.org,
containers@lists.linux-foundation.org,
virtualization@lists.linux-foundation.org, jens.axboe@oracle.com,
'Ryo Tsuruta' <ryov@valin>
Cc: 'Hirokazu Takahashi' <taka@valinux.co.jp>,
'Andrew Morton' <akpm@linux-foundation.org>,
SUGAWARA Tomoyoshi <tom-sugawara@ap.jp.nec.com>,
menage@google.com, balbir@linux.vnet.ibm.com
Subject: [PATCH][RFC][12+2][v3] A expanded CFQ scheduler for cgroups
Date: Wed, 12 Nov 2008 17:15:06 +0900 [thread overview]
Message-ID: <000c01c9449e$c5bcdc20$51369460$@jp.nec.com> (raw)
This patchset expands traditional CFQ scheduler in order to support cgroups,
and improves old version.
Improvements are as following.
* Modularizing our new CFQ scheduler.
The expanded CFQ scheduler is registered/unregistered as new I/O
elevator scheduler called "cfq-cgroups". By this, the traditional CFQ
scheduler, which does not handle cgroups, and our new CFQ scheduler, which
handles cgroups, can be used at the same time for different devices.
* Allowing to set parameter per device.
The expanded CFQ scheduler allows users to set parameter per device.
By this, users can decide share (priority) per device.
--- Optional functions ---
* Adding a validation flag for 'think time'. (Opt-1 patch)
CFQ show poor scalability. One of its causes is the think time.
The think time is used to improve the I/O performance by handling queues
with poor I/O as IDLE class. However, when many tasks have I/O requests,
think time for their tasks became long and then all queues are handled as
IDLE class. As a result, dispatching I/O requests is dispersed, and then
the I/O performance falls. The think time valid flag controls think time
judgment.
* Adding ioprio class for cgroups. (Opt-2 patch)
The previous expanded CFQ scheduler can not implement ioprio class.
This optional patch implements its proto-type. This patch gives a basic
service tree control for ioprio class of cgroups and does not give preempt
function, completed function and so on yet.
1. Introduction.
This patchset introduce "Yet Another" I/O bandwidth controlling
subsystem for cgroups based on CFQ (called 2 layer CFQ).
The idea of 2 layer CFQ is to build fairness control per group on the top of
existing CFQ control.
We added a new data structure called CFQ driver data on the top of
cfqd in order to control I/O bandwidth for cgroups.
CFQ driver data control cfq_datas by service tree (rb-tree) and
CFQ algorithm when synchronous I/O.
An active cfqd controls queue for cfq by service tree.
Namely, the CFQ meta-data control traditional CFQ data.
the CFQ data runs conventionally.
cfqdd cfqdd (cfqmd = cfq driver data)
| |
cfqc -- cfqd ----- cfqd (cfqd = cfq data,
| | cfqc = cfq cgroup data)
cfqc --[cfqd]----- cfqd
^
|
conventional control.
This patchset is against 2.6.28-rc2
2. Build
i. Apply this patchset (series 01 - 12) to kernel 2.6.28-rc2.
If you want to use optional functions, apply opt-1/opt-2 patches
to kernel 2.6.28-rc2.
ii. Build kernel with IOSCHED_CFQ_CGROUP=y option.
iii. Restart new kernel.
3. Usage of 2 layer CFQ
* Preparation for using 2 layer CFQ
i. Mount cfq_cgroup special device to device directory.
ex.
mkdir /dev/cgroup
mount -t cgroup -o cfq cfq /dev/cgroup
ii. Change elevator scheduler for device to "cfq-cgroups"
ex.
echo cfq-cgorups > /sys/block/sda/queue/scheduler
* Usage of grouping control.
- Create a new group.
Make a new directory under /dev/cgroup.
For example, the following command generates a 'test1' group.
mkdir /dev/cgroup/test1
- Insert a task to a group.
Write process id(pid) on "tasks" entry in the corresponding group.
For example, the following command sets task with pid 1100 into test1
group.
echo 1100 > /dev/cgroup/test1/tasks
New child tasks of this task is also inserted into test1 group.
- Change I/O priorities of a group.
Write priority on "cfq.ioprio" entry in the corresponding group.
For example, the following command sets priority of rank 2 to 'test1'
group.
echo 2 > /dev/cgroup/test1/cfq.ioprio
I/O priority for cgroups takes the value from 0 to 7. It is same as
existing per-task CFQ.
If you want to change only I/O priority of a specific device and group,
add its device name as a second parameter.
For example, the following command sets priority of rank 2 to 'test1'
group for 'sda' device.
echo 2 sda > /dev/cgroup/test1/cfq.ioprio
If you want to change I/O priority of a specific device and group via
sysfs. If you can change its priority, Add its path for cgroup as a
second parameter.
For example, the following command sets priority of rank 2 to 'test1'
group for 'sda' device via sysfs.
echo 2 /test1 > /sys/block/sda/queue/iosched/ioprio
If you can change parameters of cfq_data (slice_sync, back_seek_penalty
and so on) for a specific device and group.
If you write only one parameter via sysfs, its setting reflects all
groups.
If you set elevator scheduler as cfq-cgroups, I/O priorities of its
new device set a default priority with groups. If you want to change
this default priority, write priority and "default" as second parameter
on "cfq.ioprio" entry in the corresponding group.
For example,
echo 2 default > /dev/cgroup/test1/cfq.ioprio
- Change I/O priority of task
Use existing "ionice" command.
4. Usage of Optional Functions.
i. Usage of a validation flag for 'think time'
This parameter can use via sysfs as similar as other cfq data parameter.
Its entry name is 'ttime_valid'.
This flag is decide to check think time.
The value 0 is always handled queues as idle class.
In practice, idie_window flag is clear.
The value 1 is handled as same as traditional CFQ.
The value 2 makes the think time invalid.
ii. Usage of ioprio class for cgroups.
The ioprio class use via cgroupfs as similar as ioprio.
Its entry name is 'cfq.ioprio_class'
The values of ioprio class are as same as I/O class of traditional CFQ.
0: IOPRIO_CLASS_NONE (is equal to IOPRIO_CLASS_BE)
1: IOPRIO_CLASS_RT
2: IOPRIO_CLASS_BE
3: IOPRIO_CLASS_IDLE
5. Future work.
We must implement the follows.
* Handle buffered I/O.
WARNING: multiple messages have this Message-ID (diff)
From: "Satoshi UCHIDA" <s-uchida@ap.jp.nec.com>
To: <linux-kernel@vger.kernel.org>,
<containers@lists.linux-foundation.org>,
<virtualization@lists.linux-foundation.org>,
<jens.axboe@oracle.com>, "'Ryo Tsuruta'" <ryov@valinux.co.jp>,
"'Andrea Righi'" <righi.andrea@gmail.com>, <ngupta@google.com>,
<fernando@oss.ntt.co.jp>, <vtaras@openvz.org>
Cc: "'Hirokazu Takahashi'" <taka@valinux.co.jp>,
<balbir@linux.vnet.ibm.com>,
"'Andrew Morton'" <akpm@linux-foundation.org>,
<menage@google.com>,
"SUGAWARA Tomoyoshi" <tom-sugawara@ap.jp.nec.com>
Subject: [PATCH][RFC][12+2][v3] A expanded CFQ scheduler for cgroups
Date: Wed, 12 Nov 2008 17:15:06 +0900 [thread overview]
Message-ID: <000c01c9449e$c5bcdc20$51369460$@jp.nec.com> (raw)
This patchset expands traditional CFQ scheduler in order to support cgroups,
and improves old version.
Improvements are as following.
* Modularizing our new CFQ scheduler.
The expanded CFQ scheduler is registered/unregistered as new I/O
elevator scheduler called "cfq-cgroups". By this, the traditional CFQ
scheduler, which does not handle cgroups, and our new CFQ scheduler, which
handles cgroups, can be used at the same time for different devices.
* Allowing to set parameter per device.
The expanded CFQ scheduler allows users to set parameter per device.
By this, users can decide share (priority) per device.
--- Optional functions ---
* Adding a validation flag for 'think time'. (Opt-1 patch)
CFQ show poor scalability. One of its causes is the think time.
The think time is used to improve the I/O performance by handling queues
with poor I/O as IDLE class. However, when many tasks have I/O requests,
think time for their tasks became long and then all queues are handled as
IDLE class. As a result, dispatching I/O requests is dispersed, and then
the I/O performance falls. The think time valid flag controls think time
judgment.
* Adding ioprio class for cgroups. (Opt-2 patch)
The previous expanded CFQ scheduler can not implement ioprio class.
This optional patch implements its proto-type. This patch gives a basic
service tree control for ioprio class of cgroups and does not give preempt
function, completed function and so on yet.
1. Introduction.
This patchset introduce "Yet Another" I/O bandwidth controlling
subsystem for cgroups based on CFQ (called 2 layer CFQ).
The idea of 2 layer CFQ is to build fairness control per group on the top of
existing CFQ control.
We added a new data structure called CFQ driver data on the top of
cfqd in order to control I/O bandwidth for cgroups.
CFQ driver data control cfq_datas by service tree (rb-tree) and
CFQ algorithm when synchronous I/O.
An active cfqd controls queue for cfq by service tree.
Namely, the CFQ meta-data control traditional CFQ data.
the CFQ data runs conventionally.
cfqdd cfqdd (cfqmd = cfq driver data)
| |
cfqc -- cfqd ----- cfqd (cfqd = cfq data,
| | cfqc = cfq cgroup data)
cfqc --[cfqd]----- cfqd
^
|
conventional control.
This patchset is against 2.6.28-rc2
2. Build
i. Apply this patchset (series 01 - 12) to kernel 2.6.28-rc2.
If you want to use optional functions, apply opt-1/opt-2 patches
to kernel 2.6.28-rc2.
ii. Build kernel with IOSCHED_CFQ_CGROUP=y option.
iii. Restart new kernel.
3. Usage of 2 layer CFQ
* Preparation for using 2 layer CFQ
i. Mount cfq_cgroup special device to device directory.
ex.
mkdir /dev/cgroup
mount -t cgroup -o cfq cfq /dev/cgroup
ii. Change elevator scheduler for device to "cfq-cgroups"
ex.
echo cfq-cgorups > /sys/block/sda/queue/scheduler
* Usage of grouping control.
- Create a new group.
Make a new directory under /dev/cgroup.
For example, the following command generates a 'test1' group.
mkdir /dev/cgroup/test1
- Insert a task to a group.
Write process id(pid) on "tasks" entry in the corresponding group.
For example, the following command sets task with pid 1100 into test1
group.
echo 1100 > /dev/cgroup/test1/tasks
New child tasks of this task is also inserted into test1 group.
- Change I/O priorities of a group.
Write priority on "cfq.ioprio" entry in the corresponding group.
For example, the following command sets priority of rank 2 to 'test1'
group.
echo 2 > /dev/cgroup/test1/cfq.ioprio
I/O priority for cgroups takes the value from 0 to 7. It is same as
existing per-task CFQ.
If you want to change only I/O priority of a specific device and group,
add its device name as a second parameter.
For example, the following command sets priority of rank 2 to 'test1'
group for 'sda' device.
echo 2 sda > /dev/cgroup/test1/cfq.ioprio
If you want to change I/O priority of a specific device and group via
sysfs. If you can change its priority, Add its path for cgroup as a
second parameter.
For example, the following command sets priority of rank 2 to 'test1'
group for 'sda' device via sysfs.
echo 2 /test1 > /sys/block/sda/queue/iosched/ioprio
If you can change parameters of cfq_data (slice_sync, back_seek_penalty
and so on) for a specific device and group.
If you write only one parameter via sysfs, its setting reflects all
groups.
If you set elevator scheduler as cfq-cgroups, I/O priorities of its
new device set a default priority with groups. If you want to change
this default priority, write priority and "default" as second parameter
on "cfq.ioprio" entry in the corresponding group.
For example,
echo 2 default > /dev/cgroup/test1/cfq.ioprio
- Change I/O priority of task
Use existing "ionice" command.
4. Usage of Optional Functions.
i. Usage of a validation flag for 'think time'
This parameter can use via sysfs as similar as other cfq data parameter.
Its entry name is 'ttime_valid'.
This flag is decide to check think time.
The value 0 is always handled queues as idle class.
In practice, idie_window flag is clear.
The value 1 is handled as same as traditional CFQ.
The value 2 makes the think time invalid.
ii. Usage of ioprio class for cgroups.
The ioprio class use via cgroupfs as similar as ioprio.
Its entry name is 'cfq.ioprio_class'
The values of ioprio class are as same as I/O class of traditional CFQ.
0: IOPRIO_CLASS_NONE (is equal to IOPRIO_CLASS_BE)
1: IOPRIO_CLASS_RT
2: IOPRIO_CLASS_BE
3: IOPRIO_CLASS_IDLE
5. Future work.
We must implement the follows.
* Handle buffered I/O.
next reply other threads:[~2008-11-12 8:15 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-12 8:15 Satoshi UCHIDA [this message]
2008-11-12 8:15 ` [PATCH][RFC][12+2][v3] A expanded CFQ scheduler for cgroups Satoshi UCHIDA
2008-11-12 8:23 ` [PATCH][cfq-cgroups][01/12] Move basic strcture variable to header file Satoshi UCHIDA
2008-11-12 8:23 ` Satoshi UCHIDA
2008-11-12 8:23 ` Satoshi UCHIDA
2008-11-12 8:24 ` [PATCH][cfq-cgroups][02/12] Introduce "cfq_driver_data" structure Satoshi UCHIDA
2008-11-12 8:24 ` Satoshi UCHIDA
2008-11-12 8:24 ` Satoshi UCHIDA
2008-11-12 8:25 ` [PATCH][cfq-cgroups][03/12] Add cgroup file and modify configure files Satoshi UCHIDA
2008-11-12 8:25 ` Satoshi UCHIDA
2008-11-12 8:25 ` Satoshi UCHIDA
2008-11-12 8:26 ` [PATCH][cfq-cgroups][04/12] Register or unregister "cfq-cgroups" module Satoshi UCHIDA
2008-11-12 8:26 ` Satoshi UCHIDA
2008-11-12 8:26 ` Satoshi UCHIDA
2008-11-12 8:26 ` [PATCH][cfq-cgroups][05/12] Introduce cgroups structure with ioprio entry Satoshi UCHIDA
2008-11-12 8:26 ` Satoshi UCHIDA
2008-11-12 8:26 ` Satoshi UCHIDA
2008-11-12 8:27 ` [PATCH][cfq-cgroups][06/12] Add siblings tree control for driver data(cfq_driver_data) Satoshi UCHIDA
2008-11-12 8:27 ` Satoshi UCHIDA
2008-11-12 8:27 ` Satoshi UCHIDA
2008-11-12 8:28 ` [PATCH][cfq-cgroups][07/12] Add sibling tree control for group data(cfq_cgroup) Satoshi UCHIDA
2008-11-12 8:28 ` Satoshi UCHIDA
2008-11-12 8:28 ` Satoshi UCHIDA
2008-11-12 8:29 ` [PATCH][cfq-cgroups][08/12] Interface to new cfq data structure in cfq_cgroup module Satoshi UCHIDA
2008-11-12 8:29 ` Satoshi UCHIDA
2008-11-12 8:29 ` Satoshi UCHIDA
2008-11-12 8:29 ` [PATCH][cfq-cgroups][09/12] Develop service tree control Satoshi UCHIDA
2008-11-12 8:29 ` Satoshi UCHIDA
2008-11-12 8:29 ` Satoshi UCHIDA
2008-11-12 8:30 ` [PATCH][cfq-cgroups][10/12] Introduce request control for two layer Satoshi UCHIDA
2008-11-12 8:30 ` Satoshi UCHIDA
2008-11-12 8:30 ` Satoshi UCHIDA
2008-11-12 8:31 ` [PATCH][cfq-cgroups][11/12] Expand idle slice timer function Satoshi UCHIDA
2008-11-12 8:31 ` Satoshi UCHIDA
2008-11-12 8:31 ` Satoshi UCHIDA
2008-11-12 8:31 ` [PATCH][cfq-cgroups][12/12] Interface for parameter of cfq driver data Satoshi UCHIDA
2008-11-12 8:31 ` Satoshi UCHIDA
2008-11-12 8:31 ` Satoshi UCHIDA
2008-11-12 8:37 ` [PATCH][cfq-cgroups][Option 1] Introduce a think time valid entry Satoshi UCHIDA
2008-11-12 8:37 ` Satoshi UCHIDA
2008-11-12 8:37 ` Satoshi UCHIDA
2008-11-12 8:37 ` [PATCH][cfq-cgroups][Option 2] Introduce ioprio class for top layer Satoshi UCHIDA
2008-11-12 8:37 ` Satoshi UCHIDA
2008-11-12 8:37 ` Satoshi UCHIDA
2008-11-12 8:57 ` [PATCH][RFC][12+2][v3] A expanded CFQ scheduler for cgroups Peter Zijlstra
2008-11-12 8:57 ` Peter Zijlstra
2008-11-12 8:57 ` Peter Zijlstra
2008-11-12 9:22 ` Satoshi UCHIDA
2008-11-12 9:22 ` Satoshi UCHIDA
2008-11-12 9:22 ` Satoshi UCHIDA
-- strict thread matches above, loose matches on Subject: below --
2008-11-12 8:15 Satoshi UCHIDA
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='000c01c9449e$c5bcdc20$51369460$@jp.nec.com' \
--to=s-uchida@ap.jp.nec.com \
--cc=akpm@linux-foundation.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=containers@lists.linux-foundation.org \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=menage@google.com \
--cc=ryov@valin \
--cc=taka@valinux.co.jp \
--cc=tom-sugawara@ap.jp.nec.com \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.