The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: "Satoshi UCHIDA" <s-uchida@ap.jp.nec.com>
To: <linux-kernel@vger.kernel.org>,
	<containers@lists.linux-foundation.org>,
	<virtualization@lists.linux-foundation.org>,
	<jens.axboe@oracle.com>, "'Ryo Tsuruta'" <ryov@valinux.co.jp>,
	"'Andrea Righi'" <righi.andrea@gmail.com>, <ngupta@google.com>,
	<fernando@oss.ntt.co.jp>, <vtaras@openvz.org>
Cc: "'Hirokazu Takahashi'" <taka@valinux.co.jp>,
	<balbir@linux.vnet.ibm.com>,
	"'Andrew Morton'" <akpm@linux-foundation.org>,
	<menage@google.com>,
	"SUGAWARA Tomoyoshi" <tom-sugawara@ap.jp.nec.com>
Subject: [PATCH][RFC][12+2][v3] A expanded CFQ scheduler for cgroups
Date: Wed, 12 Nov 2008 17:15:06 +0900	[thread overview]
Message-ID: <000c01c9449e$c5bcdc20$51369460$@jp.nec.com> (raw)


This patchset expands traditional CFQ scheduler in order to support cgroups,
and improves old version.

Improvements are as following.

 * Modularizing our new CFQ scheduler.
      The expanded CFQ scheduler is registered/unregistered as new I/O 
    elevator scheduler called "cfq-cgroups".  By this, the traditional CFQ
    scheduler, which does not handle cgroups, and our new CFQ scheduler, which
    handles cgroups, can be used at the same time for different devices.

 * Allowing to set parameter per device.
      The expanded CFQ scheduler allows users to set parameter per device.
    By this, users can decide share (priority) per device.

--- Optional functions ---

 * Adding a validation flag for 'think time'. (Opt-1 patch)
      CFQ show poor scalability.  One of its causes is the think time.
    The think time is used to improve the I/O performance by handling queues
    with poor I/O as IDLE class.  However, when many tasks have I/O requests,
    think time for their tasks became long and then all queues are handled as
    IDLE class.  As a result, dispatching I/O requests is dispersed, and then
    the I/O performance falls.  The think time valid flag controls think time
    judgment.

 * Adding ioprio class for cgroups. (Opt-2 patch)
      The previous expanded CFQ scheduler can not implement ioprio class.
    This optional patch implements its proto-type.  This patch gives a basic
    service tree control for ioprio class of cgroups and does not give preempt
    function, completed function and so on yet.



1.  Introduction.

This patchset introduce "Yet Another" I/O bandwidth controlling
subsystem for cgroups based on CFQ (called 2 layer CFQ).

The idea of 2 layer CFQ is to build fairness control per group on the top of
existing CFQ control.
We added a new data structure called CFQ driver data on the top of
cfqd in order to control I/O bandwidth for cgroups.
CFQ driver data control cfq_datas by service tree (rb-tree) and
CFQ algorithm when synchronous I/O.
An active cfqd controls queue for cfq by service tree.
Namely, the CFQ meta-data control traditional CFQ data.
the CFQ data runs conventionally.

           cfqdd     cfqdd     (cfqmd = cfq driver data)
            |          |
  cfqc  -- cfqd ----- cfqd     (cfqd = cfq data,
            |          |        cfqc = cfq cgroup data)
  cfqc  --[cfqd]----- cfqd
            ^
            |
        conventional control.

This patchset is against 2.6.28-rc2


2. Build 

 i. Apply this patchset (series 01 - 12) to kernel 2.6.28-rc2.

     If you want to use optional functions, apply opt-1/opt-2 patches
   to kernel 2.6.28-rc2.

 ii. Build kernel with IOSCHED_CFQ_CGROUP=y option.

 iii. Restart new kernel.


3. Usage of 2 layer CFQ

* Preparation for using 2 layer CFQ

 i. Mount cfq_cgroup special device to device directory.
    ex.
      mkdir /dev/cgroup
      mount -t cgroup -o cfq cfq /dev/cgroup

 ii. Change elevator scheduler for device to "cfq-cgroups"
    ex.
      echo cfq-cgorups  > /sys/block/sda/queue/scheduler


* Usage of grouping control.
 - Create a new group.
      Make a new directory under /dev/cgroup.
      For example, the following command generates a 'test1' group.
          mkdir /dev/cgroup/test1

 - Insert a task to a group.
      Write process id(pid) on "tasks" entry in the corresponding group.
      For example, the following command sets task with pid 1100 into test1
     group.
         echo 1100 > /dev/cgroup/test1/tasks
   
      New child tasks of this task is also inserted into test1 group.

 - Change I/O priorities of a group.
      Write priority on "cfq.ioprio" entry in the corresponding group.
      For example, the following command sets priority of rank 2 to 'test1'
     group.

          echo 2 > /dev/cgroup/test1/cfq.ioprio

      I/O priority for cgroups takes the value from 0 to 7. It is same as
     existing per-task CFQ.

      If you want to change only I/O priority of a specific device and group,
     add its device name as a second parameter.
      For example, the following command sets priority of rank 2 to 'test1'
     group for 'sda' device.

          echo 2 sda > /dev/cgroup/test1/cfq.ioprio

      
      If you want to change I/O priority of a specific device and group via
     sysfs.  If you can change its priority, Add its path for cgroup as a
     second parameter.
      For example, the following command sets priority of rank 2 to 'test1'
     group for 'sda' device via sysfs. 
      
          echo 2 /test1 > /sys/block/sda/queue/iosched/ioprio 

       If you can change parameters of cfq_data (slice_sync, back_seek_penalty
      and so on) for a specific device and group.
       If you write only one parameter via sysfs, its setting reflects all
      groups.

       If you set elevator scheduler as cfq-cgroups, I/O priorities of its
      new device set a default priority with groups.  If you want to change
      this default priority, write priority and "default" as second parameter 
      on "cfq.ioprio" entry in the corresponding group.
       For example, 

          echo 2 default > /dev/cgroup/test1/cfq.ioprio
      
 - Change I/O priority of task
     Use existing "ionice" command.


4. Usage of Optional Functions.

 i. Usage of a validation flag for 'think time'
    
   This parameter can use via sysfs as similar as other cfq data parameter.
   Its entry name is 'ttime_valid'.

   This flag is decide to check think time.
     The value 0 is always handled queues as idle class.
        In practice, idie_window flag is clear.
     The value 1 is handled as same as traditional CFQ.
     The value 2 makes the think time invalid.


 ii. Usage of ioprio class for cgroups. 

   The ioprio class use via cgroupfs as similar as ioprio.
   Its entry name is 'cfq.ioprio_class'

   The values of ioprio class are as same as I/O class of traditional CFQ.
     0: IOPRIO_CLASS_NONE (is equal to IOPRIO_CLASS_BE)
     1: IOPRIO_CLASS_RT   
     2: IOPRIO_CLASS_BE
     3: IOPRIO_CLASS_IDLE


5. Future work.
 We must implement the follows.
 * Handle buffered I/O.


             reply	other threads:[~2008-11-12  8:19 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-12  8:15 Satoshi UCHIDA [this message]
2008-11-12  8:23 ` [PATCH][cfq-cgroups][01/12] Move basic strcture variable to header file Satoshi UCHIDA
2008-11-12  8:24 ` [PATCH][cfq-cgroups][02/12] Introduce "cfq_driver_data" structure Satoshi UCHIDA
2008-11-12  8:25 ` [PATCH][cfq-cgroups][03/12] Add cgroup file and modify configure files Satoshi UCHIDA
2008-11-12  8:26 ` [PATCH][cfq-cgroups][04/12] Register or unregister "cfq-cgroups" module Satoshi UCHIDA
2008-11-12  8:26 ` [PATCH][cfq-cgroups][05/12] Introduce cgroups structure with ioprio entry Satoshi UCHIDA
2008-11-12  8:27 ` [PATCH][cfq-cgroups][06/12] Add siblings tree control for driver data(cfq_driver_data) Satoshi UCHIDA
2008-11-12  8:28 ` [PATCH][cfq-cgroups][07/12] Add sibling tree control for group data(cfq_cgroup) Satoshi UCHIDA
2008-11-12  8:29 ` [PATCH][cfq-cgroups][08/12] Interface to new cfq data structure in cfq_cgroup module Satoshi UCHIDA
2008-11-12  8:29 ` [PATCH][cfq-cgroups][09/12] Develop service tree control Satoshi UCHIDA
2008-11-12  8:30 ` [PATCH][cfq-cgroups][10/12] Introduce request control for two layer Satoshi UCHIDA
2008-11-12  8:31 ` [PATCH][cfq-cgroups][11/12] Expand idle slice timer function Satoshi UCHIDA
2008-11-12  8:31 ` [PATCH][cfq-cgroups][12/12] Interface for parameter of cfq driver data Satoshi UCHIDA
2008-11-12  8:37 ` [PATCH][cfq-cgroups][Option 1] Introduce a think time valid entry Satoshi UCHIDA
2008-11-12  8:37 ` [PATCH][cfq-cgroups][Option 2] Introduce ioprio class for top layer Satoshi UCHIDA
2008-11-12  8:57 ` [PATCH][RFC][12+2][v3] A expanded CFQ scheduler for cgroups Peter Zijlstra
2008-11-12  9:22   ` Satoshi UCHIDA

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='000c01c9449e$c5bcdc20$51369460$@jp.nec.com' \
    --to=s-uchida@ap.jp.nec.com \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=fernando@oss.ntt.co.jp \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=menage@google.com \
    --cc=ngupta@google.com \
    --cc=righi.andrea@gmail.com \
    --cc=ryov@valinux.co.jp \
    --cc=taka@valinux.co.jp \
    --cc=tom-sugawara@ap.jp.nec.com \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=vtaras@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox