Linux Container Development
 help / color / mirror / Atom feed
  • [parent not found: <20080917.161811.27257227.taka@valinux.co.jp>]
  • [parent not found: <48D0C800.30207@oss.ntt.co.jp>]
  • * [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
    @ 2008-08-27 16:07 Andrea Righi
      0 siblings, 0 replies; 15+ messages in thread
    From: Andrea Righi @ 2008-08-27 16:07 UTC (permalink / raw)
      To: Balbir Singh, Paul Menage
      Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA, Carl Henrik Lunde,
    	Divyesh Shah, eric.rannaud-Re5JQEeQqe8AvxtiuMwx3w,
    	fernando-gVGce1chcLdL9jVzuh4AOg,
    	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, agk-9JcytcrH/bA+uJoB2kUjGw,
    	subrata-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
    	axboe-tSWWG44O7X1aa/9Udqfwiw, Marco Innocenti,
    	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
    	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
    	dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
    	matt-cT2on/YLNlBWk0Htik3J/w, roberto-5KDOxZqKugI,
    	ngupta-hpIqsD4AKlfQT0dZR+AlfA
    
    
    The objective of the i/o controller is to improve i/o performance
    predictability of different cgroups sharing the same block devices.
    
    Respect to other priority/weight-based solutions the approach used by this
    controller is to explicitly choke applications' requests that directly (or
    indirectly) generate i/o activity in the system.
    
    The direct bandwidth and/or iops limiting method has the advantage of improving
    the performance predictability at the cost of reducing, in general, the overall
    performance of the system (in terms of throughput).
    
    Detailed informations about design, its goal and usage are described in the
    documentation.
    
    Patchset against 2.6.27-rc1-mm1.
    
    The all-in-one patch (and previous versions) can be found at:
    http://download.systemimager.org/~arighi/linux/patches/io-throttle/
    
    This patchset is an experimental implementation, it includes functional
    differences respect to the previous versions (see the changelog below), and I
    haven't done much testing yet. So, comments are really welcome.
    
    Changelog: (v8 -> v9)
    
    * introduce struct res_counter_ratelimit as a generic structure to implement
      throttling-based cgroup subsystems
    * removed the throttling hooks from the page cache (set_page_dirty): set a
      single throttling hook in submit_bio() both for read and write operations; a
      generic process that is dirtying pages on a limited block device (for the
      cgroup it belongs to) is forced to flush the same amount of pages back to the
      block device (in this way write operations are forced to occur in the same IO
      context of the process that actually generated the IO)
    * collect per cgroup, block device and task throttling statistics (throttle
      counter and total time slept for throttling) and export them to userspace
      through blockio.throttlcnt (in the cgroup filesystem) and
      /proc/PID/io-throttle-stat (per-task statistics)
    * fair throttling: simple attempt to distribute the sleeps equally among all
      the tasks belonging to the same cgroup; instead of imposing a sleep to the
      first task that exceeds the IO limits, the time to sleep is divided by the
      number of tasks present in the same cgroup
    
    TODO:
    
    * Try to push down the throttling and implement it directly in the I/O
      schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
      to keep track of the right cgroup context. This approach could lead to more
      memory consumption and increases the number of dirty pages (hard/slow to
      reclaim pages) in the system, since dirty-page ratio in memory is not
      limited. This could even lead to potential OOM conditions, but these problems
      can be resolved directly into the memory cgroup subsystem
    
    * Handle I/O generated by kswapd: at the moment there's no control on the I/O
      generated by kswapd; try to use the page_cgroup functionality of the memory
      cgroup controller to track this kind of I/O and charge the right cgroup when
      pages are swapped in/out
    
    * Improve fair throttling: distribute the time to sleep among all the tasks of
      a cgroup that exceeded the I/O limits, depending of the amount of IO activity
      generated in the past by each task (see task_io_accounting)
    
    * Try to reduce the cost of calling cgroup_io_throttle() on every submit_bio();
      this is not too much expensive, but the call of task_subsys_state() has
      surely a cost. A possible solution could be to temporarily account I/O in the
      current task_struct and call cgroup_io_throttle() only on each X MB of I/O.
      Or on each Y number of I/O requests as well. Better if both X and/or Y can be
      tuned at runtime by a userspace tool
    
    * Think an alternative design for general purpose usage; special purpose usage
      right now is restricted to improve I/O performance predictability and
      evaluate more precise response timings for applications doing I/O. To a large
      degree the block I/O bandwidth controller should implement a more complex
      logic to better evaluate real I/O operations cost, depending also on the
      particular block device profile (i.e. USB stick, optical drive, hard disk,
      etc.). This would also allow to appropriately account I/O cost for seeky
      workloads, respect to large stream workloads. Instead of looking at the
      request stream and try to predict how expensive the I/O cost will be, a
      totally different approach could be to collect request timings (start time /
      elapsed time) and based on collected informations, try to estimate the I/O
      cost and usage
    
    -Andrea
    
    ^ permalink raw reply	[flat|nested] 15+ messages in thread

    end of thread, other threads:[~2008-09-18 14:54 UTC | newest]
    
    Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
    -- links below jump to the message on this page --
         [not found] <1219853257-11052-1-git-send-email-righi.andrea@gmail.com>
         [not found] ` <1219853257-11052-1-git-send-email-righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
    2008-09-02 18:06   ` [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9) Vivek Goyal
         [not found]     ` <20080902180620.GE15847-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2008-09-02 20:50       ` Andrea Righi
         [not found]     ` <48BDA704.9040000@gmail.com>
         [not found]       ` <48BDA704.9040000-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
    2008-09-02 21:41         ` Vivek Goyal
         [not found]           ` <20080902214146.GA3382-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2008-09-05 15:59             ` Vivek Goyal
         [not found]               ` <20080905155944.GF13742-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2008-09-05 17:38                 ` Andrea Righi
    2008-09-17  7:18   ` Hirokazu Takahashi
    2008-09-17  9:04   ` Takuya Yoshikawa
         [not found] ` <20080917.161811.27257227.taka@valinux.co.jp>
         [not found]   ` <20080917.161811.27257227.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
    2008-09-17  8:47     ` Andrea Righi
         [not found]   ` <48D0C43A.2010102@gmail.com>
         [not found]     ` <48D0C43A.2010102-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
    2008-09-18 11:24       ` Hirokazu Takahashi
    2008-09-18 13:55       ` Vivek Goyal
         [not found]         ` <20080918135513.GE20640-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2008-09-18 14:54           ` Andrea Righi
         [not found]     ` <20080918.202416.120249186.taka@valinux.co.jp>
         [not found]       ` <20080918.202416.120249186.taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
    2008-09-18 14:37         ` Andrea Righi
         [not found] ` <48D0C800.30207@oss.ntt.co.jp>
         [not found]   ` <48D0C800.30207-gVGce1chcLdL9jVzuh4AOg@public.gmane.org>
    2008-09-17  9:42     ` Andrea Righi
    2008-09-17 10:08     ` Andrea Righi
    2008-08-27 16:07 Andrea Righi
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox