* [patch 1/4] io controller: documentation [not found] <20081106153022.215696930@redhat.com> @ 2008-11-06 15:30 ` vgoyal-H+wXaHxf7aLQT0dZR+AlfA 2008-11-06 15:30 ` [patch 2/4] io controller: biocgroup implementation vgoyal-H+wXaHxf7aLQT0dZR+AlfA ` (9 subsequent siblings) 10 siblings, 0 replies; 92+ messages in thread From: vgoyal-H+wXaHxf7aLQT0dZR+AlfA @ 2008-11-06 15:30 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Hirokazu Takahashi Cc: Rik van Riel, fernando-gVGce1chcLdL9jVzuh4AOg, Jeff Moyer, menage-hpIqsD4AKlfQT0dZR+AlfA, ngupta-hpIqsD4AKlfQT0dZR+AlfA, Andrew Morton, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 [-- Attachment #1: bio-group-documentation.patch --] [-- Type: text/plain, Size: 7741 bytes --] Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Index: linux17/Documentation/controllers/io-controller.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux17/Documentation/controllers/io-controller.txt 2008-11-06 09:12:44.000000000 -0500 @@ -0,0 +1,172 @@ + IO Controller + ============ + +Design +===== +This patchset implements a basic version of proportional weight IO controller. +It is heavily derived from dm-ioband IO controller with one key difference +and that is, there is no separate device mapper driver and there is no +need to create a dm-ioband device on top of every block device which needs +to do the IO control. In this implementation, all the control logic has +been internalized and has been made per request queue. Enabling or disabling +IO control on a block device is just a matter of writing a 0 or 1 in +appropriate sysfs file. + +This is a proportional weight controller and that means various cgroups +are assigned shares and tasks in those cgroups get to dispatch the bio +in proportion to their cgroup share. + +All the contending cgroups are assigned tokens proportionate to their +weights. One token is charged for one sector of IO. Once all the contending +cgroups have consumed their tokens, fresh token allocation takes place and +this is how disk bandwidth allocation proportion to weight is achieved. + +The bigger picture is that all the bios being submitted to a block device +are first inspected by IO controller logic (bio_group_controller()), only if +IO controller has been enabled on that device. The cgroup of the bio is +determined and controller checks if this cgroup has sufficient tokens to +dispatch the bio. If sufficient tokens are there, bio submitting thread +continues to dispatch the bio through normal path otherwise IO controller +buffers the bio and submitting thread returns back. These buffered bios +are dispatched to lower layers later once the associate group (bio group) +has sufficient tokens to dispatch the bios. This delayed dispatching is +done with the help of a worker thread (biogroup). + +IO control can be enabled/disabled dynamically on any of the block device +through sysfs file system. For example, to enable IO control on a device +do following. + +echo 1 > /sys/block/sda/biogroup + +To disable IO control write 0. + +echo 0 > /sys/block/sda/biogroup + +This should be doable for any of the block device in the stack. Currently this +patch places the hooks only for device mapper driver and still need to tweak +md. + +For example, assume there are two cgroups A and B with weights 1024 and 2048 +in the system. Tasks in two cgroups A and B are doing IO to two disks sda and +sdb in the system. A user has enabled IO control on both sda and sdb. Now on +both sda and sdb, tasks in cgroup B will get to use 2/3 of disk BW and +tasks in cgroup A will get to use 1/3 of disk bandwidth, only in case of +contention. If tasks in any of the groups stop doing IO to a particular disk, +task in other group will get to use full disk BW for that duration. + + +HOWTO +==== +- Enable cgroup, memory controller and block IO controller in kernel config + file. + +- Boot into the kernel and mount io controller. + + mount -t cgroup -o bio none /cgroup/bio/ + +- Create two cgroups test1 and test2 + + cd /cgroup/bio + mkdir test1 test2 + +- Allocate weight 4096 to test1 and weight 2048 to test2 + + echo 4096 > /cgroup/bio/test1/bio.shares + echo 2048 > /cgroup/bio/test1/bio.shares + +- Launch "dd" operations in cgroup test1 and test2. + + echo $$ > /cgroup/bio/test1/tasks + dd if=/somefile1 of=/dev/null + echo $$ > /cgroup/bio/test2/tasks + dd if=/somefile2 of=/dev/null + +Job in cgroup test1 should finish before job in cgroup test2. To verify +that "dd" in cgroup test1 got to dispatch more bio as compared to "dd" in +test2, look at "bio.aggregate_tokens" in both the cgroup (At same time). At +any point of time when both the dd's are running, aggregate_tokens in cgroup +test1 should be approximately double of aggregate_tokens in cgroup test2 +(Because weight of cgroup test1 is double of weight of cgroup test2). + +Some Tunables +============= +Some tunables appear in cgroup file system and in sysfs for respective +device for debug and for configuration. Here is a brief description. + +Cgroup Files +============ +bio.shares + Specifies the weight of the cgroup. + +bio.aggregate_tokens + Specifies total number of tokens dispatched by this cgroup. One token + represents one sector of IO. + +bio.jiffies + What was the jiffies values when last bio from this cgroup was released. + +bio.nr_token_slices + How many times this cgroup got the token allocation done from token + slice. We kind of create a token slice and every contending cgroup + gets the pie out of the slice based on the share. + +bio.nr_off_the_tree + How many times this bio group went off the list of contending groups. + We maintain an rb-tree of biogroups contending for IO and token + allocation takes place to these groups regularly. If some group stops + doing IO then it is considered to be idle and removed from the tree + and added back later when group has IO to perform. This file just + counts how many times this bio group went off the tree. + +Sysfs Tunabels +============== +/sys/block/{deice name}/biogroup + Whether IO controller (bio groups) are active on this device or not. + +/sys/block/{deice name}/deftoken + Default number of tokens which are given to a bio group upon start + if all the bio groups were of same weight. token slice is of dynamic + length. So if there are 3 cgroups contending and deftoken is 100 then + token slice lenght will be 100*3 = 300 and now out of this slice + three groups will get the tokens based on their weights. + +/sys/block/{deice name}/idletime + The time after which if a bio group does not generate the bio, it is + considered idle and removed from the rb-tree. Currently by default it + is 8ms. + +/sys/block/{deice name}/newslice_count + How many times new token allocation took place on this queue. + +TODO +==== +- Do extensive testing in various scenarios and do performance optimization + and fix the things where broken. + +- IO schedulers derive context information from "current". This assumption + will be broken if bios are being submitted by a worker thread (biogroup). + Probably we need to put io context pointer in bio itself to get rid of + this dependency. + +- Allocating tokens for per sector of IO is crude approximation and will lead + to unfair bandwidth allocation in case task in cgroup is doing sequential IO + and task in other group is doing random IO. Rik Van Riel, suggested that + probably we should switch to time based scheme. Keep a track of average time + it takes to complete IO from a cgroup and do the allocation accordingly. + +- Currently this controller is dependent on memory controller being enabled. + Try to reduce this coupling. + +ISSUES +====== +- IO controller can buffer the bios if suffcient tokens were not available + at the time of bio submission. Once the tokens are available, these bios + are dispatched to elevator/lower layers in first come first serve manner. + And this has potential to break CFQ where a RT tasks should be able to + dispatch the bio first or a high priority task should be able to release + more bio as compared to low priority task in same cgroup. + + Not sure how to fix it. May be we need to maintain another rb-tree and + keep track of RT tasks and tasks priorities and dispatch accordingly. This + is equivalent of duplicating lots of CFQ logic and not sure how would it + impact AS behaviour. -- ^ permalink raw reply [flat|nested] 92+ messages in thread
* [patch 2/4] io controller: biocgroup implementation [not found] <20081106153022.215696930@redhat.com> 2008-11-06 15:30 ` [patch 1/4] io controller: documentation vgoyal-H+wXaHxf7aLQT0dZR+AlfA @ 2008-11-06 15:30 ` vgoyal-H+wXaHxf7aLQT0dZR+AlfA 2008-11-06 15:30 ` [patch 3/4] io controller: Core IO controller implementation logic vgoyal-H+wXaHxf7aLQT0dZR+AlfA ` (8 subsequent siblings) 10 siblings, 0 replies; 92+ messages in thread From: vgoyal-H+wXaHxf7aLQT0dZR+AlfA @ 2008-11-06 15:30 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Hirokazu Takahashi Cc: Rik van Riel, fernando-gVGce1chcLdL9jVzuh4AOg, Jeff Moyer, menage-hpIqsD4AKlfQT0dZR+AlfA, ngupta-hpIqsD4AKlfQT0dZR+AlfA, Andrew Morton, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 [-- Attachment #1: bio-cgroup-implementation --] [-- Type: text/plain, Size: 24609 bytes --] o biocgroup functionality. o Implemented new controller "bio" o Most of it picked from dm-ioband biocgroup implementation patches. Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Index: linux17/include/linux/cgroup_subsys.h =================================================================== --- linux17.orig/include/linux/cgroup_subsys.h 2008-10-09 18:13:53.000000000 -0400 +++ linux17/include/linux/cgroup_subsys.h 2008-11-05 18:12:32.000000000 -0500 @@ -43,6 +43,12 @@ SUBSYS(mem_cgroup) /* */ +#ifdef CONFIG_CGROUP_BIO +SUBSYS(bio_cgroup) +#endif + +/* */ + #ifdef CONFIG_CGROUP_DEVICE SUBSYS(devices) #endif Index: linux17/init/Kconfig =================================================================== --- linux17.orig/init/Kconfig 2008-10-09 18:13:53.000000000 -0400 +++ linux17/init/Kconfig 2008-11-05 18:12:32.000000000 -0500 @@ -408,6 +408,13 @@ config CGROUP_MEM_RES_CTLR This config option also selects MM_OWNER config option, which could in turn add some fork/exit overhead. +config CGROUP_BIO + bool "Block I/O cgroup subsystem" + depends on CGROUP_MEM_RES_CTLR + select MM_OWNER + help + A generic proportinal weight IO controller. + config SYSFS_DEPRECATED bool Index: linux17/mm/biocontrol.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux17/mm/biocontrol.c 2008-11-05 18:12:44.000000000 -0500 @@ -0,0 +1,409 @@ +/* biocontrol.c - Block I/O Controller + * + * Copyright IBM Corporation, 2007 + * Author Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> + * + * Copyright 2007 OpenVZ SWsoft Inc + * Author: Pavel Emelianov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> + * + * Copyright VA Linux Systems Japan, 2008 + * Author Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> + * + * Copyright RedHat Inc, 2008 + * Author Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/module.h> +#include <linux/cgroup.h> +#include <linux/mm.h> +#include <linux/blkdev.h> +#include <linux/smp.h> +#include <linux/bit_spinlock.h> +#include <linux/idr.h> +#include <linux/err.h> +#include <linux/biocontrol.h> + + +/* return corresponding bio_cgroup object of a cgroup */ +static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp) +{ + return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id), + struct bio_cgroup, css); +} + +static inline void bio_list_add_head(struct bio_list *bl, struct bio *bio) +{ + bio->bi_next = NULL; + + if (bl->head) + bio->bi_next = bl->head; + else + bl->tail = bio; + + bl->head = bio; +} + +void __bio_group_queue_bio_head(struct bio_group *biog, struct bio *bio) +{ + bio_list_add_head(&biog->bio_queue, bio); +} + +void bio_group_queue_bio_head(struct bio_group *biog, struct bio *bio) +{ + unsigned long flags; + + spin_lock_irqsave(&biog->bio_group_lock, flags); + __bio_group_queue_bio_head(biog, bio); + spin_unlock_irqrestore(&biog->bio_group_lock, flags); +} + +void __bio_group_queue_bio_tail(struct bio_group *biog, struct bio *bio) +{ + bio_list_add(&biog->bio_queue, bio); +} + +void bio_group_queue_bio_tail(struct bio_group *biog, struct bio *bio) +{ + unsigned long flags; + + spin_lock_irqsave(&biog->bio_group_lock, flags); + __bio_group_queue_bio_tail(biog, bio); + spin_unlock_irqrestore(&biog->bio_group_lock, flags); +} + +/* Removes first request from the bio-cgroup request list */ +struct bio* __bio_group_dequeue_bio(struct bio_group *biog) +{ + struct bio *bio = NULL; + + if (bio_list_empty(&biog->bio_queue)) + return NULL; + bio = bio_list_pop(&biog->bio_queue); + return bio; +} + +struct bio* bio_group_dequeue_bio(struct bio_group *biog) +{ + unsigned long flags; + struct bio *bio; + spin_lock_irqsave(&biog->bio_group_lock, flags); + bio = __bio_group_dequeue_bio(biog); + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + return bio; +} + +/* Traverse through all the active bio_group list of this cgroup and see + * if there is an active bio_group for the request queue. */ +struct bio_group* bio_group_from_cgroup(struct bio_cgroup *biocg, + struct request_queue *q) +{ + unsigned long flags; + struct bio_group *biog = NULL; + + spin_lock_irqsave(&biocg->biog_list_lock, flags); + if (list_empty(&biocg->bio_group_list)) + goto out; + list_for_each_entry(biog, &biocg->bio_group_list, next) { + if (biog->q == q) { + bio_group_get(biog); + goto out; + } + } + + /* did not find biog */ + spin_unlock_irqrestore(&biocg->biog_list_lock, flags); + return NULL; +out: + spin_unlock_irqrestore(&biocg->biog_list_lock, flags); + return biog; +} + +struct bio_cgroup *bio_cgroup_from_bio(struct bio *bio) +{ + struct page_cgroup *pc; + struct bio_cgroup *biocg = NULL; + struct page *page = bio_iovec_idx(bio, 0)->bv_page; + + lock_page_cgroup(page); + pc = page_get_page_cgroup(page); + if (pc) + biocg = pc->bio_cgroup; + if (!biocg) + biocg = bio_cgroup_from_task(rcu_dereference(init_mm.owner)); + unlock_page_cgroup(page); + return biocg; +} + +static struct cgroup_subsys_state * bio_cgroup_create(struct cgroup_subsys *ss, + struct cgroup *cgrp) +{ + struct bio_cgroup *biocg; + int error; + + if (!cgrp->parent) { + static struct bio_cgroup default_bio_cgroup; + + biocg = &default_bio_cgroup; + } else { + biocg = kzalloc(sizeof(*biocg), GFP_KERNEL); + if (!biocg) { + error = -ENOMEM; + goto out; + } + } + + /* Bind the cgroup to bio_cgroup object we just created */ + biocg->css.cgroup = cgrp; + spin_lock_init(&biocg->biog_list_lock); + spin_lock_init(&biocg->page_list_lock); + /* Assign default shares */ + biocg->shares = 1024; + INIT_LIST_HEAD(&biocg->bio_group_list); + INIT_LIST_HEAD(&biocg->page_list); + + return &biocg->css; +out: + kfree(biocg); + return ERR_PTR(error); +} + +void free_biog_elements(struct bio_cgroup *biocg) +{ + unsigned long flags, flags1; + struct bio_group *biog = NULL; + + spin_lock_irqsave(&biocg->biog_list_lock, flags); + while (1) { + if (list_empty(&biocg->bio_group_list)) + goto out; + + list_for_each_entry(biog, &biocg->bio_group_list, next) { + spin_lock_irqsave(&biog->bio_group_lock, flags1); + if (!atomic_read(&biog->refcnt)) { + list_del(&biog->next); + BUG_ON(bio_group_on_queue(biog)); + spin_unlock_irqrestore(&biog->bio_group_lock, + flags1); + kfree(biog); + break; + } else { + /* Drop the locks and schedule out. */ + spin_unlock_irqrestore(&biog->bio_group_lock, + flags1); + spin_unlock_irqrestore(&biocg->biog_list_lock, + flags); + msleep(1); + + /* Re-acquire the lock */ + spin_lock_irqsave(&biocg->biog_list_lock, + flags); + break; + } + } + } + +out: + spin_unlock_irqrestore(&biocg->biog_list_lock, flags); + return; +} + +void free_bio_cgroup(struct bio_cgroup *biocg) +{ + free_biog_elements(biocg); +} + +static void __clear_bio_cgroup(struct page_cgroup *pc) +{ + struct bio_cgroup *biocg = pc->bio_cgroup; + pc->bio_cgroup = NULL; + /* Respective bio group got deleted hence reference to + * bio cgroup removed from page during force empty. But page + * is being freed now. Igonore it. */ + if (!biocg) + return; + put_bio_cgroup(biocg); +} + +void clear_bio_cgroup(struct page_cgroup *pc) +{ + __clear_bio_cgroup(pc); +} + +#define FORCE_UNCHARGE_BATCH (128) +void bio_cgroup_force_empty(struct bio_cgroup *biocg) +{ + struct page_cgroup *pc; + struct page *page; + int count = FORCE_UNCHARGE_BATCH; + struct list_head *list = &biocg->page_list; + unsigned long flags; + + spin_lock_irqsave(&biocg->page_list_lock, flags); + while (!list_empty(list)) { + pc = list_entry(list->prev, struct page_cgroup, blist); + page = pc->page; + get_page(page); + __bio_cgroup_remove_page(pc); + __clear_bio_cgroup(pc); + spin_unlock_irqrestore(&biocg->page_list_lock, flags); + put_page(page); + if (--count <= 0) { + count = FORCE_UNCHARGE_BATCH; + cond_resched(); + } + spin_lock_irqsave(&biocg->page_list_lock, flags); + } + spin_unlock_irqrestore(&biocg->page_list_lock, flags); + /* Now free up all the bio groups releated to cgroup */ + free_bio_cgroup(biocg); + return; +} + +static void bio_cgroup_pre_destroy(struct cgroup_subsys *ss, + struct cgroup *cgrp) +{ + struct bio_cgroup *biocg = cgroup_bio(cgrp); + bio_cgroup_force_empty(biocg); +} + +static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp) +{ + struct bio_cgroup *biocg = cgroup_bio(cgrp); + kfree(biocg); +} + +static u64 bio_shares_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct bio_cgroup *biog = cgroup_bio(cgrp); + + return (u64) biog->shares; +} + +static int bio_shares_write(struct cgroup *cgrp, struct cftype *cft, u64 val) +{ + struct bio_cgroup *biog = cgroup_bio(cgrp); + + biog->shares = val; + return 0; +} + +static u64 bio_aggregate_tokens_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct bio_cgroup *biocg = cgroup_bio(cgrp); + + return (u64) biocg->aggregate_tokens; +} + +static int bio_aggregate_tokens_write(struct cgroup *cgrp, struct cftype *cft, + u64 val) +{ + struct bio_cgroup *biocg = cgroup_bio(cgrp); + + biocg->aggregate_tokens = val; + return 0; +} + +static u64 bio_jiffies_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct bio_cgroup *biocg = cgroup_bio(cgrp); + + return (u64) biocg->jiffies; +} + +static u64 bio_nr_off_the_tree_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct bio_cgroup *biocg = cgroup_bio(cgrp); + + return (u64) biocg->nr_off_the_tree; +} + +static int bio_nr_off_the_tree_write(struct cgroup *cgrp, struct cftype *cft, + u64 val) +{ + struct bio_cgroup *biocg = cgroup_bio(cgrp); + + biocg->nr_off_the_tree = val; + return 0; +} + +static u64 bio_nr_token_slices_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct bio_cgroup *biocg = cgroup_bio(cgrp); + + return (u64) biocg->nr_token_slices; +} + +static int bio_nr_token_slices_write(struct cgroup *cgrp, + struct cftype *cft, u64 val) +{ + struct bio_cgroup *biocg = cgroup_bio(cgrp); + + biocg->nr_token_slices = val; + return 0; +} + + + +static struct cftype bio_files[] = { + { + .name = "shares", + .read_u64 = bio_shares_read, + .write_u64 = bio_shares_write, + }, + { + .name = "aggregate_tokens", + .read_u64 = bio_aggregate_tokens_read, + .write_u64 = bio_aggregate_tokens_write, + }, + { + .name = "jiffies", + .read_u64 = bio_jiffies_read, + }, + { + .name = "nr_off_the_tree", + .read_u64 = bio_nr_off_the_tree_read, + .write_u64 = bio_nr_off_the_tree_write, + }, + { + .name = "nr_token_slices", + .read_u64 = bio_nr_token_slices_read, + .write_u64 = bio_nr_token_slices_write, + }, +}; + +static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont) +{ + if (bio_cgroup_disabled()) + return 0; + return cgroup_add_files(cont, ss, bio_files, ARRAY_SIZE(bio_files)); +} + +static void bio_cgroup_move_task(struct cgroup_subsys *ss, + struct cgroup *cont, + struct cgroup *old_cont, + struct task_struct *p) +{ + /* do nothing */ +} + + +struct cgroup_subsys bio_cgroup_subsys = { + .name = "bio", + .subsys_id = bio_cgroup_subsys_id, + .create = bio_cgroup_create, + .destroy = bio_cgroup_destroy, + .pre_destroy = bio_cgroup_pre_destroy, + .populate = bio_cgroup_populate, + .attach = bio_cgroup_move_task, + .early_init = 0, +}; Index: linux17/include/linux/biocontrol.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux17/include/linux/biocontrol.h 2008-11-05 18:12:44.000000000 -0500 @@ -0,0 +1,174 @@ +#include <linux/cgroup.h> +#include <linux/mm.h> +#include <linux/memcontrol.h> +#include <linux/blkdev.h> +#include "../../drivers/md/dm-bio-list.h" + +#ifndef _LINUX_BIOCONTROL_H +#define _LINUX_BIOCONTROL_H + +#ifdef CONFIG_CGROUP_BIO + +struct io_context; +struct block_device; + +struct bio_cgroup { + struct cgroup_subsys_state css; + /* Share/weight of the cgroup */ + unsigned long shares; + + /* list of bio-groups associated with this cgroup. */ + struct list_head bio_group_list; + spinlock_t biog_list_lock; + + /* list of pages associated with this bio cgroup */ + spinlock_t page_list_lock; + struct list_head page_list; + + /* Debug Aid */ + unsigned long aggregate_tokens; + unsigned long jiffies; + unsigned long nr_off_the_tree; + unsigned long nr_token_slices; +}; + +static inline int bio_cgroup_disabled(void) +{ + return bio_cgroup_subsys.disabled; +} + +static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p) +{ + return container_of(task_subsys_state(p, bio_cgroup_subsys_id), + struct bio_cgroup, css); +} + +static inline void get_bio_cgroup(struct bio_cgroup *biocg) +{ + css_get(&biocg->css); +} + +static inline void put_bio_cgroup(struct bio_cgroup *biocg) +{ + css_put(&biocg->css); +} + +static inline void set_bio_cgroup(struct page_cgroup *pc, + struct bio_cgroup *biog) +{ + pc->bio_cgroup = biog; +} + +static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc) +{ + struct bio_cgroup *biog = pc->bio_cgroup; + get_bio_cgroup(biog); + return biog; +} + +/* This sould be called in an RCU-protected section. */ +static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm) +{ + struct bio_cgroup *biog; + biog = bio_cgroup_from_task(rcu_dereference(mm->owner)); + get_bio_cgroup(biog); + return biog; +} + +static inline void __bio_cgroup_add_page(struct page_cgroup *pc) +{ + struct bio_cgroup *biocg = pc->bio_cgroup; + list_add(&pc->blist, &biocg->page_list); +} + +static inline void bio_cgroup_add_page(struct page_cgroup *pc) +{ + struct bio_cgroup *biocg = pc->bio_cgroup; + unsigned long flags; + spin_lock_irqsave(&biocg->page_list_lock, flags); + __bio_cgroup_add_page(pc); + spin_unlock_irqrestore(&biocg->page_list_lock, flags); +} + +static inline void __bio_cgroup_remove_page(struct page_cgroup *pc) +{ + list_del_init(&pc->blist); +} + +static inline void bio_cgroup_remove_page(struct page_cgroup *pc) +{ + struct bio_cgroup *biocg = pc->bio_cgroup; + unsigned long flags; + + /* Respective bio group got deleted hence reference to + * bio cgroup removed from page during force empty. But page + * is being freed now. Igonore it. */ + if (!biocg) + return; + spin_lock_irqsave(&biocg->page_list_lock, flags); + __bio_cgroup_remove_page(pc); + spin_unlock_irqrestore(&biocg->page_list_lock, flags); +} + +extern void clear_bio_cgroup(struct page_cgroup *pc); + +extern int bio_group_controller(struct request_queue *q, struct bio *bio); +extern void blk_biogroup_work(struct work_struct *work); +#else /* CONFIG_CGROUP_BIO */ + +struct bio_cgroup; + +static inline int bio_cgroup_disabled(void) +{ + return 1; +} + +static inline void get_bio_cgroup(struct bio_cgroup *biocg) +{ +} + +static inline void put_bio_cgroup(struct bio_cgroup *biocg) +{ +} + +static inline void set_bio_cgroup(struct page_cgroup *pc, + struct bio_cgroup *biog) +{ +} + +static inline void clear_bio_cgroup(struct page_cgroup *pc) +{ +} + +static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc) +{ + return NULL; +} + +static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm) +{ + return NULL; +} + +static inline void bio_cgroup_add_page(struct page_cgroup *pc) +{ + return; +} + +static inline void bio_cgroup_remove_page(struct page_cgroup *pc) +{ + return; +} + +static inline int bio_group_controller(struct request_queue *q, struct bio *bio) +{ + return 0; +} +static inline void blk_biogroup_work(struct work_struct *work) +{ +} + + +#endif /* CONFIG_CGROUP_BIO */ + +#endif /* _LINUX_BIOCONTROL_H */ Index: linux17/mm/Makefile =================================================================== --- linux17.orig/mm/Makefile 2008-10-09 18:13:53.000000000 -0400 +++ linux17/mm/Makefile 2008-11-05 18:12:32.000000000 -0500 @@ -34,4 +34,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o +obj-$(CONFIG_CGROUP_BIO) += biocontrol.o Index: linux17/mm/memcontrol.c =================================================================== --- linux17.orig/mm/memcontrol.c 2008-10-09 18:13:53.000000000 -0400 +++ linux17/mm/memcontrol.c 2008-11-05 18:12:32.000000000 -0500 @@ -32,6 +32,7 @@ #include <linux/fs.h> #include <linux/seq_file.h> #include <linux/vmalloc.h> +#include <linux/biocontrol.h> #include <asm/uaccess.h> @@ -144,30 +145,6 @@ struct mem_cgroup { }; static struct mem_cgroup init_mem_cgroup; -/* - * We use the lower bit of the page->page_cgroup pointer as a bit spin - * lock. We need to ensure that page->page_cgroup is at least two - * byte aligned (based on comments from Nick Piggin). But since - * bit_spin_lock doesn't actually set that lock bit in a non-debug - * uniprocessor kernel, we should avoid setting it here too. - */ -#define PAGE_CGROUP_LOCK_BIT 0x0 -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) -#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT) -#else -#define PAGE_CGROUP_LOCK 0x0 -#endif - -/* - * A page_cgroup page is associated with every page descriptor. The - * page_cgroup helps us identify information about the cgroup - */ -struct page_cgroup { - struct list_head lru; /* per cgroup LRU list */ - struct page *page; - struct mem_cgroup *mem_cgroup; - int flags; -}; #define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */ #define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */ @@ -278,21 +255,6 @@ struct page_cgroup *page_get_page_cgroup return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK); } -static void lock_page_cgroup(struct page *page) -{ - bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup); -} - -static int try_lock_page_cgroup(struct page *page) -{ - return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup); -} - -static void unlock_page_cgroup(struct page *page) -{ - bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup); -} - static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz, struct page_cgroup *pc) { @@ -535,14 +497,15 @@ unsigned long mem_cgroup_isolate_pages(u * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype, - struct mem_cgroup *memcg) + gfp_t gfp_mask, enum charge_type ctype, + struct mem_cgroup *memcg, struct bio_cgroup *biocg) { struct mem_cgroup *mem; struct page_cgroup *pc; unsigned long flags; unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct mem_cgroup_per_zone *mz; + struct bio_cgroup *biocg_temp; pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask); if (unlikely(pc == NULL)) @@ -572,6 +535,10 @@ static int mem_cgroup_charge_common(stru css_get(&memcg->css); } + rcu_read_lock(); + biocg_temp = biocg ? biocg : mm_get_bio_cgroup(mm); + rcu_read_unlock(); + while (res_counter_charge(&mem->res, PAGE_SIZE)) { if (!(gfp_mask & __GFP_WAIT)) goto out; @@ -597,6 +564,7 @@ static int mem_cgroup_charge_common(stru pc->mem_cgroup = mem; pc->page = page; + set_bio_cgroup(pc, biocg_temp); /* * If a page is accounted as a page cache, insert to inactive list. * If anon, insert to active list. @@ -611,21 +579,22 @@ static int mem_cgroup_charge_common(stru unlock_page_cgroup(page); res_counter_uncharge(&mem->res, PAGE_SIZE); css_put(&mem->css); + clear_bio_cgroup(pc); kmem_cache_free(page_cgroup_cache, pc); goto done; } page_assign_page_cgroup(page, pc); - mz = page_cgroup_zoneinfo(pc); spin_lock_irqsave(&mz->lru_lock, flags); __mem_cgroup_add_list(mz, pc); spin_unlock_irqrestore(&mz->lru_lock, flags); - + bio_cgroup_add_page(pc); unlock_page_cgroup(page); done: return 0; out: css_put(&mem->css); + put_bio_cgroup(biocg_temp); kmem_cache_free(page_cgroup_cache, pc); err: return -ENOMEM; @@ -648,7 +617,7 @@ int mem_cgroup_charge(struct page *page, if (unlikely(!mm)) mm = &init_mm; return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL); + MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL, NULL); } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, @@ -684,7 +653,7 @@ int mem_cgroup_cache_charge(struct page mm = &init_mm; return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_CACHE, NULL); + MEM_CGROUP_CHARGE_TYPE_CACHE, NULL, NULL); } /* @@ -720,14 +689,14 @@ __mem_cgroup_uncharge_common(struct page spin_lock_irqsave(&mz->lru_lock, flags); __mem_cgroup_remove_list(mz, pc); spin_unlock_irqrestore(&mz->lru_lock, flags); - + bio_cgroup_remove_page(pc); page_assign_page_cgroup(page, NULL); unlock_page_cgroup(page); mem = pc->mem_cgroup; res_counter_uncharge(&mem->res, PAGE_SIZE); css_put(&mem->css); - + clear_bio_cgroup(pc); kmem_cache_free(page_cgroup_cache, pc); return; unlock: @@ -754,6 +723,7 @@ int mem_cgroup_prepare_migration(struct struct mem_cgroup *mem = NULL; enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED; int ret = 0; + struct bio_cgroup *biocg = NULL; if (mem_cgroup_subsys.disabled) return 0; @@ -765,12 +735,15 @@ int mem_cgroup_prepare_migration(struct css_get(&mem->css); if (pc->flags & PAGE_CGROUP_FLAG_CACHE) ctype = MEM_CGROUP_CHARGE_TYPE_CACHE; + biocg = get_bio_page_cgroup(pc); } unlock_page_cgroup(page); if (mem) { ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL, - ctype, mem); + ctype, mem, biocg); css_put(&mem->css); + if (biocg) + put_bio_cgroup(biocg); } return ret; } Index: linux17/include/linux/memcontrol.h =================================================================== --- linux17.orig/include/linux/memcontrol.h 2008-10-09 18:13:53.000000000 -0400 +++ linux17/include/linux/memcontrol.h 2008-11-05 18:12:32.000000000 -0500 @@ -17,16 +17,47 @@ * GNU General Public License for more details. */ +#include <linux/bit_spinlock.h> +#include <linux/mm_types.h> + #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H struct mem_cgroup; -struct page_cgroup; struct page; struct mm_struct; #ifdef CONFIG_CGROUP_MEM_RES_CTLR +/* + * We use the lower bit of the page->page_cgroup pointer as a bit spin + * lock. We need to ensure that page->page_cgroup is at least two + * byte aligned (based on comments from Nick Piggin). But since + * bit_spin_lock doesn't actually set that lock bit in a non-debug + * uniprocessor kernel, we should avoid setting it here too. + */ +#define PAGE_CGROUP_LOCK_BIT 0x0 +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) +#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT) +#else +#define PAGE_CGROUP_LOCK 0x0 +#endif + +/* + * A page_cgroup page is associated with every page descriptor. The + * page_cgroup helps us identify information about the cgroup + */ +struct page_cgroup { + struct list_head lru; /* per cgroup LRU list */ + struct page *page; + struct mem_cgroup *mem_cgroup; + int flags; +#ifdef CONFIG_CGROUP_BIO + struct list_head blist; /* for bio_cgroup page list */ + struct bio_cgroup *bio_cgroup; +#endif +}; + #define page_reset_bad_cgroup(page) ((page)->page_cgroup = 0) extern struct page_cgroup *page_get_page_cgroup(struct page *page); @@ -74,6 +105,20 @@ extern long mem_cgroup_calc_reclaim_acti extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem, struct zone *zone, int priority); +static inline void lock_page_cgroup(struct page *page) +{ + bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup); +} + +static inline int try_lock_page_cgroup(struct page *page) +{ + return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup); +} + +static inline void unlock_page_cgroup(struct page *page) +{ + bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup); +} #else /* CONFIG_CGROUP_MEM_RES_CTLR */ static inline void page_reset_bad_cgroup(struct page *page) { -- ^ permalink raw reply [flat|nested] 92+ messages in thread
* [patch 3/4] io controller: Core IO controller implementation logic [not found] <20081106153022.215696930@redhat.com> 2008-11-06 15:30 ` [patch 1/4] io controller: documentation vgoyal-H+wXaHxf7aLQT0dZR+AlfA 2008-11-06 15:30 ` [patch 2/4] io controller: biocgroup implementation vgoyal-H+wXaHxf7aLQT0dZR+AlfA @ 2008-11-06 15:30 ` vgoyal-H+wXaHxf7aLQT0dZR+AlfA 2008-11-06 15:30 ` [patch 4/4] io controller: Put IO controller to use in device mapper and standard make_request() function vgoyal-H+wXaHxf7aLQT0dZR+AlfA ` (7 subsequent siblings) 10 siblings, 0 replies; 92+ messages in thread From: vgoyal-H+wXaHxf7aLQT0dZR+AlfA @ 2008-11-06 15:30 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Hirokazu Takahashi Cc: Rik van Riel, fernando-gVGce1chcLdL9jVzuh4AOg, Jeff Moyer, menage-hpIqsD4AKlfQT0dZR+AlfA, ngupta-hpIqsD4AKlfQT0dZR+AlfA, Andrew Morton, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 [-- Attachment #1: bio-group-core-implementation.patch --] [-- Type: text/plain, Size: 30342 bytes --] o Core IO controller implementation Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Index: linux2/mm/biocontrol.c =================================================================== --- linux2.orig/mm/biocontrol.c 2008-11-06 05:27:36.000000000 -0500 +++ linux2/mm/biocontrol.c 2008-11-06 05:33:27.000000000 -0500 @@ -33,6 +33,7 @@ #include <linux/err.h> #include <linux/biocontrol.h> +void bio_group_inactive_timeout(unsigned long data); /* return corresponding bio_cgroup object of a cgroup */ static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp) @@ -407,3 +408,706 @@ struct cgroup_subsys bio_cgroup_subsys = .attach = bio_cgroup_move_task, .early_init = 0, }; + +struct bio_group* create_bio_group(struct bio_cgroup *biocg, + struct request_queue *q) +{ + unsigned long flags; + struct bio_group *biog = NULL; + + biog = kzalloc(sizeof(struct bio_group), GFP_ATOMIC); + if (!biog) + return biog; + + spin_lock_init(&biog->bio_group_lock); + biog->q = q; + biog->biocg = biocg; + INIT_LIST_HEAD(&biog->next); + biog->biog_inactive_timer.function = bio_group_inactive_timeout; + biog->biog_inactive_timer.data = (unsigned long)biog; + init_timer(&biog->biog_inactive_timer); + atomic_set(&biog->refcnt, 0); + spin_lock_irqsave(&biocg->biog_list_lock, flags); + list_add(&biog->next, &biocg->bio_group_list); + bio_group_get(biog); + spin_unlock_irqrestore(&biocg->biog_list_lock, flags); + return biog; +} + +void* alloc_biog_io(void) +{ + return kzalloc(sizeof(struct biog_io), GFP_ATOMIC); +} + +void free_biog_io(struct biog_io *biog_io) +{ + kfree(biog_io); +} + +/* + * Upon succesful completion of bio, this function starts the inactive timer + * so that if a bio group stops contending for disk bandwidth, it is removed + * from the token allocation race. + */ +void biog_io_end(struct bio *bio, int error) +{ + struct biog_io *biog_io; + struct bio_group *biog; + unsigned long flags; + struct request_queue *q; + + biog_io = bio->bi_private; + biog = biog_io->biog; + BUG_ON(!biog); + + spin_lock_irqsave(&biog->bio_group_lock, flags); + q = biog->q; + BUG_ON(!q); + + /* Restore the original bio fields */ + bio->bi_end_io = biog_io->bi_end_io; + bio->bi_private = biog_io->bi_private; + + /* If bio group is still empty, then start the inactive timer */ + if (bio_group_on_queue(biog) && bio_group_empty(biog)) { + mod_timer(&biog->biog_inactive_timer, + jiffies + msecs_to_jiffies(q->biogroup_idletime)); + bio_group_flag_set(BIOG_FLAG_TIMER_ACTIVE, biog); + } + + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + free_biog_io(biog_io); + bio_group_put(biog); + bio_endio(bio, error); +} + +/* Calculate how many tokens should be allocated to new group based on + * the number of share/weight of this group and the number of tokens and + * load which is already present on the queue. + */ +unsigned long calculate_nr_tokens(struct bio_group *biog, + struct request_queue *q) +{ + unsigned long nr_tokens, total_slice; + + total_slice = q->biogroup_deftoken * q->nr_biog; + nr_tokens = total_slice * biog->biocg->shares/q->total_weight; + + BUG_ON(!nr_tokens); + return nr_tokens; +} + +unsigned long alloc_bio_group_key(struct request_queue *q) +{ + unsigned long key = 0; + + if (!q->bio_groups.rb.rb_node) + return key; + + /* Insert element at the end of tree */ + key = q->max_key + 1; + return key; +} + +/* + * The below is leftmost cache rbtree addon + */ +struct bio_group *bio_group_rb_first(struct group_rb_root *root) +{ + if (!root->left) + root->left = rb_first(&root->rb); + + if (root->left) + return rb_entry(root->left, struct bio_group, rb_node); + + return NULL; +} + +void remove_bio_group_from_rbtree(struct bio_group *biog, + struct request_queue *q) +{ + struct group_rb_root *root; + struct rb_node *n; + + root = &q->bio_groups; + n = &biog->rb_node; + + if (root->left == n) + root->left = NULL; + + rb_erase(n, &root->rb); + RB_CLEAR_NODE(n); + + if (bio_group_blocked(biog)) + q->nr_biog_blocked--; + + q->nr_biog--; + q->total_weight -= biog->biocg->shares; + + if (!q->total_weight) + q->max_key = 0; +} + + +void insert_bio_group_into_rbtree(struct bio_group *biog, + struct request_queue *q) +{ + struct rb_node **p; + struct rb_node *parent = NULL; + struct bio_group *__biog; + int leftmost = 1; + + /* Check if any element being inserted has key less than max key */ + if (biog->key < q->max_key) + BUG(); + + p = &q->bio_groups.rb.rb_node; + while (*p) { + parent = *p; + __biog = rb_entry(parent, struct bio_group, rb_node); + + /* Should equal key case be a warning? */ + if (biog->key < __biog->key) + p = &(*p)->rb_left; + else { + p = &(*p)->rb_right; + leftmost = 0; + } + } + + /* Cache the leftmost element */ + if (leftmost) + q->bio_groups.left = &biog->rb_node; + + rb_link_node(&biog->rb_node, parent, p); + rb_insert_color(&biog->rb_node, &q->bio_groups.rb); + + /* Update the tokens and weight in request_queue */ + q->nr_biog++; + q->total_weight += biog->biocg->shares; + q->max_key = biog->key; + if (bio_group_blocked(biog)) + q->nr_biog_blocked++; +} + +void queue_bio_group(struct bio_group *biog, struct request_queue *q) +{ + biog->key = alloc_bio_group_key(q); + /* Take another reference on biog. will be decremented once biog + * is off the tree */ + bio_group_get(biog); + insert_bio_group_into_rbtree(biog, q); + bio_group_flag_set(BIOG_FLAG_ON_QUEUE, biog); + bio_group_flag_clear(BIOG_FLAG_BLOCKED, biog); + biog->slice_stamp = q->current_slice; +} + +void start_new_token_slice(struct request_queue *q) +{ + struct rb_node *n; + struct bio_group *biog = NULL; + struct group_rb_root *root; + unsigned long flags; + + q->current_slice++; + + /* Traverse the tree and reset the blocked count to zero of all the + * biogs */ + + root = &q->bio_groups; + + if (!root->left) + root->left = rb_first(&root->rb); + + if (root->left) + biog = rb_entry(root->left, struct bio_group, rb_node); + + if (!biog) + return; + + n = &biog->rb_node; + + /* Reset blocked count */ + q->nr_biog_blocked = 0; + q->newslice_count++; + + do { + biog = rb_entry(n, struct bio_group, rb_node); + spin_lock_irqsave(&biog->bio_group_lock, flags); + bio_group_flag_clear(BIOG_FLAG_BLOCKED, biog); + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + n = rb_next(n); + } while (n); + +} + +int should_start_new_token_slice(struct request_queue *q) +{ + /* + * if all the biog on the queue are blocked, then start a new + * token slice + */ + if (q->nr_biog_blocked == q->nr_biog) + return 1; + return 0; +} + +int is_bio_group_blocked(struct bio_group *biog) +{ + unsigned long flags, status = 0; + + /* Do I really need to lock bio group */ + spin_lock_irqsave(&biog->bio_group_lock, flags); + if (bio_group_blocked(biog)) + status = 1; + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + return status; +} + +int can_bio_group_dispatch(struct bio_group *biog, struct bio *bio) +{ + unsigned long temp = 0, flags; + struct request_queue *q; + long nr_sectors; + int can_dispatch = 0; + + BUG_ON(!biog); + BUG_ON(!bio); + + spin_lock_irqsave(&biog->bio_group_lock, flags); + nr_sectors = bio_sectors(bio); + q = biog->q; + + if (time_after(q->current_slice, biog->slice_stamp)) { + temp = calculate_nr_tokens(biog, q); + biog->credit_tokens += temp; + biog->slice_stamp = q->current_slice; + biog->biocg->nr_token_slices++; + } + + if ((biog->credit_tokens > 0) && (biog->credit_tokens > nr_sectors)) { + if (bio_group_flag_test_and_clear(BIOG_FLAG_BLOCKED, biog)) + q->nr_biog_blocked--; + can_dispatch = 1; + goto out; + } + + if (!bio_group_flag_test_and_set(BIOG_FLAG_BLOCKED, biog)) + q->nr_biog_blocked++; + +out: + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + return can_dispatch; +} + +/* Should be called without queue lock held */ +void bio_group_deactivate_timer(struct bio_group *biog) +{ + unsigned long flags; + + spin_lock_irqsave(&biog->bio_group_lock, flags); + if (bio_group_flag_test_and_clear(BIOG_FLAG_TIMER_ACTIVE, biog)) { + /* Drop the bio group lock so that timer routine could + * finish in case it fires */ + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + del_timer_sync(&biog->biog_inactive_timer); + return; + } + spin_unlock_irqrestore(&biog->bio_group_lock, flags); +} + +int attach_bio_group_io(struct bio_group *biog, struct bio *bio) +{ + int err = 0; + struct biog_io *biog_io; + + biog_io = alloc_biog_io(); + if (!biog_io) { + err = -ENOMEM; + goto out; + } + + /* I already have a valid pointer to biog. So it should be ok + * to get a reference to it. */ + bio_group_get(biog); + biog_io->biog = biog; + biog_io->bi_end_io = bio->bi_end_io; + biog_io->bi_private = bio->bi_private; + + bio->bi_end_io = biog_io_end; + bio->bi_private = biog_io; +out: + return err; +} + +int account_bio_to_bio_group(struct bio_group *biog, struct bio *bio) +{ + int err = 0; + unsigned long flags; + struct request_queue *q; + + spin_lock_irqsave(&biog->bio_group_lock, flags); + err = attach_bio_group_io(biog, bio); + if (err) + goto out; + + biog->nr_bio++; + q = biog->q; + if (!bio_group_on_queue(biog)) + queue_bio_group(biog, q); + +out: + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + return err; +} + +int add_bio_to_bio_group_queue(struct bio_group *biog, struct bio *bio) +{ + unsigned long flags; + struct request_queue *q; + + spin_lock_irqsave(&biog->bio_group_lock, flags); + __bio_group_queue_bio_tail(biog, bio); + q = biog->q; + q->nr_queued_bio++; + queue_delayed_work(q->biogroup_workqueue, &q->biogroup_work, 0); + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + return 0; +} + +/* + * It determines if the thread submitting the bio can itself continue to + * submit the bio or this bio needs to be buffered for later submission + */ +int can_biog_do_direct_dispatch(struct bio_group *biog) +{ + unsigned long flags, dispatch = 1; + + spin_lock_irqsave(&biog->bio_group_lock, flags); + if (bio_group_blocked(biog)) { + dispatch = 0; + goto out; + } + + /* Make sure there are not other queued bios on the biog. These + * queued bios should get a chance to dispatch first */ + if (!bio_group_queued_empty(biog)) + dispatch = 0; +out: + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + return dispatch; +} + +void charge_bio_group_for_tokens(struct bio_group *biog, struct bio *bio) +{ + unsigned long flags; + long dispatched_tokens; + + spin_lock_irqsave(&biog->bio_group_lock, flags); + dispatched_tokens = bio_sectors(bio); + biog->nr_bio--; + + biog->credit_tokens -= dispatched_tokens; + + /* debug aid. also update aggregate tokens and jiffies in biocg */ + biog->biocg->aggregate_tokens += dispatched_tokens; + biog->biocg->jiffies = jiffies; + + spin_unlock_irqrestore(&biog->bio_group_lock, flags); +} + +unsigned long __bio_group_try_to_dispatch(struct bio_group *biog, + struct bio *bio) +{ + struct request_queue *q; + int dispatched = 0; + + BUG_ON(!biog); + BUG_ON(!bio); + + q = biog->q; + BUG_ON(!q); +retry: + if (!can_bio_group_dispatch(biog, bio)) { + if (should_start_new_token_slice(q)) { + start_new_token_slice(q); + goto retry; + } + goto out; + } + + charge_bio_group_for_tokens(biog, bio); + dispatched = 1; +out: + return dispatched; +} + +unsigned long bio_group_try_to_dispatch(struct bio_group *biog, struct bio *bio) +{ + struct request_queue *q; + int dispatched = 0; + unsigned long flags; + + q = biog->q; + BUG_ON(!q); + + spin_lock_irqsave(q->queue_lock, flags); + dispatched = __bio_group_try_to_dispatch(biog, bio); + spin_unlock_irqrestore(q->queue_lock, flags); + + return dispatched; +} + +/* Should be called with queue lock and bio group lock held */ +void requeue_bio_group(struct request_queue *q, struct bio_group *biog) +{ + remove_bio_group_from_rbtree(biog, q); + biog->key = alloc_bio_group_key(q); + insert_bio_group_into_rbtree(biog, q); +} + +/* Make a list of queued bios in this bio group which can be dispatched. */ +void make_release_bio_list(struct bio_group *biog, + struct bio_list *release_list) +{ + unsigned long flags, dispatched = 0; + struct bio *bio; + struct request_queue *q; + + spin_lock_irqsave(&biog->bio_group_lock, flags); + + while (1) { + if (bio_group_queued_empty(biog)) + goto out; + + if (bio_group_blocked(biog)) + goto out; + + /* Dequeue one bio from bio group */ + bio = __bio_group_dequeue_bio(biog); + BUG_ON(!bio); + q = biog->q; + q->nr_queued_bio--; + + /* Releasing lock as try to dispatch will acquire it again */ + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + dispatched = __bio_group_try_to_dispatch(biog, bio); + spin_lock_irqsave(&biog->bio_group_lock, flags); + + if (dispatched) { + /* Add the bio to release list */ + bio_list_add(release_list, bio); + continue; + } else { + /* Put the bio back into biog */ + __bio_group_queue_bio_head(biog, bio); + q->nr_queued_bio++; + goto out; + } + } +out: + spin_unlock_irqrestore(&biog->bio_group_lock, flags); + return; +} + +/* + * If a bio group is inactive for q->inactive_timeout, then this group is + * considered to be no more contending for the disk bandwidth and removed + * from the tree. + */ +void bio_group_inactive_timeout(unsigned long data) +{ + struct bio_group *biog = (struct bio_group *)data; + unsigned long flags, flags1; + struct request_queue *q; + + q = biog->q; + BUG_ON(!q); + + spin_lock_irqsave(q->queue_lock, flags); + spin_lock_irqsave(&biog->bio_group_lock, flags1); + + BUG_ON(!bio_group_on_queue(biog)); + BUG_ON(biog->nr_bio); + + BUG_ON((biog->bio_group_flags > 7)); + /* Remove biog from tree */ + biog->biocg->nr_off_the_tree++; + remove_bio_group_from_rbtree(biog, q); + bio_group_flag_clear(BIOG_FLAG_ON_QUEUE, biog); + bio_group_flag_clear(BIOG_FLAG_BLOCKED, biog); + bio_group_flag_clear(BIOG_FLAG_TIMER_ACTIVE, biog); + + /* dm_start_new_slice() takes bio_group_lock. Release it now */ + spin_unlock_irqrestore(&biog->bio_group_lock, flags1); + + /* Also check if new slice should be started */ + if ((q->nr_biog) && should_start_new_token_slice(q)) + start_new_token_slice(q); + + spin_unlock_irqrestore(q->queue_lock, flags); + /* Drop the reference to biog */ + bio_group_put(biog); + return; +} + +/* + * It is called through worker thread and it takes care of releasing queued + * bios to underlying layer + */ +void bio_group_dispatch_queued_bio(struct request_queue *q) +{ + struct bio_group *biog; + unsigned long biog_scanned = 0; + unsigned long flags, flags1; + struct bio *bio = NULL; + int ret; + struct bio_list release_list; + + bio_list_init(&release_list); + + spin_lock_irqsave(q->queue_lock, flags); + + while (1) { + + if (!q->nr_biog) + goto out; + + if (!q->nr_queued_bio) + goto out; + + if (biog_scanned == q->nr_biog) { + /* Scanned the whole tree. No eligible biog found */ + if (q->nr_queued_bio) { + queue_delayed_work(q->biogroup_workqueue, + &q->biogroup_work, 1); + } + goto out; + } + + biog = bio_group_rb_first(&q->bio_groups); + BUG_ON(!biog); + + make_release_bio_list(biog, &release_list); + + /* If there are bios to dispatch, release these */ + if (!bio_list_empty(&release_list)) { + if (q->nr_queued_bio) + queue_delayed_work(q->biogroup_workqueue, + &q->biogroup_work, 0); + goto dispatch_bio; + } else { + spin_lock_irqsave(&biog->bio_group_lock, flags1); + requeue_bio_group(q, biog); + biog_scanned++; + spin_unlock_irqrestore(&biog->bio_group_lock, flags1); + continue; + } + } + +dispatch_bio: + spin_unlock_irqrestore(q->queue_lock, flags); + bio = bio_list_pop(&release_list); + BUG_ON(!bio); + + do { + /* Taint the bio with pass through flag */ + bio->bi_flags |= (1UL << BIO_NOBIOGROUP); + do { + ret = q->make_request_fn(q, bio); + } while (ret); + bio = bio_list_pop(&release_list); + } while (bio); + + return; +out: + spin_unlock_irqrestore(q->queue_lock, flags); + return; +} + +void blk_biogroup_work(struct work_struct *work) +{ + struct delayed_work *dw = container_of(work, struct delayed_work, work); + struct request_queue *q = + container_of(dw, struct request_queue, biogroup_work); + + bio_group_dispatch_queued_bio(q); +} + +/* + * This is core IO controller function which tries to dispatch bios to + * underlying layers based on cgroup weights. + * + * If the cgroup bio belongs to has got sufficient tokens, submitting + * task/thread is allowed to continue to submit the bio otherwise, bio + * is buffered here and submitting thread returns. This buffered bio will + * be dispatched to lower layers when cgroup has sufficient tokens. + * + * Return code: + * 0 --> continue submit the bio + * 1---> bio buffered by bio group layer. return + */ +int bio_group_controller(struct request_queue *q, struct bio *bio) +{ + + struct bio_group *biog; + struct bio_cgroup *biocg; + int err = 0; + unsigned long flags, dispatched = 0; + + /* This bio has already been subjected to resource constraints. + * Let it pass through unconditionally. */ + if (bio_flagged(bio, BIO_NOBIOGROUP)) { + bio->bi_flags &= ~(1UL << BIO_NOBIOGROUP); + return 0; + } + + spin_lock_irqsave(q->queue_lock, flags); + biocg = bio_cgroup_from_bio(bio); + BUG_ON(!biocg); + + /* If a biog is found, we also take a reference to it */ + biog = bio_group_from_cgroup(biocg, q); + if (!biog) { + /* In case of success, returns with reference to biog */ + biog = create_bio_group(biocg, q); + if (!biog) { + err = -ENOMEM; + goto end_io; + } + } + + spin_unlock_irqrestore(q->queue_lock, flags); + bio_group_deactivate_timer(biog); + spin_lock_irqsave(q->queue_lock, flags); + + err = account_bio_to_bio_group(biog, bio); + if (err) + goto end_io; + + if (!can_biog_do_direct_dispatch(biog)) { + add_bio_to_bio_group_queue(biog, bio); + goto buffered; + } + + dispatched = __bio_group_try_to_dispatch(biog, bio); + + if (!dispatched) { + add_bio_to_bio_group_queue(biog, bio); + goto buffered; + } + + bio_group_put(biog); + spin_unlock_irqrestore(q->queue_lock, flags); + return 0; + +buffered: + bio_group_put(biog); + spin_unlock_irqrestore(q->queue_lock, flags); + return 1; +end_io: + bio_group_put(biog); + spin_unlock_irqrestore(q->queue_lock, flags); + bio_endio(bio, err); + return 1; +} Index: linux2/include/linux/bio.h =================================================================== --- linux2.orig/include/linux/bio.h 2008-11-06 05:27:05.000000000 -0500 +++ linux2/include/linux/bio.h 2008-11-06 05:27:37.000000000 -0500 @@ -131,6 +131,7 @@ struct bio { #define BIO_BOUNCED 5 /* bio is a bounce bio */ #define BIO_USER_MAPPED 6 /* contains user pages */ #define BIO_EOPNOTSUPP 7 /* not supported */ +#define BIO_NOBIOGROUP 8 /* Don do bio group control on this bio */ #define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag))) /* Index: linux2/block/genhd.c =================================================================== --- linux2.orig/block/genhd.c 2008-11-06 05:27:05.000000000 -0500 +++ linux2/block/genhd.c 2008-11-06 05:27:37.000000000 -0500 @@ -440,6 +440,120 @@ static ssize_t disk_removable_show(struc (disk->flags & GENHD_FL_REMOVABLE ? 1 : 0)); } +static ssize_t disk_biogroup_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct gendisk *disk = dev_to_disk(dev); + struct request_queue *q = disk->queue; + + return sprintf(buf, "%d\n", blk_queue_bio_group_enabled(q)); +} + +static ssize_t disk_biogroup_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct gendisk *disk = dev_to_disk(dev); + struct request_queue *q = disk->queue; + int i = 0; + + if (count > 0 && sscanf(buf, "%d", &i) > 0) { + spin_lock_irq(q->queue_lock); + if (i) + queue_flag_set(QUEUE_FLAG_BIOG_ENABLED, q); + else + queue_flag_clear(QUEUE_FLAG_BIOG_ENABLED, q); + + spin_unlock_irq(q->queue_lock); + } + return count; +} + +static ssize_t disk_newslice_count_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct gendisk *disk = dev_to_disk(dev); + struct request_queue *q = disk->queue; + + return sprintf(buf, "%lu\n", q->newslice_count); +} + +static ssize_t disk_newslice_count_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct gendisk *disk = dev_to_disk(dev); + struct request_queue *q = disk->queue; + unsigned long flags; + int i = 0; + + if (count > 0 && sscanf(buf, "%d", &i) > 0) { + spin_lock_irqsave(q->queue_lock, flags); + q->newslice_count = i; + spin_unlock_irqrestore(q->queue_lock, flags); + } + return count; +} + +static ssize_t disk_idletime_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct gendisk *disk = dev_to_disk(dev); + struct request_queue *q = disk->queue; + + return sprintf(buf, "%lu\n", q->biogroup_idletime); +} + +static ssize_t disk_idletime_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct gendisk *disk = dev_to_disk(dev); + struct request_queue *q = disk->queue; + int i = 0; + + if (count > 0 && sscanf(buf, "%d", &i) > 0) { + spin_lock_irq(q->queue_lock); + if (i) + q->biogroup_idletime = i; + else + q->biogroup_idletime = 0; + + spin_unlock_irq(q->queue_lock); + } + return count; +} + +static ssize_t disk_deftoken_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct gendisk *disk = dev_to_disk(dev); + struct request_queue *q = disk->queue; + + return sprintf(buf, "%lu\n", q->biogroup_deftoken); +} + +static ssize_t disk_deftoken_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct gendisk *disk = dev_to_disk(dev); + struct request_queue *q = disk->queue; + int i = 0; + + if (count > 0 && sscanf(buf, "%d", &i) > 0) { + spin_lock_irq(q->queue_lock); + if (i) { + if (i > 0x30) + q->biogroup_deftoken = i; + } else + q->biogroup_deftoken = 0; + + spin_unlock_irq(q->queue_lock); + } + return count; +} + static ssize_t disk_ro_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -524,6 +638,10 @@ static DEVICE_ATTR(ro, S_IRUGO, disk_ro_ static DEVICE_ATTR(size, S_IRUGO, disk_size_show, NULL); static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL); static DEVICE_ATTR(stat, S_IRUGO, disk_stat_show, NULL); +static DEVICE_ATTR(biogroup, S_IRUGO | S_IWUSR, disk_biogroup_show, disk_biogroup_store); +static DEVICE_ATTR(idletime, S_IRUGO | S_IWUSR, disk_idletime_show, disk_idletime_store); +static DEVICE_ATTR(deftoken, S_IRUGO | S_IWUSR, disk_deftoken_show, disk_deftoken_store); +static DEVICE_ATTR(newslice_count, S_IRUGO | S_IWUSR, disk_newslice_count_show, disk_newslice_count_store); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, S_IRUGO|S_IWUSR, disk_fail_show, disk_fail_store); @@ -539,6 +657,10 @@ static struct attribute *disk_attrs[] = #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif + &dev_attr_biogroup.attr, + &dev_attr_idletime.attr, + &dev_attr_deftoken.attr, + &dev_attr_newslice_count.attr, NULL }; Index: linux2/include/linux/blkdev.h =================================================================== --- linux2.orig/include/linux/blkdev.h 2008-11-06 05:27:05.000000000 -0500 +++ linux2/include/linux/blkdev.h 2008-11-06 05:29:51.000000000 -0500 @@ -289,6 +289,11 @@ struct blk_cmd_filter { struct kobject kobj; }; +struct group_rb_root { + struct rb_root rb; + struct rb_node *left; +}; + struct request_queue { /* @@ -298,6 +303,33 @@ struct request_queue struct request *last_merge; elevator_t *elevator; + /* rb-tree which contains all the contending bio groups */ + struct group_rb_root bio_groups; + + /* Total number of bio_group currently on the request queue */ + unsigned long nr_biog; + unsigned long current_slice; + + struct workqueue_struct *biogroup_workqueue; + struct delayed_work biogroup_work; + unsigned long nr_queued_bio; + + /* What's the idletime after which a bio group is considered idle and + * considered no more contending for the bandwidth. */ + unsigned long biogroup_idletime; + unsigned long biogroup_deftoken; + + /* Number of biog which can't issue IO because they don't have + * suffifiet tokens */ + unsigned long nr_biog_blocked; + + /* Sum of weight of all the cgroups present on this queue */ + unsigned long total_weight; + + /* Debug Aid */ + unsigned long max_key; + unsigned long newslice_count; + /* * the queue request freelist, one for reads and one for writes */ @@ -421,6 +453,7 @@ struct request_queue #define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */ #define QUEUE_FLAG_BIDI 9 /* queue supports bidi requests */ #define QUEUE_FLAG_NOMERGES 10 /* disable merge attempts */ +#define QUEUE_FLAG_BIOG_ENABLED 11 /* bio group enabled */ static inline int queue_is_locked(struct request_queue *q) { @@ -527,6 +560,7 @@ enum { #define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags) #define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags) #define blk_queue_flushing(q) ((q)->ordseq) +#define blk_queue_bio_group_enabled(q) test_bit(QUEUE_FLAG_BIOG_ENABLED, &(q)->queue_flags) #define blk_fs_request(rq) ((rq)->cmd_type == REQ_TYPE_FS) #define blk_pc_request(rq) ((rq)->cmd_type == REQ_TYPE_BLOCK_PC) Index: linux2/block/blk-core.c =================================================================== --- linux2.orig/block/blk-core.c 2008-11-06 05:27:05.000000000 -0500 +++ linux2/block/blk-core.c 2008-11-06 05:27:40.000000000 -0500 @@ -30,6 +30,7 @@ #include <linux/cpu.h> #include <linux/blktrace_api.h> #include <linux/fault-inject.h> +#include <linux/biocontrol.h> #include "blk.h" @@ -502,6 +503,20 @@ struct request_queue *blk_alloc_queue_no mutex_init(&q->sysfs_lock); spin_lock_init(&q->__queue_lock); +#ifdef CONFIG_CGROUP_BIO + /* Initialize default idle time */ + q->biogroup_idletime = DEFAULT_IDLE_PERIOD; + q->biogroup_deftoken = DEFAULT_NR_TOKENS; + + /* Also create biogroup worker threads. It needs to be conditional */ + if (!bio_cgroup_disabled()) { + q->biogroup_workqueue = create_workqueue("biogroup"); + if (!q->biogroup_workqueue) + panic("Failed to create biogroup\n"); + } + INIT_DELAYED_WORK(&q->biogroup_work, blk_biogroup_work); +#endif + return q; } EXPORT_SYMBOL(blk_alloc_queue_node); Index: linux2/include/linux/biocontrol.h =================================================================== --- linux2.orig/include/linux/biocontrol.h 2008-11-06 05:27:36.000000000 -0500 +++ linux2/include/linux/biocontrol.h 2008-11-06 05:27:37.000000000 -0500 @@ -12,6 +12,17 @@ struct io_context; struct block_device; +/* what's a good value. starting with 8 ms */ +#define DEFAULT_IDLE_PERIOD 8 +/* what's a good value. starting with 2000 */ +#define DEFAULT_NR_TOKENS 2000 + +struct biog_io { + struct bio_group *biog; + bio_end_io_t *bi_end_io; + void *bi_private; +}; + struct bio_cgroup { struct cgroup_subsys_state css; /* Share/weight of the cgroup */ @@ -32,6 +43,46 @@ struct bio_cgroup { unsigned long nr_token_slices; }; +/* + * This object keeps track of a group of bios on a particular request queue. + * A cgroup will have one bio_group on each block device request queue it + * is doing IO to. + */ +struct bio_group { + spinlock_t bio_group_lock; + + unsigned long bio_group_flags; + + /* reference counting. use bio_group_get() and bio_group_put() */ + atomic_t refcnt; + + /* Pointer to the request queue this bio-group is currently associated + * with */ + struct request_queue *q; + + /* Pointer to parent bio_cgroup */ + struct bio_cgroup *biocg; + + /* bio_groups are connected through a linked list in parent cgroup */ + struct list_head next; + + long credit_tokens; + + /* Node which hangs in per request queue rb tree */ + struct rb_node rb_node; + + /* Key to index inside rb-tree rooted at devices's request_queue. */ + unsigned long key; + + unsigned long slice_stamp; + + struct timer_list biog_inactive_timer; + unsigned long nr_bio; + + /* List where buffered bios are queued */ + struct bio_list bio_queue; +}; + static inline int bio_cgroup_disabled(void) { return bio_cgroup_subsys.disabled; @@ -110,6 +161,69 @@ static inline void bio_cgroup_remove_pag spin_unlock_irqrestore(&biocg->page_list_lock, flags); } +static inline void bio_group_get(struct bio_group *biog) +{ + atomic_inc(&biog->refcnt); +} + +static inline void bio_group_put(struct bio_group *biog) +{ + atomic_dec(&biog->refcnt); +} + +#define BIOG_FLAG_TIMER_ACTIVE 0 /* Inactive timer armed status */ +#define BIOG_FLAG_ON_QUEUE 1 /* If biog is on request queue */ +#define BIOG_FLAG_BLOCKED 2 /* bio group is blocked */ + +#define bio_group_timer_active(biog) test_bit(BIOG_FLAG_TIMER_ACTIVE, &(biog)->bio_group_flags) +#define bio_group_on_queue(biog) test_bit(BIOG_FLAG_ON_QUEUE, &(biog)->bio_group_flags) +#define bio_group_blocked(biog) test_bit(BIOG_FLAG_BLOCKED, &(biog)->bio_group_flags) + +static inline void bio_group_flag_set(unsigned int flag, struct bio_group *biog) +{ + __set_bit(flag, &biog->bio_group_flags); +} + +static inline void bio_group_flag_clear(unsigned int flag, + struct bio_group *biog) +{ + __clear_bit(flag, &biog->bio_group_flags); +} + +static inline int bio_group_flag_test_and_clear(unsigned int flag, + struct bio_group *biog) +{ + if (test_bit(flag, &biog->bio_group_flags)) { + __clear_bit(flag, &biog->bio_group_flags); + return 1; + } + + return 0; +} + +static inline int bio_group_flag_test_and_set(unsigned int flag, + struct bio_group *biog) +{ + if (!test_bit(flag, &biog->bio_group_flags)) { + __set_bit(flag, &biog->bio_group_flags); + return 0; + } + + return 1; +} + +static inline int bio_group_empty(struct bio_group *biog) +{ + return !biog->nr_bio; +} + +static inline int bio_group_queued_empty(struct bio_group *biog) +{ + if (bio_list_empty(&biog->bio_queue)) + return 1; + return 0; +} + extern void clear_bio_cgroup(struct page_cgroup *pc); extern int bio_group_controller(struct request_queue *q, struct bio *bio); -- ^ permalink raw reply [flat|nested] 92+ messages in thread
* [patch 4/4] io controller: Put IO controller to use in device mapper and standard make_request() function [not found] <20081106153022.215696930@redhat.com> ` (2 preceding siblings ...) 2008-11-06 15:30 ` [patch 3/4] io controller: Core IO controller implementation logic vgoyal-H+wXaHxf7aLQT0dZR+AlfA @ 2008-11-06 15:30 ` vgoyal-H+wXaHxf7aLQT0dZR+AlfA [not found] ` <1225986593.7803.4688.camel@twins> ` (6 subsequent siblings) 10 siblings, 0 replies; 92+ messages in thread From: vgoyal-H+wXaHxf7aLQT0dZR+AlfA @ 2008-11-06 15:30 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Hirokazu Takahashi Cc: Rik van Riel, fernando-gVGce1chcLdL9jVzuh4AOg, Jeff Moyer, menage-hpIqsD4AKlfQT0dZR+AlfA, ngupta-hpIqsD4AKlfQT0dZR+AlfA, Andrew Morton, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 [-- Attachment #1: bio-cgroup-tweak-make-request-functions.patch --] [-- Type: text/plain, Size: 2406 bytes --] o Tweak standard make_request() function and also the device mapper make request functions to enable use of IO controller. Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Index: linux17/drivers/md/dm.c =================================================================== --- linux17.orig/drivers/md/dm.c 2008-11-05 18:12:42.000000000 -0500 +++ linux17/drivers/md/dm.c 2008-11-06 09:16:32.000000000 -0500 @@ -22,6 +22,7 @@ #include <linux/hdreg.h> #include <linux/blktrace_api.h> #include <linux/smp_lock.h> +#include <linux/biocontrol.h> #define DM_MSG_PREFIX "core" @@ -885,6 +886,7 @@ static int dm_request(struct request_que int r = -EIO; int rw = bio_data_dir(bio); struct mapped_device *md = q->queuedata; + int ret; /* * There is no use in forwarding any barrier request since we can't @@ -895,6 +897,13 @@ static int dm_request(struct request_que return 0; } + if (!bio_cgroup_disabled() && blk_queue_bio_group_enabled(q)) { + ret = bio_group_controller(q, bio); + if (ret) + /* Either bio got buffered for bio_endio() done */ + return 0; + } + down_read(&md->io_lock); disk_stat_inc(dm_disk(md), ios[rw]); @@ -1081,6 +1090,10 @@ static struct mapped_device *alloc_dev(i md->queue->unplug_fn = dm_unplug_all; blk_queue_merge_bvec(md->queue, dm_merge_bvec); + /* Initialize queue spin lock */ + md->queue->queue_lock = &md->queue->__queue_lock; + spin_lock_init(md->queue->queue_lock); + md->io_pool = mempool_create_slab_pool(MIN_IOS, _io_cache); if (!md->io_pool) goto bad_io_pool; Index: linux17/block/blk-core.c =================================================================== --- linux17.orig/block/blk-core.c 2008-11-06 09:14:20.000000000 -0500 +++ linux17/block/blk-core.c 2008-11-06 09:16:32.000000000 -0500 @@ -1117,10 +1117,18 @@ static int __make_request(struct request int el_ret, nr_sectors, barrier, err; const unsigned short prio = bio_prio(bio); const int sync = bio_sync(bio); - int rw_flags; + int rw_flags, ret; nr_sectors = bio_sectors(bio); + if (!bio_cgroup_disabled() && blk_queue_bio_group_enabled(q)) { + ret = bio_group_controller(q, bio); + if (ret) { + /* Either bio got buffered for bio_endio() done */ + return 0; + } + } + /* * low level driver can indicate that it wants pages above a * certain limit bounced to low memory (ie for highmem, or even -- ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <1225986593.7803.4688.camel@twins>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <1225986593.7803.4688.camel@twins> @ 2008-11-06 16:01 ` Vivek Goyal [not found] ` <20081106160154.GA7461@redhat.com> 1 sibling, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-06 16:01 UTC (permalink / raw) To: Peter Zijlstra Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Thu, Nov 06, 2008 at 04:49:53PM +0100, Peter Zijlstra wrote: > On Thu, 2008-11-06 at 10:30 -0500, vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > > Hi, > > > > If you are not already tired of so many io controller implementations, here > > is another one. > > > > This is a very eary very crude implementation to get early feedback to see > > if this approach makes any sense or not. > > > > This controller is a proportional weight IO controller primarily > > based on/inspired by dm-ioband. One of the things I personally found little > > odd about dm-ioband was need of a dm-ioband device for every device we want > > to control. I thought that probably we can make this control per request > > queue and get rid of device mapper driver. This should make configuration > > aspect easy. > > > > I have picked up quite some amount of code from dm-ioband especially for > > biocgroup implementation. > > > > I have done very basic testing and that is running 2-3 dd commands in different > > cgroups on x86_64. Wanted to throw out the code early to get some feedback. > > > > More details about the design and how to are in documentation patch. > > > > Your comments are welcome. > > please include > > QUILT_REFRESH_ARGS="--diffstat --strip-trailing-whitespace" > > in your environment or .quiltrc > Sure, I will do. First time user of quilt. :-) > I would expect all those bio* files to be placed in block/ not mm/ > Thinking more about it, probably block/ will be more appropriate place. I will do that. > Does this still require I use dm, or does it also work on regular block > devices? Patch 4/4 isn't quite clear on this. No. You don't have to use dm. It will simply work on regular devices. We shall have to put few lines of code for it to work on devices which don't make use of standard __make_request() function and provide their own make_request function. Hence for example, I have put that few lines of code so that it can work with dm device. I shall have to do something similar for md too. Though, I am not very sure why do I need to do IO control on higher level devices. Will it be sufficient if we just control only bottom most physical block devices? Anyway, this approach should work at any level. Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081106160154.GA7461@redhat.com>]
[parent not found: <20081106160154.GA7461-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081106160154.GA7461-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-06 16:16 ` Peter Zijlstra 0 siblings, 0 replies; 92+ messages in thread From: Peter Zijlstra @ 2008-11-06 16:16 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote: > > Does this still require I use dm, or does it also work on regular block > > devices? Patch 4/4 isn't quite clear on this. > > No. You don't have to use dm. It will simply work on regular devices. We > shall have to put few lines of code for it to work on devices which don't > make use of standard __make_request() function and provide their own > make_request function. > > Hence for example, I have put that few lines of code so that it can work > with dm device. I shall have to do something similar for md too. > > Though, I am not very sure why do I need to do IO control on higher level > devices. Will it be sufficient if we just control only bottom most > physical block devices? > > Anyway, this approach should work at any level. Nice, although I would think only doing the higher level devices makes more sense than only doing the leafs. Is there any reason we cannot merge this with the regular io-scheduler interface? afaik the only problem with doing group scheduling in the io-schedulers is the stacked devices issue. Could we make the io-schedulers aware of this hierarchy? ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <1225988173.7803.4723.camel@twins>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <1225988173.7803.4723.camel@twins> @ 2008-11-06 16:39 ` Vivek Goyal 2008-11-06 16:47 ` Rik van Riel [not found] ` <20081106163957.GB7461@redhat.com> 2 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-06 16:39 UTC (permalink / raw) To: Peter Zijlstra Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote: > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote: > > > > Does this still require I use dm, or does it also work on regular block > > > devices? Patch 4/4 isn't quite clear on this. > > > > No. You don't have to use dm. It will simply work on regular devices. We > > shall have to put few lines of code for it to work on devices which don't > > make use of standard __make_request() function and provide their own > > make_request function. > > > > Hence for example, I have put that few lines of code so that it can work > > with dm device. I shall have to do something similar for md too. > > > > Though, I am not very sure why do I need to do IO control on higher level > > devices. Will it be sufficient if we just control only bottom most > > physical block devices? > > > > Anyway, this approach should work at any level. > > Nice, although I would think only doing the higher level devices makes > more sense than only doing the leafs. > I thought that we should be doing any kind of resource management only at the level where there is actual contention for the resources.So in this case looks like only bottom most devices are slow and don't have infinite bandwidth hence the contention.(I am not taking into account the contention at bus level or contention at interconnect level for external storage, assuming interconnect is not the bottleneck). For example, lets say there is one linear device mapper device dm-0 on top of physical devices sda and sdb. Assuming two tasks in two different cgroups are reading two different files from deivce dm-0. Now if these files both fall on same physical device (either sda or sdb), then they will be contending for resources. But if files being read are on different physical deivces then practically there is no device contention (Even on the surface it might look like that dm-0 is being contended for). So if files are on different physical devices, IO controller will not know it. He will simply dispatch one group at a time and other device might remain idle. Keeping that in mind I thought we will be able to make use of full available bandwidth if we do IO control only at bottom most device. Doing it at higher layer has potential of not making use of full available bandwidth. > Is there any reason we cannot merge this with the regular io-scheduler > interface? afaik the only problem with doing group scheduling in the > io-schedulers is the stacked devices issue. I think we should be able to merge it with regular io schedulers. Apart from stacked device issue, people also mentioned that it is so closely tied to IO schedulers that we will end up doing four implementations for four schedulers and that is not very good from maintenance perspective. But I will spend more time in finding out if there is a common ground between schedulers so that a lot of common IO control code can be used in all the schedulers. > > Could we make the io-schedulers aware of this hierarchy? You mean IO schedulers knowing that there is somebody above them doing proportional weight dispatching of bios? If yes, how would that help? Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <1225988173.7803.4723.camel@twins> 2008-11-06 16:39 ` Vivek Goyal @ 2008-11-06 16:47 ` Rik van Riel [not found] ` <20081106163957.GB7461@redhat.com> 2 siblings, 0 replies; 92+ messages in thread From: Rik van Riel @ 2008-11-06 16:47 UTC (permalink / raw) To: Peter Zijlstra Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi Peter Zijlstra wrote: > Nice, although I would think only doing the higher level devices makes > more sense than only doing the leafs. I'm not convinced. Say that you have two resource groups on a bunch of LVM volumes across two disks. If one of the resource groups only sends requests to one of the disks, the other resource group should be able to get all of its requests through immediateley at the other disk. Holding up the second resource group's requests could result in a disk being idle. Worse, once that cgroup's requests finally make it through, the other cgroup might also want to use the disk and they both get slowed down. When a resource is uncontended, should a potential user be made to wait? -- All rights reversed. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081106163957.GB7461@redhat.com>]
[parent not found: <20081106163957.GB7461-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081106163957.GB7461-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-06 16:52 ` Peter Zijlstra 0 siblings, 0 replies; 92+ messages in thread From: Peter Zijlstra @ 2008-11-06 16:52 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote: > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote: > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote: > > > > > > Does this still require I use dm, or does it also work on regular block > > > > devices? Patch 4/4 isn't quite clear on this. > > > > > > No. You don't have to use dm. It will simply work on regular devices. We > > > shall have to put few lines of code for it to work on devices which don't > > > make use of standard __make_request() function and provide their own > > > make_request function. > > > > > > Hence for example, I have put that few lines of code so that it can work > > > with dm device. I shall have to do something similar for md too. > > > > > > Though, I am not very sure why do I need to do IO control on higher level > > > devices. Will it be sufficient if we just control only bottom most > > > physical block devices? > > > > > > Anyway, this approach should work at any level. > > > > Nice, although I would think only doing the higher level devices makes > > more sense than only doing the leafs. > > > > I thought that we should be doing any kind of resource management only at > the level where there is actual contention for the resources.So in this case > looks like only bottom most devices are slow and don't have infinite bandwidth > hence the contention.(I am not taking into account the contention at > bus level or contention at interconnect level for external storage, > assuming interconnect is not the bottleneck). > > For example, lets say there is one linear device mapper device dm-0 on > top of physical devices sda and sdb. Assuming two tasks in two different > cgroups are reading two different files from deivce dm-0. Now if these > files both fall on same physical device (either sda or sdb), then they > will be contending for resources. But if files being read are on different > physical deivces then practically there is no device contention (Even on > the surface it might look like that dm-0 is being contended for). So if > files are on different physical devices, IO controller will not know it. > He will simply dispatch one group at a time and other device might remain > idle. > > Keeping that in mind I thought we will be able to make use of full > available bandwidth if we do IO control only at bottom most device. Doing > it at higher layer has potential of not making use of full available bandwidth. > > > Is there any reason we cannot merge this with the regular io-scheduler > > interface? afaik the only problem with doing group scheduling in the > > io-schedulers is the stacked devices issue. > > I think we should be able to merge it with regular io schedulers. Apart > from stacked device issue, people also mentioned that it is so closely > tied to IO schedulers that we will end up doing four implementations for > four schedulers and that is not very good from maintenance perspective. > > But I will spend more time in finding out if there is a common ground > between schedulers so that a lot of common IO control code can be used > in all the schedulers. > > > > > Could we make the io-schedulers aware of this hierarchy? > > You mean IO schedulers knowing that there is somebody above them doing > proportional weight dispatching of bios? If yes, how would that help? Well, take the slightly more elaborate example or a raid[56] setup. This will need to sometimes issue multiple leaf level ios to satisfy one top level io. How are you going to attribute this fairly? I don't think the issue of bandwidth availability like above will really be an issue, if your stripe is set up symmetrically, the contention should average out to both (all) disks in equal measures. The only real issue I can see is with linear volumes, but those are stupid anyway - non of the gains but all the risks. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <1225990327.7803.4776.camel@twins>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <1225990327.7803.4776.camel@twins> @ 2008-11-06 16:57 ` Rik van Riel 2008-11-06 17:08 ` Vivek Goyal ` (2 subsequent siblings) 3 siblings, 0 replies; 92+ messages in thread From: Rik van Riel @ 2008-11-06 16:57 UTC (permalink / raw) To: Peter Zijlstra Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi Peter Zijlstra wrote: > The only real issue I can see is with linear volumes, but those are > stupid anyway - non of the gains but all the risks. Linear volumes may well be the most common ones. People start out with the filesystems at a certain size, increasing onto a second (new) disk later, when more space is required. -- All rights reversed. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <1225990327.7803.4776.camel@twins> 2008-11-06 16:57 ` Rik van Riel @ 2008-11-06 17:08 ` Vivek Goyal [not found] ` <491321ED.5010103@redhat.com> [not found] ` <20081106170830.GD7461@redhat.com> 3 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-06 17:08 UTC (permalink / raw) To: Peter Zijlstra Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote: > On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote: > > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote: > > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote: > > > > > > > > Does this still require I use dm, or does it also work on regular block > > > > > devices? Patch 4/4 isn't quite clear on this. > > > > > > > > No. You don't have to use dm. It will simply work on regular devices. We > > > > shall have to put few lines of code for it to work on devices which don't > > > > make use of standard __make_request() function and provide their own > > > > make_request function. > > > > > > > > Hence for example, I have put that few lines of code so that it can work > > > > with dm device. I shall have to do something similar for md too. > > > > > > > > Though, I am not very sure why do I need to do IO control on higher level > > > > devices. Will it be sufficient if we just control only bottom most > > > > physical block devices? > > > > > > > > Anyway, this approach should work at any level. > > > > > > Nice, although I would think only doing the higher level devices makes > > > more sense than only doing the leafs. > > > > > > > I thought that we should be doing any kind of resource management only at > > the level where there is actual contention for the resources.So in this case > > looks like only bottom most devices are slow and don't have infinite bandwidth > > hence the contention.(I am not taking into account the contention at > > bus level or contention at interconnect level for external storage, > > assuming interconnect is not the bottleneck). > > > > For example, lets say there is one linear device mapper device dm-0 on > > top of physical devices sda and sdb. Assuming two tasks in two different > > cgroups are reading two different files from deivce dm-0. Now if these > > files both fall on same physical device (either sda or sdb), then they > > will be contending for resources. But if files being read are on different > > physical deivces then practically there is no device contention (Even on > > the surface it might look like that dm-0 is being contended for). So if > > files are on different physical devices, IO controller will not know it. > > He will simply dispatch one group at a time and other device might remain > > idle. > > > > Keeping that in mind I thought we will be able to make use of full > > available bandwidth if we do IO control only at bottom most device. Doing > > it at higher layer has potential of not making use of full available bandwidth. > > > > > Is there any reason we cannot merge this with the regular io-scheduler > > > interface? afaik the only problem with doing group scheduling in the > > > io-schedulers is the stacked devices issue. > > > > I think we should be able to merge it with regular io schedulers. Apart > > from stacked device issue, people also mentioned that it is so closely > > tied to IO schedulers that we will end up doing four implementations for > > four schedulers and that is not very good from maintenance perspective. > > > > But I will spend more time in finding out if there is a common ground > > between schedulers so that a lot of common IO control code can be used > > in all the schedulers. > > > > > > > > Could we make the io-schedulers aware of this hierarchy? > > > > You mean IO schedulers knowing that there is somebody above them doing > > proportional weight dispatching of bios? If yes, how would that help? > > Well, take the slightly more elaborate example or a raid[56] setup. This > will need to sometimes issue multiple leaf level ios to satisfy one top > level io. > > How are you going to attribute this fairly? > I think in this case, definition of fair allocation will be little different. We will do fair allocation only at the leaf nodes where there is actual contention, irrespective of higher level setup. So if higher level block device issues multiple ios to satisfy one top level io, we will actually do the bandwidth allocation only on those multiple ios because that's the real IO contending for disk bandwidth. And if these multiple ios are going to different physical devices, then contention management will take place on those devices. IOW, we will not worry about providing fairness at bios submitted to higher level devices. We will just pitch in for contention management only when request from various cgroups are contending for physical device at bottom most layers. Isn't if fair? Thanks Vivek > I don't think the issue of bandwidth availability like above will really > be an issue, if your stripe is set up symmetrically, the contention > should average out to both (all) disks in equal measures. > > The only real issue I can see is with linear volumes, but those are > stupid anyway - non of the gains but all the risks. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <491321ED.5010103@redhat.com>]
[parent not found: <491321ED.5010103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <491321ED.5010103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-06 17:11 ` Peter Zijlstra 0 siblings, 0 replies; 92+ messages in thread From: Peter Zijlstra @ 2008-11-06 17:11 UTC (permalink / raw) To: Rik van Riel Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote: > Peter Zijlstra wrote: > > > The only real issue I can see is with linear volumes, but those are > > stupid anyway - non of the gains but all the risks. > > Linear volumes may well be the most common ones. > > People start out with the filesystems at a certain size, > increasing onto a second (new) disk later, when more space > is required. Are they aware of how risky linear volumes are? I would discourage anyone from using them. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <1225991487.7803.4801.camel@twins>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <1225991487.7803.4801.camel@twins> @ 2008-11-07 0:41 ` Dave Chinner [not found] ` <20081107004131.GD2373@disturbed> 1 sibling, 0 replies; 92+ messages in thread From: Dave Chinner @ 2008-11-07 0:41 UTC (permalink / raw) To: Peter Zijlstra Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote: > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote: > > Peter Zijlstra wrote: > > > > > The only real issue I can see is with linear volumes, but those are > > > stupid anyway - non of the gains but all the risks. > > > > Linear volumes may well be the most common ones. > > > > People start out with the filesystems at a certain size, > > increasing onto a second (new) disk later, when more space > > is required. > > Are they aware of how risky linear volumes are? I would discourage > anyone from using them. In what way are they risky? Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081107004131.GD2373@disturbed>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081107004131.GD2373@disturbed> @ 2008-11-07 10:31 ` Peter Zijlstra [not found] ` <1226053904.7803.5856.camel@twins> 1 sibling, 0 replies; 92+ messages in thread From: Peter Zijlstra @ 2008-11-07 10:31 UTC (permalink / raw) To: Dave Chinner Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Fri, 2008-11-07 at 11:41 +1100, Dave Chinner wrote: > On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote: > > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote: > > > Peter Zijlstra wrote: > > > > > > > The only real issue I can see is with linear volumes, but those are > > > > stupid anyway - non of the gains but all the risks. > > > > > > Linear volumes may well be the most common ones. > > > > > > People start out with the filesystems at a certain size, > > > increasing onto a second (new) disk later, when more space > > > is required. > > > > Are they aware of how risky linear volumes are? I would discourage > > anyone from using them. > > In what way are they risky? You loose all your data when one disk dies, so your mtbf decreases with the number of disks in your linear span. And you get non of the benefits from having multiple disks, like extra speed from striping, or redundancy from raid. Therefore I say that linear volumes are the absolute worst choice. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <1226053904.7803.5856.camel@twins>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <1226053904.7803.5856.camel@twins> @ 2008-11-09 9:40 ` Dave Chinner 0 siblings, 0 replies; 92+ messages in thread From: Dave Chinner @ 2008-11-09 9:40 UTC (permalink / raw) To: Peter Zijlstra Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Fri, Nov 07, 2008 at 11:31:44AM +0100, Peter Zijlstra wrote: > On Fri, 2008-11-07 at 11:41 +1100, Dave Chinner wrote: > > On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote: > > > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote: > > > > Peter Zijlstra wrote: > > > > > > > > > The only real issue I can see is with linear volumes, but > > > > > those are stupid anyway - non of the gains but all the > > > > > risks. > > > > > > > > Linear volumes may well be the most common ones. > > > > > > > > People start out with the filesystems at a certain size, > > > > increasing onto a second (new) disk later, when more space > > > > is required. > > > > > > Are they aware of how risky linear volumes are? I would > > > discourage anyone from using them. > > > > In what way are they risky? > > You loose all your data when one disk dies, so your mtbf decreases > with the number of disks in your linear span. And you get non of > the benefits from having multiple disks, like extra speed from > striping, or redundancy from raid. Fmeh. Step back and think for a moment. How does every major distro build redundant root drives? Yeah, they build a mirror and then put LVM on top of the mirror to partition it. Each partition is a *linear volume*, but no single disk failure is going to lose data because it's been put on top of a mirror. IOWs, reliability of linear volumes is only an issue if you don't build redundancy into your storage stack. Just like RAID0, a single disk failure will lose data. So, most people use linear volumes on top of RAID1 or RAID5 to avoid such a single disk failure problem. People do the same thing with RAID0 - it's what RAID10 and RAID50 do.... Also, linear volume performance scalability is on a different axis to striping. Striping improves bandwidth, but each disk in a stripe tends to make the same head movements. Hence striping improves sequential throughput but only provides limited iops scalability. Effectively, striping only improves throughput while the disks are not seeking a lot. Add a few parallel I/O streams, and a stripe will start to slow down as each disk seeks between streams. i.e. disks in stripes cannot be considered to be able to operate independently. Linear voulmes create independent regions within the address space - the regions can seek independently when under concurrent I/O and hence iops scalability is much greater. Aggregate bandwidth is the same a striping, it's just that a single stream is limited in throughput. If you want to improve single stream throughput, you stripe before you concatenate. That's why people create layered storage systems like this: linear volume |->stripe |-> md RAID5 |-> disk |-> disk |-> disk |-> disk |-> disk |-> md RAID5 |-> disk |-> disk |-> disk |-> disk |-> disk |->stripe |-> md RAID5 ...... |->stripe ...... What you then need is a filesystem that can spread the load over such a layout. Lets use, for argument's sake, XFS and tell it the geometry of the RAID5 luns that make up the volume so that it's allocation is all nicely aligned. Then we match the allocation group size to the size of each independent part of the linear volume. Now when XFS spreads it's inodes and data over multiple AGs, it's spreading the load across disks that can operate concurrently.... Effectively, linear volumes are about as dangerous as striping. If you don't build in redundancy at a level below the linear volume or stripe, then you lose when something fails. Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081106170830.GD7461@redhat.com>]
[parent not found: <20081106170830.GD7461-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081106170830.GD7461-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-06 23:07 ` Nauman Rafique 0 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-06 23:07 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi It seems that approaches with two level scheduling (DM-IOBand or this patch set on top and another scheduler at elevator) will have the possibility of undesirable interactions (see "issues" listed at the end of the second patch). For example, a request submitted as RT might get delayed at higher layers, even if cfq at elevator level is doing the right thing. Moreover, if the requests in the higher level scheduler are dispatched as soon as they come, there would be no queuing at the higher layers, unless the request queue at the lower level fills up and causes a backlog. And in the absence of queuing, any work-conserving scheduler would behave as a no-op scheduler. These issues motivate to take a second look into two level scheduling. The main motivations for two level scheduling seem to be: (1) Support bandwidth division across multiple devices for RAID and LVMs. (2) Divide bandwidth between different cgroups without modifying each of the existing schedulers (and without replicating the code). One possible approach to handle (1) is to keep track of bandwidth utilized by each cgroup in a per cgroup data structure (instead of a per cgroup per device data structure) and use that information to make scheduling decisions within the elevator level schedulers. Such a patch can be made flag-disabled if co-ordination across different device schedulers is not required. And (2) can probably be handled by having one scheduler support different modes. For example, one possible mode is "propotional division between crgroups + no-op between threads of a cgroup" or "cfq between cgroups + cfq between threads of a cgroup". That would also help avoid combinations which might not work e.g RT request issue mentioned earlier in this email. And this unified scheduler can re-use code from all the existing patches. Thanks. -- Nauman On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote: >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote: >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote: >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote: >> > > >> > > > > Does this still require I use dm, or does it also work on regular block >> > > > > devices? Patch 4/4 isn't quite clear on this. >> > > > >> > > > No. You don't have to use dm. It will simply work on regular devices. We >> > > > shall have to put few lines of code for it to work on devices which don't >> > > > make use of standard __make_request() function and provide their own >> > > > make_request function. >> > > > >> > > > Hence for example, I have put that few lines of code so that it can work >> > > > with dm device. I shall have to do something similar for md too. >> > > > >> > > > Though, I am not very sure why do I need to do IO control on higher level >> > > > devices. Will it be sufficient if we just control only bottom most >> > > > physical block devices? >> > > > >> > > > Anyway, this approach should work at any level. >> > > >> > > Nice, although I would think only doing the higher level devices makes >> > > more sense than only doing the leafs. >> > > >> > >> > I thought that we should be doing any kind of resource management only at >> > the level where there is actual contention for the resources.So in this case >> > looks like only bottom most devices are slow and don't have infinite bandwidth >> > hence the contention.(I am not taking into account the contention at >> > bus level or contention at interconnect level for external storage, >> > assuming interconnect is not the bottleneck). >> > >> > For example, lets say there is one linear device mapper device dm-0 on >> > top of physical devices sda and sdb. Assuming two tasks in two different >> > cgroups are reading two different files from deivce dm-0. Now if these >> > files both fall on same physical device (either sda or sdb), then they >> > will be contending for resources. But if files being read are on different >> > physical deivces then practically there is no device contention (Even on >> > the surface it might look like that dm-0 is being contended for). So if >> > files are on different physical devices, IO controller will not know it. >> > He will simply dispatch one group at a time and other device might remain >> > idle. >> > >> > Keeping that in mind I thought we will be able to make use of full >> > available bandwidth if we do IO control only at bottom most device. Doing >> > it at higher layer has potential of not making use of full available bandwidth. >> > >> > > Is there any reason we cannot merge this with the regular io-scheduler >> > > interface? afaik the only problem with doing group scheduling in the >> > > io-schedulers is the stacked devices issue. >> > >> > I think we should be able to merge it with regular io schedulers. Apart >> > from stacked device issue, people also mentioned that it is so closely >> > tied to IO schedulers that we will end up doing four implementations for >> > four schedulers and that is not very good from maintenance perspective. >> > >> > But I will spend more time in finding out if there is a common ground >> > between schedulers so that a lot of common IO control code can be used >> > in all the schedulers. >> > >> > > >> > > Could we make the io-schedulers aware of this hierarchy? >> > >> > You mean IO schedulers knowing that there is somebody above them doing >> > proportional weight dispatching of bios? If yes, how would that help? >> >> Well, take the slightly more elaborate example or a raid[56] setup. This >> will need to sometimes issue multiple leaf level ios to satisfy one top >> level io. >> >> How are you going to attribute this fairly? >> > > I think in this case, definition of fair allocation will be little > different. We will do fair allocation only at the leaf nodes where > there is actual contention, irrespective of higher level setup. > > So if higher level block device issues multiple ios to satisfy one top > level io, we will actually do the bandwidth allocation only on > those multiple ios because that's the real IO contending for disk > bandwidth. And if these multiple ios are going to different physical > devices, then contention management will take place on those devices. > > IOW, we will not worry about providing fairness at bios submitted to > higher level devices. We will just pitch in for contention management > only when request from various cgroups are contending for physical > device at bottom most layers. Isn't if fair? > > Thanks > Vivek > >> I don't think the issue of bandwidth availability like above will really >> be an issue, if your stripe is set up symmetrically, the contention >> should average out to both (all) disks in equal measures. >> >> The only real issue I can see is with linear volumes, but those are >> stupid anyway - non of the gains but all the risks. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811061507t3e19183byf2b8b291458ba81b@mail.gmail.com>]
[parent not found: <e98e18940811061507t3e19183byf2b8b291458ba81b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811061507t3e19183byf2b8b291458ba81b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-07 14:19 ` Vivek Goyal [not found] ` <20081107141943.GC21884-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [not found] ` <e98e18940811071336n58a073d8w2cbaeddd5657d1e9@mail.gmail.com> 0 siblings, 2 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-07 14:19 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote: > It seems that approaches with two level scheduling (DM-IOBand or this > patch set on top and another scheduler at elevator) will have the > possibility of undesirable interactions (see "issues" listed at the > end of the second patch). For example, a request submitted as RT might > get delayed at higher layers, even if cfq at elevator level is doing > the right thing. > Yep. Buffering of bios at higher layer can break underlying elevator's assumptions. What if we start keeping track of task priorities and RT tasks in higher level schedulers and dispatch the bios accordingly. Will it break the underlying noop, deadline or AS? > Moreover, if the requests in the higher level scheduler are dispatched > as soon as they come, there would be no queuing at the higher layers, > unless the request queue at the lower level fills up and causes a > backlog. And in the absence of queuing, any work-conserving scheduler > would behave as a no-op scheduler. > > These issues motivate to take a second look into two level scheduling. > The main motivations for two level scheduling seem to be: > (1) Support bandwidth division across multiple devices for RAID and LVMs. Nauman, can you give an example where we really need bandwidth division for higher level devices. I am beginning to think that real contention is at leaf level physical devices and not at higher level logical devices hence we should be doing any resource management only at leaf level and not worry about higher level logical devices. If this requirement goes away, then case of two level scheduler weakens and one needs to think about doing changes at leaf level IO schedulers. > (2) Divide bandwidth between different cgroups without modifying each > of the existing schedulers (and without replicating the code). > > One possible approach to handle (1) is to keep track of bandwidth > utilized by each cgroup in a per cgroup data structure (instead of a > per cgroup per device data structure) and use that information to make > scheduling decisions within the elevator level schedulers. Such a > patch can be made flag-disabled if co-ordination across different > device schedulers is not required. > Can you give more details about it. I am not sure I understand it. Exactly what information should be stored in each cgroup. I think per cgroup per device data structures are good so that an scheduer will not worry about other devices present in the system and will just try to arbitrate between various cgroup contending for that device. This goes back to same issue of getting rid of requirement (1) from io controller. > And (2) can probably be handled by having one scheduler support > different modes. For example, one possible mode is "propotional > division between crgroups + no-op between threads of a cgroup" or "cfq > between cgroups + cfq between threads of a cgroup". That would also > help avoid combinations which might not work e.g RT request issue > mentioned earlier in this email. And this unified scheduler can re-use > code from all the existing patches. > IIUC, you are suggesting some kind of unification between four IO schedulers so that proportional weight code is not replicated and user can switch mode on the fly based on tunables? Thanks Vivek > Thanks. > -- > Nauman > > On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote: > >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote: > >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote: > >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote: > >> > > > >> > > > > Does this still require I use dm, or does it also work on regular block > >> > > > > devices? Patch 4/4 isn't quite clear on this. > >> > > > > >> > > > No. You don't have to use dm. It will simply work on regular devices. We > >> > > > shall have to put few lines of code for it to work on devices which don't > >> > > > make use of standard __make_request() function and provide their own > >> > > > make_request function. > >> > > > > >> > > > Hence for example, I have put that few lines of code so that it can work > >> > > > with dm device. I shall have to do something similar for md too. > >> > > > > >> > > > Though, I am not very sure why do I need to do IO control on higher level > >> > > > devices. Will it be sufficient if we just control only bottom most > >> > > > physical block devices? > >> > > > > >> > > > Anyway, this approach should work at any level. > >> > > > >> > > Nice, although I would think only doing the higher level devices makes > >> > > more sense than only doing the leafs. > >> > > > >> > > >> > I thought that we should be doing any kind of resource management only at > >> > the level where there is actual contention for the resources.So in this case > >> > looks like only bottom most devices are slow and don't have infinite bandwidth > >> > hence the contention.(I am not taking into account the contention at > >> > bus level or contention at interconnect level for external storage, > >> > assuming interconnect is not the bottleneck). > >> > > >> > For example, lets say there is one linear device mapper device dm-0 on > >> > top of physical devices sda and sdb. Assuming two tasks in two different > >> > cgroups are reading two different files from deivce dm-0. Now if these > >> > files both fall on same physical device (either sda or sdb), then they > >> > will be contending for resources. But if files being read are on different > >> > physical deivces then practically there is no device contention (Even on > >> > the surface it might look like that dm-0 is being contended for). So if > >> > files are on different physical devices, IO controller will not know it. > >> > He will simply dispatch one group at a time and other device might remain > >> > idle. > >> > > >> > Keeping that in mind I thought we will be able to make use of full > >> > available bandwidth if we do IO control only at bottom most device. Doing > >> > it at higher layer has potential of not making use of full available bandwidth. > >> > > >> > > Is there any reason we cannot merge this with the regular io-scheduler > >> > > interface? afaik the only problem with doing group scheduling in the > >> > > io-schedulers is the stacked devices issue. > >> > > >> > I think we should be able to merge it with regular io schedulers. Apart > >> > from stacked device issue, people also mentioned that it is so closely > >> > tied to IO schedulers that we will end up doing four implementations for > >> > four schedulers and that is not very good from maintenance perspective. > >> > > >> > But I will spend more time in finding out if there is a common ground > >> > between schedulers so that a lot of common IO control code can be used > >> > in all the schedulers. > >> > > >> > > > >> > > Could we make the io-schedulers aware of this hierarchy? > >> > > >> > You mean IO schedulers knowing that there is somebody above them doing > >> > proportional weight dispatching of bios? If yes, how would that help? > >> > >> Well, take the slightly more elaborate example or a raid[56] setup. This > >> will need to sometimes issue multiple leaf level ios to satisfy one top > >> level io. > >> > >> How are you going to attribute this fairly? > >> > > > > I think in this case, definition of fair allocation will be little > > different. We will do fair allocation only at the leaf nodes where > > there is actual contention, irrespective of higher level setup. > > > > So if higher level block device issues multiple ios to satisfy one top > > level io, we will actually do the bandwidth allocation only on > > those multiple ios because that's the real IO contending for disk > > bandwidth. And if these multiple ios are going to different physical > > devices, then contention management will take place on those devices. > > > > IOW, we will not worry about providing fairness at bios submitted to > > higher level devices. We will just pitch in for contention management > > only when request from various cgroups are contending for physical > > device at bottom most layers. Isn't if fair? > > > > Thanks > > Vivek > > > >> I don't think the issue of bandwidth availability like above will really > >> be an issue, if your stripe is set up symmetrically, the contention > >> should average out to both (all) disks in equal measures. > >> > >> The only real issue I can see is with linear volumes, but those are > >> stupid anyway - non of the gains but all the risks. > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081107141943.GC21884-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081107141943.GC21884-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-07 21:36 ` Nauman Rafique 0 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-07 21:36 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote: >> It seems that approaches with two level scheduling (DM-IOBand or this >> patch set on top and another scheduler at elevator) will have the >> possibility of undesirable interactions (see "issues" listed at the >> end of the second patch). For example, a request submitted as RT might >> get delayed at higher layers, even if cfq at elevator level is doing >> the right thing. >> > > Yep. Buffering of bios at higher layer can break underlying elevator's > assumptions. > > What if we start keeping track of task priorities and RT tasks in higher > level schedulers and dispatch the bios accordingly. Will it break the > underlying noop, deadline or AS? It will probably not. But then we have a cfq-like scheduler at higher level and we can agree that the combinations "cfq(higher level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq" would probably work. But if we implement one high level cfq-like scheduler at a higher level, we would not take care of somebody who wants noop-noop or propotional-noop. The point I am trying to make is that there is probably no single one-size-fits-all solution for a higher level scheduler. And we should limit the arbitrary mixing and matching of higher level schedulers and elevator schedulers. That being said, the existence of a higher level scheduler is still a point of debate I guess, see my comments below. > >> Moreover, if the requests in the higher level scheduler are dispatched >> as soon as they come, there would be no queuing at the higher layers, >> unless the request queue at the lower level fills up and causes a >> backlog. And in the absence of queuing, any work-conserving scheduler >> would behave as a no-op scheduler. >> >> These issues motivate to take a second look into two level scheduling. >> The main motivations for two level scheduling seem to be: >> (1) Support bandwidth division across multiple devices for RAID and LVMs. > > Nauman, can you give an example where we really need bandwidth division > for higher level devices. > > I am beginning to think that real contention is at leaf level physical > devices and not at higher level logical devices hence we should be doing > any resource management only at leaf level and not worry about higher > level logical devices. > > If this requirement goes away, then case of two level scheduler weakens > and one needs to think about doing changes at leaf level IO schedulers. I cannot agree with you more on this that there is only contention at the leaf level physical devices and bandwidth should be managed only there. But having seen earlier posts on this list, i feel some folks might not agree with us. For example, if we have RAID-0 striping, we might want to schedule requests based on accumulative bandwidth used over all devices. Again, I myself don't agree with moving scheduling at a higher level just to support that. > >> (2) Divide bandwidth between different cgroups without modifying each >> of the existing schedulers (and without replicating the code). >> >> One possible approach to handle (1) is to keep track of bandwidth >> utilized by each cgroup in a per cgroup data structure (instead of a >> per cgroup per device data structure) and use that information to make >> scheduling decisions within the elevator level schedulers. Such a >> patch can be made flag-disabled if co-ordination across different >> device schedulers is not required. >> > > Can you give more details about it. I am not sure I understand it. Exactly > what information should be stored in each cgroup. > > I think per cgroup per device data structures are good so that an scheduer > will not worry about other devices present in the system and will just try > to arbitrate between various cgroup contending for that device. This goes > back to same issue of getting rid of requirement (1) from io controller. I was thinking that we can keep track of disk time used at each device, and keep the cumulative number in a per cgroup data structure. But that is only if we want to support bandwidth division across devices. You and me both agree that we probably do not need to do that. > >> And (2) can probably be handled by having one scheduler support >> different modes. For example, one possible mode is "propotional >> division between crgroups + no-op between threads of a cgroup" or "cfq >> between cgroups + cfq between threads of a cgroup". That would also >> help avoid combinations which might not work e.g RT request issue >> mentioned earlier in this email. And this unified scheduler can re-use >> code from all the existing patches. >> > > IIUC, you are suggesting some kind of unification between four IO > schedulers so that proportional weight code is not replicated and user can > switch mode on the fly based on tunables? Yes, that seems to be a solution to avoid replication of code. But we should also look at any other solutions that avoid replication of code, and also avoid scheduling in two different layers. In my opinion, scheduling at two different layers is problematic because (a) Any buffering done at a higher level will be artificial, unless the queues at lower levels are completely full. And if there is no buffering at a higher level, any scheduling scheme would be ineffective. (b) We cannot have an arbitrary mixing and matching of higher and lower level schedulers. (a) would exist in any solution in which requests are queued at multiple levels. Can you please comment on this with respect to the patch that you have posted? Thanks. -- Nauman > > Thanks > Vivek > >> Thanks. >> -- >> Nauman >> >> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote: >> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote: >> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote: >> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote: >> >> > > >> >> > > > > Does this still require I use dm, or does it also work on regular block >> >> > > > > devices? Patch 4/4 isn't quite clear on this. >> >> > > > >> >> > > > No. You don't have to use dm. It will simply work on regular devices. We >> >> > > > shall have to put few lines of code for it to work on devices which don't >> >> > > > make use of standard __make_request() function and provide their own >> >> > > > make_request function. >> >> > > > >> >> > > > Hence for example, I have put that few lines of code so that it can work >> >> > > > with dm device. I shall have to do something similar for md too. >> >> > > > >> >> > > > Though, I am not very sure why do I need to do IO control on higher level >> >> > > > devices. Will it be sufficient if we just control only bottom most >> >> > > > physical block devices? >> >> > > > >> >> > > > Anyway, this approach should work at any level. >> >> > > >> >> > > Nice, although I would think only doing the higher level devices makes >> >> > > more sense than only doing the leafs. >> >> > > >> >> > >> >> > I thought that we should be doing any kind of resource management only at >> >> > the level where there is actual contention for the resources.So in this case >> >> > looks like only bottom most devices are slow and don't have infinite bandwidth >> >> > hence the contention.(I am not taking into account the contention at >> >> > bus level or contention at interconnect level for external storage, >> >> > assuming interconnect is not the bottleneck). >> >> > >> >> > For example, lets say there is one linear device mapper device dm-0 on >> >> > top of physical devices sda and sdb. Assuming two tasks in two different >> >> > cgroups are reading two different files from deivce dm-0. Now if these >> >> > files both fall on same physical device (either sda or sdb), then they >> >> > will be contending for resources. But if files being read are on different >> >> > physical deivces then practically there is no device contention (Even on >> >> > the surface it might look like that dm-0 is being contended for). So if >> >> > files are on different physical devices, IO controller will not know it. >> >> > He will simply dispatch one group at a time and other device might remain >> >> > idle. >> >> > >> >> > Keeping that in mind I thought we will be able to make use of full >> >> > available bandwidth if we do IO control only at bottom most device. Doing >> >> > it at higher layer has potential of not making use of full available bandwidth. >> >> > >> >> > > Is there any reason we cannot merge this with the regular io-scheduler >> >> > > interface? afaik the only problem with doing group scheduling in the >> >> > > io-schedulers is the stacked devices issue. >> >> > >> >> > I think we should be able to merge it with regular io schedulers. Apart >> >> > from stacked device issue, people also mentioned that it is so closely >> >> > tied to IO schedulers that we will end up doing four implementations for >> >> > four schedulers and that is not very good from maintenance perspective. >> >> > >> >> > But I will spend more time in finding out if there is a common ground >> >> > between schedulers so that a lot of common IO control code can be used >> >> > in all the schedulers. >> >> > >> >> > > >> >> > > Could we make the io-schedulers aware of this hierarchy? >> >> > >> >> > You mean IO schedulers knowing that there is somebody above them doing >> >> > proportional weight dispatching of bios? If yes, how would that help? >> >> >> >> Well, take the slightly more elaborate example or a raid[56] setup. This >> >> will need to sometimes issue multiple leaf level ios to satisfy one top >> >> level io. >> >> >> >> How are you going to attribute this fairly? >> >> >> > >> > I think in this case, definition of fair allocation will be little >> > different. We will do fair allocation only at the leaf nodes where >> > there is actual contention, irrespective of higher level setup. >> > >> > So if higher level block device issues multiple ios to satisfy one top >> > level io, we will actually do the bandwidth allocation only on >> > those multiple ios because that's the real IO contending for disk >> > bandwidth. And if these multiple ios are going to different physical >> > devices, then contention management will take place on those devices. >> > >> > IOW, we will not worry about providing fairness at bios submitted to >> > higher level devices. We will just pitch in for contention management >> > only when request from various cgroups are contending for physical >> > device at bottom most layers. Isn't if fair? >> > >> > Thanks >> > Vivek >> > >> >> I don't think the issue of bandwidth availability like above will really >> >> be an issue, if your stripe is set up symmetrically, the contention >> >> should average out to both (all) disks in equal measures. >> >> >> >> The only real issue I can see is with linear volumes, but those are >> >> stupid anyway - non of the gains but all the risks. >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> > Please read the FAQ at http://www.tux.org/lkml/ >> > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811071336n58a073d8w2cbaeddd5657d1e9@mail.gmail.com>]
[parent not found: <e98e18940811071336n58a073d8w2cbaeddd5657d1e9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811071336n58a073d8w2cbaeddd5657d1e9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-10 14:11 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-10 14:11 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote: > On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote: > >> It seems that approaches with two level scheduling (DM-IOBand or this > >> patch set on top and another scheduler at elevator) will have the > >> possibility of undesirable interactions (see "issues" listed at the > >> end of the second patch). For example, a request submitted as RT might > >> get delayed at higher layers, even if cfq at elevator level is doing > >> the right thing. > >> > > > > Yep. Buffering of bios at higher layer can break underlying elevator's > > assumptions. > > > > What if we start keeping track of task priorities and RT tasks in higher > > level schedulers and dispatch the bios accordingly. Will it break the > > underlying noop, deadline or AS? > > It will probably not. But then we have a cfq-like scheduler at higher > level and we can agree that the combinations "cfq(higher > level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq" > would probably work. But if we implement one high level cfq-like > scheduler at a higher level, we would not take care of somebody who > wants noop-noop or propotional-noop. The point I am trying to make is > that there is probably no single one-size-fits-all solution for a > higher level scheduler. And we should limit the arbitrary mixing and > matching of higher level schedulers and elevator schedulers. That > being said, the existence of a higher level scheduler is still a point > of debate I guess, see my comments below. > Ya, implemeting CFQ like thing in higher level scheduler will make things complex. > > > > >> Moreover, if the requests in the higher level scheduler are dispatched > >> as soon as they come, there would be no queuing at the higher layers, > >> unless the request queue at the lower level fills up and causes a > >> backlog. And in the absence of queuing, any work-conserving scheduler > >> would behave as a no-op scheduler. > >> > >> These issues motivate to take a second look into two level scheduling. > >> The main motivations for two level scheduling seem to be: > >> (1) Support bandwidth division across multiple devices for RAID and LVMs. > > > > Nauman, can you give an example where we really need bandwidth division > > for higher level devices. > > > > I am beginning to think that real contention is at leaf level physical > > devices and not at higher level logical devices hence we should be doing > > any resource management only at leaf level and not worry about higher > > level logical devices. > > > > If this requirement goes away, then case of two level scheduler weakens > > and one needs to think about doing changes at leaf level IO schedulers. > > I cannot agree with you more on this that there is only contention at > the leaf level physical devices and bandwidth should be managed only > there. But having seen earlier posts on this list, i feel some folks > might not agree with us. For example, if we have RAID-0 striping, we > might want to schedule requests based on accumulative bandwidth used > over all devices. Again, I myself don't agree with moving scheduling > at a higher level just to support that. > Hmm.., I am not very convinced that we need to do resource management at RAID0 device. The common case of resource management is that a higher priority task group is not deprived of resources because of lower priority task group. So if there is no contention between two task groups (At leaf node), then I might as well let them give them full access to RAID 0 logical device without any control. Hope people who have requirement of control at higher level devices can pitch in now and share their perspective. > > > >> (2) Divide bandwidth between different cgroups without modifying each > >> of the existing schedulers (and without replicating the code). > >> > >> One possible approach to handle (1) is to keep track of bandwidth > >> utilized by each cgroup in a per cgroup data structure (instead of a > >> per cgroup per device data structure) and use that information to make > >> scheduling decisions within the elevator level schedulers. Such a > >> patch can be made flag-disabled if co-ordination across different > >> device schedulers is not required. > >> > > > > Can you give more details about it. I am not sure I understand it. Exactly > > what information should be stored in each cgroup. > > > > I think per cgroup per device data structures are good so that an scheduer > > will not worry about other devices present in the system and will just try > > to arbitrate between various cgroup contending for that device. This goes > > back to same issue of getting rid of requirement (1) from io controller. > > I was thinking that we can keep track of disk time used at each > device, and keep the cumulative number in a per cgroup data structure. > But that is only if we want to support bandwidth division across > devices. You and me both agree that we probably do not need to do > that. > > > > >> And (2) can probably be handled by having one scheduler support > >> different modes. For example, one possible mode is "propotional > >> division between crgroups + no-op between threads of a cgroup" or "cfq > >> between cgroups + cfq between threads of a cgroup". That would also > >> help avoid combinations which might not work e.g RT request issue > >> mentioned earlier in this email. And this unified scheduler can re-use > >> code from all the existing patches. > >> > > > > IIUC, you are suggesting some kind of unification between four IO > > schedulers so that proportional weight code is not replicated and user can > > switch mode on the fly based on tunables? > > Yes, that seems to be a solution to avoid replication of code. But we > should also look at any other solutions that avoid replication of > code, and also avoid scheduling in two different layers. > In my opinion, scheduling at two different layers is problematic because > (a) Any buffering done at a higher level will be artificial, unless > the queues at lower levels are completely full. And if there is no > buffering at a higher level, any scheduling scheme would be > ineffective. > (b) We cannot have an arbitrary mixing and matching of higher and > lower level schedulers. > > (a) would exist in any solution in which requests are queued at > multiple levels. Can you please comment on this with respect to the > patch that you have posted? > I am not very sure about the queustion, but in my patch, buffering at at higher layer is irrespective of the status of underlying queue. We try our best to fill underlying queue with request, only subject to the criteria of proportional bandwidth. So, if there are two cgroups A and B and we allocate two cgroups 2000 tokens each to begin with. If A has consumed all the tokens soon and B has not, then we will stop A from dispatching more requests and wait for B to either issue more IO and consume tokens or get out of contention. This can leave disk idle for sometime. We can probably do some optimizations here. Thanks Vivek > >> > >> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote: > >> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote: > >> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote: > >> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote: > >> >> > > > >> >> > > > > Does this still require I use dm, or does it also work on regular block > >> >> > > > > devices? Patch 4/4 isn't quite clear on this. > >> >> > > > > >> >> > > > No. You don't have to use dm. It will simply work on regular devices. We > >> >> > > > shall have to put few lines of code for it to work on devices which don't > >> >> > > > make use of standard __make_request() function and provide their own > >> >> > > > make_request function. > >> >> > > > > >> >> > > > Hence for example, I have put that few lines of code so that it can work > >> >> > > > with dm device. I shall have to do something similar for md too. > >> >> > > > > >> >> > > > Though, I am not very sure why do I need to do IO control on higher level > >> >> > > > devices. Will it be sufficient if we just control only bottom most > >> >> > > > physical block devices? > >> >> > > > > >> >> > > > Anyway, this approach should work at any level. > >> >> > > > >> >> > > Nice, although I would think only doing the higher level devices makes > >> >> > > more sense than only doing the leafs. > >> >> > > > >> >> > > >> >> > I thought that we should be doing any kind of resource management only at > >> >> > the level where there is actual contention for the resources.So in this case > >> >> > looks like only bottom most devices are slow and don't have infinite bandwidth > >> >> > hence the contention.(I am not taking into account the contention at > >> >> > bus level or contention at interconnect level for external storage, > >> >> > assuming interconnect is not the bottleneck). > >> >> > > >> >> > For example, lets say there is one linear device mapper device dm-0 on > >> >> > top of physical devices sda and sdb. Assuming two tasks in two different > >> >> > cgroups are reading two different files from deivce dm-0. Now if these > >> >> > files both fall on same physical device (either sda or sdb), then they > >> >> > will be contending for resources. But if files being read are on different > >> >> > physical deivces then practically there is no device contention (Even on > >> >> > the surface it might look like that dm-0 is being contended for). So if > >> >> > files are on different physical devices, IO controller will not know it. > >> >> > He will simply dispatch one group at a time and other device might remain > >> >> > idle. > >> >> > > >> >> > Keeping that in mind I thought we will be able to make use of full > >> >> > available bandwidth if we do IO control only at bottom most device. Doing > >> >> > it at higher layer has potential of not making use of full available bandwidth. > >> >> > > >> >> > > Is there any reason we cannot merge this with the regular io-scheduler > >> >> > > interface? afaik the only problem with doing group scheduling in the > >> >> > > io-schedulers is the stacked devices issue. > >> >> > > >> >> > I think we should be able to merge it with regular io schedulers. Apart > >> >> > from stacked device issue, people also mentioned that it is so closely > >> >> > tied to IO schedulers that we will end up doing four implementations for > >> >> > four schedulers and that is not very good from maintenance perspective. > >> >> > > >> >> > But I will spend more time in finding out if there is a common ground > >> >> > between schedulers so that a lot of common IO control code can be used > >> >> > in all the schedulers. > >> >> > > >> >> > > > >> >> > > Could we make the io-schedulers aware of this hierarchy? > >> >> > > >> >> > You mean IO schedulers knowing that there is somebody above them doing > >> >> > proportional weight dispatching of bios? If yes, how would that help? > >> >> > >> >> Well, take the slightly more elaborate example or a raid[56] setup. This > >> >> will need to sometimes issue multiple leaf level ios to satisfy one top > >> >> level io. > >> >> > >> >> How are you going to attribute this fairly? > >> >> > >> > > >> > I think in this case, definition of fair allocation will be little > >> > different. We will do fair allocation only at the leaf nodes where > >> > there is actual contention, irrespective of higher level setup. > >> > > >> > So if higher level block device issues multiple ios to satisfy one top > >> > level io, we will actually do the bandwidth allocation only on > >> > those multiple ios because that's the real IO contending for disk > >> > bandwidth. And if these multiple ios are going to different physical > >> > devices, then contention management will take place on those devices. > >> > > >> > IOW, we will not worry about providing fairness at bios submitted to > >> > higher level devices. We will just pitch in for contention management > >> > only when request from various cgroups are contending for physical > >> > device at bottom most layers. Isn't if fair? > >> > > >> > Thanks > >> > Vivek > >> > > >> >> I don't think the issue of bandwidth availability like above will really > >> >> be an issue, if your stripe is set up symmetrically, the contention > >> >> should average out to both (all) disks in equal measures. > >> >> > >> >> The only real issue I can see is with linear volumes, but those are > >> >> stupid anyway - non of the gains but all the risks. > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > >> > More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > Please read the FAQ at http://www.tux.org/lkml/ > >> > > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081110141143.GC26956@redhat.com>]
[parent not found: <20081110141143.GC26956-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081110141143.GC26956-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-11 19:55 ` Nauman Rafique 0 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-11 19:55 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Mon, Nov 10, 2008 at 6:11 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote: >> On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote: >> >> It seems that approaches with two level scheduling (DM-IOBand or this >> >> patch set on top and another scheduler at elevator) will have the >> >> possibility of undesirable interactions (see "issues" listed at the >> >> end of the second patch). For example, a request submitted as RT might >> >> get delayed at higher layers, even if cfq at elevator level is doing >> >> the right thing. >> >> >> > >> > Yep. Buffering of bios at higher layer can break underlying elevator's >> > assumptions. >> > >> > What if we start keeping track of task priorities and RT tasks in higher >> > level schedulers and dispatch the bios accordingly. Will it break the >> > underlying noop, deadline or AS? >> >> It will probably not. But then we have a cfq-like scheduler at higher >> level and we can agree that the combinations "cfq(higher >> level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq" >> would probably work. But if we implement one high level cfq-like >> scheduler at a higher level, we would not take care of somebody who >> wants noop-noop or propotional-noop. The point I am trying to make is >> that there is probably no single one-size-fits-all solution for a >> higher level scheduler. And we should limit the arbitrary mixing and >> matching of higher level schedulers and elevator schedulers. That >> being said, the existence of a higher level scheduler is still a point >> of debate I guess, see my comments below. >> > > Ya, implemeting CFQ like thing in higher level scheduler will make things > complex. > >> >> > >> >> Moreover, if the requests in the higher level scheduler are dispatched >> >> as soon as they come, there would be no queuing at the higher layers, >> >> unless the request queue at the lower level fills up and causes a >> >> backlog. And in the absence of queuing, any work-conserving scheduler >> >> would behave as a no-op scheduler. >> >> >> >> These issues motivate to take a second look into two level scheduling. >> >> The main motivations for two level scheduling seem to be: >> >> (1) Support bandwidth division across multiple devices for RAID and LVMs. >> > >> > Nauman, can you give an example where we really need bandwidth division >> > for higher level devices. >> > >> > I am beginning to think that real contention is at leaf level physical >> > devices and not at higher level logical devices hence we should be doing >> > any resource management only at leaf level and not worry about higher >> > level logical devices. >> > >> > If this requirement goes away, then case of two level scheduler weakens >> > and one needs to think about doing changes at leaf level IO schedulers. >> >> I cannot agree with you more on this that there is only contention at >> the leaf level physical devices and bandwidth should be managed only >> there. But having seen earlier posts on this list, i feel some folks >> might not agree with us. For example, if we have RAID-0 striping, we >> might want to schedule requests based on accumulative bandwidth used >> over all devices. Again, I myself don't agree with moving scheduling >> at a higher level just to support that. >> > > Hmm.., I am not very convinced that we need to do resource management > at RAID0 device. The common case of resource management is that a higher > priority task group is not deprived of resources because of lower priority > task group. So if there is no contention between two task groups (At leaf > node), then I might as well let them give them full access to RAID 0 > logical device without any control. > > Hope people who have requirement of control at higher level devices can > pitch in now and share their perspective. > >> > >> >> (2) Divide bandwidth between different cgroups without modifying each >> >> of the existing schedulers (and without replicating the code). >> >> >> >> One possible approach to handle (1) is to keep track of bandwidth >> >> utilized by each cgroup in a per cgroup data structure (instead of a >> >> per cgroup per device data structure) and use that information to make >> >> scheduling decisions within the elevator level schedulers. Such a >> >> patch can be made flag-disabled if co-ordination across different >> >> device schedulers is not required. >> >> >> > >> > Can you give more details about it. I am not sure I understand it. Exactly >> > what information should be stored in each cgroup. >> > >> > I think per cgroup per device data structures are good so that an scheduer >> > will not worry about other devices present in the system and will just try >> > to arbitrate between various cgroup contending for that device. This goes >> > back to same issue of getting rid of requirement (1) from io controller. >> >> I was thinking that we can keep track of disk time used at each >> device, and keep the cumulative number in a per cgroup data structure. >> But that is only if we want to support bandwidth division across >> devices. You and me both agree that we probably do not need to do >> that. >> >> > >> >> And (2) can probably be handled by having one scheduler support >> >> different modes. For example, one possible mode is "propotional >> >> division between crgroups + no-op between threads of a cgroup" or "cfq >> >> between cgroups + cfq between threads of a cgroup". That would also >> >> help avoid combinations which might not work e.g RT request issue >> >> mentioned earlier in this email. And this unified scheduler can re-use >> >> code from all the existing patches. >> >> >> > >> > IIUC, you are suggesting some kind of unification between four IO >> > schedulers so that proportional weight code is not replicated and user can >> > switch mode on the fly based on tunables? >> >> Yes, that seems to be a solution to avoid replication of code. But we >> should also look at any other solutions that avoid replication of >> code, and also avoid scheduling in two different layers. >> In my opinion, scheduling at two different layers is problematic because >> (a) Any buffering done at a higher level will be artificial, unless >> the queues at lower levels are completely full. And if there is no >> buffering at a higher level, any scheduling scheme would be >> ineffective. >> (b) We cannot have an arbitrary mixing and matching of higher and >> lower level schedulers. >> >> (a) would exist in any solution in which requests are queued at >> multiple levels. Can you please comment on this with respect to the >> patch that you have posted? >> > > I am not very sure about the queustion, but in my patch, buffering at > at higher layer is irrespective of the status of underlying queue. We > try our best to fill underlying queue with request, only subject to the > criteria of proportional bandwidth. > > So, if there are two cgroups A and B and we allocate two cgroups 2000 > tokens each to begin with. If A has consumed all the tokens soon and B > has not, then we will stop A from dispatching more requests and wait for > B to either issue more IO and consume tokens or get out of contention. > This can leave disk idle for sometime. We can probably do some > optimizations here. What do you think about elevator based solutions like 2 level cfq patches submitted by Satoshi and Vasily earlier? CFQ can be trivially modified to do proportional division (i.e give time slices in proportion to weight instead of priority). And such a solution would avoid idleness problem like the one you mentioned above and can also avoid burstiness issues (see smoothing patches -- v1.2.0 and v1.3.0 -- of dm-ioband) in token based schemes. Also doing time based token allocation (as you mentioned in TODO list) sounds very interesting. Can we look at the disk time taken by each bio and use that to account for tokens? The problem is that the time taken is not available when the requests are sent to disk, but we can do delayed token charging (i.e deduct tokens after the request is completed?). It seems that such an approach should work. What do you think? > > Thanks > Vivek > >> >> >> >> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> >> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote: >> >> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote: >> >> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote: >> >> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote: >> >> >> > > >> >> >> > > > > Does this still require I use dm, or does it also work on regular block >> >> >> > > > > devices? Patch 4/4 isn't quite clear on this. >> >> >> > > > >> >> >> > > > No. You don't have to use dm. It will simply work on regular devices. We >> >> >> > > > shall have to put few lines of code for it to work on devices which don't >> >> >> > > > make use of standard __make_request() function and provide their own >> >> >> > > > make_request function. >> >> >> > > > >> >> >> > > > Hence for example, I have put that few lines of code so that it can work >> >> >> > > > with dm device. I shall have to do something similar for md too. >> >> >> > > > >> >> >> > > > Though, I am not very sure why do I need to do IO control on higher level >> >> >> > > > devices. Will it be sufficient if we just control only bottom most >> >> >> > > > physical block devices? >> >> >> > > > >> >> >> > > > Anyway, this approach should work at any level. >> >> >> > > >> >> >> > > Nice, although I would think only doing the higher level devices makes >> >> >> > > more sense than only doing the leafs. >> >> >> > > >> >> >> > >> >> >> > I thought that we should be doing any kind of resource management only at >> >> >> > the level where there is actual contention for the resources.So in this case >> >> >> > looks like only bottom most devices are slow and don't have infinite bandwidth >> >> >> > hence the contention.(I am not taking into account the contention at >> >> >> > bus level or contention at interconnect level for external storage, >> >> >> > assuming interconnect is not the bottleneck). >> >> >> > >> >> >> > For example, lets say there is one linear device mapper device dm-0 on >> >> >> > top of physical devices sda and sdb. Assuming two tasks in two different >> >> >> > cgroups are reading two different files from deivce dm-0. Now if these >> >> >> > files both fall on same physical device (either sda or sdb), then they >> >> >> > will be contending for resources. But if files being read are on different >> >> >> > physical deivces then practically there is no device contention (Even on >> >> >> > the surface it might look like that dm-0 is being contended for). So if >> >> >> > files are on different physical devices, IO controller will not know it. >> >> >> > He will simply dispatch one group at a time and other device might remain >> >> >> > idle. >> >> >> > >> >> >> > Keeping that in mind I thought we will be able to make use of full >> >> >> > available bandwidth if we do IO control only at bottom most device. Doing >> >> >> > it at higher layer has potential of not making use of full available bandwidth. >> >> >> > >> >> >> > > Is there any reason we cannot merge this with the regular io-scheduler >> >> >> > > interface? afaik the only problem with doing group scheduling in the >> >> >> > > io-schedulers is the stacked devices issue. >> >> >> > >> >> >> > I think we should be able to merge it with regular io schedulers. Apart >> >> >> > from stacked device issue, people also mentioned that it is so closely >> >> >> > tied to IO schedulers that we will end up doing four implementations for >> >> >> > four schedulers and that is not very good from maintenance perspective. >> >> >> > >> >> >> > But I will spend more time in finding out if there is a common ground >> >> >> > between schedulers so that a lot of common IO control code can be used >> >> >> > in all the schedulers. >> >> >> > >> >> >> > > >> >> >> > > Could we make the io-schedulers aware of this hierarchy? >> >> >> > >> >> >> > You mean IO schedulers knowing that there is somebody above them doing >> >> >> > proportional weight dispatching of bios? If yes, how would that help? >> >> >> >> >> >> Well, take the slightly more elaborate example or a raid[56] setup. This >> >> >> will need to sometimes issue multiple leaf level ios to satisfy one top >> >> >> level io. >> >> >> >> >> >> How are you going to attribute this fairly? >> >> >> >> >> > >> >> > I think in this case, definition of fair allocation will be little >> >> > different. We will do fair allocation only at the leaf nodes where >> >> > there is actual contention, irrespective of higher level setup. >> >> > >> >> > So if higher level block device issues multiple ios to satisfy one top >> >> > level io, we will actually do the bandwidth allocation only on >> >> > those multiple ios because that's the real IO contending for disk >> >> > bandwidth. And if these multiple ios are going to different physical >> >> > devices, then contention management will take place on those devices. >> >> > >> >> > IOW, we will not worry about providing fairness at bios submitted to >> >> > higher level devices. We will just pitch in for contention management >> >> > only when request from various cgroups are contending for physical >> >> > device at bottom most layers. Isn't if fair? >> >> > >> >> > Thanks >> >> > Vivek >> >> > >> >> >> I don't think the issue of bandwidth availability like above will really >> >> >> be an issue, if your stripe is set up symmetrically, the contention >> >> >> should average out to both (all) disks in equal measures. >> >> >> >> >> >> The only real issue I can see is with linear volumes, but those are >> >> >> stupid anyway - non of the gains but all the risks. >> >> > -- >> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> >> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > Please read the FAQ at http://www.tux.org/lkml/ >> >> > >> > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811111155q4bd73480pebe088fa1adbe2e4@mail.gmail.com>]
[parent not found: <e98e18940811111155q4bd73480pebe088fa1adbe2e4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811111155q4bd73480pebe088fa1adbe2e4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-11 22:30 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-11 22:30 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Tue, Nov 11, 2008 at 11:55:53AM -0800, Nauman Rafique wrote: > On Mon, Nov 10, 2008 at 6:11 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote: > >> On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >> > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote: > >> >> It seems that approaches with two level scheduling (DM-IOBand or this > >> >> patch set on top and another scheduler at elevator) will have the > >> >> possibility of undesirable interactions (see "issues" listed at the > >> >> end of the second patch). For example, a request submitted as RT might > >> >> get delayed at higher layers, even if cfq at elevator level is doing > >> >> the right thing. > >> >> > >> > > >> > Yep. Buffering of bios at higher layer can break underlying elevator's > >> > assumptions. > >> > > >> > What if we start keeping track of task priorities and RT tasks in higher > >> > level schedulers and dispatch the bios accordingly. Will it break the > >> > underlying noop, deadline or AS? > >> > >> It will probably not. But then we have a cfq-like scheduler at higher > >> level and we can agree that the combinations "cfq(higher > >> level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq" > >> would probably work. But if we implement one high level cfq-like > >> scheduler at a higher level, we would not take care of somebody who > >> wants noop-noop or propotional-noop. The point I am trying to make is > >> that there is probably no single one-size-fits-all solution for a > >> higher level scheduler. And we should limit the arbitrary mixing and > >> matching of higher level schedulers and elevator schedulers. That > >> being said, the existence of a higher level scheduler is still a point > >> of debate I guess, see my comments below. > >> > > > > Ya, implemeting CFQ like thing in higher level scheduler will make things > > complex. > > > >> > >> > > >> >> Moreover, if the requests in the higher level scheduler are dispatched > >> >> as soon as they come, there would be no queuing at the higher layers, > >> >> unless the request queue at the lower level fills up and causes a > >> >> backlog. And in the absence of queuing, any work-conserving scheduler > >> >> would behave as a no-op scheduler. > >> >> > >> >> These issues motivate to take a second look into two level scheduling. > >> >> The main motivations for two level scheduling seem to be: > >> >> (1) Support bandwidth division across multiple devices for RAID and LVMs. > >> > > >> > Nauman, can you give an example where we really need bandwidth division > >> > for higher level devices. > >> > > >> > I am beginning to think that real contention is at leaf level physical > >> > devices and not at higher level logical devices hence we should be doing > >> > any resource management only at leaf level and not worry about higher > >> > level logical devices. > >> > > >> > If this requirement goes away, then case of two level scheduler weakens > >> > and one needs to think about doing changes at leaf level IO schedulers. > >> > >> I cannot agree with you more on this that there is only contention at > >> the leaf level physical devices and bandwidth should be managed only > >> there. But having seen earlier posts on this list, i feel some folks > >> might not agree with us. For example, if we have RAID-0 striping, we > >> might want to schedule requests based on accumulative bandwidth used > >> over all devices. Again, I myself don't agree with moving scheduling > >> at a higher level just to support that. > >> > > > > Hmm.., I am not very convinced that we need to do resource management > > at RAID0 device. The common case of resource management is that a higher > > priority task group is not deprived of resources because of lower priority > > task group. So if there is no contention between two task groups (At leaf > > node), then I might as well let them give them full access to RAID 0 > > logical device without any control. > > > > Hope people who have requirement of control at higher level devices can > > pitch in now and share their perspective. > > > >> > > >> >> (2) Divide bandwidth between different cgroups without modifying each > >> >> of the existing schedulers (and without replicating the code). > >> >> > >> >> One possible approach to handle (1) is to keep track of bandwidth > >> >> utilized by each cgroup in a per cgroup data structure (instead of a > >> >> per cgroup per device data structure) and use that information to make > >> >> scheduling decisions within the elevator level schedulers. Such a > >> >> patch can be made flag-disabled if co-ordination across different > >> >> device schedulers is not required. > >> >> > >> > > >> > Can you give more details about it. I am not sure I understand it. Exactly > >> > what information should be stored in each cgroup. > >> > > >> > I think per cgroup per device data structures are good so that an scheduer > >> > will not worry about other devices present in the system and will just try > >> > to arbitrate between various cgroup contending for that device. This goes > >> > back to same issue of getting rid of requirement (1) from io controller. > >> > >> I was thinking that we can keep track of disk time used at each > >> device, and keep the cumulative number in a per cgroup data structure. > >> But that is only if we want to support bandwidth division across > >> devices. You and me both agree that we probably do not need to do > >> that. > >> > >> > > >> >> And (2) can probably be handled by having one scheduler support > >> >> different modes. For example, one possible mode is "propotional > >> >> division between crgroups + no-op between threads of a cgroup" or "cfq > >> >> between cgroups + cfq between threads of a cgroup". That would also > >> >> help avoid combinations which might not work e.g RT request issue > >> >> mentioned earlier in this email. And this unified scheduler can re-use > >> >> code from all the existing patches. > >> >> > >> > > >> > IIUC, you are suggesting some kind of unification between four IO > >> > schedulers so that proportional weight code is not replicated and user can > >> > switch mode on the fly based on tunables? > >> > >> Yes, that seems to be a solution to avoid replication of code. But we > >> should also look at any other solutions that avoid replication of > >> code, and also avoid scheduling in two different layers. > >> In my opinion, scheduling at two different layers is problematic because > >> (a) Any buffering done at a higher level will be artificial, unless > >> the queues at lower levels are completely full. And if there is no > >> buffering at a higher level, any scheduling scheme would be > >> ineffective. > >> (b) We cannot have an arbitrary mixing and matching of higher and > >> lower level schedulers. > >> > >> (a) would exist in any solution in which requests are queued at > >> multiple levels. Can you please comment on this with respect to the > >> patch that you have posted? > >> > > > > I am not very sure about the queustion, but in my patch, buffering at > > at higher layer is irrespective of the status of underlying queue. We > > try our best to fill underlying queue with request, only subject to the > > criteria of proportional bandwidth. > > > > So, if there are two cgroups A and B and we allocate two cgroups 2000 > > tokens each to begin with. If A has consumed all the tokens soon and B > > has not, then we will stop A from dispatching more requests and wait for > > B to either issue more IO and consume tokens or get out of contention. > > This can leave disk idle for sometime. We can probably do some > > optimizations here. > > What do you think about elevator based solutions like 2 level cfq > patches submitted by Satoshi and Vasily earlier? I have had a very high level look at Satoshi's patch. I will go into details soon. I was thinking that this patch solves the problem only for CFQ. Can we create a common layer which can be shared by all the four IO schedulers. So this one common layer can take care of all the management w.r.t per device per cgroup data structures and track all the groups, their limits (either token based or time based scheme), and control the dispatch of requests. This way we can enable IO controller not only for CFQ but for all the IO schedulers without duplicating too much of code. This is what I am playing around with currently. At this point I am not sure, how much of common ground I can have between all the IO schedulers. > CFQ can be trivially > modified to do proportional division (i.e give time slices in > proportion to weight instead of priority). > And such a solution would > avoid idleness problem like the one you mentioned above. Can you just elaborate a little on how do you get around idleness problem? If you don't create idleness than if two tasks in two cgroups are doing sequential IO, they might simply get into lockstep and we will not achieve any differentiated service proportionate to their weight. > and can also > avoid burstiness issues (see smoothing patches -- v1.2.0 and v1.3.0 -- > of dm-ioband) in token based schemes. > > Also doing time based token allocation (as you mentioned in TODO list) > sounds very interesting. Can we look at the disk time taken by each > bio and use that to account for tokens? The problem is that the time > taken is not available when the requests are sent to disk, but we can > do delayed token charging (i.e deduct tokens after the request is > completed?). It seems that such an approach should work. What do you > think? This is a good idea. Charging the cgroup based on time actually consumed should be doable. I will look into it. I think in the past somebody mentioned that how do you account for the seek time taken because of switchover between cgroups? May be average time per cgroup can help here a bit. This is more about refining the dispatch algorightm once we have agreed upon other semantics like 2 level scheduler and can we come up with a common layer which can be shared by all four IO schedulers. Once common layer is possible, we can always change the common layer algorithm from token based to time based to achive better accuracy. Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081111223024.GA31527@redhat.com>]
[parent not found: <20081111223024.GA31527-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081111223024.GA31527-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-12 21:20 ` Nauman Rafique 0 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-12 21:20 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Tue, Nov 11, 2008 at 2:30 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, Nov 11, 2008 at 11:55:53AM -0800, Nauman Rafique wrote: >> On Mon, Nov 10, 2008 at 6:11 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> > On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote: >> >> On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> >> > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote: >> >> >> It seems that approaches with two level scheduling (DM-IOBand or this >> >> >> patch set on top and another scheduler at elevator) will have the >> >> >> possibility of undesirable interactions (see "issues" listed at the >> >> >> end of the second patch). For example, a request submitted as RT might >> >> >> get delayed at higher layers, even if cfq at elevator level is doing >> >> >> the right thing. >> >> >> >> >> > >> >> > Yep. Buffering of bios at higher layer can break underlying elevator's >> >> > assumptions. >> >> > >> >> > What if we start keeping track of task priorities and RT tasks in higher >> >> > level schedulers and dispatch the bios accordingly. Will it break the >> >> > underlying noop, deadline or AS? >> >> >> >> It will probably not. But then we have a cfq-like scheduler at higher >> >> level and we can agree that the combinations "cfq(higher >> >> level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq" >> >> would probably work. But if we implement one high level cfq-like >> >> scheduler at a higher level, we would not take care of somebody who >> >> wants noop-noop or propotional-noop. The point I am trying to make is >> >> that there is probably no single one-size-fits-all solution for a >> >> higher level scheduler. And we should limit the arbitrary mixing and >> >> matching of higher level schedulers and elevator schedulers. That >> >> being said, the existence of a higher level scheduler is still a point >> >> of debate I guess, see my comments below. >> >> >> > >> > Ya, implemeting CFQ like thing in higher level scheduler will make things >> > complex. >> > >> >> >> >> > >> >> >> Moreover, if the requests in the higher level scheduler are dispatched >> >> >> as soon as they come, there would be no queuing at the higher layers, >> >> >> unless the request queue at the lower level fills up and causes a >> >> >> backlog. And in the absence of queuing, any work-conserving scheduler >> >> >> would behave as a no-op scheduler. >> >> >> >> >> >> These issues motivate to take a second look into two level scheduling. >> >> >> The main motivations for two level scheduling seem to be: >> >> >> (1) Support bandwidth division across multiple devices for RAID and LVMs. >> >> > >> >> > Nauman, can you give an example where we really need bandwidth division >> >> > for higher level devices. >> >> > >> >> > I am beginning to think that real contention is at leaf level physical >> >> > devices and not at higher level logical devices hence we should be doing >> >> > any resource management only at leaf level and not worry about higher >> >> > level logical devices. >> >> > >> >> > If this requirement goes away, then case of two level scheduler weakens >> >> > and one needs to think about doing changes at leaf level IO schedulers. >> >> >> >> I cannot agree with you more on this that there is only contention at >> >> the leaf level physical devices and bandwidth should be managed only >> >> there. But having seen earlier posts on this list, i feel some folks >> >> might not agree with us. For example, if we have RAID-0 striping, we >> >> might want to schedule requests based on accumulative bandwidth used >> >> over all devices. Again, I myself don't agree with moving scheduling >> >> at a higher level just to support that. >> >> >> > >> > Hmm.., I am not very convinced that we need to do resource management >> > at RAID0 device. The common case of resource management is that a higher >> > priority task group is not deprived of resources because of lower priority >> > task group. So if there is no contention between two task groups (At leaf >> > node), then I might as well let them give them full access to RAID 0 >> > logical device without any control. >> > >> > Hope people who have requirement of control at higher level devices can >> > pitch in now and share their perspective. >> > >> >> > >> >> >> (2) Divide bandwidth between different cgroups without modifying each >> >> >> of the existing schedulers (and without replicating the code). >> >> >> >> >> >> One possible approach to handle (1) is to keep track of bandwidth >> >> >> utilized by each cgroup in a per cgroup data structure (instead of a >> >> >> per cgroup per device data structure) and use that information to make >> >> >> scheduling decisions within the elevator level schedulers. Such a >> >> >> patch can be made flag-disabled if co-ordination across different >> >> >> device schedulers is not required. >> >> >> >> >> > >> >> > Can you give more details about it. I am not sure I understand it. Exactly >> >> > what information should be stored in each cgroup. >> >> > >> >> > I think per cgroup per device data structures are good so that an scheduer >> >> > will not worry about other devices present in the system and will just try >> >> > to arbitrate between various cgroup contending for that device. This goes >> >> > back to same issue of getting rid of requirement (1) from io controller. >> >> >> >> I was thinking that we can keep track of disk time used at each >> >> device, and keep the cumulative number in a per cgroup data structure. >> >> But that is only if we want to support bandwidth division across >> >> devices. You and me both agree that we probably do not need to do >> >> that. >> >> >> >> > >> >> >> And (2) can probably be handled by having one scheduler support >> >> >> different modes. For example, one possible mode is "propotional >> >> >> division between crgroups + no-op between threads of a cgroup" or "cfq >> >> >> between cgroups + cfq between threads of a cgroup". That would also >> >> >> help avoid combinations which might not work e.g RT request issue >> >> >> mentioned earlier in this email. And this unified scheduler can re-use >> >> >> code from all the existing patches. >> >> >> >> >> > >> >> > IIUC, you are suggesting some kind of unification between four IO >> >> > schedulers so that proportional weight code is not replicated and user can >> >> > switch mode on the fly based on tunables? >> >> >> >> Yes, that seems to be a solution to avoid replication of code. But we >> >> should also look at any other solutions that avoid replication of >> >> code, and also avoid scheduling in two different layers. >> >> In my opinion, scheduling at two different layers is problematic because >> >> (a) Any buffering done at a higher level will be artificial, unless >> >> the queues at lower levels are completely full. And if there is no >> >> buffering at a higher level, any scheduling scheme would be >> >> ineffective. >> >> (b) We cannot have an arbitrary mixing and matching of higher and >> >> lower level schedulers. >> >> >> >> (a) would exist in any solution in which requests are queued at >> >> multiple levels. Can you please comment on this with respect to the >> >> patch that you have posted? >> >> >> > >> > I am not very sure about the queustion, but in my patch, buffering at >> > at higher layer is irrespective of the status of underlying queue. We >> > try our best to fill underlying queue with request, only subject to the >> > criteria of proportional bandwidth. >> > >> > So, if there are two cgroups A and B and we allocate two cgroups 2000 >> > tokens each to begin with. If A has consumed all the tokens soon and B >> > has not, then we will stop A from dispatching more requests and wait for >> > B to either issue more IO and consume tokens or get out of contention. >> > This can leave disk idle for sometime. We can probably do some >> > optimizations here. >> >> What do you think about elevator based solutions like 2 level cfq >> patches submitted by Satoshi and Vasily earlier? > > I have had a very high level look at Satoshi's patch. I will go into > details soon. I was thinking that this patch solves the problem only > for CFQ. Can we create a common layer which can be shared by all > the four IO schedulers. > > So this one common layer can take care of all the management w.r.t > per device per cgroup data structures and track all the groups, their > limits (either token based or time based scheme), and control the > dispatch of requests. > > This way we can enable IO controller not only for CFQ but for all the > IO schedulers without duplicating too much of code. > > This is what I am playing around with currently. At this point I am > not sure, how much of common ground I can have between all the IO > schedulers. I see your point. But having some common code in different schedulers is not worse than what we have today (cfq, as, and deadline all have some common code). Besides, each lower level (elevator level) scheduler might impose certain requirements on higher level schedulers (e.g RT requests for cfq that we talked about earlier). > >> CFQ can be trivially >> modified to do proportional division (i.e give time slices in >> proportion to weight instead of priority). >> And such a solution would >> avoid idleness problem like the one you mentioned above. > > Can you just elaborate a little on how do you get around idleness problem? > If you don't create idleness than if two tasks in two cgroups are doing > sequential IO, they might simply get into lockstep and we will not achieve > any differentiated service proportionate to their weight. I was thinking of a more cfq-like solution for proportional division at the elevator level (i.e. not a token based solution). There are two options for proportional bandwidth division at elevator level: 1) change the size of the time slice in proportion to the weights or 2) allocate equal time slice each time but allocate more slices to cgroup with more weight. For (2), we can actually keep track of time taken to serve requests and allocate time slices in such a way that the actual disk time is proportional to the weight. We can adopt a fair-queuing (http://lkml.org/lkml/2008/4/1/234) like approach for this if we want to go that way. I am not sure if the solutions mentioned above will have the lockstep problem you mentioned above or not. Since we are allocating time slices, and would have anticipation built in (just like cfq), we would have some level of idleness. But this idleness can be predicted based on a thread behavior. Can we use AS like algorithm for predicting idle time before starting new epoch in your token based patch? > >> and can also >> avoid burstiness issues (see smoothing patches -- v1.2.0 and v1.3.0 -- >> of dm-ioband) in token based schemes. >> >> Also doing time based token allocation (as you mentioned in TODO list) >> sounds very interesting. Can we look at the disk time taken by each >> bio and use that to account for tokens? The problem is that the time >> taken is not available when the requests are sent to disk, but we can >> do delayed token charging (i.e deduct tokens after the request is >> completed?). It seems that such an approach should work. What do you >> think? > > This is a good idea. Charging the cgroup based on time actually consumed > should be doable. I will look into it. I think in the past somebody > mentioned that how do you account for the seek time taken because of > switchover between cgroups? May be average time per cgroup can help here > a bit. > > This is more about refining the dispatch algorightm once we have agreed > upon other semantics like 2 level scheduler and can we come up with a common > layer which can be shared by all four IO schedulers. Once common layer is > possible, we can always change the common layer algorithm from token based > to time based to achive better accuracy. > > Thanks > Vivek > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811121320w5f321302n13b526887cbb4012@mail.gmail.com>]
[parent not found: <e98e18940811121320w5f321302n13b526887cbb4012-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811121320w5f321302n13b526887cbb4012-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-13 13:49 ` Fabio Checconi 0 siblings, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-13 13:49 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi Hi, > From: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Date: Wed, Nov 12, 2008 01:20:13PM -0800 > ... > >> CFQ can be trivially > >> modified to do proportional division (i.e give time slices in > >> proportion to weight instead of priority). > >> And such a solution would > >> avoid idleness problem like the one you mentioned above. > > > > Can you just elaborate a little on how do you get around idleness problem? > > If you don't create idleness than if two tasks in two cgroups are doing > > sequential IO, they might simply get into lockstep and we will not achieve > > any differentiated service proportionate to their weight. > > I was thinking of a more cfq-like solution for proportional division > at the elevator level (i.e. not a token based solution). There are two > options for proportional bandwidth division at elevator level: 1) > change the size of the time slice in proportion to the weights or 2) > allocate equal time slice each time but allocate more slices to cgroup > with more weight. For (2), we can actually keep track of time taken to > serve requests and allocate time slices in such a way that the actual > disk time is proportional to the weight. We can adopt a fair-queuing > (http://lkml.org/lkml/2008/4/1/234) like approach for this if we want > to go that way. > > I am not sure if the solutions mentioned above will have the lockstep > problem you mentioned above or not. Since we are allocating time > slices, and would have anticipation built in (just like cfq), we would > have some level of idleness. But this idleness can be predicted based > on a thread behavior. if I understand that correctly, the problem may arise whenever you have to deal with *synchronous* I/O, where you may not see the streams of requests generated by tasks as continuously backlogged (and the algorithm used to distribute bandwidth does the implicit assumption that they are, as in the cfq case). A cfq-like solution with idling enabled AFAIK should not suffer from this problem, as it creates backlog for the process being anticipated. But anticipation is not always used, and cfq currently disables it for SSDs and in other cases where it may hurt performance (e.g., NCQ drives in presence of seeky loads, etc). So, in these cases, something still needs to be done if we want a proportional bandwidth distribution, and we don't want to pay the extra cost of idling when it's not strictly necessary. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081106153135.790621895@redhat.com>]
[parent not found: <20081106153135.790621895-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 2/4] io controller: biocgroup implementation [not found] ` <20081106153135.790621895-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-07 2:50 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 92+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-11-07 2:50 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, Hirokazu-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrea-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Righi On Thu, 06 Nov 2008 10:30:24 -0500 vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > > o biocgroup functionality. > o Implemented new controller "bio" > o Most of it picked from dm-ioband biocgroup implementation patches. > page_cgroup implementation is changed and most of this patch needs rework. please see the latest one. (I think most of new characteristics are useful for you.) One comment from me is == > +struct page_cgroup { > + struct list_head lru; /* per cgroup LRU list */ > + struct page *page; > + struct mem_cgroup *mem_cgroup; > + int flags; > +#ifdef CONFIG_CGROUP_BIO > + struct list_head blist; /* for bio_cgroup page list */ > + struct bio_cgroup *bio_cgroup; > +#endif > +}; == this blist is too bad. please keep this object small... Maybe dm-ioband people will post his own new one. just making use of it is an idea. Thanks, -Kame ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081107115030.7ccf3f07.kamezawa.hiroyu@jp.fujitsu.com>]
[parent not found: <20081107115030.7ccf3f07.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: [patch 2/4] io controller: biocgroup implementation [not found] ` <20081107115030.7ccf3f07.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> @ 2008-11-07 4:19 ` Hirokazu Takahashi 2008-11-07 14:44 ` Vivek Goyal 1 sibling, 0 replies; 92+ messages in thread From: Hirokazu Takahashi @ 2008-11-07 4:19 UTC (permalink / raw) To: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Hi, I'm going to release a new version of bio_cgroup soon, which doesn't have "struct list_head blist" anymore and whose overhead is minimized. > > o biocgroup functionality. > > o Implemented new controller "bio" > > o Most of it picked from dm-ioband biocgroup implementation patches. > > > page_cgroup implementation is changed and most of this patch needs rework. > please see the latest one. (I think most of new characteristics are useful > for you.) > > One comment from me is > == > > +struct page_cgroup { > > + struct list_head lru; /* per cgroup LRU list */ > > + struct page *page; > > + struct mem_cgroup *mem_cgroup; > > + int flags; > > +#ifdef CONFIG_CGROUP_BIO > > + struct list_head blist; /* for bio_cgroup page list */ > > + struct bio_cgroup *bio_cgroup; > > +#endif > > +}; > == > > this blist is too bad. please keep this object small... > > Maybe dm-ioband people will post his own new one. just making use of it is an idea. > > Thanks, > -Kame > > ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 2/4] io controller: biocgroup implementation [not found] ` <20081107115030.7ccf3f07.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 2008-11-07 4:19 ` Hirokazu Takahashi @ 2008-11-07 14:44 ` Vivek Goyal 1 sibling, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-07 14:44 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Fri, Nov 07, 2008 at 11:50:30AM +0900, KAMEZAWA Hiroyuki wrote: > On Thu, 06 Nov 2008 10:30:24 -0500 > vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > > > > > o biocgroup functionality. > > o Implemented new controller "bio" > > o Most of it picked from dm-ioband biocgroup implementation patches. > > > page_cgroup implementation is changed and most of this patch needs rework. > please see the latest one. (I think most of new characteristics are useful > for you.) > Sure I will have a look. > One comment from me is > == > > +struct page_cgroup { > > + struct list_head lru; /* per cgroup LRU list */ > > + struct page *page; > > + struct mem_cgroup *mem_cgroup; > > + int flags; > > +#ifdef CONFIG_CGROUP_BIO > > + struct list_head blist; /* for bio_cgroup page list */ > > + struct bio_cgroup *bio_cgroup; > > +#endif > > +}; > == > > this blist is too bad. please keep this object small... > This is just another connecting element so that page_cgroup can be on another list also. It is useful in making sure that IO on all the pages of a bio group has completed beofer that bio cgroup is deleted. > Maybe dm-ioband people will post his own new one. just making use of it is an idea. Sure, I will have a look when dm-ioband people post new version of patch and how they have optimized it further. Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081106153135.869625751@redhat.com>]
[parent not found: <20081106153135.869625751-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 3/4] io controller: Core IO controller implementation logic [not found] ` <20081106153135.869625751-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-07 3:21 ` KAMEZAWA Hiroyuki 2008-11-11 8:50 ` Gui Jianfeng 1 sibling, 0 replies; 92+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-11-07 3:21 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, Hirokazu-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrea-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Righi On Thu, 06 Nov 2008 10:30:25 -0500 vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > > o Core IO controller implementation > > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > 2 comments after a quick look. - I don't recommend generic work queue. More stacked dependency between "work" is not good. (I think disk-driver uses "work" for their jobs.) - It seems this bio-cgroup can queue the bio to infinite. Then, a process can submit io unitl cause OOM. (IIUC, Dirty bit of the page is cleared at submitting I/O. Then dirty_ratio can't help us.) please add "wait for congestion by sleeping" code in bio-cgroup. Thanks, -Kame ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 3/4] io controller: Core IO controller implementation logic [not found] ` <20081106153135.869625751-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2008-11-07 3:21 ` [patch 3/4] io controller: Core IO controller implementation logic KAMEZAWA Hiroyuki @ 2008-11-11 8:50 ` Gui Jianfeng 1 sibling, 0 replies; 92+ messages in thread From: Gui Jianfeng @ 2008-11-11 8:50 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: Hi vivek, I think bio_group_controller() need to be exported by EXPORT_SYMBOL() -- Regards Gui Jianfeng ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081107122145.69500cd3.kamezawa.hiroyu@jp.fujitsu.com>]
[parent not found: <20081107122145.69500cd3.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: [patch 3/4] io controller: Core IO controller implementation logic [not found] ` <20081107122145.69500cd3.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> @ 2008-11-07 14:50 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-07 14:50 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Fri, Nov 07, 2008 at 12:21:45PM +0900, KAMEZAWA Hiroyuki wrote: > On Thu, 06 Nov 2008 10:30:25 -0500 > vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > > > > > o Core IO controller implementation > > > > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > 2 comments after a quick look. > > - I don't recommend generic work queue. More stacked dependency between "work" > is not good. (I think disk-driver uses "work" for their jobs.) Sorry, I did not get this. Are you recommending that don't create a new work queue, instead use existing work queue (say kblockd) to submit the bios here? I will look into it. I was little worried about a kblockd being overworked in case of too many logical devices enabling IO controller. > > - It seems this bio-cgroup can queue the bio to infinite. Then, a process can submit > io unitl cause OOM. > (IIUC, Dirty bit of the page is cleared at submitting I/O. > Then dirty_ratio can't help us.) > please add "wait for congestion by sleeping" code in bio-cgroup. Yes, you are right. I need to put some kind of control on max number of bios I can queue on a cgroup and after crossing the limit, I should put the submitting task to sleep. (Something like request descriptor kind of flow control implememented by elevators). Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081107145036.GF21884@redhat.com>]
[parent not found: <20081107145036.GF21884-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 3/4] io controller: Core IO controller implementationlogic [not found] ` <20081107145036.GF21884-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-08 2:35 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 92+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-11-08 2:35 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi Vivek Goyal said: > On Fri, Nov 07, 2008 at 12:21:45PM +0900, KAMEZAWA Hiroyuki wrote: >> On Thu, 06 Nov 2008 10:30:25 -0500 >> vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: >> >> > >> > o Core IO controller implementation >> > >> > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> > >> >> 2 comments after a quick look. >> >> - I don't recommend generic work queue. More stacked dependency between >> "work" >> is not good. (I think disk-driver uses "work" for their jobs.) > > Sorry, I did not get this. Are you recommending that don't create a new > work queue, instead use existing work queue (say kblockd) to submit the > bios > here? > Ah, no, recomending new-original its own workqueue. I'm sorry that it seems I missed something at reading your patch. (other person may have other opinion, here;) > I will look into it. I was little worried about a kblockd being overworked > in case of too many logical devices enabling IO controller. > Thanks, -Kame ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <4913A9C2.8060904@cn.fujitsu.com>]
[parent not found: <4913A9C2.8060904-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <4913A9C2.8060904-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> @ 2008-11-07 13:38 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-07 13:38 UTC (permalink / raw) To: Gui Jianfeng Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Fri, Nov 07, 2008 at 10:36:50AM +0800, Gui Jianfeng wrote: > vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > > Hi, > > > > If you are not already tired of so many io controller implementations, here > > is another one. > > > > This is a very eary very crude implementation to get early feedback to see > > if this approach makes any sense or not. > > > > This controller is a proportional weight IO controller primarily > > based on/inspired by dm-ioband. One of the things I personally found little > > odd about dm-ioband was need of a dm-ioband device for every device we want > > to control. I thought that probably we can make this control per request > > queue and get rid of device mapper driver. This should make configuration > > aspect easy. > > > > I have picked up quite some amount of code from dm-ioband especially for > > biocgroup implementation. > > > > I have done very basic testing and that is running 2-3 dd commands in different > > cgroups on x86_64. Wanted to throw out the code early to get some feedback. > > > > More details about the design and how to are in documentation patch. > > > > Your comments are welcome. > > Which kernel version is this patch set based on? > 2.6.27 Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081106153135.743458085@redhat.com>]
[parent not found: <20081107113209.a6011c67.kamezawa.hiroyu@jp.fujitsu.com>]
[parent not found: <20081107113209.a6011c67.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: [patch 1/4] io controller: documentation [not found] ` <20081107113209.a6011c67.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> @ 2008-11-07 14:27 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-07 14:27 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Fri, Nov 07, 2008 at 11:32:09AM +0900, KAMEZAWA Hiroyuki wrote: > On Thu, 06 Nov 2008 10:30:23 -0500 > vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > > +ISSUES > > +====== > > +- IO controller can buffer the bios if suffcient tokens were not available > > + at the time of bio submission. Once the tokens are available, these bios > > + are dispatched to elevator/lower layers in first come first serve manner. > > + And this has potential to break CFQ where a RT tasks should be able to > > + dispatch the bio first or a high priority task should be able to release > > + more bio as compared to low priority task in same cgroup. > > + > > + Not sure how to fix it. May be we need to maintain another rb-tree and > > + keep track of RT tasks and tasks priorities and dispatch accordingly. This > > + is equivalent of duplicating lots of CFQ logic and not sure how would it > > + impact AS behaviour. > > > Why you don't isolate RT tasks into other cgroup ? > /cgroup/bio-cgroup/group_for_usual/...usual tasks. > /group_for_RT/ ...RT tasks. you can use high-speed path. > > How about adding RT flag to bio-cgroup and skip buffering at bio-cgroup if RT > flag is set ? I think handling an usual process and RT process in "a" cgroup > just makes the code complex. > > Looking into a cpu-scheduler, which is the first module handling RT, it has > some tweaks to handle RT in the system. > - special RT scheduler. > - isolated RT domain > - maximum execution time allowed to RT > .... > > Maybe handling RT in following way is usual way...(if we do something in this layer) > > - Allow RT-bio-cgroup to skip limit check. > - But RT-bio-cgroup calculates io-throuput, execution time, statistics... > - When RT tasks in RT-bio-cgroup does excessive I/O which starves the whole system > too long, raise safeguard-limitter. and tell users Warning or kill it. > > Hmm ? Hi Kame, Looking at CFQ, there are two issues. - RT tasks (and RT priorities with in that) - Best Effort class tasks (and priorities with in that). To make sure we don't break underlying CFQ elevator, we need to take care of both the things in higher level scheduler. This will mean practically we will end up copying code from CFQ in higher level scheduler. Even if I do that (Keep track of RT tasks and Best effort class tasks and their proirities and do dispatch accordingly), I am not sure will it have any negative impact when a user is using AS IO scheduler on the leaf node. That forces me to think that if we can let go the idea of doing proportionate bandwidth allocation at higher level logical device and just do it for leaf nodes, then we can drop the idea of two level scheduler and try to bring unification among four IO schedulers such that with least code copying they can also support proportional weight policies. Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081106153135.743458085-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 1/4] io controller: documentation [not found] ` <20081106153135.743458085-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-07 2:32 ` KAMEZAWA Hiroyuki 2008-11-07 3:46 ` KAMEZAWA Hiroyuki 2008-11-10 2:48 ` Li Zefan 2 siblings, 0 replies; 92+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-11-07 2:32 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, Hirokazu-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrea-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Righi On Thu, 06 Nov 2008 10:30:23 -0500 vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > +ISSUES > +====== > +- IO controller can buffer the bios if suffcient tokens were not available > + at the time of bio submission. Once the tokens are available, these bios > + are dispatched to elevator/lower layers in first come first serve manner. > + And this has potential to break CFQ where a RT tasks should be able to > + dispatch the bio first or a high priority task should be able to release > + more bio as compared to low priority task in same cgroup. > + > + Not sure how to fix it. May be we need to maintain another rb-tree and > + keep track of RT tasks and tasks priorities and dispatch accordingly. This > + is equivalent of duplicating lots of CFQ logic and not sure how would it > + impact AS behaviour. > Why you don't isolate RT tasks into other cgroup ? /cgroup/bio-cgroup/group_for_usual/...usual tasks. /group_for_RT/ ...RT tasks. you can use high-speed path. How about adding RT flag to bio-cgroup and skip buffering at bio-cgroup if RT flag is set ? I think handling an usual process and RT process in "a" cgroup just makes the code complex. Looking into a cpu-scheduler, which is the first module handling RT, it has some tweaks to handle RT in the system. - special RT scheduler. - isolated RT domain - maximum execution time allowed to RT .... Maybe handling RT in following way is usual way...(if we do something in this layer) - Allow RT-bio-cgroup to skip limit check. - But RT-bio-cgroup calculates io-throuput, execution time, statistics... - When RT tasks in RT-bio-cgroup does excessive I/O which starves the whole system too long, raise safeguard-limitter. and tell users Warning or kill it. Hmm ? Thanks, -Kame ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 1/4] io controller: documentation [not found] ` <20081106153135.743458085-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2008-11-07 2:32 ` KAMEZAWA Hiroyuki @ 2008-11-07 3:46 ` KAMEZAWA Hiroyuki 2008-11-10 2:48 ` Li Zefan 2 siblings, 0 replies; 92+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-11-07 3:46 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA I forget to say that I like your new design in general ;) Thanks, -Kame On Thu, 06 Nov 2008 10:30:23 -0500 vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Index: linux17/Documentation/controllers/io-controller.txt > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux17/Documentation/controllers/io-controller.txt 2008-11-06 09:12:44.000000000 -0500 > @@ -0,0 +1,172 @@ > + IO Controller > + ============ > + > +Design > +===== > +This patchset implements a basic version of proportional weight IO controller. > +It is heavily derived from dm-ioband IO controller with one key difference > +and that is, there is no separate device mapper driver and there is no > +need to create a dm-ioband device on top of every block device which needs > +to do the IO control. In this implementation, all the control logic has > +been internalized and has been made per request queue. Enabling or disabling > +IO control on a block device is just a matter of writing a 0 or 1 in > +appropriate sysfs file. > + > +This is a proportional weight controller and that means various cgroups > +are assigned shares and tasks in those cgroups get to dispatch the bio > +in proportion to their cgroup share. > + > +All the contending cgroups are assigned tokens proportionate to their > +weights. One token is charged for one sector of IO. Once all the contending > +cgroups have consumed their tokens, fresh token allocation takes place and > +this is how disk bandwidth allocation proportion to weight is achieved. > + > +The bigger picture is that all the bios being submitted to a block device > +are first inspected by IO controller logic (bio_group_controller()), only if > +IO controller has been enabled on that device. The cgroup of the bio is > +determined and controller checks if this cgroup has sufficient tokens to > +dispatch the bio. If sufficient tokens are there, bio submitting thread > +continues to dispatch the bio through normal path otherwise IO controller > +buffers the bio and submitting thread returns back. These buffered bios > +are dispatched to lower layers later once the associate group (bio group) > +has sufficient tokens to dispatch the bios. This delayed dispatching is > +done with the help of a worker thread (biogroup). > + > +IO control can be enabled/disabled dynamically on any of the block device > +through sysfs file system. For example, to enable IO control on a device > +do following. > + > +echo 1 > /sys/block/sda/biogroup > + > +To disable IO control write 0. > + > +echo 0 > /sys/block/sda/biogroup > + > +This should be doable for any of the block device in the stack. Currently this > +patch places the hooks only for device mapper driver and still need to tweak > +md. > + > +For example, assume there are two cgroups A and B with weights 1024 and 2048 > +in the system. Tasks in two cgroups A and B are doing IO to two disks sda and > +sdb in the system. A user has enabled IO control on both sda and sdb. Now on > +both sda and sdb, tasks in cgroup B will get to use 2/3 of disk BW and > +tasks in cgroup A will get to use 1/3 of disk bandwidth, only in case of > +contention. If tasks in any of the groups stop doing IO to a particular disk, > +task in other group will get to use full disk BW for that duration. > + > + > +HOWTO > +==== > +- Enable cgroup, memory controller and block IO controller in kernel config > + file. > + > +- Boot into the kernel and mount io controller. > + > + mount -t cgroup -o bio none /cgroup/bio/ > + > +- Create two cgroups test1 and test2 > + > + cd /cgroup/bio > + mkdir test1 test2 > + > +- Allocate weight 4096 to test1 and weight 2048 to test2 > + > + echo 4096 > /cgroup/bio/test1/bio.shares > + echo 2048 > /cgroup/bio/test1/bio.shares > + > +- Launch "dd" operations in cgroup test1 and test2. > + > + echo $$ > /cgroup/bio/test1/tasks > + dd if=/somefile1 of=/dev/null > + echo $$ > /cgroup/bio/test2/tasks > + dd if=/somefile2 of=/dev/null > + > +Job in cgroup test1 should finish before job in cgroup test2. To verify > +that "dd" in cgroup test1 got to dispatch more bio as compared to "dd" in > +test2, look at "bio.aggregate_tokens" in both the cgroup (At same time). At > +any point of time when both the dd's are running, aggregate_tokens in cgroup > +test1 should be approximately double of aggregate_tokens in cgroup test2 > +(Because weight of cgroup test1 is double of weight of cgroup test2). > + > +Some Tunables > +============= > +Some tunables appear in cgroup file system and in sysfs for respective > +device for debug and for configuration. Here is a brief description. > + > +Cgroup Files > +============ > +bio.shares > + Specifies the weight of the cgroup. > + > +bio.aggregate_tokens > + Specifies total number of tokens dispatched by this cgroup. One token > + represents one sector of IO. > + > +bio.jiffies > + What was the jiffies values when last bio from this cgroup was released. > + > +bio.nr_token_slices > + How many times this cgroup got the token allocation done from token > + slice. We kind of create a token slice and every contending cgroup > + gets the pie out of the slice based on the share. > + > +bio.nr_off_the_tree > + How many times this bio group went off the list of contending groups. > + We maintain an rb-tree of biogroups contending for IO and token > + allocation takes place to these groups regularly. If some group stops > + doing IO then it is considered to be idle and removed from the tree > + and added back later when group has IO to perform. This file just > + counts how many times this bio group went off the tree. > + > +Sysfs Tunabels > +============== > +/sys/block/{deice name}/biogroup > + Whether IO controller (bio groups) are active on this device or not. > + > +/sys/block/{deice name}/deftoken > + Default number of tokens which are given to a bio group upon start > + if all the bio groups were of same weight. token slice is of dynamic > + length. So if there are 3 cgroups contending and deftoken is 100 then > + token slice lenght will be 100*3 = 300 and now out of this slice > + three groups will get the tokens based on their weights. > + > +/sys/block/{deice name}/idletime > + The time after which if a bio group does not generate the bio, it is > + considered idle and removed from the rb-tree. Currently by default it > + is 8ms. > + > +/sys/block/{deice name}/newslice_count > + How many times new token allocation took place on this queue. > + > +TODO > +==== > +- Do extensive testing in various scenarios and do performance optimization > + and fix the things where broken. > + > +- IO schedulers derive context information from "current". This assumption > + will be broken if bios are being submitted by a worker thread (biogroup). > + Probably we need to put io context pointer in bio itself to get rid of > + this dependency. > + > +- Allocating tokens for per sector of IO is crude approximation and will lead > + to unfair bandwidth allocation in case task in cgroup is doing sequential IO > + and task in other group is doing random IO. Rik Van Riel, suggested that > + probably we should switch to time based scheme. Keep a track of average time > + it takes to complete IO from a cgroup and do the allocation accordingly. > + > +- Currently this controller is dependent on memory controller being enabled. > + Try to reduce this coupling. > + > +ISSUES > +====== > +- IO controller can buffer the bios if suffcient tokens were not available > + at the time of bio submission. Once the tokens are available, these bios > + are dispatched to elevator/lower layers in first come first serve manner. > + And this has potential to break CFQ where a RT tasks should be able to > + dispatch the bio first or a high priority task should be able to release > + more bio as compared to low priority task in same cgroup. > + > + Not sure how to fix it. May be we need to maintain another rb-tree and > + keep track of RT tasks and tasks priorities and dispatch accordingly. This > + is equivalent of duplicating lots of CFQ logic and not sure how would it > + impact AS behaviour. > > -- > > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linux-foundation.org/mailman/listinfo/containers > ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 1/4] io controller: documentation [not found] ` <20081106153135.743458085-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2008-11-07 2:32 ` KAMEZAWA Hiroyuki 2008-11-07 3:46 ` KAMEZAWA Hiroyuki @ 2008-11-10 2:48 ` Li Zefan 2 siblings, 0 replies; 92+ messages in thread From: Li Zefan @ 2008-11-10 2:48 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi Hi, Vivek Thanks for your work. :) A question below. > +For example, assume there are two cgroups A and B with weights 1024 and 2048 > +in the system. Tasks in two cgroups A and B are doing IO to two disks sda and > +sdb in the system. A user has enabled IO control on both sda and sdb. Now on > +both sda and sdb, tasks in cgroup B will get to use 2/3 of disk BW and > +tasks in cgroup A will get to use 1/3 of disk bandwidth, only in case of > +contention. If tasks in any of the groups stop doing IO to a particular disk, > +task in other group will get to use full disk BW for that duration. > + So in this example, I can't assign 1/3 of sda's disk BW while 2/3 of sdb's disk BW to A. Am I right? ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <4917A116.7040603@cn.fujitsu.com>]
[parent not found: <4917A116.7040603-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [patch 1/4] io controller: documentation [not found] ` <4917A116.7040603-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> @ 2008-11-10 13:44 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-10 13:44 UTC (permalink / raw) To: Li Zefan Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Mon, Nov 10, 2008 at 10:48:54AM +0800, Li Zefan wrote: > Hi, Vivek > > Thanks for your work. :) > > A question below. > > > +For example, assume there are two cgroups A and B with weights 1024 and 2048 > > +in the system. Tasks in two cgroups A and B are doing IO to two disks sda and > > +sdb in the system. A user has enabled IO control on both sda and sdb. Now on > > +both sda and sdb, tasks in cgroup B will get to use 2/3 of disk BW and > > +tasks in cgroup A will get to use 1/3 of disk bandwidth, only in case of > > +contention. If tasks in any of the groups stop doing IO to a particular disk, > > +task in other group will get to use full disk BW for that duration. > > + > > So in this example, I can't assign 1/3 of sda's disk BW while 2/3 of sdb's > disk BW to A. Am I right? Hi, Yes you are right. Currently policies are global and not per disk. So in above example, cgroup A will get 1/3 of disk BW both on sda and sdb. And one can not configure in such a manner so that A gets 1/3 of BW on sda and 2/3 of BW on sdb. I think assigning cgroup weights per disk should be doable but personally I think it makes configuration complex and I am not sure if it really a very useful feature. Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081106153022.215696930-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081106153022.215696930-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-06 15:49 ` Peter Zijlstra 2008-11-07 2:36 ` Gui Jianfeng 2008-11-13 9:05 ` Ryo Tsuruta 2 siblings, 0 replies; 92+ messages in thread From: Peter Zijlstra @ 2008-11-06 15:49 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi On Thu, 2008-11-06 at 10:30 -0500, vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > Hi, > > If you are not already tired of so many io controller implementations, here > is another one. > > This is a very eary very crude implementation to get early feedback to see > if this approach makes any sense or not. > > This controller is a proportional weight IO controller primarily > based on/inspired by dm-ioband. One of the things I personally found little > odd about dm-ioband was need of a dm-ioband device for every device we want > to control. I thought that probably we can make this control per request > queue and get rid of device mapper driver. This should make configuration > aspect easy. > > I have picked up quite some amount of code from dm-ioband especially for > biocgroup implementation. > > I have done very basic testing and that is running 2-3 dd commands in different > cgroups on x86_64. Wanted to throw out the code early to get some feedback. > > More details about the design and how to are in documentation patch. > > Your comments are welcome. please include QUILT_REFRESH_ARGS="--diffstat --strip-trailing-whitespace" in your environment or .quiltrc I would expect all those bio* files to be placed in block/ not mm/ Does this still require I use dm, or does it also work on regular block devices? Patch 4/4 isn't quite clear on this. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081106153022.215696930-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2008-11-06 15:49 ` [patch 0/4] [RFC] Another proportional weight IO controller Peter Zijlstra @ 2008-11-07 2:36 ` Gui Jianfeng 2008-11-13 9:05 ` Ryo Tsuruta 2 siblings, 0 replies; 92+ messages in thread From: Gui Jianfeng @ 2008-11-07 2:36 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Rik van Riel, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jeff Moyer, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Andrea Righi vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote: > Hi, > > If you are not already tired of so many io controller implementations, here > is another one. > > This is a very eary very crude implementation to get early feedback to see > if this approach makes any sense or not. > > This controller is a proportional weight IO controller primarily > based on/inspired by dm-ioband. One of the things I personally found little > odd about dm-ioband was need of a dm-ioband device for every device we want > to control. I thought that probably we can make this control per request > queue and get rid of device mapper driver. This should make configuration > aspect easy. > > I have picked up quite some amount of code from dm-ioband especially for > biocgroup implementation. > > I have done very basic testing and that is running 2-3 dd commands in different > cgroups on x86_64. Wanted to throw out the code early to get some feedback. > > More details about the design and how to are in documentation patch. > > Your comments are welcome. Which kernel version is this patch set based on? > > Thanks > Vivek > -- Regards Gui Jianfeng ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081106153022.215696930-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2008-11-06 15:49 ` [patch 0/4] [RFC] Another proportional weight IO controller Peter Zijlstra 2008-11-07 2:36 ` Gui Jianfeng @ 2008-11-13 9:05 ` Ryo Tsuruta 2 siblings, 0 replies; 92+ messages in thread From: Ryo Tsuruta @ 2008-11-13 9:05 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Hi, From: vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org Subject: [patch 0/4] [RFC] Another proportional weight IO controller Date: Thu, 06 Nov 2008 10:30:22 -0500 > Hi, > > If you are not already tired of so many io controller implementations, here > is another one. > > This is a very eary very crude implementation to get early feedback to see > if this approach makes any sense or not. > > This controller is a proportional weight IO controller primarily > based on/inspired by dm-ioband. One of the things I personally found little > odd about dm-ioband was need of a dm-ioband device for every device we want > to control. I thought that probably we can make this control per request > queue and get rid of device mapper driver. This should make configuration > aspect easy. > > I have picked up quite some amount of code from dm-ioband especially for > biocgroup implementation. > > I have done very basic testing and that is running 2-3 dd commands in different > cgroups on x86_64. Wanted to throw out the code early to get some feedback. > > More details about the design and how to are in documentation patch. > > Your comments are welcome. Do you have any benchmark results? I'm especially interested in the followings: - Comparison of disk performance with and without the I/O controller patch. - Put uneven I/O loads. Processes, which belong to a cgroup which is given a smaller weight than another cgroup, put heavier I/O load like the following. echo 1024 > /cgroup/bio/test1/bio.shares echo 8192 > /cgroup/bio/test2/bio.shares echo $$ > /cgroup/bio/test1/tasks dd if=/somefile1-1 of=/dev/null & dd if=/somefile1-2 of=/dev/null & ... dd if=/somefile1-100 of=/dev/null echo $$ > /cgroup/bio/test2/tasks dd if=/somefile2-1 of=/dev/null & dd if=/somefile2-2 of=/dev/null & ... dd if=/somefile2-10 of=/dev/null & Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081113.180558.519459540419535699.ryov@valinux.co.jp>]
[parent not found: <20081113155834.GE7542@redhat.com>]
[parent not found: <af41c7c40811131041t1b8491b6la5574ebe75f89000@mail.gmail.com>]
[parent not found: <20081113214642.GG7542@redhat.com>]
[parent not found: <af41c7c40811131457w472e4a86tb5344cc1d3d366fb@mail.gmail.com>]
[parent not found: <af41c7c40811131457w472e4a86tb5344cc1d3d366fb-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <af41c7c40811131457w472e4a86tb5344cc1d3d366fb-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-14 16:05 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-14 16:05 UTC (permalink / raw) To: Divyesh Shah Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Fabio Checconi, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Thu, Nov 13, 2008 at 02:57:29PM -0800, Divyesh Shah wrote: [..] > > > > Ryo, do you still want to stick to two level scheduling? Given the problem > > > > of it breaking down underlying scheduler's assumptions, probably it makes > > > > more sense to the IO control at each individual IO scheduler. > > > > > > Vivek, > > > I agree with you that 2 layer scheduler *might* invalidate some > > > IO scheduler assumptions (though some testing might help here to > > > confirm that). However, one big concern I have with proportional > > > division at the IO scheduler level is that there is no means of doing > > > admission control at the request queue for the device. What we need is > > > request queue partitioning per cgroup. > > > Consider that I want to divide my disk's bandwidth among 3 > > > cgroups(A, B and C) equally. But say some tasks in the cgroup A flood > > > the disk with IO requests and completely use up all of the requests in > > > the rq resulting in the following IOs to be blocked on a slot getting > > > empty in the rq thus affecting their overall latency. One might argue > > > that over the long term though we'll get equal bandwidth division > > > between these cgroups. But now consider that cgroup A has tasks that > > > always storm the disk with large number of IOs which can be a problem > > > for other cgroups. > > > This actually becomes an even larger problem when we want to > > > support high priority requests as they may get blocked behind other > > > lower priority requests which have used up all the available requests > > > in the rq. With request queue division we can achieve this easily by > > > having tasks requiring high priority IO belong to a different cgroup. > > > dm-ioband and any other 2-level scheduler can do this easily. > > > > > > > Hi Divyesh, > > > > I understand that request descriptors can be a bottleneck here. But that > > should be an issue even today with CFQ where a low priority process > > consume lots of request descriptors and prevent higher priority process > > from submitting the request. > > Yes that is true and that is one of the main reasons why I would lean > towards 2-level scheduler coz you get request queue division as well. > > I think you already said it and I just > > reiterated it. > > > > I think in that case we need to do something about request descriptor > > allocation instead of relying on 2nd level of IO scheduler. > > At this point I am not sure what to do. May be we can take feedback from the > > respective queue (like cfqq) of submitting application and if it is already > > backlogged beyond a certain limit, then we can put that application to sleep > > and stop it from consuming excessive amount of request descriptors > > (despite the fact that we have free request descriptors). > > This should be done per-cgroup rather than per-process. > Yep, per cgroup limit will make more sense. get_request() already calls elv_may_queue() to get a feedback from IO scheduler. May be here IO scheduler can make a decision how many request descriptors are already allocated to this cgroup. And if the queue is congested, then IO scheduler can deny the fresh request allocation. Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081114160525.GE24624@redhat.com>]
[parent not found: <20081114160525.GE24624-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081114160525.GE24624-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-14 22:44 ` Nauman Rafique 0 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-14 22:44 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Fabio Checconi, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w In an attempt to make sure that this discussion leads to something useful, we have summarized the points raised in this discussion and have come up with a strategy for future. The goal of this is to find common ground between all the approaches proposed on this mailing list. 1 Start with Satoshi's latest patches. 2 Do the following to support propotional division: a) Give time slices in proportion to weights (configurable option). We can support both priorities and weights by doing propotional division between requests with same priorities. 3 Schedule time slices using WF2Q+ instead of round robin. Test the performance impact (both throughput and jitter in latency). 4 Do the following to support the goals of 2 level schedulers: a) Limit the request descriptors allocated to each cgroup by adding functionality to elv_may_queue() b) Add support for putting an absolute limit on IO consumed by a cgroup. Such support exists in dm-ioband and is provided by Andrea Righi's patches too. c) Add support (configurable option) to keep track of total disk time/sectors/count consumed at each device, and factor that into scheduling decision (more discussion needed here) 5 Support multiple layers of cgroups to align IO controller behavior with CPU scheduling behavior (more discussion?) 6 Incorporate an IO tracking approach which re-uses memory resource controller code but is not dependent on it (may be biocgroup patches from dm-ioband can be used here directly) 7 Start an offline email thread to keep track of progress on the above goals. Please feel free to add/modify items to the list when you respond back. Any comments/suggestions are more than welcome. Thanks. Divyesh & Nauman On Fri, Nov 14, 2008 at 8:05 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, Nov 13, 2008 at 02:57:29PM -0800, Divyesh Shah wrote: > > [..] >> > > > Ryo, do you still want to stick to two level scheduling? Given the problem >> > > > of it breaking down underlying scheduler's assumptions, probably it makes >> > > > more sense to the IO control at each individual IO scheduler. >> > > >> > > Vivek, >> > > I agree with you that 2 layer scheduler *might* invalidate some >> > > IO scheduler assumptions (though some testing might help here to >> > > confirm that). However, one big concern I have with proportional >> > > division at the IO scheduler level is that there is no means of doing >> > > admission control at the request queue for the device. What we need is >> > > request queue partitioning per cgroup. >> > > Consider that I want to divide my disk's bandwidth among 3 >> > > cgroups(A, B and C) equally. But say some tasks in the cgroup A flood >> > > the disk with IO requests and completely use up all of the requests in >> > > the rq resulting in the following IOs to be blocked on a slot getting >> > > empty in the rq thus affecting their overall latency. One might argue >> > > that over the long term though we'll get equal bandwidth division >> > > between these cgroups. But now consider that cgroup A has tasks that >> > > always storm the disk with large number of IOs which can be a problem >> > > for other cgroups. >> > > This actually becomes an even larger problem when we want to >> > > support high priority requests as they may get blocked behind other >> > > lower priority requests which have used up all the available requests >> > > in the rq. With request queue division we can achieve this easily by >> > > having tasks requiring high priority IO belong to a different cgroup. >> > > dm-ioband and any other 2-level scheduler can do this easily. >> > > >> > >> > Hi Divyesh, >> > >> > I understand that request descriptors can be a bottleneck here. But that >> > should be an issue even today with CFQ where a low priority process >> > consume lots of request descriptors and prevent higher priority process >> > from submitting the request. >> >> Yes that is true and that is one of the main reasons why I would lean >> towards 2-level scheduler coz you get request queue division as well. >> >> I think you already said it and I just >> > reiterated it. >> > >> > I think in that case we need to do something about request descriptor >> > allocation instead of relying on 2nd level of IO scheduler. >> > At this point I am not sure what to do. May be we can take feedback from the >> > respective queue (like cfqq) of submitting application and if it is already >> > backlogged beyond a certain limit, then we can put that application to sleep >> > and stop it from consuming excessive amount of request descriptors >> > (despite the fact that we have free request descriptors). >> >> This should be done per-cgroup rather than per-process. >> > > Yep, per cgroup limit will make more sense. get_request() already calls > elv_may_queue() to get a feedback from IO scheduler. May be here IO > scheduler can make a decision how many request descriptors are already > allocated to this cgroup. And if the queue is congested, then IO scheduler > can deny the fresh request allocation. > > Thanks > Vivek > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811141444u5947b806v27fac453ed1e8a5@mail.gmail.com>]
[parent not found: <e98e18940811141444u5947b806v27fac453ed1e8a5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811141444u5947b806v27fac453ed1e8a5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-17 14:23 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-17 14:23 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Fabio Checconi, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote: > In an attempt to make sure that this discussion leads to > something useful, we have summarized the points raised in this > discussion and have come up with a strategy for future. > The goal of this is to find common ground between all the approaches > proposed on this mailing list. > > 1 Start with Satoshi's latest patches. I have had a brief look at both Satoshi's patch and bfq. I kind of like bfq's patches for keeping track of per cgroup, per queue data structures. May be we can look there also. > 2 Do the following to support propotional division: > a) Give time slices in proportion to weights (configurable > option). We can support both priorities and weights by doing > propotional division between requests with same priorities. > 3 Schedule time slices using WF2Q+ instead of round robin. > Test the performance impact (both throughput and jitter in latency). > 4 Do the following to support the goals of 2 level schedulers: > a) Limit the request descriptors allocated to each cgroup by adding > functionality to elv_may_queue() > b) Add support for putting an absolute limit on IO consumed by a > cgroup. Such support exists in dm-ioband and is provided by Andrea > Righi's patches too. Does dm-iobnd support abosolute limit? I think till last version they did not. I have not check the latest version though. > c) Add support (configurable option) to keep track of total disk > time/sectors/count > consumed at each device, and factor that into scheduling decision > (more discussion needed here) > 5 Support multiple layers of cgroups to align IO controller behavior > with CPU scheduling behavior (more discussion?) > 6 Incorporate an IO tracking approach which re-uses memory resource > controller code but is not dependent on it (may be biocgroup patches from > dm-ioband can be used here directly) > 7 Start an offline email thread to keep track of progress on the above > goals. > > Please feel free to add/modify items to the list > when you respond back. Any comments/suggestions are more than welcome. > Thanks Vivek > Thanks. > Divyesh & Nauman > > On Fri, Nov 14, 2008 at 8:05 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Thu, Nov 13, 2008 at 02:57:29PM -0800, Divyesh Shah wrote: > > > > [..] > >> > > > Ryo, do you still want to stick to two level scheduling? Given the problem > >> > > > of it breaking down underlying scheduler's assumptions, probably it makes > >> > > > more sense to the IO control at each individual IO scheduler. > >> > > > >> > > Vivek, > >> > > I agree with you that 2 layer scheduler *might* invalidate some > >> > > IO scheduler assumptions (though some testing might help here to > >> > > confirm that). However, one big concern I have with proportional > >> > > division at the IO scheduler level is that there is no means of doing > >> > > admission control at the request queue for the device. What we need is > >> > > request queue partitioning per cgroup. > >> > > Consider that I want to divide my disk's bandwidth among 3 > >> > > cgroups(A, B and C) equally. But say some tasks in the cgroup A flood > >> > > the disk with IO requests and completely use up all of the requests in > >> > > the rq resulting in the following IOs to be blocked on a slot getting > >> > > empty in the rq thus affecting their overall latency. One might argue > >> > > that over the long term though we'll get equal bandwidth division > >> > > between these cgroups. But now consider that cgroup A has tasks that > >> > > always storm the disk with large number of IOs which can be a problem > >> > > for other cgroups. > >> > > This actually becomes an even larger problem when we want to > >> > > support high priority requests as they may get blocked behind other > >> > > lower priority requests which have used up all the available requests > >> > > in the rq. With request queue division we can achieve this easily by > >> > > having tasks requiring high priority IO belong to a different cgroup. > >> > > dm-ioband and any other 2-level scheduler can do this easily. > >> > > > >> > > >> > Hi Divyesh, > >> > > >> > I understand that request descriptors can be a bottleneck here. But that > >> > should be an issue even today with CFQ where a low priority process > >> > consume lots of request descriptors and prevent higher priority process > >> > from submitting the request. > >> > >> Yes that is true and that is one of the main reasons why I would lean > >> towards 2-level scheduler coz you get request queue division as well. > >> > >> I think you already said it and I just > >> > reiterated it. > >> > > >> > I think in that case we need to do something about request descriptor > >> > allocation instead of relying on 2nd level of IO scheduler. > >> > At this point I am not sure what to do. May be we can take feedback from the > >> > respective queue (like cfqq) of submitting application and if it is already > >> > backlogged beyond a certain limit, then we can put that application to sleep > >> > and stop it from consuming excessive amount of request descriptors > >> > (despite the fact that we have free request descriptors). > >> > >> This should be done per-cgroup rather than per-process. > >> > > > > Yep, per cgroup limit will make more sense. get_request() already calls > > elv_may_queue() to get a feedback from IO scheduler. May be here IO > > scheduler can make a decision how many request descriptors are already > > allocated to this cgroup. And if the queue is congested, then IO scheduler > > can deny the fresh request allocation. > > > > Thanks > > Vivek > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081117142309.GA15564@redhat.com>]
[parent not found: <20081117142309.GA15564-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081117142309.GA15564-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-18 2:02 ` Li Zefan 0 siblings, 0 replies; 92+ messages in thread From: Li Zefan @ 2008-11-18 2:02 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Fabio Checconi, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Vivek Goyal wrote: > On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote: >> In an attempt to make sure that this discussion leads to >> something useful, we have summarized the points raised in this >> discussion and have come up with a strategy for future. >> The goal of this is to find common ground between all the approaches >> proposed on this mailing list. >> >> 1 Start with Satoshi's latest patches. > > I have had a brief look at both Satoshi's patch and bfq. I kind of like > bfq's patches for keeping track of per cgroup, per queue data structures. > May be we can look there also. > >> 2 Do the following to support propotional division: >> a) Give time slices in proportion to weights (configurable >> option). We can support both priorities and weights by doing >> propotional division between requests with same priorities. >> 3 Schedule time slices using WF2Q+ instead of round robin. >> Test the performance impact (both throughput and jitter in latency). >> 4 Do the following to support the goals of 2 level schedulers: >> a) Limit the request descriptors allocated to each cgroup by adding >> functionality to elv_may_queue() >> b) Add support for putting an absolute limit on IO consumed by a >> cgroup. Such support exists in dm-ioband and is provided by Andrea >> Righi's patches too. > > Does dm-iobnd support abosolute limit? I think till last version they did > not. I have not check the latest version though. > No, dm-ioband still provides weight/share control only. Only Andrea Righi's patches support absolute limit. >> c) Add support (configurable option) to keep track of total disk >> time/sectors/count >> consumed at each device, and factor that into scheduling decision >> (more discussion needed here) >> 5 Support multiple layers of cgroups to align IO controller behavior >> with CPU scheduling behavior (more discussion?) >> 6 Incorporate an IO tracking approach which re-uses memory resource >> controller code but is not dependent on it (may be biocgroup patches from >> dm-ioband can be used here directly) >> 7 Start an offline email thread to keep track of progress on the above >> goals. >> >> Please feel free to add/modify items to the list >> when you respond back. Any comments/suggestions are more than welcome. >> ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <4922224A.5030502@cn.fujitsu.com>]
[parent not found: <4922224A.5030502-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <4922224A.5030502-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> @ 2008-11-18 5:01 ` Nauman Rafique 0 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-18 5:01 UTC (permalink / raw) To: Li Zefan Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Fabio Checconi, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w If we start with bfq patches, this is how plan would look like: 1 Start with BFQ take 2. 2 Do the following to support proportional division: a) Expose the per device weight interface to user, instead of calculating from priority. b) Add support for disk time budgets, besides sector budget that is currently available (configurable option). (Fabio: Do you think we can just emulate that using the existing code?). Another approach would be to give time slices just like CFQ (discussing?) 4 Do the following to support the goals of 2 level schedulers: a) Limit the request descriptors allocated to each cgroup by adding functionality to elv_may_queue() b) Add support for putting an absolute limit on IO consumed by a cgroup. Such support is provided by Andrea Righi's patches too. c) Add support (configurable option) to keep track of total disk time/sectors/count consumed at each device, and factor that into scheduling decision (more discussion needed here) 6 Incorporate an IO tracking approach which re-uses memory resource controller code but is not dependent on it (may be biocgroup patches from dm-ioband can be used here directly) 7 Start an offline email thread to keep track of progress on the above goals. BFQ's support for hierarchy of cgroups means that its close to where we want to get. Any comments on what approach looks better? On Mon, Nov 17, 2008 at 6:02 PM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote: > Vivek Goyal wrote: >> On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote: >>> In an attempt to make sure that this discussion leads to >>> something useful, we have summarized the points raised in this >>> discussion and have come up with a strategy for future. >>> The goal of this is to find common ground between all the approaches >>> proposed on this mailing list. >>> >>> 1 Start with Satoshi's latest patches. >> >> I have had a brief look at both Satoshi's patch and bfq. I kind of like >> bfq's patches for keeping track of per cgroup, per queue data structures. >> May be we can look there also. >> >>> 2 Do the following to support propotional division: >>> a) Give time slices in proportion to weights (configurable >>> option). We can support both priorities and weights by doing >>> propotional division between requests with same priorities. >>> 3 Schedule time slices using WF2Q+ instead of round robin. >>> Test the performance impact (both throughput and jitter in latency). >>> 4 Do the following to support the goals of 2 level schedulers: >>> a) Limit the request descriptors allocated to each cgroup by adding >>> functionality to elv_may_queue() >>> b) Add support for putting an absolute limit on IO consumed by a >>> cgroup. Such support exists in dm-ioband and is provided by Andrea >>> Righi's patches too. >> >> Does dm-iobnd support abosolute limit? I think till last version they did >> not. I have not check the latest version though. >> > > No, dm-ioband still provides weight/share control only. Only Andrea Righi's > patches support absolute limit. Thanks for the correction. > >>> c) Add support (configurable option) to keep track of total disk >>> time/sectors/count >>> consumed at each device, and factor that into scheduling decision >>> (more discussion needed here) >>> 5 Support multiple layers of cgroups to align IO controller behavior >>> with CPU scheduling behavior (more discussion?) >>> 6 Incorporate an IO tracking approach which re-uses memory resource >>> controller code but is not dependent on it (may be biocgroup patches from >>> dm-ioband can be used here directly) >>> 7 Start an offline email thread to keep track of progress on the above >>> goals. >>> >>> Please feel free to add/modify items to the list >>> when you respond back. Any comments/suggestions are more than welcome. >>> > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811172101na345b6bh5c73f9e657aac5a7@mail.gmail.com>]
[parent not found: <e98e18940811172101na345b6bh5c73f9e657aac5a7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811172101na345b6bh5c73f9e657aac5a7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-18 7:42 ` Li Zefan 2008-11-18 12:05 ` Fabio Checconi 1 sibling, 0 replies; 92+ messages in thread From: Li Zefan @ 2008-11-18 7:42 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Fabio Checconi, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Nauman Rafique wrote: > If we start with bfq patches, this is how plan would look like: > > 1 Start with BFQ take 2. > 2 Do the following to support proportional division: > a) Expose the per device weight interface to user, instead of calculating > from priority. > b) Add support for disk time budgets, besides sector budget that is currently > available (configurable option). (Fabio: Do you think we can just emulate > that using the existing code?). Another approach would be to give time slices > just like CFQ (discussing?) > 4 Do the following to support the goals of 2 level schedulers: > a) Limit the request descriptors allocated to each cgroup by adding > functionality to elv_may_queue() > b) Add support for putting an absolute limit on IO consumed by a > cgroup. Such support is provided by Andrea > Righi's patches too. > c) Add support (configurable option) to keep track of total disk > time/sectors/count > consumed at each device, and factor that into scheduling decision > (more discussion needed here) > 6 Incorporate an IO tracking approach which re-uses memory resource > controller code but is not dependent on it (may be biocgroup patches from > dm-ioband can be used here directly) The newest bio_cgroup doesn't use much memcg code I think. The older biocgroup tracks IO using mem_cgroup_charge(), and mem_cgroup_charge() remembers a struct page owns by which cgroup. But now biocgroup changes to directly put some hooks in __set_page_dirty() and some other places to track pages. > 7 Start an offline email thread to keep track of progress on the above > goals. > > BFQ's support for hierarchy of cgroups means that its close to where > we want to get. Any comments on what approach looks better? > Looks like a sane way :) . We are also trying to keep track of the discussion and development of IO controller. I'll start to have a look into BFQ. > On Mon, Nov 17, 2008 at 6:02 PM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote: >> Vivek Goyal wrote: >>> On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote: >>>> In an attempt to make sure that this discussion leads to >>>> something useful, we have summarized the points raised in this >>>> discussion and have come up with a strategy for future. >>>> The goal of this is to find common ground between all the approaches >>>> proposed on this mailing list. >>>> >>>> 1 Start with Satoshi's latest patches. >>> I have had a brief look at both Satoshi's patch and bfq. I kind of like >>> bfq's patches for keeping track of per cgroup, per queue data structures. >>> May be we can look there also. >>> >>>> 2 Do the following to support propotional division: >>>> a) Give time slices in proportion to weights (configurable >>>> option). We can support both priorities and weights by doing >>>> propotional division between requests with same priorities. >>>> 3 Schedule time slices using WF2Q+ instead of round robin. >>>> Test the performance impact (both throughput and jitter in latency). >>>> 4 Do the following to support the goals of 2 level schedulers: >>>> a) Limit the request descriptors allocated to each cgroup by adding >>>> functionality to elv_may_queue() >>>> b) Add support for putting an absolute limit on IO consumed by a >>>> cgroup. Such support exists in dm-ioband and is provided by Andrea >>>> Righi's patches too. >>> Does dm-iobnd support abosolute limit? I think till last version they did >>> not. I have not check the latest version though. >>> >> No, dm-ioband still provides weight/share control only. Only Andrea Righi's >> patches support absolute limit. > > Thanks for the correction. > >>>> c) Add support (configurable option) to keep track of total disk >>>> time/sectors/count >>>> consumed at each device, and factor that into scheduling decision >>>> (more discussion needed here) >>>> 5 Support multiple layers of cgroups to align IO controller behavior >>>> with CPU scheduling behavior (more discussion?) >>>> 6 Incorporate an IO tracking approach which re-uses memory resource >>>> controller code but is not dependent on it (may be biocgroup patches from >>>> dm-ioband can be used here directly) >>>> 7 Start an offline email thread to keep track of progress on the above >>>> goals. >>>> >>>> Please feel free to add/modify items to the list >>>> when you respond back. Any comments/suggestions are more than welcome. >>>> >> > ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811172101na345b6bh5c73f9e657aac5a7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2008-11-18 7:42 ` Li Zefan @ 2008-11-18 12:05 ` Fabio Checconi 1 sibling, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-18 12:05 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Hi, > From: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Date: Mon, Nov 17, 2008 09:01:48PM -0800 > > If we start with bfq patches, this is how plan would look like: > > 1 Start with BFQ take 2. > 2 Do the following to support proportional division: > a) Expose the per device weight interface to user, instead of calculating > from priority. > b) Add support for disk time budgets, besides sector budget that is currently > available (configurable option). (Fabio: Do you think we can just emulate > that using the existing code?). Another approach would be to give time slices > just like CFQ (discussing?) it should be possible without altering the code. The slices can be assigned in the time domain using big values for max_budget. The logic is: each process is assigned a budget (in the range [max_budget/2, max_budget], choosen from the feedback mechanism, driven in __bfq_bfqq_recalc_budget()), and if it does not complete it in timeout_sync milliseconds, it is charged a fixed amount of sectors of service. Using big values for max_budget (where big means greater than two times the number of sectors the hard drive can transfer in timeout_sync milliseconds) makes the budgets always to time out, so the disk time is scheduled in slices of timeout_sync. However this is just a temporary workaround to do some basic testing. Modifying the scheduler to support time slices instead of sector budgets would indeed simplify the code; I think that the drawback would be being too unfair in the service domain. Of course we have to consider how much is important to be fair in the service domain, and how much added complexity/new code can we accept for it. [ Better service domain fairness is one of the main reasons why we started working on bfq, so, talking for me and Paolo it _is_ important :) ] I have to think a little bit on how it would be possible to support an option for time-only budgets, coexisting with the current behavior, but I think it can be done. > 4 Do the following to support the goals of 2 level schedulers: > a) Limit the request descriptors allocated to each cgroup by adding > functionality to elv_may_queue() > b) Add support for putting an absolute limit on IO consumed by a > cgroup. Such support is provided by Andrea > Righi's patches too. > c) Add support (configurable option) to keep track of total disk > time/sectors/count > consumed at each device, and factor that into scheduling decision > (more discussion needed here) > 6 Incorporate an IO tracking approach which re-uses memory resource > controller code but is not dependent on it (may be biocgroup patches from > dm-ioband can be used here directly) > 7 Start an offline email thread to keep track of progress on the above > goals. > > BFQ's support for hierarchy of cgroups means that its close to where > we want to get. Any comments on what approach looks better? > The main problems with this approach (as with the cfq-based ones) in my opinion are: - the request descriptor allocation problem Divyesh talked about, - the impossibility of respecting different weights, resulting from the interlock problem with synchronous requests Vivek talked about [ in cfq/bfq this can happen when idling is disabled, e.g., for SSDs, or when using NCQ ], but I think that correctly addressing your points 4.a) and 4.b) should solve them. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081118120508.GD15268@gandalf.sssup.it>]
[parent not found: <20081118120508.GD15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118120508.GD15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> @ 2008-11-18 14:07 ` Vivek Goyal 2008-11-18 22:33 ` Nauman Rafique 1 sibling, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-18 14:07 UTC (permalink / raw) To: Fabio Checconi Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: > Hi, > > > From: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > Date: Mon, Nov 17, 2008 09:01:48PM -0800 > > > > If we start with bfq patches, this is how plan would look like: > > > > 1 Start with BFQ take 2. > > 2 Do the following to support proportional division: > > a) Expose the per device weight interface to user, instead of calculating > > from priority. > > b) Add support for disk time budgets, besides sector budget that is currently > > available (configurable option). (Fabio: Do you think we can just emulate > > that using the existing code?). Another approach would be to give time slices > > just like CFQ (discussing?) > > it should be possible without altering the code. The slices can be > assigned in the time domain using big values for max_budget. The logic > is: each process is assigned a budget (in the range [max_budget/2, max_budget], > choosen from the feedback mechanism, driven in __bfq_bfqq_recalc_budget()), > and if it does not complete it in timeout_sync milliseconds, it is > charged a fixed amount of sectors of service. > > Using big values for max_budget (where big means greater than two > times the number of sectors the hard drive can transfer in timeout_sync > milliseconds) makes the budgets always to time out, so the disk time > is scheduled in slices of timeout_sync. > > However this is just a temporary workaround to do some basic testing. > > Modifying the scheduler to support time slices instead of sector > budgets would indeed simplify the code; I think that the drawback > would be being too unfair in the service domain. Of course we > have to consider how much is important to be fair in the service > domain, and how much added complexity/new code can we accept for it. > > [ Better service domain fairness is one of the main reasons why > we started working on bfq, so, talking for me and Paolo it _is_ > important :) ] > > I have to think a little bit on how it would be possible to support > an option for time-only budgets, coexisting with the current behavior, > but I think it can be done. > IIUC, bfq and cfq are different in following manner. a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. b. BFQ uses the budget (sector count) as notion of service and CFQ uses time slices. c. BFQ supports hierarchical fair queuing and CFQ does not. We are looking forward for implementation of point C. Fabio seems to thinking of supporting time slice as a service (B). It seems like convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round robin). It looks like WF2Q+ provides tighter service bound and bfq guys mention that they have been able to ensure throughput while ensuring tighter bounds. If that's the case, does that mean BFQ is a replacement for CFQ down the line? Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118120508.GD15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> 2008-11-18 14:07 ` Vivek Goyal @ 2008-11-18 22:33 ` Nauman Rafique 1 sibling, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-18 22:33 UTC (permalink / raw) To: Fabio Checconi Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 18, 2008 at 4:05 AM, Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > Hi, > >> From: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> >> Date: Mon, Nov 17, 2008 09:01:48PM -0800 >> >> If we start with bfq patches, this is how plan would look like: >> >> 1 Start with BFQ take 2. >> 2 Do the following to support proportional division: >> a) Expose the per device weight interface to user, instead of calculating >> from priority. >> b) Add support for disk time budgets, besides sector budget that is currently >> available (configurable option). (Fabio: Do you think we can just emulate >> that using the existing code?). Another approach would be to give time slices >> just like CFQ (discussing?) > > it should be possible without altering the code. The slices can be > assigned in the time domain using big values for max_budget. The logic > is: each process is assigned a budget (in the range [max_budget/2, max_budget], > choosen from the feedback mechanism, driven in __bfq_bfqq_recalc_budget()), > and if it does not complete it in timeout_sync milliseconds, it is > charged a fixed amount of sectors of service. > > Using big values for max_budget (where big means greater than two > times the number of sectors the hard drive can transfer in timeout_sync > milliseconds) makes the budgets always to time out, so the disk time > is scheduled in slices of timeout_sync. > > However this is just a temporary workaround to do some basic testing. > > Modifying the scheduler to support time slices instead of sector > budgets would indeed simplify the code; I think that the drawback > would be being too unfair in the service domain. Of course we > have to consider how much is important to be fair in the service > domain, and how much added complexity/new code can we accept for it. > > [ Better service domain fairness is one of the main reasons why > we started working on bfq, so, talking for me and Paolo it _is_ > important :) ] > > I have to think a little bit on how it would be possible to support > an option for time-only budgets, coexisting with the current behavior, > but I think it can be done. I think "time only budget" vs "sector budget" is dependent on the definition of fairness: do you want to be fair in the time that is given to each cgroup or fair in total number of sectors transferred. And the appropriate definition of fairness depends on how/where the IO scheduler is used. Do you think the work-around that you mentioned would have a significant performance difference compared to direct built-in support? > > >> 4 Do the following to support the goals of 2 level schedulers: >> a) Limit the request descriptors allocated to each cgroup by adding >> functionality to elv_may_queue() >> b) Add support for putting an absolute limit on IO consumed by a >> cgroup. Such support is provided by Andrea >> Righi's patches too. >> c) Add support (configurable option) to keep track of total disk >> time/sectors/count >> consumed at each device, and factor that into scheduling decision >> (more discussion needed here) >> 6 Incorporate an IO tracking approach which re-uses memory resource >> controller code but is not dependent on it (may be biocgroup patches from >> dm-ioband can be used here directly) >> 7 Start an offline email thread to keep track of progress on the above >> goals. >> >> BFQ's support for hierarchy of cgroups means that its close to where >> we want to get. Any comments on what approach looks better? >> > > The main problems with this approach (as with the cfq-based ones) in > my opinion are: > - the request descriptor allocation problem Divyesh talked about, > - the impossibility of respecting different weights, resulting from > the interlock problem with synchronous requests Vivek talked about > [ in cfq/bfq this can happen when idling is disabled, e.g., for > SSDs, or when using NCQ ], > > but I think that correctly addressing your points 4.a) and 4.b) should > solve them. > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081118140751.GA4283@redhat.com>]
[parent not found: <20081118140751.GA4283-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118140751.GA4283-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-18 14:41 ` Fabio Checconi 0 siblings, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-18 14:41 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Date: Tue, Nov 18, 2008 09:07:51AM -0500 > > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: ... > > I have to think a little bit on how it would be possible to support > > an option for time-only budgets, coexisting with the current behavior, > > but I think it can be done. > > > > IIUC, bfq and cfq are different in following manner. > > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. > b. BFQ uses the budget (sector count) as notion of service and CFQ uses > time slices. > c. BFQ supports hierarchical fair queuing and CFQ does not. > > We are looking forward for implementation of point C. Fabio seems to > thinking of supporting time slice as a service (B). It seems like > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round > robin). > > It looks like WF2Q+ provides tighter service bound and bfq guys mention > that they have been able to ensure throughput while ensuring tighter > bounds. If that's the case, does that mean BFQ is a replacement for CFQ > down the line? > BFQ started from CFQ, extending it in the way you correctly describe, so it is indeed very similar. There are also some minor changes to locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. The two schedulers share similar goals, and in my opinion BFQ can be considered, in the long term, a CFQ replacement; *but* before talking about replacing CFQ we have to consider that: - it *needs* review and testing; we've done our best, but for sure it's not enough; review and testing are never enough; - the service domain fairness, which was one of our objectives, requires some extra complexity; the mechanisms we used and the design choices we've made may not fit all the needs, or may not be as generic as the simpler CFQ's ones; - CFQ has years of history behind and has been tuned for a wider variety of environments than the ones we've been able to test. If time-based fairness is considered more robust and the loss of service-domain fairness is not a problem, then the two schedulers can be made even more similar. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081118144139.GE15268@gandalf.sssup.it>]
[parent not found: <20081118191208.GJ26308@kernel.dk>]
[parent not found: <20081118191208.GJ26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118191208.GJ26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2008-11-18 19:47 ` Vivek Goyal 2008-11-18 21:14 ` Fabio Checconi 2008-11-18 23:07 ` Nauman Rafique 2 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-18 19:47 UTC (permalink / raw) To: Jens Axboe Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 18, 2008 at 08:12:08PM +0100, Jens Axboe wrote: > On Tue, Nov 18 2008, Fabio Checconi wrote: > > > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > Date: Tue, Nov 18, 2008 09:07:51AM -0500 > > > > > > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: > > ... > > > > I have to think a little bit on how it would be possible to support > > > > an option for time-only budgets, coexisting with the current behavior, > > > > but I think it can be done. > > > > > > > > > > IIUC, bfq and cfq are different in following manner. > > > > > > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. > > > b. BFQ uses the budget (sector count) as notion of service and CFQ uses > > > time slices. > > > c. BFQ supports hierarchical fair queuing and CFQ does not. > > > > > > We are looking forward for implementation of point C. Fabio seems to > > > thinking of supporting time slice as a service (B). It seems like > > > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round > > > robin). > > > > > > It looks like WF2Q+ provides tighter service bound and bfq guys mention > > > that they have been able to ensure throughput while ensuring tighter > > > bounds. If that's the case, does that mean BFQ is a replacement for CFQ > > > down the line? > > > > > > > BFQ started from CFQ, extending it in the way you correctly describe, > > so it is indeed very similar. There are also some minor changes to > > locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. > > > > The two schedulers share similar goals, and in my opinion BFQ can be > > considered, in the long term, a CFQ replacement; *but* before talking > > about replacing CFQ we have to consider that: > > > > - it *needs* review and testing; we've done our best, but for sure > > it's not enough; review and testing are never enough; > > - the service domain fairness, which was one of our objectives, requires > > some extra complexity; the mechanisms we used and the design choices > > we've made may not fit all the needs, or may not be as generic as the > > simpler CFQ's ones; > > - CFQ has years of history behind and has been tuned for a wider > > variety of environments than the ones we've been able to test. > > > > If time-based fairness is considered more robust and the loss of > > service-domain fairness is not a problem, then the two schedulers can > > be made even more similar. > > My preferred approach here would be, in order or TODO: > > - Create and test the smallish patches for seekiness, hw_tag checking, > and so on for CFQ. > - Create and test a WF2Q+ service dispatching patch for CFQ. > Hi Jens, What do you think about "hierarchical" and cgroup part of BFQ patch? Do you intend to incorporate/include that piece also or do you think that's not the way to go for IO controller stuff. Thanks Vivek > and if there are leftovers after that, we could even conditionally > enable some of those if appropriate. I think the WF2Q+ is quite cool and > could be easily usable as the default, so it's definitely a viable > alternative. > > My main goal here is basically avoiding addition of Yet Another IO > scheduler, especially one that is so closely tied to CFQ already. > > I'll start things off by splitting cfq into a few files similar to what > bfq has done, as I think it makes a lot of sense. Fabio, if you could > create patches for the small behavioural changes you made, we can > discuss and hopefully merge those next. > > -- > Jens Axboe ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118191208.GJ26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2008-11-18 19:47 ` Vivek Goyal @ 2008-11-18 21:14 ` Fabio Checconi 2008-11-18 23:07 ` Nauman Rafique 2 siblings, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-18 21:14 UTC (permalink / raw) To: Jens Axboe Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > Date: Tue, Nov 18, 2008 08:12:08PM +0100 > > On Tue, Nov 18 2008, Fabio Checconi wrote: > > BFQ started from CFQ, extending it in the way you correctly describe, > > so it is indeed very similar. There are also some minor changes to > > locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. > > ... > My preferred approach here would be, in order or TODO: > > - Create and test the smallish patches for seekiness, hw_tag checking, > and so on for CFQ. > - Create and test a WF2Q+ service dispatching patch for CFQ. > > and if there are leftovers after that, we could even conditionally > enable some of those if appropriate. I think the WF2Q+ is quite cool and > could be easily usable as the default, so it's definitely a viable > alternative. > > My main goal here is basically avoiding addition of Yet Another IO > scheduler, especially one that is so closely tied to CFQ already. > > I'll start things off by splitting cfq into a few files similar to what > bfq has done, as I think it makes a lot of sense. Fabio, if you could > create patches for the small behavioural changes you made, we can > discuss and hopefully merge those next. > Ok, I can do that, I need just a little bit of time to organize the work. About these small (some of them are really small) changes, a mixed list of things that they will touch and/or things that I'd like to have clear before starting to write the patches (maybe we can start another thread for them): - In cfq_exit_single_io_context() and in changed_ioprio(), cic->key is dereferenced without holding any lock. As I reported in [1] this seems to be a problem when an exit() races with a cfq_exit_queue() and in a few other cases. In BFQ we used a somehow involved mechanism to avoid that, abusing rcu (of course we'll have to wait the patch to talk about it :) ), but given my lack of understanding of some parts of the block layer, I'd be interested in knowing if the race is possible and/or if there is something more involved going on that can cause the same effects. - set_task_ioprio() in fs/ioprio.c doesn't seem to have a write memory barrier to pair with the dependent read one in cfq_get_io_context(). - CFQ_MIN_TT is 2ms, this can result, depending on the value of HZ in timeouts of one jiffy, that may expire too early, so we are just wasting time and do not actually wait for the task to present its new request. Dealing with seeky traffic we've seen a lot of early timeouts due to one jiffy timers expiring too early, is it worth fixing or can we live with that? - To detect hw tagging in BFQ we consider a sample valid iff the number of requests that the scheduler could have dispatched (given by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into the scheduler plus the ones into the driver) is higher than the CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems during testing, but the way CFQ uses now seems a little bit strange. - Initially, cic->last_request_pos is zero, so the sdist charged to a task for its first seek depends on the position on the disk that is accessed first, independently from its seekiness. Even if there is a cap on that value, we choose to not charge the first seek to processes; that resulted in less wrong predictions for purely sequential loads. - From my understanding, with shared I/O contexts, two different tasks may concurrently lookup for a cfqd into the same ioc. This may result in cfq_drop_dead_cic() being called two times for the same cic. Am I missing something that prevents that from happening? Regarding the code splitup, do you think you'll go for the CFS(BFQ) way, using a single compilation unit and including the .c files, or a layout with different compilation units (like the ll_rw_blk.c splitup)? [1]: http://lkml.org/lkml/2008/8/18/119 ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118191208.GJ26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2008-11-18 19:47 ` Vivek Goyal 2008-11-18 21:14 ` Fabio Checconi @ 2008-11-18 23:07 ` Nauman Rafique 2 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-18 23:07 UTC (permalink / raw) To: Jens Axboe Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, paolo.valente-rcYM44yAMweonA0d6jMUrA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, Nov 18 2008, Fabio Checconi wrote: >> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500 >> > >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: >> ... >> > > I have to think a little bit on how it would be possible to support >> > > an option for time-only budgets, coexisting with the current behavior, >> > > but I think it can be done. >> > > >> > >> > IIUC, bfq and cfq are different in following manner. >> > >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses >> > time slices. >> > c. BFQ supports hierarchical fair queuing and CFQ does not. >> > >> > We are looking forward for implementation of point C. Fabio seems to >> > thinking of supporting time slice as a service (B). It seems like >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round >> > robin). >> > >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention >> > that they have been able to ensure throughput while ensuring tighter >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ >> > down the line? >> > >> >> BFQ started from CFQ, extending it in the way you correctly describe, >> so it is indeed very similar. There are also some minor changes to >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. >> >> The two schedulers share similar goals, and in my opinion BFQ can be >> considered, in the long term, a CFQ replacement; *but* before talking >> about replacing CFQ we have to consider that: >> >> - it *needs* review and testing; we've done our best, but for sure >> it's not enough; review and testing are never enough; >> - the service domain fairness, which was one of our objectives, requires >> some extra complexity; the mechanisms we used and the design choices >> we've made may not fit all the needs, or may not be as generic as the >> simpler CFQ's ones; >> - CFQ has years of history behind and has been tuned for a wider >> variety of environments than the ones we've been able to test. >> >> If time-based fairness is considered more robust and the loss of >> service-domain fairness is not a problem, then the two schedulers can >> be made even more similar. > > My preferred approach here would be, in order or TODO: > > - Create and test the smallish patches for seekiness, hw_tag checking, > and so on for CFQ. > - Create and test a WF2Q+ service dispatching patch for CFQ. > > and if there are leftovers after that, we could even conditionally > enable some of those if appropriate. I think the WF2Q+ is quite cool and > could be easily usable as the default, so it's definitely a viable > alternative. 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would result in time slices being scheduled using WF2Q+ 2 Do the following to support proportional division: a) Expose the per device weight interface to user, instead of calculating from priority. b) Add support for scheduling bandwidth among a hierarchy of cgroups (besides threads) 3 Do the following to support the goals of 2 level schedulers: a) Limit the request descriptors allocated to each cgroup by adding functionality to elv_may_queue() b) Add support for putting an absolute limit on IO consumed by a cgroup. Such support is provided by Andrea Righi's patches too. c) Add support (configurable option) to keep track of total disk time/sectors/count consumed at each device, and factor that into scheduling decision (more discussion needed here) 6 Incorporate an IO tracking approach which can allow tracking cgroups for asynchronous reads/writes. 7 Start an offline email thread to keep track of progress on the above goals. Jens, what is your opinion everything beyond (1) in the above list? It would be great if work on (1) and (2)-(7) can happen in parallel so that we can see "proportional division of IO bandwidth to cgroups" in tree sooner than later. > > My main goal here is basically avoiding addition of Yet Another IO > scheduler, especially one that is so closely tied to CFQ already. > > I'll start things off by splitting cfq into a few files similar to what > bfq has done, as I think it makes a lot of sense. Fabio, if you could > create patches for the small behavioural changes you made, we can > discuss and hopefully merge those next. > > -- > Jens Axboe > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811181507t6b1473act2efa23df21dab270@mail.gmail.com>]
[parent not found: <e98e18940811181507t6b1473act2efa23df21dab270-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811181507t6b1473act2efa23df21dab270-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-19 14:24 ` Jens Axboe 0 siblings, 0 replies; 92+ messages in thread From: Jens Axboe @ 2008-11-19 14:24 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, paolo.valente-rcYM44yAMweonA0d6jMUrA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 18 2008, Nauman Rafique wrote: > On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > On Tue, Nov 18 2008, Fabio Checconi wrote: > >> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500 > >> > > >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: > >> ... > >> > > I have to think a little bit on how it would be possible to support > >> > > an option for time-only budgets, coexisting with the current behavior, > >> > > but I think it can be done. > >> > > > >> > > >> > IIUC, bfq and cfq are different in following manner. > >> > > >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. > >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses > >> > time slices. > >> > c. BFQ supports hierarchical fair queuing and CFQ does not. > >> > > >> > We are looking forward for implementation of point C. Fabio seems to > >> > thinking of supporting time slice as a service (B). It seems like > >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round > >> > robin). > >> > > >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention > >> > that they have been able to ensure throughput while ensuring tighter > >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ > >> > down the line? > >> > > >> > >> BFQ started from CFQ, extending it in the way you correctly describe, > >> so it is indeed very similar. There are also some minor changes to > >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. > >> > >> The two schedulers share similar goals, and in my opinion BFQ can be > >> considered, in the long term, a CFQ replacement; *but* before talking > >> about replacing CFQ we have to consider that: > >> > >> - it *needs* review and testing; we've done our best, but for sure > >> it's not enough; review and testing are never enough; > >> - the service domain fairness, which was one of our objectives, requires > >> some extra complexity; the mechanisms we used and the design choices > >> we've made may not fit all the needs, or may not be as generic as the > >> simpler CFQ's ones; > >> - CFQ has years of history behind and has been tuned for a wider > >> variety of environments than the ones we've been able to test. > >> > >> If time-based fairness is considered more robust and the loss of > >> service-domain fairness is not a problem, then the two schedulers can > >> be made even more similar. > > > > My preferred approach here would be, in order or TODO: > > > > - Create and test the smallish patches for seekiness, hw_tag checking, > > and so on for CFQ. > > - Create and test a WF2Q+ service dispatching patch for CFQ. > > > > and if there are leftovers after that, we could even conditionally > > enable some of those if appropriate. I think the WF2Q+ is quite cool and > > could be easily usable as the default, so it's definitely a viable > > alternative. > > 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would > result in time slices being scheduled using WF2Q+ Yep, at least that is my preference. > 2 Do the following to support proportional division: > a) Expose the per device weight interface to user, instead of calculating > from priority. > b) Add support for scheduling bandwidth among a hierarchy of cgroups > (besides threads) > 3 Do the following to support the goals of 2 level schedulers: > a) Limit the request descriptors allocated to each cgroup by adding > functionality to elv_may_queue() > b) Add support for putting an absolute limit on IO consumed by a > cgroup. Such support is provided by Andrea > Righi's patches too. > c) Add support (configurable option) to keep track of total disk > time/sectors/count > consumed at each device, and factor that into scheduling decision > (more discussion needed here) > 6 Incorporate an IO tracking approach which can allow tracking cgroups > for asynchronous reads/writes. > 7 Start an offline email thread to keep track of progress on the above > goals. > > Jens, what is your opinion everything beyond (1) in the above list? > > It would be great if work on (1) and (2)-(7) can happen in parallel so > that we can see "proportional division of IO bandwidth to cgroups" in > tree sooner than later. Sounds feasible, I'd like to see the cgroups approach get more traction. My primary concern is just that I don't want to merge it into specific IO schedulers. As you mention, we can hook into the may queue logic for that subset of the problem, that avoids touching the io scheduler. If we can get this supported 'generically', then I'd be happy to help out. -- Jens Axboe ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081119142446.GH26308@kernel.dk>]
[parent not found: <20081119142446.GH26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081119142446.GH26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2008-11-20 0:12 ` Divyesh Shah 0 siblings, 0 replies; 92+ messages in thread From: Divyesh Shah @ 2008-11-20 0:12 UTC (permalink / raw) To: Jens Axboe Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, Nov 18 2008, Nauman Rafique wrote: >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: >> > On Tue, Nov 18 2008, Fabio Checconi wrote: >> >> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500 >> >> > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: >> >> ... >> >> > > I have to think a little bit on how it would be possible to support >> >> > > an option for time-only budgets, coexisting with the current behavior, >> >> > > but I think it can be done. >> >> > > >> >> > >> >> > IIUC, bfq and cfq are different in following manner. >> >> > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses >> >> > time slices. >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not. >> >> > >> >> > We are looking forward for implementation of point C. Fabio seems to >> >> > thinking of supporting time slice as a service (B). It seems like >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round >> >> > robin). >> >> > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention >> >> > that they have been able to ensure throughput while ensuring tighter >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ >> >> > down the line? >> >> > >> >> >> >> BFQ started from CFQ, extending it in the way you correctly describe, >> >> so it is indeed very similar. There are also some minor changes to >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. >> >> >> >> The two schedulers share similar goals, and in my opinion BFQ can be >> >> considered, in the long term, a CFQ replacement; *but* before talking >> >> about replacing CFQ we have to consider that: >> >> >> >> - it *needs* review and testing; we've done our best, but for sure >> >> it's not enough; review and testing are never enough; >> >> - the service domain fairness, which was one of our objectives, requires >> >> some extra complexity; the mechanisms we used and the design choices >> >> we've made may not fit all the needs, or may not be as generic as the >> >> simpler CFQ's ones; >> >> - CFQ has years of history behind and has been tuned for a wider >> >> variety of environments than the ones we've been able to test. >> >> >> >> If time-based fairness is considered more robust and the loss of >> >> service-domain fairness is not a problem, then the two schedulers can >> >> be made even more similar. >> > >> > My preferred approach here would be, in order or TODO: >> > >> > - Create and test the smallish patches for seekiness, hw_tag checking, >> > and so on for CFQ. >> > - Create and test a WF2Q+ service dispatching patch for CFQ. >> > >> > and if there are leftovers after that, we could even conditionally >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and >> > could be easily usable as the default, so it's definitely a viable >> > alternative. >> >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would >> result in time slices being scheduled using WF2Q+ > > Yep, at least that is my preference. > >> 2 Do the following to support proportional division: >> a) Expose the per device weight interface to user, instead of calculating >> from priority. >> b) Add support for scheduling bandwidth among a hierarchy of cgroups >> (besides threads) >> 3 Do the following to support the goals of 2 level schedulers: >> a) Limit the request descriptors allocated to each cgroup by adding >> functionality to elv_may_queue() >> b) Add support for putting an absolute limit on IO consumed by a >> cgroup. Such support is provided by Andrea >> Righi's patches too. >> c) Add support (configurable option) to keep track of total disk >> time/sectors/count >> consumed at each device, and factor that into scheduling decision >> (more discussion needed here) >> 6 Incorporate an IO tracking approach which can allow tracking cgroups >> for asynchronous reads/writes. >> 7 Start an offline email thread to keep track of progress on the above >> goals. >> >> Jens, what is your opinion everything beyond (1) in the above list? >> >> It would be great if work on (1) and (2)-(7) can happen in parallel so >> that we can see "proportional division of IO bandwidth to cgroups" in >> tree sooner than later. > > Sounds feasible, I'd like to see the cgroups approach get more traction. > My primary concern is just that I don't want to merge it into specific > IO schedulers. Jens, So are you saying you don't prefer cgroups based proportional IO division solutions in the IO scheduler but at a layer above so it can be shared with all IO schedulers? If yes, then in that case, what do you think about Vivek Goyal's patch or dm-ioband that achieve that. Of course, both solutions don't meet all the requirements in the list above, but we can work on that once we know which direction we should be heading in. In fact, it would help if you could express the reservations (if you have any) about these approaches. That would help in coming up with a plan that everyone agrees on. Thanks, DIvyesh As you mention, we can hook into the may queue logic for > that subset of the problem, that avoids touching the io scheduler. If we > can get this supported 'generically', then I'd be happy to help out. > > -- > Jens Axboe > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <af41c7c40811191612v5db13ae7n3cfe537beb6a157c@mail.gmail.com>]
[parent not found: <af41c7c40811191612v5db13ae7n3cfe537beb6a157c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <af41c7c40811191612v5db13ae7n3cfe537beb6a157c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-20 8:16 ` Jens Axboe 0 siblings, 0 replies; 92+ messages in thread From: Jens Axboe @ 2008-11-20 8:16 UTC (permalink / raw) To: Divyesh Shah Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Wed, Nov 19 2008, Divyesh Shah wrote: > On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > On Tue, Nov 18 2008, Nauman Rafique wrote: > >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > >> > On Tue, Nov 18 2008, Fabio Checconi wrote: > >> >> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500 > >> >> > > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: > >> >> ... > >> >> > > I have to think a little bit on how it would be possible to support > >> >> > > an option for time-only budgets, coexisting with the current behavior, > >> >> > > but I think it can be done. > >> >> > > > >> >> > > >> >> > IIUC, bfq and cfq are different in following manner. > >> >> > > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. > >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses > >> >> > time slices. > >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not. > >> >> > > >> >> > We are looking forward for implementation of point C. Fabio seems to > >> >> > thinking of supporting time slice as a service (B). It seems like > >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round > >> >> > robin). > >> >> > > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention > >> >> > that they have been able to ensure throughput while ensuring tighter > >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ > >> >> > down the line? > >> >> > > >> >> > >> >> BFQ started from CFQ, extending it in the way you correctly describe, > >> >> so it is indeed very similar. There are also some minor changes to > >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. > >> >> > >> >> The two schedulers share similar goals, and in my opinion BFQ can be > >> >> considered, in the long term, a CFQ replacement; *but* before talking > >> >> about replacing CFQ we have to consider that: > >> >> > >> >> - it *needs* review and testing; we've done our best, but for sure > >> >> it's not enough; review and testing are never enough; > >> >> - the service domain fairness, which was one of our objectives, requires > >> >> some extra complexity; the mechanisms we used and the design choices > >> >> we've made may not fit all the needs, or may not be as generic as the > >> >> simpler CFQ's ones; > >> >> - CFQ has years of history behind and has been tuned for a wider > >> >> variety of environments than the ones we've been able to test. > >> >> > >> >> If time-based fairness is considered more robust and the loss of > >> >> service-domain fairness is not a problem, then the two schedulers can > >> >> be made even more similar. > >> > > >> > My preferred approach here would be, in order or TODO: > >> > > >> > - Create and test the smallish patches for seekiness, hw_tag checking, > >> > and so on for CFQ. > >> > - Create and test a WF2Q+ service dispatching patch for CFQ. > >> > > >> > and if there are leftovers after that, we could even conditionally > >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and > >> > could be easily usable as the default, so it's definitely a viable > >> > alternative. > >> > >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would > >> result in time slices being scheduled using WF2Q+ > > > > Yep, at least that is my preference. > > > >> 2 Do the following to support proportional division: > >> a) Expose the per device weight interface to user, instead of calculating > >> from priority. > >> b) Add support for scheduling bandwidth among a hierarchy of cgroups > >> (besides threads) > >> 3 Do the following to support the goals of 2 level schedulers: > >> a) Limit the request descriptors allocated to each cgroup by adding > >> functionality to elv_may_queue() > >> b) Add support for putting an absolute limit on IO consumed by a > >> cgroup. Such support is provided by Andrea > >> Righi's patches too. > >> c) Add support (configurable option) to keep track of total disk > >> time/sectors/count > >> consumed at each device, and factor that into scheduling decision > >> (more discussion needed here) > >> 6 Incorporate an IO tracking approach which can allow tracking cgroups > >> for asynchronous reads/writes. > >> 7 Start an offline email thread to keep track of progress on the above > >> goals. > >> > >> Jens, what is your opinion everything beyond (1) in the above list? > >> > >> It would be great if work on (1) and (2)-(7) can happen in parallel so > >> that we can see "proportional division of IO bandwidth to cgroups" in > >> tree sooner than later. > > > > Sounds feasible, I'd like to see the cgroups approach get more traction. > > My primary concern is just that I don't want to merge it into specific > > IO schedulers. > > Jens, > So are you saying you don't prefer cgroups based proportional IO > division solutions in the IO scheduler but at a layer above so it can > be shared with all IO schedulers? > > If yes, then in that case, what do you think about Vivek Goyal's > patch or dm-ioband that achieve that. Of course, both solutions don't > meet all the requirements in the list above, but we can work on that > once we know which direction we should be heading in. In fact, it > would help if you could express the reservations (if you have any) > about these approaches. That would help in coming up with a plan that > everyone agrees on. The dm approach has some merrits, the major one being that it'll fit directly into existing setups that use dm and can be controlled with familiar tools. That is a bonus. The draw back is partially the same - it'll require dm. So it's still not a fit-all approach, unfortunately. So I'd prefer an approach that doesn't force you to use dm. -- Jens Axboe ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081120081640.GE26308@kernel.dk>]
[parent not found: <20081120081640.GE26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081120081640.GE26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2008-11-20 13:40 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-20 13:40 UTC (permalink / raw) To: Jens Axboe Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Thu, Nov 20, 2008 at 09:16:41AM +0100, Jens Axboe wrote: > On Wed, Nov 19 2008, Divyesh Shah wrote: > > On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > On Tue, Nov 18 2008, Nauman Rafique wrote: > > >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > >> > On Tue, Nov 18 2008, Fabio Checconi wrote: > > >> >> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500 > > >> >> > > > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: > > >> >> ... > > >> >> > > I have to think a little bit on how it would be possible to support > > >> >> > > an option for time-only budgets, coexisting with the current behavior, > > >> >> > > but I think it can be done. > > >> >> > > > > >> >> > > > >> >> > IIUC, bfq and cfq are different in following manner. > > >> >> > > > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. > > >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses > > >> >> > time slices. > > >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not. > > >> >> > > > >> >> > We are looking forward for implementation of point C. Fabio seems to > > >> >> > thinking of supporting time slice as a service (B). It seems like > > >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round > > >> >> > robin). > > >> >> > > > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention > > >> >> > that they have been able to ensure throughput while ensuring tighter > > >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ > > >> >> > down the line? > > >> >> > > > >> >> > > >> >> BFQ started from CFQ, extending it in the way you correctly describe, > > >> >> so it is indeed very similar. There are also some minor changes to > > >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. > > >> >> > > >> >> The two schedulers share similar goals, and in my opinion BFQ can be > > >> >> considered, in the long term, a CFQ replacement; *but* before talking > > >> >> about replacing CFQ we have to consider that: > > >> >> > > >> >> - it *needs* review and testing; we've done our best, but for sure > > >> >> it's not enough; review and testing are never enough; > > >> >> - the service domain fairness, which was one of our objectives, requires > > >> >> some extra complexity; the mechanisms we used and the design choices > > >> >> we've made may not fit all the needs, or may not be as generic as the > > >> >> simpler CFQ's ones; > > >> >> - CFQ has years of history behind and has been tuned for a wider > > >> >> variety of environments than the ones we've been able to test. > > >> >> > > >> >> If time-based fairness is considered more robust and the loss of > > >> >> service-domain fairness is not a problem, then the two schedulers can > > >> >> be made even more similar. > > >> > > > >> > My preferred approach here would be, in order or TODO: > > >> > > > >> > - Create and test the smallish patches for seekiness, hw_tag checking, > > >> > and so on for CFQ. > > >> > - Create and test a WF2Q+ service dispatching patch for CFQ. > > >> > > > >> > and if there are leftovers after that, we could even conditionally > > >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and > > >> > could be easily usable as the default, so it's definitely a viable > > >> > alternative. > > >> > > >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would > > >> result in time slices being scheduled using WF2Q+ > > > > > > Yep, at least that is my preference. > > > > > >> 2 Do the following to support proportional division: > > >> a) Expose the per device weight interface to user, instead of calculating > > >> from priority. > > >> b) Add support for scheduling bandwidth among a hierarchy of cgroups > > >> (besides threads) > > >> 3 Do the following to support the goals of 2 level schedulers: > > >> a) Limit the request descriptors allocated to each cgroup by adding > > >> functionality to elv_may_queue() > > >> b) Add support for putting an absolute limit on IO consumed by a > > >> cgroup. Such support is provided by Andrea > > >> Righi's patches too. > > >> c) Add support (configurable option) to keep track of total disk > > >> time/sectors/count > > >> consumed at each device, and factor that into scheduling decision > > >> (more discussion needed here) > > >> 6 Incorporate an IO tracking approach which can allow tracking cgroups > > >> for asynchronous reads/writes. > > >> 7 Start an offline email thread to keep track of progress on the above > > >> goals. > > >> > > >> Jens, what is your opinion everything beyond (1) in the above list? > > >> > > >> It would be great if work on (1) and (2)-(7) can happen in parallel so > > >> that we can see "proportional division of IO bandwidth to cgroups" in > > >> tree sooner than later. > > > > > > Sounds feasible, I'd like to see the cgroups approach get more traction. > > > My primary concern is just that I don't want to merge it into specific > > > IO schedulers. > > > > Jens, > > So are you saying you don't prefer cgroups based proportional IO > > division solutions in the IO scheduler but at a layer above so it can > > be shared with all IO schedulers? > > > > If yes, then in that case, what do you think about Vivek Goyal's > > patch or dm-ioband that achieve that. Of course, both solutions don't > > meet all the requirements in the list above, but we can work on that > > once we know which direction we should be heading in. In fact, it > > would help if you could express the reservations (if you have any) > > about these approaches. That would help in coming up with a plan that > > everyone agrees on. > > The dm approach has some merrits, the major one being that it'll fit > directly into existing setups that use dm and can be controlled with > familiar tools. That is a bonus. The draw back is partially the same - > it'll require dm. So it's still not a fit-all approach, unfortunately. > > So I'd prefer an approach that doesn't force you to use dm. Hi Jens, My patches met the goal of not using the dm for every device one wants to control. Having said that, few things come to mind. - In what cases do we need to control the higher level logical devices like dm. It looks like real contention for resources is at leaf nodes. Hence any kind of resource management/fair queueing should probably be done at leaf nodes and not at higher level logical nodes. If that makes sense, then probably we don't need to control dm device and we don't need such higher level solutions. - Any kind of 2 level scheduler solution has the potential to break the underlying IO scheduler. Higher level solution requires buffering of bios and controlled release of bios to lower layers. This control breaks the assumptions of lower layer IO scheduler which knows in what order bios should be dispatched to device to meet the semantics exported by the IO scheduler. - 2nd level scheduler does not keep track of tasks but task groups lets every group dispatch fair share. This has got little semantic problem in the sense that tasks and groups in root cgroup will not be considered at same level. "root" will be considered one group at same level with all child group hence competing with them for resources. This looks little odd. Considering tasks and groups same level kind of makes more sense. cpu scheduler also consideres tasks and groups at same level and deviation from that probably is not very good. Considering tasks and groups at same level will matter only if IO scheduler maintains separate queue for the task, like CFQ. Because in that case IO scheduler tries to provide fairness among various task queues. Some schedulers like noop don't have any notion of separate task queues and fairness among them. In that case probably we don't have a choice but to assume root group competing with child groups. Keeping above points in mind, probably two level scheduling is not a very good idea. If putting the code in a particular IO scheduler is a concern we can probably explore ways regarding how we can maximize the sharing of cgroup code among IO schedulers. Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081120134058.GA29306@redhat.com>]
[parent not found: <e98e18940811201154l6fb0499x24da39812fb2aa7e@mail.gmail.com>]
[parent not found: <e98e18940811201154l6fb0499x24da39812fb2aa7e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811201154l6fb0499x24da39812fb2aa7e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-20 21:15 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-20 21:15 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, Jens Axboe, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Thu, Nov 20, 2008 at 11:54:14AM -0800, Nauman Rafique wrote: > On Thu, Nov 20, 2008 at 5:40 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Thu, Nov 20, 2008 at 09:16:41AM +0100, Jens Axboe wrote: > >> On Wed, Nov 19 2008, Divyesh Shah wrote: > >> > On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > >> > > On Tue, Nov 18 2008, Nauman Rafique wrote: > >> > >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > >> > >> > On Tue, Nov 18 2008, Fabio Checconi wrote: > >> > >> >> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> > >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500 > >> > >> >> > > >> > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: > >> > >> >> ... > >> > >> >> > > I have to think a little bit on how it would be possible to support > >> > >> >> > > an option for time-only budgets, coexisting with the current behavior, > >> > >> >> > > but I think it can be done. > >> > >> >> > > > >> > >> >> > > >> > >> >> > IIUC, bfq and cfq are different in following manner. > >> > >> >> > > >> > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. > >> > >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses > >> > >> >> > time slices. > >> > >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not. > >> > >> >> > > >> > >> >> > We are looking forward for implementation of point C. Fabio seems to > >> > >> >> > thinking of supporting time slice as a service (B). It seems like > >> > >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round > >> > >> >> > robin). > >> > >> >> > > >> > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention > >> > >> >> > that they have been able to ensure throughput while ensuring tighter > >> > >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ > >> > >> >> > down the line? > >> > >> >> > > >> > >> >> > >> > >> >> BFQ started from CFQ, extending it in the way you correctly describe, > >> > >> >> so it is indeed very similar. There are also some minor changes to > >> > >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. > >> > >> >> > >> > >> >> The two schedulers share similar goals, and in my opinion BFQ can be > >> > >> >> considered, in the long term, a CFQ replacement; *but* before talking > >> > >> >> about replacing CFQ we have to consider that: > >> > >> >> > >> > >> >> - it *needs* review and testing; we've done our best, but for sure > >> > >> >> it's not enough; review and testing are never enough; > >> > >> >> - the service domain fairness, which was one of our objectives, requires > >> > >> >> some extra complexity; the mechanisms we used and the design choices > >> > >> >> we've made may not fit all the needs, or may not be as generic as the > >> > >> >> simpler CFQ's ones; > >> > >> >> - CFQ has years of history behind and has been tuned for a wider > >> > >> >> variety of environments than the ones we've been able to test. > >> > >> >> > >> > >> >> If time-based fairness is considered more robust and the loss of > >> > >> >> service-domain fairness is not a problem, then the two schedulers can > >> > >> >> be made even more similar. > >> > >> > > >> > >> > My preferred approach here would be, in order or TODO: > >> > >> > > >> > >> > - Create and test the smallish patches for seekiness, hw_tag checking, > >> > >> > and so on for CFQ. > >> > >> > - Create and test a WF2Q+ service dispatching patch for CFQ. > >> > >> > > >> > >> > and if there are leftovers after that, we could even conditionally > >> > >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and > >> > >> > could be easily usable as the default, so it's definitely a viable > >> > >> > alternative. > >> > >> > >> > >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would > >> > >> result in time slices being scheduled using WF2Q+ > >> > > > >> > > Yep, at least that is my preference. > >> > > > >> > >> 2 Do the following to support proportional division: > >> > >> a) Expose the per device weight interface to user, instead of calculating > >> > >> from priority. > >> > >> b) Add support for scheduling bandwidth among a hierarchy of cgroups > >> > >> (besides threads) > >> > >> 3 Do the following to support the goals of 2 level schedulers: > >> > >> a) Limit the request descriptors allocated to each cgroup by adding > >> > >> functionality to elv_may_queue() > >> > >> b) Add support for putting an absolute limit on IO consumed by a > >> > >> cgroup. Such support is provided by Andrea > >> > >> Righi's patches too. > >> > >> c) Add support (configurable option) to keep track of total disk > >> > >> time/sectors/count > >> > >> consumed at each device, and factor that into scheduling decision > >> > >> (more discussion needed here) > >> > >> 6 Incorporate an IO tracking approach which can allow tracking cgroups > >> > >> for asynchronous reads/writes. > >> > >> 7 Start an offline email thread to keep track of progress on the above > >> > >> goals. > >> > >> > >> > >> Jens, what is your opinion everything beyond (1) in the above list? > >> > >> > >> > >> It would be great if work on (1) and (2)-(7) can happen in parallel so > >> > >> that we can see "proportional division of IO bandwidth to cgroups" in > >> > >> tree sooner than later. > >> > > > >> > > Sounds feasible, I'd like to see the cgroups approach get more traction. > >> > > My primary concern is just that I don't want to merge it into specific > >> > > IO schedulers. > >> > > >> > Jens, > >> > So are you saying you don't prefer cgroups based proportional IO > >> > division solutions in the IO scheduler but at a layer above so it can > >> > be shared with all IO schedulers? > >> > > >> > If yes, then in that case, what do you think about Vivek Goyal's > >> > patch or dm-ioband that achieve that. Of course, both solutions don't > >> > meet all the requirements in the list above, but we can work on that > >> > once we know which direction we should be heading in. In fact, it > >> > would help if you could express the reservations (if you have any) > >> > about these approaches. That would help in coming up with a plan that > >> > everyone agrees on. > >> > >> The dm approach has some merrits, the major one being that it'll fit > >> directly into existing setups that use dm and can be controlled with > >> familiar tools. That is a bonus. The draw back is partially the same - > >> it'll require dm. So it's still not a fit-all approach, unfortunately. > >> > >> So I'd prefer an approach that doesn't force you to use dm. > > > > Hi Jens, > > > > My patches met the goal of not using the dm for every device one wants > > to control. > > > > Having said that, few things come to mind. > > > > - In what cases do we need to control the higher level logical devices > > like dm. It looks like real contention for resources is at leaf nodes. > > Hence any kind of resource management/fair queueing should probably be > > done at leaf nodes and not at higher level logical nodes. > > > > If that makes sense, then probably we don't need to control dm device > > and we don't need such higher level solutions. > > > > > > - Any kind of 2 level scheduler solution has the potential to break the > > underlying IO scheduler. Higher level solution requires buffering of > > bios and controlled release of bios to lower layers. This control breaks > > the assumptions of lower layer IO scheduler which knows in what order > > bios should be dispatched to device to meet the semantics exported by > > the IO scheduler. > > > > - 2nd level scheduler does not keep track of tasks but task groups lets > > every group dispatch fair share. This has got little semantic problem in > > the sense that tasks and groups in root cgroup will not be considered at > > same level. "root" will be considered one group at same level with all > > child group hence competing with them for resources. > > > > This looks little odd. Considering tasks and groups same level kind of > > makes more sense. cpu scheduler also consideres tasks and groups at same > > level and deviation from that probably is not very good. > > > > Considering tasks and groups at same level will matter only if IO > > scheduler maintains separate queue for the task, like CFQ. Because > > in that case IO scheduler tries to provide fairness among various task > > queues. Some schedulers like noop don't have any notion of separate > > task queues and fairness among them. In that case probably we don't > > have a choice but to assume root group competing with child groups. > > > > Keeping above points in mind, probably two level scheduling is not a > > very good idea. If putting the code in a particular IO scheduler is a > > concern we can probably explore ways regarding how we can maximize the > > sharing of cgroup code among IO schedulers. > > > > Thanks > > Vivek > > > > It seems that we have a solution if we can figure out a way to share > cgroup code between different schedulers. I am thinking how other > schedulers (AS, Deadline, No-op) would use cgroups. Will they have > proportional division between requests from different cgroups? And use > their own policy (e.g deadline scheduling) within a cgroup? How about > if we have both threads and cgroups at a particular level? I think > putting all threads in a default cgroup seems like a reasonable choice > in this case. > > Here is a high level design that comes to mind. > > Put proportional division code and state in common code. Each level of > the hierarchy which has more than one cgroup would have some state > maintained in common code. At leaf level of hiearchy, we can have a > cgroup specific scheduler (created when a cgroup is created). We can > choose a different scheduler for each cgroup (we can have a no-op for > one cgroup while cfq for another). I am not sure that I understand the different scheduler for each cgroup aspect of it. What's the need? It makes things even more complicated I think. But moving proportional division code out of particular scheduler and make it common makes sense. Looking at BFQ, I was thinking that we can just keep large part of the code. This common code can think of everything as scheduling entity. This scheduling entity (SE) will be defined by underlying scheduler depending on how queue management is done by underlying scheduler. So for CFQ, at each level, an SE can be either task or group. For the schedulers which don't maintain separate queues for tasks, it will simply be group at all levels. We probably can employ B-WFQ2+ to provide hierarchical fairness between secheduling entities of this tree. Common layer will do the scheduling of entities (without knowing what is contained inside) and underlying scheduler will take care of dispatching the requests from the scheduled entity. (It could be a task queue for CFQ or a group queue for other schedulers). The tricky part would be how to abstract it in a clean way. It should lead to reduced code in CFQ/BFQ because B-WFQ2+ logic will be put into a common layer (for large part). > > The scheduler gets a callback (just like it does right now from > driver) when the common code schedules a time slice (or budget) from > that cgroup. No buffering/queuing is done in the common code, so it > will still be a single level scheduler. When a request arrives, it is > routed to its scheduler's queues based on its cgroup. > > The common code proportional scheduler uses specified weights to > schedule time slices. Please let me know if it makes any sense. And > then we can start talking about lower level details. We can use either time slices or budgets (may be configurable) depending on which gives better results. Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081120211536.GG29306@redhat.com>]
[parent not found: <20081120211536.GG29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081120211536.GG29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-20 22:42 ` Nauman Rafique 0 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-20 22:42 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, Jens Axboe, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Thu, Nov 20, 2008 at 1:15 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, Nov 20, 2008 at 11:54:14AM -0800, Nauman Rafique wrote: >> On Thu, Nov 20, 2008 at 5:40 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> > On Thu, Nov 20, 2008 at 09:16:41AM +0100, Jens Axboe wrote: >> >> On Wed, Nov 19 2008, Divyesh Shah wrote: >> >> > On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: >> >> > > On Tue, Nov 18 2008, Nauman Rafique wrote: >> >> > >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: >> >> > >> > On Tue, Nov 18 2008, Fabio Checconi wrote: >> >> > >> >> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> >> > >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500 >> >> > >> >> > >> >> > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: >> >> > >> >> ... >> >> > >> >> > > I have to think a little bit on how it would be possible to support >> >> > >> >> > > an option for time-only budgets, coexisting with the current behavior, >> >> > >> >> > > but I think it can be done. >> >> > >> >> > > >> >> > >> >> > >> >> > >> >> > IIUC, bfq and cfq are different in following manner. >> >> > >> >> > >> >> > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. >> >> > >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses >> >> > >> >> > time slices. >> >> > >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not. >> >> > >> >> > >> >> > >> >> > We are looking forward for implementation of point C. Fabio seems to >> >> > >> >> > thinking of supporting time slice as a service (B). It seems like >> >> > >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round >> >> > >> >> > robin). >> >> > >> >> > >> >> > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention >> >> > >> >> > that they have been able to ensure throughput while ensuring tighter >> >> > >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ >> >> > >> >> > down the line? >> >> > >> >> > >> >> > >> >> >> >> > >> >> BFQ started from CFQ, extending it in the way you correctly describe, >> >> > >> >> so it is indeed very similar. There are also some minor changes to >> >> > >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. >> >> > >> >> >> >> > >> >> The two schedulers share similar goals, and in my opinion BFQ can be >> >> > >> >> considered, in the long term, a CFQ replacement; *but* before talking >> >> > >> >> about replacing CFQ we have to consider that: >> >> > >> >> >> >> > >> >> - it *needs* review and testing; we've done our best, but for sure >> >> > >> >> it's not enough; review and testing are never enough; >> >> > >> >> - the service domain fairness, which was one of our objectives, requires >> >> > >> >> some extra complexity; the mechanisms we used and the design choices >> >> > >> >> we've made may not fit all the needs, or may not be as generic as the >> >> > >> >> simpler CFQ's ones; >> >> > >> >> - CFQ has years of history behind and has been tuned for a wider >> >> > >> >> variety of environments than the ones we've been able to test. >> >> > >> >> >> >> > >> >> If time-based fairness is considered more robust and the loss of >> >> > >> >> service-domain fairness is not a problem, then the two schedulers can >> >> > >> >> be made even more similar. >> >> > >> > >> >> > >> > My preferred approach here would be, in order or TODO: >> >> > >> > >> >> > >> > - Create and test the smallish patches for seekiness, hw_tag checking, >> >> > >> > and so on for CFQ. >> >> > >> > - Create and test a WF2Q+ service dispatching patch for CFQ. >> >> > >> > >> >> > >> > and if there are leftovers after that, we could even conditionally >> >> > >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and >> >> > >> > could be easily usable as the default, so it's definitely a viable >> >> > >> > alternative. >> >> > >> >> >> > >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would >> >> > >> result in time slices being scheduled using WF2Q+ >> >> > > >> >> > > Yep, at least that is my preference. >> >> > > >> >> > >> 2 Do the following to support proportional division: >> >> > >> a) Expose the per device weight interface to user, instead of calculating >> >> > >> from priority. >> >> > >> b) Add support for scheduling bandwidth among a hierarchy of cgroups >> >> > >> (besides threads) >> >> > >> 3 Do the following to support the goals of 2 level schedulers: >> >> > >> a) Limit the request descriptors allocated to each cgroup by adding >> >> > >> functionality to elv_may_queue() >> >> > >> b) Add support for putting an absolute limit on IO consumed by a >> >> > >> cgroup. Such support is provided by Andrea >> >> > >> Righi's patches too. >> >> > >> c) Add support (configurable option) to keep track of total disk >> >> > >> time/sectors/count >> >> > >> consumed at each device, and factor that into scheduling decision >> >> > >> (more discussion needed here) >> >> > >> 6 Incorporate an IO tracking approach which can allow tracking cgroups >> >> > >> for asynchronous reads/writes. >> >> > >> 7 Start an offline email thread to keep track of progress on the above >> >> > >> goals. >> >> > >> >> >> > >> Jens, what is your opinion everything beyond (1) in the above list? >> >> > >> >> >> > >> It would be great if work on (1) and (2)-(7) can happen in parallel so >> >> > >> that we can see "proportional division of IO bandwidth to cgroups" in >> >> > >> tree sooner than later. >> >> > > >> >> > > Sounds feasible, I'd like to see the cgroups approach get more traction. >> >> > > My primary concern is just that I don't want to merge it into specific >> >> > > IO schedulers. >> >> > >> >> > Jens, >> >> > So are you saying you don't prefer cgroups based proportional IO >> >> > division solutions in the IO scheduler but at a layer above so it can >> >> > be shared with all IO schedulers? >> >> > >> >> > If yes, then in that case, what do you think about Vivek Goyal's >> >> > patch or dm-ioband that achieve that. Of course, both solutions don't >> >> > meet all the requirements in the list above, but we can work on that >> >> > once we know which direction we should be heading in. In fact, it >> >> > would help if you could express the reservations (if you have any) >> >> > about these approaches. That would help in coming up with a plan that >> >> > everyone agrees on. >> >> >> >> The dm approach has some merrits, the major one being that it'll fit >> >> directly into existing setups that use dm and can be controlled with >> >> familiar tools. That is a bonus. The draw back is partially the same - >> >> it'll require dm. So it's still not a fit-all approach, unfortunately. >> >> >> >> So I'd prefer an approach that doesn't force you to use dm. >> > >> > Hi Jens, >> > >> > My patches met the goal of not using the dm for every device one wants >> > to control. >> > >> > Having said that, few things come to mind. >> > >> > - In what cases do we need to control the higher level logical devices >> > like dm. It looks like real contention for resources is at leaf nodes. >> > Hence any kind of resource management/fair queueing should probably be >> > done at leaf nodes and not at higher level logical nodes. >> > >> > If that makes sense, then probably we don't need to control dm device >> > and we don't need such higher level solutions. >> > >> > >> > - Any kind of 2 level scheduler solution has the potential to break the >> > underlying IO scheduler. Higher level solution requires buffering of >> > bios and controlled release of bios to lower layers. This control breaks >> > the assumptions of lower layer IO scheduler which knows in what order >> > bios should be dispatched to device to meet the semantics exported by >> > the IO scheduler. >> > >> > - 2nd level scheduler does not keep track of tasks but task groups lets >> > every group dispatch fair share. This has got little semantic problem in >> > the sense that tasks and groups in root cgroup will not be considered at >> > same level. "root" will be considered one group at same level with all >> > child group hence competing with them for resources. >> > >> > This looks little odd. Considering tasks and groups same level kind of >> > makes more sense. cpu scheduler also consideres tasks and groups at same >> > level and deviation from that probably is not very good. >> > >> > Considering tasks and groups at same level will matter only if IO >> > scheduler maintains separate queue for the task, like CFQ. Because >> > in that case IO scheduler tries to provide fairness among various task >> > queues. Some schedulers like noop don't have any notion of separate >> > task queues and fairness among them. In that case probably we don't >> > have a choice but to assume root group competing with child groups. >> > >> > Keeping above points in mind, probably two level scheduling is not a >> > very good idea. If putting the code in a particular IO scheduler is a >> > concern we can probably explore ways regarding how we can maximize the >> > sharing of cgroup code among IO schedulers. >> > >> > Thanks >> > Vivek >> > >> >> It seems that we have a solution if we can figure out a way to share >> cgroup code between different schedulers. I am thinking how other >> schedulers (AS, Deadline, No-op) would use cgroups. Will they have >> proportional division between requests from different cgroups? And use >> their own policy (e.g deadline scheduling) within a cgroup? How about >> if we have both threads and cgroups at a particular level? I think >> putting all threads in a default cgroup seems like a reasonable choice >> in this case. >> >> Here is a high level design that comes to mind. >> >> Put proportional division code and state in common code. Each level of >> the hierarchy which has more than one cgroup would have some state >> maintained in common code. At leaf level of hiearchy, we can have a >> cgroup specific scheduler (created when a cgroup is created). We can >> choose a different scheduler for each cgroup (we can have a no-op for >> one cgroup while cfq for another). > > I am not sure that I understand the different scheduler for each cgroup > aspect of it. What's the need? It makes things even more complicated I > think. With the design I had in my mind, it seemed like that would come for free. But if it does not, I completely agree with you that its not as important. > > But moving proportional division code out of particular scheduler and make > it common makes sense. > > Looking at BFQ, I was thinking that we can just keep large part of the > code. This common code can think of everything as scheduling entity. This > scheduling entity (SE) will be defined by underlying scheduler depending on > how queue management is done by underlying scheduler. So for CFQ, at > each level, an SE can be either task or group. For the schedulers which > don't maintain separate queues for tasks, it will simply be group at all > levels. So the structure of hierarchy would be dependent on the underlying scheduler? > > We probably can employ B-WFQ2+ to provide hierarchical fairness between > secheduling entities of this tree. Common layer will do the scheduling of > entities (without knowing what is contained inside) and underlying scheduler > will take care of dispatching the requests from the scheduled entity. > (It could be a task queue for CFQ or a group queue for other schedulers). > > The tricky part would be how to abstract it in a clean way. It should lead > to reduced code in CFQ/BFQ because B-WFQ2+ logic will be put into a > common layer (for large part). How about this plan: 1 Start with CFQ patched with some BFQ like patches (This is what we will have if Jens takes some of Fabio's patches). This will have no cgroup related logic (correct me if I am wrong). 2 Repeat proportional scheduling logic for cgroups in the common layer, without touching the code produced in step 1. That means that we will have WF2Q+ used for scheduling cgroup time slices proportional to weight in the common code. If CFQ (step 1 output) is used as scheduler, WF2Q+ would be used there too, but to schedule time slices (in proportion to priorities?) between different threads. Common code logic will be completely oblivious of the actual scheduler used (patched CFQ, Deadline, AS etc). cgroup tracking has to be implemented as part of step 2. The good thing is that step 2 can proceed independent of step 1, as the output of step 1 will have the same interface as the existing CFQ scheduler. > >> >> The scheduler gets a callback (just like it does right now from >> driver) when the common code schedules a time slice (or budget) from >> that cgroup. No buffering/queuing is done in the common code, so it >> will still be a single level scheduler. When a request arrives, it is >> routed to its scheduler's queues based on its cgroup. >> >> The common code proportional scheduler uses specified weights to >> schedule time slices. Please let me know if it makes any sense. And >> then we can start talking about lower level details. > > We can use either time slices or budgets (may be configurable) depending > on which gives better results. > > Thanks > Vivek > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811201442s787a346em4ada30bcb1badfe6@mail.gmail.com>]
[parent not found: <e98e18940811201442s787a346em4ada30bcb1badfe6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811201442s787a346em4ada30bcb1badfe6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-21 15:22 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-21 15:22 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, Jens Axboe, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Thu, Nov 20, 2008 at 02:42:38PM -0800, Nauman Rafique wrote: [..] > >> It seems that we have a solution if we can figure out a way to share > >> cgroup code between different schedulers. I am thinking how other > >> schedulers (AS, Deadline, No-op) would use cgroups. Will they have > >> proportional division between requests from different cgroups? And use > >> their own policy (e.g deadline scheduling) within a cgroup? How about > >> if we have both threads and cgroups at a particular level? I think > >> putting all threads in a default cgroup seems like a reasonable choice > >> in this case. > >> > >> Here is a high level design that comes to mind. > >> > >> Put proportional division code and state in common code. Each level of > >> the hierarchy which has more than one cgroup would have some state > >> maintained in common code. At leaf level of hiearchy, we can have a > >> cgroup specific scheduler (created when a cgroup is created). We can > >> choose a different scheduler for each cgroup (we can have a no-op for > >> one cgroup while cfq for another). > > > > I am not sure that I understand the different scheduler for each cgroup > > aspect of it. What's the need? It makes things even more complicated I > > think. > > With the design I had in my mind, it seemed like that would come for > free. But if it does not, I completely agree with you that its not as > important. > > > > > But moving proportional division code out of particular scheduler and make > > it common makes sense. > > > > Looking at BFQ, I was thinking that we can just keep large part of the > > code. This common code can think of everything as scheduling entity. This > > scheduling entity (SE) will be defined by underlying scheduler depending on > > how queue management is done by underlying scheduler. So for CFQ, at > > each level, an SE can be either task or group. For the schedulers which > > don't maintain separate queues for tasks, it will simply be group at all > > levels. > > So the structure of hierarchy would be dependent on the underlying scheduler? > Kind of. In fact it will depend on cgroup hierarchy and dependent on underlying scheduler. > > > > We probably can employ B-WFQ2+ to provide hierarchical fairness between > > secheduling entities of this tree. Common layer will do the scheduling of > > entities (without knowing what is contained inside) and underlying scheduler > > will take care of dispatching the requests from the scheduled entity. > > (It could be a task queue for CFQ or a group queue for other schedulers). > > > > The tricky part would be how to abstract it in a clean way. It should lead > > to reduced code in CFQ/BFQ because B-WFQ2+ logic will be put into a > > common layer (for large part). > > How about this plan: > 1 Start with CFQ patched with some BFQ like patches (This is what we > will have if Jens takes some of Fabio's patches). This will have no > cgroup related logic (correct me if I am wrong). > 2 Repeat proportional scheduling logic for cgroups in the common > layer, without touching the code produced in step 1. That means that > we will have WF2Q+ used for scheduling cgroup time slices proportional > to weight in the common code. If CFQ (step 1 output) is used as > scheduler, WF2Q+ would be used there too, but to schedule time slices > (in proportion to priorities?) between different threads. Common code > logic will be completely oblivious of the actual scheduler used > (patched CFQ, Deadline, AS etc). I think once you start using WF2Q+ in common layer, CFQ will have to get rid of that code. (Remember in case of CFQ, we will have a tree which has got both task and groups as Scheduling Entity). So common layer code can select the next entity to be dispatched base on WFQ2+ and then CFQ will decide which request to dispatch with-in that scheduling entity. So may be we can start with bfq and try to break the code in two pieces. One common code and one scheduler specific code. Then try to make use of common code in deadline or anticipatory to see if things work fine. If, that works, then we can get to CFQ to make use of common code. By that time CFQ should have Fabio's changes. I think that will include WF2Q+ algorithm also (At least to provide faireness among taks, and not the hierarchical thing). Once common layer WF2Q+ works well, we can get rid of WF2Q+ from CFQ and try to complete the picture. > cgroup tracking has to be implemented as part of step 2. The good > thing is that step 2 can proceed independent of step 1, as the output > of step 1 will have the same interface as the existing CFQ scheduler. > Agreed. any kind of tracking based on bio and not the task context shall have to be done later, once we have come up with common layer code. These are very vague high level ideas. Devil lies in details. :-) I will get started to see how feasible the common layer code idea is. Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081120134058.GA29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081120134058.GA29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-20 19:54 ` Nauman Rafique 2008-11-26 6:40 ` Fernando Luis Vázquez Cao 1 sibling, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-20 19:54 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, Jens Axboe, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Thu, Nov 20, 2008 at 5:40 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, Nov 20, 2008 at 09:16:41AM +0100, Jens Axboe wrote: >> On Wed, Nov 19 2008, Divyesh Shah wrote: >> > On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: >> > > On Tue, Nov 18 2008, Nauman Rafique wrote: >> > >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: >> > >> > On Tue, Nov 18 2008, Fabio Checconi wrote: >> > >> >> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> > >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500 >> > >> >> > >> > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: >> > >> >> ... >> > >> >> > > I have to think a little bit on how it would be possible to support >> > >> >> > > an option for time-only budgets, coexisting with the current behavior, >> > >> >> > > but I think it can be done. >> > >> >> > > >> > >> >> > >> > >> >> > IIUC, bfq and cfq are different in following manner. >> > >> >> > >> > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. >> > >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses >> > >> >> > time slices. >> > >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not. >> > >> >> > >> > >> >> > We are looking forward for implementation of point C. Fabio seems to >> > >> >> > thinking of supporting time slice as a service (B). It seems like >> > >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round >> > >> >> > robin). >> > >> >> > >> > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention >> > >> >> > that they have been able to ensure throughput while ensuring tighter >> > >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ >> > >> >> > down the line? >> > >> >> > >> > >> >> >> > >> >> BFQ started from CFQ, extending it in the way you correctly describe, >> > >> >> so it is indeed very similar. There are also some minor changes to >> > >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. >> > >> >> >> > >> >> The two schedulers share similar goals, and in my opinion BFQ can be >> > >> >> considered, in the long term, a CFQ replacement; *but* before talking >> > >> >> about replacing CFQ we have to consider that: >> > >> >> >> > >> >> - it *needs* review and testing; we've done our best, but for sure >> > >> >> it's not enough; review and testing are never enough; >> > >> >> - the service domain fairness, which was one of our objectives, requires >> > >> >> some extra complexity; the mechanisms we used and the design choices >> > >> >> we've made may not fit all the needs, or may not be as generic as the >> > >> >> simpler CFQ's ones; >> > >> >> - CFQ has years of history behind and has been tuned for a wider >> > >> >> variety of environments than the ones we've been able to test. >> > >> >> >> > >> >> If time-based fairness is considered more robust and the loss of >> > >> >> service-domain fairness is not a problem, then the two schedulers can >> > >> >> be made even more similar. >> > >> > >> > >> > My preferred approach here would be, in order or TODO: >> > >> > >> > >> > - Create and test the smallish patches for seekiness, hw_tag checking, >> > >> > and so on for CFQ. >> > >> > - Create and test a WF2Q+ service dispatching patch for CFQ. >> > >> > >> > >> > and if there are leftovers after that, we could even conditionally >> > >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and >> > >> > could be easily usable as the default, so it's definitely a viable >> > >> > alternative. >> > >> >> > >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would >> > >> result in time slices being scheduled using WF2Q+ >> > > >> > > Yep, at least that is my preference. >> > > >> > >> 2 Do the following to support proportional division: >> > >> a) Expose the per device weight interface to user, instead of calculating >> > >> from priority. >> > >> b) Add support for scheduling bandwidth among a hierarchy of cgroups >> > >> (besides threads) >> > >> 3 Do the following to support the goals of 2 level schedulers: >> > >> a) Limit the request descriptors allocated to each cgroup by adding >> > >> functionality to elv_may_queue() >> > >> b) Add support for putting an absolute limit on IO consumed by a >> > >> cgroup. Such support is provided by Andrea >> > >> Righi's patches too. >> > >> c) Add support (configurable option) to keep track of total disk >> > >> time/sectors/count >> > >> consumed at each device, and factor that into scheduling decision >> > >> (more discussion needed here) >> > >> 6 Incorporate an IO tracking approach which can allow tracking cgroups >> > >> for asynchronous reads/writes. >> > >> 7 Start an offline email thread to keep track of progress on the above >> > >> goals. >> > >> >> > >> Jens, what is your opinion everything beyond (1) in the above list? >> > >> >> > >> It would be great if work on (1) and (2)-(7) can happen in parallel so >> > >> that we can see "proportional division of IO bandwidth to cgroups" in >> > >> tree sooner than later. >> > > >> > > Sounds feasible, I'd like to see the cgroups approach get more traction. >> > > My primary concern is just that I don't want to merge it into specific >> > > IO schedulers. >> > >> > Jens, >> > So are you saying you don't prefer cgroups based proportional IO >> > division solutions in the IO scheduler but at a layer above so it can >> > be shared with all IO schedulers? >> > >> > If yes, then in that case, what do you think about Vivek Goyal's >> > patch or dm-ioband that achieve that. Of course, both solutions don't >> > meet all the requirements in the list above, but we can work on that >> > once we know which direction we should be heading in. In fact, it >> > would help if you could express the reservations (if you have any) >> > about these approaches. That would help in coming up with a plan that >> > everyone agrees on. >> >> The dm approach has some merrits, the major one being that it'll fit >> directly into existing setups that use dm and can be controlled with >> familiar tools. That is a bonus. The draw back is partially the same - >> it'll require dm. So it's still not a fit-all approach, unfortunately. >> >> So I'd prefer an approach that doesn't force you to use dm. > > Hi Jens, > > My patches met the goal of not using the dm for every device one wants > to control. > > Having said that, few things come to mind. > > - In what cases do we need to control the higher level logical devices > like dm. It looks like real contention for resources is at leaf nodes. > Hence any kind of resource management/fair queueing should probably be > done at leaf nodes and not at higher level logical nodes. > > If that makes sense, then probably we don't need to control dm device > and we don't need such higher level solutions. > > > - Any kind of 2 level scheduler solution has the potential to break the > underlying IO scheduler. Higher level solution requires buffering of > bios and controlled release of bios to lower layers. This control breaks > the assumptions of lower layer IO scheduler which knows in what order > bios should be dispatched to device to meet the semantics exported by > the IO scheduler. > > - 2nd level scheduler does not keep track of tasks but task groups lets > every group dispatch fair share. This has got little semantic problem in > the sense that tasks and groups in root cgroup will not be considered at > same level. "root" will be considered one group at same level with all > child group hence competing with them for resources. > > This looks little odd. Considering tasks and groups same level kind of > makes more sense. cpu scheduler also consideres tasks and groups at same > level and deviation from that probably is not very good. > > Considering tasks and groups at same level will matter only if IO > scheduler maintains separate queue for the task, like CFQ. Because > in that case IO scheduler tries to provide fairness among various task > queues. Some schedulers like noop don't have any notion of separate > task queues and fairness among them. In that case probably we don't > have a choice but to assume root group competing with child groups. > > Keeping above points in mind, probably two level scheduling is not a > very good idea. If putting the code in a particular IO scheduler is a > concern we can probably explore ways regarding how we can maximize the > sharing of cgroup code among IO schedulers. > > Thanks > Vivek > It seems that we have a solution if we can figure out a way to share cgroup code between different schedulers. I am thinking how other schedulers (AS, Deadline, No-op) would use cgroups. Will they have proportional division between requests from different cgroups? And use their own policy (e.g deadline scheduling) within a cgroup? How about if we have both threads and cgroups at a particular level? I think putting all threads in a default cgroup seems like a reasonable choice in this case. Here is a high level design that comes to mind. Put proportional division code and state in common code. Each level of the hierarchy which has more than one cgroup would have some state maintained in common code. At leaf level of hiearchy, we can have a cgroup specific scheduler (created when a cgroup is created). We can choose a different scheduler for each cgroup (we can have a no-op for one cgroup while cfq for another). The scheduler gets a callback (just like it does right now from driver) when the common code schedules a time slice (or budget) from that cgroup. No buffering/queuing is done in the common code, so it will still be a single level scheduler. When a request arrives, it is routed to its scheduler's queues based on its cgroup. The common code proportional scheduler uses specified weights to schedule time slices. Please let me know if it makes any sense. And then we can start talking about lower level details. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081120134058.GA29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2008-11-20 19:54 ` Nauman Rafique @ 2008-11-26 6:40 ` Fernando Luis Vázquez Cao 1 sibling, 0 replies; 92+ messages in thread From: Fernando Luis Vázquez Cao @ 2008-11-26 6:40 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Thu, 2008-11-20 at 08:40 -0500, Vivek Goyal wrote: > > The dm approach has some merrits, the major one being that it'll fit > > directly into existing setups that use dm and can be controlled with > > familiar tools. That is a bonus. The draw back is partially the same - > > it'll require dm. So it's still not a fit-all approach, unfortunately. > > > > So I'd prefer an approach that doesn't force you to use dm. > > Hi Jens, > > My patches met the goal of not using the dm for every device one wants > to control. > > Having said that, few things come to mind. > > - In what cases do we need to control the higher level logical devices > like dm. It looks like real contention for resources is at leaf nodes. > Hence any kind of resource management/fair queueing should probably be > done at leaf nodes and not at higher level logical nodes. The problem with stacking devices is that we do not know how the IO going through the leaf nodes contributes to the aggregate throughput seen by the application/cgroup that generated it, which is what end users care about. The block device could be a plain old sata device, a loop device, a stacking device, a SSD, you name it, but their topologies and the fact that some of them do not even use an elevator should be transparent to the user. If you wanted to do resource management at the leaf nodes some kind of topology information should be passed down to the elevators controlling the underlying devices, which in turn would need to work cooperatively. > If that makes sense, then probably we don't need to control dm device > and we don't need such higher level solutions. For the reasons stated above the two level scheduling approach seems cleaner to me. > - Any kind of 2 level scheduler solution has the potential to break the > underlying IO scheduler. Higher level solution requires buffering of > bios and controlled release of bios to lower layers. This control breaks > the assumptions of lower layer IO scheduler which knows in what order > bios should be dispatched to device to meet the semantics exported by > the IO scheduler. Please notice that the such an IO controller would only get in the way of the elevator in case of contention for the device. What is more, depending on the workload it turns out that buffering at higher layers in a per-cgroup or per-task basis, like dm-band does, may actually increase the aggregate throughput (I think that the dm-band team observed this behavior too). The reason seems to be that bios buffered in such way tend to be highly correlated and thus very likely to get merged when released to the elevator. > - 2nd level scheduler does not keep track of tasks but task groups lets > every group dispatch fair share. This has got little semantic problem in > the sense that tasks and groups in root cgroup will not be considered at > same level. "root" will be considered one group at same level with all > child group hence competing with them for resources. > > This looks little odd. Considering tasks and groups same level kind of > makes more sense. cpu scheduler also consideres tasks and groups at same > level and deviation from that probably is not very good. > > Considering tasks and groups at same level will matter only if IO > scheduler maintains separate queue for the task, like CFQ. Because > in that case IO scheduler tries to provide fairness among various task > queues. Some schedulers like noop don't have any notion of separate > task queues and fairness among them. In that case probably we don't > have a choice but to assume root group competing with child groups. If deemed necessary this case could be handled too, but it does not look like a show-stopper. > Keeping above points in mind, probably two level scheduling is not a > very good idea. If putting the code in a particular IO scheduler is a > concern we can probably explore ways regarding how we can maximize the > sharing of cgroup code among IO schedulers. As discussed above, I still think that the two level scheduling approach makes more sense. Regarding the sharing of cgroup code among IO schedulers I am all for it. If we consider that elevators should only care about maximizing usage of the underlying devices, implementing other non-hardware-dependent scheduling disciplines (that prioritize according to the task or cgroup that generated the IO, for example) at higher layers so that we can reuse code makes a lot of sense. Thanks, Fernando ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <1227681618.12997.163.camel@sebastian.kern.oss.ntt.co.jp>]
[parent not found: <1227681618.12997.163.camel-xpvPi5bcW5X5OjGIXfuPlhrrLbDL3r4M6qtp775pBPw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <1227681618.12997.163.camel-xpvPi5bcW5X5OjGIXfuPlhrrLbDL3r4M6qtp775pBPw@public.gmane.org> @ 2008-11-26 15:18 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-26 15:18 UTC (permalink / raw) To: Fernando Luis Vázquez Cao Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Wed, Nov 26, 2008 at 03:40:18PM +0900, Fernando Luis Vázquez Cao wrote: > On Thu, 2008-11-20 at 08:40 -0500, Vivek Goyal wrote: > > > The dm approach has some merrits, the major one being that it'll fit > > > directly into existing setups that use dm and can be controlled with > > > familiar tools. That is a bonus. The draw back is partially the same - > > > it'll require dm. So it's still not a fit-all approach, unfortunately. > > > > > > So I'd prefer an approach that doesn't force you to use dm. > > > > Hi Jens, > > > > My patches met the goal of not using the dm for every device one wants > > to control. > > > > Having said that, few things come to mind. > > > > - In what cases do we need to control the higher level logical devices > > like dm. It looks like real contention for resources is at leaf nodes. > > Hence any kind of resource management/fair queueing should probably be > > done at leaf nodes and not at higher level logical nodes. > > The problem with stacking devices is that we do not know how the IO > going through the leaf nodes contributes to the aggregate throughput > seen by the application/cgroup that generated it, which is what end > users care about. > If we keep track of cgroup information in bio and don't loose it while bio traverses through the stack of devices, then leaf node can still do the proportional fair share allocation among contending cgroups on that device. I think end users care about getting fair share if there is a contention anywhere along the IO path. Real contention is at leaf nodes. However complex the logical device topology is, if two applications are not contending for disk at lowest level, there is no point in doing any kind of resource management among them. Though the applications seemingly might be contending for higher level logical device, at leaf nodes, their IOs might be going to different disk altogether and practically there is no contention. > The block device could be a plain old sata device, a loop device, a > stacking device, a SSD, you name it, but their topologies and the fact > that some of them do not even use an elevator should be transparent to > the user. Are there some devices which don't use elevators at leaf nodes? If no, then its not a issue. > > If you wanted to do resource management at the leaf nodes some kind of > topology information should be passed down to the elevators controlling > the underlying devices, which in turn would need to work cooperatively. > I am not able to understand why some kind of topology information needs to be passed to underlying elevators. As long as end device can map a bio correctly to the right cgroup (irrespective of complex topology) and end device step into resource management only if there is contention for resources among cgroups on that device, things are fine. We don't have to worry about intermediate complex topology. I will take one hypothetical example. Lets assume there are two cgroups A and B with weights 2048 and 1024 respectively. To me this information means that if A, and B really conted for the resources somewhere, then make sure A gets 2/3 of resources and B gets 1/3 of resource. Now if tasks in these two groups happen to contend for same disk at lowest level, we do resource management otherwise we don't. Why do I need to worry about intermediate logical devices in the IO path? May be I am missing something. A detailed example will help here... > > If that makes sense, then probably we don't need to control dm device > > and we don't need such higher level solutions. > > For the reasons stated above the two level scheduling approach seems > cleaner to me. > > > - Any kind of 2 level scheduler solution has the potential to break the > > underlying IO scheduler. Higher level solution requires buffering of > > bios and controlled release of bios to lower layers. This control breaks > > the assumptions of lower layer IO scheduler which knows in what order > > bios should be dispatched to device to meet the semantics exported by > > the IO scheduler. > > Please notice that the such an IO controller would only get in the way > of the elevator in case of contention for the device. True. So are we saying that a user can get expected CFQ or AS behavior only if there is no contention. If there is contention, then we don't gurantee anything? > What is more, > depending on the workload it turns out that buffering at higher layers > in a per-cgroup or per-task basis, like dm-band does, may actually > increase the aggregate throughput (I think that the dm-band team > observed this behavior too). The reason seems to be that bios buffered > in such way tend to be highly correlated and thus very likely to get > merged when released to the elevator. The goal here is not to increase throughput by doing buffering at higher layer. This is what IO scheduler currently does. It tries to buffer bios and select these appropriately to boost throughput. If one needs to focus on increasing throughput, it should be done at IO scheduler level and not by introducing one more buffering layer in between. > > > - 2nd level scheduler does not keep track of tasks but task groups lets > > every group dispatch fair share. This has got little semantic problem in > > the sense that tasks and groups in root cgroup will not be considered at > > same level. "root" will be considered one group at same level with all > > child group hence competing with them for resources. > > > > This looks little odd. Considering tasks and groups same level kind of > > makes more sense. cpu scheduler also consideres tasks and groups at same > > level and deviation from that probably is not very good. > > > > Considering tasks and groups at same level will matter only if IO > > scheduler maintains separate queue for the task, like CFQ. Because > > in that case IO scheduler tries to provide fairness among various task > > queues. Some schedulers like noop don't have any notion of separate > > task queues and fairness among them. In that case probably we don't > > have a choice but to assume root group competing with child groups. > > If deemed necessary this case could be handled too, but it does not look > like a show-stopper. > It is not a show stopper for sure. But it can be a genuine concern in case of CFQ atleast which tries to provide fairness among tasks. Think of following scenario. (Diagram taken from peterz's mail). root / | \ 1 2 A / \ B 3 Assume that task 1, task 2 and group A belong to Best effort class and they all have the same priority. If we go for two level scheduling than, disk BW will be divided in the ratio of 25%, 25% and 50% between task 1, task 2 and group A. I think it should instead be 33% each. Again coming back to the idea of treating 1, 2 and A at same level. So this is not a show stopper but once you go for one approach, swithing to another will become really hard as it might require close interatction with underlying scheduler and fundamentally 2 level scheduler will find it very hard to communicate with IO scheduler. > > Keeping above points in mind, probably two level scheduling is not a > > very good idea. If putting the code in a particular IO scheduler is a > > concern we can probably explore ways regarding how we can maximize the > > sharing of cgroup code among IO schedulers. > > As discussed above, I still think that the two level scheduling approach > makes more sense. IMHO, two level scheduling approach makes a case only if resource management at leaf nodes does not solve the requirements. So far we have not got a concrete example where resource management at intermediate logical devices is needed and resource management at leaf nodes is not sufficient. Thanks Vivek > Regarding the sharing of cgroup code among IO > schedulers I am all for it. If we consider that elevators should only > care about maximizing usage of the underlying devices, implementing > other non-hardware-dependent scheduling disciplines (that prioritize > according to the task or cgroup that generated the IO, for example) at > higher layers so that we can reuse code makes a lot of sense. > > Thanks, > > Fernando ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081118211442.GG15268@gandalf.sssup.it>]
[parent not found: <4923716A.5090104@gelato.unsw.edu.au>]
[parent not found: <4923716A.5090104-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <4923716A.5090104-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org> @ 2008-11-19 10:17 ` Fabio Checconi 0 siblings, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-19 10:17 UTC (permalink / raw) To: Aaron Carroll Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Aaron Carroll <aaronc-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org> > Date: Wed, Nov 19, 2008 12:52:42PM +1100 > > Fabio Checconi wrote: > > - To detect hw tagging in BFQ we consider a sample valid iff the > > number of requests that the scheduler could have dispatched (given > > by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into > > the scheduler plus the ones into the driver) is higher than the > > CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems > > during testing, but the way CFQ uses now seems a little bit > > strange. > > BFQ's tag detection logic is broken in the same way that CFQ's used to > be. Explanation is in this patch: > If you look at bfq_update_hw_tag(), the logic introduced by the patch you mention is still there; BFQ starts with ->hw_tag = 1, and updates it every 32 valid samples. What changed WRT your patch, apart from the number of samples, is that the condition for a sample to be valid is: bfqd->rq_in_driver + bfqd->queued >= 5 while in your patch it is: cfqd->rq_queued > 5 || cfqd->rq_in_driver > 5 We preferred the first one because that sum better reflects the number of requests that could have been dispatched, and I don't think that this is wrong. There is a problem, but it's not within the tag detection logic itself. From some quick experiments, what happens is that when a process starts, CFQ considers it seeky (*), BFQ doesn't. As a side effect BFQ does not always dispatch enough requests to correctly detect tagging. At the first seek you cannot tell if the process is going to bee seeky or not, and we have chosen to consider it sequential because it improved fairness in some sequential workloads (the CIC_SEEKY heuristic is used also to determine the idle_window length in [bc]fq_arm_slice_timer()). Anyway, we're dealing with heuristics, and they tend to favor some workload over other ones. If recovering this thoughput loss is more important than a transient unfairness due to short idling windows assigned to sequential processes when they start, I've no problems in switching the CIC_SEEKY logic to consider a process seeky when it starts. Thank you for testing and for pointing out this issue, we missed it in our testing. (*) to be correct, the initial classification depends on the position of the first accessed sector. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081119101701.GA20915@gandalf.sssup.it>]
[parent not found: <20081119101701.GA20915-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081119101701.GA20915-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> @ 2008-11-19 11:06 ` Fabio Checconi [not found] ` <20081119110655.GC20915-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> [not found] ` <4924EB4E.7050600@gelato.unsw.edu.au> 0 siblings, 2 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-19 11:06 UTC (permalink / raw) To: Aaron Carroll Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Date: Wed, Nov 19, 2008 11:17:01AM +0100 > > > From: Aaron Carroll <aaronc-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org> > > Date: Wed, Nov 19, 2008 12:52:42PM +1100 > > > > Fabio Checconi wrote: > > > - To detect hw tagging in BFQ we consider a sample valid iff the > > > number of requests that the scheduler could have dispatched (given > > > by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into > > > the scheduler plus the ones into the driver) is higher than the > > > CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems > > > during testing, but the way CFQ uses now seems a little bit > > > strange. > > > > BFQ's tag detection logic is broken in the same way that CFQ's used to > > be. Explanation is in this patch: > > > > If you look at bfq_update_hw_tag(), the logic introduced by the patch > you mention is still there; BFQ starts with ->hw_tag = 1, and updates it > every 32 valid samples. What changed WRT your patch, apart from the > number of samples, is that the condition for a sample to be valid is: > > bfqd->rq_in_driver + bfqd->queued >= 5 > > while in your patch it is: > > cfqd->rq_queued > 5 || cfqd->rq_in_driver > 5 > > We preferred the first one because that sum better reflects the number > of requests that could have been dispatched, and I don't think that this > is wrong. > > There is a problem, but it's not within the tag detection logic itself. > From some quick experiments, what happens is that when a process starts, > CFQ considers it seeky (*), BFQ doesn't. As a side effect BFQ does not > always dispatch enough requests to correctly detect tagging. > > At the first seek you cannot tell if the process is going to bee seeky > or not, and we have chosen to consider it sequential because it improved > fairness in some sequential workloads (the CIC_SEEKY heuristic is used > also to determine the idle_window length in [bc]fq_arm_slice_timer()). > > Anyway, we're dealing with heuristics, and they tend to favor some > workload over other ones. If recovering this thoughput loss is more > important than a transient unfairness due to short idling windows assigned > to sequential processes when they start, I've no problems in switching > the CIC_SEEKY logic to consider a process seeky when it starts. > > Thank you for testing and for pointing out this issue, we missed it > in our testing. > > > (*) to be correct, the initial classification depends on the position > of the first accessed sector. Sorry, I forgot the patch... This seems to solve the problem with your workload here, does it work for you? [ The magic number would not appear in a definitive fix... ] --- diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 83e90e9..e9b010f 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -1322,10 +1322,12 @@ static void bfq_update_io_seektime(struct bfq_data *bfqd, /* * Don't allow the seek distance to get too large from the - * odd fragment, pagein, etc. + * odd fragment, pagein, etc. The first request is not + * really a seek, but we consider a cic seeky on creation + * to make the hw_tag detection logic work better. */ - if (cic->seek_samples == 0) /* first request, not really a seek */ - sdist = 0; + if (cic->seek_samples == 0) + sdist = 8 * 1024 + 1; else if (cic->seek_samples <= 60) /* second&third seek */ sdist = min(sdist, (cic->seek_mean * 4) + 2*1024*1024); else ^ permalink raw reply related [flat|nested] 92+ messages in thread
[parent not found: <20081119110655.GC20915-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081119110655.GC20915-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> @ 2008-11-20 4:45 ` Aaron Carroll 0 siblings, 0 replies; 92+ messages in thread From: Aaron Carroll @ 2008-11-20 4:45 UTC (permalink / raw) To: Fabio Checconi, Jens Axboe Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Fabio Checconi wrote: >>> Fabio Checconi wrote: >>>> - To detect hw tagging in BFQ we consider a sample valid iff the >>>> number of requests that the scheduler could have dispatched (given >>>> by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into >>>> the scheduler plus the ones into the driver) is higher than the >>>> CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems >>>> during testing, but the way CFQ uses now seems a little bit >>>> strange. >>> BFQ's tag detection logic is broken in the same way that CFQ's used to >>> be. Explanation is in this patch: >>> >> If you look at bfq_update_hw_tag(), the logic introduced by the patch >> you mention is still there; BFQ starts with ->hw_tag = 1, and updates it Yes, I missed that. So which part of CFQ's hw_tag detection is strange? >> every 32 valid samples. What changed WRT your patch, apart from the >> number of samples, is that the condition for a sample to be valid is: >> >> bfqd->rq_in_driver + bfqd->queued >= 5 >> >> while in your patch it is: >> >> cfqd->rq_queued > 5 || cfqd->rq_in_driver > 5 >> >> We preferred the first one because that sum better reflects the number >> of requests that could have been dispatched, and I don't think that this >> is wrong. I think it's fine too. CFQ's condition accounts for a few rare situations, such as the device stalling or hw_tag being updated right after a bunch of requests are queued. They are probably irrelevant, but can't hurt. >> There is a problem, but it's not within the tag detection logic itself. >> From some quick experiments, what happens is that when a process starts, >> CFQ considers it seeky (*), BFQ doesn't. As a side effect BFQ does not >> always dispatch enough requests to correctly detect tagging. >> >> At the first seek you cannot tell if the process is going to bee seeky >> or not, and we have chosen to consider it sequential because it improved >> fairness in some sequential workloads (the CIC_SEEKY heuristic is used >> also to determine the idle_window length in [bc]fq_arm_slice_timer()). >> >> Anyway, we're dealing with heuristics, and they tend to favor some >> workload over other ones. If recovering this thoughput loss is more >> important than a transient unfairness due to short idling windows assigned >> to sequential processes when they start, I've no problems in switching >> the CIC_SEEKY logic to consider a process seeky when it starts. >> >> Thank you for testing and for pointing out this issue, we missed it >> in our testing. >> >> >> (*) to be correct, the initial classification depends on the position >> of the first accessed sector. > > Sorry, I forgot the patch... This seems to solve the problem with > your workload here, does it work for you? Yes, it works fine now :) However, hw_tag detection (in CFQ and BFQ) is still broken in a few ways: * If you go from queue_depth=1 to queue_depth=large, it's possible that the detection logic fails. This could happen if setting queue_depth to a larger value at boot, which seems a reasonable situation. * It depends too much on the hardware. If you have a seekly load on a fast disk with a unit queue depth, idling sucks for performance (I imagine this is particularly bad on SSDs). If you have any disk with a deep queue, not idling sucks for fairness. I suppose CFQ's slice_resid is supposed to help here, but as far as I can tell, it doesn't do a thing. -- Aaron ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <4924EB4E.7050600@gelato.unsw.edu.au>]
[parent not found: <4924EB4E.7050600-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <4924EB4E.7050600-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org> @ 2008-11-20 6:56 ` Fabio Checconi 0 siblings, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-20 6:56 UTC (permalink / raw) To: Aaron Carroll Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Aaron Carroll <aaronc-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org> > Date: Thu, Nov 20, 2008 03:45:02PM +1100 > > Fabio Checconi wrote: > >>> Fabio Checconi wrote: > >>>> - To detect hw tagging in BFQ we consider a sample valid iff the > >>>> number of requests that the scheduler could have dispatched (given > >>>> by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into > >>>> the scheduler plus the ones into the driver) is higher than the > >>>> CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems > >>>> during testing, but the way CFQ uses now seems a little bit > >>>> strange. > >>> BFQ's tag detection logic is broken in the same way that CFQ's used to > >>> be. Explanation is in this patch: > >>> > >> If you look at bfq_update_hw_tag(), the logic introduced by the patch > >> you mention is still there; BFQ starts with ->hw_tag = 1, and updates it > > Yes, I missed that. So which part of CFQ's hw_tag detection is strange? > I just think that is rather counterintuitive to consider invalid a sample when you have, say, rq_in_driver = 1,2,3 or 4 and other 4 queued requests. Considering the actual number of requests that could have been dispatched seemed more straightforward than considering the two values separately. Anyway I think the validity of the samples is a minor issue, while the throughput loss you experienced was a more serious one. > >> every 32 valid samples. What changed WRT your patch, apart from the > >> number of samples, is that the condition for a sample to be valid is: > >> > >> bfqd->rq_in_driver + bfqd->queued >= 5 > >> > >> while in your patch it is: > >> > >> cfqd->rq_queued > 5 || cfqd->rq_in_driver > 5 > >> > >> We preferred the first one because that sum better reflects the number > >> of requests that could have been dispatched, and I don't think that this > >> is wrong. > > I think it's fine too. CFQ's condition accounts for a few rare situations, > such as the device stalling or hw_tag being updated right after a bunch of > requests are queued. They are probably irrelevant, but can't hurt. > > >> There is a problem, but it's not within the tag detection logic itself. > >> From some quick experiments, what happens is that when a process starts, > >> CFQ considers it seeky (*), BFQ doesn't. As a side effect BFQ does not > >> always dispatch enough requests to correctly detect tagging. > >> > >> At the first seek you cannot tell if the process is going to bee seeky > >> or not, and we have chosen to consider it sequential because it improved > >> fairness in some sequential workloads (the CIC_SEEKY heuristic is used > >> also to determine the idle_window length in [bc]fq_arm_slice_timer()). > >> > >> Anyway, we're dealing with heuristics, and they tend to favor some > >> workload over other ones. If recovering this thoughput loss is more > >> important than a transient unfairness due to short idling windows assigned > >> to sequential processes when they start, I've no problems in switching > >> the CIC_SEEKY logic to consider a process seeky when it starts. > >> > >> Thank you for testing and for pointing out this issue, we missed it > >> in our testing. > >> > >> > >> (*) to be correct, the initial classification depends on the position > >> of the first accessed sector. > > > > Sorry, I forgot the patch... This seems to solve the problem with > > your workload here, does it work for you? > > Yes, it works fine now :) > Thank you very much for trying it. > However, hw_tag detection (in CFQ and BFQ) is still broken in a few ways: > * If you go from queue_depth=1 to queue_depth=large, it's possible that > the detection logic fails. This could happen if setting queue_depth > to a larger value at boot, which seems a reasonable situation. I think that the transition of hw_tag from 1 to 0 can be quite easy, and may depend only on the workload, while getting back to 1 is more difficult, because when hw_tag is 0 there may be too few dispatches to detect queueing... > * It depends too much on the hardware. If you have a seekly load on a > fast disk with a unit queue depth, idling sucks for performance (I > imagine this is particularly bad on SSDs). If you have any disk with > a deep queue, not idling sucks for fairness. Agreed. This fairness vs. throughput conflict is very workload dependent too. > I suppose CFQ's slice_resid is supposed to help here, but as far as I can > tell, it doesn't do a thing. > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081118211442.GG15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118211442.GG15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> @ 2008-11-19 1:52 ` Aaron Carroll 2008-11-19 14:30 ` Jens Axboe 1 sibling, 0 replies; 92+ messages in thread From: Aaron Carroll @ 2008-11-19 1:52 UTC (permalink / raw) To: Fabio Checconi Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Fabio Checconi wrote: > - To detect hw tagging in BFQ we consider a sample valid iff the > number of requests that the scheduler could have dispatched (given > by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into > the scheduler plus the ones into the driver) is higher than the > CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems > during testing, but the way CFQ uses now seems a little bit > strange. BFQ's tag detection logic is broken in the same way that CFQ's used to be. Explanation is in this patch: ============================x8============================ commit 45333d5a31296d0af886d94f1d08f128231cab8e Author: Aaron Carroll <aaronc-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org> Date: Tue Aug 26 15:52:36 2008 +0200 cfq-iosched: fix queue depth detection CFQ's detection of queueing devices assumes a non-queuing device and detects if the queue depth reaches a certain threshold. Under some workloads (e.g. synchronous reads), CFQ effectively forces a unit queue depth, thus defeating the detection logic. This leads to poor performance on queuing hardware, since the idle window remains enabled. This patch inverts the sense of the logic: assume a queuing-capable device, and detect if the depth does not exceed the threshold. ============================x8============================= BFQ seems better than CFQ at avoiding this problem though. Using the following fio job, I can routinely trigger it for 10s or so before BFQ detects queuing. ============================x8============================= [global] direct=1 ioengine=sync norandommap randrepeat=0 filename=/dev/sdb bs=16k runtime=200 time_based [reader] rw=randread numjobs=128 ============================x8============================= ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118211442.GG15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> 2008-11-19 1:52 ` Aaron Carroll @ 2008-11-19 14:30 ` Jens Axboe 1 sibling, 0 replies; 92+ messages in thread From: Jens Axboe @ 2008-11-19 14:30 UTC (permalink / raw) To: Fabio Checconi Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 18 2008, Fabio Checconi wrote: > > From: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > > Date: Tue, Nov 18, 2008 08:12:08PM +0100 > > > > On Tue, Nov 18 2008, Fabio Checconi wrote: > > > BFQ started from CFQ, extending it in the way you correctly describe, > > > so it is indeed very similar. There are also some minor changes to > > > locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. > > > > ... > > My preferred approach here would be, in order or TODO: > > > > - Create and test the smallish patches for seekiness, hw_tag checking, > > and so on for CFQ. > > - Create and test a WF2Q+ service dispatching patch for CFQ. > > > > and if there are leftovers after that, we could even conditionally > > enable some of those if appropriate. I think the WF2Q+ is quite cool and > > could be easily usable as the default, so it's definitely a viable > > alternative. > > > > My main goal here is basically avoiding addition of Yet Another IO > > scheduler, especially one that is so closely tied to CFQ already. > > > > I'll start things off by splitting cfq into a few files similar to what > > bfq has done, as I think it makes a lot of sense. Fabio, if you could > > create patches for the small behavioural changes you made, we can > > discuss and hopefully merge those next. > > > > Ok, I can do that, I need just a little bit of time to organize > the work. > > About these small (some of them are really small) changes, a mixed list > of things that they will touch and/or things that I'd like to have clear > before starting to write the patches (maybe we can start another thread > for them): > > - In cfq_exit_single_io_context() and in changed_ioprio(), cic->key > is dereferenced without holding any lock. As I reported in [1] > this seems to be a problem when an exit() races with a cfq_exit_queue() > and in a few other cases. In BFQ we used a somehow involved > mechanism to avoid that, abusing rcu (of course we'll have to wait > the patch to talk about it :) ), but given my lack of understanding > of some parts of the block layer, I'd be interested in knowing if > the race is possible and/or if there is something more involved > going on that can cause the same effects. OK, I'm assuming this is where Nikanth got his idea for the patch from? It does seem racy in spots, we can definitely proceed on getting that tightened up some more. > - set_task_ioprio() in fs/ioprio.c doesn't seem to have a write > memory barrier to pair with the dependent read one in > cfq_get_io_context(). Agree, that needs fixing. > - CFQ_MIN_TT is 2ms, this can result, depending on the value of > HZ in timeouts of one jiffy, that may expire too early, so we are > just wasting time and do not actually wait for the task to present > its new request. Dealing with seeky traffic we've seen a lot of > early timeouts due to one jiffy timers expiring too early, is > it worth fixing or can we live with that? We probably just need to enfore a '2 jiffies minimum' rule for that. > - To detect hw tagging in BFQ we consider a sample valid iff the > number of requests that the scheduler could have dispatched (given > by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into > the scheduler plus the ones into the driver) is higher than the > CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems > during testing, but the way CFQ uses now seems a little bit > strange. Not sure this matters a whole lot, but your approach makes sense. Have you seen the later change to the CFQ logic from Aaron? > - Initially, cic->last_request_pos is zero, so the sdist charged > to a task for its first seek depends on the position on the disk > that is accessed first, independently from its seekiness. Even > if there is a cap on that value, we choose to not charge the first > seek to processes; that resulted in less wrong predictions for > purely sequential loads. Agreed, that's is definitely off. > - From my understanding, with shared I/O contexts, two different > tasks may concurrently lookup for a cfqd into the same ioc. > This may result in cfq_drop_dead_cic() being called two times > for the same cic. Am I missing something that prevents that from > happening? That also looks problematic. I guess we need to recheck that under the lock when in cfq_drop_dead_cic(). > Regarding the code splitup, do you think you'll go for the CFS(BFQ) way, > using a single compilation unit and including the .c files, or a layout > with different compilation units (like the ll_rw_blk.c splitup)? Different compilation units would be my preferred choice. -- Jens Axboe ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081119143006.GI26308@kernel.dk>]
[parent not found: <20081119143006.GI26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081119143006.GI26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2008-11-19 15:52 ` Fabio Checconi 0 siblings, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-19 15:52 UTC (permalink / raw) To: Jens Axboe Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > Date: Wed, Nov 19, 2008 03:30:07PM +0100 > > On Tue, Nov 18 2008, Fabio Checconi wrote: ... > > - In cfq_exit_single_io_context() and in changed_ioprio(), cic->key > > is dereferenced without holding any lock. As I reported in [1] > > this seems to be a problem when an exit() races with a cfq_exit_queue() > > and in a few other cases. In BFQ we used a somehow involved > > mechanism to avoid that, abusing rcu (of course we'll have to wait > > the patch to talk about it :) ), but given my lack of understanding > > of some parts of the block layer, I'd be interested in knowing if > > the race is possible and/or if there is something more involved > > going on that can cause the same effects. > > OK, I'm assuming this is where Nikanth got his idea for the patch from? I think so. > It does seem racy in spots, we can definitely proceed on getting that > tightened up some more. > > > - set_task_ioprio() in fs/ioprio.c doesn't seem to have a write > > memory barrier to pair with the dependent read one in > > cfq_get_io_context(). > > Agree, that needs fixing. > > > - CFQ_MIN_TT is 2ms, this can result, depending on the value of > > HZ in timeouts of one jiffy, that may expire too early, so we are > > just wasting time and do not actually wait for the task to present > > its new request. Dealing with seeky traffic we've seen a lot of > > early timeouts due to one jiffy timers expiring too early, is > > it worth fixing or can we live with that? > > We probably just need to enfore a '2 jiffies minimum' rule for that. > > > - To detect hw tagging in BFQ we consider a sample valid iff the > > number of requests that the scheduler could have dispatched (given > > by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into > > the scheduler plus the ones into the driver) is higher than the > > CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems > > during testing, but the way CFQ uses now seems a little bit > > strange. > > Not sure this matters a whole lot, but your approach makes sense. Have > you seen the later change to the CFQ logic from Aaron? > Yes, we started from his code. As Aaron reported, on BFQ our change to the CIC_SEEKY logic has a bad interaction with the hw tag detection on some workloads, but that problem should be easy to solve (test patch posted in http://lkml.org/lkml/2008/11/19/100). > > - Initially, cic->last_request_pos is zero, so the sdist charged > > to a task for its first seek depends on the position on the disk > > that is accessed first, independently from its seekiness. Even > > if there is a cap on that value, we choose to not charge the first > > seek to processes; that resulted in less wrong predictions for > > purely sequential loads. > > Agreed, that's is definitely off. > > > - From my understanding, with shared I/O contexts, two different > > tasks may concurrently lookup for a cfqd into the same ioc. > > This may result in cfq_drop_dead_cic() being called two times > > for the same cic. Am I missing something that prevents that from > > happening? > > That also looks problematic. I guess we need to recheck that under the > lock when in cfq_drop_dead_cic(). > > > Regarding the code splitup, do you think you'll go for the CFS(BFQ) way, > > using a single compilation unit and including the .c files, or a layout > > with different compilation units (like the ll_rw_blk.c splitup)? > > Different compilation units would be my preferred choice. > Ok, thank you, I'll try to put together and test some patches, and to post them for discussion in the next few days. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081118144139.GE15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118144139.GE15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> @ 2008-11-18 19:12 ` Jens Axboe 2008-11-20 21:31 ` Vivek Goyal 1 sibling, 0 replies; 92+ messages in thread From: Jens Axboe @ 2008-11-18 19:12 UTC (permalink / raw) To: Fabio Checconi Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 18 2008, Fabio Checconi wrote: > > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Date: Tue, Nov 18, 2008 09:07:51AM -0500 > > > > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: > ... > > > I have to think a little bit on how it would be possible to support > > > an option for time-only budgets, coexisting with the current behavior, > > > but I think it can be done. > > > > > > > IIUC, bfq and cfq are different in following manner. > > > > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. > > b. BFQ uses the budget (sector count) as notion of service and CFQ uses > > time slices. > > c. BFQ supports hierarchical fair queuing and CFQ does not. > > > > We are looking forward for implementation of point C. Fabio seems to > > thinking of supporting time slice as a service (B). It seems like > > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round > > robin). > > > > It looks like WF2Q+ provides tighter service bound and bfq guys mention > > that they have been able to ensure throughput while ensuring tighter > > bounds. If that's the case, does that mean BFQ is a replacement for CFQ > > down the line? > > > > BFQ started from CFQ, extending it in the way you correctly describe, > so it is indeed very similar. There are also some minor changes to > locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. > > The two schedulers share similar goals, and in my opinion BFQ can be > considered, in the long term, a CFQ replacement; *but* before talking > about replacing CFQ we have to consider that: > > - it *needs* review and testing; we've done our best, but for sure > it's not enough; review and testing are never enough; > - the service domain fairness, which was one of our objectives, requires > some extra complexity; the mechanisms we used and the design choices > we've made may not fit all the needs, or may not be as generic as the > simpler CFQ's ones; > - CFQ has years of history behind and has been tuned for a wider > variety of environments than the ones we've been able to test. > > If time-based fairness is considered more robust and the loss of > service-domain fairness is not a problem, then the two schedulers can > be made even more similar. My preferred approach here would be, in order or TODO: - Create and test the smallish patches for seekiness, hw_tag checking, and so on for CFQ. - Create and test a WF2Q+ service dispatching patch for CFQ. and if there are leftovers after that, we could even conditionally enable some of those if appropriate. I think the WF2Q+ is quite cool and could be easily usable as the default, so it's definitely a viable alternative. My main goal here is basically avoiding addition of Yet Another IO scheduler, especially one that is so closely tied to CFQ already. I'll start things off by splitting cfq into a few files similar to what bfq has done, as I think it makes a lot of sense. Fabio, if you could create patches for the small behavioural changes you made, we can discuss and hopefully merge those next. -- Jens Axboe ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081118144139.GE15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> 2008-11-18 19:12 ` Jens Axboe @ 2008-11-20 21:31 ` Vivek Goyal 1 sibling, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-20 21:31 UTC (permalink / raw) To: Fabio Checconi Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 18, 2008 at 03:41:39PM +0100, Fabio Checconi wrote: > > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Date: Tue, Nov 18, 2008 09:07:51AM -0500 > > > > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote: > ... > > > I have to think a little bit on how it would be possible to support > > > an option for time-only budgets, coexisting with the current behavior, > > > but I think it can be done. > > > > > > > IIUC, bfq and cfq are different in following manner. > > > > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin. > > b. BFQ uses the budget (sector count) as notion of service and CFQ uses > > time slices. > > c. BFQ supports hierarchical fair queuing and CFQ does not. > > > > We are looking forward for implementation of point C. Fabio seems to > > thinking of supporting time slice as a service (B). It seems like > > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round > > robin). > > > > It looks like WF2Q+ provides tighter service bound and bfq guys mention > > that they have been able to ensure throughput while ensuring tighter > > bounds. If that's the case, does that mean BFQ is a replacement for CFQ > > down the line? > > > > BFQ started from CFQ, extending it in the way you correctly describe, > so it is indeed very similar. There are also some minor changes to > locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic. > > The two schedulers share similar goals, and in my opinion BFQ can be > considered, in the long term, a CFQ replacement; *but* before talking > about replacing CFQ we have to consider that: > > - it *needs* review and testing; we've done our best, but for sure > it's not enough; review and testing are never enough; > - the service domain fairness, which was one of our objectives, requires > some extra complexity; the mechanisms we used and the design choices > we've made may not fit all the needs, or may not be as generic as the > simpler CFQ's ones; > - CFQ has years of history behind and has been tuned for a wider > variety of environments than the ones we've been able to test. > > If time-based fairness is considered more robust and the loss of > service-domain fairness is not a problem, then the two schedulers can > be made even more similar. Hi Fabio, I though will give bfq a try. I get following when I put my current shell into a newly created cgroup and then try to do "ls". Thanks Vivek [ 1246.498412] BUG: unable to handle kernel NULL pointer dereference at 000000bc [ 1246.498674] IP: [<c034210b>] __bfq_cic_change_cgroup+0x148/0x239 [ 1246.498674] *pde = 00000000 [ 1246.498674] Oops: 0002 [#1] SMP [ 1246.498674] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:1/0:0:1:0/block/sdb/queue/scheduler [ 1246.498674] Modules linked in: [ 1246.498674] [ 1246.498674] Pid: 2352, comm: dd Not tainted (2.6.28-rc4-bfq #2) [ 1246.498674] EIP: 0060:[<c034210b>] EFLAGS: 00200046 CPU: 0 [ 1246.498674] EIP is at __bfq_cic_change_cgroup+0x148/0x239 [ 1246.498674] EAX: df0e50ac EBX: df0e5000 ECX: 00200046 EDX: df32f300 [ 1246.498674] ESI: dece6ee0 EDI: df0e5000 EBP: df37fc14 ESP: df37fbdc [ 1246.498674] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 1246.498674] Process dd (pid: 2352, ti=df37e000 task=dfb01e00 task.ti=df37e000) [ 1246.498674] Stack: [ 1246.498674] decc9780 dfa98948 df32f300 00000000 00000000 00200046 dece6ef0 00000000 [ 1246.498674] df0e5000 00000000 df0f8014 df32f300 dfa98948 dec0c548 df37fc54 c034351b [ 1246.498674] 00000010 dfabe6c0 dec0c548 00080000 df32f300 00000001 00200246 dfa98988 [ 1246.498674] Call Trace: [ 1246.498674] [<c034351b>] ? bfq_set_request+0x1f5/0x291 [ 1246.498674] [<c0343326>] ? bfq_set_request+0x0/0x291 [ 1246.498674] [<c0333ffe>] ? elv_set_request+0x17/0x26 [ 1246.498674] [<c03365ad>] ? get_request+0x15e/0x1e7 [ 1246.498674] [<c0336af5>] ? get_request_wait+0x22/0xd8 [ 1246.498674] [<c04943b9>] ? dm_merge_bvec+0x88/0xb5 [ 1246.498674] [<c0336f31>] ? __make_request+0x25e/0x310 [ 1246.498674] [<c0494c02>] ? dm_request+0x137/0x150 [ 1246.498674] [<c0335ecf>] ? generic_make_request+0x1e9/0x21f [ 1246.498674] [<c033708b>] ? submit_bio+0xa8/0xb1 [ 1246.498674] [<c0264e49>] ? get_page+0x8/0xe [ 1246.498674] [<c0265157>] ? __lru_cache_add+0x27/0x43 [ 1246.498674] [<c029fea2>] ? mpage_end_io_read+0x0/0x70 [ 1246.498674] [<c029f453>] ? mpage_bio_submit+0x1c/0x21 [ 1246.498674] [<c029ffc3>] ? mpage_readpages+0xb1/0xbe [ 1246.498674] [<c02c04d6>] ? ext3_readpages+0x0/0x16 [ 1246.498674] [<c02c04ea>] ? ext3_readpages+0x14/0x16 [ 1246.498674] [<c02c0f4a>] ? ext3_get_block+0x0/0xd4 [ 1246.498674] [<c02649ee>] ? __do_page_cache_readahead+0xde/0x15b [ 1246.498674] [<c0264cab>] ? ondemand_readahead+0xf9/0x107 [ 1246.498674] [<c0264d1e>] ? page_cache_sync_readahead+0x16/0x1c [ 1246.498674] [<c02600b2>] ? generic_file_aio_read+0x1ad/0x463 [ 1246.498674] [<c02811cb>] ? do_sync_read+0xab/0xe9 [ 1246.498674] [<c0235fe4>] ? autoremove_wake_function+0x0/0x33 [ 1246.498674] [<c0268f15>] ? __inc_zone_page_state+0x12/0x15 [ 1246.498674] [<c026c1a9>] ? handle_mm_fault+0x5a0/0x5b5 [ 1246.498674] [<c0314bcc>] ? security_file_permission+0xf/0x11 [ 1246.498674] [<c0281949>] ? vfs_read+0x80/0xda [ 1246.498674] [<c0281120>] ? do_sync_read+0x0/0xe9 [ 1246.498674] [<c0281bab>] ? sys_read+0x3b/0x5d [ 1246.498674] [<c0203a3d>] ? sysenter_do_call+0x12/0x21 [ 1246.498674] Code: 55 e4 8b 55 d0 89 f0 e8 72 ea ff ff 85 c0 74 04 0f 0b eb fe 8d 46 10 89 45 e0 e8 57 a5 28 00 89 45 dc 8b 55 d0 8d 83 ac 00 00 00 <89> 15 bc 00 00 00 8d 56 14 e8 18 e9 ff ff 8b 75 d0 8d 93 b4 00 [ 1246.498674] EIP: [<c034210b>] __bfq_cic_change_cgroup+0x148/0x239 SS:ESP 0068:df37fbdc [ 1246.498674] ---[ end trace 6bd1df99b7a9cb00 ]--- ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081120213155.GI29306@redhat.com>]
[parent not found: <20081120213155.GI29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081120213155.GI29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-21 3:05 ` Fabio Checconi 0 siblings, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-21 3:05 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Date: Thu, Nov 20, 2008 04:31:55PM -0500 > ... > Hi Fabio, > > I though will give bfq a try. I get following when I put my current shell > into a newly created cgroup and then try to do "ls". > The posted patch cannot work as it is, I'm sorry for that ugly bug. Do you still have problems with this one applied? --- diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index efb03fc..ed8c597 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -168,7 +168,7 @@ static void bfq_group_chain_link(struct bfq_data *bfqd, struct cgroup *cgroup, spin_lock_irqsave(&bgrp->lock, flags); - rcu_assign_pointer(bfqg->bfqd, bfqd); + rcu_assign_pointer(leaf->bfqd, bfqd); hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data); hlist_add_head(&leaf->bfqd_node, &bfqd->group_list); ^ permalink raw reply related [flat|nested] 92+ messages in thread
[parent not found: <20081121030533.GA30883@gandalf.sssup.it>]
[parent not found: <20081121030533.GA30883-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081121030533.GA30883-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> @ 2008-11-21 14:58 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-21 14:58 UTC (permalink / raw) To: Fabio Checconi Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Fri, Nov 21, 2008 at 04:05:33AM +0100, Fabio Checconi wrote: > > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Date: Thu, Nov 20, 2008 04:31:55PM -0500 > > > ... > > Hi Fabio, > > > > I though will give bfq a try. I get following when I put my current shell > > into a newly created cgroup and then try to do "ls". > > > > The posted patch cannot work as it is, I'm sorry for that ugly bug. > Do you still have problems with this one applied? > > --- > diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c > index efb03fc..ed8c597 100644 > --- a/block/bfq-cgroup.c > +++ b/block/bfq-cgroup.c > @@ -168,7 +168,7 @@ static void bfq_group_chain_link(struct bfq_data *bfqd, struct cgroup *cgroup, > > spin_lock_irqsave(&bgrp->lock, flags); > > - rcu_assign_pointer(bfqg->bfqd, bfqd); > + rcu_assign_pointer(leaf->bfqd, bfqd); > hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data); > hlist_add_head(&leaf->bfqd_node, &bfqd->group_list); Thanks Fabio. This fix solves the issue for me. I did a quick testing and I can see the differential service if I create two cgroups of different priority. How do I map ioprio to shares? I mean lets say one cgroup has ioprio 4 and other has got ioprio 7, then what's the respective share(%) of each cgroup? Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081121145823.GD3111@redhat.com>]
[parent not found: <20081121145823.GD3111-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081121145823.GD3111-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-21 15:21 ` Fabio Checconi 0 siblings, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-21 15:21 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Date: Fri, Nov 21, 2008 09:58:23AM -0500 > > On Fri, Nov 21, 2008 at 04:05:33AM +0100, Fabio Checconi wrote: > > > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > Date: Thu, Nov 20, 2008 04:31:55PM -0500 > > > > > ... > > > Hi Fabio, > > > > > > I though will give bfq a try. I get following when I put my current shell > > > into a newly created cgroup and then try to do "ls". > > > > > > > The posted patch cannot work as it is, I'm sorry for that ugly bug. > > Do you still have problems with this one applied? > > > > --- > > diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c > > index efb03fc..ed8c597 100644 > > --- a/block/bfq-cgroup.c > > +++ b/block/bfq-cgroup.c > > @@ -168,7 +168,7 @@ static void bfq_group_chain_link(struct bfq_data *bfqd, struct cgroup *cgroup, > > > > spin_lock_irqsave(&bgrp->lock, flags); > > > > - rcu_assign_pointer(bfqg->bfqd, bfqd); > > + rcu_assign_pointer(leaf->bfqd, bfqd); > > hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data); > > hlist_add_head(&leaf->bfqd_node, &bfqd->group_list); > > Thanks Fabio. This fix solves the issue for me. > Ok thank you. > I did a quick testing and I can see the differential service if I create > two cgroups of different priority. How do I map ioprio to shares? I > mean lets say one cgroup has ioprio 4 and other has got ioprio 7, then > what's the respective share(%) of each cgroup? > I thought I wrote it somewhere, but maybe I missed that; weights are mapped linearly, in decreasing order of priority: weight = 8 - ioprio [ the calculation is done in bfq_weight_t bfq_ioprio_to_weight() ] So, with ioprio 4 you have weight 4, and with ioprio 7 you have weight 1. The shares, as long as the two tasks/groups are active on the disk, are 4/5 and 1/5 respectively. This interface is really ugly, but it allows compatible uses of ioprios with the two schedulers. ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811181433o4bb5a147i1e0b9c1baf495ae2@mail.gmail.com>]
[parent not found: <e98e18940811181433o4bb5a147i1e0b9c1baf495ae2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811181433o4bb5a147i1e0b9c1baf495ae2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-18 23:44 ` Fabio Checconi 2008-11-19 7:09 ` Paolo Valente 1 sibling, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-18 23:44 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Date: Tue, Nov 18, 2008 02:33:19PM -0800 > > On Tue, Nov 18, 2008 at 4:05 AM, Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: ... > > it should be possible without altering the code. The slices can be > > assigned in the time domain using big values for max_budget. The logic > > is: each process is assigned a budget (in the range [max_budget/2, max_budget], > > choosen from the feedback mechanism, driven in __bfq_bfqq_recalc_budget()), > > and if it does not complete it in timeout_sync milliseconds, it is > > charged a fixed amount of sectors of service. > > > > Using big values for max_budget (where big means greater than two > > times the number of sectors the hard drive can transfer in timeout_sync > > milliseconds) makes the budgets always to time out, so the disk time > > is scheduled in slices of timeout_sync. > > > > However this is just a temporary workaround to do some basic testing. > > > > Modifying the scheduler to support time slices instead of sector > > budgets would indeed simplify the code; I think that the drawback > > would be being too unfair in the service domain. Of course we > > have to consider how much is important to be fair in the service > > domain, and how much added complexity/new code can we accept for it. > > > > [ Better service domain fairness is one of the main reasons why > > we started working on bfq, so, talking for me and Paolo it _is_ > > important :) ] > > > > I have to think a little bit on how it would be possible to support > > an option for time-only budgets, coexisting with the current behavior, > > but I think it can be done. > > I think "time only budget" vs "sector budget" is dependent on the > definition of fairness: do you want to be fair in the time that is > given to each cgroup or fair in total number of sectors transferred. > And the appropriate definition of fairness depends on how/where the IO > scheduler is used. Do you think the work-around that you mentioned > would have a significant performance difference compared to direct > built-in support? > In terms of throughput, it should not have any influence, since tasks would always receive a full timeslice. In terms of latency it would bypass completely the feedback mechanism, and that would have a negative impact (basically the scheduler would not be able to differentiate between tasks with the same weight but with different interactivity needs). In terms of service fairness it is a little bit hard to say, but I would not expect anything near to what can be done with a service domain approach, independently from the scheduler used. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811181433o4bb5a147i1e0b9c1baf495ae2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2008-11-18 23:44 ` Fabio Checconi @ 2008-11-19 7:09 ` Paolo Valente 1 sibling, 0 replies; 92+ messages in thread From: Paolo Valente @ 2008-11-19 7:09 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Fabio Checconi, Divyesh Shah, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Nauman Rafique ha scritto: > > > I think "time only budget" vs "sector budget" is dependent on the > definition of fairness: do you want to be fair in the time that is > given to each cgroup or fair in total number of sectors transferred. > And the appropriate definition of fairness depends on how/where the IO > scheduler is used. ... > > Just a general note: as Fabio already said, switching back to time budgets in BFQ would be (conceptually) straightforward. However, we will never get fairness in bandwidth distribution if we work (only) in the time domain. -- ----------------------------------------------------------- | Paolo Valente | | | Algogroup | | | Dip. Ing. Informazione | tel: +39 059 2056318 | | Via Vignolese 905/b | fax: +39 059 2056199 | | 41100 Modena | | | home: http://algo.ing.unimo.it/people/paolo/ | ----------------------------------------------------------- ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <492271EF.4050002@cn.fujitsu.com>]
[parent not found: <492271EF.4050002-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <492271EF.4050002-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> @ 2008-11-18 22:23 ` Nauman Rafique 0 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-18 22:23 UTC (permalink / raw) To: Li Zefan Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Fabio Checconi, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Divyesh Shah, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Mon, Nov 17, 2008 at 11:42 PM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote: > Nauman Rafique wrote: >> If we start with bfq patches, this is how plan would look like: >> >> 1 Start with BFQ take 2. >> 2 Do the following to support proportional division: >> a) Expose the per device weight interface to user, instead of calculating >> from priority. >> b) Add support for disk time budgets, besides sector budget that is currently >> available (configurable option). (Fabio: Do you think we can just emulate >> that using the existing code?). Another approach would be to give time slices >> just like CFQ (discussing?) >> 4 Do the following to support the goals of 2 level schedulers: >> a) Limit the request descriptors allocated to each cgroup by adding >> functionality to elv_may_queue() >> b) Add support for putting an absolute limit on IO consumed by a >> cgroup. Such support is provided by Andrea >> Righi's patches too. >> c) Add support (configurable option) to keep track of total disk >> time/sectors/count >> consumed at each device, and factor that into scheduling decision >> (more discussion needed here) >> 6 Incorporate an IO tracking approach which re-uses memory resource >> controller code but is not dependent on it (may be biocgroup patches from >> dm-ioband can be used here directly) > > The newest bio_cgroup doesn't use much memcg code I think. The older biocgroup > tracks IO using mem_cgroup_charge(), and mem_cgroup_charge() remembers a struct page > owns by which cgroup. But now biocgroup changes to directly put some hooks in > __set_page_dirty() and some other places to track pages. I did not look into latest biocgroup patches, so may be you are right. Nevertheless, bfq currently gets cgroup info out of io context and so would handle only synchronous reads. For this action item, we have to make the latest biocgroup patches to work with bfq. > >> 7 Start an offline email thread to keep track of progress on the above >> goals. >> >> BFQ's support for hierarchy of cgroups means that its close to where >> we want to get. Any comments on what approach looks better? >> > > Looks like a sane way :) . We are also trying to keep track of the discussion and > development of IO controller. I'll start to have a look into BFQ. > >> On Mon, Nov 17, 2008 at 6:02 PM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote: >>> Vivek Goyal wrote: >>>> On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote: >>>>> In an attempt to make sure that this discussion leads to >>>>> something useful, we have summarized the points raised in this >>>>> discussion and have come up with a strategy for future. >>>>> The goal of this is to find common ground between all the approaches >>>>> proposed on this mailing list. >>>>> >>>>> 1 Start with Satoshi's latest patches. >>>> I have had a brief look at both Satoshi's patch and bfq. I kind of like >>>> bfq's patches for keeping track of per cgroup, per queue data structures. >>>> May be we can look there also. >>>> >>>>> 2 Do the following to support propotional division: >>>>> a) Give time slices in proportion to weights (configurable >>>>> option). We can support both priorities and weights by doing >>>>> propotional division between requests with same priorities. >>>>> 3 Schedule time slices using WF2Q+ instead of round robin. >>>>> Test the performance impact (both throughput and jitter in latency). >>>>> 4 Do the following to support the goals of 2 level schedulers: >>>>> a) Limit the request descriptors allocated to each cgroup by adding >>>>> functionality to elv_may_queue() >>>>> b) Add support for putting an absolute limit on IO consumed by a >>>>> cgroup. Such support exists in dm-ioband and is provided by Andrea >>>>> Righi's patches too. >>>> Does dm-iobnd support abosolute limit? I think till last version they did >>>> not. I have not check the latest version though. >>>> >>> No, dm-ioband still provides weight/share control only. Only Andrea Righi's >>> patches support absolute limit. >> >> Thanks for the correction. >> >>>>> c) Add support (configurable option) to keep track of total disk >>>>> time/sectors/count >>>>> consumed at each device, and factor that into scheduling decision >>>>> (more discussion needed here) >>>>> 5 Support multiple layers of cgroups to align IO controller behavior >>>>> with CPU scheduling behavior (more discussion?) >>>>> 6 Incorporate an IO tracking approach which re-uses memory resource >>>>> controller code but is not dependent on it (may be biocgroup patches from >>>>> dm-ioband can be used here directly) >>>>> 7 Start an offline email thread to keep track of progress on the above >>>>> goals. >>>>> >>>>> Please feel free to add/modify items to the list >>>>> when you respond back. Any comments/suggestions are more than welcome. >>>>> >>> >> > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081113221304.GH7542@redhat.com>]
[parent not found: <20081113221304.GH7542-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081113221304.GH7542-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-20 9:20 ` Ryo Tsuruta 0 siblings, 0 replies; 92+ messages in thread From: Ryo Tsuruta @ 2008-11-20 9:20 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Hi Vivek, Sorry for late reply. > > > Do you have any benchmark results? > > > I'm especially interested in the followings: > > > - Comparison of disk performance with and without the I/O controller patch. > > > > If I dynamically disable the bio control, then I did not observe any > > impact on performance. Because in that case practically it boils down > > to just an additional variable check in __make_request(). > > > > Oh.., I understood your question wrong. You are looking for what's the > performance penalty if I enable the IO controller on a device. Yes, that is what I want to know. > I have not done any extensive benchmarking. If I run two dd commands > without controller, I get 80MB/s from disk (roughly 40 MB for each task). > With bio group enabled (default token=2000), I was getting total BW of > roughly 68 MB/s. > > I have not done any performance analysis or optimizations at this point of > time. I plan to do that once we have some sort of common understanding about > a particular approach. There are so many IO controllers floating, right now > I am more concerned if we can all come to a common platform. I understood the reason of posting the patch well. > Ryo, do you still want to stick to two level scheduling? Given the problem > of it breaking down underlying scheduler's assumptions, probably it makes > more sense to the IO control at each individual IO scheduler. I don't want to stick to it. I'm considering implementing dm-ioband's algorithm into the block I/O layer experimentally. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081120.182053.220301508585579959.ryov@valinux.co.jp>]
[parent not found: <20081120.182053.220301508585579959.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081120.182053.220301508585579959.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2008-11-20 13:47 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-20 13:47 UTC (permalink / raw) To: Ryo Tsuruta Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Thu, Nov 20, 2008 at 06:20:53PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Sorry for late reply. > > > > > Do you have any benchmark results? > > > > I'm especially interested in the followings: > > > > - Comparison of disk performance with and without the I/O controller patch. > > > > > > If I dynamically disable the bio control, then I did not observe any > > > impact on performance. Because in that case practically it boils down > > > to just an additional variable check in __make_request(). > > > > > > > Oh.., I understood your question wrong. You are looking for what's the > > performance penalty if I enable the IO controller on a device. > > Yes, that is what I want to know. > > > I have not done any extensive benchmarking. If I run two dd commands > > without controller, I get 80MB/s from disk (roughly 40 MB for each task). > > With bio group enabled (default token=2000), I was getting total BW of > > roughly 68 MB/s. > > > > I have not done any performance analysis or optimizations at this point of > > time. I plan to do that once we have some sort of common understanding about > > a particular approach. There are so many IO controllers floating, right now > > I am more concerned if we can all come to a common platform. > > I understood the reason of posting the patch well. > > > Ryo, do you still want to stick to two level scheduling? Given the problem > > of it breaking down underlying scheduler's assumptions, probably it makes > > more sense to the IO control at each individual IO scheduler. > > I don't want to stick to it. I'm considering implementing dm-ioband's > algorithm into the block I/O layer experimentally. Thanks Ryo. Implementing a control at block layer sounds like another 2 level scheduling. We will still have the issue of breaking underlying CFQ and other schedulers. How to plan to resolve that conflict. What do you think about the solution at IO scheduler level (like BFQ) or may be little above that where one can try some code sharing among IO schedulers? Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081120134701.GB29306@redhat.com>]
[parent not found: <20081120134701.GB29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081120134701.GB29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-25 2:33 ` Ryo Tsuruta 0 siblings, 0 replies; 92+ messages in thread From: Ryo Tsuruta @ 2008-11-25 2:33 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Hi Vivek, > > > Ryo, do you still want to stick to two level scheduling? Given the problem > > > of it breaking down underlying scheduler's assumptions, probably it makes > > > more sense to the IO control at each individual IO scheduler. > > > > I don't want to stick to it. I'm considering implementing dm-ioband's > > algorithm into the block I/O layer experimentally. > > Thanks Ryo. Implementing a control at block layer sounds like another > 2 level scheduling. We will still have the issue of breaking underlying > CFQ and other schedulers. How to plan to resolve that conflict. I think there is no conflict against I/O schedulers. Could you expain to me about the conflict? > What do you think about the solution at IO scheduler level (like BFQ) or > may be little above that where one can try some code sharing among IO > schedulers? I would like to support any type of block device even if I/Os issued to the underlying device doesn't go through IO scheduler. Dm-ioband can be made use of for the devices such as loop device. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081125.113359.623571555980951312.ryov@valinux.co.jp>]
[parent not found: <20081125.113359.623571555980951312.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081125.113359.623571555980951312.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2008-11-25 16:27 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-25 16:27 UTC (permalink / raw) To: Ryo Tsuruta Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > > > > Ryo, do you still want to stick to two level scheduling? Given the problem > > > > of it breaking down underlying scheduler's assumptions, probably it makes > > > > more sense to the IO control at each individual IO scheduler. > > > > > > I don't want to stick to it. I'm considering implementing dm-ioband's > > > algorithm into the block I/O layer experimentally. > > > > Thanks Ryo. Implementing a control at block layer sounds like another > > 2 level scheduling. We will still have the issue of breaking underlying > > CFQ and other schedulers. How to plan to resolve that conflict. > > I think there is no conflict against I/O schedulers. > Could you expain to me about the conflict? Because we do the buffering at higher level scheduler and mostly release the buffered bios in the FIFO order, it might break the underlying IO schedulers. Generally it is the decision of IO scheduler to determine in what order to release buffered bios. For example, If there is one task of io priority 0 in a cgroup and rest of the tasks are of io prio 7. All the tasks belong to best effort class. If tasks of lower priority (7) do lot of IO, then due to buffering there is a chance that IO from lower prio tasks is seen by CFQ first and io from higher prio task is not seen by cfq for quite some time hence that task not getting it fair share with in the cgroup. Similiar situations can arise with RT tasks also. > > > What do you think about the solution at IO scheduler level (like BFQ) or > > may be little above that where one can try some code sharing among IO > > schedulers? > > I would like to support any type of block device even if I/Os issued > to the underlying device doesn't go through IO scheduler. Dm-ioband > can be made use of for the devices such as loop device. > What do you mean by that IO issued to underlying device does not go through IO scheduler? loop device will be associated with a file and IO will ultimately go to the IO scheduler which is serving those file blocks? What's the use case scenario of doing IO control at loop device? Ultimately the resource contention will take place on actual underlying physical device where the file blocks are. Will doing the resource control there not solve the issue for you? Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081125162720.GH341@redhat.com>]
[parent not found: <20081125162720.GH341-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081125162720.GH341-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-25 22:38 ` Nauman Rafique 2008-11-26 11:55 ` Fernando Luis Vázquez Cao 2008-11-26 12:47 ` Ryo Tsuruta 2 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-25 22:38 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Nov 25, 2008 at 8:27 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote: >> Hi Vivek, >> >> > > > Ryo, do you still want to stick to two level scheduling? Given the problem >> > > > of it breaking down underlying scheduler's assumptions, probably it makes >> > > > more sense to the IO control at each individual IO scheduler. >> > > >> > > I don't want to stick to it. I'm considering implementing dm-ioband's >> > > algorithm into the block I/O layer experimentally. >> > >> > Thanks Ryo. Implementing a control at block layer sounds like another >> > 2 level scheduling. We will still have the issue of breaking underlying >> > CFQ and other schedulers. How to plan to resolve that conflict. >> >> I think there is no conflict against I/O schedulers. >> Could you expain to me about the conflict? > > Because we do the buffering at higher level scheduler and mostly release > the buffered bios in the FIFO order, it might break the underlying IO > schedulers. Generally it is the decision of IO scheduler to determine in > what order to release buffered bios. > > For example, If there is one task of io priority 0 in a cgroup and rest of > the tasks are of io prio 7. All the tasks belong to best effort class. If > tasks of lower priority (7) do lot of IO, then due to buffering there is > a chance that IO from lower prio tasks is seen by CFQ first and io from > higher prio task is not seen by cfq for quite some time hence that task > not getting it fair share with in the cgroup. Similiar situations can > arise with RT tasks also. Wouldn't even anticipation algorithms break if buffering is done at higher level? Our anticipation algorithms are tuned to model task's behavior. If IOs get buffer at a higher layer, all bets are off about anticipation. > >> >> > What do you think about the solution at IO scheduler level (like BFQ) or >> > may be little above that where one can try some code sharing among IO >> > schedulers? >> >> I would like to support any type of block device even if I/Os issued >> to the underlying device doesn't go through IO scheduler. Dm-ioband >> can be made use of for the devices such as loop device. >> > > What do you mean by that IO issued to underlying device does not go > through IO scheduler? loop device will be associated with a file and > IO will ultimately go to the IO scheduler which is serving those file > blocks? > > What's the use case scenario of doing IO control at loop device? > Ultimately the resource contention will take place on actual underlying > physical device where the file blocks are. Will doing the resource control > there not solve the issue for you? > > Thanks > Vivek > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081125162720.GH341-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2008-11-25 22:38 ` Nauman Rafique @ 2008-11-26 11:55 ` Fernando Luis Vázquez Cao 2008-11-26 12:47 ` Ryo Tsuruta 2 siblings, 0 replies; 92+ messages in thread From: Fernando Luis Vázquez Cao @ 2008-11-26 11:55 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, 2008-11-25 at 11:27 -0500, Vivek Goyal wrote: > On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > > > > Ryo, do you still want to stick to two level scheduling? Given the problem > > > > > of it breaking down underlying scheduler's assumptions, probably it makes > > > > > more sense to the IO control at each individual IO scheduler. > > > > > > > > I don't want to stick to it. I'm considering implementing dm-ioband's > > > > algorithm into the block I/O layer experimentally. > > > > > > Thanks Ryo. Implementing a control at block layer sounds like another > > > 2 level scheduling. We will still have the issue of breaking underlying > > > CFQ and other schedulers. How to plan to resolve that conflict. > > > > I think there is no conflict against I/O schedulers. > > Could you expain to me about the conflict? > > Because we do the buffering at higher level scheduler and mostly release > the buffered bios in the FIFO order, it might break the underlying IO > schedulers. Generally it is the decision of IO scheduler to determine in > what order to release buffered bios. It could be argued that the IO scheduler's primary goal is to maximize usage of the underlying device according to its physical characteristics. For hard disks this may imply minimizing time wasted by seeks; other types of devices, such as SSDs, may impose different requirements. This is something that clearly belongs in the elevator. On the other hand, it could be argued that other non-hardware-related scheduling disciplines would fit better in higher layers. That said, as you pointed out such separation could impact performance, so we will probably need to implement a feedback mechanism between the elevator, which could collect statistics and provide hints, and the upper layers. The elevator API looks like a good candidate for this, though new functions might be needed. > For example, If there is one task of io priority 0 in a cgroup and rest of > the tasks are of io prio 7. All the tasks belong to best effort class. If > tasks of lower priority (7) do lot of IO, then due to buffering there is > a chance that IO from lower prio tasks is seen by CFQ first and io from > higher prio task is not seen by cfq for quite some time hence that task > not getting it fair share with in the cgroup. Similiar situations can > arise with RT tasks also. Well, this issue is not intrinsic to dm-band and similar solutions. In the scenario you point out the problem is that the elevator and the IO controller are not cooperating. The same could happen even if we implemented everything at the elevator layer (or a little above): get hierarchical scheduling wrong and you are likely to have a rough ride. BFQ deals with hierarchical scheduling at just one layer which makes things easier. BFQ chose the elevator layer, but a similar scheduling discipline could be implemented higher in the block layer too. The HW specific-bits we cannot take out the elevator, but when it comes to task/cgroup based scheduling there are more possibilities, which includes the middle-way approach we are discussing: two level scheduling. The two level model is not bad per se, we just need to get the two levels to work in unison and for that we will certainly need to make changes to the existing elevators. > > > What do you think about the solution at IO scheduler level (like BFQ) or > > > may be little above that where one can try some code sharing among IO > > > schedulers? > > > > I would like to support any type of block device even if I/Os issued > > to the underlying device doesn't go through IO scheduler. Dm-ioband > > can be made use of for the devices such as loop device. > > What do you mean by that IO issued to underlying device does not go > through IO scheduler? loop device will be associated with a file and > IO will ultimately go to the IO scheduler which is serving those file > blocks? I think that Tsuruta-san's point is that the loop device driver uses its own make_request_fn which means that bios entering a loop device do not necessarily go through a IO scheduler after that. We will always find ourselves in this situation when trying to manage devices that provide their own make_request_fn, the reason being that its behavior is driver and configuration dependent: in the loop device case whether we go through a IO scheduler or not depends on what has been attached to it; in stacking device configurations the effect that the IO scheduling at one of the devices that constitute the multi-device will have in the aggregate throughput depends on the topology. The only way I can think of to address all cases in a sane way is controlling the entry point to the block layer, which is precisely what dm-band does. The problem with dm-band is that it relies on the dm infrastructure. In my opinion, if we could remove that dependency it would be a huge step in the right direction. > What's the use case scenario of doing IO control at loop device? My guess is virtualized machines using images exported as loop devices à la Xen's blktap (blktap's implementation is quite different from Linux' loop device, though). Thanks, Fernando _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081125162720.GH341-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2008-11-25 22:38 ` Nauman Rafique 2008-11-26 11:55 ` Fernando Luis Vázquez Cao @ 2008-11-26 12:47 ` Ryo Tsuruta 2 siblings, 0 replies; 92+ messages in thread From: Ryo Tsuruta @ 2008-11-26 12:47 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Hi Vivek, From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Subject: Re: [patch 0/4] [RFC] Another proportional weight IO controller Date: Tue, 25 Nov 2008 11:27:20 -0500 > On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > > > > Ryo, do you still want to stick to two level scheduling? Given the problem > > > > > of it breaking down underlying scheduler's assumptions, probably it makes > > > > > more sense to the IO control at each individual IO scheduler. > > > > > > > > I don't want to stick to it. I'm considering implementing dm-ioband's > > > > algorithm into the block I/O layer experimentally. > > > > > > Thanks Ryo. Implementing a control at block layer sounds like another > > > 2 level scheduling. We will still have the issue of breaking underlying > > > CFQ and other schedulers. How to plan to resolve that conflict. > > > > I think there is no conflict against I/O schedulers. > > Could you expain to me about the conflict? > > Because we do the buffering at higher level scheduler and mostly release > the buffered bios in the FIFO order, it might break the underlying IO > schedulers. Generally it is the decision of IO scheduler to determine in > what order to release buffered bios. > > For example, If there is one task of io priority 0 in a cgroup and rest of > the tasks are of io prio 7. All the tasks belong to best effort class. If > tasks of lower priority (7) do lot of IO, then due to buffering there is > a chance that IO from lower prio tasks is seen by CFQ first and io from > higher prio task is not seen by cfq for quite some time hence that task > not getting it fair share with in the cgroup. Similiar situations can > arise with RT tasks also. Thanks for your explanation. I think that the same thing occurs without the higher level scheduler, because all the tasks issuing I/Os are blocked while the underlying device's request queue is full before those I/Os are sent to the I/O scheduler. > > > What do you think about the solution at IO scheduler level (like BFQ) or > > > may be little above that where one can try some code sharing among IO > > > schedulers? > > > > I would like to support any type of block device even if I/Os issued > > to the underlying device doesn't go through IO scheduler. Dm-ioband > > can be made use of for the devices such as loop device. > > > > What do you mean by that IO issued to underlying device does not go > through IO scheduler? loop device will be associated with a file and > IO will ultimately go to the IO scheduler which is serving those file > blocks? How about if the files is on an NFS-mounted file system? > What's the use case scenario of doing IO control at loop device? > Ultimately the resource contention will take place on actual underlying > physical device where the file blocks are. Will doing the resource control > there not solve the issue for you? I don't come up with any use case, but I would like to make the resource controller more flexible. Actually, a certain block device that I'm using does not use the I/O scheduler. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811251438v245f79aegfdc92bee737af64c@mail.gmail.com>]
[parent not found: <e98e18940811251438v245f79aegfdc92bee737af64c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811251438v245f79aegfdc92bee737af64c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-26 14:06 ` Paolo Valente 0 siblings, 0 replies; 92+ messages in thread From: Paolo Valente @ 2008-11-26 14:06 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Fabio and I are a little bit worried about the fact that the problem of working in the time domain instead of the service domain is not being properly dealt with. Probably we did not express ourselves very clearly, so we will try to put in more practical terms. Using B-WF2Q+ in the time domain instead of using CFQ (Round-Robin) means introducing higher complexity than CFQ to get almost the same service properties of CFQ. With regard to fairness (long term) B-WF2Q+ in the time domain has exactly the same (un)fairness problems of CFQ. As far as bandwidth differentiation is concerned, it can be obtained with CFQ by just increasing the time slice (e.g., double weight => double slice). This has no impact on long term guarantees and certainly does not decrease the throughput. With regard to short term guarantees (request completion time), one of the properties of the reference ideal system of Wf2Q+ is that, assuming for simplicity that all the queues have the same weight, as the ideal system serves each queue at the same speed, shorter budgets are completed in a shorter time intervals than longer budgets. B-WF2Q+ guarantees O(1) deviation from this ideal service. Hence, the tight delay/jitter measured in our experiments with BFQ is a consequence of the simple (and probably still improvable) budget assignment mechanism of (the overall) BFQ. In contrast, if all the budgets are equal, as it happens if we use time slices, the resulting scheduler is exactly a Round-Robin, again as in CFQ (see [1]). Finally, with regard to completion time delay differentiation through weight differentiation, this is probably the only case in which B-WF2Q+ would perform better than CFQ, because, in case of CFQ, reducing the time slices may reduce the throughput, whereas increasing the time slice would increase the worst-case delay/jitter. In the end, BFQ succeeds in guaranteeing fairness (or in general the desired bandwidth distribution) because it works in the service domain (and this is probably the only way to achieve this goal), not because it uses WF2Q+ instead of Round-Robin. Similarly, it provides tight delay/jitter only because B-WF2Q+ is used in combination with a simple budget assignment (differentiation) mechanism (again in the service domain). [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php -- ----------------------------------------------------------- | Paolo Valente | | | Algogroup | | | Dip. Ing. Informazione | tel: +39 059 2056318 | | Via Vignolese 905/b | fax: +39 059 2056199 | | 41100 Modena | | | home: http://algo.ing.unimo.it/people/paolo/ | ----------------------------------------------------------- ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <492D57E1.5090608@unimore.it>]
[parent not found: <492D57E1.5090608-rcYM44yAMweonA0d6jMUrA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <492D57E1.5090608-rcYM44yAMweonA0d6jMUrA@public.gmane.org> @ 2008-11-26 19:41 ` Nauman Rafique 0 siblings, 0 replies; 92+ messages in thread From: Nauman Rafique @ 2008-11-26 19:41 UTC (permalink / raw) To: Paolo Valente Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Wed, Nov 26, 2008 at 6:06 AM, Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org> wrote: > Fabio and I are a little bit worried about the fact that the problem > of working in the time domain instead of the service domain is not > being properly dealt with. Probably we did not express ourselves very > clearly, so we will try to put in more practical terms. Using B-WF2Q+ > in the time domain instead of using CFQ (Round-Robin) means introducing > higher complexity than CFQ to get almost the same service properties > of CFQ. With regard to fairness (long term) B-WF2Q+ in the time domain Are we talking about a case where all the contenders have equal weights and are continuously backlogged? That seems to be the only case when B-WF2Q+ would behave like Round-Robin. Am I missing something here? I can see that the only direct advantage of using WF2Q+ scheduling is reduced jitter or latency in certain cases. But under heavy loads, that might result in request latencies seen by RT threads to be reduced from a few seconds to a few msec. > has exactly the same (un)fairness problems of CFQ. As far as bandwidth > differentiation is concerned, it can be obtained with CFQ by just > increasing the time slice (e.g., double weight => double slice). This > has no impact on long term guarantees and certainly does not decrease > the throughput. > > With regard to short term guarantees (request completion time), one of > the properties of the reference ideal system of Wf2Q+ is that, assuming > for simplicity that all the queues have the same weight, as the ideal > system serves each queue at the same speed, shorter budgets are completed > in a shorter time intervals than longer budgets. B-WF2Q+ guarantees > O(1) deviation from this ideal service. Hence, the tight delay/jitter > measured in our experiments with BFQ is a consequence of the simple (and > probably still improvable) budget assignment mechanism of (the overall) > BFQ. In contrast, if all the budgets are equal, as it happens if we use > time slices, the resulting scheduler is exactly a Round-Robin, again > as in CFQ (see [1]). Can the budget assignment mechanism of BFQ be converted to time slice assignment mechanism? What I am trying to say here is that we can have variable time slices, just like we have variable budgets. > > Finally, with regard to completion time delay differentiation through > weight differentiation, this is probably the only case in which B-WF2Q+ > would perform better than CFQ, because, in case of CFQ, reducing the > time slices may reduce the throughput, whereas increasing the time slice > would increase the worst-case delay/jitter. > > In the end, BFQ succeeds in guaranteeing fairness (or in general the > desired bandwidth distribution) because it works in the service domain > (and this is probably the only way to achieve this goal), not because > it uses WF2Q+ instead of Round-Robin. Similarly, it provides tight > delay/jitter only because B-WF2Q+ is used in combination with a simple > budget assignment (differentiation) mechanism (again in the service > domain). > > [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php > > -- > ----------------------------------------------------------- > | Paolo Valente | | > | Algogroup | | > | Dip. Ing. Informazione | tel: +39 059 2056318 | > | Via Vignolese 905/b | fax: +39 059 2056199 | > | 41100 Modena | | > | home: http://algo.ing.unimo.it/people/paolo/ | > ----------------------------------------------------------- > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <e98e18940811261141x307cf06fldd5e481e85da5c2d@mail.gmail.com>]
[parent not found: <e98e18940811261141x307cf06fldd5e481e85da5c2d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <e98e18940811261141x307cf06fldd5e481e85da5c2d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-11-26 22:21 ` Fabio Checconi 0 siblings, 0 replies; 92+ messages in thread From: Fabio Checconi @ 2008-11-26 22:21 UTC (permalink / raw) To: Nauman Rafique Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, Paolo Valente, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, riel-H+wXaHxf7aLQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > From: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Date: Wed, Nov 26, 2008 11:41:46AM -0800 > > On Wed, Nov 26, 2008 at 6:06 AM, Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org> wrote: > > Fabio and I are a little bit worried about the fact that the problem > > of working in the time domain instead of the service domain is not > > being properly dealt with. Probably we did not express ourselves very > > clearly, so we will try to put in more practical terms. Using B-WF2Q+ > > in the time domain instead of using CFQ (Round-Robin) means introducing > > higher complexity than CFQ to get almost the same service properties > > of CFQ. With regard to fairness (long term) B-WF2Q+ in the time domain > > Are we talking about a case where all the contenders have equal > weights and are continuously backlogged? That seems to be the only > case when B-WF2Q+ would behave like Round-Robin. Am I missing > something here? > It is the case with equal weights, but it is really a common one. > I can see that the only direct advantage of using WF2Q+ scheduling is > reduced jitter or latency in certain cases. But under heavy loads, > that might result in request latencies seen by RT threads to be > reduced from a few seconds to a few msec. > > > has exactly the same (un)fairness problems of CFQ. As far as bandwidth > > differentiation is concerned, it can be obtained with CFQ by just > > increasing the time slice (e.g., double weight => double slice). This > > has no impact on long term guarantees and certainly does not decrease > > the throughput. > > > > With regard to short term guarantees (request completion time), one of > > the properties of the reference ideal system of Wf2Q+ is that, assuming > > for simplicity that all the queues have the same weight, as the ideal > > system serves each queue at the same speed, shorter budgets are completed > > in a shorter time intervals than longer budgets. B-WF2Q+ guarantees > > O(1) deviation from this ideal service. Hence, the tight delay/jitter > > measured in our experiments with BFQ is a consequence of the simple (and > > probably still improvable) budget assignment mechanism of (the overall) > > BFQ. In contrast, if all the budgets are equal, as it happens if we use > > time slices, the resulting scheduler is exactly a Round-Robin, again > > as in CFQ (see [1]). > > Can the budget assignment mechanism of BFQ be converted to time slice > assignment mechanism? What I am trying to say here is that we can have > variable time slices, just like we have variable budgets. > Yes, it could be converted, and it would do in the time domain the same differentiation it does now in the service domain. What we would lose in the process is the fairness in the service domain. The service properties/guarantees of the resulting scheduler would _not_ be the same as the BFQ ones. Both long term and short term guarantees would be affected by the unfairness given by the different service rate experienced by the scheduled entities. > > > > Finally, with regard to completion time delay differentiation through > > weight differentiation, this is probably the only case in which B-WF2Q+ > > would perform better than CFQ, because, in case of CFQ, reducing the > > time slices may reduce the throughput, whereas increasing the time slice > > would increase the worst-case delay/jitter. > > > > In the end, BFQ succeeds in guaranteeing fairness (or in general the > > desired bandwidth distribution) because it works in the service domain > > (and this is probably the only way to achieve this goal), not because > > it uses WF2Q+ instead of Round-Robin. Similarly, it provides tight > > delay/jitter only because B-WF2Q+ is used in combination with a simple > > budget assignment (differentiation) mechanism (again in the service > > domain). > > > > [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php > > > > -- > > ----------------------------------------------------------- > > | Paolo Valente | | > > | Algogroup | | > > | Dip. Ing. Informazione | tel: +39 059 2056318 | > > | Via Vignolese 905/b | fax: +39 059 2056199 | > > | 41100 Modena | | > > | home: http://algo.ing.unimo.it/people/paolo/ | > > ----------------------------------------------------------- > > > > ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081126.214707.653026525707335397.ryov@valinux.co.jp>]
[parent not found: <20081126.214707.653026525707335397.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081126.214707.653026525707335397.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2008-11-26 16:08 ` Vivek Goyal 0 siblings, 0 replies; 92+ messages in thread From: Vivek Goyal @ 2008-11-26 16:08 UTC (permalink / raw) To: Ryo Tsuruta Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Wed, Nov 26, 2008 at 09:47:07PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Subject: Re: [patch 0/4] [RFC] Another proportional weight IO controller > Date: Tue, 25 Nov 2008 11:27:20 -0500 > > > On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote: > > > Hi Vivek, > > > > > > > > > Ryo, do you still want to stick to two level scheduling? Given the problem > > > > > > of it breaking down underlying scheduler's assumptions, probably it makes > > > > > > more sense to the IO control at each individual IO scheduler. > > > > > > > > > > I don't want to stick to it. I'm considering implementing dm-ioband's > > > > > algorithm into the block I/O layer experimentally. > > > > > > > > Thanks Ryo. Implementing a control at block layer sounds like another > > > > 2 level scheduling. We will still have the issue of breaking underlying > > > > CFQ and other schedulers. How to plan to resolve that conflict. > > > > > > I think there is no conflict against I/O schedulers. > > > Could you expain to me about the conflict? > > > > Because we do the buffering at higher level scheduler and mostly release > > the buffered bios in the FIFO order, it might break the underlying IO > > schedulers. Generally it is the decision of IO scheduler to determine in > > what order to release buffered bios. > > > > For example, If there is one task of io priority 0 in a cgroup and rest of > > the tasks are of io prio 7. All the tasks belong to best effort class. If > > tasks of lower priority (7) do lot of IO, then due to buffering there is > > a chance that IO from lower prio tasks is seen by CFQ first and io from > > higher prio task is not seen by cfq for quite some time hence that task > > not getting it fair share with in the cgroup. Similiar situations can > > arise with RT tasks also. > > Thanks for your explanation. > I think that the same thing occurs without the higher level scheduler, > because all the tasks issuing I/Os are blocked while the underlying > device's request queue is full before those I/Os are sent to the I/O > scheduler. > True and this issue was pointed out by Divyesh. I think we shall have to fix this by allocating the request descriptors in proportion to their share. One possible way is to make use of elv_may_queue() to determine if we can allocate furhter request descriptors or not. > > > > What do you think about the solution at IO scheduler level (like BFQ) or > > > > may be little above that where one can try some code sharing among IO > > > > schedulers? > > > > > > I would like to support any type of block device even if I/Os issued > > > to the underlying device doesn't go through IO scheduler. Dm-ioband > > > can be made use of for the devices such as loop device. > > > > > > > What do you mean by that IO issued to underlying device does not go > > through IO scheduler? loop device will be associated with a file and > > IO will ultimately go to the IO scheduler which is serving those file > > blocks? > > How about if the files is on an NFS-mounted file system? > Interesting. So on the surface it looks like contention for disk but it is more the contention for network and contention for disk on NFS server. True that leaf node IO control will not help here as IO is not going to leaf node at all. We can make the situation better by doing resource control on network IO though. > > What's the use case scenario of doing IO control at loop device? > > Ultimately the resource contention will take place on actual underlying > > physical device where the file blocks are. Will doing the resource control > > there not solve the issue for you? > > I don't come up with any use case, but I would like to make the > resource controller more flexible. Actually, a certain block device > that I'm using does not use the I/O scheduler. Isn't it equivalent to using No-op? If yes, then it should not be an issue? Thanks Vivek ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081126160805.GE27826@redhat.com>]
[parent not found: <1227775382.7443.43.camel@sebastian.kern.oss.ntt.co.jp>]
[parent not found: <1227775382.7443.43.camel-xpvPi5bcW5X5OjGIXfuPlhrrLbDL3r4M6qtp775pBPw@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <1227775382.7443.43.camel-xpvPi5bcW5X5OjGIXfuPlhrrLbDL3r4M6qtp775pBPw@public.gmane.org> @ 2008-11-28 3:09 ` Ryo Tsuruta 0 siblings, 0 replies; 92+ messages in thread From: Ryo Tsuruta @ 2008-11-28 3:09 UTC (permalink / raw) To: fernando-gVGce1chcLdL9jVzuh4AOg Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Hi, > > > I don't come up with any use case, but I would like to make the > > > resource controller more flexible. Actually, a certain block device > > > that I'm using does not use the I/O scheduler. > > > > Isn't it equivalent to using No-op? If yes, then it should not be an > > issue? > > No, it is not equivalent. When using devices drivers that provide their > own make_request_fn() (check for devices that invoke > blk_queue_make_request() at initialization time) bios entering the block > layer can go directly to the device driver and from there to the device. As Fernando said, that device driver invokes blk_queue_make_request(), Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <20081126160805.GE27826-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081126160805.GE27826-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2008-11-27 8:43 ` Fernando Luis Vázquez Cao 2008-11-28 13:33 ` Ryo Tsuruta 1 sibling, 0 replies; 92+ messages in thread From: Fernando Luis Vázquez Cao @ 2008-11-27 8:43 UTC (permalink / raw) To: Vivek Goyal Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Wed, 2008-11-26 at 11:08 -0500, Vivek Goyal wrote: > > > > > What do you think about the solution at IO scheduler level (like BFQ) or > > > > > may be little above that where one can try some code sharing among IO > > > > > schedulers? > > > > > > > > I would like to support any type of block device even if I/Os issued > > > > to the underlying device doesn't go through IO scheduler. Dm-ioband > > > > can be made use of for the devices such as loop device. > > > > > > > > > > What do you mean by that IO issued to underlying device does not go > > > through IO scheduler? loop device will be associated with a file and > > > IO will ultimately go to the IO scheduler which is serving those file > > > blocks? > > > > How about if the files is on an NFS-mounted file system? > > > > Interesting. So on the surface it looks like contention for disk but it > is more the contention for network and contention for disk on NFS server. > > True that leaf node IO control will not help here as IO is not going to > leaf node at all. We can make the situation better by doing resource > control on network IO though. On the client side NFS does not go through the block layer so no control is possible there. As Vivek pointed out this could be tackled at the network layer. Though I guess we could make do with a solution that controls just the number of dirty pages (this would work for NFS writes since the NFS superblock has a backing_device_info structure associated with it). > > > What's the use case scenario of doing IO control at loop device? > > > Ultimately the resource contention will take place on actual underlying > > > physical device where the file blocks are. Will doing the resource control > > > there not solve the issue for you? > > > > I don't come up with any use case, but I would like to make the > > resource controller more flexible. Actually, a certain block device > > that I'm using does not use the I/O scheduler. > > Isn't it equivalent to using No-op? If yes, then it should not be an > issue? No, it is not equivalent. When using devices drivers that provide their own make_request_fn() (check for devices that invoke blk_queue_make_request() at initialization time) bios entering the block layer can go directly to the device driver and from there to the device. Regards, Fernando ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [patch 0/4] [RFC] Another proportional weight IO controller [not found] ` <20081126160805.GE27826-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2008-11-27 8:43 ` Fernando Luis Vázquez Cao @ 2008-11-28 13:33 ` Ryo Tsuruta 1 sibling, 0 replies; 92+ messages in thread From: Ryo Tsuruta @ 2008-11-28 13:33 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: menage-hpIqsD4AKlfQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, paolo.valente-rcYM44yAMweonA0d6jMUrA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, ngupta-hpIqsD4AKlfQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Hi Vivek, > > Thanks for your explanation. > > I think that the same thing occurs without the higher level scheduler, > > because all the tasks issuing I/Os are blocked while the underlying > > device's request queue is full before those I/Os are sent to the I/O > > scheduler. > > > > True and this issue was pointed out by Divyesh. I think we shall have to > fix this by allocating the request descriptors in proportion to their > share. One possible way is to make use of elv_may_queue() to determine > if we can allocate furhter request descriptors or not. At the fist glance, elv_may_queue() seemed to be useful for the purpose as you mentioned. But I've noticed there are some problems after I investigated the code more. 1. Every I/O controller must have its own decision algorithm that which I/O requests should be block or not, whose algorithm will be similar to that of dm-ioband. It would be a hassle to implement it in all the I/O controllers. 2. When an I/O is completed, one of the slots in the request queue become available, then one of the processes being blocked get awakened in fifo manner. This won't be the best one in most cases and you have to make this process sleep again and you may want to wake up another one. It's inefficient. 3. In elv_may_queue(), we can't determine which process issues an I/O. You have no choice but to make any kind of process sleep even if it's a kernel thread such as kswapd or pdflush. What do you think is going to happen after that? It may be possible to modify the code not to block kernel threads, but I don't think you can control delayed-write I/Os. If you want to solve these problems, I think you are going to implement the algorithm there whose code is very similar to that of dm-ioband. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 92+ messages in thread
* [patch 0/4] [RFC] Another proportional weight IO controller @ 2008-11-06 15:30 vgoyal-H+wXaHxf7aLQT0dZR+AlfA 0 siblings, 0 replies; 92+ messages in thread From: vgoyal-H+wXaHxf7aLQT0dZR+AlfA @ 2008-11-06 15:30 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Hirokazu Takahashi Cc: Rik van Riel, fernando-gVGce1chcLdL9jVzuh4AOg, Jeff Moyer, menage-hpIqsD4AKlfQT0dZR+AlfA, ngupta-hpIqsD4AKlfQT0dZR+AlfA, Andrew Morton, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Hi, If you are not already tired of so many io controller implementations, here is another one. This is a very eary very crude implementation to get early feedback to see if this approach makes any sense or not. This controller is a proportional weight IO controller primarily based on/inspired by dm-ioband. One of the things I personally found little odd about dm-ioband was need of a dm-ioband device for every device we want to control. I thought that probably we can make this control per request queue and get rid of device mapper driver. This should make configuration aspect easy. I have picked up quite some amount of code from dm-ioband especially for biocgroup implementation. I have done very basic testing and that is running 2-3 dd commands in different cgroups on x86_64. Wanted to throw out the code early to get some feedback. More details about the design and how to are in documentation patch. Your comments are welcome. Thanks Vivek -- ^ permalink raw reply [flat|nested] 92+ messages in thread
end of thread, other threads:[~2008-11-28 13:33 UTC | newest]
Thread overview: 92+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20081106153022.215696930@redhat.com>
2008-11-06 15:30 ` [patch 1/4] io controller: documentation vgoyal-H+wXaHxf7aLQT0dZR+AlfA
2008-11-06 15:30 ` [patch 2/4] io controller: biocgroup implementation vgoyal-H+wXaHxf7aLQT0dZR+AlfA
2008-11-06 15:30 ` [patch 3/4] io controller: Core IO controller implementation logic vgoyal-H+wXaHxf7aLQT0dZR+AlfA
2008-11-06 15:30 ` [patch 4/4] io controller: Put IO controller to use in device mapper and standard make_request() function vgoyal-H+wXaHxf7aLQT0dZR+AlfA
[not found] ` <1225986593.7803.4688.camel@twins>
2008-11-06 16:01 ` [patch 0/4] [RFC] Another proportional weight IO controller Vivek Goyal
[not found] ` <20081106160154.GA7461@redhat.com>
[not found] ` <20081106160154.GA7461-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-06 16:16 ` Peter Zijlstra
[not found] ` <1225988173.7803.4723.camel@twins>
2008-11-06 16:39 ` Vivek Goyal
2008-11-06 16:47 ` Rik van Riel
[not found] ` <20081106163957.GB7461@redhat.com>
[not found] ` <20081106163957.GB7461-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-06 16:52 ` Peter Zijlstra
[not found] ` <1225990327.7803.4776.camel@twins>
2008-11-06 16:57 ` Rik van Riel
2008-11-06 17:08 ` Vivek Goyal
[not found] ` <491321ED.5010103@redhat.com>
[not found] ` <491321ED.5010103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-06 17:11 ` Peter Zijlstra
[not found] ` <1225991487.7803.4801.camel@twins>
2008-11-07 0:41 ` Dave Chinner
[not found] ` <20081107004131.GD2373@disturbed>
2008-11-07 10:31 ` Peter Zijlstra
[not found] ` <1226053904.7803.5856.camel@twins>
2008-11-09 9:40 ` Dave Chinner
[not found] ` <20081106170830.GD7461@redhat.com>
[not found] ` <20081106170830.GD7461-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-06 23:07 ` Nauman Rafique
[not found] ` <e98e18940811061507t3e19183byf2b8b291458ba81b@mail.gmail.com>
[not found] ` <e98e18940811061507t3e19183byf2b8b291458ba81b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-07 14:19 ` Vivek Goyal
[not found] ` <20081107141943.GC21884-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-07 21:36 ` Nauman Rafique
[not found] ` <e98e18940811071336n58a073d8w2cbaeddd5657d1e9@mail.gmail.com>
[not found] ` <e98e18940811071336n58a073d8w2cbaeddd5657d1e9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-10 14:11 ` Vivek Goyal
[not found] ` <20081110141143.GC26956@redhat.com>
[not found] ` <20081110141143.GC26956-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-11 19:55 ` Nauman Rafique
[not found] ` <e98e18940811111155q4bd73480pebe088fa1adbe2e4@mail.gmail.com>
[not found] ` <e98e18940811111155q4bd73480pebe088fa1adbe2e4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-11 22:30 ` Vivek Goyal
[not found] ` <20081111223024.GA31527@redhat.com>
[not found] ` <20081111223024.GA31527-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-12 21:20 ` Nauman Rafique
[not found] ` <e98e18940811121320w5f321302n13b526887cbb4012@mail.gmail.com>
[not found] ` <e98e18940811121320w5f321302n13b526887cbb4012-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-13 13:49 ` Fabio Checconi
[not found] ` <20081106153135.790621895@redhat.com>
[not found] ` <20081106153135.790621895-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-07 2:50 ` [patch 2/4] io controller: biocgroup implementation KAMEZAWA Hiroyuki
[not found] ` <20081107115030.7ccf3f07.kamezawa.hiroyu@jp.fujitsu.com>
[not found] ` <20081107115030.7ccf3f07.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-11-07 4:19 ` Hirokazu Takahashi
2008-11-07 14:44 ` Vivek Goyal
[not found] ` <20081106153135.869625751@redhat.com>
[not found] ` <20081106153135.869625751-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-07 3:21 ` [patch 3/4] io controller: Core IO controller implementation logic KAMEZAWA Hiroyuki
2008-11-11 8:50 ` Gui Jianfeng
[not found] ` <20081107122145.69500cd3.kamezawa.hiroyu@jp.fujitsu.com>
[not found] ` <20081107122145.69500cd3.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-11-07 14:50 ` Vivek Goyal
[not found] ` <20081107145036.GF21884@redhat.com>
[not found] ` <20081107145036.GF21884-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-08 2:35 ` [patch 3/4] io controller: Core IO controller implementationlogic KAMEZAWA Hiroyuki
[not found] ` <4913A9C2.8060904@cn.fujitsu.com>
[not found] ` <4913A9C2.8060904-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2008-11-07 13:38 ` [patch 0/4] [RFC] Another proportional weight IO controller Vivek Goyal
[not found] ` <20081106153135.743458085@redhat.com>
[not found] ` <20081107113209.a6011c67.kamezawa.hiroyu@jp.fujitsu.com>
[not found] ` <20081107113209.a6011c67.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-11-07 14:27 ` [patch 1/4] io controller: documentation Vivek Goyal
[not found] ` <20081106153135.743458085-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-07 2:32 ` KAMEZAWA Hiroyuki
2008-11-07 3:46 ` KAMEZAWA Hiroyuki
2008-11-10 2:48 ` Li Zefan
[not found] ` <4917A116.7040603@cn.fujitsu.com>
[not found] ` <4917A116.7040603-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2008-11-10 13:44 ` Vivek Goyal
[not found] ` <20081106153022.215696930-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-06 15:49 ` [patch 0/4] [RFC] Another proportional weight IO controller Peter Zijlstra
2008-11-07 2:36 ` Gui Jianfeng
2008-11-13 9:05 ` Ryo Tsuruta
[not found] ` <20081113.180558.519459540419535699.ryov@valinux.co.jp>
[not found] ` <20081113155834.GE7542@redhat.com>
[not found] ` <af41c7c40811131041t1b8491b6la5574ebe75f89000@mail.gmail.com>
[not found] ` <20081113214642.GG7542@redhat.com>
[not found] ` <af41c7c40811131457w472e4a86tb5344cc1d3d366fb@mail.gmail.com>
[not found] ` <af41c7c40811131457w472e4a86tb5344cc1d3d366fb-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-14 16:05 ` Vivek Goyal
[not found] ` <20081114160525.GE24624@redhat.com>
[not found] ` <20081114160525.GE24624-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-14 22:44 ` Nauman Rafique
[not found] ` <e98e18940811141444u5947b806v27fac453ed1e8a5@mail.gmail.com>
[not found] ` <e98e18940811141444u5947b806v27fac453ed1e8a5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-17 14:23 ` Vivek Goyal
[not found] ` <20081117142309.GA15564@redhat.com>
[not found] ` <20081117142309.GA15564-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-18 2:02 ` Li Zefan
[not found] ` <4922224A.5030502@cn.fujitsu.com>
[not found] ` <4922224A.5030502-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2008-11-18 5:01 ` Nauman Rafique
[not found] ` <e98e18940811172101na345b6bh5c73f9e657aac5a7@mail.gmail.com>
[not found] ` <e98e18940811172101na345b6bh5c73f9e657aac5a7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-18 7:42 ` Li Zefan
2008-11-18 12:05 ` Fabio Checconi
[not found] ` <20081118120508.GD15268@gandalf.sssup.it>
[not found] ` <20081118120508.GD15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2008-11-18 14:07 ` Vivek Goyal
2008-11-18 22:33 ` Nauman Rafique
[not found] ` <20081118140751.GA4283@redhat.com>
[not found] ` <20081118140751.GA4283-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-18 14:41 ` Fabio Checconi
[not found] ` <20081118144139.GE15268@gandalf.sssup.it>
[not found] ` <20081118191208.GJ26308@kernel.dk>
[not found] ` <20081118191208.GJ26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2008-11-18 19:47 ` Vivek Goyal
2008-11-18 21:14 ` Fabio Checconi
2008-11-18 23:07 ` Nauman Rafique
[not found] ` <e98e18940811181507t6b1473act2efa23df21dab270@mail.gmail.com>
[not found] ` <e98e18940811181507t6b1473act2efa23df21dab270-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-19 14:24 ` Jens Axboe
[not found] ` <20081119142446.GH26308@kernel.dk>
[not found] ` <20081119142446.GH26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2008-11-20 0:12 ` Divyesh Shah
[not found] ` <af41c7c40811191612v5db13ae7n3cfe537beb6a157c@mail.gmail.com>
[not found] ` <af41c7c40811191612v5db13ae7n3cfe537beb6a157c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-20 8:16 ` Jens Axboe
[not found] ` <20081120081640.GE26308@kernel.dk>
[not found] ` <20081120081640.GE26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2008-11-20 13:40 ` Vivek Goyal
[not found] ` <20081120134058.GA29306@redhat.com>
[not found] ` <e98e18940811201154l6fb0499x24da39812fb2aa7e@mail.gmail.com>
[not found] ` <e98e18940811201154l6fb0499x24da39812fb2aa7e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-20 21:15 ` Vivek Goyal
[not found] ` <20081120211536.GG29306@redhat.com>
[not found] ` <20081120211536.GG29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-20 22:42 ` Nauman Rafique
[not found] ` <e98e18940811201442s787a346em4ada30bcb1badfe6@mail.gmail.com>
[not found] ` <e98e18940811201442s787a346em4ada30bcb1badfe6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-21 15:22 ` Vivek Goyal
[not found] ` <20081120134058.GA29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-20 19:54 ` Nauman Rafique
2008-11-26 6:40 ` Fernando Luis Vázquez Cao
[not found] ` <1227681618.12997.163.camel@sebastian.kern.oss.ntt.co.jp>
[not found] ` <1227681618.12997.163.camel-xpvPi5bcW5X5OjGIXfuPlhrrLbDL3r4M6qtp775pBPw@public.gmane.org>
2008-11-26 15:18 ` Vivek Goyal
[not found] ` <20081118211442.GG15268@gandalf.sssup.it>
[not found] ` <4923716A.5090104@gelato.unsw.edu.au>
[not found] ` <4923716A.5090104-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org>
2008-11-19 10:17 ` Fabio Checconi
[not found] ` <20081119101701.GA20915@gandalf.sssup.it>
[not found] ` <20081119101701.GA20915-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2008-11-19 11:06 ` Fabio Checconi
[not found] ` <20081119110655.GC20915-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2008-11-20 4:45 ` Aaron Carroll
[not found] ` <4924EB4E.7050600@gelato.unsw.edu.au>
[not found] ` <4924EB4E.7050600-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org>
2008-11-20 6:56 ` Fabio Checconi
[not found] ` <20081118211442.GG15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2008-11-19 1:52 ` Aaron Carroll
2008-11-19 14:30 ` Jens Axboe
[not found] ` <20081119143006.GI26308@kernel.dk>
[not found] ` <20081119143006.GI26308-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2008-11-19 15:52 ` Fabio Checconi
[not found] ` <20081118144139.GE15268-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2008-11-18 19:12 ` Jens Axboe
2008-11-20 21:31 ` Vivek Goyal
[not found] ` <20081120213155.GI29306@redhat.com>
[not found] ` <20081120213155.GI29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-21 3:05 ` Fabio Checconi
[not found] ` <20081121030533.GA30883@gandalf.sssup.it>
[not found] ` <20081121030533.GA30883-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2008-11-21 14:58 ` Vivek Goyal
[not found] ` <20081121145823.GD3111@redhat.com>
[not found] ` <20081121145823.GD3111-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-21 15:21 ` Fabio Checconi
[not found] ` <e98e18940811181433o4bb5a147i1e0b9c1baf495ae2@mail.gmail.com>
[not found] ` <e98e18940811181433o4bb5a147i1e0b9c1baf495ae2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-18 23:44 ` Fabio Checconi
2008-11-19 7:09 ` Paolo Valente
[not found] ` <492271EF.4050002@cn.fujitsu.com>
[not found] ` <492271EF.4050002-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2008-11-18 22:23 ` Nauman Rafique
[not found] ` <20081113221304.GH7542@redhat.com>
[not found] ` <20081113221304.GH7542-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-20 9:20 ` Ryo Tsuruta
[not found] ` <20081120.182053.220301508585579959.ryov@valinux.co.jp>
[not found] ` <20081120.182053.220301508585579959.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-11-20 13:47 ` Vivek Goyal
[not found] ` <20081120134701.GB29306@redhat.com>
[not found] ` <20081120134701.GB29306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-25 2:33 ` Ryo Tsuruta
[not found] ` <20081125.113359.623571555980951312.ryov@valinux.co.jp>
[not found] ` <20081125.113359.623571555980951312.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-11-25 16:27 ` Vivek Goyal
[not found] ` <20081125162720.GH341@redhat.com>
[not found] ` <20081125162720.GH341-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-25 22:38 ` Nauman Rafique
2008-11-26 11:55 ` Fernando Luis Vázquez Cao
2008-11-26 12:47 ` Ryo Tsuruta
[not found] ` <e98e18940811251438v245f79aegfdc92bee737af64c@mail.gmail.com>
[not found] ` <e98e18940811251438v245f79aegfdc92bee737af64c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-26 14:06 ` Paolo Valente
[not found] ` <492D57E1.5090608@unimore.it>
[not found] ` <492D57E1.5090608-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2008-11-26 19:41 ` Nauman Rafique
[not found] ` <e98e18940811261141x307cf06fldd5e481e85da5c2d@mail.gmail.com>
[not found] ` <e98e18940811261141x307cf06fldd5e481e85da5c2d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-11-26 22:21 ` Fabio Checconi
[not found] ` <20081126.214707.653026525707335397.ryov@valinux.co.jp>
[not found] ` <20081126.214707.653026525707335397.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2008-11-26 16:08 ` Vivek Goyal
[not found] ` <20081126160805.GE27826@redhat.com>
[not found] ` <1227775382.7443.43.camel@sebastian.kern.oss.ntt.co.jp>
[not found] ` <1227775382.7443.43.camel-xpvPi5bcW5X5OjGIXfuPlhrrLbDL3r4M6qtp775pBPw@public.gmane.org>
2008-11-28 3:09 ` Ryo Tsuruta
[not found] ` <20081126160805.GE27826-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-11-27 8:43 ` Fernando Luis Vázquez Cao
2008-11-28 13:33 ` Ryo Tsuruta
2008-11-06 15:30 vgoyal-H+wXaHxf7aLQT0dZR+AlfA
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox