* [PATCH 0/3] blk-throttle: async write throttling @ 2011-02-28 10:15 Andrea Righi 2011-02-28 10:15 ` [PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio Andrea Righi ` (4 more replies) 0 siblings, 5 replies; 19+ messages in thread From: Andrea Righi @ 2011-02-28 10:15 UTC (permalink / raw) To: Vivek Goyal Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel Overview ======== Currently the blkio.throttle controller only support synchronous IO requests. This means that we always look at the current task to identify the "owner" of each IO request. However dirty pages in the page cache can be wrote to disk asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's perspective. Proposed solution ================= In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve the problem of the buffered writes limitation by tracking the ownership of all the dirty pages in the system. This would allow to always identify the owner of each IO operation at the block layer and apply the appropriate throttling policy implemented by the blkio.throttle controller. This solution makes the blkio.throttle controller to work as expected also for writeback IO, but it does not resolve the problem of faster cgroups getting blocked by slower cgroups (that would expose a potential way to create DoS in the system). In fact, at the moment critical IO requests (that have dependency with other IO requests made by other cgroups) and non-critical requests are mixed together at the filesystem layer in a way that throttling a single write request may stop also other requests in the system, and at the block layer it's not possible to retrieve such informations to make the right decision. A simple solution to this problem could be to just limit the rate of async writes at the time a task is generating dirty pages in the page cache. The big advantage of this approach is that it does not need the overhead of tracking the ownership of the dirty pages, because in this way from the blkio controller perspective all the IO operations will happen from the process context: writes in memory and synchronous reads from the block device. The drawback of this approach is that the blkio.throttle controller becomes a little bit leaky, because with this solution the controller is still affected by the IO spikes during the writeback of dirty pages executed by the kernel threads. Probably an even better approach would be to introduce the tracking of the dirty page ownership to properly account the cost of each IO operation at the block layer and apply the throttling of async writes in memory only when IO limits are exceeded. To summarize, we can identify three possible solutions to properly throttle the buffered writes: 1) account & throttle everything at block IO layer (bad for "priority inversion" problems, needs page tracking for blkio) 2) account at block IO layer and throttle in memory (needs page tracking for blkio) 3) account & throttle in memory (affected by IO spikes, depending on dirty_ratio / dirty_background_ratio settings) For now we start with the solution 3) that seems to be the simplest way to proceed. Testcase ======== - create a cgroup with 4MiB/s write limit: # mount -t cgroup -o blkio none /mnt/cgroup # mkdir /mnt/cgroup/foo # echo 8:0 $((4 * 1024 * 1024)) > /mnt/cgroup/foo/blkio.throttle.write_bps_device NOTE: async io is still limited per-device, as well as sync io - move a task into the cgroup and run a dd to generate some writeback IO Results: - 2.6.38-rc6 vanilla: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=128 & $ dstat -df --dsk/sda-- read writ 0 0 ... 0 0 0 22M <--- writeback starts here and is not limited at all 0 43M 0 45M 0 18M ... - 2.6.38-rc6 + async write throttling: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=128 & $ dstat -df --dsk/sda-- read writ 0 0 0 0 0 0 0 0 0 22M <--- we have some IO spikes but the overall writeback IO 0 0 is controlled according to the blkio write limit 0 0 0 0 0 0 0 29M 0 0 0 0 0 0 0 0 0 26M 0 0 0 0 0 0 0 0 0 30M 0 0 0 0 0 0 0 0 0 20M TODO ~~~~ - Consider to add the following new files in the blkio controller to allow the user to explicitly limit async writes as well as sync writes: blkio.throttle.async.write_bps_limit blkio.throttle.async.write_iops_limit Any feedback is welcome. -Andrea [PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio [PATCH 2/3] blkio-throttle: infrastructure to throttle async io [PATCH 3/3] blkio-throttle: async write io instrumentation block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++--------------- fs/direct-io.c | 1 + include/linux/blk_types.h | 2 + include/linux/blkdev.h | 6 +++ mm/page-writeback.c | 17 +++++++ 5 files changed, 97 insertions(+), 35 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio 2011-02-28 10:15 [PATCH 0/3] blk-throttle: async write throttling Andrea Righi @ 2011-02-28 10:15 ` Andrea Righi 2011-02-28 10:15 ` [PATCH 2/3] blkio-throttle: infrastructure to throttle async io Andrea Righi ` (3 subsequent siblings) 4 siblings, 0 replies; 19+ messages in thread From: Andrea Righi @ 2011-02-28 10:15 UTC (permalink / raw) To: Vivek Goyal Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel, Andrea Righi Introduce a new flag to identify if a bio has been generated for a direct IO operation. This flag is used by the blkio controller to identify if a write IO request has been issued by the current task and can be limited directly or if it has been generated from another IO context, as a result of a buffered IO operation. Signed-off-by: Andrea Righi <arighi@develer.com> --- fs/direct-io.c | 1 + include/linux/blk_types.h | 2 ++ 2 files changed, 3 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index b044705..fe364a4 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -361,6 +361,7 @@ static void dio_bio_submit(struct dio *dio) unsigned long flags; bio->bi_private = dio; + bio->bi_rw |= REQ_DIRECT; spin_lock_irqsave(&dio->bio_lock, flags); dio->refcount++; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 46ad519..2f98c03 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -130,6 +130,7 @@ enum rq_flag_bits { /* bio only flags */ __REQ_UNPLUG, /* unplug the immediately after submission */ __REQ_RAHEAD, /* read ahead, can fail anytime */ + __REQ_DIRECT, /* direct io request */ __REQ_THROTTLED, /* This bio has already been subjected to * throttling rules. Don't do it again. */ @@ -173,6 +174,7 @@ enum rq_flag_bits { #define REQ_UNPLUG (1 << __REQ_UNPLUG) #define REQ_RAHEAD (1 << __REQ_RAHEAD) #define REQ_THROTTLED (1 << __REQ_THROTTLED) +#define REQ_DIRECT (1 << __REQ_DIRECT) #define REQ_SORTED (1 << __REQ_SORTED) #define REQ_SOFTBARRIER (1 << __REQ_SOFTBARRIER) -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/3] blkio-throttle: infrastructure to throttle async io 2011-02-28 10:15 [PATCH 0/3] blk-throttle: async write throttling Andrea Righi 2011-02-28 10:15 ` [PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio Andrea Righi @ 2011-02-28 10:15 ` Andrea Righi 2011-03-01 16:06 ` Vivek Goyal 2011-02-28 10:15 ` [PATCH 3/3] blkio-throttle: async write io instrumentation Andrea Righi ` (2 subsequent siblings) 4 siblings, 1 reply; 19+ messages in thread From: Andrea Righi @ 2011-02-28 10:15 UTC (permalink / raw) To: Vivek Goyal Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel, Andrea Righi Add blk_throtl_async() to throttle async io writes. The idea is to stop applications before they can generate too much dirty pages in the page cache. Each time a chunk of pages is wrote in memory we charge the current task / cgroup for the size of the pages dirtied. If blkio limit are exceeded we also enforce throttling at this layer, instead of throttling writes at the block IO layer, where it's not possible to associate the pages with the process that dirtied them. In this case throttling can be simply enforced via a schedule_timeout_killable(). Also change some internal blkio function to make them more generic, and allow to get the size and type of the IO operation without extracting these information directly from a struct bio. In this way the same functions can be used also in the contextes where a struct bio is not yet created (e.g., writes in the page cache). Signed-off-by: Andrea Righi <arighi@develer.com> --- block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++++---------------- include/linux/blkdev.h | 6 +++ 2 files changed, 77 insertions(+), 35 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index a89043a..07ef429 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -449,9 +449,8 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw) } static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg, - struct bio *bio, unsigned long *wait) + bool rw, size_t size, unsigned long *wait) { - bool rw = bio_data_dir(bio); unsigned int io_allowed; unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; u64 tmp; @@ -499,9 +498,8 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg, } static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, - struct bio *bio, unsigned long *wait) + bool rw, size_t size, unsigned long *wait) { - bool rw = bio_data_dir(bio); u64 bytes_allowed, extra_bytes, tmp; unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; @@ -517,14 +515,14 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, do_div(tmp, HZ); bytes_allowed = tmp; - if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) { + if (tg->bytes_disp[rw] + size <= bytes_allowed) { if (wait) *wait = 0; return 1; } /* Calc approx time to dispatch */ - extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed; + extra_bytes = tg->bytes_disp[rw] + size - bytes_allowed; jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]); if (!jiffy_wait) @@ -540,24 +538,11 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, return 0; } -/* - * Returns whether one can dispatch a bio or not. Also returns approx number - * of jiffies to wait before this bio is with-in IO rate and can be dispatched - */ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, - struct bio *bio, unsigned long *wait) + bool rw, size_t size, unsigned long *wait) { - bool rw = bio_data_dir(bio); unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0; - /* - * Currently whole state machine of group depends on first bio - * queued in the group bio list. So one should not be calling - * this function with a different bio if there are other bios - * queued. - */ - BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw])); - /* If tg->bps = -1, then BW is unlimited */ if (tg->bps[rw] == -1 && tg->iops[rw] == -1) { if (wait) @@ -577,8 +562,8 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, throtl_extend_slice(td, tg, rw, jiffies + throtl_slice); } - if (tg_with_in_bps_limit(td, tg, bio, &bps_wait) - && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) { + if (tg_with_in_bps_limit(td, tg, rw, size, &bps_wait) && + tg_with_in_iops_limit(td, tg, rw, size, &iops_wait)) { if (wait) *wait = 0; return 1; @@ -595,20 +580,37 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, return 0; } -static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio) +/* + * Returns whether one can dispatch a bio or not. Also returns approx number + * of jiffies to wait before this bio is with-in IO rate and can be dispatched + */ +static bool tg_may_dispatch_bio(struct throtl_data *td, struct throtl_grp *tg, + struct bio *bio, unsigned long *wait) { bool rw = bio_data_dir(bio); - bool sync = bio->bi_rw & REQ_SYNC; + /* + * Currently whole state machine of group depends on first bio queued + * in the group bio list. So one should not be calling this function + * with a different bio if there are other bios queued. + */ + BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw])); + + return tg_may_dispatch(td, tg, rw, bio->bi_size, wait); +} + +static void +throtl_charge_io(struct throtl_grp *tg, bool rw, bool sync, size_t size) +{ /* Charge the bio to the group */ - tg->bytes_disp[rw] += bio->bi_size; + tg->bytes_disp[rw] += size; tg->io_disp[rw]++; /* * TODO: This will take blkg->stats_lock. Figure out a way * to avoid this cost. */ - blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync); + blkiocg_update_dispatch_stats(&tg->blkg, size, rw, sync); } static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg, @@ -630,10 +632,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg) struct bio *bio; if ((bio = bio_list_peek(&tg->bio_lists[READ]))) - tg_may_dispatch(td, tg, bio, &read_wait); + tg_may_dispatch_bio(td, tg, bio, &read_wait); if ((bio = bio_list_peek(&tg->bio_lists[WRITE]))) - tg_may_dispatch(td, tg, bio, &write_wait); + tg_may_dispatch_bio(td, tg, bio, &write_wait); min_wait = min(read_wait, write_wait); disptime = jiffies + min_wait; @@ -657,7 +659,8 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg, BUG_ON(td->nr_queued[rw] <= 0); td->nr_queued[rw]--; - throtl_charge_bio(tg, bio); + throtl_charge_io(tg, bio_data_dir(bio), + bio->bi_rw & REQ_SYNC, bio->bi_size); bio_list_add(bl, bio); bio->bi_rw |= REQ_THROTTLED; @@ -674,8 +677,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg, /* Try to dispatch 75% READS and 25% WRITES */ - while ((bio = bio_list_peek(&tg->bio_lists[READ])) - && tg_may_dispatch(td, tg, bio, NULL)) { + while ((bio = bio_list_peek(&tg->bio_lists[READ])) && + tg_may_dispatch_bio(td, tg, bio, NULL)) { tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); nr_reads++; @@ -684,8 +687,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg, break; } - while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) - && tg_may_dispatch(td, tg, bio, NULL)) { + while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) && + tg_may_dispatch_bio(td, tg, bio, NULL)) { tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); nr_writes++; @@ -998,6 +1001,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop) bio->bi_rw &= ~REQ_THROTTLED; return 0; } + /* Async writes are ratelimited in blk_throtl_async() */ + if (rw == WRITE && !(bio->bi_rw & REQ_DIRECT)) + return 0; spin_lock_irq(q->queue_lock); tg = throtl_get_tg(td); @@ -1018,8 +1024,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop) } /* Bio is with-in rate limit of group */ - if (tg_may_dispatch(td, tg, bio, NULL)) { - throtl_charge_bio(tg, bio); + if (tg_may_dispatch_bio(td, tg, bio, NULL)) { + throtl_charge_io(tg, bio_data_dir(bio), + bio->bi_rw & REQ_SYNC, bio->bi_size); goto out; } @@ -1044,6 +1051,35 @@ out: return 0; } +/* + * Enforce throttling on async i/o writes + */ +int blk_throtl_async(struct request_queue *q, size_t size) +{ + struct throtl_data *td = q->td; + struct throtl_grp *tg; + bool rw = 1; + unsigned long wait = 0; + + spin_lock_irq(q->queue_lock); + tg = throtl_get_tg(td); + if (tg_may_dispatch(td, tg, rw, size, &wait)) + throtl_charge_io(tg, rw, false, size); + else + throtl_log_tg(td, tg, "[%c] async. bdisp=%u sz=%u bps=%llu" + " iodisp=%u iops=%u queued=%d/%d", + rw == READ ? 'R' : 'W', + tg->bytes_disp[rw], size, tg->bps[rw], + tg->io_disp[rw], tg->iops[rw], + tg->nr_queued[READ], tg->nr_queued[WRITE]); + spin_unlock_irq(q->queue_lock); + + if (wait >= throtl_slice) + schedule_timeout_killable(wait); + + return 0; +} + int blk_throtl_init(struct request_queue *q) { struct throtl_data *td; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4d18ff3..01c8241 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1136,6 +1136,7 @@ static inline uint64_t rq_io_start_time_ns(struct request *req) extern int blk_throtl_init(struct request_queue *q); extern void blk_throtl_exit(struct request_queue *q); extern int blk_throtl_bio(struct request_queue *q, struct bio **bio); +extern int blk_throtl_async(struct request_queue *q, size_t size); extern void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay); extern void throtl_shutdown_timer_wq(struct request_queue *q); #else /* CONFIG_BLK_DEV_THROTTLING */ @@ -1144,6 +1145,11 @@ static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio) return 0; } +static inline int blk_throtl_async(struct request_queue *q, size_t size) +{ + return 0; +} + static inline int blk_throtl_init(struct request_queue *q) { return 0; } static inline int blk_throtl_exit(struct request_queue *q) { return 0; } static inline void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay) {} -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] blkio-throttle: infrastructure to throttle async io 2011-02-28 10:15 ` [PATCH 2/3] blkio-throttle: infrastructure to throttle async io Andrea Righi @ 2011-03-01 16:06 ` Vivek Goyal 2011-03-02 13:31 ` Andrea Righi 0 siblings, 1 reply; 19+ messages in thread From: Vivek Goyal @ 2011-03-01 16:06 UTC (permalink / raw) To: Andrea Righi Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Mon, Feb 28, 2011 at 11:15:04AM +0100, Andrea Righi wrote: > Add blk_throtl_async() to throttle async io writes. > > The idea is to stop applications before they can generate too much dirty > pages in the page cache. Each time a chunk of pages is wrote in memory > we charge the current task / cgroup for the size of the pages dirtied. > > If blkio limit are exceeded we also enforce throttling at this layer, > instead of throttling writes at the block IO layer, where it's not > possible to associate the pages with the process that dirtied them. > > In this case throttling can be simply enforced via a > schedule_timeout_killable(). > > Also change some internal blkio function to make them more generic, and > allow to get the size and type of the IO operation without extracting > these information directly from a struct bio. In this way the same > functions can be used also in the contextes where a struct bio is not > yet created (e.g., writes in the page cache). > > Signed-off-by: Andrea Righi <arighi@develer.com> > --- > block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++++---------------- > include/linux/blkdev.h | 6 +++ > 2 files changed, 77 insertions(+), 35 deletions(-) > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > index a89043a..07ef429 100644 > --- a/block/blk-throttle.c > +++ b/block/blk-throttle.c > @@ -449,9 +449,8 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw) > } > > static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg, > - struct bio *bio, unsigned long *wait) > + bool rw, size_t size, unsigned long *wait) > { > - bool rw = bio_data_dir(bio); > unsigned int io_allowed; > unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; > u64 tmp; > @@ -499,9 +498,8 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg, > } > > static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, > - struct bio *bio, unsigned long *wait) > + bool rw, size_t size, unsigned long *wait) > { > - bool rw = bio_data_dir(bio); > u64 bytes_allowed, extra_bytes, tmp; > unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; > > @@ -517,14 +515,14 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, > do_div(tmp, HZ); > bytes_allowed = tmp; > > - if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) { > + if (tg->bytes_disp[rw] + size <= bytes_allowed) { > if (wait) > *wait = 0; > return 1; > } > > /* Calc approx time to dispatch */ > - extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed; > + extra_bytes = tg->bytes_disp[rw] + size - bytes_allowed; > jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]); > > if (!jiffy_wait) > @@ -540,24 +538,11 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, > return 0; > } > > -/* > - * Returns whether one can dispatch a bio or not. Also returns approx number > - * of jiffies to wait before this bio is with-in IO rate and can be dispatched > - */ > static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, > - struct bio *bio, unsigned long *wait) > + bool rw, size_t size, unsigned long *wait) > { > - bool rw = bio_data_dir(bio); > unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0; > > - /* > - * Currently whole state machine of group depends on first bio > - * queued in the group bio list. So one should not be calling > - * this function with a different bio if there are other bios > - * queued. > - */ > - BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw])); > - > /* If tg->bps = -1, then BW is unlimited */ > if (tg->bps[rw] == -1 && tg->iops[rw] == -1) { > if (wait) > @@ -577,8 +562,8 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, > throtl_extend_slice(td, tg, rw, jiffies + throtl_slice); > } > > - if (tg_with_in_bps_limit(td, tg, bio, &bps_wait) > - && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) { > + if (tg_with_in_bps_limit(td, tg, rw, size, &bps_wait) && > + tg_with_in_iops_limit(td, tg, rw, size, &iops_wait)) { > if (wait) > *wait = 0; > return 1; > @@ -595,20 +580,37 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, > return 0; > } > > -static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio) > +/* > + * Returns whether one can dispatch a bio or not. Also returns approx number > + * of jiffies to wait before this bio is with-in IO rate and can be dispatched > + */ > +static bool tg_may_dispatch_bio(struct throtl_data *td, struct throtl_grp *tg, > + struct bio *bio, unsigned long *wait) > { > bool rw = bio_data_dir(bio); > - bool sync = bio->bi_rw & REQ_SYNC; > > + /* > + * Currently whole state machine of group depends on first bio queued > + * in the group bio list. So one should not be calling this function > + * with a different bio if there are other bios queued. > + */ > + BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw])); > + > + return tg_may_dispatch(td, tg, rw, bio->bi_size, wait); > +} > + > +static void > +throtl_charge_io(struct throtl_grp *tg, bool rw, bool sync, size_t size) > +{ > /* Charge the bio to the group */ > - tg->bytes_disp[rw] += bio->bi_size; > + tg->bytes_disp[rw] += size; > tg->io_disp[rw]++; > > /* > * TODO: This will take blkg->stats_lock. Figure out a way > * to avoid this cost. > */ > - blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync); > + blkiocg_update_dispatch_stats(&tg->blkg, size, rw, sync); > } > > static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg, > @@ -630,10 +632,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg) > struct bio *bio; > > if ((bio = bio_list_peek(&tg->bio_lists[READ]))) > - tg_may_dispatch(td, tg, bio, &read_wait); > + tg_may_dispatch_bio(td, tg, bio, &read_wait); > > if ((bio = bio_list_peek(&tg->bio_lists[WRITE]))) > - tg_may_dispatch(td, tg, bio, &write_wait); > + tg_may_dispatch_bio(td, tg, bio, &write_wait); > > min_wait = min(read_wait, write_wait); > disptime = jiffies + min_wait; > @@ -657,7 +659,8 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg, > BUG_ON(td->nr_queued[rw] <= 0); > td->nr_queued[rw]--; > > - throtl_charge_bio(tg, bio); > + throtl_charge_io(tg, bio_data_dir(bio), > + bio->bi_rw & REQ_SYNC, bio->bi_size); > bio_list_add(bl, bio); > bio->bi_rw |= REQ_THROTTLED; > > @@ -674,8 +677,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg, > > /* Try to dispatch 75% READS and 25% WRITES */ > > - while ((bio = bio_list_peek(&tg->bio_lists[READ])) > - && tg_may_dispatch(td, tg, bio, NULL)) { > + while ((bio = bio_list_peek(&tg->bio_lists[READ])) && > + tg_may_dispatch_bio(td, tg, bio, NULL)) { > > tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); > nr_reads++; > @@ -684,8 +687,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg, > break; > } > > - while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) > - && tg_may_dispatch(td, tg, bio, NULL)) { > + while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) && > + tg_may_dispatch_bio(td, tg, bio, NULL)) { > > tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); > nr_writes++; > @@ -998,6 +1001,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop) > bio->bi_rw &= ~REQ_THROTTLED; > return 0; > } > + /* Async writes are ratelimited in blk_throtl_async() */ > + if (rw == WRITE && !(bio->bi_rw & REQ_DIRECT)) > + return 0; > > spin_lock_irq(q->queue_lock); > tg = throtl_get_tg(td); > @@ -1018,8 +1024,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop) > } > > /* Bio is with-in rate limit of group */ > - if (tg_may_dispatch(td, tg, bio, NULL)) { > - throtl_charge_bio(tg, bio); > + if (tg_may_dispatch_bio(td, tg, bio, NULL)) { > + throtl_charge_io(tg, bio_data_dir(bio), > + bio->bi_rw & REQ_SYNC, bio->bi_size); > goto out; > } > > @@ -1044,6 +1051,35 @@ out: > return 0; > } > > +/* > + * Enforce throttling on async i/o writes > + */ > +int blk_throtl_async(struct request_queue *q, size_t size) > +{ > + struct throtl_data *td = q->td; > + struct throtl_grp *tg; > + bool rw = 1; > + unsigned long wait = 0; > + > + spin_lock_irq(q->queue_lock); > + tg = throtl_get_tg(td); > + if (tg_may_dispatch(td, tg, rw, size, &wait)) > + throtl_charge_io(tg, rw, false, size); > + else > + throtl_log_tg(td, tg, "[%c] async. bdisp=%u sz=%u bps=%llu" > + " iodisp=%u iops=%u queued=%d/%d", > + rw == READ ? 'R' : 'W', > + tg->bytes_disp[rw], size, tg->bps[rw], > + tg->io_disp[rw], tg->iops[rw], > + tg->nr_queued[READ], tg->nr_queued[WRITE]); > + spin_unlock_irq(q->queue_lock); > + > + if (wait >= throtl_slice) > + schedule_timeout_killable(wait); > + > + return 0; > +} I think above is not correct. We have this notion of stretching/renewing the throtl_slice if bio is big and can not be dispatched in one slice and one bio has been dispatched, we trim the slice. So first of all, above code does not take care of other IO going on in same group. We might be extending a slice for a bigger queued bio say 1MB, and suddenly a write happens saying oh, i can dispatch and startving dispatch of that bio. Similiarly, if there are more than 1 tasks in a group doing write, they all will wait for certain time and after that time start dispatching. I think we might have to have a wait queue per group and put async tasks there and start the time slice on behalf of the task at the front of the queue and wake it up once that task meets its quota. Keeping track of for which bio the current slice is operating, we reduce the number of wakeups for throtl work. A work is woken up only when bio is ready to be dispatched. So if a bio is queued that means a slice of that direction is already on, any new IO in that group is simply queued instead of trying to check again whether we can dispatch it or not. Thanks Vivek > + > int blk_throtl_init(struct request_queue *q) > { > struct throtl_data *td; > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 4d18ff3..01c8241 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1136,6 +1136,7 @@ static inline uint64_t rq_io_start_time_ns(struct request *req) > extern int blk_throtl_init(struct request_queue *q); > extern void blk_throtl_exit(struct request_queue *q); > extern int blk_throtl_bio(struct request_queue *q, struct bio **bio); > +extern int blk_throtl_async(struct request_queue *q, size_t size); > extern void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay); > extern void throtl_shutdown_timer_wq(struct request_queue *q); > #else /* CONFIG_BLK_DEV_THROTTLING */ > @@ -1144,6 +1145,11 @@ static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio) > return 0; > } > > +static inline int blk_throtl_async(struct request_queue *q, size_t size) > +{ > + return 0; > +} > + > static inline int blk_throtl_init(struct request_queue *q) { return 0; } > static inline int blk_throtl_exit(struct request_queue *q) { return 0; } > static inline void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay) {} > -- > 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] blkio-throttle: infrastructure to throttle async io 2011-03-01 16:06 ` Vivek Goyal @ 2011-03-02 13:31 ` Andrea Righi 0 siblings, 0 replies; 19+ messages in thread From: Andrea Righi @ 2011-03-02 13:31 UTC (permalink / raw) To: Vivek Goyal Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Tue, Mar 01, 2011 at 11:06:27AM -0500, Vivek Goyal wrote: > On Mon, Feb 28, 2011 at 11:15:04AM +0100, Andrea Righi wrote: > > Add blk_throtl_async() to throttle async io writes. > > > > The idea is to stop applications before they can generate too much dirty > > pages in the page cache. Each time a chunk of pages is wrote in memory > > we charge the current task / cgroup for the size of the pages dirtied. > > > > If blkio limit are exceeded we also enforce throttling at this layer, > > instead of throttling writes at the block IO layer, where it's not > > possible to associate the pages with the process that dirtied them. > > > > In this case throttling can be simply enforced via a > > schedule_timeout_killable(). > > > > Also change some internal blkio function to make them more generic, and > > allow to get the size and type of the IO operation without extracting > > these information directly from a struct bio. In this way the same > > functions can be used also in the contextes where a struct bio is not > > yet created (e.g., writes in the page cache). > > > > Signed-off-by: Andrea Righi <arighi@develer.com> > > --- > > block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++++---------------- > > include/linux/blkdev.h | 6 +++ > > 2 files changed, 77 insertions(+), 35 deletions(-) > > > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > > index a89043a..07ef429 100644 > > --- a/block/blk-throttle.c > > +++ b/block/blk-throttle.c > > @@ -449,9 +449,8 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw) > > } > > > > static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg, > > - struct bio *bio, unsigned long *wait) > > + bool rw, size_t size, unsigned long *wait) > > { > > - bool rw = bio_data_dir(bio); > > unsigned int io_allowed; > > unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; > > u64 tmp; > > @@ -499,9 +498,8 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg, > > } > > > > static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, > > - struct bio *bio, unsigned long *wait) > > + bool rw, size_t size, unsigned long *wait) > > { > > - bool rw = bio_data_dir(bio); > > u64 bytes_allowed, extra_bytes, tmp; > > unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; > > > > @@ -517,14 +515,14 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, > > do_div(tmp, HZ); > > bytes_allowed = tmp; > > > > - if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) { > > + if (tg->bytes_disp[rw] + size <= bytes_allowed) { > > if (wait) > > *wait = 0; > > return 1; > > } > > > > /* Calc approx time to dispatch */ > > - extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed; > > + extra_bytes = tg->bytes_disp[rw] + size - bytes_allowed; > > jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]); > > > > if (!jiffy_wait) > > @@ -540,24 +538,11 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, > > return 0; > > } > > > > -/* > > - * Returns whether one can dispatch a bio or not. Also returns approx number > > - * of jiffies to wait before this bio is with-in IO rate and can be dispatched > > - */ > > static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, > > - struct bio *bio, unsigned long *wait) > > + bool rw, size_t size, unsigned long *wait) > > { > > - bool rw = bio_data_dir(bio); > > unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0; > > > > - /* > > - * Currently whole state machine of group depends on first bio > > - * queued in the group bio list. So one should not be calling > > - * this function with a different bio if there are other bios > > - * queued. > > - */ > > - BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw])); > > - > > /* If tg->bps = -1, then BW is unlimited */ > > if (tg->bps[rw] == -1 && tg->iops[rw] == -1) { > > if (wait) > > @@ -577,8 +562,8 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, > > throtl_extend_slice(td, tg, rw, jiffies + throtl_slice); > > } > > > > - if (tg_with_in_bps_limit(td, tg, bio, &bps_wait) > > - && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) { > > + if (tg_with_in_bps_limit(td, tg, rw, size, &bps_wait) && > > + tg_with_in_iops_limit(td, tg, rw, size, &iops_wait)) { > > if (wait) > > *wait = 0; > > return 1; > > @@ -595,20 +580,37 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, > > return 0; > > } > > > > -static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio) > > +/* > > + * Returns whether one can dispatch a bio or not. Also returns approx number > > + * of jiffies to wait before this bio is with-in IO rate and can be dispatched > > + */ > > +static bool tg_may_dispatch_bio(struct throtl_data *td, struct throtl_grp *tg, > > + struct bio *bio, unsigned long *wait) > > { > > bool rw = bio_data_dir(bio); > > - bool sync = bio->bi_rw & REQ_SYNC; > > > > + /* > > + * Currently whole state machine of group depends on first bio queued > > + * in the group bio list. So one should not be calling this function > > + * with a different bio if there are other bios queued. > > + */ > > + BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw])); > > + > > + return tg_may_dispatch(td, tg, rw, bio->bi_size, wait); > > +} > > + > > +static void > > +throtl_charge_io(struct throtl_grp *tg, bool rw, bool sync, size_t size) > > +{ > > /* Charge the bio to the group */ > > - tg->bytes_disp[rw] += bio->bi_size; > > + tg->bytes_disp[rw] += size; > > tg->io_disp[rw]++; > > > > /* > > * TODO: This will take blkg->stats_lock. Figure out a way > > * to avoid this cost. > > */ > > - blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync); > > + blkiocg_update_dispatch_stats(&tg->blkg, size, rw, sync); > > } > > > > static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg, > > @@ -630,10 +632,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg) > > struct bio *bio; > > > > if ((bio = bio_list_peek(&tg->bio_lists[READ]))) > > - tg_may_dispatch(td, tg, bio, &read_wait); > > + tg_may_dispatch_bio(td, tg, bio, &read_wait); > > > > if ((bio = bio_list_peek(&tg->bio_lists[WRITE]))) > > - tg_may_dispatch(td, tg, bio, &write_wait); > > + tg_may_dispatch_bio(td, tg, bio, &write_wait); > > > > min_wait = min(read_wait, write_wait); > > disptime = jiffies + min_wait; > > @@ -657,7 +659,8 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg, > > BUG_ON(td->nr_queued[rw] <= 0); > > td->nr_queued[rw]--; > > > > - throtl_charge_bio(tg, bio); > > + throtl_charge_io(tg, bio_data_dir(bio), > > + bio->bi_rw & REQ_SYNC, bio->bi_size); > > bio_list_add(bl, bio); > > bio->bi_rw |= REQ_THROTTLED; > > > > @@ -674,8 +677,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg, > > > > /* Try to dispatch 75% READS and 25% WRITES */ > > > > - while ((bio = bio_list_peek(&tg->bio_lists[READ])) > > - && tg_may_dispatch(td, tg, bio, NULL)) { > > + while ((bio = bio_list_peek(&tg->bio_lists[READ])) && > > + tg_may_dispatch_bio(td, tg, bio, NULL)) { > > > > tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); > > nr_reads++; > > @@ -684,8 +687,8 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg, > > break; > > } > > > > - while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) > > - && tg_may_dispatch(td, tg, bio, NULL)) { > > + while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) && > > + tg_may_dispatch_bio(td, tg, bio, NULL)) { > > > > tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); > > nr_writes++; > > @@ -998,6 +1001,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop) > > bio->bi_rw &= ~REQ_THROTTLED; > > return 0; > > } > > + /* Async writes are ratelimited in blk_throtl_async() */ > > + if (rw == WRITE && !(bio->bi_rw & REQ_DIRECT)) > > + return 0; > > > > spin_lock_irq(q->queue_lock); > > tg = throtl_get_tg(td); > > @@ -1018,8 +1024,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop) > > } > > > > /* Bio is with-in rate limit of group */ > > - if (tg_may_dispatch(td, tg, bio, NULL)) { > > - throtl_charge_bio(tg, bio); > > + if (tg_may_dispatch_bio(td, tg, bio, NULL)) { > > + throtl_charge_io(tg, bio_data_dir(bio), > > + bio->bi_rw & REQ_SYNC, bio->bi_size); > > goto out; > > } > > > > @@ -1044,6 +1051,35 @@ out: > > return 0; > > } > > > > +/* > > + * Enforce throttling on async i/o writes > > + */ > > +int blk_throtl_async(struct request_queue *q, size_t size) > > +{ > > + struct throtl_data *td = q->td; > > + struct throtl_grp *tg; > > + bool rw = 1; > > + unsigned long wait = 0; > > + > > + spin_lock_irq(q->queue_lock); > > + tg = throtl_get_tg(td); > > + if (tg_may_dispatch(td, tg, rw, size, &wait)) > > + throtl_charge_io(tg, rw, false, size); > > + else > > + throtl_log_tg(td, tg, "[%c] async. bdisp=%u sz=%u bps=%llu" > > + " iodisp=%u iops=%u queued=%d/%d", > > + rw == READ ? 'R' : 'W', > > + tg->bytes_disp[rw], size, tg->bps[rw], > > + tg->io_disp[rw], tg->iops[rw], > > + tg->nr_queued[READ], tg->nr_queued[WRITE]); > > + spin_unlock_irq(q->queue_lock); > > + > > + if (wait >= throtl_slice) > > + schedule_timeout_killable(wait); > > + > > + return 0; > > +} > > I think above is not correct. > > We have this notion of stretching/renewing the throtl_slice if bio > is big and can not be dispatched in one slice and one bio has been > dispatched, we trim the slice. > > So first of all, above code does not take care of other IO going on > in same group. We might be extending a slice for a bigger queued bio > say 1MB, and suddenly a write happens saying oh, i can dispatch > and startving dispatch of that bio. > > Similiarly, if there are more than 1 tasks in a group doing write, they > all will wait for certain time and after that time start dispatching. > I think we might have to have a wait queue per group and put async > tasks there and start the time slice on behalf of the task at the > front of the queue and wake it up once that task meets its quota. > > Keeping track of for which bio the current slice is operating, we > reduce the number of wakeups for throtl work. A work is woken up > only when bio is ready to be dispatched. So if a bio is queued > that means a slice of that direction is already on, any new IO > in that group is simply queued instead of trying to check again > whether we can dispatch it or not. OK, I understand. I'll try to apply this logic in the next version of the patch set. Thanks for the clarification and the review. -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/3] blkio-throttle: async write io instrumentation 2011-02-28 10:15 [PATCH 0/3] blk-throttle: async write throttling Andrea Righi 2011-02-28 10:15 ` [PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio Andrea Righi 2011-02-28 10:15 ` [PATCH 2/3] blkio-throttle: infrastructure to throttle async io Andrea Righi @ 2011-02-28 10:15 ` Andrea Righi 2011-02-28 23:01 ` [PATCH 0/3] blk-throttle: async write throttling Vivek Goyal 2011-03-01 16:27 ` Vivek Goyal 4 siblings, 0 replies; 19+ messages in thread From: Andrea Righi @ 2011-02-28 10:15 UTC (permalink / raw) To: Vivek Goyal Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel, Andrea Righi Enforce IO throttling policy to asynchronous IO writes at the time tasks write pages in the page cache. Signed-off-by: Andrea Righi <arighi@develer.com> --- mm/page-writeback.c | 17 +++++++++++++++++ 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2cb01f6..e3f5f4f 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -607,6 +607,19 @@ void set_page_dirty_balance(struct page *page, int page_mkwrite) } } +/* + * Get a request_queue of the underlying superblock device from an + * address_space. + */ +static struct request_queue *as_to_rq(struct address_space *mapping) +{ + struct block_device *bdev; + + bdev = (mapping->host && mapping->host->i_sb->s_bdev) ? + mapping->host->i_sb->s_bdev : NULL; + return bdev ? bdev_get_queue(bdev) : NULL; +} + static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0; /** @@ -628,6 +641,7 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping, { unsigned long ratelimit; unsigned long *p; + struct request_queue *q; ratelimit = ratelimit_pages; if (mapping->backing_dev_info->dirty_exceeded) @@ -644,6 +658,9 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping, ratelimit = sync_writeback_pages(*p); *p = 0; preempt_enable(); + q = as_to_rq(mapping); + if (q) + blk_throtl_async(q, ratelimit << PAGE_SHIFT); balance_dirty_pages(mapping, ratelimit); return; } -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-02-28 10:15 [PATCH 0/3] blk-throttle: async write throttling Andrea Righi ` (2 preceding siblings ...) 2011-02-28 10:15 ` [PATCH 3/3] blkio-throttle: async write io instrumentation Andrea Righi @ 2011-02-28 23:01 ` Vivek Goyal 2011-03-02 13:28 ` Andrea Righi 2011-03-01 16:27 ` Vivek Goyal 4 siblings, 1 reply; 19+ messages in thread From: Vivek Goyal @ 2011-02-28 23:01 UTC (permalink / raw) To: Andrea Righi Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: > Overview > ======== > Currently the blkio.throttle controller only support synchronous IO requests. > This means that we always look at the current task to identify the "owner" of > each IO request. > > However dirty pages in the page cache can be wrote to disk asynchronously by > the per-bdi flusher kernel threads or by any other thread in the system, > according to the writeback policy. > > For this reason the real writes to the underlying block devices may > occur in a different IO context respect to the task that originally > generated the dirty pages involved in the IO operation. This makes the > tracking and throttling of writeback IO more complicate respect to the > synchronous IO from the blkio controller's perspective. > > Proposed solution > ================= > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve > the problem of the buffered writes limitation by tracking the ownership of all > the dirty pages in the system. > > This would allow to always identify the owner of each IO operation at the block > layer and apply the appropriate throttling policy implemented by the > blkio.throttle controller. > > This solution makes the blkio.throttle controller to work as expected also for > writeback IO, but it does not resolve the problem of faster cgroups getting > blocked by slower cgroups (that would expose a potential way to create DoS in > the system). > > In fact, at the moment critical IO requests (that have dependency with other IO > requests made by other cgroups) and non-critical requests are mixed together at > the filesystem layer in a way that throttling a single write request may stop > also other requests in the system, and at the block layer it's not possible to > retrieve such informations to make the right decision. > > A simple solution to this problem could be to just limit the rate of async > writes at the time a task is generating dirty pages in the page cache. The > big advantage of this approach is that it does not need the overhead of > tracking the ownership of the dirty pages, because in this way from the blkio > controller perspective all the IO operations will happen from the process > context: writes in memory and synchronous reads from the block device. > > The drawback of this approach is that the blkio.throttle controller becomes a > little bit leaky, because with this solution the controller is still affected > by the IO spikes during the writeback of dirty pages executed by the kernel > threads. > > Probably an even better approach would be to introduce the tracking of the > dirty page ownership to properly account the cost of each IO operation at the > block layer and apply the throttling of async writes in memory only when IO > limits are exceeded. Andrea, I am curious to know more about it third option. Can you give more details about accouting in block layer but throttling in memory. So say a process starts IO, then it will still be in throttle limits at block layer (because no writeback has started), then the process will write bunch of pages in cache. By the time throttle limits are crossed at block layer, we already have lots of dirty data in page cache and throttling process now is already late? > > To summarize, we can identify three possible solutions to properly throttle the > buffered writes: > > 1) account & throttle everything at block IO layer (bad for "priority > inversion" problems, needs page tracking for blkio) > > 2) account at block IO layer and throttle in memory (needs page tracking for > blkio) > > 3) account & throttle in memory (affected by IO spikes, depending on > dirty_ratio / dirty_background_ratio settings) > > For now we start with the solution 3) that seems to be the simplest way to > proceed. Yes, IO spikes is the weakness of this 3rd solution. But it should be simple too. Also as you said problem can be reduced up to some extent by changing reducing dirty_ratio and background dirty ratio But that will have other trade offs, I guess. Thanks Vivek > > Testcase > ======== > - create a cgroup with 4MiB/s write limit: > # mount -t cgroup -o blkio none /mnt/cgroup > # mkdir /mnt/cgroup/foo > # echo 8:0 $((4 * 1024 * 1024)) > /mnt/cgroup/foo/blkio.throttle.write_bps_device > > NOTE: async io is still limited per-device, as well as sync io > > - move a task into the cgroup and run a dd to generate some writeback IO > > Results: > > - 2.6.38-rc6 vanilla: > > $ cat /proc/$$/cgroup > 1:blkio:/foo > $ dd if=/dev/zero of=zero bs=1M count=128 & > $ dstat -df > --dsk/sda-- > read writ > 0 0 > ... > 0 0 > 0 22M <--- writeback starts here and is not limited at all > 0 43M > 0 45M > 0 18M > ... > > - 2.6.38-rc6 + async write throttling: > > $ cat /proc/$$/cgroup > 1:blkio:/foo > $ dd if=/dev/zero of=zero bs=1M count=128 & > $ dstat -df > --dsk/sda-- > read writ > 0 0 > 0 0 > 0 0 > 0 0 > 0 22M <--- we have some IO spikes but the overall writeback IO > 0 0 is controlled according to the blkio write limit > 0 0 > 0 0 > 0 0 > 0 29M > 0 0 > 0 0 > 0 0 > 0 0 > 0 26M > 0 0 > 0 0 > 0 0 > 0 0 > 0 30M > 0 0 > 0 0 > 0 0 > 0 0 > 0 20M > > TODO > ~~~~ > - Consider to add the following new files in the blkio controller to allow the > user to explicitly limit async writes as well as sync writes: > > blkio.throttle.async.write_bps_limit > blkio.throttle.async.write_iops_limit > > Any feedback is welcome. > -Andrea > > [PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio > [PATCH 2/3] blkio-throttle: infrastructure to throttle async io > [PATCH 3/3] blkio-throttle: async write io instrumentation > > block/blk-throttle.c | 106 ++++++++++++++++++++++++++++++--------------- > fs/direct-io.c | 1 + > include/linux/blk_types.h | 2 + > include/linux/blkdev.h | 6 +++ > mm/page-writeback.c | 17 +++++++ > 5 files changed, 97 insertions(+), 35 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-02-28 23:01 ` [PATCH 0/3] blk-throttle: async write throttling Vivek Goyal @ 2011-03-02 13:28 ` Andrea Righi 2011-03-02 21:47 ` Vivek Goyal 2011-03-02 22:00 ` Greg Thelen 0 siblings, 2 replies; 19+ messages in thread From: Andrea Righi @ 2011-03-02 13:28 UTC (permalink / raw) To: Vivek Goyal Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: > On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: > > Overview > > ======== > > Currently the blkio.throttle controller only support synchronous IO requests. > > This means that we always look at the current task to identify the "owner" of > > each IO request. > > > > However dirty pages in the page cache can be wrote to disk asynchronously by > > the per-bdi flusher kernel threads or by any other thread in the system, > > according to the writeback policy. > > > > For this reason the real writes to the underlying block devices may > > occur in a different IO context respect to the task that originally > > generated the dirty pages involved in the IO operation. This makes the > > tracking and throttling of writeback IO more complicate respect to the > > synchronous IO from the blkio controller's perspective. > > > > Proposed solution > > ================= > > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve > > the problem of the buffered writes limitation by tracking the ownership of all > > the dirty pages in the system. > > > > This would allow to always identify the owner of each IO operation at the block > > layer and apply the appropriate throttling policy implemented by the > > blkio.throttle controller. > > > > This solution makes the blkio.throttle controller to work as expected also for > > writeback IO, but it does not resolve the problem of faster cgroups getting > > blocked by slower cgroups (that would expose a potential way to create DoS in > > the system). > > > > In fact, at the moment critical IO requests (that have dependency with other IO > > requests made by other cgroups) and non-critical requests are mixed together at > > the filesystem layer in a way that throttling a single write request may stop > > also other requests in the system, and at the block layer it's not possible to > > retrieve such informations to make the right decision. > > > > A simple solution to this problem could be to just limit the rate of async > > writes at the time a task is generating dirty pages in the page cache. The > > big advantage of this approach is that it does not need the overhead of > > tracking the ownership of the dirty pages, because in this way from the blkio > > controller perspective all the IO operations will happen from the process > > context: writes in memory and synchronous reads from the block device. > > > > The drawback of this approach is that the blkio.throttle controller becomes a > > little bit leaky, because with this solution the controller is still affected > > by the IO spikes during the writeback of dirty pages executed by the kernel > > threads. > > > > Probably an even better approach would be to introduce the tracking of the > > dirty page ownership to properly account the cost of each IO operation at the > > block layer and apply the throttling of async writes in memory only when IO > > limits are exceeded. > > Andrea, I am curious to know more about it third option. Can you give more > details about accouting in block layer but throttling in memory. So say > a process starts IO, then it will still be in throttle limits at block > layer (because no writeback has started), then the process will write > bunch of pages in cache. By the time throttle limits are crossed at > block layer, we already have lots of dirty data in page cache and > throttling process now is already late? Charging the cost of each IO operation at the block layer would allow tasks to write in memory at the maximum speed. Instead, with the 3rd approach, tasks are forced to write in memory at the rate defined by the blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). When we'll have the per-cgroup dirty memory accounting and limiting feature, with this approach each cgroup could write to its dirty memory quota at the maximum rate. BTW, another thing that we probably need is that any cgroup should be forced to write their own inodes when the limit defined by dirty_ratio is exceeded. I mean, avoid to select inodes from the list of dirty inodes in a FIFO way, but provides a better logic to "assign" the ownership of each inode to a cgroup (maybe that one that had generated most of the dirty pages in the inodes) and use for example a list of dirty inodes per cgroup or something similar. In this way we should really be able to provide a good quality of service for the most part of the cases IMHO. I also plan to write down another patch set to implement this logic. > > > > > To summarize, we can identify three possible solutions to properly throttle the > > buffered writes: > > > > 1) account & throttle everything at block IO layer (bad for "priority > > inversion" problems, needs page tracking for blkio) > > > > 2) account at block IO layer and throttle in memory (needs page tracking for > > blkio) > > > > 3) account & throttle in memory (affected by IO spikes, depending on > > dirty_ratio / dirty_background_ratio settings) > > > > For now we start with the solution 3) that seems to be the simplest way to > > proceed. > > Yes, IO spikes is the weakness of this 3rd solution. But it should be > simple too. Also as you said problem can be reduced up to some extent > by changing reducing dirty_ratio and background dirty ratio But that > will have other trade offs, I guess. Agreed. -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-02 13:28 ` Andrea Righi @ 2011-03-02 21:47 ` Vivek Goyal 2011-03-06 15:52 ` Andrea Righi 2011-03-02 22:00 ` Greg Thelen 1 sibling, 1 reply; 19+ messages in thread From: Vivek Goyal @ 2011-03-02 21:47 UTC (permalink / raw) To: Andrea Righi Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote: > On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: > > On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: > > > Overview > > > ======== > > > Currently the blkio.throttle controller only support synchronous IO requests. > > > This means that we always look at the current task to identify the "owner" of > > > each IO request. > > > > > > However dirty pages in the page cache can be wrote to disk asynchronously by > > > the per-bdi flusher kernel threads or by any other thread in the system, > > > according to the writeback policy. > > > > > > For this reason the real writes to the underlying block devices may > > > occur in a different IO context respect to the task that originally > > > generated the dirty pages involved in the IO operation. This makes the > > > tracking and throttling of writeback IO more complicate respect to the > > > synchronous IO from the blkio controller's perspective. > > > > > > Proposed solution > > > ================= > > > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve > > > the problem of the buffered writes limitation by tracking the ownership of all > > > the dirty pages in the system. > > > > > > This would allow to always identify the owner of each IO operation at the block > > > layer and apply the appropriate throttling policy implemented by the > > > blkio.throttle controller. > > > > > > This solution makes the blkio.throttle controller to work as expected also for > > > writeback IO, but it does not resolve the problem of faster cgroups getting > > > blocked by slower cgroups (that would expose a potential way to create DoS in > > > the system). > > > > > > In fact, at the moment critical IO requests (that have dependency with other IO > > > requests made by other cgroups) and non-critical requests are mixed together at > > > the filesystem layer in a way that throttling a single write request may stop > > > also other requests in the system, and at the block layer it's not possible to > > > retrieve such informations to make the right decision. > > > > > > A simple solution to this problem could be to just limit the rate of async > > > writes at the time a task is generating dirty pages in the page cache. The > > > big advantage of this approach is that it does not need the overhead of > > > tracking the ownership of the dirty pages, because in this way from the blkio > > > controller perspective all the IO operations will happen from the process > > > context: writes in memory and synchronous reads from the block device. > > > > > > The drawback of this approach is that the blkio.throttle controller becomes a > > > little bit leaky, because with this solution the controller is still affected > > > by the IO spikes during the writeback of dirty pages executed by the kernel > > > threads. > > > > > > Probably an even better approach would be to introduce the tracking of the > > > dirty page ownership to properly account the cost of each IO operation at the > > > block layer and apply the throttling of async writes in memory only when IO > > > limits are exceeded. > > > > Andrea, I am curious to know more about it third option. Can you give more > > details about accouting in block layer but throttling in memory. So say > > a process starts IO, then it will still be in throttle limits at block > > layer (because no writeback has started), then the process will write > > bunch of pages in cache. By the time throttle limits are crossed at > > block layer, we already have lots of dirty data in page cache and > > throttling process now is already late? > > Charging the cost of each IO operation at the block layer would allow > tasks to write in memory at the maximum speed. Instead, with the 3rd > approach, tasks are forced to write in memory at the rate defined by the > blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). > > When we'll have the per-cgroup dirty memory accounting and limiting > feature, with this approach each cgroup could write to its dirty memory > quota at the maximum rate. Ok, so this is option 3 which you have already implemented in this patchset. I guess then I am confused with option 2. Can you elaborate a little more there. > > BTW, another thing that we probably need is that any cgroup should be > forced to write their own inodes when the limit defined by dirty_ratio > is exceeded. I mean, avoid to select inodes from the list of dirty > inodes in a FIFO way, but provides a better logic to "assign" the > ownership of each inode to a cgroup (maybe that one that had generated > most of the dirty pages in the inodes) and use for example a list of > dirty inodes per cgroup or something similar. > > In this way we should really be able to provide a good quality of > service for the most part of the cases IMHO. > > I also plan to write down another patch set to implement this logic. Yes, triggering writeout of inodes of a cgroup once it crosses dirty ratio is desirable. A patch like that can go on top of Greg's per cgroup dirty ratio patacehes. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-02 21:47 ` Vivek Goyal @ 2011-03-06 15:52 ` Andrea Righi 2011-03-07 7:31 ` Gui Jianfeng 2011-03-07 15:22 ` Vivek Goyal 0 siblings, 2 replies; 19+ messages in thread From: Andrea Righi @ 2011-03-06 15:52 UTC (permalink / raw) To: Vivek Goyal Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote: > On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote: > > On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: > > > On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: > > > > Overview > > > > ======== > > > > Currently the blkio.throttle controller only support synchronous IO requests. > > > > This means that we always look at the current task to identify the "owner" of > > > > each IO request. > > > > > > > > However dirty pages in the page cache can be wrote to disk asynchronously by > > > > the per-bdi flusher kernel threads or by any other thread in the system, > > > > according to the writeback policy. > > > > > > > > For this reason the real writes to the underlying block devices may > > > > occur in a different IO context respect to the task that originally > > > > generated the dirty pages involved in the IO operation. This makes the > > > > tracking and throttling of writeback IO more complicate respect to the > > > > synchronous IO from the blkio controller's perspective. > > > > > > > > Proposed solution > > > > ================= > > > > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve > > > > the problem of the buffered writes limitation by tracking the ownership of all > > > > the dirty pages in the system. > > > > > > > > This would allow to always identify the owner of each IO operation at the block > > > > layer and apply the appropriate throttling policy implemented by the > > > > blkio.throttle controller. > > > > > > > > This solution makes the blkio.throttle controller to work as expected also for > > > > writeback IO, but it does not resolve the problem of faster cgroups getting > > > > blocked by slower cgroups (that would expose a potential way to create DoS in > > > > the system). > > > > > > > > In fact, at the moment critical IO requests (that have dependency with other IO > > > > requests made by other cgroups) and non-critical requests are mixed together at > > > > the filesystem layer in a way that throttling a single write request may stop > > > > also other requests in the system, and at the block layer it's not possible to > > > > retrieve such informations to make the right decision. > > > > > > > > A simple solution to this problem could be to just limit the rate of async > > > > writes at the time a task is generating dirty pages in the page cache. The > > > > big advantage of this approach is that it does not need the overhead of > > > > tracking the ownership of the dirty pages, because in this way from the blkio > > > > controller perspective all the IO operations will happen from the process > > > > context: writes in memory and synchronous reads from the block device. > > > > > > > > The drawback of this approach is that the blkio.throttle controller becomes a > > > > little bit leaky, because with this solution the controller is still affected > > > > by the IO spikes during the writeback of dirty pages executed by the kernel > > > > threads. > > > > > > > > Probably an even better approach would be to introduce the tracking of the > > > > dirty page ownership to properly account the cost of each IO operation at the > > > > block layer and apply the throttling of async writes in memory only when IO > > > > limits are exceeded. > > > > > > Andrea, I am curious to know more about it third option. Can you give more > > > details about accouting in block layer but throttling in memory. So say > > > a process starts IO, then it will still be in throttle limits at block > > > layer (because no writeback has started), then the process will write > > > bunch of pages in cache. By the time throttle limits are crossed at > > > block layer, we already have lots of dirty data in page cache and > > > throttling process now is already late? > > > > Charging the cost of each IO operation at the block layer would allow > > tasks to write in memory at the maximum speed. Instead, with the 3rd > > approach, tasks are forced to write in memory at the rate defined by the > > blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). > > > > When we'll have the per-cgroup dirty memory accounting and limiting > > feature, with this approach each cgroup could write to its dirty memory > > quota at the maximum rate. > > Ok, so this is option 3 which you have already implemented in this > patchset. > > I guess then I am confused with option 2. Can you elaborate a little > more there. With option 3, we can just limit the rate at which dirty pages are generated in memory. And this can be done introducing the files blkio.throttle.async.write_bps/iops_device. At the moment in blk_throtl_bio() we charge the dispatched bytes/iops _and_ we check if the bio can be dispatched. These two distinct operations are now done by the same function. With option 2, I'm proposing to split these two operations and place throtl_charge_io() at the block layer in __generic_make_request() and an equivalent of tg_may_dispatch_bio() (maybe a better name would be blk_is_throttled()) at the page cache layer, in balance_dirty_pages_ratelimited_nr(): A prototype for blk_is_throttled() could be the following: bool blk_is_throttled(void); This means in balance_dirty_pages_ratelimited_nr() we won't charge any bytes/iops to the cgroup, but we'll just check if the limits are exceeded. And stop it in that case, so that no more dirty pages can be generated by this cgroup. Instead at the block layer WRITEs will be always dispatched in blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but the throtl_charge_io() would charge the cost of the IO operation to the right cgroup. To summarize: __generic_make_request(): blk_throtl_bio(q, &bio); balance_dirty_pages_ratelimited_nr(): if (blk_is_throttled()) // add the current task into a per-group wait queue and // wake up once this cgroup meets its quota What do you think? Thanks, -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-06 15:52 ` Andrea Righi @ 2011-03-07 7:31 ` Gui Jianfeng 2011-03-07 11:34 ` Andrea Righi 2011-03-07 15:22 ` Vivek Goyal 1 sibling, 1 reply; 19+ messages in thread From: Gui Jianfeng @ 2011-03-07 7:31 UTC (permalink / raw) To: Andrea Righi Cc: Vivek Goyal, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel Andrea Righi wrote: > On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote: >> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote: >>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: >>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: >>>>> Overview >>>>> ======== >>>>> Currently the blkio.throttle controller only support synchronous IO requests. >>>>> This means that we always look at the current task to identify the "owner" of >>>>> each IO request. >>>>> >>>>> However dirty pages in the page cache can be wrote to disk asynchronously by >>>>> the per-bdi flusher kernel threads or by any other thread in the system, >>>>> according to the writeback policy. >>>>> >>>>> For this reason the real writes to the underlying block devices may >>>>> occur in a different IO context respect to the task that originally >>>>> generated the dirty pages involved in the IO operation. This makes the >>>>> tracking and throttling of writeback IO more complicate respect to the >>>>> synchronous IO from the blkio controller's perspective. >>>>> >>>>> Proposed solution >>>>> ================= >>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve >>>>> the problem of the buffered writes limitation by tracking the ownership of all >>>>> the dirty pages in the system. >>>>> >>>>> This would allow to always identify the owner of each IO operation at the block >>>>> layer and apply the appropriate throttling policy implemented by the >>>>> blkio.throttle controller. >>>>> >>>>> This solution makes the blkio.throttle controller to work as expected also for >>>>> writeback IO, but it does not resolve the problem of faster cgroups getting >>>>> blocked by slower cgroups (that would expose a potential way to create DoS in >>>>> the system). >>>>> >>>>> In fact, at the moment critical IO requests (that have dependency with other IO >>>>> requests made by other cgroups) and non-critical requests are mixed together at >>>>> the filesystem layer in a way that throttling a single write request may stop >>>>> also other requests in the system, and at the block layer it's not possible to >>>>> retrieve such informations to make the right decision. >>>>> >>>>> A simple solution to this problem could be to just limit the rate of async >>>>> writes at the time a task is generating dirty pages in the page cache. The >>>>> big advantage of this approach is that it does not need the overhead of >>>>> tracking the ownership of the dirty pages, because in this way from the blkio >>>>> controller perspective all the IO operations will happen from the process >>>>> context: writes in memory and synchronous reads from the block device. >>>>> >>>>> The drawback of this approach is that the blkio.throttle controller becomes a >>>>> little bit leaky, because with this solution the controller is still affected >>>>> by the IO spikes during the writeback of dirty pages executed by the kernel >>>>> threads. >>>>> >>>>> Probably an even better approach would be to introduce the tracking of the >>>>> dirty page ownership to properly account the cost of each IO operation at the >>>>> block layer and apply the throttling of async writes in memory only when IO >>>>> limits are exceeded. >>>> Andrea, I am curious to know more about it third option. Can you give more >>>> details about accouting in block layer but throttling in memory. So say >>>> a process starts IO, then it will still be in throttle limits at block >>>> layer (because no writeback has started), then the process will write >>>> bunch of pages in cache. By the time throttle limits are crossed at >>>> block layer, we already have lots of dirty data in page cache and >>>> throttling process now is already late? >>> Charging the cost of each IO operation at the block layer would allow >>> tasks to write in memory at the maximum speed. Instead, with the 3rd >>> approach, tasks are forced to write in memory at the rate defined by the >>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). >>> >>> When we'll have the per-cgroup dirty memory accounting and limiting >>> feature, with this approach each cgroup could write to its dirty memory >>> quota at the maximum rate. >> Ok, so this is option 3 which you have already implemented in this >> patchset. >> >> I guess then I am confused with option 2. Can you elaborate a little >> more there. > > With option 3, we can just limit the rate at which dirty pages are > generated in memory. And this can be done introducing the files > blkio.throttle.async.write_bps/iops_device. > > At the moment in blk_throtl_bio() we charge the dispatched bytes/iops > _and_ we check if the bio can be dispatched. These two distinct > operations are now done by the same function. > > With option 2, I'm proposing to split these two operations and place > throtl_charge_io() at the block layer in __generic_make_request() and an > equivalent of tg_may_dispatch_bio() (maybe a better name would be > blk_is_throttled()) at the page cache layer, in > balance_dirty_pages_ratelimited_nr(): > > A prototype for blk_is_throttled() could be the following: > > bool blk_is_throttled(void); > > This means in balance_dirty_pages_ratelimited_nr() we won't charge any > bytes/iops to the cgroup, but we'll just check if the limits are > exceeded. And stop it in that case, so that no more dirty pages can be > generated by this cgroup. > > Instead at the block layer WRITEs will be always dispatched in > blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but > the throtl_charge_io() would charge the cost of the IO operation to the > right cgroup. > > To summarize: > > __generic_make_request(): > blk_throtl_bio(q, &bio); > > balance_dirty_pages_ratelimited_nr(): > if (blk_is_throttled()) > // add the current task into a per-group wait queue and > // wake up once this cgroup meets its quota > > What do you think? Hi Andrea, This means when you throttle writes, the reads issued by this task are also throttled? Thanks, Gui > > Thanks, > -Andrea > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-07 7:31 ` Gui Jianfeng @ 2011-03-07 11:34 ` Andrea Righi 2011-03-07 11:44 ` Gui Jianfeng 0 siblings, 1 reply; 19+ messages in thread From: Andrea Righi @ 2011-03-07 11:34 UTC (permalink / raw) To: Gui Jianfeng Cc: Vivek Goyal, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Mon, Mar 07, 2011 at 03:31:11PM +0800, Gui Jianfeng wrote: > Andrea Righi wrote: > > On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote: > >> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote: > >>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: > >>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: > >>>>> Overview > >>>>> ======== > >>>>> Currently the blkio.throttle controller only support synchronous IO requests. > >>>>> This means that we always look at the current task to identify the "owner" of > >>>>> each IO request. > >>>>> > >>>>> However dirty pages in the page cache can be wrote to disk asynchronously by > >>>>> the per-bdi flusher kernel threads or by any other thread in the system, > >>>>> according to the writeback policy. > >>>>> > >>>>> For this reason the real writes to the underlying block devices may > >>>>> occur in a different IO context respect to the task that originally > >>>>> generated the dirty pages involved in the IO operation. This makes the > >>>>> tracking and throttling of writeback IO more complicate respect to the > >>>>> synchronous IO from the blkio controller's perspective. > >>>>> > >>>>> Proposed solution > >>>>> ================= > >>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve > >>>>> the problem of the buffered writes limitation by tracking the ownership of all > >>>>> the dirty pages in the system. > >>>>> > >>>>> This would allow to always identify the owner of each IO operation at the block > >>>>> layer and apply the appropriate throttling policy implemented by the > >>>>> blkio.throttle controller. > >>>>> > >>>>> This solution makes the blkio.throttle controller to work as expected also for > >>>>> writeback IO, but it does not resolve the problem of faster cgroups getting > >>>>> blocked by slower cgroups (that would expose a potential way to create DoS in > >>>>> the system). > >>>>> > >>>>> In fact, at the moment critical IO requests (that have dependency with other IO > >>>>> requests made by other cgroups) and non-critical requests are mixed together at > >>>>> the filesystem layer in a way that throttling a single write request may stop > >>>>> also other requests in the system, and at the block layer it's not possible to > >>>>> retrieve such informations to make the right decision. > >>>>> > >>>>> A simple solution to this problem could be to just limit the rate of async > >>>>> writes at the time a task is generating dirty pages in the page cache. The > >>>>> big advantage of this approach is that it does not need the overhead of > >>>>> tracking the ownership of the dirty pages, because in this way from the blkio > >>>>> controller perspective all the IO operations will happen from the process > >>>>> context: writes in memory and synchronous reads from the block device. > >>>>> > >>>>> The drawback of this approach is that the blkio.throttle controller becomes a > >>>>> little bit leaky, because with this solution the controller is still affected > >>>>> by the IO spikes during the writeback of dirty pages executed by the kernel > >>>>> threads. > >>>>> > >>>>> Probably an even better approach would be to introduce the tracking of the > >>>>> dirty page ownership to properly account the cost of each IO operation at the > >>>>> block layer and apply the throttling of async writes in memory only when IO > >>>>> limits are exceeded. > >>>> Andrea, I am curious to know more about it third option. Can you give more > >>>> details about accouting in block layer but throttling in memory. So say > >>>> a process starts IO, then it will still be in throttle limits at block > >>>> layer (because no writeback has started), then the process will write > >>>> bunch of pages in cache. By the time throttle limits are crossed at > >>>> block layer, we already have lots of dirty data in page cache and > >>>> throttling process now is already late? > >>> Charging the cost of each IO operation at the block layer would allow > >>> tasks to write in memory at the maximum speed. Instead, with the 3rd > >>> approach, tasks are forced to write in memory at the rate defined by the > >>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). > >>> > >>> When we'll have the per-cgroup dirty memory accounting and limiting > >>> feature, with this approach each cgroup could write to its dirty memory > >>> quota at the maximum rate. > >> Ok, so this is option 3 which you have already implemented in this > >> patchset. > >> > >> I guess then I am confused with option 2. Can you elaborate a little > >> more there. > > > > With option 3, we can just limit the rate at which dirty pages are > > generated in memory. And this can be done introducing the files > > blkio.throttle.async.write_bps/iops_device. > > > > At the moment in blk_throtl_bio() we charge the dispatched bytes/iops > > _and_ we check if the bio can be dispatched. These two distinct > > operations are now done by the same function. > > > > With option 2, I'm proposing to split these two operations and place > > throtl_charge_io() at the block layer in __generic_make_request() and an > > equivalent of tg_may_dispatch_bio() (maybe a better name would be > > blk_is_throttled()) at the page cache layer, in > > balance_dirty_pages_ratelimited_nr(): > > > > A prototype for blk_is_throttled() could be the following: > > > > bool blk_is_throttled(void); > > > > This means in balance_dirty_pages_ratelimited_nr() we won't charge any > > bytes/iops to the cgroup, but we'll just check if the limits are > > exceeded. And stop it in that case, so that no more dirty pages can be > > generated by this cgroup. > > > > Instead at the block layer WRITEs will be always dispatched in > > blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but > > the throtl_charge_io() would charge the cost of the IO operation to the > > right cgroup. > > > > To summarize: > > > > __generic_make_request(): > > blk_throtl_bio(q, &bio); > > > > balance_dirty_pages_ratelimited_nr(): > > if (blk_is_throttled()) > > // add the current task into a per-group wait queue and > > // wake up once this cgroup meets its quota > > > > What do you think? > > Hi Andrea, > > This means when you throttle writes, the reads issued by this task are also throttled? > > Thanks, > Gui Exactly, we're treating the throttling of READs and WRITEs in two different ways. READs will be always throttled synchronously in the __generic_make_request() -> blk_throtl_bio() path. -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-07 11:34 ` Andrea Righi @ 2011-03-07 11:44 ` Gui Jianfeng 2011-03-07 11:59 ` Andrea Righi 0 siblings, 1 reply; 19+ messages in thread From: Gui Jianfeng @ 2011-03-07 11:44 UTC (permalink / raw) To: Andrea Righi Cc: Vivek Goyal, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel Andrea Righi wrote: > On Mon, Mar 07, 2011 at 03:31:11PM +0800, Gui Jianfeng wrote: >> Andrea Righi wrote: >>> On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote: >>>> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote: >>>>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: >>>>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: >>>>>>> Overview >>>>>>> ======== >>>>>>> Currently the blkio.throttle controller only support synchronous IO requests. >>>>>>> This means that we always look at the current task to identify the "owner" of >>>>>>> each IO request. >>>>>>> >>>>>>> However dirty pages in the page cache can be wrote to disk asynchronously by >>>>>>> the per-bdi flusher kernel threads or by any other thread in the system, >>>>>>> according to the writeback policy. >>>>>>> >>>>>>> For this reason the real writes to the underlying block devices may >>>>>>> occur in a different IO context respect to the task that originally >>>>>>> generated the dirty pages involved in the IO operation. This makes the >>>>>>> tracking and throttling of writeback IO more complicate respect to the >>>>>>> synchronous IO from the blkio controller's perspective. >>>>>>> >>>>>>> Proposed solution >>>>>>> ================= >>>>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve >>>>>>> the problem of the buffered writes limitation by tracking the ownership of all >>>>>>> the dirty pages in the system. >>>>>>> >>>>>>> This would allow to always identify the owner of each IO operation at the block >>>>>>> layer and apply the appropriate throttling policy implemented by the >>>>>>> blkio.throttle controller. >>>>>>> >>>>>>> This solution makes the blkio.throttle controller to work as expected also for >>>>>>> writeback IO, but it does not resolve the problem of faster cgroups getting >>>>>>> blocked by slower cgroups (that would expose a potential way to create DoS in >>>>>>> the system). >>>>>>> >>>>>>> In fact, at the moment critical IO requests (that have dependency with other IO >>>>>>> requests made by other cgroups) and non-critical requests are mixed together at >>>>>>> the filesystem layer in a way that throttling a single write request may stop >>>>>>> also other requests in the system, and at the block layer it's not possible to >>>>>>> retrieve such informations to make the right decision. >>>>>>> >>>>>>> A simple solution to this problem could be to just limit the rate of async >>>>>>> writes at the time a task is generating dirty pages in the page cache. The >>>>>>> big advantage of this approach is that it does not need the overhead of >>>>>>> tracking the ownership of the dirty pages, because in this way from the blkio >>>>>>> controller perspective all the IO operations will happen from the process >>>>>>> context: writes in memory and synchronous reads from the block device. >>>>>>> >>>>>>> The drawback of this approach is that the blkio.throttle controller becomes a >>>>>>> little bit leaky, because with this solution the controller is still affected >>>>>>> by the IO spikes during the writeback of dirty pages executed by the kernel >>>>>>> threads. >>>>>>> >>>>>>> Probably an even better approach would be to introduce the tracking of the >>>>>>> dirty page ownership to properly account the cost of each IO operation at the >>>>>>> block layer and apply the throttling of async writes in memory only when IO >>>>>>> limits are exceeded. >>>>>> Andrea, I am curious to know more about it third option. Can you give more >>>>>> details about accouting in block layer but throttling in memory. So say >>>>>> a process starts IO, then it will still be in throttle limits at block >>>>>> layer (because no writeback has started), then the process will write >>>>>> bunch of pages in cache. By the time throttle limits are crossed at >>>>>> block layer, we already have lots of dirty data in page cache and >>>>>> throttling process now is already late? >>>>> Charging the cost of each IO operation at the block layer would allow >>>>> tasks to write in memory at the maximum speed. Instead, with the 3rd >>>>> approach, tasks are forced to write in memory at the rate defined by the >>>>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). >>>>> >>>>> When we'll have the per-cgroup dirty memory accounting and limiting >>>>> feature, with this approach each cgroup could write to its dirty memory >>>>> quota at the maximum rate. >>>> Ok, so this is option 3 which you have already implemented in this >>>> patchset. >>>> >>>> I guess then I am confused with option 2. Can you elaborate a little >>>> more there. >>> With option 3, we can just limit the rate at which dirty pages are >>> generated in memory. And this can be done introducing the files >>> blkio.throttle.async.write_bps/iops_device. >>> >>> At the moment in blk_throtl_bio() we charge the dispatched bytes/iops >>> _and_ we check if the bio can be dispatched. These two distinct >>> operations are now done by the same function. >>> >>> With option 2, I'm proposing to split these two operations and place >>> throtl_charge_io() at the block layer in __generic_make_request() and an >>> equivalent of tg_may_dispatch_bio() (maybe a better name would be >>> blk_is_throttled()) at the page cache layer, in >>> balance_dirty_pages_ratelimited_nr(): >>> >>> A prototype for blk_is_throttled() could be the following: >>> >>> bool blk_is_throttled(void); >>> >>> This means in balance_dirty_pages_ratelimited_nr() we won't charge any >>> bytes/iops to the cgroup, but we'll just check if the limits are >>> exceeded. And stop it in that case, so that no more dirty pages can be >>> generated by this cgroup. >>> >>> Instead at the block layer WRITEs will be always dispatched in >>> blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but >>> the throtl_charge_io() would charge the cost of the IO operation to the >>> right cgroup. >>> >>> To summarize: >>> >>> __generic_make_request(): >>> blk_throtl_bio(q, &bio); >>> >>> balance_dirty_pages_ratelimited_nr(): >>> if (blk_is_throttled()) >>> // add the current task into a per-group wait queue and >>> // wake up once this cgroup meets its quota >>> >>> What do you think? >> Hi Andrea, >> >> This means when you throttle writes, the reads issued by this task are also throttled? >> >> Thanks, >> Gui > > Exactly, we're treating the throttling of READs and WRITEs in two > different ways. > > READs will be always throttled synchronously in the > __generic_make_request() -> blk_throtl_bio() path. Andrea, I means If the task exceeds write limit, this task will be put to sleep, right? So It doesn't get a chance to issue read requests. Gui > > -Andrea > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- Regards Gui Jianfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-07 11:44 ` Gui Jianfeng @ 2011-03-07 11:59 ` Andrea Righi 2011-03-07 12:15 ` Gui Jianfeng 0 siblings, 1 reply; 19+ messages in thread From: Andrea Righi @ 2011-03-07 11:59 UTC (permalink / raw) To: Gui Jianfeng Cc: Vivek Goyal, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Mon, Mar 07, 2011 at 07:44:49PM +0800, Gui Jianfeng wrote: > Andrea Righi wrote: > > On Mon, Mar 07, 2011 at 03:31:11PM +0800, Gui Jianfeng wrote: > >> Andrea Righi wrote: > >>> On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote: > >>>> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote: > >>>>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: > >>>>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: > >>>>>>> Overview > >>>>>>> ======== > >>>>>>> Currently the blkio.throttle controller only support synchronous IO requests. > >>>>>>> This means that we always look at the current task to identify the "owner" of > >>>>>>> each IO request. > >>>>>>> > >>>>>>> However dirty pages in the page cache can be wrote to disk asynchronously by > >>>>>>> the per-bdi flusher kernel threads or by any other thread in the system, > >>>>>>> according to the writeback policy. > >>>>>>> > >>>>>>> For this reason the real writes to the underlying block devices may > >>>>>>> occur in a different IO context respect to the task that originally > >>>>>>> generated the dirty pages involved in the IO operation. This makes the > >>>>>>> tracking and throttling of writeback IO more complicate respect to the > >>>>>>> synchronous IO from the blkio controller's perspective. > >>>>>>> > >>>>>>> Proposed solution > >>>>>>> ================= > >>>>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve > >>>>>>> the problem of the buffered writes limitation by tracking the ownership of all > >>>>>>> the dirty pages in the system. > >>>>>>> > >>>>>>> This would allow to always identify the owner of each IO operation at the block > >>>>>>> layer and apply the appropriate throttling policy implemented by the > >>>>>>> blkio.throttle controller. > >>>>>>> > >>>>>>> This solution makes the blkio.throttle controller to work as expected also for > >>>>>>> writeback IO, but it does not resolve the problem of faster cgroups getting > >>>>>>> blocked by slower cgroups (that would expose a potential way to create DoS in > >>>>>>> the system). > >>>>>>> > >>>>>>> In fact, at the moment critical IO requests (that have dependency with other IO > >>>>>>> requests made by other cgroups) and non-critical requests are mixed together at > >>>>>>> the filesystem layer in a way that throttling a single write request may stop > >>>>>>> also other requests in the system, and at the block layer it's not possible to > >>>>>>> retrieve such informations to make the right decision. > >>>>>>> > >>>>>>> A simple solution to this problem could be to just limit the rate of async > >>>>>>> writes at the time a task is generating dirty pages in the page cache. The > >>>>>>> big advantage of this approach is that it does not need the overhead of > >>>>>>> tracking the ownership of the dirty pages, because in this way from the blkio > >>>>>>> controller perspective all the IO operations will happen from the process > >>>>>>> context: writes in memory and synchronous reads from the block device. > >>>>>>> > >>>>>>> The drawback of this approach is that the blkio.throttle controller becomes a > >>>>>>> little bit leaky, because with this solution the controller is still affected > >>>>>>> by the IO spikes during the writeback of dirty pages executed by the kernel > >>>>>>> threads. > >>>>>>> > >>>>>>> Probably an even better approach would be to introduce the tracking of the > >>>>>>> dirty page ownership to properly account the cost of each IO operation at the > >>>>>>> block layer and apply the throttling of async writes in memory only when IO > >>>>>>> limits are exceeded. > >>>>>> Andrea, I am curious to know more about it third option. Can you give more > >>>>>> details about accouting in block layer but throttling in memory. So say > >>>>>> a process starts IO, then it will still be in throttle limits at block > >>>>>> layer (because no writeback has started), then the process will write > >>>>>> bunch of pages in cache. By the time throttle limits are crossed at > >>>>>> block layer, we already have lots of dirty data in page cache and > >>>>>> throttling process now is already late? > >>>>> Charging the cost of each IO operation at the block layer would allow > >>>>> tasks to write in memory at the maximum speed. Instead, with the 3rd > >>>>> approach, tasks are forced to write in memory at the rate defined by the > >>>>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). > >>>>> > >>>>> When we'll have the per-cgroup dirty memory accounting and limiting > >>>>> feature, with this approach each cgroup could write to its dirty memory > >>>>> quota at the maximum rate. > >>>> Ok, so this is option 3 which you have already implemented in this > >>>> patchset. > >>>> > >>>> I guess then I am confused with option 2. Can you elaborate a little > >>>> more there. > >>> With option 3, we can just limit the rate at which dirty pages are > >>> generated in memory. And this can be done introducing the files > >>> blkio.throttle.async.write_bps/iops_device. > >>> > >>> At the moment in blk_throtl_bio() we charge the dispatched bytes/iops > >>> _and_ we check if the bio can be dispatched. These two distinct > >>> operations are now done by the same function. > >>> > >>> With option 2, I'm proposing to split these two operations and place > >>> throtl_charge_io() at the block layer in __generic_make_request() and an > >>> equivalent of tg_may_dispatch_bio() (maybe a better name would be > >>> blk_is_throttled()) at the page cache layer, in > >>> balance_dirty_pages_ratelimited_nr(): > >>> > >>> A prototype for blk_is_throttled() could be the following: > >>> > >>> bool blk_is_throttled(void); > >>> > >>> This means in balance_dirty_pages_ratelimited_nr() we won't charge any > >>> bytes/iops to the cgroup, but we'll just check if the limits are > >>> exceeded. And stop it in that case, so that no more dirty pages can be > >>> generated by this cgroup. > >>> > >>> Instead at the block layer WRITEs will be always dispatched in > >>> blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but > >>> the throtl_charge_io() would charge the cost of the IO operation to the > >>> right cgroup. > >>> > >>> To summarize: > >>> > >>> __generic_make_request(): > >>> blk_throtl_bio(q, &bio); > >>> > >>> balance_dirty_pages_ratelimited_nr(): > >>> if (blk_is_throttled()) > >>> // add the current task into a per-group wait queue and > >>> // wake up once this cgroup meets its quota > >>> > >>> What do you think? > >> Hi Andrea, > >> > >> This means when you throttle writes, the reads issued by this task are also throttled? > >> > >> Thanks, > >> Gui > > > > Exactly, we're treating the throttling of READs and WRITEs in two > > different ways. > > > > READs will be always throttled synchronously in the > > __generic_make_request() -> blk_throtl_bio() path. > > Andreai 1/4 ? > > I means If the task exceeds write limit, this task will be put to sleep, right? > So It doesn't get a chance to issue read requests. Oh yes, you're right. This could be a problem. OTOH I wouldn't like to introduce an additional queue to submit the write requests in the page cache and dispatch them asyncrhonously. mmh... ideas? -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-07 11:59 ` Andrea Righi @ 2011-03-07 12:15 ` Gui Jianfeng 0 siblings, 0 replies; 19+ messages in thread From: Gui Jianfeng @ 2011-03-07 12:15 UTC (permalink / raw) To: Andrea Righi Cc: Vivek Goyal, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel Andrea Righi wrote: > On Mon, Mar 07, 2011 at 07:44:49PM +0800, Gui Jianfeng wrote: >> Andrea Righi wrote: >>> On Mon, Mar 07, 2011 at 03:31:11PM +0800, Gui Jianfeng wrote: >>>> Andrea Righi wrote: >>>>> On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote: >>>>>> On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote: >>>>>>> On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: >>>>>>>> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: >>>>>>>>> Overview >>>>>>>>> ======== >>>>>>>>> Currently the blkio.throttle controller only support synchronous IO requests. >>>>>>>>> This means that we always look at the current task to identify the "owner" of >>>>>>>>> each IO request. >>>>>>>>> >>>>>>>>> However dirty pages in the page cache can be wrote to disk asynchronously by >>>>>>>>> the per-bdi flusher kernel threads or by any other thread in the system, >>>>>>>>> according to the writeback policy. >>>>>>>>> >>>>>>>>> For this reason the real writes to the underlying block devices may >>>>>>>>> occur in a different IO context respect to the task that originally >>>>>>>>> generated the dirty pages involved in the IO operation. This makes the >>>>>>>>> tracking and throttling of writeback IO more complicate respect to the >>>>>>>>> synchronous IO from the blkio controller's perspective. >>>>>>>>> >>>>>>>>> Proposed solution >>>>>>>>> ================= >>>>>>>>> In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve >>>>>>>>> the problem of the buffered writes limitation by tracking the ownership of all >>>>>>>>> the dirty pages in the system. >>>>>>>>> >>>>>>>>> This would allow to always identify the owner of each IO operation at the block >>>>>>>>> layer and apply the appropriate throttling policy implemented by the >>>>>>>>> blkio.throttle controller. >>>>>>>>> >>>>>>>>> This solution makes the blkio.throttle controller to work as expected also for >>>>>>>>> writeback IO, but it does not resolve the problem of faster cgroups getting >>>>>>>>> blocked by slower cgroups (that would expose a potential way to create DoS in >>>>>>>>> the system). >>>>>>>>> >>>>>>>>> In fact, at the moment critical IO requests (that have dependency with other IO >>>>>>>>> requests made by other cgroups) and non-critical requests are mixed together at >>>>>>>>> the filesystem layer in a way that throttling a single write request may stop >>>>>>>>> also other requests in the system, and at the block layer it's not possible to >>>>>>>>> retrieve such informations to make the right decision. >>>>>>>>> >>>>>>>>> A simple solution to this problem could be to just limit the rate of async >>>>>>>>> writes at the time a task is generating dirty pages in the page cache. The >>>>>>>>> big advantage of this approach is that it does not need the overhead of >>>>>>>>> tracking the ownership of the dirty pages, because in this way from the blkio >>>>>>>>> controller perspective all the IO operations will happen from the process >>>>>>>>> context: writes in memory and synchronous reads from the block device. >>>>>>>>> >>>>>>>>> The drawback of this approach is that the blkio.throttle controller becomes a >>>>>>>>> little bit leaky, because with this solution the controller is still affected >>>>>>>>> by the IO spikes during the writeback of dirty pages executed by the kernel >>>>>>>>> threads. >>>>>>>>> >>>>>>>>> Probably an even better approach would be to introduce the tracking of the >>>>>>>>> dirty page ownership to properly account the cost of each IO operation at the >>>>>>>>> block layer and apply the throttling of async writes in memory only when IO >>>>>>>>> limits are exceeded. >>>>>>>> Andrea, I am curious to know more about it third option. Can you give more >>>>>>>> details about accouting in block layer but throttling in memory. So say >>>>>>>> a process starts IO, then it will still be in throttle limits at block >>>>>>>> layer (because no writeback has started), then the process will write >>>>>>>> bunch of pages in cache. By the time throttle limits are crossed at >>>>>>>> block layer, we already have lots of dirty data in page cache and >>>>>>>> throttling process now is already late? >>>>>>> Charging the cost of each IO operation at the block layer would allow >>>>>>> tasks to write in memory at the maximum speed. Instead, with the 3rd >>>>>>> approach, tasks are forced to write in memory at the rate defined by the >>>>>>> blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). >>>>>>> >>>>>>> When we'll have the per-cgroup dirty memory accounting and limiting >>>>>>> feature, with this approach each cgroup could write to its dirty memory >>>>>>> quota at the maximum rate. >>>>>> Ok, so this is option 3 which you have already implemented in this >>>>>> patchset. >>>>>> >>>>>> I guess then I am confused with option 2. Can you elaborate a little >>>>>> more there. >>>>> With option 3, we can just limit the rate at which dirty pages are >>>>> generated in memory. And this can be done introducing the files >>>>> blkio.throttle.async.write_bps/iops_device. >>>>> >>>>> At the moment in blk_throtl_bio() we charge the dispatched bytes/iops >>>>> _and_ we check if the bio can be dispatched. These two distinct >>>>> operations are now done by the same function. >>>>> >>>>> With option 2, I'm proposing to split these two operations and place >>>>> throtl_charge_io() at the block layer in __generic_make_request() and an >>>>> equivalent of tg_may_dispatch_bio() (maybe a better name would be >>>>> blk_is_throttled()) at the page cache layer, in >>>>> balance_dirty_pages_ratelimited_nr(): >>>>> >>>>> A prototype for blk_is_throttled() could be the following: >>>>> >>>>> bool blk_is_throttled(void); >>>>> >>>>> This means in balance_dirty_pages_ratelimited_nr() we won't charge any >>>>> bytes/iops to the cgroup, but we'll just check if the limits are >>>>> exceeded. And stop it in that case, so that no more dirty pages can be >>>>> generated by this cgroup. >>>>> >>>>> Instead at the block layer WRITEs will be always dispatched in >>>>> blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but >>>>> the throtl_charge_io() would charge the cost of the IO operation to the >>>>> right cgroup. >>>>> >>>>> To summarize: >>>>> >>>>> __generic_make_request(): >>>>> blk_throtl_bio(q, &bio); >>>>> >>>>> balance_dirty_pages_ratelimited_nr(): >>>>> if (blk_is_throttled()) >>>>> // add the current task into a per-group wait queue and >>>>> // wake up once this cgroup meets its quota >>>>> >>>>> What do you think? >>>> Hi Andrea, >>>> >>>> This means when you throttle writes, the reads issued by this task are also throttled? >>>> >>>> Thanks, >>>> Gui >>> Exactly, we're treating the throttling of READs and WRITEs in two >>> different ways. >>> >>> READs will be always throttled synchronously in the >>> __generic_make_request() -> blk_throtl_bio() path. >> Andrea, >> >> I means If the task exceeds write limit, this task will be put to sleep, right? >> So It doesn't get a chance to issue read requests. > > Oh yes, you're right. This could be a problem. OTOH I wouldn't like to > introduce an additional queue to submit the write requests in the page > cache and dispatch them asyncrhonously. > > mmh... ideas? > hmm, dispatching asynchronously will make things more complicated. But writes blocking reads goes against the idea of page cache. I'm not sure how to solve this... Gui > -Andrea > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-06 15:52 ` Andrea Righi 2011-03-07 7:31 ` Gui Jianfeng @ 2011-03-07 15:22 ` Vivek Goyal 1 sibling, 0 replies; 19+ messages in thread From: Vivek Goyal @ 2011-03-07 15:22 UTC (permalink / raw) To: Andrea Righi Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Sun, Mar 06, 2011 at 04:52:47PM +0100, Andrea Righi wrote: > On Wed, Mar 02, 2011 at 04:47:05PM -0500, Vivek Goyal wrote: > > On Wed, Mar 02, 2011 at 02:28:30PM +0100, Andrea Righi wrote: > > > On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: > > > > On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: > > > > > Overview > > > > > ======== > > > > > Currently the blkio.throttle controller only support synchronous IO requests. > > > > > This means that we always look at the current task to identify the "owner" of > > > > > each IO request. > > > > > > > > > > However dirty pages in the page cache can be wrote to disk asynchronously by > > > > > the per-bdi flusher kernel threads or by any other thread in the system, > > > > > according to the writeback policy. > > > > > > > > > > For this reason the real writes to the underlying block devices may > > > > > occur in a different IO context respect to the task that originally > > > > > generated the dirty pages involved in the IO operation. This makes the > > > > > tracking and throttling of writeback IO more complicate respect to the > > > > > synchronous IO from the blkio controller's perspective. > > > > > > > > > > Proposed solution > > > > > ================= > > > > > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve > > > > > the problem of the buffered writes limitation by tracking the ownership of all > > > > > the dirty pages in the system. > > > > > > > > > > This would allow to always identify the owner of each IO operation at the block > > > > > layer and apply the appropriate throttling policy implemented by the > > > > > blkio.throttle controller. > > > > > > > > > > This solution makes the blkio.throttle controller to work as expected also for > > > > > writeback IO, but it does not resolve the problem of faster cgroups getting > > > > > blocked by slower cgroups (that would expose a potential way to create DoS in > > > > > the system). > > > > > > > > > > In fact, at the moment critical IO requests (that have dependency with other IO > > > > > requests made by other cgroups) and non-critical requests are mixed together at > > > > > the filesystem layer in a way that throttling a single write request may stop > > > > > also other requests in the system, and at the block layer it's not possible to > > > > > retrieve such informations to make the right decision. > > > > > > > > > > A simple solution to this problem could be to just limit the rate of async > > > > > writes at the time a task is generating dirty pages in the page cache. The > > > > > big advantage of this approach is that it does not need the overhead of > > > > > tracking the ownership of the dirty pages, because in this way from the blkio > > > > > controller perspective all the IO operations will happen from the process > > > > > context: writes in memory and synchronous reads from the block device. > > > > > > > > > > The drawback of this approach is that the blkio.throttle controller becomes a > > > > > little bit leaky, because with this solution the controller is still affected > > > > > by the IO spikes during the writeback of dirty pages executed by the kernel > > > > > threads. > > > > > > > > > > Probably an even better approach would be to introduce the tracking of the > > > > > dirty page ownership to properly account the cost of each IO operation at the > > > > > block layer and apply the throttling of async writes in memory only when IO > > > > > limits are exceeded. > > > > > > > > Andrea, I am curious to know more about it third option. Can you give more > > > > details about accouting in block layer but throttling in memory. So say > > > > a process starts IO, then it will still be in throttle limits at block > > > > layer (because no writeback has started), then the process will write > > > > bunch of pages in cache. By the time throttle limits are crossed at > > > > block layer, we already have lots of dirty data in page cache and > > > > throttling process now is already late? > > > > > > Charging the cost of each IO operation at the block layer would allow > > > tasks to write in memory at the maximum speed. Instead, with the 3rd > > > approach, tasks are forced to write in memory at the rate defined by the > > > blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). > > > > > > When we'll have the per-cgroup dirty memory accounting and limiting > > > feature, with this approach each cgroup could write to its dirty memory > > > quota at the maximum rate. > > > > Ok, so this is option 3 which you have already implemented in this > > patchset. > > > > I guess then I am confused with option 2. Can you elaborate a little > > more there. > > With option 3, we can just limit the rate at which dirty pages are > generated in memory. And this can be done introducing the files > blkio.throttle.async.write_bps/iops_device. > > At the moment in blk_throtl_bio() we charge the dispatched bytes/iops > _and_ we check if the bio can be dispatched. These two distinct > operations are now done by the same function. > > With option 2, I'm proposing to split these two operations and place > throtl_charge_io() at the block layer in __generic_make_request() and an > equivalent of tg_may_dispatch_bio() (maybe a better name would be > blk_is_throttled()) at the page cache layer, in > balance_dirty_pages_ratelimited_nr(): > > A prototype for blk_is_throttled() could be the following: > > bool blk_is_throttled(void); > > This means in balance_dirty_pages_ratelimited_nr() we won't charge any > bytes/iops to the cgroup, but we'll just check if the limits are > exceeded. And stop it in that case, so that no more dirty pages can be > generated by this cgroup. > > Instead at the block layer WRITEs will be always dispatched in > blk_throtl_bio() (tg_may_dispatch_bio() will always return true), but > the throtl_charge_io() would charge the cost of the IO operation to the > right cgroup. > > To summarize: > > __generic_make_request(): > blk_throtl_bio(q, &bio); > > balance_dirty_pages_ratelimited_nr(): > if (blk_is_throttled()) > // add the current task into a per-group wait queue and > // wake up once this cgroup meets its quota > > What do you think? Ok, so an IO is charged only when it hits the block layer. Because a group is not throttled till actually some IO happens, group will not be throttled. - To me, this still does not solve the problem of IO spikes. When a process starts writting out, IO will go in cache, group is not blocked and process can dump lots of WRITES in cache and by the time writeout starts and we throttle the submitter, its too late. All the WRITES in cache will show up at block device in bulk and impact read latencies. Having said that I guess that will be the case initially and then process will be throttled. So every time a process writes, it can dump bunch of IO in cache and remain unthrottled for some time. So to me it might work well for the cases where long writeout happen continuously. But if some process dumps bunch of MBs then goes onto do other processes for few seconds and comes back agian to dump bunch of MBs, in these cases this scheme will not be effective. - Will it handle the case of fsync. So if a bunch of pages are in cache and process tries fsync, then nothing will be throttled and that will also impact negatively the read latencies and we are back to IO spikes? Will it make sense to try to make throttled writeback almost synchronous. In other words say we implement option 3, and once we realize that we are doing IO in a cgroup where writes are throttled, we wake up flusher threads and possibly marking the inodes which we want to be selected for IO? That way IO to an inode will become more or less synchronous and effect of IO spikes will come down. Well thinking more about it, I guess similar could be done by implemeting per cgroup dirty ratio and setting dirty_ratio to zero or a very low value? So if a user can co-mount the memory and IO controller and reduce the impact of IO spikes from WRITES of a cgroup by controlling the dirty raio of that cgroup? Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-02 13:28 ` Andrea Righi 2011-03-02 21:47 ` Vivek Goyal @ 2011-03-02 22:00 ` Greg Thelen 1 sibling, 0 replies; 19+ messages in thread From: Greg Thelen @ 2011-03-02 22:00 UTC (permalink / raw) To: Andrea Righi Cc: Vivek Goyal, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Wed, Mar 2, 2011 at 5:28 AM, Andrea Righi <arighi@develer.com> wrote: > On Mon, Feb 28, 2011 at 06:01:14PM -0500, Vivek Goyal wrote: >> On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: >> > Overview >> > ======== >> > Currently the blkio.throttle controller only support synchronous IO requests. >> > This means that we always look at the current task to identify the "owner" of >> > each IO request. >> > >> > However dirty pages in the page cache can be wrote to disk asynchronously by >> > the per-bdi flusher kernel threads or by any other thread in the system, >> > according to the writeback policy. >> > >> > For this reason the real writes to the underlying block devices may >> > occur in a different IO context respect to the task that originally >> > generated the dirty pages involved in the IO operation. This makes the >> > tracking and throttling of writeback IO more complicate respect to the >> > synchronous IO from the blkio controller's perspective. >> > >> > Proposed solution >> > ================= >> > In the previous patch set http://lwn.net/Articles/429292/ I proposed to resolve >> > the problem of the buffered writes limitation by tracking the ownership of all >> > the dirty pages in the system. >> > >> > This would allow to always identify the owner of each IO operation at the block >> > layer and apply the appropriate throttling policy implemented by the >> > blkio.throttle controller. >> > >> > This solution makes the blkio.throttle controller to work as expected also for >> > writeback IO, but it does not resolve the problem of faster cgroups getting >> > blocked by slower cgroups (that would expose a potential way to create DoS in >> > the system). >> > >> > In fact, at the moment critical IO requests (that have dependency with other IO >> > requests made by other cgroups) and non-critical requests are mixed together at >> > the filesystem layer in a way that throttling a single write request may stop >> > also other requests in the system, and at the block layer it's not possible to >> > retrieve such informations to make the right decision. >> > >> > A simple solution to this problem could be to just limit the rate of async >> > writes at the time a task is generating dirty pages in the page cache. The >> > big advantage of this approach is that it does not need the overhead of >> > tracking the ownership of the dirty pages, because in this way from the blkio >> > controller perspective all the IO operations will happen from the process >> > context: writes in memory and synchronous reads from the block device. >> > >> > The drawback of this approach is that the blkio.throttle controller becomes a >> > little bit leaky, because with this solution the controller is still affected >> > by the IO spikes during the writeback of dirty pages executed by the kernel >> > threads. >> > >> > Probably an even better approach would be to introduce the tracking of the >> > dirty page ownership to properly account the cost of each IO operation at the >> > block layer and apply the throttling of async writes in memory only when IO >> > limits are exceeded. >> >> Andrea, I am curious to know more about it third option. Can you give more >> details about accouting in block layer but throttling in memory. So say >> a process starts IO, then it will still be in throttle limits at block >> layer (because no writeback has started), then the process will write >> bunch of pages in cache. By the time throttle limits are crossed at >> block layer, we already have lots of dirty data in page cache and >> throttling process now is already late? > > Charging the cost of each IO operation at the block layer would allow > tasks to write in memory at the maximum speed. Instead, with the 3rd > approach, tasks are forced to write in memory at the rate defined by the > blkio.throttle.write_*_device (or blkio.throttle.async.write_*_device). > > When we'll have the per-cgroup dirty memory accounting and limiting > feature, with this approach each cgroup could write to its dirty memory > quota at the maximum rate. > > BTW, another thing that we probably need is that any cgroup should be > forced to write their own inodes when the limit defined by dirty_ratio > is exceeded. I mean, avoid to select inodes from the list of dirty > inodes in a FIFO way, but provides a better logic to "assign" the > ownership of each inode to a cgroup (maybe that one that had generated > most of the dirty pages in the inodes) and use for example a list of > dirty inodes per cgroup or something similar. > > In this way we should really be able to provide a good quality of > service for the most part of the cases IMHO. > > I also plan to write down another patch set to implement this logic. I have also been thinking about this problem. I had assumed that the per-bdi inode list would be retained, but a memcg_inode filter function would be used to determine which dirty inodes contribute any dirty pages to the over-dirty-limit memcg. To efficiently determine if an inode contributes dirty pages to a memcg, I was thinking about implementing a shallow memcg identifier cache within the inode's address space. Here is a summary of what I am thinking about: This approach adds an N deep cache of dirtying cgroups into struct address_space. N would likely be 1. Each address_space would have an array of N as_memcg identifiers and an as_memcg_overflow bit. The memcg identifier may be a memcg pointer or a small integer (maybe css_id). Unset cache entries would be NULL if using pointers, 0 if using css_id (an illegal css_id value). The memcg identifier would not be dereferenced - it would only be used as a fuzzy comparison value. So a reference to the memcg is not needed. Modifications to as_memcg[] and as_memcg_overflow are protected by the mapping tree_lock. The kernel integration with this approach involves: 1. When an address_space created in inode_init_always(): 1.1 Set as_memcg[*]=NULL and as_memcg_overflow=0. 2. When a mapping page is dirtied in __set_page_dirty(): 2.1 The tree lock is already held 2.2 If any of as_memcg[*] == current_memcg, then do nothing. The dirtying memcg is already associated with the address space. 2.3 If any of as_memcg[x] is NULL, then set as_memcg[x] = current_memcg. This associates the dirtying memcg with the address space. 2.4 Otherwise, set as_memcg_overflow=1 indicating that more than N memcg have dirtied the address space since it was last cleaned. 3. When an inode is completely clean - this is a case within writeback_single_inode(): 3.1 Grab the tree_lock 3.2 Set as_memcg[*]=NULL and as_memcg_overflow=0. These are the same steps as inode creation. 3.3 Release the tree_lock 4. When per-cgroup inode writeback is performed, walk the list of dirty inodes only matching inodes contributing dirty pages to cmp_memcg. For each dirty inode: 4.1 Tree locking is not needed here. Races are allowed. 4.2 If as_memcg_overflow is set, then schedule inode for writeback. Assume it contains dirty pages for cmp_memcg. This is the case where there are more than N memcg that have dirtied the address space since it was completely clean. 4.3 If any of as_memcg[*] == cmp_memcg, then schedule the inode for writeback. 4.4 Otherwise, do not schedule the inode for writeback because the current memcg has not contributed dirty pages to the inode. Weaknesses: 1. If the N deep cache is not sufficiently large then inodes will be written back more than is ideal. Example: if more than N cgroups are dirty an inode, then as_memcg_overflow is set and remains set until the inode is fully clean. Until the inode is fully cleaned, any per-memcg writeback will writeback the inode. This will unfairly affect writeback of cgroups that have never touched or dirtied the overflowing inode. If this is a significant problem, then increasing the value of N will solve the problem. Theoretically each inode, or filesystem could have a different value of N. But the preferred implementation has a compile time constant value of N. 2. If any cgroup continually dirties an inode, then no other cgroups that have ever dirtied the inode will ever be expired from the N deep as_memcg[] cache because the "when an inode is completely clean" logic is never run. If the past set of dirtying cgroups is greater than N, then every cgroup writeback will writeback the inode. If this is a problem, then a dirty_page counter could be added to each as_memcg[] record to keep track of the number of dirty pages each memcg contributes to an inode. When the per-memcg count reaches zero then the particular as_memcg entry could be removed. This would allow for memcg expiration from as_memcg[] before the inode becomes completely clean. 3. There is no way to cheaply identify the set of inodes contributing dirty pages to a memcg that has exceeded its dirty memory limit. When a memcg blows its limit, what list of dirty inodes is traversed? The bdi that was last being written to which, in the worst case, will have very little dirty data from the current memcg. If this is an issue, we could walk all bdi looking for inodes that contribute dirty pages to the over-limit memcg. >> > To summarize, we can identify three possible solutions to properly throttle the >> > buffered writes: >> > >> > 1) account & throttle everything at block IO layer (bad for "priority >> > inversion" problems, needs page tracking for blkio) >> > >> > 2) account at block IO layer and throttle in memory (needs page tracking for >> > blkio) >> > >> > 3) account & throttle in memory (affected by IO spikes, depending on >> > dirty_ratio / dirty_background_ratio settings) >> > >> > For now we start with the solution 3) that seems to be the simplest way to >> > proceed. >> >> Yes, IO spikes is the weakness of this 3rd solution. But it should be >> simple too. Also as you said problem can be reduced up to some extent >> by changing reducing dirty_ratio and background dirty ratio But that >> will have other trade offs, I guess. > > Agreed. > > -Andrea > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-02-28 10:15 [PATCH 0/3] blk-throttle: async write throttling Andrea Righi ` (3 preceding siblings ...) 2011-02-28 23:01 ` [PATCH 0/3] blk-throttle: async write throttling Vivek Goyal @ 2011-03-01 16:27 ` Vivek Goyal 2011-03-01 21:40 ` Vivek Goyal 4 siblings, 1 reply; 19+ messages in thread From: Vivek Goyal @ 2011-03-01 16:27 UTC (permalink / raw) To: Andrea Righi Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: [..] > TODO > ~~~~ > - Consider to add the following new files in the blkio controller to allow the > user to explicitly limit async writes as well as sync writes: > > blkio.throttle.async.write_bps_limit > blkio.throttle.async.write_iops_limit I am kind of split on this. - One way of thinking is that blkio.throttle.read/write_limits represent the limits on requeuest queue of the IO which is actually passing through queue now. So we should not mix the two and keep async limits separately. This will also tell the customer explicitly that async throttling does not mean the same thing as throttling happens before entering the page cache and there can be/will be IO spikes later at the request queue. Also creating the separate files leaves the door open for future extension of implementing async control when async IO is actually submitted to request queue. (Though I think that will be hard as making sure all the filesystems, writeback logic, device mapper drivers are aware of throttling and will take steps to ensure faster groups are not stuck behind slower groups). So keeping async accounting separate will help differentiating that async control is not same as sync control. There are fundamental differences. - On the other hand, it makes life a bit simple for user as they don't have to specify the async limits separately and there is one aggregate limit for sync and async (assuming we fix the throttling state issues so that throttling logic can handle both bio and task throttling out of single limit). Any thoughts? Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] blk-throttle: async write throttling 2011-03-01 16:27 ` Vivek Goyal @ 2011-03-01 21:40 ` Vivek Goyal 0 siblings, 0 replies; 19+ messages in thread From: Vivek Goyal @ 2011-03-01 21:40 UTC (permalink / raw) To: Andrea Righi Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi, Jens Axboe, Jonathan Corbet, Andrew Morton, containers, linux-mm, linux-kernel, Moyer Jeff Moyer On Tue, Mar 01, 2011 at 11:27:53AM -0500, Vivek Goyal wrote: > On Mon, Feb 28, 2011 at 11:15:02AM +0100, Andrea Righi wrote: > > [..] > > TODO > > ~~~~ > > - Consider to add the following new files in the blkio controller to allow the > > user to explicitly limit async writes as well as sync writes: > > > > blkio.throttle.async.write_bps_limit > > blkio.throttle.async.write_iops_limit > > I am kind of split on this. > > - One way of thinking is that blkio.throttle.read/write_limits represent > the limits on requeuest queue of the IO which is actually passing > through queue now. So we should not mix the two and keep async limits > separately. This will also tell the customer explicitly that async > throttling does not mean the same thing as throttling happens before > entering the page cache and there can be/will be IO spikes later > at the request queue. > > Also creating the separate files leaves the door open for future > extension of implementing async control when async IO is actually > submitted to request queue. (Though I think that will be hard as > making sure all the filesystems, writeback logic, device mapper > drivers are aware of throttling and will take steps to ensure faster > groups are not stuck behind slower groups). > > So keeping async accounting separate will help differentiating that > async control is not same as sync control. There are fundamental > differences. > > > - On the other hand, it makes life a bit simple for user as they don't > have to specify the async limits separately and there is one aggregate > limit for sync and async (assuming we fix the throttling state issues > so that throttling logic can handle both bio and task throttling out > of single limit). > > Any thoughts? Thinking more about it. Was chatting with Jeff Moyer about this and he mentioned that applications might not like this frequent blocking in buffered write path. Which is a valid concern but I don't know how to address that. dirty_ratio, dity_bytes and dirty_background_ratio will not be enough to throttle the overall WRITE rate from application. I am kind of now little inclined towards creating separate files for async writeout and not mix these throttling at device level for following reasons. - It is a different thorttling functionality at different layer and it is not same as throttling sync IO at device level when IO is submitted to device. We are just making use of existing throttling infrastructure to account for writting rates in page cache. So having separate file and separate async throttle state per group in blkio makes sense to me. - There will be these unintutive side affects of frequent blocking and keeping it separate atleast allows us to disable this functionality. - And we might come up with throttling async IO at device level and keeping it separate helps us at that point of time. So to me throttling the async rate while writting to page cache is a different functionality and let it be controlled by separate blkio files. The bigger question is that in what cases this kind of functionality is really useful given that it can still produce IO spikes and given the fact that application will be put to sleep freqently depending on throttling rate configured. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2011-03-07 15:23 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-02-28 10:15 [PATCH 0/3] blk-throttle: async write throttling Andrea Righi 2011-02-28 10:15 ` [PATCH 1/3] block: introduce REQ_DIRECT to track direct io bio Andrea Righi 2011-02-28 10:15 ` [PATCH 2/3] blkio-throttle: infrastructure to throttle async io Andrea Righi 2011-03-01 16:06 ` Vivek Goyal 2011-03-02 13:31 ` Andrea Righi 2011-02-28 10:15 ` [PATCH 3/3] blkio-throttle: async write io instrumentation Andrea Righi 2011-02-28 23:01 ` [PATCH 0/3] blk-throttle: async write throttling Vivek Goyal 2011-03-02 13:28 ` Andrea Righi 2011-03-02 21:47 ` Vivek Goyal 2011-03-06 15:52 ` Andrea Righi 2011-03-07 7:31 ` Gui Jianfeng 2011-03-07 11:34 ` Andrea Righi 2011-03-07 11:44 ` Gui Jianfeng 2011-03-07 11:59 ` Andrea Righi 2011-03-07 12:15 ` Gui Jianfeng 2011-03-07 15:22 ` Vivek Goyal 2011-03-02 22:00 ` Greg Thelen 2011-03-01 16:27 ` Vivek Goyal 2011-03-01 21:40 ` Vivek Goyal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).