* performance drop after using blkcg [not found] <CAFVn34SxqAJe_4P-WT8MOiG-kmKKD7ge96zoHXQuGqHWPgAt+A@mail.gmail.com> @ 2012-12-11 7:00 ` Zhao Shuai 2013-08-29 3:10 ` joeytao 2013-08-29 3:20 ` joeytao 2012-12-11 14:25 ` Vivek Goyal 1 sibling, 2 replies; 15+ messages in thread From: Zhao Shuai @ 2012-12-11 7:00 UTC (permalink / raw) To: linux-kernel, cgroups Hi, I plan to use blkcg(proportional BW) in my system. But I encounter great performance drop after enabling blkcg. The testing tool is fio(version 2.0.7) and both the BW and IOPS fields are recorded. Two instances of fio program are carried out simultaneously, each opearting on a separate disk file (say /data/testfile1, /data/testfile2). System environment: kernel: 3.7.0-rc5 CFQ's slice_idle is disabled(slice_idle=0) while group_idle is enabled(group_idle=8). FIO configuration(e.g. "read") for the first fio program(say FIO1): [global] description=Emulation of Intel IOmeter File Server Access Pattern [iometer] bssplit=4k/30:8k/40:16k/30 rw=read direct=1 time_based runtime=180s ioengine=sync filename=/data/testfile1 numjobs=32 group_reporting result before using blkcg: (the value of BW is KB/s) FIO1 BW/IOPS FIO2 BW/IOPS --------------------------------------- read 26799/2911 25861/2810 write 138618/15071 138578/15069 rw 72159/7838(r) 71851/7811(r) 72171/7840(w) 71799/7805(w) randread 4982/543 5370/585 randwrite 5192/566 6010/654 randrw 2369/258(r) 3027/330(r) 2369/258(w) 3016/328(w) result after using blkcg(create two blkio cgroups with default blkio.weight(500) and put FIO1 and FIO2 into these cgroups respectively) FIO1 BW/IOPS FIO2 BW/IOPS --------------------------------------- read 36651/3985 36470/3943 write 75738/8229 75641/8221 rw 49169/5342(r) 49168/5346(r) 49200/5348(w) 49140/5341(w) randread 4876/532 4905/534 randwrite 5535/603 5497/599 randrw 2521/274(r) 2527/275(r) 2510/273(w) 2532/274(w) Comparing with those results, we found greate performance drop (30%-40%) in some test cases(especially for the "write", "rw" case). Is it normal to see write/rw bandwidth decrease by 40% after using blkio-cgroup? If not, any way to improve or tune the performace? Thanks. -- Regards, Zhao Shuai ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 7:00 ` performance drop after using blkcg Zhao Shuai @ 2013-08-29 3:10 ` joeytao 2013-08-29 3:20 ` joeytao 1 sibling, 0 replies; 15+ messages in thread From: joeytao @ 2013-08-29 3:10 UTC (permalink / raw) To: linux-kernel Hello, I also do these tests and find the same results. IMO, on faster storage with deep queue depth, if device is asking for more requests,but our workload can't send enough requests, we have to idle to provide service differentiation. We'll see performance drop if applications can't drive enough IO to keep disk busy.Especially for writes, with the effect of disk cache and deep queue depth, we'll often see performance drop . So I come up with an approach called Self-adaption blkcg that if the average total service time for a request is much less,we don' choose to idle. Otherwise, we choose to idle to wait for the request. The patch is below. After large tests,the new scheduler can provide service differentiation in most cases. When the application can't drive enough requests and the mean total service time is very small, we don't choose to idle. In most cases, the performance doesn't drop after using blkcg and the service differentiation is good. >From 50705c8d4e456d3286e76bed7281796b1e915e0e Mon Sep 17 00:00:00 2001 From: Joeytao <husttsq@gmail.com> Date: Mon, 26 Aug 2013 15:40:39 +0800 Subject: [PATCH] Self-adaption blkcg --- block/cfq-iosched.c | 41 ++++++++++++++++++++++++++++++++++++++--- include/linux/iocontext.h | 5 +++++ 2 files changed, 43 insertions(+), 3 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 23500ac..79296de 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -288,6 +288,8 @@ struct cfq_data { unsigned int cfq_group_idle; unsigned int cfq_latency; + unsigned int cfq_target_latency; + unsigned int cfq_write_isolation; unsigned int cic_index; struct list_head cic_list; @@ -589,7 +591,7 @@ cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg) { struct cfq_rb_root *st = &cfqd->grp_service_tree; - return cfq_target_latency * cfqg->weight / st->total_weight; + return cfqd->cfq_target_latency * cfqg->weight / st->total_weight; } static inline unsigned @@ -2028,6 +2031,14 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) cic->ttime_mean); return; } + + /* + * added by joeytao, + * If our average await_time is 0, then don't idle. This is for requests of + * write,because if the cache of disk is on, it's no need to wait. + */ + if(!cfqd->cfq_write_isolation && sample_valid(cic->awtime_samples) && (cic->awtime_mean==0)) + return; /* There are other queues in the group, don't do group idle */ if (group_idle && cfqq->cfqg->nr_cfqq > 1) @@ -2243,7 +2254,7 @@ new_workload: * to have higher weight. A more accurate thing would be to * calculate system wide asnc/sync ratio. */ - tmp = cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg); + tmp = cfqd->cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg); tmp = tmp/cfqd->busy_queues; slice = min_t(unsigned, slice, tmp); @@ -3228,10 +3239,21 @@ err: } static void +cfq_update_io_awaittime(struct cfq_data *cfqd, struct cfq_io_context *cic) +{ + unsigned long elapsed = jiffies - cic->last_end_request; + unsigned long awtime = min(elapsed, 2UL * 16); + + cic->awtime_samples = (7*cic->awtime_samples + 256) / 8; + cic->awtime_total = (7*cic->awtime_total + 256*awtime) / 8; + cic->awtime_mean = (cic->awtime_total + 128) / cic->awtime_samples; +} + +static void cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic) { unsigned long elapsed = jiffies - cic->last_end_request; - unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle); + unsigned long ttime = min(elapsed, 2UL * 8); cic->ttime_samples = (7*cic->ttime_samples + 256) / 8; cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8; @@ -3573,6 +3595,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--; if (sync) { + cfq_update_io_awaittime(cfqd,RQ_CIC(rq)); /* added by joeytao, 2013.8.27*/ RQ_CIC(rq)->last_end_request = now; if (!time_after(rq->start_time + cfqd->cfq_fifo_expire[1], now)) cfqd->last_delayed_sync = now; @@ -4075,6 +4098,12 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_back_penalty = cfq_back_penalty; cfqd->cfq_slice[0] = cfq_slice_async; cfqd->cfq_slice[1] = cfq_slice_sync; + cfqd->cfq_target_latency = cfq_target_latency; /* added by joeytao, 2013.8.5 */ +#ifdef CONFIG_CFQ_GROUP_IOSCHED + cfqd->cfq_write_isolation = 0; /* added by joeytao, 2013.8.16 */ +#else + cfqd->cfq_write_isolation = 1; /* added by joeytao, 2013.8.21 */ +#endif cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->cfq_group_idle = cfq_group_idle; @@ -4154,6 +4183,8 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1); SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); SHOW_FUNCTION(cfq_low_latency_show, cfqd->cfq_latency, 0); +SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1); +SHOW_FUNCTION(cfq_write_isolation_show, cfqd->cfq_write_isolation, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -4187,6 +4218,8 @@ STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0); +STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, UINT_MAX, 1); +STORE_FUNCTION(cfq_write_isolation_store, &cfqd->cfq_write_isolation, 0, UINT_MAX, 0); #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -4204,6 +4237,8 @@ static struct elv_fs_entry cfq_attrs[] = { CFQ_ATTR(slice_idle), CFQ_ATTR(group_idle), CFQ_ATTR(low_latency), + CFQ_ATTR(target_latency), + CFQ_ATTR(write_isolation), __ATTR_NULL }; diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h index b2eee89..0c45b09 100644 --- a/include/linux/iocontext.h +++ b/include/linux/iocontext.h @@ -18,6 +18,11 @@ struct cfq_io_context { unsigned long ttime_samples; unsigned long ttime_mean; + /* added by joeytao */ + unsigned long awtime_total; + unsigned long awtime_samples; + unsigned long awtime_mean; + struct list_head queue_list; struct hlist_node cic_list; -- 1.7.1 -- View this message in context: http://linux-kernel.2935.n7.nabble.com/performance-drop-after-using-blkcg-tp567957p710883.html Sent from the Linux Kernel mailing list archive at Nabble.com. ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 7:00 ` performance drop after using blkcg Zhao Shuai 2013-08-29 3:10 ` joeytao @ 2013-08-29 3:20 ` joeytao 1 sibling, 0 replies; 15+ messages in thread From: joeytao @ 2013-08-29 3:20 UTC (permalink / raw) To: linux-kernel Hello, I also do these tests and find the same results. IMO, on faster storage with deep queue depth, if device is asking for more requests,but our workload can't send enough requests, we have to idle to provide service differentiation. We'll see performance drop if applications can't drive enough IO to keep disk busy.Especially for writes, with the effect of disk cache and deep queue depth, we'll often see performance drop . So I come up with an approach called Self-adaption blkcg that if the average total service time for a request is much less,we don' choose to idle. Otherwise, we choose to idle to wait for the request. The patch is below. After large tests,the new scheduler can provide service differentiation in most cases. When the application can't drive enough requests and the mean total service time is very small, we don't choose to idle. In most cases, the performance doesn't drop after using blkcg and the service differentiation is good. >From 50705c8d4e456d3286e76bed7281796b1e915e0e Mon Sep 17 00:00:00 2001 From: Joeytao <husttsq@gmail.com> Date: Mon, 26 Aug 2013 15:40:39 +0800 Subject: [PATCH] Self-adaption blkcg --- block/cfq-iosched.c | 41 ++++++++++++++++++++++++++++++++++++++--- include/linux/iocontext.h | 5 +++++ 2 files changed, 43 insertions(+), 3 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 23500ac..79296de 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -288,6 +288,8 @@ struct cfq_data { unsigned int cfq_group_idle; unsigned int cfq_latency; + unsigned int cfq_target_latency; + unsigned int cfq_write_isolation; unsigned int cic_index; struct list_head cic_list; @@ -589,7 +591,7 @@ cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg) { struct cfq_rb_root *st = &cfqd->grp_service_tree; - return cfq_target_latency * cfqg->weight / st->total_weight; + return cfqd->cfq_target_latency * cfqg->weight / st->total_weight; } static inline unsigned @@ -2028,6 +2031,14 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) cic->ttime_mean); return; } + + /* + * added by joeytao, + * If our average await_time is 0, then don't idle. This is for requests of + * write,because if the cache of disk is on, it's no need to wait. + */ + if(!cfqd->cfq_write_isolation && sample_valid(cic->awtime_samples) && (cic->awtime_mean==0)) + return; /* There are other queues in the group, don't do group idle */ if (group_idle && cfqq->cfqg->nr_cfqq > 1) @@ -2243,7 +2254,7 @@ new_workload: * to have higher weight. A more accurate thing would be to * calculate system wide asnc/sync ratio. */ - tmp = cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg); + tmp = cfqd->cfq_target_latency * cfqg_busy_async_queues(cfqd, cfqg); tmp = tmp/cfqd->busy_queues; slice = min_t(unsigned, slice, tmp); @@ -3228,10 +3239,21 @@ err: } static void +cfq_update_io_awaittime(struct cfq_data *cfqd, struct cfq_io_context *cic) +{ + unsigned long elapsed = jiffies - cic->last_end_request; + unsigned long awtime = min(elapsed, 2UL * 16); + + cic->awtime_samples = (7*cic->awtime_samples + 256) / 8; + cic->awtime_total = (7*cic->awtime_total + 256*awtime) / 8; + cic->awtime_mean = (cic->awtime_total + 128) / cic->awtime_samples; +} + +static void cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic) { unsigned long elapsed = jiffies - cic->last_end_request; - unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle); + unsigned long ttime = min(elapsed, 2UL * 8); cic->ttime_samples = (7*cic->ttime_samples + 256) / 8; cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8; @@ -3573,6 +3595,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--; if (sync) { + cfq_update_io_awaittime(cfqd,RQ_CIC(rq)); /* added by joeytao, 2013.8.27*/ RQ_CIC(rq)->last_end_request = now; if (!time_after(rq->start_time + cfqd->cfq_fifo_expire[1], now)) cfqd->last_delayed_sync = now; @@ -4075,6 +4098,12 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_back_penalty = cfq_back_penalty; cfqd->cfq_slice[0] = cfq_slice_async; cfqd->cfq_slice[1] = cfq_slice_sync; + cfqd->cfq_target_latency = cfq_target_latency; /* added by joeytao, 2013.8.5 */ +#ifdef CONFIG_CFQ_GROUP_IOSCHED + cfqd->cfq_write_isolation = 0; /* added by joeytao, 2013.8.16 */ +#else + cfqd->cfq_write_isolation = 1; /* added by joeytao, 2013.8.21 */ +#endif cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->cfq_group_idle = cfq_group_idle; @@ -4154,6 +4183,8 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1); SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); SHOW_FUNCTION(cfq_low_latency_show, cfqd->cfq_latency, 0); +SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1); +SHOW_FUNCTION(cfq_write_isolation_show, cfqd->cfq_write_isolation, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -4187,6 +4218,8 @@ STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0); +STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, UINT_MAX, 1); +STORE_FUNCTION(cfq_write_isolation_store, &cfqd->cfq_write_isolation, 0, UINT_MAX, 0); #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -4204,6 +4237,8 @@ static struct elv_fs_entry cfq_attrs[] = { CFQ_ATTR(slice_idle), CFQ_ATTR(group_idle), CFQ_ATTR(low_latency), + CFQ_ATTR(target_latency), + CFQ_ATTR(write_isolation), __ATTR_NULL }; diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h index b2eee89..0c45b09 100644 --- a/include/linux/iocontext.h +++ b/include/linux/iocontext.h @@ -18,6 +18,11 @@ struct cfq_io_context { unsigned long ttime_samples; unsigned long ttime_mean; + /* added by joeytao */ + unsigned long awtime_total; + unsigned long awtime_samples; + unsigned long awtime_mean; + struct list_head queue_list; struct hlist_node cic_list; -- 1.7.1 -- View this message in context: http://linux-kernel.2935.n7.nabble.com/performance-drop-after-using-blkcg-tp567957p710886.html Sent from the Linux Kernel mailing list archive at Nabble.com. ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg [not found] <CAFVn34SxqAJe_4P-WT8MOiG-kmKKD7ge96zoHXQuGqHWPgAt+A@mail.gmail.com> 2012-12-11 7:00 ` performance drop after using blkcg Zhao Shuai @ 2012-12-11 14:25 ` Vivek Goyal 2012-12-11 14:27 ` Tejun Heo 2012-12-12 7:29 ` Zhao Shuai 1 sibling, 2 replies; 15+ messages in thread From: Vivek Goyal @ 2012-12-11 14:25 UTC (permalink / raw) To: Zhao Shuai; +Cc: tj, axboe, ctalbott, rni, linux-kernel, cgroups, containers On Mon, Dec 10, 2012 at 08:28:54PM +0800, Zhao Shuai wrote: > Hi, > > I plan to use blkcg(proportional BW) in my system. But I encounter > great performance drop after enabling blkcg. > The testing tool is fio(version 2.0.7) and both the BW and IOPS fields > are recorded. Two instances of fio program are carried out simultaneously, > each opearting on a separate disk file (say /data/testfile1, > /data/testfile2). > System environment: > kernel: 3.7.0-rc5 > CFQ's slice_idle is disabled(slice_idle=0) while group_idle is > enabled(group_idle=8). > > FIO configuration(e.g. "read") for the first fio program(say FIO1): > > [global] > description=Emulation of Intel IOmeter File Server Access Pattern > > [iometer] > bssplit=4k/30:8k/40:16k/30 > rw=read > direct=1 > time_based > runtime=180s > ioengine=sync > filename=/data/testfile1 > numjobs=32 > group_reporting > > > result before using blkcg: (the value of BW is KB/s) > > FIO1 BW/IOPS FIO2 BW/IOPS > --------------------------------------- > read 26799/2911 25861/2810 > write 138618/15071 138578/15069 > rw 72159/7838(r) 71851/7811(r) > 72171/7840(w) 71799/7805(w) > randread 4982/543 5370/585 > randwrite 5192/566 6010/654 > randrw 2369/258(r) 3027/330(r) > 2369/258(w) 3016/328(w) > > result after using blkcg(create two blkio cgroups with > default blkio.weight(500) and put FIO1 and FIO2 into these > cgroups respectively) These results are with slice_idle=0? > > FIO1 BW/IOPS FIO2 BW/IOPS > --------------------------------------- > read 36651/3985 36470/3943 > write 75738/8229 75641/8221 > rw 49169/5342(r) 49168/5346(r) > 49200/5348(w) 49140/5341(w) > randread 4876/532 4905/534 > randwrite 5535/603 5497/599 > randrw 2521/274(r) 2527/275(r) > 2510/273(w) 2532/274(w) > > Comparing with those results, we found greate performance drop > (30%-40%) in some test cases(especially for the "write", "rw" case). > Is it normal to see write/rw bandwidth decrease by 40% after using > blkio-cgroup? If not, any way to improve or tune the performace? What's the storage you are using. Looking at the speed of IO I would guess it is not one of those rotational disks. blkcg does cause the drop in performance (due to idling at group level). Faster the storage or more the number of cgroups, drop becomes even more visible. Only optimization I could think of was disabling slice_idle and you have already done that. There might be some opporutnities to cut down the group idling in some cases and lose on fairness but we will have to identify those and modify code. In general, do not use blkcg on faster storage. In current form it is at best suitable for single rotational SATA/SAS disk. I have not been able to figure out how to provide fairness without group idling. Thanks Vivek ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 14:25 ` Vivek Goyal @ 2012-12-11 14:27 ` Tejun Heo 2012-12-11 14:43 ` Vivek Goyal 2012-12-12 7:29 ` Zhao Shuai 1 sibling, 1 reply; 15+ messages in thread From: Tejun Heo @ 2012-12-11 14:27 UTC (permalink / raw) To: Vivek Goyal Cc: Zhao Shuai, axboe, ctalbott, rni, linux-kernel, cgroups, containers On Tue, Dec 11, 2012 at 09:25:18AM -0500, Vivek Goyal wrote: > In general, do not use blkcg on faster storage. In current form it > is at best suitable for single rotational SATA/SAS disk. I have not > been able to figure out how to provide fairness without group idling. I think cfq is just the wrong approach for faster non-rotational devices. We should be allocating iops instead of time slices. Thanks. -- tejun ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 14:27 ` Tejun Heo @ 2012-12-11 14:43 ` Vivek Goyal 2012-12-11 14:47 ` Tejun Heo 0 siblings, 1 reply; 15+ messages in thread From: Vivek Goyal @ 2012-12-11 14:43 UTC (permalink / raw) To: Tejun Heo Cc: Zhao Shuai, axboe, ctalbott, rni, linux-kernel, cgroups, containers On Tue, Dec 11, 2012 at 06:27:42AM -0800, Tejun Heo wrote: > On Tue, Dec 11, 2012 at 09:25:18AM -0500, Vivek Goyal wrote: > > In general, do not use blkcg on faster storage. In current form it > > is at best suitable for single rotational SATA/SAS disk. I have not > > been able to figure out how to provide fairness without group idling. > > I think cfq is just the wrong approach for faster non-rotational > devices. We should be allocating iops instead of time slices. I think if one sets slice_idle=0 and group_idle=0 in CFQ, for all practical purposes it should become and IOPS based group scheduling. For group accounting then CFQ uses number of requests from each cgroup and uses that information to schedule groups. I have not been able to figure out the practical benefits of that approach. At least not for the simple workloads I played with. This approach will not work for simple things like trying to improve dependent read latencies in presence of heavery writers. That's the single biggest use case CFQ solves, IMO. And that happens because we stop writes and don't let them go to device and device is primarily dealing with reads. If some process is doing dependent reads and we want to improve read latencies, then either we need to stop flow of writes or devices are good and they always prioritize READs over WRITEs. If devices are good then we probably don't even need blkcg. So yes, iops based appraoch is fine just that number of cases where you will see any service differentiation should significantly less. Thanks Vivek ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 14:43 ` Vivek Goyal @ 2012-12-11 14:47 ` Tejun Heo 2012-12-11 15:02 ` Vivek Goyal 0 siblings, 1 reply; 15+ messages in thread From: Tejun Heo @ 2012-12-11 14:47 UTC (permalink / raw) To: Vivek Goyal Cc: Zhao Shuai, axboe, ctalbott, rni, linux-kernel, cgroups, containers Hello, On Tue, Dec 11, 2012 at 09:43:36AM -0500, Vivek Goyal wrote: > I think if one sets slice_idle=0 and group_idle=0 in CFQ, for all practical > purposes it should become and IOPS based group scheduling. No, I don't think it is. You can't achieve isolation without idling between group switches. We're measuring slices in terms of iops but what cfq actually schedules are still time slices, not IOs. > For group accounting then CFQ uses number of requests from each cgroup > and uses that information to schedule groups. > > I have not been able to figure out the practical benefits of that > approach. At least not for the simple workloads I played with. This > approach will not work for simple things like trying to improve dependent > read latencies in presence of heavery writers. That's the single biggest > use case CFQ solves, IMO. As I wrote above, it's not about accounting. It's about scheduling unit. > And that happens because we stop writes and don't let them go to device > and device is primarily dealing with reads. If some process is doing > dependent reads and we want to improve read latencies, then either > we need to stop flow of writes or devices are good and they always > prioritize READs over WRITEs. If devices are good then we probably > don't even need blkcg. > > So yes, iops based appraoch is fine just that number of cases where you > will see any service differentiation should significantly less. No, using iops to schedule time slices would lead to that. We just need to be allocating and scheduling iops, and I don't think we should be doing that from cfq. Thanks. -- tejun ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 14:47 ` Tejun Heo @ 2012-12-11 15:02 ` Vivek Goyal 2012-12-11 15:14 ` Tejun Heo 0 siblings, 1 reply; 15+ messages in thread From: Vivek Goyal @ 2012-12-11 15:02 UTC (permalink / raw) To: Tejun Heo Cc: Zhao Shuai, axboe, ctalbott, rni, linux-kernel, cgroups, containers On Tue, Dec 11, 2012 at 06:47:18AM -0800, Tejun Heo wrote: > Hello, > > On Tue, Dec 11, 2012 at 09:43:36AM -0500, Vivek Goyal wrote: > > I think if one sets slice_idle=0 and group_idle=0 in CFQ, for all practical > > purposes it should become and IOPS based group scheduling. > > No, I don't think it is. You can't achieve isolation without idling > between group switches. We're measuring slices in terms of iops but > what cfq actually schedules are still time slices, not IOs. I think I have not been able to understand your proposal. Can you explain a bit more. This is what CFQ does in iops_mode(). It will calculate the number of requests dispatched from a group and scale that number based on weight and put the group back on service tree. So if you have not got your fair share in terms of number of requests dispatched to the device, you will be put ahead in the queue and given a chance to dispatch requests first. Now couple of things. - There is no idling here. If device is asking for more requests (deep queue depth) then this group will be removed from service tree and CFQ will move on to serve other queued group. So if there is a dependent reader it will lose its share. If we try to idle here, then we have solved nothing in terms of performance problems. Device is faster but your workload can't cope with it so you are artificially slowing down the device. - But if all the contending workloads/groups are throwing enough IO traffic on the device and don't get expired, they they should be able to dispatch number of requests to device in proportion to their weight. So this is effectively trying to keep track of number of reqeusts dispatched from the group instead of time slice consumed by group and then do the scheduling. cfq_group_served() { if (iops_mode(cfqd)) charge = cfqq->slice_dispatch; cfqg->vdisktime += cfq_scale_slice(charge, cfqg); } Isn't it effectively IOPS scheduling. One should get IOPS rate in proportion to their weight (as long as they can throw enough traffic at device to keep it busy). If not, can you please give more details about your proposal. Thanks Vivek ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 15:02 ` Vivek Goyal @ 2012-12-11 15:14 ` Tejun Heo 2012-12-11 15:37 ` Vivek Goyal 0 siblings, 1 reply; 15+ messages in thread From: Tejun Heo @ 2012-12-11 15:14 UTC (permalink / raw) To: Vivek Goyal Cc: Zhao Shuai, axboe, ctalbott, rni, linux-kernel, cgroups, containers Hello, Vivek. On Tue, Dec 11, 2012 at 10:02:34AM -0500, Vivek Goyal wrote: > cfq_group_served() { > if (iops_mode(cfqd)) > charge = cfqq->slice_dispatch; > cfqg->vdisktime += cfq_scale_slice(charge, cfqg); > } > > Isn't it effectively IOPS scheduling. One should get IOPS rate in proportion to > their weight (as long as they can throw enough traffic at device to keep > it busy). If not, can you please give more details about your proposal. The problem is that we lose a lot of isolation w/o idling between queues or groups. This is because we switch between slices and while a slice is in progress only ios belongint to that slice can be issued. ie. higher priority cfqgs / cfqqs, after dispatching the ios they have ready, lose their slice immmediately. Lower priority slice takes over and when hgiher priority ones get ready, they have to wait for the lower priority one before submitting the new IOs. In many cases, they end up not being able to generate IOs any faster than the ones in lower priority cfqqs/cfqgs. This is becase we switch slices rather than iops. We can make cfq essentially switch iops by implementing very aggressive preemption but I really don't see much point in that. cfq is way too heavy and ill-suited for high speed non-rot devices which are becoming more and more consistent in terms of iops they can handle. I think we need something better suited for the maturing non-rot devices. They're becoming very different from what cfq was built for and we really shouldn't be maintaining several rb trees which need full synchronization for each IO. We're doing way too much and it just isn't scalable. Thanks. -- tejun ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 15:14 ` Tejun Heo @ 2012-12-11 15:37 ` Vivek Goyal 2012-12-11 16:01 ` Tejun Heo 0 siblings, 1 reply; 15+ messages in thread From: Vivek Goyal @ 2012-12-11 15:37 UTC (permalink / raw) To: Tejun Heo Cc: Zhao Shuai, axboe, ctalbott, rni, linux-kernel, cgroups, containers On Tue, Dec 11, 2012 at 07:14:12AM -0800, Tejun Heo wrote: > Hello, Vivek. > > On Tue, Dec 11, 2012 at 10:02:34AM -0500, Vivek Goyal wrote: > > cfq_group_served() { > > if (iops_mode(cfqd)) > > charge = cfqq->slice_dispatch; > > cfqg->vdisktime += cfq_scale_slice(charge, cfqg); > > } > > > > Isn't it effectively IOPS scheduling. One should get IOPS rate in proportion to > > their weight (as long as they can throw enough traffic at device to keep > > it busy). If not, can you please give more details about your proposal. > > The problem is that we lose a lot of isolation w/o idling between > queues or groups. This is because we switch between slices and while > a slice is in progress only ios belongint to that slice can be issued. > ie. higher priority cfqgs / cfqqs, after dispatching the ios they have > ready, lose their slice immmediately. Lower priority slice takes over > and when hgiher priority ones get ready, they have to wait for the > lower priority one before submitting the new IOs. In many cases, they > end up not being able to generate IOs any faster than the ones in > lower priority cfqqs/cfqgs. > > This is becase we switch slices rather than iops. I am not sure how any of the above problems will go away if we start scheduling iops. > We can make cfq > essentially switch iops by implementing very aggressive preemption but > I really don't see much point in that. Yes, this should be easily doable. Once a queue/group is being removed and is losing its share, just keep track of last vdisktime. When more IO comes in this group and current group is preempted (if its vdisktime is greater than one being queued). And new group is probably queued at the front. I have experimented with schemes like that but did not see any very promising resutls. Assume device supports queue depth of 128, and there is one dependent reader and one writer. If reader goes away and comes back and preempts low priority writer, in that small time window writer has dispatched enough requests to introduce read delays. So preemption helps only so much. I am curious to know how iops based scheduler solve these issues. Only way to provide effective isolation seemed to be idling and the moment we idle we kill the performance. It does not matter whether we are scheduling time or iops. > cfq is way too heavy and > ill-suited for high speed non-rot devices which are becoming more and > more consistent in terms of iops they can handle. > > I think we need something better suited for the maturing non-rot > devices. They're becoming very different from what cfq was built for > and we really shouldn't be maintaining several rb trees which need > full synchronization for each IO. We're doing way too much and it > just isn't scalable. I am fine with doing things differently in a different scheduler. But what I am aruging here is that atleast with CFQ we should be able to experiment and figure out what works. In CFQ all the code is there and if this iops based scheduling has merit, one should be able to quickly experiment and demonstrate how would one do things differently. To me I have not been able to understand yet that what is iops based scheduling doing differently. Will we idle there or not. If we idle we again have performance problems. So doing things out of CFQ is fine. I am only after understanding the technical idea which will solve the problem of provinding isolation as well as fairness without losing throughput. And I have not been able to get a hang of it yet. Thanks Vivek ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 15:37 ` Vivek Goyal @ 2012-12-11 16:01 ` Tejun Heo 2012-12-11 16:18 ` Vivek Goyal 0 siblings, 1 reply; 15+ messages in thread From: Tejun Heo @ 2012-12-11 16:01 UTC (permalink / raw) To: Vivek Goyal Cc: Zhao Shuai, axboe, ctalbott, rni, linux-kernel, cgroups, containers Hello, Vivek. On Tue, Dec 11, 2012 at 10:37:25AM -0500, Vivek Goyal wrote: > I have experimented with schemes like that but did not see any very > promising resutls. Assume device supports queue depth of 128, and there > is one dependent reader and one writer. If reader goes away and comes > back and preempts low priority writer, in that small time window writer > has dispatched enough requests to introduce read delays. So preemption > helps only so much. I am curious to know how iops based scheduler solve > these issues. > > Only way to provide effective isolation seemed to be idling and the > moment we idle we kill the performance. It does not matter whether we > are scheduling time or iops. If the completion latency of IOs fluctuates heavily depend on queue depth, queue depth would need to be throttled so that lower priority queue can't overwhelm the device queue while prospect higher priority accessors exist. Another aspect is that devices are getting a lot more consistent in terms of latency. While idling would also solve isolation issue with unordered deep device queue, it really is a solution for a rotational device with large seek penalty as the time lost while idling can often/somtimes made up by the save from lower seeks. For non-rot devices with deep queue, the right thing to do would be controlling queue depth or propagate priority to the device queue (from what I hear, people are working on it. dunno how well it would turn out tho). > > cfq is way too heavy and > > ill-suited for high speed non-rot devices which are becoming more and > > more consistent in terms of iops they can handle. > > > > I think we need something better suited for the maturing non-rot > > devices. They're becoming very different from what cfq was built for > > and we really shouldn't be maintaining several rb trees which need > > full synchronization for each IO. We're doing way too much and it > > just isn't scalable. > > I am fine with doing things differently in a different scheduler. But > what I am aruging here is that atleast with CFQ we should be able to > experiment and figure out what works. In CFQ all the code is there and > if this iops based scheduling has merit, one should be able to quickly > experiment and demonstrate how would one do things differently. > > To me I have not been able to understand yet that what is iops based > scheduling doing differently. Will we idle there or not. If we idle > we again have performance problems. When the device can do tens of thousands ios per sec, I don't think it makes much sense to idle the device. You just lose too much. > So doing things out of CFQ is fine. I am only after understanding the > technical idea which will solve the problem of provinding isolation > as well as fairness without losing throughput. And I have not been > able to get a hang of it yet. I think it already has some aspect of it. It has the half-iops mode for a reason, right? It just is very inefficient and way more complex than it needs to be. Thanks. -- tejun ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 16:01 ` Tejun Heo @ 2012-12-11 16:18 ` Vivek Goyal 2012-12-11 16:27 ` Tejun Heo 0 siblings, 1 reply; 15+ messages in thread From: Vivek Goyal @ 2012-12-11 16:18 UTC (permalink / raw) To: Tejun Heo Cc: Zhao Shuai, axboe, ctalbott, rni, linux-kernel, cgroups, containers On Tue, Dec 11, 2012 at 08:01:37AM -0800, Tejun Heo wrote: [..] > > Only way to provide effective isolation seemed to be idling and the > > moment we idle we kill the performance. It does not matter whether we > > are scheduling time or iops. > > If the completion latency of IOs fluctuates heavily depend on queue > depth, queue depth would need to be throttled so that lower priority > queue can't overwhelm the device queue while prospect higher priority > accessors exist. Another aspect is that devices are getting a lot > more consistent in terms of latency. > > While idling would also solve isolation issue with unordered deep > device queue, it really is a solution for a rotational device with > large seek penalty as the time lost while idling can often/somtimes > made up by the save from lower seeks. For non-rot devices with deep > queue, the right thing to do would be controlling queue depth or > propagate priority to the device queue (from what I hear, people are > working on it. dunno how well it would turn out tho). - Controlling device queue should bring down throughput too as it should bring down level of parallelism at device level. Also asking user to tune device queue depth seems bad interface. How would a user know what's the right queue depth. May be software can try to be intelligent about it and if IO latencies cross a threshold then try to decrese queue depth. (We do things like that in CFQ). - Passing prio to device sounds something new and promising. If they can do a good job at it, why not. I think at minimum they need to make sure READs are prioritized over writes by default. And may be provide a way to signal important writes which need to go to the disk now. If READs are prioritized in device, then it takes care of one very important use case. Then we just have to worry about other case of fairness between different readers or fairness between different writers and there we do not idle and try our best to give fair share. In case group is not backlogged, it is bound to loose some share. > > > > cfq is way too heavy and > > > ill-suited for high speed non-rot devices which are becoming more and > > > more consistent in terms of iops they can handle. > > > > > > I think we need something better suited for the maturing non-rot > > > devices. They're becoming very different from what cfq was built for > > > and we really shouldn't be maintaining several rb trees which need > > > full synchronization for each IO. We're doing way too much and it > > > just isn't scalable. > > > > I am fine with doing things differently in a different scheduler. But > > what I am aruging here is that atleast with CFQ we should be able to > > experiment and figure out what works. In CFQ all the code is there and > > if this iops based scheduling has merit, one should be able to quickly > > experiment and demonstrate how would one do things differently. > > > > To me I have not been able to understand yet that what is iops based > > scheduling doing differently. Will we idle there or not. If we idle > > we again have performance problems. > > When the device can do tens of thousands ios per sec, I don't think it > makes much sense to idle the device. You just lose too much. Agreed. idling starts showing soon on fast SATA rotational devices itself so idling on faster devices will lead to bad results on most of the workloads. > > > So doing things out of CFQ is fine. I am only after understanding the > > technical idea which will solve the problem of provinding isolation > > as well as fairness without losing throughput. And I have not been > > able to get a hang of it yet. > > I think it already has some aspect of it. It has the half-iops mode > for a reason, right? It just is very inefficient and way more complex > than it needs to be. I introduced this iops_mode() in an attempt to try to provide fair disk share in terms of iops instead of disk slices. It might not be most efficient one but atleast it can provide answers whether it is something useful or not and for what workload and devices this iops based scheduling is useful. So if somebody wants to experiment, just tweak the code a bit to allow preemption when a queue which lost share gets backlogged and you practially have a prototype of iops based group scheduling. Thanks Vivek ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 16:18 ` Vivek Goyal @ 2012-12-11 16:27 ` Tejun Heo 0 siblings, 0 replies; 15+ messages in thread From: Tejun Heo @ 2012-12-11 16:27 UTC (permalink / raw) To: Vivek Goyal Cc: Zhao Shuai, axboe, ctalbott, rni, linux-kernel, cgroups, containers Hello, Vivek. On Tue, Dec 11, 2012 at 11:18:20AM -0500, Vivek Goyal wrote: > - Controlling device queue should bring down throughput too as it > should bring down level of parallelism at device level. Also asking > user to tune device queue depth seems bad interface. How would a > user know what's the right queue depth. May be software can try to > be intelligent about it and if IO latencies cross a threshold then > try to decrese queue depth. (We do things like that in CFQ). Yeah, it should definitely be something automatic. Command completion latencies are visible to iosched, so it should be doable. > - Passing prio to device sounds something new and promising. If they > can do a good job at it, why not. I think at minimum they need to > make sure READs are prioritized over writes by default. And may > be provide a way to signal important writes which need to go to > the disk now. > > If READs are prioritized in device, then it takes care of one very > important use case. Then we just have to worry about other case of > fairness between different readers or fairness between different > writers and there we do not idle and try our best to give fair share. > In case group is not backlogged, it is bound to loose some share. I think it can be good enough if we have queue at the head / tail choice. No idea how it'll actually fan out tho. Thanks. -- tejun ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-11 14:25 ` Vivek Goyal 2012-12-11 14:27 ` Tejun Heo @ 2012-12-12 7:29 ` Zhao Shuai 2012-12-16 4:38 ` Zhu Yanhai 1 sibling, 1 reply; 15+ messages in thread From: Zhao Shuai @ 2012-12-12 7:29 UTC (permalink / raw) To: Vivek Goyal; +Cc: tj, axboe, ctalbott, rni, linux-kernel, cgroups, containers 2012/12/11 Vivek Goyal <vgoyal@redhat.com>: > These results are with slice_idle=0? Yes, slice_idle is disabled. > What's the storage you are using. Looking at the speed of IO I would > guess it is not one of those rotational disks. I have done the same test on 3 different type of boxes,and all of them show a performance drop(30%-40%) after using blkcg. Though they have different type of disk, all the storage they use are traditional rotational devices(e.g."HP EG0146FAWHU", "IBM-ESXS"). > So if somebody wants to experiment, just tweak the code a bit to allow > preemption when a queue which lost share gets backlogged and you > practially have a prototype of iops based group scheduling. Could you please explain more on this? How to adjust the code? I have test the following code piece, the result is we lost group differentiation. cfq_group_served() { if (iops_mode(cfqd)) charge = cfqq->slice_dispatch; cfqg->vdisktime += cfq_scale_slice(charge, cfqg); } -- Regards, Zhao Shuai ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: performance drop after using blkcg 2012-12-12 7:29 ` Zhao Shuai @ 2012-12-16 4:38 ` Zhu Yanhai 0 siblings, 0 replies; 15+ messages in thread From: Zhu Yanhai @ 2012-12-16 4:38 UTC (permalink / raw) To: Zhao Shuai Cc: Vivek Goyal, tj, axboe, ctalbott, rni, linux-kernel, cgroups, containers 2012/12/12 Zhao Shuai <zhaoshuai@freebsd.org> > > 2012/12/11 Vivek Goyal <vgoyal@redhat.com>: > > These results are with slice_idle=0? > > Yes, slice_idle is disabled. > > > What's the storage you are using. Looking at the speed of IO I would > > guess it is not one of those rotational disks. > > I have done the same test on 3 different type of boxes,and all of them > show a performance drop(30%-40%) after using blkcg. Though they > have different type of disk, all the storage they use are traditional > rotational > devices(e.g."HP EG0146FAWHU", "IBM-ESXS"). Or you may want to try IO-throttle (i.e blkio.throttle.read_iops_device and blkio.throttle.write_iops_device) instead of blkcg. We use it as a compromised solution between performance and bandwidth allocation fairness on some clusters whose storage backend is ioDrive from FusionIO, which is also a really fast device. CFS/blkcg is based on time-sharing against the storage devices (allocation based on IOPS mode is just convert IOPS to virtual time, it's still time-sharing in fact), so the device only services single one group at one slice. Since many modern device requires enough degree of parallelism to get its full capability, the device can't run at full speed if every single group can't give it enough pressure, although they do so if you add them up, that's why you can get good score if you run them under the deadline scheduler. -- Regards, Zhu Yanhai > > > So if somebody wants to experiment, just tweak the code a bit to allow > > preemption when a queue which lost share gets backlogged and you > > practially have a prototype of iops based group scheduling. > > Could you please explain more on this? How to adjust the code? I have test > the following code piece, the result is we lost group differentiation. > > cfq_group_served() { > if (iops_mode(cfqd)) > charge = cfqq->slice_dispatch; > cfqg->vdisktime += cfq_scale_slice(charge, cfqg); > } > > > -- > Regards, > Zhao Shuai > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-08-29 3:20 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAFVn34SxqAJe_4P-WT8MOiG-kmKKD7ge96zoHXQuGqHWPgAt+A@mail.gmail.com>
2012-12-11 7:00 ` performance drop after using blkcg Zhao Shuai
2013-08-29 3:10 ` joeytao
2013-08-29 3:20 ` joeytao
2012-12-11 14:25 ` Vivek Goyal
2012-12-11 14:27 ` Tejun Heo
2012-12-11 14:43 ` Vivek Goyal
2012-12-11 14:47 ` Tejun Heo
2012-12-11 15:02 ` Vivek Goyal
2012-12-11 15:14 ` Tejun Heo
2012-12-11 15:37 ` Vivek Goyal
2012-12-11 16:01 ` Tejun Heo
2012-12-11 16:18 ` Vivek Goyal
2012-12-11 16:27 ` Tejun Heo
2012-12-12 7:29 ` Zhao Shuai
2012-12-16 4:38 ` Zhu Yanhai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).