* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle [not found] <57333E75.3080309@huawei.com> @ 2016-05-12 1:11 ` Miao Xie 2016-05-12 15:32 ` Tejun Heo 0 siblings, 1 reply; 7+ messages in thread From: Miao Xie @ 2016-05-12 1:11 UTC (permalink / raw) To: Fengguang Wu, Tejun Heo; +Cc: linux-kernel Cc linux-kernel mail list on 2016/5/11 at 22:15, Miao Xie wrote: > Hi, Tejun and Fengguang > > I found that buffered write thoughput was dropped down by writeback cgroup and dirty thottle on > 4.6-rc7 kernel. If I ran benchmark on the top block cgroup, the thoughput was more than 1500MB/s. > If I ran benchmark on a new block cgroup, the thoughput was down to 4MB/s. > > Steps to reproduce: > # mount -t cgroup2 cgroup <cgrp_mnt> > # echo "+io +memory" > <cgrp_mnt>/cgroup.subtree_control > # mkdir <cgrp_mnt>/aaa > # echo $$ > <cgrp_mnt>/aaa/cgroup.procs > # fio test.config > job0: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1 > fio-2.2.8 > Starting 1 thread > Jobs: 1 (f=1): [W(1)] [3.7% done] [0KB/4000KB/0KB /s] [0/1000/0 iops] [eta 04m:50s] > > Fio configuration is: > [global] > bs=4K > direct=0 > ioengine=psync > iodepth=1 > directory=/mnt/ext4/tstdir0 > time_based > runtime=300 > group_reporting > size=16G > sync=0 > max_latency=120000000 > thread > > [job0] > numjobs=1 > rw=write > > My box has 48 cores and 188GB memory, but I set > vm.dirty_background_bytes = 268435456 > vm.dirty_bytes = 536870912 > > if I set vm.dirty_background_bytes and vm.dirty_bytes to be a large number(vm.dirty_background_bytes = 3GB, > vm.dirty_bytes = 4GB), then fio thoughput would be more than 1500MB/s. and then if I reset them to the original > value(the above ones), the thoughout would be down to 500MB/s. > > And according my debug, I found fio sleeped for 1ms every time we dirty a page(balance dirty pages) when > the thoughput was down to 4MB/s, it might be a bug of dirty throttle when we open write back cgroup, I think. > > Tejun and Fengguang, please let me know what you guys think about this issue, and if you have > any suggestions for possible solutions, Any input is greatly appreciated! > > Thanks > Miao ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle 2016-05-12 1:11 ` [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle Miao Xie @ 2016-05-12 15:32 ` Tejun Heo 2016-05-13 6:11 ` Miao Xie 2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo 0 siblings, 2 replies; 7+ messages in thread From: Tejun Heo @ 2016-05-12 15:32 UTC (permalink / raw) To: Miao Xie; +Cc: Fengguang Wu, linux-kernel Hello, On Thu, May 12, 2016 at 09:11:33AM +0800, Miao Xie wrote: > >My box has 48 cores and 188GB memory, but I set > >vm.dirty_background_bytes = 268435456 > >vm.dirty_bytes = 536870912 > > > >if I set vm.dirty_background_bytes and vm.dirty_bytes to be a large number(vm.dirty_background_bytes = 3GB, > >vm.dirty_bytes = 4GB), then fio thoughput would be more than 1500MB/s. and then if I reset them to the original > >value(the above ones), the thoughout would be down to 500MB/s. > > > >And according my debug, I found fio sleeped for 1ms every time we dirty a page(balance dirty pages) when > >the thoughput was down to 4MB/s, it might be a bug of dirty throttle when we open write back cgroup, I think. Heh, so, for cgroups, the absolute byte limits can't applied directly and converted to percentage value before being applied. You're specifying 0.27% for threshold. Unfortunately, the ratio is translated into a percentage number and 0.27% becomes 0, so your cgroups are always over limit and being throttled. Can you please see whether the following patch fixes the issue? Thanks. diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 999792d..a455a21 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -369,8 +369,9 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc); unsigned long bytes = vm_dirty_bytes; unsigned long bg_bytes = dirty_background_bytes; - unsigned long ratio = vm_dirty_ratio; - unsigned long bg_ratio = dirty_background_ratio; + /* convert ratios to per-PAGE_SIZE for higher precision */ + unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100; + unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100; unsigned long thresh; unsigned long bg_thresh; struct task_struct *tsk; @@ -382,26 +383,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) /* * The byte settings can't be applied directly to memcg * domains. Convert them to ratios by scaling against - * globally available memory. + * globally available memory. As the ratios are in + * per-PAGE_SIZE, they can be obtained by dividing bytes by + * pages. */ if (bytes) - ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 / - global_avail, 100UL); + ratio = min(DIV_ROUND_UP(bytes, global_avail), + PAGE_SIZE); if (bg_bytes) - bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 / - global_avail, 100UL); + bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail), + PAGE_SIZE); bytes = bg_bytes = 0; } if (bytes) thresh = DIV_ROUND_UP(bytes, PAGE_SIZE); else - thresh = (ratio * available_memory) / 100; + thresh = (ratio * available_memory) / PAGE_SIZE; if (bg_bytes) bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE); else - bg_thresh = (bg_ratio * available_memory) / 100; + bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE; if (bg_thresh >= thresh) bg_thresh = thresh / 2; ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle 2016-05-12 15:32 ` Tejun Heo @ 2016-05-13 6:11 ` Miao Xie 2016-05-27 18:24 ` Tejun Heo 2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo 1 sibling, 1 reply; 7+ messages in thread From: Miao Xie @ 2016-05-13 6:11 UTC (permalink / raw) To: Tejun Heo; +Cc: Fengguang Wu, linux-kernel on 2016/5/12 at 23:32, Tejun Heo wrote: > On Thu, May 12, 2016 at 09:11:33AM +0800, Miao Xie wrote: >>> My box has 48 cores and 188GB memory, but I set >>> vm.dirty_background_bytes = 268435456 >>> vm.dirty_bytes = 536870912 >>> >>> if I set vm.dirty_background_bytes and vm.dirty_bytes to be a large number(vm.dirty_background_bytes = 3GB, >>> vm.dirty_bytes = 4GB), then fio thoughput would be more than 1500MB/s. and then if I reset them to the original >>> value(the above ones), the thoughout would be down to 500MB/s. >>> >>> And according my debug, I found fio sleeped for 1ms every time we dirty a page(balance dirty pages) when >>> the thoughput was down to 4MB/s, it might be a bug of dirty throttle when we open write back cgroup, I think. > > Heh, so, for cgroups, the absolute byte limits can't applied directly > and converted to percentage value before being applied. You're > specifying 0.27% for threshold. Unfortunately, the ratio is > translated into a percentage number and 0.27% becomes 0, so your > cgroups are always over limit and being throttled. > > Can you please see whether the following patch fixes the issue? Better than the kernel without patch. Now the benchmark could reach the device bandwidth after 5-8 seconds. But at the beginning, it was still very slow, and its thoughput was only 4MB/s for ~4 seconds, then it could go up in 1~3 seconds. Thanks Miao > Thanks. > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 999792d..a455a21 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -369,8 +369,9 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) > struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc); > unsigned long bytes = vm_dirty_bytes; > unsigned long bg_bytes = dirty_background_bytes; > - unsigned long ratio = vm_dirty_ratio; > - unsigned long bg_ratio = dirty_background_ratio; > + /* convert ratios to per-PAGE_SIZE for higher precision */ > + unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100; > + unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100; > unsigned long thresh; > unsigned long bg_thresh; > struct task_struct *tsk; > @@ -382,26 +383,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) > /* > * The byte settings can't be applied directly to memcg > * domains. Convert them to ratios by scaling against > - * globally available memory. > + * globally available memory. As the ratios are in > + * per-PAGE_SIZE, they can be obtained by dividing bytes by > + * pages. > */ > if (bytes) > - ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 / > - global_avail, 100UL); > + ratio = min(DIV_ROUND_UP(bytes, global_avail), > + PAGE_SIZE); > if (bg_bytes) > - bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 / > - global_avail, 100UL); > + bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail), > + PAGE_SIZE); > bytes = bg_bytes = 0; > } > > if (bytes) > thresh = DIV_ROUND_UP(bytes, PAGE_SIZE); > else > - thresh = (ratio * available_memory) / 100; > + thresh = (ratio * available_memory) / PAGE_SIZE; > > if (bg_bytes) > bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE); > else > - bg_thresh = (bg_ratio * available_memory) / 100; > + bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE; > > if (bg_thresh >= thresh) > bg_thresh = thresh / 2; > > . > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle 2016-05-13 6:11 ` Miao Xie @ 2016-05-27 18:24 ` Tejun Heo 0 siblings, 0 replies; 7+ messages in thread From: Tejun Heo @ 2016-05-27 18:24 UTC (permalink / raw) To: Miao Xie; +Cc: Fengguang Wu, linux-kernel Hello, Sorry about the delay. I forgot about this thread. On Fri, May 13, 2016 at 02:11:53PM +0800, Miao Xie wrote: > Better than the kernel without patch. Now the benchmark could reach > the device bandwidth after 5-8 seconds. But at the beginning, it > was still very slow, and its thoughput was only 4MB/s for ~4 > seconds, then it could go up in 1~3 seconds. I see. As this fix is needed anyways, I'll send it up. As for the ramp-up, it could be normal. There are estimators which take running avg and modulate the threshold accordingly and the starting values are conservative, so a short ramp-up time can be coming from that. Thanks. -- tejun ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() 2016-05-12 15:32 ` Tejun Heo 2016-05-13 6:11 ` Miao Xie @ 2016-05-27 18:34 ` Tejun Heo 2016-05-30 8:05 ` Jan Kara 2016-05-30 14:55 ` Jens Axboe 1 sibling, 2 replies; 7+ messages in thread From: Tejun Heo @ 2016-05-27 18:34 UTC (permalink / raw) To: Jens Axboe, Jan Kara; +Cc: Fengguang Wu, linux-kernel, Miao Xie, kernel-team As vm.dirty_[background_]bytes can't be applied verbatim to multiple cgroup writeback domains, they get converted to percentages in domain_dirty_limits() and applied the same way as vm.dirty_[background]ratio. However, if the specified bytes is lower than 1% of available memory, the calculated ratios become zero and the writeback domain gets throttled constantly. Fix it by using per-PAGE_SIZE instead of percentage for ratio calculations. Also, the updated DIV_ROUND_UP() usages now should yield 1/4096 (0.0244%) as the minimum ratio as long as the specified bytes are above zero. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Miao Xie <miaoxie@huawei.com> Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com Cc: stable@vger.kernel.org # v4.2+ Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()") --- mm/page-writeback.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b9956fd..9f914e9 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -373,8 +373,9 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc); unsigned long bytes = vm_dirty_bytes; unsigned long bg_bytes = dirty_background_bytes; - unsigned long ratio = vm_dirty_ratio; - unsigned long bg_ratio = dirty_background_ratio; + /* convert ratios to per-PAGE_SIZE for higher precision */ + unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100; + unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100; unsigned long thresh; unsigned long bg_thresh; struct task_struct *tsk; @@ -386,26 +387,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) /* * The byte settings can't be applied directly to memcg * domains. Convert them to ratios by scaling against - * globally available memory. + * globally available memory. As the ratios are in + * per-PAGE_SIZE, they can be obtained by dividing bytes by + * pages. */ if (bytes) - ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 / - global_avail, 100UL); + ratio = min(DIV_ROUND_UP(bytes, global_avail), + PAGE_SIZE); if (bg_bytes) - bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 / - global_avail, 100UL); + bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail), + PAGE_SIZE); bytes = bg_bytes = 0; } if (bytes) thresh = DIV_ROUND_UP(bytes, PAGE_SIZE); else - thresh = (ratio * available_memory) / 100; + thresh = (ratio * available_memory) / PAGE_SIZE; if (bg_bytes) bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE); else - bg_thresh = (bg_ratio * available_memory) / 100; + bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE; if (bg_thresh >= thresh) bg_thresh = thresh / 2; ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() 2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo @ 2016-05-30 8:05 ` Jan Kara 2016-05-30 14:55 ` Jens Axboe 1 sibling, 0 replies; 7+ messages in thread From: Jan Kara @ 2016-05-30 8:05 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Jan Kara, Fengguang Wu, linux-kernel, Miao Xie, kernel-team On Fri 27-05-16 14:34:46, Tejun Heo wrote: > As vm.dirty_[background_]bytes can't be applied verbatim to multiple > cgroup writeback domains, they get converted to percentages in > domain_dirty_limits() and applied the same way as > vm.dirty_[background]ratio. However, if the specified bytes is lower > than 1% of available memory, the calculated ratios become zero and the > writeback domain gets throttled constantly. > > Fix it by using per-PAGE_SIZE instead of percentage for ratio > calculations. Also, the updated DIV_ROUND_UP() usages now should > yield 1/4096 (0.0244%) as the minimum ratio as long as the specified > bytes are above zero. The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Just one nit below: > @@ -386,26 +387,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) > /* > * The byte settings can't be applied directly to memcg > * domains. Convert them to ratios by scaling against > - * globally available memory. > + * globally available memory. As the ratios are in > + * per-PAGE_SIZE, they can be obtained by dividing bytes by > + * pages. The comment would be more comprehensible to me is the last sentence was "... by dividing bytes by number of pages". Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() 2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo 2016-05-30 8:05 ` Jan Kara @ 2016-05-30 14:55 ` Jens Axboe 1 sibling, 0 replies; 7+ messages in thread From: Jens Axboe @ 2016-05-30 14:55 UTC (permalink / raw) To: Tejun Heo, Jan Kara; +Cc: Fengguang Wu, linux-kernel, Miao Xie, kernel-team On 05/27/2016 12:34 PM, Tejun Heo wrote: > As vm.dirty_[background_]bytes can't be applied verbatim to multiple > cgroup writeback domains, they get converted to percentages in > domain_dirty_limits() and applied the same way as > vm.dirty_[background]ratio. However, if the specified bytes is lower > than 1% of available memory, the calculated ratios become zero and the > writeback domain gets throttled constantly. > > Fix it by using per-PAGE_SIZE instead of percentage for ratio > calculations. Also, the updated DIV_ROUND_UP() usages now should > yield 1/4096 (0.0244%) as the minimum ratio as long as the specified > bytes are above zero. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reported-by: Miao Xie <miaoxie@huawei.com> > Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com > Cc: stable@vger.kernel.org # v4.2+ > Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()") Queued up for this series, with the minor comment tweak that Jan suggested. -- Jens Axboe ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-05-30 14:55 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <57333E75.3080309@huawei.com>
2016-05-12 1:11 ` [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle Miao Xie
2016-05-12 15:32 ` Tejun Heo
2016-05-13 6:11 ` Miao Xie
2016-05-27 18:24 ` Tejun Heo
2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
2016-05-30 8:05 ` Jan Kara
2016-05-30 14:55 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).