* [ATTEND] [LSF/MM TOPIC] Buffered writes throttling @ 2012-03-02 7:18 Suresh Jayaraman 2012-03-02 15:33 ` Vivek Goyal 0 siblings, 1 reply; 21+ messages in thread From: Suresh Jayaraman @ 2012-03-02 7:18 UTC (permalink / raw) To: lsf-pc; +Cc: Vivek Goyal, Andrea Righi, linux-mm, linux-fsdevel, Jan Kara Committee members, Please consider inviting me to the Storage, Filesystem, & MM Summit. I am working for one of the kernel teams in SUSE Labs focusing on Network filesystems and block layer. Recently, I have been trying to solve the problem of "throttling buffered writes" to make per-cgroup throttling of IO to the device possible. Currently the block IO controller does not throttle buffered writes. The writes would have lost the submitter's context (I/O comes in flusher thread's context) when they are at the block IO layer. I looked at the past work and many folks have attempted to solve this problem in the past years but this problem remains unsolved so far. First, Andrea Righi tried to solve this by limiting the rate of async writes at the time a task is generating dirty pages in the page cache. Next, Vivek Goyal tried to solve this by throttling writes at the time they are entering the page cache. Both these approches have limitations and not considered for merging. I have looked at the possibility of solving this at the filesystem level but the problem with ext* filesystems is that a commit will commit the whole transaction at once (which may contain writes from processes belonging to more than one cgroup). Making filesystems cgroup aware would need redesign of journalling layer itself. Dave Chinner thinks this problem should be solved and being solved in a different manner by making the bdi-flusher writeback cgroup aware. Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM summit this year) adds cgroup awareness to writeback. Some aspects of this patchset could be borrowed for solving the problem of throttling buffered writes. As I understand the topic was discussed during last Kernel Summit as well and the idea is to get the IO-less throttling patchset into the kernel, then do per-memcg dirty memory limiting and add some memcg awareness to writeback Greg Thelen and then when these things settle down, think how to solve this problem since noone really seem to have a good answer to it. Having worked on linux filesystem/storage area for a few years now and having spent time understanding the various approaches tried and looked at other feasible way of solving this problem, I look forward to participate in the summit and discussions. So, the topic I would like to discuss is solving the problem of "throttling buffered writes". This could considered for discussion with memcg writeback session if that topic has been allocated a slot. I'm aware that this is a late submission and my apologies for not making it earlier. But, I want to take chances and see if it is possible still.. Thanks Suresh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-02 7:18 [ATTEND] [LSF/MM TOPIC] Buffered writes throttling Suresh Jayaraman @ 2012-03-02 15:33 ` Vivek Goyal 2012-03-05 19:22 ` Fengguang Wu 2012-03-05 20:23 ` [Lsf-pc] " Jan Kara 0 siblings, 2 replies; 21+ messages in thread From: Vivek Goyal @ 2012-03-02 15:33 UTC (permalink / raw) To: Suresh Jayaraman; +Cc: lsf-pc, Andrea Righi, linux-mm, linux-fsdevel, Jan Kara On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote: > Committee members, > > Please consider inviting me to the Storage, Filesystem, & MM Summit. I > am working for one of the kernel teams in SUSE Labs focusing on Network > filesystems and block layer. > > Recently, I have been trying to solve the problem of "throttling > buffered writes" to make per-cgroup throttling of IO to the device > possible. Currently the block IO controller does not throttle buffered > writes. The writes would have lost the submitter's context (I/O comes in > flusher thread's context) when they are at the block IO layer. I looked > at the past work and many folks have attempted to solve this problem in > the past years but this problem remains unsolved so far. > > First, Andrea Righi tried to solve this by limiting the rate of async > writes at the time a task is generating dirty pages in the page cache. > > Next, Vivek Goyal tried to solve this by throttling writes at the time > they are entering the page cache. > > Both these approches have limitations and not considered for merging. > > I have looked at the possibility of solving this at the filesystem level > but the problem with ext* filesystems is that a commit will commit the > whole transaction at once (which may contain writes from > processes belonging to more than one cgroup). Making filesystems cgroup > aware would need redesign of journalling layer itself. > > Dave Chinner thinks this problem should be solved and being solved in a > different manner by making the bdi-flusher writeback cgroup aware. > > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM > summit this year) adds cgroup awareness to writeback. Some aspects of > this patchset could be borrowed for solving the problem of throttling > buffered writes. > > As I understand the topic was discussed during last Kernel Summit as > well and the idea is to get the IO-less throttling patchset into the > kernel, then do per-memcg dirty memory limiting and add some memcg > awareness to writeback Greg Thelen and then when these things settle > down, think how to solve this problem since noone really seem to have a > good answer to it. > > Having worked on linux filesystem/storage area for a few years now and > having spent time understanding the various approaches tried and looked > at other feasible way of solving this problem, I look forward to > participate in the summit and discussions. > > So, the topic I would like to discuss is solving the problem of > "throttling buffered writes". This could considered for discussion with > memcg writeback session if that topic has been allocated a slot. > > I'm aware that this is a late submission and my apologies for not making > it earlier. But, I want to take chances and see if it is possible still.. This is an interesting and complicated topic. As you mentioned we have had tried to solve it but nothing has been merged yet. Personally, I am still interested in having a discussion and see if we can come up with a way forward. Because filesystems are not cgroup aware, throtting IO below filesystem has dangers of IO of faster cgroups being throttled behind slower cgroup (journalling was one example and there could be others). Hence, I personally think that this problem should be solved at higher layer and that is when we are actually writting to the cache. That has the disadvantage of still seeing IO spikes at the device but I guess we live with that. Doing it at higher layer also allows to use the same logic for NFS too otherwise NFS buffered write will continue to be a problem. In case of memory controller it jsut becomes a write to memory issue, and not sure if notion of dirty_ratio and dirty_bytes is enough or we need to rate limit the write to memory. Anyway, ideas to have better control of write rates are welcome. We have seen issues wheren a virtual machine cloning operation is going on and we also want a small direct write to be on disk and it can take a long time with deadline. CFQ should still be fine as direct IO is synchronous but deadline treats all WRITEs the same way. May be deadline should be modified to differentiate between SYNC and ASYNC IO instead of READ/WRITE. Jens? Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-02 15:33 ` Vivek Goyal @ 2012-03-05 19:22 ` Fengguang Wu 2012-03-05 21:11 ` Vivek Goyal 2012-03-05 20:23 ` [Lsf-pc] " Jan Kara 1 sibling, 1 reply; 21+ messages in thread From: Fengguang Wu @ 2012-03-05 19:22 UTC (permalink / raw) To: Vivek Goyal Cc: Suresh Jayaraman, lsf-pc, Andrea Righi, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen On Fri, Mar 02, 2012 at 10:33:23AM -0500, Vivek Goyal wrote: > On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote: > > Committee members, > > > > Please consider inviting me to the Storage, Filesystem, & MM Summit. I > > am working for one of the kernel teams in SUSE Labs focusing on Network > > filesystems and block layer. > > > > Recently, I have been trying to solve the problem of "throttling > > buffered writes" to make per-cgroup throttling of IO to the device > > possible. Currently the block IO controller does not throttle buffered > > writes. The writes would have lost the submitter's context (I/O comes in > > flusher thread's context) when they are at the block IO layer. I looked > > at the past work and many folks have attempted to solve this problem in > > the past years but this problem remains unsolved so far. > > > > First, Andrea Righi tried to solve this by limiting the rate of async > > writes at the time a task is generating dirty pages in the page cache. > > > > Next, Vivek Goyal tried to solve this by throttling writes at the time > > they are entering the page cache. > > > > Both these approches have limitations and not considered for merging. > > > > I have looked at the possibility of solving this at the filesystem level > > but the problem with ext* filesystems is that a commit will commit the > > whole transaction at once (which may contain writes from > > processes belonging to more than one cgroup). Making filesystems cgroup > > aware would need redesign of journalling layer itself. > > > > Dave Chinner thinks this problem should be solved and being solved in a > > different manner by making the bdi-flusher writeback cgroup aware. > > > > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM > > summit this year) adds cgroup awareness to writeback. Some aspects of > > this patchset could be borrowed for solving the problem of throttling > > buffered writes. > > > > As I understand the topic was discussed during last Kernel Summit as > > well and the idea is to get the IO-less throttling patchset into the > > kernel, then do per-memcg dirty memory limiting and add some memcg > > awareness to writeback Greg Thelen and then when these things settle > > down, think how to solve this problem since noone really seem to have a > > good answer to it. > > > > Having worked on linux filesystem/storage area for a few years now and > > having spent time understanding the various approaches tried and looked > > at other feasible way of solving this problem, I look forward to > > participate in the summit and discussions. > > > > So, the topic I would like to discuss is solving the problem of > > "throttling buffered writes". This could considered for discussion with > > memcg writeback session if that topic has been allocated a slot. > > > > I'm aware that this is a late submission and my apologies for not making > > it earlier. But, I want to take chances and see if it is possible still.. > > This is an interesting and complicated topic. As you mentioned we have had > tried to solve it but nothing has been merged yet. Personally, I am still > interested in having a discussion and see if we can come up with a way > forward. I'm interested, too. Here is my attempt on the problem a year ago: blk-cgroup: async write IO controller ("buffered write" would be more precise) https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d https://lkml.org/lkml/2011/4/4/205 > Because filesystems are not cgroup aware, throtting IO below filesystem > has dangers of IO of faster cgroups being throttled behind slower cgroup > (journalling was one example and there could be others). Hence, I personally > think that this problem should be solved at higher layer and that is when > we are actually writting to the cache. That has the disadvantage of still > seeing IO spikes at the device but I guess we live with that. Doing it > at higher layer also allows to use the same logic for NFS too otherwise > NFS buffered write will continue to be a problem. Totally agreed. > In case of memory controller it jsut becomes a write to memory issue, > and not sure if notion of dirty_ratio and dirty_bytes is enough or we > need to rate limit the write to memory. In a perfect world, the dirty size and rate may be each balanced around their targets. Ideally we could independently limit dirty size in memcg context and limit dirty rate in blkcg. If the user want to control both size/rate, he may put tasks into memcg as well as blkcg. In reality the dirty size limit will impact the dirty rate, because memcg needs to adjust its tasks' balanced dirty rate to drive the memcg dirty size to the target, so does the global dirty target. Comparing to the global dirty size balancing, memcg suffers from a unique problem: given N memcg each running a dd task, each memcg's dirty size will be dropping suddenly on every (N/2) seconds. Because the flusher writeout the inodes in coarse time-split round-robin fashion, with up to (bdi->write_bandwidth/2) chunk size. That sudden drop of memcg dirty pages may drive the dirty size far from the target, as a result it will need to adjust the dirty rate heavily in order to drive the dirty size back to the target. So the memcg dirty size balance may create large fluctuations in the dirty rates, and even long stall time of the memcg tasks. What's more, due to the uncontrollable way the flusher walks through the dirty pages and how the dirty pages distribute among the dirty inodes and memcgs, the dirty rate will be impacted heavily by the workload and behavior of the flusher when enforcing the dirty size target. There are no satisfactory solution to this till now. Currently I'm trying to shun away from this and look into improving the page reclaim so that it can work well with LRU lists with half pages being dirty/writeback. Then the 20% global dirty limit should be enough to serve most memcg tasks well taking into account the unevenly distributed dirty pages among different memcg and NUMA zones/nodes. There may still be few memcgs that need further dirty throttling, but they are likely mainly consist of heavy dirtiers and can afford less smoothness and longer delays. In comparison, the dirty rate limit for buffered writes seems less convolved to me. It sure has its own problems, so we see several solutions in circular, each with its unique trade offs. But at least we have relative simple solutions that work to their design goals. > Anyway, ideas to have better control of write rates are welcome. We have > seen issues wheren a virtual machine cloning operation is going on and > we also want a small direct write to be on disk and it can take a long > time with deadline. CFQ should still be fine as direct IO is synchronous > but deadline treats all WRITEs the same way. > > May be deadline should be modified to differentiate between SYNC and ASYNC > IO instead of READ/WRITE. Jens? In general users definitely need higher priorities for SYNC writes. It will also enable the "buffered write I/O controller" and "direct write I/O controller" to co-exist well and operate independently this way: the direct writes always enjoy higher priority than the flusher, but will be rate limited by the already upstreamed blk-cgroup I/O controller. The remaining disk bandwidth will be split among the buffered write tasks by another I/O controller operating at the balance_dirty_pages() level. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 19:22 ` Fengguang Wu @ 2012-03-05 21:11 ` Vivek Goyal 2012-03-05 22:30 ` Fengguang Wu 2012-03-05 22:58 ` Andrea Righi 0 siblings, 2 replies; 21+ messages in thread From: Vivek Goyal @ 2012-03-05 21:11 UTC (permalink / raw) To: Fengguang Wu Cc: Suresh Jayaraman, lsf-pc, Andrea Righi, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen On Mon, Mar 05, 2012 at 11:22:26AM -0800, Fengguang Wu wrote: [..] > > This is an interesting and complicated topic. As you mentioned we have had > > tried to solve it but nothing has been merged yet. Personally, I am still > > interested in having a discussion and see if we can come up with a way > > forward. > > I'm interested, too. Here is my attempt on the problem a year ago: > > blk-cgroup: async write IO controller ("buffered write" would be more precise) > https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d > https://lkml.org/lkml/2011/4/4/205 That was a proof of concept. Now we will need to provide actual user visibale knobs and integrate with one of the existing controller (memcg or blkcg). [..] > > Anyway, ideas to have better control of write rates are welcome. We have > > seen issues wheren a virtual machine cloning operation is going on and > > we also want a small direct write to be on disk and it can take a long > > time with deadline. CFQ should still be fine as direct IO is synchronous > > but deadline treats all WRITEs the same way. > > > > May be deadline should be modified to differentiate between SYNC and ASYNC > > IO instead of READ/WRITE. Jens? > > In general users definitely need higher priorities for SYNC writes. It > will also enable the "buffered write I/O controller" and "direct write > I/O controller" to co-exist well and operate independently this way: > the direct writes always enjoy higher priority than the flusher, but > will be rate limited by the already upstreamed blk-cgroup I/O > controller. The remaining disk bandwidth will be split among the > buffered write tasks by another I/O controller operating at the > balance_dirty_pages() level. Ok, so differentiating IO among SYNC/ASYNC makes sense and it probably will make sense in case of deadline too. (Until and unless there is a reason to keep it existing way). I am little vary of keeping "dirty rate limit" separate from rest of the limits as configuration of groups becomes even harder. Once you put a workload in a cgroup, now you need to configure multiple rate limits. "reads and direct writes" limit + "buffered write rate limit". To add to the confusion, it is not just direct write limit, it also is a limit on writethrough writes where fsync writes will show up in the context of writing thread. But looks like we don't much choice. As buffered writes can be controlled at two levels, we probably need two knobs. Also controlling writes while entring cache limits will be global and not per device (unlinke currnet per device limit in blkio controller). Having separate control for "dirty rate limit" leaves the scope for implementing write control at device level in the future (As some people prefer that). In possibly two solutions can co-exist in future. Assuming this means that we both agree that three should be some sort of knob to control "dirty rate", question is where should it be. In memcg or blkcg. Given the fact we are controlling the write to memory and we are already planning to have per memcg dirty ratio and dirty bytes, to me it will make more sense to integrate this new limit with memcg instead of blkcg. Block layer does not even come into the picture at that level hence implementing something in blkcg will be little out of place? Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 21:11 ` Vivek Goyal @ 2012-03-05 22:30 ` Fengguang Wu 2012-03-05 23:19 ` Andrea Righi 2012-03-05 22:58 ` Andrea Righi 1 sibling, 1 reply; 21+ messages in thread From: Fengguang Wu @ 2012-03-05 22:30 UTC (permalink / raw) To: Vivek Goyal Cc: Suresh Jayaraman, lsf-pc, Andrea Righi, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote: > On Mon, Mar 05, 2012 at 11:22:26AM -0800, Fengguang Wu wrote: > > [..] > > > This is an interesting and complicated topic. As you mentioned we have had > > > tried to solve it but nothing has been merged yet. Personally, I am still > > > interested in having a discussion and see if we can come up with a way > > > forward. > > > > I'm interested, too. Here is my attempt on the problem a year ago: > > > > blk-cgroup: async write IO controller ("buffered write" would be more precise) > > https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d > > https://lkml.org/lkml/2011/4/4/205 > > That was a proof of concept. Now we will need to provide actual user > visibale knobs and integrate with one of the existing controller (memcg > or blkcg). The next commit adds the interface to blkcg: throttle.async_write_bps Note that it's simply exporting the knobs via blkcg and does not really depend on the blkcg functionalities. > [..] > > > Anyway, ideas to have better control of write rates are welcome. We have > > > seen issues wheren a virtual machine cloning operation is going on and > > > we also want a small direct write to be on disk and it can take a long > > > time with deadline. CFQ should still be fine as direct IO is synchronous > > > but deadline treats all WRITEs the same way. > > > > > > May be deadline should be modified to differentiate between SYNC and ASYNC > > > IO instead of READ/WRITE. Jens? > > > > In general users definitely need higher priorities for SYNC writes. It > > will also enable the "buffered write I/O controller" and "direct write > > I/O controller" to co-exist well and operate independently this way: > > the direct writes always enjoy higher priority than the flusher, but > > will be rate limited by the already upstreamed blk-cgroup I/O > > controller. The remaining disk bandwidth will be split among the > > buffered write tasks by another I/O controller operating at the > > balance_dirty_pages() level. > > Ok, so differentiating IO among SYNC/ASYNC makes sense and it probably > will make sense in case of deadline too. (Until and unless there is a > reason to keep it existing way). Agreed. But note that the deadline I/O scheduler has nothing to do with the I/O controllers. > I am little vary of keeping "dirty rate limit" separate from rest of the > limits as configuration of groups becomes even harder. Once you put a > workload in a cgroup, now you need to configure multiple rate limits. > "reads and direct writes" limit + "buffered write rate limit". Good point. If we really want it, it's technically possible to provide one single write rate limit to the user. The way is to account the current DIRECT write bandwidth. Subtract it from the general write rate limit, we get the limit available for buffered writes. Thus we'll be providing some "throttle.total_write_bps" rather than "throttle.async_write_bps". Oh it may be difficult to implement total_write_bps for direct writes, which is implemented at the device level. But still if it's the right interface to have, we can make it happen by calling into balance_dirty_pages() (or some algorithm abstracted from it) at the end of each direct write and let it handle the global wise throttling. > To add > to the confusion, it is not just direct write limit, it also is a limit > on writethrough writes where fsync writes will show up in the context > of writing thread. Sorry I'm not sure I caught the words. Is it that O_SYNC writes can and would be (confusingly) rate limited at both levels? > But looks like we don't much choice. As buffered writes can be controlled > at two levels, we probably need two knobs. Also controlling writes while > entring cache limits will be global and not per device (unlinke currnet > per device limit in blkio controller). Having separate control for "dirty > rate limit" leaves the scope for implementing write control at device > level in the future (As some people prefer that). In possibly two > solutions can co-exist in future. Good point. balance_dirty_pages() has no idea about the devices at all. So the rate limit for buffered writes can hardly be unified with the per-device rate limit for direct writes. BTW it may have technically merits to enforce per-bdi buffered write rate limits for each cgroup: imagine it's writing concurrently to a 10MB/s USB key and a 100MB/s disk. But perhaps also de-merits when all the user want is the gross write rate, rather than to care about the unnecessarily partition between sda1 and sda2. > Assuming this means that we both agree that three should be some sort of > knob to control "dirty rate", question is where should it be. In memcg > or blkcg. Given the fact we are controlling the write to memory and > we are already planning to have per memcg dirty ratio and dirty bytes, > to me it will make more sense to integrate this new limit with memcg > instead of blkcg. Block layer does not even come into the picture at > that level hence implementing something in blkcg will be little out of > place? I personally prefer memcg for dirty sizes and blkcg for dirty rates. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 22:30 ` Fengguang Wu @ 2012-03-05 23:19 ` Andrea Righi 2012-03-05 23:51 ` Fengguang Wu 0 siblings, 1 reply; 21+ messages in thread From: Andrea Righi @ 2012-03-05 23:19 UTC (permalink / raw) To: Fengguang Wu Cc: Vivek Goyal, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen On Mon, Mar 05, 2012 at 02:30:29PM -0800, Fengguang Wu wrote: > On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote: ... > > But looks like we don't much choice. As buffered writes can be controlled > > at two levels, we probably need two knobs. Also controlling writes while > > entring cache limits will be global and not per device (unlinke currnet > > per device limit in blkio controller). Having separate control for "dirty > > rate limit" leaves the scope for implementing write control at device > > level in the future (As some people prefer that). In possibly two > > solutions can co-exist in future. > > Good point. balance_dirty_pages() has no idea about the devices at > all. So the rate limit for buffered writes can hardly be unified with > the per-device rate limit for direct writes. > I think balance_dirty_pages() can have an idea about devices. We can get a reference to the right block device / request queue from the address_space: bdev = mapping->host->i_sb->s_bdev; q = bdev_get_queue(bdev); (NULL pointer dereferences apart). -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 23:19 ` Andrea Righi @ 2012-03-05 23:51 ` Fengguang Wu 2012-03-06 0:46 ` Andrea Righi 0 siblings, 1 reply; 21+ messages in thread From: Fengguang Wu @ 2012-03-05 23:51 UTC (permalink / raw) To: Andrea Righi Cc: Vivek Goyal, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen On Tue, Mar 06, 2012 at 12:19:30AM +0100, Andrea Righi wrote: > On Mon, Mar 05, 2012 at 02:30:29PM -0800, Fengguang Wu wrote: > > On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote: > ... > > > But looks like we don't much choice. As buffered writes can be controlled > > > at two levels, we probably need two knobs. Also controlling writes while > > > entring cache limits will be global and not per device (unlinke currnet > > > per device limit in blkio controller). Having separate control for "dirty > > > rate limit" leaves the scope for implementing write control at device > > > level in the future (As some people prefer that). In possibly two > > > solutions can co-exist in future. > > > > Good point. balance_dirty_pages() has no idea about the devices at > > all. So the rate limit for buffered writes can hardly be unified with > > the per-device rate limit for direct writes. > > > > I think balance_dirty_pages() can have an idea about devices. We can get > a reference to the right block device / request queue from the > address_space: > > bdev = mapping->host->i_sb->s_bdev; > q = bdev_get_queue(bdev); > > (NULL pointer dereferences apart). Problem is, there is no general 1:1 mapping between bdev and disks. For the single disk multpile partitions (sda1, sda2...) case, the above scheme is fine and makes the throttle happen at sda granularity. However for md/dm etc. there is no way (or need?) to reach the exact disk that current blkcg is operating on. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 23:51 ` Fengguang Wu @ 2012-03-06 0:46 ` Andrea Righi 2012-03-07 20:26 ` Vivek Goyal 0 siblings, 1 reply; 21+ messages in thread From: Andrea Righi @ 2012-03-06 0:46 UTC (permalink / raw) To: Fengguang Wu Cc: Vivek Goyal, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen On Mon, Mar 05, 2012 at 03:51:32PM -0800, Fengguang Wu wrote: > On Tue, Mar 06, 2012 at 12:19:30AM +0100, Andrea Righi wrote: > > On Mon, Mar 05, 2012 at 02:30:29PM -0800, Fengguang Wu wrote: > > > On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote: > > ... > > > > But looks like we don't much choice. As buffered writes can be controlled > > > > at two levels, we probably need two knobs. Also controlling writes while > > > > entring cache limits will be global and not per device (unlinke currnet > > > > per device limit in blkio controller). Having separate control for "dirty > > > > rate limit" leaves the scope for implementing write control at device > > > > level in the future (As some people prefer that). In possibly two > > > > solutions can co-exist in future. > > > > > > Good point. balance_dirty_pages() has no idea about the devices at > > > all. So the rate limit for buffered writes can hardly be unified with > > > the per-device rate limit for direct writes. > > > > > > > I think balance_dirty_pages() can have an idea about devices. We can get > > a reference to the right block device / request queue from the > > address_space: > > > > bdev = mapping->host->i_sb->s_bdev; > > q = bdev_get_queue(bdev); > > > > (NULL pointer dereferences apart). > > Problem is, there is no general 1:1 mapping between bdev and disks. > For the single disk multpile partitions (sda1, sda2...) case, the > above scheme is fine and makes the throttle happen at sda granularity. > > However for md/dm etc. there is no way (or need?) to reach the exact > disk that current blkcg is operating on. > > Thanks, > Fengguang Oh I see, the problem is with stacked block devices. Right, if we set a limit for sda and a stacked block device is defined over sda, we'd get only the bdev at the top of the stack at balance_dirty_pages() and the limits configured for the underlying block devices will be ignored. However, maybe for the 90% of the cases this is fine, I can't see a real world scenario where we may want to limit only part or indirectly a stacked block device... Thanks, -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-06 0:46 ` Andrea Righi @ 2012-03-07 20:26 ` Vivek Goyal 0 siblings, 0 replies; 21+ messages in thread From: Vivek Goyal @ 2012-03-07 20:26 UTC (permalink / raw) To: Andrea Righi Cc: Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen On Tue, Mar 06, 2012 at 01:46:02AM +0100, Andrea Righi wrote: [..] > > > > Good point. balance_dirty_pages() has no idea about the devices at > > > > all. So the rate limit for buffered writes can hardly be unified with > > > > the per-device rate limit for direct writes. > > > > > > > > > > I think balance_dirty_pages() can have an idea about devices. We can get > > > a reference to the right block device / request queue from the > > > address_space: > > > > > > bdev = mapping->host->i_sb->s_bdev; > > > q = bdev_get_queue(bdev); > > > > > > (NULL pointer dereferences apart). > > > > Problem is, there is no general 1:1 mapping between bdev and disks. > > For the single disk multpile partitions (sda1, sda2...) case, the > > above scheme is fine and makes the throttle happen at sda granularity. > > > > However for md/dm etc. there is no way (or need?) to reach the exact > > disk that current blkcg is operating on. > > > > Thanks, > > Fengguang > > Oh I see, the problem is with stacked block devices. Right, if we set a > limit for sda and a stacked block device is defined over sda, we'd get > only the bdev at the top of the stack at balance_dirty_pages() and the > limits configured for the underlying block devices will be ignored. > > However, maybe for the 90% of the cases this is fine, I can't see a real > world scenario where we may want to limit only part or indirectly a > stacked block device... I agree that throttling will make most sense on the top most device in the stack. If we try to do anything on the intermediate device, it might not make much sense and we will most likely lose context also. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 21:11 ` Vivek Goyal 2012-03-05 22:30 ` Fengguang Wu @ 2012-03-05 22:58 ` Andrea Righi 2012-03-07 20:52 ` Vivek Goyal 1 sibling, 1 reply; 21+ messages in thread From: Andrea Righi @ 2012-03-05 22:58 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote: > On Mon, Mar 05, 2012 at 11:22:26AM -0800, Fengguang Wu wrote: > > [..] > > > This is an interesting and complicated topic. As you mentioned we have had > > > tried to solve it but nothing has been merged yet. Personally, I am still > > > interested in having a discussion and see if we can come up with a way > > > forward. > > > > I'm interested, too. Here is my attempt on the problem a year ago: > > > > blk-cgroup: async write IO controller ("buffered write" would be more precise) > > https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d > > https://lkml.org/lkml/2011/4/4/205 > > That was a proof of concept. Now we will need to provide actual user > visibale knobs and integrate with one of the existing controller (memcg > or blkcg). > > [..] > > > Anyway, ideas to have better control of write rates are welcome. We have > > > seen issues wheren a virtual machine cloning operation is going on and > > > we also want a small direct write to be on disk and it can take a long > > > time with deadline. CFQ should still be fine as direct IO is synchronous > > > but deadline treats all WRITEs the same way. > > > > > > May be deadline should be modified to differentiate between SYNC and ASYNC > > > IO instead of READ/WRITE. Jens? > > > > In general users definitely need higher priorities for SYNC writes. It > > will also enable the "buffered write I/O controller" and "direct write > > I/O controller" to co-exist well and operate independently this way: > > the direct writes always enjoy higher priority than the flusher, but > > will be rate limited by the already upstreamed blk-cgroup I/O > > controller. The remaining disk bandwidth will be split among the > > buffered write tasks by another I/O controller operating at the > > balance_dirty_pages() level. > > Ok, so differentiating IO among SYNC/ASYNC makes sense and it probably > will make sense in case of deadline too. (Until and unless there is a > reason to keep it existing way). > > I am little vary of keeping "dirty rate limit" separate from rest of the > limits as configuration of groups becomes even harder. Once you put a > workload in a cgroup, now you need to configure multiple rate limits. > "reads and direct writes" limit + "buffered write rate limit". To add > to the confusion, it is not just direct write limit, it also is a limit > on writethrough writes where fsync writes will show up in the context > of writing thread. > > But looks like we don't much choice. As buffered writes can be controlled > at two levels, we probably need two knobs. Also controlling writes while > entring cache limits will be global and not per device (unlinke currnet > per device limit in blkio controller). Having separate control for "dirty > rate limit" leaves the scope for implementing write control at device > level in the future (As some people prefer that). In possibly two > solutions can co-exist in future. > > Assuming this means that we both agree that three should be some sort of > knob to control "dirty rate", question is where should it be. In memcg > or blkcg. Given the fact we are controlling the write to memory and > we are already planning to have per memcg dirty ratio and dirty bytes, > to me it will make more sense to integrate this new limit with memcg > instead of blkcg. Block layer does not even come into the picture at > that level hence implementing something in blkcg will be little out of > place? > > Thanks > Vivek What about this scenario? (Sorry, I've not followed some of the recent discussions on this topic, so I'm sure I'm oversimplifying a bit or ignoring some details): - track inodes per-memcg for writeback IO (provided Greg's patch) - provide per-memcg dirty limit (global, not per-device); when this limit is exceeded flusher threads are awekened and all tasks that continue to generate new dirty pages inside the memcg are put to sleep - flusher threads start to write some dirty inodes of this memcg (using the inode tracking feature), let say they start with a chunk of N pages of the first dirty inode - flusher threads can't flush in this way more than N pages / sec (where N * PAGE_SIZE / sec is the blkcg "buffered write rate limit" on the inode's block device); if a flusher thread exceeds this limit it won't be blocked directly, it just stops flushing pages for this memcg after the first chunk and it can continue to flush dirty pages of a different memcg. In this way tasks are actively limited at the memcg layer and the writeback rate is limited by the blkcg layer. The missing piece (that has not been proposed yet) is to plug into the flusher threads the logic "I can flush your memcg dirty pages only if your blkcg rate is ok, otherwise let's see if someone else needs to flush some dirty pages". Thanks, -Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 22:58 ` Andrea Righi @ 2012-03-07 20:52 ` Vivek Goyal 2012-03-07 22:04 ` Jeff Moyer 2012-03-08 8:08 ` Greg Thelen 0 siblings, 2 replies; 21+ messages in thread From: Vivek Goyal @ 2012-03-07 20:52 UTC (permalink / raw) To: Andrea Righi Cc: Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen On Mon, Mar 05, 2012 at 11:58:01PM +0100, Andrea Righi wrote: [..] > What about this scenario? (Sorry, I've not followed some of the recent > discussions on this topic, so I'm sure I'm oversimplifying a bit or > ignoring some details): > > - track inodes per-memcg for writeback IO (provided Greg's patch) > - provide per-memcg dirty limit (global, not per-device); when this > limit is exceeded flusher threads are awekened and all tasks that > continue to generate new dirty pages inside the memcg are put to > sleep > - flusher threads start to write some dirty inodes of this memcg (using > the inode tracking feature), let say they start with a chunk of N > pages of the first dirty inode > - flusher threads can't flush in this way more than N pages / sec > (where N * PAGE_SIZE / sec is the blkcg "buffered write rate limit" > on the inode's block device); if a flusher thread exceeds this limit > it won't be blocked directly, it just stops flushing pages for this > memcg after the first chunk and it can continue to flush dirty pages > of a different memcg. > So, IIUC, the only thing little different here is that throttling is implemented by flusher thread. But it is still per device per cgroup. I think that is just a implementation detail whether we implement it in block layer, or in writeback or somewhere else. We can very well implement it in block layer and provide per bdi/per_group congestion flag in bdi so that flusher will stop pushing more IO if group on a bdi is congested (because IO is throttled). I think first important thing is to figure out what is minimal set of requirement (As jan said in another mail), which will solve wide variety of cases. I am trying to list some of points. - Throttling for buffered writes - Do we want per device throttling limits or global throttling limtis. - Exising direct write limtis are per device and implemented in block layer. - I personally think that both kind of limits might make sense. But a global limit for async write might make more sense at least for the workloads like backup which can run on a throttled speed. - Absolute throttling IO will make most sense on top level device in the IO stack. - For per device rate throttling, do we want a common limit for direct write and buffered write or a separate limit just for buffered writes. - Proportional IO for async writes - Will probably make most sense on bottom most devices in the IO stack (If we are able to somehow retain the submitter's context). - Logically it will make sense to keep sync and async writes in same group and try to provide fair share of disk between groups. Technically CFQ can do that but in practice I think it will be problematic. Writes of one group will take precedence of reads of another group. Currently any read is prioritized over buffered writes. So by splitting buffered writes in their own cgroups, they can serverly impact the latency of reads in another group. Not sure how many people really want to do that in practice. - Do we really need proportional IO for async writes. CFQ had tried implementing ioprio for async writes but it does not work. Should we just care about groups of sync IO and let all the async IO on device go in a single queue and lets make suere it is not starved while sync IO is going on. - I thought that most of the people cared about not impacting sync latencies badly while buffered writes are happening. Not many complained that buffered writes of one application should happen faster than other application. - If we agree that not many people require service differentation between buffered writes, then we probably don't have to do anything in this space and we can keep things simple. I personally prefer this option. Trying to provide proportional IO for async writes will make things complicated and we might not achieve much. - CFQ already does a very good job of prioritizing sync over async (at the cost of reduced throuhgput on fast devices). So what's the use case of proportion IO for async writes. Once we figure out what are the requirements, we can discuss the implementation details. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-07 20:52 ` Vivek Goyal @ 2012-03-07 22:04 ` Jeff Moyer 2012-03-08 8:08 ` Greg Thelen 1 sibling, 0 replies; 21+ messages in thread From: Jeff Moyer @ 2012-03-07 22:04 UTC (permalink / raw) To: Vivek Goyal Cc: Andrea Righi, Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel, Jan Kara, Greg Thelen Vivek Goyal <vgoyal@redhat.com> writes: > On Mon, Mar 05, 2012 at 11:58:01PM +0100, Andrea Righi wrote: > > [..] >> What about this scenario? (Sorry, I've not followed some of the recent >> discussions on this topic, so I'm sure I'm oversimplifying a bit or >> ignoring some details): >> >> - track inodes per-memcg for writeback IO (provided Greg's patch) >> - provide per-memcg dirty limit (global, not per-device); when this >> limit is exceeded flusher threads are awekened and all tasks that >> continue to generate new dirty pages inside the memcg are put to >> sleep >> - flusher threads start to write some dirty inodes of this memcg (using >> the inode tracking feature), let say they start with a chunk of N >> pages of the first dirty inode >> - flusher threads can't flush in this way more than N pages / sec >> (where N * PAGE_SIZE / sec is the blkcg "buffered write rate limit" >> on the inode's block device); if a flusher thread exceeds this limit >> it won't be blocked directly, it just stops flushing pages for this >> memcg after the first chunk and it can continue to flush dirty pages >> of a different memcg. >> > > So, IIUC, the only thing little different here is that throttling is > implemented by flusher thread. But it is still per device per cgroup. I > think that is just a implementation detail whether we implement it > in block layer, or in writeback or somewhere else. We can very well > implement it in block layer and provide per bdi/per_group congestion > flag in bdi so that flusher will stop pushing more IO if group on > a bdi is congested (because IO is throttled). > > I think first important thing is to figure out what is minimal set of > requirement (As jan said in another mail), which will solve wide > variety of cases. I am trying to list some of points. > > > - Throttling for buffered writes > - Do we want per device throttling limits or global throttling > limtis. You can implement global (perhaps in userspace utilities) if you have the per-device mechanism in the kernel. So I'd say start with per-device. > - Exising direct write limtis are per device and implemented in > block layer. > > - I personally think that both kind of limits might make sense. > But a global limit for async write might make more sense at > least for the workloads like backup which can run on a throttled > speed. When you say global, do you mean total bandwidth across all devices, or a maximum bandwidth applied to each device? > - Absolute throttling IO will make most sense on top level device > in the IO stack. I'm not sure why you used the word absolute. I do agree that throttling at the top-most device in a stack makes the most sense. > - For per device rate throttling, do we want a common limit for > direct write and buffered write or a separate limit just for > buffered writes. That depends, what's the goal? Direct writes can drive very deep queue depths, just as buffered writes can. > - Proportional IO for async writes > - Will probably make most sense on bottom most devices in the IO > stack (If we are able to somehow retain the submitter's context). Why does it make sense to have it at the bottom? Just because that's where it's implemented today? Writeback happens to the top-most device, and that device can have different properties than each of its components. So, why don't you think applying policy at the top is the right thing to do? > - Logically it will make sense to keep sync and async writes in > same group and try to provide fair share of disk between groups. > Technically CFQ can do that but in practice I think it will be > problematic. Writes of one group will take precedence of reads > of another group. Currently any read is prioritized over > buffered writes. So by splitting buffered writes in their own > cgroups, they can serverly impact the latency of reads in > another group. Not sure how many people really want to do > that in practice. > > - Do we really need proportional IO for async writes. CFQ had > tried implementing ioprio for async writes but it does not > work. Should we just care about groups of sync IO and let > all the async IO on device go in a single queue and lets > make suere it is not starved while sync IO is going on. If we get accounting of writeback I/O right, then I think it might make sense to enforce the proportional I/O policy on aysnc writes. But, I guess this also depends on what happens with the mem policy, right? > - I thought that most of the people cared about not impacting > sync latencies badly while buffered writes are happening. Not > many complained that buffered writes of one application should > happen faster than other application. Until you are forced to reclaim pages.... > - If we agree that not many people require service differentation > between buffered writes, then we probably don't have to do > anything in this space and we can keep things simple. I > personally prefer this option. Trying to provide proportional > IO for async writes will make things complicated and we might > not achieve much. Again, I think that, in order to consider this, we'd also have to lay out a plan for how it interacts with the memory cgroup policies. > - CFQ already does a very good job of prioritizing sync over async > (at the cost of reduced throuhgput on fast devices). So what's > the use case of proportion IO for async writes. > > Once we figure out what are the requirements, we can discuss the > implementation details. Nice write-up, Vivek. Cheers, Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-07 20:52 ` Vivek Goyal 2012-03-07 22:04 ` Jeff Moyer @ 2012-03-08 8:08 ` Greg Thelen 1 sibling, 0 replies; 21+ messages in thread From: Greg Thelen @ 2012-03-08 8:08 UTC (permalink / raw) To: Vivek Goyal Cc: Andrea Righi, Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel, Jan Kara Vivek Goyal <vgoyal@redhat.com> writes: > So, IIUC, the only thing little different here is that throttling is > implemented by flusher thread. But it is still per device per cgroup. I > think that is just a implementation detail whether we implement it > in block layer, or in writeback or somewhere else. We can very well > implement it in block layer and provide per bdi/per_group congestion > flag in bdi so that flusher will stop pushing more IO if group on > a bdi is congested (because IO is throttled). > > I think first important thing is to figure out what is minimal set of > requirement (As jan said in another mail), which will solve wide > variety of cases. I am trying to list some of points. > > > - Throttling for buffered writes > - Do we want per device throttling limits or global throttling > limtis. > > - Exising direct write limtis are per device and implemented in > block layer. > > - I personally think that both kind of limits might make sense. > But a global limit for async write might make more sense at > least for the workloads like backup which can run on a throttled > speed. > > - Absolute throttling IO will make most sense on top level device > in the IO stack. > > - For per device rate throttling, do we want a common limit for > direct write and buffered write or a separate limit just for > buffered writes. Another aspect to this problem is 'dirty memory limiting'. First a quick refresher on memory.soft_limit_in_bytes... In memcg the soft_limit_in_bytes can be used as a way to overcommit a machine's memory. The idea is that the memory.limit_in_bytes (aka hard limit) specified a absolute maximum amount of memory a memcg can use, while the soft_limit_in_bytes indicates the working set of the container. The simplified equation is that if the sum(*/memory.soft_limit_in_bytes) < MemTotal, then all containers should be guaranteed their working set. Jobs are allowed to allocate more than soft_limit_in_bytes so long as they fit within limit_in_bytes. This attempts to provide a min and max amount of memory for a cgroup. The soft_limit_in_bytes is related to this discussion because it is desirable if all container memory above soft_limit_in_bytes is reclaimable (i.e. clean file cache). Using previously posted memcg dirty limiting and memcg writeback logic we have been able to set a container's dirty_limit to its soft_limit. While not perfect, this approximates the goal of providing min guaranteed memory while allowing for usage of best effort memory, so long as that best effort memory can be quickly reclaimed to satisfy another container's min guarantee. > - Proportional IO for async writes > - Will probably make most sense on bottom most devices in the IO > stack (If we are able to somehow retain the submitter's context). > > - Logically it will make sense to keep sync and async writes in > same group and try to provide fair share of disk between groups. > Technically CFQ can do that but in practice I think it will be > problematic. Writes of one group will take precedence of reads > of another group. Currently any read is prioritized over > buffered writes. So by splitting buffered writes in their own > cgroups, they can serverly impact the latency of reads in > another group. Not sure how many people really want to do > that in practice. > > - Do we really need proportional IO for async writes. CFQ had > tried implementing ioprio for async writes but it does not > work. Should we just care about groups of sync IO and let > all the async IO on device go in a single queue and lets > make suere it is not starved while sync IO is going on. > > > - I thought that most of the people cared about not impacting > sync latencies badly while buffered writes are happening. Not > many complained that buffered writes of one application should > happen faster than other application. > > - If we agree that not many people require service differentation > between buffered writes, then we probably don't have to do > anything in this space and we can keep things simple. I > personally prefer this option. Trying to provide proportional > IO for async writes will make things complicated and we might > not achieve much. > > - CFQ already does a very good job of prioritizing sync over async > (at the cost of reduced throuhgput on fast devices). So what's > the use case of proportion IO for async writes. > > Once we figure out what are the requirements, we can discuss the > implementation details. > > Thanks > Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-02 15:33 ` Vivek Goyal 2012-03-05 19:22 ` Fengguang Wu @ 2012-03-05 20:23 ` Jan Kara 2012-03-05 21:41 ` Vivek Goyal ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Jan Kara @ 2012-03-05 20:23 UTC (permalink / raw) To: Vivek Goyal Cc: Suresh Jayaraman, linux-fsdevel, linux-mm, lsf-pc, Jan Kara, Andrea Righi On Fri 02-03-12 10:33:23, Vivek Goyal wrote: > On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote: > > Committee members, > > > > Please consider inviting me to the Storage, Filesystem, & MM Summit. I > > am working for one of the kernel teams in SUSE Labs focusing on Network > > filesystems and block layer. > > > > Recently, I have been trying to solve the problem of "throttling > > buffered writes" to make per-cgroup throttling of IO to the device > > possible. Currently the block IO controller does not throttle buffered > > writes. The writes would have lost the submitter's context (I/O comes in > > flusher thread's context) when they are at the block IO layer. I looked > > at the past work and many folks have attempted to solve this problem in > > the past years but this problem remains unsolved so far. > > > > First, Andrea Righi tried to solve this by limiting the rate of async > > writes at the time a task is generating dirty pages in the page cache. > > > > Next, Vivek Goyal tried to solve this by throttling writes at the time > > they are entering the page cache. > > > > Both these approches have limitations and not considered for merging. > > > > I have looked at the possibility of solving this at the filesystem level > > but the problem with ext* filesystems is that a commit will commit the > > whole transaction at once (which may contain writes from > > processes belonging to more than one cgroup). Making filesystems cgroup > > aware would need redesign of journalling layer itself. > > > > Dave Chinner thinks this problem should be solved and being solved in a > > different manner by making the bdi-flusher writeback cgroup aware. > > > > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM > > summit this year) adds cgroup awareness to writeback. Some aspects of > > this patchset could be borrowed for solving the problem of throttling > > buffered writes. > > > > As I understand the topic was discussed during last Kernel Summit as > > well and the idea is to get the IO-less throttling patchset into the > > kernel, then do per-memcg dirty memory limiting and add some memcg > > awareness to writeback Greg Thelen and then when these things settle > > down, think how to solve this problem since noone really seem to have a > > good answer to it. > > > > Having worked on linux filesystem/storage area for a few years now and > > having spent time understanding the various approaches tried and looked > > at other feasible way of solving this problem, I look forward to > > participate in the summit and discussions. > > > > So, the topic I would like to discuss is solving the problem of > > "throttling buffered writes". This could considered for discussion with > > memcg writeback session if that topic has been allocated a slot. > > > > I'm aware that this is a late submission and my apologies for not making > > it earlier. But, I want to take chances and see if it is possible still.. > > This is an interesting and complicated topic. As you mentioned we have had > tried to solve it but nothing has been merged yet. Personally, I am still > interested in having a discussion and see if we can come up with a way > forward. > > Because filesystems are not cgroup aware, throtting IO below filesystem > has dangers of IO of faster cgroups being throttled behind slower cgroup > (journalling was one example and there could be others). Hence, I personally > think that this problem should be solved at higher layer and that is when > we are actually writting to the cache. That has the disadvantage of still > seeing IO spikes at the device but I guess we live with that. Doing it > at higher layer also allows to use the same logic for NFS too otherwise > NFS buffered write will continue to be a problem. Well, I agree limiting of memory dirty rate has a value but if I look at a natural use case where I have several cgroups and I want to make sure disk time is fairly divided among them, then limiting dirty rate doesn't quite do what I need. Because I'm interested in time it takes disk to process the combination of reads, direct IO, and buffered writes the cgroup generates. Having the limits for dirty rate and other IO separate means I have to be rather pesimistic in setting the bounds so that combination of dirty rate + other IO limit doesn't exceed the desired bound but this is usually unnecessarily harsh... We agree though (as we spoke together last year) that throttling at block layer isn't really an option at least for some filesystems such as ext3/4. But what seemed like a plausible idea to me was that we'd account all IO including buffered writes at block layer (there we'd need at least approximate tracking of originator of the IO - tracking inodes as Greg did in his patch set seemed OK) but throttle only direct IO & reads. Limitting of buffered writes would then be achieved by a) having flusher thread choose inodes to write depending on how much available disk time cgroup has and b) throttling buffered writers when cgroup has too many dirty pages. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 20:23 ` [Lsf-pc] " Jan Kara @ 2012-03-05 21:41 ` Vivek Goyal 2012-03-07 17:24 ` Jan Kara 2012-03-05 22:18 ` Vivek Goyal 2012-03-07 6:31 ` Fengguang Wu 2 siblings, 1 reply; 21+ messages in thread From: Vivek Goyal @ 2012-03-05 21:41 UTC (permalink / raw) To: Jan Kara; +Cc: Suresh Jayaraman, linux-fsdevel, linux-mm, lsf-pc, Andrea Righi On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote: [..] > > Because filesystems are not cgroup aware, throtting IO below filesystem > > has dangers of IO of faster cgroups being throttled behind slower cgroup > > (journalling was one example and there could be others). Hence, I personally > > think that this problem should be solved at higher layer and that is when > > we are actually writting to the cache. That has the disadvantage of still > > seeing IO spikes at the device but I guess we live with that. Doing it > > at higher layer also allows to use the same logic for NFS too otherwise > > NFS buffered write will continue to be a problem. > Well, I agree limiting of memory dirty rate has a value but if I look at > a natural use case where I have several cgroups and I want to make sure > disk time is fairly divided among them, then limiting dirty rate doesn't > quite do what I need. Actually "proportional IO control" generally addresses the use case of disk time being fairly divided among cgroups. The "throttling/upper limit" I think is more targeted towards the cases where you have bandwidth but you don't want to give it to user as user has not paid for that kind of service. Though it could be used for other things like monitoring the system dynamically and throttling rates of a particular cgroup if admin thinks that particular cgroup is doing too much of IO. Or for things like, start a backup operation with an upper limit of say 50MB/s so that it does not affect other system activities too much. > Because I'm interested in time it takes disk to > process the combination of reads, direct IO, and buffered writes the cgroup > generates. Having the limits for dirty rate and other IO separate means I > have to be rather pesimistic in setting the bounds so that combination of > dirty rate + other IO limit doesn't exceed the desired bound but this is > usually unnecessarily harsh... Yes, seprating out the throttling limits for "reads + direct writes + certain wriththrough writes" and "buffered writes" is not ideal. But it might still have some value for specific use cases (writes over NFS, backup application, throttling a specific disk hog workload etc). > > We agree though (as we spoke together last year) that throttling at block > layer isn't really an option at least for some filesystems such as ext3/4. Yes, because of jorunalling issues and ensuring serialization, throttling/upper limit at block/device level becomes less attractive. > But what seemed like a plausible idea to me was that we'd account all IO > including buffered writes at block layer (there we'd need at least > approximate tracking of originator of the IO - tracking inodes as Greg did > in his patch set seemed OK) but throttle only direct IO & reads. Limitting > of buffered writes would then be achieved by > a) having flusher thread choose inodes to write depending on how much > available disk time cgroup has and > b) throttling buffered writers when cgroup has too many dirty pages. I am trying to remember what we had discussed. There have been so many ideas floated in this area, that now I get confused. So lets take throttling/upper limit out of the picture for a moment and just focus on the use case of proportional IO (fare share of disk among cgroups). - In that case yes, we probably can come up with some IO tracking mechanism so that IO can be accounted to right cgroup (IO originator's cgroup) at block layer. We could either store some info in "struct page" or do some approximation as you mentioned like inode owner. - With buffered IO accounted to right cgroup, CFQ should automatically start providing cgroup its fair share (Well little changes will be required). But there are still two more issues. - Issue of making writeback cgroup aware. I am assuming that this work will be taken forward by Greg. - Breaking down request descriptors into some kind of per cgroup notion so that one cgroup is not stuck behind other. (Or come up with a different mechanism for per cgroup congestion). That way, if a cgroup is congested at CFQ, flusher should stop submitting more IO for it, that will lead to increased dirty pages in memcg and that should throttle the application. So all of the aove seems to be proportional IO (fair shrae of disk). This should still be co-exist with "throttling/upper limit" implementation/knobs and one is not necessarily replacement for other? Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 21:41 ` Vivek Goyal @ 2012-03-07 17:24 ` Jan Kara 2012-03-07 21:29 ` Vivek Goyal 0 siblings, 1 reply; 21+ messages in thread From: Jan Kara @ 2012-03-07 17:24 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc, Andrea Righi, Suresh Jayaraman On Mon 05-03-12 16:41:30, Vivek Goyal wrote: > On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote: > [..] > > > Because filesystems are not cgroup aware, throtting IO below filesystem > > > has dangers of IO of faster cgroups being throttled behind slower cgroup > > > (journalling was one example and there could be others). Hence, I personally > > > think that this problem should be solved at higher layer and that is when > > > we are actually writting to the cache. That has the disadvantage of still > > > seeing IO spikes at the device but I guess we live with that. Doing it > > > at higher layer also allows to use the same logic for NFS too otherwise > > > NFS buffered write will continue to be a problem. > > Well, I agree limiting of memory dirty rate has a value but if I look at > > a natural use case where I have several cgroups and I want to make sure > > disk time is fairly divided among them, then limiting dirty rate doesn't > > quite do what I need. > > Actually "proportional IO control" generally addresses the use case of > disk time being fairly divided among cgroups. The "throttling/upper limit" > I think is more targeted towards the cases where you have bandwidth but > you don't want to give it to user as user has not paid for that kind > of service. Though it could be used for other things like monitoring the > system dynamically and throttling rates of a particular cgroup if admin > thinks that particular cgroup is doing too much of IO. Or for things like, > start a backup operation with an upper limit of say 50MB/s so that it > does not affect other system activities too much. Well, I was always slightly sceptical that these absolute bandwidth limits are that great thing. If some cgroup beats your storage with 10 MB/s of random tiny writes, then it uses more of your resources than an streaming 50 MB/s write. So although admins might be tempted to use throughput limits at the first moment because they are easier to understand, they might later find it's not quite what they wanted. Specifically for the imagined use case where a customer pays just for a given bandwidth, you can achieve similar (and IMHO more reliable) results using proportional control. Say you have available 100 MB/s sequential IO bandwidth and you would like to limit cgroup to 10 MB/s. Then you just give it weight 10. Another cgroup paying for 20 MB/s would get weight 20 and so on. If you are a clever provider and pack your load so that machine is well utilized, cgroups will get limited roughly at given bounds... > > Because I'm interested in time it takes disk to > > process the combination of reads, direct IO, and buffered writes the cgroup > > generates. Having the limits for dirty rate and other IO separate means I > > have to be rather pesimistic in setting the bounds so that combination of > > dirty rate + other IO limit doesn't exceed the desired bound but this is > > usually unnecessarily harsh... > > Yes, seprating out the throttling limits for "reads + direct writes + > certain wriththrough writes" and "buffered writes" is not ideal. But > it might still have some value for specific use cases (writes over NFS, > backup application, throttling a specific disk hog workload etc). > > > > > We agree though (as we spoke together last year) that throttling at block > > layer isn't really an option at least for some filesystems such as ext3/4. > > Yes, because of jorunalling issues and ensuring serialization, > throttling/upper limit at block/device level becomes less attractive. > > > But what seemed like a plausible idea to me was that we'd account all IO > > including buffered writes at block layer (there we'd need at least > > approximate tracking of originator of the IO - tracking inodes as Greg did > > in his patch set seemed OK) but throttle only direct IO & reads. Limitting > > of buffered writes would then be achieved by > > a) having flusher thread choose inodes to write depending on how much > > available disk time cgroup has and > > b) throttling buffered writers when cgroup has too many dirty pages. > > I am trying to remember what we had discussed. There have been so many > ideas floated in this area, that now I get confused. > > So lets take throttling/upper limit out of the picture for a moment and just > focus on the use case of proportional IO (fare share of disk among cgroups). > > - In that case yes, we probably can come up with some IO tracking > mechanism so that IO can be accounted to right cgroup (IO originator's > cgroup) at block layer. We could either store some info in "struct > page" or do some approximation as you mentioned like inode owner. > > - With buffered IO accounted to right cgroup, CFQ should automatically > start providing cgroup its fair share (Well little changes will be > required). But there are still two more issues. > > - Issue of making writeback cgroup aware. I am assuming that > this work will be taken forward by Greg. > > - Breaking down request descriptors into some kind of per cgroup > notion so that one cgroup is not stuck behind other. (Or come > up with a different mechanism for per cgroup congestion). > > That way, if a cgroup is congested at CFQ, flusher should stop submitting > more IO for it, that will lead to increased dirty pages in memcg and that > should throttle the application. > > So all of the aove seems to be proportional IO (fair shrae of disk). This > should still be co-exist with "throttling/upper limit" implementation/knobs > and one is not necessarily replacement for other? Well, I don't see a strict reason why the above won't work for "upper limit" knobs. After all, these knobs just mean you don't want to submit more that X MB of IO in 1 second. So you just need flusher thread to check against such limit as well. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-07 17:24 ` Jan Kara @ 2012-03-07 21:29 ` Vivek Goyal 0 siblings, 0 replies; 21+ messages in thread From: Vivek Goyal @ 2012-03-07 21:29 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, linux-mm, lsf-pc, Andrea Righi, Suresh Jayaraman On Wed, Mar 07, 2012 at 06:24:53PM +0100, Jan Kara wrote: > On Mon 05-03-12 16:41:30, Vivek Goyal wrote: > > On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote: > > [..] > > > > Because filesystems are not cgroup aware, throtting IO below filesystem > > > > has dangers of IO of faster cgroups being throttled behind slower cgroup > > > > (journalling was one example and there could be others). Hence, I personally > > > > think that this problem should be solved at higher layer and that is when > > > > we are actually writting to the cache. That has the disadvantage of still > > > > seeing IO spikes at the device but I guess we live with that. Doing it > > > > at higher layer also allows to use the same logic for NFS too otherwise > > > > NFS buffered write will continue to be a problem. > > > Well, I agree limiting of memory dirty rate has a value but if I look at > > > a natural use case where I have several cgroups and I want to make sure > > > disk time is fairly divided among them, then limiting dirty rate doesn't > > > quite do what I need. > > > > Actually "proportional IO control" generally addresses the use case of > > disk time being fairly divided among cgroups. The "throttling/upper limit" > > I think is more targeted towards the cases where you have bandwidth but > > you don't want to give it to user as user has not paid for that kind > > of service. Though it could be used for other things like monitoring the > > system dynamically and throttling rates of a particular cgroup if admin > > thinks that particular cgroup is doing too much of IO. Or for things like, > > start a backup operation with an upper limit of say 50MB/s so that it > > does not affect other system activities too much. > Well, I was always slightly sceptical that these absolute bandwidth > limits are that great thing. If some cgroup beats your storage with 10 MB/s > of random tiny writes, then it uses more of your resources than an > streaming 50 MB/s write. So although admins might be tempted to use > throughput limits at the first moment because they are easier to > understand, they might later find it's not quite what they wanted. Well, you have iops limits too and one can specify bps as well as iops limits and blkcg will do iops_limt AND bps_limit. So for large sequential IO one can speicfy 50MB/s upper limit but at the same time might want to specify some iops limit to cover for the case of small random IO. But I agree that configuring these limits might not be easy. One need to know capacity of the system and provision things accoridingly. As capacity of system is more or less workload depednent, its hard to predict. I personally thought that some kind of dynamic monitoring application can help which can dynamically monitor which cgroup is imacting system badly and go and change its upper limits. > > Specifically for the imagined use case where a customer pays just for a > given bandwidth, you can achieve similar (and IMHO more reliable) results > using proportional control. Say you have available 100 MB/s sequential IO > bandwidth and you would like to limit cgroup to 10 MB/s. Then you just > give it weight 10. Another cgroup paying for 20 MB/s would get weight 20 > and so on. If you are a clever provider and pack your load so that machine > is well utilized, cgroups will get limited roughly at given bounds... Well, if multiple virtual machines are running, you just don't know who is doing how much of IO at a given point of time. So a virtual machine might experience a very different IO bandwidth based on how many other virtual machines are doing IO at that point of time. Once Chris Wright mentioned that upper limits might be useful in providing a more consistent IO bandwidth experience to virtual machines as they might be migrated from one host to other and these hosts might have different IO bandwidth altogether. > > > > Because I'm interested in time it takes disk to > > > process the combination of reads, direct IO, and buffered writes the cgroup > > > generates. Having the limits for dirty rate and other IO separate means I > > > have to be rather pesimistic in setting the bounds so that combination of > > > dirty rate + other IO limit doesn't exceed the desired bound but this is > > > usually unnecessarily harsh... > > > > Yes, seprating out the throttling limits for "reads + direct writes + > > certain wriththrough writes" and "buffered writes" is not ideal. But > > it might still have some value for specific use cases (writes over NFS, > > backup application, throttling a specific disk hog workload etc). > > > > > > > > We agree though (as we spoke together last year) that throttling at block > > > layer isn't really an option at least for some filesystems such as ext3/4. > > > > Yes, because of jorunalling issues and ensuring serialization, > > throttling/upper limit at block/device level becomes less attractive. > > > > > But what seemed like a plausible idea to me was that we'd account all IO > > > including buffered writes at block layer (there we'd need at least > > > approximate tracking of originator of the IO - tracking inodes as Greg did > > > in his patch set seemed OK) but throttle only direct IO & reads. Limitting > > > of buffered writes would then be achieved by > > > a) having flusher thread choose inodes to write depending on how much > > > available disk time cgroup has and > > > b) throttling buffered writers when cgroup has too many dirty pages. > > > > I am trying to remember what we had discussed. There have been so many > > ideas floated in this area, that now I get confused. > > > > So lets take throttling/upper limit out of the picture for a moment and just > > focus on the use case of proportional IO (fare share of disk among cgroups). > > > > - In that case yes, we probably can come up with some IO tracking > > mechanism so that IO can be accounted to right cgroup (IO originator's > > cgroup) at block layer. We could either store some info in "struct > > page" or do some approximation as you mentioned like inode owner. > > > > - With buffered IO accounted to right cgroup, CFQ should automatically > > start providing cgroup its fair share (Well little changes will be > > required). But there are still two more issues. > > > > - Issue of making writeback cgroup aware. I am assuming that > > this work will be taken forward by Greg. > > > > - Breaking down request descriptors into some kind of per cgroup > > notion so that one cgroup is not stuck behind other. (Or come > > up with a different mechanism for per cgroup congestion). > > > > That way, if a cgroup is congested at CFQ, flusher should stop submitting > > more IO for it, that will lead to increased dirty pages in memcg and that > > should throttle the application. > > > > So all of the aove seems to be proportional IO (fair shrae of disk). This > > should still be co-exist with "throttling/upper limit" implementation/knobs > > and one is not necessarily replacement for other? > Well, I don't see a strict reason why the above won't work for "upper > limit" knobs. After all, these knobs just mean you don't want to submit > more that X MB of IO in 1 second. So you just need flusher thread to check > against such limit as well. Well, yes flusher thread can check for that or block layer can implement the logic and flusher thread can just check if group is congested or not. I was just trying to differentiate between "Throttling/upper limit" which is non-work conserving and "proportion IO" which is work conserving. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 20:23 ` [Lsf-pc] " Jan Kara 2012-03-05 21:41 ` Vivek Goyal @ 2012-03-05 22:18 ` Vivek Goyal 2012-03-05 22:36 ` Jan Kara 2012-03-07 6:31 ` Fengguang Wu 2 siblings, 1 reply; 21+ messages in thread From: Vivek Goyal @ 2012-03-05 22:18 UTC (permalink / raw) To: Jan Kara; +Cc: Andrea Righi, Suresh Jayaraman, linux-mm, linux-fsdevel, lsf-pc On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote: [..] > Having the limits for dirty rate and other IO separate means I > have to be rather pesimistic in setting the bounds so that combination of > dirty rate + other IO limit doesn't exceed the desired bound but this is > usually unnecessarily harsh... We had solved this issue in my previous posting. https://lkml.org/lkml/2011/6/28/243 I was accounting the buffered writes to associated block group in balance dirty pages and throttling it if group was exceeding upper limit. This had common limit for all kind of writes (direct + buffered + sync etc). But it also had its share of issues. - Control was per device (not global) and was not applicable to NFS. - Will not prevent IO spikes at devices (caused by flusher threads). Dave Chinner preferred to throttle IO at devices much later. I personally think that "dirty rate limit" does not solve all problems but has some value and it will be interesting to merge any one implementation and see if it solves a real world problem. It does not block any other idea of buffered write proportional control or even implementing upper limit in blkcg. We could put "dirty rate limit" in memcg and develop rest of the ideas in blkcg, writeback etc. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 22:18 ` Vivek Goyal @ 2012-03-05 22:36 ` Jan Kara 2012-03-07 6:42 ` Fengguang Wu 0 siblings, 1 reply; 21+ messages in thread From: Jan Kara @ 2012-03-05 22:36 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc, Andrea Righi, Suresh Jayaraman On Mon 05-03-12 17:18:43, Vivek Goyal wrote: > On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote: > > [..] > > Having the limits for dirty rate and other IO separate means I > > have to be rather pesimistic in setting the bounds so that combination of > > dirty rate + other IO limit doesn't exceed the desired bound but this is > > usually unnecessarily harsh... > > We had solved this issue in my previous posting. > > https://lkml.org/lkml/2011/6/28/243 > > I was accounting the buffered writes to associated block group in > balance dirty pages and throttling it if group was exceeding upper > limit. This had common limit for all kind of writes (direct + buffered + > sync etc). Ah, I didn't know that. > But it also had its share of issues. > > - Control was per device (not global) and was not applicable to NFS. > - Will not prevent IO spikes at devices (caused by flusher threads). > > Dave Chinner preferred to throttle IO at devices much later. > > I personally think that "dirty rate limit" does not solve all problems > but has some value and it will be interesting to merge any one > implementation and see if it solves a real world problem. It rather works the other way around - you first have to show enough users are interested in the particular feature you want to merge and then the feature can get merged. Once the feature is merged we are stuck supporting it forever so we have to be very cautious in what we merge... > It does not block any other idea of buffered write proportional control > or even implementing upper limit in blkcg. We could put "dirty rate > limit" in memcg and develop rest of the ideas in blkcg, writeback etc. Yes, it doesn't block them but OTOH we should have as few features as possible because otherwise it's a configuration and maintenance nightmare (both from admin and kernel POV). So we should think twice what set of features we choose to satisfy user demand. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 22:36 ` Jan Kara @ 2012-03-07 6:42 ` Fengguang Wu 0 siblings, 0 replies; 21+ messages in thread From: Fengguang Wu @ 2012-03-07 6:42 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, linux-fsdevel, linux-mm, lsf-pc, Andrea Righi, Suresh Jayaraman On Mon, Mar 05, 2012 at 11:36:37PM +0100, Jan Kara wrote: > On Mon 05-03-12 17:18:43, Vivek Goyal wrote: > > On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote: > > > > [..] > > > Having the limits for dirty rate and other IO separate means I > > > have to be rather pesimistic in setting the bounds so that combination of > > > dirty rate + other IO limit doesn't exceed the desired bound but this is > > > usually unnecessarily harsh... > > > > We had solved this issue in my previous posting. > > > > https://lkml.org/lkml/2011/6/28/243 > > > > I was accounting the buffered writes to associated block group in > > balance dirty pages and throttling it if group was exceeding upper > > limit. This had common limit for all kind of writes (direct + buffered + > > sync etc). > Ah, I didn't know that. > > > But it also had its share of issues. > > > > - Control was per device (not global) and was not applicable to NFS. > > - Will not prevent IO spikes at devices (caused by flusher threads). > > > > Dave Chinner preferred to throttle IO at devices much later. > > > > I personally think that "dirty rate limit" does not solve all problems > > but has some value and it will be interesting to merge any one > > implementation and see if it solves a real world problem. > It rather works the other way around - you first have to show enough > users are interested in the particular feature you want to merge and then the > feature can get merged. Once the feature is merged we are stuck supporting > it forever so we have to be very cautious in what we merge... Agreed. > > It does not block any other idea of buffered write proportional control > > or even implementing upper limit in blkcg. We could put "dirty rate > > limit" in memcg and develop rest of the ideas in blkcg, writeback etc. > Yes, it doesn't block them but OTOH we should have as few features as > possible because otherwise it's a configuration and maintenance nightmare > (both from admin and kernel POV). So we should think twice what set of > features we choose to satisfy user demand. Yeah it's a good idea to first figure out the ideal set of user interfaces that are simple, natural, flexible and extensible. Then look into the implementations and see how can we provide interfaces closest to the ideal ones (if not 100% feasible). Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling 2012-03-05 20:23 ` [Lsf-pc] " Jan Kara 2012-03-05 21:41 ` Vivek Goyal 2012-03-05 22:18 ` Vivek Goyal @ 2012-03-07 6:31 ` Fengguang Wu 2 siblings, 0 replies; 21+ messages in thread From: Fengguang Wu @ 2012-03-07 6:31 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, Suresh Jayaraman, linux-fsdevel, linux-mm, lsf-pc, Andrea Righi On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote: > On Fri 02-03-12 10:33:23, Vivek Goyal wrote: > > On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote: > > > Committee members, > > > > > > Please consider inviting me to the Storage, Filesystem, & MM Summit. I > > > am working for one of the kernel teams in SUSE Labs focusing on Network > > > filesystems and block layer. > > > > > > Recently, I have been trying to solve the problem of "throttling > > > buffered writes" to make per-cgroup throttling of IO to the device > > > possible. Currently the block IO controller does not throttle buffered > > > writes. The writes would have lost the submitter's context (I/O comes in > > > flusher thread's context) when they are at the block IO layer. I looked > > > at the past work and many folks have attempted to solve this problem in > > > the past years but this problem remains unsolved so far. > > > > > > First, Andrea Righi tried to solve this by limiting the rate of async > > > writes at the time a task is generating dirty pages in the page cache. > > > > > > Next, Vivek Goyal tried to solve this by throttling writes at the time > > > they are entering the page cache. > > > > > > Both these approches have limitations and not considered for merging. > > > > > > I have looked at the possibility of solving this at the filesystem level > > > but the problem with ext* filesystems is that a commit will commit the > > > whole transaction at once (which may contain writes from > > > processes belonging to more than one cgroup). Making filesystems cgroup > > > aware would need redesign of journalling layer itself. > > > > > > Dave Chinner thinks this problem should be solved and being solved in a > > > different manner by making the bdi-flusher writeback cgroup aware. > > > > > > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM > > > summit this year) adds cgroup awareness to writeback. Some aspects of > > > this patchset could be borrowed for solving the problem of throttling > > > buffered writes. > > > > > > As I understand the topic was discussed during last Kernel Summit as > > > well and the idea is to get the IO-less throttling patchset into the > > > kernel, then do per-memcg dirty memory limiting and add some memcg > > > awareness to writeback Greg Thelen and then when these things settle > > > down, think how to solve this problem since noone really seem to have a > > > good answer to it. > > > > > > Having worked on linux filesystem/storage area for a few years now and > > > having spent time understanding the various approaches tried and looked > > > at other feasible way of solving this problem, I look forward to > > > participate in the summit and discussions. > > > > > > So, the topic I would like to discuss is solving the problem of > > > "throttling buffered writes". This could considered for discussion with > > > memcg writeback session if that topic has been allocated a slot. > > > > > > I'm aware that this is a late submission and my apologies for not making > > > it earlier. But, I want to take chances and see if it is possible still.. > > > > This is an interesting and complicated topic. As you mentioned we have had > > tried to solve it but nothing has been merged yet. Personally, I am still > > interested in having a discussion and see if we can come up with a way > > forward. > > > > Because filesystems are not cgroup aware, throtting IO below filesystem > > has dangers of IO of faster cgroups being throttled behind slower cgroup > > (journalling was one example and there could be others). Hence, I personally > > think that this problem should be solved at higher layer and that is when > > we are actually writting to the cache. That has the disadvantage of still > > seeing IO spikes at the device but I guess we live with that. Doing it > > at higher layer also allows to use the same logic for NFS too otherwise > > NFS buffered write will continue to be a problem. > Well, I agree limiting of memory dirty rate has a value but if I look at > a natural use case where I have several cgroups and I want to make sure > disk time is fairly divided among them, then limiting dirty rate doesn't > quite do what I need. Because I'm interested in time it takes disk to > process the combination of reads, direct IO, and buffered writes the cgroup > generates. Having the limits for dirty rate and other IO separate means I > have to be rather pesimistic in setting the bounds so that combination of > dirty rate + other IO limit doesn't exceed the desired bound but this is > usually unnecessarily harsh... Yeah it's quite possible some use cases may need to control read/write respectively and others may want to simply limit the overall r/w throughput or disk utilization. It seems more a matter of interface rather than implementation. If we have code to limit the buffered/direct write bandwidth respectively, it should also be able to limit the overall buffered+direct write bandwidth or even read+write bandwidth. However for the "overall" r+w limit interface to work, some implicit rule of precedences or weight will be necessary, eg. read > DIRECT write > buffered write, or read:DIRECT write:buffered write=10:10:1 or whatever. Which the users may not totally agree. In the end it looks there are always the distinguish of the main SYNC/ASYNC and read/write I/O types and no chance to hide them from the I/O controller interfaces. Then we might export interfaces to allow the users to specify the overall I/O rate limit, the weights for each type of I/O, the individual rate limits for each type of I/O, etc. to the users' heart content. > We agree though (as we spoke together last year) that throttling at block > layer isn't really an option at least for some filesystems such as ext3/4. > But what seemed like a plausible idea to me was that we'd account all IO > including buffered writes at block layer (there we'd need at least Account buffered write I/O when they reach the block layer? It sounds too late. > approximate tracking of originator of the IO - tracking inodes as Greg did > in his patch set seemed OK) but throttle only direct IO & reads. Limitting > of buffered writes would then be achieved by > a) having flusher thread choose inodes to write depending on how much > available disk time cgroup has and The flusher is fundamentally - coarsely controllable due to the large write chunk size - not controllable in the case of shared inodes so any dirty size/rate limiting scheme based on controlling the flusher behavior is not going to be an exact/reliable solution... > b) throttling buffered writers when cgroup has too many dirty pages. That looks still be throttling at the balance_dirty_pages() level? Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2012-03-08 8:08 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-02 7:18 [ATTEND] [LSF/MM TOPIC] Buffered writes throttling Suresh Jayaraman 2012-03-02 15:33 ` Vivek Goyal 2012-03-05 19:22 ` Fengguang Wu 2012-03-05 21:11 ` Vivek Goyal 2012-03-05 22:30 ` Fengguang Wu 2012-03-05 23:19 ` Andrea Righi 2012-03-05 23:51 ` Fengguang Wu 2012-03-06 0:46 ` Andrea Righi 2012-03-07 20:26 ` Vivek Goyal 2012-03-05 22:58 ` Andrea Righi 2012-03-07 20:52 ` Vivek Goyal 2012-03-07 22:04 ` Jeff Moyer 2012-03-08 8:08 ` Greg Thelen 2012-03-05 20:23 ` [Lsf-pc] " Jan Kara 2012-03-05 21:41 ` Vivek Goyal 2012-03-07 17:24 ` Jan Kara 2012-03-07 21:29 ` Vivek Goyal 2012-03-05 22:18 ` Vivek Goyal 2012-03-05 22:36 ` Jan Kara 2012-03-07 6:42 ` Fengguang Wu 2012-03-07 6:31 ` Fengguang Wu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).