[ATTEND] [LSF/MM TOPIC] Buffered writes throttling

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
@ 2012-03-02  7:18 Suresh Jayaraman
  2012-03-02 15:33 ` Vivek Goyal
  0 siblings, 1 reply; 21+ messages in thread
From: Suresh Jayaraman @ 2012-03-02  7:18 UTC (permalink / raw)
  To: lsf-pc; +Cc: Vivek Goyal, Andrea Righi, linux-mm, linux-fsdevel, Jan Kara

Committee members,

Please consider inviting me to the Storage, Filesystem, & MM Summit. I
am working for one of the kernel teams in SUSE Labs focusing on Network
filesystems and block layer.

Recently, I have been trying to solve the problem of "throttling
buffered writes" to make per-cgroup throttling of IO to the device
possible. Currently the block IO controller does not throttle buffered
writes. The writes would have lost the submitter's context (I/O comes in
flusher thread's context) when they are at the block IO layer. I looked
at the past work and many folks have attempted to solve this problem in
the past years but this problem remains unsolved so far.

First, Andrea Righi tried to solve this by limiting the rate of async
writes at the time a task is generating dirty pages in the page cache.

Next, Vivek Goyal tried to solve this by throttling writes at the time
they are entering the page cache.

Both these approches have limitations and not considered for merging.

I have looked at the possibility of solving this at the filesystem level
but the problem with ext* filesystems is that a commit will commit the
whole transaction at once (which may contain writes from
processes belonging to more than one cgroup). Making filesystems cgroup
aware would need redesign of journalling layer itself.

Dave Chinner thinks this problem should be solved and being solved in a
different manner by making the bdi-flusher writeback cgroup aware.

Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM
summit this year) adds cgroup awareness to writeback. Some aspects of
this patchset could be borrowed for solving the problem of throttling
buffered writes.

As I understand the topic was discussed during last Kernel Summit as
well and the idea is to get the IO-less throttling patchset into the
kernel, then do per-memcg dirty memory limiting and add some memcg
awareness to writeback Greg Thelen and then when these things settle
down, think how to solve this problem since noone really seem to have a
good answer to it.

Having worked on linux filesystem/storage area for a few years now and
having spent time understanding the various approaches tried and looked
at other feasible way of solving this problem, I look forward to
participate in the summit and discussions.

So, the topic I would like to discuss is solving the problem of
"throttling buffered writes". This could considered for discussion with
memcg writeback session if that topic has been allocated a slot.

I'm aware that this is a late submission and my apologies for not making
it earlier. But, I want to take chances and see if it is possible still..

Thanks
Suresh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-02  7:18 [ATTEND] [LSF/MM TOPIC] Buffered writes throttling Suresh Jayaraman
@ 2012-03-02 15:33 ` Vivek Goyal
  2012-03-05 19:22   ` Fengguang Wu
  2012-03-05 20:23   ` [Lsf-pc] " Jan Kara
  0 siblings, 2 replies; 21+ messages in thread
From: Vivek Goyal @ 2012-03-02 15:33 UTC (permalink / raw)
  To: Suresh Jayaraman; +Cc: lsf-pc, Andrea Righi, linux-mm, linux-fsdevel, Jan Kara

On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote:
> Committee members,
> 
> Please consider inviting me to the Storage, Filesystem, & MM Summit. I
> am working for one of the kernel teams in SUSE Labs focusing on Network
> filesystems and block layer.
> 
> Recently, I have been trying to solve the problem of "throttling
> buffered writes" to make per-cgroup throttling of IO to the device
> possible. Currently the block IO controller does not throttle buffered
> writes. The writes would have lost the submitter's context (I/O comes in
> flusher thread's context) when they are at the block IO layer. I looked
> at the past work and many folks have attempted to solve this problem in
> the past years but this problem remains unsolved so far.
> 
> First, Andrea Righi tried to solve this by limiting the rate of async
> writes at the time a task is generating dirty pages in the page cache.
> 
> Next, Vivek Goyal tried to solve this by throttling writes at the time
> they are entering the page cache.
> 
> Both these approches have limitations and not considered for merging.
> 
> I have looked at the possibility of solving this at the filesystem level
> but the problem with ext* filesystems is that a commit will commit the
> whole transaction at once (which may contain writes from
> processes belonging to more than one cgroup). Making filesystems cgroup
> aware would need redesign of journalling layer itself.
> 
> Dave Chinner thinks this problem should be solved and being solved in a
> different manner by making the bdi-flusher writeback cgroup aware.
> 
> Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM
> summit this year) adds cgroup awareness to writeback. Some aspects of
> this patchset could be borrowed for solving the problem of throttling
> buffered writes.
> 
> As I understand the topic was discussed during last Kernel Summit as
> well and the idea is to get the IO-less throttling patchset into the
> kernel, then do per-memcg dirty memory limiting and add some memcg
> awareness to writeback Greg Thelen and then when these things settle
> down, think how to solve this problem since noone really seem to have a
> good answer to it.
> 
> Having worked on linux filesystem/storage area for a few years now and
> having spent time understanding the various approaches tried and looked
> at other feasible way of solving this problem, I look forward to
> participate in the summit and discussions.
> 
> So, the topic I would like to discuss is solving the problem of
> "throttling buffered writes". This could considered for discussion with
> memcg writeback session if that topic has been allocated a slot.
> 
> I'm aware that this is a late submission and my apologies for not making
> it earlier. But, I want to take chances and see if it is possible still..

This is an interesting and complicated topic. As you mentioned we have had
tried to solve it but nothing has been merged yet. Personally, I am still
interested in having a discussion and see if we can come up with a way
forward.

Because filesystems are not cgroup aware, throtting IO below filesystem
has dangers of IO of faster cgroups being throttled behind slower cgroup
(journalling was one example and there could be others). Hence, I personally
think that this problem should be solved at higher layer and that is when
we are actually writting to the cache. That has the disadvantage of still
seeing IO spikes at the device but I guess we live with that. Doing it
at higher layer also allows to use the same logic for NFS too otherwise
NFS buffered write will continue to be a problem.

In case of memory controller it jsut becomes a write to memory issue,
and not sure if notion of dirty_ratio and dirty_bytes is enough or we 
need to rate limit the write to memory. 

Anyway, ideas to have better control of write rates are welcome. We have
seen issues wheren a virtual machine cloning operation is going on and
we also want a small direct write to be on disk and it can take a long
time with deadline. CFQ should still be fine as direct IO is synchronous
but deadline treats all WRITEs the same way.

May be deadline should be modified to differentiate between SYNC and ASYNC
IO instead of READ/WRITE. Jens?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-02 15:33 ` Vivek Goyal
@ 2012-03-05 19:22   ` Fengguang Wu
  2012-03-05 21:11     ` Vivek Goyal
  2012-03-05 20:23   ` [Lsf-pc] " Jan Kara
  1 sibling, 1 reply; 21+ messages in thread
From: Fengguang Wu @ 2012-03-05 19:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Suresh Jayaraman, lsf-pc, Andrea Righi, linux-mm, linux-fsdevel,
	Jan Kara, Greg Thelen

On Fri, Mar 02, 2012 at 10:33:23AM -0500, Vivek Goyal wrote:
> On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote:
> > Committee members,
> > 
> > Please consider inviting me to the Storage, Filesystem, & MM Summit. I
> > am working for one of the kernel teams in SUSE Labs focusing on Network
> > filesystems and block layer.
> > 
> > Recently, I have been trying to solve the problem of "throttling
> > buffered writes" to make per-cgroup throttling of IO to the device
> > possible. Currently the block IO controller does not throttle buffered
> > writes. The writes would have lost the submitter's context (I/O comes in
> > flusher thread's context) when they are at the block IO layer. I looked
> > at the past work and many folks have attempted to solve this problem in
> > the past years but this problem remains unsolved so far.
> > 
> > First, Andrea Righi tried to solve this by limiting the rate of async
> > writes at the time a task is generating dirty pages in the page cache.
> > 
> > Next, Vivek Goyal tried to solve this by throttling writes at the time
> > they are entering the page cache.
> > 
> > Both these approches have limitations and not considered for merging.
> > 
> > I have looked at the possibility of solving this at the filesystem level
> > but the problem with ext* filesystems is that a commit will commit the
> > whole transaction at once (which may contain writes from
> > processes belonging to more than one cgroup). Making filesystems cgroup
> > aware would need redesign of journalling layer itself.
> > 
> > Dave Chinner thinks this problem should be solved and being solved in a
> > different manner by making the bdi-flusher writeback cgroup aware.
> > 
> > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM
> > summit this year) adds cgroup awareness to writeback. Some aspects of
> > this patchset could be borrowed for solving the problem of throttling
> > buffered writes.
> > 
> > As I understand the topic was discussed during last Kernel Summit as
> > well and the idea is to get the IO-less throttling patchset into the
> > kernel, then do per-memcg dirty memory limiting and add some memcg
> > awareness to writeback Greg Thelen and then when these things settle
> > down, think how to solve this problem since noone really seem to have a
> > good answer to it.
> > 
> > Having worked on linux filesystem/storage area for a few years now and
> > having spent time understanding the various approaches tried and looked
> > at other feasible way of solving this problem, I look forward to
> > participate in the summit and discussions.
> > 
> > So, the topic I would like to discuss is solving the problem of
> > "throttling buffered writes". This could considered for discussion with
> > memcg writeback session if that topic has been allocated a slot.
> > 
> > I'm aware that this is a late submission and my apologies for not making
> > it earlier. But, I want to take chances and see if it is possible still..
> 
> This is an interesting and complicated topic. As you mentioned we have had
> tried to solve it but nothing has been merged yet. Personally, I am still
> interested in having a discussion and see if we can come up with a way
> forward.

I'm interested, too. Here is my attempt on the problem a year ago:

blk-cgroup: async write IO controller ("buffered write" would be more precise)
https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d
https://lkml.org/lkml/2011/4/4/205

> Because filesystems are not cgroup aware, throtting IO below filesystem
> has dangers of IO of faster cgroups being throttled behind slower cgroup
> (journalling was one example and there could be others). Hence, I personally
> think that this problem should be solved at higher layer and that is when
> we are actually writting to the cache. That has the disadvantage of still
> seeing IO spikes at the device but I guess we live with that. Doing it
> at higher layer also allows to use the same logic for NFS too otherwise
> NFS buffered write will continue to be a problem.

Totally agreed.

> In case of memory controller it jsut becomes a write to memory issue,
> and not sure if notion of dirty_ratio and dirty_bytes is enough or we 
> need to rate limit the write to memory. 

In a perfect world, the dirty size and rate may be each balanced
around their targets. Ideally we could independently limit dirty size
in memcg context and limit dirty rate in blkcg. If the user want to
control both size/rate, he may put tasks into memcg as well as blkcg.

In reality the dirty size limit will impact the dirty rate, because
memcg needs to adjust its tasks' balanced dirty rate to drive the memcg
dirty size to the target, so does the global dirty target. Comparing to
the global dirty size balancing, memcg suffers from a unique problem: 
given N memcg each running a dd task, each memcg's dirty size will be
dropping suddenly on every (N/2) seconds. Because the flusher writeout
the inodes in coarse time-split round-robin fashion, with up to
(bdi->write_bandwidth/2) chunk size. That sudden drop of memcg dirty
pages may drive the dirty size far from the target, as a result it
will need to adjust the dirty rate heavily in order to drive the dirty
size back to the target. So the memcg dirty size balance may create
large fluctuations in the dirty rates, and even long stall time of the
memcg tasks. What's more, due to the uncontrollable way the flusher
walks through the dirty pages and how the dirty pages distribute among
the dirty inodes and memcgs, the dirty rate will be impacted heavily
by the workload and behavior of the flusher when enforcing the dirty
size target.  There are no satisfactory solution to this till now.
Currently I'm trying to shun away from this and look into improving
the page reclaim so that it can work well with LRU lists with half
pages being dirty/writeback. Then the 20% global dirty limit should be
enough to serve most memcg tasks well taking into account the unevenly
distributed dirty pages among different memcg and NUMA zones/nodes.
There may still be few memcgs that need further dirty throttling, but
they are likely mainly consist of heavy dirtiers and can afford less
smoothness and longer delays.

In comparison, the dirty rate limit for buffered writes seems less
convolved to me. It sure has its own problems, so we see several
solutions in circular, each with its unique trade offs. But at least
we have relative simple solutions that work to their design goals.

> Anyway, ideas to have better control of write rates are welcome. We have
> seen issues wheren a virtual machine cloning operation is going on and
> we also want a small direct write to be on disk and it can take a long
> time with deadline. CFQ should still be fine as direct IO is synchronous
> but deadline treats all WRITEs the same way.
> 
> May be deadline should be modified to differentiate between SYNC and ASYNC
> IO instead of READ/WRITE. Jens?

In general users definitely need higher priorities for SYNC writes. It
will also enable the "buffered write I/O controller" and "direct write
I/O controller" to co-exist well and operate independently this way:
the direct writes always enjoy higher priority than the flusher, but
will be rate limited by the already upstreamed blk-cgroup I/O
controller. The remaining disk bandwidth will be split among the
buffered write tasks by another I/O controller operating at the
balance_dirty_pages() level.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 19:22   ` Fengguang Wu
@ 2012-03-05 21:11     ` Vivek Goyal
  2012-03-05 22:30       ` Fengguang Wu
  2012-03-05 22:58       ` Andrea Righi
  0 siblings, 2 replies; 21+ messages in thread
From: Vivek Goyal @ 2012-03-05 21:11 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Suresh Jayaraman, lsf-pc, Andrea Righi, linux-mm, linux-fsdevel,
	Jan Kara, Greg Thelen

On Mon, Mar 05, 2012 at 11:22:26AM -0800, Fengguang Wu wrote:

[..]
> > This is an interesting and complicated topic. As you mentioned we have had
> > tried to solve it but nothing has been merged yet. Personally, I am still
> > interested in having a discussion and see if we can come up with a way
> > forward.
> 
> I'm interested, too. Here is my attempt on the problem a year ago:
> 
> blk-cgroup: async write IO controller ("buffered write" would be more precise)
> https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d
> https://lkml.org/lkml/2011/4/4/205

That was a proof of concept. Now we will need to provide actual user 
visibale knobs and integrate with one of the existing controller (memcg
or blkcg).

[..]
> > Anyway, ideas to have better control of write rates are welcome. We have
> > seen issues wheren a virtual machine cloning operation is going on and
> > we also want a small direct write to be on disk and it can take a long
> > time with deadline. CFQ should still be fine as direct IO is synchronous
> > but deadline treats all WRITEs the same way.
> > 
> > May be deadline should be modified to differentiate between SYNC and ASYNC
> > IO instead of READ/WRITE. Jens?
> 
> In general users definitely need higher priorities for SYNC writes. It
> will also enable the "buffered write I/O controller" and "direct write
> I/O controller" to co-exist well and operate independently this way:
> the direct writes always enjoy higher priority than the flusher, but
> will be rate limited by the already upstreamed blk-cgroup I/O
> controller. The remaining disk bandwidth will be split among the
> buffered write tasks by another I/O controller operating at the
> balance_dirty_pages() level.

Ok, so differentiating IO among SYNC/ASYNC makes sense and it probably
will make sense in case of deadline too. (Until and unless there is a
reason to keep it existing way).

I am little vary of keeping "dirty rate limit" separate from rest of the
limits as configuration of groups becomes even harder. Once you put a
workload in a cgroup, now you need to configure multiple rate limits.
"reads and direct writes" limit + "buffered write rate limit". To add
to the confusion, it is not just direct write limit, it also is a limit
on writethrough writes where fsync writes will show up in the context
of writing thread.

But looks like we don't much choice. As buffered writes can be controlled
at two levels, we probably need two knobs. Also controlling writes while
entring cache limits will be global and not per device (unlinke currnet
per device limit in blkio controller). Having separate control for "dirty
rate limit" leaves the scope for implementing write control at device
level in the future (As some people prefer that). In possibly two 
solutions can co-exist in future.

Assuming this means that we both agree that three should be some sort of
knob to control "dirty rate", question is where should it be. In memcg
or blkcg. Given the fact we are controlling the write to memory and
we are already planning to have per memcg dirty ratio and dirty bytes,
to me it will make more sense to integrate this new limit with memcg
instead of blkcg. Block layer does not even come into the picture at
that level hence implementing something in blkcg will be little out of
place?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 21:11     ` Vivek Goyal
@ 2012-03-05 22:30       ` Fengguang Wu
  2012-03-05 23:19         ` Andrea Righi
  2012-03-05 22:58       ` Andrea Righi
  1 sibling, 1 reply; 21+ messages in thread
From: Fengguang Wu @ 2012-03-05 22:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Suresh Jayaraman, lsf-pc, Andrea Righi, linux-mm, linux-fsdevel,
	Jan Kara, Greg Thelen

On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote:
> On Mon, Mar 05, 2012 at 11:22:26AM -0800, Fengguang Wu wrote:
> 
> [..]
> > > This is an interesting and complicated topic. As you mentioned we have had
> > > tried to solve it but nothing has been merged yet. Personally, I am still
> > > interested in having a discussion and see if we can come up with a way
> > > forward.
> > 
> > I'm interested, too. Here is my attempt on the problem a year ago:
> > 
> > blk-cgroup: async write IO controller ("buffered write" would be more precise)
> > https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d
> > https://lkml.org/lkml/2011/4/4/205
> 
> That was a proof of concept. Now we will need to provide actual user 
> visibale knobs and integrate with one of the existing controller (memcg
> or blkcg).

The next commit adds the interface to blkcg:

throttle.async_write_bps

Note that it's simply exporting the knobs via blkcg and does not
really depend on the blkcg functionalities.

> [..]
> > > Anyway, ideas to have better control of write rates are welcome. We have
> > > seen issues wheren a virtual machine cloning operation is going on and
> > > we also want a small direct write to be on disk and it can take a long
> > > time with deadline. CFQ should still be fine as direct IO is synchronous
> > > but deadline treats all WRITEs the same way.
> > > 
> > > May be deadline should be modified to differentiate between SYNC and ASYNC
> > > IO instead of READ/WRITE. Jens?
> > 
> > In general users definitely need higher priorities for SYNC writes. It
> > will also enable the "buffered write I/O controller" and "direct write
> > I/O controller" to co-exist well and operate independently this way:
> > the direct writes always enjoy higher priority than the flusher, but
> > will be rate limited by the already upstreamed blk-cgroup I/O
> > controller. The remaining disk bandwidth will be split among the
> > buffered write tasks by another I/O controller operating at the
> > balance_dirty_pages() level.
> 
> Ok, so differentiating IO among SYNC/ASYNC makes sense and it probably
> will make sense in case of deadline too. (Until and unless there is a
> reason to keep it existing way).

Agreed. But note that the deadline I/O scheduler has nothing to do
with the I/O controllers.

> I am little vary of keeping "dirty rate limit" separate from rest of the
> limits as configuration of groups becomes even harder. Once you put a
> workload in a cgroup, now you need to configure multiple rate limits.
> "reads and direct writes" limit + "buffered write rate limit".

Good point. If we really want it, it's technically possible to provide
one single write rate limit to the user. The way is to account the
current DIRECT write bandwidth. Subtract it from the general write
rate limit, we get the limit available for buffered writes.

Thus we'll be providing some "throttle.total_write_bps" rather than
"throttle.async_write_bps". Oh it may be difficult to implement
total_write_bps for direct writes, which is implemented at the device
level. But still if it's the right interface to have, we can make it
happen by calling into balance_dirty_pages() (or some algorithm
abstracted from it) at the end of each direct write and let it handle
the global wise throttling.

> To add
> to the confusion, it is not just direct write limit, it also is a limit
> on writethrough writes where fsync writes will show up in the context
> of writing thread.

Sorry I'm not sure I caught the words. Is it that O_SYNC writes can
and would be (confusingly) rate limited at both levels?

> But looks like we don't much choice. As buffered writes can be controlled
> at two levels, we probably need two knobs. Also controlling writes while
> entring cache limits will be global and not per device (unlinke currnet
> per device limit in blkio controller). Having separate control for "dirty
> rate limit" leaves the scope for implementing write control at device
> level in the future (As some people prefer that). In possibly two 
> solutions can co-exist in future.

Good point. balance_dirty_pages() has no idea about the devices at
all. So the rate limit for buffered writes can hardly be unified with
the per-device rate limit for direct writes.

BTW it may have technically merits to enforce per-bdi buffered write
rate limits for each cgroup: imagine it's writing concurrently to a
10MB/s USB key and a 100MB/s disk. But perhaps also de-merits when all
the user want is the gross write rate, rather than to care about the
unnecessarily partition between sda1 and sda2.

> Assuming this means that we both agree that three should be some sort of
> knob to control "dirty rate", question is where should it be. In memcg
> or blkcg. Given the fact we are controlling the write to memory and
> we are already planning to have per memcg dirty ratio and dirty bytes,
> to me it will make more sense to integrate this new limit with memcg
> instead of blkcg. Block layer does not even come into the picture at
> that level hence implementing something in blkcg will be little out of
> place?

I personally prefer memcg for dirty sizes and blkcg for dirty rates.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 22:30       ` Fengguang Wu
@ 2012-03-05 23:19         ` Andrea Righi
  2012-03-05 23:51           ` Fengguang Wu
  0 siblings, 1 reply; 21+ messages in thread
From: Andrea Righi @ 2012-03-05 23:19 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Vivek Goyal, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara, Greg Thelen

On Mon, Mar 05, 2012 at 02:30:29PM -0800, Fengguang Wu wrote:
> On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote:
...
> > But looks like we don't much choice. As buffered writes can be controlled
> > at two levels, we probably need two knobs. Also controlling writes while
> > entring cache limits will be global and not per device (unlinke currnet
> > per device limit in blkio controller). Having separate control for "dirty
> > rate limit" leaves the scope for implementing write control at device
> > level in the future (As some people prefer that). In possibly two 
> > solutions can co-exist in future.
> 
> Good point. balance_dirty_pages() has no idea about the devices at
> all. So the rate limit for buffered writes can hardly be unified with
> the per-device rate limit for direct writes.
> 

I think balance_dirty_pages() can have an idea about devices. We can get
a reference to the right block device / request queue from the
address_space:

  bdev = mapping->host->i_sb->s_bdev;
  q = bdev_get_queue(bdev);

(NULL pointer dereferences apart).

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 23:19         ` Andrea Righi
@ 2012-03-05 23:51           ` Fengguang Wu
  2012-03-06  0:46             ` Andrea Righi
  0 siblings, 1 reply; 21+ messages in thread
From: Fengguang Wu @ 2012-03-05 23:51 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Vivek Goyal, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara, Greg Thelen

On Tue, Mar 06, 2012 at 12:19:30AM +0100, Andrea Righi wrote:
> On Mon, Mar 05, 2012 at 02:30:29PM -0800, Fengguang Wu wrote:
> > On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote:
> ...
> > > But looks like we don't much choice. As buffered writes can be controlled
> > > at two levels, we probably need two knobs. Also controlling writes while
> > > entring cache limits will be global and not per device (unlinke currnet
> > > per device limit in blkio controller). Having separate control for "dirty
> > > rate limit" leaves the scope for implementing write control at device
> > > level in the future (As some people prefer that). In possibly two 
> > > solutions can co-exist in future.
> > 
> > Good point. balance_dirty_pages() has no idea about the devices at
> > all. So the rate limit for buffered writes can hardly be unified with
> > the per-device rate limit for direct writes.
> > 
> 
> I think balance_dirty_pages() can have an idea about devices. We can get
> a reference to the right block device / request queue from the
> address_space:
> 
>   bdev = mapping->host->i_sb->s_bdev;
>   q = bdev_get_queue(bdev);
> 
> (NULL pointer dereferences apart).

Problem is, there is no general 1:1 mapping between bdev and disks.
For the single disk multpile partitions (sda1, sda2...) case, the
above scheme is fine and makes the throttle happen at sda granularity.

However for md/dm etc. there is no way (or need?) to reach the exact
disk that current blkcg is operating on.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 23:51           ` Fengguang Wu
@ 2012-03-06  0:46             ` Andrea Righi
  2012-03-07 20:26               ` Vivek Goyal
  0 siblings, 1 reply; 21+ messages in thread
From: Andrea Righi @ 2012-03-06  0:46 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Vivek Goyal, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara, Greg Thelen

On Mon, Mar 05, 2012 at 03:51:32PM -0800, Fengguang Wu wrote:
> On Tue, Mar 06, 2012 at 12:19:30AM +0100, Andrea Righi wrote:
> > On Mon, Mar 05, 2012 at 02:30:29PM -0800, Fengguang Wu wrote:
> > > On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote:
> > ...
> > > > But looks like we don't much choice. As buffered writes can be controlled
> > > > at two levels, we probably need two knobs. Also controlling writes while
> > > > entring cache limits will be global and not per device (unlinke currnet
> > > > per device limit in blkio controller). Having separate control for "dirty
> > > > rate limit" leaves the scope for implementing write control at device
> > > > level in the future (As some people prefer that). In possibly two 
> > > > solutions can co-exist in future.
> > > 
> > > Good point. balance_dirty_pages() has no idea about the devices at
> > > all. So the rate limit for buffered writes can hardly be unified with
> > > the per-device rate limit for direct writes.
> > > 
> > 
> > I think balance_dirty_pages() can have an idea about devices. We can get
> > a reference to the right block device / request queue from the
> > address_space:
> > 
> >   bdev = mapping->host->i_sb->s_bdev;
> >   q = bdev_get_queue(bdev);
> > 
> > (NULL pointer dereferences apart).
> 
> Problem is, there is no general 1:1 mapping between bdev and disks.
> For the single disk multpile partitions (sda1, sda2...) case, the
> above scheme is fine and makes the throttle happen at sda granularity.
> 
> However for md/dm etc. there is no way (or need?) to reach the exact
> disk that current blkcg is operating on.
> 
> Thanks,
> Fengguang

Oh I see, the problem is with stacked block devices. Right, if we set a
limit for sda and a stacked block device is defined over sda, we'd get
only the bdev at the top of the stack at balance_dirty_pages() and the
limits configured for the underlying block devices will be ignored.

However, maybe for the 90% of the cases this is fine, I can't see a real
world scenario where we may want to limit only part or indirectly a
stacked block device...

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-06  0:46             ` Andrea Righi
@ 2012-03-07 20:26               ` Vivek Goyal
  0 siblings, 0 replies; 21+ messages in thread
From: Vivek Goyal @ 2012-03-07 20:26 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara, Greg Thelen

On Tue, Mar 06, 2012 at 01:46:02AM +0100, Andrea Righi wrote:

[..]
> > > > Good point. balance_dirty_pages() has no idea about the devices at
> > > > all. So the rate limit for buffered writes can hardly be unified with
> > > > the per-device rate limit for direct writes.
> > > > 
> > > 
> > > I think balance_dirty_pages() can have an idea about devices. We can get
> > > a reference to the right block device / request queue from the
> > > address_space:
> > > 
> > >   bdev = mapping->host->i_sb->s_bdev;
> > >   q = bdev_get_queue(bdev);
> > > 
> > > (NULL pointer dereferences apart).
> > 
> > Problem is, there is no general 1:1 mapping between bdev and disks.
> > For the single disk multpile partitions (sda1, sda2...) case, the
> > above scheme is fine and makes the throttle happen at sda granularity.
> > 
> > However for md/dm etc. there is no way (or need?) to reach the exact
> > disk that current blkcg is operating on.
> > 
> > Thanks,
> > Fengguang
> 
> Oh I see, the problem is with stacked block devices. Right, if we set a
> limit for sda and a stacked block device is defined over sda, we'd get
> only the bdev at the top of the stack at balance_dirty_pages() and the
> limits configured for the underlying block devices will be ignored.
> 
> However, maybe for the 90% of the cases this is fine, I can't see a real
> world scenario where we may want to limit only part or indirectly a
> stacked block device...

I agree that throttling will make most sense on the top most device in the 
stack. If we try to do anything on the intermediate device, it might not
make much sense and we will most likely lose context also.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 21:11     ` Vivek Goyal
  2012-03-05 22:30       ` Fengguang Wu
@ 2012-03-05 22:58       ` Andrea Righi
  2012-03-07 20:52         ` Vivek Goyal
  1 sibling, 1 reply; 21+ messages in thread
From: Andrea Righi @ 2012-03-05 22:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara, Greg Thelen

On Mon, Mar 05, 2012 at 04:11:15PM -0500, Vivek Goyal wrote:
> On Mon, Mar 05, 2012 at 11:22:26AM -0800, Fengguang Wu wrote:
> 
> [..]
> > > This is an interesting and complicated topic. As you mentioned we have had
> > > tried to solve it but nothing has been merged yet. Personally, I am still
> > > interested in having a discussion and see if we can come up with a way
> > > forward.
> > 
> > I'm interested, too. Here is my attempt on the problem a year ago:
> > 
> > blk-cgroup: async write IO controller ("buffered write" would be more precise)
> > https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d
> > https://lkml.org/lkml/2011/4/4/205
> 
> That was a proof of concept. Now we will need to provide actual user 
> visibale knobs and integrate with one of the existing controller (memcg
> or blkcg).
> 
> [..]
> > > Anyway, ideas to have better control of write rates are welcome. We have
> > > seen issues wheren a virtual machine cloning operation is going on and
> > > we also want a small direct write to be on disk and it can take a long
> > > time with deadline. CFQ should still be fine as direct IO is synchronous
> > > but deadline treats all WRITEs the same way.
> > > 
> > > May be deadline should be modified to differentiate between SYNC and ASYNC
> > > IO instead of READ/WRITE. Jens?
> > 
> > In general users definitely need higher priorities for SYNC writes. It
> > will also enable the "buffered write I/O controller" and "direct write
> > I/O controller" to co-exist well and operate independently this way:
> > the direct writes always enjoy higher priority than the flusher, but
> > will be rate limited by the already upstreamed blk-cgroup I/O
> > controller. The remaining disk bandwidth will be split among the
> > buffered write tasks by another I/O controller operating at the
> > balance_dirty_pages() level.
> 
> Ok, so differentiating IO among SYNC/ASYNC makes sense and it probably
> will make sense in case of deadline too. (Until and unless there is a
> reason to keep it existing way).
> 
> I am little vary of keeping "dirty rate limit" separate from rest of the
> limits as configuration of groups becomes even harder. Once you put a
> workload in a cgroup, now you need to configure multiple rate limits.
> "reads and direct writes" limit + "buffered write rate limit". To add
> to the confusion, it is not just direct write limit, it also is a limit
> on writethrough writes where fsync writes will show up in the context
> of writing thread.
> 
> But looks like we don't much choice. As buffered writes can be controlled
> at two levels, we probably need two knobs. Also controlling writes while
> entring cache limits will be global and not per device (unlinke currnet
> per device limit in blkio controller). Having separate control for "dirty
> rate limit" leaves the scope for implementing write control at device
> level in the future (As some people prefer that). In possibly two 
> solutions can co-exist in future.
> 
> Assuming this means that we both agree that three should be some sort of
> knob to control "dirty rate", question is where should it be. In memcg
> or blkcg. Given the fact we are controlling the write to memory and
> we are already planning to have per memcg dirty ratio and dirty bytes,
> to me it will make more sense to integrate this new limit with memcg
> instead of blkcg. Block layer does not even come into the picture at
> that level hence implementing something in blkcg will be little out of
> place?
> 
> Thanks
> Vivek

What about this scenario? (Sorry, I've not followed some of the recent
discussions on this topic, so I'm sure I'm oversimplifying a bit or
ignoring some details):

 - track inodes per-memcg for writeback IO (provided Greg's patch)
 - provide per-memcg dirty limit (global, not per-device); when this
   limit is exceeded flusher threads are awekened and all tasks that
   continue to generate new dirty pages inside the memcg are put to
   sleep
 - flusher threads start to write some dirty inodes of this memcg (using
   the inode tracking feature), let say they start with a chunk of N
   pages of the first dirty inode
 - flusher threads can't flush in this way more than N pages / sec
   (where N * PAGE_SIZE / sec is the blkcg "buffered write rate limit"
   on the inode's block device); if a flusher thread exceeds this limit
   it won't be blocked directly, it just stops flushing pages for this
   memcg after the first chunk and it can continue to flush dirty pages
   of a different memcg.

In this way tasks are actively limited at the memcg layer and the
writeback rate is limited by the blkcg layer. The missing piece (that
has not been proposed yet) is to plug into the flusher threads the logic
"I can flush your memcg dirty pages only if your blkcg rate is ok,
otherwise let's see if someone else needs to flush some dirty pages".

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 22:58       ` Andrea Righi
@ 2012-03-07 20:52         ` Vivek Goyal
  2012-03-07 22:04           ` Jeff Moyer
  2012-03-08  8:08           ` Greg Thelen
  0 siblings, 2 replies; 21+ messages in thread
From: Vivek Goyal @ 2012-03-07 20:52 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm, linux-fsdevel,
	Jan Kara, Greg Thelen

On Mon, Mar 05, 2012 at 11:58:01PM +0100, Andrea Righi wrote:

[..]
> What about this scenario? (Sorry, I've not followed some of the recent
> discussions on this topic, so I'm sure I'm oversimplifying a bit or
> ignoring some details):
> 
>  - track inodes per-memcg for writeback IO (provided Greg's patch)
>  - provide per-memcg dirty limit (global, not per-device); when this
>    limit is exceeded flusher threads are awekened and all tasks that
>    continue to generate new dirty pages inside the memcg are put to
>    sleep
>  - flusher threads start to write some dirty inodes of this memcg (using
>    the inode tracking feature), let say they start with a chunk of N
>    pages of the first dirty inode
>  - flusher threads can't flush in this way more than N pages / sec
>    (where N * PAGE_SIZE / sec is the blkcg "buffered write rate limit"
>    on the inode's block device); if a flusher thread exceeds this limit
>    it won't be blocked directly, it just stops flushing pages for this
>    memcg after the first chunk and it can continue to flush dirty pages
>    of a different memcg.
> 

So, IIUC, the only thing little different here is that throttling is
implemented by flusher thread. But it is still per device per cgroup. I
think that is just a implementation detail whether we implement it
in block layer, or in writeback or somewhere else.  We can very well
implement it in block layer and provide per bdi/per_group congestion
flag in bdi so that flusher will stop pushing more IO if group on 
a bdi is congested (because IO is throttled).

I think first important thing is to figure out what is minimal set of
requirement (As jan said in another mail), which will solve wide
variety of cases. I am trying to list some of points. 


- Throttling for buffered writes
	- Do we want per device throttling limits or global throttling
	  limtis.

	- Exising direct write limtis are per device and implemented in
	  block layer.

	- I personally think that both kind of limits might make sense.
	  But a global limit for async write might make more sense at
	  least for the workloads like backup which can run on a throttled
  	  speed.

	- Absolute throttling IO will make most sense on top level device
	  in the IO stack.

	- For per device rate throttling, do we want a common limit for
	  direct write and buffered write or a separate limit just for
	  buffered writes.

- Proportional IO for async writes
	- Will probably make most sense on bottom most devices in the IO
	  stack (If we are able to somehow retain the submitter's context).
	
	- Logically it will make sense to keep sync and async writes in
	  same group and try to provide fair share of disk between groups.
	  Technically CFQ can do that but in practice I think it will be
 	  problematic. Writes of one group will take precedence of reads
	  of another group. Currently any read is prioritized over 
	  buffered writes. So by splitting buffered writes in their own
	  cgroups, they can serverly impact the latency of reads in
	  another group. Not sure how many people really want to do
	  that in practice.

	- Do we really need proportional IO for async writes. CFQ had
	  tried implementing ioprio for async writes but it does not
	  work. Should we just care about groups of sync IO and let
	  all the async IO on device go in a single queue and lets
	  make suere it is not starved while sync IO is going on.


	- I thought that most of the people cared about not impacting
	  sync latencies badly while buffered writes are happening. Not
	  many complained that buffered writes of one application should
	  happen faster than other application. 

	- If we agree that not many people require service differentation
	  between buffered writes, then we probably don't have to do
	  anything in this space and we can keep things simple. I
	  personally prefer this option. Trying to provide proportional
	  IO for async writes will make things complicated and we might
	  not achieve much. 

	- CFQ already does a very good job of prioritizing sync over async
	  (at the cost of reduced throuhgput on fast devices). So what's
	  the use case of proportion IO for async writes.

Once we figure out what are the requirements, we can discuss the
implementation details.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-07 20:52         ` Vivek Goyal
@ 2012-03-07 22:04           ` Jeff Moyer
  2012-03-08  8:08           ` Greg Thelen
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff Moyer @ 2012-03-07 22:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrea Righi, Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm,
	linux-fsdevel, Jan Kara, Greg Thelen

Vivek Goyal <vgoyal@redhat.com> writes:

> On Mon, Mar 05, 2012 at 11:58:01PM +0100, Andrea Righi wrote:
>
> [..]
>> What about this scenario? (Sorry, I've not followed some of the recent
>> discussions on this topic, so I'm sure I'm oversimplifying a bit or
>> ignoring some details):
>> 
>>  - track inodes per-memcg for writeback IO (provided Greg's patch)
>>  - provide per-memcg dirty limit (global, not per-device); when this
>>    limit is exceeded flusher threads are awekened and all tasks that
>>    continue to generate new dirty pages inside the memcg are put to
>>    sleep
>>  - flusher threads start to write some dirty inodes of this memcg (using
>>    the inode tracking feature), let say they start with a chunk of N
>>    pages of the first dirty inode
>>  - flusher threads can't flush in this way more than N pages / sec
>>    (where N * PAGE_SIZE / sec is the blkcg "buffered write rate limit"
>>    on the inode's block device); if a flusher thread exceeds this limit
>>    it won't be blocked directly, it just stops flushing pages for this
>>    memcg after the first chunk and it can continue to flush dirty pages
>>    of a different memcg.
>> 
>
> So, IIUC, the only thing little different here is that throttling is
> implemented by flusher thread. But it is still per device per cgroup. I
> think that is just a implementation detail whether we implement it
> in block layer, or in writeback or somewhere else.  We can very well
> implement it in block layer and provide per bdi/per_group congestion
> flag in bdi so that flusher will stop pushing more IO if group on 
> a bdi is congested (because IO is throttled).
>
> I think first important thing is to figure out what is minimal set of
> requirement (As jan said in another mail), which will solve wide
> variety of cases. I am trying to list some of points. 
>
>
> - Throttling for buffered writes
> 	- Do we want per device throttling limits or global throttling
> 	  limtis.

You can implement global (perhaps in userspace utilities) if you have
the per-device mechanism in the kernel.  So I'd say start with per-device.

> 	- Exising direct write limtis are per device and implemented in
> 	  block layer.
>
> 	- I personally think that both kind of limits might make sense.
> 	  But a global limit for async write might make more sense at
> 	  least for the workloads like backup which can run on a throttled
>   	  speed.

When you say global, do you mean total bandwidth across all devices, or
a maximum bandwidth applied to each device?

> 	- Absolute throttling IO will make most sense on top level device
> 	  in the IO stack.

I'm not sure why you used the word absolute.  I do agree that throttling
at the top-most device in a stack makes the most sense.

> 	- For per device rate throttling, do we want a common limit for
> 	  direct write and buffered write or a separate limit just for
> 	  buffered writes.

That depends, what's the goal?  Direct writes can drive very deep queue
depths, just as buffered writes can.

> - Proportional IO for async writes
> 	- Will probably make most sense on bottom most devices in the IO
> 	  stack (If we are able to somehow retain the submitter's context).

Why does it make sense to have it at the bottom?  Just because that's
where it's implemented today?  Writeback happens to the top-most device,
and that device can have different properties than each of its
components.  So, why don't you think applying policy at the top is the
right thing to do?

> 	- Logically it will make sense to keep sync and async writes in
> 	  same group and try to provide fair share of disk between groups.
> 	  Technically CFQ can do that but in practice I think it will be
>  	  problematic. Writes of one group will take precedence of reads
> 	  of another group. Currently any read is prioritized over 
> 	  buffered writes. So by splitting buffered writes in their own
> 	  cgroups, they can serverly impact the latency of reads in
> 	  another group. Not sure how many people really want to do
> 	  that in practice.
>
> 	- Do we really need proportional IO for async writes. CFQ had
> 	  tried implementing ioprio for async writes but it does not
> 	  work. Should we just care about groups of sync IO and let
> 	  all the async IO on device go in a single queue and lets
> 	  make suere it is not starved while sync IO is going on.

If we get accounting of writeback I/O right, then I think it might make
sense to enforce the proportional I/O policy on aysnc writes.  But, I
guess this also depends on what happens with the mem policy, right?

> 	- I thought that most of the people cared about not impacting
> 	  sync latencies badly while buffered writes are happening. Not
> 	  many complained that buffered writes of one application should
> 	  happen faster than other application. 

Until you are forced to reclaim pages....

> 	- If we agree that not many people require service differentation
> 	  between buffered writes, then we probably don't have to do
> 	  anything in this space and we can keep things simple. I
> 	  personally prefer this option. Trying to provide proportional
> 	  IO for async writes will make things complicated and we might
> 	  not achieve much. 

Again, I think that, in order to consider this, we'd also have to lay
out a plan for how it interacts with the memory cgroup policies.

> 	- CFQ already does a very good job of prioritizing sync over async
> 	  (at the cost of reduced throuhgput on fast devices). So what's
> 	  the use case of proportion IO for async writes.
>
> Once we figure out what are the requirements, we can discuss the
> implementation details.

Nice write-up, Vivek.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-07 20:52         ` Vivek Goyal
  2012-03-07 22:04           ` Jeff Moyer
@ 2012-03-08  8:08           ` Greg Thelen
  1 sibling, 0 replies; 21+ messages in thread
From: Greg Thelen @ 2012-03-08  8:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrea Righi, Fengguang Wu, Suresh Jayaraman, lsf-pc, linux-mm,
	linux-fsdevel, Jan Kara

Vivek Goyal <vgoyal@redhat.com> writes:
> So, IIUC, the only thing little different here is that throttling is
> implemented by flusher thread. But it is still per device per cgroup. I
> think that is just a implementation detail whether we implement it
> in block layer, or in writeback or somewhere else.  We can very well
> implement it in block layer and provide per bdi/per_group congestion
> flag in bdi so that flusher will stop pushing more IO if group on 
> a bdi is congested (because IO is throttled).
>
> I think first important thing is to figure out what is minimal set of
> requirement (As jan said in another mail), which will solve wide
> variety of cases. I am trying to list some of points. 
>
>
> - Throttling for buffered writes
> 	- Do we want per device throttling limits or global throttling
> 	  limtis.
>
> 	- Exising direct write limtis are per device and implemented in
> 	  block layer.
>
> 	- I personally think that both kind of limits might make sense.
> 	  But a global limit for async write might make more sense at
> 	  least for the workloads like backup which can run on a throttled
>   	  speed.
>
> 	- Absolute throttling IO will make most sense on top level device
> 	  in the IO stack.
>
> 	- For per device rate throttling, do we want a common limit for
> 	  direct write and buffered write or a separate limit just for
> 	  buffered writes.

Another aspect to this problem is 'dirty memory limiting'.  First a
quick refresher on memory.soft_limit_in_bytes...  In memcg the
soft_limit_in_bytes can be used as a way to overcommit a machine's
memory.  The idea is that the memory.limit_in_bytes (aka hard limit)
specified a absolute maximum amount of memory a memcg can use, while the
soft_limit_in_bytes indicates the working set of the container.  The
simplified equation is that if the sum(*/memory.soft_limit_in_bytes) <
MemTotal, then all containers should be guaranteed their working set.
Jobs are allowed to allocate more than soft_limit_in_bytes so long as
they fit within limit_in_bytes.  This attempts to provide a min and max
amount of memory for a cgroup.

The soft_limit_in_bytes is related to this discussion because it is
desirable if all container memory above soft_limit_in_bytes is
reclaimable (i.e. clean file cache).  Using previously posted memcg
dirty limiting and memcg writeback logic we have been able to set a
container's dirty_limit to its soft_limit.  While not perfect, this
approximates the goal of providing min guaranteed memory while allowing
for usage of best effort memory, so long as that best effort memory can
be quickly reclaimed to satisfy another container's min guarantee.

> - Proportional IO for async writes
> 	- Will probably make most sense on bottom most devices in the IO
> 	  stack (If we are able to somehow retain the submitter's context).
> 	
> 	- Logically it will make sense to keep sync and async writes in
> 	  same group and try to provide fair share of disk between groups.
> 	  Technically CFQ can do that but in practice I think it will be
>  	  problematic. Writes of one group will take precedence of reads
> 	  of another group. Currently any read is prioritized over 
> 	  buffered writes. So by splitting buffered writes in their own
> 	  cgroups, they can serverly impact the latency of reads in
> 	  another group. Not sure how many people really want to do
> 	  that in practice.
>
> 	- Do we really need proportional IO for async writes. CFQ had
> 	  tried implementing ioprio for async writes but it does not
> 	  work. Should we just care about groups of sync IO and let
> 	  all the async IO on device go in a single queue and lets
> 	  make suere it is not starved while sync IO is going on.
>
>
> 	- I thought that most of the people cared about not impacting
> 	  sync latencies badly while buffered writes are happening. Not
> 	  many complained that buffered writes of one application should
> 	  happen faster than other application. 
>
> 	- If we agree that not many people require service differentation
> 	  between buffered writes, then we probably don't have to do
> 	  anything in this space and we can keep things simple. I
> 	  personally prefer this option. Trying to provide proportional
> 	  IO for async writes will make things complicated and we might
> 	  not achieve much. 
>
> 	- CFQ already does a very good job of prioritizing sync over async
> 	  (at the cost of reduced throuhgput on fast devices). So what's
> 	  the use case of proportion IO for async writes.
>
> Once we figure out what are the requirements, we can discuss the
> implementation details.
>
> Thanks
> Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-02 15:33 ` Vivek Goyal
  2012-03-05 19:22   ` Fengguang Wu
@ 2012-03-05 20:23   ` Jan Kara
  2012-03-05 21:41     ` Vivek Goyal
                       ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Jan Kara @ 2012-03-05 20:23 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Suresh Jayaraman, linux-fsdevel, linux-mm, lsf-pc, Jan Kara,
	Andrea Righi

On Fri 02-03-12 10:33:23, Vivek Goyal wrote:
> On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote:
> > Committee members,
> > 
> > Please consider inviting me to the Storage, Filesystem, & MM Summit. I
> > am working for one of the kernel teams in SUSE Labs focusing on Network
> > filesystems and block layer.
> > 
> > Recently, I have been trying to solve the problem of "throttling
> > buffered writes" to make per-cgroup throttling of IO to the device
> > possible. Currently the block IO controller does not throttle buffered
> > writes. The writes would have lost the submitter's context (I/O comes in
> > flusher thread's context) when they are at the block IO layer. I looked
> > at the past work and many folks have attempted to solve this problem in
> > the past years but this problem remains unsolved so far.
> > 
> > First, Andrea Righi tried to solve this by limiting the rate of async
> > writes at the time a task is generating dirty pages in the page cache.
> > 
> > Next, Vivek Goyal tried to solve this by throttling writes at the time
> > they are entering the page cache.
> > 
> > Both these approches have limitations and not considered for merging.
> > 
> > I have looked at the possibility of solving this at the filesystem level
> > but the problem with ext* filesystems is that a commit will commit the
> > whole transaction at once (which may contain writes from
> > processes belonging to more than one cgroup). Making filesystems cgroup
> > aware would need redesign of journalling layer itself.
> > 
> > Dave Chinner thinks this problem should be solved and being solved in a
> > different manner by making the bdi-flusher writeback cgroup aware.
> > 
> > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM
> > summit this year) adds cgroup awareness to writeback. Some aspects of
> > this patchset could be borrowed for solving the problem of throttling
> > buffered writes.
> > 
> > As I understand the topic was discussed during last Kernel Summit as
> > well and the idea is to get the IO-less throttling patchset into the
> > kernel, then do per-memcg dirty memory limiting and add some memcg
> > awareness to writeback Greg Thelen and then when these things settle
> > down, think how to solve this problem since noone really seem to have a
> > good answer to it.
> > 
> > Having worked on linux filesystem/storage area for a few years now and
> > having spent time understanding the various approaches tried and looked
> > at other feasible way of solving this problem, I look forward to
> > participate in the summit and discussions.
> > 
> > So, the topic I would like to discuss is solving the problem of
> > "throttling buffered writes". This could considered for discussion with
> > memcg writeback session if that topic has been allocated a slot.
> > 
> > I'm aware that this is a late submission and my apologies for not making
> > it earlier. But, I want to take chances and see if it is possible still..
> 
> This is an interesting and complicated topic. As you mentioned we have had
> tried to solve it but nothing has been merged yet. Personally, I am still
> interested in having a discussion and see if we can come up with a way
> forward.
> 
> Because filesystems are not cgroup aware, throtting IO below filesystem
> has dangers of IO of faster cgroups being throttled behind slower cgroup
> (journalling was one example and there could be others). Hence, I personally
> think that this problem should be solved at higher layer and that is when
> we are actually writting to the cache. That has the disadvantage of still
> seeing IO spikes at the device but I guess we live with that. Doing it
> at higher layer also allows to use the same logic for NFS too otherwise
> NFS buffered write will continue to be a problem.
  Well, I agree limiting of memory dirty rate has a value but if I look at
a natural use case where I have several cgroups and I want to make sure
disk time is fairly divided among them, then limiting dirty rate doesn't
quite do what I need. Because I'm interested in time it takes disk to
process the combination of reads, direct IO, and buffered writes the cgroup
generates. Having the limits for dirty rate and other IO separate means I
have to be rather pesimistic in setting the bounds so that combination of
dirty rate + other IO limit doesn't exceed the desired bound but this is
usually unnecessarily harsh...

We agree though (as we spoke together last year) that throttling at block
layer isn't really an option at least for some filesystems such as ext3/4.
But what seemed like a plausible idea to me was that we'd account all IO
including buffered writes at block layer (there we'd need at least
approximate tracking of originator of the IO - tracking inodes as Greg did
in his patch set seemed OK) but throttle only direct IO & reads. Limitting
of buffered writes would then be achieved by
  a) having flusher thread choose inodes to write depending on how much
available disk time cgroup has and
  b) throttling buffered writers when cgroup has too many dirty pages.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 20:23   ` [Lsf-pc] " Jan Kara
@ 2012-03-05 21:41     ` Vivek Goyal
  2012-03-07 17:24       ` Jan Kara
  2012-03-05 22:18     ` Vivek Goyal
  2012-03-07  6:31     ` Fengguang Wu
  2 siblings, 1 reply; 21+ messages in thread
From: Vivek Goyal @ 2012-03-05 21:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Suresh Jayaraman, linux-fsdevel, linux-mm, lsf-pc, Andrea Righi

On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote:

[..]
> > Because filesystems are not cgroup aware, throtting IO below filesystem
> > has dangers of IO of faster cgroups being throttled behind slower cgroup
> > (journalling was one example and there could be others). Hence, I personally
> > think that this problem should be solved at higher layer and that is when
> > we are actually writting to the cache. That has the disadvantage of still
> > seeing IO spikes at the device but I guess we live with that. Doing it
> > at higher layer also allows to use the same logic for NFS too otherwise
> > NFS buffered write will continue to be a problem.
>   Well, I agree limiting of memory dirty rate has a value but if I look at
> a natural use case where I have several cgroups and I want to make sure
> disk time is fairly divided among them, then limiting dirty rate doesn't
> quite do what I need.

Actually "proportional IO control" generally addresses the use case of
disk time being fairly divided among cgroups. The "throttling/upper limit"
I think is more targeted towards the cases where you have bandwidth but
you don't want to give it to user as user has not paid for that kind
of service. Though it could be used for other things like monitoring the
system dynamically and throttling rates of a particular cgroup if admin
thinks that particular cgroup is doing too much of IO. Or for things like,
start a backup operation with an upper limit of say 50MB/s so that it
does not affect other system activities too much.

> Because I'm interested in time it takes disk to
> process the combination of reads, direct IO, and buffered writes the cgroup
> generates. Having the limits for dirty rate and other IO separate means I
> have to be rather pesimistic in setting the bounds so that combination of
> dirty rate + other IO limit doesn't exceed the desired bound but this is
> usually unnecessarily harsh...

Yes, seprating out the throttling limits for "reads + direct writes +
certain wriththrough writes" and "buffered writes" is not ideal. But
it might still have some value for specific use cases (writes over NFS,
backup application, throttling a specific disk hog workload etc).

> 
> We agree though (as we spoke together last year) that throttling at block
> layer isn't really an option at least for some filesystems such as ext3/4.

Yes, because of jorunalling issues and ensuring serialization,
throttling/upper limit at block/device level becomes less attractive.

> But what seemed like a plausible idea to me was that we'd account all IO
> including buffered writes at block layer (there we'd need at least
> approximate tracking of originator of the IO - tracking inodes as Greg did
> in his patch set seemed OK) but throttle only direct IO & reads. Limitting
> of buffered writes would then be achieved by
>   a) having flusher thread choose inodes to write depending on how much
> available disk time cgroup has and
>   b) throttling buffered writers when cgroup has too many dirty pages.

I am trying to remember what we had discussed. There have been so many 
ideas floated in this area, that now I get confused.

So lets take throttling/upper limit out of the picture for a moment and just
focus on the use case of proportional IO (fare share of disk among cgroups).

- In that case yes, we probably can come up with some IO tracking
  mechanism so that IO can be accounted to right cgroup (IO originator's
  cgroup) at block layer. We could either store some info in "struct
  page" or do some approximation as you mentioned like inode owner.

- With buffered IO accounted to right cgroup, CFQ should automatically
  start providing cgroup its fair share (Well little changes will be
  required). But there are still two more issues.

	- Issue of making writeback cgroup aware. I am assuming that
	  this work will be taken forward by Greg.

	- Breaking down request descriptors into some kind of per cgroup
 	  notion so that one cgroup is not stuck behind other. (Or come
 	  up with a different mechanism for per cgroup congestion).

 That way, if a cgroup is congested at CFQ, flusher should stop submitting
 more IO for it, that will lead to increased dirty pages in memcg and that
 should throttle the application.

So all of the aove seems to be proportional IO (fair shrae of disk). This
should still be co-exist with "throttling/upper limit" implementation/knobs
and one is not necessarily replacement for other?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 21:41     ` Vivek Goyal
@ 2012-03-07 17:24       ` Jan Kara
  2012-03-07 21:29         ` Vivek Goyal
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2012-03-07 17:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc, Andrea Righi,
	Suresh Jayaraman

On Mon 05-03-12 16:41:30, Vivek Goyal wrote:
> On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote:
> [..]
> > > Because filesystems are not cgroup aware, throtting IO below filesystem
> > > has dangers of IO of faster cgroups being throttled behind slower cgroup
> > > (journalling was one example and there could be others). Hence, I personally
> > > think that this problem should be solved at higher layer and that is when
> > > we are actually writting to the cache. That has the disadvantage of still
> > > seeing IO spikes at the device but I guess we live with that. Doing it
> > > at higher layer also allows to use the same logic for NFS too otherwise
> > > NFS buffered write will continue to be a problem.
> >   Well, I agree limiting of memory dirty rate has a value but if I look at
> > a natural use case where I have several cgroups and I want to make sure
> > disk time is fairly divided among them, then limiting dirty rate doesn't
> > quite do what I need.
> 
> Actually "proportional IO control" generally addresses the use case of
> disk time being fairly divided among cgroups. The "throttling/upper limit"
> I think is more targeted towards the cases where you have bandwidth but
> you don't want to give it to user as user has not paid for that kind
> of service. Though it could be used for other things like monitoring the
> system dynamically and throttling rates of a particular cgroup if admin
> thinks that particular cgroup is doing too much of IO. Or for things like,
> start a backup operation with an upper limit of say 50MB/s so that it
> does not affect other system activities too much.
  Well, I was always slightly sceptical that these absolute bandwidth
limits are that great thing. If some cgroup beats your storage with 10 MB/s
of random tiny writes, then it uses more of your resources than an
streaming 50 MB/s write. So although admins might be tempted to use
throughput limits at the first moment because they are easier to
understand, they might later find it's not quite what they wanted.

Specifically for the imagined use case where a customer pays just for a
given bandwidth, you can achieve similar (and IMHO more reliable) results
using proportional control. Say you have available 100 MB/s sequential IO
bandwidth and you would like to limit cgroup to 10 MB/s. Then you just
give it weight 10. Another cgroup paying for 20 MB/s would get weight 20
and so on. If you are a clever provider and pack your load so that machine
is well utilized, cgroups will get limited roughly at given bounds...

> > Because I'm interested in time it takes disk to
> > process the combination of reads, direct IO, and buffered writes the cgroup
> > generates. Having the limits for dirty rate and other IO separate means I
> > have to be rather pesimistic in setting the bounds so that combination of
> > dirty rate + other IO limit doesn't exceed the desired bound but this is
> > usually unnecessarily harsh...
> 
> Yes, seprating out the throttling limits for "reads + direct writes +
> certain wriththrough writes" and "buffered writes" is not ideal. But
> it might still have some value for specific use cases (writes over NFS,
> backup application, throttling a specific disk hog workload etc).
> 
> > 
> > We agree though (as we spoke together last year) that throttling at block
> > layer isn't really an option at least for some filesystems such as ext3/4.
> 
> Yes, because of jorunalling issues and ensuring serialization,
> throttling/upper limit at block/device level becomes less attractive.
> 
> > But what seemed like a plausible idea to me was that we'd account all IO
> > including buffered writes at block layer (there we'd need at least
> > approximate tracking of originator of the IO - tracking inodes as Greg did
> > in his patch set seemed OK) but throttle only direct IO & reads. Limitting
> > of buffered writes would then be achieved by
> >   a) having flusher thread choose inodes to write depending on how much
> > available disk time cgroup has and
> >   b) throttling buffered writers when cgroup has too many dirty pages.
> 
> I am trying to remember what we had discussed. There have been so many 
> ideas floated in this area, that now I get confused.
> 
> So lets take throttling/upper limit out of the picture for a moment and just
> focus on the use case of proportional IO (fare share of disk among cgroups).
> 
> - In that case yes, we probably can come up with some IO tracking
>   mechanism so that IO can be accounted to right cgroup (IO originator's
>   cgroup) at block layer. We could either store some info in "struct
>   page" or do some approximation as you mentioned like inode owner.
> 
> - With buffered IO accounted to right cgroup, CFQ should automatically
>   start providing cgroup its fair share (Well little changes will be
>   required). But there are still two more issues.
> 
> 	- Issue of making writeback cgroup aware. I am assuming that
> 	  this work will be taken forward by Greg.
> 
> 	- Breaking down request descriptors into some kind of per cgroup
>  	  notion so that one cgroup is not stuck behind other. (Or come
>  	  up with a different mechanism for per cgroup congestion).
> 
>  That way, if a cgroup is congested at CFQ, flusher should stop submitting
>  more IO for it, that will lead to increased dirty pages in memcg and that
>  should throttle the application.
> 
> So all of the aove seems to be proportional IO (fair shrae of disk). This
> should still be co-exist with "throttling/upper limit" implementation/knobs
> and one is not necessarily replacement for other?
  Well, I don't see a strict reason why the above won't work for "upper
limit" knobs. After all, these knobs just mean you don't want to submit
more that X MB of IO in 1 second. So you just need flusher thread to check
against such limit as well.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-07 17:24       ` Jan Kara
@ 2012-03-07 21:29         ` Vivek Goyal
  0 siblings, 0 replies; 21+ messages in thread
From: Vivek Goyal @ 2012-03-07 21:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, lsf-pc, Andrea Righi, Suresh Jayaraman

On Wed, Mar 07, 2012 at 06:24:53PM +0100, Jan Kara wrote:
> On Mon 05-03-12 16:41:30, Vivek Goyal wrote:
> > On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote:
> > [..]
> > > > Because filesystems are not cgroup aware, throtting IO below filesystem
> > > > has dangers of IO of faster cgroups being throttled behind slower cgroup
> > > > (journalling was one example and there could be others). Hence, I personally
> > > > think that this problem should be solved at higher layer and that is when
> > > > we are actually writting to the cache. That has the disadvantage of still
> > > > seeing IO spikes at the device but I guess we live with that. Doing it
> > > > at higher layer also allows to use the same logic for NFS too otherwise
> > > > NFS buffered write will continue to be a problem.
> > >   Well, I agree limiting of memory dirty rate has a value but if I look at
> > > a natural use case where I have several cgroups and I want to make sure
> > > disk time is fairly divided among them, then limiting dirty rate doesn't
> > > quite do what I need.
> > 
> > Actually "proportional IO control" generally addresses the use case of
> > disk time being fairly divided among cgroups. The "throttling/upper limit"
> > I think is more targeted towards the cases where you have bandwidth but
> > you don't want to give it to user as user has not paid for that kind
> > of service. Though it could be used for other things like monitoring the
> > system dynamically and throttling rates of a particular cgroup if admin
> > thinks that particular cgroup is doing too much of IO. Or for things like,
> > start a backup operation with an upper limit of say 50MB/s so that it
> > does not affect other system activities too much.
>   Well, I was always slightly sceptical that these absolute bandwidth
> limits are that great thing. If some cgroup beats your storage with 10 MB/s
> of random tiny writes, then it uses more of your resources than an
> streaming 50 MB/s write. So although admins might be tempted to use
> throughput limits at the first moment because they are easier to
> understand, they might later find it's not quite what they wanted.

Well, you have iops limits too and one can specify bps as well as iops
limits and blkcg will do iops_limt AND bps_limit.

So for large sequential IO one can speicfy 50MB/s upper limit but at the
same time might want to specify some iops limit to cover for the case of
small random IO.

But I agree that configuring these limits might not be easy. One need to
know capacity of the system and provision things accoridingly. As capacity
of system is more or less workload depednent, its hard to predict. I 
personally thought that some kind of dynamic monitoring application 
can help which can dynamically monitor which cgroup is imacting system
badly and go and  change its upper limits.

> 
> Specifically for the imagined use case where a customer pays just for a
> given bandwidth, you can achieve similar (and IMHO more reliable) results
> using proportional control. Say you have available 100 MB/s sequential IO
> bandwidth and you would like to limit cgroup to 10 MB/s. Then you just
> give it weight 10. Another cgroup paying for 20 MB/s would get weight 20
> and so on. If you are a clever provider and pack your load so that machine
> is well utilized, cgroups will get limited roughly at given bounds...

Well, if multiple virtual machines are running, you just don't know who
is doing how much of IO at a given point of time. So a virtual machine
might experience a very different IO bandwidth based on how many other
virtual machines are doing IO at that point of time.

Once Chris Wright mentioned that upper limits might be useful in providing
a more consistent IO bandwidth experience to virtual machines as they
might be migrated from one host to other and these hosts might have
different IO bandwidth altogether. 

> 
> > > Because I'm interested in time it takes disk to
> > > process the combination of reads, direct IO, and buffered writes the cgroup
> > > generates. Having the limits for dirty rate and other IO separate means I
> > > have to be rather pesimistic in setting the bounds so that combination of
> > > dirty rate + other IO limit doesn't exceed the desired bound but this is
> > > usually unnecessarily harsh...
> > 
> > Yes, seprating out the throttling limits for "reads + direct writes +
> > certain wriththrough writes" and "buffered writes" is not ideal. But
> > it might still have some value for specific use cases (writes over NFS,
> > backup application, throttling a specific disk hog workload etc).
> > 
> > > 
> > > We agree though (as we spoke together last year) that throttling at block
> > > layer isn't really an option at least for some filesystems such as ext3/4.
> > 
> > Yes, because of jorunalling issues and ensuring serialization,
> > throttling/upper limit at block/device level becomes less attractive.
> > 
> > > But what seemed like a plausible idea to me was that we'd account all IO
> > > including buffered writes at block layer (there we'd need at least
> > > approximate tracking of originator of the IO - tracking inodes as Greg did
> > > in his patch set seemed OK) but throttle only direct IO & reads. Limitting
> > > of buffered writes would then be achieved by
> > >   a) having flusher thread choose inodes to write depending on how much
> > > available disk time cgroup has and
> > >   b) throttling buffered writers when cgroup has too many dirty pages.
> > 
> > I am trying to remember what we had discussed. There have been so many 
> > ideas floated in this area, that now I get confused.
> > 
> > So lets take throttling/upper limit out of the picture for a moment and just
> > focus on the use case of proportional IO (fare share of disk among cgroups).
> > 
> > - In that case yes, we probably can come up with some IO tracking
> >   mechanism so that IO can be accounted to right cgroup (IO originator's
> >   cgroup) at block layer. We could either store some info in "struct
> >   page" or do some approximation as you mentioned like inode owner.
> > 
> > - With buffered IO accounted to right cgroup, CFQ should automatically
> >   start providing cgroup its fair share (Well little changes will be
> >   required). But there are still two more issues.
> > 
> > 	- Issue of making writeback cgroup aware. I am assuming that
> > 	  this work will be taken forward by Greg.
> > 
> > 	- Breaking down request descriptors into some kind of per cgroup
> >  	  notion so that one cgroup is not stuck behind other. (Or come
> >  	  up with a different mechanism for per cgroup congestion).
> > 
> >  That way, if a cgroup is congested at CFQ, flusher should stop submitting
> >  more IO for it, that will lead to increased dirty pages in memcg and that
> >  should throttle the application.
> > 
> > So all of the aove seems to be proportional IO (fair shrae of disk). This
> > should still be co-exist with "throttling/upper limit" implementation/knobs
> > and one is not necessarily replacement for other?
>   Well, I don't see a strict reason why the above won't work for "upper
> limit" knobs. After all, these knobs just mean you don't want to submit
> more that X MB of IO in 1 second. So you just need flusher thread to check
> against such limit as well.

Well, yes flusher thread can check for that or block layer can implement
the logic and flusher thread can just check if group is congested or not.

I was just trying to differentiate between "Throttling/upper limit" which
is non-work conserving and "proportion IO" which is work conserving.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 20:23   ` [Lsf-pc] " Jan Kara
  2012-03-05 21:41     ` Vivek Goyal
@ 2012-03-05 22:18     ` Vivek Goyal
  2012-03-05 22:36       ` Jan Kara
  2012-03-07  6:31     ` Fengguang Wu
  2 siblings, 1 reply; 21+ messages in thread
From: Vivek Goyal @ 2012-03-05 22:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrea Righi, Suresh Jayaraman, linux-mm, linux-fsdevel, lsf-pc

On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote:

[..]
> Having the limits for dirty rate and other IO separate means I
> have to be rather pesimistic in setting the bounds so that combination of
> dirty rate + other IO limit doesn't exceed the desired bound but this is
> usually unnecessarily harsh...

We had solved this issue in my previous posting.

https://lkml.org/lkml/2011/6/28/243

I was accounting the buffered writes to associated block group in 
balance dirty pages and throttling it if group was exceeding upper
limit. This had common limit for all kind of writes (direct + buffered +
sync etc).

But it also had its share of issues.

- Control was per device (not global) and was not applicable to NFS.
- Will not prevent IO spikes at devices (caused by flusher threads).

Dave Chinner preferred to throttle IO at devices much later.

I personally think that "dirty rate limit" does not solve all problems
but has some value and it will be interesting to merge any one
implementation and see if it solves a real world problem. It does not
block any other idea of buffered write proportional control or even
implementing upper limit in blkcg. We could put "dirty rate limit" in
memcg and develop rest of the ideas in blkcg, writeback etc.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 22:18     ` Vivek Goyal
@ 2012-03-05 22:36       ` Jan Kara
  2012-03-07  6:42         ` Fengguang Wu
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2012-03-05 22:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc, Andrea Righi,
	Suresh Jayaraman

On Mon 05-03-12 17:18:43, Vivek Goyal wrote:
> On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote:
> 
> [..]
> > Having the limits for dirty rate and other IO separate means I
> > have to be rather pesimistic in setting the bounds so that combination of
> > dirty rate + other IO limit doesn't exceed the desired bound but this is
> > usually unnecessarily harsh...
> 
> We had solved this issue in my previous posting.
> 
> https://lkml.org/lkml/2011/6/28/243
> 
> I was accounting the buffered writes to associated block group in 
> balance dirty pages and throttling it if group was exceeding upper
> limit. This had common limit for all kind of writes (direct + buffered +
> sync etc).
  Ah, I didn't know that.

> But it also had its share of issues.
> 
> - Control was per device (not global) and was not applicable to NFS.
> - Will not prevent IO spikes at devices (caused by flusher threads).
> 
> Dave Chinner preferred to throttle IO at devices much later.
> 
> I personally think that "dirty rate limit" does not solve all problems
> but has some value and it will be interesting to merge any one
> implementation and see if it solves a real world problem.
  It rather works the other way around - you first have to show enough
users are interested in the particular feature you want to merge and then the
feature can get merged. Once the feature is merged we are stuck supporting
it forever so we have to be very cautious in what we merge...
 
> It does not block any other idea of buffered write proportional control
> or even implementing upper limit in blkcg. We could put "dirty rate
> limit" in memcg and develop rest of the ideas in blkcg, writeback etc.
  Yes, it doesn't block them but OTOH we should have as few features as
possible because otherwise it's a configuration and maintenance nightmare
(both from admin and kernel POV). So we should think twice what set of
features we choose to satisfy user demand.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 22:36       ` Jan Kara
@ 2012-03-07  6:42         ` Fengguang Wu
  0 siblings, 0 replies; 21+ messages in thread
From: Fengguang Wu @ 2012-03-07  6:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, linux-fsdevel, linux-mm, lsf-pc, Andrea Righi,
	Suresh Jayaraman

On Mon, Mar 05, 2012 at 11:36:37PM +0100, Jan Kara wrote:
> On Mon 05-03-12 17:18:43, Vivek Goyal wrote:
> > On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote:
> > 
> > [..]
> > > Having the limits for dirty rate and other IO separate means I
> > > have to be rather pesimistic in setting the bounds so that combination of
> > > dirty rate + other IO limit doesn't exceed the desired bound but this is
> > > usually unnecessarily harsh...
> > 
> > We had solved this issue in my previous posting.
> > 
> > https://lkml.org/lkml/2011/6/28/243
> > 
> > I was accounting the buffered writes to associated block group in 
> > balance dirty pages and throttling it if group was exceeding upper
> > limit. This had common limit for all kind of writes (direct + buffered +
> > sync etc).
>   Ah, I didn't know that.
> 
> > But it also had its share of issues.
> > 
> > - Control was per device (not global) and was not applicable to NFS.
> > - Will not prevent IO spikes at devices (caused by flusher threads).
> > 
> > Dave Chinner preferred to throttle IO at devices much later.
> > 
> > I personally think that "dirty rate limit" does not solve all problems
> > but has some value and it will be interesting to merge any one
> > implementation and see if it solves a real world problem.
>   It rather works the other way around - you first have to show enough
> users are interested in the particular feature you want to merge and then the
> feature can get merged. Once the feature is merged we are stuck supporting
> it forever so we have to be very cautious in what we merge...

Agreed.

> > It does not block any other idea of buffered write proportional control
> > or even implementing upper limit in blkcg. We could put "dirty rate
> > limit" in memcg and develop rest of the ideas in blkcg, writeback etc.
>   Yes, it doesn't block them but OTOH we should have as few features as
> possible because otherwise it's a configuration and maintenance nightmare
> (both from admin and kernel POV). So we should think twice what set of
> features we choose to satisfy user demand.

Yeah it's a good idea to first figure out the ideal set of user
interfaces that are simple, natural, flexible and extensible. Then
look into the implementations and see how can we provide interfaces
closest to the ideal ones (if not 100% feasible).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
  2012-03-05 20:23   ` [Lsf-pc] " Jan Kara
  2012-03-05 21:41     ` Vivek Goyal
  2012-03-05 22:18     ` Vivek Goyal
@ 2012-03-07  6:31     ` Fengguang Wu
  2 siblings, 0 replies; 21+ messages in thread
From: Fengguang Wu @ 2012-03-07  6:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Suresh Jayaraman, linux-fsdevel, linux-mm, lsf-pc,
	Andrea Righi

On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote:
> On Fri 02-03-12 10:33:23, Vivek Goyal wrote:
> > On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote:
> > > Committee members,
> > > 
> > > Please consider inviting me to the Storage, Filesystem, & MM Summit. I
> > > am working for one of the kernel teams in SUSE Labs focusing on Network
> > > filesystems and block layer.
> > > 
> > > Recently, I have been trying to solve the problem of "throttling
> > > buffered writes" to make per-cgroup throttling of IO to the device
> > > possible. Currently the block IO controller does not throttle buffered
> > > writes. The writes would have lost the submitter's context (I/O comes in
> > > flusher thread's context) when they are at the block IO layer. I looked
> > > at the past work and many folks have attempted to solve this problem in
> > > the past years but this problem remains unsolved so far.
> > > 
> > > First, Andrea Righi tried to solve this by limiting the rate of async
> > > writes at the time a task is generating dirty pages in the page cache.
> > > 
> > > Next, Vivek Goyal tried to solve this by throttling writes at the time
> > > they are entering the page cache.
> > > 
> > > Both these approches have limitations and not considered for merging.
> > > 
> > > I have looked at the possibility of solving this at the filesystem level
> > > but the problem with ext* filesystems is that a commit will commit the
> > > whole transaction at once (which may contain writes from
> > > processes belonging to more than one cgroup). Making filesystems cgroup
> > > aware would need redesign of journalling layer itself.
> > > 
> > > Dave Chinner thinks this problem should be solved and being solved in a
> > > different manner by making the bdi-flusher writeback cgroup aware.
> > > 
> > > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM
> > > summit this year) adds cgroup awareness to writeback. Some aspects of
> > > this patchset could be borrowed for solving the problem of throttling
> > > buffered writes.
> > > 
> > > As I understand the topic was discussed during last Kernel Summit as
> > > well and the idea is to get the IO-less throttling patchset into the
> > > kernel, then do per-memcg dirty memory limiting and add some memcg
> > > awareness to writeback Greg Thelen and then when these things settle
> > > down, think how to solve this problem since noone really seem to have a
> > > good answer to it.
> > > 
> > > Having worked on linux filesystem/storage area for a few years now and
> > > having spent time understanding the various approaches tried and looked
> > > at other feasible way of solving this problem, I look forward to
> > > participate in the summit and discussions.
> > > 
> > > So, the topic I would like to discuss is solving the problem of
> > > "throttling buffered writes". This could considered for discussion with
> > > memcg writeback session if that topic has been allocated a slot.
> > > 
> > > I'm aware that this is a late submission and my apologies for not making
> > > it earlier. But, I want to take chances and see if it is possible still..
> > 
> > This is an interesting and complicated topic. As you mentioned we have had
> > tried to solve it but nothing has been merged yet. Personally, I am still
> > interested in having a discussion and see if we can come up with a way
> > forward.
> > 
> > Because filesystems are not cgroup aware, throtting IO below filesystem
> > has dangers of IO of faster cgroups being throttled behind slower cgroup
> > (journalling was one example and there could be others). Hence, I personally
> > think that this problem should be solved at higher layer and that is when
> > we are actually writting to the cache. That has the disadvantage of still
> > seeing IO spikes at the device but I guess we live with that. Doing it
> > at higher layer also allows to use the same logic for NFS too otherwise
> > NFS buffered write will continue to be a problem.
>   Well, I agree limiting of memory dirty rate has a value but if I look at
> a natural use case where I have several cgroups and I want to make sure
> disk time is fairly divided among them, then limiting dirty rate doesn't
> quite do what I need. Because I'm interested in time it takes disk to
> process the combination of reads, direct IO, and buffered writes the cgroup
> generates. Having the limits for dirty rate and other IO separate means I
> have to be rather pesimistic in setting the bounds so that combination of
> dirty rate + other IO limit doesn't exceed the desired bound but this is
> usually unnecessarily harsh...

Yeah it's quite possible some use cases may need to control read/write
respectively and others may want to simply limit the overall r/w
throughput or disk utilization.

It seems more a matter of interface rather than implementation.  If we
have code to limit the buffered/direct write bandwidth respectively,
it should also be able to limit the overall buffered+direct write
bandwidth or even read+write bandwidth.

However for the "overall" r+w limit interface to work, some implicit
rule of precedences or weight will be necessary, eg. read > DIRECT
write > buffered write, or read:DIRECT write:buffered write=10:10:1 or
whatever. Which the users may not totally agree.

In the end it looks there are always the distinguish of the main
SYNC/ASYNC and read/write I/O types and no chance to hide them from
the I/O controller interfaces. Then we might export interfaces to
allow the users to specify the overall I/O rate limit, the weights for
each type of I/O, the individual rate limits for each type of I/O,
etc. to the users' heart content.

> We agree though (as we spoke together last year) that throttling at block
> layer isn't really an option at least for some filesystems such as ext3/4.
> But what seemed like a plausible idea to me was that we'd account all IO
> including buffered writes at block layer (there we'd need at least

Account buffered write I/O when they reach the block layer? It sounds
too late.

> approximate tracking of originator of the IO - tracking inodes as Greg did
> in his patch set seemed OK) but throttle only direct IO & reads. Limitting
> of buffered writes would then be achieved by
>   a) having flusher thread choose inodes to write depending on how much
> available disk time cgroup has and

The flusher is fundamentally

- coarsely controllable due to the large write chunk size
- not controllable in the case of shared inodes

so any dirty size/rate limiting scheme based on controlling the
flusher behavior is not going to be an exact/reliable solution...

>   b) throttling buffered writers when cgroup has too many dirty pages.

That looks still be throttling at the balance_dirty_pages() level?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-03-08  8:08 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-02  7:18 [ATTEND] [LSF/MM TOPIC] Buffered writes throttling Suresh Jayaraman
2012-03-02 15:33 ` Vivek Goyal
2012-03-05 19:22   ` Fengguang Wu
2012-03-05 21:11     ` Vivek Goyal
2012-03-05 22:30       ` Fengguang Wu
2012-03-05 23:19         ` Andrea Righi
2012-03-05 23:51           ` Fengguang Wu
2012-03-06  0:46             ` Andrea Righi
2012-03-07 20:26               ` Vivek Goyal
2012-03-05 22:58       ` Andrea Righi
2012-03-07 20:52         ` Vivek Goyal
2012-03-07 22:04           ` Jeff Moyer
2012-03-08  8:08           ` Greg Thelen
2012-03-05 20:23   ` [Lsf-pc] " Jan Kara
2012-03-05 21:41     ` Vivek Goyal
2012-03-07 17:24       ` Jan Kara
2012-03-07 21:29         ` Vivek Goyal
2012-03-05 22:18     ` Vivek Goyal
2012-03-05 22:36       ` Jan Kara
2012-03-07  6:42         ` Fengguang Wu
2012-03-07  6:31     ` Fengguang Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).