Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
       [not found]   ` <20110222224141.GA23723@linux.develer.com>
@ 2011-02-23  0:03     ` Vivek Goyal
  2011-02-23  8:32       ` Andrea Righi
  0 siblings, 1 reply; 9+ messages in thread
From: Vivek Goyal @ 2011-02-23  0:03 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen,
	Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi,
	Jens Axboe, Andrew Morton, containers, linux-mm, linux-kernel,
	linux-fsdevel

On Tue, Feb 22, 2011 at 11:41:41PM +0100, Andrea Righi wrote:
> On Tue, Feb 22, 2011 at 02:34:03PM -0500, Vivek Goyal wrote:
> > On Tue, Feb 22, 2011 at 06:12:51PM +0100, Andrea Righi wrote:
> > > Currently the blkio.throttle controller only support synchronous IO requests.
> > > This means that we always look at the current task to identify the "owner" of
> > > each IO request.
> > > 
> > > However dirty pages in the page cache can be wrote to disk asynchronously by
> > > the per-bdi flusher kernel threads or by any other thread in the system,
> > > according to the writeback policy.
> > > 
> > > For this reason the real writes to the underlying block devices may
> > > occur in a different IO context respect to the task that originally
> > > generated the dirty pages involved in the IO operation. This makes the
> > > tracking and throttling of writeback IO more complicate respect to the
> > > synchronous IO from the blkio controller's perspective.
> > > 
> > > The same concept is also valid for anonymous pages involed in IO operations
> > > (swap).
> > > 
> > > This patch allow to track the cgroup that originally dirtied each page in page
> > > cache and each anonymous page and pass these informations to the blk-throttle
> > > controller. These informations can be used to provide a better service level
> > > differentiation of buffered writes swap IO between different cgroups.
> > > 
> > 
> > Hi Andrea,
> > 
> > Thanks for the patches. Before I look deeper into patches, had few
> > general queries/thoughts.
> > 
> > - So this requires memory controller to be enabled. Does it also require
> >   these to be co-mounted?
> 
> No and no. The blkio controller enables and uses the page_cgroup
> functionality, but it doesn't depend on the memory controller. It
> automatically selects CONFIG_MM_OWNER and CONFIG_PAGE_TRACKING (last
> one added in PATCH 3/5) and this is sufficient to make page_cgroup
> usable from any generic controller.
> 
> > 
> > - Currently in throttling there is no limit on number of bios queued
> >   per group. I think this is not necessarily a very good idea because
> >   if throttling limits are low, we will build very long bio queues. So
> >   some AIO process can queue up lots of bios, consume lots of memory
> >   without getting blocked. I am sure there will be other side affects
> >   too. One of the side affects I noticed is that if an AIO process
> >   queues up too much of IO, and if I want to kill it now, it just hangs
> >   there for a really-2 long time (waiting for all the throttled IO
> >   to complete).
> > 
> >   So I was thinking of implementing either per group limit or per io
> >   context limit and after that process will be put to sleep. (something
> >   like request descriptor mechanism).
> 
> io context limit seems a better solution for now. We can also expect
> some help from the memory controller, if we'll have the dirty memory
> limit per cgroup in the future the max amount of bios queued will be
> automatically limited by this functionality.
> 
> > 
> >   If that's the case, then comes the question of what do to about kernel
> >   threads. Should they be blocked or not. If these are blocked then a
> >   fast group will also be indirectly throttled behind a slow group. If
> >   they are not then we still have the problem of too many bios queued
> >   in throttling layer.
> 
> I think kernel threads should be never forced to sleep, to avoid the
> classic "priority inversion" problem and create potential DoS in the
> system.
> 
> Also for this part the dirty memory limit per cgroup could help a lot,
> because a cgroup will never exceed its "quota" of dirty memory, so it
> will not be able to submit more than a certain amount of bios
> (corresponding to the dirty memory limit).

Per memory cgroup dirty ratio should help a bit. But with intentional
throttling we always run the risk of faster groups getting stuck behind
slower groups.

Even in the case of buffered WRITES, are you able to run two buffered
WRITE streams in two groups and then throttle these to respective rates.
It might be interesting to run that and see what happens.

Practically I feel we shall have to run this with per cgroup memory
dirty ratio bit hence coumount with memory controller.

> 
> > 
> > - What to do about other kernel thread like kjournald which is doing
> >   IO on behalf of all the filesystem users. If data is also journalled
> >   then I think again everything got serialized and a faster group got
> >   backlogged behind a slower one.
> 
> This is the most critical issue IMHO.
> 
> The blkio controller should need some help from the filesystems to
> understand which IO request can be throttled and which cannot. At the
> moment critical IO requests (with critical I mean that are dependency
> for other requests) and non-critical requests are mixed together in a
> way that throttling a single request may stop a lot of other requests in
> the system, and at the block layer it's not possible to retrieve such
> informations.
> 
> I don't have a solution for this right now. Except looking at each
> filesystem implementation and try to understand how to pass these
> informations to the block layer.

True. This is very important issue which needs to be sorted out. Because
if due to journalling if file IO gets serialized behind a really slow
artificially throlled group, it might be very bad for the overall
filesystem performance.

I am CCing linux-fsdevel, and hopefully filesystem guys there can give
us some ideas regarding how it can be handeled.

> 
> > 
> > - Two processes doing IO to same file and slower group will throttle
> >   IO for faster group also. (flushing is per inode). 
> > 
> 
> I think we should accept to have an inode granularity. We could redesign
> the writeback code to work per-cgroup / per-page, etc. but that would
> add a huge overhead. The limit of inode granularity could be an
> acceptable tradeoff, cgroups are supposed to work to different files
> usually, well.. except when databases come into play (ouch!).

Agreed. Granularity of per inode level might be accetable in many 
cases. Again, I am worried faster group getting stuck behind slower
group.

I am wondering if we are trying to solve the problem of ASYNC write throttling
at wrong layer. Should ASYNC IO be throttled before we allow task to write to
page cache. The way we throttle the process based on dirty ratio, can we
just check for throttle limits also there or something like that.(I think
that's what you had done in your initial throttling controller implementation?)

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
  2011-02-23  0:03     ` [PATCH 0/5] blk-throttle: writeback and swap IO control Vivek Goyal
@ 2011-02-23  8:32       ` Andrea Righi
  2011-02-23 15:23         ` Vivek Goyal
  0 siblings, 1 reply; 9+ messages in thread
From: Andrea Righi @ 2011-02-23  8:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen,
	Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi,
	Jens Axboe, Andrew Morton, containers, linux-mm, linux-kernel,
	linux-fsdevel

On Tue, Feb 22, 2011 at 07:03:58PM -0500, Vivek Goyal wrote:
> > I think we should accept to have an inode granularity. We could redesign
> > the writeback code to work per-cgroup / per-page, etc. but that would
> > add a huge overhead. The limit of inode granularity could be an
> > acceptable tradeoff, cgroups are supposed to work to different files
> > usually, well.. except when databases come into play (ouch!).
> 
> Agreed. Granularity of per inode level might be accetable in many 
> cases. Again, I am worried faster group getting stuck behind slower
> group.
> 
> I am wondering if we are trying to solve the problem of ASYNC write throttling
> at wrong layer. Should ASYNC IO be throttled before we allow task to write to
> page cache. The way we throttle the process based on dirty ratio, can we
> just check for throttle limits also there or something like that.(I think
> that's what you had done in your initial throttling controller implementation?)

Right. This is exactly the same approach I've used in my old throttling
controller: throttle sync READs and WRITEs at the block layer and async
WRITEs when the task is dirtying memory pages.

This is probably the simplest way to resolve the problem of faster group
getting blocked by slower group, but the controller will be a little bit
more leaky, because the writeback IO will be never throttled and we'll
see some limited IO spikes during the writeback. However, this is always
a better solution IMHO respect to the current implementation that is
affected by that kind of priority inversion problem.

I can try to add this logic to the current blk-throttle controller if
you think it is worth to test it.

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
  2011-02-23  8:32       ` Andrea Righi
@ 2011-02-23 15:23         ` Vivek Goyal
  2011-02-23 23:14           ` Andrea Righi
  0 siblings, 1 reply; 9+ messages in thread
From: Vivek Goyal @ 2011-02-23 15:23 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen,
	Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi,
	Jens Axboe, Andrew Morton, containers, linux-mm, linux-kernel,
	linux-fsdevel

> > Agreed. Granularity of per inode level might be accetable in many 
> > cases. Again, I am worried faster group getting stuck behind slower
> > group.
> > 
> > I am wondering if we are trying to solve the problem of ASYNC write throttling
> > at wrong layer. Should ASYNC IO be throttled before we allow task to write to
> > page cache. The way we throttle the process based on dirty ratio, can we
> > just check for throttle limits also there or something like that.(I think
> > that's what you had done in your initial throttling controller implementation?)
> 
> Right. This is exactly the same approach I've used in my old throttling
> controller: throttle sync READs and WRITEs at the block layer and async
> WRITEs when the task is dirtying memory pages.
> 
> This is probably the simplest way to resolve the problem of faster group
> getting blocked by slower group, but the controller will be a little bit
> more leaky, because the writeback IO will be never throttled and we'll
> see some limited IO spikes during the writeback.

Yes writeback will not be throttled. Not sure how big a problem that is.

- We have controlled the input rate. So that should help a bit.
- May be one can put some high limit on root cgroup to in blkio throttle
  controller to limit overall WRITE rate of the system.
- For SATA disks, try to use CFQ which can try to minimize the impact of
  WRITE.

It will atleast provide consistent bandwindth experience to application.

>However, this is always
> a better solution IMHO respect to the current implementation that is
> affected by that kind of priority inversion problem.
> 
> I can try to add this logic to the current blk-throttle controller if
> you think it is worth to test it.

At this point of time I have few concerns with this approach.

- Configuration issues. Asking user to plan for SYNC ans ASYNC IO
  separately is inconvenient. One has to know the nature of workload.

- Most likely we will come up with global limits (atleast to begin with),
  and not per device limit. That can lead to contention on one single
  lock and scalability issues on big systems.

Having said that, this approach should reduce the kernel complexity a lot.
So if we can do some intelligent locking to limit the overhead then it
will boil down to reduced complexity in kernel vs ease of use to user. I 
guess at this point of time I am inclined towards keeping it simple in
kernel.

Couple of people have asked me that we have backup jobs running at night
and we want to reduce the IO bandwidth of these jobs to limit the impact
on latency of other jobs, I guess this approach will definitely solve
that issue.

IMHO, it might be worth trying this approach and see how well does it work. It
might not solve all the problems but can be helpful in many situations.

I feel that for proportional bandwidth division, implementing ASYNC
control at CFQ will make sense because even if things get serialized in
higher layers, consequences are not very bad as it is work conserving
algorithm. But for throttling serialization will lead to bad consequences.

May be one can think of new files in blkio controller to limit async IO
per group during page dirty time.

blkio.throttle.async.write_bps_limit
blkio.throttle.async.write_iops_limit

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
  2011-02-23 15:23         ` Vivek Goyal
@ 2011-02-23 23:14           ` Andrea Righi
  2011-02-24  0:10             ` Vivek Goyal
  0 siblings, 1 reply; 9+ messages in thread
From: Andrea Righi @ 2011-02-23 23:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen,
	Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi,
	Jens Axboe, Andrew Morton, Jonathan Corbet, containers, linux-mm,
	linux-kernel, linux-fsdevel

On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
> > > Agreed. Granularity of per inode level might be accetable in many 
> > > cases. Again, I am worried faster group getting stuck behind slower
> > > group.
> > > 
> > > I am wondering if we are trying to solve the problem of ASYNC write throttling
> > > at wrong layer. Should ASYNC IO be throttled before we allow task to write to
> > > page cache. The way we throttle the process based on dirty ratio, can we
> > > just check for throttle limits also there or something like that.(I think
> > > that's what you had done in your initial throttling controller implementation?)
> > 
> > Right. This is exactly the same approach I've used in my old throttling
> > controller: throttle sync READs and WRITEs at the block layer and async
> > WRITEs when the task is dirtying memory pages.
> > 
> > This is probably the simplest way to resolve the problem of faster group
> > getting blocked by slower group, but the controller will be a little bit
> > more leaky, because the writeback IO will be never throttled and we'll
> > see some limited IO spikes during the writeback.
> 
> Yes writeback will not be throttled. Not sure how big a problem that is.
> 
> - We have controlled the input rate. So that should help a bit.
> - May be one can put some high limit on root cgroup to in blkio throttle
>   controller to limit overall WRITE rate of the system.
> - For SATA disks, try to use CFQ which can try to minimize the impact of
>   WRITE.
> 
> It will atleast provide consistent bandwindth experience to application.

Right.

> 
> >However, this is always
> > a better solution IMHO respect to the current implementation that is
> > affected by that kind of priority inversion problem.
> > 
> > I can try to add this logic to the current blk-throttle controller if
> > you think it is worth to test it.
> 
> At this point of time I have few concerns with this approach.
> 
> - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
>   separately is inconvenient. One has to know the nature of workload.
> 
> - Most likely we will come up with global limits (atleast to begin with),
>   and not per device limit. That can lead to contention on one single
>   lock and scalability issues on big systems.
> 
> Having said that, this approach should reduce the kernel complexity a lot.
> So if we can do some intelligent locking to limit the overhead then it
> will boil down to reduced complexity in kernel vs ease of use to user. I 
> guess at this point of time I am inclined towards keeping it simple in
> kernel.
> 

BTW, with this approach probably we can even get rid of the page
tracking stuff for now. If we don't consider the swap IO, any other IO
operation from our point of view will happen directly from process
context (writes in memory + sync reads from the block device).

However, I'm sure we'll need the page tracking also for the blkio
controller soon or later. This is an important information and also the
proportional bandwidth controller can take advantage of it.

> 
> Couple of people have asked me that we have backup jobs running at night
> and we want to reduce the IO bandwidth of these jobs to limit the impact
> on latency of other jobs, I guess this approach will definitely solve
> that issue.
> 
> IMHO, it might be worth trying this approach and see how well does it work. It
> might not solve all the problems but can be helpful in many situations.

Agreed. This could be a good tradeoff for a lot of common cases.

> 
> I feel that for proportional bandwidth division, implementing ASYNC
> control at CFQ will make sense because even if things get serialized in
> higher layers, consequences are not very bad as it is work conserving
> algorithm. But for throttling serialization will lead to bad consequences.

Agreed.

> 
> May be one can think of new files in blkio controller to limit async IO
> per group during page dirty time.
> 
> blkio.throttle.async.write_bps_limit
> blkio.throttle.async.write_iops_limit

OK, I'll try to add the async throttling logic and use this interface.

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
  2011-02-23 23:14           ` Andrea Righi
@ 2011-02-24  0:10             ` Vivek Goyal
  2011-02-24  0:40               ` KAMEZAWA Hiroyuki
  2011-02-25  0:54               ` Andrea Righi
  0 siblings, 2 replies; 9+ messages in thread
From: Vivek Goyal @ 2011-02-24  0:10 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen,
	Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi,
	Jens Axboe, Andrew Morton, Jonathan Corbet, containers, linux-mm,
	linux-kernel, linux-fsdevel

On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
> On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
> > > > Agreed. Granularity of per inode level might be accetable in many 
> > > > cases. Again, I am worried faster group getting stuck behind slower
> > > > group.
> > > > 
> > > > I am wondering if we are trying to solve the problem of ASYNC write throttling
> > > > at wrong layer. Should ASYNC IO be throttled before we allow task to write to
> > > > page cache. The way we throttle the process based on dirty ratio, can we
> > > > just check for throttle limits also there or something like that.(I think
> > > > that's what you had done in your initial throttling controller implementation?)
> > > 
> > > Right. This is exactly the same approach I've used in my old throttling
> > > controller: throttle sync READs and WRITEs at the block layer and async
> > > WRITEs when the task is dirtying memory pages.
> > > 
> > > This is probably the simplest way to resolve the problem of faster group
> > > getting blocked by slower group, but the controller will be a little bit
> > > more leaky, because the writeback IO will be never throttled and we'll
> > > see some limited IO spikes during the writeback.
> > 
> > Yes writeback will not be throttled. Not sure how big a problem that is.
> > 
> > - We have controlled the input rate. So that should help a bit.
> > - May be one can put some high limit on root cgroup to in blkio throttle
> >   controller to limit overall WRITE rate of the system.
> > - For SATA disks, try to use CFQ which can try to minimize the impact of
> >   WRITE.
> > 
> > It will atleast provide consistent bandwindth experience to application.
> 
> Right.
> 
> > 
> > >However, this is always
> > > a better solution IMHO respect to the current implementation that is
> > > affected by that kind of priority inversion problem.
> > > 
> > > I can try to add this logic to the current blk-throttle controller if
> > > you think it is worth to test it.
> > 
> > At this point of time I have few concerns with this approach.
> > 
> > - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
> >   separately is inconvenient. One has to know the nature of workload.
> > 
> > - Most likely we will come up with global limits (atleast to begin with),
> >   and not per device limit. That can lead to contention on one single
> >   lock and scalability issues on big systems.
> > 
> > Having said that, this approach should reduce the kernel complexity a lot.
> > So if we can do some intelligent locking to limit the overhead then it
> > will boil down to reduced complexity in kernel vs ease of use to user. I 
> > guess at this point of time I am inclined towards keeping it simple in
> > kernel.
> > 
> 
> BTW, with this approach probably we can even get rid of the page
> tracking stuff for now.

Agreed.

> If we don't consider the swap IO, any other IO
> operation from our point of view will happen directly from process
> context (writes in memory + sync reads from the block device).

Why do we need to account for swap IO? Application never asked for swap
IO. It is kernel's decision to move soem pages to swap to free up some
memory. What's the point in charging those pages to application group
and throttle accordingly?

> 
> However, I'm sure we'll need the page tracking also for the blkio
> controller soon or later. This is an important information and also the
> proportional bandwidth controller can take advantage of it.

Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
write support. But until and unless we implement memory cgroup dirty
ratio and figure a way out to make writeback logic cgroup aware, till
then I think page tracking stuff is not really useful.

> > 
> > Couple of people have asked me that we have backup jobs running at night
> > and we want to reduce the IO bandwidth of these jobs to limit the impact
> > on latency of other jobs, I guess this approach will definitely solve
> > that issue.
> > 
> > IMHO, it might be worth trying this approach and see how well does it work. It
> > might not solve all the problems but can be helpful in many situations.
> 
> Agreed. This could be a good tradeoff for a lot of common cases.
> 
> > 
> > I feel that for proportional bandwidth division, implementing ASYNC
> > control at CFQ will make sense because even if things get serialized in
> > higher layers, consequences are not very bad as it is work conserving
> > algorithm. But for throttling serialization will lead to bad consequences.
> 
> Agreed.
> 
> > 
> > May be one can think of new files in blkio controller to limit async IO
> > per group during page dirty time.
> > 
> > blkio.throttle.async.write_bps_limit
> > blkio.throttle.async.write_iops_limit
> 
> OK, I'll try to add the async throttling logic and use this interface.

Cool, I would like to play with it a bit once patches are ready.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
  2011-02-24  0:10             ` Vivek Goyal
@ 2011-02-24  0:40               ` KAMEZAWA Hiroyuki
  2011-02-24  2:01                 ` Greg Thelen
  2011-02-24 16:18                 ` Vivek Goyal
  2011-02-25  0:54               ` Andrea Righi
  1 sibling, 2 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-24  0:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrea Righi, Balbir Singh, Daisuke Nishimura, Greg Thelen,
	Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi,
	Jens Axboe, Andrew Morton, Jonathan Corbet, containers, linux-mm,
	linux-kernel, linux-fsdevel

On Wed, 23 Feb 2011 19:10:33 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
> > On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
> > > > > Agreed. Granularity of per inode level might be accetable in many 
> > > > > cases. Again, I am worried faster group getting stuck behind slower
> > > > > group.
> > > > > 
> > > > > I am wondering if we are trying to solve the problem of ASYNC write throttling
> > > > > at wrong layer. Should ASYNC IO be throttled before we allow task to write to
> > > > > page cache. The way we throttle the process based on dirty ratio, can we
> > > > > just check for throttle limits also there or something like that.(I think
> > > > > that's what you had done in your initial throttling controller implementation?)
> > > > 
> > > > Right. This is exactly the same approach I've used in my old throttling
> > > > controller: throttle sync READs and WRITEs at the block layer and async
> > > > WRITEs when the task is dirtying memory pages.
> > > > 
> > > > This is probably the simplest way to resolve the problem of faster group
> > > > getting blocked by slower group, but the controller will be a little bit
> > > > more leaky, because the writeback IO will be never throttled and we'll
> > > > see some limited IO spikes during the writeback.
> > > 
> > > Yes writeback will not be throttled. Not sure how big a problem that is.
> > > 
> > > - We have controlled the input rate. So that should help a bit.
> > > - May be one can put some high limit on root cgroup to in blkio throttle
> > >   controller to limit overall WRITE rate of the system.
> > > - For SATA disks, try to use CFQ which can try to minimize the impact of
> > >   WRITE.
> > > 
> > > It will atleast provide consistent bandwindth experience to application.
> > 
> > Right.
> > 
> > > 
> > > >However, this is always
> > > > a better solution IMHO respect to the current implementation that is
> > > > affected by that kind of priority inversion problem.
> > > > 
> > > > I can try to add this logic to the current blk-throttle controller if
> > > > you think it is worth to test it.
> > > 
> > > At this point of time I have few concerns with this approach.
> > > 
> > > - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
> > >   separately is inconvenient. One has to know the nature of workload.
> > > 
> > > - Most likely we will come up with global limits (atleast to begin with),
> > >   and not per device limit. That can lead to contention on one single
> > >   lock and scalability issues on big systems.
> > > 
> > > Having said that, this approach should reduce the kernel complexity a lot.
> > > So if we can do some intelligent locking to limit the overhead then it
> > > will boil down to reduced complexity in kernel vs ease of use to user. I 
> > > guess at this point of time I am inclined towards keeping it simple in
> > > kernel.
> > > 
> > 
> > BTW, with this approach probably we can even get rid of the page
> > tracking stuff for now.
> 
> Agreed.
> 
> > If we don't consider the swap IO, any other IO
> > operation from our point of view will happen directly from process
> > context (writes in memory + sync reads from the block device).
> 
> Why do we need to account for swap IO? Application never asked for swap
> IO. It is kernel's decision to move soem pages to swap to free up some
> memory. What's the point in charging those pages to application group
> and throttle accordingly?
> 

I think swap I/O should be controlled by memcg's dirty_ratio.
But, IIRC, NEC guy had a requirement for this...

I think some enterprise cusotmer may want to throttle the whole speed of
swapout I/O (not swapin)...so, they may be glad if they can limit throttle
the I/O against a disk partition or all I/O tagged as 'swapio' rather than
some cgroup name.

But I'm afraid slow swapout may consume much dirty_ratio and make things
worse ;)



> > 
> > However, I'm sure we'll need the page tracking also for the blkio
> > controller soon or later. This is an important information and also the
> > proportional bandwidth controller can take advantage of it.
> 
> Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
> write support. But until and unless we implement memory cgroup dirty
> ratio and figure a way out to make writeback logic cgroup aware, till
> then I think page tracking stuff is not really useful.
> 

I think Greg Thelen is now preparing patches for dirty_ratio.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
  2011-02-24  0:40               ` KAMEZAWA Hiroyuki
@ 2011-02-24  2:01                 ` Greg Thelen
  2011-02-24 16:18                 ` Vivek Goyal
  1 sibling, 0 replies; 9+ messages in thread
From: Greg Thelen @ 2011-02-24  2:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Vivek Goyal, Andrea Righi, Balbir Singh, Daisuke Nishimura,
	Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi,
	Jens Axboe, Andrew Morton, Jonathan Corbet, containers, linux-mm,
	linux-kernel, linux-fsdevel

On Wed, Feb 23, 2011 at 4:40 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 23 Feb 2011 19:10:33 -0500
> Vivek Goyal <vgoyal@redhat.com> wrote:
>
>> On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
>> > On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
>> > > > > Agreed. Granularity of per inode level might be accetable in many
>> > > > > cases. Again, I am worried faster group getting stuck behind slower
>> > > > > group.
>> > > > >
>> > > > > I am wondering if we are trying to solve the problem of ASYNC write throttling
>> > > > > at wrong layer. Should ASYNC IO be throttled before we allow task to write to
>> > > > > page cache. The way we throttle the process based on dirty ratio, can we
>> > > > > just check for throttle limits also there or something like that.(I think
>> > > > > that's what you had done in your initial throttling controller implementation?)
>> > > >
>> > > > Right. This is exactly the same approach I've used in my old throttling
>> > > > controller: throttle sync READs and WRITEs at the block layer and async
>> > > > WRITEs when the task is dirtying memory pages.
>> > > >
>> > > > This is probably the simplest way to resolve the problem of faster group
>> > > > getting blocked by slower group, but the controller will be a little bit
>> > > > more leaky, because the writeback IO will be never throttled and we'll
>> > > > see some limited IO spikes during the writeback.
>> > >
>> > > Yes writeback will not be throttled. Not sure how big a problem that is.
>> > >
>> > > - We have controlled the input rate. So that should help a bit.
>> > > - May be one can put some high limit on root cgroup to in blkio throttle
>> > >   controller to limit overall WRITE rate of the system.
>> > > - For SATA disks, try to use CFQ which can try to minimize the impact of
>> > >   WRITE.
>> > >
>> > > It will atleast provide consistent bandwindth experience to application.
>> >
>> > Right.
>> >
>> > >
>> > > >However, this is always
>> > > > a better solution IMHO respect to the current implementation that is
>> > > > affected by that kind of priority inversion problem.
>> > > >
>> > > > I can try to add this logic to the current blk-throttle controller if
>> > > > you think it is worth to test it.
>> > >
>> > > At this point of time I have few concerns with this approach.
>> > >
>> > > - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
>> > >   separately is inconvenient. One has to know the nature of workload.
>> > >
>> > > - Most likely we will come up with global limits (atleast to begin with),
>> > >   and not per device limit. That can lead to contention on one single
>> > >   lock and scalability issues on big systems.
>> > >
>> > > Having said that, this approach should reduce the kernel complexity a lot.
>> > > So if we can do some intelligent locking to limit the overhead then it
>> > > will boil down to reduced complexity in kernel vs ease of use to user. I
>> > > guess at this point of time I am inclined towards keeping it simple in
>> > > kernel.
>> > >
>> >
>> > BTW, with this approach probably we can even get rid of the page
>> > tracking stuff for now.
>>
>> Agreed.
>>
>> > If we don't consider the swap IO, any other IO
>> > operation from our point of view will happen directly from process
>> > context (writes in memory + sync reads from the block device).
>>
>> Why do we need to account for swap IO? Application never asked for swap
>> IO. It is kernel's decision to move soem pages to swap to free up some
>> memory. What's the point in charging those pages to application group
>> and throttle accordingly?
>>
>
> I think swap I/O should be controlled by memcg's dirty_ratio.
> But, IIRC, NEC guy had a requirement for this...
>
> I think some enterprise cusotmer may want to throttle the whole speed of
> swapout I/O (not swapin)...so, they may be glad if they can limit throttle
> the I/O against a disk partition or all I/O tagged as 'swapio' rather than
> some cgroup name.
>
> But I'm afraid slow swapout may consume much dirty_ratio and make things
> worse ;)
>
>
>
>> >
>> > However, I'm sure we'll need the page tracking also for the blkio
>> > controller soon or later. This is an important information and also the
>> > proportional bandwidth controller can take advantage of it.
>>
>> Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
>> write support. But until and unless we implement memory cgroup dirty
>> ratio and figure a way out to make writeback logic cgroup aware, till
>> then I think page tracking stuff is not really useful.
>>
>
> I think Greg Thelen is now preparing patches for dirty_ratio.
>
> Thanks,
> -Kame
>
>

Correct.  I am working on the memcg dirty_ratio patches with latest
mmotm memcg.  I am running some test cases which should be complete
tomorrow.  Once testing is complete, I will sent  the patches for
review.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
  2011-02-24  0:40               ` KAMEZAWA Hiroyuki
  2011-02-24  2:01                 ` Greg Thelen
@ 2011-02-24 16:18                 ` Vivek Goyal
  1 sibling, 0 replies; 9+ messages in thread
From: Vivek Goyal @ 2011-02-24 16:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi, Balbir Singh, Daisuke Nishimura, Greg Thelen,
	Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi,
	Jens Axboe, Andrew Morton, Jonathan Corbet, containers, linux-mm,
	linux-kernel, linux-fsdevel

On Thu, Feb 24, 2011 at 09:40:39AM +0900, KAMEZAWA Hiroyuki wrote:

[..]
> > > If we don't consider the swap IO, any other IO
> > > operation from our point of view will happen directly from process
> > > context (writes in memory + sync reads from the block device).
> > 
> > Why do we need to account for swap IO? Application never asked for swap
> > IO. It is kernel's decision to move soem pages to swap to free up some
> > memory. What's the point in charging those pages to application group
> > and throttle accordingly?
> > 
> 
> I think swap I/O should be controlled by memcg's dirty_ratio.
> But, IIRC, NEC guy had a requirement for this...
> 
> I think some enterprise cusotmer may want to throttle the whole speed of
> swapout I/O (not swapin)...so, they may be glad if they can limit throttle
> the I/O against a disk partition or all I/O tagged as 'swapio' rather than
> some cgroup name.

If swap is on a separate disk, then one can control put write throttling rules
on systemwide swapout. Though I still don't understand how that can help.

> 
> But I'm afraid slow swapout may consume much dirty_ratio and make things
> worse ;)

Exactly. So I think focus should be controlling things earlier and stop
applications early before they can either write too much data in page
cache etc.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
  2011-02-24  0:10             ` Vivek Goyal
  2011-02-24  0:40               ` KAMEZAWA Hiroyuki
@ 2011-02-25  0:54               ` Andrea Righi
  1 sibling, 0 replies; 9+ messages in thread
From: Andrea Righi @ 2011-02-25  0:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Greg Thelen,
	Wu Fengguang, Gui Jianfeng, Ryo Tsuruta, Hirokazu Takahashi,
	Jens Axboe, Andrew Morton, Jonathan Corbet, containers, linux-mm,
	linux-kernel, linux-fsdevel

On Wed, Feb 23, 2011 at 07:10:33PM -0500, Vivek Goyal wrote:
> On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
> > On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
> > > > > Agreed. Granularity of per inode level might be accetable in many 
> > > > > cases. Again, I am worried faster group getting stuck behind slower
> > > > > group.
> > > > > 
> > > > > I am wondering if we are trying to solve the problem of ASYNC write throttling
> > > > > at wrong layer. Should ASYNC IO be throttled before we allow task to write to
> > > > > page cache. The way we throttle the process based on dirty ratio, can we
> > > > > just check for throttle limits also there or something like that.(I think
> > > > > that's what you had done in your initial throttling controller implementation?)
> > > > 
> > > > Right. This is exactly the same approach I've used in my old throttling
> > > > controller: throttle sync READs and WRITEs at the block layer and async
> > > > WRITEs when the task is dirtying memory pages.
> > > > 
> > > > This is probably the simplest way to resolve the problem of faster group
> > > > getting blocked by slower group, but the controller will be a little bit
> > > > more leaky, because the writeback IO will be never throttled and we'll
> > > > see some limited IO spikes during the writeback.
> > > 
> > > Yes writeback will not be throttled. Not sure how big a problem that is.
> > > 
> > > - We have controlled the input rate. So that should help a bit.
> > > - May be one can put some high limit on root cgroup to in blkio throttle
> > >   controller to limit overall WRITE rate of the system.
> > > - For SATA disks, try to use CFQ which can try to minimize the impact of
> > >   WRITE.
> > > 
> > > It will atleast provide consistent bandwindth experience to application.
> > 
> > Right.
> > 
> > > 
> > > >However, this is always
> > > > a better solution IMHO respect to the current implementation that is
> > > > affected by that kind of priority inversion problem.
> > > > 
> > > > I can try to add this logic to the current blk-throttle controller if
> > > > you think it is worth to test it.
> > > 
> > > At this point of time I have few concerns with this approach.
> > > 
> > > - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
> > >   separately is inconvenient. One has to know the nature of workload.
> > > 
> > > - Most likely we will come up with global limits (atleast to begin with),
> > >   and not per device limit. That can lead to contention on one single
> > >   lock and scalability issues on big systems.
> > > 
> > > Having said that, this approach should reduce the kernel complexity a lot.
> > > So if we can do some intelligent locking to limit the overhead then it
> > > will boil down to reduced complexity in kernel vs ease of use to user. I 
> > > guess at this point of time I am inclined towards keeping it simple in
> > > kernel.
> > > 
> > 
> > BTW, with this approach probably we can even get rid of the page
> > tracking stuff for now.
> 
> Agreed.
> 
> > If we don't consider the swap IO, any other IO
> > operation from our point of view will happen directly from process
> > context (writes in memory + sync reads from the block device).
> 
> Why do we need to account for swap IO? Application never asked for swap
> IO. It is kernel's decision to move soem pages to swap to free up some
> memory. What's the point in charging those pages to application group
> and throttle accordingly?

OK, I think swap io control it's not a very important feature for now.

However without swap io control an application could always been able to
blow away any QoS provided by the blkio controller simply allocating a
lot of memory and waiting for the kernel to swap those memory pages.
Probably in that case it would be better to slow down the swap io and
wait for the oom-killer to kill the application, instead of aggressively
swap out pages.

> 
> > 
> > However, I'm sure we'll need the page tracking also for the blkio
> > controller soon or later. This is an important information and also the
> > proportional bandwidth controller can take advantage of it.
> 
> Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
> write support. But until and unless we implement memory cgroup dirty
> ratio and figure a way out to make writeback logic cgroup aware, till
> then I think page tracking stuff is not really useful.

OK.

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-02-25  0:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1298394776-9957-1-git-send-email-arighi@develer.com>
     [not found] ` <20110222193403.GG28269@redhat.com>
     [not found]   ` <20110222224141.GA23723@linux.develer.com>
2011-02-23  0:03     ` [PATCH 0/5] blk-throttle: writeback and swap IO control Vivek Goyal
2011-02-23  8:32       ` Andrea Righi
2011-02-23 15:23         ` Vivek Goyal
2011-02-23 23:14           ` Andrea Righi
2011-02-24  0:10             ` Vivek Goyal
2011-02-24  0:40               ` KAMEZAWA Hiroyuki
2011-02-24  2:01                 ` Greg Thelen
2011-02-24 16:18                 ` Vivek Goyal
2011-02-25  0:54               ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).