[LSF/MM TOPIC] [ATTEND] Throttling I/O

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] [ATTEND] Throttling I/O
@ 2013-01-25 13:19 Suresh Jayaraman
  2013-01-25 16:34 ` Vivek Goyal
  2013-01-25 17:57 ` Tejun Heo
  0 siblings, 2 replies; 9+ messages in thread
From: Suresh Jayaraman @ 2013-01-25 13:19 UTC (permalink / raw)
  To: lsf-pc, linux-fsdevel
  Cc: Tejun Heo, Fengguang Wu, Andrea Righi, Vivek Goyal, Jan Kara,
	Moyer Jeff Moyer

Hello,

I'd like to discuss again[1] the problem of throttling buffered writes
and a throttle mechanism that works for all kinds of I/O.

Some background information.

During last year's LSF/MM, Fengguang discussed his proportional I/O
controller patches as part of the writeback session. The limitations
that were seen of his approach were a) non-handling of bursty IO
submission in the flusher thread b) sharing config variables among
different policies c) and that it violates layering and lacking
long-term design. Tejun proposed back-pressure approach to the problem
i.e. apply pressure where the problem is (block layer) and propagate
upwards.

The general opinion at that time was that we needed more
inputs/consensus needed on the natural, flexible, extensible
"interface". The discussion thread that Vivek started[2] to collect the
inputs on "interface", though resulted in good collection of inputs,
not sure whether it represents inputs from all the interested parties.

At Kernel Summit last year, I learned from LWN[3] that the topic was
discussed again. Tejun, apparently proposed a solution that splits up
the global async CFQ queue by cgroup, so that the CFQ scheduler can
easily schedule the per-cgroup sync/async queues according to the
per-cgroup I/O weights. Fengguang proposed a solution by supporting the
per-cgroup buffered write weights in balance_dirty_pages() and running a
user-space daemon that updates the CFQ/BDP weights every second. There
doesn't seem to be consensus towards either of the proposed approaches.

Looking at the possibility of prototyping Tejun's proposed idea lead to
many questions (but my understanding may not be complete here as it is
based only on LWN's mem-cg mini-summit coverage, so please correct me if
wrong).

- Making cfq schedule the per cgroup sync/async queues according to I/O
  weights would mean that we'll need to use per cgroup cfqq's instead
  of per process? What will the impact on sync latencies if for example
  we have many sync only tasks in one cgroup and many async tasks in
  another?  What if BLK_CGROUP is not configured, what would be the
  fallback behavior?

- Suppose if we have 100 cgroups and we are to have one cfqq per
  priority per cgroup, this would mean we'll be requiring 100 x 3 x 8 =
  2400 cfqq's (3 classes and 8 priorities) in the worst case (as
  opposed to current 24 cfqqs)? This may not be as drastic as it sounds
  as we create cfqq's only on demand and we normally won't have tasks
  with every priority and every class?

I'm primarily interested in having the ability to limit/throttle
buffered I/O on a multiuser system where one heavy I/O user shouldn't be
impacting others and everyone should be getting their allocated share. I
understand thought there are different possible use-cases and the agreed
approach should be limiting any potential use-case and hence having a
consensus is quite important. So, I think a discussion on the topic
might help.

I would also be interested in other Network filesystem topics that have
been already proposed including NFS Ganesha, readdirplus syscall etc. I
have been working on Network filesystems for many years now and recently
started looking into block layer side of things too.

[1] http://comments.gmane.org/gmane.linux.kernel.mm/74805 (Last year's
proposal)
[2] http://www.spinics.net/lists/linux-fsdevel/msg53171.html
[3] http://lwn.net/Articles/516540/

Thanks

-- 
Suresh Jayaraman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Throttling I/O
  2013-01-25 13:19 [LSF/MM TOPIC] [ATTEND] Throttling I/O Suresh Jayaraman
@ 2013-01-25 16:34 ` Vivek Goyal
  2013-01-25 17:52   ` Tejun Heo
  2013-01-28 11:16   ` Suresh Jayaraman
  2013-01-25 17:57 ` Tejun Heo
  1 sibling, 2 replies; 9+ messages in thread
From: Vivek Goyal @ 2013-01-25 16:34 UTC (permalink / raw)
  To: Suresh Jayaraman
  Cc: lsf-pc, linux-fsdevel, Tejun Heo, Fengguang Wu, Andrea Righi,
	Jan Kara, Moyer Jeff Moyer

On Fri, Jan 25, 2013 at 06:49:34PM +0530, Suresh Jayaraman wrote:
> Hello,
> 
> I'd like to discuss again[1] the problem of throttling buffered writes
> and a throttle mechanism that works for all kinds of I/O.
> 
> Some background information.
> 
> During last year's LSF/MM, Fengguang discussed his proportional I/O
> controller patches as part of the writeback session. The limitations
> that were seen of his approach were a) non-handling of bursty IO
> submission in the flusher thread b) sharing config variables among
> different policies c) and that it violates layering and lacking
> long-term design. Tejun proposed back-pressure approach to the problem
> i.e. apply pressure where the problem is (block layer) and propagate
> upwards.
> 
> The general opinion at that time was that we needed more
> inputs/consensus needed on the natural, flexible, extensible
> "interface". The discussion thread that Vivek started[2] to collect the
> inputs on "interface", though resulted in good collection of inputs,
> not sure whether it represents inputs from all the interested parties.
> 
> At Kernel Summit last year, I learned from LWN[3] that the topic was
> discussed again. Tejun, apparently proposed a solution that splits up
> the global async CFQ queue by cgroup, so that the CFQ scheduler can
> easily schedule the per-cgroup sync/async queues according to the
> per-cgroup I/O weights. Fengguang proposed a solution by supporting the
> per-cgroup buffered write weights in balance_dirty_pages() and running a
> user-space daemon that updates the CFQ/BDP weights every second. There
> doesn't seem to be consensus towards either of the proposed approaches.
> 

Moving async queues in respective cgroup is easy part. Also for
throttling, you don't need CFQ. So CFQ and IO throttling are little
orthogonal. (I am assuming by throttling you mean upper limiting IO).

And I think tejun wanted to implement throttling at block layer and
wanted vm to adjust/respond to per group IO backlog when it comes
to writting to dirty data/inodes.

Once we have take care of writeback problem then comes the issue
of being able to associate a dirty inode/page to a cgroup. Not sure
if something has happened on that front or not. In the past it was
thought to be simple that one inode belongs to one IO cgroup.

Also seriously, in CFQ, group idling performance penalty is too
high and might start showing up easily on a single spindle sata disk
also. Especially given the fact that people will come up with hybrid
SATA drives with some caching internally. So SATA drive will not
be as slow.

So proportional group scheduling of CFQ is limited to such a specific
corner case of slow SATA drive. I am not sure how many people really
use it.

To me, we first need to have some ideas here on how to implement low
cost proportional group scheduling (either in CFQ or in other scheduler).
Till then we can develop a lot of infrastructure but it usability will
be very limited.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Throttling I/O
  2013-01-25 16:34 ` Vivek Goyal
@ 2013-01-25 17:52   ` Tejun Heo
  2013-01-25 18:26     ` Vivek Goyal
  2013-01-28 11:16   ` Suresh Jayaraman
  1 sibling, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2013-01-25 17:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Suresh Jayaraman, lsf-pc, linux-fsdevel, Fengguang Wu,
	Andrea Righi, Jan Kara, Moyer Jeff Moyer

Hey, guys.

On Fri, Jan 25, 2013 at 11:34:08AM -0500, Vivek Goyal wrote:
> And I think tejun wanted to implement throttling at block layer and
> wanted vm to adjust/respond to per group IO backlog when it comes
> to writting to dirty data/inodes.
> 
> Once we have take care of writeback problem then comes the issue
> of being able to associate a dirty inode/page to a cgroup. Not sure
> if something has happened on that front or not. In the past it was
> thought to be simple that one inode belongs to one IO cgroup.

Yeap, the above two sum it up pretty good.

> Also seriously, in CFQ, group idling performance penalty is too
> high and might start showing up easily on a single spindle sata disk
> also. Especially given the fact that people will come up with hybrid
> SATA drives with some caching internally. So SATA drive will not
> be as slow.
> 
> So proportional group scheduling of CFQ is limited to such a specific
> corner case of slow SATA drive. I am not sure how many people really
> use it.

I don't think so.  If you personal usages, sure, it's not very useful
but then again proportional IO control itself isn't all that useful
for personal use, but if you go to backend infrastructure requiring a
lot of capacity, spindled drives still rule the roost and large
deployment of on-device flash cache is not as immediate, if it ever
happens, that is.  Spindle drives may not be in your desk/laptops, but
they continue to be deployed massively in the backend.

For example, google has been using half-hacky hierarchical writeback
support in cfq for quite some time now and they'll switch to upstream
implementation once we get it working, so I don't think it's a wasted
effort.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Throttling I/O
  2013-01-25 13:19 [LSF/MM TOPIC] [ATTEND] Throttling I/O Suresh Jayaraman
  2013-01-25 16:34 ` Vivek Goyal
@ 2013-01-25 17:57 ` Tejun Heo
  2013-01-28 11:46   ` Suresh Jayaraman
  1 sibling, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2013-01-25 17:57 UTC (permalink / raw)
  To: Suresh Jayaraman
  Cc: lsf-pc, linux-fsdevel, Fengguang Wu, Andrea Righi, Vivek Goyal,
	Jan Kara, Moyer Jeff Moyer

Hey, Suresh.

On Fri, Jan 25, 2013 at 06:49:34PM +0530, Suresh Jayaraman wrote:
> - Making cfq schedule the per cgroup sync/async queues according to I/O
>   weights would mean that we'll need to use per cgroup cfqq's instead
>   of per process? What will the impact on sync latencies if for example
>   we have many sync only tasks in one cgroup and many async tasks in
>   another?  What if BLK_CGROUP is not configured, what would be the
>   fallback behavior?

So, we currently have synd cfqqs in cgroup cfqgs and shared cfqqs in
the root cfqg.  The end result would be splitting shared cfqqs into
cgroup cfqgs.  We may have to change how cfqgs are chosen depending on
whether it only has async IOs pending.  Not sure.

> - Suppose if we have 100 cgroups and we are to have one cfqq per
>   priority per cgroup, this would mean we'll be requiring 100 x 3 x 8 =
>   2400 cfqq's (3 classes and 8 priorities) in the worst case (as
>   opposed to current 24 cfqqs)? This may not be as drastic as it sounds
>   as we create cfqq's only on demand and we normally won't have tasks
>   with every priority and every class?

I don't think that's a problem.  We already have a cfqq per active IO
context which can go way beyond 10k depending on work load.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Throttling I/O
  2013-01-25 17:52   ` Tejun Heo
@ 2013-01-25 18:26     ` Vivek Goyal
  2013-01-25 18:33       ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Vivek Goyal @ 2013-01-25 18:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Suresh Jayaraman, lsf-pc, linux-fsdevel, Fengguang Wu,
	Andrea Righi, Jan Kara, Moyer Jeff Moyer

On Fri, Jan 25, 2013 at 09:52:33AM -0800, Tejun Heo wrote:
> Hey, guys.
> 
> On Fri, Jan 25, 2013 at 11:34:08AM -0500, Vivek Goyal wrote:
> > And I think tejun wanted to implement throttling at block layer and
> > wanted vm to adjust/respond to per group IO backlog when it comes
> > to writting to dirty data/inodes.
> > 
> > Once we have take care of writeback problem then comes the issue
> > of being able to associate a dirty inode/page to a cgroup. Not sure
> > if something has happened on that front or not. In the past it was
> > thought to be simple that one inode belongs to one IO cgroup.
> 
> Yeap, the above two sum it up pretty good.
> 
> > Also seriously, in CFQ, group idling performance penalty is too
> > high and might start showing up easily on a single spindle sata disk
> > also. Especially given the fact that people will come up with hybrid
> > SATA drives with some caching internally. So SATA drive will not
> > be as slow.
> > 
> > So proportional group scheduling of CFQ is limited to such a specific
> > corner case of slow SATA drive. I am not sure how many people really
> > use it.
> 
> I don't think so.  If you personal usages, sure, it's not very useful
> but then again proportional IO control itself isn't all that useful
> for personal use, but if you go to backend infrastructure requiring a
> lot of capacity, spindled drives still rule the roost and large
> deployment of on-device flash cache is not as immediate,

Hi Tejun,

How many of these spindle drives are not behind some kind of hardware
raid or on SAN network. Becaue any aggregation of spindle drives by
hardware/external entity makes group scheduling not worth it very soon.

> 
> For example, google has been using half-hacky hierarchical writeback
> support in cfq for quite some time now and they'll switch to upstream
> implementation once we get it working, so I don't think it's a wasted
> effort.

I guess apart from google I have not heard anybody else using it
successfully and that's when I get skeptic about it. May be once the
support for buffered write control is in, things will be better. Because,
that's the biggest offending workload people want to protect against.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Throttling I/O
  2013-01-25 18:26     ` Vivek Goyal
@ 2013-01-25 18:33       ` Tejun Heo
  0 siblings, 0 replies; 9+ messages in thread
From: Tejun Heo @ 2013-01-25 18:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Suresh Jayaraman, lsf-pc, linux-fsdevel, Fengguang Wu,
	Andrea Righi, Jan Kara, Moyer Jeff Moyer

Hey,

On Fri, Jan 25, 2013 at 01:26:39PM -0500, Vivek Goyal wrote:
> How many of these spindle drives are not behind some kind of hardware
> raid or on SAN network. Becaue any aggregation of spindle drives by
> hardware/external entity makes group scheduling not worth it very soon.

Hmmm?  Sure, enterprise storage is usually behind some SAN silliness,
but if you go larger scale, e.g. the "cloud" stuff, people usually go
for the most economic solution - ie. some spindles per machine.  At
least the ones I know do.

> > For example, google has been using half-hacky hierarchical writeback
> > support in cfq for quite some time now and they'll switch to upstream
> > implementation once we get it working, so I don't think it's a wasted
> > effort.
> 
> I guess apart from google I have not heard anybody else using it
> successfully and that's when I get skeptic about it. May be once the
> support for buffered write control is in, things will be better. Because,
> that's the biggest offending workload people want to protect against.

Yeah, I think it's mostly because we don't support it and there aren't
too many organizations with enough muscle to implement and maintain
their own hierarchical iosched and mutilations to writeback path.
Given the workloads these servers have to run, some form of io
resource control is just necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Throttling I/O
  2013-01-25 16:34 ` Vivek Goyal
  2013-01-25 17:52   ` Tejun Heo
@ 2013-01-28 11:16   ` Suresh Jayaraman
  2013-01-28 19:24     ` Tejun Heo
  1 sibling, 1 reply; 9+ messages in thread
From: Suresh Jayaraman @ 2013-01-28 11:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lsf-pc, linux-fsdevel, Tejun Heo, Fengguang Wu, Andrea Righi,
	Jan Kara, Moyer Jeff Moyer

On 01/25/2013 10:04 PM, Vivek Goyal wrote:
> On Fri, Jan 25, 2013 at 06:49:34PM +0530, Suresh Jayaraman wrote:
>> Hello,
>>
>> I'd like to discuss again[1] the problem of throttling buffered writes
>> and a throttle mechanism that works for all kinds of I/O.
>>
>> Some background information.
>>
>> During last year's LSF/MM, Fengguang discussed his proportional I/O
>> controller patches as part of the writeback session. The limitations
>> that were seen of his approach were a) non-handling of bursty IO
>> submission in the flusher thread b) sharing config variables among
>> different policies c) and that it violates layering and lacking
>> long-term design. Tejun proposed back-pressure approach to the problem
>> i.e. apply pressure where the problem is (block layer) and propagate
>> upwards.
>>
>> The general opinion at that time was that we needed more
>> inputs/consensus needed on the natural, flexible, extensible
>> "interface". The discussion thread that Vivek started[2] to collect the
>> inputs on "interface", though resulted in good collection of inputs,
>> not sure whether it represents inputs from all the interested parties.
>>
>> At Kernel Summit last year, I learned from LWN[3] that the topic was
>> discussed again. Tejun, apparently proposed a solution that splits up
>> the global async CFQ queue by cgroup, so that the CFQ scheduler can
>> easily schedule the per-cgroup sync/async queues according to the
>> per-cgroup I/O weights. Fengguang proposed a solution by supporting the
>> per-cgroup buffered write weights in balance_dirty_pages() and running a
>> user-space daemon that updates the CFQ/BDP weights every second. There
>> doesn't seem to be consensus towards either of the proposed approaches.
>>
> 
> Moving async queues in respective cgroup is easy part. Also for
> throttling, you don't need CFQ. So CFQ and IO throttling are little
> orthogonal. (I am assuming by throttling you mean upper limiting IO).

I meant being able to limit I/O. I didn't mean to strictly distinguish
between "Upper limit" and "Proportional I/O control" in this context
because my understanding is that limiting I/O for a customer on how much
he is paying could be achieved using proportional control as well
(by doing a little math with the weights). Perhaps, the topic name might
have suggested that the discussion is only about "Upper limit" and not
so much about "Proportional I/O control". But, my intent is to arrive at
an acceptable mechanism that will allow limiting I/O.

> And I think tejun wanted to implement throttling at block layer and
> wanted vm to adjust/respond to per group IO backlog when it comes
> to writting to dirty data/inodes.
> 
> Once we have take care of writeback problem then comes the issue
> of being able to associate a dirty inode/page to a cgroup. Not sure
> if something has happened on that front or not. In the past it was
> thought to be simple that one inode belongs to one IO cgroup.

Yes, this was discussed last year. But, not so much happened AFAIK.


Thanks

-- 
Suresh Jayaraman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Throttling I/O
  2013-01-25 17:57 ` Tejun Heo
@ 2013-01-28 11:46   ` Suresh Jayaraman
  0 siblings, 0 replies; 9+ messages in thread
From: Suresh Jayaraman @ 2013-01-28 11:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lsf-pc, linux-fsdevel, Fengguang Wu, Andrea Righi, Vivek Goyal,
	Jan Kara, Moyer Jeff Moyer

On 01/25/2013 11:27 PM, Tejun Heo wrote:
> Hey, Suresh.
> 
> On Fri, Jan 25, 2013 at 06:49:34PM +0530, Suresh Jayaraman wrote:
>> - Making cfq schedule the per cgroup sync/async queues according to I/O
>>   weights would mean that we'll need to use per cgroup cfqq's instead
>>   of per process? What will the impact on sync latencies if for example
>>   we have many sync only tasks in one cgroup and many async tasks in
>>   another?  What if BLK_CGROUP is not configured, what would be the
>>   fallback behavior?
> 
> So, we currently have synd cfqqs in cgroup cfqgs and shared cfqqs in
> the root cfqg.  The end result would be splitting shared cfqqs into
> cgroup cfqgs.  We may have to change how cfqgs are chosen depending on
> whether it only has async IOs pending.  Not sure.

Ah, ok. Even if we have a way to check if in a particular cgroup all I/O
is async or not, I have feeling that sync latencies might still get
impacted for e.g. if we have a very few sync tasks plus many async tasks
in one cgroup competing with all sync tasks in another group or some
other combinations, no?


Thanks

-- 
Suresh Jayaraman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Throttling I/O
  2013-01-28 11:16   ` Suresh Jayaraman
@ 2013-01-28 19:24     ` Tejun Heo
  0 siblings, 0 replies; 9+ messages in thread
From: Tejun Heo @ 2013-01-28 19:24 UTC (permalink / raw)
  To: Suresh Jayaraman
  Cc: Vivek Goyal, lsf-pc, linux-fsdevel, Fengguang Wu, Andrea Righi,
	Jan Kara, Moyer Jeff Moyer

Hey, Suresh.

On Mon, Jan 28, 2013 at 04:46:43PM +0530, Suresh Jayaraman wrote:
> > And I think tejun wanted to implement throttling at block layer and
> > wanted vm to adjust/respond to per group IO backlog when it comes
> > to writting to dirty data/inodes.
> > 
> > Once we have take care of writeback problem then comes the issue
> > of being able to associate a dirty inode/page to a cgroup. Not sure
> > if something has happened on that front or not. In the past it was
> > thought to be simple that one inode belongs to one IO cgroup.
> 
> Yes, this was discussed last year. But, not so much happened AFAIK.

Yeah, mostly because there were many more pressing issues around
cgroup and blkcg.  Hierarchical support for cfq is now pending and a
lot of the foundation work for unified hierarchy (which IMHO is
essential for sane interaction between memcg and blkcg for writeback)
has been done, so it's getting there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-01-28 19:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-25 13:19 [LSF/MM TOPIC] [ATTEND] Throttling I/O Suresh Jayaraman
2013-01-25 16:34 ` Vivek Goyal
2013-01-25 17:52   ` Tejun Heo
2013-01-25 18:26     ` Vivek Goyal
2013-01-25 18:33       ` Tejun Heo
2013-01-28 11:16   ` Suresh Jayaraman
2013-01-28 19:24     ` Tejun Heo
2013-01-25 17:57 ` Tejun Heo
2013-01-28 11:46   ` Suresh Jayaraman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).