Re: [patch 0/7] cpuset writeback throttling

Linux Container Development
 help / color / mirror / Atom feed

* Re: [patch 0/7] cpuset writeback throttling
       [not found]       ` <1225833710.7803.1993.camel@twins>
@ 2008-11-04 21:50         ` Andrew Morton
  2008-11-04 22:17           ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-11-04 21:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rientjes, cl, npiggin, menage, dfults, linux-kernel, containers

On Tue, 04 Nov 2008 22:21:50 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2008-11-04 at 13:16 -0800, Andrew Morton wrote:
> > On Tue, 04 Nov 2008 21:53:08 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Tue, 2008-11-04 at 12:47 -0800, Andrew Morton wrote:
> > > > On Thu, 30 Oct 2008 12:23:10 -0700 (PDT)
> > > > David Rientjes <rientjes@google.com> wrote:
> > > > 
> > > > > This is the revised cpuset writeback throttling patchset
> > > > 
> > > > I'm all confused about why this is a cpuset thing rather than a cgroups
> > > > thing.  What are the relationships here?
> > > > 
> > > > I mean, writeback throttling _should_ operate upon a control group
> > > > (more specifically: a memcg), yes?  I guess you're assuming a 1:1
> > > > relationship here?
> > > 
> > > I think the main reason is that we have per-node vmstats so the cpuset
> > > extention is relatively easy. Whereas we do not currently maintain
> > > vmstats on a cgroup level - although I imagine that could be remedied.
> > 
> > It didn't look easy to me - it added a lot more code in places which are
> > already wicked complex.
> > 
> > I'm trying to understand where this is all coming from and what fits
> > into where.  Fiddling with a cpuset's mems_allowed for purposes of
> > memory partitioning is all nasty 2007 technology, isn't it?  Does a raw
> > cpuset-based control such as this have a future?
> 
> Yes, cpusets are making a come-back on the embedded multi-core Real-Time
> side. Folks love to isolate stuff..
> 
> Not saying I really like it...
> 
> Also, there seems to be talk about node aware pdflush from the
> filesystems folks, not sure we need cpusets for that, but this does seem
> to add some node information into it.

Sorry, but I'm not seeing enough solid justification here for merging a
fairly large amount of fairly tricksy code into core kernel.  Code
which, afaict, is heading in the opposite direction from where we've
all been going for a year or two.

What are the alternatives here?  What do we need to do to make
throttling a per-memcg thing?


The patchset is badly misnamed, btw.  It doesn't throttle writeback -
in fact several people are working on IO bandwidth controllers and
calling this thing "writeback throttling" risks confusion.

What we're in fact throttling is rate-of-memory-dirtying.  The last
thing we want to throttle is writeback - we want it to go as fast as
possible!

Only I can't think of a suitable handy-dandy moniker for this concept.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 21:50         ` [patch 0/7] cpuset writeback throttling Andrew Morton
@ 2008-11-04 22:17           ` Christoph Lameter
  2008-11-04 22:35             ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2008-11-04 22:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Tue, 4 Nov 2008, Andrew Morton wrote:

> What are the alternatives here?  What do we need to do to make
> throttling a per-memcg thing?

Add statistics to the memcg lru and then you need some kind of sets of 
memcgs that are represented by bitmaps or so attached to an inode.

> The patchset is badly misnamed, btw.  It doesn't throttle writeback -
> in fact several people are working on IO bandwidth controllers and
> calling this thing "writeback throttling" risks confusion.

It is limiting dirty pages and throttling the dirty rate of applications 
in a  NUMA system (same procedure as we do in non NUMA). The excessive 
dirtying without this patchset can cause OOMs to occur on NUMA systems.

> What we're in fact throttling is rate-of-memory-dirtying.  The last
> thing we want to throttle is writeback - we want it to go as fast as
> possible!

We want to limit the amount of dirty pages such that I/O is progressing 
with optimal speeds for an application that significantly dirties memory. 
Other processes need to be still able to do their work without being 
swapped out because of excessive memory use for dirty pages by the first 
process.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 22:17           ` Christoph Lameter
@ 2008-11-04 22:35             ` Andrew Morton
  2008-11-04 22:52               ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-11-04 22:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Tue, 4 Nov 2008 16:17:52 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008, Andrew Morton wrote:
> 
> > What are the alternatives here?  What do we need to do to make
> > throttling a per-memcg thing?
> 
> Add statistics to the memcg lru and then you need some kind of sets of 
> memcgs that are represented by bitmaps or so attached to an inode.
> 
> > The patchset is badly misnamed, btw.  It doesn't throttle writeback -
> > in fact several people are working on IO bandwidth controllers and
> > calling this thing "writeback throttling" risks confusion.
> 
> It is limiting dirty pages and throttling the dirty rate of applications 
> in a  NUMA system (same procedure as we do in non NUMA). The excessive 
> dirtying without this patchset can cause OOMs to occur on NUMA systems.

yup.

To fix this with a memcg-based throttling, the operator would need to
be able to create memcg's which have pages only from particular nodes. 
(That's a bit indirect relative to what they want to do, but is
presumably workable).

But do we even have that capability now?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 22:35             ` Andrew Morton
@ 2008-11-04 22:52               ` Christoph Lameter
  2008-11-04 23:36                 ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2008-11-04 22:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Tue, 4 Nov 2008, Andrew Morton wrote:

> To fix this with a memcg-based throttling, the operator would need to
> be able to create memcg's which have pages only from particular nodes.
> (That's a bit indirect relative to what they want to do, but is
> presumably workable).

The system would need to have the capability to find the memcg groups that 
have dirty pages for a certain inode. Files are not constrained to nodes 
or memcg groups.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 22:52               ` Christoph Lameter
@ 2008-11-04 23:36                 ` Andrew Morton
  2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
  2008-11-05  2:45                   ` Christoph Lameter
  0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2008-11-04 23:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Tue, 4 Nov 2008 16:52:48 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008, Andrew Morton wrote:
> 
> > To fix this with a memcg-based throttling, the operator would need to
> > be able to create memcg's which have pages only from particular nodes.
> > (That's a bit indirect relative to what they want to do, but is
> > presumably workable).
> 
> The system would need to have the capability to find the memcg groups that 
> have dirty pages for a certain inode. Files are not constrained to nodes 
> or memcg groups.

Ah, we're talking about different things.

In a memcg implementation what we would implement is "throttle
page-dirtying tasks in this memcg when the memcg's dirty memory reaches
40% of its total".

But that doesn't solve the problem which this patchset is trying to
solve, which is "don't let all the memory in all this group of nodes
get dirty".

Yes?  Someone help me out here.  I don't yet have my head around the
overlaps and incompatibilities here.  Perhaps the containers guys will
wake up and put their thinking caps on?

What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 23:36                 ` Andrew Morton
@ 2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
  2008-11-05  3:09                     ` Andrew Morton
  2008-11-05  2:45                   ` Christoph Lameter
  1 sibling, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  1:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, npiggin, dfults, linux-kernel, rientjes,
	containers, menage

On Tue, 4 Nov 2008 15:36:10 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008 16:52:48 -0600 (CST)
> Christoph Lameter <cl@linux-foundation.org> wrote:
> 
> > On Tue, 4 Nov 2008, Andrew Morton wrote:
> > 
> > > To fix this with a memcg-based throttling, the operator would need to
> > > be able to create memcg's which have pages only from particular nodes.
> > > (That's a bit indirect relative to what they want to do, but is
> > > presumably workable).
> > 
> > The system would need to have the capability to find the memcg groups that 
> > have dirty pages for a certain inode. Files are not constrained to nodes 
> > or memcg groups.
> 
> Ah, we're talking about different things.
> 
> In a memcg implementation what we would implement is "throttle
> page-dirtying tasks in this memcg when the memcg's dirty memory reaches
> 40% of its total".
> 
yes. Andrea posted that.


> But that doesn't solve the problem which this patchset is trying to
> solve, which is "don't let all the memory in all this group of nodes
> get dirty".
> 
yes. but this patch doesn't help the case you mentioned below.

> 
> Yes?  Someone help me out here.  I don't yet have my head around the
> overlaps and incompatibilities here.  Perhaps the containers guys will
> wake up and put their thinking caps on?
> 
> 
> 
> What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
> uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?
> 
For help this, per-node-dirty-ratio-throttoling is necessary.

Shouldn't we just have a new parameter as /proc/sys/vm/dirty_ratio_per_node.

/proc/sys/vm/dirty_ratio works for throttling the whole system dirty pages.
/proc/sys/vm/dirty_ratio_per_node works for throttling dirty pages in a node.

Implementation will not be difficult and works enough against OOM.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-04 23:36                 ` Andrew Morton
  2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
@ 2008-11-05  2:45                   ` Christoph Lameter
  2008-11-05  3:05                     ` Andrew Morton
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2008-11-05  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Tue, 4 Nov 2008, Andrew Morton wrote:

> In a memcg implementation what we would implement is "throttle
> page-dirtying tasks in this memcg when the memcg's dirty memory reaches
> 40% of its total".

Right that is similar to what this patch does for cpusets. A memcg 
implementation would need to figure out if we are currently part of a 
memcg and then determine the percentage of memory that is dirty.

That is one aspect. When performing writeback then we need to figure out 
which inodes have dirty pages in the memcg and we need to start writeout 
on those inodes and not on others that have their dirty pages elsewhere. 
There are two components of this that are in this patch and that would 
also have to be implemented for a memcg.

> But that doesn't solve the problem which this patchset is trying to
> solve, which is "don't let all the memory in all this group of nodes
> get dirty".

This patch would solve the problem if the calculation of the dirty pages 
would consider the active memcg and be able to determine the amount of 
dirty pages (through some sort of additional memcg counters). That is just 
the first part though. The second part of finding the inodes that have 
dirty pages for writeback would require an association between memcgs and 
inodes.

> What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
> uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?

Yes if the activities of cpuset A cause all pages to be dirtied in cpuset 
B and then cpuset B attempts to do writeback. This will fail to acquire 
enough memory for writeback and make reclaim impossible.

Typically cpusets are not overlapped like that but used to segment the 
system.

The system would work correctly if the dirty ratio calculation would be 
done on all overlapping cpusets/memcg groups that contain nodes from 
which allocations are permitted.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05  2:45                   ` Christoph Lameter
@ 2008-11-05  3:05                     ` Andrew Morton
       [not found]                       ` <20081104190505.769b93ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2008-11-05 13:52                       ` Christoph Lameter
  0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2008-11-05  3:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Tue, 4 Nov 2008 20:45:17 -0600 (CST) Christoph Lameter <cl@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008, Andrew Morton wrote:
> 
> > In a memcg implementation what we would implement is "throttle
> > page-dirtying tasks in this memcg when the memcg's dirty memory reaches
> > 40% of its total".
> 
> Right that is similar to what this patch does for cpusets. A memcg 
> implementation would need to figure out if we are currently part of a 
> memcg and then determine the percentage of memory that is dirty.
> 
> That is one aspect. When performing writeback then we need to figure out 
> which inodes have dirty pages in the memcg and we need to start writeout 
> on those inodes and not on others that have their dirty pages elsewhere. 
> There are two components of this that are in this patch and that would 
> also have to be implemented for a memcg.

Doable.  lru->page->mapping->host is a good start.

> > But that doesn't solve the problem which this patchset is trying to
> > solve, which is "don't let all the memory in all this group of nodes
> > get dirty".
> 
> This patch would solve the problem if the calculation of the dirty pages 
> would consider the active memcg and be able to determine the amount of 
> dirty pages (through some sort of additional memcg counters). That is just 
> the first part though. The second part of finding the inodes that have 
> dirty pages for writeback would require an association between memcgs and 
> inodes.

We presently have that via the LRU.  It has holes, but so does this per-cpuset
scheme.

> > What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
> > uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?
> 
> Yes if the activities of cpuset A cause all pages to be dirtied in cpuset 
> B and then cpuset B attempts to do writeback. This will fail to acquire 
> enough memory for writeback and make reclaim impossible.
> 
> Typically cpusets are not overlapped like that but used to segment the 
> system.
> 
> The system would work correctly if the dirty ratio calculation would be 
> done on all overlapping cpusets/memcg groups that contain nodes from 
> which allocations are permitted.

That.


Generally, I worry that this is a specific fix to a specific problem
encountered on specific machines with specific setups and specific
workloads, and that it's just all too low-level and myopic.

And now we're back in the usual position where there's existing code and
everyone says it's terribly wonderful and everyone is reluctant to step
back and look at the big picture.  Am I wrong?


Plus: we need per-memcg dirty-memory throttling, and this is more
important than per-cpuset, I suspect.  How will the (already rather
buggy) code look once we've stuffed both of them in there?


I agree that there's a problem here, although given the amount of time
that it's been there, I suspect that it is a very small problem. 
Someone please convince me that in three years time we will agree that
merging this fix to that problem was a correct decision?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
@ 2008-11-05  3:09                     ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2008-11-05  3:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, npiggin, dfults, linux-kernel, rientjes,
	containers, menage

On Wed, 5 Nov 2008 10:31:23 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > 
> > Yes?  Someone help me out here.  I don't yet have my head around the
> > overlaps and incompatibilities here.  Perhaps the containers guys will
> > wake up and put their thinking caps on?
> > 
> > 
> > 
> > What happens if cpuset A uses nodes 0,1,2,3,4,5,6,7,8,9 and cpuset B
> > uses nodes 0,1?  Can activity in cpuset A cause ooms in cpuset B?
> > 
> For help this, per-node-dirty-ratio-throttoling is necessary.
> 
> Shouldn't we just have a new parameter as /proc/sys/vm/dirty_ratio_per_node.

I guess that would work.  But it is a general solution and will be less
efficient for the particular setups which are triggering this problem.

> /proc/sys/vm/dirty_ratio works for throttling the whole system dirty pages.
> /proc/sys/vm/dirty_ratio_per_node works for throttling dirty pages in a node.
> 
> Implementation will not be difficult and works enough against OOM.

Yup.  Just track per-node dirtiness and walk the LRU when it is over
threshold.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
       [not found]                       ` <20081104190505.769b93ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2008-11-05  4:31                         ` KAMEZAWA Hiroyuki
  2008-11-10  9:02                           ` Andrea Righi
  0 siblings, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: npiggin-l3A5Bk7waGM, Christoph Lameter, dfults-sJ/iWh9BUns,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	rientjes-hpIqsD4AKlfQT0dZR+AlfA,
	containers-qjLDD68F18O7TbgM5vRIOg, menage-hpIqsD4AKlfQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Tue, 4 Nov 2008 19:05:05 -0800
Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> Generally, I worry that this is a specific fix to a specific problem
> encountered on specific machines with specific setups and specific
> workloads, and that it's just all too low-level and myopic.
> 
> And now we're back in the usual position where there's existing code and
> everyone says it's terribly wonderful and everyone is reluctant to step
> back and look at the big picture.  Am I wrong?
> 
> 
> Plus: we need per-memcg dirty-memory throttling, and this is more
> important than per-cpuset, I suspect.  How will the (already rather
> buggy) code look once we've stuffed both of them in there?
> 
> 
IIUC, Andrea Righ posted 2 patches around dirty_ratio. (added him to CC:)
in early October.

  (1) patch for adding dirty_ratio_pcm. (1/100000)
  (2) per-memcg dirty ratio. (maybe this..http://lkml.org/lkml/2008/9/12/121)
 
(1) should be just posted again.

Because we have changed page_cgroup implementation, (2) should be reworked.
"rework" itself will not be very difficult.
(.... we tend to be stick to "what interface is the best" discussion ;) 

But memcg itself is not so weak against dirty_pages because we don't call
try_to_free_pages() becasue of memory shortage but because of memory limitation.

BTW, in my current stack, followings are queued.
   a. handle SwapCache in proper way in memcg.
   b. handle swap_cgroup (if configured)
   c. make LRU handling easier

For making per-memcg dirty_ratio sane, (a) should go ahead. I do (a) now.
If Andrea seems to be too busy, I'll schedule dirty_ratio-for-memcg as my work.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05  3:05                     ` Andrew Morton
       [not found]                       ` <20081104190505.769b93ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2008-11-05 13:52                       ` Christoph Lameter
  2008-11-05 18:41                         ` Andrew Morton
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2008-11-05 13:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Tue, 4 Nov 2008, Andrew Morton wrote:

>> That is one aspect. When performing writeback then we need to figure out
>> which inodes have dirty pages in the memcg and we need to start writeout
>> on those inodes and not on others that have their dirty pages elsewhere.
>> There are two components of this that are in this patch and that would
>> also have to be implemented for a memcg.
>
> Doable.  lru->page->mapping->host is a good start.

The block layer has a list of inodes that are dirty. From that we need to 
select ones that will improve the situation from the cpuset/memcg. How 
does the LRU come into this?

>> This patch would solve the problem if the calculation of the dirty pages
>> would consider the active memcg and be able to determine the amount of
>> dirty pages (through some sort of additional memcg counters). That is just
>> the first part though. The second part of finding the inodes that have
>> dirty pages for writeback would require an association between memcgs and
>> inodes.
>
> We presently have that via the LRU.  It has holes, but so does this per-cpuset
> scheme.

How do I get to the LRU from the dirtied list of inodes?

> Generally, I worry that this is a specific fix to a specific problem
> encountered on specific machines with specific setups and specific
> workloads, and that it's just all too low-level and myopic.
>
> And now we're back in the usual position where there's existing code and
> everyone says it's terribly wonderful and everyone is reluctant to step
> back and look at the big picture.  Am I wrong?

Well everyone is just reluctant to do work it seems. Thus they fall back 
to a solution that I provided when memcg groups were not yet available. It 
would be best if someone could find a general scheme or generalize this 
patchset.

> Plus: we need per-memcg dirty-memory throttling, and this is more
> important than per-cpuset, I suspect.  How will the (already rather
> buggy) code look once we've stuffed both of them in there?

The basics will still be the same

1. One need to establish the dirty ratio of memcgs and monitor them.
2. There needs to be mechanism to perform writeout on the right inodes.

> I agree that there's a problem here, although given the amount of time
> that it's been there, I suspect that it is a very small problem.

It used to be only a problem for NUMA systems. Now its also a problem for 
memcgs.

> Someone please convince me that in three years time we will agree that
> merging this fix to that problem was a correct decision?

At the mininum: It provides a basis on top of which memcg support 
can be developed. There are likely major modifications needed to VM 
statistics to get there for memcg groups.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 13:52                       ` Christoph Lameter
@ 2008-11-05 18:41                         ` Andrew Morton
  2008-11-05 20:21                           ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-11-05 18:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Wed, 5 Nov 2008 07:52:44 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Tue, 4 Nov 2008, Andrew Morton wrote:
> 
> >> That is one aspect. When performing writeback then we need to figure out
> >> which inodes have dirty pages in the memcg and we need to start writeout
> >> on those inodes and not on others that have their dirty pages elsewhere.
> >> There are two components of this that are in this patch and that would
> >> also have to be implemented for a memcg.
> >
> > Doable.  lru->page->mapping->host is a good start.
> 
> The block layer has a list of inodes that are dirty. From that we need to 
> select ones that will improve the situation from the cpuset/memcg. How 
> does the LRU come into this?

In the simplest case, dirty-memory throttling can just walk the LRU
writing back pages in the same way that kswapd does.

There would probably be performance benefits in doing
address_space-ordered writeback, so the dirty-memory throttling could
pick a dirty page off the LRU, go find its inode and then feed that
into __sync_single_inode().

> >> This patch would solve the problem if the calculation of the dirty pages
> >> would consider the active memcg and be able to determine the amount of
> >> dirty pages (through some sort of additional memcg counters). That is just
> >> the first part though. The second part of finding the inodes that have
> >> dirty pages for writeback would require an association between memcgs and
> >> inodes.
> >
> > We presently have that via the LRU.  It has holes, but so does this per-cpuset
> > scheme.
> 
> How do I get to the LRU from the dirtied list of inodes?

Don't need it.

It'll be approximate and has obvious scenarios of great inaccuraracy
but it'll suffice for the workloads which this patchset addresses.



It sounds like any memcg-based approach just won't be suitable for the
people who are hitting this problem.

But _are_ people hitting this problem?  I haven't seen any real-looking
reports in ages.  Is there some workaround?  If so, what is it?  How
serious is this problem now?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 18:41                         ` Andrew Morton
@ 2008-11-05 20:21                           ` Christoph Lameter
       [not found]                             ` <Pine.LNX.4.64.0811051415360.31450-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2008-11-05 20:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Wed, 5 Nov 2008, Andrew Morton wrote:

> > > Doable.  lru->page->mapping->host is a good start.
> >
> > The block layer has a list of inodes that are dirty. From that we need to
> > select ones that will improve the situation from the cpuset/memcg. How
> > does the LRU come into this?
>
> In the simplest case, dirty-memory throttling can just walk the LRU
> writing back pages in the same way that kswapd does.

That means running reclaim. But we are only interested in getting rid of
dirty pages. Plus the filesystem guys have repeatedly pointed out that
page sized I/O to random places in a file is not a good thing to do. There
was actually talk of stopping kswapd from writing out pages!

> There would probably be performance benefits in doing
> address_space-ordered writeback, so the dirty-memory throttling could
> pick a dirty page off the LRU, go find its inode and then feed that
> into __sync_single_inode().

We cannot call into the writeback functions for an inode from a reclaim
context. We can write back single pages but not a range of pages from an
inode due to various locking issues (see discussion on slab defrag
patchset).

> > How do I get to the LRU from the dirtied list of inodes?
>
> Don't need it.
>
> It'll be approximate and has obvious scenarios of great inaccuraracy
> but it'll suffice for the workloads which this patchset addresses.

Sounds like a wild hack that runs against known limitations in terms
of locking etc.

> It sounds like any memcg-based approach just won't be suitable for the
> people who are hitting this problem.

Why not? If you can determine which memcgs an inode has dirty pages on
then the same scheme as proposed here will work.

> But _are_ people hitting this problem?  I haven't seen any real-looking
> reports in ages.  Is there some workaround?  If so, what is it?  How
> serious is this problem now?

Are there people who are actually having memcg based solutions deployed?
No enterprise release includes it yet so I guess that there is not much of
a use yet.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
       [not found]                             ` <Pine.LNX.4.64.0811051415360.31450-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
@ 2008-11-05 20:31                               ` Andrew Morton
  2008-11-05 20:40                                 ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-11-05 20:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: npiggin-l3A5Bk7waGM, dfults-sJ/iWh9BUns,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	rientjes-hpIqsD4AKlfQT0dZR+AlfA,
	containers-qjLDD68F18O7TbgM5vRIOg, menage-hpIqsD4AKlfQT0dZR+AlfA

On Wed, 5 Nov 2008 14:21:47 -0600 (CST)
Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> On Wed, 5 Nov 2008, Andrew Morton wrote:
> 
> > > > Doable.  lru->page->mapping->host is a good start.
> > >
> > > The block layer has a list of inodes that are dirty. From that we need to
> > > select ones that will improve the situation from the cpuset/memcg. How
> > > does the LRU come into this?
> >
> > In the simplest case, dirty-memory throttling can just walk the LRU
> > writing back pages in the same way that kswapd does.
> 
> That means running reclaim. But we are only interested in getting rid of
> dirty pages. Plus the filesystem guys have repeatedly pointed out that
> page sized I/O to random places in a file is not a good thing to do. There
> was actually talk of stopping kswapd from writing out pages!

They don't have to be reclaimed.

> > There would probably be performance benefits in doing
> > address_space-ordered writeback, so the dirty-memory throttling could
> > pick a dirty page off the LRU, go find its inode and then feed that
> > into __sync_single_inode().
> 
> We cannot call into the writeback functions for an inode from a reclaim
> context. We can write back single pages but not a range of pages from an
> inode due to various locking issues (see discussion on slab defrag
> patchset).

We're not in a reclaim context.  We're in sys_write() context.

> > But _are_ people hitting this problem?  I haven't seen any real-looking
> > reports in ages.  Is there some workaround?  If so, what is it?  How
> > serious is this problem now?
> 
> Are there people who are actually having memcg based solutions deployed?
> No enterprise release includes it yet so I guess that there is not much of
> a use yet.

If you know the answer then please provide it.  If you don't, please
say "I don't know".

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:31                               ` Andrew Morton
@ 2008-11-05 20:40                                 ` Christoph Lameter
  2008-11-05 20:56                                   ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2008-11-05 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Wed, 5 Nov 2008, Andrew Morton wrote:

> > That means running reclaim. But we are only interested in getting rid of
> > dirty pages. Plus the filesystem guys have repeatedly pointed out that
> > page sized I/O to random places in a file is not a good thing to do. There
> > was actually talk of stopping kswapd from writing out pages!
>
> They don't have to be reclaimed.

Well the LRU is used for reclaim. If you step over it then its using the
existing reclaim logic in vmscan.c right?

> > > There would probably be performance benefits in doing
> > > address_space-ordered writeback, so the dirty-memory throttling could
> > > pick a dirty page off the LRU, go find its inode and then feed that
> > > into __sync_single_inode().
> >
> > We cannot call into the writeback functions for an inode from a reclaim
> > context. We can write back single pages but not a range of pages from an
> > inode due to various locking issues (see discussion on slab defrag
> > patchset).
>
> We're not in a reclaim context.  We're in sys_write() context.

Dirtying a page can occur from a variety of kernel contexts.

> > > But _are_ people hitting this problem?  I haven't seen any real-looking
> > > reports in ages.  Is there some workaround?  If so, what is it?  How
> > > serious is this problem now?
> >
> > Are there people who are actually having memcg based solutions deployed?
> > No enterprise release includes it yet so I guess that there is not much of
> > a use yet.
>
> If you know the answer then please provide it.  If you don't, please
> say "I don't know".

I thought we were talking about memcg related reports. I have dealt with
scores of the cpuset related ones in my prior job.

Workarounds are:

1. Reduce the global dirty ratios so that the number of dirty pages in a
cpuset cannot become too high.

2. Do not create small cpusets where the system can dirty all pages.

3. Find other ways to limit the dirty pages (run sync once in a while or
so).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:40                                 ` Christoph Lameter
@ 2008-11-05 20:56                                   ` Andrew Morton
  2008-11-05 21:28                                     ` Christoph Lameter
                                                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Andrew Morton @ 2008-11-05 20:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers

On Wed, 5 Nov 2008 14:40:05 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Wed, 5 Nov 2008, Andrew Morton wrote:
> 
> > > That means running reclaim. But we are only interested in getting rid of
> > > dirty pages. Plus the filesystem guys have repeatedly pointed out that
> > > page sized I/O to random places in a file is not a good thing to do. There
> > > was actually talk of stopping kswapd from writing out pages!
> >
> > They don't have to be reclaimed.
> 
> Well the LRU is used for reclaim. If you step over it then its using the
> existing reclaim logic in vmscan.c right?

Only if you use it that way.

I imagine that a suitable implementation would start IO on the page
then move it to the other end of the LRU.  ie: treat it as referenced. 
Pretty simple stuff.

If we were to do writeout on the page's inode instead then we'd need
to move the page out of the way somehow, presumably by rotating it.

It's all workable outable.

> > > > There would probably be performance benefits in doing
> > > > address_space-ordered writeback, so the dirty-memory throttling could
> > > > pick a dirty page off the LRU, go find its inode and then feed that
> > > > into __sync_single_inode().
> > >
> > > We cannot call into the writeback functions for an inode from a reclaim
> > > context. We can write back single pages but not a range of pages from an
> > > inode due to various locking issues (see discussion on slab defrag
> > > patchset).
> >
> > We're not in a reclaim context.  We're in sys_write() context.
> 
> Dirtying a page can occur from a variety of kernel contexts.

This writeback will occur from one quite specific place:
balance_dirty_pages().  That's called from sys_write() and pagefaults. 
Other scruffy places like splice too.

But none of that matters - the fact is that we're _already_ doing
writeback from balance_dirty_pages().  All we're talking about here is
alternative schemes for looking up the pages to write.

> > > > But _are_ people hitting this problem?  I haven't seen any real-looking
> > > > reports in ages.  Is there some workaround?  If so, what is it?  How
> > > > serious is this problem now?
> > >
> > > Are there people who are actually having memcg based solutions deployed?
> > > No enterprise release includes it yet so I guess that there is not much of
> > > a use yet.
> >
> > If you know the answer then please provide it.  If you don't, please
> > say "I don't know".
> 
> I thought we were talking about memcg related reports. I have dealt with
> scores of the cpuset related ones in my prior job.
> 
> Workarounds are:
> 
> 1. Reduce the global dirty ratios so that the number of dirty pages in a
> cpuset cannot become too high.

That would be less than the smallest node's memory capacity, I guess.

> 2. Do not create small cpusets where the system can dirty all pages.
> 
> 3. Find other ways to limit the dirty pages (run sync once in a while or
> so).

hm, OK.

See, here's my problem: we have a pile of new code which fixes some
problem.  But the problem seems to be fairly small - it only affects a
small number of sophisticated users and they already have workarounds
in place.

So the world wouldn't end if we just didn't merge it.  Those users
stick with their workarounds and the kernel remains simpler and
smaller.

How do we work out which is the best choice here?  I don't have enough
information to do this.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:56                                   ` Andrew Morton
@ 2008-11-05 21:28                                     ` Christoph Lameter
  2008-11-05 21:55                                     ` Paul Menage
  2008-11-05 22:04                                     ` David Rientjes
  2 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-11-05 21:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, rientjes, npiggin, menage, dfults, linux-kernel,
	containers, travis, steiner

On Wed, 5 Nov 2008, Andrew Morton wrote:

> See, here's my problem: we have a pile of new code which fixes some
> problem.  But the problem seems to be fairly small - it only affects a
> small number of sophisticated users and they already have workarounds
> in place.

Well yes... Great situation with those workarounds.

> So the world wouldn't end if we just didn't merge it.  Those users
> stick with their workarounds and the kernel remains simpler and
> smaller.
>
> How do we work out which is the best choice here?  I don't have enough
> information to do this.

Not sure how to treat this. I am not involved with large NUMA at this
point. So the people who are interested need to speak up if they want
this.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:56                                   ` Andrew Morton
  2008-11-05 21:28                                     ` Christoph Lameter
@ 2008-11-05 21:55                                     ` Paul Menage
  2008-11-05 22:04                                     ` David Rientjes
  2 siblings, 0 replies; 23+ messages in thread
From: Paul Menage @ 2008-11-05 21:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, peterz, rientjes, npiggin, dfults,
	linux-kernel, containers

On Wed, Nov 5, 2008 at 12:56 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>>
>> 1. Reduce the global dirty ratios so that the number of dirty pages in a
>> cpuset cannot become too high.
>
> That would be less than the smallest node's memory capacity, I guess.

Even that doesn't work - if there's a single global limit on dirty
pages, then any cpuset/cgroup with access to enough memory can exhaust
that limit and cause other processes to block when they try to write
to disk. You need independent dirty counts to avoid that, whether it
be per-node or per-cgroup.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 20:56                                   ` Andrew Morton
  2008-11-05 21:28                                     ` Christoph Lameter
  2008-11-05 21:55                                     ` Paul Menage
@ 2008-11-05 22:04                                     ` David Rientjes
  2008-11-06  1:34                                       ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 23+ messages in thread
From: David Rientjes @ 2008-11-05 22:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, peterz, npiggin, Paul Menage, dfults,
	linux-kernel, containers, Andrea Righi

On Wed, 5 Nov 2008, Andrew Morton wrote:

> See, here's my problem: we have a pile of new code which fixes some
> problem.  But the problem seems to be fairly small - it only affects a
> small number of sophisticated users and they already have workarounds
> in place.
> 

The workarounds, while restrictive of how you configure your cpusets, are 
indeed effective.

> So the world wouldn't end if we just didn't merge it.  Those users
> stick with their workarounds and the kernel remains simpler and
> smaller.
> 

Agreed.  This patchset is admittedly from a different time when cpusets 
was the only relevant extension that needed to be done.

> How do we work out which is the best choice here?  I don't have enough
> information to do this.
> 

If we are to support memcg-specific dirty ratios, that requires the 
aforementioned statistics to be collected so that the calculation is even 
possible.  The series at 

	http://marc.info/?l=linux-kernel&m=122123225006571
	http://marc.info/?l=linux-kernel&m=122123241106902

is a step in that direction, although I'd prefer to see NR_UNSTABLE_NFS to 
be extracted separately from MEM_CGROUP_STAT_FILE_DIRTY so 
throttle_vm_writeout() can also use the new statistics.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05 22:04                                     ` David Rientjes
@ 2008-11-06  1:34                                       ` KAMEZAWA Hiroyuki
  2008-11-06 20:35                                         ` David Rientjes
  0 siblings, 1 reply; 23+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06  1:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, npiggin, Christoph Lameter, dfults, linux-kernel,
	containers, Paul Menage, Andrea Righi

On Wed, 5 Nov 2008 14:04:42 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> > So the world wouldn't end if we just didn't merge it.  Those users
> > stick with their workarounds and the kernel remains simpler and
> > smaller.
> > 
> 
> Agreed.  This patchset is admittedly from a different time when cpusets 
> was the only relevant extension that needed to be done.
> 
BTW, what is the problem this patch wants to fix ?
  1. avoid slow-down of memory allocation by triggering write-out earlier.
  2. avoid OOM by throttoling dirty pages.

About 1, memcg's diry_ratio can help if mounted as
   mount -t cgroup none /somewhere/  -o cpuset,memory
(If the user can accept overheads of memcg.)
If implemented.

About 2, A Google guy posted OOM handler cgroup to linux-mm.

> > How do we work out which is the best choice here?  I don't have enough
> > information to do this.
> > 
> 
> If we are to support memcg-specific dirty ratios, that requires the 
> aforementioned statistics to be collected so that the calculation is even 
> possible.  The series at 
> 
> 	http://marc.info/?l=linux-kernel&m=122123225006571
> 	http://marc.info/?l=linux-kernel&m=122123241106902
> 
yes. we(memcg) need this kind of.

> is a step in that direction, although I'd prefer to see NR_UNSTABLE_NFS to 
> be extracted separately from MEM_CGROUP_STAT_FILE_DIRTY so 
> throttle_vm_writeout() can also use the new statistics.
> 
Thank you for input.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-06  1:34                                       ` KAMEZAWA Hiroyuki
@ 2008-11-06 20:35                                         ` David Rientjes
  0 siblings, 0 replies; 23+ messages in thread
From: David Rientjes @ 2008-11-06 20:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, npiggin, Christoph Lameter, dfults, linux-kernel,
	containers, Paul Menage, Andrea Righi

On Thu, 6 Nov 2008, KAMEZAWA Hiroyuki wrote:

> > Agreed.  This patchset is admittedly from a different time when cpusets 
> > was the only relevant extension that needed to be done.
> > 
> BTW, what is the problem this patch wants to fix ?
>   1. avoid slow-down of memory allocation by triggering write-out earlier.
>   2. avoid OOM by throttoling dirty pages.
> 
> About 1, memcg's diry_ratio can help if mounted as
>    mount -t cgroup none /somewhere/  -o cpuset,memory
> (If the user can accept overheads of memcg.)
> If implemented.
> 

Yeah, it needs to be generalized to its own cgroup so that it doesn't 
depend on both CONFIG_CPUSETS or CONFIG_CGROUP_MEM_RES_CTLR.  If we get 
the dirty and writeback page statistics added to memcg, this becomes much 
simpler.

> About 2, A Google guy posted OOM handler cgroup to linux-mm.
> 

Yeah, this could enable one of the workarounds that Christoph earlier 
described: the oom handler has the ability to notify userspace and allows 
it to defer invoking the oom killer if there's an alternative way to 
remedy the situation.  So the oom handler posted to linux-mm could work by 
doing a sync anytime it ran low on memory, but the objective of this 
patchset is different.

The idea here is to implement per-cpuset (and now per-memcg) dirty and 
background dirty ratios to avoid using the global sysctls.  This is 
currently problematic for users of cpusets who divide their machine for 
batches of tasks, usually for NUMA optimizations: a cpuset, for example, 
can represent 40% of the system's memory and if the global dirty ratio is 
set to 50%, we still won't begin writeback even if all the memory in the 
cpuset is dirty.

> > If we are to support memcg-specific dirty ratios, that requires the 
> > aforementioned statistics to be collected so that the calculation is even 
> > possible.  The series at 
> > 
> > 	http://marc.info/?l=linux-kernel&m=122123225006571
> > 	http://marc.info/?l=linux-kernel&m=122123241106902
> > 
> yes. we(memcg) need this kind of.
> 

Andrea, what's the status of the patch to add dirty and writeback 
statistics to memcg?  I don't see it in the October 30 mmotm or any 
followup discussion on it.

> > is a step in that direction, although I'd prefer to see NR_UNSTABLE_NFS to 
> > be extracted separately from MEM_CGROUP_STAT_FILE_DIRTY so 
> > throttle_vm_writeout() can also use the new statistics.
> > 

Is this possible in a second version?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
  2008-11-05  4:31                         ` KAMEZAWA Hiroyuki
@ 2008-11-10  9:02                           ` Andrea Righi
       [not found]                             ` <4917F895.109-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2008-11-10  9:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, rientjes
  Cc: Andrew Morton, Christoph Lameter, peterz, npiggin, menage, dfults,
	linux-kernel, containers

On 2008-11-05 05:31, KAMEZAWA Hiroyuki wrote:
> On Tue, 4 Nov 2008 19:05:05 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
>> Generally, I worry that this is a specific fix to a specific problem
>> encountered on specific machines with specific setups and specific
>> workloads, and that it's just all too low-level and myopic.
>>
>> And now we're back in the usual position where there's existing code and
>> everyone says it's terribly wonderful and everyone is reluctant to step
>> back and look at the big picture.  Am I wrong?
>>
>>
>> Plus: we need per-memcg dirty-memory throttling, and this is more
>> important than per-cpuset, I suspect.  How will the (already rather
>> buggy) code look once we've stuffed both of them in there?
>>
>>
> IIUC, Andrea Righ posted 2 patches around dirty_ratio. (added him to CC:)
> in early October.
> 
>   (1) patch for adding dirty_ratio_pcm. (1/100000)
>   (2) per-memcg dirty ratio. (maybe this..http://lkml.org/lkml/2008/9/12/121)
>  
> (1) should be just posted again.
> 
> Because we have changed page_cgroup implementation, (2) should be reworked.
> "rework" itself will not be very difficult.
> (.... we tend to be stick to "what interface is the best" discussion ;) 
> 
> But memcg itself is not so weak against dirty_pages because we don't call
> try_to_free_pages() becasue of memory shortage but because of memory limitation.
> 
> BTW, in my current stack, followings are queued.
>    a. handle SwapCache in proper way in memcg.
>    b. handle swap_cgroup (if configured)
>    c. make LRU handling easier
> 
> For making per-memcg dirty_ratio sane, (a) should go ahead. I do (a) now.
> If Andrea seems to be too busy, I'll schedule dirty_ratio-for-memcg as my work.
> 

Hi Kame,

sorry for my late. If it's not too late tonight I'll rebase and test (1)
to 2.6.28-rc2-mm1 and start to rework on (2), also considering the
David's suggestion (split NR_UNSTABLE_NFS from NR_FILE_DIRTY).

-Andrea

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/7] cpuset writeback throttling
       [not found]                             ` <4917F895.109-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2008-11-10 10:02                               ` David Rientjes
  0 siblings, 0 replies; 23+ messages in thread
From: David Rientjes @ 2008-11-10 10:02 UTC (permalink / raw)
  To: Andrea Righi
  Cc: npiggin-l3A5Bk7waGM, Christoph Lameter, dfults-sJ/iWh9BUns,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-qjLDD68F18O7TbgM5vRIOg, Andrew Morton,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Mon, 10 Nov 2008, Andrea Righi wrote:

> > IIUC, Andrea Righ posted 2 patches around dirty_ratio. (added him to CC:)
> > in early October.
> > 
> >   (1) patch for adding dirty_ratio_pcm. (1/100000)
> >   (2) per-memcg dirty ratio. (maybe this..http://lkml.org/lkml/2008/9/12/121)
> >  
> > (1) should be just posted again.
> > 
> > Because we have changed page_cgroup implementation, (2) should be reworked.
> > "rework" itself will not be very difficult.
> > (.... we tend to be stick to "what interface is the best" discussion ;) 
> > 
> > But memcg itself is not so weak against dirty_pages because we don't call
> > try_to_free_pages() becasue of memory shortage but because of memory limitation.
> > 
> > BTW, in my current stack, followings are queued.
> >    a. handle SwapCache in proper way in memcg.
> >    b. handle swap_cgroup (if configured)
> >    c. make LRU handling easier
> > 
> > For making per-memcg dirty_ratio sane, (a) should go ahead. I do (a) now.
> > If Andrea seems to be too busy, I'll schedule dirty_ratio-for-memcg as my work.
> > 
> 
> Hi Kame,
> 
> sorry for my late. If it's not too late tonight I'll rebase and test (1)
> to 2.6.28-rc2-mm1 and start to rework on (2), also considering the
> David's suggestion (split NR_UNSTABLE_NFS from NR_FILE_DIRTY).
> 

The dirty throttling change only depends on patch 1/2 from (2) above, 
which adds the necessary statistics to the memcg for calculating dirty 
ratios.  Patch 2/2 from that series will need to be moved out to a 
separate cgroup as the general consensus from this discussion has 
indicated is necessary.

If it's possible to get a rebased patch 1/2 to add the statistics to the 
memory controller, I'll take it and move the all dirty throttling to a 
separate cgroup so it supports both cpusets and memcg.  Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-11-10 10:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <alpine.DEB.2.00.0810292337170.23858@chino.kir.corp.google.com>
     [not found] ` <20081104124753.fb1dde5a.akpm@linux-foundation.org>
     [not found]   ` <1225831988.7803.1939.camel@twins>
     [not found]     ` <20081104131637.68fbe055.akpm@linux-foundation.org>
     [not found]       ` <1225833710.7803.1993.camel@twins>
2008-11-04 21:50         ` [patch 0/7] cpuset writeback throttling Andrew Morton
2008-11-04 22:17           ` Christoph Lameter
2008-11-04 22:35             ` Andrew Morton
2008-11-04 22:52               ` Christoph Lameter
2008-11-04 23:36                 ` Andrew Morton
2008-11-05  1:31                   ` KAMEZAWA Hiroyuki
2008-11-05  3:09                     ` Andrew Morton
2008-11-05  2:45                   ` Christoph Lameter
2008-11-05  3:05                     ` Andrew Morton
     [not found]                       ` <20081104190505.769b93ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2008-11-05  4:31                         ` KAMEZAWA Hiroyuki
2008-11-10  9:02                           ` Andrea Righi
     [not found]                             ` <4917F895.109-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2008-11-10 10:02                               ` David Rientjes
2008-11-05 13:52                       ` Christoph Lameter
2008-11-05 18:41                         ` Andrew Morton
2008-11-05 20:21                           ` Christoph Lameter
     [not found]                             ` <Pine.LNX.4.64.0811051415360.31450-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
2008-11-05 20:31                               ` Andrew Morton
2008-11-05 20:40                                 ` Christoph Lameter
2008-11-05 20:56                                   ` Andrew Morton
2008-11-05 21:28                                     ` Christoph Lameter
2008-11-05 21:55                                     ` Paul Menage
2008-11-05 22:04                                     ` David Rientjes
2008-11-06  1:34                                       ` KAMEZAWA Hiroyuki
2008-11-06 20:35                                         ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox