[RFC PATCH 0/6] memcg: vfs isolation in memory cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
@ 2012-08-16 20:53 Ying Han
  2012-08-16 21:10 ` Rik van Riel
  0 siblings, 1 reply; 12+ messages in thread
From: Ying Han @ 2012-08-16 20:53 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Greg Thelen, Christoph Lameter, KOSAKI Motohiro,
	Glauber Costa
  Cc: linux-mm

The patchset adds the functionality of isolating the vfs slab objects per-memcg
under reclaim. This feature is a *must-have* after the kernel slab memory
accounting which starts charging the slab objects into individual memcgs. The
existing per-superblock shrinker doesn't work since it will end up reclaiming
slabs being charged to other memcgs.

The last rebase of the patch is v3.3 and now the kernel is up and running on
our enviroment. I rebased it on top of v3.5 w/ little conflicts for posting
here, and this post is mainly a RFC for the design.

There is a functional dependency of this patchset on slab accounting, where it
queries the owner of the slab object. I left that commented out in order to
get the kernel at least compile for now. Regarding the two implementations of
the kernel slab accounting in google vs upstream, they shares lots of
similarities and the main difference is how reparenting works under
mem_cgroup_destroy(). In google, we have the kmem_cache reparented to root as
well as the dentry objects. So further pressure applies under root will end up
reclaiming the objects as well. By given the kernel slab accounting feature is
still under discussion now, I will leave that on the side for this RFC and
assume the reparenting to root still hold.

The patch now is only handling dentry cache by given the nature dentry pinned
inode. Based on the data we've collected, that contributes the main factor of
the reclaimable slab objects. We also could make a generic infrastructure for
all the shrinkers (if needed). But as we discussed during last KS, making dentry
works would be a good start. Eventually, that might be the only thing we cares
about.

Before getting into the implementation, we did consider other options:
1. keep the global list but does the filtering when scan. The performance is
really bad under our tests.
2. make per-superblock per-memcg lru list. The implementation would be very
complicated considering all the race conditions.

The work was started by Andrew Bresticker (a former intern) and also greatly
inspired by Nikhil Rao<ncrao@google.com>, Greg Thelen(gthelen@google.com>
and Suleiman Souhlal<suleiman@google.com> for the slab accounting.

Ying Han (6):
  mm: pass priority to prune_icache_sb()
  mm: memcg add target_mem_cgroup, mem_cgroup fields to shrink_control
  mm: memcg restructure shrink_slab to walk memory cgroup hierarchy
  mm: shrink slab with memcg context
  mm: move dcache slabs to root lru when memcg exits
  mm: shrink slab during memcg reclaim

 fs/dcache.c                |  214 ++++++++++++++++++++++++++++++++++++++++---
 fs/inode.c                 |   40 ++++++++-
 fs/super.c                 |   30 +++++--
 include/linux/dcache.h     |    8 ++
 include/linux/fs.h         |   34 ++++++-
 include/linux/memcontrol.h |    8 ++
 include/linux/shrinker.h   |   12 +++
 include/linux/slab_def.h   |    5 +
 mm/memcontrol.c            |   49 ++++++++++
 mm/slab.c                  |    8 ++
 mm/vmscan.c                |   70 +++++++++++----
 11 files changed, 432 insertions(+), 46 deletions(-)

-- 
1.7.7.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-16 20:53 [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup Ying Han
@ 2012-08-16 21:10 ` Rik van Riel
  2012-08-16 23:41   ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2012-08-16 21:10 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Greg Thelen, Christoph Lameter, KOSAKI Motohiro, Glauber Costa,
	linux-mm, dchinner

On 08/16/2012 04:53 PM, Ying Han wrote:
> The patchset adds the functionality of isolating the vfs slab objects per-memcg
> under reclaim. This feature is a *must-have* after the kernel slab memory
> accounting which starts charging the slab objects into individual memcgs. The
> existing per-superblock shrinker doesn't work since it will end up reclaiming
> slabs being charged to other memcgs.

> The patch now is only handling dentry cache by given the nature dentry pinned
> inode. Based on the data we've collected, that contributes the main factor of
> the reclaimable slab objects. We also could make a generic infrastructure for
> all the shrinkers (if needed).

Dave Chinner has some prototype code for that.

As an aside, the slab LRUs can also keep
recent_scanned, recent_rotated and recent_pressure
statistics, so we can balance pressure between the
normal page LRUs and the slab LRUs in the exact
same way my patch series balances pressure between
cgroups.

This could be important, because the slab LRUs
span multiple memory zones, while the normal page
LRUs only live in one memory zone each.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-16 21:10 ` Rik van Riel
@ 2012-08-16 23:41   ` Dave Chinner
  2012-08-17  5:15     ` Glauber Costa
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2012-08-16 23:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ying Han, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Greg Thelen, Christoph Lameter,
	KOSAKI Motohiro, Glauber Costa, linux-mm

On Thu, Aug 16, 2012 at 05:10:57PM -0400, Rik van Riel wrote:
> On 08/16/2012 04:53 PM, Ying Han wrote:
> >The patchset adds the functionality of isolating the vfs slab objects per-memcg
> >under reclaim. This feature is a *must-have* after the kernel slab memory
> >accounting which starts charging the slab objects into individual memcgs. The
> >existing per-superblock shrinker doesn't work since it will end up reclaiming
> >slabs being charged to other memcgs.

What list was this posted to?

The per-sb shrinkers are not intended for memcg granularity - they
are for scalability in that they allow the removal of the global
inode and dcache LRU locks and allow significant flexibility in
cache relcaim strategies for filesystems. Hint: reclaiming
the VFS inode cache doesn't free any memory on an XFS filesystem -
it's the XFS inode cache shrinker that is integrated into the per-sb
shrinker infrastructure that frees all the memory. It doesn't work
without the per-sb shrinker functionality and it's an extremely
performance critical balancing act. Hence any changes to this
shrinker infrastructure need a lot of consideration and testing,
most especially to ensure that the balance of the system has not
been disturbed.

Also how do yo propose to solve the problem of inodes and dentries
shared across multiple memcgs?  They can only be tracked in one LRU,
but the caches are global and are globally accessed. Having mem
pressure in a single memcg that causes globally accessed dentries
and inodes to be tossed from memory will simply cause cache
thrashing and performance across the system will tank.

> >The patch now is only handling dentry cache by given the nature dentry pinned
> >inode. Based on the data we've collected, that contributes the main factor of
> >the reclaimable slab objects. We also could make a generic infrastructure for
> >all the shrinkers (if needed).
> 
> Dave Chinner has some prototype code for that.

The patchset I have makes the dcache lru locks per-sb as the first
step to introducing generic per-sb LRU lists, and then builds on
that to provide generic kernel-wide LRU lists with integrated
shrinkers, and builds on that to introduce node-awareness (i.e. NUMA
scalability) into the LRU list so everyone gets scalable shrinkers.

I've looked at memcg awareness in the past, but the problem is the
overhead - the explosion of LRUs because of the per-sb X per-node X
per-memcg object tracking matrix.  It's a huge amount of overhead
and complexity, and unless there's a way of efficiently tracking
objects both per-node and per-memcg simulatneously then I'm of the
opinion that memcg awareness is simply too much trouble, complexity
and overhead to bother with.

So, convince me you can solve the various problems. ;)

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-16 23:41   ` Dave Chinner
@ 2012-08-17  5:15     ` Glauber Costa
  2012-08-17  5:40       ` Ying Han
  2012-08-17  7:54       ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Glauber Costa @ 2012-08-17  5:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Ying Han, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Greg Thelen, Christoph Lameter,
	KOSAKI Motohiro, linux-mm

On 08/17/2012 03:41 AM, Dave Chinner wrote:
> On Thu, Aug 16, 2012 at 05:10:57PM -0400, Rik van Riel wrote:
>> On 08/16/2012 04:53 PM, Ying Han wrote:
>>> The patchset adds the functionality of isolating the vfs slab objects per-memcg
>>> under reclaim. This feature is a *must-have* after the kernel slab memory
>>> accounting which starts charging the slab objects into individual memcgs. The
>>> existing per-superblock shrinker doesn't work since it will end up reclaiming
>>> slabs being charged to other memcgs.
> 
> What list was this posted to?

This what? per-memcg slab accounting ? linux-mm and cgroups, and at
least once to lkml.

You can also find the up2date version in my git tree:

  git://github.com/glommer/linux.git memcg-3.5/kmemcg-slab

But then you mainly lose the discussion. You can find the thread at
http://lwn.net/Articles/508087/, and if you scan recent messages to
linux-mm, there is a lot there too.

> The per-sb shrinkers are not intended for memcg granularity - they
> are for scalability in that they allow the removal of the global
> inode and dcache LRU locks and allow significant flexibility in
> cache relcaim strategies for filesystems. Hint: reclaiming
> the VFS inode cache doesn't free any memory on an XFS filesystem -
> it's the XFS inode cache shrinker that is integrated into the per-sb
> shrinker infrastructure that frees all the memory. It doesn't work
> without the per-sb shrinker functionality and it's an extremely
> performance critical balancing act. Hence any changes to this
> shrinker infrastructure need a lot of consideration and testing,
> most especially to ensure that the balance of the system has not
> been disturbed.
> 

I was actually wondering where the balance would stand between hooking
this into the current shrinking mechanism, and having something totally
separate for memcg. It is tempting to believe that we could get away
with something that works well for memcg-only, but this already proved
to be not true for the user pages lru list...

> Also how do yo propose to solve the problem of inodes and dentries
> shared across multiple memcgs?  They can only be tracked in one LRU,
> but the caches are global and are globally accessed. 

I think the proposal is to not solve this problem. Because at first it
sounds a bit weird, let me explain myself:

1) Not all processes in the system will sit on a memcg.
Technically they will, but the root cgroup is never accounted, so a big
part of the workload can be considered "global" and will have no
attached memcg information whatsoever.

2) Not all child memcgs will have associated vfs objects, or kernel
objects at all, for that matter. This happens only when specifically
requested by the user.

Due to that, I believe that although sharing is obviously a reality
within the VFS, but the workloads associated to this will tend to be
fairly local. When sharing does happen, we currently account to the
first process to ever touch the object. This is also how memcg treats
shared memory users for userspace pages and it is working well so far.
It doesn't *always* give you good behavior, but I guess those fall in
the list of "workloads memcg is not good for".

Do we want to extend this list of use cases? Sure. There is also
discussion going on about how to improve this in the future. That would
allow a policy to specify which memcg is to be "responsible" for the
shared objects, be them kernel memory or shared memory regions. Even
then, we'll always have one of the two scenarios:

1) There is a memcg that is responsible for accounting that object, and
then is clear we should reclaim from that memcg.

2) There is no memcg associated with the object, and then we should not
bother with that object at all.

I fully understand your concern, specifically because we talked about
that in details in the past. But I believe most of the cases that would
justify it would fall in 2).

Another thing to keep in mind is that we don't actually track objects.
We track pages, and try to make sure that objects in the same page
belong to the same memcg. (That could be important for your analysis or
not...)

> Having mem
> pressure in a single memcg that causes globally accessed dentries
> and inodes to be tossed from memory will simply cause cache
> thrashing and performance across the system will tank.
> 

As said above. I don't consider global accessed dentries to be
representative of the current use cases for memcg.

>>> The patch now is only handling dentry cache by given the nature dentry pinned
>>> inode. Based on the data we've collected, that contributes the main factor of
>>> the reclaimable slab objects. We also could make a generic infrastructure for
>>> all the shrinkers (if needed).
>>
>> Dave Chinner has some prototype code for that.
> 
> The patchset I have makes the dcache lru locks per-sb as the first
> step to introducing generic per-sb LRU lists, and then builds on
> that to provide generic kernel-wide LRU lists with integrated
> shrinkers, and builds on that to introduce node-awareness (i.e. NUMA
> scalability) into the LRU list so everyone gets scalable shrinkers.
> 

If you are building a generic infrastructure for shrinkers, what is the
big point about per-sb? I'll give you that most of the memory will come
from the VFS, but other objects are shrinkable too, that bears no
relationship with the vfs.

> I've looked at memcg awareness in the past, but the problem is the
> overhead - the explosion of LRUs because of the per-sb X per-node X
> per-memcg object tracking matrix.  It's a huge amount of overhead
> and complexity, and unless there's a way of efficiently tracking
> objects both per-node and per-memcg simulatneously then I'm of the
> opinion that memcg awareness is simply too much trouble, complexity
> and overhead to bother with.
> 
> So, convince me you can solve the various problems. ;)
> 

I believe we are open minded regarding a solution for that, and your
input is obviously top. So let me take a step back and restate the problem:

1) Some memcgs, not all, will have memory pressure regardless of the
memory pressure in the rest of the system
2) that memory pressure may or may not involve kernel objects.
3) if kernel objects are involved, we can assume the level of sharing is
low.
4) We then need to shrink memory from that memcg, affecting the others
the least we can.

Do you have any proposals for that, in any shape?

One thing that crossed my mind, was instead of having per-sb x per-node
objects, we could have per-"group" x per-node objects. The group would
then be either a memcg or a sb. Objects that doesn't belong to a memcg -
where we expect most of the globally accessed to fall, would be tied to
the sb. Global shrinkers, when called, would of course scan all groups.
Shrinking could also be triggered for the group. An object would of
course only live in one of them at a time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-17  5:15     ` Glauber Costa
@ 2012-08-17  5:40       ` Ying Han
  2012-08-17  5:42         ` Glauber Costa
  2012-08-19  3:41         ` Andi Kleen
  2012-08-17  7:54       ` Dave Chinner
  1 sibling, 2 replies; 12+ messages in thread
From: Ying Han @ 2012-08-17  5:40 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Dave Chinner, Rik van Riel, Michal Hocko, Johannes Weiner,
	Mel Gorman, KAMEZAWA Hiroyuki, Greg Thelen, Christoph Lameter,
	KOSAKI Motohiro, linux-mm

On Thu, Aug 16, 2012 at 10:15 PM, Glauber Costa <glommer@parallels.com> wrote:
> On 08/17/2012 03:41 AM, Dave Chinner wrote:
>> On Thu, Aug 16, 2012 at 05:10:57PM -0400, Rik van Riel wrote:
>>> On 08/16/2012 04:53 PM, Ying Han wrote:
>>>> The patchset adds the functionality of isolating the vfs slab objects per-memcg
>>>> under reclaim. This feature is a *must-have* after the kernel slab memory
>>>> accounting which starts charging the slab objects into individual memcgs. The
>>>> existing per-superblock shrinker doesn't work since it will end up reclaiming
>>>> slabs being charged to other memcgs.
>>
>> What list was this posted to?
>
> This what? per-memcg slab accounting ? linux-mm and cgroups, and at
> least once to lkml.
>
> You can also find the up2date version in my git tree:
>
>   git://github.com/glommer/linux.git memcg-3.5/kmemcg-slab
>
> But then you mainly lose the discussion. You can find the thread at
> http://lwn.net/Articles/508087/, and if you scan recent messages to
> linux-mm, there is a lot there too.
>
>> The per-sb shrinkers are not intended for memcg granularity - they
>> are for scalability in that they allow the removal of the global
>> inode and dcache LRU locks and allow significant flexibility in
>> cache relcaim strategies for filesystems. Hint: reclaiming
>> the VFS inode cache doesn't free any memory on an XFS filesystem -
>> it's the XFS inode cache shrinker that is integrated into the per-sb
>> shrinker infrastructure that frees all the memory. It doesn't work
>> without the per-sb shrinker functionality and it's an extremely
>> performance critical balancing act. Hence any changes to this
>> shrinker infrastructure need a lot of consideration and testing,
>> most especially to ensure that the balance of the system has not
>> been disturbed.
>>
>
> I was actually wondering where the balance would stand between hooking
> this into the current shrinking mechanism, and having something totally
> separate for memcg. It is tempting to believe that we could get away
> with something that works well for memcg-only, but this already proved
> to be not true for the user pages lru list...
>
>
>> Also how do yo propose to solve the problem of inodes and dentries
>> shared across multiple memcgs?  They can only be tracked in one LRU,
>> but the caches are global and are globally accessed.
>
> I think the proposal is to not solve this problem. Because at first it
> sounds a bit weird, let me explain myself:
>
> 1) Not all processes in the system will sit on a memcg.
> Technically they will, but the root cgroup is never accounted, so a big
> part of the workload can be considered "global" and will have no
> attached memcg information whatsoever.
>
> 2) Not all child memcgs will have associated vfs objects, or kernel
> objects at all, for that matter. This happens only when specifically
> requested by the user.
>
> Due to that, I believe that although sharing is obviously a reality
> within the VFS, but the workloads associated to this will tend to be
> fairly local. When sharing does happen, we currently account to the
> first process to ever touch the object. This is also how memcg treats
> shared memory users for userspace pages and it is working well so far.
> It doesn't *always* give you good behavior, but I guess those fall in
> the list of "workloads memcg is not good for".
>
> Do we want to extend this list of use cases? Sure. There is also
> discussion going on about how to improve this in the future. That would
> allow a policy to specify which memcg is to be "responsible" for the
> shared objects, be them kernel memory or shared memory regions. Even
> then, we'll always have one of the two scenarios:
>
> 1) There is a memcg that is responsible for accounting that object, and
> then is clear we should reclaim from that memcg.
>
> 2) There is no memcg associated with the object, and then we should not
> bother with that object at all.

In the patch I have, all objects are associated with *a* memcg. For
those objects are charged to root or reparented to root,
they do get associated with root and further memory pressure on root (
global reclaim ) will be applied on those objects.

>
> I fully understand your concern, specifically because we talked about
> that in details in the past. But I believe most of the cases that would
> justify it would fall in 2).
>
> Another thing to keep in mind is that we don't actually track objects.
> We track pages, and try to make sure that objects in the same page
> belong to the same memcg. (That could be important for your analysis or
> not...)
>
>> Having mem
>> pressure in a single memcg that causes globally accessed dentries
>> and inodes to be tossed from memory will simply cause cache
>> thrashing and performance across the system will tank.
>>
Not sure if that is the case after this patch. The global LRU is
splitted per-memcg, and each dentry is linked to
the per-memcg list. So under target reclaim of memcg A, it will only
reclaim the hashtable bucket indexed by
A but not others.

> As said above. I don't consider global accessed dentries to be
> representative of the current use cases for memcg.

>
>>>> The patch now is only handling dentry cache by given the nature dentry pinned
>>>> inode. Based on the data we've collected, that contributes the main factor of
>>>> the reclaimable slab objects. We also could make a generic infrastructure for
>>>> all the shrinkers (if needed).
>>>
>>> Dave Chinner has some prototype code for that.
>>
>> The patchset I have makes the dcache lru locks per-sb as the first
>> step to introducing generic per-sb LRU lists, and then builds on
>> that to provide generic kernel-wide LRU lists with integrated
>> shrinkers, and builds on that to introduce node-awareness (i.e. NUMA
>> scalability) into the LRU list so everyone gets scalable shrinkers.
>>
>
> If you are building a generic infrastructure for shrinkers, what is the
> big point about per-sb? I'll give you that most of the memory will come
> from the VFS, but other objects are shrinkable too, that bears no
> relationship with the vfs.

The patchset is trying to solve a very simple problem where allows
shrink_slab() to locate the *right* dentry objects to reclaim
with the memcg context.

I haven't thought about the NUMA and node awareness for the shrinkers,
and that sounds like something
beyond than the problem I am trying to solve here. I might need to
think a bit more of how that fits into the problem you described.

>
>> I've looked at memcg awareness in the past, but the problem is the
>> overhead - the explosion of LRUs because of the per-sb X per-node X
>> per-memcg object tracking matrix.  It's a huge amount of overhead
>> and complexity, and unless there's a way of efficiently tracking
>> objects both per-node and per-memcg simulatneously then I'm of the
>> opinion that memcg awareness is simply too much trouble, complexity
>> and overhead to bother with.
>>
>> So, convince me you can solve the various problems. ;)
>>
>
> I believe we are open minded regarding a solution for that, and your
> input is obviously top. So let me take a step back and restate the problem:
>
> 1) Some memcgs, not all, will have memory pressure regardless of the
> memory pressure in the rest of the system
> 2) that memory pressure may or may not involve kernel objects.
> 3) if kernel objects are involved, we can assume the level of sharing is
> low.
> 4) We then need to shrink memory from that memcg, affecting the others
> the least we can.
>
> Do you have any proposals for that, in any shape?
>
> One thing that crossed my mind, was instead of having per-sb x per-node
> objects, we could have per-"group" x per-node objects. The group would
> then be either a memcg or a sb. Objects that doesn't belong to a memcg -
> where we expect most of the globally accessed to fall, would be tied to
> the sb. Global shrinkers, when called, would of course scan all groups.
> Shrinking could also be triggered for the group. An object would of
> course only live in one of them at a time.

Not sure I understand this. Will think a bit more tomorrow morning
when my brain works better :)

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-17  5:40       ` Ying Han
@ 2012-08-17  5:42         ` Glauber Costa
  2012-08-17  7:56           ` Dave Chinner
  2012-08-19  3:41         ` Andi Kleen
  1 sibling, 1 reply; 12+ messages in thread
From: Glauber Costa @ 2012-08-17  5:42 UTC (permalink / raw)
  To: Ying Han
  Cc: Dave Chinner, Rik van Riel, Michal Hocko, Johannes Weiner,
	Mel Gorman, KAMEZAWA Hiroyuki, Greg Thelen, Christoph Lameter,
	KOSAKI Motohiro, linux-mm

On 08/17/2012 09:40 AM, Ying Han wrote:
>> > 2) There is no memcg associated with the object, and then we should not
>> > bother with that object at all.
> In the patch I have, all objects are associated with *a* memcg. For
> those objects are charged to root or reparented to root,
> they do get associated with root and further memory pressure on root (
> global reclaim ) will be applied on those objects.
> 
For the practical purposes of what Dave is concerned about, "no memcg"
equals "root memcg", right? It still holds we would expect globally
accessed dentries to belong to root/no-memcg, and per-group pressure
would not get to them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-17  5:15     ` Glauber Costa
  2012-08-17  5:40       ` Ying Han
@ 2012-08-17  7:54       ` Dave Chinner
  2012-08-17 10:00         ` Glauber Costa
  2012-08-17 14:44         ` Rik van Riel
  1 sibling, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2012-08-17  7:54 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Rik van Riel, Ying Han, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Greg Thelen, Christoph Lameter,
	KOSAKI Motohiro, linux-mm

On Fri, Aug 17, 2012 at 09:15:11AM +0400, Glauber Costa wrote:
> On 08/17/2012 03:41 AM, Dave Chinner wrote:
> > On Thu, Aug 16, 2012 at 05:10:57PM -0400, Rik van Riel wrote:
> >> On 08/16/2012 04:53 PM, Ying Han wrote:
> >>> The patchset adds the functionality of isolating the vfs slab objects per-memcg
> >>> under reclaim. This feature is a *must-have* after the kernel slab memory
> >>> accounting which starts charging the slab objects into individual memcgs. The
> >>> existing per-superblock shrinker doesn't work since it will end up reclaiming
> >>> slabs being charged to other memcgs.
> > 
> > What list was this posted to?
> 
> This what? per-memcg slab accounting ? linux-mm and cgroups, and at
> least once to lkml.

Hi Glauber, I must have lost it in the noise of lkml because
anything that doesn't hit my procmail filters generally gets deleted
without being read.

> You can also find the up2date version in my git tree:
> 
>   git://github.com/glommer/linux.git memcg-3.5/kmemcg-slab
> 
> But then you mainly lose the discussion. You can find the thread at
> http://lwn.net/Articles/508087/, and if you scan recent messages to
> linux-mm, there is a lot there too.

If I'm lucky, I'll get to looking at that sometime next week.

> > The per-sb shrinkers are not intended for memcg granularity - they
> > are for scalability in that they allow the removal of the global
> > inode and dcache LRU locks and allow significant flexibility in
> > cache relcaim strategies for filesystems. Hint: reclaiming
> > the VFS inode cache doesn't free any memory on an XFS filesystem -
> > it's the XFS inode cache shrinker that is integrated into the per-sb
> > shrinker infrastructure that frees all the memory. It doesn't work
> > without the per-sb shrinker functionality and it's an extremely
> > performance critical balancing act. Hence any changes to this
> > shrinker infrastructure need a lot of consideration and testing,
> > most especially to ensure that the balance of the system has not
> > been disturbed.
> > 
> 
> I was actually wondering where the balance would stand between hooking
> this into the current shrinking mechanism, and having something totally
> separate for memcg. It is tempting to believe that we could get away
> with something that works well for memcg-only, but this already proved
> to be not true for the user pages lru list...

Learn from past mistakes. ;P

> > Also how do yo propose to solve the problem of inodes and dentries
> > shared across multiple memcgs?  They can only be tracked in one LRU,
> > but the caches are global and are globally accessed. 
> 
> I think the proposal is to not solve this problem. Because at first it
> sounds a bit weird, let me explain myself:
> 
> 1) Not all processes in the system will sit on a memcg.
> Technically they will, but the root cgroup is never accounted, so a big
> part of the workload can be considered "global" and will have no
> attached memcg information whatsoever.
> 
> 2) Not all child memcgs will have associated vfs objects, or kernel
> objects at all, for that matter. This happens only when specifically
> requested by the user.
>
> Due to that, I believe that although sharing is obviously a reality
> within the VFS, but the workloads associated to this will tend to be
> fairly local.

I have my doubts about that - I've heard it said many times but no
data has been provided to prove the assertion....

> When sharing does happen, we currently account to the
> first process to ever touch the object. This is also how memcg treats
> shared memory users for userspace pages and it is working well so far.
> It doesn't *always* give you good behavior, but I guess those fall in
> the list of "workloads memcg is not good for".

And that list contains?

> Do we want to extend this list of use cases?  Sure. There is also
> discussion going on about how to improve this in the future. That would
> allow a policy to specify which memcg is to be "responsible" for the
> shared objects, be them kernel memory or shared memory regions. Even
> then, we'll always have one of the two scenarios:
> 
> 1) There is a memcg that is responsible for accounting that object, and
> then is clear we should reclaim from that memcg.
> 
> 2) There is no memcg associated with the object, and then we should not
> bother with that object at all.
> 
> I fully understand your concern, specifically because we talked about
> that in details in the past. But I believe most of the cases that would
> justify it would fall in 2).

Which then leads to this: the no-memcg object case needs to scale.

> Another thing to keep in mind is that we don't actually track objects.
> We track pages, and try to make sure that objects in the same page
> belong to the same memcg. (That could be important for your analysis or
> not...)

Hmmmm. So you're basically using the characteristics of internal
slab fragmentation to keep objects allocated to different memcg's
apart? That's .... devious. :)

> > Having mem
> > pressure in a single memcg that causes globally accessed dentries
> > and inodes to be tossed from memory will simply cause cache
> > thrashing and performance across the system will tank.
> > 
> 
> As said above. I don't consider global accessed dentries to be
> representative of the current use cases for memcg.

But they have to co-exist, and I think that's our big problem. If
you have a workload in a memcg, and the underlying directory
structure is exported via NFS or CIFS, then there is still global
access to that "memcg local" dentry structure.

> >>> The patch now is only handling dentry cache by given the nature dentry pinned
> >>> inode. Based on the data we've collected, that contributes the main factor of
> >>> the reclaimable slab objects. We also could make a generic infrastructure for
> >>> all the shrinkers (if needed).
> >>
> >> Dave Chinner has some prototype code for that.
> > 
> > The patchset I have makes the dcache lru locks per-sb as the first
> > step to introducing generic per-sb LRU lists, and then builds on
> > that to provide generic kernel-wide LRU lists with integrated
> > shrinkers, and builds on that to introduce node-awareness (i.e. NUMA
> > scalability) into the LRU list so everyone gets scalable shrinkers.
> > 
> 
> If you are building a generic infrastructure for shrinkers, what is the
> big point about per-sb? I'll give you that most of the memory will come
> from the VFS, but other objects are shrinkable too, that bears no
> relationship with the vfs.

Without any more information, it's hard to understand what I'm
doing.  The shrinker itself cannot lock or determine if an object is
reclaimable - that involves reference counts, status flags, whether
it is currently being freed, etc - so the generic shrinker has to
use callbacks for the objects to be freed. A generic shrinker looks
like this:

struct shrinker_lru_node {
       spinlock_t              lock;
       long                    lru_items;
       struct list_head        lru;
} ____cacheline_aligned_in_smp;

struct shrinker_lru {
       struct shrinker_lru_node node[MAX_NUMNODES];
       struct shrinker_lru_node expedited_reclaim;
       nodemask_t              active_nodes;

       int (*isolate_item)(struct list_head *, spinlock_t *, struct list_head *);
       int (*dispose)(struct list_head *);
};

and when you want to shrink the LRU you call shrinker_lru_shrink().
This walks the items on the LRU is calls the .isolate_item method
for the subsystem to try to remove the item from the LRU under the
LRU lock passed to the callback. The callback can drop the LRU lock
to free the item, or it can move it to the supplied dispose list
without dropping the lock. If the LRU lock is dropped, the scan has
to restart from the start of the list, so there's some interesting
issues with return values here.

Once all the items are scanned and moved to the dispose list, the
.dispose method is called to free all the items on the list.

This is basically a generic encoding of the methods used by both the
inode and dentry caches for optimised, low overhead, large-scale
batch freeing of objects.  The overall structure of the caches,
locking and LRU management is completely unchanged. It's just that
all the LRU code is generic....

IOWs, the context surrounding the shrinker and LRU doesn't change.
The LRU is still embedded in some structure somewhere, and it still
has the same relationships to other caches as it had before. e.g.
setting up the superblock shrinker:

               s->s_shrink.seeks = DEFAULT_SEEKS;
               s->s_shrink.shrink = prune_super;
               s->s_shrink.count_objects = prune_super_count;
               s->s_shrink.batch = 1024;
               shrinker_lru_init(&s->s_inode_lru);
               shrinker_lru_init(&s->s_dentry_lru);
               s->s_dentry_lru.isolate_item = dentry_isolate_one;
               s->s_dentry_lru.dispose = dispose_dentries;
               s->s_inode_lru.isolate_item = inode_isolate_one;
               s->s_inode_lru.dispose = dispose_inodes;

shows that we now also have to set up the LRUs appropriately for
the different objects that each LRU contains.

Then there is LRU object counting, as called by shrink_slab()
instead of tha nasty nr_to_scan = 0 hack:

static long
prune_super_count(
	struct shrinker *shrink)
	{
	struct super_block *sb;
	long    total_objects = 0;

	sb = container_of(shrink, struct super_block, s_shrink);

	if (!grab_super_passive(sb))
	       return -1;

	if (sb->s_op && sb->s_op->nr_cached_objects)
	       total_objects = sb->s_op->nr_cached_objects(sb);

	total_objects += shrinker_lru_count(&sb->s_dentry_lru);
	total_objects += shrinker_lru_count(&sb->s_inode_lru);

        total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
        drop_super(sb);
        return total_objects;
}

And the guts of prune_super(), which is later called by shrink_slab() once it
has worked out how much to scan:

	dentries = shrinker_lru_count(&sb->s_dentry_lru);
	inodes = shrinker_lru_count(&sb->s_inode_lru);
	if (sb->s_op && sb->s_op->nr_cached_objects)
	        fs_objects = sb->s_op->nr_cached_objects(sb);

	total_objects = dentries + inodes + fs_objects + 1;

	/* proportion the scan between the caches */
	dentries = (sc->nr_to_scan * dentries) / total_objects;
	inodes = (sc->nr_to_scan * inodes) / total_objects;
	if (fs_objects)
	        fs_objects = (sc->nr_to_scan * fs_objects) / total_objects;

	/*
	 * prune the dcache first as the icache is pinned by it, then
	 * prune the icache, followed by the filesystem specific caches
	 */
	sc->nr_to_scan = dentries;
	nr = shrinker_lru_shrink(&sb->s_dentry_lru, sc);
	sc->nr_to_scan = inodes;
	nr += shrinker_lru_shrink(&sb->s_inode_lru, sc);

	if (fs_objects && sb->s_op->free_cached_objects) {
	        sc->nr_to_scan = fs_objects;
	        nr += sb->s_op->free_cached_objects(sb, sc);
	}

	drop_super(sb);
	return nr;
}

IOWs, it's still the same count/scan shrinker interface, just with
all the LRU and shrinker bits abstracted and implemented in common
code. The generic LRU abstraction is that it only knows about the
list-head in the structure that is passed to it, and it passes that
listhead to the per-object callbacks for the subsystem to do the
specific work that is needed. The LRU/shrinker doesn't need to know
anything about the object being added or freed, so it only needs to
concern itself with lists of opaque objects. Hence we can change the
LRU internals without having to change any of the per-subsystem
code.

But it does not alter the fact that the subsystem is ultimately
responsible for how the shrinker is controlled and interoperates
with other subsystems...

> > I've looked at memcg awareness in the past, but the problem is the
> > overhead - the explosion of LRUs because of the per-sb X per-node X
> > per-memcg object tracking matrix.  It's a huge amount of overhead
> > and complexity, and unless there's a way of efficiently tracking
> > objects both per-node and per-memcg simulatneously then I'm of the
> > opinion that memcg awareness is simply too much trouble, complexity
> > and overhead to bother with.
> > 
> > So, convince me you can solve the various problems. ;)
> 
> I believe we are open minded regarding a solution for that, and your
> input is obviously top. So let me take a step back and restate the problem:
> 
> 1) Some memcgs, not all, will have memory pressure regardless of the
> memory pressure in the rest of the system
> 2) that memory pressure may or may not involve kernel objects.
> 3) if kernel objects are involved, we can assume the level of sharing is
> low.

I don't think you can make this assumption. You could simply have a
circle-thrash of a shared object where the first memcg reads it,
caches it, then reclaims it, then the second does the same thing,
then the third, and so on around the circle....

> 4) We then need to shrink memory from that memcg, affecting the
> others the least we can.
> 
> Do you have any proposals for that, in any shape?
> 
> One thing that crossed my mind, was instead of having per-sb x
> per-node objects, we could have per-"group" x per-node objects.
> The group would then be either a memcg or a sb.

Perhaps we've all been looking at the problem the wrong way.

As I was writing this, it came to me that the problem is not that
"the object is owned either per-sb or per-memcg". The issue is how
to track the objects in a given context. The overall LRU manipulations
and ownership of the object is identical in both the global and
memcg cases - it's the LRU that the object is placed on that
matters! With a generic LRU+shrinker implementation, this detail is
completely confined to the internals of the LRU+shrinker subsystem.

IOWs, if you are tagging the object with memcg info at a slab page
level, the LRU and shrinker need to operate at the same level, not
at the per-object level. The LRU implementation I have currently
selects the internal LRU list according to the node the object was
allocated on. i.e. by looking at the page:

int
shrinker_lru_add(
	struct shrinker_lru *lru,
	struct list_head *item)
{
>>>>>	int node_id = page_to_nid(virt_to_page(item)); <<<<<<<<<
	struct shrinker_lru_node *nlru = &lru->node[node_id];

	spin_lock(&nlru->lock);
	lru_list_check(lru, nlru, node_id);
	if (list_empty(item)) {
	       list_add(item, &nlru->lru);
	       if (nlru->lru_items++ == 0)
		       node_set(node_id, lru->active_nodes);
	       BUG_ON(nlru->lru_items < 1);
	       spin_unlock(&nlru->lock);
	       pr_info("shrinker_lru_add: item %p, node %d %p, items 0x%lx",
		       item, node_id, nlru, nlru->lru_items);
	       return 1;
	}
	lru_list_check(lru, nlru, node_id);
	spin_unlock(&nlru->lock);
	return 0;
}
EXPORT_SYMBOL(shrinker_lru_add);

There is no reason why we couldn't determine if an object was being
tracked by a memcg in the same way. Do we have a page_to_memcg()
function? If we've got that, then all we need to add to the struct
shrinker_lru is a method of dynamically adding and looking up the
memcg to get the appropriate struct shrinker_lru_node from the
memcg. The memcg would need a struct shrinker_lru_node per generic
LRU in use, and this probably means we need to uniquely identify
each struct shrinker_lru instance so the memcg code can kept a
dynamic list of them.

With that, we could track objects per memcg or globally on the
per-node lists.  If we then add a memcg id to the struct
scan_control, the shrinker can then walk the related memcg LRU
rather than the per-node LRU.  That means that general per-node
reclaim won't find memcg related objects, and memcg related reclaim
won't scan the global per-node objects. This could be changed as
needed, though.

What it does, though, is preserve the correct balance of related
caches in the memcg because they use exactly the same subsystem code
that defines the relationship for the global cache. It also
preserves the scalabilty of the non-memcg based processes, and
allows us to tune the global vs memcg LRU reclaim algorithm in a
single place.

That, to me, sounds almost ideal - memcg tracking and reclaim works
with very little added complexity, it has no additional memory
overhead, and scalability is not compromised. What have I missed? :P

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-17  5:42         ` Glauber Costa
@ 2012-08-17  7:56           ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2012-08-17  7:56 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Ying Han, Rik van Riel, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Greg Thelen, Christoph Lameter,
	KOSAKI Motohiro, linux-mm

On Fri, Aug 17, 2012 at 09:42:15AM +0400, Glauber Costa wrote:
> On 08/17/2012 09:40 AM, Ying Han wrote:
> >> > 2) There is no memcg associated with the object, and then we should not
> >> > bother with that object at all.
> > In the patch I have, all objects are associated with *a* memcg. For
> > those objects are charged to root or reparented to root,
> > they do get associated with root and further memory pressure on root (
> > global reclaim ) will be applied on those objects.
> > 
> For the practical purposes of what Dave is concerned about, "no memcg"
> equals "root memcg", right? It still holds we would expect globally
> accessed dentries to belong to root/no-memcg, and per-group pressure
> would not get to them.

Exactly.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-17  7:54       ` Dave Chinner
@ 2012-08-17 10:00         ` Glauber Costa
  2012-08-17 19:18           ` Ying Han
  2012-08-17 14:44         ` Rik van Riel
  1 sibling, 1 reply; 12+ messages in thread
From: Glauber Costa @ 2012-08-17 10:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Ying Han, Michal Hocko, Johannes Weiner, Mel Gorman,
	KAMEZAWA Hiroyuki, Greg Thelen, Christoph Lameter,
	KOSAKI Motohiro, linux-mm

[-- Attachment #1: Type: text/plain, Size: 12349 bytes --]


>>> Also how do yo propose to solve the problem of inodes and dentries
>>> shared across multiple memcgs?  They can only be tracked in one LRU,
>>> but the caches are global and are globally accessed. 
>>
>> I think the proposal is to not solve this problem. Because at first it
>> sounds a bit weird, let me explain myself:
>>
>> 1) Not all processes in the system will sit on a memcg.
>> Technically they will, but the root cgroup is never accounted, so a big
>> part of the workload can be considered "global" and will have no
>> attached memcg information whatsoever.
>>
>> 2) Not all child memcgs will have associated vfs objects, or kernel
>> objects at all, for that matter. This happens only when specifically
>> requested by the user.
>>
>> Due to that, I believe that although sharing is obviously a reality
>> within the VFS, but the workloads associated to this will tend to be
>> fairly local.
> 
> I have my doubts about that - I've heard it said many times but no
> data has been provided to prove the assertion....
> 

First let me stress this point again, which is never enough: Sitting on
a memcg will not automatically start tracking kmem for you. So it is
possible to have the craziest possible memcg use case, and still have
kmem not tracked. Since kmem is naturally a shared resource, one can
only expect that people turning this on will be aware of that.

It is of course hard for me to argue or come up with data for about that
for every possible use of memcg. If you have anything specific in mind,
I can get a test running. But I can talk about the things we do.

Since we run a full-userspace container, our main use case is to mount
a directory structure somewhere - so this already gets the whole chain
cached, then we chroot to it, and move the tasks to the cgroup. There
seems to be nothing unreasonable about assuming that the vast majority
of the dentries accessed from that point on will tend to be fairly local.

Even if you have a remote NFS mount that is only yours, that is also
fine, because you get accounted as you touch the dentries. The
complications arise when more than one group is accessing it. But then,
as I said, There is WIP to predictably determine which group you will
end up at, and at this point, it becomes policy.

Maybe Ying Han can tell us more about what they are doing to add to the
pool, but I can only assume that if sharing was a deal-breaker for them,
they would not be pursuing this path at all.

>> When sharing does happen, we currently account to the
>> first process to ever touch the object. This is also how memcg treats
>> shared memory users for userspace pages and it is working well so far.
>> It doesn't *always* give you good behavior, but I guess those fall in
>> the list of "workloads memcg is not good for".
> 
> And that list contains?

I would say that anything that is first, not logically groupable
(general case for cgroups), and for which is hard to come up with a
immediate upper limit of memory.

When kernel memory comes into play, you have to consider that you are
accounting a shared resource, so you should have reasonable expectations
to not be sharing those resources with everybody.

We are also never implying that no sharing will happen. Just that we
expect it to be in a low rate.

> 
>> Do we want to extend this list of use cases?  Sure. There is also
>> discussion going on about how to improve this in the future. That would
>> allow a policy to specify which memcg is to be "responsible" for the
>> shared objects, be them kernel memory or shared memory regions. Even
>> then, we'll always have one of the two scenarios:
>>
>> 1) There is a memcg that is responsible for accounting that object, and
>> then is clear we should reclaim from that memcg.
>>
>> 2) There is no memcg associated with the object, and then we should not
>> bother with that object at all.
>>
>> I fully understand your concern, specifically because we talked about
>> that in details in the past. But I believe most of the cases that would
>> justify it would fall in 2).
> 
> Which then leads to this: the no-memcg object case needs to scale.
> 

Yes, and I trust you to do it! =)


>> Another thing to keep in mind is that we don't actually track objects.
>> We track pages, and try to make sure that objects in the same page
>> belong to the same memcg. (That could be important for your analysis or
>> not...)
> 
> Hmmmm. So you're basically using the characteristics of internal
> slab fragmentation to keep objects allocated to different memcg's
> apart? That's .... devious. :)

My first approach was to get the pages themselves, move pages around
after cgroup destruction, etc. But that touched the slab heavily, and
since we have 3 of them, and writing more seem to be a joyful hobby some
people pursue, I am now actually reproducing the metadata and creating
per-cgroup caches. It is about the same, but with a bit more wasted
space that we're happily paying as the price for added simplicity.

Instead of directing the allocation to a page, which requires knowledge
of how the various slabs operate, we direct allocations to a different
cache altogether, that hides it.


> 
>>> Having mem
>>> pressure in a single memcg that causes globally accessed dentries
>>> and inodes to be tossed from memory will simply cause cache
>>> thrashing and performance across the system will tank.
>>>
>>
>> As said above. I don't consider global accessed dentries to be
>> representative of the current use cases for memcg.
> 
> But they have to co-exist, and I think that's our big problem. If
> you have a workload in a memcg, and the underlying directory
> structure is exported via NFS or CIFS, then there is still global
> access to that "memcg local" dentry structure.
> 

I am all for co-existing.

>>>>> The patch now is only handling dentry cache by given the nature dentry pinned
>>>>> inode. Based on the data we've collected, that contributes the main factor of
>>>>> the reclaimable slab objects. We also could make a generic infrastructure for
>>>>> all the shrinkers (if needed).
>>>>
>>>> Dave Chinner has some prototype code for that.
>>>
>>> The patchset I have makes the dcache lru locks per-sb as the first
>>> step to introducing generic per-sb LRU lists, and then builds on
>>> that to provide generic kernel-wide LRU lists with integrated
>>> shrinkers, and builds on that to introduce node-awareness (i.e. NUMA
>>> scalability) into the LRU list so everyone gets scalable shrinkers.
>>>
>>
>> If you are building a generic infrastructure for shrinkers, what is the
>> big point about per-sb? I'll give you that most of the memory will come
>> from the VFS, but other objects are shrinkable too, that bears no
>> relationship with the vfs.
> 
> Without any more information, it's hard to understand what I'm
> doing.  

With more information it's hard as well. =p

> 
>>> I've looked at memcg awareness in the past, but the problem is the
>>> overhead - the explosion of LRUs because of the per-sb X per-node X
>>> per-memcg object tracking matrix.  It's a huge amount of overhead
>>> and complexity, and unless there's a way of efficiently tracking
>>> objects both per-node and per-memcg simulatneously then I'm of the
>>> opinion that memcg awareness is simply too much trouble, complexity
>>> and overhead to bother with.
>>>
>>> So, convince me you can solve the various problems. ;)
>>
>> I believe we are open minded regarding a solution for that, and your
>> input is obviously top. So let me take a step back and restate the problem:
>>
>> 1) Some memcgs, not all, will have memory pressure regardless of the
>> memory pressure in the rest of the system
>> 2) that memory pressure may or may not involve kernel objects.
>> 3) if kernel objects are involved, we can assume the level of sharing is
>> low.
> 
> I don't think you can make this assumption. You could simply have a
> circle-thrash of a shared object where the first memcg reads it,
> caches it, then reclaims it, then the second does the same thing,
> then the third, and so on around the circle....
> 

I think the best we can do here is a trade-off. We won't make shared
resources become exclusive. What I tried to do, is by having to have one
explicitly turning this on, confine its usage to scenarios where this
assumption will mostly hold.

"Mostly hold" basically means still have problems like the one you
describe, but expecting them to be in a quantity small enough not to
bother us. It is just like the case for processor caches...

>> 4) We then need to shrink memory from that memcg, affecting the
>> others the least we can.
>>
>> Do you have any proposals for that, in any shape?
>>
>> One thing that crossed my mind, was instead of having per-sb x
>> per-node objects, we could have per-"group" x per-node objects.
>> The group would then be either a memcg or a sb.
> 
> Perhaps we've all been looking at the problem the wrong way.
> 
> As I was writing this, it came to me that the problem is not that
> "the object is owned either per-sb or per-memcg". The issue is how
> to track the objects in a given context. 

Which is basically just a generalization of "per-group" as I said. The
group can be anything, and I believe your generic LRU interface makes it
quite clear.

> The overall LRU manipulations
> and ownership of the object is identical in both the global and
> memcg cases - it's the LRU that the object is placed on that
> matters! With a generic LRU+shrinker implementation, this detail is
> completely confined to the internals of the LRU+shrinker subsystem.
> 

Indeed.

> IOWs, if you are tagging the object with memcg info at a slab page
> level, the LRU and shrinker need to operate at the same level, not
> at the per-object level. 

Not sure I fully parse. You do still track objects in the LRU, right?

> The LRU implementation I have currently
> selects the internal LRU list according to the node the object was
> allocated on. i.e. by looking at the page:
> 
[...]

> There is no reason why we couldn't determine if an object was being
> tracked by a memcg in the same way. Do we have a page_to_memcg()
> function? 

Yes. We don't currently have, because we have no users.
But that can easily be provided, and Ying Han's patches actually
provides one.

> If we've got that, then all we need to add to the struct
> shrinker_lru is a method of dynamically adding and looking up the
> memcg to get the appropriate struct shrinker_lru_node from the
> memcg. The memcg would need a struct shrinker_lru_node per generic
> LRU in use, and this probably means we need to uniquely identify
> each struct shrinker_lru instance so the memcg code can kept a
> dynamic list of them.

Sounds like right.

> 
> With that, we could track objects per memcg or globally on the
> per-node lists.  

For userspace pages, we do per-memcg-per-zone. Can't we do the same
here? Note that simple "per-memcg" is already a big thing for us, but
you're not alone in your fight for scalability.

> If we then add a memcg id to the struct
> scan_control, the shrinker can then walk the related memcg LRU
> rather than the per-node LRU.  That means that general per-node
> reclaim won't find memcg related objects, and memcg related reclaim
> won't scan the global per-node objects. This could be changed as
> needed, though.
> 
> What it does, though, is preserve the correct balance of related
> caches in the memcg because they use exactly the same subsystem code
> that defines the relationship for the global cache. It also
> preserves the scalabilty of the non-memcg based processes, and
> allows us to tune the global vs memcg LRU reclaim algorithm in a
> single place.
> 
> That, to me, sounds almost ideal - memcg tracking and reclaim works
> with very little added complexity, it has no additional memory
> overhead, and scalability is not compromised. What have I missed? :P
> 

So before I got to your e-mail, I actually coded a prototype for it.
Due to the lack of funding for my crystal balls department, I didn't
make use of your generic LRU, so a lot of it is still hard coded in
place. I am attaching the ugly patch here so you can take a look.

Note that I am basically using the same prune_dcache function, just with
a different list. I am also not bothering with inodes, etc.

Take a look. How do you think this would integrate with your idea?



[-- Attachment #2: example.patch --]
[-- Type: text/x-patch, Size: 9079 bytes --]

diff --git a/fs/dcache.c b/fs/dcache.c
index 4046904..c8d6f08 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -308,10 +308,17 @@ static void dentry_unlink_inode(struct dentry * dentry)
  */
 static void dentry_lru_add(struct dentry *dentry)
 {
+
 	if (list_empty(&dentry->d_lru)) {
+		struct mem_cgroup *memcg;
+		memcg = memcg_from_object(dentry);
 		spin_lock(&dcache_lru_lock);
-		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
-		dentry->d_sb->s_nr_dentry_unused++;
+		if (!memcg) {
+			list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
+			dentry->d_sb->s_nr_dentry_unused++;
+		} else {
+			memcg_add_dentry_lru(memcg, dentry);
+		}
 		dentry_stat.nr_unused++;
 		spin_unlock(&dcache_lru_lock);
 	}
@@ -319,9 +326,18 @@ static void dentry_lru_add(struct dentry *dentry)
 
 static void __dentry_lru_del(struct dentry *dentry)
 {
+
+	struct mem_cgroup *memcg;
+	memcg = memcg_from_object(dentry);
+
 	list_del_init(&dentry->d_lru);
 	dentry->d_flags &= ~DCACHE_SHRINK_LIST;
-	dentry->d_sb->s_nr_dentry_unused--;
+
+	if (!memcg)
+		dentry->d_sb->s_nr_dentry_unused--;
+	else
+		memcg_del_dentry_lru(memcg, dentry);
+
 	dentry_stat.nr_unused--;
 }
 
@@ -847,19 +863,7 @@ static void shrink_dentry_list(struct list_head *list)
 	rcu_read_unlock();
 }
 
-/**
- * prune_dcache_sb - shrink the dcache
- * @sb: superblock
- * @count: number of entries to try to free
- *
- * Attempt to shrink the superblock dcache LRU by @count entries. This is
- * done when we need more memory an called from the superblock shrinker
- * function.
- *
- * This function may fail to free any resources if all the dentries are in
- * use.
- */
-void prune_dcache_sb(struct super_block *sb, int count)
+void prune_dcache_list(struct list_head *dentry_list, int count)
 {
 	struct dentry *dentry;
 	LIST_HEAD(referenced);
@@ -867,10 +871,9 @@ void prune_dcache_sb(struct super_block *sb, int count)
 
 relock:
 	spin_lock(&dcache_lru_lock);
-	while (!list_empty(&sb->s_dentry_lru)) {
-		dentry = list_entry(sb->s_dentry_lru.prev,
+	while (!list_empty(dentry_list)) {
+		dentry = list_entry(dentry_list->prev,
 				struct dentry, d_lru);
-		BUG_ON(dentry->d_sb != sb);
 
 		if (!spin_trylock(&dentry->d_lock)) {
 			spin_unlock(&dcache_lru_lock);
@@ -892,18 +895,37 @@ relock:
 		cond_resched_lock(&dcache_lru_lock);
 	}
 	if (!list_empty(&referenced))
-		list_splice(&referenced, &sb->s_dentry_lru);
+		list_splice(&referenced, dentry_list);
 	spin_unlock(&dcache_lru_lock);
 
 	shrink_dentry_list(&tmp);
 }
 
 /**
+ * prune_dcache_sb - shrink the dcache
+ * @sb: superblock
+ * @count: number of entries to try to free
+ *
+ * Attempt to shrink the superblock dcache LRU by @count entries. This is
+ * done when we need more memory an called from the superblock shrinker
+ * function.
+ *
+ * This function may fail to free any resources if all the dentries are in
+ * use.
+ */
+void prune_dcache_sb(struct super_block *sb, int count)
+{
+	prune_dcache_list(&sb->s_dentry_lru, count);
+}
+
+/**
  * shrink_dcache_sb - shrink dcache for a superblock
  * @sb: superblock
  *
  * Shrink the dcache for the specified super block. This is used to free
  * the dcache before unmounting a file system.
+ *
+ * FIXME: This may be a problem if the lists are separate, because we need to get to all sb objects
  */
 void shrink_dcache_sb(struct super_block *sb)
 {
diff --git a/fs/super.c b/fs/super.c
index 5af6817..0180cc0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -52,6 +52,9 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	int	fs_objects = 0;
 	int	total_objects;
 
+	if (sc->memcg)
+		return -1;
+
 	sb = container_of(shrink, struct super_block, s_shrink);
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6a1f97f..d4d3eb9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1555,6 +1555,8 @@ struct super_block {
 /* superblock cache pruning functions */
 extern void prune_icache_sb(struct super_block *sb, int nr_to_scan);
 extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
+extern void prune_dcache_list(struct list_head *dentry_list, int nr_to_scan);
+extern void prune_icache_list(struct list_head *inode_list, int nr_to_scan);
 
 extern struct timespec current_fs_time(struct super_block *sb);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a3e462a..90b587d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -567,5 +567,7 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
+extern void memcg_add_dentry_lru(struct mem_cgroup *memcg, struct dentry *dentry);
+extern void memcg_del_dentry_lru(struct mem_cgroup *memcg, struct dentry *dentry);
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index ac6b8ee..a829570 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -10,6 +10,7 @@ struct shrink_control {
 
 	/* How many slab objects shrinker() should scan and try to reclaim */
 	unsigned long nr_to_scan;
+	struct mem_cgroup *memcg;
 };
 
 /*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 765e12c..0d833fe 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -351,6 +351,18 @@ extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
 
 #ifdef CONFIG_MEMCG_KMEM
 #define MAX_KMEM_CACHE_TYPES 400
+extern struct kmem_cache *virt_to_cache(const void *x);
+
+static inline struct mem_cgroup *memcg_from_object(const void *x)
+{
+	struct kmem_cache *s = virt_to_cache(x);
+	return s->memcg_params.memcg;
+}
+#else
+static inline struct kmem_cache *memcg_from_object(const void *x)
+{
+	return NULL;
+}
 #endif
 
 #ifdef CONFIG_NUMA
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 26834d1..4dac864 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -347,6 +347,9 @@ struct mem_cgroup {
 #ifdef CONFIG_MEMCG_KMEM
 	/* Slab accounting */
 	struct kmem_cache *slabs[MAX_KMEM_CACHE_TYPES];
+	unsigned long nr_dentry_unused;
+	struct list_head dentry_lru_list;
+	struct shrinker vfs_shrink;
 #endif
 };
 
@@ -413,6 +416,50 @@ int memcg_css_id(struct mem_cgroup *memcg)
 {
 	return css_id(&memcg->css);
 }
+
+void memcg_add_dentry_lru(struct mem_cgroup *memcg, struct dentry *dentry)
+{
+	list_add(&dentry->d_lru, &memcg->dentry_lru_list);
+	memcg->nr_dentry_unused++;
+}
+
+void memcg_del_dentry_lru(struct mem_cgroup *memcg, struct dentry *dentry)
+{
+	memcg->nr_dentry_unused--;
+}
+
+static int vfs_shrink(struct shrinker *shrink, struct shrink_control *sc)
+{
+	struct mem_cgroup *memcg;
+
+	memcg  = container_of(shrink, struct mem_cgroup, vfs_shrink);
+	if (memcg != sc->memcg)
+		return -1;
+
+	printk("Called vfs_shrink, memcg %p, nr_to_scan %lu\n", memcg, sc->nr_to_scan);
+	printk("Unused dentries: %lu\n", memcg->nr_dentry_unused);
+
+        if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) {
+		printk("out\n");
+		return -1;
+	}
+
+	if (sc->nr_to_scan) {
+		prune_dcache_list(&memcg->dentry_lru_list, sc->nr_to_scan);
+		printk("Remaining Unused dentries: %lu\n", memcg->nr_dentry_unused);
+	}
+	return memcg->nr_dentry_unused;
+}
+#else
+void memcg_add_dentry_lru(struct mem_cgroup *memcg, struct dentry *dentry)
+{
+	BUG();
+}
+
+void memcg_del_dentry_lru(struct mem_cgroup *memcg, struct dentry *dentry)
+{
+	BUG();
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 /* Stuffs for move charges at task migration. */
@@ -4631,6 +4678,14 @@ static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
 	mutex_lock(&set_limit_mutex);
 	if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
 
+		INIT_LIST_HEAD(&memcg->dentry_lru_list);
+		memcg->vfs_shrink.seeks = DEFAULT_SEEKS;
+		memcg->vfs_shrink.shrink = vfs_shrink;
+		memcg->vfs_shrink.batch = 1024;
+
+		register_shrinker(&memcg->vfs_shrink);
+
+
 		/*
 		 * Once enabled, can't be disabled. We could in theory disable
 		 * it if we haven't yet created any caches, or if we can shrink
@@ -5605,6 +5660,7 @@ static void free_work(struct work_struct *work)
 	 * the cgroup_lock.
 	 */
 	disarm_static_keys(memcg);
+	unregister_shrinker(&memcg->vfs_shrink);
 	if (size < PAGE_SIZE)
 		kfree(memcg);
 	else
diff --git a/mm/slab.c b/mm/slab.c
index e4de1fa..e736e01 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -522,7 +522,7 @@ static inline struct kmem_cache *page_get_cache(struct page *page)
 	return page->slab_cache;
 }
 
-static inline struct kmem_cache *virt_to_cache(const void *obj)
+struct kmem_cache *virt_to_cache(const void *obj)
 {
 	struct page *page = virt_to_head_page(obj);
 	return page->slab_cache;
diff --git a/mm/slub.c b/mm/slub.c
index 4e1f470..33c9a6d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2623,6 +2623,12 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+struct kmem_cache *virt_to_cache(const void *obj)
+{
+	struct page *page = virt_to_head_page(obj);
+	return page->slab;
+}
+
 /*
  * Object placement in a slab is made very easy because we always start at
  * offset 0. If we tune the size of the object to the alignment then we can

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-17  7:54       ` Dave Chinner
  2012-08-17 10:00         ` Glauber Costa
@ 2012-08-17 14:44         ` Rik van Riel
  1 sibling, 0 replies; 12+ messages in thread
From: Rik van Riel @ 2012-08-17 14:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Glauber Costa, Ying Han, Michal Hocko, Johannes Weiner,
	Mel Gorman, KAMEZAWA Hiroyuki, Greg Thelen, Christoph Lameter,
	KOSAKI Motohiro, linux-mm

On 08/17/2012 03:54 AM, Dave Chinner wrote:

> IOWs, it's still the same count/scan shrinker interface, just with
> all the LRU and shrinker bits abstracted and implemented in common
> code. The generic LRU abstraction is that it only knows about the
> list-head in the structure that is passed to it, and it passes that
> listhead to the per-object callbacks for the subsystem to do the
> specific work that is needed.

This will make it very easy to iterate over the slab object
LRUs in my "reclaim from the highest score LRU" patch set.

That in turn will allow us to properly balance pressure between
cgroup and non-cgroup object LRUs, between the LRUs of various
superblocks, etc...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-17 10:00         ` Glauber Costa
@ 2012-08-17 19:18           ` Ying Han
  0 siblings, 0 replies; 12+ messages in thread
From: Ying Han @ 2012-08-17 19:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Dave Chinner, Rik van Riel, Michal Hocko, Johannes Weiner,
	Mel Gorman, KAMEZAWA Hiroyuki, Greg Thelen, Christoph Lameter,
	KOSAKI Motohiro, linux-mm

On Fri, Aug 17, 2012 at 3:00 AM, Glauber Costa <glommer@parallels.com> wrote:
>
>>>> Also how do yo propose to solve the problem of inodes and dentries
>>>> shared across multiple memcgs?  They can only be tracked in one LRU,
>>>> but the caches are global and are globally accessed.
>>>
>>> I think the proposal is to not solve this problem. Because at first it
>>> sounds a bit weird, let me explain myself:
>>>
>>> 1) Not all processes in the system will sit on a memcg.
>>> Technically they will, but the root cgroup is never accounted, so a big
>>> part of the workload can be considered "global" and will have no
>>> attached memcg information whatsoever.
>>>
>>> 2) Not all child memcgs will have associated vfs objects, or kernel
>>> objects at all, for that matter. This happens only when specifically
>>> requested by the user.
>>>
>>> Due to that, I believe that although sharing is obviously a reality
>>> within the VFS, but the workloads associated to this will tend to be
>>> fairly local.
>>
>> I have my doubts about that - I've heard it said many times but no
>> data has been provided to prove the assertion....
>>
>
> First let me stress this point again, which is never enough: Sitting on
> a memcg will not automatically start tracking kmem for you. So it is
> possible to have the craziest possible memcg use case, and still have
> kmem not tracked. Since kmem is naturally a shared resource, one can
> only expect that people turning this on will be aware of that.
>
> It is of course hard for me to argue or come up with data for about that
> for every possible use of memcg. If you have anything specific in mind,
> I can get a test running. But I can talk about the things we do.
>
> Since we run a full-userspace container, our main use case is to mount
> a directory structure somewhere - so this already gets the whole chain
> cached, then we chroot to it, and move the tasks to the cgroup. There
> seems to be nothing unreasonable about assuming that the vast majority
> of the dentries accessed from that point on will tend to be fairly local.
>
> Even if you have a remote NFS mount that is only yours, that is also
> fine, because you get accounted as you touch the dentries. The
> complications arise when more than one group is accessing it. But then,
> as I said, There is WIP to predictably determine which group you will
> end up at, and at this point, it becomes policy.
>
> Maybe Ying Han can tell us more about what they are doing to add to the
> pool, but I can only assume that if sharing was a deal-breaker for them,
> they would not be pursuing this path at all.

Similar discussion happens on "per-memcg dirty limit & writeback"
patch, where we faced similar issue of
how to tag a inode if it is shared between two memcgs. The answer
remains that it is a rare case in our
production environment.


>
>>> When sharing does happen, we currently account to the
>>> first process to ever touch the object. This is also how memcg treats
>>> shared memory users for userspace pages and it is working well so far.
>>> It doesn't *always* give you good behavior, but I guess those fall in
>>> the list of "workloads memcg is not good for".
>>
>> And that list contains?
>
> I would say that anything that is first, not logically groupable
> (general case for cgroups), and for which is hard to come up with a
> immediate upper limit of memory.
>
> When kernel memory comes into play, you have to consider that you are
> accounting a shared resource, so you should have reasonable expectations
> to not be sharing those resources with everybody.
>
> We are also never implying that no sharing will happen. Just that we
> expect it to be in a low rate.
>
>>
>>> Do we want to extend this list of use cases?  Sure. There is also
>>> discussion going on about how to improve this in the future. That would
>>> allow a policy to specify which memcg is to be "responsible" for the
>>> shared objects, be them kernel memory or shared memory regions. Even
>>> then, we'll always have one of the two scenarios:
>>>
>>> 1) There is a memcg that is responsible for accounting that object, and
>>> then is clear we should reclaim from that memcg.
>>>
>>> 2) There is no memcg associated with the object, and then we should not
>>> bother with that object at all.
>>>
>>> I fully understand your concern, specifically because we talked about
>>> that in details in the past. But I believe most of the cases that would
>>> justify it would fall in 2).
>>
>> Which then leads to this: the no-memcg object case needs to scale.
>>
>
> Yes, and I trust you to do it! =)
>
>
>>> Another thing to keep in mind is that we don't actually track objects.
>>> We track pages, and try to make sure that objects in the same page
>>> belong to the same memcg. (That could be important for your analysis or
>>> not...)
>>
>> Hmmmm. So you're basically using the characteristics of internal
>> slab fragmentation to keep objects allocated to different memcg's
>> apart? That's .... devious. :)
>
> My first approach was to get the pages themselves, move pages around
> after cgroup destruction, etc. But that touched the slab heavily, and
> since we have 3 of them, and writing more seem to be a joyful hobby some
> people pursue, I am now actually reproducing the metadata and creating
> per-cgroup caches. It is about the same, but with a bit more wasted
> space that we're happily paying as the price for added simplicity.
>
> Instead of directing the allocation to a page, which requires knowledge
> of how the various slabs operate, we direct allocations to a different
> cache altogether, that hides it.
>
>
>>
>>>> Having mem
>>>> pressure in a single memcg that causes globally accessed dentries
>>>> and inodes to be tossed from memory will simply cause cache
>>>> thrashing and performance across the system will tank.
>>>>
>>>
>>> As said above. I don't consider global accessed dentries to be
>>> representative of the current use cases for memcg.
>>
>> But they have to co-exist, and I think that's our big problem. If
>> you have a workload in a memcg, and the underlying directory
>> structure is exported via NFS or CIFS, then there is still global
>> access to that "memcg local" dentry structure.
>>
>
> I am all for co-existing.
>
>>>>>> The patch now is only handling dentry cache by given the nature dentry pinned
>>>>>> inode. Based on the data we've collected, that contributes the main factor of
>>>>>> the reclaimable slab objects. We also could make a generic infrastructure for
>>>>>> all the shrinkers (if needed).
>>>>>
>>>>> Dave Chinner has some prototype code for that.
>>>>
>>>> The patchset I have makes the dcache lru locks per-sb as the first
>>>> step to introducing generic per-sb LRU lists, and then builds on
>>>> that to provide generic kernel-wide LRU lists with integrated
>>>> shrinkers, and builds on that to introduce node-awareness (i.e. NUMA
>>>> scalability) into the LRU list so everyone gets scalable shrinkers.
>>>>
>>>
>>> If you are building a generic infrastructure for shrinkers, what is the
>>> big point about per-sb? I'll give you that most of the memory will come
>>> from the VFS, but other objects are shrinkable too, that bears no
>>> relationship with the vfs.
>>
>> Without any more information, it's hard to understand what I'm
>> doing.
>
> With more information it's hard as well. =p
>
>>
>>>> I've looked at memcg awareness in the past, but the problem is the
>>>> overhead - the explosion of LRUs because of the per-sb X per-node X
>>>> per-memcg object tracking matrix.  It's a huge amount of overhead
>>>> and complexity, and unless there's a way of efficiently tracking
>>>> objects both per-node and per-memcg simulatneously then I'm of the
>>>> opinion that memcg awareness is simply too much trouble, complexity
>>>> and overhead to bother with.
>>>>
>>>> So, convince me you can solve the various problems. ;)
>>>
>>> I believe we are open minded regarding a solution for that, and your
>>> input is obviously top. So let me take a step back and restate the problem:
>>>
>>> 1) Some memcgs, not all, will have memory pressure regardless of the
>>> memory pressure in the rest of the system
>>> 2) that memory pressure may or may not involve kernel objects.
>>> 3) if kernel objects are involved, we can assume the level of sharing is
>>> low.
>>
>> I don't think you can make this assumption. You could simply have a
>> circle-thrash of a shared object where the first memcg reads it,
>> caches it, then reclaims it, then the second does the same thing,
>> then the third, and so on around the circle....
>>
>
> I think the best we can do here is a trade-off. We won't make shared
> resources become exclusive. What I tried to do, is by having to have one
> explicitly turning this on, confine its usage to scenarios where this
> assumption will mostly hold.
>
> "Mostly hold" basically means still have problems like the one you
> describe, but expecting them to be in a quantity small enough not to
> bother us. It is just like the case for processor caches...
>
>>> 4) We then need to shrink memory from that memcg, affecting the
>>> others the least we can.
>>>
>>> Do you have any proposals for that, in any shape?
>>>
>>> One thing that crossed my mind, was instead of having per-sb x
>>> per-node objects, we could have per-"group" x per-node objects.
>>> The group would then be either a memcg or a sb.
>>
>> Perhaps we've all been looking at the problem the wrong way.
>>
>> As I was writing this, it came to me that the problem is not that
>> "the object is owned either per-sb or per-memcg". The issue is how
>> to track the objects in a given context.
>
> Which is basically just a generalization of "per-group" as I said. The
> group can be anything, and I believe your generic LRU interface makes it
> quite clear.
>
>> The overall LRU manipulations
>> and ownership of the object is identical in both the global and
>> memcg cases - it's the LRU that the object is placed on that
>> matters! With a generic LRU+shrinker implementation, this detail is
>> completely confined to the internals of the LRU+shrinker subsystem.
>>
>
> Indeed.
>
>> IOWs, if you are tagging the object with memcg info at a slab page
>> level, the LRU and shrinker need to operate at the same level, not
>> at the per-object level.
>
> Not sure I fully parse. You do still track objects in the LRU, right?
>
>> The LRU implementation I have currently
>> selects the internal LRU list according to the node the object was
>> allocated on. i.e. by looking at the page:
>>
> [...]
>
>> There is no reason why we couldn't determine if an object was being
>> tracked by a memcg in the same way. Do we have a page_to_memcg()
>> function?
>
> Yes. We don't currently have, because we have no users.
> But that can easily be provided, and Ying Han's patches actually
> provides one.
>
>> If we've got that, then all we need to add to the struct
>> shrinker_lru is a method of dynamically adding and looking up the
>> memcg to get the appropriate struct shrinker_lru_node from the
>> memcg. The memcg would need a struct shrinker_lru_node per generic
>> LRU in use, and this probably means we need to uniquely identify
>> each struct shrinker_lru instance so the memcg code can kept a
>> dynamic list of them.
>
> Sounds like right.
>
>>
>> With that, we could track objects per memcg or globally on the
>> per-node lists.
>
> For userspace pages, we do per-memcg-per-zone. Can't we do the same
> here? Note that simple "per-memcg" is already a big thing for us, but
> you're not alone in your fight for scalability.
>
>> If we then add a memcg id to the struct
>> scan_control, the shrinker can then walk the related memcg LRU
>> rather than the per-node LRU.  That means that general per-node
>> reclaim won't find memcg related objects, and memcg related reclaim
>> won't scan the global per-node objects. This could be changed as
>> needed, though.
>>
>> What it does, though, is preserve the correct balance of related
>> caches in the memcg because they use exactly the same subsystem code
>> that defines the relationship for the global cache. It also
>> preserves the scalabilty of the non-memcg based processes, and
>> allows us to tune the global vs memcg LRU reclaim algorithm in a
>> single place.
>>
>> That, to me, sounds almost ideal - memcg tracking and reclaim works
>> with very little added complexity, it has no additional memory
>> overhead, and scalability is not compromised. What have I missed? :P

If I understand it, we are creating per-memcg per-zone lru for vfs
slabs? That sounds similar to how the user pages are handled today in
lruvec. As I have in the commit description, we thought about this at
the beginning but the complication of the code could turn into doesn't
gain us much. The complication mainly come from the race where memcg
comes and go, and we need to be careful of handing the separate lists.

The patch I have here works fine w/ us of the problem we are trying to
solve. By saying that, I need to read a bit more of understanding what
the generic shrinker buying us.

Thanks

--Ying


>
> So before I got to your e-mail, I actually coded a prototype for it.
> Due to the lack of funding for my crystal balls department, I didn't
> make use of your generic LRU, so a lot of it is still hard coded in
> place. I am attaching the ugly patch here so you can take a look.
>
> Note that I am basically using the same prune_dcache function, just with
> a different list. I am also not bothering with inodes, etc.
>
> Take a look. How do you think this would integrate with your idea?
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup
  2012-08-17  5:40       ` Ying Han
  2012-08-17  5:42         ` Glauber Costa
@ 2012-08-19  3:41         ` Andi Kleen
  1 sibling, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2012-08-19  3:41 UTC (permalink / raw)
  To: Ying Han
  Cc: Glauber Costa, Dave Chinner, Rik van Riel, Michal Hocko,
	Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Greg Thelen,
	Christoph Lameter, KOSAKI Motohiro, linux-mm

Ying Han <yinghan@google.com> writes:
>
> I haven't thought about the NUMA and node awareness for the shrinkers,
> and that sounds like something
> beyond than the problem I am trying to solve here. I might need to
> think a bit more of how that fits into the problem you described.

The memory failure code would also benefit from more directed slab
(especially d/icache) freeing method. Right now if it wants to hard/soft
offline a slab page it has to take the big hammer and free as much as it
can, just in the hope to free that one page.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-08-19  3:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-16 20:53 [RFC PATCH 0/6] memcg: vfs isolation in memory cgroup Ying Han
2012-08-16 21:10 ` Rik van Riel
2012-08-16 23:41   ` Dave Chinner
2012-08-17  5:15     ` Glauber Costa
2012-08-17  5:40       ` Ying Han
2012-08-17  5:42         ` Glauber Costa
2012-08-17  7:56           ` Dave Chinner
2012-08-19  3:41         ` Andi Kleen
2012-08-17  7:54       ` Dave Chinner
2012-08-17 10:00         ` Glauber Costa
2012-08-17 19:18           ` Ying Han
2012-08-17 14:44         ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).