Some thoughts on memory policies

All of lore.kernel.org
 help / color / mirror / Atom feed

* Some thoughts on memory policies
@ 2007-06-18 20:22 ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2007-06-18 20:22 UTC (permalink / raw)
  To: linux-mm, wli, lee.schermerhorn; +Cc: linux-kernel

I think we are getting into more and more of a mess with the existing 
memory policies. The refcount issue with shmem is just one bad symptom of 
it. Memory policies were intended to be process based and 
taking them out of that context causes various issues.

I have thought for a long time that we need something to replace memory 
policies especially since the requirements on memory policies go far 
beyond just being process based. So some requirements and ideas about 
memory policies.

1. Memory policies must be attachable to a variety of objects

- Device drivers may need to control memory allocations in
  devices either because their DMA engines can only reach a
  subsection of the system or because memory transfer
  performance is superior in certain nodes of the system.

- Process. This is the classic usage scenario

- File / Socket. One may have particular reasons to place
  objects on a set of nodes because of how the threads of
  an application are spread out in the system.

- Cpuset / Container. Some simple support is there with
  memory spreading today. That could be made more universal.

- System policies. The system policy is currently not
  modifiable. It may be useful to be able to set this.
  Small NUMA systems may want to run with interleave by default 

- Address range. For the virtual memory address range
  this is included in todays functionality but one may also
  want to control the physical address range to make sure
  f.e. that memory is allocated in an area where a device
  can reach it.

- Memory policies need to be attachable to types of pages.
  F.e. executable pages of a threaded application are best
  spread (or replicated) whereas the stack and the data may
  best be allocated in a node local way.
  Useful categories that I can think of
  Stack, Data, Filebacked pages, Anonymous Memory,
  Shared memory, Page tables, Slabs, Mlocked pages and
  huge pages.

  Maybe a set of global policies would be useful for these
  categories. Andy hacked subsystem memory policies into
  shmem and it seems that we are now trying to do the same
  for hugepages. Maybe we could get to a consistent scheme
  here?

2. Memory policies need to support additional constraints

- Restriction to a set of nodes. That is what we have today.

- Restriction to a container or cpuset. Maybe restriction
  to a set of containers?

- Strict vs no strict allocations. A strict allocation needs
  to fail if the constraints cannot be met. A non strict
  allocation can fall back.

- Order of allocation. Higher order pages may require
  different allocation constraints? This is like a
  generalization of huge page policies.

- Locality placement. These are node local, interleave etc.

3. Additional flags

- Automigrate flag so that memory touched by a process
  is moved to a memory location that has best performance.

- Page order flag that determines the preferred allocation
  order. Maybe useful in connection with the large blocksize
  patch to control anonymous memory orders.

- Replicate flags so that memory is replicated.

4. Policy combinations

We need some way to combine policies in a systematic way. The current
hieracy from System->cpuset->proces->memory range does not longer
work if a process can use policies set up in shmem or huge pages.
Some consistent scheme to combine memory policies would also need
to be able to synthesize different policies. I.e. automigrate
can be combined with node local or interleave and a cpuset constraint.

5. Management tools

If we make the policies more versatile then we need the proper
management tools in user space to set and display these policies
in such a way that they can be managed by the end user. The esoteric
nature of memory policy semantics makes them difficult to comprehend.

6. GFP_xx flags may actually be considered as a form of policy

i.e. GFP_THISNODE is essentially a one node cpuset.

GFP_DMA and GFP_DMA32 are physical address range constraints.

GFP_HARDWALL is a strict vs. nonstrict distinction.

7. Allocators must change

Right now the policy is set by the process context which is bad because
one cannot specify a memory policy for an allocation. It must be possible
to pass a memory policy to the allocators and then get the memory 
requested.

I wish we could come up with some universal scheme that encompasses all
of the functionality we want and that makes memory more manageable....

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Some thoughts on memory policies
@ 2007-06-18 20:22 ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2007-06-18 20:22 UTC (permalink / raw)
  To: linux-mm, wli, lee.schermerhorn; +Cc: linux-kernel

I think we are getting into more and more of a mess with the existing 
memory policies. The refcount issue with shmem is just one bad symptom of 
it. Memory policies were intended to be process based and 
taking them out of that context causes various issues.

I have thought for a long time that we need something to replace memory 
policies especially since the requirements on memory policies go far 
beyond just being process based. So some requirements and ideas about 
memory policies.

1. Memory policies must be attachable to a variety of objects

- Device drivers may need to control memory allocations in
  devices either because their DMA engines can only reach a
  subsection of the system or because memory transfer
  performance is superior in certain nodes of the system.

- Process. This is the classic usage scenario

- File / Socket. One may have particular reasons to place
  objects on a set of nodes because of how the threads of
  an application are spread out in the system.

- Cpuset / Container. Some simple support is there with
  memory spreading today. That could be made more universal.

- System policies. The system policy is currently not
  modifiable. It may be useful to be able to set this.
  Small NUMA systems may want to run with interleave by default 

- Address range. For the virtual memory address range
  this is included in todays functionality but one may also
  want to control the physical address range to make sure
  f.e. that memory is allocated in an area where a device
  can reach it.

- Memory policies need to be attachable to types of pages.
  F.e. executable pages of a threaded application are best
  spread (or replicated) whereas the stack and the data may
  best be allocated in a node local way.
  Useful categories that I can think of
  Stack, Data, Filebacked pages, Anonymous Memory,
  Shared memory, Page tables, Slabs, Mlocked pages and
  huge pages.

  Maybe a set of global policies would be useful for these
  categories. Andy hacked subsystem memory policies into
  shmem and it seems that we are now trying to do the same
  for hugepages. Maybe we could get to a consistent scheme
  here?

2. Memory policies need to support additional constraints

- Restriction to a set of nodes. That is what we have today.

- Restriction to a container or cpuset. Maybe restriction
  to a set of containers?

- Strict vs no strict allocations. A strict allocation needs
  to fail if the constraints cannot be met. A non strict
  allocation can fall back.

- Order of allocation. Higher order pages may require
  different allocation constraints? This is like a
  generalization of huge page policies.

- Locality placement. These are node local, interleave etc.

3. Additional flags

- Automigrate flag so that memory touched by a process
  is moved to a memory location that has best performance.

- Page order flag that determines the preferred allocation
  order. Maybe useful in connection with the large blocksize
  patch to control anonymous memory orders.

- Replicate flags so that memory is replicated.

4. Policy combinations

We need some way to combine policies in a systematic way. The current
hieracy from System->cpuset->proces->memory range does not longer
work if a process can use policies set up in shmem or huge pages.
Some consistent scheme to combine memory policies would also need
to be able to synthesize different policies. I.e. automigrate
can be combined with node local or interleave and a cpuset constraint.

5. Management tools

If we make the policies more versatile then we need the proper
management tools in user space to set and display these policies
in such a way that they can be managed by the end user. The esoteric
nature of memory policy semantics makes them difficult to comprehend.

6. GFP_xx flags may actually be considered as a form of policy

i.e. GFP_THISNODE is essentially a one node cpuset.

GFP_DMA and GFP_DMA32 are physical address range constraints.

GFP_HARDWALL is a strict vs. nonstrict distinction.

7. Allocators must change

Right now the policy is set by the process context which is bad because
one cannot specify a memory policy for an allocation. It must be possible
to pass a memory policy to the allocators and then get the memory 
requested.

I wish we could come up with some universal scheme that encompasses all
of the functionality we want and that makes memory more manageable....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
  2007-06-18 20:22 ` Christoph Lameter
@ 2007-06-19 20:24   ` Lee Schermerhorn
  -1 siblings, 0 replies; 16+ messages in thread
From: Lee Schermerhorn @ 2007-06-19 20:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, wli, linux-kernel

On Mon, 2007-06-18 at 13:22 -0700, Christoph Lameter wrote:
> I think we are getting into more and more of a mess with the existing 
> memory policies. The refcount issue with shmem is just one bad symptom of 
> it. Memory policies were intended to be process based and 
> taking them out of that context causes various issues.

I don't think memory policies are in as much of a mess as Christoph
seems to.  Perhaps this is my ignorance showing.  Certainly, there are
issues to be addressed--especially in the interaction of memory policies
with containers, such as cpusets.  The shmem refcount issue may be one
of these issues--not sure how "bad" it is.  

I agree that the "process memory policy"--i.e., the one set by
set_mempolicy()--is "process based", but I don't see the system default
policy as process based.  The system default policy is, currently, the
policy of last resort for any allocation.  And, [as I've discussed with
Christoph], I view policies applied via mbind() as applying to [some
range of] the "memory object" mapped at a specific address range.  I
admit that this view is somewhat muddied by the fact that [private]
anonymous segments don't actually have any actual kernel structure to
represent them outside of a process's various anonymous VMAs and page
table [and sometimes the swap cache]; and by the fact that the kernel
currently ignores policy that one attempts to place on shared regular
file mappings.  However, I think "object-based policy" is a natural
extension of the current API and easily implemented with the current
infrastructure.  

> 
> I have thought for a long time that we need something to replace memory 
> policies especially since the requirements on memory policies go far 
> beyond just being process based. So some requirements and ideas about 
> memory policies.

Listing the requirements is a great idea.  But I won't go so far as to
agree that we need to "replace memory policies" so much as rationalize
them for all the desired uses/contexts...

> 
> 1. Memory policies must be attachable to a variety of objects
> 
> - Device drivers may need to control memory allocations in
>   devices either because their DMA engines can only reach a
>   subsection of the system or because memory transfer
>   performance is superior in certain nodes of the system.
> 
> - Process. This is the classic usage scenario
> 
> - File / Socket. One may have particular reasons to place
>   objects on a set of nodes because of how the threads of
>   an application are spread out in the system.

...how the tasks/threads are spread out and how the application accesses
the pages of the objects.  Some accesses--e.g., unmapped pages of
files--are implicit or transparent to task.  I guess any pages
associated with a socket would also be transparent to the application as
well?

> 
> - Cpuset / Container. Some simple support is there with
>   memory spreading today. That could be made more universal.

I've said before that I viewed cpusets as administrative contraints on
applications, where as policies are something that can be controlled by
the application or a non-privileged user.  As cpusets evolve into more
general "containers", I think they'll become less visible to the
applications running within them.  The application will see the
container as "the system"--at least, the set of system resources to
which the application has access.  

The current memory policy APIs can work in such a "containerized"
environment if we can reconcile the policy APIs' notion of nodes with
the set of nodes that container allows.  Perhaps we need to revisit the
"cpumemset" proposal that provides a separate node id namespace in each
container/cpuset.  As a minimum, I think a task should be able to query
the set of nodes that it can use and/or have the system "do the right
thing" if the application specifies "all possible nodes" for, say, and
interleave policy.

> 
> - System policies. The system policy is currently not
>   modifiable. It may be useful to be able to set this.
>   Small NUMA systems may want to run with interleave by default 

Agreed.  And, on our platforms, it would be useful to have a separately
specifiable system-wide [or container-wide] default page cache policy.

> 
> - Address range. For the virtual memory address range
>   this is included in todays functionality but one may also
>   want to control the physical address range to make sure
>   f.e. that memory is allocated in an area where a device
>   can reach it.

For application usage?  Does this mean something like an MPOL_MF_DMA
flag?  

One way to handle this w/o an explicit 'DMA flag for use space APIs is
to mmap() the device that would use the memory and allow the device
driver to allocate the memory internally with the appropriate DMA/32
flags and map that memory into the task's address space.  I think that
works today.

What other usage scenarios are you thinking of?

> 
> - Memory policies need to be attachable to types of pages.
>   F.e. executable pages of a threaded application are best
>   spread (or replicated) whereas the stack and the data may
>   best be allocated in a node local way.
>   Useful categories that I can think of
>   Stack, Data, Filebacked pages, Anonymous Memory,
>   Shared memory, Page tables, Slabs, Mlocked pages and
>   huge pages.

Rather, I would say, to "types of objects".   I think all of the "types
of pages" you mention [except, maybe, mlocked?] can be correlated to
some structure/object to which policy can be attached.  Regarding
"Mlocked pages"--are you suggesting that you might want to specify that
mlocked pages have a different policy/locality than other pages in the
same object?

Stack and data/heap can easily be handled by always defaulting the
process policy to node local [or perhaps interleaved across the nodes in
the container, if node local results in hot spots or other problems],
and explicitly binding other objects of interest, if performance
considerations warrant, using the mbind() API or by using fixed or
heuristic defaults.

> 
>   Maybe a set of global policies would be useful for these
>   categories. Andy hacked subsystem memory policies into
>   shmem and it seems that we are now trying to do the same
>   for hugepages. Maybe we could get to a consistent scheme
>   here?

Christoph, I wish you wouldn't characterize Andi's shared policy
infrastructure as a hack.  I think it provides an excellent base
implementation for [shared] object-based policies.  It extends easily to
any object that can be addressed by offset [page offset, hugepage
offset, ...].  The main issue is the generic one of memory policy on
object that can be shared by processes running in separate cpusets,
whether the sharing is intentional or not.  

> 
> 2. Memory policies need to support additional constraints
> 
> - Restriction to a set of nodes. That is what we have today.

See "locality placement" below.

> 
> - Restriction to a container or cpuset. Maybe restriction
>   to a set of containers?

I don't know about a "set of containers", but perhaps you are referring
to sharing of objects between applications running in different
containers with potentially disjoint memory resources?  That is
problematic.  We need to enumerate the use cases for this and what the
desired behavior should be.

Christoph and I discussed one scenario:  backup running in a separate
cpuset, disjoint from an application that mmap()s a file shared and
installs a shared policy on it [my "mapped file policy" patches would
enable this].  If the application's cpuset contains sufficient memory
for the application's working set, but NOT enough to hold the entire
file, the backup running in another cpuset reading the entire file may
push out pages of the application from it's cpuset because the object
policy constrains the pages to be located in the application's cpuset.  

> 
> - Strict vs no strict allocations. A strict allocation needs
>   to fail if the constraints cannot be met. A non strict
>   allocation can fall back.

Agreed.  And I think this needs to be explicit in the allocation
request.  Callers requesting strict allocation [including "no wait"]
should be prepared to handle failure of the allocation.

> 
> - Order of allocation. Higher order pages may require
>   different allocation constraints? This is like a
>   generalization of huge page policies.

Agreed.  On our platform, I'd like to keep default huge page allocations
and interleave requests off the "hardware interleaved pseudo-node" as
that is "special" memory.  I'd like to reserve it for access only by
explicit request.  The current model doesn't support this, but I think
it could, with a few "small" enhancements. [TODO]

> 
> - Locality placement. These are node local, interleave etc.

How is this different from "restriction to a set of nodes" in the
context of memory policies [1st bullet in section 2]?  I tend to think
of memory policies--whether default or explicit--as "locality placement"
and cpusets as "constraints" or restrictions on what policies can do.

> 
> 3. Additional flags
> 
> - Automigrate flag so that memory touched by a process
>   is moved to a memory location that has best performance.

Automigration can be turned on/off in the environment--e.g., per
container/cpuset, but perhaps there is a use case for more explicit
control over automigration of specific pages of an object?

"Lazy migration" or "migrate on fault" is fairly easy to achieve atop
the existing migration infrastructure.  However, it requires a fault to
trigger the migration.  One can arrange for these faults to occur
explicitly--e.g., via a straightforward extension to mbind() with
MPOL_MF_MOVE and a new MPOL_MF_LAZY flag to remove the page translations
from all page tables resulting in a fault, and possible migration, on
next touch.  Or, one can arrange to automatically "unmap" [remove ptes
referencing] selected types of pages when the load balancer moves a task
to a new node.

I've seen fairly dramatic reductions in real, user and system time in,
e.g., kernel builds on a heavily loaded [STREAMS benchmark running] NUMA
platform with automatic/lazy migration patches:   ~14% real, ~4.7% user
and ~22% system time reductions.

> 
> - Page order flag that determines the preferred allocation
>   order. Maybe useful in connection with the large blocksize
>   patch to control anonymous memory orders.

Agreed.  "requested page order" could be a component of policy, along
with locality.

> 
> - Replicate flags so that memory is replicated.

This could be a different policy mode, MPOL_REPLICATE.  Or, as with
Nick's prototype, it could be the default behavior for read-only access
to page cache pages when no explicit policy exists on the object [file].

For "automatic, lazy replication, one would also need a fault to trigger
the replication.  This could be achieved by removing the pte from only
the calling task's page table via mbind(MOVE+LAZY) or automatically on
inter-node task migration.  The resulting fault, when that corresponding
virtual address is touched, would cause Nick's page cache replication
infrastructure to create/use a local copy of the page.  It's "on my
list" ...
> 
> 4. Policy combinations
> 
> We need some way to combine policies in a systematic way. The current
> hieracy from System->cpuset->proces->memory range does not longer
> work if a process can use policies set up in shmem or huge pages.
> Some consistent scheme to combine memory policies would also need
> to be able to synthesize different policies. I.e. automigrate
> can be combined with node local or interleave and a cpuset constraint.

The big issue, here, for me, is the interaction of policy on shared
objects [shmem and shared regular file mappings] referenced from
different containers/cpusets.   Given that we want to allow this--almost
can't prevent it in the case of regular file access--we need to specify
the use cases, what the desired behavior is for each such case, and
which scenarios to optimize for.

> 
> 5. Management tools
> 
> If we make the policies more versatile then we need the proper
> management tools in user space to set and display these policies
> in such a way that they can be managed by the end user. The esoteric
> nature of memory policy semantics makes them difficult to comprehend.

/proc/<pid>/numa_maps works well [with my patches] for object mapped
into a task's address space.  What it doesn't work so well for are:
1) shared policy on currently unattached shmem segments and 2) shared
policy on unmapped regular files, should my patches be accepted.  [Note,
however, we need not retain shared policy on regular files after the
last shared mapping is removed--my recommended persistence model.]

> 6. GFP_xx flags may actually be considered as a form of policy

Agreed.  For kernel internal allocation requests...

> 
> i.e. GFP_THISNODE is essentially a one node cpuset.

sort of behaves like one, I agree.  Or like an explicit MPOL_BIND with a
single node.

> 
> GFP_DMA and GFP_DMA32 are physical address range constraints.

with platform specific locality implications...

> 
> GFP_HARDWALL is a strict vs. nonstrict distinction.
> 
> 
> 7. Allocators must change
> 
> Right now the policy is set by the process context which is bad because
> one cannot specify a memory policy for an allocation. It must be possible
> to pass a memory policy to the allocators and then get the memory 
> requested.

Agreed.  In my shared/mapped file policy patches, I have factored an
"allocate_page_pol() function out of alloc_page_vma().  The modified
alloc_page_vma() calls get_vma_policy() [as does the current version] to
obtain the policy at the specified address in the calling task's virtual
address space or some default policy, and then calls alloc_page_pol() to
allocate a page based on that policy.  I can then use the same
alloc_page_pol() function to allocate page cache pages after looking up
a shared policy on a mapped file or using the default policy for page
cache allocations [currently process->system default].  Perhaps other of
the page allocators could use alloc_page_pol() as well?

> 
> 
> I wish we could come up with some universal scheme that encompasses all
> of the functionality we want and that makes memory more manageable....

I think it's possible and that the current mempolicy support can be
evolved with not too much effort.  Again, the biggest issue for me is
the reconciliation of the policies with the administrative constraints
imposed by subsetting the system via containers/cpusets--especially for
objects that can be referenced from more than one container.  I think
that any reasonable, let alone "correct", solution would be
workload/application dependent.

Lee

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
@ 2007-06-19 20:24   ` Lee Schermerhorn
  0 siblings, 0 replies; 16+ messages in thread
From: Lee Schermerhorn @ 2007-06-19 20:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, wli, linux-kernel

On Mon, 2007-06-18 at 13:22 -0700, Christoph Lameter wrote:
> I think we are getting into more and more of a mess with the existing 
> memory policies. The refcount issue with shmem is just one bad symptom of 
> it. Memory policies were intended to be process based and 
> taking them out of that context causes various issues.

I don't think memory policies are in as much of a mess as Christoph
seems to.  Perhaps this is my ignorance showing.  Certainly, there are
issues to be addressed--especially in the interaction of memory policies
with containers, such as cpusets.  The shmem refcount issue may be one
of these issues--not sure how "bad" it is.  

I agree that the "process memory policy"--i.e., the one set by
set_mempolicy()--is "process based", but I don't see the system default
policy as process based.  The system default policy is, currently, the
policy of last resort for any allocation.  And, [as I've discussed with
Christoph], I view policies applied via mbind() as applying to [some
range of] the "memory object" mapped at a specific address range.  I
admit that this view is somewhat muddied by the fact that [private]
anonymous segments don't actually have any actual kernel structure to
represent them outside of a process's various anonymous VMAs and page
table [and sometimes the swap cache]; and by the fact that the kernel
currently ignores policy that one attempts to place on shared regular
file mappings.  However, I think "object-based policy" is a natural
extension of the current API and easily implemented with the current
infrastructure.  

> 
> I have thought for a long time that we need something to replace memory 
> policies especially since the requirements on memory policies go far 
> beyond just being process based. So some requirements and ideas about 
> memory policies.

Listing the requirements is a great idea.  But I won't go so far as to
agree that we need to "replace memory policies" so much as rationalize
them for all the desired uses/contexts...

> 
> 1. Memory policies must be attachable to a variety of objects
> 
> - Device drivers may need to control memory allocations in
>   devices either because their DMA engines can only reach a
>   subsection of the system or because memory transfer
>   performance is superior in certain nodes of the system.
> 
> - Process. This is the classic usage scenario
> 
> - File / Socket. One may have particular reasons to place
>   objects on a set of nodes because of how the threads of
>   an application are spread out in the system.

...how the tasks/threads are spread out and how the application accesses
the pages of the objects.  Some accesses--e.g., unmapped pages of
files--are implicit or transparent to task.  I guess any pages
associated with a socket would also be transparent to the application as
well?

> 
> - Cpuset / Container. Some simple support is there with
>   memory spreading today. That could be made more universal.

I've said before that I viewed cpusets as administrative contraints on
applications, where as policies are something that can be controlled by
the application or a non-privileged user.  As cpusets evolve into more
general "containers", I think they'll become less visible to the
applications running within them.  The application will see the
container as "the system"--at least, the set of system resources to
which the application has access.  

The current memory policy APIs can work in such a "containerized"
environment if we can reconcile the policy APIs' notion of nodes with
the set of nodes that container allows.  Perhaps we need to revisit the
"cpumemset" proposal that provides a separate node id namespace in each
container/cpuset.  As a minimum, I think a task should be able to query
the set of nodes that it can use and/or have the system "do the right
thing" if the application specifies "all possible nodes" for, say, and
interleave policy.

> 
> - System policies. The system policy is currently not
>   modifiable. It may be useful to be able to set this.
>   Small NUMA systems may want to run with interleave by default 

Agreed.  And, on our platforms, it would be useful to have a separately
specifiable system-wide [or container-wide] default page cache policy.

> 
> - Address range. For the virtual memory address range
>   this is included in todays functionality but one may also
>   want to control the physical address range to make sure
>   f.e. that memory is allocated in an area where a device
>   can reach it.

For application usage?  Does this mean something like an MPOL_MF_DMA
flag?  

One way to handle this w/o an explicit 'DMA flag for use space APIs is
to mmap() the device that would use the memory and allow the device
driver to allocate the memory internally with the appropriate DMA/32
flags and map that memory into the task's address space.  I think that
works today.

What other usage scenarios are you thinking of?

> 
> - Memory policies need to be attachable to types of pages.
>   F.e. executable pages of a threaded application are best
>   spread (or replicated) whereas the stack and the data may
>   best be allocated in a node local way.
>   Useful categories that I can think of
>   Stack, Data, Filebacked pages, Anonymous Memory,
>   Shared memory, Page tables, Slabs, Mlocked pages and
>   huge pages.

Rather, I would say, to "types of objects".   I think all of the "types
of pages" you mention [except, maybe, mlocked?] can be correlated to
some structure/object to which policy can be attached.  Regarding
"Mlocked pages"--are you suggesting that you might want to specify that
mlocked pages have a different policy/locality than other pages in the
same object?

Stack and data/heap can easily be handled by always defaulting the
process policy to node local [or perhaps interleaved across the nodes in
the container, if node local results in hot spots or other problems],
and explicitly binding other objects of interest, if performance
considerations warrant, using the mbind() API or by using fixed or
heuristic defaults.

> 
>   Maybe a set of global policies would be useful for these
>   categories. Andy hacked subsystem memory policies into
>   shmem and it seems that we are now trying to do the same
>   for hugepages. Maybe we could get to a consistent scheme
>   here?

Christoph, I wish you wouldn't characterize Andi's shared policy
infrastructure as a hack.  I think it provides an excellent base
implementation for [shared] object-based policies.  It extends easily to
any object that can be addressed by offset [page offset, hugepage
offset, ...].  The main issue is the generic one of memory policy on
object that can be shared by processes running in separate cpusets,
whether the sharing is intentional or not.  

> 
> 2. Memory policies need to support additional constraints
> 
> - Restriction to a set of nodes. That is what we have today.

See "locality placement" below.

> 
> - Restriction to a container or cpuset. Maybe restriction
>   to a set of containers?

I don't know about a "set of containers", but perhaps you are referring
to sharing of objects between applications running in different
containers with potentially disjoint memory resources?  That is
problematic.  We need to enumerate the use cases for this and what the
desired behavior should be.

Christoph and I discussed one scenario:  backup running in a separate
cpuset, disjoint from an application that mmap()s a file shared and
installs a shared policy on it [my "mapped file policy" patches would
enable this].  If the application's cpuset contains sufficient memory
for the application's working set, but NOT enough to hold the entire
file, the backup running in another cpuset reading the entire file may
push out pages of the application from it's cpuset because the object
policy constrains the pages to be located in the application's cpuset.  

> 
> - Strict vs no strict allocations. A strict allocation needs
>   to fail if the constraints cannot be met. A non strict
>   allocation can fall back.

Agreed.  And I think this needs to be explicit in the allocation
request.  Callers requesting strict allocation [including "no wait"]
should be prepared to handle failure of the allocation.

> 
> - Order of allocation. Higher order pages may require
>   different allocation constraints? This is like a
>   generalization of huge page policies.

Agreed.  On our platform, I'd like to keep default huge page allocations
and interleave requests off the "hardware interleaved pseudo-node" as
that is "special" memory.  I'd like to reserve it for access only by
explicit request.  The current model doesn't support this, but I think
it could, with a few "small" enhancements. [TODO]

> 
> - Locality placement. These are node local, interleave etc.

How is this different from "restriction to a set of nodes" in the
context of memory policies [1st bullet in section 2]?  I tend to think
of memory policies--whether default or explicit--as "locality placement"
and cpusets as "constraints" or restrictions on what policies can do.

> 
> 3. Additional flags
> 
> - Automigrate flag so that memory touched by a process
>   is moved to a memory location that has best performance.

Automigration can be turned on/off in the environment--e.g., per
container/cpuset, but perhaps there is a use case for more explicit
control over automigration of specific pages of an object?

"Lazy migration" or "migrate on fault" is fairly easy to achieve atop
the existing migration infrastructure.  However, it requires a fault to
trigger the migration.  One can arrange for these faults to occur
explicitly--e.g., via a straightforward extension to mbind() with
MPOL_MF_MOVE and a new MPOL_MF_LAZY flag to remove the page translations
from all page tables resulting in a fault, and possible migration, on
next touch.  Or, one can arrange to automatically "unmap" [remove ptes
referencing] selected types of pages when the load balancer moves a task
to a new node.

I've seen fairly dramatic reductions in real, user and system time in,
e.g., kernel builds on a heavily loaded [STREAMS benchmark running] NUMA
platform with automatic/lazy migration patches:   ~14% real, ~4.7% user
and ~22% system time reductions.

> 
> - Page order flag that determines the preferred allocation
>   order. Maybe useful in connection with the large blocksize
>   patch to control anonymous memory orders.

Agreed.  "requested page order" could be a component of policy, along
with locality.

> 
> - Replicate flags so that memory is replicated.

This could be a different policy mode, MPOL_REPLICATE.  Or, as with
Nick's prototype, it could be the default behavior for read-only access
to page cache pages when no explicit policy exists on the object [file].

For "automatic, lazy replication, one would also need a fault to trigger
the replication.  This could be achieved by removing the pte from only
the calling task's page table via mbind(MOVE+LAZY) or automatically on
inter-node task migration.  The resulting fault, when that corresponding
virtual address is touched, would cause Nick's page cache replication
infrastructure to create/use a local copy of the page.  It's "on my
list" ...
> 
> 4. Policy combinations
> 
> We need some way to combine policies in a systematic way. The current
> hieracy from System->cpuset->proces->memory range does not longer
> work if a process can use policies set up in shmem or huge pages.
> Some consistent scheme to combine memory policies would also need
> to be able to synthesize different policies. I.e. automigrate
> can be combined with node local or interleave and a cpuset constraint.

The big issue, here, for me, is the interaction of policy on shared
objects [shmem and shared regular file mappings] referenced from
different containers/cpusets.   Given that we want to allow this--almost
can't prevent it in the case of regular file access--we need to specify
the use cases, what the desired behavior is for each such case, and
which scenarios to optimize for.

> 
> 5. Management tools
> 
> If we make the policies more versatile then we need the proper
> management tools in user space to set and display these policies
> in such a way that they can be managed by the end user. The esoteric
> nature of memory policy semantics makes them difficult to comprehend.

/proc/<pid>/numa_maps works well [with my patches] for object mapped
into a task's address space.  What it doesn't work so well for are:
1) shared policy on currently unattached shmem segments and 2) shared
policy on unmapped regular files, should my patches be accepted.  [Note,
however, we need not retain shared policy on regular files after the
last shared mapping is removed--my recommended persistence model.]

> 6. GFP_xx flags may actually be considered as a form of policy

Agreed.  For kernel internal allocation requests...

> 
> i.e. GFP_THISNODE is essentially a one node cpuset.

sort of behaves like one, I agree.  Or like an explicit MPOL_BIND with a
single node.

> 
> GFP_DMA and GFP_DMA32 are physical address range constraints.

with platform specific locality implications...

> 
> GFP_HARDWALL is a strict vs. nonstrict distinction.
> 
> 
> 7. Allocators must change
> 
> Right now the policy is set by the process context which is bad because
> one cannot specify a memory policy for an allocation. It must be possible
> to pass a memory policy to the allocators and then get the memory 
> requested.

Agreed.  In my shared/mapped file policy patches, I have factored an
"allocate_page_pol() function out of alloc_page_vma().  The modified
alloc_page_vma() calls get_vma_policy() [as does the current version] to
obtain the policy at the specified address in the calling task's virtual
address space or some default policy, and then calls alloc_page_pol() to
allocate a page based on that policy.  I can then use the same
alloc_page_pol() function to allocate page cache pages after looking up
a shared policy on a mapped file or using the default policy for page
cache allocations [currently process->system default].  Perhaps other of
the page allocators could use alloc_page_pol() as well?

> 
> 
> I wish we could come up with some universal scheme that encompasses all
> of the functionality we want and that makes memory more manageable....

I think it's possible and that the current mempolicy support can be
evolved with not too much effort.  Again, the biggest issue for me is
the reconciliation of the policies with the administrative constraints
imposed by subsetting the system via containers/cpusets--especially for
objects that can be referenced from more than one container.  I think
that any reasonable, let alone "correct", solution would be
workload/application dependent.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
  2007-06-19 20:24   ` Lee Schermerhorn
@ 2007-06-19 21:23     ` Paul Jackson
  -1 siblings, 0 replies; 16+ messages in thread
From: Paul Jackson @ 2007-06-19 21:23 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: clameter, linux-mm, wli, linux-kernel

> The current memory policy APIs can work in such a "containerized"
> environment if we can reconcile the policy APIs' notion of nodes with
> the set of nodes that container allows.  Perhaps we need to revisit the
> "cpumemset" proposal that provides a separate node id namespace in each
> container/cpuset.

Currently, we (SGI) do this for our systems using user level library
code.

Even though that library code is LGPL licensed, it's still far less
widely distributed than the Linux kernel.  Container relative numbering
support directly in the kernel might make sense; though it would be
very challenging to provide that without breaking any existing API's
such as sched_setaffinity, mbind, set_mempolicy and various /proc
files that provide only system-wide numbering.

The advantage I had doing cpuset relative cpu and mem numbering in a
user library was that I could invent new API's that were numbered
relatively from day one.

So ... I'd likely be supportive of cpuset (or container) relative
numbering support in the kernel ... if someone can figure out how to do
it without breaking existing API's left and right.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
@ 2007-06-19 21:23     ` Paul Jackson
  0 siblings, 0 replies; 16+ messages in thread
From: Paul Jackson @ 2007-06-19 21:23 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: clameter, linux-mm, wli, linux-kernel

> The current memory policy APIs can work in such a "containerized"
> environment if we can reconcile the policy APIs' notion of nodes with
> the set of nodes that container allows.  Perhaps we need to revisit the
> "cpumemset" proposal that provides a separate node id namespace in each
> container/cpuset.

Currently, we (SGI) do this for our systems using user level library
code.

Even though that library code is LGPL licensed, it's still far less
widely distributed than the Linux kernel.  Container relative numbering
support directly in the kernel might make sense; though it would be
very challenging to provide that without breaking any existing API's
such as sched_setaffinity, mbind, set_mempolicy and various /proc
files that provide only system-wide numbering.

The advantage I had doing cpuset relative cpu and mem numbering in a
user library was that I could invent new API's that were numbered
relatively from day one.

So ... I'd likely be supportive of cpuset (or container) relative
numbering support in the kernel ... if someone can figure out how to do
it without breaking existing API's left and right.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
  2007-06-19 20:24   ` Lee Schermerhorn
@ 2007-06-19 22:30     ` Christoph Lameter
  -1 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2007-06-19 22:30 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, wli, linux-kernel

On Tue, 19 Jun 2007, Lee Schermerhorn wrote:

> > - File / Socket. One may have particular reasons to place
> >   objects on a set of nodes because of how the threads of
> >   an application are spread out in the system.
> 
> ...how the tasks/threads are spread out and how the application accesses
> the pages of the objects.  Some accesses--e.g., unmapped pages of
> files--are implicit or transparent to task.  I guess any pages
> associated with a socket would also be transparent to the application as
> well?

Not sure about the exact semantics that we should have.

> > - Cpuset / Container. Some simple support is there with
> >   memory spreading today. That could be made more universal.
> 
> I've said before that I viewed cpusets as administrative contraints on
> applications, where as policies are something that can be controlled by
> the application or a non-privileged user.  As cpusets evolve into more
> general "containers", I think they'll become less visible to the
> applications running within them.  The application will see the
> container as "the system"--at least, the set of system resources to
> which the application has access.  

An application may want to access memory from various pools of memory that 
may be different containers? The containers can then dynamically sized by 
system administrators.

> The current memory policy APIs can work in such a "containerized"
> environment if we can reconcile the policy APIs' notion of nodes with
> the set of nodes that container allows.  Perhaps we need to revisit the
> "cpumemset" proposal that provides a separate node id namespace in each
> container/cpuset.  As a minimum, I think a task should be able to query

Right.

> the set of nodes that it can use and/or have the system "do the right
> thing" if the application specifies "all possible nodes" for, say, and
> interleave policy.

I agree.

> > - Address range. For the virtual memory address range
> >   this is included in todays functionality but one may also
> >   want to control the physical address range to make sure
> >   f.e. that memory is allocated in an area where a device
> >   can reach it.
> 
> For application usage?  Does this mean something like an MPOL_MF_DMA
> flag?  

Mostly useful for memory policies attached to devices I think.

> > - Memory policies need to be attachable to types of pages.
> >   F.e. executable pages of a threaded application are best
> >   spread (or replicated) whereas the stack and the data may
> >   best be allocated in a node local way.
> >   Useful categories that I can think of
> >   Stack, Data, Filebacked pages, Anonymous Memory,
> >   Shared memory, Page tables, Slabs, Mlocked pages and
> >   huge pages.
> 
> Rather, I would say, to "types of objects".   I think all of the "types
> of pages" you mention [except, maybe, mlocked?] can be correlated to
> some structure/object to which policy can be attached.  Regarding
> "Mlocked pages"--are you suggesting that you might want to specify that
> mlocked pages have a different policy/locality than other pages in the
> same object?

One may not want mlocked pages to contaminate certain nodes?
 
> Christoph, I wish you wouldn't characterize Andi's shared policy
> infrastructure as a hack.  I think it provides an excellent base
> implementation for [shared] object-based policies.  It extends easily to
> any object that can be addressed by offset [page offset, hugepage
> offset, ...].  The main issue is the generic one of memory policy on
> object that can be shared by processes running in separate cpusets,
> whether the sharing is intentional or not.  

The refcount issues and the creation of vmas on the stack do suggest that 
this is not a clean implemenation.

> > 4. Policy combinations
> > 
> > We need some way to combine policies in a systematic way. The current
> > hieracy from System->cpuset->proces->memory range does not longer
> > work if a process can use policies set up in shmem or huge pages.
> > Some consistent scheme to combine memory policies would also need
> > to be able to synthesize different policies. I.e. automigrate
> > can be combined with node local or interleave and a cpuset constraint.
> 
> The big issue, here, for me, is the interaction of policy on shared
> objects [shmem and shared regular file mappings] referenced from
> different containers/cpusets.   Given that we want to allow this--almost
> can't prevent it in the case of regular file access--we need to specify
> the use cases, what the desired behavior is for each such case, and
> which scenarios to optimize for.

Right and we need some form of permissions management for policies.

> > 7. Allocators must change
> > 
> > Right now the policy is set by the process context which is bad because
> > one cannot specify a memory policy for an allocation. It must be possible
> > to pass a memory policy to the allocators and then get the memory 
> > requested.
> 
> Agreed.  In my shared/mapped file policy patches, I have factored an
> "allocate_page_pol() function out of alloc_page_vma().  The modified
> alloc_page_vma() calls get_vma_policy() [as does the current version] to
> obtain the policy at the specified address in the calling task's virtual
> address space or some default policy, and then calls alloc_page_pol() to
> allocate a page based on that policy.  I can then use the same
> alloc_page_pol() function to allocate page cache pages after looking up
> a shared policy on a mapped file or using the default policy for page
> cache allocations [currently process->system default].  Perhaps other of
> the page allocators could use alloc_page_pol() as well?

Think about how the slab allocators, uncached allocator and vmalloc could 
support policies. Somehow this needs to work in a consistent way.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
@ 2007-06-19 22:30     ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2007-06-19 22:30 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, wli, linux-kernel

On Tue, 19 Jun 2007, Lee Schermerhorn wrote:

> > - File / Socket. One may have particular reasons to place
> >   objects on a set of nodes because of how the threads of
> >   an application are spread out in the system.
> 
> ...how the tasks/threads are spread out and how the application accesses
> the pages of the objects.  Some accesses--e.g., unmapped pages of
> files--are implicit or transparent to task.  I guess any pages
> associated with a socket would also be transparent to the application as
> well?

Not sure about the exact semantics that we should have.

> > - Cpuset / Container. Some simple support is there with
> >   memory spreading today. That could be made more universal.
> 
> I've said before that I viewed cpusets as administrative contraints on
> applications, where as policies are something that can be controlled by
> the application or a non-privileged user.  As cpusets evolve into more
> general "containers", I think they'll become less visible to the
> applications running within them.  The application will see the
> container as "the system"--at least, the set of system resources to
> which the application has access.  

An application may want to access memory from various pools of memory that 
may be different containers? The containers can then dynamically sized by 
system administrators.

> The current memory policy APIs can work in such a "containerized"
> environment if we can reconcile the policy APIs' notion of nodes with
> the set of nodes that container allows.  Perhaps we need to revisit the
> "cpumemset" proposal that provides a separate node id namespace in each
> container/cpuset.  As a minimum, I think a task should be able to query

Right.

> the set of nodes that it can use and/or have the system "do the right
> thing" if the application specifies "all possible nodes" for, say, and
> interleave policy.

I agree.

> > - Address range. For the virtual memory address range
> >   this is included in todays functionality but one may also
> >   want to control the physical address range to make sure
> >   f.e. that memory is allocated in an area where a device
> >   can reach it.
> 
> For application usage?  Does this mean something like an MPOL_MF_DMA
> flag?  

Mostly useful for memory policies attached to devices I think.

> > - Memory policies need to be attachable to types of pages.
> >   F.e. executable pages of a threaded application are best
> >   spread (or replicated) whereas the stack and the data may
> >   best be allocated in a node local way.
> >   Useful categories that I can think of
> >   Stack, Data, Filebacked pages, Anonymous Memory,
> >   Shared memory, Page tables, Slabs, Mlocked pages and
> >   huge pages.
> 
> Rather, I would say, to "types of objects".   I think all of the "types
> of pages" you mention [except, maybe, mlocked?] can be correlated to
> some structure/object to which policy can be attached.  Regarding
> "Mlocked pages"--are you suggesting that you might want to specify that
> mlocked pages have a different policy/locality than other pages in the
> same object?

One may not want mlocked pages to contaminate certain nodes?
 
> Christoph, I wish you wouldn't characterize Andi's shared policy
> infrastructure as a hack.  I think it provides an excellent base
> implementation for [shared] object-based policies.  It extends easily to
> any object that can be addressed by offset [page offset, hugepage
> offset, ...].  The main issue is the generic one of memory policy on
> object that can be shared by processes running in separate cpusets,
> whether the sharing is intentional or not.  

The refcount issues and the creation of vmas on the stack do suggest that 
this is not a clean implemenation.

> > 4. Policy combinations
> > 
> > We need some way to combine policies in a systematic way. The current
> > hieracy from System->cpuset->proces->memory range does not longer
> > work if a process can use policies set up in shmem or huge pages.
> > Some consistent scheme to combine memory policies would also need
> > to be able to synthesize different policies. I.e. automigrate
> > can be combined with node local or interleave and a cpuset constraint.
> 
> The big issue, here, for me, is the interaction of policy on shared
> objects [shmem and shared regular file mappings] referenced from
> different containers/cpusets.   Given that we want to allow this--almost
> can't prevent it in the case of regular file access--we need to specify
> the use cases, what the desired behavior is for each such case, and
> which scenarios to optimize for.

Right and we need some form of permissions management for policies.

> > 7. Allocators must change
> > 
> > Right now the policy is set by the process context which is bad because
> > one cannot specify a memory policy for an allocation. It must be possible
> > to pass a memory policy to the allocators and then get the memory 
> > requested.
> 
> Agreed.  In my shared/mapped file policy patches, I have factored an
> "allocate_page_pol() function out of alloc_page_vma().  The modified
> alloc_page_vma() calls get_vma_policy() [as does the current version] to
> obtain the policy at the specified address in the calling task's virtual
> address space or some default policy, and then calls alloc_page_pol() to
> allocate a page based on that policy.  I can then use the same
> alloc_page_pol() function to allocate page cache pages after looking up
> a shared policy on a mapped file or using the default policy for page
> cache allocations [currently process->system default].  Perhaps other of
> the page allocators could use alloc_page_pol() as well?

Think about how the slab allocators, uncached allocator and vmalloc could 
support policies. Somehow this needs to work in a consistent way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
  2007-06-18 20:22 ` Christoph Lameter
@ 2007-06-20  4:01   ` Paul Mundt
  -1 siblings, 0 replies; 16+ messages in thread
From: Paul Mundt @ 2007-06-20  4:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel

On Mon, Jun 18, 2007 at 01:22:08PM -0700, Christoph Lameter wrote:
> 1. Memory policies must be attachable to a variety of objects
> 
> - System policies. The system policy is currently not
>   modifiable. It may be useful to be able to set this.
>   Small NUMA systems may want to run with interleave by default 
> 
For small systems there are a number of things that could be done for
this. With the interleave map for system init dynamically created, we can
make a reasonable guess about whether we want to use interleave as a
default policy or not if the node map is considerably different from
the online map (or the node_memory_map in -mm).

If the system policy only makes sense as interleave or default, it might
make sense simply to have a sysctl for this (the sysctl handler could
rebalance the interleave map when switching to handle offline nodes
coming online later, for example).

> - Memory policies need to be attachable to types of pages.
>   F.e. executable pages of a threaded application are best
>   spread (or replicated) whereas the stack and the data may
>   best be allocated in a node local way.

That would be nice, but one would also have to be able to restrict
the range of nodes to replicate across when applications know their
worst-case locality. Perhaps some of the cpuset work could be generalized
for this?

> 2. Memory policies need to support additional constraints
> 
> - Restriction to a set of nodes. That is what we have today.
> 
> - Restriction to a container or cpuset. Maybe restriction
>   to a set of containers?
> 
Having memory policies per container or cpuset would be nice to have,
but this seems like it would get pretty messy with nested cpusets that
contain overlapping memory nodes?

The other question is whether tasks residing under a cpuset with an
established memory policy are allowed to mbind() outside of the cpuset
policy constraints. Spreading of page and slab cache pages seem to
already side-step constraints.

> 7. Allocators must change
> 
> Right now the policy is set by the process context which is bad because
> one cannot specify a memory policy for an allocation. It must be possible
> to pass a memory policy to the allocators and then get the memory 
> requested.
> 
Some policy hints can already be determined from the gfpflags, perhaps
it's worth expanding on this? If these sorts of things have to be handled
by devices, one has to assume that the device may not always be running
in the same configuration or system, so an explicit policy would simply
cause more trouble.

> I wish we could come up with some universal scheme that encompasses all
> of the functionality we want and that makes memory more manageable....
> 
There's quite a bit of room for improving and extending the existing
code, and those options should likely be exhausted first.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
@ 2007-06-20  4:01   ` Paul Mundt
  0 siblings, 0 replies; 16+ messages in thread
From: Paul Mundt @ 2007-06-20  4:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel

On Mon, Jun 18, 2007 at 01:22:08PM -0700, Christoph Lameter wrote:
> 1. Memory policies must be attachable to a variety of objects
> 
> - System policies. The system policy is currently not
>   modifiable. It may be useful to be able to set this.
>   Small NUMA systems may want to run with interleave by default 
> 
For small systems there are a number of things that could be done for
this. With the interleave map for system init dynamically created, we can
make a reasonable guess about whether we want to use interleave as a
default policy or not if the node map is considerably different from
the online map (or the node_memory_map in -mm).

If the system policy only makes sense as interleave or default, it might
make sense simply to have a sysctl for this (the sysctl handler could
rebalance the interleave map when switching to handle offline nodes
coming online later, for example).

> - Memory policies need to be attachable to types of pages.
>   F.e. executable pages of a threaded application are best
>   spread (or replicated) whereas the stack and the data may
>   best be allocated in a node local way.

That would be nice, but one would also have to be able to restrict
the range of nodes to replicate across when applications know their
worst-case locality. Perhaps some of the cpuset work could be generalized
for this?

> 2. Memory policies need to support additional constraints
> 
> - Restriction to a set of nodes. That is what we have today.
> 
> - Restriction to a container or cpuset. Maybe restriction
>   to a set of containers?
> 
Having memory policies per container or cpuset would be nice to have,
but this seems like it would get pretty messy with nested cpusets that
contain overlapping memory nodes?

The other question is whether tasks residing under a cpuset with an
established memory policy are allowed to mbind() outside of the cpuset
policy constraints. Spreading of page and slab cache pages seem to
already side-step constraints.

> 7. Allocators must change
> 
> Right now the policy is set by the process context which is bad because
> one cannot specify a memory policy for an allocation. It must be possible
> to pass a memory policy to the allocators and then get the memory 
> requested.
> 
Some policy hints can already be determined from the gfpflags, perhaps
it's worth expanding on this? If these sorts of things have to be handled
by devices, one has to assume that the device may not always be running
in the same configuration or system, so an explicit policy would simply
cause more trouble.

> I wish we could come up with some universal scheme that encompasses all
> of the functionality we want and that makes memory more manageable....
> 
There's quite a bit of room for improving and extending the existing
code, and those options should likely be exhausted first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
  2007-06-20  4:01   ` Paul Mundt
@ 2007-06-20  5:08     ` Christoph Lameter
  -1 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2007-06-20  5:08 UTC (permalink / raw)
  To: Paul Mundt; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel

On Wed, 20 Jun 2007, Paul Mundt wrote:

> There's quite a bit of room for improving and extending the existing
> code, and those options should likely be exhausted first.

There is a confusing maze of special rules if one goes beyond the simple 
process address space case. There are no clean rules on how to combine 
memory policies. Refcounting / updating becomes a problem because policies 
are intended to be only updated from the process that set them up. Look at 
the gimmicks that Paul needed to do to update memory policies when a 
process is migrated and the vmas on the stack for shmem etc etc.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
@ 2007-06-20  5:08     ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2007-06-20  5:08 UTC (permalink / raw)
  To: Paul Mundt; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel

On Wed, 20 Jun 2007, Paul Mundt wrote:

> There's quite a bit of room for improving and extending the existing
> code, and those options should likely be exhausted first.

There is a confusing maze of special rules if one goes beyond the simple 
process address space case. There are no clean rules on how to combine 
memory policies. Refcounting / updating becomes a problem because policies 
are intended to be only updated from the process that set them up. Look at 
the gimmicks that Paul needed to do to update memory policies when a 
process is migrated and the vmas on the stack for shmem etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
  2007-06-18 20:22 ` Christoph Lameter
@ 2007-06-20 12:30   ` Andi Kleen
  -1 siblings, 0 replies; 16+ messages in thread
From: Andi Kleen @ 2007-06-20 12:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> I think we are getting into more and more of a mess with the existing 
> memory policies. The refcount issue with shmem is just one bad symptom of 
> it. 

That's easy to fix by just making the mpol freeing RCU
and not use a reference count for this window. I'll
send a patch soon.

> Memory policies were intended to be process based

Not true, e.g. shmem is a good counter example. Also kernel
has its own policies too.

> and 
> taking them out of that context causes various issues.

My primary concern is if there is a good user interface
that you can actually explain to a normal sysadmin and can
be used relatively race free. Many of the proposals I've
seen earlier failed these tests.

> - Device drivers may need to control memory allocations in
>   devices either because their DMA engines can only reach a
>   subsection of the system

On what system does this happen? Not sure you can call that
a "coherent system"

> or because memory transfer
>   performance is superior in certain nodes of the system.

Most architectures already give a sensible default for the coherent DMA
mappings (node the device is attached to) For the others it is
really not under driver control.

> - File / Socket. One may have particular reasons to place
>   objects on a set of nodes because of how the threads of
>   an application are spread out in the system.

The default right now seems reasonable to me. e.g. network
devices typically allocate on the node their interrupt
is assigned to. If you bind the application to that node
then you'll have everything local.

Now figuring out how to do this automatically without
explicit configuration would be great, but I don't
know of a really general solution (Especially when
you consider MSI-X hash based load balancing). Ok the 
scheduler does a little work in this direction by nudging
processes already towards the CPU that gets the wakeups
from; perhaps this could be made a little stronher.

Arguably irqbalanced needs to be more NUMA aware, 
but that's not really a kernel issue.

But frankly I wouldn't see the value of more explicit
configuration here.

> - Cpuset / Container. Some simple support is there with
>   memory spreading today. That could be made more universal.

Agreed.
> 
> - System policies. The system policy is currently not
>   modifiable. It may be useful to be able to set this.
>   Small NUMA systems may want to run with interleave by default 

Yes we need page cache policy. That's easy to do though.

> - Address range. For the virtual memory address range
>   this is included in todays functionality but one may also
>   want to control the physical address range to make sure
>   f.e. that memory is allocated in an area where a device
>   can reach it.

Why?  Where do we have such broken devices that cannot
DMA everywhere?  If they're really that broken they
probably deserve to be slow (or rather use double buffering,
not DMA)

Also controlling from the device where the submitted
data is difficult unless you bind processes. If you do 
it just works, but if you don't want to (for most cases
explicit binding is bad) it is hard.

I would be definitely opposed to anything that exposes
addresses as user interface.

> - Memory policies need to be attachable to types of pages.
>   F.e. executable pages of a threaded application are best
>   spread (or replicated) 

There are some experimental patches for text replication.
I used to think they were probably not needed, but there
are now some benchmark results that show they're a good
idea for some workloads.

This should be probably investigated. I think Nick P. was looking
at it.

>   whereas the stack and the data may
>   best be allocated in a node local way.
>   Useful categories that I can think of
>   Stack, Data, Filebacked pages, Anonymous Memory,
>   Shared memory, Page tables, Slabs, Mlocked pages and
>   huge pages.

My experience so far with user feedback is that most
users only use the barest basics of NUMA policy and they
rarely use anything more advanced. For anything complicated
you need a very very good justification. 

> 
>   Maybe a set of global policies would be useful for these
>   categories. Andy hacked subsystem memory policies into
>   shmem and it seems that we are now trying to do the same
>   for hugepages.

It's already there for huge pages if you look at the code 
(I was confused earlier when I claimed it wasn't) 

For page cache that is not mmaped I agree it's useful.
But I suspect a couple of sysctls would do fine here
(SLES9 had something like this for page cache as a sysctl) 

> 2. Memory policies need to support additional constraints
> 
> - Restriction to a set of nodes. That is what we have today.
> 
> - Restriction to a container or cpuset. Maybe restriction
>   to a set of containers?

Why?

> 
> - Strict vs no strict allocations. A strict allocation needs
>   to fail if the constraints cannot be met. A non strict
>   allocation can fall back.

That's already there -- that's the difference between PREFERED
and BIND.

> 
> - Order of allocation. Higher order pages may require

What higher order pages?  Right now they're only 
in hugetlbfs.

Regarding your page cache proposal: I think it's a bad
idea, larger soft page sizes would be better.

> - Automigrate flag so that memory touched by a process
>   is moved to a memory location that has best performance.

Hmm, possible. Do we actually have users for this though? 

> - Page order flag that determines the preferred allocation
>   order. Maybe useful in connection with the large blocksize
>   patch to control anonymous memory orders.

Not sure I see the point of this.

> 4. Policy combinations
> 
> We need some way to combine policies in a systematic way. The current
> hieracy from System->cpuset->proces->memory range does not longer
> work if a process can use policies set up in shmem or huge pages.
> Some consistent scheme to combine memory policies would also need
> to be able to synthesize different policies. I.e. automigrate
> can be combined with node local or interleave and a cpuset constraint.

Maybe.

> The esoteric
> nature of memory policy semantics makes them difficult to comprehend.

Exactly.  It doesn't make sense to implement if you can't
give it a good interface.

> 7. Allocators must change
> 
> Right now the policy is set by the process context which is bad because
> one cannot specify a memory policy for an allocation. It must be possible
> to pass a memory policy to the allocators and then get the memory 
> requested.

We already can allocate on a node. If there is really demand
we could also expose interleaved allocations, but again
we would need a good user.

Not sure it is useful for sl[aou]b.

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
@ 2007-06-20 12:30   ` Andi Kleen
  0 siblings, 0 replies; 16+ messages in thread
From: Andi Kleen @ 2007-06-20 12:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> I think we are getting into more and more of a mess with the existing 
> memory policies. The refcount issue with shmem is just one bad symptom of 
> it. 

That's easy to fix by just making the mpol freeing RCU
and not use a reference count for this window. I'll
send a patch soon.

> Memory policies were intended to be process based

Not true, e.g. shmem is a good counter example. Also kernel
has its own policies too.

> and 
> taking them out of that context causes various issues.

My primary concern is if there is a good user interface
that you can actually explain to a normal sysadmin and can
be used relatively race free. Many of the proposals I've
seen earlier failed these tests.

> - Device drivers may need to control memory allocations in
>   devices either because their DMA engines can only reach a
>   subsection of the system

On what system does this happen? Not sure you can call that
a "coherent system"

> or because memory transfer
>   performance is superior in certain nodes of the system.

Most architectures already give a sensible default for the coherent DMA
mappings (node the device is attached to) For the others it is
really not under driver control.

> - File / Socket. One may have particular reasons to place
>   objects on a set of nodes because of how the threads of
>   an application are spread out in the system.

The default right now seems reasonable to me. e.g. network
devices typically allocate on the node their interrupt
is assigned to. If you bind the application to that node
then you'll have everything local.

Now figuring out how to do this automatically without
explicit configuration would be great, but I don't
know of a really general solution (Especially when
you consider MSI-X hash based load balancing). Ok the 
scheduler does a little work in this direction by nudging
processes already towards the CPU that gets the wakeups
from; perhaps this could be made a little stronher.

Arguably irqbalanced needs to be more NUMA aware, 
but that's not really a kernel issue.

But frankly I wouldn't see the value of more explicit
configuration here.

> - Cpuset / Container. Some simple support is there with
>   memory spreading today. That could be made more universal.

Agreed.
> 
> - System policies. The system policy is currently not
>   modifiable. It may be useful to be able to set this.
>   Small NUMA systems may want to run with interleave by default 

Yes we need page cache policy. That's easy to do though.

> - Address range. For the virtual memory address range
>   this is included in todays functionality but one may also
>   want to control the physical address range to make sure
>   f.e. that memory is allocated in an area where a device
>   can reach it.

Why?  Where do we have such broken devices that cannot
DMA everywhere?  If they're really that broken they
probably deserve to be slow (or rather use double buffering,
not DMA)

Also controlling from the device where the submitted
data is difficult unless you bind processes. If you do 
it just works, but if you don't want to (for most cases
explicit binding is bad) it is hard.

I would be definitely opposed to anything that exposes
addresses as user interface.

> - Memory policies need to be attachable to types of pages.
>   F.e. executable pages of a threaded application are best
>   spread (or replicated) 

There are some experimental patches for text replication.
I used to think they were probably not needed, but there
are now some benchmark results that show they're a good
idea for some workloads.

This should be probably investigated. I think Nick P. was looking
at it.

>   whereas the stack and the data may
>   best be allocated in a node local way.
>   Useful categories that I can think of
>   Stack, Data, Filebacked pages, Anonymous Memory,
>   Shared memory, Page tables, Slabs, Mlocked pages and
>   huge pages.

My experience so far with user feedback is that most
users only use the barest basics of NUMA policy and they
rarely use anything more advanced. For anything complicated
you need a very very good justification. 

> 
>   Maybe a set of global policies would be useful for these
>   categories. Andy hacked subsystem memory policies into
>   shmem and it seems that we are now trying to do the same
>   for hugepages.

It's already there for huge pages if you look at the code 
(I was confused earlier when I claimed it wasn't) 

For page cache that is not mmaped I agree it's useful.
But I suspect a couple of sysctls would do fine here
(SLES9 had something like this for page cache as a sysctl) 

> 2. Memory policies need to support additional constraints
> 
> - Restriction to a set of nodes. That is what we have today.
> 
> - Restriction to a container or cpuset. Maybe restriction
>   to a set of containers?

Why?

> 
> - Strict vs no strict allocations. A strict allocation needs
>   to fail if the constraints cannot be met. A non strict
>   allocation can fall back.

That's already there -- that's the difference between PREFERED
and BIND.

> 
> - Order of allocation. Higher order pages may require

What higher order pages?  Right now they're only 
in hugetlbfs.

Regarding your page cache proposal: I think it's a bad
idea, larger soft page sizes would be better.

> - Automigrate flag so that memory touched by a process
>   is moved to a memory location that has best performance.

Hmm, possible. Do we actually have users for this though? 

> - Page order flag that determines the preferred allocation
>   order. Maybe useful in connection with the large blocksize
>   patch to control anonymous memory orders.

Not sure I see the point of this.

> 4. Policy combinations
> 
> We need some way to combine policies in a systematic way. The current
> hieracy from System->cpuset->proces->memory range does not longer
> work if a process can use policies set up in shmem or huge pages.
> Some consistent scheme to combine memory policies would also need
> to be able to synthesize different policies. I.e. automigrate
> can be combined with node local or interleave and a cpuset constraint.

Maybe.

> The esoteric
> nature of memory policy semantics makes them difficult to comprehend.

Exactly.  It doesn't make sense to implement if you can't
give it a good interface.

> 7. Allocators must change
> 
> Right now the policy is set by the process context which is bad because
> one cannot specify a memory policy for an allocation. It must be possible
> to pass a memory policy to the allocators and then get the memory 
> requested.

We already can allocate on a node. If there is really demand
we could also expose interleaved allocations, but again
we would need a good user.

Not sure it is useful for sl[aou]b.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
  2007-06-20 12:30   ` Andi Kleen
@ 2007-06-20 16:51     ` Christoph Lameter
  -1 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2007-06-20 16:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel

On Wed, 20 Jun 2007, Andi Kleen wrote:

> > - Device drivers may need to control memory allocations in
> >   devices either because their DMA engines can only reach a
> >   subsection of the system
> 
> On what system does this happen? Not sure you can call that
> a "coherent system"

This is an issue for example on DSM systems where memory is virtualized by 
transferring pages of memory on request. We are also getting into embedded 
systems using NUMA for a variety of reason. Devices that can only reach
a subset of memory may come about there.

> > or because memory transfer
> >   performance is superior in certain nodes of the system.
> 
> Most architectures already give a sensible default for the coherent DMA
> mappings (node the device is attached to) For the others it is
> really not under driver control.

Yes I did some of that work. But that is still from the perspective of the 
system as a whole. A device may have unique locality requirements. F.e. it 
may be able to interweave memory transfers from multiple nodes for optimal 
performance.

> Also controlling from the device where the submitted
> data is difficult unless you bind processes. If you do 
> it just works, but if you don't want to (for most cases
> explicit binding is bad) it is hard.

It wont be difficult if the device has

1. A node number
2. An allocation policy

Then the allocation must be done as if we would be on that node.

> I would be definitely opposed to anything that exposes
> addresses as user interface.

Well its more the device driver telling the system where the stuff ought 
to be best located.

> > 2. Memory policies need to support additional constraints
> > 
> > - Restriction to a set of nodes. That is what we have today.
> > 
> > - Restriction to a container or cpuset. Maybe restriction
> >   to a set of containers?
> 
> Why?

Because the sysadmin can set the containers up in a flexible way. Maybe we 
want to segment a node into a couple of 100MB chunks and give various apps
access to it?

> > - Strict vs no strict allocations. A strict allocation needs
> >   to fail if the constraints cannot be met. A non strict
> >   allocation can fall back.
> 
> That's already there -- that's the difference between PREFERED
> and BIND.

But its not available for interleave f.e.

> Regarding your page cache proposal: I think it's a bad
> idea, larger soft page sizes would be better.

I am not sure what you are talking about.

> > The esoteric
> > nature of memory policy semantics makes them difficult to comprehend.
> 
> Exactly.  It doesn't make sense to implement if you can't
> give it a good interface.

Right we need a clean interface and something that works in such a way 
that people can understand it. The challenge is to boil down something 
complex to a few simple mechanisms.

> > 7. Allocators must change
> > 
> > Right now the policy is set by the process context which is bad because
> > one cannot specify a memory policy for an allocation. It must be possible
> > to pass a memory policy to the allocators and then get the memory 
> > requested.
> 
> We already can allocate on a node. If there is really demand
> we could also expose interleaved allocations, but again
> we would need a good user.

We have these bad hacks for shmem and for hugetlb where we have to set 
policies in the context by creating a fake vma in order to get policy 
applied.

If we want to allocate for a device then the device is the context and not 
the process, same thing for shmem and hugetlb.

> Not sure it is useful for sl[aou]b.

If we do this then it needs to be consistently supported by the 
allocators. Meaning the slab allocators would have to support a call where 
you can pass a policy in and then objects need to be served in conformity 
with that policy.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Some thoughts on memory policies
@ 2007-06-20 16:51     ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2007-06-20 16:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel

On Wed, 20 Jun 2007, Andi Kleen wrote:

> > - Device drivers may need to control memory allocations in
> >   devices either because their DMA engines can only reach a
> >   subsection of the system
> 
> On what system does this happen? Not sure you can call that
> a "coherent system"

This is an issue for example on DSM systems where memory is virtualized by 
transferring pages of memory on request. We are also getting into embedded 
systems using NUMA for a variety of reason. Devices that can only reach
a subset of memory may come about there.

> > or because memory transfer
> >   performance is superior in certain nodes of the system.
> 
> Most architectures already give a sensible default for the coherent DMA
> mappings (node the device is attached to) For the others it is
> really not under driver control.

Yes I did some of that work. But that is still from the perspective of the 
system as a whole. A device may have unique locality requirements. F.e. it 
may be able to interweave memory transfers from multiple nodes for optimal 
performance.

> Also controlling from the device where the submitted
> data is difficult unless you bind processes. If you do 
> it just works, but if you don't want to (for most cases
> explicit binding is bad) it is hard.

It wont be difficult if the device has

1. A node number
2. An allocation policy

Then the allocation must be done as if we would be on that node.

> I would be definitely opposed to anything that exposes
> addresses as user interface.

Well its more the device driver telling the system where the stuff ought 
to be best located.

> > 2. Memory policies need to support additional constraints
> > 
> > - Restriction to a set of nodes. That is what we have today.
> > 
> > - Restriction to a container or cpuset. Maybe restriction
> >   to a set of containers?
> 
> Why?

Because the sysadmin can set the containers up in a flexible way. Maybe we 
want to segment a node into a couple of 100MB chunks and give various apps
access to it?

> > - Strict vs no strict allocations. A strict allocation needs
> >   to fail if the constraints cannot be met. A non strict
> >   allocation can fall back.
> 
> That's already there -- that's the difference between PREFERED
> and BIND.

But its not available for interleave f.e.

> Regarding your page cache proposal: I think it's a bad
> idea, larger soft page sizes would be better.

I am not sure what you are talking about.

> > The esoteric
> > nature of memory policy semantics makes them difficult to comprehend.
> 
> Exactly.  It doesn't make sense to implement if you can't
> give it a good interface.

Right we need a clean interface and something that works in such a way 
that people can understand it. The challenge is to boil down something 
complex to a few simple mechanisms.

> > 7. Allocators must change
> > 
> > Right now the policy is set by the process context which is bad because
> > one cannot specify a memory policy for an allocation. It must be possible
> > to pass a memory policy to the allocators and then get the memory 
> > requested.
> 
> We already can allocate on a node. If there is really demand
> we could also expose interleaved allocations, but again
> we would need a good user.

We have these bad hacks for shmem and for hugetlb where we have to set 
policies in the context by creating a fake vma in order to get policy 
applied.

If we want to allocate for a device then the device is the context and not 
the process, same thing for shmem and hugetlb.

> Not sure it is useful for sl[aou]b.

If we do this then it needs to be consistently supported by the 
allocators. Meaning the slab allocators would have to support a call where 
you can pass a policy in and then objects need to be served in conformity 
with that policy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2007-06-20 16:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-18 20:22 Some thoughts on memory policies Christoph Lameter
2007-06-18 20:22 ` Christoph Lameter
2007-06-19 20:24 ` Lee Schermerhorn
2007-06-19 20:24   ` Lee Schermerhorn
2007-06-19 21:23   ` Paul Jackson
2007-06-19 21:23     ` Paul Jackson
2007-06-19 22:30   ` Christoph Lameter
2007-06-19 22:30     ` Christoph Lameter
2007-06-20  4:01 ` Paul Mundt
2007-06-20  4:01   ` Paul Mundt
2007-06-20  5:08   ` Christoph Lameter
2007-06-20  5:08     ` Christoph Lameter
2007-06-20 12:30 ` Andi Kleen
2007-06-20 12:30   ` Andi Kleen
2007-06-20 16:51   ` Christoph Lameter
2007-06-20 16:51     ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.