* Some thoughts on memory policies @ 2007-06-18 20:22 ` Christoph Lameter 0 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2007-06-18 20:22 UTC (permalink / raw) To: linux-mm, wli, lee.schermerhorn; +Cc: linux-kernel I think we are getting into more and more of a mess with the existing memory policies. The refcount issue with shmem is just one bad symptom of it. Memory policies were intended to be process based and taking them out of that context causes various issues. I have thought for a long time that we need something to replace memory policies especially since the requirements on memory policies go far beyond just being process based. So some requirements and ideas about memory policies. 1. Memory policies must be attachable to a variety of objects - Device drivers may need to control memory allocations in devices either because their DMA engines can only reach a subsection of the system or because memory transfer performance is superior in certain nodes of the system. - Process. This is the classic usage scenario - File / Socket. One may have particular reasons to place objects on a set of nodes because of how the threads of an application are spread out in the system. - Cpuset / Container. Some simple support is there with memory spreading today. That could be made more universal. - System policies. The system policy is currently not modifiable. It may be useful to be able to set this. Small NUMA systems may want to run with interleave by default - Address range. For the virtual memory address range this is included in todays functionality but one may also want to control the physical address range to make sure f.e. that memory is allocated in an area where a device can reach it. - Memory policies need to be attachable to types of pages. F.e. executable pages of a threaded application are best spread (or replicated) whereas the stack and the data may best be allocated in a node local way. Useful categories that I can think of Stack, Data, Filebacked pages, Anonymous Memory, Shared memory, Page tables, Slabs, Mlocked pages and huge pages. Maybe a set of global policies would be useful for these categories. Andy hacked subsystem memory policies into shmem and it seems that we are now trying to do the same for hugepages. Maybe we could get to a consistent scheme here? 2. Memory policies need to support additional constraints - Restriction to a set of nodes. That is what we have today. - Restriction to a container or cpuset. Maybe restriction to a set of containers? - Strict vs no strict allocations. A strict allocation needs to fail if the constraints cannot be met. A non strict allocation can fall back. - Order of allocation. Higher order pages may require different allocation constraints? This is like a generalization of huge page policies. - Locality placement. These are node local, interleave etc. 3. Additional flags - Automigrate flag so that memory touched by a process is moved to a memory location that has best performance. - Page order flag that determines the preferred allocation order. Maybe useful in connection with the large blocksize patch to control anonymous memory orders. - Replicate flags so that memory is replicated. 4. Policy combinations We need some way to combine policies in a systematic way. The current hieracy from System->cpuset->proces->memory range does not longer work if a process can use policies set up in shmem or huge pages. Some consistent scheme to combine memory policies would also need to be able to synthesize different policies. I.e. automigrate can be combined with node local or interleave and a cpuset constraint. 5. Management tools If we make the policies more versatile then we need the proper management tools in user space to set and display these policies in such a way that they can be managed by the end user. The esoteric nature of memory policy semantics makes them difficult to comprehend. 6. GFP_xx flags may actually be considered as a form of policy i.e. GFP_THISNODE is essentially a one node cpuset. GFP_DMA and GFP_DMA32 are physical address range constraints. GFP_HARDWALL is a strict vs. nonstrict distinction. 7. Allocators must change Right now the policy is set by the process context which is bad because one cannot specify a memory policy for an allocation. It must be possible to pass a memory policy to the allocators and then get the memory requested. I wish we could come up with some universal scheme that encompasses all of the functionality we want and that makes memory more manageable.... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Some thoughts on memory policies @ 2007-06-18 20:22 ` Christoph Lameter 0 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2007-06-18 20:22 UTC (permalink / raw) To: linux-mm, wli, lee.schermerhorn; +Cc: linux-kernel I think we are getting into more and more of a mess with the existing memory policies. The refcount issue with shmem is just one bad symptom of it. Memory policies were intended to be process based and taking them out of that context causes various issues. I have thought for a long time that we need something to replace memory policies especially since the requirements on memory policies go far beyond just being process based. So some requirements and ideas about memory policies. 1. Memory policies must be attachable to a variety of objects - Device drivers may need to control memory allocations in devices either because their DMA engines can only reach a subsection of the system or because memory transfer performance is superior in certain nodes of the system. - Process. This is the classic usage scenario - File / Socket. One may have particular reasons to place objects on a set of nodes because of how the threads of an application are spread out in the system. - Cpuset / Container. Some simple support is there with memory spreading today. That could be made more universal. - System policies. The system policy is currently not modifiable. It may be useful to be able to set this. Small NUMA systems may want to run with interleave by default - Address range. For the virtual memory address range this is included in todays functionality but one may also want to control the physical address range to make sure f.e. that memory is allocated in an area where a device can reach it. - Memory policies need to be attachable to types of pages. F.e. executable pages of a threaded application are best spread (or replicated) whereas the stack and the data may best be allocated in a node local way. Useful categories that I can think of Stack, Data, Filebacked pages, Anonymous Memory, Shared memory, Page tables, Slabs, Mlocked pages and huge pages. Maybe a set of global policies would be useful for these categories. Andy hacked subsystem memory policies into shmem and it seems that we are now trying to do the same for hugepages. Maybe we could get to a consistent scheme here? 2. Memory policies need to support additional constraints - Restriction to a set of nodes. That is what we have today. - Restriction to a container or cpuset. Maybe restriction to a set of containers? - Strict vs no strict allocations. A strict allocation needs to fail if the constraints cannot be met. A non strict allocation can fall back. - Order of allocation. Higher order pages may require different allocation constraints? This is like a generalization of huge page policies. - Locality placement. These are node local, interleave etc. 3. Additional flags - Automigrate flag so that memory touched by a process is moved to a memory location that has best performance. - Page order flag that determines the preferred allocation order. Maybe useful in connection with the large blocksize patch to control anonymous memory orders. - Replicate flags so that memory is replicated. 4. Policy combinations We need some way to combine policies in a systematic way. The current hieracy from System->cpuset->proces->memory range does not longer work if a process can use policies set up in shmem or huge pages. Some consistent scheme to combine memory policies would also need to be able to synthesize different policies. I.e. automigrate can be combined with node local or interleave and a cpuset constraint. 5. Management tools If we make the policies more versatile then we need the proper management tools in user space to set and display these policies in such a way that they can be managed by the end user. The esoteric nature of memory policy semantics makes them difficult to comprehend. 6. GFP_xx flags may actually be considered as a form of policy i.e. GFP_THISNODE is essentially a one node cpuset. GFP_DMA and GFP_DMA32 are physical address range constraints. GFP_HARDWALL is a strict vs. nonstrict distinction. 7. Allocators must change Right now the policy is set by the process context which is bad because one cannot specify a memory policy for an allocation. It must be possible to pass a memory policy to the allocators and then get the memory requested. I wish we could come up with some universal scheme that encompasses all of the functionality we want and that makes memory more manageable.... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies 2007-06-18 20:22 ` Christoph Lameter @ 2007-06-19 20:24 ` Lee Schermerhorn -1 siblings, 0 replies; 16+ messages in thread From: Lee Schermerhorn @ 2007-06-19 20:24 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, wli, linux-kernel On Mon, 2007-06-18 at 13:22 -0700, Christoph Lameter wrote: > I think we are getting into more and more of a mess with the existing > memory policies. The refcount issue with shmem is just one bad symptom of > it. Memory policies were intended to be process based and > taking them out of that context causes various issues. I don't think memory policies are in as much of a mess as Christoph seems to. Perhaps this is my ignorance showing. Certainly, there are issues to be addressed--especially in the interaction of memory policies with containers, such as cpusets. The shmem refcount issue may be one of these issues--not sure how "bad" it is. I agree that the "process memory policy"--i.e., the one set by set_mempolicy()--is "process based", but I don't see the system default policy as process based. The system default policy is, currently, the policy of last resort for any allocation. And, [as I've discussed with Christoph], I view policies applied via mbind() as applying to [some range of] the "memory object" mapped at a specific address range. I admit that this view is somewhat muddied by the fact that [private] anonymous segments don't actually have any actual kernel structure to represent them outside of a process's various anonymous VMAs and page table [and sometimes the swap cache]; and by the fact that the kernel currently ignores policy that one attempts to place on shared regular file mappings. However, I think "object-based policy" is a natural extension of the current API and easily implemented with the current infrastructure. > > I have thought for a long time that we need something to replace memory > policies especially since the requirements on memory policies go far > beyond just being process based. So some requirements and ideas about > memory policies. Listing the requirements is a great idea. But I won't go so far as to agree that we need to "replace memory policies" so much as rationalize them for all the desired uses/contexts... > > 1. Memory policies must be attachable to a variety of objects > > - Device drivers may need to control memory allocations in > devices either because their DMA engines can only reach a > subsection of the system or because memory transfer > performance is superior in certain nodes of the system. > > - Process. This is the classic usage scenario > > - File / Socket. One may have particular reasons to place > objects on a set of nodes because of how the threads of > an application are spread out in the system. ...how the tasks/threads are spread out and how the application accesses the pages of the objects. Some accesses--e.g., unmapped pages of files--are implicit or transparent to task. I guess any pages associated with a socket would also be transparent to the application as well? > > - Cpuset / Container. Some simple support is there with > memory spreading today. That could be made more universal. I've said before that I viewed cpusets as administrative contraints on applications, where as policies are something that can be controlled by the application or a non-privileged user. As cpusets evolve into more general "containers", I think they'll become less visible to the applications running within them. The application will see the container as "the system"--at least, the set of system resources to which the application has access. The current memory policy APIs can work in such a "containerized" environment if we can reconcile the policy APIs' notion of nodes with the set of nodes that container allows. Perhaps we need to revisit the "cpumemset" proposal that provides a separate node id namespace in each container/cpuset. As a minimum, I think a task should be able to query the set of nodes that it can use and/or have the system "do the right thing" if the application specifies "all possible nodes" for, say, and interleave policy. > > - System policies. The system policy is currently not > modifiable. It may be useful to be able to set this. > Small NUMA systems may want to run with interleave by default Agreed. And, on our platforms, it would be useful to have a separately specifiable system-wide [or container-wide] default page cache policy. > > - Address range. For the virtual memory address range > this is included in todays functionality but one may also > want to control the physical address range to make sure > f.e. that memory is allocated in an area where a device > can reach it. For application usage? Does this mean something like an MPOL_MF_DMA flag? One way to handle this w/o an explicit 'DMA flag for use space APIs is to mmap() the device that would use the memory and allow the device driver to allocate the memory internally with the appropriate DMA/32 flags and map that memory into the task's address space. I think that works today. What other usage scenarios are you thinking of? > > - Memory policies need to be attachable to types of pages. > F.e. executable pages of a threaded application are best > spread (or replicated) whereas the stack and the data may > best be allocated in a node local way. > Useful categories that I can think of > Stack, Data, Filebacked pages, Anonymous Memory, > Shared memory, Page tables, Slabs, Mlocked pages and > huge pages. Rather, I would say, to "types of objects". I think all of the "types of pages" you mention [except, maybe, mlocked?] can be correlated to some structure/object to which policy can be attached. Regarding "Mlocked pages"--are you suggesting that you might want to specify that mlocked pages have a different policy/locality than other pages in the same object? Stack and data/heap can easily be handled by always defaulting the process policy to node local [or perhaps interleaved across the nodes in the container, if node local results in hot spots or other problems], and explicitly binding other objects of interest, if performance considerations warrant, using the mbind() API or by using fixed or heuristic defaults. > > Maybe a set of global policies would be useful for these > categories. Andy hacked subsystem memory policies into > shmem and it seems that we are now trying to do the same > for hugepages. Maybe we could get to a consistent scheme > here? Christoph, I wish you wouldn't characterize Andi's shared policy infrastructure as a hack. I think it provides an excellent base implementation for [shared] object-based policies. It extends easily to any object that can be addressed by offset [page offset, hugepage offset, ...]. The main issue is the generic one of memory policy on object that can be shared by processes running in separate cpusets, whether the sharing is intentional or not. > > 2. Memory policies need to support additional constraints > > - Restriction to a set of nodes. That is what we have today. See "locality placement" below. > > - Restriction to a container or cpuset. Maybe restriction > to a set of containers? I don't know about a "set of containers", but perhaps you are referring to sharing of objects between applications running in different containers with potentially disjoint memory resources? That is problematic. We need to enumerate the use cases for this and what the desired behavior should be. Christoph and I discussed one scenario: backup running in a separate cpuset, disjoint from an application that mmap()s a file shared and installs a shared policy on it [my "mapped file policy" patches would enable this]. If the application's cpuset contains sufficient memory for the application's working set, but NOT enough to hold the entire file, the backup running in another cpuset reading the entire file may push out pages of the application from it's cpuset because the object policy constrains the pages to be located in the application's cpuset. > > - Strict vs no strict allocations. A strict allocation needs > to fail if the constraints cannot be met. A non strict > allocation can fall back. Agreed. And I think this needs to be explicit in the allocation request. Callers requesting strict allocation [including "no wait"] should be prepared to handle failure of the allocation. > > - Order of allocation. Higher order pages may require > different allocation constraints? This is like a > generalization of huge page policies. Agreed. On our platform, I'd like to keep default huge page allocations and interleave requests off the "hardware interleaved pseudo-node" as that is "special" memory. I'd like to reserve it for access only by explicit request. The current model doesn't support this, but I think it could, with a few "small" enhancements. [TODO] > > - Locality placement. These are node local, interleave etc. How is this different from "restriction to a set of nodes" in the context of memory policies [1st bullet in section 2]? I tend to think of memory policies--whether default or explicit--as "locality placement" and cpusets as "constraints" or restrictions on what policies can do. > > 3. Additional flags > > - Automigrate flag so that memory touched by a process > is moved to a memory location that has best performance. Automigration can be turned on/off in the environment--e.g., per container/cpuset, but perhaps there is a use case for more explicit control over automigration of specific pages of an object? "Lazy migration" or "migrate on fault" is fairly easy to achieve atop the existing migration infrastructure. However, it requires a fault to trigger the migration. One can arrange for these faults to occur explicitly--e.g., via a straightforward extension to mbind() with MPOL_MF_MOVE and a new MPOL_MF_LAZY flag to remove the page translations from all page tables resulting in a fault, and possible migration, on next touch. Or, one can arrange to automatically "unmap" [remove ptes referencing] selected types of pages when the load balancer moves a task to a new node. I've seen fairly dramatic reductions in real, user and system time in, e.g., kernel builds on a heavily loaded [STREAMS benchmark running] NUMA platform with automatic/lazy migration patches: ~14% real, ~4.7% user and ~22% system time reductions. > > - Page order flag that determines the preferred allocation > order. Maybe useful in connection with the large blocksize > patch to control anonymous memory orders. Agreed. "requested page order" could be a component of policy, along with locality. > > - Replicate flags so that memory is replicated. This could be a different policy mode, MPOL_REPLICATE. Or, as with Nick's prototype, it could be the default behavior for read-only access to page cache pages when no explicit policy exists on the object [file]. For "automatic, lazy replication, one would also need a fault to trigger the replication. This could be achieved by removing the pte from only the calling task's page table via mbind(MOVE+LAZY) or automatically on inter-node task migration. The resulting fault, when that corresponding virtual address is touched, would cause Nick's page cache replication infrastructure to create/use a local copy of the page. It's "on my list" ... > > 4. Policy combinations > > We need some way to combine policies in a systematic way. The current > hieracy from System->cpuset->proces->memory range does not longer > work if a process can use policies set up in shmem or huge pages. > Some consistent scheme to combine memory policies would also need > to be able to synthesize different policies. I.e. automigrate > can be combined with node local or interleave and a cpuset constraint. The big issue, here, for me, is the interaction of policy on shared objects [shmem and shared regular file mappings] referenced from different containers/cpusets. Given that we want to allow this--almost can't prevent it in the case of regular file access--we need to specify the use cases, what the desired behavior is for each such case, and which scenarios to optimize for. > > 5. Management tools > > If we make the policies more versatile then we need the proper > management tools in user space to set and display these policies > in such a way that they can be managed by the end user. The esoteric > nature of memory policy semantics makes them difficult to comprehend. /proc/<pid>/numa_maps works well [with my patches] for object mapped into a task's address space. What it doesn't work so well for are: 1) shared policy on currently unattached shmem segments and 2) shared policy on unmapped regular files, should my patches be accepted. [Note, however, we need not retain shared policy on regular files after the last shared mapping is removed--my recommended persistence model.] > 6. GFP_xx flags may actually be considered as a form of policy Agreed. For kernel internal allocation requests... > > i.e. GFP_THISNODE is essentially a one node cpuset. sort of behaves like one, I agree. Or like an explicit MPOL_BIND with a single node. > > GFP_DMA and GFP_DMA32 are physical address range constraints. with platform specific locality implications... > > GFP_HARDWALL is a strict vs. nonstrict distinction. > > > 7. Allocators must change > > Right now the policy is set by the process context which is bad because > one cannot specify a memory policy for an allocation. It must be possible > to pass a memory policy to the allocators and then get the memory > requested. Agreed. In my shared/mapped file policy patches, I have factored an "allocate_page_pol() function out of alloc_page_vma(). The modified alloc_page_vma() calls get_vma_policy() [as does the current version] to obtain the policy at the specified address in the calling task's virtual address space or some default policy, and then calls alloc_page_pol() to allocate a page based on that policy. I can then use the same alloc_page_pol() function to allocate page cache pages after looking up a shared policy on a mapped file or using the default policy for page cache allocations [currently process->system default]. Perhaps other of the page allocators could use alloc_page_pol() as well? > > > I wish we could come up with some universal scheme that encompasses all > of the functionality we want and that makes memory more manageable.... I think it's possible and that the current mempolicy support can be evolved with not too much effort. Again, the biggest issue for me is the reconciliation of the policies with the administrative constraints imposed by subsetting the system via containers/cpusets--especially for objects that can be referenced from more than one container. I think that any reasonable, let alone "correct", solution would be workload/application dependent. Lee ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies @ 2007-06-19 20:24 ` Lee Schermerhorn 0 siblings, 0 replies; 16+ messages in thread From: Lee Schermerhorn @ 2007-06-19 20:24 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, wli, linux-kernel On Mon, 2007-06-18 at 13:22 -0700, Christoph Lameter wrote: > I think we are getting into more and more of a mess with the existing > memory policies. The refcount issue with shmem is just one bad symptom of > it. Memory policies were intended to be process based and > taking them out of that context causes various issues. I don't think memory policies are in as much of a mess as Christoph seems to. Perhaps this is my ignorance showing. Certainly, there are issues to be addressed--especially in the interaction of memory policies with containers, such as cpusets. The shmem refcount issue may be one of these issues--not sure how "bad" it is. I agree that the "process memory policy"--i.e., the one set by set_mempolicy()--is "process based", but I don't see the system default policy as process based. The system default policy is, currently, the policy of last resort for any allocation. And, [as I've discussed with Christoph], I view policies applied via mbind() as applying to [some range of] the "memory object" mapped at a specific address range. I admit that this view is somewhat muddied by the fact that [private] anonymous segments don't actually have any actual kernel structure to represent them outside of a process's various anonymous VMAs and page table [and sometimes the swap cache]; and by the fact that the kernel currently ignores policy that one attempts to place on shared regular file mappings. However, I think "object-based policy" is a natural extension of the current API and easily implemented with the current infrastructure. > > I have thought for a long time that we need something to replace memory > policies especially since the requirements on memory policies go far > beyond just being process based. So some requirements and ideas about > memory policies. Listing the requirements is a great idea. But I won't go so far as to agree that we need to "replace memory policies" so much as rationalize them for all the desired uses/contexts... > > 1. Memory policies must be attachable to a variety of objects > > - Device drivers may need to control memory allocations in > devices either because their DMA engines can only reach a > subsection of the system or because memory transfer > performance is superior in certain nodes of the system. > > - Process. This is the classic usage scenario > > - File / Socket. One may have particular reasons to place > objects on a set of nodes because of how the threads of > an application are spread out in the system. ...how the tasks/threads are spread out and how the application accesses the pages of the objects. Some accesses--e.g., unmapped pages of files--are implicit or transparent to task. I guess any pages associated with a socket would also be transparent to the application as well? > > - Cpuset / Container. Some simple support is there with > memory spreading today. That could be made more universal. I've said before that I viewed cpusets as administrative contraints on applications, where as policies are something that can be controlled by the application or a non-privileged user. As cpusets evolve into more general "containers", I think they'll become less visible to the applications running within them. The application will see the container as "the system"--at least, the set of system resources to which the application has access. The current memory policy APIs can work in such a "containerized" environment if we can reconcile the policy APIs' notion of nodes with the set of nodes that container allows. Perhaps we need to revisit the "cpumemset" proposal that provides a separate node id namespace in each container/cpuset. As a minimum, I think a task should be able to query the set of nodes that it can use and/or have the system "do the right thing" if the application specifies "all possible nodes" for, say, and interleave policy. > > - System policies. The system policy is currently not > modifiable. It may be useful to be able to set this. > Small NUMA systems may want to run with interleave by default Agreed. And, on our platforms, it would be useful to have a separately specifiable system-wide [or container-wide] default page cache policy. > > - Address range. For the virtual memory address range > this is included in todays functionality but one may also > want to control the physical address range to make sure > f.e. that memory is allocated in an area where a device > can reach it. For application usage? Does this mean something like an MPOL_MF_DMA flag? One way to handle this w/o an explicit 'DMA flag for use space APIs is to mmap() the device that would use the memory and allow the device driver to allocate the memory internally with the appropriate DMA/32 flags and map that memory into the task's address space. I think that works today. What other usage scenarios are you thinking of? > > - Memory policies need to be attachable to types of pages. > F.e. executable pages of a threaded application are best > spread (or replicated) whereas the stack and the data may > best be allocated in a node local way. > Useful categories that I can think of > Stack, Data, Filebacked pages, Anonymous Memory, > Shared memory, Page tables, Slabs, Mlocked pages and > huge pages. Rather, I would say, to "types of objects". I think all of the "types of pages" you mention [except, maybe, mlocked?] can be correlated to some structure/object to which policy can be attached. Regarding "Mlocked pages"--are you suggesting that you might want to specify that mlocked pages have a different policy/locality than other pages in the same object? Stack and data/heap can easily be handled by always defaulting the process policy to node local [or perhaps interleaved across the nodes in the container, if node local results in hot spots or other problems], and explicitly binding other objects of interest, if performance considerations warrant, using the mbind() API or by using fixed or heuristic defaults. > > Maybe a set of global policies would be useful for these > categories. Andy hacked subsystem memory policies into > shmem and it seems that we are now trying to do the same > for hugepages. Maybe we could get to a consistent scheme > here? Christoph, I wish you wouldn't characterize Andi's shared policy infrastructure as a hack. I think it provides an excellent base implementation for [shared] object-based policies. It extends easily to any object that can be addressed by offset [page offset, hugepage offset, ...]. The main issue is the generic one of memory policy on object that can be shared by processes running in separate cpusets, whether the sharing is intentional or not. > > 2. Memory policies need to support additional constraints > > - Restriction to a set of nodes. That is what we have today. See "locality placement" below. > > - Restriction to a container or cpuset. Maybe restriction > to a set of containers? I don't know about a "set of containers", but perhaps you are referring to sharing of objects between applications running in different containers with potentially disjoint memory resources? That is problematic. We need to enumerate the use cases for this and what the desired behavior should be. Christoph and I discussed one scenario: backup running in a separate cpuset, disjoint from an application that mmap()s a file shared and installs a shared policy on it [my "mapped file policy" patches would enable this]. If the application's cpuset contains sufficient memory for the application's working set, but NOT enough to hold the entire file, the backup running in another cpuset reading the entire file may push out pages of the application from it's cpuset because the object policy constrains the pages to be located in the application's cpuset. > > - Strict vs no strict allocations. A strict allocation needs > to fail if the constraints cannot be met. A non strict > allocation can fall back. Agreed. And I think this needs to be explicit in the allocation request. Callers requesting strict allocation [including "no wait"] should be prepared to handle failure of the allocation. > > - Order of allocation. Higher order pages may require > different allocation constraints? This is like a > generalization of huge page policies. Agreed. On our platform, I'd like to keep default huge page allocations and interleave requests off the "hardware interleaved pseudo-node" as that is "special" memory. I'd like to reserve it for access only by explicit request. The current model doesn't support this, but I think it could, with a few "small" enhancements. [TODO] > > - Locality placement. These are node local, interleave etc. How is this different from "restriction to a set of nodes" in the context of memory policies [1st bullet in section 2]? I tend to think of memory policies--whether default or explicit--as "locality placement" and cpusets as "constraints" or restrictions on what policies can do. > > 3. Additional flags > > - Automigrate flag so that memory touched by a process > is moved to a memory location that has best performance. Automigration can be turned on/off in the environment--e.g., per container/cpuset, but perhaps there is a use case for more explicit control over automigration of specific pages of an object? "Lazy migration" or "migrate on fault" is fairly easy to achieve atop the existing migration infrastructure. However, it requires a fault to trigger the migration. One can arrange for these faults to occur explicitly--e.g., via a straightforward extension to mbind() with MPOL_MF_MOVE and a new MPOL_MF_LAZY flag to remove the page translations from all page tables resulting in a fault, and possible migration, on next touch. Or, one can arrange to automatically "unmap" [remove ptes referencing] selected types of pages when the load balancer moves a task to a new node. I've seen fairly dramatic reductions in real, user and system time in, e.g., kernel builds on a heavily loaded [STREAMS benchmark running] NUMA platform with automatic/lazy migration patches: ~14% real, ~4.7% user and ~22% system time reductions. > > - Page order flag that determines the preferred allocation > order. Maybe useful in connection with the large blocksize > patch to control anonymous memory orders. Agreed. "requested page order" could be a component of policy, along with locality. > > - Replicate flags so that memory is replicated. This could be a different policy mode, MPOL_REPLICATE. Or, as with Nick's prototype, it could be the default behavior for read-only access to page cache pages when no explicit policy exists on the object [file]. For "automatic, lazy replication, one would also need a fault to trigger the replication. This could be achieved by removing the pte from only the calling task's page table via mbind(MOVE+LAZY) or automatically on inter-node task migration. The resulting fault, when that corresponding virtual address is touched, would cause Nick's page cache replication infrastructure to create/use a local copy of the page. It's "on my list" ... > > 4. Policy combinations > > We need some way to combine policies in a systematic way. The current > hieracy from System->cpuset->proces->memory range does not longer > work if a process can use policies set up in shmem or huge pages. > Some consistent scheme to combine memory policies would also need > to be able to synthesize different policies. I.e. automigrate > can be combined with node local or interleave and a cpuset constraint. The big issue, here, for me, is the interaction of policy on shared objects [shmem and shared regular file mappings] referenced from different containers/cpusets. Given that we want to allow this--almost can't prevent it in the case of regular file access--we need to specify the use cases, what the desired behavior is for each such case, and which scenarios to optimize for. > > 5. Management tools > > If we make the policies more versatile then we need the proper > management tools in user space to set and display these policies > in such a way that they can be managed by the end user. The esoteric > nature of memory policy semantics makes them difficult to comprehend. /proc/<pid>/numa_maps works well [with my patches] for object mapped into a task's address space. What it doesn't work so well for are: 1) shared policy on currently unattached shmem segments and 2) shared policy on unmapped regular files, should my patches be accepted. [Note, however, we need not retain shared policy on regular files after the last shared mapping is removed--my recommended persistence model.] > 6. GFP_xx flags may actually be considered as a form of policy Agreed. For kernel internal allocation requests... > > i.e. GFP_THISNODE is essentially a one node cpuset. sort of behaves like one, I agree. Or like an explicit MPOL_BIND with a single node. > > GFP_DMA and GFP_DMA32 are physical address range constraints. with platform specific locality implications... > > GFP_HARDWALL is a strict vs. nonstrict distinction. > > > 7. Allocators must change > > Right now the policy is set by the process context which is bad because > one cannot specify a memory policy for an allocation. It must be possible > to pass a memory policy to the allocators and then get the memory > requested. Agreed. In my shared/mapped file policy patches, I have factored an "allocate_page_pol() function out of alloc_page_vma(). The modified alloc_page_vma() calls get_vma_policy() [as does the current version] to obtain the policy at the specified address in the calling task's virtual address space or some default policy, and then calls alloc_page_pol() to allocate a page based on that policy. I can then use the same alloc_page_pol() function to allocate page cache pages after looking up a shared policy on a mapped file or using the default policy for page cache allocations [currently process->system default]. Perhaps other of the page allocators could use alloc_page_pol() as well? > > > I wish we could come up with some universal scheme that encompasses all > of the functionality we want and that makes memory more manageable.... I think it's possible and that the current mempolicy support can be evolved with not too much effort. Again, the biggest issue for me is the reconciliation of the policies with the administrative constraints imposed by subsetting the system via containers/cpusets--especially for objects that can be referenced from more than one container. I think that any reasonable, let alone "correct", solution would be workload/application dependent. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies 2007-06-19 20:24 ` Lee Schermerhorn @ 2007-06-19 21:23 ` Paul Jackson -1 siblings, 0 replies; 16+ messages in thread From: Paul Jackson @ 2007-06-19 21:23 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: clameter, linux-mm, wli, linux-kernel > The current memory policy APIs can work in such a "containerized" > environment if we can reconcile the policy APIs' notion of nodes with > the set of nodes that container allows. Perhaps we need to revisit the > "cpumemset" proposal that provides a separate node id namespace in each > container/cpuset. Currently, we (SGI) do this for our systems using user level library code. Even though that library code is LGPL licensed, it's still far less widely distributed than the Linux kernel. Container relative numbering support directly in the kernel might make sense; though it would be very challenging to provide that without breaking any existing API's such as sched_setaffinity, mbind, set_mempolicy and various /proc files that provide only system-wide numbering. The advantage I had doing cpuset relative cpu and mem numbering in a user library was that I could invent new API's that were numbered relatively from day one. So ... I'd likely be supportive of cpuset (or container) relative numbering support in the kernel ... if someone can figure out how to do it without breaking existing API's left and right. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies @ 2007-06-19 21:23 ` Paul Jackson 0 siblings, 0 replies; 16+ messages in thread From: Paul Jackson @ 2007-06-19 21:23 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: clameter, linux-mm, wli, linux-kernel > The current memory policy APIs can work in such a "containerized" > environment if we can reconcile the policy APIs' notion of nodes with > the set of nodes that container allows. Perhaps we need to revisit the > "cpumemset" proposal that provides a separate node id namespace in each > container/cpuset. Currently, we (SGI) do this for our systems using user level library code. Even though that library code is LGPL licensed, it's still far less widely distributed than the Linux kernel. Container relative numbering support directly in the kernel might make sense; though it would be very challenging to provide that without breaking any existing API's such as sched_setaffinity, mbind, set_mempolicy and various /proc files that provide only system-wide numbering. The advantage I had doing cpuset relative cpu and mem numbering in a user library was that I could invent new API's that were numbered relatively from day one. So ... I'd likely be supportive of cpuset (or container) relative numbering support in the kernel ... if someone can figure out how to do it without breaking existing API's left and right. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies 2007-06-19 20:24 ` Lee Schermerhorn @ 2007-06-19 22:30 ` Christoph Lameter -1 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2007-06-19 22:30 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: linux-mm, wli, linux-kernel On Tue, 19 Jun 2007, Lee Schermerhorn wrote: > > - File / Socket. One may have particular reasons to place > > objects on a set of nodes because of how the threads of > > an application are spread out in the system. > > ...how the tasks/threads are spread out and how the application accesses > the pages of the objects. Some accesses--e.g., unmapped pages of > files--are implicit or transparent to task. I guess any pages > associated with a socket would also be transparent to the application as > well? Not sure about the exact semantics that we should have. > > - Cpuset / Container. Some simple support is there with > > memory spreading today. That could be made more universal. > > I've said before that I viewed cpusets as administrative contraints on > applications, where as policies are something that can be controlled by > the application or a non-privileged user. As cpusets evolve into more > general "containers", I think they'll become less visible to the > applications running within them. The application will see the > container as "the system"--at least, the set of system resources to > which the application has access. An application may want to access memory from various pools of memory that may be different containers? The containers can then dynamically sized by system administrators. > The current memory policy APIs can work in such a "containerized" > environment if we can reconcile the policy APIs' notion of nodes with > the set of nodes that container allows. Perhaps we need to revisit the > "cpumemset" proposal that provides a separate node id namespace in each > container/cpuset. As a minimum, I think a task should be able to query Right. > the set of nodes that it can use and/or have the system "do the right > thing" if the application specifies "all possible nodes" for, say, and > interleave policy. I agree. > > - Address range. For the virtual memory address range > > this is included in todays functionality but one may also > > want to control the physical address range to make sure > > f.e. that memory is allocated in an area where a device > > can reach it. > > For application usage? Does this mean something like an MPOL_MF_DMA > flag? Mostly useful for memory policies attached to devices I think. > > - Memory policies need to be attachable to types of pages. > > F.e. executable pages of a threaded application are best > > spread (or replicated) whereas the stack and the data may > > best be allocated in a node local way. > > Useful categories that I can think of > > Stack, Data, Filebacked pages, Anonymous Memory, > > Shared memory, Page tables, Slabs, Mlocked pages and > > huge pages. > > Rather, I would say, to "types of objects". I think all of the "types > of pages" you mention [except, maybe, mlocked?] can be correlated to > some structure/object to which policy can be attached. Regarding > "Mlocked pages"--are you suggesting that you might want to specify that > mlocked pages have a different policy/locality than other pages in the > same object? One may not want mlocked pages to contaminate certain nodes? > Christoph, I wish you wouldn't characterize Andi's shared policy > infrastructure as a hack. I think it provides an excellent base > implementation for [shared] object-based policies. It extends easily to > any object that can be addressed by offset [page offset, hugepage > offset, ...]. The main issue is the generic one of memory policy on > object that can be shared by processes running in separate cpusets, > whether the sharing is intentional or not. The refcount issues and the creation of vmas on the stack do suggest that this is not a clean implemenation. > > 4. Policy combinations > > > > We need some way to combine policies in a systematic way. The current > > hieracy from System->cpuset->proces->memory range does not longer > > work if a process can use policies set up in shmem or huge pages. > > Some consistent scheme to combine memory policies would also need > > to be able to synthesize different policies. I.e. automigrate > > can be combined with node local or interleave and a cpuset constraint. > > The big issue, here, for me, is the interaction of policy on shared > objects [shmem and shared regular file mappings] referenced from > different containers/cpusets. Given that we want to allow this--almost > can't prevent it in the case of regular file access--we need to specify > the use cases, what the desired behavior is for each such case, and > which scenarios to optimize for. Right and we need some form of permissions management for policies. > > 7. Allocators must change > > > > Right now the policy is set by the process context which is bad because > > one cannot specify a memory policy for an allocation. It must be possible > > to pass a memory policy to the allocators and then get the memory > > requested. > > Agreed. In my shared/mapped file policy patches, I have factored an > "allocate_page_pol() function out of alloc_page_vma(). The modified > alloc_page_vma() calls get_vma_policy() [as does the current version] to > obtain the policy at the specified address in the calling task's virtual > address space or some default policy, and then calls alloc_page_pol() to > allocate a page based on that policy. I can then use the same > alloc_page_pol() function to allocate page cache pages after looking up > a shared policy on a mapped file or using the default policy for page > cache allocations [currently process->system default]. Perhaps other of > the page allocators could use alloc_page_pol() as well? Think about how the slab allocators, uncached allocator and vmalloc could support policies. Somehow this needs to work in a consistent way. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies @ 2007-06-19 22:30 ` Christoph Lameter 0 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2007-06-19 22:30 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: linux-mm, wli, linux-kernel On Tue, 19 Jun 2007, Lee Schermerhorn wrote: > > - File / Socket. One may have particular reasons to place > > objects on a set of nodes because of how the threads of > > an application are spread out in the system. > > ...how the tasks/threads are spread out and how the application accesses > the pages of the objects. Some accesses--e.g., unmapped pages of > files--are implicit or transparent to task. I guess any pages > associated with a socket would also be transparent to the application as > well? Not sure about the exact semantics that we should have. > > - Cpuset / Container. Some simple support is there with > > memory spreading today. That could be made more universal. > > I've said before that I viewed cpusets as administrative contraints on > applications, where as policies are something that can be controlled by > the application or a non-privileged user. As cpusets evolve into more > general "containers", I think they'll become less visible to the > applications running within them. The application will see the > container as "the system"--at least, the set of system resources to > which the application has access. An application may want to access memory from various pools of memory that may be different containers? The containers can then dynamically sized by system administrators. > The current memory policy APIs can work in such a "containerized" > environment if we can reconcile the policy APIs' notion of nodes with > the set of nodes that container allows. Perhaps we need to revisit the > "cpumemset" proposal that provides a separate node id namespace in each > container/cpuset. As a minimum, I think a task should be able to query Right. > the set of nodes that it can use and/or have the system "do the right > thing" if the application specifies "all possible nodes" for, say, and > interleave policy. I agree. > > - Address range. For the virtual memory address range > > this is included in todays functionality but one may also > > want to control the physical address range to make sure > > f.e. that memory is allocated in an area where a device > > can reach it. > > For application usage? Does this mean something like an MPOL_MF_DMA > flag? Mostly useful for memory policies attached to devices I think. > > - Memory policies need to be attachable to types of pages. > > F.e. executable pages of a threaded application are best > > spread (or replicated) whereas the stack and the data may > > best be allocated in a node local way. > > Useful categories that I can think of > > Stack, Data, Filebacked pages, Anonymous Memory, > > Shared memory, Page tables, Slabs, Mlocked pages and > > huge pages. > > Rather, I would say, to "types of objects". I think all of the "types > of pages" you mention [except, maybe, mlocked?] can be correlated to > some structure/object to which policy can be attached. Regarding > "Mlocked pages"--are you suggesting that you might want to specify that > mlocked pages have a different policy/locality than other pages in the > same object? One may not want mlocked pages to contaminate certain nodes? > Christoph, I wish you wouldn't characterize Andi's shared policy > infrastructure as a hack. I think it provides an excellent base > implementation for [shared] object-based policies. It extends easily to > any object that can be addressed by offset [page offset, hugepage > offset, ...]. The main issue is the generic one of memory policy on > object that can be shared by processes running in separate cpusets, > whether the sharing is intentional or not. The refcount issues and the creation of vmas on the stack do suggest that this is not a clean implemenation. > > 4. Policy combinations > > > > We need some way to combine policies in a systematic way. The current > > hieracy from System->cpuset->proces->memory range does not longer > > work if a process can use policies set up in shmem or huge pages. > > Some consistent scheme to combine memory policies would also need > > to be able to synthesize different policies. I.e. automigrate > > can be combined with node local or interleave and a cpuset constraint. > > The big issue, here, for me, is the interaction of policy on shared > objects [shmem and shared regular file mappings] referenced from > different containers/cpusets. Given that we want to allow this--almost > can't prevent it in the case of regular file access--we need to specify > the use cases, what the desired behavior is for each such case, and > which scenarios to optimize for. Right and we need some form of permissions management for policies. > > 7. Allocators must change > > > > Right now the policy is set by the process context which is bad because > > one cannot specify a memory policy for an allocation. It must be possible > > to pass a memory policy to the allocators and then get the memory > > requested. > > Agreed. In my shared/mapped file policy patches, I have factored an > "allocate_page_pol() function out of alloc_page_vma(). The modified > alloc_page_vma() calls get_vma_policy() [as does the current version] to > obtain the policy at the specified address in the calling task's virtual > address space or some default policy, and then calls alloc_page_pol() to > allocate a page based on that policy. I can then use the same > alloc_page_pol() function to allocate page cache pages after looking up > a shared policy on a mapped file or using the default policy for page > cache allocations [currently process->system default]. Perhaps other of > the page allocators could use alloc_page_pol() as well? Think about how the slab allocators, uncached allocator and vmalloc could support policies. Somehow this needs to work in a consistent way. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies 2007-06-18 20:22 ` Christoph Lameter @ 2007-06-20 4:01 ` Paul Mundt -1 siblings, 0 replies; 16+ messages in thread From: Paul Mundt @ 2007-06-20 4:01 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel On Mon, Jun 18, 2007 at 01:22:08PM -0700, Christoph Lameter wrote: > 1. Memory policies must be attachable to a variety of objects > > - System policies. The system policy is currently not > modifiable. It may be useful to be able to set this. > Small NUMA systems may want to run with interleave by default > For small systems there are a number of things that could be done for this. With the interleave map for system init dynamically created, we can make a reasonable guess about whether we want to use interleave as a default policy or not if the node map is considerably different from the online map (or the node_memory_map in -mm). If the system policy only makes sense as interleave or default, it might make sense simply to have a sysctl for this (the sysctl handler could rebalance the interleave map when switching to handle offline nodes coming online later, for example). > - Memory policies need to be attachable to types of pages. > F.e. executable pages of a threaded application are best > spread (or replicated) whereas the stack and the data may > best be allocated in a node local way. That would be nice, but one would also have to be able to restrict the range of nodes to replicate across when applications know their worst-case locality. Perhaps some of the cpuset work could be generalized for this? > 2. Memory policies need to support additional constraints > > - Restriction to a set of nodes. That is what we have today. > > - Restriction to a container or cpuset. Maybe restriction > to a set of containers? > Having memory policies per container or cpuset would be nice to have, but this seems like it would get pretty messy with nested cpusets that contain overlapping memory nodes? The other question is whether tasks residing under a cpuset with an established memory policy are allowed to mbind() outside of the cpuset policy constraints. Spreading of page and slab cache pages seem to already side-step constraints. > 7. Allocators must change > > Right now the policy is set by the process context which is bad because > one cannot specify a memory policy for an allocation. It must be possible > to pass a memory policy to the allocators and then get the memory > requested. > Some policy hints can already be determined from the gfpflags, perhaps it's worth expanding on this? If these sorts of things have to be handled by devices, one has to assume that the device may not always be running in the same configuration or system, so an explicit policy would simply cause more trouble. > I wish we could come up with some universal scheme that encompasses all > of the functionality we want and that makes memory more manageable.... > There's quite a bit of room for improving and extending the existing code, and those options should likely be exhausted first. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies @ 2007-06-20 4:01 ` Paul Mundt 0 siblings, 0 replies; 16+ messages in thread From: Paul Mundt @ 2007-06-20 4:01 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel On Mon, Jun 18, 2007 at 01:22:08PM -0700, Christoph Lameter wrote: > 1. Memory policies must be attachable to a variety of objects > > - System policies. The system policy is currently not > modifiable. It may be useful to be able to set this. > Small NUMA systems may want to run with interleave by default > For small systems there are a number of things that could be done for this. With the interleave map for system init dynamically created, we can make a reasonable guess about whether we want to use interleave as a default policy or not if the node map is considerably different from the online map (or the node_memory_map in -mm). If the system policy only makes sense as interleave or default, it might make sense simply to have a sysctl for this (the sysctl handler could rebalance the interleave map when switching to handle offline nodes coming online later, for example). > - Memory policies need to be attachable to types of pages. > F.e. executable pages of a threaded application are best > spread (or replicated) whereas the stack and the data may > best be allocated in a node local way. That would be nice, but one would also have to be able to restrict the range of nodes to replicate across when applications know their worst-case locality. Perhaps some of the cpuset work could be generalized for this? > 2. Memory policies need to support additional constraints > > - Restriction to a set of nodes. That is what we have today. > > - Restriction to a container or cpuset. Maybe restriction > to a set of containers? > Having memory policies per container or cpuset would be nice to have, but this seems like it would get pretty messy with nested cpusets that contain overlapping memory nodes? The other question is whether tasks residing under a cpuset with an established memory policy are allowed to mbind() outside of the cpuset policy constraints. Spreading of page and slab cache pages seem to already side-step constraints. > 7. Allocators must change > > Right now the policy is set by the process context which is bad because > one cannot specify a memory policy for an allocation. It must be possible > to pass a memory policy to the allocators and then get the memory > requested. > Some policy hints can already be determined from the gfpflags, perhaps it's worth expanding on this? If these sorts of things have to be handled by devices, one has to assume that the device may not always be running in the same configuration or system, so an explicit policy would simply cause more trouble. > I wish we could come up with some universal scheme that encompasses all > of the functionality we want and that makes memory more manageable.... > There's quite a bit of room for improving and extending the existing code, and those options should likely be exhausted first. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies 2007-06-20 4:01 ` Paul Mundt @ 2007-06-20 5:08 ` Christoph Lameter -1 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2007-06-20 5:08 UTC (permalink / raw) To: Paul Mundt; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel On Wed, 20 Jun 2007, Paul Mundt wrote: > There's quite a bit of room for improving and extending the existing > code, and those options should likely be exhausted first. There is a confusing maze of special rules if one goes beyond the simple process address space case. There are no clean rules on how to combine memory policies. Refcounting / updating becomes a problem because policies are intended to be only updated from the process that set them up. Look at the gimmicks that Paul needed to do to update memory policies when a process is migrated and the vmas on the stack for shmem etc etc. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies @ 2007-06-20 5:08 ` Christoph Lameter 0 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2007-06-20 5:08 UTC (permalink / raw) To: Paul Mundt; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel On Wed, 20 Jun 2007, Paul Mundt wrote: > There's quite a bit of room for improving and extending the existing > code, and those options should likely be exhausted first. There is a confusing maze of special rules if one goes beyond the simple process address space case. There are no clean rules on how to combine memory policies. Refcounting / updating becomes a problem because policies are intended to be only updated from the process that set them up. Look at the gimmicks that Paul needed to do to update memory policies when a process is migrated and the vmas on the stack for shmem etc etc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies 2007-06-18 20:22 ` Christoph Lameter @ 2007-06-20 12:30 ` Andi Kleen -1 siblings, 0 replies; 16+ messages in thread From: Andi Kleen @ 2007-06-20 12:30 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel Christoph Lameter <clameter@sgi.com> writes: > I think we are getting into more and more of a mess with the existing > memory policies. The refcount issue with shmem is just one bad symptom of > it. That's easy to fix by just making the mpol freeing RCU and not use a reference count for this window. I'll send a patch soon. > Memory policies were intended to be process based Not true, e.g. shmem is a good counter example. Also kernel has its own policies too. > and > taking them out of that context causes various issues. My primary concern is if there is a good user interface that you can actually explain to a normal sysadmin and can be used relatively race free. Many of the proposals I've seen earlier failed these tests. > - Device drivers may need to control memory allocations in > devices either because their DMA engines can only reach a > subsection of the system On what system does this happen? Not sure you can call that a "coherent system" > or because memory transfer > performance is superior in certain nodes of the system. Most architectures already give a sensible default for the coherent DMA mappings (node the device is attached to) For the others it is really not under driver control. > - File / Socket. One may have particular reasons to place > objects on a set of nodes because of how the threads of > an application are spread out in the system. The default right now seems reasonable to me. e.g. network devices typically allocate on the node their interrupt is assigned to. If you bind the application to that node then you'll have everything local. Now figuring out how to do this automatically without explicit configuration would be great, but I don't know of a really general solution (Especially when you consider MSI-X hash based load balancing). Ok the scheduler does a little work in this direction by nudging processes already towards the CPU that gets the wakeups from; perhaps this could be made a little stronher. Arguably irqbalanced needs to be more NUMA aware, but that's not really a kernel issue. But frankly I wouldn't see the value of more explicit configuration here. > - Cpuset / Container. Some simple support is there with > memory spreading today. That could be made more universal. Agreed. > > - System policies. The system policy is currently not > modifiable. It may be useful to be able to set this. > Small NUMA systems may want to run with interleave by default Yes we need page cache policy. That's easy to do though. > - Address range. For the virtual memory address range > this is included in todays functionality but one may also > want to control the physical address range to make sure > f.e. that memory is allocated in an area where a device > can reach it. Why? Where do we have such broken devices that cannot DMA everywhere? If they're really that broken they probably deserve to be slow (or rather use double buffering, not DMA) Also controlling from the device where the submitted data is difficult unless you bind processes. If you do it just works, but if you don't want to (for most cases explicit binding is bad) it is hard. I would be definitely opposed to anything that exposes addresses as user interface. > - Memory policies need to be attachable to types of pages. > F.e. executable pages of a threaded application are best > spread (or replicated) There are some experimental patches for text replication. I used to think they were probably not needed, but there are now some benchmark results that show they're a good idea for some workloads. This should be probably investigated. I think Nick P. was looking at it. > whereas the stack and the data may > best be allocated in a node local way. > Useful categories that I can think of > Stack, Data, Filebacked pages, Anonymous Memory, > Shared memory, Page tables, Slabs, Mlocked pages and > huge pages. My experience so far with user feedback is that most users only use the barest basics of NUMA policy and they rarely use anything more advanced. For anything complicated you need a very very good justification. > > Maybe a set of global policies would be useful for these > categories. Andy hacked subsystem memory policies into > shmem and it seems that we are now trying to do the same > for hugepages. It's already there for huge pages if you look at the code (I was confused earlier when I claimed it wasn't) For page cache that is not mmaped I agree it's useful. But I suspect a couple of sysctls would do fine here (SLES9 had something like this for page cache as a sysctl) > 2. Memory policies need to support additional constraints > > - Restriction to a set of nodes. That is what we have today. > > - Restriction to a container or cpuset. Maybe restriction > to a set of containers? Why? > > - Strict vs no strict allocations. A strict allocation needs > to fail if the constraints cannot be met. A non strict > allocation can fall back. That's already there -- that's the difference between PREFERED and BIND. > > - Order of allocation. Higher order pages may require What higher order pages? Right now they're only in hugetlbfs. Regarding your page cache proposal: I think it's a bad idea, larger soft page sizes would be better. > - Automigrate flag so that memory touched by a process > is moved to a memory location that has best performance. Hmm, possible. Do we actually have users for this though? > - Page order flag that determines the preferred allocation > order. Maybe useful in connection with the large blocksize > patch to control anonymous memory orders. Not sure I see the point of this. > 4. Policy combinations > > We need some way to combine policies in a systematic way. The current > hieracy from System->cpuset->proces->memory range does not longer > work if a process can use policies set up in shmem or huge pages. > Some consistent scheme to combine memory policies would also need > to be able to synthesize different policies. I.e. automigrate > can be combined with node local or interleave and a cpuset constraint. Maybe. > The esoteric > nature of memory policy semantics makes them difficult to comprehend. Exactly. It doesn't make sense to implement if you can't give it a good interface. > 7. Allocators must change > > Right now the policy is set by the process context which is bad because > one cannot specify a memory policy for an allocation. It must be possible > to pass a memory policy to the allocators and then get the memory > requested. We already can allocate on a node. If there is really demand we could also expose interleaved allocations, but again we would need a good user. Not sure it is useful for sl[aou]b. -Andi ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies @ 2007-06-20 12:30 ` Andi Kleen 0 siblings, 0 replies; 16+ messages in thread From: Andi Kleen @ 2007-06-20 12:30 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel Christoph Lameter <clameter@sgi.com> writes: > I think we are getting into more and more of a mess with the existing > memory policies. The refcount issue with shmem is just one bad symptom of > it. That's easy to fix by just making the mpol freeing RCU and not use a reference count for this window. I'll send a patch soon. > Memory policies were intended to be process based Not true, e.g. shmem is a good counter example. Also kernel has its own policies too. > and > taking them out of that context causes various issues. My primary concern is if there is a good user interface that you can actually explain to a normal sysadmin and can be used relatively race free. Many of the proposals I've seen earlier failed these tests. > - Device drivers may need to control memory allocations in > devices either because their DMA engines can only reach a > subsection of the system On what system does this happen? Not sure you can call that a "coherent system" > or because memory transfer > performance is superior in certain nodes of the system. Most architectures already give a sensible default for the coherent DMA mappings (node the device is attached to) For the others it is really not under driver control. > - File / Socket. One may have particular reasons to place > objects on a set of nodes because of how the threads of > an application are spread out in the system. The default right now seems reasonable to me. e.g. network devices typically allocate on the node their interrupt is assigned to. If you bind the application to that node then you'll have everything local. Now figuring out how to do this automatically without explicit configuration would be great, but I don't know of a really general solution (Especially when you consider MSI-X hash based load balancing). Ok the scheduler does a little work in this direction by nudging processes already towards the CPU that gets the wakeups from; perhaps this could be made a little stronher. Arguably irqbalanced needs to be more NUMA aware, but that's not really a kernel issue. But frankly I wouldn't see the value of more explicit configuration here. > - Cpuset / Container. Some simple support is there with > memory spreading today. That could be made more universal. Agreed. > > - System policies. The system policy is currently not > modifiable. It may be useful to be able to set this. > Small NUMA systems may want to run with interleave by default Yes we need page cache policy. That's easy to do though. > - Address range. For the virtual memory address range > this is included in todays functionality but one may also > want to control the physical address range to make sure > f.e. that memory is allocated in an area where a device > can reach it. Why? Where do we have such broken devices that cannot DMA everywhere? If they're really that broken they probably deserve to be slow (or rather use double buffering, not DMA) Also controlling from the device where the submitted data is difficult unless you bind processes. If you do it just works, but if you don't want to (for most cases explicit binding is bad) it is hard. I would be definitely opposed to anything that exposes addresses as user interface. > - Memory policies need to be attachable to types of pages. > F.e. executable pages of a threaded application are best > spread (or replicated) There are some experimental patches for text replication. I used to think they were probably not needed, but there are now some benchmark results that show they're a good idea for some workloads. This should be probably investigated. I think Nick P. was looking at it. > whereas the stack and the data may > best be allocated in a node local way. > Useful categories that I can think of > Stack, Data, Filebacked pages, Anonymous Memory, > Shared memory, Page tables, Slabs, Mlocked pages and > huge pages. My experience so far with user feedback is that most users only use the barest basics of NUMA policy and they rarely use anything more advanced. For anything complicated you need a very very good justification. > > Maybe a set of global policies would be useful for these > categories. Andy hacked subsystem memory policies into > shmem and it seems that we are now trying to do the same > for hugepages. It's already there for huge pages if you look at the code (I was confused earlier when I claimed it wasn't) For page cache that is not mmaped I agree it's useful. But I suspect a couple of sysctls would do fine here (SLES9 had something like this for page cache as a sysctl) > 2. Memory policies need to support additional constraints > > - Restriction to a set of nodes. That is what we have today. > > - Restriction to a container or cpuset. Maybe restriction > to a set of containers? Why? > > - Strict vs no strict allocations. A strict allocation needs > to fail if the constraints cannot be met. A non strict > allocation can fall back. That's already there -- that's the difference between PREFERED and BIND. > > - Order of allocation. Higher order pages may require What higher order pages? Right now they're only in hugetlbfs. Regarding your page cache proposal: I think it's a bad idea, larger soft page sizes would be better. > - Automigrate flag so that memory touched by a process > is moved to a memory location that has best performance. Hmm, possible. Do we actually have users for this though? > - Page order flag that determines the preferred allocation > order. Maybe useful in connection with the large blocksize > patch to control anonymous memory orders. Not sure I see the point of this. > 4. Policy combinations > > We need some way to combine policies in a systematic way. The current > hieracy from System->cpuset->proces->memory range does not longer > work if a process can use policies set up in shmem or huge pages. > Some consistent scheme to combine memory policies would also need > to be able to synthesize different policies. I.e. automigrate > can be combined with node local or interleave and a cpuset constraint. Maybe. > The esoteric > nature of memory policy semantics makes them difficult to comprehend. Exactly. It doesn't make sense to implement if you can't give it a good interface. > 7. Allocators must change > > Right now the policy is set by the process context which is bad because > one cannot specify a memory policy for an allocation. It must be possible > to pass a memory policy to the allocators and then get the memory > requested. We already can allocate on a node. If there is really demand we could also expose interleaved allocations, but again we would need a good user. Not sure it is useful for sl[aou]b. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies 2007-06-20 12:30 ` Andi Kleen @ 2007-06-20 16:51 ` Christoph Lameter -1 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2007-06-20 16:51 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel On Wed, 20 Jun 2007, Andi Kleen wrote: > > - Device drivers may need to control memory allocations in > > devices either because their DMA engines can only reach a > > subsection of the system > > On what system does this happen? Not sure you can call that > a "coherent system" This is an issue for example on DSM systems where memory is virtualized by transferring pages of memory on request. We are also getting into embedded systems using NUMA for a variety of reason. Devices that can only reach a subset of memory may come about there. > > or because memory transfer > > performance is superior in certain nodes of the system. > > Most architectures already give a sensible default for the coherent DMA > mappings (node the device is attached to) For the others it is > really not under driver control. Yes I did some of that work. But that is still from the perspective of the system as a whole. A device may have unique locality requirements. F.e. it may be able to interweave memory transfers from multiple nodes for optimal performance. > Also controlling from the device where the submitted > data is difficult unless you bind processes. If you do > it just works, but if you don't want to (for most cases > explicit binding is bad) it is hard. It wont be difficult if the device has 1. A node number 2. An allocation policy Then the allocation must be done as if we would be on that node. > I would be definitely opposed to anything that exposes > addresses as user interface. Well its more the device driver telling the system where the stuff ought to be best located. > > 2. Memory policies need to support additional constraints > > > > - Restriction to a set of nodes. That is what we have today. > > > > - Restriction to a container or cpuset. Maybe restriction > > to a set of containers? > > Why? Because the sysadmin can set the containers up in a flexible way. Maybe we want to segment a node into a couple of 100MB chunks and give various apps access to it? > > - Strict vs no strict allocations. A strict allocation needs > > to fail if the constraints cannot be met. A non strict > > allocation can fall back. > > That's already there -- that's the difference between PREFERED > and BIND. But its not available for interleave f.e. > Regarding your page cache proposal: I think it's a bad > idea, larger soft page sizes would be better. I am not sure what you are talking about. > > The esoteric > > nature of memory policy semantics makes them difficult to comprehend. > > Exactly. It doesn't make sense to implement if you can't > give it a good interface. Right we need a clean interface and something that works in such a way that people can understand it. The challenge is to boil down something complex to a few simple mechanisms. > > 7. Allocators must change > > > > Right now the policy is set by the process context which is bad because > > one cannot specify a memory policy for an allocation. It must be possible > > to pass a memory policy to the allocators and then get the memory > > requested. > > We already can allocate on a node. If there is really demand > we could also expose interleaved allocations, but again > we would need a good user. We have these bad hacks for shmem and for hugetlb where we have to set policies in the context by creating a fake vma in order to get policy applied. If we want to allocate for a device then the device is the context and not the process, same thing for shmem and hugetlb. > Not sure it is useful for sl[aou]b. If we do this then it needs to be consistently supported by the allocators. Meaning the slab allocators would have to support a call where you can pass a policy in and then objects need to be served in conformity with that policy. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Some thoughts on memory policies @ 2007-06-20 16:51 ` Christoph Lameter 0 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2007-06-20 16:51 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-mm, wli, lee.schermerhorn, linux-kernel On Wed, 20 Jun 2007, Andi Kleen wrote: > > - Device drivers may need to control memory allocations in > > devices either because their DMA engines can only reach a > > subsection of the system > > On what system does this happen? Not sure you can call that > a "coherent system" This is an issue for example on DSM systems where memory is virtualized by transferring pages of memory on request. We are also getting into embedded systems using NUMA for a variety of reason. Devices that can only reach a subset of memory may come about there. > > or because memory transfer > > performance is superior in certain nodes of the system. > > Most architectures already give a sensible default for the coherent DMA > mappings (node the device is attached to) For the others it is > really not under driver control. Yes I did some of that work. But that is still from the perspective of the system as a whole. A device may have unique locality requirements. F.e. it may be able to interweave memory transfers from multiple nodes for optimal performance. > Also controlling from the device where the submitted > data is difficult unless you bind processes. If you do > it just works, but if you don't want to (for most cases > explicit binding is bad) it is hard. It wont be difficult if the device has 1. A node number 2. An allocation policy Then the allocation must be done as if we would be on that node. > I would be definitely opposed to anything that exposes > addresses as user interface. Well its more the device driver telling the system where the stuff ought to be best located. > > 2. Memory policies need to support additional constraints > > > > - Restriction to a set of nodes. That is what we have today. > > > > - Restriction to a container or cpuset. Maybe restriction > > to a set of containers? > > Why? Because the sysadmin can set the containers up in a flexible way. Maybe we want to segment a node into a couple of 100MB chunks and give various apps access to it? > > - Strict vs no strict allocations. A strict allocation needs > > to fail if the constraints cannot be met. A non strict > > allocation can fall back. > > That's already there -- that's the difference between PREFERED > and BIND. But its not available for interleave f.e. > Regarding your page cache proposal: I think it's a bad > idea, larger soft page sizes would be better. I am not sure what you are talking about. > > The esoteric > > nature of memory policy semantics makes them difficult to comprehend. > > Exactly. It doesn't make sense to implement if you can't > give it a good interface. Right we need a clean interface and something that works in such a way that people can understand it. The challenge is to boil down something complex to a few simple mechanisms. > > 7. Allocators must change > > > > Right now the policy is set by the process context which is bad because > > one cannot specify a memory policy for an allocation. It must be possible > > to pass a memory policy to the allocators and then get the memory > > requested. > > We already can allocate on a node. If there is really demand > we could also expose interleaved allocations, but again > we would need a good user. We have these bad hacks for shmem and for hugetlb where we have to set policies in the context by creating a fake vma in order to get policy applied. If we want to allocate for a device then the device is the context and not the process, same thing for shmem and hugetlb. > Not sure it is useful for sl[aou]b. If we do this then it needs to be consistently supported by the allocators. Meaning the slab allocators would have to support a call where you can pass a policy in and then objects need to be served in conformity with that policy. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2007-06-20 16:51 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-06-18 20:22 Some thoughts on memory policies Christoph Lameter 2007-06-18 20:22 ` Christoph Lameter 2007-06-19 20:24 ` Lee Schermerhorn 2007-06-19 20:24 ` Lee Schermerhorn 2007-06-19 21:23 ` Paul Jackson 2007-06-19 21:23 ` Paul Jackson 2007-06-19 22:30 ` Christoph Lameter 2007-06-19 22:30 ` Christoph Lameter 2007-06-20 4:01 ` Paul Mundt 2007-06-20 4:01 ` Paul Mundt 2007-06-20 5:08 ` Christoph Lameter 2007-06-20 5:08 ` Christoph Lameter 2007-06-20 12:30 ` Andi Kleen 2007-06-20 12:30 ` Andi Kleen 2007-06-20 16:51 ` Christoph Lameter 2007-06-20 16:51 ` Christoph Lameter
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.