* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management @ 2012-11-09 18:14 Srinivas Pandruvada 0 siblings, 0 replies; 17+ messages in thread From: Srinivas Pandruvada @ 2012-11-09 18:14 UTC (permalink / raw) To: linux-mm I did like this implementation and think it is valuable. I am experimenting with one of our HW. This type of partition does help in saving power. Our calculations shows significant saving of power per DIM with the help of some HW/BIOS changes. We are only talking about content preserving memory, so we don't have to be 100% correct. In my experiments, I tried two methods: - Similar to approach suggested by Mel Gorman. I have a special sticky migrate type like CMA. - Buddy buckets: Buddies are organized into memory region aware buckets. During allocation it prefers higher order buckets. I made sure that there is no affect of my change if there are no power saving memory DIMs. The advantage of this bucket is that I can keep the memory in close proximity for a related task groups by direct hashing to a bucket. The free list if organized as two dimensional array with bucket and migrate type for each order. In both methods, currently reclaim is targeted to be done by a sysfs interface similar to memory compaction for a node allowing user space to initiate reclaim. Thanks, Srinivas Pandruvada Open Source Technology Center, Intel Corp. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management @ 2012-11-06 19:52 Srivatsa S. Bhat 2012-11-08 18:02 ` Mel Gorman 2012-12-04 10:51 ` wujianguo 0 siblings, 2 replies; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-11-06 19:52 UTC (permalink / raw) To: akpm, mgorman, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw Cc: gargankita, amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, srivatsa.bhat, linux-pm, linux-mm, linux-kernel Hi, This is an alternative design for Memory Power Management, developed based on some of the suggestions[1] received during the review of the earlier patchset ("Hierarchy" design) on Memory Power Management[2]. This alters the buddy-lists to keep them region-sorted, and is hence identified as the "Sorted-buddy" design. One of the key aspects of this design is that it avoids the zone-fragmentation problem that was present in the earlier design[3]. Quick overview of Memory Power Management and Memory Regions: ------------------------------------------------------------ Today memory subsystems are offer a wide range of capabilities for managing memory power consumption. As a quick example, if a block of memory is not referenced for a threshold amount of time, the memory controller can decide to put that chunk into a low-power content-preserving state. And the next reference to that memory chunk would bring it back to full power for read/write. With this capability in place, it becomes important for the OS to understand the boundaries of such power-manageable chunks of memory and to ensure that references are consolidated to a minimum number of such memory power management domains. ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that the firmware can expose information regarding the boundaries of such memory power management domains to the OS in a standard way. How can Linux VM help memory power savings? o Consolidate memory allocations and/or references such that they are not spread across the entire memory address space. Basically area of memory that is not being referenced, can reside in low power state. o Support targeted memory reclaim, where certain areas of memory that can be easily freed can be offlined, allowing those areas of memory to be put into lower power states. Memory Regions: --------------- "Memory Regions" is a way of capturing the boundaries of power-managable chunks of memory, within the MM subsystem. Short description of the "Sorted-buddy" design: ----------------------------------------------- In this design, the memory region boundaries are captured in a parallel data-structure instead of fitting regions between nodes and zones in the hierarchy. Further, the buddy allocator is altered, such that we maintain the zones' freelists in region-sorted-order and thus do page allocation in the order of increasing memory regions. (The freelists need not be fully address-sorted, they just need to be region-sorted. Patch 6 explains this in more detail). The idea is to do page allocation in increasing order of memory regions (within a zone) and perform page reclaim in the reverse order, as illustrated below. ---------------------------- Increasing region number----------------------> Direction of allocation---> <---Direction of reclaim The sorting logic (to maintain freelist pageblocks in region-sorted-order) lies in the page-free path and not the page-allocation path and hence the critical page allocation paths remain fast. Moreover, the heart of the page allocation algorithm itself remains largely unchanged, and the region-related data-structures are optimized to avoid unnecessary updates during the page-allocator's runtime. Advantages of this design: -------------------------- 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and hence we avoid its associated problems (like too many zones, extra page reclaim threads, question of choosing watermarks etc). [This is an advantage over the "Hierarchy" design] 2. Performance overhead is expected to be low: Since we retain the simplicity of the algorithm in the page allocation path, page allocation can potentially remain as fast as it would be without memory regions. The overhead is pushed to the page-freeing paths which are not that critical. Results: ======= Test setup: ----------- This patchset applies cleanly on top of 3.7-rc3. x86 dual-socket quad core HT-enabled machine booted with mem=8G Memory region size = 512 MB Functional testing: ------------------- Ran pagetest, a simple C program that allocates and touches a required number of pages. Below is the statistics from the regions within ZONE_NORMAL, at various sizes of allocations from pagetest. Present pages | Free pages at various allocations | | start | 512 MB | 1024 MB | 2048 MB | Region 0 16 | 0 | 0 | 0 | 0 | Region 1 131072 | 87219 | 8066 | 7892 | 7387 | Region 2 131072 | 131072 | 79036 | 0 | 0 | Region 3 131072 | 131072 | 131072 | 79061 | 0 | Region 4 131072 | 131072 | 131072 | 131072 | 0 | Region 5 131072 | 131072 | 131072 | 131072 | 79051 | Region 6 131072 | 131072 | 131072 | 131072 | 131072 | Region 7 131072 | 131072 | 131072 | 131072 | 131072 | Region 8 131056 | 105475 | 105472 | 105472 | 105472 | This shows that page allocation occurs in the order of increasing region numbers, as intended in this design. Performance impact: ------------------- Kernbench results didn't show much of a difference between the performance of vanilla 3.7-rc3 and this patchset. Todos: ===== 1. Memory-region aware page-reclamation: ---------------------------------------- We would like to do page reclaim in the reverse order of page allocation within a zone, ie., in the order of decreasing region numbers. To achieve that, while scanning lru pages to reclaim, we could potentially look for pages belonging to higher regions (considering region boundaries) or perhaps simply prefer pages of higher pfns (and skip lower pfns) as reclaim candidates. 2. Compile-time exclusion of Memory Power Management, and extending the support to also work with other features such as Mem cgroups, kexec etc. References: ---------- [1]. Review comments suggesting modifying the buddy allocator to be aware of memory regions: http://article.gmane.org/gmane.linux.power-management.general/24862 http://article.gmane.org/gmane.linux.power-management.general/25061 http://article.gmane.org/gmane.linux.kernel.mm/64689 [2]. Patch series that implemented the node-region-zone hierarchy design: http://lwn.net/Articles/445045/ http://thread.gmane.org/gmane.linux.kernel.mm/63840 Summary of the discussion on that patchset: http://article.gmane.org/gmane.linux.power-management.general/25061 Forward-port of that patchset to 3.7-rc3 (minimal x86 config) http://thread.gmane.org/gmane.linux.kernel.mm/89202 [3]. Disadvantages of having memory regions in the hierarchy between nodes and zones: http://article.gmane.org/gmane.linux.kernel.mm/63849 [4]. Estimate of potential power savings on Samsung exynos board http://article.gmane.org/gmane.linux.kernel.mm/65935 [5]. ACPI 5.0 and MPST support http://www.acpi.info/spec.htm Section 5.2.21 Memory Power State Table (MPST) Srivatsa S. Bhat (8): mm: Introduce memory regions data-structure to capture region boundaries within node mm: Initialize node memory regions during boot mm: Introduce and initialize zone memory regions mm: Add helpers to retrieve node region and zone region for a given page mm: Add data-structures to describe memory regions within the zones' freelists mm: Demarcate and maintain pageblocks in region-order in the zones' freelists mm: Add an optimized version of del_from_freelist to keep page allocation fast mm: Print memory region statistics to understand the buddy allocator behavior include/linux/mm.h | 38 +++++++ include/linux/mmzone.h | 52 +++++++++ mm/compaction.c | 8 + mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++---- mm/vmstat.c | 59 ++++++++++- 5 files changed, 390 insertions(+), 30 deletions(-) Thanks, Srivatsa S. Bhat IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-06 19:52 Srivatsa S. Bhat @ 2012-11-08 18:02 ` Mel Gorman 2012-11-08 19:38 ` Srivatsa S. Bhat ` (2 more replies) 2012-12-04 10:51 ` wujianguo 1 sibling, 3 replies; 17+ messages in thread From: Mel Gorman @ 2012-11-08 18:02 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: akpm, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote: > ------------------------------------------------------------ > > Today memory subsystems are offer a wide range of capabilities for managing > memory power consumption. As a quick example, if a block of memory is not > referenced for a threshold amount of time, the memory controller can decide to > put that chunk into a low-power content-preserving state. And the next > reference to that memory chunk would bring it back to full power for read/write. > With this capability in place, it becomes important for the OS to understand > the boundaries of such power-manageable chunks of memory and to ensure that > references are consolidated to a minimum number of such memory power management > domains. > How much power is saved? > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that > the firmware can expose information regarding the boundaries of such memory > power management domains to the OS in a standard way. > I'm not familiar with the ACPI spec but is there support for parsing of MPST and interpreting the associated ACPI events? For example, if ACPI fires an event indicating that a memory power node is to enter a low state then presumably the OS should actively migrate pages away -- even if it's going into a state where the contents are still refreshed as exiting that state could take a long time. I did not look closely at the patchset at all because it looked like the actual support to use it and measure the benefit is missing. > How can Linux VM help memory power savings? > > o Consolidate memory allocations and/or references such that they are > not spread across the entire memory address space. Basically area of memory > that is not being referenced, can reside in low power state. > Which the series does not appear to do. > o Support targeted memory reclaim, where certain areas of memory that can be > easily freed can be offlined, allowing those areas of memory to be put into > lower power states. > Which the series does not appear to do judging from this; include/linux/mm.h | 38 +++++++ include/linux/mmzone.h | 52 +++++++++ mm/compaction.c | 8 + mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++---- mm/vmstat.c | 59 ++++++++++- This does not appear to be doing anything with reclaim and not enough with compaction to indicate that the series actively manages memory placement in response to ACPI events. Further in section 5.2.21.4 the spec says that power node regions can overlap (but are not hierarchal for some reason) but have no gaps yet the structure you use to represent is assumes there can be gaps and there are no overlaps. Again, this is just glancing at the spec and a quick skim of the patches so maybe I missed something that explains why this structure is suitable. It seems to me that superficially the VM implementation for the support would have a) Involved a tree that managed the overlapping regions (even if it's not hierarchal it feels more sensible) and picked the highest-power-state common denominator in the tree. This would only be allocated if support for MPST is available. b) Leave memory allocations and reclaim as they are in the active state. c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower power but still usable with a latency penalty. This might be a single migrate type but could also be a parallel set of free_area called free_area_lowpower that is only used when free_area is depleted and in the very slow path of the allocator. d) Use memory hot-remove for power states where the refresh rates were not constant and only did anything expensive in response to an ACPI event -- none of the fast paths should be touched. When transitioning to the low power state, memory should be migrated in a vaguely similar fashion to what CMA does. For low-power, migration failure is acceptable. If contents are not preserved, ACPI needs to know if the migration failed because it cannot enter that power state. For any of this to be worthwhile, low power states would need to be achieved for long periods of time because that migration is not free. > Memory Regions: > --------------- > > "Memory Regions" is a way of capturing the boundaries of power-managable > chunks of memory, within the MM subsystem. > > Short description of the "Sorted-buddy" design: > ----------------------------------------------- > > In this design, the memory region boundaries are captured in a parallel > data-structure instead of fitting regions between nodes and zones in the > hierarchy. Further, the buddy allocator is altered, such that we maintain the > zones' freelists in region-sorted-order and thus do page allocation in the > order of increasing memory regions. Implying that this sorting has to happen in the either the alloc or free fast path. > (The freelists need not be fully > address-sorted, they just need to be region-sorted. Patch 6 explains this > in more detail). > > The idea is to do page allocation in increasing order of memory regions > (within a zone) and perform page reclaim in the reverse order, as illustrated > below. > > ---------------------------- Increasing region number----------------------> > > Direction of allocation---> <---Direction of reclaim > Compaction will work against this because it uses a PFN walker to isolate free pages and will ignore memory regions. If pageblocks were used, it could take that into account at least. > The sorting logic (to maintain freelist pageblocks in region-sorted-order) > lies in the page-free path and not the page-allocation path and hence the > critical page allocation paths remain fast. Page free can be a critical path for application performance as well. Think network buffer heavy alloc and freeing of buffers. However, migratetype information is already looked up for THP so ideally power awareness would piggyback on it. > Moreover, the heart of the page > allocation algorithm itself remains largely unchanged, and the region-related > data-structures are optimized to avoid unnecessary updates during the > page-allocator's runtime. > > Advantages of this design: > -------------------------- > 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and > hence we avoid its associated problems (like too many zones, extra page > reclaim threads, question of choosing watermarks etc). > [This is an advantage over the "Hierarchy" design] > > 2. Performance overhead is expected to be low: Since we retain the simplicity > of the algorithm in the page allocation path, page allocation can > potentially remain as fast as it would be without memory regions. The > overhead is pushed to the page-freeing paths which are not that critical. > > > Results: > ======= > > Test setup: > ----------- > This patchset applies cleanly on top of 3.7-rc3. > > x86 dual-socket quad core HT-enabled machine booted with mem=8G > Memory region size = 512 MB > > Functional testing: > ------------------- > > Ran pagetest, a simple C program that allocates and touches a required number > of pages. > > Below is the statistics from the regions within ZONE_NORMAL, at various sizes > of allocations from pagetest. > > Present pages | Free pages at various allocations | > | start | 512 MB | 1024 MB | 2048 MB | > Region 0 16 | 0 | 0 | 0 | 0 | > Region 1 131072 | 87219 | 8066 | 7892 | 7387 | > Region 2 131072 | 131072 | 79036 | 0 | 0 | > Region 3 131072 | 131072 | 131072 | 79061 | 0 | > Region 4 131072 | 131072 | 131072 | 131072 | 0 | > Region 5 131072 | 131072 | 131072 | 131072 | 79051 | > Region 6 131072 | 131072 | 131072 | 131072 | 131072 | > Region 7 131072 | 131072 | 131072 | 131072 | 131072 | > Region 8 131056 | 105475 | 105472 | 105472 | 105472 | > > This shows that page allocation occurs in the order of increasing region > numbers, as intended in this design. > > Performance impact: > ------------------- > > Kernbench results didn't show much of a difference between the performance > of vanilla 3.7-rc3 and this patchset. > > > Todos: > ===== > > 1. Memory-region aware page-reclamation: > ---------------------------------------- > > We would like to do page reclaim in the reverse order of page allocation > within a zone, ie., in the order of decreasing region numbers. > To achieve that, while scanning lru pages to reclaim, we could potentially > look for pages belonging to higher regions (considering region boundaries) > or perhaps simply prefer pages of higher pfns (and skip lower pfns) as > reclaim candidates. > This would disrupting LRU ordering and if those pages were recently allocated and you force a situation where swap has to be used then any saving in low memory will be lost by having to access the disk instead. > 2. Compile-time exclusion of Memory Power Management, and extending the > support to also work with other features such as Mem cgroups, kexec etc. > Compile-time exclusion is pointless because it'll be always activated by distribution configs. Support for MPST should be detected at runtime and 3. ACPI support to actually use this thing and validate the design is compatible with the spec and actually works in hardware -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-08 18:02 ` Mel Gorman @ 2012-11-08 19:38 ` Srivatsa S. Bhat 2012-11-09 5:14 ` Vaidyanathan Srinivasan [not found] ` <loom.20121109T172910-394@post.gmane.org> 2 siblings, 0 replies; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-11-08 19:38 UTC (permalink / raw) To: Mel Gorman Cc: akpm, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On 11/08/2012 11:32 PM, Mel Gorman wrote: > On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote: >> ------------------------------------------------------------ >> >> Today memory subsystems are offer a wide range of capabilities for managing >> memory power consumption. As a quick example, if a block of memory is not >> referenced for a threshold amount of time, the memory controller can decide to >> put that chunk into a low-power content-preserving state. And the next >> reference to that memory chunk would bring it back to full power for read/write. >> With this capability in place, it becomes important for the OS to understand >> the boundaries of such power-manageable chunks of memory and to ensure that >> references are consolidated to a minimum number of such memory power management >> domains. >> > > How much power is saved? Last year, Amit had evaluated the "Hierarchy" patchset on a Samsung Exynos (ARM) board and reported that it could save up to 6.3% relative to total system power. (This was when he allowed only 1 GB out of the total 2 GB RAM to enter low power states). Below is the link to his post, as mentioned in the references section in the cover letter. http://article.gmane.org/gmane.linux.kernel.mm/65935 Of course, the power savings depends on the characteristics of the particular hardware memory subsystem used, and the amount of memory present in the system. > >> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that >> the firmware can expose information regarding the boundaries of such memory >> power management domains to the OS in a standard way. >> > > I'm not familiar with the ACPI spec but is there support for parsing of > MPST and interpreting the associated ACPI events? Sorry I should have been clearer when I mentioned ACPI 5.0. I mentioned ACPI 5.0 just to make a point that support for getting the memory power management boundaries from the firmware is not far away. I didn't mean to say that that's the only target for memory power management. Like I mentioned above, last year the power-savings benefit was measured on ARM boards. The aim of this patchset is to propose and evaluate some of the core VM algorithms that we will need to efficiently exploit the power management features offered by the memory subsystems. IOW, info regarding memory power domain boundaries made available by ACPI 5.0 or even just with some help from the bootloader on some platforms is only the input to the VM subsystem to understand at what granularity it should manage things. *How* it manages is the choice of the algorithm/design at the VM level, which is what this patchset is trying to propose, by exploring several different designs of doing it and its costs/benefits. That's the reason I just hard-coded mem region size to 512 MB in this patchset and focussed on the VM algorithm to explore what we can do, once we have that size/boundary info. > For example, if ACPI > fires an event indicating that a memory power node is to enter a low > state then presumably the OS should actively migrate pages away -- even > if it's going into a state where the contents are still refreshed > as exiting that state could take a long time. > We are not really looking at ACPI event notifications here. All we expect from the firmware (at a first level) is info regarding the boundaries, so that the VM can be intelligent about how it consolidates references. Many of the memory subsystems can do power-management automatically - like for example, if a particular chunk of memory is not referenced for a given threshold time, it can put it into low-power (content preserving) state without the OS telling it to do it. > I did not look closely at the patchset at all because it looked like the > actual support to use it and measure the benefit is missing. > Right, we are focussing on the core VM algorithms for now. The input (ACPI or other methods) can come later and then we can measure the numbers. >> How can Linux VM help memory power savings? >> >> o Consolidate memory allocations and/or references such that they are >> not spread across the entire memory address space. Basically area of memory >> that is not being referenced, can reside in low power state. >> > > Which the series does not appear to do. > Well, it influences page-allocation to be memory-region aware. So it does an attempt to consolidate allocations (and thereby references). As I mentioned, hardware transition to low-power state can be automatic. The VM must be intelligent enough to help with that (or atleast smart enough not to disrupt that!), by avoiding spreading across allocations everywhere. >> o Support targeted memory reclaim, where certain areas of memory that can be >> easily freed can be offlined, allowing those areas of memory to be put into >> lower power states. >> > > Which the series does not appear to do judging from this; > Yes, that is one of the items in the TODO list. > include/linux/mm.h | 38 +++++++ > include/linux/mmzone.h | 52 +++++++++ > mm/compaction.c | 8 + > mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++---- > mm/vmstat.c | 59 ++++++++++- > > This does not appear to be doing anything with reclaim and not enough with > compaction to indicate that the series actively manages memory placement > in response to ACPI events. > > Further in section 5.2.21.4 the spec says that power node regions can > overlap (but are not hierarchal for some reason) but have no gaps yet the > structure you use to represent is assumes there can be gaps and there are > no overlaps. Again, this is just glancing at the spec and a quick skim of > the patches so maybe I missed something that explains why this structure > is suitable. > Right, we might need a better way to handle the various possibilities of the layout of memory regions in the hardware. But this initial RFC tried to focus on what do we do with that info, inside the VM to aid with power management. > It seems to me that superficially the VM implementation for the support > would have > > a) Involved a tree that managed the overlapping regions (even if it's > not hierarchal it feels more sensible) and picked the highest-power-state > common denominator in the tree. This would only be allocated if support > for MPST is available. > b) Leave memory allocations and reclaim as they are in the active state. > c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower > power but still usable with a latency penalty. This might be a single > migrate type but could also be a parallel set of free_area called > free_area_lowpower that is only used when free_area is depleted and in > the very slow path of the allocator. > d) Use memory hot-remove for power states where the refresh rates were > not constant > > and only did anything expensive in response to an ACPI event -- none of > the fast paths should be touched. > > When transitioning to the low power state, memory should be migrated in > a vaguely similar fashion to what CMA does. For low-power, migration > failure is acceptable. If contents are not preserved, ACPI needs to know > if the migration failed because it cannot enter that power state. > As I mentioned, we are not really talking about reacting to ACPI events here. The idea behind this patchset is to have efficient VM algorithms that can shape memory references depending on power-management boundaries exposed by the firmware. With that as the goal, I feel we should not even consider migration as a first step - we should rather consider how to shape allocations such that we can remain power-efficient right from the beginning and throughout the runtime, without needing to migrate if possible. And this patchset implements one of the designs that achieves that. > For any of this to be worthwhile, low power states would need to be achieved > for long periods of time because that migration is not free. > Best to avoid migration as far as possible in the first place :-) >> Memory Regions: >> --------------- >> >> "Memory Regions" is a way of capturing the boundaries of power-managable >> chunks of memory, within the MM subsystem. >> >> Short description of the "Sorted-buddy" design: >> ----------------------------------------------- >> >> In this design, the memory region boundaries are captured in a parallel >> data-structure instead of fitting regions between nodes and zones in the >> hierarchy. Further, the buddy allocator is altered, such that we maintain the >> zones' freelists in region-sorted-order and thus do page allocation in the >> order of increasing memory regions. > > Implying that this sorting has to happen in the either the alloc or free > fast path. > Yep, I have moved it to the free path. The alloc path remains fast. >> (The freelists need not be fully >> address-sorted, they just need to be region-sorted. Patch 6 explains this >> in more detail). >> >> The idea is to do page allocation in increasing order of memory regions >> (within a zone) and perform page reclaim in the reverse order, as illustrated >> below. >> >> ---------------------------- Increasing region number----------------------> >> >> Direction of allocation---> <---Direction of reclaim >> > > Compaction will work against this because it uses a PFN walker to isolate > free pages and will ignore memory regions. If pageblocks were used, it > could take that into account at least. > >> The sorting logic (to maintain freelist pageblocks in region-sorted-order) >> lies in the page-free path and not the page-allocation path and hence the >> critical page allocation paths remain fast. > > Page free can be a critical path for application performance as well. > Think network buffer heavy alloc and freeing of buffers. > > However, migratetype information is already looked up for THP so ideally > power awareness would piggyback on it. > >> Moreover, the heart of the page >> allocation algorithm itself remains largely unchanged, and the region-related >> data-structures are optimized to avoid unnecessary updates during the >> page-allocator's runtime. >> >> Advantages of this design: >> -------------------------- >> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and >> hence we avoid its associated problems (like too many zones, extra page >> reclaim threads, question of choosing watermarks etc). >> [This is an advantage over the "Hierarchy" design] >> >> 2. Performance overhead is expected to be low: Since we retain the simplicity >> of the algorithm in the page allocation path, page allocation can >> potentially remain as fast as it would be without memory regions. The >> overhead is pushed to the page-freeing paths which are not that critical. >> >> >> Results: >> ======= >> >> Test setup: >> ----------- >> This patchset applies cleanly on top of 3.7-rc3. >> >> x86 dual-socket quad core HT-enabled machine booted with mem=8G >> Memory region size = 512 MB >> >> Functional testing: >> ------------------- >> >> Ran pagetest, a simple C program that allocates and touches a required number >> of pages. >> >> Below is the statistics from the regions within ZONE_NORMAL, at various sizes >> of allocations from pagetest. >> >> Present pages | Free pages at various allocations | >> | start | 512 MB | 1024 MB | 2048 MB | >> Region 0 16 | 0 | 0 | 0 | 0 | >> Region 1 131072 | 87219 | 8066 | 7892 | 7387 | >> Region 2 131072 | 131072 | 79036 | 0 | 0 | >> Region 3 131072 | 131072 | 131072 | 79061 | 0 | >> Region 4 131072 | 131072 | 131072 | 131072 | 0 | >> Region 5 131072 | 131072 | 131072 | 131072 | 79051 | >> Region 6 131072 | 131072 | 131072 | 131072 | 131072 | >> Region 7 131072 | 131072 | 131072 | 131072 | 131072 | >> Region 8 131056 | 105475 | 105472 | 105472 | 105472 | >> >> This shows that page allocation occurs in the order of increasing region >> numbers, as intended in this design. >> >> Performance impact: >> ------------------- >> >> Kernbench results didn't show much of a difference between the performance >> of vanilla 3.7-rc3 and this patchset. >> >> >> Todos: >> ===== >> >> 1. Memory-region aware page-reclamation: >> ---------------------------------------- >> >> We would like to do page reclaim in the reverse order of page allocation >> within a zone, ie., in the order of decreasing region numbers. >> To achieve that, while scanning lru pages to reclaim, we could potentially >> look for pages belonging to higher regions (considering region boundaries) >> or perhaps simply prefer pages of higher pfns (and skip lower pfns) as >> reclaim candidates. >> > > This would disrupting LRU ordering and if those pages were recently > allocated and you force a situation where swap has to be used then any > saving in low memory will be lost by having to access the disk instead. > Right, we need to do it in a way that doesn't hurt performance or power-savings. I definitely need to think more on this.. Any suggestions? >> 2. Compile-time exclusion of Memory Power Management, and extending the >> support to also work with other features such as Mem cgroups, kexec etc. >> > > Compile-time exclusion is pointless because it'll be always activated by > distribution configs. Support for MPST should be detected at runtime and > > 3. ACPI support to actually use this thing and validate the design is > compatible with the spec and actually works in hardware > ACPI is not the only way to exploit this; other platforms (like ARM for example) can expose info today with some help with the bootloader, and as mentioned Amit already did a quick evaluation last year. So its not like we are totally blocked on ACPI support in order to design the VM algorithms to manage memory power-efficiently. Thanks a lot for taking a look and for your invaluable feedback! Regards, Srivatsa S. Bhat -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-08 18:02 ` Mel Gorman 2012-11-08 19:38 ` Srivatsa S. Bhat @ 2012-11-09 5:14 ` Vaidyanathan Srinivasan 2012-11-09 9:00 ` Mel Gorman 2012-11-09 15:34 ` Arjan van de Ven [not found] ` <loom.20121109T172910-394@post.gmane.org> 2 siblings, 2 replies; 17+ messages in thread From: Vaidyanathan Srinivasan @ 2012-11-09 5:14 UTC (permalink / raw) To: Mel Gorman Cc: Srivatsa S. Bhat, akpm, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel * Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]: > On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote: > > ------------------------------------------------------------ Hi Mel, Thanks for detailed review and comments. The goal of this patch series is to brainstorm on ideas that enable Linux VM to record and exploit memory region boundaries. The first approach that we had last year (hierarchy) has more runtime overhead. This approach of sorted-buddy was one of the alternative discussed earlier and we are trying to find out if simple requirements of biasing memory allocations can be achieved with this approach. Smart reclaim based on this approach is a key piece we still need to design. Ideas from compaction will certainly help. > > Today memory subsystems are offer a wide range of capabilities for managing > > memory power consumption. As a quick example, if a block of memory is not > > referenced for a threshold amount of time, the memory controller can decide to > > put that chunk into a low-power content-preserving state. And the next > > reference to that memory chunk would bring it back to full power for read/write. > > With this capability in place, it becomes important for the OS to understand > > the boundaries of such power-manageable chunks of memory and to ensure that > > references are consolidated to a minimum number of such memory power management > > domains. > > > > How much power is saved? On embedded platform the savings could be around 5% as discussed in the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935 On larger servers with large amounts of memory the savings could be more. We do not yet have all the pieces together to evaluate. > > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that > > the firmware can expose information regarding the boundaries of such memory > > power management domains to the OS in a standard way. > > > > I'm not familiar with the ACPI spec but is there support for parsing of > MPST and interpreting the associated ACPI events? For example, if ACPI > fires an event indicating that a memory power node is to enter a low > state then presumably the OS should actively migrate pages away -- even > if it's going into a state where the contents are still refreshed > as exiting that state could take a long time. > > I did not look closely at the patchset at all because it looked like the > actual support to use it and measure the benefit is missing. Correct. The platform interface part is not included in this patch set mainly because there is not much design required there. Each platform can have code to collect the memory region boundaries from BIOS/firmware and load it into the Linux VM. The goal of this patch is to brainstorm on the idea of hos core VM should used the region information. > > How can Linux VM help memory power savings? > > > > o Consolidate memory allocations and/or references such that they are > > not spread across the entire memory address space. Basically area of memory > > that is not being referenced, can reside in low power state. > > > > Which the series does not appear to do. Correct. We need to design the correct reclaim strategy for this to work. However having buddy list sorted by region address could get us one step closer to shaping the allocations. > > o Support targeted memory reclaim, where certain areas of memory that can be > > easily freed can be offlined, allowing those areas of memory to be put into > > lower power states. > > > > Which the series does not appear to do judging from this; > > include/linux/mm.h | 38 +++++++ > include/linux/mmzone.h | 52 +++++++++ > mm/compaction.c | 8 + > mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++---- > mm/vmstat.c | 59 ++++++++++- > > This does not appear to be doing anything with reclaim and not enough with > compaction to indicate that the series actively manages memory placement > in response to ACPI events. Correct. Evaluating different ideas for reclaim will be next step before getting into the platform interface parts. > Further in section 5.2.21.4 the spec says that power node regions can > overlap (but are not hierarchal for some reason) but have no gaps yet the > structure you use to represent is assumes there can be gaps and there are > no overlaps. Again, this is just glancing at the spec and a quick skim of > the patches so maybe I missed something that explains why this structure > is suitable. This patch is roughly based on the idea that ACPI MPST will give us memory region boundaries. It is not designed to implement all options defined in the spec. We have taken a general case of regions do not overlap while memory addresses itself can be discontinuous. > It seems to me that superficially the VM implementation for the support > would have > > a) Involved a tree that managed the overlapping regions (even if it's > not hierarchal it feels more sensible) and picked the highest-power-state > common denominator in the tree. This would only be allocated if support > for MPST is available. > b) Leave memory allocations and reclaim as they are in the active state. > c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower > power but still usable with a latency penalty. This might be a single > migrate type but could also be a parallel set of free_area called > free_area_lowpower that is only used when free_area is depleted and in > the very slow path of the allocator. > d) Use memory hot-remove for power states where the refresh rates were > not constant > > and only did anything expensive in response to an ACPI event -- none of > the fast paths should be touched. > > When transitioning to the low power state, memory should be migrated in > a vaguely similar fashion to what CMA does. For low-power, migration > failure is acceptable. If contents are not preserved, ACPI needs to know > if the migration failed because it cannot enter that power state. > > For any of this to be worthwhile, low power states would need to be achieved > for long periods of time because that migration is not free. In this patch series we are assuming the simple case of hardware managing the actual power states and OS facilitates them by keeping the allocations in less number of memory regions. As we keep allocations and references low to a regions, it becomes case (c) above. We are addressing only a small subset of the above list. > > Memory Regions: > > --------------- > > > > "Memory Regions" is a way of capturing the boundaries of power-managable > > chunks of memory, within the MM subsystem. > > > > Short description of the "Sorted-buddy" design: > > ----------------------------------------------- > > > > In this design, the memory region boundaries are captured in a parallel > > data-structure instead of fitting regions between nodes and zones in the > > hierarchy. Further, the buddy allocator is altered, such that we maintain the > > zones' freelists in region-sorted-order and thus do page allocation in the > > order of increasing memory regions. > > Implying that this sorting has to happen in the either the alloc or free > fast path. Yes, in the free path. This optimization can be actually be delayed in the free fast path and completely avoided if our memory is full and we are doing direct reclaim during allocations. > > (The freelists need not be fully > > address-sorted, they just need to be region-sorted. Patch 6 explains this > > in more detail). > > > > The idea is to do page allocation in increasing order of memory regions > > (within a zone) and perform page reclaim in the reverse order, as illustrated > > below. > > > > ---------------------------- Increasing region number----------------------> > > > > Direction of allocation---> <---Direction of reclaim > > > > Compaction will work against this because it uses a PFN walker to isolate > free pages and will ignore memory regions. If pageblocks were used, it > could take that into account at least. > > > The sorting logic (to maintain freelist pageblocks in region-sorted-order) > > lies in the page-free path and not the page-allocation path and hence the > > critical page allocation paths remain fast. > > Page free can be a critical path for application performance as well. > Think network buffer heavy alloc and freeing of buffers. > > However, migratetype information is already looked up for THP so ideally > power awareness would piggyback on it. > > > Moreover, the heart of the page > > allocation algorithm itself remains largely unchanged, and the region-related > > data-structures are optimized to avoid unnecessary updates during the > > page-allocator's runtime. > > > > Advantages of this design: > > -------------------------- > > 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and > > hence we avoid its associated problems (like too many zones, extra page > > reclaim threads, question of choosing watermarks etc). > > [This is an advantage over the "Hierarchy" design] > > > > 2. Performance overhead is expected to be low: Since we retain the simplicity > > of the algorithm in the page allocation path, page allocation can > > potentially remain as fast as it would be without memory regions. The > > overhead is pushed to the page-freeing paths which are not that critical. > > > > > > Results: > > ======= > > > > Test setup: > > ----------- > > This patchset applies cleanly on top of 3.7-rc3. > > > > x86 dual-socket quad core HT-enabled machine booted with mem=8G > > Memory region size = 512 MB > > > > Functional testing: > > ------------------- > > > > Ran pagetest, a simple C program that allocates and touches a required number > > of pages. > > > > Below is the statistics from the regions within ZONE_NORMAL, at various sizes > > of allocations from pagetest. > > > > Present pages | Free pages at various allocations | > > | start | 512 MB | 1024 MB | 2048 MB | > > Region 0 16 | 0 | 0 | 0 | 0 | > > Region 1 131072 | 87219 | 8066 | 7892 | 7387 | > > Region 2 131072 | 131072 | 79036 | 0 | 0 | > > Region 3 131072 | 131072 | 131072 | 79061 | 0 | > > Region 4 131072 | 131072 | 131072 | 131072 | 0 | > > Region 5 131072 | 131072 | 131072 | 131072 | 79051 | > > Region 6 131072 | 131072 | 131072 | 131072 | 131072 | > > Region 7 131072 | 131072 | 131072 | 131072 | 131072 | > > Region 8 131056 | 105475 | 105472 | 105472 | 105472 | > > > > This shows that page allocation occurs in the order of increasing region > > numbers, as intended in this design. > > > > Performance impact: > > ------------------- > > > > Kernbench results didn't show much of a difference between the performance > > of vanilla 3.7-rc3 and this patchset. > > > > > > Todos: > > ===== > > > > 1. Memory-region aware page-reclamation: > > ---------------------------------------- > > > > We would like to do page reclaim in the reverse order of page allocation > > within a zone, ie., in the order of decreasing region numbers. > > To achieve that, while scanning lru pages to reclaim, we could potentially > > look for pages belonging to higher regions (considering region boundaries) > > or perhaps simply prefer pages of higher pfns (and skip lower pfns) as > > reclaim candidates. > > > > This would disrupting LRU ordering and if those pages were recently > allocated and you force a situation where swap has to be used then any > saving in low memory will be lost by having to access the disk instead. > > > 2. Compile-time exclusion of Memory Power Management, and extending the > > support to also work with other features such as Mem cgroups, kexec etc. > > > > Compile-time exclusion is pointless because it'll be always activated by > distribution configs. Support for MPST should be detected at runtime and > > 3. ACPI support to actually use this thing and validate the design is > compatible with the spec and actually works in hardware This is required to actually evaluate power saving benefit once we have candidate implementations in the VM. At this point we want to look at overheads of having region infrastructure in VM and how does that trade off in terms of requirements that we can meet. The first goal is to have memory allocations fill as few regions as possible when system's memory usage is significantly lower. Next we would like VM to actively move pages around to cooperate with platform memory power saving features like notifications or policy changes. --Vaidy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-09 5:14 ` Vaidyanathan Srinivasan @ 2012-11-09 9:00 ` Mel Gorman 2012-11-09 14:51 ` Srivatsa S. Bhat 2012-11-09 15:34 ` Arjan van de Ven 1 sibling, 1 reply; 17+ messages in thread From: Mel Gorman @ 2012-11-09 9:00 UTC (permalink / raw) To: Vaidyanathan Srinivasan Cc: Srivatsa S. Bhat, akpm, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote: > * Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]: > > > On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote: > > > ------------------------------------------------------------ > > Hi Mel, > > Thanks for detailed review and comments. The goal of this patch > series is to brainstorm on ideas that enable Linux VM to record and > exploit memory region boundaries. > I see. > The first approach that we had last year (hierarchy) has more runtime > overhead. This approach of sorted-buddy was one of the alternative > discussed earlier and we are trying to find out if simple requirements > of biasing memory allocations can be achieved with this approach. > > Smart reclaim based on this approach is a key piece we still need to > design. Ideas from compaction will certainly help. > > > > Today memory subsystems are offer a wide range of capabilities for managing > > > memory power consumption. As a quick example, if a block of memory is not > > > referenced for a threshold amount of time, the memory controller can decide to > > > put that chunk into a low-power content-preserving state. And the next > > > reference to that memory chunk would bring it back to full power for read/write. > > > With this capability in place, it becomes important for the OS to understand > > > the boundaries of such power-manageable chunks of memory and to ensure that > > > references are consolidated to a minimum number of such memory power management > > > domains. > > > > > > > How much power is saved? > > On embedded platform the savings could be around 5% as discussed in > the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935 > > On larger servers with large amounts of memory the savings could be > more. We do not yet have all the pieces together to evaluate. > Ok, it's something to keep an eye on because if memory power savings require large amounts of CPU (for smart placement or migration) or more disk accesses (due to reclaim) then the savings will be offset by increased power usage elsehwere. > > > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that > > > the firmware can expose information regarding the boundaries of such memory > > > power management domains to the OS in a standard way. > > > > > > > I'm not familiar with the ACPI spec but is there support for parsing of > > MPST and interpreting the associated ACPI events? For example, if ACPI > > fires an event indicating that a memory power node is to enter a low > > state then presumably the OS should actively migrate pages away -- even > > if it's going into a state where the contents are still refreshed > > as exiting that state could take a long time. > > > > I did not look closely at the patchset at all because it looked like the > > actual support to use it and measure the benefit is missing. > > Correct. The platform interface part is not included in this patch > set mainly because there is not much design required there. Each > platform can have code to collect the memory region boundaries from > BIOS/firmware and load it into the Linux VM. The goal of this patch > is to brainstorm on the idea of hos core VM should used the region > information. > Ok. It does mean that the patches should not be merged until there is some platform support that can take advantage of them. > > > How can Linux VM help memory power savings? > > > > > > o Consolidate memory allocations and/or references such that they are > > > not spread across the entire memory address space. Basically area of memory > > > that is not being referenced, can reside in low power state. > > > > > > > Which the series does not appear to do. > > Correct. We need to design the correct reclaim strategy for this to > work. However having buddy list sorted by region address could get us > one step closer to shaping the allocations. > If you reclaim, it means that the information is going to disk and will have to be refaulted in sooner rather than later. If you concentrate on reclaiming low memory regions and memory is almost full, it will lead to a situation where you almost always reclaim newer pages and increase faulting. You will save a few milliwatts on memory and lose way more than that on increase disk traffic and CPU usage. > > > o Support targeted memory reclaim, where certain areas of memory that can be > > > easily freed can be offlined, allowing those areas of memory to be put into > > > lower power states. > > > > > > > Which the series does not appear to do judging from this; > > > > include/linux/mm.h | 38 +++++++ > > include/linux/mmzone.h | 52 +++++++++ > > mm/compaction.c | 8 + > > mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++---- > > mm/vmstat.c | 59 ++++++++++- > > > > This does not appear to be doing anything with reclaim and not enough with > > compaction to indicate that the series actively manages memory placement > > in response to ACPI events. > > Correct. Evaluating different ideas for reclaim will be next step > before getting into the platform interface parts. > > > Further in section 5.2.21.4 the spec says that power node regions can > > overlap (but are not hierarchal for some reason) but have no gaps yet the > > structure you use to represent is assumes there can be gaps and there are > > no overlaps. Again, this is just glancing at the spec and a quick skim of > > the patches so maybe I missed something that explains why this structure > > is suitable. > > This patch is roughly based on the idea that ACPI MPST will give us > memory region boundaries. It is not designed to implement all options > defined in the spec. Ok, but as it is the only potential consumer of this interface that you mentioned then it should at least be able to handle it. The spec talks about overlapping memory regions where the regions potentially have differnet power states. This is pretty damn remarkable and hard to see how it could be interpreted in a sensible way but it forces your implementation to take it into account. > We have taken a general case of regions do not > overlap while memory addresses itself can be discontinuous. > Why is the general case? You referred to the ACPI spec where it is not the case and no other examples. > > It seems to me that superficially the VM implementation for the support > > would have > > > > a) Involved a tree that managed the overlapping regions (even if it's > > not hierarchal it feels more sensible) and picked the highest-power-state > > common denominator in the tree. This would only be allocated if support > > for MPST is available. > > b) Leave memory allocations and reclaim as they are in the active state. > > c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower > > power but still usable with a latency penalty. This might be a single > > migrate type but could also be a parallel set of free_area called > > free_area_lowpower that is only used when free_area is depleted and in > > the very slow path of the allocator. > > d) Use memory hot-remove for power states where the refresh rates were > > not constant > > > > and only did anything expensive in response to an ACPI event -- none of > > the fast paths should be touched. > > > > When transitioning to the low power state, memory should be migrated in > > a vaguely similar fashion to what CMA does. For low-power, migration > > failure is acceptable. If contents are not preserved, ACPI needs to know > > if the migration failed because it cannot enter that power state. > > > > For any of this to be worthwhile, low power states would need to be achieved > > for long periods of time because that migration is not free. > > In this patch series we are assuming the simple case of hardware > managing the actual power states and OS facilitates them by keeping > the allocations in less number of memory regions. As we keep > allocations and references low to a regions, it becomes case (c) > above. We are addressing only a small subset of the above list. > > > > Memory Regions: > > > --------------- > > > > > > "Memory Regions" is a way of capturing the boundaries of power-managable > > > chunks of memory, within the MM subsystem. > > > > > > Short description of the "Sorted-buddy" design: > > > ----------------------------------------------- > > > > > > In this design, the memory region boundaries are captured in a parallel > > > data-structure instead of fitting regions between nodes and zones in the > > > hierarchy. Further, the buddy allocator is altered, such that we maintain the > > > zones' freelists in region-sorted-order and thus do page allocation in the > > > order of increasing memory regions. > > > > Implying that this sorting has to happen in the either the alloc or free > > fast path. > > Yes, in the free path. This optimization can be actually be delayed in > the free fast path and completely avoided if our memory is full and we > are doing direct reclaim during allocations. > Hurting the free fast path is a bad idea as there are workloads that depend on it (buffer allocation and free) even though many workloads do *not* notice it because the bulk of the cost is incurred at exit time. As memory low power usage has many caveats (may be impossible if a page table is allocated in the region for example) but CPU usage has less restrictions it is more important that the CPU usage be kept low. That means, little or no modification to the fastpath. Sorting or linear searches should be minimised or avoided. > > > <SNIPPED where I pointed out that compaction will bust sorting> > > > > Compile-time exclusion is pointless because it'll be always activated by > > distribution configs. Support for MPST should be detected at runtime and > > > > 3. ACPI support to actually use this thing and validate the design is > > compatible with the spec and actually works in hardware > > This is required to actually evaluate power saving benefit once we > have candidate implementations in the VM. > > At this point we want to look at overheads of having region > infrastructure in VM and how does that trade off in terms of > requirements that we can meet. > > The first goal is to have memory allocations fill as few regions as > possible when system's memory usage is significantly lower. While it's a reasonable starting objective, the fast path overhead is very unfortunate and such a strategy can be easily defeated by running sometime metadata intensive (like find over the entire system) while a large memory user starts at the same time to spread kernel and user space allocations throughout the address space. This will spread the allocations throughout the address space and persist even after the two processes exit due to the page cache usage from the metadata intensive workload. Basically, it'll only work as long as the system is idle or never uses much memory during the lifetime of the system. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-09 9:00 ` Mel Gorman @ 2012-11-09 14:51 ` Srivatsa S. Bhat 2012-11-09 15:23 ` Srivatsa S. Bhat 0 siblings, 1 reply; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-11-09 14:51 UTC (permalink / raw) To: Mel Gorman Cc: Vaidyanathan Srinivasan, akpm, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On 11/09/2012 02:30 PM, Mel Gorman wrote: > On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote: >> * Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]: >> [...] >>> How much power is saved? >> >> On embedded platform the savings could be around 5% as discussed in >> the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935 >> >> On larger servers with large amounts of memory the savings could be >> more. We do not yet have all the pieces together to evaluate. >> > > Ok, it's something to keep an eye on because if memory power savings > require large amounts of CPU (for smart placement or migration) or more > disk accesses (due to reclaim) then the savings will be offset by > increased power usage elsehwere. > True. >>>> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that >>>> the firmware can expose information regarding the boundaries of such memory >>>> power management domains to the OS in a standard way. >>>> >>> >>> I'm not familiar with the ACPI spec but is there support for parsing of >>> MPST and interpreting the associated ACPI events? For example, if ACPI >>> fires an event indicating that a memory power node is to enter a low >>> state then presumably the OS should actively migrate pages away -- even >>> if it's going into a state where the contents are still refreshed >>> as exiting that state could take a long time. >>> >>> I did not look closely at the patchset at all because it looked like the >>> actual support to use it and measure the benefit is missing. >> >> Correct. The platform interface part is not included in this patch >> set mainly because there is not much design required there. Each >> platform can have code to collect the memory region boundaries from >> BIOS/firmware and load it into the Linux VM. The goal of this patch >> is to brainstorm on the idea of hos core VM should used the region >> information. >> > > Ok. It does mean that the patches should not be merged until there is > some platform support that can take advantage of them. > That's right, but the development of the VM algorithms and the platform support for different platforms can go on in parallel. And once we have all the pieces designed, we can fit them together and merge them. >>>> How can Linux VM help memory power savings? >>>> >>>> o Consolidate memory allocations and/or references such that they are >>>> not spread across the entire memory address space. Basically area of memory >>>> that is not being referenced, can reside in low power state. >>>> >>> >>> Which the series does not appear to do. >> >> Correct. We need to design the correct reclaim strategy for this to >> work. However having buddy list sorted by region address could get us >> one step closer to shaping the allocations. >> > > If you reclaim, it means that the information is going to disk and will > have to be refaulted in sooner rather than later. If you concentrate on > reclaiming low memory regions and memory is almost full, it will lead to > a situation where you almost always reclaim newer pages and increase > faulting. You will save a few milliwatts on memory and lose way more > than that on increase disk traffic and CPU usage. > Yes, we should ensure that our reclaim strategy won't back-fire like that. We definitely need to depend on LRU ordering for reclaim for the most part, but try to opportunistically reclaim from within the required region boundaries while doing that. We definitely need to think more about this... But the point of making the free lists sorted region-wise in this patchset was to exploit the shaping of page allocations the way we want (ie., constrained to lesser number of regions). >>>> o Support targeted memory reclaim, where certain areas of memory that can be >>>> easily freed can be offlined, allowing those areas of memory to be put into >>>> lower power states. >>>> >>> >>> Which the series does not appear to do judging from this; >>> >>> include/linux/mm.h | 38 +++++++ >>> include/linux/mmzone.h | 52 +++++++++ >>> mm/compaction.c | 8 + >>> mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++---- >>> mm/vmstat.c | 59 ++++++++++- >>> >>> This does not appear to be doing anything with reclaim and not enough with >>> compaction to indicate that the series actively manages memory placement >>> in response to ACPI events. >> >> Correct. Evaluating different ideas for reclaim will be next step >> before getting into the platform interface parts. >> [...] >> >> This patch is roughly based on the idea that ACPI MPST will give us >> memory region boundaries. It is not designed to implement all options >> defined in the spec. > > Ok, but as it is the only potential consumer of this interface that you > mentioned then it should at least be able to handle it. The spec talks about > overlapping memory regions where the regions potentially have differnet > power states. This is pretty damn remarkable and hard to see how it could > be interpreted in a sensible way but it forces your implementation to take > it into account. > Well, sorry for not mentioning in the cover-letter, but the VM algorithms for memory power management could benefit other platforms too, like ARM, not just ACPI-based systems. Last year, Amit had evaluated them on Samsung boards with a simplistic layout for memory regions, based on the Samsung exynos board's configuration. http://article.gmane.org/gmane.linux.kernel.mm/65935 >> We have taken a general case of regions do not >> overlap while memory addresses itself can be discontinuous. >> > > Why is the general case? You referred to the ACPI spec where it is not > the case and no other examples. > ARM is another example, where we could describe the memory regions in a simple manner with respect to the Samsung exynos board. So the idea behind this patchset was to start by assuming a simplistic layout for memory regions and focussing on the design of the VM algorithms, and evaluating how this "sorted-buddy" design would perform in comparison to the previous "hierarchy" design that was explored last year. But of course, you are absolutely right in pointing out that, to make all this consumable, we need to revisit this with a focus on the layout of memory regions themselves, so that all interested platforms can make use of it effectively. [...] >>>> Short description of the "Sorted-buddy" design: >>>> ----------------------------------------------- >>>> >>>> In this design, the memory region boundaries are captured in a parallel >>>> data-structure instead of fitting regions between nodes and zones in the >>>> hierarchy. Further, the buddy allocator is altered, such that we maintain the >>>> zones' freelists in region-sorted-order and thus do page allocation in the >>>> order of increasing memory regions. >>> >>> Implying that this sorting has to happen in the either the alloc or free >>> fast path. >> >> Yes, in the free path. This optimization can be actually be delayed in >> the free fast path and completely avoided if our memory is full and we >> are doing direct reclaim during allocations. >> > > Hurting the free fast path is a bad idea as there are workloads that depend > on it (buffer allocation and free) even though many workloads do *not* > notice it because the bulk of the cost is incurred at exit time. As > memory low power usage has many caveats (may be impossible if a page > table is allocated in the region for example) but CPU usage has less > restrictions it is more important that the CPU usage be kept low. > > That means, little or no modification to the fastpath. Sorting or linear > searches should be minimised or avoided. > Right. For example, in the previous "hierarchy" design[1], there was no overhead in any of the fast paths. Because it split up the zones themselves, so that they fit on memory region boundaries. But that design had other problems, like zone fragmentation (too many zones).. which kind of out-weighed the benefit obtained from zero overhead in the fast-paths. So one of the suggested alternatives during that review[2], was to explore modifying the buddy allocator to be aware of memory region boundaries, which this "sorted-buddy" design implements. [1]. http://lwn.net/Articles/445045/ http://thread.gmane.org/gmane.linux.kernel.mm/63840 http://thread.gmane.org/gmane.linux.kernel.mm/89202 [2]. http://article.gmane.org/gmane.linux.power-management.general/24862 http://article.gmane.org/gmane.linux.power-management.general/25061 http://article.gmane.org/gmane.linux.kernel.mm/64689 In this patchset, I have tried to minimize the overhead on the fastpaths. For example, I have used a special 'next_region' data-structure to keep the alloc path fast. Also, in the free path, we don't need to keep the free lists fully address sorted; having them region-sorted is sufficient. Of course we could explore more ways of avoiding overhead in the fast paths, or even a different design that promises to be much better overall. I'm all ears for any suggestions :-) >> At this point we want to look at overheads of having region >> infrastructure in VM and how does that trade off in terms of >> requirements that we can meet. >> >> The first goal is to have memory allocations fill as few regions as >> possible when system's memory usage is significantly lower. > > While it's a reasonable starting objective, the fast path overhead is very > unfortunate and such a strategy can be easily defeated by running sometime > metadata intensive (like find over the entire system) while a large memory > user starts at the same time to spread kernel and user space allocations > throughout the address space. This will spread the allocations throughout > the address space and persist even after the two processes exit due to > the page cache usage from the metadata intensive workload. > > Basically, it'll only work as long as the system is idle or never uses > much memory during the lifetime of the system. > Well, page cache usage could definitely come in the way of memory power management. Probably having a separate driver shrink the page cache (depending on how aggressive we want to get with respect to power-management) is the way to go? Regards, Srivatsa S. Bhat -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-09 14:51 ` Srivatsa S. Bhat @ 2012-11-09 15:23 ` Srivatsa S. Bhat 2012-11-09 16:13 ` Dave Hansen 0 siblings, 1 reply; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-11-09 15:23 UTC (permalink / raw) To: Mel Gorman Cc: Vaidyanathan Srinivasan, akpm, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On 11/09/2012 08:21 PM, Srivatsa S. Bhat wrote: > On 11/09/2012 02:30 PM, Mel Gorman wrote: >> On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote: >>> * Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]: [...] >>>>> Short description of the "Sorted-buddy" design: >>>>> ----------------------------------------------- >>>>> >>>>> In this design, the memory region boundaries are captured in a parallel >>>>> data-structure instead of fitting regions between nodes and zones in the >>>>> hierarchy. Further, the buddy allocator is altered, such that we maintain the >>>>> zones' freelists in region-sorted-order and thus do page allocation in the >>>>> order of increasing memory regions. >>>> >>>> Implying that this sorting has to happen in the either the alloc or free >>>> fast path. >>> >>> Yes, in the free path. This optimization can be actually be delayed in >>> the free fast path and completely avoided if our memory is full and we >>> are doing direct reclaim during allocations. >>> >> >> Hurting the free fast path is a bad idea as there are workloads that depend >> on it (buffer allocation and free) even though many workloads do *not* >> notice it because the bulk of the cost is incurred at exit time. As >> memory low power usage has many caveats (may be impossible if a page >> table is allocated in the region for example) but CPU usage has less >> restrictions it is more important that the CPU usage be kept low. >> >> That means, little or no modification to the fastpath. Sorting or linear >> searches should be minimised or avoided. >> > > Right. For example, in the previous "hierarchy" design[1], there was no overhead > in any of the fast paths. Because it split up the zones themselves, so that > they fit on memory region boundaries. But that design had other problems, like > zone fragmentation (too many zones).. which kind of out-weighed the benefit > obtained from zero overhead in the fast-paths. So one of the suggested > alternatives during that review[2], was to explore modifying the buddy allocator > to be aware of memory region boundaries, which this "sorted-buddy" design > implements. > > [1]. http://lwn.net/Articles/445045/ > http://thread.gmane.org/gmane.linux.kernel.mm/63840 > http://thread.gmane.org/gmane.linux.kernel.mm/89202 > > [2]. http://article.gmane.org/gmane.linux.power-management.general/24862 > http://article.gmane.org/gmane.linux.power-management.general/25061 > http://article.gmane.org/gmane.linux.kernel.mm/64689 > > In this patchset, I have tried to minimize the overhead on the fastpaths. > For example, I have used a special 'next_region' data-structure to keep the > alloc path fast. Also, in the free path, we don't need to keep the free > lists fully address sorted; having them region-sorted is sufficient. Of course > we could explore more ways of avoiding overhead in the fast paths, or even a > different design that promises to be much better overall. I'm all ears for > any suggestions :-) > FWIW, kernbench is actually (and surprisingly) showing a slight performance *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in my other email to Dave. https://lkml.org/lkml/2012/11/7/428 I don't think I can dismiss it as an experimental error, because I am seeing those results consistently.. I'm trying to find out what's behind that. Regards, Srivatsa S. Bhat -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-09 15:23 ` Srivatsa S. Bhat @ 2012-11-09 16:13 ` Dave Hansen 2012-11-09 16:34 ` Srivatsa S. Bhat 0 siblings, 1 reply; 17+ messages in thread From: Dave Hansen @ 2012-11-09 16:13 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Mel Gorman, Vaidyanathan Srinivasan, akpm, mjg59, paulmck, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote: > FWIW, kernbench is actually (and surprisingly) showing a slight performance > *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in > my other email to Dave. > > https://lkml.org/lkml/2012/11/7/428 > > I don't think I can dismiss it as an experimental error, because I am seeing > those results consistently.. I'm trying to find out what's behind that. The only numbers in that link are in the date. :) Let's see the numbers, please. If you really have performance improvement to the memory allocator (or something else) here, then surely it can be pared out of your patches and merged quickly by itself. Those kinds of optimizations are hard to come by! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-09 16:13 ` Dave Hansen @ 2012-11-09 16:34 ` Srivatsa S. Bhat 2012-11-09 16:43 ` Srivatsa S. Bhat 0 siblings, 1 reply; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-11-09 16:34 UTC (permalink / raw) To: Dave Hansen Cc: Mel Gorman, Vaidyanathan Srinivasan, akpm, mjg59, paulmck, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On 11/09/2012 09:43 PM, Dave Hansen wrote: > On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote: >> FWIW, kernbench is actually (and surprisingly) showing a slight performance >> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in >> my other email to Dave. >> >> https://lkml.org/lkml/2012/11/7/428 >> >> I don't think I can dismiss it as an experimental error, because I am seeing >> those results consistently.. I'm trying to find out what's behind that. > > The only numbers in that link are in the date. :) Let's see the > numbers, please. > Sure :) The reason I didn't post the numbers very eagerly was that I didn't want it to look ridiculous if it later turned out to be really an error in the experiment ;) But since I have seen it happening consistently I think I can post the numbers here with some non-zero confidence. > If you really have performance improvement to the memory allocator (or > something else) here, then surely it can be pared out of your patches > and merged quickly by itself. Those kinds of optimizations are hard to > come by! > :-) Anyway, here it goes: Test setup: ---------- x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my patchset might not handle NUMA properly). Mem region size = 512 MB. Kernbench log for Vanilla 3.7-rc3 ================================= Kernel: 3.7.0-rc3-vanilla-default Average Optimal load -j 32 Run (std deviation): Elapsed Time 650.742 (2.49774) User Time 8213.08 (17.6347) System Time 1273.91 (6.00643) Percent CPU 1457.4 (3.64692) Context Switches 2250203 (3846.61) Sleeps 1.8781e+06 (5310.33) Kernbench log for this sorted-buddy patchset ============================================ Kernel: 3.7.0-rc3-sorted-buddy-default Average Optimal load -j 32 Run (std deviation): Elapsed Time 591.696 (0.660969) User Time 7511.97 (1.08313) System Time 1062.99 (1.1109) Percent CPU 1448.6 (1.94936) Context Switches 2.1496e+06 (3507.12) Sleeps 1.84305e+06 (3092.67) Regards, Srivatsa S. Bhat -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-09 16:34 ` Srivatsa S. Bhat @ 2012-11-09 16:43 ` Srivatsa S. Bhat 2012-11-09 16:52 ` Srivatsa S. Bhat 0 siblings, 1 reply; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-11-09 16:43 UTC (permalink / raw) To: Dave Hansen Cc: Mel Gorman, Vaidyanathan Srinivasan, akpm, mjg59, paulmck, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote: > On 11/09/2012 09:43 PM, Dave Hansen wrote: >> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote: >>> FWIW, kernbench is actually (and surprisingly) showing a slight performance >>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in >>> my other email to Dave. >>> >>> https://lkml.org/lkml/2012/11/7/428 >>> >>> I don't think I can dismiss it as an experimental error, because I am seeing >>> those results consistently.. I'm trying to find out what's behind that. >> >> The only numbers in that link are in the date. :) Let's see the >> numbers, please. >> > > Sure :) The reason I didn't post the numbers very eagerly was that I didn't > want it to look ridiculous if it later turned out to be really an error in the > experiment ;) But since I have seen it happening consistently I think I can > post the numbers here with some non-zero confidence. > >> If you really have performance improvement to the memory allocator (or >> something else) here, then surely it can be pared out of your patches >> and merged quickly by itself. Those kinds of optimizations are hard to >> come by! >> > > :-) > > Anyway, here it goes: > > Test setup: > ---------- > x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my > patchset might not handle NUMA properly). Mem region size = 512 MB. > For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels was much lesser, but nevertheless, this patchset performed better. I wouldn't vouch that my patchset handles NUMA correctly, but here are the numbers from that run anyway (at least to show that I really found the results to be repeatable): Kernbench log for Vanilla 3.7-rc3 ================================= Kernel: 3.7.0-rc3-vanilla-numa-default Average Optimal load -j 32 Run (std deviation): Elapsed Time 589.058 (0.596171) User Time 7461.26 (1.69702) System Time 1072.03 (1.54704) Percent CPU 1448.2 (1.30384) Context Switches 2.14322e+06 (4042.97) Sleeps 1847230 (2614.96) Kernbench log for Vanilla 3.7-rc3 ================================= Kernel: 3.7.0-rc3-sorted-buddy-numa-default Average Optimal load -j 32 Run (std deviation): Elapsed Time 577.182 (0.713772) User Time 7315.43 (3.87226) System Time 1043 (1.12855) Percent CPU 1447.6 (2.19089) Context Switches 2117022 (3810.15) Sleeps 1.82966e+06 (4149.82) Regards, Srivatsa S. Bhat > Kernbench log for Vanilla 3.7-rc3 > ================================= > > Kernel: 3.7.0-rc3-vanilla-default > Average Optimal load -j 32 Run (std deviation): > Elapsed Time 650.742 (2.49774) > User Time 8213.08 (17.6347) > System Time 1273.91 (6.00643) > Percent CPU 1457.4 (3.64692) > Context Switches 2250203 (3846.61) > Sleeps 1.8781e+06 (5310.33) > > Kernbench log for this sorted-buddy patchset > ============================================ > > Kernel: 3.7.0-rc3-sorted-buddy-default > Average Optimal load -j 32 Run (std deviation): > Elapsed Time 591.696 (0.660969) > User Time 7511.97 (1.08313) > System Time 1062.99 (1.1109) > Percent CPU 1448.6 (1.94936) > Context Switches 2.1496e+06 (3507.12) > Sleeps 1.84305e+06 (3092.67) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-09 16:43 ` Srivatsa S. Bhat @ 2012-11-09 16:52 ` Srivatsa S. Bhat 2012-11-16 18:32 ` Srivatsa S. Bhat 0 siblings, 1 reply; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-11-09 16:52 UTC (permalink / raw) To: Dave Hansen Cc: Mel Gorman, Vaidyanathan Srinivasan, akpm, mjg59, paulmck, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On 11/09/2012 10:13 PM, Srivatsa S. Bhat wrote: > On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote: >> On 11/09/2012 09:43 PM, Dave Hansen wrote: >>> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote: >>>> FWIW, kernbench is actually (and surprisingly) showing a slight performance >>>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in >>>> my other email to Dave. >>>> >>>> https://lkml.org/lkml/2012/11/7/428 >>>> >>>> I don't think I can dismiss it as an experimental error, because I am seeing >>>> those results consistently.. I'm trying to find out what's behind that. >>> >>> The only numbers in that link are in the date. :) Let's see the >>> numbers, please. >>> >> >> Sure :) The reason I didn't post the numbers very eagerly was that I didn't >> want it to look ridiculous if it later turned out to be really an error in the >> experiment ;) But since I have seen it happening consistently I think I can >> post the numbers here with some non-zero confidence. >> >>> If you really have performance improvement to the memory allocator (or >>> something else) here, then surely it can be pared out of your patches >>> and merged quickly by itself. Those kinds of optimizations are hard to >>> come by! >>> >> >> :-) >> >> Anyway, here it goes: >> >> Test setup: >> ---------- >> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my >> patchset might not handle NUMA properly). Mem region size = 512 MB. >> > > For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels > was much lesser, but nevertheless, this patchset performed better. I wouldn't > vouch that my patchset handles NUMA correctly, but here are the numbers from > that run anyway (at least to show that I really found the results to be > repeatable): > > Kernbench log for Vanilla 3.7-rc3 > ================================= > Kernel: 3.7.0-rc3-vanilla-numa-default > Average Optimal load -j 32 Run (std deviation): > Elapsed Time 589.058 (0.596171) > User Time 7461.26 (1.69702) > System Time 1072.03 (1.54704) > Percent CPU 1448.2 (1.30384) > Context Switches 2.14322e+06 (4042.97) > Sleeps 1847230 (2614.96) > > Kernbench log for Vanilla 3.7-rc3 > ================================= Oops, that title must have been "for sorted-buddy patchset" of course.. > Kernel: 3.7.0-rc3-sorted-buddy-numa-default > Average Optimal load -j 32 Run (std deviation): > Elapsed Time 577.182 (0.713772) > User Time 7315.43 (3.87226) > System Time 1043 (1.12855) > Percent CPU 1447.6 (2.19089) > Context Switches 2117022 (3810.15) > Sleeps 1.82966e+06 (4149.82) > > Regards, Srivatsa S. Bhat -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-09 16:52 ` Srivatsa S. Bhat @ 2012-11-16 18:32 ` Srivatsa S. Bhat 0 siblings, 0 replies; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-11-16 18:32 UTC (permalink / raw) To: Dave Hansen Cc: Mel Gorman, Vaidyanathan Srinivasan, akpm, mjg59, paulmck, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel, andi, SrinivasPandruvada On 11/09/2012 10:22 PM, Srivatsa S. Bhat wrote: > On 11/09/2012 10:13 PM, Srivatsa S. Bhat wrote: >> On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote: >>> On 11/09/2012 09:43 PM, Dave Hansen wrote: >>>> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote: >>>>> FWIW, kernbench is actually (and surprisingly) showing a slight performance >>>>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in >>>>> my other email to Dave. >>>>> >>>>> https://lkml.org/lkml/2012/11/7/428 >>>>> >>>>> I don't think I can dismiss it as an experimental error, because I am seeing >>>>> those results consistently.. I'm trying to find out what's behind that. >>>> >>>> The only numbers in that link are in the date. :) Let's see the >>>> numbers, please. >>>> >>> >>> Sure :) The reason I didn't post the numbers very eagerly was that I didn't >>> want it to look ridiculous if it later turned out to be really an error in the >>> experiment ;) But since I have seen it happening consistently I think I can >>> post the numbers here with some non-zero confidence. >>> >>>> If you really have performance improvement to the memory allocator (or >>>> something else) here, then surely it can be pared out of your patches >>>> and merged quickly by itself. Those kinds of optimizations are hard to >>>> come by! >>>> >>> >>> :-) >>> >>> Anyway, here it goes: >>> >>> Test setup: >>> ---------- >>> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my >>> patchset might not handle NUMA properly). Mem region size = 512 MB. >>> >> >> For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels >> was much lesser, but nevertheless, this patchset performed better. I wouldn't >> vouch that my patchset handles NUMA correctly, but here are the numbers from >> that run anyway (at least to show that I really found the results to be >> repeatable): >> I fixed up the NUMA case (I'll post the updated patch for that soon) and ran a fresh set of kernbench runs. The difference between mainline and this patchset is quite tiny; so we can't really say that this patchset shows a performance improvement over mainline. However, I can safely conclude that this patchset doesn't show any performance _degradation_ w.r.t mainline in kernbench. Results from one of the recent kernbench runs: --------------------------------------------- Kernbench log for Vanilla 3.7-rc3 ================================= Kernel: 3.7.0-rc3 Average Optimal load -j 32 Run (std deviation): Elapsed Time 330.39 (0.746257) User Time 4283.63 (3.39617) System Time 604.783 (2.72629) Percent CPU 1479 (3.60555) Context Switches 845634 (6031.22) Sleeps 833655 (6652.17) Kernbench log for Sorted-buddy ============================== Kernel: 3.7.0-rc3-sorted-buddy Average Optimal load -j 32 Run (std deviation): Elapsed Time 329.967 (2.76789) User Time 4230.02 (2.15324) System Time 599.793 (1.09988) Percent CPU 1463.33 (11.3725) Context Switches 840530 (1646.75) Sleeps 833732 (2227.68) Regards, Srivatsa S. Bhat -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-09 5:14 ` Vaidyanathan Srinivasan 2012-11-09 9:00 ` Mel Gorman @ 2012-11-09 15:34 ` Arjan van de Ven 1 sibling, 0 replies; 17+ messages in thread From: Arjan van de Ven @ 2012-11-09 15:34 UTC (permalink / raw) To: svaidy Cc: Mel Gorman, Srivatsa S. Bhat, akpm, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel On 11/8/2012 9:14 PM, Vaidyanathan Srinivasan wrote: > * Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]: > >> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote: >>> ------------------------------------------------------------ > > Hi Mel, > > Thanks for detailed review and comments. The goal of this patch > series is to brainstorm on ideas that enable Linux VM to record and > exploit memory region boundaries. > > The first approach that we had last year (hierarchy) has more runtime > overhead. This approach of sorted-buddy was one of the alternative > discussed earlier and we are trying to find out if simple requirements > of biasing memory allocations can be achieved with this approach. > > Smart reclaim based on this approach is a key piece we still need to > design. Ideas from compaction will certainly help. reclaim may be needed for the embedded use case but at least we are also looking at memory power savings that come for content-preserving power states. For that, Linux should *statistically* not be actively using (e.g. read or write from it) a percentage of memory... and statistically clustering is quite sufficient for that. (for example, if you don't use a DIMM for a certain amount of time, the link and other pieces can go to a lower power state, even on todays server systems. In a many-dimm system.. if each app is, on a per app basis, preferring one dimm for its allocations, the process scheduler will help us naturally keeping the other dimms "dark") If you have to actually free the memory, it is a much much harder problem, increasingly so if the region you MUST free is quite large. if one solution can solve both cases, great, but lets not make both not happen because one of the cases is hard... (and please lets not use moving or freeing of pages as a solution for at least the content preserving case) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <loom.20121109T172910-394@post.gmane.org>]
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management [not found] ` <loom.20121109T172910-394@post.gmane.org> @ 2012-11-12 16:14 ` Srivatsa S. Bhat 0 siblings, 0 replies; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-11-12 16:14 UTC (permalink / raw) To: SrinivasPandruvada, akpm@linux-foundation.org, Mel Gorman, mjg59, Paul E. McKenney, Dave Hansen, maxime.coquelin, loic.pallardy, Arjan van de Ven, kmpark, kamezawa.hiroyu, Len Brown, Rafael J. Wysocki Cc: linux-pm, Ankita Garg, amit.kachhap, Vaidyanathan Srinivasan, thomas.abraham, Santosh Shilimkar, Srivatsa S. Bhat, linux-mm, linux-kernel@vger.kernel.org, andi Hi Srinivas, It looks like your email did not get delivered to the mailing lists (and the people in the CC list) properly. So quoting your entire mail as-it-is here. And thanks a lot for taking a look at this patchset! Regards, Srivatsa S. Bhat On 11/09/2012 10:18 PM, SrinivasPandruvada wrote: > I did like this implementation and think it is valuable. > I am experimenting with one of our HW. This type of partition does help in > saving power. We believe we can save up-to 1W power per DIM with the help > of some HW/BIOS changes. We are only talking about content preserving memory, > so we don't have to be 100% correct. > In my experiments, I tried two methods: > - Similar to approach suggested by Mel Gorman. I have a special sticky > migrate type like CMA. > - Buddy buckets: Buddies are organized into memory region aware buckets. > During allocation it prefers higher order buckets. I made sure that there is > no affect of my change if there are no power saving memory DIMs. The advantage > of this bucket is that I can keep the memory in close proximity for a related > task groups by direct hashing to a bucket. The free list if organized as two > dimensional array with bucket and migrate type for each order. > > In both methods, currently reclaim is targetted to be done by a sysfs interface > similar to memory compaction for a node allowing user space to initiate reclaim. > > Thanks, > Srinivas Pandruvada > Open Source Technology Center, > Intel Corp. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-11-06 19:52 Srivatsa S. Bhat 2012-11-08 18:02 ` Mel Gorman @ 2012-12-04 10:51 ` wujianguo 2012-12-06 6:32 ` Srivatsa S. Bhat 1 sibling, 1 reply; 17+ messages in thread From: wujianguo @ 2012-12-04 10:51 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: akpm, mgorman, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel Hi Srivatsa, I applied this patchset, and run genload(from LTP) test: numactl --membind=1 ./genload -m 100, then got a "general protection fault", and system was going to reboot. If I revert [RFC PATCH 7/8], and run this test again, genload will be killed due to OOM, but the system is OK, no coredump. ps: node1 has 8G memory. [ 3647.020666] general protection fault: 0000 [#1] SMP [ 3647.026232] Modules linked in: edd cpufreq_conservative cpufreq_userspace cpu freq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_ intel ixgbe ipv6 i7core_edac igb iTCO_wdt i2c_i801 iTCO_vendor_support ioatdma e dac_core tpm_tis joydev lpc_ich i2c_core microcode mfd_core rtc_cmos pcspkr sr_m od tpm sg dca hid_generic mdio tpm_bios cdrom button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hw mon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_ piix libata megaraid_sas scsi_mod [ 3647.084565] CPU 19 [ 3647.086709] Pid: 33708, comm: genload Not tainted 3.7.0-rc7-mem-region+ #11 Q CI QSSC-S4R/QSSC-S4R [ 3647.096799] RIP: 0010:[<ffffffff8110979c>] [<ffffffff8110979c>] add_to_freel ist+0x8c/0x100 [ 3647.106125] RSP: 0000:ffff880a7f6c3e58 EFLAGS: 00010086 [ 3647.112042] RAX: dead000000200200 RBX: 0000000000000001 RCX: 0000000000000000 [ 3647.119990] RDX: ffffea001211a3a0 RSI: ffffea001211ffa0 RDI: 0000000000000001 [ 3647.127936] RBP: ffff880a7f6c3e58 R08: ffff88067ff6d240 R09: ffff88067ff6b180 [ 3647.135884] R10: 0000000000000002 R11: 0000000000000001 R12: 00000000000007fe [ 3647.143831] R13: 0000000000000001 R14: 0000000000000001 R15: ffffea001211ff80 [ 3647.151778] FS: 00007f0b2a674700(0000) GS:ffff880a7f6c0000(0000) knlGS:00000 00000000000 [ 3647.160790] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 3647.167188] CR2: 00007f0b1a000000 CR3: 0000000484723000 CR4: 00000000000007e0 [ 3647.175136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3647.183083] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 3647.191030] Process genload (pid: 33708, threadinfo ffff8806852bc000, task ff ff880688288000) [ 3647.200428] Stack: [ 3647.202667] ffff880a7f6c3f08 ffffffff8110e9c0 ffff88067ff66100 0000000000000 7fe [ 3647.210954] ffff880a7f6d5bb0 0000000000000030 0000000000002030 ffff88067ff66 168 [ 3647.219244] 0000000000000002 ffff880a7f6d5b78 0000000e88288000 ffff88067ff66 100 [ 3647.227530] Call Trace: [ 3647.230252] <IRQ> [ 3647.232394] [<ffffffff8110e9c0>] free_pcppages_bulk+0x350/0x450 [ 3647.239297] [<ffffffff8110f0d0>] ? drain_pages+0xd0/0xd0 [ 3647.245313] [<ffffffff8110f0c3>] drain_pages+0xc3/0xd0 [ 3647.251135] [<ffffffff8110f0e6>] drain_local_pages+0x16/0x20 [ 3647.257540] [<ffffffff810a3bce>] generic_smp_call_function_interrupt+0xae/0x 260 [ 3647.265783] [<ffffffff810282c7>] smp_call_function_interrupt+0x27/0x40 [ 3647.273156] [<ffffffff8147f272>] call_function_interrupt+0x72/0x80 [ 3647.280136] <EOI> [ 3647.282278] [<ffffffff81077936>] ? mutex_spin_on_owner+0x76/0xa0 [ 3647.289292] [<ffffffff81473116>] __mutex_lock_slowpath+0x66/0x180 [ 3647.296181] [<ffffffff8113afe7>] ? try_to_unmap_one+0x277/0x440 [ 3647.302872] [<ffffffff81472b93>] mutex_lock+0x23/0x40 [ 3647.308595] [<ffffffff8113b657>] rmap_walk+0x137/0x240 [ 3647.314417] [<ffffffff8115c230>] ? get_page+0x40/0x40 [ 3647.320133] [<ffffffff8115d036>] move_to_new_page+0xb6/0x110 [ 3647.326526] [<ffffffff8115d452>] __unmap_and_move+0x192/0x230 [ 3647.333023] [<ffffffff8115d612>] unmap_and_move+0x122/0x140 [ 3647.339328] [<ffffffff8115d6c9>] migrate_pages+0x99/0x150 [ 3647.345433] [<ffffffff81129f10>] ? isolate_freepages+0x220/0x220 [ 3647.352220] [<ffffffff8112ace2>] compact_zone+0x2f2/0x5d0 [ 3647.358332] [<ffffffff8112b4a0>] try_to_compact_pages+0x180/0x240 [ 3647.365218] [<ffffffff8110f1e7>] __alloc_pages_direct_compact+0x97/0x200 [ 3647.372780] [<ffffffff810a45a3>] ? on_each_cpu_mask+0x63/0xb0 [ 3647.379279] [<ffffffff8110f84f>] __alloc_pages_slowpath+0x4ff/0x780 [ 3647.386349] [<ffffffff8110fbf1>] __alloc_pages_nodemask+0x121/0x180 [ 3647.393430] [<ffffffff811500d6>] alloc_pages_vma+0xd6/0x170 [ 3647.399737] [<ffffffff81162198>] do_huge_pmd_anonymous_page+0x148/0x210 [ 3647.407203] [<ffffffff81132f6b>] handle_mm_fault+0x33b/0x340 [ 3647.413609] [<ffffffff814799d3>] __do_page_fault+0x2a3/0x4e0 [ 3647.420017] [<ffffffff8126316a>] ? trace_hardirqs_off_thunk+0x3a/0x6c [ 3647.427290] [<ffffffff81479c1e>] do_page_fault+0xe/0x10 [ 3647.433208] [<ffffffff81475f68>] page_fault+0x28/0x30 [ 3647.438921] Code: 8d 78 01 48 89 f8 48 c1 e0 04 49 8d 04 00 48 8b 50 08 48 83 40 10 01 48 85 d2 74 1b 48 8b 42 08 48 89 72 08 48 89 16 48 89 46 08 <48> 89 30 c9 c3 0f 1f 80 00 00 00 00 4d 3b 00 74 4b 83 e9 01 79 [ 3647.460607] RIP [<ffffffff8110979c>] add_to_freelist+0x8c/0x100 [ 3647.467308] RSP <ffff880a7f6c3e58> [ 0.000000] Linux version 3.7.0-rc7-mem-region+ (root@linux-intel) (gcc versi on 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #11 SMP Tue Dec 4 15:23 :15 CST 2012 . Thanks, Jianguo Wu On 2012-11-7 3:52, Srivatsa S. Bhat wrote: > Hi, > > This is an alternative design for Memory Power Management, developed based on > some of the suggestions[1] received during the review of the earlier patchset > ("Hierarchy" design) on Memory Power Management[2]. This alters the buddy-lists > to keep them region-sorted, and is hence identified as the "Sorted-buddy" design. > > One of the key aspects of this design is that it avoids the zone-fragmentation > problem that was present in the earlier design[3]. > > > Quick overview of Memory Power Management and Memory Regions: > ------------------------------------------------------------ > > Today memory subsystems are offer a wide range of capabilities for managing > memory power consumption. As a quick example, if a block of memory is not > referenced for a threshold amount of time, the memory controller can decide to > put that chunk into a low-power content-preserving state. And the next > reference to that memory chunk would bring it back to full power for read/write. > With this capability in place, it becomes important for the OS to understand > the boundaries of such power-manageable chunks of memory and to ensure that > references are consolidated to a minimum number of such memory power management > domains. > > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that > the firmware can expose information regarding the boundaries of such memory > power management domains to the OS in a standard way. > > How can Linux VM help memory power savings? > > o Consolidate memory allocations and/or references such that they are > not spread across the entire memory address space. Basically area of memory > that is not being referenced, can reside in low power state. > > o Support targeted memory reclaim, where certain areas of memory that can be > easily freed can be offlined, allowing those areas of memory to be put into > lower power states. > > Memory Regions: > --------------- > > "Memory Regions" is a way of capturing the boundaries of power-managable > chunks of memory, within the MM subsystem. > > > Short description of the "Sorted-buddy" design: > ----------------------------------------------- > > In this design, the memory region boundaries are captured in a parallel > data-structure instead of fitting regions between nodes and zones in the > hierarchy. Further, the buddy allocator is altered, such that we maintain the > zones' freelists in region-sorted-order and thus do page allocation in the > order of increasing memory regions. (The freelists need not be fully > address-sorted, they just need to be region-sorted. Patch 6 explains this > in more detail). > > The idea is to do page allocation in increasing order of memory regions > (within a zone) and perform page reclaim in the reverse order, as illustrated > below. > > ---------------------------- Increasing region number----------------------> > > Direction of allocation---> <---Direction of reclaim > > > The sorting logic (to maintain freelist pageblocks in region-sorted-order) > lies in the page-free path and not the page-allocation path and hence the > critical page allocation paths remain fast. Moreover, the heart of the page > allocation algorithm itself remains largely unchanged, and the region-related > data-structures are optimized to avoid unnecessary updates during the > page-allocator's runtime. > > Advantages of this design: > -------------------------- > 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and > hence we avoid its associated problems (like too many zones, extra page > reclaim threads, question of choosing watermarks etc). > [This is an advantage over the "Hierarchy" design] > > 2. Performance overhead is expected to be low: Since we retain the simplicity > of the algorithm in the page allocation path, page allocation can > potentially remain as fast as it would be without memory regions. The > overhead is pushed to the page-freeing paths which are not that critical. > > > Results: > ======= > > Test setup: > ----------- > This patchset applies cleanly on top of 3.7-rc3. > > x86 dual-socket quad core HT-enabled machine booted with mem=8G > Memory region size = 512 MB > > Functional testing: > ------------------- > > Ran pagetest, a simple C program that allocates and touches a required number > of pages. > > Below is the statistics from the regions within ZONE_NORMAL, at various sizes > of allocations from pagetest. > > Present pages | Free pages at various allocations | > | start | 512 MB | 1024 MB | 2048 MB | > Region 0 16 | 0 | 0 | 0 | 0 | > Region 1 131072 | 87219 | 8066 | 7892 | 7387 | > Region 2 131072 | 131072 | 79036 | 0 | 0 | > Region 3 131072 | 131072 | 131072 | 79061 | 0 | > Region 4 131072 | 131072 | 131072 | 131072 | 0 | > Region 5 131072 | 131072 | 131072 | 131072 | 79051 | > Region 6 131072 | 131072 | 131072 | 131072 | 131072 | > Region 7 131072 | 131072 | 131072 | 131072 | 131072 | > Region 8 131056 | 105475 | 105472 | 105472 | 105472 | > > This shows that page allocation occurs in the order of increasing region > numbers, as intended in this design. > > Performance impact: > ------------------- > > Kernbench results didn't show much of a difference between the performance > of vanilla 3.7-rc3 and this patchset. > > > Todos: > ===== > > 1. Memory-region aware page-reclamation: > ---------------------------------------- > > We would like to do page reclaim in the reverse order of page allocation > within a zone, ie., in the order of decreasing region numbers. > To achieve that, while scanning lru pages to reclaim, we could potentially > look for pages belonging to higher regions (considering region boundaries) > or perhaps simply prefer pages of higher pfns (and skip lower pfns) as > reclaim candidates. > > 2. Compile-time exclusion of Memory Power Management, and extending the > support to also work with other features such as Mem cgroups, kexec etc. > > References: > ---------- > > [1]. Review comments suggesting modifying the buddy allocator to be aware of > memory regions: > http://article.gmane.org/gmane.linux.power-management.general/24862 > http://article.gmane.org/gmane.linux.power-management.general/25061 > http://article.gmane.org/gmane.linux.kernel.mm/64689 > > [2]. Patch series that implemented the node-region-zone hierarchy design: > http://lwn.net/Articles/445045/ > http://thread.gmane.org/gmane.linux.kernel.mm/63840 > > Summary of the discussion on that patchset: > http://article.gmane.org/gmane.linux.power-management.general/25061 > > Forward-port of that patchset to 3.7-rc3 (minimal x86 config) > http://thread.gmane.org/gmane.linux.kernel.mm/89202 > > [3]. Disadvantages of having memory regions in the hierarchy between nodes and > zones: > http://article.gmane.org/gmane.linux.kernel.mm/63849 > > [4]. Estimate of potential power savings on Samsung exynos board > http://article.gmane.org/gmane.linux.kernel.mm/65935 > > [5]. ACPI 5.0 and MPST support > http://www.acpi.info/spec.htm > Section 5.2.21 Memory Power State Table (MPST) > > Srivatsa S. Bhat (8): > mm: Introduce memory regions data-structure to capture region boundaries within node > mm: Initialize node memory regions during boot > mm: Introduce and initialize zone memory regions > mm: Add helpers to retrieve node region and zone region for a given page > mm: Add data-structures to describe memory regions within the zones' freelists > mm: Demarcate and maintain pageblocks in region-order in the zones' freelists > mm: Add an optimized version of del_from_freelist to keep page allocation fast > mm: Print memory region statistics to understand the buddy allocator behavior > > > include/linux/mm.h | 38 +++++++ > include/linux/mmzone.h | 52 +++++++++ > mm/compaction.c | 8 + > mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++---- > mm/vmstat.c | 59 ++++++++++- > 5 files changed, 390 insertions(+), 30 deletions(-) > > > Thanks, > Srivatsa S. Bhat > IBM Linux Technology Center > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management 2012-12-04 10:51 ` wujianguo @ 2012-12-06 6:32 ` Srivatsa S. Bhat 0 siblings, 0 replies; 17+ messages in thread From: Srivatsa S. Bhat @ 2012-12-06 6:32 UTC (permalink / raw) To: wujianguo Cc: akpm, mgorman, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm, linux-kernel Hi Jianguo, On 12/04/2012 04:21 PM, wujianguo wrote: > Hi Srivatsa, > > I applied this patchset, and run genload(from LTP) test: numactl --membind=1 ./genload -m 100, > then got a "general protection fault", and system was going to reboot. > > If I revert [RFC PATCH 7/8], and run this test again, genload will be killed due to OOM, > but the system is OK, no coredump. > Sorry for the delay in replying. Thanks a lot for testing and for the bug-report! I could recreate the issue in one of my machines using the LTP test you mentioned. I'll try to dig and find out what is going wrong. Regards, Srivatsa S. Bhat > ps: node1 has 8G memory. > > [ 3647.020666] general protection fault: 0000 [#1] SMP > [ 3647.026232] Modules linked in: edd cpufreq_conservative cpufreq_userspace cpu > freq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_ > intel ixgbe ipv6 i7core_edac igb iTCO_wdt i2c_i801 iTCO_vendor_support ioatdma e > dac_core tpm_tis joydev lpc_ich i2c_core microcode mfd_core rtc_cmos pcspkr sr_m > od tpm sg dca hid_generic mdio tpm_bios cdrom button ext3 jbd mbcache usbhid hid > uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hw > mon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_ > piix libata megaraid_sas scsi_mod > [ 3647.084565] CPU 19 > [ 3647.086709] Pid: 33708, comm: genload Not tainted 3.7.0-rc7-mem-region+ #11 Q > CI QSSC-S4R/QSSC-S4R > [ 3647.096799] RIP: 0010:[<ffffffff8110979c>] [<ffffffff8110979c>] add_to_freel > ist+0x8c/0x100 > [ 3647.106125] RSP: 0000:ffff880a7f6c3e58 EFLAGS: 00010086 > [ 3647.112042] RAX: dead000000200200 RBX: 0000000000000001 RCX: 0000000000000000 > > [ 3647.119990] RDX: ffffea001211a3a0 RSI: ffffea001211ffa0 RDI: 0000000000000001 > > [ 3647.127936] RBP: ffff880a7f6c3e58 R08: ffff88067ff6d240 R09: ffff88067ff6b180 > > [ 3647.135884] R10: 0000000000000002 R11: 0000000000000001 R12: 00000000000007fe > > [ 3647.143831] R13: 0000000000000001 R14: 0000000000000001 R15: ffffea001211ff80 > > [ 3647.151778] FS: 00007f0b2a674700(0000) GS:ffff880a7f6c0000(0000) knlGS:00000 > 00000000000 > [ 3647.160790] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 3647.167188] CR2: 00007f0b1a000000 CR3: 0000000484723000 CR4: 00000000000007e0 > > [ 3647.175136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 3647.183083] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > [ 3647.191030] Process genload (pid: 33708, threadinfo ffff8806852bc000, task ff > ff880688288000) > [ 3647.200428] Stack: > [ 3647.202667] ffff880a7f6c3f08 ffffffff8110e9c0 ffff88067ff66100 0000000000000 > 7fe > [ 3647.210954] ffff880a7f6d5bb0 0000000000000030 0000000000002030 ffff88067ff66 > 168 > [ 3647.219244] 0000000000000002 ffff880a7f6d5b78 0000000e88288000 ffff88067ff66 > 100 > [ 3647.227530] Call Trace: > [ 3647.230252] <IRQ> > [ 3647.232394] [<ffffffff8110e9c0>] free_pcppages_bulk+0x350/0x450 > [ 3647.239297] [<ffffffff8110f0d0>] ? drain_pages+0xd0/0xd0 > [ 3647.245313] [<ffffffff8110f0c3>] drain_pages+0xc3/0xd0 > [ 3647.251135] [<ffffffff8110f0e6>] drain_local_pages+0x16/0x20 > [ 3647.257540] [<ffffffff810a3bce>] generic_smp_call_function_interrupt+0xae/0x > 260 > [ 3647.265783] [<ffffffff810282c7>] smp_call_function_interrupt+0x27/0x40 > [ 3647.273156] [<ffffffff8147f272>] call_function_interrupt+0x72/0x80 > [ 3647.280136] <EOI> > [ 3647.282278] [<ffffffff81077936>] ? mutex_spin_on_owner+0x76/0xa0 > [ 3647.289292] [<ffffffff81473116>] __mutex_lock_slowpath+0x66/0x180 > [ 3647.296181] [<ffffffff8113afe7>] ? try_to_unmap_one+0x277/0x440 > [ 3647.302872] [<ffffffff81472b93>] mutex_lock+0x23/0x40 > [ 3647.308595] [<ffffffff8113b657>] rmap_walk+0x137/0x240 > [ 3647.314417] [<ffffffff8115c230>] ? get_page+0x40/0x40 > [ 3647.320133] [<ffffffff8115d036>] move_to_new_page+0xb6/0x110 > [ 3647.326526] [<ffffffff8115d452>] __unmap_and_move+0x192/0x230 > [ 3647.333023] [<ffffffff8115d612>] unmap_and_move+0x122/0x140 > [ 3647.339328] [<ffffffff8115d6c9>] migrate_pages+0x99/0x150 > [ 3647.345433] [<ffffffff81129f10>] ? isolate_freepages+0x220/0x220 > [ 3647.352220] [<ffffffff8112ace2>] compact_zone+0x2f2/0x5d0 > [ 3647.358332] [<ffffffff8112b4a0>] try_to_compact_pages+0x180/0x240 > [ 3647.365218] [<ffffffff8110f1e7>] __alloc_pages_direct_compact+0x97/0x200 > [ 3647.372780] [<ffffffff810a45a3>] ? on_each_cpu_mask+0x63/0xb0 > [ 3647.379279] [<ffffffff8110f84f>] __alloc_pages_slowpath+0x4ff/0x780 > [ 3647.386349] [<ffffffff8110fbf1>] __alloc_pages_nodemask+0x121/0x180 > [ 3647.393430] [<ffffffff811500d6>] alloc_pages_vma+0xd6/0x170 > [ 3647.399737] [<ffffffff81162198>] do_huge_pmd_anonymous_page+0x148/0x210 > [ 3647.407203] [<ffffffff81132f6b>] handle_mm_fault+0x33b/0x340 > [ 3647.413609] [<ffffffff814799d3>] __do_page_fault+0x2a3/0x4e0 > [ 3647.420017] [<ffffffff8126316a>] ? trace_hardirqs_off_thunk+0x3a/0x6c > [ 3647.427290] [<ffffffff81479c1e>] do_page_fault+0xe/0x10 > [ 3647.433208] [<ffffffff81475f68>] page_fault+0x28/0x30 > [ 3647.438921] Code: 8d 78 01 48 89 f8 48 c1 e0 04 49 8d 04 00 48 8b 50 08 48 83 > 40 10 01 48 85 d2 74 1b 48 8b 42 08 48 89 72 08 48 89 16 48 89 46 08 <48> 89 30 > c9 c3 0f 1f 80 00 00 00 00 4d 3b 00 74 4b 83 e9 01 79 > [ 3647.460607] RIP [<ffffffff8110979c>] add_to_freelist+0x8c/0x100 > [ 3647.467308] RSP <ffff880a7f6c3e58> > [ 0.000000] Linux version 3.7.0-rc7-mem-region+ (root@linux-intel) (gcc versi > on 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #11 SMP Tue Dec 4 15:23 > :15 CST 2012 > . > > Thanks, > Jianguo Wu > > On 2012-11-7 3:52, Srivatsa S. Bhat wrote: >> Hi, >> >> This is an alternative design for Memory Power Management, developed based on >> some of the suggestions[1] received during the review of the earlier patchset >> ("Hierarchy" design) on Memory Power Management[2]. This alters the buddy-lists >> to keep them region-sorted, and is hence identified as the "Sorted-buddy" design. >> >> One of the key aspects of this design is that it avoids the zone-fragmentation >> problem that was present in the earlier design[3]. >> >> >> Quick overview of Memory Power Management and Memory Regions: >> ------------------------------------------------------------ >> >> Today memory subsystems are offer a wide range of capabilities for managing >> memory power consumption. As a quick example, if a block of memory is not >> referenced for a threshold amount of time, the memory controller can decide to >> put that chunk into a low-power content-preserving state. And the next >> reference to that memory chunk would bring it back to full power for read/write. >> With this capability in place, it becomes important for the OS to understand >> the boundaries of such power-manageable chunks of memory and to ensure that >> references are consolidated to a minimum number of such memory power management >> domains. >> >> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that >> the firmware can expose information regarding the boundaries of such memory >> power management domains to the OS in a standard way. >> >> How can Linux VM help memory power savings? >> >> o Consolidate memory allocations and/or references such that they are >> not spread across the entire memory address space. Basically area of memory >> that is not being referenced, can reside in low power state. >> >> o Support targeted memory reclaim, where certain areas of memory that can be >> easily freed can be offlined, allowing those areas of memory to be put into >> lower power states. >> >> Memory Regions: >> --------------- >> >> "Memory Regions" is a way of capturing the boundaries of power-managable >> chunks of memory, within the MM subsystem. >> >> >> Short description of the "Sorted-buddy" design: >> ----------------------------------------------- >> >> In this design, the memory region boundaries are captured in a parallel >> data-structure instead of fitting regions between nodes and zones in the >> hierarchy. Further, the buddy allocator is altered, such that we maintain the >> zones' freelists in region-sorted-order and thus do page allocation in the >> order of increasing memory regions. (The freelists need not be fully >> address-sorted, they just need to be region-sorted. Patch 6 explains this >> in more detail). >> >> The idea is to do page allocation in increasing order of memory regions >> (within a zone) and perform page reclaim in the reverse order, as illustrated >> below. >> >> ---------------------------- Increasing region number----------------------> >> >> Direction of allocation---> <---Direction of reclaim >> >> >> The sorting logic (to maintain freelist pageblocks in region-sorted-order) >> lies in the page-free path and not the page-allocation path and hence the >> critical page allocation paths remain fast. Moreover, the heart of the page >> allocation algorithm itself remains largely unchanged, and the region-related >> data-structures are optimized to avoid unnecessary updates during the >> page-allocator's runtime. >> >> Advantages of this design: >> -------------------------- >> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and >> hence we avoid its associated problems (like too many zones, extra page >> reclaim threads, question of choosing watermarks etc). >> [This is an advantage over the "Hierarchy" design] >> >> 2. Performance overhead is expected to be low: Since we retain the simplicity >> of the algorithm in the page allocation path, page allocation can >> potentially remain as fast as it would be without memory regions. The >> overhead is pushed to the page-freeing paths which are not that critical. >> >> >> Results: >> ======= >> >> Test setup: >> ----------- >> This patchset applies cleanly on top of 3.7-rc3. >> >> x86 dual-socket quad core HT-enabled machine booted with mem=8G >> Memory region size = 512 MB >> >> Functional testing: >> ------------------- >> >> Ran pagetest, a simple C program that allocates and touches a required number >> of pages. >> >> Below is the statistics from the regions within ZONE_NORMAL, at various sizes >> of allocations from pagetest. >> >> Present pages | Free pages at various allocations | >> | start | 512 MB | 1024 MB | 2048 MB | >> Region 0 16 | 0 | 0 | 0 | 0 | >> Region 1 131072 | 87219 | 8066 | 7892 | 7387 | >> Region 2 131072 | 131072 | 79036 | 0 | 0 | >> Region 3 131072 | 131072 | 131072 | 79061 | 0 | >> Region 4 131072 | 131072 | 131072 | 131072 | 0 | >> Region 5 131072 | 131072 | 131072 | 131072 | 79051 | >> Region 6 131072 | 131072 | 131072 | 131072 | 131072 | >> Region 7 131072 | 131072 | 131072 | 131072 | 131072 | >> Region 8 131056 | 105475 | 105472 | 105472 | 105472 | >> >> This shows that page allocation occurs in the order of increasing region >> numbers, as intended in this design. >> >> Performance impact: >> ------------------- >> >> Kernbench results didn't show much of a difference between the performance >> of vanilla 3.7-rc3 and this patchset. >> >> >> Todos: >> ===== >> >> 1. Memory-region aware page-reclamation: >> ---------------------------------------- >> >> We would like to do page reclaim in the reverse order of page allocation >> within a zone, ie., in the order of decreasing region numbers. >> To achieve that, while scanning lru pages to reclaim, we could potentially >> look for pages belonging to higher regions (considering region boundaries) >> or perhaps simply prefer pages of higher pfns (and skip lower pfns) as >> reclaim candidates. >> >> 2. Compile-time exclusion of Memory Power Management, and extending the >> support to also work with other features such as Mem cgroups, kexec etc. >> >> References: >> ---------- >> >> [1]. Review comments suggesting modifying the buddy allocator to be aware of >> memory regions: >> http://article.gmane.org/gmane.linux.power-management.general/24862 >> http://article.gmane.org/gmane.linux.power-management.general/25061 >> http://article.gmane.org/gmane.linux.kernel.mm/64689 >> >> [2]. Patch series that implemented the node-region-zone hierarchy design: >> http://lwn.net/Articles/445045/ >> http://thread.gmane.org/gmane.linux.kernel.mm/63840 >> >> Summary of the discussion on that patchset: >> http://article.gmane.org/gmane.linux.power-management.general/25061 >> >> Forward-port of that patchset to 3.7-rc3 (minimal x86 config) >> http://thread.gmane.org/gmane.linux.kernel.mm/89202 >> >> [3]. Disadvantages of having memory regions in the hierarchy between nodes and >> zones: >> http://article.gmane.org/gmane.linux.kernel.mm/63849 >> >> [4]. Estimate of potential power savings on Samsung exynos board >> http://article.gmane.org/gmane.linux.kernel.mm/65935 >> >> [5]. ACPI 5.0 and MPST support >> http://www.acpi.info/spec.htm >> Section 5.2.21 Memory Power State Table (MPST) >> >> Srivatsa S. Bhat (8): >> mm: Introduce memory regions data-structure to capture region boundaries within node >> mm: Initialize node memory regions during boot >> mm: Introduce and initialize zone memory regions >> mm: Add helpers to retrieve node region and zone region for a given page >> mm: Add data-structures to describe memory regions within the zones' freelists >> mm: Demarcate and maintain pageblocks in region-order in the zones' freelists >> mm: Add an optimized version of del_from_freelist to keep page allocation fast >> mm: Print memory region statistics to understand the buddy allocator behavior >> >> >> include/linux/mm.h | 38 +++++++ >> include/linux/mmzone.h | 52 +++++++++ >> mm/compaction.c | 8 + >> mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++---- >> mm/vmstat.c | 59 ++++++++++- >> 5 files changed, 390 insertions(+), 30 deletions(-) >> >> >> Thanks, >> Srivatsa S. Bhat >> IBM Linux Technology Center >> >> -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2012-12-06 6:34 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-11-09 18:14 [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management Srinivas Pandruvada -- strict thread matches above, loose matches on Subject: below -- 2012-11-06 19:52 Srivatsa S. Bhat 2012-11-08 18:02 ` Mel Gorman 2012-11-08 19:38 ` Srivatsa S. Bhat 2012-11-09 5:14 ` Vaidyanathan Srinivasan 2012-11-09 9:00 ` Mel Gorman 2012-11-09 14:51 ` Srivatsa S. Bhat 2012-11-09 15:23 ` Srivatsa S. Bhat 2012-11-09 16:13 ` Dave Hansen 2012-11-09 16:34 ` Srivatsa S. Bhat 2012-11-09 16:43 ` Srivatsa S. Bhat 2012-11-09 16:52 ` Srivatsa S. Bhat 2012-11-16 18:32 ` Srivatsa S. Bhat 2012-11-09 15:34 ` Arjan van de Ven [not found] ` <loom.20121109T172910-394@post.gmane.org> 2012-11-12 16:14 ` Srivatsa S. Bhat 2012-12-04 10:51 ` wujianguo 2012-12-06 6:32 ` Srivatsa S. Bhat
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).