* [RFC PATCH v4 00/40] mm: Memory Power Management
@ 2013-09-25 23:13 Srivatsa S. Bhat
  2013-09-25 23:13 ` [RFC PATCH v4 01/40] mm: Introduce memory regions data-structure to capture region boundaries within nodes Srivatsa S. Bhat
                   ` (40 more replies)
  0 siblings, 41 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:13 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Hi,
Here is version 4 of the Memory Power Management patchset, which includes the
targeted compaction mechanism (which was temporarily removed in v3). So now
that this includes all the major features & changes to the Linux MM intended
to aid memory power management, it gives us a better picture of the extent to
which this patchset performs better than mainline, in causing memory power
savings.
Role of the Linux MM in influencing Memory Power Management:
-----------------------------------------------------------
Modern memory hardware such as DDR3 support a number of power management
capabilities - for instance, the memory controller can automatically put
memory DIMMs/banks into content-preserving low-power states, if it detects
that the *entire* memory DIMM/bank has not been referenced for a threshold
amount of time. This in turn reduces the energy consumption of the memory
hardware. We term these power-manageable chunks of memory as "Memory Regions".
To increase the power savings we need to enhance the Linux MM to understand
the granularity at which RAM modules can be power-managed, and keep the
memory allocations and references consolidated to a minimum no. of these
memory regions.
Thus, we can summarize the goals for the Linux MM as follows:
o Consolidate memory allocations and/or references such that they are not
spread across the entire memory address space, because the area of memory
that is not being referenced can reside in low power state.
o Support light-weight targeted memory compaction/reclaim, to evacuate
lightly-filled memory regions. This helps avoid memory references to
those regions, thereby allowing them to reside in low power states.
Brief overview of the design/approach used in this patchset:
-----------------------------------------------------------
The strategy used in this patchset is to do page allocation in increasing order
of memory regions (within a zone) and perform region-compaction in the reverse
order, as illustrated below.
---------------------------- Increasing region number---------------------->
Direction of allocation--->               <---Direction of region-compaction
We achieve this by making 3 major design changes to the Linux kernel memory
manager, as outlined below.
1. Sorted-buddy design of buddy freelists
   To allocate pages in increasing order of memory regions, we first capture
   the memory region boundaries in suitable zone-level data-structures, and
   modify the buddy allocator so as to maintain the buddy freelists in
   region-sorted-order. This automatically ensures that page allocation occurs
   in the order of increasing memory regions.
2. Split-allocator design: Page-Allocator as front-end; Region-Allocator as
   back-end
   Mixing of movable and unmovable pages can disrupt opportunities for
   consolidating allocations. In order to separate such pages at a memory-region
   granularity, a "Region-Allocator" is introduced which allocates entire memory
   regions. The Page-Allocator is then modified to get its memory from the
   Region-Allocator and hand out pages to requesting applications in
   page-sized chunks. This design is showing significant improvements in the
   effectiveness of this patchset in consolidating allocations to a minimum no.
   of memory regions.
3. Targeted region compaction/evacuation
   Over time, due to multiple alloc()s and free()s in random order, memory gets
   fragmented, which means the memory allocations will no longer be consolidated
   to a minimum no. of memory regions. In such cases we need a light-weight
   mechanism to opportunistically compact memory to evacuate lightly-filled
   memory regions, thereby enhancing the power-savings.
   Noting that CMA (Contiguous Memory Allocator) does targeted compaction to
   achieve its goals, this patchset generalizes the targeted compaction code
   and reuses it to evacuate memory regions. A dedicated per-node "kmempowerd"
   kthread is employed to perform this region evacuation.
Assumptions and goals of this patchset:
--------------------------------------
In this patchset, we don't handle the part of getting the region boundary info
from the firmware/bootloader and populating it in the kernel data-structures.
The aim of this patchset is to propose and brainstorm on a power-aware design
of the Linux MM which can *use* the region boundary info to influence the MM
at various places such as page allocation, reclamation/compaction etc, thereby
contributing to memory power savings. So, in this patchset, we assume a simple
model in which each 512MB chunk of memory can be independently power-managed,
and hard-code this in the kernel.
However, its not very far-fetched to try this out with actual region boundary
info to get the real power savings numbers. For example, on ARM platforms, we
can make the bootloader export this info to the OS via device-tree and then run
this patchset. (This was the method used to get the power-numbers in [4]). But
even without doing that, we can very well evaluate the effectiveness of this
patchset in contributing to power-savings, by analyzing the free page statistics
per-memory-region; and we can observe the performance impact by running
benchmarks - this is the approach currently used to evaluate this patchset.
Experimental Results:
====================
In a nutshell here are the results (higher the better):
                  Free regions at test-start   Free regions after test-run
Without patchset               214                          8
With patchset                  210                        202
This shows that this patchset performs enormously better than mainline, in
terms of keeping allocations consolidated to a minimum no. of regions.
I'll include the detailed results as a reply to this cover-letter, since it
can benefit from a dedicated discussion.
This patchset has been hosted in the below git tree. It applies cleanly on
v3.12-rc2.
git://github.com/srivatsabhat/linux.git mem-power-mgmt-v4
Changes in v4:
=============
* Revived and redesigned the targeted region compaction code. Added a dedicated
  per-node kthread to perform the evacuation, instead of the workqueue worker
  used in the previous design.
* Redesigned the locking scheme in the targeted evacuation code to be much
  more simple and elegant.
* Fixed a bug pointed out by Yasuaki Ishimatsu.
* Got much better results (consolidation ratio) than v3, due to the addition of
  the targeted compaction logic. [ v3 used to get us to around 120, whereas
  this v4 is going up to 202! :-) ].
Some important TODOs:
====================
1. Add optimizations to improve the performance and reduce the overhead in
   the MM hot paths.
2. Add support for making this patchset work with sparsemem, THP, memcg etc.
References:
----------
[1]. LWN article that explains the goals and the design of my Memory Power
     Management patchset:
     http://lwn.net/Articles/547439/
[2]. v3 of the Memory Power Management patchset, with a new split-allocator
     design:
     http://lwn.net/Articles/565371/
[3]. v2 of the "Sorted-buddy" patchset with support for targeted memory
     region compaction:
     http://lwn.net/Articles/546696/
     LWN article describing this design: http://lwn.net/Articles/547439/
     v1 of the patchset:
     http://thread.gmane.org/gmane.linux.power-management.general/28498
[4]. Estimate of potential power savings on Samsung exynos board
     http://article.gmane.org/gmane.linux.kernel.mm/65935
[5]. C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and Tom Keller.
     Energy management for commercial servers. In IEEE Computer, pages 39–48,
     Dec 2003.
     Link: researcher.ibm.com/files/us-lefurgy/computer2003.pdf
[6]. ACPI 5.0 and MPST support
     http://www.acpi.info/spec.htm
     Section 5.2.21 Memory Power State Table (MPST)
[7]. Prototype implementation of parsing of ACPI 5.0 MPST tables, by Srinivas
     Pandruvada.
     https://lkml.org/lkml/2013/4/18/349
 Srivatsa S. Bhat (40):
      mm: Introduce memory regions data-structure to capture region boundaries within nodes
      mm: Initialize node memory regions during boot
      mm: Introduce and initialize zone memory regions
      mm: Add helpers to retrieve node region and zone region for a given page
      mm: Add data-structures to describe memory regions within the zones' freelists
      mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
      mm: Track the freepage migratetype of pages accurately
      mm: Use the correct migratetype during buddy merging
      mm: Add an optimized version of del_from_freelist to keep page allocation fast
      bitops: Document the difference in indexing between fls() and __fls()
      mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
      mm: Add support to accurately track per-memory-region allocation
      mm: Print memory region statistics to understand the buddy allocator behavior
      mm: Enable per-memory-region fragmentation stats in pagetypeinfo
      mm: Add aggressive bias to prefer lower regions during page allocation
      mm: Introduce a "Region Allocator" to manage entire memory regions
      mm: Add a mechanism to add pages to buddy freelists in bulk
      mm: Provide a mechanism to delete pages from buddy freelists in bulk
      mm: Provide a mechanism to release free memory to the region allocator
      mm: Provide a mechanism to request free memory from the region allocator
      mm: Maintain the counter for freepages in the region allocator
      mm: Propagate the sorted-buddy bias for picking free regions, to region allocator
      mm: Fix vmstat to also account for freepages in the region allocator
      mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC
      mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow
      mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= RA flow
      mm: Update the freepage migratetype of pages during region allocation
      mm: Provide a mechanism to check if a given page is in the region allocator
      mm: Add a way to request pages of a particular region from the region allocator
      mm: Modify move_freepages() to handle pages in the region allocator properly
      mm: Never change migratetypes of pageblocks during freepage stealing
      mm: Set pageblock migratetype when allocating regions from region allocator
      mm: Use a cache between page-allocator and region-allocator
      mm: Restructure the compaction part of CMA for wider use
      mm: Add infrastructure to evacuate memory regions using compaction
      kthread: Split out kthread-worker bits to avoid circular header-file dependency
      mm: Add a kthread to perform targeted compaction for memory power management
      mm: Add a mechanism to queue work to the kmempowerd kthread
      mm: Add intelligence in kmempowerd to ignore regions unsuitable for evacuation
      mm: Add triggers in the page-allocator to kick off region evacuation
 arch/x86/include/asm/bitops.h      |    4 
 include/asm-generic/bitops/__fls.h |    5 
 include/linux/compaction.h         |    7 
 include/linux/gfp.h                |    2 
 include/linux/kthread-work.h       |   92 +++
 include/linux/kthread.h            |   85 ---
 include/linux/migrate.h            |    3 
 include/linux/mm.h                 |   43 ++
 include/linux/mmzone.h             |   87 +++
 include/trace/events/migrate.h     |    3 
 mm/compaction.c                    |  309 +++++++++++
 mm/internal.h                      |   45 ++
 mm/page_alloc.c                    | 1018 ++++++++++++++++++++++++++++++++----
 mm/vmstat.c                        |  130 ++++-
 14 files changed, 1637 insertions(+), 196 deletions(-)
 create mode 100644 include/linux/kthread-work.h
Regards,
Srivatsa S. Bhat
IBM Linux Technology Center
^ permalink raw reply	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 01/40] mm: Introduce memory regions data-structure to capture region boundaries within nodes
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
@ 2013-09-25 23:13 ` Srivatsa S. Bhat
  2013-10-23  9:54   ` Johannes Weiner
  2013-09-25 23:14 ` [RFC PATCH v4 02/40] mm: Initialize node memory regions during boot Srivatsa S. Bhat
                   ` (39 subsequent siblings)
  40 siblings, 1 reply; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:13 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
The memory within a node can be divided into regions of memory that can be
independently power-managed. That is, chunks of memory can be transitioned
(manually or automatically) to low-power states based on the frequency of
references to that region. For example, if a memory chunk is not referenced
for a given threshold amount of time, the hardware (memory controller) can
decide to put that piece of memory into a content-preserving low-power state.
And of course, on the next reference to that chunk of memory, it will be
transitioned back to full-power for read/write operations.
So, the Linux MM can take advantage of this feature by managing the available
memory with an eye towards power-savings - ie., by keeping the memory
allocations/references consolidated to a minimum no. of such power-manageable
memory regions. In order to do so, the first step is to teach the MM about
the boundaries of these regions - and to capture that info, we introduce a new
data-structure called "Memory Regions".
[Also, the concept of memory regions could potentially be extended to work
with different classes of memory like PCM (Phase Change Memory) etc and
hence, it is not limited to just power management alone].
We already sub-divide a node's memory into zones, based on some well-known
constraints. So the question is, where do we fit in memory regions in this
hierarchy. Instead of artificially trying to fit it into the hierarchy one
way or the other, we choose to simply capture the region boundaries in a
parallel data-structure, since most likely the region boundaries won't
naturally fit inside the zone boundaries or vice-versa.
But of course, memory regions are sub-divisions *within* a node, so it makes
sense to keep the data-structures in the node's struct pglist_data. (Thus
this placement makes memory regions parallel to zones in that node).
Once we capture the region boundaries in the memory regions data-structure,
we can influence MM decisions at various places, such as page allocation,
reclamation etc, in order to perform power-aware memory management.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd791e4..d3288b0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -35,6 +35,8 @@
  */
 #define PAGE_ALLOC_COSTLY_ORDER 3
 
+#define MAX_NR_NODE_REGIONS	512
+
 enum {
 	MIGRATE_UNMOVABLE,
 	MIGRATE_RECLAIMABLE,
@@ -708,6 +710,14 @@ struct node_active_region {
 extern struct page *mem_map;
 #endif
 
+struct node_mem_region {
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	unsigned long present_pages;
+	unsigned long spanned_pages;
+	struct pglist_data *pgdat;
+};
+
 /*
  * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
  * (mostly NUMA machines?) to denote a higher-level memory zone than the
@@ -724,6 +734,8 @@ typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
 	struct zonelist node_zonelists[MAX_ZONELISTS];
 	int nr_zones;
+	struct node_mem_region node_regions[MAX_NR_NODE_REGIONS];
+	int nr_node_regions;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
 #ifdef CONFIG_MEMCG
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 02/40] mm: Initialize node memory regions during boot
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
  2013-09-25 23:13 ` [RFC PATCH v4 01/40] mm: Introduce memory regions data-structure to capture region boundaries within nodes Srivatsa S. Bhat
@ 2013-09-25 23:14 ` Srivatsa S. Bhat
  2013-09-25 23:14 ` [RFC PATCH v4 03/40] mm: Introduce and initialize zone memory regions Srivatsa S. Bhat
                   ` (38 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:14 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Initialize the node's memory-regions structures with the information about
the region-boundaries, at boot time.
Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mm.h |    4 ++++
 mm/page_alloc.c    |   28 ++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..223be46 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -620,6 +620,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
+/* Hard-code memory region size to be 512 MB for now. */
+#define MEM_REGION_SHIFT	(29 - PAGE_SHIFT)
+#define MEM_REGION_SIZE		(1UL << MEM_REGION_SHIFT)
+
 static inline enum zone_type page_zonenum(const struct page *page)
 {
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ee638f..26835c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4858,6 +4858,33 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
 #endif /* CONFIG_FLAT_NODE_MEM_MAP */
 }
 
+static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
+{
+	int nid = pgdat->node_id;
+	unsigned long start_pfn = pgdat->node_start_pfn;
+	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
+	struct node_mem_region *region;
+	unsigned long i, absent;
+	int idx;
+
+	for (i = start_pfn, idx = 0; i < end_pfn;
+				i += region->spanned_pages, idx++) {
+
+		region = &pgdat->node_regions[idx];
+		region->pgdat = pgdat;
+		region->start_pfn = i;
+		region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
+		region->end_pfn = region->start_pfn + region->spanned_pages;
+
+		absent = __absent_pages_in_range(nid, region->start_pfn,
+						 region->end_pfn);
+
+		region->present_pages = region->spanned_pages - absent;
+	}
+
+	pgdat->nr_node_regions = idx;
+}
+
 void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 		unsigned long node_start_pfn, unsigned long *zholes_size)
 {
@@ -4886,6 +4913,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	free_area_init_core(pgdat, start_pfn, end_pfn,
 			    zones_size, zholes_size);
+	init_node_memory_regions(pgdat);
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 03/40] mm: Introduce and initialize zone memory regions
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
  2013-09-25 23:13 ` [RFC PATCH v4 01/40] mm: Introduce memory regions data-structure to capture region boundaries within nodes Srivatsa S. Bhat
  2013-09-25 23:14 ` [RFC PATCH v4 02/40] mm: Initialize node memory regions during boot Srivatsa S. Bhat
@ 2013-09-25 23:14 ` Srivatsa S. Bhat
  2013-09-25 23:14 ` [RFC PATCH v4 04/40] mm: Add helpers to retrieve node region and zone region for a given page Srivatsa S. Bhat
                   ` (37 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:14 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Memory region boundaries don't necessarily fit on zone boundaries. So we need
to maintain a zone-level mapping of the absolute memory region boundaries.
"Node Memory Regions" will be used to capture the absolute region boundaries.
Add "Zone Memory Regions" to track the subsets of the absolute memory regions
that fall within the zone boundaries.
Eg:
	|<----------------------Node---------------------->|
	 __________________________________________________
	|      Node mem reg 0 	 |      Node mem reg 1     |  (Absolute region
	|________________________|_________________________|   boundaries)
	 __________________________________________________
	|    ZONE_DMA   |	    ZONE_NORMAL		   |
	|               |                                  |
	|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
	|_______________|________|_________________________|
In the above figure,
ZONE_DMA will have only 1 zone memory region (say, Zone mem reg 0) which is a
subset of Node mem reg 0 (ie., the portion of Node mem reg 0 that intersects
with ZONE_DMA).
ZONE_NORMAL will have 2 zone memory regions (say, Zone mem reg 0 and
Zone mem reg 1) which are subsets of Node mem reg 0 and Node mem reg 1
respectively, that intersect with ZONE_NORMAL's range.
Most of the MM algorithms (like page allocation etc) work within a zone,
hence such a zone-level mapping of the absolute region boundaries will be
very useful in influencing the MM decisions at those places.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |   11 +++++++++
 mm/page_alloc.c        |   62 +++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d3288b0..3c1dc97 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -36,6 +36,7 @@
 #define PAGE_ALLOC_COSTLY_ORDER 3
 
 #define MAX_NR_NODE_REGIONS	512
+#define MAX_NR_ZONE_REGIONS	MAX_NR_NODE_REGIONS
 
 enum {
 	MIGRATE_UNMOVABLE,
@@ -313,6 +314,13 @@ enum zone_type {
 
 #ifndef __GENERATING_BOUNDS_H
 
+struct zone_mem_region {
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	unsigned long present_pages;
+	unsigned long spanned_pages;
+};
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -369,6 +377,9 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+	struct zone_mem_region	zone_regions[MAX_NR_ZONE_REGIONS];
+	int 			nr_zone_regions;
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 26835c4..10a1cc8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4885,6 +4885,66 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
 	pgdat->nr_node_regions = idx;
 }
 
+static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
+{
+	unsigned long start_pfn, end_pfn, absent;
+	unsigned long z_start_pfn, z_end_pfn;
+	int i, j, idx, nid = pgdat->node_id;
+	struct node_mem_region *node_region;
+	struct zone_mem_region *zone_region;
+	struct zone *z;
+
+	for (i = 0, j = 0; i < pgdat->nr_zones; i++) {
+		z = &pgdat->node_zones[i];
+		z_start_pfn = z->zone_start_pfn;
+		z_end_pfn = z->zone_start_pfn + z->spanned_pages;
+		idx = 0;
+
+		for ( ; j < pgdat->nr_node_regions; j++) {
+			node_region = &pgdat->node_regions[j];
+
+			/*
+			 * Skip node memory regions that don't intersect with
+			 * this zone.
+			 */
+			if (node_region->end_pfn <= z_start_pfn)
+				continue; /* Move to next higher node region */
+
+			if (node_region->start_pfn >= z_end_pfn)
+				break; /* Move to next higher zone */
+
+			start_pfn = max(z_start_pfn, node_region->start_pfn);
+			end_pfn = min(z_end_pfn, node_region->end_pfn);
+
+			zone_region = &z->zone_regions[idx];
+			zone_region->start_pfn = start_pfn;
+			zone_region->end_pfn = end_pfn;
+			zone_region->spanned_pages = end_pfn - start_pfn;
+
+			absent = __absent_pages_in_range(nid, start_pfn,
+						         end_pfn);
+			zone_region->present_pages =
+					zone_region->spanned_pages - absent;
+
+			idx++;
+		}
+
+		z->nr_zone_regions = idx;
+
+		/*
+		 * Revisit the last visited node memory region, in case it
+		 * spans multiple zones.
+		 */
+		j--;
+	}
+}
+
+static void __meminit init_memory_regions(struct pglist_data *pgdat)
+{
+	init_node_memory_regions(pgdat);
+	init_zone_memory_regions(pgdat);
+}
+
 void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 		unsigned long node_start_pfn, unsigned long *zholes_size)
 {
@@ -4913,7 +4973,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	free_area_init_core(pgdat, start_pfn, end_pfn,
 			    zones_size, zholes_size);
-	init_node_memory_regions(pgdat);
+	init_memory_regions(pgdat);
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 04/40] mm: Add helpers to retrieve node region and zone region for a given page
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (2 preceding siblings ...)
  2013-09-25 23:14 ` [RFC PATCH v4 03/40] mm: Introduce and initialize zone memory regions Srivatsa S. Bhat
@ 2013-09-25 23:14 ` Srivatsa S. Bhat
  2013-09-25 23:14 ` [RFC PATCH v4 05/40] mm: Add data-structures to describe memory regions within the zones' freelists Srivatsa S. Bhat
                   ` (36 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:14 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Given a page, we would like to have an efficient mechanism to find out
the node memory region and the zone memory region to which it belongs.
Since the node is assumed to be divided into equal-sized node memory
regions, the node memory region can be obtained by simply right-shifting
the page's pfn by 'MEM_REGION_SHIFT'.
But finding the corresponding zone memory region's index in the zone is
not that straight-forward. To have a O(1) algorithm to find it out, define a
zone_region_idx[] array to store the zone memory region indices for every
node memory region.
To illustrate, consider the following example:
	|<----------------------Node---------------------->|
	 __________________________________________________
	|      Node mem reg 0 	 |      Node mem reg 1     |  (Absolute region
	|________________________|_________________________|   boundaries)
	 __________________________________________________
	|    ZONE_DMA   |	    ZONE_NORMAL		   |
	|               |                                  |
	|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
	|_______________|________|_________________________|
In the above figure,
Node mem region 0:
------------------
This region corresponds to the first zone mem region in ZONE_DMA and also
the first zone mem region in ZONE_NORMAL. Hence its index array would look
like this:
    node_regions[0].zone_region_idx[ZONE_DMA]     == 0
    node_regions[0].zone_region_idx[ZONE_NORMAL]  == 0
Node mem region 1:
------------------
This region corresponds to the second zone mem region in ZONE_NORMAL. Hence
its index array would look like this:
    node_regions[1].zone_region_idx[ZONE_NORMAL]  == 1
Using this index array, we can quickly obtain the zone memory region to
which a given page belongs.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mm.h     |   24 ++++++++++++++++++++++++
 include/linux/mmzone.h |    7 +++++++
 mm/page_alloc.c        |   22 ++++++++++++++++++++++
 3 files changed, 53 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 223be46..307f375 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -716,6 +716,30 @@ static inline struct zone *page_zone(const struct page *page)
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
 }
 
+static inline int page_node_region_id(const struct page *page,
+				      const pg_data_t *pgdat)
+{
+	return (page_to_pfn(page) - pgdat->node_start_pfn) >> MEM_REGION_SHIFT;
+}
+
+/**
+ * Return the index of the zone memory region to which the page belongs.
+ *
+ * Given a page, find the absolute (node) memory region as well as the zone to
+ * which it belongs. Then find the region within the zone that corresponds to
+ * that node memory region, and return its index.
+ */
+static inline int page_zone_region_id(const struct page *page)
+{
+	pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
+	enum zone_type z_num = page_zonenum(page);
+	unsigned long node_region_idx;
+
+	node_region_idx = page_node_region_id(page, pgdat);
+
+	return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
+}
+
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3c1dc97..a22358c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -726,6 +726,13 @@ struct node_mem_region {
 	unsigned long end_pfn;
 	unsigned long present_pages;
 	unsigned long spanned_pages;
+
+	/*
+	 * A physical (node) region could be split across multiple zones.
+	 * Store the indices of the corresponding regions of each such
+	 * zone for this physical (node) region.
+	 */
+	int zone_region_idx[MAX_NR_ZONES];
 	struct pglist_data *pgdat;
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 10a1cc8..d747f92 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4885,6 +4885,24 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
 	pgdat->nr_node_regions = idx;
 }
 
+/*
+ * Zone-region indices are used to map node-memory-regions to
+ * zone-memory-regions. Initialize all of them to an invalid value (-1),
+ * to make way for the correct mapping to be set up subsequently.
+ */
+static void __meminit init_zone_region_indices(struct pglist_data *pgdat)
+{
+	struct node_mem_region *node_region;
+	int i, j;
+
+	for (i = 0; i < pgdat->nr_node_regions; i++) {
+		node_region = &pgdat->node_regions[i];
+
+		for (j = 0; j < pgdat->nr_zones; j++)
+			node_region->zone_region_idx[j] = -1;
+	}
+}
+
 static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 {
 	unsigned long start_pfn, end_pfn, absent;
@@ -4894,6 +4912,9 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 	struct zone_mem_region *zone_region;
 	struct zone *z;
 
+	/* Set all the zone-region indices to -1 */
+	init_zone_region_indices(pgdat);
+
 	for (i = 0, j = 0; i < pgdat->nr_zones; i++) {
 		z = &pgdat->node_zones[i];
 		z_start_pfn = z->zone_start_pfn;
@@ -4926,6 +4947,7 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 			zone_region->present_pages =
 					zone_region->spanned_pages - absent;
 
+			node_region->zone_region_idx[zone_idx(z)] = idx;
 			idx++;
 		}
 
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 05/40] mm: Add data-structures to describe memory regions within the zones' freelists
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (3 preceding siblings ...)
  2013-09-25 23:14 ` [RFC PATCH v4 04/40] mm: Add helpers to retrieve node region and zone region for a given page Srivatsa S. Bhat
@ 2013-09-25 23:14 ` Srivatsa S. Bhat
  2013-09-25 23:14 ` [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in " Srivatsa S. Bhat
                   ` (35 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:14 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
In order to influence page allocation decisions (i.e., to make page-allocation
region-aware), we need to be able to distinguish pageblocks belonging to
different zone memory regions within the zones' (buddy) freelists.
So, within every freelist in a zone, provide pointers to describe the
boundaries of zone memory regions and counters to track the number of free
pageblocks within each region.
Also, fixup the references to the freelist's list_head inside struct free_area.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |   17 ++++++++++++++++-
 mm/compaction.c        |    2 +-
 mm/page_alloc.c        |   23 ++++++++++++-----------
 mm/vmstat.c            |    2 +-
 4 files changed, 30 insertions(+), 14 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a22358c..2ac8025 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -83,8 +83,23 @@ static inline int get_pageblock_migratetype(struct page *page)
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
+struct mem_region_list {
+	struct list_head	*page_block;
+	unsigned long		nr_free;
+};
+
+struct free_list {
+	struct list_head	list;
+
+	/*
+	 * Demarcates pageblocks belonging to different regions within
+	 * this freelist.
+	 */
+	struct mem_region_list	mr_list[MAX_NR_ZONE_REGIONS];
+};
+
 struct free_area {
-	struct list_head	free_list[MIGRATE_TYPES];
+	struct free_list	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
diff --git a/mm/compaction.c b/mm/compaction.c
index c437893..511b191 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -858,7 +858,7 @@ static int compact_finished(struct zone *zone,
 		struct free_area *area = &zone->free_area[order];
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[cc->migratetype]))
+		if (!list_empty(&area->free_list[cc->migratetype].list))
 			return COMPACT_PARTIAL;
 
 		/* Job done if allocation would set block type */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d747f92..e9d8082 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -606,12 +606,13 @@ static inline void __free_one_page(struct page *page,
 		higher_buddy = higher_page + (buddy_idx - combined_idx);
 		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
 			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
+				&zone->free_area[order].free_list[migratetype].list);
 			goto out;
 		}
 	}
 
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+	list_add(&page->lru,
+		&zone->free_area[order].free_list[migratetype].list);
 out:
 	zone->free_area[order].nr_free++;
 }
@@ -832,7 +833,7 @@ static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		list_add(&page[size].lru, &area->free_list[migratetype]);
+		list_add(&page[size].lru, &area->free_list[migratetype].list);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -894,10 +895,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		if (list_empty(&area->free_list[migratetype]))
+		if (list_empty(&area->free_list[migratetype].list))
 			continue;
 
-		page = list_entry(area->free_list[migratetype].next,
+		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
 		list_del(&page->lru);
 		rmv_page_order(page);
@@ -969,7 +970,7 @@ int move_freepages(struct zone *zone,
 
 		order = page_order(page);
 		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype]);
+			  &zone->free_area[order].free_list[migratetype].list);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1076,10 +1077,10 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 				break;
 
 			area = &(zone->free_area[current_order]);
-			if (list_empty(&area->free_list[migratetype]))
+			if (list_empty(&area->free_list[migratetype].list))
 				continue;
 
-			page = list_entry(area->free_list[migratetype].next,
+			page = list_entry(area->free_list[migratetype].list.next,
 					struct page, lru);
 			area->nr_free--;
 
@@ -1323,7 +1324,7 @@ void mark_free_pages(struct zone *zone)
 		}
 
 	for_each_migratetype_order(order, t) {
-		list_for_each(curr, &zone->free_area[order].free_list[t]) {
+		list_for_each(curr, &zone->free_area[order].free_list[t].list) {
 			unsigned long i;
 
 			pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -3193,7 +3194,7 @@ void show_free_areas(unsigned int filter)
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!list_empty(&area->free_list[type]))
+				if (!list_empty(&area->free_list[type].list))
 					types[order] |= 1 << type;
 			}
 		}
@@ -4049,7 +4050,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 {
 	int order, t;
 	for_each_migratetype_order(order, t) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+		INIT_LIST_HEAD(&zone->free_area[order].free_list[t].list);
 		zone->free_area[order].nr_free = 0;
 	}
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9bb3145..c967043 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,7 +901,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 
 			area = &(zone->free_area[order]);
 
-			list_for_each(curr, &area->free_list[mtype])
+			list_for_each(curr, &area->free_list[mtype].list)
 				freecount++;
 			seq_printf(m, "%6lu ", freecount);
 		}
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (4 preceding siblings ...)
  2013-09-25 23:14 ` [RFC PATCH v4 05/40] mm: Add data-structures to describe memory regions within the zones' freelists Srivatsa S. Bhat
@ 2013-09-25 23:14 ` Srivatsa S. Bhat
  2013-09-26 22:16   ` Dave Hansen
  2013-10-23 10:17   ` Johannes Weiner
  2013-09-25 23:15 ` [RFC PATCH v4 07/40] mm: Track the freepage migratetype of pages accurately Srivatsa S. Bhat
                   ` (34 subsequent siblings)
  40 siblings, 2 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:14 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
The zones' freelists need to be made region-aware, in order to influence
page allocation and freeing algorithms. So in every free list in the zone, we
would like to demarcate the pageblocks belonging to different memory regions
(we can do this using a set of pointers, and thus avoid splitting up the
freelists).
Also, we would like to keep the pageblocks in the freelists sorted in
region-order. That is, pageblocks belonging to region-0 would come first,
followed by pageblocks belonging to region-1 and so on, within a given
freelist. Of course, a set of pageblocks belonging to the same region need
not be sorted; it is sufficient if we maintain the pageblocks in
region-sorted-order, rather than a full address-sorted-order.
For each freelist within the zone, we maintain a set of pointers to
pageblocks belonging to the various memory regions in that zone.
Eg:
    |<---Region0--->|   |<---Region1--->|   |<-------Region2--------->|
     ____      ____      ____      ____      ____      ____      ____
--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->
                 ^                  ^                              ^
                 |                  |                              |
                Reg0               Reg1                          Reg2
Page allocation will proceed as usual - pick the first item on the free list.
But we don't want to keep updating these region pointers every time we allocate
a pageblock from the freelist. So, instead of pointing to the *first* pageblock
of that region, we maintain the region pointers such that they point to the
*last* pageblock in that region, as shown in the figure above. That way, as
long as there are > 1 pageblocks in that region in that freelist, that region
pointer doesn't need to be updated.
Page allocation algorithm:
-------------------------
The heart of the page allocation algorithm remains as it is - pick the first
item on the appropriate freelist and return it.
Arrangement of pageblocks in the zone freelists:
-----------------------------------------------
This is the main change - we keep the pageblocks in region-sorted order,
where pageblocks belonging to region-0 come first, followed by those belonging
to region-1 and so on. But the pageblocks within a given region need *not* be
sorted, since we need them to be only region-sorted and not fully
address-sorted.
This sorting is performed when adding pages back to the freelists, thus
avoiding any region-related overhead in the critical page allocation
paths.
Strategy to consolidate allocations to a minimum no. of regions:
---------------------------------------------------------------
Page allocation happens in the order of increasing region number. We would
like to do light-weight page reclaim or compaction (for the purpose of memory
power management) in the reverse order, to keep the allocated pages within
a minimum number of regions (approximately). The latter part is implemented
in subsequent patches.
---------------------------- Increasing region number---------------------->
Direction of allocation--->                <---Direction of reclaim/compaction
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |  154 +++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 138 insertions(+), 16 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e9d8082..d48eb04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -517,6 +517,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+static void add_to_freelist(struct page *page, struct free_list *free_list)
+{
+	struct list_head *prev_region_list, *lru;
+	struct mem_region_list *region;
+	int region_id, i;
+
+	lru = &page->lru;
+	region_id = page_zone_region_id(page);
+
+	region = &free_list->mr_list[region_id];
+	region->nr_free++;
+
+	if (region->page_block) {
+		list_add_tail(lru, region->page_block);
+		return;
+	}
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
+#endif
+
+	if (!list_empty(&free_list->list)) {
+		for (i = region_id - 1; i >= 0; i--) {
+			if (free_list->mr_list[i].page_block) {
+				prev_region_list =
+					free_list->mr_list[i].page_block;
+				goto out;
+			}
+		}
+	}
+
+	/* This is the first region, so add to the head of the list */
+	prev_region_list = &free_list->list;
+
+out:
+	list_add(lru, prev_region_list);
+
+	/* Save pointer to page block of this region */
+	region->page_block = lru;
+}
+
+static void del_from_freelist(struct page *page, struct free_list *free_list)
+{
+	struct list_head *prev_page_lru, *lru, *p;
+	struct mem_region_list *region;
+	int region_id;
+
+	lru = &page->lru;
+	region_id = page_zone_region_id(page);
+	region = &free_list->mr_list[region_id];
+	region->nr_free--;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
+
+	/* Verify whether this page indeed belongs to this free list! */
+
+	list_for_each(p, &free_list->list) {
+		if (p == lru)
+			goto page_found;
+	}
+
+	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
+
+page_found:
+#endif
+
+	/*
+	 * If we are not deleting the last pageblock in this region (i.e.,
+	 * farthest from list head, but not necessarily the last numerically),
+	 * then we need not update the region->page_block pointer.
+	 */
+	if (lru != region->page_block) {
+		list_del(lru);
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
+#endif
+		return;
+	}
+
+	prev_page_lru = lru->prev;
+	list_del(lru);
+
+	if (region->nr_free == 0) {
+		region->page_block = NULL;
+	} else {
+		region->page_block = prev_page_lru;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(prev_page_lru == &free_list->list,
+			"%s: region->page_block points to list head\n",
+								__func__);
+#endif
+	}
+}
+
+/**
+ * Move a given page from one freelist to another.
+ */
+static void move_page_freelist(struct page *page, struct free_list *old_list,
+			       struct free_list *new_list)
+{
+	del_from_freelist(page, old_list);
+	add_to_freelist(page, new_list);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -550,6 +655,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_idx;
 	unsigned long uninitialized_var(buddy_idx);
 	struct page *buddy;
+	struct free_area *area;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 
@@ -579,8 +685,9 @@ static inline void __free_one_page(struct page *page,
 			__mod_zone_freepage_state(zone, 1 << order,
 						  migratetype);
 		} else {
-			list_del(&buddy->lru);
-			zone->free_area[order].nr_free--;
+			area = &zone->free_area[order];
+			del_from_freelist(buddy, &area->free_list[migratetype]);
+			area->nr_free--;
 			rmv_page_order(buddy);
 		}
 		combined_idx = buddy_idx & page_idx;
@@ -589,6 +696,7 @@ static inline void __free_one_page(struct page *page,
 		order++;
 	}
 	set_page_order(page, order);
+	area = &zone->free_area[order];
 
 	/*
 	 * If this is not the largest possible page, check if the buddy
@@ -605,16 +713,22 @@ static inline void __free_one_page(struct page *page,
 		buddy_idx = __find_buddy_index(combined_idx, order + 1);
 		higher_buddy = higher_page + (buddy_idx - combined_idx);
 		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype].list);
+
+			/*
+			 * Implementing an add_to_freelist_tail() won't be
+			 * very useful because both of them (almost) add to
+			 * the tail within the region. So we could potentially
+			 * switch off this entire "is next-higher buddy free?"
+			 * logic when memory regions are used.
+			 */
+			add_to_freelist(page, &area->free_list[migratetype]);
 			goto out;
 		}
 	}
 
-	list_add(&page->lru,
-		&zone->free_area[order].free_list[migratetype].list);
+	add_to_freelist(page, &area->free_list[migratetype]);
 out:
-	zone->free_area[order].nr_free++;
+	area->nr_free++;
 }
 
 static inline int free_pages_check(struct page *page)
@@ -833,7 +947,7 @@ static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		list_add(&page[size].lru, &area->free_list[migratetype].list);
+		add_to_freelist(&page[size], &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -900,7 +1014,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		list_del(&page->lru);
+		del_from_freelist(page, &area->free_list[migratetype]);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
@@ -941,7 +1055,8 @@ int move_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long order;
-	int pages_moved = 0;
+	struct free_area *area;
+	int pages_moved = 0, old_mt;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -969,8 +1084,10 @@ int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype].list);
+		old_mt = get_freepage_migratetype(page);
+		area = &zone->free_area[order];
+		move_page_freelist(page, &area->free_list[old_mt],
+				    &area->free_list[migratetype]);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1064,7 +1181,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	struct free_area *area;
 	int current_order;
 	struct page *page;
-	int migratetype, new_type, i;
+	int migratetype, new_type, i, mt;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
@@ -1089,7 +1206,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							  migratetype);
 
 			/* Remove the page from the freelists */
-			list_del(&page->lru);
+			mt = get_freepage_migratetype(page);
+			del_from_freelist(page, &area->free_list[mt]);
 			rmv_page_order(page);
 
 			/*
@@ -1449,7 +1567,8 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
-	list_del(&page->lru);
+	mt = get_freepage_migratetype(page);
+	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
 
@@ -6442,6 +6561,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	int order, i;
 	unsigned long pfn;
 	unsigned long flags;
+	int mt;
+
 	/* find the first valid pfn */
 	for (pfn = start_pfn; pfn < end_pfn; pfn++)
 		if (pfn_valid(pfn))
@@ -6474,7 +6595,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		printk(KERN_INFO "remove from free list %lx %d %lx\n",
 		       pfn, 1 << order, end_pfn);
 #endif
-		list_del(&page->lru);
+		mt = get_freepage_migratetype(page);
+		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
 #ifdef CONFIG_HIGHMEM
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 07/40] mm: Track the freepage migratetype of pages accurately
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (5 preceding siblings ...)
  2013-09-25 23:14 ` [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in " Srivatsa S. Bhat
@ 2013-09-25 23:15 ` Srivatsa S. Bhat
  2013-09-25 23:15 ` [RFC PATCH v4 08/40] mm: Use the correct migratetype during buddy merging Srivatsa S. Bhat
                   ` (33 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:15 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Due to the region-wise ordering of the pages in the buddy allocator's
free lists, whenever we want to delete a free pageblock from a free list
(for ex: when moving blocks of pages from one list to the other), we need
to be able to tell the buddy allocator exactly which migratetype it belongs
to. For that purpose, we can use the page's freepage migratetype (which is
maintained in the page's ->index field).
So, while splitting up higher order pages into smaller ones as part of buddy
operations, keep the new head pages updated with the correct freepage
migratetype information (because we depend on tracking this info accurately,
as outlined above).
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |    7 +++++++
 1 file changed, 7 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d48eb04..e31daf4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -950,6 +950,13 @@ static inline void expand(struct zone *zone, struct page *page,
 		add_to_freelist(&page[size], &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
+
+		/*
+		 * Freepage migratetype is tracked using the index field of the
+		 * first page of the block. So we need to update the new first
+		 * page, when changing the page order.
+		 */
+		set_freepage_migratetype(&page[size], migratetype);
 	}
 }
 
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 08/40] mm: Use the correct migratetype during buddy merging
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (6 preceding siblings ...)
  2013-09-25 23:15 ` [RFC PATCH v4 07/40] mm: Track the freepage migratetype of pages accurately Srivatsa S. Bhat
@ 2013-09-25 23:15 ` Srivatsa S. Bhat
  2013-09-25 23:15 ` [RFC PATCH v4 09/40] mm: Add an optimized version of del_from_freelist to keep page allocation fast Srivatsa S. Bhat
                   ` (32 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:15 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
While merging buddy free pages of a given order to make a higher order page,
the buddy allocator might coalesce pages belonging to *two* *different*
migratetypes of that order!
So, don't assume that both the buddies come from the same freelist;
instead, explicitly find out the migratetype info of the buddy page and use
it while merging the buddies.
Also, set the freepage migratetype of the buddy to the new one.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e31daf4..c40715c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -685,10 +685,14 @@ static inline void __free_one_page(struct page *page,
 			__mod_zone_freepage_state(zone, 1 << order,
 						  migratetype);
 		} else {
+			int mt;
+
 			area = &zone->free_area[order];
-			del_from_freelist(buddy, &area->free_list[migratetype]);
+			mt = get_freepage_migratetype(buddy);
+			del_from_freelist(buddy, &area->free_list[mt]);
 			area->nr_free--;
 			rmv_page_order(buddy);
+			set_freepage_migratetype(buddy, migratetype);
 		}
 		combined_idx = buddy_idx & page_idx;
 		page = page + (combined_idx - page_idx);
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 09/40] mm: Add an optimized version of del_from_freelist to keep page allocation fast
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (7 preceding siblings ...)
  2013-09-25 23:15 ` [RFC PATCH v4 08/40] mm: Use the correct migratetype during buddy merging Srivatsa S. Bhat
@ 2013-09-25 23:15 ` Srivatsa S. Bhat
  2013-09-25 23:15 ` [RFC PATCH v4 10/40] bitops: Document the difference in indexing between fls() and __fls() Srivatsa S. Bhat
                   ` (31 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:15 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
One of the main advantages of this design of memory regions is that page
allocations can potentially be extremely fast - almost with no extra
overhead from memory regions.
To exploit that, introduce an optimized version of del_from_freelist(), which
utilizes the fact that we always delete items from the head of the list
during page allocation.
Basically, we want to keep a note of the region from which we are allocating
in a given freelist, to avoid having to compute the page-to-zone-region for
every page allocation. So introduce a 'next_region' pointer in every freelist
to achieve that, and use it to keep the fastpath of page allocation almost as
fast as it would have been without memory regions.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mm.h     |   14 +++++++++++
 include/linux/mmzone.h |    6 +++++
 mm/page_alloc.c        |   62 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 81 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 307f375..4286a75 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -740,6 +740,20 @@ static inline int page_zone_region_id(const struct page *page)
 	return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
 }
 
+static inline void set_next_region_in_freelist(struct free_list *free_list)
+{
+	struct page *page;
+	int region_id;
+
+	if (unlikely(list_empty(&free_list->list))) {
+		free_list->next_region = NULL;
+	} else {
+		page = list_entry(free_list->list.next, struct page, lru);
+		region_id = page_zone_region_id(page);
+		free_list->next_region = &free_list->mr_list[region_id];
+	}
+}
+
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2ac8025..4721a22 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -92,6 +92,12 @@ struct free_list {
 	struct list_head	list;
 
 	/*
+	 * Pointer to the region from which the next allocation will be
+	 * satisfied. (Same as the freelist's first pageblock's region.)
+	 */
+	struct mem_region_list	*next_region; /* for fast page allocation */
+
+	/*
 	 * Demarcates pageblocks belonging to different regions within
 	 * this freelist.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c40715c..fe812e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -551,6 +551,15 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 	/* This is the first region, so add to the head of the list */
 	prev_region_list = &free_list->list;
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN((list_empty(&free_list->list) && free_list->next_region != NULL),
+					"%s: next_region not NULL\n", __func__);
+#endif
+	/*
+	 * Set 'next_region' to this region, since this is the first region now
+	 */
+	free_list->next_region = region;
+
 out:
 	list_add(lru, prev_region_list);
 
@@ -558,6 +567,47 @@ out:
 	region->page_block = lru;
 }
 
+/**
+ * __rmqueue_smallest() *always* deletes elements from the head of the
+ * list. Use this knowledge to keep page allocation fast, despite being
+ * region-aware.
+ *
+ * Do *NOT* call this function if you are deleting from somewhere deep
+ * inside the freelist.
+ */
+static void rmqueue_del_from_freelist(struct page *page,
+				      struct free_list *free_list)
+{
+	struct list_head *lru = &page->lru;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN((free_list->list.next != lru),
+				"%s: page not at head of list", __func__);
+#endif
+
+	list_del(lru);
+
+	/* Fastpath */
+	if (--(free_list->next_region->nr_free)) {
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(free_list->next_region->nr_free < 0,
+				"%s: nr_free is negative\n", __func__);
+#endif
+		return;
+	}
+
+	/*
+	 * Slowpath, when this is the last pageblock of this region
+	 * in this freelist.
+	 */
+	free_list->next_region->page_block = NULL;
+
+	/* Set 'next_region' to the new first region in the freelist. */
+	set_next_region_in_freelist(free_list);
+}
+
+/* Generic delete function for region-aware buddy allocator. */
 static void del_from_freelist(struct page *page, struct free_list *free_list)
 {
 	struct list_head *prev_page_lru, *lru, *p;
@@ -565,6 +615,11 @@ static void del_from_freelist(struct page *page, struct free_list *free_list)
 	int region_id;
 
 	lru = &page->lru;
+
+	/* Try to fastpath, if deleting from the head of the list */
+	if (lru == free_list->list.next)
+		return rmqueue_del_from_freelist(page, free_list);
+
 	region_id = page_zone_region_id(page);
 	region = &free_list->mr_list[region_id];
 	region->nr_free--;
@@ -600,6 +655,11 @@ page_found:
 	prev_page_lru = lru->prev;
 	list_del(lru);
 
+	/*
+	 * Since we are not deleting from the head of the freelist, the
+	 * 'next_region' pointer doesn't have to change.
+	 */
+
 	if (region->nr_free == 0) {
 		region->page_block = NULL;
 	} else {
@@ -1025,7 +1085,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		del_from_freelist(page, &area->free_list[migratetype]);
+		rmqueue_del_from_freelist(page, &area->free_list[migratetype]);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 10/40] bitops: Document the difference in indexing between fls() and __fls()
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (8 preceding siblings ...)
  2013-09-25 23:15 ` [RFC PATCH v4 09/40] mm: Add an optimized version of del_from_freelist to keep page allocation fast Srivatsa S. Bhat
@ 2013-09-25 23:15 ` Srivatsa S. Bhat
  2013-09-25 23:16 ` [RFC PATCH v4 11/40] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting Srivatsa S. Bhat
                   ` (30 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:15 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
fls() indexes the bits starting with 1, ie., from 1 to BITS_PER_LONG
whereas __fls() uses a zero-based indexing scheme (0 to BITS_PER_LONG - 1).
Add comments to document this important difference.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 arch/x86/include/asm/bitops.h      |    4 ++++
 include/asm-generic/bitops/__fls.h |    5 +++++
 2 files changed, 9 insertions(+)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 41639ce..9186e4a 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -388,6 +388,10 @@ static inline unsigned long ffz(unsigned long word)
  * @word: The word to search
  *
  * Undefined if no set bit exists, so code should check against 0 first.
+ *
+ * Note: __fls(x) is equivalent to fls(x) - 1. That is, __fls() uses
+ * a zero-based indexing scheme (0 to BITS_PER_LONG - 1), where
+ * __fls(1) = 0, __fls(2) = 1, and so on.
  */
 static inline unsigned long __fls(unsigned long word)
 {
diff --git a/include/asm-generic/bitops/__fls.h b/include/asm-generic/bitops/__fls.h
index a60a7cc..ae908a5 100644
--- a/include/asm-generic/bitops/__fls.h
+++ b/include/asm-generic/bitops/__fls.h
@@ -8,6 +8,11 @@
  * @word: the word to search
  *
  * Undefined if no set bit exists, so code should check against 0 first.
+ *
+ * Note: __fls(x) is equivalent to fls(x) - 1. That is, __fls() uses
+ * a zero-based indexing scheme (0 to BITS_PER_LONG - 1), where
+ * __fls(1) = 0, __fls(2) = 1, and so on.
+ *
  */
 static __always_inline unsigned long __fls(unsigned long word)
 {
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 11/40] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (9 preceding siblings ...)
  2013-09-25 23:15 ` [RFC PATCH v4 10/40] bitops: Document the difference in indexing between fls() and __fls() Srivatsa S. Bhat
@ 2013-09-25 23:16 ` Srivatsa S. Bhat
  2013-09-25 23:16 ` [RFC PATCH v4 12/40] mm: Add support to accurately track per-memory-region allocation Srivatsa S. Bhat
                   ` (29 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:16 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
The sorted-buddy design for memory power management depends on
keeping the buddy freelists region-sorted. And this sorting operation
has been pushed to the free() logic, keeping the alloc() path fast.
However, we would like to also keep the free() path as fast as possible,
since it holds the zone->lock, which will indirectly affect alloc() also.
So replace the existing O(n) sorting logic used in the free-path, with
a new special-case sorting algorithm of time complexity O(log n), in order
to optimize the free() path further. This algorithm uses a bitmap-based
radix tree to help speed up the sorting.
One of the other main advantages of this O(log n) design is that it can
support large amounts of RAM (upto 2 TB and beyond) quite effortlessly.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |    2 +
 mm/page_alloc.c        |  142 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 137 insertions(+), 7 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4721a22..472c76a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -102,6 +102,8 @@ struct free_list {
 	 * this freelist.
 	 */
 	struct mem_region_list	mr_list[MAX_NR_ZONE_REGIONS];
+	DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+	DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
 };
 
 struct free_area {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe812e0..daac5fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -517,11 +517,129 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+/**
+ *
+ * An example should help illustrate the bitmap representation of memory
+ * regions easily. So consider the following scenario:
+ *
+ * MAX_NR_ZONE_REGIONS = 256
+ * DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
+ * DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+ *
+ * Here region_leaf_mask is an array of unsigned longs. And region_root_mask
+ * is a single unsigned long. The tree notion is constructed like this:
+ * Each bit in the region_root_mask will correspond to an array element of
+ * region_leaf_mask, as shown below. (The elements of the region_leaf_mask
+ * array are shown as being discontiguous, only to help illustrate the
+ * concept easily).
+ *
+ *                    Region Root Mask
+ *                   ___________________
+ *                  |____|____|____|____|
+ *                    /    |     \     \
+ *                   /     |      \     \
+ *             ________    |   ________  \
+ *            |________|   |  |________|  \
+ *                         |               \
+ *                      ________        ________
+ *                     |________|      |________|   <--- Region Leaf Mask
+ *                                                         array elements
+ *
+ * If an array element in the leaf mask is non-zero, the corresponding bit
+ * for that array element will be set in the root mask. Every bit in the
+ * region_leaf_mask will correspond to a memory region; it is set if that
+ * region is present in that free list, cleared otherwise.
+ *
+ * This arrangement helps us find the previous set bit in region_leaf_mask
+ * using at most 2 bitmask-searches (each bitmask of size BITS_PER_LONG),
+ * one at the root-level, and one at the leaf level. Thus, this design of
+ * an optimized access structure reduces the search-complexity when dealing
+ * with large amounts of memory. The worst-case time-complexity of buddy
+ * sorting comes to O(log n) using this algorithm, where 'n' is the no. of
+ * memory regions in the zone.
+ *
+ * For example, with MEM_REGION_SIZE = 512 MB, on 64-bit machines, we can
+ * deal with upto 2TB of RAM (MAX_NR_ZONE_REGIONS = 4096) efficiently (just
+ * 12 ops in the worst case, as opposed to 4096 ops in an O(n) algorithm)
+ * with such an arrangement, without even needing to extend this 2-level
+ * hierarchy any further.
+ */
+
+static void set_region_bit(int region_id, struct free_list *free_list)
+{
+	set_bit(region_id, free_list->region_leaf_mask);
+	set_bit(BIT_WORD(region_id), free_list->region_root_mask);
+}
+
+static void clear_region_bit(int region_id, struct free_list *free_list)
+{
+	clear_bit(region_id, free_list->region_leaf_mask);
+
+	if (!(free_list->region_leaf_mask[BIT_WORD(region_id)]))
+		clear_bit(BIT_WORD(region_id), free_list->region_root_mask);
+
+}
+
+/* Note that Region 0 corresponds to bit position 1 (0x1) and so on */
+static int find_prev_region(int region_id, struct free_list *free_list)
+{
+	int leaf_word, prev_region_id;
+	unsigned long *region_root_mask, *region_leaf_mask;
+	unsigned long tmp_root_mask, tmp_leaf_mask;
+
+	if (!region_id)
+		return -1; /* No previous region */
+
+	leaf_word = BIT_WORD(region_id);
+
+	region_root_mask = free_list->region_root_mask;
+	region_leaf_mask = free_list->region_leaf_mask;
+
+	/* Note that region_id itself has NOT been set in the bitmasks yet. */
+
+	/* Try to get the prev region id without going to the root mask. */
+	if (region_leaf_mask[leaf_word]) {
+		tmp_leaf_mask = region_leaf_mask[leaf_word] &
+							(BIT_MASK(region_id) - 1);
+
+		if (tmp_leaf_mask) {
+			/* Prev region is in this leaf mask itself. Find it. */
+			prev_region_id = leaf_word * BITS_PER_LONG +
+							__fls(tmp_leaf_mask);
+			goto out;
+		}
+	}
+
+	/* Search the root mask for the leaf mask having prev region */
+	tmp_root_mask = *region_root_mask & (BIT(leaf_word) - 1);
+	if (tmp_root_mask) {
+		leaf_word = __fls(tmp_root_mask);
+
+		/* Get the prev region id from the leaf mask */
+		prev_region_id = leaf_word * BITS_PER_LONG +
+					__fls(region_leaf_mask[leaf_word]);
+	} else {
+		/*
+		 * This itself is the first populated region in this
+		 * freelist, so previous region doesn't exist.
+		 */
+		prev_region_id = -1;
+	}
+
+out:
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(prev_region_id >= region_id, "%s: bitmap logic messed up\n",
+								__func__);
+#endif
+	return prev_region_id;
+}
+
 static void add_to_freelist(struct page *page, struct free_list *free_list)
 {
 	struct list_head *prev_region_list, *lru;
 	struct mem_region_list *region;
-	int region_id, i;
+	int region_id, prev_region_id;
 
 	lru = &page->lru;
 	region_id = page_zone_region_id(page);
@@ -539,12 +657,17 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 #endif
 
 	if (!list_empty(&free_list->list)) {
-		for (i = region_id - 1; i >= 0; i--) {
-			if (free_list->mr_list[i].page_block) {
-				prev_region_list =
-					free_list->mr_list[i].page_block;
-				goto out;
-			}
+		prev_region_id = find_prev_region(region_id, free_list);
+		if (prev_region_id >= 0) {
+			prev_region_list =
+				free_list->mr_list[prev_region_id].page_block;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+			WARN(prev_region_list == NULL,
+				"%s: prev_region_list is NULL\n"
+				"region_id=%d, prev_region_id=%d\n", __func__,
+				 region_id, prev_region_id);
+#endif
+			goto out;
 		}
 	}
 
@@ -565,6 +688,7 @@ out:
 
 	/* Save pointer to page block of this region */
 	region->page_block = lru;
+	set_region_bit(region_id, free_list);
 }
 
 /**
@@ -579,6 +703,7 @@ static void rmqueue_del_from_freelist(struct page *page,
 				      struct free_list *free_list)
 {
 	struct list_head *lru = &page->lru;
+	int region_id;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN((free_list->list.next != lru),
@@ -602,6 +727,8 @@ static void rmqueue_del_from_freelist(struct page *page,
 	 * in this freelist.
 	 */
 	free_list->next_region->page_block = NULL;
+	region_id = free_list->next_region - free_list->mr_list;
+	clear_region_bit(region_id, free_list);
 
 	/* Set 'next_region' to the new first region in the freelist. */
 	set_next_region_in_freelist(free_list);
@@ -662,6 +789,7 @@ page_found:
 
 	if (region->nr_free == 0) {
 		region->page_block = NULL;
+		clear_region_bit(region_id, free_list);
 	} else {
 		region->page_block = prev_page_lru;
 #ifdef CONFIG_DEBUG_PAGEALLOC
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 12/40] mm: Add support to accurately track per-memory-region allocation
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (10 preceding siblings ...)
  2013-09-25 23:16 ` [RFC PATCH v4 11/40] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting Srivatsa S. Bhat
@ 2013-09-25 23:16 ` Srivatsa S. Bhat
  2013-09-25 23:16 ` [RFC PATCH v4 13/40] mm: Print memory region statistics to understand the buddy allocator behavior Srivatsa S. Bhat
                   ` (28 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:16 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
The page allocator can make smarter decisions to influence memory power
management, if we track the per-region memory allocations closely.
So add the necessary support to accurately track allocations on a per-region
basis.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |    2 +
 mm/page_alloc.c        |   65 +++++++++++++++++++++++++++++++++++-------------
 2 files changed, 50 insertions(+), 17 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 472c76a..155c1a1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -86,6 +86,7 @@ static inline int get_pageblock_migratetype(struct page *page)
 struct mem_region_list {
 	struct list_head	*page_block;
 	unsigned long		nr_free;
+	struct zone_mem_region	*zone_region;
 };
 
 struct free_list {
@@ -342,6 +343,7 @@ struct zone_mem_region {
 	unsigned long end_pfn;
 	unsigned long present_pages;
 	unsigned long spanned_pages;
+	unsigned long nr_free;
 };
 
 struct zone {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index daac5fd..fbaa2dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -635,7 +635,8 @@ out:
 	return prev_region_id;
 }
 
-static void add_to_freelist(struct page *page, struct free_list *free_list)
+static void add_to_freelist(struct page *page, struct free_list *free_list,
+			    int order)
 {
 	struct list_head *prev_region_list, *lru;
 	struct mem_region_list *region;
@@ -646,6 +647,7 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 
 	region = &free_list->mr_list[region_id];
 	region->nr_free++;
+	region->zone_region->nr_free += 1 << order;
 
 	if (region->page_block) {
 		list_add_tail(lru, region->page_block);
@@ -700,9 +702,10 @@ out:
  * inside the freelist.
  */
 static void rmqueue_del_from_freelist(struct page *page,
-				      struct free_list *free_list)
+				      struct free_list *free_list, int order)
 {
 	struct list_head *lru = &page->lru;
+	struct mem_region_list *mr_list;
 	int region_id;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -712,8 +715,11 @@ static void rmqueue_del_from_freelist(struct page *page,
 
 	list_del(lru);
 
+	mr_list = free_list->next_region;
+	mr_list->zone_region->nr_free -= 1 << order;
+
 	/* Fastpath */
-	if (--(free_list->next_region->nr_free)) {
+	if (--(mr_list->nr_free)) {
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 		WARN(free_list->next_region->nr_free < 0,
@@ -735,7 +741,8 @@ static void rmqueue_del_from_freelist(struct page *page,
 }
 
 /* Generic delete function for region-aware buddy allocator. */
-static void del_from_freelist(struct page *page, struct free_list *free_list)
+static void del_from_freelist(struct page *page, struct free_list *free_list,
+			      int order)
 {
 	struct list_head *prev_page_lru, *lru, *p;
 	struct mem_region_list *region;
@@ -745,11 +752,12 @@ static void del_from_freelist(struct page *page, struct free_list *free_list)
 
 	/* Try to fastpath, if deleting from the head of the list */
 	if (lru == free_list->list.next)
-		return rmqueue_del_from_freelist(page, free_list);
+		return rmqueue_del_from_freelist(page, free_list, order);
 
 	region_id = page_zone_region_id(page);
 	region = &free_list->mr_list[region_id];
 	region->nr_free--;
+	region->zone_region->nr_free -= 1 << order;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
@@ -804,10 +812,10 @@ page_found:
  * Move a given page from one freelist to another.
  */
 static void move_page_freelist(struct page *page, struct free_list *old_list,
-			       struct free_list *new_list)
+			       struct free_list *new_list, int order)
 {
-	del_from_freelist(page, old_list);
-	add_to_freelist(page, new_list);
+	del_from_freelist(page, old_list, order);
+	add_to_freelist(page, new_list, order);
 }
 
 /*
@@ -877,7 +885,7 @@ static inline void __free_one_page(struct page *page,
 
 			area = &zone->free_area[order];
 			mt = get_freepage_migratetype(buddy);
-			del_from_freelist(buddy, &area->free_list[mt]);
+			del_from_freelist(buddy, &area->free_list[mt], order);
 			area->nr_free--;
 			rmv_page_order(buddy);
 			set_freepage_migratetype(buddy, migratetype);
@@ -913,12 +921,13 @@ static inline void __free_one_page(struct page *page,
 			 * switch off this entire "is next-higher buddy free?"
 			 * logic when memory regions are used.
 			 */
-			add_to_freelist(page, &area->free_list[migratetype]);
+			add_to_freelist(page, &area->free_list[migratetype],
+					order);
 			goto out;
 		}
 	}
 
-	add_to_freelist(page, &area->free_list[migratetype]);
+	add_to_freelist(page, &area->free_list[migratetype], order);
 out:
 	area->nr_free++;
 }
@@ -1139,7 +1148,8 @@ static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		add_to_freelist(&page[size], &area->free_list[migratetype]);
+		add_to_freelist(&page[size], &area->free_list[migratetype],
+				high);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 
@@ -1213,7 +1223,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		rmqueue_del_from_freelist(page, &area->free_list[migratetype]);
+		rmqueue_del_from_freelist(page, &area->free_list[migratetype],
+					  current_order);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
@@ -1286,7 +1297,7 @@ int move_freepages(struct zone *zone,
 		old_mt = get_freepage_migratetype(page);
 		area = &zone->free_area[order];
 		move_page_freelist(page, &area->free_list[old_mt],
-				    &area->free_list[migratetype]);
+				    &area->free_list[migratetype], order);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1406,7 +1417,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 
 			/* Remove the page from the freelists */
 			mt = get_freepage_migratetype(page);
-			del_from_freelist(page, &area->free_list[mt]);
+			del_from_freelist(page, &area->free_list[mt],
+					  current_order);
 			rmv_page_order(page);
 
 			/*
@@ -1767,7 +1779,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 
 	/* Remove page from free list */
 	mt = get_freepage_migratetype(page);
-	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
+	del_from_freelist(page, &zone->free_area[order].free_list[mt], order);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
 
@@ -5204,6 +5216,22 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
 	pgdat->nr_node_regions = idx;
 }
 
+static void __meminit zone_init_free_lists_late(struct zone *zone)
+{
+	struct mem_region_list *mr_list;
+	int order, t, i;
+
+	for_each_migratetype_order(order, t) {
+		for (i = 0; i < zone->nr_zone_regions; i++) {
+			mr_list =
+				&zone->free_area[order].free_list[t].mr_list[i];
+
+			mr_list->nr_free = 0;
+			mr_list->zone_region = &zone->zone_regions[i];
+		}
+	}
+}
+
 /*
  * Zone-region indices are used to map node-memory-regions to
  * zone-memory-regions. Initialize all of them to an invalid value (-1),
@@ -5272,6 +5300,8 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 
 		z->nr_zone_regions = idx;
 
+		zone_init_free_lists_late(z);
+
 		/*
 		 * Revisit the last visited node memory region, in case it
 		 * spans multiple zones.
@@ -6795,7 +6825,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		       pfn, 1 << order, end_pfn);
 #endif
 		mt = get_freepage_migratetype(page);
-		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
+		del_from_freelist(page, &zone->free_area[order].free_list[mt],
+				  order);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
 #ifdef CONFIG_HIGHMEM
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 13/40] mm: Print memory region statistics to understand the buddy allocator behavior
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (11 preceding siblings ...)
  2013-09-25 23:16 ` [RFC PATCH v4 12/40] mm: Add support to accurately track per-memory-region allocation Srivatsa S. Bhat
@ 2013-09-25 23:16 ` Srivatsa S. Bhat
  2013-09-25 23:17 ` [RFC PATCH v4 14/40] mm: Enable per-memory-region fragmentation stats in pagetypeinfo Srivatsa S. Bhat
                   ` (27 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:16 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
In order to observe the behavior of the region-aware buddy allocator, modify
vmstat.c to also print memory region related statistics. In particular, enable
memory region-related info in /proc/zoneinfo and /proc/buddyinfo, since they
would help us to atleast (roughly) observe how the new buddy allocator is
behaving.
For now, the region statistics correspond to the zone memory regions and not
the (absolute) node memory regions, and some of the statistics (especially the
no. of present pages) might not be very accurate. But since we account for
and print the free page statistics for every zone memory region accurately, we
should be able to observe the new page allocator behavior to a reasonable
degree of accuracy.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/vmstat.c |   34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c967043..8e8c8bd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -866,11 +866,28 @@ const char * const vmstat_text[] = {
 static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 						struct zone *zone)
 {
-	int order;
+	int i, order, t;
+	struct free_area *area;
 
-	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-	for (order = 0; order < MAX_ORDER; ++order)
-		seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+	seq_printf(m, "Node %d, zone %8s \n", pgdat->node_id, zone->name);
+
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+
+		seq_printf(m, "\t\t Region %6d ", i);
+
+		for (order = 0; order < MAX_ORDER; ++order) {
+			unsigned long nr_free = 0;
+
+			area = &zone->free_area[order];
+
+			for (t = 0; t < MIGRATE_TYPES; t++) {
+				nr_free +=
+					area->free_list[t].mr_list[i].nr_free;
+			}
+			seq_printf(m, "%6lu ", nr_free);
+		}
+		seq_putc(m, '\n');
+	}
 	seq_putc(m, '\n');
 }
 
@@ -1057,6 +1074,15 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   zone->present_pages,
 		   zone->managed_pages);
 
+	seq_printf(m, "\n\nPer-region page stats\t present\t free\n\n");
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+		struct zone_mem_region *region;
+
+		region = &zone->zone_regions[i];
+		seq_printf(m, "\tRegion %6d \t %6lu \t %6lu\n", i,
+				region->present_pages, region->nr_free);
+	}
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
 				zone_page_state(zone, i));
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 14/40] mm: Enable per-memory-region fragmentation stats in pagetypeinfo
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (12 preceding siblings ...)
  2013-09-25 23:16 ` [RFC PATCH v4 13/40] mm: Print memory region statistics to understand the buddy allocator behavior Srivatsa S. Bhat
@ 2013-09-25 23:17 ` Srivatsa S. Bhat
  2013-09-25 23:17 ` [RFC PATCH v4 15/40] mm: Add aggressive bias to prefer lower regions during page allocation Srivatsa S. Bhat
                   ` (26 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:17 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Pagetypeinfo is invaluable in observing the fragmentation of memory
into different migratetypes. Modify this code to also print out the
fragmentation statistics at a per-zone-memory-region granularity
(along with the existing per-zone reporting).
This helps us observe the effects of influencing memory allocation
decisions at the page-allocator level and understand the extent to
which they help in consolidation.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/vmstat.c |   86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 2 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8e8c8bd..bb44d30 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -926,6 +926,35 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 	}
 }
 
+static void pagetypeinfo_showfree_region_print(struct seq_file *m,
+					       pg_data_t *pgdat,
+					       struct zone *zone)
+{
+	int order, mtype, i;
+
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+
+		for (i = 0; i < zone->nr_zone_regions; i++) {
+			seq_printf(m, "Node %4d, zone %8s, R%3d %12s ",
+						pgdat->node_id,
+						zone->name,
+						i,
+						migratetype_names[mtype]);
+
+			for (order = 0; order < MAX_ORDER; ++order) {
+				struct free_area *area;
+
+				area = &(zone->free_area[order]);
+
+				seq_printf(m, "%6lu ",
+				   area->free_list[mtype].mr_list[i].nr_free);
+			}
+			seq_putc(m, '\n');
+		}
+
+	}
+}
+
 /* Print out the free pages at each order for each migatetype */
 static int pagetypeinfo_showfree(struct seq_file *m, void *arg)
 {
@@ -940,6 +969,11 @@ static int pagetypeinfo_showfree(struct seq_file *m, void *arg)
 
 	walk_zones_in_node(m, pgdat, pagetypeinfo_showfree_print);
 
+	seq_putc(m, '\n');
+
+	/* Print the free pages at each migratetype, per memory region */
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showfree_region_print);
+
 	return 0;
 }
 
@@ -971,24 +1005,72 @@ static void pagetypeinfo_showblockcount_print(struct seq_file *m,
 	}
 
 	/* Print counts */
-	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+	seq_printf(m, "Node %d, zone %8s      ", pgdat->node_id, zone->name);
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
 		seq_printf(m, "%12lu ", count[mtype]);
 	seq_putc(m, '\n');
 }
 
+static void pagetypeinfo_showblockcount_region_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	int mtype, i;
+	unsigned long pfn;
+	unsigned long start_pfn, end_pfn;
+	unsigned long count[MIGRATE_TYPES] = { 0, };
+
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+		start_pfn = zone->zone_regions[i].start_pfn;
+		end_pfn = zone->zone_regions[i].end_pfn;
+
+		for (pfn = start_pfn; pfn < end_pfn;
+						pfn += pageblock_nr_pages) {
+			struct page *page;
+
+			if (!pfn_valid(pfn))
+				continue;
+
+			page = pfn_to_page(pfn);
+
+			/* Watch for unexpected holes punched in the memmap */
+			if (!memmap_valid_within(pfn, page, zone))
+				continue;
+
+			mtype = get_pageblock_migratetype(page);
+
+			if (mtype < MIGRATE_TYPES)
+				count[mtype]++;
+		}
+
+		/* Print counts */
+		seq_printf(m, "Node %d, zone %8s R%3d ", pgdat->node_id,
+			   zone->name, i);
+		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+			seq_printf(m, "%12lu ", count[mtype]);
+		seq_putc(m, '\n');
+
+		/* Reset the counters */
+		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+			count[mtype] = 0;
+	}
+}
+
 /* Print out the free pages at each order for each migratetype */
 static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg)
 {
 	int mtype;
 	pg_data_t *pgdat = (pg_data_t *)arg;
 
-	seq_printf(m, "\n%-23s", "Number of blocks type ");
+	seq_printf(m, "\n%-23s", "Number of blocks type      ");
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
 		seq_printf(m, "%12s ", migratetype_names[mtype]);
 	seq_putc(m, '\n');
 	walk_zones_in_node(m, pgdat, pagetypeinfo_showblockcount_print);
 
+	/* Print out the pageblock info for per memory region */
+	seq_putc(m, '\n');
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showblockcount_region_print);
+
 	return 0;
 }
 
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 15/40] mm: Add aggressive bias to prefer lower regions during page allocation
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (13 preceding siblings ...)
  2013-09-25 23:17 ` [RFC PATCH v4 14/40] mm: Enable per-memory-region fragmentation stats in pagetypeinfo Srivatsa S. Bhat
@ 2013-09-25 23:17 ` Srivatsa S. Bhat
  2013-09-25 23:17 ` [RFC PATCH v4 16/40] mm: Introduce a "Region Allocator" to manage entire memory regions Srivatsa S. Bhat
                   ` (25 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:17 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
While allocating pages from buddy freelists, there could be situations
in which we have a ready freepage of the required order in a *higher*
numbered memory region, and there also exists a freepage of a higher
page order in a *lower* numbered memory region.
To make the consolidation logic more aggressive, try to split up the
higher order buddy page of a lower numbered region and allocate it,
rather than allocating pages from a higher numbered region.
This ensures that we spill over to a new region only when we truly
don't have enough contiguous memory in any lower numbered region to
satisfy that allocation request.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   44 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 10 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fbaa2dc..dc02a80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1211,8 +1211,9 @@ static inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
-	unsigned int current_order;
-	struct free_area *area;
+	unsigned int current_order, alloc_order;
+	struct free_area *area, *other_area;
+	int alloc_region, other_region;
 	struct page *page;
 
 	/* Find a page of the appropriate size in the preferred list */
@@ -1221,17 +1222,40 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		if (list_empty(&area->free_list[migratetype].list))
 			continue;
 
-		page = list_entry(area->free_list[migratetype].list.next,
-							struct page, lru);
-		rmqueue_del_from_freelist(page, &area->free_list[migratetype],
-					  current_order);
-		rmv_page_order(page);
-		area->nr_free--;
-		expand(zone, page, order, current_order, area, migratetype);
-		return page;
+		alloc_order = current_order;
+		alloc_region = area->free_list[migratetype].next_region -
+				area->free_list[migratetype].mr_list;
+		current_order++;
+		goto try_others;
 	}
 
 	return NULL;
+
+try_others:
+	/* Try to aggressively prefer lower numbered regions for allocations */
+	for ( ; current_order < MAX_ORDER; ++current_order) {
+		other_area = &(zone->free_area[current_order]);
+		if (list_empty(&other_area->free_list[migratetype].list))
+			continue;
+
+		other_region = other_area->free_list[migratetype].next_region -
+				other_area->free_list[migratetype].mr_list;
+
+		if (other_region < alloc_region) {
+			alloc_region = other_region;
+			alloc_order = current_order;
+		}
+	}
+
+	area = &(zone->free_area[alloc_order]);
+	page = list_entry(area->free_list[migratetype].list.next, struct page,
+			  lru);
+	rmqueue_del_from_freelist(page, &area->free_list[migratetype],
+				  alloc_order);
+	rmv_page_order(page);
+	area->nr_free--;
+	expand(zone, page, order, alloc_order, area, migratetype);
+	return page;
 }
 
 
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 16/40] mm: Introduce a "Region Allocator" to manage entire memory regions
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (14 preceding siblings ...)
  2013-09-25 23:17 ` [RFC PATCH v4 15/40] mm: Add aggressive bias to prefer lower regions during page allocation Srivatsa S. Bhat
@ 2013-09-25 23:17 ` Srivatsa S. Bhat
  2013-10-23 10:10   ` Johannes Weiner
  2013-09-25 23:17 ` [RFC PATCH v4 17/40] mm: Add a mechanism to add pages to buddy freelists in bulk Srivatsa S. Bhat
                   ` (24 subsequent siblings)
  40 siblings, 1 reply; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:17 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Today, the MM subsystem uses the buddy 'Page Allocator' to manage memory
at a 'page' granularity. But this allocator has no notion of the physical
topology of the underlying memory hardware, and hence it is hard to
influence memory allocation decisions keeping the platform constraints
in mind.
So we need to augment the page-allocator with a new entity to manage
memory (at a much larger granularity) keeping the underlying platform
characteristics and the memory hardware topology in mind.
To that end, introduce a "Memory Region Allocator" as a backend to the
existing "Page Allocator".
Splitting the memory allocator into a Page-Allocator front-end and a
Region-Allocator backend:
                 Page Allocator          |      Memory Region Allocator
                                         -
           __    __    __                |    ________    ________
          |__|--|__|--|__|-- ...         -   |        |  |        |
           ____    ____    ____          |   |        |  |        |
          |____|--|____|--|____|-- ...   -   |        |--|        |-- ...
                                         |   |        |  |        |
                                         -   |________|  |________|
                                         |
                                         -
             Manages pages using         |     Manages memory regions
              buddy freelists            -  (allocates and frees entire
                                         |   memory regions, i.e., at a
                                         -   memory-region granularity)
The flow of memory allocations/frees between entities requesting memory
(applications/kernel) and the MM subsystem:
                  pages               regions
  Applications <========>   Page    <========>  Memory Region
   and Kernel             Allocator               Allocator
Since the region allocator is supposed to function as a backend to the
page allocator, we implement it on a per-zone basis (since the page-allocator
is also per-zone).
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |   17 +++++++++++++++++
 mm/page_alloc.c        |   19 +++++++++++++++++++
 2 files changed, 36 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 155c1a1..7c87518 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,21 @@ struct free_area {
 	unsigned long		nr_free;
 };
 
+/* A simplified free_area for managing entire memory regions */
+struct free_area_region {
+	struct list_head	list;
+	unsigned long		nr_free;
+};
+
+struct mem_region {
+	struct free_area_region	region_area[MAX_ORDER];
+};
+
+struct region_allocator {
+	struct mem_region	region[MAX_NR_ZONE_REGIONS];
+	int			next_region;
+};
+
 struct pglist_data;
 
 /*
@@ -405,6 +420,8 @@ struct zone {
 	struct zone_mem_region	zone_regions[MAX_NR_ZONE_REGIONS];
 	int 			nr_zone_regions;
 
+	struct region_allocator	region_allocator;
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc02a80..876c231 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5240,6 +5240,23 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
 	pgdat->nr_node_regions = idx;
 }
 
+static void __meminit init_zone_region_allocator(struct zone *zone)
+{
+	struct free_area_region *area;
+	int i, j;
+
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+		area = zone->region_allocator.region[i].region_area;
+
+		for (j = 0; j < MAX_ORDER; j++) {
+			INIT_LIST_HEAD(&area[j].list);
+			area[j].nr_free = 0;
+		}
+	}
+
+	zone->region_allocator.next_region = -1;
+}
+
 static void __meminit zone_init_free_lists_late(struct zone *zone)
 {
 	struct mem_region_list *mr_list;
@@ -5326,6 +5343,8 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 
 		zone_init_free_lists_late(z);
 
+		init_zone_region_allocator(z);
+
 		/*
 		 * Revisit the last visited node memory region, in case it
 		 * spans multiple zones.
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 17/40] mm: Add a mechanism to add pages to buddy freelists in bulk
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (15 preceding siblings ...)
  2013-09-25 23:17 ` [RFC PATCH v4 16/40] mm: Introduce a "Region Allocator" to manage entire memory regions Srivatsa S. Bhat
@ 2013-09-25 23:17 ` Srivatsa S. Bhat
  2013-09-25 23:18 ` [RFC PATCH v4 18/40] mm: Provide a mechanism to delete pages from " Srivatsa S. Bhat
                   ` (23 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:17 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
When the buddy page allocator requests the region allocator for memory,
it gets all the freepages belonging to an entire region at once. So, in
order to make it efficient, we need a way to add all those pages to the
buddy freelists in one shot. Add this support, and also take care to
update the nr-free statistics properly.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 876c231..c3a2cda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -693,6 +693,52 @@ out:
 	set_region_bit(region_id, free_list);
 }
 
+/*
+ * Add all the freepages contained in 'list' to the buddy freelist
+ * 'free_list'. Using suitable list-manipulation tricks, we move the
+ * pages between the lists in one shot.
+ */
+static void add_to_freelist_bulk(struct list_head *list,
+				 struct free_list *free_list, int order,
+				 int region_id)
+{
+	struct list_head *cur, *position;
+	struct mem_region_list *region;
+	unsigned long nr_pages = 0;
+	struct free_area *area;
+	struct page *page;
+
+	if (list_empty(list))
+		return;
+
+	page = list_first_entry(list, struct page, lru);
+	list_del(&page->lru);
+
+	/*
+	 * Add one page using add_to_freelist() so that it sets up the
+	 * region related data-structures of the freelist properly.
+	 */
+	add_to_freelist(page, free_list, order);
+
+	/* Now add the rest of the pages in bulk */
+	list_for_each(cur, list)
+		nr_pages++;
+
+	position = free_list->mr_list[region_id].page_block;
+	list_splice_tail(list, position);
+
+
+	/* Update the statistics */
+	region = &free_list->mr_list[region_id];
+	region->nr_free += nr_pages;
+
+	area = &(page_zone(page)->free_area[order]);
+	area->nr_free += nr_pages + 1;
+
+	/* Fix up the zone region stats, since add_to_freelist() altered it */
+	region->zone_region->nr_free -= 1 << order;
+}
+
 /**
  * __rmqueue_smallest() *always* deletes elements from the head of the
  * list. Use this knowledge to keep page allocation fast, despite being
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 18/40] mm: Provide a mechanism to delete pages from buddy freelists in bulk
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (16 preceding siblings ...)
  2013-09-25 23:17 ` [RFC PATCH v4 17/40] mm: Add a mechanism to add pages to buddy freelists in bulk Srivatsa S. Bhat
@ 2013-09-25 23:18 ` Srivatsa S. Bhat
  2013-09-25 23:18 ` [RFC PATCH v4 19/40] mm: Provide a mechanism to release free memory to the region allocator Srivatsa S. Bhat
                   ` (22 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:18 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
When the buddy allocator releases excess free memory to the region
allocator, it does it at a region granularity - that is, it releases
all the freepages of that region to the region allocator, at once.
So, in order to make this efficient, we need a way to delete all those
pages from the buddy freelists in one shot. Add this support, and also
take care to update the nr-free statistics properly.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3a2cda..d96746e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -854,6 +854,61 @@ page_found:
 	}
 }
 
+/*
+ * Delete all freepages belonging to the region 'region_id' from 'free_list'
+ * and move them to 'list'. Using suitable list-manipulation tricks, we move
+ * the pages between the lists in one shot.
+ */
+static void del_from_freelist_bulk(struct list_head *list,
+				   struct free_list *free_list, int order,
+				   int region_id)
+{
+	struct mem_region_list *region, *prev_region;
+	unsigned long nr_pages = 0;
+	struct free_area *area;
+	struct list_head *cur;
+	struct page *page;
+	int prev_region_id;
+
+	region = &free_list->mr_list[region_id];
+
+	/*
+	 * Perform bulk movement of all pages of the region to the new list,
+	 * except the page pointed to by region->pageblock.
+	 */
+	prev_region_id = find_prev_region(region_id, free_list);
+	if (prev_region_id < 0) {
+		/* This is the first region on the list */
+		list_cut_position(list, &free_list->list,
+				  region->page_block->prev);
+	} else {
+		prev_region = &free_list->mr_list[prev_region_id];
+		list_cut_position(list, prev_region->page_block,
+				  region->page_block->prev);
+	}
+
+	list_for_each(cur, list)
+		nr_pages++;
+
+	region->nr_free -= nr_pages;
+
+	/*
+	 * Now delete the page pointed to by region->page_block using
+	 * del_from_freelist(), so that it sets up the region related
+	 * data-structures of the freelist properly.
+	 */
+	page = list_entry(region->page_block, struct page, lru);
+	del_from_freelist(page, free_list, order);
+
+	list_add_tail(&page->lru, list);
+
+	area = &(page_zone(page)->free_area[order]);
+	area->nr_free -= nr_pages + 1;
+
+	/* Fix up the zone region stats, since del_from_freelist() altered it */
+	region->zone_region->nr_free += 1 << order;
+}
+
 /**
  * Move a given page from one freelist to another.
  */
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 19/40] mm: Provide a mechanism to release free memory to the region allocator
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (17 preceding siblings ...)
  2013-09-25 23:18 ` [RFC PATCH v4 18/40] mm: Provide a mechanism to delete pages from " Srivatsa S. Bhat
@ 2013-09-25 23:18 ` Srivatsa S. Bhat
  2013-09-25 23:18 ` [RFC PATCH v4 20/40] mm: Provide a mechanism to request free memory from " Srivatsa S. Bhat
                   ` (21 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:18 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Implement helper functions to release freepages from the buddy freelists to
the region allocator.
For simplicity, all operations related to the region allocator are performed
at the granularity of entire memory regions. That is, when we release freepages
to the region allocator, we free all the pages belonging to that region.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d96746e..c727bba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -919,6 +919,26 @@ static void move_page_freelist(struct page *page, struct free_list *old_list,
 	add_to_freelist(page, new_list, order);
 }
 
+/* Add pages from the given buddy freelist to the region allocator */
+static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
+				    int region_id)
+{
+	struct region_allocator *reg_alloc;
+	struct list_head *ralloc_list;
+	int order;
+
+	if (WARN_ON(list_empty(&free_list->list)))
+		return;
+
+	order = page_order(list_first_entry(&free_list->list,
+					    struct page, lru));
+
+	reg_alloc = &z->region_allocator;
+	ralloc_list = ®_alloc->region[region_id].region_area[order].list;
+
+	del_from_freelist_bulk(ralloc_list, free_list, order, region_id);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 20/40] mm: Provide a mechanism to request free memory from the region allocator
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (18 preceding siblings ...)
  2013-09-25 23:18 ` [RFC PATCH v4 19/40] mm: Provide a mechanism to release free memory to the region allocator Srivatsa S. Bhat
@ 2013-09-25 23:18 ` Srivatsa S. Bhat
  2013-09-25 23:18 ` [RFC PATCH v4 21/40] mm: Maintain the counter for freepages in " Srivatsa S. Bhat
                   ` (20 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:18 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Implement helper functions to request freepages from the region allocator
in order to add them to the buddy freelists.
For simplicity, all operations related to the region allocator are performed
at the granularity of entire memory regions. That is, when the buddy
allocator requests freepages from the region allocator, the latter picks a
free region and always allocates all the freepages belonging to that entire
region.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c727bba..d71d671 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -939,6 +939,29 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 	del_from_freelist_bulk(ralloc_list, free_list, order, region_id);
 }
 
+/* Delete freepages from the region allocator and add them to buddy freelists */
+static int del_from_region_allocator(struct zone *zone, unsigned int order,
+				     int migratetype)
+{
+	struct region_allocator *reg_alloc;
+	struct list_head *ralloc_list;
+	struct free_list *free_list;
+	int next_region;
+
+	reg_alloc = &zone->region_allocator;
+
+	next_region = reg_alloc->next_region;
+	if (next_region < 0)
+		return -ENOMEM;
+
+	ralloc_list = ®_alloc->region[next_region].region_area[order].list;
+	free_list = &zone->free_area[order].free_list[migratetype];
+
+	add_to_freelist_bulk(ralloc_list, free_list, order, next_region);
+
+	return 0;
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 21/40] mm: Maintain the counter for freepages in the region allocator
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (19 preceding siblings ...)
  2013-09-25 23:18 ` [RFC PATCH v4 20/40] mm: Provide a mechanism to request free memory from " Srivatsa S. Bhat
@ 2013-09-25 23:18 ` Srivatsa S. Bhat
  2013-09-25 23:18 ` [RFC PATCH v4 22/40] mm: Propagate the sorted-buddy bias for picking free regions, to " Srivatsa S. Bhat
                   ` (19 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:18 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
We have a field named 'nr_free' for every memory-region in the region
allocator. Keep it updated with the count of freepages in that region.
We already run a loop while moving freepages in bulk between the buddy
allocator and the region allocator. Reuse that to update the freepages
count as well.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   45 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 11 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d71d671..ee6c098 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,10 +697,12 @@ out:
  * Add all the freepages contained in 'list' to the buddy freelist
  * 'free_list'. Using suitable list-manipulation tricks, we move the
  * pages between the lists in one shot.
+ *
+ * Returns the number of pages moved.
  */
-static void add_to_freelist_bulk(struct list_head *list,
-				 struct free_list *free_list, int order,
-				 int region_id)
+static unsigned long
+add_to_freelist_bulk(struct list_head *list, struct free_list *free_list,
+		     int order, int region_id)
 {
 	struct list_head *cur, *position;
 	struct mem_region_list *region;
@@ -709,7 +711,7 @@ static void add_to_freelist_bulk(struct list_head *list,
 	struct page *page;
 
 	if (list_empty(list))
-		return;
+		return 0;
 
 	page = list_first_entry(list, struct page, lru);
 	list_del(&page->lru);
@@ -737,6 +739,8 @@ static void add_to_freelist_bulk(struct list_head *list,
 
 	/* Fix up the zone region stats, since add_to_freelist() altered it */
 	region->zone_region->nr_free -= 1 << order;
+
+	return nr_pages + 1;
 }
 
 /**
@@ -858,10 +862,12 @@ page_found:
  * Delete all freepages belonging to the region 'region_id' from 'free_list'
  * and move them to 'list'. Using suitable list-manipulation tricks, we move
  * the pages between the lists in one shot.
+ *
+ * Returns the number of pages moved.
  */
-static void del_from_freelist_bulk(struct list_head *list,
-				   struct free_list *free_list, int order,
-				   int region_id)
+static unsigned long
+del_from_freelist_bulk(struct list_head *list, struct free_list *free_list,
+		       int order, int region_id)
 {
 	struct mem_region_list *region, *prev_region;
 	unsigned long nr_pages = 0;
@@ -907,6 +913,8 @@ static void del_from_freelist_bulk(struct list_head *list,
 
 	/* Fix up the zone region stats, since del_from_freelist() altered it */
 	region->zone_region->nr_free += 1 << order;
+
+	return nr_pages + 1;
 }
 
 /**
@@ -924,7 +932,9 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 				    int region_id)
 {
 	struct region_allocator *reg_alloc;
+	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
+	unsigned long nr_pages;
 	int order;
 
 	if (WARN_ON(list_empty(&free_list->list)))
@@ -934,9 +944,14 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 					    struct page, lru));
 
 	reg_alloc = &z->region_allocator;
-	ralloc_list = ®_alloc->region[region_id].region_area[order].list;
+	reg_area = ®_alloc->region[region_id].region_area[order];
+	ralloc_list = ®_area->list;
+
+	nr_pages = del_from_freelist_bulk(ralloc_list, free_list, order,
+					  region_id);
 
-	del_from_freelist_bulk(ralloc_list, free_list, order, region_id);
+	WARN_ON(reg_area->nr_free != 0);
+	reg_area->nr_free += nr_pages;
 }
 
 /* Delete freepages from the region allocator and add them to buddy freelists */
@@ -944,8 +959,10 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 				     int migratetype)
 {
 	struct region_allocator *reg_alloc;
+	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
 	struct free_list *free_list;
+	unsigned long nr_pages;
 	int next_region;
 
 	reg_alloc = &zone->region_allocator;
@@ -954,10 +971,16 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	if (next_region < 0)
 		return -ENOMEM;
 
-	ralloc_list = ®_alloc->region[next_region].region_area[order].list;
+	reg_area = ®_alloc->region[next_region].region_area[order];
+	ralloc_list = ®_area->list;
+
 	free_list = &zone->free_area[order].free_list[migratetype];
 
-	add_to_freelist_bulk(ralloc_list, free_list, order, next_region);
+	nr_pages = add_to_freelist_bulk(ralloc_list, free_list, order,
+					next_region);
+
+	reg_area->nr_free -= nr_pages;
+	WARN_ON(reg_area->nr_free != 0);
 
 	return 0;
 }
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 22/40] mm: Propagate the sorted-buddy bias for picking free regions, to region allocator
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (20 preceding siblings ...)
  2013-09-25 23:18 ` [RFC PATCH v4 21/40] mm: Maintain the counter for freepages in " Srivatsa S. Bhat
@ 2013-09-25 23:18 ` Srivatsa S. Bhat
  2013-09-25 23:19 ` [RFC PATCH v4 23/40] mm: Fix vmstat to also account for freepages in the " Srivatsa S. Bhat
                   ` (18 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:18 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
The sorted-buddy page allocator keeps the buddy freelists sorted region-wise,
and tries to pick lower numbered regions while allocating pages. The idea is
to allocate regions in the increasing order of region number.
Propagate the same bias to the region allocator as well. That is, make it
favor lower numbered regions while allocating regions to the page allocator.
To do this efficiently, add a bitmap to represent the regions in the region
allocator, and use bitmap operations to manage these regions and to pick the
lowest numbered free region efficiently.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |    1 +
 mm/page_alloc.c        |   19 ++++++++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7c87518..49c8926 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -125,6 +125,7 @@ struct mem_region {
 struct region_allocator {
 	struct mem_region	region[MAX_NR_ZONE_REGIONS];
 	int			next_region;
+	DECLARE_BITMAP(ralloc_mask, MAX_NR_ZONE_REGIONS);
 };
 
 struct pglist_data;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee6c098..d5acea7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -935,7 +935,7 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
 	unsigned long nr_pages;
-	int order;
+	int order, *next_region;
 
 	if (WARN_ON(list_empty(&free_list->list)))
 		return;
@@ -952,6 +952,13 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 
 	WARN_ON(reg_area->nr_free != 0);
 	reg_area->nr_free += nr_pages;
+
+	set_bit(region_id, reg_alloc->ralloc_mask);
+	next_region = ®_alloc->next_region;
+
+	if ((*next_region < 0) ||
+			(*next_region > 0 && region_id < *next_region))
+		*next_region = region_id;
 }
 
 /* Delete freepages from the region allocator and add them to buddy freelists */
@@ -982,6 +989,16 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	reg_area->nr_free -= nr_pages;
 	WARN_ON(reg_area->nr_free != 0);
 
+	/* Pick a new next_region */
+	clear_bit(next_region, reg_alloc->ralloc_mask);
+	next_region = find_first_bit(reg_alloc->ralloc_mask,
+				     MAX_NR_ZONE_REGIONS);
+
+	if (next_region >= MAX_NR_ZONE_REGIONS)
+		next_region = -1; /* No free regions available */
+
+	reg_alloc->next_region = next_region;
+
 	return 0;
 }
 
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 23/40] mm: Fix vmstat to also account for freepages in the region allocator
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (21 preceding siblings ...)
  2013-09-25 23:18 ` [RFC PATCH v4 22/40] mm: Propagate the sorted-buddy bias for picking free regions, to " Srivatsa S. Bhat
@ 2013-09-25 23:19 ` Srivatsa S. Bhat
  2013-09-25 23:19 ` [RFC PATCH v4 24/40] mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC Srivatsa S. Bhat
                   ` (17 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:19 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Currently vmstat considers only the freepages present in the buddy freelists
of the page allocator. But with the newly introduced region allocator in
place, freepages could be present in the region allocator as well. So teach
vmstat to take them into consideration when reporting free memory.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/vmstat.c |    8 ++++++++
 1 file changed, 8 insertions(+)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bb44d30..4dc103e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -868,6 +868,8 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 {
 	int i, order, t;
 	struct free_area *area;
+	struct free_area_region *reg_area;
+	struct region_allocator *reg_alloc;
 
 	seq_printf(m, "Node %d, zone %8s \n", pgdat->node_id, zone->name);
 
@@ -884,6 +886,12 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 				nr_free +=
 					area->free_list[t].mr_list[i].nr_free;
 			}
+
+			/* Add up freepages in the region allocator as well */
+			reg_alloc = &zone->region_allocator;
+			reg_area = ®_alloc->region[i].region_area[order];
+			nr_free += reg_area->nr_free;
+
 			seq_printf(m, "%6lu ", nr_free);
 		}
 		seq_putc(m, '\n');
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 24/40] mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (22 preceding siblings ...)
  2013-09-25 23:19 ` [RFC PATCH v4 23/40] mm: Fix vmstat to also account for freepages in the " Srivatsa S. Bhat
@ 2013-09-25 23:19 ` Srivatsa S. Bhat
  2013-09-25 23:19 ` [RFC PATCH v4 25/40] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow Srivatsa S. Bhat
                   ` (16 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:19 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Under CONFIG_DEBUG_PAGEALLOC, we have numerous checks and balances to verify
the correctness of various sorted-buddy operations. But some of them are very
expensive and hence can't be enabled while benchmarking the code.
(They should be used only to verify that the code is working correctly, as a
precursor to benchmarking the performance).
The check to see if a page given as input to del_from_freelist() indeed
belongs to that freelist, is one such very expensive check. Drop it.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |    2 ++
 1 file changed, 2 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d5acea7..178f210 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -812,6 +812,7 @@ static void del_from_freelist(struct page *page, struct free_list *free_list,
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
 
+#if 0
 	/* Verify whether this page indeed belongs to this free list! */
 
 	list_for_each(p, &free_list->list) {
@@ -820,6 +821,7 @@ static void del_from_freelist(struct page *page, struct free_list *free_list,
 	}
 
 	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
+#endif
 
 page_found:
 #endif
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 25/40] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (23 preceding siblings ...)
  2013-09-25 23:19 ` [RFC PATCH v4 24/40] mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC Srivatsa S. Bhat
@ 2013-09-25 23:19 ` Srivatsa S. Bhat
  2013-09-25 23:19 ` [RFC PATCH v4 26/40] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= " Srivatsa S. Bhat
                   ` (15 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:19 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Now that we have built up an infrastructure that forms a "Memory Region
Allocator", connect it with the page allocator. To entities requesting
memory, the page allocator will function as a front-end, whereas the
region allocator will act as a back-end to the page allocator.
(Analogy: page allocator is like free cash, whereas region allocator
is like a bank).
Implement the flow of freepages from the page allocator to the region
allocator. When the buddy freelists notice that they have all the freepages
forming a memory region, they give it back to the region allocator.
Simplification: We assume that the freepages of a memory region can be
completely represented by a set of MAX_ORDER-1 pages. That is, we only
need to consider the buddy freelists corresponding to MAX_ORDER-1, while
interacting with the region allocator. Furthermore, we assume that
pageblock_order == MAX_ORDER-1.
(These assumptions are used to ease the implementation, so that one can
quickly evaluate the benefits of the overall design without getting
bogged down by too many corner cases and constraints. Of course future
implementations will handle more scenarios and will have reduced dependence
on such simplifying assumptions.)
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 178f210..d08bc91 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -635,6 +635,37 @@ out:
 	return prev_region_id;
 }
 
+
+static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
+				    int region_id);
+
+
+static inline int can_return_region(struct mem_region_list *region, int order)
+{
+	struct zone_mem_region *zone_region;
+
+	zone_region = region->zone_region;
+
+	if (likely(zone_region->nr_free != zone_region->present_pages))
+		return 0;
+
+	/*
+	 * Don't release freepages to the region allocator if some other
+	 * buddy pages can potentially merge with our freepages to form
+	 * higher order pages.
+	 *
+	 * Hack: Don't return the region unless all the freepages are of
+	 * order MAX_ORDER-1.
+	 */
+	if (likely(order != MAX_ORDER-1))
+		return 0;
+
+	if (region->nr_free * (1 << order) != zone_region->nr_free)
+		return 0;
+
+	return 1;
+}
+
 static void add_to_freelist(struct page *page, struct free_list *free_list,
 			    int order)
 {
@@ -651,7 +682,7 @@ static void add_to_freelist(struct page *page, struct free_list *free_list,
 
 	if (region->page_block) {
 		list_add_tail(lru, region->page_block);
-		return;
+		goto try_return_region;
 	}
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -691,6 +722,15 @@ out:
 	/* Save pointer to page block of this region */
 	region->page_block = lru;
 	set_region_bit(region_id, free_list);
+
+try_return_region:
+
+	/*
+	 * Try to return the freepages of a memory region to the region
+	 * allocator, if possible.
+	 */
+	if (can_return_region(region, order))
+		add_to_region_allocator(page_zone(page), free_list, region_id);
 }
 
 /*
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 26/40] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= RA flow
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (24 preceding siblings ...)
  2013-09-25 23:19 ` [RFC PATCH v4 25/40] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow Srivatsa S. Bhat
@ 2013-09-25 23:19 ` Srivatsa S. Bhat
  2013-09-25 23:19 ` [RFC PATCH v4 27/40] mm: Update the freepage migratetype of pages during region allocation Srivatsa S. Bhat
                   ` (14 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:19 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Now that we have built up an infrastructure that forms a "Memory Region
Allocator", connect it with the page allocator. To entities requesting
memory, the page allocator will function as a front-end, whereas the
region allocator will act as a back-end to the page allocator.
(Analogy: page allocator is like free cash, whereas region allocator
is like a bank).
Implement the flow of freepages from the region allocator to the page
allocator. When __rmqueue_smallest() comes out empty handed, try to get
freepages from the region allocator. If that fails, only then fallback
to an allocation from a different migratetype. This helps significantly
in avoiding mixing of allocations of different migratetypes in a single
region. Thus it helps in keeping entire memory regions homogeneous with
respect to the type of allocations.
Simplification: We assume that the freepages of a memory region can be
completely represented by a set of MAX_ORDER-1 pages. That is, we only
need to consider the buddy freelists corresponding to MAX_ORDER-1, while
interacting with the region allocator. Furthermore, we assume that
pageblock_order == MAX_ORDER-1.
(These assumptions are used to ease the implementation, so that one can
quickly evaluate the benefits of the overall design without getting
bogged down by too many corner cases and constraints. Of course future
implementations will handle more scenarios and will have reduced dependence
on such simplifying assumptions.)
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d08bc91..0d73134 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1703,10 +1703,18 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
-retry_reserve:
+retry:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
 	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+
+		/*
+		 * Try to get a region from the region allocator before falling
+		 * back to an allocation from a different migratetype.
+		 */
+		if (!del_from_region_allocator(zone, MAX_ORDER-1, migratetype))
+			goto retry;
+
 		page = __rmqueue_fallback(zone, order, migratetype);
 
 		/*
@@ -1716,7 +1724,7 @@ retry_reserve:
 		 */
 		if (!page) {
 			migratetype = MIGRATE_RESERVE;
-			goto retry_reserve;
+			goto retry;
 		}
 	}
 
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 27/40] mm: Update the freepage migratetype of pages during region allocation
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (25 preceding siblings ...)
  2013-09-25 23:19 ` [RFC PATCH v4 26/40] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= " Srivatsa S. Bhat
@ 2013-09-25 23:19 ` Srivatsa S. Bhat
  2013-09-25 23:20 ` [RFC PATCH v4 28/40] mm: Provide a mechanism to check if a given page is in the region allocator Srivatsa S. Bhat
                   ` (13 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:19 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
The freepage migratetype is used to determine which freelist a given
page should be added to, upon getting freed. To ensure that the page
goes to the right freelist, set the freepage migratetype of all
the pages of a region, when allocating freepages from the region allocator.
This helps ensure that upon freeing the pages or during buddy expansion,
the pages are added back to the freelists of the migratetype for which
the pages were originally requested from the region allocator.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |    3 +++
 1 file changed, 3 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d73134..ca7b959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1023,6 +1023,9 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	reg_area = ®_alloc->region[next_region].region_area[order];
 	ralloc_list = ®_area->list;
 
+	list_for_each_entry(page, ralloc_list, lru)
+		set_freepage_migratetype(page, migratetype);
+
 	free_list = &zone->free_area[order].free_list[migratetype];
 
 	nr_pages = add_to_freelist_bulk(ralloc_list, free_list, order,
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 28/40] mm: Provide a mechanism to check if a given page is in the region allocator
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (26 preceding siblings ...)
  2013-09-25 23:19 ` [RFC PATCH v4 27/40] mm: Update the freepage migratetype of pages during region allocation Srivatsa S. Bhat
@ 2013-09-25 23:20 ` Srivatsa S. Bhat
  2013-09-25 23:20 ` [RFC PATCH v4 29/40] mm: Add a way to request pages of a particular region from " Srivatsa S. Bhat
                   ` (12 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:20 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
With the introduction of the region allocator, a freepage can be either
in one of the buddy freelists or in the region allocator. In cases where we
want to move freepages to a given migratetype's freelists, we will need to
know where they were originally located. So provide a helper to distinguish
whether the freepage resides in the region allocator or the buddy freelists.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca7b959..ac04b45 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1048,6 +1048,37 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 }
 
 /*
+ * Return 1 if the page is in the region allocator, else return 0
+ * (which usually means that the page is in the buddy freelists).
+ */
+static int page_in_region_allocator(struct page *page)
+{
+	struct region_allocator *reg_alloc;
+	struct free_area_region *reg_area;
+	int order, region_id;
+
+	/* We keep only MAX_ORDER-1 pages in the region allocator */
+	order = page_order(page);
+	if (order != MAX_ORDER-1)
+		return 0;
+
+	/*
+	 * It is sufficient to check if (any of) the pages belonging to
+	 * that region are in the region allocator, because a page resides
+	 * in the region allocator if and only if all the pages of that
+	 * region are also in the region allocator.
+	 */
+	region_id = page_zone_region_id(page);
+	reg_alloc = &page_zone(page)->region_allocator;
+	reg_area = ®_alloc->region[region_id].region_area[order];
+
+	if (reg_area->nr_free)
+		return 1;
+
+	return 0;
+}
+
+/*
  * Freeing function for a buddy system allocator.
  *
  * The concept of a buddy system is to maintain direct-mapped table
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 29/40] mm: Add a way to request pages of a particular region from the region allocator
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (27 preceding siblings ...)
  2013-09-25 23:20 ` [RFC PATCH v4 28/40] mm: Provide a mechanism to check if a given page is in the region allocator Srivatsa S. Bhat
@ 2013-09-25 23:20 ` Srivatsa S. Bhat
  2013-09-25 23:20 ` [RFC PATCH v4 30/40] mm: Modify move_freepages() to handle pages in the region allocator properly Srivatsa S. Bhat
                   ` (11 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:20 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
When moving freepages from one migratetype to another (using move_freepages()
or equivalent), we might encounter situations in which we would like to move
pages that are in the region allocator. In such cases, we need a way to
request pages of a particular region from the region allocator.
We already have the code to perform the heavy-lifting of actually moving the
pages of a region from the region allocator to a requested freelist or
migratetype. So just reorganize that code in such a way that we can also
pin-point a region and specify that we want the region allocator to allocate
pages from that particular region.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac04b45..ed5298c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1003,24 +1003,18 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 		*next_region = region_id;
 }
 
-/* Delete freepages from the region allocator and add them to buddy freelists */
-static int del_from_region_allocator(struct zone *zone, unsigned int order,
-				     int migratetype)
+static void __del_from_region_allocator(struct zone *zone, unsigned int order,
+					int migratetype, int region_id)
 {
 	struct region_allocator *reg_alloc;
 	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
 	struct free_list *free_list;
 	unsigned long nr_pages;
-	int next_region;
+	struct page *page;
 
 	reg_alloc = &zone->region_allocator;
-
-	next_region = reg_alloc->next_region;
-	if (next_region < 0)
-		return -ENOMEM;
-
-	reg_area = ®_alloc->region[next_region].region_area[order];
+	reg_area = ®_alloc->region[region_id].region_area[order];
 	ralloc_list = ®_area->list;
 
 	list_for_each_entry(page, ralloc_list, lru)
@@ -1029,20 +1023,34 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	free_list = &zone->free_area[order].free_list[migratetype];
 
 	nr_pages = add_to_freelist_bulk(ralloc_list, free_list, order,
-					next_region);
+					region_id);
 
 	reg_area->nr_free -= nr_pages;
 	WARN_ON(reg_area->nr_free != 0);
 
 	/* Pick a new next_region */
-	clear_bit(next_region, reg_alloc->ralloc_mask);
-	next_region = find_first_bit(reg_alloc->ralloc_mask,
+	clear_bit(region_id, reg_alloc->ralloc_mask);
+	region_id = find_first_bit(reg_alloc->ralloc_mask,
 				     MAX_NR_ZONE_REGIONS);
 
-	if (next_region >= MAX_NR_ZONE_REGIONS)
-		next_region = -1; /* No free regions available */
+	if (region_id >= MAX_NR_ZONE_REGIONS)
+		region_id = -1; /* No free regions available */
+
+	reg_alloc->next_region = region_id;
+}
+
+/* Delete freepages from the region allocator and add them to buddy freelists */
+static int del_from_region_allocator(struct zone *zone, unsigned int order,
+				     int migratetype)
+{
+	int next_region;
+
+	next_region = zone->region_allocator.next_region;
+
+	if (next_region < 0)
+		return -ENOMEM;
 
-	reg_alloc->next_region = next_region;
+	__del_from_region_allocator(zone, order, migratetype, next_region);
 
 	return 0;
 }
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 30/40] mm: Modify move_freepages() to handle pages in the region allocator properly
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (28 preceding siblings ...)
  2013-09-25 23:20 ` [RFC PATCH v4 29/40] mm: Add a way to request pages of a particular region from " Srivatsa S. Bhat
@ 2013-09-25 23:20 ` Srivatsa S. Bhat
  2013-09-25 23:20 ` [RFC PATCH v4 31/40] mm: Never change migratetypes of pageblocks during freepage stealing Srivatsa S. Bhat
                   ` (10 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:20 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
There are situations in which the memory management subsystem needs to move
pages from one migratetype to another, such as when setting up the per-zone
migrate reserves (where freepages are moved from MIGRATE_MOVABLE to
MIGRATE_RESERVE freelists).
But the existing code that does freepage movement is unaware of the region
allocator. In other words, it always assumes that the freepages that it is
moving are always in the buddy page allocator's freelists. But with the
introduction of the region allocator, the freepages could instead reside
in the region allocator as well. So teach move_freepages() to check whether
the pages are in the buddy page allocator's freelists or the region
allocator and handle the two cases appropriately.
The region allocator is designed in such a way that it always allocates
or receives entire memory regions as a single unit. To retain these
semantics during freepage movement, we first move all the pages of that
region from the region allocator to the MIGRATE_MOVABLE buddy freelist
and then move the requested page(s) from MIGRATE_MOVABLE to the required
migratetype.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed5298c..939f378 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1558,7 +1558,7 @@ int move_freepages(struct zone *zone,
 	struct page *page;
 	unsigned long order;
 	struct free_area *area;
-	int pages_moved = 0, old_mt;
+	int pages_moved = 0, old_mt, region_id;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -1585,7 +1585,23 @@ int move_freepages(struct zone *zone,
 			continue;
 		}
 
+		/*
+		 * If the page is in the region allocator, we first move the
+		 * region to the MIGRATE_MOVABLE buddy freelists and then move
+		 * that page to the freelist of the requested migratetype.
+		 * This is because the region allocator operates on whole region-
+		 * sized chunks, whereas here we want to move pages in much
+		 * smaller chunks.
+		 */
 		order = page_order(page);
+		if (page_in_region_allocator(page)) {
+			region_id = page_zone_region_id(page);
+			__del_from_region_allocator(zone, order, MIGRATE_MOVABLE,
+						    region_id);
+
+			continue; /* Try this page again from the buddy-list */
+		}
+
 		old_mt = get_freepage_migratetype(page);
 		area = &zone->free_area[order];
 		move_page_freelist(page, &area->free_list[old_mt],
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 31/40] mm: Never change migratetypes of pageblocks during freepage stealing
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (29 preceding siblings ...)
  2013-09-25 23:20 ` [RFC PATCH v4 30/40] mm: Modify move_freepages() to handle pages in the region allocator properly Srivatsa S. Bhat
@ 2013-09-25 23:20 ` Srivatsa S. Bhat
  2013-09-25 23:20 ` [RFC PATCH v4 32/40] mm: Set pageblock migratetype when allocating regions from region allocator Srivatsa S. Bhat
                   ` (9 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:20 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
We would like to keep large chunks of memory (of the size of memory regions)
populated by allocations of a single migratetype. This helps in influencing
allocation/reclaim decisions at a per-migratetype basis, which would also
automatically respect memory region boundaries.
For example, if a region is known to contain only MIGRATE_UNMOVABLE pages,
we can skip trying targeted compaction on that region. Similarly, if a region
has only MIGRATE_MOVABLE pages, then the likelihood of successful targeted
evacuation of that region is higher, as opposed to having a few unmovable
pages embedded in a region otherwise containing mostly movable allocations.
Thus, it is beneficial to try and keep memory allocations homogeneous (in
terms of the migratetype) in region-sized chunks of memory.
Changing the migratetypes of pageblocks during freepage stealing comes in the
way of this effort, since it fragments the ownership of memory segments.
So never change the ownership of pageblocks during freepage stealing.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   36 ++++++++++--------------------------
 1 file changed, 10 insertions(+), 26 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 939f378..fd32533 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1649,14 +1649,16 @@ static void change_pageblock_range(struct page *pageblock_page,
 /*
  * If breaking a large block of pages, move all free pages to the preferred
  * allocation list. If falling back for a reclaimable kernel allocation, be
- * more aggressive about taking ownership of free pages.
+ * more aggressive about borrowing the free pages.
  *
- * On the other hand, never change migration type of MIGRATE_CMA pageblocks
- * nor move CMA pages to different free lists. We don't want unmovable pages
- * to be allocated from MIGRATE_CMA areas.
+ * On the other hand, never move CMA pages to different free lists. We don't
+ * want unmovable pages to be allocated from MIGRATE_CMA areas.
  *
- * Returns the new migratetype of the pageblock (or the same old migratetype
- * if it was unchanged).
+ * Also, we *NEVER* change the pageblock migratetype of any block of memory.
+ * (IOW, we only try to _loan_ the freepages from a fallback list, but never
+ * try to _own_ them.)
+ *
+ * Returns the migratetype of the fallback list.
  */
 static int try_to_steal_freepages(struct zone *zone, struct page *page,
 				  int start_type, int fallback_type)
@@ -1666,28 +1668,10 @@ static int try_to_steal_freepages(struct zone *zone, struct page *page,
 	if (is_migrate_cma(fallback_type))
 		return fallback_type;
 
-	/* Take ownership for orders >= pageblock_order */
-	if (current_order >= pageblock_order) {
-		change_pageblock_range(page, current_order, start_type);
-		return start_type;
-	}
-
 	if (current_order >= pageblock_order / 2 ||
 	    start_type == MIGRATE_RECLAIMABLE ||
-	    page_group_by_mobility_disabled) {
-		int pages;
-
-		pages = move_freepages_block(zone, page, start_type);
-
-		/* Claim the whole block if over half of it is free */
-		if (pages >= (1 << (pageblock_order-1)) ||
-				page_group_by_mobility_disabled) {
-
-			set_pageblock_migratetype(page, start_type);
-			return start_type;
-		}
-
-	}
+	    page_group_by_mobility_disabled)
+		move_freepages_block(zone, page, start_type);
 
 	return fallback_type;
 }
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 32/40] mm: Set pageblock migratetype when allocating regions from region allocator
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (30 preceding siblings ...)
  2013-09-25 23:20 ` [RFC PATCH v4 31/40] mm: Never change migratetypes of pageblocks during freepage stealing Srivatsa S. Bhat
@ 2013-09-25 23:20 ` Srivatsa S. Bhat
  2013-09-25 23:21 ` [RFC PATCH v4 33/40] mm: Use a cache between page-allocator and region-allocator Srivatsa S. Bhat
                   ` (8 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:20 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
We would like to maintain memory regions such that all memory pertaining to
given a memory region serves allocations of a single migratetype. IOW, we
don't want to permanently mix allocations of different migratetypes within
the same region.
So, when allocating a region from the region allocator to the page allocator,
set the pageblock migratetype of all that memory to the migratetype for which
the page allocator requested memory.
Note that this still allows temporary sharing of pages between different
migratetypes; it just ensures that there is no *permanent* mixing of
migratetypes within a given memory region.
An important advantage to be noted here is that the region allocator doesn't
have to manage memory in a granularity lesser than a memory region, in *any*
situation. This is because the freepage migratetype and the fallback mechanism
allows temporary sharing of free memory between different migratetypes when
the system is short on memory, but eventually all the memory gets freed to
the original migratetype (because we set the pageblock migratetype of all the
freepages appropriately when allocating regions).
This greatly simplifies the design of the region allocator, since it doesn't
have to keep track of memory in smaller chunks than a memory region.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd32533..c4cbd80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1017,8 +1017,10 @@ static void __del_from_region_allocator(struct zone *zone, unsigned int order,
 	reg_area = ®_alloc->region[region_id].region_area[order];
 	ralloc_list = ®_area->list;
 
-	list_for_each_entry(page, ralloc_list, lru)
+	list_for_each_entry(page, ralloc_list, lru) {
 		set_freepage_migratetype(page, migratetype);
+		set_pageblock_migratetype(page, migratetype);
+	}
 
 	free_list = &zone->free_area[order].free_list[migratetype];
 
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 33/40] mm: Use a cache between page-allocator and region-allocator
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (31 preceding siblings ...)
  2013-09-25 23:20 ` [RFC PATCH v4 32/40] mm: Set pageblock migratetype when allocating regions from region allocator Srivatsa S. Bhat
@ 2013-09-25 23:21 ` Srivatsa S. Bhat
  2013-09-25 23:21 ` [RFC PATCH v4 34/40] mm: Restructure the compaction part of CMA for wider use Srivatsa S. Bhat
                   ` (7 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:21 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Currently, whenever the page allocator notices that it has all the freepages
of a given memory region, it attempts to return it back to the region
allocator. This strategy is needlessly aggressive and can cause a lot of back
and forth between the page-allocator and the region-allocator.
More importantly, it can potentially completely wreck the benefits of having
a region allocator in the first place - if the buddy allocator immediately
returns freepages of memory regions to the region allocator, it goes back to
the generic pool of pages. So, in future, depending on when the next allocation
request arrives for this particular migratetype, the region allocator might not
have any free regions to hand out, and hence we might end up falling back
to freepages of other migratetypes. Instead, if the page allocator retains
a few regions as a cache for every migratetype, we will have higher chances
of avoiding fallbacks to other migratetypes.
So, don't return all free memory regions (in the page allocator) to the
region allocator. Keep atleast one region as a cache, for future use.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c4cbd80..a15ac96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -640,9 +640,11 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 				    int region_id);
 
 
-static inline int can_return_region(struct mem_region_list *region, int order)
+static inline int can_return_region(struct mem_region_list *region, int order,
+				    struct free_list *free_list)
 {
 	struct zone_mem_region *zone_region;
+	struct page *prev_page, *next_page;
 
 	zone_region = region->zone_region;
 
@@ -660,6 +662,16 @@ static inline int can_return_region(struct mem_region_list *region, int order)
 	if (likely(order != MAX_ORDER-1))
 		return 0;
 
+	/*
+	 * Don't return all the regions; retain atleast one region as a
+	 * cache for future use.
+	 */
+	prev_page = container_of(free_list->list.prev , struct page, lru);
+	next_page = container_of(free_list->list.next , struct page, lru);
+
+	if (page_zone_region_id(prev_page) == page_zone_region_id(next_page))
+		return 0; /* There is only one region in this freelist */
+
 	if (region->nr_free * (1 << order) != zone_region->nr_free)
 		return 0;
 
@@ -729,7 +741,7 @@ try_return_region:
 	 * Try to return the freepages of a memory region to the region
 	 * allocator, if possible.
 	 */
-	if (can_return_region(region, order))
+	if (can_return_region(region, order, free_list))
 		add_to_region_allocator(page_zone(page), free_list, region_id);
 }
 
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 34/40] mm: Restructure the compaction part of CMA for wider use
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (32 preceding siblings ...)
  2013-09-25 23:21 ` [RFC PATCH v4 33/40] mm: Use a cache between page-allocator and region-allocator Srivatsa S. Bhat
@ 2013-09-25 23:21 ` Srivatsa S. Bhat
  2013-09-25 23:21 ` [RFC PATCH v4 35/40] mm: Add infrastructure to evacuate memory regions using compaction Srivatsa S. Bhat
                   ` (6 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:21 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
CMA uses bits and pieces of the memory compaction algorithms to perform
large contiguous allocations. Those algorithms would be useful for
memory power management too, to evacuate entire regions of memory.
So rewrite the code in a way that helps us to easily reuse the code for
both use-cases.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/compaction.c |   81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h   |   40 +++++++++++++++++++++++++++
 mm/page_alloc.c |   51 +++++++++--------------------------
 3 files changed, 134 insertions(+), 38 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 511b191..c775066 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -816,6 +816,87 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return ISOLATE_SUCCESS;
 }
 
+/*
+ * Make free pages available within the given range, using compaction to
+ * migrate used pages elsewhere.
+ *
+ * [start, end) must belong to a single zone.
+ *
+ * This function is roughly based on the logic inside compact_zone().
+ */
+int compact_range(struct compact_control *cc, struct aggression_control *ac,
+		  struct free_page_control *fc, unsigned long start,
+		  unsigned long end)
+{
+	unsigned long pfn = start;
+	int ret = 0, tries, migrate_mode;
+
+	if (ac->prep_all)
+		migrate_prep();
+	else
+		migrate_prep_local();
+
+	while (pfn < end || !list_empty(&cc->migratepages)) {
+		if (list_empty(&cc->migratepages)) {
+			cc->nr_migratepages = 0;
+			pfn = isolate_migratepages_range(cc->zone, cc,
+					pfn, end, ac->isolate_unevictable);
+
+			if (!pfn) {
+				ret = -EINTR;
+				break;
+			}
+		}
+
+		for (tries = 0; tries < ac->max_tries; tries++) {
+			unsigned long nr_migrate, nr_remaining;
+
+			if (fatal_signal_pending(current)){
+				ret = -EINTR;
+				goto out;
+			}
+
+			if (ac->reclaim_clean) {
+				int nr_reclaimed;
+
+				nr_reclaimed =
+					reclaim_clean_pages_from_list(cc->zone,
+							&cc->migratepages);
+
+				cc->nr_migratepages -= nr_reclaimed;
+			}
+
+			migrate_mode = cc->sync ? MIGRATE_SYNC : MIGRATE_ASYNC;
+			nr_migrate = cc->nr_migratepages;
+			ret = migrate_pages(&cc->migratepages,
+					    fc->free_page_alloc, fc->alloc_data,
+					    migrate_mode, ac->reason);
+
+			update_nr_listpages(cc);
+			nr_remaining = cc->nr_migratepages;
+			trace_mm_compaction_migratepages(
+				nr_migrate - nr_remaining, nr_remaining);
+		}
+
+		if (tries == ac->max_tries) {
+			ret = ret < 0 ? ret : -EBUSY;
+			break;
+		}
+	}
+
+out:
+	if (ret < 0)
+		putback_movable_pages(&cc->migratepages);
+
+	/* Release free pages and check accounting */
+	if (fc->release_freepages)
+		cc->nr_freepages -= fc->release_freepages(fc->free_data);
+
+	VM_BUG_ON(cc->nr_freepages != 0);
+
+	return ret;
+}
+
 static int compact_finished(struct zone *zone,
 			    struct compact_control *cc)
 {
diff --git a/mm/internal.h b/mm/internal.h
index 684f7aa..acb50f8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -107,6 +107,42 @@ extern bool is_free_buddy_page(struct page *page);
 /*
  * in mm/compaction.c
  */
+
+struct free_page_control {
+
+	/* Function used to allocate free pages as target of migration. */
+	struct page * (*free_page_alloc)(struct page *migratepage,
+					 unsigned long data,
+					 int **result);
+
+	unsigned long alloc_data;	/* Private data for free_page_alloc() */
+
+	/*
+	 * Function to release the accumulated free pages after the compaction
+	 * run.
+	 */
+	unsigned long (*release_freepages)(unsigned long info);
+	unsigned long free_data;	/* Private data for release_freepages() */
+};
+
+/*
+ * aggression_control gives us fine-grained control to specify how aggressively
+ * we want to compact memory.
+ */
+struct aggression_control {
+	bool isolate_unevictable;	/* Isolate unevictable pages too */
+	bool prep_all;			/* Use migrate_prep() instead of
+					 * migrate_prep_local().
+					 */
+	bool reclaim_clean;		/* Reclaim clean page-cache pages */
+	int max_tries;			/* No. of tries to migrate the
+					 * isolated pages before giving up.
+					 */
+	int reason;			/* Reason for compaction, passed on
+					 * as reason for migrate_pages().
+					 */
+};
+
 /*
  * compact_control is used to track pages being migrated and the free pages
  * they are being migrated to during memory compaction. The free_pfn starts
@@ -141,6 +177,10 @@ unsigned long
 isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	unsigned long low_pfn, unsigned long end_pfn, bool unevictable);
 
+int compact_range(struct compact_control *cc, struct aggression_control *ac,
+		  struct free_page_control *fc, unsigned long start,
+		  unsigned long end);
+
 #endif
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a15ac96..70c3d7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6893,46 +6893,21 @@ static unsigned long pfn_max_align_up(unsigned long pfn)
 static int __alloc_contig_migrate_range(struct compact_control *cc,
 					unsigned long start, unsigned long end)
 {
-	/* This function is based on compact_zone() from compaction.c. */
-	unsigned long nr_reclaimed;
-	unsigned long pfn = start;
-	unsigned int tries = 0;
-	int ret = 0;
-
-	migrate_prep();
-
-	while (pfn < end || !list_empty(&cc->migratepages)) {
-		if (fatal_signal_pending(current)) {
-			ret = -EINTR;
-			break;
-		}
-
-		if (list_empty(&cc->migratepages)) {
-			cc->nr_migratepages = 0;
-			pfn = isolate_migratepages_range(cc->zone, cc,
-							 pfn, end, true);
-			if (!pfn) {
-				ret = -EINTR;
-				break;
-			}
-			tries = 0;
-		} else if (++tries == 5) {
-			ret = ret < 0 ? ret : -EBUSY;
-			break;
-		}
+	struct aggression_control ac = {
+		.isolate_unevictable = true,
+		.prep_all = true,
+		.reclaim_clean = true,
+		.max_tries = 5,
+		.reason = MR_CMA,
+	};
 
-		nr_reclaimed = reclaim_clean_pages_from_list(cc->zone,
-							&cc->migratepages);
-		cc->nr_migratepages -= nr_reclaimed;
+	struct free_page_control fc = {
+		.free_page_alloc = alloc_migrate_target,
+		.alloc_data = 0,
+		.release_freepages = NULL,
+	};
 
-		ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
-				    0, MIGRATE_SYNC, MR_CMA);
-	}
-	if (ret < 0) {
-		putback_movable_pages(&cc->migratepages);
-		return ret;
-	}
-	return 0;
+	return compact_range(cc, &ac, &fc, start, end);
 }
 
 /**
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 35/40] mm: Add infrastructure to evacuate memory regions using compaction
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (33 preceding siblings ...)
  2013-09-25 23:21 ` [RFC PATCH v4 34/40] mm: Restructure the compaction part of CMA for wider use Srivatsa S. Bhat
@ 2013-09-25 23:21 ` Srivatsa S. Bhat
  2013-09-25 23:21 ` [RFC PATCH v4 36/40] kthread: Split out kthread-worker bits to avoid circular header-file dependency Srivatsa S. Bhat
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:21 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
To enhance memory power-savings, we need to be able to completely evacuate
lightly allocated regions, and move those used pages to lower regions,
which would help consolidate all the allocations to a minimum no. of regions.
This can be done using some of the memory compaction and reclaim algorithms.
Develop such an infrastructure to evacuate memory regions completely.
The traditional compaction algorithm uses a pfn walker to get free pages
for compaction. But this would be way too costly for us. So we do a pfn walk
only to isolate the used pages, but to get free pages, we just depend on the
fast buddy allocator itself. But we are careful to abort the compaction run
when the buddy allocator starts giving free pages in this region itself or
higher regions (because in that case, if we proceed, it would be defeating
the purpose of the entire effort).
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/compaction.h     |    7 +++
 include/linux/gfp.h            |    2 +
 include/linux/migrate.h        |    3 +
 include/linux/mm.h             |    1 
 include/trace/events/migrate.h |    3 +
 mm/compaction.c                |   99 ++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c                |   23 +++++++--
 7 files changed, 130 insertions(+), 8 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 091d72e..6be2b08 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -26,6 +26,7 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
+extern int evacuate_mem_region(struct zone *z, struct zone_mem_region *zmr);
 
 /* Do not skip compaction more than 64 times */
 #define COMPACT_MAX_DEFER_SHIFT 6
@@ -102,6 +103,12 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 	return true;
 }
 
+static inline int evacuate_mem_region(struct zone *z,
+				      struct zone_mem_region *zmr)
+{
+	return 0;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..dab3c78 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -351,6 +351,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
 
+int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long count,
+		 struct list_head *list, int migratetype, int cold);
 void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
 void free_pages_exact(void *virt, size_t size);
 /* This is different from alloc_pages_exact_node !!! */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8d3c57f..5ab1d48 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -30,7 +30,8 @@ enum migrate_reason {
 	MR_SYSCALL,		/* also applies to cpusets */
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
-	MR_CMA
+	MR_CMA,
+	MR_PWR_MGMT
 };
 
 #ifdef CONFIG_MIGRATION
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4286a75..f49acb0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -470,6 +470,7 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 int split_free_page(struct page *page);
+void __split_free_page(struct page *page, unsigned int order);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index ec2a6cc..e6892c0 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -15,7 +15,8 @@
 	{MR_MEMORY_HOTPLUG,	"memory_hotplug"},		\
 	{MR_SYSCALL,		"syscall_or_cpuset"},		\
 	{MR_MEMPOLICY_MBIND,	"mempolicy_mbind"},		\
-	{MR_CMA,		"cma"}
+	{MR_CMA,		"cma"},				\
+	{MR_PWR_MGMT,		"power_management"}
 
 TRACE_EVENT(mm_migrate_pages,
 
diff --git a/mm/compaction.c b/mm/compaction.c
index c775066..9449b7f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1168,6 +1168,105 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return rc;
 }
 
+static struct page *power_mgmt_alloc(struct page *migratepage,
+				     unsigned long data, int **result)
+{
+	struct compact_control *cc = (struct compact_control *)data;
+	struct page *freepage;
+
+	/*
+	 * Try to allocate pages from lower memory regions. If it fails,
+	 * abort.
+	 */
+	if (list_empty(&cc->freepages)) {
+		struct zone *z = page_zone(migratepage);
+		unsigned int i, count, order = 0;
+		struct page *page, *tmp;
+		LIST_HEAD(list);
+
+		/* Get a bunch of order-0 pages from the buddy freelists */
+		count = rmqueue_bulk(z, order, cc->nr_migratepages, &list,
+				     MIGRATE_MOVABLE, 1);
+
+		cc->nr_freepages = count * (1ULL << order);
+
+		if (list_empty(&list))
+			return NULL;
+
+		list_for_each_entry_safe(page, tmp, &list, lru) {
+			__split_free_page(page, order);
+
+			list_move_tail(&page->lru, &cc->freepages);
+
+			/*
+			 * Now add all the order-0 subdivisions of this page
+			 * to the freelist as well.
+			 */
+			for (i = 1; i < (1ULL << order); i++) {
+				page++;
+				list_add(&page->lru, &cc->freepages);
+			}
+
+		}
+
+		VM_BUG_ON(!list_empty(&list));
+
+		/* Now map all the order-0 pages on the freelist. */
+		map_pages(&cc->freepages);
+	}
+
+	freepage = list_entry(cc->freepages.next, struct page, lru);
+
+	if (page_zone_region_id(freepage) >= page_zone_region_id(migratepage))
+		return NULL; /* Freepage is not from lower region, so abort */
+
+	list_del(&freepage->lru);
+	cc->nr_freepages--;
+
+	return freepage;
+}
+
+static unsigned long power_mgmt_release_freepages(unsigned long info)
+{
+	struct compact_control *cc = (struct compact_control *)info;
+
+	return release_freepages(&cc->freepages);
+}
+
+int evacuate_mem_region(struct zone *z, struct zone_mem_region *zmr)
+{
+	unsigned long start_pfn = zmr->start_pfn;
+	unsigned long end_pfn = zmr->end_pfn;
+
+	struct compact_control cc = {
+		.nr_migratepages = 0,
+		.order = -1,
+		.zone = page_zone(pfn_to_page(start_pfn)),
+		.sync = false,  /* Async migration */
+		.ignore_skip_hint = true,
+	};
+
+	struct aggression_control ac = {
+		.isolate_unevictable = false,
+		.prep_all = false,
+		.reclaim_clean = true,
+		.max_tries = 1,
+		.reason = MR_PWR_MGMT,
+	};
+
+	struct free_page_control fc = {
+		.free_page_alloc = power_mgmt_alloc,
+		.alloc_data = (unsigned long)&cc,
+		.release_freepages = power_mgmt_release_freepages,
+		.free_data = (unsigned long)&cc,
+	};
+
+	INIT_LIST_HEAD(&cc.migratepages);
+	INIT_LIST_HEAD(&cc.freepages);
+
+	return compact_range(&cc, &ac, &fc, start_pfn, end_pfn);
+}
+
 
 /* Compact all zones within a node */
 static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 70c3d7a..4571d30 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1793,9 +1793,8 @@ retry:
  * a single hold of the lock, for efficiency.  Add them to the supplied list.
  * Returns the number of new pages which were placed at *list.
  */
-static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list,
-			int migratetype, int cold)
+int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long count,
+		 struct list_head *list, int migratetype, int cold)
 {
 	int mt = migratetype, i;
 
@@ -2111,6 +2110,20 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	return 1UL << order;
 }
 
+
+/*
+ * The page is already free and isolated (removed) from the buddy system.
+ * Set up the refcounts appropriately. Note that we can't use page_order()
+ * here, since the buddy system would have invoked rmv_page_order() before
+ * giving the page.
+ */
+void __split_free_page(struct page *page, unsigned int order)
+{
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+}
+
 /*
  * Similar to split_page except the page is already free. As this is only
  * being used for migration, the migratetype of the block also changes.
@@ -2132,9 +2145,7 @@ int split_free_page(struct page *page)
 	if (!nr_pages)
 		return 0;
 
-	/* Split into individual pages */
-	set_page_refcounted(page);
-	split_page(page, order);
+	__split_free_page(page, order);
 	return nr_pages;
 }
 
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 36/40] kthread: Split out kthread-worker bits to avoid circular header-file dependency
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (34 preceding siblings ...)
  2013-09-25 23:21 ` [RFC PATCH v4 35/40] mm: Add infrastructure to evacuate memory regions using compaction Srivatsa S. Bhat
@ 2013-09-25 23:21 ` Srivatsa S. Bhat
  2013-09-25 23:22 ` [RFC PATCH v4 37/40] mm: Add a kthread to perform targeted compaction for memory power management Srivatsa S. Bhat
                   ` (4 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:21 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
In subsequent patches, we will want to declare variables of kthread-work and
kthread-worker structures within mmzone.h. But trying to include kthread.h
inside mmzone.h to get the structure definitions, will lead to the following
circular header-file dependency.
     mmzone.h -> kthread.h -> sched.h -> gfp.h -> mmzone.h
We can avoid this by not including sched.h in kthread.h. But sched.h is quite
handy for call-sites which use the core kthread start/stop infrastructure such
as kthread-create-on-cpu/node etc. However, the kthread-work/worker framework
doesn't actually depend on sched.h.
So extract the definitions related to kthread-work/worker from kthread.h into
a new header-file named kthread-work.h (which doesn't include sched.h), so that
it can be easily included inside mmzone.h when required.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/kthread-work.h |   92 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/kthread.h      |   85 ---------------------------------------
 2 files changed, 93 insertions(+), 84 deletions(-)
 create mode 100644 include/linux/kthread-work.h
diff --git a/include/linux/kthread-work.h b/include/linux/kthread-work.h
new file mode 100644
index 0000000..6dae2ab
--- /dev/null
+++ b/include/linux/kthread-work.h
@@ -0,0 +1,92 @@
+#ifndef _LINUX_KTHREAD_WORK_H
+#define _LINUX_KTHREAD_WORK_H
+#include <linux/err.h>
+
+__printf(4, 5)
+
+/*
+ * Simple work processor based on kthread.
+ *
+ * This provides easier way to make use of kthreads.  A kthread_work
+ * can be queued and flushed using queue/flush_kthread_work()
+ * respectively.  Queued kthread_works are processed by a kthread
+ * running kthread_worker_fn().
+ */
+struct kthread_work;
+typedef void (*kthread_work_func_t)(struct kthread_work *work);
+
+struct kthread_worker {
+	spinlock_t		lock;
+	struct list_head	work_list;
+	struct task_struct	*task;
+	struct kthread_work	*current_work;
+};
+
+struct kthread_work {
+	struct list_head	node;
+	kthread_work_func_t	func;
+	wait_queue_head_t	done;
+	struct kthread_worker	*worker;
+};
+
+#define KTHREAD_WORKER_INIT(worker)	{				\
+	.lock = __SPIN_LOCK_UNLOCKED((worker).lock),			\
+	.work_list = LIST_HEAD_INIT((worker).work_list),		\
+	}
+
+#define KTHREAD_WORK_INIT(work, fn)	{				\
+	.node = LIST_HEAD_INIT((work).node),				\
+	.func = (fn),							\
+	.done = __WAIT_QUEUE_HEAD_INITIALIZER((work).done),		\
+	}
+
+#define DEFINE_KTHREAD_WORKER(worker)					\
+	struct kthread_worker worker = KTHREAD_WORKER_INIT(worker)
+
+#define DEFINE_KTHREAD_WORK(work, fn)					\
+	struct kthread_work work = KTHREAD_WORK_INIT(work, fn)
+
+/*
+ * kthread_worker.lock and kthread_work.done need their own lockdep class
+ * keys if they are defined on stack with lockdep enabled.  Use the
+ * following macros when defining them on stack.
+ */
+#ifdef CONFIG_LOCKDEP
+# define KTHREAD_WORKER_INIT_ONSTACK(worker)				\
+	({ init_kthread_worker(&worker); worker; })
+# define DEFINE_KTHREAD_WORKER_ONSTACK(worker)				\
+	struct kthread_worker worker = KTHREAD_WORKER_INIT_ONSTACK(worker)
+# define KTHREAD_WORK_INIT_ONSTACK(work, fn)				\
+	({ init_kthread_work((&work), fn); work; })
+# define DEFINE_KTHREAD_WORK_ONSTACK(work, fn)				\
+	struct kthread_work work = KTHREAD_WORK_INIT_ONSTACK(work, fn)
+#else
+# define DEFINE_KTHREAD_WORKER_ONSTACK(worker) DEFINE_KTHREAD_WORKER(worker)
+# define DEFINE_KTHREAD_WORK_ONSTACK(work, fn) DEFINE_KTHREAD_WORK(work, fn)
+#endif
+
+extern void __init_kthread_worker(struct kthread_worker *worker,
+			const char *name, struct lock_class_key *key);
+
+#define init_kthread_worker(worker)					\
+	do {								\
+		static struct lock_class_key __key;			\
+		__init_kthread_worker((worker), "("#worker")->lock", &__key); \
+	} while (0)
+
+#define init_kthread_work(work, fn)					\
+	do {								\
+		memset((work), 0, sizeof(struct kthread_work));		\
+		INIT_LIST_HEAD(&(work)->node);				\
+		(work)->func = (fn);					\
+		init_waitqueue_head(&(work)->done);			\
+	} while (0)
+
+int kthread_worker_fn(void *worker_ptr);
+
+bool queue_kthread_work(struct kthread_worker *worker,
+			struct kthread_work *work);
+void flush_kthread_work(struct kthread_work *work);
+void flush_kthread_worker(struct kthread_worker *worker);
+
+#endif /* _LINUX_KTHREAD_WORK_H */
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 7dcef33..cbefb16 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -52,89 +52,6 @@ int kthreadd(void *unused);
 extern struct task_struct *kthreadd_task;
 extern int tsk_fork_get_node(struct task_struct *tsk);
 
-/*
- * Simple work processor based on kthread.
- *
- * This provides easier way to make use of kthreads.  A kthread_work
- * can be queued and flushed using queue/flush_kthread_work()
- * respectively.  Queued kthread_works are processed by a kthread
- * running kthread_worker_fn().
- */
-struct kthread_work;
-typedef void (*kthread_work_func_t)(struct kthread_work *work);
-
-struct kthread_worker {
-	spinlock_t		lock;
-	struct list_head	work_list;
-	struct task_struct	*task;
-	struct kthread_work	*current_work;
-};
-
-struct kthread_work {
-	struct list_head	node;
-	kthread_work_func_t	func;
-	wait_queue_head_t	done;
-	struct kthread_worker	*worker;
-};
-
-#define KTHREAD_WORKER_INIT(worker)	{				\
-	.lock = __SPIN_LOCK_UNLOCKED((worker).lock),			\
-	.work_list = LIST_HEAD_INIT((worker).work_list),		\
-	}
-
-#define KTHREAD_WORK_INIT(work, fn)	{				\
-	.node = LIST_HEAD_INIT((work).node),				\
-	.func = (fn),							\
-	.done = __WAIT_QUEUE_HEAD_INITIALIZER((work).done),		\
-	}
-
-#define DEFINE_KTHREAD_WORKER(worker)					\
-	struct kthread_worker worker = KTHREAD_WORKER_INIT(worker)
-
-#define DEFINE_KTHREAD_WORK(work, fn)					\
-	struct kthread_work work = KTHREAD_WORK_INIT(work, fn)
-
-/*
- * kthread_worker.lock and kthread_work.done need their own lockdep class
- * keys if they are defined on stack with lockdep enabled.  Use the
- * following macros when defining them on stack.
- */
-#ifdef CONFIG_LOCKDEP
-# define KTHREAD_WORKER_INIT_ONSTACK(worker)				\
-	({ init_kthread_worker(&worker); worker; })
-# define DEFINE_KTHREAD_WORKER_ONSTACK(worker)				\
-	struct kthread_worker worker = KTHREAD_WORKER_INIT_ONSTACK(worker)
-# define KTHREAD_WORK_INIT_ONSTACK(work, fn)				\
-	({ init_kthread_work((&work), fn); work; })
-# define DEFINE_KTHREAD_WORK_ONSTACK(work, fn)				\
-	struct kthread_work work = KTHREAD_WORK_INIT_ONSTACK(work, fn)
-#else
-# define DEFINE_KTHREAD_WORKER_ONSTACK(worker) DEFINE_KTHREAD_WORKER(worker)
-# define DEFINE_KTHREAD_WORK_ONSTACK(work, fn) DEFINE_KTHREAD_WORK(work, fn)
-#endif
-
-extern void __init_kthread_worker(struct kthread_worker *worker,
-			const char *name, struct lock_class_key *key);
-
-#define init_kthread_worker(worker)					\
-	do {								\
-		static struct lock_class_key __key;			\
-		__init_kthread_worker((worker), "("#worker")->lock", &__key); \
-	} while (0)
-
-#define init_kthread_work(work, fn)					\
-	do {								\
-		memset((work), 0, sizeof(struct kthread_work));		\
-		INIT_LIST_HEAD(&(work)->node);				\
-		(work)->func = (fn);					\
-		init_waitqueue_head(&(work)->done);			\
-	} while (0)
-
-int kthread_worker_fn(void *worker_ptr);
-
-bool queue_kthread_work(struct kthread_worker *worker,
-			struct kthread_work *work);
-void flush_kthread_work(struct kthread_work *work);
-void flush_kthread_worker(struct kthread_worker *worker);
+#include <linux/kthread-work.h>
 
 #endif /* _LINUX_KTHREAD_H */
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 37/40] mm: Add a kthread to perform targeted compaction for memory power management
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (35 preceding siblings ...)
  2013-09-25 23:21 ` [RFC PATCH v4 36/40] kthread: Split out kthread-worker bits to avoid circular header-file dependency Srivatsa S. Bhat
@ 2013-09-25 23:22 ` Srivatsa S. Bhat
  2013-09-25 23:22 ` [RFC PATCH v4 38/40] mm: Add a mechanism to queue work to the kmempowerd kthread Srivatsa S. Bhat
                   ` (3 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:22 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
To further increase the opportunities for memory power savings, we can perform
targeted compaction to evacuate lightly-filled memory regions. For this
purpose, introduce a dedicated per-node kthread to perform the targeted
compaction work.
Our "kmempowerd" kthread uses the generic kthread-worker framework to do most
of the usual work all kthreads need to do. On top of that, this kthread has the
following infrastructure in place, to perform the region evacuation.
A work item is instantiated for every zone. Accessible to this work item is a
spin-lock protected bitmask, which helps us indicate which regions have to be
evacuated. The bits set in the bitmask represent the zone-memory-region number
within that zone that would benefit from evacuation.
The operation of the "kmempowerd" kthread is quite straight-forward: it makes a
local copy of the bitmask (which represents the work it is supposed to do), and
performs targeted region evacuation for each of the regions represented in
that bitmask. When its done, it updates the original bitmask by clearing those
bits, to indicate that the requested work was completed. While the kthread is
going about doing its duty, the original bitmask can be updated to indicate the
arrival of more work. So once the kthread finishes one round of processing, it
re-examines the original bitmask to see if any new work had arrived in the
meantime, and does the corresponding work if required. This process continues
until the original bitmask becomes empty (no bits set, so no more work to do).
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |   10 ++++++
 mm/compaction.c        |   80 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 49c8926..257afdf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -10,6 +10,7 @@
 #include <linux/bitops.h>
 #include <linux/cache.h>
 #include <linux/threads.h>
+#include <linux/kthread-work.h>
 #include <linux/numa.h>
 #include <linux/init.h>
 #include <linux/seqlock.h>
@@ -128,6 +129,13 @@ struct region_allocator {
 	DECLARE_BITMAP(ralloc_mask, MAX_NR_ZONE_REGIONS);
 };
 
+struct mempower_work {
+	spinlock_t		lock;
+	DECLARE_BITMAP(mempower_mask, MAX_NR_ZONE_REGIONS);
+
+	struct kthread_work	work;
+};
+
 struct pglist_data;
 
 /*
@@ -460,6 +468,7 @@ struct zone {
 	 */
 	unsigned int inactive_ratio;
 
+	struct mempower_work	mempower_work;
 
 	ZONE_PADDING(_pad2_)
 	/* Rarely used or read-mostly fields */
@@ -830,6 +839,7 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+	struct kthread_worker mempower_worker;
 #ifdef CONFIG_NUMA_BALANCING
 	/*
 	 * Lock serializing the per destination node AutoNUMA memory
diff --git a/mm/compaction.c b/mm/compaction.c
index 9449b7f..0511eae 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -16,6 +16,7 @@
 #include <linux/sysfs.h>
 #include <linux/balloon_compaction.h>
 #include <linux/page-isolation.h>
+#include <linux/kthread.h>
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
@@ -1267,6 +1268,85 @@ int evacuate_mem_region(struct zone *z, struct zone_mem_region *zmr)
 	return compact_range(&cc, &ac, &fc, start_pfn, end_pfn);
 }
 
+#define nr_zone_region_bits	MAX_NR_ZONE_REGIONS
+static DECLARE_BITMAP(mpwork_mask, nr_zone_region_bits);
+
+static void kmempowerd(struct kthread_work *work)
+{
+	struct mempower_work *mpwork;
+	struct zone *zone;
+	unsigned long flags;
+	int region_id;
+
+	mpwork = container_of(work, struct mempower_work, work);
+	zone = container_of(mpwork, struct zone, mempower_work);
+
+	spin_lock_irqsave(&mpwork->lock, flags);
+repeat:
+	bitmap_copy(mpwork_mask, mpwork->mempower_mask, nr_zone_region_bits);
+	spin_unlock_irqrestore(&mpwork->lock, flags);
+
+	if (bitmap_empty(mpwork_mask, nr_zone_region_bits))
+		return;
+
+	for_each_set_bit(region_id, mpwork_mask, nr_zone_region_bits)
+		evacuate_mem_region(zone, &zone->zone_regions[region_id]);
+
+	spin_lock_irqsave(&mpwork->lock, flags);
+
+	bitmap_andnot(mpwork->mempower_mask, mpwork->mempower_mask, mpwork_mask,
+		      nr_zone_region_bits);
+	if (!bitmap_empty(mpwork->mempower_mask, nr_zone_region_bits))
+		goto repeat; /* More work got added in the meanwhile */
+
+	spin_unlock_irqrestore(&mpwork->lock, flags);
+
+}
+
+static void kmempowerd_run(int nid)
+{
+	struct kthread_worker *worker;
+	struct mempower_work *mpwork;
+	struct pglist_data *pgdat;
+	struct task_struct *task;
+	unsigned long flags;
+	int i;
+
+	pgdat = NODE_DATA(nid);
+	worker = &pgdat->mempower_worker;
+
+	init_kthread_worker(worker);
+
+	task = kthread_create_on_node(kthread_worker_fn, worker, nid,
+				      "kmempowerd/%d", nid);
+	if (IS_ERR(task))
+		return;
+
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		mpwork = &pgdat->node_zones[i].mempower_work;
+		init_kthread_work(&mpwork->work, kmempowerd);
+
+		spin_lock_init(&mpwork->lock);
+
+		/* Initialize bitmap to zero to indicate no-pending-work */
+		spin_lock_irqsave(&mpwork->lock, flags);
+		bitmap_zero(mpwork->mempower_mask, nr_zone_region_bits);
+		spin_unlock_irqrestore(&mpwork->lock, flags);
+	}
+
+	wake_up_process(task);
+}
+
+int kmempowerd_init(void)
+{
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY)
+		kmempowerd_run(nid);
+
+	return 0;
+}
+module_init(kmempowerd_init);
 
 /* Compact all zones within a node */
 static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 38/40] mm: Add a mechanism to queue work to the kmempowerd kthread
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (36 preceding siblings ...)
  2013-09-25 23:22 ` [RFC PATCH v4 37/40] mm: Add a kthread to perform targeted compaction for memory power management Srivatsa S. Bhat
@ 2013-09-25 23:22 ` Srivatsa S. Bhat
  2013-09-25 23:22 ` [RFC PATCH v4 39/40] mm: Add intelligence in kmempowerd to ignore regions unsuitable for evacuation Srivatsa S. Bhat
                   ` (2 subsequent siblings)
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:22 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Now that we have a dedicated kthread in place to perform targeted region
evacuation, add and export a mechanism to queue work to the kthread.
Adding work to kmempowerd is very simple: just set the bits corresponding
to the region numbers that we want to evacuate, and queue the work item
to the kthread.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/compaction.c |   26 ++++++++++++++++++++++++++
 mm/internal.h   |    3 +++
 2 files changed, 29 insertions(+)
diff --git a/mm/compaction.c b/mm/compaction.c
index 0511eae..b56be89 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1271,6 +1271,32 @@ int evacuate_mem_region(struct zone *z, struct zone_mem_region *zmr)
 #define nr_zone_region_bits	MAX_NR_ZONE_REGIONS
 static DECLARE_BITMAP(mpwork_mask, nr_zone_region_bits);
 
+void queue_mempower_work(struct pglist_data *pgdat, struct zone *zone,
+			 int region_id)
+{
+	struct mempower_work *mpwork;
+	unsigned long flags;
+
+	mpwork = &zone->mempower_work;
+	spin_lock_irqsave(&mpwork->lock, flags);
+	set_bit(region_id, mpwork->mempower_mask);
+	spin_unlock_irqrestore(&mpwork->lock, flags);
+
+	/*
+	 * The kmempowerd kthread will never miss the work we assign it,
+	 * due to the way queue_kthread_work() and kthread_worker_fn()
+	 * synchronize with each other. If the work is currently executing,
+	 * it gets requeued; but if it is pending, the kthread will naturally
+	 * process it in the future. Eitherway, it will notice and process
+	 * all the work submitted to it, and won't prematurely go to sleep.
+	 *
+	 * Note: The bits set in the mempower_mask represent the actual
+	 * "work" for the kthread. The work-struct is just a container used
+	 * to communicate that work to the kthread.
+	 */
+	queue_kthread_work(&pgdat->mempower_worker, &mpwork->work);
+}
+
 static void kmempowerd(struct kthread_work *work)
 {
 	struct mempower_work *mpwork;
diff --git a/mm/internal.h b/mm/internal.h
index acb50f8..3fbc9f6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -181,6 +181,9 @@ int compact_range(struct compact_control *cc, struct aggression_control *ac,
 		  struct free_page_control *fc, unsigned long start,
 		  unsigned long end);
 
+void queue_mempower_work(struct pglist_data *pgdat, struct zone *zone,
+			 int region_id);
+
 #endif
 
 /*
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 39/40] mm: Add intelligence in kmempowerd to ignore regions unsuitable for evacuation
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (37 preceding siblings ...)
  2013-09-25 23:22 ` [RFC PATCH v4 38/40] mm: Add a mechanism to queue work to the kmempowerd kthread Srivatsa S. Bhat
@ 2013-09-25 23:22 ` Srivatsa S. Bhat
  2013-09-25 23:22 ` [RFC PATCH v4 40/40] mm: Add triggers in the page-allocator to kick off region evacuation Srivatsa S. Bhat
  2013-09-25 23:26 ` [Results] [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:22 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Enhance kmempowerd to determine situations where evacuating a region would prove
to be too costly or counter-productive, and ignore those regions for region
evacuation.
For example, if the region has a significant number of used pages (say more than
32), then evacuation will involve more work and might not be justifiable. Also,
compacting region 0 would be pointless, since that is the target of all our
compaction runs. Add these checks in the region-evacuator.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |    2 ++
 mm/compaction.c        |   25 +++++++++++++++++++++++--
 mm/internal.h          |    2 ++
 3 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 257afdf..f383cc8d4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -84,6 +84,8 @@ static inline int get_pageblock_migratetype(struct page *page)
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
+#define MAX_MEMPWR_MIGRATE_PAGES	32
+
 struct mem_region_list {
 	struct list_head	*page_block;
 	unsigned long		nr_free;
diff --git a/mm/compaction.c b/mm/compaction.c
index b56be89..41585b0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1297,9 +1297,26 @@ void queue_mempower_work(struct pglist_data *pgdat, struct zone *zone,
 	queue_kthread_work(&pgdat->mempower_worker, &mpwork->work);
 }
 
+int should_evacuate_region(struct zone *z, struct zone_mem_region *region)
+{
+	unsigned long pages_in_use;
+
+	/* Don't try to evacuate region 0, since its the target of migration */
+	if (region == z->zone_regions)
+		return 0;
+
+	pages_in_use = region->present_pages - region->nr_free;
+
+	if (pages_in_use > 0 && pages_in_use <= MAX_MEMPWR_MIGRATE_PAGES)
+		return 1;
+
+	return 0;
+}
+
 static void kmempowerd(struct kthread_work *work)
 {
 	struct mempower_work *mpwork;
+	struct zone_mem_region *zmr;
 	struct zone *zone;
 	unsigned long flags;
 	int region_id;
@@ -1315,8 +1332,12 @@ repeat:
 	if (bitmap_empty(mpwork_mask, nr_zone_region_bits))
 		return;
 
-	for_each_set_bit(region_id, mpwork_mask, nr_zone_region_bits)
-		evacuate_mem_region(zone, &zone->zone_regions[region_id]);
+	for_each_set_bit(region_id, mpwork_mask, nr_zone_region_bits) {
+		zmr = &zone->zone_regions[region_id];
+
+		if (should_evacuate_region(zone, zmr))
+			evacuate_mem_region(zone, zmr);
+	}
 
 	spin_lock_irqsave(&mpwork->lock, flags);
 
diff --git a/mm/internal.h b/mm/internal.h
index 3fbc9f6..5b4658c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -184,6 +184,8 @@ int compact_range(struct compact_control *cc, struct aggression_control *ac,
 void queue_mempower_work(struct pglist_data *pgdat, struct zone *zone,
 			 int region_id);
 
+int should_evacuate_region(struct zone *z, struct zone_mem_region *region);
+
 #endif
 
 /*
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [RFC PATCH v4 40/40] mm: Add triggers in the page-allocator to kick off region evacuation
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (38 preceding siblings ...)
  2013-09-25 23:22 ` [RFC PATCH v4 39/40] mm: Add intelligence in kmempowerd to ignore regions unsuitable for evacuation Srivatsa S. Bhat
@ 2013-09-25 23:22 ` Srivatsa S. Bhat
  2013-09-25 23:26 ` [Results] [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
  40 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:22 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel
Now that we have the entire infrastructure to perform targeted region
evacuation from a dedicated kthread (kmempowerd), modify the page-allocator
to invoke the region-evacuator at opportune points.
At a basic level, the most obvious opportunity to try region-evacuation is
when a page is freed back to the page-allocator. The rationale behind this is
explained below.
The page-allocator already has the intelligence to allocate pages such that
they are consolidated within as few regions as possible. That is, due to the
sorted-buddy design, it will _not_ spill allocations to a new region as long
as there is still memory available in lower-numbered regions to satisfy the
allocation request.
So, the fragmentation happens _after_ they are allocated, i.e., once the
entity starts freeing the memory in a random fashion. This freeing of pages
presents an opportunity to the MM subsystem: if the pages freed belong to
lower-numbered regions, then there is a chance that pages from higher-numbered
regions could be moved to these freshly freed pages, thereby causing further
consolidation of regions.
With this in mind, add the region-evac trigger in the page-freeing path.
Along with that, also add appropriate checks and intelligence necessary to
avoid compaction attempts that don't provide any net benefit. For example,
we can avoid compacting regions in ZONE_DMA, or regions that have mostly only
MIGRATE_UNMOVABLE allocations etc. These checks are done best at the
page-allocator side. Apart from them, also perform the same eligibility checks
that the region-evacuator employs, to avoid useless wakeups of kmempowerd.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 mm/page_alloc.c |   38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4571d30..48b748e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -639,6 +639,29 @@ out:
 static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 				    int region_id);
 
+static inline int region_is_evac_candidate(struct zone *z,
+					   struct zone_mem_region *region,
+					   int migratetype)
+{
+
+	/* Don't start evacuation too early during boot */
+	if (system_state != SYSTEM_RUNNING)
+		return 0;
+
+	/* Don't bother evacuating regions in ZONE_DMA */
+	if (zone_idx(z) == ZONE_DMA)
+		return 0;
+
+	/*
+	 * Don't try evacuations in regions not containing MOVABLE or
+	 * RECLAIMABLE allocations.
+	 */
+	if (!(migratetype == MIGRATE_MOVABLE ||
+		migratetype == MIGRATE_RECLAIMABLE))
+		return 0;
+
+	return should_evacuate_region(z, region);
+}
 
 static inline int can_return_region(struct mem_region_list *region, int order,
 				    struct free_list *free_list)
@@ -683,7 +706,9 @@ static void add_to_freelist(struct page *page, struct free_list *free_list,
 {
 	struct list_head *prev_region_list, *lru;
 	struct mem_region_list *region;
-	int region_id, prev_region_id;
+	int region_id, prev_region_id, migratetype;
+	struct zone *zone;
+	struct pglist_data *pgdat;
 
 	lru = &page->lru;
 	region_id = page_zone_region_id(page);
@@ -741,8 +766,17 @@ try_return_region:
 	 * Try to return the freepages of a memory region to the region
 	 * allocator, if possible.
 	 */
-	if (can_return_region(region, order, free_list))
+	if (can_return_region(region, order, free_list)) {
 		add_to_region_allocator(page_zone(page), free_list, region_id);
+		return;
+	}
+
+	zone = page_zone(page);
+	migratetype = get_pageblock_migratetype(page);
+	pgdat = NODE_DATA(page_to_nid(page));
+
+	if (region_is_evac_candidate(zone, region->zone_region, migratetype))
+		queue_mempower_work(pgdat, zone, region_id);
 }
 
 /*
^ permalink raw reply related	[flat|nested] 72+ messages in thread
* [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
                   ` (39 preceding siblings ...)
  2013-09-25 23:22 ` [RFC PATCH v4 40/40] mm: Add triggers in the page-allocator to kick off region evacuation Srivatsa S. Bhat
@ 2013-09-25 23:26 ` Srivatsa S. Bhat
  2013-09-25 23:40   ` Andrew Morton
  40 siblings, 1 reply; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-25 23:26 UTC (permalink / raw)
  To: akpm, mgorman, dave, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: Srivatsa S. Bhat, gargankita, paulmck, svaidy, andi,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel
Experimental Results:
====================
Test setup:
----------
x86 Sandybridge dual-socket quad core HT-enabled machine, with 128GB RAM.
Memory Region size = 512MB.
Testcase:
--------
Strategy:
Try to allocate and free large chunks of memory (comparable to that of memory
region size) in multiple threads, and examine the number of completely free
memory regions at the end of the run (when all the memory is freed). (Note
that we don't create any pagecache usage here).
Implementation:
Run 20 instances of multi-threaded ebizzy in parallel, with chunksize=256MB,
and no. of threads=32. This means, potentially 20 * 32 threads can allocate/free
memory in parallel, and each alloc/free size will be 256MB, which is half of
the memory region size.
Cmd-line of each ebizzy instance: ./ebizzy -s 268435456 -n 2 -t 32 -S 60
Effectiveness in consolidating allocations:
------------------------------------------
With the above test case, the higher the number of completely free memory
regions at the end of the run, the better is the memory management algorithm
in consolidating allocations.
Here are the results, with vanilla 3.12-rc2 and with this patchset applied:
                  Free regions at test-start   Free regions after test-run
Without patchset               214                          8
With patchset                  210                        202
This shows that this patchset performs tremendously better than vanilla
kernel in terms of keeping the memory allocations consolidated to a minimum
no. of memory regions. Note that the amount of memory consumed at the end of
the run is 0, so it shows the drastic extent to which the mainline kernel can
fragment memory by spreading a handful of pages across many memory regions.
And since this patchset teaches the kernel to understand the memory region
granularity/boundaries and influences the MM decisions effectively, it shows
a significant improvement over mainline.
Below is the log of the variation of the no. of completely free regions
from the beginning to the end of the test, at 1 second intervals (total
test-run takes 1 minute).
         Vanilla 3.12-rc2         With this patchset
                214	                210
                214	                210
                214	                210
                214	                210
                214	                210
                210	                208
                194	                194
                165	                165
                117	                145
                87	                115
                34	                82
                21	                57
                11	                37
                4	                27
                4	                13
                4	                9
                4	                5
                4	                5
                4	                5
                4	                5
                4	                5
                4	                5
                4	                5
                4	                5
                4	                5
                4	                6
                4	                7
                4	                9
                4	                9
                4	                9
                4	                9
                4	                13
                4	                15
                4	                18
                4	                19
                4	                21
                4	                22
                4	                22
                4	                25
                4	                26
                4	                26
                4	                28
                4	                28
                4	                29
                4	                29
                4	                29
                4	                31
                4	                75
                4	                144
                4	                150
                4	                150
                4	                154
                4	                154
                4	                154
                4	                156
                4	                157
                4	                157
                4	                157
                4	                162
                4	                163
                4	                163
                4	                163
                4	                163
                4	                163
                4	                163
                4	                164
                4	                166
                4	                166
                4	                166
                4	                166
                4	                167
                4	                167
                4	                167
                4	                167
                4	                167
                8	                202
It is interesting to also examine the fragmentation of memory by
looking at the per-region statistics added by this patchset.
Statistics for vanilla 3.12-rc2 kernel:
======================================
We can see from the statistics that there is a lot of fragmentation
among the MOVABLE migratetype.
Node 0, zone   Normal
  pages free     15808914
        min      5960
        low      7450
        high     8940
        scanned  0
        spanned  16252928
        present  16252928
        managed  15989606
Per-region page stats	 present	 free
	Region      0 	      1 	   1024
	Region      1 	 131072 	 130935
	Region      2 	 131072 	 130989
	Region      3 	 131072 	 130958
	Region      4 	 131072 	 130958
	Region      5 	 131072 	 130945
	Region      6 	 131072 	 130413
	Region      7 	 131072 	 130493
	Region      8 	 131072 	 131801
	Region      9 	 131072 	 130974
	Region     10 	 131072 	 130969
	Region     11 	 131072 	 130007
	Region     12 	 131072 	 131329
	Region     13 	 131072 	 131513
	Region     14 	 131072 	 130988
	Region     15 	 131072 	 130986
	Region     16 	 131072 	 130992
	Region     17 	 131072 	 130962
	Region     18 	 131072 	 130187
	Region     19 	 131072 	 131729
	Region     20 	 131072 	 130875
	Region     21 	 131072 	 130968
	Region     22 	 131072 	 130961
	Region     23 	 131072 	 130966
	Region     24 	 131072 	 130950
	Region     25 	 131072 	 130915
	Region     26 	 131072 	 130438
	Region     27 	 131072 	 130563
	Region     28 	 131072 	 131831
	Region     29 	 131072 	 130109
	Region     30 	 131072 	 131899
	Region     31 	 131072 	 130949
	Region     32 	 131072 	 130975
	Region     33 	 131072 	 130444
	Region     34 	 131072 	 131478
	Region     35 	 131072 	 131002
	Region     36 	 131072 	 130976
	Region     37 	 131072 	 130950
	Region     38 	 131072 	 130222
	Region     39 	 131072 	 130965
	Region     40 	 131072 	 130820
	Region     41 	 131072 	 131332
	Region     42 	 131072 	 130970
	Region     43 	 131072 	 131485
	Region     44 	 131072 	 130964
	Region     45 	 131072 	 130993
	Region     46 	 131072 	 130966
	Region     47 	 131072 	 130907
	Region     48 	 131072 	 130965
	Region     49 	 131072 	 129989
	Region     50 	 131072 	 131912
	Region     51 	 131072 	 130980
	Region     52 	 131072 	 130970
	Region     53 	 131072 	 130962
	Region     54 	 131072 	 130962
	Region     55 	 131072 	 130984
	Region     56 	 131072 	 131000
	Region     57 	 131072 	 130186
	Region     58 	 131072 	 131717
	Region     59 	 131072 	 130942
	Region     60 	 131072 	 130983
	Region     61 	 131072 	 130440
	Region     62 	 131072 	 131504
	Region     63 	 131072 	 130947
	Region     64 	 131072 	 130947
	Region     65 	 131072 	 130977
	Region     66 	 131072 	 130950
	Region     67 	 131072 	 130201
	Region     68 	 131072 	 130948
	Region     69 	 131072 	 131749
	Region     70 	 131072 	 130986
	Region     71 	 131072 	 130406
	Region     72 	 131072 	 131469
	Region     73 	 131072 	 130964
	Region     74 	 131072 	 130983
	Region     75 	 131072 	 130942
	Region     76 	 131072 	 130470
	Region     77 	 131072 	 130980
	Region     78 	 131072 	 130599
	Region     79 	 131072 	 131880
	Region     80 	 131072 	 130961
	Region     81 	 131072 	 130979
	Region     82 	 131072 	 130991
	Region     83 	 131072 	 130136
	Region     84 	 131072 	 130878
	Region     85 	 131072 	 131867
	Region     86 	 131072 	 130994
	Region     87 	 131072 	 130465
	Region     88 	 131072 	 131488
	Region     89 	 131072 	 130937
	Region     90 	 131072 	 130954
	Region     91 	 131072 	 129897
	Region     92 	 131072 	 131970
	Region     93 	 131072 	 130967
	Region     94 	 131072 	 130941
	Region     95 	 131072 	 130191
	Region     96 	 131072 	 130967
	Region     97 	 131072 	 131182
	Region     98 	 131072 	 131494
	Region     99 	 131072 	 130911
	Region    100 	 131072 	 130832
	Region    101 	 131072 	 130445
	Region    102 	 131072 	 130488
	Region    103 	 131072 	 131951
	Region    104 	 131072 	 130937
	Region    105 	 131072 	 130162
	Region    106 	 131072 	 131724
	Region    107 	 131072 	 130954
	Region    108 	 131072 	 130383
	Region    109 	 131072 	 130477
	Region    110 	 131072 	 132062
	Region    111 	 131072 	 131039
	Region    112 	 131072 	 130960
	Region    113 	 131072 	 131062
	Region    114 	 131072 	 129938
	Region    115 	 131072 	 131989
	Region    116 	 131072 	 130903
	Region    117 	 131072 	 131020
	Region    118 	 131072 	 131032
	Region    119 	 131072 	  98662
	Region    120 	 131072 	 115369
	Region    121 	 131072 	 107352
	Region    122 	 131072 	  33060
	Region    123 	 131072 	      0
	Region    124 	 131071 	     67
Page block order: 10
Pages per block:  1024
Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      1      2      2      1      3      2      0      0      1      1      0
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      2
Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable    543    554    431    220     94     18      3      1      0      0      0
Node    0, zone    DMA32, type  Reclaimable      0      1      0     16      8      1      0      1      0      0      0
Node    0, zone    DMA32, type      Movable    754    826    846    811    792    748    659    528    364    168    100
Node    0, zone    DMA32, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable    567   1880   2103    849    276     34      0      0      0      0      0
Node    0, zone   Normal, type  Reclaimable      1    512    363    237     97     16      7      1      0      0      0
Node    0, zone   Normal, type      Movable   8383  13055  14648  13112  11081   9161   7898   6882   5694   4630   9660
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      2
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, R  0      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  1      Movable     75     82     80     85     80     73     76     71     62     48     69 
Node    0, zone   Normal, R  2      Movable     43     49     54     55     55     55     55     45     46     40     84 
Node    0, zone   Normal, R  3      Movable     43     50     62     61     60     53     55     54     47     35     85 
Node    0, zone   Normal, R  4      Movable     40     53     59     58     58     59     57     56     45     35     85 
Node    0, zone   Normal, R  5      Movable     41     50     49     48     47     44     45     40     37     30     93 
Node    0, zone   Normal, R  6      Movable     73     86     92     89     84     82     79     66     56     46     72 
Node    0, zone   Normal, R  7      Movable     73     84     79     86     78     80     80     74     65     48     68 
Node    0, zone   Normal, R  8      Movable     59     71     78     77     85     85     84     71     56     43     74 
Node    0, zone   Normal, R  9      Movable     58     64     67     69     69     65     65     64     59     48     73 
Node    0, zone   Normal, R 10      Movable     53     58     62     61     63     63     61     52     49     45     79 
Node    0, zone   Normal, R 11      Movable     55     64     68     62     60     61     61     57     47     39     81 
Node    0, zone   Normal, R 12      Movable     63     97     98     99     92     93     88     84     68     43     68 
Node    0, zone   Normal, R 13      Movable     29     36     39     39     38     37     36     37     35     27     97 
Node    0, zone   Normal, R 14      Movable     40     46     54     54     52     53     53     51     46     37     85 
Node    0, zone   Normal, R 15      Movable     34     46     51     48     48     47     46     45     42     36     88 
Node    0, zone   Normal, R 16      Movable     46     57     56     58     58     56     55     56     50     39     82 
Node    0, zone   Normal, R 17      Movable     40     45     52     52     54     52     53     47     38     34     89 
Node    0, zone   Normal, R 18      Movable     47     54     62     57     57     55     55     50     40     36     86 
Node    0, zone   Normal, R 19      Movable     49     54     63     61     63     61     60     59     47     48     78 
Node    0, zone   Normal, R 20      Movable     65     83     87     79     74     69     71     65     43     42     79 
Node    0, zone   Normal, R 21      Movable     54     67     73     69     71     71     69     60     51     42     78 
Node    0, zone   Normal, R 22      Movable     51     59     60     65     65     61     57     54     49     41     81 
Node    0, zone   Normal, R 23      Movable     48     55     62     58     59     56     54     50     45     33     87 
Node    0, zone   Normal, R 24      Movable     54     68     70     74     70     68     68     65     57     42     76 
Node    0, zone   Normal, R 25      Movable     60     70     81     79     78     72     70     69     53     46     74 
Node    0, zone   Normal, R 26      Movable     60     69     78     77     78     76     75     66     59     54     68 
Node    0, zone   Normal, R 27      Movable     49     65     72     72     65     69     67     63     47     45     77 
Node    0, zone   Normal, R 28      Movable     67     82     90     89     84     79     79     72     61     43     73 
Node    0, zone   Normal, R 29      Movable     43     47     47     47     46     47     47     46     36     29     92 
Node    0, zone   Normal, R 30      Movable     42     45     46     48     48     44     46     45     44     39     87 
Node    0, zone   Normal, R 31      Movable     77     80     84     79     77     76     72     63     55     42     76 
Node    0, zone   Normal, R 32      Movable     63     68     68     69     68     67     60     62     51     43     78 
Node    0, zone   Normal, R 33      Movable     52     66     71     71     70     69     70     66     58     36     78 
Node    0, zone   Normal, R 34      Movable     48     55     58     60     57     57     56     53     45     41     83 
Node    0, zone   Normal, R 35      Movable     40     47     49     50     50     50     48     45     41     30     91 
Node    0, zone   Normal, R 36      Movable     40     50     53     52     54     46     50     46     38     37     88 
Node    0, zone   Normal, R 37      Movable     56     77     79     75     72     75     71     67     54     44     75 
Node    0, zone   Normal, R 38      Movable     38     48     52     51     50     45     40     38     40     34     90 
Node    0, zone   Normal, R 39      Movable     59     69     68     70     69     68     65     59     47     41     80 
Node    0, zone   Normal, R 40      Movable     34     43     45     43     42     43     44     41     37     34     91 
Node    0, zone   Normal, R 41      Movable     62     75     86     91     88     84     78     73     52     42     75 
Node    0, zone   Normal, R 42      Movable     44     59     56     61     61     55     54     52     42     40     84 
Node    0, zone   Normal, R 43      Movable     45     48     50     53     54     52     49     49     34     35     90 
Node    0, zone   Normal, R 44      Movable     58     71     69     67     66     66     63     63     52     44     77 
Node    0, zone   Normal, R 45      Movable     43     51     54     55     53     50     48     48     43     34     88 
Node    0, zone   Normal, R 46      Movable     52     65     68     70     68     67     68     66     47     47     76 
Node    0, zone   Normal, R 47      Movable     61     65     69     75     71     70     68     64     55     43     76 
Node    0, zone   Normal, R 48      Movable     51     63     69     66     62     61     61     62     52     39     80 
Node    0, zone   Normal, R 49      Movable     51     61     68     69     68     69     64     54     54     41     78 
Node    0, zone   Normal, R 50      Movable     64     76     76     76     76     73     66     67     53     45     76 
Node    0, zone   Normal, R 51      Movable     44     52     58     59     57     56     55     48     48     44     81 
Node    0, zone   Normal, R 52      Movable     48     61     68     68     64     64     63     56     46     45     79 
Node    0, zone   Normal, R 53      Movable     57     69     72     68     67     65     65     60     49     42     79 
Node    0, zone   Normal, R 54      Movable     66     82     83     80     78     78     77     64     59     45     73 
Node    0, zone   Normal, R 55      Movable     44     52     55     51     54     48     49     48     47     38     85 
Node    0, zone   Normal, R 56      Movable     42     47     50     49     48     49     47     46     45     34     88 
Node    0, zone   Normal, R 57      Movable     62     72     73     75     74     75     69     66     54     41     76 
Node    0, zone   Normal, R 58      Movable     63     75     74     71     71     69     67     67     59     42     76 
Node    0, zone   Normal, R 59      Movable     50     68     67     71     66     65     65     64     51     40     79 
Node    0, zone   Normal, R 60      Movable     53     59     63     60     58     56     58     52     47     39     83 
Node    0, zone   Normal, R 61      Movable     58     69     77     70     66     68     65     65     44     42     79 
Node    0, zone   Normal, R 62      Movable     40     46     51     50     51     49     50     50     42     35     88 
Node    0, zone   Normal, R 63      Movable     55     64     67     72     68     68     65     65     52     47     75 
Node    0, zone   Normal, R 64      Movable     47     58     68     66     62     61     59     57     53     42     79 
Node    0, zone   Normal, R 65      Movable     53     62     62     61     63     61     58     50     51     41     81 
Node    0, zone   Normal, R 66      Movable     56     65     75     74     73     74     72     65     59     48     72 
Node    0, zone   Normal, R 67      Movable     43     53     53     54     54     51     49     43     44     31     89 
Node    0, zone   Normal, R 68      Movable     74     77     82     85     77     79     76     76     57     49     70 
Node    0, zone   Normal, R 69      Movable     49     54     62     62     60     61     61     57     48     44     80 
Node    0, zone   Normal, R 70      Movable     64     65     66     64     64     63     64     62     51     37     81 
Node    0, zone   Normal, R 71      Movable     58     76     83     81     78     79     72     66     57     49     71 
Node    0, zone   Normal, R 72      Movable     45     56     64     66     64     63     62     59     53     38     81 
Node    0, zone   Normal, R 73      Movable     54     67     72     73     69     63     67     65     64     45     73 
Node    0, zone   Normal, R 74      Movable     49     59     62     61     60     53     57     53     43     37     85 
Node    0, zone   Normal, R 75      Movable     68     77     86     87     81     72     74     68     58     45     73 
Node    0, zone   Normal, R 76      Movable     56     63     66     61     64     60     60     59     54     38     80 
Node    0, zone   Normal, R 77      Movable     40     56     61     59     56     58     58     52     45     38     84 
Node    0, zone   Normal, R 78      Movable     35     44     49     49     48     49     49     50     45     32     88 
Node    0, zone   Normal, R 79      Movable     52     56     59     55     50     52     53     46     42     34     89 
Node    0, zone   Normal, R 80      Movable     60     65     75     73     64     69     65     65     56     43     76 
Node    0, zone   Normal, R 81      Movable     37     49     53     53     52     53     47     48     41     39     86 
Node    0, zone   Normal, R 82      Movable     55     58     63     61     60     61     59     60     54     41     79 
Node    0, zone   Normal, R 83      Movable     64     84     98     87     93     87     86     82     64     48     66 
Node    0, zone   Normal, R 84      Movable     37     47     49     49     49     49     47     47     40     36     88 
Node    0, zone   Normal, R 85      Movable     40     50     58     57     56     53     51     46     38     34     90 
Node    0, zone   Normal, R 86      Movable     50     56     58     57     54     56     56     54     47     47     79 
Node    0, zone   Normal, R 87      Movable     35     51     54     48     50     49     46     44     38     33     90 
Node    0, zone   Normal, R 88      Movable     60     60     67     68     68     64     64     61     51     44     78 
Node    0, zone   Normal, R 89      Movable     59     89     83     84     84     81     81     80     63     50     67 
Node    0, zone   Normal, R 90      Movable     44     61     63     65     64     63     62     55     57     48     75 
Node    0, zone   Normal, R 91      Movable     63     73     78     80     74     72     73     68     55     55     68 
Node    0, zone   Normal, R 92      Movable     58     70     75     74     76     74     75     67     53     52     72 
Node    0, zone   Normal, R 93      Movable     53     67     69     67     65     63     63     54     53     34     83 
Node    0, zone   Normal, R 94      Movable     69     82     85     84     84     83     84     80     64     49     67 
Node    0, zone   Normal, R 95      Movable     67     74     78     76     78     72     69     66     52     48     73 
Node    0, zone   Normal, R 96      Movable     49     61     67     68     68     68     64     64     55     42     77 
Node    0, zone   Normal, R 97      Movable     78     88     96     94     90     89     85     68     57     49     70 
Node    0, zone   Normal, R 98      Movable     58     70     70     67     65     67     63     63     56     35     81 
Node    0, zone   Normal, R 99      Movable     55     66     81     80     80     75     76     69     59     46     72 
Node    0, zone   Normal, R100      Movable     62     81     86     81     77     74     71     69     56     50     71 
Node    0, zone   Normal, R101      Movable     67     83     83     81     79     75     76     69     57     44     73 
Node    0, zone   Normal, R102      Movable     52     58     68     68     64     65     59     50     46     32     86 
Node    0, zone   Normal, R103      Movable     77     85     82     86     82     75     76     64     53     46     75 
Node    0, zone   Normal, R104      Movable     69     82     92     92     92     90     89     80     69     47     66 
Node    0, zone   Normal, R105      Movable     76     81     83     89     89     85     87     75     54     53     67 
Node    0, zone   Normal, R106      Movable     75     85     90     88     89     83     78     72     46     38     79 
Node    0, zone   Normal, R107      Movable     50     66     69     70     67     69     65     65     46     40     80 
Node    0, zone   Normal, R108      Movable     77     95     95     97     90     89     88     74     54     45     71 
Node    0, zone   Normal, R109      Movable     47     65     67     66     64     63     62     61     50     37     81 
Node    0, zone   Normal, R110      Movable     16     17     17     15     13     15     15     15     11      9    118 
Node    0, zone   Normal, R111      Movable     31     32     32     32     32     32     32     30     19     13    109 
Node    0, zone   Normal, R112      Movable     30     29     40     33     25     28     28     25     19     15    109 
Node    0, zone   Normal, R113      Movable     10     10     10      8      9      9      9      9      9      7    120 
Node    0, zone   Normal, R114      Movable     46     46     48     43     47     44     42     42     31     23     97 
Node    0, zone   Normal, R115      Movable     27     29     28     26     24     24     24     23     20     20    108 
Node    0, zone   Normal, R116      Movable     43     46     42     43     41     44     45     39     26     12    105 
Node    0, zone   Normal, R117      Movable     38     37     39     36     32     33     28     27     29     19    104 
Node    0, zone   Normal, R118      Movable     30     31     29     27     21     23     20     20     17     13    112 
Node    0, zone   Normal, R119      Movable    340   1039   1218    960    611    261     94     36     14      7     37 
Node    0, zone   Normal, R120      Movable    514   2102   2401   2023   1473    838    348     76      8      0      0 
Node    0, zone   Normal, R121      Movable   1034   2065   2561   1913   1163    574    235     62      9      0      0 
Node    0, zone   Normal, R122      Movable    361    571    734    560    363    181     63      7      1      0      0 
Node    0, zone   Normal, R123      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R124      Movable      3      2      3      2      0      1      0      0      0      0      0 
Number of blocks type         Unmovable  Reclaimable      Movable      Reserve      Isolate 
Node 0, zone      DMA                 1            0            2            1            0 
Node 0, zone    DMA32                10            2          495            1            0 
Node 0, zone   Normal               121           41        15708            2            0 
Node 0, zone   Normal R  0            0            0            0            1            0 
Node 0, zone   Normal R  1            0            0          126            2            0 
Node 0, zone   Normal R  2            0            0          128            0            0 
Node 0, zone   Normal R  3            0            0          128            0            0 
Node 0, zone   Normal R  4            0            0          128            0            0 
Node 0, zone   Normal R  5            0            0          128            0            0 
Node 0, zone   Normal R  6            0            0          128            0            0 
Node 0, zone   Normal R  7            0            0          128            0            0 
Node 0, zone   Normal R  8            0            0          128            0            0 
Node 0, zone   Normal R  9            0            0          128            0            0 
Node 0, zone   Normal R 10            0            0          128            0            0 
Node 0, zone   Normal R 11            0            0          128            0            0 
Node 0, zone   Normal R 12            0            0          128            0            0 
Node 0, zone   Normal R 13            0            0          128            0            0 
Node 0, zone   Normal R 14            0            0          128            0            0 
Node 0, zone   Normal R 15            0            0          128            0            0 
Node 0, zone   Normal R 16            0            0          128            0            0 
Node 0, zone   Normal R 17            0            0          128            0            0 
Node 0, zone   Normal R 18            0            0          128            0            0 
Node 0, zone   Normal R 19            0            0          128            0            0 
Node 0, zone   Normal R 20            0            0          128            0            0 
Node 0, zone   Normal R 21            0            0          128            0            0 
Node 0, zone   Normal R 22            0            0          128            0            0 
Node 0, zone   Normal R 23            0            0          128            0            0 
Node 0, zone   Normal R 24            0            0          128            0            0 
Node 0, zone   Normal R 25            0            0          128            0            0 
Node 0, zone   Normal R 26            0            0          128            0            0 
Node 0, zone   Normal R 27            0            0          128            0            0 
Node 0, zone   Normal R 28            0            0          128            0            0 
Node 0, zone   Normal R 29            0            0          128            0            0 
Node 0, zone   Normal R 30            0            0          128            0            0 
Node 0, zone   Normal R 31            0            0          128            0            0 
Node 0, zone   Normal R 32            0            0          128            0            0 
Node 0, zone   Normal R 33            0            0          128            0            0 
Node 0, zone   Normal R 34            0            0          128            0            0 
Node 0, zone   Normal R 35            0            0          128            0            0 
Node 0, zone   Normal R 36            0            0          128            0            0 
Node 0, zone   Normal R 37            0            0          128            0            0 
Node 0, zone   Normal R 38            0            0          128            0            0 
Node 0, zone   Normal R 39            0            0          128            0            0 
Node 0, zone   Normal R 40            0            0          128            0            0 
Node 0, zone   Normal R 41            0            0          128            0            0 
Node 0, zone   Normal R 42            0            0          128            0            0 
Node 0, zone   Normal R 43            0            0          128            0            0 
Node 0, zone   Normal R 44            0            0          128            0            0 
Node 0, zone   Normal R 45            0            0          128            0            0 
Node 0, zone   Normal R 46            0            0          128            0            0 
Node 0, zone   Normal R 47            0            0          128            0            0 
Node 0, zone   Normal R 48            0            0          128            0            0 
Node 0, zone   Normal R 49            0            0          128            0            0 
Node 0, zone   Normal R 50            0            0          128            0            0 
Node 0, zone   Normal R 51            0            0          128            0            0 
Node 0, zone   Normal R 52            0            0          128            0            0 
Node 0, zone   Normal R 53            0            0          128            0            0 
Node 0, zone   Normal R 54            0            0          128            0            0 
Node 0, zone   Normal R 55            0            0          128            0            0 
Node 0, zone   Normal R 56            0            0          128            0            0 
Node 0, zone   Normal R 57            0            0          128            0            0 
Node 0, zone   Normal R 58            0            0          128            0            0 
Node 0, zone   Normal R 59            0            0          128            0            0 
Node 0, zone   Normal R 60            0            0          128            0            0 
Node 0, zone   Normal R 61            0            0          128            0            0 
Node 0, zone   Normal R 62            0            0          128            0            0 
Node 0, zone   Normal R 63            0            0          128            0            0 
Node 0, zone   Normal R 64            0            0          128            0            0 
Node 0, zone   Normal R 65            0            0          128            0            0 
Node 0, zone   Normal R 66            0            0          128            0            0 
Node 0, zone   Normal R 67            0            0          128            0            0 
Node 0, zone   Normal R 68            0            0          128            0            0 
Node 0, zone   Normal R 69            0            0          128            0            0 
Node 0, zone   Normal R 70            0            0          128            0            0 
Node 0, zone   Normal R 71            0            0          128            0            0 
Node 0, zone   Normal R 72            0            0          128            0            0 
Node 0, zone   Normal R 73            0            0          128            0            0 
Node 0, zone   Normal R 74            0            0          128            0            0 
Node 0, zone   Normal R 75            0            0          128            0            0 
Node 0, zone   Normal R 76            0            0          128            0            0 
Node 0, zone   Normal R 77            0            0          128            0            0 
Node 0, zone   Normal R 78            0            0          128            0            0 
Node 0, zone   Normal R 79            0            0          128            0            0 
Node 0, zone   Normal R 80            0            0          128            0            0 
Node 0, zone   Normal R 81            0            0          128            0            0 
Node 0, zone   Normal R 82            0            0          128            0            0 
Node 0, zone   Normal R 83            0            0          128            0            0 
Node 0, zone   Normal R 84            0            0          128            0            0 
Node 0, zone   Normal R 85            0            0          128            0            0 
Node 0, zone   Normal R 86            0            0          128            0            0 
Node 0, zone   Normal R 87            0            0          128            0            0 
Node 0, zone   Normal R 88            0            0          128            0            0 
Node 0, zone   Normal R 89            0            0          128            0            0 
Node 0, zone   Normal R 90            0            0          128            0            0 
Node 0, zone   Normal R 91            0            0          128            0            0 
Node 0, zone   Normal R 92            0            0          128            0            0 
Node 0, zone   Normal R 93            0            0          128            0            0 
Node 0, zone   Normal R 94            0            0          128            0            0 
Node 0, zone   Normal R 95            0            0          128            0            0 
Node 0, zone   Normal R 96            0            0          128            0            0 
Node 0, zone   Normal R 97            0            0          128            0            0 
Node 0, zone   Normal R 98            0            0          128            0            0 
Node 0, zone   Normal R 99            0            0          128            0            0 
Node 0, zone   Normal R100            0            0          128            0            0 
Node 0, zone   Normal R101            0            0          128            0            0 
Node 0, zone   Normal R102            0            0          128            0            0 
Node 0, zone   Normal R103            0            0          128            0            0 
Node 0, zone   Normal R104            0            0          128            0            0 
Node 0, zone   Normal R105            0            0          128            0            0 
Node 0, zone   Normal R106            0            0          128            0            0 
Node 0, zone   Normal R107            0            0          128            0            0 
Node 0, zone   Normal R108            0            0          128            0            0 
Node 0, zone   Normal R109            0            0          128            0            0 
Node 0, zone   Normal R110            0            0          128            0            0 
Node 0, zone   Normal R111            0            0          128            0            0 
Node 0, zone   Normal R112            0            0          128            0            0 
Node 0, zone   Normal R113            0            0          128            0            0 
Node 0, zone   Normal R114            0            0          128            0            0 
Node 0, zone   Normal R115            0            0          128            0            0 
Node 0, zone   Normal R116            0            0          128            0            0 
Node 0, zone   Normal R117            0            0          128            0            0 
Node 0, zone   Normal R118            0            0          128            0            0 
Node 0, zone   Normal R119           15           20           93            0            0 
Node 0, zone   Normal R120            3            2          123            0            0 
Node 0, zone   Normal R121           22            2          104            0            0 
Node 0, zone   Normal R122           81           17           30            0            0 
Node 0, zone   Normal R123            0            0          128            0            0 
Node 0, zone   Normal R124            0            0          128            0            0 
Statistics with this patchset applied:
=====================================
Comparing these statistics with that of vanilla kernel, we see that the
fragmentation is significantly lesser, as seen in the MOVABLE migratetype.
Node 0, zone   Normal
  pages free     15754148
        min      5960
        low      7450
        high     8940
        scanned  0
        spanned  16252928
        present  16252928
        managed  15989474
Per-region page stats	 present	 free
	Region      0 	      1 	   1024
	Region      1 	 131072 	  24206
	Region      2 	 131072 	  85728
	Region      3 	 131072 	  69362
	Region      4 	 131072 	 120699
	Region      5 	 131072 	 121015
	Region      6 	 131072 	 131053
	Region      7 	 131072 	 131072
	Region      8 	 131072 	 131072
	Region      9 	 131072 	 131072
	Region     10 	 131072 	 131069
	Region     11 	 131072 	 130988
	Region     12 	 131072 	 131001
	Region     13 	 131072 	 131067
	Region     14 	 131072 	 131072
	Region     15 	 131072 	 131072
	Region     16 	 131072 	 131072
	Region     17 	 131072 	 131072
	Region     18 	 131072 	 131072
	Region     19 	 131072 	 131072
	Region     20 	 131072 	 131072
	Region     21 	 131072 	 131072
	Region     22 	 131072 	 131072
	Region     23 	 131072 	 131072
	Region     24 	 131072 	 131072
	Region     25 	 131072 	 131072
	Region     26 	 131072 	 131072
	Region     27 	 131072 	 131072
	Region     28 	 131072 	 131031
	Region     29 	 131072 	 131072
	Region     30 	 131072 	 131072
	Region     31 	 131072 	 131072
	Region     32 	 131072 	 131072
	Region     33 	 131072 	 131072
	Region     34 	 131072 	 131036
	Region     35 	 131072 	 131072
	Region     36 	 131072 	 131072
	Region     37 	 131072 	 131064
	Region     38 	 131072 	 131071
	Region     39 	 131072 	 131072
	Region     40 	 131072 	 131036
	Region     41 	 131072 	 131071
	Region     42 	 131072 	 131072
	Region     43 	 131072 	 131072
	Region     44 	 131072 	 131072
	Region     45 	 131072 	 131007
	Region     46 	 131072 	 131072
	Region     47 	 131072 	 131072
	Region     48 	 131072 	 131036
	Region     49 	 131072 	 131072
	Region     50 	 131072 	 131072
	Region     51 	 131072 	 131072
	Region     52 	 131072 	 131072
	Region     53 	 131072 	 131072
	Region     54 	 131072 	 131072
	Region     55 	 131072 	 131038
	Region     56 	 131072 	 131072
	Region     57 	 131072 	 131072
	Region     58 	 131072 	 131071
	Region     59 	 131072 	 131072
	Region     60 	 131072 	 131036
	Region     61 	 131072 	 131065
	Region     62 	 131072 	 131072
	Region     63 	 131072 	 131072
	Region     64 	 131072 	 131071
	Region     65 	 131072 	 131072
	Region     66 	 131072 	 131072
	Region     67 	 131072 	 131072
	Region     68 	 131072 	 131072
	Region     69 	 131072 	 131072
	Region     70 	 131072 	 131072
	Region     71 	 131072 	 131072
	Region     72 	 131072 	 131072
	Region     73 	 131072 	 131072
	Region     74 	 131072 	 131072
	Region     75 	 131072 	 131072
	Region     76 	 131072 	 131072
	Region     77 	 131072 	 131072
	Region     78 	 131072 	 131072
	Region     79 	 131072 	 131072
	Region     80 	 131072 	 131072
	Region     81 	 131072 	 131067
	Region     82 	 131072 	 131072
	Region     83 	 131072 	 131072
	Region     84 	 131072 	 130852
	Region     85 	 131072 	 131072
	Region     86 	 131072 	 131071
	Region     87 	 131072 	 131072
	Region     88 	 131072 	 131072
	Region     89 	 131072 	 131072
	Region     90 	 131072 	 131072
	Region     91 	 131072 	 131072
	Region     92 	 131072 	 131072
	Region     93 	 131072 	 131072
	Region     94 	 131072 	 131072
	Region     95 	 131072 	 131072
	Region     96 	 131072 	 131072
	Region     97 	 131072 	 131072
	Region     98 	 131072 	 131072
	Region     99 	 131072 	 131072
	Region    100 	 131072 	 131072
	Region    101 	 131072 	 131072
	Region    102 	 131072 	 131072
	Region    103 	 131072 	 131072
	Region    104 	 131072 	 131072
	Region    105 	 131072 	 131072
	Region    106 	 131072 	 131072
	Region    107 	 131072 	 131072
	Region    108 	 131072 	 131072
	Region    109 	 131072 	 131072
	Region    110 	 131072 	 131072
	Region    111 	 131072 	 131072
	Region    112 	 131072 	 131072
	Region    113 	 131072 	 131072
	Region    114 	 131072 	 131072
	Region    115 	 131072 	 131072
	Region    116 	 131072 	 131072
	Region    117 	 131072 	 131072
	Region    118 	 131072 	 131072
	Region    119 	 131072 	 131072
	Region    120 	 131072 	 131072
	Region    121 	 131072 	 131072
	Region    122 	 131072 	 128722
	Region    123 	 131072 	      0
	Region    124 	 131071 	     10
Page block order: 10
Pages per block:  1024
Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      1      2      2      1      3      2      0      0      1      1      0
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      2
Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable    586    714    497    300    160     93     66     45     36     24     80
Node    0, zone    DMA32, type  Reclaimable      1      1      0      0      1      1      1      1      1      1      0
Node    0, zone    DMA32, type      Movable    781    661    635    594    495    433    339    227    110     56    178
Node    0, zone    DMA32, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable   4357   4070   4542   3024   1866    955    385     92      8      1    110
Node    0, zone   Normal, type  Reclaimable     11      0      1      1      0      0      0      0      1      1     82
Node    0, zone   Normal, type      Movable    207    272    566    504    482    503    456    303    189    120   2676
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      2
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, R  0      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  1      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  2      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  3      Movable     58    116    360    320    289    309    276    133     40      7      1 
Node    0, zone   Normal, R  4      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  5      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  6      Movable      3     11     11     11     11     11      9     10      8      9    119 
Node    0, zone   Normal, R  7      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  8      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  9      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 10      Movable      3      3      3      1      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 11      Movable     22     11     16      4      8      5      8      8      8      4    122 
Node    0, zone   Normal, R 12      Movable     35     25     13     12     13     14      9     10     11      7    119 
Node    0, zone   Normal, R 13      Movable      1      3      3      3      3      1      2      2      2      2    126 
Node    0, zone   Normal, R 14      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 15      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 16      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 17      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 18      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 19      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 20      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 21      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 22      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 23      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 24      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 25      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 26      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 27      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 28      Movable      9      9     11     10     12     12     12     10      7      5    121 
Node    0, zone   Normal, R 29      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 30      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 31      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 32      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 33      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 34      Movable     10     11     11     10     10      7      9      9      9      7    120 
Node    0, zone   Normal, R 35      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 36      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 37      Movable      2      5      5      5      5      5      5      5      5      5    123 
Node    0, zone   Normal, R 38      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 39      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 40      Movable      8      6      6      6      6      7      7      7      7      7    121 
Node    0, zone   Normal, R 41      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 42      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 43      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 44      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 45      Movable      7      8     16     17     16     17     17     15     14     13    114 
Node    0, zone   Normal, R 46      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 47      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 48      Movable      6      5      7      4      7      7      7      7      7      5    122 
Node    0, zone   Normal, R 49      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 50      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 51      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 52      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 53      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 54      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 55      Movable     10     18     20     18     19     19     19     19     15      9    115 
Node    0, zone   Normal, R 56      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 57      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 58      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 59      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 60      Movable      6      9      7      5      6      7      5      6      6      4    123 
Node    0, zone   Normal, R 61      Movable      7      7      5      6      6      6      6      6      6      6    122 
Node    0, zone   Normal, R 62      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 63      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 64      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 65      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 66      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 67      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 68      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 69      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 70      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 71      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 72      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 73      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 74      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 75      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 76      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 77      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 78      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 79      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 80      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 81      Movable      1      3      1      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 82      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 83      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 84      Movable     12     14     57     61     59     56     54     46     33     20     97 
Node    0, zone   Normal, R 85      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 86      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 87      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 88      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 89      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 90      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 91      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 92      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 93      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 94      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 95      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 96      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 97      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 98      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 99      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R100      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R101      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R102      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R103      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R104      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R105      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R106      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R107      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R108      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R109      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R110      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R111      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R112      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R113      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R114      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R115      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R116      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R117      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R118      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R119      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R120      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R121      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R122      Movable      2      2      9      3      3     11      2      1      2      1    124 
Node    0, zone   Normal, R123      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R124      Movable      0      1      0      1      0      0      0      0      0      0      0 
Number of blocks type         Unmovable  Reclaimable      Movable      Reserve      Isolate 
Node 0, zone      DMA                 1            0            2            1            0 
Node 0, zone    DMA32               128            0          379            1            0 
Node 0, zone   Normal               384          128        15359            1            0 
Node 0, zone   Normal R  0            0            0            0            1            0 
Node 0, zone   Normal R  1          127            0            0            1            0 
Node 0, zone   Normal R  2            1          127            0            0            0 
Node 0, zone   Normal R  3            0            1          127            0            0 
Node 0, zone   Normal R  4          127            0            1            0            0 
Node 0, zone   Normal R  5          128            0            0            0            0 
Node 0, zone   Normal R  6            1            0          127            0            0 
Node 0, zone   Normal R  7            0            0          128            0            0 
Node 0, zone   Normal R  8            0            0          128            0            0 
Node 0, zone   Normal R  9            0            0          128            0            0 
Node 0, zone   Normal R 10            0            0          128            0            0 
Node 0, zone   Normal R 11            0            0          128            0            0 
Node 0, zone   Normal R 12            0            0          128            0            0 
Node 0, zone   Normal R 13            0            0          128            0            0 
Node 0, zone   Normal R 14            0            0          128            0            0 
Node 0, zone   Normal R 15            0            0          128            0            0 
Node 0, zone   Normal R 16            0            0          128            0            0 
Node 0, zone   Normal R 17            0            0          128            0            0 
Node 0, zone   Normal R 18            0            0          128            0            0 
Node 0, zone   Normal R 19            0            0          128            0            0 
Node 0, zone   Normal R 20            0            0          128            0            0 
Node 0, zone   Normal R 21            0            0          128            0            0 
Node 0, zone   Normal R 22            0            0          128            0            0 
Node 0, zone   Normal R 23            0            0          128            0            0 
Node 0, zone   Normal R 24            0            0          128            0            0 
Node 0, zone   Normal R 25            0            0          128            0            0 
Node 0, zone   Normal R 26            0            0          128            0            0 
Node 0, zone   Normal R 27            0            0          128            0            0 
Node 0, zone   Normal R 28            0            0          128            0            0 
Node 0, zone   Normal R 29            0            0          128            0            0 
Node 0, zone   Normal R 30            0            0          128            0            0 
Node 0, zone   Normal R 31            0            0          128            0            0 
Node 0, zone   Normal R 32            0            0          128            0            0 
Node 0, zone   Normal R 33            0            0          128            0            0 
Node 0, zone   Normal R 34            0            0          128            0            0 
Node 0, zone   Normal R 35            0            0          128            0            0 
Node 0, zone   Normal R 36            0            0          128            0            0 
Node 0, zone   Normal R 37            0            0          128            0            0 
Node 0, zone   Normal R 38            0            0          128            0            0 
Node 0, zone   Normal R 39            0            0          128            0            0 
Node 0, zone   Normal R 40            0            0          128            0            0 
Node 0, zone   Normal R 41            0            0          128            0            0 
Node 0, zone   Normal R 42            0            0          128            0            0 
Node 0, zone   Normal R 43            0            0          128            0            0 
Node 0, zone   Normal R 44            0            0          128            0            0 
Node 0, zone   Normal R 45            0            0          128            0            0 
Node 0, zone   Normal R 46            0            0          128            0            0 
Node 0, zone   Normal R 47            0            0          128            0            0 
Node 0, zone   Normal R 48            0            0          128            0            0 
Node 0, zone   Normal R 49            0            0          128            0            0 
Node 0, zone   Normal R 50            0            0          128            0            0 
Node 0, zone   Normal R 51            0            0          128            0            0 
Node 0, zone   Normal R 52            0            0          128            0            0 
Node 0, zone   Normal R 53            0            0          128            0            0 
Node 0, zone   Normal R 54            0            0          128            0            0 
Node 0, zone   Normal R 55            0            0          128            0            0 
Node 0, zone   Normal R 56            0            0          128            0            0 
Node 0, zone   Normal R 57            0            0          128            0            0 
Node 0, zone   Normal R 58            0            0          128            0            0 
Node 0, zone   Normal R 59            0            0          128            0            0 
Node 0, zone   Normal R 60            0            0          128            0            0 
Node 0, zone   Normal R 61            0            0          128            0            0 
Node 0, zone   Normal R 62            0            0          128            0            0 
Node 0, zone   Normal R 63            0            0          128            0            0 
Node 0, zone   Normal R 64            0            0          128            0            0 
Node 0, zone   Normal R 65            0            0          128            0            0 
Node 0, zone   Normal R 66            0            0          128            0            0 
Node 0, zone   Normal R 67            0            0          128            0            0 
Node 0, zone   Normal R 68            0            0          128            0            0 
Node 0, zone   Normal R 69            0            0          128            0            0 
Node 0, zone   Normal R 70            0            0          128            0            0 
Node 0, zone   Normal R 71            0            0          128            0            0 
Node 0, zone   Normal R 72            0            0          128            0            0 
Node 0, zone   Normal R 73            0            0          128            0            0 
Node 0, zone   Normal R 74            0            0          128            0            0 
Node 0, zone   Normal R 75            0            0          128            0            0 
Node 0, zone   Normal R 76            0            0          128            0            0 
Node 0, zone   Normal R 77            0            0          128            0            0 
Node 0, zone   Normal R 78            0            0          128            0            0 
Node 0, zone   Normal R 79            0            0          128            0            0 
Node 0, zone   Normal R 80            0            0          128            0            0 
Node 0, zone   Normal R 81            0            0          128            0            0 
Node 0, zone   Normal R 82            0            0          128            0            0 
Node 0, zone   Normal R 83            0            0          128            0            0 
Node 0, zone   Normal R 84            0            0          128            0            0 
Node 0, zone   Normal R 85            0            0          128            0            0 
Node 0, zone   Normal R 86            0            0          128            0            0 
Node 0, zone   Normal R 87            0            0          128            0            0 
Node 0, zone   Normal R 88            0            0          128            0            0 
Node 0, zone   Normal R 89            0            0          128            0            0 
Node 0, zone   Normal R 90            0            0          128            0            0 
Node 0, zone   Normal R 91            0            0          128            0            0 
Node 0, zone   Normal R 92            0            0          128            0            0 
Node 0, zone   Normal R 93            0            0          128            0            0 
Node 0, zone   Normal R 94            0            0          128            0            0 
Node 0, zone   Normal R 95            0            0          128            0            0 
Node 0, zone   Normal R 96            0            0          128            0            0 
Node 0, zone   Normal R 97            0            0          128            0            0 
Node 0, zone   Normal R 98            0            0          128            0            0 
Node 0, zone   Normal R 99            0            0          128            0            0 
Node 0, zone   Normal R100            0            0          128            0            0 
Node 0, zone   Normal R101            0            0          128            0            0 
Node 0, zone   Normal R102            0            0          128            0            0 
Node 0, zone   Normal R103            0            0          128            0            0 
Node 0, zone   Normal R104            0            0          128            0            0 
Node 0, zone   Normal R105            0            0          128            0            0 
Node 0, zone   Normal R106            0            0          128            0            0 
Node 0, zone   Normal R107            0            0          128            0            0 
Node 0, zone   Normal R108            0            0          128            0            0 
Node 0, zone   Normal R109            0            0          128            0            0 
Node 0, zone   Normal R110            0            0          128            0            0 
Node 0, zone   Normal R111            0            0          128            0            0 
Node 0, zone   Normal R112            0            0          128            0            0 
Node 0, zone   Normal R113            0            0          128            0            0 
Node 0, zone   Normal R114            0            0          128            0            0 
Node 0, zone   Normal R115            0            0          128            0            0 
Node 0, zone   Normal R116            0            0          128            0            0 
Node 0, zone   Normal R117            0            0          128            0            0 
Node 0, zone   Normal R118            0            0          128            0            0 
Node 0, zone   Normal R119            0            0          128            0            0 
Node 0, zone   Normal R120            0            0          128            0            0 
Node 0, zone   Normal R121            0            0          128            0            0 
Node 0, zone   Normal R122            0            0          128            0            0 
Node 0, zone   Normal R123            0            0          128            0            0 
Node 0, zone   Normal R124            0            0          128            0            0 
Performance impact:
------------------
Kernbench was run with and without the patchset. It shows an _improvement_ of
around 6.8% with the patchset applied. (Which is of course a little unexpected;
I'll dig more on that).
Vanilla kernel:
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 687.140000
User Time 4528.030000
System Time 1382.140000
Percent CPU 860.000000
Context Switches 679060.000000
Sleeps 1343514.000000
With patchset:
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 643.930000
User Time 4371.600000
System Time 985.900000
Percent CPU 831.000000
Context Switches 655479.000000
Sleeps 1360223.000000
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-25 23:26 ` [Results] [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
@ 2013-09-25 23:40   ` Andrew Morton
  2013-09-25 23:47     ` Andi Kleen
  2013-09-26 12:58     ` Srivatsa S. Bhat
  0 siblings, 2 replies; 72+ messages in thread
From: Andrew Morton @ 2013-09-25 23:40 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mgorman, dave, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel
On Thu, 26 Sep 2013 04:56:32 +0530 "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> Experimental Results:
> ====================
> 
> Test setup:
> ----------
> 
> x86 Sandybridge dual-socket quad core HT-enabled machine, with 128GB RAM.
> Memory Region size = 512MB.
Yes, but how much power was saved ;)
Also, the changelogs don't appear to discuss one obvious downside: the
latency incurred in bringing a bank out of one of the low-power states
and back into full operation.  Please do discuss and quantify that to
the best of your knowledge.
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-25 23:40   ` Andrew Morton
@ 2013-09-25 23:47     ` Andi Kleen
  2013-09-26  1:14       ` Arjan van de Ven
  2013-09-26  1:15       ` Arjan van de Ven
  2013-09-26 12:58     ` Srivatsa S. Bhat
  1 sibling, 2 replies; 72+ messages in thread
From: Andi Kleen @ 2013-09-25 23:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srivatsa S. Bhat, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, arjan, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy, andi,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel
> Also, the changelogs don't appear to discuss one obvious downside: the
> latency incurred in bringing a bank out of one of the low-power states
> and back into full operation.  Please do discuss and quantify that to
> the best of your knowledge.
On Sandy Bridge the memry wakeup overhead is really small. It's on by default
in most setups today.
-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-25 23:47     ` Andi Kleen
@ 2013-09-26  1:14       ` Arjan van de Ven
  2013-09-26 13:09         ` Srivatsa S. Bhat
  2013-09-26  1:15       ` Arjan van de Ven
  1 sibling, 1 reply; 72+ messages in thread
From: Arjan van de Ven @ 2013-09-26  1:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Srivatsa S. Bhat, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel
On 9/25/2013 4:47 PM, Andi Kleen wrote:
>> Also, the changelogs don't appear to discuss one obvious downside: the
>> latency incurred in bringing a bank out of one of the low-power states
>> and back into full operation.  Please do discuss and quantify that to
>> the best of your knowledge.
>
> On Sandy Bridge the memry wakeup overhead is really small. It's on by default
> in most setups today.
yet grouping is often defeated (in current systems) due to hw level interleaving ;-(
sometimes that's a bios setting though.
in internal experimental bioses we've been able to observe a "swing" of a few watts
(not with these patches but with some other tricks)...
I'm curious to see how these patches do for Srivatsa
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-25 23:47     ` Andi Kleen
  2013-09-26  1:14       ` Arjan van de Ven
@ 2013-09-26  1:15       ` Arjan van de Ven
  2013-09-26  1:21         ` Andrew Morton
  2013-09-26 13:16         ` Srivatsa S. Bhat
  1 sibling, 2 replies; 72+ messages in thread
From: Arjan van de Ven @ 2013-09-26  1:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Srivatsa S. Bhat, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel
On 9/25/2013 4:47 PM, Andi Kleen wrote:
>> Also, the changelogs don't appear to discuss one obvious downside: the
>> latency incurred in bringing a bank out of one of the low-power states
>> and back into full operation.  Please do discuss and quantify that to
>> the best of your knowledge.
>
> On Sandy Bridge the memry wakeup overhead is really small. It's on by default
> in most setups today.
btw note that those kind of memory power savings are content-preserving,
so likely a whole chunk of these patches is not actually needed on SNB
(or anything else Intel sells or sold)
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26  1:15       ` Arjan van de Ven
@ 2013-09-26  1:21         ` Andrew Morton
  2013-09-26  1:50           ` Andi Kleen
  2013-09-26 15:23           ` Arjan van de Ven
  2013-09-26 13:16         ` Srivatsa S. Bhat
  1 sibling, 2 replies; 72+ messages in thread
From: Andrew Morton @ 2013-09-26  1:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Srivatsa S. Bhat, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel
On Wed, 25 Sep 2013 18:15:21 -0700 Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 9/25/2013 4:47 PM, Andi Kleen wrote:
> >> Also, the changelogs don't appear to discuss one obvious downside: the
> >> latency incurred in bringing a bank out of one of the low-power states
> >> and back into full operation.  Please do discuss and quantify that to
> >> the best of your knowledge.
> >
> > On Sandy Bridge the memry wakeup overhead is really small. It's on by default
> > in most setups today.
> 
> btw note that those kind of memory power savings are content-preserving,
> so likely a whole chunk of these patches is not actually needed on SNB
> (or anything else Intel sells or sold)
(head spinning a bit).  Could you please expand on this rather a lot?
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26  1:21         ` Andrew Morton
@ 2013-09-26  1:50           ` Andi Kleen
  2013-09-26  2:59             ` Andrew Morton
  2013-09-26 13:37             ` Srivatsa S. Bhat
  2013-09-26 15:23           ` Arjan van de Ven
  1 sibling, 2 replies; 72+ messages in thread
From: Andi Kleen @ 2013-09-26  1:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arjan van de Ven, Andi Kleen, Srivatsa S. Bhat, mgorman, dave,
	hannes, tony.luck, matthew.garrett, riel, srinivas.pandruvada,
	willy, kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel
On Wed, Sep 25, 2013 at 06:21:29PM -0700, Andrew Morton wrote:
> On Wed, 25 Sep 2013 18:15:21 -0700 Arjan van de Ven <arjan@linux.intel.com> wrote:
> 
> > On 9/25/2013 4:47 PM, Andi Kleen wrote:
> > >> Also, the changelogs don't appear to discuss one obvious downside: the
> > >> latency incurred in bringing a bank out of one of the low-power states
> > >> and back into full operation.  Please do discuss and quantify that to
> > >> the best of your knowledge.
> > >
> > > On Sandy Bridge the memry wakeup overhead is really small. It's on by default
> > > in most setups today.
> > 
> > btw note that those kind of memory power savings are content-preserving,
> > so likely a whole chunk of these patches is not actually needed on SNB
> > (or anything else Intel sells or sold)
> 
> (head spinning a bit).  Could you please expand on this rather a lot?
As far as I understand there is a range of aggressiveness. You could
just group memory a bit better (assuming you can sufficiently predict
the future or have some interface to let someone tell you about it).
Or you can actually move memory around later to get as low footprint
as possible.
This patchkit seems to do both, with the later parts being on the
aggressive side (move things around) 
If you had non content preserving memory saving you would 
need to be aggressive as you couldn't afford any mistakes.
If you had very slow wakeup you also couldn't afford mistakes,
as those could cost a lot of time.
On SandyBridge is not slow and it's preserving, so some mistakes are ok.
But being aggressive (so move things around) may still help you saving
more power -- i guess only benchmarks can tell. It's a trade off between
potential gain and potential worse case performance regression.
It may also depend on the workload.
At least right now the numbers seem to be positive.
-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26  1:50           ` Andi Kleen
@ 2013-09-26  2:59             ` Andrew Morton
  2013-09-26 13:42               ` Srivatsa S. Bhat
  2013-09-26 13:37             ` Srivatsa S. Bhat
  1 sibling, 1 reply; 72+ messages in thread
From: Andrew Morton @ 2013-09-26  2:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Arjan van de Ven, Srivatsa S. Bhat, mgorman, dave, hannes,
	tony.luck, matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel
On Thu, 26 Sep 2013 03:50:16 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> On Wed, Sep 25, 2013 at 06:21:29PM -0700, Andrew Morton wrote:
> > On Wed, 25 Sep 2013 18:15:21 -0700 Arjan van de Ven <arjan@linux.intel.com> wrote:
> > 
> > > On 9/25/2013 4:47 PM, Andi Kleen wrote:
> > > >> Also, the changelogs don't appear to discuss one obvious downside: the
> > > >> latency incurred in bringing a bank out of one of the low-power states
> > > >> and back into full operation.  Please do discuss and quantify that to
> > > >> the best of your knowledge.
> > > >
> > > > On Sandy Bridge the memry wakeup overhead is really small. It's on by default
> > > > in most setups today.
> > > 
> > > btw note that those kind of memory power savings are content-preserving,
> > > so likely a whole chunk of these patches is not actually needed on SNB
> > > (or anything else Intel sells or sold)
> > 
> > (head spinning a bit).  Could you please expand on this rather a lot?
> 
> As far as I understand there is a range of aggressiveness. You could
> just group memory a bit better (assuming you can sufficiently predict
> the future or have some interface to let someone tell you about it).
> 
> Or you can actually move memory around later to get as low footprint
> as possible.
> 
> This patchkit seems to do both, with the later parts being on the
> aggressive side (move things around) 
> 
> If you had non content preserving memory saving you would 
> need to be aggressive as you couldn't afford any mistakes.
> 
> If you had very slow wakeup you also couldn't afford mistakes,
> as those could cost a lot of time.
> 
> On SandyBridge is not slow and it's preserving, so some mistakes are ok.
> 
> But being aggressive (so move things around) may still help you saving
> more power -- i guess only benchmarks can tell. It's a trade off between
> potential gain and potential worse case performance regression.
> It may also depend on the workload.
> 
> At least right now the numbers seem to be positive.
OK.  But why are "a whole chunk of these patches not actually needed on SNB
(or anything else Intel sells or sold)"?  What's the difference between
Intel products and whatever-it-is-this-patchset-was-designed-for?
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-25 23:40   ` Andrew Morton
  2013-09-25 23:47     ` Andi Kleen
@ 2013-09-26 12:58     ` Srivatsa S. Bhat
  2013-09-26 15:29       ` Arjan van de Ven
  2013-11-12  8:02       ` Srivatsa S. Bhat
  1 sibling, 2 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-26 12:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgorman, dave, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, maxime.coquelin, loic.pallardy, amit.kachhap,
	thomas.abraham
On 09/26/2013 05:10 AM, Andrew Morton wrote:
> On Thu, 26 Sep 2013 04:56:32 +0530 "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> 
>> Experimental Results:
>> ====================
>>
>> Test setup:
>> ----------
>>
>> x86 Sandybridge dual-socket quad core HT-enabled machine, with 128GB RAM.
>> Memory Region size = 512MB.
> 
> Yes, but how much power was saved ;)
> 
I don't have those numbers yet, but I'll be able to get them going forward.
Let me explain the challenge I am facing. A prototype powerpc platform that
I work with has the capability to transition memory banks to content-preserving
low-power states at a per-socket granularity. What that means is that we can
get memory power savings *without* needing to go to full-system-idle, unlike
Intel platforms such as Sandybridge.
So, since we can exploit per-socket memory power-savings irrespective of
whether the system is fully idle or not, using this patchset to shape the
memory references appropriately is definitely going to be beneficial on that
platform.
But the challenge is that I don't have all the pieces in place for demarcating
the actual boundaries of the power-manageable memory chunks of that platform
and exposing it to the Linux kernel. As a result, I was not able to test and
report the overall power-savings from this patchset.
But I'll soon start working on getting the required pieces ready to expose
the memory boundary info of the platform via device-tree and then using
that to construct the Linux MM's view of memory regions (instead of hard-coding
them as I did in this patchset). With that done, I should be able to test and
report the overall power-savings numbers on this prototype powerpc platform.
Until then, in this and previous versions of the patchset, I had used an
Intel Sandybridge system just to evaluate the effectiveness of this patchset
by looking at the statistics (such as /proc/zoneinfo, /proc/pagetypeinfo
etc)., and of course this patchset has the code to export per-memory-region
info in procfs to enable such analyses. Apart from this, I was able to
evaluate the performance overhead of this patchset similarly, without actually
needing to run on a system with true (hardware) memory region boundaries.
Of course, this was a first-level algorithmic/functional testing and evaluation,
and I was able to demonstrate a huge benefit over mainline in terms of
consolidation of allocations. Going forward, I'll work on getting this running
on a setup that can give me the overall power-savings numbers as well.
BTW, it would be really great if somebody who has access to custom BIOSes
(which export memory region/ACPI MPST info) on x86 platforms could try out
this patchset and let me know how well this patchset performs on x86 in terms
of memory power savings. I don't have a custom x86 BIOS to get that info, so
I don't think I'll be able to try that out myself :-(
> Also, the changelogs don't appear to discuss one obvious downside: the
> latency incurred in bringing a bank out of one of the low-power states
> and back into full operation.  Please do discuss and quantify that to
> the best of your knowledge.
> 
> 
As Andi mentioned, the wakeup latency is not expected to be noticeable. And
these power-savings logic is turned on in the hardware by default. So its not
as if this patchset is going to _introduce_ that latency. This patchset only
tries to make the Linux MM _cooperate_ with the (already existing) hardware
power-savings logic and thereby get much better memory power-savings benefits
out of it.
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26  1:14       ` Arjan van de Ven
@ 2013-09-26 13:09         ` Srivatsa S. Bhat
  0 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-26 13:09 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Andrew Morton, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel, maxime.coquelin, loic.pallardy,
	thomas.abraham, amit.kachhap
On 09/26/2013 06:44 AM, Arjan van de Ven wrote:
> On 9/25/2013 4:47 PM, Andi Kleen wrote:
>>> Also, the changelogs don't appear to discuss one obvious downside: the
>>> latency incurred in bringing a bank out of one of the low-power states
>>> and back into full operation.  Please do discuss and quantify that to
>>> the best of your knowledge.
>>
>> On Sandy Bridge the memry wakeup overhead is really small. It's on by
>> default
>> in most setups today.
> 
> yet grouping is often defeated (in current systems) due to hw level
> interleaving ;-(
> sometimes that's a bios setting though.
> 
True, and I plan to tweak those hardware settings in the prototype powerpc
platform and evaluate the power vs performance trade-offs of various
interleaving schemes in conjunction with this patchset.
> in internal experimental bioses we've been able to observe a "swing" of
> a few watts
> (not with these patches but with some other tricks)...
Great! So, would you have the opportunity to try out this patchset as well
on those systems that you have? I can modify the patchset to take memory
region info from whatever source you want me to take it from and then we'll
have realistic power-savings numbers to evaluate this patchset and its benefits
on Intel/x86 platforms.
> I'm curious to see how these patches do for Srivatsa
> 
As I mentioned in my other mail, I don't yet have a setup for doing actual
power-measurements. Hence, so far I was focussing on the algorithmic aspects
of the patchset and was trying to get an excellent consolidation ratio,
without hurting performance too much. Going forward, I'll work on getting the
power-measurements as well on the powerpc platform that I have.
 
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26  1:15       ` Arjan van de Ven
  2013-09-26  1:21         ` Andrew Morton
@ 2013-09-26 13:16         ` Srivatsa S. Bhat
  1 sibling, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-26 13:16 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Andrew Morton, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel, maxime.coquelin, loic.pallardy,
	thomas.abraham, amit.kachhap
On 09/26/2013 06:45 AM, Arjan van de Ven wrote:
> On 9/25/2013 4:47 PM, Andi Kleen wrote:
>>> Also, the changelogs don't appear to discuss one obvious downside: the
>>> latency incurred in bringing a bank out of one of the low-power states
>>> and back into full operation.  Please do discuss and quantify that to
>>> the best of your knowledge.
>>
>> On Sandy Bridge the memry wakeup overhead is really small. It's on by
>> default
>> in most setups today.
> 
> btw note that those kind of memory power savings are content-preserving,
> so likely a whole chunk of these patches is not actually needed on SNB
> (or anything else Intel sells or sold)
> 
Umm, why not? By consolidating the allocations to fewer memory regions,
this patchset also indirectly consolidates the *references* as well. And
its the lack of memory references that really makes the hardware transition
the unreferenced banks to low-power (content-preserving) states. So from what
I understand, this patchset should provide noticeable benefits on Intel/SNB
platforms as well.
(BTW, even in the prototype powerpc hardware that I mentioned, the primary
memory power savings is expected to come from content-preserving states. So
its not like this patchset was designed only for content-losing/full-poweroff
type of scenarios).
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26  1:50           ` Andi Kleen
  2013-09-26  2:59             ` Andrew Morton
@ 2013-09-26 13:37             ` Srivatsa S. Bhat
  1 sibling, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-26 13:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Arjan van de Ven, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel, maxime.coquelin, loic.pallardy,
	thomas.abraham, amit.kachhap
On 09/26/2013 07:20 AM, Andi Kleen wrote:
> On Wed, Sep 25, 2013 at 06:21:29PM -0700, Andrew Morton wrote:
>> On Wed, 25 Sep 2013 18:15:21 -0700 Arjan van de Ven <arjan@linux.intel.com> wrote:
>>
>>> On 9/25/2013 4:47 PM, Andi Kleen wrote:
>>>>> Also, the changelogs don't appear to discuss one obvious downside: the
>>>>> latency incurred in bringing a bank out of one of the low-power states
>>>>> and back into full operation.  Please do discuss and quantify that to
>>>>> the best of your knowledge.
>>>>
>>>> On Sandy Bridge the memry wakeup overhead is really small. It's on by default
>>>> in most setups today.
>>>
>>> btw note that those kind of memory power savings are content-preserving,
>>> so likely a whole chunk of these patches is not actually needed on SNB
>>> (or anything else Intel sells or sold)
>>
>> (head spinning a bit).  Could you please expand on this rather a lot?
> 
> As far as I understand there is a range of aggressiveness. You could
> just group memory a bit better (assuming you can sufficiently predict
> the future or have some interface to let someone tell you about it).
> 
> Or you can actually move memory around later to get as low footprint
> as possible.
> 
> This patchkit seems to do both, with the later parts being on the
> aggressive side (move things around) 
>
Yes, that's correct.
Grouping memory at allocation time is achieved using 2 techniques:
- Sorted-buddy allocator (patches 1-15)
- Split-allocator design or a "Region Allocator" as back-end (patches 16-33)
The aggressive/opportunistic page movement or reclaim is achieved using:
- Targeted region evacuation mechanism (patches 34-40)
Apart from being individually beneficial, the first 2 techniques influence
the Linux MM in such a way that it tremendously improves the yield/benefit
of the targeted compaction as well :-)
[ Its due to the avoidance of fragmentation of allocations pertaining to
_different_ migratetypes within a single memory region. By keeping each memory
region homogeneous with respect to the type of allocation, targeted compaction
becomes much more effective, since for example, we won't have cases where a
stubborn unmovable page will end up disrupting the region-evac attempt of a
region which has mostly movable/reclaimable allocations. ]
> If you had non content preserving memory saving you would 
> need to be aggressive as you couldn't afford any mistakes.
> 
> If you had very slow wakeup you also couldn't afford mistakes,
> as those could cost a lot of time.
> 
> On SandyBridge is not slow and it's preserving, so some mistakes are ok.
> 
> But being aggressive (so move things around) may still help you saving
> more power -- i guess only benchmarks can tell. It's a trade off between
> potential gain and potential worse case performance regression.
> It may also depend on the workload.
>
True, but we get better consolidation ratios than mainline even without
getting too aggressive. For example, the v3 of this patchset didn't have
the targeted compaction logic, and still it was able to show up to around
120 free-regions at the end of test run.
http://article.gmane.org/gmane.linux.kernel.mm/106283
This version of the patchset with the added aggressive logic (targeted
compaction) makes it only better: the free-regions number comes up to 202.
 
> At least right now the numbers seem to be positive.
> 
:-)
 
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26  2:59             ` Andrew Morton
@ 2013-09-26 13:42               ` Srivatsa S. Bhat
  2013-09-26 15:58                 ` Arjan van de Ven
  0 siblings, 1 reply; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-26 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Arjan van de Ven, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel, maxime.coquelin, loic.pallardy,
	thomas.abraham, amit.kachhap
On 09/26/2013 08:29 AM, Andrew Morton wrote:
> On Thu, 26 Sep 2013 03:50:16 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
>> On Wed, Sep 25, 2013 at 06:21:29PM -0700, Andrew Morton wrote:
>>> On Wed, 25 Sep 2013 18:15:21 -0700 Arjan van de Ven <arjan@linux.intel.com> wrote:
>>>
>>>> On 9/25/2013 4:47 PM, Andi Kleen wrote:
>>>>>> Also, the changelogs don't appear to discuss one obvious downside: the
>>>>>> latency incurred in bringing a bank out of one of the low-power states
>>>>>> and back into full operation.  Please do discuss and quantify that to
>>>>>> the best of your knowledge.
>>>>>
>>>>> On Sandy Bridge the memry wakeup overhead is really small. It's on by default
>>>>> in most setups today.
>>>>
>>>> btw note that those kind of memory power savings are content-preserving,
>>>> so likely a whole chunk of these patches is not actually needed on SNB
>>>> (or anything else Intel sells or sold)
>>>
>>> (head spinning a bit).  Could you please expand on this rather a lot?
>>
>> As far as I understand there is a range of aggressiveness. You could
>> just group memory a bit better (assuming you can sufficiently predict
>> the future or have some interface to let someone tell you about it).
>>
>> Or you can actually move memory around later to get as low footprint
>> as possible.
>>
>> This patchkit seems to do both, with the later parts being on the
>> aggressive side (move things around) 
>>
>> If you had non content preserving memory saving you would 
>> need to be aggressive as you couldn't afford any mistakes.
>>
>> If you had very slow wakeup you also couldn't afford mistakes,
>> as those could cost a lot of time.
>>
>> On SandyBridge is not slow and it's preserving, so some mistakes are ok.
>>
>> But being aggressive (so move things around) may still help you saving
>> more power -- i guess only benchmarks can tell. It's a trade off between
>> potential gain and potential worse case performance regression.
>> It may also depend on the workload.
>>
>> At least right now the numbers seem to be positive.
> 
> OK.  But why are "a whole chunk of these patches not actually needed on SNB
> (or anything else Intel sells or sold)"?  What's the difference between
> Intel products and whatever-it-is-this-patchset-was-designed-for?
> 
Arjan, are you referring to the fact that Intel/SNB systems can exploit
memory self-refresh only when the entire system goes idle? Is that why this
patchset won't turn out to be that useful on those platforms?
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26  1:21         ` Andrew Morton
  2013-09-26  1:50           ` Andi Kleen
@ 2013-09-26 15:23           ` Arjan van de Ven
  1 sibling, 0 replies; 72+ messages in thread
From: Arjan van de Ven @ 2013-09-26 15:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Srivatsa S. Bhat, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel
On 9/25/2013 6:21 PM, Andrew Morton wrote:
> On Wed, 25 Sep 2013 18:15:21 -0700 Arjan van de Ven <arjan@linux.intel.com> wrote:
>
>> On 9/25/2013 4:47 PM, Andi Kleen wrote:
>>>> Also, the changelogs don't appear to discuss one obvious downside: the
>>>> latency incurred in bringing a bank out of one of the low-power states
>>>> and back into full operation.  Please do discuss and quantify that to
>>>> the best of your knowledge.
>>>
>>> On Sandy Bridge the memry wakeup overhead is really small. It's on by default
>>> in most setups today.
>>
>> btw note that those kind of memory power savings are content-preserving,
>> so likely a whole chunk of these patches is not actually needed on SNB
>> (or anything else Intel sells or sold)
>
> (head spinning a bit).  Could you please expand on this rather a lot?
so there is two general ways to save power on memory
one way keeps the content of the memory there
the other way loses the content of the memory.
in the first type, there are degrees of power savings (each with their own costs), and the mechanism to enter/exit
tends to be fully automatic, e.g. OS invisible. (and generally very very fast.. measured in low numbers of nanoseconds)
in the later case the OS by nature has to get involved and actively free the content of the memory prior to
setting the power level lower (and thus lose the content).
on the machines Srivatsa has been measuring, only the first type exists... e.g. content is preserved.
at which point, I am skeptical that it is worth spending a lot of CPU time (and thus power!) to move stuff around
or free memory (e.g. reduce disk cache efficiency -> loses power as well).
the patches posted seem to go to great lengths doing these kind of things.
to get the power savings, my deep suspicion (based on some rudimentary experiments done internally to Intel
earlier this year) is that it is more than enough to have "statistical" level of "binding", to get 95%+ of
the max theoretical power savings.... basically what todays NUMA policy would do.
>
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26 12:58     ` Srivatsa S. Bhat
@ 2013-09-26 15:29       ` Arjan van de Ven
  2013-11-12  8:02       ` Srivatsa S. Bhat
  1 sibling, 0 replies; 72+ messages in thread
From: Arjan van de Ven @ 2013-09-26 15:29 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Andrew Morton, mgorman, dave, hannes, tony.luck, matthew.garrett,
	riel, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, maxime.coquelin, loic.pallardy, amit.kachhap,
	thomas.abraham
On 9/26/2013 5:58 AM, Srivatsa S. Bhat wrote:
> Let me explain the challenge I am facing. A prototype powerpc platform that
> I work with has the capability to transition memory banks to content-preserving
> low-power states at a per-socket granularity. What that means is that we can
> get memory power savings*without*  needing to go to full-system-idle, unlike
> Intel platforms such as Sandybridge.
btw this is not a correct statement
even Sandybridge can put memory in low power states (just not self refresh) even if the system is not idle
(depending on bios settings to enable this)
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26 13:42               ` Srivatsa S. Bhat
@ 2013-09-26 15:58                 ` Arjan van de Ven
  2013-09-26 17:00                   ` Srivatsa S. Bhat
  0 siblings, 1 reply; 72+ messages in thread
From: Arjan van de Ven @ 2013-09-26 15:58 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Andrew Morton, Andi Kleen, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel, maxime.coquelin, loic.pallardy,
	thomas.abraham, amit.kachhap
On 9/26/2013 6:42 AM, Srivatsa S. Bhat wrote:
> On 09/26/2013 08:29 AM, Andrew Morton wrote:
>> On Thu, 26 Sep 2013 03:50:16 +0200 Andi Kleen <andi@firstfloor.org> wrote:
>>
>>> On Wed, Sep 25, 2013 at 06:21:29PM -0700, Andrew Morton wrote:
>>>> On Wed, 25 Sep 2013 18:15:21 -0700 Arjan van de Ven <arjan@linux.intel.com> wrote:
>>>>
>>>>> On 9/25/2013 4:47 PM, Andi Kleen wrote:
>>>>>>> Also, the changelogs don't appear to discuss one obvious downside: the
>>>>>>> latency incurred in bringing a bank out of one of the low-power states
>>>>>>> and back into full operation.  Please do discuss and quantify that to
>>>>>>> the best of your knowledge.
>>>>>>
>>>>>> On Sandy Bridge the memry wakeup overhead is really small. It's on by default
>>>>>> in most setups today.
>>>>>
>>>>> btw note that those kind of memory power savings are content-preserving,
>>>>> so likely a whole chunk of these patches is not actually needed on SNB
>>>>> (or anything else Intel sells or sold)
>>>>
>>>> (head spinning a bit).  Could you please expand on this rather a lot?
>>>
>>> As far as I understand there is a range of aggressiveness. You could
>>> just group memory a bit better (assuming you can sufficiently predict
>>> the future or have some interface to let someone tell you about it).
>>>
>>> Or you can actually move memory around later to get as low footprint
>>> as possible.
>>>
>>> This patchkit seems to do both, with the later parts being on the
>>> aggressive side (move things around)
>>>
>>> If you had non content preserving memory saving you would
>>> need to be aggressive as you couldn't afford any mistakes.
>>>
>>> If you had very slow wakeup you also couldn't afford mistakes,
>>> as those could cost a lot of time.
>>>
>>> On SandyBridge is not slow and it's preserving, so some mistakes are ok.
>>>
>>> But being aggressive (so move things around) may still help you saving
>>> more power -- i guess only benchmarks can tell. It's a trade off between
>>> potential gain and potential worse case performance regression.
>>> It may also depend on the workload.
>>>
>>> At least right now the numbers seem to be positive.
>>
>> OK.  But why are "a whole chunk of these patches not actually needed on SNB
>> (or anything else Intel sells or sold)"?  What's the difference between
>> Intel products and whatever-it-is-this-patchset-was-designed-for?
>>
>
> Arjan, are you referring to the fact that Intel/SNB systems can exploit
> memory self-refresh only when the entire system goes idle? Is that why this
> patchset won't turn out to be that useful on those platforms?
no we can use other things (CKE and co) all the time.
just that we found that statistical grouping gave 95%+ of the benefit,
without the cost of being aggressive on going to a 100.00% grouping
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26 15:58                 ` Arjan van de Ven
@ 2013-09-26 17:00                   ` Srivatsa S. Bhat
  2013-09-26 18:06                     ` Arjan van de Ven
  0 siblings, 1 reply; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-26 17:00 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Andi Kleen, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel, maxime.coquelin, loic.pallardy,
	thomas.abraham, amit.kachhap
On 09/26/2013 09:28 PM, Arjan van de Ven wrote:
> On 9/26/2013 6:42 AM, Srivatsa S. Bhat wrote:
>> On 09/26/2013 08:29 AM, Andrew Morton wrote:
>>> On Thu, 26 Sep 2013 03:50:16 +0200 Andi Kleen <andi@firstfloor.org>
>>> wrote:
>>>
>>>> On Wed, Sep 25, 2013 at 06:21:29PM -0700, Andrew Morton wrote:
>>>>> On Wed, 25 Sep 2013 18:15:21 -0700 Arjan van de Ven
>>>>> <arjan@linux.intel.com> wrote:
>>>>>
>>>>>> On 9/25/2013 4:47 PM, Andi Kleen wrote:
>>>>>>>> Also, the changelogs don't appear to discuss one obvious
>>>>>>>> downside: the
>>>>>>>> latency incurred in bringing a bank out of one of the low-power
>>>>>>>> states
>>>>>>>> and back into full operation.  Please do discuss and quantify
>>>>>>>> that to
>>>>>>>> the best of your knowledge.
>>>>>>>
>>>>>>> On Sandy Bridge the memry wakeup overhead is really small. It's
>>>>>>> on by default
>>>>>>> in most setups today.
>>>>>>
>>>>>> btw note that those kind of memory power savings are
>>>>>> content-preserving,
>>>>>> so likely a whole chunk of these patches is not actually needed on
>>>>>> SNB
>>>>>> (or anything else Intel sells or sold)
>>>>>
>>>>> (head spinning a bit).  Could you please expand on this rather a lot?
>>>>
>>>> As far as I understand there is a range of aggressiveness. You could
>>>> just group memory a bit better (assuming you can sufficiently predict
>>>> the future or have some interface to let someone tell you about it).
>>>>
>>>> Or you can actually move memory around later to get as low footprint
>>>> as possible.
>>>>
>>>> This patchkit seems to do both, with the later parts being on the
>>>> aggressive side (move things around)
>>>>
>>>> If you had non content preserving memory saving you would
>>>> need to be aggressive as you couldn't afford any mistakes.
>>>>
>>>> If you had very slow wakeup you also couldn't afford mistakes,
>>>> as those could cost a lot of time.
>>>>
>>>> On SandyBridge is not slow and it's preserving, so some mistakes are
>>>> ok.
>>>>
>>>> But being aggressive (so move things around) may still help you saving
>>>> more power -- i guess only benchmarks can tell. It's a trade off
>>>> between
>>>> potential gain and potential worse case performance regression.
>>>> It may also depend on the workload.
>>>>
>>>> At least right now the numbers seem to be positive.
>>>
>>> OK.  But why are "a whole chunk of these patches not actually needed
>>> on SNB
>>> (or anything else Intel sells or sold)"?  What's the difference between
>>> Intel products and whatever-it-is-this-patchset-was-designed-for?
>>>
>>
>> Arjan, are you referring to the fact that Intel/SNB systems can exploit
>> memory self-refresh only when the entire system goes idle? Is that why
>> this
>> patchset won't turn out to be that useful on those platforms?
> 
> no we can use other things (CKE and co) all the time.
> 
Ah, ok..
> just that we found that statistical grouping gave 95%+ of the benefit,
> without the cost of being aggressive on going to a 100.00% grouping
> 
And how do you do that statistical grouping? Don't you need patches similar
to those in this patchset? Or are you saying that the existing vanilla
kernel itself does statistical grouping somehow?
Also, I didn't fully understand how NUMA policy will help in this case..
If you want to group memory allocations/references into fewer memory regions
_within_ a node, will NUMA policy really help? For example, in this patchset,
everything (all the allocation/reference shaping) is done _within_ the
NUMA boundary, assuming that the memory regions are subsets of a NUMA node.
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26 17:00                   ` Srivatsa S. Bhat
@ 2013-09-26 18:06                     ` Arjan van de Ven
  2013-09-26 18:33                       ` Srivatsa S. Bhat
  0 siblings, 1 reply; 72+ messages in thread
From: Arjan van de Ven @ 2013-09-26 18:06 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Andrew Morton, Andi Kleen, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel, maxime.coquelin, loic.pallardy,
	thomas.abraham, amit.kachhap
>>>>
>>>
>>> Arjan, are you referring to the fact that Intel/SNB systems can exploit
>>> memory self-refresh only when the entire system goes idle? Is that why
>>> this
>>> patchset won't turn out to be that useful on those platforms?
>>
>> no we can use other things (CKE and co) all the time.
>>
>
> Ah, ok..
>
>> just that we found that statistical grouping gave 95%+ of the benefit,
>> without the cost of being aggressive on going to a 100.00% grouping
>>
>
> And how do you do that statistical grouping? Don't you need patches similar
> to those in this patchset? Or are you saying that the existing vanilla
> kernel itself does statistical grouping somehow?
so the way I scanned your patchset.. half of it is about grouping,
the other half (roughly) is about moving stuff.
the grouping makes total sense to me.
actively moving is the part that I am very worried about; that part burns power to do
(and performance).... for which the ROI is somewhat unclear to me
(but... data speaks. I can easily be convinced with data that proves one way or the other)
is moving stuff around the 95%-of-the-work-for-the-last-5%-of-the-theoretical-gain
or is statistical grouping enough to get > 95% of the gain... without the cost of moving.
>
> Also, I didn't fully understand how NUMA policy will help in this case..
> If you want to group memory allocations/references into fewer memory regions
> _within_ a node, will NUMA policy really help? For example, in this patchset,
> everything (all the allocation/reference shaping) is done _within_ the
> NUMA boundary, assuming that the memory regions are subsets of a NUMA node.
>
> Regards,
> Srivatsa S. Bhat
>
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26 18:06                     ` Arjan van de Ven
@ 2013-09-26 18:33                       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-26 18:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Andi Kleen, mgorman, dave, hannes, tony.luck,
	matthew.garrett, riel, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw, gargankita, paulmck, svaidy,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel, maxime.coquelin, loic.pallardy,
	thomas.abraham, amit.kachhap
On 09/26/2013 11:36 PM, Arjan van de Ven wrote:
>>>>>
>>>>
>>>> Arjan, are you referring to the fact that Intel/SNB systems can exploit
>>>> memory self-refresh only when the entire system goes idle? Is that why
>>>> this
>>>> patchset won't turn out to be that useful on those platforms?
>>>
>>> no we can use other things (CKE and co) all the time.
>>>
>>
>> Ah, ok..
>>
>>> just that we found that statistical grouping gave 95%+ of the benefit,
>>> without the cost of being aggressive on going to a 100.00% grouping
>>>
>>
>> And how do you do that statistical grouping? Don't you need patches
>> similar
>> to those in this patchset? Or are you saying that the existing vanilla
>> kernel itself does statistical grouping somehow?
> 
> so the way I scanned your patchset.. half of it is about grouping,
> the other half (roughly) is about moving stuff.
> 
Actually, either by number-of-lines or by patch count, a majority of the
patchset is about grouping, and only a few patches do the moving part.
As I mentioned in my earlier mail, patches 1-33 achieve the grouping,
whereas patches 34-40 do the movement. (Both sorted-buddy allocator and
the region allocators are grouping techniques.)
And v3 of this patchset actually didn't have the movement stuff at all,
it just had the grouping parts. And they got me upto around 120 free-regions
at the end of test run - a noticeably better consolidation ratio compared
to mainline (18).
http://article.gmane.org/gmane.linux.kernel.mm/106283
> the grouping makes total sense to me.
Ah, great!
> actively moving is the part that I am very worried about; that part
> burns power to do
> (and performance).... for which the ROI is somewhat unclear to me
> (but... data speaks. I can easily be convinced with data that proves one
> way or the other)
> 
Actually I have added some intelligence in the moving parts to avoid being
too aggressive. For example, I don't do _any_ movement if more than 32 pages
in a region are used, since it will take a considerable amount of work to
evacuate that region. Further, my evacuation/compaction technique is very
conservative:
1. I reclaim only clean page-cache pages. So no disk I/O involved.
2. I move movable pages around.
3. I allocate target pages for migration using the fast buddy-allocator
   itself, so there is not a lot of PFN scanning involved.
And that's it! No other case for page movement. And with this conservative
approach itself, I'm getting great consolidation ratios!
I am also thinking of adding more smartness in the code to be very choosy in
doing the movement, and do it only in cases where it is almost guaranteed to
be beneficial. For example, I can make the kmempowerd kthread more "lazy"
while moving/reclaiming stuff; I can bias the page movements such that "cold"
pages are left around (since they are not expected to be referenced much
anyway) and only the (few) hot pages are moved... etc.
And this aggressiveness can be exposed as a policy/knob to userspace as well,
so that the user can control its degree as he wishes.
> is moving stuff around the
> 95%-of-the-work-for-the-last-5%-of-the-theoretical-gain
> or is statistical grouping enough to get > 95% of the gain... without
> the cost of moving.
>
I certainly agree with you on the part that moving pages should really be
a last resort sort of thing, and do it only where it really pays off. So
we should definitely go with grouping first, and then see how much additional
benefit the moving stuff will bring along with the involved overhead (by
appropriate benchmarking).
But one of the goals of this patchset was to give a glimpse of all the
techniques/algorithms we can employ to consolidate memory references, and get
an idea of the extent to which such algorithms would be effective in getting
us excellent consolidation ratios.
And now that we have several techniques to choose from (and with varying
qualities and aggressiveness), we can start evaluating them more deeply and
choose the ones that give us the most benefits with least cost/overhead.
 
> 
>>
>> Also, I didn't fully understand how NUMA policy will help in this case..
>> If you want to group memory allocations/references into fewer memory
>> regions
>> _within_ a node, will NUMA policy really help? For example, in this
>> patchset,
>> everything (all the allocation/reference shaping) is done _within_ the
>> NUMA boundary, assuming that the memory regions are subsets of a NUMA
>> node.
>>
 
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
  2013-09-25 23:14 ` [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in " Srivatsa S. Bhat
@ 2013-09-26 22:16   ` Dave Hansen
  2013-09-27  6:34     ` Srivatsa S. Bhat
  2013-10-23 10:17   ` Johannes Weiner
  1 sibling, 1 reply; 72+ messages in thread
From: Dave Hansen @ 2013-09-26 22:16 UTC (permalink / raw)
  To: Srivatsa S. Bhat, akpm, mgorman, hannes, tony.luck,
	matthew.garrett, riel, arjan, srinivas.pandruvada, willy,
	kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel
On 09/25/2013 04:14 PM, Srivatsa S. Bhat wrote:
> @@ -605,16 +713,22 @@ static inline void __free_one_page(struct page *page,
>  		buddy_idx = __find_buddy_index(combined_idx, order + 1);
>  		higher_buddy = higher_page + (buddy_idx - combined_idx);
>  		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
> -			list_add_tail(&page->lru,
> -				&zone->free_area[order].free_list[migratetype].list);
> +
> +			/*
> +			 * Implementing an add_to_freelist_tail() won't be
> +			 * very useful because both of them (almost) add to
> +			 * the tail within the region. So we could potentially
> +			 * switch off this entire "is next-higher buddy free?"
> +			 * logic when memory regions are used.
> +			 */
> +			add_to_freelist(page, &area->free_list[migratetype]);
>  			goto out;
>  		}
>  	}
Commit 6dda9d55b says that this had some discrete performance gains.
It's a bummer that this deoptimizes it, and I think that (expected)
performance degredation at least needs to be referenced _somewhere_.
I also find it very hard to take code seriously which stuff like this:
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
> +#endif
nine times.
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
  2013-09-26 22:16   ` Dave Hansen
@ 2013-09-27  6:34     ` Srivatsa S. Bhat
  0 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-27  6:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, maxime.coquelin@stericsson.com,
	loic.pallardy@stericsson.com, thomas.abraham@linaro.org,
	amit.kachhap@linaro.org
On 09/27/2013 03:46 AM, Dave Hansen wrote:
> On 09/25/2013 04:14 PM, Srivatsa S. Bhat wrote:
>> @@ -605,16 +713,22 @@ static inline void __free_one_page(struct page *page,
>>  		buddy_idx = __find_buddy_index(combined_idx, order + 1);
>>  		higher_buddy = higher_page + (buddy_idx - combined_idx);
>>  		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
>> -			list_add_tail(&page->lru,
>> -				&zone->free_area[order].free_list[migratetype].list);
>> +
>> +			/*
>> +			 * Implementing an add_to_freelist_tail() won't be
>> +			 * very useful because both of them (almost) add to
>> +			 * the tail within the region. So we could potentially
>> +			 * switch off this entire "is next-higher buddy free?"
>> +			 * logic when memory regions are used.
>> +			 */
>> +			add_to_freelist(page, &area->free_list[migratetype]);
>>  			goto out;
>>  		}
>>  	}
> 
> Commit 6dda9d55b says that this had some discrete performance gains.
I had seen the comments about this but not the patch which made that change.
Thanks for pointing the commit to me! But now that I went through the changelog
carefully, it appears as if there were only some slight benefits in huge page
allocation benchmarks, and the results were either inconclusive or unsubstantial
in most other benchmarks that the author tried.
> It's a bummer that this deoptimizes it, and I think that (expected)
> performance degredation at least needs to be referenced _somewhere_.
>
I'm not so sure about that. Yes, I know that my patchset treats all pages
equally (by adding all of them _far_ _away_ from the head of the list), but
given that the above commit didn't show any significant improvements, I doubt
whether my patchset will lead to any noticeable _degradation_. Perhaps I'll try
out the huge-page allocation benchmark and observe what happens with my patchset.
 
> I also find it very hard to take code seriously which stuff like this:
> 
>> +#ifdef CONFIG_DEBUG_PAGEALLOC
>> +		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
>> +#endif
> 
> nine times.
> 
Hmm, those debug checks were pretty invaluable for me when testing the code.
I retained them in the patches so that if other people test it and find
problems, they would be able to send bug reports with good amount of info as
to what exactly went wrong. Besides, this patchset adds a ton of new code, and
this list manipulation framework along with the bitmap-based radix tree is one
of the core components. If that goes for a toss, everything from there onwards
will be a train-wreck! So I felt having these checks and balances would be very
useful to validate the correct working of each piece and to debug complex
problems easily.
But please help me understand your point correctly - are you suggesting that
I remove these checks completely or just make them gel well with the other code
so that they don't become such an eyesore as they are at the moment (with all
the #ifdefs sticking out etc)?
If you are suggesting the latter, I completely agree with you. I'll find out
a way to do that, and if you have any suggestions, please let me know!
Thank you!
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [RFC PATCH v4 01/40] mm: Introduce memory regions data-structure to capture region boundaries within nodes
  2013-09-25 23:13 ` [RFC PATCH v4 01/40] mm: Introduce memory regions data-structure to capture region boundaries within nodes Srivatsa S. Bhat
@ 2013-10-23  9:54   ` Johannes Weiner
  2013-10-23 14:38     ` Srivatsa S. Bhat
  0 siblings, 1 reply; 72+ messages in thread
From: Johannes Weiner @ 2013-10-23  9:54 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, dave, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel
On Thu, Sep 26, 2013 at 04:43:48AM +0530, Srivatsa S. Bhat wrote:
> The memory within a node can be divided into regions of memory that can be
> independently power-managed. That is, chunks of memory can be transitioned
> (manually or automatically) to low-power states based on the frequency of
> references to that region. For example, if a memory chunk is not referenced
> for a given threshold amount of time, the hardware (memory controller) can
> decide to put that piece of memory into a content-preserving low-power state.
> And of course, on the next reference to that chunk of memory, it will be
> transitioned back to full-power for read/write operations.
> 
> So, the Linux MM can take advantage of this feature by managing the available
> memory with an eye towards power-savings - ie., by keeping the memory
> allocations/references consolidated to a minimum no. of such power-manageable
> memory regions. In order to do so, the first step is to teach the MM about
> the boundaries of these regions - and to capture that info, we introduce a new
> data-structure called "Memory Regions".
> 
> [Also, the concept of memory regions could potentially be extended to work
> with different classes of memory like PCM (Phase Change Memory) etc and
> hence, it is not limited to just power management alone].
> 
> We already sub-divide a node's memory into zones, based on some well-known
> constraints. So the question is, where do we fit in memory regions in this
> hierarchy. Instead of artificially trying to fit it into the hierarchy one
> way or the other, we choose to simply capture the region boundaries in a
> parallel data-structure, since most likely the region boundaries won't
> naturally fit inside the zone boundaries or vice-versa.
> 
> But of course, memory regions are sub-divisions *within* a node, so it makes
> sense to keep the data-structures in the node's struct pglist_data. (Thus
> this placement makes memory regions parallel to zones in that node).
> 
> Once we capture the region boundaries in the memory regions data-structure,
> we can influence MM decisions at various places, such as page allocation,
> reclamation etc, in order to perform power-aware memory management.
> 
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
> 
>  include/linux/mmzone.h |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bd791e4..d3288b0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -35,6 +35,8 @@
>   */
>  #define PAGE_ALLOC_COSTLY_ORDER 3
>  
> +#define MAX_NR_NODE_REGIONS	512
> +
>  enum {
>  	MIGRATE_UNMOVABLE,
>  	MIGRATE_RECLAIMABLE,
> @@ -708,6 +710,14 @@ struct node_active_region {
>  extern struct page *mem_map;
>  #endif
>  
> +struct node_mem_region {
> +	unsigned long start_pfn;
> +	unsigned long end_pfn;
> +	unsigned long present_pages;
> +	unsigned long spanned_pages;
> +	struct pglist_data *pgdat;
> +};
> +
>  /*
>   * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
>   * (mostly NUMA machines?) to denote a higher-level memory zone than the
> @@ -724,6 +734,8 @@ typedef struct pglist_data {
>  	struct zone node_zones[MAX_NR_ZONES];
>  	struct zonelist node_zonelists[MAX_ZONELISTS];
>  	int nr_zones;
> +	struct node_mem_region node_regions[MAX_NR_NODE_REGIONS];
> +	int nr_node_regions;
>  #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
>  	struct page *node_mem_map;
>  #ifdef CONFIG_MEMCG
Please don't write patches that add data structures but do not use
them.
This is a pattern throughout the whole series.  You add a data
structure in one patch, individual helper functions in followup
patches, optimizations and statistics in yet more patches, even
unrelated cleanups and documentation like the fls() vs __fls() stuff,
until finally you add the actual algorithm, also bit by bit.  I find
it really hard to review when I have to jump back and forth between
several different emails to piece things together.
Prepare the code base as necessary (the fls stuff, instrumentation for
existing code, cleanups), then add the most basic data structure and
code in one patch, then follow up with new statistics, optimizations
etc. (unless the optimizations can be reasonably folded into the
initial implementation in the first place).  This might not always be
possible of course, but please strive for it.
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [RFC PATCH v4 16/40] mm: Introduce a "Region Allocator" to manage entire memory regions
  2013-09-25 23:17 ` [RFC PATCH v4 16/40] mm: Introduce a "Region Allocator" to manage entire memory regions Srivatsa S. Bhat
@ 2013-10-23 10:10   ` Johannes Weiner
  2013-10-23 16:22     ` Srivatsa S. Bhat
  0 siblings, 1 reply; 72+ messages in thread
From: Johannes Weiner @ 2013-10-23 10:10 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, dave, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel
On Thu, Sep 26, 2013 at 04:47:34AM +0530, Srivatsa S. Bhat wrote:
> Today, the MM subsystem uses the buddy 'Page Allocator' to manage memory
> at a 'page' granularity. But this allocator has no notion of the physical
> topology of the underlying memory hardware, and hence it is hard to
> influence memory allocation decisions keeping the platform constraints
> in mind.
This is no longer true after patches 1-15 introduce regions and have
the allocator try to stay within the lowest possible region (patch
15).  Which leaves the question what the following patches are for.
This patch only adds a data structure and I gave up finding where
among the helpers, statistics, and optimization patches an actual
implementation is.
Again, please try to make every single a patch a complete logical
change to the code base.
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
  2013-09-25 23:14 ` [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in " Srivatsa S. Bhat
  2013-09-26 22:16   ` Dave Hansen
@ 2013-10-23 10:17   ` Johannes Weiner
  2013-10-23 16:09     ` Srivatsa S. Bhat
  1 sibling, 1 reply; 72+ messages in thread
From: Johannes Weiner @ 2013-10-23 10:17 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, dave, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel
On Thu, Sep 26, 2013 at 04:44:56AM +0530, Srivatsa S. Bhat wrote:
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -517,6 +517,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>  	return 0;
>  }
>  
> +static void add_to_freelist(struct page *page, struct free_list *free_list)
> +{
> +	struct list_head *prev_region_list, *lru;
> +	struct mem_region_list *region;
> +	int region_id, i;
> +
> +	lru = &page->lru;
> +	region_id = page_zone_region_id(page);
> +
> +	region = &free_list->mr_list[region_id];
> +	region->nr_free++;
> +
> +	if (region->page_block) {
> +		list_add_tail(lru, region->page_block);
> +		return;
> +	}
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
> +#endif
> +
> +	if (!list_empty(&free_list->list)) {
> +		for (i = region_id - 1; i >= 0; i--) {
> +			if (free_list->mr_list[i].page_block) {
> +				prev_region_list =
> +					free_list->mr_list[i].page_block;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +	/* This is the first region, so add to the head of the list */
> +	prev_region_list = &free_list->list;
> +
> +out:
> +	list_add(lru, prev_region_list);
> +
> +	/* Save pointer to page block of this region */
> +	region->page_block = lru;
"Pageblock" has a different meaning in the allocator already.
The things you string up here are just called pages, regardless of
which order they are in and how many pages they can be split into.
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [RFC PATCH v4 01/40] mm: Introduce memory regions data-structure to capture region boundaries within nodes
  2013-10-23  9:54   ` Johannes Weiner
@ 2013-10-23 14:38     ` Srivatsa S. Bhat
  0 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-23 14:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, dave, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, mark.gross
On 10/23/2013 03:24 PM, Johannes Weiner wrote:
> On Thu, Sep 26, 2013 at 04:43:48AM +0530, Srivatsa S. Bhat wrote:
>> The memory within a node can be divided into regions of memory that can be
>> independently power-managed. That is, chunks of memory can be transitioned
>> (manually or automatically) to low-power states based on the frequency of
>> references to that region. For example, if a memory chunk is not referenced
>> for a given threshold amount of time, the hardware (memory controller) can
>> decide to put that piece of memory into a content-preserving low-power state.
>> And of course, on the next reference to that chunk of memory, it will be
>> transitioned back to full-power for read/write operations.
>>
>> So, the Linux MM can take advantage of this feature by managing the available
>> memory with an eye towards power-savings - ie., by keeping the memory
>> allocations/references consolidated to a minimum no. of such power-manageable
>> memory regions. In order to do so, the first step is to teach the MM about
>> the boundaries of these regions - and to capture that info, we introduce a new
>> data-structure called "Memory Regions".
>>
>> [Also, the concept of memory regions could potentially be extended to work
>> with different classes of memory like PCM (Phase Change Memory) etc and
>> hence, it is not limited to just power management alone].
>>
>> We already sub-divide a node's memory into zones, based on some well-known
>> constraints. So the question is, where do we fit in memory regions in this
>> hierarchy. Instead of artificially trying to fit it into the hierarchy one
>> way or the other, we choose to simply capture the region boundaries in a
>> parallel data-structure, since most likely the region boundaries won't
>> naturally fit inside the zone boundaries or vice-versa.
>>
>> But of course, memory regions are sub-divisions *within* a node, so it makes
>> sense to keep the data-structures in the node's struct pglist_data. (Thus
>> this placement makes memory regions parallel to zones in that node).
>>
>> Once we capture the region boundaries in the memory regions data-structure,
>> we can influence MM decisions at various places, such as page allocation,
>> reclamation etc, in order to perform power-aware memory management.
>>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>>  include/linux/mmzone.h |   12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index bd791e4..d3288b0 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -35,6 +35,8 @@
>>   */
>>  #define PAGE_ALLOC_COSTLY_ORDER 3
>>  
>> +#define MAX_NR_NODE_REGIONS	512
>> +
>>  enum {
>>  	MIGRATE_UNMOVABLE,
>>  	MIGRATE_RECLAIMABLE,
>> @@ -708,6 +710,14 @@ struct node_active_region {
>>  extern struct page *mem_map;
>>  #endif
>>  
>> +struct node_mem_region {
>> +	unsigned long start_pfn;
>> +	unsigned long end_pfn;
>> +	unsigned long present_pages;
>> +	unsigned long spanned_pages;
>> +	struct pglist_data *pgdat;
>> +};
>> +
>>  /*
>>   * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
>>   * (mostly NUMA machines?) to denote a higher-level memory zone than the
>> @@ -724,6 +734,8 @@ typedef struct pglist_data {
>>  	struct zone node_zones[MAX_NR_ZONES];
>>  	struct zonelist node_zonelists[MAX_ZONELISTS];
>>  	int nr_zones;
>> +	struct node_mem_region node_regions[MAX_NR_NODE_REGIONS];
>> +	int nr_node_regions;
>>  #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
>>  	struct page *node_mem_map;
>>  #ifdef CONFIG_MEMCG
> 
> Please don't write patches that add data structures but do not use
> them.
> 
> This is a pattern throughout the whole series.  You add a data
> structure in one patch, individual helper functions in followup
> patches, optimizations and statistics in yet more patches, even
> unrelated cleanups and documentation like the fls() vs __fls() stuff,
> until finally you add the actual algorithm, also bit by bit.  I find
> it really hard to review when I have to jump back and forth between
> several different emails to piece things together.
> 
Hmm, sorry about that! I was trying to keep the amount of code in each
patch small enough that it is easy to review. I didn't realize that the
split was making it difficult to connect the different pieces together
while reviewing the code.
> Prepare the code base as necessary (the fls stuff, instrumentation for
> existing code, cleanups), then add the most basic data structure and
> code in one patch, then follow up with new statistics, optimizations
> etc. (unless the optimizations can be reasonably folded into the
> initial implementation in the first place).  This might not always be
> possible of course, but please strive for it.
> 
Sure, I'll try that in the next posting. But for this patch series, let
me atleast describe the high-level goal that a given group of patches
try to achieve, so that it becomes easier to review them.
So here it is:
Patches 1 - 4 do the most basic, first phase of work required to make the
MM subsystem aware of the underlying topology, by building the notion of
independently power-manageable regions, and carving out suitable chunks
from the zones. Thus at the end of patch 4, we have a zone-level
representation of memory regions, and we can determine the memory region
to which any given page belongs. So far, no real influence has been made
in any of the MM decisions such as page allocation.
Patches 5 and 6 start the real work of trying to influence the page
allocator's decisions - they integrate the notion of "memory regions"
within the buddy freelists themselves, by using appropriate data-structures.
These 2 patches also brings about an important change in the mechanics of
how pages are added and deleted from the buddy freelists. In particular,
deleting a page is no longer as simple as list_del(&page->lru). We need to
provide more information than that, as suggested by the prototype of
del_from_freelist(). We need to know exactly which freelist the page
belongs to, and for that we need to accurately keep track of the page's
migratetype even when it is in the buddy allocator.
That gives rise to patches 7 and 8. They fix up things related to migratetype
tracking, to prevent the mechanics of del_from_freelist() from falling
apart. So by now, we have a stable implementation of maintaining freepages
in the buddy allocator, sorted into different region buckets.
So, next come the optimizations. Patch 9 introduces a variable named
'next_region' per freelist, to avoid looking up the page-to-region translation
every time. That's one level of optimization.
Patch 11 adds another optimization by improving the sorting speed by using
a bitmap-based radix tree approach. When developing the patch, I had a hard
time figuring out that __fls() had completely different semantics than fls().
So I thought I should add a comment explaining that part, before I start
using __fls() in patch 11 (because I didn't find any documentation about
that subtle difference anywhere). That's why I put in patch 10 to do that.
But yes, I agree that its a bit extraneous, and ideally should go in as an
independent patch.
So by patch 11, we have a reasonably well-contained memory power management
infrastructure. So I felt it would be best to enable per-region statistics
as soon as possible in the patch series, so that we can measure the improvement
brought-about by each subsequent optimization or change, so that we can make
a good evaluation of how beneficial they are. So patches 12, 13 and 14
implement that and export per-region statistics. IMHO this ordering is quite
important since we are still yet to completely agree on which parts of the
patchset are useful in a wide variety of cases and which are not. So exposing
the statistics as early as possible in the patchset enables this sort of
evaluation.
Patch 15 is a fundamental change in how we allocate pages from the page
allocator, so I kept that patch separate, to make it noticeable, since it
has the potential to have direct impacts on performance.
By patch 15, we have the maximum amount of tweaking/tuning/optimization
for the sorted-buddy infrastructure. So from patch 16 onwards, we start
adding some very different stuff, designed to augment the sorted-buddy page
allocator.
Patch 16 inserts a new layer between the page allocator and the memory
hardware, known as the "region allocator". The idea is that the region
allocator allocates entire regions, from which the page allocator can further
split up things and allocate in smaller chunks (pages). The goal here is
to avoid the fragmentation of pages of different migratetypes among
various memory regions, and instead make it easy to have 'n' entire regions
for all MIGRATE_UNMOVABLE allocations, 'm' entire regions for MIGRATE_MOVABLE
and so on. This has pronounced impact in improving the success of the targeted
region compaction/evacuation framework (which comes later in the patchset).
For example, it can avoid cases where a single unmovable page is stuck in
a region otherwise populated by mostly movable or reclaimable allocations.
So basically you can think of this as a way of extending the 'pageblock_order'
fragmentation avoidance mechanism such that it can incorporate memory region
topology. That is, it will help us avoid mixing pages of different migratetypes
within a single region, and thus keep entire regions homogeneous with respect
to the allocation type.
Patches 17 and 18 add the infrastructure necessary to perform bulk movements
of pages between the page allocator and the region allocator, since that's
how the 2 entities will interact with each other. Then patches 19 and 20
provide helpers to talk to the region allocator itself, in terms of requesting
or giving back memory.
Now since we have _two_ different allocators (page and region), they need
to coordinate in their strategy. The page allocator chooses the lowest numbered
region to allocate. Patch 22 adds this same strategy to the region allocator
as well.
I admit that patches 23 and 24 are a bit oddly placed.
Patches 25 and 26 finally connect the page and the region allocators, now
that we have all the infrastructure ready. This is kept separate because this
has a policy associated with it and hence needs discussion (as in, how often
does the page allocator try to move regions back to the region allocator,
and at what points in the code (fast/hot vs slow paths etc)).
Patches 27 to 32 are mostly policy changes that drastically change how
fallbacks are handled. These are important to keep the region allocator sane
and simple. If for example, an unmovable page allocation falls back to
movable and then never returns the page to movable freelist even upon free,
then it will be very hard to account for that page as part of the region.
So it will enormously complicate the interaction between the page allocator
and the region allocator. Patches 27 to 32 help avoid that.
Patch 33 is the final patch related to the region allocator - it just adds a
caching logic to avoid frequent interactions between the page allocator and
the region allocator (ping-pong kind of interactions).
Patches 34 to 40 introduce the targeted compaction/region evacuation logic,
which is meant to augment the sorted-buddy and the region allocator, in causing
power-savings. Basically they carve out the reusable compaction bits from
CMA and build a per-node kthread infrastructure to free lightly allocated
regions. Then, the final patch 40 adds the trigger to wakeup these kthreads
from the page allocator, at appropriate opportunity points.
Hope this explanation helps to make it easier to review the patches!
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
  2013-10-23 10:17   ` Johannes Weiner
@ 2013-10-23 16:09     ` Srivatsa S. Bhat
  0 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-23 16:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, dave, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, mark.gross
On 10/23/2013 03:47 PM, Johannes Weiner wrote:
> On Thu, Sep 26, 2013 at 04:44:56AM +0530, Srivatsa S. Bhat wrote:
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -517,6 +517,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>>  	return 0;
>>  }
>>  
>> +static void add_to_freelist(struct page *page, struct free_list *free_list)
>> +{
>> +	struct list_head *prev_region_list, *lru;
>> +	struct mem_region_list *region;
>> +	int region_id, i;
>> +
>> +	lru = &page->lru;
>> +	region_id = page_zone_region_id(page);
>> +
>> +	region = &free_list->mr_list[region_id];
>> +	region->nr_free++;
>> +
>> +	if (region->page_block) {
>> +		list_add_tail(lru, region->page_block);
>> +		return;
>> +	}
>> +
>> +#ifdef CONFIG_DEBUG_PAGEALLOC
>> +	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
>> +#endif
>> +
>> +	if (!list_empty(&free_list->list)) {
>> +		for (i = region_id - 1; i >= 0; i--) {
>> +			if (free_list->mr_list[i].page_block) {
>> +				prev_region_list =
>> +					free_list->mr_list[i].page_block;
>> +				goto out;
>> +			}
>> +		}
>> +	}
>> +
>> +	/* This is the first region, so add to the head of the list */
>> +	prev_region_list = &free_list->list;
>> +
>> +out:
>> +	list_add(lru, prev_region_list);
>> +
>> +	/* Save pointer to page block of this region */
>> +	region->page_block = lru;
> 
> "Pageblock" has a different meaning in the allocator already.
> 
> The things you string up here are just called pages, regardless of
> which order they are in and how many pages they can be split into.
> 
Ah, yes. I'll fix that.
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [RFC PATCH v4 16/40] mm: Introduce a "Region Allocator" to manage entire memory regions
  2013-10-23 10:10   ` Johannes Weiner
@ 2013-10-23 16:22     ` Srivatsa S. Bhat
  0 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-23 16:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, dave, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, mark.gross
On 10/23/2013 03:40 PM, Johannes Weiner wrote:
> On Thu, Sep 26, 2013 at 04:47:34AM +0530, Srivatsa S. Bhat wrote:
>> Today, the MM subsystem uses the buddy 'Page Allocator' to manage memory
>> at a 'page' granularity. But this allocator has no notion of the physical
>> topology of the underlying memory hardware, and hence it is hard to
>> influence memory allocation decisions keeping the platform constraints
>> in mind.
> 
> This is no longer true after patches 1-15 introduce regions and have
> the allocator try to stay within the lowest possible region (patch
> 15).
Sorry, the changelog is indeed misleading. What I really meant to say
here is that there is no way to keep an entire region homogeneous with
respect to allocation types: ie., have only a single type of allocations
(like movable). Patches 1-15 don't address that problem. The later ones
do.
>  Which leaves the question what the following patches are for.
> 
The region allocator is meant to help in keeping entire memory regions
homogeneous with respect to allocations. This helps in increasing the
success rate of targeted region evacuation. For example, if we know
that the region has only unmovable allocations, we can completely skip
compaction/evac on that region. And this can be determined just by looking
at the pageblock migratetype of *one* of the pages of that region; thus
its very cheap. Similarly, if we know that the region has only movable
allocations, we can try compaction on that when its lightly allocated.
And we won't have horrible scenarios where we moved say 15 pages and then
found out that there is an unmovable page stuck in that region, making
all that previous work go waste.
> This patch only adds a data structure and I gave up finding where
> among the helpers, statistics, and optimization patches an actual
> implementation is.
> 
I hope the patch-wise explanation that I gave in the other mail will
help make this understandable. Please do let me know if you need any
other clarifications.
> Again, please try to make every single a patch a complete logical
> change to the code base.
Sure, I'll strive for that in the next postings.
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-09-26 12:58     ` Srivatsa S. Bhat
  2013-09-26 15:29       ` Arjan van de Ven
@ 2013-11-12  8:02       ` Srivatsa S. Bhat
  2013-11-12 17:34         ` Dave Hansen
  2013-11-12 18:49         ` Srivatsa S. Bhat
  1 sibling, 2 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-11-12  8:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgorman, dave, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, maxime.coquelin, loic.pallardy, amit.kachhap,
	thomas.abraham, markgross
On 09/26/2013 06:28 PM, Srivatsa S. Bhat wrote:
> On 09/26/2013 05:10 AM, Andrew Morton wrote:
>> On Thu, 26 Sep 2013 04:56:32 +0530 "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>
>>> Experimental Results:
>>> ====================
>>>
>>> Test setup:
>>> ----------
>>>
>>> x86 Sandybridge dual-socket quad core HT-enabled machine, with 128GB RAM.
>>> Memory Region size = 512MB.
>>
>> Yes, but how much power was saved ;)
>>
> 
> I don't have those numbers yet, but I'll be able to get them going forward.
> 
Hi,
I performed experiments on an IBM POWER 7 machine and got actual power-savings
numbers (upto 2.6% of total system power) from this patchset. I presented them
at the Kernel Summit but forgot to post them on LKML. So here they are:
Hardware-setup:
--------------
IBM POWER 7 machine: 4 socket (NUMA), 32 cores, 128GB RAM
 - 4 NUMA nodes with 32 GB RAM each
 - Booted with numa=fake=1 and treated them as 4 memory regions
Software setup:
--------------
Workload: Run modified ebizzy for half an hour, which allocates and frees large
quantities of memory frequently. The modified ebizzy touches every allocated
page a number of times (4 times) before freeing it up. This ensures that
allocating a page in the "wrong" memory region makes it very costly in terms
of power-savings, since every allocated page is accessed before getting
freed (and accesses cause energy consumption). Thus, with this modified
benchmark, sub-optimal MM decisions (in terms of memory power-savings) get
magnified and hence become noticeable.
Power-savings compared to mainline (3.12-rc4):
---------------------------------------------
With this patchset applied, the average power of the system reduced by 2.6%
compared to the mainline kernel during the benchmark run. The total system
power is an excellent metric for such evaluations, since it brings out the
overall power-efficiency of the patchset. (IOW, if the patchset shoots up the
CPU or disk power-consumption while causing memory power savings, then the
total system power will not show much difference). So these numbers indicate
that the patchset performs quite well in reducing the power-consumption of
the system as a whole.
This is not the most ideal hardware configuration to test on, since I had
only 4 memory regions to play with, but this gives a good initial indication
of the kind of power savings that can be achieved with this patchset.
I am expecting the same patchset to give us power-savings of upto 5% of the
total system power on a newer prototype hardware that I have (since it has
more memory regions and lower base power consumption).
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-11-12  8:02       ` Srivatsa S. Bhat
@ 2013-11-12 17:34         ` Dave Hansen
  2013-11-12 18:44           ` Srivatsa S. Bhat
  2013-11-12 18:49         ` Srivatsa S. Bhat
  1 sibling, 1 reply; 72+ messages in thread
From: Dave Hansen @ 2013-11-12 17:34 UTC (permalink / raw)
  To: Srivatsa S. Bhat, Andrew Morton
  Cc: mgorman, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, maxime.coquelin, loic.pallardy, amit.kachhap,
	thomas.abraham, markgross
On 11/12/2013 12:02 AM, Srivatsa S. Bhat wrote:
> I performed experiments on an IBM POWER 7 machine and got actual power-savings
> numbers (upto 2.6% of total system power) from this patchset. I presented them
> at the Kernel Summit but forgot to post them on LKML. So here they are:
"upto"?  What was it, actually?  Essentially what you've told us here is
that you have a patch that tries to do some memory power management and
that it accomplishes that.  But, to what degree?
Was your baseline against a kernel also booted with numa=fake=1, or was
it a kernel booted normally?
1. What is the theoretical power savings from memory?
2. How much of the theoretical numbers can your patch reach?
3. What is the performance impact?  Does it hurt ebizzy?
You also said before:
> On page 40, the paper shows the power-consumption breakdown for an IBM p670
> machine, which shows that as much as 40% of the system energy is consumed by
> the memory sub-system in a mid-range server.
2.6% seems pretty awful for such an invasive patch set if you were
expecting 40%.
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-11-12 17:34         ` Dave Hansen
@ 2013-11-12 18:44           ` Srivatsa S. Bhat
  0 siblings, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-11-12 18:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, mgorman, hannes, tony.luck, matthew.garrett, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, maxime.coquelin, loic.pallardy, amit.kachhap,
	thomas.abraham, markgross
On 11/12/2013 11:04 PM, Dave Hansen wrote:
> On 11/12/2013 12:02 AM, Srivatsa S. Bhat wrote:
>> I performed experiments on an IBM POWER 7 machine and got actual power-savings
>> numbers (upto 2.6% of total system power) from this patchset. I presented them
>> at the Kernel Summit but forgot to post them on LKML. So here they are:
> 
> "upto"?  What was it, actually?
Hmm? It _was_ 2.6% (Maybe my usage of the word 'upto' is misleading in that
sentence, sorry).
>  Essentially what you've told us here is
> that you have a patch that tries to do some memory power management and
> that it accomplishes that.  But, to what degree?
> 
> Was your baseline against a kernel also booted with numa=fake=1, or was
> it a kernel booted normally?
> 
The baseline kernel was also booted with numa=fake=1.
> 1. What is the theoretical power savings from memory?
I don't have that number for POWER (but I'll work on that), but referring to
the previous data from Samsung ARM boards, it was around the same number
(2.7% to 3.2 % of total system power) for memory power management using
content-preserving states. For content-destructive (power-off) states, it was
around 6.3%.
http://article.gmane.org/gmane.linux.kernel.mm/65935
> 2. How much of the theoretical numbers can your patch reach?
Honestly, the 2.6% number on the hardware that I tested is not bad at all.
As I mentioned, the base power consumption of the system (power consumption
at idle) was a bit high, so the percentage power-savings value might look small,
but nevertheless it is not insignificant. I'm trying to setup a newer prototype
hardware to test this patchset, and I expect to see better numbers on that
with the same code. By some crude initial estimates, I expect to see around
5% power-savings with the same patchset.
> 3. What is the performance impact?  Does it hurt ebizzy?
> 
Ebizzy numbers were quite low in both cases (vanilla and patched kernel),
in 1 digit numbers, due to the huge allocations/frees that were done on every
loop. So comparing performance with those numbers is not going to be reliable.
I'll work on detailed performance measurements after I'm done with the initial
power-savings experiments.
> You also said before:
>> On page 40, the paper shows the power-consumption breakdown for an IBM p670
>> machine, which shows that as much as 40% of the system energy is consumed by
>> the memory sub-system in a mid-range server.
> 
> 2.6% seems pretty awful for such an invasive patch set if you were
> expecting 40%.
As I said, this was not the most ideal hardware to test my patches on. 128GB
is not a particularly large amount of RAM. So obviously it wont contribute a
whole lot to the total system power, atleast not as much as, say a terabyte of
RAM would. So yeah, the overall number is small, but given the relatively
modest amount of RAM installed on that machine, the savings is not ignorable.
Also, I used only 4 memory regions on this hardware, which is quite a small
number to play with. More the number of memory regions, higher the opportunity
that my patches have to cause power-savings. So I'll test with newer platforms
(with more memory regions) to see how well that goes.
Regards,
Srivatsa S. Bhat
^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: [Results] [RFC PATCH v4 00/40] mm: Memory Power Management
  2013-11-12  8:02       ` Srivatsa S. Bhat
  2013-11-12 17:34         ` Dave Hansen
@ 2013-11-12 18:49         ` Srivatsa S. Bhat
  1 sibling, 0 replies; 72+ messages in thread
From: Srivatsa S. Bhat @ 2013-11-12 18:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgorman, dave, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel, maxime.coquelin, loic.pallardy, amit.kachhap,
	thomas.abraham, markgross
On 11/12/2013 01:32 PM, Srivatsa S. Bhat wrote:
> On 09/26/2013 06:28 PM, Srivatsa S. Bhat wrote:
>> On 09/26/2013 05:10 AM, Andrew Morton wrote:
>>> On Thu, 26 Sep 2013 04:56:32 +0530 "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>
>>>> Experimental Results:
>>>> ====================
>>>>
>>>> Test setup:
>>>> ----------
>>>>
>>>> x86 Sandybridge dual-socket quad core HT-enabled machine, with 128GB RAM.
>>>> Memory Region size = 512MB.
>>>
>>> Yes, but how much power was saved ;)
>>>
>>
>> I don't have those numbers yet, but I'll be able to get them going forward.
>>
> 
> Hi,
> 
> I performed experiments on an IBM POWER 7 machine and got actual power-savings
> numbers (upto 2.6% of total system power) from this patchset. I presented them
> at the Kernel Summit but forgot to post them on LKML. So here they are:
> 
<snip>
And here is a recent LWN article that highlights the important design changes
in this version and gives a good overview of this patchset as a whole:
http://lwn.net/Articles/568891/
And here is the link to the patchset (v4):
http://lwn.net/Articles/568369/
Regards,
Srivatsa S. Bhat
 
^ permalink raw reply	[flat|nested] 72+ messages in thread
end of thread, other threads:[~2013-11-12 18:54 UTC | newest]
Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-25 23:13 [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
2013-09-25 23:13 ` [RFC PATCH v4 01/40] mm: Introduce memory regions data-structure to capture region boundaries within nodes Srivatsa S. Bhat
2013-10-23  9:54   ` Johannes Weiner
2013-10-23 14:38     ` Srivatsa S. Bhat
2013-09-25 23:14 ` [RFC PATCH v4 02/40] mm: Initialize node memory regions during boot Srivatsa S. Bhat
2013-09-25 23:14 ` [RFC PATCH v4 03/40] mm: Introduce and initialize zone memory regions Srivatsa S. Bhat
2013-09-25 23:14 ` [RFC PATCH v4 04/40] mm: Add helpers to retrieve node region and zone region for a given page Srivatsa S. Bhat
2013-09-25 23:14 ` [RFC PATCH v4 05/40] mm: Add data-structures to describe memory regions within the zones' freelists Srivatsa S. Bhat
2013-09-25 23:14 ` [RFC PATCH v4 06/40] mm: Demarcate and maintain pageblocks in region-order in " Srivatsa S. Bhat
2013-09-26 22:16   ` Dave Hansen
2013-09-27  6:34     ` Srivatsa S. Bhat
2013-10-23 10:17   ` Johannes Weiner
2013-10-23 16:09     ` Srivatsa S. Bhat
2013-09-25 23:15 ` [RFC PATCH v4 07/40] mm: Track the freepage migratetype of pages accurately Srivatsa S. Bhat
2013-09-25 23:15 ` [RFC PATCH v4 08/40] mm: Use the correct migratetype during buddy merging Srivatsa S. Bhat
2013-09-25 23:15 ` [RFC PATCH v4 09/40] mm: Add an optimized version of del_from_freelist to keep page allocation fast Srivatsa S. Bhat
2013-09-25 23:15 ` [RFC PATCH v4 10/40] bitops: Document the difference in indexing between fls() and __fls() Srivatsa S. Bhat
2013-09-25 23:16 ` [RFC PATCH v4 11/40] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting Srivatsa S. Bhat
2013-09-25 23:16 ` [RFC PATCH v4 12/40] mm: Add support to accurately track per-memory-region allocation Srivatsa S. Bhat
2013-09-25 23:16 ` [RFC PATCH v4 13/40] mm: Print memory region statistics to understand the buddy allocator behavior Srivatsa S. Bhat
2013-09-25 23:17 ` [RFC PATCH v4 14/40] mm: Enable per-memory-region fragmentation stats in pagetypeinfo Srivatsa S. Bhat
2013-09-25 23:17 ` [RFC PATCH v4 15/40] mm: Add aggressive bias to prefer lower regions during page allocation Srivatsa S. Bhat
2013-09-25 23:17 ` [RFC PATCH v4 16/40] mm: Introduce a "Region Allocator" to manage entire memory regions Srivatsa S. Bhat
2013-10-23 10:10   ` Johannes Weiner
2013-10-23 16:22     ` Srivatsa S. Bhat
2013-09-25 23:17 ` [RFC PATCH v4 17/40] mm: Add a mechanism to add pages to buddy freelists in bulk Srivatsa S. Bhat
2013-09-25 23:18 ` [RFC PATCH v4 18/40] mm: Provide a mechanism to delete pages from " Srivatsa S. Bhat
2013-09-25 23:18 ` [RFC PATCH v4 19/40] mm: Provide a mechanism to release free memory to the region allocator Srivatsa S. Bhat
2013-09-25 23:18 ` [RFC PATCH v4 20/40] mm: Provide a mechanism to request free memory from " Srivatsa S. Bhat
2013-09-25 23:18 ` [RFC PATCH v4 21/40] mm: Maintain the counter for freepages in " Srivatsa S. Bhat
2013-09-25 23:18 ` [RFC PATCH v4 22/40] mm: Propagate the sorted-buddy bias for picking free regions, to " Srivatsa S. Bhat
2013-09-25 23:19 ` [RFC PATCH v4 23/40] mm: Fix vmstat to also account for freepages in the " Srivatsa S. Bhat
2013-09-25 23:19 ` [RFC PATCH v4 24/40] mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC Srivatsa S. Bhat
2013-09-25 23:19 ` [RFC PATCH v4 25/40] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow Srivatsa S. Bhat
2013-09-25 23:19 ` [RFC PATCH v4 26/40] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= " Srivatsa S. Bhat
2013-09-25 23:19 ` [RFC PATCH v4 27/40] mm: Update the freepage migratetype of pages during region allocation Srivatsa S. Bhat
2013-09-25 23:20 ` [RFC PATCH v4 28/40] mm: Provide a mechanism to check if a given page is in the region allocator Srivatsa S. Bhat
2013-09-25 23:20 ` [RFC PATCH v4 29/40] mm: Add a way to request pages of a particular region from " Srivatsa S. Bhat
2013-09-25 23:20 ` [RFC PATCH v4 30/40] mm: Modify move_freepages() to handle pages in the region allocator properly Srivatsa S. Bhat
2013-09-25 23:20 ` [RFC PATCH v4 31/40] mm: Never change migratetypes of pageblocks during freepage stealing Srivatsa S. Bhat
2013-09-25 23:20 ` [RFC PATCH v4 32/40] mm: Set pageblock migratetype when allocating regions from region allocator Srivatsa S. Bhat
2013-09-25 23:21 ` [RFC PATCH v4 33/40] mm: Use a cache between page-allocator and region-allocator Srivatsa S. Bhat
2013-09-25 23:21 ` [RFC PATCH v4 34/40] mm: Restructure the compaction part of CMA for wider use Srivatsa S. Bhat
2013-09-25 23:21 ` [RFC PATCH v4 35/40] mm: Add infrastructure to evacuate memory regions using compaction Srivatsa S. Bhat
2013-09-25 23:21 ` [RFC PATCH v4 36/40] kthread: Split out kthread-worker bits to avoid circular header-file dependency Srivatsa S. Bhat
2013-09-25 23:22 ` [RFC PATCH v4 37/40] mm: Add a kthread to perform targeted compaction for memory power management Srivatsa S. Bhat
2013-09-25 23:22 ` [RFC PATCH v4 38/40] mm: Add a mechanism to queue work to the kmempowerd kthread Srivatsa S. Bhat
2013-09-25 23:22 ` [RFC PATCH v4 39/40] mm: Add intelligence in kmempowerd to ignore regions unsuitable for evacuation Srivatsa S. Bhat
2013-09-25 23:22 ` [RFC PATCH v4 40/40] mm: Add triggers in the page-allocator to kick off region evacuation Srivatsa S. Bhat
2013-09-25 23:26 ` [Results] [RFC PATCH v4 00/40] mm: Memory Power Management Srivatsa S. Bhat
2013-09-25 23:40   ` Andrew Morton
2013-09-25 23:47     ` Andi Kleen
2013-09-26  1:14       ` Arjan van de Ven
2013-09-26 13:09         ` Srivatsa S. Bhat
2013-09-26  1:15       ` Arjan van de Ven
2013-09-26  1:21         ` Andrew Morton
2013-09-26  1:50           ` Andi Kleen
2013-09-26  2:59             ` Andrew Morton
2013-09-26 13:42               ` Srivatsa S. Bhat
2013-09-26 15:58                 ` Arjan van de Ven
2013-09-26 17:00                   ` Srivatsa S. Bhat
2013-09-26 18:06                     ` Arjan van de Ven
2013-09-26 18:33                       ` Srivatsa S. Bhat
2013-09-26 13:37             ` Srivatsa S. Bhat
2013-09-26 15:23           ` Arjan van de Ven
2013-09-26 13:16         ` Srivatsa S. Bhat
2013-09-26 12:58     ` Srivatsa S. Bhat
2013-09-26 15:29       ` Arjan van de Ven
2013-11-12  8:02       ` Srivatsa S. Bhat
2013-11-12 17:34         ` Dave Hansen
2013-11-12 18:44           ` Srivatsa S. Bhat
2013-11-12 18:49         ` Srivatsa S. Bhat
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).