linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
To: akpm@linux-foundation.org, mgorman@suse.de, hannes@cmpxchg.org,
	tony.luck@intel.com, matthew.garrett@nebula.com, dave@sr71.net,
	riel@redhat.com, arjan@linux.intel.com,
	srinivas.pandruvada@linux.intel.com, willy@linux.intel.com,
	kamezawa.hiroyu@jp.fujitsu.com, lenb@kernel.org, rjw@sisk.pl
Cc: gargankita@gmail.com, paulmck@linux.vnet.ibm.com,
	svaidy@linux.vnet.ibm.com, andi@firstfloor.org,
	isimatu.yasuaki@jp.fujitsu.com, santosh.shilimkar@ti.com,
	kosaki.motohiro@gmail.com, srivatsa.bhat@linux.vnet.ibm.com,
	linux-pm@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH v3 13/35] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
Date: Fri, 30 Aug 2013 18:47:54 +0530	[thread overview]
Message-ID: <20130830131746.4947.94934.stgit@srivatsabhat.in.ibm.com> (raw)
In-Reply-To: <20130830131221.4947.99764.stgit@srivatsabhat.in.ibm.com>

The sorted-buddy design for memory power management depends on
keeping the buddy freelists region-sorted. And this sorting operation
has been pushed to the free() logic, keeping the alloc() path fast.

However, we would like to also keep the free() path as fast as possible,
since it holds the zone->lock, which will indirectly affect alloc() also.

So replace the existing O(n) sorting logic used in the free-path, with
a new special-case sorting algorithm of time complexity O(log n), in order
to optimize the free() path further. This algorithm uses a bitmap-based
radix tree to help speed up the sorting.

One of the other main advantages of this O(log n) design is that it can
support large amounts of RAM (upto 2 TB and beyond) quite effortlessly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |    2 +
 mm/page_alloc.c        |  144 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 932e71f..b35020f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -102,6 +102,8 @@ struct free_list {
 	 * this freelist.
 	 */
 	struct mem_region_list	mr_list[MAX_NR_ZONE_REGIONS];
+	DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+	DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
 };
 
 struct free_area {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 52b6655..4da02fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -514,11 +514,131 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+/**
+ *
+ * An example should help illustrate the bitmap representation of memory
+ * regions easily. So consider the following scenario:
+ *
+ * MAX_NR_ZONE_REGIONS = 256
+ * DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
+ * DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+ *
+ * Here region_leaf_mask is an array of unsigned longs. And region_root_mask
+ * is a single unsigned long. The tree notion is constructed like this:
+ * Each bit in the region_root_mask will correspond to an array element of
+ * region_leaf_mask, as shown below. (The elements of the region_leaf_mask
+ * array are shown as being discontiguous, only to help illustrate the
+ * concept easily).
+ *
+ *                    Region Root Mask
+ *                   ___________________
+ *                  |____|____|____|____|
+ *                    /    |     \     \
+ *                   /     |      \     \
+ *             ________    |   ________  \
+ *            |________|   |  |________|  \
+ *                         |               \
+ *                      ________        ________
+ *                     |________|      |________|   <--- Region Leaf Mask
+ *                                                         array elements
+ *
+ * If an array element in the leaf mask is non-zero, the corresponding bit
+ * for that array element will be set in the root mask. Every bit in the
+ * region_leaf_mask will correspond to a memory region; it is set if that
+ * region is present in that free list, cleared otherwise.
+ *
+ * This arrangement helps us find the previous set bit in region_leaf_mask
+ * using at most 2 bitmask-searches (each bitmask of size BITS_PER_LONG),
+ * one at the root-level, and one at the leaf level. Thus, this design of
+ * an optimized access structure reduces the search-complexity when dealing
+ * with large amounts of memory. The worst-case time-complexity of buddy
+ * sorting comes to O(log n) using this algorithm, where 'n' is the no. of
+ * memory regions in the zone.
+ *
+ * For example, with MEM_REGION_SIZE = 512 MB, on 64-bit machines, we can
+ * deal with upto 2TB of RAM (MAX_NR_ZONE_REGIONS = 4096) efficiently (just
+ * 12 ops in the worst case, as opposed to 4096 in an O(n) algo) with such
+ * an arrangement, without even needing to extend this 2-level hierarchy
+ * any further.
+ */
+
+static void set_region_bit(int region_id, struct free_list *free_list)
+{
+	set_bit(region_id, free_list->region_leaf_mask);
+	set_bit(BIT_WORD(region_id), free_list->region_root_mask);
+}
+
+static void clear_region_bit(int region_id, struct free_list *free_list)
+{
+	clear_bit(region_id, free_list->region_leaf_mask);
+
+	if (!(free_list->region_leaf_mask[BIT_WORD(region_id)]))
+		clear_bit(BIT_WORD(region_id), free_list->region_root_mask);
+
+}
+
+/* Note that Region 0 corresponds to bit position 1 (0x1) and so on */
+static int find_prev_region(int region_id, struct free_list *free_list)
+{
+	int leaf_word, prev_region_id;
+	unsigned long *region_root_mask, *region_leaf_mask;
+	unsigned long tmp_root_mask, tmp_leaf_mask;
+
+	if (!region_id)
+		return -1; /* No previous region */
+
+	leaf_word = BIT_WORD(region_id);
+
+	region_root_mask = free_list->region_root_mask;
+	region_leaf_mask = free_list->region_leaf_mask;
+
+
+	/*
+	 * Try to get the prev region id without going to the root mask.
+	 * Note that region_id itself might not be set yet.
+	 */
+	if (region_leaf_mask[leaf_word]) {
+		tmp_leaf_mask = region_leaf_mask[leaf_word] &
+							(BIT_MASK(region_id) - 1);
+
+		if (tmp_leaf_mask) {
+			/* Prev region is in this leaf mask itself. Find it. */
+			prev_region_id = leaf_word * BITS_PER_LONG +
+							__fls(tmp_leaf_mask);
+			goto out;
+		}
+	}
+
+	/* Search the root mask for the leaf mask having prev region */
+	tmp_root_mask = *region_root_mask & (BIT(leaf_word) - 1);
+	if (tmp_root_mask) {
+		leaf_word = __fls(tmp_root_mask);
+
+		/* Get the prev region id from the leaf mask */
+		prev_region_id = leaf_word * BITS_PER_LONG +
+					__fls(region_leaf_mask[leaf_word]);
+	} else {
+		/*
+		 * This itself is the first populated region in this
+		 * freelist, so previous region doesn't exist.
+		 */
+		prev_region_id = -1;
+	}
+
+out:
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(prev_region_id >= region_id, "%s: bitmap logic messed up\n",
+								__func__);
+#endif
+	return prev_region_id;
+}
+
 static void add_to_freelist(struct page *page, struct free_list *free_list)
 {
 	struct list_head *prev_region_list, *lru;
 	struct mem_region_list *region;
-	int region_id, i;
+	int region_id, prev_region_id;
 
 	lru = &page->lru;
 	region_id = page_zone_region_id(page);
@@ -536,12 +656,17 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 #endif
 
 	if (!list_empty(&free_list->list)) {
-		for (i = region_id - 1; i >= 0; i--) {
-			if (free_list->mr_list[i].page_block) {
-				prev_region_list =
-					free_list->mr_list[i].page_block;
-				goto out;
-			}
+		prev_region_id = find_prev_region(region_id, free_list);
+		if (prev_region_id >= 0) {
+			prev_region_list =
+				free_list->mr_list[prev_region_id].page_block;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+			WARN(prev_region_list == NULL,
+				"%s: prev_region_list is NULL\n"
+				"region_id=%d, prev_region_id=%d\n", __func__,
+				 region_id, prev_region_id);
+#endif
+			goto out;
 		}
 	}
 
@@ -562,6 +687,7 @@ out:
 
 	/* Save pointer to page block of this region */
 	region->page_block = lru;
+	set_region_bit(region_id, free_list);
 }
 
 /**
@@ -576,6 +702,7 @@ static void rmqueue_del_from_freelist(struct page *page,
 				      struct free_list *free_list)
 {
 	struct list_head *lru = &page->lru;
+	int region_id;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN((free_list->list.next != lru),
@@ -599,6 +726,8 @@ static void rmqueue_del_from_freelist(struct page *page,
 	 * in this freelist.
 	 */
 	free_list->next_region->page_block = NULL;
+	region_id = free_list->next_region - free_list->mr_list;
+	clear_region_bit(region_id, free_list);
 
 	/* Set 'next_region' to the new first region in the freelist. */
 	set_next_region_in_freelist(free_list);
@@ -659,6 +788,7 @@ page_found:
 
 	if (region->nr_free == 0) {
 		region->page_block = NULL;
+		clear_region_bit(region_id, free_list);
 	} else {
 		region->page_block = prev_page_lru;
 #ifdef CONFIG_DEBUG_PAGEALLOC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2013-08-30 13:21 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-30 13:13 [RESEND RFC PATCH v3 00/35] mm: Memory Power Management Srivatsa S. Bhat
2013-08-30 13:14 ` [RFC PATCH v3 01/35] mm: Restructure free-page stealing code and fix a bug Srivatsa S. Bhat
2013-08-30 13:14 ` [RFC PATCH v3 02/35] mm: Fix the value of fallback_migratetype in alloc_extfrag tracepoint Srivatsa S. Bhat
2013-08-30 13:14 ` [RFC PATCH v3 03/35] mm: Introduce memory regions data-structure to capture region boundaries within nodes Srivatsa S. Bhat
2013-08-30 13:15 ` [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot Srivatsa S. Bhat
2013-09-02  6:20   ` Yasuaki Ishimatsu
2013-09-02 17:43     ` Srivatsa S. Bhat
2013-09-03  4:53       ` Yasuaki Ishimatsu
2013-08-30 13:15 ` [RFC PATCH v3 05/35] mm: Introduce and initialize zone memory regions Srivatsa S. Bhat
2013-08-30 13:15 ` [RFC PATCH v3 06/35] mm: Add helpers to retrieve node region and zone region for a given page Srivatsa S. Bhat
2013-09-03  5:56   ` Yasuaki Ishimatsu
2013-09-03  8:34     ` Srivatsa S. Bhat
2013-08-30 13:16 ` [RFC PATCH v3 07/35] mm: Add data-structures to describe memory regions within the zones' freelists Srivatsa S. Bhat
2013-08-30 13:16 ` [RFC PATCH v3 08/35] mm: Demarcate and maintain pageblocks in region-order in " Srivatsa S. Bhat
2013-09-04  7:49   ` Yasuaki Ishimatsu
2013-08-30 13:16 ` [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately Srivatsa S. Bhat
2013-09-03  6:38   ` Yasuaki Ishimatsu
2013-09-03  8:45     ` Srivatsa S. Bhat
2013-09-04  8:23       ` Yasuaki Ishimatsu
2013-09-06  5:24         ` Srivatsa S. Bhat
2013-08-30 13:16 ` [RFC PATCH v3 10/35] mm: Use the correct migratetype during buddy merging Srivatsa S. Bhat
2013-08-30 13:17 ` [RFC PATCH v3 11/35] mm: Add an optimized version of del_from_freelist to keep page allocation fast Srivatsa S. Bhat
2013-08-30 13:17 ` [RFC PATCH v3 12/35] bitops: Document the difference in indexing between fls() and __fls() Srivatsa S. Bhat
2013-08-30 13:17 ` Srivatsa S. Bhat [this message]
2013-08-30 13:18 ` [RFC PATCH v3 14/35] mm: Add support to accurately track per-memory-region allocation Srivatsa S. Bhat
2013-08-30 13:18 ` [RFC PATCH v3 15/35] mm: Print memory region statistics to understand the buddy allocator behavior Srivatsa S. Bhat
2013-08-30 13:18 ` [RFC PATCH v3 16/35] mm: Enable per-memory-region fragmentation stats in pagetypeinfo Srivatsa S. Bhat
2013-08-30 13:19 ` [RFC PATCH v3 17/35] mm: Add aggressive bias to prefer lower regions during page allocation Srivatsa S. Bhat
2013-08-30 13:19 ` [RFC PATCH v3 18/35] mm: Introduce a "Region Allocator" to manage entire memory regions Srivatsa S. Bhat
2013-08-30 13:19 ` [RFC PATCH v3 19/35] mm: Add a mechanism to add pages to buddy freelists in bulk Srivatsa S. Bhat
2013-08-30 13:20 ` [RFC PATCH v3 20/35] mm: Provide a mechanism to delete pages from " Srivatsa S. Bhat
2013-08-30 13:20 ` [RFC PATCH v3 21/35] mm: Provide a mechanism to release free memory to the region allocator Srivatsa S. Bhat
2013-08-30 13:20 ` [RFC PATCH v3 22/35] mm: Provide a mechanism to request free memory from " Srivatsa S. Bhat
2013-08-30 13:21 ` [RFC PATCH v3 23/35] mm: Maintain the counter for freepages in " Srivatsa S. Bhat
2013-08-30 13:21 ` [RFC PATCH v3 24/35] mm: Propagate the sorted-buddy bias for picking free regions, to " Srivatsa S. Bhat
2013-08-30 13:21 ` [RFC PATCH v3 25/35] mm: Fix vmstat to also account for freepages in the " Srivatsa S. Bhat
2013-08-30 13:22 ` [RFC PATCH v3 26/35] mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC Srivatsa S. Bhat
2013-08-30 13:22 ` [RFC PATCH v3 27/35] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow Srivatsa S. Bhat
2013-08-30 13:22 ` [RFC PATCH v3 28/35] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= " Srivatsa S. Bhat
2013-08-30 13:23 ` [RFC PATCH v3 29/35] mm: Update the freepage migratetype of pages during region allocation Srivatsa S. Bhat
2013-08-30 13:23 ` [RFC PATCH v3 30/35] mm: Provide a mechanism to check if a given page is in the region allocator Srivatsa S. Bhat
2013-08-30 13:23 ` [RFC PATCH v3 31/35] mm: Add a way to request pages of a particular region from " Srivatsa S. Bhat
2013-08-30 13:24 ` [RFC PATCH v3 32/35] mm: Modify move_freepages() to handle pages in the region allocator properly Srivatsa S. Bhat
2013-08-30 13:24 ` [RFC PATCH v3 33/35] mm: Never change migratetypes of pageblocks during freepage stealing Srivatsa S. Bhat
2013-08-30 13:24 ` [RFC PATCH v3 34/35] mm: Set pageblock migratetype when allocating regions from region allocator Srivatsa S. Bhat
2013-08-30 13:24 ` [RFC PATCH v3 35/35] mm: Use a cache between page-allocator and region-allocator Srivatsa S. Bhat
2013-08-30 13:26 ` [RESEND RFC PATCH v3 00/35] mm: Memory Power Management Srivatsa S. Bhat
2013-08-30 15:27 ` Dave Hansen
2013-08-30 17:50   ` Srivatsa S. Bhat
  -- strict thread matches above, loose matches on Subject: below --
2013-08-30 12:33 [RFC " Srivatsa S. Bhat
2013-08-30 12:38 ` [RFC PATCH v3 13/35] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting Srivatsa S. Bhat

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130830131746.4947.94934.stgit@srivatsabhat.in.ibm.com \
    --to=srivatsa.bhat@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=arjan@linux.intel.com \
    --cc=dave@sr71.net \
    --cc=gargankita@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=isimatu.yasuaki@jp.fujitsu.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=matthew.garrett@nebula.com \
    --cc=mgorman@suse.de \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=rjw@sisk.pl \
    --cc=santosh.shilimkar@ti.com \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=svaidy@linux.vnet.ibm.com \
    --cc=tony.luck@intel.com \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).