linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch 1/2 v2]swap: add a simple buddy allocator
@ 2012-11-19  7:59 Shaohua Li
  2012-11-19 10:06 ` Hugh Dickins
  0 siblings, 1 reply; 3+ messages in thread
From: Shaohua Li @ 2012-11-19  7:59 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, hughd, minchan, riel

I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to 20~30%
CPU time (when cluster is hard to find, the CPU time can be up to 80%), which
becomes a bottleneck.  scan_swap_map() scans a byte array to search a 256 page
cluster, which is very slow.

Here I introduced a simple buddy allocator. Since we only care about 256 pages
cluster, we can just use a counter to implement the buddy allocator. Every 256
pages use one int to store the counter. All free clusters will be added to a
list, so searching cluster is very efficient. With this, scap_swap_map()
overhead disappears.

Since searching cluster with a list is easy, we can easily implement a per-cpu
cluster algorithm to do block allocation, which can make swapout more
efficient. This is in my TODO list.

This might help low end SD card swap too. Because if the cluster is aligned, SD
firmware can do flash erase more efficiently.

We only enable the buddy allocator for SSD. Hard disk swap isn't fast enough
and has downside with a buddy allocator which might introduce regression.

The patch slightly changes which cluster is choosen. It always adds free
cluster to list tail. This can help wear leveling for low end SSD too. And if
no cluster found, the scan_swap_map() will do search from the end of last
cluster. So if no cluster found, the scan_swap_map() will do search from the
end of last free cluster, which is random. For SSD, this isn't a problem at
all.

Another downside is the cluster must be aligned to 256 pages, which will reduce
the chance to find a cluster. I would expect this isn't a big problem for SSD
because of the non-seek penality.

Changes from V1 to V2:
1. free cluster is added to a list, which makes searching cluster more efficient
2. only enable the buddy allocator for SSD.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 include/linux/swap.h |    3 
 mm/swapfile.c        |  178 +++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 170 insertions(+), 11 deletions(-)

Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2012-11-19 09:07:55.144413455 +0800
+++ linux/include/linux/swap.h	2012-11-19 12:15:30.382915945 +0800
@@ -185,6 +185,9 @@ struct swap_info_struct {
 	signed char	next;		/* next type on the swap list */
 	unsigned int	max;		/* extent of the swap_map */
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
+	unsigned int *cluster_info;	/* cluster info. Only for SSD */
+	unsigned int free_cluster_head;
+	unsigned int free_cluster_tail;
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c	2012-11-19 09:08:15.688155187 +0800
+++ linux/mm/swapfile.c	2012-11-19 12:15:30.370916096 +0800
@@ -182,6 +182,85 @@ static int wait_for_discard(void *word)
 #define SWAPFILE_CLUSTER	256
 #define LATENCY_LIMIT		256
 
+/*
+ * cluster info is a unsigned int, the highest 8 bits stores flags, the low 24
+ * bits stores next cluster if the cluster is free or cluster counter otherwise
+ */
+#define CLUSTER_FLAG_FREE (1 << 0)
+#define CLUSTER_FLAG_NEXT_NULL (1 << 1)
+#define CLUSTER_NULL (CLUSTER_FLAG_NEXT_NULL << 24)
+#define cluster_flag(info) ((info) >> 24)
+#define cluster_set_flag(info, flag) \
+	do { info = ((info) & 0xffffff) | ((flag) << 24); } while (0)
+#define cluster_count(info) ((info) & 0xffffff)
+#define cluster_set_count(info, c) \
+	do { info = (cluster_flag(info) << 24) | (c); } while (0)
+#define cluster_next(info) ((info) & 0xffffff)
+#define cluster_set_next(info, n) \
+	do { info = (cluster_flag(info) << 24) | (n); } while (0)
+#define cluster_is_free(info) (cluster_flag(info) & CLUSTER_FLAG_FREE)
+
+static inline void inc_cluster_info_page(struct swap_info_struct *p,
+	unsigned int *cluster_info, unsigned long page_nr)
+{
+	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+
+	if (!cluster_info)
+		return;
+	if (cluster_is_free(cluster_info[idx])) {
+		VM_BUG_ON(p->free_cluster_head != idx);
+		p->free_cluster_head = cluster_next(cluster_info[idx]);
+		if (p->free_cluster_tail == idx) {
+			p->free_cluster_tail = CLUSTER_NULL;
+			p->free_cluster_head = CLUSTER_NULL;
+		}
+		cluster_set_flag(cluster_info[idx], 0);
+		cluster_set_count(cluster_info[idx], 0);
+	}
+
+	VM_BUG_ON(cluster_count(cluster_info[idx]) >= SWAPFILE_CLUSTER);
+	cluster_set_count(cluster_info[idx],
+		cluster_count(cluster_info[idx]) + 1);
+}
+
+static inline void dec_cluster_info_page(struct swap_info_struct *p,
+	unsigned int *cluster_info, unsigned long page_nr)
+{
+	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+
+	if (!cluster_info)
+		return;
+
+	VM_BUG_ON(cluster_count(cluster_info[idx]) == 0);
+	cluster_set_count(cluster_info[idx],
+		cluster_count(cluster_info[idx]) - 1);
+
+	if (cluster_count(cluster_info[idx]) == 0) {
+		cluster_set_flag(cluster_info[idx], CLUSTER_FLAG_FREE);
+		if (p->free_cluster_head == CLUSTER_NULL) {
+			p->free_cluster_head = idx;
+			p->free_cluster_tail = idx;
+		} else {
+			cluster_set_next(cluster_info[p->free_cluster_tail],
+				idx);
+			p->free_cluster_tail = idx;
+		}
+	}
+}
+
+/*
+ * It's possible scan_swap_map() uses a free cluster in the middle of free
+ * cluster list. Avoiding such abuse to avoid list corruption.
+ */
+static inline bool scan_swap_map_recheck_cluster(struct swap_info_struct *si,
+	unsigned long offset)
+{
+	offset /= SWAPFILE_CLUSTER;
+	return si->free_cluster_head != CLUSTER_NULL &&
+		offset != si->free_cluster_head &&
+		cluster_is_free(si->cluster_info[offset]);
+}
+
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
@@ -223,6 +302,24 @@ static unsigned long scan_swap_map(struc
 			si->lowest_alloc = si->max;
 			si->highest_alloc = 0;
 		}
+check_cluster:
+		if (si->free_cluster_head != CLUSTER_NULL) {
+			offset = si->free_cluster_head * SWAPFILE_CLUSTER;
+			last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
+			si->cluster_next = offset;
+			si->cluster_nr = SWAPFILE_CLUSTER - 1;
+			found_free_cluster = 1;
+			goto checks;
+		} else if (si->cluster_info) {
+			/*
+			 * Checking free cluster is fast enough, we can do the
+			 * check every time
+			 */
+			si->cluster_nr = 0;
+			si->lowest_alloc = 0;
+			goto checks;
+		}
+
 		spin_unlock(&swap_lock);
 
 		/*
@@ -283,6 +380,8 @@ static unsigned long scan_swap_map(struc
 	}
 
 checks:
+	if (scan_swap_map_recheck_cluster(si, offset))
+		goto check_cluster;
 	if (!(si->flags & SWP_WRITEOK))
 		goto no_page;
 	if (!si->highest_bit)
@@ -315,6 +414,7 @@ checks:
 		si->highest_bit = 0;
 	}
 	si->swap_map[offset] = usage;
+	inc_cluster_info_page(si, si->cluster_info, offset);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
 
@@ -549,6 +649,7 @@ static unsigned char swap_entry_free(str
 
 	/* free if no reference */
 	if (!usage) {
+		dec_cluster_info_page(p, p->cluster_info, offset);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
 		if (offset > p->highest_bit)
@@ -1445,6 +1546,7 @@ static int setup_swap_extents(struct swa
 
 static void enable_swap_info(struct swap_info_struct *p, int prio,
 				unsigned char *swap_map,
+				unsigned int *cluster_info,
 				unsigned long *frontswap_map)
 {
 	int i, prev;
@@ -1455,6 +1557,7 @@ static void enable_swap_info(struct swap
 	else
 		p->prio = --least_priority;
 	p->swap_map = swap_map;
+	p->cluster_info = cluster_info;
 	frontswap_map_set(p, frontswap_map);
 	p->flags |= SWP_WRITEOK;
 	nr_swap_pages += p->pages;
@@ -1480,6 +1583,7 @@ SYSCALL_DEFINE1(swapoff, const char __us
 {
 	struct swap_info_struct *p = NULL;
 	unsigned char *swap_map;
+	unsigned int *cluster_info;
 	struct file *swap_file, *victim;
 	struct address_space *mapping;
 	struct inode *inode;
@@ -1555,7 +1659,8 @@ SYSCALL_DEFINE1(swapoff, const char __us
 		 * sys_swapoff for this swap_info_struct at this point.
 		 */
 		/* re-insert swap space back into swap_list */
-		enable_swap_info(p, p->prio, p->swap_map, frontswap_map_get(p));
+		enable_swap_info(p, p->prio, p->swap_map, p->cluster_info,
+				frontswap_map_get(p));
 		goto out_dput;
 	}
 
@@ -1580,11 +1685,14 @@ SYSCALL_DEFINE1(swapoff, const char __us
 	p->max = 0;
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
+	cluster_info = p->cluster_info;
+	p->cluster_info = NULL;
 	p->flags = 0;
 	frontswap_invalidate_area(type);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
+	vfree(cluster_info);
 	vfree(frontswap_map_get(p));
 	/* Destroy swap account informatin */
 	swap_cgroup_swapoff(type);
@@ -1896,15 +2004,21 @@ static unsigned long read_swap_header(st
 static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					union swap_header *swap_header,
 					unsigned char *swap_map,
+					unsigned int *cluster_info,
 					unsigned long maxpages,
 					sector_t *span)
 {
 	int i;
 	unsigned int nr_good_pages;
 	int nr_extents;
+	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
+	unsigned long idx = p->cluster_next / SWAPFILE_CLUSTER;
 
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
+	p->free_cluster_head = CLUSTER_NULL;
+	p->free_cluster_tail = CLUSTER_NULL;
+
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 		if (page_nr == 0 || page_nr > swap_header->info.last_page)
@@ -1912,11 +2026,25 @@ static int setup_swap_map_and_extents(st
 		if (page_nr < maxpages) {
 			swap_map[page_nr] = SWAP_MAP_BAD;
 			nr_good_pages--;
+			/*
+			 * Not mark the cluster free yet, no list
+			 * operation involved
+			 */
+			inc_cluster_info_page(p, cluster_info, page_nr);
 		}
 	}
 
+	/* Not mark the cluster free yet, no list operation involved */
+	for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
+		inc_cluster_info_page(p, cluster_info, i);
+
 	if (nr_good_pages) {
 		swap_map[0] = SWAP_MAP_BAD;
+		/*
+		 * Not mark the cluster free yet, no list
+		 * operation involved
+		 */
+		inc_cluster_info_page(p, cluster_info, 0);
 		p->max = maxpages;
 		p->pages = nr_good_pages;
 		nr_extents = setup_swap_extents(p, span);
@@ -1929,6 +2057,27 @@ static int setup_swap_map_and_extents(st
 		return -EINVAL;
 	}
 
+	if (!cluster_info)
+		return nr_extents;
+
+	for (i = 0; i < nr_clusters; i++) {
+		if (!cluster_count(cluster_info[idx])) {
+			cluster_set_flag(cluster_info[idx], CLUSTER_FLAG_FREE);
+			if (p->free_cluster_head == CLUSTER_NULL) {
+				p->free_cluster_head = idx;
+				p->free_cluster_tail = idx;
+			} else {
+				cluster_set_next(
+					cluster_info[p->free_cluster_tail],
+					idx);
+				p->free_cluster_tail = idx;
+			}
+		}
+		idx++;
+		if (idx == nr_clusters)
+			idx = 0;
+	}
+
 	return nr_extents;
 }
 
@@ -1946,6 +2095,7 @@ SYSCALL_DEFINE2(swapon, const char __use
 	sector_t span;
 	unsigned long maxpages;
 	unsigned char *swap_map = NULL;
+	unsigned int *cluster_info = NULL;
 	unsigned long *frontswap_map = NULL;
 	struct page *page = NULL;
 	struct inode *inode = NULL;
@@ -2019,13 +2169,24 @@ SYSCALL_DEFINE2(swapon, const char __use
 		error = -ENOMEM;
 		goto bad_swap;
 	}
+	if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
+		p->flags |= SWP_SOLIDSTATE;
+		p->cluster_next = 1 + (random32() % p->highest_bit);
+
+		cluster_info = vzalloc(DIV_ROUND_UP(maxpages,
+			SWAPFILE_CLUSTER) * sizeof(*cluster_info));
+		if (!cluster_info) {
+			error = -ENOMEM;
+			goto bad_swap;
+		}
+	}
 
 	error = swap_cgroup_swapon(p->type, maxpages);
 	if (error)
 		goto bad_swap;
 
 	nr_extents = setup_swap_map_and_extents(p, swap_header, swap_map,
-		maxpages, &span);
+		cluster_info, maxpages, &span);
 	if (unlikely(nr_extents < 0)) {
 		error = nr_extents;
 		goto bad_swap;
@@ -2034,21 +2195,15 @@ SYSCALL_DEFINE2(swapon, const char __use
 	if (frontswap_enabled)
 		frontswap_map = vzalloc(maxpages / sizeof(long));
 
-	if (p->bdev) {
-		if (blk_queue_nonrot(bdev_get_queue(p->bdev))) {
-			p->flags |= SWP_SOLIDSTATE;
-			p->cluster_next = 1 + (random32() % p->highest_bit);
-		}
-		if ((swap_flags & SWAP_FLAG_DISCARD) && discard_swap(p) == 0)
-			p->flags |= SWP_DISCARDABLE;
-	}
+	if (p->bdev && (swap_flags & SWAP_FLAG_DISCARD) && discard_swap(p) == 0)
+		p->flags |= SWP_DISCARDABLE;
 
 	mutex_lock(&swapon_mutex);
 	prio = -1;
 	if (swap_flags & SWAP_FLAG_PREFER)
 		prio =
 		  (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
-	enable_swap_info(p, prio, swap_map, frontswap_map);
+	enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map);
 
 	printk(KERN_INFO "Adding %uk swap on %s.  "
 			"Priority:%d extents:%d across:%lluk %s%s%s\n",
@@ -2078,6 +2233,7 @@ bad_swap:
 	p->flags = 0;
 	spin_unlock(&swap_lock);
 	vfree(swap_map);
+	vfree(cluster_info);
 	if (swap_file) {
 		if (inode && S_ISREG(inode->i_mode)) {
 			mutex_unlock(&inode->i_mutex);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [patch 1/2 v2]swap: add a simple buddy allocator
  2012-11-19  7:59 [patch 1/2 v2]swap: add a simple buddy allocator Shaohua Li
@ 2012-11-19 10:06 ` Hugh Dickins
  2012-11-20  1:41   ` Shaohua Li
  0 siblings, 1 reply; 3+ messages in thread
From: Hugh Dickins @ 2012-11-19 10:06 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, akpm, minchan, riel

On Mon, 19 Nov 2012, Shaohua Li wrote:
> 
> Changes from V1 to V2:
> 1. free cluster is added to a list, which makes searching cluster more efficient
> 2. only enable the buddy allocator for SSD.

Oh.  My fault, not yours at all, but I wish I'd known this was coming.
I spent today testing and fixing (a couple of hangs in discard 2/2) V1,
was about to send you a patch, but looks like none of it relevant to V2.

It's nice work you've done, I thoroughly approve of V1 (very minor mods),
but it'll take me a few more days to get around to looking at V2, sorry.

Though, it may be my ignorance, I entirely fail to see what this has to
do with a buddy allocator: you've speeded up scan_swap_map()'s search
for a cluster by adding an additional cluster map, then extended that
neatly for a much better discard implementation.  Good work yes, but
a buddy allocator??

Thanks!
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [patch 1/2 v2]swap: add a simple buddy allocator
  2012-11-19 10:06 ` Hugh Dickins
@ 2012-11-20  1:41   ` Shaohua Li
  0 siblings, 0 replies; 3+ messages in thread
From: Shaohua Li @ 2012-11-20  1:41 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, akpm, minchan, riel

On Mon, Nov 19, 2012 at 02:06:48AM -0800, Hugh Dickins wrote:
> On Mon, 19 Nov 2012, Shaohua Li wrote:
> > 
> > Changes from V1 to V2:
> > 1. free cluster is added to a list, which makes searching cluster more efficient
> > 2. only enable the buddy allocator for SSD.
> 
> Oh.  My fault, not yours at all, but I wish I'd known this was coming.
> I spent today testing and fixing (a couple of hangs in discard 2/2) V1,
> was about to send you a patch, but looks like none of it relevant to V2.

Appologize for this. I was in travel last whole week, a little lagged to fully
test the V2 patch till yesterday. Had no confidence to bother again before I
know it really works.
 
> It's nice work you've done, I thoroughly approve of V1 (very minor mods),
> but it'll take me a few more days to get around to looking at V2, sorry.
> 
> Though, it may be my ignorance, I entirely fail to see what this has to
> do with a buddy allocator: you've speeded up scan_swap_map()'s search
> for a cluster by adding an additional cluster map, then extended that
> neatly for a much better discard implementation.  Good work yes, but
> a buddy allocator??

That is just the name I called. I thought it's a variant of buddy allocator,
don't take it serious.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-11-20  1:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-19  7:59 [patch 1/2 v2]swap: add a simple buddy allocator Shaohua Li
2012-11-19 10:06 ` Hugh Dickins
2012-11-20  1:41   ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).