All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 5/21] pagevec infrastructure
@ 2002-08-11  7:38 Andrew Morton
  2002-08-14  8:41 ` William Lee Irwin III
  0 siblings, 1 reply; 2+ messages in thread
From: Andrew Morton @ 2002-08-11  7:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: lkml



This is the first patch in a series of eight which address
pagemap_lru_lock contention, and which simplify the VM locking
hierarchy.

Most testing has been done with all eight patches applied, so it would
be best not to cherrypick, please.

The workload which was optimised was: 4x500MHz PIII CPUs, mem=512m, six
disks, six filesystems, six processes each flat-out writing a large
file onto one of the disks.  ie: heavy page replacement load.

The frequency with which pagemap_lru_lock is taken is reduced by 90%.

Lockmeter claims that pagemap_lru_lock contention on the 4-way has been
reduced by 98%.  Total amount of system time lost to lock spinning went
from 2.5% to 0.85%.

Anton ran a similar test on 8-way PPC, the reduction in system time was
around 25%, and the reduction in time spent playing with
pagemap_lru_lock was 80%.

	http://samba.org/~anton/linux/2.5.30/standard/
versus
	http://samba.org/~anton/linux/2.5.30/akpm/

Throughput changes on uniprocessor are modest: a 1% speedup with this
workload due to shortened code paths and improved cache locality.

The patches do two main things:

1: In almost all places where the kernel was doing something with
   lots of pages one-at-a-time, convert the code to do the same thing
   sixteen-pages-at-a-time.  Take the lock once rather than sixteen
   times.  Take the lock for the minimum possible time.

2: Multithread the pagecache reclaim function: don't hold
   pagemap_lru_lock while reclaiming pagecache pages.  That function
   was massively expensive.

One fallout from this work is that we never take any other locks while
holding pagemap_lru_lock.  So this lock conceptually disappears from
the VM locking hierarchy.


So.  This is all basically a code tweak to improve kernel scalability. 
It does it by optimising the existing design, rather than by redesign. 
There is little conceptual change to how the VM works.

This is as far as I can tweak it.  It seems that the results are now
acceptable on SMP.  But things are still bad on NUMA.  It is expected
that the per-zone LRU and per-zone LRU lock patches will fix NUMA as
well, but that has yet to be tested.


This first patch introduces `struct pagevec', which is the basic unit
of batched work.  It is simply:

struct pagevec {
	unsigned nr;
	struct page *pages[16];
};

pagevecs are used in the following patches to get the VM away from
page-at-a-time operations.

This patch includes all the pagevec library functions which are used in
later patches.





 include/linux/pagevec.h |   76 ++++++++++++++++++++++
 mm/page_alloc.c         |    9 ++
 mm/swap.c               |  160 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 236 insertions(+), 9 deletions(-)

--- /dev/null	Thu Aug 30 13:30:55 2001
+++ 2.5.31-akpm/include/linux/pagevec.h	Sun Aug 11 00:21:03 2002
@@ -0,0 +1,76 @@
+/*
+ * include/linux/pagevec.h
+ *
+ * In many places it is efficient to batch an operation up against multiple
+ * pages.  A pagevec is a multipage container which is used for that.
+ */
+
+#define PAGEVEC_SIZE	16
+
+struct page;
+
+struct pagevec {
+	unsigned nr;
+	struct page *pages[PAGEVEC_SIZE];
+};
+
+void __pagevec_release(struct pagevec *pvec);
+void __pagevec_release_nonlru(struct pagevec *pvec);
+void __pagevec_free(struct pagevec *pvec);
+void __pagevec_lru_add(struct pagevec *pvec);
+void __pagevec_lru_del(struct pagevec *pvec);
+void pagevec_deactivate_inactive(struct pagevec *pvec);
+
+static inline void pagevec_init(struct pagevec *pvec)
+{
+	pvec->nr = 0;
+}
+
+static inline unsigned pagevec_count(struct pagevec *pvec)
+{
+	return pvec->nr;
+}
+
+static inline unsigned pagevec_space(struct pagevec *pvec)
+{
+	return PAGEVEC_SIZE - pvec->nr;
+}
+
+/*
+ * Add a page to a pagevec.  Returns the number of slots still available.
+ */
+static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
+{
+	pvec->pages[pvec->nr++] = page;
+	return pagevec_space(pvec);
+}
+
+static inline void pagevec_release(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_release(pvec);
+}
+
+static inline void pagevec_release_nonlru(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_release_nonlru(pvec);
+}
+
+static inline void pagevec_free(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_free(pvec);
+}
+
+static inline void pagevec_lru_add(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_lru_add(pvec);
+}
+
+static inline void pagevec_lru_del(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_lru_del(pvec);
+}
--- 2.5.31/mm/swap.c~pagevec	Sun Aug 11 00:20:32 2002
+++ 2.5.31-akpm/mm/swap.c	Sun Aug 11 00:21:02 2002
@@ -17,11 +17,9 @@
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/init.h>
-
-#include <asm/dma.h>
-#include <asm/uaccess.h> /* for copy_to/from_user */
-#include <asm/pgtable.h>
+#include <linux/prefetch.h>
 
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
@@ -38,6 +36,9 @@ static inline void activate_page_nolock(
 	}
 }
 
+/*
+ * FIXME: speed this up?
+ */
 void activate_page(struct page * page)
 {
 	spin_lock(&pagemap_lru_lock);
@@ -51,9 +52,10 @@ void activate_page(struct page * page)
  */
 void lru_cache_add(struct page * page)
 {
-	if (!TestSetPageLRU(page)) {
+	if (!PageLRU(page)) {
 		spin_lock(&pagemap_lru_lock);
-		add_page_to_inactive_list(page);
+		if (!TestSetPageLRU(page))
+			add_page_to_inactive_list(page);
 		spin_unlock(&pagemap_lru_lock);
 	}
 }
@@ -68,11 +70,10 @@ void lru_cache_add(struct page * page)
 void __lru_cache_del(struct page * page)
 {
 	if (TestClearPageLRU(page)) {
-		if (PageActive(page)) {
+		if (PageActive(page))
 			del_page_from_active_list(page);
-		} else {
+		else
 			del_page_from_inactive_list(page);
-		}
 	}
 }
 
@@ -88,6 +89,147 @@ void lru_cache_del(struct page * page)
 }
 
 /*
+ * Batched page_cache_release().  Decrement the reference count on all the
+ * pagevec's pages.  If it fell to zero then remove the page from the LRU and
+ * free it.
+ *
+ * Avoid taking pagemap_lru_lock if possible, but if it is taken, retain it
+ * for the remainder of the operation.
+ *
+ * The locking in this function is against shrink_cache(): we recheck the
+ * page count inside the lock to see whether shrink_cache grabbed the page
+ * via the LRU.  If it did, give up: shrink_cache will free it.
+ *
+ * This function reinitialises the caller's pagevec.
+ */
+void __pagevec_release(struct pagevec *pvec)
+{
+	int i;
+	int lock_held = 0;
+	struct pagevec pages_to_free;
+
+	pagevec_init(&pages_to_free);
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		if (!put_page_testzero(page))
+			continue;
+
+		if (!lock_held && PageLRU(page)) {
+			spin_lock(&pagemap_lru_lock);
+			lock_held = 1;
+		}
+
+		if (TestClearPageLRU(page)) {
+			if (PageActive(page))
+				del_page_from_active_list(page);
+			else
+				del_page_from_inactive_list(page);
+		}
+		if (page_count(page) == 0)
+			pagevec_add(&pages_to_free, page);
+	}
+	if (lock_held)
+		spin_unlock(&pagemap_lru_lock);
+
+	pagevec_free(&pages_to_free);
+	pagevec_init(pvec);
+}
+
+/*
+ * pagevec_release() for pages which are known to not be on the LRU
+ *
+ * This function reinitialises the caller's pagevec.
+ */
+void __pagevec_release_nonlru(struct pagevec *pvec)
+{
+	int i;
+	struct pagevec pages_to_free;
+
+	pagevec_init(&pages_to_free);
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		BUG_ON(PageLRU(page));
+		if (put_page_testzero(page))
+			pagevec_add(&pages_to_free, page);
+	}
+	pagevec_free(&pages_to_free);
+	pagevec_init(pvec);
+}
+
+/*
+ * Move all the inactive pages to the head of the inactive list
+ * and release them.  Reinitialises the caller's pagevec.
+ */
+void pagevec_deactivate_inactive(struct pagevec *pvec)
+{
+	int i;
+	int lock_held = 0;
+
+	if (pagevec_count(pvec) == 0)
+		return;
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		if (!lock_held) {
+			if (PageActive(page) || !PageLRU(page))
+				continue;
+			spin_lock(&pagemap_lru_lock);
+			lock_held = 1;
+		}
+		if (!PageActive(page) && PageLRU(page))
+			list_move(&page->lru, &inactive_list);
+	}
+	if (lock_held)
+		spin_unlock(&pagemap_lru_lock);
+	__pagevec_release(pvec);
+}
+
+/*
+ * Add the passed pages to the inactive_list, then drop the caller's refcount
+ * on them.  Reinitialises the caller's pagevec.
+ */
+void __pagevec_lru_add(struct pagevec *pvec)
+{
+	int i;
+
+	spin_lock(&pagemap_lru_lock);
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		if (TestSetPageLRU(page))
+			BUG();
+		add_page_to_inactive_list(page);
+	}
+	spin_unlock(&pagemap_lru_lock);
+	pagevec_release(pvec);
+}
+
+/*
+ * Remove the passed pages from the LRU, then drop the caller's refcount on
+ * them.  Reinitialises the caller's pagevec.
+ */
+void __pagevec_lru_del(struct pagevec *pvec)
+{
+	int i;
+
+	spin_lock(&pagemap_lru_lock);
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		if (!TestClearPageLRU(page))
+			BUG();
+		if (PageActive(page))
+			del_page_from_active_list(page);
+		else
+			del_page_from_inactive_list(page);
+	}
+	spin_unlock(&pagemap_lru_lock);
+	pagevec_release(pvec);
+}
+
+/*
  * Perform any setup for the swap system
  */
 void __init swap_setup(void)
--- 2.5.31/mm/page_alloc.c~pagevec	Sun Aug 11 00:20:32 2002
+++ 2.5.31-akpm/mm/page_alloc.c	Sun Aug 11 00:21:03 2002
@@ -22,6 +22,7 @@
 #include <linux/compiler.h>
 #include <linux/module.h>
 #include <linux/suspend.h>
+#include <linux/pagevec.h>
 
 unsigned long totalram_pages;
 unsigned long totalhigh_pages;
@@ -458,6 +459,14 @@ void page_cache_release(struct page *pag
 	}
 }
 
+void __pagevec_free(struct pagevec *pvec)
+{
+	int i = pagevec_count(pvec);
+
+	while (--i >= 0)
+		__free_pages_ok(pvec->pages[i], 0);
+}
+
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (!PageReserved(page) && put_page_testzero(page))

.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [patch 5/21] pagevec infrastructure
  2002-08-11  7:38 [patch 5/21] pagevec infrastructure Andrew Morton
@ 2002-08-14  8:41 ` William Lee Irwin III
  0 siblings, 0 replies; 2+ messages in thread
From: William Lee Irwin III @ 2002-08-14  8:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, lkml

On Sun, Aug 11, 2002 at 12:38:42AM -0700, Andrew Morton wrote:
> This is the first patch in a series of eight which address
> pagemap_lru_lock contention, and which simplify the VM locking
> hierarchy.
> Most testing has been done with all eight patches applied, so it would
> be best not to cherrypick, please.

dbench 256 on a 16x/16G numaq:

Throughput 50.3639 MB/sec (NB=62.9549 MB/sec  503.639 MBit/sec)  256 procs


c013c444 7514467  78.18       .text.lock.highmem
c013baac 584912   6.08539     kmap_high
c013bca0 556734   5.79223     kunmap_high
c012f260 133797   1.39202     generic_file_write
c012e53c 85433    0.88884     file_read_actor
c0114820 72361    0.752839    scheduler_tick
c01441ec 61757    0.642516    block_prepare_write
c013c18c 46977    0.488746    blk_queue_bounce
c0105394 46657    0.485417    default_idle
c01113b8 37263    0.387682    smp_apic_timer_interrupt
c0135afc 35618    0.370567    rmqueue
c014380c 26328    0.273915    __block_prepare_write
c012dec0 19238    0.200151    unlock_page
c01405f4 17391    0.180935    vfs_write
c01433e8 17382    0.180841    create_empty_buffers
c01361d8 16797    0.174755    page_cache_release
c01143d8 16727    0.174027    load_balance
c0136632 16378    0.170396    .text.lock.page_alloc
c013ba28 12628    0.131381    flush_all_zero_pkmaps
c013429c 10114    0.105225    lru_cache_add
c016de8c 9700     0.100918    ext2_prepare_write
c0135820 7610     0.079174    __free_pages_ok

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2002-08-14  8:39 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-11  7:38 [patch 5/21] pagevec infrastructure Andrew Morton
2002-08-14  8:41 ` William Lee Irwin III

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.