[very very drafty] prezeroing to increase the page fault rate

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* [very very drafty] prezeroing to increase the page fault rate
@ 2004-12-15  2:34 Christoph Lameter
  2004-12-15 21:21 ` Robin Holt
                   ` (16 more replies)
  0 siblings, 17 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-15  2:34 UTC (permalink / raw)
  To: linux-ia64

The page fault patches address the scalability of the fault handler
by aggregating requests (anticipatory prefaulting) or by reducing the locking
overhead (page fault scalability patches). However, the main time spend in
the page fault handler is by zeroing pages. The following patch
zeroes pages in the background through hardware (Altix Block Transfer Engine)
or via software when the system is idle. This increases the performance
of the page fault handler dramatically even for only a single thread:

2.6.10-rc3-bk7 (allocating 1 GB):

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  1   1    8    0.029s      1.373s   0.039s 46733.217 167449.984
  1   1    4    0.016s      1.152s   0.043s 56064.229 152067.012
  1   1    2    0.011s      1.074s   0.056s 60349.726 115679.719
  1   1    1    0.012s      0.708s   0.072s 90933.436  90849.200

with patch:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  1   1    8    0.012s      0.759s   0.023s 84840.529 279197.309
  1   1    4    0.014s      0.307s   0.018s203360.588 354015.152
  1   1    2    0.021s      0.373s   0.023s166111.155 283594.162
  1   1    1    0.012s      0.200s   0.021s307839.729 306791.723

I have some spot results here that indicate that a single thread may
do up to 500000 faults a second with this patch alone.

The patches are not ready for prime time yet but I would like to get
some feedback on the approach taken. The patch will only work on IA64
for now and the BTE code is sometimes still doing funky things.

The patch introduces some new state information for pages on the free lists.
First, a zero bit to flag a page that has been zeroed. Second a lock
bit to lock the page during hardware or software zeroing.

This means also that a page may not always be able to be merged
or split by the buddy allocator. Thus page merging may have to be deferred
through a new "free_queue" in the zone structure.

The semantics for the list of free pages are also slightly changed. Zeroed
pages are always at the back of the list, unzeroed at the front. The system
may pick from the back or from the front depending on what type of page
is wanted. In order to use this mechanism all locations where a page
is acquired and then zeroed have been modified to use __GFP_ZERO
to request a zeroed page.

This also necessitates the introduction of a new queue in the pcp structure
for zeroed pages. Pages will be picked from that queue if the application
desires to have a zeroed page.

Hardware zeroing is started if pages are merged in the page
buddy allocator to an order greater than /proc/sys/vm/zero_order.

In the idle loop also a function idle_page_zero is called that picks
the highest order unzeroed page and zeroes it.

This typically results in almost all pages being zeroed at the end of the
bootup process. Any idle time may be used by the system to rezero
its memory and thus be able to speed up future memory allocation. It
will continually zero pages beyond a certain size if hardware support
is available.

Having memory zeroed also has a security aspect because information does
not linger for long in memory after the page that contains the information
has been deallocated.

In order to not negatively impact the system, page zeroing must be done in the
greatest chunks possible and possibly by hardware. The system always
selects the largest order for page to zero (and this necessitates the
introduction of the architecture specific function clear_pages()) in order
to have the highest efficiency. Zeroing pages may wipe out the processor
caches if not done by hardware.

The zeroing system typically discovers large chunks of unzeroed memory
and zeroes it during the startup phase of the system. Later typically
only smaller chunks of memory are processed.

Index: linux-2.6.9/arch/i386/kernel/process.c
=================================--- linux-2.6.9.orig/arch/i386/kernel/process.c	2004-12-13 11:18:42.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/process.c	2004-12-13 13:58:48.000000000 -0800
@@ -148,6 +148,10 @@
 	while (1) {
 		while (!need_resched()) {
 			void (*idle)(void);
+			void idle_page_zero(void);
+
+			idle_page_zero();
+
 			/*
 			 * Mark this as an RCU critical section so that
 			 * synchronize_kernel() in the unload path waits
Index: linux-2.6.9/include/linux/page-flags.h
=================================--- linux-2.6.9.orig/include/linux/page-flags.h	2004-10-18 14:54:39.000000000 -0700
+++ linux-2.6.9/include/linux/page-flags.h	2004-12-13 13:58:48.000000000 -0800
@@ -74,6 +74,7 @@
 #define PG_swapcache		16	/* Swap page: swp_entry_t in private */
 #define PG_mappedtodisk		17	/* Has blocks allocated on-disk */
 #define PG_reclaim		18	/* To be reclaimed asap */
+#define PG_zero			19	/* Page contains zeros (valid only on freelist) */


 /*
@@ -298,6 +299,11 @@
 #define PageSwapCache(page)	0
 #endif

+#define PageZero(page)		test_bit(PG_zero, &(page)->flags)
+#define SetPageZero(page)	set_bit(PG_zero, &(page)->flags)
+#define ClearPageZero(page)	clear_bit(PG_zero, &(page)->flags)
+#define TestClearPageZero(page)	test_and_clear_bit(PG_zero, &(page)->flags)
+
 struct page;	/* forward declaration */

 int test_clear_page_dirty(struct page *page);
Index: linux-2.6.9/arch/ia64/kernel/process.c
=================================--- linux-2.6.9.orig/arch/ia64/kernel/process.c	2004-12-10 12:42:27.000000000 -0800
+++ linux-2.6.9/arch/ia64/kernel/process.c	2004-12-13 13:58:48.000000000 -0800
@@ -238,7 +238,9 @@
 #endif
 		while (!need_resched()) {
 			void (*idle)(void);
+			void idle_page_zero(void);

+			idle_page_zero();
 			if (mark_idle)
 				(*mark_idle)(1);
 			/*
Index: linux-2.6.9/mm/page_alloc.c
=================================--- linux-2.6.9.orig/mm/page_alloc.c	2004-12-10 12:42:33.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c	2004-12-14 16:39:59.000000000 -0800
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Prezeroing of pages, Christoph Lameter, SGI, Dec 2004
  */

 #include <linux/config.h>
@@ -32,6 +33,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
+#include <linux/zero.h>

 #include <asm/tlbflush.h>

@@ -43,6 +45,7 @@
 long nr_swap_pages;
 int numnodes = 1;
 int sysctl_lower_zone_protection = 0;
+unsigned int sysctl_zero_order = 5;

 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
@@ -90,13 +93,32 @@
 			1 << PG_active	|
 			1 << PG_dirty	|
 			1 << PG_swapcache |
-			1 << PG_writeback);
+			1 << PG_writeback |
+			1 << PG_zero);
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
 	tainted |= TAINT_BAD_PAGE;
 }

+LIST_HEAD(init_zero);
+
+/*
+ * Attempt to zero a page via hardware support.
+ */
+int zero_page(struct free_area *area, struct page *p, int order)
+{
+	struct list_head *l;
+
+	list_for_each(l, &init_zero) {
+		struct zero_driver *d = list_entry(l, struct zero_driver, list);
+		if (d->start_bzero(p, order) = 0) {
+			return 1;
+		}
+	}
+	return 0;
+}
+
 #ifndef CONFIG_HUGETLB_PAGE
 #define prep_compound_page(page, order) do { } while (0)
 #define destroy_compound_page(page, order) do { } while (0)
@@ -179,6 +201,13 @@
  * -- wli
  */

+static inline void free_page_queue(struct zone *zone, struct page *p, int order)
+{
+	printk("free_page_queue: queuing page=%p order=%d\n", page_address(p), order);
+	p->index = order;
+	list_add_tail(&p->lru, &zone->free_queue);
+}
+
 static inline void __free_pages_bulk (struct page *page, struct page *base,
 		struct zone *zone, struct free_area *area, unsigned int order)
 {
@@ -192,7 +221,6 @@
 		BUG();
 	index = page_idx >> (1 + order);

-	zone->free_pages += 1 << order;
 	while (order < MAX_ORDER-1) {
 		struct page *buddy1, *buddy2;

@@ -208,14 +236,72 @@
 		buddy2 = base + page_idx;
 		BUG_ON(bad_range(zone, buddy1));
 		BUG_ON(bad_range(zone, buddy2));
+		if (unlikely(PageLocked(buddy1))) {
+
+			/* Restore the page map */
+			change_bit(index, area->map);
+
+			/*
+			 * Page is locked due to zeroing. Thus we cannot update the map.
+			 * Queue the page for later insertion and leave the buddy page alone
+			 */
+			printk(KERN_ERR "__free_pages_bulk: Locked buddy page %p\n", page_address(buddy1));
+			free_page_queue(zone, buddy2, order);
+			return;
+		}
 		list_del(&buddy1->lru);
+
+		if (unlikely(PageZero(buddy1) && PageZero(buddy2))) {
+			if (buddy1 < buddy2) {
+				SetPageZero(buddy1);
+				ClearPageZero(buddy2);
+			} else {
+				SetPageZero(buddy2);
+				ClearPageZero(buddy1);
+			}
+		} else {
+			ClearPageZero(buddy1);
+			ClearPageZero(buddy2);
+		}
+
 		mask <<= 1;
 		order++;
 		area++;
 		index >>= 1;
 		page_idx &= mask;
 	}
-	list_add(&(base + page_idx)->lru, &area->free_list);
+	if (PageZero(page) || (order > sysctl_zero_order && zero_page(area, page, order)))
+		list_add_tail(&(base + page_idx)->lru, &area->free_list);
+	else
+		list_add(&(base + page_idx)->lru, &area->free_list);
+}
+
+/*
+ * Called by __alloc_pages if memory gets tight to clear up the queue of pages
+ * not yet in the buddy allocator maps
+ */
+static void free_queue_free(struct zone *zone)
+{
+	unsigned long flags;
+	struct list_head *l, *n;
+
+	if (list_empty(&zone->free_queue))
+		return;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list_for_each_safe(l, n, &zone->free_queue) {
+		struct page *page = list_entry(l, struct page, lru);
+		int  order = page->index;
+
+		/* The page may still be in the process of being zeroed... */
+		if (!PageLocked(page)) {
+			printk(KERN_ERR "free_pages_bulk: freeing from free_queue %p. order=%ld\n", page_address(page), page->index);
+			list_del(&page->lru);
+			__free_pages_bulk(page, zone->zone_mem_map, zone,
+				zone->free_area + order, order);
+		}
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -231,6 +317,7 @@
 			1 << PG_reclaim	|
 			1 << PG_slab	|
 			1 << PG_swapcache |
+			1 << PG_zero |
 			1 << PG_writeback )))
 		bad_page(function, page);
 	if (PageDirty(page))
@@ -266,6 +353,7 @@
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
+		zone->free_pages += 1 << order;
 		__free_pages_bulk(page, base, zone, area, order);
 		ret++;
 	}
@@ -316,7 +404,16 @@
 		high--;
 		size >>= 1;
 		BUG_ON(bad_range(zone, &page[size]));
-		list_add(&page[size].lru, &area->free_list);
+		/*
+		 * If the main page was zeroed then the
+		 * split off page is also and must be added to the
+		 * end of the list
+		 */
+		if (PageZero(page)) {
+			SetPageZero(page + size);
+			list_add_tail(&page[size].lru, &area->free_list);
+		} else
+			list_add(&page[size].lru, &area->free_list);
 		MARK_USED(index + size, high, area);
 	}
 	return page;
@@ -341,7 +438,7 @@
 /*
  * This page is about to be returned from the page allocator
  */
-static void prep_new_page(struct page *page, int order)
+static void prep_new_page(struct page *page, int order, int zero)
 {
 	if (page->mapping || page_mapped(page) ||
 	    (page->flags & (
@@ -355,18 +452,35 @@
 			1 << PG_writeback )))
 		bad_page(__FUNCTION__, page);

+	if (zero) {
+		if (PageHighMem(page)) {
+			int n = 1 << order;
+
+			while (n--> 0)
+				clear_highpage(page + n);
+		} else
+		if (!PageZero(page))
+			clear_pages(page_address(page), order);
+	}
+
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_checked | 1 << PG_mappedtodisk);
+			1 << PG_checked | 1 << PG_mappedtodisk |
+			1 << PG_zero);
 	page->private = 0;
 	set_page_refs(page, order);
 }

+/* Ways of removing a page from a queue */
+#define ALLOC_FRONT 0
+#define ALLOC_BACK 1
+#define ALLOC_ZERO 2
+
 /*
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int mode)
 {
 	struct free_area * area;
 	unsigned int current_order;
@@ -378,7 +492,18 @@
 		if (list_empty(&area->free_list))
 			continue;

-		page = list_entry(area->free_list.next, struct page, lru);
+		page = list_entry(
+				mode != ALLOC_FRONT ? area->free_list.prev: area->free_list.next,
+				struct page,
+				lru);
+
+		if (PageLocked(page))
+			/* Order is being zeroed. Do not disturb */
+			continue;
+
+		if (mode = ALLOC_ZERO && !PageZero(page))
+			/* Must return zero page and there is no zero page available */
+			continue;
 		list_del(&page->lru);
 		index = page - zone->zone_mem_map;
 		if (current_order != MAX_ORDER-1)
@@ -396,7 +521,7 @@
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int mode)
 {
 	unsigned long flags;
 	int i;
@@ -405,7 +530,7 @@

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, mode);
 		if (page = NULL)
 			break;
 		allocated++;
@@ -546,7 +671,8 @@
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int mode = (gfp_flags & __GFP_ZERO) ? ALLOC_BACK : ALLOC_FRONT;
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*mode;

 	if (order = 0) {
 		struct per_cpu_pages *pcp;
@@ -555,7 +681,7 @@
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, mode * 2);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -567,14 +693,14 @@

 	if (page = NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, mode);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
 		mod_page_state_zone(zone, pgalloc, 1 << order);
-		prep_new_page(page, order);
+		prep_new_page(page, order, mode);
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -693,6 +819,7 @@

 	/* go through the zonelist yet one more time */
 	for (i = 0; (z = zones[i]) != NULL; i++) {
+		free_queue_free(z);
 		min = z->pages_min;
 		if (gfp_mask & __GFP_HIGH)
 			min /= 2;
@@ -767,12 +894,9 @@
 	 */
 	BUG_ON(gfp_mask & __GFP_HIGHMEM);

-	page = alloc_pages(gfp_mask, 0);
-	if (page) {
-		void *address = page_address(page);
-		clear_page(address);
-		return (unsigned long) address;
-	}
+	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+	if (page)
+		return (unsigned long) page_address(page);
 	return 0;
 }

@@ -1030,6 +1154,7 @@

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1062,10 +1187,10 @@

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1150,6 +1275,75 @@
 }

 /*
+ * Idle page zero takes a page off the front of the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ *
+ * Page zeroing only works in zone 0 in order to insure that a numa cpu
+ * only clears its own memory space (This is true for SGI Altix but
+ * maybe there are other situations ?? )
+ */
+
+
+void idle_page_zero(void)
+{
+	struct zone *z;
+	struct free_area *area;
+	unsigned long flags;
+
+	if (sysctl_zero_order >= MAX_ORDER || system_state != SYSTEM_RUNNING)
+		return;
+
+	z = NODE_DATA(numa_node_id())->node_zones + 0;
+
+	local_irq_save(flags);
+	if (!spin_trylock(&z->lock)) {
+		/*
+		 * We can easily defer this so if someone is already holding the lock
+		 * be nice to them and let them do what they have to do
+		 */
+		local_irq_restore(flags);
+		return;
+	}
+	/*
+	 * Find order where we could do something. We always begin the
+	 * scan at the top. Lower pages may coalesce into higher orders
+	 * whereupon they may loose the zero page mark. Thus it is advantageous
+	 * always to zero the highest order we can find.
+	 */
+	for(area = z->free_area + MAX_ORDER - 1; area >= z->free_area; area--)
+		if (!list_empty(&area->free_list)) {
+			struct page *p = list_entry(area->free_list.next, struct page, lru);
+
+			if (!PageLocked(p) && !PageZero(p)) {
+				int order = area - z->free_area;
+
+				list_move_tail(&p->lru, &area->free_list);
+				if (zero_page(area, p, order))
+					goto out;
+
+				/* Unable to find a zeroing device that would
+				 * deal with this page so just do it on our own.
+				 * This will likely thrash our caches but the system
+				 * is idle after all and we can handle this with
+				 * minimal administrative overhead after dropping
+				 * the lock
+				 */
+				SetPageZero(p);
+				SetPageLocked(p);
+				spin_unlock_irqrestore(&z->lock, flags);
+				clear_pages(page_address(p), order);
+				smp_wmb();
+				ClearPageLocked(p);
+				return;
+			}
+		}
+out:
+	spin_unlock_irqrestore(&z->lock, flags);
+}
+
+/*
  * Builds allocation fallback zone lists.
  */
 static int __init build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, int j, int k)
@@ -1549,11 +1743,19 @@
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
+		INIT_LIST_HEAD(&zone->free_queue);
 		zone->nr_scan_active = 0;
 		zone->nr_scan_inactive = 0;
 		zone->nr_active = 0;
Index: linux-2.6.9/include/linux/mmzone.h
=================================--- linux-2.6.9.orig/include/linux/mmzone.h	2004-12-10 12:42:33.000000000 -0800
+++ linux-2.6.9/include/linux/mmzone.h	2004-12-13 13:58:48.000000000 -0800
@@ -51,7 +51,7 @@
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -132,7 +132,7 @@
 	 */
 	spinlock_t		lock;
 	struct free_area	free_area[MAX_ORDER];
-
+	struct list_head	free_queue;		/* Queued pages not in maps yet */

 	ZONE_PADDING(_pad1_)

Index: linux-2.6.9/include/linux/gfp.h
=================================--- linux-2.6.9.orig/include/linux/gfp.h	2004-10-18 14:53:44.000000000 -0700
+++ linux-2.6.9/include/linux/gfp.h	2004-12-13 13:58:48.000000000 -0800
@@ -37,6 +37,7 @@
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_ZERO	0x8000	/* Return zeroed page on success */

 #define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.9/mm/memory.c
=================================--- linux-2.6.9.orig/mm/memory.c	2004-12-10 12:42:33.000000000 -0800
+++ linux-2.6.9/mm/memory.c	2004-12-13 13:58:48.000000000 -0800
@@ -1445,10 +1445,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.9/kernel/profile.c
=================================--- linux-2.6.9.orig/kernel/profile.c	2004-12-10 12:42:33.000000000 -0800
+++ linux-2.6.9/kernel/profile.c	2004-12-13 13:58:48.000000000 -0800
@@ -326,17 +326,15 @@
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				return NOTIFY_BAD;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				goto out_free;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
 		}
 		break;
@@ -510,16 +508,14 @@
 		int node = cpu_to_node(cpu);
 		struct page *page;

-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[0]
 				= (struct profile_hit *)page_address(page);
 	}
Index: linux-2.6.9/mm/shmem.c
=================================--- linux-2.6.9.orig/mm/shmem.c	2004-12-10 12:42:33.000000000 -0800
+++ linux-2.6.9/mm/shmem.c	2004-12-13 13:58:48.000000000 -0800
@@ -369,9 +369,8 @@
 		}

 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
 		if (page) {
-			clear_highpage(page);
 			page->nr_swapped = 0;
 		}
 		spin_lock(&info->lock);
@@ -910,7 +909,7 @@
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -926,7 +925,7 @@
 shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
 				 unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(gfp) | __GFP_ZERO;
 }
 #endif

@@ -1135,7 +1134,6 @@

 		info->alloced++;
 		spin_unlock(&info->lock);
-		clear_highpage(filepage);
 		flush_dcache_page(filepage);
 		SetPageUptodate(filepage);
 	}
Index: linux-2.6.9/mm/hugetlb.c
=================================--- linux-2.6.9.orig/mm/hugetlb.c	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c	2004-12-13 13:58:48.000000000 -0800
@@ -77,7 +77,6 @@
 struct page *alloc_huge_page(void)
 {
 	struct page *page;
-	int i;

 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page();
@@ -88,8 +87,7 @@
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	clear_pages(page_address(page), HUGETLB_PAGE_ORDER);
 	return page;
 }

Index: linux-2.6.9/arch/ia64/lib/Makefile
=================================--- linux-2.6.9.orig/arch/ia64/lib/Makefile	2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/arch/ia64/lib/Makefile	2004-12-13 13:58:48.000000000 -0800
@@ -6,7 +6,7 @@

 lib-y := __divsi3.o __udivsi3.o __modsi3.o __umodsi3.o			\
 	__divdi3.o __udivdi3.o __moddi3.o __umoddi3.o			\
-	bitop.o checksum.o clear_page.o csum_partial_copy.o copy_page.o	\
+	bitop.o checksum.o clear_page.o clear_pages.o csum_partial_copy.o copy_page.o	\
 	clear_user.o strncpy_from_user.o strlen_user.o strnlen_user.o	\
 	flush.o ip_fast_csum.o do_csum.o				\
 	memset.o strlen.o swiotlb.o
Index: linux-2.6.9/include/asm-ia64/page.h
=================================--- linux-2.6.9.orig/include/asm-ia64/page.h	2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/page.h	2004-12-13 13:58:48.000000000 -0800
@@ -57,6 +57,7 @@
 #  define STRICT_MM_TYPECHECKS

 extern void clear_page (void *page);
+extern void clear_pages (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
Index: linux-2.6.9/arch/ia64/lib/clear_pages.S
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/arch/ia64/lib/clear_pages.S	2004-12-13 13:58:48.000000000 -0800
@@ -0,0 +1,84 @@
+/*
+ * Copyright (C) 1999-2002 Hewlett-Packard Co
+ *	Stephane Eranian <eranian@hpl.hp.com>
+ *	David Mosberger-Tang <davidm@hpl.hp.com>
+ * Copyright (C) 2002 Ken Chen <kenneth.w.chen@intel.com>
+ *
+ * 1/06/01 davidm	Tuned for Itanium.
+ * 2/12/02 kchen	Tuned for both Itanium and McKinley
+ * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
+ */
+#include <linux/config.h>
+
+#include <asm/asmmacro.h>
+#include <asm/page.h>
+
+#ifdef CONFIG_ITANIUM
+# define L3_LINE_SIZE	64	// Itanium L3 line size
+# define PREFETCH_LINES	9	// magic number
+#else
+# define L3_LINE_SIZE	128	// McKinley L3 line size
+# define PREFETCH_LINES	12	// magic number
+#endif
+
+#define saved_lc	r2
+#define dst_fetch	r3
+#define dst1		r8
+#define dst2		r9
+#define dst3		r10
+#define dst4		r11
+
+#define dst_last	r31
+#define totsize		r14
+
+GLOBAL_ENTRY(clear_pages)
+	.prologue
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
+	.save ar.lc, saved_lc
+	mov saved_lc = ar.lc
+	;;
+	.body
+	adds dst1 = 16, in0
+	mov ar.lc = (PREFETCH_LINES - 1)
+	mov dst_fetch = in0
+	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
+	;;
+.fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
+	adds dst3 = 48, in0		// executing this multiple times is harmless
+	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
+	;;
+	mov ar.lc = r16			// one L3 line per iteration
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
+	;;
+#ifdef CONFIG_ITANIUM
+	// Optimized for Itanium
+1:	stf.spill.nta [dst1] = f0, 64
+	stf.spill.nta [dst2] = f0, 64
+	cmp.lt p8,p0=dst_fetch, dst_last
+	;;
+#else
+	// Optimized for McKinley
+1:	stf.spill.nta [dst1] = f0, 64
+	stf.spill.nta [dst2] = f0, 64
+	stf.spill.nta [dst3] = f0, 64
+	stf.spill.nta [dst4] = f0, 128
+	cmp.lt p8,p0=dst_fetch, dst_last
+	;;
+	stf.spill.nta [dst1] = f0, 64
+	stf.spill.nta [dst2] = f0, 64
+#endif
+	stf.spill.nta [dst3] = f0, 64
+(p8)	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
+	br.cloop.sptk.few 1b
+	;;
+	mov ar.lc = saved_lc		// restore lc
+	br.ret.sptk.many rp
+END(clear_pages)
Index: linux-2.6.9/kernel/sysctl.c
=================================--- linux-2.6.9.orig/kernel/sysctl.c	2004-12-10 12:42:33.000000000 -0800
+++ linux-2.6.9/kernel/sysctl.c	2004-12-13 21:28:29.000000000 -0800
@@ -67,6 +67,7 @@
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
 extern int pid_max_min, pid_max_max;
+extern unsigned int sysctl_zero_order;

 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
 int unknown_nmi_panic;
@@ -816,6 +817,15 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_ZERO_ORDER,
+		.procname	= "zero_order",
+		.data		= &sysctl_zero_order,
+		.maxlen		= sizeof(sysctl_zero_order),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.9/include/linux/sysctl.h
=================================--- linux-2.6.9.orig/include/linux/sysctl.h	2004-12-10 12:42:33.000000000 -0800
+++ linux-2.6.9/include/linux/sysctl.h	2004-12-13 20:42:35.000000000 -0800
@@ -168,6 +168,7 @@
 	VM_VFS_CACHE_PRESSURE&, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT', /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT(, /* default time for token time out */
+	VM_ZERO_ORDER),	/* idle page zeroing */
 };


Index: linux-2.6.9/include/linux/zero.h
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/include/linux/zero.h	2004-12-14 11:40:58.000000000 -0800
@@ -0,0 +1,27 @@
+#ifndef _LINUX_ZERO_H
+#define _LINUX_ZERO_H
+
+/*
+ * Definitions for drivers that allow the zeroing of memory
+ * without
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start_bzero)(struct page *p, int order);
+        struct list_head list;
+};
+
+extern struct list_head init_zero;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &init_zero);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+#endif
Index: linux-2.6.9/include/asm-ia64/sn/bte.h
=================================--- linux-2.6.9.orig/include/asm-ia64/sn/bte.h	2004-12-10 12:42:32.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/sn/bte.h	2004-12-14 17:35:45.000000000 -0800
@@ -115,6 +115,7 @@
 	int bte_error_count;	/* Number of errors encountered     */
 	int bte_num;		/* 0 --> BTE0, 1 --> BTE1           */
 	int cleanup_active;	/* Interface is locked for cleanup  */
+	struct page *zp;	/* Page being zeroed 		    */
 	volatile bte_result_t bh_error;	/* error while processing   */
 	volatile u64 *most_rcnt_na;
 };
Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
=================================--- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-13 21:36:19.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-14 18:19:07.000000000 -0800
@@ -20,6 +20,8 @@
 #include <linux/bootmem.h>
 #include <linux/string.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/zero.h>

 #include <asm/sn/bte.h>

@@ -30,7 +32,7 @@
 /* two interfaces on two btes */
 #define MAX_INTERFACES_TO_TRY		4

-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
 {
 	nodepda_t *tmp_nodepda;

@@ -39,6 +41,14 @@

 }

+static inline void bte_bzero_complete(struct bteinfo_s *bte) {
+	if (bte->zp) {
+		printk(KERN_WARNING "bzero: completed %p\n",page_address(bte->zp));
+		ClearPageLocked(bte->zp);
+		*bte->most_rcnt_na = BTE_WORD_AVAILABLE;
+		bte->zp = NULL;
+	}
+}
 /************************************************************************
  * Block Transfer Engine copy related functions.
  *
@@ -132,13 +142,13 @@
 			if (bte = NULL) {
 				continue;
 			}
-
 			if (spin_trylock(&bte->spinlock)) {
 				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
 				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
 					/* Got the lock but BTE still busy */
 					spin_unlock(&bte->spinlock);
 				} else {
+					bte_bzero_complete(bte);
 					/* we got the lock and it's not busy */
 					break;
 				}
@@ -448,6 +458,94 @@
 		mynodepda->bte_if[i].bte_num = i;
 		mynodepda->bte_if[i].cleanup_active = 0;
 		mynodepda->bte_if[i].bh_error = 0;
+		mynodepda->bte_if[i].zp = NULL;
+	}
+}
+
+static inline void check_bzero_complete(void)
+{
+	unsigned long irq_flags;
+	struct bteinfo_s *bte;
+
+	/* CPU 0 (per node) uses bte0 , CPU 1 uses bte1 */
+	bte = bte_if_on_node(get_nasid(), cpuid_to_subnode(smp_processor_id()));
+
+	if (!bte->zp)
+		return;
+	local_irq_save(irq_flags);
+	if (!spin_trylock(&bte->spinlock)) {
+		local_irq_restore(irq_flags);
+		return;
+	}
+	if (*bte->most_rcnt_na = BTE_WORD_BUSY ||
+            (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
+                spin_unlock_irqrestore(&bte->spinlock, irq_flags);
+		return;
+	}
+	bte_bzero_complete(bte);
+	spin_unlock_irqrestore(&bte->spinlock, irq_flags);
+}
+
+static int bte_start_bzero(struct page *p, int order)
+{
+	struct bteinfo_s *bte;
+	unsigned int len = PAGE_SIZE << order;
+	unsigned long irq_flags;
+
+
+	/* Check limitations.
+		1. System must be running (weird things happen during bootup)
+		2. Size >128KB. Smaller requests cause too much bte traffic
+	 */
+	if (len > BTE_MAX_XFER ||
+	    order < 4 ||
+	    system_state != SYSTEM_RUNNING) {
+		check_bzero_complete();
+		return EINVAL;
+	}
+
+	/* CPU 0 (per node) uses bte0 , CPU 1 uses bte1 */
+	bte = bte_if_on_node(get_nasid(), cpuid_to_subnode(smp_processor_id()));
+	local_irq_save(irq_flags);
+
+	if (!spin_trylock(&bte->spinlock)) {
+		local_irq_restore(irq_flags);
+		printk(KERN_INFO "bzero: bte spinlock locked\n");
+		return EBUSY;
 	}

+	/* Complete any pending bzero notification */
+	bte_bzero_complete(bte);
+
+	if (bte->zp ||
+	    !(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
+	    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
+		/* Got the lock but BTE still busy */
+		spin_unlock_irqrestore(&bte->spinlock, irq_flags);
+		return EBUSY;
+	}
+	printk(KERN_INFO "bzero: start address=%p length=%d\n", page_address(p), len);
+	bte->most_rcnt_na = &bte->notify;
+	*bte->most_rcnt_na = BTE_WORD_BUSY;
+	bte->zp = p;
+	SetPageLocked(p);
+	SetPageZero(p);
+	BTE_LNSTAT_STORE(bte, IBLS_BUSY | ((len >> L1_CACHE_SHIFT) & BTE_LEN_MASK));
+	BTE_SRC_STORE(bte, TO_PHYS(ia64_tpa(page_address(p))));
+	BTE_DEST_STORE(bte, 0);
+	BTE_NOTIF_STORE(bte,
+			TO_PHYS(ia64_tpa((unsigned long)bte->most_rcnt_na)));
+	BTE_CTRL_STORE(bte, BTE_ZERO_FILL);
+
+	spin_unlock_irqrestore(&bte->spinlock, irq_flags);
+	return 0;
+
+}
+
+static struct zero_driver bte_bzero = {
+	.start_bzero = bte_start_bzero
+};
+
+void sn_bte_bzero_init(void) {
+	register_zero_driver(&bte_bzero);
 }
Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
=================================--- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c	2004-12-10 12:42:27.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/setup.c	2004-12-14 12:32:15.000000000 -0800
@@ -243,6 +243,7 @@
 	int pxm;
 	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
 	extern void sn_cpu_init(void);
+	extern void sn_bte_bzero_init(void);

 	/*
 	 * If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
 	screen_info = sn_screen_info;

 	sn_timer_init();
+	sn_bte_bzero_init();
 }

 /**


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
@ 2004-12-15 21:21 ` Robin Holt
  2004-12-15 21:58 ` Christoph Lameter
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Robin Holt @ 2004-12-15 21:21 UTC (permalink / raw)
  To: linux-ia64

On Tue, Dec 14, 2004 at 06:34:13PM -0800, Christoph Lameter wrote:
> The page fault patches address the scalability of the fault handler
> by aggregating requests (anticipatory prefaulting) or by reducing the locking
> overhead (page fault scalability patches). However, the main time spend in
> the page fault handler is by zeroing pages. The following patch
> zeroes pages in the background through hardware (Altix Block Transfer Engine)
> or via software when the system is idle. This increases the performance
> of the page fault handler dramatically even for only a single thread:
> 
> 2.6.10-rc3-bk7 (allocating 1 GB):
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>   1   1    8    0.029s      1.373s   0.039s 46733.217 167449.984
>   1   1    4    0.016s      1.152s   0.043s 56064.229 152067.012
>   1   1    2    0.011s      1.074s   0.056s 60349.726 115679.719
>   1   1    1    0.012s      0.708s   0.072s 90933.436  90849.200
> 
> with patch:
> 
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>   1   1    8    0.012s      0.759s   0.023s 84840.529 279197.309
>   1   1    4    0.014s      0.307s   0.018s203360.588 354015.152
>   1   1    2    0.021s      0.373s   0.023s166111.155 283594.162
>   1   1    1    0.012s      0.200s   0.021s307839.729 306791.723
> 
> I have some spot results here that indicate that a single thread may
> do up to 500000 faults a second with this patch alone.

This sounds impressive, but from my limited understanding of the patches,
I think it is a misleading figure.  This would require the system to
sit idle for a period of time between large jobs to ensure that
enough pages are free so all allocations could be satisfied from the
pre-zeroed section.

My understanding (very limited as I only spent 15 minutes looking)
is that only idle cpus are actually queueing pages for zereoing.
Is this correct or am I off the mark?

If that is so, I think we need to rethink this some.  I believe the
largest benefit would come if you used timers to check for a previous
page zero operation completing and then queueing up the next.  This
could be done for all nodes that are owned by a parent node.  That would
allow a system with many mbricks (and therefore many btes) with very
few cbricks to effeciently use all the btes for zeroeoing.  Is that
the intent at any point in this patch life?  Otherwise you end up
with speedups only when you have idle cpus for zereoing.

> Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
> =================================> --- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-13 21:36:19.000000000 -0800
> +++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-14 18:19:07.000000000 -0800
> @@ -448,6 +458,94 @@
>  		mynodepda->bte_if[i].bte_num = i;
>  		mynodepda->bte_if[i].cleanup_active = 0;
>  		mynodepda->bte_if[i].bh_error = 0;
> +		mynodepda->bte_if[i].zp = NULL;
> +	}
> +}
> +
> +static inline void check_bzero_complete(void)
> +{
> +	unsigned long irq_flags;
> +	struct bteinfo_s *bte;
> +
> +	/* CPU 0 (per node) uses bte0 , CPU 1 uses bte1 */
> +	bte = bte_if_on_node(get_nasid(), cpuid_to_subnode(smp_processor_id()));
> +
> +	if (!bte->zp)
> +		return;
> +	local_irq_save(irq_flags);
> +	if (!spin_trylock(&bte->spinlock)) {
> +		local_irq_restore(irq_flags);
> +		return;
> +	}
> +	if (*bte->most_rcnt_na = BTE_WORD_BUSY ||
> +            (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
> +                spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> +		return;
> +	}
> +	bte_bzero_complete(bte);
> +	spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> +}
> +

Why not have a seperate notification line for zereoing operations.
Add a seperate bte flag in that says "use the zereoing notification
line" and have it return the address of the line being used.

You start then calls bte_copy with the flags and you get back
the notification line you are concerned with.  Alternatively,
you could put the notification line into a structure owned
by the bte_start_zero() private structures and pass the address
in.  This allows the bte_copy code to operation as is.  It will
also simplify the bte_start_bzero significantly and make it
very easy to keep things consistent.  I also makes understanding
the bte_copy code easier since there is no back-door interaction
with any other functions.


> +static int bte_start_bzero(struct page *p, int order)
> +{
> +	struct bteinfo_s *bte;
> +	unsigned int len = PAGE_SIZE << order;
> +	unsigned long irq_flags;
> +
> +
> +	/* Check limitations.
> +		1. System must be running (weird things happen during bootup)
> +		2. Size >128KB. Smaller requests cause too much bte traffic
> +	 */
> +	if (len > BTE_MAX_XFER ||
> +	    order < 4 ||
> +	    system_state != SYSTEM_RUNNING) {
> +		check_bzero_complete();
> +		return EINVAL;
> +	}
> +
> +	/* CPU 0 (per node) uses bte0 , CPU 1 uses bte1 */
> +	bte = bte_if_on_node(get_nasid(), cpuid_to_subnode(smp_processor_id()));
> +	local_irq_save(irq_flags);
> +
> +	if (!spin_trylock(&bte->spinlock)) {
> +		local_irq_restore(irq_flags);
> +		printk(KERN_INFO "bzero: bte spinlock locked\n");
> +		return EBUSY;
>  	}
> 
> +	/* Complete any pending bzero notification */
> +	bte_bzero_complete(bte);
> +
> +	if (bte->zp ||
> +	    !(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
> +	    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
> +		/* Got the lock but BTE still busy */
> +		spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> +		return EBUSY;
> +	}
> +	printk(KERN_INFO "bzero: start address=%p length=%d\n", page_address(p), len);
> +	bte->most_rcnt_na = &bte->notify;
> +	*bte->most_rcnt_na = BTE_WORD_BUSY;
> +	bte->zp = p;
> +	SetPageLocked(p);
> +	SetPageZero(p);
> +	BTE_LNSTAT_STORE(bte, IBLS_BUSY | ((len >> L1_CACHE_SHIFT) & BTE_LEN_MASK));
> +	BTE_SRC_STORE(bte, TO_PHYS(ia64_tpa(page_address(p))));
> +	BTE_DEST_STORE(bte, 0);
> +	BTE_NOTIF_STORE(bte,
> +			TO_PHYS(ia64_tpa((unsigned long)bte->most_rcnt_na)));
> +	BTE_CTRL_STORE(bte, BTE_ZERO_FILL);
> +
> +	spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> +	return 0;
> +
> +}
> +
> +static struct zero_driver bte_bzero = {
> +	.start_bzero = bte_start_bzero
> +};
> +
> +void sn_bte_bzero_init(void) {
> +	register_zero_driver(&bte_bzero);
>  }

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
  2004-12-15 21:21 ` Robin Holt
@ 2004-12-15 21:58 ` Christoph Lameter
  2004-12-15 22:00 ` Christoph Lameter
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-15 21:58 UTC (permalink / raw)
  To: linux-ia64

On Wed, 15 Dec 2004, Robin Holt wrote:

> > I have some spot results here that indicate that a single thread may
> > do up to 500000 faults a second with this patch alone.
>
> This sounds impressive, but from my limited understanding of the patches,
> I think it is a misleading figure.  This would require the system to
> sit idle for a period of time between large jobs to ensure that
> enough pages are free so all allocations could be satisfied from the
> pre-zeroed section.

There is enough time for all pages to be zeroed on bootup so there is
a large repository of zeroed pages. Of course if the system uses up that
reservoir and the zeroing is not effectively keeping up then this
will degenerates into the old behavior.

> My understanding (very limited as I only spent 15 minutes looking)
> is that only idle cpus are actually queueing pages for zereoing.
> Is this correct or am I off the mark?

Idle cpus will zero pages and the bte will be used if pages are coalesced
by the buddy allocator above a certain size.

> If that is so, I think we need to rethink this some.  I believe the
> largest benefit would come if you used timers to check for a previous
> page zero operation completing and then queueing up the next.  This
> could be done for all nodes that are owned by a parent node.  That would
> allow a system with many mbricks (and therefore many btes) with very
> few cbricks to effeciently use all the btes for zeroeoing.  Is that
> the intent at any point in this patch life?  Otherwise you end up
> with speedups only when you have idle cpus for zereoing.

Having access to more than one bte per cpu is certainly an interesting
idea. Any suggestions on how to discover these?

(will follow up in next post on your code comments).


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
  2004-12-15 21:21 ` Robin Holt
  2004-12-15 21:58 ` Christoph Lameter
@ 2004-12-15 22:00 ` Christoph Lameter
  2004-12-16  0:25 ` Nick Piggin
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-15 22:00 UTC (permalink / raw)
  To: linux-ia64

On Wed, 15 Dec 2004, Robin Holt wrote:

> > +static inline void check_bzero_complete(void)
> > +{
> > +	unsigned long irq_flags;
> > +	struct bteinfo_s *bte;
> > +
> > +	/* CPU 0 (per node) uses bte0 , CPU 1 uses bte1 */
> > +	bte = bte_if_on_node(get_nasid(), cpuid_to_subnode(smp_processor_id()));
> > +
> > +	if (!bte->zp)
> > +		return;
> > +	local_irq_save(irq_flags);
> > +	if (!spin_trylock(&bte->spinlock)) {
> > +		local_irq_restore(irq_flags);
> > +		return;
> > +	}
> > +	if (*bte->most_rcnt_na = BTE_WORD_BUSY ||
> > +            (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
> > +                spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> > +		return;
> > +	}
> > +	bte_bzero_complete(bte);
> > +	spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> > +}
> > +
>
> Why not have a seperate notification line for zereoing operations.
> Add a seperate bte flag in that says "use the zereoing notification
> line" and have it return the address of the line being used.
>
> You start then calls bte_copy with the flags and you get back
> the notification line you are concerned with.  Alternatively,
> you could put the notification line into a structure owned
> by the bte_start_zero() private structures and pass the address
> in.  This allows the bte_copy code to operation as is.  It will
> also simplify the bte_start_bzero significantly and make it
> very easy to keep things consistent.  I also makes understanding
> the bte_copy code easier since there is no back-door interaction
> with any other functions.

Good idea. Will do that in the next rev

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (2 preceding siblings ...)
  2004-12-15 22:00 ` Christoph Lameter
@ 2004-12-16  0:25 ` Nick Piggin
  2004-12-16  0:41 ` Christoph Lameter
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2004-12-16  0:25 UTC (permalink / raw)
  To: linux-ia64

On Wed, 2004-12-15 at 13:58 -0800, Christoph Lameter wrote:
> On Wed, 15 Dec 2004, Robin Holt wrote:
> 
> > > I have some spot results here that indicate that a single thread may
> > > do up to 500000 faults a second with this patch alone.
> >
> > This sounds impressive, but from my limited understanding of the patches,
> > I think it is a misleading figure.  This would require the system to
> > sit idle for a period of time between large jobs to ensure that
> > enough pages are free so all allocations could be satisfied from the
> > pre-zeroed section.
> 
> There is enough time for all pages to be zeroed on bootup so there is
> a large repository of zeroed pages. Of course if the system uses up that
> reservoir and the zeroing is not effectively keeping up then this
> will degenerates into the old behavior.
> 

Just curious - how does this go on a real workload as opposed to
raw pagefault performance?

You said the majority of the time in the fault handler is taken
in zeroing pages, but that does have the upshot of warming up the
cache on memory which is likely to be used soon.

I think if there were *no* downsides to this patch, it would
obviously make sense - because in that case it doesn't really matter
whether the cache is heated by the fault handler or by the app
itself... but it does have some downsides (complexity, memory
bandwidth, cache degradation in the case of idle zeroing, less
predictable performance).

I'm not saying the negatives outweigh the positives, but I wonder.
I think this sort of thing has been contentious in the past.

Also just a stupid question - would an madvise(..., MADV_PREFAULT)
be of use to you? Or is that too difficult to get a good NUMA
allocation layout?

Nick

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (3 preceding siblings ...)
  2004-12-16  0:25 ` Nick Piggin
@ 2004-12-16  0:41 ` Christoph Lameter
  2004-12-16  0:41 ` Linus Torvalds
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-16  0:41 UTC (permalink / raw)
  To: linux-ia64

On Thu, 16 Dec 2004, Nick Piggin wrote:

> Just curious - how does this go on a real workload as opposed to
> raw pagefault performance?

No idea. Not tested yet. The bench models the startup behavior
of many large scale apps though. I am trying to find some other tests
that are nearer to reality and see what other effects this has. Lmbench
shows some interesting things but I need to finish that first.

> You said the majority of the time in the fault handler is taken
> in zeroing pages, but that does have the upshot of warming up the
> cache on memory which is likely to be used soon.

Right. The test that I am using only touches one word of each page so it
is not reflecting what a real application would be doing. The patch makes
the page zeroing engine stay away from the hot pages in the pcp structure
though.

> I think if there were *no* downsides to this patch, it would
> obviously make sense - because in that case it doesn't really matter
> whether the cache is heated by the fault handler or by the app
> itself... but it does have some downsides (complexity, memory
> bandwidth, cache degradation in the case of idle zeroing, less
> predictable performance).

I am also not sure if this will make sense. But its something that I felt
had to be tried.

> I'm not saying the negatives outweigh the positives, but I wonder.
> I think this sort of thing has been contentious in the past.

Andrea's patches last summer zeroed hot pages. I think that is definitely
not useful. The zeroing must have minimal impact otherwise this wont work.
The best thing would be to have some dma device that does the zeroing
without touching the cache.

> Also just a stupid question - would an madvise(..., MADV_PREFAULT)
> be of use to you? Or is that too difficult to get a good NUMA
> allocation layout?

Its easy to implement. The current code already checks for MADV_RAND and
switches off prefaulting for that case.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (4 preceding siblings ...)
  2004-12-16  0:41 ` Christoph Lameter
@ 2004-12-16  0:41 ` Linus Torvalds
  2004-12-16  0:46 ` Christoph Lameter
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-12-16  0:41 UTC (permalink / raw)
  To: linux-ia64



On Thu, 16 Dec 2004, Nick Piggin wrote:
> 
> I think if there were *no* downsides to this patch, it would
> obviously make sense - because in that case it doesn't really matter
> whether the cache is heated by the fault handler or by the app
> itself... but it does have some downsides (complexity, memory
> bandwidth, cache degradation in the case of idle zeroing, less
> predictable performance).

The big downside I personally fear is that the added activity disturbs
other processes.

For example, even outside the obvious bus activity issues on SMP systems, 
things like idle loop exit latency and even just CPU power usage are 
things that makes me go "Hmmm.."

I don't dislike pre-zeroing per se, but I'm pretty strongly of the opinion
that it should be done not by the idle loop or any other very low-level
system process. I'd personally much prefer it donne at a somewhat higher
level, so that users can control it as part of system management, and so 
that it does not impact the _real_ idle loop.

So I'd much prefer a background "scrubd" or similar. 

			Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (5 preceding siblings ...)
  2004-12-16  0:41 ` Linus Torvalds
@ 2004-12-16  0:46 ` Christoph Lameter
  2004-12-16  0:50 ` Nick Piggin
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-16  0:46 UTC (permalink / raw)
  To: linux-ia64

On Wed, 15 Dec 2004, Linus Torvalds wrote:

> I don't dislike pre-zeroing per se, but I'm pretty strongly of the opinion
> that it should be done not by the idle loop or any other very low-level
> system process. I'd personally much prefer it donne at a somewhat higher
> level, so that users can control it as part of system management, and so
> that it does not impact the _real_ idle loop.
>
> So I'd much prefer a background "scrubd" or similar.

The current patch triggers a hardware zeroing device to start zeroing when
the buddy allocator generates a free pages over a certain order.  The
order can be configured in /proc/sys/vm/zero_order. This has just minimum
impact on the system (main impact is that the buddy allocator can no
longer merge locked free pages) apart from the system bus.

We could change the idle zeroing to a scrubd which would be
controlled by a similar mechanism. Set the minimal order to be scrubbed in
/proc/sys/vm/scrub_order?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (6 preceding siblings ...)
  2004-12-16  0:46 ` Christoph Lameter
@ 2004-12-16  0:50 ` Nick Piggin
  2004-12-16  0:54 ` Christoph Lameter
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2004-12-16  0:50 UTC (permalink / raw)
  To: linux-ia64

On Wed, 2004-12-15 at 16:41 -0800, Linus Torvalds wrote:
> 
> On Thu, 16 Dec 2004, Nick Piggin wrote:
> > 
> > I think if there were *no* downsides to this patch, it would
> > obviously make sense - because in that case it doesn't really matter
> > whether the cache is heated by the fault handler or by the app
> > itself... but it does have some downsides (complexity, memory
> > bandwidth, cache degradation in the case of idle zeroing, less
> > predictable performance).
> 
> The big downside I personally fear is that the added activity disturbs
> other processes.
> 
> For example, even outside the obvious bus activity issues on SMP systems, 
> things like idle loop exit latency and even just CPU power usage are 
> things that makes me go "Hmmm.."
> 
> I don't dislike pre-zeroing per se, but I'm pretty strongly of the opinion
> that it should be done not by the idle loop or any other very low-level
> system process. I'd personally much prefer it donne at a somewhat higher
> level, so that users can control it as part of system management, and so 
> that it does not impact the _real_ idle loop.
> 
> So I'd much prefer a background "scrubd" or similar. 
> 

I agree.

Christoph, just another thing now that I glance at the patch - the
page locking and deferred coalescing seems to be fairly ugly...

Would it be possible to just *allocate* the page, zero it, then
free it, and have it get put onto the zero page list from there?
If you conceptually think of a page being zeroed as being in use,
then it would be natural to allocate it...?





^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (7 preceding siblings ...)
  2004-12-16  0:50 ` Nick Piggin
@ 2004-12-16  0:54 ` Christoph Lameter
  2004-12-16  1:18 ` Linus Torvalds
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-16  0:54 UTC (permalink / raw)
  To: linux-ia64

On Thu, 16 Dec 2004, Nick Piggin wrote:

> Christoph, just another thing now that I glance at the patch - the
> page locking and deferred coalescing seems to be fairly ugly...

Right. I tried various other things that were even uglier before that
finally worked.

> Would it be possible to just *allocate* the page, zero it, then
> free it, and have it get put onto the zero page list from there?
> If you conceptually think of a page being zeroed as being in use,
> then it would be natural to allocate it...?

I tried that. This increases the overhead significantly. Also if you put
back a zeroed page and its going to be merged with its neighbor (likely
not zeroed) then the zero bit has to be dropped. Its then getting a bit
difficult to get zeroed pages into the buddy allocator.

One could set the zero bit on each individual page but then that increases
again overhead of managing them and checking them.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (8 preceding siblings ...)
  2004-12-16  0:54 ` Christoph Lameter
@ 2004-12-16  1:18 ` Linus Torvalds
  2004-12-16  1:44 ` Christoph Lameter
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-12-16  1:18 UTC (permalink / raw)
  To: linux-ia64



On Wed, 15 Dec 2004, Christoph Lameter wrote:
> 
> The current patch triggers a hardware zeroing device to start zeroing when
> the buddy allocator generates a free pages over a certain order.  The
> order can be configured in /proc/sys/vm/zero_order. This has just minimum
> impact on the system (main impact is that the buddy allocator can no
> longer merge locked free pages) apart from the system bus.
> 
> We could change the idle zeroing to a scrubd which would be
> controlled by a similar mechanism. Set the minimal order to be scrubbed in
> /proc/sys/vm/scrub_order?

Yes. Along with some minimal system inactivity threshold or similar, along
with a way to trigger it actively. For example, there might be loads that
just _know_ that they are doing to do a lot of this, and having a way for
them to say "hey, I'll need a ton of pre-zeroed pages" just sounds like a
good idea if we are going to do this at all.

It's easy to only scrub pages above a certain order, by just looking at
the buddy order lists. No locking even required, ie you could do something 
like

	for (order = min_order ; order < MAX_ORDER; order++) {
		struct free_area *area = zone->free_area + order;
		struct list_head *n = area->free_list.prev;

		/* This is opportunistic, with no locking */
		barrier();
		if (n = area->free_list)
			continue;
		p = list_entry(n, struct page, lru);
		/* Already pre-zeroed? */
		if (p->flags & PAGE_PREZERO)
			continue;

		/* Ok, found a possible candidate, now we need to be more careful */
		spin_lock_irqsave(&zone->lock, flags);
		.. re-do the tests,
		   remove the page from the list,
		   drop the lock,
		   clear the page,
		   mark it zeroed,
		   get the lock, 
		   put the page at the head of the list 
		...

No?

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (9 preceding siblings ...)
  2004-12-16  1:18 ` Linus Torvalds
@ 2004-12-16  1:44 ` Christoph Lameter
  2004-12-16  1:55 ` Linus Torvalds
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-16  1:44 UTC (permalink / raw)
  To: linux-ia64

On Wed, 15 Dec 2004, Linus Torvalds wrote:

> It's easy to only scrub pages above a certain order, by just looking at
> the buddy order lists. No locking even required, ie you could do something
> like
>
> 	for (order = min_order ; order < MAX_ORDER; order++) {
> 		struct free_area *area = zone->free_area + order;
> 		struct list_head *n = area->free_list.prev;
>
> 		/* This is opportunistic, with no locking */
> 		barrier();
> 		if (n = area->free_list)
> 			continue;
> 		p = list_entry(n, struct page, lru);
> 		/* Already pre-zeroed? */
> 		if (p->flags & PAGE_PREZERO)
> 			continue;
>
> 		/* Ok, found a possible candidate, now we need to be more careful */
> 		spin_lock_irqsave(&zone->lock, flags);
> 		.. re-do the tests,
> 		   remove the page from the list,
> 		   drop the lock,
> 		   clear the page,
> 		   mark it zeroed,
> 		   get the lock,
> 		   put the page at the head of the list
> 		...
>
> No?

Yes that is what my patch does. I am missing the barrier though. Why would
that be needed?



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (10 preceding siblings ...)
  2004-12-16  1:44 ` Christoph Lameter
@ 2004-12-16  1:55 ` Linus Torvalds
  2004-12-16  2:17 ` Nick Piggin
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-12-16  1:55 UTC (permalink / raw)
  To: linux-ia64



On Wed, 15 Dec 2004, Christoph Lameter wrote:
> 
> Yes that is what my patch does. I am missing the barrier though. Why would
> that be needed?

I just wanted to make sure that the compiler didn't end up re-loading the 
"next" ptr from memory if it had register pressure.

Think of it as another way to say that because we're doing that
speculative thing, we must consider "area->free_list.prev" to be a
volatile access.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (11 preceding siblings ...)
  2004-12-16  1:55 ` Linus Torvalds
@ 2004-12-16  2:17 ` Nick Piggin
  2004-12-16  7:59 ` Nick Piggin
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2004-12-16  2:17 UTC (permalink / raw)
  To: linux-ia64

On Wed, 2004-12-15 at 16:41 -0800, Christoph Lameter wrote:
> On Thu, 16 Dec 2004, Nick Piggin wrote:

> > Also just a stupid question - would an madvise(..., MADV_PREFAULT)
> > be of use to you? Or is that too difficult to get a good NUMA
> > allocation layout?
> 
> Its easy to implement. The current code already checks for MADV_RAND and
> switches off prefaulting for that case.

Oh yeah that could be useful too (ie. pre-enlarging the prefault window)

What I had meant though is: MADV_PREFAULT will allocate pages and
instantiate ptes and allocate pages for the region specified. So your
large app would call that on startup and not take any faults in future.

The problem I was imagining is that the allocations will all come from
the caller with regard to NUMA behaviour... so the app would need to
be aware of that (and eg. have each thread call MADV_PREFAULT on their
own working set).

Nick

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (12 preceding siblings ...)
  2004-12-16  2:17 ` Nick Piggin
@ 2004-12-16  7:59 ` Nick Piggin
  2004-12-16 16:27 ` Christoph Lameter
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2004-12-16  7:59 UTC (permalink / raw)
  To: linux-ia64

Christoph Lameter wrote:
> On Thu, 16 Dec 2004, Nick Piggin wrote:
> 
>>Would it be possible to just *allocate* the page, zero it, then
>>free it, and have it get put onto the zero page list from there?
>>If you conceptually think of a page being zeroed as being in use,
>>then it would be natural to allocate it...?
> 
> 
> I tried that. This increases the overhead significantly. Also if you put
> back a zeroed page and its going to be merged with its neighbor (likely
> not zeroed) then the zero bit has to be dropped. Its then getting a bit
> difficult to get zeroed pages into the buddy allocator.
> 
> One could set the zero bit on each individual page but then that increases
> again overhead of managing them and checking them.
> 

Hmm, I was thinking 'free' the zeroed page stright onto the pcp zero list.
This probably doesn't work well, though, if you want to zero a large
proportion of system memory...


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (13 preceding siblings ...)
  2004-12-16  7:59 ` Nick Piggin
@ 2004-12-16 16:27 ` Christoph Lameter
  2004-12-16 18:38 ` Luck, Tony
  2004-12-16 22:37 ` Nick Piggin
  16 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-16 16:27 UTC (permalink / raw)
  To: linux-ia64

On Thu, 16 Dec 2004, Nick Piggin wrote:

> > One could set the zero bit on each individual page but then that increases
> > again overhead of managing them and checking them.
> Hmm, I was thinking 'free' the zeroed page stright onto the pcp zero list.
> This probably doesn't work well, though, if you want to zero a large
> proportion of system memory...

Thats an interesting idea for order 0 pages that would avoid the merge in
issue with the buddy allocator.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (14 preceding siblings ...)
  2004-12-16 16:27 ` Christoph Lameter
@ 2004-12-16 18:38 ` Luck, Tony
  2004-12-16 22:37 ` Nick Piggin
  16 siblings, 0 replies; 99+ messages in thread
From: Luck, Tony @ 2004-12-16 18:38 UTC (permalink / raw)
  To: linux-ia64

>> > Also just a stupid question - would an madvise(..., MADV_PREFAULT)
>> > be of use to you? Or is that too difficult to get a good NUMA
>> > allocation layout?
>> 
>> Its easy to implement. The current code already checks for MADV_RAND and
>> switches off prefaulting for that case.
>
>Oh yeah that could be useful too (ie. pre-enlarging the 
>prefault window)
>
>What I had meant though is: MADV_PREFAULT will allocate pages and
>instantiate ptes and allocate pages for the region specified. So your
>large app would call that on startup and not take any faults in future.

How does that differ from (the already existing) MADV_WILLNEED?

There is also a MADV_SEQUENTIAL ... perhaps that could be used to
kick the prefaulter into higher gear (perhaps go directly to order
2, or more, allocation, instead of ramping up slowly).

-Tony

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [very very drafty] prezeroing to increase the page fault rate
  2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
                   ` (15 preceding siblings ...)
  2004-12-16 18:38 ` Luck, Tony
@ 2004-12-16 22:37 ` Nick Piggin
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
  16 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2004-12-16 22:37 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote:
>>>>Also just a stupid question - would an madvise(..., MADV_PREFAULT)
>>>>be of use to you? Or is that too difficult to get a good NUMA
>>>>allocation layout?
>>>
>>>Its easy to implement. The current code already checks for MADV_RAND and
>>>switches off prefaulting for that case.
>>
>>Oh yeah that could be useful too (ie. pre-enlarging the 
>>prefault window)
>>
>>What I had meant though is: MADV_PREFAULT will allocate pages and
>>instantiate ptes and allocate pages for the region specified. So your
>>large app would call that on startup and not take any faults in future.
> 
> 
> How does that differ from (the already existing) MADV_WILLNEED?
> 

MADV_WILLNEED populates the pagecache, but AFAIKS doesn't actually
set up any ptes in the mapping. So for anonymous memory, I don't
think it would help.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Increase page fault rate by prezeroing V1 [0/3]: Overview
  2004-12-16 22:37 ` Nick Piggin
@ 2004-12-21 19:55   ` Christoph Lameter
  2004-12-21 19:56     ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
                       ` (4 more replies)
  0 siblings, 5 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

The patches increasing the page fault rate (introduction of atomic pte operations
and anticipatory prefaulting) do so by reducing the locking overhead and are
therefore mainly of interest for applications running in SMP systems with a high
number of cpus. The single thread performance does just show minor increases.
Only the performance of multi-threaded applications increase significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. Others have seen this too and have tried provide a way to provide
zeroed pages to the page fault handler:

http://marc.theaimsgroup.com/?t\x109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t\x109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m\x104931944213955&w=2

The problem so far has been that simple zeroing of pages simply shifts
the time spend somewhere else. Plus one would not want to zero hot
pages.

This patch addresses those issues by making it more effective to zero pages by:

1. Aggregating zeroing operations to mainly apply to larger order pages
which results in many later order 0 pages to be zeroed in one go.
For that purpose a new achitecture specific function zero_page(page, order)
is introduced.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The result is a significant increase of the page fault performance even for
single threaded applications:

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   1   1    1    0.014s      0.110s   0.012s524292.194 517665.538

This is a performance increase by a factor 8!

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmark the system will run out of these very
fast but the efficient algorithm for page zeroing still makes this a winner
(8 way system with 6 GB RAM, no hardware zeroing support):

w/o patch:

Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852
 4   3    2    0.170s     14.909s   7.097s 52150.369  98643.687
 4   3    4    0.181s     16.597s   5.079s 46869.167 135642.420
 4   3    8    0.166s     23.239s   4.037s 33599.215 179791.120

w/patch
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.183s      2.750s   2.093s268077.996 267952.890
 4   3    2    0.185s      4.876s   2.097s155344.562 263967.292
 4   3    4    0.150s      6.617s   2.097s116205.793 264774.080
 4   3    8    0.186s     13.693s   3.054s 56659.819 221701.073

The patch is composed of 3 parts:

[1/3] Introduce __GFP_ZERO
	Modifies the page allocator to be able to take the __GFP_ZERO flag
	and returns zeroed memory on request. Modifies locations throughout
	the linux sources that retrieve a page and then zeroe it to request
	a zeroed page.
	Adds new low level zero_page functions for i386, ia64 and x86_64.
	(x64_64 untested)

[2/3] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disable by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order then the scrub daemon will start zeroing
	until all pages of order /proc/sys/vm/scrub_stop and higher are
	zeroed.

[3/3]	SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support there will be minimal impact of zeroing
	on the performance of the system.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
@ 2004-12-21 19:56     ` Christoph Lameter
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.

- Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set

- Replace all page zeroing after allocating pages by request for
  zeroed pages.

- Add an arch specific call zero_page to clear pages greater than
  order 0 and a fallback to repeated calles to clear_page if an
  architecture does not support zero_page(address, order) yet.

- Add ia64 zero_page function
- Add i386 zero_page function
- Add x86_64 zero_page function (untested, unverified)

Index: linux-2.6.9/mm/page_alloc.c
=================================--- linux-2.6.9.orig/mm/page_alloc.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c	2004-12-21 10:19:37.000000000 -0800
@@ -575,6 +575,18 @@
 		BUG_ON(bad_range(zone, page));
 		mod_page_state_zone(zone, pgalloc, 1 << order);
 		prep_new_page(page, order);
+
+		if (gfp_flags & __GFP_ZERO) {
+#ifdef CONFIG_HIGHMEM
+			if (PageHighMem(page)) {
+				int n = 1 << order;
+
+				while (n-- >0)
+					clear_highpage(page + n);
+			} else
+#endif
+			zero_page(page_address(page), order);
+		}
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -767,12 +779,9 @@
 	 */
 	BUG_ON(gfp_mask & __GFP_HIGHMEM);

-	page = alloc_pages(gfp_mask, 0);
-	if (page) {
-		void *address = page_address(page);
-		clear_page(address);
-		return (unsigned long) address;
-	}
+	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+	if (page)
+		return (unsigned long) page_address(page);
 	return 0;
 }

Index: linux-2.6.9/include/linux/gfp.h
=================================--- linux-2.6.9.orig/include/linux/gfp.h	2004-10-18 14:53:44.000000000 -0700
+++ linux-2.6.9/include/linux/gfp.h	2004-12-21 10:19:37.000000000 -0800
@@ -37,6 +37,7 @@
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_ZERO	0x8000	/* Return zeroed page on success */

 #define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.9/mm/memory.c
=================================--- linux-2.6.9.orig/mm/memory.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/mm/memory.c	2004-12-21 10:19:37.000000000 -0800
@@ -1445,10 +1445,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.9/kernel/profile.c
=================================--- linux-2.6.9.orig/kernel/profile.c	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/kernel/profile.c	2004-12-21 10:19:37.000000000 -0800
@@ -326,17 +326,15 @@
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				return NOTIFY_BAD;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				goto out_free;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
 		}
 		break;
@@ -510,16 +508,14 @@
 		int node = cpu_to_node(cpu);
 		struct page *page;

-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[0]
 				= (struct profile_hit *)page_address(page);
 	}
Index: linux-2.6.9/mm/shmem.c
=================================--- linux-2.6.9.orig/mm/shmem.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/mm/shmem.c	2004-12-21 10:19:37.000000000 -0800
@@ -369,9 +369,8 @@
 		}

 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
 		if (page) {
-			clear_highpage(page);
 			page->nr_swapped = 0;
 		}
 		spin_lock(&info->lock);
@@ -910,7 +909,7 @@
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -926,7 +925,7 @@
 shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
 				 unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(gfp | __GFP_ZERO);
 }
 #endif

@@ -1135,7 +1134,6 @@

 		info->alloced++;
 		spin_unlock(&info->lock);
-		clear_highpage(filepage);
 		flush_dcache_page(filepage);
 		SetPageUptodate(filepage);
 	}
Index: linux-2.6.9/mm/hugetlb.c
=================================--- linux-2.6.9.orig/mm/hugetlb.c	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c	2004-12-21 10:19:37.000000000 -0800
@@ -77,7 +77,6 @@
 struct page *alloc_huge_page(void)
 {
 	struct page *page;
-	int i;

 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page();
@@ -88,8 +87,7 @@
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	zero_page(page_address(page), HUGETLB_PAGE_ORDER);
 	return page;
 }

Index: linux-2.6.9/arch/ia64/lib/Makefile
=================================--- linux-2.6.9.orig/arch/ia64/lib/Makefile	2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/arch/ia64/lib/Makefile	2004-12-21 10:19:37.000000000 -0800
@@ -6,7 +6,7 @@

 lib-y := __divsi3.o __udivsi3.o __modsi3.o __umodsi3.o			\
 	__divdi3.o __udivdi3.o __moddi3.o __umoddi3.o			\
-	bitop.o checksum.o clear_page.o csum_partial_copy.o copy_page.o	\
+	bitop.o checksum.o clear_page.o zero_page.o csum_partial_copy.o copy_page.o	\
 	clear_user.o strncpy_from_user.o strlen_user.o strnlen_user.o	\
 	flush.o ip_fast_csum.o do_csum.o				\
 	memset.o strlen.o swiotlb.o
Index: linux-2.6.9/include/asm-ia64/page.h
=================================--- linux-2.6.9.orig/include/asm-ia64/page.h	2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -57,6 +57,8 @@
 #  define STRICT_MM_TYPECHECKS

 extern void clear_page (void *page);
+extern void zero_page (void *page, int order);
+
 extern void copy_page (void *to, void *from);

 /*
Index: linux-2.6.9/include/asm-ia64/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h	2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h	2004-12-21 10:19:37.000000000 -0800
@@ -61,9 +61,7 @@
 	pgd_t *pgd = pgd_alloc_one_fast(mm);

 	if (unlikely(pgd = NULL)) {
-		pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-		if (likely(pgd != NULL))
-			clear_page(pgd);
+		pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	}
 	return pgd;
 }
@@ -107,10 +105,8 @@
 static inline pmd_t*
 pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pmd != NULL))
-		clear_page(pmd);
 	return pmd;
 }

@@ -141,20 +137,16 @@
 static inline struct page *
 pte_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

-	if (likely(pte != NULL))
-		clear_page(page_address(pte));
 	return pte;
 }

 static inline pte_t *
 pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pte != NULL))
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.9/arch/ia64/lib/zero_page.S
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/arch/ia64/lib/zero_page.S	2004-12-21 10:19:37.000000000 -0800
@@ -0,0 +1,84 @@
+/*
+ * Copyright (C) 1999-2002 Hewlett-Packard Co
+ *	Stephane Eranian <eranian@hpl.hp.com>
+ *	David Mosberger-Tang <davidm@hpl.hp.com>
+ * Copyright (C) 2002 Ken Chen <kenneth.w.chen@intel.com>
+ *
+ * 1/06/01 davidm	Tuned for Itanium.
+ * 2/12/02 kchen	Tuned for both Itanium and McKinley
+ * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
+ */
+#include <linux/config.h>
+
+#include <asm/asmmacro.h>
+#include <asm/page.h>
+
+#ifdef CONFIG_ITANIUM
+# define L3_LINE_SIZE	64	// Itanium L3 line size
+# define PREFETCH_LINES	9	// magic number
+#else
+# define L3_LINE_SIZE	128	// McKinley L3 line size
+# define PREFETCH_LINES	12	// magic number
+#endif
+
+#define saved_lc	r2
+#define dst_fetch	r3
+#define dst1		r8
+#define dst2		r9
+#define dst3		r10
+#define dst4		r11
+
+#define dst_last	r31
+#define totsize		r14
+
+GLOBAL_ENTRY(zero_page)
+	.prologue
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
+	.save ar.lc, saved_lc
+	mov saved_lc = ar.lc
+	;;
+	.body
+	adds dst1 = 16, in0
+	mov ar.lc = (PREFETCH_LINES - 1)
+	mov dst_fetch = in0
+	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
+	;;
+.fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
+	adds dst3 = 48, in0		// executing this multiple times is harmless
+	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
+	;;
+	mov ar.lc = r16			// one L3 line per iteration
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
+	;;
+#ifdef CONFIG_ITANIUM
+	// Optimized for Itanium
+1:	stf.spill.nta [dst1] = f0, 64
+	stf.spill.nta [dst2] = f0, 64
+	cmp.lt p8,p0=dst_fetch, dst_last
+	;;
+#else
+	// Optimized for McKinley
+1:	stf.spill.nta [dst1] = f0, 64
+	stf.spill.nta [dst2] = f0, 64
+	stf.spill.nta [dst3] = f0, 64
+	stf.spill.nta [dst4] = f0, 128
+	cmp.lt p8,p0=dst_fetch, dst_last
+	;;
+	stf.spill.nta [dst1] = f0, 64
+	stf.spill.nta [dst2] = f0, 64
+#endif
+	stf.spill.nta [dst3] = f0, 64
+(p8)	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
+	br.cloop.sptk.few 1b
+	;;
+	mov ar.lc = saved_lc		// restore lc
+	br.ret.sptk.many rp
+END(zero_page)
Index: linux-2.6.9/include/asm-i386/page.h
=================================--- linux-2.6.9.orig/include/asm-i386/page.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-i386/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -20,6 +20,7 @@

 #define clear_page(page)	mmx_clear_page((void *)(page))
 #define copy_page(to,from)	mmx_copy_page(to,from)
+#define zero_page(page, order)	mmx_zero_page(page, order)

 #else

@@ -29,6 +30,7 @@
  */

 #define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define zero_page(page, ordeR)	memset((void *)(page), 0, PAGE_SIZE << order)
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif
Index: linux-2.6.9/include/asm-x86_64/page.h
=================================--- linux-2.6.9.orig/include/asm-x86_64/page.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -33,6 +33,7 @@
 #ifndef __ASSEMBLY__

 void clear_page(void *);
+void zero_page(void *, int);
 void copy_page(void *, void *);

 #define clear_user_page(page, vaddr, pg)	clear_page(page)
Index: linux-2.6.9/include/asm-sparc/page.h
=================================--- linux-2.6.9.orig/include/asm-sparc/page.h	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-sparc/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -29,6 +29,7 @@
 #ifndef __ASSEMBLY__

 #define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define zero_page(page,order)	 memset((void *)(page), 0, PAGE_SIZE <<(order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
 	do { 	clear_page(addr);		\
Index: linux-2.6.9/include/asm-s390/page.h
=================================--- linux-2.6.9.orig/include/asm-s390/page.h	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/include/asm-s390/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -33,6 +33,17 @@
 		      : "+&a" (rp) : : "memory", "cc", "1" );
 }

+static inline void zero_page(void *page, int order)
+{
+	register_pair rp;
+
+	rp.subreg.even = (unsigned long) page;
+	rp.subreg.odd = (unsigned long) 4096 << order;
+        asm volatile ("   slr  1,1\n"
+		      "   mvcl %0,0"
+		      : "+&a" (rp) : : "memory", "cc", "1" );
+}
+
 static inline void copy_page(void *to, void *from)
 {
         if (MACHINE_HAS_MVPG)
Index: linux-2.6.9/arch/i386/lib/mmx.c
=================================--- linux-2.6.9.orig/arch/i386/lib/mmx.c	2004-10-18 14:54:23.000000000 -0700
+++ linux-2.6.9/arch/i386/lib/mmx.c	2004-12-21 10:55:00.000000000 -0800
@@ -161,6 +161,39 @@
 	kernel_fpu_end();
 }

+static void fast_zero_page(void *page, int order)
+{
+	int i;
+
+	kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"  pxor %%mm0, %%mm0\n" : :
+	);
+
+	for(i=0;i<((4096/64) << order);i++)
+	{
+		__asm__ __volatile__ (
+		"  movntq %%mm0, (%0)\n"
+		"  movntq %%mm0, 8(%0)\n"
+		"  movntq %%mm0, 16(%0)\n"
+		"  movntq %%mm0, 24(%0)\n"
+		"  movntq %%mm0, 32(%0)\n"
+		"  movntq %%mm0, 40(%0)\n"
+		"  movntq %%mm0, 48(%0)\n"
+		"  movntq %%mm0, 56(%0)\n"
+		: : "r" (page) : "memory");
+		page+d;
+	}
+	/* since movntq is weakly-ordered, a "sfence" is needed to become
+	 * ordered again.
+	 */
+	__asm__ __volatile__ (
+		"  sfence \n" : :
+	);
+	kernel_fpu_end();
+}
+
 static void fast_copy_page(void *to, void *from)
 {
 	int i;
@@ -293,6 +326,42 @@
 	kernel_fpu_end();
 }

+static void fast_zero_page(void *page, int order)
+{
+	int i;
+
+	kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"  pxor %%mm0, %%mm0\n" : :
+	);
+
+	for(i=0;i<((4096/128) << order);i++)
+	{
+		__asm__ __volatile__ (
+		"  movq %%mm0, (%0)\n"
+		"  movq %%mm0, 8(%0)\n"
+		"  movq %%mm0, 16(%0)\n"
+		"  movq %%mm0, 24(%0)\n"
+		"  movq %%mm0, 32(%0)\n"
+		"  movq %%mm0, 40(%0)\n"
+		"  movq %%mm0, 48(%0)\n"
+		"  movq %%mm0, 56(%0)\n"
+		"  movq %%mm0, 64(%0)\n"
+		"  movq %%mm0, 72(%0)\n"
+		"  movq %%mm0, 80(%0)\n"
+		"  movq %%mm0, 88(%0)\n"
+		"  movq %%mm0, 96(%0)\n"
+		"  movq %%mm0, 104(%0)\n"
+		"  movq %%mm0, 112(%0)\n"
+		"  movq %%mm0, 120(%0)\n"
+		: : "r" (page) : "memory");
+		page+\x128;
+	}
+
+	kernel_fpu_end();
+}
+
 static void fast_copy_page(void *to, void *from)
 {
 	int i;
@@ -359,7 +428,7 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
@@ -369,15 +438,34 @@
 		:"a" (0),"1" (page),"0" (1024)
 		:"memory");
 }
+
+static void slow_zero_page(void * page, int order)
+{
+	int d0, d1;
+	__asm__ __volatile__( \
+		"cld\n\t" \
+		"rep ; stosl" \
+		: "=&c" (d0), "=&D" (d1)
+		:"a" (0),"1" (page),"0" (1024 << order)
+		:"memory");
+}

 void mmx_clear_page(void * page)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page);
 	else
 		fast_clear_page(page);
 }

+void mmx_zero_page(void * page, int order)
+{
+	if(unlikely(in_interrupt()))
+		slow_zero_page(page, order);
+	else
+		fast_zero_page(page, order);
+}
+
 static void slow_copy_page(void *to, void *from)
 {
 	int d0, d1, d2;
Index: linux-2.6.9/arch/i386/mm/pgtable.c
=================================--- linux-2.6.9.orig/arch/i386/mm/pgtable.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/i386/mm/pgtable.c	2004-12-21 10:19:37.000000000 -0800
@@ -132,10 +132,7 @@

 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
-	return pte;
+	return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 }

 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -143,12 +140,10 @@
 	struct page *pte;

 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 #endif
-	if (pte)
-		clear_highpage(pte);
 	return pte;
 }

Index: linux-2.6.9/arch/i386/kernel/i386_ksyms.c
=================================--- linux-2.6.9.orig/arch/i386/kernel/i386_ksyms.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/i386_ksyms.c	2004-12-21 10:19:37.000000000 -0800
@@ -126,6 +126,7 @@
 #ifdef CONFIG_X86_USE_3DNOW
 EXPORT_SYMBOL(_mmx_memcpy);
 EXPORT_SYMBOL(mmx_clear_page);
+EXPORT_SYMBOL(mmx_zero_page);
 EXPORT_SYMBOL(mmx_copy_page);
 #endif

Index: linux-2.6.9/drivers/block/pktcdvd.c
=================================--- linux-2.6.9.orig/drivers/block/pktcdvd.c	2004-12-17 14:40:12.000000000 -0800
+++ linux-2.6.9/drivers/block/pktcdvd.c	2004-12-21 10:19:37.000000000 -0800
@@ -125,22 +125,19 @@
 	int i;
 	struct packet_data *pkt;

-	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL);
+	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO);
 	if (!pkt)
 		goto no_pkt;
-	memset(pkt, 0, sizeof(struct packet_data));

 	pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE);
 	if (!pkt->w_bio)
 		goto no_bio;

 	for (i = 0; i < PAGES_PER_PACKET; i++) {
-		pkt->pages[i] = alloc_page(GFP_KERNEL);
+		pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO);
 		if (!pkt->pages[i])
 			goto no_page;
 	}
-	for (i = 0; i < PAGES_PER_PACKET; i++)
-		clear_page(page_address(pkt->pages[i]));

 	spin_lock_init(&pkt->lock);

Index: linux-2.6.9/arch/x86_64/lib/zero_page.S
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/arch/x86_64/lib/zero_page.S	2004-12-21 10:19:37.000000000 -0800
@@ -0,0 +1,52 @@
+/*
+ * Zero a page.
+ * rdi	page
+ */
+	.globl zero_page
+	.p2align 4
+zero_page:
+	xorl   %eax,%eax
+	movl   $4096/64,%ecx
+	shl	%ecx, %esi
+	.p2align 4
+.Lloop:
+	decl	%ecx
+#define PUT(x) movq %rax,x*8(%rdi)
+	movq %rax,(%rdi)
+	PUT(1)
+	PUT(2)
+	PUT(3)
+	PUT(4)
+	PUT(5)
+	PUT(6)
+	PUT(7)
+	leaq	64(%rdi),%rdi
+	jnz	.Lloop
+	nop
+	ret
+zero_page_end:
+
+	/* C stepping K8 run faster using the string instructions.
+	   It is also a lot simpler. Use this when possible */
+
+#include <asm/cpufeature.h>
+
+	.section .altinstructions,"a"
+	.align 8
+	.quad  zero_page
+	.quad  zero_page_c
+	.byte  X86_FEATURE_K8_C
+	.byte  zero_page_end-clear_page
+	.byte  zero_page_c_end-clear_page_c
+	.previous
+
+	.section .altinstr_replacement,"ax"
+zero_page_c:
+	movl $4096/8,%ecx
+	shl	%ecx, %esi
+	xorl %eax,%eax
+	rep
+	stosq
+	ret
+zero_page_c_end:
+	.previous
Index: linux-2.6.9/arch/x86_64/lib/Makefile
=================================--- linux-2.6.9.orig/arch/x86_64/lib/Makefile	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/arch/x86_64/lib/Makefile	2004-12-21 10:19:37.000000000 -0800
@@ -7,7 +7,7 @@
 obj-y := io.o

 lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
-	usercopy.o getuser.o putuser.o  \
+	usercopy.o getuser.o putuser.o zero_page.S \
 	thunk.o clear_page.o copy_page.o bitstr.o bitops.o
 lib-y += memcpy.o memmove.o memset.o copy_user.o

Index: linux-2.6.9/include/asm-x86_64/mmx.h
=================================--- linux-2.6.9.orig/include/asm-x86_64/mmx.h	2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/mmx.h	2004-12-21 10:19:37.000000000 -0800
@@ -9,6 +9,7 @@

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
 extern void mmx_clear_page(void *page);
+extern void mmx_zero_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.9/arch/x86_64/kernel/x8664_ksyms.c
=================================--- linux-2.6.9.orig/arch/x86_64/kernel/x8664_ksyms.c	2004-12-17 14:40:11.000000000 -0800
+++ linux-2.6.9/arch/x86_64/kernel/x8664_ksyms.c	2004-12-21 10:19:37.000000000 -0800
@@ -110,6 +110,7 @@

 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(zero_page);

 EXPORT_SYMBOL(cpu_pda);
 #ifdef CONFIG_SMP


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
  2004-12-21 19:56     ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
@ 2004-12-21 19:57     ` Christoph Lameter
  2005-01-01  2:22       ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and Nick Piggin
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Christoph Lameter
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meminfo

Index: linux-2.6.9/mm/page_alloc.c
=================================--- linux-2.6.9.orig/mm/page_alloc.c	2004-12-21 10:19:37.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c	2004-12-21 11:01:40.000000000 -0800
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Support for page zeroing, Christoph Lameter, SGI, Dec 2004
  */

 #include <linux/config.h>
@@ -32,6 +33,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
+#include <linux/scrub.h>

 #include <asm/tlbflush.h>

@@ -179,7 +181,7 @@
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
+static inline int __free_pages_bulk (struct page *page, struct page *base,
 		struct zone *zone, struct free_area *area, unsigned int order)
 {
 	unsigned long page_idx, index, mask;
@@ -192,11 +194,10 @@
 		BUG();
 	index = page_idx >> (1 + order);

-	zone->free_pages += 1 << order;
 	while (order < MAX_ORDER-1) {
 		struct page *buddy1, *buddy2;

-		BUG_ON(area >= zone->free_area + MAX_ORDER);
+		BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER);
 		if (!__test_and_change_bit(index, area->map))
 			/*
 			 * the buddy page is still allocated.
@@ -216,6 +217,7 @@
 		page_idx &= mask;
 	}
 	list_add(&(base + page_idx)->lru, &area->free_list);
+	return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -258,7 +260,7 @@
 	int ret = 0;

 	base = zone->zone_mem_map;
-	area = zone->free_area + order;
+	area = zone->free_area[NOT_ZEROED] + order;
 	spin_lock_irqsave(&zone->lock, flags);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
@@ -266,7 +268,10 @@
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
-		__free_pages_bulk(page, base, zone, area, order);
+		zone->free_pages += 1 << order;
+		if (__free_pages_bulk(page, base, zone, area, order)
+			>= sysctl_scrub_start)
+				wakeup_kscrubd(zone);
 		ret++;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -288,6 +293,21 @@
 	free_pages_bulk(page_zone(page), 1, &list, order);
 }

+void end_zero_page(struct page *page)
+{
+	unsigned long flags;
+	int order = page->index;
+	struct zone * zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	zone->zero_pages += 1 << order;
+	__free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order);
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+
 #define MARK_USED(index, order, area) \
 	__change_bit((index) >> (1+(order)), (area)->map)

@@ -366,25 +386,46 @@
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order)
+{
+	list_del(&page->lru);
+	if (order != MAX_ORDER-1)
+		MARK_USED(page - zone->zone_mem_map, order, area);
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order)
+{
+	unsigned long flags;
+	struct page *page = NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+
+		rmpage(page, zone, area, order);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
-	unsigned int index;

 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = zone->free_area[zero] + current_order;
 		if (list_empty(&area->free_list))
 			continue;

 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		index = page - zone->zone_mem_map;
-		if (current_order != MAX_ORDER-1)
-			MARK_USED(index, current_order, area);
+		rmpage(page, zone, area, current_order);
 		zone->free_pages -= 1UL << order;
-		return expand(zone, page, index, order, current_order, area);
+		if (zero)
+			zone->zero_pages -= 1UL << order;
+		return expand(zone, page, page - zone->zone_mem_map, order, current_order, area);
 	}

 	return NULL;
@@ -396,7 +437,7 @@
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int zero)
 {
 	unsigned long flags;
 	int i;
@@ -405,7 +446,7 @@

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
 		if (page = NULL)
 			break;
 		allocated++;
@@ -546,7 +587,9 @@
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int nr_pages = 1 << order;
+	int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;

 	if (order = 0) {
 		struct per_cpu_pages *pcp;
@@ -555,7 +598,7 @@
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, zero);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -567,19 +610,30 @@

 	if (page = NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+
+		page = __rmqueue(zone, order, zero);
+
+		/*
+		 * If we failed to obtain a zero and/or unzeroed page
+		 * then we may still be able to obtain the other
+		 * type of page.
+		 */
+		if (!page) {
+			page = __rmqueue(zone, order, !zero);
+			zero = 0;
+		}
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
-		mod_page_state_zone(zone, pgalloc, 1 << order);
-		prep_new_page(page, order);
+		mod_page_state_zone(zone, pgalloc, nr_pages);

-		if (gfp_flags & __GFP_ZERO) {
+		if ((gfp_flags & __GFP_ZERO) && !zero) {
 #ifdef CONFIG_HIGHMEM
 			if (PageHighMem(page)) {
-				int n = 1 << order;
+				int n = nr_pages;

 				while (n-- >0)
 					clear_highpage(page + n);
@@ -587,6 +641,7 @@
 #endif
 			zero_page(page_address(page), order);
 		}
+		prep_new_page(page, order);
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -974,7 +1029,7 @@
 }

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
@@ -982,27 +1037,31 @@
 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
 		*free += zones[i].free_pages;
+		*zero += zones[i].zero_pages;
 	}
 }

 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *free, unsigned long *zero)
 {
 	struct pglist_data *pgdat;

 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n,o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
 		*free += n;
+		*zero += o;
 	}
 }

@@ -1039,6 +1098,7 @@

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1051,6 +1111,7 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;
 	struct zone *zone;

 	for_each_zone(zone) {
@@ -1071,10 +1132,10 @@

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1082,20 +1143,21 @@
 	}

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 	printk("\nFree pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));

 	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+		"unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
 		nr_free_pages(),
+		zero,
 		ps.nr_slab,
 		ps.nr_mapped,
 		ps.nr_page_table_pages);
@@ -1146,7 +1208,7 @@
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
 			nr = 0;
-			list_for_each(elem, &zone->free_area[order].free_list)
+			list_for_each(elem, &zone->free_area[NOT_ZEROED][order].free_list)
 				++nr;
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
@@ -1470,14 +1532,18 @@
 	for (order = 0; ; order++) {
 		unsigned long bitmap_size;

-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
 		if (order = MAX_ORDER-1) {
-			zone->free_area[order].map = NULL;
+			zone->free_area[NOT_ZEROED][order].map = NULL;
+			zone->free_area[ZEROED][order].map = NULL;
 			break;
 		}

 		bitmap_size = pages_to_bitmap_size(order, size);
-		zone->free_area[order].map +		zone->free_area[NOT_ZEROED][order].map +		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+		zone->free_area[ZEROED][order].map  		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
 	}
 }
@@ -1503,6 +1569,7 @@

 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kscrubd_wait);

 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
@@ -1525,6 +1592,7 @@
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->zero_pages = 0;

 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

@@ -1558,6 +1626,13 @@
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1687,7 +1762,7 @@
 			unsigned long nr_bufs = 0;
 			struct list_head *elem;

-			list_for_each(elem, &(zone->free_area[order].free_list))
+			list_for_each(elem, &(zone->free_area[NOT_ZEROED][order].free_list))
 				++nr_bufs;
 			seq_printf(m, "%6lu ", nr_bufs);
 		}
Index: linux-2.6.9/include/linux/mmzone.h
=================================--- linux-2.6.9.orig/include/linux/mmzone.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/mmzone.h	2004-12-21 11:01:15.000000000 -0800
@@ -51,7 +51,7 @@
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -107,10 +107,14 @@
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */

+#define NOT_ZEROED 0
+#define ZEROED 1
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		zero_pages;
 	/*
 	 * protection[] is a pre-calculated number of extra pages that must be
 	 * available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area[2][MAX_ORDER];


 	ZONE_PADDING(_pad1_)
@@ -265,6 +269,9 @@
 	struct pglist_data *pgdat_next;
 	wait_queue_head_t       kswapd_wait;
 	struct task_struct *kswapd;
+
+	wait_queue_head_t       kscrubd_wait;
+	struct task_struct *kscrubd;
 } pg_data_t;

 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@
 extern struct pglist_data *pgdat_list;

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+			unsigned long *free, unsigned long *zero);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone);

Index: linux-2.6.9/fs/proc/proc_misc.c
=================================--- linux-2.6.9.orig/fs/proc/proc_misc.c	2004-12-17 14:40:15.000000000 -0800
+++ linux-2.6.9/fs/proc/proc_misc.c	2004-12-21 11:01:15.000000000 -0800
@@ -158,13 +158,14 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;
 	unsigned long vmtot;
 	unsigned long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 /*
  * display in kilobytes.
@@ -187,6 +188,7 @@
 	len = sprintf(page,
 		"MemTotal:     %8lu kB\n"
 		"MemFree:      %8lu kB\n"
+		"MemZero:      %8lu kB\n"
 		"Buffers:      %8lu kB\n"
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
@@ -210,6 +212,7 @@
 		"VmallocChunk: %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
+		K(zero),
 		K(i.bufferram),
 		K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
 		K(total_swapcache_pages),
Index: linux-2.6.9/mm/readahead.c
=================================--- linux-2.6.9.orig/mm/readahead.c	2004-10-18 14:53:11.000000000 -0700
+++ linux-2.6.9/mm/readahead.c	2004-12-21 11:01:15.000000000 -0800
@@ -570,7 +570,8 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;

-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
Index: linux-2.6.9/drivers/base/node.c
=================================--- linux-2.6.9.orig/drivers/base/node.c	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/drivers/base/node.c	2004-12-21 11:01:15.000000000 -0800
@@ -41,13 +41,15 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;

 	si_meminfo_node(&i, nid);
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));

 	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
+		       "Node %d MemZero:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
@@ -57,6 +59,7 @@
 		       "Node %d LowFree:      %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
+		       nid, K(zero),
 		       nid, K(i.totalram - i.freeram),
 		       nid, K(active),
 		       nid, K(inactive),
Index: linux-2.6.9/include/linux/sched.h
=================================--- linux-2.6.9.orig/include/linux/sched.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h	2004-12-21 11:01:15.000000000 -0800
@@ -715,6 +715,7 @@
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
+#define PF_KSCRUBD	0x00800000	/* I am kscrubd */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.9/mm/Makefile
=================================--- linux-2.6.9.orig/mm/Makefile	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/Makefile	2004-12-21 11:01:15.000000000 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o scrubd.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o prio_tree.o \
Index: linux-2.6.9/mm/scrubd.c
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/mm/scrubd.c	2004-12-21 11:01:15.000000000 -0800
@@ -0,0 +1,148 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = MAX_ORDER;		/* Off */
+unsigned int sysctl_scrub_stop = 2;	/* Mininum order of page to zero */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (sysctl_scrub_start < MAX_ORDER) {
+		struct zone *zone;
+
+		for_each_zone(zone)
+			wakeup_kscrubd(zone);
+	}
+	return 0;
+}
+
+
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+	int order;
+
+	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+		struct free_area *area = z->free_area[NOT_ZEROED] + order;
+		if (!list_empty(&area->free_list)) {
+			struct page *page = scrubd_rmpage(z, area, order);
+			struct list_head *l;
+
+			if (!page)
+				continue;
+
+			page->index = order;
+
+			list_for_each(l, &zero_drivers) {
+				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+				unsigned long size = PAGE_SIZE << order;
+
+				if (driver->start(page_address(page), size) = 0) {
+
+					unsigned ticks = (size*HZ)/driver->rate;
+					if (ticks) {
+						/* Wait the minimum time of the transfer */
+						current->state = TASK_INTERRUPTIBLE;
+						schedule_timeout(ticks);
+					}
+					/* Then keep on checking until transfer is complete */
+					while (!driver->check())
+						schedule();
+					goto out;
+				}
+			}
+
+			/* Unable to find a zeroing device that would
+			 * deal with this page so just do it on our own.
+			 * This will likely thrash the cpu caches.
+			 */
+			cond_resched();
+			zero_page(page_address(page), order);
+out:
+			end_zero_page(page);
+			cond_resched();
+			return 1 << order;
+		}
+	}
+	return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+	int i;
+	unsigned long pages_zeroed;
+
+	if (system_state != SYSTEM_RUNNING)
+		return;
+
+	do {
+		pages_zeroed = 0;
+		for (i = 0; i < pgdat->nr_zones; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			pages_zeroed += zero_highest_order_page(zone);
+		}
+	} while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	cpumask_t cpumask;
+
+	daemonize("kscrubd%d", pgdat->node_id);
+	cpumask = node_to_cpumask(pgdat->node_id);
+	if (!cpus_empty(cpumask))
+		set_cpus_allowed(tsk, cpumask);
+
+	tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+	for ( ; ; ) {
+		if (current->flags & PF_FREEZE)
+			refrigerator(PF_FREEZE);
+		prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&pgdat->kscrubd_wait, &wait);
+
+		scrub_pgdat(pgdat);
+	}
+	return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+	pg_data_t *pgdat;
+	for_each_pgdat(pgdat)
+		pgdat->kscrubd
+		= find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+	return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.9/include/linux/scrub.h
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/include/linux/scrub.h	2004-12-21 11:01:15.000000000 -0800
@@ -0,0 +1,48 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start)(void *, unsigned length);		/* Start bzero transfer */
+	int (*check)(void);				/* Check if bzero is complete */
+	int rate;					/* bzero rate in MB/sec */
+        struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+        if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+                return;
+        wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+				      void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page);
+#endif
Index: linux-2.6.9/kernel/sysctl.c
=================================--- linux-2.6.9.orig/kernel/sysctl.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/kernel/sysctl.c	2004-12-21 11:01:15.000000000 -0800
@@ -40,6 +40,7 @@
 #include <linux/times.h>
 #include <linux/limits.h>
 #include <linux/dcache.h>
+#include <linux/scrub.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -816,6 +817,24 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_SCRUB_START,
+		.procname	= "scrub_start",
+		.data		= &sysctl_scrub_start,
+		.maxlen		= sizeof(sysctl_scrub_start),
+		.mode		= 0644,
+		.proc_handler	= &scrub_start_handler,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_STOP,
+		.procname	= "scrub_stop",
+		.data		= &sysctl_scrub_stop,
+		.maxlen		= sizeof(sysctl_scrub_stop),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.9/include/linux/sysctl.h
=================================--- linux-2.6.9.orig/include/linux/sysctl.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sysctl.h	2004-12-21 11:01:15.000000000 -0800
@@ -168,6 +168,8 @@
 	VM_VFS_CACHE_PRESSURE&, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT', /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT(, /* default time for token time out */
+	VM_SCRUB_START0,	/* percentage * 10 at which to start scrubd */
+	VM_SCRUB_STOP1,	/* percentage * 10 at which to stop scrubd */
 };




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
  2004-12-21 19:56     ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
@ 2004-12-21 19:57     ` Christoph Lameter
  2004-12-22 12:46       ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Robin Holt
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
  2004-12-24 18:31     ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
  4 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

o Use the Block Transfer Engine in the Altix SN2 SHub for background zeroing

Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
=================================--- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-21 11:03:49.000000000 -0800
@@ -4,6 +4,8 @@
  * for more details.
  *
  * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
  */

 #include <linux/config.h>
@@ -20,6 +22,8 @@
 #include <linux/bootmem.h>
 #include <linux/string.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>

 #include <asm/sn/bte.h>

@@ -30,7 +34,11 @@
 /* two interfaces on two btes */
 #define MAX_INTERFACES_TO_TRY		4

-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+DEFINE_PER_CPU(u64 *, bte_zero_notify);
+
+#define bte_zero_notify __get_cpu_var(bte_zero_notify)
+
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
 {
 	nodepda_t *tmp_nodepda;

@@ -132,7 +140,6 @@
 			if (bte = NULL) {
 				continue;
 			}
-
 			if (spin_trylock(&bte->spinlock)) {
 				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
 				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +164,7 @@
 		}
 	} while (1);

-	if (notification = NULL) {
+	if (notification = NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
 		/* User does not want to be notified. */
 		bte->most_rcnt_na = &bte->notify;
 	} else {
@@ -192,6 +199,8 @@

 	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);

+	if (mode & BTE_NOTIFY_AND_GET_POINTER)
+		 *(u64 volatile **)(notification) = &bte->notify;
 	spin_unlock_irqrestore(&bte->spinlock, irq_flags);

 	if (notification != NULL) {
@@ -449,5 +458,31 @@
 		mynodepda->bte_if[i].cleanup_active = 0;
 		mynodepda->bte_if[i].bh_error = 0;
 	}
+}
+
+static int bte_check_bzero(void)
+{
+	return *bte_zero_notify != BTE_WORD_BUSY;
+}
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+	/* Check limitations.
+		1. System must be running (weird things happen during bootup)
+		2. Size >64KB. Smaller requests cause too much bte traffic
+	 */
+	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+		return EINVAL;
+
+	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, &bte_zero_notify);
+}
+
+static struct zero_driver bte_bzero = {
+	.start = bte_start_bzero,
+	.check = bte_check_bzero,
+	.rate = 500000000		/* 500 MB /sec */
+};

+void sn_bte_bzero_init(void) {
+	register_zero_driver(&bte_bzero);
 }
Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
=================================--- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/setup.c	2004-12-21 11:02:35.000000000 -0800
@@ -243,6 +243,7 @@
 	int pxm;
 	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
 	extern void sn_cpu_init(void);
+	extern void sn_bte_bzero_init(void);

 	/*
 	 * If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
 	screen_info = sn_screen_info;

 	sn_timer_init();
+	sn_bte_bzero_init();
 }

 /**
Index: linux-2.6.9/include/asm-ia64/sn/bte.h
=================================--- linux-2.6.9.orig/include/asm-ia64/sn/bte.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/sn/bte.h	2004-12-21 11:02:35.000000000 -0800
@@ -48,6 +48,8 @@
 #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
 /* Use a reserved bit to let the caller specify a wait for any BTE */
 #define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
 /* Use the BTE on the node with the destination memory */
 #define BTE_USE_DEST (BTE_WACQUIRE << 1)
 /* Use any available BTE interface on any node for the transfer */


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Christoph Lameter
@ 2004-12-22 12:46       ` Robin Holt
  2004-12-22 19:56         ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: Robin Holt @ 2004-12-22 12:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
	torvalds, linux-mm, linux-kernel

We still need to talk.  This is a much smaller patch, which I like.  The
problem I see in my 30 second review is you are doing things per-cpu
when they really need to be done per-node.  It is very likely that
there will be M-Bricks in the system (cranberry2 has one if you want
to test your code out there or you can take any altix and disable the
cpus on a C-Brick).  With M-Bricks, you will essentially limit
yourself to one zero operation per controlling node instead of one
per node.

I think the easy answer is to not have the structure allocated
within bte_copy(), but rather within bte_start_zero and passed
in as the notification address.

Give me a call sometime today (Wed. I am in the office from about
10:00 CDT until around 4:00 CDT)  Maybe we can get this straightened
out quickly.  If you are not calling from the office, email me with
other arrangements.

Thanks,
Robin

On Tue, Dec 21, 2004 at 11:57:57AM -0800, Christoph Lameter wrote:
> o Use the Block Transfer Engine in the Altix SN2 SHub for background zeroing
> 
> Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
> =================================> --- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-17 14:40:10.000000000 -0800
> +++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-21 11:03:49.000000000 -0800
> @@ -4,6 +4,8 @@
>   * for more details.
>   *
>   * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
> + *
> + * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
>   */
> 
>  #include <linux/config.h>
> @@ -20,6 +22,8 @@
>  #include <linux/bootmem.h>
>  #include <linux/string.h>
>  #include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/scrub.h>
> 
>  #include <asm/sn/bte.h>
> 
> @@ -30,7 +34,11 @@
>  /* two interfaces on two btes */
>  #define MAX_INTERFACES_TO_TRY		4
> 
> -static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> +DEFINE_PER_CPU(u64 *, bte_zero_notify);
> +
> +#define bte_zero_notify __get_cpu_var(bte_zero_notify)
> +
> +static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
>  {
>  	nodepda_t *tmp_nodepda;
> 
> @@ -132,7 +140,6 @@
>  			if (bte = NULL) {
>  				continue;
>  			}
> -
>  			if (spin_trylock(&bte->spinlock)) {
>  				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
>  				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
> @@ -157,7 +164,7 @@
>  		}
>  	} while (1);
> 
> -	if (notification = NULL) {
> +	if (notification = NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
>  		/* User does not want to be notified. */
>  		bte->most_rcnt_na = &bte->notify;
>  	} else {
> @@ -192,6 +199,8 @@
> 
>  	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
> 
> +	if (mode & BTE_NOTIFY_AND_GET_POINTER)
> +		 *(u64 volatile **)(notification) = &bte->notify;
>  	spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> 
>  	if (notification != NULL) {
> @@ -449,5 +458,31 @@
>  		mynodepda->bte_if[i].cleanup_active = 0;
>  		mynodepda->bte_if[i].bh_error = 0;
>  	}
> +}
> +
> +static int bte_check_bzero(void)
> +{
> +	return *bte_zero_notify != BTE_WORD_BUSY;
> +}
> +
> +static int bte_start_bzero(void *p, unsigned long len)
> +{
> +	/* Check limitations.
> +		1. System must be running (weird things happen during bootup)
> +		2. Size >64KB. Smaller requests cause too much bte traffic
> +	 */
> +	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> +		return EINVAL;
> +
> +	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, &bte_zero_notify);
> +}
> +
> +static struct zero_driver bte_bzero = {
> +	.start = bte_start_bzero,
> +	.check = bte_check_bzero,
> +	.rate = 500000000		/* 500 MB /sec */
> +};
> 
> +void sn_bte_bzero_init(void) {
> +	register_zero_driver(&bte_bzero);
>  }
> Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
> =================================> --- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c	2004-12-17 14:40:10.000000000 -0800
> +++ linux-2.6.9/arch/ia64/sn/kernel/setup.c	2004-12-21 11:02:35.000000000 -0800
> @@ -243,6 +243,7 @@
>  	int pxm;
>  	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
>  	extern void sn_cpu_init(void);
> +	extern void sn_bte_bzero_init(void);
> 
>  	/*
>  	 * If the generic code has enabled vga console support - lets
> @@ -333,6 +334,7 @@
>  	screen_info = sn_screen_info;
> 
>  	sn_timer_init();
> +	sn_bte_bzero_init();
>  }
> 
>  /**
> Index: linux-2.6.9/include/asm-ia64/sn/bte.h
> =================================> --- linux-2.6.9.orig/include/asm-ia64/sn/bte.h	2004-12-17 14:40:16.000000000 -0800
> +++ linux-2.6.9/include/asm-ia64/sn/bte.h	2004-12-21 11:02:35.000000000 -0800
> @@ -48,6 +48,8 @@
>  #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
>  /* Use a reserved bit to let the caller specify a wait for any BTE */
>  #define BTE_WACQUIRE (0x4000)
> +/* Return the pointer to the notification cacheline to the user */
> +#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
>  /* Use the BTE on the node with the destination memory */
>  #define BTE_USE_DEST (BTE_WACQUIRE << 1)
>  /* Use any available BTE interface on any node for the transfer */

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE
  2004-12-22 12:46       ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Robin Holt
@ 2004-12-22 19:56         ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-22 19:56 UTC (permalink / raw)
  To: Robin Holt
  Cc: Nick Piggin, Luck, Tony, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

I have done some additional tests with a 128 cpu SMP machine and it shows
that the bte slows things down during memory benchmarking by about 10-20%
although its causing less load when the system is not under high stress.

So its not always win and I may drop bte support in a future version. Can
we talk off list about this since this is mostly an SGI thing?

On Wed, 22 Dec 2004, Robin Holt wrote:

> We still need to talk.  This is a much smaller patch, which I like.  The
> problem I see in my 30 second review is you are doing things per-cpu
> when they really need to be done per-node.  It is very likely that
> there will be M-Bricks in the system (cranberry2 has one if you want
> to test your code out there or you can take any altix and disable the
> cpus on a C-Brick).  With M-Bricks, you will essentially limit
> yourself to one zero operation per controlling node instead of one
> per node.
>
> I think the easy answer is to not have the structure allocated
> within bte_copy(), but rather within bte_start_zero and passed
> in as the notification address.
>
> Give me a call sometime today (Wed. I am in the office from about
> 10:00 CDT until around 4:00 CDT)  Maybe we can get this straightened
> out quickly.  If you are not calling from the office, email me with
> other arrangements.
>
> Thanks,
> Robin
>
> On Tue, Dec 21, 2004 at 11:57:57AM -0800, Christoph Lameter wrote:
> > o Use the Block Transfer Engine in the Altix SN2 SHub for background zeroing
> >
> > Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
> > =================================> > --- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-17 14:40:10.000000000 -0800
> > +++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-21 11:03:49.000000000 -0800
> > @@ -4,6 +4,8 @@
> >   * for more details.
> >   *
> >   * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
> > + *
> > + * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
> >   */
> >
> >  #include <linux/config.h>
> > @@ -20,6 +22,8 @@
> >  #include <linux/bootmem.h>
> >  #include <linux/string.h>
> >  #include <linux/sched.h>
> > +#include <linux/mm.h>
> > +#include <linux/scrub.h>
> >
> >  #include <asm/sn/bte.h>
> >
> > @@ -30,7 +34,11 @@
> >  /* two interfaces on two btes */
> >  #define MAX_INTERFACES_TO_TRY		4
> >
> > -static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> > +DEFINE_PER_CPU(u64 *, bte_zero_notify);
> > +
> > +#define bte_zero_notify __get_cpu_var(bte_zero_notify)
> > +
> > +static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> >  {
> >  	nodepda_t *tmp_nodepda;
> >
> > @@ -132,7 +140,6 @@
> >  			if (bte = NULL) {
> >  				continue;
> >  			}
> > -
> >  			if (spin_trylock(&bte->spinlock)) {
> >  				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
> >  				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
> > @@ -157,7 +164,7 @@
> >  		}
> >  	} while (1);
> >
> > -	if (notification = NULL) {
> > +	if (notification = NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
> >  		/* User does not want to be notified. */
> >  		bte->most_rcnt_na = &bte->notify;
> >  	} else {
> > @@ -192,6 +199,8 @@
> >
> >  	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
> >
> > +	if (mode & BTE_NOTIFY_AND_GET_POINTER)
> > +		 *(u64 volatile **)(notification) = &bte->notify;
> >  	spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> >
> >  	if (notification != NULL) {
> > @@ -449,5 +458,31 @@
> >  		mynodepda->bte_if[i].cleanup_active = 0;
> >  		mynodepda->bte_if[i].bh_error = 0;
> >  	}
> > +}
> > +
> > +static int bte_check_bzero(void)
> > +{
> > +	return *bte_zero_notify != BTE_WORD_BUSY;
> > +}
> > +
> > +static int bte_start_bzero(void *p, unsigned long len)
> > +{
> > +	/* Check limitations.
> > +		1. System must be running (weird things happen during bootup)
> > +		2. Size >64KB. Smaller requests cause too much bte traffic
> > +	 */
> > +	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> > +		return EINVAL;
> > +
> > +	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, &bte_zero_notify);
> > +}
> > +
> > +static struct zero_driver bte_bzero = {
> > +	.start = bte_start_bzero,
> > +	.check = bte_check_bzero,
> > +	.rate = 500000000		/* 500 MB /sec */
> > +};
> >
> > +void sn_bte_bzero_init(void) {
> > +	register_zero_driver(&bte_bzero);
> >  }
> > Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
> > =================================> > --- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c	2004-12-17 14:40:10.000000000 -0800
> > +++ linux-2.6.9/arch/ia64/sn/kernel/setup.c	2004-12-21 11:02:35.000000000 -0800
> > @@ -243,6 +243,7 @@
> >  	int pxm;
> >  	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
> >  	extern void sn_cpu_init(void);
> > +	extern void sn_bte_bzero_init(void);
> >
> >  	/*
> >  	 * If the generic code has enabled vga console support - lets
> > @@ -333,6 +334,7 @@
> >  	screen_info = sn_screen_info;
> >
> >  	sn_timer_init();
> > +	sn_bte_bzero_init();
> >  }
> >
> >  /**
> > Index: linux-2.6.9/include/asm-ia64/sn/bte.h
> > =================================> > --- linux-2.6.9.orig/include/asm-ia64/sn/bte.h	2004-12-17 14:40:16.000000000 -0800
> > +++ linux-2.6.9/include/asm-ia64/sn/bte.h	2004-12-21 11:02:35.000000000 -0800
> > @@ -48,6 +48,8 @@
> >  #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
> >  /* Use a reserved bit to let the caller specify a wait for any BTE */
> >  #define BTE_WACQUIRE (0x4000)
> > +/* Return the pointer to the notification cacheline to the user */
> > +#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
> >  /* Use the BTE on the node with the destination memory */
> >  #define BTE_USE_DEST (BTE_WACQUIRE << 1)
> >  /* Use any available BTE interface on any node for the transfer */
>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V2 [0/3]: Why and When it works
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
                       ` (2 preceding siblings ...)
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Christoph Lameter
@ 2004-12-23 19:29     ` Christoph Lameter
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
                         ` (4 more replies)
  2004-12-24 18:31     ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
  4 siblings, 5 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:29 UTC (permalink / raw)
  Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Change from V1 to V2:
o Add explanation--and some bench results--as to why and when this optimization works
  and why other approaches have not worked.
o Instead of zero_page(p,order) extend clear_page to take second argument
o Update all architectures to accept second argument for clear_pages
o Extensive removal of all page allocs/clear_page combination from all archs
o Blank / typo fixups
o SGI BTE zero driver update: Use node specific variables instead of cpu specific
  since a cpu may be responsible for multiple nodes.

The patches increasing the page fault rate (introduction of atomic pte operations
and anticipatory prefaulting) do so by reducing the locking overhead and are
therefore mainly of interest for applications running in SMP systems with a high
number of cpus. The single thread performance does just show minor increases.
Only the performance of multi-threaded applications increase significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page. This zeroing means that all
cachelines of the faulted page (on Altix that means all 128 cachelines of
128 byte each) must be loaded and later written back. This patch allows to
avoid having to load all cachelines if only a part of the cachelines of
that page is needed immediately after the fault.

Thus the patch will only be effective for sparsely accessed memory which
is typicalfor anonymous memory and pte maps. Prezeroed pages will be used
for those purposes. Unzeroed pages will be used as usual for the other
purposes.

Others have also thought that prezeroing could be a benefit and have tried
provide a way to provide zeroed pages to the page fault handler:

http://marc.theaimsgroup.com/?t\x109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t\x109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m\x104931944213955&w=2

However, these attempt have tried to zero pages soon to be
accessed (and which may already have recently been accessed). Elements of
these pages are thus already in the cache. Approaches like that will only
shift processing a bit and not yield performance benefits.
Prezeroing only makes sense for pages that are not currently needed and
that are not in the cpu caches. Pages that have recently been touched and
that soon will be touched again are better hot zeroed since the zeroing
will largely be done to cachelines already in the cpu caches.

The patch makes prezeroing very effective by:

1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become order 0 to be
zeroed in one go. For that purpose the existing clear_page function is
extended and made to take an additional argument specifying the order of
the page to be cleared.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The result is a significant increase of the page fault performance even for
single threaded applications:

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   1   1    1    0.014s      0.110s   0.012s524292.194 517665.538

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmarks the system could potentially
run out of zeroed pages but the efficient algorithm for page zeroing still
shows this to be a winner:

(8 way system with 6 GB RAM, no hardware zeroing support)

w/o patch:
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852
 4   3    2    0.170s     14.909s   7.097s 52150.369  98643.687
 4   3    4    0.181s     16.597s   5.079s 46869.167 135642.420
 4   3    8    0.166s     23.239s   4.037s 33599.215 179791.120

w/patch
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.183s      2.750s   2.093s268077.996 267952.890
 4   3    2    0.185s      4.876s   2.097s155344.562 263967.292
 4   3    4    0.150s      6.617s   2.097s116205.793 264774.080
 4   3    8    0.186s     13.693s   3.054s 56659.819 221701.073

Note that zeroing of pages makes no sense if the application
touches all cache lines of a page allocated (there is no influence of
prezeroing on benchmarks like lmbench for that reason) since the extensive
caching of modern cpus means that the zeroes written to a hot zeroed page
will then be overwritten by the application in the cpu cache and thus
the zeros will never make it to memory! The test program used above only
touches one 128 byte cache line of a 16k page (ia64).

Here is another test in order to gauge the influence of the number of cache
lines touched on the performance of the prezero enhancements:

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  1    1   1    0.01s      0.12s   0.01s500813.853 497925.891
  1  1    1   2    0.01s      0.11s   0.01s493453.103 472877.725
  1  1    1   4    0.02s      0.10s   0.01s479351.658 471507.415
  1  1    1   8    0.01s      0.13s   0.01s424742.054 416725.013
  1  1    1  16    0.05s      0.12s   0.01s347715.359 336983.834
  1  1    1  32    0.12s      0.13s   0.02s258112.286 256246.731
  1  1    1  64    0.24s      0.14s   0.03s169896.381 168189.283
  1  1    1 128    0.49s      0.14s   0.06s102300.257 101674.435

The benefits of prezeroing become smaller the more cache lines of
a page are touched. Prezeroing can only be effective if memory is not
immediately touched after the anonymous page fault.

The patch is composed of 4 parts:

[1/4] Introduce __GFP_ZERO
	Modifies the page allocator to be able to take the __GFP_ZERO flag
	and returns zeroed memory on request. Modifies locations throughout
	the linux sources that retrieve a page and then zero it to request
	a zeroed page.

[2/4] Architecture specific clear_page updates
	Adds second order argument to clear_page and updates all arches.

Note: The two first pages may be used alone if no zeroing engine is wanted.

[3/4] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disabled by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order or higher then the scrub daemon will
	start zeroing until all pages of order /proc/sys/vm/scrub_stop and
	higher are zeroed and then go back to sleep.

	In an SMP environment the scrub daemon is typically
	running on the most idle cpu. Thus a single threaded application running
	on one cpu may have the other cpu zeroing pages for it etc. The scrub
	daemon is hardly noticable and usually finished zeroing quickly since most
	processors are optimized for linear memory filling.

[4/4]	SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support there will be minimal impact of zeroing
	on the performance of the system.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
@ 2004-12-23 19:33       ` Christoph Lameter
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all Christoph Lameter
                           ` (3 more replies)
  2004-12-23 19:49       ` Prezeroing V2 [0/3]: Why and When it works Arjan van de Ven
                         ` (3 subsequent siblings)
  4 siblings, 4 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:33 UTC (permalink / raw)
  To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.

o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set

o Replace all page zeroing after allocating pages by request for
  zeroed pages.

o requires arch updates to clear_page in order to function properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9/mm/page_alloc.c
=================================--- linux-2.6.9.orig/mm/page_alloc.c	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c	2004-12-22 17:23:43.000000000 -0800
@@ -575,6 +575,18 @@
 		BUG_ON(bad_range(zone, page));
 		mod_page_state_zone(zone, pgalloc, 1 << order);
 		prep_new_page(page, order);
+
+		if (gfp_flags & __GFP_ZERO) {
+#ifdef CONFIG_HIGHMEM
+			if (PageHighMem(page)) {
+				int n = 1 << order;
+
+				while (n-- >0)
+					clear_highpage(page + n);
+			} else
+#endif
+			clear_page(page_address(page), order);
+		}
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -767,12 +779,9 @@
 	 */
 	BUG_ON(gfp_mask & __GFP_HIGHMEM);

-	page = alloc_pages(gfp_mask, 0);
-	if (page) {
-		void *address = page_address(page);
-		clear_page(address);
-		return (unsigned long) address;
-	}
+	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+	if (page)
+		return (unsigned long) page_address(page);
 	return 0;
 }

Index: linux-2.6.9/include/linux/gfp.h
=================================--- linux-2.6.9.orig/include/linux/gfp.h	2004-10-18 14:53:44.000000000 -0700
+++ linux-2.6.9/include/linux/gfp.h	2004-12-22 17:23:43.000000000 -0800
@@ -37,6 +37,7 @@
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_ZERO	0x8000	/* Return zeroed page on success */

 #define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.9/mm/memory.c
=================================--- linux-2.6.9.orig/mm/memory.c	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/mm/memory.c	2004-12-22 17:23:43.000000000 -0800
@@ -1445,10 +1445,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.9/kernel/profile.c
=================================--- linux-2.6.9.orig/kernel/profile.c	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/kernel/profile.c	2004-12-22 17:23:43.000000000 -0800
@@ -326,17 +326,15 @@
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				return NOTIFY_BAD;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				goto out_free;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
 		}
 		break;
@@ -510,16 +508,14 @@
 		int node = cpu_to_node(cpu);
 		struct page *page;

-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[0]
 				= (struct profile_hit *)page_address(page);
 	}
Index: linux-2.6.9/mm/shmem.c
=================================--- linux-2.6.9.orig/mm/shmem.c	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/mm/shmem.c	2004-12-22 17:23:43.000000000 -0800
@@ -369,9 +369,8 @@
 		}

 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
 		if (page) {
-			clear_highpage(page);
 			page->nr_swapped = 0;
 		}
 		spin_lock(&info->lock);
@@ -910,7 +909,7 @@
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -926,7 +925,7 @@
 shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
 				 unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(gfp | __GFP_ZERO);
 }
 #endif

@@ -1135,7 +1134,6 @@

 		info->alloced++;
 		spin_unlock(&info->lock);
-		clear_highpage(filepage);
 		flush_dcache_page(filepage);
 		SetPageUptodate(filepage);
 	}
Index: linux-2.6.9/mm/hugetlb.c
=================================--- linux-2.6.9.orig/mm/hugetlb.c	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c	2004-12-22 17:23:43.000000000 -0800
@@ -77,7 +77,6 @@
 struct page *alloc_huge_page(void)
 {
 	struct page *page;
-	int i;

 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page();
@@ -88,8 +87,7 @@
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	clear_page(page_address(page), HUGETLB_PAGE_ORDER);
 	return page;
 }

Index: linux-2.6.9/include/asm-ia64/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h	2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -61,9 +61,7 @@
 	pgd_t *pgd = pgd_alloc_one_fast(mm);

 	if (unlikely(pgd = NULL)) {
-		pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-		if (likely(pgd != NULL))
-			clear_page(pgd);
+		pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	}
 	return pgd;
 }
@@ -107,10 +105,8 @@
 static inline pmd_t*
 pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pmd != NULL))
-		clear_page(pmd);
 	return pmd;
 }

@@ -141,20 +137,16 @@
 static inline struct page *
 pte_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

-	if (likely(pte != NULL))
-		clear_page(page_address(pte));
 	return pte;
 }

 static inline pte_t *
 pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pte != NULL))
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.9/arch/i386/mm/pgtable.c
=================================--- linux-2.6.9.orig/arch/i386/mm/pgtable.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/i386/mm/pgtable.c	2004-12-22 17:23:43.000000000 -0800
@@ -132,10 +132,7 @@

 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
-	return pte;
+	return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 }

 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -143,12 +140,10 @@
 	struct page *pte;

 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 #endif
-	if (pte)
-		clear_highpage(pte);
 	return pte;
 }

Index: linux-2.6.9/drivers/block/pktcdvd.c
=================================--- linux-2.6.9.orig/drivers/block/pktcdvd.c	2004-12-22 16:48:15.000000000 -0800
+++ linux-2.6.9/drivers/block/pktcdvd.c	2004-12-22 17:23:43.000000000 -0800
@@ -125,22 +125,19 @@
 	int i;
 	struct packet_data *pkt;

-	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL);
+	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO);
 	if (!pkt)
 		goto no_pkt;
-	memset(pkt, 0, sizeof(struct packet_data));

 	pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE);
 	if (!pkt->w_bio)
 		goto no_bio;

 	for (i = 0; i < PAGES_PER_PACKET; i++) {
-		pkt->pages[i] = alloc_page(GFP_KERNEL);
+		pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO);
 		if (!pkt->pages[i])
 			goto no_page;
 	}
-	for (i = 0; i < PAGES_PER_PACKET; i++)
-		clear_page(page_address(pkt->pages[i]));

 	spin_lock_init(&pkt->lock);

Index: linux-2.6.9/arch/m68k/mm/motorola.c
=================================--- linux-2.6.9.orig/arch/m68k/mm/motorola.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/m68k/mm/motorola.c	2004-12-22 17:23:43.000000000 -0800
@@ -1,4 +1,4 @@
-/*
+*
  * linux/arch/m68k/motorola.c
  *
  * Routines specific to the Motorola MMU, originally from:
@@ -50,7 +50,7 @@

 	ptablep = (pte_t *)alloc_bootmem_low_pages(PAGE_SIZE);

-	clear_page(ptablep);
+	clear_page(ptablep, 0);
 	__flush_page_to_ram(ptablep);
 	flush_tlb_kernel_page(ptablep);
 	nocache_page(ptablep);
@@ -90,7 +90,7 @@
 	if (((unsigned long)last_pgtable & ~PAGE_MASK) = 0) {
 		last_pgtable = (pmd_t *)alloc_bootmem_low_pages(PAGE_SIZE);

-		clear_page(last_pgtable);
+		clear_page(last_pgtable, 0);
 		__flush_page_to_ram(last_pgtable);
 		flush_tlb_kernel_page(last_pgtable);
 		nocache_page(last_pgtable);
Index: linux-2.6.9/include/asm-mips/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-mips/pgalloc.h	2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-mips/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -56,9 +56,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);

 	return pte;
 }
Index: linux-2.6.9/arch/alpha/mm/init.c
=================================--- linux-2.6.9.orig/arch/alpha/mm/init.c	2004-10-18 14:55:07.000000000 -0700
+++ linux-2.6.9/arch/alpha/mm/init.c	2004-12-22 17:23:43.000000000 -0800
@@ -42,10 +42,9 @@
 {
 	pgd_t *ret, *init;

-	ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+	ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	init = pgd_offset(&init_mm, 0UL);
 	if (ret) {
-		clear_page(ret);
 #ifdef CONFIG_ALPHA_LARGE_VMALLOC
 		memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
 			(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
@@ -63,9 +62,7 @@
 pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.9/include/asm-parisc/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-parisc/pgalloc.h	2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/include/asm-parisc/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -120,18 +120,14 @@
 static inline struct page *
 pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(page != NULL))
-		clear_page(page_address(page));
+	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return page;
 }

 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(pte != NULL))
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.9/arch/sh/mm/pg-sh4.c
=================================--- linux-2.6.9.orig/arch/sh/mm/pg-sh4.c	2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/pg-sh4.c	2004-12-22 17:23:43.000000000 -0800
@@ -34,7 +34,7 @@
 {
 	__set_bit(PG_mapped, &page->flags);
 	if (((address ^ (unsigned long)to) & CACHE_ALIAS) = 0)
-		clear_page(to);
+		clear_page(to, 0);
 	else {
 		pgprot_t pgprot = __pgprot(_PAGE_PRESENT |
 					   _PAGE_RW | _PAGE_CACHABLE |
Index: linux-2.6.9/include/asm-sparc64/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-sparc64/pgalloc.h	2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/include/asm-sparc64/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -73,10 +73,9 @@
 		struct page *page;

 		preempt_enable();
-		page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+		page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (page) {
 			ret = (struct page *)page_address(page);
-			clear_page(ret);
 			page->lru.prev = (void *) 2UL;

 			preempt_disable();
Index: linux-2.6.9/include/asm-sh/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-sh/pgalloc.h	2004-10-18 14:54:08.000000000 -0700
+++ linux-2.6.9/include/asm-sh/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -44,9 +44,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);

 	return pte;
 }
@@ -56,9 +54,7 @@
 {
 	struct page *pte;

-   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
Index: linux-2.6.9/include/asm-m32r/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-m32r/pgalloc.h	2004-10-18 14:55:07.000000000 -0700
+++ linux-2.6.9/include/asm-m32r/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -23,10 +23,7 @@
  */
 static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
-	if (pgd)
-		clear_page(pgd);
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pgd;
 }
@@ -39,10 +36,7 @@
 static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
-
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pte;
 }
@@ -50,10 +44,8 @@
 static __inline__ struct page *pte_alloc_one(struct mm_struct *mm,
 	unsigned long address)
 {
-	struct page *pte = alloc_page(GFP_KERNEL);
+	struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO);

-	if (pte)
-		clear_page(page_address(pte));

 	return pte;
 }
Index: linux-2.6.9/arch/um/kernel/mem.c
=================================--- linux-2.6.9.orig/arch/um/kernel/mem.c	2004-10-18 14:53:51.000000000 -0700
+++ linux-2.6.9/arch/um/kernel/mem.c	2004-12-22 17:23:43.000000000 -0800
@@ -307,9 +307,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

@@ -317,9 +315,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_highpage(pte);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.9/arch/ppc64/mm/init.c
=================================--- linux-2.6.9.orig/arch/ppc64/mm/init.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/ppc64/mm/init.c	2004-12-22 17:23:43.000000000 -0800
@@ -761,7 +761,7 @@

 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
-	clear_page(page);
+	clear_page(page, 0);

 	if (cur_cpu_spec->cpu_features & CPU_FTR_COHERENT_ICACHE)
 		return;
Index: linux-2.6.9/include/asm-sh64/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-sh64/pgalloc.h	2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-sh64/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -112,9 +112,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO);

 	return pte;
 }
@@ -123,9 +121,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
@@ -150,9 +146,7 @@
 static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pmd_t *pmd;
-	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pmd)
-		clear_page(pmd);
+	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pmd;
 }

Index: linux-2.6.9/include/asm-cris/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-cris/pgalloc.h	2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/include/asm-cris/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -24,18 +24,14 @@

 extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
  	return pte;
 }

 extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.9/arch/ppc/mm/pgtable.c
=================================--- linux-2.6.9.orig/arch/ppc/mm/pgtable.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/ppc/mm/pgtable.c	2004-12-22 17:23:43.000000000 -0800
@@ -85,8 +85,7 @@
 {
 	pgd_t *ret;

-	if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL)
-		clear_pages(ret, PGDIR_ORDER);
+	ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
 	return ret;
 }

@@ -102,7 +101,7 @@
 	extern void *early_get_page(void);

 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (pte) {
 			struct page *ptepage = virt_to_page(pte);
 			ptepage->mapping = (void *) mm;
@@ -110,8 +109,6 @@
 		}
 	} else
 		pte = (pte_t *)early_get_page();
-	if (pte)
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.9/arch/ppc/mm/init.c
=================================--- linux-2.6.9.orig/arch/ppc/mm/init.c	2004-10-18 14:53:43.000000000 -0700
+++ linux-2.6.9/arch/ppc/mm/init.c	2004-12-22 17:23:43.000000000 -0800
@@ -595,7 +595,7 @@
 }
 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
-	clear_page(page);
+	clear_page(page, 0);
 	clear_bit(PG_arch_1, &pg->flags);
 }

Index: linux-2.6.9/fs/afs/file.c
=================================--- linux-2.6.9.orig/fs/afs/file.c	2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/fs/afs/file.c	2004-12-22 17:23:43.000000000 -0800
@@ -172,7 +172,7 @@
 				      (size_t) PAGE_SIZE);
 		desc.buffer	= kmap(page);

-		clear_page(desc.buffer);
+		clear_page(desc.buffer, 0);

 		/* read the contents of the file from the server into the
 		 * page */
Index: linux-2.6.9/include/asm-alpha/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-alpha/pgalloc.h	2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-alpha/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -40,9 +40,7 @@
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (ret)
-		clear_page(ret);
+	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return ret;
 }

Index: linux-2.6.9/include/linux/highmem.h
=================================--- linux-2.6.9.orig/include/linux/highmem.h	2004-10-18 14:54:54.000000000 -0700
+++ linux-2.6.9/include/linux/highmem.h	2004-12-22 17:23:43.000000000 -0800
@@ -47,7 +47,7 @@
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }

Index: linux-2.6.9/arch/sh64/mm/ioremap.c
=================================--- linux-2.6.9.orig/arch/sh64/mm/ioremap.c	2004-10-18 14:54:32.000000000 -0700
+++ linux-2.6.9/arch/sh64/mm/ioremap.c	2004-12-22 17:23:43.000000000 -0800
@@ -399,7 +399,7 @@
 	if (pte_none(*ptep) || !pte_present(*ptep))
 		return;

-	clear_page((void *)ptep);
+	clear_page((void *)ptep, 0);
 	pte_clear(ptep);
 }

Index: linux-2.6.9/include/asm-m68k/motorola_pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-m68k/motorola_pgalloc.h	2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/include/asm-m68k/motorola_pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -12,9 +12,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
@@ -31,7 +30,7 @@

 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	pte_t *pte;

 	if(!page)
@@ -39,7 +38,6 @@

 	pte = kmap(page);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
Index: linux-2.6.9/arch/sh/mm/pg-sh7705.c
=================================--- linux-2.6.9.orig/arch/sh/mm/pg-sh7705.c	2004-12-22 16:48:15.000000000 -0800
+++ linux-2.6.9/arch/sh/mm/pg-sh7705.c	2004-12-22 17:23:43.000000000 -0800
@@ -78,13 +78,13 @@

 	__set_bit(PG_mapped, &page->flags);
 	if (((address ^ (unsigned long)to) & CACHE_ALIAS) = 0) {
-		clear_page(to);
+		clear_page(to, 0);
 		__flush_wback_region(to, PAGE_SIZE);
 	} else {
 		__flush_purge_virtual_region(to,
 					     (void *)(address & 0xfffff000),
 					     PAGE_SIZE);
-		clear_page(to);
+		clear_page(to, 0);
 		__flush_wback_region(to, PAGE_SIZE);
 	}
 }
Index: linux-2.6.9/arch/sparc64/mm/init.c
=================================--- linux-2.6.9.orig/arch/sparc64/mm/init.c	2004-12-22 16:48:15.000000000 -0800
+++ linux-2.6.9/arch/sparc64/mm/init.c	2004-12-22 17:23:43.000000000 -0800
@@ -1687,13 +1687,12 @@
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
 	 */
-	mem_map_zero = alloc_pages(GFP_KERNEL, 0);
+	mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (mem_map_zero = NULL) {
 		prom_printf("paging_init: Cannot alloc zero page.\n");
 		prom_halt();
 	}
 	SetPageReserved(mem_map_zero);
-	clear_page(page_address(mem_map_zero));

 	codepages = (((unsigned long) _etext) - ((unsigned long) _start));
 	codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT;
Index: linux-2.6.9/include/asm-arm/pgalloc.h
=================================--- linux-2.6.9.orig/include/asm-arm/pgalloc.h	2004-10-18 14:55:27.000000000 -0700
+++ linux-2.6.9/include/asm-arm/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -50,9 +50,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
 		pte += PTRS_PER_PTE;
 	}
@@ -65,10 +64,9 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	if (pte) {
 		void *page = page_address(pte);
-		clear_page(page);
 		clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
 	}




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V2 [2/4]: add second parameter to clear_page() for all
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
@ 2004-12-23 19:33         ` Christoph Lameter
  2004-12-24  8:33           ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Pavel Machek
                             ` (2 more replies)
  2004-12-23 19:34         ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
                           ` (2 subsequent siblings)
  3 siblings, 3 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:33 UTC (permalink / raw)
  To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

o Extend clear_page to take an order parameter for all architectures.

Known to work:

ia64
i386

Trivial modification expected to simply work:

arm
cris
h8300
m68k
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

x86_64
s390
alpha
sparc64
sh
mips
m32r

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9/include/asm-ia64/page.h
=================================--- linux-2.6.9.orig/include/asm-ia64/page.h	2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.9/include/asm-i386/page.h
=================================--- linux-2.6.9.orig/include/asm-i386/page.h	2004-12-22 16:48:19.000000000 -0800
+++ linux-2.6.9/include/asm-i386/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-x86_64/page.h
=================================--- linux-2.6.9.orig/include/asm-x86_64/page.h	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-sparc/page.h
=================================--- linux-2.6.9.orig/include/asm-sparc/page.h	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-sparc/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.9/include/asm-s390/page.h
=================================--- linux-2.6.9.orig/include/asm-s390/page.h	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/include/asm-s390/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /* Pure 2^n version of get_order */
Index: linux-2.6.9/arch/i386/lib/mmx.c
=================================--- linux-2.6.9.orig/arch/i386/lib/mmx.c	2004-10-18 14:54:23.000000000 -0700
+++ linux-2.6.9/arch/i386/lib/mmx.c	2004-12-23 07:44:14.000000000 -0800
@@ -128,7 +128,7 @@
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.9/include/asm-x86_64/mmx.h
=================================--- linux-2.6.9.orig/include/asm-x86_64/mmx.h	2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/mmx.h	2004-12-23 07:44:14.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.9/arch/ia64/lib/clear_page.S
=================================--- linux-2.6.9.orig/arch/ia64/lib/clear_page.S	2004-10-18 14:53:10.000000000 -0700
+++ linux-2.6.9/arch/ia64/lib/clear_page.S	2004-12-23 07:44:14.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.9/arch/x86_64/lib/clear_page.S
=================================--- linux-2.6.9.orig/arch/x86_64/lib/clear_page.S	2004-10-18 14:54:07.000000000 -0700
+++ linux-2.6.9/arch/x86_64/lib/clear_page.S	2004-12-23 07:44:14.000000000 -0800
@@ -7,6 +7,7 @@
 clear_page:
 	xorl   %eax,%eax
 	movl   $4096/64,%ecx
+	shl	%esi, %ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -42,6 +43,7 @@
 	.section .altinstr_replacement,"ax"
 clear_page_c:
 	movl $4096/8,%ecx
+	shl	%esi, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.9/include/asm-sh/page.h
=================================--- linux-2.6.9.orig/include/asm-sh/page.h	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/include/asm-sh/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.9/include/asm-i386/mmx.h
=================================--- linux-2.6.9.orig/include/asm-i386/mmx.h	2004-10-18 14:54:27.000000000 -0700
+++ linux-2.6.9/include/asm-i386/mmx.h	2004-12-23 07:44:14.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.9/arch/alpha/lib/clear_page.S
=================================--- linux-2.6.9.orig/arch/alpha/lib/clear_page.S	2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/arch/alpha/lib/clear_page.S	2004-12-23 07:44:14.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.9/include/asm-sh64/page.h
=================================--- linux-2.6.9.orig/include/asm-sh64/page.h	2004-10-18 14:54:07.000000000 -0700
+++ linux-2.6.9/include/asm-sh64/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -50,12 +50,20 @@
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.9/include/asm-h8300/page.h
=================================--- linux-2.6.9.orig/include/asm-h8300/page.h	2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/include/asm-h8300/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-arm/page.h
=================================--- linux-2.6.9.orig/include/asm-arm/page.h	2004-12-22 16:48:19.000000000 -0800
+++ linux-2.6.9/include/asm-arm/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -128,7 +128,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.9/include/asm-ppc64/page.h
=================================--- linux-2.6.9.orig/include/asm-ppc64/page.h	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/include/asm-ppc64/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, int order)
 {
 	unsigned long lines, line_size;

 	line_size = systemcfg->dCacheL1LineSize;
-	lines = naca->dCacheL1LinesPerPage;
+	lines = naca->dCacheL1LinesPerPage << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.9/include/asm-m32r/page.h
=================================--- linux-2.6.9.orig/include/asm-m32r/page.h	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-m32r/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-alpha/page.h
=================================--- linux-2.6.9.orig/include/asm-alpha/page.h	2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-alpha/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
Index: linux-2.6.9/arch/mips/mm/pg-sb1.c
=================================--- linux-2.6.9.orig/arch/mips/mm/pg-sb1.c	2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/arch/mips/mm/pg-sb1.c	2004-12-23 07:44:14.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.9/include/asm-m68k/page.h
=================================--- linux-2.6.9.orig/include/asm-m68k/page.h	2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/include/asm-m68k/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -50,7 +50,7 @@
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, 0)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.9/include/asm-mips/page.h
=================================--- linux-2.6.9.orig/include/asm-mips/page.h	2004-12-22 16:48:19.000000000 -0800
+++ linux-2.6.9/include/asm-mips/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.9/include/asm-m68knommu/page.h
=================================--- linux-2.6.9.orig/include/asm-m68knommu/page.h	2004-10-18 14:54:07.000000000 -0700
+++ linux-2.6.9/include/asm-m68knommu/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-cris/page.h
=================================--- linux-2.6.9.orig/include/asm-cris/page.h	2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/include/asm-cris/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-v850/page.h
=================================--- linux-2.6.9.orig/include/asm-v850/page.h	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/include/asm-v850/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.9/include/asm-parisc/page.h
=================================--- linux-2.6.9.orig/include/asm-parisc/page.h	2004-10-18 14:53:43.000000000 -0700
+++ linux-2.6.9/include/asm-parisc/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.9/arch/arm/mm/copypage-v6.c
=================================--- linux-2.6.9.orig/arch/arm/mm/copypage-v6.c	2004-12-23 07:44:04.000000000 -0800
+++ linux-2.6.9/arch/arm/mm/copypage-v6.c	2004-12-23 07:44:14.000000000 -0800
@@ -47,7 +47,7 @@
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.9/arch/m32r/mm/page.S
=================================--- linux-2.6.9.orig/arch/m32r/mm/page.S	2004-10-18 14:54:31.000000000 -0700
+++ linux-2.6.9/arch/m32r/mm/page.S	2004-12-23 07:44:14.000000000 -0800
@@ -51,7 +51,7 @@
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.9/include/asm-ppc/page.h
=================================--- linux-2.6.9.orig/include/asm-ppc/page.h	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-ppc/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -85,7 +85,7 @@

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.9/arch/alpha/kernel/alpha_ksyms.c
=================================--- linux-2.6.9.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-22 16:48:13.000000000 -0800
+++ linux-2.6.9/arch/alpha/kernel/alpha_ksyms.c	2004-12-23 07:44:14.000000000 -0800
@@ -88,7 +88,7 @@
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.9/arch/alpha/lib/ev6-clear_page.S
=================================--- linux-2.6.9.orig/arch/alpha/lib/ev6-clear_page.S	2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/arch/alpha/lib/ev6-clear_page.S	2004-12-23 07:44:14.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.9/arch/sh/mm/init.c
=================================--- linux-2.6.9.orig/arch/sh/mm/init.c	2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/init.c	2004-12-23 07:44:14.000000000 -0800
@@ -57,7 +57,7 @@
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.9/arch/sh/mm/pg-dma.c
=================================--- linux-2.6.9.orig/arch/sh/mm/pg-dma.c	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/pg-dma.c	2004-12-23 07:44:14.000000000 -0800
@@ -78,7 +78,7 @@
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.9/arch/sh/mm/pg-nommu.c
=================================--- linux-2.6.9.orig/arch/sh/mm/pg-nommu.c	2004-10-18 14:53:51.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/pg-nommu.c	2004-12-23 07:44:14.000000000 -0800
@@ -27,7 +27,7 @@
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.9/arch/mips/mm/pg-r4k.c
=================================--- linux-2.6.9.orig/arch/mips/mm/pg-r4k.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/mips/mm/pg-r4k.c	2004-12-23 07:44:14.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.9/arch/m32r/kernel/m32r_ksyms.c
=================================--- linux-2.6.9.orig/arch/m32r/kernel/m32r_ksyms.c	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/arch/m32r/kernel/m32r_ksyms.c	2004-12-23 07:44:14.000000000 -0800
@@ -102,7 +102,7 @@
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.9/include/asm-arm26/page.h
=================================--- linux-2.6.9.orig/include/asm-arm26/page.h	2004-10-18 14:54:39.000000000 -0700
+++ linux-2.6.9/include/asm-arm26/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -25,7 +25,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all Christoph Lameter
@ 2004-12-23 19:34         ` Christoph Lameter
  2004-12-23 19:35         ` Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE Christoph Lameter
  2004-12-23 20:08         ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
  3 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:34 UTC (permalink / raw)
  To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meninfo

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9/mm/page_alloc.c
=================================--- linux-2.6.9.orig/mm/page_alloc.c	2004-12-22 13:31:02.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c	2004-12-22 14:24:56.000000000 -0800
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Support for page zeroing, Christoph Lameter, SGI, Dec 2004
  */

 #include <linux/config.h>
@@ -32,6 +33,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
+#include <linux/scrub.h>

 #include <asm/tlbflush.h>

@@ -179,7 +181,7 @@
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
+static inline int __free_pages_bulk (struct page *page, struct page *base,
 		struct zone *zone, struct free_area *area, unsigned int order)
 {
 	unsigned long page_idx, index, mask;
@@ -192,11 +194,10 @@
 		BUG();
 	index = page_idx >> (1 + order);

-	zone->free_pages += 1 << order;
 	while (order < MAX_ORDER-1) {
 		struct page *buddy1, *buddy2;

-		BUG_ON(area >= zone->free_area + MAX_ORDER);
+		BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER);
 		if (!__test_and_change_bit(index, area->map))
 			/*
 			 * the buddy page is still allocated.
@@ -216,6 +217,7 @@
 		page_idx &= mask;
 	}
 	list_add(&(base + page_idx)->lru, &area->free_list);
+	return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -258,7 +260,7 @@
 	int ret = 0;

 	base = zone->zone_mem_map;
-	area = zone->free_area + order;
+	area = zone->free_area[NOT_ZEROED] + order;
 	spin_lock_irqsave(&zone->lock, flags);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
@@ -266,7 +268,10 @@
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
-		__free_pages_bulk(page, base, zone, area, order);
+		zone->free_pages += 1 << order;
+		if (__free_pages_bulk(page, base, zone, area, order)
+			>= sysctl_scrub_start)
+				wakeup_kscrubd(zone);
 		ret++;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -288,6 +293,21 @@
 	free_pages_bulk(page_zone(page), 1, &list, order);
 }

+void end_zero_page(struct page *page)
+{
+	unsigned long flags;
+	int order = page->index;
+	struct zone * zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	zone->zero_pages += 1 << order;
+	__free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order);
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+
 #define MARK_USED(index, order, area) \
 	__change_bit((index) >> (1+(order)), (area)->map)

@@ -366,25 +386,46 @@
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order)
+{
+	list_del(&page->lru);
+	if (order != MAX_ORDER-1)
+		MARK_USED(page - zone->zone_mem_map, order, area);
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order)
+{
+	unsigned long flags;
+	struct page *page = NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+
+		rmpage(page, zone, area, order);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
-	unsigned int index;

 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = zone->free_area[zero] + current_order;
 		if (list_empty(&area->free_list))
 			continue;

 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		index = page - zone->zone_mem_map;
-		if (current_order != MAX_ORDER-1)
-			MARK_USED(index, current_order, area);
+		rmpage(page, zone, area, current_order);
 		zone->free_pages -= 1UL << order;
-		return expand(zone, page, index, order, current_order, area);
+		if (zero)
+			zone->zero_pages -= 1UL << order;
+		return expand(zone, page, page - zone->zone_mem_map, order, current_order, area);
 	}

 	return NULL;
@@ -396,7 +437,7 @@
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int zero)
 {
 	unsigned long flags;
 	int i;
@@ -405,7 +446,7 @@

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
 		if (page = NULL)
 			break;
 		allocated++;
@@ -546,7 +587,9 @@
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int nr_pages = 1 << order;
+	int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;

 	if (order = 0) {
 		struct per_cpu_pages *pcp;
@@ -555,7 +598,7 @@
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, zero);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -567,19 +610,30 @@

 	if (page = NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+
+		page = __rmqueue(zone, order, zero);
+
+		/*
+		 * If we failed to obtain a zero and/or unzeroed page
+		 * then we may still be able to obtain the other
+		 * type of page.
+		 */
+		if (!page) {
+			page = __rmqueue(zone, order, !zero);
+			zero = 0;
+		}
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
-		mod_page_state_zone(zone, pgalloc, 1 << order);
-		prep_new_page(page, order);
+		mod_page_state_zone(zone, pgalloc, nr_pages);

-		if (gfp_flags & __GFP_ZERO) {
+		if ((gfp_flags & __GFP_ZERO) && !zero) {
 #ifdef CONFIG_HIGHMEM
 			if (PageHighMem(page)) {
-				int n = 1 << order;
+				int n = nr_pages;

 				while (n-- >0)
 					clear_highpage(page + n);
@@ -587,6 +641,7 @@
 #endif
 			clear_page(page_address(page), order);
 		}
+		prep_new_page(page, order);
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -974,7 +1029,7 @@
 }

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
@@ -982,27 +1037,31 @@
 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
 		*free += zones[i].free_pages;
+		*zero += zones[i].zero_pages;
 	}
 }

 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *free, unsigned long *zero)
 {
 	struct pglist_data *pgdat;

 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n,o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
 		*free += n;
+		*zero += o;
 	}
 }

@@ -1039,6 +1098,7 @@

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1051,6 +1111,7 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;
 	struct zone *zone;

 	for_each_zone(zone) {
@@ -1071,10 +1132,10 @@

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1082,20 +1143,21 @@
 	}

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 	printk("\nFree pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));

 	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+		"unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
 		nr_free_pages(),
+		zero,
 		ps.nr_slab,
 		ps.nr_mapped,
 		ps.nr_page_table_pages);
@@ -1146,7 +1208,7 @@
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
 			nr = 0;
-			list_for_each(elem, &zone->free_area[order].free_list)
+			list_for_each(elem, &zone->free_area[NOT_ZEROED][order].free_list)
 				++nr;
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
@@ -1470,14 +1532,18 @@
 	for (order = 0; ; order++) {
 		unsigned long bitmap_size;

-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
 		if (order = MAX_ORDER-1) {
-			zone->free_area[order].map = NULL;
+			zone->free_area[NOT_ZEROED][order].map = NULL;
+			zone->free_area[ZEROED][order].map = NULL;
 			break;
 		}

 		bitmap_size = pages_to_bitmap_size(order, size);
-		zone->free_area[order].map +		zone->free_area[NOT_ZEROED][order].map +		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+		zone->free_area[ZEROED][order].map  		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
 	}
 }
@@ -1503,6 +1569,7 @@

 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kscrubd_wait);

 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
@@ -1525,6 +1592,7 @@
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->zero_pages = 0;

 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

@@ -1558,6 +1626,13 @@
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1687,7 +1762,7 @@
 			unsigned long nr_bufs = 0;
 			struct list_head *elem;

-			list_for_each(elem, &(zone->free_area[order].free_list))
+			list_for_each(elem, &(zone->free_area[NOT_ZEROED][order].free_list))
 				++nr_bufs;
 			seq_printf(m, "%6lu ", nr_bufs);
 		}
Index: linux-2.6.9/include/linux/mmzone.h
=================================--- linux-2.6.9.orig/include/linux/mmzone.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/mmzone.h	2004-12-22 14:24:56.000000000 -0800
@@ -51,7 +51,7 @@
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -107,10 +107,14 @@
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */

+#define NOT_ZEROED 0
+#define ZEROED 1
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		zero_pages;
 	/*
 	 * protection[] is a pre-calculated number of extra pages that must be
 	 * available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area[2][MAX_ORDER];


 	ZONE_PADDING(_pad1_)
@@ -265,6 +269,9 @@
 	struct pglist_data *pgdat_next;
 	wait_queue_head_t       kswapd_wait;
 	struct task_struct *kswapd;
+
+	wait_queue_head_t       kscrubd_wait;
+	struct task_struct *kscrubd;
 } pg_data_t;

 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@
 extern struct pglist_data *pgdat_list;

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+			unsigned long *free, unsigned long *zero);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone);

Index: linux-2.6.9/fs/proc/proc_misc.c
=================================--- linux-2.6.9.orig/fs/proc/proc_misc.c	2004-12-17 14:40:15.000000000 -0800
+++ linux-2.6.9/fs/proc/proc_misc.c	2004-12-22 14:24:56.000000000 -0800
@@ -158,13 +158,14 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;
 	unsigned long vmtot;
 	unsigned long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 /*
  * display in kilobytes.
@@ -187,6 +188,7 @@
 	len = sprintf(page,
 		"MemTotal:     %8lu kB\n"
 		"MemFree:      %8lu kB\n"
+		"MemZero:      %8lu kB\n"
 		"Buffers:      %8lu kB\n"
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
@@ -210,6 +212,7 @@
 		"VmallocChunk: %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
+		K(zero),
 		K(i.bufferram),
 		K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
 		K(total_swapcache_pages),
Index: linux-2.6.9/mm/readahead.c
=================================--- linux-2.6.9.orig/mm/readahead.c	2004-10-18 14:53:11.000000000 -0700
+++ linux-2.6.9/mm/readahead.c	2004-12-22 14:24:56.000000000 -0800
@@ -570,7 +570,8 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;

-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
Index: linux-2.6.9/drivers/base/node.c
=================================--- linux-2.6.9.orig/drivers/base/node.c	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/drivers/base/node.c	2004-12-22 14:24:56.000000000 -0800
@@ -41,13 +41,15 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;

 	si_meminfo_node(&i, nid);
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));

 	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
+		       "Node %d MemZero:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
@@ -57,6 +59,7 @@
 		       "Node %d LowFree:      %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
+		       nid, K(zero),
 		       nid, K(i.totalram - i.freeram),
 		       nid, K(active),
 		       nid, K(inactive),
Index: linux-2.6.9/include/linux/sched.h
=================================--- linux-2.6.9.orig/include/linux/sched.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h	2004-12-22 14:24:56.000000000 -0800
@@ -715,6 +715,7 @@
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
+#define PF_KSCRUBD	0x00800000	/* I am kscrubd */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.9/mm/Makefile
=================================--- linux-2.6.9.orig/mm/Makefile	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/Makefile	2004-12-22 14:24:56.000000000 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o scrubd.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o prio_tree.o \
Index: linux-2.6.9/mm/scrubd.c
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/mm/scrubd.c	2004-12-22 14:26:35.000000000 -0800
@@ -0,0 +1,146 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = MAX_ORDER;		/* Off */
+unsigned int sysctl_scrub_stop = 2;	/* Mininum order of page to zero */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (sysctl_scrub_start < MAX_ORDER) {
+		struct zone *zone;
+
+		for_each_zone(zone)
+			wakeup_kscrubd(zone);
+	}
+	return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+	int order;
+
+	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+		struct free_area *area = z->free_area[NOT_ZEROED] + order;
+		if (!list_empty(&area->free_list)) {
+			struct page *page = scrubd_rmpage(z, area, order);
+			struct list_head *l;
+
+			if (!page)
+				continue;
+
+			page->index = order;
+
+			list_for_each(l, &zero_drivers) {
+				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+				unsigned long size = PAGE_SIZE << order;
+
+				if (driver->start(page_address(page), size) = 0) {
+
+					unsigned ticks = (size*HZ)/driver->rate;
+					if (ticks) {
+						/* Wait the minimum time of the transfer */
+						current->state = TASK_INTERRUPTIBLE;
+						schedule_timeout(ticks);
+					}
+					/* Then keep on checking until transfer is complete */
+					while (!driver->check())
+						schedule();
+					goto out;
+				}
+			}
+
+			/* Unable to find a zeroing device that would
+			 * deal with this page so just do it on our own.
+			 * This will likely thrash the cpu caches.
+			 */
+			cond_resched();
+			clear_page(page_address(page), order);
+out:
+			end_zero_page(page);
+			cond_resched();
+			return 1 << order;
+		}
+	}
+	return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+	int i;
+	unsigned long pages_zeroed;
+
+	if (system_state != SYSTEM_RUNNING)
+		return;
+
+	do {
+		pages_zeroed = 0;
+		for (i = 0; i < pgdat->nr_zones; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			pages_zeroed += zero_highest_order_page(zone);
+		}
+	} while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	cpumask_t cpumask;
+
+	daemonize("kscrubd%d", pgdat->node_id);
+	cpumask = node_to_cpumask(pgdat->node_id);
+	if (!cpus_empty(cpumask))
+		set_cpus_allowed(tsk, cpumask);
+
+	tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+	for ( ; ; ) {
+		if (current->flags & PF_FREEZE)
+			refrigerator(PF_FREEZE);
+		prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&pgdat->kscrubd_wait, &wait);
+
+		scrub_pgdat(pgdat);
+	}
+	return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+	pg_data_t *pgdat;
+	for_each_pgdat(pgdat)
+		pgdat->kscrubd
+		= find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+	return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.9/include/linux/scrub.h
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/include/linux/scrub.h	2004-12-22 14:24:56.000000000 -0800
@@ -0,0 +1,48 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start)(void *, unsigned long);		/* Start bzero transfer */
+	int (*check)(void);				/* Check if bzero is complete */
+	unsigned long rate;				/* zeroing rate in bytes/sec */
+        struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+        if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+                return;
+        wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+				      void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page);
+#endif
Index: linux-2.6.9/kernel/sysctl.c
=================================--- linux-2.6.9.orig/kernel/sysctl.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/kernel/sysctl.c	2004-12-22 14:24:56.000000000 -0800
@@ -40,6 +40,7 @@
 #include <linux/times.h>
 #include <linux/limits.h>
 #include <linux/dcache.h>
+#include <linux/scrub.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -816,6 +817,24 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_SCRUB_START,
+		.procname	= "scrub_start",
+		.data		= &sysctl_scrub_start,
+		.maxlen		= sizeof(sysctl_scrub_start),
+		.mode		= 0644,
+		.proc_handler	= &scrub_start_handler,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_STOP,
+		.procname	= "scrub_stop",
+		.data		= &sysctl_scrub_stop,
+		.maxlen		= sizeof(sysctl_scrub_stop),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.9/include/linux/sysctl.h
=================================--- linux-2.6.9.orig/include/linux/sysctl.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sysctl.h	2004-12-22 14:24:56.000000000 -0800
@@ -168,6 +168,8 @@
 	VM_VFS_CACHE_PRESSURE&, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT', /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT(, /* default time for token time out */
+	VM_SCRUB_START0,	/* percentage * 10 at which to start scrubd */
+	VM_SCRUB_STOP1,	/* percentage * 10 at which to stop scrubd */
 };




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all Christoph Lameter
  2004-12-23 19:34         ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
@ 2004-12-23 19:35         ` Christoph Lameter
  2004-12-23 20:08         ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
  3 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:35 UTC (permalink / raw)
  To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

o Zeroing driver implemented with the Block Transfer Engine in the Altix SN2 SHub

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
=================================--- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-22 12:48:23.000000000 -0800
@@ -4,6 +4,8 @@
  * for more details.
  *
  * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
  */

 #include <linux/config.h>
@@ -20,6 +22,8 @@
 #include <linux/bootmem.h>
 #include <linux/string.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>

 #include <asm/sn/bte.h>

@@ -30,7 +34,7 @@
 /* two interfaces on two btes */
 #define MAX_INTERFACES_TO_TRY		4

-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
 {
 	nodepda_t *tmp_nodepda;

@@ -132,7 +136,6 @@
 			if (bte = NULL) {
 				continue;
 			}
-
 			if (spin_trylock(&bte->spinlock)) {
 				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
 				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +160,7 @@
 		}
 	} while (1);

-	if (notification = NULL) {
+	if (notification = NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
 		/* User does not want to be notified. */
 		bte->most_rcnt_na = &bte->notify;
 	} else {
@@ -192,6 +195,8 @@

 	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);

+	if (mode & BTE_NOTIFY_AND_GET_POINTER)
+		 *(u64 volatile **)(notification) = &bte->notify;
 	spin_unlock_irqrestore(&bte->spinlock, irq_flags);

 	if (notification != NULL) {
@@ -449,5 +454,37 @@
 		mynodepda->bte_if[i].cleanup_active = 0;
 		mynodepda->bte_if[i].bh_error = 0;
 	}
+}
+
+u64 *bte_zero_notify[MAX_COMPACT_NODES];
+
+static int bte_check_bzero(void)
+{
+	int node = get_nasid();
+
+	return *(bte_zero_notify[node]) != BTE_WORD_BUSY;
+}
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+	int node = get_nasid();
+
+	/* Check limitations.
+		1. System must be running (weird things happen during bootup)
+		2. Size >64KB. Smaller requests cause too much bte traffic
+	 */
+	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+		return EINVAL;
+
+	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, bte_zero_notify+node);
+}
+
+static struct zero_driver bte_bzero = {
+	.start = bte_start_bzero,
+	.check = bte_check_bzero,
+	.rate = 500000000		/* 500 MB /sec */
+};

+void sn_bte_bzero_init(void) {
+	register_zero_driver(&bte_bzero);
 }
Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
=================================--- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/setup.c	2004-12-22 12:28:00.000000000 -0800
@@ -243,6 +243,7 @@
 	int pxm;
 	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
 	extern void sn_cpu_init(void);
+	extern void sn_bte_bzero_init(void);

 	/*
 	 * If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
 	screen_info = sn_screen_info;

 	sn_timer_init();
+	sn_bte_bzero_init();
 }

 /**
Index: linux-2.6.9/include/asm-ia64/sn/bte.h
=================================--- linux-2.6.9.orig/include/asm-ia64/sn/bte.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/sn/bte.h	2004-12-22 12:28:00.000000000 -0800
@@ -48,6 +48,8 @@
 #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
 /* Use a reserved bit to let the caller specify a wait for any BTE */
 #define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
 /* Use the BTE on the node with the destination memory */
 #define BTE_USE_DEST (BTE_WACQUIRE << 1)
 /* Use any available BTE interface on any node for the transfer */


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
@ 2004-12-23 19:49       ` Arjan van de Ven
  2004-12-23 20:57       ` Matt Mackall
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 99+ messages in thread
From: Arjan van de Ven @ 2004-12-23 19:49 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel


> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page. This zeroing means that all
> cachelines of the faulted page (on Altix that means all 128 cachelines of
> 128 byte each) must be loaded and later written back. This patch allows to
> avoid having to load all cachelines if only a part of the cachelines of
> that page is needed immediately after the fault.

eh why will all cachelines be loaded? Surely you can avoid the write-
allocate behavior for this case.....



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
                           ` (2 preceding siblings ...)
  2004-12-23 19:35         ` Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE Christoph Lameter
@ 2004-12-23 20:08         ` Brian Gerst
  2004-12-24 16:24           ` Christoph Lameter
  3 siblings, 1 reply; 99+ messages in thread
From: Brian Gerst @ 2004-12-23 20:08 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Christoph Lameter wrote:
> This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> to request zeroed pages from the page allocator.
> 
> o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set
> 
> o Replace all page zeroing after allocating pages by request for
>   zeroed pages.
> 
> o requires arch updates to clear_page in order to function properly.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 

> @@ -125,22 +125,19 @@
>  	int i;
>  	struct packet_data *pkt;
> 
> -	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL);
> +	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO);
>  	if (!pkt)
>  		goto no_pkt;
> -	memset(pkt, 0, sizeof(struct packet_data));
> 
>  	pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE);
>  	if (!pkt->w_bio)

This part is wrong.  kmalloc() uses the slab allocator instead of 
getting a full page.

--
				Brian Gerst

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
  2004-12-23 19:49       ` Prezeroing V2 [0/3]: Why and When it works Arjan van de Ven
@ 2004-12-23 20:57       ` Matt Mackall
  2004-12-23 21:01       ` Paul Mackerras
  2004-12-23 21:11       ` Paul Mackerras
  4 siblings, 0 replies; 99+ messages in thread
From: Matt Mackall @ 2004-12-23 20:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Thu, Dec 23, 2004 at 11:29:10AM -0800, Christoph Lameter wrote:
> 2. Hardware support for offloading zeroing from the cpu. This avoids
> the invalidation of the cpu caches by extensive zeroing operations.

I'm wondering if it would be possible to use typical video cards for
hardware zeroing. We could set aside a page's worth of zeros in video
memory and then use the card's DMA engines to clear pages on the host.

This could be done in fbdev drivers, which would register a zeroer
with the core.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
                         ` (2 preceding siblings ...)
  2004-12-23 20:57       ` Matt Mackall
@ 2004-12-23 21:01       ` Paul Mackerras
  2004-12-23 21:11       ` Paul Mackerras
  4 siblings, 0 replies; 99+ messages in thread
From: Paul Mackerras @ 2004-12-23 21:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, , linux-ia64, torvalds, linux-mm, linux-kernel

Christoph Lameter writes:

> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page. This zeroing means that all
> cachelines of the faulted page (on Altix that means all 128 cachelines of
> 128 byte each) must be loaded and later written back. This patch allows to
> avoid having to load all cachelines if only a part of the cachelines of
> that page is needed immediately after the fault.

On ppc64 we avoid having to zero newly-allocated page table pages by
using a slab cache for them, with a constructor function that zeroes
them.  Page table pages naturally end up being full of zeroes when
they are freed, since ptep_get_and_clear, pmd_clear or pgd_clear has
been used on every non-zero entry by that stage.  Thus there is no
extra work required either when allocating them or freeing them.

I don't see any point in your patches for systems which don't have
some magic hardware for zeroing pages.  Your patch seems like a lot of
extra code that only benefits a very small number of machines.

Paul.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
                         ` (3 preceding siblings ...)
  2004-12-23 21:01       ` Paul Mackerras
@ 2004-12-23 21:11       ` Paul Mackerras
  2004-12-23 21:37         ` Andrew Morton
  2004-12-23 21:48         ` Linus Torvalds
  4 siblings, 2 replies; 99+ messages in thread
From: Paul Mackerras @ 2004-12-23 21:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, , linux-ia64, torvalds, linux-mm, linux-kernel

Christoph Lameter writes:

> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page.

Re-reading this I see that you mean the zeroing of the page that is
mapped into the process address space, not the page table pages.  So
ignore my previous reply.

Do you have any statistics on how often a page fault needs to supply a
page of zeroes versus supplying a copy of an existing page, for real
applications?

In any case, unless you have magic page-zeroing hardware, I am still
inclined to think that zeroing the page at the time of the fault is
the most efficient, since that means the page will be hot in the cache
for the process to use.  If you zero it earlier using CPU stores, it
can only cause more overall memory traffic, as far as I can see.

I did some measurements once on my G5 powermac (running a ppc64 linux
kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
page.  This is real-life elapsed time in the kernel, not just some
cache-hot benchmark measurement.  Thus I don't think your patch will
gain us anything on ppc64.

Paul.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:11       ` Paul Mackerras
@ 2004-12-23 21:37         ` Andrew Morton
  2004-12-23 23:00           ` Paul Mackerras
  2004-12-23 21:48         ` Linus Torvalds
  1 sibling, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2004-12-23 21:37 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel

Paul Mackerras <paulus@samba.org> wrote:
>
> Christoph Lameter writes:
> 
> > The most expensive operation in the page fault handler is (apart of SMP
> > locking overhead) the zeroing of the page.
> 
> Re-reading this I see that you mean the zeroing of the page that is
> mapped into the process address space, not the page table pages.  So
> ignore my previous reply.
> 
> Do you have any statistics on how often a page fault needs to supply a
> page of zeroes versus supplying a copy of an existing page, for real
> applications?

When the workload is a gcc run, the pagefault handler dominates the system
time.  That's the page zeroing.

> In any case, unless you have magic page-zeroing hardware, I am still
> inclined to think that zeroing the page at the time of the fault is
> the most efficient, since that means the page will be hot in the cache
> for the process to use.  If you zero it earlier using CPU stores, it
> can only cause more overall memory traffic, as far as I can see.

x86's movnta instructions provide a way of initialising memory without
trashing the caches and it has pretty good bandwidth, I believe.  We should
wire that up to these patches and see if it speeds things up.

> I did some measurements once on my G5 powermac (running a ppc64 linux
> kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> page.

40GB/s.  Is that straight into L1 or does the measurement include writeback?


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:11       ` Paul Mackerras
  2004-12-23 21:37         ` Andrew Morton
@ 2004-12-23 21:48         ` Linus Torvalds
  2004-12-23 22:34           ` Zwane Mwaikambo
                             ` (2 more replies)
  1 sibling, 3 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-12-23 21:48 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Christoph Lameter, Andrew Morton, linux-ia64, torvalds, linux-mm,
	Kernel Mailing List

On Fri, 24 Dec 2004, Paul Mackerras wrote:
> 
> I did some measurements once on my G5 powermac (running a ppc64 linux
> kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> page.  This is real-life elapsed time in the kernel, not just some
> cache-hot benchmark measurement.  Thus I don't think your patch will
> gain us anything on ppc64.

Well, the thing is, if we really _know_ the machine is idle (and not just 
waiting for something like disk IO), it might be a good idea to just 
pre-zero everything we can.

The question to me is whether we can have a good enough heuristic to
notice that it triggers often enough to matter, but seldom enough that it
really won't disturb anybody.

And "disturb" very much includes things like laptop battery life,
scheduling latencies, memory bus traffic _and_ cache contents. 

And I really don't see a very good heuristic. Maybe it might literally be
something like "five-second load average goes down to zero" (we've got
fixed-point arithmetic with eleven fractional bits, so we can tune just
how close to "zero" we want to get). The load average is system-wide and
takes disk load (which tends to imply latency-critical work) into account,
so that might actually work out reasonably well as a "the system really is
quiescent".

So if we make the "what load is considered low" tunable, a system 
administrator can use that to make it more aggressive. And indeed, you 
might have a cron-job that says "be more aggressive at clearing pages 
between 2AM and 4AM in the morning" or something - if you have so much 
memory that it actually matters if you clear the memory just occasionally.

And the tunable load-average check has another advantage: if you want to 
benchmark it, you can first set it to true zero (basically never), and run 
the benchmark, and then you can set it to something very agressive ("clear 
pages every five seconds regardless of load") and re-run.

Does this sound sane? Christoph - can you try making the "scrub deamon" do 
that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to 
them), do a "scub-load" thing that takes a scaled integer, and compares it 
with "avenrun[0]" in kernel/timer.c: calc_load() when the average is 
updated every five seconds..

Personally, at least for a desktop usage, I think that the load average 
would work wonderfully well. I know my machines are often at basically 
zero load, and then having low-latency zero-pages when I sit down sounds 
like a good idea. Whether there is _enough_ free memory around for a 
5-second thing to work out well, I have no idea..

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:48         ` Linus Torvalds
@ 2004-12-23 22:34           ` Zwane Mwaikambo
  2004-12-24  9:14           ` Arjan van de Ven
  2004-12-24 16:17           ` Christoph Lameter
  2 siblings, 0 replies; 99+ messages in thread
From: Zwane Mwaikambo @ 2004-12-23 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List

On Thu, 23 Dec 2004, Linus Torvalds wrote:

> Personally, at least for a desktop usage, I think that the load average 
> would work wonderfully well. I know my machines are often at basically 
> zero load, and then having low-latency zero-pages when I sit down sounds 
> like a good idea. Whether there is _enough_ free memory around for a 
> 5-second thing to work out well, I have no idea..

Isn't the basic premise very similar to the following paper;

http://www.usenix.org/publications/library/proceedings/osdi99/full_papers/dougan/dougan_html/dougan.html

In fact i thought ppc32 did something akin to this.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:37         ` Andrew Morton
@ 2004-12-23 23:00           ` Paul Mackerras
  0 siblings, 0 replies; 99+ messages in thread
From: Paul Mackerras @ 2004-12-23 23:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel

Andrew Morton writes:

> When the workload is a gcc run, the pagefault handler dominates the system
> time.  That's the page zeroing.

For a program which uses a lot of heap and doesn't fork, that sounds
reasonable.

> x86's movnta instructions provide a way of initialising memory without
> trashing the caches and it has pretty good bandwidth, I believe.  We should
> wire that up to these patches and see if it speeds things up.

Yes.  I don't know the movnta instruction, but surely, whatever scheme
is used, there has to be a snoop for every cache line's worth of
memory that is zeroed.

The other point is that having the page hot in the cache may well be a
benefit to the program.  Using any sort of cache-bypassing zeroing
might not actually make things faster, when the user time as well as
the system time is taken into account.

> > I did some measurements once on my G5 powermac (running a ppc64 linux
> > kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> > page.
> 
> 40GB/s.  Is that straight into L1 or does the measurement include writeback?

It is the average elapsed time in clear_page, so it would include the
writeback of any cache lines displaced by the zeroing, but not the
writeback of the newly-zeroed cache lines (which we hope will be
modified by the program before they get written back anyway).

This is using the dcbz (data cache block zero) instruction, which
establishes a cache line in modified state with zero contents without
any memory traffic other than a cache line kill transaction sent to
the other CPUs and possible writeback of a dirty cache line displaced
by the newly-zeroed cache line.  The new cache line is established in
the L2 cache, because the L1 is write-through on the G5, and all
stores and dcbz instructions have to go to the L2 cache.

Thus, on the G5 (and POWER4, which is similar) I don't think there
will be much if any benefit from having pre-zeroed cache-cold pages.
We can establish the zero lines in cache much faster using dcbz than
we can by reading them in from main memory.  If the program uses only
a few cache lines out of each new page, then reading them from memory
might be faster, but that seems unlikely.

Paul.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all Christoph Lameter
@ 2004-12-24  8:33           ` Pavel Machek
  2004-12-24 16:18             ` Prezeroing V2 [2/4]: add second parameter to clear_page() for Christoph Lameter
  2004-12-24 17:05           ` David S. Miller
  2005-01-01 10:24           ` Geert Uytterhoeven
  2 siblings, 1 reply; 99+ messages in thread
From: Pavel Machek @ 2004-12-24  8:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Hi!

> o Extend clear_page to take an order parameter for all architectures.
> 

I believe you sould leave clear_page() as is, and introduce
clear_pages() with two arguments.
				Pavel

> -extern void clear_page (void *page);
> +extern void clear_page (void *page, int order);
>  extern void copy_page (void *to, void *from);
> 

-- 
64 bytes from 195.113.31.123: icmp_seq( ttlQ timeD8769.1 ms         


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:48         ` Linus Torvalds
  2004-12-23 22:34           ` Zwane Mwaikambo
@ 2004-12-24  9:14           ` Arjan van de Ven
  2004-12-24 18:21             ` Linus Torvalds
  2004-12-24 16:17           ` Christoph Lameter
  2 siblings, 1 reply; 99+ messages in thread
From: Arjan van de Ven @ 2004-12-24  9:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List


> Personally, at least for a desktop usage, I think that the load average 
> would work wonderfully well. I know my machines are often at basically 
> zero load, and then having low-latency zero-pages when I sit down sounds 
> like a good idea. Whether there is _enough_ free memory around for a 
> 5-second thing to work out well, I have no idea..

problem is.. will it buy you anything if you use the page again
anyway... since such pages will be cold cached now. So for sure some of
it is only shifting latency from kernel side to userspace side, but
readprofile doesn't measure the later so it *looks* better...



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:48         ` Linus Torvalds
  2004-12-23 22:34           ` Zwane Mwaikambo
  2004-12-24  9:14           ` Arjan van de Ven
@ 2004-12-24 16:17           ` Christoph Lameter
  2 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Andrew Morton, linux-ia64, linux-mm,
	Kernel Mailing List

On Thu, 23 Dec 2004, Linus Torvalds wrote:

> So if we make the "what load is considered low" tunable, a system
> administrator can use that to make it more aggressive. And indeed, you
> might have a cron-job that says "be more aggressive at clearing pages
> between 2AM and 4AM in the morning" or something - if you have so much
> memory that it actually matters if you clear the memory just occasionally.
>
> And the tunable load-average check has another advantage: if you want to
> benchmark it, you can first set it to true zero (basically never), and run
> the benchmark, and then you can set it to something very agressive ("clear
> pages every five seconds regardless of load") and re-run.
>
> Does this sound sane? Christoph - can you try making the "scrub deamon" do
> that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to
> them), do a "scub-load" thing that takes a scaled integer, and compares it
> with "avenrun[0]" in kernel/timer.c: calc_load() when the average is
> updated every five seconds..

Sure V3 will have that. So far the impact of zeroing is quite minimal
on IA64 (even without using hardware), the big zeroing happens immediately
after activating it anyways. I have not seen any measurable effect on
benchmarks even with 4G allocations on a 6G machine.

> Personally, at least for a desktop usage, I think that the load average
> would work wonderfully well. I know my machines are often at basically
> zero load, and then having low-latency zero-pages when I sit down sounds
> like a good idea. Whether there is _enough_ free memory around for a
> 5-second thing to work out well, I have no idea..

The CPU can do a couple of Gigs of zeroing per second per CPU and the
zeroing zeros local RAM. On my 6G machine with 8 Cpus it can only
take a fraction of a second to zero all RAM.

Merry Christmas, I am off till now next year. SGI mandatory holiday
shutdown so all addicts have to go cold turkey ;-)


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for
  2004-12-24  8:33           ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Pavel Machek
@ 2004-12-24 16:18             ` Christoph Lameter
  2004-12-24 16:27               ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Pavel Machek
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:18 UTC (permalink / raw)
  To: Pavel Machek; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004, Pavel Machek wrote:

> Hi!
>
> > o Extend clear_page to take an order parameter for all architectures.
> >
>
> I believe you sould leave clear_page() as is, and introduce
> clear_pages() with two arguments.

Did that in V1 and Andi Kleen complained about it.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal
  2004-12-23 20:08         ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
@ 2004-12-24 16:24           ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:24 UTC (permalink / raw)
  To: Brian Gerst; +Cc: linux-ia64, linux-mm, linux-kernel

On Thu, 23 Dec 2004, Brian Gerst wrote:

> This part is wrong.  kmalloc() uses the slab allocator instead of
> getting a full page.

Thanks for finding that. V3 will have that fixed.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-24 16:18             ` Prezeroing V2 [2/4]: add second parameter to clear_page() for Christoph Lameter
@ 2004-12-24 16:27               ` Pavel Machek
  2004-12-24 17:02                 ` Prezeroing V2 [2/4]: add second parameter to clear_page() for David S. Miller
  0 siblings, 1 reply; 99+ messages in thread
From: Pavel Machek @ 2004-12-24 16:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Hi!

> > > o Extend clear_page to take an order parameter for all architectures.
> > >
> >
> > I believe you sould leave clear_page() as is, and introduce
> > clear_pages() with two arguments.
> 
> Did that in V1 and Andi Kleen complained about it.

I do not know what Andi said, but having clear_page clearing two
page*s* seems wrong to me.
								Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for
  2004-12-24 16:27               ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Pavel Machek
@ 2004-12-24 17:02                 ` David S. Miller
  0 siblings, 0 replies; 99+ messages in thread
From: David S. Miller @ 2004-12-24 17:02 UTC (permalink / raw)
  To: Pavel Machek; +Cc: clameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004 17:27:45 +0100
Pavel Machek <pavel@ucw.cz> wrote:

> I do not know what Andi said, but having clear_page clearing two
> page*s* seems wrong to me.

It's represented by a single top-level page struct regardless
of it's order, so in that sense it's indeed a single page
no matter it's order.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all Christoph Lameter
  2004-12-24  8:33           ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Pavel Machek
@ 2004-12-24 17:05           ` David S. Miller
  2004-12-27 22:48             ` David S. Miller
  2005-01-03 17:52             ` Christoph Lameter
  2005-01-01 10:24           ` Geert Uytterhoeven
  2 siblings, 2 replies; 99+ messages in thread
From: David S. Miller @ 2004-12-24 17:05 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Thu, 23 Dec 2004 11:33:59 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> Modification made but it would be good to have some feedback from the arch maintainers:
> 
 ...
> sparc64

I don't see any sparc64 bits in this patch, else I'd
review them :-)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-24  9:14           ` Arjan van de Ven
@ 2004-12-24 18:21             ` Linus Torvalds
  2004-12-24 18:57               ` Arjan van de Ven
  2004-12-27 22:50               ` David S. Miller
  0 siblings, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2004-12-24 18:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List

On Fri, 24 Dec 2004, Arjan van de Ven wrote:
> 
> problem is.. will it buy you anything if you use the page again
> anyway... since such pages will be cold cached now. So for sure some of
> it is only shifting latency from kernel side to userspace side, but
> readprofile doesn't measure the later so it *looks* better...

Absolutely. I would want to see some real benchmarks before we do this.  
Not just some microbenchmark of "how many page faults can we take without
_using_ the page at all".

I agree 100% with you that we shouldn't shift the costs around. Having a
hice hot-spot that we know about is a good thing, and it means that
performance profiles show what the time is really spent on. Often getting
rid of the hotspot just smears out the work over a wider area, making
other optimizations (like trying to make the memory footprint _smaller_
and removing the work entirely that way) totally impossible because now
the performance profile just has a constant background noise and you can't 
tell what the real problem is.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [0/3]: Overview
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
                       ` (3 preceding siblings ...)
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
@ 2004-12-24 18:31     ` Andrea Arcangeli
  2005-01-03 17:54       ` Christoph Lameter
  4 siblings, 1 reply; 99+ messages in thread
From: Andrea Arcangeli @ 2004-12-24 18:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
	torvalds, linux-mm, linux-kernel

Did you notice I already implemented full PG_zero caching here with
prezeroing on top of it?

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2
	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2-no-zerolist-reserve-1

I was about to push this in SP1, but it was a bit late.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-24 18:21             ` Linus Torvalds
@ 2004-12-24 18:57               ` Arjan van de Ven
  2004-12-27 22:50               ` David S. Miller
  1 sibling, 0 replies; 99+ messages in thread
From: Arjan van de Ven @ 2004-12-24 18:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List

On Fri, 2004-12-24 at 10:21 -0800, Linus Torvalds wrote:
> 
> On Fri, 24 Dec 2004, Arjan van de Ven wrote:
> > 
> > problem is.. will it buy you anything if you use the page again
> > anyway... since such pages will be cold cached now. So for sure some of
> > it is only shifting latency from kernel side to userspace side, but
> > readprofile doesn't measure the later so it *looks* better...
> 
> Absolutely. I would want to see some real benchmarks before we do this.  
> Not just some microbenchmark of "how many page faults can we take without
> _using_ the page at all".
> 
> I agree 100% with you that we shouldn't shift the costs around. Having a
> hice hot-spot that we know about is a good thing, and it means that
> performance profiles show what the time is really spent on. Often getting
> rid of the hotspot just smears out the work over a wider area, making
> other optimizations (like trying to make the memory footprint _smaller_
> and removing the work entirely that way) totally impossible because now
> the performance profile just has a constant background noise and you can't 
> tell what the real problem is.

I suspect it's even worse.
Think about it; you can spew 4k of zeroes into your L1 cache really fast
(assuming your cpu is smart enough to avoid write-allocate for rep
stosl; not sure which cpus are). I suspect you can do that faster than a
cachemiss or two. And at that point the page is cache hot... so reads
don't miss either.

all this makes me wonder if there is any scenario where this thing will
be a gain, other than cpus that aren't smart enough to avoid the write-
allocate.



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for
  2004-12-24 17:05           ` David S. Miller
@ 2004-12-27 22:48             ` David S. Miller
  2005-01-03 17:52             ` Christoph Lameter
  1 sibling, 0 replies; 99+ messages in thread
From: David S. Miller @ 2004-12-27 22:48 UTC (permalink / raw)
  To: David S. Miller
  Cc: clameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004 09:05:39 -0800
"David S. Miller" <davem@davemloft.net> wrote:

> On Thu, 23 Dec 2004 11:33:59 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > Modification made but it would be good to have some feedback from the arch maintainers:
> > 
>  ...
> > sparc64
> 
> I don't see any sparc64 bits in this patch, else I'd
> review them :-)

So I found time to implement the missing sparc64 clear_page()
changes, here they are:

=== arch/sparc64/lib/clear_page.S 1.1 vs edited ==--- 1.1/arch/sparc64/lib/clear_page.S	2004-08-08 19:54:07 -07:00
+++ edited/arch/sparc64/lib/clear_page.S	2004-12-24 08:53:29 -08:00
@@ -28,9 +28,12 @@
 	.text
 
 	.globl		_clear_page
-_clear_page:		/* %o0Þst */
+_clear_page:		/* %o0Þst, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1
 
 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@ clear_user_page:	/* %o0Þst, %o1=vaddr 
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate
 
+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1
 
 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8
=== include/asm-sparc64/page.h 1.19 vs edited ==--- 1.19/include/asm-sparc64/page.h	2004-07-27 12:54:49 -07:00
+++ edited/include/asm-sparc64/page.h	2004-12-24 08:52:17 -08:00
@@ -14,8 +14,8 @@
 
 #ifndef __ASSEMBLY__
 
-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-24 18:21             ` Linus Torvalds
  2004-12-24 18:57               ` Arjan van de Ven
@ 2004-12-27 22:50               ` David S. Miller
  2004-12-28 11:53                 ` Marcelo Tosatti
  1 sibling, 1 reply; 99+ messages in thread
From: David S. Miller @ 2004-12-27 22:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: arjan, paulus, clameter, akpm, linux-ia64, linux-mm, linux-kernel

On Fri, 24 Dec 2004 10:21:24 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> Absolutely. I would want to see some real benchmarks before we do this.  
> Not just some microbenchmark of "how many page faults can we take without
> _using_ the page at all".

Here's my small contribution.  I did three "make -j3 vmlinux" timed
runs, one running a kernel without the pre-zeroing stuff applied,
one with it applied.  It did shave a few seconds off the build
consistently.  Here is the before:

real	8m35.248s
user	15m54.132s
sys	1m1.098s

real	8m32.202s
user	15m54.329s
sys	1m0.229s

real	8m31.932s
user	15m54.160s
sys	1m0.245s

and here is the after:

real	8m29.375s
user	15m43.296s
sys	0m59.549s

real	8m28.213s
user	15m39.819s
sys	0m58.790s

real	8m26.140s
user	15m44.145s
sys	0m58.872s

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-27 22:50               ` David S. Miller
@ 2004-12-28 11:53                 ` Marcelo Tosatti
  0 siblings, 0 replies; 99+ messages in thread
From: Marcelo Tosatti @ 2004-12-28 11:53 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linus Torvalds, arjan, paulus, clameter, akpm, linux-ia64,
	linux-mm, linux-kernel

On Mon, Dec 27, 2004 at 02:50:57PM -0800, David S. Miller wrote:
> On Fri, 24 Dec 2004 10:21:24 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > Absolutely. I would want to see some real benchmarks before we do this.  
> > Not just some microbenchmark of "how many page faults can we take without
> > _using_ the page at all".
> 
> Here's my small contribution.  I did three "make -j3 vmlinux" timed
> runs, one running a kernel without the pre-zeroing stuff applied,
> one with it applied.  It did shave a few seconds off the build
> consistently.  Here is the before:
> 
> real	8m35.248s
> user	15m54.132s
> sys	1m1.098s
> 
> real	8m32.202s
> user	15m54.329s
> sys	1m0.229s
> 
> real	8m31.932s
> user	15m54.160s
> sys	1m0.245s
> 
> and here is the after:
> 
> real	8m29.375s
> user	15m43.296s
> sys	0m59.549s
> 
> real	8m28.213s
> user	15m39.819s
> sys	0m58.790s
> 
> real	8m26.140s
> user	15m44.145s
> sys	0m58.872s

Christopher and other SGI fellows,

Get your patch into STP, once its there we can do some wider x86 benchmarking 
easily.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [2/3]: zeroing and
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
@ 2005-01-01  2:22       ` Nick Piggin
  2005-01-01  2:55         ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd pmarques
  0 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2005-01-01  2:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

Christoph Lameter wrote:
> o Add page zeroing
> o Add scrub daemon
> o Add ability to view amount of zeroed information in /proc/meminfo
> 

I quite like how you're handling the page zeroing now. It seems
less intrusive and cleaner in its interface to the page allocator
now.

I think this is pretty close to what I'd be happy with if we decide
to go with zeroing.

Just one small comment - there is a patch in the -mm tree that may
be of use to you; mm-keep-count-of-free-areas.patch is used later
by kswapd to handle and account higher order free areas properly.
You may be able to use it to better implement triggers/watermarks
for the scrub daemon.

Also...

> +
> +/*
> + * zero_highest_order_page takes a page off the freelist
> + * and then hands it off to block zeroing agents.
> + * The cleared pages are added to the back of
> + * the freelist where the page allocator may pick them up.
> + */
> +int zero_highest_order_page(struct zone *z)
> +{
> +	int order;
> +
> +	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
> +		struct free_area *area = z->free_area[NOT_ZEROED] + order;
> +		if (!list_empty(&area->free_list)) {
> +			struct page *page = scrubd_rmpage(z, area, order);
> +			struct list_head *l;
> +
> +			if (!page)
> +				continue;
> +
> +			page->index = order;
> +
> +			list_for_each(l, &zero_drivers) {
> +				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
> +				unsigned long size = PAGE_SIZE << order;
> +
> +				if (driver->start(page_address(page), size) = 0) {
> +
> +					unsigned ticks = (size*HZ)/driver->rate;
> +					if (ticks) {
> +						/* Wait the minimum time of the transfer */
> +						current->state = TASK_INTERRUPTIBLE;
> +						schedule_timeout(ticks);
> +					}
> +					/* Then keep on checking until transfer is complete */
> +					while (!driver->check())
> +						schedule();
> +					goto out;
> +				}

Would you be better off to just have a driver->zero_me(...) call, with this
logic pushed into those like your BTE which need it? I'm thinking this would
help flexibility if you had say a BTE-thingy that did an interrupt on
completion, or if it was done synchronously by the CPU with cache bypassing
stores.

Also, would there be any use in passing a batch of pages to the zeroing driver?
That may improve performance on some implementations, but could also cut down
the inefficiency in your timeout mechanism due to timer quantization (I guess
probably not much if you are only zeroing quite large areas).

BTW, that while loop is basically a busy-wait. Not a critical problem, but you
may want to renice scrubd to the lowest scheduling priority to be a bit nicer?
(I think you'd want to do that anyway). And put a cpu_relax() call in there?

Just some suggestions.

Nick

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd
  2005-01-01  2:22       ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and Nick Piggin
@ 2005-01-01  2:55         ` pmarques
  0 siblings, 0 replies; 99+ messages in thread
From: pmarques @ 2005-01-01  2:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
	torvalds, linux-mm, linux-kernel

Quoting Nick Piggin <nickpiggin@yahoo.com.au>:
> [...]
> Would you be better off to just have a driver->zero_me(...) call, with this
> logic pushed into those like your BTE which need it? I'm thinking this would
> help flexibility if you had say a BTE-thingy that did an interrupt on
> completion, or if it was done synchronously by the CPU with cache bypassing
> stores.

It seems that people in this discussion are assuming that PC's don't have
hardware to do this at all.

While there is no _official_ hardware, a bt878 with the brightness setting all
the way down, at 1024 pixels per line, 32 bits per pixel would be able to zero
a full physical page in under 60 microseconds (PAL scanline). It could even
zero a _list_ of pages passed to it and generate an interrupt in the end.

This is just an example, and there might be some problems in the implementation
details that make it impossible to work, but there might also be more hardware
out there that could perform similar functions (graphics cards?).

This might not be worth the bother *at all*, but I can imagine some weird
conversation between two sysadmins:
  "My server is wasting a lot of time handling page faults"
  "Why don't you install a video aquisition board with a bt878 chip? It did
wonders on my server"
  "Yes, I've also weard that a radeon graphics card can really accelerate kernel
compiles"

Well, just my 0.02 euro :)

--
Paulo Marques - www.grupopie.com

"A journey of a thousand miles begins with a single step."
Lao-tzu, The Way of Lao-tzu

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all Christoph Lameter
  2004-12-24  8:33           ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Pavel Machek
  2004-12-24 17:05           ` David S. Miller
@ 2005-01-01 10:24           ` Geert Uytterhoeven
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
  2 siblings, 1 reply; 99+ messages in thread
From: Geert Uytterhoeven @ 2005-01-01 10:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

On Thu, 23 Dec 2004, Christoph Lameter wrote:
> o Extend clear_page to take an order parameter for all architectures.

> Index: linux-2.6.9/include/asm-m68k/page.h
> =================================> --- linux-2.6.9.orig/include/asm-m68k/page.h	2004-10-18 14:55:36.000000000 -0700
> +++ linux-2.6.9/include/asm-m68k/page.h	2004-12-23 07:44:14.000000000 -0800
> @@ -50,7 +50,7 @@
>  		       );
>  }
> 
> -static inline void clear_page(void *page)
> +static inline void clear_page(void *page, int order)
>  {
>  	unsigned long tmp;
>  	unsigned long *sp = page;
> @@ -69,16 +69,16 @@
>  			     "dbra   %1,1b\n\t"
>  			     : "=a" (sp), "=d" (tmp)
>  			     : "a" (page), "0" (sp),
> -			       "1" ((PAGE_SIZE - 16) / 16 - 1));
> +			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
>  }
> 
>  #else
> -#define clear_page(page)	memset((page), 0, PAGE_SIZE)
> +#define clear_page(page, 0)	memset((page), 0, PAGE_SIZE << (order))
                            ^
			    order

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for
  2004-12-24 17:05           ` David S. Miller
  2004-12-27 22:48             ` David S. Miller
@ 2005-01-03 17:52             ` Christoph Lameter
  1 sibling, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-03 17:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004, David S. Miller wrote:

> On Thu, 23 Dec 2004 11:33:59 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > Modification made but it would be good to have some feedback from the arch maintainers:
> >
>  ...
> > sparc64
>
> I don't see any sparc64 bits in this patch, else I'd
> review them :-)
>

Sorry here it is:

Index: linux-2.6.9/include/asm-sparc64/page.h
=================================--- linux-2.6.9.orig/include/asm-sparc64/page.h 2004-10-18 14:53:51.000000000 -0700
+++ linux-2.6.9/include/asm-sparc64/page.h      2005-01-03 09:50:16.000000000 -0800
@@ -15,7 +15,17 @@
 #ifndef __ASSEMBLY__

 extern void _clear_page(void *page);
-#define clear_page(X)  _clear_page((void *)(X))
+
+static void inline clear_page(void *page, int order)
+{
+       unsigned int nr = 1 << order;
+
+       while (nr-- > 0) {
+               _clear_page(page);
+               page += PAGE_SIZE;
+       }
+}
+
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y) memcpy((void *)(X), (void *)(Y), PAGE_SIZE)


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [0/3]: Overview
  2004-12-24 18:31     ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
@ 2005-01-03 17:54       ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-03 17:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
	torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004, Andrea Arcangeli wrote:

> Did you notice I already implemented full PG_zero caching here with
> prezeroing on top of it?
>
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2-no-zerolist-reserve-1
>
> I was about to push this in SP1, but it was a bit late.

Yes but this did not do the trick and the interface to get zeroed pages is
a bit difficult to handle.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V3 [0/4]: Discussion and i386 performance tests
  2005-01-01 10:24           ` Geert Uytterhoeven
@ 2005-01-04 23:12             ` Christoph Lameter
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
                                 ` (3 more replies)
  0 siblings, 4 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:12 UTC (permalink / raw)
  Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

Change from V2 to V3:
o Updates for clear_page on various platforms
o Performance measurements on i386 (2x PIII-450 384M RAM)
o Port patches to 2.6.10-bk7
o Add scrub_load so that a high load prevents scrubd from running
  (So that people may feel better about this approach. Set by
  default to 999 so its off. The typical result of not running kscrubd
  under high loads is to slow the system down even further since zeroing
  large consecutive areas of memory is more efficient than zeroing page
  size chunks. Memory subsystems are typically optimized for linear accesses
  and reach their peak performance if large areas of memory are written to)
o Various fixes

The patches increasing the page fault rate (introduction of atomic pte
operations and anticipatory prefaulting) do so by reducing the locking
overhead and are therefore mainly of interest for applications running in
SMP systems with a high number of cpus. The single thread performance does
just show minor increases. Only the performance of multi-threaded
applications increases significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. This zeroing means that all cachelines of the faulted page (on Altix
that means all 128 cachelines of 128 byte each) must be loaded and later
written back. This patch allows to avoid having to load all cachelines
if only a part of the cachelines of that page is needed immediately after
the fault. Doing so will only be effective for sparsely accessed memory
which is typical for anonymous memory and pte maps. Prezeroed pages will
only be used for those purposes. Unzeroed pages will be used as usual for
file mapping, page caching etc etc.

Others have also thought that prezeroing could be a benefit and have tried
provide a way to provide zeroed pages to the page fault handler:

http://marc.theaimsgroup.com/?t\x109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t\x109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m\x104931944213955&w=2

However, these attempt have tried to zero pages that are like to be used
soon (and that may have recently been accessed). Elements of these pages
are thus already in the cpu caches. Approaches like that will only shift
processing to somewhere else and not bring any performance benefits.
Prezeroing only makes sense for pages that are not currently needed and that
are not in the cpu caches. Pages that have recently been touched and that
soon will be touched again are better hot zeroed since the zeroing will
largely be done to cachelines already in the cpu caches.

The patch makes prezeroing very effective by:

1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become zero 0 to be zeroed in one
step.
For that purpose the existing clear_page function is extended and made to
take an additional argument specifying the order of the page to be cleared.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The scrub daemon is invoked when a unzeroed page of a certain order has
been generated so that its worth running it. If no higher order pages are
present then the logic will favor hot zeroing rather than simply shifting
processing around. kscrubd typically runs only for a fraction of a second
and sleeps for long periods of time even under memory benchmarking. kscrubd
performs short bursts of zeroing when needed and tries to stay out off the
processor as much as possible.

The result is a significant increase of the page fault performance even for
single threaded applications (i386 2x PIII-450 384M RAM allocating 256M in
each run):

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  0   1    1    0.006s      0.389s   0.039s157455.320 157070.694
  0   1    2    0.007s      0.607s   0.032s101476.689 190350.885

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  0   1    1    0.008s      0.083s   0.009s672151.422 664045.899
  0   1    2    0.005s      0.129s   0.008s459629.796 741857.373

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmark the system may run out of these very
fast but the efficient algorithm for page zeroing still makes this a winner
(2 way system with 384MB RAM, no hardware zeroing support). In the following
measurement the test is repeated 10 times allocating 256M each in rapid
succession which would deplete the pool of zeroed pages quickly):

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  0  10    1    0.058s      3.913s   3.097s157335.774 157076.932
  0  10    2    0.063s      6.139s   3.027s100756.788 190572.486

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  0  10    1    0.059s      1.828s   1.089s330913.517 330225.515
  0  10    2    0.082s      1.951s   1.094s307172.100 320680.232

Note that zeroing of pages makes no sense if the application
touches all cache lines of a page allocated (there is no influence of
prezeroing on benchmarks like lmbench for that reason) since the extensive
caching of modern cpus means that the zeroes written to a hot zeroed page
will then be overwritten by the application in the cpu cache and thus
the zeros will never make it to memory! The test program used above only
touches one 128 byte cache line of a 16k page (ia64). Sparsely
populated and accessed areas are typical for lots of applications.

Here is another test in order to gauge the influence of the number of cache
lines touched on the performance of the prezero enhancements:

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  1    1   1    0.01s      0.12s   0.01s500813.853 497925.891
  1  1    1   2    0.01s      0.11s   0.01s493453.103 472877.725
  1  1    1   4    0.02s      0.10s   0.01s479351.658 471507.415
  1  1    1   8    0.01s      0.13s   0.01s424742.054 416725.013
  1  1    1  16    0.05s      0.12s   0.01s347715.359 336983.834
  1  1    1  32    0.12s      0.13s   0.02s258112.286 256246.731
  1  1    1  64    0.24s      0.14s   0.03s169896.381 168189.283
  1  1    1 128    0.49s      0.14s   0.06s102300.257 101674.435

The benefits of prezeroing are reduced to minimal quantities if all
cachelines of a page are touched. Prezeroing can only be effective
if the whole page is not immediately used after the page fault.

The patch is composed of 4 parts:

[1/4] Introduce __GFP_ZERO
	Modifies the page allocator to be able to take the __GFP_ZERO flag
	and returns zeroed memory on request. Modifies locations throughout
	the linux sources that retrieve a page and then zero it to request
	a zeroed page.

[2/4] Architecture specific clear_page updates
	Adds second order argument to clear_page and updates all arches.

Note: The two first pages may be used alone if no zeroing engine is wanted.

[3/4] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disabled by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order or higher then the scrub daemon will
	start zeroing until all pages of order /proc/sys/vm/scrub_stop and
	higher are zeroed and then go back to sleep.

	In an SMP environment the scrub daemon is typically
	running on the most idle cpu. Thus a single threaded application running
	on one cpu may have the other cpu zeroing pages for it etc. The scrub
	daemon is hardly noticable and usually finished zeroing quickly since
	most processors are optimized for linear memory filling.

[4/4]	SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support there will be minimal impact of zeroing
	on the performance of the system.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
@ 2005-01-04 23:13               ` Christoph Lameter
  2005-01-04 23:45                 ` Dave Hansen
                                   ` (2 more replies)
  2005-01-04 23:14               ` Prezeroing V3 [2/4]: Extension of clear_page to take an order Christoph Lameter
                                 ` (2 subsequent siblings)
  3 siblings, 3 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:13 UTC (permalink / raw)
  To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.

o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set

o Replace all page zeroing after allocating pages by request for
  zeroed pages.

o requires arch updates to clear_page in order to function properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/mm/page_alloc.c
=================================--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-04 12:16:49.000000000 -0800
@@ -584,6 +584,18 @@
 		BUG_ON(bad_range(zone, page));
 		mod_page_state_zone(zone, pgalloc, 1 << order);
 		prep_new_page(page, order);
+
+		if (gfp_flags & __GFP_ZERO) {
+#ifdef CONFIG_HIGHMEM
+			if (PageHighMem(page)) {
+				int n = 1 << order;
+
+				while (n-- >0)
+					clear_highpage(page + n);
+			} else
+#endif
+			clear_page(page_address(page), order);
+		}
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -796,12 +808,9 @@
 	 */
 	BUG_ON(gfp_mask & __GFP_HIGHMEM);

-	page = alloc_pages(gfp_mask, 0);
-	if (page) {
-		void *address = page_address(page);
-		clear_page(address);
-		return (unsigned long) address;
-	}
+	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+	if (page)
+		return (unsigned long) page_address(page);
 	return 0;
 }

Index: linux-2.6.10/include/linux/gfp.h
=================================--- linux-2.6.10.orig/include/linux/gfp.h	2004-12-24 13:34:27.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h	2005-01-04 12:16:49.000000000 -0800
@@ -37,6 +37,7 @@
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_ZERO	0x8000	/* Return zeroed page on success */

 #define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.10/mm/memory.c
=================================--- linux-2.6.10.orig/mm/memory.c	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-01-04 12:16:49.000000000 -0800
@@ -1650,10 +1650,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.10/kernel/profile.c
=================================--- linux-2.6.10.orig/kernel/profile.c	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/kernel/profile.c	2005-01-04 12:16:49.000000000 -0800
@@ -326,17 +326,15 @@
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				return NOTIFY_BAD;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				goto out_free;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
 		}
 		break;
@@ -510,16 +508,14 @@
 		int node = cpu_to_node(cpu);
 		struct page *page;

-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[0]
 				= (struct profile_hit *)page_address(page);
 	}
Index: linux-2.6.10/mm/shmem.c
=================================--- linux-2.6.10.orig/mm/shmem.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/mm/shmem.c	2005-01-04 12:16:49.000000000 -0800
@@ -369,9 +369,8 @@
 		}

 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
 		if (page) {
-			clear_highpage(page);
 			page->nr_swapped = 0;
 		}
 		spin_lock(&info->lock);
@@ -910,7 +909,7 @@
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -926,7 +925,7 @@
 shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
 				 unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(gfp | __GFP_ZERO);
 }
 #endif

@@ -1135,7 +1134,6 @@

 		info->alloced++;
 		spin_unlock(&info->lock);
-		clear_highpage(filepage);
 		flush_dcache_page(filepage);
 		SetPageUptodate(filepage);
 	}
Index: linux-2.6.10/mm/hugetlb.c
=================================--- linux-2.6.10.orig/mm/hugetlb.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/mm/hugetlb.c	2005-01-04 12:16:49.000000000 -0800
@@ -77,7 +77,6 @@
 struct page *alloc_huge_page(void)
 {
 	struct page *page;
-	int i;

 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page();
@@ -88,8 +87,7 @@
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	clear_page(page_address(page), HUGETLB_PAGE_ORDER);
 	return page;
 }

Index: linux-2.6.10/include/asm-ia64/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -61,9 +61,7 @@
 	pgd_t *pgd = pgd_alloc_one_fast(mm);

 	if (unlikely(pgd = NULL)) {
-		pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-		if (likely(pgd != NULL))
-			clear_page(pgd);
+		pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	}
 	return pgd;
 }
@@ -106,10 +104,8 @@
 static inline pmd_t*
 pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pmd != NULL))
-		clear_page(pmd);
 	return pmd;
 }

@@ -140,20 +136,16 @@
 static inline struct page *
 pte_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

-	if (likely(pte != NULL))
-		clear_page(page_address(pte));
 	return pte;
 }

 static inline pte_t *
 pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pte != NULL))
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.10/arch/i386/mm/pgtable.c
=================================--- linux-2.6.10.orig/arch/i386/mm/pgtable.c	2005-01-04 12:16:39.000000000 -0800
+++ linux-2.6.10/arch/i386/mm/pgtable.c	2005-01-04 12:16:49.000000000 -0800
@@ -140,10 +140,7 @@

 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
-	return pte;
+	return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 }

 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -151,12 +148,10 @@
 	struct page *pte;

 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 #endif
-	if (pte)
-		clear_highpage(pte);
 	return pte;
 }

Index: linux-2.6.10/arch/m68k/mm/motorola.c
=================================--- linux-2.6.10.orig/arch/m68k/mm/motorola.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/m68k/mm/motorola.c	2005-01-04 12:16:49.000000000 -0800
@@ -1,4 +1,4 @@
-/*
+*
  * linux/arch/m68k/motorola.c
  *
  * Routines specific to the Motorola MMU, originally from:
@@ -50,7 +50,7 @@

 	ptablep = (pte_t *)alloc_bootmem_low_pages(PAGE_SIZE);

-	clear_page(ptablep);
+	clear_page(ptablep, 0);
 	__flush_page_to_ram(ptablep);
 	flush_tlb_kernel_page(ptablep);
 	nocache_page(ptablep);
@@ -90,7 +90,7 @@
 	if (((unsigned long)last_pgtable & ~PAGE_MASK) = 0) {
 		last_pgtable = (pmd_t *)alloc_bootmem_low_pages(PAGE_SIZE);

-		clear_page(last_pgtable);
+		clear_page(last_pgtable, 0);
 		__flush_page_to_ram(last_pgtable);
 		flush_tlb_kernel_page(last_pgtable);
 		nocache_page(last_pgtable);
Index: linux-2.6.10/include/asm-mips/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-mips/pgalloc.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-mips/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -56,9 +56,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);

 	return pte;
 }
Index: linux-2.6.10/arch/alpha/mm/init.c
=================================--- linux-2.6.10.orig/arch/alpha/mm/init.c	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/arch/alpha/mm/init.c	2005-01-04 12:16:49.000000000 -0800
@@ -42,10 +42,9 @@
 {
 	pgd_t *ret, *init;

-	ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+	ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	init = pgd_offset(&init_mm, 0UL);
 	if (ret) {
-		clear_page(ret);
 #ifdef CONFIG_ALPHA_LARGE_VMALLOC
 		memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
 			(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
@@ -63,9 +62,7 @@
 pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.10/include/asm-parisc/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-parisc/pgalloc.h	2004-12-24 13:35:39.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -120,18 +120,14 @@
 static inline struct page *
 pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(page != NULL))
-		clear_page(page_address(page));
+	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return page;
 }

 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(pte != NULL))
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.10/arch/sh/mm/pg-sh4.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-sh4.c	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-sh4.c	2005-01-04 12:16:49.000000000 -0800
@@ -34,7 +34,7 @@
 {
 	__set_bit(PG_mapped, &page->flags);
 	if (((address ^ (unsigned long)to) & CACHE_ALIAS) = 0)
-		clear_page(to);
+		clear_page(to, 0);
 	else {
 		pgprot_t pgprot = __pgprot(_PAGE_PRESENT |
 					   _PAGE_RW | _PAGE_CACHABLE |
Index: linux-2.6.10/include/asm-sparc64/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-sparc64/pgalloc.h	2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -73,10 +73,9 @@
 		struct page *page;

 		preempt_enable();
-		page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+		page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (page) {
 			ret = (struct page *)page_address(page);
-			clear_page(ret);
 			page->lru.prev = (void *) 2UL;

 			preempt_disable();
Index: linux-2.6.10/include/asm-sh/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-sh/pgalloc.h	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-sh/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -44,9 +44,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);

 	return pte;
 }
@@ -56,9 +54,7 @@
 {
 	struct page *pte;

-   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
Index: linux-2.6.10/include/asm-m32r/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-m32r/pgalloc.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -23,10 +23,7 @@
  */
 static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
-	if (pgd)
-		clear_page(pgd);
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pgd;
 }
@@ -39,10 +36,7 @@
 static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
-
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pte;
 }
@@ -50,10 +44,8 @@
 static __inline__ struct page *pte_alloc_one(struct mm_struct *mm,
 	unsigned long address)
 {
-	struct page *pte = alloc_page(GFP_KERNEL);
+	struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO);

-	if (pte)
-		clear_page(page_address(pte));

 	return pte;
 }
Index: linux-2.6.10/arch/um/kernel/mem.c
=================================--- linux-2.6.10.orig/arch/um/kernel/mem.c	2005-01-04 12:16:40.000000000 -0800
+++ linux-2.6.10/arch/um/kernel/mem.c	2005-01-04 12:16:49.000000000 -0800
@@ -327,9 +327,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

@@ -337,9 +335,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_highpage(pte);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.10/arch/ppc64/mm/init.c
=================================--- linux-2.6.10.orig/arch/ppc64/mm/init.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/ppc64/mm/init.c	2005-01-04 12:16:49.000000000 -0800
@@ -761,7 +761,7 @@

 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
-	clear_page(page);
+	clear_page(page, 0);

 	if (cur_cpu_spec->cpu_features & CPU_FTR_COHERENT_ICACHE)
 		return;
Index: linux-2.6.10/include/asm-sh64/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-sh64/pgalloc.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -112,9 +112,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO);

 	return pte;
 }
@@ -123,9 +121,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
@@ -150,9 +146,7 @@
 static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pmd_t *pmd;
-	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pmd)
-		clear_page(pmd);
+	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pmd;
 }

Index: linux-2.6.10/include/asm-cris/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-cris/pgalloc.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-cris/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -24,18 +24,14 @@

 extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
  	return pte;
 }

 extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.10/arch/ppc/mm/pgtable.c
=================================--- linux-2.6.10.orig/arch/ppc/mm/pgtable.c	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/arch/ppc/mm/pgtable.c	2005-01-04 12:16:49.000000000 -0800
@@ -85,8 +85,7 @@
 {
 	pgd_t *ret;

-	if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL)
-		clear_pages(ret, PGDIR_ORDER);
+	ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
 	return ret;
 }

@@ -102,7 +101,7 @@
 	extern void *early_get_page(void);

 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (pte) {
 			struct page *ptepage = virt_to_page(pte);
 			ptepage->mapping = (void *) mm;
@@ -110,8 +109,6 @@
 		}
 	} else
 		pte = (pte_t *)early_get_page();
-	if (pte)
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.10/arch/ppc/mm/init.c
=================================--- linux-2.6.10.orig/arch/ppc/mm/init.c	2005-01-04 12:16:40.000000000 -0800
+++ linux-2.6.10/arch/ppc/mm/init.c	2005-01-04 12:16:49.000000000 -0800
@@ -594,7 +594,7 @@
 }
 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
-	clear_page(page);
+	clear_page(page, 0);
 	clear_bit(PG_arch_1, &pg->flags);
 }

Index: linux-2.6.10/fs/afs/file.c
=================================--- linux-2.6.10.orig/fs/afs/file.c	2004-12-24 13:35:59.000000000 -0800
+++ linux-2.6.10/fs/afs/file.c	2005-01-04 12:16:49.000000000 -0800
@@ -172,7 +172,7 @@
 				      (size_t) PAGE_SIZE);
 		desc.buffer	= kmap(page);

-		clear_page(desc.buffer);
+		clear_page(desc.buffer, 0);

 		/* read the contents of the file from the server into the
 		 * page */
Index: linux-2.6.10/include/asm-alpha/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-alpha/pgalloc.h	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -40,9 +40,7 @@
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (ret)
-		clear_page(ret);
+	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return ret;
 }

Index: linux-2.6.10/include/linux/highmem.h
=================================--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-04 12:16:49.000000000 -0800
@@ -45,7 +45,7 @@
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }

Index: linux-2.6.10/arch/sh64/mm/ioremap.c
=================================--- linux-2.6.10.orig/arch/sh64/mm/ioremap.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/sh64/mm/ioremap.c	2005-01-04 12:16:49.000000000 -0800
@@ -399,7 +399,7 @@
 	if (pte_none(*ptep) || !pte_present(*ptep))
 		return;

-	clear_page((void *)ptep);
+	clear_page((void *)ptep, 0);
 	pte_clear(ptep);
 }

Index: linux-2.6.10/include/asm-m68k/motorola_pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-m68k/motorola_pgalloc.h	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/motorola_pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -12,9 +12,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
@@ -31,7 +30,7 @@

 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	pte_t *pte;

 	if(!page)
@@ -39,7 +38,6 @@

 	pte = kmap(page);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
Index: linux-2.6.10/arch/sh/mm/pg-sh7705.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-sh7705.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-sh7705.c	2005-01-04 12:16:49.000000000 -0800
@@ -78,13 +78,13 @@

 	__set_bit(PG_mapped, &page->flags);
 	if (((address ^ (unsigned long)to) & CACHE_ALIAS) = 0) {
-		clear_page(to);
+		clear_page(to, 0);
 		__flush_wback_region(to, PAGE_SIZE);
 	} else {
 		__flush_purge_virtual_region(to,
 					     (void *)(address & 0xfffff000),
 					     PAGE_SIZE);
-		clear_page(to);
+		clear_page(to, 0);
 		__flush_wback_region(to, PAGE_SIZE);
 	}
 }
Index: linux-2.6.10/arch/sparc64/mm/init.c
=================================--- linux-2.6.10.orig/arch/sparc64/mm/init.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/sparc64/mm/init.c	2005-01-04 12:16:49.000000000 -0800
@@ -1687,13 +1687,12 @@
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
 	 */
-	mem_map_zero = alloc_pages(GFP_KERNEL, 0);
+	mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (mem_map_zero = NULL) {
 		prom_printf("paging_init: Cannot alloc zero page.\n");
 		prom_halt();
 	}
 	SetPageReserved(mem_map_zero);
-	clear_page(page_address(mem_map_zero));

 	codepages = (((unsigned long) _etext) - ((unsigned long) _start));
 	codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT;
Index: linux-2.6.10/include/asm-arm/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-arm/pgalloc.h	2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-arm/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -50,9 +50,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
 		pte += PTRS_PER_PTE;
 	}
@@ -65,10 +64,9 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	if (pte) {
 		void *page = page_address(pte);
-		clear_page(page);
 		clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
 	}

Index: linux-2.6.10/drivers/net/tc35815.c
=================================--- linux-2.6.10.orig/drivers/net/tc35815.c	2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c	2005-01-04 12:16:49.000000000 -0800
@@ -657,7 +657,7 @@
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
 	} else {
-		clear_page(lp->fd_buf);
+		clear_page(lp->fd_buf, 0);
 #ifdef __mips__
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
Index: linux-2.6.10/drivers/block/pktcdvd.c
=================================--- linux-2.6.10.orig/drivers/block/pktcdvd.c	2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/drivers/block/pktcdvd.c	2005-01-04 12:16:49.000000000 -0800
@@ -135,12 +135,10 @@
 		goto no_bio;

 	for (i = 0; i < PAGES_PER_PACKET; i++) {
-		pkt->pages[i] = alloc_page(GFP_KERNEL);
+		pkt->pages[i] = alloc_page(GFP_KERNEL|| __GFP_ZERO);
 		if (!pkt->pages[i])
 			goto no_page;
 	}
-	for (i = 0; i < PAGES_PER_PACKET; i++)
-		clear_page(page_address(pkt->pages[i]));

 	spin_lock_init(&pkt->lock);



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V3 [2/4]: Extension of clear_page to take an order
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
@ 2005-01-04 23:14               ` Christoph Lameter
  2005-01-05 23:25                 ` Christoph Lameter
  2005-01-04 23:15               ` Prezeroing V3 [3/4]: Page zeroing through kscrubd Christoph Lameter
  2005-01-04 23:16               ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
  3 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:14 UTC (permalink / raw)
  To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

o Extend clear_page to take an order parameter for all architectures.

Architecture support:
---------------------

Known to work:

ia64
i386
sparc64
m68k

Trivial modification expected to simply work:

arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

x86_64
s390
alpha
sh
mips
m32r

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/include/asm-ia64/page.h
=================================--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.10/include/asm-i386/page.h
=================================--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-x86_64/page.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-sparc/page.h
=================================--- linux-2.6.10.orig/include/asm-sparc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-s390/page.h
=================================--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /* Pure 2^n version of get_order */
Index: linux-2.6.10/arch/i386/lib/mmx.c
=================================--- linux-2.6.10.orig/arch/i386/lib/mmx.c	2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c	2005-01-04 12:34:03.000000000 -0800
@@ -128,7 +128,7 @@
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/mmx.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h	2005-01-04 12:34:03.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S	2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -7,6 +7,7 @@
 clear_page:
 	xorl   %eax,%eax
 	movl   $4096/64,%ecx
+	shl	%esi, %ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -42,6 +43,7 @@
 	.section .altinstr_replacement,"ax"
 clear_page_c:
 	movl $4096/8,%ecx
+	shl	%esi, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.10/include/asm-sh/page.h
=================================--- linux-2.6.10.orig/include/asm-sh/page.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.10/include/asm-i386/mmx.h
=================================--- linux-2.6.10.orig/include/asm-i386/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h	2005-01-04 12:34:03.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
=================================--- linux-2.6.10.orig/include/asm-sh64/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -50,12 +50,20 @@
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.10/include/asm-h8300/page.h
=================================--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-arm/page.h
=================================--- linux-2.6.10.orig/include/asm-arm/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -128,7 +128,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
=================================--- linux-2.6.10.orig/include/asm-ppc64/page.h	2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, int order)
 {
 	unsigned long lines, line_size;

 	line_size = systemcfg->dCacheL1LineSize;
-	lines = naca->dCacheL1LinesPerPage;
+	lines = naca->dCacheL1LinesPerPage << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
=================================--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-alpha/page.h
=================================--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
=================================--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c	2005-01-04 12:34:03.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
=================================--- linux-2.6.10.orig/include/asm-m68k/page.h	2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -50,7 +50,7 @@
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-mips/page.h
=================================--- linux-2.6.10.orig/include/asm-mips/page.h	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.10/include/asm-m68knommu/page.h
=================================--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-cris/page.h
=================================--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-v850/page.h
=================================--- linux-2.6.10.orig/include/asm-v850/page.h	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-parisc/page.h
=================================--- linux-2.6.10.orig/include/asm-parisc/page.h	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
=================================--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c	2005-01-04 12:34:03.000000000 -0800
@@ -47,7 +47,7 @@
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.10/arch/m32r/mm/page.S
=================================--- linux-2.6.10.orig/arch/m32r/mm/page.S	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S	2005-01-04 12:34:03.000000000 -0800
@@ -51,7 +51,7 @@
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
=================================--- linux-2.6.10.orig/include/asm-ppc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -85,7 +85,7 @@

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
=================================--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c	2005-01-04 12:34:03.000000000 -0800
@@ -88,7 +88,7 @@
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
=================================--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
=================================--- linux-2.6.10.orig/arch/sh/mm/init.c	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c	2005-01-04 12:34:03.000000000 -0800
@@ -57,7 +57,7 @@
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c	2005-01-04 12:34:03.000000000 -0800
@@ -78,7 +78,7 @@
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c	2005-01-04 12:34:03.000000000 -0800
@@ -27,7 +27,7 @@
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
=================================--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c	2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c	2005-01-04 12:34:03.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
=================================--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c	2005-01-04 12:34:03.000000000 -0800
@@ -102,7 +102,7 @@
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
=================================--- linux-2.6.10.orig/include/asm-arm26/page.h	2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -25,7 +25,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
=================================--- linux-2.6.10.orig/include/asm-sparc64/page.h	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -14,8 +14,8 @@

 #ifndef __ASSEMBLY__

-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S	2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -28,9 +28,12 @@
 	.text

 	.globl		_clear_page
-_clear_page:		/* %o0Þst */
+_clear_page:		/* %o0Þst, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1

 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate

+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1

 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V3 [3/4]: Page zeroing through kscrubd
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  2005-01-04 23:14               ` Prezeroing V3 [2/4]: Extension of clear_page to take an order Christoph Lameter
@ 2005-01-04 23:15               ` Christoph Lameter
  2005-01-04 23:16               ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
  3 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:15 UTC (permalink / raw)
  To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meninfo

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/mm/page_alloc.c
=================================--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-04 14:17:02.000000000 -0800
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Support for page zeroing, Christoph Lameter, SGI, Dec 2004
  */

 #include <linux/config.h>
@@ -33,6 +34,7 @@
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
 #include <linux/vmalloc.h>
+#include <linux/scrub.h>

 #include <asm/tlbflush.h>

@@ -180,7 +182,7 @@
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
+static inline int __free_pages_bulk (struct page *page, struct page *base,
 		struct zone *zone, struct free_area *area, unsigned int order)
 {
 	unsigned long page_idx, index, mask;
@@ -193,11 +195,10 @@
 		BUG();
 	index = page_idx >> (1 + order);

-	zone->free_pages += 1 << order;
 	while (order < MAX_ORDER-1) {
 		struct page *buddy1, *buddy2;

-		BUG_ON(area >= zone->free_area + MAX_ORDER);
+		BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER);
 		if (!__test_and_change_bit(index, area->map))
 			/*
 			 * the buddy page is still allocated.
@@ -219,6 +220,7 @@
 	}
 	list_add(&(base + page_idx)->lru, &area->free_list);
 	area->nr_free++;
+	return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -261,7 +263,7 @@
 	int ret = 0;

 	base = zone->zone_mem_map;
-	area = zone->free_area + order;
+	area = zone->free_area[NOT_ZEROED] + order;
 	spin_lock_irqsave(&zone->lock, flags);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
@@ -269,7 +271,10 @@
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
-		__free_pages_bulk(page, base, zone, area, order);
+		zone->free_pages += 1 << order;
+		if (__free_pages_bulk(page, base, zone, area, order)
+			>= sysctl_scrub_start)
+				wakeup_kscrubd(zone);
 		ret++;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -291,6 +296,21 @@
 	free_pages_bulk(page_zone(page), 1, &list, order);
 }

+void end_zero_page(struct page *page)
+{
+	unsigned long flags;
+	int order = page->index;
+	struct zone * zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	zone->zero_pages += 1 << order;
+	__free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order);
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+
 #define MARK_USED(index, order, area) \
 	__change_bit((index) >> (1+(order)), (area)->map)

@@ -370,26 +390,47 @@
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order)
+{
+	list_del(&page->lru);
+	area->nr_free--;
+	if (order != MAX_ORDER-1)
+		MARK_USED(page - zone->zone_mem_map, order, area);
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order)
+{
+	unsigned long flags;
+	struct page *page = NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+
+		rmpage(page, zone, area, order);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
-	unsigned int index;

 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = zone->free_area[zero] + current_order;
 		if (list_empty(&area->free_list))
 			continue;

 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		area->nr_free--;
-		index = page - zone->zone_mem_map;
-		if (current_order != MAX_ORDER-1)
-			MARK_USED(index, current_order, area);
+		rmpage(page, zone, area, current_order);
 		zone->free_pages -= 1UL << order;
-		return expand(zone, page, index, order, current_order, area);
+		if (zero)
+			zone->zero_pages -= 1UL << order;
+		return expand(zone, page, page - zone->zone_mem_map, order, current_order, area);
 	}

 	return NULL;
@@ -401,7 +442,7 @@
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int zero)
 {
 	unsigned long flags;
 	int i;
@@ -410,7 +451,7 @@

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
 		if (page = NULL)
 			break;
 		allocated++;
@@ -457,7 +498,7 @@
 		ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));

 	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
+		list_for_each(curr, &zone->free_area[NOT_ZEROED][order].free_list) {
 			unsigned long start_pfn, i;

 			start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -555,7 +596,9 @@
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int nr_pages = 1 << order;
+	int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;

 	if (order = 0) {
 		struct per_cpu_pages *pcp;
@@ -564,7 +607,7 @@
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, zero);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -576,19 +619,30 @@

 	if (page = NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+
+		page = __rmqueue(zone, order, zero);
+
+		/*
+		 * If we failed to obtain a zero and/or unzeroed page
+		 * then we may still be able to obtain the other
+		 * type of page.
+		 */
+		if (!page) {
+			page = __rmqueue(zone, order, !zero);
+			zero = 0;
+		}
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
-		mod_page_state_zone(zone, pgalloc, 1 << order);
-		prep_new_page(page, order);
+		mod_page_state_zone(zone, pgalloc, nr_pages);

-		if (gfp_flags & __GFP_ZERO) {
+		if ((gfp_flags & __GFP_ZERO) && !zero) {
 #ifdef CONFIG_HIGHMEM
 			if (PageHighMem(page)) {
-				int n = 1 << order;
+				int n = nr_pages;

 				while (n-- >0)
 					clear_highpage(page + n);
@@ -596,6 +650,7 @@
 #endif
 			clear_page(page_address(page), order);
 		}
+		prep_new_page(page, order);
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -622,7 +677,7 @@
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
+		free_pages -= (z->free_area[NOT_ZEROED][o].nr_free + z->free_area[ZEROED][o].nr_free)  << o;

 		/* Require fewer higher order pages to be free */
 		min >>= 1;
@@ -1000,7 +1055,7 @@
 }

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
@@ -1008,27 +1063,31 @@
 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
 		*free += zones[i].free_pages;
+		*zero += zones[i].zero_pages;
 	}
 }

 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *free, unsigned long *zero)
 {
 	struct pglist_data *pgdat;

 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n,o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
 		*free += n;
+		*zero += o;
 	}
 }

@@ -1065,6 +1124,7 @@

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1077,6 +1137,7 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;
 	struct zone *zone;

 	for_each_zone(zone) {
@@ -1097,10 +1158,10 @@

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1108,20 +1169,21 @@
 	}

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 	printk("\nFree pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));

 	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+		"unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
 		nr_free_pages(),
+		zero,
 		ps.nr_slab,
 		ps.nr_mapped,
 		ps.nr_page_table_pages);
@@ -1170,7 +1232,7 @@

 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
-			nr = zone->free_area[order].nr_free;
+			nr = zone->free_area[NOT_ZEROED][order].nr_free + zone->free_area[ZEROED][order].nr_free;
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
 		}
@@ -1493,16 +1555,21 @@
 	for (order = 0; ; order++) {
 		unsigned long bitmap_size;

-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
 		if (order = MAX_ORDER-1) {
-			zone->free_area[order].map = NULL;
+			zone->free_area[NOT_ZEROED][order].map = NULL;
+			zone->free_area[ZEROED][order].map = NULL;
 			break;
 		}

 		bitmap_size = pages_to_bitmap_size(order, size);
-		zone->free_area[order].map +		zone->free_area[NOT_ZEROED][order].map +		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+		zone->free_area[ZEROED][order].map  		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
-		zone->free_area[order].nr_free = 0;
+		zone->free_area[NOT_ZEROED][order].nr_free = 0;
+		zone->free_area[ZEROED][order].nr_free = 0;
 	}
 }

@@ -1527,6 +1594,7 @@

 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kscrubd_wait);
 	pgdat->kswapd_max_order = 0;

 	for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -1550,6 +1618,7 @@
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->zero_pages = 0;

 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

@@ -1583,6 +1652,13 @@
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1708,7 +1784,7 @@
 		spin_lock_irqsave(&zone->lock, flags);
 		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
 		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+			seq_printf(m, "%6lu ", zone->free_area[NOT_ZEROED][order].nr_free);
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
Index: linux-2.6.10/include/linux/mmzone.h
=================================--- linux-2.6.10.orig/include/linux/mmzone.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/linux/mmzone.h	2005-01-04 14:17:02.000000000 -0800
@@ -52,7 +52,7 @@
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -108,10 +108,14 @@
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */

+#define NOT_ZEROED 0
+#define ZEROED 1
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		zero_pages;
 	/*
 	 * protection[] is a pre-calculated number of extra pages that must be
 	 * available in a zone in order for __alloc_pages() to allocate memory
@@ -132,7 +136,7 @@
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area[2][MAX_ORDER];


 	ZONE_PADDING(_pad1_)
@@ -267,6 +271,9 @@
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+
+	wait_queue_head_t       kscrubd_wait;
+	struct task_struct *kscrubd;
 } pg_data_t;

 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -276,9 +283,9 @@
 extern struct pglist_data *pgdat_list;

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+			unsigned long *free, unsigned long *zero);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Index: linux-2.6.10/fs/proc/proc_misc.c
=================================--- linux-2.6.10.orig/fs/proc/proc_misc.c	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c	2005-01-04 14:17:02.000000000 -0800
@@ -158,13 +158,14 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;
 	unsigned long vmtot;
 	unsigned long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 /*
  * display in kilobytes.
@@ -187,6 +188,7 @@
 	len = sprintf(page,
 		"MemTotal:     %8lu kB\n"
 		"MemFree:      %8lu kB\n"
+		"MemZero:      %8lu kB\n"
 		"Buffers:      %8lu kB\n"
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
@@ -210,6 +212,7 @@
 		"VmallocChunk: %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
+		K(zero),
 		K(i.bufferram),
 		K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
 		K(total_swapcache_pages),
Index: linux-2.6.10/mm/readahead.c
=================================--- linux-2.6.10.orig/mm/readahead.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/readahead.c	2005-01-04 14:17:02.000000000 -0800
@@ -573,7 +573,8 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;

-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
Index: linux-2.6.10/drivers/base/node.c
=================================--- linux-2.6.10.orig/drivers/base/node.c	2005-01-04 14:17:00.000000000 -0800
+++ linux-2.6.10/drivers/base/node.c	2005-01-04 14:17:02.000000000 -0800
@@ -41,13 +41,15 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;

 	si_meminfo_node(&i, nid);
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));

 	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
+		       "Node %d MemZero:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
@@ -57,6 +59,7 @@
 		       "Node %d LowFree:      %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
+		       nid, K(zero),
 		       nid, K(i.totalram - i.freeram),
 		       nid, K(active),
 		       nid, K(inactive),
Index: linux-2.6.10/include/linux/sched.h
=================================--- linux-2.6.10.orig/include/linux/sched.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h	2005-01-04 14:17:02.000000000 -0800
@@ -715,6 +715,7 @@
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
+#define PF_KSCRUBD	0x00800000	/* I am kscrubd */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.10/mm/Makefile
=================================--- linux-2.6.10.orig/mm/Makefile	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/mm/Makefile	2005-01-04 14:17:02.000000000 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o scrubd.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o prio_tree.o \
Index: linux-2.6.10/mm/scrubd.c
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/mm/scrubd.c	2005-01-04 14:58:46.000000000 -0800
@@ -0,0 +1,147 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = 7;	/* if a page of this order is coalesed then run kscrubd */
+unsigned int sysctl_scrub_stop = 2;	/* Mininum order of page to zero */
+unsigned int sysctl_scrub_load = 999;	/* Do not run scrubd if load > */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (sysctl_scrub_start < MAX_ORDER) {
+		struct zone *zone;
+
+		for_each_zone(zone)
+			wakeup_kscrubd(zone);
+	}
+	return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+	int order;
+
+	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+		struct free_area *area = z->free_area[NOT_ZEROED] + order;
+		if (!list_empty(&area->free_list)) {
+			struct page *page = scrubd_rmpage(z, area, order);
+			struct list_head *l;
+
+			if (!page)
+				continue;
+
+			page->index = order;
+
+			list_for_each(l, &zero_drivers) {
+				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+				unsigned long size = PAGE_SIZE << order;
+
+				if (driver->start(page_address(page), size) = 0) {
+
+					unsigned ticks = (size*HZ)/driver->rate;
+					if (ticks) {
+						/* Wait the minimum time of the transfer */
+						current->state = TASK_INTERRUPTIBLE;
+						schedule_timeout(ticks);
+					}
+					/* Then keep on checking until transfer is complete */
+					while (!driver->check())
+						schedule();
+					goto out;
+				}
+			}
+
+			/* Unable to find a zeroing device that would
+			 * deal with this page so just do it on our own.
+			 * This will likely thrash the cpu caches.
+			 */
+			cond_resched();
+			clear_page(page_address(page), order);
+out:
+			end_zero_page(page);
+			cond_resched();
+			return 1 << order;
+		}
+	}
+	return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+	int i;
+	unsigned long pages_zeroed;
+
+	if (system_state != SYSTEM_RUNNING)
+		return;
+
+	do {
+		pages_zeroed = 0;
+		for (i = 0; i < pgdat->nr_zones; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			pages_zeroed += zero_highest_order_page(zone);
+		}
+	} while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	cpumask_t cpumask;
+
+	daemonize("kscrubd%d", pgdat->node_id);
+	cpumask = node_to_cpumask(pgdat->node_id);
+	if (!cpus_empty(cpumask))
+		set_cpus_allowed(tsk, cpumask);
+
+	tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+	for ( ; ; ) {
+		if (current->flags & PF_FREEZE)
+			refrigerator(PF_FREEZE);
+		prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&pgdat->kscrubd_wait, &wait);
+
+		scrub_pgdat(pgdat);
+	}
+	return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+	pg_data_t *pgdat;
+	for_each_pgdat(pgdat)
+		pgdat->kscrubd
+		= find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+	return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.10/include/linux/scrub.h
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/include/linux/scrub.h	2005-01-04 14:17:02.000000000 -0800
@@ -0,0 +1,51 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start)(void *, unsigned long);		/* Start bzero transfer */
+	int (*check)(void);				/* Check if bzero is complete */
+	unsigned long rate;				/* zeroing rate in bytes/sec */
+        struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+extern unsigned int sysctl_scrub_load;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+        if (avenrun[0] >= (unsigned long)sysctl_scrub_load << FSHIFT)
+		return;
+	if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+                return;
+        wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+				      void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page);
+#endif
Index: linux-2.6.10/kernel/sysctl.c
=================================--- linux-2.6.10.orig/kernel/sysctl.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/kernel/sysctl.c	2005-01-04 14:17:02.000000000 -0800
@@ -40,6 +40,7 @@
 #include <linux/times.h>
 #include <linux/limits.h>
 #include <linux/dcache.h>
+#include <linux/scrub.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -826,6 +827,33 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_SCRUB_START,
+		.procname	= "scrub_start",
+		.data		= &sysctl_scrub_start,
+		.maxlen		= sizeof(sysctl_scrub_start),
+		.mode		= 0644,
+		.proc_handler	= &scrub_start_handler,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_STOP,
+		.procname	= "scrub_stop",
+		.data		= &sysctl_scrub_stop,
+		.maxlen		= sizeof(sysctl_scrub_stop),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_LOAD,
+		.procname	= "scrub_load",
+		.data		= &sysctl_scrub_load,
+		.maxlen		= sizeof(sysctl_scrub_load),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.10/include/linux/sysctl.h
=================================--- linux-2.6.10.orig/include/linux/sysctl.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/linux/sysctl.h	2005-01-04 14:17:02.000000000 -0800
@@ -169,6 +169,9 @@
 	VM_VFS_CACHE_PRESSURE&, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT', /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT(, /* default time for token time out */
+	VM_SCRUB_START0,	/* percentage * 10 at which to start scrubd */
+	VM_SCRUB_STOP1,	/* percentage * 10 at which to stop scrubd */
+	VM_SCRUB_LOAD1,	/* Load factor at which not to scrub anymore */
 };




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
                                 ` (2 preceding siblings ...)
  2005-01-04 23:15               ` Prezeroing V3 [3/4]: Page zeroing through kscrubd Christoph Lameter
@ 2005-01-04 23:16               ` Christoph Lameter
  3 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:16 UTC (permalink / raw)
  To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

o Zeroing driver implemented with the Block Transfer Engine in the Altix
  SN2 SHub.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/arch/ia64/sn/kernel/bte.c
=================================--- linux-2.6.10.orig/arch/ia64/sn/kernel/bte.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/bte.c	2005-01-03 13:36:07.000000000 -0800
@@ -4,6 +4,8 @@
  * for more details.
  *
  * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
  */

 #include <linux/config.h>
@@ -20,6 +22,8 @@
 #include <linux/bootmem.h>
 #include <linux/string.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>

 #include <asm/sn/bte.h>

@@ -30,7 +34,7 @@
 /* two interfaces on two btes */
 #define MAX_INTERFACES_TO_TRY		4

-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
 {
 	nodepda_t *tmp_nodepda;

@@ -132,7 +136,6 @@
 			if (bte = NULL) {
 				continue;
 			}
-
 			if (spin_trylock(&bte->spinlock)) {
 				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
 				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +160,7 @@
 		}
 	} while (1);

-	if (notification = NULL) {
+	if (notification = NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
 		/* User does not want to be notified. */
 		bte->most_rcnt_na = &bte->notify;
 	} else {
@@ -192,6 +195,8 @@

 	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);

+	if (mode & BTE_NOTIFY_AND_GET_POINTER)
+		 *(u64 volatile **)(notification) = &bte->notify;
 	spin_unlock_irqrestore(&bte->spinlock, irq_flags);

 	if (notification != NULL) {
@@ -449,5 +454,37 @@
 		mynodepda->bte_if[i].cleanup_active = 0;
 		mynodepda->bte_if[i].bh_error = 0;
 	}
+}
+
+u64 *bte_zero_notify[MAX_COMPACT_NODES];
+
+static int bte_check_bzero(void)
+{
+	int node = get_nasid();
+
+	return *(bte_zero_notify[node]) != BTE_WORD_BUSY;
+}
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+	int node = get_nasid();
+
+	/* Check limitations.
+		1. System must be running (weird things happen during bootup)
+		2. Size >64KB. Smaller requests cause too much bte traffic
+	 */
+	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+		return EINVAL;
+
+	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, bte_zero_notify+node);
+}
+
+static struct zero_driver bte_bzero = {
+	.start = bte_start_bzero,
+	.check = bte_check_bzero,
+	.rate = 500000000		/* 500 MB /sec */
+};

+void sn_bte_bzero_init(void) {
+	register_zero_driver(&bte_bzero);
 }
Index: linux-2.6.10/arch/ia64/sn/kernel/setup.c
=================================--- linux-2.6.10.orig/arch/ia64/sn/kernel/setup.c	2004-12-24 13:34:27.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/setup.c	2005-01-03 13:36:07.000000000 -0800
@@ -243,6 +243,7 @@
 	int pxm;
 	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
 	extern void sn_cpu_init(void);
+	extern void sn_bte_bzero_init(void);

 	/*
 	 * If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
 	screen_info = sn_screen_info;

 	sn_timer_init();
+	sn_bte_bzero_init();
 }

 /**
Index: linux-2.6.10/include/asm-ia64/sn/bte.h
=================================--- linux-2.6.10.orig/include/asm-ia64/sn/bte.h	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/sn/bte.h	2005-01-03 13:36:07.000000000 -0800
@@ -48,6 +48,8 @@
 #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
 /* Use a reserved bit to let the caller specify a wait for any BTE */
 #define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
 /* Use the BTE on the node with the destination memory */
 #define BTE_USE_DEST (BTE_WACQUIRE << 1)
 /* Use any available BTE interface on any node for the transfer */


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
@ 2005-01-04 23:45                 ` Dave Hansen
  2005-01-05  1:16                   ` Christoph Lameter
  2005-01-05  0:34                 ` Linus Torvalds
  2005-01-08 21:12                 ` Hugh Dickins
  2 siblings, 1 reply; 99+ messages in thread
From: Dave Hansen @ 2005-01-04 23:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

On Tue, 2005-01-04 at 15:13 -0800, Christoph Lameter wrote:
> +		if (gfp_flags & __GFP_ZERO) {
> +#ifdef CONFIG_HIGHMEM
> +			if (PageHighMem(page)) {
> +				int n = 1 << order;
> +
> +				while (n-- >0)
> +					clear_highpage(page + n);
> +			} else
> +#endif
> +			clear_page(page_address(page), order);
> +		}
>  		if (order && (gfp_flags & __GFP_COMP))
>  			prep_compound_page(page, order);

That #ifdef can probably die.  The compiler should get that all by
itself:

> #ifdef CONFIG_HIGHMEM
> #define PageHighMem(page)       test_bit(PG_highmem, &(page)->flags)
> #else
> #define PageHighMem(page)       0 /* needed to optimize away at compile time */
> #endif

-- Dave


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  2005-01-04 23:45                 ` Dave Hansen
@ 2005-01-05  0:34                 ` Linus Torvalds
  2005-01-05  0:47                   ` Andrew Morton
  2005-01-08 21:12                 ` Hugh Dickins
  2 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2005-01-05  0:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-ia64, linux-mm, Linux Kernel Development

On Tue, 4 Jan 2005, Christoph Lameter wrote:
>
> This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> to request zeroed pages from the page allocator.

Ok, let's start merging this slowly, and in particular, this 1/4 one looks 
pretty much like a cleanup regardless of whatever else happen, so let's 
just do it. However, for it to really be a cleanup, how about making 
_this_ part:

> +
> +		if (gfp_flags & __GFP_ZERO) {
> +#ifdef CONFIG_HIGHMEM
> +			if (PageHighMem(page)) {
> +				int n = 1 << order;
> +
> +				while (n-- >0)
> +					clear_highpage(page + n);
> +			} else
> +#endif
> +			clear_page(page_address(page), order);
> +		}

Match the existing previous part:

>  		if (order && (gfp_flags & __GFP_COMP))
>  			prep_compound_page(page, order);

and just split it up into a "prep_zero_page(page, order)"? I dislike 
#ifdef's in the middle of deep functions. In the middle of a _trivial_ 
function it's much more palatable.

At that point at least part 1 ends up being a nice clean patch on its own, 
and should even shrink the code-size a bit. IOW, it not only is a cleanup, 
there is even a technical argument for it (even without worrying about the 
next stages).

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-05  0:34                 ` Linus Torvalds
@ 2005-01-05  0:47                   ` Andrew Morton
  2005-01-05  1:15                     ` Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2005-01-05  0:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: clameter, linux-ia64, linux-mm, linux-kernel

Linus Torvalds <torvalds@osdl.org> wrote:
>
> On Tue, 4 Jan 2005, Christoph Lameter wrote:
> >
> > This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> > to request zeroed pages from the page allocator.
> 
> Ok, let's start merging this slowly

One week hence, please.  Things like the no-bitmaps-for-the-buddy-allocator
have been well tested and should go in first.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-05  0:47                   ` Andrew Morton
@ 2005-01-05  1:15                     ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-05  1:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, linux-ia64, linux-mm, linux-kernel

On Tue, 4 Jan 2005, Andrew Morton wrote:

> > Ok, let's start merging this slowly
>
> One week hence, please.  Things like the no-bitmaps-for-the-buddy-allocator
> have been well tested and should go in first.

The first two patches are basically cleanup type stuff and will not affect
the page allocator in a significant way. On the other hand they touch many
files and are thus difficult to maintain.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:45                 ` Dave Hansen
@ 2005-01-05  1:16                   ` Christoph Lameter
  2005-01-05  1:26                     ` Linus Torvalds
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2005-01-05  1:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

On Tue, 4 Jan 2005, Dave Hansen wrote:

> That #ifdef can probably die.  The compiler should get that all by
> itself:
>
> > #ifdef CONFIG_HIGHMEM
> > #define PageHighMem(page)       test_bit(PG_highmem, &(page)->flags)
> > #else
> > #define PageHighMem(page)       0 /* needed to optimize away at compile time */
> > #endif

Ahh. Great. Do I need to submit a corrected patch that removes those two
lines or is it fine as is?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-05  1:16                   ` Christoph Lameter
@ 2005-01-05  1:26                     ` Linus Torvalds
  2005-01-05 23:11                       ` Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2005-01-05  1:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, Andrew Morton, linux-ia64, linux-mm,
	Linux Kernel Development



On Tue, 4 Jan 2005, Christoph Lameter wrote:
> 
> Ahh. Great. Do I need to submit a corrected patch that removes those two
> lines or is it fine as is?

Please do split it up into a function of its own. It's going to look a lot 
prettier as an intermediate phase. I realize that that touches #3 in the 
series, but I suspect that one will also just be prettier as a result.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-05  1:26                     ` Linus Torvalds
@ 2005-01-05 23:11                       ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-05 23:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Andrew Morton, linux-ia64, linux-mm,
	Linux Kernel Development

On Tue, 4 Jan 2005, Linus Torvalds wrote:

> Please do split it up into a function of its own. It's going to look a lot
> prettier as an intermediate phase. I realize that that touches #3 in the
> series, but I suspect that one will also just be prettier as a result.

Here is the first patch redone as you wanted. I also removed all
dependencies on the second patch. This should be able to get in
on its own.
I will sent the revised second patch dealing with updating clear_page
later and keep back the last two patches until the bitmap thing has been
changed in the buddy allocator.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.

- Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set

- Replace all page zeroing after allocating pages by prior allocations with
  allocations using __GFP_ZERO

Index: linux-2.6.10/mm/page_alloc.c
=================================--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-05 09:32:52.000000000 -0800
@@ -549,6 +549,12 @@
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
+static inline void prep_zero_page(struct page *page, int order) {
+	int i;
+
+	for(i = 0; i < (1 << order); i++)
+		clear_highpage(page + i);
+}

 static struct page *
 buffered_rmqueue(struct zone *zone, int order, int gfp_flags)
@@ -584,6 +590,10 @@
 		BUG_ON(bad_range(zone, page));
 		mod_page_state_zone(zone, pgalloc, 1 << order);
 		prep_new_page(page, order);
+
+		if (gfp_flags & __GFP_ZERO)
+			prep_zero_page(page, order);
+
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -796,12 +806,9 @@
 	 */
 	BUG_ON(gfp_mask & __GFP_HIGHMEM);

-	page = alloc_pages(gfp_mask, 0);
-	if (page) {
-		void *address = page_address(page);
-		clear_page(address);
-		return (unsigned long) address;
-	}
+	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+	if (page)
+		return (unsigned long) page_address(page);
 	return 0;
 }

Index: linux-2.6.10/include/linux/gfp.h
=================================--- linux-2.6.10.orig/include/linux/gfp.h	2004-12-24 13:34:27.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h	2005-01-05 09:30:39.000000000 -0800
@@ -37,6 +37,7 @@
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_ZERO	0x8000	/* Return zeroed page on success */

 #define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.10/mm/memory.c
=================================--- linux-2.6.10.orig/mm/memory.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-01-05 09:30:39.000000000 -0800
@@ -1650,10 +1650,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.10/kernel/profile.c
=================================--- linux-2.6.10.orig/kernel/profile.c	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/kernel/profile.c	2005-01-05 09:30:39.000000000 -0800
@@ -326,17 +326,15 @@
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				return NOTIFY_BAD;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				goto out_free;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
 		}
 		break;
@@ -510,16 +508,14 @@
 		int node = cpu_to_node(cpu);
 		struct page *page;

-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[0]
 				= (struct profile_hit *)page_address(page);
 	}
Index: linux-2.6.10/mm/shmem.c
=================================--- linux-2.6.10.orig/mm/shmem.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/mm/shmem.c	2005-01-05 09:30:39.000000000 -0800
@@ -369,9 +369,8 @@
 		}

 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
 		if (page) {
-			clear_highpage(page);
 			page->nr_swapped = 0;
 		}
 		spin_lock(&info->lock);
@@ -910,7 +909,7 @@
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -926,7 +925,7 @@
 shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
 				 unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(gfp | __GFP_ZERO);
 }
 #endif

@@ -1135,7 +1134,6 @@

 		info->alloced++;
 		spin_unlock(&info->lock);
-		clear_highpage(filepage);
 		flush_dcache_page(filepage);
 		SetPageUptodate(filepage);
 	}
Index: linux-2.6.10/include/asm-ia64/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -61,9 +61,7 @@
 	pgd_t *pgd = pgd_alloc_one_fast(mm);

 	if (unlikely(pgd = NULL)) {
-		pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-		if (likely(pgd != NULL))
-			clear_page(pgd);
+		pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	}
 	return pgd;
 }
@@ -106,10 +104,8 @@
 static inline pmd_t*
 pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pmd != NULL))
-		clear_page(pmd);
 	return pmd;
 }

@@ -140,20 +136,16 @@
 static inline struct page *
 pte_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

-	if (likely(pte != NULL))
-		clear_page(page_address(pte));
 	return pte;
 }

 static inline pte_t *
 pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pte != NULL))
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.10/arch/i386/mm/pgtable.c
=================================--- linux-2.6.10.orig/arch/i386/mm/pgtable.c	2005-01-04 14:16:59.000000000 -0800
+++ linux-2.6.10/arch/i386/mm/pgtable.c	2005-01-05 09:30:39.000000000 -0800
@@ -140,10 +140,7 @@

 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
-	return pte;
+	return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 }

 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -151,12 +148,10 @@
 	struct page *pte;

 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 #endif
-	if (pte)
-		clear_highpage(pte);
 	return pte;
 }

Index: linux-2.6.10/include/asm-mips/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-mips/pgalloc.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-mips/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -56,9 +56,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);

 	return pte;
 }
Index: linux-2.6.10/arch/alpha/mm/init.c
=================================--- linux-2.6.10.orig/arch/alpha/mm/init.c	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/arch/alpha/mm/init.c	2005-01-05 09:30:39.000000000 -0800
@@ -42,10 +42,9 @@
 {
 	pgd_t *ret, *init;

-	ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+	ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	init = pgd_offset(&init_mm, 0UL);
 	if (ret) {
-		clear_page(ret);
 #ifdef CONFIG_ALPHA_LARGE_VMALLOC
 		memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
 			(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
@@ -63,9 +62,7 @@
 pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.10/include/asm-parisc/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-parisc/pgalloc.h	2004-12-24 13:35:39.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -120,18 +120,14 @@
 static inline struct page *
 pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(page != NULL))
-		clear_page(page_address(page));
+	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return page;
 }

 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(pte != NULL))
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.10/include/asm-sparc64/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-sparc64/pgalloc.h	2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -73,10 +73,9 @@
 		struct page *page;

 		preempt_enable();
-		page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+		page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (page) {
 			ret = (struct page *)page_address(page);
-			clear_page(ret);
 			page->lru.prev = (void *) 2UL;

 			preempt_disable();
Index: linux-2.6.10/include/asm-sh/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-sh/pgalloc.h	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-sh/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -44,9 +44,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);

 	return pte;
 }
@@ -56,9 +54,7 @@
 {
 	struct page *pte;

-   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
Index: linux-2.6.10/include/asm-m32r/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-m32r/pgalloc.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -23,10 +23,7 @@
  */
 static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
-	if (pgd)
-		clear_page(pgd);
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pgd;
 }
@@ -39,10 +36,7 @@
 static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
-
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pte;
 }
@@ -50,10 +44,8 @@
 static __inline__ struct page *pte_alloc_one(struct mm_struct *mm,
 	unsigned long address)
 {
-	struct page *pte = alloc_page(GFP_KERNEL);
+	struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO);

-	if (pte)
-		clear_page(page_address(pte));

 	return pte;
 }
Index: linux-2.6.10/arch/um/kernel/mem.c
=================================--- linux-2.6.10.orig/arch/um/kernel/mem.c	2005-01-04 14:17:00.000000000 -0800
+++ linux-2.6.10/arch/um/kernel/mem.c	2005-01-05 09:30:39.000000000 -0800
@@ -327,9 +327,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

@@ -337,9 +335,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_highpage(pte);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.10/include/asm-sh64/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-sh64/pgalloc.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -112,9 +112,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO);

 	return pte;
 }
@@ -123,9 +121,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
@@ -150,9 +146,7 @@
 static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pmd_t *pmd;
-	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pmd)
-		clear_page(pmd);
+	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pmd;
 }

Index: linux-2.6.10/include/asm-cris/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-cris/pgalloc.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-cris/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -24,18 +24,14 @@

 extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
  	return pte;
 }

 extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.10/arch/ppc/mm/pgtable.c
=================================--- linux-2.6.10.orig/arch/ppc/mm/pgtable.c	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/arch/ppc/mm/pgtable.c	2005-01-05 09:30:39.000000000 -0800
@@ -85,8 +85,7 @@
 {
 	pgd_t *ret;

-	if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL)
-		clear_pages(ret, PGDIR_ORDER);
+	ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
 	return ret;
 }

@@ -102,7 +101,7 @@
 	extern void *early_get_page(void);

 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (pte) {
 			struct page *ptepage = virt_to_page(pte);
 			ptepage->mapping = (void *) mm;
@@ -110,8 +109,6 @@
 		}
 	} else
 		pte = (pte_t *)early_get_page();
-	if (pte)
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.10/include/asm-alpha/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-alpha/pgalloc.h	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -40,9 +40,7 @@
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (ret)
-		clear_page(ret);
+	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return ret;
 }

Index: linux-2.6.10/include/asm-m68k/motorola_pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-m68k/motorola_pgalloc.h	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/motorola_pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -12,9 +12,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
@@ -31,7 +30,7 @@

 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	pte_t *pte;

 	if(!page)
@@ -39,7 +38,6 @@

 	pte = kmap(page);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
Index: linux-2.6.10/arch/sparc64/mm/init.c
=================================--- linux-2.6.10.orig/arch/sparc64/mm/init.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/sparc64/mm/init.c	2005-01-05 09:30:39.000000000 -0800
@@ -1687,13 +1687,12 @@
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
 	 */
-	mem_map_zero = alloc_pages(GFP_KERNEL, 0);
+	mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (mem_map_zero = NULL) {
 		prom_printf("paging_init: Cannot alloc zero page.\n");
 		prom_halt();
 	}
 	SetPageReserved(mem_map_zero);
-	clear_page(page_address(mem_map_zero));

 	codepages = (((unsigned long) _etext) - ((unsigned long) _start));
 	codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT;
Index: linux-2.6.10/include/asm-arm/pgalloc.h
=================================--- linux-2.6.10.orig/include/asm-arm/pgalloc.h	2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-arm/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -50,9 +50,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
 		pte += PTRS_PER_PTE;
 	}
@@ -65,10 +64,9 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	if (pte) {
 		void *page = page_address(pte);
-		clear_page(page);
 		clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
 	}

Index: linux-2.6.10/drivers/block/pktcdvd.c
=================================--- linux-2.6.10.orig/drivers/block/pktcdvd.c	2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/drivers/block/pktcdvd.c	2005-01-05 09:30:39.000000000 -0800
@@ -135,12 +135,10 @@
 		goto no_bio;

 	for (i = 0; i < PAGES_PER_PACKET; i++) {
-		pkt->pages[i] = alloc_page(GFP_KERNEL);
+		pkt->pages[i] = alloc_page(GFP_KERNEL|| __GFP_ZERO);
 		if (!pkt->pages[i])
 			goto no_page;
 	}
-	for (i = 0; i < PAGES_PER_PACKET; i++)
-		clear_page(page_address(pkt->pages[i]));

 	spin_lock_init(&pkt->lock);


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [2/4]: Extension of clear_page to take an order
  2005-01-04 23:14               ` Prezeroing V3 [2/4]: Extension of clear_page to take an order Christoph Lameter
@ 2005-01-05 23:25                 ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-05 23:25 UTC (permalink / raw)
  To: Linus Torvalds, linux-ia64, Andrew Morton, linux-mm,
	Linux Kernel Development

Here is an updated version that is independent of the first patch and
contains all the necessary modifications to make clear_page take a second
parameter.

Architecture support:
---------------------

Known to work:

ia64
i386
sparc64
m68k

Trivial modification expected to simply work:

arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

x86_64
s390
alpha
sh
mips
m32r

Index: linux-2.6.10/include/asm-ia64/page.h
=================================--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.10/include/asm-i386/page.h
=================================--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-x86_64/page.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-sparc/page.h
=================================--- linux-2.6.10.orig/include/asm-sparc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-s390/page.h
=================================--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /* Pure 2^n version of get_order */
Index: linux-2.6.10/arch/i386/lib/mmx.c
=================================--- linux-2.6.10.orig/arch/i386/lib/mmx.c	2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c	2005-01-05 10:09:51.000000000 -0800
@@ -128,7 +128,7 @@
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/mmx.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h	2005-01-05 10:09:51.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S	2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -7,6 +7,7 @@
 clear_page:
 	xorl   %eax,%eax
 	movl   $4096/64,%ecx
+	shl	%esi, %ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -42,6 +43,7 @@
 	.section .altinstr_replacement,"ax"
 clear_page_c:
 	movl $4096/8,%ecx
+	shl	%esi, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.10/include/asm-sh/page.h
=================================--- linux-2.6.10.orig/include/asm-sh/page.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.10/include/asm-i386/mmx.h
=================================--- linux-2.6.10.orig/include/asm-i386/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h	2005-01-05 10:09:51.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
=================================--- linux-2.6.10.orig/include/asm-sh64/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -50,12 +50,20 @@
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.10/include/asm-h8300/page.h
=================================--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-arm/page.h
=================================--- linux-2.6.10.orig/include/asm-arm/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -128,7 +128,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
=================================--- linux-2.6.10.orig/include/asm-ppc64/page.h	2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, int order)
 {
 	unsigned long lines, line_size;

 	line_size = systemcfg->dCacheL1LineSize;
-	lines = naca->dCacheL1LinesPerPage;
+	lines = naca->dCacheL1LinesPerPage << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
=================================--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-alpha/page.h
=================================--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
=================================--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c	2005-01-05 10:09:51.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
=================================--- linux-2.6.10.orig/include/asm-m68k/page.h	2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -50,7 +50,7 @@
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-mips/page.h
=================================--- linux-2.6.10.orig/include/asm-mips/page.h	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.10/include/asm-m68knommu/page.h
=================================--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-cris/page.h
=================================--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-v850/page.h
=================================--- linux-2.6.10.orig/include/asm-v850/page.h	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-parisc/page.h
=================================--- linux-2.6.10.orig/include/asm-parisc/page.h	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
=================================--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c	2005-01-05 10:09:51.000000000 -0800
@@ -47,7 +47,7 @@
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.10/arch/m32r/mm/page.S
=================================--- linux-2.6.10.orig/arch/m32r/mm/page.S	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S	2005-01-05 10:09:51.000000000 -0800
@@ -51,7 +51,7 @@
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
=================================--- linux-2.6.10.orig/include/asm-ppc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -85,7 +85,7 @@

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
=================================--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c	2005-01-05 10:09:51.000000000 -0800
@@ -88,7 +88,7 @@
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
=================================--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
=================================--- linux-2.6.10.orig/arch/sh/mm/init.c	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c	2005-01-05 10:09:51.000000000 -0800
@@ -57,7 +57,7 @@
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c	2005-01-05 10:09:51.000000000 -0800
@@ -78,7 +78,7 @@
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c	2005-01-05 10:09:51.000000000 -0800
@@ -27,7 +27,7 @@
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
=================================--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c	2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c	2005-01-05 10:09:51.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
=================================--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c	2005-01-05 10:09:51.000000000 -0800
@@ -102,7 +102,7 @@
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
=================================--- linux-2.6.10.orig/include/asm-arm26/page.h	2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -25,7 +25,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
=================================--- linux-2.6.10.orig/include/asm-sparc64/page.h	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -14,8 +14,8 @@

 #ifndef __ASSEMBLY__

-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S	2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -28,9 +28,12 @@
 	.text

 	.globl		_clear_page
-_clear_page:		/* %o0Þst */
+_clear_page:		/* %o0Þst, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1

 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate

+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1

 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8
Index: linux-2.6.10/drivers/net/tc35815.c
=================================--- linux-2.6.10.orig/drivers/net/tc35815.c	2005-01-05 09:43:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c	2005-01-05 10:09:51.000000000 -0800
@@ -657,7 +657,7 @@
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
 	} else {
-		clear_page(lp->fd_buf);
+		clear_page(lp->fd_buf, 0);
 #ifdef __mips__
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
Index: linux-2.6.10/mm/page_alloc.c
=================================--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-05 09:32:52.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-05 10:09:51.000000000 -0800
@@ -550,10 +550,14 @@
  * or two.
  */
 static inline void prep_zero_page(struct page *page, int order) {
-	int i;

-	for(i = 0; i < 1 << order; i++)
-		clear_highpage(page + i);
+	if (PageHighMem(page)) {
+		int i;
+
+		for(i = 0; i < 1 << order; i++)
+			clear_highpage(page + i);
+	} else
+		clear_page(page_address(page), order);
 }

 static struct page *
Index: linux-2.6.10/include/linux/highmem.h
=================================--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-05 10:09:44.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-05 10:10:08.000000000 -0800
@@ -45,7 +45,7 @@
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  2005-01-04 23:45                 ` Dave Hansen
  2005-01-05  0:34                 ` Linus Torvalds
@ 2005-01-08 21:12                 ` Hugh Dickins
  2005-01-08 21:56                   ` David S. Miller
  2005-01-10 17:16                   ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  2 siblings, 2 replies; 99+ messages in thread
From: Hugh Dickins @ 2005-01-08 21:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, David S. Miller, linux-ia64, Linus Torvalds,
	linux-mm, Linux Kernel Development

On Tue, 4 Jan 2005, Christoph Lameter wrote:
> This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> to request zeroed pages from the page allocator.
> ...
> --- linux-2.6.10.orig/mm/memory.c	2005-01-04 12:16:41.000000000 -0800
> +++ linux-2.6.10/mm/memory.c	2005-01-04 12:16:49.000000000 -0800
> @@ -1650,10 +1650,9 @@
> 
>  		if (unlikely(anon_vma_prepare(vma)))
>  			goto no_mem;
> -		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> +		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
>  		if (!page)
>  			goto no_mem;
> -		clear_user_highpage(page, addr);
> 
>  		spin_lock(&mm->page_table_lock);
>  		page_table = pte_offset_map(pmd, addr);

Christoph, a late comment: doesn't this effectively replace
do_anonymous_page's clear_user_highpage by clear_highpage, which would
be a bad idea (inefficient? or corrupting?) on those few architectures
which actually do something with that user addr?

Hugh


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-08 21:12                 ` Hugh Dickins
@ 2005-01-08 21:56                   ` David S. Miller
  2005-01-21 20:09                     ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
  2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
  2005-01-10 17:16                   ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  1 sibling, 2 replies; 99+ messages in thread
From: David S. Miller @ 2005-01-08 21:56 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: clameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Sat, 8 Jan 2005 21:12:10 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Christoph, a late comment: doesn't this effectively replace
> do_anonymous_page's clear_user_highpage by clear_highpage, which would
> be a bad idea (inefficient? or corrupting?) on those few architectures
> which actually do something with that user addr?

Good catch, it probably does.  We really do need to use
the page clearing routines that pass in the user virtual
address when preparing new anonymous pages or else we'll
get cache aliasing problems on sparc, sparc64, and mips
at the very least.  That is what the virtual address argument
was added for to begin with.

The other way to deal with this is to make whatever routine
the kscrubd thing invokes do all the cache flushing et al.
magic so that the above works when taking pages from the
pre-zero'd pool (only, if no pre-zero'd pages are available
we sill need to invoke clear_user_highpage() with the proper
virtual address).

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-08 21:12                 ` Hugh Dickins
  2005-01-08 21:56                   ` David S. Miller
@ 2005-01-10 17:16                   ` Christoph Lameter
  2005-01-10 18:13                     ` Linus Torvalds
  1 sibling, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2005-01-10 17:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David S. Miller, linux-ia64, Linus Torvalds,
	linux-mm, Linux Kernel Development

On Sat, 8 Jan 2005, Hugh Dickins wrote:

> Christoph, a late comment: doesn't this effectively replace
> do_anonymous_page's clear_user_highpage by clear_highpage, which would
> be a bad idea (inefficient? or corrupting?) on those few architectures
> which actually do something with that user addr?

Yes. Right my ia64 centric vision got me again. Thanks for all the other
patches that were posted. I hope this is now all cleared up?


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-10 17:16                   ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
@ 2005-01-10 18:13                     ` Linus Torvalds
  2005-01-10 20:17                       ` Christoph Lameter
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
  0 siblings, 2 replies; 99+ messages in thread
From: Linus Torvalds @ 2005-01-10 18:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development



On Mon, 10 Jan 2005, Christoph Lameter wrote:
>
> Yes. Right my ia64 centric vision got me again. Thanks for all the other
> patches that were posted. I hope this is now all cleared up?

Hmm.. I fixed things up, but I didn't exactly do it like the posted 
patches. 

Currently the BK tree
 - doesn't use __GFP_ZERO with anonymous user-mapped pages (which is what 
   you wrote this whole thing for ;)

   Potential fix: declare a per-architecture "alloc_user_highpage(vaddr)"
   that does the proper magic on virtually indexed machines, and on others 
   it just does a "alloc_page(GFP_HIGHUSER | __GFP_ZERO)".

 - verifies that nobody ever asks for a HIGHMEM allocation together with 
   __GFP_ZERO (nobody does - a quick grep shows that 99% of all uses are
   statically clearly fine (there's a few HIGHMEM zero-page users, but 
   they are all GFP_KERNEL or similar), with just two special cases:

	- get_zeroed_page() - which can't use HIGHMEM anyway
	- shm.c does "mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO"
	  and that's fine because while the mapping gfp masks may lack
	  GFP_FS and GFP_IO, they are always supposed to be ok with 
	  waiting.

 - moves "kernel_map_pages()" into "prep_new_page()" to fix the 
   DEBUG_PAGEALLOC issue (Chris Wright).

So that should take care of the known problems.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-10 18:13                     ` Linus Torvalds
@ 2005-01-10 20:17                       ` Christoph Lameter
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
  1 sibling, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-10 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

On Mon, 10 Jan 2005, Linus Torvalds wrote:

> Currently the BK tree
>  - doesn't use __GFP_ZERO with anonymous user-mapped pages (which is what
>    you wrote this whole thing for ;)
>
>    Potential fix: declare a per-architecture "alloc_user_highpage(vaddr)"
>    that does the proper magic on virtually indexed machines, and on others
>    it just does a "alloc_page(GFP_HIGHUSER | __GFP_ZERO)".

The following patch adds an alloc_zeroed_user_highpage(vma, vaddr). It
also uses zeroed pages on COW. clear_user_highpage is now only used by
that function. Fold it into alloc_zeroed_user_highpage?

This is against last hours bitkeeper tree. mm/memory.o compiles fine but
I was not able to build a ia64 kernel due to some pieces that seem to be
missing in last hours tree.

Index: linus/include/asm-ia64/page.h
=================================--- linus.orig/include/asm-ia64/page.h	2004-10-20 12:04:58.000000000 -0700
+++ linus/include/asm-ia64/page.h	2005-01-10 12:05:55.000000000 -0800
@@ -75,6 +75,16 @@
 	flush_dcache_page(page);		\
 } while (0)

+
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+({						\
+	struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+	flush_dcache_page(page);		\
+	 page;					\
+})
+
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)

 #ifdef CONFIG_VIRTUAL_MEM_MAP
Index: linus/include/asm-h8300/page.h
=================================--- linus.orig/include/asm-h8300/page.h	2004-10-20 12:04:58.000000000 -0700
+++ linus/include/asm-h8300/page.h	2005-01-10 11:53:17.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/mm/memory.c
=================================--- linus.orig/mm/memory.c	2005-01-10 11:44:39.000000000 -0800
+++ linus/mm/memory.c	2005-01-10 12:05:21.000000000 -0800
@@ -84,20 +84,6 @@
 EXPORT_SYMBOL(vmalloc_earlyreserve);

 /*
- * We special-case the C-O-W ZERO_PAGE, because it's such
- * a common occurrence (no need to read the page to know
- * that it's zero - better for the cache and memory subsystem).
- */
-static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)
-{
-	if (from = ZERO_PAGE(address)) {
-		clear_user_highpage(to, address);
-		return;
-	}
-	copy_user_highpage(to, from, address);
-}
-
-/*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
@@ -1329,11 +1315,16 @@

 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-	if (!new_page)
-		goto no_new_page;
-	copy_cow_page(old_page,new_page,address);
-
+	if (old_page = ZERO_PAGE(address)) {
+		new_page = alloc_zeroed_user_highpage(vma, address);
+		if (!new_page)
+			goto no_new_page;
+	} else {
+		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!new_page)
+			goto no_new_page;
+		copy_user_highpage(new_page, old_page, address);
+	}
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1795,10 +1786,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_zeroed_user_highpage(vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linus/include/asm-m32r/page.h
=================================--- linus.orig/include/asm-m32r/page.h	2004-10-20 12:04:58.000000000 -0700
+++ linus/include/asm-m32r/page.h	2005-01-10 12:08:03.000000000 -0800
@@ -17,6 +17,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/asm-alpha/page.h
=================================--- linus.orig/include/asm-alpha/page.h	2004-10-20 12:04:57.000000000 -0700
+++ linus/include/asm-alpha/page.h	2005-01-10 11:54:37.000000000 -0800
@@ -18,6 +18,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

Index: linus/include/asm-m68knommu/page.h
=================================--- linus.orig/include/asm-m68knommu/page.h	2005-01-10 09:53:05.000000000 -0800
+++ linus/include/asm-m68knommu/page.h	2005-01-10 11:54:27.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/asm-cris/page.h
=================================--- linus.orig/include/asm-cris/page.h	2004-10-20 12:04:57.000000000 -0700
+++ linus/include/asm-cris/page.h	2005-01-10 11:55:06.000000000 -0800
@@ -21,6 +21,9 @@
 #define clear_user_page(page, vaddr, pg)    clear_page(page)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/linux/highmem.h
=================================--- linus.orig/include/linux/highmem.h	2005-01-06 12:58:48.000000000 -0800
+++ linus/include/linux/highmem.h	2005-01-10 12:08:56.000000000 -0800
@@ -42,6 +42,17 @@
 	smp_wmb();
 }

+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+static inline struct page* alloc_zeroed_user_highpage(struct vm_area_struct *vma,
+	 unsigned long vaddr)
+{
+	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+
+	clear_user_highpage(page, vaddr);
+	return page;
+}
+#endif
+
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
Index: linus/include/asm-i386/page.h
=================================--- linus.orig/include/asm-i386/page.h	2005-01-06 12:58:47.000000000 -0800
+++ linus/include/asm-i386/page.h	2005-01-10 12:09:43.000000000 -0800
@@ -36,6 +36,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/asm-x86_64/page.h
=================================--- linus.orig/include/asm-x86_64/page.h	2005-01-06 12:58:48.000000000 -0800
+++ linus/include/asm-x86_64/page.h	2005-01-10 11:56:04.000000000 -0800
@@ -38,6 +38,8 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/asm-s390/page.h
=================================--- linus.orig/include/asm-s390/page.h	2004-10-20 12:04:59.000000000 -0700
+++ linus/include/asm-s390/page.h	2005-01-10 11:56:33.000000000 -0800
@@ -106,6 +106,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /* Pure 2^n version of get_order */
 extern __inline__ int get_order(unsigned long size)
 {

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V4 [0/4]: Overview
  2005-01-10 18:13                     ` Linus Torvalds
  2005-01-10 20:17                       ` Christoph Lameter
@ 2005-01-10 23:53                       ` Christoph Lameter
  2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
                                           ` (3 more replies)
  1 sibling, 4 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

Changes from V3 to V4:
o Drop __GFP_ZERO patch since its in Linus tree. Include new patch that allows
  archs that need special measures around zeroing of user pages during a page
  fault to maintain their special adaptations.
o Use zeroed pages during COW.
o Updates for clear_page for various platforms. Make clear_page an optional
  patch and fall back to a series of clear_page without order if the patch
  to expand clear_page patch has not been applied.
o x86_64 asm code fixed up
o Port patches to 2.6.10-bk13 and make it fit the bitmapless buddy allocator

The patches increasing the page fault rate (introduction of atomic pte
operations and anticipatory prefaulting) do so by reducing the locking
overhead and are therefore mainly of interest for applications running in
SMP systems with a high number of cpus. The single thread performance does
just show minor increases. Only the performance of multi-threaded
applications increases significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. This zeroing means that all cachelines of the faulted page (on Altix
that means all 128 cachelines of 128 byte each) must be loaded and later
written back. This patch allows to avoid having to load all cachelines
if only a part of the cachelines of that page is needed immediately after
the fault. Doing so will only be effective for sparsely accessed memory
which is typical for anonymous memory and pte maps. Prezeroed pages will
only be used for those purposes. Unzeroed pages will be used as usual for
file mapping, page caching etc etc.

The patch makes prezeroing very effective by:

1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become zero 0 to be zeroed in one
step.
For that purpose the existing clear_page function is extended and made to
take an additional argument specifying the order of the page to be cleared.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The scrub daemon is invoked when a unzeroed page of a certain order has
been generated so that its worth running it. If no higher order pages are
present then the logic will favor hot zeroing rather than simply shifting
processing around. kscrubd typically runs only for a fraction of a second
and sleeps for long periods of time even under memory benchmarking. kscrubd
performs short bursts of zeroing when needed and tries to stay out off the
processor as much as possible.

The benefits of prezeroing are reduced to minimal quantities if all
cachelines of a page are touched. Prezeroing can only be effective
if the whole page is not immediately used after the page fault.

The patch is composed of 4 parts:

[1/4] GFP_ZERO fixups
	Adds alloc_zeroed_user_highpage(vma, vaddr) that may be customized for
	each arch by defining __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE. Includes
	proper definitions for a large selection of arches, others fall back to
	the default function in include/linux/highmem.h (and falls back to not
	using prezeroed pages).

[2/4] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disabled by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order or higher then the scrub daemon will
	start zeroing until all pages of order /proc/sys/vm/scrub_stop and
	higher are zeroed and then go back to sleep.

	In an SMP environment the scrub daemon is typically
	running on the most idle cpu. Thus a single threaded application running
	on one cpu may have the other cpu zeroing pages for it etc. The scrub
	daemon is hardly noticable and usually finished zeroing quickly since
	most processors are optimized for linear memory filling.

The following patches increase performance but may be omitted:

[2/4] SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support the impact of zeroing on the system is reduced
	to a minimum.

[4/4] Architecture specific clear_page updates
	Adds second order argument to clear_page and updates all arches.
	This allows the zeroing of large areas of memory without repeately
	invoking clear_page() for the page allocator, scrubd and the huge
	page allocator.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
@ 2005-01-10 23:54                         ` Christoph Lameter
  2005-01-11  0:41                           ` Chris Wright
  2005-01-10 23:55                         ` Prezeroing V4 [2/4]: Zeroing implementation Christoph Lameter
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

This patch fixes the __GFP_ZERO related code by adding a new function
alloc_zeroed_user_highpage that is then used in the anonymous page fault
handler and in the COW code to allocate pages. The function can be defined
per arch to setup special processing for user pages by defining
__HAVE_ARCH_ALLOC_ZEROED_USER_PAGE.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/include/asm-ia64/page.h
=================================--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -75,6 +75,16 @@
 	flush_dcache_page(page);		\
 } while (0)

+
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+({						\
+	struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+	flush_dcache_page(page);		\
+	 page;					\
+})
+
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)

 #ifdef CONFIG_VIRTUAL_MEM_MAP
Index: linux-2.6.10/include/asm-h8300/page.h
=================================--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/mm/memory.c
=================================--- linux-2.6.10.orig/mm/memory.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-01-10 13:54:30.000000000 -0800
@@ -84,20 +84,6 @@
 EXPORT_SYMBOL(vmalloc_earlyreserve);

 /*
- * We special-case the C-O-W ZERO_PAGE, because it's such
- * a common occurrence (no need to read the page to know
- * that it's zero - better for the cache and memory subsystem).
- */
-static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)
-{
-	if (from = ZERO_PAGE(address)) {
-		clear_user_highpage(to, address);
-		return;
-	}
-	copy_user_highpage(to, from, address);
-}
-
-/*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
@@ -1329,11 +1315,16 @@

 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-	if (!new_page)
-		goto no_new_page;
-	copy_cow_page(old_page,new_page,address);
-
+	if (old_page = ZERO_PAGE(address)) {
+		new_page = alloc_zeroed_user_highpage(vma, address);
+		if (!new_page)
+			goto no_new_page;
+	} else {
+		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!new_page)
+			goto no_new_page;
+		copy_user_highpage(new_page, old_page, address);
+	}
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1795,7 +1786,7 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
+		page = alloc_zeroed_user_highpage(vma, addr);
 		if (!page)
 			goto no_mem;

Index: linux-2.6.10/include/asm-m32r/page.h
=================================--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -17,6 +17,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-alpha/page.h
=================================--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -18,6 +18,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

Index: linux-2.6.10/include/asm-m68knommu/page.h
=================================--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-cris/page.h
=================================--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -21,6 +21,9 @@
 #define clear_user_page(page, vaddr, pg)    clear_page(page)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/linux/highmem.h
=================================--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-10 13:53:59.000000000 -0800
@@ -42,6 +42,17 @@
 	smp_wmb();
 }

+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+static inline struct page* alloc_zeroed_user_highpage(struct vm_area_struct *vma,
+	 unsigned long vaddr)
+{
+	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+
+	clear_user_highpage(page, vaddr);
+	return page;
+}
+#endif
+
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
Index: linux-2.6.10/include/asm-i386/page.h
=================================--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -36,6 +36,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-x86_64/page.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -38,6 +38,8 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-s390/page.h
=================================--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -106,6 +106,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /* Pure 2^n version of get_order */
 extern __inline__ int get_order(unsigned long size)
 {


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V4 [2/4]: Zeroing implementation
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
  2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
@ 2005-01-10 23:55                         ` Christoph Lameter
  2005-01-10 23:55                         ` Prezeroing V4 [3/4]: Altix SN2 BTE zero driver Christoph Lameter
  2005-01-10 23:56                         ` Prezeroing V4 [4/4]: Extend clear_page to take an order parameter Christoph Lameter
  3 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meninfo

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/mm/page_alloc.c
=================================--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-10 14:44:22.000000000 -0800
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Support for page zeroing, Christoph Lameter, SGI, Dec 2004
  */

 #include <linux/config.h>
@@ -33,6 +34,7 @@
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
 #include <linux/vmalloc.h>
+#include <linux/scrub.h>

 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -167,16 +169,16 @@
  * zone->lock is already acquired when we use these.
  * So, we don't need atomic page->flags operations here.
  */
-static inline unsigned long page_order(struct page *page) {
+static inline unsigned long page_zorder(struct page *page) {
 	return page->private;
 }

-static inline void set_page_order(struct page *page, int order) {
-	page->private = order;
+static inline void set_page_zorder(struct page *page, int order, int zero) {
+	page->private = order + (zero << 10);
 	__SetPagePrivate(page);
 }

-static inline void rmv_page_order(struct page *page)
+static inline void rmv_page_zorder(struct page *page)
 {
 	__ClearPagePrivate(page);
 	page->private = 0;
@@ -187,14 +189,15 @@
  * we can do coalesce a page and its buddy if
  * (a) the buddy is free &&
  * (b) the buddy is on the buddy system &&
- * (c) a page and its buddy have the same order.
+ * (c) a page and its buddy have the same order and the same
+ *     zeroing status.
  * for recording page's order, we use page->private and PG_private.
  *
  */
-static inline int page_is_buddy(struct page *page, int order)
+static inline int page_is_buddy(struct page *page, int order, int zero)
 {
        if (PagePrivate(page)           &&
-           (page_order(page) = order) &&
+           (page_zorder(page) = order + (zero << 10)) &&
            !PageReserved(page)         &&
             page_count(page) = 0)
                return 1;
@@ -225,22 +228,20 @@
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
-		struct zone *zone, unsigned int order)
+static inline int __free_pages_bulk (struct page *page, struct page *base,
+		struct zone *zone, unsigned int order, int zero)
 {
 	unsigned long page_idx;
 	struct page *coalesced;
-	int order_size = 1 << order;

 	if (unlikely(order))
 		destroy_compound_page(page, order);

 	page_idx = page - base;

-	BUG_ON(page_idx & (order_size - 1));
+	BUG_ON(page_idx & (( 1 << order) - 1));
 	BUG_ON(bad_range(zone, page));

-	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		struct free_area *area;
 		struct page *buddy;
@@ -250,20 +251,21 @@
 		buddy = base + buddy_idx;
 		if (bad_range(zone, buddy))
 			break;
-		if (!page_is_buddy(buddy, order))
+		if (!page_is_buddy(buddy, order, zero))
 			break;
 		/* Move the buddy up one level. */
 		list_del(&buddy->lru);
-		area = zone->free_area + order;
+		area = zone->free_area[zero] + order;
 		area->nr_free--;
-		rmv_page_order(buddy);
+		rmv_page_zorder(buddy);
 		page_idx &= buddy_idx;
 		order++;
 	}
 	coalesced = base + page_idx;
-	set_page_order(coalesced, order);
-	list_add(&coalesced->lru, &zone->free_area[order].free_list);
-	zone->free_area[order].nr_free++;
+	set_page_zorder(coalesced, order, zero);
+	list_add(&coalesced->lru, &zone->free_area[zero][order].free_list);
+	zone->free_area[zero][order].nr_free++;
+	return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -312,8 +314,11 @@
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
-		__free_pages_bulk(page, base, zone, order);
+		if (__free_pages_bulk(page, base, zone, order, NOT_ZEROED)
+			>= sysctl_scrub_start)
+				wakeup_kscrubd(zone);
 		ret++;
+		zone->free_pages += 1UL << order;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
@@ -341,6 +346,18 @@
 	free_pages_bulk(page_zone(page), 1, &list, order);
 }

+void end_zero_page(struct page *page, unsigned int order)
+{
+	unsigned long flags;
+	struct zone * zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	__free_pages_bulk(page, zone->zone_mem_map, zone, order, ZEROED);
+	zone->zero_pages += 1UL << order;
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}

 /*
  * The order of subdivision here is critical for the IO subsystem.
@@ -358,7 +375,7 @@
  */
 static inline struct page *
 expand(struct zone *zone, struct page *page,
- 	int low, int high, struct free_area *area)
+ 	int low, int high, struct free_area *area, int zero)
 {
 	unsigned long size = 1 << high;

@@ -369,7 +386,7 @@
 		BUG_ON(bad_range(zone, &page[size]));
 		list_add(&page[size].lru, &area->free_list);
 		area->nr_free++;
-		set_page_order(&page[size], high);
+		set_page_zorder(&page[size], high, zero);
 	}
 	return page;
 }
@@ -419,23 +436,44 @@
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct free_area *area)
+{
+	list_del(&page->lru);
+	rmv_page_zorder(page);
+	area->nr_free--;
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area)
+{
+	unsigned long flags;
+	struct page *page = NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	if (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+		rmpage(page, area);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
 {
-	struct free_area * area;
+	struct free_area *area;
 	unsigned int current_order;
 	struct page *page;

 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = zone->free_area[zero] + current_order;
 		if (list_empty(&area->free_list))
 			continue;

 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		rmv_page_order(page);
-		area->nr_free--;
+		rmpage(page, zone->free_area[zero] + current_order);
 		zone->free_pages -= 1UL << order;
-		return expand(zone, page, order, current_order, area);
+		if (zero)
+			zone->zero_pages -= 1UL << order;
+		return expand(zone, page, order, current_order, area, zero);
 	}

 	return NULL;
@@ -447,7 +485,7 @@
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int zero)
 {
 	unsigned long flags;
 	int i;
@@ -456,7 +494,7 @@

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
 		if (page = NULL)
 			break;
 		allocated++;
@@ -503,7 +541,7 @@
 		ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));

 	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
+		list_for_each(curr, &zone->free_area[NOT_ZEROED][order].free_list) {
 			unsigned long start_pfn, i;

 			start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -595,7 +633,7 @@
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static inline void prep_zero_page(struct page *page, int order)
+void prep_zero_page(struct page *page, unsigned int order)
 {
 	int i;

@@ -608,7 +646,9 @@
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int nr_pages = 1 << order;
+	int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;

 	if (order = 0) {
 		struct per_cpu_pages *pcp;
@@ -617,7 +657,7 @@
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, zero);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -629,16 +669,25 @@

 	if (page = NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
+		/*
+		 * If we failed to obtain a zero and/or unzeroed page
+		 * then we may still be able to obtain the other
+		 * type of page.
+		 */
+		if (!page) {
+			page = __rmqueue(zone, order, !zero);
+			zero = 0;
+		}
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
-		mod_page_state_zone(zone, pgalloc, 1 << order);
+		mod_page_state_zone(zone, pgalloc, nr_pages);
 		prep_new_page(page, order);

-		if (gfp_flags & __GFP_ZERO)
+		if ((gfp_flags & __GFP_ZERO) && !zero)
 			prep_zero_page(page, order);

 		if (order && (gfp_flags & __GFP_COMP))
@@ -667,7 +716,7 @@
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
+		free_pages -= (z->free_area[NOT_ZEROED][o].nr_free + z->free_area[ZEROED][o].nr_free)  << o;

 		/* Require fewer higher order pages to be free */
 		min >>= 1;
@@ -1045,7 +1094,7 @@
 }

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
@@ -1053,27 +1102,31 @@
 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
 		*free += zones[i].free_pages;
+		*zero += zones[i].zero_pages;
 	}
 }

 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *free, unsigned long *zero)
 {
 	struct pglist_data *pgdat;

 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n,o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
 		*free += n;
+		*zero += o;
 	}
 }

@@ -1110,6 +1163,7 @@

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1122,6 +1176,7 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;
 	struct zone *zone;

 	for_each_zone(zone) {
@@ -1142,10 +1197,10 @@

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1153,20 +1208,21 @@
 	}

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 	printk("\nFree pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));

 	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+		"unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
 		nr_free_pages(),
+		zero,
 		ps.nr_slab,
 		ps.nr_mapped,
 		ps.nr_page_table_pages);
@@ -1215,7 +1271,7 @@

 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
-			nr = zone->free_area[order].nr_free;
+			nr = zone->free_area[NOT_ZEROED][order].nr_free + zone->free_area[ZEROED][order].nr_free;
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
 		}
@@ -1515,8 +1571,10 @@
 {
 	int order;
 	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		zone->free_area[order].nr_free = 0;
+		INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
+		zone->free_area[NOT_ZEROED][order].nr_free = 0;
+		zone->free_area[ZEROED][order].nr_free = 0;
 	}
 }

@@ -1541,6 +1599,7 @@

 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kscrubd_wait);
 	pgdat->kswapd_max_order = 0;

 	for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -1564,6 +1623,7 @@
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->zero_pages = 0;

 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

@@ -1597,6 +1657,13 @@
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1722,7 +1789,7 @@
 		spin_lock_irqsave(&zone->lock, flags);
 		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
 		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+			seq_printf(m, "%6lu ", zone->free_area[NOT_ZEROED][order].nr_free);
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
Index: linux-2.6.10/include/linux/mmzone.h
=================================--- linux-2.6.10.orig/include/linux/mmzone.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/mmzone.h	2005-01-10 13:54:50.000000000 -0800
@@ -51,7 +51,7 @@
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -107,10 +107,14 @@
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */

+#define NOT_ZEROED 0
+#define ZEROED 1
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		zero_pages;
 	/*
 	 * protection[] is a pre-calculated number of extra pages that must be
 	 * available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area[2][MAX_ORDER];


 	ZONE_PADDING(_pad1_)
@@ -266,6 +270,9 @@
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+
+	wait_queue_head_t       kscrubd_wait;
+	struct task_struct *kscrubd;
 } pg_data_t;

 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@
 extern struct pglist_data *pgdat_list;

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+			unsigned long *free, unsigned long *zero);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Index: linux-2.6.10/fs/proc/proc_misc.c
=================================--- linux-2.6.10.orig/fs/proc/proc_misc.c	2005-01-10 13:48:10.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c	2005-01-10 13:54:50.000000000 -0800
@@ -123,12 +123,13 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;
 	unsigned long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 /*
  * display in kilobytes.
@@ -148,6 +149,7 @@
 	len = sprintf(page,
 		"MemTotal:     %8lu kB\n"
 		"MemFree:      %8lu kB\n"
+		"MemZero:      %8lu kB\n"
 		"Buffers:      %8lu kB\n"
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
@@ -171,6 +173,7 @@
 		"VmallocChunk: %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
+		K(zero),
 		K(i.bufferram),
 		K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
 		K(total_swapcache_pages),
Index: linux-2.6.10/mm/readahead.c
=================================--- linux-2.6.10.orig/mm/readahead.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/readahead.c	2005-01-10 13:54:50.000000000 -0800
@@ -573,7 +573,8 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;

-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
Index: linux-2.6.10/drivers/base/node.c
=================================--- linux-2.6.10.orig/drivers/base/node.c	2005-01-10 13:48:08.000000000 -0800
+++ linux-2.6.10/drivers/base/node.c	2005-01-10 13:54:50.000000000 -0800
@@ -42,13 +42,15 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;

 	si_meminfo_node(&i, nid);
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));

 	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
+		       "Node %d MemZero:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
@@ -58,6 +60,7 @@
 		       "Node %d LowFree:      %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
+		       nid, K(zero),
 		       nid, K(i.totalram - i.freeram),
 		       nid, K(active),
 		       nid, K(inactive),
Index: linux-2.6.10/include/linux/sched.h
=================================--- linux-2.6.10.orig/include/linux/sched.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h	2005-01-10 13:54:50.000000000 -0800
@@ -731,6 +731,7 @@
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
+#define PF_KSCRUBD	0x00800000	/* I am kscrubd */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.10/mm/Makefile
=================================--- linux-2.6.10.orig/mm/Makefile	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/Makefile	2005-01-10 13:54:50.000000000 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o scrubd.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
Index: linux-2.6.10/mm/scrubd.c
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/mm/scrubd.c	2005-01-10 14:56:20.000000000 -0800
@@ -0,0 +1,134 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = 5;	/* if a page of this order is coalesed then run kscrubd */
+unsigned int sysctl_scrub_stop = 2;	/* Mininum order of page to zero */
+unsigned int sysctl_scrub_load = 999;	/* Do not run scrubd if load > */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (sysctl_scrub_start < MAX_ORDER) {
+		struct zone *zone;
+
+		for_each_zone(zone)
+			wakeup_kscrubd(zone);
+	}
+	return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+	int order;
+
+	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+		struct free_area *area = z->free_area[NOT_ZEROED] + order;
+		if (!list_empty(&area->free_list)) {
+			struct page *page = scrubd_rmpage(z, area);
+			struct list_head *l;
+			int size = PAGE_SIZE << order;
+
+			if (!page)
+				continue;
+
+			list_for_each(l, &zero_drivers) {
+				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+
+				if (driver->start(page_address(page), size) = 0)
+					goto done;
+			}
+
+			/* Unable to find a zeroing device that would
+			 * deal with this page so just do it on our own.
+			 * This will likely thrash the cpu caches.
+			 */
+			cond_resched();
+			prep_zero_page(page, order);
+done:
+			end_zero_page(page, order);
+			cond_resched();
+			return 1 << order;
+		}
+	}
+	return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+	int i;
+	unsigned long pages_zeroed;
+
+	if (system_state != SYSTEM_RUNNING)
+		return;
+
+	do {
+		pages_zeroed = 0;
+		for (i = 0; i < pgdat->nr_zones; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			pages_zeroed += zero_highest_order_page(zone);
+		}
+	} while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	cpumask_t cpumask;
+
+	daemonize("kscrubd%d", pgdat->node_id);
+	cpumask = node_to_cpumask(pgdat->node_id);
+	if (!cpus_empty(cpumask))
+		set_cpus_allowed(tsk, cpumask);
+
+	tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+	for ( ; ; ) {
+		if (current->flags & PF_FREEZE)
+			refrigerator(PF_FREEZE);
+		prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&pgdat->kscrubd_wait, &wait);
+
+		scrub_pgdat(pgdat);
+	}
+	return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+	pg_data_t *pgdat;
+	for_each_pgdat(pgdat)
+		pgdat->kscrubd
+		= find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+	return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.10/include/linux/scrub.h
=================================--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/include/linux/scrub.h	2005-01-10 14:34:25.000000000 -0800
@@ -0,0 +1,49 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start)(void *, unsigned long);		/* Start bzero transfer */
+        struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+extern unsigned int sysctl_scrub_load;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+        if (avenrun[0] >= ((unsigned long)sysctl_scrub_load << FSHIFT))
+		return;
+	if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+                return;
+        wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+				      void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page, unsigned int order);
+#endif
Index: linux-2.6.10/kernel/sysctl.c
=================================--- linux-2.6.10.orig/kernel/sysctl.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/kernel/sysctl.c	2005-01-10 13:54:50.000000000 -0800
@@ -40,6 +40,7 @@
 #include <linux/times.h>
 #include <linux/limits.h>
 #include <linux/dcache.h>
+#include <linux/scrub.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -827,6 +828,33 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_SCRUB_START,
+		.procname	= "scrub_start",
+		.data		= &sysctl_scrub_start,
+		.maxlen		= sizeof(sysctl_scrub_start),
+		.mode		= 0644,
+		.proc_handler	= &scrub_start_handler,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_STOP,
+		.procname	= "scrub_stop",
+		.data		= &sysctl_scrub_stop,
+		.maxlen		= sizeof(sysctl_scrub_stop),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_LOAD,
+		.procname	= "scrub_load",
+		.data		= &sysctl_scrub_load,
+		.maxlen		= sizeof(sysctl_scrub_load),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.10/include/linux/sysctl.h
=================================--- linux-2.6.10.orig/include/linux/sysctl.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/sysctl.h	2005-01-10 13:54:50.000000000 -0800
@@ -169,6 +169,9 @@
 	VM_VFS_CACHE_PRESSURE&, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT', /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT(, /* default time for token time out */
+	VM_SCRUB_START0,	/* percentage * 10 at which to start scrubd */
+	VM_SCRUB_STOP1,	/* percentage * 10 at which to stop scrubd */
+	VM_SCRUB_LOAD2,	/* Load factor at which not to scrub anymore */
 };


Index: linux-2.6.10/include/linux/gfp.h
=================================--- linux-2.6.10.orig/include/linux/gfp.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h	2005-01-10 13:54:50.000000000 -0800
@@ -132,4 +132,5 @@

 void page_alloc_init(void);

+void prep_zero_page(struct page *, unsigned int order);
 #endif /* __LINUX_GFP_H */


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V4 [3/4]: Altix SN2 BTE zero driver
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
  2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
  2005-01-10 23:55                         ` Prezeroing V4 [2/4]: Zeroing implementation Christoph Lameter
@ 2005-01-10 23:55                         ` Christoph Lameter
  2005-01-10 23:56                         ` Prezeroing V4 [4/4]: Extend clear_page to take an order parameter Christoph Lameter
  3 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

o Zeroing driver implemented with the Block Transfer Engine in the Altix
  SN2 SHub.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/arch/ia64/sn/kernel/bte.c
=================================--- linux-2.6.10.orig/arch/ia64/sn/kernel/bte.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/bte.c	2005-01-10 13:54:52.000000000 -0800
@@ -4,6 +4,8 @@
  * for more details.
  *
  * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
  */

 #include <linux/config.h>
@@ -20,6 +22,8 @@
 #include <linux/bootmem.h>
 #include <linux/string.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>

 #include <asm/sn/bte.h>

@@ -30,7 +34,7 @@
 /* two interfaces on two btes */
 #define MAX_INTERFACES_TO_TRY		4

-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
 {
 	nodepda_t *tmp_nodepda;

@@ -132,7 +136,6 @@
 			if (bte = NULL) {
 				continue;
 			}
-
 			if (spin_trylock(&bte->spinlock)) {
 				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
 				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +160,7 @@
 		}
 	} while (1);

-	if (notification = NULL) {
+	if (notification = NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
 		/* User does not want to be notified. */
 		bte->most_rcnt_na = &bte->notify;
 	} else {
@@ -192,6 +195,8 @@

 	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);

+	if (mode & BTE_NOTIFY_AND_GET_POINTER)
+		 *(u64 volatile **)(notification) = &bte->notify;
 	spin_unlock_irqrestore(&bte->spinlock, irq_flags);

 	if (notification != NULL) {
@@ -449,5 +454,47 @@
 		mynodepda->bte_if[i].cleanup_active = 0;
 		mynodepda->bte_if[i].bh_error = 0;
 	}
+}
+
+u64 *bte_zero_notify[MAX_COMPACT_NODES];
+
+#define ZERO_RATE_PER_SEC 500000000
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+	int rc;
+	int ticks;
+	int node = get_nasid();
+
+	/* Check limitations.
+		1. System must be running (weird things happen during bootup)
+		2. Size >64KB. Smaller requests cause too much bte traffic
+	 */
+	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+		return EINVAL;
+
+	rc = bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, bte_zero_notify+node);
+	if (rc)
+		return rc;
+
+	ticks = (len*HZ)/ZERO_RATE_PER_SEC;
+	if (ticks) {
+		/* Wait the minimum time of the transfer */
+		current->state = TASK_INTERRUPTIBLE;
+		schedule_timeout(ticks);
+	}
+	while (*(bte_zero_notify[node]) != BTE_WORD_BUSY) {
+		/* Then keep on checking until transfer is complete */
+		cpu_relax();
+		schedule();
+	}
+	return 0;
+}
+
+static struct zero_driver bte_bzero = {
+	.start = bte_start_bzero,
+};

+void sn_bte_bzero_init(void) {
+	register_zero_driver(&bte_bzero);
 }
Index: linux-2.6.10/arch/ia64/sn/kernel/setup.c
=================================--- linux-2.6.10.orig/arch/ia64/sn/kernel/setup.c	2005-01-10 13:48:08.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/setup.c	2005-01-10 13:54:52.000000000 -0800
@@ -244,6 +244,7 @@
 	int pxm;
 	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
 	extern void sn_cpu_init(void);
+	extern void sn_bte_bzero_init(void);

 	/*
 	 * If the generic code has enabled vga console support - lets
@@ -334,6 +335,7 @@
 	screen_info = sn_screen_info;

 	sn_timer_init();
+	sn_bte_bzero_init();
 }

 /**
Index: linux-2.6.10/include/asm-ia64/sn/bte.h
=================================--- linux-2.6.10.orig/include/asm-ia64/sn/bte.h	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/sn/bte.h	2005-01-10 13:54:52.000000000 -0800
@@ -48,6 +48,8 @@
 #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
 /* Use a reserved bit to let the caller specify a wait for any BTE */
 #define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
 /* Use the BTE on the node with the destination memory */
 #define BTE_USE_DEST (BTE_WACQUIRE << 1)
 /* Use any available BTE interface on any node for the transfer */


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Prezeroing V4 [4/4]: Extend clear_page to take an order parameter
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
                                           ` (2 preceding siblings ...)
  2005-01-10 23:55                         ` Prezeroing V4 [3/4]: Altix SN2 BTE zero driver Christoph Lameter
@ 2005-01-10 23:56                         ` Christoph Lameter
  3 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development


- Extend clear_page to take an order parameter.

Architecture support:
---------------------

Known to work:

ia64
i386
x86_64
sparc64
m68k

Trivial modification expected to simply work:

arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

s390
alpha
sh
mips
m32r

Index: linux-2.6.10/include/asm-ia64/page.h
=================================--- linux-2.6.10.orig/include/asm-ia64/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-10 14:23:21.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.10/include/asm-i386/page.h
=================================--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-x86_64/page.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-sparc/page.h
=================================--- linux-2.6.10.orig/include/asm-sparc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-s390/page.h
=================================--- linux-2.6.10.orig/include/asm-s390/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/arch/i386/lib/mmx.c
=================================--- linux-2.6.10.orig/arch/i386/lib/mmx.c	2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c	2005-01-10 14:23:22.000000000 -0800
@@ -128,7 +128,7 @@
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/mmx.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h	2005-01-10 14:23:22.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S	2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -1,12 +1,16 @@
 /*
  * Zero a page.
  * rdi	page
+ * rsi	order
  */
 	.globl clear_page
 	.p2align 4
 clear_page:
+	movl   $4096/64,%eax
+	movl	%esi, %ecx
+	shll	%cl, %eax
+	movl	%eax, %ecx
 	xorl   %eax,%eax
-	movl   $4096/64,%ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -41,7 +45,10 @@

 	.section .altinstr_replacement,"ax"
 clear_page_c:
-	movl $4096/8,%ecx
+	movl $4096/8,%eax
+	movl %esi, %ecx
+	shll %cl, %eax
+	movl %eax, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.10/include/asm-sh/page.h
=================================--- linux-2.6.10.orig/include/asm-sh/page.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.10/include/asm-i386/mmx.h
=================================--- linux-2.6.10.orig/include/asm-i386/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h	2005-01-10 14:23:22.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
=================================--- linux-2.6.10.orig/include/asm-sh64/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -50,12 +50,20 @@
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.10/include/asm-h8300/page.h
=================================--- linux-2.6.10.orig/include/asm-h8300/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-arm/page.h
=================================--- linux-2.6.10.orig/include/asm-arm/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -128,7 +128,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
=================================--- linux-2.6.10.orig/include/asm-ppc64/page.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, unsigned int order)
 {
 	unsigned long lines, line_size;

 	line_size = ppc64_caches.dline_size;
-	lines = ppc64_caches.dlines_per_page;
+	lines = ppc64_caches.dlines_per_page << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
=================================--- linux-2.6.10.orig/include/asm-m32r/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-alpha/page.h
=================================--- linux-2.6.10.orig/include/asm-alpha/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
=================================--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c	2005-01-10 14:23:22.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
=================================--- linux-2.6.10.orig/include/asm-m68k/page.h	2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -50,7 +50,7 @@
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-mips/page.h
=================================--- linux-2.6.10.orig/include/asm-mips/page.h	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.10/include/asm-m68knommu/page.h
=================================--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-cris/page.h
=================================--- linux-2.6.10.orig/include/asm-cris/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-v850/page.h
=================================--- linux-2.6.10.orig/include/asm-v850/page.h	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-parisc/page.h
=================================--- linux-2.6.10.orig/include/asm-parisc/page.h	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
=================================--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c	2005-01-10 14:23:22.000000000 -0800
@@ -47,7 +47,7 @@
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.10/arch/m32r/mm/page.S
=================================--- linux-2.6.10.orig/arch/m32r/mm/page.S	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S	2005-01-10 14:23:22.000000000 -0800
@@ -51,7 +51,7 @@
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
=================================--- linux-2.6.10.orig/include/asm-ppc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -85,7 +85,7 @@

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
=================================--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c	2005-01-10 14:23:22.000000000 -0800
@@ -88,7 +88,7 @@
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
=================================--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
=================================--- linux-2.6.10.orig/arch/sh/mm/init.c	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c	2005-01-10 14:23:22.000000000 -0800
@@ -57,7 +57,7 @@
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c	2005-01-10 14:23:22.000000000 -0800
@@ -78,7 +78,7 @@
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c	2005-01-10 14:23:22.000000000 -0800
@@ -27,7 +27,7 @@
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
=================================--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c	2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c	2005-01-10 14:23:22.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
=================================--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c	2005-01-10 14:23:22.000000000 -0800
@@ -102,7 +102,7 @@
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
=================================--- linux-2.6.10.orig/include/asm-arm26/page.h	2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -25,7 +25,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
=================================--- linux-2.6.10.orig/include/asm-sparc64/page.h	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -14,8 +14,8 @@

 #ifndef __ASSEMBLY__

-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S	2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -28,9 +28,12 @@
 	.text

 	.globl		_clear_page
-_clear_page:		/* %o0Þst */
+_clear_page:		/* %o0Þst, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1

 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate

+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1

 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8
Index: linux-2.6.10/drivers/net/tc35815.c
=================================--- linux-2.6.10.orig/drivers/net/tc35815.c	2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c	2005-01-10 14:23:22.000000000 -0800
@@ -657,7 +657,7 @@
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
 	} else {
-		clear_page(lp->fd_buf);
+		clear_page(lp->fd_buf, 0);
 #ifdef __mips__
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
Index: linux-2.6.10/include/linux/highmem.h
=================================--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-10 14:23:22.000000000 -0800
@@ -56,7 +56,7 @@
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }

Index: linux-2.6.10/fs/afs/file.c
=================================--- linux-2.6.10.orig/fs/afs/file.c	2004-12-24 13:35:59.000000000 -0800
+++ linux-2.6.10/fs/afs/file.c	2005-01-10 14:23:22.000000000 -0800
@@ -172,7 +172,7 @@
 				      (size_t) PAGE_SIZE);
 		desc.buffer	= kmap(page);

-		clear_page(desc.buffer);
+		clear_page(desc.buffer, 0);

 		/* read the contents of the file from the server into the
 		 * page */
Index: linux-2.6.10/fs/ntfs/compress.c
=================================--- linux-2.6.10.orig/fs/ntfs/compress.c	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/fs/ntfs/compress.c	2005-01-10 14:23:22.000000000 -0800
@@ -107,7 +107,7 @@
 		 * FIXME: Using clear_page() will become wrong when we get
 		 * PAGE_CACHE_SIZE != PAGE_SIZE but for now there is no problem.
 		 */
-		clear_page(kp);
+		clear_page(kp, 0);
 		return;
 	}
 	kp_ofs = ni->initialized_size & ~PAGE_CACHE_MASK;
@@ -742,7 +742,7 @@
 				 * for now there is no problem.
 				 */
 				if (likely(!cur_ofs))
-					clear_page(page_address(page));
+					clear_page(page_address(page), 0);
 				else
 					memset(page_address(page) + cur_ofs, 0,
 							PAGE_CACHE_SIZE -
Index: linux-2.6.10/mm/page_alloc.c
=================================--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-10 14:21:06.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-10 14:23:22.000000000 -0800
@@ -639,6 +639,10 @@
 {
 	int i;

+	if (!PageHighMem(page)) {
+		clear_page(page_address(page), order);
+		return;
+	}
 	for(i = 0; i < (1 << order); i++)
 		clear_highpage(page + i);
 }
Index: linux-2.6.10/mm/hugetlb.c
=================================--- linux-2.6.10.orig/mm/hugetlb.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/hugetlb.c	2005-01-10 14:23:22.000000000 -0800
@@ -89,8 +89,7 @@
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	clear_page(page_address(page), HUGETLB_PAGE_ORDER);
 	return page;
 }



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
  2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
@ 2005-01-11  0:41                           ` Chris Wright
  2005-01-11  0:46                             ` Prezeroing V4 [1/4]: Arch specific page zeroing during page Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: Chris Wright @ 2005-01-11  0:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David S. Miller,
	linux-ia64, linux-mm, Linux Kernel Development

* Christoph Lameter (clameter@sgi.com) wrote:
> @@ -1795,7 +1786,7 @@
> 
>  		if (unlikely(anon_vma_prepare(vma)))
>  			goto no_mem;
> -		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
> +		page = alloc_zeroed_user_highpage(vma, addr);

Oops, HIGHZERO is gone already in Linus' tree.

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V4 [1/4]: Arch specific page zeroing during page
  2005-01-11  0:41                           ` Chris Wright
@ 2005-01-11  0:46                             ` Christoph Lameter
  2005-01-11  0:49                               ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Chris Wright
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2005-01-11  0:46 UTC (permalink / raw)
  To: Chris Wright
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David S. Miller,
	linux-ia64, linux-mm, Linux Kernel Development

On Mon, 10 Jan 2005, Chris Wright wrote:

> * Christoph Lameter (clameter@sgi.com) wrote:
> > @@ -1795,7 +1786,7 @@
> >
> >  		if (unlikely(anon_vma_prepare(vma)))
> >  			goto no_mem;
> > -		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
> > +		page = alloc_zeroed_user_highpage(vma, addr);
>
> Oops, HIGHZERO is gone already in Linus' tree.

Use bk13 as I indicated.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
  2005-01-11  0:46                             ` Prezeroing V4 [1/4]: Arch specific page zeroing during page Christoph Lameter
@ 2005-01-11  0:49                               ` Chris Wright
  0 siblings, 0 replies; 99+ messages in thread
From: Chris Wright @ 2005-01-11  0:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Wright, Linus Torvalds, Hugh Dickins, Andrew Morton,
	David S. Miller, linux-ia64, linux-mm, Linux Kernel Development

* Christoph Lameter (clameter@sgi.com) wrote:
> Use bk13 as I indicated.

Ah, so you did, thanks ;-)
-chris
--
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 99+ messages in thread

* alloc_zeroed_user_highpage to fix the clear_user_highpage issue
  2005-01-08 21:56                   ` David S. Miller
@ 2005-01-21 20:09                     ` Christoph Lameter
  2005-02-09  9:58                       ` [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL Michael Ellerman
  2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
  1 sibling, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2005-01-21 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: Hugh Dickins, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

This patch adds a new function alloc_zeroed_user_highpage that is then used in the
anonymous page fault handler and in the COW code to allocate zeroed pages. The function
can be defined per arch to setup special processing for user pages by defining
__HAVE_ARCH_ALLOC_ZEROED_USER_PAGE. For arches that do not need to do special things
for user pages, alloc_zeroed_user_highpage is defined to simply do

	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)

Patch against 2.6.11-rc1-bk9

This patch needs to update a number of archs. Wish there was a better way
to do this.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/include/linux/highmem.h
=================================--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-21 10:44:27.000000000 -0800
@@ -42,6 +42,17 @@ static inline void clear_user_highpage(s
 	smp_wmb();
 }

+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+static inline struct page* alloc_zeroed_user_highpage(struct vm_area_struct *vma,
+	 unsigned long vaddr)
+{
+	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+
+	clear_user_highpage(page, vaddr);
+	return page;
+}
+#endif
+
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
Index: linux-2.6.10/mm/memory.c
=================================--- linux-2.6.10.orig/mm/memory.c	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-01-21 11:10:42.000000000 -0800
@@ -84,20 +84,6 @@ EXPORT_SYMBOL(high_memory);
 EXPORT_SYMBOL(vmalloc_earlyreserve);

 /*
- * We special-case the C-O-W ZERO_PAGE, because it's such
- * a common occurrence (no need to read the page to know
- * that it's zero - better for the cache and memory subsystem).
- */
-static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)
-{
-	if (from = ZERO_PAGE(address)) {
-		clear_user_highpage(to, address);
-		return;
-	}
-	copy_user_highpage(to, from, address);
-}
-
-/*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
@@ -1329,11 +1315,16 @@ static int do_wp_page(struct mm_struct *

 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-	if (!new_page)
-		goto no_new_page;
-	copy_cow_page(old_page,new_page,address);
-
+	if (old_page = ZERO_PAGE(address)) {
+		new_page = alloc_zeroed_user_highpage(vma, address);
+		if (!new_page)
+			goto no_new_page;
+	} else {
+		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!new_page)
+			goto no_new_page;
+		copy_user_highpage(new_page, old_page, address);
+	}
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1795,10 +1786,9 @@ do_anonymous_page(struct mm_struct *mm,

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_zeroed_user_highpage(vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.10/include/asm-ia64/page.h
=================================--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -75,6 +75,16 @@ do {						\
 	flush_dcache_page(page);		\
 } while (0)

+
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+({						\
+	struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+	flush_dcache_page(page);		\
+	page;					\
+})
+
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)

 #ifdef CONFIG_VIRTUAL_MEM_MAP
Index: linux-2.6.10/include/asm-i386/page.h
=================================--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -36,6 +36,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-x86_64/page.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -38,6 +38,8 @@ void copy_page(void *, void *);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-m32r/page.h
=================================--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -17,6 +17,9 @@ extern void copy_page(void *to, void *fr
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-alpha/page.h
=================================--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -18,6 +18,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

Index: linux-2.6.10/include/asm-m68knommu/page.h
=================================--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-cris/page.h
=================================--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -21,6 +21,9 @@
 #define clear_user_page(page, vaddr, pg)    clear_page(page)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-s390/page.h
=================================--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -106,6 +106,9 @@ static inline void copy_page(void *to, v
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /* Pure 2^n version of get_order */
 extern __inline__ int get_order(unsigned long size)
 {
Index: linux-2.6.10/include/asm-h8300/page.h
=================================--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Extend clear_page by an order parameter
  2005-01-08 21:56                   ` David S. Miller
  2005-01-21 20:09                     ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
@ 2005-01-21 20:12                     ` Christoph Lameter
  2005-01-21 22:29                       ` Paul Mackerras
  2005-01-23  7:45                       ` Andrew Morton
  1 sibling, 2 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-21 20:12 UTC (permalink / raw)
  To: David S. Miller
  Cc: Hugh Dickins, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
clear_page that is capable of zeroing multiple pages at once (and scrubd
too but that is now an independent patch). The following patch extends
clear_page with a second parameter specifying the order of the page to be zeroed to allow an
efficient zeroing of pages. Hope I caught everything....

Patch against 2.6.11-rc1-bk9

Architecture support:
---------------------

Known to work:

ia64
i386
x86_64
sparc64
m68k

Trivial modification expected to simply work:

arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

s390
alpha
sh
mips
m32r

Index: linux-2.6.10/mm/page_alloc.c
=================================--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-21 11:51:39.000000000 -0800
@@ -591,11 +591,16 @@ void fastcall free_cold_page(struct page
 	free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int gfp_flags)
 {
 	int i;

 	BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) = __GFP_HIGHMEM);
+	if (!PageHighMem(page)) {
+		clear_page(page_address(page), order);
+		return;
+	}
+
 	for(i = 0; i < (1 << order); i++)
 		clear_highpage(page + i);
 }
Index: linux-2.6.10/mm/hugetlb.c
=================================--- linux-2.6.10.orig/mm/hugetlb.c	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/hugetlb.c	2005-01-21 11:51:39.000000000 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
 	struct page *page;
-	int i;

 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER);
 	return page;
 }

Index: linux-2.6.10/include/linux/highmem.h
=================================--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-21 11:51:39.000000000 -0800
@@ -45,7 +45,7 @@ static inline void clear_user_highpage(s
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }

Index: linux-2.6.10/drivers/net/tc35815.c
=================================--- linux-2.6.10.orig/drivers/net/tc35815.c	2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c	2005-01-21 11:51:39.000000000 -0800
@@ -657,7 +657,7 @@ tc35815_init_queues(struct net_device *d
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
 	} else {
-		clear_page(lp->fd_buf);
+		clear_page(lp->fd_buf, 0);
 #ifdef __mips__
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
Index: linux-2.6.10/fs/afs/file.c
=================================--- linux-2.6.10.orig/fs/afs/file.c	2004-12-24 13:35:59.000000000 -0800
+++ linux-2.6.10/fs/afs/file.c	2005-01-21 11:51:39.000000000 -0800
@@ -172,7 +172,7 @@ static int afs_file_readpage(struct file
 				      (size_t) PAGE_SIZE);
 		desc.buffer	= kmap(page);

-		clear_page(desc.buffer);
+		clear_page(desc.buffer, 0);

 		/* read the contents of the file from the server into the
 		 * page */
Index: linux-2.6.10/fs/ntfs/compress.c
=================================--- linux-2.6.10.orig/fs/ntfs/compress.c	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/fs/ntfs/compress.c	2005-01-21 11:51:39.000000000 -0800
@@ -107,7 +107,7 @@ static void zero_partial_compressed_page
 		 * FIXME: Using clear_page() will become wrong when we get
 		 * PAGE_CACHE_SIZE != PAGE_SIZE but for now there is no problem.
 		 */
-		clear_page(kp);
+		clear_page(kp, 0);
 		return;
 	}
 	kp_ofs = ni->initialized_size & ~PAGE_CACHE_MASK;
@@ -742,7 +742,7 @@ lock_retry_remap:
 				 * for now there is no problem.
 				 */
 				if (likely(!cur_ofs))
-					clear_page(page_address(page));
+					clear_page(page_address(page), 0);
 				else
 					memset(page_address(page) + cur_ofs, 0,
 							PAGE_CACHE_SIZE -
Index: linux-2.6.10/include/asm-ia64/page.h
=================================--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@ extern void copy_page (void *to, void *f
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.10/arch/ia64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S	2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.10/include/asm-i386/page.h
=================================--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-i386/mmx.h
=================================--- linux-2.6.10.orig/include/asm-i386/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h	2005-01-21 11:51:39.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/i386/lib/mmx.c
=================================--- linux-2.6.10.orig/arch/i386/lib/mmx.c	2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c	2005-01-21 11:51:39.000000000 -0800
@@ -128,7 +128,7 @@ void *_mmx_memcpy(void *to, const void *
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@ static void fast_clear_page(void *page)
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@ static void fast_copy_page(void *to, voi
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@ static void fast_clear_page(void *page)
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@ static void fast_copy_page(void *to, voi
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/page.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-x86_64/mmx.h
=================================--- linux-2.6.10.orig/include/asm-x86_64/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h	2005-01-21 11:51:39.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -1,12 +1,16 @@
 /*
  * Zero a page.
  * rdi	page
+ * rsi	order
  */
 	.globl clear_page
 	.p2align 4
 clear_page:
+	movl   $4096/64,%eax
+	movl	%esi, %ecx
+	shll	%cl, %eax
+	movl	%eax, %ecx
 	xorl   %eax,%eax
-	movl   $4096/64,%ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -41,7 +45,10 @@ clear_page_end:

 	.section .altinstr_replacement,"ax"
 clear_page_c:
-	movl $4096/8,%ecx
+	movl $4096/8,%eax
+	movl %esi, %ecx
+	shll %cl, %eax
+	movl %eax, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.10/include/asm-sparc/page.h
=================================--- linux-2.6.10.orig/include/asm-sparc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-s390/page.h
=================================--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@ static inline void copy_page(void *to, v

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@ static inline void copy_page(void *to, v

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /* Pure 2^n version of get_order */
Index: linux-2.6.10/include/asm-sh/page.h
=================================--- linux-2.6.10.orig/include/asm-sh/page.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@ extern void copy_user_page(void *to, voi
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@ clear_page:
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
=================================--- linux-2.6.10.orig/include/asm-sh64/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -50,12 +50,20 @@ extern struct page *mem_map;
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.10/include/asm-h8300/page.h
=================================--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-arm/page.h
=================================--- linux-2.6.10.orig/include/asm-arm/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -128,7 +128,7 @@ extern void __cpu_copy_user_page(void *t
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
=================================--- linux-2.6.10.orig/include/asm-ppc64/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, unsigned int order)
 {
 	unsigned long lines, line_size;

 	line_size = ppc64_caches.dline_size;
-	lines = ppc64_caches.dlines_per_page;
+	lines = ppc64_caches.dlines_per_page << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
=================================--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-alpha/page.h
=================================--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
=================================--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c	2005-01-21 11:51:39.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@ void sb1_dma_init(void)
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@ void copy_page(void *to, void *from)

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
=================================--- linux-2.6.10.orig/include/asm-m68k/page.h	2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -50,7 +50,7 @@ static inline void copy_page(void *to, v
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@ static inline void clear_page(void *page
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-mips/page.h
=================================--- linux-2.6.10.orig/include/asm-mips/page.h	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@ static inline void clear_user_page(void
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.10/include/asm-m68knommu/page.h
=================================--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-cris/page.h
=================================--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-v850/page.h
=================================--- linux-2.6.10.orig/include/asm-v850/page.h	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-parisc/page.h
=================================--- linux-2.6.10.orig/include/asm-parisc/page.h	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
=================================--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c	2005-01-21 11:51:39.000000000 -0800
@@ -47,7 +47,7 @@ void v6_copy_user_page_nonaliasing(void
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@ void v6_clear_user_page_aliasing(void *k

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.10/arch/m32r/mm/page.S
=================================--- linux-2.6.10.orig/arch/m32r/mm/page.S	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S	2005-01-21 11:51:39.000000000 -0800
@@ -51,7 +51,7 @@ copy_page:
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@ copy_page:
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
=================================--- linux-2.6.10.orig/include/asm-ppc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -85,7 +85,7 @@ typedef unsigned long pgprot_t;

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
=================================--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c	2005-01-21 11:51:39.000000000 -0800
@@ -88,7 +88,7 @@ EXPORT_SYMBOL(__memset);
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
=================================--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@ clear_page:
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
=================================--- linux-2.6.10.orig/arch/sh/mm/init.c	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c	2005-01-21 11:51:39.000000000 -0800
@@ -57,7 +57,7 @@ bootmem_data_t discontig_node_bdata[MAX_
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@ void __init mem_init(void)
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c	2005-01-21 11:51:39.000000000 -0800
@@ -78,7 +78,7 @@ static int __init pg_dma_init(void)
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
=================================--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c	2005-01-21 11:51:39.000000000 -0800
@@ -27,7 +27,7 @@ static void clear_page_nommu(void *to)
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
=================================--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c	2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c	2005-01-21 11:51:39.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
=================================--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c	2005-01-21 11:51:39.000000000 -0800
@@ -102,7 +102,7 @@ EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
=================================--- linux-2.6.10.orig/include/asm-arm26/page.h	2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -25,7 +25,7 @@ extern void copy_page(void *to, const vo
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
=================================--- linux-2.6.10.orig/include/asm-sparc64/page.h	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -14,8 +14,8 @@

 #ifndef __ASSEMBLY__

-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
=================================--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S	2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -28,9 +28,12 @@
 	.text

 	.globl		_clear_page
-_clear_page:		/* %o0Þst */
+_clear_page:		/* %o0Þst, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1

 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@ clear_user_page:	/* %o0Þst, %o1=vaddr
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate

+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1

 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
@ 2005-01-21 22:29                       ` Paul Mackerras
  2005-01-21 23:48                         ` Christoph Lameter
  2005-01-23  7:45                       ` Andrew Morton
  1 sibling, 1 reply; 99+ messages in thread
From: Paul Mackerras @ 2005-01-21 22:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David S. Miller, Hugh Dickins, akpm, linux-ia64, torvalds,
	linux-mm, linux-kernel

Christoph Lameter writes:

> The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> clear_page that is capable of zeroing multiple pages at once (and scrubd
> too but that is now an independent patch). The following patch extends
> clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> efficient zeroing of pages. Hope I caught everything....

Wouldn't it be nicer to call the version that takes the order
parameter "clear_pages" and then define clear_page(p) as
clear_pages(p, 0) ?

Paul.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-21 22:29                       ` Paul Mackerras
@ 2005-01-21 23:48                         ` Christoph Lameter
  2005-01-22  0:35                           ` Paul Mackerras
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2005-01-21 23:48 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David S. Miller, Hugh Dickins, akpm, linux-ia64, torvalds,
	linux-mm, linux-kernel

On Sat, 22 Jan 2005, Paul Mackerras wrote:

> Christoph Lameter writes:
>
> > The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> > clear_page that is capable of zeroing multiple pages at once (and scrubd
> > too but that is now an independent patch). The following patch extends
> > clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> > efficient zeroing of pages. Hope I caught everything....
>
> Wouldn't it be nicer to call the version that takes the order
> parameter "clear_pages" and then define clear_page(p) as
> clear_pages(p, 0) ?

clear_page clears one page of the specified order. clear_page cannot clear
multiple pages. Calling the function clear_pages would give a wrong
impression on what the function does and may lead to attempts to specify
the number of zero order pages as a parameter instead of the order.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-21 23:48                         ` Christoph Lameter
@ 2005-01-22  0:35                           ` Paul Mackerras
  2005-01-22  0:43                             ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: Paul Mackerras @ 2005-01-22  0:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David S. Miller, Hugh Dickins, akpm, linux-ia64, torvalds,
	linux-mm, linux-kernel

Christoph Lameter writes:

> clear_page clears one page of the specified order.

Now you're really being confusing.  A cluster of 2^n contiguous pages
isn't one page by any normal definition.  Call it "clear_page_cluster"
or "clear_page_order" or something, but not "clear_page".

Paul.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  0:35                           ` Paul Mackerras
@ 2005-01-22  0:43                             ` Andrew Morton
  2005-01-22  1:08                               ` Paul Mackerras
                                                 ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Andrew Morton @ 2005-01-22  0:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: clameter, davem, hugh, linux-ia64, torvalds, linux-mm,
	linux-kernel

Paul Mackerras <paulus@samba.org> wrote:
>
> A cluster of 2^n contiguous pages
>  isn't one page by any normal definition.

It is, actually, from the POV of the page allocator.  It's a "higher order
page" and is controlled by a struct page*, just like a zero-order page...

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  0:43                             ` Andrew Morton
@ 2005-01-22  1:08                               ` Paul Mackerras
  2005-01-22  1:20                               ` Roman Zippel
  2005-01-22  1:25                               ` Paul Mackerras
  2 siblings, 0 replies; 99+ messages in thread
From: Paul Mackerras @ 2005-01-22  1:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, davem, hugh, linux-ia64, torvalds, linux-mm,
	linux-kernel

Andrew Morton writes:

> It is, actually, from the POV of the page allocator.  It's a "higher order
> page" and is controlled by a struct page*, just like a zero-order page...

OK.  I still reckon it's confusing terminology for the rest of us who
don't have our heads deep in the page allocator code.

Paul.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  0:43                             ` Andrew Morton
  2005-01-22  1:08                               ` Paul Mackerras
@ 2005-01-22  1:20                               ` Roman Zippel
  2005-01-22  1:25                               ` Paul Mackerras
  2 siblings, 0 replies; 99+ messages in thread
From: Roman Zippel @ 2005-01-22  1:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Mackerras, clameter, davem, hugh, linux-ia64, torvalds,
	linux-mm, linux-kernel

Hi,

On Fri, 21 Jan 2005, Andrew Morton wrote:

> Paul Mackerras <paulus@samba.org> wrote:
> >
> > A cluster of 2^n contiguous pages
> >  isn't one page by any normal definition.
> 
> It is, actually, from the POV of the page allocator.  It's a "higher order
> page" and is controlled by a struct page*, just like a zero-order page...

OTOH we also have alloc_page/alloc_pages.

bye, Roman

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  0:43                             ` Andrew Morton
  2005-01-22  1:08                               ` Paul Mackerras
  2005-01-22  1:20                               ` Roman Zippel
@ 2005-01-22  1:25                               ` Paul Mackerras
  2005-01-22  1:54                                 ` Christoph Lameter
  2 siblings, 1 reply; 99+ messages in thread
From: Paul Mackerras @ 2005-01-22  1:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, davem, hugh, linux-ia64, torvalds, linux-mm,
	linux-kernel

Andrew Morton writes:

> It is, actually, from the POV of the page allocator.  It's a "higher order
> page" and is controlled by a struct page*, just like a zero-order page...

So why is the function that gets me one of these "higher order pages"
called "get_free_pages" with an "s"? :)

Christoph's patch is bigger than it needs to be because he has to
change all the occurrences of clear_page(x) to clear_page(x, 0), and
then he has to change a lot of architectures' clear_page functions to
be called _clear_page instead.  If he picked a different name for the
"clear a higher order page" function it would end up being less
invasive as well as less confusing.

The argument that clear_page is called that because it clears a higher
order page won't wash; all the clear_page implementations in his patch
are perfectly capable of clearing any contiguous set of 2^order pages
(oops, I mean "zero-order pages"), not just a "higher order page".

Paul.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  1:25                               ` Paul Mackerras
@ 2005-01-22  1:54                                 ` Christoph Lameter
  2005-01-22  2:53                                   ` Paul Mackerras
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2005-01-22  1:54 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andrew Morton, davem, hugh, linux-ia64, torvalds, linux-mm,
	linux-kernel

On Sat, 22 Jan 2005, Paul Mackerras wrote:

> Christoph's patch is bigger than it needs to be because he has to
> change all the occurrences of clear_page(x) to clear_page(x, 0), and
> then he has to change a lot of architectures' clear_page functions to
> be called _clear_page instead.  If he picked a different name for the
> "clear a higher order page" function it would end up being less
> invasive as well as less confusing.

I had the name "zero_page" in V1 and V2 of the patch where it was
separate. Then someone complained about code duplication.

> The argument that clear_page is called that because it clears a higher
> order page won't wash; all the clear_page implementations in his patch
> are perfectly capable of clearing any contiguous set of 2^order pages
> (oops, I mean "zero-order pages"), not just a "higher order page".

clear_page is called clear_page because it clears one page of *any* order
not just higher orders. zero-order pages are not segregated nor are they
intrisincally better just because they contain more memory ;-).

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  1:54                                 ` Christoph Lameter
@ 2005-01-22  2:53                                   ` Paul Mackerras
  0 siblings, 0 replies; 99+ messages in thread
From: Paul Mackerras @ 2005-01-22  2:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, davem, hugh, linux-ia64, torvalds, linux-mm,
	linux-kernel

Christoph Lameter writes:

> I had the name "zero_page" in V1 and V2 of the patch where it was
> separate. Then someone complained about code duplication.

Well, if you duplicated each arch's clear_page implementation in
zero_page, then yes, that would be unnecessary code duplication.  I
would suggest that for architectures where the clear_page
implementation can easily be extended, rename it to clear_page_order
(or something) and #define clear_page(x) to be clear_page_order(x, 0).
For architectures where it can't, leave clear_page as clear_page and
define clear_page_order as an inline function that calls clear_page in
a loop.

> clear_page is called clear_page because it clears one page of *any* order
> not just higher orders. zero-order pages are not segregated nor are they
> intrisincally better just because they contain more memory ;-).

You have missed my point, which was about address constraints, not a
distinction between zero-order pages and higher-order pages.

Anyway, I remain of the opinion that your naming is inconsistent with
the naming of other functions that deal with zero-order and
higher-order pages, such as get_free_pages, alloc_pages, free_pages,
etc., and that your patch is unnecessarily intrusive.  I guess it's up
to Andrew to decide which way we go.

Paul.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
  2005-01-21 22:29                       ` Paul Mackerras
@ 2005-01-23  7:45                       ` Andrew Morton
  2005-01-24 16:37                         ` Christoph Lameter
  1 sibling, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2005-01-23  7:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

Christoph Lameter <clameter@sgi.com> wrote:
>
> The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
>  clear_page that is capable of zeroing multiple pages at once (and scrubd
>  too but that is now an independent patch). The following patch extends
>  clear_page with a second parameter specifying the order of the page to be zeroed to allow an
>  efficient zeroing of pages. Hope I caught everything....
> 

Sorry, I take it back.  As Paul says:

: Wouldn't it be nicer to call the version that takes the order
: parameter "clear_pages" and then define clear_page(p) as
: clear_pages(p, 0) ?

It would make the patch considerably smaller, and our naming is all over
the place anyway...

>  -static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
>  +void prep_zero_page(struct page *page, unsigned int order, unsigned int gfp_flags)
>   {
>   	int i;
> 
>   	BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) = __GFP_HIGHMEM);
>  +	if (!PageHighMem(page)) {
>  +		clear_page(page_address(page), order);
>  +		return;
>  +	}
>  +
>   	for(i = 0; i < (1 << order); i++)
>   		clear_highpage(page + i);
>   }

I'd have thought that we'd want to make the new clear_pages() handle
highmem pages too, if only from a regularity POV.  x86 hugetlbpages could
use it then, if someone thinks up a fast page-clearer.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-23  7:45                       ` Andrew Morton
@ 2005-01-24 16:37                         ` Christoph Lameter
  2005-01-24 20:23                           ` David S. Miller
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2005-01-24 16:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

On Sat, 22 Jan 2005, Andrew Morton wrote:

> Christoph Lameter <clameter@sgi.com> wrote:
> >
> > The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> >  clear_page that is capable of zeroing multiple pages at once (and scrubd
> >  too but that is now an independent patch). The following patch extends
> >  clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> >  efficient zeroing of pages. Hope I caught everything....
> >
>
> Sorry, I take it back.  As Paul says:
>
> : Wouldn't it be nicer to call the version that takes the order
> : parameter "clear_pages" and then define clear_page(p) as
> : clear_pages(p, 0) ?

> It would make the patch considerably smaller, and our naming is all over
> the place anyway...

Sounds good. Note though that this just means renaming clear_page to
clear_pages for all arches which would increase the patch size for the
arch specific section.

> I'd have thought that we'd want to make the new clear_pages() handle
> highmem pages too, if only from a regularity POV.  x86 hugetlbpages could
> use it then, if someone thinks up a fast page-clearer.

That would get us back to code duplication. We would have a clear_page (no
highmem support) and a clear_pages (supporting highmem). Then it may
also be better to pass the page struct to clear_pages instead of a memory address.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-24 16:37                         ` Christoph Lameter
@ 2005-01-24 20:23                           ` David S. Miller
  2005-01-24 20:33                             ` Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: David S. Miller @ 2005-01-24 20:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

On Mon, 24 Jan 2005 08:37:15 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> Then it may also be better to pass the page struct to clear_pages
> instead of a memory address.

What is more generally available at the call sites at this time?
Consider both HIGHMEM and non-HIGHMEM setups in your estimation
please :-)


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-24 20:23                           ` David S. Miller
@ 2005-01-24 20:33                             ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2005-01-24 20:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

On Mon, 24 Jan 2005, David S. Miller wrote:

> On Mon, 24 Jan 2005 08:37:15 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > Then it may also be better to pass the page struct to clear_pages
> > instead of a memory address.
>
> What is more generally available at the call sites at this time?
> Consider both HIGHMEM and non-HIGHMEM setups in your estimation
> please :-)

The only call site is prep_zero_page which has a GFP flag, the order and
the pointer to struct page.

The patch makes the huge page code call prep_zero_page and scrubd will
also call prep_zero_page.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL
  2005-01-21 20:09                     ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
@ 2005-02-09  9:58                       ` Michael Ellerman
  0 siblings, 0 replies; 99+ messages in thread
From: Michael Ellerman @ 2005-02-09  9:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: davem, hugh, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Hi All,

The generic and IA-64 versions of alloc_zeroed_user_highpage() don't check the return value from alloc_page_vma(). This can lead to an oops if we're OOM.

This fixes my oops on PPC64, but I haven't got an IA-64 machine/compiler handy.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>

diff -rN -p -u oombreakage-old/include/asm-ia64/page.h oombreakage-new/include/asm-ia64/page.h
--- oombreakage-old/include/asm-ia64/page.h	2005-02-04 04:10:37.000000000 +1100
+++ oombreakage-new/include/asm-ia64/page.h	2005-02-09 20:53:37.000000000 +1100
@@ -79,7 +79,8 @@ do {						\
 #define alloc_zeroed_user_highpage(vma, vaddr) \
 ({						\
 	struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
-	flush_dcache_page(page);		\
+	if (page)				\
+ 		flush_dcache_page(page);	\
 	page;					\
 })
 
diff -rN -p -u oombreakage-old/include/linux/highmem.h oombreakage-new/include/linux/highmem.h
--- oombreakage-old/include/linux/highmem.h	2005-02-09 20:22:41.000000000 +1100
+++ oombreakage-new/include/linux/highmem.h	2005-02-09 20:47:01.000000000 +1100
@@ -48,7 +48,9 @@ alloc_zeroed_user_highpage(struct vm_are
 {
 	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
 
-	clear_user_highpage(page, vaddr);
+	if (page)
+		clear_user_highpage(page, vaddr);
+
 	return page;
 }
 #endif




^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2005-02-09  9:58 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-15  2:34 [very very drafty] prezeroing to increase the page fault rate Christoph Lameter
2004-12-15 21:21 ` Robin Holt
2004-12-15 21:58 ` Christoph Lameter
2004-12-15 22:00 ` Christoph Lameter
2004-12-16  0:25 ` Nick Piggin
2004-12-16  0:41 ` Christoph Lameter
2004-12-16  0:41 ` Linus Torvalds
2004-12-16  0:46 ` Christoph Lameter
2004-12-16  0:50 ` Nick Piggin
2004-12-16  0:54 ` Christoph Lameter
2004-12-16  1:18 ` Linus Torvalds
2004-12-16  1:44 ` Christoph Lameter
2004-12-16  1:55 ` Linus Torvalds
2004-12-16  2:17 ` Nick Piggin
2004-12-16  7:59 ` Nick Piggin
2004-12-16 16:27 ` Christoph Lameter
2004-12-16 18:38 ` Luck, Tony
2004-12-16 22:37 ` Nick Piggin
2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
2004-12-21 19:56     ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
2005-01-01  2:22       ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and Nick Piggin
2005-01-01  2:55         ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd pmarques
2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Christoph Lameter
2004-12-22 12:46       ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Robin Holt
2004-12-22 19:56         ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Christoph Lameter
2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all Christoph Lameter
2004-12-24  8:33           ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Pavel Machek
2004-12-24 16:18             ` Prezeroing V2 [2/4]: add second parameter to clear_page() for Christoph Lameter
2004-12-24 16:27               ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Pavel Machek
2004-12-24 17:02                 ` Prezeroing V2 [2/4]: add second parameter to clear_page() for David S. Miller
2004-12-24 17:05           ` David S. Miller
2004-12-27 22:48             ` David S. Miller
2005-01-03 17:52             ` Christoph Lameter
2005-01-01 10:24           ` Geert Uytterhoeven
2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2005-01-04 23:45                 ` Dave Hansen
2005-01-05  1:16                   ` Christoph Lameter
2005-01-05  1:26                     ` Linus Torvalds
2005-01-05 23:11                       ` Christoph Lameter
2005-01-05  0:34                 ` Linus Torvalds
2005-01-05  0:47                   ` Andrew Morton
2005-01-05  1:15                     ` Christoph Lameter
2005-01-08 21:12                 ` Hugh Dickins
2005-01-08 21:56                   ` David S. Miller
2005-01-21 20:09                     ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
2005-02-09  9:58                       ` [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL Michael Ellerman
2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
2005-01-21 22:29                       ` Paul Mackerras
2005-01-21 23:48                         ` Christoph Lameter
2005-01-22  0:35                           ` Paul Mackerras
2005-01-22  0:43                             ` Andrew Morton
2005-01-22  1:08                               ` Paul Mackerras
2005-01-22  1:20                               ` Roman Zippel
2005-01-22  1:25                               ` Paul Mackerras
2005-01-22  1:54                                 ` Christoph Lameter
2005-01-22  2:53                                   ` Paul Mackerras
2005-01-23  7:45                       ` Andrew Morton
2005-01-24 16:37                         ` Christoph Lameter
2005-01-24 20:23                           ` David S. Miller
2005-01-24 20:33                             ` Christoph Lameter
2005-01-10 17:16                   ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2005-01-10 18:13                     ` Linus Torvalds
2005-01-10 20:17                       ` Christoph Lameter
2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
2005-01-11  0:41                           ` Chris Wright
2005-01-11  0:46                             ` Prezeroing V4 [1/4]: Arch specific page zeroing during page Christoph Lameter
2005-01-11  0:49                               ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Chris Wright
2005-01-10 23:55                         ` Prezeroing V4 [2/4]: Zeroing implementation Christoph Lameter
2005-01-10 23:55                         ` Prezeroing V4 [3/4]: Altix SN2 BTE zero driver Christoph Lameter
2005-01-10 23:56                         ` Prezeroing V4 [4/4]: Extend clear_page to take an order parameter Christoph Lameter
2005-01-04 23:14               ` Prezeroing V3 [2/4]: Extension of clear_page to take an order Christoph Lameter
2005-01-05 23:25                 ` Christoph Lameter
2005-01-04 23:15               ` Prezeroing V3 [3/4]: Page zeroing through kscrubd Christoph Lameter
2005-01-04 23:16               ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
2004-12-23 19:34         ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
2004-12-23 19:35         ` Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE Christoph Lameter
2004-12-23 20:08         ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
2004-12-24 16:24           ` Christoph Lameter
2004-12-23 19:49       ` Prezeroing V2 [0/3]: Why and When it works Arjan van de Ven
2004-12-23 20:57       ` Matt Mackall
2004-12-23 21:01       ` Paul Mackerras
2004-12-23 21:11       ` Paul Mackerras
2004-12-23 21:37         ` Andrew Morton
2004-12-23 23:00           ` Paul Mackerras
2004-12-23 21:48         ` Linus Torvalds
2004-12-23 22:34           ` Zwane Mwaikambo
2004-12-24  9:14           ` Arjan van de Ven
2004-12-24 18:21             ` Linus Torvalds
2004-12-24 18:57               ` Arjan van de Ven
2004-12-27 22:50               ` David S. Miller
2004-12-28 11:53                 ` Marcelo Tosatti
2004-12-24 16:17           ` Christoph Lameter
2004-12-24 18:31     ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
2005-01-03 17:54       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox