[RFC][PATCH 0/4] ZERO PAGE again v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/4] ZERO PAGE again v2
@ 2009-07-07  7:51 KAMEZAWA Hiroyuki
  2009-07-07  7:52 ` [RFC][PATCH 1/4] introduce pte_zero() KAMEZAWA Hiroyuki
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-07  7:51 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: npiggin, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

Hi, this is ZERO_PAGE mapping revival patch v2.

ZERO PAGE was removed in 2.6.24 (=> http://lkml.org/lkml/2007/10/9/112)
and I had no objections.

In these days, at user support jobs, I noticed a few of customers
are making use of ZERO_PAGE intentionally...brutal mmap and scan, etc. 
(For example, scanning big sparse table and save the contents.)

They are using RHEL4-5(before 2.6.18) then they don't notice that ZERO_PAGE
is gone, yet.
yes, I can say  "ZERO PAGE is gone" to them in next generation distro.

Recently, a question comes to lkml (http://lkml.org/lkml/2009/6/4/383

Maybe there are some users of ZERO_PAGE other than my customers.
So, can't we use ZERO_PAGE again ?

IIUC, the problem of ZERO_PAGE was
  - reference count cache ping-pong
  - complicated handling.
  - the behavior page-fault-twice can make applications slow.

This patch is a trial to de-refcounted ZERO_PAGE.

This includes 4 patches.
[1/4] introduce pte_zero() at el.
[2/4] use ZERO_PAGE for READ fault in anonymous mapping.
[3/4] corner cases, get_user_pages()
[4/4] introduce get_user_pages_nozero().

I feel these patches needs to be clearer but includes almost all
messes we have to handle at using ZERO_PAGE again.

What I feel now is
 a. technically, we can do because we did.
 b. Considering maintenance, code's beauty etc.. ZERO_PAGE adds messes.
 c. Very big benefits for some (a few?) users but no benefits to usual programs.

 There are trade-off between b. and c.

Any comments are welcome.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 1/4] introduce pte_zero()
  2009-07-07  7:51 [RFC][PATCH 0/4] ZERO PAGE again v2 KAMEZAWA Hiroyuki
@ 2009-07-07  7:52 ` KAMEZAWA Hiroyuki
  2009-07-07  7:54 ` [RFC][PATCH 2/4] use ZERO_PAGE for READ fault in regular anonymous mapping KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-07  7:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, npiggin, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

initializng zero_page_pfn is not clean yet...

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Add some helper functions for supporing zero-page again.
This patch itself adss some tiny functions but no behavior change.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/mm.h |   20 ++++++++++++++++++++
 mm/memory.c        |    8 ++++++++
 2 files changed, 28 insertions(+)

Index: zeropage-trial/include/linux/mm.h
===================================================================
--- zeropage-trial.orig/include/linux/mm.h
+++ zeropage-trial/include/linux/mm.h
@@ -822,6 +822,26 @@ static inline int handle_mm_fault(struct
 }
 #endif
 
+/*
+ * ZERO page is used for read-only(never write) private page mapping. It's not
+ * usually used but sometimes useful at maping /dev/zero or at scanning
+ * sparsely used big private memory or at calculation.with sparse matrix where
+ * most of entries are zero. ZERO page is not refcounted and exists as
+ * PG_reserved page. zero_page_pfn is pfn of ZERO_PAGE(0).
+ */
+
+extern unsigned long zero_page_pfn;
+static inline int pte_zero(pte_t pte)
+{
+	return (pte_pfn(pte) == zero_page_pfn);
+}
+
+static inline int page_is_zero(struct page *page)
+{
+	return page == ZERO_PAGE(0);
+}
+
+
 extern int make_pages_present(unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
 
Index: zeropage-trial/mm/memory.c
===================================================================
--- zeropage-trial.orig/mm/memory.c
+++ zeropage-trial/mm/memory.c
@@ -106,6 +106,14 @@ static int __init disable_randmaps(char 
 }
 __setup("norandmaps", disable_randmaps);
 
+unsigned long zero_page_pfn __read_mostly;
+static int __init zeropage_init(void)
+{
+	zero_page_pfn = page_to_pfn(ZERO_PAGE(0));
+	return 0;
+}
+__initcall(zeropage_init);
+
 
 /*
  * If a p?d_bad entry is found while walking page tables, report

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 2/4] use ZERO_PAGE for READ fault in regular anonymous mapping
  2009-07-07  7:51 [RFC][PATCH 0/4] ZERO PAGE again v2 KAMEZAWA Hiroyuki
  2009-07-07  7:52 ` [RFC][PATCH 1/4] introduce pte_zero() KAMEZAWA Hiroyuki
@ 2009-07-07  7:54 ` KAMEZAWA Hiroyuki
  2009-07-07  7:59 ` [RFC][PATCH 3/4] get_user_pages READ fault handling special cases KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-07  7:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, npiggin, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch makes vm_normal_page() ruturn NULL if it founds zero page.
If the caller must handle zero page, it should check zero_pte() before/after
calling vm_normal_page().

In summary,
 - vm_normal_page() returns NULL if it finds ZERO_PAGE.
 - As old ages, mapped ZERO_PAGE is counted as file_rss under mm struct.
 - Read access by get_user_pages() can returns ZERO_PAGE.
   This behavior is the same to the old ZERO_PAGE's behavior.
   But has some troubles now. this problem will be handled in the next patch
   in series.

Changelog: v1->v2
 - making use of pte_zero() rather than modify vm_normal_page too much.
 - don't handle (VM_PFNMAP | VM_FIXEDMAP) pages.
 - splitted get_user_pages(READ) workaround  into other patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/proc/task_mmu.c |   10 ++++++
 mm/fremap.c        |    3 ++
 mm/memory.c        |   78 +++++++++++++++++++++++++++++++++++++++++++++--------
 mm/mempolicy.c     |   11 ++-----
 mm/rmap.c          |    2 -
 5 files changed, 83 insertions(+), 21 deletions(-)

Index: zeropage-trial/mm/memory.c
===================================================================
--- zeropage-trial.orig/mm/memory.c
+++ zeropage-trial/mm/memory.c
@@ -490,6 +490,7 @@ static inline int is_cow_mapping(unsigne
  * advantage is that we don't have to follow the strict linearity rule of
  * PFNMAP mappings in order to support COWable mappings.
  *
+ * vm_normal_page() returns NULL if ZERO_PAGE founds.
  */
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 # define HAVE_PTE_SPECIAL 1
@@ -527,11 +528,12 @@ struct page *vm_normal_page(struct vm_ar
 	}
 
 check_pfn:
+	if (unlikely(pte_zero(pte)))
+		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn)) {
 		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
-
 	/*
 	 * NOTE! We still have PageReserved() pages in the page tables.
 	 * eg. VDSO mappings can cause them to exist.
@@ -605,7 +607,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		get_page(page);
 		page_dup_rmap(page, vma, addr);
 		rss[!!PageAnon(page)]++;
-	}
+	} else if (pte_zero(pte))
+		rss[1]++;
 
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
@@ -813,6 +816,8 @@ static unsigned long zap_pte_range(struc
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
+			if (pte_zero(ptent))
+				file_rss--;
 			if (unlikely(!page))
 				continue;
 			if (unlikely(details) && details->nonlinear_vma
@@ -1149,9 +1154,13 @@ struct page *follow_page(struct vm_area_
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
-	page = vm_normal_page(vma, address, pte);
-	if (unlikely(!page))
-		goto bad_page;
+
+	if (likely(!pte_zero(pte))) {
+		page = vm_normal_page(vma, address, pte);
+		if (unlikely(!page))
+			goto bad_page;
+	} else
+		page = ZERO_PAGE(0);
 
 	if (flags & FOLL_GET)
 		get_page(page);
@@ -1164,7 +1173,8 @@ struct page *follow_page(struct vm_area_
 		 * is needed to avoid losing the dirty bit: it is easier to use
 		 * mark_page_accessed().
 		 */
-		mark_page_accessed(page);
+		if (!pte_zero(pte))
+			mark_page_accessed(page);
 	}
 unlock:
 	pte_unmap_unlock(ptep, ptl);
@@ -1267,7 +1277,12 @@ int __get_user_pages(struct task_struct 
 				return i ? : -EFAULT;
 			}
 			if (pages) {
-				struct page *page = vm_normal_page(gate_vma, start, *pte);
+				struct page *page;
+				if (!pte_zero(*pte))
+					page = vm_normal_page(gate_vma,
+							      start, *pte);
+				else
+					page = ZERO_PAGE(page);
 				pages[i] = page;
 				if (page)
 					get_page(page);
@@ -1960,6 +1975,13 @@ static int do_wp_page(struct mm_struct *
 	int reuse = 0, ret = 0;
 	int page_mkwrite = 0;
 	struct page *dirty_page = NULL;
+	gfp_t gfpflags = GFP_HIGHUSER_MOVABLE;
+
+	if (pte_zero(orig_pte)) {
+		gfpflags |= __GFP_ZERO;
+		old_page = NULL;
+		goto gotten;
+	}
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
@@ -2082,7 +2104,7 @@ gotten:
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 	VM_BUG_ON(old_page == ZERO_PAGE(0));
-	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	new_page = alloc_page_vma(gfpflags, vma, address);
 	if (!new_page)
 		goto oom;
 	/*
@@ -2094,7 +2116,9 @@ gotten:
 		clear_page_mlock(old_page);
 		unlock_page(old_page);
 	}
-	cow_user_page(new_page, old_page, address, vma);
+	/* If zeropage COW, page is already cleared */
+	if (!pte_zero(orig_pte))
+		cow_user_page(new_page, old_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
@@ -2110,8 +2134,11 @@ gotten:
 				dec_mm_counter(mm, file_rss);
 				inc_mm_counter(mm, anon_rss);
 			}
-		} else
+		} else {
+			if (pte_zero(orig_pte))
+				dec_mm_counter(mm, file_rss);
 			inc_mm_counter(mm, anon_rss);
+		}
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2618,6 +2645,32 @@ out_page:
 	return ret;
 }
 
+static int do_anon_zeromap(struct mm_struct *mm, struct vm_area_struct *vma,
+			   pmd_t *pmd, unsigned long address)
+{
+	spinlock_t *ptl;
+	pte_t entry;
+	pte_t *page_table;
+	int ret = 1;
+	/*
+	 * only usual lenear objrmap-vma can use zeropage. see vm_normal_page().
+	 */
+	if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+		return ret;
+
+	entry = mk_pte(ZERO_PAGE(0), vma->vm_page_prot);
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_none(*page_table))
+		goto out_unlock;
+	inc_mm_counter(mm, file_rss);
+	set_pte_at(mm, address, page_table, entry);
+	update_mmu_cache(vma, address, entry);
+	ret = 0;
+out_unlock:
+	pte_unmap_unlock(page_table, ptl);
+	return ret;
+}
+
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2631,9 +2684,12 @@ static int do_anonymous_page(struct mm_s
 	spinlock_t *ptl;
 	pte_t entry;
 
-	/* Allocate our own private page. */
 	pte_unmap(page_table);
 
+	if (unlikely(!(flags & FAULT_FLAG_WRITE)))
+		if (!do_anon_zeromap(mm, vma, pmd, address))
+			return 0;
+	/* Allocate our own private page */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, address);
Index: zeropage-trial/mm/fremap.c
===================================================================
--- zeropage-trial.orig/mm/fremap.c
+++ zeropage-trial/mm/fremap.c
@@ -41,6 +41,9 @@ static void zap_pte(struct mm_struct *mm
 			page_cache_release(page);
 			update_hiwater_rss(mm);
 			dec_mm_counter(mm, file_rss);
+		} else if (pte_zero(pte)) {
+			update_hiwater_rss(mm);
+			dec_mm_counter(mm, file_rss);
 		}
 	} else {
 		if (!pte_file(pte))
Index: zeropage-trial/mm/rmap.c
===================================================================
--- zeropage-trial.orig/mm/rmap.c
+++ zeropage-trial/mm/rmap.c
@@ -941,7 +941,7 @@ static int try_to_unmap_cluster(unsigned
 	update_hiwater_rss(mm);
 
 	for (; address < end; pte++, address += PAGE_SIZE) {
-		if (!pte_present(*pte))
+		if (!pte_present(*pte) || pte_zero(*pte))
 			continue;
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
Index: zeropage-trial/fs/proc/task_mmu.c
===================================================================
--- zeropage-trial.orig/fs/proc/task_mmu.c
+++ zeropage-trial/fs/proc/task_mmu.c
@@ -342,7 +342,11 @@ static int smaps_pte_range(pmd_t *pmd, u
 			continue;
 
 		mss->resident += PAGE_SIZE;
-
+		if (pte_zero(ptent)) {
+			mss->shared_clean += PAGE_SIZE;
+			/* pss can be considered to be 0 */
+			continue;
+		}
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
 			continue;
@@ -451,6 +455,10 @@ static int clear_refs_pte_range(pmd_t *p
 		if (!pte_present(ptent))
 			continue;
 
+		if (pte_zero(ptent)) {
+			ptep_test_and_clear_young(vma, addr, pte);
+			continue;
+		}
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
 			continue;
Index: zeropage-trial/mm/mempolicy.c
===================================================================
--- zeropage-trial.orig/mm/mempolicy.c
+++ zeropage-trial/mm/mempolicy.c
@@ -404,19 +404,14 @@ static int check_pte_range(struct vm_are
 
 		if (!pte_present(*pte))
 			continue;
+		/* zero page will retrun NULL here.*/
 		page = vm_normal_page(vma, addr, *pte);
 		if (!page)
 			continue;
 		/*
 		 * The check for PageReserved here is important to avoid
-		 * handling zero pages and other pages that may have been
-		 * marked special by the system.
-		 *
-		 * If the PageReserved would not be checked here then f.e.
-		 * the location of the zero page could have an influence
-		 * on MPOL_MF_STRICT, zero pages would be counted for
-		 * the per node stats, and there would be useless attempts
-		 * to put zero pages on the migration list.
+		 * handling pages that may have been marked special by the
+		 * system.
 		 */
 		if (PageReserved(page))
 			continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 3/4] get_user_pages READ fault handling special cases
  2009-07-07  7:51 [RFC][PATCH 0/4] ZERO PAGE again v2 KAMEZAWA Hiroyuki
  2009-07-07  7:52 ` [RFC][PATCH 1/4] introduce pte_zero() KAMEZAWA Hiroyuki
  2009-07-07  7:54 ` [RFC][PATCH 2/4] use ZERO_PAGE for READ fault in regular anonymous mapping KAMEZAWA Hiroyuki
@ 2009-07-07  7:59 ` KAMEZAWA Hiroyuki
  2009-07-07 16:50   ` Linus Torvalds
  2009-07-07  8:01 ` [RFC][PATCH 4/4] add get user pages nozero KAMEZAWA Hiroyuki
  2009-07-07  8:47 ` [RFC][PATCH 0/4] ZERO PAGE again v2 Nick Piggin
  4 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-07  7:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, npiggin, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

most of parts are overwritten by 4/4 patch.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, get_user_pages(READ) can return ZERO_PAGE but it creates some trouble.
This patch is a workaround for each callers.
 - mlock() ....ignore ZERO_PAGE if found. This happens only when mlock against
		read-only mapping finds zero pages.
 - futex() ....if ZERO PAGE is found....BUG ?(but possible...)
 - lookup_node() .... no good idea..this is the same behavior to 2.6.23 age.

Others ?

I wonder it's better to add some function to replace
ZERO PAGE to be an usual page..(do copy-on-write if ZERO_PAGE)
in some special cases. (like futex in above.) 

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: zeropage-trial/mm/mlock.c
===================================================================
--- zeropage-trial.orig/mm/mlock.c
+++ zeropage-trial/mm/mlock.c
@@ -220,19 +220,24 @@ static long __mlock_vma_pages_range(stru
 		for (i = 0; i < ret; i++) {
 			struct page *page = pages[i];
 
-			lock_page(page);
+
 			/*
 			 * Because we lock page here and migration is blocked
 			 * by the elevated reference, we need only check for
 			 * page truncation (file-cache only).
+			 *
+			 * The page can be ZERO_PAGE if VM_WRITE is not set.
 			 */
-			if (page->mapping) {
-				if (mlock)
-					mlock_vma_page(page);
-				else
-					munlock_vma_page(page);
+			if (!page_is_zero(page)) {
+				lock_page(page);
+				if (page->mapping) {
+					if (mlock)
+						mlock_vma_page(page);
+					else
+						munlock_vma_page(page);
+				}
+				unlock_page(page);
 			}
-			unlock_page(page);
 			put_page(page);		/* ref from get_user_pages() */
 
 			/*
Index: zeropage-trial/kernel/futex.c
===================================================================
--- zeropage-trial.orig/kernel/futex.c
+++ zeropage-trial/kernel/futex.c
@@ -249,9 +249,23 @@ again:
 
 	lock_page(page);
 	if (!page->mapping) {
+		if (!page_is_zero(page)) {
+			unlock_page(page);
+			put_page(page);
+			goto again;
+		}
+		/*
+	 	* Finding ZERO PAGE here is obviously user's BUG because
+	 	* futex_wake()etc. is called against never-written page.
+	 	* Considering how futex is used, this kind of bug should not
+	 	* happen i.e. very strange system bug. Then, print out message.
+	 	*/
 		unlock_page(page);
 		put_page(page);
-		goto again;
+		printk(KERN_WARNING "futex is called against not-initialized"
+				     "memory %d(%s) at %p", current->pid,
+				     current->comm, (void*)address);
+		return -EINVAL;
 	}
 
 	/*
Index: zeropage-trial/mm/mempolicy.c
===================================================================
--- zeropage-trial.orig/mm/mempolicy.c
+++ zeropage-trial/mm/mempolicy.c
@@ -684,6 +684,10 @@ static int lookup_node(struct mm_struct 
 	struct page *p;
 	int err;
 
+	/*
+	 * This get_user_page() may catch ZERO PAGE. In that case, returned
+	 * value will not be very useful. But we can't return error here.
+	 */
 	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
 	if (err >= 0) {
 		err = page_to_nid(p);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 3/4] get_user_pages READ fault handling special cases
  2009-07-07  7:59 ` [RFC][PATCH 3/4] get_user_pages READ fault handling special cases KAMEZAWA Hiroyuki
@ 2009-07-07 16:50   ` Linus Torvalds
  2009-07-08  0:03     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-07-07 16:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, npiggin, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org

On Tue, 7 Jul 2009, KAMEZAWA Hiroyuki wrote:
>
> Now, get_user_pages(READ) can return ZERO_PAGE but it creates some trouble.
> This patch is a workaround for each callers.
>  - mlock() ....ignore ZERO_PAGE if found. This happens only when mlock against
> 		read-only mapping finds zero pages.
>  - futex() ....if ZERO PAGE is found....BUG ?(but possible...)
>  - lookup_node() .... no good idea..this is the same behavior to 2.6.23 age.

Gaah. None of these special cases seem at all valid.

I _like_ ZERO_PAGE(), but I always liked it mainly with the whole 
"PAGE_RESERVED" flag.

And I think that if we resurrect zero-page, then we should do it with the 
modern equivalent of PAGE_RESERVED, namely the "pte_special()" bit. 
Anybody who walks page tables had better already handle special PTE 
entries (or we could trivially extend them - in case they currently just 
look at the vm_flags and decide that the range can have no special pages).

So I'd suggest instead:

 - always mark the zero page with PTE_SPECIAL. This avoids the constant 
   page count updates - that's what PTE_SPECIAL means, after all.

   The page count updates was what killed ZERO_PAGE. It's wonderful for 
   cache behaviour _other_ than the ping-pong of having to modify the 
   "struct page".

 - for architectures that don't have the PTE_SPECIAL bit in the page 
   tables, we don't do the magic zero page at all.

 - for architectures that have virtual caches and cannot handle a single 
   zero page well (eg the mess we had with MIPS and muliple zero-pages), 
   also simply don't do it, at least not initially.

 - for the rest, depend on pte_special().

 - pass down the fault flags to "vm_normal_page()", and let one of the 
   bits in there say "I want the zero-page". That way "get_user_pages()" 
   can just treat the zero page as a normal page (it's read-only, of 
   course, but we check the page tables, so that's ok). We'd increment the 
   page count there, but nowhere else (we _need_ to increment the zero 
   page count there, since it will be decremented at free time, and we've 
   lost the page table entry that says that the "struct page *" is 
   special).

With something like the above, there really shouldn't be a lot of 
special-case code. None of these games with mlock etc. Nothing should 
_ever_ need to test "is_zero_page()", because the only thing that does so 
is vm_normal_page() - and if that one returns the "struct page *", then 
it's going to be considered a normal page, nothing special.

That's how the _original_ ZERO_PAGE worked. It had pretty much no special 
case logic. It was basically treated as an IO page from an allocation 
standpoint, thanks to the PG_Reserved bit, but other than that nobody 
really cared.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 3/4] get_user_pages READ fault handling special cases
  2009-07-07 16:50   ` Linus Torvalds
@ 2009-07-08  0:03     ` KAMEZAWA Hiroyuki
  2009-07-08  1:38       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-08  0:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm@kvack.org, npiggin, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org

On Tue, 7 Jul 2009 09:50:19 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Tue, 7 Jul 2009, KAMEZAWA Hiroyuki wrote:
> >
> > Now, get_user_pages(READ) can return ZERO_PAGE but it creates some trouble.
> > This patch is a workaround for each callers.
> >  - mlock() ....ignore ZERO_PAGE if found. This happens only when mlock against
> > 		read-only mapping finds zero pages.
> >  - futex() ....if ZERO PAGE is found....BUG ?(but possible...)
> >  - lookup_node() .... no good idea..this is the same behavior to 2.6.23 age.
> 
> Gaah. None of these special cases seem at all valid.
> 
ya, this patch is for hearing how-to.

> I _like_ ZERO_PAGE(), but I always liked it mainly with the whole 
> "PAGE_RESERVED" flag.
> 
ok.

> And I think that if we resurrect zero-page, then we should do it with the 
> modern equivalent of PAGE_RESERVED, namely the "pte_special()" bit. 
> Anybody who walks page tables had better already handle special PTE 
> entries (or we could trivially extend them - in case they currently just 
> look at the vm_flags and decide that the range can have no special pages).
> 
Hm, ok. I'll remove pte_zero and use pte_special instead of it.

> So I'd suggest instead:
> 
>  - always mark the zero page with PTE_SPECIAL. This avoids the constant 
>    page count updates - that's what PTE_SPECIAL means, after all.
> 
>    The page count updates was what killed ZERO_PAGE. It's wonderful for 
>    cache behaviour _other_ than the ping-pong of having to modify the 
>    "struct page".
> 
yes.

>  - for architectures that don't have the PTE_SPECIAL bit in the page 
>    tables, we don't do the magic zero page at all.
> 
ok.

>  - for architectures that have virtual caches and cannot handle a single 
>    zero page well (eg the mess we had with MIPS and muliple zero-pages), 
>    also simply don't do it, at least not initially.
> 
ok. will add config check in do_anonymous_page as

#ifdef CONIFG_ARCH_USE_ZEROPAGE
static int do_zeromap_anon_private()
{
	......
}
#else
static int do_zeromap_anon_private()
{
	return false;
}
#endif



>  - for the rest, depend on pte_special().
> 
sure.

>  - pass down the fault flags to "vm_normal_page()", and let one of the 
>    bits in there say "I want the zero-page". That way "get_user_pages()" 
>    can just treat the zero page as a normal page (it's read-only, of 
>    course, but we check the page tables, so that's ok). We'd increment the 
>    page count there, but nowhere else (we _need_ to increment the zero 
>    page count there, since it will be decremented at free time, and we've 
>    lost the page table entry that says that the "struct page *" is 
>    special).
> 
ok. not far from this patch series except for pte_zero() v.s. pte_special().

> With something like the above, there really shouldn't be a lot of 
> special-case code. None of these games with mlock etc. Nothing should 
> _ever_ need to test "is_zero_page()", because the only thing that does so 
> is vm_normal_page() - and if that one returns the "struct page *", then 
> it's going to be considered a normal page, nothing special.
> 
> That's how the _original_ ZERO_PAGE worked. It had pretty much no special 
> case logic. It was basically treated as an IO page from an allocation 
> standpoint, thanks to the PG_Reserved bit, but other than that nobody 
> really cared.
> 
About above 3 cases
 - mlock() case .... yes, pte_special/PG_reserved will work well.
 - mempolicy case ... This was broken even in original ZERO_PAGE, I think.
                      I ignore this for now.
 - futex case   .... my mistake. I missed that I should handle
                     VM_SHARED|VM_MAYSHARE(i.e. not private) case and avoid
                     ZERO_PAGE for such vmas. I'll remove this.

By using pte_special(), the whole patch size will be reduced. let me try v3.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 3/4] get_user_pages READ fault handling special cases
  2009-07-08  0:03     ` KAMEZAWA Hiroyuki
@ 2009-07-08  1:38       ` KAMEZAWA Hiroyuki
  2009-07-08  2:27         ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-08  1:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linus Torvalds, linux-mm@kvack.org, npiggin,
	hugh.dickins@tiscali.co.uk, avi, akpm@linux-foundation.org

On Wed, 8 Jul 2009 09:03:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 7 Jul 2009 09:50:19 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > And I think that if we resurrect zero-page, then we should do it with the 
> > modern equivalent of PAGE_RESERVED, namely the "pte_special()" bit. 
> > Anybody who walks page tables had better already handle special PTE 
> > entries (or we could trivially extend them - in case they currently just 
> > look at the vm_flags and decide that the range can have no special pages).
> > 
> Hm, ok. I'll remove pte_zero and use pte_special instead of it.
> 

Can I make a question ?

As far as I know,

 - ZERO PAGE was not accounted as RSS (in 2.6.9 age)
 - ZERO PAGE was accounted as file_rss (until 2.6.24)

Maybe this one is the change.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=
commitdiff;h=4294621f41a85497019fae64341aa5351a1921b7

Is there a special reason to have to account zero page as file_rss ?
If not, pte_special() solution works well. (I think not necessary..)

This was one reason I added pte_zero().

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 3/4] get_user_pages READ fault handling special cases
  2009-07-08  1:38       ` KAMEZAWA Hiroyuki
@ 2009-07-08  2:27         ` Linus Torvalds
  0 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2009-07-08  2:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, npiggin, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org



On Wed, 8 Jul 2009, KAMEZAWA Hiroyuki wrote:
> 
> Is there a special reason to have to account zero page as file_rss ?
> If not, pte_special() solution works well. (I think not necessary..)

I would suggest _not_ accounting the zero page. After all, it doesn't 
actually use any memory.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 4/4] add get user pages nozero
  2009-07-07  7:51 [RFC][PATCH 0/4] ZERO PAGE again v2 KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2009-07-07  7:59 ` [RFC][PATCH 3/4] get_user_pages READ fault handling special cases KAMEZAWA Hiroyuki
@ 2009-07-07  8:01 ` KAMEZAWA Hiroyuki
  2009-07-07  8:47 ` [RFC][PATCH 0/4] ZERO PAGE again v2 Nick Piggin
  4 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-07  8:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, npiggin, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

just an experimental. better idea is welcome.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, get_user_pages() can return ZERO_PAGE if mapped. But in some calles,
using ZERO_PAGE is not suitable and avoiding ZERO_PAGE is better.

This patch adds
  - get_user_pages_nozero() (READ fault only)

This function work as usual get_user_pages() but if it finds ZERO_PAGE
in usual mapping (map where vma exists), it purges zero-pte and fault
again. In this page fault, zero page mapping is avoided.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/mm.h       |   19 +++++++++++++++
 include/linux/mm_types.h |    2 +
 kernel/futex.c           |   25 ++++++++++---------
 mm/internal.h            |    1 
 mm/memory.c              |   59 ++++++++++++++++++++++++++++++++++++++++++++---
 mm/mempolicy.c           |    5 +--
 6 files changed, 93 insertions(+), 18 deletions(-)

Index: zeropage-trial/include/linux/mm.h
===================================================================
--- zeropage-trial.orig/include/linux/mm.h
+++ zeropage-trial/include/linux/mm.h
@@ -841,6 +841,21 @@ static inline int page_is_zero(struct pa
 	return page == ZERO_PAGE(0);
 }
 
+/*
+ * These functions are for avoidling zero-page allocation while someone calls
+ * get_user_pages.etc. See mm/memory.c::get_user_pages_nozero().
+ * While mm->avoid_zeromap > 1, new read page fault to not-present memory will
+ * not use ZERO_PAGE.This will not set in usual page faults.
+ */
+static inline void mm_exclude_zeropage(struct mm_struct *mm)
+{
+	atomic_inc(&mm->avoid_zeromap);
+}
+
+static inline void mm_allow_zeropage(struct mm_struct *mm)
+{
+	atomic_dec(&mm->avoid_zeromap);
+}
 
 extern int make_pages_present(unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
@@ -851,6 +866,9 @@ int get_user_pages(struct task_struct *t
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 
+int get_user_pages_nozero(struct task_struct *tsk, struct mm_struct *mm,
+		  unsigned long start, int nr_pages, struct page **pages);
+
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
 extern void do_invalidatepage(struct page *page, unsigned long offset);
 
@@ -1262,6 +1280,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_ANON	0x08	/* give ZERO_PAGE if no pgtable */
+#define FOLL_NOZERO	0x10	/* return NULL even if ZEROPAGE is mapped */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
Index: zeropage-trial/kernel/futex.c
===================================================================
--- zeropage-trial.orig/kernel/futex.c
+++ zeropage-trial/kernel/futex.c
@@ -254,18 +254,19 @@ again:
 			put_page(page);
 			goto again;
 		}
-		/*
-	 	* Finding ZERO PAGE here is obviously user's BUG because
-	 	* futex_wake()etc. is called against never-written page.
-	 	* Considering how futex is used, this kind of bug should not
-	 	* happen i.e. very strange system bug. Then, print out message.
-	 	*/
-		unlock_page(page);
-		put_page(page);
-		printk(KERN_WARNING "futex is called against not-initialized"
-				     "memory %d(%s) at %p", current->pid,
-				     current->comm, (void*)address);
-		return -EINVAL;
+		{
+			struct mm_struct *mm = current->mm;
+			/*
+			 * _VERY_ SLOW PATH: we find zeropage...replace it
+			 * see mm/memory.c
+			 */
+			down_read(&mm->mmap_sem);
+			err = get_user_pages_nozero(current, mm,
+						    address, 1, &page);
+			up_read(&mm->mmap_sem);
+			if (err < 0)
+				return err;
+		}
 	}
 
 	/*
Index: zeropage-trial/mm/internal.h
===================================================================
--- zeropage-trial.orig/mm/internal.h
+++ zeropage-trial/mm/internal.h
@@ -254,6 +254,7 @@ static inline void mminit_validate_memmo
 #define GUP_FLAGS_FORCE                  0x2
 #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
 #define GUP_FLAGS_IGNORE_SIGKILL         0x8
+#define GUP_FLAGS_NOZEROPAGE		 0x10
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
Index: zeropage-trial/mm/memory.c
===================================================================
--- zeropage-trial.orig/mm/memory.c
+++ zeropage-trial/mm/memory.c
@@ -1159,7 +1159,9 @@ struct page *follow_page(struct vm_area_
 		page = vm_normal_page(vma, address, pte);
 		if (unlikely(!page))
 			goto bad_page;
-	} else
+	} else if (flags & FOLL_NOZERO)
+		goto unlock;
+	else
 		page = ZERO_PAGE(0);
 
 	if (flags & FOLL_GET)
@@ -1234,6 +1236,7 @@ int __get_user_pages(struct task_struct 
 	int force = !!(flags & GUP_FLAGS_FORCE);
 	int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
 	int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+	int no_zero = !!(flags & GUP_FLAGS_NOZEROPAGE);
 
 	if (nr_pages <= 0)
 		return 0;
@@ -1277,11 +1280,11 @@ int __get_user_pages(struct task_struct 
 				return i ? : -EFAULT;
 			}
 			if (pages) {
-				struct page *page;
+				struct page *page = NULL;
 				if (!pte_zero(*pte))
 					page = vm_normal_page(gate_vma,
 							      start, *pte);
-				else
+				else if (!no_zero)
 					page = ZERO_PAGE(page);
 				pages[i] = page;
 				if (page)
@@ -1329,6 +1332,8 @@ int __get_user_pages(struct task_struct 
 
 			if (write)
 				foll_flags |= FOLL_WRITE;
+			if (no_zero)
+				foll_flags |= FOLL_NOZERO;
 
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
@@ -1452,6 +1457,26 @@ int get_user_pages(struct task_struct *t
 
 EXPORT_SYMBOL(get_user_pages);
 
+/*
+ * This get_user_pages_nozero() is provided for READ operation of
+ * get_user_pages() and guaranteed not to return ZERO_PAGE. If
+ * get_user_pages(_fast)() returns ZERO_PAGE and the caller don't want that,
+ * he can call this function to allocating new anon pages in that place.
+ * unnecessary flags are omitted.
+ *
+ */
+int get_user_pages_nozero(struct task_struct *tsk, struct mm_struct *mm,
+		  unsigned long start, int nr_pages, struct page **pages)
+{
+	int ret;
+	mm_exclude_zeropage(mm);
+	ret = __get_user_pages(tsk, mm, start, nr_pages,
+				      GUP_FLAGS_NOZEROPAGE, pages, NULL);
+	mm_allow_zeropage(mm);
+	return ret;
+}
+
+
 pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
 			spinlock_t **ptl)
 {
@@ -2657,6 +2682,8 @@ static int do_anon_zeromap(struct mm_str
 	 */
 	if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 		return ret;
+	if (atomic_read(&mm->avoid_zeromap))
+		return ret;
 
 	entry = mk_pte(ZERO_PAGE(0), vma->vm_page_prot);
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -2954,6 +2981,19 @@ static int do_nonlinear_fault(struct mm_
 }
 
 /*
+ * Unmap zero page and allow fault here again.
+ */
+void flush_zero_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+		    unsigned long address, pte_t *pte)
+{
+	flush_cache_page(vma, address, zero_page_pfn);
+	ptep_clear_flush_notify(vma, address, pte);
+	update_hiwater_rss(mm);
+	dec_mm_counter(mm, file_rss);
+}
+
+
+/*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
  * RISC architectures).  The early dirtying is also good on the i386.
@@ -2974,6 +3014,19 @@ static inline int handle_pte_fault(struc
 	spinlock_t *ptl;
 
 	entry = *pte;
+	/*
+	 * Read fault to mapped zero page...this is caused by get_user_page()
+	 * artificially. We purge this pte and fall through. This is very
+	 * rare case. If write fault, copy-on-write will handle all.
+	 */
+	if (unlikely(!(flags & FAULT_FLAG_WRITE) && pte_zero(entry))) {
+		ptl = pte_lockptr(mm, pmd);
+		spin_lock(ptl);
+		if (pte_zero(*pte)) /* purge mapped zero page */
+			flush_zero_pte(mm, vma, address, pte);
+		spin_unlock(ptl);
+		entry = *pte;
+	}
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
Index: zeropage-trial/include/linux/mm_types.h
===================================================================
--- zeropage-trial.orig/include/linux/mm_types.h
+++ zeropage-trial/include/linux/mm_types.h
@@ -266,6 +266,8 @@ struct mm_struct {
 	spinlock_t		ioctx_lock;
 	struct hlist_head	ioctx_list;
 
+	atomic_t		avoid_zeromap;
+
 #ifdef CONFIG_MM_OWNER
 	/*
 	 * "owner" points to a task that is regarded as the canonical
Index: zeropage-trial/mm/mempolicy.c
===================================================================
--- zeropage-trial.orig/mm/mempolicy.c
+++ zeropage-trial/mm/mempolicy.c
@@ -685,10 +685,9 @@ static int lookup_node(struct mm_struct 
 	int err;
 
 	/*
-	 * This get_user_page() may catch ZERO PAGE. In that case, returned
-	 * value will not be very useful. But we can't return error here.
+	 * get_user_page_nozero() never returns ZERO PAGE.
 	 */
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_user_pages_nozero(current, mm, addr & PAGE_MASK, 1, &p);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-07  7:51 [RFC][PATCH 0/4] ZERO PAGE again v2 KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2009-07-07  8:01 ` [RFC][PATCH 4/4] add get user pages nozero KAMEZAWA Hiroyuki
@ 2009-07-07  8:47 ` Nick Piggin
  2009-07-07  9:05   ` Avi Kivity
  2009-07-07  9:06   ` KAMEZAWA Hiroyuki
  4 siblings, 2 replies; 34+ messages in thread
From: Nick Piggin @ 2009-07-07  8:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

On Tue, Jul 07, 2009 at 04:51:01PM +0900, KAMEZAWA Hiroyuki wrote:
> Hi, this is ZERO_PAGE mapping revival patch v2.
> 
> ZERO PAGE was removed in 2.6.24 (=> http://lkml.org/lkml/2007/10/9/112)
> and I had no objections.
> 
> In these days, at user support jobs, I noticed a few of customers
> are making use of ZERO_PAGE intentionally...brutal mmap and scan, etc. 
> (For example, scanning big sparse table and save the contents.)
> 
> They are using RHEL4-5(before 2.6.18) then they don't notice that ZERO_PAGE
> is gone, yet.
> yes, I can say  "ZERO PAGE is gone" to them in next generation distro.
> 
> Recently, a question comes to lkml (http://lkml.org/lkml/2009/6/4/383
> 
> Maybe there are some users of ZERO_PAGE other than my customers.
> So, can't we use ZERO_PAGE again ?
> 
> IIUC, the problem of ZERO_PAGE was
>   - reference count cache ping-pong
>   - complicated handling.
>   - the behavior page-fault-twice can make applications slow.
> 
> This patch is a trial to de-refcounted ZERO_PAGE.
> 
> This includes 4 patches.
> [1/4] introduce pte_zero() at el.
> [2/4] use ZERO_PAGE for READ fault in anonymous mapping.
> [3/4] corner cases, get_user_pages()
> [4/4] introduce get_user_pages_nozero().
> 
> I feel these patches needs to be clearer but includes almost all
> messes we have to handle at using ZERO_PAGE again.
> 
> What I feel now is
>  a. technically, we can do because we did.
>  b. Considering maintenance, code's beauty etc.. ZERO_PAGE adds messes.
>  c. Very big benefits for some (a few?) users but no benefits to usual programs.
>  
>  There are trade-off between b. and c.
>  
> Any comments are welcome.

Can we just try to wean them off it? Using zero page for huge sparse
matricies is probably not ideal anyway because it needs to still be
faulted in and it occupies TLB space. They might see better performance
by using a better algorithm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-07  8:47 ` [RFC][PATCH 0/4] ZERO PAGE again v2 Nick Piggin
@ 2009-07-07  9:05   ` Avi Kivity
  2009-07-07  9:18     ` KAMEZAWA Hiroyuki
  2009-07-07  9:06   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 34+ messages in thread
From: Avi Kivity @ 2009-07-07  9:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	akpm@linux-foundation.org, torvalds

On 07/07/2009 11:47 AM, Nick Piggin wrote:
> On Tue, Jul 07, 2009 at 04:51:01PM +0900, KAMEZAWA Hiroyuki wrote:
>    
>> Hi, this is ZERO_PAGE mapping revival patch v2.
>>
>> ZERO PAGE was removed in 2.6.24 (=>  http://lkml.org/lkml/2007/10/9/112)
>> and I had no objections.
>>
>> In these days, at user support jobs, I noticed a few of customers
>> are making use of ZERO_PAGE intentionally...brutal mmap and scan, etc.
>> (For example, scanning big sparse table and save the contents.)
>>
>> They are using RHEL4-5(before 2.6.18) then they don't notice that ZERO_PAGE
>> is gone, yet.
>> yes, I can say  "ZERO PAGE is gone" to them in next generation distro.
>>
>> Recently, a question comes to lkml (http://lkml.org/lkml/2009/6/4/383
>>
>> Maybe there are some users of ZERO_PAGE other than my customers.
>> So, can't we use ZERO_PAGE again ?
>>
>> IIUC, the problem of ZERO_PAGE was
>>    - reference count cache ping-pong
>>    - complicated handling.
>>    - the behavior page-fault-twice can make applications slow.
>>
>> This patch is a trial to de-refcounted ZERO_PAGE.
>>
>> This includes 4 patches.
>> [1/4] introduce pte_zero() at el.
>> [2/4] use ZERO_PAGE for READ fault in anonymous mapping.
>> [3/4] corner cases, get_user_pages()
>> [4/4] introduce get_user_pages_nozero().
>>
>> I feel these patches needs to be clearer but includes almost all
>> messes we have to handle at using ZERO_PAGE again.
>>
>> What I feel now is
>>   a. technically, we can do because we did.
>>   b. Considering maintenance, code's beauty etc.. ZERO_PAGE adds messes.
>>   c. Very big benefits for some (a few?) users but no benefits to usual programs.
>>
>>   There are trade-off between b. and c.
>>
>> Any comments are welcome.
>>      
>
> Can we just try to wean them off it? Using zero page for huge sparse
> matricies is probably not ideal anyway because it needs to still be
> faulted in and it occupies TLB space. They might see better performance
> by using a better algorithm.
>    

For kvm live migration, I've thought of extending mincore() to report if 
a page will be read as zeros.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-07  9:05   ` Avi Kivity
@ 2009-07-07  9:18     ` KAMEZAWA Hiroyuki
  2009-07-07  9:26       ` Avi Kivity
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-07  9:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	akpm@linux-foundation.org, torvalds

On Tue, 07 Jul 2009 12:05:24 +0300
Avi Kivity <avi@redhat.com> wrote:

> On 07/07/2009 11:47 AM, Nick Piggin wrote:
> >> Any comments are welcome.
> >>      
> >
> > Can we just try to wean them off it? Using zero page for huge sparse
> > matricies is probably not ideal anyway because it needs to still be
> > faulted in and it occupies TLB space. They might see better performance
> > by using a better algorithm.
> >    
> 
> For kvm live migration, I've thought of extending mincore() to report if 
> a page will be read as zeros.
> 
BTW, ksm can scale enough to combine all pages which just includes zero ?
No heavy cache ping-pong without zero-page ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-07  9:18     ` KAMEZAWA Hiroyuki
@ 2009-07-07  9:26       ` Avi Kivity
  0 siblings, 0 replies; 34+ messages in thread
From: Avi Kivity @ 2009-07-07  9:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nick Piggin, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	akpm@linux-foundation.org, torvalds

On 07/07/2009 12:18 PM, KAMEZAWA Hiroyuki wrote:
>> For kvm live migration, I've thought of extending mincore() to report if
>> a page will be read as zeros.
>>
>>      
> BTW, ksm can scale enough to combine all pages which just includes zero ?
> No heavy cache ping-pong without zero-page ?
>    

ksm will increase cpu and cache load; it's oriented towards workloads 
where reducing memory pressure is more important than cpu load.  For 
cpu-intensive, low sharing workloads it will be disabled.  That's why I 
want an alternative way to deal with zero pages; it can be ZERO_PAGE, 
mincore(), or madvise(MADV_DROP_IFZERO).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-07  8:47 ` [RFC][PATCH 0/4] ZERO PAGE again v2 Nick Piggin
  2009-07-07  9:05   ` Avi Kivity
@ 2009-07-07  9:06   ` KAMEZAWA Hiroyuki
  2009-07-07 14:00     ` Nick Piggin
  2009-07-08 17:32     ` Andrea Arcangeli
  1 sibling, 2 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-07  9:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm@kvack.org, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

On Tue, 7 Jul 2009 10:47:50 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Jul 07, 2009 at 04:51:01PM +0900, KAMEZAWA Hiroyuki wrote:
> > Hi, this is ZERO_PAGE mapping revival patch v2.
> > 
> > ZERO PAGE was removed in 2.6.24 (=> http://lkml.org/lkml/2007/10/9/112)
> > and I had no objections.
> > 
> > In these days, at user support jobs, I noticed a few of customers
> > are making use of ZERO_PAGE intentionally...brutal mmap and scan, etc. 
> > (For example, scanning big sparse table and save the contents.)
> > 
> > They are using RHEL4-5(before 2.6.18) then they don't notice that ZERO_PAGE
> > is gone, yet.
> > yes, I can say  "ZERO PAGE is gone" to them in next generation distro.
> > 
> > Recently, a question comes to lkml (http://lkml.org/lkml/2009/6/4/383
> > 
> > Maybe there are some users of ZERO_PAGE other than my customers.
> > So, can't we use ZERO_PAGE again ?
> > 
> > IIUC, the problem of ZERO_PAGE was
> >   - reference count cache ping-pong
> >   - complicated handling.
> >   - the behavior page-fault-twice can make applications slow.
> > 
> > This patch is a trial to de-refcounted ZERO_PAGE.
> > 
> > This includes 4 patches.
> > [1/4] introduce pte_zero() at el.
> > [2/4] use ZERO_PAGE for READ fault in anonymous mapping.
> > [3/4] corner cases, get_user_pages()
> > [4/4] introduce get_user_pages_nozero().
> > 
> > I feel these patches needs to be clearer but includes almost all
> > messes we have to handle at using ZERO_PAGE again.
> > 
> > What I feel now is
> >  a. technically, we can do because we did.
> >  b. Considering maintenance, code's beauty etc.. ZERO_PAGE adds messes.
> >  c. Very big benefits for some (a few?) users but no benefits to usual programs.
> >  
> >  There are trade-off between b. and c.
> >  
> > Any comments are welcome.
> 
> Can we just try to wean them off it? Using zero page for huge sparse
> matricies is probably not ideal anyway because it needs to still be
> faulted in and it occupies TLB space. They might see better performance
> by using a better algorithm.
> 
TLB usage is another problem I think...

I agreed removal of ZERO_PAGE in 2.6.24. But I'm now retrying this
because of following reasons.

1.  From programmer's perspective, I almost agree to you. But considering users,
most of them are _not_ programmers, saying "please rewrite your program
because OS changed its implementation" is no help.
What they want is calclating something and not writing a program.

2. This change is _very_ implicit and doesn't affect alomost all programs.
I think ZERO_PAGE() is used only when an apllication does some special jobs.

Then,  most of users will not notice that ZERO_PAGE is not available until
he(she) find OOM-Killer message. This is very terrible situation for me.
(and most of system admins.)

3. Considering save&restore application's data table, ZERO_PAGE is useful.
   maybe.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-07  9:06   ` KAMEZAWA Hiroyuki
@ 2009-07-07 14:00     ` Nick Piggin
  2009-07-07 16:59       ` Linus Torvalds
  2009-07-08 17:32     ` Andrea Arcangeli
  1 sibling, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2009-07-07 14:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

On Tue, Jul 07, 2009 at 06:06:29PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 7 Jul 2009 10:47:50 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Tue, Jul 07, 2009 at 04:51:01PM +0900, KAMEZAWA Hiroyuki wrote:
> > > Hi, this is ZERO_PAGE mapping revival patch v2.
> > > 
> > > ZERO PAGE was removed in 2.6.24 (=> http://lkml.org/lkml/2007/10/9/112)
> > > and I had no objections.
> > > 
> > > In these days, at user support jobs, I noticed a few of customers
> > > are making use of ZERO_PAGE intentionally...brutal mmap and scan, etc. 
> > > (For example, scanning big sparse table and save the contents.)
> > > 
> > > They are using RHEL4-5(before 2.6.18) then they don't notice that ZERO_PAGE
> > > is gone, yet.
> > > yes, I can say  "ZERO PAGE is gone" to them in next generation distro.
> > > 
> > > Recently, a question comes to lkml (http://lkml.org/lkml/2009/6/4/383
> > > 
> > > Maybe there are some users of ZERO_PAGE other than my customers.
> > > So, can't we use ZERO_PAGE again ?
> > > 
> > > IIUC, the problem of ZERO_PAGE was
> > >   - reference count cache ping-pong
> > >   - complicated handling.
> > >   - the behavior page-fault-twice can make applications slow.
> > > 
> > > This patch is a trial to de-refcounted ZERO_PAGE.
> > > 
> > > This includes 4 patches.
> > > [1/4] introduce pte_zero() at el.
> > > [2/4] use ZERO_PAGE for READ fault in anonymous mapping.
> > > [3/4] corner cases, get_user_pages()
> > > [4/4] introduce get_user_pages_nozero().
> > > 
> > > I feel these patches needs to be clearer but includes almost all
> > > messes we have to handle at using ZERO_PAGE again.
> > > 
> > > What I feel now is
> > >  a. technically, we can do because we did.
> > >  b. Considering maintenance, code's beauty etc.. ZERO_PAGE adds messes.
> > >  c. Very big benefits for some (a few?) users but no benefits to usual programs.
> > >  
> > >  There are trade-off between b. and c.
> > >  
> > > Any comments are welcome.
> > 
> > Can we just try to wean them off it? Using zero page for huge sparse
> > matricies is probably not ideal anyway because it needs to still be
> > faulted in and it occupies TLB space. They might see better performance
> > by using a better algorithm.
> > 
> TLB usage is another problem I think...
> 
> I agreed removal of ZERO_PAGE in 2.6.24. But I'm now retrying this
> because of following reasons.
> 
> 1.  From programmer's perspective, I almost agree to you. But considering users,
> most of them are _not_ programmers, saying "please rewrite your program
> because OS changed its implementation" is no help.
> What they want is calclating something and not writing a program.
> 
> 2. This change is _very_ implicit and doesn't affect alomost all programs.
> I think ZERO_PAGE() is used only when an apllication does some special jobs.
> 
> Then,  most of users will not notice that ZERO_PAGE is not available until
> he(she) find OOM-Killer message. This is very terrible situation for me.
> (and most of system admins.)
> 
> 3. Considering save&restore application's data table, ZERO_PAGE is useful.
>    maybe.

I just wouldn't like to re-add significant complexity back to
the vm without good and concrete examples. OK I agree that
just saying "rewrite your code" is not so good, but are there
real significant problems? Is it inside just a particuar linear
algebra library or something  that might be able to be updated?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-07 14:00     ` Nick Piggin
@ 2009-07-07 16:59       ` Linus Torvalds
  2009-07-08  6:21         ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-07-07 16:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	avi, akpm@linux-foundation.org

On Tue, 7 Jul 2009, Nick Piggin wrote:
> 
> I just wouldn't like to re-add significant complexity back to
> the vm without good and concrete examples. OK I agree that
> just saying "rewrite your code" is not so good, but are there
> real significant problems? Is it inside just a particuar linear
> algebra library or something  that might be able to be updated?

The thing is, ZERO_PAGE really used to work very well.

It was not only useful for simple "I want lots of memory, and I'm going to 
use it pretty sparsely" (which _is_ a very valid thing to do), but it was 
useful for TLB benchmarking, and for cache-efficient "I'm going to write 
lots of zeroes to files", and for a number of other uses.

You can talk about TLB pressure all you want, but the fact is, quite often 
normal cache effects dominate - and ZERO_PAGE is _wonderful_ for sharing 
cachelines (which is why it was so useful for TLB performance testing: map 
a huge area, and you know that there will be no cache effects, only TLB 
effects).

There are actually very few cases where TLB effects are the primary ones - 
they tend to happen when you have truly random accesses that have no 
locality even on a small case. That's pretty rare. Even things that depend 
on sparse arrays etc tend to mainly _access_ the parts it works on (ie you 
may have allocated hundreds of megs of memory to simplify your memory 
management, but you work on only a small part of it).

So it's not just "people actually use it". It really was a useful feature, 
with valid uses. We got rid of it, but if we can re-introduce it cleanly, 
we definitely should.

I don't understand why you fight it. If we can do it well (read: without 
having fork/exit cause endless amounts of cache ping-pongs due to touching 
'struct page *'), there are no downsides that I can see. It's not like 
it's a complicated feature.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-07 16:59       ` Linus Torvalds
@ 2009-07-08  6:21         ` Nick Piggin
  2009-07-08 16:07           ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2009-07-08  6:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	avi, akpm@linux-foundation.org

On Tue, Jul 07, 2009 at 09:59:39AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 7 Jul 2009, Nick Piggin wrote:
> > 
> > I just wouldn't like to re-add significant complexity back to
> > the vm without good and concrete examples. OK I agree that
> > just saying "rewrite your code" is not so good, but are there
> > real significant problems? Is it inside just a particuar linear
> > algebra library or something  that might be able to be updated?
> 
> The thing is, ZERO_PAGE really used to work very well.
> 
> It was not only useful for simple "I want lots of memory, and I'm going to 
> use it pretty sparsely" (which _is_ a very valid thing to do), but it was 
> useful for TLB benchmarking, and for cache-efficient "I'm going to write 
> lots of zeroes to files", and for a number of other uses.
> 
> You can talk about TLB pressure all you want, but the fact is, quite often 
> normal cache effects dominate - and ZERO_PAGE is _wonderful_ for sharing 
> cachelines (which is why it was so useful for TLB performance testing: map 
> a huge area, and you know that there will be no cache effects, only TLB 
> effects).
> 
> There are actually very few cases where TLB effects are the primary ones - 
> they tend to happen when you have truly random accesses that have no 
> locality even on a small case. That's pretty rare. Even things that depend 
> on sparse arrays etc tend to mainly _access_ the parts it works on (ie you 
> may have allocated hundreds of megs of memory to simplify your memory 
> management, but you work on only a small part of it).

I'm talking about the cases where you would want to use ZERO_PAGE for
computing with anonymous memory (not for zeroing IO). In that case,
the TLB would probably be the primary one. For IO, having zero page
for /dev/zero mapping would be a good idea (I think I actually
implemented that in a sles kernel for someone doing benchmarking).

 
> So it's not just "people actually use it". It really was a useful feature, 
> with valid uses. We got rid of it, but if we can re-introduce it cleanly, 
> we definitely should.
> 
> I don't understand why you fight it. If we can do it well (read: without 
> having fork/exit cause endless amounts of cache ping-pongs due to touching 
> 'struct page *'), there are no downsides that I can see. It's not like 
> it's a complicated feature.

I don't fight it. I had proposals to get rid of cache pingpong too,
but you rejected that ;)

I just think that right now seeing as we have gotten rid of it for
a year or so, then it would be good to know of some real cases where
it helps before reintroducing it. I'm not saying none exist, I just
want to know about them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-08  6:21         ` Nick Piggin
@ 2009-07-08 16:07           ` Linus Torvalds
  2009-07-09  7:47             ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-07-08 16:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	avi, akpm@linux-foundation.org

On Wed, 8 Jul 2009, Nick Piggin wrote:
> 
> I'm talking about the cases where you would want to use ZERO_PAGE for
> computing with anonymous memory (not for zeroing IO). In that case,
> the TLB would probably be the primary one.

Umm. Are you even listening to yourself?

OF COURSE the TLB would be the primary issue, since the zero page has made 
cache effects go away.

BUT THAT IS A GOOD THING.

Instead of making it sound like "that's a bad thing, because now TLB 
dominates", just say what's really going on: "that's a good thing, because 
you made the cache access patterns wonderful".

See? You claim TLB is a problem, but it's really that you made all _other_ 
problems go away. 

Now, it's true that you can avoid the TLB costs by moving the costs into a 
"software TLB" (aka "indirection table"), and make the TLB footprint go 
away by turning it into something else (indirection through a pointer). 

Sometimes that speeds things up - because you may be able to actually 
avoid doing other things by noticing huge gaps etc - but sometimes it 
slows you down too - because indirection isn't free, and maybe there are 
common cases where there isn't so many sparse accesses.

> I don't fight it. I had proposals to get rid of cache pingpong too,
> but you rejected that ;)

Yeah, and they were ugly as hell. I had a suggestion to just continue to 
use PG_reserved (which was _way_ simpler than your version) before the 
counting, but you and Hugh were on a religious agenda against the whole 
PG_reserved bit.

So I don't understand why you claim that you fight it, when you CLEARLY 
do. The patches that KAMEZAWA-san posted were already simpler than your 
complicated models were - I just think they can be simpler still.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-08 16:07           ` Linus Torvalds
@ 2009-07-09  7:47             ` Nick Piggin
  2009-07-09 17:54               ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2009-07-09  7:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	avi, akpm@linux-foundation.org

On Wed, Jul 08, 2009 at 09:07:08AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 8 Jul 2009, Nick Piggin wrote:
> > 
> > I'm talking about the cases where you would want to use ZERO_PAGE for
> > computing with anonymous memory (not for zeroing IO). In that case,
> > the TLB would probably be the primary one.
> 
> Umm. Are you even listening to yourself?
> 
> OF COURSE the TLB would be the primary issue, since the zero page has made 
> cache effects go away.

Yes, that's what I said.

> BUT THAT IS A GOOD THING.
> 
> Instead of making it sound like "that's a bad thing, because now TLB 
> dominates", just say what's really going on: "that's a good thing, because 
> you made the cache access patterns wonderful".
> 
> See? You claim TLB is a problem, but it's really that you made all _other_ 
> problems go away. 

No I don't. Re-read what I wrote. I said that an app that scans huge
sparse matricies *might* be better off with a different data format
rather than relying on ZERO_PAGE with a naive format. Of course if it
does rely on ZERO_PAGE for this, then having ZERO_PAGE is going to be
better than allocating lots of anonymous memory for it, I didn't caim
otherwise.

> Now, it's true that you can avoid the TLB costs by moving the costs into a 
> "software TLB" (aka "indirection table"), and make the TLB footprint go 
> away by turning it into something else (indirection through a pointer). 
> 
> Sometimes that speeds things up - because you may be able to actually 
> avoid doing other things by noticing huge gaps etc - but sometimes it 
> slows you down too - because indirection isn't free, and maybe there are 
> common cases where there isn't so many sparse accesses.

Sometimes there are much for efficient data formats for sparse
matricies too, which can also avoid the quantization effects
(and cache usage) of page size.

> > I don't fight it. I had proposals to get rid of cache pingpong too,
> > but you rejected that ;)
> 
> Yeah, and they were ugly as hell. I had a suggestion to just continue to 
> use PG_reserved (which was _way_ simpler than your version) before the 
> counting, but you and Hugh were on a religious agenda against the whole 
> PG_reserved bit.

No I had no problem with it. I didn't see the big difference between
explicitly testing for ZERO_PAGE or using a new page flag bit (which
aren't free -- PG_reserved can basicaly be reclaimed now if somebody
cares to go through arch init code).

Now if there was more than one type of page to test for, then yes
a page flag would be better because it would reduce branches. I
just didn't see why you were religiously against testing ZERO_PAGE
but thought PG_zero (or PG_reserved or whatever) was so much better.

> So I don't understand why you claim that you fight it, when you CLEARLY 
> do. The patches that KAMEZAWA-san posted were already simpler than your 
> complicated models were - I just think they can be simpler still.

Having a ZERO_PAGE I'm not against, so I don't know why you claim
I am. Al I'm saying is that now we don't have one, we should have
some good reasons to introduce it again. Unreasonable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-09  7:47             ` Nick Piggin
@ 2009-07-09 17:54               ` Linus Torvalds
  2009-07-10  2:09                 ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-07-09 17:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	avi, akpm@linux-foundation.org



On Thu, 9 Jul 2009, Nick Piggin wrote:
>
> Having a ZERO_PAGE I'm not against, so I don't know why you claim
> I am. Al I'm saying is that now we don't have one, we should have
> some good reasons to introduce it again. Unreasonable?

Umm. I had good reasons to introduce it in the _first_ place.

And now you have reports of people who depend on the behaviour, and point 
to the new behaviour as a *regression*.

What the _hell_ more do you want?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-09 17:54               ` Linus Torvalds
@ 2009-07-10  2:09                 ` Nick Piggin
  2009-07-10  3:38                   ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2009-07-10  2:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	avi, akpm@linux-foundation.org

On Thu, Jul 09, 2009 at 10:54:02AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 9 Jul 2009, Nick Piggin wrote:
> >
> > Having a ZERO_PAGE I'm not against, so I don't know why you claim
> > I am. Al I'm saying is that now we don't have one, we should have
> > some good reasons to introduce it again. Unreasonable?
> 
> Umm. I had good reasons to introduce it in the _first_ place.
> 
> And now you have reports of people who depend on the behaviour, and point 
> to the new behaviour as a *regression*.
> 
> What the _hell_ more do you want?

Well there is obviously no way to test a representaive sample of
workoads, and we pretty much knew that some people are going to
prefer to have a ZERO_PAGE with their app.

So if you were going to re-add the zero page when a single regression
is reported after a year or two, then it was wrong of you to remove
the zero page to begin with.

So to answer your question, I guess I would like to know a bit
more about the regression and what the app is doing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-10  2:09                 ` Nick Piggin
@ 2009-07-10  3:38                   ` Linus Torvalds
  2009-07-10  3:51                     ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-07-10  3:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	avi, akpm@linux-foundation.org

On Fri, 10 Jul 2009, Nick Piggin wrote:
> 
> So if you were going to re-add the zero page when a single regression
> is reported after a year or two, then it was wrong of you to remove
> the zero page to begin with.

Oh, I argued against it. And I told people we can always revert it.

But even better than reverting it is to just fix it cleanly in the new 
world order, wouldn't you say?

> So to answer your question, I guess I would like to know a bit
> more about the regression and what the app is doing.

Ok, go ahead and try to figure it out. But please don't cc me on it any 
more. I'm not interested in your hang-ups with ZERO_PAGE.

Because I just don't care. I think ZERO_PAGE was great to begin with, I 
put it to use muyself historically at Transmeta, and I didn't like your 
crusade against it.

People (including me) have told you why it's useful. Whatever. If you 
still want more information, go bother somebody else.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-10  3:38                   ` Linus Torvalds
@ 2009-07-10  3:51                     ` Nick Piggin
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2009-07-10  3:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	avi, akpm@linux-foundation.org

On Thu, Jul 09, 2009 at 08:38:41PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 10 Jul 2009, Nick Piggin wrote:
> > 
> > So if you were going to re-add the zero page when a single regression
> > is reported after a year or two, then it was wrong of you to remove
> > the zero page to begin with.
> 
> Oh, I argued against it. And I told people we can always revert it.
> 
> But even better than reverting it is to just fix it cleanly in the new 
> world order, wouldn't you say?

If it is put back in without being refcounted, that should be
fine. That's what I first proposed for it (although you didn't
think my actua implementation was clean and preferred to remove
it completely).

I would like to see support for architectures which don't define
a pte_special bit too, however.


> > So to answer your question, I guess I would like to know a bit
> > more about the regression and what the app is doing.
> 
> Ok, go ahead and try to figure it out. But please don't cc me on it any 
> more. I'm not interested in your hang-ups with ZERO_PAGE.
> 
> Because I just don't care. I think ZERO_PAGE was great to begin with, I 
> put it to use muyself historically at Transmeta, and I didn't like your 
> crusade against it.
> 
> People (including me) have told you why it's useful. Whatever. If you 
> still want more information, go bother somebody else.

You're apparently not reading what I write when I do cc you, so
I don't think there would be much difference.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-07  9:06   ` KAMEZAWA Hiroyuki
  2009-07-07 14:00     ` Nick Piggin
@ 2009-07-08 17:32     ` Andrea Arcangeli
  2009-07-09  1:12       ` KAMEZAWA Hiroyuki
  2009-07-10 11:18       ` Hugh Dickins
  1 sibling, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2009-07-08 17:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nick Piggin, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

On Tue, Jul 07, 2009 at 06:06:29PM +0900, KAMEZAWA Hiroyuki wrote:
> Then,  most of users will not notice that ZERO_PAGE is not available until
> he(she) find OOM-Killer message. This is very terrible situation for me.
> (and most of system admins.)

Can you try to teach them to use KSM and see if they gain a while lot
more from it (surely they also do some memset(dst, 0) sometime not
only memcpy(zerosrc, dst)). Not to tell when they init to non zero
values their arrays/matrix which is a bit harder to optimize for with
zero page...

My only dislike is that zero page requires a flood of "if ()" new
branches in fast paths that benefits nothing but badly written app,
and that's the only reason I liked its removal.

For goodly (and badly) written scientific app there KSM that will do
more than zeropage while dealing with matrix algorithms and such. If
they try KSM and they don't gain a lot more free memory than with the
zero page hack, then I agree in reintroducing it, but I guess when
they try KSM they will ask you to patch kernel with it, instead of
patch kernel with zeropage. If they don't gain anything more with KSM
than with zeropage, and the kksmd overhead is too high, then it would
make sense to use zeropage for them I agree even if it bites in the
fast path of all apps that can't benefit from it. (not to tell the
fact that reading zero and writing non zero back for normal apps is
harmful as there's a double page fault generated instead of a single
one, kksmd has a cost but zeropage isn't free either in term of page
faults too)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-08 17:32     ` Andrea Arcangeli
@ 2009-07-09  1:12       ` KAMEZAWA Hiroyuki
  2009-07-10 11:18       ` Hugh Dickins
  1 sibling, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-09  1:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk, avi,
	akpm@linux-foundation.org, torvalds

On Wed, 8 Jul 2009 19:32:06 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Tue, Jul 07, 2009 at 06:06:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > Then,  most of users will not notice that ZERO_PAGE is not available until
> > he(she) find OOM-Killer message. This is very terrible situation for me.
> > (and most of system admins.)
> 
> Can you try to teach them to use KSM and see if they gain a while lot
> more from it (surely they also do some memset(dst, 0) sometime not
> only memcpy(zerosrc, dst)). Not to tell when they init to non zero
> values their arrays/matrix which is a bit harder to optimize for with
> zero page...
> 
Hmm, scan & take diff & merge user pages in the kernel ?
IIUC, it can be only help if zero-page's life time are verrrry long.

> My only dislike is that zero page requires a flood of "if ()" new
> branches in fast paths that benefits nothing but badly written app,
> and that's the only reason I liked its removal.
> 
I'll take Linus's suggestion "use pte_special() in vm_normal_page()".
Then, "if()" will not increase so much as expected, flood.

In usual apps which doen't use any zero-page, following path will be checked.

 - "is this WRITE fault ?" in do_anonymous_page().
 - vm_normal_page() never finds pte_special() then no more "if"s.
 - get_user_pages() etc..will have more 2-3 if()s depends on passed flags.

Anyway, I'll reduce overheads as much as possible. please see v3.
pte_special() checks (which are already used) reduce "if()" to some extent.

> For goodly (and badly) written scientific app there KSM that will do
> more than zeropage while dealing with matrix algorithms and such. If
> they try KSM and they don't gain a lot more free memory than with the
> zero page hack, then I agree in reintroducing it, but I guess when
> they try KSM they will ask you to patch kernel with it, instead of
> patch kernel with zeropage. 

Most of the difference between zeropage and KSM solution is that
zeropage requires no refcnt/rmap handling, never pollutes caches, etc.
This will be big advantage.

> If they don't gain anything more with KSM
> than with zeropage, and the kksmd overhead is too high, then it would
> make sense to use zeropage for them I agree even if it bites in the
> fast path of all apps that can't benefit from it. (not to tell the
> fact that reading zero and writing non zero back for normal apps is
> harmful as there's a double page fault generated instead of a single
> one, kksmd has a cost but zeropage isn't free either in term of page
> faults too)
> 
Sorry, my _all_ customers use RHEL5 and there are no ksm yet.

BTW, I love concepts of KSM but I don't trust KSM so much as that I recommend
it to my customers, yet. It's a bit young for production in my point of view.
AFAIK, no bug reports of ksm has reached this mailing list, yet.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-08 17:32     ` Andrea Arcangeli
  2009-07-09  1:12       ` KAMEZAWA Hiroyuki
@ 2009-07-10 11:18       ` Hugh Dickins
  2009-07-10 13:42         ` Andrea Arcangeli
  2009-07-13  6:46         ` Nick Piggin
  1 sibling, 2 replies; 34+ messages in thread
From: Hugh Dickins @ 2009-07-10 11:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, linux-mm@kvack.org, avi, akpm,
	torvalds

On Wed, 8 Jul 2009, Andrea Arcangeli wrote:
> On Tue, Jul 07, 2009 at 06:06:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > Then,  most of users will not notice that ZERO_PAGE is not available until
> > he(she) find OOM-Killer message. This is very terrible situation for me.
> > (and most of system admins.)
> 
> Can you try to teach them to use KSM and see if they gain a while lot
> more from it (surely they also do some memset(dst, 0) sometime not
> only memcpy(zerosrc, dst)). Not to tell when they init to non zero
> values their arrays/matrix which is a bit harder to optimize for with
> zero page...
> 
> My only dislike is that zero page requires a flood of "if ()" new
> branches in fast paths that benefits nothing but badly written app,
> and that's the only reason I liked its removal.
> 
> For goodly (and badly) written scientific app there KSM that will do
> more than zeropage while dealing with matrix algorithms and such. If
> they try KSM and they don't gain a lot more free memory than with the
> zero page hack, then I agree in reintroducing it, but I guess when
> they try KSM they will ask you to patch kernel with it, instead of
> patch kernel with zeropage. If they don't gain anything more with KSM
> than with zeropage, and the kksmd overhead is too high, then it would
> make sense to use zeropage for them I agree even if it bites in the
> fast path of all apps that can't benefit from it. (not to tell the
> fact that reading zero and writing non zero back for normal apps is
> harmful as there's a double page fault generated instead of a single
> one, kksmd has a cost but zeropage isn't free either in term of page
> faults too)

Much as I like KSM, I have to agree with Avi, that if people are
wanting the ZERO_PAGE back in compute-intensive loads, then relying
on ksmd to put Humpty Dumpty together again is much too expensive a
way to go about it: ZERO_PAGE saves him from falling off the wall
in the first place, and that's much the better way to deal with it.

It might turn out in the end to be convenient to treat the ZERO_PAGE
as an "automatic" KSM page, I don't know; or we'll need to teach KSM
not to waste its time remerging instances of the ZERO_PAGE to a
zeroed KSM page.  We'll worry about that once both sets in mmotm.
 
I didn't care for Kamezawa-san's original patchsets, seemed messy
and branchy, but it looks to be heading the right way now using
vm_normal_page (pity about arches without pte_special, oh well).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-10 11:18       ` Hugh Dickins
@ 2009-07-10 13:42         ` Andrea Arcangeli
  2009-07-10 14:12           ` KAMEZAWA Hiroyuki
  2009-07-10 17:09           ` Hugh Dickins
  2009-07-13  6:46         ` Nick Piggin
  1 sibling, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2009-07-10 13:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, linux-mm@kvack.org, avi, akpm,
	torvalds

On Fri, Jul 10, 2009 at 12:18:07PM +0100, Hugh Dickins wrote:
> as an "automatic" KSM page, I don't know; or we'll need to teach KSM
> not to waste its time remerging instances of the ZERO_PAGE to a
> zeroed KSM page.  We'll worry about that once both sets in mmotm.

There is no risk of collision, zero page is not anonymous so...

I think it's a mistake for them not to try ksm first regardless of the
new zeropage patches being floating around, because my whole point is
that those kind of apps will save more than just zero page with
ksm. Sure not guaranteed... but possible and worth checking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-10 13:42         ` Andrea Arcangeli
@ 2009-07-10 14:12           ` KAMEZAWA Hiroyuki
  2009-07-10 15:16             ` Andrea Arcangeli
  2009-07-10 17:09           ` Hugh Dickins
  1 sibling, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10 14:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, KAMEZAWA Hiroyuki, Nick Piggin, linux-mm@kvack.org,
	avi, akpm, torvalds

Andrea Arcangeli さんは書きました：
> On Fri, Jul 10, 2009 at 12:18:07PM +0100, Hugh Dickins wrote:
>> as an "automatic" KSM page, I don't know; or we'll need to teach KSM
>> not to waste its time remerging instances of the ZERO_PAGE to a
>> zeroed KSM page.  We'll worry about that once both sets in mmotm.
>
> There is no risk of collision, zero page is not anonymous so...
>
> I think it's a mistake for them not to try ksm first regardless of the
> new zeropage patches being floating around, because my whole point is
> that those kind of apps will save more than just zero page with
> ksm. Sure not guaranteed... but possible and worth checking.
>
How many mercyless teachers who know waht is correct there are...

BTW, ksm has no refcnt pingpong problem ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-10 14:12           ` KAMEZAWA Hiroyuki
@ 2009-07-10 15:16             ` Andrea Arcangeli
  2009-07-10 15:32               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2009-07-10 15:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, Nick Piggin, linux-mm@kvack.org, avi, akpm,
	torvalds

On Fri, Jul 10, 2009 at 11:12:38PM +0900, KAMEZAWA Hiroyuki wrote:
> BTW, ksm has no refcnt pingpong problem ?

Well sure it has, the refcount has to be increased when pages are
shared, just like for regular fork() on anonymous memory, but the
point is that you pay for it only when you're saving ram, so the
probability that is just pure overhead is lower than for the zero
page... it always depend on the app. I simply suggest in trying
it... perhaps zero page is way to go for your users.. they should
tell, not us...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-10 15:16             ` Andrea Arcangeli
@ 2009-07-10 15:32               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-10 15:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Hugh Dickins, Nick Piggin, linux-mm@kvack.org,
	avi, akpm, torvalds

Andrea Arcangeli wrote:
> On Fri, Jul 10, 2009 at 11:12:38PM +0900, KAMEZAWA Hiroyuki wrote:
>> BTW, ksm has no refcnt pingpong problem ?
>
> Well sure it has, the refcount has to be increased when pages are
> shared, just like for regular fork() on anonymous memory, but the
> point is that you pay for it only when you're saving ram, so the
> probability that is just pure overhead is lower than for the zero
> page... it always depend on the app. I simply suggest in trying
> it... perhaps zero page is way to go for your users.. they should
> tell, not us...
>
My point is that we don't have to say "Unless you evolve yourself,
you'll die" to users. they will evolve by themselves if they are sane.
As I said, I like ksm. But demanding users to rewrite private apps is
different problem. I'd like to say "You can live as you're. but here,
there is better options" rather than "die!".
Adding documentation/advertisement and show pros. and cons. of ksm or
something correct is what we can do for increasing sane users.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-10 13:42         ` Andrea Arcangeli
  2009-07-10 14:12           ` KAMEZAWA Hiroyuki
@ 2009-07-10 17:09           ` Hugh Dickins
  1 sibling, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2009-07-10 17:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, linux-mm@kvack.org, avi, akpm,
	torvalds

On Fri, 10 Jul 2009, Andrea Arcangeli wrote:
> On Fri, Jul 10, 2009 at 12:18:07PM +0100, Hugh Dickins wrote:
> > as an "automatic" KSM page, I don't know; or we'll need to teach KSM
> > not to waste its time remerging instances of the ZERO_PAGE to a
> > zeroed KSM page.  We'll worry about that once both sets in mmotm.
> 
> There is no risk of collision, zero page is not anonymous so...

You're right, yes, no change required.

> 
> I think it's a mistake for them not to try ksm first regardless of the
> new zeropage patches being floating around, because my whole point is
> that those kind of apps will save more than just zero page with
> ksm. Sure not guaranteed... but possible and worth checking.

Okay, you're right to ask people to give KSM a try: there may be some
apps wanting ZERO_PAGE back, which would really benefit from having
other pages also merged for them, despite the cost.

(And the cost may not be so bad, given that you can stop KSM scanning
for merges, while still keeping all the merges already made.)

But I'm not going to hold my breath on that, and I don't think Kame
should hold back his patch for that.  Particularly since it would
need the extensions to apply KSM to other processes, and we're not
giving those any thought this time around.

(Beyond musing that if we're going to apply madvise MADV_MERGEABLE
to other processes, wouldn't we do better to extend the idea, to be
able to apply madvise and mlock generally to other processes?).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-10 11:18       ` Hugh Dickins
  2009-07-10 13:42         ` Andrea Arcangeli
@ 2009-07-13  6:46         ` Nick Piggin
  2009-07-13  7:24           ` Nick Piggin
  1 sibling, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2009-07-13  6:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm@kvack.org, avi,
	akpm, torvalds

On Fri, Jul 10, 2009 at 12:18:07PM +0100, Hugh Dickins wrote:
> On Wed, 8 Jul 2009, Andrea Arcangeli wrote:
> > On Tue, Jul 07, 2009 at 06:06:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > harmful as there's a double page fault generated instead of a single
> > one, kksmd has a cost but zeropage isn't free either in term of page
> > faults too)
> 
> Much as I like KSM, I have to agree with Avi, that if people are
> wanting the ZERO_PAGE back in compute-intensive loads, then relying

I can't imagine ZERO_PAGE would be too widely used in compute-intensive
loads. At least, not serious stuff. Nobody wants to spend 4K of cache
and one TLB entry for one or two non-zero floating point numbers in a
big sparse matrix. Not to mention the cache and memory overhead of just
scanning through lots of zeros.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
  2009-07-13  6:46         ` Nick Piggin
@ 2009-07-13  7:24           ` Nick Piggin
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2009-07-13  7:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm@kvack.org, avi,
	akpm, torvalds

On Mon, Jul 13, 2009 at 08:46:41AM +0200, Nick Piggin wrote:
> On Fri, Jul 10, 2009 at 12:18:07PM +0100, Hugh Dickins wrote:
> > On Wed, 8 Jul 2009, Andrea Arcangeli wrote:
> > > On Tue, Jul 07, 2009 at 06:06:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > > harmful as there's a double page fault generated instead of a single
> > > one, kksmd has a cost but zeropage isn't free either in term of page
> > > faults too)
> > 
> > Much as I like KSM, I have to agree with Avi, that if people are
> > wanting the ZERO_PAGE back in compute-intensive loads, then relying
> 
> I can't imagine ZERO_PAGE would be too widely used in compute-intensive
> loads. At least, not serious stuff. Nobody wants to spend 4K of cache
> and one TLB entry for one or two non-zero floating point numbers in a
> big sparse matrix. Not to mention the cache and memory overhead of just
> scanning through lots of zeros.

Heh, oops: before anyone thinks it will be fun to make some
personal insults, there won't be much memory overhead from
zero page of course! Cache and *TLB* overhead is going to be
involved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/4] ZERO PAGE again v2
@ 2009-07-07 15:50 KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-07-07 15:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	avi, akpm@linux-foundation.org, torvalds

Nick Piggin wrote:
> On Tue, Jul 07, 2009 at 06:06:29PM +0900, KAMEZAWA Hiroyuki wrote:
>> 3. Considering save&restore application's data table, ZERO_PAGE is
>> useful.
>>    maybe.
>
> I just wouldn't like to re-add significant complexity back to
> the vm without good and concrete examples. OK I agree that
> just saying "rewrite your code" is not so good, but are there
> real significant problems? Is it inside just a particuar linear
> algebra library or something  that might be able to be updated?
>
As far as I can tell

I know 2 cases from my limited experience for user support.

1. A middlware maps /dev/zero with PRIVATE mapping and use copy-on-write
   intentionally. I think this is because their Solaris? apps required
   /dev/zero to use ZERO_PAGE or anon.
   I don't know much about solaris but
   "mapping /dev/zero eats up tons of memory" sounds strange for me.

2. A HPC middleware seems to make use of ZERO_PAGE to do checkpoint/restart
   of his job. (Maybe they can rewrite programs as you say.)

Maybe there are others. (I'm not afraid of famous OSS applications/library.
There will be enough technical support for such apps.)

To be honest, I'd like to support /dev/zero, at least.
"mmap(/dev/zero, PROT_READ) caues OOM" sounds like a crazy behavior as OS.

Is it ok to write fault handler for /dev/zero and use zero page even if
this request is rejected ?

It was a choice to advertise "ZERO PAGE is not available any more, plz
check and rewrite you applications" to all my customers. But I'm being
pessimistic about this issue. (So, trying this patch)
Users will not understand what is the change and I'll see some of OOM
report caused by this change.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2009-07-13  7:03 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-07  7:51 [RFC][PATCH 0/4] ZERO PAGE again v2 KAMEZAWA Hiroyuki
2009-07-07  7:52 ` [RFC][PATCH 1/4] introduce pte_zero() KAMEZAWA Hiroyuki
2009-07-07  7:54 ` [RFC][PATCH 2/4] use ZERO_PAGE for READ fault in regular anonymous mapping KAMEZAWA Hiroyuki
2009-07-07  7:59 ` [RFC][PATCH 3/4] get_user_pages READ fault handling special cases KAMEZAWA Hiroyuki
2009-07-07 16:50   ` Linus Torvalds
2009-07-08  0:03     ` KAMEZAWA Hiroyuki
2009-07-08  1:38       ` KAMEZAWA Hiroyuki
2009-07-08  2:27         ` Linus Torvalds
2009-07-07  8:01 ` [RFC][PATCH 4/4] add get user pages nozero KAMEZAWA Hiroyuki
2009-07-07  8:47 ` [RFC][PATCH 0/4] ZERO PAGE again v2 Nick Piggin
2009-07-07  9:05   ` Avi Kivity
2009-07-07  9:18     ` KAMEZAWA Hiroyuki
2009-07-07  9:26       ` Avi Kivity
2009-07-07  9:06   ` KAMEZAWA Hiroyuki
2009-07-07 14:00     ` Nick Piggin
2009-07-07 16:59       ` Linus Torvalds
2009-07-08  6:21         ` Nick Piggin
2009-07-08 16:07           ` Linus Torvalds
2009-07-09  7:47             ` Nick Piggin
2009-07-09 17:54               ` Linus Torvalds
2009-07-10  2:09                 ` Nick Piggin
2009-07-10  3:38                   ` Linus Torvalds
2009-07-10  3:51                     ` Nick Piggin
2009-07-08 17:32     ` Andrea Arcangeli
2009-07-09  1:12       ` KAMEZAWA Hiroyuki
2009-07-10 11:18       ` Hugh Dickins
2009-07-10 13:42         ` Andrea Arcangeli
2009-07-10 14:12           ` KAMEZAWA Hiroyuki
2009-07-10 15:16             ` Andrea Arcangeli
2009-07-10 15:32               ` KAMEZAWA Hiroyuki
2009-07-10 17:09           ` Hugh Dickins
2009-07-13  6:46         ` Nick Piggin
2009-07-13  7:24           ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2009-07-07 15:50 KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).