linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3] Support volatile range for anon vma
@ 2012-12-11  2:34 Minchan Kim
  2012-12-11  2:41 ` Minchan Kim
  2012-12-11 18:45 ` John Stultz
  0 siblings, 2 replies; 16+ messages in thread
From: Minchan Kim @ 2012-12-11  2:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Michael Kerrisk, Arun Sharma,
	sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

This still is [RFC v3] because just passed my simple test
with TCMalloc tweaking.

I hope more inputs from user-space allocator people and test patch
with their allocator because it might need design change of arena
management design for getting real vaule.

Changelog from v2

 * Removing madvise(addr, length, MADV_NOVOLATILE).
 * add vmstat about the number of discarded volatile pages
 * discard volatile pages without promotion in reclaim path

This is based on v3.6.

- What's the madvise(addr, length, MADV_VOLATILE)?

  It's a hint that user deliver to kernel so kernel can *discard*
  pages in a range anytime.

- What happens if user access page(ie, virtual address) discarded
  by kernel?

  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).

- What happens if user access page(ie, virtual address) doesn't
  discarded by kernel?

  The user can see old data without page fault.

- What's different with madvise(DONTNEED)?

  System call semantic

  DONTNEED makes sure user always can see zero-fill pages after
  he calls madvise while VOLATILE can see zero-fill pages or
  old data.

  Internal implementation

  The madvise(DONTNEED) should zap all mapped pages in range so
  overhead is increased linearly with the number of mapped pages.
  Even, if user access zapped pages by write, page fault + page
  allocation + memset should be happened.

  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
  It doesn't touch pages any more so overhead of the system call
  should be very small. If memory pressure happens, VM can discard
  pages in VMAs marked by VOLATILE. If user access address with
  write mode by discarding by VM, he can see zero-fill pages so the
  cost is same with DONTNEED but if memory pressure isn't severe,
  user can see old data without (page fault + page allocation + memset)

  The VOLATILE mark should be removed in page fault handler when first
  page fault occur in marked vma so next page faults will follow normal
  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
  interface.

- What's the benefit compared to DONTNEED?

  1. The system call overhead is smaller because VOLATILE just marks
     the flag to VMA instead of zapping all the page in a range.

  2. It has a chance to eliminate overheads (ex, page fault +
     page allocation + memset(PAGE_SIZE)).

- Isn't there any drawback?

  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
  fault of other threads could be allowed. But VOLATILE needs exclusive
  mmap_sem so other thread would be blocked if they try to access
  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
  should be small as far as possible.

  Other concern of exclusive mmap_sem is when page fault occur in
  VOLATILE marked vma. We should remove the flag of vma and merge
  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
  handling and prevent concurrent page fault. But we need such handling
  just once when page fault occur after we mark VOLATILE into VMA
  only if memory pressure happpens so the page is discarded. So it wouldn't
  not common so that benefit we get by this feature would be bigger than
  lose.

- What's for targetting?

  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
  of virtual machine like Dalvik. Also, it comes in handy for embedded
  which doesn't have swap device so they can't reclaim anonymous pages.
  By discarding instead of swap, it could be used in the non-swap system.
  For it,  we have to age anon lru list although we don't have swap because
  I don't want to discard volatile pages by top priority when memory pressure
  happens as volatile in this patch means "We don't need to swap out because
  user can handle the situation which data are disappear suddenly", NOT
  "They are useless so hurry up to reclaim them". So I want to apply same
  aging rule of nomal pages to them.

  Anonymous page background aging of non-swap system would be a trade-off
  for getting good feature. Even, we had done it two years ago until merge
  [1] and I believe gain of this patch will beat loss of anon lru aging's
  overead once all of allocator start to use madvise.
  (This patch doesn't include background aging in case of non-swap system
  but it's trivial if we decide)

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/x86/mm/fault.c               |    2 +
 include/asm-generic/mman-common.h |    6 ++
 include/linux/mm.h                |    7 ++-
 include/linux/rmap.h              |   20 ++++++
 include/linux/vm_event_item.h     |    2 +-
 mm/madvise.c                      |   19 +++++-
 mm/memory.c                       |   32 ++++++++++
 mm/migrate.c                      |    6 +-
 mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                       |    7 +++
 mm/vmstat.c                       |    1 +
 11 files changed, 218 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 76dcd9d..a734166 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 		}
 
 		out_of_memory(regs, error_code, address);
+	} else if (fault & VM_FAULT_BAD_AREA) {
+			bad_area(regs, error_code, address);
 	} else {
 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
 			     VM_FAULT_HWPOISON_LARGE))
diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
index d030d2c..f07781e 100644
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -34,6 +34,12 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+/*
+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.
+ * For changing the flag, we need mmap_sem's write lock and volatile_lock
+ * while we just need volatile_lock in case of reading the flag.
+ */
+#define MADV_VOLATILE	5		/* pages will disappear suddenly */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..89027b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#define VM_VOLATILE	0x100000000	/* Pages in the vma could be discarable without swap */
 
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);
  * Special vmas that are non-mergable, non-mlock()able.
  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
  */
-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)
 
 /*
  * mapping from the currently active vm_flags protection bits (the
@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
-
+#define VM_FAULT_SIGSEG	0x0800	/* -> There is no vma */
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
-			 VM_FAULT_HWPOISON_LARGE)
+			 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)
 
 /* Encode hstate index for a hwpoisoned large page */
 #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 3fce545..735d7a3 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -67,6 +67,9 @@ struct anon_vma_chain {
 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
 };
 
+void volatile_lock(struct vm_area_struct *vma);
+void volatile_unlock(struct vm_area_struct *vma);
+
 #ifdef CONFIG_MMU
 static inline void get_anon_vma(struct anon_vma *anon_vma)
 {
@@ -170,6 +173,7 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
 };
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 
@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	return ptep;
 }
 
+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
+                                unsigned long, spinlock_t **);
+
+static inline pte_t *page_check_volatile_address(struct page *page,
+                                        struct mm_struct *mm,
+                                        unsigned long address,
+                                        spinlock_t **ptlp)
+{
+        pte_t *ptep;
+
+        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
+                                        mm, address, ptlp));
+        return ptep;
+}
+
 /*
  * Used by swapoff to help locate where page is expected in vma.
  */
@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
 #define SWAP_MLOCK	3
+#define SWAP_DISCARD	4
 
 #endif	/* _LINUX_RMAP_H */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 57f7b10..3f9a40b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -23,7 +23,7 @@
 
 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
-		PGFREE, PGACTIVATE, PGDEACTIVATE,
+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
diff --git a/mm/madvise.c b/mm/madvise.c
index 14d260f..53a19d8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_VOLATILE:
+		if (vma->vm_flags & VM_LOCKED) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_VOLATILE;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
 success:
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
+	 * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.
 	 */
+	if (behavior == MADV_VOLATILE)
+		volatile_lock(vma);
 	vma->vm_flags = new_flags;
-
+	if (behavior == MADV_VOLATILE)
+		volatile_unlock(vma);
 out:
 	if (error == -ENOMEM)
 		error = -EAGAIN;
@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_VOLATILE:
 		return 1;
 
 	default:
@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 		goto out;
 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
 
+	if (behavior != MADV_VOLATILE)
+		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+	else
+		len = len_in & PAGE_MASK;
+
 	/* Check to see whether len was rounded up from small -ve to zero */
 	if (len_in && !len)
 		goto out;
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..b5e4996 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/mempolicy.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,
 					return do_linear_fault(mm, vma, address,
 						pte, pmd, flags, entry);
 			}
+			if (vma->vm_flags & VM_VOLATILE) {
+				struct vm_area_struct *prev;
+
+				up_read(&mm->mmap_sem);
+				down_write(&mm->mmap_sem);
+				vma = find_vma_prev(mm, address, &prev);
+
+				/* Someone unmap the vma */
+				if (unlikely(!vma) || vma->vm_start > address) {
+					downgrade_write(&mm->mmap_sem);
+					return VM_FAULT_SIGSEG;
+				}
+				/* Someone else already hanlded */
+				if (vma->vm_flags & VM_VOLATILE) {
+					/*
+					 * From now on, we hold mmap_sem as
+					 * exclusive.
+					 */
+					volatile_lock(vma);
+					vma->vm_flags &= ~VM_VOLATILE;
+					volatile_unlock(vma);
+
+					vma_merge(mm, prev, vma->vm_start,
+						vma->vm_end, vma->vm_flags,
+						vma->anon_vma, vma->vm_file,
+						vma->vm_pgoff, vma_policy(vma));
+
+				}
+
+				downgrade_write(&mm->mmap_sem);
+			}
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, flags);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..08b009c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
 
 skip_unmap:
 	if (!page_mapped(page))
@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (PageAnon(hpage))
 		anon_vma = page_get_anon_vma(hpage);
 
-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
 
 	if (!page_mapped(hpage))
 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..1a0ab2b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 	return vma_address(page, vma);
 }
 
+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
+		unsigned long address, spinlock_t **ptlp)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	swp_entry_t entry = { .val = page_private(page) };
+
+	if (unlikely(PageHuge(page))) {
+		pte = huge_pte_offset(mm, address);
+		ptl = &mm->page_table_lock;
+		goto check;
+	}
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return NULL;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return NULL;
+	if (pmd_trans_huge(*pmd))
+		return NULL;
+
+	pte = pte_offset_map(pmd, address);
+	ptl = pte_lockptr(mm, pmd);
+check:
+	spin_lock(ptl);
+	if (PageAnon(page)) {
+		if (!pte_present(*pte) && entry.val ==
+				pte_to_swp_entry(*pte).val) {
+			*ptlp = ptl;
+			return pte;
+		}
+	} else {
+		if (pte_none(*pte)) {
+			*ptlp = ptl;
+			return pte;
+		}
+	}
+	pte_unmap_unlock(pte, ptl);
+	return NULL;
+}
+
 /*
  * Check that @page is mapped at @address into @mm.
  *
@@ -1218,6 +1269,35 @@ out:
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
+                unsigned long address)
+{
+        struct mm_struct *mm = vma->vm_mm;
+        pte_t *pte;
+        pte_t pteval;
+        spinlock_t *ptl;
+
+        pte = page_check_volatile_address(page, mm, address, &ptl);
+        if (!pte)
+                return 0;
+
+        /* Nuke the page table entry. */
+        flush_cache_page(vma, address, page_to_pfn(page));
+        pteval = ptep_clear_flush(vma, address, pte);
+
+        if (PageAnon(page)) {
+                swp_entry_t entry = { .val = page_private(page) };
+                if (PageSwapCache(page)) {
+                        dec_mm_counter(mm, MM_SWAPENTS);
+                        swap_free(entry);
+                }
+        }
+
+        pte_unmap_unlock(pte, ptl);
+        mmu_notifier_invalidate_page(mm, address);
+        return 1;
+}
+
 /*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 	struct anon_vma *anon_vma;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
+	bool is_volatile = true;
+
+	if (flags & TTU_IGNORE_VOLATILE)
+		is_volatile = false;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 		 * temporary VMAs until after exec() completes.
 		 */
 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
-				is_vma_temporary_stack(vma))
+				is_vma_temporary_stack(vma)) {
+			is_volatile = false;
 			continue;
+		}
 
 		address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
+                /*
+                 * A volatile page will only be purged if ALL vmas
+		 * pointing to it are VM_VOLATILE.
+                 */
+                if (!(vma->vm_flags & VM_VOLATILE))
+                        is_volatile = false;
+
 		ret = try_to_unmap_one(page, vma, address, flags);
 		if (ret != SWAP_AGAIN || !page_mapped(page))
 			break;
 	}
 
+        if (page_mapped(page) || is_volatile == false)
+                goto out;
+
+        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+                struct vm_area_struct *vma = avc->vma;
+                unsigned long address;
+
+                address = vma_address(page, vma);
+                try_to_zap_one(page, vma, address);
+        }
+        /* We're throwing this page out, so mark it clean */
+        ClearPageDirty(page);
+        ret = SWAP_DISCARD;
+out:
 	page_unlock_anon_vma(anon_vma);
 	return ret;
 }
@@ -1651,6 +1758,7 @@ out:
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
+ * SWAP_DISCARD - page is volatile.
  */
 int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 		ret = try_to_unmap_anon(page, flags);
 	else
 		ret = try_to_unmap_file(page, flags);
-	if (ret != SWAP_MLOCK && !page_mapped(page))
+	if (ret != SWAP_MLOCK && !page_mapped(page) &&
+					ret != SWAP_DISCARD)
 		ret = SWAP_SUCCESS;
 	return ret;
 }
@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
 	anon_vma_free(anon_vma);
 }
 
+void volatile_lock(struct vm_area_struct *vma)
+{
+        if (vma->anon_vma)
+                anon_vma_lock(vma->anon_vma);
+}
+
+void volatile_unlock(struct vm_area_struct *vma)
+{
+        if (vma->anon_vma)
+                anon_vma_unlock(vma->anon_vma);
+}
+
 #ifdef CONFIG_MIGRATION
 /*
  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99b434b..4e463a4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	if (vm_flags & VM_VOLATILE)
+		return PAGEREF_RECLAIM;
+
 	if (referenced_ptes) {
 		if (PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page, TTU_UNMAP)) {
+			case SWAP_DISCARD:
+				count_vm_event(PGVOLATILE);
+				goto discard_page;
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
+discard_page:
 		/*
 		 * If the page has buffers, try to free the buffer mappings
 		 * associated with this page. If we succeed we try to free
diff --git a/mm/vmstat.c b/mm/vmstat.c
index df7a674..410caf5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -734,6 +734,7 @@ const char * const vmstat_text[] = {
 	TEXTS_FOR_ZONES("pgalloc")
 
 	"pgfree",
+	"pgvolatile",
 	"pgactivate",
 	"pgdeactivate",
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  2:34 [RFC v3] Support volatile range for anon vma Minchan Kim
@ 2012-12-11  2:41 ` Minchan Kim
  2012-12-11  7:17   ` Mike Hommey
                     ` (3 more replies)
  2012-12-11 18:45 ` John Stultz
  1 sibling, 4 replies; 16+ messages in thread
From: Minchan Kim @ 2012-12-11  2:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Michael Kerrisk, Arun Sharma, sanjay,
	Paul Turner, David Rientjes, John Stultz, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

Sorry, resending with fixing compile error. :(

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  2:41 ` Minchan Kim
@ 2012-12-11  7:17   ` Mike Hommey
  2012-12-11  7:37     ` Minchan Kim
  2012-12-12  6:43   ` Wanpeng Li
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Mike Hommey @ 2012-12-11  7:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> - What's the madvise(addr, length, MADV_VOLATILE)?
> 
>   It's a hint that user deliver to kernel so kernel can *discard*
>   pages in a range anytime.
> 
> - What happens if user access page(ie, virtual address) discarded
>   by kernel?
> 
>   The user can see zero-fill-on-demand pages as if madvise(DONTNEED).

What happened to getting SIGBUS?

Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  7:17   ` Mike Hommey
@ 2012-12-11  7:37     ` Minchan Kim
  2012-12-11  7:59       ` Mike Hommey
  0 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2012-12-11  7:37 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 11, 2012 at 08:17:42AM +0100, Mike Hommey wrote:
> On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> > - What's the madvise(addr, length, MADV_VOLATILE)?
> > 
> >   It's a hint that user deliver to kernel so kernel can *discard*
> >   pages in a range anytime.
> > 
> > - What happens if user access page(ie, virtual address) discarded
> >   by kernel?
> > 
> >   The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
> 
> What happened to getting SIGBUS?

I thought it could force for user to handle signal.
If user can receive signal, what can he do?
Maybe he can call madivse(NOVOLATILE) in my old version but I removed it
in this version so user don't need handle signal handling.

The problem of madvise(NOVOLATILE) is that time delay between allocator
allocats a free chunk to user and the user really access the memory.
Normally, when allocator return free chunk to customer, allocator should
call madvise(NOVOLATILE) but user could access the memory long time after.
So during that time difference, that pages could be swap out. It means to
mitigate the patch's goal.

Yes. It's not good for tmpfs volatile pages. If you have an interesting
about tmpfs-volatile, please look at this.

https://lkml.org/lkml/2012/12/10/695

> 
> Mike
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  7:37     ` Minchan Kim
@ 2012-12-11  7:59       ` Mike Hommey
  2012-12-11  8:11         ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Hommey @ 2012-12-11  7:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 11, 2012 at 04:37:44PM +0900, Minchan Kim wrote:
> On Tue, Dec 11, 2012 at 08:17:42AM +0100, Mike Hommey wrote:
> > On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> > > - What's the madvise(addr, length, MADV_VOLATILE)?
> > > 
> > >   It's a hint that user deliver to kernel so kernel can *discard*
> > >   pages in a range anytime.
> > > 
> > > - What happens if user access page(ie, virtual address) discarded
> > >   by kernel?
> > > 
> > >   The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
> > 
> > What happened to getting SIGBUS?
> 
> I thought it could force for user to handle signal.
> If user can receive signal, what can he do?
> Maybe he can call madivse(NOVOLATILE) in my old version but I removed it
> in this version so user don't need handle signal handling.

NOVOLATILE and signal throwing are two different and not necessarily
related needs. We (Mozilla) could probably live without NOVOLATILE,
but certainly not without signal throwing.

Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  7:59       ` Mike Hommey
@ 2012-12-11  8:11         ` Minchan Kim
  2012-12-11  8:29           ` Mike Hommey
  0 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2012-12-11  8:11 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 11, 2012 at 08:59:50AM +0100, Mike Hommey wrote:
> On Tue, Dec 11, 2012 at 04:37:44PM +0900, Minchan Kim wrote:
> > On Tue, Dec 11, 2012 at 08:17:42AM +0100, Mike Hommey wrote:
> > > On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> > > > - What's the madvise(addr, length, MADV_VOLATILE)?
> > > > 
> > > >   It's a hint that user deliver to kernel so kernel can *discard*
> > > >   pages in a range anytime.
> > > > 
> > > > - What happens if user access page(ie, virtual address) discarded
> > > >   by kernel?
> > > > 
> > > >   The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
> > > 
> > > What happened to getting SIGBUS?
> > 
> > I thought it could force for user to handle signal.
> > If user can receive signal, what can he do?
> > Maybe he can call madivse(NOVOLATILE) in my old version but I removed it
> > in this version so user don't need handle signal handling.
> 
> NOVOLATILE and signal throwing are two different and not necessarily
> related needs. We (Mozilla) could probably live without NOVOLATILE,
> but certainly not without signal throwing.

What's shortcoming if we don't provide signal handling?
Could you explain how you want to signal in your allocator?

It could be very helpful to improve this patch.
Thanks for the input.

> 
> Mike
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  8:11         ` Minchan Kim
@ 2012-12-11  8:29           ` Mike Hommey
  2012-12-11  8:45             ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Hommey @ 2012-12-11  8:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 11, 2012 at 05:11:17PM +0900, Minchan Kim wrote:
> On Tue, Dec 11, 2012 at 08:59:50AM +0100, Mike Hommey wrote:
> > On Tue, Dec 11, 2012 at 04:37:44PM +0900, Minchan Kim wrote:
> > > On Tue, Dec 11, 2012 at 08:17:42AM +0100, Mike Hommey wrote:
> > > > On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> > > > > - What's the madvise(addr, length, MADV_VOLATILE)?
> > > > > 
> > > > >   It's a hint that user deliver to kernel so kernel can *discard*
> > > > >   pages in a range anytime.
> > > > > 
> > > > > - What happens if user access page(ie, virtual address) discarded
> > > > >   by kernel?
> > > > > 
> > > > >   The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
> > > > 
> > > > What happened to getting SIGBUS?
> > > 
> > > I thought it could force for user to handle signal.
> > > If user can receive signal, what can he do?
> > > Maybe he can call madivse(NOVOLATILE) in my old version but I removed it
> > > in this version so user don't need handle signal handling.
> > 
> > NOVOLATILE and signal throwing are two different and not necessarily
> > related needs. We (Mozilla) could probably live without NOVOLATILE,
> > but certainly not without signal throwing.
> 
> What's shortcoming if we don't provide signal handling?
> Could you explain how you want to signal in your allocator?

The main use case we have for signals is not an allocator. We're
currently using ashmem to decompress libraries on Android. We would like
to use volatile memory for that instead, so that unused pages can be
discarded. With NOVOLATILE, or when getting zero-filled pages, that just
doesn't pan out: you may well be jumping in the volatile memory from
anywhere, and you can't check the status of the page you're jumping into
before jumping. Thus you need to be signaled when reaching a discarded
page.

Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  8:29           ` Mike Hommey
@ 2012-12-11  8:45             ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-12-11  8:45 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 11, 2012 at 09:29:03AM +0100, Mike Hommey wrote:
> On Tue, Dec 11, 2012 at 05:11:17PM +0900, Minchan Kim wrote:
> > On Tue, Dec 11, 2012 at 08:59:50AM +0100, Mike Hommey wrote:
> > > On Tue, Dec 11, 2012 at 04:37:44PM +0900, Minchan Kim wrote:
> > > > On Tue, Dec 11, 2012 at 08:17:42AM +0100, Mike Hommey wrote:
> > > > > On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> > > > > > - What's the madvise(addr, length, MADV_VOLATILE)?
> > > > > > 
> > > > > >   It's a hint that user deliver to kernel so kernel can *discard*
> > > > > >   pages in a range anytime.
> > > > > > 
> > > > > > - What happens if user access page(ie, virtual address) discarded
> > > > > >   by kernel?
> > > > > > 
> > > > > >   The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
> > > > > 
> > > > > What happened to getting SIGBUS?
> > > > 
> > > > I thought it could force for user to handle signal.
> > > > If user can receive signal, what can he do?
> > > > Maybe he can call madivse(NOVOLATILE) in my old version but I removed it
> > > > in this version so user don't need handle signal handling.
> > > 
> > > NOVOLATILE and signal throwing are two different and not necessarily
> > > related needs. We (Mozilla) could probably live without NOVOLATILE,
> > > but certainly not without signal throwing.
> > 
> > What's shortcoming if we don't provide signal handling?
> > Could you explain how you want to signal in your allocator?
> 
> The main use case we have for signals is not an allocator. We're
> currently using ashmem to decompress libraries on Android. We would like
> to use volatile memory for that instead, so that unused pages can be
> discarded. With NOVOLATILE, or when getting zero-filled pages, that just
> doesn't pan out: you may well be jumping in the volatile memory from
> anywhere, and you can't check the status of the page you're jumping into
> before jumping. Thus you need to be signaled when reaching a discarded
> page.

It seems you are saying about tmpfs-based volatile ranges.
As I mentioned in John's thread, some interface to pin memory
as ashmem's term is needed for tmpfs-based volatile ranges.
But in case of allocator, we might not need it so this patch which
for considering allocator usecase removed SIGBUS.
If user allocator guys ask such interface, it wouldn't be a problem
for unifying both usecases but if they don't want due to by
performance, I don't want to add it. If so, there are two choices.

1) Go separate way with each interface.
   (madvise for anon vs fadvise or fallocate for tmpfs)
2) A new system call to unify them.

> 
> Mike
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  2:34 [RFC v3] Support volatile range for anon vma Minchan Kim
  2012-12-11  2:41 ` Minchan Kim
@ 2012-12-11 18:45 ` John Stultz
  2012-12-11 23:21   ` Minchan Kim
  1 sibling, 1 reply; 16+ messages in thread
From: John Stultz @ 2012-12-11 18:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 12/10/2012 06:34 PM, Minchan Kim wrote:
> This still is [RFC v3] because just passed my simple test
> with TCMalloc tweaking.
>
> I hope more inputs from user-space allocator people and test patch
> with their allocator because it might need design change of arena
> management design for getting real vaule.
>
> Changelog from v2
>
>   * Removing madvise(addr, length, MADV_NOVOLATILE).
>   * add vmstat about the number of discarded volatile pages
>   * discard volatile pages without promotion in reclaim path
>
> This is based on v3.6.
>
> - What's the madvise(addr, length, MADV_VOLATILE)?
>
>    It's a hint that user deliver to kernel so kernel can *discard*
>    pages in a range anytime.
>
> - What happens if user access page(ie, virtual address) discarded
>    by kernel?
>
>    The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
>
> - What happens if user access page(ie, virtual address) doesn't
>    discarded by kernel?
>
>    The user can see old data without page fault.
>
> - What's different with madvise(DONTNEED)?
>
>    System call semantic
>
>    DONTNEED makes sure user always can see zero-fill pages after
>    he calls madvise while VOLATILE can see zero-fill pages or
>    old data.
I still need to really read and understand the patch, but at a high 
level I'm not sure how this works. So does the VOLATILE flag get cleared 
on any access, even if the pages have not been discarded? What happens 
if an application wants to store non-volatile data in an area that was 
once marked volatile. If there was never memory pressure, it seems the 
volatility would persist with no way of removing it.

Either way, I feel that with this revision, specifically dropping the 
NOVOLATILE call and the SIGBUS optimization the Mozilla folks suggested, 
your implementation has drifted quite far from the concept I'm pushing. 
While I hope we can still align the underlying mm implementation, I 
might ask that you use a different term for the semantics you propose, 
so we don't add too much confusion to the discussion.

Maybe you could call it DONTNEED_DEFERRED or something?

In the meantime, I'll be reading your patch in detail and seeing how we 
might be able to combine our differing approaches.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11 18:45 ` John Stultz
@ 2012-12-11 23:21   ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-12-11 23:21 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

Hi John,

On Tue, Dec 11, 2012 at 10:45:27AM -0800, John Stultz wrote:
> On 12/10/2012 06:34 PM, Minchan Kim wrote:
> >This still is [RFC v3] because just passed my simple test
> >with TCMalloc tweaking.
> >
> >I hope more inputs from user-space allocator people and test patch
> >with their allocator because it might need design change of arena
> >management design for getting real vaule.
> >
> >Changelog from v2
> >
> >  * Removing madvise(addr, length, MADV_NOVOLATILE).
> >  * add vmstat about the number of discarded volatile pages
> >  * discard volatile pages without promotion in reclaim path
> >
> >This is based on v3.6.
> >
> >- What's the madvise(addr, length, MADV_VOLATILE)?
> >
> >   It's a hint that user deliver to kernel so kernel can *discard*
> >   pages in a range anytime.
> >
> >- What happens if user access page(ie, virtual address) discarded
> >   by kernel?
> >
> >   The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
> >
> >- What happens if user access page(ie, virtual address) doesn't
> >   discarded by kernel?
> >
> >   The user can see old data without page fault.
> >
> >- What's different with madvise(DONTNEED)?
> >
> >   System call semantic
> >
> >   DONTNEED makes sure user always can see zero-fill pages after
> >   he calls madvise while VOLATILE can see zero-fill pages or
> >   old data.
> I still need to really read and understand the patch, but at a high
> level I'm not sure how this works. So does the VOLATILE flag get
> cleared on any access, even if the pages have not been discarded?

No. It is cleared when user try to access discareded pages so
This patch is utter crap. I missed that point.
Thanks for pointing out, John.

Hmm, in the end, we need NOVOLATILE.

> What happens if an application wants to store non-volatile data in
> an area that was once marked volatile. If there was never memory
> pressure, it seems the volatility would persist with no way of
> removing it.

Yes. that's why this patch is crap and I'm insane. :(

> 
> Either way, I feel that with this revision, specifically dropping
> the NOVOLATILE call and the SIGBUS optimization the Mozilla folks
> suggested, your implementation has drifted quite far from the
> concept I'm pushing. While I hope we can still align the underlying
> mm implementation, I might ask that you use a different term for the
> semantics you propose, so we don't add too much confusion to the
> discussion.
> 
> Maybe you could call it DONTNEED_DEFERRED or something?
> 
> In the meantime, I'll be reading your patch in detail and seeing how
> we might be able to combine our differing approaches.

You don't need it. Ignore this patch.
I will rework.

Thanks.

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  2:41 ` Minchan Kim
  2012-12-11  7:17   ` Mike Hommey
@ 2012-12-12  6:43   ` Wanpeng Li
  2012-12-12  8:17     ` Wanpeng Li
                       ` (2 more replies)
  2012-12-12  6:43   ` Wanpeng Li
       [not found]   ` <50c827cb.ce98320a.7d38.ffffad3fSMTPIN_ADDED_BROKEN@mx.google.com>
  3 siblings, 3 replies; 16+ messages in thread
From: Wanpeng Li @ 2012-12-12  6:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
>Sorry, resending with fixing compile error. :(
>
>>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001
>From: Minchan Kim <minchan@kernel.org>
>Date: Tue, 11 Dec 2012 11:38:30 +0900
>Subject: [RFC v3] Support volatile range for anon vma
>
>This still is [RFC v3] because just passed my simple test
>with TCMalloc tweaking.
>
>I hope more inputs from user-space allocator people and test patch
>with their allocator because it might need design change of arena
>management design for getting real vaule.
>
>Changelog from v2
>
> * Removing madvise(addr, length, MADV_NOVOLATILE).
> * add vmstat about the number of discarded volatile pages
> * discard volatile pages without promotion in reclaim path
>
>This is based on v3.6.
>
>- What's the madvise(addr, length, MADV_VOLATILE)?
>
>  It's a hint that user deliver to kernel so kernel can *discard*
>  pages in a range anytime.
>
>- What happens if user access page(ie, virtual address) discarded
>  by kernel?
>
>  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
>
>- What happens if user access page(ie, virtual address) doesn't
>  discarded by kernel?
>
>  The user can see old data without page fault.
>
>- What's different with madvise(DONTNEED)?
>
>  System call semantic
>
>  DONTNEED makes sure user always can see zero-fill pages after
>  he calls madvise while VOLATILE can see zero-fill pages or
>  old data.
>
>  Internal implementation
>
>  The madvise(DONTNEED) should zap all mapped pages in range so
>  overhead is increased linearly with the number of mapped pages.
>  Even, if user access zapped pages by write, page fault + page
>  allocation + memset should be happened.
>
>  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
>  It doesn't touch pages any more so overhead of the system call
>  should be very small. If memory pressure happens, VM can discard
>  pages in VMAs marked by VOLATILE. If user access address with
>  write mode by discarding by VM, he can see zero-fill pages so the
>  cost is same with DONTNEED but if memory pressure isn't severe,
>  user can see old data without (page fault + page allocation + memset)
>
>  The VOLATILE mark should be removed in page fault handler when first
>  page fault occur in marked vma so next page faults will follow normal
>  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
>  interface.
>
>- What's the benefit compared to DONTNEED?
>
>  1. The system call overhead is smaller because VOLATILE just marks
>     the flag to VMA instead of zapping all the page in a range.
>
>  2. It has a chance to eliminate overheads (ex, page fault +
>     page allocation + memset(PAGE_SIZE)).
>
>- Isn't there any drawback?
>
>  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
>  fault of other threads could be allowed. But VOLATILE needs exclusive
>  mmap_sem so other thread would be blocked if they try to access
>  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
>  should be small as far as possible.
>
>  Other concern of exclusive mmap_sem is when page fault occur in
>  VOLATILE marked vma. We should remove the flag of vma and merge
>  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
>  handling and prevent concurrent page fault. But we need such handling
>  just once when page fault occur after we mark VOLATILE into VMA
>  only if memory pressure happpens so the page is discarded. So it wouldn't
>  not common so that benefit we get by this feature would be bigger than
>  lose.
>
>- What's for targetting?
>
>  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
>  of virtual machine like Dalvik. Also, it comes in handy for embedded
>  which doesn't have swap device so they can't reclaim anonymous pages.
>  By discarding instead of swap, it could be used in the non-swap system.
>  For it,  we have to age anon lru list although we don't have swap because
>  I don't want to discard volatile pages by top priority when memory pressure
>  happens as volatile in this patch means "We don't need to swap out because
>  user can handle the situation which data are disappear suddenly", NOT
>  "They are useless so hurry up to reclaim them". So I want to apply same
>  aging rule of nomal pages to them.
>
>  Anonymous page background aging of non-swap system would be a trade-off
>  for getting good feature. Even, we had done it two years ago until merge
>  [1] and I believe gain of this patch will beat loss of anon lru aging's
>  overead once all of allocator start to use madvise.
>  (This patch doesn't include background aging in case of non-swap system
>  but it's trivial if we decide)
>
>[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
>
>Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>Cc: Arun Sharma <asharma@fb.com>
>Cc: sanjay@google.com
>Cc: Paul Turner <pjt@google.com>
>CC: David Rientjes <rientjes@google.com>
>Cc: John Stultz <john.stultz@linaro.org>
>Cc: Andrew Morton <akpm@linux-foundation.org>
>Cc: Christoph Lameter <cl@linux.com>
>Cc: Android Kernel Team <kernel-team@android.com>
>Cc: Robert Love <rlove@google.com>
>Cc: Mel Gorman <mel@csn.ul.ie>
>Cc: Hugh Dickins <hughd@google.com>
>Cc: Dave Hansen <dave@linux.vnet.ibm.com>
>Cc: Rik van Riel <riel@redhat.com>
>Cc: Dave Chinner <david@fromorbit.com>
>Cc: Neil Brown <neilb@suse.de>
>Cc: Mike Hommey <mh@glandium.org>
>Cc: Taras Glek <tglek@mozilla.com>
>Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
>Cc: Christoph Lameter <cl@linux.com>
>Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>Signed-off-by: Minchan Kim <minchan@kernel.org>
>---
> arch/x86/mm/fault.c               |    2 +
> include/asm-generic/mman-common.h |    6 ++
> include/linux/mm.h                |    7 ++-
> include/linux/rmap.h              |   20 ++++++
> include/linux/vm_event_item.h     |    2 +-
> mm/madvise.c                      |   19 +++++-
> mm/memory.c                       |   32 ++++++++++
> mm/migrate.c                      |    6 +-
> mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-
> mm/vmscan.c                       |    7 +++
> mm/vmstat.c                       |    1 +
> 11 files changed, 218 insertions(+), 9 deletions(-)
>
>diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>index 76dcd9d..17c1c20 100644
>--- a/arch/x86/mm/fault.c
>+++ b/arch/x86/mm/fault.c
>@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> 		}
>
> 		out_of_memory(regs, error_code, address);
>+	} else if (fault & VM_FAULT_SIGSEG) {
>+			bad_area(regs, error_code, address);
> 	} else {
> 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
> 			     VM_FAULT_HWPOISON_LARGE))
>diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
>index d030d2c..f07781e 100644
>--- a/include/asm-generic/mman-common.h
>+++ b/include/asm-generic/mman-common.h
>@@ -34,6 +34,12 @@
> #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> #define MADV_WILLNEED	3		/* will need these pages */
> #define MADV_DONTNEED	4		/* don't need these pages */
>+/*
>+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.
>+ * For changing the flag, we need mmap_sem's write lock and volatile_lock
>+ * while we just need volatile_lock in case of reading the flag.
>+ */
>+#define MADV_VOLATILE	5		/* pages will disappear suddenly */
>
> /* common parameters: try to keep these consistent across architectures */
> #define MADV_REMOVE	9		/* remove these pages & resources */
>diff --git a/include/linux/mm.h b/include/linux/mm.h
>index 311be90..89027b5 100644
>--- a/include/linux/mm.h
>+++ b/include/linux/mm.h
>@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);
> #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
> #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
> #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
>+#define VM_VOLATILE	0x100000000	/* Pages in the vma could be discarable without swap */
>
> /* Bits set in the VMA until the stack is in its final location */
> #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
>@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);
>  * Special vmas that are non-mergable, non-mlock()able.
>  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
>  */
>-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
>+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)
>
> /*
>  * mapping from the currently active vm_flags protection bits (the
>@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)
> #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
> #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
> #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>-
>+#define VM_FAULT_SIGSEG	0x0800	/* -> There is no vma */
> #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
>
> #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
>-			 VM_FAULT_HWPOISON_LARGE)
>+			 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)
>
> /* Encode hstate index for a hwpoisoned large page */
> #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
>diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>index 3fce545..735d7a3 100644
>--- a/include/linux/rmap.h
>+++ b/include/linux/rmap.h
>@@ -67,6 +67,9 @@ struct anon_vma_chain {
> 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
> };
>
>+void volatile_lock(struct vm_area_struct *vma);
>+void volatile_unlock(struct vm_area_struct *vma);
>+
> #ifdef CONFIG_MMU
> static inline void get_anon_vma(struct anon_vma *anon_vma)
> {
>@@ -170,6 +173,7 @@ enum ttu_flags {
> 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
>+	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
> };
> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>
>@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> 	return ptep;
> }
>
>+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
>+                                unsigned long, spinlock_t **);
>+
>+static inline pte_t *page_check_volatile_address(struct page *page,
>+                                        struct mm_struct *mm,
>+                                        unsigned long address,
>+                                        spinlock_t **ptlp)
>+{
>+        pte_t *ptep;
>+
>+        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
>+                                        mm, address, ptlp));
>+        return ptep;
>+}
>+
> /*
>  * Used by swapoff to help locate where page is expected in vma.
>  */
>@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)
> #define SWAP_AGAIN	1
> #define SWAP_FAIL	2
> #define SWAP_MLOCK	3
>+#define SWAP_DISCARD	4
>
> #endif	/* _LINUX_RMAP_H */
>diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>index 57f7b10..3f9a40b 100644
>--- a/include/linux/vm_event_item.h
>+++ b/include/linux/vm_event_item.h
>@@ -23,7 +23,7 @@
>
> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> 		FOR_ALL_ZONES(PGALLOC),
>-		PGFREE, PGACTIVATE, PGDEACTIVATE,
>+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
> 		PGFAULT, PGMAJFAULT,
> 		FOR_ALL_ZONES(PGREFILL),
> 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
>diff --git a/mm/madvise.c b/mm/madvise.c
>index 14d260f..53a19d8 100644
>--- a/mm/madvise.c
>+++ b/mm/madvise.c
>@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
> 		if (error)
> 			goto out;
> 		break;
>+	case MADV_VOLATILE:
>+		if (vma->vm_flags & VM_LOCKED) {
>+			error = -EINVAL;
>+			goto out;
>+		}
>+		new_flags |= VM_VOLATILE;
>+		break;
> 	}
>
> 	if (new_flags == vma->vm_flags) {
>@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
> success:
> 	/*
> 	 * vm_flags is protected by the mmap_sem held in write mode.
>+	 * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.
> 	 */
>+	if (behavior == MADV_VOLATILE)
>+		volatile_lock(vma);
> 	vma->vm_flags = new_flags;
>-
>+	if (behavior == MADV_VOLATILE)
>+		volatile_unlock(vma);
> out:
> 	if (error == -ENOMEM)
> 		error = -EAGAIN;
>@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)
> #endif
> 	case MADV_DONTDUMP:
> 	case MADV_DODUMP:
>+	case MADV_VOLATILE:
> 		return 1;
>
> 	default:
>@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> 		goto out;
> 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>
>+	if (behavior != MADV_VOLATILE)
>+		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>+	else
>+		len = len_in & PAGE_MASK;
>+
> 	/* Check to see whether len was rounded up from small -ve to zero */
> 	if (len_in && !len)
> 		goto out;
>diff --git a/mm/memory.c b/mm/memory.c
>index 5736170..b5e4996 100644
>--- a/mm/memory.c
>+++ b/mm/memory.c
>@@ -57,6 +57,7 @@
> #include <linux/swapops.h>
> #include <linux/elf.h>
> #include <linux/gfp.h>
>+#include <linux/mempolicy.h>
>
> #include <asm/io.h>
> #include <asm/pgalloc.h>
>@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,
> 					return do_linear_fault(mm, vma, address,
> 						pte, pmd, flags, entry);
> 			}
>+			if (vma->vm_flags & VM_VOLATILE) {
>+				struct vm_area_struct *prev;
>+
>+				up_read(&mm->mmap_sem);
>+				down_write(&mm->mmap_sem);
>+				vma = find_vma_prev(mm, address, &prev);
>+
>+				/* Someone unmap the vma */
>+				if (unlikely(!vma) || vma->vm_start > address) {
>+					downgrade_write(&mm->mmap_sem);
>+					return VM_FAULT_SIGSEG;
>+				}
>+				/* Someone else already hanlded */
>+				if (vma->vm_flags & VM_VOLATILE) {
>+					/*
>+					 * From now on, we hold mmap_sem as
>+					 * exclusive.
>+					 */
>+					volatile_lock(vma);
>+					vma->vm_flags &= ~VM_VOLATILE;
>+					volatile_unlock(vma);
>+
>+					vma_merge(mm, prev, vma->vm_start,
>+						vma->vm_end, vma->vm_flags,
>+						vma->anon_vma, vma->vm_file,
>+						vma->vm_pgoff, vma_policy(vma));
>+
>+				}
>+
>+				downgrade_write(&mm->mmap_sem);
>+			}
> 			return do_anonymous_page(mm, vma, address,
> 						 pte, pmd, flags);
> 		}
>diff --git a/mm/migrate.c b/mm/migrate.c
>index 77ed2d7..08b009c 100644
>--- a/mm/migrate.c
>+++ b/mm/migrate.c
>@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> 	}
>
> 	/* Establish migration ptes or remove ptes */
>-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>
> skip_unmap:
> 	if (!page_mapped(page))
>@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> 	if (PageAnon(hpage))
> 		anon_vma = page_get_anon_vma(hpage);
>
>-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>
> 	if (!page_mapped(hpage))
> 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
>diff --git a/mm/rmap.c b/mm/rmap.c
>index 0f3b7cd..1a0ab2b 100644
>--- a/mm/rmap.c
>+++ b/mm/rmap.c
>@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> 	return vma_address(page, vma);
> }
>
>+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
>+		unsigned long address, spinlock_t **ptlp)
>+{
>+	pgd_t *pgd;
>+	pud_t *pud;
>+	pmd_t *pmd;
>+	pte_t *pte;
>+	spinlock_t *ptl;
>+
>+	swp_entry_t entry = { .val = page_private(page) };
>+
>+	if (unlikely(PageHuge(page))) {
>+		pte = huge_pte_offset(mm, address);
>+		ptl = &mm->page_table_lock;
>+		goto check;
>+	}
>+
>+	pgd = pgd_offset(mm, address);
>+	if (!pgd_present(*pgd))
>+		return NULL;
>+
>+	pud = pud_offset(pgd, address);
>+	if (!pud_present(*pud))
>+		return NULL;
>+
>+	pmd = pmd_offset(pud, address);
>+	if (!pmd_present(*pmd))
>+		return NULL;
>+	if (pmd_trans_huge(*pmd))
>+		return NULL;
>+
>+	pte = pte_offset_map(pmd, address);
>+	ptl = pte_lockptr(mm, pmd);
>+check:
>+	spin_lock(ptl);
>+	if (PageAnon(page)) {
>+		if (!pte_present(*pte) && entry.val ==
>+				pte_to_swp_entry(*pte).val) {
>+			*ptlp = ptl;
>+			return pte;
>+		}
>+	} else {
>+		if (pte_none(*pte)) {
>+			*ptlp = ptl;
>+			return pte;
>+		}
>+	}
>+	pte_unmap_unlock(pte, ptl);
>+	return NULL;
>+}
>+
> /*
>  * Check that @page is mapped at @address into @mm.
>  *
>@@ -1218,6 +1269,35 @@ out:
> 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> }
>
>+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
>+                unsigned long address)
>+{
>+        struct mm_struct *mm = vma->vm_mm;
>+        pte_t *pte;
>+        pte_t pteval;
>+        spinlock_t *ptl;
>+
>+        pte = page_check_volatile_address(page, mm, address, &ptl);
>+        if (!pte)
>+                return 0;
>+
>+        /* Nuke the page table entry. */
>+        flush_cache_page(vma, address, page_to_pfn(page));
>+        pteval = ptep_clear_flush(vma, address, pte);
>+
>+        if (PageAnon(page)) {
>+                swp_entry_t entry = { .val = page_private(page) };
>+                if (PageSwapCache(page)) {
>+                        dec_mm_counter(mm, MM_SWAPENTS);
>+                        swap_free(entry);
>+                }
>+        }
>+
>+        pte_unmap_unlock(pte, ptl);
>+        mmu_notifier_invalidate_page(mm, address);
>+        return 1;
>+}
>+
> /*
>  * Subfunctions of try_to_unmap: try_to_unmap_one called
>  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
>@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> 	struct anon_vma *anon_vma;
> 	struct anon_vma_chain *avc;
> 	int ret = SWAP_AGAIN;
>+	bool is_volatile = true;
>+
>+	if (flags & TTU_IGNORE_VOLATILE)
>+		is_volatile = false;
>
> 	anon_vma = page_lock_anon_vma(page);
> 	if (!anon_vma)
>@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> 		 * temporary VMAs until after exec() completes.
> 		 */
> 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
>-				is_vma_temporary_stack(vma))
>+				is_vma_temporary_stack(vma)) {
>+			is_volatile = false;
> 			continue;
>+		}
>
> 		address = vma_address(page, vma);
> 		if (address == -EFAULT)
> 			continue;
>+                /*
>+                 * A volatile page will only be purged if ALL vmas
>+		 * pointing to it are VM_VOLATILE.
>+                 */
>+                if (!(vma->vm_flags & VM_VOLATILE))
>+                        is_volatile = false;
>+
> 		ret = try_to_unmap_one(page, vma, address, flags);
> 		if (ret != SWAP_AGAIN || !page_mapped(page))
> 			break;
> 	}
>
>+        if (page_mapped(page) || is_volatile == false)
>+                goto out;
>+
>+        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>+                struct vm_area_struct *vma = avc->vma;
>+                unsigned long address;
>+
>+                address = vma_address(page, vma);
>+                try_to_zap_one(page, vma, address);
>+        }
>+        /* We're throwing this page out, so mark it clean */
>+        ClearPageDirty(page);
>+        ret = SWAP_DISCARD;
>+out:
> 	page_unlock_anon_vma(anon_vma);
> 	return ret;
> }
>@@ -1651,6 +1758,7 @@ out:
>  * SWAP_AGAIN	- we missed a mapping, try again later
>  * SWAP_FAIL	- the page is unswappable
>  * SWAP_MLOCK	- page is mlocked.
>+ * SWAP_DISCARD - page is volatile.
>  */
> int try_to_unmap(struct page *page, enum ttu_flags flags)
> {
>@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
> 		ret = try_to_unmap_anon(page, flags);
> 	else
> 		ret = try_to_unmap_file(page, flags);
>-	if (ret != SWAP_MLOCK && !page_mapped(page))
>+	if (ret != SWAP_MLOCK && !page_mapped(page) &&
>+					ret != SWAP_DISCARD)
> 		ret = SWAP_SUCCESS;
> 	return ret;
> }
>@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
> 	anon_vma_free(anon_vma);
> }
>
>+void volatile_lock(struct vm_area_struct *vma)
>+{
>+        if (vma->anon_vma)
>+                anon_vma_lock(vma->anon_vma);
>+}
>+
>+void volatile_unlock(struct vm_area_struct *vma)
>+{
>+        if (vma->anon_vma)
>+                anon_vma_unlock(vma->anon_vma);
>+}
>+
> #ifdef CONFIG_MIGRATION
> /*
>  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 99b434b..4e463a4 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,
> 	if (vm_flags & VM_LOCKED)
> 		return PAGEREF_RECLAIM;
>
>+	if (vm_flags & VM_VOLATILE)
>+		return PAGEREF_RECLAIM;
>+
> 	if (referenced_ptes) {
> 		if (PageSwapBacked(page))
> 			return PAGEREF_ACTIVATE;
>@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> 		 */

Hi Minchan,

IIUC, anonymous page has already add to swapcache through add_to_swap called
by shrink_page_list, but I can't figure out where you remove it from swapache.

Regards,
Wanpeng Li 

> 		if (page_mapped(page) && mapping) {
> 			switch (try_to_unmap(page, TTU_UNMAP)) {
>+			case SWAP_DISCARD:
>+				count_vm_event(PGVOLATILE);
>+				goto discard_page;
> 			case SWAP_FAIL:
> 				goto activate_locked;
> 			case SWAP_AGAIN:
>@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> 			}
> 		}
>
>+discard_page:
> 		/*
> 		 * If the page has buffers, try to free the buffer mappings
> 		 * associated with this page. If we succeed we try to free
>diff --git a/mm/vmstat.c b/mm/vmstat.c
>index df7a674..410caf5 100644
>--- a/mm/vmstat.c
>+++ b/mm/vmstat.c
>@@ -734,6 +734,7 @@ const char * const vmstat_text[] = {
> 	TEXTS_FOR_ZONES("pgalloc")
>
> 	"pgfree",
>+	"pgvolatile",
> 	"pgactivate",
> 	"pgdeactivate",
>
>-- 
>1.7.9.5
>
>-- 
>Kind regards,
>Minchan Kim
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-11  2:41 ` Minchan Kim
  2012-12-11  7:17   ` Mike Hommey
  2012-12-12  6:43   ` Wanpeng Li
@ 2012-12-12  6:43   ` Wanpeng Li
       [not found]   ` <50c827cb.ce98320a.7d38.ffffad3fSMTPIN_ADDED_BROKEN@mx.google.com>
  3 siblings, 0 replies; 16+ messages in thread
From: Wanpeng Li @ 2012-12-12  6:43 UTC (permalink / raw)
  Cc: Andrew Morton, linux-mm, linux-kernel, Minchan Kim,
	Michael Kerrisk, Arun Sharma, sanjay, Paul Turner, David Rientjes,
	John Stultz, Christoph Lameter, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
>Sorry, resending with fixing compile error. :(
>
>>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001
>From: Minchan Kim <minchan@kernel.org>
>Date: Tue, 11 Dec 2012 11:38:30 +0900
>Subject: [RFC v3] Support volatile range for anon vma
>
>This still is [RFC v3] because just passed my simple test
>with TCMalloc tweaking.
>
>I hope more inputs from user-space allocator people and test patch
>with their allocator because it might need design change of arena
>management design for getting real vaule.
>
>Changelog from v2
>
> * Removing madvise(addr, length, MADV_NOVOLATILE).
> * add vmstat about the number of discarded volatile pages
> * discard volatile pages without promotion in reclaim path
>
>This is based on v3.6.
>
>- What's the madvise(addr, length, MADV_VOLATILE)?
>
>  It's a hint that user deliver to kernel so kernel can *discard*
>  pages in a range anytime.
>
>- What happens if user access page(ie, virtual address) discarded
>  by kernel?
>
>  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
>
>- What happens if user access page(ie, virtual address) doesn't
>  discarded by kernel?
>
>  The user can see old data without page fault.
>
>- What's different with madvise(DONTNEED)?
>
>  System call semantic
>
>  DONTNEED makes sure user always can see zero-fill pages after
>  he calls madvise while VOLATILE can see zero-fill pages or
>  old data.
>
>  Internal implementation
>
>  The madvise(DONTNEED) should zap all mapped pages in range so
>  overhead is increased linearly with the number of mapped pages.
>  Even, if user access zapped pages by write, page fault + page
>  allocation + memset should be happened.
>
>  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
>  It doesn't touch pages any more so overhead of the system call
>  should be very small. If memory pressure happens, VM can discard
>  pages in VMAs marked by VOLATILE. If user access address with
>  write mode by discarding by VM, he can see zero-fill pages so the
>  cost is same with DONTNEED but if memory pressure isn't severe,
>  user can see old data without (page fault + page allocation + memset)
>
>  The VOLATILE mark should be removed in page fault handler when first
>  page fault occur in marked vma so next page faults will follow normal
>  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
>  interface.
>
>- What's the benefit compared to DONTNEED?
>
>  1. The system call overhead is smaller because VOLATILE just marks
>     the flag to VMA instead of zapping all the page in a range.
>
>  2. It has a chance to eliminate overheads (ex, page fault +
>     page allocation + memset(PAGE_SIZE)).
>
>- Isn't there any drawback?
>
>  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
>  fault of other threads could be allowed. But VOLATILE needs exclusive
>  mmap_sem so other thread would be blocked if they try to access
>  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
>  should be small as far as possible.
>
>  Other concern of exclusive mmap_sem is when page fault occur in
>  VOLATILE marked vma. We should remove the flag of vma and merge
>  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
>  handling and prevent concurrent page fault. But we need such handling
>  just once when page fault occur after we mark VOLATILE into VMA
>  only if memory pressure happpens so the page is discarded. So it wouldn't
>  not common so that benefit we get by this feature would be bigger than
>  lose.
>
>- What's for targetting?
>
>  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
>  of virtual machine like Dalvik. Also, it comes in handy for embedded
>  which doesn't have swap device so they can't reclaim anonymous pages.
>  By discarding instead of swap, it could be used in the non-swap system.
>  For it,  we have to age anon lru list although we don't have swap because
>  I don't want to discard volatile pages by top priority when memory pressure
>  happens as volatile in this patch means "We don't need to swap out because
>  user can handle the situation which data are disappear suddenly", NOT
>  "They are useless so hurry up to reclaim them". So I want to apply same
>  aging rule of nomal pages to them.
>
>  Anonymous page background aging of non-swap system would be a trade-off
>  for getting good feature. Even, we had done it two years ago until merge
>  [1] and I believe gain of this patch will beat loss of anon lru aging's
>  overead once all of allocator start to use madvise.
>  (This patch doesn't include background aging in case of non-swap system
>  but it's trivial if we decide)
>
>[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
>
>Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>Cc: Arun Sharma <asharma@fb.com>
>Cc: sanjay@google.com
>Cc: Paul Turner <pjt@google.com>
>CC: David Rientjes <rientjes@google.com>
>Cc: John Stultz <john.stultz@linaro.org>
>Cc: Andrew Morton <akpm@linux-foundation.org>
>Cc: Christoph Lameter <cl@linux.com>
>Cc: Android Kernel Team <kernel-team@android.com>
>Cc: Robert Love <rlove@google.com>
>Cc: Mel Gorman <mel@csn.ul.ie>
>Cc: Hugh Dickins <hughd@google.com>
>Cc: Dave Hansen <dave@linux.vnet.ibm.com>
>Cc: Rik van Riel <riel@redhat.com>
>Cc: Dave Chinner <david@fromorbit.com>
>Cc: Neil Brown <neilb@suse.de>
>Cc: Mike Hommey <mh@glandium.org>
>Cc: Taras Glek <tglek@mozilla.com>
>Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
>Cc: Christoph Lameter <cl@linux.com>
>Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>Signed-off-by: Minchan Kim <minchan@kernel.org>
>---
> arch/x86/mm/fault.c               |    2 +
> include/asm-generic/mman-common.h |    6 ++
> include/linux/mm.h                |    7 ++-
> include/linux/rmap.h              |   20 ++++++
> include/linux/vm_event_item.h     |    2 +-
> mm/madvise.c                      |   19 +++++-
> mm/memory.c                       |   32 ++++++++++
> mm/migrate.c                      |    6 +-
> mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-
> mm/vmscan.c                       |    7 +++
> mm/vmstat.c                       |    1 +
> 11 files changed, 218 insertions(+), 9 deletions(-)
>
>diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>index 76dcd9d..17c1c20 100644
>--- a/arch/x86/mm/fault.c
>+++ b/arch/x86/mm/fault.c
>@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> 		}
>
> 		out_of_memory(regs, error_code, address);
>+	} else if (fault & VM_FAULT_SIGSEG) {
>+			bad_area(regs, error_code, address);
> 	} else {
> 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
> 			     VM_FAULT_HWPOISON_LARGE))
>diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
>index d030d2c..f07781e 100644
>--- a/include/asm-generic/mman-common.h
>+++ b/include/asm-generic/mman-common.h
>@@ -34,6 +34,12 @@
> #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> #define MADV_WILLNEED	3		/* will need these pages */
> #define MADV_DONTNEED	4		/* don't need these pages */
>+/*
>+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.
>+ * For changing the flag, we need mmap_sem's write lock and volatile_lock
>+ * while we just need volatile_lock in case of reading the flag.
>+ */
>+#define MADV_VOLATILE	5		/* pages will disappear suddenly */
>
> /* common parameters: try to keep these consistent across architectures */
> #define MADV_REMOVE	9		/* remove these pages & resources */
>diff --git a/include/linux/mm.h b/include/linux/mm.h
>index 311be90..89027b5 100644
>--- a/include/linux/mm.h
>+++ b/include/linux/mm.h
>@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);
> #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
> #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
> #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
>+#define VM_VOLATILE	0x100000000	/* Pages in the vma could be discarable without swap */
>
> /* Bits set in the VMA until the stack is in its final location */
> #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
>@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);
>  * Special vmas that are non-mergable, non-mlock()able.
>  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
>  */
>-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
>+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)
>
> /*
>  * mapping from the currently active vm_flags protection bits (the
>@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)
> #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
> #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
> #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>-
>+#define VM_FAULT_SIGSEG	0x0800	/* -> There is no vma */
> #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
>
> #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
>-			 VM_FAULT_HWPOISON_LARGE)
>+			 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)
>
> /* Encode hstate index for a hwpoisoned large page */
> #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
>diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>index 3fce545..735d7a3 100644
>--- a/include/linux/rmap.h
>+++ b/include/linux/rmap.h
>@@ -67,6 +67,9 @@ struct anon_vma_chain {
> 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
> };
>
>+void volatile_lock(struct vm_area_struct *vma);
>+void volatile_unlock(struct vm_area_struct *vma);
>+
> #ifdef CONFIG_MMU
> static inline void get_anon_vma(struct anon_vma *anon_vma)
> {
>@@ -170,6 +173,7 @@ enum ttu_flags {
> 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
>+	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
> };
> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>
>@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> 	return ptep;
> }
>
>+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
>+                                unsigned long, spinlock_t **);
>+
>+static inline pte_t *page_check_volatile_address(struct page *page,
>+                                        struct mm_struct *mm,
>+                                        unsigned long address,
>+                                        spinlock_t **ptlp)
>+{
>+        pte_t *ptep;
>+
>+        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
>+                                        mm, address, ptlp));
>+        return ptep;
>+}
>+
> /*
>  * Used by swapoff to help locate where page is expected in vma.
>  */
>@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)
> #define SWAP_AGAIN	1
> #define SWAP_FAIL	2
> #define SWAP_MLOCK	3
>+#define SWAP_DISCARD	4
>
> #endif	/* _LINUX_RMAP_H */
>diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>index 57f7b10..3f9a40b 100644
>--- a/include/linux/vm_event_item.h
>+++ b/include/linux/vm_event_item.h
>@@ -23,7 +23,7 @@
>
> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> 		FOR_ALL_ZONES(PGALLOC),
>-		PGFREE, PGACTIVATE, PGDEACTIVATE,
>+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
> 		PGFAULT, PGMAJFAULT,
> 		FOR_ALL_ZONES(PGREFILL),
> 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
>diff --git a/mm/madvise.c b/mm/madvise.c
>index 14d260f..53a19d8 100644
>--- a/mm/madvise.c
>+++ b/mm/madvise.c
>@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
> 		if (error)
> 			goto out;
> 		break;
>+	case MADV_VOLATILE:
>+		if (vma->vm_flags & VM_LOCKED) {
>+			error = -EINVAL;
>+			goto out;
>+		}
>+		new_flags |= VM_VOLATILE;
>+		break;
> 	}
>
> 	if (new_flags == vma->vm_flags) {
>@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
> success:
> 	/*
> 	 * vm_flags is protected by the mmap_sem held in write mode.
>+	 * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.
> 	 */
>+	if (behavior == MADV_VOLATILE)
>+		volatile_lock(vma);
> 	vma->vm_flags = new_flags;
>-
>+	if (behavior == MADV_VOLATILE)
>+		volatile_unlock(vma);
> out:
> 	if (error == -ENOMEM)
> 		error = -EAGAIN;
>@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)
> #endif
> 	case MADV_DONTDUMP:
> 	case MADV_DODUMP:
>+	case MADV_VOLATILE:
> 		return 1;
>
> 	default:
>@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> 		goto out;
> 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>
>+	if (behavior != MADV_VOLATILE)
>+		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>+	else
>+		len = len_in & PAGE_MASK;
>+
> 	/* Check to see whether len was rounded up from small -ve to zero */
> 	if (len_in && !len)
> 		goto out;
>diff --git a/mm/memory.c b/mm/memory.c
>index 5736170..b5e4996 100644
>--- a/mm/memory.c
>+++ b/mm/memory.c
>@@ -57,6 +57,7 @@
> #include <linux/swapops.h>
> #include <linux/elf.h>
> #include <linux/gfp.h>
>+#include <linux/mempolicy.h>
>
> #include <asm/io.h>
> #include <asm/pgalloc.h>
>@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,
> 					return do_linear_fault(mm, vma, address,
> 						pte, pmd, flags, entry);
> 			}
>+			if (vma->vm_flags & VM_VOLATILE) {
>+				struct vm_area_struct *prev;
>+
>+				up_read(&mm->mmap_sem);
>+				down_write(&mm->mmap_sem);
>+				vma = find_vma_prev(mm, address, &prev);
>+
>+				/* Someone unmap the vma */
>+				if (unlikely(!vma) || vma->vm_start > address) {
>+					downgrade_write(&mm->mmap_sem);
>+					return VM_FAULT_SIGSEG;
>+				}
>+				/* Someone else already hanlded */
>+				if (vma->vm_flags & VM_VOLATILE) {
>+					/*
>+					 * From now on, we hold mmap_sem as
>+					 * exclusive.
>+					 */
>+					volatile_lock(vma);
>+					vma->vm_flags &= ~VM_VOLATILE;
>+					volatile_unlock(vma);
>+
>+					vma_merge(mm, prev, vma->vm_start,
>+						vma->vm_end, vma->vm_flags,
>+						vma->anon_vma, vma->vm_file,
>+						vma->vm_pgoff, vma_policy(vma));
>+
>+				}
>+
>+				downgrade_write(&mm->mmap_sem);
>+			}
> 			return do_anonymous_page(mm, vma, address,
> 						 pte, pmd, flags);
> 		}
>diff --git a/mm/migrate.c b/mm/migrate.c
>index 77ed2d7..08b009c 100644
>--- a/mm/migrate.c
>+++ b/mm/migrate.c
>@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> 	}
>
> 	/* Establish migration ptes or remove ptes */
>-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>
> skip_unmap:
> 	if (!page_mapped(page))
>@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> 	if (PageAnon(hpage))
> 		anon_vma = page_get_anon_vma(hpage);
>
>-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>
> 	if (!page_mapped(hpage))
> 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
>diff --git a/mm/rmap.c b/mm/rmap.c
>index 0f3b7cd..1a0ab2b 100644
>--- a/mm/rmap.c
>+++ b/mm/rmap.c
>@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> 	return vma_address(page, vma);
> }
>
>+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
>+		unsigned long address, spinlock_t **ptlp)
>+{
>+	pgd_t *pgd;
>+	pud_t *pud;
>+	pmd_t *pmd;
>+	pte_t *pte;
>+	spinlock_t *ptl;
>+
>+	swp_entry_t entry = { .val = page_private(page) };
>+
>+	if (unlikely(PageHuge(page))) {
>+		pte = huge_pte_offset(mm, address);
>+		ptl = &mm->page_table_lock;
>+		goto check;
>+	}
>+
>+	pgd = pgd_offset(mm, address);
>+	if (!pgd_present(*pgd))
>+		return NULL;
>+
>+	pud = pud_offset(pgd, address);
>+	if (!pud_present(*pud))
>+		return NULL;
>+
>+	pmd = pmd_offset(pud, address);
>+	if (!pmd_present(*pmd))
>+		return NULL;
>+	if (pmd_trans_huge(*pmd))
>+		return NULL;
>+
>+	pte = pte_offset_map(pmd, address);
>+	ptl = pte_lockptr(mm, pmd);
>+check:
>+	spin_lock(ptl);
>+	if (PageAnon(page)) {
>+		if (!pte_present(*pte) && entry.val ==
>+				pte_to_swp_entry(*pte).val) {
>+			*ptlp = ptl;
>+			return pte;
>+		}
>+	} else {
>+		if (pte_none(*pte)) {
>+			*ptlp = ptl;
>+			return pte;
>+		}
>+	}
>+	pte_unmap_unlock(pte, ptl);
>+	return NULL;
>+}
>+
> /*
>  * Check that @page is mapped at @address into @mm.
>  *
>@@ -1218,6 +1269,35 @@ out:
> 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> }
>
>+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
>+                unsigned long address)
>+{
>+        struct mm_struct *mm = vma->vm_mm;
>+        pte_t *pte;
>+        pte_t pteval;
>+        spinlock_t *ptl;
>+
>+        pte = page_check_volatile_address(page, mm, address, &ptl);
>+        if (!pte)
>+                return 0;
>+
>+        /* Nuke the page table entry. */
>+        flush_cache_page(vma, address, page_to_pfn(page));
>+        pteval = ptep_clear_flush(vma, address, pte);
>+
>+        if (PageAnon(page)) {
>+                swp_entry_t entry = { .val = page_private(page) };
>+                if (PageSwapCache(page)) {
>+                        dec_mm_counter(mm, MM_SWAPENTS);
>+                        swap_free(entry);
>+                }
>+        }
>+
>+        pte_unmap_unlock(pte, ptl);
>+        mmu_notifier_invalidate_page(mm, address);
>+        return 1;
>+}
>+
> /*
>  * Subfunctions of try_to_unmap: try_to_unmap_one called
>  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
>@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> 	struct anon_vma *anon_vma;
> 	struct anon_vma_chain *avc;
> 	int ret = SWAP_AGAIN;
>+	bool is_volatile = true;
>+
>+	if (flags & TTU_IGNORE_VOLATILE)
>+		is_volatile = false;
>
> 	anon_vma = page_lock_anon_vma(page);
> 	if (!anon_vma)
>@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> 		 * temporary VMAs until after exec() completes.
> 		 */
> 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
>-				is_vma_temporary_stack(vma))
>+				is_vma_temporary_stack(vma)) {
>+			is_volatile = false;
> 			continue;
>+		}
>
> 		address = vma_address(page, vma);
> 		if (address == -EFAULT)
> 			continue;
>+                /*
>+                 * A volatile page will only be purged if ALL vmas
>+		 * pointing to it are VM_VOLATILE.
>+                 */
>+                if (!(vma->vm_flags & VM_VOLATILE))
>+                        is_volatile = false;
>+
> 		ret = try_to_unmap_one(page, vma, address, flags);
> 		if (ret != SWAP_AGAIN || !page_mapped(page))
> 			break;
> 	}
>
>+        if (page_mapped(page) || is_volatile == false)
>+                goto out;
>+
>+        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>+                struct vm_area_struct *vma = avc->vma;
>+                unsigned long address;
>+
>+                address = vma_address(page, vma);
>+                try_to_zap_one(page, vma, address);
>+        }
>+        /* We're throwing this page out, so mark it clean */
>+        ClearPageDirty(page);
>+        ret = SWAP_DISCARD;
>+out:
> 	page_unlock_anon_vma(anon_vma);
> 	return ret;
> }
>@@ -1651,6 +1758,7 @@ out:
>  * SWAP_AGAIN	- we missed a mapping, try again later
>  * SWAP_FAIL	- the page is unswappable
>  * SWAP_MLOCK	- page is mlocked.
>+ * SWAP_DISCARD - page is volatile.
>  */
> int try_to_unmap(struct page *page, enum ttu_flags flags)
> {
>@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
> 		ret = try_to_unmap_anon(page, flags);
> 	else
> 		ret = try_to_unmap_file(page, flags);
>-	if (ret != SWAP_MLOCK && !page_mapped(page))
>+	if (ret != SWAP_MLOCK && !page_mapped(page) &&
>+					ret != SWAP_DISCARD)
> 		ret = SWAP_SUCCESS;
> 	return ret;
> }
>@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
> 	anon_vma_free(anon_vma);
> }
>
>+void volatile_lock(struct vm_area_struct *vma)
>+{
>+        if (vma->anon_vma)
>+                anon_vma_lock(vma->anon_vma);
>+}
>+
>+void volatile_unlock(struct vm_area_struct *vma)
>+{
>+        if (vma->anon_vma)
>+                anon_vma_unlock(vma->anon_vma);
>+}
>+
> #ifdef CONFIG_MIGRATION
> /*
>  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 99b434b..4e463a4 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,
> 	if (vm_flags & VM_LOCKED)
> 		return PAGEREF_RECLAIM;
>
>+	if (vm_flags & VM_VOLATILE)
>+		return PAGEREF_RECLAIM;
>+
> 	if (referenced_ptes) {
> 		if (PageSwapBacked(page))
> 			return PAGEREF_ACTIVATE;
>@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> 		 */

Hi Minchan,

IIUC, anonymous page has already add to swapcache through add_to_swap called
by shrink_page_list, but I can't figure out where you remove it from swapache.

Regards,
Wanpeng Li 

> 		if (page_mapped(page) && mapping) {
> 			switch (try_to_unmap(page, TTU_UNMAP)) {
>+			case SWAP_DISCARD:
>+				count_vm_event(PGVOLATILE);
>+				goto discard_page;
> 			case SWAP_FAIL:
> 				goto activate_locked;
> 			case SWAP_AGAIN:
>@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> 			}
> 		}
>
>+discard_page:
> 		/*
> 		 * If the page has buffers, try to free the buffer mappings
> 		 * associated with this page. If we succeed we try to free
>diff --git a/mm/vmstat.c b/mm/vmstat.c
>index df7a674..410caf5 100644
>--- a/mm/vmstat.c
>+++ b/mm/vmstat.c
>@@ -734,6 +734,7 @@ const char * const vmstat_text[] = {
> 	TEXTS_FOR_ZONES("pgalloc")
>
> 	"pgfree",
>+	"pgvolatile",
> 	"pgactivate",
> 	"pgdeactivate",
>
>-- 
>1.7.9.5
>
>-- 
>Kind regards,
>Minchan Kim
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
       [not found]   ` <50c827cb.ce98320a.7d38.ffffad3fSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2012-12-12  8:15     ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-12-12  8:15 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Wed, Dec 12, 2012 at 02:43:49PM +0800, Wanpeng Li wrote:
> On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> >Sorry, resending with fixing compile error. :(
> >
> >>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001
> >From: Minchan Kim <minchan@kernel.org>
> >Date: Tue, 11 Dec 2012 11:38:30 +0900
> >Subject: [RFC v3] Support volatile range for anon vma
> >
> >This still is [RFC v3] because just passed my simple test
> >with TCMalloc tweaking.
> >
> >I hope more inputs from user-space allocator people and test patch
> >with their allocator because it might need design change of arena
> >management design for getting real vaule.
> >
> >Changelog from v2
> >
> > * Removing madvise(addr, length, MADV_NOVOLATILE).
> > * add vmstat about the number of discarded volatile pages
> > * discard volatile pages without promotion in reclaim path
> >
> >This is based on v3.6.
> >
> >- What's the madvise(addr, length, MADV_VOLATILE)?
> >
> >  It's a hint that user deliver to kernel so kernel can *discard*
> >  pages in a range anytime.
> >
> >- What happens if user access page(ie, virtual address) discarded
> >  by kernel?
> >
> >  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
> >
> >- What happens if user access page(ie, virtual address) doesn't
> >  discarded by kernel?
> >
> >  The user can see old data without page fault.
> >
> >- What's different with madvise(DONTNEED)?
> >
> >  System call semantic
> >
> >  DONTNEED makes sure user always can see zero-fill pages after
> >  he calls madvise while VOLATILE can see zero-fill pages or
> >  old data.
> >
> >  Internal implementation
> >
> >  The madvise(DONTNEED) should zap all mapped pages in range so
> >  overhead is increased linearly with the number of mapped pages.
> >  Even, if user access zapped pages by write, page fault + page
> >  allocation + memset should be happened.
> >
> >  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
> >  It doesn't touch pages any more so overhead of the system call
> >  should be very small. If memory pressure happens, VM can discard
> >  pages in VMAs marked by VOLATILE. If user access address with
> >  write mode by discarding by VM, he can see zero-fill pages so the
> >  cost is same with DONTNEED but if memory pressure isn't severe,
> >  user can see old data without (page fault + page allocation + memset)
> >
> >  The VOLATILE mark should be removed in page fault handler when first
> >  page fault occur in marked vma so next page faults will follow normal
> >  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
> >  interface.
> >
> >- What's the benefit compared to DONTNEED?
> >
> >  1. The system call overhead is smaller because VOLATILE just marks
> >     the flag to VMA instead of zapping all the page in a range.
> >
> >  2. It has a chance to eliminate overheads (ex, page fault +
> >     page allocation + memset(PAGE_SIZE)).
> >
> >- Isn't there any drawback?
> >
> >  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
> >  fault of other threads could be allowed. But VOLATILE needs exclusive
> >  mmap_sem so other thread would be blocked if they try to access
> >  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
> >  should be small as far as possible.
> >
> >  Other concern of exclusive mmap_sem is when page fault occur in
> >  VOLATILE marked vma. We should remove the flag of vma and merge
> >  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
> >  handling and prevent concurrent page fault. But we need such handling
> >  just once when page fault occur after we mark VOLATILE into VMA
> >  only if memory pressure happpens so the page is discarded. So it wouldn't
> >  not common so that benefit we get by this feature would be bigger than
> >  lose.
> >
> >- What's for targetting?
> >
> >  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
> >  of virtual machine like Dalvik. Also, it comes in handy for embedded
> >  which doesn't have swap device so they can't reclaim anonymous pages.
> >  By discarding instead of swap, it could be used in the non-swap system.
> >  For it,  we have to age anon lru list although we don't have swap because
> >  I don't want to discard volatile pages by top priority when memory pressure
> >  happens as volatile in this patch means "We don't need to swap out because
> >  user can handle the situation which data are disappear suddenly", NOT
> >  "They are useless so hurry up to reclaim them". So I want to apply same
> >  aging rule of nomal pages to them.
> >
> >  Anonymous page background aging of non-swap system would be a trade-off
> >  for getting good feature. Even, we had done it two years ago until merge
> >  [1] and I believe gain of this patch will beat loss of anon lru aging's
> >  overead once all of allocator start to use madvise.
> >  (This patch doesn't include background aging in case of non-swap system
> >  but it's trivial if we decide)
> >
> >[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
> >
> >Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> >Cc: Arun Sharma <asharma@fb.com>
> >Cc: sanjay@google.com
> >Cc: Paul Turner <pjt@google.com>
> >CC: David Rientjes <rientjes@google.com>
> >Cc: John Stultz <john.stultz@linaro.org>
> >Cc: Andrew Morton <akpm@linux-foundation.org>
> >Cc: Christoph Lameter <cl@linux.com>
> >Cc: Android Kernel Team <kernel-team@android.com>
> >Cc: Robert Love <rlove@google.com>
> >Cc: Mel Gorman <mel@csn.ul.ie>
> >Cc: Hugh Dickins <hughd@google.com>
> >Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> >Cc: Rik van Riel <riel@redhat.com>
> >Cc: Dave Chinner <david@fromorbit.com>
> >Cc: Neil Brown <neilb@suse.de>
> >Cc: Mike Hommey <mh@glandium.org>
> >Cc: Taras Glek <tglek@mozilla.com>
> >Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> >Cc: Christoph Lameter <cl@linux.com>
> >Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> > arch/x86/mm/fault.c               |    2 +
> > include/asm-generic/mman-common.h |    6 ++
> > include/linux/mm.h                |    7 ++-
> > include/linux/rmap.h              |   20 ++++++
> > include/linux/vm_event_item.h     |    2 +-
> > mm/madvise.c                      |   19 +++++-
> > mm/memory.c                       |   32 ++++++++++
> > mm/migrate.c                      |    6 +-
> > mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-
> > mm/vmscan.c                       |    7 +++
> > mm/vmstat.c                       |    1 +
> > 11 files changed, 218 insertions(+), 9 deletions(-)
> >
> >diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> >index 76dcd9d..17c1c20 100644
> >--- a/arch/x86/mm/fault.c
> >+++ b/arch/x86/mm/fault.c
> >@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> > 		}
> >
> > 		out_of_memory(regs, error_code, address);
> >+	} else if (fault & VM_FAULT_SIGSEG) {
> >+			bad_area(regs, error_code, address);
> > 	} else {
> > 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
> > 			     VM_FAULT_HWPOISON_LARGE))
> >diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
> >index d030d2c..f07781e 100644
> >--- a/include/asm-generic/mman-common.h
> >+++ b/include/asm-generic/mman-common.h
> >@@ -34,6 +34,12 @@
> > #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> > #define MADV_WILLNEED	3		/* will need these pages */
> > #define MADV_DONTNEED	4		/* don't need these pages */
> >+/*
> >+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.
> >+ * For changing the flag, we need mmap_sem's write lock and volatile_lock
> >+ * while we just need volatile_lock in case of reading the flag.
> >+ */
> >+#define MADV_VOLATILE	5		/* pages will disappear suddenly */
> >
> > /* common parameters: try to keep these consistent across architectures */
> > #define MADV_REMOVE	9		/* remove these pages & resources */
> >diff --git a/include/linux/mm.h b/include/linux/mm.h
> >index 311be90..89027b5 100644
> >--- a/include/linux/mm.h
> >+++ b/include/linux/mm.h
> >@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);
> > #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
> > #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
> > #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
> >+#define VM_VOLATILE	0x100000000	/* Pages in the vma could be discarable without swap */
> >
> > /* Bits set in the VMA until the stack is in its final location */
> > #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
> >@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);
> >  * Special vmas that are non-mergable, non-mlock()able.
> >  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
> >  */
> >-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
> >+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)
> >
> > /*
> >  * mapping from the currently active vm_flags protection bits (the
> >@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)
> > #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
> > #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
> > #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
> >-
> >+#define VM_FAULT_SIGSEG	0x0800	/* -> There is no vma */
> > #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
> >
> > #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
> >-			 VM_FAULT_HWPOISON_LARGE)
> >+			 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)
> >
> > /* Encode hstate index for a hwpoisoned large page */
> > #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
> >diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> >index 3fce545..735d7a3 100644
> >--- a/include/linux/rmap.h
> >+++ b/include/linux/rmap.h
> >@@ -67,6 +67,9 @@ struct anon_vma_chain {
> > 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
> > };
> >
> >+void volatile_lock(struct vm_area_struct *vma);
> >+void volatile_unlock(struct vm_area_struct *vma);
> >+
> > #ifdef CONFIG_MMU
> > static inline void get_anon_vma(struct anon_vma *anon_vma)
> > {
> >@@ -170,6 +173,7 @@ enum ttu_flags {
> > 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> > 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> > 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> >+	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
> > };
> > #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> >
> >@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> > 	return ptep;
> > }
> >
> >+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
> >+                                unsigned long, spinlock_t **);
> >+
> >+static inline pte_t *page_check_volatile_address(struct page *page,
> >+                                        struct mm_struct *mm,
> >+                                        unsigned long address,
> >+                                        spinlock_t **ptlp)
> >+{
> >+        pte_t *ptep;
> >+
> >+        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
> >+                                        mm, address, ptlp));
> >+        return ptep;
> >+}
> >+
> > /*
> >  * Used by swapoff to help locate where page is expected in vma.
> >  */
> >@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)
> > #define SWAP_AGAIN	1
> > #define SWAP_FAIL	2
> > #define SWAP_MLOCK	3
> >+#define SWAP_DISCARD	4
> >
> > #endif	/* _LINUX_RMAP_H */
> >diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> >index 57f7b10..3f9a40b 100644
> >--- a/include/linux/vm_event_item.h
> >+++ b/include/linux/vm_event_item.h
> >@@ -23,7 +23,7 @@
> >
> > enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > 		FOR_ALL_ZONES(PGALLOC),
> >-		PGFREE, PGACTIVATE, PGDEACTIVATE,
> >+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
> > 		PGFAULT, PGMAJFAULT,
> > 		FOR_ALL_ZONES(PGREFILL),
> > 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> >diff --git a/mm/madvise.c b/mm/madvise.c
> >index 14d260f..53a19d8 100644
> >--- a/mm/madvise.c
> >+++ b/mm/madvise.c
> >@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
> > 		if (error)
> > 			goto out;
> > 		break;
> >+	case MADV_VOLATILE:
> >+		if (vma->vm_flags & VM_LOCKED) {
> >+			error = -EINVAL;
> >+			goto out;
> >+		}
> >+		new_flags |= VM_VOLATILE;
> >+		break;
> > 	}
> >
> > 	if (new_flags == vma->vm_flags) {
> >@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
> > success:
> > 	/*
> > 	 * vm_flags is protected by the mmap_sem held in write mode.
> >+	 * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.
> > 	 */
> >+	if (behavior == MADV_VOLATILE)
> >+		volatile_lock(vma);
> > 	vma->vm_flags = new_flags;
> >-
> >+	if (behavior == MADV_VOLATILE)
> >+		volatile_unlock(vma);
> > out:
> > 	if (error == -ENOMEM)
> > 		error = -EAGAIN;
> >@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)
> > #endif
> > 	case MADV_DONTDUMP:
> > 	case MADV_DODUMP:
> >+	case MADV_VOLATILE:
> > 		return 1;
> >
> > 	default:
> >@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > 		goto out;
> > 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> >
> >+	if (behavior != MADV_VOLATILE)
> >+		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> >+	else
> >+		len = len_in & PAGE_MASK;
> >+
> > 	/* Check to see whether len was rounded up from small -ve to zero */
> > 	if (len_in && !len)
> > 		goto out;
> >diff --git a/mm/memory.c b/mm/memory.c
> >index 5736170..b5e4996 100644
> >--- a/mm/memory.c
> >+++ b/mm/memory.c
> >@@ -57,6 +57,7 @@
> > #include <linux/swapops.h>
> > #include <linux/elf.h>
> > #include <linux/gfp.h>
> >+#include <linux/mempolicy.h>
> >
> > #include <asm/io.h>
> > #include <asm/pgalloc.h>
> >@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,
> > 					return do_linear_fault(mm, vma, address,
> > 						pte, pmd, flags, entry);
> > 			}
> >+			if (vma->vm_flags & VM_VOLATILE) {
> >+				struct vm_area_struct *prev;
> >+
> >+				up_read(&mm->mmap_sem);
> >+				down_write(&mm->mmap_sem);
> >+				vma = find_vma_prev(mm, address, &prev);
> >+
> >+				/* Someone unmap the vma */
> >+				if (unlikely(!vma) || vma->vm_start > address) {
> >+					downgrade_write(&mm->mmap_sem);
> >+					return VM_FAULT_SIGSEG;
> >+				}
> >+				/* Someone else already hanlded */
> >+				if (vma->vm_flags & VM_VOLATILE) {
> >+					/*
> >+					 * From now on, we hold mmap_sem as
> >+					 * exclusive.
> >+					 */
> >+					volatile_lock(vma);
> >+					vma->vm_flags &= ~VM_VOLATILE;
> >+					volatile_unlock(vma);
> >+
> >+					vma_merge(mm, prev, vma->vm_start,
> >+						vma->vm_end, vma->vm_flags,
> >+						vma->anon_vma, vma->vm_file,
> >+						vma->vm_pgoff, vma_policy(vma));
> >+
> >+				}
> >+
> >+				downgrade_write(&mm->mmap_sem);
> >+			}
> > 			return do_anonymous_page(mm, vma, address,
> > 						 pte, pmd, flags);
> > 		}
> >diff --git a/mm/migrate.c b/mm/migrate.c
> >index 77ed2d7..08b009c 100644
> >--- a/mm/migrate.c
> >+++ b/mm/migrate.c
> >@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> > 	}
> >
> > 	/* Establish migration ptes or remove ptes */
> >-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> >+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> >+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
> >
> > skip_unmap:
> > 	if (!page_mapped(page))
> >@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> > 	if (PageAnon(hpage))
> > 		anon_vma = page_get_anon_vma(hpage);
> >
> >-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> >+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> >+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
> >
> > 	if (!page_mapped(hpage))
> > 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
> >diff --git a/mm/rmap.c b/mm/rmap.c
> >index 0f3b7cd..1a0ab2b 100644
> >--- a/mm/rmap.c
> >+++ b/mm/rmap.c
> >@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> > 	return vma_address(page, vma);
> > }
> >
> >+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
> >+		unsigned long address, spinlock_t **ptlp)
> >+{
> >+	pgd_t *pgd;
> >+	pud_t *pud;
> >+	pmd_t *pmd;
> >+	pte_t *pte;
> >+	spinlock_t *ptl;
> >+
> >+	swp_entry_t entry = { .val = page_private(page) };
> >+
> >+	if (unlikely(PageHuge(page))) {
> >+		pte = huge_pte_offset(mm, address);
> >+		ptl = &mm->page_table_lock;
> >+		goto check;
> >+	}
> >+
> >+	pgd = pgd_offset(mm, address);
> >+	if (!pgd_present(*pgd))
> >+		return NULL;
> >+
> >+	pud = pud_offset(pgd, address);
> >+	if (!pud_present(*pud))
> >+		return NULL;
> >+
> >+	pmd = pmd_offset(pud, address);
> >+	if (!pmd_present(*pmd))
> >+		return NULL;
> >+	if (pmd_trans_huge(*pmd))
> >+		return NULL;
> >+
> >+	pte = pte_offset_map(pmd, address);
> >+	ptl = pte_lockptr(mm, pmd);
> >+check:
> >+	spin_lock(ptl);
> >+	if (PageAnon(page)) {
> >+		if (!pte_present(*pte) && entry.val ==
> >+				pte_to_swp_entry(*pte).val) {
> >+			*ptlp = ptl;
> >+			return pte;
> >+		}
> >+	} else {
> >+		if (pte_none(*pte)) {
> >+			*ptlp = ptl;
> >+			return pte;
> >+		}
> >+	}
> >+	pte_unmap_unlock(pte, ptl);
> >+	return NULL;
> >+}
> >+
> > /*
> >  * Check that @page is mapped at @address into @mm.
> >  *
> >@@ -1218,6 +1269,35 @@ out:
> > 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> > }
> >
> >+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
> >+                unsigned long address)
> >+{
> >+        struct mm_struct *mm = vma->vm_mm;
> >+        pte_t *pte;
> >+        pte_t pteval;
> >+        spinlock_t *ptl;
> >+
> >+        pte = page_check_volatile_address(page, mm, address, &ptl);
> >+        if (!pte)
> >+                return 0;
> >+
> >+        /* Nuke the page table entry. */
> >+        flush_cache_page(vma, address, page_to_pfn(page));
> >+        pteval = ptep_clear_flush(vma, address, pte);
> >+
> >+        if (PageAnon(page)) {
> >+                swp_entry_t entry = { .val = page_private(page) };
> >+                if (PageSwapCache(page)) {
> >+                        dec_mm_counter(mm, MM_SWAPENTS);
> >+                        swap_free(entry);
> >+                }
> >+        }
> >+
> >+        pte_unmap_unlock(pte, ptl);
> >+        mmu_notifier_invalidate_page(mm, address);
> >+        return 1;
> >+}
> >+
> > /*
> >  * Subfunctions of try_to_unmap: try_to_unmap_one called
> >  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
> >@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> > 	struct anon_vma *anon_vma;
> > 	struct anon_vma_chain *avc;
> > 	int ret = SWAP_AGAIN;
> >+	bool is_volatile = true;
> >+
> >+	if (flags & TTU_IGNORE_VOLATILE)
> >+		is_volatile = false;
> >
> > 	anon_vma = page_lock_anon_vma(page);
> > 	if (!anon_vma)
> >@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> > 		 * temporary VMAs until after exec() completes.
> > 		 */
> > 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> >-				is_vma_temporary_stack(vma))
> >+				is_vma_temporary_stack(vma)) {
> >+			is_volatile = false;
> > 			continue;
> >+		}
> >
> > 		address = vma_address(page, vma);
> > 		if (address == -EFAULT)
> > 			continue;
> >+                /*
> >+                 * A volatile page will only be purged if ALL vmas
> >+		 * pointing to it are VM_VOLATILE.
> >+                 */
> >+                if (!(vma->vm_flags & VM_VOLATILE))
> >+                        is_volatile = false;
> >+
> > 		ret = try_to_unmap_one(page, vma, address, flags);
> > 		if (ret != SWAP_AGAIN || !page_mapped(page))
> > 			break;
> > 	}
> >
> >+        if (page_mapped(page) || is_volatile == false)
> >+                goto out;
> >+
> >+        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> >+                struct vm_area_struct *vma = avc->vma;
> >+                unsigned long address;
> >+
> >+                address = vma_address(page, vma);
> >+                try_to_zap_one(page, vma, address);
> >+        }
> >+        /* We're throwing this page out, so mark it clean */
> >+        ClearPageDirty(page);
> >+        ret = SWAP_DISCARD;
> >+out:
> > 	page_unlock_anon_vma(anon_vma);
> > 	return ret;
> > }
> >@@ -1651,6 +1758,7 @@ out:
> >  * SWAP_AGAIN	- we missed a mapping, try again later
> >  * SWAP_FAIL	- the page is unswappable
> >  * SWAP_MLOCK	- page is mlocked.
> >+ * SWAP_DISCARD - page is volatile.
> >  */
> > int try_to_unmap(struct page *page, enum ttu_flags flags)
> > {
> >@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
> > 		ret = try_to_unmap_anon(page, flags);
> > 	else
> > 		ret = try_to_unmap_file(page, flags);
> >-	if (ret != SWAP_MLOCK && !page_mapped(page))
> >+	if (ret != SWAP_MLOCK && !page_mapped(page) &&
> >+					ret != SWAP_DISCARD)
> > 		ret = SWAP_SUCCESS;
> > 	return ret;
> > }
> >@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
> > 	anon_vma_free(anon_vma);
> > }
> >
> >+void volatile_lock(struct vm_area_struct *vma)
> >+{
> >+        if (vma->anon_vma)
> >+                anon_vma_lock(vma->anon_vma);
> >+}
> >+
> >+void volatile_unlock(struct vm_area_struct *vma)
> >+{
> >+        if (vma->anon_vma)
> >+                anon_vma_unlock(vma->anon_vma);
> >+}
> >+
> > #ifdef CONFIG_MIGRATION
> > /*
> >  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index 99b434b..4e463a4 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,
> > 	if (vm_flags & VM_LOCKED)
> > 		return PAGEREF_RECLAIM;
> >
> >+	if (vm_flags & VM_VOLATILE)
> >+		return PAGEREF_RECLAIM;
> >+
> > 	if (referenced_ptes) {
> > 		if (PageSwapBacked(page))
> > 			return PAGEREF_ACTIVATE;
> >@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > 		 */
> 
> Hi Minchan,
> 
> IIUC, anonymous page has already add to swapcache through add_to_swap called
> by shrink_page_list, but I can't figure out where you remove it from swapache.

I intended to free it in __remove_mapping.

Thanks.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-12  6:43   ` Wanpeng Li
@ 2012-12-12  8:17     ` Wanpeng Li
  2012-12-12  8:17     ` Wanpeng Li
       [not found]     ` <50c83d9b.49fe2a0a.57ee.ffff90b0SMTPIN_ADDED_BROKEN@mx.google.com>
  2 siblings, 0 replies; 16+ messages in thread
From: Wanpeng Li @ 2012-12-12  8:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: John Stultz, Andrew Morton, linux-mm, linux-kernel,
	Michael Kerrisk, Arun Sharma, sanjay, Paul Turner, David Rientjes,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Wed, Dec 12, 2012 at 02:43:49PM +0800, Wanpeng Li wrote:
>On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
>>Sorry, resending with fixing compile error. :(
>>
>>>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001
>>From: Minchan Kim <minchan@kernel.org>
>>Date: Tue, 11 Dec 2012 11:38:30 +0900
>>Subject: [RFC v3] Support volatile range for anon vma
>>
>>This still is [RFC v3] because just passed my simple test
>>with TCMalloc tweaking.
>>
>>I hope more inputs from user-space allocator people and test patch
>>with their allocator because it might need design change of arena
>>management design for getting real vaule.
>>
>>Changelog from v2
>>
>> * Removing madvise(addr, length, MADV_NOVOLATILE).
>> * add vmstat about the number of discarded volatile pages
>> * discard volatile pages without promotion in reclaim path
>>
>>This is based on v3.6.
>>
>>- What's the madvise(addr, length, MADV_VOLATILE)?
>>
>>  It's a hint that user deliver to kernel so kernel can *discard*
>>  pages in a range anytime.
>>
>>- What happens if user access page(ie, virtual address) discarded
>>  by kernel?
>>
>>  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
>>
>>- What happens if user access page(ie, virtual address) doesn't
>>  discarded by kernel?
>>
>>  The user can see old data without page fault.
>>
>>- What's different with madvise(DONTNEED)?
>>
>>  System call semantic
>>
>>  DONTNEED makes sure user always can see zero-fill pages after
>>  he calls madvise while VOLATILE can see zero-fill pages or
>>  old data.
>>
>>  Internal implementation
>>
>>  The madvise(DONTNEED) should zap all mapped pages in range so
>>  overhead is increased linearly with the number of mapped pages.
>>  Even, if user access zapped pages by write, page fault + page
>>  allocation + memset should be happened.
>>
>>  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
>>  It doesn't touch pages any more so overhead of the system call
>>  should be very small. If memory pressure happens, VM can discard
>>  pages in VMAs marked by VOLATILE. If user access address with
>>  write mode by discarding by VM, he can see zero-fill pages so the
>>  cost is same with DONTNEED but if memory pressure isn't severe,
>>  user can see old data without (page fault + page allocation + memset)
>>
>>  The VOLATILE mark should be removed in page fault handler when first
>>  page fault occur in marked vma so next page faults will follow normal
>>  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
>>  interface.
>>
>>- What's the benefit compared to DONTNEED?
>>
>>  1. The system call overhead is smaller because VOLATILE just marks
>>     the flag to VMA instead of zapping all the page in a range.
>>
>>  2. It has a chance to eliminate overheads (ex, page fault +
>>     page allocation + memset(PAGE_SIZE)).
>>
>>- Isn't there any drawback?
>>
>>  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
>>  fault of other threads could be allowed. But VOLATILE needs exclusive
>>  mmap_sem so other thread would be blocked if they try to access
>>  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
>>  should be small as far as possible.
>>
>>  Other concern of exclusive mmap_sem is when page fault occur in
>>  VOLATILE marked vma. We should remove the flag of vma and merge
>>  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
>>  handling and prevent concurrent page fault. But we need such handling
>>  just once when page fault occur after we mark VOLATILE into VMA
>>  only if memory pressure happpens so the page is discarded. So it wouldn't
>>  not common so that benefit we get by this feature would be bigger than
>>  lose.
>>
>>- What's for targetting?
>>
>>  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
>>  of virtual machine like Dalvik. Also, it comes in handy for embedded
>>  which doesn't have swap device so they can't reclaim anonymous pages.
>>  By discarding instead of swap, it could be used in the non-swap system.
>>  For it,  we have to age anon lru list although we don't have swap because
>>  I don't want to discard volatile pages by top priority when memory pressure
>>  happens as volatile in this patch means "We don't need to swap out because
>>  user can handle the situation which data are disappear suddenly", NOT
>>  "They are useless so hurry up to reclaim them". So I want to apply same
>>  aging rule of nomal pages to them.
>>
>>  Anonymous page background aging of non-swap system would be a trade-off
>>  for getting good feature. Even, we had done it two years ago until merge
>>  [1] and I believe gain of this patch will beat loss of anon lru aging's
>>  overead once all of allocator start to use madvise.
>>  (This patch doesn't include background aging in case of non-swap system
>>  but it's trivial if we decide)
>>
>>[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
>>
>>Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>Cc: Arun Sharma <asharma@fb.com>
>>Cc: sanjay@google.com
>>Cc: Paul Turner <pjt@google.com>
>>CC: David Rientjes <rientjes@google.com>
>>Cc: John Stultz <john.stultz@linaro.org>
>>Cc: Andrew Morton <akpm@linux-foundation.org>
>>Cc: Christoph Lameter <cl@linux.com>
>>Cc: Android Kernel Team <kernel-team@android.com>
>>Cc: Robert Love <rlove@google.com>
>>Cc: Mel Gorman <mel@csn.ul.ie>
>>Cc: Hugh Dickins <hughd@google.com>
>>Cc: Dave Hansen <dave@linux.vnet.ibm.com>
>>Cc: Rik van Riel <riel@redhat.com>
>>Cc: Dave Chinner <david@fromorbit.com>
>>Cc: Neil Brown <neilb@suse.de>
>>Cc: Mike Hommey <mh@glandium.org>
>>Cc: Taras Glek <tglek@mozilla.com>
>>Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
>>Cc: Christoph Lameter <cl@linux.com>
>>Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>Signed-off-by: Minchan Kim <minchan@kernel.org>
>>---
>> arch/x86/mm/fault.c               |    2 +
>> include/asm-generic/mman-common.h |    6 ++
>> include/linux/mm.h                |    7 ++-
>> include/linux/rmap.h              |   20 ++++++
>> include/linux/vm_event_item.h     |    2 +-
>> mm/madvise.c                      |   19 +++++-
>> mm/memory.c                       |   32 ++++++++++
>> mm/migrate.c                      |    6 +-
>> mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-
>> mm/vmscan.c                       |    7 +++
>> mm/vmstat.c                       |    1 +
>> 11 files changed, 218 insertions(+), 9 deletions(-)
>>
>>diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>>index 76dcd9d..17c1c20 100644
>>--- a/arch/x86/mm/fault.c
>>+++ b/arch/x86/mm/fault.c
>>@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
>> 		}
>>
>> 		out_of_memory(regs, error_code, address);
>>+	} else if (fault & VM_FAULT_SIGSEG) {
>>+			bad_area(regs, error_code, address);
>> 	} else {
>> 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
>> 			     VM_FAULT_HWPOISON_LARGE))
>>diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
>>index d030d2c..f07781e 100644
>>--- a/include/asm-generic/mman-common.h
>>+++ b/include/asm-generic/mman-common.h
>>@@ -34,6 +34,12 @@
>> #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>> #define MADV_WILLNEED	3		/* will need these pages */
>> #define MADV_DONTNEED	4		/* don't need these pages */
>>+/*
>>+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.
>>+ * For changing the flag, we need mmap_sem's write lock and volatile_lock
>>+ * while we just need volatile_lock in case of reading the flag.
>>+ */
>>+#define MADV_VOLATILE	5		/* pages will disappear suddenly */
>>
>> /* common parameters: try to keep these consistent across architectures */
>> #define MADV_REMOVE	9		/* remove these pages & resources */
>>diff --git a/include/linux/mm.h b/include/linux/mm.h
>>index 311be90..89027b5 100644
>>--- a/include/linux/mm.h
>>+++ b/include/linux/mm.h
>>@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);
>> #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
>> #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
>> #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
>>+#define VM_VOLATILE	0x100000000	/* Pages in the vma could be discarable without swap */
>>
>> /* Bits set in the VMA until the stack is in its final location */
>> #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
>>@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);
>>  * Special vmas that are non-mergable, non-mlock()able.
>>  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
>>  */
>>-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
>>+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)
>>
>> /*
>>  * mapping from the currently active vm_flags protection bits (the
>>@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)
>> #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
>> #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
>> #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>>-
>>+#define VM_FAULT_SIGSEG	0x0800	/* -> There is no vma */
>> #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
>>
>> #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
>>-			 VM_FAULT_HWPOISON_LARGE)
>>+			 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)
>>
>> /* Encode hstate index for a hwpoisoned large page */
>> #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
>>diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>index 3fce545..735d7a3 100644
>>--- a/include/linux/rmap.h
>>+++ b/include/linux/rmap.h
>>@@ -67,6 +67,9 @@ struct anon_vma_chain {
>> 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
>> };
>>
>>+void volatile_lock(struct vm_area_struct *vma);
>>+void volatile_unlock(struct vm_area_struct *vma);
>>+
>> #ifdef CONFIG_MMU
>> static inline void get_anon_vma(struct anon_vma *anon_vma)
>> {
>>@@ -170,6 +173,7 @@ enum ttu_flags {
>> 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
>> 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
>> 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
>>+	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
>> };
>> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>>
>>@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
>> 	return ptep;
>> }
>>
>>+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
>>+                                unsigned long, spinlock_t **);
>>+
>>+static inline pte_t *page_check_volatile_address(struct page *page,
>>+                                        struct mm_struct *mm,
>>+                                        unsigned long address,
>>+                                        spinlock_t **ptlp)
>>+{
>>+        pte_t *ptep;
>>+
>>+        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
>>+                                        mm, address, ptlp));
>>+        return ptep;
>>+}
>>+
>> /*
>>  * Used by swapoff to help locate where page is expected in vma.
>>  */
>>@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)
>> #define SWAP_AGAIN	1
>> #define SWAP_FAIL	2
>> #define SWAP_MLOCK	3
>>+#define SWAP_DISCARD	4
>>
>> #endif	/* _LINUX_RMAP_H */
>>diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>index 57f7b10..3f9a40b 100644
>>--- a/include/linux/vm_event_item.h
>>+++ b/include/linux/vm_event_item.h
>>@@ -23,7 +23,7 @@
>>
>> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> 		FOR_ALL_ZONES(PGALLOC),
>>-		PGFREE, PGACTIVATE, PGDEACTIVATE,
>>+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
>> 		PGFAULT, PGMAJFAULT,
>> 		FOR_ALL_ZONES(PGREFILL),
>> 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
>>diff --git a/mm/madvise.c b/mm/madvise.c
>>index 14d260f..53a19d8 100644
>>--- a/mm/madvise.c
>>+++ b/mm/madvise.c
>>@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
>> 		if (error)
>> 			goto out;
>> 		break;
>>+	case MADV_VOLATILE:
>>+		if (vma->vm_flags & VM_LOCKED) {
>>+			error = -EINVAL;
>>+			goto out;
>>+		}
>>+		new_flags |= VM_VOLATILE;
>>+		break;
>> 	}
>>
>> 	if (new_flags == vma->vm_flags) {
>>@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
>> success:
>> 	/*
>> 	 * vm_flags is protected by the mmap_sem held in write mode.
>>+	 * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.
>> 	 */
>>+	if (behavior == MADV_VOLATILE)
>>+		volatile_lock(vma);
>> 	vma->vm_flags = new_flags;
>>-
>>+	if (behavior == MADV_VOLATILE)
>>+		volatile_unlock(vma);
>> out:
>> 	if (error == -ENOMEM)
>> 		error = -EAGAIN;
>>@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)
>> #endif
>> 	case MADV_DONTDUMP:
>> 	case MADV_DODUMP:
>>+	case MADV_VOLATILE:
>> 		return 1;
>>
>> 	default:
>>@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>> 		goto out;
>> 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>>
>>+	if (behavior != MADV_VOLATILE)
>>+		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>>+	else
>>+		len = len_in & PAGE_MASK;
>>+
>> 	/* Check to see whether len was rounded up from small -ve to zero */
>> 	if (len_in && !len)
>> 		goto out;
>>diff --git a/mm/memory.c b/mm/memory.c
>>index 5736170..b5e4996 100644
>>--- a/mm/memory.c
>>+++ b/mm/memory.c
>>@@ -57,6 +57,7 @@
>> #include <linux/swapops.h>
>> #include <linux/elf.h>
>> #include <linux/gfp.h>
>>+#include <linux/mempolicy.h>
>>
>> #include <asm/io.h>
>> #include <asm/pgalloc.h>
>>@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,
>> 					return do_linear_fault(mm, vma, address,
>> 						pte, pmd, flags, entry);
>> 			}
>>+			if (vma->vm_flags & VM_VOLATILE) {
>>+				struct vm_area_struct *prev;
>>+
>>+				up_read(&mm->mmap_sem);
>>+				down_write(&mm->mmap_sem);
>>+				vma = find_vma_prev(mm, address, &prev);
>>+
>>+				/* Someone unmap the vma */
>>+				if (unlikely(!vma) || vma->vm_start > address) {
>>+					downgrade_write(&mm->mmap_sem);
>>+					return VM_FAULT_SIGSEG;
>>+				}
>>+				/* Someone else already hanlded */
>>+				if (vma->vm_flags & VM_VOLATILE) {
>>+					/*
>>+					 * From now on, we hold mmap_sem as
>>+					 * exclusive.
>>+					 */
>>+					volatile_lock(vma);
>>+					vma->vm_flags &= ~VM_VOLATILE;
>>+					volatile_unlock(vma);
>>+
>>+					vma_merge(mm, prev, vma->vm_start,
>>+						vma->vm_end, vma->vm_flags,
>>+						vma->anon_vma, vma->vm_file,
>>+						vma->vm_pgoff, vma_policy(vma));
>>+
>>+				}
>>+
>>+				downgrade_write(&mm->mmap_sem);
>>+			}
>> 			return do_anonymous_page(mm, vma, address,
>> 						 pte, pmd, flags);
>> 		}
>>diff --git a/mm/migrate.c b/mm/migrate.c
>>index 77ed2d7..08b009c 100644
>>--- a/mm/migrate.c
>>+++ b/mm/migrate.c
>>@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>> 	}
>>
>> 	/* Establish migration ptes or remove ptes */
>>-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>>+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>>
>> skip_unmap:
>> 	if (!page_mapped(page))
>>@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>> 	if (PageAnon(hpage))
>> 		anon_vma = page_get_anon_vma(hpage);
>>
>>-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>>+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>>
>> 	if (!page_mapped(hpage))
>> 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
>>diff --git a/mm/rmap.c b/mm/rmap.c
>>index 0f3b7cd..1a0ab2b 100644
>>--- a/mm/rmap.c
>>+++ b/mm/rmap.c
>>@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>> 	return vma_address(page, vma);
>> }
>>
>>+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
>>+		unsigned long address, spinlock_t **ptlp)
>>+{
>>+	pgd_t *pgd;
>>+	pud_t *pud;
>>+	pmd_t *pmd;
>>+	pte_t *pte;
>>+	spinlock_t *ptl;
>>+
>>+	swp_entry_t entry = { .val = page_private(page) };
>>+
>>+	if (unlikely(PageHuge(page))) {
>>+		pte = huge_pte_offset(mm, address);
>>+		ptl = &mm->page_table_lock;
>>+		goto check;
>>+	}
>>+
>>+	pgd = pgd_offset(mm, address);
>>+	if (!pgd_present(*pgd))
>>+		return NULL;
>>+
>>+	pud = pud_offset(pgd, address);
>>+	if (!pud_present(*pud))
>>+		return NULL;
>>+
>>+	pmd = pmd_offset(pud, address);
>>+	if (!pmd_present(*pmd))
>>+		return NULL;
>>+	if (pmd_trans_huge(*pmd))
>>+		return NULL;
>>+
>>+	pte = pte_offset_map(pmd, address);
>>+	ptl = pte_lockptr(mm, pmd);
>>+check:
>>+	spin_lock(ptl);
>>+	if (PageAnon(page)) {
>>+		if (!pte_present(*pte) && entry.val ==
>>+				pte_to_swp_entry(*pte).val) {
>>+			*ptlp = ptl;
>>+			return pte;
>>+		}
>>+	} else {
>>+		if (pte_none(*pte)) {
>>+			*ptlp = ptl;
>>+			return pte;
>>+		}
>>+	}
>>+	pte_unmap_unlock(pte, ptl);
>>+	return NULL;
>>+}
>>+
>> /*
>>  * Check that @page is mapped at @address into @mm.
>>  *
>>@@ -1218,6 +1269,35 @@ out:
>> 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
>> }
>>
>>+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
>>+                unsigned long address)
>>+{
>>+        struct mm_struct *mm = vma->vm_mm;
>>+        pte_t *pte;
>>+        pte_t pteval;
>>+        spinlock_t *ptl;
>>+
>>+        pte = page_check_volatile_address(page, mm, address, &ptl);
>>+        if (!pte)
>>+                return 0;
>>+
>>+        /* Nuke the page table entry. */
>>+        flush_cache_page(vma, address, page_to_pfn(page));
>>+        pteval = ptep_clear_flush(vma, address, pte);
>>+
>>+        if (PageAnon(page)) {
>>+                swp_entry_t entry = { .val = page_private(page) };
>>+                if (PageSwapCache(page)) {
>>+                        dec_mm_counter(mm, MM_SWAPENTS);
>>+                        swap_free(entry);
>>+                }
>>+        }
>>+
>>+        pte_unmap_unlock(pte, ptl);
>>+        mmu_notifier_invalidate_page(mm, address);
>>+        return 1;
>>+}
>>+
>> /*
>>  * Subfunctions of try_to_unmap: try_to_unmap_one called
>>  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
>>@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>> 	struct anon_vma *anon_vma;
>> 	struct anon_vma_chain *avc;
>> 	int ret = SWAP_AGAIN;
>>+	bool is_volatile = true;
>>+
>>+	if (flags & TTU_IGNORE_VOLATILE)
>>+		is_volatile = false;
>>
>> 	anon_vma = page_lock_anon_vma(page);
>> 	if (!anon_vma)
>>@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>> 		 * temporary VMAs until after exec() completes.
>> 		 */
>> 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
>>-				is_vma_temporary_stack(vma))
>>+				is_vma_temporary_stack(vma)) {
>>+			is_volatile = false;
>> 			continue;
>>+		}
>>
>> 		address = vma_address(page, vma);
>> 		if (address == -EFAULT)
>> 			continue;
>>+                /*
>>+                 * A volatile page will only be purged if ALL vmas
>>+		 * pointing to it are VM_VOLATILE.
>>+                 */
>>+                if (!(vma->vm_flags & VM_VOLATILE))
>>+                        is_volatile = false;
>>+
>> 		ret = try_to_unmap_one(page, vma, address, flags);
>> 		if (ret != SWAP_AGAIN || !page_mapped(page))
>> 			break;
>> 	}
>>
>>+        if (page_mapped(page) || is_volatile == false)
>>+                goto out;
>>+
>>+        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>>+                struct vm_area_struct *vma = avc->vma;
>>+                unsigned long address;
>>+
>>+                address = vma_address(page, vma);
>>+                try_to_zap_one(page, vma, address);
>>+        }
>>+        /* We're throwing this page out, so mark it clean */
>>+        ClearPageDirty(page);
>>+        ret = SWAP_DISCARD;
>>+out:
>> 	page_unlock_anon_vma(anon_vma);
>> 	return ret;
>> }
>>@@ -1651,6 +1758,7 @@ out:
>>  * SWAP_AGAIN	- we missed a mapping, try again later
>>  * SWAP_FAIL	- the page is unswappable
>>  * SWAP_MLOCK	- page is mlocked.
>>+ * SWAP_DISCARD - page is volatile.
>>  */
>> int try_to_unmap(struct page *page, enum ttu_flags flags)
>> {
>>@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
>> 		ret = try_to_unmap_anon(page, flags);
>> 	else
>> 		ret = try_to_unmap_file(page, flags);
>>-	if (ret != SWAP_MLOCK && !page_mapped(page))
>>+	if (ret != SWAP_MLOCK && !page_mapped(page) &&
>>+					ret != SWAP_DISCARD)
>> 		ret = SWAP_SUCCESS;
>> 	return ret;
>> }
>>@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
>> 	anon_vma_free(anon_vma);
>> }
>>
>>+void volatile_lock(struct vm_area_struct *vma)
>>+{
>>+        if (vma->anon_vma)
>>+                anon_vma_lock(vma->anon_vma);
>>+}
>>+
>>+void volatile_unlock(struct vm_area_struct *vma)
>>+{
>>+        if (vma->anon_vma)
>>+                anon_vma_unlock(vma->anon_vma);
>>+}
>>+
>> #ifdef CONFIG_MIGRATION
>> /*
>>  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
>>diff --git a/mm/vmscan.c b/mm/vmscan.c
>>index 99b434b..4e463a4 100644
>>--- a/mm/vmscan.c
>>+++ b/mm/vmscan.c
>>@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,
>> 	if (vm_flags & VM_LOCKED)
>> 		return PAGEREF_RECLAIM;
>>
>>+	if (vm_flags & VM_VOLATILE)
>>+		return PAGEREF_RECLAIM;
>>+
>> 	if (referenced_ptes) {
>> 		if (PageSwapBacked(page))
>> 			return PAGEREF_ACTIVATE;
>>@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 		 */
>
>Hi Minchan,
>
>IIUC, anonymous page has already add to swapcache through add_to_swap called
>by shrink_page_list, but I can't figure out where you remove it from swapache.
>

Yeah, they all done in shrink_page_list, I mean if you can avoid the process of 
add to swapcache and remove it from swapcache since your idea don't need swapout.

>Regards,
>Wanpeng Li 
>
>> 		if (page_mapped(page) && mapping) {
>> 			switch (try_to_unmap(page, TTU_UNMAP)) {
>>+			case SWAP_DISCARD:
>>+				count_vm_event(PGVOLATILE);
>>+				goto discard_page;
>> 			case SWAP_FAIL:
>> 				goto activate_locked;
>> 			case SWAP_AGAIN:
>>@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 			}
>> 		}
>>
>>+discard_page:
>> 		/*
>> 		 * If the page has buffers, try to free the buffer mappings
>> 		 * associated with this page. If we succeed we try to free
>>diff --git a/mm/vmstat.c b/mm/vmstat.c
>>index df7a674..410caf5 100644
>>--- a/mm/vmstat.c
>>+++ b/mm/vmstat.c
>>@@ -734,6 +734,7 @@ const char * const vmstat_text[] = {
>> 	TEXTS_FOR_ZONES("pgalloc")
>>
>> 	"pgfree",
>>+	"pgvolatile",
>> 	"pgactivate",
>> 	"pgdeactivate",
>>
>>-- 
>>1.7.9.5
>>
>>-- 
>>Kind regards,
>>Minchan Kim
>>
>>--
>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>the body to majordomo@kvack.org.  For more info on Linux MM,
>>see: http://www.linux-mm.org/ .
>>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
  2012-12-12  6:43   ` Wanpeng Li
  2012-12-12  8:17     ` Wanpeng Li
@ 2012-12-12  8:17     ` Wanpeng Li
       [not found]     ` <50c83d9b.49fe2a0a.57ee.ffff90b0SMTPIN_ADDED_BROKEN@mx.google.com>
  2 siblings, 0 replies; 16+ messages in thread
From: Wanpeng Li @ 2012-12-12  8:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: John Stultz, Andrew Morton, linux-mm, linux-kernel,
	Michael Kerrisk, Arun Sharma, sanjay, Paul Turner, David Rientjes,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Wed, Dec 12, 2012 at 02:43:49PM +0800, Wanpeng Li wrote:
>On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
>>Sorry, resending with fixing compile error. :(
>>
>>>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001
>>From: Minchan Kim <minchan@kernel.org>
>>Date: Tue, 11 Dec 2012 11:38:30 +0900
>>Subject: [RFC v3] Support volatile range for anon vma
>>
>>This still is [RFC v3] because just passed my simple test
>>with TCMalloc tweaking.
>>
>>I hope more inputs from user-space allocator people and test patch
>>with their allocator because it might need design change of arena
>>management design for getting real vaule.
>>
>>Changelog from v2
>>
>> * Removing madvise(addr, length, MADV_NOVOLATILE).
>> * add vmstat about the number of discarded volatile pages
>> * discard volatile pages without promotion in reclaim path
>>
>>This is based on v3.6.
>>
>>- What's the madvise(addr, length, MADV_VOLATILE)?
>>
>>  It's a hint that user deliver to kernel so kernel can *discard*
>>  pages in a range anytime.
>>
>>- What happens if user access page(ie, virtual address) discarded
>>  by kernel?
>>
>>  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
>>
>>- What happens if user access page(ie, virtual address) doesn't
>>  discarded by kernel?
>>
>>  The user can see old data without page fault.
>>
>>- What's different with madvise(DONTNEED)?
>>
>>  System call semantic
>>
>>  DONTNEED makes sure user always can see zero-fill pages after
>>  he calls madvise while VOLATILE can see zero-fill pages or
>>  old data.
>>
>>  Internal implementation
>>
>>  The madvise(DONTNEED) should zap all mapped pages in range so
>>  overhead is increased linearly with the number of mapped pages.
>>  Even, if user access zapped pages by write, page fault + page
>>  allocation + memset should be happened.
>>
>>  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
>>  It doesn't touch pages any more so overhead of the system call
>>  should be very small. If memory pressure happens, VM can discard
>>  pages in VMAs marked by VOLATILE. If user access address with
>>  write mode by discarding by VM, he can see zero-fill pages so the
>>  cost is same with DONTNEED but if memory pressure isn't severe,
>>  user can see old data without (page fault + page allocation + memset)
>>
>>  The VOLATILE mark should be removed in page fault handler when first
>>  page fault occur in marked vma so next page faults will follow normal
>>  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
>>  interface.
>>
>>- What's the benefit compared to DONTNEED?
>>
>>  1. The system call overhead is smaller because VOLATILE just marks
>>     the flag to VMA instead of zapping all the page in a range.
>>
>>  2. It has a chance to eliminate overheads (ex, page fault +
>>     page allocation + memset(PAGE_SIZE)).
>>
>>- Isn't there any drawback?
>>
>>  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
>>  fault of other threads could be allowed. But VOLATILE needs exclusive
>>  mmap_sem so other thread would be blocked if they try to access
>>  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
>>  should be small as far as possible.
>>
>>  Other concern of exclusive mmap_sem is when page fault occur in
>>  VOLATILE marked vma. We should remove the flag of vma and merge
>>  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
>>  handling and prevent concurrent page fault. But we need such handling
>>  just once when page fault occur after we mark VOLATILE into VMA
>>  only if memory pressure happpens so the page is discarded. So it wouldn't
>>  not common so that benefit we get by this feature would be bigger than
>>  lose.
>>
>>- What's for targetting?
>>
>>  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
>>  of virtual machine like Dalvik. Also, it comes in handy for embedded
>>  which doesn't have swap device so they can't reclaim anonymous pages.
>>  By discarding instead of swap, it could be used in the non-swap system.
>>  For it,  we have to age anon lru list although we don't have swap because
>>  I don't want to discard volatile pages by top priority when memory pressure
>>  happens as volatile in this patch means "We don't need to swap out because
>>  user can handle the situation which data are disappear suddenly", NOT
>>  "They are useless so hurry up to reclaim them". So I want to apply same
>>  aging rule of nomal pages to them.
>>
>>  Anonymous page background aging of non-swap system would be a trade-off
>>  for getting good feature. Even, we had done it two years ago until merge
>>  [1] and I believe gain of this patch will beat loss of anon lru aging's
>>  overead once all of allocator start to use madvise.
>>  (This patch doesn't include background aging in case of non-swap system
>>  but it's trivial if we decide)
>>
>>[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
>>
>>Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>Cc: Arun Sharma <asharma@fb.com>
>>Cc: sanjay@google.com
>>Cc: Paul Turner <pjt@google.com>
>>CC: David Rientjes <rientjes@google.com>
>>Cc: John Stultz <john.stultz@linaro.org>
>>Cc: Andrew Morton <akpm@linux-foundation.org>
>>Cc: Christoph Lameter <cl@linux.com>
>>Cc: Android Kernel Team <kernel-team@android.com>
>>Cc: Robert Love <rlove@google.com>
>>Cc: Mel Gorman <mel@csn.ul.ie>
>>Cc: Hugh Dickins <hughd@google.com>
>>Cc: Dave Hansen <dave@linux.vnet.ibm.com>
>>Cc: Rik van Riel <riel@redhat.com>
>>Cc: Dave Chinner <david@fromorbit.com>
>>Cc: Neil Brown <neilb@suse.de>
>>Cc: Mike Hommey <mh@glandium.org>
>>Cc: Taras Glek <tglek@mozilla.com>
>>Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
>>Cc: Christoph Lameter <cl@linux.com>
>>Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>Signed-off-by: Minchan Kim <minchan@kernel.org>
>>---
>> arch/x86/mm/fault.c               |    2 +
>> include/asm-generic/mman-common.h |    6 ++
>> include/linux/mm.h                |    7 ++-
>> include/linux/rmap.h              |   20 ++++++
>> include/linux/vm_event_item.h     |    2 +-
>> mm/madvise.c                      |   19 +++++-
>> mm/memory.c                       |   32 ++++++++++
>> mm/migrate.c                      |    6 +-
>> mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-
>> mm/vmscan.c                       |    7 +++
>> mm/vmstat.c                       |    1 +
>> 11 files changed, 218 insertions(+), 9 deletions(-)
>>
>>diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>>index 76dcd9d..17c1c20 100644
>>--- a/arch/x86/mm/fault.c
>>+++ b/arch/x86/mm/fault.c
>>@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
>> 		}
>>
>> 		out_of_memory(regs, error_code, address);
>>+	} else if (fault & VM_FAULT_SIGSEG) {
>>+			bad_area(regs, error_code, address);
>> 	} else {
>> 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
>> 			     VM_FAULT_HWPOISON_LARGE))
>>diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
>>index d030d2c..f07781e 100644
>>--- a/include/asm-generic/mman-common.h
>>+++ b/include/asm-generic/mman-common.h
>>@@ -34,6 +34,12 @@
>> #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>> #define MADV_WILLNEED	3		/* will need these pages */
>> #define MADV_DONTNEED	4		/* don't need these pages */
>>+/*
>>+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.
>>+ * For changing the flag, we need mmap_sem's write lock and volatile_lock
>>+ * while we just need volatile_lock in case of reading the flag.
>>+ */
>>+#define MADV_VOLATILE	5		/* pages will disappear suddenly */
>>
>> /* common parameters: try to keep these consistent across architectures */
>> #define MADV_REMOVE	9		/* remove these pages & resources */
>>diff --git a/include/linux/mm.h b/include/linux/mm.h
>>index 311be90..89027b5 100644
>>--- a/include/linux/mm.h
>>+++ b/include/linux/mm.h
>>@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);
>> #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
>> #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
>> #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
>>+#define VM_VOLATILE	0x100000000	/* Pages in the vma could be discarable without swap */
>>
>> /* Bits set in the VMA until the stack is in its final location */
>> #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
>>@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);
>>  * Special vmas that are non-mergable, non-mlock()able.
>>  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
>>  */
>>-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
>>+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)
>>
>> /*
>>  * mapping from the currently active vm_flags protection bits (the
>>@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)
>> #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
>> #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
>> #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>>-
>>+#define VM_FAULT_SIGSEG	0x0800	/* -> There is no vma */
>> #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
>>
>> #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
>>-			 VM_FAULT_HWPOISON_LARGE)
>>+			 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)
>>
>> /* Encode hstate index for a hwpoisoned large page */
>> #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
>>diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>index 3fce545..735d7a3 100644
>>--- a/include/linux/rmap.h
>>+++ b/include/linux/rmap.h
>>@@ -67,6 +67,9 @@ struct anon_vma_chain {
>> 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
>> };
>>
>>+void volatile_lock(struct vm_area_struct *vma);
>>+void volatile_unlock(struct vm_area_struct *vma);
>>+
>> #ifdef CONFIG_MMU
>> static inline void get_anon_vma(struct anon_vma *anon_vma)
>> {
>>@@ -170,6 +173,7 @@ enum ttu_flags {
>> 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
>> 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
>> 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
>>+	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
>> };
>> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>>
>>@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
>> 	return ptep;
>> }
>>
>>+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
>>+                                unsigned long, spinlock_t **);
>>+
>>+static inline pte_t *page_check_volatile_address(struct page *page,
>>+                                        struct mm_struct *mm,
>>+                                        unsigned long address,
>>+                                        spinlock_t **ptlp)
>>+{
>>+        pte_t *ptep;
>>+
>>+        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
>>+                                        mm, address, ptlp));
>>+        return ptep;
>>+}
>>+
>> /*
>>  * Used by swapoff to help locate where page is expected in vma.
>>  */
>>@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)
>> #define SWAP_AGAIN	1
>> #define SWAP_FAIL	2
>> #define SWAP_MLOCK	3
>>+#define SWAP_DISCARD	4
>>
>> #endif	/* _LINUX_RMAP_H */
>>diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>index 57f7b10..3f9a40b 100644
>>--- a/include/linux/vm_event_item.h
>>+++ b/include/linux/vm_event_item.h
>>@@ -23,7 +23,7 @@
>>
>> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> 		FOR_ALL_ZONES(PGALLOC),
>>-		PGFREE, PGACTIVATE, PGDEACTIVATE,
>>+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
>> 		PGFAULT, PGMAJFAULT,
>> 		FOR_ALL_ZONES(PGREFILL),
>> 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
>>diff --git a/mm/madvise.c b/mm/madvise.c
>>index 14d260f..53a19d8 100644
>>--- a/mm/madvise.c
>>+++ b/mm/madvise.c
>>@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
>> 		if (error)
>> 			goto out;
>> 		break;
>>+	case MADV_VOLATILE:
>>+		if (vma->vm_flags & VM_LOCKED) {
>>+			error = -EINVAL;
>>+			goto out;
>>+		}
>>+		new_flags |= VM_VOLATILE;
>>+		break;
>> 	}
>>
>> 	if (new_flags == vma->vm_flags) {
>>@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
>> success:
>> 	/*
>> 	 * vm_flags is protected by the mmap_sem held in write mode.
>>+	 * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.
>> 	 */
>>+	if (behavior == MADV_VOLATILE)
>>+		volatile_lock(vma);
>> 	vma->vm_flags = new_flags;
>>-
>>+	if (behavior == MADV_VOLATILE)
>>+		volatile_unlock(vma);
>> out:
>> 	if (error == -ENOMEM)
>> 		error = -EAGAIN;
>>@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)
>> #endif
>> 	case MADV_DONTDUMP:
>> 	case MADV_DODUMP:
>>+	case MADV_VOLATILE:
>> 		return 1;
>>
>> 	default:
>>@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>> 		goto out;
>> 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>>
>>+	if (behavior != MADV_VOLATILE)
>>+		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>>+	else
>>+		len = len_in & PAGE_MASK;
>>+
>> 	/* Check to see whether len was rounded up from small -ve to zero */
>> 	if (len_in && !len)
>> 		goto out;
>>diff --git a/mm/memory.c b/mm/memory.c
>>index 5736170..b5e4996 100644
>>--- a/mm/memory.c
>>+++ b/mm/memory.c
>>@@ -57,6 +57,7 @@
>> #include <linux/swapops.h>
>> #include <linux/elf.h>
>> #include <linux/gfp.h>
>>+#include <linux/mempolicy.h>
>>
>> #include <asm/io.h>
>> #include <asm/pgalloc.h>
>>@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,
>> 					return do_linear_fault(mm, vma, address,
>> 						pte, pmd, flags, entry);
>> 			}
>>+			if (vma->vm_flags & VM_VOLATILE) {
>>+				struct vm_area_struct *prev;
>>+
>>+				up_read(&mm->mmap_sem);
>>+				down_write(&mm->mmap_sem);
>>+				vma = find_vma_prev(mm, address, &prev);
>>+
>>+				/* Someone unmap the vma */
>>+				if (unlikely(!vma) || vma->vm_start > address) {
>>+					downgrade_write(&mm->mmap_sem);
>>+					return VM_FAULT_SIGSEG;
>>+				}
>>+				/* Someone else already hanlded */
>>+				if (vma->vm_flags & VM_VOLATILE) {
>>+					/*
>>+					 * From now on, we hold mmap_sem as
>>+					 * exclusive.
>>+					 */
>>+					volatile_lock(vma);
>>+					vma->vm_flags &= ~VM_VOLATILE;
>>+					volatile_unlock(vma);
>>+
>>+					vma_merge(mm, prev, vma->vm_start,
>>+						vma->vm_end, vma->vm_flags,
>>+						vma->anon_vma, vma->vm_file,
>>+						vma->vm_pgoff, vma_policy(vma));
>>+
>>+				}
>>+
>>+				downgrade_write(&mm->mmap_sem);
>>+			}
>> 			return do_anonymous_page(mm, vma, address,
>> 						 pte, pmd, flags);
>> 		}
>>diff --git a/mm/migrate.c b/mm/migrate.c
>>index 77ed2d7..08b009c 100644
>>--- a/mm/migrate.c
>>+++ b/mm/migrate.c
>>@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>> 	}
>>
>> 	/* Establish migration ptes or remove ptes */
>>-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>>+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>>
>> skip_unmap:
>> 	if (!page_mapped(page))
>>@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>> 	if (PageAnon(hpage))
>> 		anon_vma = page_get_anon_vma(hpage);
>>
>>-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>>+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>>
>> 	if (!page_mapped(hpage))
>> 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
>>diff --git a/mm/rmap.c b/mm/rmap.c
>>index 0f3b7cd..1a0ab2b 100644
>>--- a/mm/rmap.c
>>+++ b/mm/rmap.c
>>@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>> 	return vma_address(page, vma);
>> }
>>
>>+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
>>+		unsigned long address, spinlock_t **ptlp)
>>+{
>>+	pgd_t *pgd;
>>+	pud_t *pud;
>>+	pmd_t *pmd;
>>+	pte_t *pte;
>>+	spinlock_t *ptl;
>>+
>>+	swp_entry_t entry = { .val = page_private(page) };
>>+
>>+	if (unlikely(PageHuge(page))) {
>>+		pte = huge_pte_offset(mm, address);
>>+		ptl = &mm->page_table_lock;
>>+		goto check;
>>+	}
>>+
>>+	pgd = pgd_offset(mm, address);
>>+	if (!pgd_present(*pgd))
>>+		return NULL;
>>+
>>+	pud = pud_offset(pgd, address);
>>+	if (!pud_present(*pud))
>>+		return NULL;
>>+
>>+	pmd = pmd_offset(pud, address);
>>+	if (!pmd_present(*pmd))
>>+		return NULL;
>>+	if (pmd_trans_huge(*pmd))
>>+		return NULL;
>>+
>>+	pte = pte_offset_map(pmd, address);
>>+	ptl = pte_lockptr(mm, pmd);
>>+check:
>>+	spin_lock(ptl);
>>+	if (PageAnon(page)) {
>>+		if (!pte_present(*pte) && entry.val ==
>>+				pte_to_swp_entry(*pte).val) {
>>+			*ptlp = ptl;
>>+			return pte;
>>+		}
>>+	} else {
>>+		if (pte_none(*pte)) {
>>+			*ptlp = ptl;
>>+			return pte;
>>+		}
>>+	}
>>+	pte_unmap_unlock(pte, ptl);
>>+	return NULL;
>>+}
>>+
>> /*
>>  * Check that @page is mapped at @address into @mm.
>>  *
>>@@ -1218,6 +1269,35 @@ out:
>> 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
>> }
>>
>>+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
>>+                unsigned long address)
>>+{
>>+        struct mm_struct *mm = vma->vm_mm;
>>+        pte_t *pte;
>>+        pte_t pteval;
>>+        spinlock_t *ptl;
>>+
>>+        pte = page_check_volatile_address(page, mm, address, &ptl);
>>+        if (!pte)
>>+                return 0;
>>+
>>+        /* Nuke the page table entry. */
>>+        flush_cache_page(vma, address, page_to_pfn(page));
>>+        pteval = ptep_clear_flush(vma, address, pte);
>>+
>>+        if (PageAnon(page)) {
>>+                swp_entry_t entry = { .val = page_private(page) };
>>+                if (PageSwapCache(page)) {
>>+                        dec_mm_counter(mm, MM_SWAPENTS);
>>+                        swap_free(entry);
>>+                }
>>+        }
>>+
>>+        pte_unmap_unlock(pte, ptl);
>>+        mmu_notifier_invalidate_page(mm, address);
>>+        return 1;
>>+}
>>+
>> /*
>>  * Subfunctions of try_to_unmap: try_to_unmap_one called
>>  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
>>@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>> 	struct anon_vma *anon_vma;
>> 	struct anon_vma_chain *avc;
>> 	int ret = SWAP_AGAIN;
>>+	bool is_volatile = true;
>>+
>>+	if (flags & TTU_IGNORE_VOLATILE)
>>+		is_volatile = false;
>>
>> 	anon_vma = page_lock_anon_vma(page);
>> 	if (!anon_vma)
>>@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>> 		 * temporary VMAs until after exec() completes.
>> 		 */
>> 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
>>-				is_vma_temporary_stack(vma))
>>+				is_vma_temporary_stack(vma)) {
>>+			is_volatile = false;
>> 			continue;
>>+		}
>>
>> 		address = vma_address(page, vma);
>> 		if (address == -EFAULT)
>> 			continue;
>>+                /*
>>+                 * A volatile page will only be purged if ALL vmas
>>+		 * pointing to it are VM_VOLATILE.
>>+                 */
>>+                if (!(vma->vm_flags & VM_VOLATILE))
>>+                        is_volatile = false;
>>+
>> 		ret = try_to_unmap_one(page, vma, address, flags);
>> 		if (ret != SWAP_AGAIN || !page_mapped(page))
>> 			break;
>> 	}
>>
>>+        if (page_mapped(page) || is_volatile == false)
>>+                goto out;
>>+
>>+        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>>+                struct vm_area_struct *vma = avc->vma;
>>+                unsigned long address;
>>+
>>+                address = vma_address(page, vma);
>>+                try_to_zap_one(page, vma, address);
>>+        }
>>+        /* We're throwing this page out, so mark it clean */
>>+        ClearPageDirty(page);
>>+        ret = SWAP_DISCARD;
>>+out:
>> 	page_unlock_anon_vma(anon_vma);
>> 	return ret;
>> }
>>@@ -1651,6 +1758,7 @@ out:
>>  * SWAP_AGAIN	- we missed a mapping, try again later
>>  * SWAP_FAIL	- the page is unswappable
>>  * SWAP_MLOCK	- page is mlocked.
>>+ * SWAP_DISCARD - page is volatile.
>>  */
>> int try_to_unmap(struct page *page, enum ttu_flags flags)
>> {
>>@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
>> 		ret = try_to_unmap_anon(page, flags);
>> 	else
>> 		ret = try_to_unmap_file(page, flags);
>>-	if (ret != SWAP_MLOCK && !page_mapped(page))
>>+	if (ret != SWAP_MLOCK && !page_mapped(page) &&
>>+					ret != SWAP_DISCARD)
>> 		ret = SWAP_SUCCESS;
>> 	return ret;
>> }
>>@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
>> 	anon_vma_free(anon_vma);
>> }
>>
>>+void volatile_lock(struct vm_area_struct *vma)
>>+{
>>+        if (vma->anon_vma)
>>+                anon_vma_lock(vma->anon_vma);
>>+}
>>+
>>+void volatile_unlock(struct vm_area_struct *vma)
>>+{
>>+        if (vma->anon_vma)
>>+                anon_vma_unlock(vma->anon_vma);
>>+}
>>+
>> #ifdef CONFIG_MIGRATION
>> /*
>>  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
>>diff --git a/mm/vmscan.c b/mm/vmscan.c
>>index 99b434b..4e463a4 100644
>>--- a/mm/vmscan.c
>>+++ b/mm/vmscan.c
>>@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,
>> 	if (vm_flags & VM_LOCKED)
>> 		return PAGEREF_RECLAIM;
>>
>>+	if (vm_flags & VM_VOLATILE)
>>+		return PAGEREF_RECLAIM;
>>+
>> 	if (referenced_ptes) {
>> 		if (PageSwapBacked(page))
>> 			return PAGEREF_ACTIVATE;
>>@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 		 */
>
>Hi Minchan,
>
>IIUC, anonymous page has already add to swapcache through add_to_swap called
>by shrink_page_list, but I can't figure out where you remove it from swapache.
>

Yeah, they all done in shrink_page_list, I mean if you can avoid the process of 
add to swapcache and remove it from swapcache since your idea don't need swapout.

>Regards,
>Wanpeng Li 
>
>> 		if (page_mapped(page) && mapping) {
>> 			switch (try_to_unmap(page, TTU_UNMAP)) {
>>+			case SWAP_DISCARD:
>>+				count_vm_event(PGVOLATILE);
>>+				goto discard_page;
>> 			case SWAP_FAIL:
>> 				goto activate_locked;
>> 			case SWAP_AGAIN:
>>@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 			}
>> 		}
>>
>>+discard_page:
>> 		/*
>> 		 * If the page has buffers, try to free the buffer mappings
>> 		 * associated with this page. If we succeed we try to free
>>diff --git a/mm/vmstat.c b/mm/vmstat.c
>>index df7a674..410caf5 100644
>>--- a/mm/vmstat.c
>>+++ b/mm/vmstat.c
>>@@ -734,6 +734,7 @@ const char * const vmstat_text[] = {
>> 	TEXTS_FOR_ZONES("pgalloc")
>>
>> 	"pgfree",
>>+	"pgvolatile",
>> 	"pgactivate",
>> 	"pgdeactivate",
>>
>>-- 
>>1.7.9.5
>>
>>-- 
>>Kind regards,
>>Minchan Kim
>>
>>--
>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>the body to majordomo@kvack.org.  For more info on Linux MM,
>>see: http://www.linux-mm.org/ .
>>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v3] Support volatile range for anon vma
       [not found]     ` <50c83d9b.49fe2a0a.57ee.ffff90b0SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2012-12-12  8:42       ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-12-12  8:42 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: John Stultz, Andrew Morton, linux-mm, linux-kernel,
	Michael Kerrisk, Arun Sharma, sanjay, Paul Turner, David Rientjes,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Wed, Dec 12, 2012 at 04:17:14PM +0800, Wanpeng Li wrote:
> On Wed, Dec 12, 2012 at 02:43:49PM +0800, Wanpeng Li wrote:
> >On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> >>Sorry, resending with fixing compile error. :(
> >>
> >>>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001
> >>From: Minchan Kim <minchan@kernel.org>
> >>Date: Tue, 11 Dec 2012 11:38:30 +0900
> >>Subject: [RFC v3] Support volatile range for anon vma
> >>
> >>This still is [RFC v3] because just passed my simple test
> >>with TCMalloc tweaking.
> >>
> >>I hope more inputs from user-space allocator people and test patch
> >>with their allocator because it might need design change of arena
> >>management design for getting real vaule.
> >>
> >>Changelog from v2
> >>
> >> * Removing madvise(addr, length, MADV_NOVOLATILE).
> >> * add vmstat about the number of discarded volatile pages
> >> * discard volatile pages without promotion in reclaim path
> >>
> >>This is based on v3.6.
> >>
> >>- What's the madvise(addr, length, MADV_VOLATILE)?
> >>
> >>  It's a hint that user deliver to kernel so kernel can *discard*
> >>  pages in a range anytime.
> >>
> >>- What happens if user access page(ie, virtual address) discarded
> >>  by kernel?
> >>
> >>  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
> >>
> >>- What happens if user access page(ie, virtual address) doesn't
> >>  discarded by kernel?
> >>
> >>  The user can see old data without page fault.
> >>
> >>- What's different with madvise(DONTNEED)?
> >>
> >>  System call semantic
> >>
> >>  DONTNEED makes sure user always can see zero-fill pages after
> >>  he calls madvise while VOLATILE can see zero-fill pages or
> >>  old data.
> >>
> >>  Internal implementation
> >>
> >>  The madvise(DONTNEED) should zap all mapped pages in range so
> >>  overhead is increased linearly with the number of mapped pages.
> >>  Even, if user access zapped pages by write, page fault + page
> >>  allocation + memset should be happened.
> >>
> >>  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
> >>  It doesn't touch pages any more so overhead of the system call
> >>  should be very small. If memory pressure happens, VM can discard
> >>  pages in VMAs marked by VOLATILE. If user access address with
> >>  write mode by discarding by VM, he can see zero-fill pages so the
> >>  cost is same with DONTNEED but if memory pressure isn't severe,
> >>  user can see old data without (page fault + page allocation + memset)
> >>
> >>  The VOLATILE mark should be removed in page fault handler when first
> >>  page fault occur in marked vma so next page faults will follow normal
> >>  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
> >>  interface.
> >>
> >>- What's the benefit compared to DONTNEED?
> >>
> >>  1. The system call overhead is smaller because VOLATILE just marks
> >>     the flag to VMA instead of zapping all the page in a range.
> >>
> >>  2. It has a chance to eliminate overheads (ex, page fault +
> >>     page allocation + memset(PAGE_SIZE)).
> >>
> >>- Isn't there any drawback?
> >>
> >>  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
> >>  fault of other threads could be allowed. But VOLATILE needs exclusive
> >>  mmap_sem so other thread would be blocked if they try to access
> >>  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
> >>  should be small as far as possible.
> >>
> >>  Other concern of exclusive mmap_sem is when page fault occur in
> >>  VOLATILE marked vma. We should remove the flag of vma and merge
> >>  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
> >>  handling and prevent concurrent page fault. But we need such handling
> >>  just once when page fault occur after we mark VOLATILE into VMA
> >>  only if memory pressure happpens so the page is discarded. So it wouldn't
> >>  not common so that benefit we get by this feature would be bigger than
> >>  lose.
> >>
> >>- What's for targetting?
> >>
> >>  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
> >>  of virtual machine like Dalvik. Also, it comes in handy for embedded
> >>  which doesn't have swap device so they can't reclaim anonymous pages.
> >>  By discarding instead of swap, it could be used in the non-swap system.
> >>  For it,  we have to age anon lru list although we don't have swap because
> >>  I don't want to discard volatile pages by top priority when memory pressure
> >>  happens as volatile in this patch means "We don't need to swap out because
> >>  user can handle the situation which data are disappear suddenly", NOT
> >>  "They are useless so hurry up to reclaim them". So I want to apply same
> >>  aging rule of nomal pages to them.
> >>
> >>  Anonymous page background aging of non-swap system would be a trade-off
> >>  for getting good feature. Even, we had done it two years ago until merge
> >>  [1] and I believe gain of this patch will beat loss of anon lru aging's
> >>  overead once all of allocator start to use madvise.
> >>  (This patch doesn't include background aging in case of non-swap system
> >>  but it's trivial if we decide)
> >>
> >>[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
> >>
> >>Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> >>Cc: Arun Sharma <asharma@fb.com>
> >>Cc: sanjay@google.com
> >>Cc: Paul Turner <pjt@google.com>
> >>CC: David Rientjes <rientjes@google.com>
> >>Cc: John Stultz <john.stultz@linaro.org>
> >>Cc: Andrew Morton <akpm@linux-foundation.org>
> >>Cc: Christoph Lameter <cl@linux.com>
> >>Cc: Android Kernel Team <kernel-team@android.com>
> >>Cc: Robert Love <rlove@google.com>
> >>Cc: Mel Gorman <mel@csn.ul.ie>
> >>Cc: Hugh Dickins <hughd@google.com>
> >>Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> >>Cc: Rik van Riel <riel@redhat.com>
> >>Cc: Dave Chinner <david@fromorbit.com>
> >>Cc: Neil Brown <neilb@suse.de>
> >>Cc: Mike Hommey <mh@glandium.org>
> >>Cc: Taras Glek <tglek@mozilla.com>
> >>Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> >>Cc: Christoph Lameter <cl@linux.com>
> >>Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >>Signed-off-by: Minchan Kim <minchan@kernel.org>
> >>---
> >> arch/x86/mm/fault.c               |    2 +
> >> include/asm-generic/mman-common.h |    6 ++
> >> include/linux/mm.h                |    7 ++-
> >> include/linux/rmap.h              |   20 ++++++
> >> include/linux/vm_event_item.h     |    2 +-
> >> mm/madvise.c                      |   19 +++++-
> >> mm/memory.c                       |   32 ++++++++++
> >> mm/migrate.c                      |    6 +-
> >> mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-
> >> mm/vmscan.c                       |    7 +++
> >> mm/vmstat.c                       |    1 +
> >> 11 files changed, 218 insertions(+), 9 deletions(-)
> >>
> >>diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> >>index 76dcd9d..17c1c20 100644
> >>--- a/arch/x86/mm/fault.c
> >>+++ b/arch/x86/mm/fault.c
> >>@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> >> 		}
> >>
> >> 		out_of_memory(regs, error_code, address);
> >>+	} else if (fault & VM_FAULT_SIGSEG) {
> >>+			bad_area(regs, error_code, address);
> >> 	} else {
> >> 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
> >> 			     VM_FAULT_HWPOISON_LARGE))
> >>diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
> >>index d030d2c..f07781e 100644
> >>--- a/include/asm-generic/mman-common.h
> >>+++ b/include/asm-generic/mman-common.h
> >>@@ -34,6 +34,12 @@
> >> #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> >> #define MADV_WILLNEED	3		/* will need these pages */
> >> #define MADV_DONTNEED	4		/* don't need these pages */
> >>+/*
> >>+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.
> >>+ * For changing the flag, we need mmap_sem's write lock and volatile_lock
> >>+ * while we just need volatile_lock in case of reading the flag.
> >>+ */
> >>+#define MADV_VOLATILE	5		/* pages will disappear suddenly */
> >>
> >> /* common parameters: try to keep these consistent across architectures */
> >> #define MADV_REMOVE	9		/* remove these pages & resources */
> >>diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>index 311be90..89027b5 100644
> >>--- a/include/linux/mm.h
> >>+++ b/include/linux/mm.h
> >>@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);
> >> #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
> >> #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
> >> #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
> >>+#define VM_VOLATILE	0x100000000	/* Pages in the vma could be discarable without swap */
> >>
> >> /* Bits set in the VMA until the stack is in its final location */
> >> #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
> >>@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);
> >>  * Special vmas that are non-mergable, non-mlock()able.
> >>  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
> >>  */
> >>-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
> >>+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)
> >>
> >> /*
> >>  * mapping from the currently active vm_flags protection bits (the
> >>@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)
> >> #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
> >> #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
> >> #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
> >>-
> >>+#define VM_FAULT_SIGSEG	0x0800	/* -> There is no vma */
> >> #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
> >>
> >> #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
> >>-			 VM_FAULT_HWPOISON_LARGE)
> >>+			 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)
> >>
> >> /* Encode hstate index for a hwpoisoned large page */
> >> #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
> >>diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> >>index 3fce545..735d7a3 100644
> >>--- a/include/linux/rmap.h
> >>+++ b/include/linux/rmap.h
> >>@@ -67,6 +67,9 @@ struct anon_vma_chain {
> >> 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
> >> };
> >>
> >>+void volatile_lock(struct vm_area_struct *vma);
> >>+void volatile_unlock(struct vm_area_struct *vma);
> >>+
> >> #ifdef CONFIG_MMU
> >> static inline void get_anon_vma(struct anon_vma *anon_vma)
> >> {
> >>@@ -170,6 +173,7 @@ enum ttu_flags {
> >> 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> >> 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> >> 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> >>+	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
> >> };
> >> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> >>
> >>@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> >> 	return ptep;
> >> }
> >>
> >>+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
> >>+                                unsigned long, spinlock_t **);
> >>+
> >>+static inline pte_t *page_check_volatile_address(struct page *page,
> >>+                                        struct mm_struct *mm,
> >>+                                        unsigned long address,
> >>+                                        spinlock_t **ptlp)
> >>+{
> >>+        pte_t *ptep;
> >>+
> >>+        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
> >>+                                        mm, address, ptlp));
> >>+        return ptep;
> >>+}
> >>+
> >> /*
> >>  * Used by swapoff to help locate where page is expected in vma.
> >>  */
> >>@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)
> >> #define SWAP_AGAIN	1
> >> #define SWAP_FAIL	2
> >> #define SWAP_MLOCK	3
> >>+#define SWAP_DISCARD	4
> >>
> >> #endif	/* _LINUX_RMAP_H */
> >>diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> >>index 57f7b10..3f9a40b 100644
> >>--- a/include/linux/vm_event_item.h
> >>+++ b/include/linux/vm_event_item.h
> >>@@ -23,7 +23,7 @@
> >>
> >> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >> 		FOR_ALL_ZONES(PGALLOC),
> >>-		PGFREE, PGACTIVATE, PGDEACTIVATE,
> >>+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
> >> 		PGFAULT, PGMAJFAULT,
> >> 		FOR_ALL_ZONES(PGREFILL),
> >> 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> >>diff --git a/mm/madvise.c b/mm/madvise.c
> >>index 14d260f..53a19d8 100644
> >>--- a/mm/madvise.c
> >>+++ b/mm/madvise.c
> >>@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >> 		if (error)
> >> 			goto out;
> >> 		break;
> >>+	case MADV_VOLATILE:
> >>+		if (vma->vm_flags & VM_LOCKED) {
> >>+			error = -EINVAL;
> >>+			goto out;
> >>+		}
> >>+		new_flags |= VM_VOLATILE;
> >>+		break;
> >> 	}
> >>
> >> 	if (new_flags == vma->vm_flags) {
> >>@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >> success:
> >> 	/*
> >> 	 * vm_flags is protected by the mmap_sem held in write mode.
> >>+	 * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.
> >> 	 */
> >>+	if (behavior == MADV_VOLATILE)
> >>+		volatile_lock(vma);
> >> 	vma->vm_flags = new_flags;
> >>-
> >>+	if (behavior == MADV_VOLATILE)
> >>+		volatile_unlock(vma);
> >> out:
> >> 	if (error == -ENOMEM)
> >> 		error = -EAGAIN;
> >>@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)
> >> #endif
> >> 	case MADV_DONTDUMP:
> >> 	case MADV_DODUMP:
> >>+	case MADV_VOLATILE:
> >> 		return 1;
> >>
> >> 	default:
> >>@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >> 		goto out;
> >> 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> >>
> >>+	if (behavior != MADV_VOLATILE)
> >>+		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> >>+	else
> >>+		len = len_in & PAGE_MASK;
> >>+
> >> 	/* Check to see whether len was rounded up from small -ve to zero */
> >> 	if (len_in && !len)
> >> 		goto out;
> >>diff --git a/mm/memory.c b/mm/memory.c
> >>index 5736170..b5e4996 100644
> >>--- a/mm/memory.c
> >>+++ b/mm/memory.c
> >>@@ -57,6 +57,7 @@
> >> #include <linux/swapops.h>
> >> #include <linux/elf.h>
> >> #include <linux/gfp.h>
> >>+#include <linux/mempolicy.h>
> >>
> >> #include <asm/io.h>
> >> #include <asm/pgalloc.h>
> >>@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,
> >> 					return do_linear_fault(mm, vma, address,
> >> 						pte, pmd, flags, entry);
> >> 			}
> >>+			if (vma->vm_flags & VM_VOLATILE) {
> >>+				struct vm_area_struct *prev;
> >>+
> >>+				up_read(&mm->mmap_sem);
> >>+				down_write(&mm->mmap_sem);
> >>+				vma = find_vma_prev(mm, address, &prev);
> >>+
> >>+				/* Someone unmap the vma */
> >>+				if (unlikely(!vma) || vma->vm_start > address) {
> >>+					downgrade_write(&mm->mmap_sem);
> >>+					return VM_FAULT_SIGSEG;
> >>+				}
> >>+				/* Someone else already hanlded */
> >>+				if (vma->vm_flags & VM_VOLATILE) {
> >>+					/*
> >>+					 * From now on, we hold mmap_sem as
> >>+					 * exclusive.
> >>+					 */
> >>+					volatile_lock(vma);
> >>+					vma->vm_flags &= ~VM_VOLATILE;
> >>+					volatile_unlock(vma);
> >>+
> >>+					vma_merge(mm, prev, vma->vm_start,
> >>+						vma->vm_end, vma->vm_flags,
> >>+						vma->anon_vma, vma->vm_file,
> >>+						vma->vm_pgoff, vma_policy(vma));
> >>+
> >>+				}
> >>+
> >>+				downgrade_write(&mm->mmap_sem);
> >>+			}
> >> 			return do_anonymous_page(mm, vma, address,
> >> 						 pte, pmd, flags);
> >> 		}
> >>diff --git a/mm/migrate.c b/mm/migrate.c
> >>index 77ed2d7..08b009c 100644
> >>--- a/mm/migrate.c
> >>+++ b/mm/migrate.c
> >>@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >> 	}
> >>
> >> 	/* Establish migration ptes or remove ptes */
> >>-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> >>+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> >>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
> >>
> >> skip_unmap:
> >> 	if (!page_mapped(page))
> >>@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >> 	if (PageAnon(hpage))
> >> 		anon_vma = page_get_anon_vma(hpage);
> >>
> >>-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> >>+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
> >>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
> >>
> >> 	if (!page_mapped(hpage))
> >> 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
> >>diff --git a/mm/rmap.c b/mm/rmap.c
> >>index 0f3b7cd..1a0ab2b 100644
> >>--- a/mm/rmap.c
> >>+++ b/mm/rmap.c
> >>@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> >> 	return vma_address(page, vma);
> >> }
> >>
> >>+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
> >>+		unsigned long address, spinlock_t **ptlp)
> >>+{
> >>+	pgd_t *pgd;
> >>+	pud_t *pud;
> >>+	pmd_t *pmd;
> >>+	pte_t *pte;
> >>+	spinlock_t *ptl;
> >>+
> >>+	swp_entry_t entry = { .val = page_private(page) };
> >>+
> >>+	if (unlikely(PageHuge(page))) {
> >>+		pte = huge_pte_offset(mm, address);
> >>+		ptl = &mm->page_table_lock;
> >>+		goto check;
> >>+	}
> >>+
> >>+	pgd = pgd_offset(mm, address);
> >>+	if (!pgd_present(*pgd))
> >>+		return NULL;
> >>+
> >>+	pud = pud_offset(pgd, address);
> >>+	if (!pud_present(*pud))
> >>+		return NULL;
> >>+
> >>+	pmd = pmd_offset(pud, address);
> >>+	if (!pmd_present(*pmd))
> >>+		return NULL;
> >>+	if (pmd_trans_huge(*pmd))
> >>+		return NULL;
> >>+
> >>+	pte = pte_offset_map(pmd, address);
> >>+	ptl = pte_lockptr(mm, pmd);
> >>+check:
> >>+	spin_lock(ptl);
> >>+	if (PageAnon(page)) {
> >>+		if (!pte_present(*pte) && entry.val ==
> >>+				pte_to_swp_entry(*pte).val) {
> >>+			*ptlp = ptl;
> >>+			return pte;
> >>+		}
> >>+	} else {
> >>+		if (pte_none(*pte)) {
> >>+			*ptlp = ptl;
> >>+			return pte;
> >>+		}
> >>+	}
> >>+	pte_unmap_unlock(pte, ptl);
> >>+	return NULL;
> >>+}
> >>+
> >> /*
> >>  * Check that @page is mapped at @address into @mm.
> >>  *
> >>@@ -1218,6 +1269,35 @@ out:
> >> 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> >> }
> >>
> >>+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
> >>+                unsigned long address)
> >>+{
> >>+        struct mm_struct *mm = vma->vm_mm;
> >>+        pte_t *pte;
> >>+        pte_t pteval;
> >>+        spinlock_t *ptl;
> >>+
> >>+        pte = page_check_volatile_address(page, mm, address, &ptl);
> >>+        if (!pte)
> >>+                return 0;
> >>+
> >>+        /* Nuke the page table entry. */
> >>+        flush_cache_page(vma, address, page_to_pfn(page));
> >>+        pteval = ptep_clear_flush(vma, address, pte);
> >>+
> >>+        if (PageAnon(page)) {
> >>+                swp_entry_t entry = { .val = page_private(page) };
> >>+                if (PageSwapCache(page)) {
> >>+                        dec_mm_counter(mm, MM_SWAPENTS);
> >>+                        swap_free(entry);
> >>+                }
> >>+        }
> >>+
> >>+        pte_unmap_unlock(pte, ptl);
> >>+        mmu_notifier_invalidate_page(mm, address);
> >>+        return 1;
> >>+}
> >>+
> >> /*
> >>  * Subfunctions of try_to_unmap: try_to_unmap_one called
> >>  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
> >>@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> >> 	struct anon_vma *anon_vma;
> >> 	struct anon_vma_chain *avc;
> >> 	int ret = SWAP_AGAIN;
> >>+	bool is_volatile = true;
> >>+
> >>+	if (flags & TTU_IGNORE_VOLATILE)
> >>+		is_volatile = false;
> >>
> >> 	anon_vma = page_lock_anon_vma(page);
> >> 	if (!anon_vma)
> >>@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
> >> 		 * temporary VMAs until after exec() completes.
> >> 		 */
> >> 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
> >>-				is_vma_temporary_stack(vma))
> >>+				is_vma_temporary_stack(vma)) {
> >>+			is_volatile = false;
> >> 			continue;
> >>+		}
> >>
> >> 		address = vma_address(page, vma);
> >> 		if (address == -EFAULT)
> >> 			continue;
> >>+                /*
> >>+                 * A volatile page will only be purged if ALL vmas
> >>+		 * pointing to it are VM_VOLATILE.
> >>+                 */
> >>+                if (!(vma->vm_flags & VM_VOLATILE))
> >>+                        is_volatile = false;
> >>+
> >> 		ret = try_to_unmap_one(page, vma, address, flags);
> >> 		if (ret != SWAP_AGAIN || !page_mapped(page))
> >> 			break;
> >> 	}
> >>
> >>+        if (page_mapped(page) || is_volatile == false)
> >>+                goto out;
> >>+
> >>+        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> >>+                struct vm_area_struct *vma = avc->vma;
> >>+                unsigned long address;
> >>+
> >>+                address = vma_address(page, vma);
> >>+                try_to_zap_one(page, vma, address);
> >>+        }
> >>+        /* We're throwing this page out, so mark it clean */
> >>+        ClearPageDirty(page);
> >>+        ret = SWAP_DISCARD;
> >>+out:
> >> 	page_unlock_anon_vma(anon_vma);
> >> 	return ret;
> >> }
> >>@@ -1651,6 +1758,7 @@ out:
> >>  * SWAP_AGAIN	- we missed a mapping, try again later
> >>  * SWAP_FAIL	- the page is unswappable
> >>  * SWAP_MLOCK	- page is mlocked.
> >>+ * SWAP_DISCARD - page is volatile.
> >>  */
> >> int try_to_unmap(struct page *page, enum ttu_flags flags)
> >> {
> >>@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
> >> 		ret = try_to_unmap_anon(page, flags);
> >> 	else
> >> 		ret = try_to_unmap_file(page, flags);
> >>-	if (ret != SWAP_MLOCK && !page_mapped(page))
> >>+	if (ret != SWAP_MLOCK && !page_mapped(page) &&
> >>+					ret != SWAP_DISCARD)
> >> 		ret = SWAP_SUCCESS;
> >> 	return ret;
> >> }
> >>@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
> >> 	anon_vma_free(anon_vma);
> >> }
> >>
> >>+void volatile_lock(struct vm_area_struct *vma)
> >>+{
> >>+        if (vma->anon_vma)
> >>+                anon_vma_lock(vma->anon_vma);
> >>+}
> >>+
> >>+void volatile_unlock(struct vm_area_struct *vma)
> >>+{
> >>+        if (vma->anon_vma)
> >>+                anon_vma_unlock(vma->anon_vma);
> >>+}
> >>+
> >> #ifdef CONFIG_MIGRATION
> >> /*
> >>  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
> >>diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>index 99b434b..4e463a4 100644
> >>--- a/mm/vmscan.c
> >>+++ b/mm/vmscan.c
> >>@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,
> >> 	if (vm_flags & VM_LOCKED)
> >> 		return PAGEREF_RECLAIM;
> >>
> >>+	if (vm_flags & VM_VOLATILE)
> >>+		return PAGEREF_RECLAIM;
> >>+
> >> 	if (referenced_ptes) {
> >> 		if (PageSwapBacked(page))
> >> 			return PAGEREF_ACTIVATE;
> >>@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >> 		 */
> >
> >Hi Minchan,
> >
> >IIUC, anonymous page has already add to swapcache through add_to_swap called
> >by shrink_page_list, but I can't figure out where you remove it from swapache.
> >
> 
> Yeah, they all done in shrink_page_list, I mean if you can avoid the process of 
> add to swapcache and remove it from swapcache since your idea don't need swapout.

It's not simple because wee need swap entry for unmapping to store it in PTEs
since we can't expect it's volatile or not. In addition, we have to put the page
into swapcache before puting it to pte so it can handle race with another fault
for swapin.
So If it's not a severe problem, I want to leave just as it is, which is so simple.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-12-12  8:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-11  2:34 [RFC v3] Support volatile range for anon vma Minchan Kim
2012-12-11  2:41 ` Minchan Kim
2012-12-11  7:17   ` Mike Hommey
2012-12-11  7:37     ` Minchan Kim
2012-12-11  7:59       ` Mike Hommey
2012-12-11  8:11         ` Minchan Kim
2012-12-11  8:29           ` Mike Hommey
2012-12-11  8:45             ` Minchan Kim
2012-12-12  6:43   ` Wanpeng Li
2012-12-12  8:17     ` Wanpeng Li
2012-12-12  8:17     ` Wanpeng Li
     [not found]     ` <50c83d9b.49fe2a0a.57ee.ffff90b0SMTPIN_ADDED_BROKEN@mx.google.com>
2012-12-12  8:42       ` Minchan Kim
2012-12-12  6:43   ` Wanpeng Li
     [not found]   ` <50c827cb.ce98320a.7d38.ffffad3fSMTPIN_ADDED_BROKEN@mx.google.com>
2012-12-12  8:15     ` Minchan Kim
2012-12-11 18:45 ` John Stultz
2012-12-11 23:21   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).