All of lore.kernel.org
 help / color / mirror / Atom feed
diff for duplicates of <20121211024104.GA10523@blaptop>

diff --git a/a/1.txt b/N1/1.txt
index e1fe1be..ef3279e 100644
--- a/a/1.txt
+++ b/N1/1.txt
@@ -1 +1,642 @@
 Sorry, resending with fixing compile error. :(
+
+>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001
+From: Minchan Kim <minchan@kernel.org>
+Date: Tue, 11 Dec 2012 11:38:30 +0900
+Subject: [RFC v3] Support volatile range for anon vma
+
+This still is [RFC v3] because just passed my simple test
+with TCMalloc tweaking.
+
+I hope more inputs from user-space allocator people and test patch
+with their allocator because it might need design change of arena
+management design for getting real vaule.
+
+Changelog from v2
+
+ * Removing madvise(addr, length, MADV_NOVOLATILE).
+ * add vmstat about the number of discarded volatile pages
+ * discard volatile pages without promotion in reclaim path
+
+This is based on v3.6.
+
+- What's the madvise(addr, length, MADV_VOLATILE)?
+
+  It's a hint that user deliver to kernel so kernel can *discard*
+  pages in a range anytime.
+
+- What happens if user access page(ie, virtual address) discarded
+  by kernel?
+
+  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
+
+- What happens if user access page(ie, virtual address) doesn't
+  discarded by kernel?
+
+  The user can see old data without page fault.
+
+- What's different with madvise(DONTNEED)?
+
+  System call semantic
+
+  DONTNEED makes sure user always can see zero-fill pages after
+  he calls madvise while VOLATILE can see zero-fill pages or
+  old data.
+
+  Internal implementation
+
+  The madvise(DONTNEED) should zap all mapped pages in range so
+  overhead is increased linearly with the number of mapped pages.
+  Even, if user access zapped pages by write, page fault + page
+  allocation + memset should be happened.
+
+  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
+  It doesn't touch pages any more so overhead of the system call
+  should be very small. If memory pressure happens, VM can discard
+  pages in VMAs marked by VOLATILE. If user access address with
+  write mode by discarding by VM, he can see zero-fill pages so the
+  cost is same with DONTNEED but if memory pressure isn't severe,
+  user can see old data without (page fault + page allocation + memset)
+
+  The VOLATILE mark should be removed in page fault handler when first
+  page fault occur in marked vma so next page faults will follow normal
+  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
+  interface.
+
+- What's the benefit compared to DONTNEED?
+
+  1. The system call overhead is smaller because VOLATILE just marks
+     the flag to VMA instead of zapping all the page in a range.
+
+  2. It has a chance to eliminate overheads (ex, page fault +
+     page allocation + memset(PAGE_SIZE)).
+
+- Isn't there any drawback?
+
+  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
+  fault of other threads could be allowed. But VOLATILE needs exclusive
+  mmap_sem so other thread would be blocked if they try to access
+  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
+  should be small as far as possible.
+
+  Other concern of exclusive mmap_sem is when page fault occur in
+  VOLATILE marked vma. We should remove the flag of vma and merge
+  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
+  handling and prevent concurrent page fault. But we need such handling
+  just once when page fault occur after we mark VOLATILE into VMA
+  only if memory pressure happpens so the page is discarded. So it wouldn't
+  not common so that benefit we get by this feature would be bigger than
+  lose.
+
+- What's for targetting?
+
+  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
+  of virtual machine like Dalvik. Also, it comes in handy for embedded
+  which doesn't have swap device so they can't reclaim anonymous pages.
+  By discarding instead of swap, it could be used in the non-swap system.
+  For it,  we have to age anon lru list although we don't have swap because
+  I don't want to discard volatile pages by top priority when memory pressure
+  happens as volatile in this patch means "We don't need to swap out because
+  user can handle the situation which data are disappear suddenly", NOT
+  "They are useless so hurry up to reclaim them". So I want to apply same
+  aging rule of nomal pages to them.
+
+  Anonymous page background aging of non-swap system would be a trade-off
+  for getting good feature. Even, we had done it two years ago until merge
+  [1] and I believe gain of this patch will beat loss of anon lru aging's
+  overead once all of allocator start to use madvise.
+  (This patch doesn't include background aging in case of non-swap system
+  but it's trivial if we decide)
+
+[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
+
+Cc: Michael Kerrisk <mtk.manpages@gmail.com>
+Cc: Arun Sharma <asharma@fb.com>
+Cc: sanjay@google.com
+Cc: Paul Turner <pjt@google.com>
+CC: David Rientjes <rientjes@google.com>
+Cc: John Stultz <john.stultz@linaro.org>
+Cc: Andrew Morton <akpm@linux-foundation.org>
+Cc: Christoph Lameter <cl@linux.com>
+Cc: Android Kernel Team <kernel-team@android.com>
+Cc: Robert Love <rlove@google.com>
+Cc: Mel Gorman <mel@csn.ul.ie>
+Cc: Hugh Dickins <hughd@google.com>
+Cc: Dave Hansen <dave@linux.vnet.ibm.com>
+Cc: Rik van Riel <riel@redhat.com>
+Cc: Dave Chinner <david@fromorbit.com>
+Cc: Neil Brown <neilb@suse.de>
+Cc: Mike Hommey <mh@glandium.org>
+Cc: Taras Glek <tglek@mozilla.com>
+Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
+Cc: Christoph Lameter <cl@linux.com>
+Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
+Signed-off-by: Minchan Kim <minchan@kernel.org>
+---
+ arch/x86/mm/fault.c               |    2 +
+ include/asm-generic/mman-common.h |    6 ++
+ include/linux/mm.h                |    7 ++-
+ include/linux/rmap.h              |   20 ++++++
+ include/linux/vm_event_item.h     |    2 +-
+ mm/madvise.c                      |   19 +++++-
+ mm/memory.c                       |   32 ++++++++++
+ mm/migrate.c                      |    6 +-
+ mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-
+ mm/vmscan.c                       |    7 +++
+ mm/vmstat.c                       |    1 +
+ 11 files changed, 218 insertions(+), 9 deletions(-)
+
+diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
+index 76dcd9d..17c1c20 100644
+--- a/arch/x86/mm/fault.c
++++ b/arch/x86/mm/fault.c
+@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
+ 		}
+ 
+ 		out_of_memory(regs, error_code, address);
++	} else if (fault & VM_FAULT_SIGSEG) {
++			bad_area(regs, error_code, address);
+ 	} else {
+ 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
+ 			     VM_FAULT_HWPOISON_LARGE))
+diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
+index d030d2c..f07781e 100644
+--- a/include/asm-generic/mman-common.h
++++ b/include/asm-generic/mman-common.h
+@@ -34,6 +34,12 @@
+ #define MADV_SEQUENTIAL	2		/* expect sequential page references */
+ #define MADV_WILLNEED	3		/* will need these pages */
+ #define MADV_DONTNEED	4		/* don't need these pages */
++/*
++ * Unlike other flags, we need two locks to protect MADV_VOLATILE.
++ * For changing the flag, we need mmap_sem's write lock and volatile_lock
++ * while we just need volatile_lock in case of reading the flag.
++ */
++#define MADV_VOLATILE	5		/* pages will disappear suddenly */
+ 
+ /* common parameters: try to keep these consistent across architectures */
+ #define MADV_REMOVE	9		/* remove these pages & resources */
+diff --git a/include/linux/mm.h b/include/linux/mm.h
+index 311be90..89027b5 100644
+--- a/include/linux/mm.h
++++ b/include/linux/mm.h
+@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);
+ #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
+ #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
+ #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
++#define VM_VOLATILE	0x100000000	/* Pages in the vma could be discarable without swap */
+ 
+ /* Bits set in the VMA until the stack is in its final location */
+ #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
+@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);
+  * Special vmas that are non-mergable, non-mlock()able.
+  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
+  */
+-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
++#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)
+ 
+ /*
+  * mapping from the currently active vm_flags protection bits (the
+@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)
+ #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
+ #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
+ #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
+-
++#define VM_FAULT_SIGSEG	0x0800	/* -> There is no vma */
+ #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
+ 
+ #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
+-			 VM_FAULT_HWPOISON_LARGE)
++			 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)
+ 
+ /* Encode hstate index for a hwpoisoned large page */
+ #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
+diff --git a/include/linux/rmap.h b/include/linux/rmap.h
+index 3fce545..735d7a3 100644
+--- a/include/linux/rmap.h
++++ b/include/linux/rmap.h
+@@ -67,6 +67,9 @@ struct anon_vma_chain {
+ 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
+ };
+ 
++void volatile_lock(struct vm_area_struct *vma);
++void volatile_unlock(struct vm_area_struct *vma);
++
+ #ifdef CONFIG_MMU
+ static inline void get_anon_vma(struct anon_vma *anon_vma)
+ {
+@@ -170,6 +173,7 @@ enum ttu_flags {
+ 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
+ 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+ 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
++	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
+ };
+ #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+ 
+@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
+ 	return ptep;
+ }
+ 
++pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
++                                unsigned long, spinlock_t **);
++
++static inline pte_t *page_check_volatile_address(struct page *page,
++                                        struct mm_struct *mm,
++                                        unsigned long address,
++                                        spinlock_t **ptlp)
++{
++        pte_t *ptep;
++
++        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
++                                        mm, address, ptlp));
++        return ptep;
++}
++
+ /*
+  * Used by swapoff to help locate where page is expected in vma.
+  */
+@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)
+ #define SWAP_AGAIN	1
+ #define SWAP_FAIL	2
+ #define SWAP_MLOCK	3
++#define SWAP_DISCARD	4
+ 
+ #endif	/* _LINUX_RMAP_H */
+diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
+index 57f7b10..3f9a40b 100644
+--- a/include/linux/vm_event_item.h
++++ b/include/linux/vm_event_item.h
+@@ -23,7 +23,7 @@
+ 
+ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
+ 		FOR_ALL_ZONES(PGALLOC),
+-		PGFREE, PGACTIVATE, PGDEACTIVATE,
++		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
+ 		PGFAULT, PGMAJFAULT,
+ 		FOR_ALL_ZONES(PGREFILL),
+ 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
+diff --git a/mm/madvise.c b/mm/madvise.c
+index 14d260f..53a19d8 100644
+--- a/mm/madvise.c
++++ b/mm/madvise.c
+@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
+ 		if (error)
+ 			goto out;
+ 		break;
++	case MADV_VOLATILE:
++		if (vma->vm_flags & VM_LOCKED) {
++			error = -EINVAL;
++			goto out;
++		}
++		new_flags |= VM_VOLATILE;
++		break;
+ 	}
+ 
+ 	if (new_flags == vma->vm_flags) {
+@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
+ success:
+ 	/*
+ 	 * vm_flags is protected by the mmap_sem held in write mode.
++	 * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.
+ 	 */
++	if (behavior == MADV_VOLATILE)
++		volatile_lock(vma);
+ 	vma->vm_flags = new_flags;
+-
++	if (behavior == MADV_VOLATILE)
++		volatile_unlock(vma);
+ out:
+ 	if (error == -ENOMEM)
+ 		error = -EAGAIN;
+@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)
+ #endif
+ 	case MADV_DONTDUMP:
+ 	case MADV_DODUMP:
++	case MADV_VOLATILE:
+ 		return 1;
+ 
+ 	default:
+@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+ 		goto out;
+ 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+ 
++	if (behavior != MADV_VOLATILE)
++		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
++	else
++		len = len_in & PAGE_MASK;
++
+ 	/* Check to see whether len was rounded up from small -ve to zero */
+ 	if (len_in && !len)
+ 		goto out;
+diff --git a/mm/memory.c b/mm/memory.c
+index 5736170..b5e4996 100644
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -57,6 +57,7 @@
+ #include <linux/swapops.h>
+ #include <linux/elf.h>
+ #include <linux/gfp.h>
++#include <linux/mempolicy.h>
+ 
+ #include <asm/io.h>
+ #include <asm/pgalloc.h>
+@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,
+ 					return do_linear_fault(mm, vma, address,
+ 						pte, pmd, flags, entry);
+ 			}
++			if (vma->vm_flags & VM_VOLATILE) {
++				struct vm_area_struct *prev;
++
++				up_read(&mm->mmap_sem);
++				down_write(&mm->mmap_sem);
++				vma = find_vma_prev(mm, address, &prev);
++
++				/* Someone unmap the vma */
++				if (unlikely(!vma) || vma->vm_start > address) {
++					downgrade_write(&mm->mmap_sem);
++					return VM_FAULT_SIGSEG;
++				}
++				/* Someone else already hanlded */
++				if (vma->vm_flags & VM_VOLATILE) {
++					/*
++					 * From now on, we hold mmap_sem as
++					 * exclusive.
++					 */
++					volatile_lock(vma);
++					vma->vm_flags &= ~VM_VOLATILE;
++					volatile_unlock(vma);
++
++					vma_merge(mm, prev, vma->vm_start,
++						vma->vm_end, vma->vm_flags,
++						vma->anon_vma, vma->vm_file,
++						vma->vm_pgoff, vma_policy(vma));
++
++				}
++
++				downgrade_write(&mm->mmap_sem);
++			}
+ 			return do_anonymous_page(mm, vma, address,
+ 						 pte, pmd, flags);
+ 		}
+diff --git a/mm/migrate.c b/mm/migrate.c
+index 77ed2d7..08b009c 100644
+--- a/mm/migrate.c
++++ b/mm/migrate.c
+@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
+ 	}
+ 
+ 	/* Establish migration ptes or remove ptes */
+-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
++	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
++				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
+ 
+ skip_unmap:
+ 	if (!page_mapped(page))
+@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
+ 	if (PageAnon(hpage))
+ 		anon_vma = page_get_anon_vma(hpage);
+ 
+-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
++	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
++				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
+ 
+ 	if (!page_mapped(hpage))
+ 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
+diff --git a/mm/rmap.c b/mm/rmap.c
+index 0f3b7cd..1a0ab2b 100644
+--- a/mm/rmap.c
++++ b/mm/rmap.c
+@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
+ 	return vma_address(page, vma);
+ }
+ 
++pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
++		unsigned long address, spinlock_t **ptlp)
++{
++	pgd_t *pgd;
++	pud_t *pud;
++	pmd_t *pmd;
++	pte_t *pte;
++	spinlock_t *ptl;
++
++	swp_entry_t entry = { .val = page_private(page) };
++
++	if (unlikely(PageHuge(page))) {
++		pte = huge_pte_offset(mm, address);
++		ptl = &mm->page_table_lock;
++		goto check;
++	}
++
++	pgd = pgd_offset(mm, address);
++	if (!pgd_present(*pgd))
++		return NULL;
++
++	pud = pud_offset(pgd, address);
++	if (!pud_present(*pud))
++		return NULL;
++
++	pmd = pmd_offset(pud, address);
++	if (!pmd_present(*pmd))
++		return NULL;
++	if (pmd_trans_huge(*pmd))
++		return NULL;
++
++	pte = pte_offset_map(pmd, address);
++	ptl = pte_lockptr(mm, pmd);
++check:
++	spin_lock(ptl);
++	if (PageAnon(page)) {
++		if (!pte_present(*pte) && entry.val ==
++				pte_to_swp_entry(*pte).val) {
++			*ptlp = ptl;
++			return pte;
++		}
++	} else {
++		if (pte_none(*pte)) {
++			*ptlp = ptl;
++			return pte;
++		}
++	}
++	pte_unmap_unlock(pte, ptl);
++	return NULL;
++}
++
+ /*
+  * Check that @page is mapped at @address into @mm.
+  *
+@@ -1218,6 +1269,35 @@ out:
+ 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+ }
+ 
++int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
++                unsigned long address)
++{
++        struct mm_struct *mm = vma->vm_mm;
++        pte_t *pte;
++        pte_t pteval;
++        spinlock_t *ptl;
++
++        pte = page_check_volatile_address(page, mm, address, &ptl);
++        if (!pte)
++                return 0;
++
++        /* Nuke the page table entry. */
++        flush_cache_page(vma, address, page_to_pfn(page));
++        pteval = ptep_clear_flush(vma, address, pte);
++
++        if (PageAnon(page)) {
++                swp_entry_t entry = { .val = page_private(page) };
++                if (PageSwapCache(page)) {
++                        dec_mm_counter(mm, MM_SWAPENTS);
++                        swap_free(entry);
++                }
++        }
++
++        pte_unmap_unlock(pte, ptl);
++        mmu_notifier_invalidate_page(mm, address);
++        return 1;
++}
++
+ /*
+  * Subfunctions of try_to_unmap: try_to_unmap_one called
+  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
+@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
+ 	struct anon_vma *anon_vma;
+ 	struct anon_vma_chain *avc;
+ 	int ret = SWAP_AGAIN;
++	bool is_volatile = true;
++
++	if (flags & TTU_IGNORE_VOLATILE)
++		is_volatile = false;
+ 
+ 	anon_vma = page_lock_anon_vma(page);
+ 	if (!anon_vma)
+@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
+ 		 * temporary VMAs until after exec() completes.
+ 		 */
+ 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
+-				is_vma_temporary_stack(vma))
++				is_vma_temporary_stack(vma)) {
++			is_volatile = false;
+ 			continue;
++		}
+ 
+ 		address = vma_address(page, vma);
+ 		if (address == -EFAULT)
+ 			continue;
++                /*
++                 * A volatile page will only be purged if ALL vmas
++		 * pointing to it are VM_VOLATILE.
++                 */
++                if (!(vma->vm_flags & VM_VOLATILE))
++                        is_volatile = false;
++
+ 		ret = try_to_unmap_one(page, vma, address, flags);
+ 		if (ret != SWAP_AGAIN || !page_mapped(page))
+ 			break;
+ 	}
+ 
++        if (page_mapped(page) || is_volatile == false)
++                goto out;
++
++        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
++                struct vm_area_struct *vma = avc->vma;
++                unsigned long address;
++
++                address = vma_address(page, vma);
++                try_to_zap_one(page, vma, address);
++        }
++        /* We're throwing this page out, so mark it clean */
++        ClearPageDirty(page);
++        ret = SWAP_DISCARD;
++out:
+ 	page_unlock_anon_vma(anon_vma);
+ 	return ret;
+ }
+@@ -1651,6 +1758,7 @@ out:
+  * SWAP_AGAIN	- we missed a mapping, try again later
+  * SWAP_FAIL	- the page is unswappable
+  * SWAP_MLOCK	- page is mlocked.
++ * SWAP_DISCARD - page is volatile.
+  */
+ int try_to_unmap(struct page *page, enum ttu_flags flags)
+ {
+@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
+ 		ret = try_to_unmap_anon(page, flags);
+ 	else
+ 		ret = try_to_unmap_file(page, flags);
+-	if (ret != SWAP_MLOCK && !page_mapped(page))
++	if (ret != SWAP_MLOCK && !page_mapped(page) &&
++					ret != SWAP_DISCARD)
+ 		ret = SWAP_SUCCESS;
+ 	return ret;
+ }
+@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
+ 	anon_vma_free(anon_vma);
+ }
+ 
++void volatile_lock(struct vm_area_struct *vma)
++{
++        if (vma->anon_vma)
++                anon_vma_lock(vma->anon_vma);
++}
++
++void volatile_unlock(struct vm_area_struct *vma)
++{
++        if (vma->anon_vma)
++                anon_vma_unlock(vma->anon_vma);
++}
++
+ #ifdef CONFIG_MIGRATION
+ /*
+  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
+diff --git a/mm/vmscan.c b/mm/vmscan.c
+index 99b434b..4e463a4 100644
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,
+ 	if (vm_flags & VM_LOCKED)
+ 		return PAGEREF_RECLAIM;
+ 
++	if (vm_flags & VM_VOLATILE)
++		return PAGEREF_RECLAIM;
++
+ 	if (referenced_ptes) {
+ 		if (PageSwapBacked(page))
+ 			return PAGEREF_ACTIVATE;
+@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
+ 		 */
+ 		if (page_mapped(page) && mapping) {
+ 			switch (try_to_unmap(page, TTU_UNMAP)) {
++			case SWAP_DISCARD:
++				count_vm_event(PGVOLATILE);
++				goto discard_page;
+ 			case SWAP_FAIL:
+ 				goto activate_locked;
+ 			case SWAP_AGAIN:
+@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
+ 			}
+ 		}
+ 
++discard_page:
+ 		/*
+ 		 * If the page has buffers, try to free the buffer mappings
+ 		 * associated with this page. If we succeed we try to free
+diff --git a/mm/vmstat.c b/mm/vmstat.c
+index df7a674..410caf5 100644
+--- a/mm/vmstat.c
++++ b/mm/vmstat.c
+@@ -734,6 +734,7 @@ const char * const vmstat_text[] = {
+ 	TEXTS_FOR_ZONES("pgalloc")
+ 
+ 	"pgfree",
++	"pgvolatile",
+ 	"pgactivate",
+ 	"pgdeactivate",
+ 
+-- 
+1.7.9.5
+
+-- 
+Kind regards,
+Minchan Kim
diff --git a/a/content_digest b/N1/content_digest
index bbbda7f..711ae86 100644
--- a/a/content_digest
+++ b/N1/content_digest
@@ -26,6 +26,647 @@
  " KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>\0"
  "\00:1\0"
  "b\0"
- Sorry, resending with fixing compile error. :(
+ "Sorry, resending with fixing compile error. :(\n"
+ "\n"
+ ">From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001\n"
+ "From: Minchan Kim <minchan@kernel.org>\n"
+ "Date: Tue, 11 Dec 2012 11:38:30 +0900\n"
+ "Subject: [RFC v3] Support volatile range for anon vma\n"
+ "\n"
+ "This still is [RFC v3] because just passed my simple test\n"
+ "with TCMalloc tweaking.\n"
+ "\n"
+ "I hope more inputs from user-space allocator people and test patch\n"
+ "with their allocator because it might need design change of arena\n"
+ "management design for getting real vaule.\n"
+ "\n"
+ "Changelog from v2\n"
+ "\n"
+ " * Removing madvise(addr, length, MADV_NOVOLATILE).\n"
+ " * add vmstat about the number of discarded volatile pages\n"
+ " * discard volatile pages without promotion in reclaim path\n"
+ "\n"
+ "This is based on v3.6.\n"
+ "\n"
+ "- What's the madvise(addr, length, MADV_VOLATILE)?\n"
+ "\n"
+ "  It's a hint that user deliver to kernel so kernel can *discard*\n"
+ "  pages in a range anytime.\n"
+ "\n"
+ "- What happens if user access page(ie, virtual address) discarded\n"
+ "  by kernel?\n"
+ "\n"
+ "  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).\n"
+ "\n"
+ "- What happens if user access page(ie, virtual address) doesn't\n"
+ "  discarded by kernel?\n"
+ "\n"
+ "  The user can see old data without page fault.\n"
+ "\n"
+ "- What's different with madvise(DONTNEED)?\n"
+ "\n"
+ "  System call semantic\n"
+ "\n"
+ "  DONTNEED makes sure user always can see zero-fill pages after\n"
+ "  he calls madvise while VOLATILE can see zero-fill pages or\n"
+ "  old data.\n"
+ "\n"
+ "  Internal implementation\n"
+ "\n"
+ "  The madvise(DONTNEED) should zap all mapped pages in range so\n"
+ "  overhead is increased linearly with the number of mapped pages.\n"
+ "  Even, if user access zapped pages by write, page fault + page\n"
+ "  allocation + memset should be happened.\n"
+ "\n"
+ "  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).\n"
+ "  It doesn't touch pages any more so overhead of the system call\n"
+ "  should be very small. If memory pressure happens, VM can discard\n"
+ "  pages in VMAs marked by VOLATILE. If user access address with\n"
+ "  write mode by discarding by VM, he can see zero-fill pages so the\n"
+ "  cost is same with DONTNEED but if memory pressure isn't severe,\n"
+ "  user can see old data without (page fault + page allocation + memset)\n"
+ "\n"
+ "  The VOLATILE mark should be removed in page fault handler when first\n"
+ "  page fault occur in marked vma so next page faults will follow normal\n"
+ "  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)\n"
+ "  interface.\n"
+ "\n"
+ "- What's the benefit compared to DONTNEED?\n"
+ "\n"
+ "  1. The system call overhead is smaller because VOLATILE just marks\n"
+ "     the flag to VMA instead of zapping all the page in a range.\n"
+ "\n"
+ "  2. It has a chance to eliminate overheads (ex, page fault +\n"
+ "     page allocation + memset(PAGE_SIZE)).\n"
+ "\n"
+ "- Isn't there any drawback?\n"
+ "\n"
+ "  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page\n"
+ "  fault of other threads could be allowed. But VOLATILE needs exclusive\n"
+ "  mmap_sem so other thread would be blocked if they try to access\n"
+ "  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead\n"
+ "  should be small as far as possible.\n"
+ "\n"
+ "  Other concern of exclusive mmap_sem is when page fault occur in\n"
+ "  VOLATILE marked vma. We should remove the flag of vma and merge\n"
+ "  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault\n"
+ "  handling and prevent concurrent page fault. But we need such handling\n"
+ "  just once when page fault occur after we mark VOLATILE into VMA\n"
+ "  only if memory pressure happpens so the page is discarded. So it wouldn't\n"
+ "  not common so that benefit we get by this feature would be bigger than\n"
+ "  lose.\n"
+ "\n"
+ "- What's for targetting?\n"
+ "\n"
+ "  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management\n"
+ "  of virtual machine like Dalvik. Also, it comes in handy for embedded\n"
+ "  which doesn't have swap device so they can't reclaim anonymous pages.\n"
+ "  By discarding instead of swap, it could be used in the non-swap system.\n"
+ "  For it,  we have to age anon lru list although we don't have swap because\n"
+ "  I don't want to discard volatile pages by top priority when memory pressure\n"
+ "  happens as volatile in this patch means \"We don't need to swap out because\n"
+ "  user can handle the situation which data are disappear suddenly\", NOT\n"
+ "  \"They are useless so hurry up to reclaim them\". So I want to apply same\n"
+ "  aging rule of nomal pages to them.\n"
+ "\n"
+ "  Anonymous page background aging of non-swap system would be a trade-off\n"
+ "  for getting good feature. Even, we had done it two years ago until merge\n"
+ "  [1] and I believe gain of this patch will beat loss of anon lru aging's\n"
+ "  overead once all of allocator start to use madvise.\n"
+ "  (This patch doesn't include background aging in case of non-swap system\n"
+ "  but it's trivial if we decide)\n"
+ "\n"
+ "[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system\n"
+ "\n"
+ "Cc: Michael Kerrisk <mtk.manpages@gmail.com>\n"
+ "Cc: Arun Sharma <asharma@fb.com>\n"
+ "Cc: sanjay@google.com\n"
+ "Cc: Paul Turner <pjt@google.com>\n"
+ "CC: David Rientjes <rientjes@google.com>\n"
+ "Cc: John Stultz <john.stultz@linaro.org>\n"
+ "Cc: Andrew Morton <akpm@linux-foundation.org>\n"
+ "Cc: Christoph Lameter <cl@linux.com>\n"
+ "Cc: Android Kernel Team <kernel-team@android.com>\n"
+ "Cc: Robert Love <rlove@google.com>\n"
+ "Cc: Mel Gorman <mel@csn.ul.ie>\n"
+ "Cc: Hugh Dickins <hughd@google.com>\n"
+ "Cc: Dave Hansen <dave@linux.vnet.ibm.com>\n"
+ "Cc: Rik van Riel <riel@redhat.com>\n"
+ "Cc: Dave Chinner <david@fromorbit.com>\n"
+ "Cc: Neil Brown <neilb@suse.de>\n"
+ "Cc: Mike Hommey <mh@glandium.org>\n"
+ "Cc: Taras Glek <tglek@mozilla.com>\n"
+ "Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>\n"
+ "Cc: Christoph Lameter <cl@linux.com>\n"
+ "Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>\n"
+ "Signed-off-by: Minchan Kim <minchan@kernel.org>\n"
+ "---\n"
+ " arch/x86/mm/fault.c               |    2 +\n"
+ " include/asm-generic/mman-common.h |    6 ++\n"
+ " include/linux/mm.h                |    7 ++-\n"
+ " include/linux/rmap.h              |   20 ++++++\n"
+ " include/linux/vm_event_item.h     |    2 +-\n"
+ " mm/madvise.c                      |   19 +++++-\n"
+ " mm/memory.c                       |   32 ++++++++++\n"
+ " mm/migrate.c                      |    6 +-\n"
+ " mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-\n"
+ " mm/vmscan.c                       |    7 +++\n"
+ " mm/vmstat.c                       |    1 +\n"
+ " 11 files changed, 218 insertions(+), 9 deletions(-)\n"
+ "\n"
+ "diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c\n"
+ "index 76dcd9d..17c1c20 100644\n"
+ "--- a/arch/x86/mm/fault.c\n"
+ "+++ b/arch/x86/mm/fault.c\n"
+ "@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,\n"
+ " \t\t}\n"
+ " \n"
+ " \t\tout_of_memory(regs, error_code, address);\n"
+ "+\t} else if (fault & VM_FAULT_SIGSEG) {\n"
+ "+\t\t\tbad_area(regs, error_code, address);\n"
+ " \t} else {\n"
+ " \t\tif (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|\n"
+ " \t\t\t     VM_FAULT_HWPOISON_LARGE))\n"
+ "diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h\n"
+ "index d030d2c..f07781e 100644\n"
+ "--- a/include/asm-generic/mman-common.h\n"
+ "+++ b/include/asm-generic/mman-common.h\n"
+ "@@ -34,6 +34,12 @@\n"
+ " #define MADV_SEQUENTIAL\t2\t\t/* expect sequential page references */\n"
+ " #define MADV_WILLNEED\t3\t\t/* will need these pages */\n"
+ " #define MADV_DONTNEED\t4\t\t/* don't need these pages */\n"
+ "+/*\n"
+ "+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.\n"
+ "+ * For changing the flag, we need mmap_sem's write lock and volatile_lock\n"
+ "+ * while we just need volatile_lock in case of reading the flag.\n"
+ "+ */\n"
+ "+#define MADV_VOLATILE\t5\t\t/* pages will disappear suddenly */\n"
+ " \n"
+ " /* common parameters: try to keep these consistent across architectures */\n"
+ " #define MADV_REMOVE\t9\t\t/* remove these pages & resources */\n"
+ "diff --git a/include/linux/mm.h b/include/linux/mm.h\n"
+ "index 311be90..89027b5 100644\n"
+ "--- a/include/linux/mm.h\n"
+ "+++ b/include/linux/mm.h\n"
+ "@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);\n"
+ " #define VM_SAO\t\t0x20000000\t/* Strong Access Ordering (powerpc) */\n"
+ " #define VM_PFN_AT_MMAP\t0x40000000\t/* PFNMAP vma that is fully mapped at mmap time */\n"
+ " #define VM_MERGEABLE\t0x80000000\t/* KSM may merge identical pages */\n"
+ "+#define VM_VOLATILE\t0x100000000\t/* Pages in the vma could be discarable without swap */\n"
+ " \n"
+ " /* Bits set in the VMA until the stack is in its final location */\n"
+ " #define VM_STACK_INCOMPLETE_SETUP\t(VM_RAND_READ | VM_SEQ_READ)\n"
+ "@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);\n"
+ "  * Special vmas that are non-mergable, non-mlock()able.\n"
+ "  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.\n"
+ "  */\n"
+ "-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)\n"
+ "+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)\n"
+ " \n"
+ " /*\n"
+ "  * mapping from the currently active vm_flags protection bits (the\n"
+ "@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)\n"
+ " #define VM_FAULT_NOPAGE\t0x0100\t/* ->fault installed the pte, not return page */\n"
+ " #define VM_FAULT_LOCKED\t0x0200\t/* ->fault locked the returned page */\n"
+ " #define VM_FAULT_RETRY\t0x0400\t/* ->fault blocked, must retry */\n"
+ "-\n"
+ "+#define VM_FAULT_SIGSEG\t0x0800\t/* -> There is no vma */\n"
+ " #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */\n"
+ " \n"
+ " #define VM_FAULT_ERROR\t(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \\\n"
+ "-\t\t\t VM_FAULT_HWPOISON_LARGE)\n"
+ "+\t\t\t VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)\n"
+ " \n"
+ " /* Encode hstate index for a hwpoisoned large page */\n"
+ " #define VM_FAULT_SET_HINDEX(x) ((x) << 12)\n"
+ "diff --git a/include/linux/rmap.h b/include/linux/rmap.h\n"
+ "index 3fce545..735d7a3 100644\n"
+ "--- a/include/linux/rmap.h\n"
+ "+++ b/include/linux/rmap.h\n"
+ "@@ -67,6 +67,9 @@ struct anon_vma_chain {\n"
+ " \tstruct list_head same_anon_vma;\t/* locked by anon_vma->mutex */\n"
+ " };\n"
+ " \n"
+ "+void volatile_lock(struct vm_area_struct *vma);\n"
+ "+void volatile_unlock(struct vm_area_struct *vma);\n"
+ "+\n"
+ " #ifdef CONFIG_MMU\n"
+ " static inline void get_anon_vma(struct anon_vma *anon_vma)\n"
+ " {\n"
+ "@@ -170,6 +173,7 @@ enum ttu_flags {\n"
+ " \tTTU_IGNORE_MLOCK = (1 << 8),\t/* ignore mlock */\n"
+ " \tTTU_IGNORE_ACCESS = (1 << 9),\t/* don't age */\n"
+ " \tTTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */\n"
+ "+\tTTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */\n"
+ " };\n"
+ " #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)\n"
+ " \n"
+ "@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,\n"
+ " \treturn ptep;\n"
+ " }\n"
+ " \n"
+ "+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,\n"
+ "+                                unsigned long, spinlock_t **);\n"
+ "+\n"
+ "+static inline pte_t *page_check_volatile_address(struct page *page,\n"
+ "+                                        struct mm_struct *mm,\n"
+ "+                                        unsigned long address,\n"
+ "+                                        spinlock_t **ptlp)\n"
+ "+{\n"
+ "+        pte_t *ptep;\n"
+ "+\n"
+ "+        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,\n"
+ "+                                        mm, address, ptlp));\n"
+ "+        return ptep;\n"
+ "+}\n"
+ "+\n"
+ " /*\n"
+ "  * Used by swapoff to help locate where page is expected in vma.\n"
+ "  */\n"
+ "@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)\n"
+ " #define SWAP_AGAIN\t1\n"
+ " #define SWAP_FAIL\t2\n"
+ " #define SWAP_MLOCK\t3\n"
+ "+#define SWAP_DISCARD\t4\n"
+ " \n"
+ " #endif\t/* _LINUX_RMAP_H */\n"
+ "diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h\n"
+ "index 57f7b10..3f9a40b 100644\n"
+ "--- a/include/linux/vm_event_item.h\n"
+ "+++ b/include/linux/vm_event_item.h\n"
+ "@@ -23,7 +23,7 @@\n"
+ " \n"
+ " enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,\n"
+ " \t\tFOR_ALL_ZONES(PGALLOC),\n"
+ "-\t\tPGFREE, PGACTIVATE, PGDEACTIVATE,\n"
+ "+\t\tPGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,\n"
+ " \t\tPGFAULT, PGMAJFAULT,\n"
+ " \t\tFOR_ALL_ZONES(PGREFILL),\n"
+ " \t\tFOR_ALL_ZONES(PGSTEAL_KSWAPD),\n"
+ "diff --git a/mm/madvise.c b/mm/madvise.c\n"
+ "index 14d260f..53a19d8 100644\n"
+ "--- a/mm/madvise.c\n"
+ "+++ b/mm/madvise.c\n"
+ "@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,\n"
+ " \t\tif (error)\n"
+ " \t\t\tgoto out;\n"
+ " \t\tbreak;\n"
+ "+\tcase MADV_VOLATILE:\n"
+ "+\t\tif (vma->vm_flags & VM_LOCKED) {\n"
+ "+\t\t\terror = -EINVAL;\n"
+ "+\t\t\tgoto out;\n"
+ "+\t\t}\n"
+ "+\t\tnew_flags |= VM_VOLATILE;\n"
+ "+\t\tbreak;\n"
+ " \t}\n"
+ " \n"
+ " \tif (new_flags == vma->vm_flags) {\n"
+ "@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,\n"
+ " success:\n"
+ " \t/*\n"
+ " \t * vm_flags is protected by the mmap_sem held in write mode.\n"
+ "+\t * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.\n"
+ " \t */\n"
+ "+\tif (behavior == MADV_VOLATILE)\n"
+ "+\t\tvolatile_lock(vma);\n"
+ " \tvma->vm_flags = new_flags;\n"
+ "-\n"
+ "+\tif (behavior == MADV_VOLATILE)\n"
+ "+\t\tvolatile_unlock(vma);\n"
+ " out:\n"
+ " \tif (error == -ENOMEM)\n"
+ " \t\terror = -EAGAIN;\n"
+ "@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)\n"
+ " #endif\n"
+ " \tcase MADV_DONTDUMP:\n"
+ " \tcase MADV_DODUMP:\n"
+ "+\tcase MADV_VOLATILE:\n"
+ " \t\treturn 1;\n"
+ " \n"
+ " \tdefault:\n"
+ "@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)\n"
+ " \t\tgoto out;\n"
+ " \tlen = (len_in + ~PAGE_MASK) & PAGE_MASK;\n"
+ " \n"
+ "+\tif (behavior != MADV_VOLATILE)\n"
+ "+\t\tlen = (len_in + ~PAGE_MASK) & PAGE_MASK;\n"
+ "+\telse\n"
+ "+\t\tlen = len_in & PAGE_MASK;\n"
+ "+\n"
+ " \t/* Check to see whether len was rounded up from small -ve to zero */\n"
+ " \tif (len_in && !len)\n"
+ " \t\tgoto out;\n"
+ "diff --git a/mm/memory.c b/mm/memory.c\n"
+ "index 5736170..b5e4996 100644\n"
+ "--- a/mm/memory.c\n"
+ "+++ b/mm/memory.c\n"
+ "@@ -57,6 +57,7 @@\n"
+ " #include <linux/swapops.h>\n"
+ " #include <linux/elf.h>\n"
+ " #include <linux/gfp.h>\n"
+ "+#include <linux/mempolicy.h>\n"
+ " \n"
+ " #include <asm/io.h>\n"
+ " #include <asm/pgalloc.h>\n"
+ "@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,\n"
+ " \t\t\t\t\treturn do_linear_fault(mm, vma, address,\n"
+ " \t\t\t\t\t\tpte, pmd, flags, entry);\n"
+ " \t\t\t}\n"
+ "+\t\t\tif (vma->vm_flags & VM_VOLATILE) {\n"
+ "+\t\t\t\tstruct vm_area_struct *prev;\n"
+ "+\n"
+ "+\t\t\t\tup_read(&mm->mmap_sem);\n"
+ "+\t\t\t\tdown_write(&mm->mmap_sem);\n"
+ "+\t\t\t\tvma = find_vma_prev(mm, address, &prev);\n"
+ "+\n"
+ "+\t\t\t\t/* Someone unmap the vma */\n"
+ "+\t\t\t\tif (unlikely(!vma) || vma->vm_start > address) {\n"
+ "+\t\t\t\t\tdowngrade_write(&mm->mmap_sem);\n"
+ "+\t\t\t\t\treturn VM_FAULT_SIGSEG;\n"
+ "+\t\t\t\t}\n"
+ "+\t\t\t\t/* Someone else already hanlded */\n"
+ "+\t\t\t\tif (vma->vm_flags & VM_VOLATILE) {\n"
+ "+\t\t\t\t\t/*\n"
+ "+\t\t\t\t\t * From now on, we hold mmap_sem as\n"
+ "+\t\t\t\t\t * exclusive.\n"
+ "+\t\t\t\t\t */\n"
+ "+\t\t\t\t\tvolatile_lock(vma);\n"
+ "+\t\t\t\t\tvma->vm_flags &= ~VM_VOLATILE;\n"
+ "+\t\t\t\t\tvolatile_unlock(vma);\n"
+ "+\n"
+ "+\t\t\t\t\tvma_merge(mm, prev, vma->vm_start,\n"
+ "+\t\t\t\t\t\tvma->vm_end, vma->vm_flags,\n"
+ "+\t\t\t\t\t\tvma->anon_vma, vma->vm_file,\n"
+ "+\t\t\t\t\t\tvma->vm_pgoff, vma_policy(vma));\n"
+ "+\n"
+ "+\t\t\t\t}\n"
+ "+\n"
+ "+\t\t\t\tdowngrade_write(&mm->mmap_sem);\n"
+ "+\t\t\t}\n"
+ " \t\t\treturn do_anonymous_page(mm, vma, address,\n"
+ " \t\t\t\t\t\t pte, pmd, flags);\n"
+ " \t\t}\n"
+ "diff --git a/mm/migrate.c b/mm/migrate.c\n"
+ "index 77ed2d7..08b009c 100644\n"
+ "--- a/mm/migrate.c\n"
+ "+++ b/mm/migrate.c\n"
+ "@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,\n"
+ " \t}\n"
+ " \n"
+ " \t/* Establish migration ptes or remove ptes */\n"
+ "-\ttry_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);\n"
+ "+\ttry_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|\n"
+ "+\t\t\t\tTTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);\n"
+ " \n"
+ " skip_unmap:\n"
+ " \tif (!page_mapped(page))\n"
+ "@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,\n"
+ " \tif (PageAnon(hpage))\n"
+ " \t\tanon_vma = page_get_anon_vma(hpage);\n"
+ " \n"
+ "-\ttry_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);\n"
+ "+\ttry_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|\n"
+ "+\t\t\t\tTTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);\n"
+ " \n"
+ " \tif (!page_mapped(hpage))\n"
+ " \t\trc = move_to_new_page(new_hpage, hpage, 1, mode);\n"
+ "diff --git a/mm/rmap.c b/mm/rmap.c\n"
+ "index 0f3b7cd..1a0ab2b 100644\n"
+ "--- a/mm/rmap.c\n"
+ "+++ b/mm/rmap.c\n"
+ "@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)\n"
+ " \treturn vma_address(page, vma);\n"
+ " }\n"
+ " \n"
+ "+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,\n"
+ "+\t\tunsigned long address, spinlock_t **ptlp)\n"
+ "+{\n"
+ "+\tpgd_t *pgd;\n"
+ "+\tpud_t *pud;\n"
+ "+\tpmd_t *pmd;\n"
+ "+\tpte_t *pte;\n"
+ "+\tspinlock_t *ptl;\n"
+ "+\n"
+ "+\tswp_entry_t entry = { .val = page_private(page) };\n"
+ "+\n"
+ "+\tif (unlikely(PageHuge(page))) {\n"
+ "+\t\tpte = huge_pte_offset(mm, address);\n"
+ "+\t\tptl = &mm->page_table_lock;\n"
+ "+\t\tgoto check;\n"
+ "+\t}\n"
+ "+\n"
+ "+\tpgd = pgd_offset(mm, address);\n"
+ "+\tif (!pgd_present(*pgd))\n"
+ "+\t\treturn NULL;\n"
+ "+\n"
+ "+\tpud = pud_offset(pgd, address);\n"
+ "+\tif (!pud_present(*pud))\n"
+ "+\t\treturn NULL;\n"
+ "+\n"
+ "+\tpmd = pmd_offset(pud, address);\n"
+ "+\tif (!pmd_present(*pmd))\n"
+ "+\t\treturn NULL;\n"
+ "+\tif (pmd_trans_huge(*pmd))\n"
+ "+\t\treturn NULL;\n"
+ "+\n"
+ "+\tpte = pte_offset_map(pmd, address);\n"
+ "+\tptl = pte_lockptr(mm, pmd);\n"
+ "+check:\n"
+ "+\tspin_lock(ptl);\n"
+ "+\tif (PageAnon(page)) {\n"
+ "+\t\tif (!pte_present(*pte) && entry.val ==\n"
+ "+\t\t\t\tpte_to_swp_entry(*pte).val) {\n"
+ "+\t\t\t*ptlp = ptl;\n"
+ "+\t\t\treturn pte;\n"
+ "+\t\t}\n"
+ "+\t} else {\n"
+ "+\t\tif (pte_none(*pte)) {\n"
+ "+\t\t\t*ptlp = ptl;\n"
+ "+\t\t\treturn pte;\n"
+ "+\t\t}\n"
+ "+\t}\n"
+ "+\tpte_unmap_unlock(pte, ptl);\n"
+ "+\treturn NULL;\n"
+ "+}\n"
+ "+\n"
+ " /*\n"
+ "  * Check that @page is mapped at @address into @mm.\n"
+ "  *\n"
+ "@@ -1218,6 +1269,35 @@ out:\n"
+ " \t\tmem_cgroup_end_update_page_stat(page, &locked, &flags);\n"
+ " }\n"
+ " \n"
+ "+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,\n"
+ "+                unsigned long address)\n"
+ "+{\n"
+ "+        struct mm_struct *mm = vma->vm_mm;\n"
+ "+        pte_t *pte;\n"
+ "+        pte_t pteval;\n"
+ "+        spinlock_t *ptl;\n"
+ "+\n"
+ "+        pte = page_check_volatile_address(page, mm, address, &ptl);\n"
+ "+        if (!pte)\n"
+ "+                return 0;\n"
+ "+\n"
+ "+        /* Nuke the page table entry. */\n"
+ "+        flush_cache_page(vma, address, page_to_pfn(page));\n"
+ "+        pteval = ptep_clear_flush(vma, address, pte);\n"
+ "+\n"
+ "+        if (PageAnon(page)) {\n"
+ "+                swp_entry_t entry = { .val = page_private(page) };\n"
+ "+                if (PageSwapCache(page)) {\n"
+ "+                        dec_mm_counter(mm, MM_SWAPENTS);\n"
+ "+                        swap_free(entry);\n"
+ "+                }\n"
+ "+        }\n"
+ "+\n"
+ "+        pte_unmap_unlock(pte, ptl);\n"
+ "+        mmu_notifier_invalidate_page(mm, address);\n"
+ "+        return 1;\n"
+ "+}\n"
+ "+\n"
+ " /*\n"
+ "  * Subfunctions of try_to_unmap: try_to_unmap_one called\n"
+ "  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.\n"
+ "@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)\n"
+ " \tstruct anon_vma *anon_vma;\n"
+ " \tstruct anon_vma_chain *avc;\n"
+ " \tint ret = SWAP_AGAIN;\n"
+ "+\tbool is_volatile = true;\n"
+ "+\n"
+ "+\tif (flags & TTU_IGNORE_VOLATILE)\n"
+ "+\t\tis_volatile = false;\n"
+ " \n"
+ " \tanon_vma = page_lock_anon_vma(page);\n"
+ " \tif (!anon_vma)\n"
+ "@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)\n"
+ " \t\t * temporary VMAs until after exec() completes.\n"
+ " \t\t */\n"
+ " \t\tif (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&\n"
+ "-\t\t\t\tis_vma_temporary_stack(vma))\n"
+ "+\t\t\t\tis_vma_temporary_stack(vma)) {\n"
+ "+\t\t\tis_volatile = false;\n"
+ " \t\t\tcontinue;\n"
+ "+\t\t}\n"
+ " \n"
+ " \t\taddress = vma_address(page, vma);\n"
+ " \t\tif (address == -EFAULT)\n"
+ " \t\t\tcontinue;\n"
+ "+                /*\n"
+ "+                 * A volatile page will only be purged if ALL vmas\n"
+ "+\t\t * pointing to it are VM_VOLATILE.\n"
+ "+                 */\n"
+ "+                if (!(vma->vm_flags & VM_VOLATILE))\n"
+ "+                        is_volatile = false;\n"
+ "+\n"
+ " \t\tret = try_to_unmap_one(page, vma, address, flags);\n"
+ " \t\tif (ret != SWAP_AGAIN || !page_mapped(page))\n"
+ " \t\t\tbreak;\n"
+ " \t}\n"
+ " \n"
+ "+        if (page_mapped(page) || is_volatile == false)\n"
+ "+                goto out;\n"
+ "+\n"
+ "+        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {\n"
+ "+                struct vm_area_struct *vma = avc->vma;\n"
+ "+                unsigned long address;\n"
+ "+\n"
+ "+                address = vma_address(page, vma);\n"
+ "+                try_to_zap_one(page, vma, address);\n"
+ "+        }\n"
+ "+        /* We're throwing this page out, so mark it clean */\n"
+ "+        ClearPageDirty(page);\n"
+ "+        ret = SWAP_DISCARD;\n"
+ "+out:\n"
+ " \tpage_unlock_anon_vma(anon_vma);\n"
+ " \treturn ret;\n"
+ " }\n"
+ "@@ -1651,6 +1758,7 @@ out:\n"
+ "  * SWAP_AGAIN\t- we missed a mapping, try again later\n"
+ "  * SWAP_FAIL\t- the page is unswappable\n"
+ "  * SWAP_MLOCK\t- page is mlocked.\n"
+ "+ * SWAP_DISCARD - page is volatile.\n"
+ "  */\n"
+ " int try_to_unmap(struct page *page, enum ttu_flags flags)\n"
+ " {\n"
+ "@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)\n"
+ " \t\tret = try_to_unmap_anon(page, flags);\n"
+ " \telse\n"
+ " \t\tret = try_to_unmap_file(page, flags);\n"
+ "-\tif (ret != SWAP_MLOCK && !page_mapped(page))\n"
+ "+\tif (ret != SWAP_MLOCK && !page_mapped(page) &&\n"
+ "+\t\t\t\t\tret != SWAP_DISCARD)\n"
+ " \t\tret = SWAP_SUCCESS;\n"
+ " \treturn ret;\n"
+ " }\n"
+ "@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)\n"
+ " \tanon_vma_free(anon_vma);\n"
+ " }\n"
+ " \n"
+ "+void volatile_lock(struct vm_area_struct *vma)\n"
+ "+{\n"
+ "+        if (vma->anon_vma)\n"
+ "+                anon_vma_lock(vma->anon_vma);\n"
+ "+}\n"
+ "+\n"
+ "+void volatile_unlock(struct vm_area_struct *vma)\n"
+ "+{\n"
+ "+        if (vma->anon_vma)\n"
+ "+                anon_vma_unlock(vma->anon_vma);\n"
+ "+}\n"
+ "+\n"
+ " #ifdef CONFIG_MIGRATION\n"
+ " /*\n"
+ "  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():\n"
+ "diff --git a/mm/vmscan.c b/mm/vmscan.c\n"
+ "index 99b434b..4e463a4 100644\n"
+ "--- a/mm/vmscan.c\n"
+ "+++ b/mm/vmscan.c\n"
+ "@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,\n"
+ " \tif (vm_flags & VM_LOCKED)\n"
+ " \t\treturn PAGEREF_RECLAIM;\n"
+ " \n"
+ "+\tif (vm_flags & VM_VOLATILE)\n"
+ "+\t\treturn PAGEREF_RECLAIM;\n"
+ "+\n"
+ " \tif (referenced_ptes) {\n"
+ " \t\tif (PageSwapBacked(page))\n"
+ " \t\t\treturn PAGEREF_ACTIVATE;\n"
+ "@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,\n"
+ " \t\t */\n"
+ " \t\tif (page_mapped(page) && mapping) {\n"
+ " \t\t\tswitch (try_to_unmap(page, TTU_UNMAP)) {\n"
+ "+\t\t\tcase SWAP_DISCARD:\n"
+ "+\t\t\t\tcount_vm_event(PGVOLATILE);\n"
+ "+\t\t\t\tgoto discard_page;\n"
+ " \t\t\tcase SWAP_FAIL:\n"
+ " \t\t\t\tgoto activate_locked;\n"
+ " \t\t\tcase SWAP_AGAIN:\n"
+ "@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,\n"
+ " \t\t\t}\n"
+ " \t\t}\n"
+ " \n"
+ "+discard_page:\n"
+ " \t\t/*\n"
+ " \t\t * If the page has buffers, try to free the buffer mappings\n"
+ " \t\t * associated with this page. If we succeed we try to free\n"
+ "diff --git a/mm/vmstat.c b/mm/vmstat.c\n"
+ "index df7a674..410caf5 100644\n"
+ "--- a/mm/vmstat.c\n"
+ "+++ b/mm/vmstat.c\n"
+ "@@ -734,6 +734,7 @@ const char * const vmstat_text[] = {\n"
+ " \tTEXTS_FOR_ZONES(\"pgalloc\")\n"
+ " \n"
+ " \t\"pgfree\",\n"
+ "+\t\"pgvolatile\",\n"
+ " \t\"pgactivate\",\n"
+ " \t\"pgdeactivate\",\n"
+ " \n"
+ "-- \n"
+ "1.7.9.5\n"
+ "\n"
+ "-- \n"
+ "Kind regards,\n"
+ Minchan Kim
 
-bbb39ee4e4da38253f234f7578ddcf32588a6bcb78c81d1c7deb466cd16df0e3
+50667ac0930e903f0758f3f122e5967b978d38feeda060bb4787d6a81dc241b6

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.