diff for duplicates of <20121211024104.GA10523@blaptop> diff --git a/a/1.txt b/N1/1.txt index e1fe1be..ef3279e 100644 --- a/a/1.txt +++ b/N1/1.txt @@ -1 +1,642 @@ Sorry, resending with fixing compile error. :( + +>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001 +From: Minchan Kim <minchan@kernel.org> +Date: Tue, 11 Dec 2012 11:38:30 +0900 +Subject: [RFC v3] Support volatile range for anon vma + +This still is [RFC v3] because just passed my simple test +with TCMalloc tweaking. + +I hope more inputs from user-space allocator people and test patch +with their allocator because it might need design change of arena +management design for getting real vaule. + +Changelog from v2 + + * Removing madvise(addr, length, MADV_NOVOLATILE). + * add vmstat about the number of discarded volatile pages + * discard volatile pages without promotion in reclaim path + +This is based on v3.6. + +- What's the madvise(addr, length, MADV_VOLATILE)? + + It's a hint that user deliver to kernel so kernel can *discard* + pages in a range anytime. + +- What happens if user access page(ie, virtual address) discarded + by kernel? + + The user can see zero-fill-on-demand pages as if madvise(DONTNEED). + +- What happens if user access page(ie, virtual address) doesn't + discarded by kernel? + + The user can see old data without page fault. + +- What's different with madvise(DONTNEED)? + + System call semantic + + DONTNEED makes sure user always can see zero-fill pages after + he calls madvise while VOLATILE can see zero-fill pages or + old data. + + Internal implementation + + The madvise(DONTNEED) should zap all mapped pages in range so + overhead is increased linearly with the number of mapped pages. + Even, if user access zapped pages by write, page fault + page + allocation + memset should be happened. + + The madvise(VOLATILE) should mark the flag in a range(ie, VMA). + It doesn't touch pages any more so overhead of the system call + should be very small. If memory pressure happens, VM can discard + pages in VMAs marked by VOLATILE. If user access address with + write mode by discarding by VM, he can see zero-fill pages so the + cost is same with DONTNEED but if memory pressure isn't severe, + user can see old data without (page fault + page allocation + memset) + + The VOLATILE mark should be removed in page fault handler when first + page fault occur in marked vma so next page faults will follow normal + page fault path. That's why user don't need madvise(MADV_NOVOLATILE) + interface. + +- What's the benefit compared to DONTNEED? + + 1. The system call overhead is smaller because VOLATILE just marks + the flag to VMA instead of zapping all the page in a range. + + 2. It has a chance to eliminate overheads (ex, page fault + + page allocation + memset(PAGE_SIZE)). + +- Isn't there any drawback? + + DONTNEED doesn't need exclusive mmap_sem locking so concurrent page + fault of other threads could be allowed. But VOLATILE needs exclusive + mmap_sem so other thread would be blocked if they try to access + not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead + should be small as far as possible. + + Other concern of exclusive mmap_sem is when page fault occur in + VOLATILE marked vma. We should remove the flag of vma and merge + adjacent vmas so needs exclusive mmap_sem. It can slow down page fault + handling and prevent concurrent page fault. But we need such handling + just once when page fault occur after we mark VOLATILE into VMA + only if memory pressure happpens so the page is discarded. So it wouldn't + not common so that benefit we get by this feature would be bigger than + lose. + +- What's for targetting? + + Firstly, user-space allocator like ptmalloc, tcmalloc or heap management + of virtual machine like Dalvik. Also, it comes in handy for embedded + which doesn't have swap device so they can't reclaim anonymous pages. + By discarding instead of swap, it could be used in the non-swap system. + For it, we have to age anon lru list although we don't have swap because + I don't want to discard volatile pages by top priority when memory pressure + happens as volatile in this patch means "We don't need to swap out because + user can handle the situation which data are disappear suddenly", NOT + "They are useless so hurry up to reclaim them". So I want to apply same + aging rule of nomal pages to them. + + Anonymous page background aging of non-swap system would be a trade-off + for getting good feature. Even, we had done it two years ago until merge + [1] and I believe gain of this patch will beat loss of anon lru aging's + overead once all of allocator start to use madvise. + (This patch doesn't include background aging in case of non-swap system + but it's trivial if we decide) + +[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system + +Cc: Michael Kerrisk <mtk.manpages@gmail.com> +Cc: Arun Sharma <asharma@fb.com> +Cc: sanjay@google.com +Cc: Paul Turner <pjt@google.com> +CC: David Rientjes <rientjes@google.com> +Cc: John Stultz <john.stultz@linaro.org> +Cc: Andrew Morton <akpm@linux-foundation.org> +Cc: Christoph Lameter <cl@linux.com> +Cc: Android Kernel Team <kernel-team@android.com> +Cc: Robert Love <rlove@google.com> +Cc: Mel Gorman <mel@csn.ul.ie> +Cc: Hugh Dickins <hughd@google.com> +Cc: Dave Hansen <dave@linux.vnet.ibm.com> +Cc: Rik van Riel <riel@redhat.com> +Cc: Dave Chinner <david@fromorbit.com> +Cc: Neil Brown <neilb@suse.de> +Cc: Mike Hommey <mh@glandium.org> +Cc: Taras Glek <tglek@mozilla.com> +Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> +Cc: Christoph Lameter <cl@linux.com> +Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> +Signed-off-by: Minchan Kim <minchan@kernel.org> +--- + arch/x86/mm/fault.c | 2 + + include/asm-generic/mman-common.h | 6 ++ + include/linux/mm.h | 7 ++- + include/linux/rmap.h | 20 ++++++ + include/linux/vm_event_item.h | 2 +- + mm/madvise.c | 19 +++++- + mm/memory.c | 32 ++++++++++ + mm/migrate.c | 6 +- + mm/rmap.c | 125 ++++++++++++++++++++++++++++++++++++- + mm/vmscan.c | 7 +++ + mm/vmstat.c | 1 + + 11 files changed, 218 insertions(+), 9 deletions(-) + +diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c +index 76dcd9d..17c1c20 100644 +--- a/arch/x86/mm/fault.c ++++ b/arch/x86/mm/fault.c +@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, + } + + out_of_memory(regs, error_code, address); ++ } else if (fault & VM_FAULT_SIGSEG) { ++ bad_area(regs, error_code, address); + } else { + if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON| + VM_FAULT_HWPOISON_LARGE)) +diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h +index d030d2c..f07781e 100644 +--- a/include/asm-generic/mman-common.h ++++ b/include/asm-generic/mman-common.h +@@ -34,6 +34,12 @@ + #define MADV_SEQUENTIAL 2 /* expect sequential page references */ + #define MADV_WILLNEED 3 /* will need these pages */ + #define MADV_DONTNEED 4 /* don't need these pages */ ++/* ++ * Unlike other flags, we need two locks to protect MADV_VOLATILE. ++ * For changing the flag, we need mmap_sem's write lock and volatile_lock ++ * while we just need volatile_lock in case of reading the flag. ++ */ ++#define MADV_VOLATILE 5 /* pages will disappear suddenly */ + + /* common parameters: try to keep these consistent across architectures */ + #define MADV_REMOVE 9 /* remove these pages & resources */ +diff --git a/include/linux/mm.h b/include/linux/mm.h +index 311be90..89027b5 100644 +--- a/include/linux/mm.h ++++ b/include/linux/mm.h +@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp); + #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ + #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */ + #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ ++#define VM_VOLATILE 0x100000000 /* Pages in the vma could be discarable without swap */ + + /* Bits set in the VMA until the stack is in its final location */ + #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ) +@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp); + * Special vmas that are non-mergable, non-mlock()able. + * Note: mm/huge_memory.c VM_NO_THP depends on this definition. + */ +-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) ++#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE) + + /* + * mapping from the currently active vm_flags protection bits (the +@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page) + #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ + #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ + #define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */ +- ++#define VM_FAULT_SIGSEG 0x0800 /* -> There is no vma */ + #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */ + + #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \ +- VM_FAULT_HWPOISON_LARGE) ++ VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG) + + /* Encode hstate index for a hwpoisoned large page */ + #define VM_FAULT_SET_HINDEX(x) ((x) << 12) +diff --git a/include/linux/rmap.h b/include/linux/rmap.h +index 3fce545..735d7a3 100644 +--- a/include/linux/rmap.h ++++ b/include/linux/rmap.h +@@ -67,6 +67,9 @@ struct anon_vma_chain { + struct list_head same_anon_vma; /* locked by anon_vma->mutex */ + }; + ++void volatile_lock(struct vm_area_struct *vma); ++void volatile_unlock(struct vm_area_struct *vma); ++ + #ifdef CONFIG_MMU + static inline void get_anon_vma(struct anon_vma *anon_vma) + { +@@ -170,6 +173,7 @@ enum ttu_flags { + TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ + TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ + TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ ++ TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */ + }; + #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) + +@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm, + return ptep; + } + ++pte_t *__page_check_volatile_address(struct page *, struct mm_struct *, ++ unsigned long, spinlock_t **); ++ ++static inline pte_t *page_check_volatile_address(struct page *page, ++ struct mm_struct *mm, ++ unsigned long address, ++ spinlock_t **ptlp) ++{ ++ pte_t *ptep; ++ ++ __cond_lock(*ptlp, ptep = __page_check_volatile_address(page, ++ mm, address, ptlp)); ++ return ptep; ++} ++ + /* + * Used by swapoff to help locate where page is expected in vma. + */ +@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page) + #define SWAP_AGAIN 1 + #define SWAP_FAIL 2 + #define SWAP_MLOCK 3 ++#define SWAP_DISCARD 4 + + #endif /* _LINUX_RMAP_H */ +diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h +index 57f7b10..3f9a40b 100644 +--- a/include/linux/vm_event_item.h ++++ b/include/linux/vm_event_item.h +@@ -23,7 +23,7 @@ + + enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, + FOR_ALL_ZONES(PGALLOC), +- PGFREE, PGACTIVATE, PGDEACTIVATE, ++ PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE, + PGFAULT, PGMAJFAULT, + FOR_ALL_ZONES(PGREFILL), + FOR_ALL_ZONES(PGSTEAL_KSWAPD), +diff --git a/mm/madvise.c b/mm/madvise.c +index 14d260f..53a19d8 100644 +--- a/mm/madvise.c ++++ b/mm/madvise.c +@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma, + if (error) + goto out; + break; ++ case MADV_VOLATILE: ++ if (vma->vm_flags & VM_LOCKED) { ++ error = -EINVAL; ++ goto out; ++ } ++ new_flags |= VM_VOLATILE; ++ break; + } + + if (new_flags == vma->vm_flags) { +@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma, + success: + /* + * vm_flags is protected by the mmap_sem held in write mode. ++ * In caes of MADV_VOLATILE, we need anon_vma_lock additionally. + */ ++ if (behavior == MADV_VOLATILE) ++ volatile_lock(vma); + vma->vm_flags = new_flags; +- ++ if (behavior == MADV_VOLATILE) ++ volatile_unlock(vma); + out: + if (error == -ENOMEM) + error = -EAGAIN; +@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior) + #endif + case MADV_DONTDUMP: + case MADV_DODUMP: ++ case MADV_VOLATILE: + return 1; + + default: +@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) + goto out; + len = (len_in + ~PAGE_MASK) & PAGE_MASK; + ++ if (behavior != MADV_VOLATILE) ++ len = (len_in + ~PAGE_MASK) & PAGE_MASK; ++ else ++ len = len_in & PAGE_MASK; ++ + /* Check to see whether len was rounded up from small -ve to zero */ + if (len_in && !len) + goto out; +diff --git a/mm/memory.c b/mm/memory.c +index 5736170..b5e4996 100644 +--- a/mm/memory.c ++++ b/mm/memory.c +@@ -57,6 +57,7 @@ + #include <linux/swapops.h> + #include <linux/elf.h> + #include <linux/gfp.h> ++#include <linux/mempolicy.h> + + #include <asm/io.h> + #include <asm/pgalloc.h> +@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm, + return do_linear_fault(mm, vma, address, + pte, pmd, flags, entry); + } ++ if (vma->vm_flags & VM_VOLATILE) { ++ struct vm_area_struct *prev; ++ ++ up_read(&mm->mmap_sem); ++ down_write(&mm->mmap_sem); ++ vma = find_vma_prev(mm, address, &prev); ++ ++ /* Someone unmap the vma */ ++ if (unlikely(!vma) || vma->vm_start > address) { ++ downgrade_write(&mm->mmap_sem); ++ return VM_FAULT_SIGSEG; ++ } ++ /* Someone else already hanlded */ ++ if (vma->vm_flags & VM_VOLATILE) { ++ /* ++ * From now on, we hold mmap_sem as ++ * exclusive. ++ */ ++ volatile_lock(vma); ++ vma->vm_flags &= ~VM_VOLATILE; ++ volatile_unlock(vma); ++ ++ vma_merge(mm, prev, vma->vm_start, ++ vma->vm_end, vma->vm_flags, ++ vma->anon_vma, vma->vm_file, ++ vma->vm_pgoff, vma_policy(vma)); ++ ++ } ++ ++ downgrade_write(&mm->mmap_sem); ++ } + return do_anonymous_page(mm, vma, address, + pte, pmd, flags); + } +diff --git a/mm/migrate.c b/mm/migrate.c +index 77ed2d7..08b009c 100644 +--- a/mm/migrate.c ++++ b/mm/migrate.c +@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage, + } + + /* Establish migration ptes or remove ptes */ +- try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); ++ try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK| ++ TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE); + + skip_unmap: + if (!page_mapped(page)) +@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, + if (PageAnon(hpage)) + anon_vma = page_get_anon_vma(hpage); + +- try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); ++ try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK| ++ TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE); + + if (!page_mapped(hpage)) + rc = move_to_new_page(new_hpage, hpage, 1, mode); +diff --git a/mm/rmap.c b/mm/rmap.c +index 0f3b7cd..1a0ab2b 100644 +--- a/mm/rmap.c ++++ b/mm/rmap.c +@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) + return vma_address(page, vma); + } + ++pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm, ++ unsigned long address, spinlock_t **ptlp) ++{ ++ pgd_t *pgd; ++ pud_t *pud; ++ pmd_t *pmd; ++ pte_t *pte; ++ spinlock_t *ptl; ++ ++ swp_entry_t entry = { .val = page_private(page) }; ++ ++ if (unlikely(PageHuge(page))) { ++ pte = huge_pte_offset(mm, address); ++ ptl = &mm->page_table_lock; ++ goto check; ++ } ++ ++ pgd = pgd_offset(mm, address); ++ if (!pgd_present(*pgd)) ++ return NULL; ++ ++ pud = pud_offset(pgd, address); ++ if (!pud_present(*pud)) ++ return NULL; ++ ++ pmd = pmd_offset(pud, address); ++ if (!pmd_present(*pmd)) ++ return NULL; ++ if (pmd_trans_huge(*pmd)) ++ return NULL; ++ ++ pte = pte_offset_map(pmd, address); ++ ptl = pte_lockptr(mm, pmd); ++check: ++ spin_lock(ptl); ++ if (PageAnon(page)) { ++ if (!pte_present(*pte) && entry.val == ++ pte_to_swp_entry(*pte).val) { ++ *ptlp = ptl; ++ return pte; ++ } ++ } else { ++ if (pte_none(*pte)) { ++ *ptlp = ptl; ++ return pte; ++ } ++ } ++ pte_unmap_unlock(pte, ptl); ++ return NULL; ++} ++ + /* + * Check that @page is mapped at @address into @mm. + * +@@ -1218,6 +1269,35 @@ out: + mem_cgroup_end_update_page_stat(page, &locked, &flags); + } + ++int try_to_zap_one(struct page *page, struct vm_area_struct *vma, ++ unsigned long address) ++{ ++ struct mm_struct *mm = vma->vm_mm; ++ pte_t *pte; ++ pte_t pteval; ++ spinlock_t *ptl; ++ ++ pte = page_check_volatile_address(page, mm, address, &ptl); ++ if (!pte) ++ return 0; ++ ++ /* Nuke the page table entry. */ ++ flush_cache_page(vma, address, page_to_pfn(page)); ++ pteval = ptep_clear_flush(vma, address, pte); ++ ++ if (PageAnon(page)) { ++ swp_entry_t entry = { .val = page_private(page) }; ++ if (PageSwapCache(page)) { ++ dec_mm_counter(mm, MM_SWAPENTS); ++ swap_free(entry); ++ } ++ } ++ ++ pte_unmap_unlock(pte, ptl); ++ mmu_notifier_invalidate_page(mm, address); ++ return 1; ++} ++ + /* + * Subfunctions of try_to_unmap: try_to_unmap_one called + * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file. +@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) + struct anon_vma *anon_vma; + struct anon_vma_chain *avc; + int ret = SWAP_AGAIN; ++ bool is_volatile = true; ++ ++ if (flags & TTU_IGNORE_VOLATILE) ++ is_volatile = false; + + anon_vma = page_lock_anon_vma(page); + if (!anon_vma) +@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) + * temporary VMAs until after exec() completes. + */ + if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) && +- is_vma_temporary_stack(vma)) ++ is_vma_temporary_stack(vma)) { ++ is_volatile = false; + continue; ++ } + + address = vma_address(page, vma); + if (address == -EFAULT) + continue; ++ /* ++ * A volatile page will only be purged if ALL vmas ++ * pointing to it are VM_VOLATILE. ++ */ ++ if (!(vma->vm_flags & VM_VOLATILE)) ++ is_volatile = false; ++ + ret = try_to_unmap_one(page, vma, address, flags); + if (ret != SWAP_AGAIN || !page_mapped(page)) + break; + } + ++ if (page_mapped(page) || is_volatile == false) ++ goto out; ++ ++ list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { ++ struct vm_area_struct *vma = avc->vma; ++ unsigned long address; ++ ++ address = vma_address(page, vma); ++ try_to_zap_one(page, vma, address); ++ } ++ /* We're throwing this page out, so mark it clean */ ++ ClearPageDirty(page); ++ ret = SWAP_DISCARD; ++out: + page_unlock_anon_vma(anon_vma); + return ret; + } +@@ -1651,6 +1758,7 @@ out: + * SWAP_AGAIN - we missed a mapping, try again later + * SWAP_FAIL - the page is unswappable + * SWAP_MLOCK - page is mlocked. ++ * SWAP_DISCARD - page is volatile. + */ + int try_to_unmap(struct page *page, enum ttu_flags flags) + { +@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags) + ret = try_to_unmap_anon(page, flags); + else + ret = try_to_unmap_file(page, flags); +- if (ret != SWAP_MLOCK && !page_mapped(page)) ++ if (ret != SWAP_MLOCK && !page_mapped(page) && ++ ret != SWAP_DISCARD) + ret = SWAP_SUCCESS; + return ret; + } +@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma) + anon_vma_free(anon_vma); + } + ++void volatile_lock(struct vm_area_struct *vma) ++{ ++ if (vma->anon_vma) ++ anon_vma_lock(vma->anon_vma); ++} ++ ++void volatile_unlock(struct vm_area_struct *vma) ++{ ++ if (vma->anon_vma) ++ anon_vma_unlock(vma->anon_vma); ++} ++ + #ifdef CONFIG_MIGRATION + /* + * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file(): +diff --git a/mm/vmscan.c b/mm/vmscan.c +index 99b434b..4e463a4 100644 +--- a/mm/vmscan.c ++++ b/mm/vmscan.c +@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page, + if (vm_flags & VM_LOCKED) + return PAGEREF_RECLAIM; + ++ if (vm_flags & VM_VOLATILE) ++ return PAGEREF_RECLAIM; ++ + if (referenced_ptes) { + if (PageSwapBacked(page)) + return PAGEREF_ACTIVATE; +@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, + */ + if (page_mapped(page) && mapping) { + switch (try_to_unmap(page, TTU_UNMAP)) { ++ case SWAP_DISCARD: ++ count_vm_event(PGVOLATILE); ++ goto discard_page; + case SWAP_FAIL: + goto activate_locked; + case SWAP_AGAIN: +@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, + } + } + ++discard_page: + /* + * If the page has buffers, try to free the buffer mappings + * associated with this page. If we succeed we try to free +diff --git a/mm/vmstat.c b/mm/vmstat.c +index df7a674..410caf5 100644 +--- a/mm/vmstat.c ++++ b/mm/vmstat.c +@@ -734,6 +734,7 @@ const char * const vmstat_text[] = { + TEXTS_FOR_ZONES("pgalloc") + + "pgfree", ++ "pgvolatile", + "pgactivate", + "pgdeactivate", + +-- +1.7.9.5 + +-- +Kind regards, +Minchan Kim diff --git a/a/content_digest b/N1/content_digest index bbbda7f..711ae86 100644 --- a/a/content_digest +++ b/N1/content_digest @@ -26,6 +26,647 @@ " KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>\0" "\00:1\0" "b\0" - Sorry, resending with fixing compile error. :( + "Sorry, resending with fixing compile error. :(\n" + "\n" + ">From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001\n" + "From: Minchan Kim <minchan@kernel.org>\n" + "Date: Tue, 11 Dec 2012 11:38:30 +0900\n" + "Subject: [RFC v3] Support volatile range for anon vma\n" + "\n" + "This still is [RFC v3] because just passed my simple test\n" + "with TCMalloc tweaking.\n" + "\n" + "I hope more inputs from user-space allocator people and test patch\n" + "with their allocator because it might need design change of arena\n" + "management design for getting real vaule.\n" + "\n" + "Changelog from v2\n" + "\n" + " * Removing madvise(addr, length, MADV_NOVOLATILE).\n" + " * add vmstat about the number of discarded volatile pages\n" + " * discard volatile pages without promotion in reclaim path\n" + "\n" + "This is based on v3.6.\n" + "\n" + "- What's the madvise(addr, length, MADV_VOLATILE)?\n" + "\n" + " It's a hint that user deliver to kernel so kernel can *discard*\n" + " pages in a range anytime.\n" + "\n" + "- What happens if user access page(ie, virtual address) discarded\n" + " by kernel?\n" + "\n" + " The user can see zero-fill-on-demand pages as if madvise(DONTNEED).\n" + "\n" + "- What happens if user access page(ie, virtual address) doesn't\n" + " discarded by kernel?\n" + "\n" + " The user can see old data without page fault.\n" + "\n" + "- What's different with madvise(DONTNEED)?\n" + "\n" + " System call semantic\n" + "\n" + " DONTNEED makes sure user always can see zero-fill pages after\n" + " he calls madvise while VOLATILE can see zero-fill pages or\n" + " old data.\n" + "\n" + " Internal implementation\n" + "\n" + " The madvise(DONTNEED) should zap all mapped pages in range so\n" + " overhead is increased linearly with the number of mapped pages.\n" + " Even, if user access zapped pages by write, page fault + page\n" + " allocation + memset should be happened.\n" + "\n" + " The madvise(VOLATILE) should mark the flag in a range(ie, VMA).\n" + " It doesn't touch pages any more so overhead of the system call\n" + " should be very small. If memory pressure happens, VM can discard\n" + " pages in VMAs marked by VOLATILE. If user access address with\n" + " write mode by discarding by VM, he can see zero-fill pages so the\n" + " cost is same with DONTNEED but if memory pressure isn't severe,\n" + " user can see old data without (page fault + page allocation + memset)\n" + "\n" + " The VOLATILE mark should be removed in page fault handler when first\n" + " page fault occur in marked vma so next page faults will follow normal\n" + " page fault path. That's why user don't need madvise(MADV_NOVOLATILE)\n" + " interface.\n" + "\n" + "- What's the benefit compared to DONTNEED?\n" + "\n" + " 1. The system call overhead is smaller because VOLATILE just marks\n" + " the flag to VMA instead of zapping all the page in a range.\n" + "\n" + " 2. It has a chance to eliminate overheads (ex, page fault +\n" + " page allocation + memset(PAGE_SIZE)).\n" + "\n" + "- Isn't there any drawback?\n" + "\n" + " DONTNEED doesn't need exclusive mmap_sem locking so concurrent page\n" + " fault of other threads could be allowed. But VOLATILE needs exclusive\n" + " mmap_sem so other thread would be blocked if they try to access\n" + " not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead\n" + " should be small as far as possible.\n" + "\n" + " Other concern of exclusive mmap_sem is when page fault occur in\n" + " VOLATILE marked vma. We should remove the flag of vma and merge\n" + " adjacent vmas so needs exclusive mmap_sem. It can slow down page fault\n" + " handling and prevent concurrent page fault. But we need such handling\n" + " just once when page fault occur after we mark VOLATILE into VMA\n" + " only if memory pressure happpens so the page is discarded. So it wouldn't\n" + " not common so that benefit we get by this feature would be bigger than\n" + " lose.\n" + "\n" + "- What's for targetting?\n" + "\n" + " Firstly, user-space allocator like ptmalloc, tcmalloc or heap management\n" + " of virtual machine like Dalvik. Also, it comes in handy for embedded\n" + " which doesn't have swap device so they can't reclaim anonymous pages.\n" + " By discarding instead of swap, it could be used in the non-swap system.\n" + " For it, we have to age anon lru list although we don't have swap because\n" + " I don't want to discard volatile pages by top priority when memory pressure\n" + " happens as volatile in this patch means \"We don't need to swap out because\n" + " user can handle the situation which data are disappear suddenly\", NOT\n" + " \"They are useless so hurry up to reclaim them\". So I want to apply same\n" + " aging rule of nomal pages to them.\n" + "\n" + " Anonymous page background aging of non-swap system would be a trade-off\n" + " for getting good feature. Even, we had done it two years ago until merge\n" + " [1] and I believe gain of this patch will beat loss of anon lru aging's\n" + " overead once all of allocator start to use madvise.\n" + " (This patch doesn't include background aging in case of non-swap system\n" + " but it's trivial if we decide)\n" + "\n" + "[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system\n" + "\n" + "Cc: Michael Kerrisk <mtk.manpages@gmail.com>\n" + "Cc: Arun Sharma <asharma@fb.com>\n" + "Cc: sanjay@google.com\n" + "Cc: Paul Turner <pjt@google.com>\n" + "CC: David Rientjes <rientjes@google.com>\n" + "Cc: John Stultz <john.stultz@linaro.org>\n" + "Cc: Andrew Morton <akpm@linux-foundation.org>\n" + "Cc: Christoph Lameter <cl@linux.com>\n" + "Cc: Android Kernel Team <kernel-team@android.com>\n" + "Cc: Robert Love <rlove@google.com>\n" + "Cc: Mel Gorman <mel@csn.ul.ie>\n" + "Cc: Hugh Dickins <hughd@google.com>\n" + "Cc: Dave Hansen <dave@linux.vnet.ibm.com>\n" + "Cc: Rik van Riel <riel@redhat.com>\n" + "Cc: Dave Chinner <david@fromorbit.com>\n" + "Cc: Neil Brown <neilb@suse.de>\n" + "Cc: Mike Hommey <mh@glandium.org>\n" + "Cc: Taras Glek <tglek@mozilla.com>\n" + "Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>\n" + "Cc: Christoph Lameter <cl@linux.com>\n" + "Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>\n" + "Signed-off-by: Minchan Kim <minchan@kernel.org>\n" + "---\n" + " arch/x86/mm/fault.c | 2 +\n" + " include/asm-generic/mman-common.h | 6 ++\n" + " include/linux/mm.h | 7 ++-\n" + " include/linux/rmap.h | 20 ++++++\n" + " include/linux/vm_event_item.h | 2 +-\n" + " mm/madvise.c | 19 +++++-\n" + " mm/memory.c | 32 ++++++++++\n" + " mm/migrate.c | 6 +-\n" + " mm/rmap.c | 125 ++++++++++++++++++++++++++++++++++++-\n" + " mm/vmscan.c | 7 +++\n" + " mm/vmstat.c | 1 +\n" + " 11 files changed, 218 insertions(+), 9 deletions(-)\n" + "\n" + "diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c\n" + "index 76dcd9d..17c1c20 100644\n" + "--- a/arch/x86/mm/fault.c\n" + "+++ b/arch/x86/mm/fault.c\n" + "@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,\n" + " \t\t}\n" + " \n" + " \t\tout_of_memory(regs, error_code, address);\n" + "+\t} else if (fault & VM_FAULT_SIGSEG) {\n" + "+\t\t\tbad_area(regs, error_code, address);\n" + " \t} else {\n" + " \t\tif (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|\n" + " \t\t\t VM_FAULT_HWPOISON_LARGE))\n" + "diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h\n" + "index d030d2c..f07781e 100644\n" + "--- a/include/asm-generic/mman-common.h\n" + "+++ b/include/asm-generic/mman-common.h\n" + "@@ -34,6 +34,12 @@\n" + " #define MADV_SEQUENTIAL\t2\t\t/* expect sequential page references */\n" + " #define MADV_WILLNEED\t3\t\t/* will need these pages */\n" + " #define MADV_DONTNEED\t4\t\t/* don't need these pages */\n" + "+/*\n" + "+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.\n" + "+ * For changing the flag, we need mmap_sem's write lock and volatile_lock\n" + "+ * while we just need volatile_lock in case of reading the flag.\n" + "+ */\n" + "+#define MADV_VOLATILE\t5\t\t/* pages will disappear suddenly */\n" + " \n" + " /* common parameters: try to keep these consistent across architectures */\n" + " #define MADV_REMOVE\t9\t\t/* remove these pages & resources */\n" + "diff --git a/include/linux/mm.h b/include/linux/mm.h\n" + "index 311be90..89027b5 100644\n" + "--- a/include/linux/mm.h\n" + "+++ b/include/linux/mm.h\n" + "@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);\n" + " #define VM_SAO\t\t0x20000000\t/* Strong Access Ordering (powerpc) */\n" + " #define VM_PFN_AT_MMAP\t0x40000000\t/* PFNMAP vma that is fully mapped at mmap time */\n" + " #define VM_MERGEABLE\t0x80000000\t/* KSM may merge identical pages */\n" + "+#define VM_VOLATILE\t0x100000000\t/* Pages in the vma could be discarable without swap */\n" + " \n" + " /* Bits set in the VMA until the stack is in its final location */\n" + " #define VM_STACK_INCOMPLETE_SETUP\t(VM_RAND_READ | VM_SEQ_READ)\n" + "@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);\n" + " * Special vmas that are non-mergable, non-mlock()able.\n" + " * Note: mm/huge_memory.c VM_NO_THP depends on this definition.\n" + " */\n" + "-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)\n" + "+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)\n" + " \n" + " /*\n" + " * mapping from the currently active vm_flags protection bits (the\n" + "@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)\n" + " #define VM_FAULT_NOPAGE\t0x0100\t/* ->fault installed the pte, not return page */\n" + " #define VM_FAULT_LOCKED\t0x0200\t/* ->fault locked the returned page */\n" + " #define VM_FAULT_RETRY\t0x0400\t/* ->fault blocked, must retry */\n" + "-\n" + "+#define VM_FAULT_SIGSEG\t0x0800\t/* -> There is no vma */\n" + " #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */\n" + " \n" + " #define VM_FAULT_ERROR\t(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \\\n" + "-\t\t\t VM_FAULT_HWPOISON_LARGE)\n" + "+\t\t\t VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)\n" + " \n" + " /* Encode hstate index for a hwpoisoned large page */\n" + " #define VM_FAULT_SET_HINDEX(x) ((x) << 12)\n" + "diff --git a/include/linux/rmap.h b/include/linux/rmap.h\n" + "index 3fce545..735d7a3 100644\n" + "--- a/include/linux/rmap.h\n" + "+++ b/include/linux/rmap.h\n" + "@@ -67,6 +67,9 @@ struct anon_vma_chain {\n" + " \tstruct list_head same_anon_vma;\t/* locked by anon_vma->mutex */\n" + " };\n" + " \n" + "+void volatile_lock(struct vm_area_struct *vma);\n" + "+void volatile_unlock(struct vm_area_struct *vma);\n" + "+\n" + " #ifdef CONFIG_MMU\n" + " static inline void get_anon_vma(struct anon_vma *anon_vma)\n" + " {\n" + "@@ -170,6 +173,7 @@ enum ttu_flags {\n" + " \tTTU_IGNORE_MLOCK = (1 << 8),\t/* ignore mlock */\n" + " \tTTU_IGNORE_ACCESS = (1 << 9),\t/* don't age */\n" + " \tTTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */\n" + "+\tTTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */\n" + " };\n" + " #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)\n" + " \n" + "@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,\n" + " \treturn ptep;\n" + " }\n" + " \n" + "+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,\n" + "+ unsigned long, spinlock_t **);\n" + "+\n" + "+static inline pte_t *page_check_volatile_address(struct page *page,\n" + "+ struct mm_struct *mm,\n" + "+ unsigned long address,\n" + "+ spinlock_t **ptlp)\n" + "+{\n" + "+ pte_t *ptep;\n" + "+\n" + "+ __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,\n" + "+ mm, address, ptlp));\n" + "+ return ptep;\n" + "+}\n" + "+\n" + " /*\n" + " * Used by swapoff to help locate where page is expected in vma.\n" + " */\n" + "@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)\n" + " #define SWAP_AGAIN\t1\n" + " #define SWAP_FAIL\t2\n" + " #define SWAP_MLOCK\t3\n" + "+#define SWAP_DISCARD\t4\n" + " \n" + " #endif\t/* _LINUX_RMAP_H */\n" + "diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h\n" + "index 57f7b10..3f9a40b 100644\n" + "--- a/include/linux/vm_event_item.h\n" + "+++ b/include/linux/vm_event_item.h\n" + "@@ -23,7 +23,7 @@\n" + " \n" + " enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,\n" + " \t\tFOR_ALL_ZONES(PGALLOC),\n" + "-\t\tPGFREE, PGACTIVATE, PGDEACTIVATE,\n" + "+\t\tPGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,\n" + " \t\tPGFAULT, PGMAJFAULT,\n" + " \t\tFOR_ALL_ZONES(PGREFILL),\n" + " \t\tFOR_ALL_ZONES(PGSTEAL_KSWAPD),\n" + "diff --git a/mm/madvise.c b/mm/madvise.c\n" + "index 14d260f..53a19d8 100644\n" + "--- a/mm/madvise.c\n" + "+++ b/mm/madvise.c\n" + "@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,\n" + " \t\tif (error)\n" + " \t\t\tgoto out;\n" + " \t\tbreak;\n" + "+\tcase MADV_VOLATILE:\n" + "+\t\tif (vma->vm_flags & VM_LOCKED) {\n" + "+\t\t\terror = -EINVAL;\n" + "+\t\t\tgoto out;\n" + "+\t\t}\n" + "+\t\tnew_flags |= VM_VOLATILE;\n" + "+\t\tbreak;\n" + " \t}\n" + " \n" + " \tif (new_flags == vma->vm_flags) {\n" + "@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,\n" + " success:\n" + " \t/*\n" + " \t * vm_flags is protected by the mmap_sem held in write mode.\n" + "+\t * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.\n" + " \t */\n" + "+\tif (behavior == MADV_VOLATILE)\n" + "+\t\tvolatile_lock(vma);\n" + " \tvma->vm_flags = new_flags;\n" + "-\n" + "+\tif (behavior == MADV_VOLATILE)\n" + "+\t\tvolatile_unlock(vma);\n" + " out:\n" + " \tif (error == -ENOMEM)\n" + " \t\terror = -EAGAIN;\n" + "@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)\n" + " #endif\n" + " \tcase MADV_DONTDUMP:\n" + " \tcase MADV_DODUMP:\n" + "+\tcase MADV_VOLATILE:\n" + " \t\treturn 1;\n" + " \n" + " \tdefault:\n" + "@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)\n" + " \t\tgoto out;\n" + " \tlen = (len_in + ~PAGE_MASK) & PAGE_MASK;\n" + " \n" + "+\tif (behavior != MADV_VOLATILE)\n" + "+\t\tlen = (len_in + ~PAGE_MASK) & PAGE_MASK;\n" + "+\telse\n" + "+\t\tlen = len_in & PAGE_MASK;\n" + "+\n" + " \t/* Check to see whether len was rounded up from small -ve to zero */\n" + " \tif (len_in && !len)\n" + " \t\tgoto out;\n" + "diff --git a/mm/memory.c b/mm/memory.c\n" + "index 5736170..b5e4996 100644\n" + "--- a/mm/memory.c\n" + "+++ b/mm/memory.c\n" + "@@ -57,6 +57,7 @@\n" + " #include <linux/swapops.h>\n" + " #include <linux/elf.h>\n" + " #include <linux/gfp.h>\n" + "+#include <linux/mempolicy.h>\n" + " \n" + " #include <asm/io.h>\n" + " #include <asm/pgalloc.h>\n" + "@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,\n" + " \t\t\t\t\treturn do_linear_fault(mm, vma, address,\n" + " \t\t\t\t\t\tpte, pmd, flags, entry);\n" + " \t\t\t}\n" + "+\t\t\tif (vma->vm_flags & VM_VOLATILE) {\n" + "+\t\t\t\tstruct vm_area_struct *prev;\n" + "+\n" + "+\t\t\t\tup_read(&mm->mmap_sem);\n" + "+\t\t\t\tdown_write(&mm->mmap_sem);\n" + "+\t\t\t\tvma = find_vma_prev(mm, address, &prev);\n" + "+\n" + "+\t\t\t\t/* Someone unmap the vma */\n" + "+\t\t\t\tif (unlikely(!vma) || vma->vm_start > address) {\n" + "+\t\t\t\t\tdowngrade_write(&mm->mmap_sem);\n" + "+\t\t\t\t\treturn VM_FAULT_SIGSEG;\n" + "+\t\t\t\t}\n" + "+\t\t\t\t/* Someone else already hanlded */\n" + "+\t\t\t\tif (vma->vm_flags & VM_VOLATILE) {\n" + "+\t\t\t\t\t/*\n" + "+\t\t\t\t\t * From now on, we hold mmap_sem as\n" + "+\t\t\t\t\t * exclusive.\n" + "+\t\t\t\t\t */\n" + "+\t\t\t\t\tvolatile_lock(vma);\n" + "+\t\t\t\t\tvma->vm_flags &= ~VM_VOLATILE;\n" + "+\t\t\t\t\tvolatile_unlock(vma);\n" + "+\n" + "+\t\t\t\t\tvma_merge(mm, prev, vma->vm_start,\n" + "+\t\t\t\t\t\tvma->vm_end, vma->vm_flags,\n" + "+\t\t\t\t\t\tvma->anon_vma, vma->vm_file,\n" + "+\t\t\t\t\t\tvma->vm_pgoff, vma_policy(vma));\n" + "+\n" + "+\t\t\t\t}\n" + "+\n" + "+\t\t\t\tdowngrade_write(&mm->mmap_sem);\n" + "+\t\t\t}\n" + " \t\t\treturn do_anonymous_page(mm, vma, address,\n" + " \t\t\t\t\t\t pte, pmd, flags);\n" + " \t\t}\n" + "diff --git a/mm/migrate.c b/mm/migrate.c\n" + "index 77ed2d7..08b009c 100644\n" + "--- a/mm/migrate.c\n" + "+++ b/mm/migrate.c\n" + "@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,\n" + " \t}\n" + " \n" + " \t/* Establish migration ptes or remove ptes */\n" + "-\ttry_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);\n" + "+\ttry_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|\n" + "+\t\t\t\tTTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);\n" + " \n" + " skip_unmap:\n" + " \tif (!page_mapped(page))\n" + "@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,\n" + " \tif (PageAnon(hpage))\n" + " \t\tanon_vma = page_get_anon_vma(hpage);\n" + " \n" + "-\ttry_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);\n" + "+\ttry_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|\n" + "+\t\t\t\tTTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);\n" + " \n" + " \tif (!page_mapped(hpage))\n" + " \t\trc = move_to_new_page(new_hpage, hpage, 1, mode);\n" + "diff --git a/mm/rmap.c b/mm/rmap.c\n" + "index 0f3b7cd..1a0ab2b 100644\n" + "--- a/mm/rmap.c\n" + "+++ b/mm/rmap.c\n" + "@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)\n" + " \treturn vma_address(page, vma);\n" + " }\n" + " \n" + "+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,\n" + "+\t\tunsigned long address, spinlock_t **ptlp)\n" + "+{\n" + "+\tpgd_t *pgd;\n" + "+\tpud_t *pud;\n" + "+\tpmd_t *pmd;\n" + "+\tpte_t *pte;\n" + "+\tspinlock_t *ptl;\n" + "+\n" + "+\tswp_entry_t entry = { .val = page_private(page) };\n" + "+\n" + "+\tif (unlikely(PageHuge(page))) {\n" + "+\t\tpte = huge_pte_offset(mm, address);\n" + "+\t\tptl = &mm->page_table_lock;\n" + "+\t\tgoto check;\n" + "+\t}\n" + "+\n" + "+\tpgd = pgd_offset(mm, address);\n" + "+\tif (!pgd_present(*pgd))\n" + "+\t\treturn NULL;\n" + "+\n" + "+\tpud = pud_offset(pgd, address);\n" + "+\tif (!pud_present(*pud))\n" + "+\t\treturn NULL;\n" + "+\n" + "+\tpmd = pmd_offset(pud, address);\n" + "+\tif (!pmd_present(*pmd))\n" + "+\t\treturn NULL;\n" + "+\tif (pmd_trans_huge(*pmd))\n" + "+\t\treturn NULL;\n" + "+\n" + "+\tpte = pte_offset_map(pmd, address);\n" + "+\tptl = pte_lockptr(mm, pmd);\n" + "+check:\n" + "+\tspin_lock(ptl);\n" + "+\tif (PageAnon(page)) {\n" + "+\t\tif (!pte_present(*pte) && entry.val ==\n" + "+\t\t\t\tpte_to_swp_entry(*pte).val) {\n" + "+\t\t\t*ptlp = ptl;\n" + "+\t\t\treturn pte;\n" + "+\t\t}\n" + "+\t} else {\n" + "+\t\tif (pte_none(*pte)) {\n" + "+\t\t\t*ptlp = ptl;\n" + "+\t\t\treturn pte;\n" + "+\t\t}\n" + "+\t}\n" + "+\tpte_unmap_unlock(pte, ptl);\n" + "+\treturn NULL;\n" + "+}\n" + "+\n" + " /*\n" + " * Check that @page is mapped at @address into @mm.\n" + " *\n" + "@@ -1218,6 +1269,35 @@ out:\n" + " \t\tmem_cgroup_end_update_page_stat(page, &locked, &flags);\n" + " }\n" + " \n" + "+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,\n" + "+ unsigned long address)\n" + "+{\n" + "+ struct mm_struct *mm = vma->vm_mm;\n" + "+ pte_t *pte;\n" + "+ pte_t pteval;\n" + "+ spinlock_t *ptl;\n" + "+\n" + "+ pte = page_check_volatile_address(page, mm, address, &ptl);\n" + "+ if (!pte)\n" + "+ return 0;\n" + "+\n" + "+ /* Nuke the page table entry. */\n" + "+ flush_cache_page(vma, address, page_to_pfn(page));\n" + "+ pteval = ptep_clear_flush(vma, address, pte);\n" + "+\n" + "+ if (PageAnon(page)) {\n" + "+ swp_entry_t entry = { .val = page_private(page) };\n" + "+ if (PageSwapCache(page)) {\n" + "+ dec_mm_counter(mm, MM_SWAPENTS);\n" + "+ swap_free(entry);\n" + "+ }\n" + "+ }\n" + "+\n" + "+ pte_unmap_unlock(pte, ptl);\n" + "+ mmu_notifier_invalidate_page(mm, address);\n" + "+ return 1;\n" + "+}\n" + "+\n" + " /*\n" + " * Subfunctions of try_to_unmap: try_to_unmap_one called\n" + " * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.\n" + "@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)\n" + " \tstruct anon_vma *anon_vma;\n" + " \tstruct anon_vma_chain *avc;\n" + " \tint ret = SWAP_AGAIN;\n" + "+\tbool is_volatile = true;\n" + "+\n" + "+\tif (flags & TTU_IGNORE_VOLATILE)\n" + "+\t\tis_volatile = false;\n" + " \n" + " \tanon_vma = page_lock_anon_vma(page);\n" + " \tif (!anon_vma)\n" + "@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)\n" + " \t\t * temporary VMAs until after exec() completes.\n" + " \t\t */\n" + " \t\tif (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&\n" + "-\t\t\t\tis_vma_temporary_stack(vma))\n" + "+\t\t\t\tis_vma_temporary_stack(vma)) {\n" + "+\t\t\tis_volatile = false;\n" + " \t\t\tcontinue;\n" + "+\t\t}\n" + " \n" + " \t\taddress = vma_address(page, vma);\n" + " \t\tif (address == -EFAULT)\n" + " \t\t\tcontinue;\n" + "+ /*\n" + "+ * A volatile page will only be purged if ALL vmas\n" + "+\t\t * pointing to it are VM_VOLATILE.\n" + "+ */\n" + "+ if (!(vma->vm_flags & VM_VOLATILE))\n" + "+ is_volatile = false;\n" + "+\n" + " \t\tret = try_to_unmap_one(page, vma, address, flags);\n" + " \t\tif (ret != SWAP_AGAIN || !page_mapped(page))\n" + " \t\t\tbreak;\n" + " \t}\n" + " \n" + "+ if (page_mapped(page) || is_volatile == false)\n" + "+ goto out;\n" + "+\n" + "+ list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {\n" + "+ struct vm_area_struct *vma = avc->vma;\n" + "+ unsigned long address;\n" + "+\n" + "+ address = vma_address(page, vma);\n" + "+ try_to_zap_one(page, vma, address);\n" + "+ }\n" + "+ /* We're throwing this page out, so mark it clean */\n" + "+ ClearPageDirty(page);\n" + "+ ret = SWAP_DISCARD;\n" + "+out:\n" + " \tpage_unlock_anon_vma(anon_vma);\n" + " \treturn ret;\n" + " }\n" + "@@ -1651,6 +1758,7 @@ out:\n" + " * SWAP_AGAIN\t- we missed a mapping, try again later\n" + " * SWAP_FAIL\t- the page is unswappable\n" + " * SWAP_MLOCK\t- page is mlocked.\n" + "+ * SWAP_DISCARD - page is volatile.\n" + " */\n" + " int try_to_unmap(struct page *page, enum ttu_flags flags)\n" + " {\n" + "@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)\n" + " \t\tret = try_to_unmap_anon(page, flags);\n" + " \telse\n" + " \t\tret = try_to_unmap_file(page, flags);\n" + "-\tif (ret != SWAP_MLOCK && !page_mapped(page))\n" + "+\tif (ret != SWAP_MLOCK && !page_mapped(page) &&\n" + "+\t\t\t\t\tret != SWAP_DISCARD)\n" + " \t\tret = SWAP_SUCCESS;\n" + " \treturn ret;\n" + " }\n" + "@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)\n" + " \tanon_vma_free(anon_vma);\n" + " }\n" + " \n" + "+void volatile_lock(struct vm_area_struct *vma)\n" + "+{\n" + "+ if (vma->anon_vma)\n" + "+ anon_vma_lock(vma->anon_vma);\n" + "+}\n" + "+\n" + "+void volatile_unlock(struct vm_area_struct *vma)\n" + "+{\n" + "+ if (vma->anon_vma)\n" + "+ anon_vma_unlock(vma->anon_vma);\n" + "+}\n" + "+\n" + " #ifdef CONFIG_MIGRATION\n" + " /*\n" + " * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():\n" + "diff --git a/mm/vmscan.c b/mm/vmscan.c\n" + "index 99b434b..4e463a4 100644\n" + "--- a/mm/vmscan.c\n" + "+++ b/mm/vmscan.c\n" + "@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,\n" + " \tif (vm_flags & VM_LOCKED)\n" + " \t\treturn PAGEREF_RECLAIM;\n" + " \n" + "+\tif (vm_flags & VM_VOLATILE)\n" + "+\t\treturn PAGEREF_RECLAIM;\n" + "+\n" + " \tif (referenced_ptes) {\n" + " \t\tif (PageSwapBacked(page))\n" + " \t\t\treturn PAGEREF_ACTIVATE;\n" + "@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,\n" + " \t\t */\n" + " \t\tif (page_mapped(page) && mapping) {\n" + " \t\t\tswitch (try_to_unmap(page, TTU_UNMAP)) {\n" + "+\t\t\tcase SWAP_DISCARD:\n" + "+\t\t\t\tcount_vm_event(PGVOLATILE);\n" + "+\t\t\t\tgoto discard_page;\n" + " \t\t\tcase SWAP_FAIL:\n" + " \t\t\t\tgoto activate_locked;\n" + " \t\t\tcase SWAP_AGAIN:\n" + "@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,\n" + " \t\t\t}\n" + " \t\t}\n" + " \n" + "+discard_page:\n" + " \t\t/*\n" + " \t\t * If the page has buffers, try to free the buffer mappings\n" + " \t\t * associated with this page. If we succeed we try to free\n" + "diff --git a/mm/vmstat.c b/mm/vmstat.c\n" + "index df7a674..410caf5 100644\n" + "--- a/mm/vmstat.c\n" + "+++ b/mm/vmstat.c\n" + "@@ -734,6 +734,7 @@ const char * const vmstat_text[] = {\n" + " \tTEXTS_FOR_ZONES(\"pgalloc\")\n" + " \n" + " \t\"pgfree\",\n" + "+\t\"pgvolatile\",\n" + " \t\"pgactivate\",\n" + " \t\"pgdeactivate\",\n" + " \n" + "-- \n" + "1.7.9.5\n" + "\n" + "-- \n" + "Kind regards,\n" + Minchan Kim -bbb39ee4e4da38253f234f7578ddcf32588a6bcb78c81d1c7deb466cd16df0e3 +50667ac0930e903f0758f3f122e5967b978d38feeda060bb4787d6a81dc241b6
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.