linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v5 0/8] Support volatile for anonymous range
@ 2013-01-03  4:27 Minchan Kim
  2013-01-03  4:27 ` [RFC 1/8] Introduce new system call mvolatile Minchan Kim
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-03  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Michael Kerrisk, Arun Sharma,
	sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

This is still RFC because we need more input from user-space
people, more stress test, design discussion about interface/reclaim
policy of volatile pages and I want to expand this concept to tmpfs
volatile range if it is possbile without big performance drop of
anonymous volatile range.
(Let's define our term. anon volatile VS tmpfs volatile? John?)

I hope more inputs from user-space allocator people and test patch
with their allocator because it might need design change of arena
management for getting real vaule.

TODO
 * Improve volatile range scanning speed
 * Aware of NUMA policy with vma's mempolicy
 * Add direct reclaim hook for discarding volatile pages first
 * Support tmpfs-volatile

Changelog from v5 - There are many changes.

 * Support CONFIG_VOLATILE_PAGE
 * Working with THP/KSM
 * Remove vma hacking logic in m[no]volatile system call
 * Discard page without swap cache
 * Kswapd discard volatile page so we can discard volatile pages
   although we don't have swap.

Changelog from v4

 * Add new system call mvolatile/mnovolatile
 * Add sigbus when user try to access volatile range
 * Rebased on v3.7
 * Applied bug fix from John Stultz, Thanks!

Changelog from v3

 * Removing madvise(addr, length, MADV_NOVOLATILE).
 * add vmstat about the number of discarded volatile pages
 * discard volatile pages without promotion in reclaim path

This is based on v3.7

- What's the mvolatile(addr, length)?

  It's a hint that user deliver to kernel so kernel can *discard*
  pages in a range anytime.

- What happens if user access page(ie, virtual address) discarded
  by kernel?

  The user can encounter SIGBUS.

- What should user do for avoding SIGBUS?
  He should call mnovolatie(addr, length) before accessing the range
  which was called by mvolatile.

- What happens if user access page(ie, virtual address) doesn't
  discarded by kernel?

  The user can see old data without page fault.

- What's different with madvise(DONTNEED)?

  System call semantic

  DONTNEED makes sure user always can see zero-fill pages after
  he calls madvise while mvolatile can see old data or encounter
  SIGBUS.

  Internal implementation

  The madvise(DONTNEED) should zap all mapped pages in range so
  overhead is increased linearly with the number of mapped pages.
  Even, if user access zapped pages as write mode, page fault +
  page allocation + memset should be happened.

  The mvolatile just marks the flag in a range(ie, VMA) instead of
  zapping all of pte in the vma so it doesn't touch ptes any more.

- What's the benefit compared to DONTNEED?

  1. The system call overhead is smaller because mvolatile just marks
     the flag to VMA instead of zapping all the page in a range so
     overhead should be very small.

  2. It has a chance to eliminate overheads (ex, zapping pte + page fault
     + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
     severe.

  3. It has a potential to zap all ptes and free the pages if memory
     pressure is severe so reclaim overhead could be disappear - TODO

- Isn't there any drawback?

  Madvise(DONTNEED) doesn't need exclusive mmap_sem so concurrent page
  fault of other threads could be allowed. But m[no]volatile needs
  exclusive mmap_sem so other thread would be blocked if they try to
  access not-yet-mapped pages. That's why I design m[no]volatile
  overhead should be small as far as possible.

  It could suffer from max rss usage increasement because madvise(DONTNEED)
  deallocates pages instantly when the system call is issued while mvoatile
  delays it until memory pressure happens so if memory pressure is severe by
  max rss incresement, system would suffer. First of all, allocator needs
  some balance logic for that or kernel might handle it by zapping pages
  although user calls mvolatile if memory pressure is severe.
  The problem is how we know memory pressure is severe.
  One of solution is to see kswapd is active or not. Another solution is
  Anton's mempressure so allocator can handle it.

- What's for targetting?

  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
  of virtual machine like Dalvik. Also, it comes in handy for embedded
  which doesn't have swap device so they can't reclaim anonymous pages.
  By discarding instead of swapout, it could be used in the non-swap system.

- Stupid performance test
  I attach test program/script which are utter crap and I don't expect
  current smart allocator never have done it so we need more practical data
  with real allocator.

  KVM - 8 core, 2G

VOLATILE test
13.16user 7.58system 0:06.04elapsed 343%CPU (0avgtext+0avgdata 2624096maxresident)k
0inputs+0outputs (0major+164050minor)pagefaults 0swaps

DONTNEED test
23.30user 228.92system 0:33.10elapsed 762%CPU (0avgtext+0avgdata 213088maxresident)k
0inputs+0outputs (0major+16384210minor)pagefaults 0swaps

  x86-64 - 12 core, 2G

VOLATILE test
33.38user 0.44system 0:02.87elapsed 1178%CPU (0avgtext+0avgdata 3935008maxresident)k
0inputs+0outputs (0major+245989minor)pagefaults 0swaps

DONTNEED test
28.02user 41.25system 0:05.80elapsed 1192%CPU (0avgtext+0avgdata 387776maxresident)k

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system

Any comments are welcome!

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Minchan Kim (8):
  Introduce new system call mvolatile
  Don't allow volatile attribute on THP and KSM
  bail out when the page is in VOLATILE vma
  add page_locked parameter in free_swap_and_cache
  Discard volatile page
  add PGVOLATILE vmstat count
  add volatile page discard hook to kswapd
  extend PGVOLATILE vmstat to kswapd

 arch/x86/syscalls/syscall_64.tbl |    2 +
 fs/exec.c                        |    4 +-
 include/linux/memory.h           |    2 +
 include/linux/mm.h               |    6 +-
 include/linux/mm_types.h         |    4 +
 include/linux/mvolatile.h        |   63 +++
 include/linux/rmap.h             |    2 +
 include/linux/sched.h            |    1 +
 include/linux/swap.h             |    6 +-
 include/linux/syscalls.h         |    2 +
 include/linux/vm_event_item.h    |    4 +
 kernel/fork.c                    |    2 +
 mm/Kconfig                       |   11 +
 mm/Makefile                      |    2 +-
 mm/fremap.c                      |    2 +-
 mm/huge_memory.c                 |    9 +-
 mm/internal.h                    |    2 +
 mm/ksm.c                         |    3 +-
 mm/madvise.c                     |    2 +-
 mm/memory.c                      |   12 +-
 mm/mempolicy.c                   |    2 +-
 mm/mlock.c                       |    7 +-
 mm/mmap.c                        |   62 ++-
 mm/mprotect.c                    |    3 +-
 mm/mremap.c                      |    2 +-
 mm/mvolatile.c                   |  813 ++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                        |   11 +-
 mm/shmem.c                       |    2 +-
 mm/swapfile.c                    |    7 +-
 mm/vmscan.c                      |   57 ++-
 mm/vmstat.c                      |    4 +
 31 files changed, 1065 insertions(+), 46 deletions(-)
 create mode 100644 include/linux/mvolatile.h
 create mode 100644 mm/mvolatile.c

================== 8< =============================

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/syscall.h>

#define SYS_mvolatile 313
#define SYS_mnovolatile 314

#define ALLOC_SIZE (8 << 20)
#define MAP_SIZE  (ALLOC_SIZE * 10)
#define PAGE_SIZE (1 << 12)
#define RETRY 100

pthread_barrier_t barrier;
int mode;
#define VOLATILE_MODE 1

static int mvolatile(void *addr, size_t length)
{
	return syscall(SYS_mvolatile, addr, length);
}

static int mnovolatile(void *addr, size_t length)
{
	return syscall(SYS_mnovolatile, addr, length);
}

void *thread_entry(void *data)
{
	unsigned long i;
	cpu_set_t set;
	int cpu = *(int*)data;
	void *mmap_area;
	int retry = RETRY;

	CPU_ZERO(&set);
	CPU_SET(cpu, &set);
	sched_setaffinity(0, sizeof(set), &set);

	mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	mmap_area = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE,
					MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	if (mmap_area == MAP_FAILED) {
		fprintf(stderr, "Fail to mmap [%d]\n", *(int*)data);
		exit(1);
	}

	pthread_barrier_wait(&barrier);

	while(retry--) {
		if (mode == VOLATILE_MODE) {
			mvolatile(mmap_area, MAP_SIZE);
			for (i = 0; i < MAP_SIZE; i+= ALLOC_SIZE) {
				mnovolatile(mmap_area + i, ALLOC_SIZE);
				memset(mmap_area + i, i, ALLOC_SIZE);
				mvolatile(mmap_area + i, ALLOC_SIZE);
			}
		} else {
			for (i = 0; i < MAP_SIZE; i += ALLOC_SIZE) {
				memset(mmap_area + i, i, ALLOC_SIZE);
				madvise(mmap_area + i, ALLOC_SIZE, MADV_DONTNEED);
			}
		}
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	int i, nr_thread;
	int *data;

	if (argc < 3)
		return 1;

	nr_thread = atoi(argv[1]);
	mode = atoi(argv[2]);

	pthread_t *thread = malloc(sizeof(pthread_t) * nr_thread);
	data = malloc(sizeof(int) * nr_thread);
	pthread_barrier_init(&barrier, NULL, nr_thread);

	for (i = 0; i < nr_thread; i++) {
		data[i] = i;
		if (pthread_create(&thread[i], NULL, thread_entry, &data[i])) {
			perror("Fail to create thread\n");
			exit(1);
		}
	}

	for (i = 0; i < nr_thread; i++) {
		if (pthread_join(thread[i], NULL))
			perror("Fail to join thread\n");
		printf("[%d] thread done\n", i);
	}

	return 0;
}

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 1/8] Introduce new system call mvolatile
  2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
@ 2013-01-03  4:27 ` Minchan Kim
  2013-01-03 18:35   ` Taras Glek
  2013-01-17  1:48   ` John Stultz
  2013-01-03  4:28 ` [RFC 2/8] Don't allow volatile attribute on THP and KSM Minchan Kim
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-03  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Michael Kerrisk, Arun Sharma,
	sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

This patch adds new system call m[no]volatile.
If someone asks is_volatile system call, it could be added, too.

The reason why I introduced new system call instead of madvise is
m[no]volatile vma handling is totally different with madvise's vma
handling.

1) The m[no]volatile should be successful although the range includes
   unmapped or non-volatile range. It just skips such range
   without stopping with returning error although it encounters
   invalid range. It makes user convenient without calling several
   system call of small range - Suggested by John Stultz

2) The purged state of volatile range should be propagated out to user
   although the range is merged with adjacent non-volatile range when
   user calls mnovolatile.

3) mvolatile's interface could be changed with madvise
   in future discussion.  For example, I feel needs
   movlatile(start, len, mode).
   'mode' means FULL_VOLATILE or PARTIAL_VOLATILE.
   FULL volatile means that if VM decide to reclaim the range, it would
   reclaim all of pages in the range but in case of PARTIAL_VOLATILE,
   VM could reclaim just a few number of pages in the range.
   In case of tmpfs-volatile, user may regenerate all images data once
   one of page in the range is discarded so there is pointless that
   VM discard a page in the range when memory pressure is severe.
   In case of anon-volatile, too excess discarding cause too many minor
   fault for the allocator so it would be better to discard part of
   the range.

3) The mvolatile system call's return value is quite different with
   madvise. Look at below semantic explanation.

So I want to separate mvolatile from madvise.

mvolatile(start, len)'s semantics

1) It makes range(start, len) as volatile although the range includes
unmapped area, speacial mapping and mlocked area which are just skipped.

Return -EINVAL if range doesn't include a right vma at all.
Return -ENOMEM with interrupting range opeartion if memory is not
enough to merge/split vmas. In this case, some ranges would be
volatile and others not so user may recall mvolatile after he
cancel all range by mnovolatile.
Return 0 if range consists of only proper vmas.
Return 1 if part of range includes hole/huge/ksm/mlock/special area.

2) If user calls mvolatile to the range which was already volatile VMA and
even purged state, VOLATILE attributes still remains but purged state
is reset. I expect some user want to split volatile vma into smaller
ranges. Although he can do it for mnovlatile(whole range) and serveral calling
with movlatile(smaller range), this function can avoid mnovolatile if he
doesn't care purged state. I'm not sure we really need this function so
I hope listen opinions. Unfortunately, current implemenation doesn't split
volatile VMA with new range in this case. I forgot implementing it
in this version but decide to send it to listen opinions because
implementing is rather trivial if we decided.

mnovolatile(start, len)'s semantics is following as.

1) It makes range(start, len) as non-volatile although the range
includes unmapped area, speacial mapping and non-volatile range
which are just skipped.

2) If the range is purged, it will return 1 regardless of including
invalid range.

3) It returns -ENOMEM if system doesn't have enough memory for vma operation.

4) It returns -EINVAL if range doesn't include a right vma at all.

5) If user try to access purged range without mnovoatile call, it encounters
SIGBUS which would show up next patch.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/x86/syscalls/syscall_64.tbl |    2 +
 fs/exec.c                        |    4 +-
 include/linux/mm.h               |    6 +-
 include/linux/mm_types.h         |    4 +
 include/linux/mvolatile.h        |   30 ++++
 include/linux/syscalls.h         |    2 +
 mm/Kconfig                       |   11 ++
 mm/Makefile                      |    2 +-
 mm/madvise.c                     |    2 +-
 mm/mempolicy.c                   |    2 +-
 mm/mlock.c                       |    7 +-
 mm/mmap.c                        |   62 ++++++--
 mm/mprotect.c                    |    3 +-
 mm/mremap.c                      |    2 +-
 mm/mvolatile.c                   |  312 ++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                        |    2 +
 16 files changed, 427 insertions(+), 26 deletions(-)
 create mode 100644 include/linux/mvolatile.h
 create mode 100644 mm/mvolatile.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a582bfe..568d488 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -319,6 +319,8 @@
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
+313	common	mvolatile		sys_mvolatile
+314	common	mnovolatile		sys_mnovolatile
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index 0039055..da677d1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -594,7 +594,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
+	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL, NULL))
 		return -ENOMEM;
 
 	/*
@@ -628,7 +628,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL, NULL);
 
 	return 0;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcaab4e..4bb59f3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -87,6 +87,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
 
+#define VM_VOLATILE	0x00001000	/* Pages could be discarede without swapout */
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
@@ -1411,11 +1412,12 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 /* mmap.c */
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
 extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
+	bool *purged);
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-	struct mempolicy *);
+	struct mempolicy *, bool *purged);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..1eaf458 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -275,6 +275,10 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+#ifdef CONFIG_VOLATILE_PAGE
+	/* True if more than a page in this vma is reclaimed. */
+	bool purged;	/* Serialized by mmap_sem and anon_vma's mutex */
+#endif
 };
 
 struct core_thread {
diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
new file mode 100644
index 0000000..cfb12b4
--- /dev/null
+++ b/include/linux/mvolatile.h
@@ -0,0 +1,30 @@
+#ifndef __LINUX_MVOLATILE_H
+#define __LINUX_MVOLATILE_H
+
+#include <linux/syscalls.h>
+
+#ifdef CONFIG_VOLATILE_PAGE
+static inline bool vma_purged(struct vm_area_struct *vma)
+{
+	return vma->purged;
+}
+
+static inline void vma_purge_copy(struct vm_area_struct *dst,
+					struct vm_area_struct *src)
+{
+	dst->purged = src->purged;
+}
+#else
+static inline bool vma_purged(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline void vma_purge_copy(struct vm_area_struct *dst,
+					struct vm_area_struct *src)
+{
+
+}
+#endif
+#endif
+
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 727f0cd..a8ded1c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -470,6 +470,8 @@ asmlinkage long sys_munlock(unsigned long start, size_t len);
 asmlinkage long sys_mlockall(int flags);
 asmlinkage long sys_munlockall(void);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_mvolatile(unsigned long start, size_t len);
+asmlinkage long sys_mnovolatile(unsigned long start, size_t len);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index a3f8ddd..30b24ba 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -355,6 +355,17 @@ choice
 	  benefit.
 endchoice
 
+config VOLATILE_PAGE
+	bool "Volatile Page Support"
+	depends on MMU
+	help
+	  Enabling this option adds the system calls mvolatile and mnovolatile
+	  which are for giving user's address space range to kernel so VM
+	  can discard pages of the range anytime instead swapout. This feature
+	  can enhance performance to certain application(ex, memory allocator,
+	  web browser's tmpfs pages) by reduce the number of minor fault and
+          swap out.
+
 config CROSS_MEMORY_ATTACH
 	bool "Cross Memory Support"
 	depends on MMU
diff --git a/mm/Makefile b/mm/Makefile
index 6b025f8..1efb735 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o pgtable-generic.o
+			   mvolatile.o vmalloc.o pagewalk.o pgtable-generic.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/madvise.c b/mm/madvise.c
index 03dfa5c..6ffad21 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -99,7 +99,7 @@ static long madvise_behavior(struct vm_area_struct * vma,
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
-				vma->vm_file, pgoff, vma_policy(vma));
+				vma->vm_file, pgoff, vma_policy(vma), NULL);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ea600d..9b1aa2d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -675,7 +675,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 			((vmstart - vma->vm_start) >> PAGE_SHIFT);
 		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
 				  vma->anon_vma, vma->vm_file, pgoff,
-				  new_pol);
+				  new_pol, NULL);
 		if (prev) {
 			vma = prev;
 			next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index f0b9ce5..e03523a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -316,13 +316,14 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int ret = 0;
 	int lock = !!(newflags & VM_LOCKED);
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || (vma->vm_flags &
+		(VM_SPECIAL|VM_VOLATILE)) || is_vm_hugetlb_page(vma) ||
+		vma == get_gate_vma(current->mm))
 		goto out;	/* don't set VM_LOCKED,  don't count */
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
-			  vma->vm_file, pgoff, vma_policy(vma));
+			  vma->vm_file, pgoff, vma_policy(vma), NULL);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index 9a796c4..ba636c3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -31,6 +31,7 @@
 #include <linux/audit.h>
 #include <linux/khugepaged.h>
 #include <linux/uprobes.h>
+#include <linux/mvolatile.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -516,7 +517,8 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
  * before we drop the necessary locks.
  */
 int vma_adjust(struct vm_area_struct *vma, unsigned long start,
-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
+	bool *purged)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *next = vma->vm_next;
@@ -527,10 +529,9 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct file *file = vma->vm_file;
 	long adjust_next = 0;
 	int remove_next = 0;
+	struct vm_area_struct *exporter = NULL;
 
 	if (next && !insert) {
-		struct vm_area_struct *exporter = NULL;
-
 		if (end >= next->vm_end) {
 			/*
 			 * vma expands, overlapping all the next, and
@@ -621,6 +622,15 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (adjust_next) {
 		next->vm_start += adjust_next << PAGE_SHIFT;
 		next->vm_pgoff += adjust_next;
+		/*
+		 * Look at mm/mvolatile.c for knowing terminology.
+		 * V4. NNPPVV -> NNNPVV
+		 */
+		if (purged) {
+			*purged = vma_purged(next);
+			if (exporter == vma) /* V5. VVPPNN -> VVPNNN */
+				*purged = vma_purged(vma);
+		}
 	}
 
 	if (root) {
@@ -651,6 +661,13 @@ again:			remove_next = 1 + (end > next->vm_end);
 		anon_vma_interval_tree_post_update_vma(vma);
 		if (adjust_next)
 			anon_vma_interval_tree_post_update_vma(next);
+		/*
+		 * Look at mm/mvolatile.c for knowing terminology.
+		 * V7. VVPPVV -> VVNPVV
+		 * V8. VVPPVV -> VVPNVV
+		 */
+		if (insert)
+			vma_purge_copy(insert, vma);
 		anon_vma_unlock(anon_vma);
 	}
 	if (mapping)
@@ -670,6 +687,20 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
+
+		/*
+		 * next is detatched from anon vma chain so purged isn't
+		 * raced any more.
+		 * Look at mm/mvolatile.c for knowing terminology.
+		 *
+		 * V1. NNPPVV -> NNNNVV
+		 * V2. VVPPNN -> VVNNNN
+		 * V3. NNPPNN -> NNNNNN
+		 */
+		if (purged) {
+			*purged |= vma_purged(vma); /* case V2 */
+			*purged |= vma_purged(next); /* case V1,V3 */
+		}
 		mm->map_count--;
 		mpol_put(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
@@ -798,7 +829,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			struct vm_area_struct *prev, unsigned long addr,
 			unsigned long end, unsigned long vm_flags,
 		     	struct anon_vma *anon_vma, struct file *file,
-			pgoff_t pgoff, struct mempolicy *policy)
+			pgoff_t pgoff, struct mempolicy *policy, bool *purged)
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *area, *next;
@@ -808,7 +839,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	 * We later require that vma->vm_flags == vm_flags,
 	 * so this tests vma->vm_flags & VM_SPECIAL, too.
 	 */
-	if (vm_flags & VM_SPECIAL)
+	if (vm_flags & (VM_SPECIAL|VM_VOLATILE))
 		return NULL;
 
 	if (prev)
@@ -837,10 +868,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 						      next->anon_vma, NULL)) {
 							/* cases 1, 6 */
 			err = vma_adjust(prev, prev->vm_start,
-				next->vm_end, prev->vm_pgoff, NULL);
+				next->vm_end, prev->vm_pgoff, NULL, purged);
 		} else					/* cases 2, 5, 7 */
 			err = vma_adjust(prev, prev->vm_start,
-				end, prev->vm_pgoff, NULL);
+				end, prev->vm_pgoff, NULL, purged);
 		if (err)
 			return NULL;
 		khugepaged_enter_vma_merge(prev);
@@ -856,10 +887,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 					anon_vma, file, pgoff+pglen)) {
 		if (prev && addr < prev->vm_end)	/* case 4 */
 			err = vma_adjust(prev, prev->vm_start,
-				addr, prev->vm_pgoff, NULL);
+				addr, prev->vm_pgoff, NULL, purged);
 		else					/* cases 3, 8 */
 			err = vma_adjust(area, addr, next->vm_end,
-				next->vm_pgoff - pglen, NULL);
+				next->vm_pgoff - pglen, NULL, purged);
 		if (err)
 			return NULL;
 		khugepaged_enter_vma_merge(area);
@@ -1292,7 +1323,8 @@ munmap_back:
 	/*
 	 * Can we just expand an old mapping?
 	 */
-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file,
+				pgoff, NULL, NULL);
 	if (vma)
 		goto out;
 
@@ -2025,9 +2057,10 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 
 	if (new_below)
 		err = vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
-			((addr - new->vm_start) >> PAGE_SHIFT), new);
+			((addr - new->vm_start) >> PAGE_SHIFT), new, NULL);
 	else
-		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
+		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff,
+			new, NULL);
 
 	/* Success. */
 	if (!err)
@@ -2240,7 +2273,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 
 	/* Can we just expand an old private anonymous mapping? */
 	vma = vma_merge(mm, prev, addr, addr + len, flags,
-					NULL, NULL, pgoff, NULL);
+					NULL, NULL, pgoff, NULL, NULL);
 	if (vma)
 		goto out;
 
@@ -2396,7 +2429,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
 		return NULL;	/* should never get here */
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			NULL);
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..f461177 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -179,7 +179,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 */
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*pprev = vma_merge(mm, *pprev, start, end, newflags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			NULL);
 	if (*pprev) {
 		vma = *pprev;
 		goto success;
diff --git a/mm/mremap.c b/mm/mremap.c
index 1b61c2d..8586c52 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -512,7 +512,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 			int pages = (new_len - old_len) >> PAGE_SHIFT;
 
 			if (vma_adjust(vma, vma->vm_start, addr + new_len,
-				       vma->vm_pgoff, NULL)) {
+				       vma->vm_pgoff, NULL, NULL)) {
 				ret = -ENOMEM;
 				goto out;
 			}
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
new file mode 100644
index 0000000..8b812d2
--- /dev/null
+++ b/mm/mvolatile.c
@@ -0,0 +1,312 @@
+/*
+ *	linux/mm/mvolatile.c
+ *
+ *  Copyright 2012 Minchan Kim
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mvolatile.h>
+#include <linux/mm_types.h>
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/mempolicy.h>
+
+#ifndef CONFIG_VOLATILE_PAGE
+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
+{
+	return -EINVAL;
+}
+
+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
+{
+	return -EINVAL;
+}
+#else
+
+#define NO_PURGED	0
+#define PURGED		1
+
+/*
+ * N: Normal VMA
+ * V: Volatile VMA
+ * P: Purged volatile VMA
+ *
+ * Assume that each VMA has two block so case 1-8 consists of three VMA.
+ * For example, NNPPVV means VMA1 has normal VMA, VMA2 has purged volailte VMA,
+ * and VMA3 has volatile VMA. With another example, NNPVVV means VMA1 has
+ * normal VMA, VMA2-1 has purged volatile VMA, VMA2-2 has volatile VMA.
+ *
+ * Case 7,8 create a new VMA and we call it VMA4 which can be loated before VMA2
+ * or after.
+ *
+ * Notice: The merge between volatile VMAs shouldn't happen.
+ * If we call mnovolatile(VMA2),
+ *
+ * Case 1 NNPPVV -> NNNNVV
+ * Case 2 VVPPNN -> VVNNNN
+ * Case 3 NNPPNN -> NNNNNN
+ * Case 4 NNPPVV -> NNNPVV
+ * case 5 VVPPNN -> VVPNNN
+ * case 6 VVPPVV -> VVNNVV
+ * case 7 VVPPVV -> VVNPVV
+ * case 8 VVPPVV -> VVPNVV
+ */
+static int do_mnovolatile(struct vm_area_struct *vma,
+		struct vm_area_struct **prev, unsigned long start,
+		unsigned long end, bool *is_purged)
+{
+	unsigned long new_flags;
+	int error = 0;
+	struct mm_struct *mm = vma->vm_mm;
+	pgoff_t pgoff;
+	bool purged = false;
+
+	new_flags = vma->vm_flags & ~VM_VOLATILE;
+	if (new_flags == vma->vm_flags) {
+		*prev = vma;
+		goto success;
+	}
+
+
+	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
+			vma->vm_file, pgoff, vma_policy(vma), &purged);
+	if (*prev) {
+		vma = *prev;
+		goto success;
+	}
+
+	*prev = vma;
+
+	if (start != vma->vm_start) {
+		error = split_vma(mm, vma, start, 1);
+		if (error)
+			goto out;
+	}
+
+	if (end != vma->vm_end) {
+		error = split_vma(mm, vma, end, 0);
+		if (error)
+			goto out;
+	}
+
+success:
+	/* V6. VVPPVV -> VVNNVV */
+	vma_lock_anon_vma(vma);
+	*is_purged |= (vma->purged|purged);
+	vma_unlock_anon_vma(vma);
+
+	vma->vm_flags = new_flags;
+	vma->purged = false;
+	return 0;
+out:
+	return error;
+}
+
+/* I didn't look into KSM/Hugepage so disalbed them */
+#define VM_NO_VOLATILE	(VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|\
+		VM_MERGEABLE|VM_HUGEPAGE|VM_LOCKED)
+
+static int do_mvolatile(struct vm_area_struct *vma,
+	struct vm_area_struct **prev, unsigned long start, unsigned long end)
+{
+	int error = -EINVAL;
+	vm_flags_t new_flags = vma->vm_flags;
+	struct mm_struct *mm = vma->vm_mm;
+
+	new_flags |= VM_VOLATILE;
+
+	/* Note : Current version doesn't support file vma volatile */
+	if (vma->vm_file) {
+		*prev = vma;
+		goto out;
+	}
+
+	if (vma->vm_flags & VM_NO_VOLATILE ||
+			(vma == get_gate_vma(current->mm))) {
+		*prev = vma;
+		goto out;
+	}
+	/*
+	 * In case of calling MADV_VOLATILE again,
+	 * We just reset purged state.
+	 */
+	if (new_flags == vma->vm_flags) {
+		*prev = vma;
+		vma_lock_anon_vma(vma);
+		vma->purged = false;
+		vma_unlock_anon_vma(vma);
+		error = 0;
+		goto out;
+	}
+
+	*prev = vma;
+
+	if (start != vma->vm_start) {
+		error = split_vma(mm, vma, start, 1);
+		if (error)
+			goto out;
+	}
+
+	if (end != vma->vm_end) {
+		error = split_vma(mm, vma, end, 0);
+		if (error)
+			goto out;
+	}
+
+	error = 0;
+
+	vma_lock_anon_vma(vma);
+	vma->vm_flags = new_flags;
+	vma_unlock_anon_vma(vma);
+out:
+	return error;
+}
+
+/*
+ * Return -EINVAL if range doesn't include a right vma at all.
+ * Return -ENOMEM with interrupting range opeartion if memory is not enough to
+ * merge/split vmas.
+ * Return 0 if range consists of only proper vmas.
+ * Return 1 if part of range includes inavlid area(ex, hole/huge/ksm/mlock/
+ * special area)
+ */
+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
+{
+	unsigned long end, tmp;
+	struct vm_area_struct *vma, *prev;
+	bool invalid = false;
+	int error = -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (!vma)
+		goto out;
+
+	if (start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			start = vma->vm_start;
+			if (start >= end)
+				goto out;
+			invalid = true;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = do_mvolatile(vma, &prev, start, tmp);
+		if (error == -ENOMEM) {
+			up_write(&current->mm->mmap_sem);
+			return error;
+		}
+		if (error == -EINVAL)
+			invalid = true;
+		else
+			error = 0;
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+
+		vma = prev->vm_next;
+		if (!vma)
+			break;
+	}
+out:
+	up_write(&current->mm->mmap_sem);
+	return invalid ? 1 : 0;
+}
+/*
+ * Return -ENOMEM with interrupting range opeartion if memory is not enough
+ * to merge/split vmas.
+ * Return 1 if part of range includes purged's one, otherwise, return 0
+ */
+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
+{
+	unsigned long end, tmp;
+	struct vm_area_struct *vma, *prev;
+	int ret, error = -EINVAL;
+	bool is_purged = false;
+
+	down_write(&current->mm->mmap_sem);
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (!vma)
+		goto out;
+
+	if (start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			start = vma->vm_start;
+			if (start >= end)
+				goto out;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = do_mnovolatile(vma, &prev, start, tmp, &is_purged);
+		if (error) {
+			WARN_ON(error != -ENOMEM);
+			goto out;
+		}
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+
+		vma = prev->vm_next;
+		if (!vma)
+			break;
+	}
+out:
+	up_write(&current->mm->mmap_sem);
+
+	if (error)
+		ret = error;
+	else if (is_purged)
+		ret = PURGED;
+	else
+		ret = NO_PURGED;
+
+	return ret;
+}
+#endif
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ee1ef0..402d9da 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -57,6 +57,7 @@
 #include <linux/migrate.h>
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
+#include <linux/mvolatile.h>
 
 #include <asm/tlbflush.h>
 
@@ -308,6 +309,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	vma->anon_vma = anon_vma;
 	anon_vma_lock(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
+	vma_purge_copy(vma, pvma);
 	anon_vma_unlock(anon_vma);
 
 	return 0;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 2/8] Don't allow volatile attribute on THP and KSM
  2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
  2013-01-03  4:27 ` [RFC 1/8] Introduce new system call mvolatile Minchan Kim
@ 2013-01-03  4:28 ` Minchan Kim
  2013-01-03 16:27   ` Dave Hansen
  2013-01-03  4:28 ` [RFC 3/8] bail out when the page is in VOLATILE vma Minchan Kim
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2013-01-03  4:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Rik van Riel, Hugh Dickins,
	Andrea Arcangeli

VOLATILE imply the the pages in the range isn't working set any more
so it's pointless that make them to THP/KSM.

Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/huge_memory.c |    9 +++++++--
 mm/ksm.c         |    3 ++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..5ddd00e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1477,7 +1477,8 @@ out:
 	return ret;
 }
 
-#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
+#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|\
+			VM_SHARED|VM_MAYSHARE|VM_VOLATILE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
@@ -1641,6 +1642,8 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
 		 * page fault if needed.
 		 */
 		return 0;
+	if (vma->vm_flags & VM_VOLATILE)
+		return 0;
 	if (vma->vm_ops)
 		/* khugepaged not yet working on file or special mappings */
 		return 0;
@@ -1969,6 +1972,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 		goto out;
 	if (is_vma_temporary_stack(vma))
 		goto out;
+	if (vma->vm_flags & VM_VOLATILE)
+		goto out;
 	VM_BUG_ON(vma->vm_flags & VM_NO_THP);
 
 	pgd = pgd_offset(mm, address);
@@ -2196,7 +2201,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 		if ((!(vma->vm_flags & VM_HUGEPAGE) &&
 		     !khugepaged_always()) ||
-		    (vma->vm_flags & VM_NOHUGEPAGE)) {
+		     (vma->vm_flags & (VM_NOHUGEPAGE|VM_VOLATILE))) {
 		skip:
 			progress++;
 			continue;
diff --git a/mm/ksm.c b/mm/ksm.c
index ae539f0..2775f59 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1486,7 +1486,8 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		 */
 		if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
 				 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
-				 VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP))
+				 VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP   |
+				 VM_VOLATILE))
 			return 0;		/* just ignore the advice */
 
 #ifdef VM_SAO
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 3/8] bail out when the page is in VOLATILE vma
  2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
  2013-01-03  4:27 ` [RFC 1/8] Introduce new system call mvolatile Minchan Kim
  2013-01-03  4:28 ` [RFC 2/8] Don't allow volatile attribute on THP and KSM Minchan Kim
@ 2013-01-03  4:28 ` Minchan Kim
  2013-01-03  4:28 ` [RFC 4/8] add page_locked parameter in free_swap_and_cache Minchan Kim
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-03  4:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Rik van Riel, Mel Gorman

If we found a page is in VOLATILE vma, hurry up discarding
instead of access bit check because it's very unlikey working set.

Next patch will use it.

Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/rmap.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 402d9da..fea01cd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -695,10 +695,12 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (!pte)
 			goto out;
 
-		if (vma->vm_flags & VM_LOCKED) {
+		if ((vma->vm_flags & VM_LOCKED) ||
+				(vma->vm_flags & VM_VOLATILE)) {
 			pte_unmap_unlock(pte, ptl);
 			*mapcount = 0;	/* break early from loop */
-			*vm_flags |= VM_LOCKED;
+			*vm_flags |= (vma->vm_flags & VM_LOCKED ?
+					VM_LOCKED : VM_VOLATILE);
 			goto out;
 		}
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 4/8] add page_locked parameter in free_swap_and_cache
  2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
                   ` (2 preceding siblings ...)
  2013-01-03  4:28 ` [RFC 3/8] bail out when the page is in VOLATILE vma Minchan Kim
@ 2013-01-03  4:28 ` Minchan Kim
  2013-01-03  4:28 ` [RFC 5/8] Discard volatile page Minchan Kim
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-03  4:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Hugh Dickins, Mel Gorman,
	Rik van Riel

Add page_locked parameter for avoiding trylock_page.
Next patch will use it.

Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |    6 +++---
 mm/fremap.c          |    2 +-
 mm/memory.c          |    2 +-
 mm/shmem.c           |    2 +-
 mm/swapfile.c        |    7 ++++---
 5 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 68df9c1..5cf2191 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -357,7 +357,7 @@ extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free(swp_entry_t, struct page *page);
-extern int free_swap_and_cache(swp_entry_t);
+extern int free_swap_and_cache(swp_entry_t, bool);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
@@ -397,8 +397,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(swp)	is_migration_entry(swp)
-#define swapcache_prepare(swp)		is_migration_entry(swp)
+#define free_swap_and_cache(swp, page_locked)	is_migration_entry(swp)
+#define swapcache_prepare(swp)			is_migration_entry(swp)
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/mm/fremap.c b/mm/fremap.c
index a0aaf0e..a300508 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -44,7 +44,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	} else {
 		if (!pte_file(pte))
-			free_swap_and_cache(pte_to_swp_entry(pte));
+			free_swap_and_cache(pte_to_swp_entry(pte), false);
 		pte_clear_not_present_full(mm, addr, ptep, 0);
 	}
 }
diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..c475cc1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1198,7 +1198,7 @@ again:
 				else
 					rss[MM_FILEPAGES]--;
 			}
-			if (unlikely(!free_swap_and_cache(entry)))
+			if (unlikely(!free_swap_and_cache(entry, false)))
 				print_bad_pte(vma, addr, ptent, NULL);
 		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
diff --git a/mm/shmem.c b/mm/shmem.c
index 50c5b8f..33ec719 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -391,7 +391,7 @@ static int shmem_free_swap(struct address_space *mapping,
 	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!error)
-		free_swap_and_cache(radix_to_swp_entry(radswap));
+		free_swap_and_cache(radix_to_swp_entry(radswap), false);
 	return error;
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f91a255..43437ff 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -688,7 +688,7 @@ int try_to_free_swap(struct page *page)
  * Free the swap entry like above, but also try to
  * free the page cache entry if it is the last user.
  */
-int free_swap_and_cache(swp_entry_t entry)
+int free_swap_and_cache(swp_entry_t entry, bool page_locked)
 {
 	struct swap_info_struct *p;
 	struct page *page = NULL;
@@ -700,7 +700,7 @@ int free_swap_and_cache(swp_entry_t entry)
 	if (p) {
 		if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) {
 			page = find_get_page(&swapper_space, entry.val);
-			if (page && !trylock_page(page)) {
+			if (page && !page_locked && !trylock_page(page)) {
 				page_cache_release(page);
 				page = NULL;
 			}
@@ -717,7 +717,8 @@ int free_swap_and_cache(swp_entry_t entry)
 			delete_from_swap_cache(page);
 			SetPageDirty(page);
 		}
-		unlock_page(page);
+		if (!page_locked)
+			unlock_page(page);
 		page_cache_release(page);
 	}
 	return p != NULL;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 5/8] Discard volatile page
  2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
                   ` (3 preceding siblings ...)
  2013-01-03  4:28 ` [RFC 4/8] add page_locked parameter in free_swap_and_cache Minchan Kim
@ 2013-01-03  4:28 ` Minchan Kim
  2013-01-03  4:28 ` [RFC 6/8] add PGVOLATILE vmstat count Minchan Kim
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-03  4:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Rik van Riel, Mel Gorman,
	Hugh Dickins

VM don't need to swap out volatile pages. Instead, it just discards
pages and set true to the vma's purge state so if user try to access
purged vma without calling mnovolatile, it will encounter SIGBUS.

Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/memory.h    |    2 +
 include/linux/mvolatile.h |   20 +++++
 include/linux/rmap.h      |    2 +
 mm/memory.c               |   10 ++-
 mm/mvolatile.c            |  185 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/rmap.c                 |    3 +-
 mm/vmscan.c               |   13 ++++
 7 files changed, 230 insertions(+), 5 deletions(-)

diff --git a/include/linux/memory.h b/include/linux/memory.h
index ff9a9f8..0c50bec 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -150,5 +150,7 @@ struct memory_accessor {
  * can sleep.
  */
 extern struct mutex text_mutex;
+void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
+			  pte_t pte, struct page *page);
 
 #endif /* _LINUX_MEMORY_H_ */
diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
index cfb12b4..eb07761 100644
--- a/include/linux/mvolatile.h
+++ b/include/linux/mvolatile.h
@@ -2,8 +2,15 @@
 #define __LINUX_MVOLATILE_H
 
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 #ifdef CONFIG_VOLATILE_PAGE
+
+static inline bool is_volatile_vma(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_VOLATILE;
+}
+
 static inline bool vma_purged(struct vm_area_struct *vma)
 {
 	return vma->purged;
@@ -14,6 +21,8 @@ static inline void vma_purge_copy(struct vm_area_struct *dst,
 {
 	dst->purged = src->purged;
 }
+
+int discard_volatile_page(struct page *page, enum ttu_flags ttu_flags);
 #else
 static inline bool vma_purged(struct vm_area_struct *vma)
 {
@@ -25,6 +34,17 @@ static inline void vma_purge_copy(struct vm_area_struct *dst,
 {
 
 }
+
+static inline int discard_volatile_page(struct page *page,
+					enum ttu_flags ttu_flags)
+{
+	return 0;
+}
+
+static inline bool is_volatile_vma(struct vm_area_struct *vma)
+{
+	return false;
+}
 #endif
 #endif
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bfe1f47..5429804 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -223,6 +223,7 @@ int try_to_munlock(struct page *);
 struct anon_vma *page_lock_anon_vma(struct page *page);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
+unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
 
 /*
  * Called by migrate.c to remove migration ptes, but might be used more later.
@@ -244,6 +245,7 @@ static inline int page_referenced(struct page *page, int is_locked,
 	return 0;
 }
 
+
 #define try_to_unmap(page, refs) SWAP_FAIL
 
 static inline int page_mkclean(struct page *page)
diff --git a/mm/memory.c b/mm/memory.c
index c475cc1..0646375 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/mvolatile.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -655,7 +656,7 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
  *
  * The calling function must still handle the error.
  */
-static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
+void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 			  pte_t pte, struct page *page)
 {
 	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
@@ -3459,6 +3460,8 @@ int handle_pte_fault(struct mm_struct *mm,
 					return do_linear_fault(mm, vma, address,
 						pte, pmd, flags, entry);
 			}
+			if (unlikely(is_volatile_vma(vma)))
+				return VM_FAULT_SIGBUS;
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, flags);
 		}
@@ -3528,9 +3531,12 @@ retry:
 	if (!pmd)
 		return VM_FAULT_OOM;
 	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
-		if (!vma->vm_ops)
+		if (!vma->vm_ops) {
+			if (unlikely(is_volatile_vma(vma)))
+				return VM_FAULT_SIGBUS;
 			return do_huge_pmd_anonymous_page(mm, vma, address,
 							  pmd, flags);
+		}
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index 8b812d2..6bc9f7e 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -10,8 +10,12 @@
 #include <linux/mvolatile.h>
 #include <linux/mm_types.h>
 #include <linux/mm.h>
-#include <linux/rmap.h>
+#include <linux/memory.h>
 #include <linux/mempolicy.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
 
 #ifndef CONFIG_VOLATILE_PAGE
 SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
@@ -25,6 +29,185 @@ SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
 }
 #else
 
+/*
+ * Check that @page is mapped at @address into @mm
+ * The difference with __page_check_address is this function checks
+ * pte has swap entry of page.
+ *
+ * On success returns with pte mapped and locked.
+ */
+static pte_t *__page_check_volatile_address(struct page *page,
+	struct mm_struct *mm, unsigned long address, spinlock_t **ptlp)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return NULL;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return NULL;
+
+	VM_BUG_ON(pmd_trans_huge(*pmd));
+
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (pte_none(*pte))
+		goto out;
+
+	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
+		*ptlp = ptl;
+		return pte;
+	} else {
+		swp_entry_t entry = { .val = page_private(page) };
+
+		WARN_ON(pte_present(*pte));
+		VM_BUG_ON(non_swap_entry(entry));
+
+		if (entry.val != pte_to_swp_entry(*pte).val)
+			goto out;
+
+		*ptlp = ptl;
+		return pte;
+	}
+out:
+	pte_unmap_unlock(pte, ptl);
+	return NULL;
+}
+
+static inline pte_t *page_check_volatile_address(struct page *page,
+			struct mm_struct *mm, unsigned long address,
+			spinlock_t **ptlp)
+{
+	pte_t *ptep;
+
+	__cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
+				mm, address, ptlp));
+	return ptep;
+}
+
+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
+		unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *pte;
+	pte_t pteval;
+	spinlock_t *ptl;
+	int ret = 0;
+	bool present;
+
+	VM_BUG_ON(!PageLocked(page));
+
+	pte = page_check_volatile_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	present = pte_present(*pte);
+	flush_cache_page(vma, address, page_to_pfn(page));
+	pteval = ptep_clear_flush(vma, address, pte);
+
+	update_hiwater_rss(mm);
+	dec_mm_counter(mm, MM_ANONPAGES);
+
+	page_remove_rmap(page);
+	page_cache_release(page);
+
+	if (!present) {
+		swp_entry_t entry = pte_to_swp_entry(*pte);
+		dec_mm_counter(mm, MM_SWAPENTS);
+		if (unlikely(!free_swap_and_cache(entry, true)))
+			print_bad_pte(vma, address, *pte, NULL);
+	}
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, address);
+	ret = 1;
+out:
+	return ret;
+}
+
+static int try_to_volatile_page(struct page *page, enum ttu_flags flags)
+{
+	struct anon_vma *anon_vma;
+	pgoff_t pgoff;
+	struct anon_vma_chain *avc;
+	unsigned long address;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int ret = 0;
+
+	VM_BUG_ON(!PageLocked(page));
+
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page)))
+			return 0;
+
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		return ret;
+
+	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		vma = avc->vma;
+		mm = vma->vm_mm;
+		/*
+		 * During exec, a temporary VMA is setup and later moved.
+		 * The VMA is moved under the anon_vma lock but not the
+		 * page tables leading to a race where migration cannot
+		 * find the migration ptes. Rather than increasing the
+		 * locking requirements of exec(), migration skips
+		 * temporary VMAs until after exec() completes.
+		 */
+		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
+				is_vma_temporary_stack(vma))
+			continue;
+
+		address = vma_address(page, vma);
+		pte = page_check_volatile_address(page, mm, address, &ptl);
+		if (!pte)
+			continue;
+		pte_unmap_unlock(pte, ptl);
+
+		if (!(vma->vm_flags & VM_VOLATILE))
+			goto out;
+	}
+
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		struct vm_area_struct *vma = avc->vma;
+
+		address = vma_address(page, vma);
+		if (try_to_zap_one(page, vma, address))
+			vma->purged = true;
+	}
+
+	ret = 1;
+out:
+	page_unlock_anon_vma(anon_vma);
+	return ret;
+}
+
+int discard_volatile_page(struct page *page, enum ttu_flags ttu_flags)
+{
+	if (try_to_volatile_page(page, ttu_flags)) {
+		if (page_freeze_refs(page, 1)) {
+			unlock_page(page);
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
 #define NO_PURGED	0
 #define PURGED		1
 
diff --git a/mm/rmap.c b/mm/rmap.c
index fea01cd..e305bbf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -525,8 +525,7 @@ __vma_address(struct page *page, struct vm_area_struct *vma)
 	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 }
 
-inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
+unsigned long vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	unsigned long address = __vma_address(page, vma);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..449ec95 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -42,6 +42,7 @@
 #include <linux/sysctl.h>
 #include <linux/oom.h>
 #include <linux/prefetch.h>
+#include <linux/mvolatile.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -609,6 +610,7 @@ redo:
 enum page_references {
 	PAGEREF_RECLAIM,
 	PAGEREF_RECLAIM_CLEAN,
+	PAGEREF_DISCARD,
 	PAGEREF_KEEP,
 	PAGEREF_ACTIVATE,
 };
@@ -627,9 +629,16 @@ static enum page_references page_check_references(struct page *page,
 	 * Mlock lost the isolation race with us.  Let try_to_unmap()
 	 * move the page to the unevictable list.
 	 */
+
+	VM_BUG_ON((vm_flags & (VM_LOCKED|VM_VOLATILE)) ==
+				(VM_LOCKED|VM_VOLATILE));
+
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	if (vm_flags & VM_VOLATILE)
+		return PAGEREF_DISCARD;
+
 	if (referenced_ptes) {
 		if (PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
@@ -768,6 +777,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto activate_locked;
 		case PAGEREF_KEEP:
 			goto keep_locked;
+		case PAGEREF_DISCARD:
+			if (discard_volatile_page(page, ttu_flags))
+				goto free_it;
+			break;
 		case PAGEREF_RECLAIM:
 		case PAGEREF_RECLAIM_CLEAN:
 			; /* try to reclaim the page below */
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 6/8] add PGVOLATILE vmstat count
  2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
                   ` (4 preceding siblings ...)
  2013-01-03  4:28 ` [RFC 5/8] Discard volatile page Minchan Kim
@ 2013-01-03  4:28 ` Minchan Kim
  2013-01-03  4:28 ` [RFC 7/8] add volatile page discard hook to kswapd Minchan Kim
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-03  4:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim

This patch add pgvolatile vmstat so admin can see how many of volatile
pages are discarded by VM until now.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/vm_event_item.h |    3 +++
 mm/mvolatile.c                |    1 +
 mm/vmstat.c                   |    3 +++
 3 files changed, 7 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..721d096 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+#ifdef CONFIG_VOLATILE_PAGE
+		PGVOLATILE,
+#endif
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index 6bc9f7e..c66c3bc 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -201,6 +201,7 @@ int discard_volatile_page(struct page *page, enum ttu_flags ttu_flags)
 	if (try_to_volatile_page(page, ttu_flags)) {
 		if (page_freeze_refs(page, 1)) {
 			unlock_page(page);
+			count_vm_event(PGVOLATILE);
 			return 1;
 		}
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..3d08e1a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -753,6 +753,9 @@ const char * const vmstat_text[] = {
 	"pgfault",
 	"pgmajfault",
 
+#ifdef CONFIG_VOLATILE_PAGE
+	"pgvolatile",
+#endif
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
 	TEXTS_FOR_ZONES("pgsteal_direct")
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 7/8] add volatile page discard hook to kswapd
  2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
                   ` (5 preceding siblings ...)
  2013-01-03  4:28 ` [RFC 6/8] add PGVOLATILE vmstat count Minchan Kim
@ 2013-01-03  4:28 ` Minchan Kim
  2013-01-03  4:28 ` [RFC 8/8] extend PGVOLATILE vmstat " Minchan Kim
  2013-01-03 17:19 ` [RFC v5 0/8] Support volatile for anonymous range Sanjay Ghemawat
  8 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-03  4:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Hugh Dickins,
	Andrea Arcangeli, Rik van Riel, Mel Gorman

This patch adds volatile page discard hook to kswapd for
minimizing eviction of working set and enable discarding
volatile page although we don't turn on swap.

This patch is copied heavily from THP.

Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/mvolatile.h |   13 ++
 include/linux/sched.h     |    1 +
 kernel/fork.c             |    2 +
 mm/internal.h             |    2 +
 mm/mvolatile.c            |  314 +++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c               |   44 ++++++-
 6 files changed, 374 insertions(+), 2 deletions(-)

diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
index eb07761..9276022 100644
--- a/include/linux/mvolatile.h
+++ b/include/linux/mvolatile.h
@@ -23,6 +23,9 @@ static inline void vma_purge_copy(struct vm_area_struct *dst,
 }
 
 int discard_volatile_page(struct page *page, enum ttu_flags ttu_flags);
+unsigned int discard_volatile_pages(struct zone *zone, unsigned int nr_pages);
+void mvolatile_exit(struct mm_struct *mm);
+
 #else
 static inline bool vma_purged(struct vm_area_struct *vma)
 {
@@ -45,6 +48,16 @@ static inline bool is_volatile_vma(struct vm_area_struct *vma)
 {
 	return false;
 }
+
+static inline unsigned int discard_volatile_pages(struct zone *zone,
+							unsigned int nr_pages)
+{
+	return 0;
+}
+
+static inline void mvolatile_exit(struct mm_struct *mm)
+{
+}
 #endif
 #endif
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0dd42a0..7ae95df 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -408,6 +408,7 @@ extern int get_dumpable(struct mm_struct *mm);
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
+#define MMF_VM_VOLATILE		21	/* set when VM_VOLATILE is set on vma */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..9d7d218 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
+#include <linux/mvolatile.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -612,6 +613,7 @@ void mmput(struct mm_struct *mm)
 		uprobe_clear_state(mm);
 		exit_aio(mm);
 		ksm_exit(mm);
+		mvolatile_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
diff --git a/mm/internal.h b/mm/internal.h
index a4fa284..e595224 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -351,6 +351,8 @@ extern unsigned long vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long);
 
 extern void set_pageblock_order(void);
+unsigned long discard_volatile_page_list(struct zone *zone,
+					    struct list_head *page_list);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 					    struct list_head *page_list);
 /* The ALLOC_WMARK bits are used as an index to zone->watermark */
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index c66c3bc..1c7bf5a 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -16,6 +16,8 @@
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
+#include <linux/migrate.h>
+#include "internal.h"
 
 #ifndef CONFIG_VOLATILE_PAGE
 SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
@@ -29,6 +31,49 @@ SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
 }
 #else
 
+static DEFINE_SPINLOCK(mvolatile_mm_lock);
+
+#define MM_SLOTS_HASH_SHIFT 10
+#define MM_SLOTS_HASH_HEADS (1 << MM_SLOTS_HASH_SHIFT)
+
+struct mvolatile_scan {
+	struct list_head mm_head;
+	struct mm_slot *mm_slot;
+	unsigned long address;
+};
+
+static struct mvolatile_scan mvolatile_scan = {
+	.mm_head = LIST_HEAD_INIT(mvolatile_scan.mm_head),
+};
+
+static struct hlist_head mm_slots_hash[MM_SLOTS_HASH_HEADS];
+
+struct mm_slot {
+	struct hlist_node hash;
+	struct list_head mm_node;
+	struct mm_struct *mm;
+};
+
+static struct kmem_cache *mm_slot_cache __read_mostly;
+
+static int __init mvolatile_slab_init(void)
+{
+	mm_slot_cache = kmem_cache_create("mvolatile_mm_slot",
+			sizeof(struct mm_slot),
+			__alignof(struct mm_slot), 0, NULL);
+	if (!mm_slot_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int __init mvolatile_init(void)
+{
+	mvolatile_slab_init();
+	return 0;
+}
+module_init(mvolatile_init)
+
 /*
  * Check that @page is mapped at @address into @mm
  * The difference with __page_check_address is this function checks
@@ -209,6 +254,274 @@ int discard_volatile_page(struct page *page, enum ttu_flags ttu_flags)
 	return 0;
 }
 
+
+static inline struct mm_slot *alloc_mm_slot(void)
+{
+	if (!mm_slot_cache)
+		return NULL;
+	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot(struct mm_slot *mm_slot)
+{
+	kmem_cache_free(mm_slot_cache, mm_slot);
+}
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	struct hlist_head *bucket;
+	struct hlist_node *node;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+		% MM_SLOTS_HASH_HEADS];
+	hlist_for_each_entry(mm_slot, node, bucket, hash) {
+		if (mm == mm_slot->mm)
+			return mm_slot;
+	}
+	return NULL;
+
+}
+
+void insert_to_mm_slots_hash(struct mm_struct *mm, struct mm_slot *mm_slot)
+{
+	struct hlist_head *bucket;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+		% MM_SLOTS_HASH_HEADS];
+	mm_slot->mm = mm;
+	hlist_add_head(&mm_slot->hash, bucket);
+}
+
+int mvolatile_enter(struct vm_area_struct *vma)
+{
+	struct mm_slot *mm_slot;
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (test_bit(MMF_VM_VOLATILE, &mm->flags))
+		return 0;
+
+	mm_slot = alloc_mm_slot();
+	if (!mm_slot)
+		return -ENOMEM;
+
+	if (unlikely(test_and_set_bit(MMF_VM_VOLATILE, &mm->flags))) {
+		free_mm_slot(mm_slot);
+		return 0;
+	}
+
+	spin_lock(&mvolatile_mm_lock);
+	insert_to_mm_slots_hash(mm, mm_slot);
+	list_add_tail(&mm_slot->mm_node, &mvolatile_scan.mm_head);
+	spin_unlock(&mvolatile_mm_lock);
+
+	atomic_inc(&mm->mm_count);
+	return 0;
+}
+
+void mvolatile_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	bool free = false;
+
+	if (!test_bit(MMF_VM_VOLATILE, &mm->flags))
+		return;
+	/* TODO : revisit spin_lock vs spin_lock_irq */
+	spin_lock(&mvolatile_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	/* TODO Consider current mm_slot we are scanning now */
+	if (mm_slot && mvolatile_scan.mm_slot != mm_slot) {
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+		free = true;
+	}
+	spin_unlock(&mvolatile_mm_lock);
+	if (free) {
+		clear_bit(MMF_VM_VOLATILE, &mm->flags);
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static inline int mvolatile_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static void collect_mm_slot(struct mm_slot *mm_slot)
+{
+	struct mm_struct *mm = mm_slot->mm;
+
+	if (mvolatile_test_exit(mm)) {
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	}
+}
+
+/* TODO: consider nr_pages */
+static unsigned int discard_vma_pages(struct zone *zone, struct mm_struct *mm,
+			struct vm_area_struct *vma, unsigned long address,
+			unsigned int nr_pages)
+{
+	LIST_HEAD(pagelist);
+	struct page *page;
+	int ret = 0;
+
+	for (; mvolatile_scan.address < vma->vm_end;
+			mvolatile_scan.address += PAGE_SIZE) {
+
+		if (mvolatile_test_exit(mm))
+			break;
+
+		/*
+		 * TODO : optimize page walking with the lock
+		 *        batch isolate_lru_page
+		 */
+		page = follow_page(vma, mvolatile_scan.address,
+				FOLL_GET|FOLL_SPLIT);
+		if (IS_ERR_OR_NULL(page)) {
+			cond_resched();
+			continue;
+		}
+
+		VM_BUG_ON(PageCompound(page));
+		BUG_ON(!PageAnon(page));
+		VM_BUG_ON(!PageSwapBacked(page));
+
+		/*
+		 * TODO : putback pages into tail of the zones.
+		 */
+		if (page_zone(page) != zone || isolate_lru_page(page)) {
+			put_page(page);
+			continue;
+		}
+
+		put_page(page);
+		list_add(&page->lru, &pagelist);
+		inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+	}
+
+	if (!list_empty(&pagelist))
+		ret = discard_volatile_page_list(zone, &pagelist);
+
+	/* TODO : putback pages into lru's tail */
+	putback_lru_pages(&pagelist);
+	return ret;
+}
+
+static unsigned int discard_mm_slot(struct zone *zone,
+		unsigned int nr_to_reclaim)
+{
+	struct mm_slot *mm_slot;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	unsigned int nr_discard = 0;
+
+	VM_BUG_ON(!nr_to_reclaim);
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&mvolatile_mm_lock));
+
+	if (mvolatile_scan.mm_slot)
+		mm_slot = mvolatile_scan.mm_slot;
+	else {
+		mm_slot = list_entry(mvolatile_scan.mm_head.next,
+				struct mm_slot, mm_node);
+		mvolatile_scan.address = 0;
+		mvolatile_scan.mm_slot = mm_slot;
+	}
+	spin_unlock(&mvolatile_mm_lock);
+
+	mm = mm_slot->mm;
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		vma = NULL;
+		goto next_mm;
+	}
+
+	if (unlikely(mvolatile_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, mvolatile_scan.address);
+
+	for (; vma; vma = vma->vm_next) {
+		cond_resched();
+
+		if (!(vma->vm_flags & VM_VOLATILE) || !vma->anon_vma) {
+			mvolatile_scan.address = vma->vm_end;
+			continue;
+		}
+
+		if (mvolatile_scan.address < vma->vm_start)
+			mvolatile_scan.address = vma->vm_start;
+
+		if (unlikely(mvolatile_test_exit(mm)))
+			break;
+
+		nr_discard += discard_vma_pages(zone, mm, vma,
+				mvolatile_scan.address,
+				nr_to_reclaim - nr_discard);
+
+		mvolatile_scan.address = vma->vm_end;
+		if (nr_discard >= nr_to_reclaim)
+			break;
+	}
+	up_read(&mm->mmap_sem);
+next_mm:
+	spin_lock(&mvolatile_mm_lock);
+	VM_BUG_ON(mvolatile_scan.mm_slot != mm_slot);
+
+	if (mvolatile_test_exit(mm) || !vma) {
+		if (mm_slot->mm_node.next != &mvolatile_scan.mm_head) {
+			mvolatile_scan.mm_slot = list_entry(
+					mm_slot->mm_node.next,
+					struct mm_slot, mm_node);
+			mvolatile_scan.address = 0;
+		} else {
+			mvolatile_scan.mm_slot = NULL;
+		}
+
+		collect_mm_slot(mm_slot);
+	}
+
+	return nr_discard;
+}
+
+
+#define MAX_NODISCARD_PROGRESS     (12)
+
+unsigned int discard_volatile_pages(struct zone *zone,
+		unsigned int nr_to_reclaim)
+{
+	unsigned int nr_discard = 0;
+	int nodiscard_progress = 0;
+
+	while (nr_discard < nr_to_reclaim) {
+		unsigned int ret;
+
+		cond_resched();
+
+		spin_lock(&mvolatile_mm_lock);
+		if (list_empty(&mvolatile_scan.mm_head)) {
+			spin_unlock(&mvolatile_mm_lock);
+			break;
+		}
+		ret = discard_mm_slot(zone, nr_to_reclaim);
+		if (!ret)
+			nodiscard_progress++;
+		spin_unlock(&mvolatile_mm_lock);
+		if (nodiscard_progress >= MAX_NODISCARD_PROGRESS)
+			break;
+		nr_to_reclaim -= ret;
+		nr_discard += ret;
+	}
+
+	return nr_discard;
+}
+
 #define NO_PURGED	0
 #define PURGED		1
 
@@ -345,6 +658,7 @@ static int do_mvolatile(struct vm_area_struct *vma,
 	vma_lock_anon_vma(vma);
 	vma->vm_flags = new_flags;
 	vma_unlock_anon_vma(vma);
+	mvolatile_enter(vma);
 out:
 	return error;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 449ec95..c936880 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -977,6 +977,34 @@ keep:
 	return nr_reclaimed;
 }
 
+#ifdef CONFIG_VOLATILE_PAGE
+unsigned long discard_volatile_page_list(struct zone *zone,
+		struct list_head *page_list)
+{
+	unsigned long ret;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.priority = DEF_PRIORITY,
+		.may_unmap = 1,
+		.may_swap = 1
+	};
+
+	unsigned long dummy1, dummy2;
+	struct page *page;
+
+	list_for_each_entry(page, page_list, lru) {
+		VM_BUG_ON(!PageAnon(page));
+		ClearPageActive(page);
+	}
+
+	ret = shrink_page_list(page_list, zone, &sc,
+			TTU_UNMAP|TTU_IGNORE_ACCESS,
+			&dummy1, &dummy2, false);
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -ret);
+	return ret;
+}
+#endif
+
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 					    struct list_head *page_list)
 {
@@ -2703,8 +2731,19 @@ loop_again:
 				testorder = 0;
 
 			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
-			    !zone_balanced(zone, testorder,
-					   balance_gap, end_zone)) {
+					!zone_balanced(zone, testorder,
+						balance_gap, end_zone)) {
+				unsigned int nr_discard;
+				if (testorder == 0) {
+					nr_discard = discard_volatile_pages(
+							zone,
+							SWAP_CLUSTER_MAX);
+					sc.nr_reclaimed += nr_discard;
+					if (zone_balanced(zone, testorder, 0,
+								end_zone))
+						goto zone_balanced;
+				}
+
 				shrink_zone(zone, &sc);
 
 				reclaim_state->reclaimed_slab = 0;
@@ -2742,6 +2781,7 @@ loop_again:
 					    min_wmark_pages(zone), end_zone, 0))
 					has_under_min_watermark_zone = 1;
 			} else {
+zone_balanced:
 				/*
 				 * If a zone reaches its high watermark,
 				 * consider it to be no longer congested. It's
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 8/8] extend PGVOLATILE vmstat to kswapd
  2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
                   ` (6 preceding siblings ...)
  2013-01-03  4:28 ` [RFC 7/8] add volatile page discard hook to kswapd Minchan Kim
@ 2013-01-03  4:28 ` Minchan Kim
  2013-01-03 17:19 ` [RFC v5 0/8] Support volatile for anonymous range Sanjay Ghemawat
  8 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-03  4:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim

Now kswapd can discard volatile pages so let's cover it for vmstat.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/vm_event_item.h |    3 ++-
 mm/mvolatile.c                |    5 ++++-
 mm/vmstat.c                   |    3 ++-
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 721d096..4efa3bf 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -26,7 +26,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 #ifdef CONFIG_VOLATILE_PAGE
-		PGVOLATILE,
+		PGVOLATILE_DIRECT,
+		PGVOLATILE_KSWAPD,
 #endif
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index 1c7bf5a..08a7eb3 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -246,7 +246,10 @@ int discard_volatile_page(struct page *page, enum ttu_flags ttu_flags)
 	if (try_to_volatile_page(page, ttu_flags)) {
 		if (page_freeze_refs(page, 1)) {
 			unlock_page(page);
-			count_vm_event(PGVOLATILE);
+			if (current_is_kswapd())
+				count_vm_event(PGVOLATILE_KSWAPD);
+			else
+				count_vm_event(PGVOLATILE_DIRECT);
 			return 1;
 		}
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3d08e1a..416f550 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -754,7 +754,8 @@ const char * const vmstat_text[] = {
 	"pgmajfault",
 
 #ifdef CONFIG_VOLATILE_PAGE
-	"pgvolatile",
+	"pgvolatile_direct",
+	"pgvolatile_kswapd",
 #endif
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC 2/8] Don't allow volatile attribute on THP and KSM
  2013-01-03  4:28 ` [RFC 2/8] Don't allow volatile attribute on THP and KSM Minchan Kim
@ 2013-01-03 16:27   ` Dave Hansen
  2013-01-04  2:51     ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2013-01-03 16:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins,
	Andrea Arcangeli

On 01/02/2013 08:28 PM, Minchan Kim wrote:
> VOLATILE imply the the pages in the range isn't working set any more
> so it's pointless that make them to THP/KSM.

One of the points of this implementation is that it be able to preserve
memory contents when there is no pressure.  If those contents happen to
contain a THP/KSM page, and there's no pressure, it seems like the right
thing to do is to leave that memory in place.

It might be a fair thing to do this in order to keep the implementation
more sane at the moment.  But, we should make sure there's some good
text on that in the changelog.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/8] Support volatile for anonymous range
  2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
                   ` (7 preceding siblings ...)
  2013-01-03  4:28 ` [RFC 8/8] extend PGVOLATILE vmstat " Minchan Kim
@ 2013-01-03 17:19 ` Sanjay Ghemawat
  2013-01-04  5:15   ` Minchan Kim
  8 siblings, 1 reply; 17+ messages in thread
From: Sanjay Ghemawat @ 2013-01-03 17:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Wed, Jan 2, 2013 at 8:27 PM, Minchan Kim <minchan@kernel.org> wrote:
> This is still RFC because we need more input from user-space
> people, more stress test, design discussion about interface/reclaim

Speaking as one of the authors of tcmalloc, I don't see any particular
need for this new system call for tcmalloc.  We are fine using
madvise(MADV_DONTNEED) and don't notice any significant
performance issues caused by it.  Background: we throttle how
quickly we release memory back to the system (1-10MB/s), so
we do not call madvise() very much, and we don't end up reusing
madvise-ed away pages at a fast rate. My guess is that we won't
see large enough application-level performance improvements to
cause us to change tcmalloc to use this system call.

> - What's different with madvise(DONTNEED)?
>
>   System call semantic
>
>   DONTNEED makes sure user always can see zero-fill pages after
>   he calls madvise while mvolatile can see old data or encounter
>   SIGBUS.

Do you need a new system call for this?  Why not just a new flag to madvise
with weaker guarantees than zero-filling?  All of the implementation changes
you point out below could be triggered from that flag.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/8] Introduce new system call mvolatile
  2013-01-03  4:27 ` [RFC 1/8] Introduce new system call mvolatile Minchan Kim
@ 2013-01-03 18:35   ` Taras Glek
  2013-01-04  4:25     ` Minchan Kim
  2013-01-17  1:48   ` John Stultz
  1 sibling, 1 reply; 17+ messages in thread
From: Taras Glek @ 2013-01-03 18:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 1/2/2013 8:27 PM, Minchan Kim wrote:
> This patch adds new system call m[no]volatile.
> If someone asks is_volatile system call, it could be added, too.
>
> The reason why I introduced new system call instead of madvise is
> m[no]volatile vma handling is totally different with madvise's vma
> handling.
>
> 1) The m[no]volatile should be successful although the range includes
>     unmapped or non-volatile range. It just skips such range
>     without stopping with returning error although it encounters
>     invalid range. It makes user convenient without calling several
>     system call of small range - Suggested by John Stultz
>
> 2) The purged state of volatile range should be propagated out to user
>     although the range is merged with adjacent non-volatile range when
>     user calls mnovolatile.
>
> 3) mvolatile's interface could be changed with madvise
>     in future discussion.  For example, I feel needs
>     movlatile(start, len, mode).
>     'mode' means FULL_VOLATILE or PARTIAL_VOLATILE.
>     FULL volatile means that if VM decide to reclaim the range, it would
>     reclaim all of pages in the range but in case of PARTIAL_VOLATILE,
>     VM could reclaim just a few number of pages in the range.
>     In case of tmpfs-volatile, user may regenerate all images data once
>     one of page in the range is discarded so there is pointless that
>     VM discard a page in the range when memory pressure is severe.
>     In case of anon-volatile, too excess discarding cause too many minor
>     fault for the allocator so it would be better to discard part of
>     the range.
I don't understand point 3).
Are you saying that using mvolatile in conjuction with madvise could 
allow mvolatile behavior to be tweaked in the future? Or are you 
suggesting adding an extra parameter in the future(what would that have 
to do with madvise)?

4) Having a new system call makes it easier for userspace apps to detect 
kernels without this functionality.

I really like the proposed interface. I like the suggestion of having 
explicit FULL|PARTIAL_VOLATILE. Why not include PARTIAL_VOLATILE as a 
required 3rd param in first version with expectation that FULL_VOLATILE 
will be added later(and returning some not-supported error in meantime)?
>
> 3) The mvolatile system call's return value is quite different with
>     madvise. Look at below semantic explanation.
>
> So I want to separate mvolatile from madvise.
>
> mvolatile(start, len)'s semantics
>
> 1) It makes range(start, len) as volatile although the range includes
> unmapped area, speacial mapping and mlocked area which are just skipped.
>
> Return -EINVAL if range doesn't include a right vma at all.
> Return -ENOMEM with interrupting range opeartion if memory is not
> enough to merge/split vmas. In this case, some ranges would be
> volatile and others not so user may recall mvolatile after he
> cancel all range by mnovolatile.
> Return 0 if range consists of only proper vmas.
> Return 1 if part of range includes hole/huge/ksm/mlock/special area.
>
> 2) If user calls mvolatile to the range which was already volatile VMA and
> even purged state, VOLATILE attributes still remains but purged state
> is reset. I expect some user want to split volatile vma into smaller
> ranges. Although he can do it for mnovlatile(whole range) and serveral calling
> with movlatile(smaller range), this function can avoid mnovolatile if he
> doesn't care purged state. I'm not sure we really need this function so
> I hope listen opinions. Unfortunately, current implemenation doesn't split
> volatile VMA with new range in this case. I forgot implementing it
> in this version but decide to send it to listen opinions because
> implementing is rather trivial if we decided.
>
> mnovolatile(start, len)'s semantics is following as.
>
> 1) It makes range(start, len) as non-volatile although the range
> includes unmapped area, speacial mapping and non-volatile range
> which are just skipped.
>
> 2) If the range is purged, it will return 1 regardless of including
> invalid range.
If I understand this correctly:
mvolatile(0, 10);
//then range [9,10] is purged by kernel
mnovolatile(0,4) will fail?
that seems counterintuitive.

One of the uses for mnovolatile is to atomicly lock the pages(vs a racy 
proposed is_volatile) syscall. Above situation would make it less effective.


>
> 3) It returns -ENOMEM if system doesn't have enough memory for vma operation.
>
> 4) It returns -EINVAL if range doesn't include a right vma at all.
>
> 5) If user try to access purged range without mnovoatile call, it encounters
> SIGBUS which would show up next patch.
>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Arun Sharma <asharma@fb.com>
> Cc: sanjay@google.com
> Cc: Paul Turner <pjt@google.com>
> CC: David Rientjes <rientjes@google.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   arch/x86/syscalls/syscall_64.tbl |    2 +
>   fs/exec.c                        |    4 +-
>   include/linux/mm.h               |    6 +-
>   include/linux/mm_types.h         |    4 +
>   include/linux/mvolatile.h        |   30 ++++
>   include/linux/syscalls.h         |    2 +
>   mm/Kconfig                       |   11 ++
>   mm/Makefile                      |    2 +-
>   mm/madvise.c                     |    2 +-
>   mm/mempolicy.c                   |    2 +-
>   mm/mlock.c                       |    7 +-
>   mm/mmap.c                        |   62 ++++++--
>   mm/mprotect.c                    |    3 +-
>   mm/mremap.c                      |    2 +-
>   mm/mvolatile.c                   |  312 ++++++++++++++++++++++++++++++++++++++
>   mm/rmap.c                        |    2 +
>   16 files changed, 427 insertions(+), 26 deletions(-)
>   create mode 100644 include/linux/mvolatile.h
>   create mode 100644 mm/mvolatile.c
>
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index a582bfe..568d488 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -319,6 +319,8 @@
>   310	64	process_vm_readv	sys_process_vm_readv
>   311	64	process_vm_writev	sys_process_vm_writev
>   312	common	kcmp			sys_kcmp
> +313	common	mvolatile		sys_mvolatile
> +314	common	mnovolatile		sys_mnovolatile
>   
>   #
>   # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/exec.c b/fs/exec.c
> index 0039055..da677d1 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -594,7 +594,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
>   	/*
>   	 * cover the whole range: [new_start, old_end)
>   	 */
> -	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
> +	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL, NULL))
>   		return -ENOMEM;
>   
>   	/*
> @@ -628,7 +628,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
>   	/*
>   	 * Shrink the vma to just the new range.  Always succeeds.
>   	 */
> -	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
> +	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL, NULL);
>   
>   	return 0;
>   }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bcaab4e..4bb59f3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -87,6 +87,7 @@ extern unsigned int kobjsize(const void *objp);
>   #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
>   #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
>   
> +#define VM_VOLATILE	0x00001000	/* Pages could be discarede without swapout */
>   #define VM_LOCKED	0x00002000
>   #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
>   
> @@ -1411,11 +1412,12 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>   /* mmap.c */
>   extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
>   extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> -	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
> +	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> +	bool *purged);
>   extern struct vm_area_struct *vma_merge(struct mm_struct *,
>   	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
>   	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> -	struct mempolicy *);
> +	struct mempolicy *, bool *purged);
>   extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
>   extern int split_vma(struct mm_struct *,
>   	struct vm_area_struct *, unsigned long addr, int new_below);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 31f8a3a..1eaf458 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -275,6 +275,10 @@ struct vm_area_struct {
>   #ifdef CONFIG_NUMA
>   	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>   #endif
> +#ifdef CONFIG_VOLATILE_PAGE
> +	/* True if more than a page in this vma is reclaimed. */
> +	bool purged;	/* Serialized by mmap_sem and anon_vma's mutex */
> +#endif
>   };
>   
>   struct core_thread {
> diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
> new file mode 100644
> index 0000000..cfb12b4
> --- /dev/null
> +++ b/include/linux/mvolatile.h
> @@ -0,0 +1,30 @@
> +#ifndef __LINUX_MVOLATILE_H
> +#define __LINUX_MVOLATILE_H
> +
> +#include <linux/syscalls.h>
> +
> +#ifdef CONFIG_VOLATILE_PAGE
> +static inline bool vma_purged(struct vm_area_struct *vma)
> +{
> +	return vma->purged;
> +}
> +
> +static inline void vma_purge_copy(struct vm_area_struct *dst,
> +					struct vm_area_struct *src)
> +{
> +	dst->purged = src->purged;
> +}
> +#else
> +static inline bool vma_purged(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +
> +static inline void vma_purge_copy(struct vm_area_struct *dst,
> +					struct vm_area_struct *src)
> +{
> +
> +}
> +#endif
> +#endif
> +
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 727f0cd..a8ded1c 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -470,6 +470,8 @@ asmlinkage long sys_munlock(unsigned long start, size_t len);
>   asmlinkage long sys_mlockall(int flags);
>   asmlinkage long sys_munlockall(void);
>   asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> +asmlinkage long sys_mvolatile(unsigned long start, size_t len);
> +asmlinkage long sys_mnovolatile(unsigned long start, size_t len);
>   asmlinkage long sys_mincore(unsigned long start, size_t len,
>   				unsigned char __user * vec);
>   
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a3f8ddd..30b24ba 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -355,6 +355,17 @@ choice
>   	  benefit.
>   endchoice
>   
> +config VOLATILE_PAGE
> +	bool "Volatile Page Support"
> +	depends on MMU
> +	help
> +	  Enabling this option adds the system calls mvolatile and mnovolatile
> +	  which are for giving user's address space range to kernel so VM
> +	  can discard pages of the range anytime instead swapout. This feature
> +	  can enhance performance to certain application(ex, memory allocator,
> +	  web browser's tmpfs pages) by reduce the number of minor fault and
> +          swap out.
> +
>   config CROSS_MEMORY_ATTACH
>   	bool "Cross Memory Support"
>   	depends on MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 6b025f8..1efb735 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -5,7 +5,7 @@
>   mmu-y			:= nommu.o
>   mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
>   			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
> -			   vmalloc.o pagewalk.o pgtable-generic.o
> +			   mvolatile.o vmalloc.o pagewalk.o pgtable-generic.o
>   
>   ifdef CONFIG_CROSS_MEMORY_ATTACH
>   mmu-$(CONFIG_MMU)	+= process_vm_access.o
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 03dfa5c..6ffad21 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -99,7 +99,7 @@ static long madvise_behavior(struct vm_area_struct * vma,
>   
>   	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
>   	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
> -				vma->vm_file, pgoff, vma_policy(vma));
> +				vma->vm_file, pgoff, vma_policy(vma), NULL);
>   	if (*prev) {
>   		vma = *prev;
>   		goto success;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ea600d..9b1aa2d 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -675,7 +675,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
>   			((vmstart - vma->vm_start) >> PAGE_SHIFT);
>   		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
>   				  vma->anon_vma, vma->vm_file, pgoff,
> -				  new_pol);
> +				  new_pol, NULL);
>   		if (prev) {
>   			vma = prev;
>   			next = vma->vm_next;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index f0b9ce5..e03523a 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -316,13 +316,14 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
>   	int ret = 0;
>   	int lock = !!(newflags & VM_LOCKED);
>   
> -	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
> -	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
> +	if (newflags == vma->vm_flags || (vma->vm_flags &
> +		(VM_SPECIAL|VM_VOLATILE)) || is_vm_hugetlb_page(vma) ||
> +		vma == get_gate_vma(current->mm))
>   		goto out;	/* don't set VM_LOCKED,  don't count */
>   
>   	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
>   	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
> -			  vma->vm_file, pgoff, vma_policy(vma));
> +			  vma->vm_file, pgoff, vma_policy(vma), NULL);
>   	if (*prev) {
>   		vma = *prev;
>   		goto success;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9a796c4..ba636c3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -31,6 +31,7 @@
>   #include <linux/audit.h>
>   #include <linux/khugepaged.h>
>   #include <linux/uprobes.h>
> +#include <linux/mvolatile.h>
>   
>   #include <asm/uaccess.h>
>   #include <asm/cacheflush.h>
> @@ -516,7 +517,8 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
>    * before we drop the necessary locks.
>    */
>   int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> -	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
> +	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> +	bool *purged)
>   {
>   	struct mm_struct *mm = vma->vm_mm;
>   	struct vm_area_struct *next = vma->vm_next;
> @@ -527,10 +529,9 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
>   	struct file *file = vma->vm_file;
>   	long adjust_next = 0;
>   	int remove_next = 0;
> +	struct vm_area_struct *exporter = NULL;
>   
>   	if (next && !insert) {
> -		struct vm_area_struct *exporter = NULL;
> -
>   		if (end >= next->vm_end) {
>   			/*
>   			 * vma expands, overlapping all the next, and
> @@ -621,6 +622,15 @@ again:			remove_next = 1 + (end > next->vm_end);
>   	if (adjust_next) {
>   		next->vm_start += adjust_next << PAGE_SHIFT;
>   		next->vm_pgoff += adjust_next;
> +		/*
> +		 * Look at mm/mvolatile.c for knowing terminology.
> +		 * V4. NNPPVV -> NNNPVV
> +		 */
> +		if (purged) {
> +			*purged = vma_purged(next);
> +			if (exporter == vma) /* V5. VVPPNN -> VVPNNN */
> +				*purged = vma_purged(vma);
> +		}
>   	}
>   
>   	if (root) {
> @@ -651,6 +661,13 @@ again:			remove_next = 1 + (end > next->vm_end);
>   		anon_vma_interval_tree_post_update_vma(vma);
>   		if (adjust_next)
>   			anon_vma_interval_tree_post_update_vma(next);
> +		/*
> +		 * Look at mm/mvolatile.c for knowing terminology.
> +		 * V7. VVPPVV -> VVNPVV
> +		 * V8. VVPPVV -> VVPNVV
> +		 */
> +		if (insert)
> +			vma_purge_copy(insert, vma);
>   		anon_vma_unlock(anon_vma);
>   	}
>   	if (mapping)
> @@ -670,6 +687,20 @@ again:			remove_next = 1 + (end > next->vm_end);
>   		}
>   		if (next->anon_vma)
>   			anon_vma_merge(vma, next);
> +
> +		/*
> +		 * next is detatched from anon vma chain so purged isn't
> +		 * raced any more.
> +		 * Look at mm/mvolatile.c for knowing terminology.
> +		 *
> +		 * V1. NNPPVV -> NNNNVV
> +		 * V2. VVPPNN -> VVNNNN
> +		 * V3. NNPPNN -> NNNNNN
> +		 */
> +		if (purged) {
> +			*purged |= vma_purged(vma); /* case V2 */
> +			*purged |= vma_purged(next); /* case V1,V3 */
> +		}
>   		mm->map_count--;
>   		mpol_put(vma_policy(next));
>   		kmem_cache_free(vm_area_cachep, next);
> @@ -798,7 +829,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>   			struct vm_area_struct *prev, unsigned long addr,
>   			unsigned long end, unsigned long vm_flags,
>   		     	struct anon_vma *anon_vma, struct file *file,
> -			pgoff_t pgoff, struct mempolicy *policy)
> +			pgoff_t pgoff, struct mempolicy *policy, bool *purged)
>   {
>   	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
>   	struct vm_area_struct *area, *next;
> @@ -808,7 +839,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>   	 * We later require that vma->vm_flags == vm_flags,
>   	 * so this tests vma->vm_flags & VM_SPECIAL, too.
>   	 */
> -	if (vm_flags & VM_SPECIAL)
> +	if (vm_flags & (VM_SPECIAL|VM_VOLATILE))
>   		return NULL;
>   
>   	if (prev)
> @@ -837,10 +868,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>   						      next->anon_vma, NULL)) {
>   							/* cases 1, 6 */
>   			err = vma_adjust(prev, prev->vm_start,
> -				next->vm_end, prev->vm_pgoff, NULL);
> +				next->vm_end, prev->vm_pgoff, NULL, purged);
>   		} else					/* cases 2, 5, 7 */
>   			err = vma_adjust(prev, prev->vm_start,
> -				end, prev->vm_pgoff, NULL);
> +				end, prev->vm_pgoff, NULL, purged);
>   		if (err)
>   			return NULL;
>   		khugepaged_enter_vma_merge(prev);
> @@ -856,10 +887,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>   					anon_vma, file, pgoff+pglen)) {
>   		if (prev && addr < prev->vm_end)	/* case 4 */
>   			err = vma_adjust(prev, prev->vm_start,
> -				addr, prev->vm_pgoff, NULL);
> +				addr, prev->vm_pgoff, NULL, purged);
>   		else					/* cases 3, 8 */
>   			err = vma_adjust(area, addr, next->vm_end,
> -				next->vm_pgoff - pglen, NULL);
> +				next->vm_pgoff - pglen, NULL, purged);
>   		if (err)
>   			return NULL;
>   		khugepaged_enter_vma_merge(area);
> @@ -1292,7 +1323,8 @@ munmap_back:
>   	/*
>   	 * Can we just expand an old mapping?
>   	 */
> -	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
> +	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file,
> +				pgoff, NULL, NULL);
>   	if (vma)
>   		goto out;
>   
> @@ -2025,9 +2057,10 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
>   
>   	if (new_below)
>   		err = vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
> -			((addr - new->vm_start) >> PAGE_SHIFT), new);
> +			((addr - new->vm_start) >> PAGE_SHIFT), new, NULL);
>   	else
> -		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
> +		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff,
> +			new, NULL);
>   
>   	/* Success. */
>   	if (!err)
> @@ -2240,7 +2273,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
>   
>   	/* Can we just expand an old private anonymous mapping? */
>   	vma = vma_merge(mm, prev, addr, addr + len, flags,
> -					NULL, NULL, pgoff, NULL);
> +					NULL, NULL, pgoff, NULL, NULL);
>   	if (vma)
>   		goto out;
>   
> @@ -2396,7 +2429,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>   	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
>   		return NULL;	/* should never get here */
>   	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
> -			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
> +			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> +			NULL);
>   	if (new_vma) {
>   		/*
>   		 * Source vma may have been merged into new_vma
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index a409926..f461177 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -179,7 +179,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
>   	 */
>   	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
>   	*pprev = vma_merge(mm, *pprev, start, end, newflags,
> -			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
> +			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> +			NULL);
>   	if (*pprev) {
>   		vma = *pprev;
>   		goto success;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 1b61c2d..8586c52 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -512,7 +512,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>   			int pages = (new_len - old_len) >> PAGE_SHIFT;
>   
>   			if (vma_adjust(vma, vma->vm_start, addr + new_len,
> -				       vma->vm_pgoff, NULL)) {
> +				       vma->vm_pgoff, NULL, NULL)) {
>   				ret = -ENOMEM;
>   				goto out;
>   			}
> diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> new file mode 100644
> index 0000000..8b812d2
> --- /dev/null
> +++ b/mm/mvolatile.c
> @@ -0,0 +1,312 @@
> +/*
> + *	linux/mm/mvolatile.c
> + *
> + *  Copyright 2012 Minchan Kim
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mvolatile.h>
> +#include <linux/mm_types.h>
> +#include <linux/mm.h>
> +#include <linux/rmap.h>
> +#include <linux/mempolicy.h>
> +
> +#ifndef CONFIG_VOLATILE_PAGE
> +SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
> +{
> +	return -EINVAL;
> +}
> +
> +SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
> +{
> +	return -EINVAL;
> +}
> +#else
> +
> +#define NO_PURGED	0
> +#define PURGED		1
> +
> +/*
> + * N: Normal VMA
> + * V: Volatile VMA
> + * P: Purged volatile VMA
> + *
> + * Assume that each VMA has two block so case 1-8 consists of three VMA.
> + * For example, NNPPVV means VMA1 has normal VMA, VMA2 has purged volailte VMA,
> + * and VMA3 has volatile VMA. With another example, NNPVVV means VMA1 has
> + * normal VMA, VMA2-1 has purged volatile VMA, VMA2-2 has volatile VMA.
> + *
> + * Case 7,8 create a new VMA and we call it VMA4 which can be loated before VMA2
> + * or after.
> + *
> + * Notice: The merge between volatile VMAs shouldn't happen.
> + * If we call mnovolatile(VMA2),
> + *
> + * Case 1 NNPPVV -> NNNNVV
> + * Case 2 VVPPNN -> VVNNNN
> + * Case 3 NNPPNN -> NNNNNN
> + * Case 4 NNPPVV -> NNNPVV
> + * case 5 VVPPNN -> VVPNNN
> + * case 6 VVPPVV -> VVNNVV
> + * case 7 VVPPVV -> VVNPVV
> + * case 8 VVPPVV -> VVPNVV
> + */
> +static int do_mnovolatile(struct vm_area_struct *vma,
> +		struct vm_area_struct **prev, unsigned long start,
> +		unsigned long end, bool *is_purged)
> +{
> +	unsigned long new_flags;
> +	int error = 0;
> +	struct mm_struct *mm = vma->vm_mm;
> +	pgoff_t pgoff;
> +	bool purged = false;
> +
> +	new_flags = vma->vm_flags & ~VM_VOLATILE;
> +	if (new_flags == vma->vm_flags) {
> +		*prev = vma;
> +		goto success;
> +	}
> +
> +
> +	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> +	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
> +			vma->vm_file, pgoff, vma_policy(vma), &purged);
> +	if (*prev) {
> +		vma = *prev;
> +		goto success;
> +	}
> +
> +	*prev = vma;
> +
> +	if (start != vma->vm_start) {
> +		error = split_vma(mm, vma, start, 1);
> +		if (error)
> +			goto out;
> +	}
> +
> +	if (end != vma->vm_end) {
> +		error = split_vma(mm, vma, end, 0);
> +		if (error)
> +			goto out;
> +	}
> +
> +success:
> +	/* V6. VVPPVV -> VVNNVV */
> +	vma_lock_anon_vma(vma);
> +	*is_purged |= (vma->purged|purged);
> +	vma_unlock_anon_vma(vma);
> +
> +	vma->vm_flags = new_flags;
> +	vma->purged = false;
> +	return 0;
> +out:
> +	return error;
> +}
> +
> +/* I didn't look into KSM/Hugepage so disalbed them */
> +#define VM_NO_VOLATILE	(VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|\
> +		VM_MERGEABLE|VM_HUGEPAGE|VM_LOCKED)
> +
> +static int do_mvolatile(struct vm_area_struct *vma,
> +	struct vm_area_struct **prev, unsigned long start, unsigned long end)
> +{
> +	int error = -EINVAL;
> +	vm_flags_t new_flags = vma->vm_flags;
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	new_flags |= VM_VOLATILE;
> +
> +	/* Note : Current version doesn't support file vma volatile */
> +	if (vma->vm_file) {
> +		*prev = vma;
> +		goto out;
> +	}
> +
> +	if (vma->vm_flags & VM_NO_VOLATILE ||
> +			(vma == get_gate_vma(current->mm))) {
> +		*prev = vma;
> +		goto out;
> +	}
> +	/*
> +	 * In case of calling MADV_VOLATILE again,
> +	 * We just reset purged state.
> +	 */
> +	if (new_flags == vma->vm_flags) {
> +		*prev = vma;
> +		vma_lock_anon_vma(vma);
> +		vma->purged = false;
> +		vma_unlock_anon_vma(vma);
> +		error = 0;
> +		goto out;
> +	}
> +
> +	*prev = vma;
> +
> +	if (start != vma->vm_start) {
> +		error = split_vma(mm, vma, start, 1);
> +		if (error)
> +			goto out;
> +	}
> +
> +	if (end != vma->vm_end) {
> +		error = split_vma(mm, vma, end, 0);
> +		if (error)
> +			goto out;
> +	}
> +
> +	error = 0;
> +
> +	vma_lock_anon_vma(vma);
> +	vma->vm_flags = new_flags;
> +	vma_unlock_anon_vma(vma);
> +out:
> +	return error;
> +}
> +
> +/*
> + * Return -EINVAL if range doesn't include a right vma at all.
> + * Return -ENOMEM with interrupting range opeartion if memory is not enough to
> + * merge/split vmas.
> + * Return 0 if range consists of only proper vmas.
> + * Return 1 if part of range includes inavlid area(ex, hole/huge/ksm/mlock/
> + * special area)
> + */
> +SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
> +{
> +	unsigned long end, tmp;
> +	struct vm_area_struct *vma, *prev;
> +	bool invalid = false;
> +	int error = -EINVAL;
> +
> +	down_write(&current->mm->mmap_sem);
> +	if (start & ~PAGE_MASK)
> +		goto out;
> +
> +	len &= PAGE_MASK;
> +	if (!len)
> +		goto out;
> +
> +	end = start + len;
> +	if (end < start)
> +		goto out;
> +
> +	vma = find_vma_prev(current->mm, start, &prev);
> +	if (!vma)
> +		goto out;
> +
> +	if (start > vma->vm_start)
> +		prev = vma;
> +
> +	for (;;) {
> +		/* Here start < (end|vma->vm_end). */
> +		if (start < vma->vm_start) {
> +			start = vma->vm_start;
> +			if (start >= end)
> +				goto out;
> +			invalid = true;
> +		}
> +
> +		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> +		error = do_mvolatile(vma, &prev, start, tmp);
> +		if (error == -ENOMEM) {
> +			up_write(&current->mm->mmap_sem);
> +			return error;
> +		}
> +		if (error == -EINVAL)
> +			invalid = true;
> +		else
> +			error = 0;
> +		start = tmp;
> +		if (prev && start < prev->vm_end)
> +			start = prev->vm_end;
> +		if (start >= end)
> +			break;
> +
> +		vma = prev->vm_next;
> +		if (!vma)
> +			break;
> +	}
> +out:
> +	up_write(&current->mm->mmap_sem);
> +	return invalid ? 1 : 0;
> +}
> +/*
> + * Return -ENOMEM with interrupting range opeartion if memory is not enough
> + * to merge/split vmas.
> + * Return 1 if part of range includes purged's one, otherwise, return 0
> + */
> +SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
> +{
> +	unsigned long end, tmp;
> +	struct vm_area_struct *vma, *prev;
> +	int ret, error = -EINVAL;
> +	bool is_purged = false;
> +
> +	down_write(&current->mm->mmap_sem);
> +	if (start & ~PAGE_MASK)
> +		goto out;
> +
> +	len &= PAGE_MASK;
> +	if (!len)
> +		goto out;
> +
> +	end = start + len;
> +	if (end < start)
> +		goto out;
> +
> +	vma = find_vma_prev(current->mm, start, &prev);
> +	if (!vma)
> +		goto out;
> +
> +	if (start > vma->vm_start)
> +		prev = vma;
> +
> +	for (;;) {
> +		/* Here start < (end|vma->vm_end). */
> +		if (start < vma->vm_start) {
> +			start = vma->vm_start;
> +			if (start >= end)
> +				goto out;
> +		}
> +
> +		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> +		error = do_mnovolatile(vma, &prev, start, tmp, &is_purged);
> +		if (error) {
> +			WARN_ON(error != -ENOMEM);
> +			goto out;
> +		}
> +		start = tmp;
> +		if (prev && start < prev->vm_end)
> +			start = prev->vm_end;
> +		if (start >= end)
> +			break;
> +
> +		vma = prev->vm_next;
> +		if (!vma)
> +			break;
> +	}
> +out:
> +	up_write(&current->mm->mmap_sem);
> +
> +	if (error)
> +		ret = error;
> +	else if (is_purged)
> +		ret = PURGED;
> +	else
> +		ret = NO_PURGED;
> +
> +	return ret;
> +}
> +#endif
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2ee1ef0..402d9da 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -57,6 +57,7 @@
>   #include <linux/migrate.h>
>   #include <linux/hugetlb.h>
>   #include <linux/backing-dev.h>
> +#include <linux/mvolatile.h>
>   
>   #include <asm/tlbflush.h>
>   
> @@ -308,6 +309,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>   	vma->anon_vma = anon_vma;
>   	anon_vma_lock(anon_vma);
>   	anon_vma_chain_link(vma, avc, anon_vma);
> +	vma_purge_copy(vma, pvma);
>   	anon_vma_unlock(anon_vma);
>   
>   	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/8] Don't allow volatile attribute on THP and KSM
  2013-01-03 16:27   ` Dave Hansen
@ 2013-01-04  2:51     ` Minchan Kim
  0 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-04  2:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins,
	Andrea Arcangeli

On Thu, Jan 03, 2013 at 08:27:31AM -0800, Dave Hansen wrote:
> On 01/02/2013 08:28 PM, Minchan Kim wrote:
> > VOLATILE imply the the pages in the range isn't working set any more
> > so it's pointless that make them to THP/KSM.
> 
> One of the points of this implementation is that it be able to preserve
> memory contents when there is no pressure.  If those contents happen to
> contain a THP/KSM page, and there's no pressure, it seems like the right
> thing to do is to leave that memory in place.

Indeed. I should have written more cleary,

Current implementation is following as

1. madvised-THP/KSM(1, 10) -> mvolatile(1, 10) -> fail
2. mvolatile(1, 10) -> madvised-THP/KSM(1, 10) -> fail
3. always-THP -> mvolatile -> success -> if memory pressure happens
   -> split_huge_page -> discard.

I think 2,3 makes sense to me but we need to fix 1 in further patches.

> 
> It might be a fair thing to do this in order to keep the implementation
> more sane at the moment.  But, we should make sure there's some good
> text on that in the changelog.

Absolutely, Thanks for pointing out, Dave.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/8] Introduce new system call mvolatile
  2013-01-03 18:35   ` Taras Glek
@ 2013-01-04  4:25     ` Minchan Kim
  0 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-04  4:25 UTC (permalink / raw)
  To: Taras Glek
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, KOSAKI Motohiro, KAMEZAWA Hiroyuki

Hi,

On Thu, Jan 03, 2013 at 10:35:24AM -0800, Taras Glek wrote:
> On 1/2/2013 8:27 PM, Minchan Kim wrote:
> >This patch adds new system call m[no]volatile.
> >If someone asks is_volatile system call, it could be added, too.
> >
> >The reason why I introduced new system call instead of madvise is
> >m[no]volatile vma handling is totally different with madvise's vma
> >handling.
> >
> >1) The m[no]volatile should be successful although the range includes
> >    unmapped or non-volatile range. It just skips such range
> >    without stopping with returning error although it encounters
> >    invalid range. It makes user convenient without calling several
> >    system call of small range - Suggested by John Stultz
> >
> >2) The purged state of volatile range should be propagated out to user
> >    although the range is merged with adjacent non-volatile range when
> >    user calls mnovolatile.
> >
> >3) mvolatile's interface could be changed with madvise
> >    in future discussion.  For example, I feel needs
> >    movlatile(start, len, mode).
> >    'mode' means FULL_VOLATILE or PARTIAL_VOLATILE.
> >    FULL volatile means that if VM decide to reclaim the range, it would
> >    reclaim all of pages in the range but in case of PARTIAL_VOLATILE,
> >    VM could reclaim just a few number of pages in the range.
> >    In case of tmpfs-volatile, user may regenerate all images data once
> >    one of page in the range is discarded so there is pointless that
> >    VM discard a page in the range when memory pressure is severe.
> >    In case of anon-volatile, too excess discarding cause too many minor
> >    fault for the allocator so it would be better to discard part of
> >    the range.
> I don't understand point 3).
> Are you saying that using mvolatile in conjuction with madvise could
> allow mvolatile behavior to be tweaked in the future? Or are you
> suggesting adding an extra parameter in the future(what would that
> have to do with madvise)?

I meant I might want to expand mvolatile's interface like below
during discussion.

        int mvolatile(start, len, mode);

> 
> 4) Having a new system call makes it easier for userspace apps to
> detect kernels without this functionality.

I coudn't understand your claim.
Now mvolatile just return EINVAL on !CONFIG_VOLATILE_PAGE system.
Why is it easy compared to returning EINVAL when we call madvise(VOLATILE)
on !CONFIG_VOLATILE_PAGE?

> 
> I really like the proposed interface. I like the suggestion of

Thanks.

> having explicit FULL|PARTIAL_VOLATILE. Why not include
> PARTIAL_VOLATILE as a required 3rd param in first version with
> expectation that FULL_VOLATILE will be added later(and returning
> some not-supported error in meantime)?

I just wanted to discuss about needs of it.
The reason I need PARTIAL_VOLATILE is that avoids many minor fault
for allocator. Is it useful for tmpfs-volatile, too?

Thanks for the feedback, Taras.

> >
> >3) The mvolatile system call's return value is quite different with
> >    madvise. Look at below semantic explanation.
> >
> >So I want to separate mvolatile from madvise.
> >
> >mvolatile(start, len)'s semantics
> >
> >1) It makes range(start, len) as volatile although the range includes
> >unmapped area, speacial mapping and mlocked area which are just skipped.
> >
> >Return -EINVAL if range doesn't include a right vma at all.
> >Return -ENOMEM with interrupting range opeartion if memory is not
> >enough to merge/split vmas. In this case, some ranges would be
> >volatile and others not so user may recall mvolatile after he
> >cancel all range by mnovolatile.
> >Return 0 if range consists of only proper vmas.
> >Return 1 if part of range includes hole/huge/ksm/mlock/special area.
> >
> >2) If user calls mvolatile to the range which was already volatile VMA and
> >even purged state, VOLATILE attributes still remains but purged state
> >is reset. I expect some user want to split volatile vma into smaller
> >ranges. Although he can do it for mnovlatile(whole range) and serveral calling
> >with movlatile(smaller range), this function can avoid mnovolatile if he
> >doesn't care purged state. I'm not sure we really need this function so
> >I hope listen opinions. Unfortunately, current implemenation doesn't split
> >volatile VMA with new range in this case. I forgot implementing it
> >in this version but decide to send it to listen opinions because
> >implementing is rather trivial if we decided.
> >
> >mnovolatile(start, len)'s semantics is following as.
> >
> >1) It makes range(start, len) as non-volatile although the range
> >includes unmapped area, speacial mapping and non-volatile range
> >which are just skipped.
> >
> >2) If the range is purged, it will return 1 regardless of including
> >invalid range.
> If I understand this correctly:
> mvolatile(0, 10);
> //then range [9,10] is purged by kernel
> mnovolatile(0,4) will fail?
> that seems counterintuitive.
> 
> One of the uses for mnovolatile is to atomicly lock the pages(vs a
> racy proposed is_volatile) syscall. Above situation would make it
> less effective.
> 
> 
> >
> >3) It returns -ENOMEM if system doesn't have enough memory for vma operation.
> >
> >4) It returns -EINVAL if range doesn't include a right vma at all.
> >
> >5) If user try to access purged range without mnovoatile call, it encounters
> >SIGBUS which would show up next patch.
> >
> >Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> >Cc: Arun Sharma <asharma@fb.com>
> >Cc: sanjay@google.com
> >Cc: Paul Turner <pjt@google.com>
> >CC: David Rientjes <rientjes@google.com>
> >Cc: John Stultz <john.stultz@linaro.org>
> >Cc: Andrew Morton <akpm@linux-foundation.org>
> >Cc: Christoph Lameter <cl@linux.com>
> >Cc: Android Kernel Team <kernel-team@android.com>
> >Cc: Robert Love <rlove@google.com>
> >Cc: Mel Gorman <mel@csn.ul.ie>
> >Cc: Hugh Dickins <hughd@google.com>
> >Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> >Cc: Rik van Riel <riel@redhat.com>
> >Cc: Dave Chinner <david@fromorbit.com>
> >Cc: Neil Brown <neilb@suse.de>
> >Cc: Mike Hommey <mh@glandium.org>
> >Cc: Taras Glek <tglek@mozilla.com>
> >Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> >Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  arch/x86/syscalls/syscall_64.tbl |    2 +
> >  fs/exec.c                        |    4 +-
> >  include/linux/mm.h               |    6 +-
> >  include/linux/mm_types.h         |    4 +
> >  include/linux/mvolatile.h        |   30 ++++
> >  include/linux/syscalls.h         |    2 +
> >  mm/Kconfig                       |   11 ++
> >  mm/Makefile                      |    2 +-
> >  mm/madvise.c                     |    2 +-
> >  mm/mempolicy.c                   |    2 +-
> >  mm/mlock.c                       |    7 +-
> >  mm/mmap.c                        |   62 ++++++--
> >  mm/mprotect.c                    |    3 +-
> >  mm/mremap.c                      |    2 +-
> >  mm/mvolatile.c                   |  312 ++++++++++++++++++++++++++++++++++++++
> >  mm/rmap.c                        |    2 +
> >  16 files changed, 427 insertions(+), 26 deletions(-)
> >  create mode 100644 include/linux/mvolatile.h
> >  create mode 100644 mm/mvolatile.c
> >
> >diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> >index a582bfe..568d488 100644
> >--- a/arch/x86/syscalls/syscall_64.tbl
> >+++ b/arch/x86/syscalls/syscall_64.tbl
> >@@ -319,6 +319,8 @@
> >  310	64	process_vm_readv	sys_process_vm_readv
> >  311	64	process_vm_writev	sys_process_vm_writev
> >  312	common	kcmp			sys_kcmp
> >+313	common	mvolatile		sys_mvolatile
> >+314	common	mnovolatile		sys_mnovolatile
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> >diff --git a/fs/exec.c b/fs/exec.c
> >index 0039055..da677d1 100644
> >--- a/fs/exec.c
> >+++ b/fs/exec.c
> >@@ -594,7 +594,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
> >  	/*
> >  	 * cover the whole range: [new_start, old_end)
> >  	 */
> >-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
> >+	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL, NULL))
> >  		return -ENOMEM;
> >  	/*
> >@@ -628,7 +628,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
> >  	/*
> >  	 * Shrink the vma to just the new range.  Always succeeds.
> >  	 */
> >-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
> >+	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL, NULL);
> >  	return 0;
> >  }
> >diff --git a/include/linux/mm.h b/include/linux/mm.h
> >index bcaab4e..4bb59f3 100644
> >--- a/include/linux/mm.h
> >+++ b/include/linux/mm.h
> >@@ -87,6 +87,7 @@ extern unsigned int kobjsize(const void *objp);
> >  #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
> >  #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
> >+#define VM_VOLATILE	0x00001000	/* Pages could be discarede without swapout */
> >  #define VM_LOCKED	0x00002000
> >  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
> >@@ -1411,11 +1412,12 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
> >  /* mmap.c */
> >  extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
> >  extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
> >+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> >+	bool *purged);
> >  extern struct vm_area_struct *vma_merge(struct mm_struct *,
> >  	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> >  	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> >-	struct mempolicy *);
> >+	struct mempolicy *, bool *purged);
> >  extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
> >  extern int split_vma(struct mm_struct *,
> >  	struct vm_area_struct *, unsigned long addr, int new_below);
> >diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >index 31f8a3a..1eaf458 100644
> >--- a/include/linux/mm_types.h
> >+++ b/include/linux/mm_types.h
> >@@ -275,6 +275,10 @@ struct vm_area_struct {
> >  #ifdef CONFIG_NUMA
> >  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> >  #endif
> >+#ifdef CONFIG_VOLATILE_PAGE
> >+	/* True if more than a page in this vma is reclaimed. */
> >+	bool purged;	/* Serialized by mmap_sem and anon_vma's mutex */
> >+#endif
> >  };
> >  struct core_thread {
> >diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
> >new file mode 100644
> >index 0000000..cfb12b4
> >--- /dev/null
> >+++ b/include/linux/mvolatile.h
> >@@ -0,0 +1,30 @@
> >+#ifndef __LINUX_MVOLATILE_H
> >+#define __LINUX_MVOLATILE_H
> >+
> >+#include <linux/syscalls.h>
> >+
> >+#ifdef CONFIG_VOLATILE_PAGE
> >+static inline bool vma_purged(struct vm_area_struct *vma)
> >+{
> >+	return vma->purged;
> >+}
> >+
> >+static inline void vma_purge_copy(struct vm_area_struct *dst,
> >+					struct vm_area_struct *src)
> >+{
> >+	dst->purged = src->purged;
> >+}
> >+#else
> >+static inline bool vma_purged(struct vm_area_struct *vma)
> >+{
> >+	return false;
> >+}
> >+
> >+static inline void vma_purge_copy(struct vm_area_struct *dst,
> >+					struct vm_area_struct *src)
> >+{
> >+
> >+}
> >+#endif
> >+#endif
> >+
> >diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> >index 727f0cd..a8ded1c 100644
> >--- a/include/linux/syscalls.h
> >+++ b/include/linux/syscalls.h
> >@@ -470,6 +470,8 @@ asmlinkage long sys_munlock(unsigned long start, size_t len);
> >  asmlinkage long sys_mlockall(int flags);
> >  asmlinkage long sys_munlockall(void);
> >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> >+asmlinkage long sys_mvolatile(unsigned long start, size_t len);
> >+asmlinkage long sys_mnovolatile(unsigned long start, size_t len);
> >  asmlinkage long sys_mincore(unsigned long start, size_t len,
> >  				unsigned char __user * vec);
> >diff --git a/mm/Kconfig b/mm/Kconfig
> >index a3f8ddd..30b24ba 100644
> >--- a/mm/Kconfig
> >+++ b/mm/Kconfig
> >@@ -355,6 +355,17 @@ choice
> >  	  benefit.
> >  endchoice
> >+config VOLATILE_PAGE
> >+	bool "Volatile Page Support"
> >+	depends on MMU
> >+	help
> >+	  Enabling this option adds the system calls mvolatile and mnovolatile
> >+	  which are for giving user's address space range to kernel so VM
> >+	  can discard pages of the range anytime instead swapout. This feature
> >+	  can enhance performance to certain application(ex, memory allocator,
> >+	  web browser's tmpfs pages) by reduce the number of minor fault and
> >+          swap out.
> >+
> >  config CROSS_MEMORY_ATTACH
> >  	bool "Cross Memory Support"
> >  	depends on MMU
> >diff --git a/mm/Makefile b/mm/Makefile
> >index 6b025f8..1efb735 100644
> >--- a/mm/Makefile
> >+++ b/mm/Makefile
> >@@ -5,7 +5,7 @@
> >  mmu-y			:= nommu.o
> >  mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
> >  			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
> >-			   vmalloc.o pagewalk.o pgtable-generic.o
> >+			   mvolatile.o vmalloc.o pagewalk.o pgtable-generic.o
> >  ifdef CONFIG_CROSS_MEMORY_ATTACH
> >  mmu-$(CONFIG_MMU)	+= process_vm_access.o
> >diff --git a/mm/madvise.c b/mm/madvise.c
> >index 03dfa5c..6ffad21 100644
> >--- a/mm/madvise.c
> >+++ b/mm/madvise.c
> >@@ -99,7 +99,7 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >  	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >  	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
> >-				vma->vm_file, pgoff, vma_policy(vma));
> >+				vma->vm_file, pgoff, vma_policy(vma), NULL);
> >  	if (*prev) {
> >  		vma = *prev;
> >  		goto success;
> >diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> >index 4ea600d..9b1aa2d 100644
> >--- a/mm/mempolicy.c
> >+++ b/mm/mempolicy.c
> >@@ -675,7 +675,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
> >  			((vmstart - vma->vm_start) >> PAGE_SHIFT);
> >  		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
> >  				  vma->anon_vma, vma->vm_file, pgoff,
> >-				  new_pol);
> >+				  new_pol, NULL);
> >  		if (prev) {
> >  			vma = prev;
> >  			next = vma->vm_next;
> >diff --git a/mm/mlock.c b/mm/mlock.c
> >index f0b9ce5..e03523a 100644
> >--- a/mm/mlock.c
> >+++ b/mm/mlock.c
> >@@ -316,13 +316,14 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  	int ret = 0;
> >  	int lock = !!(newflags & VM_LOCKED);
> >-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
> >-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
> >+	if (newflags == vma->vm_flags || (vma->vm_flags &
> >+		(VM_SPECIAL|VM_VOLATILE)) || is_vm_hugetlb_page(vma) ||
> >+		vma == get_gate_vma(current->mm))
> >  		goto out;	/* don't set VM_LOCKED,  don't count */
> >  	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >  	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
> >-			  vma->vm_file, pgoff, vma_policy(vma));
> >+			  vma->vm_file, pgoff, vma_policy(vma), NULL);
> >  	if (*prev) {
> >  		vma = *prev;
> >  		goto success;
> >diff --git a/mm/mmap.c b/mm/mmap.c
> >index 9a796c4..ba636c3 100644
> >--- a/mm/mmap.c
> >+++ b/mm/mmap.c
> >@@ -31,6 +31,7 @@
> >  #include <linux/audit.h>
> >  #include <linux/khugepaged.h>
> >  #include <linux/uprobes.h>
> >+#include <linux/mvolatile.h>
> >  #include <asm/uaccess.h>
> >  #include <asm/cacheflush.h>
> >@@ -516,7 +517,8 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
> >   * before we drop the necessary locks.
> >   */
> >  int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
> >+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> >+	bool *purged)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct vm_area_struct *next = vma->vm_next;
> >@@ -527,10 +529,9 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >  	struct file *file = vma->vm_file;
> >  	long adjust_next = 0;
> >  	int remove_next = 0;
> >+	struct vm_area_struct *exporter = NULL;
> >  	if (next && !insert) {
> >-		struct vm_area_struct *exporter = NULL;
> >-
> >  		if (end >= next->vm_end) {
> >  			/*
> >  			 * vma expands, overlapping all the next, and
> >@@ -621,6 +622,15 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  	if (adjust_next) {
> >  		next->vm_start += adjust_next << PAGE_SHIFT;
> >  		next->vm_pgoff += adjust_next;
> >+		/*
> >+		 * Look at mm/mvolatile.c for knowing terminology.
> >+		 * V4. NNPPVV -> NNNPVV
> >+		 */
> >+		if (purged) {
> >+			*purged = vma_purged(next);
> >+			if (exporter == vma) /* V5. VVPPNN -> VVPNNN */
> >+				*purged = vma_purged(vma);
> >+		}
> >  	}
> >  	if (root) {
> >@@ -651,6 +661,13 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  		anon_vma_interval_tree_post_update_vma(vma);
> >  		if (adjust_next)
> >  			anon_vma_interval_tree_post_update_vma(next);
> >+		/*
> >+		 * Look at mm/mvolatile.c for knowing terminology.
> >+		 * V7. VVPPVV -> VVNPVV
> >+		 * V8. VVPPVV -> VVPNVV
> >+		 */
> >+		if (insert)
> >+			vma_purge_copy(insert, vma);
> >  		anon_vma_unlock(anon_vma);
> >  	}
> >  	if (mapping)
> >@@ -670,6 +687,20 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  		}
> >  		if (next->anon_vma)
> >  			anon_vma_merge(vma, next);
> >+
> >+		/*
> >+		 * next is detatched from anon vma chain so purged isn't
> >+		 * raced any more.
> >+		 * Look at mm/mvolatile.c for knowing terminology.
> >+		 *
> >+		 * V1. NNPPVV -> NNNNVV
> >+		 * V2. VVPPNN -> VVNNNN
> >+		 * V3. NNPPNN -> NNNNNN
> >+		 */
> >+		if (purged) {
> >+			*purged |= vma_purged(vma); /* case V2 */
> >+			*purged |= vma_purged(next); /* case V1,V3 */
> >+		}
> >  		mm->map_count--;
> >  		mpol_put(vma_policy(next));
> >  		kmem_cache_free(vm_area_cachep, next);
> >@@ -798,7 +829,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >  			struct vm_area_struct *prev, unsigned long addr,
> >  			unsigned long end, unsigned long vm_flags,
> >  		     	struct anon_vma *anon_vma, struct file *file,
> >-			pgoff_t pgoff, struct mempolicy *policy)
> >+			pgoff_t pgoff, struct mempolicy *policy, bool *purged)
> >  {
> >  	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
> >  	struct vm_area_struct *area, *next;
> >@@ -808,7 +839,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >  	 * We later require that vma->vm_flags == vm_flags,
> >  	 * so this tests vma->vm_flags & VM_SPECIAL, too.
> >  	 */
> >-	if (vm_flags & VM_SPECIAL)
> >+	if (vm_flags & (VM_SPECIAL|VM_VOLATILE))
> >  		return NULL;
> >  	if (prev)
> >@@ -837,10 +868,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >  						      next->anon_vma, NULL)) {
> >  							/* cases 1, 6 */
> >  			err = vma_adjust(prev, prev->vm_start,
> >-				next->vm_end, prev->vm_pgoff, NULL);
> >+				next->vm_end, prev->vm_pgoff, NULL, purged);
> >  		} else					/* cases 2, 5, 7 */
> >  			err = vma_adjust(prev, prev->vm_start,
> >-				end, prev->vm_pgoff, NULL);
> >+				end, prev->vm_pgoff, NULL, purged);
> >  		if (err)
> >  			return NULL;
> >  		khugepaged_enter_vma_merge(prev);
> >@@ -856,10 +887,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >  					anon_vma, file, pgoff+pglen)) {
> >  		if (prev && addr < prev->vm_end)	/* case 4 */
> >  			err = vma_adjust(prev, prev->vm_start,
> >-				addr, prev->vm_pgoff, NULL);
> >+				addr, prev->vm_pgoff, NULL, purged);
> >  		else					/* cases 3, 8 */
> >  			err = vma_adjust(area, addr, next->vm_end,
> >-				next->vm_pgoff - pglen, NULL);
> >+				next->vm_pgoff - pglen, NULL, purged);
> >  		if (err)
> >  			return NULL;
> >  		khugepaged_enter_vma_merge(area);
> >@@ -1292,7 +1323,8 @@ munmap_back:
> >  	/*
> >  	 * Can we just expand an old mapping?
> >  	 */
> >-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
> >+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file,
> >+				pgoff, NULL, NULL);
> >  	if (vma)
> >  		goto out;
> >@@ -2025,9 +2057,10 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
> >  	if (new_below)
> >  		err = vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
> >-			((addr - new->vm_start) >> PAGE_SHIFT), new);
> >+			((addr - new->vm_start) >> PAGE_SHIFT), new, NULL);
> >  	else
> >-		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
> >+		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff,
> >+			new, NULL);
> >  	/* Success. */
> >  	if (!err)
> >@@ -2240,7 +2273,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
> >  	/* Can we just expand an old private anonymous mapping? */
> >  	vma = vma_merge(mm, prev, addr, addr + len, flags,
> >-					NULL, NULL, pgoff, NULL);
> >+					NULL, NULL, pgoff, NULL, NULL);
> >  	if (vma)
> >  		goto out;
> >@@ -2396,7 +2429,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> >  	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
> >  		return NULL;	/* should never get here */
> >  	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
> >-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
> >+			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> >+			NULL);
> >  	if (new_vma) {
> >  		/*
> >  		 * Source vma may have been merged into new_vma
> >diff --git a/mm/mprotect.c b/mm/mprotect.c
> >index a409926..f461177 100644
> >--- a/mm/mprotect.c
> >+++ b/mm/mprotect.c
> >@@ -179,7 +179,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> >  	 */
> >  	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >  	*pprev = vma_merge(mm, *pprev, start, end, newflags,
> >-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
> >+			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> >+			NULL);
> >  	if (*pprev) {
> >  		vma = *pprev;
> >  		goto success;
> >diff --git a/mm/mremap.c b/mm/mremap.c
> >index 1b61c2d..8586c52 100644
> >--- a/mm/mremap.c
> >+++ b/mm/mremap.c
> >@@ -512,7 +512,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> >  			int pages = (new_len - old_len) >> PAGE_SHIFT;
> >  			if (vma_adjust(vma, vma->vm_start, addr + new_len,
> >-				       vma->vm_pgoff, NULL)) {
> >+				       vma->vm_pgoff, NULL, NULL)) {
> >  				ret = -ENOMEM;
> >  				goto out;
> >  			}
> >diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> >new file mode 100644
> >index 0000000..8b812d2
> >--- /dev/null
> >+++ b/mm/mvolatile.c
> >@@ -0,0 +1,312 @@
> >+/*
> >+ *	linux/mm/mvolatile.c
> >+ *
> >+ *  Copyright 2012 Minchan Kim
> >+ *
> >+ *  This work is licensed under the terms of the GNU GPL, version 2. See
> >+ *  the COPYING file in the top-level directory.
> >+ */
> >+
> >+#include <linux/mvolatile.h>
> >+#include <linux/mm_types.h>
> >+#include <linux/mm.h>
> >+#include <linux/rmap.h>
> >+#include <linux/mempolicy.h>
> >+
> >+#ifndef CONFIG_VOLATILE_PAGE
> >+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
> >+{
> >+	return -EINVAL;
> >+}
> >+
> >+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
> >+{
> >+	return -EINVAL;
> >+}
> >+#else
> >+
> >+#define NO_PURGED	0
> >+#define PURGED		1
> >+
> >+/*
> >+ * N: Normal VMA
> >+ * V: Volatile VMA
> >+ * P: Purged volatile VMA
> >+ *
> >+ * Assume that each VMA has two block so case 1-8 consists of three VMA.
> >+ * For example, NNPPVV means VMA1 has normal VMA, VMA2 has purged volailte VMA,
> >+ * and VMA3 has volatile VMA. With another example, NNPVVV means VMA1 has
> >+ * normal VMA, VMA2-1 has purged volatile VMA, VMA2-2 has volatile VMA.
> >+ *
> >+ * Case 7,8 create a new VMA and we call it VMA4 which can be loated before VMA2
> >+ * or after.
> >+ *
> >+ * Notice: The merge between volatile VMAs shouldn't happen.
> >+ * If we call mnovolatile(VMA2),
> >+ *
> >+ * Case 1 NNPPVV -> NNNNVV
> >+ * Case 2 VVPPNN -> VVNNNN
> >+ * Case 3 NNPPNN -> NNNNNN
> >+ * Case 4 NNPPVV -> NNNPVV
> >+ * case 5 VVPPNN -> VVPNNN
> >+ * case 6 VVPPVV -> VVNNVV
> >+ * case 7 VVPPVV -> VVNPVV
> >+ * case 8 VVPPVV -> VVPNVV
> >+ */
> >+static int do_mnovolatile(struct vm_area_struct *vma,
> >+		struct vm_area_struct **prev, unsigned long start,
> >+		unsigned long end, bool *is_purged)
> >+{
> >+	unsigned long new_flags;
> >+	int error = 0;
> >+	struct mm_struct *mm = vma->vm_mm;
> >+	pgoff_t pgoff;
> >+	bool purged = false;
> >+
> >+	new_flags = vma->vm_flags & ~VM_VOLATILE;
> >+	if (new_flags == vma->vm_flags) {
> >+		*prev = vma;
> >+		goto success;
> >+	}
> >+
> >+
> >+	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >+	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
> >+			vma->vm_file, pgoff, vma_policy(vma), &purged);
> >+	if (*prev) {
> >+		vma = *prev;
> >+		goto success;
> >+	}
> >+
> >+	*prev = vma;
> >+
> >+	if (start != vma->vm_start) {
> >+		error = split_vma(mm, vma, start, 1);
> >+		if (error)
> >+			goto out;
> >+	}
> >+
> >+	if (end != vma->vm_end) {
> >+		error = split_vma(mm, vma, end, 0);
> >+		if (error)
> >+			goto out;
> >+	}
> >+
> >+success:
> >+	/* V6. VVPPVV -> VVNNVV */
> >+	vma_lock_anon_vma(vma);
> >+	*is_purged |= (vma->purged|purged);
> >+	vma_unlock_anon_vma(vma);
> >+
> >+	vma->vm_flags = new_flags;
> >+	vma->purged = false;
> >+	return 0;
> >+out:
> >+	return error;
> >+}
> >+
> >+/* I didn't look into KSM/Hugepage so disalbed them */
> >+#define VM_NO_VOLATILE	(VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|\
> >+		VM_MERGEABLE|VM_HUGEPAGE|VM_LOCKED)
> >+
> >+static int do_mvolatile(struct vm_area_struct *vma,
> >+	struct vm_area_struct **prev, unsigned long start, unsigned long end)
> >+{
> >+	int error = -EINVAL;
> >+	vm_flags_t new_flags = vma->vm_flags;
> >+	struct mm_struct *mm = vma->vm_mm;
> >+
> >+	new_flags |= VM_VOLATILE;
> >+
> >+	/* Note : Current version doesn't support file vma volatile */
> >+	if (vma->vm_file) {
> >+		*prev = vma;
> >+		goto out;
> >+	}
> >+
> >+	if (vma->vm_flags & VM_NO_VOLATILE ||
> >+			(vma == get_gate_vma(current->mm))) {
> >+		*prev = vma;
> >+		goto out;
> >+	}
> >+	/*
> >+	 * In case of calling MADV_VOLATILE again,
> >+	 * We just reset purged state.
> >+	 */
> >+	if (new_flags == vma->vm_flags) {
> >+		*prev = vma;
> >+		vma_lock_anon_vma(vma);
> >+		vma->purged = false;
> >+		vma_unlock_anon_vma(vma);
> >+		error = 0;
> >+		goto out;
> >+	}
> >+
> >+	*prev = vma;
> >+
> >+	if (start != vma->vm_start) {
> >+		error = split_vma(mm, vma, start, 1);
> >+		if (error)
> >+			goto out;
> >+	}
> >+
> >+	if (end != vma->vm_end) {
> >+		error = split_vma(mm, vma, end, 0);
> >+		if (error)
> >+			goto out;
> >+	}
> >+
> >+	error = 0;
> >+
> >+	vma_lock_anon_vma(vma);
> >+	vma->vm_flags = new_flags;
> >+	vma_unlock_anon_vma(vma);
> >+out:
> >+	return error;
> >+}
> >+
> >+/*
> >+ * Return -EINVAL if range doesn't include a right vma at all.
> >+ * Return -ENOMEM with interrupting range opeartion if memory is not enough to
> >+ * merge/split vmas.
> >+ * Return 0 if range consists of only proper vmas.
> >+ * Return 1 if part of range includes inavlid area(ex, hole/huge/ksm/mlock/
> >+ * special area)
> >+ */
> >+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
> >+{
> >+	unsigned long end, tmp;
> >+	struct vm_area_struct *vma, *prev;
> >+	bool invalid = false;
> >+	int error = -EINVAL;
> >+
> >+	down_write(&current->mm->mmap_sem);
> >+	if (start & ~PAGE_MASK)
> >+		goto out;
> >+
> >+	len &= PAGE_MASK;
> >+	if (!len)
> >+		goto out;
> >+
> >+	end = start + len;
> >+	if (end < start)
> >+		goto out;
> >+
> >+	vma = find_vma_prev(current->mm, start, &prev);
> >+	if (!vma)
> >+		goto out;
> >+
> >+	if (start > vma->vm_start)
> >+		prev = vma;
> >+
> >+	for (;;) {
> >+		/* Here start < (end|vma->vm_end). */
> >+		if (start < vma->vm_start) {
> >+			start = vma->vm_start;
> >+			if (start >= end)
> >+				goto out;
> >+			invalid = true;
> >+		}
> >+
> >+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> >+		tmp = vma->vm_end;
> >+		if (end < tmp)
> >+			tmp = end;
> >+
> >+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> >+		error = do_mvolatile(vma, &prev, start, tmp);
> >+		if (error == -ENOMEM) {
> >+			up_write(&current->mm->mmap_sem);
> >+			return error;
> >+		}
> >+		if (error == -EINVAL)
> >+			invalid = true;
> >+		else
> >+			error = 0;
> >+		start = tmp;
> >+		if (prev && start < prev->vm_end)
> >+			start = prev->vm_end;
> >+		if (start >= end)
> >+			break;
> >+
> >+		vma = prev->vm_next;
> >+		if (!vma)
> >+			break;
> >+	}
> >+out:
> >+	up_write(&current->mm->mmap_sem);
> >+	return invalid ? 1 : 0;
> >+}
> >+/*
> >+ * Return -ENOMEM with interrupting range opeartion if memory is not enough
> >+ * to merge/split vmas.
> >+ * Return 1 if part of range includes purged's one, otherwise, return 0
> >+ */
> >+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
> >+{
> >+	unsigned long end, tmp;
> >+	struct vm_area_struct *vma, *prev;
> >+	int ret, error = -EINVAL;
> >+	bool is_purged = false;
> >+
> >+	down_write(&current->mm->mmap_sem);
> >+	if (start & ~PAGE_MASK)
> >+		goto out;
> >+
> >+	len &= PAGE_MASK;
> >+	if (!len)
> >+		goto out;
> >+
> >+	end = start + len;
> >+	if (end < start)
> >+		goto out;
> >+
> >+	vma = find_vma_prev(current->mm, start, &prev);
> >+	if (!vma)
> >+		goto out;
> >+
> >+	if (start > vma->vm_start)
> >+		prev = vma;
> >+
> >+	for (;;) {
> >+		/* Here start < (end|vma->vm_end). */
> >+		if (start < vma->vm_start) {
> >+			start = vma->vm_start;
> >+			if (start >= end)
> >+				goto out;
> >+		}
> >+
> >+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> >+		tmp = vma->vm_end;
> >+		if (end < tmp)
> >+			tmp = end;
> >+
> >+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> >+		error = do_mnovolatile(vma, &prev, start, tmp, &is_purged);
> >+		if (error) {
> >+			WARN_ON(error != -ENOMEM);
> >+			goto out;
> >+		}
> >+		start = tmp;
> >+		if (prev && start < prev->vm_end)
> >+			start = prev->vm_end;
> >+		if (start >= end)
> >+			break;
> >+
> >+		vma = prev->vm_next;
> >+		if (!vma)
> >+			break;
> >+	}
> >+out:
> >+	up_write(&current->mm->mmap_sem);
> >+
> >+	if (error)
> >+		ret = error;
> >+	else if (is_purged)
> >+		ret = PURGED;
> >+	else
> >+		ret = NO_PURGED;
> >+
> >+	return ret;
> >+}
> >+#endif
> >diff --git a/mm/rmap.c b/mm/rmap.c
> >index 2ee1ef0..402d9da 100644
> >--- a/mm/rmap.c
> >+++ b/mm/rmap.c
> >@@ -57,6 +57,7 @@
> >  #include <linux/migrate.h>
> >  #include <linux/hugetlb.h>
> >  #include <linux/backing-dev.h>
> >+#include <linux/mvolatile.h>
> >  #include <asm/tlbflush.h>
> >@@ -308,6 +309,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
> >  	vma->anon_vma = anon_vma;
> >  	anon_vma_lock(anon_vma);
> >  	anon_vma_chain_link(vma, avc, anon_vma);
> >+	vma_purge_copy(vma, pvma);
> >  	anon_vma_unlock(anon_vma);
> >  	return 0;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/8] Support volatile for anonymous range
  2013-01-03 17:19 ` [RFC v5 0/8] Support volatile for anonymous range Sanjay Ghemawat
@ 2013-01-04  5:15   ` Minchan Kim
  0 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-04  5:15 UTC (permalink / raw)
  To: Sanjay Ghemawat
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

Hello,

On Thu, Jan 03, 2013 at 09:19:08AM -0800, Sanjay Ghemawat wrote:
> On Wed, Jan 2, 2013 at 8:27 PM, Minchan Kim <minchan@kernel.org> wrote:
> > This is still RFC because we need more input from user-space
> > people, more stress test, design discussion about interface/reclaim
> 
> Speaking as one of the authors of tcmalloc, I don't see any particular
> need for this new system call for tcmalloc.  We are fine using
> madvise(MADV_DONTNEED) and don't notice any significant
> performance issues caused by it.  Background: we throttle how
> quickly we release memory back to the system (1-10MB/s), so
> we do not call madvise() very much, and we don't end up reusing
> madvise-ed away pages at a fast rate. My guess is that we won't

It means TCmalloc controls madvise's rate dynamically without
user's intervention? Smart TCmalloc!

Let me ask some questions.
What is your policy for control of throttling of madvise?
I guess policy is following as.

The madvise's frequent calling is bad because pte zap overhead of
madvise + next page fault/memset + page access bit emulatation
page fault in some architecture like ARM when reused the range.
So we should call it fast rate only when memory pressure happens
very carefully. Is it similar to your throttling logic?

If my assumption isn't totally wrong, how could a process know
the memory pressure at the moment by just per-process view, NOT
system view?

If your logic takes some mistake, (for instace, memory pressure
is severe but it doesn't call madvise) working set could be reclaimed
like file-backed pages, which could minimize your benefit via madvise
throttling. I guess it's very fragile. It's more severe in embedded
world because they don't use swap so system encounters OOM instead of
swappout.

In this point, mvolatile's concept is light weight system call by
just mark the flag in the vma and auto free when system suffers from
memory pressure(about this, my plan is zap all pages if kswapd is active
when movlatile system call is called) by reclaimer with preventing
working set page eviction, otherwise enhance application's speed with
removing (minor fault + page allocation + memset). Also, it would make
allocator simple through removing control logic, which is less error-prone
and even might make smart TCmalloc better than now althoug it doesn't have
any significat performance issue.

> see large enough application-level performance improvements to
> cause us to change tcmalloc to use this system call.
> 
> > - What's different with madvise(DONTNEED)?
> >
> >   System call semantic
> >
> >   DONTNEED makes sure user always can see zero-fill pages after
> >   he calls madvise while mvolatile can see old data or encounter
> >   SIGBUS.
> 
> Do you need a new system call for this?  Why not just a new flag to madvise
> with weaker guarantees than zero-filling?  All of the implementation changes
> you point out below could be triggered from that flag.

Agreed and actually, I tried it but changed my mind because it required
adding many hacky codes in madvise due to return value and error's semantic
is totally different with normal madvise and needs three flags at least at
the moment but not sure we need more flags during discussion.

MADV_VOLATILE, MADV_NOVOLATILE, MADV_[NO]VOLATILE|MADV_PARTIAL_DISCARD

I don't want to make madvise dirty and consume lots of new flags of madvise
for a volatile feature. But if everybody want to fold into madivse,
I can do it, too.

Thanks for the feedback, Sanjay!

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/8] Introduce new system call mvolatile
  2013-01-03  4:27 ` [RFC 1/8] Introduce new system call mvolatile Minchan Kim
  2013-01-03 18:35   ` Taras Glek
@ 2013-01-17  1:48   ` John Stultz
  2013-01-18  5:30     ` Minchan Kim
  1 sibling, 1 reply; 17+ messages in thread
From: John Stultz @ 2013-01-17  1:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 01/02/2013 08:27 PM, Minchan Kim wrote:
> This patch adds new system call m[no]volatile.
> If someone asks is_volatile system call, it could be added, too.

So some nits below from my initial playing around with this patchset.

> +/*
> + * Return -EINVAL if range doesn't include a right vma at all.
> + * Return -ENOMEM with interrupting range opeartion if memory is not enough to
> + * merge/split vmas.
> + * Return 0 if range consists of only proper vmas.
> + * Return 1 if part of range includes inavlid area(ex, hole/huge/ksm/mlock/
> + * special area)
> + */
> +SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
> +{
> +	unsigned long end, tmp;
> +	struct vm_area_struct *vma, *prev;
> +	bool invalid = false;
> +	int error = -EINVAL;
> +
> +	down_write(&current->mm->mmap_sem);
> +	if (start & ~PAGE_MASK)
> +		goto out;
> +
> +	len &= PAGE_MASK;
> +	if (!len)
> +		goto out;
> +
> +	end = start + len;
> +	if (end < start)
> +		goto out;
> +
> +	vma = find_vma_prev(current->mm, start, &prev);
> +	if (!vma)
> +		goto out;
> +
> +	if (start > vma->vm_start)
> +		prev = vma;
> +
> +	for (;;) {
> +		/* Here start < (end|vma->vm_end). */
> +		if (start < vma->vm_start) {
> +			start = vma->vm_start;
> +			if (start >= end)
> +				goto out;
> +			invalid = true;
> +		}
> +
> +		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> +		error = do_mvolatile(vma, &prev, start, tmp);
> +		if (error == -ENOMEM) {
> +			up_write(&current->mm->mmap_sem);
> +			return error;
> +		}
> +		if (error == -EINVAL)
> +			invalid = true;
> +		else
> +			error = 0;
> +		start = tmp;
> +		if (prev && start < prev->vm_end)
> +			start = prev->vm_end;
> +		if (start >= end)
> +			break;
> +
> +		vma = prev->vm_next;
> +		if (!vma)
> +			break;
> +	}
> +out:
> +	up_write(&current->mm->mmap_sem);
> +	return invalid ? 1 : 0;
> +}

The error logic here is really strange. If any of the early error cases 
are triggered (ie: (start & ~PAGE_MASK), etc), then we jump to out and 
return 0 (instead of EINVAL). I don't think that's what you intended.


> +/*
> + * Return -ENOMEM with interrupting range opeartion if memory is not enough
> + * to merge/split vmas.
> + * Return 1 if part of range includes purged's one, otherwise, return 0
> + */
> +SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
> +{
> +	unsigned long end, tmp;
> +	struct vm_area_struct *vma, *prev;
> +	int ret, error = -EINVAL;
> +	bool is_purged = false;
> +
> +	down_write(&current->mm->mmap_sem);
> +	if (start & ~PAGE_MASK)
> +		goto out;
> +
> +	len &= PAGE_MASK;
> +	if (!len)
> +		goto out;
> +
> +	end = start + len;
> +	if (end < start)
> +		goto out;
> +
> +	vma = find_vma_prev(current->mm, start, &prev);
> +	if (!vma)
> +		goto out;
> +
> +	if (start > vma->vm_start)
> +		prev = vma;
> +
> +	for (;;) {
> +		/* Here start < (end|vma->vm_end). */
> +		if (start < vma->vm_start) {
> +			start = vma->vm_start;
> +			if (start >= end)
> +				goto out;
> +		}
> +
> +		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> +		error = do_mnovolatile(vma, &prev, start, tmp, &is_purged);
> +		if (error) {
> +			WARN_ON(error != -ENOMEM);
> +			goto out;
> +		}
> +		start = tmp;
> +		if (prev && start < prev->vm_end)
> +			start = prev->vm_end;
> +		if (start >= end)
> +			break;
> +
> +		vma = prev->vm_next;
> +		if (!vma)
> +			break;
> +	}

I'm still not sure how this logic improves over the madvise case. If we 
catch an error mid-way through setting a series of vmas to non-volatile, 
we end up exiting and losing state (ie: if only the first vma was 
purged, but half way through 10 vmas we get a ENOMEM error. So the first 
vma is now non-volatile, but we do not return the purged flag ).

If we're going to have a new syscall for this (which I'm not sure is the 
right approach), we should make use of multiple arguments so we can 
return if data was purged, even if we hit an error midway).

Alternatively, if we can find a way to allocate any necessary memory 
before we do any vma volatility state changes, then we can return ENOMEM 
then and be confident we won't end up with failed partial state change 
(this is the approach I used in my fallocate-volatile patches).

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/8] Introduce new system call mvolatile
  2013-01-17  1:48   ` John Stultz
@ 2013-01-18  5:30     ` Minchan Kim
  0 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2013-01-18  5:30 UTC (permalink / raw)
  To: John Stultz
  Cc: Andrew Morton, linux-mm, linux-kernel, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Mike Hommey, Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Wed, Jan 16, 2013 at 05:48:37PM -0800, John Stultz wrote:
> On 01/02/2013 08:27 PM, Minchan Kim wrote:
> >This patch adds new system call m[no]volatile.
> >If someone asks is_volatile system call, it could be added, too.
> 
> So some nits below from my initial playing around with this patchset.
> 
> >+/*
> >+ * Return -EINVAL if range doesn't include a right vma at all.
> >+ * Return -ENOMEM with interrupting range opeartion if memory is not enough to
> >+ * merge/split vmas.
> >+ * Return 0 if range consists of only proper vmas.
> >+ * Return 1 if part of range includes inavlid area(ex, hole/huge/ksm/mlock/
> >+ * special area)
> >+ */
> >+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
> >+{
> >+	unsigned long end, tmp;
> >+	struct vm_area_struct *vma, *prev;
> >+	bool invalid = false;
> >+	int error = -EINVAL;
> >+
> >+	down_write(&current->mm->mmap_sem);
> >+	if (start & ~PAGE_MASK)
> >+		goto out;
> >+
> >+	len &= PAGE_MASK;
> >+	if (!len)
> >+		goto out;
> >+
> >+	end = start + len;
> >+	if (end < start)
> >+		goto out;
> >+
> >+	vma = find_vma_prev(current->mm, start, &prev);
> >+	if (!vma)
> >+		goto out;
> >+
> >+	if (start > vma->vm_start)
> >+		prev = vma;
> >+
> >+	for (;;) {
> >+		/* Here start < (end|vma->vm_end). */
> >+		if (start < vma->vm_start) {
> >+			start = vma->vm_start;
> >+			if (start >= end)
> >+				goto out;
> >+			invalid = true;
> >+		}
> >+
> >+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> >+		tmp = vma->vm_end;
> >+		if (end < tmp)
> >+			tmp = end;
> >+
> >+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> >+		error = do_mvolatile(vma, &prev, start, tmp);
> >+		if (error == -ENOMEM) {
> >+			up_write(&current->mm->mmap_sem);
> >+			return error;
> >+		}
> >+		if (error == -EINVAL)
> >+			invalid = true;
> >+		else
> >+			error = 0;
> >+		start = tmp;
> >+		if (prev && start < prev->vm_end)
> >+			start = prev->vm_end;
> >+		if (start >= end)
> >+			break;
> >+
> >+		vma = prev->vm_next;
> >+		if (!vma)
> >+			break;
> >+	}
> >+out:
> >+	up_write(&current->mm->mmap_sem);
> >+	return invalid ? 1 : 0;
> >+}
> 
> The error logic here is really strange. If any of the early error
> cases are triggered (ie: (start & ~PAGE_MASK), etc), then we jump to
> out and return 0 (instead of EINVAL). I don't think that's what you
> intended.

Need fixing.

> 
> 
> >+/*
> >+ * Return -ENOMEM with interrupting range opeartion if memory is not enough
> >+ * to merge/split vmas.
> >+ * Return 1 if part of range includes purged's one, otherwise, return 0
> >+ */
> >+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
> >+{
> >+	unsigned long end, tmp;
> >+	struct vm_area_struct *vma, *prev;
> >+	int ret, error = -EINVAL;
> >+	bool is_purged = false;
> >+
> >+	down_write(&current->mm->mmap_sem);
> >+	if (start & ~PAGE_MASK)
> >+		goto out;
> >+
> >+	len &= PAGE_MASK;
> >+	if (!len)
> >+		goto out;
> >+
> >+	end = start + len;
> >+	if (end < start)
> >+		goto out;
> >+
> >+	vma = find_vma_prev(current->mm, start, &prev);
> >+	if (!vma)
> >+		goto out;
> >+
> >+	if (start > vma->vm_start)
> >+		prev = vma;
> >+
> >+	for (;;) {
> >+		/* Here start < (end|vma->vm_end). */
> >+		if (start < vma->vm_start) {
> >+			start = vma->vm_start;
> >+			if (start >= end)
> >+				goto out;
> >+		}
> >+
> >+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> >+		tmp = vma->vm_end;
> >+		if (end < tmp)
> >+			tmp = end;
> >+
> >+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> >+		error = do_mnovolatile(vma, &prev, start, tmp, &is_purged);
> >+		if (error) {
> >+			WARN_ON(error != -ENOMEM);
> >+			goto out;
> >+		}
> >+		start = tmp;
> >+		if (prev && start < prev->vm_end)
> >+			start = prev->vm_end;
> >+		if (start >= end)
> >+			break;
> >+
> >+		vma = prev->vm_next;
> >+		if (!vma)
> >+			break;
> >+	}
> 
> I'm still not sure how this logic improves over the madvise case. If
> we catch an error mid-way through setting a series of vmas to
> non-volatile, we end up exiting and losing state (ie: if only the
> first vma was purged, but half way through 10 vmas we get a ENOMEM
> error. So the first vma is now non-volatile, but we do not return
> the purged flag ).

Right. 

> 
> If we're going to have a new syscall for this (which I'm not sure is
> the right approach), we should make use of multiple arguments so we
> can return if data was purged, even if we hit an error midway).

It would be easier method to achieve our goal than below suggestion
in case of VMA-basd approach because it's hard to expect how many
we need vmas with atomically.

Will do it in next version.

> 
> Alternatively, if we can find a way to allocate any necessary memory
> before we do any vma volatility state changes, then we can return
> ENOMEM then and be confident we won't end up with failed partial
> state change (this is the approach I used in my fallocate-volatile
> patches).

Thanks for the review, John.

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-01-18  5:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-03  4:27 [RFC v5 0/8] Support volatile for anonymous range Minchan Kim
2013-01-03  4:27 ` [RFC 1/8] Introduce new system call mvolatile Minchan Kim
2013-01-03 18:35   ` Taras Glek
2013-01-04  4:25     ` Minchan Kim
2013-01-17  1:48   ` John Stultz
2013-01-18  5:30     ` Minchan Kim
2013-01-03  4:28 ` [RFC 2/8] Don't allow volatile attribute on THP and KSM Minchan Kim
2013-01-03 16:27   ` Dave Hansen
2013-01-04  2:51     ` Minchan Kim
2013-01-03  4:28 ` [RFC 3/8] bail out when the page is in VOLATILE vma Minchan Kim
2013-01-03  4:28 ` [RFC 4/8] add page_locked parameter in free_swap_and_cache Minchan Kim
2013-01-03  4:28 ` [RFC 5/8] Discard volatile page Minchan Kim
2013-01-03  4:28 ` [RFC 6/8] add PGVOLATILE vmstat count Minchan Kim
2013-01-03  4:28 ` [RFC 7/8] add volatile page discard hook to kswapd Minchan Kim
2013-01-03  4:28 ` [RFC 8/8] extend PGVOLATILE vmstat " Minchan Kim
2013-01-03 17:19 ` [RFC v5 0/8] Support volatile for anonymous range Sanjay Ghemawat
2013-01-04  5:15   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).