linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: John Stultz <john.stultz@linaro.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <john.stultz@linaro.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Android Kernel Team <kernel-team@android.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Robert Love <rlove@google.com>, Mel Gorman <mel@csn.ul.ie>,
	Hugh Dickins <hughd@google.com>, Dave Hansen <dave@sr71.net>,
	Rik van Riel <riel@redhat.com>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Neil Brown <neilb@suse.de>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Mike Hommey <mh@glandium.org>, Taras Glek <tglek@mozilla.com>,
	Dhaval Giani <dgiani@mozilla.com>, Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Minchan Kim <minchan@kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas
Date: Fri, 14 Mar 2014 11:33:31 -0700	[thread overview]
Message-ID: <1394822013-23804-2-git-send-email-john.stultz@linaro.org> (raw)
In-Reply-To: <1394822013-23804-1-git-send-email-john.stultz@linaro.org>

This patch introduces the vrange() syscall, which allows for specifying
ranges of memory as volatile, and able to be discarded by the system.

This initial patch simply adds the syscall, and the vma handling,
splitting and merging the vmas as needed, and marking them with
VM_VOLATILE.

No purging or discarding of volatile ranges is done at this point.

Example man page:

NAME
	vrange - Mark or unmark range of memory as volatile

SYNOPSIS
	int vrange(unsigned_long start, size_t length, int mode,
			 int *purged);

DESCRIPTION
	Applications can use vrange(2) to advise the kernel how it should
	handle paging I/O in this VM area.  The idea is to help the kernel
	discard pages of vrange instead of reclaiming when memory pressure
	happens. It means kernel doesn't discard any pages of vrange if
	there is no memory pressure.

	mode:
	VRANGE_VOLATILE
		hint to kernel so VM can discard in vrange pages when
		memory pressure happens.
	VRANGE_NONVOLATILE
		hint to kernel so VM doesn't discard vrange pages
		any more.

	If user try to access purged memory without VRANGE_NONVOLATILE call,
	he can encounter SIGBUS if the page was discarded by kernel.

	purged: Pointer to an integer which will return 1 if
	mode == VRANGE_NONVOLATILE and any page in the affected range
	was purged. If purged returns zero during a mode ==
	VRANGE_NONVOLATILE call, it means all of the pages in the range
	are intact.

RETURN VALUE
	On success vrange returns the number of bytes marked or unmarked.
	Similar to write(), it may return fewer bytes then specified
	if it ran into a problem.

	If an error is returned, no changes were made.

ERRORS
	EINVAL This error can occur for the following reasons:
		* The value length is negative or not page size units.
		* addr is not page-aligned
		* mode not a valid value.

	ENOMEM Not enough memory

	EFAULT purged pointer is invalid

This a simplified implementation which reuses some of the logic
from Minchan's earlier efforts. So credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Dhaval Giani <dgiani@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/vrange.h           |   7 ++
 mm/Makefile                      |   2 +-
 mm/vrange.c                      | 150 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 160 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/vrange.h
 create mode 100644 mm/vrange.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..7ae3940 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	vrange			sys_vrange
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1b7414..a1f11da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
 					/* Used by sys_madvise() */
+#define VM_VOLATILE	0x00001000	/* VMA is volatile */
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
 #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
 
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
new file mode 100644
index 0000000..652396b
--- /dev/null
+++ b/include/linux/vrange.h
@@ -0,0 +1,7 @@
+#ifndef _LINUX_VRANGE_H
+#define _LINUX_VRANGE_H
+
+#define VRANGE_NONVOLATILE 0
+#define VRANGE_VOLATILE 1
+
+#endif /* _LINUX_VRANGE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..20229e2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o balloon_compaction.o \
+			   compaction.o balloon_compaction.o vrange.o \
 			   interval_tree.o list_lru.o $(mmu-y)
 
 obj-y += init-mm.o
diff --git a/mm/vrange.c b/mm/vrange.c
new file mode 100644
index 0000000..acb4356
--- /dev/null
+++ b/mm/vrange.c
@@ -0,0 +1,150 @@
+#include <linux/syscalls.h>
+#include <linux/vrange.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
+				unsigned long end, int mode, int *purged)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long orig_start = start;
+	ssize_t count = 0, ret = 0;
+	int lpurged = 0;
+
+	down_read(&mm->mmap_sem);
+
+	vma = find_vma_prev(mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		unsigned long new_flags;
+		pgoff_t pgoff;
+		unsigned long tmp;
+
+		if (!vma)
+			goto out;
+
+		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+					VM_HUGETLB))
+			goto out;
+
+		/* We don't support volatility on files for now */
+		if (vma->vm_file) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		new_flags = vma->vm_flags;
+
+		if (start < vma->vm_start) {
+			start = vma->vm_start;
+			if (start >= end)
+				goto out;
+		}
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		switch (mode) {
+		case VRANGE_VOLATILE:
+			new_flags |= VM_VOLATILE;
+			break;
+		case VRANGE_NONVOLATILE:
+			new_flags &= ~VM_VOLATILE;
+		}
+
+		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+		prev = vma_merge(mm, prev, start, tmp, new_flags,
+					vma->anon_vma, vma->vm_file, pgoff,
+					vma_policy(vma));
+		if (prev)
+			goto success;
+
+		if (start != vma->vm_start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				goto out;
+		}
+
+		if (tmp != vma->vm_end) {
+			ret = split_vma(mm, vma, tmp, 0);
+			if (ret)
+				goto out;
+		}
+
+		prev = vma;
+success:
+		vma->vm_flags = new_flags;
+		*purged = lpurged;
+
+		/* update count to distance covered so far*/
+		count = tmp - orig_start;
+
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			goto out;
+		if (prev)
+			vma = prev->vm_next;
+		else	/* madvise_remove dropped mmap_sem */
+			vma = find_vma(mm, start);
+	}
+out:
+	up_read(&mm->mmap_sem);
+
+	/* report bytes successfully marked, even if we're exiting on error */
+	if (count)
+		return count;
+
+	return ret;
+}
+
+SYSCALL_DEFINE4(vrange, unsigned long, start,
+		size_t, len, int, mode, int __user *, purged)
+{
+	unsigned long end;
+	struct mm_struct *mm = current->mm;
+	ssize_t ret = -EINVAL;
+	int p = 0;
+
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	if (start >= TASK_SIZE)
+		goto out;
+
+	if (purged) {
+		/* Test pointer is valid before making any changes */
+		if (put_user(p, purged))
+			return -EFAULT;
+	}
+
+	ret = do_vrange(mm, start, end, mode, &p);
+
+	if (purged) {
+		if (put_user(p, purged)) {
+			/*
+			 * This would be bad, since we've modified volatilty
+			 * and the change in purged state would be lost.
+			 */
+			WARN_ONCE(1, "vrange: purge state possibly lost\n");
+		}
+	}
+
+out:
+	return ret;
+}
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-03-14 18:33 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-14 18:33 [PATCH 0/3] Volatile Ranges (v11) John Stultz
2014-03-14 18:33 ` John Stultz [this message]
2014-03-17  9:21   ` [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas Jan Kara
2014-03-17  9:43     ` Jan Kara
2014-03-18  0:36       ` John Stultz
2014-03-17 22:19     ` John Stultz
2014-03-14 18:33 ` [PATCH 2/3] vrange: Add purged page detection on setting memory non-volatile John Stultz
2014-03-17  9:39   ` Jan Kara
2014-03-17 22:22     ` John Stultz
2014-03-14 18:33 ` [PATCH 3/3] vrange: Add page purging logic & SIGBUS trap John Stultz
2014-03-18 12:24 ` [PATCH 0/3] Volatile Ranges (v11) Michal Hocko
2014-03-18 17:53   ` John Stultz
2014-03-20  0:38   ` Dave Hansen
2014-03-20  0:57     ` John Stultz
2014-03-20  7:45     ` Minchan Kim
2014-03-18 15:11 ` Minchan Kim
2014-03-18 18:07   ` John Stultz
2014-03-19  0:49     ` Minchan Kim
2014-03-19 10:12       ` Jan Kara
2014-03-20  1:09         ` Minchan Kim
2014-03-20  8:13           ` Jan Kara
2014-03-21  5:29             ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1394822013-23804-2-git-send-email-john.stultz@linaro.org \
    --to=john.stultz@linaro.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@sr71.net \
    --cc=dgiani@mozilla.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=kernel-team@android.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mh@glandium.org \
    --cc=minchan@kernel.org \
    --cc=neilb@suse.de \
    --cc=riel@redhat.com \
    --cc=rlove@google.com \
    --cc=tglek@mozilla.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).