From: Simon Jeons <simon.jeons@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Rusty Russell <rusty@rustcorp.com.au>,
LKML <linux-kernel@vger.kernel.org>,
Nick Piggin <npiggin@suse.de>,
Stewart Smith <stewart@flamingspork.com>,
linux-mm@kvack.org, linux-arch@vger.kernel.org
Subject: Re: [patch 1/2] mm: fincore()
Date: Tue, 19 Feb 2013 18:25:31 +0800 [thread overview]
Message-ID: <5123531B.8090301@gmail.com> (raw)
In-Reply-To: <20130215063450.GA24047@cmpxchg.org>
Hi Johannes,
On 02/15/2013 02:34 PM, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 02:12:39PM -0800, Andrew Morton wrote:
>> Also, having to mmap the file to be able to query pagecache state is a
>> hack. Whatever happened to the fincore() patch?
> I don't know, but how about this one:
>
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch 1/2] mm: fincore()
>
> Provide a syscall to determine whether a given file's pages are cached
> in memory. This is more elegant than mmapping the file for the sole
> purpose of using mincore(), and also works on NOMMU.
Who is the user of mincore()/fincore()? In which scenario user processes
need to know their pages are resident in memory or not?
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> include/linux/syscalls.h | 2 +
> mm/Makefile | 2 +-
> mm/fincore.c | 128 +++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 131 insertions(+), 1 deletion(-)
> create mode 100644 mm/fincore.c
>
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 313a8e0..3ceab2a 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -897,4 +897,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
> asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> unsigned long idx1, unsigned long idx2);
> asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
> +asmlinkage long sys_fincore(unsigned int fd, loff_t start, loff_t len,
> + unsigned char __user * vec);
> #endif
> diff --git a/mm/Makefile b/mm/Makefile
> index 185a22b..221cdae 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
> util.o mmzone.o vmstat.o backing-dev.o \
> mm_init.o mmu_context.o percpu.o slab_common.o \
> compaction.o balloon_compaction.o \
> - interval_tree.o $(mmu-y)
> + interval_tree.o fincore.o $(mmu-y)
>
> obj-y += init-mm.o
>
> diff --git a/mm/fincore.c b/mm/fincore.c
> new file mode 100644
> index 0000000..d504611
> --- /dev/null
> +++ b/mm/fincore.c
> @@ -0,0 +1,128 @@
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +
> +static long do_fincore(struct address_space *mapping, pgoff_t pgstart,
> + unsigned long nr_pages, unsigned char *vec)
> +{
> + pgoff_t pgend = pgstart + nr_pages;
> + struct radix_tree_iter iter;
> + void **slot;
> + long nr = 0;
> +
> + rcu_read_lock();
> +restart:
> + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, pgstart) {
> + unsigned char present;
> + struct page *page;
> +
> + /* Handle holes */
> + if (iter.index != pgstart + nr) {
> + if (iter.index < pgend)
> + nr_pages = iter.index - pgstart;
> + break;
> + }
> +repeat:
> + page = radix_tree_deref_slot(slot);
> + if (unlikely(!page))
> + continue;
> +
> + if (radix_tree_exception(page)) {
> + if (radix_tree_deref_retry(page)) {
> + /*
> + * Transient condition which can only trigger
> + * when entry at index 0 moves out of or back
> + * to root: none yet gotten, safe to restart.
> + */
> + WARN_ON(iter.index);
> + goto restart;
> + }
> + present = 0;
> + } else {
> + if (!page_cache_get_speculative(page))
> + goto repeat;
> +
> + /* Has the page moved? */
> + if (unlikely(page != *slot)) {
> + page_cache_release(page);
> + goto repeat;
> + }
> +
> + present = PageUptodate(page);
> + page_cache_release(page);
> + }
> + vec[nr] = present;
> +
> + if (++nr == nr_pages)
> + break;
> + }
> + rcu_read_unlock();
> +
> + if (nr < nr_pages)
> + memset(vec + nr, 0, nr_pages - nr);
> +
> + return nr_pages;
> +}
> +
> +/*
> + * The fincore(2) system call.
> + *
> + * fincore() returns the memory residency status of the given file's
> + * pages, in the range [start, start + len].
> + * The status is returned in a vector of bytes. The least significant
> + * bit of each byte is 1 if the referenced page is in memory, otherwise
> + * it is zero.
> + *
> + * Because the status of a page can change after fincore() checks it
> + * but before it returns to the application, the returned vector may
> + * contain stale information.
> + *
> + * return values:
> + * zero - success
> + * -EBADF - fd isn't a valid open file descriptor
> + * -EFAULT - vec points to an illegal address
> + * -EINVAL - start is not a multiple of PAGE_CACHE_SIZE
> + */
> +SYSCALL_DEFINE4(fincore, unsigned int, fd, loff_t, start, loff_t, len,
> + unsigned char __user *, vec)
> +{
> + unsigned long nr_pages;
> + pgoff_t pgstart;
> + struct fd f;
> + long ret;
> +
> + if (start & ~PAGE_CACHE_MASK)
> + return -EINVAL;
> +
> + f = fdget(fd);
> + if (!f.file)
> + return -EBADF;
> +
> + pgstart = start >> PAGE_CACHE_SHIFT;
> + nr_pages = DIV_ROUND_UP(len, PAGE_CACHE_SIZE);
> +
> + while (nr_pages) {
> + unsigned char tmp[64];
> +
> + ret = do_fincore(f.file->f_mapping, pgstart,
> + min(nr_pages, sizeof(tmp)), tmp);
> + if (ret <= 0)
> + break;
> +
> + if (copy_to_user(vec, tmp, ret)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + nr_pages -= ret;
> + pgstart += ret;
> + vec += ret;
> + ret = 0;
> + }
> +
> + fdput(f);
> +
> + return ret;
> +}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Simon Jeons <simon.jeons@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Rusty Russell <rusty@rustcorp.com.au>,
LKML <linux-kernel@vger.kernel.org>,
Nick Piggin <npiggin@suse.de>,
Stewart Smith <stewart@flamingspork.com>,
linux-mm@kvack.org, linux-arch@vger.kernel.org
Subject: Re: [patch 1/2] mm: fincore()
Date: Tue, 19 Feb 2013 18:25:31 +0800 [thread overview]
Message-ID: <5123531B.8090301@gmail.com> (raw)
Message-ID: <20130219102531.dJCix0s07KyZbExn9wBgr5M6CCNqt6jyh7bc3m1h1Jw@z> (raw)
In-Reply-To: <20130215063450.GA24047@cmpxchg.org>
Hi Johannes,
On 02/15/2013 02:34 PM, Johannes Weiner wrote:
> On Mon, Feb 11, 2013 at 02:12:39PM -0800, Andrew Morton wrote:
>> Also, having to mmap the file to be able to query pagecache state is a
>> hack. Whatever happened to the fincore() patch?
> I don't know, but how about this one:
>
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch 1/2] mm: fincore()
>
> Provide a syscall to determine whether a given file's pages are cached
> in memory. This is more elegant than mmapping the file for the sole
> purpose of using mincore(), and also works on NOMMU.
Who is the user of mincore()/fincore()? In which scenario user processes
need to know their pages are resident in memory or not?
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> include/linux/syscalls.h | 2 +
> mm/Makefile | 2 +-
> mm/fincore.c | 128 +++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 131 insertions(+), 1 deletion(-)
> create mode 100644 mm/fincore.c
>
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 313a8e0..3ceab2a 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -897,4 +897,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
> asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> unsigned long idx1, unsigned long idx2);
> asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
> +asmlinkage long sys_fincore(unsigned int fd, loff_t start, loff_t len,
> + unsigned char __user * vec);
> #endif
> diff --git a/mm/Makefile b/mm/Makefile
> index 185a22b..221cdae 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
> util.o mmzone.o vmstat.o backing-dev.o \
> mm_init.o mmu_context.o percpu.o slab_common.o \
> compaction.o balloon_compaction.o \
> - interval_tree.o $(mmu-y)
> + interval_tree.o fincore.o $(mmu-y)
>
> obj-y += init-mm.o
>
> diff --git a/mm/fincore.c b/mm/fincore.c
> new file mode 100644
> index 0000000..d504611
> --- /dev/null
> +++ b/mm/fincore.c
> @@ -0,0 +1,128 @@
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +
> +static long do_fincore(struct address_space *mapping, pgoff_t pgstart,
> + unsigned long nr_pages, unsigned char *vec)
> +{
> + pgoff_t pgend = pgstart + nr_pages;
> + struct radix_tree_iter iter;
> + void **slot;
> + long nr = 0;
> +
> + rcu_read_lock();
> +restart:
> + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, pgstart) {
> + unsigned char present;
> + struct page *page;
> +
> + /* Handle holes */
> + if (iter.index != pgstart + nr) {
> + if (iter.index < pgend)
> + nr_pages = iter.index - pgstart;
> + break;
> + }
> +repeat:
> + page = radix_tree_deref_slot(slot);
> + if (unlikely(!page))
> + continue;
> +
> + if (radix_tree_exception(page)) {
> + if (radix_tree_deref_retry(page)) {
> + /*
> + * Transient condition which can only trigger
> + * when entry at index 0 moves out of or back
> + * to root: none yet gotten, safe to restart.
> + */
> + WARN_ON(iter.index);
> + goto restart;
> + }
> + present = 0;
> + } else {
> + if (!page_cache_get_speculative(page))
> + goto repeat;
> +
> + /* Has the page moved? */
> + if (unlikely(page != *slot)) {
> + page_cache_release(page);
> + goto repeat;
> + }
> +
> + present = PageUptodate(page);
> + page_cache_release(page);
> + }
> + vec[nr] = present;
> +
> + if (++nr == nr_pages)
> + break;
> + }
> + rcu_read_unlock();
> +
> + if (nr < nr_pages)
> + memset(vec + nr, 0, nr_pages - nr);
> +
> + return nr_pages;
> +}
> +
> +/*
> + * The fincore(2) system call.
> + *
> + * fincore() returns the memory residency status of the given file's
> + * pages, in the range [start, start + len].
> + * The status is returned in a vector of bytes. The least significant
> + * bit of each byte is 1 if the referenced page is in memory, otherwise
> + * it is zero.
> + *
> + * Because the status of a page can change after fincore() checks it
> + * but before it returns to the application, the returned vector may
> + * contain stale information.
> + *
> + * return values:
> + * zero - success
> + * -EBADF - fd isn't a valid open file descriptor
> + * -EFAULT - vec points to an illegal address
> + * -EINVAL - start is not a multiple of PAGE_CACHE_SIZE
> + */
> +SYSCALL_DEFINE4(fincore, unsigned int, fd, loff_t, start, loff_t, len,
> + unsigned char __user *, vec)
> +{
> + unsigned long nr_pages;
> + pgoff_t pgstart;
> + struct fd f;
> + long ret;
> +
> + if (start & ~PAGE_CACHE_MASK)
> + return -EINVAL;
> +
> + f = fdget(fd);
> + if (!f.file)
> + return -EBADF;
> +
> + pgstart = start >> PAGE_CACHE_SHIFT;
> + nr_pages = DIV_ROUND_UP(len, PAGE_CACHE_SIZE);
> +
> + while (nr_pages) {
> + unsigned char tmp[64];
> +
> + ret = do_fincore(f.file->f_mapping, pgstart,
> + min(nr_pages, sizeof(tmp)), tmp);
> + if (ret <= 0)
> + break;
> +
> + if (copy_to_user(vec, tmp, ret)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + nr_pages -= ret;
> + pgstart += ret;
> + vec += ret;
> + ret = 0;
> + }
> +
> + fdput(f);
> +
> + return ret;
> +}
next prev parent reply other threads:[~2013-02-19 10:25 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-11 3:13 RFC: mincore: add a bit to indicate a page is dirty Rusty Russell
2013-02-11 16:27 ` Johannes Weiner
2013-02-11 22:12 ` Andrew Morton
2013-02-12 5:44 ` Rusty Russell
2013-02-15 6:34 ` [patch 1/2] mm: fincore() Johannes Weiner
2013-02-15 6:34 ` Johannes Weiner
2013-02-15 20:39 ` David Miller
2013-02-15 20:39 ` David Miller
2013-02-15 21:14 ` Andrew Morton
2013-02-15 21:14 ` Andrew Morton
2013-02-15 22:28 ` Johannes Weiner
2013-02-15 22:28 ` Johannes Weiner
2013-02-15 22:34 ` Andrew Morton
2013-02-15 22:34 ` Andrew Morton
2013-02-15 21:27 ` Andrew Morton
2013-02-15 21:27 ` Andrew Morton
2013-02-15 23:13 ` Johannes Weiner
2013-02-15 23:13 ` Johannes Weiner
2013-02-15 23:42 ` Andrew Morton
2013-02-15 23:42 ` Andrew Morton
2013-02-16 4:23 ` Rusty Russell
2013-02-16 4:23 ` Rusty Russell
2013-02-17 22:51 ` Johannes Weiner
2013-02-17 22:51 ` Johannes Weiner
2013-02-17 22:54 ` Andrew Morton
2013-02-17 22:54 ` Andrew Morton
2013-05-29 14:53 ` Andres Freund
2013-05-29 14:53 ` Andres Freund
2013-05-29 17:32 ` Johannes Weiner
2013-05-29 17:32 ` Johannes Weiner
2013-05-29 17:52 ` Andres Freund
2013-05-29 17:52 ` Andres Freund
2013-02-18 5:41 ` Rusty Russell
2013-02-18 5:41 ` Rusty Russell
2013-02-19 10:25 ` Simon Jeons [this message]
2013-02-19 10:25 ` Simon Jeons
2013-02-15 6:35 ` [patch 2/2] x86-64: hook up fincore() syscall Johannes Weiner
2013-02-15 6:35 ` Johannes Weiner
2013-02-12 5:49 ` RFC: mincore: add a bit to indicate a page is dirty Rusty Russell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5123531B.8090301@gmail.com \
--to=simon.jeons@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=rusty@rustcorp.com.au \
--cc=stewart@flamingspork.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.