linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Christoph Hellwig <hch@infradead.org>,
	Jeremy Allison <jra@samba.org>, Milosz Tanski <milosz@adfin.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-aio@kvack.org, Mel Gorman <mgorman@suse.de>,
	Volker Lendecke <Volker.Lendecke@sernet.de>,
	Tejun Heo <tj@kernel.org>, Jeff Moyer <jmoyer@redhat.com>,
	Theodore Ts'o <tytso@mit.edu>, Al Viro <viro@zeniv.linux.org.uk>,
	linux-api@vger.kernel.org,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	linux-arch@vger.kernel.org, Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
Date: Fri, 3 Apr 2015 20:42:09 -0700	[thread overview]
Message-ID: <20150403204209.75405f37.akpm@linux-foundation.org> (raw)
In-Reply-To: <20150330132625.52b1250527ca3dcda79e349e@linux-foundation.org>

On Mon, 30 Mar 2015 13:26:25 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> d) fincore() is more expensive

Actually, I kinda take that back.  fincore() will be faster than
preadv2() in the case of a pagecache miss, and slower in the case of a
pagecache hit.

The breakpoint appears to be a hit rate of 30% - if fewer than 30% of
queries find the page in pagecache, fincore() will be faster than
preadv2().

This is because for a pagecache miss, fincore() will be about twice as
fast as preadv2().  For a pagecache hit, fincore()+pread() is 55%
slower than preadv2().  If there are lots of misses, fincore() is
faster overall.




Minimal fincore() implementation is below.  It doesn't implement the
page_map!=NULL mode at all and will be slow for large areas - it needs
to be taught about radix_tree_for_each_*().  But it's good enough for
testing.  

On a slow machine, in nanoseconds:

null syscall:		528
fincore (miss):		674
fincore (hit):		729
single byte pread:	1026
single byte preadv:	1134

pread() is a bit faster than preadv() and samba uses pread(), so the
implementations are:

	if (fincore(fd, NULL, offset, len) == len)
		pread();
	else
		punt();

	if (preadv2(fd, ..., offset, len) == len)
		...
	else
		punt();

fincore+pread, pagecache-hit:	1755ns
fincore+pread, pagecache-miss:	674ns
preadv():			1134ns (preadv2() will be a little faster for misses)



Now, a pagecache hit rate of 30% sounds high so one would think that
fincore+pread is clearly ahead.  But the pagecache hit rate in this
code will actually be quite high, because of readahead.

For a large linear read of a file which is perfectly laid out on disk
and is fully *uncached*, the hit rates will be as good as 99.8%,
because readahead is bringing in data in 2MB blobs.

In practice I expect that fincore()+pread() will be slower for linear
reads of medium to large files and faster for small files and seeky
accesses.

How much does all this matter?  Not much.  On a fast machine a
single-byte pread() takes 240ns.  So if your server thread is handling
25000 requests/sec, we're only talking 0.6% overhead.

Note that we can trivially monitor the hit rate with either preadv2()
or fincore()+pread(): just count how many times all the data is there
versus how many times it isn't.



Also, note that we can use *both* fincore() and preadv2() to detect the
problematic page-just-disappeared race:

	if (fincore(fd, NULL, offset, len) == len) {
		if (preadv2(fd, offset, len) != len)
			printf("race just happened");

It would be great if someone could apply the below, modify the
preadv2() callsite as above and determine under what conditions (if
any) the page-stealing race occurs.



 arch/x86/syscalls/syscall_64.tbl |    1 
 include/linux/syscalls.h         |    2 
 mm/Makefile                      |    2 
 mm/fincore.c                     |   65 +++++++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 1 deletion(-)

diff -puN arch/x86/syscalls/syscall_64.tbl~fincore arch/x86/syscalls/syscall_64.tbl
--- a/arch/x86/syscalls/syscall_64.tbl~fincore
+++ a/arch/x86/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	64	preadv2			sys_preadv2
 324	64	pwritev2		sys_pwritev2
+325	common	fincore			sys_fincore
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN include/linux/syscalls.h~fincore include/linux/syscalls.h
--- a/include/linux/syscalls.h~fincore
+++ a/include/linux/syscalls.h
@@ -880,6 +880,8 @@ asmlinkage long sys_process_vm_writev(pi
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_fincore(int fd, unsigned char __user *page_map,
+			    loff_t offset, size_t len);
 asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 			    const char __user *uargs);
 asmlinkage long sys_getrandom(char __user *buf, size_t count,
diff -puN mm/Makefile~fincore mm/Makefile
--- a/mm/Makefile~fincore
+++ a/mm/Makefile
@@ -19,7 +19,7 @@ obj-y			:= filemap.o mempool.o oom_kill.
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o vmacache.o \
+			   compaction.o vmacache.o fincore.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
diff -puN /dev/null mm/fincore.c
--- /dev/null
+++ a/mm/fincore.c
@@ -0,0 +1,65 @@
+#include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+
+SYSCALL_DEFINE4(fincore, int, fd, unsigned char __user *, page_map,
+		loff_t, offset, size_t, len)
+{
+	struct fd f;
+	struct address_space *mapping;
+	loff_t cur_off;
+	loff_t end;
+	pgoff_t pgoff;
+	long ret = 0;
+
+	if (offset < 0 || (ssize_t)len <= 0)
+		return -EINVAL;
+
+	f = fdget(fd);
+
+	if (!f.file)
+		return -EBADF;
+
+	if (is_file_hugepages(f.file)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!S_ISREG(file_inode(f.file)->i_mode)) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	end = min_t(loff_t, offset + len, i_size_read(file_inode(f.file)));
+	pgoff = offset >> PAGE_CACHE_SHIFT;
+	mapping = f.file->f_mapping;
+
+	/*
+	 * We probably need to do somethnig here to reduce the chance of the
+	 * pages being reclaimed between fincore() and read().  eg,
+	 * SetPageReferenced(page) or mark_page_accessed(page) or
+	 * activate_page(page).
+	 */
+	for (cur_off = offset; cur_off < end ; ) {
+		struct page *page;
+		loff_t end_of_coverage;
+
+		page = find_get_page(mapping, pgoff);
+		if (!page || !PageUptodate(page))
+			break;
+		page_cache_release(page);
+
+		pgoff++;
+		end_of_coverage = min_t(loff_t, pgoff << PAGE_CACHE_SHIFT, end);
+		ret += end_of_coverage - cur_off;
+		cur_off = (cur_off + PAGE_CACHE_SIZE) & PAGE_CACHE_MASK;
+	}
+
+out:
+	fdput(f);
+	return ret;
+}
_


  parent reply	other threads:[~2015-04-04  3:42 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 1/5] vfs: Prepare for adding a new preadv/pwritev with user flags Milosz Tanski
2015-03-16 21:05   ` Andreas Dilger
2015-03-16 18:27 ` [PATCH v7 2/5] vfs: Define new syscalls preadv2,pwritev2 Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 3/5] x86: wire up preadv2 and pwritev2 Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 4/5] vfs: RWF_NONBLOCK flag for preadv2 Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 5/5] xfs: add RWF_NONBLOCK support Milosz Tanski
2015-03-16 22:04   ` Dave Chinner
2015-03-16 18:32 ` [PATCH] Add preadv2/pwritev2 documentation Milosz Tanski
2015-03-27 16:49   ` Andrew Morton
     [not found]     ` <20150327094932.31b5c9fc.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-03-30  7:33       ` Christoph Hellwig
     [not found] ` <cover.1426528417.git.milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org>
2015-03-16 18:34   ` [PATCH] fstests: generic test for preadv2 behavior on linux Milosz Tanski
2015-03-16 21:07     ` Andreas Dilger
2015-03-16 22:03       ` Milosz Tanski
2015-03-16 22:02     ` Dave Chinner
2015-03-16 22:11       ` Milosz Tanski
     [not found]         ` <CANP1eJEj2buvwaU-jum=GROowY6DrysQ0NU+weXstn=83yVspQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-16 22:56           ` Dave Chinner
2015-03-27  3:28   ` [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Andrew Morton
2015-03-27  5:41     ` Volker Lendecke
     [not found]       ` <E1YbN1J-0084qO-3s-dqLtpHMqGvUyWpdLl23E4A@public.gmane.org>
2015-03-27  6:08         ` Andrew Morton
     [not found]           ` <20150326230833.4ccfaebb.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-03-27  8:02             ` Volker Lendecke
2015-03-27  8:12               ` Christoph Hellwig
2015-03-27  8:18     ` Christoph Hellwig
     [not found]       ` <20150327081822.GA28669-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-03-27  8:35         ` Andrew Morton
2015-03-27  8:48           ` Christoph Hellwig
     [not found]             ` <20150327084833.GA7689-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-03-27  9:01               ` Andrew Morton
2015-03-27  9:44                 ` Volker Lendecke
2015-03-27 15:58                 ` Jeremy Allison
2015-03-27 16:30                   ` Andrew Morton
2015-03-27 16:39                     ` Jeremy Allison
2015-03-27 16:39                     ` Andrew Morton
2015-03-27 16:45                     ` Milosz Tanski
2015-03-31  1:27                     ` Milosz Tanski
2015-03-27 16:38                   ` Milosz Tanski
2015-03-30  7:36                   ` Christoph Hellwig
     [not found]                     ` <20150330073604.GB22229-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-03-30 17:19                       ` Jeremy Allison
2015-03-30 22:51                         ` Milosz Tanski
2015-03-30 20:26                     ` Andrew Morton
2015-03-30 20:32                       ` Jeremy Allison
2015-03-30 20:37                         ` Andrew Morton
2015-03-30 20:49                           ` Jeremy Allison
2015-03-30 21:33                             ` Andrew Morton
2015-03-30 22:35                           ` Milosz Tanski
2015-03-30 22:49                         ` Milosz Tanski
2015-03-30 22:57                           ` Andrew Morton
     [not found]                             ` <20150330155700.92f4c8a0bf13418aaf01ae04-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-03-30 23:06                               ` Milosz Tanski
2015-03-30 23:25                       ` Milosz Tanski
2015-04-04  3:42                       ` Andrew Morton [this message]
     [not found]                         ` <20150403204209.75405f37.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-04-06  3:53                           ` Milosz Tanski
2015-03-30 23:09                     ` Milosz Tanski
2015-03-27 15:21     ` Milosz Tanski
2015-03-27 17:04       ` Andrew Morton
2015-03-30  7:40         ` Christoph Hellwig
2015-03-30 18:54           ` Andrew Morton
2015-03-30 22:40             ` Milosz Tanski
     [not found]               ` <CANP1eJH4BcZ0vgZ6pZdKOd4orEzfKUqjpKXb3m=WMy0mbK+PFA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-30 22:50                 ` Andrew Morton
2015-03-26 11:55 ` Christoph Hellwig
2015-03-26 19:12   ` Milosz Tanski
2015-03-27  2:26     ` Milosz Tanski
2015-03-27  2:29     ` Milosz Tanski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150403204209.75405f37.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=Volker.Lendecke@sernet.de \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jmoyer@redhat.com \
    --cc=jra@samba.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=milosz@adfin.com \
    --cc=mtk.manpages@gmail.com \
    --cc=tj@kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).