read()/readv() only from page cache

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* read()/readv() only from page cache
@ 2014-07-25  2:36 Milosz Tanski
  2014-09-05 11:09 ` Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Milosz Tanski @ 2014-07-25  2:36 UTC (permalink / raw)
  To: Mel Gorman; +Cc: LKML

Mel,

I've been following your recent work with the postgres folks to
improve the kernel for postgres like workloads (which can really help
all database like loads).

After spending some time of my own fighting similar problems I figured
I'd reach out to see if there's something that can be done that can
make my use case easier. I was wondering if there is a read family
syscall that allows me to read from a file descriptor only if the data
is in the page cache (or only the portion of the data is in the page
cache).

The way my userspace application (database like system) is divided is
three kinds of threads. There's threads for dealing with processing of
data and IO threads (mostly for reading data). There's also threads
for dealing with networking (epoll) but that's not interesting.

What I would like to be able to do is a issue a read call in the
processing thread to get more data ... if it exists in the page cache.
If it doesn't then I would end up queuing that work to the IO threads.
Today as it stands I always have to queue up the work to the IO
threads and I end up paying for the message passing (and
synchronization) for case where it's a simple page cache to userspace
buffer memcpy. Add kernel readahead to my example and it's a pretty
big win.

I'm not the only person who laments this kind of facility. Other folks
have also been frustrated by lack of being able to tell if this read
will block or not.
http://www.1024cores.net/home/scalable-architecture/parallel-disk-io/the-solution

The sad part is that we do have similar syscall that handles none-file
fds like recvmsg() where you can specify O_NOBLOCK and have it return
if there's no data in the buffer. Sadly it doesn't work for regular
files.

I understand that there is a mincore() syscall but in this case it's
not useful since it requires an extra syscall and

Is there any kind of facility / solution for my problem that I can
leverage in the Linux kernel? Linus is always adamant about working
with the page cache versus working against the page cache and in this
case that's exactly what I'm trying to do here.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read()/readv() only from page cache
  2014-07-25  2:36 read()/readv() only from page cache Milosz Tanski
@ 2014-09-05 11:09 ` Mel Gorman
  2014-09-05 15:48   ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2014-09-05 11:09 UTC (permalink / raw)
  To: Milosz Tanski; +Cc: LKML

On Thu, Jul 24, 2014 at 10:36:33PM -0400, Milosz Tanski wrote:
> After spending some time of my own fighting similar problems I figured
> I'd reach out to see if there's something that can be done that can
> make my use case easier. I was wondering if there is a read family
> syscall that allows me to read from a file descriptor only if the data
> is in the page cache (or only the portion of the data is in the page
> cache).
> 

I suggest you look at the recent fincore debate. It did not progress much
the last time because the author wanted to push a lot of functionality in
there where as reviewers felt it should start simple.  The simple case is
likely a good fit for what you want. The primary downside is that it would
be race-prone in memory pressure situations as the page could be reclaimed
between the fincore check and the read but I expect that your application
is already avoiding reclaim activity.

Depending on your application, fincore is far cheaper than mincore because
mincore requires the file be mapped first which in a threaded application
will crucify performance if called regularly.

Technically nothing would prevent the implementation of an fcntl operation
that returned failure from read() when the page is not in the page
cache. However, the use-case is so specific and Linux-specific that it
would encounter resistance being merged. The likely feedback would be to
implement fincore or explain in detail why fincore is not sufficient which
would be a tough argument to win. You'll get beaten with the "interfaces
are forever and your use case is too specific" stick.

The argument that fincore is an extra syscall is not likely to get much
traction as it'll be pointed out that you are already incurring IPC and
synchronisation overhead. Relative to that, the cost of fincore should
be negligible.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read()/readv() only from page cache
  2014-09-05 11:09 ` Mel Gorman
@ 2014-09-05 15:48   ` Christoph Hellwig
  2014-09-05 16:02     ` Jeff Moyer
  2014-09-05 16:27     ` Milosz Tanski
  0 siblings, 2 replies; 9+ messages in thread
From: Christoph Hellwig @ 2014-09-05 15:48 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Milosz Tanski, LKML, Volker Lendecke, Tejun Heo, linux-aio

[-- Attachment #1: Type: text/plain, Size: 1660 bytes --]

On Fri, Sep 05, 2014 at 12:09:27PM +0100, Mel Gorman wrote:
> I suggest you look at the recent fincore debate. It did not progress much
> the last time because the author wanted to push a lot of functionality in
> there where as reviewers felt it should start simple.  The simple case is
> likely a good fit for what you want. The primary downside is that it would
> be race-prone in memory pressure situations as the page could be reclaimed
> between the fincore check and the read but I expect that your application
> is already avoiding reclaim activity.

I've actually experimentally hacked up O_NONBLOCK support for regular
files so that it only returns data from the page cache, and not
otherwise.  Volker promised to test it with Samba, but we never made
any progress on it, and just last week a customer told me they would
have liked to use it if it was available.

Note that we might want to also avoid blocking on locks, and I have some
vague memory that we shouldn't actually implement O_NONBLOCK on regular
files due to compatibility options but would have to use a new flag
instead.

Note that mincor/fincore would not help for the usual use case where you
have a non blocking event main loop and want to offload actual blocking
I/O to helper threads, as you it returns information that can be stale
any time.

One further consideration would be to finally implement real buffered
I/O in kernel space by something like the above and offloading to
workqueues in kernelspace.  I think our workqueues now are way better
than any possible user thread pool, although we'd need to find a way to
temporarily tie the work threads to a user address space.

[-- Attachment #2: support-O_NONBLOCK-reads --]
[-- Type: text/plain, Size: 2925 bytes --]

Index: xfs/include/linux/mm.h
===================================================================
--- xfs.orig/include/linux/mm.h	2012-09-21 17:50:32.243490371 +0200
+++ xfs/include/linux/mm.h	2012-09-21 17:50:43.333490305 +0200
@@ -1467,6 +1467,12 @@ void page_cache_async_readahead(struct a
 				pgoff_t offset,
 				unsigned long size);
 
+void page_cache_nonblock_readahead(struct address_space *mapping,
+				struct file_ra_state *ra,
+				struct file *filp,
+				pgoff_t offset,
+				unsigned long size);
+
 unsigned long max_sane_readahead(unsigned long nr);
 unsigned long ra_submit(struct file_ra_state *ra,
 			struct address_space *mapping,
Index: xfs/mm/filemap.c
===================================================================
--- xfs.orig/mm/filemap.c	2012-09-21 17:50:32.243490371 +0200
+++ xfs/mm/filemap.c	2012-09-21 18:35:53.343474115 +0200
@@ -1107,6 +1107,9 @@ static void do_generic_file_read(struct
 find_page:
 		page = find_get_page(mapping, index);
 		if (!page) {
+			if (filp && filp->f_flags & O_NONBLOCK)
+				goto short_read;
+
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1218,6 +1221,12 @@ page_not_up_to_date_locked:
 		}
 
 readpage:
+		if (filp && filp->f_flags & O_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto short_read;
+		}
+
 		/*
 		 * A previous I/O error may have been due to temporary
 		 * failures, eg. multipath errors.
@@ -1293,6 +1302,10 @@ out:
 
 	*ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset;
 	file_accessed(filp);
+	return;
+short_read:
+	page_cache_nonblock_readahead(mapping, ra, filp, index,
+				      last_index - index);
 }
 
 int file_read_actor(read_descriptor_t *desc, struct page *page,
@@ -1466,7 +1479,8 @@ generic_file_aio_read(struct kiocb *iocb
 		do_generic_file_read(filp, ppos, &desc, file_read_actor);
 		retval += desc.written;
 		if (desc.error) {
-			retval = retval ?: desc.error;
+			if (!retval)
+				retval = desc.error;
 			break;
 		}
 		if (desc.count > 0)
Index: xfs/mm/readahead.c
===================================================================
--- xfs.orig/mm/readahead.c	2012-09-21 17:50:32.243490371 +0200
+++ xfs/mm/readahead.c	2012-09-21 17:50:43.336823641 +0200
@@ -565,6 +565,22 @@ page_cache_async_readahead(struct addres
 }
 EXPORT_SYMBOL_GPL(page_cache_async_readahead);
 
+void
+page_cache_nonblock_readahead(struct address_space *mapping,
+			   struct file_ra_state *ra, struct file *filp,
+			   pgoff_t offset, unsigned long req_size)
+{
+	/*
+	 * Defer asynchronous read-ahead on IO congestion.
+	 */
+	if (bdi_read_congested(mapping->backing_dev_info))
+		return;
+
+	/* do read-ahead */
+	ondemand_readahead(mapping, ra, filp, false, offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_nonblock_readahead);
+
 static ssize_t
 do_readahead(struct address_space *mapping, struct file *filp,
 	     pgoff_t index, unsigned long nr)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read()/readv() only from page cache
  2014-09-05 15:48   ` Christoph Hellwig
@ 2014-09-05 16:02     ` Jeff Moyer
  2014-09-05 16:04       ` Christoph Hellwig
  2014-09-05 16:27     ` Milosz Tanski
  1 sibling, 1 reply; 9+ messages in thread
From: Jeff Moyer @ 2014-09-05 16:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Milosz Tanski, LKML, Volker Lendecke, Tejun Heo,
	linux-aio

Christoph Hellwig <hch@infradead.org> writes:

> On Fri, Sep 05, 2014 at 12:09:27PM +0100, Mel Gorman wrote:
>> I suggest you look at the recent fincore debate. It did not progress much
>> the last time because the author wanted to push a lot of functionality in
>> there where as reviewers felt it should start simple.  The simple case is
>> likely a good fit for what you want. The primary downside is that it would
>> be race-prone in memory pressure situations as the page could be reclaimed
>> between the fincore check and the read but I expect that your application
>> is already avoiding reclaim activity.
>
> I've actually experimentally hacked up O_NONBLOCK support for regular
> files so that it only returns data from the page cache, and not
> otherwise.  Volker promised to test it with Samba, but we never made
> any progress on it, and just last week a customer told me they would
> have liked to use it if it was available.
>
> Note that we might want to also avoid blocking on locks, and I have some
> vague memory that we shouldn't actually implement O_NONBLOCK on regular
> files due to compatibility options but would have to use a new flag
> instead.

FWIW, here's a discussion from an old attempt at O_NONBLOCK for regular
files:
  http://www.gossamer-threads.com/lists/linux/kernel/477936?do=post_view_threaded#477936

I recall it blowing up in various situations, so yeah, a new flag would
be a good idea.

> Note that mincor/fincore would not help for the usual use case where you
> have a non blocking event main loop and want to offload actual blocking
> I/O to helper threads, as you it returns information that can be stale
> any time.
>
> One further consideration would be to finally implement real buffered
> I/O in kernel space by something like the above and offloading to
> workqueues in kernelspace.  I think our workqueues now are way better
> than any possible user thread pool, although we'd need to find a way to
> temporarily tie the work threads to a user address space.

Do you mean real buffered AIO?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read()/readv() only from page cache
  2014-09-05 16:02     ` Jeff Moyer
@ 2014-09-05 16:04       ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2014-09-05 16:04 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Mel Gorman, Milosz Tanski, LKML,
	Volker Lendecke, Tejun Heo, linux-aio

On Fri, Sep 05, 2014 at 12:02:19PM -0400, Jeff Moyer wrote:
> > One further consideration would be to finally implement real buffered
> > I/O in kernel space by something like the above and offloading to
> > workqueues in kernelspace.  I think our workqueues now are way better
> > than any possible user thread pool, although we'd need to find a way to
> > temporarily tie the work threads to a user address space.
> 
> Do you mean real buffered AIO?

Yes.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read()/readv() only from page cache
  2014-09-05 15:48   ` Christoph Hellwig
  2014-09-05 16:02     ` Jeff Moyer
@ 2014-09-05 16:27     ` Milosz Tanski
  2014-09-05 16:32       ` Christoph Hellwig
  2014-09-07 20:48       ` Volker Lendecke
  1 sibling, 2 replies; 9+ messages in thread
From: Milosz Tanski @ 2014-09-05 16:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mel Gorman, LKML, Volker Lendecke, Tejun Heo, linux-aio

I would prefer a interface more like recv() where I can specify the
flag if I want blocking behavior for this read or not. Let me explain
why:

In a VLDB like workload this would enable me to lower the latency of
common fast requests and. By fast requests I mean ones that do not
require much data, the data is cached, or there's a predictable read
pattern (read-ahead). Obviously it would be at the expense of the
latency of large/slow requests (they have to make 2 read calls, the
first one always EWOULDBLOCK) ... but in that case it doesn't matter
since the time to do actual IO would trump any kind of extra latency.

Essentially, it's using the kernel facilities (page cache) to help me
perform better (in a more predictable fashion). I would implement this
in our application tomorrow. It's frustrating that there is a similar
interface (recv* family) that I cannot use.

I know there's been a bunch of attempts at buffered AIO and none of
them made it into the kernel. It would let me build a buffered AIO
implementation in user-space using a threadpool. And cached data would
not end up getting blocked behind other non-cached requests sitting in
the queue. I know there's other sources of blocking (locking, metadata
lookups) but direct AIO already suffers from these so I'm fine to
paper over that for now.

On Fri, Sep 5, 2014 at 11:48 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Sep 05, 2014 at 12:09:27PM +0100, Mel Gorman wrote:
>> I suggest you look at the recent fincore debate. It did not progress much
>> the last time because the author wanted to push a lot of functionality in
>> there where as reviewers felt it should start simple.  The simple case is
>> likely a good fit for what you want. The primary downside is that it would
>> be race-prone in memory pressure situations as the page could be reclaimed
>> between the fincore check and the read but I expect that your application
>> is already avoiding reclaim activity.
>
> I've actually experimentally hacked up O_NONBLOCK support for regular
> files so that it only returns data from the page cache, and not
> otherwise.  Volker promised to test it with Samba, but we never made
> any progress on it, and just last week a customer told me they would
> have liked to use it if it was available.
>
> Note that we might want to also avoid blocking on locks, and I have some
> vague memory that we shouldn't actually implement O_NONBLOCK on regular
> files due to compatibility options but would have to use a new flag
> instead.
>
> Note that mincor/fincore would not help for the usual use case where you
> have a non blocking event main loop and want to offload actual blocking
> I/O to helper threads, as you it returns information that can be stale
> any time.
>
> One further consideration would be to finally implement real buffered
> I/O in kernel space by something like the above and offloading to
> workqueues in kernelspace.  I think our workqueues now are way better
> than any possible user thread pool, although we'd need to find a way to
> temporarily tie the work threads to a user address space.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read()/readv() only from page cache
  2014-09-05 16:27     ` Milosz Tanski
@ 2014-09-05 16:32       ` Christoph Hellwig
  2014-09-05 16:45         ` Milosz Tanski
  2014-09-07 20:48       ` Volker Lendecke
  1 sibling, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2014-09-05 16:32 UTC (permalink / raw)
  To: Milosz Tanski; +Cc: Mel Gorman, LKML, Volker Lendecke, Tejun Heo, linux-aio

On Fri, Sep 05, 2014 at 12:27:21PM -0400, Milosz Tanski wrote:
> I would prefer a interface more like recv() where I can specify the
> flag if I want blocking behavior for this read or not. Let me explain
> why:
> 
> In a VLDB like workload this would enable me to lower the latency of
> common fast requests and. By fast requests I mean ones that do not
> require much data, the data is cached, or there's a predictable read
> pattern (read-ahead). Obviously it would be at the expense of the
> latency of large/slow requests (they have to make 2 read calls, the
> first one always EWOULDBLOCK) ... but in that case it doesn't matter
> since the time to do actual IO would trump any kind of extra latency.

This is another good suggestion.  I've actually heard people asking
for allowing per-I/O flags for other uses cases.  The one I cane
remember is applying O_DSYNC only for FUA writes on a SCSI target,
the other one would be Samba again, as SMB allows per-I/O flags on
the wire as well.

> Essentially, it's using the kernel facilities (page cache) to help me
> perform better (in a more predictable fashion). I would implement this
> in our application tomorrow. It's frustrating that there is a similar
> interface (recv* family) that I cannot use.
> 
> I know there's been a bunch of attempts at buffered AIO and none of
> them made it into the kernel. It would let me build a buffered AIO
> implementation in user-space using a threadpool. And cached data would
> not end up getting blocked behind other non-cached requests sitting in
> the queue. I know there's other sources of blocking (locking, metadata
> lookups) but direct AIO already suffers from these so I'm fine to
> paper over that for now.

Although I still think providing useful AIO at the kernel level would be
better than having everyone reimplement it it still would be useful to
allow people to sanely reimplement it.  If only to avoid the discussion
about what API to use between the non-standard and not really that nice
Linux io_submit and the utterly horrible Posix aio_ semantics.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read()/readv() only from page cache
  2014-09-05 16:32       ` Christoph Hellwig
@ 2014-09-05 16:45         ` Milosz Tanski
  0 siblings, 0 replies; 9+ messages in thread
From: Milosz Tanski @ 2014-09-05 16:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mel Gorman, LKML, Volker Lendecke, Tejun Heo, linux-aio

On Fri, Sep 5, 2014 at 12:32 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Sep 05, 2014 at 12:27:21PM -0400, Milosz Tanski wrote:
>> I would prefer a interface more like recv() where I can specify the
>> flag if I want blocking behavior for this read or not. Let me explain
>> why:
>>
>> In a VLDB like workload this would enable me to lower the latency of
>> common fast requests and. By fast requests I mean ones that do not
>> require much data, the data is cached, or there's a predictable read
>> pattern (read-ahead). Obviously it would be at the expense of the
>> latency of large/slow requests (they have to make 2 read calls, the
>> first one always EWOULDBLOCK) ... but in that case it doesn't matter
>> since the time to do actual IO would trump any kind of extra latency.
>
> This is another good suggestion.  I've actually heard people asking
> for allowing per-I/O flags for other uses cases.  The one I cane
> remember is applying O_DSYNC only for FUA writes on a SCSI target,
> the other one would be Samba again, as SMB allows per-I/O flags on
> the wire as well.
>
>> Essentially, it's using the kernel facilities (page cache) to help me
>> perform better (in a more predictable fashion). I would implement this
>> in our application tomorrow. It's frustrating that there is a similar
>> interface (recv* family) that I cannot use.
>>
>> I know there's been a bunch of attempts at buffered AIO and none of
>> them made it into the kernel. It would let me build a buffered AIO
>> implementation in user-space using a threadpool. And cached data would
>> not end up getting blocked behind other non-cached requests sitting in
>> the queue. I know there's other sources of blocking (locking, metadata
>> lookups) but direct AIO already suffers from these so I'm fine to
>> paper over that for now.
>
> Although I still think providing useful AIO at the kernel level would be
> better than having everyone reimplement it it still would be useful to
> allow people to sanely reimplement it.  If only to avoid the discussion
> about what API to use between the non-standard and not really that nice
> Linux io_submit and the utterly horrible Posix aio_ semantics.

Yeah, I would love for that to happen but I've been lurking and
following the non-blocking buffered AIO discussions and attempts on
lkml since about 2008 and the threads go back much further than that
about 12 years. I would take a much less ambitious syscall read/pread
syscall that gets me 90% of the way there and I can build the
remainder in user-space. It also has the nice side-effect of being
providing a not-horrible fallback for older/non-linux systems where
all IO goes into the thread pool (without the option to skip it).

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: read()/readv() only from page cache
  2014-09-05 16:27     ` Milosz Tanski
  2014-09-05 16:32       ` Christoph Hellwig
@ 2014-09-07 20:48       ` Volker Lendecke
  1 sibling, 0 replies; 9+ messages in thread
From: Volker Lendecke @ 2014-09-07 20:48 UTC (permalink / raw)
  To: Milosz Tanski; +Cc: Christoph Hellwig, Mel Gorman, LKML, Tejun Heo, linux-aio

On Fri, Sep 05, 2014 at 12:27:21PM -0400, Milosz Tanski wrote:
> In a VLDB like workload this would enable me to lower the latency of
> common fast requests and. By fast requests I mean ones that do not
> require much data, the data is cached, or there's a predictable read
> pattern (read-ahead). Obviously it would be at the expense of the
> latency of large/slow requests (they have to make 2 read calls, the
> first one always EWOULDBLOCK) ... but in that case it doesn't matter
> since the time to do actual IO would trump any kind of extra latency.

That was my thinking as well when I discussed it with Christoph. Sorry
for not getting around to actually test his patch. Samba right
now uses a thread pool for aio, we get parallel read requests from
clients over a single TCP connection and we need to keep all disks busy
simultaneously. We'd like to avoid the thread overhead when it's not
necessary because the data is already around. A per-request flag would
fit Samba better than a flag settable by fcntl in this pattern. Samba
can't easily open a file twice (once with NONBLOCK and once without).

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt@sernet.de

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-09-07 21:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-25  2:36 read()/readv() only from page cache Milosz Tanski
2014-09-05 11:09 ` Mel Gorman
2014-09-05 15:48   ` Christoph Hellwig
2014-09-05 16:02     ` Jeff Moyer
2014-09-05 16:04       ` Christoph Hellwig
2014-09-05 16:27     ` Milosz Tanski
2014-09-05 16:32       ` Christoph Hellwig
2014-09-05 16:45         ` Milosz Tanski
2014-09-07 20:48       ` Volker Lendecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox