* read()/readv() only from page cache @ 2014-07-25 2:36 Milosz Tanski 2014-09-05 11:09 ` Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: Milosz Tanski @ 2014-07-25 2:36 UTC (permalink / raw) To: Mel Gorman; +Cc: LKML Mel, I've been following your recent work with the postgres folks to improve the kernel for postgres like workloads (which can really help all database like loads). After spending some time of my own fighting similar problems I figured I'd reach out to see if there's something that can be done that can make my use case easier. I was wondering if there is a read family syscall that allows me to read from a file descriptor only if the data is in the page cache (or only the portion of the data is in the page cache). The way my userspace application (database like system) is divided is three kinds of threads. There's threads for dealing with processing of data and IO threads (mostly for reading data). There's also threads for dealing with networking (epoll) but that's not interesting. What I would like to be able to do is a issue a read call in the processing thread to get more data ... if it exists in the page cache. If it doesn't then I would end up queuing that work to the IO threads. Today as it stands I always have to queue up the work to the IO threads and I end up paying for the message passing (and synchronization) for case where it's a simple page cache to userspace buffer memcpy. Add kernel readahead to my example and it's a pretty big win. I'm not the only person who laments this kind of facility. Other folks have also been frustrated by lack of being able to tell if this read will block or not. http://www.1024cores.net/home/scalable-architecture/parallel-disk-io/the-solution The sad part is that we do have similar syscall that handles none-file fds like recvmsg() where you can specify O_NOBLOCK and have it return if there's no data in the buffer. Sadly it doesn't work for regular files. I understand that there is a mincore() syscall but in this case it's not useful since it requires an extra syscall and Is there any kind of facility / solution for my problem that I can leverage in the Linux kernel? Linus is always adamant about working with the page cache versus working against the page cache and in this case that's exactly what I'm trying to do here. -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: milosz@adfin.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read()/readv() only from page cache 2014-07-25 2:36 read()/readv() only from page cache Milosz Tanski @ 2014-09-05 11:09 ` Mel Gorman 2014-09-05 15:48 ` Christoph Hellwig 0 siblings, 1 reply; 9+ messages in thread From: Mel Gorman @ 2014-09-05 11:09 UTC (permalink / raw) To: Milosz Tanski; +Cc: LKML On Thu, Jul 24, 2014 at 10:36:33PM -0400, Milosz Tanski wrote: > After spending some time of my own fighting similar problems I figured > I'd reach out to see if there's something that can be done that can > make my use case easier. I was wondering if there is a read family > syscall that allows me to read from a file descriptor only if the data > is in the page cache (or only the portion of the data is in the page > cache). > I suggest you look at the recent fincore debate. It did not progress much the last time because the author wanted to push a lot of functionality in there where as reviewers felt it should start simple. The simple case is likely a good fit for what you want. The primary downside is that it would be race-prone in memory pressure situations as the page could be reclaimed between the fincore check and the read but I expect that your application is already avoiding reclaim activity. Depending on your application, fincore is far cheaper than mincore because mincore requires the file be mapped first which in a threaded application will crucify performance if called regularly. Technically nothing would prevent the implementation of an fcntl operation that returned failure from read() when the page is not in the page cache. However, the use-case is so specific and Linux-specific that it would encounter resistance being merged. The likely feedback would be to implement fincore or explain in detail why fincore is not sufficient which would be a tough argument to win. You'll get beaten with the "interfaces are forever and your use case is too specific" stick. The argument that fincore is an extra syscall is not likely to get much traction as it'll be pointed out that you are already incurring IPC and synchronisation overhead. Relative to that, the cost of fincore should be negligible. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read()/readv() only from page cache 2014-09-05 11:09 ` Mel Gorman @ 2014-09-05 15:48 ` Christoph Hellwig 2014-09-05 16:02 ` Jeff Moyer 2014-09-05 16:27 ` Milosz Tanski 0 siblings, 2 replies; 9+ messages in thread From: Christoph Hellwig @ 2014-09-05 15:48 UTC (permalink / raw) To: Mel Gorman; +Cc: Milosz Tanski, LKML, Volker Lendecke, Tejun Heo, linux-aio [-- Attachment #1: Type: text/plain, Size: 1660 bytes --] On Fri, Sep 05, 2014 at 12:09:27PM +0100, Mel Gorman wrote: > I suggest you look at the recent fincore debate. It did not progress much > the last time because the author wanted to push a lot of functionality in > there where as reviewers felt it should start simple. The simple case is > likely a good fit for what you want. The primary downside is that it would > be race-prone in memory pressure situations as the page could be reclaimed > between the fincore check and the read but I expect that your application > is already avoiding reclaim activity. I've actually experimentally hacked up O_NONBLOCK support for regular files so that it only returns data from the page cache, and not otherwise. Volker promised to test it with Samba, but we never made any progress on it, and just last week a customer told me they would have liked to use it if it was available. Note that we might want to also avoid blocking on locks, and I have some vague memory that we shouldn't actually implement O_NONBLOCK on regular files due to compatibility options but would have to use a new flag instead. Note that mincor/fincore would not help for the usual use case where you have a non blocking event main loop and want to offload actual blocking I/O to helper threads, as you it returns information that can be stale any time. One further consideration would be to finally implement real buffered I/O in kernel space by something like the above and offloading to workqueues in kernelspace. I think our workqueues now are way better than any possible user thread pool, although we'd need to find a way to temporarily tie the work threads to a user address space. [-- Attachment #2: support-O_NONBLOCK-reads --] [-- Type: text/plain, Size: 2925 bytes --] Index: xfs/include/linux/mm.h =================================================================== --- xfs.orig/include/linux/mm.h 2012-09-21 17:50:32.243490371 +0200 +++ xfs/include/linux/mm.h 2012-09-21 17:50:43.333490305 +0200 @@ -1467,6 +1467,12 @@ void page_cache_async_readahead(struct a pgoff_t offset, unsigned long size); +void page_cache_nonblock_readahead(struct address_space *mapping, + struct file_ra_state *ra, + struct file *filp, + pgoff_t offset, + unsigned long size); + unsigned long max_sane_readahead(unsigned long nr); unsigned long ra_submit(struct file_ra_state *ra, struct address_space *mapping, Index: xfs/mm/filemap.c =================================================================== --- xfs.orig/mm/filemap.c 2012-09-21 17:50:32.243490371 +0200 +++ xfs/mm/filemap.c 2012-09-21 18:35:53.343474115 +0200 @@ -1107,6 +1107,9 @@ static void do_generic_file_read(struct find_page: page = find_get_page(mapping, index); if (!page) { + if (filp && filp->f_flags & O_NONBLOCK) + goto short_read; + page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); @@ -1218,6 +1221,12 @@ page_not_up_to_date_locked: } readpage: + if (filp && filp->f_flags & O_NONBLOCK) { + unlock_page(page); + page_cache_release(page); + goto short_read; + } + /* * A previous I/O error may have been due to temporary * failures, eg. multipath errors. @@ -1293,6 +1302,10 @@ out: *ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset; file_accessed(filp); + return; +short_read: + page_cache_nonblock_readahead(mapping, ra, filp, index, + last_index - index); } int file_read_actor(read_descriptor_t *desc, struct page *page, @@ -1466,7 +1479,8 @@ generic_file_aio_read(struct kiocb *iocb do_generic_file_read(filp, ppos, &desc, file_read_actor); retval += desc.written; if (desc.error) { - retval = retval ?: desc.error; + if (!retval) + retval = desc.error; break; } if (desc.count > 0) Index: xfs/mm/readahead.c =================================================================== --- xfs.orig/mm/readahead.c 2012-09-21 17:50:32.243490371 +0200 +++ xfs/mm/readahead.c 2012-09-21 17:50:43.336823641 +0200 @@ -565,6 +565,22 @@ page_cache_async_readahead(struct addres } EXPORT_SYMBOL_GPL(page_cache_async_readahead); +void +page_cache_nonblock_readahead(struct address_space *mapping, + struct file_ra_state *ra, struct file *filp, + pgoff_t offset, unsigned long req_size) +{ + /* + * Defer asynchronous read-ahead on IO congestion. + */ + if (bdi_read_congested(mapping->backing_dev_info)) + return; + + /* do read-ahead */ + ondemand_readahead(mapping, ra, filp, false, offset, req_size); +} +EXPORT_SYMBOL_GPL(page_cache_nonblock_readahead); + static ssize_t do_readahead(struct address_space *mapping, struct file *filp, pgoff_t index, unsigned long nr) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read()/readv() only from page cache 2014-09-05 15:48 ` Christoph Hellwig @ 2014-09-05 16:02 ` Jeff Moyer 2014-09-05 16:04 ` Christoph Hellwig 2014-09-05 16:27 ` Milosz Tanski 1 sibling, 1 reply; 9+ messages in thread From: Jeff Moyer @ 2014-09-05 16:02 UTC (permalink / raw) To: Christoph Hellwig Cc: Mel Gorman, Milosz Tanski, LKML, Volker Lendecke, Tejun Heo, linux-aio Christoph Hellwig <hch@infradead.org> writes: > On Fri, Sep 05, 2014 at 12:09:27PM +0100, Mel Gorman wrote: >> I suggest you look at the recent fincore debate. It did not progress much >> the last time because the author wanted to push a lot of functionality in >> there where as reviewers felt it should start simple. The simple case is >> likely a good fit for what you want. The primary downside is that it would >> be race-prone in memory pressure situations as the page could be reclaimed >> between the fincore check and the read but I expect that your application >> is already avoiding reclaim activity. > > I've actually experimentally hacked up O_NONBLOCK support for regular > files so that it only returns data from the page cache, and not > otherwise. Volker promised to test it with Samba, but we never made > any progress on it, and just last week a customer told me they would > have liked to use it if it was available. > > Note that we might want to also avoid blocking on locks, and I have some > vague memory that we shouldn't actually implement O_NONBLOCK on regular > files due to compatibility options but would have to use a new flag > instead. FWIW, here's a discussion from an old attempt at O_NONBLOCK for regular files: http://www.gossamer-threads.com/lists/linux/kernel/477936?do=post_view_threaded#477936 I recall it blowing up in various situations, so yeah, a new flag would be a good idea. > Note that mincor/fincore would not help for the usual use case where you > have a non blocking event main loop and want to offload actual blocking > I/O to helper threads, as you it returns information that can be stale > any time. > > One further consideration would be to finally implement real buffered > I/O in kernel space by something like the above and offloading to > workqueues in kernelspace. I think our workqueues now are way better > than any possible user thread pool, although we'd need to find a way to > temporarily tie the work threads to a user address space. Do you mean real buffered AIO? Cheers, Jeff ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read()/readv() only from page cache 2014-09-05 16:02 ` Jeff Moyer @ 2014-09-05 16:04 ` Christoph Hellwig 0 siblings, 0 replies; 9+ messages in thread From: Christoph Hellwig @ 2014-09-05 16:04 UTC (permalink / raw) To: Jeff Moyer Cc: Christoph Hellwig, Mel Gorman, Milosz Tanski, LKML, Volker Lendecke, Tejun Heo, linux-aio On Fri, Sep 05, 2014 at 12:02:19PM -0400, Jeff Moyer wrote: > > One further consideration would be to finally implement real buffered > > I/O in kernel space by something like the above and offloading to > > workqueues in kernelspace. I think our workqueues now are way better > > than any possible user thread pool, although we'd need to find a way to > > temporarily tie the work threads to a user address space. > > Do you mean real buffered AIO? Yes. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read()/readv() only from page cache 2014-09-05 15:48 ` Christoph Hellwig 2014-09-05 16:02 ` Jeff Moyer @ 2014-09-05 16:27 ` Milosz Tanski 2014-09-05 16:32 ` Christoph Hellwig 2014-09-07 20:48 ` Volker Lendecke 1 sibling, 2 replies; 9+ messages in thread From: Milosz Tanski @ 2014-09-05 16:27 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Mel Gorman, LKML, Volker Lendecke, Tejun Heo, linux-aio I would prefer a interface more like recv() where I can specify the flag if I want blocking behavior for this read or not. Let me explain why: In a VLDB like workload this would enable me to lower the latency of common fast requests and. By fast requests I mean ones that do not require much data, the data is cached, or there's a predictable read pattern (read-ahead). Obviously it would be at the expense of the latency of large/slow requests (they have to make 2 read calls, the first one always EWOULDBLOCK) ... but in that case it doesn't matter since the time to do actual IO would trump any kind of extra latency. Essentially, it's using the kernel facilities (page cache) to help me perform better (in a more predictable fashion). I would implement this in our application tomorrow. It's frustrating that there is a similar interface (recv* family) that I cannot use. I know there's been a bunch of attempts at buffered AIO and none of them made it into the kernel. It would let me build a buffered AIO implementation in user-space using a threadpool. And cached data would not end up getting blocked behind other non-cached requests sitting in the queue. I know there's other sources of blocking (locking, metadata lookups) but direct AIO already suffers from these so I'm fine to paper over that for now. On Fri, Sep 5, 2014 at 11:48 AM, Christoph Hellwig <hch@infradead.org> wrote: > On Fri, Sep 05, 2014 at 12:09:27PM +0100, Mel Gorman wrote: >> I suggest you look at the recent fincore debate. It did not progress much >> the last time because the author wanted to push a lot of functionality in >> there where as reviewers felt it should start simple. The simple case is >> likely a good fit for what you want. The primary downside is that it would >> be race-prone in memory pressure situations as the page could be reclaimed >> between the fincore check and the read but I expect that your application >> is already avoiding reclaim activity. > > I've actually experimentally hacked up O_NONBLOCK support for regular > files so that it only returns data from the page cache, and not > otherwise. Volker promised to test it with Samba, but we never made > any progress on it, and just last week a customer told me they would > have liked to use it if it was available. > > Note that we might want to also avoid blocking on locks, and I have some > vague memory that we shouldn't actually implement O_NONBLOCK on regular > files due to compatibility options but would have to use a new flag > instead. > > Note that mincor/fincore would not help for the usual use case where you > have a non blocking event main loop and want to offload actual blocking > I/O to helper threads, as you it returns information that can be stale > any time. > > One further consideration would be to finally implement real buffered > I/O in kernel space by something like the above and offloading to > workqueues in kernelspace. I think our workqueues now are way better > than any possible user thread pool, although we'd need to find a way to > temporarily tie the work threads to a user address space. -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: milosz@adfin.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read()/readv() only from page cache 2014-09-05 16:27 ` Milosz Tanski @ 2014-09-05 16:32 ` Christoph Hellwig 2014-09-05 16:45 ` Milosz Tanski 2014-09-07 20:48 ` Volker Lendecke 1 sibling, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2014-09-05 16:32 UTC (permalink / raw) To: Milosz Tanski; +Cc: Mel Gorman, LKML, Volker Lendecke, Tejun Heo, linux-aio On Fri, Sep 05, 2014 at 12:27:21PM -0400, Milosz Tanski wrote: > I would prefer a interface more like recv() where I can specify the > flag if I want blocking behavior for this read or not. Let me explain > why: > > In a VLDB like workload this would enable me to lower the latency of > common fast requests and. By fast requests I mean ones that do not > require much data, the data is cached, or there's a predictable read > pattern (read-ahead). Obviously it would be at the expense of the > latency of large/slow requests (they have to make 2 read calls, the > first one always EWOULDBLOCK) ... but in that case it doesn't matter > since the time to do actual IO would trump any kind of extra latency. This is another good suggestion. I've actually heard people asking for allowing per-I/O flags for other uses cases. The one I cane remember is applying O_DSYNC only for FUA writes on a SCSI target, the other one would be Samba again, as SMB allows per-I/O flags on the wire as well. > Essentially, it's using the kernel facilities (page cache) to help me > perform better (in a more predictable fashion). I would implement this > in our application tomorrow. It's frustrating that there is a similar > interface (recv* family) that I cannot use. > > I know there's been a bunch of attempts at buffered AIO and none of > them made it into the kernel. It would let me build a buffered AIO > implementation in user-space using a threadpool. And cached data would > not end up getting blocked behind other non-cached requests sitting in > the queue. I know there's other sources of blocking (locking, metadata > lookups) but direct AIO already suffers from these so I'm fine to > paper over that for now. Although I still think providing useful AIO at the kernel level would be better than having everyone reimplement it it still would be useful to allow people to sanely reimplement it. If only to avoid the discussion about what API to use between the non-standard and not really that nice Linux io_submit and the utterly horrible Posix aio_ semantics. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read()/readv() only from page cache 2014-09-05 16:32 ` Christoph Hellwig @ 2014-09-05 16:45 ` Milosz Tanski 0 siblings, 0 replies; 9+ messages in thread From: Milosz Tanski @ 2014-09-05 16:45 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Mel Gorman, LKML, Volker Lendecke, Tejun Heo, linux-aio On Fri, Sep 5, 2014 at 12:32 PM, Christoph Hellwig <hch@infradead.org> wrote: > On Fri, Sep 05, 2014 at 12:27:21PM -0400, Milosz Tanski wrote: >> I would prefer a interface more like recv() where I can specify the >> flag if I want blocking behavior for this read or not. Let me explain >> why: >> >> In a VLDB like workload this would enable me to lower the latency of >> common fast requests and. By fast requests I mean ones that do not >> require much data, the data is cached, or there's a predictable read >> pattern (read-ahead). Obviously it would be at the expense of the >> latency of large/slow requests (they have to make 2 read calls, the >> first one always EWOULDBLOCK) ... but in that case it doesn't matter >> since the time to do actual IO would trump any kind of extra latency. > > This is another good suggestion. I've actually heard people asking > for allowing per-I/O flags for other uses cases. The one I cane > remember is applying O_DSYNC only for FUA writes on a SCSI target, > the other one would be Samba again, as SMB allows per-I/O flags on > the wire as well. > >> Essentially, it's using the kernel facilities (page cache) to help me >> perform better (in a more predictable fashion). I would implement this >> in our application tomorrow. It's frustrating that there is a similar >> interface (recv* family) that I cannot use. >> >> I know there's been a bunch of attempts at buffered AIO and none of >> them made it into the kernel. It would let me build a buffered AIO >> implementation in user-space using a threadpool. And cached data would >> not end up getting blocked behind other non-cached requests sitting in >> the queue. I know there's other sources of blocking (locking, metadata >> lookups) but direct AIO already suffers from these so I'm fine to >> paper over that for now. > > Although I still think providing useful AIO at the kernel level would be > better than having everyone reimplement it it still would be useful to > allow people to sanely reimplement it. If only to avoid the discussion > about what API to use between the non-standard and not really that nice > Linux io_submit and the utterly horrible Posix aio_ semantics. Yeah, I would love for that to happen but I've been lurking and following the non-blocking buffered AIO discussions and attempts on lkml since about 2008 and the threads go back much further than that about 12 years. I would take a much less ambitious syscall read/pread syscall that gets me 90% of the way there and I can build the remainder in user-space. It also has the nice side-effect of being providing a not-horrible fallback for older/non-linux systems where all IO goes into the thread pool (without the option to skip it). -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: milosz@adfin.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: read()/readv() only from page cache 2014-09-05 16:27 ` Milosz Tanski 2014-09-05 16:32 ` Christoph Hellwig @ 2014-09-07 20:48 ` Volker Lendecke 1 sibling, 0 replies; 9+ messages in thread From: Volker Lendecke @ 2014-09-07 20:48 UTC (permalink / raw) To: Milosz Tanski; +Cc: Christoph Hellwig, Mel Gorman, LKML, Tejun Heo, linux-aio On Fri, Sep 05, 2014 at 12:27:21PM -0400, Milosz Tanski wrote: > In a VLDB like workload this would enable me to lower the latency of > common fast requests and. By fast requests I mean ones that do not > require much data, the data is cached, or there's a predictable read > pattern (read-ahead). Obviously it would be at the expense of the > latency of large/slow requests (they have to make 2 read calls, the > first one always EWOULDBLOCK) ... but in that case it doesn't matter > since the time to do actual IO would trump any kind of extra latency. That was my thinking as well when I discussed it with Christoph. Sorry for not getting around to actually test his patch. Samba right now uses a thread pool for aio, we get parallel read requests from clients over a single TCP connection and we need to keep all disks busy simultaneously. We'd like to avoid the thread overhead when it's not necessary because the data is already around. A per-request flag would fit Samba better than a flag settable by fcntl in this pattern. Samba can't easily open a file twice (once with NONBLOCK and once without). Volker -- SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen phone: +49-551-370000-0, fax: +49-551-370000-9 AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen http://www.sernet.de, mailto:kontakt@sernet.de ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-09-07 21:01 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-25 2:36 read()/readv() only from page cache Milosz Tanski 2014-09-05 11:09 ` Mel Gorman 2014-09-05 15:48 ` Christoph Hellwig 2014-09-05 16:02 ` Jeff Moyer 2014-09-05 16:04 ` Christoph Hellwig 2014-09-05 16:27 ` Milosz Tanski 2014-09-05 16:32 ` Christoph Hellwig 2014-09-05 16:45 ` Milosz Tanski 2014-09-07 20:48 ` Volker Lendecke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox