linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Miklos Szeredi <miklos@szeredi.hu>,
	jens.axboe@oracle.com, akpm@linux-foundation.org,
	nickpiggin@yahoo.com.au, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [patch v3] splice: fix race with page invalidation
Date: Thu, 31 Jul 2008 13:33:50 +0100	[thread overview]
Message-ID: <20080731123350.GB16481@shareable.org> (raw)
In-Reply-To: <20080731102612.GA29766@2ka.mipt.ru>

Evgeniy Polyakov wrote:
> On Thu, Jul 31, 2008 at 07:12:01AM +0100, Jamie Lokier (jamie@shareable.org) wrote:
> > The obvious mechanism for completion notifications is the AIO event
> > interface.  I.e. aio_sendfile that reports completion when it's safe
> > to modify data it was using.  aio_splice would be logical for similar
> > reasons.  Note it doesn't mean when the data has reached a particular
> > place, it means when the pages it's holding are released.  Pity AIO
> > still sucks ;-)
> 
> It is not that simple: page can be held in hardware or tcp queues for
> a long time, and the only possible way to know, that system finished
> with it, is receiving ack from the remote side. There is a project to
> implement such a callback at skb destruction time (it is freed after ack
> from the other peer), but do we really need it? System which does care
> about transmit will implement own ack mechanism, so page can be unlocked
> at higher layer. Actually page can be locked during transfer and
> unlocked after rpc reply received, so underlying page invalidation will
> be postponed and will not affect sendfile/splice.

This is why marking the pages COW would be better.  Automatic!
There's no need for a notification, merely letting go of the page
references - yes, the hardware / TCP acks already do that, no locking
or anything!  :-)  The last reference is nothing special, it just means
the next file write/truncate sees the count is 1 and doesn't need to
COW the page.


Two reason for being mildly curious about sendfile page releases in an
application though:

   - Sendfile on tmpfs files: zero copy sending of calculated data.
     Only trouble is when can you reuse the pages?  Current solution
     is use a set of files, consume the pages in sequential order, delete
     files at some point, let the kernel hold the pages.  Works for
     sequentially generated and transmitted data, but not good for
     userspace caches where different parts expire separately.  Also,
     may pin a lot of page cache; not sure if that's accounted.

   - Sendfile on real large data contained in a userspace
     database-come-filesystem (the future!).  App wants to send big
     blobs, and with COW it can forget about them, but for performance
     it would rathe allocate new writes in the file to areas that are
     not sendfile-hot.  It can approximate with heuristics though.

> > Btw, Windows had this since forever, it's called overlapped
> > TransmitFile with an I/O completion event.  Don't know if it's any
> > good though ;-)
> 
> There was a linux aio_sendfile() too. Google still knows about its
> numbers, graphs and so on... :)

I vaguely remember it's performance didn't seem that good.

One of the problems is you don't really want AIO all the time, just
when a process would block because the data isn't in cache.  You
really don't want to be sending *all* ops to worker threads, even
kernel threads.  And you preferably don't want the AIO interface
overhead for ops satisfied from cache.

Syslets got some of the way there, and maybe that's why they were
faster than AIO for some things.  There are user-space hacks which are
a bit like syslets.  (Bind two processes to the same CPU, process 1
wakes process 2 just before 1 does a syscall, and puts 2 back to sleep
if 2 didn't wake and do an atomic op to prove it's awake).  I haven't
tested their performance, it could suck.

Look up LAIO, Lazy Asynchronous I/O.  Apparently FreeBSD, NetBSD,
Solaris, Tru64, and Windows, have the capability to call a synchronous
I/O op and if it's satisfied from cache, simply return a result, if
not, either queue it and return an AIO event later (Windows style (it
does some cleverer thread balancing too)), or wake another thread to
handle it (FreeBSD style).  I believe Linus suggested something like
the latter line approach some time ago.

-- Jamie

  reply	other threads:[~2008-07-31 12:34 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-30  9:43 [patch v3] splice: fix race with page invalidation Miklos Szeredi
2008-07-30 17:00 ` Linus Torvalds
2008-07-30 17:29   ` Miklos Szeredi
2008-07-30 17:54     ` Jens Axboe
2008-07-30 18:32       ` Miklos Szeredi
2008-07-30 18:43         ` Miklos Szeredi
2008-07-30 19:45           ` Jens Axboe
2008-07-30 20:05             ` Miklos Szeredi
2008-07-30 20:13               ` Linus Torvalds
2008-07-30 20:45                 ` Miklos Szeredi
2008-07-30 20:51                   ` Linus Torvalds
2008-07-30 21:16                     ` Miklos Szeredi
2008-07-30 21:22                       ` Linus Torvalds
2008-07-30 21:46                         ` Miklos Szeredi
2008-07-30 21:56                           ` Linus Torvalds
2008-07-31  0:11                   ` Jamie Lokier
2008-07-31  0:42                     ` Jamie Lokier
2008-07-31  0:51                       ` Linus Torvalds
2008-07-31  0:54                         ` Linus Torvalds
2008-07-31  6:12                         ` Jamie Lokier
2008-07-31 10:26                           ` Evgeniy Polyakov
2008-07-31 12:33                             ` Jamie Lokier [this message]
2008-07-31 12:49                               ` Nick Piggin
2008-07-31 13:29                               ` Evgeniy Polyakov
2008-07-31 16:56                                 ` Linus Torvalds
2008-07-31 16:34                           ` Linus Torvalds
2008-07-31 17:21                             ` Jamie Lokier
2008-07-31 18:54                               ` Linus Torvalds
2008-07-31  7:30                     ` Miklos Szeredi
2008-07-31  2:16       ` Nick Piggin
2008-07-31 12:59 ` Nick Piggin
2008-07-31 17:00   ` Linus Torvalds
2008-07-31 18:13     ` Miklos Szeredi
2008-08-01  1:22       ` Nick Piggin
2008-08-01 18:28         ` Miklos Szeredi
2008-08-01 18:32           ` Linus Torvalds
2008-08-02  4:26           ` Nick Piggin
2008-08-04 15:29             ` Jamie Lokier
2008-08-05  2:57               ` Nick Piggin
2008-08-11  3:22                 ` Michael Kerrisk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080731123350.GB16481@shareable.org \
    --to=jamie@shareable.org \
    --cc=akpm@linux-foundation.org \
    --cc=jens.axboe@oracle.com \
    --cc=johnpol@2ka.mipt.ru \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).