public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <jens.axboe@oracle.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Eric Dumazet <dada1@cosmosbay.com>,
	linux-kernel@vger.kernel.org, cotte@de.ibm.com, hugh@veritas.com,
	neilb@suse.de, zanussi@us.ibm.com, hch@infradead.org
Subject: Re: [PATCH] sendfile removal
Date: Sat, 2 Jun 2007 17:01:04 +0200	[thread overview]
Message-ID: <20070602150104.GH32105@kernel.dk> (raw)
In-Reply-To: <alpine.LFD.0.98.0706010904110.3957@woody.linux-foundation.org>

On Fri, Jun 01 2007, Linus Torvalds wrote:
> 
> 
> On Fri, 1 Jun 2007, H. Peter Anvin wrote:
> > 
> > Fair enough.  Unix has traditionally not acknowledged the possibility of
> > nonblocking I/O on conventional files, for some odd reason.
> 
> It's not odd at all.
> 
> If you return EAGAIN, you had better have a way to _wait_ for that EAGAIN 
> to go away, otherwise the EAGAIN is just a total waste of time.
> 
> So the rule about EAGAIN is very simple:
>  (a) the file descriptor must be O_NONBLOCK
>  (b) the access must otherwise block
> AND
>  (c) the condition must be something we can wait for with poll/select
> 
> I don't know why people continually ignore that (c) point, even though 
> it's obvious and very very important!
> 
> If you cannot wait for it, tell me why the kernel should _ever_ return 
> EAGAIN? The only option for the user is to just do the operation again 
> immediately.
> 
> And the thing is, neither poll nor select work on regular files. And no, 
> that is _not_ just an implementation issue. It's very fundamental: neither 
> poll nor select get the file offset to wait for!
> 
> And that file offset is _critical_ for a regular file, in a way it 
> obviously is _not_ for a socket, pipe, or other special file. Because 
> without knowing the file offset, you cannot know which page you should be 
> waiting for!
> 
> And no, the file offset is not "f_pos". sendfile(), along with 
> pread/pwrite, uses a totally separate file offset, so if select/poll were 
> to base their decision on f_pos, they'd be _wrong_.
> 
> This really is very fundamental. 
> 
> Now, you can argue that you can always just return -EAGAIN anyway, but 
> then the calling process will basically be busy-looping, calling 
> sendfile() (or splice()) over and over again. That's _horrible_. It's much 
> better to just not return EAGAIN, and sleep like a good process should!
> 
> So there's a few things to take away from this:
> 
>  - regular file access MUST NOT return EAGAIN just because a page isn't 
>    in the cache. Doing so is simply a bug. No ifs, buts or maybe's about 
>    it!
> 
>    Busy-looping is NOT ACCEPTABLE!
> 
>  - you *could* make some alternative conventions:
> 
> 	(a) you could make O_NONBLOCK mean that you'll at least 
> 	    guarantee that you *start* the IO, and while you never return 
> 	    EAGAIN, you migth validly return a _partial_ result!
> 
> 	(b) variation on (a): it's ok to return EAGAIN if _you_ were the 
> 	    one who started the IO during this particular time aroudn the 
> 	    loop. But if you find a page that isn't up-to-date yet, and 
> 	    you didn't start the IO, you *must* wait for it, so that you 
> 	    end up returning EAGAIN atmost once! Exactly because 
> 	    busy-looping is simply not acceptable behaviour!
> 
> I have to admit that I didn't look at what raw splice() itself does these 
> days. I would not be surprised if Jens also didn't realize this very 
> fundamental issue. It seems too easy to miss, because people think 
> that EAGAIN stands on its own, and don't realize that EAGAIN must be 
> paired with select/poll to make sense.

splice() WILL return EAGAIN, but in that case it should have triggered
the read-ahead and thus started some IO. At least that was/is the
intention, and it worked like that originally. Since I'm deep on the
splice updating for sendfile etc right now anyway, I'll doubly verify
that it still works that way as expected!

For the from-file case, see __generic_file_splice_read(). splice does:

       if (!PageUptodate(page)) {
               /*
                * If in nonblock mode then dont block on
                * waiting
                * for an in-flight io page
                */
               if (flags & SPLICE_F_NONBLOCK) {
                       if (TestSetPageLocked(page))
                               break;
               } else
                       lock_page(page);

but it doesn't know whether this call into __generic_file_splice_read()
initiated IO to that page or if someone else did. Naturally this will
most often be the case that __generic_file_splice_read() was the one
that initiated IO to that page, but it could be someone else of course.

The code above doesn't actually return EAGAIN, but
generic_file_splice_read() turns a 0 return for SPLICE_F_NONBLOCK into
EAGAIN.

So we need a bit of tweaking. I don't see how we can make b) work well
enough without getting hackish and breaking into the page cache and
read-ahead code, so I'd be inclined to suggest we make it conform to
your a) case. Now I didn't test yet, but it seems that just doing:

      if ((flags & SPLICE_F_NONBLOCK) && spd.nr_pages) {
              if (TestSetPageLocked(page))
                      break;
      } else
              lock_page(page);

should do that - always block for the first page and potentially return
a partial results for the remaining pages that read-ahead kicked into
gear.

-- 
Jens Axboe


  parent reply	other threads:[~2007-06-02 15:02 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-31 10:33 [PATCH] sendfile removal Jens Axboe
2007-05-31 10:47 ` Jens Axboe
2007-05-31 10:47 ` Eric Dumazet
2007-05-31 10:53   ` Jens Axboe
2007-06-01  4:09     ` H. Peter Anvin
2007-06-01  5:41       ` Jens Axboe
2007-06-01  5:50         ` H. Peter Anvin
2007-06-01  7:22           ` Eric Dumazet
2007-06-01 15:52             ` H. Peter Anvin
2007-06-01 16:18               ` Linus Torvalds
2007-06-01 16:47                 ` Eric Dumazet
2007-06-01 16:53                 ` H. Peter Anvin
2007-06-02 15:02                   ` Jens Axboe
2007-06-02 15:01                 ` Jens Axboe [this message]
2007-06-02 15:40                   ` Linus Torvalds
2007-06-02 16:35                     ` Jens Axboe
     [not found]                     ` <20070603130507.GA11170@mail.ustc.edu.cn>
2007-06-03 13:05                       ` Fengguang Wu
     [not found]                       ` <20070603142931.GA5916@mail.ustc.edu.cn>
2007-06-03 14:29                         ` Fengguang Wu
     [not found]                         ` <20070604004647.GA8076@mail.ustc.edu.cn>
2007-06-04  0:46                           ` Fengguang Wu
2007-06-04  8:05                           ` Jens Axboe
     [not found]                             ` <20070604112214.GA7457@mail.ustc.edu.cn>
2007-06-04 11:22                               ` Fengguang Wu
2007-06-01 16:22               ` Pádraig Brady
2007-05-31 10:55 ` Christoph Hellwig
2007-05-31 11:05   ` Jens Axboe
2007-05-31 12:26     ` Neil Brown
2007-05-31 12:27       ` Jens Axboe
2007-06-01  2:44         ` [PATCH] sendfile removal (nfsd update) Neil Brown
2007-06-01  5:44           ` Jens Axboe
2007-06-01  8:01             ` Jens Axboe
2007-06-01  8:15     ` [PATCH] sendfile removal Jens Axboe
2007-05-31 11:04 ` Carsten Otte
2007-05-31 11:06   ` Jens Axboe
2007-05-31 15:33 ` Tom Zanussi
2007-05-31 19:01   ` Jens Axboe
2007-05-31 17:06 ` Hugh Dickins
2007-05-31 17:31   ` Christoph Hellwig
2007-05-31 19:03   ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070602150104.GH32105@kernel.dk \
    --to=jens.axboe@oracle.com \
    --cc=cotte@de.ibm.com \
    --cc=dada1@cosmosbay.com \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=torvalds@linux-foundation.org \
    --cc=zanussi@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox