From: Jens Axboe <axboe@suse.de>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Diego Calleja <diegocg@gmail.com>, linux-kernel@vger.kernel.org
Subject: Re: Linux 2.6.17-rc2
Date: Thu, 20 Apr 2006 21:19:55 +0200 [thread overview]
Message-ID: <20060420191954.GG4717@suse.de> (raw)
In-Reply-To: <Pine.LNX.4.64.0604200818490.3701@g5.osdl.org>
On Thu, Apr 20 2006, Linus Torvalds wrote:
>
>
> On Thu, 20 Apr 2006, Jens Axboe wrote:
> >
> > > - an ioctl/fcntl to set the maximum size of the buffer. Right now it's
> > > hardcoded to 16 "buffer entries" (which in turn are normally limited to
> > > one page each, although there's nothing that _requires_ that a buffer
> > > entry always be a page).
> >
> > This is on a TODO, but not very high up since I've yet to see a case
> > where the current 16 page limitation is an issue. I'm sure something
> > will come up eventually, but until then I'd rather not bother.
>
> The real reason for limiting the number of buffer entries is not to make
> the number _larger_ (although that can be a performance optimization), but
> to make it _smaller_ or at least knowing/limiting how big it is.
>
> It doesn't matter with the current interfaces which are mostly agnostic as
> to how big the buffer is, but it _does_ matter with vmsplice().
>
> Why?
>
> Simple: for a vmsplice() user, it's very important to know when they can
> start re-using the buffer(s) that they used vmsplice() on previously. And
> while the user could just ask the kernel how many bytes are left in the
> pipe buffer, that's pretty inefficient for many normal streaming cases.
>
> The _efficient_ way is to make the user-space buffer that you use for
> splicing information to another entity a circular buffer that is at least
> as large as any of the splice pipes involved in the transfer (depending on
> use. In many cases, you will probably want to make the user-space buffer
> _twice_ as big as the kernel buffer, which makes the tracking even easier:
> while half of the buffer is busy, you can write to the half that is
> guaranteed to not be in the kernel buffer, so you effectively do "double
> buffering")
>
> So if you do that, then you can continue to write to the buffer without
> ever worrying about re-use, because you know that by the time you wrap
> around, the kernel buffer will have been flushed out, or the vmsplice()
> would have blocked, waiting for the receiver. So now you no longer need to
> worry about "how much has flushed" - you only need to worry about doing
> the vmsplice() call at least twice per buffer traversal (assuming the
> "user buffer is double the size of the kernel buffer" approach).
>
> So you could do a very efficient "stdio-like" implementation for logging,
> for example, since this allows you to re-use the same pages over and over
> for splicing, without ever having any copying overhead, and without ever
> having to play VM-related games (ie you don't need to do unmaps or
> mprotects or anything expensive like that in order to get a new page or
> something).
>
> But in order to do that, you really do need to know (and preferably set)
> the size of the splice buffer. Otherwise, if the in-kernel splice buffer
> is larger than the circular buffer you use in user space, the kernel will
> add the same page _twice_ to the buffer, and you'll overwrite the data
> that you already spliced.
Good point, as you can tell I had other uses in mind for this. I'd
prefer using fcntl for this instead of an ioctl - how about a set of
matching F_SETPIPESZ/F_GETPIPESZ or something in that order? Right now
we can just -EINVAL stub the pipe size setting, but really implement the
pipe size getting.
Other suggestions?
The vmsplice addition itself is pretty slim:
splice.c | 163 +++++++++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 132 insertions(+), 31 deletions(-)
> (Now, you still need to be very careful with vmsplice() in general, since
> it leaves the data page writable in the source VM and thus allows for all
> kinds of confusion, but the theory here is "give them rope". Rope enough
> to do clever things always ends up being rope enough to hang yourself too.
> Tough.).
Oh definitely :-)
--
Jens Axboe
next prev parent reply other threads:[~2006-04-20 19:19 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-04-19 3:27 Linux 2.6.17-rc2 Linus Torvalds
2006-04-19 7:30 ` [patch, 2.6.17-rc2] dm: fix typo Ingo Molnar
2006-04-19 7:33 ` Ingo Molnar
2006-04-19 18:00 ` Linux 2.6.17-rc2 Diego Calleja
2006-04-19 18:04 ` Hua Zhong
2006-04-19 19:40 ` splice and tee [was Linux 2.6.17-rc2] Jonathan Corbet
2006-04-19 18:44 ` Linux 2.6.17-rc2 Linus Torvalds
2006-04-19 19:20 ` Grzegorz Kulewski
2006-04-19 20:09 ` Linus Torvalds
2006-04-19 21:23 ` Trond Myklebust
2006-04-19 21:49 ` Linus Torvalds
2006-04-19 22:19 ` Peter Naulls
2006-04-20 13:21 ` Diego Calleja
2006-04-20 14:50 ` Jens Axboe
2006-04-20 15:32 ` Linus Torvalds
2006-04-20 19:19 ` Jens Axboe [this message]
2006-04-20 18:40 ` Linh Dang
2006-04-20 19:49 ` Jens Axboe
2006-04-20 19:57 ` Linh Dang
2006-04-20 20:02 ` Nick Piggin
2006-04-21 7:53 ` Jens Axboe
2006-04-20 20:08 ` Jens Axboe
2006-04-20 19:26 ` David S. Miller
2006-04-20 19:34 ` Jens Axboe
2006-04-20 19:39 ` David S. Miller
2006-04-20 19:44 ` Jens Axboe
2006-04-20 19:54 ` Nick Piggin
2006-04-20 21:37 ` Piet Delaney
2006-04-20 22:20 ` Linus Torvalds
2006-04-20 23:39 ` Piet Delaney
2006-04-21 0:09 ` Linus Torvalds
2006-04-20 23:26 ` David Lang
2006-04-21 0:49 ` David S. Miller
2006-04-22 4:52 ` Troy Benjegerdes
2006-04-21 0:41 ` David S. Miller
2006-04-21 17:58 ` Linus Torvalds
2006-04-21 18:15 ` Steven Rostedt
2006-04-21 18:42 ` Steven Rostedt
2006-04-21 0:20 ` David S. Miller
2006-04-21 2:05 ` Andi Kleen
2006-04-21 6:47 ` Piet Delaney
2006-04-20 16:24 ` Ingo Oeser
2006-04-20 19:52 ` splice(), vmsplice() niftiness [was: Re: Linux 2.6.17-rc2] bjd
2006-04-21 10:21 ` Linux 2.6.17-rc2 Alistair John Strachan
2006-04-21 16:40 ` Linus Torvalds
2006-04-21 17:21 ` Stephen Rothwell
2006-04-21 22:02 ` Andi Kleen
2006-04-22 0:53 ` Alistair John Strachan
2006-04-22 1:07 ` Andi Kleen
2006-04-22 13:21 ` Alistair John Strachan
2006-04-21 11:01 ` Linux 2.6.17-rc2 - notifier chain problem? Herbert Poetzl
2006-04-21 21:31 ` Chandra Seetharaman
2006-04-22 0:58 ` Herbert Poetzl
2006-04-24 21:26 ` Chandra Seetharaman
2006-04-24 22:03 ` Andrew Morton
2006-04-24 23:01 ` Chandra Seetharaman
2006-04-24 23:28 ` Andrew Morton
2006-04-25 0:19 ` Chandra Seetharaman
2006-04-26 15:49 ` Alan Stern
2006-04-26 18:18 ` Chandra Seetharaman
2006-04-26 18:43 ` Andrew Morton
2006-04-26 19:29 ` Ashok Raj
2006-04-26 20:21 ` Chandra Seetharaman
2006-04-26 20:26 ` Ashok Raj
2006-04-28 23:12 ` Chandra Seetharaman
2006-04-28 23:23 ` Andrew Morton
2006-04-28 23:33 ` Linus Torvalds
2006-04-28 23:48 ` Chandra Seetharaman
2006-04-28 23:43 ` Chandra Seetharaman
2006-04-29 15:30 ` Alan Stern
2006-04-22 6:40 ` Keith Owens
[not found] <63bym-4wt-3@gated-at.bofh.it>
[not found] ` <64eE4-1gP-15@gated-at.bofh.it>
[not found] ` <64eX5-1RE-13@gated-at.bofh.it>
[not found] ` <64wre-2cg-35@gated-at.bofh.it>
2006-04-24 4:42 ` Linux 2.6.17-rc2 Robert Hancock
2006-04-24 13:08 ` Alistair John Strachan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060420191954.GG4717@suse.de \
--to=axboe@suse.de \
--cc=diegocg@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.