From: Coywolf Qi Hunt <coywolf@gmail.com>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>,
William Lee Irwin III <wli@holomorphy.com>,
linux-kernel@vger.kernel.org
Subject: Re: Make pipe data structure be a circular list of pages, rather than
Date: Thu, 18 Aug 2005 14:07:02 +0800 [thread overview]
Message-ID: <2cd57c90050817230735895530@mail.gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0501070735000.2272@ppc970.osdl.org>
On 1/8/05, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Fri, 7 Jan 2005, Oleg Nesterov wrote:
> >
> > If i understand this patch correctly, then this code
> >
> > for (;;)
> > write(pipe_fd, &byte, 1);
> >
> > will block after writing PIPE_BUFFERS == 16 characters, no?
> > And pipe_inode_info will use 64K to hold 16 bytes!
>
> Yes.
>
> > Is it ok?
>
> If you want throughput, don't do single-byte writes. Obviously we _could_
> do coalescing, but there's a reason I'd prefer to avoid it. So I consider
> it a "don't do that then", and I'll wait to see if people do. I can't
> think of anything that cares about performance that does that anyway:
> becuase system calls are reasonably expensive regardless, anybody who
> cares at all about performance will have buffered things up in user space.
>
> > May be it make sense to add data to the last allocated page
> > until buf->len > PAGE_SIZE ?
>
> The reason I don't want to coalesce is that I don't ever want to modify a
> page that is on a pipe buffer (well, at least not through the pipe buffer
> - it might get modified some other way). Why? Because the long-term plan
> for pipe-buffers is to allow the data to come from _other_ sources than
> just a user space copy. For example, it might be a page directly from the
> page cache, or a partial page that contains the data part of an skb that
> just came in off the network.
>
> With this organization, a pipe ends up being able to act as a "conduit"
> for pretty much any data, including some high-bandwidth things like video
> streams, where you really _really_ don't want to copy the data. So the
> next stage is:
>
> - allow the buffer size to be set dynamically per-pipe (probably only
> increased by root, due to obvious issues, although a per-user limit is
> not out of the question - it's just a "mlock" in kernel buffer space,
> after all)
> - add per-"struct pipe_buffer" ops pointer to a structure with
> operation function pointers: "release()", "wait_for_ready()", "poll()"
> (and possibly "merge()", if we want to coalesce things, although I
> really hope we won't need to)
> - add a "splice(fd, fd)" system call that copies pages (by incrementing
> their reference count, not by copying the data!) from an input source
> to the pipe, or from a pipe to an output.
> - add a "tee(in, out1, out2)" system call that duplicates the pages
> (again, incrementing their reference count, not copying the data) from
> one pipe to two other pipes.
>
> All of the above is basically a few lines of code (the "splice()" thing
> requires some help from drivers/networking/pagecache etc, but it's not
> complex help, and not everybody needs to do it - I'll start off with
> _just_ a generic page cache helper to get the thing rolling, that's easy).
>
> Now, imagine using the above in a media server, for example. Let's say
> that a year or two has passed, so that the video drivers have been updated
> to be able to do the splice thing, and what can you do? You can:
>
> - splice from the (mpeg or whatever - let's just assume that the video
> input is either digital or does the encoding on its own - like they
> pretty much all do) video input into a pipe (remember: no copies - the
> video input will just DMA directly into memory, and splice will just
> set up the pages in the pipe buffer)
> - tee that pipe to split it up
> - splice one end to a file (ie "save the compressed stream to disk")
> - splice the other end to a real-time video decoder window for your
> real-time viewing pleasure.
>
> That's the plan, at least. I think it makes sense, and the thing that
> convinced me about it was (a) how simple all of this seems to be
> implementation-wise (modulo details - but there are no "conceptually
> complex" parts: no horrid asynchronous interfaces, no questions about
> hotw to buffer things, no direct user access to pages that might
> partially contain protected data etc etc) and (b) it's so UNIXy. If
> there's something that says "the UNIX way", it's pipes between entities
> that act on the data.
>
> For example, let's say that you wanted to serve a file from disk (or any
> other source) with a header to another program (or to a TCP connection, or
> to whatever - it's just a file descriptor). You'd do
>
> fd = create_pipe_to_destination();
>
> input = open("filename", O_RDONLY);
> write(fd, "header goes here", length_of_header);
> for (;;) {
> ssize_t err;
> err = splice(input, fd,
> ~0 /* maxlen */,
> 0 /* msg flags - think "sendmgsg" */);
> if (err > 0)
> continue;
> if (!err) /* EOF */
> break;
> .. handle input errors here ..
> }
>
> (obviously, if this is a real server, this would likely all be in a
> select/epoll loop, but that just gets too hard to describe consicely, so
> I'm showing the single-threaded simple version).
>
> Further, this also ends up giving a nice pollable interface to regular
> files too: just splice from the file (at any offset) into a pipe, and poll
> on the result. The "splice()" will just do the page cache operations and
> start the IO if necessary, the "poll()" will wait for the first page to be
> actually available. All _trivially_ done with the "struct pipe_buffer"
> operations.
>
> So the above kind of "send a file to another destination" should
> automatically work very naturally in any poll loop: instead of filling a
> writable pipe with a "write()", you just fill it with "splice()" instead
> (and you can read it with a 'read()' or you just splice it to somewhere
> else, or you tee() it to two destinations....).
>
> I think the media server kind of environment is the one most easily
> explained, where you have potentially tons of data that the server process
> really never actually wants to _see_ - it just wants to push it on to
> another process or connection or save it to a log-file or something. But
> as with regular pipes, it's not a specialized interface: it really is just
> a channel of communication.
>
> The difference being that a historical UNIX pipe is always a channel
> between two process spaces (ie you can only fill it and empty it into the
> process address space), and the _only_ thing I'm trying to do is to have
> it be able to be a channel between two different file descriptors too. You
> still need the process to "control" the channel, but the data doesn't have
> to touch the address space any more.
>
> Think of all the servers or other processes that really don't care about
> the data. Think of something as simple as encrypting a file, for example.
> Imagine that you have hardware encryption support that does DMA from the
> source, and writes the results using DMA. I think it's pretty obvious how
> you'd connect this up using pipes and two splices (one for the input, one
> for the output).
>
> And notice how _flexible_ it is (both the input and the output can be any
> kind of fd you want - the pipes end up doing both the "conversion" into a
> common format of "list of (possibly partial) pages" and the buffering,
> which is why the different "engines" don't need to care where the data
> comes from, or where it goes. So while you can use it to encrypt a file
> into another file, you could equally easily use it for something like
>
> tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network
>
> and the whole pipeline would not have a _single_ actual data copy: the
> pipes are channels.
>
> Of course, since it's a pipe, the nice thing is that people don't have to
> use "splice()" to access it - the above pipeline has a perfectly regular
> "tar" process that probably just does regular writes. You can have a
> process that does "splice()" to fill the pipe, and the other end is a
> normal thing that just uses regular "read()" and doesn't even _know_ that
> the pipe is using new-fangled technology to be filled.
>
> I'm clearly enamoured with this concept. I think it's one of those few
> "RightThing(tm)" that doesn't come along all that often. I don't know of
> anybody else doing this, and I think it's both useful and clever. If you
> now prove me wrong, I'll hate you forever ;)
>
> Linus
Actually this what L4 is different from traditional micro-kernel, IMHO. Is it?
I have a long-term plan to add some messaging subsystem based on this.
--
Coywolf Qi Hunt
http://ahbl.org/~coywolf/
next prev parent reply other threads:[~2005-08-18 6:07 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-01-07 14:30 Make pipe data structure be a circular list of pages, rather than Oleg Nesterov
2005-01-07 15:45 ` Alan Cox
2005-01-07 17:23 ` Linus Torvalds
2005-01-08 18:25 ` Hugh Dickins
2005-01-08 18:54 ` Linus Torvalds
2005-01-07 16:17 ` Linus Torvalds
2005-01-07 16:06 ` Alan Cox
2005-01-07 17:33 ` Linus Torvalds
2005-01-07 17:48 ` Linus Torvalds
2005-01-07 20:59 ` Mike Waychison
2005-01-07 23:46 ` Chris Friesen
2005-01-08 21:38 ` Lee Revell
2005-01-08 21:51 ` Linus Torvalds
2005-01-08 22:02 ` Lee Revell
2005-01-08 22:29 ` Davide Libenzi
2005-01-09 4:07 ` Linus Torvalds
2005-01-09 23:19 ` Davide Libenzi
2005-01-14 10:15 ` Peter Chubb
2005-01-07 21:59 ` Linus Torvalds
2005-01-07 22:53 ` Diego Calleja
2005-01-07 23:15 ` Linus Torvalds
2005-01-10 23:23 ` Robert White
2005-01-07 17:45 ` Chris Friesen
2005-01-07 16:39 ` Davide Libenzi
2005-01-07 17:09 ` Linus Torvalds
2005-08-18 6:07 ` Coywolf Qi Hunt [this message]
-- strict thread matches above, loose matches on Subject: below --
2005-01-20 2:14 Robert White
2005-01-16 2:59 Make pipe data structure be a circular list of pages, rather Linus Torvalds
2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux
2005-01-20 2:06 ` Robert White
[not found] <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain>
[not found] ` <Pine.LNX.4.58.0501091713300.2373@ppc970.osdl.org>
[not found] ` <Pine.LNX.4.58.0501091830120.2373@ppc970.osdl.org>
2005-01-12 19:50 ` Davide Libenzi
2005-01-12 20:10 ` Linus Torvalds
[not found] <200501070313.j073DCaQ009641@hera.kernel.org>
2005-01-07 3:41 ` William Lee Irwin III
2005-01-07 6:35 ` Linus Torvalds
2005-01-07 6:37 ` Linus Torvalds
2005-01-19 16:29 ` Larry McVoy
2005-01-19 17:14 ` Linus Torvalds
2005-01-19 19:01 ` Larry McVoy
2005-01-20 0:01 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2cd57c90050817230735895530@mail.gmail.com \
--to=coywolf@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=oleg@tv-sign.ru \
--cc=torvalds@osdl.org \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.