public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Larry McVoy <lm@bitmover.com>
To: Russell Leighton <leighton@imake.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Is sendfile all that sexy?
Date: Thu, 18 Jan 2001 08:36:55 -0800	[thread overview]
Message-ID: <20010118083655.D6787@work.bitmover.com> (raw)
In-Reply-To: <200101181001.f0IA11I25258@webber.adilger.net> <3A66CDB1.B61CD27B@imake.com>
In-Reply-To: <3A66CDB1.B61CD27B@imake.com>

On Thu, Jan 18, 2001 at 06:04:17AM -0500, Russell Leighton wrote:
> 
> "copy this fd to that one, and optimize that if you can"
> 
> ... isn't this Larry M's "splice" (http://www.bitmover.com/lm/papers/splice.ps)?

Not really.  It's not clear to me that people really understood what I was
getting at in that and I've had some coffee and BK 2.0 is just about ready
to ship (shameless plug :-) so I'll give it another go.

The goal of splice is to avoid both data copies and virtual memory completely.
My SGI experience taught me that once you remove the data copy problem, the
next problem becomes setting up and tearing down the virtual mappings to the
data.  Linux is quite a bit lighter than IRIX but that doesn't remove this
issue, it just moves the point on the spectrum where the setup/teardown
becomes a problem.

Another goal of splice was to be general enough to allow data to flow from
any place to any place.  The idea was to have a good model and then iterate
over all the possible endpoints; I can think of files, sockets, and virtual
address spaces right off the top of my head, devices are subset of files
as will become apparent.

A final goal was to be able to be able to handle caching vs non-caching.
Sometimes one of the endpoints is a cache, such as the file system cache.
Sometimes you want data to stay in the cache and sometimes you want to
bypass it completely.  The model had to handle this.

OK, so the issues are
    - avoid copying
    - avoid virtual memory as much as possible
    - allow data flow to/from non aligned, non page sized objects
    - handle caching or non-caching

This leads pretty naturally to some observations about the shape of the
solution:

    - the basic unit of data is a physical page, or part of one.  That's
      physical page, not a virtual address which points to a physical page.
    - since we may be coming from sockets, where the payload is buried in
      the middle of page, there needs to be a vector of pages and a 
      vector of { pageno, offset, len } that goes along with the first
      vector.  There are two vectors because you could have multiple payloads
      in a single page, i.e., there is not a 1:1 between pages and payloads.
    - The page vector needs some flags, which handle caching.  I had just
      two flags, the "LOAN" flag and the "GIFT" flag.

In my mind, this was enough that everyone should "get it" at this point, but
that's me being lazy.

So how would this all work?  The first thing is that we are now dealing
in vectors of physical pages.  That's key - if you look at an OS, it
spends a lot of time with data going into a physical page, then being
translated to a virtual page, being copied to another virtual page, and
then being translated back to a physical page so that it can be sent to
a different device.  That's the basic FTP loop.

So you go "hey, just always talk physical pages and you avoid a lot of this
wasted time".  Now is a good time to observe that splice() is a library
interface.  The kernel level interfaces I called pull() and push().  The
idea was that you could do

	vectors = 0;

	do {
		vectors = pull(from_fd, vectors);
	} while (splice_size(vectors) < BIG_ENOUGH_SIZE);
	push(to_fd, vectors);

The idea was that you maintained a pointer to the vectors, the pointer is
a "cookie", you can't really dereference it in user space, at least not all
of it, but the kernel doesn't want to maintain this stuff, it wants you to
do that.  So you start pulling and then you push what you got.  And you,
being the user land process, are never looking at the data, in fact, you 
can't, you have a pointer to a data structure which describes the data
but you can't look at it.

A couple of interesting things: 
    - this design allows for multiplexing.  You could pull from multiple devices
      and then push to one.  The interface needs a little tweaking for that to
      be meaningful, we can steal from pipe semantics.  We need to be able to
      say how much to pull, so we add a length.  
    - there is no reason that you couldn't have an fd which was open to 
      /proc/self/my_address_space and you could essentially do an mmap()
      by seeking to where you want the mapping and doing a push to it.
      This is a fairly important point, it allows for end to end.  Lots of
      nasty issues with non-page sized chunks in the vector, what you do there
      depends on the semantics you want.

So what about the caching?  That's the loan/gift distinction.  The deal is that
these pages have reference counts and when the reference count goes to zero,
somebody has to free them.  So the page vector needs a free_page() function
pointer and if the pages are a loan, you call that function pointer when you
are done with them.   In other words, if the file system cache loaned you 
the pages, you do a call back to let the file system know you are done with
them.  If the pages were a gift, then the function pointer is null and you
have to manage them.  You can put the normal decrement_and_free() function
in there and when you get done with them you call that and the pages go back
to the free list.  You can also "free" them into your private page pool, etc.
The point is that if the end point which is being pulled() from wants the
pages cached, it "loans" them, if it doesn't, it "gifts" them.  Sockets as 
a "from" end point would always gift, files as a from endpoint would typically
loan.

So, there's the set of ideas.  I'm ashamed to admit that I don't really know
how close kiobufs are to this.  I am interested in hearing what you all think,
but especially what the people think who have been playing around with kiobufs
and sendfile.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

  reply	other threads:[~2001-01-18 16:37 UTC|newest]

Thread overview: 109+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-01-17 15:02 Is sendfile all that sexy? Ben Mansell
2000-01-01  2:10 ` Pavel Machek
2001-01-17 19:32 ` Linus Torvalds
2001-01-18  2:34   ` Olivier Galibert
2001-01-21 21:22     ` LA Walsh
2001-01-18  8:23   ` Rogier Wolff
2001-01-18 10:01     ` Andreas Dilger
2001-01-18 11:04       ` Russell Leighton
2001-01-18 16:36         ` Larry McVoy [this message]
2001-01-19  1:53         ` Linus Torvalds
2001-01-18 16:24       ` Linus Torvalds
2001-01-18 18:46         ` Kai Henningsen
2001-01-18 18:58         ` Roman Zippel
2001-01-18 19:42           ` Linus Torvalds
2001-01-19  0:18             ` Roman Zippel
2001-01-19  1:14               ` Linus Torvalds
2001-01-19  6:57                 ` Alan Cox
2001-01-19 10:13                 ` Roman Zippel
2001-01-19 10:55                   ` Andre Hedrick
2001-01-19 20:18                   ` kuznet
2001-01-19 21:45                     ` Linus Torvalds
2001-01-20 18:53                       ` kuznet
2001-01-20 19:26                         ` Linus Torvalds
2001-01-20 21:20                           ` Roman Zippel
2001-01-21  0:25                             ` Linus Torvalds
2001-01-21  2:03                               ` Roman Zippel
2001-01-21 18:00                               ` kuznet
2001-01-21 23:21                           ` David Woodhouse
2001-01-20 15:36             ` Kai Henningsen
2001-01-20 21:01               ` Linus Torvalds
2001-01-20 21:10                 ` Mo McKinlay
2001-01-20 22:24                 ` Roman Zippel
2001-01-21  0:33                   ` Linus Torvalds
2001-01-21  1:29                     ` David Schwartz
2001-01-21  2:42                     ` Roman Zippel
2001-01-21  9:52                     ` James Sutherland
2001-01-21 10:02                       ` Ingo Molnar
2001-01-22  9:52                       ` Helge Hafting
2001-01-22 13:00                         ` James Sutherland
2001-01-23  9:01                           ` Helge Hafting
2001-01-23  9:37                             ` James Sutherland
2001-01-18 19:51           ` Rick Jones
2001-01-18 12:17     ` Peter Samuelson
2001-01-22 18:13   ` Val Henson
2001-01-22 18:27     ` David Lang
2001-01-22 19:37       ` Val Henson
2001-01-22 20:01         ` David Lang
2001-01-22 22:04           ` Ion Badulescu
2001-01-22 18:54     ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2001-01-24 15:12 Sasi Peter
2001-01-24 15:29 ` James Sutherland
2001-01-25  1:11 ` Alan Cox
2001-01-25  9:06   ` James Sutherland
2001-01-25 10:42     ` bert hubert
2001-01-25 12:14       ` James Sutherland
     [not found] <Pine.LNX.4.10.10101190911130.10218-100000@penguin.transmeta.com>
2001-01-19 17:23 ` Rogier Wolff
2001-01-16 13:50 Andries.Brouwer
2001-01-17  6:56 ` Ton Hospel
2001-01-17  7:31   ` Steve VanDevender
2001-01-17  8:09     ` Ton Hospel
2001-01-14 18:29 jamal
2001-01-14 18:50 ` Ingo Molnar
2001-01-14 19:02   ` jamal
2001-01-14 19:09     ` Ingo Molnar
2001-01-14 19:18       ` jamal
2001-01-14 20:22 ` Linus Torvalds
2001-01-14 20:38   ` Ingo Molnar
2001-01-14 21:44     ` Linus Torvalds
2001-01-14 21:49       ` Ingo Molnar
2001-01-14 21:54     ` Gerhard Mack
2001-01-14 22:40       ` Linus Torvalds
2001-01-14 22:45         ` J Sloan
2001-01-15 20:15           ` H. Peter Anvin
2001-01-15  3:43         ` Michael Peddemors
2001-01-15 13:02       ` Florian Weimer
2001-01-15 13:45         ` Tristan Greaves
2001-01-15  1:14   ` Dan Hollis
2001-01-15 15:24   ` Jonathan Thackray
2001-01-15 15:36     ` Matti Aarnio
2001-01-15 20:17       ` H. Peter Anvin
2001-01-15 16:05     ` dean gaudet
2001-01-15 18:34       ` Jonathan Thackray
2001-01-15 18:46         ` Linus Torvalds
2001-01-15 18:58         ` dean gaudet
2001-01-15 19:41     ` Ingo Molnar
2001-01-15 20:33       ` Albert D. Cahalan
2001-01-15 21:00         ` Linus Torvalds
2001-01-16 10:40         ` Felix von Leitner
2001-01-16 11:56           ` Peter Samuelson
2001-01-16 12:37           ` Ingo Molnar
2001-01-16 12:42           ` Ingo Molnar
2001-01-16 12:47             ` Felix von Leitner
2001-01-16 13:48               ` Jamie Lokier
2001-01-16 14:20                 ` Felix von Leitner
2001-01-16 15:05                   ` David L. Parsley
2001-01-16 15:05                     ` Jakub Jelinek
2001-01-16 15:46                       ` David L. Parsley
2001-01-18 14:00                         ` Laramie Leavitt
2001-01-17 19:27                     ` dean gaudet
2001-01-24  0:58   ` Sasi Peter
2001-01-24  8:44     ` James Sutherland
2001-01-25 10:20     ` Anton Blanchard
2001-01-25 10:58       ` Sasi Peter
2001-01-26  6:10         ` Anton Blanchard
2001-01-26 11:46           ` David S. Miller
2001-01-26 14:12             ` Anton Blanchard
2001-01-15 23:16 ` Pavel Machek
2001-01-16 13:47   ` jamal
2001-01-16 14:41     ` Pavel Machek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20010118083655.D6787@work.bitmover.com \
    --to=lm@bitmover.com \
    --cc=leighton@imake.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox