From: "Stephen C. Tweedie" <sct@redhat.com>
To: Linus Torvalds <torvalds@transmeta.com>
Cc: linux-kernel@vger.kernel.org, Stephen Tweedie <sct@redhat.com>
Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
Date: Wed, 10 Jan 2001 15:21:58 +0000 [thread overview]
Message-ID: <20010110152158.F10633@redhat.com> (raw)
In-Reply-To: <20010109141806.F4284@redhat.com> <Pine.LNX.4.30.0101091532150.4368-100000@e2> <20010109151725.D9321@redhat.com> <93g357$2jf$1@penguin.transmeta.com>
In-Reply-To: <93g357$2jf$1@penguin.transmeta.com>; from torvalds@transmeta.com on Tue, Jan 09, 2001 at 02:25:43PM -0800
Hi,
On Tue, Jan 09, 2001 at 02:25:43PM -0800, Linus Torvalds wrote:
> In article <20010109151725.D9321@redhat.com>,
> Stephen C. Tweedie <sct@redhat.com> wrote:
> >
> >Jes has also got hard numbers for the performance advantages of
> >jumbograms on some of the networks he's been using, and you ain't
> >going to get udp jumbograms through a page-by-page API, ever.
>
> The only thing you need is a nagle-type thing that coalesces requests.
Is this robust enough to build a useful user-level API on top of?
What happens if we have a threaded application in which more than one
process may be sending udp sendmsg()s to the file descriptor? If we
end up decomposing each datagram into multiple page-sized chunks, then
you can imagine them arriving at the fd stream in interleaved order.
You can fix that by adding extra locking, but that just indicates that
the original API wasn't sufficient to communicate the precise intent
of the application in the first place.
Things look worse from the point of view of ll_rw_block, which lacks
any concept of (a) a file descriptor, or (b) a non-reorderable stream
of atomic requests. ll_rw_block coalesces in any order it chooses, so
its coalescing function is a _lot_ more complex than hooking the next
page onto a linked list.
Once the queue size grows non-trivial, adding a new request can become
quite expensive (even with only one item on the request queue at once,
make_request is still by far the biggest cost on a kernel profile
running raw IO). If you've got a 32-page IO to send, sending it in
chunks means either merging 32 times into that queue when you could
have just done it once, or holding off all merging until you're told
to unplug: but with multiple clients, you just encounter the lack of
caller context again, and each client can unplug the other before its
time.
I realise these are apples and oranges to some extent, because
ll_rw_block doesn't accept a file descriptor: the place where we _do_
use file descriptors, block_write(), could be doing some of this if
the requests were coming from an application.
However, that doesn't address the fact that we have got raw devices
and filesystems such as XFS already generating large multi-page block
IO requests and having to cram them down the thin pipe which is
ll_rw_block, and the MSG_MORE flag doesn't seem capable of extending
to ll_rw_block sufficiently well.
I guess it comes down to this: what problem are we trying to fix? If
it's strictly limited to sendfile/writev and related calls, then
you've convinced me that page-by-page MSG_MORE can work if you add a
bit of locking, but that locking is by itself nasty.
Think about O_DIRECT to a database file. We get a write() call,
locate the physical pages through unspecified magic, and fire off a
series of page or partial-page writes to the O_DIRECT fd. If we are
coalescing these via MSG_MORE, then we have to keep the fd locked for
write until we've processed the whole IO (including any page faults
that result). The filesystem --- which is what understands the
concept of a file descriptor --- can merge these together into another
request, but we'd just have to split that request into chunks again to
send them to ll_rw_block.
We may also have things like software raid layers in the write path.
That's the motivation for having an object capable of describing
multi-page IOs --- it lets us pass the desired IO chunks down through
the filesystem, virtual block devices and physical block devices,
without any context being required and without having to
decompose/merge at each layer.
Cheers,
Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
next prev parent reply other threads:[~2001-01-10 15:24 UTC|newest]
Thread overview: 119+ messages / expand[flat|nested] mbox.gz Atom feed top
2001-01-08 1:24 [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller
2001-01-08 10:39 ` Christoph Hellwig
2001-01-08 10:34 ` David S. Miller
2001-01-08 18:05 ` Rik van Riel
2001-01-08 21:07 ` David S. Miller
2001-01-09 10:23 ` Ingo Molnar
2001-01-09 10:31 ` Christoph Hellwig
2001-01-09 10:31 ` David S. Miller
2001-01-09 11:28 ` Christoph Hellwig
2001-01-09 11:42 ` David S. Miller
2001-01-09 12:04 ` Ingo Molnar
2001-01-09 14:25 ` Stephen C. Tweedie
2001-01-09 14:33 ` Alan Cox
2001-01-09 15:00 ` Ingo Molnar
2001-01-09 15:27 ` Stephen C. Tweedie
2001-01-09 16:16 ` Ingo Molnar
2001-01-09 16:37 ` Alan Cox
2001-01-09 16:48 ` Ingo Molnar
2001-01-09 17:29 ` Alan Cox
2001-01-09 17:38 ` Jens Axboe
2001-01-09 18:38 ` Ingo Molnar
2001-01-09 19:54 ` Andrea Arcangeli
2001-01-09 20:10 ` Ingo Molnar
2001-01-10 0:00 ` Andrea Arcangeli
2001-01-09 20:12 ` Jens Axboe
2001-01-09 23:20 ` Andrea Arcangeli
2001-01-09 23:34 ` Jens Axboe
2001-01-09 23:52 ` Andrea Arcangeli
2001-01-17 5:16 ` Rik van Riel
2001-01-09 17:56 ` Chris Evans
2001-01-09 18:41 ` Ingo Molnar
2001-01-09 22:58 ` [patch]: ac4 blk (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) Jens Axboe
2001-01-09 19:20 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 J Sloan
2001-01-09 18:10 ` Stephen C. Tweedie
2001-01-09 15:38 ` Benjamin C.R. LaHaise
2001-01-09 16:40 ` Ingo Molnar
2001-01-09 17:30 ` Benjamin C.R. LaHaise
2001-01-09 18:12 ` Stephen C. Tweedie
2001-01-09 18:35 ` Ingo Molnar
2001-01-09 17:53 ` Christoph Hellwig
2001-01-09 21:13 ` David S. Miller
2001-01-09 19:14 ` Linus Torvalds
2001-01-09 20:07 ` Ingo Molnar
2001-01-09 20:15 ` Linus Torvalds
2001-01-09 20:36 ` Christoph Hellwig
2001-01-09 20:55 ` Linus Torvalds
2001-01-09 21:12 ` Christoph Hellwig
2001-01-09 21:26 ` Linus Torvalds
2001-01-10 7:42 ` Christoph Hellwig
2001-01-10 8:05 ` Linus Torvalds
2001-01-10 8:33 ` Christoph Hellwig
2001-01-10 8:37 ` Andrew Morton
2001-01-10 23:32 ` Linus Torvalds
2001-01-19 15:55 ` Andrew Scott
2001-01-17 14:05 ` Rik van Riel
2001-01-18 0:53 ` Christoph Hellwig
2001-01-18 1:13 ` Linus Torvalds
2001-01-18 17:50 ` Christoph Hellwig
2001-01-18 18:04 ` Linus Torvalds
2001-01-18 21:12 ` Albert D. Cahalan
2001-01-19 1:52 ` 2.4.1-pre8 video/ohci1394 compile problem ebi4
2001-01-19 6:55 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Linus Torvalds
2001-01-09 23:06 ` Benjamin C.R. LaHaise
2001-01-09 23:54 ` Linus Torvalds
2001-01-10 7:51 ` Gerd Knorr
2001-01-12 1:42 ` Stephen C. Tweedie
2001-01-09 11:05 ` Ingo Molnar
2001-01-09 18:27 ` Christoph Hellwig
2001-01-09 19:19 ` Ingo Molnar
2001-01-09 14:18 ` Stephen C. Tweedie
2001-01-09 14:40 ` Ingo Molnar
2001-01-09 14:51 ` Alan Cox
2001-01-09 15:17 ` Stephen C. Tweedie
2001-01-09 15:37 ` Ingo Molnar
2001-01-09 21:18 ` David S. Miller
2001-01-09 22:25 ` Linus Torvalds
2001-01-10 15:21 ` Stephen C. Tweedie [this message]
2001-01-09 15:25 ` Stephen Frost
2001-01-09 15:40 ` Ingo Molnar
2001-01-09 15:48 ` Stephen Frost
2001-01-10 1:14 ` Dave Zarzycki
2001-01-10 1:14 ` David S. Miller
2001-01-10 2:18 ` Dave Zarzycki
2001-01-10 1:19 ` Ingo Molnar
2001-01-10 2:56 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet
2001-01-10 2:58 ` David S. Miller
2001-01-10 3:18 ` dean gaudet
2001-01-10 3:09 ` David S. Miller
2001-01-10 3:05 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, Alan Cox
2001-01-08 21:56 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Jes Sorensen
2001-01-08 21:48 ` David S. Miller
2001-01-08 22:32 ` Jes Sorensen
2001-01-08 22:36 ` David S. Miller
2001-01-09 12:12 ` Ingo Molnar
2001-01-08 22:43 ` Stephen Frost
2001-01-08 22:37 ` David S. Miller
2001-01-09 13:52 ` Trond Myklebust
2001-01-09 13:42 ` David S. Miller
2001-01-09 15:27 ` Trond Myklebust
2001-01-09 21:19 ` David S. Miller
2001-01-10 9:21 ` Trond Myklebust
-- strict thread matches above, loose matches on Subject: below --
2001-01-09 13:08 Stephen Landamore
2001-01-09 13:24 ` Ingo Molnar
2001-01-09 13:47 ` Andrew Morton
2001-01-09 19:15 ` Dan Hollis
2001-01-09 19:14 ` Dan Hollis
2001-01-09 22:03 ` David S. Miller
2001-01-09 22:58 ` Dan Hollis
2001-01-09 22:59 ` Ingo Molnar
2001-01-09 23:11 ` Dan Hollis
2001-01-10 3:24 ` Chris Wedgwood
2001-01-09 17:46 Manfred Spraul
2001-01-10 8:41 Manfred Spraul
2001-01-10 8:31 ` David S. Miller
2001-01-10 11:25 ` Ingo Molnar
2001-01-10 12:03 ` Manfred Spraul
2001-01-10 12:07 ` Ingo Molnar
2001-01-10 16:18 ` Jamie Lokier
2001-01-13 15:43 ` yodaiken
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20010110152158.F10633@redhat.com \
--to=sct@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@transmeta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox