* [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
@ 2001-01-08 1:24 David S. Miller
2001-01-08 10:39 ` Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 101+ messages in thread
From: David S. Miller @ 2001-01-08 1:24 UTC (permalink / raw)
To: linux-kernel; +Cc: netdev
I've put a patch up for testing on the kernel.org mirrors:
/pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz
It provides a framework for zerocopy transmits and delayed
receive fragment coalescing. TUX-1.01 uses this framework.
Zerocopy transmit requires some driver support, things run
as they did before for drivers which do not have the support
added. Currently sg+csum driver support has been added to
Acenic, 3c59x, sunhme, and loopback drivers. We had eepro100
support coded at one point, but it was removed because we didn't know
how to identify the cards which support hw csum assist vs. ones
which could not.
I would like people to test this hard and report bugs they may
discover. _PLEASE_ try to see if 2.4.0 without this patch produces
the same problem, and if so report it is a 2.4.0 bug _not_ as a
bug in the zerocopy patch. Thank you.
In particular, I am interested in hearing about any new breakage
caused by the zerocopy patches when using netfilter. When reporting
bugs, please note what networking cards you are using as whether the
card actually is using hw csum assist and sg support is an important
data point.
Finally, regardless of networking card, there should be a measurable
performance boost for NFS clients with this patch due to the delayed
fragment coalescing. KNFSD does not take full advantage of this
facility yet.
Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 101+ messages in thread* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 1:24 [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller @ 2001-01-08 10:39 ` Christoph Hellwig 2001-01-08 10:34 ` David S. Miller 2001-01-08 21:56 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Jes Sorensen 2001-01-09 13:52 ` Trond Myklebust 2 siblings, 1 reply; 101+ messages in thread From: Christoph Hellwig @ 2001-01-08 10:39 UTC (permalink / raw) To: "David S. Miller"; +Cc: netdev, linux-kernel In article <200101080124.RAA08134@pizda.ninka.net> you wrote: > I've put a patch up for testing on the kernel.org mirrors: > > /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz > > It provides a framework for zerocopy transmits and delayed > receive fragment coalescing. TUX-1.01 uses this framework. Hi Dave, don't you think the writepage file operation is rather hackish? I'd much prefer Ben La Haise's rw_kiovec [1] operation, it is more generic (supports read and write) and should be easily usable for zerocopy networking with plain old write (using map_user_kio). Besides that the FS crew thinks it should go in soon because of aio anyway... Christoph [1] for those that don't know yet, the prototype is: rw_kiovec(struct file * filp, int rw, int nr, struct kiobuf ** kiovec, int flags, size_t size, loff_t pos); -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 10:39 ` Christoph Hellwig @ 2001-01-08 10:34 ` David S. Miller 2001-01-08 18:05 ` Rik van Riel 0 siblings, 1 reply; 101+ messages in thread From: David S. Miller @ 2001-01-08 10:34 UTC (permalink / raw) To: hch; +Cc: netdev, linux-kernel Date: Mon, 8 Jan 2001 11:39:15 +0100 From: Christoph Hellwig <hch@caldera.de> don't you think the writepage file operation is rather hackish? Not at all, it's simply direct sendfile support. It does not try to be any fancier than that. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 10:34 ` David S. Miller @ 2001-01-08 18:05 ` Rik van Riel 2001-01-08 21:07 ` David S. Miller 2001-01-09 10:23 ` Ingo Molnar 0 siblings, 2 replies; 101+ messages in thread From: Rik van Riel @ 2001-01-08 18:05 UTC (permalink / raw) To: David S. Miller; +Cc: hch, netdev, linux-kernel On Mon, 8 Jan 2001, David S. Miller wrote: > From: Christoph Hellwig <hch@caldera.de> > > don't you think the writepage file operation is rather hackish? > > Not at all, it's simply direct sendfile support. It does > not try to be any fancier than that. I really think the zerocopy network stuff should be ported to kiobuf proper. The usefulness of the patch you posted is rather .. umm .. limited. Having proper kiobuf support would make it possible to, for example, do zerocopy network->disk data transfers and lots of other things. Furthermore, by using kiobuf for the network zerocopy stuff there's a good chance the networking code will be integrated. Otherwise we just might end up with a zero-copy-for-everything- except-networking Linux 2.5 kernel ;) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 18:05 ` Rik van Riel @ 2001-01-08 21:07 ` David S. Miller 2001-01-09 10:23 ` Ingo Molnar 1 sibling, 0 replies; 101+ messages in thread From: David S. Miller @ 2001-01-08 21:07 UTC (permalink / raw) To: riel; +Cc: hch, netdev, linux-kernel Date: Mon, 8 Jan 2001 16:05:23 -0200 (BRDT) From: Rik van Riel <riel@conectiva.com.br> I really think the zerocopy network stuff should be ported to kiobuf proper. That is how it could be done in 2.5.x, sure. But this patch is intended for 2.4.x so "minimum impact" applies. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 18:05 ` Rik van Riel 2001-01-08 21:07 ` David S. Miller @ 2001-01-09 10:23 ` Ingo Molnar 2001-01-09 10:31 ` Christoph Hellwig ` (2 more replies) 1 sibling, 3 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 10:23 UTC (permalink / raw) To: Rik van Riel; +Cc: David S. Miller, hch, netdev, linux-kernel On Mon, 8 Jan 2001, Rik van Riel wrote: > I really think the zerocopy network stuff should be ported to kiobuf > proper. yep, we talked to Stephen Tweedie about this already, but it involves some changes in kiovec support and we didnt want to touch too much code for 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses vectors of struct page *, offset, size entities), so transition to a finalized kiovec framework (or whatever other mechanizm) is trivial. Right now kiovecs are *way* too bloated for the purposes of skb fragments. > The usefulness of the patch you posted is rather .. umm .. limited. > [...] i violently disagree :-) The upcoming TUX release is based on David's and Alexey's cleaned-up zerocopy framework. [thus TUX and zerocopy are separated.] David's patch adds a *very* scalable implementation of zerocopy sendfile() and zerocopy sendmsg(), the panacea of fileserver (webserver) scalability - it can be used by Apache, Samba and other fileservers. The new zerocopy networking code DMA-s straight out of the pagecache, natively supports hardware-checksumming and highmem (64-bit DMA on 32-bit systems) zerocopy as well and multi-fragment DMA - no limitations. We can saturate a gigabit link with TCP traffic, at about 20% CPU usage on a 500 MHz x86 UP system. David and Alexey's patch is cool - check it out! > Having proper kiobuf support would make it possible to, for example, > do zerocopy network->disk data transfers and lots of other things. i used to think that this is useful, but these days it isnt. It's a waste of PCI bandwidth resources, and it's much cheaper to keep a cache in RAM instead of doing direct disk=>network DMA *all the time* some resource is requested. > Furthermore, by using kiobuf for the network zerocopy stuff there's a > good chance the networking code will be integrated. David and Alexey are TCP/IP networking code maintainers. So if you see a 'test this' networking framework patch from them on l-k, it has quite high chances of being integrated into the networking code :-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 10:23 ` Ingo Molnar @ 2001-01-09 10:31 ` Christoph Hellwig 2001-01-09 10:31 ` David S. Miller 2001-01-09 11:05 ` Ingo Molnar 2001-01-09 14:18 ` Stephen C. Tweedie 2001-01-10 2:56 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet 2 siblings, 2 replies; 101+ messages in thread From: Christoph Hellwig @ 2001-01-09 10:31 UTC (permalink / raw) To: Ingo Molnar; +Cc: Rik van Riel, David S. Miller, netdev, linux-kernel On Tue, Jan 09, 2001 at 11:23:41AM +0100, Ingo Molnar wrote: > > On Mon, 8 Jan 2001, Rik van Riel wrote: > > > I really think the zerocopy network stuff should be ported to kiobuf > > proper. > > yep, we talked to Stephen Tweedie about this already, but it involves some > changes in kiovec support and we didnt want to touch too much code for > 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses vectors of > struct page *, offset, size entities), Yep. That is why I was so worried aboit the writepages file op. It's rather hackish (only write, looks usefull only for networking) instead of the proposed rw_kiovec fop. > > > The usefulness of the patch you posted is rather .. umm .. limited. > > [...] > > i violently disagree :-) The upcoming TUX release is based on David's and > Alexey's cleaned-up zerocopy framework. [thus TUX and zerocopy are > separated.] David's patch adds a *very* scalable implementation of > zerocopy sendfile() and zerocopy sendmsg(), the panacea of fileserver > (webserver) scalability - it can be used by Apache, Samba and other > fileservers. The new zerocopy networking code DMA-s straight out of the > pagecache, natively supports hardware-checksumming and highmem (64-bit DMA > on 32-bit systems) zerocopy as well and multi-fragment DMA - no > limitations. We can saturate a gigabit link with TCP traffic, at about 20% > CPU usage on a 500 MHz x86 UP system. David and Alexey's patch is cool - > check it out! Yuck. A new file_opo just to get a few benchmarks right ... I hope the writepages stuff will not be merged in Linus tree (but I wish the code behind it!) Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 10:31 ` Christoph Hellwig @ 2001-01-09 10:31 ` David S. Miller 2001-01-09 11:28 ` Christoph Hellwig 2001-01-09 11:05 ` Ingo Molnar 1 sibling, 1 reply; 101+ messages in thread From: David S. Miller @ 2001-01-09 10:31 UTC (permalink / raw) To: hch; +Cc: mingo, riel, netdev, linux-kernel Date: Tue, 9 Jan 2001 11:31:45 +0100 From: Christoph Hellwig <hch@caldera.de> Yuck. A new file_opo just to get a few benchmarks right ... I hope the writepages stuff will not be merged in Linus tree (but I wish the code behind it!) It's a "I know how to send a page somewhere via this filedescriptor all by myself" operation. I don't see why people need to take painkillers over this for 2.4.x. I think f_op->write is stupid, such a special case file operation just to get a few benchmarks right. This is the kind of argument I am hearing. Orthogonal to f_op->write being for specifying a low-level implementation of sys_write, f_op->writepage is for specifying a low-level implementation of sys_sendfile. Can you grok that? Linus has already seen this. Originally he had a gripe because in an older revision of the code used to allow multiple pages to be passed in an array to the writepage(s) operation. He didn't like that, so I made it take only one page as he requested. He had no other major objections to the infrastructure. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 10:31 ` David S. Miller @ 2001-01-09 11:28 ` Christoph Hellwig 2001-01-09 11:42 ` David S. Miller ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Christoph Hellwig @ 2001-01-09 11:28 UTC (permalink / raw) To: David S. Miller; +Cc: mingo, riel, netdev, linux-kernel On Tue, Jan 09, 2001 at 02:31:13AM -0800, David S. Miller wrote: > Date: Tue, 9 Jan 2001 11:31:45 +0100 > From: Christoph Hellwig <hch@caldera.de> > > Yuck. A new file_opo just to get a few benchmarks right ... I > hope the writepages stuff will not be merged in Linus tree (but I > wish the code behind it!) > > It's a "I know how to send a page somewhere via this filedescriptor > all by myself" operation. I don't see why people need to take > painkillers over this for 2.4.x. I think f_op->write is stupid, such > a special case file operation just to get a few benchmarks right. > This is the kind of argument I am hearing. > > Orthogonal to f_op->write being for specifying a low-level > implementation of sys_write, f_op->writepage is for specifying a > low-level implementation of sys_sendfile. Can you grok that? Sure. But sendfile is not one of the fundamental UNIX operations... If there was no alternative to this I would probably have not said anything, but with the rw_kiovec file op just before the door I don't see any reason to add this _very_ specific file operation. An alloc_kiovec before and an free_kiovec after the actual call and the memory overhaed of a kiobuf won't hurt so much that it stands against a clean interface, IMHO. > > Linus has already seen this. Originally he had a gripe because in an > older revision of the code used to allow multiple pages to be passed > in an array to the writepage(s) operation. He didn't like that, so I > made it take only one page as he requested. He had no other major > objections to the infrastructure. You get that multiple page call with kiobufs for free... Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 11:28 ` Christoph Hellwig @ 2001-01-09 11:42 ` David S. Miller 2001-01-09 12:04 ` Ingo Molnar 2001-01-09 19:14 ` Linus Torvalds 2 siblings, 0 replies; 101+ messages in thread From: David S. Miller @ 2001-01-09 11:42 UTC (permalink / raw) To: hch; +Cc: mingo, riel, netdev, linux-kernel Date: Tue, 9 Jan 2001 12:28:10 +0100 From: Christoph Hellwig <hch@caldera.de> Sure. But sendfile is not one of the fundamental UNIX operations... It's a fundamental Linux interface and VFS-->networking interface. An alloc_kiovec before and an free_kiovec after the actual call and the memory overhaed of a kiobuf won't hurt so much that it stands against a clean interface, IMHO. This whole exercise is pointless unless it performs well. The overhead _DOES_ matter, we've tested and profiled all of this with full specweb99 runs, zerocopy ftp server loads, etc. Removing one word of information from anything involved in these code paths makes enormous differences. Have you run such tests with your suggested kiobuf scheme? Know what I really hate? People who are talking, "almost done", and "designing" the "real solution" to a problem and have no code to show for it. Ie. a total working implementation. Often they have not one line of code to show. Then the folks who actually get off their lazy asses and make something real, which works, and in fact exceeded most of our personal performance expectations, are the ones who are getting told that what they did was crap. What was the first thing out of people's mouths? Not "nice work", but "I think writepage is ugly and an eyesore, I hope nobody seriously considers this code for inclusion." Keep designing... like Linus says, "show me the code". Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 11:28 ` Christoph Hellwig 2001-01-09 11:42 ` David S. Miller @ 2001-01-09 12:04 ` Ingo Molnar 2001-01-09 14:25 ` Stephen C. Tweedie 2001-01-09 19:14 ` Linus Torvalds 2 siblings, 1 reply; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 12:04 UTC (permalink / raw) To: Christoph Hellwig; +Cc: David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Christoph Hellwig wrote: > Sure. But sendfile is not one of the fundamental UNIX operations... Neither were eg. kernel-based semaphores. So what? Unix wasnt perfect and isnt perfect - but it was a (very) good starting point. If you are arguing against the existence or importance of sendfile() you should re-think, sendfile() is a unique (and important) interface because it enables moving information between files (streams) without involving any interim user-space memory buffer. No original Unix API did this AFAIK, so we obviously had to add it. It's an important Linux API category. > If there was no alternative to this I would probably have not said > anything, but with the rw_kiovec file op just before the door I don't > see any reason to add this _very_ specific file operation. I do think that the kiovec code has to be rewritten substantially before it can be used for networking zero-copy, so right now we do the least damange if we do not increase the coverage of kiovec code. > An alloc_kiovec before and an free_kiovec after the actual call and > the memory overhaed of a kiobuf won't hurt so much that it stands > against a clean interface, IMHO. please study the networking portions of the zerocopy patch and you'll see why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the thing we cannot afford in a sendfile() operation. sendfile() is lightweight, the setup times of kiovecs are not. basically the current kiovec design does not deal with the realities of high-speed, featherweight networking. DO NOT talk in hypotheticals. The code is there, do it, measure it. You might not care about performance, we do. another, more theoretical issue is that i think the kernel should not be littered with multi-page interfaces, we should keep the one "struct page * at a time" interfaces. Eg. check out how the new zerocopy code generates perfect MTU sized frames via the ->writepage() interface. No interim container objects are necessary. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 12:04 ` Ingo Molnar @ 2001-01-09 14:25 ` Stephen C. Tweedie 2001-01-09 14:33 ` Alan Cox ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Stephen C. Tweedie @ 2001-01-09 14:25 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel, Stephen Tweedie Hi, On Tue, Jan 09, 2001 at 01:04:49PM +0100, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > please study the networking portions of the zerocopy patch and you'll see > why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the > thing we cannot afford in a sendfile() operation. sendfile() is > lightweight, the setup times of kiovecs are not. > Right. However, kiobufs can be kept around for as long as you want and can be reused easily, and even if allocating and freeing them is more work than you want, populating an existing kiobuf is _very_ cheap. > another, more theoretical issue is that i think the kernel should not be > littered with multi-page interfaces, we should keep the one "struct page * > at a time" interfaces. Bad bad bad. We already have SCSI devices optimised for bandwidth which don't approach decent performance until you are passing them 1MB IOs, and even in networking the 1.5K packet limit kills us in some cases and we need an interface capable of generating jumbograms. Perhaps tcp can merge internal 4K requests, but if you're doing udp jumbograms (or STP or VIA), you do need an interface which can give the networking stack more than one page at once. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 14:25 ` Stephen C. Tweedie @ 2001-01-09 14:33 ` Alan Cox 2001-01-09 15:00 ` Ingo Molnar 2001-01-09 21:13 ` David S. Miller 2 siblings, 0 replies; 101+ messages in thread From: Alan Cox @ 2001-01-09 14:33 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel, Stephen Tweedie > Bad bad bad. We already have SCSI devices optimised for bandwidth > which don't approach decent performance until you are passing them 1MB > IOs, and even in networking the 1.5K packet limit kills us in some Even low end cheap raid cards like the AMI megaraid dearly want 128K writes. Its quite a difference on them - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 14:25 ` Stephen C. Tweedie 2001-01-09 14:33 ` Alan Cox @ 2001-01-09 15:00 ` Ingo Molnar 2001-01-09 15:27 ` Stephen C. Tweedie 2001-01-09 15:38 ` Benjamin C.R. LaHaise 2001-01-09 21:13 ` David S. Miller 2 siblings, 2 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 15:00 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > please study the networking portions of the zerocopy patch and you'll see > > why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the > > thing we cannot afford in a sendfile() operation. sendfile() is > > lightweight, the setup times of kiovecs are not. > > > Right. However, kiobufs can be kept around for as long as you want > and can be reused easily, and even if allocating and freeing them is > more work than you want, populating an existing kiobuf is _very_ > cheap. we do have SLAB [which essentially caches structures, on a per-CPU basis] which i did take into account, but still, initializing a 600+ byte kiovec is probably more work than the rest of sending a packet! I mean i'd love to eliminate the 200+ bytes skb initialization as well, it shows up. > > another, more theoretical issue is that i think the kernel should not be > > littered with multi-page interfaces, we should keep the one "struct page * > > at a time" interfaces. > > Bad bad bad. We already have SCSI devices optimised for bandwidth > which don't approach decent performance until you are passing them 1MB > IOs, [...] The fact that we're using single-page interfaces doesnt preclude us from having nicely clustered requests, this is what IO-plugging is about! > and even in networking the 1.5K packet limit kills us in some cases > and we need an interface capable of generating jumbograms. which cases? > Perhaps tcp can merge internal 4K requests, [...] yes, because depending on the application to send properly sized requests is a futile act IMO. So we do have intelligent buffering and clustering in basically every kernel subsystem - and we'll continue to have it because we have no choice - most of Linux's user-visible IO APIs have byte granularity (which is good btw.). Adding a multi-page interface will IMO mostly just complicate the design and the implementation. Do you have empirical (or theoretical) proof which shows that single-page interfaces cannot perform well? > but if you're doing udp jumbograms (or STP or VIA), you do need an > interface which can give the networking stack more than one page at > once. nothing prevents the introduction of specialized interfaces - if they feel like they can get enough traction. I was talking about the normal Linux IO APIs, read()/write()/sendfile(), which are byte granularity and invoke an almost mandatory buffering/clustering mechanizm in every kernel subsystem they deal with. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:00 ` Ingo Molnar @ 2001-01-09 15:27 ` Stephen C. Tweedie 2001-01-09 16:16 ` Ingo Molnar 2001-01-09 15:38 ` Benjamin C.R. LaHaise 1 sibling, 1 reply; 101+ messages in thread From: Stephen C. Tweedie @ 2001-01-09 15:27 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel Hi, On Tue, Jan 09, 2001 at 04:00:34PM +0100, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > we do have SLAB [which essentially caches structures, on a per-CPU basis] > which i did take into account, but still, initializing a 600+ byte kiovec > is probably more work than the rest of sending a packet! I mean i'd love > to eliminate the 200+ bytes skb initialization as well, it shows up. Reusing a kiobuf for a request involves setting up the length, offset and maybe errno fields, and writing the struct page *'s into the maplist[]. Nothing more. > > Bad bad bad. We already have SCSI devices optimised for bandwidth > > which don't approach decent performance until you are passing them 1MB > > IOs, [...] > > The fact that we're using single-page interfaces doesnt preclude us from > having nicely clustered requests, this is what IO-plugging is about! We've already got measurements showing how insane this is. Raw IO requests, plus internal pagebuf contiguous requests from XFS, have to get broken down into page-sized chunks by the current ll_rw_block() API, only to get reassembled by the make_request code. It's *enormous* overhead, and the kiobuf-based disk IO code demonstrates this clearly. We have already shown that the IO-plugging API sucks, I'm afraid. > > and even in networking the 1.5K packet limit kills us in some cases > > and we need an interface capable of generating jumbograms. > > which cases? Gig Ethernet, HIPPI... It's not so bad with an intelligent controller, admittedly. > > but if you're doing udp jumbograms (or STP or VIA), you do need an > > interface which can give the networking stack more than one page at > > once. > > nothing prevents the introduction of specialized interfaces - if they feel > like they can get enough traction. So you mean we'll introduce two separate APIs for general zero-copy, just to get around the problems in the single-page-based on? > I was talking about the normal Linux IO > APIs, read()/write()/sendfile(), which are byte granularity and invoke an > almost mandatory buffering/clustering mechanizm in every kernel subsystem > they deal with. Only tcp and ll_rw_block. ll_rw_block has already been fixed in the SGI patches, and gets _much_ better performance as a result. udp doesn't do any such clustering. That leaves tcp. The presence of terrible performance in the old ll_rw_block code is NOT a good excuse for perpetuating that model. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:27 ` Stephen C. Tweedie @ 2001-01-09 16:16 ` Ingo Molnar 2001-01-09 16:37 ` Alan Cox 2001-01-09 18:10 ` Stephen C. Tweedie 0 siblings, 2 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 16:16 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > we do have SLAB [which essentially caches structures, on a per-CPU basis] > > which i did take into account, but still, initializing a 600+ byte kiovec > > is probably more work than the rest of sending a packet! I mean i'd love > > to eliminate the 200+ bytes skb initialization as well, it shows up. > > Reusing a kiobuf for a request involves setting up the length, offset > and maybe errno fields, and writing the struct page *'s into the > maplist[]. Nothing more. i'm talking about kiovecs not kiobufs (because those are equivalent to a fragmented packet - every packet fragment can be anywhere). Initializing a kiovec involves touching a dozen cachelines. Keeping structures compressed is very important. i dont know. I dont think it's necesserily bad for a subsystem to have its own 'native structure' how it manages data. > We've already got measurements showing how insane this is. Raw IO > requests, plus internal pagebuf contiguous requests from XFS, have to > get broken down into page-sized chunks by the current ll_rw_block() > API, only to get reassembled by the make_request code. It's > *enormous* overhead, and the kiobuf-based disk IO code demonstrates > this clearly. i do believe that you are wrong here. We did have a multi-page API between sendfile and the TCP layer initially, and it made *absolutely no performance difference*. But it was more complex, and harder to fix. And we had to keep intelligent buffering/clustering/merging in any case, because some native Linux interfaces such as write() and read() have byte granularity. so unless there is some fundamental difference between the two approaches, i dont buy this argument. I dont necesserily say that your measurements are wrong, i'm saying that the performance analysis is wrong. > We have already shown that the IO-plugging API sucks, I'm afraid. it might not be important to others, but we do hold one particular SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full fileset of ~9 GB. It generates insane block-IO load, and we do beat other OSs that have multipage support, including SGI. (and no, it's not due to kernel-space acceleration alone this time - it's mostly due to very good block-IO performance.) We use Jens Axobe's IO-batching fixes that dramatically improve the block scheduler's performance under high load. > > > and even in networking the 1.5K packet limit kills us in some cases > > > and we need an interface capable of generating jumbograms. > > > > which cases? > > Gig Ethernet, [...] we handle gigabit ethernet with 1.5K zero-copy packets just fine. One thing people forget is IRQ throttling: when switching from 1500 byte packets to 9000 byte packets then the amount of interrupts drops by a factor of 6. Now if the tunings of a driver are not changed accordingly, 1500 byte MTU can show dramatically lower performance than 9000 byte MTU. But if tuned properly, i see little difference between 1500 byte and 9000 byte MTU. (when using a good protocol such as TCP.) > > nothing prevents the introduction of specialized interfaces - if they feel > > like they can get enough traction. > > So you mean we'll introduce two separate APIs for general zero-copy, > just to get around the problems in the single-page-based on? no. But i think that none of the mainstream protocols or APIs mandate a multi-page interface - i do think that the performance problems mentioned were mis-analyzed. I'd call the multi-page API thing an urban legend. Nobody in their right mind can claim that a series of function calls shows any difference in *block IO* performance, compared to a multi-page API (which has an additional vector-setup cost). Only functional differences can explain any measured performance difference - and for those merging/clustering bugs, multipage support is only a workaround. > > I was talking about the normal Linux IO > > APIs, read()/write()/sendfile(), which are byte granularity and invoke an > > almost mandatory buffering/clustering mechanizm in every kernel subsystem > > they deal with. > > Only tcp and ll_rw_block. ll_rw_block has already been fixed in the > SGI patches, and gets _much_ better performance as a result. [...] as mentioned above, i think this is not due to going multipage. > The presence of terrible performance in the old ll_rw_block code is > NOT a good excuse for perpetuating that model. i'd like to measure this performance problem (because i'd like to double-check it) - what measurement method was used? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 16:16 ` Ingo Molnar @ 2001-01-09 16:37 ` Alan Cox 2001-01-09 16:48 ` Ingo Molnar 2001-01-09 19:20 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 J Sloan 2001-01-09 18:10 ` Stephen C. Tweedie 1 sibling, 2 replies; 101+ messages in thread From: Alan Cox @ 2001-01-09 16:37 UTC (permalink / raw) To: mingo Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel > > We have already shown that the IO-plugging API sucks, I'm afraid. > > it might not be important to others, but we do hold one particular > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full And its real world value is exactly the same as the mindcraft NT values. Don't forget that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 16:37 ` Alan Cox @ 2001-01-09 16:48 ` Ingo Molnar 2001-01-09 17:29 ` Alan Cox 2001-01-09 17:56 ` Chris Evans 2001-01-09 19:20 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 J Sloan 1 sibling, 2 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 16:48 UTC (permalink / raw) To: Alan Cox Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Alan Cox wrote: > > > We have already shown that the IO-plugging API sucks, I'm afraid. > > > > it might not be important to others, but we do hold one particular > > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full > > And its real world value is exactly the same as the mindcraft NT > values. Don't forget that. ( what you have not quoted is the part that says that the fileset is 9GB. This is one of the busiest and most complex block-IO Linux systems i've ever seen, this is why i quoted it - the talk was about block-IO performance, and Stephen said that our block IO sucks. It used to suck, but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 16:48 ` Ingo Molnar @ 2001-01-09 17:29 ` Alan Cox 2001-01-09 17:38 ` Jens Axboe 2001-01-09 17:56 ` Chris Evans 1 sibling, 1 reply; 101+ messages in thread From: Alan Cox @ 2001-01-09 17:29 UTC (permalink / raw) To: mingo Cc: Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel > ever seen, this is why i quoted it - the talk was about block-IO > performance, and Stephen said that our block IO sucks. It used to suck, > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller and I'll be a happy man I don't have a problem with the claim that its not the per page stuff and plugging that breaks ll_rw_blk. If there is evidence contradicting the SGI stuff it's very interesting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 17:29 ` Alan Cox @ 2001-01-09 17:38 ` Jens Axboe 2001-01-09 18:38 ` Ingo Molnar 0 siblings, 1 reply; 101+ messages in thread From: Jens Axboe @ 2001-01-09 17:38 UTC (permalink / raw) To: Alan Cox Cc: mingo, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, Jan 09 2001, Alan Cox wrote: > > ever seen, this is why i quoted it - the talk was about block-IO > > performance, and Stephen said that our block IO sucks. It used to suck, > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller > and I'll be a happy man No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID. -- * Jens Axboe <axboe@suse.de> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 17:38 ` Jens Axboe @ 2001-01-09 18:38 ` Ingo Molnar 2001-01-09 19:54 ` Andrea Arcangeli 0 siblings, 1 reply; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 18:38 UTC (permalink / raw) To: Jens Axboe Cc: Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Jens Axboe wrote: > > > ever seen, this is why i quoted it - the talk was about block-IO > > > performance, and Stephen said that our block IO sucks. It used to suck, > > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) > > > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller > > and I'll be a happy man > > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID. i cannot agree more - Jens' patch did wonders to IO performance here. It fixes a long-standing bug in the Linux block-IO-scheduler that caused very suboptimal requests being issued to lowlevel drivers once the request queue gets full. I think this patch is a clear candidate for 2.4.x inclusion. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 18:38 ` Ingo Molnar @ 2001-01-09 19:54 ` Andrea Arcangeli 2001-01-09 20:10 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Andrea Arcangeli @ 2001-01-09 19:54 UTC (permalink / raw) To: Ingo Molnar Cc: Jens Axboe, Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, Jan 09, 2001 at 07:38:28PM +0100, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Jens Axboe wrote: > > > > > ever seen, this is why i quoted it - the talk was about block-IO > > > > performance, and Stephen said that our block IO sucks. It used to suck, > > > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) > > > > > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller > > > and I'll be a happy man > > > > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID. > > i cannot agree more - Jens' patch did wonders to IO performance here. It BTW, I noticed what is left in blk-13B seems to be my work (Jens's fixes for merging when the I/O queue is full are just been integrated in test1x). The 512K SCSI command, wake_up_nr, elevator fixes and cleanups and removal of the bogus 64 max_segment limit in scsi.c that matters only with the IOMMU to allow devices with sg_tablesize <64 to do SG with 64 segments were all thought and implemented by me. My last public patch with most of the blk-13B stuff in it was here: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3 I sumbitted a later revision of the above blkdev-3 to Jens and he kept nicely maintaining it in sync with 2.4.x-latest. My blkdev tree is even more advanced but I didn't had time to update with 2.4.0 and marge it with Jens yet (I just described to Jens what "more advanced" means though, in practice it means something like a x2 speedup in tiotest seek write numbers, streaming I/O doesn't change on highmem boxes but it doesn't hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet because it hurts with lowmem (try with mem=32m with your scsi array that gets 512K*512 requests in flight :) and it's not able to exploit the elevator as well as my tree even on highmemory machines. So I'd wait until I merge the last bits with Jens (I raised the QUEUE_NR_REQUESTS to 3000) before inclusion. Confirm Jens? Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 19:54 ` Andrea Arcangeli @ 2001-01-09 20:10 ` Ingo Molnar 2001-01-10 0:00 ` Andrea Arcangeli 2001-01-09 20:12 ` Jens Axboe 2001-01-17 5:16 ` Rik van Riel 2 siblings, 1 reply; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 20:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: Jens Axboe, Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Andrea Arcangeli wrote: > BTW, I noticed what is left in blk-13B seems to be my work (Jens's > fixes for merging when the I/O queue is full are just been integrated > in test1x). [...] it was Jens' [i think those were implemented by Jens entirely] batch-freeing changes that made the most difference. (we did profile it step by step.) > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3 great! i'm happy that the block IO layer and IO scheduler now has a real home :-) nice work. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 20:10 ` Ingo Molnar @ 2001-01-10 0:00 ` Andrea Arcangeli 0 siblings, 0 replies; 101+ messages in thread From: Andrea Arcangeli @ 2001-01-10 0:00 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jens Axboe, linux-kernel On Tue, Jan 09, 2001 at 09:10:24PM +0100, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Andrea Arcangeli wrote: > > > BTW, I noticed what is left in blk-13B seems to be my work (Jens's > > fixes for merging when the I/O queue is full are just been integrated > > in test1x). [...] > > it was Jens' [i think those were implemented by Jens entirely] > batch-freeing changes that made the most difference. (we did Confirm, the bach-freeing was Jens's work. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 19:54 ` Andrea Arcangeli 2001-01-09 20:10 ` Ingo Molnar @ 2001-01-09 20:12 ` Jens Axboe 2001-01-09 23:20 ` Andrea Arcangeli 2001-01-17 5:16 ` Rik van Riel 2 siblings, 1 reply; 101+ messages in thread From: Jens Axboe @ 2001-01-09 20:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, Jan 09 2001, Andrea Arcangeli wrote: > > > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller > > > > and I'll be a happy man > > > > > > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID. > > > > i cannot agree more - Jens' patch did wonders to IO performance here. It > > BTW, I noticed what is left in blk-13B seems to be my work (Jens's fixes for > merging when the I/O queue is full are just been integrated in test1x). The > 512K SCSI command, wake_up_nr, elevator fixes and cleanups and removal of the > bogus 64 max_segment limit in scsi.c that matters only with the IOMMU to allow > devices with sg_tablesize <64 to do SG with 64 segments were all thought and > implemented by me. My last public patch with most of the blk-13B stuff in it > was here: > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3 > > I sumbitted a later revision of the above blkdev-3 to Jens and he kept nicely > maintaining it in sync with 2.4.x-latest. There are several parts that have been merged beyond recognition at this point :-). The wake_up_nr was actually partially redone by Ingo, I suspect he can fill in the gaps there. Then there are the general cleanups and cruft removal done by you (elevator->nr_segments stuff). The bogus 64 max segments from SCSI was there before merge too, I think I've actually had that in my tree for ages! The request free batching and pending queues were done by me, and Ingo helped tweak it during the spec runs to find a sweet spot of how much to batch etc. The elevator received lots of massaging beyond blkdev-3. For one, there are now only one complete queue scan for merge and insert of request where we before did one for each of them. The merger also does correct accounting and aging. In addition there are a bunch other small fixes in there, I'm too lazy to list them all now :) > My blkdev tree is even more advanced but I didn't had time to update with 2.4.0 > and marge it with Jens yet (I just described to Jens what "more advanced" > means though, in practice it means something like a x2 speedup in tiotest seek I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to see what you have pending so we can merge :-). The tiotest seek increase was mainly due to the elevator having 3000 requests to juggle and thus being able to eliminate a lot of seeks right? > write numbers, streaming I/O doesn't change on highmem boxes but it doesn't > hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet > because it hurts with lowmem (try with mem=32m with your scsi array that gets > 512K*512 requests in flight :) and it's not able to exploit the elevator as I don't see any lowmem problems -- if under pressure, the queue should be fired and thus it won't get as long as if you have lots of memory free.` > well as my tree even on highmemory machines. So I'd wait until I merge the last > bits with Jens (I raised the QUEUE_NR_REQUESTS to 3000) before inclusion. ?? What do you mean exploit the elevator? -- * Jens Axboe <axboe@suse.de> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 20:12 ` Jens Axboe @ 2001-01-09 23:20 ` Andrea Arcangeli 2001-01-09 23:34 ` Jens Axboe 0 siblings, 1 reply; 101+ messages in thread From: Andrea Arcangeli @ 2001-01-09 23:20 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, Jan 09, 2001 at 09:12:04PM +0100, Jens Axboe wrote: > I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to > see what you have pending so we can merge :-). The tiotest seek increase was > mainly due to the elevator having 3000 requests to juggle and thus being able > to eliminate a lot of seeks right? Raising QUEUE_NR_REQUEST is possible because of the rework of other parts of ll_rw_block meant to fix the lowmem boxes. > > write numbers, streaming I/O doesn't change on highmem boxes but it doesn't > > hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet > > because it hurts with lowmem (try with mem=32m with your scsi array that gets > > 512K*512 requests in flight :) and it's not able to exploit the elevator as > > I don't see any lowmem problems -- if under pressure, the queue should be > fired and thus it won't get as long as if you have lots of memory free.` A write(2) shouldn't cause the allocator to wait I/O completion. It's the write that should block when it's only polluting the cache or you'll hurt the innocent rest of the system that isn't writing. At least with my original implementation of the 512K large scsi command support that you merged, before a write could block you first had to generate at least 128Mbyte of memory _locked_ all queued in the I/O request list waiting the driver to process the requests (only locked, without considering the dirty part of memory). Since you raised from 256 requests per queue to 512 with your patch you may have to generate 256Mbyte of locked memory before a write can block. This is great on the 8G boxes that runs specweb but this isn't that great on a 32Mbyte box connected incidentally to a decent SCSI adapter. I say "may" because I didn't checked closely if you introduced any kind of logic to avoid this. It seems not though because such a logic needs to touch at least blkdev_release_request and that's what I developed in my tree and then I could raise the number of I/O request in the queue up to 10000 if I wanted without any problem, the max-I/O in flight was controlled properly. (this allowed me to optimize away not 256 or in your case 512 seeks but 10000 seeks) This is what I meant with exploiting the elevator. No panic, there's no buffer overflow there ;) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 23:20 ` Andrea Arcangeli @ 2001-01-09 23:34 ` Jens Axboe 2001-01-09 23:52 ` Andrea Arcangeli 0 siblings, 1 reply; 101+ messages in thread From: Jens Axboe @ 2001-01-09 23:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Wed, Jan 10 2001, Andrea Arcangeli wrote: > On Tue, Jan 09, 2001 at 09:12:04PM +0100, Jens Axboe wrote: > > I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to > > see what you have pending so we can merge :-). The tiotest seek increase was > > mainly due to the elevator having 3000 requests to juggle and thus being able > > to eliminate a lot of seeks right? > > Raising QUEUE_NR_REQUEST is possible because of the rework of other parts of > ll_rw_block meant to fix the lowmem boxes. Ah I see. It would be nice to base the QUEUE_NR_REQUEST on something else than a static number. For example, 3000 per queue translates into 281Kb of request slots per queue. On a typical system with a floppy, hard drive, and CD-ROM it's getting close to 1Mb of RAM used for this alone. On a 32Mb box this is unaccebtable. I previously had blk_init_queue_nr(q, nr_free_slots) to eg not use that many free slots on say floppy, which doesn't really make much sense anyway. > > I don't see any lowmem problems -- if under pressure, the queue should be > > fired and thus it won't get as long as if you have lots of memory free.` > > A write(2) shouldn't cause the allocator to wait I/O completion. It's the write > that should block when it's only polluting the cache or you'll hurt the > innocent rest of the system that isn't writing. > > At least with my original implementation of the 512K large scsi command > support that you merged, before a write could block you first had to generate > at least 128Mbyte of memory _locked_ all queued in the I/O request list waiting > the driver to process the requests (only locked, without considering > the dirty part of memory). > > Since you raised from 256 requests per queue to 512 with your patch you > may have to generate 256Mbyte of locked memory before a write can block. > > This is great on the 8G boxes that runs specweb but this isn't that great on a > 32Mbyte box connected incidentally to a decent SCSI adapter. Yes I see your point. However memory shortage will fire the queue in due time, it won't make the WRITE block however. In this case it would be bdflush blocking on the WRITE's, which seem exactly what we don't want? > I say "may" because I didn't checked closely if you introduced any kind of > logic to avoid this. It seems not though because such a logic needs to touch at > least blkdev_release_request and that's what I developed in my tree and then I > could raise the number of I/O request in the queue up to 10000 if I wanted > without any problem, the max-I/O in flight was controlled properly. (this > allowed me to optimize away not 256 or in your case 512 seeks but 10000 seeks) > This is what I meant with exploiting the elevator. No panic, there's no buffer > overflow there ;) So you imposed a MB limit on how much I/O would be outstanding in blkdev_release_request? Wouldn't it make more sense to move this to at get_request time, since with the blkdev_release_request approach you won't catch lots of outstanding lock buffers before you start releasing one of them, at which point it would be too late (it might recover, but still). -- * Jens Axboe <axboe@suse.de> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 23:34 ` Jens Axboe @ 2001-01-09 23:52 ` Andrea Arcangeli 0 siblings, 0 replies; 101+ messages in thread From: Andrea Arcangeli @ 2001-01-09 23:52 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Wed, Jan 10, 2001 at 12:34:35AM +0100, Jens Axboe wrote: > Ah I see. It would be nice to base the QUEUE_NR_REQUEST on something else > than a static number. For example, 3000 per queue translates into 281Kb > of request slots per queue. On a typical system with a floppy, hard drive, > and CD-ROM it's getting close to 1Mb of RAM used for this alone. On a > 32Mb box this is unaccebtable. Yes of course. Infact 3000 was just the number I choosen when doing the benchmarks on a 128Mbox. Things needs to be autotuning and that's not yet implemented. I meant 3000 to tell how such number can grow. Right now if you use 3000 you will need to lock 1.5G of RAM (more than the normal zone!) before you can block with the 512K scsi commands. This was just to show the rest of the blkdev layer was obviously restructured. On a 8G box 10000 requests would probably be a good number. > Yes I see your point. However memory shortage will fire the queue in due > time, it won't make the WRITE block however. In this case it would be That's the performance problem I'm talking about on the lowmem boxes. Infact this problem will happen in 2.4.x too, just less biased than with the 512K scsi commands and by you increasing the number of requests from 256 to 512. > bdflush blocking on the WRITE's, which seem exactly what we don't want? In 2.4.0 Linus fixed wakeup_bdflush not to wait bdflush anymore as I suggested, now it's the task context that sumbits the requests directly to the I/O queue so it's the task that must block, not bdflush. And the task will block correctly _if_ we unplug at the sane time in ll_rw_block. > So you imposed a MB limit on how much I/O would be outstanding in > blkdev_release_request? Wouldn't it make more sense to move this to at No absolutely. Not in blkdev_release_request. The changes there are because you need to somehow do some accounting at I/O completion. > get_request time, since with the blkdev_release_request approach you won't Yes, only ll_rw_block uplugs, not blkdev_release_request. Obviously since the latter runs from irqs. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 19:54 ` Andrea Arcangeli 2001-01-09 20:10 ` Ingo Molnar 2001-01-09 20:12 ` Jens Axboe @ 2001-01-17 5:16 ` Rik van Riel 2 siblings, 0 replies; 101+ messages in thread From: Rik van Riel @ 2001-01-17 5:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Jens Axboe, Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, netdev, linux-kernel On Tue, 9 Jan 2001, Andrea Arcangeli wrote: > BTW, I noticed what is left in blk-13B seems to be my work Yeah yeah, we'll buy you beer at the next conference... ;) Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 16:48 ` Ingo Molnar 2001-01-09 17:29 ` Alan Cox @ 2001-01-09 17:56 ` Chris Evans 2001-01-09 18:41 ` Ingo Molnar 1 sibling, 1 reply; 101+ messages in thread From: Chris Evans @ 2001-01-09 17:56 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel On Tue, 9 Jan 2001, Ingo Molnar wrote: > This is one of the busiest and most complex block-IO Linux systems i've > ever seen, this is why i quoted it - the talk was about block-IO > performance, and Stephen said that our block IO sucks. It used to suck, > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) Is this "right patch from Jens" on the radar for 2.4 inclusion? Cheers Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 17:56 ` Chris Evans @ 2001-01-09 18:41 ` Ingo Molnar 2001-01-09 22:58 ` [patch]: ac4 blk (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) Jens Axboe 0 siblings, 1 reply; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 18:41 UTC (permalink / raw) To: Chris Evans; +Cc: linux-kernel On Tue, 9 Jan 2001, Chris Evans wrote: > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) > > Is this "right patch from Jens" on the radar for 2.4 inclusion? i do hope so! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* [patch]: ac4 blk (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) 2001-01-09 18:41 ` Ingo Molnar @ 2001-01-09 22:58 ` Jens Axboe 0 siblings, 0 replies; 101+ messages in thread From: Jens Axboe @ 2001-01-09 22:58 UTC (permalink / raw) To: Ingo Molnar; +Cc: Chris Evans, linux-kernel On Tue, Jan 09 2001, Ingo Molnar wrote: > > > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) > > > > Is this "right patch from Jens" on the radar for 2.4 inclusion? > > i do hope so! Here's a version against 2.4.0-ac4, blk-13B did not apply cleanly due to moving of i2o files and S/390 dasd changes: *.kernel.org/pub/linux/kernel/people/axboe/patches/2.4.0-ac4/blk-13C.bz2 -- * Jens Axboe <axboe@suse.de> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 16:37 ` Alan Cox 2001-01-09 16:48 ` Ingo Molnar @ 2001-01-09 19:20 ` J Sloan 1 sibling, 0 replies; 101+ messages in thread From: J Sloan @ 2001-01-09 19:20 UTC (permalink / raw) To: Alan Cox Cc: mingo, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel Alan Cox wrote: > > > it might not be important to others, but we do hold one particular > > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full > > And its real world value is exactly the same as the mindcraft NT values. Don't > forget that. In other words, devastating. jjs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 16:16 ` Ingo Molnar 2001-01-09 16:37 ` Alan Cox @ 2001-01-09 18:10 ` Stephen C. Tweedie 1 sibling, 0 replies; 101+ messages in thread From: Stephen C. Tweedie @ 2001-01-09 18:10 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel Hi, On Tue, Jan 09, 2001 at 05:16:40PM +0100, Ingo Molnar wrote: > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > i'm talking about kiovecs not kiobufs (because those are equivalent to a > fragmented packet - every packet fragment can be anywhere). Initializing a > kiovec involves touching a dozen cachelines. Keeping structures compressed > is very important. > > i dont know. I dont think it's necesserily bad for a subsystem to have its > own 'native structure' how it manages data. For the transmit case, unless the sender needs seriously fragmented data, the kiovec is just a kiobuf*. > i do believe that you are wrong here. We did have a multi-page API between > sendfile and the TCP layer initially, and it made *absolutely no > performance difference*. That may be fine for tcp, but tcp explicitly maintains the state of the caller and can stream things sequentially to a specific file descriptor. The block device layer, on the other hand, has to accept requests _in any order_ and still reorder them to the optimal elevator order. The merging in ll_rw_block is _far_ more expensive than adding a request to the end of a list. It's not helped by the fact that each such request has a buffer_head and a struct request associated with it, so deconstructing the large IO into buffer_heads results in huge amounts of data being allocated and deleted. We could streamline this greatly if the block device layer kept per-caller context in the way that tcp does, but the block device API just doesn't work that way. > > We have already shown that the IO-plugging API sucks, I'm afraid. > > it might not be important to others, but we do hold one particular > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full > fileset of ~9 GB. It generates insane block-IO load, and we do beat other > OSs that have multipage support, including SGI. (and no, it's not due to > kernel-space acceleration alone this time - it's mostly due to very good > block-IO performance.) We use Jens Axobe's IO-batching fixes that > dramatically improve the block scheduler's performance under high load. Perhaps, but we have proven and significant reductions in CPU utilisation from eliminating the per-buffer_head API to the block layer. Next time M$ gets close to our specweb records, maybe this is the next place to look for those extra few % points! > > Gig Ethernet, [...] > > we handle gigabit ethernet with 1.5K zero-copy packets just fine. One > thing people forget is IRQ throttling: when switching from 1500 byte > packets to 9000 byte packets then the amount of interrupts drops by a > factor of 6. Now if the tunings of a driver are not changed accordingly, > 1500 byte MTU can show dramatically lower performance than 9000 byte MTU. > But if tuned properly, i see little difference between 1500 byte and 9000 > byte MTU. (when using a good protocol such as TCP.) Maybe you see good throughput numbers, but I still bet the CPU utilisation could be bettered significantly with jumbograms. That's one of the problems with benchmarks: our CPU may be fast enough that we can keep the IO subsystems streaming, and the benchmark will not show up any OS bottlenecks, but we may still be consuming far too much CPU time internally. That's certainly the case with the block IO measurements made on XFS: sure, ext2 can keep a fast disk loaded to pretty much 100%, but at the cost of far more system CPU time than XFS+pagebuf+kiobuf-IO takes on the same disk. > > The presence of terrible performance in the old ll_rw_block code is > > NOT a good excuse for perpetuating that model. > > i'd like to measure this performance problem (because i'd like to > double-check it) - what measurement method was used? "time" will show it. A 13MB/sec raw IO dd using 64K blocks uses something between 5% and 15% of CPU time on the various systems I've tested on (up to 30% on an old 486 with a 1540, but that's hardly representative. :) The kernel profile clearly shows the buffer management as the biggest cost, with the SCSI code walking those buffer heads a close second. On my main scsi server test box, I get raw 32K reads taking about 7% system time on the cpu, with make_request and __get_request_wait being the biggest hogs. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:00 ` Ingo Molnar 2001-01-09 15:27 ` Stephen C. Tweedie @ 2001-01-09 15:38 ` Benjamin C.R. LaHaise 2001-01-09 16:40 ` Ingo Molnar 2001-01-09 17:53 ` Christoph Hellwig 1 sibling, 2 replies; 101+ messages in thread From: Benjamin C.R. LaHaise @ 2001-01-09 15:38 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > > > please study the networking portions of the zerocopy patch and you'll see > > > why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the > > > thing we cannot afford in a sendfile() operation. sendfile() is > > > lightweight, the setup times of kiovecs are not. > > > > > Right. However, kiobufs can be kept around for as long as you want > > and can be reused easily, and even if allocating and freeing them is > > more work than you want, populating an existing kiobuf is _very_ > > cheap. > > we do have SLAB [which essentially caches structures, on a per-CPU basis] > which i did take into account, but still, initializing a 600+ byte kiovec > is probably more work than the rest of sending a packet! I mean i'd love > to eliminate the 200+ bytes skb initialization as well, it shows up. Do the math again: for transmitting a single page in a kiobuf only 64 bytes needs to be initialized. If map_array is moved to the end of the structure, that's all contiguous data and is a single cacheline. What you're completely ignoring is that sendpages is lacking a huge amount of functionality that is *needed*. I can't implement clean async io on top of sendpages -- it'll require keeping 1 task around per outstanding io, which is exactly the bottleneck we're trying to work around. > The fact that we're using single-page interfaces doesnt preclude us from > having nicely clustered requests, this is what IO-plugging is about! It does waste a significant amount of CPU cycles trying to reassemble io requests and is not deterministic. Unplugging the io queue is a real pain with async io. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:38 ` Benjamin C.R. LaHaise @ 2001-01-09 16:40 ` Ingo Molnar 2001-01-09 17:30 ` Benjamin C.R. LaHaise 2001-01-09 17:53 ` Christoph Hellwig 1 sibling, 1 reply; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 16:40 UTC (permalink / raw) To: Benjamin C.R. LaHaise Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote: > Do the math again: for transmitting a single page in a kiobuf only 64 > bytes needs to be initialized. If map_array is moved to the end of > the structure, that's all contiguous data and is a single cacheline. but you are comparing apples to oranges: an iobuf to a fragment-array. A fragment-array is equivalent to an array of iobufs. In typical (eg. HTTP) output we have mixed sendfile() and sendmsg() based output, so we have an array of (page, offset, size) memory-areas, not a (initial_offset, page[]) array like kiobufs. The closest thing would be an array of kiobufs (where each kiobuf would use a single page only). this is why i ment that *right now* kiobufs are not suited for networking, at least the way we do it. Maybe if kiobufs had the same kind of internal structure as sk_frag (ie. array of (page,offset,size) triples, not array of pages), that would work out better. > What you're completely ignoring is that sendpages is lacking a huge > amount of functionality that is *needed*. I can't implement clean > async io on top of sendpages -- it'll require keeping 1 task around > per outstanding io, which is exactly the bottleneck we're trying to > work around. Please take a look at next release of TUX. Probably the last missing piece was that i added O_NONBLOCK to generic_file_read() && sendfile(), so not fully cached requests can be offloaded to IO threads. Otherwise the current lowlevel filesystem infrastructure is not suited for implementing "process-less async IO "- and kiovecs wont be able to help that either. Unless we implement async, IRQ-driven bmap(), we'll always need some sort of process context to set up IO. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 16:40 ` Ingo Molnar @ 2001-01-09 17:30 ` Benjamin C.R. LaHaise 2001-01-09 18:12 ` Stephen C. Tweedie 2001-01-09 18:35 ` Ingo Molnar 0 siblings, 2 replies; 101+ messages in thread From: Benjamin C.R. LaHaise @ 2001-01-09 17:30 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Ingo Molnar wrote: > this is why i ment that *right now* kiobufs are not suited for networking, > at least the way we do it. Maybe if kiobufs had the same kind of internal > structure as sk_frag (ie. array of (page,offset,size) triples, not array > of pages), that would work out better. That I can agree with, and it would make my life easier since I really only care about the completion of an entire io, not the individual fragments of it. > Please take a look at next release of TUX. Probably the last missing piece > was that i added O_NONBLOCK to generic_file_read() && sendfile(), so not > fully cached requests can be offloaded to IO threads. > > Otherwise the current lowlevel filesystem infrastructure is not suited for > implementing "process-less async IO "- and kiovecs wont be able to help > that either. Unless we implement async, IRQ-driven bmap(), we'll always > need some sort of process context to set up IO. I've already got fully async read and write working via a helper thread for doing the bmaps when the page is not uptodate in the page cache. The primatives for async locking of pages and waiting on events such that converting ext2 to performing full async bmap should be trivial. Note that O_NONBLOCK is not good enough because you can't implement an asynchronous O_SYNC write with it. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 17:30 ` Benjamin C.R. LaHaise @ 2001-01-09 18:12 ` Stephen C. Tweedie 2001-01-09 18:35 ` Ingo Molnar 1 sibling, 0 replies; 101+ messages in thread From: Stephen C. Tweedie @ 2001-01-09 18:12 UTC (permalink / raw) To: Benjamin C.R. LaHaise Cc: Ingo Molnar, Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel Hi, On Tue, Jan 09, 2001 at 12:30:39PM -0500, Benjamin C.R. LaHaise wrote: > On Tue, 9 Jan 2001, Ingo Molnar wrote: > > > this is why i ment that *right now* kiobufs are not suited for networking, > > at least the way we do it. Maybe if kiobufs had the same kind of internal > > structure as sk_frag (ie. array of (page,offset,size) triples, not array > > of pages), that would work out better. > > That I can agree with, and it would make my life easier since I really > only care about the completion of an entire io, not the individual > fragments of it. Right, but this is why the kiobuf IO functions are supposed to accept kiovecs (ie. counted vectors of kiobuf *s, just like ll_rw_block receives buffer_heads). The kiobuf is supposed to be a unit of memory, not of IO. You can map several different kiobufs from different sources and send them all together to brw_kiovec() as a single IO. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 17:30 ` Benjamin C.R. LaHaise 2001-01-09 18:12 ` Stephen C. Tweedie @ 2001-01-09 18:35 ` Ingo Molnar 1 sibling, 0 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 18:35 UTC (permalink / raw) To: Benjamin C.R. LaHaise Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote: > I've already got fully async read and write working via a helper thread ^^^^^^^^^^^^^^^^^^^ > for doing the bmaps when the page is not uptodate in the page cache. ^^^^^^^^^^^^^^^^^^^ thats what TUX 2.0 does. (it does async reads at the moment.) > The primatives for async locking of pages and waiting on events such > that converting ext2 to performing full async bmap should be trivial. well - if you think it's trivial (ie. no process context, no helper thread will be needed), more power to you. How are you going to assure that the issuing process does not block during the bmap()? [without extensive lowlevel-FS changes that is.] > Note that O_NONBLOCK is not good enough because you can't implement an > asynchronous O_SYNC write with it. (i'm using it for reads only.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:38 ` Benjamin C.R. LaHaise 2001-01-09 16:40 ` Ingo Molnar @ 2001-01-09 17:53 ` Christoph Hellwig 1 sibling, 0 replies; 101+ messages in thread From: Christoph Hellwig @ 2001-01-09 17:53 UTC (permalink / raw) To: Benjamin C.R. LaHaise Cc: Ingo Molnar, Stephen C. Tweedie, David S. Miller, riel, netdev, linux-kernel On Tue, Jan 09, 2001 at 10:38:30AM -0500, Benjamin C.R. LaHaise wrote: > What you're completely ignoring is that sendpages is lacking a huge amount > of functionality that is *needed*. I can't implement clean async io on > top of sendpages -- it'll require keeping 1 task around per outstanding > io, which is exactly the bottleneck we're trying to work around. Yepp. That's why I proposed to ue rw_kiovec. Currently Alexy seems to have an own hack for socket-only asynch IO with some COW semantics for the userlevel buffers, but I would much prefer a generic version... Christoph P.S. Any chance to find a new version of your aio-patch somewhere? -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 14:25 ` Stephen C. Tweedie 2001-01-09 14:33 ` Alan Cox 2001-01-09 15:00 ` Ingo Molnar @ 2001-01-09 21:13 ` David S. Miller 2 siblings, 0 replies; 101+ messages in thread From: David S. Miller @ 2001-01-09 21:13 UTC (permalink / raw) To: sct; +Cc: mingo, hch, riel, netdev, linux-kernel, sct Date: Tue, 9 Jan 2001 14:25:42 +0000 From: "Stephen C. Tweedie" <sct@redhat.com> Perhaps tcp can merge internal 4K requests, but if you're doing udp jumbograms (or STP or VIA), you do need an interface which can give the networking stack more than one page at once. All network protocols can use the current interface and get the result you are after, see MSG_MORE. TCP isn't "special" in this regard. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 11:28 ` Christoph Hellwig 2001-01-09 11:42 ` David S. Miller 2001-01-09 12:04 ` Ingo Molnar @ 2001-01-09 19:14 ` Linus Torvalds 2001-01-09 20:07 ` Ingo Molnar 2001-01-12 1:42 ` Stephen C. Tweedie 2 siblings, 2 replies; 101+ messages in thread From: Linus Torvalds @ 2001-01-09 19:14 UTC (permalink / raw) To: linux-kernel In article <20010109122810.A3115@caldera.de>, Christoph Hellwig <hch@caldera.de> wrote: > >You get that multiple page call with kiobufs for free... No, you don't. kiobufs are crap. Face it. They do NOT allow proper multi-page scatter gather, regardless of what the kiobuf PR department has said. I've complained about it before, and nobody listened. Davids zero-copy network code had the same bug. I complained about it to David, and David took about a day to understand my arguments, and fixed it. It's more likely that the zero-copy network code will be used in real life than kiobufs will ever be. The kiobufs are damn ugly by comparison, and the fact that the kiobuf people don't even seem to realize the problems makes me just more convinced that it's not worth even arguing about. What is the problem with kiobuf's? Simple: they have a "offset" and a "length", and an array of pages. What that completely and utterly misses is that if you have an array of pages, you should have an array of "offset" and "length" too. As it is, kiobuf's cannot be used for things like readv() and writev(). Yes, to work around this limitation, there's the notion of "kiovec", an array of kiobuf's. Never mind the fact that if kiobuf's had been properly designed in the first place, you wouldn't need kiovec's at all. And kiovec's are too damn heavy to use for something like the networking zero-copy, with all the double indirection etc. I told David that he can fix the network zero-copy code two ways: either he makes it _truly_ scatter-gather (an array of not just pages, but of proper page-offset-length tuples), or he makes it just a single area and lets the low-level TCP/whatever code build up multiple segments internally. Either of which are good designs. It so happens that none of the users actually wanted multi-page scatter-gather, and the only thing that really wanted to do the sg was the networking layer when it created a single packet out of multiple areas, so the zero-copy stuff uses the simpler non-array interface. And kiobufs can rot in hell for their design mistakes. Maybe somebody will listen some day and fix them up, and in the meantime they can look at the networking code for an example of how to do it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 19:14 ` Linus Torvalds @ 2001-01-09 20:07 ` Ingo Molnar 2001-01-09 20:15 ` Linus Torvalds 2001-01-12 1:42 ` Stephen C. Tweedie 1 sibling, 1 reply; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 20:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On 9 Jan 2001, Linus Torvalds wrote: > I told David that he can fix the network zero-copy code two ways: either > he makes it _truly_ scatter-gather (an array of not just pages, but of > proper page-offset-length tuples), or he makes it just a single area and > lets the low-level TCP/whatever code build up multiple segments > internally. Either of which are good designs. it's actually truly zero-copy internally, we use an array of (page,offset,length) tuples, with proper per-page usage counting. We did this for than half a year. I believe the array-of-pages solution you refer to went only from the pagecache layer into the highest level of TCP - then it got converted into the internal representation. These tuples right now do not have their own life, they are always associated with actual outgoing packets (and in fact are allocated together with skb's and are at the end of the header area). the lowlevel networking drivers (and even midlevel networking code) knows nothing about kiovecs or arrays of pages, it's using the array-of-tuples representation: typedef struct skb_frag_struct skb_frag_t; struct skb_frag_struct { struct page *page; __u16 page_offset; __u16 size; }; /* This data is invariant across clones and lives at * the end of the header data, ie. at skb->end. */ struct skb_shared_info { atomic_t dataref; unsigned int nr_frags; struct sk_buff *frag_list; skb_frag_t frags[MAX_SKB_FRAGS]; }; (the __u16 thing is more of a cache footprint paranoia than real necessity, it could be int as well.). So i do believe that the networking code is properly designed in this respect, and this concept goes to the highest level of the networking code. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 20:07 ` Ingo Molnar @ 2001-01-09 20:15 ` Linus Torvalds 2001-01-09 20:36 ` Christoph Hellwig 0 siblings, 1 reply; 101+ messages in thread From: Linus Torvalds @ 2001-01-09 20:15 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel On Tue, 9 Jan 2001, Ingo Molnar wrote: > > So i do believe that the networking > code is properly designed in this respect, and this concept goes to the > highest level of the networking code. Absolutely. This is why I have no conceptual problems with the networking layer changes, and why I am in violent disagreement with people who think the networking layer should have used the (much inferior, in my opinion) kiobuf/kiovec approach. For people who worry about code re-use and argue for kiobuf/kiovec on those grounds, I can only say that the code re-use should go the other way. It should be "the bad code should re-use code from the good code". It should NOT be "the new code should re-use code from the old code". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 20:15 ` Linus Torvalds @ 2001-01-09 20:36 ` Christoph Hellwig 2001-01-09 20:55 ` Linus Torvalds 0 siblings, 1 reply; 101+ messages in thread From: Christoph Hellwig @ 2001-01-09 20:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: migo, linux-kernel In article <Pine.LNX.4.10.10101091212520.2331-100000@penguin.transmeta.com> you wrote: > On Tue, 9 Jan 2001, Ingo Molnar wrote: >> >> So i do believe that the networking >> code is properly designed in this respect, and this concept goes to the >> highest level of the networking code. > Absolutely. This is why I have no conceptual problems with the networking > layer changes, and why I am in violent disagreement with people who think > the networking layer should have used the (much inferior, in my opinion) > kiobuf/kiovec approach. At least I (who has started this threads) haven't said htey should use iobufs internally. I said: use iovecs in the interface, because this interface is a little more general and allows to integrate into other parts (namely Ben's aio work nicely). Also the tuple argument you gave earlier isn't right in this specific case: when doing sendfile from pagecache to an fs, you have a bunch of pages, an offset in the first and a length that makes the data end before last page's end. > For people who worry about code re-use and argue for kiobuf/kiovec on > those grounds, I can only say that the code re-use should go the other > way. It should be "the bad code should re-use code from the good code". It > should NOT be "the new code should re-use code from the old code". It's not relly about reusing, but about compatiblity with other interfaces... Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 20:36 ` Christoph Hellwig @ 2001-01-09 20:55 ` Linus Torvalds 2001-01-09 21:12 ` Christoph Hellwig 2001-01-09 23:06 ` Benjamin C.R. LaHaise 0 siblings, 2 replies; 101+ messages in thread From: Linus Torvalds @ 2001-01-09 20:55 UTC (permalink / raw) To: Christoph Hellwig; +Cc: migo, linux-kernel On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > Also the tuple argument you gave earlier isn't right in this specific case: > > when doing sendfile from pagecache to an fs, you have a bunch of pages, > an offset in the first and a length that makes the data end before last > page's end. No. Look at sendfile(). You do NOT have a "bunch" of pages. Sendfile() is very much a page-at-a-time thing, and expects the actual IO layers to do it's own scatter-gather. So sendfile() doesn't want any array at all: it only wants a single page-offset-length tuple interface. The _lower-level_ stuff (ie TCP and the drivers) want the "array of tuples", and again, they do NOT want an array of pages, because if somebody does two sendfile() calls that fit in one packet, it really needs an array of tuples. In short, the kiobuf interface is _always_ the wrong one. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 20:55 ` Linus Torvalds @ 2001-01-09 21:12 ` Christoph Hellwig 2001-01-09 21:26 ` Linus Torvalds 2001-01-09 23:06 ` Benjamin C.R. LaHaise 1 sibling, 1 reply; 101+ messages in thread From: Christoph Hellwig @ 2001-01-09 21:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christoph Hellwig, migo, linux-kernel On Tue, Jan 09, 2001 at 12:55:51PM -0800, Linus Torvalds wrote: > > > On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > > > Also the tuple argument you gave earlier isn't right in this specific case: > > > > when doing sendfile from pagecache to an fs, you have a bunch of pages, > > an offset in the first and a length that makes the data end before last > > page's end. > > No. > > Look at sendfile(). You do NOT have a "bunch" of pages. > > Sendfile() is very much a page-at-a-time thing, and expects the actual IO > layers to do it's own scatter-gather. > > So sendfile() doesn't want any array at all: it only wants a single > page-offset-length tuple interface. The current implementations does. But others are possible. I could post one in a few days to show that it is possible. Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 21:12 ` Christoph Hellwig @ 2001-01-09 21:26 ` Linus Torvalds 2001-01-10 7:42 ` Christoph Hellwig 0 siblings, 1 reply; 101+ messages in thread From: Linus Torvalds @ 2001-01-09 21:26 UTC (permalink / raw) To: Christoph Hellwig; +Cc: migo, linux-kernel On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > > > Look at sendfile(). You do NOT have a "bunch" of pages. > > > > Sendfile() is very much a page-at-a-time thing, and expects the actual IO > > layers to do it's own scatter-gather. > > > > So sendfile() doesn't want any array at all: it only wants a single > > page-offset-length tuple interface. > > The current implementations does. > But others are possible. I could post one in a few days to show that it is > possible. Why do you bother arguing, when I've shown you that even if sendfile() _did_ do multiple pages, it STILL wouldn't make kibuf's the right interface. You just snipped out that part of my email, which states that the networking layer would still need to do better scatter-gather than kiobuf's can give it for multiple send-file invocations. Let me iterate: - the layers like TCP _need_ to do scatter-gather anyway: you absolutely want to be able to send out just one packet even if the data comes from two different sources (for example, one source might be the http header, while the other source is the actual file contents. This is definitely not a made-up-example, this is THE example of something like this, and happens with just about all protocols that have a notion of a header, which is pretty much 100% of them). - because TCP needs to do scatter-gather anyway across calls, there is no real reason for sendfile() to do it. And sendfile() doing it would _not_ obviate the need for it in the networking layer - it would only add complexity for absolutely no performance gain. So neither sendfile _nor_ the networking layer want kiobuf's. Never have, never will. The "half-way scatter-gather" support they give ends up either being too much baggage, or too little. It's never the right fit. kiovec adds the support for true scatter-gather, but with a horribly bad interface, and much too much overhead - and absolutely NO advantages over the _proper_ array of <page-offset-tuple> which is much simpler than the complex two-level arrays that you get with kiovec+kiobuf. End of story. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 21:26 ` Linus Torvalds @ 2001-01-10 7:42 ` Christoph Hellwig 2001-01-10 8:05 ` Linus Torvalds 2001-01-17 14:05 ` Rik van Riel 0 siblings, 2 replies; 101+ messages in thread From: Christoph Hellwig @ 2001-01-10 7:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christoph Hellwig, migo, linux-kernel On Tue, Jan 09, 2001 at 01:26:44PM -0800, Linus Torvalds wrote: > > > On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > > > > > Look at sendfile(). You do NOT have a "bunch" of pages. > > > > > > Sendfile() is very much a page-at-a-time thing, and expects the actual IO > > > layers to do it's own scatter-gather. > > > > > > So sendfile() doesn't want any array at all: it only wants a single > > > page-offset-length tuple interface. > > > > The current implementations does. > > But others are possible. I could post one in a few days to show that it is > > possible. > > Why do you bother arguing, when I've shown you that even if sendfile() > _did_ do multiple pages, it STILL wouldn't make kibuf's the right > interface. You just snipped out that part of my email, which states that > the networking layer would still need to do better scatter-gather than > kiobuf's can give it for multiple send-file invocations. Simple. Because I stated before that I DON'T even want the networking to use kiobufs in lower layers. My whole argument is to pass a kiovec into the fileop instead of a page, because it makes sense for other drivers to use multiple pages, and doesn't hurt networking besides the cost of one kiobuf (116k) and the processor cycles for creating and destroying it once per sys_sendfile. Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-10 7:42 ` Christoph Hellwig @ 2001-01-10 8:05 ` Linus Torvalds 2001-01-10 8:33 ` Christoph Hellwig 2001-01-10 8:37 ` Andrew Morton 2001-01-17 14:05 ` Rik van Riel 1 sibling, 2 replies; 101+ messages in thread From: Linus Torvalds @ 2001-01-10 8:05 UTC (permalink / raw) To: Christoph Hellwig; +Cc: migo, linux-kernel On Wed, 10 Jan 2001, Christoph Hellwig wrote: > > Simple. Because I stated before that I DON'T even want the networking > to use kiobufs in lower layers. My whole argument is to pass a kiovec > into the fileop instead of a page, because it makes sense for other > drivers to use multiple pages, and doesn't hurt networking besides > the cost of one kiobuf (116k) and the processor cycles for creating > and destroying it once per sys_sendfile. Fair enough. My whole argument against that is that I think kiovec's are incredibly ugly, and the less I see of them in critical regions, the happier I am. And that, I have to admit, is really mostly a matter of "taste". De gustibus non disputandum. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-10 8:05 ` Linus Torvalds @ 2001-01-10 8:33 ` Christoph Hellwig 2001-01-10 8:37 ` Andrew Morton 1 sibling, 0 replies; 101+ messages in thread From: Christoph Hellwig @ 2001-01-10 8:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: migo, linux-kernel On Wed, Jan 10, 2001 at 12:05:01AM -0800, Linus Torvalds wrote: > > > On Wed, 10 Jan 2001, Christoph Hellwig wrote: > > > > Simple. Because I stated before that I DON'T even want the networking > > to use kiobufs in lower layers. My whole argument is to pass a kiovec > > into the fileop instead of a page, because it makes sense for other > > drivers to use multiple pages, and doesn't hurt networking besides > > the cost of one kiobuf (116k) and the processor cycles for creating > > and destroying it once per sys_sendfile. > > Fair enough. > > My whole argument against that is that I think kiovec's are incredibly > ugly, and the less I see of them in critical regions, the happier I am. > > And that, I have to admit, is really mostly a matter of "taste". Ok. This is a statement that makes all the kiobuf efforts currently look no more as interesting as before. IHMO is time to find a generic interface for IO that is acceptable by you and widely usable. As you stated before that seems to be s.th. with page,offset,length tuples. Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-10 8:05 ` Linus Torvalds 2001-01-10 8:33 ` Christoph Hellwig @ 2001-01-10 8:37 ` Andrew Morton 2001-01-10 23:32 ` Linus Torvalds 1 sibling, 1 reply; 101+ messages in thread From: Andrew Morton @ 2001-01-10 8:37 UTC (permalink / raw) To: linux-kernel Linus Torvalds wrote: > > De gustibus non disputandum. http://cogprints.soton.ac.uk/documents/disk0/00/00/07/57/ "ingestion of the afterbirth during delivery" eh? http://www.degustibus.co.uk/ "Award winning artisan breadmakers." Ah. That'll be it. - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-10 8:37 ` Andrew Morton @ 2001-01-10 23:32 ` Linus Torvalds 2001-01-19 15:55 ` Andrew Scott 0 siblings, 1 reply; 101+ messages in thread From: Linus Torvalds @ 2001-01-10 23:32 UTC (permalink / raw) To: linux-kernel In article <3A5C1F64.99C611F2@uow.edu.au>, Andrew Morton <andrewm@uow.edu.au> wrote: >Linus Torvalds wrote: >> >> De gustibus non disputandum. > >http://cogprints.soton.ac.uk/documents/disk0/00/00/07/57/ > > "ingestion of the afterbirth during delivery" > >eh? > > >http://www.degustibus.co.uk/ > > "Award winning artisan breadmakers." > >Ah. That'll be it. Latin 101. Literally "about taste no argument". I suspect that it _should_ be "De gustibus non disputandum est", but it's been too many years. That adds the required verb ("is") to make it a full sentence. In English: "There is no arguing taste". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-10 23:32 ` Linus Torvalds @ 2001-01-19 15:55 ` Andrew Scott 0 siblings, 0 replies; 101+ messages in thread From: Andrew Scott @ 2001-01-19 15:55 UTC (permalink / raw) To: Linus Torvalds, linux-kernel On 10 Jan 2001, at 15:32, Linus Torvalds wrote: > Latin 101. Literally "about taste no argument". Or "about taste no argument there is" if you add the 'est', which still makes sense in english, in a twisted (convoluted as apposed to 'bad' or 'sick') way. Q.E.D. > I suspect that it _should_ be "De gustibus non disputandum est", but > it's been too many years. That adds the required verb ("is") to make it > a full sentence. > > In English: "There is no arguing taste". > > Linus ------------------Mailed via Pegasus 3.12c & Mercury 1.48--------------- A.J.Scott@casdn.neu.edu Fax (617)373-2942 Andrew Scott Tel (617)373-5278 _ Northeastern University--138 Meserve Hall / \ / College of Arts & Sciences-Deans Office / \ \ / Boston, Ma. 02115 / \_/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-10 7:42 ` Christoph Hellwig 2001-01-10 8:05 ` Linus Torvalds @ 2001-01-17 14:05 ` Rik van Riel 2001-01-18 0:53 ` Christoph Hellwig 1 sibling, 1 reply; 101+ messages in thread From: Rik van Riel @ 2001-01-17 14:05 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Linus Torvalds, migo, linux-kernel On Wed, 10 Jan 2001, Christoph Hellwig wrote: > Simple. Because I stated before that I DON'T even want the > networking to use kiobufs in lower layers. My whole argument is > to pass a kiovec into the fileop instead of a page, because it > makes sense for other drivers to use multiple pages, Now wouldn't it be great if we had one type of data structure that would work for both the network layer and the block layer (and v4l, ...) ? If we constantly need to convert between zerocopy metadata type, I'm sure we'll lose most of the performance gain we started this whole idea for in the first place. cheers, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-17 14:05 ` Rik van Riel @ 2001-01-18 0:53 ` Christoph Hellwig 2001-01-18 1:13 ` Linus Torvalds 0 siblings, 1 reply; 101+ messages in thread From: Christoph Hellwig @ 2001-01-18 0:53 UTC (permalink / raw) To: Rik van Riel; +Cc: Christoph Hellwig, Linus Torvalds, mingo, linux-kernel On Thu, Jan 18, 2001 at 01:05:43AM +1100, Rik van Riel wrote: > On Wed, 10 Jan 2001, Christoph Hellwig wrote: > > > Simple. Because I stated before that I DON'T even want the > > networking to use kiobufs in lower layers. My whole argument is > > to pass a kiovec into the fileop instead of a page, because it > > makes sense for other drivers to use multiple pages, > > Now wouldn't it be great if we had one type of data > structure that would work for both the network layer > and the block layer (and v4l, ...) ? Sure it would be nice, and IIRC that was what the kiobuf stuff was designed for. But it looks like it doesn't do well for the networking (and maybe other) guys. That means we have to find something that might be worth paying a little overhead for in all layers, but that on the other hand is usable evrywhere. So after the last flame^H^H^H^H^Hthread I've come up in my mind with the following structures: /* * a simple page,offset,legth tuple like Linus wants it */ struct kiobuf2 { struct page * page; /* The page itself */ u_int16_t offset; /* Offset to start of valid data */ u_int16_t length; /* Number of valid bytes of data */ }; /* * A container for the tuples - it is actually pretty similar to old * kiobuf, but on the other hand allows SG */ struct kiovec2 { int nbufs; /* Kiobufs actually referenced */ int array_len; /* Space in the allocated lists */ struct kiobuf * bufs; unsigned int locked : 1; /* If set, pages has been locked */ /* Always embed enough struct pages for 64k of IO */ struct kiobuf * buf_array[KIO_STATIC_PAGES]; /* Private data */ void * private; /* Dynamic state for IO completion: */ atomic_t io_count; /* IOs still in progress */ int errno; /* Status of completed IO */ void (*end_io) (struct kiovec *); /* Completion callback */ wait_queue_head_t wait_queue; }; We don't need the page-length/offset in the usual block-io path, but on the other hand, if we get a common interface for it... Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-18 0:53 ` Christoph Hellwig @ 2001-01-18 1:13 ` Linus Torvalds 2001-01-18 17:50 ` Christoph Hellwig 2001-01-18 21:12 ` Albert D. Cahalan 0 siblings, 2 replies; 101+ messages in thread From: Linus Torvalds @ 2001-01-18 1:13 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Rik van Riel, mingo, linux-kernel On Thu, 18 Jan 2001, Christoph Hellwig wrote: > > /* > * a simple page,offset,legth tuple like Linus wants it > */ > struct kiobuf2 { > struct page * page; /* The page itself */ > u_int16_t offset; /* Offset to start of valid data */ > u_int16_t length; /* Number of valid bytes of data */ > }; Please use "u16". Or "__u16" if you want to export it to user space. > struct kiovec2 { > int nbufs; /* Kiobufs actually referenced */ > int array_len; /* Space in the allocated lists */ > struct kiobuf * bufs; Any reason for array_len? Why not just int nbufs, struct kiobuf *bufs; Remember: simplicity is a virtue. Simplicity is also what makes it usable for people who do NOT want to have huge overhead. > unsigned int locked : 1; /* If set, pages has been locked */ Remove this. I don't think it's valid to lock the pages. Who wants to use this anyway? > /* Always embed enough struct pages for 64k of IO */ > struct kiobuf * buf_array[KIO_STATIC_PAGES]; Kill kill kill kill. If somebody wants to embed a kiovec into their own data structure, THEY can decide to add their own buffers etc. A fundamental data structure should _never_ make assumptions like this. > /* Private data */ > void * private; > > /* Dynamic state for IO completion: */ > atomic_t io_count; /* IOs still in progress */ What is io_count used for? > int errno; > > /* Status of completed IO */ > void (*end_io) (struct kiovec *); /* Completion callback */ > wait_queue_head_t wait_queue; I suspect all of the above ("private", "end_io" etc) should be at a higher layer. Not everybody will necessarily need them. Remember: if this is to be well designed, we want to have the data structures to pass down to low-level drivers etc, that may not want or need a lot of high-level stuff. You should not pass down more than the driver really needs. In the end, the only thing you _know_ a driver will need (assuming that it wants these kinds of buffers) is just int nbufs; struct biobuf *bufs; That's kind of the minimal set. That should be one level of abstraction in its own right. Never over-design. Never think "Hmm, maybe somebody would find this useful". Start from what you know people _have_ to have, and try to make that set smaller. When you can make it no smaller, you've reached one point. That's a good point to start from - use that for some real implementation. Once you've gotten that far, you can see how well you can embed the lower layers into higher layers. That does _not_ mean that the lower layers should know about the high-level data structures. Try to avoid pushing down abstractions too far. Maybe you'll want to push down the error code. But maybe not. And you should NOT link the callback with the vector of IO's: you may find (in fact, I bet you _will_ find), that the lowest level will want a callback to call up to when it is ready, and that layer may want _another_ callback to call up to higher levels. Imagine, for example, the network driver telling the IP layer that "ok, packet sent". That's _NOT_ the same callback as the TCP layer telling the upper layers that the packet data has been sent and successfully acknowledged, and that the data structures can be free'd now. They are at two completely different levels of abstraction, and one level needing something doesn't need that the other level should necessarily even care. Don't imagine that everybody wants the same data structure, and that that data structure should thus be very generic. Genericity kills good ideas. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-18 1:13 ` Linus Torvalds @ 2001-01-18 17:50 ` Christoph Hellwig 2001-01-18 18:04 ` Linus Torvalds 2001-01-18 21:12 ` Albert D. Cahalan 1 sibling, 1 reply; 101+ messages in thread From: Christoph Hellwig @ 2001-01-18 17:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rik van Riel, mingo, linux-kernel, kiobuf-io-devel On Wed, Jan 17, 2001 at 05:13:31PM -0800, Linus Torvalds wrote: > > > On Thu, 18 Jan 2001, Christoph Hellwig wrote: > > > > /* > > * a simple page,offset,legth tuple like Linus wants it > > */ > > struct kiobuf2 { > > struct page * page; /* The page itself */ > > u_int16_t offset; /* Offset to start of valid data */ > > u_int16_t length; /* Number of valid bytes of data */ > > }; > > Please use "u16". Or "__u16" if you want to export it to user space. Ok. > > > struct kiovec2 { > > int nbufs; /* Kiobufs actually referenced */ > > int array_len; /* Space in the allocated lists */ > > struct kiobuf * bufs; > > Any reason for array_len? It's usefull for the expand function - but with kiobufs as secondary data structure it may no more be nessesary. > Why not just > > int nbufs, > struct kiobuf *bufs; > > > Remember: simplicity is a virtue. > > Simplicity is also what makes it usable for people who do NOT want to have > huge overhead. > > > unsigned int locked : 1; /* If set, pages has been locked */ > > Remove this. I don't think it's valid to lock the pages. Who wants to use > this anyway? E.g. in the block IO pathes the pages have to be locked. It's also used by free_kiovec to see wether to do unlock_kiovec before. > > > /* Always embed enough struct pages for 64k of IO */ > > struct kiobuf * buf_array[KIO_STATIC_PAGES]; > > Kill kill kill kill. > > If somebody wants to embed a kiovec into their own data structure, THEY > can decide to add their own buffers etc. A fundamental data structure > should _never_ make assumptions like this. Ok. > > > /* Private data */ > > void * private; > > > > /* Dynamic state for IO completion: */ > > atomic_t io_count; /* IOs still in progress */ > > What is io_count used for? In the current buffer_head based IO-scheme it is used to determine wether all bh request are finished. It's obsolete once we pass kiobufs to the low-level drivers. > > > int errno; > > > > /* Status of completed IO */ > > void (*end_io) (struct kiovec *); /* Completion callback */ > > wait_queue_head_t wait_queue; > > I suspect all of the above ("private", "end_io" etc) should be at a higher > layer. Not everybody will necessarily need them. > > Remember: if this is to be well designed, we want to have the data > structures to pass down to low-level drivers etc, that may not want or > need a lot of high-level stuff. You should not pass down more than the > driver really needs. > > In the end, the only thing you _know_ a driver will need (assuming that it > wants these kinds of buffers) is just > > int nbufs; > struct biobuf *bufs; > > That's kind of the minimal set. That should be one level of abstraction in > its own right. Ok. Then we need an additional more or less generic object that is used for passing in a rw_kiovec file operation (and we really want that for many kinds of IO). I thould mostly be used for communicating to the high-level driver. /* * the name is just plain stupid, but that shouldn't matter */ struct vfs_kiovec { struct kiovec * iov; /* private data, mostly for the callback */ void * private; /* completion callback */ void (*end_io) (struct vfs_kiovec *); wait_queue_head_t wait_queue; }; Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-18 17:50 ` Christoph Hellwig @ 2001-01-18 18:04 ` Linus Torvalds 0 siblings, 0 replies; 101+ messages in thread From: Linus Torvalds @ 2001-01-18 18:04 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Rik van Riel, mingo, linux-kernel, kiobuf-io-devel On Thu, 18 Jan 2001, Christoph Hellwig wrote: > > > > Remove this. I don't think it's valid to lock the pages. Who wants to use > > this anyway? > > E.g. in the block IO pathes the pages have to be locked. > It's also used by free_kiovec to see wether to do unlock_kiovec before. This is all MUCH higher level functionality, and probably bogus anyway. > > That's kind of the minimal set. That should be one level of abstraction in > > its own right. > > Ok. Then we need an additional more or less generic object that is used for > passing in a rw_kiovec file operation (and we really want that for many kinds > of IO). I thould mostly be used for communicating to the high-level driver. That's fine. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-18 1:13 ` Linus Torvalds 2001-01-18 17:50 ` Christoph Hellwig @ 2001-01-18 21:12 ` Albert D. Cahalan 2001-01-19 1:52 ` 2.4.1-pre8 video/ohci1394 compile problem ebi4 2001-01-19 6:55 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Linus Torvalds 1 sibling, 2 replies; 101+ messages in thread From: Albert D. Cahalan @ 2001-01-18 21:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christoph Hellwig, Rik van Riel, mingo, linux-kernel >> struct kiovec2 { >> int nbufs; /* Kiobufs actually referenced */ >> int array_len; /* Space in the allocated lists */ >> struct kiobuf * bufs; > > Any reason for array_len? > > Why not just > > int nbufs, > struct kiobuf *bufs; > > Remember: simplicity is a virtue. > > Simplicity is also what makes it usable for people who do NOT want to have > huge overhead. > >> unsigned int locked : 1; /* If set, pages has been locked */ > > Remove this. I don't think it's valid to lock the pages. Who wants to use > this anyway? > >> /* Always embed enough struct pages for 64k of IO */ >> struct kiobuf * buf_array[KIO_STATIC_PAGES]; > > Kill kill kill kill. > > If somebody wants to embed a kiovec into their own data structure, THEY > can decide to add their own buffers etc. A fundamental data structure > should _never_ make assumptions like this. What about getting rid of both that and the pointer, and just hanging that data on the end as a variable length array? struct kiovec2{ int nbufs; /* ... */ struct kiobuf[0]; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* 2.4.1-pre8 video/ohci1394 compile problem 2001-01-18 21:12 ` Albert D. Cahalan @ 2001-01-19 1:52 ` ebi4 2001-01-19 6:55 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Linus Torvalds 1 sibling, 0 replies; 101+ messages in thread From: ebi4 @ 2001-01-19 1:52 UTC (permalink / raw) To: linux-kernel video1394.o(.data+0x0): multiple definition of `ohci_csr_rom' ohci1394.o(.data+0x0): first defined here make[3]: *** [ieee1394drv.o] Error 1 Compilation fails here. ::::: Gene Imes http://www.ozob.net ::::: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-18 21:12 ` Albert D. Cahalan 2001-01-19 1:52 ` 2.4.1-pre8 video/ohci1394 compile problem ebi4 @ 2001-01-19 6:55 ` Linus Torvalds 1 sibling, 0 replies; 101+ messages in thread From: Linus Torvalds @ 2001-01-19 6:55 UTC (permalink / raw) To: linux-kernel In article <200101182112.f0ILCmZ113705@saturn.cs.uml.edu>, Albert D. Cahalan <acahalan@cs.uml.edu> wrote: > >What about getting rid of both that and the pointer, and just >hanging that data on the end as a variable length array? > >struct kiovec2{ > int nbufs; > /* ... */ > struct kiobuf[0]; >} If the struct ends up having lots of other fields, yes. On the other hand, if one basic form of kiobuf's ends up being really just the array and the number of elements, there are reasons not to do this. One is that you can "peel" off parts of the buffer, and split it up if (for example) your driver has some limitation to the number of scatter-gather requests it can make. For example, you may have code that looks roughly like .. int nr, struct kibuf *buf .. while (nr > MAX_SEGMENTS) { lower_level(MAX_SEGMENTS, buf); nr -= MAX_SEGMENTS; buf += MAX_SEGMENTS; } lower_level(nr, buf); which is rather awkward to do if you tie "nr" and the array too closely together. (Of course, the driver could just split them up - take it from the structure and pass them down in the separated manner. I don't know which level the separation is worth doing at, but I have this feeling that if the structure ends up being _only_ the nbufs and bufs, they should not be tied together.) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 20:55 ` Linus Torvalds 2001-01-09 21:12 ` Christoph Hellwig @ 2001-01-09 23:06 ` Benjamin C.R. LaHaise 2001-01-09 23:54 ` Linus Torvalds 1 sibling, 1 reply; 101+ messages in thread From: Benjamin C.R. LaHaise @ 2001-01-09 23:06 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christoph Hellwig, migo, linux-kernel On Tue, 9 Jan 2001, Linus Torvalds wrote: > The _lower-level_ stuff (ie TCP and the drivers) want the "array of > tuples", and again, they do NOT want an array of pages, because if > somebody does two sendfile() calls that fit in one packet, it really needs > an array of tuples. A kiobuf simply provides that tuple plus the completion callback. Stick a bunch of them together and you've got a kiovec. I don't see the advantage of moving to simpler primatives if they don't provide needed functionality. > In short, the kiobuf interface is _always_ the wrong one. Please tell me what you think the right interface is that provides a hook on io completion and is asynchronous. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 23:06 ` Benjamin C.R. LaHaise @ 2001-01-09 23:54 ` Linus Torvalds 2001-01-10 7:51 ` Gerd Knorr 0 siblings, 1 reply; 101+ messages in thread From: Linus Torvalds @ 2001-01-09 23:54 UTC (permalink / raw) To: Benjamin C.R. LaHaise; +Cc: Christoph Hellwig, migo, linux-kernel On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote: > On Tue, 9 Jan 2001, Linus Torvalds wrote: > > > The _lower-level_ stuff (ie TCP and the drivers) want the "array of > > tuples", and again, they do NOT want an array of pages, because if > > somebody does two sendfile() calls that fit in one packet, it really needs > > an array of tuples. > > A kiobuf simply provides that tuple plus the completion callback. Stick a > bunch of them together and you've got a kiovec. I don't see the advantage > of moving to simpler primatives if they don't provide needed > functionality. Ehh. Let's re-state your argument: "You could have used the existing, complex and cumbersome primitives that had the wrong semantics. I don't see the advantage of pointing out the fact that those primitives are badly designed for the problem at hand and moving to simpler and better designed primitives that fit the problem well" Would you agree that that is the essense of what you said? And if not, then why not? > Please tell me what you think the right interface is that provides a hook > on io completion and is asynchronous. Suggested fix to kiovec's: get rid of them. Immediately. Replace them with kiobuf's that can handle scatter-gather pages. kiobuf's have 90% of that support already. Never EVER have a "struct page **" interface. It is never the valid thing to do. You should have struct fragment { struct page *page; __u16 offset, length; } and then have "struct fragment **" inside the kiobuf's instead. Rename "nr_pages" as "nr_fragments", and get rid of the global offset/length, as they don't make any sense. Voila - your kiobuf is suddenly a lot more flexible. Finally, don't embed the static KIO_STATIC_PAGES array in the kiobuf. The caller knows when it makes sense, and when it doesn't. Don't embed that knowledge in fundamental data structures. In the meantime, I'm more than happy to make sure that the networking infrastructure is sane. Which implies that the networking infrastructure does NOT use kiovecs. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 23:54 ` Linus Torvalds @ 2001-01-10 7:51 ` Gerd Knorr 0 siblings, 0 replies; 101+ messages in thread From: Gerd Knorr @ 2001-01-10 7:51 UTC (permalink / raw) To: linux-kernel > > Please tell me what you think the right interface is that provides a hook > > on io completion and is asynchronous. > > Suggested fix to kiovec's: get rid of them. Immediately. Replace them with > kiobuf's that can handle scatter-gather pages. kiobuf's have 90% of that > support already. > > Never EVER have a "struct page **" interface. It is never the valid thing > to do. Hmm, /me is quite happy with it. It's fine for *big* chunks of memory like video frames: I just need a large number of pages, length and offset. If someone wants to have a look: a rewritten bttv version which uses kiobufs is available at http://www.strusel007.de/linux/bttv/bttv-0.8.8.tar.gz It does _not_ use kiovecs throuth (to be exact: kiovecs with just one single kiobuf in there). > You should have > > struct fragment { > struct page *page; > __u16 offset, length; > } What happens with big memory blocks? Do all pages but the first and last get offset=0 and length=PAGE_SIZE then? Gerd -- Get back there in front of the computer NOW. Christmas can wait. -- Linus "the Grinch" Torvalds, 24 Dec 2000 on linux-kernel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 19:14 ` Linus Torvalds 2001-01-09 20:07 ` Ingo Molnar @ 2001-01-12 1:42 ` Stephen C. Tweedie 1 sibling, 0 replies; 101+ messages in thread From: Stephen C. Tweedie @ 2001-01-12 1:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel, Stephen Tweedie Hi, On Tue, Jan 09, 2001 at 11:14:54AM -0800, Linus Torvalds wrote: > In article <20010109122810.A3115@caldera.de>, > > kiobufs are crap. Face it. They do NOT allow proper multi-page scatter > gather, regardless of what the kiobuf PR department has said. It's not surprising, since they were designed to solve a totally different problem. Kiobufs were always intended to represent logical buffers --- a virtual address range from some process, or a region of a cached file. The purpose behind them was, if you remember, to allow something like map_user_kiobuf() to produce a list of physical pages from the user VA range. This works exactly as intended. The raw IO device driver may build a kiobuf to represent a user VA range, and the XFS filesystem may build one for its pagebuf abstraction to represent a range within a file in the page cache. The lower level IO routines just don't care where the buffers came from. There are still problems here --- the encoding of block addresses in the list, dealing with a stack of completion events if you push these buffers down through various layers of logical block device such as raid/lvm, carving requests up and merging them if you get requirest which span a raid or LVM stripe, for example. Kiobufs don't solve those, but neither do skfrags, and neither does the MSG_MORE concept. If you want a scatter-gather list capable of taking individual buffer_heads and merging them, then sure, kiobufs won't do the trick as they stand now: they were never intended to. The whole point of kiobufs was to encapsulate one single buffer in the higher layers, and to allow lower layers to work on that buffer without caring where the memory came from. But adding the sub-page sg lists is a simple extension. I've got a number of raw IO fixes pending, and we've just traced the source of the last problem that was holding it up, so if you want I'll add the per-page offset/length with those. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 10:31 ` Christoph Hellwig 2001-01-09 10:31 ` David S. Miller @ 2001-01-09 11:05 ` Ingo Molnar 2001-01-09 18:27 ` Christoph Hellwig 1 sibling, 1 reply; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 11:05 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Rik van Riel, David S. Miller, netdev, linux-kernel On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses > > vectors of struct page *, offset, size entities), > Yep. That is why I was so worried aboit the writepages file op. i believe you misunderstand. kiovecs (in their current form) are simply too bloated for networking purposes. Due to its nature and nonpersistency, networking is very lightweight and memory-footprint-sensitive code (as opposed to eg. block IO code), right now an 'struct skb_shared_info' [which is roughly equivalent to a kiovec] is 12+4*6 == 36 bytes, which includes support for 6 distinct fragments (each fragment can be on any page, any offset, any size). A *single* kiobuf (which is roughly equivalent to an skb fragment) is 52+16*4 == 116 bytes. 6 of these would be 696 bytes, for a single TCP packet (!!!). This is simply not something to be used for lightweight zero-copy networking. so it's easy to say 'use kiovecs', but right now it's simply not practical. kiobufs are a loaded concept, and i'm not sure whether it's desirable at all to mix networking zero-copy concepts with block-IO/filesystem zero-copy concepts. Just to make it even more clear: although i do believe it to be desirable from an architectural point of view, i'm not sure at all whether it's possible, based on the experience we gathered while implementing TCP-zerocopy. we talked (and are talking) to Stephen about this problem, but it's a clealy 2.5 kernel issue. Merging to a finalized zero-copy framework will be easy. (The overwhelming percentage of zero-copy code is in the networking code itself and is insensitive to any kiovec issues.) > It's rather hackish (only write, looks usefull only for networking) > instead of the proposed rw_kiovec fop. i'm not sure what you are trying to say. You mean we should remove sendfile() as well? It's only write, looks useful mostly for networking. A substantial percentage of kernel code is useful only for networking :-) > > zerocopy sendfile() and zerocopy sendmsg(), the panacea of fileserver > > (webserver) scalability - it can be used by Apache, Samba and other > > fileservers. The new zerocopy networking code DMA-s straight out of the > > The new zerocopy networking code DMA-s straight out of the > > pagecache, natively supports hardware-checksumming and highmem (64-bit > > DMA on 32-bit systems) zerocopy as well and multi-fragment DMA - no > > limitations. We can saturate a gigabit link with TCP traffic, at about > > 20% CPU usage on a 500 MHz x86 UP system. David and Alexey's patch is > > cool - check it out! > Yuck. A new file_opo just to get a few benchmarks right ... no. As David said, it's direct sendfile() support. It's completely isolated, it's 20 lines of code, it does not impact filesystems, it only shows up in sendfile(). So i truly dont understand your point. This interface has gone through several iterations and was actually further simplified. Ingo ps1. "first they say it's impossible, then they ridicule you, then they oppose you, finally they say it's self-evident". Looks like, after many many years, zero-copy networking for Linux is now finally in phase III. :-) ps2. i'm joking :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 11:05 ` Ingo Molnar @ 2001-01-09 18:27 ` Christoph Hellwig 2001-01-09 19:19 ` Ingo Molnar 0 siblings, 1 reply; 101+ messages in thread From: Christoph Hellwig @ 2001-01-09 18:27 UTC (permalink / raw) To: Ingo Molnar; +Cc: Rik van Riel, David S. Miller, netdev, linux-kernel On Tue, Jan 09, 2001 at 12:05:59PM +0100, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > > > 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses > > > vectors of struct page *, offset, size entities), > > > Yep. That is why I was so worried aboit the writepages file op. > > i believe you misunderstand. kiovecs (in their current form) are simply > too bloated for networking purposes. Stop. I NEVER said you should use them internally. My concern is too use a file operation with a kiobuf ** as main argument instead of page *. With a little more bloat it allows you to do the same you do now. But it also offers a real advantage: you don't have to call into the network stack for every single page, and this fits easily in Ben's AIO stuff, so your stuff is very well integrated into the (futur) asynch IO framework. (he latter was my main concern). You pay 116 bytes and a few cycles for a _lot_ more abstraction and integration. Exactly such a design principle (design vs speed) is the cause why UNIX survived so long. > Due to its nature and nonpersistency, > networking is very lightweight and memory-footprint-sensitive code (as > opposed to eg. block IO code), right now an 'struct skb_shared_info' > [which is roughly equivalent to a kiovec] is 12+4*6 == 36 bytes, which > includes support for 6 distinct fragments (each fragment can be on any > page, any offset, any size). A *single* kiobuf (which is roughly > equivalent to an skb fragment) is 52+16*4 == 116 bytes. 6 of these would > be 696 bytes, for a single TCP packet (!!!). This is simply not something > to be used for lightweight zero-copy networking. This doesn't matter, because rw_kiovec can easily take only one kiobuf, and you don't really need the different fragments there. > so it's easy to say 'use kiovecs', but right now it's simply not > practical. kiobufs are a loaded concept, and i'm not sure whether it's > desirable at all to mix networking zero-copy concepts with > block-IO/filesystem zero-copy concepts. I didn't wnat to suggest that - I'm to clueless concerning networking to even consider an internal design for network zero-copy IO. I'm just talking about the VFS interface to the rest of the kernel. > we talked (and are talking) to Stephen about this problem, but it's a > clealy 2.5 kernel issue. Merging to a finalized zero-copy framework will > be easy. (The overwhelming percentage of zero-copy code is in the > networking code itself and is insensitive to any kiovec issues.) Agreed. > > It's rather hackish (only write, looks usefull only for networking) > > instead of the proposed rw_kiovec fop. > > i'm not sure what you are trying to say. You mean we should remove > sendfile() as well? It's only write, looks useful mostly for networking. A > substantial percentage of kernel code is useful only for networking :-) No. But it looks like a recvmsg syscall wouldn't too bad either ... Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 18:27 ` Christoph Hellwig @ 2001-01-09 19:19 ` Ingo Molnar 0 siblings, 0 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 19:19 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Rik van Riel, David S. Miller, netdev, linux-kernel On Tue, 9 Jan 2001, Christoph Hellwig wrote: > I didn't want to suggest that - I'm to clueless concerning networking > to even consider an internal design for network zero-copy IO. I'm just > talking about the VFS interface to the rest of the kernel. (well, i think you just cannot be clueless about one and then demand various things about the other...) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 10:23 ` Ingo Molnar 2001-01-09 10:31 ` Christoph Hellwig @ 2001-01-09 14:18 ` Stephen C. Tweedie 2001-01-09 14:40 ` Ingo Molnar 2001-01-10 2:56 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet 2 siblings, 1 reply; 101+ messages in thread From: Stephen C. Tweedie @ 2001-01-09 14:18 UTC (permalink / raw) To: Ingo Molnar Cc: Rik van Riel, David S. Miller, hch, netdev, linux-kernel, Stephen Tweedie Hi, On Tue, Jan 09, 2001 at 11:23:41AM +0100, Ingo Molnar wrote: > > > Having proper kiobuf support would make it possible to, for example, > > do zerocopy network->disk data transfers and lots of other things. > > i used to think that this is useful, but these days it isnt. It's a waste > of PCI bandwidth resources, and it's much cheaper to keep a cache in RAM > instead of doing direct disk=>network DMA *all the time* some resource is > requested. No. I'm certain you're right when talking about things like web serving, but it just doesn't apply when you look at some other applications, such as streaming out video data or performing fileserving in a high-performance compute cluster where you are serving bulk data. The multimedia and HPC worlds typically operate on datasets which are far too large to cache, so you want to keep them in memory as little as possible when you ship them over the wire. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 14:18 ` Stephen C. Tweedie @ 2001-01-09 14:40 ` Ingo Molnar 2001-01-09 14:51 ` Alan Cox ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 14:40 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Rik van Riel, David S. Miller, hch, netdev, linux-kernel On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > i used to think that this is useful, but these days it isnt. It's a waste > > of PCI bandwidth resources, and it's much cheaper to keep a cache in RAM > > instead of doing direct disk=>network DMA *all the time* some resource is > > requested. > > No. I'm certain you're right when talking about things like web > serving, [...] yep, i was concentrating on fileserving load. > but it just doesn't apply when you look at some other applications, > such as streaming out video data or performing fileserving in a > high-performance compute cluster where you are serving bulk data. > The multimedia and HPC worlds typically operate on datasets which are > far too large to cache, so you want to keep them in memory as little > as possible when you ship them over the wire. i'd love to first see these kinds of applications (under Linux) before designing for them. Eg. if an IO operation (eg. streaming video webcast) does a DMA from a camera card to an outgoing networking card, would it be possible to access the packet data in case of a TCP retransmit? Basically these applications are limited enough in scope to justify even temporary 'hacks' that enable them - and once we *see* things in action, we could design for them. Not the other way around. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 14:40 ` Ingo Molnar @ 2001-01-09 14:51 ` Alan Cox 2001-01-09 15:17 ` Stephen C. Tweedie 2001-01-09 15:25 ` Stephen Frost 2 siblings, 0 replies; 101+ messages in thread From: Alan Cox @ 2001-01-09 14:51 UTC (permalink / raw) To: mingo Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev, linux-kernel > designing for them. Eg. if an IO operation (eg. streaming video webcast) > does a DMA from a camera card to an outgoing networking card, would it be Most mpeg2 hardware isnt set up for that kind of use. And webcast protocols like h.263 tend to be software implemented. Capturing raw video for pre-processing is similar. Right now thats best done with mmap() on the ring buffer and O_DIRECT I/O it seems Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 14:40 ` Ingo Molnar 2001-01-09 14:51 ` Alan Cox @ 2001-01-09 15:17 ` Stephen C. Tweedie 2001-01-09 15:37 ` Ingo Molnar ` (2 more replies) 2001-01-09 15:25 ` Stephen Frost 2 siblings, 3 replies; 101+ messages in thread From: Stephen C. Tweedie @ 2001-01-09 15:17 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev, linux-kernel Hi, On Tue, Jan 09, 2001 at 03:40:56PM +0100, Ingo Molnar wrote: > > i'd love to first see these kinds of applications (under Linux) before > designing for them. Things like Beowulf have been around for a while now, and SGI have been doing that sort of multimedia stuff for ages. I don't think that there's any doubt that there's a demand for this. > Eg. if an IO operation (eg. streaming video webcast) > does a DMA from a camera card to an outgoing networking card, would it be > possible to access the packet data in case of a TCP retransmit? I'm not thinking about pci-to-pci as much as pci-to-memory-to-pci with no memory-to-memory copies. That's no different to writepage: doing a zero-copy writepage on a page cache page still gives you the problem of maintaining retransmit semantics if a user mmaps the file or writes to it after your initial transmit. And if you want other examples, we have applications such as Oracle who want to do raw disk IO in chunks of at least 128K. Going through a page-by-page interface for large IOs is almost as bad as the existing buffer_head-by-buffer_head interface, and we have already demonstrated that to be a bottleneck in the block device layer. Jes has also got hard numbers for the performance advantages of jumbograms on some of the networks he's been using, and you ain't going to get udp jumbograms through a page-by-page API, ever. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:17 ` Stephen C. Tweedie @ 2001-01-09 15:37 ` Ingo Molnar 2001-01-09 21:18 ` David S. Miller 2001-01-09 22:25 ` Linus Torvalds 2 siblings, 0 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 15:37 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Rik van Riel, David S. Miller, hch, netdev, linux-kernel On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > Jes has also got hard numbers for the performance advantages of > jumbograms on some of the networks he's been using, and you ain't > going to get udp jumbograms through a page-by-page API, ever. i know the performance advantages of jumbograms (typically when it's over a local network), it's undisputed. Still i dont see why it should be impossible to do effective UDP via a single-page interface. Eg. buffering of outgoing pages could be supported, and MSG_MORE in sendmsg() used to indicate end of stream. This is why ->writepage() has a 'more' flag (and tcp_sendpage() has a flag as well). Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:17 ` Stephen C. Tweedie 2001-01-09 15:37 ` Ingo Molnar @ 2001-01-09 21:18 ` David S. Miller 2001-01-09 22:25 ` Linus Torvalds 2 siblings, 0 replies; 101+ messages in thread From: David S. Miller @ 2001-01-09 21:18 UTC (permalink / raw) To: sct; +Cc: mingo, sct, riel, hch, netdev, linux-kernel Date: Tue, 9 Jan 2001 15:17:25 +0000 From: "Stephen C. Tweedie" <sct@redhat.com> Jes has also got hard numbers for the performance advantages of jumbograms on some of the networks he's been using, and you ain't going to get udp jumbograms through a page-by-page API, ever. Again, see MSG_MORE in the patches. It is possible and our UDP implementation could make it easily. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:17 ` Stephen C. Tweedie 2001-01-09 15:37 ` Ingo Molnar 2001-01-09 21:18 ` David S. Miller @ 2001-01-09 22:25 ` Linus Torvalds 2001-01-10 15:21 ` Stephen C. Tweedie 2 siblings, 1 reply; 101+ messages in thread From: Linus Torvalds @ 2001-01-09 22:25 UTC (permalink / raw) To: linux-kernel In article <20010109151725.D9321@redhat.com>, Stephen C. Tweedie <sct@redhat.com> wrote: > >Jes has also got hard numbers for the performance advantages of >jumbograms on some of the networks he's been using, and you ain't >going to get udp jumbograms through a page-by-page API, ever. Wrong. The only thing you need is a nagle-type thing that coalesces requests. In the case of UDP, that coalescing obviously has to be explicitly controlled, as the "standard" UDP behaviour is to send out just one packet per write. But this is a problem for TCP too: you want to tell TCP to _not_ send out a short packet even if there are none in-flight, if you know you want to send more. So you want to have some way to anti-nagle for TCP anyway. Also, if you look at the problem of "writev()", you'll notice that you have many of the same issues: what you really want is to _always_ coalesce, and only send out when explicitly asked for (and then that explicit ask would be on by default at the end of write() and at the very end of the last segment in "writev()". It so happens that this logic already exists, it's called MSG_MORE or something similar (I'm too lazy to check the actual patches). And it's there exactly because it is stupid to make the upper layers have to gather everything into one packet if the lower layers need that logic for other reasons anyway. Which they obviously do. So what you can do is to just do multiple writes, and set the MSG_MORE flag. This works with sendfile(), but more importantly it is also an uncommonly good interface to user mode. With this, you can actually implement things like "writev()" _properly_ from user-space, and we could get rid of the special socket writev() magic if we wanted to. So if you have a header, you just send out that header separately (with the MSG_MORE flag), and then do a "sendfile()" or whatever to send out the data. This is much more flexible than writev(), and a lot easier to use. It's also a hell of a lot more flexible than the ugly sendfile() interfaces that HP-UX and the BSD people have - I'm ashamed of how little taste the BSD group in general has had in interface design. Ugh. Tacking on a mixture of writev() and sendfile() in the same system call. Tacky. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 22:25 ` Linus Torvalds @ 2001-01-10 15:21 ` Stephen C. Tweedie 0 siblings, 0 replies; 101+ messages in thread From: Stephen C. Tweedie @ 2001-01-10 15:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel, Stephen Tweedie Hi, On Tue, Jan 09, 2001 at 02:25:43PM -0800, Linus Torvalds wrote: > In article <20010109151725.D9321@redhat.com>, > Stephen C. Tweedie <sct@redhat.com> wrote: > > > >Jes has also got hard numbers for the performance advantages of > >jumbograms on some of the networks he's been using, and you ain't > >going to get udp jumbograms through a page-by-page API, ever. > > The only thing you need is a nagle-type thing that coalesces requests. Is this robust enough to build a useful user-level API on top of? What happens if we have a threaded application in which more than one process may be sending udp sendmsg()s to the file descriptor? If we end up decomposing each datagram into multiple page-sized chunks, then you can imagine them arriving at the fd stream in interleaved order. You can fix that by adding extra locking, but that just indicates that the original API wasn't sufficient to communicate the precise intent of the application in the first place. Things look worse from the point of view of ll_rw_block, which lacks any concept of (a) a file descriptor, or (b) a non-reorderable stream of atomic requests. ll_rw_block coalesces in any order it chooses, so its coalescing function is a _lot_ more complex than hooking the next page onto a linked list. Once the queue size grows non-trivial, adding a new request can become quite expensive (even with only one item on the request queue at once, make_request is still by far the biggest cost on a kernel profile running raw IO). If you've got a 32-page IO to send, sending it in chunks means either merging 32 times into that queue when you could have just done it once, or holding off all merging until you're told to unplug: but with multiple clients, you just encounter the lack of caller context again, and each client can unplug the other before its time. I realise these are apples and oranges to some extent, because ll_rw_block doesn't accept a file descriptor: the place where we _do_ use file descriptors, block_write(), could be doing some of this if the requests were coming from an application. However, that doesn't address the fact that we have got raw devices and filesystems such as XFS already generating large multi-page block IO requests and having to cram them down the thin pipe which is ll_rw_block, and the MSG_MORE flag doesn't seem capable of extending to ll_rw_block sufficiently well. I guess it comes down to this: what problem are we trying to fix? If it's strictly limited to sendfile/writev and related calls, then you've convinced me that page-by-page MSG_MORE can work if you add a bit of locking, but that locking is by itself nasty. Think about O_DIRECT to a database file. We get a write() call, locate the physical pages through unspecified magic, and fire off a series of page or partial-page writes to the O_DIRECT fd. If we are coalescing these via MSG_MORE, then we have to keep the fd locked for write until we've processed the whole IO (including any page faults that result). The filesystem --- which is what understands the concept of a file descriptor --- can merge these together into another request, but we'd just have to split that request into chunks again to send them to ll_rw_block. We may also have things like software raid layers in the write path. That's the motivation for having an object capable of describing multi-page IOs --- it lets us pass the desired IO chunks down through the filesystem, virtual block devices and physical block devices, without any context being required and without having to decompose/merge at each layer. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 14:40 ` Ingo Molnar 2001-01-09 14:51 ` Alan Cox 2001-01-09 15:17 ` Stephen C. Tweedie @ 2001-01-09 15:25 ` Stephen Frost 2001-01-09 15:40 ` Ingo Molnar 2 siblings, 1 reply; 101+ messages in thread From: Stephen Frost @ 2001-01-09 15:25 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2153 bytes --] * Ingo Molnar (mingo@elte.hu) wrote: > > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > > but it just doesn't apply when you look at some other applications, > > such as streaming out video data or performing fileserving in a > > high-performance compute cluster where you are serving bulk data. > > The multimedia and HPC worlds typically operate on datasets which are > > far too large to cache, so you want to keep them in memory as little > > as possible when you ship them over the wire. > > i'd love to first see these kinds of applications (under Linux) before > designing for them. Eg. if an IO operation (eg. streaming video webcast) > does a DMA from a camera card to an outgoing networking card, would it be > possible to access the packet data in case of a TCP retransmit? Basically > these applications are limited enough in scope to justify even temporary > 'hacks' that enable them - and once we *see* things in action, we could > design for them. Not the other way around. Well, I know I for one use a system that you might have heard of called 'MOSIX'. It's a (kinda large) kernel patch with some user-space tools but allows for migration of processes between machines without modifying any code. There are some limitations (threaded applications and shared memory and whatnot) but it works very well for the rendering work we use it for. We use radiance which in general has pretty little inter- process communication and what it has is done through the filesystem. Now, the interesting bit here is that the processes can grow to be pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what happens with MOSIX is that entire processes get sent over the wire to other machines for work. MOSIX will also attempt to rebalance the load on all of the machines in the cluster and whatnot so it can often be moving processes back and forth. So, anyhow, this is just an fyi if you weren't aware of it that I believe more than a few people are using MOSIX these days for similar appliactions and that it's availible at http://www.mosix.org if you're curious. Stephen [-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:25 ` Stephen Frost @ 2001-01-09 15:40 ` Ingo Molnar 2001-01-09 15:48 ` Stephen Frost 2001-01-10 1:14 ` Dave Zarzycki 0 siblings, 2 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 15:40 UTC (permalink / raw) To: Stephen Frost Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev, linux-kernel On Tue, 9 Jan 2001, Stephen Frost wrote: > Now, the interesting bit here is that the processes can grow to be > pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what > happens with MOSIX is that entire processes get sent over the wire to > other machines for work. MOSIX will also attempt to rebalance the load on > all of the machines in the cluster and whatnot so it can often be moving > processes back and forth. then you'll love the zerocopy patch :-) Just use sendfile() or specify MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card DMA-and-checksumming on cards that support it. the discussion with Stephen is about various device-to-device schemes. (which Mosix i dont think wants to use. Mosix wants to use memory to device zero-copy, right?) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:40 ` Ingo Molnar @ 2001-01-09 15:48 ` Stephen Frost 2001-01-10 1:14 ` Dave Zarzycki 1 sibling, 0 replies; 101+ messages in thread From: Stephen Frost @ 2001-01-09 15:48 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1335 bytes --] * Ingo Molnar (mingo@elte.hu) wrote: > > On Tue, 9 Jan 2001, Stephen Frost wrote: > > > Now, the interesting bit here is that the processes can grow to be > > pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what > > happens with MOSIX is that entire processes get sent over the wire to > > other machines for work. MOSIX will also attempt to rebalance the load on > > all of the machines in the cluster and whatnot so it can often be moving > > processes back and forth. > > then you'll love the zerocopy patch :-) Just use sendfile() or specify > MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card > DMA-and-checksumming on cards that support it. Excellent, this patch certainly sounds interesting which is why I've been following this discussion. Once the MOSIX patch for 2.4 comes out I think I'm going to tinker with this and see if I can get MOSIX to use these methods. > the discussion with Stephen is about various device-to-device schemes. > (which Mosix i dont think wants to use. Mosix wants to use memory to > device zero-copy, right?) Yes, very much so actually now that I think about it. Alot of memory->device and device->memory work going on. I was mainly replying to the idea of clustering since that's what MOSIX is all about. Stephen [-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:40 ` Ingo Molnar 2001-01-09 15:48 ` Stephen Frost @ 2001-01-10 1:14 ` Dave Zarzycki 2001-01-10 1:14 ` David S. Miller 2001-01-10 1:19 ` Ingo Molnar 1 sibling, 2 replies; 101+ messages in thread From: Dave Zarzycki @ 2001-01-10 1:14 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel On Tue, 9 Jan 2001, Ingo Molnar wrote: > then you'll love the zerocopy patch :-) Just use sendfile() or specify > MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card > DMA-and-checksumming on cards that support it. I'm confused. In user space, how do you know when its safe to reuse the buffer that was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() with that flag block until the buffer isn't needed by the kernel any more? If it does block, doesn't that defeat the use of non-blocking I/O? davez -- Dave Zarzycki http://thor.sbay.org/~dave/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-10 1:14 ` Dave Zarzycki @ 2001-01-10 1:14 ` David S. Miller 2001-01-10 2:18 ` Dave Zarzycki 2001-01-10 1:19 ` Ingo Molnar 1 sibling, 1 reply; 101+ messages in thread From: David S. Miller @ 2001-01-10 1:14 UTC (permalink / raw) To: dave; +Cc: mingo, linux-kernel Date: Tue, 9 Jan 2001 17:14:33 -0800 (PST) From: Dave Zarzycki <dave@zarzycki.org> On Tue, 9 Jan 2001, Ingo Molnar wrote: > then you'll love the zerocopy patch :-) Just use sendfile() or specify > MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card > DMA-and-checksumming on cards that support it. I'm confused. In user space, how do you know when its safe to reuse the buffer that was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() with that flag block until the buffer isn't needed by the kernel any more? If it does block, doesn't that defeat the use of non-blocking I/O? Ignore Ingo's comments about the MSG_NOCOPY flag, I've not included those parts in the zerocopy patches as they are very controversial and require some VM layer support. Basically, it pins the userspace pages, so if you write to them before the data is fully sent and the networking buffer freed, they get copied with a COW fault. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-10 1:14 ` David S. Miller @ 2001-01-10 2:18 ` Dave Zarzycki 0 siblings, 0 replies; 101+ messages in thread From: Dave Zarzycki @ 2001-01-10 2:18 UTC (permalink / raw) To: David S. Miller; +Cc: mingo, linux-kernel On Tue, 9 Jan 2001, David S. Miller wrote: > Ignore Ingo's comments about the MSG_NOCOPY flag, I've not included > those parts in the zerocopy patches as they are very controversial > and require some VM layer support. Okay, I talked to some kernel engineers where I work and they were (I think) very justifiably skeptical of zero-copy work with respect to read/write style APIs. > Basically, it pins the userspace pages, so if you write to them before > the data is fully sent and the networking buffer freed, they get > copied with a COW fault. Yum... Assuming a gigabit ethernet link is saturated with the sendmsg(MSG_NOCOPY) API, what is CPU utilization like for a given clock speed and processor make? It is any different than the sendfile() case? davez -- Dave Zarzycki http://thor.sbay.org/~dave/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-10 1:14 ` Dave Zarzycki 2001-01-10 1:14 ` David S. Miller @ 2001-01-10 1:19 ` Ingo Molnar 1 sibling, 0 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-10 1:19 UTC (permalink / raw) To: Dave Zarzycki; +Cc: linux-kernel On Tue, 9 Jan 2001, Dave Zarzycki wrote: > In user space, how do you know when its safe to reuse the buffer that > was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() > with that flag block until the buffer isn't needed by the kernel any > more? If it does block, doesn't that defeat the use of non-blocking > I/O? sendmsg() marks those pages COW and copies the original page into a new one for further usage. (the old page is used until the packet is released.) So for maximum performance user-space should not reuse such buffers immediately. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) 2001-01-09 10:23 ` Ingo Molnar 2001-01-09 10:31 ` Christoph Hellwig 2001-01-09 14:18 ` Stephen C. Tweedie @ 2001-01-10 2:56 ` dean gaudet 2001-01-10 2:58 ` David S. Miller 2001-01-10 3:05 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, Alan Cox 2 siblings, 2 replies; 101+ messages in thread From: dean gaudet @ 2001-01-10 2:56 UTC (permalink / raw) To: Ingo Molnar; +Cc: Rik van Riel, David S. Miller, hch, netdev, linux-kernel On Tue, 9 Jan 2001, Ingo Molnar wrote: > On Mon, 8 Jan 2001, Rik van Riel wrote: > > > Having proper kiobuf support would make it possible to, for example, > > do zerocopy network->disk data transfers and lots of other things. > > i used to think that this is useful, but these days it isnt. this seems to be in the general theme of "network receive is boring". which i mostly agree with... except recently i've been thinking about an application where it may not be so boring, but i haven't researched all the details yet. the application is storage over IP -- SAN using IP (i.e. gigabit ethernet) technologies instead of fiberchannel technologies. several companies are doing it or planning to do it (for example EMC, 3ware). i'm taking a wild guess that SCSI over FC is arranged conveniently to allow a scatter request to read packets off the FC NIC such that the headers go one way and the data lands neatly into the page cache (i.e. fixed length headers). i've never investigated the actual protocols though so maybe the solution used was to just push a lot of the detail down into the controllers. a quick look at the iSCSI specification <http://www.ietf.org/internet-drafts/draft-ietf-ips-iscsi-02.txt>, and the FCIP spec <http://www.ietf.org/internet-drafts/draft-ietf-ips-fcovertcpip-01.txt> show that both use TCP/IP. TCP/IP has variable length headers (or am i on crack?), which totally complicates the receive path. the iSCSI requirements document seems to imply they're happy with pushing this extra processing down to a special storage NIC. that kind of sucks -- one of the benefits of storage over IP would be the ability to redundantly connect a box to storage and IP with only two NICs (instead of 4 -- 2 IP and 2 FC). is NFS receive single copy today? anyone tried doing packet demultiplexing by grabbing headers on one pass and scattering the data on a second pass? i'm hoping i'm missing something. anyone else looked around at this stuff yet? -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) 2001-01-10 2:56 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet @ 2001-01-10 2:58 ` David S. Miller 2001-01-10 3:18 ` dean gaudet 2001-01-10 3:05 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, Alan Cox 1 sibling, 1 reply; 101+ messages in thread From: David S. Miller @ 2001-01-10 2:58 UTC (permalink / raw) To: dean-list-linux-kernel; +Cc: mingo, riel, hch, netdev, linux-kernel Date: Tue, 9 Jan 2001 18:56:33 -0800 (PST) From: dean gaudet <dean-list-linux-kernel@arctic.org> is NFS receive single copy today? With the zerocopy patches, NFS client receive is "single cpu copy" if that's what you mean. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) 2001-01-10 2:58 ` David S. Miller @ 2001-01-10 3:18 ` dean gaudet 2001-01-10 3:09 ` David S. Miller 0 siblings, 1 reply; 101+ messages in thread From: dean gaudet @ 2001-01-10 3:18 UTC (permalink / raw) To: David S. Miller; +Cc: mingo, riel, hch, netdev, linux-kernel On Tue, 9 Jan 2001, David S. Miller wrote: > Date: Tue, 9 Jan 2001 18:56:33 -0800 (PST) > From: dean gaudet <dean-list-linux-kernel@arctic.org> > > is NFS receive single copy today? > > With the zerocopy patches, NFS client receive is "single cpu copy" if > that's what you mean. yeah sorry, i meant: - NIC DMAs packet to memory - CPU reads headers from memory, figures out it's NFS - CPU copies data bytes from packet image in memory to pagecache -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) 2001-01-10 3:18 ` dean gaudet @ 2001-01-10 3:09 ` David S. Miller 0 siblings, 0 replies; 101+ messages in thread From: David S. Miller @ 2001-01-10 3:09 UTC (permalink / raw) To: dean-list-linux-kernel; +Cc: mingo, riel, hch, netdev, linux-kernel Date: Tue, 9 Jan 2001 19:18:53 -0800 (PST) From: dean gaudet <dean-list-linux-kernel@arctic.org> - NIC DMAs packet to memory - CPU reads headers from memory, figures out it's NFS - CPU copies data bytes from packet image in memory to pagecache Yes, this is precisely what happens in the NFS client with the zerocopy patches applied. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2001-01-10 2:56 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet 2001-01-10 2:58 ` David S. Miller @ 2001-01-10 3:05 ` Alan Cox 1 sibling, 0 replies; 101+ messages in thread From: Alan Cox @ 2001-01-10 3:05 UTC (permalink / raw) To: dean gaudet Cc: Ingo Molnar, Rik van Riel, David S. Miller, hch, netdev, linux-kernel > fixed length headers). i've never investigated the actual protocols > though so maybe the solution used was to just push a lot of the detail > down into the controllers. The stuff I have access to (MPT fusion) pushes the FC handling down onto the board. Basically you talk scsi and IP to it (See drivers/message/fusion in -ac) > <http://www.ietf.org/internet-drafts/draft-ietf-ips-fcovertcpip-01.txt> > show that both use TCP/IP. TCP/IP has variable length headers (or am i on > crack?), which totally complicates the receive path. TCP has variable length headers. It also prevents you re-ordering commands in the stream which would be beneficial. I've not checked if the draft uses multiple TCP streams but then you have scaling questions. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 1:24 [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller 2001-01-08 10:39 ` Christoph Hellwig @ 2001-01-08 21:56 ` Jes Sorensen 2001-01-08 21:48 ` David S. Miller 2001-01-09 13:52 ` Trond Myklebust 2 siblings, 1 reply; 101+ messages in thread From: Jes Sorensen @ 2001-01-08 21:56 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel, netdev >>>>> "David" == David S Miller <davem@redhat.com> writes: David> I've put a patch up for testing on the kernel.org mirrors: David> /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz David> It provides a framework for zerocopy transmits and delayed David> receive fragment coalescing. TUX-1.01 uses this framework. David> Zerocopy transmit requires some driver support, things run as David> they did before for drivers which do not have the support David> added. Currently sg+csum driver support has been added to David> Acenic, 3c59x, sunhme, and loopback drivers. We had eepro100 David> support coded at one point, but it was removed because we David> didn't know how to identify the cards which support hw csum David> assist vs. ones which could not. I haven't had time to test this patch, but looking over the changes to the acenic driver I have to say that I am quite displeased with the way the changes were done. I can't comment on how authors of the other drivers which were changed feel about it. However I find it highly annoying that someone goes off and makes major cosmetic structural changes to someone elses code without even consulting the author who happens to maintain the code. It doesn't help that the patch reverts changes that should not have been reverted. I don't think it's too much to ask that one actually tries to communicate with an author of a piece of code before making such major changes and submitting them opting for inclusion in the kernel. Jes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 21:56 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Jes Sorensen @ 2001-01-08 21:48 ` David S. Miller 2001-01-08 22:32 ` Jes Sorensen 0 siblings, 1 reply; 101+ messages in thread From: David S. Miller @ 2001-01-08 21:48 UTC (permalink / raw) To: jes; +Cc: linux-kernel, netdev From: Jes Sorensen <jes@linuxcare.com> Date: 08 Jan 2001 22:56:48 +0100 I don't think it's too much to ask that one actually tries to communicate with an author of a piece of code before making such major changes and submitting them opting for inclusion in the kernel. Jes, I have not submitted this for inclusion into the kernel. This is the "everyone, including driver authors, take a look" part of the development process. We _had_ to change some drivers to show how to support this new SKB api for transmit sg+csum support. If you can think of a way for us to effectively do this work without changing at least a few drivers as examples (and proof of concept), please let us know. In the process we hit real bugs in your driver, and tried to deal with them as best we could so that we could continue testing and debugging our own code. As a side note, as much as you may hate some of Alexey's changes to your driver, several things he does fixes long standing real bugs in the Acenic driver that you've been papering over with workarounds for quite some time. I would even go so far as to say that in many regards Alexey understands the Acenic much better than you, and you would be wise to work with Alexey and not against him. Thanks. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 21:48 ` David S. Miller @ 2001-01-08 22:32 ` Jes Sorensen 2001-01-08 22:36 ` David S. Miller 2001-01-08 22:43 ` Stephen Frost 0 siblings, 2 replies; 101+ messages in thread From: Jes Sorensen @ 2001-01-08 22:32 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel, netdev >>>>> "David" == David S Miller <davem@redhat.com> writes: David> We _had_ to change some drivers to show how to support this new David> SKB api for transmit sg+csum support. If you can think of a David> way for us to effectively do this work without changing at David> least a few drivers as examples (and proof of concept), please David> let us know. Dave, I am not complaining about drivers having to be changed for this to work I am fully aware of this need. My complaints are about how this is being done, ie. I some people try to maintain drivers and have certain ideas about how they structure their code etc. If you had sent me a short email saying this is what we plan to do and this is what we think should be done to your code, whats your oppinion. I would have volunteered to help write the code and get the stuff integrated much earlier as well as given you my input on how I would like to see the changes implemented. Instead we now have a fairly large patch which will take me a long time to merge into the driver version that I maintain. David> In the process we hit real bugs in your driver, and tried to David> deal with them as best we could so that we could continue David> testing and debugging our own code. I would have appreciated a simple email saying "we found bug X in your driver" with either a patch attached or a short note of your observations. David> As a side note, as much as you may hate some of Alexey's David> changes to your driver, several things he does fixes long David> standing real bugs in the Acenic driver that you've been David> papering over with workarounds for quite some time. I would David> even go so far as to say that in many regards Alexey David> understands the Acenic much better than you, and you would be David> wise to work with Alexey and not against him. Thanks. I don't question Alexey's skills and I have no intentions of working against him. All I am asking is that someone lets me know if they make major changes to my code so I can keep track of whats happening. It is really hard to maintain code if you work on major changes while someone else branches off in a different direction without you knowing. It's simply a waste of everybody's time. Thanks Jes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 22:32 ` Jes Sorensen @ 2001-01-08 22:36 ` David S. Miller 2001-01-09 12:12 ` Ingo Molnar 2001-01-08 22:43 ` Stephen Frost 1 sibling, 1 reply; 101+ messages in thread From: David S. Miller @ 2001-01-08 22:36 UTC (permalink / raw) To: jes; +Cc: linux-kernel, netdev From: Jes Sorensen <jes@linuxcare.com> Date: 08 Jan 2001 23:32:48 +0100 All I am asking is that someone lets me know if they make major changes to my code so I can keep track of whats happening. We have not made any major changes to your code, in lieu of this not being code which is actually being submitted yet. If it bothers you that publicly someone has published changes to your driver which you disagree with, oh well... :-) This "please check things out" phase is precisely what you are asking of us, it is how we are saying "here is what we need to do with your driver, please comment". Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 22:36 ` David S. Miller @ 2001-01-09 12:12 ` Ingo Molnar 0 siblings, 0 replies; 101+ messages in thread From: Ingo Molnar @ 2001-01-09 12:12 UTC (permalink / raw) To: David S. Miller; +Cc: jes, linux-kernel, netdev On Mon, 8 Jan 2001, David S. Miller wrote: > All I am asking is that someone lets me know if they make major > changes to my code so I can keep track of whats happening. > > We have not made any major changes to your code, in lieu of this > not being code which is actually being submitted yet. > > If it bothers you that publicly someone has published changes to your > driver which you disagree with, oh well... :-) i did tell Jes about our zerocopy work, months ago (and IIRC we even exchanged emails about technical issues briefly). The changes were first published in the TUX 1.0 source code last August, and subsequent cleanups (more than 10 iterations) were published on Alexey's public FTP site: ftp://ftp.inr.ac.ru/ip-routing/ i think this whole issue got miscommunicated because Jes moved to Canada exactly when we wrote the fragmented-API changes. I do believe Jes will like most of our changes though, and i can surely tell that the elegant and clean code of the Acenic driver made these changes so much easier. Jen's Acenic driver was the first Linux networking driver in history to support zero-copy TCP. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 22:32 ` Jes Sorensen 2001-01-08 22:36 ` David S. Miller @ 2001-01-08 22:43 ` Stephen Frost 2001-01-08 22:37 ` David S. Miller 1 sibling, 1 reply; 101+ messages in thread From: Stephen Frost @ 2001-01-08 22:43 UTC (permalink / raw) To: Jes Sorensen; +Cc: David S. Miller, linux-kernel, netdev [-- Attachment #1: Type: text/plain, Size: 1186 bytes --] * Jes Sorensen (jes@linuxcare.com) wrote: > >>>>> "David" == David S Miller <davem@redhat.com> writes: > > I don't question Alexey's skills and I have no intentions of working > against him. All I am asking is that someone lets me know if they make > major changes to my code so I can keep track of whats happening. It is > really hard to maintain code if you work on major changes while > someone else branches off in a different direction without you > knowing. It's simply a waste of everybody's time. Perhaps you missed it, but I believe Dave's intent is for this to only be a proof-of-concept idea at this time. These changes are not currently up for inclusion into the mainstream kernel. I can not think that Dave would ever just step around a maintainer and submit a patch to Linus for large changes. If many people test these and things work out well for them then I'm sure Dave will go back to the maintainers with the code and the api and work with them to get it into the mainstream kernel. Soliciting ideas and suggestions on how to improve the api and the code paths in the drivers to handle this new method most effectively. Stephen [-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 22:43 ` Stephen Frost @ 2001-01-08 22:37 ` David S. Miller 0 siblings, 0 replies; 101+ messages in thread From: David S. Miller @ 2001-01-08 22:37 UTC (permalink / raw) To: sfrost; +Cc: jes, linux-kernel, netdev Date: Mon, 8 Jan 2001 17:43:56 -0500 From: Stephen Frost <sfrost@snowman.net> Perhaps you missed it, but I believe Dave's intent is for this to only be a proof-of-concept idea at this time. Thank you Stephen, this is the point Jes continues to miss. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-08 1:24 [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller 2001-01-08 10:39 ` Christoph Hellwig 2001-01-08 21:56 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Jes Sorensen @ 2001-01-09 13:52 ` Trond Myklebust 2001-01-09 13:42 ` David S. Miller 2 siblings, 1 reply; 101+ messages in thread From: Trond Myklebust @ 2001-01-09 13:52 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel, netdev >>>>> " " == David S Miller <davem@redhat.com> writes: > I've put a patch up for testing on the kernel.org mirrors: > /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz ..... > Finally, regardless of networking card, there should be a > measurable performance boost for NFS clients with this patch > due to the delayed fragment coalescing. KNFSD does not take > full advantage of this facility yet. Hi David, I don't really want to be chiming in with another 'make it a kiobuf', but given that you already have written 'do_tcp_sendpages()' why did you make sock->ops->sendpage() take the single page as an argument rather than just have it take the 'struct page **'? I would have thought one of the main interests of doing something like this would be to allow us to speed up large writes to the socket for ncpfs/knfsd/nfs/smbfs/... After all, in both the case of the client WRITE requests and the server READ responses, we end up with a set of several pages that just need to be pushed down the network without further ado. Unless I misunderstood the code, it seems that do_tcp_sendpages() fits the bill nicely... Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 13:52 ` Trond Myklebust @ 2001-01-09 13:42 ` David S. Miller 2001-01-09 15:27 ` Trond Myklebust 0 siblings, 1 reply; 101+ messages in thread From: David S. Miller @ 2001-01-09 13:42 UTC (permalink / raw) To: trond.myklebust; +Cc: linux-kernel, netdev From: Trond Myklebust <trond.myklebust@fys.uio.no> Date: 09 Jan 2001 14:52:40 +0100 I don't really want to be chiming in with another 'make it a kiobuf', but given that you already have written 'do_tcp_sendpages()' why did you make sock->ops->sendpage() take the single page as an argument rather than just have it take the 'struct page **'? It was like that to begin with. But to do it cleanly you have to pass in not a vector of "pages" but a vector of "page+offset+len" triplets. Linus hated it, and I understood why, so I reverted the API to be single page based. I would have thought one of the main interests of doing something like this would be to allow us to speed up large writes to the socket for ncpfs/knfsd/nfs/smbfs/... This is what TCP_CORK/MSG_MORE et al. are all for, things get coalesced perfectly. Sending in a vector of pages seems nice, but none of the page cache infrastructure works like this, all of the core routines work on a page at a time. It actually simplifies a lot. The writepage interface optimizes large file writes to a socket just fine. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 13:42 ` David S. Miller @ 2001-01-09 15:27 ` Trond Myklebust 2001-01-09 21:19 ` David S. Miller 0 siblings, 1 reply; 101+ messages in thread From: Trond Myklebust @ 2001-01-09 15:27 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel, netdev >>>>> David S Miller <davem@redhat.com> writes: > I would have thought one of the main interests of doing > something like this would be to allow us to speed up large > writes to the socket for ncpfs/knfsd/nfs/smbfs/... > This is what TCP_CORK/MSG_MORE et al. are all for, things get > coalesced perfectly. Sending in a vector of pages seems nice, > but none of the page cache infrastructure works like this, all > of the core routines work on a page at a time. It actually > simplifies a lot. > The writepage interface optimizes large file writes to a socket > just fine. OK, but can you eventually generalize it to non-stream protocols (i.e. UDP)? After all, it doesn't make sense to differentiate between zero-copy on stream and non-stream sockets, and Linux NFS, at least, remains heavily UDP-oriented... Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 15:27 ` Trond Myklebust @ 2001-01-09 21:19 ` David S. Miller 2001-01-10 9:21 ` Trond Myklebust 0 siblings, 1 reply; 101+ messages in thread From: David S. Miller @ 2001-01-09 21:19 UTC (permalink / raw) To: trond.myklebust; +Cc: linux-kernel, netdev Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET) From: Trond Myklebust <trond.myklebust@fys.uio.no> OK, but can you eventually generalize it to non-stream protocols (i.e. UDP)? Sure, this is what MSG_MORE is meant to accomodate. UDP could support it just fine. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 2001-01-09 21:19 ` David S. Miller @ 2001-01-10 9:21 ` Trond Myklebust 0 siblings, 0 replies; 101+ messages in thread From: Trond Myklebust @ 2001-01-10 9:21 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel, netdev >>>>> " " == David S Miller <davem@redhat.com> writes: > Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET) From: Trond > Myklebust <trond.myklebust@fys.uio.no> > OK, but can you eventually generalize it to non-stream > protocols (i.e. UDP)? > Sure, this is what MSG_MORE is meant to accomodate. UDP could > support it just fine. Great! I've been waiting for something like this. In particular the knfsd TCP server code can get very buffer-intensive without it since you need to pre-allocate 1 set of buffers per TCP connection (else you get DOS due to buffer saturation when doing wait+retry for blocked sockets). If it all gets in to the kernel, I'll do the work of adapting the NFS + sunrpc stuff. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 101+ messages in thread
end of thread, other threads:[~2001-01-19 15:56 UTC | newest] Thread overview: 101+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-01-08 1:24 [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller 2001-01-08 10:39 ` Christoph Hellwig 2001-01-08 10:34 ` David S. Miller 2001-01-08 18:05 ` Rik van Riel 2001-01-08 21:07 ` David S. Miller 2001-01-09 10:23 ` Ingo Molnar 2001-01-09 10:31 ` Christoph Hellwig 2001-01-09 10:31 ` David S. Miller 2001-01-09 11:28 ` Christoph Hellwig 2001-01-09 11:42 ` David S. Miller 2001-01-09 12:04 ` Ingo Molnar 2001-01-09 14:25 ` Stephen C. Tweedie 2001-01-09 14:33 ` Alan Cox 2001-01-09 15:00 ` Ingo Molnar 2001-01-09 15:27 ` Stephen C. Tweedie 2001-01-09 16:16 ` Ingo Molnar 2001-01-09 16:37 ` Alan Cox 2001-01-09 16:48 ` Ingo Molnar 2001-01-09 17:29 ` Alan Cox 2001-01-09 17:38 ` Jens Axboe 2001-01-09 18:38 ` Ingo Molnar 2001-01-09 19:54 ` Andrea Arcangeli 2001-01-09 20:10 ` Ingo Molnar 2001-01-10 0:00 ` Andrea Arcangeli 2001-01-09 20:12 ` Jens Axboe 2001-01-09 23:20 ` Andrea Arcangeli 2001-01-09 23:34 ` Jens Axboe 2001-01-09 23:52 ` Andrea Arcangeli 2001-01-17 5:16 ` Rik van Riel 2001-01-09 17:56 ` Chris Evans 2001-01-09 18:41 ` Ingo Molnar 2001-01-09 22:58 ` [patch]: ac4 blk (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) Jens Axboe 2001-01-09 19:20 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 J Sloan 2001-01-09 18:10 ` Stephen C. Tweedie 2001-01-09 15:38 ` Benjamin C.R. LaHaise 2001-01-09 16:40 ` Ingo Molnar 2001-01-09 17:30 ` Benjamin C.R. LaHaise 2001-01-09 18:12 ` Stephen C. Tweedie 2001-01-09 18:35 ` Ingo Molnar 2001-01-09 17:53 ` Christoph Hellwig 2001-01-09 21:13 ` David S. Miller 2001-01-09 19:14 ` Linus Torvalds 2001-01-09 20:07 ` Ingo Molnar 2001-01-09 20:15 ` Linus Torvalds 2001-01-09 20:36 ` Christoph Hellwig 2001-01-09 20:55 ` Linus Torvalds 2001-01-09 21:12 ` Christoph Hellwig 2001-01-09 21:26 ` Linus Torvalds 2001-01-10 7:42 ` Christoph Hellwig 2001-01-10 8:05 ` Linus Torvalds 2001-01-10 8:33 ` Christoph Hellwig 2001-01-10 8:37 ` Andrew Morton 2001-01-10 23:32 ` Linus Torvalds 2001-01-19 15:55 ` Andrew Scott 2001-01-17 14:05 ` Rik van Riel 2001-01-18 0:53 ` Christoph Hellwig 2001-01-18 1:13 ` Linus Torvalds 2001-01-18 17:50 ` Christoph Hellwig 2001-01-18 18:04 ` Linus Torvalds 2001-01-18 21:12 ` Albert D. Cahalan 2001-01-19 1:52 ` 2.4.1-pre8 video/ohci1394 compile problem ebi4 2001-01-19 6:55 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Linus Torvalds 2001-01-09 23:06 ` Benjamin C.R. LaHaise 2001-01-09 23:54 ` Linus Torvalds 2001-01-10 7:51 ` Gerd Knorr 2001-01-12 1:42 ` Stephen C. Tweedie 2001-01-09 11:05 ` Ingo Molnar 2001-01-09 18:27 ` Christoph Hellwig 2001-01-09 19:19 ` Ingo Molnar 2001-01-09 14:18 ` Stephen C. Tweedie 2001-01-09 14:40 ` Ingo Molnar 2001-01-09 14:51 ` Alan Cox 2001-01-09 15:17 ` Stephen C. Tweedie 2001-01-09 15:37 ` Ingo Molnar 2001-01-09 21:18 ` David S. Miller 2001-01-09 22:25 ` Linus Torvalds 2001-01-10 15:21 ` Stephen C. Tweedie 2001-01-09 15:25 ` Stephen Frost 2001-01-09 15:40 ` Ingo Molnar 2001-01-09 15:48 ` Stephen Frost 2001-01-10 1:14 ` Dave Zarzycki 2001-01-10 1:14 ` David S. Miller 2001-01-10 2:18 ` Dave Zarzycki 2001-01-10 1:19 ` Ingo Molnar 2001-01-10 2:56 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet 2001-01-10 2:58 ` David S. Miller 2001-01-10 3:18 ` dean gaudet 2001-01-10 3:09 ` David S. Miller 2001-01-10 3:05 ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, Alan Cox 2001-01-08 21:56 ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Jes Sorensen 2001-01-08 21:48 ` David S. Miller 2001-01-08 22:32 ` Jes Sorensen 2001-01-08 22:36 ` David S. Miller 2001-01-09 12:12 ` Ingo Molnar 2001-01-08 22:43 ` Stephen Frost 2001-01-08 22:37 ` David S. Miller 2001-01-09 13:52 ` Trond Myklebust 2001-01-09 13:42 ` David S. Miller 2001-01-09 15:27 ` Trond Myklebust 2001-01-09 21:19 ` David S. Miller 2001-01-10 9:21 ` Trond Myklebust
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox