* Is sendfile all that sexy?
@ 2001-01-14 18:29 jamal
2001-01-14 18:50 ` Ingo Molnar
` (2 more replies)
0 siblings, 3 replies; 130+ messages in thread
From: jamal @ 2001-01-14 18:29 UTC (permalink / raw)
To: linux-kernel, netdev
I thought i'd run some tests on the new zerocopy patches
(this is using a hacked ttcp which knows how to do sendfile
and does MSG_TRUNC for true zero-copy receive, if you know what i mean
;-> ).
2 back to back SMP 2*PII-450Mhz hooked up via 1M acenics (gigE).
MTU 9K.
Before getting excited i had the courage to give plain 2.4.0-pre3 a whirl
and somethings bothered me.
test1:
------
regular ttcp, no ZC and no sendfile. send as much as you can in 15secs;
actually 8192 byte chunks, 2048 of them at a time. Repeat until 15 secs is
complete.
Repeat the test 5 times to narrow experimental deviation.
Throughput: ~99MB/sec (for those obsessed with Mbps ~810Mbps)
CPU abuse: server side 87% client side 22% (the CPU measurement could do
with some work and proper measure for SMP).
test2:
------
sendfile server.
created a file which is 8192*2048 bytes. Again the same 15 second
exercise as test1 (and the 5-set thing):
- throughput: 86MB/sec
- CPU: server 100%, client 17%
So i figured, no problem i'll re-run it with a file 10 times larger.
**I was dissapointed to see no improvement.**
Looking at the system calls being made:
with the non-sendfile version, approximately 182K write-to-socket system
calls were made each writing 8192 bytes, Each call lasted on average
0.08ms.
With sendfile test2: 78 calls were made, each sending the file
size 8192*2048 bytes; each lasted about 199 msecs
TWO observations:
- Given Linux's non-pre-emptability of the kernel i get the feeling that
sendfile could starve other user space programs. Imagine trying to send a
1Gig file on 10Mbps pipe in one shot.
- It doesnt matter if you break down the file into chunks for
self-pre-emption; sendfile is still a pig.
I have a feeling i am missing some very serious shit. So enlighten me.
Has anyone done similar tests?
Anyways, the struggle continues next with zc patches.
cheers,
jamal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 130+ messages in thread* Re: Is sendfile all that sexy? 2001-01-14 18:29 Is sendfile all that sexy? jamal @ 2001-01-14 18:50 ` Ingo Molnar 2001-01-14 19:02 ` jamal 2001-01-14 20:22 ` Linus Torvalds 2001-01-15 23:16 ` Pavel Machek 2 siblings, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2001-01-14 18:50 UTC (permalink / raw) To: jamal; +Cc: linux-kernel, netdev On Sun, 14 Jan 2001, jamal wrote: > regular ttcp, no ZC and no sendfile. [...] > Throughput: ~99MB/sec (for those obsessed with Mbps ~810Mbps) > CPU abuse: server side 87% client side 22% [...] > sendfile server. > - throughput: 86MB/sec > - CPU: server 100%, client 17% i believe what you are seeing here is the overhead of the pagecache. When using sendmsg() only, you do not read() the file every time, right? Is ttcp using multiple threads? In that case if the sendfile() is using the *same* file all the time, creating SMP locking overhead. if this is the case, what result do you get if you use a separate, isolated file per process? (And i bet that with DaveM's pagecache scalability patch the situation would also get much better - the global pagecache_lock hurts.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 18:50 ` Ingo Molnar @ 2001-01-14 19:02 ` jamal 2001-01-14 19:09 ` Ingo Molnar 0 siblings, 1 reply; 130+ messages in thread From: jamal @ 2001-01-14 19:02 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, netdev On Sun, 14 Jan 2001, Ingo Molnar wrote: > > i believe what you are seeing here is the overhead of the pagecache. When > using sendmsg() only, you do not read() the file every time, right? Is In that case just a user space buffer is sent i.e no file association. > ttcp using multiple threads? Only a single thread, single flow setup. Very primitive but simple. > In that case if the sendfile() is using the > *same* file all the time, creating SMP locking overhead. > > if this is the case, what result do you get if you use a separate, > isolated file per process? (And i bet that with DaveM's pagecache > scalability patch the situation would also get much better - the global > pagecache_lock hurts.) > Already doing the single file, single process. However, i do run by time which means i could read the file from the begining(offset 0) to the end then re-do it for as many times as 15secs would allow. Does this affect it? I tried one 1.5 GB file, it was oopsing and given my setup right now i cant trace it. So i am using about 170M which is read about 8 times in the 15 secs cheers, jamal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 19:02 ` jamal @ 2001-01-14 19:09 ` Ingo Molnar 2001-01-14 19:18 ` jamal 0 siblings, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2001-01-14 19:09 UTC (permalink / raw) To: jamal; +Cc: linux-kernel, netdev On Sun, 14 Jan 2001, jamal wrote: > Already doing the single file, single process. [...] in this case there could still be valid performance differences, as copying from user-space is cheaper than copying from the pagecache. To rule out SMP interactions, you could try a UP-IOAPIC kernel on that box. (I'm also curious what kind of numbers you'll get with the zerocopy patch.) > However, i do run by time which means i could read the file from the > begining(offset 0) to the end then re-do it for as many times as > 15secs would allow. Does this affect it? [...] no, in the case of a single thread this should have minimum impact. But i'd suggest to increase the /proc/sys/net/tcp*mem* values (to 1MB or more). Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 19:09 ` Ingo Molnar @ 2001-01-14 19:18 ` jamal 0 siblings, 0 replies; 130+ messages in thread From: jamal @ 2001-01-14 19:18 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, netdev On Sun, 14 Jan 2001, Ingo Molnar wrote: > > in this case there could still be valid performance differences, as > copying from user-space is cheaper than copying from the pagecache. To > rule out SMP interactions, you could try a UP-IOAPIC kernel on that box. > Let me complete this with the ZC patches first. then i'll do that. There are a few retarnsmits; maybe receiver IRQ affinity might help some. > (I'm also curious what kind of numbers you'll get with the zerocopy > patch.) Working on it. > no, in the case of a single thread this should have minimum impact. But > i'd suggest to increase the /proc/sys/net/tcp*mem* values (to 1MB or > more). The upper thresholds to 1000000 ? I should have mentioned that i set /proc/sys/net/core/*mem* to currently 262144. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 18:29 Is sendfile all that sexy? jamal 2001-01-14 18:50 ` Ingo Molnar @ 2001-01-14 20:22 ` Linus Torvalds 2001-01-14 20:38 ` Ingo Molnar ` (3 more replies) 2001-01-15 23:16 ` Pavel Machek 2 siblings, 4 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-14 20:22 UTC (permalink / raw) To: linux-kernel In article <Pine.GSO.4.30.0101141237020.12354-100000@shell.cyberus.ca>, jamal <hadi@cyberus.ca> wrote: > >Before getting excited i had the courage to give plain 2.4.0-pre3 a whirl >and somethings bothered me. Note that "sendfile(fd, file, len)" is never going to be faster than "write(fd, userdata, len)". That's not the point of sendfile(). The point of sendfile() is to be faster than the _combination_ of: addr = mmap(file, ...len...); write(fd, addr, len); or read(file, userdata, len); write(fd, userdata, len); and in your case you're not comparing sendfile() against this combination. You're just comparing sendfile() against a simple "write()". And no, I don't actually hink that sendfile() is all that hot. It was _very_ easy to implement, and can be considered a 5-minute hack to give a feature that fit very well in the MM architecture, and that the Apache folks had already been using on other architectures. The only obvious use for it is file serving, and as high-performance file serving tends to end up as a kernel module in the end anyway (the only hold-out is samba, and that's been discussed too), "sendfile()" really is more a proof of concept than anything else. Does anybody but apache actually use it? Linus PS. I still _like_ sendfile(), even if the above sounds negative. It's basically a "cool feature" that has zero negative impact on the design of the system. It uses the same "do_generic_file_read()" that is used for normal "read()", and is also used by the loop device and by in-kernel fileserving. But it's not really "important". - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 20:22 ` Linus Torvalds @ 2001-01-14 20:38 ` Ingo Molnar 2001-01-14 21:44 ` Linus Torvalds 2001-01-14 21:54 ` Gerhard Mack 2001-01-15 1:14 ` Dan Hollis ` (2 subsequent siblings) 3 siblings, 2 replies; 130+ messages in thread From: Ingo Molnar @ 2001-01-14 20:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel List On 14 Jan 2001, Linus Torvalds wrote: > Does anybody but apache actually use it? There is a Samba patch as well that makes it sendfile() based. Various other projects use it too (phttpd for example), some FTP servers i believe, and khttpd and TUX. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 20:38 ` Ingo Molnar @ 2001-01-14 21:44 ` Linus Torvalds 2001-01-14 21:49 ` Ingo Molnar 2001-01-14 21:54 ` Gerhard Mack 1 sibling, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2001-01-14 21:44 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linux Kernel List On Sun, 14 Jan 2001, Ingo Molnar wrote: > > There is a Samba patch as well that makes it sendfile() based. Various > other projects use it too (phttpd for example), some FTP servers i > believe, and khttpd and TUX. At least khttpd uses "do_generic_file_read()", not sendfile per se. I assume TUX does too. Sendfile itself is mainly only useful from user space.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 21:44 ` Linus Torvalds @ 2001-01-14 21:49 ` Ingo Molnar 0 siblings, 0 replies; 130+ messages in thread From: Ingo Molnar @ 2001-01-14 21:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel List On Sun, 14 Jan 2001, Linus Torvalds wrote: > > There is a Samba patch as well that makes it sendfile() based. Various > > other projects use it too (phttpd for example), some FTP servers i > > believe, and khttpd and TUX. > > At least khttpd uses "do_generic_file_read()", not sendfile per se. I > assume TUX does too. Sendfile itself is mainly only useful from user > space.. yes, you are right. TUX does it mainly to avoid some of the user-space interfacing overhead present in sys_sendfile(), and to be able to control packet boundaries. (ie. to have or not have the MSG_MORE flag). So TUX is using its own sock_send_actor and own read_descriptor. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 20:38 ` Ingo Molnar 2001-01-14 21:44 ` Linus Torvalds @ 2001-01-14 21:54 ` Gerhard Mack 2001-01-14 22:40 ` Linus Torvalds 2001-01-15 13:02 ` Florian Weimer 1 sibling, 2 replies; 130+ messages in thread From: Gerhard Mack @ 2001-01-14 21:54 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, Linux Kernel List On Sun, 14 Jan 2001, Ingo Molnar wrote: > > On 14 Jan 2001, Linus Torvalds wrote: > > > Does anybody but apache actually use it? > > There is a Samba patch as well that makes it sendfile() based. Various > other projects use it too (phttpd for example), some FTP servers i > believe, and khttpd and TUX. Proftpd to name one ftp server, nice little daemon uses linux-privs too. Gerhard PS I wish someone would explain to me why distros insist on using WU instead given it's horrid security record. -- Gerhard Mack gmack@innerfire.net <>< As a computer I find your faith in technology amusing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 21:54 ` Gerhard Mack @ 2001-01-14 22:40 ` Linus Torvalds 2001-01-14 22:45 ` J Sloan 2001-01-15 3:43 ` Michael Peddemors 2001-01-15 13:02 ` Florian Weimer 1 sibling, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-14 22:40 UTC (permalink / raw) To: Gerhard Mack; +Cc: Ingo Molnar, Linux Kernel List On Sun, 14 Jan 2001, Gerhard Mack wrote: > > PS I wish someone would explain to me why distros insist on using WU > instead given it's horrid security record. I think it's a case of "better the devil you know..". Think of all the security scares sendmail has historically had. But it's a pretty secure piece of work now - and people know if backwards and forward. Few people advocate switching from sendmail these days (sure, they do exist, but what I'm saying is that a long track record that includes security issues isn't necessarily bad, if it has gotten fixed). Of course, you may be right on wuftpd. It obviously wasn't designed with security in mind, other alternatives may be better. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 22:40 ` Linus Torvalds @ 2001-01-14 22:45 ` J Sloan 2001-01-15 20:15 ` H. Peter Anvin 2001-01-15 3:43 ` Michael Peddemors 1 sibling, 1 reply; 130+ messages in thread From: J Sloan @ 2001-01-14 22:45 UTC (permalink / raw) To: Kernel Mailing List Linus Torvalds wrote: > Of course, you may be right on wuftpd. It obviously wasn't designed with > security in mind, other alternatives may be better. I run proftpd on all my ftp servers - it's fast, configurable and can do all the tricks I need - even red hat seems to agree that proftpd is the way to go. Visit any red hat ftp site and they are running proftpd - So, why do they keep shipping us wu-ftpd instead? That really frosts me. jjs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 22:45 ` J Sloan @ 2001-01-15 20:15 ` H. Peter Anvin 0 siblings, 0 replies; 130+ messages in thread From: H. Peter Anvin @ 2001-01-15 20:15 UTC (permalink / raw) To: linux-kernel Followup to: <3A622C25.766F3BCE@pobox.com> By author: J Sloan <jjs@pobox.com> In newsgroup: linux.dev.kernel > > Linus Torvalds wrote: > > > Of course, you may be right on wuftpd. It obviously wasn't designed with > > security in mind, other alternatives may be better. > > I run proftpd on all my ftp servers - it's fast, configurable > and can do all the tricks I need - even red hat seems to > agree that proftpd is the way to go. > > Visit any red hat ftp site and they are running proftpd - > > So, why do they keep shipping us wu-ftpd instead? > > That really frosts me. > proftpd is not what you want for an FTP server whose main function is *non-*anonymous access. It is very much written for the sole purpose of being a great FTP server for a large anonymous FTP site. If you're running a site large enough to matter, you can replace an RPM or two. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 22:40 ` Linus Torvalds 2001-01-14 22:45 ` J Sloan @ 2001-01-15 3:43 ` Michael Peddemors 1 sibling, 0 replies; 130+ messages in thread From: Michael Peddemors @ 2001-01-15 3:43 UTC (permalink / raw) To: Gerhard Mack; +Cc: Ingo Molnar, Linux Kernel List The two things I change everytime are sendmail->qmail and wuftpd->proftpd But remember, security bugs are caught because more people use one vs the other.. Bugs in Proftpd weren't caught until more people started changing from wu-ftpd... Often, all it means when one product has more bugs than another, is that more people tried to find bugs in one than another... (Yes, a plug to get everyone to test 2.4 here) On Sun, 14 Jan 2001, Linus Torvalds wrote: > On Sun, 14 Jan 2001, Gerhard Mack wrote: > > PS I wish someone would explain to me why distros insist on using WU > > instead given it's horrid security record. > > Of course, you may be right on wuftpd. It obviously wasn't designed with > security in mind, other alternatives may be better. > > Linus -- -------------------------------------------------------- Michael Peddemors - Senior Consultant Unix Administration - WebSite Hosting Network Services - Programming Wizard Internet Services http://www.wizard.ca Linux Support Specialist - http://www.linuxmagic.com -------------------------------------------------------- (604) 589-0037 Beautiful British Columbia, Canada -------------------------------------------------------- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 21:54 ` Gerhard Mack 2001-01-14 22:40 ` Linus Torvalds @ 2001-01-15 13:02 ` Florian Weimer 2001-01-15 13:45 ` Tristan Greaves 1 sibling, 1 reply; 130+ messages in thread From: Florian Weimer @ 2001-01-15 13:02 UTC (permalink / raw) To: Gerhard Mack; +Cc: Linux Kernel List Gerhard Mack <gmack@innerfire.net> writes: > PS I wish someone would explain to me why distros insist on using WU > instead given it's horrid security record. The security record of Proftpd is not horrid, but embarrassing. They once claimed to have fixed vulnerability, but in fact introduced another one... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* RE: Is sendfile all that sexy? 2001-01-15 13:02 ` Florian Weimer @ 2001-01-15 13:45 ` Tristan Greaves 0 siblings, 0 replies; 130+ messages in thread From: Tristan Greaves @ 2001-01-15 13:45 UTC (permalink / raw) To: 'Linux Kernel List' > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org > [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Florian Weimer > Sent: 15 January 2001 13:02 > To: Gerhard Mack > Cc: Linux Kernel List > Subject: Re: Is sendfile all that sexy? > > The security record of Proftpd is not horrid, but embarrassing. They > once claimed to have fixed vulnerability, but in fact introduced > another one... Oh, come on, this is a classic event in bug fixing. All Software Has Bugs [TM]. Nothing Is Completely Secure [TM]. As long as the vulnerabilities are fixed as they happen (where possible), we should be happy. Tris. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 20:22 ` Linus Torvalds 2001-01-14 20:38 ` Ingo Molnar @ 2001-01-15 1:14 ` Dan Hollis 2001-01-15 15:24 ` Jonathan Thackray 2001-01-24 0:58 ` Sasi Peter 3 siblings, 0 replies; 130+ messages in thread From: Dan Hollis @ 2001-01-15 1:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On 14 Jan 2001, Linus Torvalds wrote: > That's not the point of sendfile(). The point of sendfile() is to be > faster than the _combination_ of: > addr = mmap(file, ...len...); > write(fd, addr, len); > or > read(file, userdata, len); > write(fd, userdata, len); And boy is it ever. It blows both away by more than double. Not only that the mmap one grinds my box into the ground with swapping, while the sendfile() case you can't even tell its running except that the drive is going like mad. > Does anybody but apache actually use it? I wonder why samba doesn't use it. -Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 20:22 ` Linus Torvalds 2001-01-14 20:38 ` Ingo Molnar 2001-01-15 1:14 ` Dan Hollis @ 2001-01-15 15:24 ` Jonathan Thackray 2001-01-15 15:36 ` Matti Aarnio ` (2 more replies) 2001-01-24 0:58 ` Sasi Peter 3 siblings, 3 replies; 130+ messages in thread From: Jonathan Thackray @ 2001-01-15 15:24 UTC (permalink / raw) To: linux-kernel > Does anybody but apache actually use it? Zeus uses it! (it was HP who added it to HP-UX first at our request :-) > PS. I still _like_ sendfile(), even if the above sounds negative. It's > basically a "cool feature" that has zero negative impact on the design > of the system. It uses the same "do_generic_file_read()" that is used > for normal "read()", and is also used by the loop device and by > in-kernel fileserving. But it's not really "important". It's a very useful system call and makes file serving much more scalable, and I'm glad that most Un*xes now have support for it (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to Linux is sendpath(), which does the open() before the sendfile() all combined into one system call. Ugh, I hear you all scream :-) Jon. -- Jonathan Thackray Zeus House, Cowley Road, Cambridge CB4 OZT, UK Software Engineer +44 1223 525000, fax +44 1223 525100 Zeus Technology http://www.zeus.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 15:24 ` Jonathan Thackray @ 2001-01-15 15:36 ` Matti Aarnio 2001-01-15 20:17 ` H. Peter Anvin 2001-01-15 16:05 ` dean gaudet 2001-01-15 19:41 ` Ingo Molnar 2 siblings, 1 reply; 130+ messages in thread From: Matti Aarnio @ 2001-01-15 15:36 UTC (permalink / raw) To: Jonathan Thackray; +Cc: linux-kernel On Mon, Jan 15, 2001 at 03:24:55PM +0000, Jonathan Thackray wrote: > It's a very useful system call and makes file serving much more > scalable, and I'm glad that most Un*xes now have support for it > (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to > Linux is sendpath(), which does the open() before the sendfile() > all combined into one system call. One thing about 'sendfile' (and likely 'sendpath') is that current (hammered into running binaries -> unchangeable) syscalls support only up to 2GB files at 32 bit systems. Glibc 2.2(9) at RedHat <sys/sendfile.h>: #ifdef __USE_FILE_OFFSET64 # error "<sendfile.h> cannot be used with _FILE_OFFSET_BITS=64" #endif I do admit that doing sendfile() on some extremely large file is unlikely, but still... > Ugh, I hear you all scream :-) > Jon. > -- > Jonathan Thackray Zeus House, Cowley Road, Cambridge CB4 OZT, UK > Zeus Technology http://www.zeus.com/ /Matti Aarnio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 15:36 ` Matti Aarnio @ 2001-01-15 20:17 ` H. Peter Anvin 0 siblings, 0 replies; 130+ messages in thread From: H. Peter Anvin @ 2001-01-15 20:17 UTC (permalink / raw) To: linux-kernel Followup to: <20010115173607.S25659@mea-ext.zmailer.org> By author: Matti Aarnio <matti.aarnio@zmailer.org> In newsgroup: linux.dev.kernel > > One thing about 'sendfile' (and likely 'sendpath') is that > current (hammered into running binaries -> unchangeable) > syscalls support only up to 2GB files at 32 bit systems. > > Glibc 2.2(9) at RedHat <sys/sendfile.h>: > > #ifdef __USE_FILE_OFFSET64 > # error "<sendfile.h> cannot be used with _FILE_OFFSET_BITS=64" > #endif > > I do admit that doing sendfile() on some extremely large > file is unlikely, but still... > 2 GB isn't really that extremely large these days. This is an unpleasant limitation. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 15:24 ` Jonathan Thackray 2001-01-15 15:36 ` Matti Aarnio @ 2001-01-15 16:05 ` dean gaudet 2001-01-15 18:34 ` Jonathan Thackray 2001-01-15 19:41 ` Ingo Molnar 2 siblings, 1 reply; 130+ messages in thread From: dean gaudet @ 2001-01-15 16:05 UTC (permalink / raw) To: Jonathan Thackray; +Cc: linux-kernel On Mon, 15 Jan 2001, Jonathan Thackray wrote: > (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to > Linux is sendpath(), which does the open() before the sendfile() > all combined into one system call. how would sendpath() construct the Content-Length in the HTTP header? it's totally unfortunate that the other unixes chose to combine writev() into sendfile() rather than implementing TCP_CORK. TCP_CORK is useful for FAR more than just sendfile() headers and footers. it's arguably the most correct way to write server code. nagle/no-nagle in the default BSD API both suck -- nagle because it delays packets which need to be sent; no-nagle because it can send incomplete packets. i'm completely happy that linus, davem and ingo refused to combine writev() into sendfile() and suggested CORK when i pointed out the header/trailer problem. imnsho if you want to optimise static file serving then it's pretty pointless to continue working in userland. nobody is going to catch up with all the kernel-side implementations in linux, NT, and solaris. -dean p.s. linus, apache-1.3 does *not* use sendfile(). it's in apache-2.0, which unfortunately is now performing like crap because they didn't listen to some of my advice well over a year ago. a case of "let's make a pretty API and hope performance works out"... where i told them "i've already written code using the API you suggest, and it *doesn't* work." </rant> thankfully linux now has TUX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 16:05 ` dean gaudet @ 2001-01-15 18:34 ` Jonathan Thackray 2001-01-15 18:46 ` Linus Torvalds 2001-01-15 18:58 ` Is sendfile all that sexy? dean gaudet 0 siblings, 2 replies; 130+ messages in thread From: Jonathan Thackray @ 2001-01-15 18:34 UTC (permalink / raw) To: dean gaudet; +Cc: linux-kernel > how would sendpath() construct the Content-Length in the HTTP header? You'd still stat() the file to decide whether to use sendpath() to send it or not, if it was Last-Modified: etc. Of course, you'd cache stat() calls too for a few seconds. The main thing is that you save a valuable fd and open() is expensive, even more so than stat(). > TCP_CORK is useful for FAR more than just sendfile() headers and > footers. it's arguably the most correct way to write server code. Agreed -- the hard-coded Nagle algorithm makes no sense these days. > imnsho if you want to optimise static file serving then it's pretty > pointless to continue working in userland. nobody is going to catch up > with all the kernel-side implementations in linux, NT, and solaris. Hmmm, there's a place for userland httpds that are within a few percent of kernel ones (like Zeus is, when I last looked). But I agree, hybrid approaches will become more common, although the trend towards server-side dynamic pages negate this. A kernel approach is a definite win if you're used to using a limited-scalability userland httpd like Apache. Jon. -- Jonathan Thackray Zeus House, Cowley Road, Cambridge CB4 OZT, UK Software Engineer +44 1223 525000, fax +44 1223 525100 Zeus Technology http://www.zeus.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 18:34 ` Jonathan Thackray @ 2001-01-15 18:46 ` Linus Torvalds 2001-01-15 20:47 ` [patch] sendpath() support, 2.4.0-test3/-ac9 Ingo Molnar 2001-01-15 18:58 ` Is sendfile all that sexy? dean gaudet 1 sibling, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2001-01-15 18:46 UTC (permalink / raw) To: linux-kernel In article <14947.17050.127502.936533@leda.cam.zeus.com>, Jonathan Thackray <jthackray@zeus.com> wrote: > >> how would sendpath() construct the Content-Length in the HTTP header? > >You'd still stat() the file to decide whether to use sendpath() to >send it or not, if it was Last-Modified: etc. Of course, you'd cache >stat() calls too for a few seconds. The main thing is that you save >a valuable fd and open() is expensive, even more so than stat(). "open" expensive? Maybe on HP-UX and other platforms. But give me numbers: I seriously doubt that int fd = open(..) fstat(fd..); sendfile(fd..); close(fd); is any slower than .. cache stat() in user space based on name .. sendpath(name, ..); on any real load. >> TCP_CORK is useful for FAR more than just sendfile() headers and >> footers. it's arguably the most correct way to write server code. > >Agreed -- the hard-coded Nagle algorithm makes no sense these days. The fact I dislike about the HP-UX implementation is that it is so _obviously_ stupid. And I have to say that I absolutely despise the BSD people. They did sendfile() after both Linux and HP-UX had done it, and they must have known about both implementations. And they chose the HP-UX braindamage, and even brag about the fact that they were stupid and didn't understand TCP_CORK (they don't say so in those exact words, of course - they just show that they were stupid and clueless by the things they brag about). Oh, well. Not everybody can be as goodlooking as me. It's a curse. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* [patch] sendpath() support, 2.4.0-test3/-ac9 2001-01-15 18:46 ` Linus Torvalds @ 2001-01-15 20:47 ` Ingo Molnar 2001-01-16 4:51 ` dean gaudet 0 siblings, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2001-01-15 20:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel List, Jonathan Thackray [-- Attachment #1: Type: TEXT/PLAIN, Size: 596 bytes --] On 15 Jan 2001, Linus Torvalds wrote: > int fd = open(..) > fstat(fd..); > sendfile(fd..); > close(fd); > > is any slower than > > .. cache stat() in user space based on name .. > sendpath(name, ..); > > on any real load. just for kicks i've implemented sendpath() support. (patch against 2.4.0-test and sample code attached) It appears to work just fine here. With a bit of reorganization in mm/filemap.c it was quite straightforward to do. Jonathan, is this what Zeus needs? If yes, it could be interesting to run a simple benchmark to compare sendpath() to open()+sendfile()? Ingo [-- Attachment #2: Type: TEXT/PLAIN, Size: 4020 bytes --] --- linux/mm/filemap.c.orig Mon Jan 15 22:43:21 2001 +++ linux/mm/filemap.c Mon Jan 15 23:09:55 2001 @@ -39,6 +39,8 @@ * page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com> * * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de> + * + * Started sendpath() support, (C) 2000 Ingo Molnar <mingo@redhat.com> */ atomic_t page_cache_size = ATOMIC_INIT(0); @@ -1450,15 +1452,15 @@ return written; } -asmlinkage ssize_t sys_sendfile(int out_fd, int in_fd, off_t *offset, size_t count) +/* + * Get input file, and verify that it is ok.. + */ +static struct file * get_verify_in_file (int in_fd, size_t count) { - ssize_t retval; - struct file * in_file, * out_file; - struct inode * in_inode, * out_inode; + struct inode * in_inode; + struct file * in_file; + int retval; - /* - * Get input file, and verify that it is ok.. - */ retval = -EBADF; in_file = fget(in_fd); if (!in_file) @@ -1474,10 +1476,21 @@ retval = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, in_file->f_pos, count); if (retval) goto fput_in; + return in_file; +fput_in: + fput(in_file); +out: + return ERR_PTR(retval); +} +/* + * Get output file, and verify that it is ok.. + */ +static struct file * get_verify_out_file (int out_fd, size_t count) +{ + struct file *out_file; + struct inode *out_inode; + int retval; - /* - * Get output file, and verify that it is ok.. - */ retval = -EBADF; out_file = fget(out_fd); if (!out_file) @@ -1491,6 +1504,29 @@ retval = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, out_file->f_pos, count); if (retval) goto fput_out; + return out_file; + +fput_out: + fput(out_file); +fput_in: + return ERR_PTR(retval); +} + +asmlinkage ssize_t sys_sendfile(int out_fd, int in_fd, off_t *offset, size_t count) +{ + ssize_t retval; + struct file * in_file, *out_file; + + in_file = get_verify_in_file(in_fd, count); + if (IS_ERR(in_file)) { + retval = PTR_ERR(in_file); + goto out; + } + out_file = get_verify_out_file(out_fd, count); + if (IS_ERR(out_file)) { + retval = PTR_ERR(out_file); + goto fput_in; + } retval = 0; if (count) { @@ -1524,6 +1560,56 @@ fput(in_file); out: return retval; +} + +asmlinkage ssize_t sys_sendpath(int out_fd, char *path, off_t *offset, size_t count) +{ + struct file in_file, *out_file; + read_descriptor_t desc; + loff_t pos = 0, *ppos; + struct nameidata nd; + int ret; + + out_file = get_verify_out_file(out_fd, count); + if (IS_ERR(out_file)) { + ret = PTR_ERR(out_file); + goto err; + } + ret = user_path_walk(path, &nd); + if (ret) + goto put_out; + ret = -EINVAL; + if (!nd.dentry || !nd.dentry->d_inode) + goto put_in_out; + + memset(&in_file, 0, sizeof(in_file)); + in_file.f_dentry = nd.dentry; + in_file.f_op = nd.dentry->d_inode->i_fop; + + ppos = &in_file.f_pos; + if (offset) { + if (get_user(pos, offset)) + goto put_in_out; + ppos = &pos; + } + desc.written = 0; + desc.count = count; + desc.buf = (char *) out_file; + desc.error = 0; + do_generic_file_read(&in_file, ppos, &desc, file_send_actor, 0); + + ret = desc.written; + if (!ret) + ret = desc.error; + if (offset) + put_user(pos, offset); + +put_in_out: + fput(out_file); +put_out: + path_release(&nd); +err: + return ret; } /* --- linux/arch/i386/kernel/entry.S.orig Mon Jan 15 22:42:47 2001 +++ linux/arch/i386/kernel/entry.S Mon Jan 15 22:43:12 2001 @@ -646,6 +646,7 @@ .long SYMBOL_NAME(sys_getdents64) /* 220 */ .long SYMBOL_NAME(sys_fcntl64) .long SYMBOL_NAME(sys_ni_syscall) /* reserved for TUX */ + .long SYMBOL_NAME(sys_sendpath) /* * NOTE!! This doesn't have to be exact - we just have @@ -653,6 +654,6 @@ * entries. Don't panic if you notice that this hasn't * been shrunk every time we add a new system call. */ - .rept NR_syscalls-221 + .rept NR_syscalls-223 .long SYMBOL_NAME(sys_ni_syscall) .endr [-- Attachment #3: Type: TEXT/PLAIN, Size: 593 bytes --] /* * Sample sendpath() code. It should mainly be used for sockets. */ #include <linux/unistd.h> #include <sys/sendfile.h> #include <stdlib.h> #include <unistd.h> #include <stdio.h> #include <fcntl.h> #define __NR_sendpath 223 _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size) int main (int argc, char **argv) { int out_fd; int ret; out_fd = open("./tmpfile", O_RDWR|O_CREAT|O_TRUNC, 0700); ret = sendpath(out_fd, "/usr/include/unistd.h", NULL, 300); printf("sendpath wrote %d bytes into ./tmpfile.\n", ret); return 0; } ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [patch] sendpath() support, 2.4.0-test3/-ac9 2001-01-15 20:47 ` [patch] sendpath() support, 2.4.0-test3/-ac9 Ingo Molnar @ 2001-01-16 4:51 ` dean gaudet 2001-01-16 4:59 ` Linus Torvalds 2001-01-16 9:19 ` [patch] sendpath() support, 2.4.0-test3/-ac9 Ingo Molnar 0 siblings, 2 replies; 130+ messages in thread From: dean gaudet @ 2001-01-16 4:51 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, Linux Kernel List, Jonathan Thackray On Mon, 15 Jan 2001, Ingo Molnar wrote: > just for kicks i've implemented sendpath() support. > > _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size) hey so how do you implement transmit timeouts with sendpath() ? (i.e. drop the client after 30 seconds of no progress.) -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [patch] sendpath() support, 2.4.0-test3/-ac9 2001-01-16 4:51 ` dean gaudet @ 2001-01-16 4:59 ` Linus Torvalds 2001-01-16 9:48 ` 'native files', 'object fingerprints' [was: sendpath()] Ingo Molnar 2001-01-16 9:19 ` [patch] sendpath() support, 2.4.0-test3/-ac9 Ingo Molnar 1 sibling, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2001-01-16 4:59 UTC (permalink / raw) To: dean gaudet; +Cc: Ingo Molnar, Linux Kernel List, Jonathan Thackray On Mon, 15 Jan 2001, dean gaudet wrote: > On Mon, 15 Jan 2001, Ingo Molnar wrote: > > > just for kicks i've implemented sendpath() support. > > > > _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size) > > hey so how do you implement transmit timeouts with sendpath() ? (i.e. > drop the client after 30 seconds of no progress.) The whole "sendpath()" idea is just stupid. You want to do a non-blocking send, so that you don't block on the socket, and do some simple multiplexing in your server. And "sendpath()" cannot do that without having to look up the name again, and again, and again. Which makes the performance "optimization" a horrible pessimisation. Basically, sendpath() seems to be only useful for blocking and uninterruptible file sending. Bad design. I'm not touching it with a ten-foot pole. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* 'native files', 'object fingerprints' [was: sendpath()] 2001-01-16 4:59 ` Linus Torvalds @ 2001-01-16 9:48 ` Ingo Molnar 2000-01-01 2:02 ` Pavel Machek ` (5 more replies) 0 siblings, 6 replies; 130+ messages in thread From: Ingo Molnar @ 2001-01-16 9:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: dean gaudet, Linux Kernel List, Jonathan Thackray On Mon, 15 Jan 2001, Linus Torvalds wrote: > > _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size) > You want to do a non-blocking send, so that you don't block on the > socket, and do some simple multiplexing in your server. > > And "sendpath()" cannot do that without having to look up the name > again, and again, and again. Which makes the performance > "optimization" a horrible pessimisation. yep, correct. But take a look at the trick it does with file descriptors, i believe it could be a useful way of doing things. It basically privatizes a struct file, without inserting it into the enumerated file descriptors. This shows that 'native files' are possible: file struct without file descriptor integers mapped to them. ob'plug: this privatized file descriptor mechanizm is used in TUX [TUX privatizes files by putting them into the HTTP request structure - ie. timeouts and continuation/nonblocking logic can be done with them]. But TUX is trusted code, and it can pass a struct file to the VFS without having to validate it, and TUX will also free such file descriptors. But even user-space code could use 'native files', via the following, safe mechanizm: 1) current->native_files list, freed at exit_files() time. 2) "struct native_file" which embedds "struct file". It has the following fields: struct native_file { unsigned long master_fingerprint[8]; unsigned long file_fingerprint[8]; struct file file; }; 'fingerprints' are 256 bit, true random numbers. master_fingerprint is global to the kernel and is generated once per boot. It validates the pointer of the structure. The master fingerprint is never known to user-space. file_fingerprint is a 256-bit identifier generated for this native file. The file fingerprint and the (kernel) pointer to the native file is returned to user-space. The cryptographical safety of these 256-bit random numbers guarantees that no breach can occur in a reasonable period of time. It's in essence an 'encrypted' communication between kernel and user-space. user-space thus can pass a pointer to the following structure: struct safe_kpointer { void *kaddr; unsigned long fingerprint[4]; }; the kernel can validate kaddr by 1) validating the pointer via the master fingerprint (every valid kernel pointer must point to a structure that starts with the master fingerprint's copy). Then usage-permissions are validated by checking the file fingerprint (the per-object fingerprint). this is a safe, very fast [ O(1) ] object-permission model. (it's a variation of a former idea of yours.) A process can pass object fingerprints and kernel pointers to other processes too - thus the other process can access the object too. Threads will 'naturally' share objects, because fingerprints are typically stored in memory. 3) on closing a native file the fingerprint is destroyed (first byte of the master fingerprint copy is overwritten). what do you think about this? I believe most of the file APIs can be / should be reworked to use native files, and 'Unix files' would just be a compatibility layer parallel to them. Then various applications could convert to 'native file' usage - i believe file servers which have lots of file descriptors would do this first. (this 'fingerprint' mechanizm can be used for any object, not only files.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: 'native files', 'object fingerprints' [was: sendpath()] 2001-01-16 9:48 ` 'native files', 'object fingerprints' [was: sendpath()] Ingo Molnar @ 2000-01-01 2:02 ` Pavel Machek 2001-01-16 11:13 ` Andi Kleen ` (4 subsequent siblings) 5 siblings, 0 replies; 130+ messages in thread From: Pavel Machek @ 2000-01-01 2:02 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, dean gaudet, Linux Kernel List, Jonathan Thackray Hi! > struct safe_kpointer { > void *kaddr; > unsigned long fingerprint[4]; > }; > > the kernel can validate kaddr by 1) validating the pointer via the master > fingerprint (every valid kernel pointer must point to a structure that > starts with the master fingerprint's copy). Then usage-permissions are > validated by checking the file fingerprint (the per-object fingerprint). > > this is a safe, very fast [ O(1) ] object-permission model. (it's a > variation of a former idea of yours.) A process can pass object > fingerprints and kernel pointers to other processes too - thus the other > process can access the object too. Threads will 'naturally' share objects, > because fingerprints are typically stored in memory. I do not know if I'd trust this. First, (fd < current->fdlimit && current->fdlist[fd]) if O(1), too. Sure, passing those is slightly hard, but we can do that already. With your proposal, all hopes for fuser and revoke are out. Ouch; you say process can pass it to other process. How will kernel know not to free fd until _both_ freed it? Plus, you are playing tricks with random numbers. Up to now, only ssh and similar depended on random numbers. Now kernel relies on them during boot. Notice that most important "master fingerprint" is generated first. At that timeyou might not have enough entropy in your pools. Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: 'native files', 'object fingerprints' [was: sendpath()] 2001-01-16 9:48 ` 'native files', 'object fingerprints' [was: sendpath()] Ingo Molnar 2000-01-01 2:02 ` Pavel Machek @ 2001-01-16 11:13 ` Andi Kleen 2001-01-16 11:26 ` Ingo Molnar 2001-01-16 13:57 ` 'native files', 'object fingerprints' [was: sendpath()] Jamie Lokier ` (3 subsequent siblings) 5 siblings, 1 reply; 130+ messages in thread From: Andi Kleen @ 2001-01-16 11:13 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, dean gaudet, Linux Kernel List, Jonathan Thackray On Tue, Jan 16, 2001 at 10:48:34AM +0100, Ingo Molnar wrote: > this is a safe, very fast [ O(1) ] object-permission model. (it's a > variation of a former idea of yours.) A process can pass object > fingerprints and kernel pointers to other processes too - thus the other > process can access the object too. Threads will 'naturally' share objects, >... Just setuid etc. doesn't work with that because access cannot be easily revoked without disturbing other clients. To handle that you would probably need a "relookup if needed" mechanism similar to what NFSv4 has, so that you can force other users to relookup after you revoked a key. That complicates the use a lot though. Also the model depends on good secure random numbers, which is questionable in many environments (e.g. a diskless box where the random device effectively gets no new input) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: 'native files', 'object fingerprints' [was: sendpath()] 2001-01-16 11:13 ` Andi Kleen @ 2001-01-16 11:26 ` Ingo Molnar 2001-01-16 11:37 ` Andi Kleen 0 siblings, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2001-01-16 11:26 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, dean gaudet, Linux Kernel List, Jonathan Thackray On Tue, 16 Jan 2001, Andi Kleen wrote: > On Tue, Jan 16, 2001 at 10:48:34AM +0100, Ingo Molnar wrote: > > this is a safe, very fast [ O(1) ] object-permission model. (it's a > > variation of a former idea of yours.) A process can pass object > > fingerprints and kernel pointers to other processes too - thus the other > > process can access the object too. Threads will 'naturally' share objects, > >... > > Just setuid etc. doesn't work with that because access cannot be > easily revoked without disturbing other clients. well, you cannot easily close() an already shared file descriptor in another process's context either. Is revocation so important? Why is setuid() a problem? A native file is just like a normal file, with the difference that not an integer but a fingerprint identifies it, and that access and usage counts are not automatically inherited across some explicit sharing interface. perhaps we could get most of the advantages by allowing the relaxation of the 'allocate first free file descriptor number' rule for normal Unix files? > Also the model depends on good secure random numbers, which is > questionable in many environments (e.g. a diskless box where the > random device effectively gets no new input) true, although newer chipsets include hardware random generators. But indeed, object fingerprints (tokens? ids?) make the random generator a much more central thing. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: 'native files', 'object fingerprints' [was: sendpath()] 2001-01-16 11:26 ` Ingo Molnar @ 2001-01-16 11:37 ` Andi Kleen 2001-01-16 12:04 ` O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] Ingo Molnar 0 siblings, 1 reply; 130+ messages in thread From: Andi Kleen @ 2001-01-16 11:37 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Linus Torvalds, dean gaudet, Linux Kernel List, Jonathan Thackray On Tue, Jan 16, 2001 at 12:26:12PM +0100, Ingo Molnar wrote: > > On Tue, 16 Jan 2001, Andi Kleen wrote: > > > On Tue, Jan 16, 2001 at 10:48:34AM +0100, Ingo Molnar wrote: > > > this is a safe, very fast [ O(1) ] object-permission model. (it's a > > > variation of a former idea of yours.) A process can pass object > > > fingerprints and kernel pointers to other processes too - thus the other > > > process can access the object too. Threads will 'naturally' share objects, > > >... > > > > Just setuid etc. doesn't work with that because access cannot be > > easily revoked without disturbing other clients. > > well, you cannot easily close() an already shared file descriptor in > another process's context either. Is revocation so important? Why is > setuid() a problem? A native file is just like a normal file, with the > difference that not an integer but a fingerprint identifies it, and that > access and usage counts are not automatically inherited across some > explicit sharing interface. Actually on second thought exec() is more a problem than setuid(), because it requires closing for file descriptors. So if you could devise a security model that doesn't depend on exec giving you a clean plate -- then it could work, but would probably not be very unixy. I'm amazed how non flamed you can present radical API ideas though, I even get flamed for much smaller things (like using text errors to replace the hundreds of EINVALs in the rtnetlink message interface) ;);) > > perhaps we could get most of the advantages by allowing the relaxation of > the 'allocate first free file descriptor number' rule for normal Unix > files? Not sure I follow. You mean dup2() ? -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] 2001-01-16 11:37 ` Andi Kleen @ 2001-01-16 12:04 ` Ingo Molnar 2001-01-16 12:09 ` Ingo Molnar ` (3 more replies) 0 siblings, 4 replies; 130+ messages in thread From: Ingo Molnar @ 2001-01-16 12:04 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, dean gaudet, Linux Kernel List, Jonathan Thackray On Tue, 16 Jan 2001, Andi Kleen wrote: > > the 'allocate first free file descriptor number' rule for normal Unix > > files? > Not sure I follow. You mean dup2() ? I'm sure you know this: when there are thousands of files open already, much of the overhead of opening a new file comes from the mandatory POSIX requirement of allocating the first not yet allocated file descriptor integer to this file. Eg. if files 0, 1, 2, 10, 11 are already open, the kernel must allocate file descriptor 3. Many utilities rely on this, and the rule makes sense in a select() environment, because it compresses the 'file descriptor spectrum'. But in a non-select(), event-drive environment it becomes unnecessery overhead. - probably the most radical solution is what i suggested, to completely avoid the unique-mapping of file structures to an integer range, and use the address of the file structure (and some cookies) as an identification. - a less radical solution would be to still map file structures to an integer range (file descriptors) and usage-maintain files per processes, but relax the 'allocate first non-allocated integer in the range' rule. I'm not sure exactly how simple this is, but something like this should work: on close()-ing file descriptors the freed file descriptors would be cached in a list (this needs a new, separate structure which must be allocated/freed as well). Something like: struct lazy_filedesc { int fd; struct file *file; } struct task { ... struct lazy_filedesc *lazy_files; ... } the actual filedescriptor bit of a 'lazy file' would be cleared for real on close(), and the '*file' argument is not a real file - it's NULL if at close() time this process wasnt the last user of the file, or contains a pointer to an allocated (but otherwise invalid) file structure. This must happen to ensure the first-free-desc rule, and to optimize freeing/allocate of file structures. Now, if the new code does a: fd = open(...,O_ANY); then the kernel looks at the current->lazy_files list, and tries to set the file descriptor bit in the current->files file table. If successful then open() uses desc->fd and desc->file (if available) for opening the new file, and unlinks+frees the lazy descriptor. If unsuccessful then open() frees desc->file, frees and unlinks the descriptor and goes on to look at the next descriptor. - worst case overhead is the extra allocation overhead of the (very small) lazy file descriptor. Worst-case happens only if O_ANY allocation is mixed in a special way with normal open()s. - Best-case overhead saves us a get_unused_fd() call, which can be *very* expensive (in terms of CPU time and cache footprint) if thousands of files are used. If O_ANY is used mostly, then the best-case is always triggered. - (the number of lazy files must be limited to some sane value) at exit_files() time the current->lazy_files list must be processed. On exec() it does not get inherited. current->lazy_files has no effect on task state or semantics otherwise, it's only an isolated 'information cache'. Have i missed something important? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] 2001-01-16 12:04 ` O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] Ingo Molnar @ 2001-01-16 12:09 ` Ingo Molnar 2001-01-16 12:13 ` Peter Samuelson ` (2 subsequent siblings) 3 siblings, 0 replies; 130+ messages in thread From: Ingo Molnar @ 2001-01-16 12:09 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, dean gaudet, Linux Kernel List, Jonathan Thackray On Tue, 16 Jan 2001, Ingo Molnar wrote: > struct lazy_filedesc { > int fd; > struct file *file; > } in fact "struct file" can (ab)used for this, no need for new structures or new fields. Eg. file->f_flags contains the cached descriptor-information. file->f_list is used for the current->lazy_files ringlist. this way there is no additional allocation overhead in the worst-case. (unless i'm missing something obvious.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] 2001-01-16 12:04 ` O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] Ingo Molnar 2001-01-16 12:09 ` Ingo Molnar @ 2001-01-16 12:13 ` Peter Samuelson 2001-01-16 12:33 ` Ingo Molnar 2001-01-16 12:34 ` Andi Kleen 2001-01-16 13:00 ` Mitchell Blank Jr 3 siblings, 1 reply; 130+ messages in thread From: Peter Samuelson @ 2001-01-16 12:13 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linux Kernel List [Ingo Molnar] > - probably the most radical solution is what i suggested, to > completely avoid the unique-mapping of file structures to an integer > range, and use the address of the file structure (and some cookies) > as an identification. Careful, these must cast to non-negative integers, without clashing. > fd = open(...,O_ANY); I like this idea, but call it O_ALLOCANYFD. Peter - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] 2001-01-16 12:13 ` Peter Samuelson @ 2001-01-16 12:33 ` Ingo Molnar 2001-01-16 14:40 ` Felix von Leitner 0 siblings, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2001-01-16 12:33 UTC (permalink / raw) To: Peter Samuelson; +Cc: Linux Kernel List On Tue, 16 Jan 2001, Peter Samuelson wrote: > [Ingo Molnar] > > - probably the most radical solution is what i suggested, to > > completely avoid the unique-mapping of file structures to an integer > > range, and use the address of the file structure (and some cookies) > > as an identification. > > Careful, these must cast to non-negative integers, without clashing. if you read my (radical) proposal, the identification is based on a kernel pointer and a 256-bit random integer. So non-negative integers are not needed. (file-IO system-calls would be modified to detect if 'Unix file descriptors' or pointers to 'native file descriptors' are passed to them, so this is truly radical.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] 2001-01-16 12:33 ` Ingo Molnar @ 2001-01-16 14:40 ` Felix von Leitner 0 siblings, 0 replies; 130+ messages in thread From: Felix von Leitner @ 2001-01-16 14:40 UTC (permalink / raw) To: Linux Kernel List Thus spake Ingo Molnar (mingo@elte.hu): > if you read my (radical) proposal, the identification is based on a kernel > pointer and a 256-bit random integer. So non-negative integers are not > needed. (file-IO system-calls would be modified to detect if 'Unix file > descriptors' or pointers to 'native file descriptors' are passed to them, > so this is truly radical.) Yuck, don't pass pointers in kernel space to user space! NT does it and look what kernel call argument verification havoc it wrought over them! Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] 2001-01-16 12:04 ` O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] Ingo Molnar 2001-01-16 12:09 ` Ingo Molnar 2001-01-16 12:13 ` Peter Samuelson @ 2001-01-16 12:34 ` Andi Kleen 2001-01-16 13:00 ` Mitchell Blank Jr 3 siblings, 0 replies; 130+ messages in thread From: Andi Kleen @ 2001-01-16 12:34 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Linus Torvalds, dean gaudet, Linux Kernel List, Jonathan Thackray On Tue, Jan 16, 2001 at 01:04:22PM +0100, Ingo Molnar wrote: > - a less radical solution would be to still map file structures to an > integer range (file descriptors) and usage-maintain files per processes, > but relax the 'allocate first non-allocated integer in the range' rule. > I'm not sure exactly how simple this is, but something like this should > work: on close()-ing file descriptors the freed file descriptors would be > cached in a list (this needs a new, separate structure which must be > allocated/freed as well). Something like: > > struct lazy_filedesc { > int fd; > struct file *file; > } More generic file -> fd mapping would be useful to speed up poll() too, because the event trigger could directly modify the poll table without a second slow walk over the whole table. So you could add another bit that tells if the fd is open or closed and share it with poll. Also in that table you could just keep a linked ordered free list and not use GFP_ANY, because getting the lowest would be rather cheap. Disadvantage is that it would need more cache and more overhead than the current scheme. [in a way it is a ugly duck like pte<->vma links] > - Best-case overhead saves us a get_unused_fd() call, which can be *very* > expensive (in terms of CPU time and cache footprint) if thousands of > files are used. If O_ANY is used mostly, then the best-case is always > triggered. Really? Does the open_fds bitmap get that big ? Maybe it just needs a faster find_next_zero_bit() @) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] 2001-01-16 12:04 ` O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] Ingo Molnar ` (2 preceding siblings ...) 2001-01-16 12:34 ` Andi Kleen @ 2001-01-16 13:00 ` Mitchell Blank Jr 3 siblings, 0 replies; 130+ messages in thread From: Mitchell Blank Jr @ 2001-01-16 13:00 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linux Kernel List Ingo Molnar wrote: > - probably the most radical solution is what i suggested, to completely > avoid the unique-mapping of file structures to an integer range, and use > the address of the file structure (and some cookies) as an identification. IMO... gross. We do pretty much this exact thing in the ATM code (for the signalling daemon and the kernel exchainging status on VCCs) and it's pretty disgusting. I want to make it go away. > - a less radical solution would be to still map file structures to an > integer range (file descriptors) and usage-maintain files per processes, > but relax the 'allocate first non-allocated integer in the range' rule. [...] > fd = open(...,O_ANY); Yeah, this gets talked about, but I don't think a new flag for open is a good way to do this, because open() isn't the only thing that returns a new fd. What about socket()? pipe()? Maybe we could have a new prctl() control that turns this behavior on and off. Then you'd just have to be careful to turn it back off before calling any library functions that require ordering (like popen). Other than that, I think it'd be a good idea, especially if it could be implemented clean enough to make it CONFIG_'urable. That can't really be fairly judged until someone produces the code. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: 'native files', 'object fingerprints' [was: sendpath()] 2001-01-16 9:48 ` 'native files', 'object fingerprints' [was: sendpath()] Ingo Molnar 2000-01-01 2:02 ` Pavel Machek 2001-01-16 11:13 ` Andi Kleen @ 2001-01-16 13:57 ` Jamie Lokier 2001-01-16 14:27 ` Felix von Leitner ` (2 subsequent siblings) 5 siblings, 0 replies; 130+ messages in thread From: Jamie Lokier @ 2001-01-16 13:57 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, dean gaudet, Linux Kernel List, Jonathan Thackray Ingo Molnar wrote: > struct native_file { > unsigned long master_fingerprint[8]; > unsigned long file_fingerprint[8]; > struct file file; > }; > > 'fingerprints' are 256 bit, true random numbers. master_fingerprint is > global to the kernel and is generated once per boot. It validates the > pointer of the structure. The master fingerprint is never known to > user-space. > > file_fingerprint is a 256-bit identifier generated for this native file. > The file fingerprint and the (kernel) pointer to the native file is > returned to user-space. The cryptographical safety of these 256-bit random > numbers guarantees that no breach can occur in a reasonable period of > time. It's in essence an 'encrypted' communication between kernel and > user-space. Sounds similar to the Hurd... -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: 'native files', 'object fingerprints' [was: sendpath()] 2001-01-16 9:48 ` 'native files', 'object fingerprints' [was: sendpath()] Ingo Molnar ` (2 preceding siblings ...) 2001-01-16 13:57 ` 'native files', 'object fingerprints' [was: sendpath()] Jamie Lokier @ 2001-01-16 14:27 ` Felix von Leitner 2001-01-16 17:47 ` Linus Torvalds 2001-01-17 4:39 ` dean gaudet 5 siblings, 0 replies; 130+ messages in thread From: Felix von Leitner @ 2001-01-16 14:27 UTC (permalink / raw) To: Linux Kernel List Thus spake Ingo Molnar (mingo@elte.hu): > But even user-space code could use 'native files', via the following, safe > mechanizm: [something reminiscient of a token from a capability system] > (this 'fingerprint' mechanizm can be used for any object, not only files.) One good thing about tokens is that file handles can be implemented on top of them in user space. On the other hand, there already are mechanisms to pass file descriptors around and so on, so you don't gain anything tangible from your efford. I would advise reading some text books about capability systems, there is a lot to be learned here. But retrofitting something like this on an existing kernel is probably not a very good idea. Experience shows that you can't "un-bloat" a piece of software by introducing a few elegant concepts. The compatibility stuff eats most of the benefits. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: 'native files', 'object fingerprints' [was: sendpath()] 2001-01-16 9:48 ` 'native files', 'object fingerprints' [was: sendpath()] Ingo Molnar ` (3 preceding siblings ...) 2001-01-16 14:27 ` Felix von Leitner @ 2001-01-16 17:47 ` Linus Torvalds 2001-01-17 4:39 ` dean gaudet 5 siblings, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-16 17:47 UTC (permalink / raw) To: Ingo Molnar; +Cc: dean gaudet, Linux Kernel List, Jonathan Thackray On Tue, 16 Jan 2001, Ingo Molnar wrote: > > yep, correct. But take a look at the trick it does with file descriptors, > i believe it could be a useful way of doing things. It basically > privatizes a struct file, without inserting it into the enumerated file > descriptors. This shows that 'native files' are possible: file struct > without file descriptor integers mapped to them. That's nothing new: the exec() code does exactly the same. In fact, there's a function for it: filp_open() and filp_close(). Which do a better job of it than your private implementation did, I suspect. I don't think your object fingerprints are anything more generic than the current file descriptors. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: 'native files', 'object fingerprints' [was: sendpath()] 2001-01-16 9:48 ` 'native files', 'object fingerprints' [was: sendpath()] Ingo Molnar ` (4 preceding siblings ...) 2001-01-16 17:47 ` Linus Torvalds @ 2001-01-17 4:39 ` dean gaudet 5 siblings, 0 replies; 130+ messages in thread From: dean gaudet @ 2001-01-17 4:39 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, Linux Kernel List, Jonathan Thackray On Tue, 16 Jan 2001, Ingo Molnar wrote: > But even user-space code could use 'native files', via the following, safe > mechanizm: so here's an alternative to ingo's proposal which i think solves some of the other objections raised. it's something i've proposed in the past under the name "extended file handles". struct extended_file_permission { int refcount; some form of mutex to protect refcount; some list structure head; }; struct extended_file { struct file *file; struct extended_file_permission *perm; whatever list foo is needed to link with extended_file_perm above; }; if you allocate a few huge arrays of struct extended_file, then you can verify if a pointer passed from user space fits into one of those arrays pretty quickly. struct task has a struct extended_file_permission * added to it to indicate which perm struct that task is associated with. so you just compare the f->perm to current->extended_file_perm and you know if the task is allowed to use it or not. clone() allows you to create tasks sharing the same extended_file_permissions. fork()/exec() would create new extended_file_perms -- which implicitly causes all those files to be closed. this gives you pretty light cgi fork()/exec() off a main "process" which is handling thousands of sockets. i also proposed various methods of doing O_foo flag inheritance... but the above is more interesting. -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [patch] sendpath() support, 2.4.0-test3/-ac9 2001-01-16 4:51 ` dean gaudet 2001-01-16 4:59 ` Linus Torvalds @ 2001-01-16 9:19 ` Ingo Molnar 2001-01-17 0:03 ` dean gaudet 1 sibling, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2001-01-16 9:19 UTC (permalink / raw) To: dean gaudet; +Cc: Linus Torvalds, Linux Kernel List, Jonathan Thackray On Mon, 15 Jan 2001, dean gaudet wrote: > > just for kicks i've implemented sendpath() support. > > > > _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size) > > hey so how do you implement transmit timeouts with sendpath() ? > (i.e. drop the client after 30 seconds of no progress.) well this problem is not unique to sendpath(), sendfile() has it as well. in TUX i've added per-socket connection timers, and i believe something like this should be done in Apache as well - timers are IMO not a good enough excuse for avoiding event-based IO models and using select() or poll(). Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [patch] sendpath() support, 2.4.0-test3/-ac9 2001-01-16 9:19 ` [patch] sendpath() support, 2.4.0-test3/-ac9 Ingo Molnar @ 2001-01-17 0:03 ` dean gaudet 0 siblings, 0 replies; 130+ messages in thread From: dean gaudet @ 2001-01-17 0:03 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, Linux Kernel List, Jonathan Thackray On Tue, 16 Jan 2001, Ingo Molnar wrote: > > On Mon, 15 Jan 2001, dean gaudet wrote: > > > > just for kicks i've implemented sendpath() support. > > > > > > _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size) > > > > hey so how do you implement transmit timeouts with sendpath() ? > > (i.e. drop the client after 30 seconds of no progress.) > > well this problem is not unique to sendpath(), sendfile() has it as well. hrm? with sendfile() i just send 32k or 64k at a time and use alarm() or non-blocking/select() to implement timeouts. with sendpath() i can do the same thing but i'm gonna pay a path lookup each time... and there's no guarantee that i'm getting the same file each time. > in TUX i've added per-socket connection timers, and i believe something > like this should be done in Apache as well - timers are IMO not a good > enough excuse for avoiding event-based IO models and using select() or > poll(). i wasn't suggesting avoiding sendfile/sendpath -- i just couldn't see how to use sendpath() effectively. explain per-socket connection timers. are they available to the userland? at least with the apache-2.0 i/o stuff i should be able to support kernel-based timers. apache-2.0 uses non-blocking/poll() to implement timeouts -- does write() or sendfile() until there's an EWOULDBLOCK then it calls poll() waiting for write/timeout. with kernel supported timeouts i could just block in the write() and that'd be fine by me. 1.2 used alarm() ... 1.3 communicates each child's activity to the parent through the scoreboard and the parent occasionally wakes up and sends SIGALRM to children that are past their timeout. (that let me get rid of a few syscalls.) -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 18:34 ` Jonathan Thackray 2001-01-15 18:46 ` Linus Torvalds @ 2001-01-15 18:58 ` dean gaudet 1 sibling, 0 replies; 130+ messages in thread From: dean gaudet @ 2001-01-15 18:58 UTC (permalink / raw) To: Jonathan Thackray; +Cc: linux-kernel On Mon, 15 Jan 2001, Jonathan Thackray wrote: > > TCP_CORK is useful for FAR more than just sendfile() headers and > > footers. it's arguably the most correct way to write server code. > > Agreed -- the hard-coded Nagle algorithm makes no sense these days. hey, actually a little more thinking this morning made me think nagle *may* have a place. i don't like any of the solutions i've come up with though for this. the problem specifically is how do you implement an efficient HTTP/ng server which supports WebMUX and parallel processing of multiple responses. the problem in a nutshell is that multiple threads may be working on responses which are multiplexed onto a single socket -- there's some extra mux header info used to separate each of the response streams. like what if the response stream is a few hundred HEADs (for cache validation) some of which are static files and others which require some dynamic code. the static responses will finish really fast, and you want to fill up network packets with them. but you don't know when the dynamic responses will finish so you can't be sure when to start sending the packets. i don't know NFSv3 very much, but i imagine it's got similar problems -- any multiplexed request/response protocol allowing out-of-order responses would have this problem. any gurus got suggestions? -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 15:24 ` Jonathan Thackray 2001-01-15 15:36 ` Matti Aarnio 2001-01-15 16:05 ` dean gaudet @ 2001-01-15 19:41 ` Ingo Molnar 2001-01-15 20:33 ` Albert D. Cahalan 2 siblings, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2001-01-15 19:41 UTC (permalink / raw) To: Jonathan Thackray; +Cc: Linux Kernel List On Mon, 15 Jan 2001, Jonathan Thackray wrote: > It's a very useful system call and makes file serving much more > scalable, and I'm glad that most Un*xes now have support for it > (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to > Linux is sendpath(), which does the open() before the sendfile() all > combined into one system call. i believe the right model for a user-space webserver is to cache open file descriptors, and directly hash URLs to open files. This way you can do pure sendfile() without any open(). Not that open() is too expensive in Linux: m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall open Simple open/close: 7.5756 microseconds m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall stat Simple stat: 5.4864 microseconds m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall write Simple write: 0.9614 microseconds m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall read Simple read: 1.1420 microseconds m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall null Simple syscall: 0.6349 microseconds (note that lmbench opens a nontrivial path, it can be cheaper than this.) nevertheless saving the lookup can be win. [ TUX uses dentries directly so there is no file opening cost - it's pretty equivalent to sendpath(), with the difference that TUX can do security evaluation on the (held) file prior sending it - while sendpath() is pretty much a shot into the dark. ] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 19:41 ` Ingo Molnar @ 2001-01-15 20:33 ` Albert D. Cahalan 2001-01-15 21:00 ` Linus Torvalds 2001-01-16 10:40 ` Felix von Leitner 0 siblings, 2 replies; 130+ messages in thread From: Albert D. Cahalan @ 2001-01-15 20:33 UTC (permalink / raw) To: mingo; +Cc: Jonathan Thackray, Linux Kernel List Ingo Molnar writes: > On Mon, 15 Jan 2001, Jonathan Thackray wrote: >> It's a very useful system call and makes file serving much more >> scalable, and I'm glad that most Un*xes now have support for it >> (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to >> Linux is sendpath(), which does the open() before the sendfile() all >> combined into one system call. Ingo Molnar's data in a nice table: open/close 7.5756 microseconds stat 5.4864 microseconds write 0.9614 microseconds read 1.1420 microseconds syscall 0.6349 microseconds Rather than combining open() with sendfile(), it could be combined with stat(). Since the syscall would be new anyway, it could skip the normal requirement about returning the next free file descriptor in favor of returning whatever can be most quickly found. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 20:33 ` Albert D. Cahalan @ 2001-01-15 21:00 ` Linus Torvalds 2001-01-16 10:40 ` Felix von Leitner 1 sibling, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-15 21:00 UTC (permalink / raw) To: linux-kernel In article <200101152033.f0FKXpv250839@saturn.cs.uml.edu>, Albert D. Cahalan <acahalan@cs.uml.edu> wrote: >Ingo Molnar writes: >> On Mon, 15 Jan 2001, Jonathan Thackray wrote: > >>> It's a very useful system call and makes file serving much more >>> scalable, and I'm glad that most Un*xes now have support for it >>> (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to >>> Linux is sendpath(), which does the open() before the sendfile() all >>> combined into one system call. > >Ingo Molnar's data in a nice table: > >open/close 7.5756 microseconds >stat 5.4864 microseconds >write 0.9614 microseconds >read 1.1420 microseconds >syscall 0.6349 microseconds > >Rather than combining open() with sendfile(), it could be combined >with stat(). Note that "fstat()" is fairly low-overhead (unlike "stat()" it obviously doesn't have to parse the name again), so "open+fstat" is quite fine as-is. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 20:33 ` Albert D. Cahalan 2001-01-15 21:00 ` Linus Torvalds @ 2001-01-16 10:40 ` Felix von Leitner 2001-01-16 11:56 ` Peter Samuelson ` (2 more replies) 1 sibling, 3 replies; 130+ messages in thread From: Felix von Leitner @ 2001-01-16 10:40 UTC (permalink / raw) To: Linux Kernel List Thus spake Albert D. Cahalan (acahalan@cs.uml.edu): > Rather than combining open() with sendfile(), it could be combined > with stat(). Since the syscall would be new anyway, it could skip > the normal requirement about returning the next free file descriptor > in favor of returning whatever can be most quickly found. I don't know how Linux does it, but returning the first free file descriptor can be implemented as O(1) operation. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 10:40 ` Felix von Leitner @ 2001-01-16 11:56 ` Peter Samuelson 2001-01-16 12:37 ` Ingo Molnar 2001-01-16 12:42 ` Ingo Molnar 2 siblings, 0 replies; 130+ messages in thread From: Peter Samuelson @ 2001-01-16 11:56 UTC (permalink / raw) To: Linux Kernel List [Felix von Leitner] > I don't know how Linux does it, but returning the first free file > descriptor can be implemented as O(1) operation. How exactly? Maybe I'm being dense today. Having used up the lowest available fd, how do you find the next-lowest one, the next open()? I can't think of anything that isn't O(n). (Sure you can amortize it different ways by keeping lists of fds, etc.) Peter - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 10:40 ` Felix von Leitner 2001-01-16 11:56 ` Peter Samuelson @ 2001-01-16 12:37 ` Ingo Molnar 2001-01-16 12:42 ` Ingo Molnar 2 siblings, 0 replies; 130+ messages in thread From: Ingo Molnar @ 2001-01-16 12:37 UTC (permalink / raw) To: Felix von Leitner; +Cc: Linux Kernel List On Tue, 16 Jan 2001, Felix von Leitner wrote: > I don't know how Linux does it, but returning the first free file > descriptor can be implemented as O(1) operation. only if special allocation patters are assumed. Otherwise it cannot be a generic O(1) solution. The first-free rule adds an implicit ordering to the file descriptor space, and this order cannot be maintained in an O(1) way. Linux can allocate up to a million file descriptors. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 10:40 ` Felix von Leitner 2001-01-16 11:56 ` Peter Samuelson 2001-01-16 12:37 ` Ingo Molnar @ 2001-01-16 12:42 ` Ingo Molnar 2001-01-16 12:47 ` Felix von Leitner 2 siblings, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2001-01-16 12:42 UTC (permalink / raw) To: Felix von Leitner; +Cc: Linux Kernel List On Tue, 16 Jan 2001, Felix von Leitner wrote: > I don't know how Linux does it, but returning the first free file > descriptor can be implemented as O(1) operation. to put it more accurately: the requirement is to be able to open(), use and close() an unlimited number of file descriptors with O(1) overhead, under any allocation pattern, with only RAM limiting the number of files. Both of my proposals attempt to provide this. It's possible to open() O(1) but do a O(log(N)) close(), but that is of no practical value IMO. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 12:42 ` Ingo Molnar @ 2001-01-16 12:47 ` Felix von Leitner 2001-01-16 13:48 ` Jamie Lokier 0 siblings, 1 reply; 130+ messages in thread From: Felix von Leitner @ 2001-01-16 12:47 UTC (permalink / raw) To: Linux Kernel List Thus spake Ingo Molnar (mingo@elte.hu): > > I don't know how Linux does it, but returning the first free file > > descriptor can be implemented as O(1) operation. > to put it more accurately: the requirement is to be able to open(), use > and close() an unlimited number of file descriptors with O(1) overhead, > under any allocation pattern, with only RAM limiting the number of files. > Both of my proposals attempt to provide this. It's possible to open() O(1) > but do a O(log(N)) close(), but that is of no practical value IMO. I cheated. I was only talking about open(). close() is of course more expensive then. Other than that: where does the requirement come from? Can't we just use a free list where we prepend closed fds and always use the first one on open()? That would even increase spatial locality and be good for the CPU caches. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 12:47 ` Felix von Leitner @ 2001-01-16 13:48 ` Jamie Lokier 2001-01-16 14:20 ` Felix von Leitner 0 siblings, 1 reply; 130+ messages in thread From: Jamie Lokier @ 2001-01-16 13:48 UTC (permalink / raw) To: Linux Kernel List Felix von Leitner wrote: > I cheated. I was only talking about open(). > close() is of course more expensive then. > > Other than that: where does the requirement come from? > Can't we just use a free list where we prepend closed fds and always use > the first one on open()? That would even increase spatial locality and > be good for the CPU caches. You would need to use a new open() flag: O_ANYFD. The requirement comes from this like this: close (0); close (1); close (2); open ("/dev/console", O_RDWR); dup (); dup (); -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 13:48 ` Jamie Lokier @ 2001-01-16 14:20 ` Felix von Leitner 2001-01-16 15:05 ` David L. Parsley 0 siblings, 1 reply; 130+ messages in thread From: Felix von Leitner @ 2001-01-16 14:20 UTC (permalink / raw) To: Linux Kernel List Thus spake Jamie Lokier (lk@tantalophile.demon.co.uk): > You would need to use a new open() flag: O_ANYFD. > The requirement comes from this like this: > close (0); > close (1); > close (2); > open ("/dev/console", O_RDWR); > dup (); > dup (); So it's not actually part of POSIX, it's just to get around fixing legacy code? ;-) Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 14:20 ` Felix von Leitner @ 2001-01-16 15:05 ` David L. Parsley 2001-01-16 15:05 ` Jakub Jelinek 2001-01-17 19:27 ` dean gaudet 0 siblings, 2 replies; 130+ messages in thread From: David L. Parsley @ 2001-01-16 15:05 UTC (permalink / raw) To: Felix von Leitner, linux-kernel, mingo Felix von Leitner wrote: > > close (0); > > close (1); > > close (2); > > open ("/dev/console", O_RDWR); > > dup (); > > dup (); > > So it's not actually part of POSIX, it's just to get around fixing > legacy code? ;-) This makes me wonder... If the kernel only kept a queue of the three smallest unused fd's, and when the queue emptied handed out whatever it liked, how many things would break? I suspect this would cover a lot of bases... <dons flameproof underwear> regards, David -- David L. Parsley Network Administrator Roanoke College - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 15:05 ` David L. Parsley @ 2001-01-16 15:05 ` Jakub Jelinek 2001-01-16 15:46 ` David L. Parsley 2001-01-17 19:27 ` dean gaudet 1 sibling, 1 reply; 130+ messages in thread From: Jakub Jelinek @ 2001-01-16 15:05 UTC (permalink / raw) To: David L. Parsley; +Cc: Felix von Leitner, linux-kernel, mingo On Tue, Jan 16, 2001 at 10:05:06AM -0500, David L. Parsley wrote: > Felix von Leitner wrote: > > > close (0); > > > close (1); > > > close (2); > > > open ("/dev/console", O_RDWR); > > > dup (); > > > dup (); > > > > So it's not actually part of POSIX, it's just to get around fixing > > legacy code? ;-) > > This makes me wonder... > > If the kernel only kept a queue of the three smallest unused fd's, and > when the queue emptied handed out whatever it liked, how many things > would break? I suspect this would cover a lot of bases... First it would break Unix98 and other standards: The Single UNIX (R) Specification, Version 2 Copyright (c) 1997 The Open Group ... int open(const char *path, int oflag, ... ); ... The open() function will return a file descriptor for the named file that is the lowest file descriptor not currently open for that process. The open file description is new, and therefore the file descriptor does not share it with any other process in the system. The FD_CLOEXEC file descriptor flag associated with the new file descriptor will be cleared. Jakub - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 15:05 ` Jakub Jelinek @ 2001-01-16 15:46 ` David L. Parsley 2001-01-18 14:00 ` Laramie Leavitt 0 siblings, 1 reply; 130+ messages in thread From: David L. Parsley @ 2001-01-16 15:46 UTC (permalink / raw) To: Jakub Jelinek, linux-kernel, leitner, mingo Jakub Jelinek wrote: > > This makes me wonder... > > > > If the kernel only kept a queue of the three smallest unused fd's, and > > when the queue emptied handed out whatever it liked, how many things > > would break? I suspect this would cover a lot of bases... > > First it would break Unix98 and other standards: [snip] Yeah, I reallized it would violate at least POSIX. The discussion was just bandying about ways to avoid an expensive 'open()' without breaking lots of utilities and glibc stuff. This might be something that could be configured for specific server environments, where performance is more imporant than POSIX/Unix98, but you still don't want to completely break the system. Just a thought, brain-damaged as it might be. ;-) regards, David -- David L. Parsley Network Administrator Roanoke College - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* RE: Is sendfile all that sexy? 2001-01-16 15:46 ` David L. Parsley @ 2001-01-18 14:00 ` Laramie Leavitt 0 siblings, 0 replies; 130+ messages in thread From: Laramie Leavitt @ 2001-01-18 14:00 UTC (permalink / raw) To: linux-kernel > Jakub Jelinek wrote: > > > > This makes me wonder... > > > > > > If the kernel only kept a queue of the three smallest unused fd's, and > > > when the queue emptied handed out whatever it liked, how many things > > > would break? I suspect this would cover a lot of bases... > > > > First it would break Unix98 and other standards: > [snip] > > Yeah, I reallized it would violate at least POSIX. The discussion was > just bandying about ways to avoid an expensive 'open()' without breaking > lots of utilities and glibc stuff. This might be something that could > be configured for specific server environments, where performance is > more imporant than POSIX/Unix98, but you still don't want to completely > break the system. Just a thought, brain-damaged as it might be. ;-) > Merely following the discussion a thought occurred to me of how to make fd allocation fairly efficient (and simple) even if it retains the O(n) structure worst case. I don't know how it is currently implemented so this may be how it is done, or I may be way off base. First, keep a table of FDs in sorted order ( mark deleted entries ) that you can access quickly. O(1) lookup. Then, maintain this struct like struct { int lowest_fd; int highest_fd; } open: if( lowest_fd == highest_fd ) { fd = lowest_fd; lowest_fd = ++highest_fd; } if( flags == IGNORE_UNIX98 ) { fd = highest_fd++; } else { fd = lowest_fd lowest_fd = linear_search( lowest_fd+1, highest_fd ); } close: if( fd < lowest_fd ) { lowest_fd = fd; } else if( fd == highest_fd - 1 ) { if( highest_fd == lowest_fd ) { lowest_fd = --highest_fd; } else { highest_fd; } } For common cases this would be fairly quick. It would be very easy to implement an O(1) allocation if you want it to be fast ( at the expense of a growing file handle table. ) Just thinking about it. Laramie. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 15:05 ` David L. Parsley 2001-01-16 15:05 ` Jakub Jelinek @ 2001-01-17 19:27 ` dean gaudet 1 sibling, 0 replies; 130+ messages in thread From: dean gaudet @ 2001-01-17 19:27 UTC (permalink / raw) To: David L. Parsley; +Cc: Felix von Leitner, linux-kernel, mingo On Tue, 16 Jan 2001, David L. Parsley wrote: > Felix von Leitner wrote: > > > close (0); > > > close (1); > > > close (2); > > > open ("/dev/console", O_RDWR); > > > dup (); > > > dup (); > > > > So it's not actually part of POSIX, it's just to get around fixing > > legacy code? ;-) it's part of POSIX. > This makes me wonder... > > If the kernel only kept a queue of the three smallest unused fd's, and > when the queue emptied handed out whatever it liked, how many things > would break? I suspect this would cover a lot of bases... apache-1.3 relies on the open-lowest-numbered-free-fd behaviour... but only as a band-aid to work around other broken behaviours surrounding FD_SETSIZE. when opening the log files, and listening sockets apache uses fcntl(F_DUPFD) to push them all higher than fd 15. (see ap_slack) some sites are configured in a way that there's thousands of log files or listening fds (both are bogus configs in my opinion, but hey, let the admin shoot themself). this generally leaves a handful of low numbered fds available. this pretty much protects apache from broken libraries compiled with small FD_SETSIZE, or which otherwise can't handle big fds. libc used to be just such a library because it used select() in the DNS resolver code. (a libc guru can tell you when this was fixed.) it also ensures that the client fd will be low numbered, and lets us be lazy and just use select() rather than do all the config tests to figure out which OSs support poll(). it's all pretty gross... but then select() is pretty gross and it's essentially the bug that necessitated this. (solaris also has a stupid FILE * limitation that it can't use fds > 255 in a FILE * ... which breaks even more libraries than fds >= FD_SETSIZE.) -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 20:22 ` Linus Torvalds ` (2 preceding siblings ...) 2001-01-15 15:24 ` Jonathan Thackray @ 2001-01-24 0:58 ` Sasi Peter 2001-01-24 8:44 ` James Sutherland 2001-01-25 10:20 ` Anton Blanchard 3 siblings, 2 replies; 130+ messages in thread From: Sasi Peter @ 2001-01-24 0:58 UTC (permalink / raw) To: linux-kernel On 14 Jan 2001, Linus Torvalds wrote: > The only obvious use for it is file serving, and as high-performance > file serving tends to end up as a kernel module in the end anyway (the > only hold-out is samba, and that's been discussed too), "sendfile()" > really is more a proof of concept than anything else. No plans for samba to use sendfile? Even better make it a tux-like module? (that would enable Netware-Linux like performance with the standard kernel... would be cool afterall ;) -- SaPE - Peter, Sasi - mailto:sape@sch.hu - http://sape.iq.rulez.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-24 0:58 ` Sasi Peter @ 2001-01-24 8:44 ` James Sutherland 2001-01-25 10:20 ` Anton Blanchard 1 sibling, 0 replies; 130+ messages in thread From: James Sutherland @ 2001-01-24 8:44 UTC (permalink / raw) To: Sasi Peter; +Cc: linux-kernel On Wed, 24 Jan 2001, Sasi Peter wrote: > On 14 Jan 2001, Linus Torvalds wrote: > > > The only obvious use for it is file serving, and as high-performance > > file serving tends to end up as a kernel module in the end anyway (the > > only hold-out is samba, and that's been discussed too), "sendfile()" > > really is more a proof of concept than anything else. > > No plans for samba to use sendfile? Even better make it a tux-like module? > (that would enable Netware-Linux like performance with the standard > kernel... would be cool afterall ;) AIUI, Jeff Merkey was working on loading "userspace" apps into the kernel to tackle this sort of problem generically. I don't know if he's tried it with Samba - the forking would probably be a problem... James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-24 0:58 ` Sasi Peter 2001-01-24 8:44 ` James Sutherland @ 2001-01-25 10:20 ` Anton Blanchard 2001-01-25 10:58 ` Sasi Peter 1 sibling, 1 reply; 130+ messages in thread From: Anton Blanchard @ 2001-01-25 10:20 UTC (permalink / raw) To: Sasi Peter; +Cc: linux-kernel > No plans for samba to use sendfile? Even better make it a tux-like module? > (that would enable Netware-Linux like performance with the standard > kernel... would be cool afterall ;) I have patches for samba to do sendfile. Making a tux module does not make sense to me, especially since we are nowhere near the limits of samba in userspace. Once userspace samba can run no faster, then we should think about other options. Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-25 10:20 ` Anton Blanchard @ 2001-01-25 10:58 ` Sasi Peter 2001-01-26 6:10 ` Anton Blanchard 0 siblings, 1 reply; 130+ messages in thread From: Sasi Peter @ 2001-01-25 10:58 UTC (permalink / raw) To: Anton Blanchard; +Cc: linux-kernel On Thu, 25 Jan 2001, Anton Blanchard wrote: > I have patches for samba to do sendfile. Making a tux module does not make > sense to me, especially since we are nowhere near the limits of samba in > userspace. Once userspace samba can run no faster, then we should think > about other options. Do you have it at a URL? -- SaPE - Peter, Sasi - mailto:sape@sch.hu - http://sape.iq.rulez.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-25 10:58 ` Sasi Peter @ 2001-01-26 6:10 ` Anton Blanchard 2001-01-26 11:46 ` David S. Miller 0 siblings, 1 reply; 130+ messages in thread From: Anton Blanchard @ 2001-01-26 6:10 UTC (permalink / raw) To: Sasi Peter; +Cc: linux-kernel > Do you have it at a URL? The patch is small so I have attached it to this email. It should apply to the samba CVS tree. Remember this is still a hack and I need to add code to ensure the file is not truncated and we sendfile() less than we promised. (After talking to tridge and davem, this should be fixed shortly.) There is a lot more going on than in the web serving case, so sendfile+zero copy is not going to help us as much as it did for the tux guys. For example currently on 2.4.0 + zero copy patches: anton@drongo:~/dbench$ ~anton/samba/source/bin/smbtorture //otherhost/netbench -U% -N 15 NBW95 read/write: Throughput 16.5478 MB/sec (NB=20.6848 MB/sec 165.478 MBit/sec) sendfile: Throughput 17.0128 MB/sec (NB=21.266 MB/sec 170.128 MBit/sec) Of course there is still lots to be done :) Cheers, Anton diff -u -u -r1.195 includes.h --- source/include/includes.h 2000/12/06 00:05:14 1.195 +++ source/include/includes.h 2001/01/26 05:38:51 @@ -871,7 +871,8 @@ /* default socket options. Dave Miller thinks we should default to TCP_NODELAY given the socket IO pattern that Samba uses */ -#ifdef TCP_NODELAY + +#if 0 #define DEFAULT_SOCKET_OPTIONS "TCP_NODELAY" #else #define DEFAULT_SOCKET_OPTIONS "" diff -u -u -r1.257 reply.c --- source/smbd/reply.c 2001/01/24 19:34:53 1.257 +++ source/smbd/reply.c 2001/01/26 05:38:53 @@ -2383,6 +2391,51 @@ END_PROFILE(SMBreadX); return(ERROR(ERRDOS,ERRlock)); } + +#if 1 + /* We can use sendfile if it is not chained */ + if (CVAL(inbuf,smb_vwv0) == 0xFF) { + off_t tmpoffset; + struct stat buf; + int flags = 0; + + nread = smb_maxcnt; + + fstat(fsp->fd, &buf); + if (startpos > buf.st_size) + return(UNIXERROR(ERRDOS,ERRnoaccess)); + if (nread > (buf.st_size - startpos)) + nread = (buf.st_size - startpos); + + SSVAL(outbuf,smb_vwv5,nread); + SSVAL(outbuf,smb_vwv6,smb_offset(data,outbuf)); + SSVAL(smb_buf(outbuf),-2,nread); + CVAL(outbuf,smb_vwv0) = 0xFF; + set_message(outbuf,12,nread,False); + +#define MSG_MORE 0x8000 + if (nread > 0) + flags = MSG_MORE; + if (send(smbd_server_fd(), outbuf, data - outbuf, flags) == -1) + DEBUG(0,("reply_read_and_X: send ERROR!\n")); + + tmpoffset = startpos; + while(nread) { + int nwritten; + nwritten = sendfile(smbd_server_fd(), fsp->fd, &tmpoffset, nread); + if (nwritten == -1) + DEBUG(0,("reply_read_and_X: sendfile ERROR!\n")); + + if (!nwritten) + break; + + nread -= nwritten; + } + + return -1; + } +#endif + nread = read_file(fsp,data,startpos,smb_maxcnt); if (nread < 0) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-26 6:10 ` Anton Blanchard @ 2001-01-26 11:46 ` David S. Miller 2001-01-26 14:12 ` Anton Blanchard 0 siblings, 1 reply; 130+ messages in thread From: David S. Miller @ 2001-01-26 11:46 UTC (permalink / raw) To: Anton Blanchard; +Cc: Sasi Peter, linux-kernel Anton Blanchard writes: > diff -u -u -r1.257 reply.c > --- source/smbd/reply.c 2001/01/24 19:34:53 1.257 > +++ source/smbd/reply.c 2001/01/26 05:38:53 > @@ -2383,6 +2391,51 @@ ... > + while(nread) { > + int nwritten; > + nwritten = sendfile(smbd_server_fd(), fsp->fd, &tmpoffset, nread); > + if (nwritten == -1) > + DEBUG(0,("reply_read_and_X: sendfile ERROR!\n")); > + > + if (!nwritten) > + break; > + > + nread -= nwritten; > + } > + > + return -1; Anton, why are you always returning -1 (which means error for the smb_message[] array functions) when using sendfile? Aren't you supposed to return the number of bytes output or something like this? I'm probably missing something subtle here, so just let me know what I missed. Thanks. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-26 11:46 ` David S. Miller @ 2001-01-26 14:12 ` Anton Blanchard 0 siblings, 0 replies; 130+ messages in thread From: Anton Blanchard @ 2001-01-26 14:12 UTC (permalink / raw) To: David S. Miller; +Cc: Sasi Peter, linux-kernel Hi Dave, How are the VB withdrawal symptoms going? :) > Anton, why are you always returning -1 (which means error for the > smb_message[] array functions) when using sendfile? Returning -1 tells the higher level code that we actually sent the bytes out ourselves and not to bother doing it. > Aren't you supposed to return the number of bytes output or > something like this? Only if you want the code to do a send() on outbuf which we dont here. Cheers, Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-14 18:29 Is sendfile all that sexy? jamal 2001-01-14 18:50 ` Ingo Molnar 2001-01-14 20:22 ` Linus Torvalds @ 2001-01-15 23:16 ` Pavel Machek 2001-01-16 13:47 ` jamal 2 siblings, 1 reply; 130+ messages in thread From: Pavel Machek @ 2001-01-15 23:16 UTC (permalink / raw) To: jamal, linux-kernel, netdev Hi! > TWO observations: > - Given Linux's non-pre-emptability of the kernel i get the feeling that > sendfile could starve other user space programs. Imagine trying to send a > 1Gig file on 10Mbps pipe in one shot. Hehe, try sigkilling process doing that transfer. Last time I tried it it did not work. Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-15 23:16 ` Pavel Machek @ 2001-01-16 13:47 ` jamal 2001-01-16 14:41 ` Pavel Machek 0 siblings, 1 reply; 130+ messages in thread From: jamal @ 2001-01-16 13:47 UTC (permalink / raw) To: Pavel Machek; +Cc: linux-kernel, netdev On Tue, 16 Jan 2001, Pavel Machek wrote: > > TWO observations: > > - Given Linux's non-pre-emptability of the kernel i get the feeling that > > sendfile could starve other user space programs. Imagine trying to send a > > 1Gig file on 10Mbps pipe in one shot. > > Hehe, try sigkilling process doing that transfer. Last time I tried it > it did not work. >From Alexey's response: it does get descheduled possibly every sndbuf send. So you should be able to sneak that sigkill. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-16 13:47 ` jamal @ 2001-01-16 14:41 ` Pavel Machek 0 siblings, 0 replies; 130+ messages in thread From: Pavel Machek @ 2001-01-16 14:41 UTC (permalink / raw) To: jamal; +Cc: linux-kernel, netdev Hi! > > > TWO observations: > > > - Given Linux's non-pre-emptability of the kernel i get the feeling that > > > sendfile could starve other user space programs. Imagine trying to send a > > > 1Gig file on 10Mbps pipe in one shot. > > > > Hehe, try sigkilling process doing that transfer. Last time I tried it > > it did not work. > > >From Alexey's response: it does get descheduled possibly every sndbuf > send. So you should be able to sneak that sigkill. Did you actually tried it? Last time I did the test, SIGKILL did not make it in. sendfile did not actually check for signals... (And you could do something like send 100MB from cache into dev null. I do not see where sigkill could sneak in in this case.) Pavel -- The best software in life is free (not shareware)! Pavel GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy?
@ 2001-01-16 13:50 Andries.Brouwer
2001-01-17 6:56 ` Ton Hospel
0 siblings, 1 reply; 130+ messages in thread
From: Andries.Brouwer @ 2001-01-16 13:50 UTC (permalink / raw)
To: mingo; +Cc: linux-kernel
From: Ingo Molnar <mingo@elte.hu>
On Tue, 16 Jan 2001, Felix von Leitner wrote:
> I don't know how Linux does it, but returning the first free file
> descriptor can be implemented as O(1) operation.
to put it more accurately: the requirement is to be able to open(), use
and close() an unlimited number of file descriptors with O(1) overhead,
under any allocation pattern, with only RAM limiting the number of files.
Both of my proposals attempt to provide this. It's possible to open() O(1)
but do a O(log(N)) close(), but that is of no practical value IMO.
Ingo
> Both of my proposals
I am afraid I have missed most earlier messages in this thread.
However, let me remark that the problem of assigning a
file descriptor is the one that is usually described by
"priority queue". The version of Peter van Emde Boas takes
time O(loglog N) for both open() and close().
Of course this is not meant to suggest that we use it.
Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 130+ messages in thread* Re: Is sendfile all that sexy? 2001-01-16 13:50 Andries.Brouwer @ 2001-01-17 6:56 ` Ton Hospel 2001-01-17 7:31 ` Steve VanDevender 0 siblings, 1 reply; 130+ messages in thread From: Ton Hospel @ 2001-01-17 6:56 UTC (permalink / raw) To: linux-kernel In article <UTC200101161350.OAA141869.aeb@ark.cwi.nl>, Andries.Brouwer@cwi.nl writes: > > I am afraid I have missed most earlier messages in this thread. > However, let me remark that the problem of assigning a > file descriptor is the one that is usually described by > "priority queue". The version of Peter van Emde Boas takes > time O(loglog N) for both open() and close(). > Of course this is not meant to suggest that we use it. > Fascinating ! But how is this possible ? What stops me from using this algorithm from entering N values and extracting them again in order and so end up with a O(N*log log N) sorting algorithm ? (which would be better than log N! ~ N*logN) (at least the web pages I found about this seem to suggest you can use this on any set with a full order relation) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-17 6:56 ` Ton Hospel @ 2001-01-17 7:31 ` Steve VanDevender 2001-01-17 8:09 ` Ton Hospel 0 siblings, 1 reply; 130+ messages in thread From: Steve VanDevender @ 2001-01-17 7:31 UTC (permalink / raw) To: linux-kernel Ton Hospel writes: > In article <UTC200101161350.OAA141869.aeb@ark.cwi.nl>, > Andries.Brouwer@cwi.nl writes: > > I am afraid I have missed most earlier messages in this thread. > > However, let me remark that the problem of assigning a > > file descriptor is the one that is usually described by > > "priority queue". The version of Peter van Emde Boas takes > > time O(loglog N) for both open() and close(). > > Of course this is not meant to suggest that we use it. > > > Fascinating ! But how is this possible ? What stops me from > using this algorithm from entering N values and extracting > them again in order and so end up with a O(N*log log N) > sorting algorithm ? (which would be better than log N! ~ N*logN) > > (at least the web pages I found about this seem to suggest you > can use this on any set with a full order relation) How do you know how to extract the items in order, unless you've already sorted them independently from placing them in this data structure? Besides, there are plenty of sorting algorithms that work only on specific kinds of data sets that are better than the O(n log n) bound for generalized sorting. For example, there's the O(n) "mailbox sort". You have an unordered array u of m integers, each in the range 1..n; allocate an array s of n integers initialized to all zeros, and for i in 1..m increment s[u[i]]. Then for j in 1..n print j s[j] times. If n is of reasonable size then you can sort that list of integers in O(m) time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-17 7:31 ` Steve VanDevender @ 2001-01-17 8:09 ` Ton Hospel 0 siblings, 0 replies; 130+ messages in thread From: Ton Hospel @ 2001-01-17 8:09 UTC (permalink / raw) To: linux-kernel In article <14949.19028.404458.318735@tzadkiel.efn.org>, Steve VanDevender <stevev@efn.org> writes: > Ton Hospel writes: > > In article <UTC200101161350.OAA141869.aeb@ark.cwi.nl>, > > Andries.Brouwer@cwi.nl writes: > > > I am afraid I have missed most earlier messages in this thread. > > > However, let me remark that the problem of assigning a > > > file descriptor is the one that is usually described by > > > "priority queue". The version of Peter van Emde Boas takes > > > time O(loglog N) for both open() and close(). > > > Of course this is not meant to suggest that we use it. > > > > > Fascinating ! But how is this possible ? What stops me from > > using this algorithm from entering N values and extracting > > them again in order and so end up with a O(N*log log N) > > sorting algorithm ? (which would be better than log N! ~ N*logN) > > > > (at least the web pages I found about this seem to suggest you > > can use this on any set with a full order relation) > > How do you know how to extract the items in order, unless you've already > sorted them independently from placing them in this data structure? Because "extract max" is a basic operation of a priority queue, which I just do N times. > > Besides, there are plenty of sorting algorithms that work only on > specific kinds of data sets that are better than the O(n log n) bound > for generalized sorting. For example, there's the O(n) "mailbox sort". > You have an unordered array u of m integers, each in the range 1..n; > allocate an array s of n integers initialized to all zeros, and for i in > 1..m increment s[u[i]]. Then for j in 1..n print j s[j] times. If n is > of reasonable size then you can sort that list of integers in O(m) time. Yes, I know. that's why you see the "any set with a full order relation" in there. That basically disallows using extra structure of the elements. Notice that the radix sort you describe basically hides the log N in the the representation of a number of max n (which has a length that is basically log n). It just doesn't account for that because we do the operation on processors where these bits are basically handled in parallel, and so do not end up in the O-notation. Any attempt to make radix sort handle arbitrary width integers on a fixed width processor will make the log N reappear. Having said that, in the particular case of fd allocation, we DO have additional structure (in fact, it's indeed integers in 0..n). So I can very well imagine the existance of a priority queue for this where the basic operators are better than O(log N). I just don't understand how it can exist for a generic priority queue algorithm (which the Peter van Emde Boas method seems to be. Unfortunately I have found no full description of the algorithm that's used to do the insert/extract in the queue nodes yet). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy?
@ 2001-01-17 15:02 Ben Mansell
2000-01-01 2:10 ` Pavel Machek
2001-01-17 19:32 ` Linus Torvalds
0 siblings, 2 replies; 130+ messages in thread
From: Ben Mansell @ 2001-01-17 15:02 UTC (permalink / raw)
To: torvalds; +Cc: linux-kernel
On 14 Jan 2001, Linus Torvalds wrote:
> And no, I don't actually hink that sendfile() is all that hot. It was
> _very_ easy to implement, and can be considered a 5-minute hack to give
> a feature that fit very well in the MM architecture, and that the Apache
> folks had already been using on other architectures.
The current sendfile() has the limitation that it can't read data from
a socket. Would it be another 5-minute hack to remove this limitation, so
you could sendfile between sockets? Now _that_ would be sexy :)
Ben
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 130+ messages in thread* Re: Is sendfile all that sexy? 2001-01-17 15:02 Ben Mansell @ 2000-01-01 2:10 ` Pavel Machek 2001-01-17 19:32 ` Linus Torvalds 1 sibling, 0 replies; 130+ messages in thread From: Pavel Machek @ 2000-01-01 2:10 UTC (permalink / raw) To: Ben Mansell; +Cc: torvalds, linux-kernel Hi! > > And no, I don't actually hink that sendfile() is all that hot. It was > > _very_ easy to implement, and can be considered a 5-minute hack to give > > a feature that fit very well in the MM architecture, and that the Apache > > folks had already been using on other architectures. > > The current sendfile() has the limitation that it can't read data from > a socket. Would it be another 5-minute hack to remove this limitation, so > you could sendfile between sockets? Now _that_ would be sexy :) I had patch to do that. (Unoptimized, of course) -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-17 15:02 Ben Mansell 2000-01-01 2:10 ` Pavel Machek @ 2001-01-17 19:32 ` Linus Torvalds 2001-01-18 2:34 ` Olivier Galibert ` (2 more replies) 1 sibling, 3 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-17 19:32 UTC (permalink / raw) To: linux-kernel In article <Pine.LNX.4.30.0101171454340.29536-100000@baphomet.bogo.bogus>, Ben Mansell <linux-kernel@slimyhorror.com> wrote: >On 14 Jan 2001, Linus Torvalds wrote: > >> And no, I don't actually hink that sendfile() is all that hot. It was >> _very_ easy to implement, and can be considered a 5-minute hack to give >> a feature that fit very well in the MM architecture, and that the Apache >> folks had already been using on other architectures. > >The current sendfile() has the limitation that it can't read data from >a socket. Would it be another 5-minute hack to remove this limitation, so >you could sendfile between sockets? Now _that_ would be sexy :) I don't think that would be all that sexy at all. You have to realize, that sendfile() is meant as an optimization, by being able to re-use the same buffers that act as the in-kernel page cache as buffers for sending data. So you avoid one copy. However, for socket->socket, we would not have such an advantage. A socket->socket sendfile() would not avoid any copies the way the networking is done today. That _may_ change, of course. But it might not. And I'd rather tell people using sendfile() that you get EINVAL if it isn't able to optimize the transfer.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-17 19:32 ` Linus Torvalds @ 2001-01-18 2:34 ` Olivier Galibert 2001-01-21 21:22 ` LA Walsh 2001-01-18 8:23 ` Rogier Wolff 2001-01-22 18:13 ` Val Henson 2 siblings, 1 reply; 130+ messages in thread From: Olivier Galibert @ 2001-01-18 2:34 UTC (permalink / raw) To: linux-kernel On Wed, Jan 17, 2001 at 11:32:35AM -0800, Linus Torvalds wrote: > However, for socket->socket, we would not have such an advantage. A > socket->socket sendfile() would not avoid any copies the way the > networking is done today. That _may_ change, of course. But it might > not. And I'd rather tell people using sendfile() that you get EINVAL if > it isn't able to optimize the transfer.. On the other hand you could consider sendfile to be a concept rather than an optimization. That is, "move n bytes from this fd to that one". That would be very nice for this like tar (file <-> file or tty), cp (file <-> file), application-level routing (socket <-> socket). Hey, even cat(1) would be simplified. Whether the kernel can optimize it in zero-copy mode is another problem that will change with time anyway. But the "I want to move x amount of data from here to there, and I don't need to see the actual contents" is something that happens quite often, and to be able to do it with one syscall that does not muck with page tables (aka no mmap nor malloc) would be both more readable and scale better on smp. OG. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* RE: Is sendfile all that sexy? 2001-01-18 2:34 ` Olivier Galibert @ 2001-01-21 21:22 ` LA Walsh 0 siblings, 0 replies; 130+ messages in thread From: LA Walsh @ 2001-01-21 21:22 UTC (permalink / raw) To: linux-kernel, torvalds FYI - Another use sendfile(2) might be used for. Suppose you were to generate large amounts of data -- maybe kernel profiling data, audit data, whatever, in the kernel. You want to pull that data out as fast as possible and write it to a disk or network socket. Normally, I think you'd do a "read/write" that would xfer the data into user space, then write it back to the target in system space. With sendfile, it seems, one could write a dump-daemon that used sendfile to dump the data directly out to a target file descriptor w/o it going through user space. Just make sure the internal 'raw' data is massaged into the format of a block device and voila! A side benefit would be that data in the kernel that is written to the block device would be 'queued' in the block buffers and them being marked 'dirty' and needing to be written out. The device driver marks the buffers as clean once they are pushed out of a fd by doing a 'seek' to a new (later) position in the file -- whole buffers before that point are marked 'clean' and freed. Seems like this would have the benefit of reusing an existing buffer management system for buffering while also using a single-copy to get data to the target. ??? -l -- L A Walsh | Trust Technology, Core Linux, SGI law@sgi.com | Voice/Vmail: (650) 933-5338 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-17 19:32 ` Linus Torvalds 2001-01-18 2:34 ` Olivier Galibert @ 2001-01-18 8:23 ` Rogier Wolff 2001-01-18 10:01 ` Andreas Dilger 2001-01-18 12:17 ` Peter Samuelson 2001-01-22 18:13 ` Val Henson 2 siblings, 2 replies; 130+ messages in thread From: Rogier Wolff @ 2001-01-18 8:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds wrote: > In article <Pine.LNX.4.30.0101171454340.29536-100000@baphomet.bogo.bogus>, > Ben Mansell <linux-kernel@slimyhorror.com> wrote: > >On 14 Jan 2001, Linus Torvalds wrote: > > > >> And no, I don't actually hink that sendfile() is all that hot. It was > >> _very_ easy to implement, and can be considered a 5-minute hack to give > >> a feature that fit very well in the MM architecture, and that the Apache > >> folks had already been using on other architectures. > > > >The current sendfile() has the limitation that it can't read data from > >a socket. Would it be another 5-minute hack to remove this limitation, so > >you could sendfile between sockets? Now _that_ would be sexy :) > > I don't think that would be all that sexy at all. > > You have to realize, that sendfile() is meant as an optimization, by > being able to re-use the same buffers that act as the in-kernel page > cache as buffers for sending data. So you avoid one copy. > > However, for socket->socket, we would not have such an advantage. A > socket->socket sendfile() would not avoid any copies the way the > networking is done today. That _may_ change, of course. But it might > not. And I'd rather tell people using sendfile() that you get EINVAL if > it isn't able to optimize the transfer.. Linus, I admire your good taste in designing interface, but here is one where we disagree. I'd prefer an interface that says "copy this fd to that one, and optimize that if you can". All cases that can't be optimized would end up doing an in-kernel read / write loop. Sure, there is no advantage above doing that same loop in userspace, but this way the kernel can "grow" and optimize more stuff later on. For example, copying a file from one disk to another. I'm pretty sure that some efficiency can be gained if you don't need to handle the possibility of the userspace program accessing the data in between the read and the write. Sure this may not qualify as a "trivial optimization, that can be done with the existing infrastructure" right now, but programs that want to indicate "kernel, please optimize this if you can" can say so. Currently, once the optimization happens to become possible (*), we'll have to upgrade all apps that happen to be able to use it. If now we start advertizing the interface (at a cost of a read/write loop in the kernel: five lines of code) we will be able to upgrade the kernel, and automatically improve the performance of every app that happens to use the interface. Roger. (*) Either because the infrastructure makes it "trivial", or because someone convinces you that it is a valid optimization that makes a huge difference in an important case. -- ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 ** *-- BitWizard writes Linux device drivers for any device you may have! --* * There are old pilots, and there are bold pilots. * There are also old, bald pilots. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 8:23 ` Rogier Wolff @ 2001-01-18 10:01 ` Andreas Dilger 2001-01-18 11:04 ` Russell Leighton 2001-01-18 16:24 ` Linus Torvalds 2001-01-18 12:17 ` Peter Samuelson 1 sibling, 2 replies; 130+ messages in thread From: Andreas Dilger @ 2001-01-18 10:01 UTC (permalink / raw) To: Rogier Wolff; +Cc: Linus Torvalds, linux-kernel Roger Wolff writes: > I'd prefer an interface that says "copy this fd to that one, and > optimize that if you can". > > For example, copying a file from one disk to another. I'm pretty sure > that some efficiency can be gained if you don't need to handle the > possibility of the userspace program accessing the data in between the > read and the write. Sure this may not qualify as a "trivial > optimization, that can be done with the existing infrastructure" right > now, but programs that want to indicate "kernel, please optimize this > if you can" can say so. Actually, this is a great example, because at one point I was working on a device interface which would offload all of the disk-disk copying overhead to the disks themselves, and not involve the CPU/RAM at all. I seem to recall that I2O promised something along these lines as well (i.e. direct device-device communication). Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 10:01 ` Andreas Dilger @ 2001-01-18 11:04 ` Russell Leighton 2001-01-18 16:36 ` Larry McVoy 2001-01-19 1:53 ` Linus Torvalds 2001-01-18 16:24 ` Linus Torvalds 1 sibling, 2 replies; 130+ messages in thread From: Russell Leighton @ 2001-01-18 11:04 UTC (permalink / raw) To: linux-kernel "copy this fd to that one, and optimize that if you can" ... isn't this Larry M's "splice" (http://www.bitmover.com/lm/papers/splice.ps)? Andreas Dilger wrote: > Roger Wolff writes: > > I'd prefer an interface that says "copy this fd to that one, and > > optimize that if you can". > > > > For example, copying a file from one disk to another. I'm pretty sure > > that some efficiency can be gained if you don't need to handle the > > possibility of the userspace program accessing the data in between the > > read and the write. Sure this may not qualify as a "trivial > > optimization, that can be done with the existing infrastructure" right > > now, but programs that want to indicate "kernel, please optimize this > > if you can" can say so. > > Actually, this is a great example, because at one point I was working > on a device interface which would offload all of the disk-disk copying > overhead to the disks themselves, and not involve the CPU/RAM at all. > > I seem to recall that I2O promised something along these lines as well > (i.e. direct device-device communication). > > Cheers, Andreas > -- > Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, > \ would they cancel out, leaving him still hungry?" > http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ -- ------------------------------------------------- Russell Leighton leighton@imake.com http://www.247media.com Company Vision: To be the preeminent global provider of interactive marketing solutions and services. ------------------------------------------------- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 11:04 ` Russell Leighton @ 2001-01-18 16:36 ` Larry McVoy 2001-01-19 1:53 ` Linus Torvalds 1 sibling, 0 replies; 130+ messages in thread From: Larry McVoy @ 2001-01-18 16:36 UTC (permalink / raw) To: Russell Leighton; +Cc: linux-kernel On Thu, Jan 18, 2001 at 06:04:17AM -0500, Russell Leighton wrote: > > "copy this fd to that one, and optimize that if you can" > > ... isn't this Larry M's "splice" (http://www.bitmover.com/lm/papers/splice.ps)? Not really. It's not clear to me that people really understood what I was getting at in that and I've had some coffee and BK 2.0 is just about ready to ship (shameless plug :-) so I'll give it another go. The goal of splice is to avoid both data copies and virtual memory completely. My SGI experience taught me that once you remove the data copy problem, the next problem becomes setting up and tearing down the virtual mappings to the data. Linux is quite a bit lighter than IRIX but that doesn't remove this issue, it just moves the point on the spectrum where the setup/teardown becomes a problem. Another goal of splice was to be general enough to allow data to flow from any place to any place. The idea was to have a good model and then iterate over all the possible endpoints; I can think of files, sockets, and virtual address spaces right off the top of my head, devices are subset of files as will become apparent. A final goal was to be able to be able to handle caching vs non-caching. Sometimes one of the endpoints is a cache, such as the file system cache. Sometimes you want data to stay in the cache and sometimes you want to bypass it completely. The model had to handle this. OK, so the issues are - avoid copying - avoid virtual memory as much as possible - allow data flow to/from non aligned, non page sized objects - handle caching or non-caching This leads pretty naturally to some observations about the shape of the solution: - the basic unit of data is a physical page, or part of one. That's physical page, not a virtual address which points to a physical page. - since we may be coming from sockets, where the payload is buried in the middle of page, there needs to be a vector of pages and a vector of { pageno, offset, len } that goes along with the first vector. There are two vectors because you could have multiple payloads in a single page, i.e., there is not a 1:1 between pages and payloads. - The page vector needs some flags, which handle caching. I had just two flags, the "LOAN" flag and the "GIFT" flag. In my mind, this was enough that everyone should "get it" at this point, but that's me being lazy. So how would this all work? The first thing is that we are now dealing in vectors of physical pages. That's key - if you look at an OS, it spends a lot of time with data going into a physical page, then being translated to a virtual page, being copied to another virtual page, and then being translated back to a physical page so that it can be sent to a different device. That's the basic FTP loop. So you go "hey, just always talk physical pages and you avoid a lot of this wasted time". Now is a good time to observe that splice() is a library interface. The kernel level interfaces I called pull() and push(). The idea was that you could do vectors = 0; do { vectors = pull(from_fd, vectors); } while (splice_size(vectors) < BIG_ENOUGH_SIZE); push(to_fd, vectors); The idea was that you maintained a pointer to the vectors, the pointer is a "cookie", you can't really dereference it in user space, at least not all of it, but the kernel doesn't want to maintain this stuff, it wants you to do that. So you start pulling and then you push what you got. And you, being the user land process, are never looking at the data, in fact, you can't, you have a pointer to a data structure which describes the data but you can't look at it. A couple of interesting things: - this design allows for multiplexing. You could pull from multiple devices and then push to one. The interface needs a little tweaking for that to be meaningful, we can steal from pipe semantics. We need to be able to say how much to pull, so we add a length. - there is no reason that you couldn't have an fd which was open to /proc/self/my_address_space and you could essentially do an mmap() by seeking to where you want the mapping and doing a push to it. This is a fairly important point, it allows for end to end. Lots of nasty issues with non-page sized chunks in the vector, what you do there depends on the semantics you want. So what about the caching? That's the loan/gift distinction. The deal is that these pages have reference counts and when the reference count goes to zero, somebody has to free them. So the page vector needs a free_page() function pointer and if the pages are a loan, you call that function pointer when you are done with them. In other words, if the file system cache loaned you the pages, you do a call back to let the file system know you are done with them. If the pages were a gift, then the function pointer is null and you have to manage them. You can put the normal decrement_and_free() function in there and when you get done with them you call that and the pages go back to the free list. You can also "free" them into your private page pool, etc. The point is that if the end point which is being pulled() from wants the pages cached, it "loans" them, if it doesn't, it "gifts" them. Sockets as a "from" end point would always gift, files as a from endpoint would typically loan. So, there's the set of ideas. I'm ashamed to admit that I don't really know how close kiobufs are to this. I am interested in hearing what you all think, but especially what the people think who have been playing around with kiobufs and sendfile. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 11:04 ` Russell Leighton 2001-01-18 16:36 ` Larry McVoy @ 2001-01-19 1:53 ` Linus Torvalds 1 sibling, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-19 1:53 UTC (permalink / raw) To: linux-kernel In article <3A66CDB1.B61CD27B@imake.com>, Russell Leighton <leighton@imake.com> wrote: > >"copy this fd to that one, and optimize that if you can" > >... isn't this Larry M's "splice" (http://www.bitmover.com/lm/papers/splice.ps)? We talked extensively about "splice()" with Larry. It was one of the motivations for doing sendfile(). The problem with "splice()" is that it did not have very good semantics on who does the push and who does the pull, and how to actually implement this efficiently yet in a generic manner. In many ways, that lack of good generic interfaces is what turned me off splice(). I showed Larry the simple solution that gets 95% of what people wanted splice for, and he didn't object. He didn't have any really good solutions to the implementation problems either. Now, the reason it is called "sendfile()" is obviously partially because others _did_ have sendfiles (NT and HP-UX), but it's also because I wanted to make it clear that this was NOT a generic splice(). It could really only work in one direction: from the page cache out. The page cache would always do a push, and nobody would do a pull. Now, the page cache has improved, and these days we could _almost_ do a "receivefile()", with the page cache doing a pull, in addition to the push it can already do. And yes, I'd probably use the same system call, and possibly rename it to be "splice()", even though it still wouldn't be the generic case. Now, the reason is say "almost" on the page cache "pull()" thing is that while the page cache can now do basically "prepare_write()" + "pull()" + "commit_write()", the problem is that it still needs to know the _size_ of the pull() in order to be able to prepare for the write. Basically, the pull<->push model turns into a four-way handshake: (a) prepare for the pull (source) (b) prepare for the push (destination) (c) do the pull (source) (d) commit the push (destination) and with this kind of model I suspect that we could actually do a fairly real splice(), where sendfile() would just be a special case. Right now, the only part we lack above is (a) - everything else we have. (b) is "prepare_write()", (c) is "read()", (d) is "commit_write()". So we lack a "prepare_read()" as things stand now. The interface would probably be something on the order of int (*prepare_read)(struct file *, int); wehere we'd pass in the "struct file" and the amount of data we'd _like_ to see, and we'd get back the amount of data we can actually have so that we can successfully prepare for the push (ie "prepare_write()"). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 10:01 ` Andreas Dilger 2001-01-18 11:04 ` Russell Leighton @ 2001-01-18 16:24 ` Linus Torvalds 2001-01-18 18:46 ` Kai Henningsen 2001-01-18 18:58 ` Roman Zippel 1 sibling, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-18 16:24 UTC (permalink / raw) To: Andreas Dilger; +Cc: Rogier Wolff, linux-kernel On Thu, 18 Jan 2001, Andreas Dilger wrote: > > Actually, this is a great example, because at one point I was working > on a device interface which would offload all of the disk-disk copying > overhead to the disks themselves, and not involve the CPU/RAM at all. It's a horrible example. device-to-device copies sound like the ultimate thing. They suck. They add a lot of complexity and do not work in general. And, if your "normal" usage pattern really is to just move the data without even looking at it, then you have to ask yourself whether you're doing something worthwhile in the first place. Not going to happen. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 16:24 ` Linus Torvalds @ 2001-01-18 18:46 ` Kai Henningsen 2001-01-18 18:58 ` Roman Zippel 1 sibling, 0 replies; 130+ messages in thread From: Kai Henningsen @ 2001-01-18 18:46 UTC (permalink / raw) To: linux-kernel torvalds@transmeta.com (Linus Torvalds) wrote on 18.01.01 in <Pine.LNX.4.10.10101180822020.18072-100000@penguin.transmeta.com>: > if your "normal" usage pattern really is to just move the data without > even looking at it, then you have to ask yourself whether you're doing > something worthwhile in the first place. Web server. FTP server. Network file server. cp. mv. cat. dd. In short, vfs->net (what sendfile already does) and vfs->vfs are probably the most interesting applications, with net->vfs as a possible third. Classical bulk data copy applications. All the other stuff I can think of really does want to look at the data, and we can already handle virtual memory just fine with read/write/mmap. MfG Kai - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 16:24 ` Linus Torvalds 2001-01-18 18:46 ` Kai Henningsen @ 2001-01-18 18:58 ` Roman Zippel 2001-01-18 19:42 ` Linus Torvalds 2001-01-18 19:51 ` Rick Jones 1 sibling, 2 replies; 130+ messages in thread From: Roman Zippel @ 2001-01-18 18:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel Hi, On Thu, 18 Jan 2001, Linus Torvalds wrote: > > Actually, this is a great example, because at one point I was working > > on a device interface which would offload all of the disk-disk copying > > overhead to the disks themselves, and not involve the CPU/RAM at all. > > It's a horrible example. > > device-to-device copies sound like the ultimate thing. > > They suck. They add a lot of complexity and do not work in general. And, > if your "normal" usage pattern really is to just move the data without > even looking at it, then you have to ask yourself whether you're doing > something worthwhile in the first place. > > Not going to happen. device-to-device is not the same as disk-to-disk. A better example would be a streaming file server. Slowly the pci bus becomes a bottleneck, why would you want to move the data twice over the pci bus if once is enough and the data very likely not needed afterwards? Sure you can use a more expensive 64bit/60MHz bus, but why should you if the 32bit/30MHz bus is theoretically fast enough for your application? So I'm not advising it as "the ultimate thing", but I don't understand, why it shouldn't happen. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 18:58 ` Roman Zippel @ 2001-01-18 19:42 ` Linus Torvalds 2001-01-19 0:18 ` Roman Zippel 2001-01-20 15:36 ` Kai Henningsen 2001-01-18 19:51 ` Rick Jones 1 sibling, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-18 19:42 UTC (permalink / raw) To: Roman Zippel; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel On Thu, 18 Jan 2001, Roman Zippel wrote: > > > > Not going to happen. > > device-to-device is not the same as disk-to-disk. A better example would > be a streaming file server. No, it wouldn't be. [ Crystal ball mode: ON ] It's too damn device-dependent, and it's not worth it. There's no way to make it general with any current hardware, and there probably isn't going to be for at least another decade or so. And because it's expensive and slow to do even on a hardware level, it probably won't be done even then. Which means that it will continue to be a pure localized hack for the forseeable future. Quite frankly, show me a setup where the network bandwidth is even _close_ to big enough that it would make sense to try to stream directly from the disk? The only one I can think of is basically DoD-type installations with big satellite pictures on a huge server, and gigabit ethernet everywhere. Quite frankly, if that huge server is so borderline that it cannot handle the double copy, the system administrators have big problems. Streaming local video to disk? Sure, I can see that people might want that. But if you can't see that people might want to _see_ it while they are streaming, then you're missing a big part of the picture called "psychology". So you'd still want to have a in-memory buffer for things like that. Come back to this in ten years, when devices and buses are smarter. MAYBE they'll support it (but see later about why I don't think they will). Today, you're living in a pipe dream. You can't practically do it with any real devices of today - even when both parts support busmastering, they do NOT tend to support "busmaster to the disk", or "busmaster from the disk". I don't know of any disk interfaces that do that kind of interfaces (they'd basically need to have some way to busmaster directly to the controller caches, and do cache management in software. Can be done, but probably exposes more of the hardware than most people want to see), Right now the only special case might be some very specific embedded devices, things like routers, video recorders etc. And for them it would be very much a special case, with special drivers and everything. This is NOT a generic kernel issue, and we have not reached the point where it's even worth it trying to design the interfaces for it yet. An important point in interface design is to know when you don't know enough. We do not have the internal interfaces for doing anything like this, and I seriously doubt they'll be around soon. And you have to realize that it's not at all a given that device protocols will even move towards this kind of environment. It's equally likely that device protocols in the future will be more memory-intensive, where the basic protocol will all be "read from memory" and "write to memory", and nobody will even have a notion of mapping memory into device space like PCI kind of does now. I haven't looked at what infiniband/NGIO etc spec out, but I'd actually be surprised if they allow you to effectively short-circuit the IO networks together. It is not an operation that lends itself well to a network topology - it happens to work on PCI due to the traditional "shared bus" kind of logic that PCI inherited. And even on PCI, there are just a LOT of PCI bridges that apparently do not like seeing PCI-PCI transfers. (Short and sweet: most hogh-performance people want point-to-point serial line IO with no hops, because it's a known art to make that go fast. No general-case routing in hardware - if you want to go as fast as the devices and the link can go, you just don't have time to route. Trying to support device->device transfers easily slows down the _common_ case, which is why I personally doubt it will even be supported 10-15 years from now. Better hardware does NOT mean "more features"). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 19:42 ` Linus Torvalds @ 2001-01-19 0:18 ` Roman Zippel 2001-01-19 1:14 ` Linus Torvalds 2001-01-20 15:36 ` Kai Henningsen 1 sibling, 1 reply; 130+ messages in thread From: Roman Zippel @ 2001-01-19 0:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel Hi, On Thu, 18 Jan 2001, Linus Torvalds wrote: > It's too damn device-dependent, and it's not worth it. There's no way to > make it general with any current hardware, and there probably isn't going > to be for at least another decade or so. And because it's expensive and > slow to do even on a hardware level, it probably won't be done even then. > > [...] > > An important point in interface design is to know when you don't know > enough. We do not have the internal interfaces for doing anything like > this, and I seriously doubt they'll be around soon. I agree, it's device dependent, but such hardware exists. It needs of course its own memory, but then you can see it as a NUMA architecture and we already have the support for this. Create a new memory zone for the device memory and keep the pages reserved. Now you can use it almost like other memory, e.g. reading from/writing to it using address_space_ops. An application, where I'd like to use it, is audio recording/playback (24bit, 96kHz on 144 channels). Although it's possible to copy that amount of data around, but then you can't do much beside this. All the data is most of the time only needed on the soundcard, so why should I copy it first to the main memory? Right now I'm stuck to accessing a scsi device directly, but I would love to use the generic file/address_space interface for that, so you can directly stream to/from any filesystem. The only problem is that the fs interface is still to slow. That's btw the reason I suggested to split the get_block function. If you record into a file, you first just want to allocate any block from the fs for that file. A bit later when you start the write, you need a real block. And again a bit later you can still update the inode. These three stages have completely different locking requirements (except the page lock) and you can use the same mechanism for delayed writes. Anyway, now with the zerocopy network patches, there are basically already all the needed interfaces and you don't have to wait for 10 years, so I think you need to polish your crystal ball. :-) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-19 0:18 ` Roman Zippel @ 2001-01-19 1:14 ` Linus Torvalds 2001-01-19 6:57 ` Alan Cox 2001-01-19 10:13 ` Roman Zippel 0 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-19 1:14 UTC (permalink / raw) To: Roman Zippel; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel On Fri, 19 Jan 2001, Roman Zippel wrote: > > On Thu, 18 Jan 2001, Linus Torvalds wrote: > > > It's too damn device-dependent, and it's not worth it. There's no way to > > make it general with any current hardware, and there probably isn't going > > to be for at least another decade or so. And because it's expensive and > > slow to do even on a hardware level, it probably won't be done even then. > > > > [...] > > > > An important point in interface design is to know when you don't know > > enough. We do not have the internal interfaces for doing anything like > > this, and I seriously doubt they'll be around soon. > > I agree, it's device dependent, but such hardware exists. Show me any practical case where the hardware actually exists. I do not know of _any_ disk controllers that let you map the controller buffers over PCI. Which means that with current hardware, you have to assume that the disk is the initiator of the PCI-PCI DMA requests. Agreed? Which in turn implies that the non-disk target hardware has to be able to have a PCI-mapped memory buffer for the source or the destination, AND they have to be able to cope with the fact that the data you get off the disk will have to be the raw data at 512-byte granularity. There are really quite few devices that do this. The most common example by far would be a frame buffer, where you could think of streaming a few frames at a time directly from disk into graphics memory. But nobody actually saves pictures that way in reality - they all need processing to show up. Even when the graphics card does things like mpeg2 decoding in hardware, the decoding logic is not set up the way the data comes off the disk in any case I know of. As to soundcards, all the ones I know about that are worthwhile have certainly on-board memory, but that memory tends to be used for things like waveforms etc, and most of them refill their audio data by doing DMA. Again, they are the initiator of the IO, not a passive receiver. I'm sure there are sound cards that just expose their buffers directly. Fine. Make a special user-space driver for it. Don't try to make it into a design. > It needs of > course its own memory, but then you can see it as a NUMA architecture and > we already have the support for this. Create a new memory zone for the > device memory and keep the pages reserved. Now you can use it almost like > other memory, e.g. reading from/writing to it using address_space_ops. You need to have a damn special sound card to do the above. And you wouldn't need a new memory zone - the kernel wouldn't ever touch the memory anyway, you'd just ioremap() it if you needed to access it programmatically in addition to the streaming of data off disk. > An application, where I'd like to use it, is audio recording/playback > (24bit, 96kHz on 144 channels). Although it's possible to copy that amount > of data around, but then you can't do much beside this. All the data is > most of the time only needed on the soundcard, so why should I copy it > first to the main memory? Because with 99% of the hardware, there is no other way to get at it? Also, even when you happen to have the 1% card combination where it would work in the first place, you'd better make sure that they are on the same PCI bus. That's usually true on most PC's today, but that's probably going to be an issue eventually. > Anyway, now with the zerocopy network patches, there are basically already > all the needed interfaces and you don't have to wait for 10 years, so I > think you need to polish your crystal ball. :-) The zero-copy network patches have _none_ of the interfaces you think you need. They do not fix the fact that hardware usually doesn't even _allow_ for what you are hoping for. And what you want is probably going to be less likely in the future than more likely. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-19 1:14 ` Linus Torvalds @ 2001-01-19 6:57 ` Alan Cox 2001-01-19 10:13 ` Roman Zippel 1 sibling, 0 replies; 130+ messages in thread From: Alan Cox @ 2001-01-19 6:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Roman Zippel, Andreas Dilger, Rogier Wolff, linux-kernel > Which in turn implies that the non-disk target hardware has to be able to > have a PCI-mapped memory buffer for the source or the destination, AND > they have to be able to cope with the fact that the data you get off the > disk will have to be the raw data at 512-byte granularity. And that the chipset gets it right. Which is a big assumption as tv card driver folks can tell you The pcipci stuff in quirks is only a beginning alas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-19 1:14 ` Linus Torvalds 2001-01-19 6:57 ` Alan Cox @ 2001-01-19 10:13 ` Roman Zippel 2001-01-19 10:55 ` Andre Hedrick 2001-01-19 20:18 ` kuznet 1 sibling, 2 replies; 130+ messages in thread From: Roman Zippel @ 2001-01-19 10:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel Hi, On Thu, 18 Jan 2001, Linus Torvalds wrote: > > I agree, it's device dependent, but such hardware exists. > > Show me any practical case where the hardware actually exists. http://www.augan.com > I do not know of _any_ disk controllers that let you map the controller > buffers over PCI. Which means that with current hardware, you have to > assume that the disk is the initiator of the PCI-PCI DMA requests. Agreed? Yes. > I'm sure there are sound cards that just expose their buffers directly. > Fine. Make a special user-space driver for it. Don't try to make it into a > design. > [..] > You need to have a damn special sound card to do the above. That's true. "Soundcard" is actually a small understatement. :) Why should I make a new design for it, then it fits nicely into the current design? > And you wouldn't need a new memory zone - the kernel wouldn't ever touch > the memory anyway, you'd just ioremap() it if you needed to access it > programmatically in addition to the streaming of data off disk. ioremapped memory is not the same (that's what we do right now), you have to fake some virtual address to get the data to the right physical location. > Also, even when you happen to have the 1% card combination where it would > work in the first place, you'd better make sure that they are on the same > PCI bus. That's usually true on most PC's today, but that's probably going > to be an issue eventually. I agree, it's a special setup. > > Anyway, now with the zerocopy network patches, there are basically already > > all the needed interfaces and you don't have to wait for 10 years, so I > > think you need to polish your crystal ball. :-) > > The zero-copy network patches have _none_ of the interfaces you think you > need. They do not fix the fact that hardware usually doesn't even _allow_ > for what you are hoping for. And what you want is probably going to be > less likely in the future than more likely. It's about direct i/o from/to pages, for that you need a page struct (so the ioremapping doesn't work). See the memory on the pci card as normal memory, except that you can't allocate it normally, but you can still organize it like normal memory. All you need to do is to setup this memory area, then you can use it like normal memory, e.g. I can put it into the page cache and I can do a normal read/write with it. The changes are very minor, but it would solve so much other problems (especially alias issues). I know, that this isn't possible with any hardware combination, nonetheless it's not that a big problem to support it where it's possible. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-19 10:13 ` Roman Zippel @ 2001-01-19 10:55 ` Andre Hedrick 2001-01-19 20:18 ` kuznet 1 sibling, 0 replies; 130+ messages in thread From: Andre Hedrick @ 2001-01-19 10:55 UTC (permalink / raw) To: Roman Zippel; +Cc: Linus Torvalds, Andreas Dilger, Rogier Wolff, linux-kernel On Fri, 19 Jan 2001, Roman Zippel wrote: > Hi, > > On Thu, 18 Jan 2001, Linus Torvalds wrote: > > > > I agree, it's device dependent, but such hardware exists. > > > > Show me any practical case where the hardware actually exists. > > http://www.augan.com > > > I do not know of _any_ disk controllers that let you map the controller > > buffers over PCI. Which means that with current hardware, you have to > > assume that the disk is the initiator of the PCI-PCI DMA requests. Agreed? Err, first-party DMA devices do this, I think. I do have some of these on the radar map. Andre Hedrick Linux ATA Development - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-19 10:13 ` Roman Zippel 2001-01-19 10:55 ` Andre Hedrick @ 2001-01-19 20:18 ` kuznet 2001-01-19 21:45 ` Linus Torvalds 1 sibling, 1 reply; 130+ messages in thread From: kuznet @ 2001-01-19 20:18 UTC (permalink / raw) To: Roman Zippel; +Cc: linux-kernel Hello! > It's about direct i/o from/to pages, Yes. Formally, there are no problems to send to tcp directly from io space. But could someone explain me one thing. Does bus-mastering from io really work? And if it does, is it enough fast? At least, looking at my book on pci, I do not understand how such transfers are able to use bursts. MRM is banned for them... Alexey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-19 20:18 ` kuznet @ 2001-01-19 21:45 ` Linus Torvalds 2001-01-20 18:53 ` kuznet 0 siblings, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2001-01-19 21:45 UTC (permalink / raw) To: linux-kernel In article <200101192018.XAA25263@ms2.inr.ac.ru>, <kuznet@ms2.inr.ac.ru> wrote: >Hello! > >> It's about direct i/o from/to pages, > >Yes. Formally, there are no problems to send to tcp directly from io space. Actually, as long as there is no "struct page" there _are_ problems. This is why the NUMA stuff was brought up - it would require that there be a mem_map for the PCI pages.. (to do ref-counting etc). >But could someone explain me one thing. Does bus-mastering >from io really work? And if it does, is it enough fast? >At least, looking at my book on pci, I do not understand >how such transfers are able to use bursts. MRM is banned for them... It does work at least on some hardware. But no, I don't think you can depend on bursting (but I don't see why it couldn't work in theory). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-19 21:45 ` Linus Torvalds @ 2001-01-20 18:53 ` kuznet 2001-01-20 19:26 ` Linus Torvalds 0 siblings, 1 reply; 130+ messages in thread From: kuznet @ 2001-01-20 18:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Hello! > Actually, as long as there is no "struct page" there _are_ problems. > This is why the NUMA stuff was brought up - it would require that there > be a mem_map for the PCI pages.. (to do ref-counting etc). I see. Is this strong "no-no-no"? What is obstacle to allow "struct page" to sit outside of mem_map (in some private table, or as full orphan)? Only bloat of struct page with reference to some "page_ops" or something more profound? > It does work at least on some hardware. But no, I don't think you can > depend on bursting (but I don't see why it couldn't work in theory). I do not see too, but documents are pretty obscure explaining this. MRM seems to be prohibited for pci-pci. But my education is still not enough even to understand, whether MRM is required to burst or this is fully orthogonal yet. 8) Alexey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-20 18:53 ` kuznet @ 2001-01-20 19:26 ` Linus Torvalds 2001-01-20 21:20 ` Roman Zippel 2001-01-21 23:21 ` David Woodhouse 0 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-20 19:26 UTC (permalink / raw) To: kuznet; +Cc: linux-kernel On Sat, 20 Jan 2001 kuznet@ms2.inr.ac.ru wrote: > > Actually, as long as there is no "struct page" there _are_ problems. > > This is why the NUMA stuff was brought up - it would require that there > > be a mem_map for the PCI pages.. (to do ref-counting etc). > > I see. > > Is this strong "no-no-no"? What is obstacle to allow "struct page" > to sit outside of mem_map (in some private table, or as full orphan)? > Only bloat of struct page with reference to some "page_ops" or something > more profound? There's no no-no here: you can even create the "struct page"s on demand, and create a dummy local zone that contains them that they all point back to. It should be trivial - nobody else cares about those pages or that zone anyway. This is very much how the MM layer in 2.4.x is set up to work. That said, nobody has actually done this in practice yet, so there may be details to work out, of course. I don't see any fundamental reasons it wouldn't easily work, but.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-20 19:26 ` Linus Torvalds @ 2001-01-20 21:20 ` Roman Zippel 2001-01-21 0:25 ` Linus Torvalds 2001-01-21 23:21 ` David Woodhouse 1 sibling, 1 reply; 130+ messages in thread From: Roman Zippel @ 2001-01-20 21:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: kuznet, linux-kernel Hi, On Sat, 20 Jan 2001, Linus Torvalds wrote: > There's no no-no here: you can even create the "struct page"s on demand, > and create a dummy local zone that contains them that they all point back > to. It should be trivial - nobody else cares about those pages or that > zone anyway. AFAIK as long as that dummy page struct is only used in the page cache, that should work, but you get new problems as soon as you map the page also into a user process (grep for CONFIG_DISCONTIGMEM under include/asm-mips64 to see the needed changes). In the worst case one might need reverse mapping to get the page back. :) > That said, nobody has actually done this in practice yet, so there may be > details to work out, of course. I don't see any fundamental reasons it > wouldn't easily work, but.. I hope I have soon the time to experiment with this, so I'll now for sure. I don't see major problems, except I don't know yet, how the performance will be. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-20 21:20 ` Roman Zippel @ 2001-01-21 0:25 ` Linus Torvalds 2001-01-21 2:03 ` Roman Zippel 2001-01-21 18:00 ` kuznet 0 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-21 0:25 UTC (permalink / raw) To: Roman Zippel; +Cc: kuznet, linux-kernel On Sat, 20 Jan 2001, Roman Zippel wrote: > > AFAIK as long as that dummy page struct is only used in the page cache, > that should work, but you get new problems as soon as you map the page > also into a user process (grep for CONFIG_DISCONTIGMEM under > include/asm-mips64 to see the needed changes). In the worst case one > might need reverse mapping to get the page back. :) No, for the CONTIGMEM case you can just use remap_page_range() directly: it won't actually map the "struct page*" into the user space, it will just map a special reserved page into user space. No changes needed. So it just so happens that the physical address of the two "pages" is the same in this case - one reachable through the dummy "struct page *" and one reachable through the VM layer. The VM layer will never see the dummy "struct page", and that's ok. It doesn't need it. Now, there are things to look out for: when you do these kinds of dummy "struct page" tricks, some macros etc WILL NOT WORK. In particular, we do not currently have a good "page_to_bus/phys()" function. That means that anybody trying to do DMA to this page is currently screwed, simply because he has no good way of getting the physical address. This is a limitation in general: the PTE access functions would also like to have "page_to_phys()" and "phys_to_page()" functions. It gets even worse with IO mappings, where "CPU-physical" is NOT necessarily the same as "bus-physical". It shouldn't be too hard to do the phys/bus addresses in general, something like this should actually do it static inline unsigned long page_to_physnr(struct page * page) { unsigned long offset; struct zone_struct * zone = page->zone; offset = zone->zone_mem_map - page; return zone->zone_start_paddr + offset; } except right now I think "zone_start_paddr" is defined wrong (it's defined to be the actual physical address, rather than being the "physical address shifted right by the page-size". It needs to be the latter in order to handle physical memory spaces that are bigger than "unsigned long" (ie x86 PAE mode). Making the thing "unsigned long long" is _not_ an option, considering how crappy gcc is at double integers. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-21 0:25 ` Linus Torvalds @ 2001-01-21 2:03 ` Roman Zippel 2001-01-21 18:00 ` kuznet 1 sibling, 0 replies; 130+ messages in thread From: Roman Zippel @ 2001-01-21 2:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: kuznet, linux-kernel Hi, On Sat, 20 Jan 2001, Linus Torvalds wrote: > Now, there are things to look out for: when you do these kinds of dummy > "struct page" tricks, some macros etc WILL NOT WORK. In particular, we do > not currently have a good "page_to_bus/phys()" function. That means that > anybody trying to do DMA to this page is currently screwed, simply because > he has no good way of getting the physical address. > > This is a limitation in general: the PTE access functions would also like > to have "page_to_phys()" and "phys_to_page()" functions. It gets even > worse with IO mappings, where "CPU-physical" is NOT necessarily the same > as "bus-physical". That's why I want to avoid dummy struct page and use a real mem_map instead. I have two options: 1. I map everything together in one mem_map, like it's still done for m68k, the overhead here is in the phys_to_virt()/virt_to_phys() function. 2. I use several nodes like mips64/arm and virt_to_page() gets more complex, but this usually assumes a specific memory layout to keep it fast. Once that problem is solved, I can manage the memory on the card like the main memory and use it however I want. I probably do something like ia64 and use the highest bits as an offset into a table. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-21 0:25 ` Linus Torvalds 2001-01-21 2:03 ` Roman Zippel @ 2001-01-21 18:00 ` kuznet 1 sibling, 0 replies; 130+ messages in thread From: kuznet @ 2001-01-21 18:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: zippel, mingo, linux-kernel Hello! > "struct page" tricks, some macros etc WILL NOT WORK. In particular, we do > not currently have a good "page_to_bus/phys()" function. That means that > anybody trying to do DMA to this page is currently screwed, simply because > he has no good way of getting the physical address. We already have similar problem with 64bit dma on Intel. Namely, we need page_to_bus() and, moreover, we need 64bit bus addresses for devices understanding them. Now we make this in acenic like: #if defined(CONFIG_X86) && defined(CONFIG_HIGHMEM) #define BITS_PER_DMAADDR 64 typedef unsigned long long dmaaddr_high_t; static inline dmaaddr_high_t pci_map_single_high(struct pci_dev *hwdev, struct page *page, int offset, size_t size, int dir) { dmaaddr_high_t phys; phys = (page-mem_map) * (unsigned long long) PAGE_SIZE + offset; return phys; } #else Ingo, do you remember, that we agreed not to consider this code as "ready for release" until this issue is not cleaned up? I forgot this. 8)8)8) Seems, we can remove at least direct dependencies on mem_map using zone_struct. Alexey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-20 19:26 ` Linus Torvalds 2001-01-20 21:20 ` Roman Zippel @ 2001-01-21 23:21 ` David Woodhouse 1 sibling, 0 replies; 130+ messages in thread From: David Woodhouse @ 2001-01-21 23:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: kuznet, linux-kernel On Sat, 20 Jan 2001, Linus Torvalds wrote: > There's no no-no here: you can even create the "struct page"s on demand, > and create a dummy local zone that contains them that they all point back > to. It should be trivial - nobody else cares about those pages or that > zone anyway. > > This is very much how the MM layer in 2.4.x is set up to work. > > That said, nobody has actually done this in practice yet, so there may be > details to work out, of course. I don't see any fundamental reasons it > wouldn't easily work, but.. If I follow you correctly, this is how I was planning to provide execute-in-place support for filesystems on flash chips - allocating 'struct page's and adding them to the page cache on read_inode(). -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 19:42 ` Linus Torvalds 2001-01-19 0:18 ` Roman Zippel @ 2001-01-20 15:36 ` Kai Henningsen 2001-01-20 21:01 ` Linus Torvalds 1 sibling, 1 reply; 130+ messages in thread From: Kai Henningsen @ 2001-01-20 15:36 UTC (permalink / raw) To: torvalds; +Cc: linux-kernel torvalds@transmeta.com (Linus Torvalds) wrote on 18.01.01 in <Pine.LNX.4.10.10101181120070.18387-100000@penguin.transmeta.com>: > (Short and sweet: most hogh-performance people want point-to-point serial > line IO with no hops, because it's a known art to make that go fast. No > general-case routing in hardware - if you want to go as fast as the > devices and the link can go, you just don't have time to route. Trying to > support device->device transfers easily slows down the _common_ case, > which is why I personally doubt it will even be supported 10-15 years from > now. Better hardware does NOT mean "more features"). Well, maybe. Then again, I could easily see those I/O devices go the general embedded route, which in a decade or two could well mean they run some sort of embedded Linux on the controller. Which would make some features rather easy to implement. (Think about it: twenty years from mow, a typical desktop machine may be a heterogenous Linux cluster. Didn't someone say something about World Domination?) (Note that I predicted this 2001-01-20T16:35:30. Just in case it actually works out that way.) MfG Kai - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-20 15:36 ` Kai Henningsen @ 2001-01-20 21:01 ` Linus Torvalds 2001-01-20 21:10 ` Mo McKinlay 2001-01-20 22:24 ` Roman Zippel 0 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-20 21:01 UTC (permalink / raw) To: Kai Henningsen; +Cc: linux-kernel On 20 Jan 2001, Kai Henningsen wrote: > > Then again, I could easily see those I/O devices go the general embedded > route, which in a decade or two could well mean they run some sort of > embedded Linux on the controller. > > Which would make some features rather easy to implement. I'm not worried about a certain class of features. I will predict, for example, that disk subsystems etc will continue to get smarter, to the point where most people will end up just buying a "file server" whenever they buy a disk. THOSE kinds of features are the obvious ones when you have devices that get smarter, and the kinds of features people are willing to pay for. The things I find really doubtful is that somebody would be so silly as to make the low-level electrical protocol be anything but a simple direct point-to-point link. Shared buses just do not scale, and they also have some major problems with true high-performance GBps bandwidth. Look at where ethernet is today. Ten years ago most people used it as a bus. These days almost everybody thinks of ethernet as point-to-point, with switches and hubs to make it look nothing like the bus of yore. You just don't connect multiple devices to one wire any more. The advantage of direct point-to-point links is that it's a hell of a lot faster, and it's also much easier to distribute - the links don't have to be in lock-step any more etc. It's perfectly ok to have one really high-performance link for devices that need it, and a few low-performance links in the same system do not bog the fast one down. But point-to-point also means that you don't get any real advantage from doing things like device-to-device DMA. Because the links are asynchronous, you need buffers in between them anyway, and there is no bandwidth advantage of not going through the hub if the topology is a pretty normal "star" kind of thing. And you _do_ want the star topology, because in the end most of the bandwidth you want concentrated at the point that uses it. The exception to this will be when you hav esmart devices that _internally_ also have the same kind of structure, and you have a RAID device with multiple disks in a star around the raid controller. Then you'll find the raid controller doing raid rebuilds etc without the data ever coming off that "local star" - but this is not something that the OS will even get involved in other than sending the raid controller the command to start the rebuild. It's not a "device-device" transfer in that bigger sense - it's internal to the raid unit. Just wait. My crystal ball is infallible. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-20 21:01 ` Linus Torvalds @ 2001-01-20 21:10 ` Mo McKinlay 2001-01-20 22:24 ` Roman Zippel 1 sibling, 0 replies; 130+ messages in thread From: Mo McKinlay @ 2001-01-20 21:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kai Henningsen, linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Today, Linus Torvalds (torvalds@transmeta.com) wrote: > Just wait. My crystal ball is infallible. One of these days, that line will be your downfall :-) *grins* Mo. - -- Mo McKinlay mmckinlay@gnu.org - ------------------------------------------------------------------------- GnuPG/PGP Key: pub 1024D/76A275F9 2000-07-22 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.4 (GNU/Linux) Comment: For info see http://www.gnupg.org iEYEARECAAYFAjpp/ssACgkQRcGgB3aidfmcagCgkieTFD77O+Xqn+nmcaoiYERh UwwAoIL8cWZPdaKine4fZ4fJmQqwTvBZ =i1Ax -----END PGP SIGNATURE----- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-20 21:01 ` Linus Torvalds 2001-01-20 21:10 ` Mo McKinlay @ 2001-01-20 22:24 ` Roman Zippel 2001-01-21 0:33 ` Linus Torvalds 1 sibling, 1 reply; 130+ messages in thread From: Roman Zippel @ 2001-01-20 22:24 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kai Henningsen, linux-kernel Hi, On Sat, 20 Jan 2001, Linus Torvalds wrote: > But point-to-point also means that you don't get any real advantage from > doing things like device-to-device DMA. Because the links are > asynchronous, you need buffers in between them anyway, and there is no > bandwidth advantage of not going through the hub if the topology is a > pretty normal "star" kind of thing. And you _do_ want the star topology, > because in the end most of the bandwidth you want concentrated at the > point that uses it. I agree, but who says, that the buffer always has to be the main memory? That might be true especially for embedded devices. The cpu is then just the local controller, that manages several devices with its own buffer. Let's take a file server with multiple disks and multiple network cards with it's own buffer. For stuff like this you don't want to go through the main memory, on the other hand you still need to synchronize all the data. Although I don't know such hardware, but I don't see a reason not to do it under Linux. :-) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-20 22:24 ` Roman Zippel @ 2001-01-21 0:33 ` Linus Torvalds 2001-01-21 1:29 ` David Schwartz ` (2 more replies) 0 siblings, 3 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-21 0:33 UTC (permalink / raw) To: Roman Zippel; +Cc: Kai Henningsen, linux-kernel On Sat, 20 Jan 2001, Roman Zippel wrote: > > On Sat, 20 Jan 2001, Linus Torvalds wrote: > > > But point-to-point also means that you don't get any real advantage from > > doing things like device-to-device DMA. Because the links are > > asynchronous, you need buffers in between them anyway, and there is no > > bandwidth advantage of not going through the hub if the topology is a > > pretty normal "star" kind of thing. And you _do_ want the star topology, > > because in the end most of the bandwidth you want concentrated at the > > point that uses it. > > I agree, but who says, that the buffer always has to be the main memory? It doesn't _have_ to be. But think like a good hardware designer. In 99% of all cases, where do you want the results of a read to end up? Where do you want the contents of a write to come from? Right. Memory. Now, optimize for the common case. Make the common case go as fast as you can, with as little latency and as high bandwidth as you can. What kind of hardware would _you_ design for the point-to-point link? I'm claiming that you'd do a nice DMA engine for each link point. There wouldn't be any reason to have any other buffers (except, of course, minimal buffers inside the IO chip itself - not for the whole packet, but for just being able to handle cases where you don't have 100% access to the memory bus all the time - and for doing things like burst reads and writes to memory etc). I'm _not_ seeing the point for a high-performance link to have a generic packet buffer. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* RE: Is sendfile all that sexy? 2001-01-21 0:33 ` Linus Torvalds @ 2001-01-21 1:29 ` David Schwartz 2001-01-21 2:42 ` Roman Zippel 2001-01-21 9:52 ` James Sutherland 2 siblings, 0 replies; 130+ messages in thread From: David Schwartz @ 2001-01-21 1:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel > I'm _not_ seeing the point for a high-performance link to have a generic > packet buffer. > > Linus Well suppose your RAID controller can take over control of disks distributed throughout your I/O subsystem. If you assume the bandwidth of the I/O subsystem is not the limiting factor, there's no need to hang the disks directly off the RAID controller. This makes even more sense if your computer can upload code to your peripherals which they can then run autonomously. Imagine if your filesystem code is mobile and can reside (perhaps to a variable extent) in your drives if you want it to. Of course none of this really relates to the case of the OS trying to get peripherals to talk to each other directly. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-21 0:33 ` Linus Torvalds 2001-01-21 1:29 ` David Schwartz @ 2001-01-21 2:42 ` Roman Zippel 2001-01-21 9:52 ` James Sutherland 2 siblings, 0 replies; 130+ messages in thread From: Roman Zippel @ 2001-01-21 2:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kai Henningsen, linux-kernel Hi, On Sat, 20 Jan 2001, Linus Torvalds wrote: > But think like a good hardware designer. > > In 99% of all cases, where do you want the results of a read to end up? > Where do you want the contents of a write to come from? > > Right. Memory. > > Now, optimize for the common case. Make the common case go as fast as you > can, with as little latency and as high bandwidth as you can. > > What kind of hardware would _you_ design for the point-to-point link? > > I'm claiming that you'd do a nice DMA engine for each link point. There > wouldn't be any reason to have any other buffers (except, of course, > minimal buffers inside the IO chip itself - not for the whole packet, but > for just being able to handle cases where you don't have 100% access to > the memory bus all the time - and for doing things like burst reads and > writes to memory etc). > > I'm _not_ seeing the point for a high-performance link to have a generic > packet buffer. I completely agree, if we are talking about standard pc hardware. I was more thinking about some dedicated hardware, where you want to get the data directly to the correct place. If the hardware does a bit more with the data you need large buffers. In a standard pc the main cpu does most of the data processing, but in dedicated hardware you might have several cards each with it's own logic and memory and here the cpu does manage that stuff only. You can do all this of course from user space, but this means you have to copy the data around, what you don't want with such hardware, when the kernel can help you a bit. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-21 0:33 ` Linus Torvalds 2001-01-21 1:29 ` David Schwartz 2001-01-21 2:42 ` Roman Zippel @ 2001-01-21 9:52 ` James Sutherland 2001-01-21 10:02 ` Ingo Molnar 2001-01-22 9:52 ` Helge Hafting 2 siblings, 2 replies; 130+ messages in thread From: James Sutherland @ 2001-01-21 9:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: Roman Zippel, Kai Henningsen, linux-kernel On Sat, 20 Jan 2001, Linus Torvalds wrote: > > > On Sat, 20 Jan 2001, Roman Zippel wrote: > > > > On Sat, 20 Jan 2001, Linus Torvalds wrote: > > > > > But point-to-point also means that you don't get any real advantage from > > > doing things like device-to-device DMA. Because the links are > > > asynchronous, you need buffers in between them anyway, and there is no > > > bandwidth advantage of not going through the hub if the topology is a > > > pretty normal "star" kind of thing. And you _do_ want the star topology, > > > because in the end most of the bandwidth you want concentrated at the > > > point that uses it. > > > > I agree, but who says, that the buffer always has to be the main memory? > > It doesn't _have_ to be. > > But think like a good hardware designer. > > In 99% of all cases, where do you want the results of a read to end up? > Where do you want the contents of a write to come from? > > Right. Memory. For many applications, yes - but think about a file server for a moment. 99% of the data read from the RAID (or whatever) is really aimed at the appropriate NIC - going via main memory would just slow things down. Take a heavily laden webserver. With a nice intelligent NIC and RAID controller, you might have the httpd write the header to this NIC, then have the NIC and RAID controller handle the sendfile operation themselves - without ever touching the OS with this data. > Now, optimize for the common case. Make the common case go as fast as > you can, with as little latency and as high bandwidth as you can. > > What kind of hardware would _you_ design for the point-to-point link? > > I'm claiming that you'd do a nice DMA engine for each link point. There > wouldn't be any reason to have any other buffers (except, of course, > minimal buffers inside the IO chip itself - not for the whole packet, but > for just being able to handle cases where you don't have 100% access to > the memory bus all the time - and for doing things like burst reads and > writes to memory etc). > > I'm _not_ seeing the point for a high-performance link to have a generic > packet buffer. I'd agree with that, but I would want peripherals to be able to send data to each other without touching the host memory - think about playing video files with an accelerator (just pipe the files from disk to the accelerator), music with an "intelligent" sound card (just pipe the music to the card), video capture, file servers, CD burning... Having an Ethernet-style point-to-point "network" (everything connected as a star, with something intelligent in the middle to direct the data where it needs to go) makes sense, but don't assume everything is heading for the host's memory. DMA straight to/from a "switch" would be a nice solution, though... James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-21 9:52 ` James Sutherland @ 2001-01-21 10:02 ` Ingo Molnar 2001-01-22 9:52 ` Helge Hafting 1 sibling, 0 replies; 130+ messages in thread From: Ingo Molnar @ 2001-01-21 10:02 UTC (permalink / raw) To: James Sutherland Cc: Linus Torvalds, Roman Zippel, Kai Henningsen, Linux Kernel List On Sun, 21 Jan 2001, James Sutherland wrote: > For many applications, yes - but think about a file server for a > moment. 99% of the data read from the RAID (or whatever) is really > aimed at the appropriate NIC - going via main memory would just slow > things down. patently wrong. Compare the bandwidth of PCI and the bandwidth of memory controllers. It's both slower, has higher latency and uses up more valuable (PCI) bandwidth to do PCI->PCI transfers. The number of situations where PCI->PCI transactions are the preferred method are *very* limited, and i think we should deal with them when we see them. But this has been said at the very beginning of this thread already, please read it all ... Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-21 9:52 ` James Sutherland 2001-01-21 10:02 ` Ingo Molnar @ 2001-01-22 9:52 ` Helge Hafting 2001-01-22 13:00 ` James Sutherland 1 sibling, 1 reply; 130+ messages in thread From: Helge Hafting @ 2001-01-22 9:52 UTC (permalink / raw) To: James Sutherland, linux-kernel James Sutherland wrote: > > On Sat, 20 Jan 2001, Linus Torvalds wrote: > > > > > > > On Sat, 20 Jan 2001, Roman Zippel wrote: > > > > > > On Sat, 20 Jan 2001, Linus Torvalds wrote: > > > > > > > But point-to-point also means that you don't get any real advantage from > > > > doing things like device-to-device DMA. Because the links are > > > > asynchronous, you need buffers in between them anyway, and there is no > > > > bandwidth advantage of not going through the hub if the topology is a > > > > pretty normal "star" kind of thing. And you _do_ want the star topology, > > > > because in the end most of the bandwidth you want concentrated at the > > > > point that uses it. > > > > > > I agree, but who says, that the buffer always has to be the main memory? > > > > It doesn't _have_ to be. > > > > But think like a good hardware designer. > > > > In 99% of all cases, where do you want the results of a read to end up? > > Where do you want the contents of a write to come from? > > > > Right. Memory. > > For many applications, yes - but think about a file server for a moment. > 99% of the data read from the RAID (or whatever) is really aimed at the > appropriate NIC - going via main memory would just slow things down. > > Take a heavily laden webserver. With a nice intelligent NIC and RAID > controller, you might have the httpd write the header to this NIC, then > have the NIC and RAID controller handle the sendfile operation themselves > - without ever touching the OS with this data. And when the next user wants the same webpage/file you read it from the RAID again? Seems to me you loose the benefit of caching stuff in memory with this scheme. Sure - the RAID controller might have some cache, but it is usually smaller than main memory anyway. And then there are things like retransmissions... Helge Hafting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-22 9:52 ` Helge Hafting @ 2001-01-22 13:00 ` James Sutherland 2001-01-23 9:01 ` Helge Hafting 0 siblings, 1 reply; 130+ messages in thread From: James Sutherland @ 2001-01-22 13:00 UTC (permalink / raw) To: Helge Hafting; +Cc: linux-kernel On Mon, 22 Jan 2001, Helge Hafting wrote: > And when the next user wants the same webpage/file you read it from > the RAID again? Seems to me you loose the benefit of caching stuff in > memory with this scheme. Sure - the RAID controller might have some > cache, but it is usually smaller than main memory anyway. Hrm... good point. Using "main memory" (whose memory, on a NUMA box??) as a cache could be a performance boost in some circumstances. On the other hand, you're eating up a chunk of memory bandwidth which could be used for other things - even when you only cache in "spare" RAM, how do you decide who uses that RAM - and whether or not they should? There certainly comes a point at which not caching in RAM would be a net win, but ATM the kernel doesn't know enough to determine this. On a shared bus, probably the best solution would be to have the data sent to both devices (NIC and RAM) at once? > And then there are things like retransmissions... Hopefully handled by an intelligent NIC in most cases; if you're caching the file in RAM as well (by "CCing" the data there the first time) this is OK anyway. Something to think about, but probably more on-topic for linux-futures I suspect... James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-22 13:00 ` James Sutherland @ 2001-01-23 9:01 ` Helge Hafting 2001-01-23 9:37 ` James Sutherland 0 siblings, 1 reply; 130+ messages in thread From: Helge Hafting @ 2001-01-23 9:01 UTC (permalink / raw) To: James Sutherland; +Cc: linux-kernel James Sutherland wrote: > > On Mon, 22 Jan 2001, Helge Hafting wrote: > > > And when the next user wants the same webpage/file you read it from > > the RAID again? Seems to me you loose the benefit of caching stuff in > > memory with this scheme. Sure - the RAID controller might have some > > cache, but it is usually smaller than main memory anyway. > > Hrm... good point. Using "main memory" (whose memory, on a NUMA box??) as > a cache could be a performance boost in some circumstances. On the other > hand, you're eating up a chunk of memory bandwidth which could be used for > other things - even when you only cache in "spare" RAM, how do you decide > who uses that RAM - and whether or not they should? If we will need it again soon - cache it. If not, consider your device->device scheme. What we will need is often impossible to know, so approximations like LRU is used. You could have a object table (probably a file table or disk block table) counting how often various files/objects are referenced. You can then decide to use RAID->NIC transfers for something that haven't been read before, and memory cache when something is re-read for the nth time in a given time interval. "n" and the time interval depends on how much cache you have, and the size of your working set. This might be a win, maybe even a big win under some circumstances. But considering how it works only for a few devices only, and how complicated it is, the conclusion becomes don't do it for standard linux. You may of course try to make super-performance servers that work for a special hw combination, with a single very optimized linux driver taking care of the RAID adapter, the NIC(s), the fs, parts of the network stack and possibly the web server too. Helge Hafting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-23 9:01 ` Helge Hafting @ 2001-01-23 9:37 ` James Sutherland 0 siblings, 0 replies; 130+ messages in thread From: James Sutherland @ 2001-01-23 9:37 UTC (permalink / raw) To: Helge Hafting; +Cc: linux-kernel On Tue, 23 Jan 2001, Helge Hafting wrote: > James Sutherland wrote: > > > > On Mon, 22 Jan 2001, Helge Hafting wrote: > > > > > And when the next user wants the same webpage/file you read it from > > > the RAID again? Seems to me you loose the benefit of caching stuff in > > > memory with this scheme. Sure - the RAID controller might have some > > > cache, but it is usually smaller than main memory anyway. > > > > Hrm... good point. Using "main memory" (whose memory, on a NUMA box??) as > > a cache could be a performance boost in some circumstances. On the other > > hand, you're eating up a chunk of memory bandwidth which could be used for > > other things - even when you only cache in "spare" RAM, how do you decide > > who uses that RAM - and whether or not they should? > > If we will need it again soon - cache it. If not, consider > your device->device scheme. What we will need is often impossible to > know, > so approximations like LRU is used. You could have a object table > (probably a file table or disk block table) counting how often various > files/objects are referenced. You can then decide to use RAID->NIC > transfers for something that haven't been read before, and memory > cache when something is re-read for the nth time in a given time > interval. I think my compromise of sending it to both simultaneously is better: if you do reuse it, you've just got a cache hit (win); if not, you've just burned some RAM bandwidth, which isn't a catastrophe. > This might be a win, maybe even a big win under some circumstances. > But considering how it works only for a few devices only, and how > complicated it is, the conclusion becomes don't do it for > standard linux. Eh? This is a hardware system - Linux has very little hardware in it :-) > You may of course try to make super-performance servers that work for > a special hw combination, with a single very optimized linux driver > taking care of the RAID adapter, the NIC(s), the fs, parts of the > network stack and possibly the web server too. Actually, I'd like it to be a much more generic thing. If you get an "intelligent" NIC, it will have, say, a StrongARM processor on it. Why shouldn't the code running on that processor be supplied by the kernel, as part of the NIC driver? Given a powerful enough CPU on the NIC, you could offload a useful chunk of the Linux network stack to it. Or a RAID adapter - instead of coming with the manufacturer's proprietary code for striping etc., upload Linux's own RAID software to the CPU. Run some subset of X's primitives on the graphics card. On an I2O-type system (dedicated ARM processor or similar for handling I/O), run the low-level caching stuff, perhaps, or some of the FS code. Over the next few years, we'll see a lot of little baby CPUs in our PCs, on network cards, video cards etc. I'd like to see Linux able to take advantage of this sort of off-load capability where possible. James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 18:58 ` Roman Zippel 2001-01-18 19:42 ` Linus Torvalds @ 2001-01-18 19:51 ` Rick Jones 1 sibling, 0 replies; 130+ messages in thread From: Rick Jones @ 2001-01-18 19:51 UTC (permalink / raw) To: Roman Zippel; +Cc: Linus Torvalds, Andreas Dilger, Rogier Wolff, linux-kernel > device-to-device is not the same as disk-to-disk. A better example would > be a streaming file server. Slowly the pci bus becomes a bottleneck, why > would you want to move the data twice over the pci bus if once is enough > and the data very likely not needed afterwards? Sure you can use a more > expensive 64bit/60MHz bus, but why should you if the 32bit/30MHz bus is > theoretically fast enough for your application? theoretically fast enough for the application would imply the dual transfers across the bus would fit :) also, if a system was doing something with that much throughput, i suspect it would not only be designed with 64/66 busses (or better), but also have things on several different busses. that makes device to device life more of a challenge. rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-18 8:23 ` Rogier Wolff 2001-01-18 10:01 ` Andreas Dilger @ 2001-01-18 12:17 ` Peter Samuelson 1 sibling, 0 replies; 130+ messages in thread From: Peter Samuelson @ 2001-01-18 12:17 UTC (permalink / raw) To: Rogier Wolff; +Cc: Linus Torvalds, linux-kernel [Rogier Wolff] > I'd prefer an interface that says "copy this fd to that one, and > optimize that if you can". So do exactly that in libc. sendfile () { if (sys_sendfile() == -1) return (errno == EINVAL) ? do_slow_sendfile() : -1; return 0; } Peter - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-17 19:32 ` Linus Torvalds 2001-01-18 2:34 ` Olivier Galibert 2001-01-18 8:23 ` Rogier Wolff @ 2001-01-22 18:13 ` Val Henson 2001-01-22 18:27 ` David Lang 2001-01-22 18:54 ` Linus Torvalds 2 siblings, 2 replies; 130+ messages in thread From: Val Henson @ 2001-01-22 18:13 UTC (permalink / raw) To: linux-kernel; +Cc: Linus Torvalds On Wed, Jan 17, 2001 at 11:32:35AM -0800, Linus Torvalds wrote: > In article <Pine.LNX.4.30.0101171454340.29536-100000@baphomet.bogo.bogus>, > Ben Mansell <linux-kernel@slimyhorror.com> wrote: > > > >The current sendfile() has the limitation that it can't read data from > >a socket. Would it be another 5-minute hack to remove this limitation, so > >you could sendfile between sockets? Now _that_ would be sexy :) > > I don't think that would be all that sexy at all. > > You have to realize, that sendfile() is meant as an optimization, by > being able to re-use the same buffers that act as the in-kernel page > cache as buffers for sending data. So you avoid one copy. > > However, for socket->socket, we would not have such an advantage. A > socket->socket sendfile() would not avoid any copies the way the > networking is done today. That _may_ change, of course. But it might > not. And I'd rather tell people using sendfile() that you get EINVAL if > it isn't able to optimize the transfer.. Yes, socket->socket sendfile is not that sexy. I actually did this for 2.2.16 in the obvious (and stupid) way, copying data into a buffer and writing it it out again. The performance was unsurprisingly _exactly_ identical to a userspace read()/write() loop. There is a use for an optimized socket->socket transfer - proxying high speed TCP connections. It would be exciting if the zerocopy networking framework led to a decent socket->socket transfer. -VAL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-22 18:13 ` Val Henson @ 2001-01-22 18:27 ` David Lang 2001-01-22 19:37 ` Val Henson 2001-01-22 18:54 ` Linus Torvalds 1 sibling, 1 reply; 130+ messages in thread From: David Lang @ 2001-01-22 18:27 UTC (permalink / raw) To: Val Henson; +Cc: linux-kernel, Linus Torvalds On Mon, 22 Jan 2001, Val Henson wrote: > There is a use for an optimized socket->socket transfer - proxying > high speed TCP connections. It would be exciting if the zerocopy > networking framework led to a decent socket->socket transfer. if you are proxying connextions you should really be looking at what data you pass through your proxy. now replay proxying with routing and I would agree with you (but I'll bet this is handled in the kernel IP stack anyway) David Lang - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-22 18:27 ` David Lang @ 2001-01-22 19:37 ` Val Henson 2001-01-22 20:01 ` David Lang 0 siblings, 1 reply; 130+ messages in thread From: Val Henson @ 2001-01-22 19:37 UTC (permalink / raw) To: David Lang; +Cc: linux-kernel, Linus Torvalds On Mon, Jan 22, 2001 at 10:27:58AM -0800, David Lang wrote: > On Mon, 22 Jan 2001, Val Henson wrote: > > > There is a use for an optimized socket->socket transfer - proxying > > high speed TCP connections. It would be exciting if the zerocopy > > networking framework led to a decent socket->socket transfer. > > if you are proxying connextions you should really be looking at what data > you pass through your proxy. > > now replay proxying with routing and I would agree with you (but I'll bet > this is handled in the kernel IP stack anyway) Well, there is a (real-world) case where your TCP proxy doesn't want to look at the data and you can't use IP forwarding. If you have TCP connections between networks that have very different MTU's, using IP forwarding will result in tiny packets on the large MTU networks. So who cares? Some machines, notably Crays and NEC's, have a severely rate-limited network stack and can only transmit up to about 3500 packets per second. That's 40 Mbps on a 1500 byte MTU network, but greater than line speed on HIPPI (65280 MTU, 800 Mbps). So, for a rate-limited network stack on a HIPPI network, the best way to talk to a machine on a gigabit ethernet network is through a TCP proxy which just doesn't care about the data going through it. Hence my interest in socket->socket sendfile(). I'll admit this is an odd corner case which isn't important enough to justify socket->socket sendfile() on its own. But this odd corner case did make enough money to pay my salary for years to come. :) -VAL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-22 19:37 ` Val Henson @ 2001-01-22 20:01 ` David Lang 2001-01-22 22:04 ` Ion Badulescu 0 siblings, 1 reply; 130+ messages in thread From: David Lang @ 2001-01-22 20:01 UTC (permalink / raw) To: Val Henson; +Cc: linux-kernel, Linus Torvalds how about always_defragment (or whatever the option is now called) so that your routing box always reassembles packets and then fragments them to the correct size for the next segment? wouldn't this do the job? David Lang On Mon, 22 Jan 2001, Val Henson wrote: > Date: Mon, 22 Jan 2001 12:37:07 -0700 > From: Val Henson <vhenson@esscom.com> > To: David Lang <dlang@diginsite.com> > Cc: linux-kernel@vger.kernel.org, Linus Torvalds <torvalds@transmeta.com> > Subject: Re: Is sendfile all that sexy? > > On Mon, Jan 22, 2001 at 10:27:58AM -0800, David Lang wrote: > > On Mon, 22 Jan 2001, Val Henson wrote: > > > > > There is a use for an optimized socket->socket transfer - proxying > > > high speed TCP connections. It would be exciting if the zerocopy > > > networking framework led to a decent socket->socket transfer. > > > > if you are proxying connextions you should really be looking at what data > > you pass through your proxy. > > > > now replay proxying with routing and I would agree with you (but I'll bet > > this is handled in the kernel IP stack anyway) > > Well, there is a (real-world) case where your TCP proxy doesn't want > to look at the data and you can't use IP forwarding. If you have TCP > connections between networks that have very different MTU's, using IP > forwarding will result in tiny packets on the large MTU networks. > > So who cares? Some machines, notably Crays and NEC's, have a severely > rate-limited network stack and can only transmit up to about 3500 > packets per second. That's 40 Mbps on a 1500 byte MTU network, but > greater than line speed on HIPPI (65280 MTU, 800 Mbps). > > So, for a rate-limited network stack on a HIPPI network, the best way > to talk to a machine on a gigabit ethernet network is through a TCP > proxy which just doesn't care about the data going through it. Hence > my interest in socket->socket sendfile(). > > I'll admit this is an odd corner case which isn't important enough to > justify socket->socket sendfile() on its own. But this odd corner > case did make enough money to pay my salary for years to come. :) > > -VAL > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-22 20:01 ` David Lang @ 2001-01-22 22:04 ` Ion Badulescu 0 siblings, 0 replies; 130+ messages in thread From: Ion Badulescu @ 2001-01-22 22:04 UTC (permalink / raw) To: David Lang; +Cc: linux-kernel, Linus Torvalds, Val Henson On Mon, 22 Jan 2001 12:01:23 -0800 (PST), David Lang <dlang@diginsite.com> wrote: > how about always_defragment (or whatever the option is now called) so that > your routing box always reassembles packets and then fragments them to the > correct size for the next segment? wouldn't this do the job? It doesn't help with TCP, because the negotiated MSS will always be 1500 and thus there won't be any fragments to re-assemble. > On Mon, 22 Jan 2001, Val Henson wrote: > >> Well, there is a (real-world) case where your TCP proxy doesn't want >> to look at the data and you can't use IP forwarding. If you have TCP >> connections between networks that have very different MTU's, using IP >> forwarding will result in tiny packets on the large MTU networks. There is another real-world case: a load-balancing proxy. socket->socket sendfile would allow the proxy to open a non-keepalive connection to the backend server, send the request, and then just link the two sockets together using sendfile. Of course, some changes would have to be made to the API. An asynchronous sendsocket()/sendfile() system call would be just lovely, in fact. :-) Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-22 18:13 ` Val Henson 2001-01-22 18:27 ` David Lang @ 2001-01-22 18:54 ` Linus Torvalds 1 sibling, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2001-01-22 18:54 UTC (permalink / raw) To: Val Henson; +Cc: linux-kernel On Mon, 22 Jan 2001, Val Henson wrote: > On Wed, Jan 17, 2001 at 11:32:35AM -0800, Linus Torvalds wrote: > > > > However, for socket->socket, we would not have such an advantage. A > > socket->socket sendfile() would not avoid any copies the way the > > networking is done today. That _may_ change, of course. But it might > > not. And I'd rather tell people using sendfile() that you get EINVAL if > > it isn't able to optimize the transfer.. > > Yes, socket->socket sendfile is not that sexy. I actually did this > for 2.2.16 in the obvious (and stupid) way, copying data into a buffer > and writing it it out again. The performance was unsurprisingly > _exactly_ identical to a userspace read()/write() loop. The thing is, that if I knew that I could always beat the user-space numbers (by virtue of having fewer system calls etc), I would still consider "sendfile()" to be ok for it. But we can actually do _worse_ in sendfile() than in user-space applications. For example, userspace "read+write" may now more about packet boundary behaviour etc, which sendfile is totally clueless about, so a userspace application might actually get _better_ performance by doing it by hand. That's why I currently want sendfile() to only work for the things we _know_ we can do better. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
[parent not found: <Pine.LNX.4.10.10101190911130.10218-100000@penguin.transmeta.com>]
* Re: Is sendfile all that sexy? [not found] <Pine.LNX.4.10.10101190911130.10218-100000@penguin.transmeta.com> @ 2001-01-19 17:23 ` Rogier Wolff 0 siblings, 0 replies; 130+ messages in thread From: Rogier Wolff @ 2001-01-19 17:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds wrote: > > I wrote a driver for a zoran-chipset frame-grabber card. The "natural" > > way to save a video stream was exactly the way it came out of the > > card. And the card was structured that you could put on an "mpeg > > decoder" (or encoder) chip, and you could DMA the stream directly into > > that chip. > > Ehh.. > > And how many of these chips are out on the market? > > Would you agree that it is less than 0.01% of all PC hardware? Like MUCH > less? Someone asked me to write a driver for one of these cards. I was assuming that most of them work like this. And I'm never wrong, you know... > > The way soundcards are commonly programmed, they don't play from their > > own memory, but from main memory. However, they all can play from > > their own memory. > > And how do you synchronize the streams etc? It's a nasty piece of > business, and direct PCI-PCI streaming is not the answer. > > > > And you wouldn't need a new memory zone - the kernel wouldn't ever touch > > > the memory anyway, you'd just ioremap() it if you needed to access it > > > programmatically in addition to the streaming of data off disk. > > > > That's the way things currently work. If you start thinking about it > > as a NUMA, it may improve the situation for "common users" too. > > > > A PC is a NUMA machine! We have disk (swap) and main memory. We also > > have a frame buffer, which doesn't currently fit into our memory > > architecture. > > Don't be silly. It fints _fine_ in our memory architecture. We map it to > xfree86, and we're done with it. > > Using the frame buffer for "backing store" for normal memory is not worth > it. That's what disks are for. Frame buffers are _way_ too small to be > interesting as a memory resource. It's a silly small resource that suddenly becomes usable should the right infrastructure be in place. It isn't. You're not planning on doing it soonish. Neither am I. Roger. -- ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 ** *-- BitWizard writes Linux device drivers for any device you may have! --* * There are old pilots, and there are bold pilots. * There are also old, bald pilots. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? @ 2001-01-24 15:12 Sasi Peter 2001-01-24 15:29 ` James Sutherland 2001-01-25 1:11 ` Alan Cox 0 siblings, 2 replies; 130+ messages in thread From: Sasi Peter @ 2001-01-24 15:12 UTC (permalink / raw) To: James Sutherland, linux-kernel > AIUI, Jeff Merkey was working on loading "userspace" apps into the kernel > to tackle this sort of problem generically. I don't know if he's tried it > with Samba - the forking would probably be a problem... I think, that is not what we need. Once Ingo wrote, that since HTTP serving can also be viewed as a kind of fileserving, it should be possible to create a TUX like module for the same framwork, that serves using the SMB protocol instead of HTTP... -- SaPE / Sasi Péter / mailto: sape@sch.hu / http://sape.iq.rulez.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-24 15:12 Sasi Peter @ 2001-01-24 15:29 ` James Sutherland 2001-01-25 1:11 ` Alan Cox 1 sibling, 0 replies; 130+ messages in thread From: James Sutherland @ 2001-01-24 15:29 UTC (permalink / raw) To: Sasi Peter; +Cc: linux-kernel On Wed, 24 Jan 2001, Sasi Peter wrote: > > AIUI, Jeff Merkey was working on loading "userspace" apps into the > kernel > > to tackle this sort of problem generically. I don't know if he's > tried it > > with Samba - the forking would probably be a problem... > > I think, that is not what we need. Once Ingo wrote, that since HTTP > serving can also be viewed as a kind of fileserving, it should be > possible to create a TUX like module for the same framwork, that serves > using the SMB protocol instead of HTTP... I must admit I'm a bit sceptical - apart from anything else, Jeff's approach allows a bug in the server software to blow the whole OS away, instead of just quietly coring! (Or, worse still, trample on some FS metadata in RAM... eek!) A TUX module would be a nice idea, although I haven't even been able to find a proper TUX web page - Google just gave page after page of mailing list archives and discussion about it :-( James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-24 15:12 Sasi Peter 2001-01-24 15:29 ` James Sutherland @ 2001-01-25 1:11 ` Alan Cox 2001-01-25 9:06 ` James Sutherland 1 sibling, 1 reply; 130+ messages in thread From: Alan Cox @ 2001-01-25 1:11 UTC (permalink / raw) To: Sasi Peter; +Cc: James Sutherland, linux-kernel > I think, that is not what we need. Once Ingo wrote, that since HTTP > serving can also be viewed as a kind of fileserving, it should be > possible to create a TUX like module for the same framwork, that serves > using the SMB protocol instead of HTTP... Kernel SMB is basically not a sane idea. sendfile can help it though - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-25 1:11 ` Alan Cox @ 2001-01-25 9:06 ` James Sutherland 2001-01-25 10:42 ` bert hubert 0 siblings, 1 reply; 130+ messages in thread From: James Sutherland @ 2001-01-25 9:06 UTC (permalink / raw) To: Alan Cox; +Cc: Sasi Peter, linux-kernel On Thu, 25 Jan 2001, Alan Cox wrote: > > I think, that is not what we need. Once Ingo wrote, that since HTTP > > serving can also be viewed as a kind of fileserving, it should be > > possible to create a TUX like module for the same framwork, that serves > > using the SMB protocol instead of HTTP... > > > Kernel SMB is basically not a sane idea. sendfile can help it though Right now, ISTR Samba is still a forking daemon?? This has less impact on performance than it would for an httpd, because of the long-lived sessions, but rewriting it as a state machine (no forking, threads or other crap, just use non-blocking I/O) would probably make much more sense. James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-25 9:06 ` James Sutherland @ 2001-01-25 10:42 ` bert hubert 2001-01-25 12:14 ` James Sutherland 0 siblings, 1 reply; 130+ messages in thread From: bert hubert @ 2001-01-25 10:42 UTC (permalink / raw) To: linux-kernel On Thu, Jan 25, 2001 at 09:06:33AM +0000, James Sutherland wrote: > performance than it would for an httpd, because of the long-lived > sessions, but rewriting it as a state machine (no forking, threads or > other crap, just use non-blocking I/O) would probably make much more > sense. >From a kernel coders perspective, possibly. But a lot of SMB details are pretty convoluted. Statemachines may produce more efficient code but can be hell to maintain and expand. Bugs can hide in lots of corners. Regards, bert hubert -- PowerDNS Versatile DNS Services Trilab The Technology People 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Is sendfile all that sexy? 2001-01-25 10:42 ` bert hubert @ 2001-01-25 12:14 ` James Sutherland 0 siblings, 0 replies; 130+ messages in thread From: James Sutherland @ 2001-01-25 12:14 UTC (permalink / raw) To: bert hubert; +Cc: linux-kernel On Thu, 25 Jan 2001, bert hubert wrote: > On Thu, Jan 25, 2001 at 09:06:33AM +0000, James Sutherland wrote: > > > performance than it would for an httpd, because of the long-lived > > sessions, but rewriting it as a state machine (no forking, threads or > > other crap, just use non-blocking I/O) would probably make much more > > sense. > > From a kernel coders perspective, possibly. But a lot of SMB details are > pretty convoluted. Statemachines may produce more efficient code but can be > hell to maintain and expand. Bugs can hide in lots of corners. I said they were good from a performance PoV - I didn't say they were easy! Obviously there are reasons why the Samba guys have done what they have. In fact, some parts of Samba ARE implemented as state machines to some extent; presumably the remainder were considered too difficult to reimplement that way for the time being. James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 130+ messages in thread
end of thread, other threads:[~2001-01-26 14:16 UTC | newest]
Thread overview: 130+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-14 18:29 Is sendfile all that sexy? jamal
2001-01-14 18:50 ` Ingo Molnar
2001-01-14 19:02 ` jamal
2001-01-14 19:09 ` Ingo Molnar
2001-01-14 19:18 ` jamal
2001-01-14 20:22 ` Linus Torvalds
2001-01-14 20:38 ` Ingo Molnar
2001-01-14 21:44 ` Linus Torvalds
2001-01-14 21:49 ` Ingo Molnar
2001-01-14 21:54 ` Gerhard Mack
2001-01-14 22:40 ` Linus Torvalds
2001-01-14 22:45 ` J Sloan
2001-01-15 20:15 ` H. Peter Anvin
2001-01-15 3:43 ` Michael Peddemors
2001-01-15 13:02 ` Florian Weimer
2001-01-15 13:45 ` Tristan Greaves
2001-01-15 1:14 ` Dan Hollis
2001-01-15 15:24 ` Jonathan Thackray
2001-01-15 15:36 ` Matti Aarnio
2001-01-15 20:17 ` H. Peter Anvin
2001-01-15 16:05 ` dean gaudet
2001-01-15 18:34 ` Jonathan Thackray
2001-01-15 18:46 ` Linus Torvalds
2001-01-15 20:47 ` [patch] sendpath() support, 2.4.0-test3/-ac9 Ingo Molnar
2001-01-16 4:51 ` dean gaudet
2001-01-16 4:59 ` Linus Torvalds
2001-01-16 9:48 ` 'native files', 'object fingerprints' [was: sendpath()] Ingo Molnar
2000-01-01 2:02 ` Pavel Machek
2001-01-16 11:13 ` Andi Kleen
2001-01-16 11:26 ` Ingo Molnar
2001-01-16 11:37 ` Andi Kleen
2001-01-16 12:04 ` O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]] Ingo Molnar
2001-01-16 12:09 ` Ingo Molnar
2001-01-16 12:13 ` Peter Samuelson
2001-01-16 12:33 ` Ingo Molnar
2001-01-16 14:40 ` Felix von Leitner
2001-01-16 12:34 ` Andi Kleen
2001-01-16 13:00 ` Mitchell Blank Jr
2001-01-16 13:57 ` 'native files', 'object fingerprints' [was: sendpath()] Jamie Lokier
2001-01-16 14:27 ` Felix von Leitner
2001-01-16 17:47 ` Linus Torvalds
2001-01-17 4:39 ` dean gaudet
2001-01-16 9:19 ` [patch] sendpath() support, 2.4.0-test3/-ac9 Ingo Molnar
2001-01-17 0:03 ` dean gaudet
2001-01-15 18:58 ` Is sendfile all that sexy? dean gaudet
2001-01-15 19:41 ` Ingo Molnar
2001-01-15 20:33 ` Albert D. Cahalan
2001-01-15 21:00 ` Linus Torvalds
2001-01-16 10:40 ` Felix von Leitner
2001-01-16 11:56 ` Peter Samuelson
2001-01-16 12:37 ` Ingo Molnar
2001-01-16 12:42 ` Ingo Molnar
2001-01-16 12:47 ` Felix von Leitner
2001-01-16 13:48 ` Jamie Lokier
2001-01-16 14:20 ` Felix von Leitner
2001-01-16 15:05 ` David L. Parsley
2001-01-16 15:05 ` Jakub Jelinek
2001-01-16 15:46 ` David L. Parsley
2001-01-18 14:00 ` Laramie Leavitt
2001-01-17 19:27 ` dean gaudet
2001-01-24 0:58 ` Sasi Peter
2001-01-24 8:44 ` James Sutherland
2001-01-25 10:20 ` Anton Blanchard
2001-01-25 10:58 ` Sasi Peter
2001-01-26 6:10 ` Anton Blanchard
2001-01-26 11:46 ` David S. Miller
2001-01-26 14:12 ` Anton Blanchard
2001-01-15 23:16 ` Pavel Machek
2001-01-16 13:47 ` jamal
2001-01-16 14:41 ` Pavel Machek
-- strict thread matches above, loose matches on Subject: below --
2001-01-16 13:50 Andries.Brouwer
2001-01-17 6:56 ` Ton Hospel
2001-01-17 7:31 ` Steve VanDevender
2001-01-17 8:09 ` Ton Hospel
2001-01-17 15:02 Ben Mansell
2000-01-01 2:10 ` Pavel Machek
2001-01-17 19:32 ` Linus Torvalds
2001-01-18 2:34 ` Olivier Galibert
2001-01-21 21:22 ` LA Walsh
2001-01-18 8:23 ` Rogier Wolff
2001-01-18 10:01 ` Andreas Dilger
2001-01-18 11:04 ` Russell Leighton
2001-01-18 16:36 ` Larry McVoy
2001-01-19 1:53 ` Linus Torvalds
2001-01-18 16:24 ` Linus Torvalds
2001-01-18 18:46 ` Kai Henningsen
2001-01-18 18:58 ` Roman Zippel
2001-01-18 19:42 ` Linus Torvalds
2001-01-19 0:18 ` Roman Zippel
2001-01-19 1:14 ` Linus Torvalds
2001-01-19 6:57 ` Alan Cox
2001-01-19 10:13 ` Roman Zippel
2001-01-19 10:55 ` Andre Hedrick
2001-01-19 20:18 ` kuznet
2001-01-19 21:45 ` Linus Torvalds
2001-01-20 18:53 ` kuznet
2001-01-20 19:26 ` Linus Torvalds
2001-01-20 21:20 ` Roman Zippel
2001-01-21 0:25 ` Linus Torvalds
2001-01-21 2:03 ` Roman Zippel
2001-01-21 18:00 ` kuznet
2001-01-21 23:21 ` David Woodhouse
2001-01-20 15:36 ` Kai Henningsen
2001-01-20 21:01 ` Linus Torvalds
2001-01-20 21:10 ` Mo McKinlay
2001-01-20 22:24 ` Roman Zippel
2001-01-21 0:33 ` Linus Torvalds
2001-01-21 1:29 ` David Schwartz
2001-01-21 2:42 ` Roman Zippel
2001-01-21 9:52 ` James Sutherland
2001-01-21 10:02 ` Ingo Molnar
2001-01-22 9:52 ` Helge Hafting
2001-01-22 13:00 ` James Sutherland
2001-01-23 9:01 ` Helge Hafting
2001-01-23 9:37 ` James Sutherland
2001-01-18 19:51 ` Rick Jones
2001-01-18 12:17 ` Peter Samuelson
2001-01-22 18:13 ` Val Henson
2001-01-22 18:27 ` David Lang
2001-01-22 19:37 ` Val Henson
2001-01-22 20:01 ` David Lang
2001-01-22 22:04 ` Ion Badulescu
2001-01-22 18:54 ` Linus Torvalds
[not found] <Pine.LNX.4.10.10101190911130.10218-100000@penguin.transmeta.com>
2001-01-19 17:23 ` Rogier Wolff
2001-01-24 15:12 Sasi Peter
2001-01-24 15:29 ` James Sutherland
2001-01-25 1:11 ` Alan Cox
2001-01-25 9:06 ` James Sutherland
2001-01-25 10:42 ` bert hubert
2001-01-25 12:14 ` James Sutherland
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox