* Re: Is sendfile all that sexy?
2001-01-17 15:02 Is sendfile all that sexy? Ben Mansell
@ 2000-01-01 2:10 ` Pavel Machek
2001-01-17 19:32 ` Linus Torvalds
1 sibling, 0 replies; 109+ messages in thread
From: Pavel Machek @ 2000-01-01 2:10 UTC (permalink / raw)
To: Ben Mansell; +Cc: torvalds, linux-kernel
Hi!
> > And no, I don't actually hink that sendfile() is all that hot. It was
> > _very_ easy to implement, and can be considered a 5-minute hack to give
> > a feature that fit very well in the MM architecture, and that the Apache
> > folks had already been using on other architectures.
>
> The current sendfile() has the limitation that it can't read data from
> a socket. Would it be another 5-minute hack to remove this limitation, so
> you could sendfile between sockets? Now _that_ would be sexy :)
I had patch to do that. (Unoptimized, of course)
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Is sendfile all that sexy?
@ 2001-01-14 18:29 jamal
2001-01-14 18:50 ` Ingo Molnar
` (2 more replies)
0 siblings, 3 replies; 109+ messages in thread
From: jamal @ 2001-01-14 18:29 UTC (permalink / raw)
To: linux-kernel, netdev
I thought i'd run some tests on the new zerocopy patches
(this is using a hacked ttcp which knows how to do sendfile
and does MSG_TRUNC for true zero-copy receive, if you know what i mean
;-> ).
2 back to back SMP 2*PII-450Mhz hooked up via 1M acenics (gigE).
MTU 9K.
Before getting excited i had the courage to give plain 2.4.0-pre3 a whirl
and somethings bothered me.
test1:
------
regular ttcp, no ZC and no sendfile. send as much as you can in 15secs;
actually 8192 byte chunks, 2048 of them at a time. Repeat until 15 secs is
complete.
Repeat the test 5 times to narrow experimental deviation.
Throughput: ~99MB/sec (for those obsessed with Mbps ~810Mbps)
CPU abuse: server side 87% client side 22% (the CPU measurement could do
with some work and proper measure for SMP).
test2:
------
sendfile server.
created a file which is 8192*2048 bytes. Again the same 15 second
exercise as test1 (and the 5-set thing):
- throughput: 86MB/sec
- CPU: server 100%, client 17%
So i figured, no problem i'll re-run it with a file 10 times larger.
**I was dissapointed to see no improvement.**
Looking at the system calls being made:
with the non-sendfile version, approximately 182K write-to-socket system
calls were made each writing 8192 bytes, Each call lasted on average
0.08ms.
With sendfile test2: 78 calls were made, each sending the file
size 8192*2048 bytes; each lasted about 199 msecs
TWO observations:
- Given Linux's non-pre-emptability of the kernel i get the feeling that
sendfile could starve other user space programs. Imagine trying to send a
1Gig file on 10Mbps pipe in one shot.
- It doesnt matter if you break down the file into chunks for
self-pre-emption; sendfile is still a pig.
I have a feeling i am missing some very serious shit. So enlighten me.
Has anyone done similar tests?
Anyways, the struggle continues next with zc patches.
cheers,
jamal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 18:29 jamal
@ 2001-01-14 18:50 ` Ingo Molnar
2001-01-14 19:02 ` jamal
2001-01-14 20:22 ` Linus Torvalds
2001-01-15 23:16 ` Pavel Machek
2 siblings, 1 reply; 109+ messages in thread
From: Ingo Molnar @ 2001-01-14 18:50 UTC (permalink / raw)
To: jamal; +Cc: linux-kernel, netdev
On Sun, 14 Jan 2001, jamal wrote:
> regular ttcp, no ZC and no sendfile. [...]
> Throughput: ~99MB/sec (for those obsessed with Mbps ~810Mbps)
> CPU abuse: server side 87% client side 22% [...]
> sendfile server.
> - throughput: 86MB/sec
> - CPU: server 100%, client 17%
i believe what you are seeing here is the overhead of the pagecache. When
using sendmsg() only, you do not read() the file every time, right? Is
ttcp using multiple threads? In that case if the sendfile() is using the
*same* file all the time, creating SMP locking overhead.
if this is the case, what result do you get if you use a separate,
isolated file per process? (And i bet that with DaveM's pagecache
scalability patch the situation would also get much better - the global
pagecache_lock hurts.)
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 18:50 ` Ingo Molnar
@ 2001-01-14 19:02 ` jamal
2001-01-14 19:09 ` Ingo Molnar
0 siblings, 1 reply; 109+ messages in thread
From: jamal @ 2001-01-14 19:02 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, netdev
On Sun, 14 Jan 2001, Ingo Molnar wrote:
>
> i believe what you are seeing here is the overhead of the pagecache. When
> using sendmsg() only, you do not read() the file every time, right? Is
In that case just a user space buffer is sent i.e no file association.
> ttcp using multiple threads?
Only a single thread, single flow setup. Very primitive but simple.
> In that case if the sendfile() is using the
> *same* file all the time, creating SMP locking overhead.
>
> if this is the case, what result do you get if you use a separate,
> isolated file per process? (And i bet that with DaveM's pagecache
> scalability patch the situation would also get much better - the global
> pagecache_lock hurts.)
>
Already doing the single file, single process. However, i do run by time
which means i could read the file from the begining(offset 0) to the end
then re-do it for as many times as 15secs would allow. Does this affect
it? I tried one 1.5 GB file, it was oopsing and given my setup right now i
cant trace it. So i am using about 170M which is read about 8 times in
the 15 secs
cheers,
jamal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 19:02 ` jamal
@ 2001-01-14 19:09 ` Ingo Molnar
2001-01-14 19:18 ` jamal
0 siblings, 1 reply; 109+ messages in thread
From: Ingo Molnar @ 2001-01-14 19:09 UTC (permalink / raw)
To: jamal; +Cc: linux-kernel, netdev
On Sun, 14 Jan 2001, jamal wrote:
> Already doing the single file, single process. [...]
in this case there could still be valid performance differences, as
copying from user-space is cheaper than copying from the pagecache. To
rule out SMP interactions, you could try a UP-IOAPIC kernel on that box.
(I'm also curious what kind of numbers you'll get with the zerocopy
patch.)
> However, i do run by time which means i could read the file from the
> begining(offset 0) to the end then re-do it for as many times as
> 15secs would allow. Does this affect it? [...]
no, in the case of a single thread this should have minimum impact. But
i'd suggest to increase the /proc/sys/net/tcp*mem* values (to 1MB or
more).
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 19:09 ` Ingo Molnar
@ 2001-01-14 19:18 ` jamal
0 siblings, 0 replies; 109+ messages in thread
From: jamal @ 2001-01-14 19:18 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, netdev
On Sun, 14 Jan 2001, Ingo Molnar wrote:
>
> in this case there could still be valid performance differences, as
> copying from user-space is cheaper than copying from the pagecache. To
> rule out SMP interactions, you could try a UP-IOAPIC kernel on that box.
>
Let me complete this with the ZC patches first. then i'll do that.
There are a few retarnsmits; maybe receiver IRQ affinity might help some.
> (I'm also curious what kind of numbers you'll get with the zerocopy
> patch.)
Working on it.
> no, in the case of a single thread this should have minimum impact. But
> i'd suggest to increase the /proc/sys/net/tcp*mem* values (to 1MB or
> more).
The upper thresholds to 1000000 ?
I should have mentioned that i set /proc/sys/net/core/*mem*
to currently 262144.
cheers,
jamal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 18:29 jamal
2001-01-14 18:50 ` Ingo Molnar
@ 2001-01-14 20:22 ` Linus Torvalds
2001-01-14 20:38 ` Ingo Molnar
` (3 more replies)
2001-01-15 23:16 ` Pavel Machek
2 siblings, 4 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-14 20:22 UTC (permalink / raw)
To: linux-kernel
In article <Pine.GSO.4.30.0101141237020.12354-100000@shell.cyberus.ca>,
jamal <hadi@cyberus.ca> wrote:
>
>Before getting excited i had the courage to give plain 2.4.0-pre3 a whirl
>and somethings bothered me.
Note that "sendfile(fd, file, len)" is never going to be faster than
"write(fd, userdata, len)".
That's not the point of sendfile(). The point of sendfile() is to be
faster than the _combination_ of:
addr = mmap(file, ...len...);
write(fd, addr, len);
or
read(file, userdata, len);
write(fd, userdata, len);
and in your case you're not comparing sendfile() against this
combination. You're just comparing sendfile() against a simple
"write()".
And no, I don't actually hink that sendfile() is all that hot. It was
_very_ easy to implement, and can be considered a 5-minute hack to give
a feature that fit very well in the MM architecture, and that the Apache
folks had already been using on other architectures.
The only obvious use for it is file serving, and as high-performance
file serving tends to end up as a kernel module in the end anyway (the
only hold-out is samba, and that's been discussed too), "sendfile()"
really is more a proof of concept than anything else.
Does anybody but apache actually use it?
Linus
PS. I still _like_ sendfile(), even if the above sounds negative. It's
basically a "cool feature" that has zero negative impact on the design
of the system. It uses the same "do_generic_file_read()" that is used
for normal "read()", and is also used by the loop device and by
in-kernel fileserving. But it's not really "important".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 20:22 ` Linus Torvalds
@ 2001-01-14 20:38 ` Ingo Molnar
2001-01-14 21:44 ` Linus Torvalds
2001-01-14 21:54 ` Gerhard Mack
2001-01-15 1:14 ` Dan Hollis
` (2 subsequent siblings)
3 siblings, 2 replies; 109+ messages in thread
From: Ingo Molnar @ 2001-01-14 20:38 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Linux Kernel List
On 14 Jan 2001, Linus Torvalds wrote:
> Does anybody but apache actually use it?
There is a Samba patch as well that makes it sendfile() based. Various
other projects use it too (phttpd for example), some FTP servers i
believe, and khttpd and TUX.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 20:38 ` Ingo Molnar
@ 2001-01-14 21:44 ` Linus Torvalds
2001-01-14 21:49 ` Ingo Molnar
2001-01-14 21:54 ` Gerhard Mack
1 sibling, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2001-01-14 21:44 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Linux Kernel List
On Sun, 14 Jan 2001, Ingo Molnar wrote:
>
> There is a Samba patch as well that makes it sendfile() based. Various
> other projects use it too (phttpd for example), some FTP servers i
> believe, and khttpd and TUX.
At least khttpd uses "do_generic_file_read()", not sendfile per se. I
assume TUX does too. Sendfile itself is mainly only useful from user
space..
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 21:44 ` Linus Torvalds
@ 2001-01-14 21:49 ` Ingo Molnar
0 siblings, 0 replies; 109+ messages in thread
From: Ingo Molnar @ 2001-01-14 21:49 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Linux Kernel List
On Sun, 14 Jan 2001, Linus Torvalds wrote:
> > There is a Samba patch as well that makes it sendfile() based. Various
> > other projects use it too (phttpd for example), some FTP servers i
> > believe, and khttpd and TUX.
>
> At least khttpd uses "do_generic_file_read()", not sendfile per se. I
> assume TUX does too. Sendfile itself is mainly only useful from user
> space..
yes, you are right. TUX does it mainly to avoid some of the user-space
interfacing overhead present in sys_sendfile(), and to be able to control
packet boundaries. (ie. to have or not have the MSG_MORE flag). So TUX is
using its own sock_send_actor and own read_descriptor.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 20:38 ` Ingo Molnar
2001-01-14 21:44 ` Linus Torvalds
@ 2001-01-14 21:54 ` Gerhard Mack
2001-01-14 22:40 ` Linus Torvalds
2001-01-15 13:02 ` Florian Weimer
1 sibling, 2 replies; 109+ messages in thread
From: Gerhard Mack @ 2001-01-14 21:54 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Linus Torvalds, Linux Kernel List
On Sun, 14 Jan 2001, Ingo Molnar wrote:
>
> On 14 Jan 2001, Linus Torvalds wrote:
>
> > Does anybody but apache actually use it?
>
> There is a Samba patch as well that makes it sendfile() based. Various
> other projects use it too (phttpd for example), some FTP servers i
> believe, and khttpd and TUX.
Proftpd to name one ftp server, nice little daemon uses linux-privs too.
Gerhard
PS I wish someone would explain to me why distros insist on using WU
instead given it's horrid security record.
--
Gerhard Mack
gmack@innerfire.net
<>< As a computer I find your faith in technology amusing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 21:54 ` Gerhard Mack
@ 2001-01-14 22:40 ` Linus Torvalds
2001-01-14 22:45 ` J Sloan
2001-01-15 3:43 ` Michael Peddemors
2001-01-15 13:02 ` Florian Weimer
1 sibling, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-14 22:40 UTC (permalink / raw)
To: Gerhard Mack; +Cc: Ingo Molnar, Linux Kernel List
On Sun, 14 Jan 2001, Gerhard Mack wrote:
>
> PS I wish someone would explain to me why distros insist on using WU
> instead given it's horrid security record.
I think it's a case of "better the devil you know..".
Think of all the security scares sendmail has historically had. But it's a
pretty secure piece of work now - and people know if backwards and
forward. Few people advocate switching from sendmail these days (sure,
they do exist, but what I'm saying is that a long track record that
includes security issues isn't necessarily bad, if it has gotten fixed).
Of course, you may be right on wuftpd. It obviously wasn't designed with
security in mind, other alternatives may be better.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 22:40 ` Linus Torvalds
@ 2001-01-14 22:45 ` J Sloan
2001-01-15 20:15 ` H. Peter Anvin
2001-01-15 3:43 ` Michael Peddemors
1 sibling, 1 reply; 109+ messages in thread
From: J Sloan @ 2001-01-14 22:45 UTC (permalink / raw)
To: Kernel Mailing List
Linus Torvalds wrote:
> Of course, you may be right on wuftpd. It obviously wasn't designed with
> security in mind, other alternatives may be better.
I run proftpd on all my ftp servers - it's fast, configurable
and can do all the tricks I need - even red hat seems to
agree that proftpd is the way to go.
Visit any red hat ftp site and they are running proftpd -
So, why do they keep shipping us wu-ftpd instead?
That really frosts me.
jjs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 20:22 ` Linus Torvalds
2001-01-14 20:38 ` Ingo Molnar
@ 2001-01-15 1:14 ` Dan Hollis
2001-01-15 15:24 ` Jonathan Thackray
2001-01-24 0:58 ` Sasi Peter
3 siblings, 0 replies; 109+ messages in thread
From: Dan Hollis @ 2001-01-15 1:14 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
On 14 Jan 2001, Linus Torvalds wrote:
> That's not the point of sendfile(). The point of sendfile() is to be
> faster than the _combination_ of:
> addr = mmap(file, ...len...);
> write(fd, addr, len);
> or
> read(file, userdata, len);
> write(fd, userdata, len);
And boy is it ever. It blows both away by more than double.
Not only that the mmap one grinds my box into the ground with swapping,
while the sendfile() case you can't even tell its running except that the
drive is going like mad.
> Does anybody but apache actually use it?
I wonder why samba doesn't use it.
-Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 22:40 ` Linus Torvalds
2001-01-14 22:45 ` J Sloan
@ 2001-01-15 3:43 ` Michael Peddemors
1 sibling, 0 replies; 109+ messages in thread
From: Michael Peddemors @ 2001-01-15 3:43 UTC (permalink / raw)
To: Gerhard Mack; +Cc: Ingo Molnar, Linux Kernel List
The two things I change everytime are sendmail->qmail and wuftpd->proftpd
But remember, security bugs are caught because more people use one vs the
other.. Bugs in Proftpd weren't caught until more people started changing
from wu-ftpd...
Often, all it means when one product has more bugs than another, is that more
people tried to find bugs in one than another...
(Yes, a plug to get everyone to test 2.4 here)
On Sun, 14 Jan 2001, Linus Torvalds wrote:
> On Sun, 14 Jan 2001, Gerhard Mack wrote:
> > PS I wish someone would explain to me why distros insist on using WU
> > instead given it's horrid security record.
>
> Of course, you may be right on wuftpd. It obviously wasn't designed with
> security in mind, other alternatives may be better.
>
> Linus
--
--------------------------------------------------------
Michael Peddemors - Senior Consultant
Unix Administration - WebSite Hosting
Network Services - Programming
Wizard Internet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com
--------------------------------------------------------
(604) 589-0037 Beautiful British Columbia, Canada
--------------------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 21:54 ` Gerhard Mack
2001-01-14 22:40 ` Linus Torvalds
@ 2001-01-15 13:02 ` Florian Weimer
2001-01-15 13:45 ` Tristan Greaves
1 sibling, 1 reply; 109+ messages in thread
From: Florian Weimer @ 2001-01-15 13:02 UTC (permalink / raw)
To: Gerhard Mack; +Cc: Linux Kernel List
Gerhard Mack <gmack@innerfire.net> writes:
> PS I wish someone would explain to me why distros insist on using WU
> instead given it's horrid security record.
The security record of Proftpd is not horrid, but embarrassing. They
once claimed to have fixed vulnerability, but in fact introduced
another one...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* RE: Is sendfile all that sexy?
2001-01-15 13:02 ` Florian Weimer
@ 2001-01-15 13:45 ` Tristan Greaves
0 siblings, 0 replies; 109+ messages in thread
From: Tristan Greaves @ 2001-01-15 13:45 UTC (permalink / raw)
To: 'Linux Kernel List'
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org
> [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Florian Weimer
> Sent: 15 January 2001 13:02
> To: Gerhard Mack
> Cc: Linux Kernel List
> Subject: Re: Is sendfile all that sexy?
>
> The security record of Proftpd is not horrid, but embarrassing. They
> once claimed to have fixed vulnerability, but in fact introduced
> another one...
Oh, come on, this is a classic event in bug fixing. All Software Has
Bugs [TM]. Nothing Is Completely Secure [TM].
As long as the vulnerabilities are fixed as they happen (where possible),
we should be happy.
Tris.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 20:22 ` Linus Torvalds
2001-01-14 20:38 ` Ingo Molnar
2001-01-15 1:14 ` Dan Hollis
@ 2001-01-15 15:24 ` Jonathan Thackray
2001-01-15 15:36 ` Matti Aarnio
` (2 more replies)
2001-01-24 0:58 ` Sasi Peter
3 siblings, 3 replies; 109+ messages in thread
From: Jonathan Thackray @ 2001-01-15 15:24 UTC (permalink / raw)
To: linux-kernel
> Does anybody but apache actually use it?
Zeus uses it! (it was HP who added it to HP-UX first at our request :-)
> PS. I still _like_ sendfile(), even if the above sounds negative. It's
> basically a "cool feature" that has zero negative impact on the design
> of the system. It uses the same "do_generic_file_read()" that is used
> for normal "read()", and is also used by the loop device and by
> in-kernel fileserving. But it's not really "important".
It's a very useful system call and makes file serving much more
scalable, and I'm glad that most Un*xes now have support for it
(Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to
Linux is sendpath(), which does the open() before the sendfile()
all combined into one system call.
Ugh, I hear you all scream :-)
Jon.
--
Jonathan Thackray Zeus House, Cowley Road, Cambridge CB4 OZT, UK
Software Engineer +44 1223 525000, fax +44 1223 525100
Zeus Technology http://www.zeus.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 15:24 ` Jonathan Thackray
@ 2001-01-15 15:36 ` Matti Aarnio
2001-01-15 20:17 ` H. Peter Anvin
2001-01-15 16:05 ` dean gaudet
2001-01-15 19:41 ` Ingo Molnar
2 siblings, 1 reply; 109+ messages in thread
From: Matti Aarnio @ 2001-01-15 15:36 UTC (permalink / raw)
To: Jonathan Thackray; +Cc: linux-kernel
On Mon, Jan 15, 2001 at 03:24:55PM +0000, Jonathan Thackray wrote:
> It's a very useful system call and makes file serving much more
> scalable, and I'm glad that most Un*xes now have support for it
> (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to
> Linux is sendpath(), which does the open() before the sendfile()
> all combined into one system call.
One thing about 'sendfile' (and likely 'sendpath') is that
current (hammered into running binaries -> unchangeable)
syscalls support only up to 2GB files at 32 bit systems.
Glibc 2.2(9) at RedHat <sys/sendfile.h>:
#ifdef __USE_FILE_OFFSET64
# error "<sendfile.h> cannot be used with _FILE_OFFSET_BITS=64"
#endif
I do admit that doing sendfile() on some extremely large
file is unlikely, but still...
> Ugh, I hear you all scream :-)
> Jon.
> --
> Jonathan Thackray Zeus House, Cowley Road, Cambridge CB4 OZT, UK
> Zeus Technology http://www.zeus.com/
/Matti Aarnio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 15:24 ` Jonathan Thackray
2001-01-15 15:36 ` Matti Aarnio
@ 2001-01-15 16:05 ` dean gaudet
2001-01-15 18:34 ` Jonathan Thackray
2001-01-15 19:41 ` Ingo Molnar
2 siblings, 1 reply; 109+ messages in thread
From: dean gaudet @ 2001-01-15 16:05 UTC (permalink / raw)
To: Jonathan Thackray; +Cc: linux-kernel
On Mon, 15 Jan 2001, Jonathan Thackray wrote:
> (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to
> Linux is sendpath(), which does the open() before the sendfile()
> all combined into one system call.
how would sendpath() construct the Content-Length in the HTTP header?
it's totally unfortunate that the other unixes chose to combine writev()
into sendfile() rather than implementing TCP_CORK. TCP_CORK is useful for
FAR more than just sendfile() headers and footers. it's arguably the most
correct way to write server code. nagle/no-nagle in the default BSD API
both suck -- nagle because it delays packets which need to be sent;
no-nagle because it can send incomplete packets.
i'm completely happy that linus, davem and ingo refused to combine
writev() into sendfile() and suggested CORK when i pointed out the
header/trailer problem.
imnsho if you want to optimise static file serving then it's pretty
pointless to continue working in userland. nobody is going to catch up
with all the kernel-side implementations in linux, NT, and solaris.
-dean
p.s. linus, apache-1.3 does *not* use sendfile(). it's in apache-2.0,
which unfortunately is now performing like crap because they didn't listen
to some of my advice well over a year ago. a case of "let's make a pretty
API and hope performance works out"... where i told them "i've already
written code using the API you suggest, and it *doesn't* work." </rant>
thankfully linux now has TUX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 16:05 ` dean gaudet
@ 2001-01-15 18:34 ` Jonathan Thackray
2001-01-15 18:46 ` Linus Torvalds
2001-01-15 18:58 ` dean gaudet
0 siblings, 2 replies; 109+ messages in thread
From: Jonathan Thackray @ 2001-01-15 18:34 UTC (permalink / raw)
To: dean gaudet; +Cc: linux-kernel
> how would sendpath() construct the Content-Length in the HTTP header?
You'd still stat() the file to decide whether to use sendpath() to
send it or not, if it was Last-Modified: etc. Of course, you'd cache
stat() calls too for a few seconds. The main thing is that you save
a valuable fd and open() is expensive, even more so than stat().
> TCP_CORK is useful for FAR more than just sendfile() headers and
> footers. it's arguably the most correct way to write server code.
Agreed -- the hard-coded Nagle algorithm makes no sense these days.
> imnsho if you want to optimise static file serving then it's pretty
> pointless to continue working in userland. nobody is going to catch up
> with all the kernel-side implementations in linux, NT, and solaris.
Hmmm, there's a place for userland httpds that are within a few
percent of kernel ones (like Zeus is, when I last looked). But I
agree, hybrid approaches will become more common, although the trend
towards server-side dynamic pages negate this. A kernel approach is a
definite win if you're used to using a limited-scalability userland
httpd like Apache.
Jon.
--
Jonathan Thackray Zeus House, Cowley Road, Cambridge CB4 OZT, UK
Software Engineer +44 1223 525000, fax +44 1223 525100
Zeus Technology http://www.zeus.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 18:34 ` Jonathan Thackray
@ 2001-01-15 18:46 ` Linus Torvalds
2001-01-15 18:58 ` dean gaudet
1 sibling, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-15 18:46 UTC (permalink / raw)
To: linux-kernel
In article <14947.17050.127502.936533@leda.cam.zeus.com>,
Jonathan Thackray <jthackray@zeus.com> wrote:
>
>> how would sendpath() construct the Content-Length in the HTTP header?
>
>You'd still stat() the file to decide whether to use sendpath() to
>send it or not, if it was Last-Modified: etc. Of course, you'd cache
>stat() calls too for a few seconds. The main thing is that you save
>a valuable fd and open() is expensive, even more so than stat().
"open" expensive?
Maybe on HP-UX and other platforms. But give me numbers: I seriously
doubt that
int fd = open(..)
fstat(fd..);
sendfile(fd..);
close(fd);
is any slower than
.. cache stat() in user space based on name ..
sendpath(name, ..);
on any real load.
>> TCP_CORK is useful for FAR more than just sendfile() headers and
>> footers. it's arguably the most correct way to write server code.
>
>Agreed -- the hard-coded Nagle algorithm makes no sense these days.
The fact I dislike about the HP-UX implementation is that it is so
_obviously_ stupid.
And I have to say that I absolutely despise the BSD people. They did
sendfile() after both Linux and HP-UX had done it, and they must have
known about both implementations. And they chose the HP-UX braindamage,
and even brag about the fact that they were stupid and didn't understand
TCP_CORK (they don't say so in those exact words, of course - they just
show that they were stupid and clueless by the things they brag about).
Oh, well. Not everybody can be as goodlooking as me. It's a curse.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 18:34 ` Jonathan Thackray
2001-01-15 18:46 ` Linus Torvalds
@ 2001-01-15 18:58 ` dean gaudet
1 sibling, 0 replies; 109+ messages in thread
From: dean gaudet @ 2001-01-15 18:58 UTC (permalink / raw)
To: Jonathan Thackray; +Cc: linux-kernel
On Mon, 15 Jan 2001, Jonathan Thackray wrote:
> > TCP_CORK is useful for FAR more than just sendfile() headers and
> > footers. it's arguably the most correct way to write server code.
>
> Agreed -- the hard-coded Nagle algorithm makes no sense these days.
hey, actually a little more thinking this morning made me think nagle
*may* have a place. i don't like any of the solutions i've come up with
though for this. the problem specifically is how do you implement an
efficient HTTP/ng server which supports WebMUX and parallel processing of
multiple responses.
the problem in a nutshell is that multiple threads may be working on
responses which are multiplexed onto a single socket -- there's some extra
mux header info used to separate each of the response streams.
like what if the response stream is a few hundred HEADs (for cache
validation) some of which are static files and others which require some
dynamic code. the static responses will finish really fast, and you want
to fill up network packets with them. but you don't know when the dynamic
responses will finish so you can't be sure when to start sending the
packets.
i don't know NFSv3 very much, but i imagine it's got similar problems --
any multiplexed request/response protocol allowing out-of-order responses
would have this problem. any gurus got suggestions?
-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 15:24 ` Jonathan Thackray
2001-01-15 15:36 ` Matti Aarnio
2001-01-15 16:05 ` dean gaudet
@ 2001-01-15 19:41 ` Ingo Molnar
2001-01-15 20:33 ` Albert D. Cahalan
2 siblings, 1 reply; 109+ messages in thread
From: Ingo Molnar @ 2001-01-15 19:41 UTC (permalink / raw)
To: Jonathan Thackray; +Cc: Linux Kernel List
On Mon, 15 Jan 2001, Jonathan Thackray wrote:
> It's a very useful system call and makes file serving much more
> scalable, and I'm glad that most Un*xes now have support for it
> (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to
> Linux is sendpath(), which does the open() before the sendfile() all
> combined into one system call.
i believe the right model for a user-space webserver is to cache open file
descriptors, and directly hash URLs to open files. This way you can do
pure sendfile() without any open(). Not that open() is too expensive in
Linux:
m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall open
Simple open/close: 7.5756 microseconds
m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall stat
Simple stat: 5.4864 microseconds
m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall write
Simple write: 0.9614 microseconds
m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall read
Simple read: 1.1420 microseconds
m:~/lm/lmbench-2alpha9/bin/i686-linux> ./lat_syscall null
Simple syscall: 0.6349 microseconds
(note that lmbench opens a nontrivial path, it can be cheaper than this.)
nevertheless saving the lookup can be win.
[ TUX uses dentries directly so there is no file opening cost - it's
pretty equivalent to sendpath(), with the difference that TUX can do
security evaluation on the (held) file prior sending it - while sendpath()
is pretty much a shot into the dark. ]
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 22:45 ` J Sloan
@ 2001-01-15 20:15 ` H. Peter Anvin
0 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2001-01-15 20:15 UTC (permalink / raw)
To: linux-kernel
Followup to: <3A622C25.766F3BCE@pobox.com>
By author: J Sloan <jjs@pobox.com>
In newsgroup: linux.dev.kernel
>
> Linus Torvalds wrote:
>
> > Of course, you may be right on wuftpd. It obviously wasn't designed with
> > security in mind, other alternatives may be better.
>
> I run proftpd on all my ftp servers - it's fast, configurable
> and can do all the tricks I need - even red hat seems to
> agree that proftpd is the way to go.
>
> Visit any red hat ftp site and they are running proftpd -
>
> So, why do they keep shipping us wu-ftpd instead?
>
> That really frosts me.
>
proftpd is not what you want for an FTP server whose main function is
*non-*anonymous access. It is very much written for the sole purpose
of being a great FTP server for a large anonymous FTP site. If you're
running a site large enough to matter, you can replace an RPM or two.
-hpa
--
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 15:36 ` Matti Aarnio
@ 2001-01-15 20:17 ` H. Peter Anvin
0 siblings, 0 replies; 109+ messages in thread
From: H. Peter Anvin @ 2001-01-15 20:17 UTC (permalink / raw)
To: linux-kernel
Followup to: <20010115173607.S25659@mea-ext.zmailer.org>
By author: Matti Aarnio <matti.aarnio@zmailer.org>
In newsgroup: linux.dev.kernel
>
> One thing about 'sendfile' (and likely 'sendpath') is that
> current (hammered into running binaries -> unchangeable)
> syscalls support only up to 2GB files at 32 bit systems.
>
> Glibc 2.2(9) at RedHat <sys/sendfile.h>:
>
> #ifdef __USE_FILE_OFFSET64
> # error "<sendfile.h> cannot be used with _FILE_OFFSET_BITS=64"
> #endif
>
> I do admit that doing sendfile() on some extremely large
> file is unlikely, but still...
>
2 GB isn't really that extremely large these days. This is an
unpleasant limitation.
-hpa
--
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 19:41 ` Ingo Molnar
@ 2001-01-15 20:33 ` Albert D. Cahalan
2001-01-15 21:00 ` Linus Torvalds
2001-01-16 10:40 ` Felix von Leitner
0 siblings, 2 replies; 109+ messages in thread
From: Albert D. Cahalan @ 2001-01-15 20:33 UTC (permalink / raw)
To: mingo; +Cc: Jonathan Thackray, Linux Kernel List
Ingo Molnar writes:
> On Mon, 15 Jan 2001, Jonathan Thackray wrote:
>> It's a very useful system call and makes file serving much more
>> scalable, and I'm glad that most Un*xes now have support for it
>> (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to
>> Linux is sendpath(), which does the open() before the sendfile() all
>> combined into one system call.
Ingo Molnar's data in a nice table:
open/close 7.5756 microseconds
stat 5.4864 microseconds
write 0.9614 microseconds
read 1.1420 microseconds
syscall 0.6349 microseconds
Rather than combining open() with sendfile(), it could be combined
with stat(). Since the syscall would be new anyway, it could skip
the normal requirement about returning the next free file descriptor
in favor of returning whatever can be most quickly found.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 20:33 ` Albert D. Cahalan
@ 2001-01-15 21:00 ` Linus Torvalds
2001-01-16 10:40 ` Felix von Leitner
1 sibling, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-15 21:00 UTC (permalink / raw)
To: linux-kernel
In article <200101152033.f0FKXpv250839@saturn.cs.uml.edu>,
Albert D. Cahalan <acahalan@cs.uml.edu> wrote:
>Ingo Molnar writes:
>> On Mon, 15 Jan 2001, Jonathan Thackray wrote:
>
>>> It's a very useful system call and makes file serving much more
>>> scalable, and I'm glad that most Un*xes now have support for it
>>> (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to
>>> Linux is sendpath(), which does the open() before the sendfile() all
>>> combined into one system call.
>
>Ingo Molnar's data in a nice table:
>
>open/close 7.5756 microseconds
>stat 5.4864 microseconds
>write 0.9614 microseconds
>read 1.1420 microseconds
>syscall 0.6349 microseconds
>
>Rather than combining open() with sendfile(), it could be combined
>with stat().
Note that "fstat()" is fairly low-overhead (unlike "stat()" it obviously
doesn't have to parse the name again), so "open+fstat" is quite fine
as-is.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 18:29 jamal
2001-01-14 18:50 ` Ingo Molnar
2001-01-14 20:22 ` Linus Torvalds
@ 2001-01-15 23:16 ` Pavel Machek
2001-01-16 13:47 ` jamal
2 siblings, 1 reply; 109+ messages in thread
From: Pavel Machek @ 2001-01-15 23:16 UTC (permalink / raw)
To: jamal, linux-kernel, netdev
Hi!
> TWO observations:
> - Given Linux's non-pre-emptability of the kernel i get the feeling that
> sendfile could starve other user space programs. Imagine trying to send a
> 1Gig file on 10Mbps pipe in one shot.
Hehe, try sigkilling process doing that transfer. Last time I tried it
it did not work.
Pavel
--
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 20:33 ` Albert D. Cahalan
2001-01-15 21:00 ` Linus Torvalds
@ 2001-01-16 10:40 ` Felix von Leitner
2001-01-16 11:56 ` Peter Samuelson
` (2 more replies)
1 sibling, 3 replies; 109+ messages in thread
From: Felix von Leitner @ 2001-01-16 10:40 UTC (permalink / raw)
To: Linux Kernel List
Thus spake Albert D. Cahalan (acahalan@cs.uml.edu):
> Rather than combining open() with sendfile(), it could be combined
> with stat(). Since the syscall would be new anyway, it could skip
> the normal requirement about returning the next free file descriptor
> in favor of returning whatever can be most quickly found.
I don't know how Linux does it, but returning the first free file
descriptor can be implemented as O(1) operation.
Felix
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 10:40 ` Felix von Leitner
@ 2001-01-16 11:56 ` Peter Samuelson
2001-01-16 12:37 ` Ingo Molnar
2001-01-16 12:42 ` Ingo Molnar
2 siblings, 0 replies; 109+ messages in thread
From: Peter Samuelson @ 2001-01-16 11:56 UTC (permalink / raw)
To: Linux Kernel List
[Felix von Leitner]
> I don't know how Linux does it, but returning the first free file
> descriptor can be implemented as O(1) operation.
How exactly? Maybe I'm being dense today. Having used up the lowest
available fd, how do you find the next-lowest one, the next open()? I
can't think of anything that isn't O(n). (Sure you can amortize it
different ways by keeping lists of fds, etc.)
Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 10:40 ` Felix von Leitner
2001-01-16 11:56 ` Peter Samuelson
@ 2001-01-16 12:37 ` Ingo Molnar
2001-01-16 12:42 ` Ingo Molnar
2 siblings, 0 replies; 109+ messages in thread
From: Ingo Molnar @ 2001-01-16 12:37 UTC (permalink / raw)
To: Felix von Leitner; +Cc: Linux Kernel List
On Tue, 16 Jan 2001, Felix von Leitner wrote:
> I don't know how Linux does it, but returning the first free file
> descriptor can be implemented as O(1) operation.
only if special allocation patters are assumed. Otherwise it cannot be a
generic O(1) solution. The first-free rule adds an implicit ordering to
the file descriptor space, and this order cannot be maintained in an O(1)
way. Linux can allocate up to a million file descriptors.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 10:40 ` Felix von Leitner
2001-01-16 11:56 ` Peter Samuelson
2001-01-16 12:37 ` Ingo Molnar
@ 2001-01-16 12:42 ` Ingo Molnar
2001-01-16 12:47 ` Felix von Leitner
2 siblings, 1 reply; 109+ messages in thread
From: Ingo Molnar @ 2001-01-16 12:42 UTC (permalink / raw)
To: Felix von Leitner; +Cc: Linux Kernel List
On Tue, 16 Jan 2001, Felix von Leitner wrote:
> I don't know how Linux does it, but returning the first free file
> descriptor can be implemented as O(1) operation.
to put it more accurately: the requirement is to be able to open(), use
and close() an unlimited number of file descriptors with O(1) overhead,
under any allocation pattern, with only RAM limiting the number of files.
Both of my proposals attempt to provide this. It's possible to open() O(1)
but do a O(log(N)) close(), but that is of no practical value IMO.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 12:42 ` Ingo Molnar
@ 2001-01-16 12:47 ` Felix von Leitner
2001-01-16 13:48 ` Jamie Lokier
0 siblings, 1 reply; 109+ messages in thread
From: Felix von Leitner @ 2001-01-16 12:47 UTC (permalink / raw)
To: Linux Kernel List
Thus spake Ingo Molnar (mingo@elte.hu):
> > I don't know how Linux does it, but returning the first free file
> > descriptor can be implemented as O(1) operation.
> to put it more accurately: the requirement is to be able to open(), use
> and close() an unlimited number of file descriptors with O(1) overhead,
> under any allocation pattern, with only RAM limiting the number of files.
> Both of my proposals attempt to provide this. It's possible to open() O(1)
> but do a O(log(N)) close(), but that is of no practical value IMO.
I cheated. I was only talking about open().
close() is of course more expensive then.
Other than that: where does the requirement come from?
Can't we just use a free list where we prepend closed fds and always use
the first one on open()? That would even increase spatial locality and
be good for the CPU caches.
Felix
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-15 23:16 ` Pavel Machek
@ 2001-01-16 13:47 ` jamal
2001-01-16 14:41 ` Pavel Machek
0 siblings, 1 reply; 109+ messages in thread
From: jamal @ 2001-01-16 13:47 UTC (permalink / raw)
To: Pavel Machek; +Cc: linux-kernel, netdev
On Tue, 16 Jan 2001, Pavel Machek wrote:
> > TWO observations:
> > - Given Linux's non-pre-emptability of the kernel i get the feeling that
> > sendfile could starve other user space programs. Imagine trying to send a
> > 1Gig file on 10Mbps pipe in one shot.
>
> Hehe, try sigkilling process doing that transfer. Last time I tried it
> it did not work.
>From Alexey's response: it does get descheduled possibly every sndbuf
send. So you should be able to sneak that sigkill.
cheers,
jamal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 12:47 ` Felix von Leitner
@ 2001-01-16 13:48 ` Jamie Lokier
2001-01-16 14:20 ` Felix von Leitner
0 siblings, 1 reply; 109+ messages in thread
From: Jamie Lokier @ 2001-01-16 13:48 UTC (permalink / raw)
To: Linux Kernel List
Felix von Leitner wrote:
> I cheated. I was only talking about open().
> close() is of course more expensive then.
>
> Other than that: where does the requirement come from?
> Can't we just use a free list where we prepend closed fds and always use
> the first one on open()? That would even increase spatial locality and
> be good for the CPU caches.
You would need to use a new open() flag: O_ANYFD.
The requirement comes from this like this:
close (0);
close (1);
close (2);
open ("/dev/console", O_RDWR);
dup ();
dup ();
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
@ 2001-01-16 13:50 Andries.Brouwer
2001-01-17 6:56 ` Ton Hospel
0 siblings, 1 reply; 109+ messages in thread
From: Andries.Brouwer @ 2001-01-16 13:50 UTC (permalink / raw)
To: mingo; +Cc: linux-kernel
From: Ingo Molnar <mingo@elte.hu>
On Tue, 16 Jan 2001, Felix von Leitner wrote:
> I don't know how Linux does it, but returning the first free file
> descriptor can be implemented as O(1) operation.
to put it more accurately: the requirement is to be able to open(), use
and close() an unlimited number of file descriptors with O(1) overhead,
under any allocation pattern, with only RAM limiting the number of files.
Both of my proposals attempt to provide this. It's possible to open() O(1)
but do a O(log(N)) close(), but that is of no practical value IMO.
Ingo
> Both of my proposals
I am afraid I have missed most earlier messages in this thread.
However, let me remark that the problem of assigning a
file descriptor is the one that is usually described by
"priority queue". The version of Peter van Emde Boas takes
time O(loglog N) for both open() and close().
Of course this is not meant to suggest that we use it.
Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 13:48 ` Jamie Lokier
@ 2001-01-16 14:20 ` Felix von Leitner
2001-01-16 15:05 ` David L. Parsley
0 siblings, 1 reply; 109+ messages in thread
From: Felix von Leitner @ 2001-01-16 14:20 UTC (permalink / raw)
To: Linux Kernel List
Thus spake Jamie Lokier (lk@tantalophile.demon.co.uk):
> You would need to use a new open() flag: O_ANYFD.
> The requirement comes from this like this:
> close (0);
> close (1);
> close (2);
> open ("/dev/console", O_RDWR);
> dup ();
> dup ();
So it's not actually part of POSIX, it's just to get around fixing
legacy code? ;-)
Felix
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 13:47 ` jamal
@ 2001-01-16 14:41 ` Pavel Machek
0 siblings, 0 replies; 109+ messages in thread
From: Pavel Machek @ 2001-01-16 14:41 UTC (permalink / raw)
To: jamal; +Cc: linux-kernel, netdev
Hi!
> > > TWO observations:
> > > - Given Linux's non-pre-emptability of the kernel i get the feeling that
> > > sendfile could starve other user space programs. Imagine trying to send a
> > > 1Gig file on 10Mbps pipe in one shot.
> >
> > Hehe, try sigkilling process doing that transfer. Last time I tried it
> > it did not work.
>
> >From Alexey's response: it does get descheduled possibly every sndbuf
> send. So you should be able to sneak that sigkill.
Did you actually tried it? Last time I did the test, SIGKILL did not
make it in. sendfile did not actually check for signals...
(And you could do something like send 100MB from cache into dev
null. I do not see where sigkill could sneak in in this case.)
Pavel
--
The best software in life is free (not shareware)! Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 14:20 ` Felix von Leitner
@ 2001-01-16 15:05 ` David L. Parsley
2001-01-16 15:05 ` Jakub Jelinek
2001-01-17 19:27 ` dean gaudet
0 siblings, 2 replies; 109+ messages in thread
From: David L. Parsley @ 2001-01-16 15:05 UTC (permalink / raw)
To: Felix von Leitner, linux-kernel, mingo
Felix von Leitner wrote:
> > close (0);
> > close (1);
> > close (2);
> > open ("/dev/console", O_RDWR);
> > dup ();
> > dup ();
>
> So it's not actually part of POSIX, it's just to get around fixing
> legacy code? ;-)
This makes me wonder...
If the kernel only kept a queue of the three smallest unused fd's, and
when the queue emptied handed out whatever it liked, how many things
would break? I suspect this would cover a lot of bases...
<dons flameproof underwear>
regards,
David
--
David L. Parsley
Network Administrator
Roanoke College
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 15:05 ` David L. Parsley
@ 2001-01-16 15:05 ` Jakub Jelinek
2001-01-16 15:46 ` David L. Parsley
2001-01-17 19:27 ` dean gaudet
1 sibling, 1 reply; 109+ messages in thread
From: Jakub Jelinek @ 2001-01-16 15:05 UTC (permalink / raw)
To: David L. Parsley; +Cc: Felix von Leitner, linux-kernel, mingo
On Tue, Jan 16, 2001 at 10:05:06AM -0500, David L. Parsley wrote:
> Felix von Leitner wrote:
> > > close (0);
> > > close (1);
> > > close (2);
> > > open ("/dev/console", O_RDWR);
> > > dup ();
> > > dup ();
> >
> > So it's not actually part of POSIX, it's just to get around fixing
> > legacy code? ;-)
>
> This makes me wonder...
>
> If the kernel only kept a queue of the three smallest unused fd's, and
> when the queue emptied handed out whatever it liked, how many things
> would break? I suspect this would cover a lot of bases...
First it would break Unix98 and other standards:
The Single UNIX (R) Specification, Version 2
Copyright (c) 1997 The Open Group
...
int open(const char *path, int oflag, ... );
...
The open() function will return a file descriptor for the named file that is the lowest file descriptor not currently
open for that process. The open file description is new, and therefore the file descriptor does not share it with any
other process in the system. The FD_CLOEXEC file descriptor flag associated with the new file descriptor will be
cleared.
Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 15:05 ` Jakub Jelinek
@ 2001-01-16 15:46 ` David L. Parsley
2001-01-18 14:00 ` Laramie Leavitt
0 siblings, 1 reply; 109+ messages in thread
From: David L. Parsley @ 2001-01-16 15:46 UTC (permalink / raw)
To: Jakub Jelinek, linux-kernel, leitner, mingo
Jakub Jelinek wrote:
> > This makes me wonder...
> >
> > If the kernel only kept a queue of the three smallest unused fd's, and
> > when the queue emptied handed out whatever it liked, how many things
> > would break? I suspect this would cover a lot of bases...
>
> First it would break Unix98 and other standards:
[snip]
Yeah, I reallized it would violate at least POSIX. The discussion was
just bandying about ways to avoid an expensive 'open()' without breaking
lots of utilities and glibc stuff. This might be something that could
be configured for specific server environments, where performance is
more imporant than POSIX/Unix98, but you still don't want to completely
break the system. Just a thought, brain-damaged as it might be. ;-)
regards,
David
--
David L. Parsley
Network Administrator
Roanoke College
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 13:50 Andries.Brouwer
@ 2001-01-17 6:56 ` Ton Hospel
2001-01-17 7:31 ` Steve VanDevender
0 siblings, 1 reply; 109+ messages in thread
From: Ton Hospel @ 2001-01-17 6:56 UTC (permalink / raw)
To: linux-kernel
In article <UTC200101161350.OAA141869.aeb@ark.cwi.nl>,
Andries.Brouwer@cwi.nl writes:
>
> I am afraid I have missed most earlier messages in this thread.
> However, let me remark that the problem of assigning a
> file descriptor is the one that is usually described by
> "priority queue". The version of Peter van Emde Boas takes
> time O(loglog N) for both open() and close().
> Of course this is not meant to suggest that we use it.
>
Fascinating ! But how is this possible ? What stops me from
using this algorithm from entering N values and extracting
them again in order and so end up with a O(N*log log N)
sorting algorithm ? (which would be better than log N! ~ N*logN)
(at least the web pages I found about this seem to suggest you
can use this on any set with a full order relation)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-17 6:56 ` Ton Hospel
@ 2001-01-17 7:31 ` Steve VanDevender
2001-01-17 8:09 ` Ton Hospel
0 siblings, 1 reply; 109+ messages in thread
From: Steve VanDevender @ 2001-01-17 7:31 UTC (permalink / raw)
To: linux-kernel
Ton Hospel writes:
> In article <UTC200101161350.OAA141869.aeb@ark.cwi.nl>,
> Andries.Brouwer@cwi.nl writes:
> > I am afraid I have missed most earlier messages in this thread.
> > However, let me remark that the problem of assigning a
> > file descriptor is the one that is usually described by
> > "priority queue". The version of Peter van Emde Boas takes
> > time O(loglog N) for both open() and close().
> > Of course this is not meant to suggest that we use it.
> >
> Fascinating ! But how is this possible ? What stops me from
> using this algorithm from entering N values and extracting
> them again in order and so end up with a O(N*log log N)
> sorting algorithm ? (which would be better than log N! ~ N*logN)
>
> (at least the web pages I found about this seem to suggest you
> can use this on any set with a full order relation)
How do you know how to extract the items in order, unless you've already
sorted them independently from placing them in this data structure?
Besides, there are plenty of sorting algorithms that work only on
specific kinds of data sets that are better than the O(n log n) bound
for generalized sorting. For example, there's the O(n) "mailbox sort".
You have an unordered array u of m integers, each in the range 1..n;
allocate an array s of n integers initialized to all zeros, and for i in
1..m increment s[u[i]]. Then for j in 1..n print j s[j] times. If n is
of reasonable size then you can sort that list of integers in O(m) time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-17 7:31 ` Steve VanDevender
@ 2001-01-17 8:09 ` Ton Hospel
0 siblings, 0 replies; 109+ messages in thread
From: Ton Hospel @ 2001-01-17 8:09 UTC (permalink / raw)
To: linux-kernel
In article <14949.19028.404458.318735@tzadkiel.efn.org>,
Steve VanDevender <stevev@efn.org> writes:
> Ton Hospel writes:
> > In article <UTC200101161350.OAA141869.aeb@ark.cwi.nl>,
> > Andries.Brouwer@cwi.nl writes:
> > > I am afraid I have missed most earlier messages in this thread.
> > > However, let me remark that the problem of assigning a
> > > file descriptor is the one that is usually described by
> > > "priority queue". The version of Peter van Emde Boas takes
> > > time O(loglog N) for both open() and close().
> > > Of course this is not meant to suggest that we use it.
> > >
> > Fascinating ! But how is this possible ? What stops me from
> > using this algorithm from entering N values and extracting
> > them again in order and so end up with a O(N*log log N)
> > sorting algorithm ? (which would be better than log N! ~ N*logN)
> >
> > (at least the web pages I found about this seem to suggest you
> > can use this on any set with a full order relation)
>
> How do you know how to extract the items in order, unless you've already
> sorted them independently from placing them in this data structure?
Because "extract max" is a basic operation of a priority queue,
which I just do N times.
>
> Besides, there are plenty of sorting algorithms that work only on
> specific kinds of data sets that are better than the O(n log n) bound
> for generalized sorting. For example, there's the O(n) "mailbox sort".
> You have an unordered array u of m integers, each in the range 1..n;
> allocate an array s of n integers initialized to all zeros, and for i in
> 1..m increment s[u[i]]. Then for j in 1..n print j s[j] times. If n is
> of reasonable size then you can sort that list of integers in O(m) time.
Yes, I know. that's why you see the "any set with a full order relation"
in there. That basically disallows using extra structure of the elements.
Notice that the radix sort you describe basically hides the log N in the
the representation of a number of max n (which has a length that is
basically log n). It just doesn't account for that because we do the
operation on processors where these bits are basically handled in parallel,
and so do not end up in the O-notation. Any attempt to make radix sort
handle arbitrary width integers on a fixed width processor will make the
log N reappear.
Having said that, in the particular case of fd allocation, we DO have
additional structure (in fact, it's indeed integers in 0..n). So I can
very well imagine the existance of a priority queue for this where the
basic operators are better than O(log N). I just don't understand how
it can exist for a generic priority queue algorithm (which the
Peter van Emde Boas method seems to be. Unfortunately I have found no
full description of the algorithm that's used to do the insert/extract
in the queue nodes yet).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
@ 2001-01-17 15:02 Ben Mansell
2000-01-01 2:10 ` Pavel Machek
2001-01-17 19:32 ` Linus Torvalds
0 siblings, 2 replies; 109+ messages in thread
From: Ben Mansell @ 2001-01-17 15:02 UTC (permalink / raw)
To: torvalds; +Cc: linux-kernel
On 14 Jan 2001, Linus Torvalds wrote:
> And no, I don't actually hink that sendfile() is all that hot. It was
> _very_ easy to implement, and can be considered a 5-minute hack to give
> a feature that fit very well in the MM architecture, and that the Apache
> folks had already been using on other architectures.
The current sendfile() has the limitation that it can't read data from
a socket. Would it be another 5-minute hack to remove this limitation, so
you could sendfile between sockets? Now _that_ would be sexy :)
Ben
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-16 15:05 ` David L. Parsley
2001-01-16 15:05 ` Jakub Jelinek
@ 2001-01-17 19:27 ` dean gaudet
1 sibling, 0 replies; 109+ messages in thread
From: dean gaudet @ 2001-01-17 19:27 UTC (permalink / raw)
To: David L. Parsley; +Cc: Felix von Leitner, linux-kernel, mingo
On Tue, 16 Jan 2001, David L. Parsley wrote:
> Felix von Leitner wrote:
> > > close (0);
> > > close (1);
> > > close (2);
> > > open ("/dev/console", O_RDWR);
> > > dup ();
> > > dup ();
> >
> > So it's not actually part of POSIX, it's just to get around fixing
> > legacy code? ;-)
it's part of POSIX.
> This makes me wonder...
>
> If the kernel only kept a queue of the three smallest unused fd's, and
> when the queue emptied handed out whatever it liked, how many things
> would break? I suspect this would cover a lot of bases...
apache-1.3 relies on the open-lowest-numbered-free-fd behaviour... but
only as a band-aid to work around other broken behaviours surrounding
FD_SETSIZE.
when opening the log files, and listening sockets apache uses
fcntl(F_DUPFD) to push them all higher than fd 15. (see ap_slack) some
sites are configured in a way that there's thousands of log files or
listening fds (both are bogus configs in my opinion, but hey, let the
admin shoot themself).
this generally leaves a handful of low numbered fds available. this
pretty much protects apache from broken libraries compiled with small
FD_SETSIZE, or which otherwise can't handle big fds. libc used to be just
such a library because it used select() in the DNS resolver code. (a libc
guru can tell you when this was fixed.)
it also ensures that the client fd will be low numbered, and lets us be
lazy and just use select() rather than do all the config tests to figure
out which OSs support poll().
it's all pretty gross... but then select() is pretty gross and it's
essentially the bug that necessitated this.
(solaris also has a stupid FILE * limitation that it can't use fds > 255
in a FILE * ... which breaks even more libraries than fds >= FD_SETSIZE.)
-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-17 15:02 Is sendfile all that sexy? Ben Mansell
2000-01-01 2:10 ` Pavel Machek
@ 2001-01-17 19:32 ` Linus Torvalds
2001-01-18 2:34 ` Olivier Galibert
` (2 more replies)
1 sibling, 3 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-17 19:32 UTC (permalink / raw)
To: linux-kernel
In article <Pine.LNX.4.30.0101171454340.29536-100000@baphomet.bogo.bogus>,
Ben Mansell <linux-kernel@slimyhorror.com> wrote:
>On 14 Jan 2001, Linus Torvalds wrote:
>
>> And no, I don't actually hink that sendfile() is all that hot. It was
>> _very_ easy to implement, and can be considered a 5-minute hack to give
>> a feature that fit very well in the MM architecture, and that the Apache
>> folks had already been using on other architectures.
>
>The current sendfile() has the limitation that it can't read data from
>a socket. Would it be another 5-minute hack to remove this limitation, so
>you could sendfile between sockets? Now _that_ would be sexy :)
I don't think that would be all that sexy at all.
You have to realize, that sendfile() is meant as an optimization, by
being able to re-use the same buffers that act as the in-kernel page
cache as buffers for sending data. So you avoid one copy.
However, for socket->socket, we would not have such an advantage. A
socket->socket sendfile() would not avoid any copies the way the
networking is done today. That _may_ change, of course. But it might
not. And I'd rather tell people using sendfile() that you get EINVAL if
it isn't able to optimize the transfer..
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-17 19:32 ` Linus Torvalds
@ 2001-01-18 2:34 ` Olivier Galibert
2001-01-21 21:22 ` LA Walsh
2001-01-18 8:23 ` Rogier Wolff
2001-01-22 18:13 ` Val Henson
2 siblings, 1 reply; 109+ messages in thread
From: Olivier Galibert @ 2001-01-18 2:34 UTC (permalink / raw)
To: linux-kernel
On Wed, Jan 17, 2001 at 11:32:35AM -0800, Linus Torvalds wrote:
> However, for socket->socket, we would not have such an advantage. A
> socket->socket sendfile() would not avoid any copies the way the
> networking is done today. That _may_ change, of course. But it might
> not. And I'd rather tell people using sendfile() that you get EINVAL if
> it isn't able to optimize the transfer..
On the other hand you could consider sendfile to be a concept rather
than an optimization. That is, "move n bytes from this fd to that
one". That would be very nice for this like tar (file <-> file or
tty), cp (file <-> file), application-level routing (socket <->
socket). Hey, even cat(1) would be simplified.
Whether the kernel can optimize it in zero-copy mode is another
problem that will change with time anyway. But the "I want to move x
amount of data from here to there, and I don't need to see the actual
contents" is something that happens quite often, and to be able to do
it with one syscall that does not muck with page tables (aka no mmap
nor malloc) would be both more readable and scale better on smp.
OG.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-17 19:32 ` Linus Torvalds
2001-01-18 2:34 ` Olivier Galibert
@ 2001-01-18 8:23 ` Rogier Wolff
2001-01-18 10:01 ` Andreas Dilger
2001-01-18 12:17 ` Peter Samuelson
2001-01-22 18:13 ` Val Henson
2 siblings, 2 replies; 109+ messages in thread
From: Rogier Wolff @ 2001-01-18 8:23 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
Linus Torvalds wrote:
> In article <Pine.LNX.4.30.0101171454340.29536-100000@baphomet.bogo.bogus>,
> Ben Mansell <linux-kernel@slimyhorror.com> wrote:
> >On 14 Jan 2001, Linus Torvalds wrote:
> >
> >> And no, I don't actually hink that sendfile() is all that hot. It was
> >> _very_ easy to implement, and can be considered a 5-minute hack to give
> >> a feature that fit very well in the MM architecture, and that the Apache
> >> folks had already been using on other architectures.
> >
> >The current sendfile() has the limitation that it can't read data from
> >a socket. Would it be another 5-minute hack to remove this limitation, so
> >you could sendfile between sockets? Now _that_ would be sexy :)
>
> I don't think that would be all that sexy at all.
>
> You have to realize, that sendfile() is meant as an optimization, by
> being able to re-use the same buffers that act as the in-kernel page
> cache as buffers for sending data. So you avoid one copy.
>
> However, for socket->socket, we would not have such an advantage. A
> socket->socket sendfile() would not avoid any copies the way the
> networking is done today. That _may_ change, of course. But it might
> not. And I'd rather tell people using sendfile() that you get EINVAL if
> it isn't able to optimize the transfer..
Linus,
I admire your good taste in designing interface, but here is one where
we disagree.
I'd prefer an interface that says "copy this fd to that one, and
optimize that if you can".
All cases that can't be optimized would end up doing an in-kernel read
/ write loop. Sure, there is no advantage above doing that same loop
in userspace, but this way the kernel can "grow" and optimize more
stuff later on.
For example, copying a file from one disk to another. I'm pretty sure
that some efficiency can be gained if you don't need to handle the
possibility of the userspace program accessing the data in between the
read and the write. Sure this may not qualify as a "trivial
optimization, that can be done with the existing infrastructure" right
now, but programs that want to indicate "kernel, please optimize this
if you can" can say so.
Currently, once the optimization happens to become possible (*), we'll
have to upgrade all apps that happen to be able to use it. If now we
start advertizing the interface (at a cost of a read/write loop in the
kernel: five lines of code) we will be able to upgrade the kernel, and
automatically improve the performance of every app that happens to use
the interface.
Roger.
(*) Either because the infrastructure makes it "trivial", or because
someone convinces you that it is a valid optimization that makes a
huge difference in an important case.
--
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 8:23 ` Rogier Wolff
@ 2001-01-18 10:01 ` Andreas Dilger
2001-01-18 11:04 ` Russell Leighton
2001-01-18 16:24 ` Linus Torvalds
2001-01-18 12:17 ` Peter Samuelson
1 sibling, 2 replies; 109+ messages in thread
From: Andreas Dilger @ 2001-01-18 10:01 UTC (permalink / raw)
To: Rogier Wolff; +Cc: Linus Torvalds, linux-kernel
Roger Wolff writes:
> I'd prefer an interface that says "copy this fd to that one, and
> optimize that if you can".
>
> For example, copying a file from one disk to another. I'm pretty sure
> that some efficiency can be gained if you don't need to handle the
> possibility of the userspace program accessing the data in between the
> read and the write. Sure this may not qualify as a "trivial
> optimization, that can be done with the existing infrastructure" right
> now, but programs that want to indicate "kernel, please optimize this
> if you can" can say so.
Actually, this is a great example, because at one point I was working
on a device interface which would offload all of the disk-disk copying
overhead to the disks themselves, and not involve the CPU/RAM at all.
I seem to recall that I2O promised something along these lines as well
(i.e. direct device-device communication).
Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 10:01 ` Andreas Dilger
@ 2001-01-18 11:04 ` Russell Leighton
2001-01-18 16:36 ` Larry McVoy
2001-01-19 1:53 ` Linus Torvalds
2001-01-18 16:24 ` Linus Torvalds
1 sibling, 2 replies; 109+ messages in thread
From: Russell Leighton @ 2001-01-18 11:04 UTC (permalink / raw)
To: linux-kernel
"copy this fd to that one, and optimize that if you can"
... isn't this Larry M's "splice" (http://www.bitmover.com/lm/papers/splice.ps)?
Andreas Dilger wrote:
> Roger Wolff writes:
> > I'd prefer an interface that says "copy this fd to that one, and
> > optimize that if you can".
> >
> > For example, copying a file from one disk to another. I'm pretty sure
> > that some efficiency can be gained if you don't need to handle the
> > possibility of the userspace program accessing the data in between the
> > read and the write. Sure this may not qualify as a "trivial
> > optimization, that can be done with the existing infrastructure" right
> > now, but programs that want to indicate "kernel, please optimize this
> > if you can" can say so.
>
> Actually, this is a great example, because at one point I was working
> on a device interface which would offload all of the disk-disk copying
> overhead to the disks themselves, and not involve the CPU/RAM at all.
>
> I seem to recall that I2O promised something along these lines as well
> (i.e. direct device-device communication).
>
> Cheers, Andreas
> --
> Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
> \ would they cancel out, leaving him still hungry?"
> http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
--
-------------------------------------------------
Russell Leighton
leighton@imake.com
http://www.247media.com
Company Vision:
To be the preeminent global provider
of interactive marketing solutions and services.
-------------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 8:23 ` Rogier Wolff
2001-01-18 10:01 ` Andreas Dilger
@ 2001-01-18 12:17 ` Peter Samuelson
1 sibling, 0 replies; 109+ messages in thread
From: Peter Samuelson @ 2001-01-18 12:17 UTC (permalink / raw)
To: Rogier Wolff; +Cc: Linus Torvalds, linux-kernel
[Rogier Wolff]
> I'd prefer an interface that says "copy this fd to that one, and
> optimize that if you can".
So do exactly that in libc.
sendfile () {
if (sys_sendfile() == -1)
return (errno == EINVAL) ? do_slow_sendfile() : -1;
return 0;
}
Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* RE: Is sendfile all that sexy?
2001-01-16 15:46 ` David L. Parsley
@ 2001-01-18 14:00 ` Laramie Leavitt
0 siblings, 0 replies; 109+ messages in thread
From: Laramie Leavitt @ 2001-01-18 14:00 UTC (permalink / raw)
To: linux-kernel
> Jakub Jelinek wrote:
>
> > > This makes me wonder...
> > >
> > > If the kernel only kept a queue of the three smallest unused fd's, and
> > > when the queue emptied handed out whatever it liked, how many things
> > > would break? I suspect this would cover a lot of bases...
> >
> > First it would break Unix98 and other standards:
> [snip]
>
> Yeah, I reallized it would violate at least POSIX. The discussion was
> just bandying about ways to avoid an expensive 'open()' without breaking
> lots of utilities and glibc stuff. This might be something that could
> be configured for specific server environments, where performance is
> more imporant than POSIX/Unix98, but you still don't want to completely
> break the system. Just a thought, brain-damaged as it might be. ;-)
>
Merely following the discussion a thought occurred to me of how
to make fd allocation fairly efficient (and simple) even if it retains
the O(n) structure worst case. I don't know how it is currently implemented
so this may be how it is done, or I may be way off base.
First, keep a table of FDs in sorted order ( mark deleted entries )
that you can access quickly. O(1) lookup.
Then, maintain this struct like
struct
{
int lowest_fd;
int highest_fd;
}
open:
if( lowest_fd == highest_fd )
{
fd = lowest_fd;
lowest_fd = ++highest_fd;
}
if( flags == IGNORE_UNIX98 )
{
fd = highest_fd++;
}
else
{
fd = lowest_fd
lowest_fd = linear_search( lowest_fd+1, highest_fd );
}
close:
if( fd < lowest_fd )
{
lowest_fd = fd;
}
else if( fd == highest_fd - 1 )
{
if( highest_fd == lowest_fd )
{
lowest_fd = --highest_fd;
}
else
{
highest_fd;
}
}
For common cases this would be fairly quick. It would be very easy to
implement an O(1) allocation if you want it to be fast ( at the expense
of a growing file handle table. )
Just thinking about it.
Laramie.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 10:01 ` Andreas Dilger
2001-01-18 11:04 ` Russell Leighton
@ 2001-01-18 16:24 ` Linus Torvalds
2001-01-18 18:46 ` Kai Henningsen
2001-01-18 18:58 ` Roman Zippel
1 sibling, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-18 16:24 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Rogier Wolff, linux-kernel
On Thu, 18 Jan 2001, Andreas Dilger wrote:
>
> Actually, this is a great example, because at one point I was working
> on a device interface which would offload all of the disk-disk copying
> overhead to the disks themselves, and not involve the CPU/RAM at all.
It's a horrible example.
device-to-device copies sound like the ultimate thing.
They suck. They add a lot of complexity and do not work in general. And,
if your "normal" usage pattern really is to just move the data without
even looking at it, then you have to ask yourself whether you're doing
something worthwhile in the first place.
Not going to happen.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 11:04 ` Russell Leighton
@ 2001-01-18 16:36 ` Larry McVoy
2001-01-19 1:53 ` Linus Torvalds
1 sibling, 0 replies; 109+ messages in thread
From: Larry McVoy @ 2001-01-18 16:36 UTC (permalink / raw)
To: Russell Leighton; +Cc: linux-kernel
On Thu, Jan 18, 2001 at 06:04:17AM -0500, Russell Leighton wrote:
>
> "copy this fd to that one, and optimize that if you can"
>
> ... isn't this Larry M's "splice" (http://www.bitmover.com/lm/papers/splice.ps)?
Not really. It's not clear to me that people really understood what I was
getting at in that and I've had some coffee and BK 2.0 is just about ready
to ship (shameless plug :-) so I'll give it another go.
The goal of splice is to avoid both data copies and virtual memory completely.
My SGI experience taught me that once you remove the data copy problem, the
next problem becomes setting up and tearing down the virtual mappings to the
data. Linux is quite a bit lighter than IRIX but that doesn't remove this
issue, it just moves the point on the spectrum where the setup/teardown
becomes a problem.
Another goal of splice was to be general enough to allow data to flow from
any place to any place. The idea was to have a good model and then iterate
over all the possible endpoints; I can think of files, sockets, and virtual
address spaces right off the top of my head, devices are subset of files
as will become apparent.
A final goal was to be able to be able to handle caching vs non-caching.
Sometimes one of the endpoints is a cache, such as the file system cache.
Sometimes you want data to stay in the cache and sometimes you want to
bypass it completely. The model had to handle this.
OK, so the issues are
- avoid copying
- avoid virtual memory as much as possible
- allow data flow to/from non aligned, non page sized objects
- handle caching or non-caching
This leads pretty naturally to some observations about the shape of the
solution:
- the basic unit of data is a physical page, or part of one. That's
physical page, not a virtual address which points to a physical page.
- since we may be coming from sockets, where the payload is buried in
the middle of page, there needs to be a vector of pages and a
vector of { pageno, offset, len } that goes along with the first
vector. There are two vectors because you could have multiple payloads
in a single page, i.e., there is not a 1:1 between pages and payloads.
- The page vector needs some flags, which handle caching. I had just
two flags, the "LOAN" flag and the "GIFT" flag.
In my mind, this was enough that everyone should "get it" at this point, but
that's me being lazy.
So how would this all work? The first thing is that we are now dealing
in vectors of physical pages. That's key - if you look at an OS, it
spends a lot of time with data going into a physical page, then being
translated to a virtual page, being copied to another virtual page, and
then being translated back to a physical page so that it can be sent to
a different device. That's the basic FTP loop.
So you go "hey, just always talk physical pages and you avoid a lot of this
wasted time". Now is a good time to observe that splice() is a library
interface. The kernel level interfaces I called pull() and push(). The
idea was that you could do
vectors = 0;
do {
vectors = pull(from_fd, vectors);
} while (splice_size(vectors) < BIG_ENOUGH_SIZE);
push(to_fd, vectors);
The idea was that you maintained a pointer to the vectors, the pointer is
a "cookie", you can't really dereference it in user space, at least not all
of it, but the kernel doesn't want to maintain this stuff, it wants you to
do that. So you start pulling and then you push what you got. And you,
being the user land process, are never looking at the data, in fact, you
can't, you have a pointer to a data structure which describes the data
but you can't look at it.
A couple of interesting things:
- this design allows for multiplexing. You could pull from multiple devices
and then push to one. The interface needs a little tweaking for that to
be meaningful, we can steal from pipe semantics. We need to be able to
say how much to pull, so we add a length.
- there is no reason that you couldn't have an fd which was open to
/proc/self/my_address_space and you could essentially do an mmap()
by seeking to where you want the mapping and doing a push to it.
This is a fairly important point, it allows for end to end. Lots of
nasty issues with non-page sized chunks in the vector, what you do there
depends on the semantics you want.
So what about the caching? That's the loan/gift distinction. The deal is that
these pages have reference counts and when the reference count goes to zero,
somebody has to free them. So the page vector needs a free_page() function
pointer and if the pages are a loan, you call that function pointer when you
are done with them. In other words, if the file system cache loaned you
the pages, you do a call back to let the file system know you are done with
them. If the pages were a gift, then the function pointer is null and you
have to manage them. You can put the normal decrement_and_free() function
in there and when you get done with them you call that and the pages go back
to the free list. You can also "free" them into your private page pool, etc.
The point is that if the end point which is being pulled() from wants the
pages cached, it "loans" them, if it doesn't, it "gifts" them. Sockets as
a "from" end point would always gift, files as a from endpoint would typically
loan.
So, there's the set of ideas. I'm ashamed to admit that I don't really know
how close kiobufs are to this. I am interested in hearing what you all think,
but especially what the people think who have been playing around with kiobufs
and sendfile.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 16:24 ` Linus Torvalds
@ 2001-01-18 18:46 ` Kai Henningsen
2001-01-18 18:58 ` Roman Zippel
1 sibling, 0 replies; 109+ messages in thread
From: Kai Henningsen @ 2001-01-18 18:46 UTC (permalink / raw)
To: linux-kernel
torvalds@transmeta.com (Linus Torvalds) wrote on 18.01.01 in <Pine.LNX.4.10.10101180822020.18072-100000@penguin.transmeta.com>:
> if your "normal" usage pattern really is to just move the data without
> even looking at it, then you have to ask yourself whether you're doing
> something worthwhile in the first place.
Web server. FTP server. Network file server. cp. mv. cat. dd.
In short, vfs->net (what sendfile already does) and vfs->vfs are probably
the most interesting applications, with net->vfs as a possible third.
Classical bulk data copy applications.
All the other stuff I can think of really does want to look at the data,
and we can already handle virtual memory just fine with read/write/mmap.
MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 16:24 ` Linus Torvalds
2001-01-18 18:46 ` Kai Henningsen
@ 2001-01-18 18:58 ` Roman Zippel
2001-01-18 19:42 ` Linus Torvalds
2001-01-18 19:51 ` Rick Jones
1 sibling, 2 replies; 109+ messages in thread
From: Roman Zippel @ 2001-01-18 18:58 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel
Hi,
On Thu, 18 Jan 2001, Linus Torvalds wrote:
> > Actually, this is a great example, because at one point I was working
> > on a device interface which would offload all of the disk-disk copying
> > overhead to the disks themselves, and not involve the CPU/RAM at all.
>
> It's a horrible example.
>
> device-to-device copies sound like the ultimate thing.
>
> They suck. They add a lot of complexity and do not work in general. And,
> if your "normal" usage pattern really is to just move the data without
> even looking at it, then you have to ask yourself whether you're doing
> something worthwhile in the first place.
>
> Not going to happen.
device-to-device is not the same as disk-to-disk. A better example would
be a streaming file server. Slowly the pci bus becomes a bottleneck, why
would you want to move the data twice over the pci bus if once is enough
and the data very likely not needed afterwards? Sure you can use a more
expensive 64bit/60MHz bus, but why should you if the 32bit/30MHz bus is
theoretically fast enough for your application?
So I'm not advising it as "the ultimate thing", but I don't understand,
why it shouldn't happen.
bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 18:58 ` Roman Zippel
@ 2001-01-18 19:42 ` Linus Torvalds
2001-01-19 0:18 ` Roman Zippel
2001-01-20 15:36 ` Kai Henningsen
2001-01-18 19:51 ` Rick Jones
1 sibling, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-18 19:42 UTC (permalink / raw)
To: Roman Zippel; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel
On Thu, 18 Jan 2001, Roman Zippel wrote:
> >
> > Not going to happen.
>
> device-to-device is not the same as disk-to-disk. A better example would
> be a streaming file server.
No, it wouldn't be.
[ Crystal ball mode: ON ]
It's too damn device-dependent, and it's not worth it. There's no way to
make it general with any current hardware, and there probably isn't going
to be for at least another decade or so. And because it's expensive and
slow to do even on a hardware level, it probably won't be done even then.
Which means that it will continue to be a pure localized hack for the
forseeable future.
Quite frankly, show me a setup where the network bandwidth is even _close_
to big enough that it would make sense to try to stream directly from the
disk? The only one I can think of is basically DoD-type installations with
big satellite pictures on a huge server, and gigabit ethernet everywhere.
Quite frankly, if that huge server is so borderline that it cannot handle
the double copy, the system administrators have big problems.
Streaming local video to disk? Sure, I can see that people might want
that. But if you can't see that people might want to _see_ it while they
are streaming, then you're missing a big part of the picture called
"psychology". So you'd still want to have a in-memory buffer for things
like that.
Come back to this in ten years, when devices and buses are smarter. MAYBE
they'll support it (but see later about why I don't think they will).
Today, you're living in a pipe dream. You can't practically do it with any
real devices of today - even when both parts support busmastering, they do
NOT tend to support "busmaster to the disk", or "busmaster from the disk".
I don't know of any disk interfaces that do that kind of interfaces
(they'd basically need to have some way to busmaster directly to the
controller caches, and do cache management in software. Can be done, but
probably exposes more of the hardware than most people want to see),
Right now the only special case might be some very specific embedded
devices, things like routers, video recorders etc. And for them it would
be very much a special case, with special drivers and everything. This is
NOT a generic kernel issue, and we have not reached the point where it's
even worth it trying to design the interfaces for it yet.
An important point in interface design is to know when you don't know
enough. We do not have the internal interfaces for doing anything like
this, and I seriously doubt they'll be around soon.
And you have to realize that it's not at all a given that device protocols
will even move towards this kind of environment. It's equally likely that
device protocols in the future will be more memory-intensive, where the
basic protocol will all be "read from memory" and "write to memory", and
nobody will even have a notion of mapping memory into device space like
PCI kind of does now.
I haven't looked at what infiniband/NGIO etc spec out, but I'd actually be
surprised if they allow you to effectively short-circuit the IO networks
together. It is not an operation that lends itself well to a network
topology - it happens to work on PCI due to the traditional "shared bus"
kind of logic that PCI inherited. And even on PCI, there are just a LOT of
PCI bridges that apparently do not like seeing PCI-PCI transfers.
(Short and sweet: most hogh-performance people want point-to-point serial
line IO with no hops, because it's a known art to make that go fast. No
general-case routing in hardware - if you want to go as fast as the
devices and the link can go, you just don't have time to route. Trying to
support device->device transfers easily slows down the _common_ case,
which is why I personally doubt it will even be supported 10-15 years from
now. Better hardware does NOT mean "more features").
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 18:58 ` Roman Zippel
2001-01-18 19:42 ` Linus Torvalds
@ 2001-01-18 19:51 ` Rick Jones
1 sibling, 0 replies; 109+ messages in thread
From: Rick Jones @ 2001-01-18 19:51 UTC (permalink / raw)
To: Roman Zippel; +Cc: Linus Torvalds, Andreas Dilger, Rogier Wolff, linux-kernel
> device-to-device is not the same as disk-to-disk. A better example would
> be a streaming file server. Slowly the pci bus becomes a bottleneck, why
> would you want to move the data twice over the pci bus if once is enough
> and the data very likely not needed afterwards? Sure you can use a more
> expensive 64bit/60MHz bus, but why should you if the 32bit/30MHz bus is
> theoretically fast enough for your application?
theoretically fast enough for the application would imply the dual
transfers across the bus would fit :)
also, if a system was doing something with that much throughput, i
suspect it would not only be designed with 64/66 busses (or better), but
also have things on several different busses. that makes device to
device life more of a challenge.
rick jones
--
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 19:42 ` Linus Torvalds
@ 2001-01-19 0:18 ` Roman Zippel
2001-01-19 1:14 ` Linus Torvalds
2001-01-20 15:36 ` Kai Henningsen
1 sibling, 1 reply; 109+ messages in thread
From: Roman Zippel @ 2001-01-19 0:18 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel
Hi,
On Thu, 18 Jan 2001, Linus Torvalds wrote:
> It's too damn device-dependent, and it's not worth it. There's no way to
> make it general with any current hardware, and there probably isn't going
> to be for at least another decade or so. And because it's expensive and
> slow to do even on a hardware level, it probably won't be done even then.
>
> [...]
>
> An important point in interface design is to know when you don't know
> enough. We do not have the internal interfaces for doing anything like
> this, and I seriously doubt they'll be around soon.
I agree, it's device dependent, but such hardware exists. It needs of
course its own memory, but then you can see it as a NUMA architecture and
we already have the support for this. Create a new memory zone for the
device memory and keep the pages reserved. Now you can use it almost like
other memory, e.g. reading from/writing to it using address_space_ops.
An application, where I'd like to use it, is audio recording/playback
(24bit, 96kHz on 144 channels). Although it's possible to copy that amount
of data around, but then you can't do much beside this. All the data is
most of the time only needed on the soundcard, so why should I copy it
first to the main memory?
Right now I'm stuck to accessing a scsi device directly, but I would love
to use the generic file/address_space interface for that, so you can
directly stream to/from any filesystem. The only problem is that the fs
interface is still to slow.
That's btw the reason I suggested to split the get_block function. If you
record into a file, you first just want to allocate any block from the fs
for that file. A bit later when you start the write, you need a real
block. And again a bit later you can still update the inode. These three
stages have completely different locking requirements (except the page
lock) and you can use the same mechanism for delayed writes.
Anyway, now with the zerocopy network patches, there are basically already
all the needed interfaces and you don't have to wait for 10 years, so I
think you need to polish your crystal ball. :-)
bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-19 0:18 ` Roman Zippel
@ 2001-01-19 1:14 ` Linus Torvalds
2001-01-19 6:57 ` Alan Cox
2001-01-19 10:13 ` Roman Zippel
0 siblings, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-19 1:14 UTC (permalink / raw)
To: Roman Zippel; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel
On Fri, 19 Jan 2001, Roman Zippel wrote:
>
> On Thu, 18 Jan 2001, Linus Torvalds wrote:
>
> > It's too damn device-dependent, and it's not worth it. There's no way to
> > make it general with any current hardware, and there probably isn't going
> > to be for at least another decade or so. And because it's expensive and
> > slow to do even on a hardware level, it probably won't be done even then.
> >
> > [...]
> >
> > An important point in interface design is to know when you don't know
> > enough. We do not have the internal interfaces for doing anything like
> > this, and I seriously doubt they'll be around soon.
>
> I agree, it's device dependent, but such hardware exists.
Show me any practical case where the hardware actually exists.
I do not know of _any_ disk controllers that let you map the controller
buffers over PCI. Which means that with current hardware, you have to
assume that the disk is the initiator of the PCI-PCI DMA requests. Agreed?
Which in turn implies that the non-disk target hardware has to be able to
have a PCI-mapped memory buffer for the source or the destination, AND
they have to be able to cope with the fact that the data you get off the
disk will have to be the raw data at 512-byte granularity.
There are really quite few devices that do this. The most common example
by far would be a frame buffer, where you could think of streaming a few
frames at a time directly from disk into graphics memory. But nobody
actually saves pictures that way in reality - they all need processing to
show up. Even when the graphics card does things like mpeg2 decoding in
hardware, the decoding logic is not set up the way the data comes off the
disk in any case I know of.
As to soundcards, all the ones I know about that are worthwhile have
certainly on-board memory, but that memory tends to be used for things
like waveforms etc, and most of them refill their audio data by doing DMA.
Again, they are the initiator of the IO, not a passive receiver.
I'm sure there are sound cards that just expose their buffers directly.
Fine. Make a special user-space driver for it. Don't try to make it into a
design.
> It needs of
> course its own memory, but then you can see it as a NUMA architecture and
> we already have the support for this. Create a new memory zone for the
> device memory and keep the pages reserved. Now you can use it almost like
> other memory, e.g. reading from/writing to it using address_space_ops.
You need to have a damn special sound card to do the above.
And you wouldn't need a new memory zone - the kernel wouldn't ever touch
the memory anyway, you'd just ioremap() it if you needed to access it
programmatically in addition to the streaming of data off disk.
> An application, where I'd like to use it, is audio recording/playback
> (24bit, 96kHz on 144 channels). Although it's possible to copy that amount
> of data around, but then you can't do much beside this. All the data is
> most of the time only needed on the soundcard, so why should I copy it
> first to the main memory?
Because with 99% of the hardware, there is no other way to get at it?
Also, even when you happen to have the 1% card combination where it would
work in the first place, you'd better make sure that they are on the same
PCI bus. That's usually true on most PC's today, but that's probably going
to be an issue eventually.
> Anyway, now with the zerocopy network patches, there are basically already
> all the needed interfaces and you don't have to wait for 10 years, so I
> think you need to polish your crystal ball. :-)
The zero-copy network patches have _none_ of the interfaces you think you
need. They do not fix the fact that hardware usually doesn't even _allow_
for what you are hoping for. And what you want is probably going to be
less likely in the future than more likely.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 11:04 ` Russell Leighton
2001-01-18 16:36 ` Larry McVoy
@ 2001-01-19 1:53 ` Linus Torvalds
1 sibling, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-19 1:53 UTC (permalink / raw)
To: linux-kernel
In article <3A66CDB1.B61CD27B@imake.com>,
Russell Leighton <leighton@imake.com> wrote:
>
>"copy this fd to that one, and optimize that if you can"
>
>... isn't this Larry M's "splice" (http://www.bitmover.com/lm/papers/splice.ps)?
We talked extensively about "splice()" with Larry. It was one of the
motivations for doing sendfile(). The problem with "splice()" is that it
did not have very good semantics on who does the push and who does the
pull, and how to actually implement this efficiently yet in a generic
manner.
In many ways, that lack of good generic interfaces is what turned me off
splice(). I showed Larry the simple solution that gets 95% of what
people wanted splice for, and he didn't object. He didn't have any
really good solutions to the implementation problems either.
Now, the reason it is called "sendfile()" is obviously partially because
others _did_ have sendfiles (NT and HP-UX), but it's also because I
wanted to make it clear that this was NOT a generic splice(). It could
really only work in one direction: from the page cache out. The page
cache would always do a push, and nobody would do a pull.
Now, the page cache has improved, and these days we could _almost_ do a
"receivefile()", with the page cache doing a pull, in addition to the
push it can already do. And yes, I'd probably use the same system call,
and possibly rename it to be "splice()", even though it still wouldn't
be the generic case.
Now, the reason is say "almost" on the page cache "pull()" thing is that
while the page cache can now do basically "prepare_write()" + "pull()" +
"commit_write()", the problem is that it still needs to know the _size_
of the pull() in order to be able to prepare for the write.
Basically, the pull<->push model turns into a four-way handshake:
(a) prepare for the pull (source)
(b) prepare for the push (destination)
(c) do the pull (source)
(d) commit the push (destination)
and with this kind of model I suspect that we could actually do a fairly
real splice(), where sendfile() would just be a special case.
Right now, the only part we lack above is (a) - everything else we have.
(b) is "prepare_write()", (c) is "read()", (d) is "commit_write()".
So we lack a "prepare_read()" as things stand now. The interface would
probably be something on the order of
int (*prepare_read)(struct file *, int);
wehere we'd pass in the "struct file" and the amount of data we'd _like_
to see, and we'd get back the amount of data we can actually have so
that we can successfully prepare for the push (ie "prepare_write()").
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-19 1:14 ` Linus Torvalds
@ 2001-01-19 6:57 ` Alan Cox
2001-01-19 10:13 ` Roman Zippel
1 sibling, 0 replies; 109+ messages in thread
From: Alan Cox @ 2001-01-19 6:57 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Roman Zippel, Andreas Dilger, Rogier Wolff, linux-kernel
> Which in turn implies that the non-disk target hardware has to be able to
> have a PCI-mapped memory buffer for the source or the destination, AND
> they have to be able to cope with the fact that the data you get off the
> disk will have to be the raw data at 512-byte granularity.
And that the chipset gets it right. Which is a big assumption as tv card
driver folks can tell you
The pcipci stuff in quirks is only a beginning alas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-19 1:14 ` Linus Torvalds
2001-01-19 6:57 ` Alan Cox
@ 2001-01-19 10:13 ` Roman Zippel
2001-01-19 10:55 ` Andre Hedrick
2001-01-19 20:18 ` kuznet
1 sibling, 2 replies; 109+ messages in thread
From: Roman Zippel @ 2001-01-19 10:13 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andreas Dilger, Rogier Wolff, linux-kernel
Hi,
On Thu, 18 Jan 2001, Linus Torvalds wrote:
> > I agree, it's device dependent, but such hardware exists.
>
> Show me any practical case where the hardware actually exists.
http://www.augan.com
> I do not know of _any_ disk controllers that let you map the controller
> buffers over PCI. Which means that with current hardware, you have to
> assume that the disk is the initiator of the PCI-PCI DMA requests. Agreed?
Yes.
> I'm sure there are sound cards that just expose their buffers directly.
> Fine. Make a special user-space driver for it. Don't try to make it into a
> design.
> [..]
> You need to have a damn special sound card to do the above.
That's true. "Soundcard" is actually a small understatement. :)
Why should I make a new design for it, then it fits nicely into the
current design?
> And you wouldn't need a new memory zone - the kernel wouldn't ever touch
> the memory anyway, you'd just ioremap() it if you needed to access it
> programmatically in addition to the streaming of data off disk.
ioremapped memory is not the same (that's what we do right now), you have
to fake some virtual address to get the data to the right physical
location.
> Also, even when you happen to have the 1% card combination where it would
> work in the first place, you'd better make sure that they are on the same
> PCI bus. That's usually true on most PC's today, but that's probably going
> to be an issue eventually.
I agree, it's a special setup.
> > Anyway, now with the zerocopy network patches, there are basically already
> > all the needed interfaces and you don't have to wait for 10 years, so I
> > think you need to polish your crystal ball. :-)
>
> The zero-copy network patches have _none_ of the interfaces you think you
> need. They do not fix the fact that hardware usually doesn't even _allow_
> for what you are hoping for. And what you want is probably going to be
> less likely in the future than more likely.
It's about direct i/o from/to pages, for that you need a page struct (so
the ioremapping doesn't work). See the memory on the pci card as normal
memory, except that you can't allocate it normally, but you can still
organize it like normal memory. All you need to do is to setup this memory
area, then you can use it like normal memory, e.g. I can put it into the
page cache and I can do a normal read/write with it. The changes are very
minor, but it would solve so much other problems (especially alias
issues).
I know, that this isn't possible with any hardware combination,
nonetheless it's not that a big problem to support it where it's possible.
bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-19 10:13 ` Roman Zippel
@ 2001-01-19 10:55 ` Andre Hedrick
2001-01-19 20:18 ` kuznet
1 sibling, 0 replies; 109+ messages in thread
From: Andre Hedrick @ 2001-01-19 10:55 UTC (permalink / raw)
To: Roman Zippel; +Cc: Linus Torvalds, Andreas Dilger, Rogier Wolff, linux-kernel
On Fri, 19 Jan 2001, Roman Zippel wrote:
> Hi,
>
> On Thu, 18 Jan 2001, Linus Torvalds wrote:
>
> > > I agree, it's device dependent, but such hardware exists.
> >
> > Show me any practical case where the hardware actually exists.
>
> http://www.augan.com
>
> > I do not know of _any_ disk controllers that let you map the controller
> > buffers over PCI. Which means that with current hardware, you have to
> > assume that the disk is the initiator of the PCI-PCI DMA requests. Agreed?
Err, first-party DMA devices do this, I think.
I do have some of these on the radar map.
Andre Hedrick
Linux ATA Development
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
[not found] <Pine.LNX.4.10.10101190911130.10218-100000@penguin.transmeta.com>
@ 2001-01-19 17:23 ` Rogier Wolff
0 siblings, 0 replies; 109+ messages in thread
From: Rogier Wolff @ 2001-01-19 17:23 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
Linus Torvalds wrote:
> > I wrote a driver for a zoran-chipset frame-grabber card. The "natural"
> > way to save a video stream was exactly the way it came out of the
> > card. And the card was structured that you could put on an "mpeg
> > decoder" (or encoder) chip, and you could DMA the stream directly into
> > that chip.
>
> Ehh..
>
> And how many of these chips are out on the market?
>
> Would you agree that it is less than 0.01% of all PC hardware? Like MUCH
> less?
Someone asked me to write a driver for one of these cards. I was
assuming that most of them work like this. And I'm never wrong, you
know...
> > The way soundcards are commonly programmed, they don't play from their
> > own memory, but from main memory. However, they all can play from
> > their own memory.
>
> And how do you synchronize the streams etc? It's a nasty piece of
> business, and direct PCI-PCI streaming is not the answer.
>
> > > And you wouldn't need a new memory zone - the kernel wouldn't ever touch
> > > the memory anyway, you'd just ioremap() it if you needed to access it
> > > programmatically in addition to the streaming of data off disk.
> >
> > That's the way things currently work. If you start thinking about it
> > as a NUMA, it may improve the situation for "common users" too.
> >
> > A PC is a NUMA machine! We have disk (swap) and main memory. We also
> > have a frame buffer, which doesn't currently fit into our memory
> > architecture.
>
> Don't be silly. It fints _fine_ in our memory architecture. We map it to
> xfree86, and we're done with it.
>
> Using the frame buffer for "backing store" for normal memory is not worth
> it. That's what disks are for. Frame buffers are _way_ too small to be
> interesting as a memory resource.
It's a silly small resource that suddenly becomes usable should the
right infrastructure be in place. It isn't. You're not planning on doing
it soonish. Neither am I.
Roger.
--
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-19 10:13 ` Roman Zippel
2001-01-19 10:55 ` Andre Hedrick
@ 2001-01-19 20:18 ` kuznet
2001-01-19 21:45 ` Linus Torvalds
1 sibling, 1 reply; 109+ messages in thread
From: kuznet @ 2001-01-19 20:18 UTC (permalink / raw)
To: Roman Zippel; +Cc: linux-kernel
Hello!
> It's about direct i/o from/to pages,
Yes. Formally, there are no problems to send to tcp directly from io space.
But could someone explain me one thing. Does bus-mastering
from io really work? And if it does, is it enough fast?
At least, looking at my book on pci, I do not understand
how such transfers are able to use bursts. MRM is banned for them...
Alexey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-19 20:18 ` kuznet
@ 2001-01-19 21:45 ` Linus Torvalds
2001-01-20 18:53 ` kuznet
0 siblings, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2001-01-19 21:45 UTC (permalink / raw)
To: linux-kernel
In article <200101192018.XAA25263@ms2.inr.ac.ru>,
<kuznet@ms2.inr.ac.ru> wrote:
>Hello!
>
>> It's about direct i/o from/to pages,
>
>Yes. Formally, there are no problems to send to tcp directly from io space.
Actually, as long as there is no "struct page" there _are_ problems.
This is why the NUMA stuff was brought up - it would require that there
be a mem_map for the PCI pages.. (to do ref-counting etc).
>But could someone explain me one thing. Does bus-mastering
>from io really work? And if it does, is it enough fast?
>At least, looking at my book on pci, I do not understand
>how such transfers are able to use bursts. MRM is banned for them...
It does work at least on some hardware. But no, I don't think you can
depend on bursting (but I don't see why it couldn't work in theory).
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-18 19:42 ` Linus Torvalds
2001-01-19 0:18 ` Roman Zippel
@ 2001-01-20 15:36 ` Kai Henningsen
2001-01-20 21:01 ` Linus Torvalds
1 sibling, 1 reply; 109+ messages in thread
From: Kai Henningsen @ 2001-01-20 15:36 UTC (permalink / raw)
To: torvalds; +Cc: linux-kernel
torvalds@transmeta.com (Linus Torvalds) wrote on 18.01.01 in <Pine.LNX.4.10.10101181120070.18387-100000@penguin.transmeta.com>:
> (Short and sweet: most hogh-performance people want point-to-point serial
> line IO with no hops, because it's a known art to make that go fast. No
> general-case routing in hardware - if you want to go as fast as the
> devices and the link can go, you just don't have time to route. Trying to
> support device->device transfers easily slows down the _common_ case,
> which is why I personally doubt it will even be supported 10-15 years from
> now. Better hardware does NOT mean "more features").
Well, maybe.
Then again, I could easily see those I/O devices go the general embedded
route, which in a decade or two could well mean they run some sort of
embedded Linux on the controller.
Which would make some features rather easy to implement.
(Think about it: twenty years from mow, a typical desktop machine may be a
heterogenous Linux cluster. Didn't someone say something about World
Domination?)
(Note that I predicted this 2001-01-20T16:35:30. Just in case it actually
works out that way.)
MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-19 21:45 ` Linus Torvalds
@ 2001-01-20 18:53 ` kuznet
2001-01-20 19:26 ` Linus Torvalds
0 siblings, 1 reply; 109+ messages in thread
From: kuznet @ 2001-01-20 18:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
Hello!
> Actually, as long as there is no "struct page" there _are_ problems.
> This is why the NUMA stuff was brought up - it would require that there
> be a mem_map for the PCI pages.. (to do ref-counting etc).
I see.
Is this strong "no-no-no"? What is obstacle to allow "struct page"
to sit outside of mem_map (in some private table, or as full orphan)?
Only bloat of struct page with reference to some "page_ops" or something
more profound?
> It does work at least on some hardware. But no, I don't think you can
> depend on bursting (but I don't see why it couldn't work in theory).
I do not see too, but documents are pretty obscure explaining this.
MRM seems to be prohibited for pci-pci. But my education is still not
enough even to understand, whether MRM is required to burst
or this is fully orthogonal yet. 8)
Alexey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-20 18:53 ` kuznet
@ 2001-01-20 19:26 ` Linus Torvalds
2001-01-20 21:20 ` Roman Zippel
2001-01-21 23:21 ` David Woodhouse
0 siblings, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-20 19:26 UTC (permalink / raw)
To: kuznet; +Cc: linux-kernel
On Sat, 20 Jan 2001 kuznet@ms2.inr.ac.ru wrote:
> > Actually, as long as there is no "struct page" there _are_ problems.
> > This is why the NUMA stuff was brought up - it would require that there
> > be a mem_map for the PCI pages.. (to do ref-counting etc).
>
> I see.
>
> Is this strong "no-no-no"? What is obstacle to allow "struct page"
> to sit outside of mem_map (in some private table, or as full orphan)?
> Only bloat of struct page with reference to some "page_ops" or something
> more profound?
There's no no-no here: you can even create the "struct page"s on demand,
and create a dummy local zone that contains them that they all point back
to. It should be trivial - nobody else cares about those pages or that
zone anyway.
This is very much how the MM layer in 2.4.x is set up to work.
That said, nobody has actually done this in practice yet, so there may be
details to work out, of course. I don't see any fundamental reasons it
wouldn't easily work, but..
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-20 15:36 ` Kai Henningsen
@ 2001-01-20 21:01 ` Linus Torvalds
2001-01-20 21:10 ` Mo McKinlay
2001-01-20 22:24 ` Roman Zippel
0 siblings, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-20 21:01 UTC (permalink / raw)
To: Kai Henningsen; +Cc: linux-kernel
On 20 Jan 2001, Kai Henningsen wrote:
>
> Then again, I could easily see those I/O devices go the general embedded
> route, which in a decade or two could well mean they run some sort of
> embedded Linux on the controller.
>
> Which would make some features rather easy to implement.
I'm not worried about a certain class of features. I will predict, for
example, that disk subsystems etc will continue to get smarter, to the
point where most people will end up just buying a "file server" whenever
they buy a disk. THOSE kinds of features are the obvious ones when you
have devices that get smarter, and the kinds of features people are
willing to pay for.
The things I find really doubtful is that somebody would be so silly as to
make the low-level electrical protocol be anything but a simple direct
point-to-point link. Shared buses just do not scale, and they also have
some major problems with true high-performance GBps bandwidth.
Look at where ethernet is today. Ten years ago most people used it as a
bus. These days almost everybody thinks of ethernet as point-to-point,
with switches and hubs to make it look nothing like the bus of yore. You
just don't connect multiple devices to one wire any more.
The advantage of direct point-to-point links is that it's a hell of a lot
faster, and it's also much easier to distribute - the links don't have to
be in lock-step any more etc. It's perfectly ok to have one really
high-performance link for devices that need it, and a few low-performance
links in the same system do not bog the fast one down.
But point-to-point also means that you don't get any real advantage from
doing things like device-to-device DMA. Because the links are
asynchronous, you need buffers in between them anyway, and there is no
bandwidth advantage of not going through the hub if the topology is a
pretty normal "star" kind of thing. And you _do_ want the star topology,
because in the end most of the bandwidth you want concentrated at the
point that uses it.
The exception to this will be when you hav esmart devices that
_internally_ also have the same kind of structure, and you have a RAID
device with multiple disks in a star around the raid controller. Then
you'll find the raid controller doing raid rebuilds etc without the data
ever coming off that "local star" - but this is not something that the OS
will even get involved in other than sending the raid controller the
command to start the rebuild. It's not a "device-device" transfer in that
bigger sense - it's internal to the raid unit.
Just wait. My crystal ball is infallible.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-20 21:01 ` Linus Torvalds
@ 2001-01-20 21:10 ` Mo McKinlay
2001-01-20 22:24 ` Roman Zippel
1 sibling, 0 replies; 109+ messages in thread
From: Mo McKinlay @ 2001-01-20 21:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kai Henningsen, linux-kernel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Today, Linus Torvalds (torvalds@transmeta.com) wrote:
> Just wait. My crystal ball is infallible.
One of these days, that line will be your downfall :-)
*grins*
Mo.
- --
Mo McKinlay
mmckinlay@gnu.org
- -------------------------------------------------------------------------
GnuPG/PGP Key: pub 1024D/76A275F9 2000-07-22
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iEYEARECAAYFAjpp/ssACgkQRcGgB3aidfmcagCgkieTFD77O+Xqn+nmcaoiYERh
UwwAoIL8cWZPdaKine4fZ4fJmQqwTvBZ
=i1Ax
-----END PGP SIGNATURE-----
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-20 19:26 ` Linus Torvalds
@ 2001-01-20 21:20 ` Roman Zippel
2001-01-21 0:25 ` Linus Torvalds
2001-01-21 23:21 ` David Woodhouse
1 sibling, 1 reply; 109+ messages in thread
From: Roman Zippel @ 2001-01-20 21:20 UTC (permalink / raw)
To: Linus Torvalds; +Cc: kuznet, linux-kernel
Hi,
On Sat, 20 Jan 2001, Linus Torvalds wrote:
> There's no no-no here: you can even create the "struct page"s on demand,
> and create a dummy local zone that contains them that they all point back
> to. It should be trivial - nobody else cares about those pages or that
> zone anyway.
AFAIK as long as that dummy page struct is only used in the page cache,
that should work, but you get new problems as soon as you map the page
also into a user process (grep for CONFIG_DISCONTIGMEM under
include/asm-mips64 to see the needed changes). In the worst case one
might need reverse mapping to get the page back. :)
> That said, nobody has actually done this in practice yet, so there may be
> details to work out, of course. I don't see any fundamental reasons it
> wouldn't easily work, but..
I hope I have soon the time to experiment with this, so I'll now for sure.
I don't see major problems, except I don't know yet, how the performance
will be.
bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-20 21:01 ` Linus Torvalds
2001-01-20 21:10 ` Mo McKinlay
@ 2001-01-20 22:24 ` Roman Zippel
2001-01-21 0:33 ` Linus Torvalds
1 sibling, 1 reply; 109+ messages in thread
From: Roman Zippel @ 2001-01-20 22:24 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kai Henningsen, linux-kernel
Hi,
On Sat, 20 Jan 2001, Linus Torvalds wrote:
> But point-to-point also means that you don't get any real advantage from
> doing things like device-to-device DMA. Because the links are
> asynchronous, you need buffers in between them anyway, and there is no
> bandwidth advantage of not going through the hub if the topology is a
> pretty normal "star" kind of thing. And you _do_ want the star topology,
> because in the end most of the bandwidth you want concentrated at the
> point that uses it.
I agree, but who says, that the buffer always has to be the main memory?
That might be true especially for embedded devices. The cpu is then just
the local controller, that manages several devices with its own buffer.
Let's take a file server with multiple disks and multiple network cards
with it's own buffer. For stuff like this you don't want to go through the
main memory, on the other hand you still need to synchronize all the data.
Although I don't know such hardware, but I don't see a reason not to do it
under Linux. :-)
bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-20 21:20 ` Roman Zippel
@ 2001-01-21 0:25 ` Linus Torvalds
2001-01-21 2:03 ` Roman Zippel
2001-01-21 18:00 ` kuznet
0 siblings, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-21 0:25 UTC (permalink / raw)
To: Roman Zippel; +Cc: kuznet, linux-kernel
On Sat, 20 Jan 2001, Roman Zippel wrote:
>
> AFAIK as long as that dummy page struct is only used in the page cache,
> that should work, but you get new problems as soon as you map the page
> also into a user process (grep for CONFIG_DISCONTIGMEM under
> include/asm-mips64 to see the needed changes). In the worst case one
> might need reverse mapping to get the page back. :)
No, for the CONTIGMEM case you can just use remap_page_range() directly:
it won't actually map the "struct page*" into the user space, it will just
map a special reserved page into user space. No changes needed.
So it just so happens that the physical address of the two "pages" is the
same in this case - one reachable through the dummy "struct page *" and
one reachable through the VM layer. The VM layer will never see the dummy
"struct page", and that's ok. It doesn't need it.
Now, there are things to look out for: when you do these kinds of dummy
"struct page" tricks, some macros etc WILL NOT WORK. In particular, we do
not currently have a good "page_to_bus/phys()" function. That means that
anybody trying to do DMA to this page is currently screwed, simply because
he has no good way of getting the physical address.
This is a limitation in general: the PTE access functions would also like
to have "page_to_phys()" and "phys_to_page()" functions. It gets even
worse with IO mappings, where "CPU-physical" is NOT necessarily the same
as "bus-physical".
It shouldn't be too hard to do the phys/bus addresses in general,
something like this should actually do it
static inline unsigned long page_to_physnr(struct page * page)
{
unsigned long offset;
struct zone_struct * zone = page->zone;
offset = zone->zone_mem_map - page;
return zone->zone_start_paddr + offset;
}
except right now I think "zone_start_paddr" is defined wrong (it's defined
to be the actual physical address, rather than being the "physical address
shifted right by the page-size". It needs to be the latter in order to
handle physical memory spaces that are bigger than "unsigned long" (ie
x86 PAE mode). Making the thing "unsigned long long" is _not_ an option,
considering how crappy gcc is at double integers.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-20 22:24 ` Roman Zippel
@ 2001-01-21 0:33 ` Linus Torvalds
2001-01-21 1:29 ` David Schwartz
` (2 more replies)
0 siblings, 3 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-21 0:33 UTC (permalink / raw)
To: Roman Zippel; +Cc: Kai Henningsen, linux-kernel
On Sat, 20 Jan 2001, Roman Zippel wrote:
>
> On Sat, 20 Jan 2001, Linus Torvalds wrote:
>
> > But point-to-point also means that you don't get any real advantage from
> > doing things like device-to-device DMA. Because the links are
> > asynchronous, you need buffers in between them anyway, and there is no
> > bandwidth advantage of not going through the hub if the topology is a
> > pretty normal "star" kind of thing. And you _do_ want the star topology,
> > because in the end most of the bandwidth you want concentrated at the
> > point that uses it.
>
> I agree, but who says, that the buffer always has to be the main memory?
It doesn't _have_ to be.
But think like a good hardware designer.
In 99% of all cases, where do you want the results of a read to end up?
Where do you want the contents of a write to come from?
Right. Memory.
Now, optimize for the common case. Make the common case go as fast as you
can, with as little latency and as high bandwidth as you can.
What kind of hardware would _you_ design for the point-to-point link?
I'm claiming that you'd do a nice DMA engine for each link point. There
wouldn't be any reason to have any other buffers (except, of course,
minimal buffers inside the IO chip itself - not for the whole packet, but
for just being able to handle cases where you don't have 100% access to
the memory bus all the time - and for doing things like burst reads and
writes to memory etc).
I'm _not_ seeing the point for a high-performance link to have a generic
packet buffer.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* RE: Is sendfile all that sexy?
2001-01-21 0:33 ` Linus Torvalds
@ 2001-01-21 1:29 ` David Schwartz
2001-01-21 2:42 ` Roman Zippel
2001-01-21 9:52 ` James Sutherland
2 siblings, 0 replies; 109+ messages in thread
From: David Schwartz @ 2001-01-21 1:29 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
> I'm _not_ seeing the point for a high-performance link to have a generic
> packet buffer.
>
> Linus
Well suppose your RAID controller can take over control of disks
distributed throughout your I/O subsystem. If you assume the bandwidth of
the I/O subsystem is not the limiting factor, there's no need to hang the
disks directly off the RAID controller.
This makes even more sense if your computer can upload code to your
peripherals which they can then run autonomously. Imagine if your filesystem
code is mobile and can reside (perhaps to a variable extent) in your drives
if you want it to.
Of course none of this really relates to the case of the OS trying to get
peripherals to talk to each other directly.
DS
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-21 0:25 ` Linus Torvalds
@ 2001-01-21 2:03 ` Roman Zippel
2001-01-21 18:00 ` kuznet
1 sibling, 0 replies; 109+ messages in thread
From: Roman Zippel @ 2001-01-21 2:03 UTC (permalink / raw)
To: Linus Torvalds; +Cc: kuznet, linux-kernel
Hi,
On Sat, 20 Jan 2001, Linus Torvalds wrote:
> Now, there are things to look out for: when you do these kinds of dummy
> "struct page" tricks, some macros etc WILL NOT WORK. In particular, we do
> not currently have a good "page_to_bus/phys()" function. That means that
> anybody trying to do DMA to this page is currently screwed, simply because
> he has no good way of getting the physical address.
>
> This is a limitation in general: the PTE access functions would also like
> to have "page_to_phys()" and "phys_to_page()" functions. It gets even
> worse with IO mappings, where "CPU-physical" is NOT necessarily the same
> as "bus-physical".
That's why I want to avoid dummy struct page and use a real mem_map
instead. I have two options:
1. I map everything together in one mem_map, like it's still done for
m68k, the overhead here is in the phys_to_virt()/virt_to_phys() function.
2. I use several nodes like mips64/arm and virt_to_page() gets more
complex, but this usually assumes a specific memory layout to keep it
fast.
Once that problem is solved, I can manage the memory on the card like the
main memory and use it however I want. I probably do something like ia64
and use the highest bits as an offset into a table.
bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-21 0:33 ` Linus Torvalds
2001-01-21 1:29 ` David Schwartz
@ 2001-01-21 2:42 ` Roman Zippel
2001-01-21 9:52 ` James Sutherland
2 siblings, 0 replies; 109+ messages in thread
From: Roman Zippel @ 2001-01-21 2:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kai Henningsen, linux-kernel
Hi,
On Sat, 20 Jan 2001, Linus Torvalds wrote:
> But think like a good hardware designer.
>
> In 99% of all cases, where do you want the results of a read to end up?
> Where do you want the contents of a write to come from?
>
> Right. Memory.
>
> Now, optimize for the common case. Make the common case go as fast as you
> can, with as little latency and as high bandwidth as you can.
>
> What kind of hardware would _you_ design for the point-to-point link?
>
> I'm claiming that you'd do a nice DMA engine for each link point. There
> wouldn't be any reason to have any other buffers (except, of course,
> minimal buffers inside the IO chip itself - not for the whole packet, but
> for just being able to handle cases where you don't have 100% access to
> the memory bus all the time - and for doing things like burst reads and
> writes to memory etc).
>
> I'm _not_ seeing the point for a high-performance link to have a generic
> packet buffer.
I completely agree, if we are talking about standard pc hardware. I was
more thinking about some dedicated hardware, where you want to get the
data directly to the correct place. If the hardware does a bit more with
the data you need large buffers. In a standard pc the main cpu does most
of the data processing, but in dedicated hardware you might have several
cards each with it's own logic and memory and here the cpu does manage
that stuff only. You can do all this of course from user space, but this
means you have to copy the data around, what you don't want with such
hardware, when the kernel can help you a bit.
bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-21 0:33 ` Linus Torvalds
2001-01-21 1:29 ` David Schwartz
2001-01-21 2:42 ` Roman Zippel
@ 2001-01-21 9:52 ` James Sutherland
2001-01-21 10:02 ` Ingo Molnar
2001-01-22 9:52 ` Helge Hafting
2 siblings, 2 replies; 109+ messages in thread
From: James Sutherland @ 2001-01-21 9:52 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Roman Zippel, Kai Henningsen, linux-kernel
On Sat, 20 Jan 2001, Linus Torvalds wrote:
>
>
> On Sat, 20 Jan 2001, Roman Zippel wrote:
> >
> > On Sat, 20 Jan 2001, Linus Torvalds wrote:
> >
> > > But point-to-point also means that you don't get any real advantage from
> > > doing things like device-to-device DMA. Because the links are
> > > asynchronous, you need buffers in between them anyway, and there is no
> > > bandwidth advantage of not going through the hub if the topology is a
> > > pretty normal "star" kind of thing. And you _do_ want the star topology,
> > > because in the end most of the bandwidth you want concentrated at the
> > > point that uses it.
> >
> > I agree, but who says, that the buffer always has to be the main memory?
>
> It doesn't _have_ to be.
>
> But think like a good hardware designer.
>
> In 99% of all cases, where do you want the results of a read to end up?
> Where do you want the contents of a write to come from?
>
> Right. Memory.
For many applications, yes - but think about a file server for a moment.
99% of the data read from the RAID (or whatever) is really aimed at the
appropriate NIC - going via main memory would just slow things down.
Take a heavily laden webserver. With a nice intelligent NIC and RAID
controller, you might have the httpd write the header to this NIC, then
have the NIC and RAID controller handle the sendfile operation themselves
- without ever touching the OS with this data.
> Now, optimize for the common case. Make the common case go as fast as
> you can, with as little latency and as high bandwidth as you can.
>
> What kind of hardware would _you_ design for the point-to-point link?
>
> I'm claiming that you'd do a nice DMA engine for each link point. There
> wouldn't be any reason to have any other buffers (except, of course,
> minimal buffers inside the IO chip itself - not for the whole packet, but
> for just being able to handle cases where you don't have 100% access to
> the memory bus all the time - and for doing things like burst reads and
> writes to memory etc).
>
> I'm _not_ seeing the point for a high-performance link to have a generic
> packet buffer.
I'd agree with that, but I would want peripherals to be able to send data
to each other without touching the host memory - think about playing video
files with an accelerator (just pipe the files from disk to the
accelerator), music with an "intelligent" sound card (just pipe the music
to the card), video capture, file servers, CD burning...
Having an Ethernet-style point-to-point "network" (everything connected as
a star, with something intelligent in the middle to direct the data where
it needs to go) makes sense, but don't assume everything is heading for
the host's memory. DMA straight to/from a "switch" would be a nice
solution, though...
James.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-21 9:52 ` James Sutherland
@ 2001-01-21 10:02 ` Ingo Molnar
2001-01-22 9:52 ` Helge Hafting
1 sibling, 0 replies; 109+ messages in thread
From: Ingo Molnar @ 2001-01-21 10:02 UTC (permalink / raw)
To: James Sutherland
Cc: Linus Torvalds, Roman Zippel, Kai Henningsen, Linux Kernel List
On Sun, 21 Jan 2001, James Sutherland wrote:
> For many applications, yes - but think about a file server for a
> moment. 99% of the data read from the RAID (or whatever) is really
> aimed at the appropriate NIC - going via main memory would just slow
> things down.
patently wrong. Compare the bandwidth of PCI and the bandwidth of memory
controllers. It's both slower, has higher latency and uses up more
valuable (PCI) bandwidth to do PCI->PCI transfers. The number of
situations where PCI->PCI transactions are the preferred method are *very*
limited, and i think we should deal with them when we see them. But this
has been said at the very beginning of this thread already, please read it
all ...
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-21 0:25 ` Linus Torvalds
2001-01-21 2:03 ` Roman Zippel
@ 2001-01-21 18:00 ` kuznet
1 sibling, 0 replies; 109+ messages in thread
From: kuznet @ 2001-01-21 18:00 UTC (permalink / raw)
To: Linus Torvalds; +Cc: zippel, mingo, linux-kernel
Hello!
> "struct page" tricks, some macros etc WILL NOT WORK. In particular, we do
> not currently have a good "page_to_bus/phys()" function. That means that
> anybody trying to do DMA to this page is currently screwed, simply because
> he has no good way of getting the physical address.
We already have similar problem with 64bit dma on Intel.
Namely, we need page_to_bus() and, moreover, we need 64bit bus addresses
for devices understanding them.
Now we make this in acenic like:
#if defined(CONFIG_X86) && defined(CONFIG_HIGHMEM)
#define BITS_PER_DMAADDR 64
typedef unsigned long long dmaaddr_high_t;
static inline dmaaddr_high_t
pci_map_single_high(struct pci_dev *hwdev, struct page *page,
int offset, size_t size, int dir)
{
dmaaddr_high_t phys;
phys = (page-mem_map) *
(unsigned long long) PAGE_SIZE +
offset;
return phys;
}
#else
Ingo, do you remember, that we agreed not to consider this code
as "ready for release" until this issue is not cleaned up?
I forgot this. 8)8)8)
Seems, we can remove at least direct dependencies on mem_map
using zone_struct.
Alexey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* RE: Is sendfile all that sexy?
2001-01-18 2:34 ` Olivier Galibert
@ 2001-01-21 21:22 ` LA Walsh
0 siblings, 0 replies; 109+ messages in thread
From: LA Walsh @ 2001-01-21 21:22 UTC (permalink / raw)
To: linux-kernel, torvalds
FYI -
Another use sendfile(2) might be used for. Suppose you were to generate
large amounts of data -- maybe kernel profiling data, audit data, whatever,
in the kernel.
You want to pull that data out as fast as possible and write it to
a disk or network socket. Normally, I think you'd do a "read/write" that
would xfer the data into user space, then write it back to the target
in system space. With sendfile, it seems, one could write a dump-daemon
that used sendfile to dump the data directly out to a target file descriptor
w/o it going through user space.
Just make sure the internal 'raw' data is massaged into the format
of a block device and voila! A side benefit would be that data in the
kernel that is written to the block device would be 'queued' in the
block buffers and them being marked 'dirty' and needing to be written out.
The device driver marks the buffers as clean once they are pushed out
of a fd by doing a 'seek' to a new (later) position in the file -- whole
buffers
before that point are marked 'clean' and freed.
Seems like this would have the benefit of reusing an existing
buffer management system for buffering while also using a single-copy
to get data to the target.
???
-l
--
L A Walsh | Trust Technology, Core Linux, SGI
law@sgi.com | Voice/Vmail: (650) 933-5338
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-20 19:26 ` Linus Torvalds
2001-01-20 21:20 ` Roman Zippel
@ 2001-01-21 23:21 ` David Woodhouse
1 sibling, 0 replies; 109+ messages in thread
From: David Woodhouse @ 2001-01-21 23:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: kuznet, linux-kernel
On Sat, 20 Jan 2001, Linus Torvalds wrote:
> There's no no-no here: you can even create the "struct page"s on demand,
> and create a dummy local zone that contains them that they all point back
> to. It should be trivial - nobody else cares about those pages or that
> zone anyway.
>
> This is very much how the MM layer in 2.4.x is set up to work.
>
> That said, nobody has actually done this in practice yet, so there may be
> details to work out, of course. I don't see any fundamental reasons it
> wouldn't easily work, but..
If I follow you correctly, this is how I was planning to provide
execute-in-place support for filesystems on flash chips - allocating
'struct page's and adding them to the page cache on read_inode().
--
dwmw2
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-21 9:52 ` James Sutherland
2001-01-21 10:02 ` Ingo Molnar
@ 2001-01-22 9:52 ` Helge Hafting
2001-01-22 13:00 ` James Sutherland
1 sibling, 1 reply; 109+ messages in thread
From: Helge Hafting @ 2001-01-22 9:52 UTC (permalink / raw)
To: James Sutherland, linux-kernel
James Sutherland wrote:
>
> On Sat, 20 Jan 2001, Linus Torvalds wrote:
>
> >
> >
> > On Sat, 20 Jan 2001, Roman Zippel wrote:
> > >
> > > On Sat, 20 Jan 2001, Linus Torvalds wrote:
> > >
> > > > But point-to-point also means that you don't get any real advantage from
> > > > doing things like device-to-device DMA. Because the links are
> > > > asynchronous, you need buffers in between them anyway, and there is no
> > > > bandwidth advantage of not going through the hub if the topology is a
> > > > pretty normal "star" kind of thing. And you _do_ want the star topology,
> > > > because in the end most of the bandwidth you want concentrated at the
> > > > point that uses it.
> > >
> > > I agree, but who says, that the buffer always has to be the main memory?
> >
> > It doesn't _have_ to be.
> >
> > But think like a good hardware designer.
> >
> > In 99% of all cases, where do you want the results of a read to end up?
> > Where do you want the contents of a write to come from?
> >
> > Right. Memory.
>
> For many applications, yes - but think about a file server for a moment.
> 99% of the data read from the RAID (or whatever) is really aimed at the
> appropriate NIC - going via main memory would just slow things down.
>
> Take a heavily laden webserver. With a nice intelligent NIC and RAID
> controller, you might have the httpd write the header to this NIC, then
> have the NIC and RAID controller handle the sendfile operation themselves
> - without ever touching the OS with this data.
And when the next user wants the same webpage/file you read it from the
RAID again?
Seems to me you loose the benefit of caching stuff in memory with this
scheme.
Sure - the RAID controller might have some cache, but it is usually
smaller
than main memory anyway. And then there are things like
retransmissions...
Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-22 9:52 ` Helge Hafting
@ 2001-01-22 13:00 ` James Sutherland
2001-01-23 9:01 ` Helge Hafting
0 siblings, 1 reply; 109+ messages in thread
From: James Sutherland @ 2001-01-22 13:00 UTC (permalink / raw)
To: Helge Hafting; +Cc: linux-kernel
On Mon, 22 Jan 2001, Helge Hafting wrote:
> And when the next user wants the same webpage/file you read it from
> the RAID again? Seems to me you loose the benefit of caching stuff in
> memory with this scheme. Sure - the RAID controller might have some
> cache, but it is usually smaller than main memory anyway.
Hrm... good point. Using "main memory" (whose memory, on a NUMA box??) as
a cache could be a performance boost in some circumstances. On the other
hand, you're eating up a chunk of memory bandwidth which could be used for
other things - even when you only cache in "spare" RAM, how do you decide
who uses that RAM - and whether or not they should?
There certainly comes a point at which not caching in RAM would be a net
win, but ATM the kernel doesn't know enough to determine this.
On a shared bus, probably the best solution would be to have the data sent
to both devices (NIC and RAM) at once?
> And then there are things like retransmissions...
Hopefully handled by an intelligent NIC in most cases; if you're caching
the file in RAM as well (by "CCing" the data there the first time) this is
OK anyway.
Something to think about, but probably more on-topic for linux-futures I
suspect...
James.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-17 19:32 ` Linus Torvalds
2001-01-18 2:34 ` Olivier Galibert
2001-01-18 8:23 ` Rogier Wolff
@ 2001-01-22 18:13 ` Val Henson
2001-01-22 18:27 ` David Lang
2001-01-22 18:54 ` Linus Torvalds
2 siblings, 2 replies; 109+ messages in thread
From: Val Henson @ 2001-01-22 18:13 UTC (permalink / raw)
To: linux-kernel; +Cc: Linus Torvalds
On Wed, Jan 17, 2001 at 11:32:35AM -0800, Linus Torvalds wrote:
> In article <Pine.LNX.4.30.0101171454340.29536-100000@baphomet.bogo.bogus>,
> Ben Mansell <linux-kernel@slimyhorror.com> wrote:
> >
> >The current sendfile() has the limitation that it can't read data from
> >a socket. Would it be another 5-minute hack to remove this limitation, so
> >you could sendfile between sockets? Now _that_ would be sexy :)
>
> I don't think that would be all that sexy at all.
>
> You have to realize, that sendfile() is meant as an optimization, by
> being able to re-use the same buffers that act as the in-kernel page
> cache as buffers for sending data. So you avoid one copy.
>
> However, for socket->socket, we would not have such an advantage. A
> socket->socket sendfile() would not avoid any copies the way the
> networking is done today. That _may_ change, of course. But it might
> not. And I'd rather tell people using sendfile() that you get EINVAL if
> it isn't able to optimize the transfer..
Yes, socket->socket sendfile is not that sexy. I actually did this
for 2.2.16 in the obvious (and stupid) way, copying data into a buffer
and writing it it out again. The performance was unsurprisingly
_exactly_ identical to a userspace read()/write() loop.
There is a use for an optimized socket->socket transfer - proxying
high speed TCP connections. It would be exciting if the zerocopy
networking framework led to a decent socket->socket transfer.
-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-22 18:13 ` Val Henson
@ 2001-01-22 18:27 ` David Lang
2001-01-22 19:37 ` Val Henson
2001-01-22 18:54 ` Linus Torvalds
1 sibling, 1 reply; 109+ messages in thread
From: David Lang @ 2001-01-22 18:27 UTC (permalink / raw)
To: Val Henson; +Cc: linux-kernel, Linus Torvalds
On Mon, 22 Jan 2001, Val Henson wrote:
> There is a use for an optimized socket->socket transfer - proxying
> high speed TCP connections. It would be exciting if the zerocopy
> networking framework led to a decent socket->socket transfer.
if you are proxying connextions you should really be looking at what data
you pass through your proxy.
now replay proxying with routing and I would agree with you (but I'll bet
this is handled in the kernel IP stack anyway)
David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-22 18:13 ` Val Henson
2001-01-22 18:27 ` David Lang
@ 2001-01-22 18:54 ` Linus Torvalds
1 sibling, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2001-01-22 18:54 UTC (permalink / raw)
To: Val Henson; +Cc: linux-kernel
On Mon, 22 Jan 2001, Val Henson wrote:
> On Wed, Jan 17, 2001 at 11:32:35AM -0800, Linus Torvalds wrote:
> >
> > However, for socket->socket, we would not have such an advantage. A
> > socket->socket sendfile() would not avoid any copies the way the
> > networking is done today. That _may_ change, of course. But it might
> > not. And I'd rather tell people using sendfile() that you get EINVAL if
> > it isn't able to optimize the transfer..
>
> Yes, socket->socket sendfile is not that sexy. I actually did this
> for 2.2.16 in the obvious (and stupid) way, copying data into a buffer
> and writing it it out again. The performance was unsurprisingly
> _exactly_ identical to a userspace read()/write() loop.
The thing is, that if I knew that I could always beat the user-space
numbers (by virtue of having fewer system calls etc), I would still
consider "sendfile()" to be ok for it.
But we can actually do _worse_ in sendfile() than in user-space
applications. For example, userspace "read+write" may now more about
packet boundary behaviour etc, which sendfile is totally clueless about,
so a userspace application might actually get _better_ performance by
doing it by hand.
That's why I currently want sendfile() to only work for the things we
_know_ we can do better.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-22 18:27 ` David Lang
@ 2001-01-22 19:37 ` Val Henson
2001-01-22 20:01 ` David Lang
0 siblings, 1 reply; 109+ messages in thread
From: Val Henson @ 2001-01-22 19:37 UTC (permalink / raw)
To: David Lang; +Cc: linux-kernel, Linus Torvalds
On Mon, Jan 22, 2001 at 10:27:58AM -0800, David Lang wrote:
> On Mon, 22 Jan 2001, Val Henson wrote:
>
> > There is a use for an optimized socket->socket transfer - proxying
> > high speed TCP connections. It would be exciting if the zerocopy
> > networking framework led to a decent socket->socket transfer.
>
> if you are proxying connextions you should really be looking at what data
> you pass through your proxy.
>
> now replay proxying with routing and I would agree with you (but I'll bet
> this is handled in the kernel IP stack anyway)
Well, there is a (real-world) case where your TCP proxy doesn't want
to look at the data and you can't use IP forwarding. If you have TCP
connections between networks that have very different MTU's, using IP
forwarding will result in tiny packets on the large MTU networks.
So who cares? Some machines, notably Crays and NEC's, have a severely
rate-limited network stack and can only transmit up to about 3500
packets per second. That's 40 Mbps on a 1500 byte MTU network, but
greater than line speed on HIPPI (65280 MTU, 800 Mbps).
So, for a rate-limited network stack on a HIPPI network, the best way
to talk to a machine on a gigabit ethernet network is through a TCP
proxy which just doesn't care about the data going through it. Hence
my interest in socket->socket sendfile().
I'll admit this is an odd corner case which isn't important enough to
justify socket->socket sendfile() on its own. But this odd corner
case did make enough money to pay my salary for years to come. :)
-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-22 19:37 ` Val Henson
@ 2001-01-22 20:01 ` David Lang
2001-01-22 22:04 ` Ion Badulescu
0 siblings, 1 reply; 109+ messages in thread
From: David Lang @ 2001-01-22 20:01 UTC (permalink / raw)
To: Val Henson; +Cc: linux-kernel, Linus Torvalds
how about always_defragment (or whatever the option is now called) so that
your routing box always reassembles packets and then fragments them to the
correct size for the next segment? wouldn't this do the job?
David Lang
On Mon, 22 Jan 2001, Val Henson wrote:
> Date: Mon, 22 Jan 2001 12:37:07 -0700
> From: Val Henson <vhenson@esscom.com>
> To: David Lang <dlang@diginsite.com>
> Cc: linux-kernel@vger.kernel.org, Linus Torvalds <torvalds@transmeta.com>
> Subject: Re: Is sendfile all that sexy?
>
> On Mon, Jan 22, 2001 at 10:27:58AM -0800, David Lang wrote:
> > On Mon, 22 Jan 2001, Val Henson wrote:
> >
> > > There is a use for an optimized socket->socket transfer - proxying
> > > high speed TCP connections. It would be exciting if the zerocopy
> > > networking framework led to a decent socket->socket transfer.
> >
> > if you are proxying connextions you should really be looking at what data
> > you pass through your proxy.
> >
> > now replay proxying with routing and I would agree with you (but I'll bet
> > this is handled in the kernel IP stack anyway)
>
> Well, there is a (real-world) case where your TCP proxy doesn't want
> to look at the data and you can't use IP forwarding. If you have TCP
> connections between networks that have very different MTU's, using IP
> forwarding will result in tiny packets on the large MTU networks.
>
> So who cares? Some machines, notably Crays and NEC's, have a severely
> rate-limited network stack and can only transmit up to about 3500
> packets per second. That's 40 Mbps on a 1500 byte MTU network, but
> greater than line speed on HIPPI (65280 MTU, 800 Mbps).
>
> So, for a rate-limited network stack on a HIPPI network, the best way
> to talk to a machine on a gigabit ethernet network is through a TCP
> proxy which just doesn't care about the data going through it. Hence
> my interest in socket->socket sendfile().
>
> I'll admit this is an odd corner case which isn't important enough to
> justify socket->socket sendfile() on its own. But this odd corner
> case did make enough money to pay my salary for years to come. :)
>
> -VAL
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-22 20:01 ` David Lang
@ 2001-01-22 22:04 ` Ion Badulescu
0 siblings, 0 replies; 109+ messages in thread
From: Ion Badulescu @ 2001-01-22 22:04 UTC (permalink / raw)
To: David Lang; +Cc: linux-kernel, Linus Torvalds, Val Henson
On Mon, 22 Jan 2001 12:01:23 -0800 (PST), David Lang <dlang@diginsite.com> wrote:
> how about always_defragment (or whatever the option is now called) so that
> your routing box always reassembles packets and then fragments them to the
> correct size for the next segment? wouldn't this do the job?
It doesn't help with TCP, because the negotiated MSS will always be 1500
and thus there won't be any fragments to re-assemble.
> On Mon, 22 Jan 2001, Val Henson wrote:
>
>> Well, there is a (real-world) case where your TCP proxy doesn't want
>> to look at the data and you can't use IP forwarding. If you have TCP
>> connections between networks that have very different MTU's, using IP
>> forwarding will result in tiny packets on the large MTU networks.
There is another real-world case: a load-balancing proxy. socket->socket
sendfile would allow the proxy to open a non-keepalive connection to the
backend server, send the request, and then just link the two sockets
together using sendfile.
Of course, some changes would have to be made to the API. An asynchronous
sendsocket()/sendfile() system call would be just lovely, in fact. :-)
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-22 13:00 ` James Sutherland
@ 2001-01-23 9:01 ` Helge Hafting
2001-01-23 9:37 ` James Sutherland
0 siblings, 1 reply; 109+ messages in thread
From: Helge Hafting @ 2001-01-23 9:01 UTC (permalink / raw)
To: James Sutherland; +Cc: linux-kernel
James Sutherland wrote:
>
> On Mon, 22 Jan 2001, Helge Hafting wrote:
>
> > And when the next user wants the same webpage/file you read it from
> > the RAID again? Seems to me you loose the benefit of caching stuff in
> > memory with this scheme. Sure - the RAID controller might have some
> > cache, but it is usually smaller than main memory anyway.
>
> Hrm... good point. Using "main memory" (whose memory, on a NUMA box??) as
> a cache could be a performance boost in some circumstances. On the other
> hand, you're eating up a chunk of memory bandwidth which could be used for
> other things - even when you only cache in "spare" RAM, how do you decide
> who uses that RAM - and whether or not they should?
If we will need it again soon - cache it. If not, consider
your device->device scheme. What we will need is often impossible to
know,
so approximations like LRU is used. You could have a object table
(probably a file table or disk block table) counting how often various
files/objects are referenced. You can then decide to use RAID->NIC
transfers for something that haven't been read before, and memory
cache when something is re-read for the nth time in a given time
interval.
"n" and the time interval depends on how much cache you have, and
the size of your working set.
This might be a win, maybe even a big win under some circumstances.
But considering how it works only for a few devices only, and how
complicated it is, the conclusion becomes don't do it for
standard linux.
You may of course try to make super-performance
servers that work for a special hw combination, with a single
very optimized linux driver taking care of the RAID adapter, the NIC(s),
the fs, parts of the network stack and possibly the web server too.
Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-23 9:01 ` Helge Hafting
@ 2001-01-23 9:37 ` James Sutherland
0 siblings, 0 replies; 109+ messages in thread
From: James Sutherland @ 2001-01-23 9:37 UTC (permalink / raw)
To: Helge Hafting; +Cc: linux-kernel
On Tue, 23 Jan 2001, Helge Hafting wrote:
> James Sutherland wrote:
> >
> > On Mon, 22 Jan 2001, Helge Hafting wrote:
> >
> > > And when the next user wants the same webpage/file you read it from
> > > the RAID again? Seems to me you loose the benefit of caching stuff in
> > > memory with this scheme. Sure - the RAID controller might have some
> > > cache, but it is usually smaller than main memory anyway.
> >
> > Hrm... good point. Using "main memory" (whose memory, on a NUMA box??) as
> > a cache could be a performance boost in some circumstances. On the other
> > hand, you're eating up a chunk of memory bandwidth which could be used for
> > other things - even when you only cache in "spare" RAM, how do you decide
> > who uses that RAM - and whether or not they should?
>
> If we will need it again soon - cache it. If not, consider
> your device->device scheme. What we will need is often impossible to
> know,
> so approximations like LRU is used. You could have a object table
> (probably a file table or disk block table) counting how often various
> files/objects are referenced. You can then decide to use RAID->NIC
> transfers for something that haven't been read before, and memory
> cache when something is re-read for the nth time in a given time
> interval.
I think my compromise of sending it to both simultaneously is better: if
you do reuse it, you've just got a cache hit (win); if not, you've just
burned some RAM bandwidth, which isn't a catastrophe.
> This might be a win, maybe even a big win under some circumstances.
> But considering how it works only for a few devices only, and how
> complicated it is, the conclusion becomes don't do it for
> standard linux.
Eh? This is a hardware system - Linux has very little hardware in it :-)
> You may of course try to make super-performance servers that work for
> a special hw combination, with a single very optimized linux driver
> taking care of the RAID adapter, the NIC(s), the fs, parts of the
> network stack and possibly the web server too.
Actually, I'd like it to be a much more generic thing. If you get an
"intelligent" NIC, it will have, say, a StrongARM processor on it. Why
shouldn't the code running on that processor be supplied by the kernel, as
part of the NIC driver? Given a powerful enough CPU on the NIC, you could
offload a useful chunk of the Linux network stack to it.
Or a RAID adapter - instead of coming with the manufacturer's proprietary
code for striping etc., upload Linux's own RAID software to the CPU. Run
some subset of X's primitives on the graphics card. On an I2O-type system
(dedicated ARM processor or similar for handling I/O), run the low-level
caching stuff, perhaps, or some of the FS code.
Over the next few years, we'll see a lot of little baby CPUs in our PCs,
on network cards, video cards etc. I'd like to see Linux able to take
advantage of this sort of off-load capability where possible.
James.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-14 20:22 ` Linus Torvalds
` (2 preceding siblings ...)
2001-01-15 15:24 ` Jonathan Thackray
@ 2001-01-24 0:58 ` Sasi Peter
2001-01-24 8:44 ` James Sutherland
2001-01-25 10:20 ` Anton Blanchard
3 siblings, 2 replies; 109+ messages in thread
From: Sasi Peter @ 2001-01-24 0:58 UTC (permalink / raw)
To: linux-kernel
On 14 Jan 2001, Linus Torvalds wrote:
> The only obvious use for it is file serving, and as high-performance
> file serving tends to end up as a kernel module in the end anyway (the
> only hold-out is samba, and that's been discussed too), "sendfile()"
> really is more a proof of concept than anything else.
No plans for samba to use sendfile? Even better make it a tux-like module?
(that would enable Netware-Linux like performance with the standard
kernel... would be cool afterall ;)
--
SaPE - Peter, Sasi - mailto:sape@sch.hu - http://sape.iq.rulez.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-24 0:58 ` Sasi Peter
@ 2001-01-24 8:44 ` James Sutherland
2001-01-25 10:20 ` Anton Blanchard
1 sibling, 0 replies; 109+ messages in thread
From: James Sutherland @ 2001-01-24 8:44 UTC (permalink / raw)
To: Sasi Peter; +Cc: linux-kernel
On Wed, 24 Jan 2001, Sasi Peter wrote:
> On 14 Jan 2001, Linus Torvalds wrote:
>
> > The only obvious use for it is file serving, and as high-performance
> > file serving tends to end up as a kernel module in the end anyway (the
> > only hold-out is samba, and that's been discussed too), "sendfile()"
> > really is more a proof of concept than anything else.
>
> No plans for samba to use sendfile? Even better make it a tux-like module?
> (that would enable Netware-Linux like performance with the standard
> kernel... would be cool afterall ;)
AIUI, Jeff Merkey was working on loading "userspace" apps into the kernel
to tackle this sort of problem generically. I don't know if he's tried it
with Samba - the forking would probably be a problem...
James.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
@ 2001-01-24 15:12 Sasi Peter
2001-01-24 15:29 ` James Sutherland
2001-01-25 1:11 ` Alan Cox
0 siblings, 2 replies; 109+ messages in thread
From: Sasi Peter @ 2001-01-24 15:12 UTC (permalink / raw)
To: James Sutherland, linux-kernel
> AIUI, Jeff Merkey was working on loading "userspace" apps into the
kernel
> to tackle this sort of problem generically. I don't know if he's
tried it
> with Samba - the forking would probably be a problem...
I think, that is not what we need. Once Ingo wrote, that since HTTP
serving can also be viewed as a kind of fileserving, it should be
possible to create a TUX like module for the same framwork, that serves
using the SMB protocol instead of HTTP...
-- SaPE / Sasi Péter / mailto: sape@sch.hu / http://sape.iq.rulez.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-24 15:12 Sasi Peter
@ 2001-01-24 15:29 ` James Sutherland
2001-01-25 1:11 ` Alan Cox
1 sibling, 0 replies; 109+ messages in thread
From: James Sutherland @ 2001-01-24 15:29 UTC (permalink / raw)
To: Sasi Peter; +Cc: linux-kernel
On Wed, 24 Jan 2001, Sasi Peter wrote:
> > AIUI, Jeff Merkey was working on loading "userspace" apps into the
> kernel
> > to tackle this sort of problem generically. I don't know if he's
> tried it
> > with Samba - the forking would probably be a problem...
>
> I think, that is not what we need. Once Ingo wrote, that since HTTP
> serving can also be viewed as a kind of fileserving, it should be
> possible to create a TUX like module for the same framwork, that serves
> using the SMB protocol instead of HTTP...
I must admit I'm a bit sceptical - apart from anything else, Jeff's
approach allows a bug in the server software to blow the whole OS away,
instead of just quietly coring! (Or, worse still, trample on some FS
metadata in RAM... eek!) A TUX module would be a nice idea, although I
haven't even been able to find a proper TUX web page - Google just gave
page after page of mailing list archives and discussion about it :-(
James.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-24 15:12 Sasi Peter
2001-01-24 15:29 ` James Sutherland
@ 2001-01-25 1:11 ` Alan Cox
2001-01-25 9:06 ` James Sutherland
1 sibling, 1 reply; 109+ messages in thread
From: Alan Cox @ 2001-01-25 1:11 UTC (permalink / raw)
To: Sasi Peter; +Cc: James Sutherland, linux-kernel
> I think, that is not what we need. Once Ingo wrote, that since HTTP
> serving can also be viewed as a kind of fileserving, it should be
> possible to create a TUX like module for the same framwork, that serves
> using the SMB protocol instead of HTTP...
Kernel SMB is basically not a sane idea. sendfile can help it though
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-25 1:11 ` Alan Cox
@ 2001-01-25 9:06 ` James Sutherland
2001-01-25 10:42 ` bert hubert
0 siblings, 1 reply; 109+ messages in thread
From: James Sutherland @ 2001-01-25 9:06 UTC (permalink / raw)
To: Alan Cox; +Cc: Sasi Peter, linux-kernel
On Thu, 25 Jan 2001, Alan Cox wrote:
> > I think, that is not what we need. Once Ingo wrote, that since HTTP
> > serving can also be viewed as a kind of fileserving, it should be
> > possible to create a TUX like module for the same framwork, that serves
> > using the SMB protocol instead of HTTP...
>
>
> Kernel SMB is basically not a sane idea. sendfile can help it though
Right now, ISTR Samba is still a forking daemon?? This has less impact on
performance than it would for an httpd, because of the long-lived
sessions, but rewriting it as a state machine (no forking, threads or
other crap, just use non-blocking I/O) would probably make much more
sense.
James.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-24 0:58 ` Sasi Peter
2001-01-24 8:44 ` James Sutherland
@ 2001-01-25 10:20 ` Anton Blanchard
2001-01-25 10:58 ` Sasi Peter
1 sibling, 1 reply; 109+ messages in thread
From: Anton Blanchard @ 2001-01-25 10:20 UTC (permalink / raw)
To: Sasi Peter; +Cc: linux-kernel
> No plans for samba to use sendfile? Even better make it a tux-like module?
> (that would enable Netware-Linux like performance with the standard
> kernel... would be cool afterall ;)
I have patches for samba to do sendfile. Making a tux module does not make
sense to me, especially since we are nowhere near the limits of samba in
userspace. Once userspace samba can run no faster, then we should think
about other options.
Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-25 9:06 ` James Sutherland
@ 2001-01-25 10:42 ` bert hubert
2001-01-25 12:14 ` James Sutherland
0 siblings, 1 reply; 109+ messages in thread
From: bert hubert @ 2001-01-25 10:42 UTC (permalink / raw)
To: linux-kernel
On Thu, Jan 25, 2001 at 09:06:33AM +0000, James Sutherland wrote:
> performance than it would for an httpd, because of the long-lived
> sessions, but rewriting it as a state machine (no forking, threads or
> other crap, just use non-blocking I/O) would probably make much more
> sense.
>From a kernel coders perspective, possibly. But a lot of SMB details are
pretty convoluted. Statemachines may produce more efficient code but can be
hell to maintain and expand. Bugs can hide in lots of corners.
Regards,
bert hubert
--
PowerDNS Versatile DNS Services
Trilab The Technology People
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-25 10:20 ` Anton Blanchard
@ 2001-01-25 10:58 ` Sasi Peter
2001-01-26 6:10 ` Anton Blanchard
0 siblings, 1 reply; 109+ messages in thread
From: Sasi Peter @ 2001-01-25 10:58 UTC (permalink / raw)
To: Anton Blanchard; +Cc: linux-kernel
On Thu, 25 Jan 2001, Anton Blanchard wrote:
> I have patches for samba to do sendfile. Making a tux module does not make
> sense to me, especially since we are nowhere near the limits of samba in
> userspace. Once userspace samba can run no faster, then we should think
> about other options.
Do you have it at a URL?
--
SaPE - Peter, Sasi - mailto:sape@sch.hu - http://sape.iq.rulez.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-25 10:42 ` bert hubert
@ 2001-01-25 12:14 ` James Sutherland
0 siblings, 0 replies; 109+ messages in thread
From: James Sutherland @ 2001-01-25 12:14 UTC (permalink / raw)
To: bert hubert; +Cc: linux-kernel
On Thu, 25 Jan 2001, bert hubert wrote:
> On Thu, Jan 25, 2001 at 09:06:33AM +0000, James Sutherland wrote:
>
> > performance than it would for an httpd, because of the long-lived
> > sessions, but rewriting it as a state machine (no forking, threads or
> > other crap, just use non-blocking I/O) would probably make much more
> > sense.
>
> From a kernel coders perspective, possibly. But a lot of SMB details are
> pretty convoluted. Statemachines may produce more efficient code but can be
> hell to maintain and expand. Bugs can hide in lots of corners.
I said they were good from a performance PoV - I didn't say they were
easy! Obviously there are reasons why the Samba guys have done what they
have. In fact, some parts of Samba ARE implemented as state machines to
some extent; presumably the remainder were considered too difficult to
reimplement that way for the time being.
James.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-25 10:58 ` Sasi Peter
@ 2001-01-26 6:10 ` Anton Blanchard
2001-01-26 11:46 ` David S. Miller
0 siblings, 1 reply; 109+ messages in thread
From: Anton Blanchard @ 2001-01-26 6:10 UTC (permalink / raw)
To: Sasi Peter; +Cc: linux-kernel
> Do you have it at a URL?
The patch is small so I have attached it to this email. It should apply
to the samba CVS tree. Remember this is still a hack and I need to add
code to ensure the file is not truncated and we sendfile() less than we
promised. (After talking to tridge and davem, this should be fixed shortly.)
There is a lot more going on than in the web serving case, so
sendfile+zero copy is not going to help us as much as it did for the tux
guys. For example currently on 2.4.0 + zero copy patches:
anton@drongo:~/dbench$ ~anton/samba/source/bin/smbtorture //otherhost/netbench -U% -N 15 NBW95
read/write:
Throughput 16.5478 MB/sec (NB=20.6848 MB/sec 165.478 MBit/sec)
sendfile:
Throughput 17.0128 MB/sec (NB=21.266 MB/sec 170.128 MBit/sec)
Of course there is still lots to be done :)
Cheers,
Anton
diff -u -u -r1.195 includes.h
--- source/include/includes.h 2000/12/06 00:05:14 1.195
+++ source/include/includes.h 2001/01/26 05:38:51
@@ -871,7 +871,8 @@
/* default socket options. Dave Miller thinks we should default to TCP_NODELAY
given the socket IO pattern that Samba uses */
-#ifdef TCP_NODELAY
+
+#if 0
#define DEFAULT_SOCKET_OPTIONS "TCP_NODELAY"
#else
#define DEFAULT_SOCKET_OPTIONS ""
diff -u -u -r1.257 reply.c
--- source/smbd/reply.c 2001/01/24 19:34:53 1.257
+++ source/smbd/reply.c 2001/01/26 05:38:53
@@ -2383,6 +2391,51 @@
END_PROFILE(SMBreadX);
return(ERROR(ERRDOS,ERRlock));
}
+
+#if 1
+ /* We can use sendfile if it is not chained */
+ if (CVAL(inbuf,smb_vwv0) == 0xFF) {
+ off_t tmpoffset;
+ struct stat buf;
+ int flags = 0;
+
+ nread = smb_maxcnt;
+
+ fstat(fsp->fd, &buf);
+ if (startpos > buf.st_size)
+ return(UNIXERROR(ERRDOS,ERRnoaccess));
+ if (nread > (buf.st_size - startpos))
+ nread = (buf.st_size - startpos);
+
+ SSVAL(outbuf,smb_vwv5,nread);
+ SSVAL(outbuf,smb_vwv6,smb_offset(data,outbuf));
+ SSVAL(smb_buf(outbuf),-2,nread);
+ CVAL(outbuf,smb_vwv0) = 0xFF;
+ set_message(outbuf,12,nread,False);
+
+#define MSG_MORE 0x8000
+ if (nread > 0)
+ flags = MSG_MORE;
+ if (send(smbd_server_fd(), outbuf, data - outbuf, flags) == -1)
+ DEBUG(0,("reply_read_and_X: send ERROR!\n"));
+
+ tmpoffset = startpos;
+ while(nread) {
+ int nwritten;
+ nwritten = sendfile(smbd_server_fd(), fsp->fd, &tmpoffset, nread);
+ if (nwritten == -1)
+ DEBUG(0,("reply_read_and_X: sendfile ERROR!\n"));
+
+ if (!nwritten)
+ break;
+
+ nread -= nwritten;
+ }
+
+ return -1;
+ }
+#endif
+
nread = read_file(fsp,data,startpos,smb_maxcnt);
if (nread < 0) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-26 6:10 ` Anton Blanchard
@ 2001-01-26 11:46 ` David S. Miller
2001-01-26 14:12 ` Anton Blanchard
0 siblings, 1 reply; 109+ messages in thread
From: David S. Miller @ 2001-01-26 11:46 UTC (permalink / raw)
To: Anton Blanchard; +Cc: Sasi Peter, linux-kernel
Anton Blanchard writes:
> diff -u -u -r1.257 reply.c
> --- source/smbd/reply.c 2001/01/24 19:34:53 1.257
> +++ source/smbd/reply.c 2001/01/26 05:38:53
> @@ -2383,6 +2391,51 @@
...
> + while(nread) {
> + int nwritten;
> + nwritten = sendfile(smbd_server_fd(), fsp->fd, &tmpoffset, nread);
> + if (nwritten == -1)
> + DEBUG(0,("reply_read_and_X: sendfile ERROR!\n"));
> +
> + if (!nwritten)
> + break;
> +
> + nread -= nwritten;
> + }
> +
> + return -1;
Anton, why are you always returning -1 (which means error for the
smb_message[] array functions) when using sendfile?
Aren't you supposed to return the number of bytes output or
something like this?
I'm probably missing something subtle here, so just let me
know what I missed.
Thanks.
Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: Is sendfile all that sexy?
2001-01-26 11:46 ` David S. Miller
@ 2001-01-26 14:12 ` Anton Blanchard
0 siblings, 0 replies; 109+ messages in thread
From: Anton Blanchard @ 2001-01-26 14:12 UTC (permalink / raw)
To: David S. Miller; +Cc: Sasi Peter, linux-kernel
Hi Dave,
How are the VB withdrawal symptoms going? :)
> Anton, why are you always returning -1 (which means error for the
> smb_message[] array functions) when using sendfile?
Returning -1 tells the higher level code that we actually sent the bytes
out ourselves and not to bother doing it.
> Aren't you supposed to return the number of bytes output or
> something like this?
Only if you want the code to do a send() on outbuf which we dont here.
Cheers,
Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 109+ messages in thread
end of thread, other threads:[~2001-01-26 14:16 UTC | newest]
Thread overview: 109+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-17 15:02 Is sendfile all that sexy? Ben Mansell
2000-01-01 2:10 ` Pavel Machek
2001-01-17 19:32 ` Linus Torvalds
2001-01-18 2:34 ` Olivier Galibert
2001-01-21 21:22 ` LA Walsh
2001-01-18 8:23 ` Rogier Wolff
2001-01-18 10:01 ` Andreas Dilger
2001-01-18 11:04 ` Russell Leighton
2001-01-18 16:36 ` Larry McVoy
2001-01-19 1:53 ` Linus Torvalds
2001-01-18 16:24 ` Linus Torvalds
2001-01-18 18:46 ` Kai Henningsen
2001-01-18 18:58 ` Roman Zippel
2001-01-18 19:42 ` Linus Torvalds
2001-01-19 0:18 ` Roman Zippel
2001-01-19 1:14 ` Linus Torvalds
2001-01-19 6:57 ` Alan Cox
2001-01-19 10:13 ` Roman Zippel
2001-01-19 10:55 ` Andre Hedrick
2001-01-19 20:18 ` kuznet
2001-01-19 21:45 ` Linus Torvalds
2001-01-20 18:53 ` kuznet
2001-01-20 19:26 ` Linus Torvalds
2001-01-20 21:20 ` Roman Zippel
2001-01-21 0:25 ` Linus Torvalds
2001-01-21 2:03 ` Roman Zippel
2001-01-21 18:00 ` kuznet
2001-01-21 23:21 ` David Woodhouse
2001-01-20 15:36 ` Kai Henningsen
2001-01-20 21:01 ` Linus Torvalds
2001-01-20 21:10 ` Mo McKinlay
2001-01-20 22:24 ` Roman Zippel
2001-01-21 0:33 ` Linus Torvalds
2001-01-21 1:29 ` David Schwartz
2001-01-21 2:42 ` Roman Zippel
2001-01-21 9:52 ` James Sutherland
2001-01-21 10:02 ` Ingo Molnar
2001-01-22 9:52 ` Helge Hafting
2001-01-22 13:00 ` James Sutherland
2001-01-23 9:01 ` Helge Hafting
2001-01-23 9:37 ` James Sutherland
2001-01-18 19:51 ` Rick Jones
2001-01-18 12:17 ` Peter Samuelson
2001-01-22 18:13 ` Val Henson
2001-01-22 18:27 ` David Lang
2001-01-22 19:37 ` Val Henson
2001-01-22 20:01 ` David Lang
2001-01-22 22:04 ` Ion Badulescu
2001-01-22 18:54 ` Linus Torvalds
-- strict thread matches above, loose matches on Subject: below --
2001-01-24 15:12 Sasi Peter
2001-01-24 15:29 ` James Sutherland
2001-01-25 1:11 ` Alan Cox
2001-01-25 9:06 ` James Sutherland
2001-01-25 10:42 ` bert hubert
2001-01-25 12:14 ` James Sutherland
[not found] <Pine.LNX.4.10.10101190911130.10218-100000@penguin.transmeta.com>
2001-01-19 17:23 ` Rogier Wolff
2001-01-16 13:50 Andries.Brouwer
2001-01-17 6:56 ` Ton Hospel
2001-01-17 7:31 ` Steve VanDevender
2001-01-17 8:09 ` Ton Hospel
2001-01-14 18:29 jamal
2001-01-14 18:50 ` Ingo Molnar
2001-01-14 19:02 ` jamal
2001-01-14 19:09 ` Ingo Molnar
2001-01-14 19:18 ` jamal
2001-01-14 20:22 ` Linus Torvalds
2001-01-14 20:38 ` Ingo Molnar
2001-01-14 21:44 ` Linus Torvalds
2001-01-14 21:49 ` Ingo Molnar
2001-01-14 21:54 ` Gerhard Mack
2001-01-14 22:40 ` Linus Torvalds
2001-01-14 22:45 ` J Sloan
2001-01-15 20:15 ` H. Peter Anvin
2001-01-15 3:43 ` Michael Peddemors
2001-01-15 13:02 ` Florian Weimer
2001-01-15 13:45 ` Tristan Greaves
2001-01-15 1:14 ` Dan Hollis
2001-01-15 15:24 ` Jonathan Thackray
2001-01-15 15:36 ` Matti Aarnio
2001-01-15 20:17 ` H. Peter Anvin
2001-01-15 16:05 ` dean gaudet
2001-01-15 18:34 ` Jonathan Thackray
2001-01-15 18:46 ` Linus Torvalds
2001-01-15 18:58 ` dean gaudet
2001-01-15 19:41 ` Ingo Molnar
2001-01-15 20:33 ` Albert D. Cahalan
2001-01-15 21:00 ` Linus Torvalds
2001-01-16 10:40 ` Felix von Leitner
2001-01-16 11:56 ` Peter Samuelson
2001-01-16 12:37 ` Ingo Molnar
2001-01-16 12:42 ` Ingo Molnar
2001-01-16 12:47 ` Felix von Leitner
2001-01-16 13:48 ` Jamie Lokier
2001-01-16 14:20 ` Felix von Leitner
2001-01-16 15:05 ` David L. Parsley
2001-01-16 15:05 ` Jakub Jelinek
2001-01-16 15:46 ` David L. Parsley
2001-01-18 14:00 ` Laramie Leavitt
2001-01-17 19:27 ` dean gaudet
2001-01-24 0:58 ` Sasi Peter
2001-01-24 8:44 ` James Sutherland
2001-01-25 10:20 ` Anton Blanchard
2001-01-25 10:58 ` Sasi Peter
2001-01-26 6:10 ` Anton Blanchard
2001-01-26 11:46 ` David S. Miller
2001-01-26 14:12 ` Anton Blanchard
2001-01-15 23:16 ` Pavel Machek
2001-01-16 13:47 ` jamal
2001-01-16 14:41 ` Pavel Machek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox