public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
@ 2007-11-20  6:53 Ulrich Drepper
  2007-11-20  7:59 ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Ulrich Drepper @ 2007-11-20  6:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, mingo, tglx, torvalds

This patch adds support for setting the O_NONBLOCK flag of the file
descriptors returned by socket, socketpair, and accept.

 socket.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

--- net/socket.c
+++ net/socket.c
@@ -362,7 +362,7 @@ static int sock_alloc_fd(struct file **filep, int flags)
 	return fd;
 }
 
-static int sock_attach_fd(struct socket *sock, struct file *file)
+static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
 	struct qstr name = { .name = "" };
@@ -384,7 +384,7 @@ static int sock_attach_fd(struct socket *sock, struct file *file)
 	init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,
 		  &socket_file_ops);
 	SOCK_INODE(sock)->i_fop = &socket_file_ops;
-	file->f_flags = O_RDWR;
+	file->f_flags = O_RDWR | (flags & O_NONBLOCK);
 	file->f_pos = 0;
 	file->private_data = sock;
 
@@ -397,7 +397,7 @@ static int sock_map_fd_flags(struct socket *sock, int flags)
 	int fd = sock_alloc_fd(&newfile, flags);
 
 	if (likely(fd >= 0)) {
-		int err = sock_attach_fd(sock, newfile);
+		int err = sock_attach_fd(sock, newfile, flags);
 
 		if (unlikely(err < 0)) {
 			put_filp(newfile);
@@ -1268,12 +1268,14 @@ asmlinkage long sys_socketpair(int family, int type, int protocol,
 		goto out_release_both;
 	}
 
-	err = sock_attach_fd(sock1, newfile1);
+	err = sock_attach_fd(sock1, newfile1,
+			     INDIRECT_PARAM(file_flags, flags));
 	if (unlikely(err < 0)) {
 		goto out_fd2;
 	}
 
-	err = sock_attach_fd(sock2, newfile2);
+	err = sock_attach_fd(sock2, newfile2,
+			     INDIRECT_PARAM(file_flags, flags));
 	if (unlikely(err < 0)) {
 		fput(newfile1);
 		goto out_fd1;
@@ -1423,7 +1425,8 @@ asmlinkage long sys_accept(int fd, struct sockaddr __user *upeer_sockaddr,
 		goto out_put;
 	}
 
-	err = sock_attach_fd(newsock, newfile);
+	err = sock_attach_fd(newsock, newfile,
+			     INDIRECT_PARAM(file_flags, flags));
 	if (err < 0)
 		goto out_fd_simple;
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20  6:53 [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets Ulrich Drepper
@ 2007-11-20  7:59 ` David Miller
  2007-11-20 16:04   ` Ulrich Drepper
  2007-11-20 17:54   ` Zach Brown
  0 siblings, 2 replies; 27+ messages in thread
From: David Miller @ 2007-11-20  7:59 UTC (permalink / raw)
  To: drepper; +Cc: linux-kernel, akpm, mingo, tglx, torvalds

From: Ulrich Drepper <drepper@redhat.com>
Date: Tue, 20 Nov 2007 01:53:14 -0500

FWIW, I think this indirect syscall stuff is the most ugly interface
I've ever seen proposed for the kernel.

And I agree with all of the objections raised by both H. Pater Anvin
and Eric Dumazet.

> This patch adds support for setting the O_NONBLOCK flag of the file
> descriptors returned by socket, socketpair, and accept.
 ...
> -	err = sock_attach_fd(sock1, newfile1);
> +	err = sock_attach_fd(sock1, newfile1,
> +			     INDIRECT_PARAM(file_flags, flags));

Where does this INDIRECT_PARAM() macro get defined?  I do not
see it being defined anywhere in these patches.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20  7:59 ` David Miller
@ 2007-11-20 16:04   ` Ulrich Drepper
  2007-11-20 18:13     ` H. Peter Anvin
  2007-11-20 21:48     ` David Miller
  2007-11-20 17:54   ` Zach Brown
  1 sibling, 2 replies; 27+ messages in thread
From: Ulrich Drepper @ 2007-11-20 16:04 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, akpm, mingo, tglx, torvalds

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David Miller wrote:
> FWIW, I think this indirect syscall stuff is the most ugly interface
> I've ever seen proposed for the kernel.

Well, the alternative is to introduce a dozens of new interfaces.  It
was Linus who suggested this alternative.  Plus, it seems that for
syslets we need basically the same interface anyway.


> And I agree with all of the objections raised by both H. Pater Anvin
> and Eric Dumazet.

Eric had no arguments and HP's comments lack a viable alternative proposal.


> Where does this INDIRECT_PARAM() macro get defined?  I do not
> see it being defined anywhere in these patches.

Defined in <linux/indirect.h>:

+#define INDIRECT_PARAM(set, name) current->indirect_params.set.name

Not my idea, I was following one review comment.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHQwWl2ijCOnn/RHQRAhEbAJ9/bkrb/phOMRl16Fb0N1TDYglSsgCeNhHQ
3huhdKCAVTu4CJnktf/ufy4=
=Jj6h
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20  7:59 ` David Miller
  2007-11-20 16:04   ` Ulrich Drepper
@ 2007-11-20 17:54   ` Zach Brown
  1 sibling, 0 replies; 27+ messages in thread
From: Zach Brown @ 2007-11-20 17:54 UTC (permalink / raw)
  To: David Miller; +Cc: drepper, linux-kernel, akpm, mingo, tglx, torvalds

David Miller wrote:
> From: Ulrich Drepper <drepper@redhat.com>
> Date: Tue, 20 Nov 2007 01:53:14 -0500
> 
> FWIW, I think this indirect syscall stuff is the most ugly interface
> I've ever seen proposed for the kernel.

Well, there's no XML in /proc :) :).

But, yes, I agree that the internal code needs a lot more cleanup before
being considered for merging.

> And I agree with all of the objections raised by both H. Pater Anvin
> and Eric Dumazet.

I'm worried, too.  Do we have a stronger alternative?  I'm all ears,
this isn't really my area of expertise.

- z

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 16:04   ` Ulrich Drepper
@ 2007-11-20 18:13     ` H. Peter Anvin
  2007-11-20 18:24       ` Zach Brown
  2007-11-26 18:17       ` Linus Torvalds
  2007-11-20 21:48     ` David Miller
  1 sibling, 2 replies; 27+ messages in thread
From: H. Peter Anvin @ 2007-11-20 18:13 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: David Miller, linux-kernel, akpm, mingo, tglx, torvalds

Ulrich Drepper wrote:
> 
>> And I agree with all of the objections raised by both H. Pater Anvin
>> and Eric Dumazet.
> 
> Eric had no arguments and HP's comments lack a viable alternative proposal.
> 

That's only because you're being, deliberately or accidentally, vague 
about what your actual (as opposed to imagined) requirements are.

The only thing concrete that I have seen is that the limitation to 6 
system call arguments is insufficient.  This is clearly true, as 
evidenced by things like pselect.  To which I responded that I'd *much* 
rather see a systematized way to handle the the system call ABI beyond 6 
arguments... the system call interface is a calling convention and 
should be treated as such, and the last thing we need is something that 
ends up looking like the MS-DOS kernel interface where every call has 
its own random convention.

The easy answer, to repeat myself, is to adopt the convention that for > 
6 system calls, the sixth argument register carries a pointer to the 6+ 
arguments.  This has minor performance disadvantages on platforms which 
use the stack for return addresses AND uses exactly six registers for 
arguments (a surprisingly common number.)  On those platforms we have 
the option of either take the extra user space copies, or pick a method 
for passing the in-memory copy in a pointer.

If the whole thing about "a dozen new [system calls]" then a dozen 
system calls added to the existing tables are better than this mess.

Inside the kernel, a lot of things could be cleaned up substantially by 
automating the generation of stubs, where necessary.  I did a lot of 
work in klibc to automatically generate stubs of various sorts; some of 
that work may be possible to re-use.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 18:13     ` H. Peter Anvin
@ 2007-11-20 18:24       ` Zach Brown
  2007-11-20 19:12         ` H. Peter Anvin
  2007-11-26 18:17       ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Zach Brown @ 2007-11-20 18:24 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, David Miller, linux-kernel, akpm, mingo, tglx,
	torvalds


> That's only because you're being, deliberately or accidentally, vague
> about what your actual (as opposed to imagined) requirements are.

Maybe I can help by summarizing how syslets fit in to this.

Currently the syslet patches add a single submission call which includes
an argument which is a structure which duplicates the system call ABI.
The submission syscall in the kernel does some syslet specific work
which amounts to verifying state and storing it in the task_struct.  It
then has to unpack the system call arguments from this submission
syscall argument and call the specified system call.

Every architecture will need helpers, then, on either side.  They'll
need to pack their arguments into the struct and then unpack and call in
the kernel.  The PPC64 guys have already expressed concern about this.

It's, in effect, adding the syslet arguments to every single system call.

So, instead of duplicating the system call ABI in the argument to a
syslet submission syscall, we could pass the syslet arguments via this
indirect parameters convention.  This, hopefully, will reduce complexity
by reducing the number of places that we have to muck around with the
sycall ABI.

That's the high level summary, anyway.  I'm working on the simplest
expression of this mechanism at the moment.  We'll have code to argue
about before the silly thanksgiving break, I hope.

- z

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 18:24       ` Zach Brown
@ 2007-11-20 19:12         ` H. Peter Anvin
  2007-11-20 22:22           ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2007-11-20 19:12 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ulrich Drepper, David Miller, linux-kernel, akpm, mingo, tglx,
	torvalds

Zach Brown wrote:
>> That's only because you're being, deliberately or accidentally, vague
>> about what your actual (as opposed to imagined) requirements are.
> 
> Maybe I can help by summarizing how syslets fit in to this.
> 
> Currently the syslet patches add a single submission call which includes
> an argument which is a structure which duplicates the system call ABI.
> The submission syscall in the kernel does some syslet specific work
> which amounts to verifying state and storing it in the task_struct.  It
> then has to unpack the system call arguments from this submission
> syscall argument and call the specified system call.
> 
> Every architecture will need helpers, then, on either side.  They'll
> need to pack their arguments into the struct and then unpack and call in
> the kernel.  The PPC64 guys have already expressed concern about this.
> 
> It's, in effect, adding the syslet arguments to every single system call.
> 
> So, instead of duplicating the system call ABI in the argument to a
> syslet submission syscall, we could pass the syslet arguments via this
> indirect parameters convention.  This, hopefully, will reduce complexity
> by reducing the number of places that we have to muck around with the
> sycall ABI.
> 
> That's the high level summary, anyway.  I'm working on the simplest
> expression of this mechanism at the moment.  We'll have code to argue
> about before the silly thanksgiving break, I hope.
> 

It seems that you're doing the same thing in both cases, except you're 
now extending it to include other random functionality, which means 
other things than syslets are suddenly affected.

syslets are arguably a little bit different, since what you're 
effectively doing there is running a miniature interpreted language in 
kernel space.  A higher startup overhead should be acceptable, since 
you're amortizing it over a larger number of calls.  Extending that 
mechanism suddenly means you HAVE to use that interpreted language 
message mechanism to access certain system calls, which really does not 
seem like a good thing neither for performance nor for encouraging sane 
design of interfaces.

Everyone who designs a multiplexer have good reasons for the expediency 
that it provides, but it really isn't a good thing in the long term. 
The reason I mentioned MS-DOS is that MS-DOS has tons of multiplexers, 
sometimes three levels deep.  Furthermore, it doesn't have any kind of 
uniformity to its system calls calling convention.  The end result is 
hand-crafted stubs and wrappers, on both sides of the interface.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 16:04   ` Ulrich Drepper
  2007-11-20 18:13     ` H. Peter Anvin
@ 2007-11-20 21:48     ` David Miller
  2007-11-20 21:55       ` Zach Brown
  1 sibling, 1 reply; 27+ messages in thread
From: David Miller @ 2007-11-20 21:48 UTC (permalink / raw)
  To: drepper; +Cc: linux-kernel, akpm, mingo, tglx, torvalds

From: Ulrich Drepper <drepper@redhat.com>
Date: Tue, 20 Nov 2007 08:04:53 -0800

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> David Miller wrote:
> > Where does this INDIRECT_PARAM() macro get defined?  I do not
> > see it being defined anywhere in these patches.
> 
> Defined in <linux/indirect.h>:
> 
> +#define INDIRECT_PARAM(set, name) current->indirect_params.set.name
> 
> Not my idea, I was following one review comment.

This was not in the patches you posted, I double checked before
sending my reply.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 21:48     ` David Miller
@ 2007-11-20 21:55       ` Zach Brown
  2007-11-20 22:36         ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Zach Brown @ 2007-11-20 21:55 UTC (permalink / raw)
  To: David Miller; +Cc: drepper, linux-kernel, akpm, mingo, tglx, torvalds


>>> Where does this INDIRECT_PARAM() macro get defined?  I do not
>>> see it being defined anywhere in these patches.
>> Defined in <linux/indirect.h>:
>>
>> +#define INDIRECT_PARAM(set, name) current->indirect_params.set.name
>>
>> Not my idea, I was following one review comment.
> 
> This was not in the patches you posted, I double checked before
> sending my reply.

Not to belabor this point, but it was:

http://lkml.org/lkml/2007/11/20/53

$ grep -l INDIRECT_PARAM .git/patches/master/*
.git/patches/master/indirect-v4-4.patch
.git/patches/master/indirect-v4-5.patch
.git/patches/master/indirect-v4-6.patch

Maybe the patches got to you out of order so you saw 5/ before 4/?

- z


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 19:12         ` H. Peter Anvin
@ 2007-11-20 22:22           ` Ingo Molnar
  2007-11-20 22:33             ` Davide Libenzi
  2007-11-20 23:25             ` H. Peter Anvin
  0 siblings, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2007-11-20 22:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Zach Brown, Ulrich Drepper, David Miller, linux-kernel, akpm,
	tglx, torvalds


* H. Peter Anvin <hpa@zytor.com> wrote:

> It seems that you're doing the same thing in both cases, except you're 
> now extending it to include other random functionality, which means 
> other things than syslets are suddenly affected.
>
> syslets are arguably a little bit different, since what you're 
> effectively doing there is running a miniature interpreted language in 
> kernel space.  A higher startup overhead should be acceptable, since 
> you're amortizing it over a larger number of calls.  Extending that 
> mechanism suddenly means you HAVE to use that interpreted language 
> message mechanism to access certain system calls, which really does 
> not seem like a good thing neither for performance nor for encouraging 
> sane design of interfaces.

whether that interpreted syslet language survives is still an open 
question - it was extremely ugly when i wrote the first version of it 
and it only got uglier since then :-)

do you suggest that extending the system call calling convention to 
include an arbitrary number of parameters will solve all these API needs 
we have at the moment?

if yes, then a one-shot syslet/async call would in essence be:

	syslet_arg1 ... N, syscall_arg 1 ... M

the same is true for the indirect stuff, we in essence nest syscalls 
inside another syscall:

	sys_indirect arg1 ... N, syscall arg 1 ... M

this all assumes an arbitrarily extendable syscall ABI, which can take 
N+M parameters. Right?

i'm not entirely sure we really want to do this. Nested syscalls would 
have to unpack the arguments and repack them into a kernel-internal call 
format anyway. So there's no performance upside - in fact i can only see 
additional complications.

Why not just pin down the current ABI that there's 6 syscall parameters 
_and not more_? It's totally sensible, and indirection has some minimal 
costs anyway, so copying the nested syscall parameters is a non-issue. 
This is not ad-hoc and when i wrote syslets i actually profiled the 
performance a variable-length calling convention and decided _against_ 
it. Nothing beats the performance of a straight fixd-length copy of 6x4 
(or 6x8) bytes.

The memory access cost argument you mentioned is largely irrelevant and 
inapposite here, this is all passed in on the stack which is well-cached 
in the L1 cache.

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 22:22           ` Ingo Molnar
@ 2007-11-20 22:33             ` Davide Libenzi
  2007-11-20 22:42               ` Ingo Molnar
  2007-11-20 23:25             ` H. Peter Anvin
  1 sibling, 1 reply; 27+ messages in thread
From: Davide Libenzi @ 2007-11-20 22:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Zach Brown, Ulrich Drepper, David Miller,
	Linux Kernel Mailing List, Andrew Morton, tglx, Linus Torvalds

On Tue, 20 Nov 2007, Ingo Molnar wrote:

> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
> > It seems that you're doing the same thing in both cases, except you're 
> > now extending it to include other random functionality, which means 
> > other things than syslets are suddenly affected.
> >
> > syslets are arguably a little bit different, since what you're 
> > effectively doing there is running a miniature interpreted language in 
> > kernel space.  A higher startup overhead should be acceptable, since 
> > you're amortizing it over a larger number of calls.  Extending that 
> > mechanism suddenly means you HAVE to use that interpreted language 
> > message mechanism to access certain system calls, which really does 
> > not seem like a good thing neither for performance nor for encouraging 
> > sane design of interfaces.
> 
> whether that interpreted syslet language survives is still an open 
> question - it was extremely ugly when i wrote the first version of it 
> and it only got uglier since then :-)

Aha! You admitted it finally :)



- Davide



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 21:55       ` Zach Brown
@ 2007-11-20 22:36         ` David Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David Miller @ 2007-11-20 22:36 UTC (permalink / raw)
  To: zach.brown; +Cc: drepper, linux-kernel, akpm, mingo, tglx, torvalds

From: Zach Brown <zach.brown@oracle.com>
Date: Tue, 20 Nov 2007 13:55:56 -0800

> Not to belabor this point, but it was:
> 
> http://lkml.org/lkml/2007/11/20/53
> 
> $ grep -l INDIRECT_PARAM .git/patches/master/*
> .git/patches/master/indirect-v4-4.patch
> .git/patches/master/indirect-v4-5.patch
> .git/patches/master/indirect-v4-6.patch
> 
> Maybe the patches got to you out of order so you saw 5/ before 4/?

Thanks for pointing this out, I stand corrected.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 22:33             ` Davide Libenzi
@ 2007-11-20 22:42               ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2007-11-20 22:42 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: H. Peter Anvin, Zach Brown, Ulrich Drepper, David Miller,
	Linux Kernel Mailing List, Andrew Morton, tglx, Linus Torvalds


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > whether that interpreted syslet language survives is still an open 
> > question - it was extremely ugly when i wrote the first version of 
> > it and it only got uglier since then :-)
> 
> Aha! You admitted it finally :)

damn :-)

but if the only alternative is to be fundamentally slower, i am not 
afraid of some ugliness :-)

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 22:22           ` Ingo Molnar
  2007-11-20 22:33             ` Davide Libenzi
@ 2007-11-20 23:25             ` H. Peter Anvin
  2007-11-20 23:41               ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2007-11-20 23:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, Ulrich Drepper, David Miller, linux-kernel, akpm,
	tglx, torvalds

Ingo Molnar wrote:
> do you suggest that extending the system call calling convention to 
> include an arbitrary number of parameters will solve all these API needs 
> we have at the moment?
> 
> if yes, then a one-shot syslet/async call would in essence be:
> 
> 	syslet_arg1 ... N, syscall_arg 1 ... M
> 
> the same is true for the indirect stuff, we in essence nest syscalls 
> inside another syscall:
> 
> 	sys_indirect arg1 ... N, syscall arg 1 ... M
> 
> this all assumes an arbitrarily extendable syscall ABI, which can take 
> N+M parameters. Right?
> 
> i'm not entirely sure we really want to do this. Nested syscalls would 
> have to unpack the arguments and repack them into a kernel-internal call 
> format anyway. So there's no performance upside - in fact i can only see 
> additional complications.

Forget about indirection for the moment.  Let's first look at the need 
of plain system calls.

> Why not just pin down the current ABI that there's 6 syscall parameters 
> _and not more_?

Because we have already violated it.  There are system calls that need 
more than 6 arguments: we need *a* convention.  Worse, we're not 
actually talking 6 *arguments*, we're talking 6 *words*; on 32-bit 
platforms a single argument can occupy two words.

Uli talks about the need to adding additional system calls with 
parameters, and suggests "back-dooring" them via the sys_indirect interface.

pselect introduced the convention that to take more than 6 arguments, 
the 6th argument register contains a pointer into a user-space memory 
area which contains the real arguments 6 and above.  This is a simple 
convention, which can be trivially executed as a rule set.  Furthermore, 
*with some care* it can be mapped 1:1 onto the C calling convention by 
system-call-generic code, as opposed to needing system-call-specific 
stubs to marshall parameters.

Now, if you execute that asynchronously, you of course need to make sure 
that userspace doesn't clobber those additional arguments, so you would 
have to save them away or otherwise restrict userspace from reclaiming this.

** This is the situation as it stands today, and any solution needs to 
take this into account. **

> It's totally sensible, and indirection has some minimal 
> costs anyway, so copying the nested syscall parameters is a non-issue. 
> This is not ad-hoc and when i wrote syslets i actually profiled the 
> performance a variable-length calling convention and decided _against_ 
> it. Nothing beats the performance of a straight fixd-length copy of 6x4 
> (or 6x8) bytes.
> 
> The memory access cost argument you mentioned is largely irrelevant and 
> inapposite here, this is all passed in on the stack which is well-cached 
> in the L1 cache.

What I'm objecting to, strongly, is the use of this syslets-style 
indirection for unrelated purposes, such as modifying the behaviour of 
existing system calls.  The sys_indirect call, as far as I understand 
it, basically is a way to inject commands into a separate 
hyper-lightweight thread of kernel execution.  That's fine so far.

However, proposing that we should have system calls (call a spade a 
spade) which can ONLY be accessed via this indirection interface is bad 
interface design at best and something much stronger at worst.  What we 
have done in the past when we want to add new parameters to a system 
call is that we assign it a new system call number, and point the old 
system call number to a thunk which sets the new parameters to specific 
default values and then tailcalls the new system call.  This is a very 
straightforward thing to do, and imposes any costs at all only on users 
of the legacy system call number -- and then they are only a handful of 
instructions.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 23:25             ` H. Peter Anvin
@ 2007-11-20 23:41               ` Ingo Molnar
  2007-11-20 23:57                 ` H. Peter Anvin
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2007-11-20 23:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Zach Brown, Ulrich Drepper, David Miller, linux-kernel, akpm,
	tglx, torvalds


* H. Peter Anvin <hpa@zytor.com> wrote:

>> Why not just pin down the current ABI that there's 6 syscall 
>> parameters _and not more_?
>
> Because we have already violated it.  There are system calls that need 
> more than 6 arguments: we need *a* convention.  Worse, we're not 
> actually talking 6 *arguments*, we're talking 6 *words*; on 32-bit 
> platforms a single argument can occupy two words.

i think you are at least partly wrong here. Multiplexing/demultiplexing 
can go on infinitely - for example sys_write(fd, size, buf) can be 
thought of as a function call that passes in fd, size and a variable 
number of arguments of the data to be written.

in that sense capping function arguments at 6 is _sensible_ because it 
prefers _simple_ interfaces. When i wrote syslets i did a syscall number 
of arguments histogram:

  #args   #syscalls
  -----------------
      0       22
      1       51
      2       83
      3       85
      4       40
      5       23
      6        8


Fortunately what we see today is that 80% of all syscalls have 4 or less 
parameters. (yes, there are a few 6-parameter syscalls that arguably 
hurt, but still, it's the exception not the rule)

this histogram shows a healthy bell curve which is _not_ limited by the 
arguments limit of 6, but by common sense! If the 6-arguments limit was 
a problem then we'd see a pile-up of 6-param syscalls.

so i believe you should start thinking about lots-of-arguments syscalls 
as an exception not as something that needs to fit into some generic 
ABI. (Especially as most schemes that were supposed to handle this 
problem would hurt the sane 4-parameter (or less) syscall case too.)

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 23:41               ` Ingo Molnar
@ 2007-11-20 23:57                 ` H. Peter Anvin
  0 siblings, 0 replies; 27+ messages in thread
From: H. Peter Anvin @ 2007-11-20 23:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, Ulrich Drepper, David Miller, linux-kernel, akpm,
	tglx, torvalds

Ingo Molnar wrote:
> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
>>> Why not just pin down the current ABI that there's 6 syscall 
>>> parameters _and not more_?
>> Because we have already violated it.  There are system calls that need 
>> more than 6 arguments: we need *a* convention.  Worse, we're not 
>> actually talking 6 *arguments*, we're talking 6 *words*; on 32-bit 
>> platforms a single argument can occupy two words.
> 
> i think you are at least partly wrong here. Multiplexing/demultiplexing 
> can go on infinitely - for example sys_write(fd, size, buf) can be 
> thought of as a function call that passes in fd, size and a variable 
> number of arguments of the data to be written.
> 
> in that sense capping function arguments at 6 is _sensible_ because it 
> prefers _simple_ interfaces. When i wrote syslets i did a syscall number 
> of arguments histogram:
> 
>   #args   #syscalls
>   -----------------
>       0       22
>       1       51
>       2       83
>       3       85
>       4       40
>       5       23
>       6        8
> 
> Fortunately what we see today is that 80% of all syscalls have 4 or less 
> parameters. (yes, there are a few 6-parameter syscalls that arguably 
> hurt, but still, it's the exception not the rule)
> 
> this histogram shows a healthy bell curve which is _not_ limited by the 
> arguments limit of 6, but by common sense! If the 6-arguments limit was 
> a problem then we'd see a pile-up of 6-param syscalls.
> 
> so i believe you should start thinking about lots-of-arguments syscalls 
> as an exception not as something that needs to fit into some generic 
> ABI. (Especially as most schemes that were supposed to handle this 
> problem would hurt the sane 4-parameter (or less) syscall case too.)
> 

I guess I'm confused here... all I said was I wanted them to be 
systematic, and not need ad-hoc interfaces.  In particular, I really 
don't want to see an interface where "oh, the fifth parameter is really 
a flags field so it's passed with sys_indirect, and is only accessible 
via a sys_indirect" is the norm.

We don't have all that many; pselect() being the main one (I think there 
might be a handful more on 32-bit platforms, but not positive.)  It 
introduced the convention of pointing argument register 6 to a 
user-space data structure.  Simple, and as you correctly point out, it's 
a comparatively rare case.  In klibc, I currently handle it as a special 
case, but I would prefer to avoid special cases of that sort going forward.

Note that on s390, 6-parameter system calls are already a special case: 
anything with over 5 parameters is invoked via a memory structure.  This 
actually means that for pselect on s390, we indirect via a memory 
structure not once, but twice, for no good reason.

	-hpa


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-20 18:13     ` H. Peter Anvin
  2007-11-20 18:24       ` Zach Brown
@ 2007-11-26 18:17       ` Linus Torvalds
  2007-11-26 18:45         ` Ingo Molnar
  2007-11-26 19:20         ` H. Peter Anvin
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2007-11-26 18:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, David Miller, linux-kernel, akpm, mingo, tglx



On Tue, 20 Nov 2007, H. Peter Anvin wrote:
> 
> If the whole thing about "a dozen new [system calls]" then a dozen system
> calls added to the existing tables are better than this mess.

No it's not.

The point about the indirect calls is that we can do it for other things 
than just a dozen random things that wants this one flag.

We'll eventually want AIO calls for filename lookup etc, for example. 
That's another dozen calls (stat, lstat, open, etc). Having an indirect 
call interface to do these kinds of things would be wonderful, instead of 
having to add new system calls every time some issue with a flag that 
changes behaviour for an already existing system call comes up.

THAT is why I'd much rather have indirect system calls. 

The actual calling convention details are open for debate, of course. We 
could encode the information in the system call number itself, for example 
(eg have a bit there that says "extended information"). But we'll never 
get away from the fact that we have the odd architecture-specific system 
call interfaces with things like "pselect()" having pointers etc, if only 
because of legacy issues.

So we can *never* have a truly "generic" argument marshalling setup. We'll 
have to live with each architecture having system calls with special 
rules: some of those rules will be architecture-specific (eg number of 
easily available registers and/or historical reasons), and a few of the 
rules will be architecture-independent (eg things like sigreturn, clone 
and execve, that need to have direct access to the whole kernel return 
stack and simply *cannot* be called from any indirect code!)

So the choice is basically one of:

 - come up with a totally new interface to system calls, and effectively 
   duplicating the whole system call table.

   I'd hate to do this. We already have duplicated system call tables due 
   to compat stuff, it's painful.

 - just emulate the *existing* interface exactly, but with indirection. 
   IOW, the system call interface on x86 an unconditional "six words in 
   six registers, the meaning of which is totally up to the system call 
   implementation itself".

   This is what Uli's sys_indirect() does.

 - add whole new system calls with extended information, making the 6-word 
   limits even worse, and likely forcing a whole new argument marshalling 
   code with conditionals depending on per-system-call flags, which 
   further complicates it and slows things down.

Quite frankly, I can't really see many other approaches. And of the above 
three ones, the sys_indirect() approach really does seem to be the 
simplest *and* the best-performing. It's basically faster to just 
unconditionally load six registers off an indirect block than it would be 
to have any conditionals based on which system call it is.

		Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-26 18:17       ` Linus Torvalds
@ 2007-11-26 18:45         ` Ingo Molnar
  2007-11-26 19:07           ` H. Peter Anvin
  2007-11-26 19:20         ` H. Peter Anvin
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2007-11-26 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Ulrich Drepper, David Miller, linux-kernel, akpm,
	tglx


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Quite frankly, I can't really see many other approaches. And of the 
> above three ones, the sys_indirect() approach really does seem to be 
> the simplest *and* the best-performing. It's basically faster to just 
> unconditionally load six registers off an indirect block than it would 
> be to have any conditionals based on which system call it is.

yeah. And even assuming for the sake of argument, that there was only 
one dominant architecture we care about, even there many of our existing 
syscall APIs are _already_ special-purpose APIs that do not encode 
parameters in a 'flat' way.

So it's not like sys_indirect() would break some magic pristine state of 
a flat parameter space - on the contrary, most of the nontrivial 
syscalls take pointers to structures or pointers to streams of 
information. The parameter count histogram i believe further underlines 
this point:

  #args   #syscalls
  -----------------
      0       22
      1       51
      2       83
      3       85
      4       40
      5       23
      6        8

the natural 'center' of function call parameter counts is around 1-4 
parameters, and that is natural. (most operators that the human brain 
prefers to operate with are like that - having higher complexity than 
that often defeats the purpose of getting an API used by ... humans.)

(side-note: in that sense, introducing some generic "arbitrary number of 
parameters" ABI design that was suggested instead of sys_indirect() 
would be _counter productive_ from a meta-design POV: it would not 
punish sucky, over-complicated APIs that expose way too many details in 
their main API call.)

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-26 18:45         ` Ingo Molnar
@ 2007-11-26 19:07           ` H. Peter Anvin
  2007-11-26 19:55             ` Davide Libenzi
  0 siblings, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2007-11-26 19:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Ulrich Drepper, David Miller, linux-kernel, akpm,
	tglx

Ingo Molnar wrote:
> 
> So it's not like sys_indirect() would break some magic pristine state of 
> a flat parameter space - on the contrary, most of the nontrivial 
> syscalls take pointers to structures or pointers to streams of 
> information. The parameter count histogram i believe further underlines 
> this point:
> 
>   #args   #syscalls
>   -----------------
>       0       22
>       1       51
>       2       83
>       3       85
>       4       40
>       5       23
>       6        8
> 
> the natural 'center' of function call parameter counts is around 1-4 
> parameters, and that is natural. (most operators that the human brain 
> prefers to operate with are like that - having higher complexity than 
> that often defeats the purpose of getting an API used by ... humans.)
> 

I was preparing a response to Linus' email, but I really feel this needs 
to be addressed specifically.

When it comes to dealing with the operator-visible state, what matters 
is what happens on the API level, NOT on the system call level. 
Furthermore, the proposed sys_indirect interface just means that there 
are parameters hidden from immediately view, even though they 
fundamentally change the operation performed, and that it is much harder 
to correlate, say, the output of strace(1) with what actually happened 
in the program.  So from a *psychological* point of view, this seems to 
be an insane design choice.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-26 18:17       ` Linus Torvalds
  2007-11-26 18:45         ` Ingo Molnar
@ 2007-11-26 19:20         ` H. Peter Anvin
  2007-11-26 23:25           ` Ulrich Drepper
  2007-11-27  2:14           ` Linus Torvalds
  1 sibling, 2 replies; 27+ messages in thread
From: H. Peter Anvin @ 2007-11-26 19:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, David Miller, linux-kernel, akpm, mingo, tglx

Linus Torvalds wrote:
> 
> On Tue, 20 Nov 2007, H. Peter Anvin wrote:
>> If the whole thing about "a dozen new [system calls]" then a dozen system
>> calls added to the existing tables are better than this mess.
> 
> No it's not.
> 
> The point about the indirect calls is that we can do it for other things 
> than just a dozen random things that wants this one flag.
> 
> We'll eventually want AIO calls for filename lookup etc, for example. 
> That's another dozen calls (stat, lstat, open, etc). Having an indirect 
> call interface to do these kinds of things would be wonderful, instead of 
> having to add new system calls every time some issue with a flag that 
> changes behaviour for an already existing system call comes up.
> 
> THAT is why I'd much rather have indirect system calls. 

I'm presuming you're not talking about some sort of 
syslets/fibrils/threadlets here (executing an interpreted thread of 
execution in kernel space).  That's a whole separate ball of wax.

> The actual calling convention details are open for debate, of course. We 
> could encode the information in the system call number itself, for example 
> (eg have a bit there that says "extended information"). But we'll never 
> get away from the fact that we have the odd architecture-specific system 
> call interfaces with things like "pselect()" having pointers etc, if only 
> because of legacy issues.
> 
> So we can *never* have a truly "generic" argument marshalling setup. We'll 
> have to live with each architecture having system calls with special 
> rules: some of those rules will be architecture-specific (eg number of 
> easily available registers and/or historical reasons), and a few of the 
> rules will be architecture-independent (eg things like sigreturn, clone 
> and execve, that need to have direct access to the whole kernel return 
> stack and simply *cannot* be called from any indirect code!)

> So the choice is basically one of:
> 
>  - come up with a totally new interface to system calls, and effectively 
>    duplicating the whole system call table.
> 
>    I'd hate to do this. We already have duplicated system call tables due 
>    to compat stuff, it's painful.

This would be the right thing to do if we were to redesign the system 
call interface from the ground up, which it doesn't exactly sound like 
we are intending.

>  - just emulate the *existing* interface exactly, but with indirection. 
>    IOW, the system call interface on x86 an unconditional "six words in 
>    six registers, the meaning of which is totally up to the system call 
>    implementation itself".
> 
>    This is what Uli's sys_indirect() does.
> 
>  - add whole new system calls with extended information, making the 6-word 
>    limits even worse, and likely forcing a whole new argument marshalling 
>    code with conditionals depending on per-system-call flags, which 
>    further complicates it and slows things down.

The 6-word limit is a red herring.  There is at least two ways to deal 
with it (and this doesn't mean wiping the legacy stuff we already have):

- Let each architecture pick a calling convention and redefine the 
architecture-independent bits to take an arbitrary number of arguments. 
  This is a one-time panarchitectural change.

- Define the architecture-independent interface inside the kernel to be 
a 6-word interface and use a marshalling thunk when the number of 
parameters exceed this number.  **This is what we're currently doing.** 
     This is inefficient for s390 (which already has to thunk 
6-parameter functions in its arch layer), but I think all other 
architectures are fine.  Those thunks (stubs) could be generated 
automatically if we wanted to.

So I would advocate admitting that we already broke the 6-word limit and 
abolish it.  Then we can create new system calls that match what the 
user would see.

> Quite frankly, I can't really see many other approaches. And of the above 
> three ones, the sys_indirect() approach really does seem to be the 
> simplest *and* the best-performing. It's basically faster to just 
> unconditionally load six registers off an indirect block than it would be 
> to have any conditionals based on which system call it is.

I find it very hard to see how it could be better performing than 
jumping through a thunk; in fact, for the second option (the one we're 
currently using) when gcc does top-level reordering the thunk (e.g. 
sys_pselect6) SHOULD simply the system call function proper (e.g. 
sys_pselect7).  For one thing, you will have at least one additional 
data-dependent indirect call in the path.

	-hpa


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-26 19:07           ` H. Peter Anvin
@ 2007-11-26 19:55             ` Davide Libenzi
  0 siblings, 0 replies; 27+ messages in thread
From: Davide Libenzi @ 2007-11-26 19:55 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Linus Torvalds, Ulrich Drepper, David Miller,
	Linux Kernel Mailing List, Andrew Morton, tglx

On Mon, 26 Nov 2007, H. Peter Anvin wrote:

> Ingo Molnar wrote:
> > 
> > So it's not like sys_indirect() would break some magic pristine state of a
> > flat parameter space - on the contrary, most of the nontrivial syscalls take
> > pointers to structures or pointers to streams of information. The parameter
> > count histogram i believe further underlines this point:
> > 
> >   #args   #syscalls
> >   -----------------
> >       0       22
> >       1       51
> >       2       83
> >       3       85
> >       4       40
> >       5       23
> >       6        8
> > 
> > the natural 'center' of function call parameter counts is around 1-4
> > parameters, and that is natural. (most operators that the human brain
> > prefers to operate with are like that - having higher complexity than that
> > often defeats the purpose of getting an API used by ... humans.)
> > 
> 
> I was preparing a response to Linus' email, but I really feel this needs to be
> addressed specifically.
> 
> When it comes to dealing with the operator-visible state, what matters is what
> happens on the API level, NOT on the system call level. Furthermore, the
> proposed sys_indirect interface just means that there are parameters hidden
> from immediately view, even though they fundamentally change the operation
> performed, and that it is much harder to correlate, say, the output of
> strace(1) with what actually happened in the program.  So from a
> *psychological* point of view, this seems to be an insane design choice.

I think there are two different issues. One is the proliferation of system 
calls, and the other is the sane design of internal kernel APIs.
The first one is not very interesting to me, since I don't have any strong 
opinions in either cases.
The second is the one I'd care most. I think that, whatever is the 
solution used to address the first, internal kernel APIs should be 
designed so that parameters flow down from the system call to the 
parameter's user code. IMO, besides very few cases where it could make 
some sense [*], setting some thread-global bits in the upper layer, to be 
magically picked up by code in the lower layers, does not lead to readable 
interfaces.



[*] Things that already read from a shared context, that is already 
    exposed to the user through some sort of set/get APIs.



- Davide



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-26 19:20         ` H. Peter Anvin
@ 2007-11-26 23:25           ` Ulrich Drepper
  2007-11-27  0:14             ` H. Peter Anvin
  2007-11-27  2:14           ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Ulrich Drepper @ 2007-11-26 23:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, David Miller, linux-kernel, akpm, mingo, tglx

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

H. Peter Anvin wrote:
> The 6-word limit is a red herring.  There is at least two ways to deal
> with it (and this doesn't mean wiping the legacy stuff we already have):
> 
> - Let each architecture pick a calling convention and redefine the
> architecture-independent bits to take an arbitrary number of arguments.
>  This is a one-time panarchitectural change.
> [...]

Just think beyond wishful thinking for a moment.  What does it take to
come up with something completely new and grand?

Let's start at the basic: you need to signal that the new syscall
calling convention is used.  Since the syscall entry code is limited (at
least the likes of syscall/sysenter, it would be easy enough to use int
$0x81 in addition to int $0x80) you would have to extend the use of the
syscall number while keeping binary compatibility.  This means
additional costs for every single syscall.

Once you're past that, how do you implement the expandable syscall
parameter count?  There are two ways:

- - pass to the real sys_* implementations the number of provided syscall
parameters and have each function figure out what this means

- - dynamically construct a call to the sys_* functions where the syscall
magic adds an appropriate number of parameters filled with zeros.  This
is quite complicated and, more importantly, it requires that you have
code/data somewhere which specifies how many parameters each of the
sys_* function actually requires.  The actual sys_* code and the data
has to be kept in sync at all times.  A maintenance nightmare.


The handling of syscalls with many parameters should not at all be a
driver of this design at all.  Syscalls shouldn't be that complicated, I
completely agree with ingo.


I'm perfectly willing to give you the benefit of doubt, show us a design
for what you're proposing which is not slower than the current code,
doesn't impact existing code, and solves the problem in a nice and clean
way.  I cannot really see it now but I might miss something.  The
sys_indirect approach ain't pretty but it does it jobs, doesn't impact
performance, and is expandable in direction we *know* we will want to go
very soon.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFHS1X12ijCOnn/RHQRAihRAJwLNJ9fT8GTv6MAoO6RZGOub07sGgCdGBLR
frXyQVB8Oh5VgWY5YJhpitg=
=FuBx
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-26 23:25           ` Ulrich Drepper
@ 2007-11-27  0:14             ` H. Peter Anvin
  2007-11-27  0:42               ` Ulrich Drepper
  0 siblings, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2007-11-27  0:14 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linus Torvalds, David Miller, linux-kernel, akpm, mingo, tglx

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> H. Peter Anvin wrote:
>> The 6-word limit is a red herring.  There is at least two ways to deal
>> with it (and this doesn't mean wiping the legacy stuff we already have):
>>
>> - Let each architecture pick a calling convention and redefine the
>> architecture-independent bits to take an arbitrary number of arguments.
>>  This is a one-time panarchitectural change.
>> [...]
> 
> Just think beyond wishful thinking for a moment.  What does it take to
> come up with something completely new and grand?
> 
> Let's start at the basic: you need to signal that the new syscall
> calling convention is used.  Since the syscall entry code is limited (at
> least the likes of syscall/sysenter, it would be easy enough to use int
> $0x81 in addition to int $0x80) you would have to extend the use of the
> syscall number while keeping binary compatibility.  This means
> additional costs for every single syscall.

No.

I already said I'm not looking at changing the calling convention for 
existing syscalls.  I don't think that makes sense.  (The only realistic 
exception to that is that if we really want a small number (16 or less) 
of flags fully orthogonal to the system call table, I guess it might 
make sense to stuff those in the high bits of the system call register. 
  However, I am a bit concerned about the auditability of that.)

> Once you're past that, how do you implement the expandable syscall
> parameter count?  There are two ways:
> 
> - - pass to the real sys_* implementations the number of provided syscall
> parameters and have each function figure out what this means
> 
> - - dynamically construct a call to the sys_* functions where the syscall
> magic adds an appropriate number of parameters filled with zeros.  This
> is quite complicated and, more importantly, it requires that you have
> code/data somewhere which specifies how many parameters each of the
> sys_* function actually requires.  The actual sys_* code and the data
> has to be kept in sync at all times.  A maintenance nightmare.

Hardly so, as evidenced by the fact that we have successfully done so 
for 15 years already; a number of Linux architectures require this 
information for the existing system calls.

> The handling of syscalls with many parameters should not at all be a
> driver of this design at all.  Syscalls shouldn't be that complicated, I
> completely agree with ingo.

You *ARE* introducing additional syscall parameters, regardless if 
you're admitting it or not.  It's exactly what you're doing, and by 
making those parameters hidden, all we're accomplishing is:

- a penalty any time those parameters have to be invoked,
- a good possibility that all the combinations are not audited.

> I'm perfectly willing to give you the benefit of doubt, show us a design
> for what you're proposing which is not slower than the current code,
> doesn't impact existing code, and solves the problem in a nice and clean
> way.  I cannot really see it now but I might miss something.  The
> sys_indirect approach ain't pretty but it does it jobs, doesn't impact
> performance, and is expandable in direction we *know* we will want to go
> very soon.

We have a ton of examples in the kernel already for both dealing with 
additional parameters and (somewhat fewer examples, but sys_pselect is a 
good one) too many parameters: in all cases, we invoke a wrapper 
function that sets up the parameters and then invokes the "true" syscall 
function.  Right now we're generating those manually, which appears to 
work okay simply because the amount of work it takes is small compared 
to the amount of work it takes to write the code for a proper syscall. 
(That doesn't mean it is the ideal model, of course.)  We *could* 
generate them automatically if we wanted to, with either of the models I 
mentioned -- the code I have in klibc to generate syscall stubs should 
be relatively easily modifiable to do this, although it's probably 
overkill for this job.

In this case we do minimal thunking of parameters for the legacy 
entrypoints, and for the current entrypoints we do the guaranteed 
minimum amount of work.

	-hpa


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-27  0:14             ` H. Peter Anvin
@ 2007-11-27  0:42               ` Ulrich Drepper
  2007-11-27  1:23                 ` H. Peter Anvin
  0 siblings, 1 reply; 27+ messages in thread
From: Ulrich Drepper @ 2007-11-27  0:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, David Miller, linux-kernel, akpm, mingo, tglx

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

H. Peter Anvin wrote:
> No.
> 
> I already said I'm not looking at changing the calling convention for
> existing syscalls.

I did not suggest or ask for that at all.

I was asking you to consider the real implementation details for a new
syscall mechanism.

We do not want to abandon the use of syscall/sysenter and go back to int
(on x86/x86-64).  This means that you have to come up with a mechanism
which hooks into the current syscall/sysenter path while preserving full
backward compatibility.

Now it's your turn.  How do you do this without additional costs?


> Hardly so, as evidenced by the fact that we have successfully done so
> for 15 years already; a number of Linux architectures require this
> information for the existing system calls.

Nothing at this scale is there in the moment, as far as I can see.  And
nothing so critical for getting right.

Talk is cheap.  You still haven't shown one bit if design how you want
to achieve your grand goal.  The time for hand-waiving is over.  Do some
work or step out of the way.  Nothing you have said so far in the least
convinces me and your arguments like "sys_indirect adds parameters" are
not really contested.  Yes, that's what sys_indirect does.  So what?  It
does this with almost no cost which outweighs the ugliness factor in my
book.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFHS2gQ2ijCOnn/RHQRAlN5AKCWZQL97sROWBv33//Uj/MN+CNi3gCdFgCU
uLVEOfclERpakp1kdYzy2oI=
=stVB
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-27  0:42               ` Ulrich Drepper
@ 2007-11-27  1:23                 ` H. Peter Anvin
  0 siblings, 0 replies; 27+ messages in thread
From: H. Peter Anvin @ 2007-11-27  1:23 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linus Torvalds, David Miller, linux-kernel, akpm, mingo, tglx

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> H. Peter Anvin wrote:
>> No.
>>
>> I already said I'm not looking at changing the calling convention for
>> existing syscalls.
> 
> I did not suggest or ask for that at all.
> 
> I was asking you to consider the real implementation details for a new
> syscall mechanism.
> 
> We do not want to abandon the use of syscall/sysenter and go back to int
> (on x86/x86-64).  This means that you have to come up with a mechanism
> which hooks into the current syscall/sysenter path while preserving full
> backward compatibility.
> 
> Now it's your turn.  How do you do this without additional costs?
> 

- Add sys_new_call to the syscall table
- Create a stub thunk:

asmlinkage long sys_old_call(long parm1, long parm2, long parm3)
{
	return sys_new_call(parm1, parm2, parm3, 0);
}

We have 2^n examples on this in the kernel already.

Or, if the new syscall requires more than 6 parameters (with the current 
convention):

asmlinkage long sys_new_call6(long parm1, long parm2, long parm3,
			      long parm4, long parm5,
			      long __user *additional)
{
	long xparm[3];	/* 8 parameters, total */

	if (copy_from_user(xparm, additional, sizeof xparm)
	    != sizeof xparm)
		return -EFAULT;

	return sys_new_call(parm1, parm2, parm3, parm4, parm5,
			    xparm[0], xparm[1], xparm[2]);
}

This is a fixed-size copy from userspace, which obviously cannot be avoided.

The C version isn't optimal, obviously, hence my mentioning the 
possibility of doing it in the arch layer.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-26 19:20         ` H. Peter Anvin
  2007-11-26 23:25           ` Ulrich Drepper
@ 2007-11-27  2:14           ` Linus Torvalds
  2007-11-27  2:38             ` H. Peter Anvin
  1 sibling, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2007-11-27  2:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, David Miller, linux-kernel, akpm, mingo, tglx



On Mon, 26 Nov 2007, H. Peter Anvin wrote:
> 
> I'm presuming you're not talking about some sort of syslets/fibrils/threadlets
> here (executing an interpreted thread of execution in kernel space).  That's a
> whole separate ball of wax.

Indeed. 

I'm hoping that just dies. It's too complex. But the "do this single 
system call asynchronously" isn't, and has lots of historical 
implementations, ranging from VMS to the braindead POSIX "aio" setup.

I do think that more complex threadlets could be useful in theory, I just 
doubt they'd be used in practice..

> > So the choice is basically one of:
> > 
> >  - come up with a totally new interface to system calls, and effectively
> > duplicating the whole system call table.
> > 
> >    I'd hate to do this. We already have duplicated system call tables due
> > to compat stuff, it's painful.
> 
> This would be the right thing to do if we were to redesign the system call
> interface from the ground up, which it doesn't exactly sound like we are
> intending.

Yeah. I'm also not sure it's the right thing even if we did redesign from 
scratch.

The current system call interface may look less than regular, but it has 
some very solid foundation: it's fast. Passing arguments in registers is 
by definition a lot faster *and*safer* than passing them any other way. 
There are no subtle security issues with people playing games with the 
argument base pointer (ie usually the stack pointer) and trying to fool 
the kernel into accessing kernel memory etc.

Immediately when you do anything but registers, it is much *much* more 
costly. The "get_user()" and "copy_from_user()" stuff is not exactly slow, 
but it's quite noticeable overhead for simple system calls. It gets worse 
if this all is described by some indirect table setup.

In the system call path, right now, for some system calls, the biggest two 
overheads are

 - the CPU system call overhead itself. We can't do much about this, but 
   the CPU designers do seem to be slowly getting it fixed (ie it's slower 
   than it should need to be, but it's a hell of a lot faster than a P4 
   used to be ;)

 - the cost of just the single indirect - and unpredictable - call.

(The second cost is actually often totally hidden in the trivial system 
call benchmarks people run: if the benchmark just does "getppid()" a 
million times in a tight loop, the indirect call on the system call number 
seems really quite fast, but outside of benchmarks it is generally totally 
unpredictable indeed, and a real cost for real-life system call usage).

Everything else in the system call path is generally as fast as we can 
make it. Doing more indirection and conditionals would be really quite 
nasty.

Of course, for *most* of system calls, the work the kernel actually does 
ends up being so big that it doesn't much matter, but I was literally 
chasing down why a page fault had slowed down by ~70 cycles two weeks ago. 
And it doesn't take more than a couple of unpredictable jumps to do things 
like that!

> The 6-word limit is a red herring.  There is at least two ways to deal with it
> (and this doesn't mean wiping the legacy stuff we already have):
> 
> - Let each architecture pick a calling convention and redefine the
> architecture-independent bits to take an arbitrary number of arguments.  This
> is a one-time panarchitectural change.

Not applicable on x86-32.

The six-word limit is effectively a hardware limit there. Once it goes 
past that limit, one of the words needs to be a pointer to extended 
information that is fundamentally slower to access. Happily, only very 
rare system calls do that (and none of them are of the simple variety 
where we see a few cycles easily).

On other architectures, we could more easily just use more registers. But 
x86-32 is still a big part (bulk) of what matters for most people.

			Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
  2007-11-27  2:14           ` Linus Torvalds
@ 2007-11-27  2:38             ` H. Peter Anvin
  0 siblings, 0 replies; 27+ messages in thread
From: H. Peter Anvin @ 2007-11-27  2:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, David Miller, linux-kernel, akpm, mingo, tglx

Linus Torvalds wrote:
> 
>> The 6-word limit is a red herring.  There is at least two ways to deal with it
>> (and this doesn't mean wiping the legacy stuff we already have):
>>
>> - Let each architecture pick a calling convention and redefine the
>> architecture-independent bits to take an arbitrary number of arguments.  This
>> is a one-time panarchitectural change.
> 
> Not applicable on x86-32.
> 
> The six-word limit is effectively a hardware limit there. Once it goes 
> past that limit, one of the words needs to be a pointer to extended 
> information that is fundamentally slower to access. Happily, only very 
> rare system calls do that (and none of them are of the simple variety 
> where we see a few cycles easily).
> 
> On other architectures, we could more easily just use more registers. But 
> x86-32 is still a big part (bulk) of what matters for most people.
> 

Well, x86-32 and x86-64 are surprisingly similar here, for very 
different reasons (x86-64 is because there are only seven clobbered 
registers that aren't destroyed by the syscall instruction itself.)

However, on both of these we could make the user-space side cheaper, by 
making sure that we don't have to do additional copies in user space. 
For both these architectures, anything more than 3 parameters (i386) or 
6 parameters (x86-64) will be already in memory on the stack, so if we 
can use that image as-is then we at least save the intra-user-space copy 
that goes along with it.

x86-64 requires some minor thought, since the obvious way of doing it 
(using arg register 6 to push in a pointer) would end up with a 
discontiguous frame.  One can do it with something like this, although 
it's not clear to me it is a win at all (the more obvious sequence using 
XCHG isn't usable since XCHG locks unconditionally):

	pop	%r10			# Return address
	push	%r9			# Argument 6
	movq	%rsp, %r11
	push	%r10
	movq	%rcx, %r10
	syscall
	cmpq	$-4095, %rax
	jae	...
	pop	%r10
	pop	%r9
	push	%r10
	retq

The number of registers do vary, obviously, with s390 being the smallest 
number (5).

> Immediately when you do anything but registers, it is much *much* more 
> costly. The "get_user()" and "copy_from_user()" stuff is not exactly slow, 
> but it's quite noticeable overhead for simple system calls. It gets worse 
> if this all is described by some indirect table setup.

True, of course, although we're talking here about different ways to 
pull arguments out of userspace memory; *definitely* agreed with that we 
don't want to have any additional indirection necessary.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2007-11-27  2:39 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-20  6:53 [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets Ulrich Drepper
2007-11-20  7:59 ` David Miller
2007-11-20 16:04   ` Ulrich Drepper
2007-11-20 18:13     ` H. Peter Anvin
2007-11-20 18:24       ` Zach Brown
2007-11-20 19:12         ` H. Peter Anvin
2007-11-20 22:22           ` Ingo Molnar
2007-11-20 22:33             ` Davide Libenzi
2007-11-20 22:42               ` Ingo Molnar
2007-11-20 23:25             ` H. Peter Anvin
2007-11-20 23:41               ` Ingo Molnar
2007-11-20 23:57                 ` H. Peter Anvin
2007-11-26 18:17       ` Linus Torvalds
2007-11-26 18:45         ` Ingo Molnar
2007-11-26 19:07           ` H. Peter Anvin
2007-11-26 19:55             ` Davide Libenzi
2007-11-26 19:20         ` H. Peter Anvin
2007-11-26 23:25           ` Ulrich Drepper
2007-11-27  0:14             ` H. Peter Anvin
2007-11-27  0:42               ` Ulrich Drepper
2007-11-27  1:23                 ` H. Peter Anvin
2007-11-27  2:14           ` Linus Torvalds
2007-11-27  2:38             ` H. Peter Anvin
2007-11-20 21:48     ` David Miller
2007-11-20 21:55       ` Zach Brown
2007-11-20 22:36         ` David Miller
2007-11-20 17:54   ` Zach Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox