All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: David Drysdale <drysdale@google.com>
Cc: linux-api@vger.kernel.org,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>,
	Shuah Khan <shuahkh@osg.samsung.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Eric B Munson <emunson@akamai.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Peter Zijlstra <peterz@infradead.org>,
	Vivek Goyal <vgoyal@redhat.com>,
	Alexei Starovoitov <ast@plumgrid.com>,
	David Herrmann <dh.herrmann@gmail.com>,
	Theodore Ts'o <tytso@mit.edu>, Kees Cook <keesc>
Subject: Re: [PATCHv2 1/1] Documentation: describe how to add a system call
Date: Thu, 30 Jul 2015 10:38:31 +0200	[thread overview]
Message-ID: <20150730083831.GA22182@gmail.com> (raw)
In-Reply-To: <1438242731-27756-2-git-send-email-drysdale@google.com>


* David Drysdale <drysdale@google.com> wrote:

> +Designing the API
> +-----------------
> +
> +A new system call forms part of the API of the kernel, and has to be supported
> +indefinitely.  As such, it's a very good idea to explicitly discuss the
> +interface on the kernel mailing list, and to plan for future extensions of the
> +interface.  In particular:
> +
> +  **Include a flags argument for every new system call**

Sorry, but I think that's bad avice, because even a 'flags' field is inflexible 
and stupid in many cases - it fosters an 'ioctl' kind of design.

> +The syscall table is littered with historical examples where this wasn't done, 
> +together with the corresponding follow-up system calls (eventfd/eventfd2, 
> +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so 
> +learn from the history of the kernel and include a flags argument from the 
> +start.

The syscall table is also littered with system calls that have an argument space 
considerably larger than what 6 parameters can express, where various 'flags' are 
used to bring in different parts of new APIs, in a rather messy way.

The right approach IMHO is to think about how extensible a system call is expected 
to be, and to plan accordingly.

If you are anywhere close to 6 parameters, you should not introduce 'flags' but 
you should _reduce_ the number of parameters to a clean essential of 2 or 3 
parameters and should shuffle parameters out to a separate 'parameters/attributes' 
structure that is passed in by pointer:

	SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params);

And it's the design of 'struct params' that determines future flexibility of the 
interface. A very flexible approach is to not use flags but a 'size' argument:

	struct params {
		u32 size;
		u32 param_1;
		u64 param_2;
		u64 param_3;
	};

Where 'size' is set by user-space to the size of 'struct params' known to it at 
build time:

	params->size = sizeof(*params);

In the normal case the kernel will get param->size == sizeof(*params) as known to 
the kernel.

When the system call is extended in the future on the kernel side, with 'u64 
param_4', then the structure expands from an old size of 24 to a new size of 32 
bytes. The following scenarios might occur:

 - the common case: new user-space calls the new kernel code, ->size is 32 on both 
   sides.

 - old binaries might call the kernel with params->size == 24, in which case the 
   kernel sets the new fields to 0. The new feature should be written
   accordingly, so that a value of 0 means the old behavior.

 - new binaries might run on old kernels, with params->size == 32. In this case 
   the old kernel will check that all the new fields it does not know about are 
   set to 0 - if they are nonzero (if the new feature is used) it returns with 
   -ENOSYS or -EINVAL.

With this approach we have both backwards and forwards binary compatibility: new 
binaries will run on old kernels just fine, even if they have ->size set to 32, as 
long as they make use of the features.

This design simplifies application design considerably: as new code can mostly 
forget about old ABIs, there's no multiple versions to be taken care of, there's 
just a single 'struct param' known to both sides, and there's no version skew.

We are using such a design in perf_event_open(), see perf_copy_attr() in 
kernel/events/core.c. And yes, ironically that system call still has a historic 
'flags' argument, but it's not used anymore for extension: we've made over 30 
extensions to the ABI in the last 3 years, which would have been impossible with a 
'flags' approach.

Thanks,

	Ingo

WARNING: multiple messages have this Message-ID (diff)
From: Ingo Molnar <mingo@kernel.org>
To: David Drysdale <drysdale@google.com>
Cc: linux-api@vger.kernel.org,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>,
	Shuah Khan <shuahkh@osg.samsung.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Eric B Munson <emunson@akamai.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Peter Zijlstra <peterz@infradead.org>,
	Vivek Goyal <vgoyal@redhat.com>,
	Alexei Starovoitov <ast@plumgrid.com>,
	David Herrmann <dh.herrmann@gmail.com>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Kees Cook <keescook@chromium.org>,
	Miklos Szeredi <mszeredi@suse.cz>,
	Milosz Tanski <milosz@adfin.com>, Fam Zheng <famz@redhat.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCHv2 1/1] Documentation: describe how to add a system call
Date: Thu, 30 Jul 2015 10:38:31 +0200	[thread overview]
Message-ID: <20150730083831.GA22182@gmail.com> (raw)
In-Reply-To: <1438242731-27756-2-git-send-email-drysdale@google.com>


* David Drysdale <drysdale@google.com> wrote:

> +Designing the API
> +-----------------
> +
> +A new system call forms part of the API of the kernel, and has to be supported
> +indefinitely.  As such, it's a very good idea to explicitly discuss the
> +interface on the kernel mailing list, and to plan for future extensions of the
> +interface.  In particular:
> +
> +  **Include a flags argument for every new system call**

Sorry, but I think that's bad avice, because even a 'flags' field is inflexible 
and stupid in many cases - it fosters an 'ioctl' kind of design.

> +The syscall table is littered with historical examples where this wasn't done, 
> +together with the corresponding follow-up system calls (eventfd/eventfd2, 
> +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so 
> +learn from the history of the kernel and include a flags argument from the 
> +start.

The syscall table is also littered with system calls that have an argument space 
considerably larger than what 6 parameters can express, where various 'flags' are 
used to bring in different parts of new APIs, in a rather messy way.

The right approach IMHO is to think about how extensible a system call is expected 
to be, and to plan accordingly.

If you are anywhere close to 6 parameters, you should not introduce 'flags' but 
you should _reduce_ the number of parameters to a clean essential of 2 or 3 
parameters and should shuffle parameters out to a separate 'parameters/attributes' 
structure that is passed in by pointer:

	SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params);

And it's the design of 'struct params' that determines future flexibility of the 
interface. A very flexible approach is to not use flags but a 'size' argument:

	struct params {
		u32 size;
		u32 param_1;
		u64 param_2;
		u64 param_3;
	};

Where 'size' is set by user-space to the size of 'struct params' known to it at 
build time:

	params->size = sizeof(*params);

In the normal case the kernel will get param->size == sizeof(*params) as known to 
the kernel.

When the system call is extended in the future on the kernel side, with 'u64 
param_4', then the structure expands from an old size of 24 to a new size of 32 
bytes. The following scenarios might occur:

 - the common case: new user-space calls the new kernel code, ->size is 32 on both 
   sides.

 - old binaries might call the kernel with params->size == 24, in which case the 
   kernel sets the new fields to 0. The new feature should be written
   accordingly, so that a value of 0 means the old behavior.

 - new binaries might run on old kernels, with params->size == 32. In this case 
   the old kernel will check that all the new fields it does not know about are 
   set to 0 - if they are nonzero (if the new feature is used) it returns with 
   -ENOSYS or -EINVAL.

With this approach we have both backwards and forwards binary compatibility: new 
binaries will run on old kernels just fine, even if they have ->size set to 32, as 
long as they make use of the features.

This design simplifies application design considerably: as new code can mostly 
forget about old ABIs, there's no multiple versions to be taken care of, there's 
just a single 'struct param' known to both sides, and there's no version skew.

We are using such a design in perf_event_open(), see perf_copy_attr() in 
kernel/events/core.c. And yes, ironically that system call still has a historic 
'flags' argument, but it's not used anymore for extension: we've made over 30 
extensions to the ABI in the last 3 years, which would have been impossible with a 
'flags' approach.

Thanks,

	Ingo

  reply	other threads:[~2015-07-30  8:38 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-30  7:52 [PATCHv2 0/1] Document how to add a new syscall David Drysdale
2015-07-30  7:52 ` David Drysdale
2015-07-30  7:52 ` [PATCHv2 1/1] Documentation: describe how to add a system call David Drysdale
2015-07-30  7:52   ` David Drysdale
2015-07-30  8:38   ` Ingo Molnar [this message]
2015-07-30  8:38     ` Ingo Molnar
     [not found]     ` <20150730083831.GA22182-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-07-30 11:10       ` David Drysdale
2015-07-30 11:10         ` David Drysdale
2015-07-30 18:21         ` Kees Cook
2015-07-30 18:21           ` Kees Cook
     [not found]           ` <CAGXu5j+5KHy68ELU6PmNWaj7mQBXTbRQGXqJFwsXHt9n0LPw8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-30 19:04             ` Josh Triplett
2015-07-30 19:04               ` Josh Triplett
2015-07-30 20:03               ` Kees Cook
2015-07-30 20:03                 ` Kees Cook
2015-07-31  1:02                 ` Josh Triplett
2015-07-31  1:02                   ` Josh Triplett
2015-07-31  1:03                   ` Josh Triplett
2015-07-31  1:03                     ` Josh Triplett
2015-07-31 18:56                   ` Kees Cook
2015-07-31 18:56                     ` Kees Cook
2015-07-31 20:59                     ` josh
2015-07-31 20:59                       ` josh
2015-07-31 21:19                       ` Andy Lutomirski
2015-07-31 21:19                         ` Andy Lutomirski
     [not found]                         ` <CALCETrUkMXvFRKdTH7ekY7FyGvbKDDJbf7L0shgs5R-Hep6bVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-31 22:08                           ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-07-31 22:08                             ` josh
2015-07-31 22:54                             ` Andy Lutomirski
2015-07-31 22:54                               ` Andy Lutomirski
2015-08-01  4:32                               ` Josh Triplett
2015-08-01  4:32                                 ` Josh Triplett
2015-08-01  4:56                                 ` H. Peter Anvin
2015-08-01  4:56                                   ` H. Peter Anvin
     [not found]                                   ` <55BC518E.4010102-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
2015-08-01  6:18                                     ` Josh Triplett
2015-08-01  6:18                                       ` Josh Triplett
2015-08-01  6:28                                       ` H. Peter Anvin
2015-08-01  6:28                                         ` H. Peter Anvin
2015-07-30 18:22     ` Josh Triplett
2015-07-30 18:22       ` Josh Triplett
2015-07-30 16:30   ` Cyril Hrubis
2015-07-30 16:30     ` Cyril Hrubis
2015-07-30 16:45     ` Greg Kroah-Hartman
2015-07-30 16:45       ` Greg Kroah-Hartman
2015-07-30 18:50   ` Josh Triplett
2015-07-30 18:50     ` Josh Triplett
2015-07-31  9:48     ` David Drysdale
2015-07-31  9:48       ` David Drysdale
2015-07-31 13:06       ` Josh Triplett
2015-07-31 13:06         ` Josh Triplett
2015-07-31 14:42         ` David Drysdale
2015-07-31 14:42           ` David Drysdale

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150730083831.GA22182@gmail.com \
    --to=mingo@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=ast@plumgrid.com \
    --cc=corbet@lwn.net \
    --cc=dh.herrmann@gmail.com \
    --cc=drysdale@google.com \
    --cc=emunson@akamai.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hpa@zytor.com \
    --cc=linux-api@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=rusty@rustcorp.com.au \
    --cc=shuahkh@osg.samsung.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=vgoyal@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.