Linux userland API discussions
 help / color / mirror / Atom feed
* Re: Edited seccomp.2 man page for review [v2]
From: Kees Cook @ 2014-12-30 17:16 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Daniel Borkmann, Linux API,
	linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lkml
In-Reply-To: <54A29722.1010901-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Tue, Dec 30, 2014 at 4:14 AM, Michael Kerrisk (man-pages)
<mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Kees, (and all),
>
> Thanks for your comments on the previous draft of the seccomp(2)
> man page and (once again) my apologies for the slow follow-up.
>
> I have done some further editing of the page. Could you check
> the revised version below. I have added a number of FIXMEs
> for points where I'd either like you to check new text that I
> added (in case it contains errors) or where I hope you can
> provide answers to questions relating to details that may need
> clarifying in the page.
>
> I've appended the revised page at the foot of this mail. You can also
> find the branch holding this page in Git at:
> http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_seccomp
>
> Notable changes from the previous draft:
> * Several new error cases added under ERRORS
> * New subsection on Seccomp-specific BPF details
> * Add some detail in discussion of 'siginfo_t' fields
> * Tweaked comments on BPF program in EXAMPLE section
> * Added various FIXMEs
>
> I also have one API quibble, regarding the name of the
> SYS_SECCOMP constant; see below.
>
> Feedback as inline comments to the below would be great!
>
> Cheers,
>
> Michael
>
> .\" Copyright (C) 2014 Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> .\" and Copyright (C) 2012 Will Drewry <wad-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> .\" and Copyright (C) 2008, 2014 Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> .\"
> .\" %%%LICENSE_START(VERBATIM)
> .\" Permission is granted to make and distribute verbatim copies of this
> .\" manual provided the copyright notice and this permission notice are
> .\" preserved on all copies.
> .\"
> .\" Permission is granted to copy and distribute modified versions of this
> .\" manual under the conditions for verbatim copying, provided that the
> .\" entire resulting derived work is distributed under the terms of a
> .\" permission notice identical to this one.
> .\"
> .\" Since the Linux kernel and libraries are constantly changing, this
> .\" manual page may be incorrect or out-of-date.  The author(s) assume no
> .\" responsibility for errors or omissions, or for damages resulting from
> .\" the use of the information contained herein.  The author(s) may not
> .\" have taken the same level of care in the production of this manual,
> .\" which is licensed free of charge, as they might when working
> .\" professionally.
> .\"
> .\" Formatted or processed versions of this manual, if unaccompanied by
> .\" the source, must acknowledge the copyright and authors of this work.
> .\" %%%LICENSE_END
> .\"
> .TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
> .SH NAME
> seccomp \- operate on Secure Computing state of the process
> .SH SYNOPSIS
> .nf
> .B #include <linux/seccomp.h>
> .B #include <linux/filter.h>
> .B #include <linux/audit.h>
> .B #include <linux/signal.h>
> .B #include <sys/ptrace.h>
> .\" Kees Cook noted: Anything that uses SECCOMP_RET_TRACE returns will
> .\"                  need <sys/ptrace.h>
>
> .BI "int seccomp(unsigned int " operation ", unsigned int " flags \
> ", void *" args );
> .fi
> .SH DESCRIPTION
> The
> .BR seccomp ()
> system call operates on the Secure Computing (seccomp) state of the
> calling process.
>
> Currently, Linux supports the following
> .IR operation
> values:
> .TP
> .BR SECCOMP_SET_MODE_STRICT
> The only system calls that the calling thread is permitted to make are
> .BR read (2),
> .BR write (2),
> .BR _exit (2),
> and
> .BR sigreturn (2).
> Other system calls result in the delivery of a
> .BR SIGKILL
> signal.
> Strict secure computing mode is useful for number-crunching
> applications that may need to execute untrusted byte code, perhaps
> obtained by reading from a pipe or socket.
>
> This operation is available only if the kernel is configured with
> .BR CONFIG_SECCOMP
> enabled.
>
> The value of
> .IR flags
> must be 0, and
> .IR args
> must be NULL.
>
> This operation is functionally identical to the call:
>
>     prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
> .TP
> .BR SECCOMP_SET_MODE_FILTER
> The system calls allowed are defined by a pointer to a Berkeley Packet
> Filter (BPF) passed via
> .IR args .
> This argument is a pointer to a
> .IR "struct\ sock_fprog" ;
> it can be designed to filter arbitrary system calls and system call
> arguments.
> If the filter is invalid,
> .BR seccomp ()
> fails, returning
> .BR EINVAL
> in
> .IR errno .
>
> If
> .BR fork (2)
> or
> .BR clone (2)
> is allowed by the filter, any child processes will be constrained to
> the same system call filters as the parent.
> If
> .BR execve (2)
> is allowed,
> the existing filters will be preserved across a call to
> .BR execve (2).
>
> In order to use the
> .BR SECCOMP_SET_MODE_FILTER
> operation, either the caller must have the
> .BR CAP_SYS_ADMIN
> capability, or the thread must already have the
> .I no_new_privs
> bit set.
> If that bit was not already set by an ancestor of this thread,
> the thread must make the following call:
>
>     prctl(PR_SET_NO_NEW_PRIVS, 1);
>
> Otherwise, the
> .BR SECCOMP_SET_MODE_FILTER
> operation will fail and return
> .BR EACCES
> in
> .IR errno .
> This requirement ensures that an unprivileged process cannot apply
> a malicious filter and then invoke a set-user-ID or
> other privileged program using
> .BR execve (2),
> thus potentially compromising that program.
> (Such a malicious filter might, for example, cause an attempt to use
> .BR setuid (2)
> to set the caller's user IDs to non-zero values to instead
> return 0 without actually making the system call.
> Thus, the program might be tricked into retaining superuser privileges
> in circumstances where it is possible to influence it to do
> dangerous things because it did not actually drop privileges.)
>
> If
> .BR prctl (2)
> or
> .BR seccomp (2)
> is allowed by the attached filter, further filters may be added.
> This will increase evaluation time, but allows for further reduction of
> the attack surface during execution of a thread.
>
> The
> .BR SECCOMP_SET_MODE_FILTER
> operation is available only if the kernel is configured with
> .BR CONFIG_SECCOMP_FILTER
> enabled.
>
> When
> .IR flags
> is 0, this operation is functionally identical to the call:
>
>     prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
>
> The recognized
> .IR flags
> are:
> .RS
> .TP
> .BR SECCOMP_FILTER_FLAG_TSYNC
> When adding a new filter, synchronize all other threads of the calling
> process to the same seccomp filter tree.
> A "filter tree" is the ordered list of filters attached to a thread.
> (Attaching identical filters in separate
> .BR seccomp ()
> calls results in different filters from this perspective.)
>
> If any thread cannot synchronize to the same filter tree,
> the call will not attach the new seccomp filter,
> and will fail, returning the first thread ID found that cannot synchronize.
> Synchronization will fail if another thread in the same process is in
> .BR SECCOMP_MODE_STRICT
> or if it has attached new seccomp filters to itself,
> diverging from the calling thread's filter tree.
> .RE
> .SS Filters
> When adding filters via
> .BR SECCOMP_SET_MODE_FILTER ,
> .IR args
> points to a filter program:
>
> .in +4n
> .nf
> struct sock_fprog {
>     unsigned short      len;    /* Number of BPF instructions */
>     struct sock_filter *filter; /* Pointer to array of
>                                    BPF instructions */
> };
> .fi
> .in
>
> Each program must contain one or more BPF instructions:
>
> .in +4n
> .nf
> struct sock_filter {            /* Filter block */
>     __u16 code;                 /* Actual filter code */
>     __u8  jt;                   /* Jump true */
>     __u8  jf;                   /* Jump false */
>     __u32 k;                    /* Generic multiuse field */
> };
> .fi
> .in
>
> .\" FIXME I reworded/enhanced the following sentence. Is it okay?
> When executing the instructions, the BPF program operates on the
> system call information made available (i.e., use the
> .BR BPF_ABS
> addressing mode) as a buffer of the following form:

That looks correct to me, yes.

>
> .in +4n
> .nf
> struct seccomp_data {
>     int   nr;                   /* System call number */
>     __u32 arch;                 /* AUDIT_ARCH_* value
>                                    (see <linux/audit.h>) */
>     __u64 instruction_pointer;  /* CPU instruction pointer */
>     __u64 args[6];              /* Up to 6 system call arguments */
> };
> .fi
> .in
>
> A seccomp filter returns a 32-bit value consisting of two parts:
> the most significant 16 bits
> (corresponding to the mask defined by the constant
> .BR SECCOMP_RET_ACTION )
> contain one of the "action" values listed below;
> the least significant 16-bits (defined by the constant
> .BR SECCOMP_RET_DATA )
> are "data" to be associated with this return value.
>
> If multiple filters exist, they are all executed,
> in reverse order of their addition to the filter tree
> (i.e., the most recently installed filter is executed first).
> The return value for the evaluation of a given system call is the first-seen
> .BR SECCOMP_RET_ACTION
> value of highest precedence (along with its accompanying data)
> returned by execution of all of the filters.
>
> In decreasing order of precedence,
> the values that may be returned by a seccomp filter are:
> .TP
> .BR SECCOMP_RET_KILL
> This value results in the process exiting immediately
> without executing the system call.
> The process terminates as though killed by a
> .B SIGSYS
> signal
> .RI ( not
> .BR SIGKILL ).
> .TP
> .BR SECCOMP_RET_TRAP
> This value results in the kernel sending a
> .BR SIGSYS
> signal to the triggering process without executing the system call.
> Various fields will be set in the
> .I siginfo_t
> structure (see
> .BR sigaction (2))
> associated with signal:
> .RS
> .IP * 3
> .I si_signo
> will contain
> .BR SIGSYS .
> .IP *
> .IR si_call_addr
> will show the address of the system call instruction.
> .IP *
> .IR si_syscall
> and
> .IR si_arch
> will indicate which system call was attempted.
> .IP *
> .I si_code
> .\" FIXME Why is the constant thus named? All of the other 'si_code'
> .\"       constants are prefixed 'SI_'. Why the inconsistency?
> will contain
> .BR SYS_SECCOMP .

Only certain reserved values have the SI_ prefix. All the
signal-specific values have their signal name as the prefix. See ILL_*
FPE_* SEGV_* BUS_* TRAP_* CLD_* POLL_* and SYS_*. I see these in
/usr/include/asm-generic/siginfo.h

> .IP *
> .I si_errno
> will contain the
> .BR SECCOMP_RET_DATA
> portion of the filter return value.
> .RE
> .IP
> The program counter will be as though the system call happened
> (i.e., it will not point to the system call instruction).
> The return value register will contain an architecture\-dependent value;
> if resuming execution, set it to something sensible.
> .\" FIXME Regarding the preceding line, can you give an example(s)
> .\"       of "something sensible"? (Depending on the answer, maybe it
> .\"       might be useful to add some text on this point.)

This means sensible in the context of the syscall made, or the desired
behavior. For example, setting the return value to ELOOP for something
like a "bind" syscall isn't very sensible.

> .\"
> .\" FIXME Please check:
> .\"     In an attempt to make the text clearer, I changed
> .\"     "replacing it with" to "setting the return value register to"
> .\"     Okay?
> (The architecture dependency is because setting the return value register to
> .BR ENOSYS
> could overwrite some useful information.)

Well, the arch dependency is really because _how_ to change the
register, and the register itself, is different between architectures.
(i.e. which ptrace call is needed, and which register is being
changed.) The overwriting of useful information is certainly true too,
though.

> .TP
> .BR SECCOMP_RET_ERRNO
> This value results in the
> .B SECCOMP_RET_DATA
> portion of the filter's return value being passed to user space as the
> .IR errno
> value without executing the system call.
> .TP
> .BR SECCOMP_RET_TRACE
> When returned, this value will cause the kernel to attempt to notify a
> .BR ptrace (2)-based
> tracer prior to executing the system call.
> If there is no tracer present,
> the system call is not executed and returns a failure status with
> .I errno
> set to
> .BR ENOSYS .
>
> A tracer will be notified if it requests
> .BR PTRACE_O_TRACESECCOMP
> using
> .IR ptrace(PTRACE_SETOPTIONS) .
> The tracer will be notified of a
> .BR PTRACE_EVENT_SECCOMP
> and the
> .BR SECCOMP_RET_DATA
> portion of the filter's return value will be available to the tracer via
> .BR PTRACE_GETEVENTMSG .
>
> The tracer can skip the system call by changing the system call number
> to \-1.
> Alternatively, the tracer can change the system call
> requested by changing the system call to a valid system call number.
> If the tracer asks to skip the system call, then the system call will
> appear to return the value that the tracer puts in the return value register.
>
> The seccomp check will not be run again after the tracer is notified.
> (This means that seccomp-based sandboxes
> .B "must not"
> allow use of
> .BR ptrace (2)\(emeven
> of other
> sandboxed processes\(emwithout extreme care;
> .\" FIXME Below, I think it would be helpful to add some words after
> .\"       "to escape", as in "to escape [what?]" I suppose the wording
> .\"       would be something like "to escape the seccomp sandbox mechanism"
> .\"       but perhaps you have a better wording.
> ptracers can use this mechanism to escape.)

Yeah, that could be further clarified to "... use this mechanism to
escape from the seccomp sandbox." How does that sound?

> .TP
> .BR SECCOMP_RET_ALLOW
> This value results in the system call being executed.
> .SH RETURN VALUE
> On success,
> .BR seccomp ()
> returns 0.
> On error, if
> .BR SECCOMP_FILTER_FLAG_TSYNC
> was used,
> the return value is the ID of the thread
> that caused the synchronization failure.
> (This ID is a kernel thread ID of the type returned by
> .BR clone (2)
> and
> .BR gettid (2).)
> On other errors, \-1 is returned, and
> .IR errno
> is set to indicate the cause of the error.
> .SH ERRORS
> .BR seccomp ()
> can fail for the following reasons:
> .TP
> .BR EACCESS
> The caller did not have the
> .BR CAP_SYS_ADMIN
> capability, or had not set
> .IR no_new_privs
> before using
> .BR SECCOMP_SET_MODE_FILTER .
> .TP
> .BR EFAULT
> .IR args
> was not a valid address.
> .TP
> .BR EINVAL
> .IR operation
> is unknown; or
> .IR flags
> are invalid for the given
> .IR operation .
> .\" FIXME Please review the following
> .TP
> .BR EINVAL
> .I operation
> included
> .BR BPF_ABS ,
> but the specified offset was not aligned to a 32-bit boundary or exceeded
> .IR "sizeof(struct\ seccomp_data)" .
> .\" FIXME Please review the following
> .TP
> .BR EINVAL
> .\" See kernel/seccomp.c::seccomp_may_assign_mode() in 3.18 sources
> A secure computing mode has already been set, and
> .I operation
> differs from the existing setting.
> .\" FIXME Please review the following
> .TP
> .BR EINVAL
> .\" See stub kernel/seccomp.c::seccomp_set_mode_filter() in 3.18 sources
> .I operation
> specified
> .BR SECCOMP_SET_MODE_FILTER ,
> but the kernel was not built with
> .B CONFIG_SECCOMP_FILTER
> enabled.
> .\" FIXME Please review the following
> .TP
> .BR EINVAL
> .I operation
> specified
> .BR SECCOMP_SET_MODE_FILTER ,
> but the filter program pointed to by
> .I args
> was not valid or the length of the filter program was zero or exceeded
> .B BPF_MAXINSNS
> (4096) instructions.
> .BR EINVAL
> .TP
> .BR ENOMEM
> Out of memory.
> .\" FIXME Please review the following
> .TP
> .BR ENOMEM
> .\" ENOMEM in kernel/seccomp.c::seccomp_attach_filter() in 3.18 sources
> The total length of all filter programs attached
> to the calling thread would exceed
> .B MAX_INSNS_PER_PATH
> (32768) instructions.
> Note that for the purposes of calculating this limit,
> each already existing filter program incurs an
> overhead penalty of 4 instructions.
> .TP
> .BR ESRCH
> Another thread caused a failure during thread sync, but its ID could not
> be determined.
> .SH VERSIONS
> The
> .BR seccomp()
> system call first appeared in Linux 3.17.
> .\" FIXME . Add glibc version
> .SH CONFORMING TO
> The
> .BR seccomp()
> system call is a nonstandard Linux extension.
> .SH NOTES
> .BR seccomp ()
> provides a superset of the functionality provided by the
> .BR prctl (2)
> .BR PR_SET_SECCOMP
> operation (which does not support
> .IR flags ).
> .\" FIXME Please review the following new subsection {{{
> .SS Seccomp-specific BPF details
> Note the following BPF details specific to seccomp filters:
> .IP * 3
> The
> .B BPF_H
> and
> .B BPF_B
> size modifiers are not supported: all operations must load and store
> (4-byte) words
> .RB ( BPF_W ).
> .IP *
> To access the contents of the
> .I seccomp_data
> buffer, use the
> .B BPF_ABS
> addressing mode modifier.
> .\" FIXME What is the significance of the line
> .\"           ftest->code = BPF_LDX | BPF_W | BPF_ABS;
> .\"       in kernel/seccomp.c::seccomp_check_filter()?

This is converting an accumulator load (BPF_LD) into a index load
(BPF_LDX). I think this is to avoid addressing modes 1 and 2, but Will
may remember more here. The LD|W|ABS structure is very common, so I
think this was a way to accept that in the filter, but change it into
a more limited command.

> .IP *
> The
> .B BPF_LEN
> addressing mode modifier yields an immediate mode operand
> whose value is the size of the
> .IR seccomp_data
> buffer.
> .\" FIXME Any other seccomp-specific BPF details that should be added here?
> .\"
> .\" FIXME End of new subsection for review }}}

All the rest of the FIXMEs above (excepting the standing glibc one)
looks correct to me.

> .SH EXAMPLE
> The program below accepts four or more arguments.
> The first three arguments are a system call number,
> a numeric architecture identifier, and an error number.
> The program uses these values to construct a BPF filter
> that is used at run time to perform the following checks:
> .IP [1] 4
> If the program is not running on the specified architecture,
> the BPF filter causes system calls to fail with the error
> .BR ENOSYS .
> .IP [2]
> If the program attempts to execute the system call with the specified number,
> the BPF filter causes the system call to fail, with
> .I errno
> being set to the specified error number.
> .PP
> The remaining command-line arguments specify
> the pathname and additional arguments of a program
> that the example program should attempt to execute using
> .BR execve (3)
> (a library function that employs the
> .BR execve (2)
> system call).
> Some example runs of the program are shown below.
>
> First, we display the architecture that we are running on (x86-64)
> and then construct a shell function that looks up system call
> numbers on this architecture:
>
> .nf
> .in +4n
> $ \fBuname -m\fP
> x86_64
> $ \fBsyscall_nr() {
>     cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \\
>     awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
> }\fP
> .in
> .fi
>
> When the BPF filter rejects a system call (case [2] above),
> it causes the system call to fail with the error number
> specified on the command line.
> In the experiments shown here, we'll use error number 99:
>
> .nf
> .in +4n
> $ \fBerrno 99\fP
> EADDRNOTAVAIL 99 Cannot assign requested address
> .in
> .fi
>
> In the following example, we attempt to run the command
> .BR whoami (1),
> but the BPF filter rejects the
> .BR execve (2)
> system call, so that the command is not even executed:
>
> .nf
> .in +4n
> $ \fBsyscall_nr execve\fP
> 59
> $ \fB./a.out\fP
> Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
> Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
>                  AUDIT_ARCH_X86_64: 0xC000003E
> $ \fB./a.out 59 0xC000003E 99 /bin/whoami\fP
> execv: Cannot assign requested address
> .in
> .fi
>
> In the next example, the BPF filter rejects the
> .BR write (2)
> system call, so that, although it is successfully started, the
> .BR whoami (1)
> command is not able to write output:
>
> .nf
> .in +4n
> $ \fBsyscall_nr write\fP
> 1
> $ \fB./a.out 1 0xC000003E 99 /bin/whoami\fP
> .in
> .fi
>
> In the final example,
> the BPF filter rejects a system call that is not used by the
> .BR whoami (1)
> command, so it is able to successfully execute and produce output:
>
> .nf
> .in +4n
> $ \fBsyscall_nr preadv\fP
> 295
> $ \fB./a.out 295 0xC000003E 99 /bin/whoami\fP
> cecilia
> .in
> .fi
> .SS Program source
> .fi
> .nf
> #include <errno.h>
> #include <stddef.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <linux/audit.h>
> #include <linux/filter.h>
> #include <linux/seccomp.h>
> #include <sys/prctl.h>
>
> static int
> install_filter(int syscall_nr, int t_arch, int f_errno)
> {
> .\" FIXME In the BPF program below, you use '+' to build the instructions.
> .\"       However, most other BPF example code I see uses '|'. While I
> .\"       assume it's equivalent (i.e., the bit fields are nonoverlapping),
> .\"       was there a reason to use '+' rather than '|'? (To me, the
> .\"       latter is a little clearer in its intent.)

Ah, no, "|" should be used, good catch.

> .\"
> .\" FIXME I expanded comments [0], [1], [2], [3], [4] a little.
> .\"       Are they okay? */

Yup, these look good to me.

> .\"
>     struct sock_filter filter[] = {
>         /* [0] Load architecture from 'seccomp_data' buffer into
>                accumulator */
>         BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
>                  (offsetof(struct seccomp_data, arch))),
>
>         /* [1] Jump forward 4 instructions if architecture does not
>                match 't_arch' */
>         BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, t_arch, 0, 4),
>
>         /* [2] Load system call number from 'seccomp_data' buffer into
>                accumulator */
>         BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
>                  (offsetof(struct seccomp_data, nr))),
>
>         /* [3] Jump forward 1 instruction if system call number
>                does not match 'syscall_nr' */
>         BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, syscall_nr, 0, 1),
>
>         /* [4] Matching architecture and system call: don't execute
>                the system call, and return 'f_errno' in 'errno' */
>         BPF_STMT(BPF_RET + BPF_K,
>                  SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
>
>         /* [5] Destination of system call number mismatch: allow other
>                system calls */
>         BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
>
>         /* [6] Destination of architecture mismatch: kill process */
>         BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_KILL),
>     };
>
>     struct sock_fprog prog = {
>         .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
>         .filter = filter,
>     };
>
>     if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
>         perror("seccomp");
>         return 1;
>     }
>
>     return 0;
> }
>
> int
> main(int argc, char **argv)
> {
>     if (argc < 5) {
>         fprintf(stderr, "Usage: "
>                 "%s <syscall_nr> <arch> <errno> <prog> [<args>]\\n"
>                 "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\\n"
>                 "                 AUDIT_ARCH_X86_64: 0x%X\\n"
>                 "\\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
>         exit(EXIT_FAILURE);
>     }
>
>     if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
>         perror("prctl");
>         exit(EXIT_FAILURE);
>     }
>
>     if (install_filter(strtol(argv[1], NULL, 0),
>                        strtol(argv[2], NULL, 0),
>                        strtol(argv[3], NULL, 0)))
>         exit(EXIT_FAILURE);
>
>     execv(argv[4], &argv[4]);
>     perror("execv");
>     exit(EXIT_FAILURE);
> }
> .fi
> .SH SEE ALSO
> .BR prctl (2),
> .BR ptrace (2),
> .BR signal (7),
> .BR socket (7)
> .sp
> The kernel source files
> .IR Documentation/networking/filter.txt
> and
> .IR Documentation/prctl/seccomp_filter.txt .
> .sp
> McCanne, S. and Jacobson, V. (1992)
> .IR "The BSD Packet Filter: A New Architecture for User-level Packet Capture" ,
> Proceedings of the USENIX Winter 1993 Conference
> .UR http://www.tcpdump.org/papers/bpf-usenix93.pdf
> .UE
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Thanks for the additional details and clarifications!

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH RFC v2 7/7] virtio_pci: drop virtio_config dependency
From: Michael S. Tsirkin @ 2014-12-30 16:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rusty Russell, cornelia.huck, virtualization, linux-api
In-Reply-To: <1419957310-26009-1-git-send-email-mst@redhat.com>

virtio_pci does not depend on virtio_config:
let's not include it, users can pull it in as necessary.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_pci.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 5d546c6..e841edd 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -39,7 +39,7 @@
 #ifndef _LINUX_VIRTIO_PCI_H
 #define _LINUX_VIRTIO_PCI_H
 
-#include <linux/virtio_config.h>
+#include <linux/types.h>
 
 #ifndef VIRTIO_PCI_NO_LEGACY
 
-- 
MST

^ permalink raw reply related

* [PATCH RFC v2 6/7] virtio_pci: macros for PCI layout offsets.
From: Michael S. Tsirkin @ 2014-12-30 16:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-api, virtualization
In-Reply-To: <1419957310-26009-1-git-send-email-mst@redhat.com>

QEMU wants it, so why not?  Trust, but verify.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 include/uapi/linux/virtio_pci.h    | 30 ++++++++++++++++++
 drivers/virtio/virtio_pci_modern.c | 63 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 28c2ce0..5d546c6 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -159,6 +159,36 @@ struct virtio_pci_common_cfg {
 	__le32 queue_used_hi;		/* read-write */
 };
 
+/* Macro versions of offsets for the Old Timers! */
+#define VIRTIO_PCI_CAP_VNDR		0
+#define VIRTIO_PCI_CAP_NEXT		1
+#define VIRTIO_PCI_CAP_LEN		2
+#define VIRTIO_PCI_CAP_TYPE_AND_BAR	3
+#define VIRTIO_PCI_CAP_OFFSET		4
+#define VIRTIO_PCI_CAP_LENGTH		8
+
+#define VIRTIO_PCI_NOTIFY_CAP_MULT	12
+
+#define VIRTIO_PCI_COMMON_DFSELECT	0
+#define VIRTIO_PCI_COMMON_DF		4
+#define VIRTIO_PCI_COMMON_GFSELECT	8
+#define VIRTIO_PCI_COMMON_GF		12
+#define VIRTIO_PCI_COMMON_MSIX		16
+#define VIRTIO_PCI_COMMON_NUMQ		18
+#define VIRTIO_PCI_COMMON_STATUS	20
+#define VIRTIO_PCI_COMMON_CFGGENERATION	21
+#define VIRTIO_PCI_COMMON_Q_SELECT	22
+#define VIRTIO_PCI_COMMON_Q_SIZE	24
+#define VIRTIO_PCI_COMMON_Q_MSIX	26
+#define VIRTIO_PCI_COMMON_Q_ENABLE	28
+#define VIRTIO_PCI_COMMON_Q_NOFF	30
+#define VIRTIO_PCI_COMMON_Q_DESCLO	32
+#define VIRTIO_PCI_COMMON_Q_DESCHI	36
+#define VIRTIO_PCI_COMMON_Q_AVAILLO	40
+#define VIRTIO_PCI_COMMON_Q_AVAILHI	44
+#define VIRTIO_PCI_COMMON_Q_USEDLO	48
+#define VIRTIO_PCI_COMMON_Q_USEDHI	52
+
 #endif /* VIRTIO_PCI_NO_MODERN */
 
 #endif
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 6f63b4c..ba04055 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -470,6 +470,67 @@ static inline int virtio_pci_find_capability(struct pci_dev *dev, u8 cfg_type,
 	return 0;
 }
 
+/* This is part of the ABI.  Don't screw with it. */
+static inline void check_offsets(void)
+{
+	/* Note: disk space was harmed in compilation of this function. */
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_VNDR !=
+		     offsetof(struct virtio_pci_cap, cap_vndr));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_NEXT !=
+		     offsetof(struct virtio_pci_cap, cap_next));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_LEN !=
+		     offsetof(struct virtio_pci_cap, cap_len));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_TYPE_AND_BAR !=
+		     offsetof(struct virtio_pci_cap, type_and_bar));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_OFFSET !=
+		     offsetof(struct virtio_pci_cap, offset));
+	BUILD_BUG_ON(VIRTIO_PCI_CAP_LENGTH !=
+		     offsetof(struct virtio_pci_cap, length));
+	BUILD_BUG_ON(VIRTIO_PCI_NOTIFY_CAP_MULT !=
+		     offsetof(struct virtio_pci_notify_cap,
+			      notify_off_multiplier));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_DFSELECT !=
+		     offsetof(struct virtio_pci_common_cfg,
+			      device_feature_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_DF !=
+		     offsetof(struct virtio_pci_common_cfg, device_feature));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_GFSELECT !=
+		     offsetof(struct virtio_pci_common_cfg,
+			      guest_feature_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_GF !=
+		     offsetof(struct virtio_pci_common_cfg, guest_feature));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_MSIX !=
+		     offsetof(struct virtio_pci_common_cfg, msix_config));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_NUMQ !=
+		     offsetof(struct virtio_pci_common_cfg, num_queues));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_STATUS !=
+		     offsetof(struct virtio_pci_common_cfg, device_status));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_CFGGENERATION !=
+		     offsetof(struct virtio_pci_common_cfg, config_generation));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SELECT !=
+		     offsetof(struct virtio_pci_common_cfg, queue_select));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SIZE !=
+		     offsetof(struct virtio_pci_common_cfg, queue_size));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_MSIX !=
+		     offsetof(struct virtio_pci_common_cfg, queue_msix_vector));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_ENABLE !=
+		     offsetof(struct virtio_pci_common_cfg, queue_enable));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_NOFF !=
+		     offsetof(struct virtio_pci_common_cfg, queue_notify_off));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_desc_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_desc_hi));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_avail_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_avail_hi));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDLO !=
+		     offsetof(struct virtio_pci_common_cfg, queue_used_lo));
+	BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDHI !=
+		     offsetof(struct virtio_pci_common_cfg, queue_used_hi));
+}
+
 /* the PCI probing function */
 int virtio_pci_modern_probe(struct pci_dev *pci_dev,
 			    const struct pci_device_id *id)
@@ -479,6 +540,8 @@ int virtio_pci_modern_probe(struct pci_dev *pci_dev,
 	struct virtio_device_id virtio_id;
 	u32 notify_length;
 
+	check_offsets();
+
 	/* We only own devices >= 0x1000 and <= 0x107f: leave the rest. */
 	if (pci_dev->device < 0x1000 || pci_dev->device > 0x107f)
 		return -ENODEV;
-- 
MST

^ permalink raw reply related

* [PATCH RFC v2 4/7] virtio-pci: define layout for virtio 1.0
From: Michael S. Tsirkin @ 2014-12-30 16:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-api, virtualization
In-Reply-To: <1419957310-26009-1-git-send-email-mst@redhat.com>

From: Rusty Russell <rusty@rustcorp.com.au>

Based on patches by Michael S. Tsirkin <mst@redhat.com>, but I found it
hard to follow so changed to use structures which are more
self-documenting.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_pci.h | 62 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 35b552c..28c2ce0 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -99,4 +99,66 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+#ifndef VIRTIO_PCI_NO_MODERN
+
+/* IDs for different capabilities.  Must all exist. */
+
+/* Common configuration */
+#define VIRTIO_PCI_CAP_COMMON_CFG	1
+/* Notifications */
+#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
+/* ISR access */
+#define VIRTIO_PCI_CAP_ISR_CFG		3
+/* Device specific confiuration */
+#define VIRTIO_PCI_CAP_DEVICE_CFG	4
+
+/* This is the PCI capability header: */
+struct virtio_pci_cap {
+	__u8 cap_vndr;		/* Generic PCI field: PCI_CAP_ID_VNDR */
+	__u8 cap_next;		/* Generic PCI field: next ptr. */
+	__u8 cap_len;		/* Generic PCI field: capability length */
+	__u8 type_and_bar;	/* Upper 3 bits: bar.
+				 * Lower 3 is VIRTIO_PCI_CAP_*_CFG. */
+	__le32 offset;		/* Offset within bar. */
+	__le32 length;		/* Length. */
+};
+
+#define VIRTIO_PCI_CAP_BAR_SHIFT	5
+#define VIRTIO_PCI_CAP_BAR_MASK		0x7
+#define VIRTIO_PCI_CAP_TYPE_SHIFT	0
+#define VIRTIO_PCI_CAP_TYPE_MASK	0x7
+
+struct virtio_pci_notify_cap {
+	struct virtio_pci_cap cap;
+	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
+};
+
+/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
+struct virtio_pci_common_cfg {
+	/* About the whole device. */
+	__le32 device_feature_select;	/* read-write */
+	__le32 device_feature;		/* read-only */
+	__le32 guest_feature_select;	/* read-write */
+	__le32 guest_feature;		/* read-write */
+	__le16 msix_config;		/* read-write */
+	__le16 num_queues;		/* read-only */
+	__u8 device_status;		/* read-write */
+	__u8 config_generation;		/* read-only */
+
+	/* About a specific virtqueue. */
+	__le16 queue_select;		/* read-write */
+	__le16 queue_size;		/* read-write, power of 2. */
+	__le16 queue_msix_vector;	/* read-write */
+	__le16 queue_enable;		/* read-write */
+	__le16 queue_notify_off;	/* read-only */
+	__le32 queue_desc_lo;		/* read-write */
+	__le32 queue_desc_hi;		/* read-write */
+	__le32 queue_avail_lo;		/* read-write */
+	__le32 queue_avail_hi;		/* read-write */
+	__le32 queue_used_lo;		/* read-write */
+	__le32 queue_used_hi;		/* read-write */
+};
+
+#endif /* VIRTIO_PCI_NO_MODERN */
+
 #endif
-- 
MST

^ permalink raw reply related

* Edited seccomp.2 man page for review [v2]
From: Michael Kerrisk (man-pages) @ 2014-12-30 12:14 UTC (permalink / raw)
  To: Kees Cook
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Daniel Borkmann, Linux API,
	linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lkml

Hi Kees, (and all),

Thanks for your comments on the previous draft of the seccomp(2) 
man page and (once again) my apologies for the slow follow-up. 

I have done some further editing of the page. Could you check
the revised version below. I have added a number of FIXMEs
for points where I'd either like you to check new text that I 
added (in case it contains errors) or where I hope you can 
provide answers to questions relating to details that may need 
clarifying in the page.

I've appended the revised page at the foot of this mail. You can also
find the branch holding this page in Git at:
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_seccomp

Notable changes from the previous draft:
* Several new error cases added under ERRORS
* New subsection on Seccomp-specific BPF details
* Add some detail in discussion of 'siginfo_t' fields
* Tweaked comments on BPF program in EXAMPLE section
* Added various FIXMEs

I also have one API quibble, regarding the name of the
SYS_SECCOMP constant; see below.

Feedback as inline comments to the below would be great!

Cheers,

Michael

.\" Copyright (C) 2014 Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
.\" and Copyright (C) 2012 Will Drewry <wad-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
.\" and Copyright (C) 2008, 2014 Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
.SH NAME
seccomp \- operate on Secure Computing state of the process
.SH SYNOPSIS
.nf
.B #include <linux/seccomp.h>
.B #include <linux/filter.h>
.B #include <linux/audit.h>
.B #include <linux/signal.h>
.B #include <sys/ptrace.h>
.\" Kees Cook noted: Anything that uses SECCOMP_RET_TRACE returns will
.\"                  need <sys/ptrace.h>

.BI "int seccomp(unsigned int " operation ", unsigned int " flags \
", void *" args );
.fi
.SH DESCRIPTION
The
.BR seccomp ()
system call operates on the Secure Computing (seccomp) state of the
calling process.

Currently, Linux supports the following
.IR operation
values:
.TP
.BR SECCOMP_SET_MODE_STRICT
The only system calls that the calling thread is permitted to make are
.BR read (2),
.BR write (2),
.BR _exit (2),
and
.BR sigreturn (2).
Other system calls result in the delivery of a
.BR SIGKILL
signal.
Strict secure computing mode is useful for number-crunching
applications that may need to execute untrusted byte code, perhaps
obtained by reading from a pipe or socket.

This operation is available only if the kernel is configured with
.BR CONFIG_SECCOMP
enabled.

The value of
.IR flags
must be 0, and
.IR args
must be NULL.

This operation is functionally identical to the call:

    prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
.TP
.BR SECCOMP_SET_MODE_FILTER
The system calls allowed are defined by a pointer to a Berkeley Packet
Filter (BPF) passed via
.IR args .
This argument is a pointer to a
.IR "struct\ sock_fprog" ;
it can be designed to filter arbitrary system calls and system call
arguments.
If the filter is invalid,
.BR seccomp ()
fails, returning
.BR EINVAL
in
.IR errno .

If
.BR fork (2)
or
.BR clone (2)
is allowed by the filter, any child processes will be constrained to
the same system call filters as the parent.
If
.BR execve (2)
is allowed,
the existing filters will be preserved across a call to
.BR execve (2).

In order to use the
.BR SECCOMP_SET_MODE_FILTER
operation, either the caller must have the
.BR CAP_SYS_ADMIN
capability, or the thread must already have the
.I no_new_privs
bit set.
If that bit was not already set by an ancestor of this thread,
the thread must make the following call:

    prctl(PR_SET_NO_NEW_PRIVS, 1);

Otherwise, the
.BR SECCOMP_SET_MODE_FILTER
operation will fail and return
.BR EACCES
in
.IR errno .
This requirement ensures that an unprivileged process cannot apply
a malicious filter and then invoke a set-user-ID or
other privileged program using
.BR execve (2),
thus potentially compromising that program.
(Such a malicious filter might, for example, cause an attempt to use
.BR setuid (2)
to set the caller's user IDs to non-zero values to instead
return 0 without actually making the system call.
Thus, the program might be tricked into retaining superuser privileges
in circumstances where it is possible to influence it to do
dangerous things because it did not actually drop privileges.)

If
.BR prctl (2)
or
.BR seccomp (2)
is allowed by the attached filter, further filters may be added.
This will increase evaluation time, but allows for further reduction of
the attack surface during execution of a thread.

The
.BR SECCOMP_SET_MODE_FILTER
operation is available only if the kernel is configured with
.BR CONFIG_SECCOMP_FILTER
enabled.

When
.IR flags
is 0, this operation is functionally identical to the call:

    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);

The recognized
.IR flags
are:
.RS
.TP
.BR SECCOMP_FILTER_FLAG_TSYNC
When adding a new filter, synchronize all other threads of the calling
process to the same seccomp filter tree.
A "filter tree" is the ordered list of filters attached to a thread.
(Attaching identical filters in separate
.BR seccomp ()
calls results in different filters from this perspective.)

If any thread cannot synchronize to the same filter tree,
the call will not attach the new seccomp filter,
and will fail, returning the first thread ID found that cannot synchronize.
Synchronization will fail if another thread in the same process is in
.BR SECCOMP_MODE_STRICT
or if it has attached new seccomp filters to itself,
diverging from the calling thread's filter tree.
.RE
.SS Filters
When adding filters via
.BR SECCOMP_SET_MODE_FILTER ,
.IR args
points to a filter program:

.in +4n
.nf
struct sock_fprog {
    unsigned short      len;    /* Number of BPF instructions */
    struct sock_filter *filter; /* Pointer to array of
                                   BPF instructions */
};
.fi
.in

Each program must contain one or more BPF instructions:

.in +4n
.nf
struct sock_filter {            /* Filter block */
    __u16 code;                 /* Actual filter code */
    __u8  jt;                   /* Jump true */
    __u8  jf;                   /* Jump false */
    __u32 k;                    /* Generic multiuse field */
};
.fi
.in

.\" FIXME I reworded/enhanced the following sentence. Is it okay?
When executing the instructions, the BPF program operates on the
system call information made available (i.e., use the
.BR BPF_ABS
addressing mode) as a buffer of the following form:

.in +4n
.nf
struct seccomp_data {
    int   nr;                   /* System call number */
    __u32 arch;                 /* AUDIT_ARCH_* value
                                   (see <linux/audit.h>) */
    __u64 instruction_pointer;  /* CPU instruction pointer */
    __u64 args[6];              /* Up to 6 system call arguments */
};
.fi
.in

A seccomp filter returns a 32-bit value consisting of two parts:
the most significant 16 bits
(corresponding to the mask defined by the constant
.BR SECCOMP_RET_ACTION )
contain one of the "action" values listed below;
the least significant 16-bits (defined by the constant
.BR SECCOMP_RET_DATA )
are "data" to be associated with this return value.

If multiple filters exist, they are all executed,
in reverse order of their addition to the filter tree
(i.e., the most recently installed filter is executed first).
The return value for the evaluation of a given system call is the first-seen
.BR SECCOMP_RET_ACTION
value of highest precedence (along with its accompanying data)
returned by execution of all of the filters.

In decreasing order of precedence,
the values that may be returned by a seccomp filter are:
.TP
.BR SECCOMP_RET_KILL
This value results in the process exiting immediately
without executing the system call.
The process terminates as though killed by a
.B SIGSYS
signal
.RI ( not
.BR SIGKILL ).
.TP
.BR SECCOMP_RET_TRAP
This value results in the kernel sending a
.BR SIGSYS
signal to the triggering process without executing the system call.
Various fields will be set in the
.I siginfo_t
structure (see
.BR sigaction (2))
associated with signal:
.RS
.IP * 3
.I si_signo
will contain
.BR SIGSYS .
.IP *
.IR si_call_addr
will show the address of the system call instruction.
.IP *
.IR si_syscall
and
.IR si_arch
will indicate which system call was attempted.
.IP *
.I si_code
.\" FIXME Why is the constant thus named? All of the other 'si_code'
.\"       constants are prefixed 'SI_'. Why the inconsistency?
will contain
.BR SYS_SECCOMP .
.IP *
.I si_errno
will contain the
.BR SECCOMP_RET_DATA
portion of the filter return value.
.RE
.IP
The program counter will be as though the system call happened
(i.e., it will not point to the system call instruction).
The return value register will contain an architecture\-dependent value;
if resuming execution, set it to something sensible.
.\" FIXME Regarding the preceding line, can you give an example(s)
.\"       of "something sensible"? (Depending on the answer, maybe it
.\"       might be useful to add some text on this point.)
.\"
.\" FIXME Please check:
.\"     In an attempt to make the text clearer, I changed
.\"     "replacing it with" to "setting the return value register to"
.\"     Okay?
(The architecture dependency is because setting the return value register to
.BR ENOSYS
could overwrite some useful information.)
.TP
.BR SECCOMP_RET_ERRNO
This value results in the
.B SECCOMP_RET_DATA
portion of the filter's return value being passed to user space as the
.IR errno
value without executing the system call.
.TP
.BR SECCOMP_RET_TRACE
When returned, this value will cause the kernel to attempt to notify a
.BR ptrace (2)-based
tracer prior to executing the system call.
If there is no tracer present,
the system call is not executed and returns a failure status with
.I errno
set to
.BR ENOSYS .

A tracer will be notified if it requests
.BR PTRACE_O_TRACESECCOMP
using
.IR ptrace(PTRACE_SETOPTIONS) .
The tracer will be notified of a
.BR PTRACE_EVENT_SECCOMP
and the
.BR SECCOMP_RET_DATA
portion of the filter's return value will be available to the tracer via
.BR PTRACE_GETEVENTMSG .

The tracer can skip the system call by changing the system call number
to \-1.
Alternatively, the tracer can change the system call
requested by changing the system call to a valid system call number.
If the tracer asks to skip the system call, then the system call will
appear to return the value that the tracer puts in the return value register.

The seccomp check will not be run again after the tracer is notified.
(This means that seccomp-based sandboxes
.B "must not"
allow use of
.BR ptrace (2)\(emeven
of other
sandboxed processes\(emwithout extreme care;
.\" FIXME Below, I think it would be helpful to add some words after
.\"       "to escape", as in "to escape [what?]" I suppose the wording
.\"       would be something like "to escape the seccomp sandbox mechanism"
.\"       but perhaps you have a better wording.
ptracers can use this mechanism to escape.)
.TP
.BR SECCOMP_RET_ALLOW
This value results in the system call being executed.
.SH RETURN VALUE
On success,
.BR seccomp ()
returns 0.
On error, if
.BR SECCOMP_FILTER_FLAG_TSYNC
was used,
the return value is the ID of the thread
that caused the synchronization failure.
(This ID is a kernel thread ID of the type returned by
.BR clone (2)
and
.BR gettid (2).)
On other errors, \-1 is returned, and
.IR errno
is set to indicate the cause of the error.
.SH ERRORS
.BR seccomp ()
can fail for the following reasons:
.TP
.BR EACCESS
The caller did not have the
.BR CAP_SYS_ADMIN
capability, or had not set
.IR no_new_privs
before using
.BR SECCOMP_SET_MODE_FILTER .
.TP
.BR EFAULT
.IR args
was not a valid address.
.TP
.BR EINVAL
.IR operation
is unknown; or
.IR flags
are invalid for the given
.IR operation .
.\" FIXME Please review the following
.TP
.BR EINVAL
.I operation
included
.BR BPF_ABS ,
but the specified offset was not aligned to a 32-bit boundary or exceeded
.IR "sizeof(struct\ seccomp_data)" .
.\" FIXME Please review the following
.TP
.BR EINVAL
.\" See kernel/seccomp.c::seccomp_may_assign_mode() in 3.18 sources
A secure computing mode has already been set, and
.I operation
differs from the existing setting.
.\" FIXME Please review the following
.TP
.BR EINVAL
.\" See stub kernel/seccomp.c::seccomp_set_mode_filter() in 3.18 sources
.I operation
specified
.BR SECCOMP_SET_MODE_FILTER ,
but the kernel was not built with
.B CONFIG_SECCOMP_FILTER
enabled.
.\" FIXME Please review the following
.TP
.BR EINVAL
.I operation
specified
.BR SECCOMP_SET_MODE_FILTER ,
but the filter program pointed to by
.I args
was not valid or the length of the filter program was zero or exceeded
.B BPF_MAXINSNS
(4096) instructions.
.BR EINVAL
.TP
.BR ENOMEM
Out of memory.
.\" FIXME Please review the following
.TP
.BR ENOMEM
.\" ENOMEM in kernel/seccomp.c::seccomp_attach_filter() in 3.18 sources
The total length of all filter programs attached
to the calling thread would exceed
.B MAX_INSNS_PER_PATH
(32768) instructions.
Note that for the purposes of calculating this limit,
each already existing filter program incurs an
overhead penalty of 4 instructions.
.TP
.BR ESRCH
Another thread caused a failure during thread sync, but its ID could not
be determined.
.SH VERSIONS
The
.BR seccomp()
system call first appeared in Linux 3.17.
.\" FIXME . Add glibc version
.SH CONFORMING TO
The
.BR seccomp()
system call is a nonstandard Linux extension.
.SH NOTES
.BR seccomp ()
provides a superset of the functionality provided by the
.BR prctl (2)
.BR PR_SET_SECCOMP
operation (which does not support
.IR flags ).
.\" FIXME Please review the following new subsection {{{
.SS Seccomp-specific BPF details
Note the following BPF details specific to seccomp filters:
.IP * 3
The
.B BPF_H
and
.B BPF_B
size modifiers are not supported: all operations must load and store
(4-byte) words
.RB ( BPF_W ).
.IP *
To access the contents of the
.I seccomp_data
buffer, use the
.B BPF_ABS
addressing mode modifier.
.\" FIXME What is the significance of the line
.\"           ftest->code = BPF_LDX | BPF_W | BPF_ABS;
.\"       in kernel/seccomp.c::seccomp_check_filter()?
.IP *
The
.B BPF_LEN
addressing mode modifier yields an immediate mode operand
whose value is the size of the
.IR seccomp_data
buffer.
.\" FIXME Any other seccomp-specific BPF details that should be added here?
.\"
.\" FIXME End of new subsection for review }}}
.SH EXAMPLE
The program below accepts four or more arguments.
The first three arguments are a system call number,
a numeric architecture identifier, and an error number.
The program uses these values to construct a BPF filter
that is used at run time to perform the following checks:
.IP [1] 4
If the program is not running on the specified architecture,
the BPF filter causes system calls to fail with the error
.BR ENOSYS .
.IP [2]
If the program attempts to execute the system call with the specified number,
the BPF filter causes the system call to fail, with
.I errno
being set to the specified error number.
.PP
The remaining command-line arguments specify
the pathname and additional arguments of a program
that the example program should attempt to execute using
.BR execve (3)
(a library function that employs the
.BR execve (2)
system call).
Some example runs of the program are shown below.

First, we display the architecture that we are running on (x86-64)
and then construct a shell function that looks up system call
numbers on this architecture:

.nf
.in +4n
$ \fBuname -m\fP
x86_64
$ \fBsyscall_nr() {
    cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \\
    awk '$2 != "x32" && $3 == "'$1'" { print $1 }' 
}\fP
.in
.fi

When the BPF filter rejects a system call (case [2] above),
it causes the system call to fail with the error number
specified on the command line.
In the experiments shown here, we'll use error number 99:

.nf
.in +4n
$ \fBerrno 99\fP
EADDRNOTAVAIL 99 Cannot assign requested address
.in
.fi

In the following example, we attempt to run the command
.BR whoami (1),
but the BPF filter rejects the
.BR execve (2)
system call, so that the command is not even executed:

.nf
.in +4n
$ \fBsyscall_nr execve\fP
59      
$ \fB./a.out\fP
Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
                 AUDIT_ARCH_X86_64: 0xC000003E
$ \fB./a.out 59 0xC000003E 99 /bin/whoami\fP
execv: Cannot assign requested address
.in
.fi

In the next example, the BPF filter rejects the
.BR write (2)
system call, so that, although it is successfully started, the
.BR whoami (1)
command is not able to write output:

.nf
.in +4n
$ \fBsyscall_nr write\fP
1
$ \fB./a.out 1 0xC000003E 99 /bin/whoami\fP
.in
.fi

In the final example,
the BPF filter rejects a system call that is not used by the
.BR whoami (1)
command, so it is able to successfully execute and produce output:

.nf
.in +4n
$ \fBsyscall_nr preadv\fP
295
$ \fB./a.out 295 0xC000003E 99 /bin/whoami\fP
cecilia
.in
.fi
.SS Program source
.fi
.nf
#include <errno.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/audit.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>

static int
install_filter(int syscall_nr, int t_arch, int f_errno)
{
.\" FIXME In the BPF program below, you use '+' to build the instructions.
.\"       However, most other BPF example code I see uses '|'. While I
.\"       assume it's equivalent (i.e., the bit fields are nonoverlapping),
.\"       was there a reason to use '+' rather than '|'? (To me, the
.\"       latter is a little clearer in its intent.)
.\"
.\" FIXME I expanded comments [0], [1], [2], [3], [4] a little.
.\"       Are they okay? */
.\"
    struct sock_filter filter[] = {
        /* [0] Load architecture from 'seccomp_data' buffer into
               accumulator */
        BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
                 (offsetof(struct seccomp_data, arch))),

        /* [1] Jump forward 4 instructions if architecture does not
               match 't_arch' */
        BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, t_arch, 0, 4),

        /* [2] Load system call number from 'seccomp_data' buffer into
               accumulator */
        BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
                 (offsetof(struct seccomp_data, nr))),

        /* [3] Jump forward 1 instruction if system call number
               does not match 'syscall_nr' */
        BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, syscall_nr, 0, 1),

        /* [4] Matching architecture and system call: don't execute
	       the system call, and return 'f_errno' in 'errno' */
        BPF_STMT(BPF_RET + BPF_K,
                 SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),

        /* [5] Destination of system call number mismatch: allow other
               system calls */
        BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),

        /* [6] Destination of architecture mismatch: kill process */
        BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
        .filter = filter,
    };

    if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
        perror("seccomp");
        return 1;
    }

    return 0;
}

int
main(int argc, char **argv)
{
    if (argc < 5) {
        fprintf(stderr, "Usage: "
                "%s <syscall_nr> <arch> <errno> <prog> [<args>]\\n"
                "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\\n"
                "                 AUDIT_ARCH_X86_64: 0x%X\\n"
                "\\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
        exit(EXIT_FAILURE);
    }

    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("prctl");
        exit(EXIT_FAILURE);
    }

    if (install_filter(strtol(argv[1], NULL, 0),
                       strtol(argv[2], NULL, 0),
                       strtol(argv[3], NULL, 0)))
        exit(EXIT_FAILURE);

    execv(argv[4], &argv[4]);
    perror("execv");
    exit(EXIT_FAILURE);
}
.fi
.SH SEE ALSO
.BR prctl (2),
.BR ptrace (2),
.BR signal (7),
.BR socket (7)
.sp
The kernel source files
.IR Documentation/networking/filter.txt
and
.IR Documentation/prctl/seccomp_filter.txt .
.sp
McCanne, S. and Jacobson, V. (1992)
.IR "The BSD Packet Filter: A New Architecture for User-level Packet Capture" ,
Proceedings of the USENIX Winter 1993 Conference
.UR http://www.tcpdump.org/papers/bpf-usenix93.pdf
.UE

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Edited seccomp.2 man page for review
From: Michael Kerrisk (man-pages) @ 2014-12-30 12:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lkml,
	Andy Lutomirski, Linux API, Daniel Borkmann
In-Reply-To: <CAGXu5jJ4MMZ66JzWqenaz4h5nYLnE8o_H0D5sr+M=j1YYnf0AQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi Kees,

My apologies for the slow follow-up.

On 11/10/2014 10:13 PM, Kees Cook wrote:
> On Sat, Nov 8, 2014 at 4:22 AM, Michael Kerrisk (man-pages)
> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Hi Kees, (and all),
>>
>> Thanks for the seccomp.2 draft man page that you provided a few
>> weeks ago (https://lkml.org/lkml/2014/9/25/685), and my apologies
>> for the slow follow-up.
>>
>> I have done some substantial editing of the page. Therefore, could
>> you please carefully read the revised version below, in case I have
>> somewhere injected errors.
> 
> Woo! Thanks for all your work on it!
> 
>> In addition, I've added a number of FIXMEs to the page source. Could
>> you please review these.
> 
> Sure, I'll try to avoid being redundant with Andy. :)
> 
>> I've also added long piece to the example section, describing the
>> program and demonstrating its use. Again, I'd appreciate it if you
>> could check that over.
>>
>> One other question about these man-pages changes: should we add
>> a note in prctl(2) to say that seccomp(2) is preferred over
>> PR_SET_SECCOMP for new code?
> 
> Given how new it is, I was shy to suggest it. Anything needed the new
> features (TSYNC) obviously must use it, but it'll be a while before
> this syscall is in distros. I think it should be used over prctl, but
> there's no strong reason to change existing code.

I've settled for adding the following under the description of
PR_SET_SECCOMP in prctl(2)

     The more recent seccomp(2) system call provides a superset
     of the functionality of PR_SET_SECCOMP .

Okay?

>> I've appended the revised page at the foot of this mail. You can also
>> find the branch holding this page (and thus, the series of changes
>> I've made in Git at:
>> http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_seccomp
>>
>> Feedback either as inline comments to the below, or as a patch based on
>> the Git branch, would be great!
>>
>> Cheers,
>>
>> Michael
>>
>> .\" Copyright (C) 2014 Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>> .\" and Copyright (C) 2012 Will Drewry <wad-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
>> .\" and Copyright (C) 2008, 2014 Michael Kerrisk <mtk.manpages@gmail.com>
>> .\"
>> .\" %%%LICENSE_START(VERBATIM)
>> .\" Permission is granted to make and distribute verbatim copies of this
>> .\" manual provided the copyright notice and this permission notice are
>> .\" preserved on all copies.
>> .\"
>> .\" Permission is granted to copy and distribute modified versions of this
>> .\" manual under the conditions for verbatim copying, provided that the
>> .\" entire resulting derived work is distributed under the terms of a
>> .\" permission notice identical to this one.
>> .\"
>> .\" Since the Linux kernel and libraries are constantly changing, this
>> .\" manual page may be incorrect or out-of-date.  The author(s) assume no
>> .\" responsibility for errors or omissions, or for damages resulting from
>> .\" the use of the information contained herein.  The author(s) may not
>> .\" have taken the same level of care in the production of this manual,
>> .\" which is licensed free of charge, as they might when working
>> .\" professionally.
>> .\"
>> .\" Formatted or processed versions of this manual, if unaccompanied by
>> .\" the source, must acknowledge the copyright and authors of this work.
>> .\" %%%LICENSE_END
>> .\"
>> .TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
>> .SH NAME
>> seccomp \- operate on Secure Computing state of the process
>> .SH SYNOPSIS
>> .nf
>> .B #include <linux/seccomp.h>
>> .B #include <linux/filter.h>
>> .B #include <linux/audit.h>
>> .B #include <linux/signal.h>
>> .\" FIXME Is sys/ptrace.h really required? It is not used in
>> .\"       the example program below.
>> .B #include <sys/ptrace.h>
> 
> It's not required for this example, but anything uses the
> SECCOMP_RET_TRACE returns, it'll want it. And given the mention of
> things like PTRACE_O_TRACESECCOMP, it seemed like we should include
> the #include. I'll leave it to your discretion on what's appropriate
> for a man-page header, though. :)

Okay -- I'll keep it. (Thanks for providing the background detail.)

>> .BI "int seccomp(unsigned int " operation ", unsigned int " flags \
>> ", void *" args );
>> .fi
>> .SH DESCRIPTION
>> The
>> .BR seccomp ()
>> system call operates on the Secure Computing (seccomp) state of the
>> calling process.
>> .\" FIXME: This page various uses the terms "process', "thread" and "task".
>> .\" Probably only one of these (not "task"!) should be used in all
>> .\" cases. I suspect it should be "thread".
> 
> Yeah, "task" should be avoided, my mistake! I will try to correct them
> below. The above general case is correct, since TSYNC can change the
> state on all threads of the process.

Okay.

>> Currently, Linux supports the following
>> .IR operation
>> values:
>> .TP
>> .BR SECCOMP_SET_MODE_STRICT
>> The only system calls that the thread is permitted to make are
> 
> Should this be clarified to "the calling thread", or is that implied?

Thanks. It's better to be clearer. Changed.

>> .BR read (2),
>> .BR write (2),
>> .BR _exit (2),
>> and
>> .BR sigreturn (2).
>> Other system calls result in the delivery of a
>> .BR SIGKILL
>> signal
> 
> "signal" needs a period ending the sentence above.

Thanks.

>> Strict secure computing mode is useful for number-crunching
>> applications that may need to execute untrusted byte code, perhaps
>> obtained by reading from a pipe or socket.
>>
>> This operation is available only if the kernel is configured with
>> .BR CONFIG_SECCOMP
>> enabled.
>>
>> The value of
>> .IR flags
>> must be 0, and
>> .IR args
>> must be NULL.
>>
>> This operation is functionally identical to the call:
>>
>>     prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
>> .TP
>> .BR SECCOMP_SET_MODE_FILTER
>> The system calls allowed are defined by a pointer to a Berkeley Packet
>> Filter (BPF) passed via
>> .IR args .
>> This arguMent is a pointer to a
> 
> s/M/m/

Fixed.

>> .IR "struct\ sock_fprog" ;
>> it can be designed to filter arbitrary system calls and system call
>> arguments.
>> If the filter is invalid,
>> .BR seccomp ()
>> fails, returning
>> .BR EACCESS
> 
> EINVAL (EACCESS would be for lacking CAP_SYS_ADMIN or no-new-privs).

Fixed.

>> in
>> .IR errno .
>>
>> .\" FIXME I (mtk) reworded the following paragraph substantially.
>> .\" Please check it.
>> If
>> .BR fork (2)
>> or
>> .BR clone (2)
>> is allowed by the filter, any child processes will be constrained to
>> the same filters and system calls as the parent.
> 
> To me, "and system calls" implies something other than filters. Maybe:
> "the same system call filters as the parent"?

Fixed.

>> If
>> .BR execve (2)
>> is allowed by the filter,
>> the filters and constraints on permitted system calls are preserved across an
>> .BR execve (2).
> 
> Perhaps "Similarly, if execve is allowed, the existing filters will be
> preserved across the call to execve." The filter _is_ the "constraints
> on permitted system calls", but since it can do more than constrain,
> I'm shy to imply a limit to the scope of this description.

Changed as you suggest. Thanks.

>> .\" FIXME I (mtk) reworded the following paragraph substantially.
>> .\" Please check it.
>> In order to use the
>> .BR SECCOMP_SET_MODE_FILTER
>> operation, either the caller must have the
>> .BR CAP_SYS_ADMIN
>> capability or the call must be preceded by the call:
>>
>>     prctl(PR_SET_NO_NEW_PRIVS, 1);
> 
> Strictly speaking, if any ancestor ever called PR_SET_NO_NEW_PRIVS,
> the process already has it set. Perhaps "... capability, or the thread
> must already have thew "no new privs" prctl bit set. If not already
> set by an ancestory, the thread must call: ..."

Changed as you suggest.

>> Otherwise, the
>> .BR SECCOMP_SET_MODE_FILTER
>> operation will fail and return
>> .BR EACCES
>> in
>> .IR errno .
>> This requirement ensures that filter programs cannot be applied to child
>> .\" FIXME What does "installed" in the following line mean?
>> processes with greater privileges than the process that installed them.
> 
> Andy mentioned the "why", but "installed" here means "called seccomp()
> to add filters", e.g. add (install) a filter to have
> "setuid(non-root)" return 0 instead of actually getting called, and
> then exec a setuid process that tries to drop privileges, which
> doesn't happen, and now the original caller (non-root) has a setuid
> process running as root that it may be able to influence into doing
> dangerous things because it didn't _actually_ drop privileges.

Thanks. That's a really good example. I incorporated your text, revising slightly:

[[
(Such a malicious filter might, for example, cause an attempt to use
.BR setuid (2)
to set the caller's user IDs to non-zero values to instead
return 0 without actually making the system call.
Thus, the program might be tricked into retaining superuser privileges
in circumstances where it is possible to influence it to do
dangerous things because it did not actually drop privileges.)
]]

Okay?

>> If
>> .BR prctl (2)
>> or
>> .BR seccomp (2)
>> is allowed by the attached filter, further filters may be added.
>> This will increase evaluation time, but allows for further reduction of
>> the attack surface during execution of a process.
> 
> Strictly speaking, "process" -> "thread"

Fixed.

>> The
>> .BR SECCOMP_SET_MODE_FILTER
>> operation is available only if the kernel is configured with
>> .BR CONFIG_SECCOMP_FILTER
>> enabled.
>>
>> When
>> .IR flags
>> is 0, this operation is functionally identical to the call:
>>
>>     prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
>>
>> The recognized
>> .IR flags
>> are:
>> .RS
>> .TP
>> .BR SECCOMP_FILTER_FLAG_TSYNC
>> When adding a new filter, synchronize all other threads of the calling
>> process to the same seccomp filter tree.
>> .\" FIXME Nowhere in this page is the term "filter tree" defined.
>> .\" There should be a definition somewhere.
>> .\" Is it: "the set of filters attached to a thread"?
> 
> As Andy said, the list of filters attached to the process. A process
> may have multiple threads adding filters, which would cause those
> threads to have separate branches of seccomp filters (though they
> would share a common root from when the process started). It is only
> possible to use TSYNC if threads haven't diverged in this way.

Yup, got it already, thanks to Andy's explanation. Thanks for the
further detail though, which helped me tweak the wording a little.

>> If any thread cannot do this,
>> the call will not attach the new seccomp filter,
>> and will fail, returning the first thread ID found that cannot synchronize.
>> Synchronization will fail if another thread is in
>> .BR SECCOMP_MODE_STRICT
>> or if it has attached new seccomp filters to itself,
>> diverging from the calling thread's filter tree.
>> .RE
>> .SH FILTERS
>> When adding filters via
>> .BR SECCOMP_SET_MODE_FILTER ,
>> .IR args
>> points to a filter program:
>>
>> .in +4n
>> .nf
>> struct sock_fprog {
>>     unsigned short      len;    /* Number of BPF instructions */
>>     struct sock_filter *filter;
>> };
>> .fi
>> .in
>>
>> Each program must contain one or more BPF instructions:
>>
>> .in +4n
>> .nf
>> struct sock_filter {    /* Filter block */
>>     __u16   code;       /* Actual filter code */
>>     __u8    jt;         /* Jump true */
>>     __u8    jf;         /* Jump false */
>>     __u32   k;          /* Generic multiuse field */
>> };
>> .fi
>> .in
>>
>> When executing the instructions, the BPF program executes over the
>> system call information made available via:
>>
>> .in +4n
>> .nf
>> struct seccomp_data {
>>     int nr;                     /* system call number */
>>     __u32 arch;                 /* AUDIT_ARCH_* value */
>>     __u64 instruction_pointer;  /* CPU instruction pointer */
>>     __u64 args[6];              /* up to 6 system call arguments */
>> };
>> .fi
>> .in
>>
>> .\" FIXME I find the next piece a little hard to understand, so,
>> .\"       some questions:
>> .\"       * If there are multiple filters, in what order are they executed?
>> .\"         (The man page should probably detail the answer to this question.)
> 
> They are executed in reverse order (most recently added is executed first).

Thanks. I integrated that point into the page.
 
>> .\"       * If there are multiple filters, are they all always executed?
>> .\"         I assume not, but the notion that
>> .\"             "the return value for the evaluation of a given system call
>> .\"              will always use the value with the highest precedence"
>> .\"         implies that even that if one filter generates (say)
>> .\"         SECCOMP_RET_ERRNO, then further filters may still be executed,
>> .\"         including one that generates (say) the "higher priority"
>> .\"         SECCOMP_RET_KILL condition.
>> .\"       Can you clarify the above?
> 
> Correct. All filters are executed. The returned value is the one with
> the first seen highest priority (lowest numerical value) action of
> those returned by each filter. For example, if a filter was installed
> that returned SECCOMP_RET_ERRNO|1, and then another filter installed
> SECCOMP_RET_ERRNO|22, and then another filter installed
> SECCOMP_RET_ALLOW, SECCOMP_RET_ERRNO|22 would be returned.
> SECCOMP_RET_ERRNO is higher priority than SECCOMP_RET_ALLOW, but since
> the SECCOMP_RET_ERRNO|22 was seen first, it's data (22) will be used,
> even though the last filter returned a lower data (1), as only action
> values are compared.

So, the penny finally dropped here (and I confirmed by looking at the
code, which I didn't find time to do earlier). The return value consists
of two parts: SECCOMP_RET_ACTION and SECCOMP_RET_DATA. This was hinted 
at in various places in the text, but was not made explicit.

>> A seccomp filter returns one of the values listed below.
> 
> Based on discussion further below, perhaps "value" should be called
> "action" here? Maybe:
> 
> A seccomp filter returns a value. The high 16 bits
> (SECCOMP_RET_ACTION) is the seccomp filter "action" to take. The low
> 16 bits (SECCOMP_RET_DATA) is data specific to the action.

Ahhh -- and then I see you explain it as I (now) understand it ;-).

I added some text that conveys the same info. So, by now this piece of 
the page reads:

       A seccomp filter returns a 32-bit value consisting of two  parts:
       the  most  significant 16 bits (corresponding to the mask defined
       by the constant SECCOMP_RET_ACTION) contain one of  the  "action"
       values  listed  below;  the least significant 16-bits (defined by
       the constant SECCOMP_RET_DATA) are "data" to be  associated  with
       this return value.

       If  multiple  filters  exist,  they  are all executed, in reverse
       order of their addition  to  the  filter  tree  (i.e.,  the  most
       recently  installed  filter is executed first).  The return value
       for the evaluation of a given system call is the first-seen  SEC‐
       COMP_RET_ACTION  value  of  highest  precedence  (along  with its
       accompanying data) returned by execution of all of the filters.

(And I've removed some repetition of the same points later in the page.)

>> If multiple filters exist,
>> the return value for the evaluation of a given system call
>> will always use the value with the highest precedence.
>> (For example,
>> .BR SECCOMP_RET_KILL
>> will always take precedence.)
>>
>> In decreasing order order of precedence,
>> the values that may be returned by a seccomp filter are:
>> .TP
>> .BR SECCOMP_RET_KILL
>> Results in the task exiting immediately without executing the system call.
>> The task terminates as though killed by a
> 
> Both "task" -> "process" above.

Fixed.

>> .B SIGSYS
>> signal
>> .RI ( not
>> .BR SIGKILL ).
>> .TP
>> .BR SECCOMP_RET_TRAP
>> Results in the kernel sending a
>> .BR SIGSYS
>> signal to the triggering task without executing the system call.
> 
> "task" -> "process"

Fixed.

>> .IR siginfo\->si_call_addr
>> will show the address of the system call instruction, and
>> .IR siginfo\->si_syscall
>> and
>> .IR siginfo\->si_arch
>> will indicate which system call was attempted.
>> The program counter will be as though the system call happened
>> (i.e., it will not point to the system call instruction).
>> The return value register will contain an architecture\-dependent value;
>> if resuming execution, set it to something sensible.
>> (The architecture dependency is because replacing it with
>> .BR ENOSYS
>> could overwrite some useful information.)
>>
>> .\" FIXME The following sentence is the first time that SECCOMP_RET_DATA
>> .\"       is mentioned. SECCOMP_RET_DATA needs to be described in this
>> .\"       man page.
> 
> How should these be detailed? (I took a stab at it further above.)

I think the text above suffices. probably no need to actually show the 
constant definitions.

> #define SECCOMP_RET_ACTION      0x7fff0000U
> #define SECCOMP_RET_DATA        0x0000ffffU
>
>> The
>> .BR SECCOMP_RET_DATA
>> portion of the return value will be passed as
>> .IR si_errno .
>>
>> .BR SIGSYS
>> triggered by seccomp will have the value
>> .BR SYS_SECCOMP
>> in the
>> .IR si_code
>> field.
>> .TP
>> .BR SECCOMP_RET_ERRNO
>> .\" FIXME What does "the return value" refer to in the next sentence?
>> .\"       It is not obvious to me.
> 
> As Andy said, the 32 bit value returned by the BPF filter.

Yep, got it now.

>> Results in the lower 16-bits of the return value being passed
>> to user space as the
>> .IR errno
>> without executing the system call.
>> .TP
>> .BR SECCOMP_RET_TRACE
>> When returned, this value will cause the kernel to attempt to notify a
>> .BR ptrace (2)-based
>> tracer prior to executing the system call.
>> .\" FIXME I (mtk) reworded the following sentence substantially.
>> .\" Please check it.
> 
> Yes, correct.

Thanks for checking.

>> If there is no tracer present,
>> the system call is not executed and returns a failure status with
>> .I errno
>> set to
>> .BR ENOSYS .
>>
>> A tracer will be notified if it requests
>> .BR PTRACE_O_TRACESECCOMP
>> using
>> .IR ptrace(PTRACE_SETOPTIONS) .
>> The tracer will be notified of a
>> .BR PTRACE_EVENT_SECCOMP
>> and the
>> .BR SECCOMP_RET_DATA
>> portion of the BPF program return value will be available to the tracer
>> via
>> .BR PTRACE_GETEVENTMSG .
>>
>> The tracer can skip the system call by changing the system call number
>> to \-1.
>> Alternatively, the tracer can change the system call
>> requested by changing the system call to a valid system call number.
>> If the tracer asks to skip the system call, then the system call will
>> appear to return the value that the tracer puts in the return value register.
>>
>> The seccomp check will not be run again after the tracer is notified.
>> (This means that seccomp-based sandboxes
>> .B "must not"
>> allow use of
>> .BR ptrace (2)\(emeven
>> of other
>> sandboxed processes\(emwithout extreme care;
>> ptracers can use this mechanism to escape.)
>> .TP
>> .BR SECCOMP_RET_ALLOW
>> Results in the system call being executed.
>> .PP
>> If multiple filters exist, the return value for the evaluation of a
>> given system call will always use the highest precedent value.
>>
>> .\" FIXME The following sentence is the first time that SECCOMP_RET_ACTION
>> .\"       is mentioned. SECCOMP_RET_ACTION needs to be described in this
>> .\"       man page.
> 
> Attempted earlier...

Yep, we're good now.
 
>> Precedence is determined using only the
>> .BR SECCOMP_RET_ACTION
>> mask.
>> When multiple filters return values of the same precedence,
>> only the
>> .BR SECCOMP_RET_DATA
>> from the most recently installed filter will be returned.
> 
> The above tries to document what was mentioned about the order of
> return value parsing I discussed further above.

Yep, got it.

>> .SH RETURN VALUE
>> On success,
>> .BR seccomp ()
>> returns 0.
>> On error, if
>> .BR SECCOMP_FILTER_FLAG_TSYNC
>> was used,
>> the return value is the thread ID that caused the synchronization failure.
>> On other errors, \-1 is returned, and
>> .IR errno
>> is set to indicate the cause of the error.
>> .SH ERRORS
>> .BR seccomp ()
>> can fail for the following reasons:
>> .TP
>> .BR EACCESS
>> The caller did not have the
>> .BR CAP_SYS_ADMIN
>> capability, or had not set
>> .IR no_new_privs
>> before using
>> .BR SECCOMP_SET_MODE_FILTER .
>> .TP
>> .BR EFAULT
>> .IR args
>> was required to be a valid address.
>> .TP
>> .BR EINVAL
>> .IR operation
>> is unknown; or
>> .IR flags
>> are invalid for the given
>> .IR operation
>> .TP
>> .BR ESRCH
>> Another thread caused a failure during thread sync, but its ID could not
>> be determined.
>> .SH VERSIONS
>> The
>> .BR seccomp()
>> system call first appeared in Linux 3.17.
>> .\" FIXME Add glibc version
>> .SH CONFORMING TO
>> The
>> .BR seccomp()
>> system call is a nonstandard Linux extension.
>> .SH NOTES
>> .BR seccomp ()
>> provides a superset of the functionality provided by the
>> .BR prctl (2)
>> .BR PR_SET_SECCOMP
>> operation (which does not support
>> .IR flags ).
>> .SH EXAMPLE
>> .\" FIXME Please carefully review the following new piece that
>> .\"       demonstrates the use of your example program.
> 
> This is great! Thanks for expanding this.

You're welcome. In the first instance, I was just experimenting
to ensure that I understood what was going on. Then it seemed
worthwhile to include such experiments in the page itself.

>> The program below accepts four or more arguments.
>> The first three arguments are a system call number,
>> a numeric architecture identifier, and an error number.
>> The program uses these values to construct a BPF filter
>> that is used at run time to perform the following checks:
>> .IP [1] 4
>> If the program is not running on the specified architecture,
>> the BPF filter causes system calls to fail with the error
>> .BR ENOSYS .
>> .IP [2]
>> If the program attempts to execute the system call with the specified number,
>> the BPF filter causes the system call to fail, with
>> .I errno
>> being set to the specified error number.
>> .PP
>> The remaining command-line arguments specify
>> the pathname and additional arguments of a program
>> that the example program should attempt to execute using
>> .BR execve (3)
>> (a library function that employs the
>> .BR execve (2)
>> system call).
>> Some example runs of the program are shown below.
>>
>> First, we display the architecture that we are running on (x86-64)
>> and then construct a shell function that looks up system call
>> numbers on this architecture:
>>
>> .nf
>> .in +4n
>> $ \fBuname -m\fP
>> x86_64
>> $ \fBsyscall_nr() {
>>     cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \\
>>     awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
>> }\fP
>> .in
>> .fi
>>
>> When the BPF filter rejects a system call (case [2] above),
>> it causes the system call to fail with the error number
>> specified on the command line.
>> In the experiments shown here, we'll use error number 99:
>>
>> .nf
>> .in +4n
>> $ \fBerrno 99\fP
>> EADDRNOTAVAIL 99 Cannot assign requested address
>> .in
>> .fi
>>
>> In the following example, we attempt to run the command
>> .BR whoami (1),
>> but the BPF filter rejects the
>> .BR execve (2)
>> system call, so that the command is not even executed:
>>
>> .nf
>> .in +4n
>> $ \fBsyscall_nr execve\fP
>> 59
>> $ \fB./a.out 59 0xC000003E 99 /bin/whoami\fP
> 
> It it worth showing where you got the 0xC000003E value from? (i.e.
> from just running ./a.out and looking at its hints)

Yes, that seems worthwhile. Done.

>> execv: Cannot assign requested address
>> .in
>> .fi
>>
>> In the next example, the BPF filter rejects the
>> .BR write (2)
>> system call, so that, although it is successfully started, the
>> .BR whoami (1)
>> command is not able to write output:
>>
>> .nf
>> .in +4n
>> $ \fBsyscall_nr write\fP
>> 1
>> $ \fB./a.out 1 0xC000003E 99 /bin/whoami\fP
>> .in
>> .fi
>>
>> In the final example,
>> the BPF filter rejects a system call that is not used by the
>> .BR whoami (1)
>> command, so it is able to successfully execute and produce output:
>>
>> .nf
>> .in +4n
>> $ \fBsyscall_nr preadv\fP
>> 295
>> $ \fB./a.out 295 0xC000003E 99 /bin/whoami\fP
>> cecilia
>> .in
>> .fi
>> .SS Program source
>> .fi
>> .nf
>> #include <errno.h>
>> #include <stddef.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <linux/audit.h>
>> #include <linux/filter.h>
>> #include <linux/seccomp.h>
>> #include <sys/prctl.h>
>>
>> static int
>> install_filter(int syscall, int arch, int error)
>> {
>>     struct sock_filter filter[] = {
>>         /* [0] Load architecture */
>>         BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
>>                  (offsetof(struct seccomp_data, arch))),
>>
>>         /* [1] Jump forward 4 instructions on architecture mismatch */
>>         BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, arch, 0, 4),
>>
>>         /* [2] Load system call number */
>>         BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
>>                  (offsetof(struct seccomp_data, nr))),
>>
>>         /* [3] Jump forward 1 instruction on system call number
>>                mismatch */
>>         BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, syscall, 0, 1),
>>
>>         /* [4] Matching architecture and system call: return
>>                specific errno */
>>         BPF_STMT(BPF_RET + BPF_K,
>>                  SECCOMP_RET_ERRNO | (error & SECCOMP_RET_DATA)),
>>
>>         /* [5] Destination of system call number mismatch: allow other
>>                system calls */
>>         BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
>>
>>         /* [6] Destination of architecture mismatch: kill process */
>>         BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_KILL),
>>     };
>>
>>     struct sock_fprog prog = {
>>         .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
>>         .filter = filter,
>>     };
>>
>>     if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
>>         perror("seccomp");
>>         return 1;
>>     }
>>
>>     return 0;
>> }
>>
>> int
>> main(int argc, char **argv)
>> {
>>     if (argc < 5) {
>>         fprintf(stderr, "Usage:\\n"
>>                 "refuse <syscall_nr> <arch> <errno> <prog> [<args>]\\n"
>>                 "Hint:  AUDIT_ARCH_I386: 0x%X\\n"
>>                 "       AUDIT_ARCH_X86_64: 0x%X\\n"
>>                 "\\n", AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
>>         exit(EXIT_FAILURE);
>>     }
>>
>>     if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
>>         perror("prctl");
>>         exit(EXIT_FAILURE);
>>     }
>>
>>     if (install_filter(strtol(argv[1], NULL, 0),
>>                        strtol(argv[2], NULL, 0),
>>                        strtol(argv[3], NULL, 0)))
>>         exit(EXIT_FAILURE);
>>
>>     execv(argv[4], &argv[4]);
>>     perror("execv");
>>     exit(EXIT_FAILURE);
>> }
>> .fi
>> .SH SEE ALSO
>> .BR prctl (2),
>> .BR ptrace (2),
>> .BR signal (7),
>> .BR socket (7)
>> .sp
>> .\" FIXME: Is the following the best source of info on the BPF language?
>> The kernel source file
>> .IR Documentation/networking/filter.txt .
> 
> I don't know of anything better.

Okay.

> 
> Thanks! This is looking really good. :)

Thanks for all these comments, Kees. I've now taken a further pass through 
the page, and made further edits, and some more FIXMEs as a result. I'll 
send the new draft in a separate message.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: Edited seccomp.2 man page for review
From: Michael Kerrisk (man-pages) @ 2014-12-30 12:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: mtk.manpages, Kees Cook, linux-man@vger.kernel.org, lkml,
	Linux API, Daniel Borkmann
In-Reply-To: <CALCETrWSz5hZJb5vavKX_kbfjm42w-e4aQjdRNsvS4m5uw4Q2w@mail.gmail.com>

Hi Andy,

Apologies for the slow follow-up.

On 11/10/2014 08:37 PM, Andy Lutomirski wrote:
> On Sat, Nov 8, 2014 at 4:22 AM, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> Hi Kees, (and all),
>>
>> Thanks for the seccomp.2 draft man page that you provided a few
>> weeks ago (https://lkml.org/lkml/2014/9/25/685), and my apologies
>> for the slow follow-up.
>>
> 
> Answers to some of your questions below.
> 
>> .BR execve (2)
>> is allowed by the filter,
>> the filters and constraints on permitted system calls are preserved across an
>> .BR execve (2).
>>
>> .\" FIXME I (mtk) reworded the following paragraph substantially.
>> .\" Please check it.
>> In order to use the
>> .BR SECCOMP_SET_MODE_FILTER
>> operation, either the caller must have the
>> .BR CAP_SYS_ADMIN
>> capability or the call must be preceded by the call:
>>
>>     prctl(PR_SET_NO_NEW_PRIVS, 1);
>>
>> Otherwise, the
>> .BR SECCOMP_SET_MODE_FILTER
>> operation will fail and return
>> .BR EACCES
>> in
>> .IR errno .
>> This requirement ensures that filter programs cannot be applied to child
>> .\" FIXME What does "installed" in the following line mean?
>> processes with greater privileges than the process that installed them.
>>
> 
> This requirement ensures that an unprivileged process cannot apply a
> malicious filter and then invoke a setuid or other privileged program
> using execve, thus potentially compromising that program.

Thanks. Much easier to understand. I've taken your text pretty much as
given into the man page.

>> If
>> .BR prctl (2)
>> or
>> .BR seccomp (2)
>> is allowed by the attached filter, further filters may be added.
>> This will increase evaluation time, but allows for further reduction of
>> the attack surface during execution of a process.
>>
>> The
>> .BR SECCOMP_SET_MODE_FILTER
>> operation is available only if the kernel is configured with
>> .BR CONFIG_SECCOMP_FILTER
>> enabled.
>>
>> When
>> .IR flags
>> is 0, this operation is functionally identical to the call:
>>
>>     prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
>>
>> The recognized
>> .IR flags
>> are:
>> .RS
>> .TP
>> .BR SECCOMP_FILTER_FLAG_TSYNC
>> When adding a new filter, synchronize all other threads of the calling
>> process to the same seccomp filter tree.
>> .\" FIXME Nowhere in this page is the term "filter tree" defined.
>> .\" There should be a definition somewhere.
>> .\" Is it: "the set of filters attached to a thread"?
> 
> It's the ordered list of filters attached to a thread, where attaching
> identical filters in separate syscalls results in different filters
> from this perspective.

Thanks again. I've pretty much taken that text into the man page.

>> If any thread cannot do this,
>> the call will not attach the new seccomp filter,
>> and will fail, returning the first thread ID found that cannot synchronize.
>> Synchronization will fail if another thread is in
>> .BR SECCOMP_MODE_STRICT
>> or if it has attached new seccomp filters to itself,
>> diverging from the calling thread's filter tree.
>> .RE
>> .SH FILTERS
>> When adding filters via
>> .BR SECCOMP_SET_MODE_FILTER ,
>> .IR args
>> points to a filter program:
>>
>> .in +4n
>> .nf
>> struct sock_fprog {
>>     unsigned short      len;    /* Number of BPF instructions */
>>     struct sock_filter *filter;
>> };
>> .fi
>> .in
>>
>> Each program must contain one or more BPF instructions:
>>
>> .in +4n
>> .nf
>> struct sock_filter {    /* Filter block */
>>     __u16   code;       /* Actual filter code */
>>     __u8    jt;         /* Jump true */
>>     __u8    jf;         /* Jump false */
>>     __u32   k;          /* Generic multiuse field */
>> };
>> .fi
>> .in
>>
>> When executing the instructions, the BPF program executes over the
>> system call information made available via:
>>
>> .in +4n
>> .nf
>> struct seccomp_data {
>>     int nr;                     /* system call number */
>>     __u32 arch;                 /* AUDIT_ARCH_* value */
>>     __u64 instruction_pointer;  /* CPU instruction pointer */
>>     __u64 args[6];              /* up to 6 system call arguments */
>> };
>> .fi
>> .in
>>
>> .\" FIXME I find the next piece a little hard to understand, so,
>> .\"       some questions:
>> .\"       * If there are multiple filters, in what order are they executed?
>> .\"         (The man page should probably detail the answer to this question.)
> 
> All of them are executed.  The precedence rules determine what happens
> if the filters return different values.

Got it. Thanks.

>> .\"       * If there are multiple filters, are they all always executed?
>> .\"         I assume not, but the notion that
>> .\"             "the return value for the evaluation of a given system call
>> .\"              will always use the value with the highest precedence"
>> .\"         implies that even that if one filter generates (say)
>> .\"         SECCOMP_RET_ERRNO, then further filters may still be executed,
>> .\"         including one that generates (say) the "higher priority"
>> .\"         SECCOMP_RET_KILL condition.
>> .\"       Can you clarify the above?
>> A seccomp filter returns one of the values listed below.
>> If multiple filters exist,
>> the return value for the evaluation of a given system call
>> will always use the value with the highest precedence.
>> (For example,
>> .BR SECCOMP_RET_KILL
>> will always take precedence.)
>>
>> In decreasing order order of precedence,
>> the values that may be returned by a seccomp filter are:
>> .TP
>> .BR SECCOMP_RET_KILL
>> Results in the task exiting immediately without executing the system call.
>> The task terminates as though killed by a
>> .B SIGSYS
>> signal
>> .RI ( not
>> .BR SIGKILL ).
>> .TP
>> .BR SECCOMP_RET_TRAP
>> Results in the kernel sending a
>> .BR SIGSYS
>> signal to the triggering task without executing the system call.
>> .IR siginfo\->si_call_addr
>> will show the address of the system call instruction, and
>> .IR siginfo\->si_syscall
>> and
>> .IR siginfo\->si_arch
>> will indicate which system call was attempted.
>> The program counter will be as though the system call happened
>> (i.e., it will not point to the system call instruction).
>> The return value register will contain an architecture\-dependent value;
>> if resuming execution, set it to something sensible.
>> (The architecture dependency is because replacing it with
>> .BR ENOSYS
>> could overwrite some useful information.)
>>
>> .\" FIXME The following sentence is the first time that SECCOMP_RET_DATA
>> .\"       is mentioned. SECCOMP_RET_DATA needs to be described in this
>> .\"       man page.
>> The
>> .BR SECCOMP_RET_DATA
>> portion of the return value will be passed as
>> .IR si_errno .
>>
>> .BR SIGSYS
>> triggered by seccomp will have the value
>> .BR SYS_SECCOMP
>> in the
>> .IR si_code
>> field.
>> .TP
>> .BR SECCOMP_RET_ERRNO
>> .\" FIXME What does "the return value" refer to in the next sentence?
>> .\"       It is not obvious to me.
> 
> The return value is the value returned by the BPF program.

Got it!

Thanks for the comments, Andy!

Cheers,

Michael





-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* [GIT PULL] kselftest fixes for 3.19
From: Shuah Khan @ 2014-12-29 19:42 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, open list:KERNEL SELFTEST F..., shuahkh

Hi Linus,

Please pull the following kselftest fix for 3.19.

thanks,
-- Shuah

The following changes since commit 97bf6af1f928216fd6c5a66e8a57bfa95a659672:

  Linux 3.19-rc1 (2014-12-20 17:08:50 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-3.19-fixes

for you to fetch changes up to 6898b627aab6ba553e6d8b40a0b1ddc43c48d42f:

  selftests/exec: Use %zu to format size_t (2014-12-22 11:11:36 -0700)

----------------------------------------------------------------
kselftest fixes for: 3.19

Fix exec test compile warnings.

----------------------------------------------------------------
Geert Uytterhoeven (1):
      selftests/exec: Use %zu to format size_t

 tools/testing/selftests/exec/execveat.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978

^ permalink raw reply

* Re: [PATCH v6 1/4] crypto: AF_ALG: add AEAD support
From: Herbert Xu @ 2014-12-29 17:33 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <29582980.qoHS2EjmLy-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>

On Mon, Dec 29, 2014 at 04:05:40PM +0100, Stephan Mueller wrote:
>
> This would mean that the check must stay in recvmsg as only here we know that 
> the caller wants data to be processed.

On the send side you would do the check when MSG_MORE is unset.
On the receive side you should stop waiting only when ctx->more
is false and the send-side check succeeded.

Perhaps rename ctx->more to ctx->done and then you can use it
to indicate to the receive side that we're ready and have valid
data for it.  The receive side can then simply wait for ctx->done
to become true.

> > PS we should add a length check for missing/partial auth tags
> > to crypto_aead_decrypt.  We can then remove such checks from
> > individual implementations.
> 
> I agree in full here. Shall I create such a patch together with the AEAD 
> AF_ALG interface, or can we merge the AEAD without that patch now and create a 
> separate patch later?

We should at least add a check in crypto_aead_decrypt first so as
to guarantee nothing slips through.

Thanks,
-- 
Email: Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v3 00/20] kselftest install target feature
From: Shuah Khan @ 2014-12-29 15:24 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: mmarek-AlSwsSmVLrQ, gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rostedt-nx8X9YLhiw1AfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, keescook-F7+t8E8rja9g9hUCZPvPmw,
	tranmanphong-Re5JQEeQqe8AvxtiuMwx3w, cov-sgV2jX0FEOL9JmXXK+q4OQ,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	bobby.prani-Re5JQEeQqe8AvxtiuMwx3w,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, tim.bird-/MT0OVThwyLZJqsBc5GL+g,
	josh-iaAMLnmF4UmaiuxdJuQwMA, koct9i-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kbuild-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1419828790.26911.3.camel-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>

On 12/28/2014 09:53 PM, Michael Ellerman wrote:
> On Wed, 2014-12-24 at 09:27 -0700, Shuah Khan wrote:
>> This patch series adds a new kselftest_install make target
>> to enable selftest install. When make kselftest_install is
>> run, selftests are installed on the system. A new install
>> target is added to selftests Makefile which will install
>> targets for the tests that are specified in INSTALL_TARGETS.
>> During install, a script is generated to run tests that are
>> installed. This script will be installed in the selftest install
>> directory. Individual test Makefiles are changed to add to the
>> script. This will allow new tests to add install and run test
>> commands to the generated kselftest script. kselftest target
>> now depends on kselftest_install and runs the generated kselftest
>> script to reduce duplicate work and for common look and feel when
>> running tests.
>>
>> This approach leverages and extends the existing framework that
>> uses makefile targets to implement run_tests and adds install
>> target. This will scale well as new tests get added and makes
>> it easier for test writers to add install target at the same
>> time new test gets added.
>>
>> This v3 series reduces duplicate code to generate script
>> in indiviual test Makefiles and consolidates support in
>> selftests main Makefile. In the main Makefile, it does
>> minimal work to set and export install path. In this
>> series exec and powerpc tests are not included in the
>> install, this work will be done in future patches. exec
>> and powerpc are still run when make kselftest is invoked.
> 
> Any particular reason you excluded the powerpc tests? Going by a quick count,
> powerpc has 32 of the 54 self tests, ie. more than half.

No particular reason other than not having a good way to test the
changes I need to make. It does have sub-directory structure with
multiple makefiles underneath. I would like to work on this after
this patch series gets in or maybe you can help out on powerpc
changes if you like.

> 
> Sorry I didn't get a chance to review v1 or v2, but is this really the best
> solution we can come up with? It seems to involve a lot of boiler plate getting
> repeated in every Makefile.

This approach extends the existing approach to use makefile targets as
a means to support running tests. Also it gives full control to the
individual test developer in making changes to the targets as needed
without conflicts with work that is in progress on other tests.

There isn't a whole lot of boiler plating code repeated as such in
individual makefiles. They all add their specific targets to the
main script.

-- Shuah


-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978

^ permalink raw reply

* Re: [PATCH v6 1/4] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2014-12-29 15:05 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141229103319.GB13334-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>

Am Montag, 29. Dezember 2014, 21:33:19 schrieb Herbert Xu:

Hi Herbert,

> On Thu, Dec 25, 2014 at 11:01:47PM +0100, Stephan Mueller wrote:
> > +	err = -ENOMEM;
> 
> This should be EINVAL.

Changed
> 
> > +	if (!aead_sufficient_data(ctx))
> > +		goto unlock;
> 
> So we're checking two things here, one that we have enough data
> for AD and two we have the authentication tag.  The latter is
> redundant as the underlying implementation should be able to cope
> with short input so we should only check the assoclen here.

Agreed, will change it to

if (ctx->used < ctx->aead_assoclen)

> 
> Also this check should be moved to the sendmsg side as that'll
> make it more obvious as to what went wrong.

I would be a bit uneasy about that as this would open up a potential kernel 
crasher: the sleep in aead_readable() can wake up recvmsg in two conditions: 
either we received sufficient data or we do not expect more data (due to !ctx-
>more). If the latter triggers, we still may have insufficient AD data. Yet, 
the following code now sets the AD with aead_request_set_assoc using the 
initially expected data. So, the data buffer provided to  
aead_request_set_assoc is not long enough. The mentioned check shall prevent 
this problem.

In addition, I do not see how we can move that check to the sendmsg/sendpage 
side: the code currently allows the caller to freely invoke the syscall 
arbitrary amount of times. Thus, one particular invocation of sendmsg/sendpage 
does not mean we receive all AD.

Again, to allow the caller the greatest degree of freedom, you can call 
sendmsg with an arbitrary amount of bytes as often as you want (until we fill 
up all buffers) before the recvmsg is triggered. So, there is no need to send 
the entire AD (or even AD+message) buffer in one sendmsg call. Compare the 
AEAD interface with a hash interface:

- the AEAD sendmsg/sendpage is logically equivalent to a hash update that you 
can call an arbitrary number of times with an arbitrary number of bytes.

- the AEAD recvmsg is logically equivalent to the hash final.

This would mean that the check must stay in recvmsg as only here we know that 
the caller wants data to be processed.

> 
> PS we should add a length check for missing/partial auth tags
> to crypto_aead_decrypt.  We can then remove such checks from
> individual implementations.

I agree in full here. Shall I create such a patch together with the AEAD 
AF_ALG interface, or can we merge the AEAD without that patch now and create a 
separate patch later?
> 
> Thanks,


-- 
Ciao
Stephan

^ permalink raw reply

* Re: [PATCH v2] [media] Add RGB444_1X12 and RGB565_1X16 media bus formats
From: Boris Brezillon @ 2014-12-29 12:45 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Boris Brezillon, Hans Verkuil, Laurent Pinchart,
	linux-media-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Philipp Zabel, Sakari Ailus
In-Reply-To: <1416126278-17708-1-git-send-email-boris.brezillon-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8@public.gmane.org>

Hello Mauro,

Last week I received a notification informing me that this patch was
"Not Applicable".
Could you give more details on why you think this should not go through
the media-tree (or am I misunderstanding the meaning of "Not
Applicable") ?

I really need this patch for the atmel HLCDC DRM driver, moreover this
patch from Philip [1] depends on mine.

Regards,

Boris

[1]http://comments.gmane.org/gmane.linux.drivers.video-input-infrastructure/85952

On Sun, 16 Nov 2014 09:24:38 +0100
Boris Brezillon <boris.brezillon-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8@public.gmane.org> wrote:

> Add RGB444_1X12 and RGB565_1X16 format definitions and update the
> documentation.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8@public.gmane.org>
> Acked-by: Mauro Carvalho Chehab <mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
> ---
> Changes since v1:
> - keep BPP and bits per sample ordering
> 
>  Documentation/DocBook/media/v4l/subdev-formats.xml | 40 ++++++++++++++++++++++
>  include/uapi/linux/media-bus-format.h              |  4 ++-
>  2 files changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/DocBook/media/v4l/subdev-formats.xml b/Documentation/DocBook/media/v4l/subdev-formats.xml
> index 18730b9..0d6f731 100644
> --- a/Documentation/DocBook/media/v4l/subdev-formats.xml
> +++ b/Documentation/DocBook/media/v4l/subdev-formats.xml
> @@ -176,6 +176,24 @@
>  	    </row>
>  	  </thead>
>  	  <tbody valign="top">
> +	    <row id="MEDIA-BUS-FMT-RGB444-1X12">
> +	      <entry>MEDIA_BUS_FMT_RGB444_1X12</entry>
> +	      <entry>0x100d</entry>
> +	      <entry></entry>
> +	      &dash-ent-20;
> +	      <entry>r<subscript>3</subscript></entry>
> +	      <entry>r<subscript>2</subscript></entry>
> +	      <entry>r<subscript>1</subscript></entry>
> +	      <entry>r<subscript>0</subscript></entry>
> +	      <entry>g<subscript>3</subscript></entry>
> +	      <entry>g<subscript>2</subscript></entry>
> +	      <entry>g<subscript>1</subscript></entry>
> +	      <entry>g<subscript>0</subscript></entry>
> +	      <entry>b<subscript>3</subscript></entry>
> +	      <entry>b<subscript>2</subscript></entry>
> +	      <entry>b<subscript>1</subscript></entry>
> +	      <entry>b<subscript>0</subscript></entry>
> +	    </row>
>  	    <row id="MEDIA-BUS-FMT-RGB444-2X8-PADHI-BE">
>  	      <entry>MEDIA_BUS_FMT_RGB444_2X8_PADHI_BE</entry>
>  	      <entry>0x1001</entry>
> @@ -288,6 +306,28 @@
>  	      <entry>g<subscript>4</subscript></entry>
>  	      <entry>g<subscript>3</subscript></entry>
>  	    </row>
> +	    <row id="MEDIA-BUS-FMT-RGB565-1X16">
> +	      <entry>MEDIA_BUS_FMT_RGB565_1X16</entry>
> +	      <entry>0x100d</entry>
> +	      <entry></entry>
> +	      &dash-ent-16;
> +	      <entry>r<subscript>4</subscript></entry>
> +	      <entry>r<subscript>3</subscript></entry>
> +	      <entry>r<subscript>2</subscript></entry>
> +	      <entry>r<subscript>1</subscript></entry>
> +	      <entry>r<subscript>0</subscript></entry>
> +	      <entry>g<subscript>5</subscript></entry>
> +	      <entry>g<subscript>4</subscript></entry>
> +	      <entry>g<subscript>3</subscript></entry>
> +	      <entry>g<subscript>2</subscript></entry>
> +	      <entry>g<subscript>1</subscript></entry>
> +	      <entry>g<subscript>0</subscript></entry>
> +	      <entry>b<subscript>4</subscript></entry>
> +	      <entry>b<subscript>3</subscript></entry>
> +	      <entry>b<subscript>2</subscript></entry>
> +	      <entry>b<subscript>1</subscript></entry>
> +	      <entry>b<subscript>0</subscript></entry>
> +	    </row>
>  	    <row id="MEDIA-BUS-FMT-BGR565-2X8-BE">
>  	      <entry>MEDIA_BUS_FMT_BGR565_2X8_BE</entry>
>  	      <entry>0x1005</entry>
> diff --git a/include/uapi/linux/media-bus-format.h b/include/uapi/linux/media-bus-format.h
> index 23b4090..37091c6 100644
> --- a/include/uapi/linux/media-bus-format.h
> +++ b/include/uapi/linux/media-bus-format.h
> @@ -33,11 +33,13 @@
>  
>  #define MEDIA_BUS_FMT_FIXED			0x0001
>  
> -/* RGB - next is	0x100e */
> +/* RGB - next is	0x1010 */
> +#define MEDIA_BUS_FMT_RGB444_1X12		0x100e
>  #define MEDIA_BUS_FMT_RGB444_2X8_PADHI_BE	0x1001
>  #define MEDIA_BUS_FMT_RGB444_2X8_PADHI_LE	0x1002
>  #define MEDIA_BUS_FMT_RGB555_2X8_PADHI_BE	0x1003
>  #define MEDIA_BUS_FMT_RGB555_2X8_PADHI_LE	0x1004
> +#define MEDIA_BUS_FMT_RGB565_1X16		0x100f
>  #define MEDIA_BUS_FMT_BGR565_2X8_BE		0x1005
>  #define MEDIA_BUS_FMT_BGR565_2X8_LE		0x1006
>  #define MEDIA_BUS_FMT_RGB565_2X8_BE		0x1007



-- 
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH v6 4/4] crypto: AF_ALG: enable RNG interface compilation
From: Herbert Xu @ 2014-12-29 10:41 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto, linux-api
In-Reply-To: <1531939.gTKbxagG6Z@tachyon.chronox.de>

On Thu, Dec 25, 2014 at 11:00:39PM +0100, Stephan Mueller wrote:
> Enable compilation of the RNG AF_ALG support and provide a Kconfig
> option to compile the RNG AF_ALG support.
> 
> Signed-off-by: Stephan Mueller <smueller@chronox.de>

Patch applied.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v6 3/4] crypto: AF_ALG: add random number generator support
From: Herbert Xu @ 2014-12-29 10:41 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <2323421.LJnyPUDp59-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>

On Thu, Dec 25, 2014 at 11:00:06PM +0100, Stephan Mueller wrote:
> This patch adds the random number generator support for AF_ALG.
> 
> A random number generator's purpose is to generate data without
> requiring the caller to provide any data. Therefore, the AF_ALG
> interface handler for RNGs only implements a callback handler for
> recvmsg.
> 
> The following parameters provided with a recvmsg are processed by the
> RNG callback handler:
> 
> 	* sock - to resolve the RNG context data structure accessing the
> 	  RNG instance private to the socket
> 
> 	* len - this parameter allows userspace callers to specify how
> 	  many random bytes the RNG shall produce and return. As the
> 	  kernel context for the RNG allocates a buffer of 128 bytes to
> 	  store random numbers before copying them to userspace, the len
> 	  parameter is checked that it is not larger than 128. If a
> 	  caller wants more random numbers, a new request for recvmsg
> 	  shall be made.
> 
> The size of 128 bytes is chose because of the following considerations:
> 
> 	* to increase the memory footprint of the kernel too much (note,
> 	  that would be 128 bytes per open socket)
> 
> 	* 128 is divisible by any typical cryptographic block size an
> 	  RNG may have
> 
> 	* A request for random numbers typically only shall supply small
> 	  amount of data like for keys or IVs that should only require
> 	  one invocation of the recvmsg function.
> 
> Note, during instantiation of the RNG, the code checks whether the RNG
> implementation requires seeding. If so, the RNG is seeded with output
> from get_random_bytes.
> 
> A fully working example using all aspects of the RNG interface is
> provided at http://www.chronox.de/libkcapi.html
> 
> Signed-off-by: Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>

Patch applied.
-- 
Email: Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v6 1/4] crypto: AF_ALG: add AEAD support
From: Herbert Xu @ 2014-12-29 10:33 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <5002301.TQO37u96dE-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>

On Thu, Dec 25, 2014 at 11:01:47PM +0100, Stephan Mueller wrote:
>
> +	err = -ENOMEM;

This should be EINVAL.

> +	if (!aead_sufficient_data(ctx))
> +		goto unlock;

So we're checking two things here, one that we have enough data
for AD and two we have the authentication tag.  The latter is
redundant as the underlying implementation should be able to cope
with short input so we should only check the assoclen here.

Also this check should be moved to the sendmsg side as that'll
make it more obvious as to what went wrong.

PS we should add a length check for missing/partial auth tags
to crypto_aead_decrypt.  We can then remove such checks from
individual implementations.

Thanks,
-- 
Email: Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v6 0/4] crypto: AF_ALG: add AEAD and RNG support
From: Herbert Xu @ 2014-12-29 10:20 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto, linux-api
In-Reply-To: <5682082.ffPqvQlSqN@tachyon.chronox.de>

On Thu, Dec 25, 2014 at 10:58:01PM +0100, Stephan Mueller wrote:
> Hi,
> 
> This patch set adds AEAD and RNG support to the AF_ALG interface
> exported by the kernel crypto API. By extending AF_ALG with AEAD and RNG
> support, all cipher types the kernel crypto API allows access to are
> now accessible from userspace.

For some reason your 1st patch came out last due to its Date
header.  Please fix this up in your next submission as otherwise
it screws up the patch ordering for me.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v3 00/20] kselftest install target feature
From: Michael Ellerman @ 2014-12-29  4:53 UTC (permalink / raw)
  To: Shuah Khan
  Cc: mmarek-AlSwsSmVLrQ, gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rostedt-nx8X9YLhiw1AfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, keescook-F7+t8E8rja9g9hUCZPvPmw,
	tranmanphong-Re5JQEeQqe8AvxtiuMwx3w, cov-sgV2jX0FEOL9JmXXK+q4OQ,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	bobby.prani-Re5JQEeQqe8AvxtiuMwx3w,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, tim.bird-/MT0OVThwyLZJqsBc5GL+g,
	josh-iaAMLnmF4UmaiuxdJuQwMA, koct9i-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kbuild-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <cover.1419387513.git.shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

On Wed, 2014-12-24 at 09:27 -0700, Shuah Khan wrote:
> This patch series adds a new kselftest_install make target
> to enable selftest install. When make kselftest_install is
> run, selftests are installed on the system. A new install
> target is added to selftests Makefile which will install
> targets for the tests that are specified in INSTALL_TARGETS.
> During install, a script is generated to run tests that are
> installed. This script will be installed in the selftest install
> directory. Individual test Makefiles are changed to add to the
> script. This will allow new tests to add install and run test
> commands to the generated kselftest script. kselftest target
> now depends on kselftest_install and runs the generated kselftest
> script to reduce duplicate work and for common look and feel when
> running tests.
> 
> This approach leverages and extends the existing framework that
> uses makefile targets to implement run_tests and adds install
> target. This will scale well as new tests get added and makes
> it easier for test writers to add install target at the same
> time new test gets added.
> 
> This v3 series reduces duplicate code to generate script
> in indiviual test Makefiles and consolidates support in
> selftests main Makefile. In the main Makefile, it does
> minimal work to set and export install path. In this
> series exec and powerpc tests are not included in the
> install, this work will be done in future patches. exec
> and powerpc are still run when make kselftest is invoked.

Any particular reason you excluded the powerpc tests? Going by a quick count,
powerpc has 32 of the 54 self tests, ie. more than half.

Sorry I didn't get a chance to review v1 or v2, but is this really the best
solution we can come up with? It seems to involve a lot of boiler plate getting
repeated in every Makefile.

I'm off this week so I can't immediately come up with something better, I'll
try in the new year.

cheers

^ permalink raw reply

* Andoid Binder sneaking in [was Re: [GIT PULL] Staging driver patches for 3.19-rc1]
From: Pavel Machek @ 2014-12-28 17:53 UTC (permalink / raw)
  To: Greg KH
  Cc: Richard Weinberger, linux-api, LKML, Christoph Hellwig, arve,
	john.stultz, viro, devel@linuxdriverproject.org, Andrew Morton,
	Linus Torvalds
In-Reply-To: <20141215184103.GA6761@kroah.com>

On Mon 2014-12-15 10:41:03, Greg KH wrote:
> On Mon, Dec 15, 2014 at 10:39:15AM -0800, Christoph Hellwig wrote:
> > On Mon, Dec 15, 2014 at 07:23:35PM +0100, Richard Weinberger wrote:
> > > I don't understand this kind of logic.
> > > a) Binder is considered a piece of shite.
> > > b) Google is working on a (hopefully sane) replacement.
> > > 
> > > Why moving it out of staging then? What is the benefit?
> > 
> > There is none, and Greg didn't even bother addressing the various
> > comments when this first came up.
> 
> I thought I did, it was a long thread at the time, and I was on the road
> for 3 weeks, sorry if I missed something.

I pointed quite a lot of simple cleanups that could be done, but got
no feedback...

You should really post new version for review to people that commented
on the old one.

Plus "I set a rule that code must be cleaned in staging, and this is
not happening here, so it has to be moved to mainline, ignoring all
the usual rules" is quite interesting justification.

> > So a clear NAK from me on this one.
> 
> You don't have to maintain it, I do, so why does it concern you?

You ignored even NAKs from people that maintain stuff this interfaces with.

Late NAK here, too, FWIW. Because it is going to be used as an
argument "it is in mainline, so it must be ok". You are willing to
ignore mainline rules for this; it should be way easier to ignore
single staging rule for this one.
									
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* [PATCH net-next] l2tp : multicast notification to the registered listeners
From: Stephen Hemminger @ 2014-12-27 18:12 UTC (permalink / raw)
  To: David Miller; +Cc: Tom Herbert, Guillaume Nault, linux-api, netdev

From: Bill Hong <bhong@brocade.com>

Previously l2tp module did not provide any means for the user space to
get notified when tunnels/sessions are added/modified/deleted.
This change contains the following
- create a multicast group for the listeners to register.
- notify the registered listeners when the tunnels/sessions are
  created/modified/deleted.

Signed-off-by: Bill Hong <bhong@brocade.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Sven-Thorsten Dietrich <sven@brocade.com>


--- a/include/uapi/linux/l2tp.h	2014-12-27 09:58:06.650962294 -0800
+++ b/include/uapi/linux/l2tp.h	2014-12-27 09:58:06.646962270 -0800
@@ -178,5 +178,6 @@ enum l2tp_seqmode {
  */
 #define L2TP_GENL_NAME		"l2tp"
 #define L2TP_GENL_VERSION	0x1
+#define L2TP_GENL_MCGROUP       "l2tp"
 
 #endif /* _UAPI_LINUX_L2TP_H_ */
--- a/net/l2tp/l2tp_netlink.c	2014-12-27 09:58:06.650962294 -0800
+++ b/net/l2tp/l2tp_netlink.c	2014-12-27 10:08:42.058892710 -0800
@@ -40,6 +40,18 @@ static struct genl_family l2tp_nl_family
 	.netnsok	= true,
 };
 
+static const struct genl_multicast_group l2tp_multicast_group[] = {
+	{
+		.name = L2TP_GENL_MCGROUP,
+	},
+};
+
+static int l2tp_nl_tunnel_send(struct sk_buff *skb, u32 portid, u32 seq,
+			       int flags, struct l2tp_tunnel *tunnel, u8 cmd);
+static int l2tp_nl_session_send(struct sk_buff *skb, u32 portid, u32 seq,
+				int flags, struct l2tp_session *session,
+				u8 cmd);
+
 /* Accessed under genl lock */
 static const struct l2tp_nl_cmd_ops *l2tp_nl_cmd_ops[__L2TP_PWTYPE_MAX];
 
@@ -97,6 +109,52 @@ out:
 	return ret;
 }
 
+static int l2tp_tunnel_notify(struct genl_family *family,
+			      struct genl_info *info,
+			      struct l2tp_tunnel *tunnel,
+			      u8 cmd)
+{
+	struct sk_buff *msg;
+	int ret;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	ret = l2tp_nl_tunnel_send(msg, info->snd_portid, info->snd_seq,
+				  NLM_F_ACK, tunnel, cmd);
+
+	if (ret >= 0)
+		return genlmsg_multicast_allns(family, msg, 0,	0, GFP_ATOMIC);
+
+	nlmsg_free(msg);
+
+	return ret;
+}
+
+static int l2tp_session_notify(struct genl_family *family,
+			       struct genl_info *info,
+			       struct l2tp_session *session,
+			       u8 cmd)
+{
+	struct sk_buff *msg;
+	int ret;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	ret = l2tp_nl_session_send(msg, info->snd_portid, info->snd_seq,
+				   NLM_F_ACK, session, cmd);
+
+	if (ret >= 0)
+		return genlmsg_multicast_allns(family, msg, 0,	0, GFP_ATOMIC);
+
+	nlmsg_free(msg);
+
+	return ret;
+}
+
 static int l2tp_nl_cmd_tunnel_create(struct sk_buff *skb, struct genl_info *info)
 {
 	u32 tunnel_id;
@@ -188,6 +246,9 @@ static int l2tp_nl_cmd_tunnel_create(str
 		break;
 	}
 
+	if (ret >= 0)
+		ret = l2tp_tunnel_notify(&l2tp_nl_family, info,
+					 tunnel, L2TP_CMD_TUNNEL_CREATE);
 out:
 	return ret;
 }
@@ -211,6 +272,9 @@ static int l2tp_nl_cmd_tunnel_delete(str
 		goto out;
 	}
 
+	l2tp_tunnel_notify(&l2tp_nl_family, info,
+			   tunnel, L2TP_CMD_TUNNEL_DELETE);
+
 	(void) l2tp_tunnel_delete(tunnel);
 
 out:
@@ -239,12 +303,15 @@ static int l2tp_nl_cmd_tunnel_modify(str
 	if (info->attrs[L2TP_ATTR_DEBUG])
 		tunnel->debug = nla_get_u32(info->attrs[L2TP_ATTR_DEBUG]);
 
+	ret = l2tp_tunnel_notify(&l2tp_nl_family, info,
+				 tunnel, L2TP_CMD_TUNNEL_MODIFY);
+
 out:
 	return ret;
 }
 
 static int l2tp_nl_tunnel_send(struct sk_buff *skb, u32 portid, u32 seq, int flags,
-			       struct l2tp_tunnel *tunnel)
+			       struct l2tp_tunnel *tunnel, u8 cmd)
 {
 	void *hdr;
 	struct nlattr *nest;
@@ -254,8 +321,7 @@ static int l2tp_nl_tunnel_send(struct sk
 	struct ipv6_pinfo *np = NULL;
 #endif
 
-	hdr = genlmsg_put(skb, portid, seq, &l2tp_nl_family, flags,
-			  L2TP_CMD_TUNNEL_GET);
+	hdr = genlmsg_put(skb, portid, seq, &l2tp_nl_family, flags, cmd);
 	if (!hdr)
 		return -EMSGSIZE;
 
@@ -359,7 +425,7 @@ static int l2tp_nl_cmd_tunnel_get(struct
 	}
 
 	ret = l2tp_nl_tunnel_send(msg, info->snd_portid, info->snd_seq,
-				  NLM_F_ACK, tunnel);
+				  NLM_F_ACK, tunnel, L2TP_CMD_TUNNEL_GET);
 	if (ret < 0)
 		goto err_out;
 
@@ -385,7 +451,7 @@ static int l2tp_nl_cmd_tunnel_dump(struc
 
 		if (l2tp_nl_tunnel_send(skb, NETLINK_CB(cb->skb).portid,
 					cb->nlh->nlmsg_seq, NLM_F_MULTI,
-					tunnel) <= 0)
+					tunnel, L2TP_CMD_TUNNEL_GET) <= 0)
 			goto out;
 
 		ti++;
@@ -539,6 +605,13 @@ static int l2tp_nl_cmd_session_create(st
 		ret = (*l2tp_nl_cmd_ops[cfg.pw_type]->session_create)(net, tunnel_id,
 			session_id, peer_session_id, &cfg);
 
+	if (ret >= 0) {
+		session = l2tp_session_find(net, tunnel, session_id);
+		if (session)
+			ret = l2tp_session_notify(&l2tp_nl_family, info, session,
+						  L2TP_CMD_SESSION_CREATE);
+	}
+
 out:
 	return ret;
 }
@@ -555,6 +628,9 @@ static int l2tp_nl_cmd_session_delete(st
 		goto out;
 	}
 
+	l2tp_session_notify(&l2tp_nl_family, info,
+			    session, L2TP_CMD_SESSION_DELETE);
+
 	pw_type = session->pwtype;
 	if (pw_type < __L2TP_PWTYPE_MAX)
 		if (l2tp_nl_cmd_ops[pw_type] && l2tp_nl_cmd_ops[pw_type]->session_delete)
@@ -601,12 +677,15 @@ static int l2tp_nl_cmd_session_modify(st
 	if (info->attrs[L2TP_ATTR_MRU])
 		session->mru = nla_get_u16(info->attrs[L2TP_ATTR_MRU]);
 
+	ret = l2tp_session_notify(&l2tp_nl_family, info,
+				  session, L2TP_CMD_SESSION_MODIFY);
+
 out:
 	return ret;
 }
 
 static int l2tp_nl_session_send(struct sk_buff *skb, u32 portid, u32 seq, int flags,
-				struct l2tp_session *session)
+				struct l2tp_session *session, u8 cmd)
 {
 	void *hdr;
 	struct nlattr *nest;
@@ -615,7 +694,7 @@ static int l2tp_nl_session_send(struct s
 
 	sk = tunnel->sock;
 
-	hdr = genlmsg_put(skb, portid, seq, &l2tp_nl_family, flags, L2TP_CMD_SESSION_GET);
+	hdr = genlmsg_put(skb, portid, seq, &l2tp_nl_family, flags, cmd);
 	if (!hdr)
 		return -EMSGSIZE;
 
@@ -699,7 +778,7 @@ static int l2tp_nl_cmd_session_get(struc
 	}
 
 	ret = l2tp_nl_session_send(msg, info->snd_portid, info->snd_seq,
-				   0, session);
+				   0, session, L2TP_CMD_SESSION_GET);
 	if (ret < 0)
 		goto err_out;
 
@@ -737,7 +816,7 @@ static int l2tp_nl_cmd_session_dump(stru
 
 		if (l2tp_nl_session_send(skb, NETLINK_CB(cb->skb).portid,
 					 cb->nlh->nlmsg_seq, NLM_F_MULTI,
-					 session) <= 0)
+					 session, L2TP_CMD_SESSION_GET) <= 0)
 			break;
 
 		si++;
@@ -896,7 +975,9 @@ EXPORT_SYMBOL_GPL(l2tp_nl_unregister_ops
 static int l2tp_nl_init(void)
 {
 	pr_info("L2TP netlink interface\n");
-	return genl_register_family_with_ops(&l2tp_nl_family, l2tp_nl_ops);
+	return genl_register_family_with_ops_groups(&l2tp_nl_family,
+						    l2tp_nl_ops,
+						    l2tp_multicast_group);
 }
 
 static void l2tp_nl_cleanup(void)

^ permalink raw reply

* [PATCH v6 1/4] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2014-12-25 22:01 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto, linux-api
In-Reply-To: <5682082.ffPqvQlSqN@tachyon.chronox.de>

This patch adds the AEAD support for AF_ALG.

The implementation is based on algif_skcipher, but contains heavy
modifications to streamline the interface for AEAD uses.

To use AEAD, the user space consumer has to use the salg_type named
"aead".

The AEAD implementation includes some overhead to calculate the size of
the ciphertext, because the AEAD implementation of the kernel crypto API
makes implied assumption on the location of the authentication tag. When
performing an encryption, the tag will be added to the created
ciphertext (note, the tag is placed adjacent to the ciphertext). For
decryption, the caller must hand in the ciphertext with the tag appended
to the ciphertext. Therefore, the selection of the used memory
needs to add/subtract the tag size from the source/destination buffers
depending on the encryption type. The code is provided with comments
explaining when and how that operation is performed.

A fully working example using all aspects of AEAD is provided at
http://www.chronox.de/libkcapi.html

Signed-off-by: Stephan Mueller <smueller@chronox.de>
---
 crypto/algif_aead.c | 651 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 651 insertions(+)
 create mode 100644 crypto/algif_aead.c

diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
new file mode 100644
index 0000000..c5d7e26
--- /dev/null
+++ b/crypto/algif_aead.c
@@ -0,0 +1,651 @@
+/*
+ * algif_aeadr: User-space interface for AEAD algorithms
+ *
+ * Copyright (C) 2014, Stephan Mueller <smueller@chronox.de>
+ *
+ * This file provides the user-space API for AEAD ciphers.
+ *
+ * This file is derived from algif_skcipher.c.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#include <crypto/scatterwalk.h>
+#include <crypto/if_alg.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <net/sock.h>
+
+struct aead_sg_list {
+	unsigned int cur;
+	struct scatterlist sg[ALG_MAX_PAGES];
+};
+
+struct aead_ctx {
+	struct aead_sg_list tsgl;
+	struct af_alg_sgl rsgl;
+
+	void *iv;
+
+	struct af_alg_completion completion;
+
+	unsigned long used;
+
+	unsigned int len;
+	bool more;
+	bool merge;
+	bool enc;
+
+	size_t aead_assoclen;
+	struct aead_request aead_req;
+};
+
+static inline int aead_sndbuf(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+
+	return max_t(int, max_t(int, sk->sk_sndbuf & PAGE_MASK, PAGE_SIZE) -
+			  ctx->used, 0);
+}
+
+static inline bool aead_writable(struct sock *sk)
+{
+	return PAGE_SIZE <= aead_sndbuf(sk);
+}
+
+static inline bool aead_sufficient_data(struct aead_ctx *ctx)
+{
+	unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+
+	return (ctx->used >= (ctx->aead_assoclen + (ctx->enc ?: as)));
+}
+static inline bool aead_readable(struct aead_ctx *ctx)
+{
+	/*
+	 * Ensure that assoc data is present, the plaintext / ciphertext
+	 * is non-zero and that the authentication tag is also present
+	 * in case of a decryption operation.
+	 *
+	 * Also, wait until all data is received before processing.
+	 */
+	return (aead_sufficient_data(ctx) && !ctx->more);
+}
+
+static void aead_put_sgl(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	struct scatterlist *sg = sgl->sg;
+	unsigned int i;
+
+	for (i = 0; i < sgl->cur; i++) {
+		if (!sg_page(sg + i))
+			continue;
+
+		put_page(sg_page(sg + i));
+		sg_assign_page(sg + i, NULL);
+	}
+	sgl->cur = 0;
+	ctx->used = 0;
+	ctx->more = 0;
+	ctx->merge = 0;
+}
+
+static int aead_wait_for_wmem(struct sock *sk, unsigned flags)
+{
+	long timeout;
+	DEFINE_WAIT(wait);
+	int err = -ERESTARTSYS;
+
+	if (flags & MSG_DONTWAIT)
+		return -EAGAIN;
+
+	set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+
+	for (;;) {
+		if (signal_pending(current))
+			break;
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		timeout = MAX_SCHEDULE_TIMEOUT;
+		if (sk_wait_event(sk, &timeout, aead_writable(sk))) {
+			err = 0;
+			break;
+		}
+	}
+	finish_wait(sk_sleep(sk), &wait);
+
+	return err;
+}
+
+static void aead_wmem_wakeup(struct sock *sk)
+{
+	struct socket_wq *wq;
+
+	if (!aead_writable(sk))
+		return;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
+							   POLLRDNORM |
+							   POLLRDBAND);
+	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	rcu_read_unlock();
+}
+
+static int aead_wait_for_data(struct sock *sk, unsigned flags)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	long timeout;
+	DEFINE_WAIT(wait);
+	int err = -ERESTARTSYS;
+
+	if (flags & MSG_DONTWAIT) {
+		return -EAGAIN;
+	}
+
+	set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+	for (;;) {
+		if (signal_pending(current))
+			break;
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		timeout = MAX_SCHEDULE_TIMEOUT;
+		if (sk_wait_event(sk, &timeout, aead_readable(ctx))) {
+			err = 0;
+			break;
+		}
+	}
+	finish_wait(sk_sleep(sk), &wait);
+
+	clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+	return err;
+}
+
+static void aead_data_wakeup(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	struct socket_wq *wq;
+
+	if (!aead_readable(ctx))
+		return;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
+							   POLLRDNORM |
+							   POLLRDBAND);
+	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+	rcu_read_unlock();
+}
+
+static int aead_sendmsg(struct kiocb *unused, struct socket *sock,
+		        struct msghdr *msg, size_t size)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned ivsize =
+		crypto_aead_ivsize(crypto_aead_reqtfm(&ctx->aead_req));
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	struct af_alg_control con = {};
+	long copied = 0;
+	bool enc = 0;
+	bool init = 0;
+	int err = -EINVAL;
+
+	if (msg->msg_controllen) {
+		err = af_alg_cmsg_send(msg, &con);
+		if (err)
+			return err;
+
+		init = 1;
+		switch (con.op) {
+		case ALG_OP_ENCRYPT:
+			enc = 1;
+			break;
+		case ALG_OP_DECRYPT:
+			enc = 0;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (con.iv && con.iv->ivlen != ivsize)
+			return -EINVAL;
+
+		if (!con.aead_assoclen)
+			return -EINVAL;
+
+		/* aead_recvmsg limits the maximum AD size to one page */
+		if (con.aead_assoclen > PAGE_SIZE)
+			return -E2BIG;
+	}
+
+	lock_sock(sk);
+	if (!ctx->more && ctx->used)
+		goto unlock;
+
+	if (init) {
+		ctx->enc = enc;
+		if (con.iv)
+			memcpy(ctx->iv, con.iv->iv, ivsize);
+
+		ctx->aead_assoclen = con.aead_assoclen;
+	}
+
+	while (size) {
+		unsigned long len = size;
+		struct scatterlist *sg = NULL;
+
+		if (ctx->merge) {
+			sg = sgl->sg + sgl->cur - 1;
+			len = min_t(unsigned long, len,
+				    PAGE_SIZE - sg->offset - sg->length);
+			err = memcpy_from_msg(page_address(sg_page(sg)) +
+					      sg->offset + sg->length,
+					      msg, len);
+			if (err)
+				goto unlock;
+
+			sg->length += len;
+			ctx->merge = (sg->offset + sg->length) &
+				     (PAGE_SIZE - 1);
+
+			ctx->used += len;
+			copied += len;
+			size -= len;
+		}
+
+		if (!aead_writable(sk)) {
+			/*
+			 * If there is more data to be expected, but we cannot
+			 * write more data, forcefully define that we do not
+			 * expect more data to invoke the AEAD operation. This
+			 * prevents a deadlock in user space.
+			 */
+			ctx->more = 0;
+			err = aead_wait_for_wmem(sk, msg->msg_flags);
+			if (err)
+				goto unlock;
+		}
+
+		len = min_t(unsigned long, size, aead_sndbuf(sk));
+		while (len && sgl->cur < ALG_MAX_PAGES) {
+			int plen = 0;
+
+			sg = sgl->sg + sgl->cur;
+			plen = min_t(int, len, PAGE_SIZE);
+
+			if (sgl->cur >= ALG_MAX_PAGES) {
+				err = -E2BIG;
+				goto unlock;
+			}
+
+			sg_assign_page(sg, alloc_page(GFP_KERNEL));
+			err = -ENOMEM;
+			if (!sg_page(sg))
+				goto unlock;
+
+			err = memcpy_from_msg(page_address(sg_page(sg)),
+					      msg, plen);
+			if (err) {
+				__free_page(sg_page(sg));
+				sg_assign_page(sg, NULL);
+				goto unlock;
+			}
+
+			sg->length = plen;
+			len -= plen;
+			ctx->used += plen;
+			copied += plen;
+			sgl->cur++;
+			size -= plen;
+			ctx->merge = plen & (PAGE_SIZE - 1);
+		}
+	}
+
+	err = 0;
+
+	ctx->more = msg->msg_flags & MSG_MORE;
+
+unlock:
+	aead_data_wakeup(sk);
+	release_sock(sk);
+
+	return copied ?: err;
+}
+
+static ssize_t aead_sendpage(struct socket *sock, struct page *page,
+			     int offset, size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	int err = -EINVAL;
+
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		flags |= MSG_MORE;
+
+	if (sgl->cur >= ALG_MAX_PAGES)
+		return -E2BIG;
+
+	lock_sock(sk);
+	if (!ctx->more && ctx->used)
+		goto unlock;
+
+	if (!size)
+		goto done;
+
+	if (!aead_writable(sk)) {
+		/* see aead_sendmsg why more is set to 0 */
+		ctx->more = 0;
+		err = aead_wait_for_wmem(sk, flags);
+		if (err)
+			goto unlock;
+	}
+
+	ctx->merge = 0;
+
+	get_page(page);
+	sg_set_page(sgl->sg + sgl->cur, page, size, offset);
+	sgl->cur++;
+	ctx->used += size;
+
+	err = 0;
+
+done:
+	ctx->more = flags & MSG_MORE;
+
+unlock:
+	aead_data_wakeup(sk);
+	release_sock(sk);
+
+	return err ?: size;
+}
+
+static int aead_recvmsg(struct kiocb *unused, struct socket *sock,
+			    struct msghdr *msg, size_t ignored, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned bs = crypto_aead_blocksize(crypto_aead_reqtfm(&ctx->aead_req));
+	unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	struct scatterlist *sg = sgl->sg;
+	struct scatterlist assoc;
+	size_t assoclen = 0;
+	unsigned int i = 0;
+	int err = -EAGAIN;
+	unsigned long used = 0;
+	unsigned long outlen = 0;
+
+	/*
+	 * Require exactly one IOV block as the AEAD operation is a one shot
+	 * due to the authentication tag.
+	 */
+	if (msg->msg_iter.nr_segs != 1)
+		return -ENOMSG;
+
+	lock_sock(sk);
+	/*
+	* AEAD memory structure: For encryption, the tag is appended to the
+	* ciphertext which implies that the memory allocated for the ciphertext
+	* must be increased by the tag length. For decryption, the tag
+	* is expected to be concatenated to the ciphertext. The plaintext
+	* therefore has a memory size of the ciphertext minus the tag length.
+	*
+	* The memory structure for cipher operation has the following
+	* structure:
+	*	AEAD encryption input:  assoc data || plaintext
+	*	AEAD encryption output: cipherntext || auth tag
+	*	AEAD decryption input:  assoc data || ciphertext || auth tag
+	*	AEAD decryption output: plaintext
+	*/
+
+	if (!aead_readable(ctx)) {
+		err = aead_wait_for_data(sk, flags);
+		if (err)
+			goto unlock;
+	}
+
+	used = ctx->used;
+
+	err = -ENOMEM;
+	if (!aead_sufficient_data(ctx))
+		goto unlock;
+	/*
+	 * The cipher operation input data is reduced by the associated data
+	 * length as this data is processed separately later on.
+	 */
+	used -= ctx->aead_assoclen;
+
+	if (ctx->enc) {
+		/* round up output buffer to multiple of block size */
+		outlen = ((used + bs - 1) / bs * bs);
+		/* add the size needed for the auth tag to be created */
+		outlen += as;
+	} else {
+		/* output data size is input without the authentication tag */
+		outlen = used - as;
+		/* round up output buffer to multiple of block size */
+		outlen = ((outlen + bs - 1) / bs * bs);
+	}
+
+	/* ensure output buffer is sufficiently large */
+	if (msg->msg_iter.iov->iov_len < outlen)
+		goto unlock;
+
+	outlen = af_alg_make_sg(&ctx->rsgl, msg->msg_iter.iov->iov_base,
+				outlen, 1);
+	err = outlen;
+	if (err < 0)
+		goto unlock;
+
+	err = -EINVAL;
+	/*
+	 * first chunk of input is AD -- one scatterlist entry is one page,
+	 * and we process only one scatterlist, the maximum size of AD is
+	 * one page
+	 */
+	sg_init_table(&assoc, 1);
+	sg_set_page(&assoc, sg_page(sg), ctx->aead_assoclen, sg->offset);
+	aead_request_set_assoc(&ctx->aead_req, &assoc, ctx->aead_assoclen);
+
+	/* point sg to cipher/plaintext start */
+	assoclen = ctx->aead_assoclen;
+	for(i = 0; i < ctx->tsgl.cur; i++) {
+		sg = sgl->sg + i;
+		if (sg->length <= assoclen) {
+			assoclen -= sg->length;
+			if (i >= ctx->tsgl.cur)
+				goto unlock;
+		} else {
+			sg->length -= assoclen;
+			sg->offset += assoclen;
+			break;
+		}
+	}
+
+	aead_request_set_crypt(&ctx->aead_req, sg, ctx->rsgl.sg, used, ctx->iv);
+
+	err = af_alg_wait_for_completion(ctx->enc ?
+					 crypto_aead_encrypt(&ctx->aead_req) :
+					 crypto_aead_decrypt(&ctx->aead_req),
+					 &ctx->completion);
+
+	af_alg_free_sg(&ctx->rsgl);
+
+	if (err)
+		goto unlock;
+
+	aead_put_sgl(sk);
+
+	err = 0;
+
+unlock:
+	aead_wmem_wakeup(sk);
+	release_sock(sk);
+
+	return err ? err : outlen;
+}
+
+static unsigned int aead_poll(struct file *file, struct socket *sock,
+				  poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned int mask;
+
+	sock_poll_wait(file, sk_sleep(sk), wait);
+	mask = 0;
+
+	if (aead_readable(ctx))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (aead_writable(sk))
+		mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
+
+	return mask;
+}
+
+static struct proto_ops algif_aead_ops = {
+	.family		=	PF_ALG,
+
+	.connect	=	sock_no_connect,
+	.socketpair	=	sock_no_socketpair,
+	.getname	=	sock_no_getname,
+	.ioctl		=	sock_no_ioctl,
+	.listen		=	sock_no_listen,
+	.shutdown	=	sock_no_shutdown,
+	.getsockopt	=	sock_no_getsockopt,
+	.mmap		=	sock_no_mmap,
+	.bind		=	sock_no_bind,
+	.accept		=	sock_no_accept,
+	.setsockopt	=	sock_no_setsockopt,
+
+	.release	=	af_alg_release,
+	.sendmsg	=	aead_sendmsg,
+	.sendpage	=	aead_sendpage,
+	.recvmsg	=	aead_recvmsg,
+	.poll		=	aead_poll,
+};
+
+static void *aead_bind(const char *name, u32 type, u32 mask)
+{
+	return crypto_alloc_aead(name, type, mask);
+}
+
+static void aead_release(void *private)
+{
+	crypto_free_aead(private);
+}
+
+static int aead_setauthsize(void *private, unsigned int authsize)
+{
+	return crypto_aead_setauthsize(private, authsize);
+}
+
+static int aead_setkey(void *private, const u8 *key, unsigned int keylen)
+{
+	return crypto_aead_setkey(private, key, keylen);
+}
+
+static void aead_sock_destruct(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned int ivlen = crypto_aead_ivsize(
+				crypto_aead_reqtfm(&ctx->aead_req));
+
+	aead_put_sgl(sk);
+	sock_kzfree_s(sk, ctx->iv, ivlen);
+	sock_kfree_s(sk, ctx, ctx->len);
+	af_alg_release_parent(sk);
+}
+
+static int aead_accept_parent(void *private, struct sock *sk)
+{
+	struct aead_ctx *ctx;
+	struct alg_sock *ask = alg_sk(sk);
+	unsigned int len = sizeof(*ctx) + crypto_aead_reqsize(private);
+	unsigned int ivlen = crypto_aead_ivsize(private);
+
+	ctx = sock_kmalloc(sk, len, GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	memset(ctx, 0, len);
+
+	ctx->iv = sock_kmalloc(sk, ivlen, GFP_KERNEL);
+	if (!ctx->iv) {
+		sock_kfree_s(sk, ctx, len);
+		return -ENOMEM;
+	}
+	memset(ctx->iv, 0, ivlen);
+
+	ctx->len = len;
+	ctx->used = 0;
+	ctx->more = 0;
+	ctx->merge = 0;
+	ctx->enc = 0;
+	ctx->tsgl.cur = 0;
+	ctx->aead_assoclen = 0;
+	af_alg_init_completion(&ctx->completion);
+	sg_init_table(ctx->tsgl.sg, ALG_MAX_PAGES);
+
+	ask->private = ctx;
+
+	aead_request_set_tfm(&ctx->aead_req, private);
+	aead_request_set_callback(&ctx->aead_req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				  af_alg_complete, &ctx->completion);
+
+	sk->sk_destruct = aead_sock_destruct;
+
+	return 0;
+}
+
+static const struct af_alg_type algif_type_aead = {
+	.bind		=	aead_bind,
+	.release	=	aead_release,
+	.setkey		=	aead_setkey,
+	.setauthsize	=	aead_setauthsize,
+	.accept		=	aead_accept_parent,
+	.ops		=	&algif_aead_ops,
+	.name		=	"aead",
+	.owner		=	THIS_MODULE
+};
+
+static int __init algif_aead_init(void)
+{
+	return af_alg_register_type(&algif_type_aead);
+}
+
+static void __exit algif_aead_exit(void)
+{
+	int err = af_alg_unregister_type(&algif_type_aead);
+	BUG_ON(err);
+}
+
+module_init(algif_aead_init);
+module_exit(algif_aead_exit);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Stephan Mueller <smueller@chronox.de>");
+MODULE_DESCRIPTION("AEAD kernel crypto API user space interface");
-- 
2.1.0

^ permalink raw reply related

* [PATCH v6 4/4] crypto: AF_ALG: enable RNG interface compilation
From: Stephan Mueller @ 2014-12-25 22:00 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto, linux-api
In-Reply-To: <5682082.ffPqvQlSqN@tachyon.chronox.de>

Enable compilation of the RNG AF_ALG support and provide a Kconfig
option to compile the RNG AF_ALG support.

Signed-off-by: Stephan Mueller <smueller@chronox.de>
---
 crypto/Kconfig  | 9 +++++++++
 crypto/Makefile | 1 +
 2 files changed, 10 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index cd3e6fd..f2d434b 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1523,6 +1523,15 @@ config CRYPTO_USER_API_AEAD
 	  This option enables the user-spaces interface for AEAD
 	  cipher algorithms.
 
+config CRYPTO_USER_API_RNG
+	tristate "User-space interface for random number generator algorithms"
+	depends on NET
+	select CRYPTO_RNG
+	select CRYPTO_USER_API
+	help
+	  This option enables the user-spaces interface for random
+	  number generator algorithms.
+
 config CRYPTO_HASH_INFO
 	bool
 
diff --git a/crypto/Makefile b/crypto/Makefile
index 593fd3c..c109df5 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
 obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
 obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
 obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
+obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o
 
 #
 # generic algorithms and the async_tx api
-- 
2.1.0

^ permalink raw reply related

* [PATCH v6 3/4] crypto: AF_ALG: add random number generator support
From: Stephan Mueller @ 2014-12-25 22:00 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto, linux-api
In-Reply-To: <5682082.ffPqvQlSqN@tachyon.chronox.de>

This patch adds the random number generator support for AF_ALG.

A random number generator's purpose is to generate data without
requiring the caller to provide any data. Therefore, the AF_ALG
interface handler for RNGs only implements a callback handler for
recvmsg.

The following parameters provided with a recvmsg are processed by the
RNG callback handler:

	* sock - to resolve the RNG context data structure accessing the
	  RNG instance private to the socket

	* len - this parameter allows userspace callers to specify how
	  many random bytes the RNG shall produce and return. As the
	  kernel context for the RNG allocates a buffer of 128 bytes to
	  store random numbers before copying them to userspace, the len
	  parameter is checked that it is not larger than 128. If a
	  caller wants more random numbers, a new request for recvmsg
	  shall be made.

The size of 128 bytes is chose because of the following considerations:

	* to increase the memory footprint of the kernel too much (note,
	  that would be 128 bytes per open socket)

	* 128 is divisible by any typical cryptographic block size an
	  RNG may have

	* A request for random numbers typically only shall supply small
	  amount of data like for keys or IVs that should only require
	  one invocation of the recvmsg function.

Note, during instantiation of the RNG, the code checks whether the RNG
implementation requires seeding. If so, the RNG is seeded with output
from get_random_bytes.

A fully working example using all aspects of the RNG interface is
provided at http://www.chronox.de/libkcapi.html

Signed-off-by: Stephan Mueller <smueller@chronox.de>
---
 crypto/algif_rng.c | 192 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 192 insertions(+)
 create mode 100644 crypto/algif_rng.c

diff --git a/crypto/algif_rng.c b/crypto/algif_rng.c
new file mode 100644
index 0000000..91c06f5
--- /dev/null
+++ b/crypto/algif_rng.c
@@ -0,0 +1,192 @@
+/*
+ * algif_rng: User-space interface for random number generators
+ *
+ * This file provides the user-space API for random number generators.
+ *
+ * Copyright (C) 2014, Stephan Mueller <smueller@chronox.de>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, and the entire permission notice in its entirety,
+ *    including the disclaimer of warranties.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. The name of the author may not be used to endorse or promote
+ *    products derived from this software without specific prior
+ *    written permission.
+ *
+ * ALTERNATIVELY, this product may be distributed under the terms of
+ * the GNU General Public License, in which case the provisions of the GPL2
+ * are required INSTEAD OF the above restrictions.  (This clause is
+ * necessary due to a potential bad interaction between the GPL and
+ * the restrictions contained in a BSD-style copyright.)
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ALL OF
+ * WHICH ARE HEREBY DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
+ * OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
+ * USE OF THIS SOFTWARE, EVEN IF NOT ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ */
+
+#include <linux/module.h>
+#include <crypto/rng.h>
+#include <linux/random.h>
+#include <crypto/if_alg.h>
+#include <linux/net.h>
+#include <net/sock.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Stephan Mueller <smueller@chronox.de>");
+MODULE_DESCRIPTION("User-space interface for random number generators");
+
+struct rng_ctx {
+#define MAXSIZE 128
+	unsigned int len;
+	struct crypto_rng *drng;
+};
+
+static int rng_recvmsg(struct kiocb *unused, struct socket *sock,
+		       struct msghdr *msg, size_t len, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct rng_ctx *ctx = ask->private;
+	int err = -EFAULT;
+	int genlen = 0;
+	u8 result[MAXSIZE];
+
+	if (len == 0)
+		return 0;
+	if (len > MAXSIZE)
+		len = MAXSIZE;
+
+	/*
+	 * although not strictly needed, this is a precaution against coding
+	 * errors
+	 */
+	memset(result, 0, len);
+
+	/*
+	 * The enforcement of a proper seeding of an RNG is done within an
+	 * RNG implementation. Some RNGs (DRBG, krng) do not need specific
+	 * seeding as they automatically seed. The X9.31 DRNG will return
+	 * an error if it was not seeded properly.
+	 */
+	genlen = crypto_rng_get_bytes(ctx->drng, result, len);
+	if (genlen < 0)
+		return genlen;
+
+	err = memcpy_to_msg(msg, result, len);
+	memzero_explicit(result, genlen);
+
+	return err ? err : len;
+}
+
+static struct proto_ops algif_rng_ops = {
+	.family		=	PF_ALG,
+
+	.connect	=	sock_no_connect,
+	.socketpair	=	sock_no_socketpair,
+	.getname	=	sock_no_getname,
+	.ioctl		=	sock_no_ioctl,
+	.listen		=	sock_no_listen,
+	.shutdown	=	sock_no_shutdown,
+	.getsockopt	=	sock_no_getsockopt,
+	.mmap		=	sock_no_mmap,
+	.bind		=	sock_no_bind,
+	.accept		=	sock_no_accept,
+	.setsockopt	=	sock_no_setsockopt,
+	.poll		=	sock_no_poll,
+	.sendmsg	=	sock_no_sendmsg,
+	.sendpage	=	sock_no_sendpage,
+
+	.release	=	af_alg_release,
+	.recvmsg	=	rng_recvmsg,
+};
+
+static void *rng_bind(const char *name, u32 type, u32 mask)
+{
+	return crypto_alloc_rng(name, type, mask);
+}
+
+static void rng_release(void *private)
+{
+	crypto_free_rng(private);
+}
+
+static void rng_sock_destruct(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct rng_ctx *ctx = ask->private;
+
+	sock_kfree_s(sk, ctx, ctx->len);
+	af_alg_release_parent(sk);
+}
+
+static int rng_accept_parent(void *private, struct sock *sk)
+{
+	struct rng_ctx *ctx;
+	struct alg_sock *ask = alg_sk(sk);
+	unsigned int len = sizeof(*ctx);
+
+	ctx = sock_kmalloc(sk, len, GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->len = len;
+
+	/*
+	 * No seeding done at that point -- if multiple accepts are
+	 * done on one RNG instance, each resulting FD points to the same
+	 * state of the RNG.
+	 */
+
+	ctx->drng = private;
+	ask->private = ctx;
+	sk->sk_destruct = rng_sock_destruct;
+
+	return 0;
+}
+
+static int rng_setkey(void *private, const u8 *seed, unsigned int seedlen)
+{
+	/*
+	 * Check whether seedlen is of sufficient size is done in RNG
+	 * implementations.
+	 */
+	return crypto_rng_reset(private, (u8 *)seed, seedlen);
+}
+
+static const struct af_alg_type algif_type_rng = {
+	.bind		=	rng_bind,
+	.release	=	rng_release,
+	.accept		=	rng_accept_parent,
+	.setkey		=	rng_setkey,
+	.ops		=	&algif_rng_ops,
+	.name		=	"rng",
+	.owner		=	THIS_MODULE
+};
+
+static int __init rng_init(void)
+{
+	return af_alg_register_type(&algif_type_rng);
+}
+
+void __exit rng_exit(void)
+{
+	int err = af_alg_unregister_type(&algif_type_rng);
+	BUG_ON(err);
+}
+
+module_init(rng_init);
+module_exit(rng_exit);
-- 
2.1.0

^ permalink raw reply related

* [PATCH v6 2/4] crypto: AF_ALG: enable AEAD interface compilation
From: Stephan Mueller @ 2014-12-25 21:59 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <5682082.ffPqvQlSqN-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>

Enable compilation of the AEAD AF_ALG support and provide a Kconfig
option to compile the AEAD AF_ALG support.

Signed-off-by: Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>
---
 crypto/Kconfig  | 9 +++++++++
 crypto/Makefile | 1 +
 2 files changed, 10 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 1618468..cd3e6fd 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1514,6 +1514,15 @@ config CRYPTO_USER_API_SKCIPHER
 	  This option enables the user-spaces interface for symmetric
 	  key cipher algorithms.
 
+config CRYPTO_USER_API_AEAD
+	tristate "User-space interface for AEAD cipher algorithms"
+	depends on NET
+	select CRYPTO_AEAD
+	select CRYPTO_USER_API
+	help
+	  This option enables the user-spaces interface for AEAD
+	  cipher algorithms.
+
 config CRYPTO_HASH_INFO
 	bool
 
diff --git a/crypto/Makefile b/crypto/Makefile
index 1445b91..593fd3c 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_CRYPTO_GHASH) += ghash-generic.o
 obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
 obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
 obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
+obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
 
 #
 # generic algorithms and the async_tx api
-- 
2.1.0

^ permalink raw reply related

* [PATCH v6 0/4] crypto: AF_ALG: add AEAD and RNG support
From: Stephan Mueller @ 2014-12-25 21:58 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Hi,

This patch set adds AEAD and RNG support to the AF_ALG interface
exported by the kernel crypto API. By extending AF_ALG with AEAD and RNG
support, all cipher types the kernel crypto API allows access to are
now accessible from userspace.

Both, AEAD and RNG implementations are stand-alone and do not depend
other AF_ALG interfaces (like hash or skcipher).

The AEAD implementation uses the same approach as provided with
skcipher by offering the following interfaces:

	* sendmsg and recvmsg interfaces allowing multiple
	  invocations supporting a threaded user space. To support
	  multi-threaded user space, kernel-side buffering
	  is implemented similarly to skcipher.

	* splice / vmsplice interfaces allowing a zero-copy
	  invocation

The RNG interface only implements the recvmsg interface as
zero-copy is not applicable.

The new AEAD and RNG interfaces are fully tested with the test application
provided at [1]. That test application exercises all newly added user space
interfaces. The testing covers:

	* use of the sendmsg/recvmsg interface

	* use of the splice / vmsplice interface

	* invocation of all AF_ALG types (aead, rng, skcipher, hash)

	* using all types of operation (encryption, decryption, keyed MD,
	  MD, random numbers, AEAD decryption with positive and negative
	  authentication verification)

	* stress testing by running all tests for 30 minutes in an
	  endless loop

	* test execution on 64 bit and 32 bit

[1] http://www.chronox.de/libkcapi.html

Changes v2:
* rebase to current cryptodev-2.6 tree
* use memzero_explicit to zeroize AEAD associated data
* use sizeof for determining length of AEAD associated data
* update algif_rng.c covering all suggestions from Daniel Borkmann
  <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
* addition of patch 9: add digestsize interface for hashes
* addition of patch to update documentation covering the userspace interface
* change numbers of getsockopt options: separate them from sendmsg interface
  definitions

Changes v3:
* remove getsockopt interface
* AEAD: associated data is set prepended to the plain/ciphertext
* AEAD: allowing arbitrary associated data lengths
* remove setkey patch as protection was already in the existing code

Changes v4:
* stand-alone implementation of AEAD
* testing of all interfaces offered by AEAD
* stress testing of AEAD and RNG

Changes v5:
* AEAD: add outer while(size) loop in aead_sendmsg to ensure all data is
  copied into the kernel (reporter Herbert Xu)
* AEAD: aead_sendmsg bug fix: change size -= len; to size -= plen;
* AF_ALG / AEAD: add aead_setauthsize and associated extension to
  struct af_alg_type as well as alg_setsockopt (reporter Herbert Xu)
* RNG: rng_recvmsg: use 128 byte stack variable for output of RNG instead
  of ctx->result (reporter Herbert Xu)
* RNG / AF_ALG: allow user space to seed RNG via setsockopt
* RNG: rng_recvmsg bug fix: use genlen as result variable for
  crypto_rng_get_bytes as previously no negative errors were obtained
* AF_ALG: alg_setop: zeroize buffer before free

Changes v6:
* AEAD/RNG: port to 3.19-rc1 with the iov_iter handling
* RNG: use the setkey interface to obtain the seed and drop the patch adding
  a separate reseeding interface
* extract the zeroization patch for alg_setkey into a stand-alone patch
  submission
* fix bug in aead_sufficient_data (reporter Herbert Xu)
* testing of all interfaces with test application provided with libkcapi version
  0.6.2

Stephan Mueller (4):
  crypto: AF_ALG: add AEAD support
  crypto: AF_ALG: enable AEAD interface compilation
  crypto: AF_ALG: add random number generator support
  crypto: AF_ALG: enable RNG interface compilation

 crypto/Kconfig      |  18 ++
 crypto/Makefile     |   2 +
 crypto/algif_aead.c | 651 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 crypto/algif_rng.c  | 192 ++++++++++++++++
 4 files changed, 863 insertions(+)
 create mode 100644 crypto/algif_aead.c
 create mode 100644 crypto/algif_rng.c

-- 
2.1.0

^ permalink raw reply

* Re: [PATCH v5 3/8] crypto: AF_ALG: add AEAD support
From: Herbert Xu @ 2014-12-25 20:28 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: Daniel Borkmann, 'Quentin Gouchet', 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <2159528.zCJB0y2Cap-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>

On Wed, Dec 24, 2014 at 09:54:33AM +0100, Stephan Mueller wrote:
> 
> That is right, but isn't that the nature of AEAD ciphers in general? Even if 
> you are in the kernel, you need to have all scatter lists together for one 
> invocation of the AEAD cipher.

It's actually only the nature of certain algorithms like CCM which
requires the total length to start the operation.  Most AEAD
algorithms can be implemented in a way that allows piecemeal
operation.  However, as the only users of AEAD is IPsec, it's
probably not worth adding more complexity for now.

So let's proceed with your current solution.

Thanks,
-- 
Email: Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox