Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH v4 5/7] fs/nfsd: support compiling out splice
From: Pieter Smith @ 2014-11-24 23:01 UTC (permalink / raw)
  To: pieter
  Cc: Josh Triplett, Alexander Duyck, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Bertrand Jacquin,
	Catalina Mocanu, Daniel Borkmann, David S. Miller, Eric Dumazet,
	Eric W. Biederman, Fabian Frederick,
	open list:FUSE: FILESYSTEM..., Geert Uytterhoeven, Hugh Dickins,
	Iulia Manda, Jan Beulich, J. Bruce Fields, Jeff Layton,
	open list:ABI/API, linux-fsdevel, linux-kernel
In-Reply-To: <1416870079-15254-1-git-send-email-pieter@boesman.nl>

The goal of the larger patch set is to completely compile out fs/splice, and
as a result, splice support for all file-systems. This patch ensures that
fs/nfsd falls back to non-splice fs support when CONFIG_SYSCALL_SPLICE is
undefined.

Signed-off-by: Pieter Smith <pieter@boesman.nl>
---
 net/sunrpc/svc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ca8a795..6cacc37 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1084,7 +1084,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
 		goto err_short_len;
 
 	/* Will be turned off only in gss privacy case: */
-	rqstp->rq_splice_ok = true;
+	rqstp->rq_splice_ok = IS_ENABLED(CONFIG_SPLICE_SYSCALL);
 	/* Will be turned off only when NFSv4 Sessions are used */
 	rqstp->rq_usedeferral = true;
 	rqstp->rq_dropme = false;
-- 
2.1.0

^ permalink raw reply related

* [PATCH v4 4/7] fs/fuse: support compiling out splice
From: Pieter Smith @ 2014-11-24 23:01 UTC (permalink / raw)
  To: pieter
  Cc: Josh Triplett, Alexander Duyck, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Bertrand Jacquin,
	Catalina Mocanu, Daniel Borkmann, David S. Miller, Eric Dumazet,
	Eric W. Biederman, Fabian Frederick,
	open list:FUSE: FILESYSTEM..., Geert Uytterhoeven, Hugh Dickins,
	Iulia Manda, Jan Beulich, J. Bruce Fields, Jeff Layton,
	open list:ABI/API, linux-fsdevel, linux-kernel
In-Reply-To: <1416870079-15254-1-git-send-email-pieter@boesman.nl>

To implement splice support, fs/fuse makes use of nosteal_pipe_buf_ops. This
struct is exported by fs/splice. The goal of the larger patch set is to
completely compile out fs/splice, so uses of the exported struct need to be
compiled out along with fs/splice.

This patch therefore compiles out splice support in fs/fuse when
CONFIG_SYSCALL_SPLICE is undefined.

Signed-off-by: Pieter Smith <pieter@boesman.nl>
---
 fs/fuse/dev.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ca88731..e984302 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1191,8 +1191,9 @@ __releases(fc->lock)
  * request_end().  Otherwise add it to the processing list, and set
  * the 'sent' flag.
  */
-static ssize_t fuse_dev_do_read(struct fuse_conn *fc, struct file *file,
-				struct fuse_copy_state *cs, size_t nbytes)
+static ssize_t __maybe_unused
+fuse_dev_do_read(struct fuse_conn *fc, struct file *file,
+		 struct fuse_copy_state *cs, size_t nbytes)
 {
 	int err;
 	struct fuse_req *req;
@@ -1291,6 +1292,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov,
 	return fuse_dev_do_read(fc, file, &cs, iov_length(iov, nr_segs));
 }
 
+#ifdef CONFIG_SYSCALL_SPLICE
 static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 				    struct pipe_inode_info *pipe,
 				    size_t len, unsigned int flags)
@@ -1368,6 +1370,9 @@ out:
 	kfree(bufs);
 	return ret;
 }
+#else /* CONFIG_SYSCALL_SPLICE */
+#define fuse_dev_splice_read NULL
+#endif
 
 static int fuse_notify_poll(struct fuse_conn *fc, unsigned int size,
 			    struct fuse_copy_state *cs)
-- 
2.1.0

^ permalink raw reply related

* [PATCH v4 3/7] fs/splice: support compiling out splice-family syscalls
From: Pieter Smith @ 2014-11-24 23:01 UTC (permalink / raw)
  To: pieter
  Cc: Josh Triplett, Alexander Duyck, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Bertrand Jacquin,
	Catalina Mocanu, Daniel Borkmann, David S. Miller, Eric Dumazet,
	Eric W. Biederman, Fabian Frederick,
	open list:FUSE: FILESYSTEM..., Geert Uytterhoeven, Hugh Dickins,
	Iulia Manda, Jan Beulich, J. Bruce Fields, Jeff Layton,
	open list:ABI/API, linux-fsdevel, linux-kernel
In-Reply-To: <1416870079-15254-1-git-send-email-pieter@boesman.nl>

Many embedded systems will not need the splice-family syscalls (splice,
vmsplice, tee and sendfile). Omitting them saves space.  This adds a new EXPERT
config option CONFIG_SYSCALL_SPLICE (default y) to support compiling them out.

The goal is to completely compile out fs/splice along with the syscalls. To
achieve this, the remaining patch-set will deal with fs/splice exports. As far
as possible, the impact on other device drivers will be minimized so as to
reduce the overal maintenance burden of CONFIG_SYSCALL_SPLICE.

The use of exported functions will be solved by transparently mocking them out
with static inlines. Uses of the exported pipe_buf_operations struct however
require direct modification in fs/fuse and net/core. The next two patches will
deal with this. A macro is defined that will assist with NULL'ing out callbacks
when CONFIG_SYSCALL_SPLICE is undefined: __splice_p().

Once all exports are solved, fs/splice can be compiled out.

The bloat benefit of this patch given a tinyconfig is:

add/remove: 0/16 grow/shrink: 2/5 up/down: 114/-3693 (-3579)
function                                     old     new   delta
splice_direct_to_actor                       348     416     +68
splice_to_pipe                               371     417     +46
splice_from_pipe_next                        107     106      -1
fdput                                         11       -     -11
signal_pending                                39      26     -13
fdget                                         56      42     -14
user_page_pipe_buf_ops                        20       -     -20
user_page_pipe_buf_steal                      25       -     -25
file_end_write                                58      29     -29
file_start_write                              68      34     -34
pipe_to_user                                  43       -     -43
wakeup_pipe_readers                           54       -     -54
do_splice_to                                  87       -     -87
ipipe_prep.part                               92       -     -92
opipe_prep.part                              119       -    -119
sys_sendfile                                 122       -    -122
sys_sendfile64                               126       -    -126
sys_vmsplice                                 137       -    -137
vmsplice_to_user                             205       -    -205
sys_tee                                      491       -    -491
do_sendfile                                  492       -    -492
vmsplice_to_pipe                             558       -    -558
sys_splice                                  1020       -   -1020

Signed-off-by: Pieter Smith <pieter@boesman.nl>
---
 fs/splice.c     |  2 ++
 init/Kconfig    | 10 ++++++++++
 kernel/sys_ni.c |  8 ++++++++
 3 files changed, 20 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 44b201b..7c4c695 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1316,6 +1316,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 	return ret;
 }
 
+#ifdef CONFIG_SYSCALL_SPLICE
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -2200,4 +2201,5 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 	return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+#endif
 
diff --git a/init/Kconfig b/init/Kconfig
index d811d5f..dec9819 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1571,6 +1571,16 @@ config NTP
 	  system clock to an NTP server, you can disable this option to save
 	  space.
 
+config SYSCALL_SPLICE
+	bool "Enable splice/vmsplice/tee/sendfile syscalls" if EXPERT
+	default y
+	help
+	  This option enables the splice, vmsplice, tee and sendfile syscalls. These
+	  are used by applications to: move data between buffers and arbitrary file
+	  descriptors; "copy" data between buffers; or copy data from userspace into
+	  buffers. If building an embedded system where no applications use these
+	  syscalls, you can disable this option to save space.
+
 config PCI_QUIRKS
 	default y
 	bool "Enable PCI quirk workarounds" if EXPERT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d2f5b00..25d5551 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -170,6 +170,14 @@ cond_syscall(sys_fstat);
 cond_syscall(sys_stat);
 cond_syscall(sys_uname);
 cond_syscall(sys_olduname);
+cond_syscall(sys_vmsplice);
+cond_syscall(sys_splice);
+cond_syscall(sys_tee);
+cond_syscall(sys_sendfile);
+cond_syscall(sys_sendfile64);
+cond_syscall(compat_sys_vmsplice);
+cond_syscall(compat_sys_sendfile);
+cond_syscall(compat_sys_sendfile64);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);
-- 
2.1.0

^ permalink raw reply related

* [PATCH v4 2/7] fs: moved kernel_write to fs/read_write
From: Pieter Smith @ 2014-11-24 23:01 UTC (permalink / raw)
  To: pieter
  Cc: Josh Triplett, Alexander Duyck, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Bertrand Jacquin,
	Catalina Mocanu, Daniel Borkmann, David S. Miller, Eric Dumazet,
	Eric W. Biederman, Fabian Frederick,
	open list:FUSE: FILESYSTEM..., Geert Uytterhoeven, Hugh Dickins,
	Iulia Manda, Jan Beulich, J. Bruce Fields, Jeff Layton,
	open list:ABI/API, linux-fsdevel, linux-kernel
In-Reply-To: <1416870079-15254-1-git-send-email-pieter@boesman.nl>

kernel_write shares infrastructure with the read_write translation unit but not
with the splice translation unit. Grouping kernel_write with the read_write
translation unit is more logical. It also paves the way to compiling out the
splice group of syscalls for embedded systems that do not need them.

Signed-off-by: Pieter Smith <pieter@boesman.nl>
---
 fs/read_write.c | 16 ++++++++++++++++
 fs/splice.c     | 16 ----------------
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d9451ba..f4c8d8b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1191,3 +1191,19 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
 }
 #endif
 
+ssize_t kernel_write(struct file *file, const char *buf, size_t count,
+			    loff_t pos)
+{
+	mm_segment_t old_fs;
+	ssize_t res;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	/* The cast to a user pointer is valid due to the set_fs() */
+	res = vfs_write(file, (__force const char __user *)buf, count, &pos);
+	set_fs(old_fs);
+
+	return res;
+}
+EXPORT_SYMBOL(kernel_write);
+
diff --git a/fs/splice.c b/fs/splice.c
index c1a2861..44b201b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -583,22 +583,6 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	return res;
 }
 
-ssize_t kernel_write(struct file *file, const char *buf, size_t count,
-			    loff_t pos)
-{
-	mm_segment_t old_fs;
-	ssize_t res;
-
-	old_fs = get_fs();
-	set_fs(get_ds());
-	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_write(file, (__force const char __user *)buf, count, &pos);
-	set_fs(old_fs);
-
-	return res;
-}
-EXPORT_SYMBOL(kernel_write);
-
 ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
-- 
2.1.0

^ permalink raw reply related

* [PATCH v4 1/7] fs: move sendfile syscall into fs/splice
From: Pieter Smith @ 2014-11-24 23:01 UTC (permalink / raw)
  To: pieter
  Cc: Josh Triplett, Alexander Duyck, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Bertrand Jacquin,
	Catalina Mocanu, Daniel Borkmann, David S. Miller, Eric Dumazet,
	Eric W. Biederman, Fabian Frederick,
	open list:FUSE: FILESYSTEM..., Geert Uytterhoeven, Hugh Dickins,
	Iulia Manda, Jan Beulich, J. Bruce Fields, Jeff Layton,
	open list:ABI/API, linux-fsdevel, linux-kernel
In-Reply-To: <1416870079-15254-1-git-send-email-pieter@boesman.nl>

sendfile functionally forms part of the splice group of syscalls (splice,
vmsplice and tee). Grouping sendfile with splice paves the way to compiling out
the splice group of syscalls for embedded systems that do not need these.

add/remove: 0/0 grow/shrink: 7/2 up/down: 86/-61 (25)
function                                     old     new   delta
file_start_write                              34      68     +34
file_end_write                                29      58     +29
sys_pwritev                                  115     122      +7
sys_preadv                                   115     122      +7
fdput_pos                                     29      36      +7
sys_pwrite64                                 115     116      +1
sys_pread64                                  115     116      +1
sys_tee                                      497     491      -6
sys_splice                                  1075    1020     -55

Signed-off-by: Pieter Smith <pieter@boesman.nl>
---
 fs/read_write.c | 175 -------------------------------------------------------
 fs/splice.c     | 178 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 178 insertions(+), 175 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 7d9318c..d9451ba 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1191,178 +1191,3 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
 }
 #endif
 
-static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
-		  	   size_t count, loff_t max)
-{
-	struct fd in, out;
-	struct inode *in_inode, *out_inode;
-	loff_t pos;
-	loff_t out_pos;
-	ssize_t retval;
-	int fl;
-
-	/*
-	 * Get input file, and verify that it is ok..
-	 */
-	retval = -EBADF;
-	in = fdget(in_fd);
-	if (!in.file)
-		goto out;
-	if (!(in.file->f_mode & FMODE_READ))
-		goto fput_in;
-	retval = -ESPIPE;
-	if (!ppos) {
-		pos = in.file->f_pos;
-	} else {
-		pos = *ppos;
-		if (!(in.file->f_mode & FMODE_PREAD))
-			goto fput_in;
-	}
-	retval = rw_verify_area(READ, in.file, &pos, count);
-	if (retval < 0)
-		goto fput_in;
-	count = retval;
-
-	/*
-	 * Get output file, and verify that it is ok..
-	 */
-	retval = -EBADF;
-	out = fdget(out_fd);
-	if (!out.file)
-		goto fput_in;
-	if (!(out.file->f_mode & FMODE_WRITE))
-		goto fput_out;
-	retval = -EINVAL;
-	in_inode = file_inode(in.file);
-	out_inode = file_inode(out.file);
-	out_pos = out.file->f_pos;
-	retval = rw_verify_area(WRITE, out.file, &out_pos, count);
-	if (retval < 0)
-		goto fput_out;
-	count = retval;
-
-	if (!max)
-		max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
-
-	if (unlikely(pos + count > max)) {
-		retval = -EOVERFLOW;
-		if (pos >= max)
-			goto fput_out;
-		count = max - pos;
-	}
-
-	fl = 0;
-#if 0
-	/*
-	 * We need to debate whether we can enable this or not. The
-	 * man page documents EAGAIN return for the output at least,
-	 * and the application is arguably buggy if it doesn't expect
-	 * EAGAIN on a non-blocking file descriptor.
-	 */
-	if (in.file->f_flags & O_NONBLOCK)
-		fl = SPLICE_F_NONBLOCK;
-#endif
-	file_start_write(out.file);
-	retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
-	file_end_write(out.file);
-
-	if (retval > 0) {
-		add_rchar(current, retval);
-		add_wchar(current, retval);
-		fsnotify_access(in.file);
-		fsnotify_modify(out.file);
-		out.file->f_pos = out_pos;
-		if (ppos)
-			*ppos = pos;
-		else
-			in.file->f_pos = pos;
-	}
-
-	inc_syscr(current);
-	inc_syscw(current);
-	if (pos > max)
-		retval = -EOVERFLOW;
-
-fput_out:
-	fdput(out);
-fput_in:
-	fdput(in);
-out:
-	return retval;
-}
-
-SYSCALL_DEFINE4(sendfile, int, out_fd, int, in_fd, off_t __user *, offset, size_t, count)
-{
-	loff_t pos;
-	off_t off;
-	ssize_t ret;
-
-	if (offset) {
-		if (unlikely(get_user(off, offset)))
-			return -EFAULT;
-		pos = off;
-		ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS);
-		if (unlikely(put_user(pos, offset)))
-			return -EFAULT;
-		return ret;
-	}
-
-	return do_sendfile(out_fd, in_fd, NULL, count, 0);
-}
-
-SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, loff_t __user *, offset, size_t, count)
-{
-	loff_t pos;
-	ssize_t ret;
-
-	if (offset) {
-		if (unlikely(copy_from_user(&pos, offset, sizeof(loff_t))))
-			return -EFAULT;
-		ret = do_sendfile(out_fd, in_fd, &pos, count, 0);
-		if (unlikely(put_user(pos, offset)))
-			return -EFAULT;
-		return ret;
-	}
-
-	return do_sendfile(out_fd, in_fd, NULL, count, 0);
-}
-
-#ifdef CONFIG_COMPAT
-COMPAT_SYSCALL_DEFINE4(sendfile, int, out_fd, int, in_fd,
-		compat_off_t __user *, offset, compat_size_t, count)
-{
-	loff_t pos;
-	off_t off;
-	ssize_t ret;
-
-	if (offset) {
-		if (unlikely(get_user(off, offset)))
-			return -EFAULT;
-		pos = off;
-		ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS);
-		if (unlikely(put_user(pos, offset)))
-			return -EFAULT;
-		return ret;
-	}
-
-	return do_sendfile(out_fd, in_fd, NULL, count, 0);
-}
-
-COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
-		compat_loff_t __user *, offset, compat_size_t, count)
-{
-	loff_t pos;
-	ssize_t ret;
-
-	if (offset) {
-		if (unlikely(copy_from_user(&pos, offset, sizeof(loff_t))))
-			return -EFAULT;
-		ret = do_sendfile(out_fd, in_fd, &pos, count, 0);
-		if (unlikely(put_user(pos, offset)))
-			return -EFAULT;
-		return ret;
-	}
-
-	return do_sendfile(out_fd, in_fd, NULL, count, 0);
-}
-#endif
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..c1a2861 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -28,6 +28,7 @@
 #include <linux/export.h>
 #include <linux/syscalls.h>
 #include <linux/uio.h>
+#include <linux/fsnotify.h>
 #include <linux/security.h>
 #include <linux/gfp.h>
 #include <linux/socket.h>
@@ -2039,3 +2040,180 @@ SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
 
 	return error;
 }
+
+static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
+			   size_t count, loff_t max)
+{
+	struct fd in, out;
+	struct inode *in_inode, *out_inode;
+	loff_t pos;
+	loff_t out_pos;
+	ssize_t retval;
+	int fl;
+
+	/*
+	 * Get input file, and verify that it is ok..
+	 */
+	retval = -EBADF;
+	in = fdget(in_fd);
+	if (!in.file)
+		goto out;
+	if (!(in.file->f_mode & FMODE_READ))
+		goto fput_in;
+	retval = -ESPIPE;
+	if (!ppos) {
+		pos = in.file->f_pos;
+	} else {
+		pos = *ppos;
+		if (!(in.file->f_mode & FMODE_PREAD))
+			goto fput_in;
+	}
+	retval = rw_verify_area(READ, in.file, &pos, count);
+	if (retval < 0)
+		goto fput_in;
+	count = retval;
+
+	/*
+	 * Get output file, and verify that it is ok..
+	 */
+	retval = -EBADF;
+	out = fdget(out_fd);
+	if (!out.file)
+		goto fput_in;
+	if (!(out.file->f_mode & FMODE_WRITE))
+		goto fput_out;
+	retval = -EINVAL;
+	in_inode = file_inode(in.file);
+	out_inode = file_inode(out.file);
+	out_pos = out.file->f_pos;
+	retval = rw_verify_area(WRITE, out.file, &out_pos, count);
+	if (retval < 0)
+		goto fput_out;
+	count = retval;
+
+	if (!max)
+		max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
+
+	if (unlikely(pos + count > max)) {
+		retval = -EOVERFLOW;
+		if (pos >= max)
+			goto fput_out;
+		count = max - pos;
+	}
+
+	fl = 0;
+#if 0
+	/*
+	 * We need to debate whether we can enable this or not. The
+	 * man page documents EAGAIN return for the output at least,
+	 * and the application is arguably buggy if it doesn't expect
+	 * EAGAIN on a non-blocking file descriptor.
+	 */
+	if (in.file->f_flags & O_NONBLOCK)
+		fl = SPLICE_F_NONBLOCK;
+#endif
+	file_start_write(out.file);
+	retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
+	file_end_write(out.file);
+
+	if (retval > 0) {
+		add_rchar(current, retval);
+		add_wchar(current, retval);
+		fsnotify_access(in.file);
+		fsnotify_modify(out.file);
+		out.file->f_pos = out_pos;
+		if (ppos)
+			*ppos = pos;
+		else
+			in.file->f_pos = pos;
+	}
+
+	inc_syscr(current);
+	inc_syscw(current);
+	if (pos > max)
+		retval = -EOVERFLOW;
+
+fput_out:
+	fdput(out);
+fput_in:
+	fdput(in);
+out:
+	return retval;
+}
+
+SYSCALL_DEFINE4(sendfile, int, out_fd, int, in_fd, off_t __user *, offset, size_t, count)
+{
+	loff_t pos;
+	off_t off;
+	ssize_t ret;
+
+	if (offset) {
+		if (unlikely(get_user(off, offset)))
+			return -EFAULT;
+		pos = off;
+		ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS);
+		if (unlikely(put_user(pos, offset)))
+			return -EFAULT;
+		return ret;
+	}
+
+	return do_sendfile(out_fd, in_fd, NULL, count, 0);
+}
+
+SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, loff_t __user *, offset, size_t, count)
+{
+	loff_t pos;
+	ssize_t ret;
+
+	if (offset) {
+		if (unlikely(copy_from_user(&pos, offset, sizeof(loff_t))))
+			return -EFAULT;
+		ret = do_sendfile(out_fd, in_fd, &pos, count, 0);
+		if (unlikely(put_user(pos, offset)))
+			return -EFAULT;
+		return ret;
+	}
+
+	return do_sendfile(out_fd, in_fd, NULL, count, 0);
+}
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE4(sendfile, int, out_fd, int, in_fd,
+		compat_off_t __user *, offset, compat_size_t, count)
+{
+	loff_t pos;
+	off_t off;
+	ssize_t ret;
+
+	if (offset) {
+		if (unlikely(get_user(off, offset)))
+			return -EFAULT;
+		pos = off;
+		ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS);
+		if (unlikely(put_user(pos, offset)))
+			return -EFAULT;
+		return ret;
+	}
+
+	return do_sendfile(out_fd, in_fd, NULL, count, 0);
+}
+
+COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
+		compat_loff_t __user *, offset, compat_size_t, count)
+{
+	loff_t pos;
+	ssize_t ret;
+
+	if (offset) {
+		if (unlikely(copy_from_user(&pos, offset, sizeof(loff_t))))
+			return -EFAULT;
+		ret = do_sendfile(out_fd, in_fd, &pos, count, 0);
+		if (unlikely(put_user(pos, offset)))
+			return -EFAULT;
+		return ret;
+	}
+
+	return do_sendfile(out_fd, in_fd, NULL, count, 0);
+}
+#endif
+
-- 
2.1.0

^ permalink raw reply related

* [PATCH v4 0/7] kernel tinification: optionally compile out splice family of syscalls (splice, vmsplice, tee and sendfile)
From: Pieter Smith @ 2014-11-24 23:00 UTC (permalink / raw)
  To: pieter
  Cc: Josh Triplett, Alexander Duyck, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Bertrand Jacquin,
	Catalina Mocanu, Daniel Borkmann, David S. Miller, Eric Dumazet,
	Eric W. Biederman, Fabian Frederick,
	open list:FUSE: FILESYSTEM..., Geert Uytterhoeven, Hugh Dickins,
	Iulia Manda, Jan Beulich, J. Bruce Fields, Jeff Layton,
	open list:ABI/API, linux-fsdevel, linux-kernel

REPO: https://github.com/smipi1/linux-tinification.git

BRANCH: tiny/config-syscall-splice

BACKGROUND: This patch-set forms part of the Linux Kernel Tinification effort (
  https://tiny.wiki.kernel.org/).

GOAL: Support compiling out the splice family of syscalls (splice, vmsplice,
  tee and sendfile) along with all supporting infrastructure if not needed.
  Many embedded systems will not need the splice-family syscalls. Omitting them
  saves space.

HISTORY:
  PATCH v4:
    - Drops __splice_p()
    - Let nfsd fall back to non-splice support when splice is compiled out
    - Style fixes
  
  PATCH v3:
    - Fixup commit logs so that they are consistent with patch strategy
    - Style fixes
  
  PATCH v2:
    - Avoid the ifdef mess introduced in PATCH v1 by mocking out exported splice
      functions.

STRATEGY:
a. With the goal of eventually compiling out fs/splice.c, several functions
   that are only used in support of the the splice family of syscalls are moved
   into fs/splice.c from fs/read_write.c. The kernel_write function that is not
   used to support the splice syscalls is moved to fs/read_write.c.

b. Introduce an EXPERT kernel configuration option; CONFIG_SYSCALL_SPLICE; to
   compile out the splice family of syscalls. This removes all userspace uses
   of the splice infrastructure.

c. Splice exports an operations struct, nosteal_pipe_buf_ops. Eliminate the 
   use of this struct when CONFIG_SYSCALL_SPLICE is undefined, so that splice
   can later be compiled out.

d. Let nfsd fall back to non-splice support when splice is compiled out.

e. Compile out fs/splice.c. Functions exported by fs/splice are mocked out with
   failing static inlines. This is done so as to all but eliminate the
   maintenance burden on file-system drivers.

RESULTS: A tinyconfig bloat-o-meter score for the entire patch-set:

add/remove: 0/41 grow/shrink: 5/7 up/down: 23/-8422 (-8399)
function                                     old     new   delta
sys_pwritev                                  115     122      +7
sys_preadv                                   115     122      +7
fdput_pos                                     29      36      +7
sys_pwrite64                                 115     116      +1
sys_pread64                                  115     116      +1
pipe_to_null                                   4       -      -4
generic_pipe_buf_nosteal                       6       -      -6
spd_release_page                              10       -     -10
fdput                                         11       -     -11
PageUptodate                                  22      11     -11
lock_page                                     36      24     -12
signal_pending                                39      26     -13
fdget                                         56      42     -14
page_cache_pipe_buf_release                   16       -     -16
user_page_pipe_buf_ops                        20       -     -20
splice_write_null                             24       4     -20
page_cache_pipe_buf_ops                       20       -     -20
nosteal_pipe_buf_ops                          20       -     -20
default_pipe_buf_ops                          20       -     -20
generic_splice_sendpage                       24       -     -24
user_page_pipe_buf_steal                      25       -     -25
splice_shrink_spd                             27       -     -27
pipe_to_user                                  43       -     -43
direct_splice_actor                           47       -     -47
default_file_splice_write                     49       -     -49
wakeup_pipe_writers                           54       -     -54
wakeup_pipe_readers                           54       -     -54
write_pipe_buf                                71       -     -71
page_cache_pipe_buf_confirm                   80       -     -80
splice_grow_spd                               87       -     -87
do_splice_to                                  87       -     -87
ipipe_prep.part                               92       -     -92
splice_from_pipe                              93       -     -93
splice_from_pipe_next                        107       -    -107
pipe_to_sendpage                             109       -    -109
page_cache_pipe_buf_steal                    114       -    -114
opipe_prep.part                              119       -    -119
sys_sendfile                                 122       -    -122
generic_file_splice_read                     131       8    -123
sys_sendfile64                               126       -    -126
sys_vmsplice                                 137       -    -137
do_splice_direct                             148       -    -148
vmsplice_to_user                             205       -    -205
__splice_from_pipe                           246       -    -246
splice_direct_to_actor                       348       -    -348
splice_to_pipe                               371       -    -371
do_sendfile                                  492       -    -492
sys_tee                                      497       -    -497
vmsplice_to_pipe                             558       -    -558
default_file_splice_read                     688       -    -688
iter_file_splice_write                       702       4    -698
sys_splice                                  1075       -   -1075
__generic_file_splice_read                  1109       -   -1109

Pieter Smith (7):
  fs: move sendfile syscall into fs/splice
  fs: moved kernel_write to fs/read_write
  fs/splice: support compiling out splice-family syscalls
  fs/fuse: support compiling out splice
  fs/nfsd: support compiling out splice
  net/core: support compiling out splice
  fs/splice: full support for compiling out splice

 fs/Makefile            |   3 +-
 fs/fuse/dev.c          |   9 ++-
 fs/read_write.c        | 181 +++------------------------------------------
 fs/splice.c            | 194 +++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/fs.h     |  26 +++++++
 include/linux/skbuff.h |  10 +++
 include/linux/splice.h |  42 +++++++++++
 init/Kconfig           |  10 +++
 kernel/sys_ni.c        |   8 ++
 net/core/skbuff.c      |  11 ++-
 net/sunrpc/svc.c       |   2 +-
 11 files changed, 302 insertions(+), 194 deletions(-)

-- 
2.1.0

^ permalink raw reply

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
From: Andi Kleen @ 2014-11-24 22:43 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, corbet, mingo, hpa, peterz, riel, akpm, rientjes, mgorman,
	liwanp, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api
In-Reply-To: <1416862595-24513-1-git-send-email-khalid.aziz@oracle.com>

> +1. Location of shared flag can be set using prctl() only once. To
> +   write a new memory address, the previous memory address must be
> +   cleared first by writing NULL. Each new memory address requires
> +   validation in the kernel and update of pointers. Changing this
> +   address too many times creates too much overhead.

Can you explain this more? Doesn't make any sense to me.
The validation is just access_ok() which is only a few instructions?

Also I would drop the config symbol. Linux normally doesn't
do CONFIG for things like that.

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..7f0d843 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1671,6 +1671,11 @@ long do_fork(unsigned long clone_flags,
>  			init_completion(&vfork);
>  			get_task_struct(p);
>  		}
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +		p->sched_preempt_delay.delay_req = NULL;
> +		p->sched_preempt_delay.delay_granted = 0;
> +		p->sched_preempt_delay.yield_penalty = 0;
> +#endif

FWIW this would lead to every new thread having to reexecute
this. No good way around it, but it may eventually make
thread spawns more expensive if it was widely used.

>  
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	/*
> +	 * Clear the penalty flag for current task to reward it for
> +	 * palying by the rules
> +	 */
> +	current->sched_preempt_delay.yield_penalty = 0;
> +#endif

Doesn't that need to be quantified? After all they may yield
only near the end of their time slice.

> +	}
> +
> +	/*
> +	 * Get the value of preemption delay request flag from userspace.
> +	 * Task had already passed us the address where the flag is stored
> +	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
> +	 * futex, leverage the futex code here to read the flag. If there

I don't think any of the calls below are futex code.

> +	case PR_GET_PREEMPT_DELAY:
> +		error = put_user(
> +			(unsigned long)current->sched_preempt_delay.delay_req,
> +					(unsigned long __user *)arg2);
> +		break;
> +#endif

Unnecessary cast.

> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  #endif
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	{
> +		.procname	= "preempt_delay_available",
> +		.data		= &sysctl_preempt_delay_available,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0600,

Better 0644, so users can know if they can use it.

Rest looks reasonable to me.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply

* Re: [fuse-devel] [PATCH 4/6] fs/fuse: support compiling out splice
From: Pieter Smith @ 2014-11-24 21:49 UTC (permalink / raw)
  To: Greg KH
  Cc: josh-iaAMLnmF4UmaiuxdJuQwMA, Richard Weinberger,
	Michael S. Tsirkin, Bertrand Jacquin, Oleg Nesterov,
	J. Bruce Fields, Eric Dumazet, 蔡正龙,
	Jeff Layton, Tom Herbert, Alexei Starovoitov, Miklos Szeredi,
	Peter Foley, Hugh Dickins, Xiao Guangrong, Geert Uytterhoeven,
	Mel Gorman, Matt Turner, Paul E. McKenney, Alexander Duyck,
	open list:FUSE: FILESYSTEM...
In-Reply-To: <20141124202214.GA11362-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On Mon, Nov 24, 2014 at 12:22:14PM -0800, Greg KH wrote:
> On Mon, Nov 24, 2014 at 12:14:50PM -0800, josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org wrote:
> > > I would, again, argue that stuff like __splice_p() not be implemented at
> > > all please.  It will only cause a huge proliferation of stuff like this
> > > that will not make any sense, and only cause a trivial, if any, amount
> > > of code savings.
> > > 
> > > I thought you were going to not do this type of thing until you got the
> > > gcc optimizer working for function callbacks.
> > 
> > Compared to the previous patchset, there are now only two instances of
> > ifdefs outside of the splice code for this, and this is one of them.  In
> > this case, the issue is no longer about making the code for this
> > splice_read function disappear, but rather to eliminate a reference to a
> > bit of splice functionality (used *inside* the FUSE splice code) that
> > will not work without SPLICE_SYSCALL.
> > 
> > Would you prefer to see this specific case handled via an #ifdef in
> > fs/fuse/dev.c rather than introducing a __splice_p that people might be
> > inclined to propagate?  That'd be fine; the code could simply wrap
> > fuse_dev_splice_read in an #ifdef and have the #else define a NULL
> > fuse_dev_splice_read.
> 
> Yes, I would prefer that, but I'm not the fuse maintainer.
> 
> thanks,
> 
> greg k-h

Okay. I'll do my part to prevent the type of proliferation guaranteed to rate
high on the respected K-H icky-scale. __splice_p() goes the way of the dodo
in favor of the solution presented by Josh.

I pray that the Gods of fuse maintenance will look favorable upon the result.

^ permalink raw reply

* Re: kdbus: add documentation
From: Andy Lutomirski @ 2014-11-24 20:57 UTC (permalink / raw)
  To: David Herrmann
  Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Linux API,
	linux-kernel@vger.kernel.org, Daniel Mack, Djalal Harouni
In-Reply-To: <CANq1E4QA53-L8vPOWJfjhzXaYVQnNvgvR6AcYev0EaP6SfOjQQ@mail.gmail.com>

On Mon, Nov 24, 2014 at 12:16 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi Andy!
>
> On Fri, Nov 21, 2014 at 6:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Nov 20, 2014 at 9:02 PM, Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>>> From: Daniel Mack <daniel@zonque.org>
>>>
>>> kdbus is a system for low-latency, low-overhead, easy to use
>>> interprocess communication (IPC).
>>>
>>> The interface to all functions in this driver is implemented through
>>> ioctls on files exposed through the mount point of a kdbusfs.  This
>>> patch adds detailed documentation about the kernel level API design.
>>>
>>> Signed-off-by: Daniel Mack <daniel@zonque.org>
>>> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
>>> Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
>>> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> ---
>>
>>> +  Pool:
>>> +    Each connection allocates a piece of shmem-backed memory that is used
>>> +    to receive messages and answers to ioctl command from the kernel. It is
>>> +    never used to send anything to the kernel. In order to access that memory,
>>> +    userspace must mmap() it into its task.
>>> +    See section 12 for more details.
>>
>> At the risk of opening a can of worms, wouldn't this be much more
>> useful if you could share a pool between multiple connections?
>
> Within a process it could theoretically be possible to share the same
> memory pool between multiple connections made by the process. However,
> note that normally a process only has a single connection to the bus
> open (possibly two, if it opens a connection to both the system and
> the user bus). Now, sharing the receiver buffer could certainly be
> considered an optimization, but it would have no effect on
> "usefulness", though, as just allocating space from a single shared
> per-process receiver won't give you any new possibilities...
>
> We have thought about this, but decided to delay it for now. Shared
> pools can easily be added as an extension later on.
>
> [snip]
>>> +KDBUS_CMD_BUS_MAKE, and KDBUS_CMD_ENDPOINT_MAKE take a
>>> +struct kdbus_cmd_make argument.
>>> +
>>> +struct kdbus_cmd_make {
>>> +  __u64 size;
>>> +    The overall size of the struct, including its items.
>>> +
>>> +  __u64 flags;
>>> +    The flags for creation.
>>> +
>>> +    KDBUS_MAKE_ACCESS_GROUP
>>> +      Make the device file group-accessible
>>> +
>>> +    KDBUS_MAKE_ACCESS_WORLD
>>> +      Make the device file world-accessible
>>
>> This thing is a file.  What's wrong with using a normal POSIX mode?
>> (And what to the read, write, and exec modes do?)
>
> Domains and buses are directories, endpoints are files. Domains also
> create control-files implicitly.
>
> For kdbus clients there is just access or no access, but not
> distinction between read, write and execute access. Due to that we
> just break this down to per-group and world access bits, since doing
> more is pointless, and we shouldn't allow shoehorning more stuff into
> the access mode.
>
> [snip]
>>> +      KDBUS_ITEM_CREDS
>>> +      KDBUS_ITEM_SECLABEL
>>> +        Privileged bus users may submit these types in order to create
>>> +        connections with faked credentials. The only real use case for this
>>> +        is a proxy service which acts on behalf of some other tasks. For a
>>> +        connection that runs in that mode, the message's metadata items will
>>> +        be limited to what's specified here. See section 13 for more
>>> +        information.
>>
>> This is still confusing.  There are multiple places in which metadata
>> is attached.  Which does this apply to?  And why are only creds and
>> seclabel listed?
>
> Yes, and there are multiple places where metadata is *gathered*. This
> ioctl creates connections, so only the items that are actually
> *gathered* by that ioctl are documented here. These items are not part
> of any messages, but are used as identification of the connection
> owner (and in this particular case, to allow privileged proxies to
> overwrite the items so they can properly proxy a legacy-dbus peer
> connection).

But don't proxies need to override the per-message metadata, too?
This is why I'm confused (and my confusion about what's happening goes
down into the code, too).  IMO it would be great if all the variables
were named things like message_metadata, conn_metadata, bus_metadata,
etc.

>
> [snip]
>>> +6.5 Getting information about a connection's bus creator
>>> +--------------------------------------------------------
>>> +
>>> +The KDBUS_CMD_BUS_CREATOR_INFO ioctl takes the same struct as
>>> +KDBUS_CMD_CONN_INFO but is used to retrieve information about the creator of
>>> +the bus the connection is attached to. The metadata returned by this call is
>>> +collected during the creation of the bus and is never altered afterwards, so
>>> +it provides pristine information on the task that created the bus, at the
>>> +moment when it did so.
>>
>> What's this for?  I understand the need for the creator of busses to
>> be authenticated, but doing it like this mean that anyone who will
>> *fail* authentication can DoS the authentic creator.
>
> This returns information on a bus owner, to determine whether a
> connection is connected to a system, user or session bus. Note that
> the bus-creator itself is not a valid peer on the bus, so you cannot
> send messages to them. Which kind of DoS do you have in mind?

I assume that the logic is something like:

connect to bus
request bus metadata
if (bus metadata matches expectations) {
  great, trust the bus!
} else {
  oh crap!
}

If I'm understanding it right, then user code only really has two
outcomes: the good case and the "oh crap!" case.  The problem is that
"oh crap!" isn't a clean failure -- if it happens, then the
application has just been DoSed, because in that case, one of two
things happened:

1. Some policy mismatch means that the legitimate bus owner did create
the bus, but the user application is confused.  This will result in
difficult-to-diagnose failures.

2. A malicious or confused program created the bus.  This is a DoS --
even the legitimate bus creator can't actually create the bus now.

So I think that the policy should be applied at the time that the bus
name is claimed, not at the time that someone else tries to use the
bus.  IOW, the way that you verify you're talking to the system bus
should be by checking that the bus is called "system", not by checking
that UID 0 created the bus.

>
>>> +
>>> +7.3 Passing of Payload Data
>>> +---------------------------
>>> +
>>> +When connecting to the bus, receivers request a memory pool of a given size,
>>> +large enough to carry all backlog of data enqueued for the connection. The
>>> +pool is internally backed by a shared memory file which can be mmap()ed by
>>> +the receiver.
>>> +
>>> +KDBUS_MSG_PAYLOAD_VEC:
>>> +  Messages are directly copied by the sending process into the receiver's pool,
>>> +  that way two peers can exchange data by effectively doing a single-copy from
>>> +  one process to another, the kernel will not buffer the data anywhere else.
>>> +
>>> +KDBUS_MSG_PAYLOAD_MEMFD:
>>> +  Messages can reference memfd files which contain the data.
>>> +  memfd files are tmpfs-backed files that allow sealing of the content of the
>>> +  file, which prevents all writable access to the file content.
>>> +  Only sealed memfd files are accepted as payload data, which enforces
>>> +  reliable passing of data; the receiver can assume that neither the sender nor
>>> +  anyone else can alter the content after the message is sent.
>>
>> This should specify *which* seals are checked.
>
> True. Will be added to the documentation. For the record, it's the
> full set of seals right now, so
> F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE|F_SEAL_SEAL.

Makes sense.

>
>>> +
>>> +Apart from the sender filling-in the content into memfd files, the data will
>>> +be passed as zero-copy from one process to another, read-only, shared between
>>> +the peers.
>>> +
>>> +
>>> +7.4 Receiving messages
>>> +----------------------
>>> +
>>> +Messages are received by the client with the KDBUS_CMD_MSG_RECV ioctl. The
>>> +endpoint file of the bus supports poll() to wake up the receiving process when
>>> +new messages are queued up to be received.
>>> +
>>> +With the KDBUS_CMD_MSG_RECV ioctl, a struct kdbus_cmd_recv is used.
>>> +
>>> +struct kdbus_cmd_recv {
>>> +  __u64 flags;
>>> +    Flags to control the receive command.
>>> +
>>> +    KDBUS_RECV_PEEK
>>> +      Just return the location of the next message. Do not install file
>>> +      descriptors or anything else. This is usually used to determine the
>>> +      sender of the next queued message.
>>> +
>>> +    KDBUS_RECV_DROP
>>> +      Drop the next message without doing anything else with it, and free the
>>> +      pool slice. This a short-cut for KDBUS_RECV_PEEK and KDBUS_CMD_FREE.
>>> +
>>> +    KDBUS_RECV_USE_PRIORITY
>>> +      Use the priority field (see below).
>>> +
>>> +  __u64 kernel_flags;
>>> +    Valid flags for this command, returned by the kernel upon each call.
>>> +
>>> +  __s64 priority;
>>> +      With KDBUS_RECV_USE_PRIORITY set in flags, receive the next message in
>>> +      the queue with at least the given priority. If no such message is waiting
>>> +      in the queue, -ENOMSG is returned.
>>> +
>>> +  __u64 offset;
>>> +      Upon return of the ioctl, this field contains the offset in the
>>> +      receiver's memory pool.
>>> +};
>>> +
>>> +Unless KDBUS_RECV_DROP was passed, and given that the ioctl succeeded, the
>>> +offset field contains the location of the new message inside the receiver's
>>> +pool. The message is stored as struct kdbus_msg at this offset, and can be
>>> +interpreted with the semantics described above.
>>
>> I'm confused here.  Is sent data written to the pool when send is
>> called or when recv is called?
>
> When send is called.
>
>> If the former, what prevents DoS, especially DoS due to sending too many fds?
>
> FDs are installed into the task only when the receiver issues
> CMD_RECV, so a task won't explode unless it issues that ioctl. Plus,
> receiving FDs is already a per-connection opt-in. Furthermore, you can
> CMD_PEEK messages which allows you to look at the full message
> _without_ installing FDs. If you don't want the FDs, you can CMD_DROP
> the message.
>
>> If the latter, where is the data buffered in the mean time?
>>
>>> +
>>> +Also, if the connection allowed for file descriptor to be passed
>>> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
>>> +installed into the receiving process after the KDBUS_CMD_MSG_RECV ioctl
>>> +returns. The receiving task is obliged to close all of them appropriately.
>>
>> This makes it sound like fds are installed at receive time.  What
>> prevents resource exhaustion due to having excessive numbers of fds in
>> transit (that are presumably not accounted to anyone)?
>
> We have a per-user message accounting for undelivered messages, as
> well as a maximum number of pending messages per connection on the
> receiving end. These limits are accounted on a "user<->user" basis, so
> the limit of a user A will not affect two other users (B and C)
> talking.

But you can shove tons of fds in a message, and you can have lots of
messages, and some of the fds can be fds of unix sockets that have fds
queued up in them, and one of those fds could be the fd to the kdbus
connection that sent the fd...

This is not advice as to what to do about it, but I think that it will
be a problem at some point.


>>> +11. Policy
>>> +===============================================================================
>>> +
>>> +A policy databases restrict the possibilities of connections to own, see and
>>> +talk to well-known names. It can be associated with a bus (through a policy
>>> +holder connection) or a custom endpoint.
>>
>> ISTM metadata items on bus names should be replaced with policy that
>> applies to the domain as a whole and governs bus creation.
>
> No, well-known names are bound to buses, so a bus is really the right
> place to hold policy about which process is allowed to claim them.
> Every user is allowed to create a bus of its own, there's no policy
> for that, and there shouldn't be.
>
> It has nothing to do with metadata items.

But it does -- the creator of the bus binds metadata to that bus at
creation time.

I think that a better solution would be to have a global policy that
says, for example, "to create the bus called 'system', the creator
must have selinux label xyz" or "to create a user bus called
uid-1000-privileged-ui-bus the creator must have some cgroup" or
whatever.

Although maybe a better solution would leave this in the kernel but
allow any cgroup to create a bus with a same that indicates the
creating cgroup.  Then I could have my desktop shell create a
"/cgroup/path/to/desktop" for per-user privileged things.

>
>>> +A set of policy rules is described by a name and multiple access rules, defined
>>> +by the following struct.
>>> +
>>> +struct kdbus_policy_access {
>>> +  __u64 type;  /* USER, GROUP, WORLD */
>>> +    One of the following.
>>> +
>>> +    KDBUS_POLICY_ACCESS_USER
>>> +      Grant access to a user with the uid stored in the 'id' field.
>>> +
>>> +    KDBUS_POLICY_ACCESS_GROUP
>>> +      Grant access to a user with the gid stored in the 'id' field.
>>> +
>>> +    KDBUS_POLICY_ACCESS_WORLD
>>> +      Grant access to everyone. The 'id' field is ignored.
>>> +
>>> +  __u64 access;        /* OWN, TALK, SEE */
>>> +    The access to grant.
>>> +
>>> +    KDBUS_POLICY_SEE
>>> +      Allow the name to be seen.
>>> +
>>> +    KDBUS_POLICY_TALK
>>> +      Allow the name to be talked to.
>>> +
>>> +    KDBUS_POLICY_OWN
>>> +      Allow the name to be owned.
>>> +
>>> +  __u64 id;
>>> +    For KDBUS_POLICY_ACCESS_USER, stores the uid.
>>> +    For KDBUS_POLICY_ACCESS_GROUP, stores the gid.
>>> +};
>>
>>
>> What happens if there are multiple matches?
>
> We only have _granting_ policy entries. We search through the
> policy-db until we find an entry that grants access. We do _not_ stop
> on the first item that matches.

Yay!  Can you document that more clearly?

>
>>
>>> +
>>> +11.4 TALK access and multiple well-known names per connection
>>> +-------------------------------------------------------------
>>> +
>>> +Note that TALK access is checked against all names of a connection.
>>> +For example, if a connection owns both 'org.foo.bar' and 'org.blah.baz', and
>>> +the policy database allows 'org.blah.baz' to be talked to by WORLD, then this
>>> +permission is also granted to 'org.foo.bar'. That might sound illogical, but
>>> +after all, we allow messages to be directed to either the name or a well-known
>>> +name, and policy is applied to the connection, not the name. In other words,
>>> +the effective TALK policy for a connection is the most permissive of all names
>>> +the connection owns.
>>
>> This does seem illogical.  Does the recipient at least know which
>> well-known name was addressed?
>
> If the sender addressed it to a well-known name, yes. If the sender
> addressed the message to a unique-ID, there will be no such name, of
> course. Still, the policy applies to such transactions either way
> (standard D-Bus behavior).
>
> Note however that dbus1 will not pass along the destination well-known
> name, hence most userspace libraries will ignore this information too,
> even if they run on kdbus which might pass this information around.
> The right way for services that carry multiple service names to
> discern which actual service is being talked to is having separate
> object paths for the different functionality to hide between the
> services.

It seems unfortunate to keep around this really weird behavior for the
benefit of legacy applications.  Could there perhaps be a flag that a
connection can set to indicate that it understands per-destination
access control and therefore wants stricter policy enforcement?

>
>>> +11.5 Implicit policies
>>> +----------------------
>>> +
>>> +Depending on the type of the endpoint, a set of implicit rules might be
>>> +enforced. On default endpoints, the following set is enforced:
>>> +
>>
>> How do these rules interact with installed policy?
>
> As said before, all policy entries _grant_ access. We look through all
> entries until we find one that grants access.
>
>>> +  * Privileged connections always override any installed policy. Those
>>> +    connections could easily install their own policies, so there is no
>>> +    reason to enforce installed policies.
>>> +  * Connections can always talk to connections of the same user. This
>>> +    includes broadcast messages.
>>
>> Why?
>
> All limits on buses are enforced on a user<->user basis. We don't want
> to provide policies that are more fine-grained than our accounting.

This seems completely at odds with all the fine-grained metadata
stuff.  Also, anything that relies on this may get very confused when
the LSM hooks go in, because I'm reasonably sure that the intent is
for them to *not* follow this principle.

>
>> If anyone ever strengthens the concept of identity to include
>> things other than users (hmm -- there are already groups), this could
>> be very limiting.
>
> If user-based accounting is not suitable, you can create custom
> endpoints. Future extensions to that are always welcome. So far, the
> default user-based accounting was enough. And I think it's suitable as
> default.
>
>>> +  * Connections that own names might send broadcast messages to other
>>> +    connections that belong to a different user, but only if that
>>> +    destination connection does not own any name.
>>> +

(Also, what does "might" mean here?)

>>
>> This is weird.  It is also differently illogical than the "illogical"
>> thing above.
>
> Actually it follows the same model described above. If two connections
> are running under the same user then broadcasts are allowed, but if
> they are running under different users *and* if the destination owns a
> well-known name, then broadcasts are subject to TALK policy checks
> since that destination may own a restricted well-known name that is
> not interested in broadcasts. So this implicit policy is just
> fast-path for the common case where the target is subscribed to a
> broadcast and does not own any name.

Huh?

Say I have two users, "Sender" and "Receiver", each with a single
connection.  If Receiver owns no well-known names, then Sender can
send to it.  If Receiver owns one well-known name, then Sender needs
to pass a TALK check on that name.  If Reciever owns two well-known
names, then Sender only needs to pass a TALK check on one of them.

Am I understanding this right?  If I am, then I think this is in the
category of baroque and inconsistent security rules which everyone
will screw up and therefore introduce security vulnerabilities.

Can you really not enforce the much simpler rule that, to send to a
name, you must have permission to send to *that* name?  If legacy
dbus1 receivers register two names and don't validate everything
correctly, then only the legacy receivers have problems.

>>> +
>>> +To access the memory, the caller is expected to mmap() it to its task, like
>>> +this:
>>> +
>>> +  /*
>>> +   * POOL_SIZE has to be a multiple of PAGE_SIZE, and it must match the
>>> +   * value that was previously passed in the .pool_size field of struct
>>> +   * kdbus_cmd_hello.
>>> +   */
>>> +
>>> +  buf = mmap(NULL, POOL_SIZE, PROT_READ, MAP_PRIVATE, conn_fd, 0);
>>> +
>>
>> Will mapping with PROT_WRITE fail?  What about MAP_SHARED?
>
> PROT_WRITE will fail and VM_MAYWRITE is cleared.
>
>> And why are you suggesting MAP_PRIVATE?  That's just strange.
>
> This was a leftover from pre-3.17. memfds require you to use
> MAP_PRIVATE if SEAL_WRITE is set (which is a linux-specific behavior,
> where MAP_SHARED is always accounted as writable mapping as
> mprotect(VM_WRITE) can be called any time). I fixed this up.

Thanks.  (I assume that memfd has nothing to do with this directly,
since these are pools, not memfds.)

>
>>> +
>>> +13. Metadata
>>> +===============================================================================
> [snip]
>>> +13.1 Known item types
>>> +---------------------
>>> +
>>> +The following attach flags are currently supported.
>>> +
>>> +  KDBUS_ATTACH_TIMESTAMP
>>> +    Attaches an item of type KDBUS_ITEM_TIMESTAMP which contains both the
>>> +    monotonic and the realtime timestamp, taken when the message was
>>> +    processed on the kernel side.
>>> +
>>> +  KDBUS_ATTACH_CREDS
>>> +    Attaches an item of type KDBUS_ITEM_CREDS, containing credentials as
>>> +    described in kdbus_creds: the uid, gid, pid, tid and starttime of the task.
>>> +
>>
>> As mentioned last time, please remove or justify starttime.
>
> starttime allows detecting PID overflows. Exposing the process
> starttime is useful to detect when a PID is getting reused.
> Unfortunately, we don't have 64bit pids, so we need the pid+time
> combination to avoid ambiguity.

NAK, I think.

I agree that PID overflow is a real issue and should be addressed
somehow.  But please address it for real instead of adding Yet Another
Hack (tm).  In the mean time, leave that hack out, please.

I would *love* to see PIDs have extra high bits at the end, done in a
way that supports CRIU and that guarantees no reuse unless something
privileged intentionally mis-programs it.  But starttime isn't that
mechanism.

>
>>> +  KDBUS_ATTACH_AUXGROUPS
>>> +    Attaches an item of type KDBUS_ITEM_AUXGROUPS, containing a dynamic
>>> +    number of auxiliary groups the sending task was a member of.
>>> +
>>> +  KDBUS_ATTACH_NAMES
>>> +    Attaches items of type KDBUS_ITEM_OWNED_NAME, one for each name the sending
>>> +    connection currently owns. The name and flags are stored in kdbus_item.name
>>> +    for each of them.
>>> +
>>
>> That's interesting.  What's it for?
>
> It a valuable piece of information for receivers to know which bus
> names a sender has claimed. For instance, we need this information for
> the D-Bus proxy service, because we have to apply D-Bus1 policy in
> that case, and we need to get a list of owned names in a race-free
> manner to check the policy against.

But if you change the rule to the sensible one where you need
permission to TALK to the name that you're talking to, this goes away,
right?

>
>>> +  KDBUS_ATTACH_TID_COMM
>>> +    Attaches an items of type KDBUS_ITEM_TID_COMM, transporting the sending
>>> +    task's 'comm', for the tid.  The string is stored in kdbus_item.str.
>>> +
>>> +  KDBUS_ATTACH_PID_COMM
>>> +    Attaches an items of type KDBUS_ITEM_PID_COMM, transporting the sending
>>> +    task's 'comm', for the pid.  The string is stored in kdbus_item.str.
>>> +
>>> +  KDBUS_ATTACH_EXE
>>> +    Attaches an item of type KDBUS_ITEM_EXE, containing the path to the
>>> +    executable of the sending task, stored in kdbus_item.str.
>>> +
>>> +  KDBUS_ATTACH_CMDLINE
>>> +    Attaches an item of type KDBUS_ITEM_CMDLINE, containing the command line
>>> +    arguments of the sending task, as an array of strings, stored in
>>> +    kdbus_item.str.
>>
>> Please remove these four items.  They are genuinely useless.  Anything
>> that uses them for anything is either buggy or should have asked the
>> sender to put the value in the payload (and immediately wondered why
>> it was doing that).
>
> We use them for logging, debugging and monitoring. With our wireshark
> extension it's pretty useful to know the comm-name of a process when
> monitoring a bus. As we explained last time, this is not about
> security. We're aware that a process can modify them. We use them only
> as additional meta-data for logging and debugging.

Use the PID.  Really.  Your wireshark extention can look this crap up
in /proc and, if it fails due to a race, big frickin' deal.

>
> If we put those items into the payload, we have to transmit this data
> even if the destination process is not interested in this.
> Furthermore, each caller has to run multiple syscalls on each message
> to retrieve those values.
>
> We use these items heavily for filtering and debugging, regardless of
> the payload protocol that is transmitted on the bus.
>
> To give another specific use-case here: dbus supports bus activation,
> where a message sent to a non-running service causes it to be spawned
> implicitly without losing the message. Now, with such a scheme it is
> incredibly useful to be able to log which client caused a service to
> be triggered, hence we want to know the cmdline/exe/comm of the
> client. Not knowing this is a major pita when trying to trace the boot
> process and figuring out why a specific service got activated.

Again, use the PID for tracing, please.

At the very least, make it impossible to specify these fields in the
"must be received" set and rename them to something like
KDBUS_INSTALL_UNRELIABLE_CMDLINE, etc, because they're unreliable

Finally, this stuff should only be readable by privileged users.  And
using the PID accomplishes that.

>
> Also note that since v2 of the patch there's actually a per-sender
> mask for meta-data like this, hence a peer which doesn't want to pass
> its exec/cmdline/comm along can do that. Of course, this will
> seriously hamper debuggability and transparency...

Transparency is a terrible thing here.

How many users put passwords into things on the command line?  Yes,
it's a bad idea (for reasons that are entirely stupid), but now those
passwords get *logged*.

If this is in the kernel, and someone complains that sensitive data is
showing up on ten different logs on their system, they'll *correctly*
blame the kernel.  If you at least use the PID and restrict it to the
logging code, then at least the bug report will go to the logging
daemon, which will be *correctly* accused of doing something daft, and
it can be fixed.

>
>>> +
>>> +  KDBUS_ATTACH_CGROUP
>>> +    Attaches an item of type KDBUS_ITEM_CGROUP with the task's cgroup path.
>>> +
>>> +  KDBUS_ATTACH_CAPS
>>> +    Attaches an item of type KDBUS_ITEM_CAPS, carrying sets of capabilities
>>> +    that should be accessed via kdbus_item.caps.caps. Also, userspace should
>>> +    be written in a way that it takes kdbus_item.caps.last_cap into account,
>>> +    and derive the number of sets and rows from the item size and the reported
>>> +    number of valid capability bits.
>>> +
>>
>> Please remove this, too, or justify its use.
>
> cgroup information tells us which service is doing a bus request. This
> is useful for a variety of things. For example, the bus activation
> logging item above benefits from it. In general, if some message shall
> be logged about any client it is useful to know its service name.
>
> Capabilities are useful to authenticate specific method calls. For
> example, when a client asks systemd to reboot, without this concept we
> can only check for UID==0 to decide whether to allow this. Breaking
> this down to capabilities in a race-free way has the benefit of
> allowing systemd to bind this to CAP_SYS_BOOT instead. There is no
> reason to deny a process with CAP_SYS_BOOT to reboot via bus-APIs, as
> they could just enforce it via syscalls, anyway.

With all due respect, BS.

I admit that there is probably no reason to deny systemd-based reboot
to a CAP_SYS_BOOT-capable process, but there is absolutely no reason
to give processes that are supposed to reboot using systemd the
CAP_SYS_BOOT capability.

In any event, I suspect you'll have a hard time justifying this for
anything other than CAP_SYS_BOOT.  Just because CAP_SYS_ADMIN users
can probably do whatever they want doesn't mean that systemd should
make that a built-in policy.

Also, wtf is the bounding set and such for?  At the very least this
should only be the effective set.

>
> We think it's a useful and reliable authentication method. Why should
> we remove it?

Because the implementation is buggy and therefore it's insecure?
Remember that caps are namespaced in an interesting way.

>
> Anyway, these items are  just optional. The sender can refuse the
> reveal them, and the item is only transmitted if the receiver opted in
> for it, too. So there's no need to drop any item type from the
> protocol.

No.

Because if receivers opt in to most of these, *they're doing it
wrong*, and the kernel shouldn't be in the business of helping them.

>
>>> +  KDBUS_ATTACH_SECLABEL
>>> +    Attaches an item of type KDBUS_ITEM_SECLABEL, which contains the SELinux
>>> +    security label of the sending task. Access via kdbus_item->str.
>>> +
>>
>> This one, too, and please justify why code that uses it will work in
>> containers an on non-selinux systems.
>
> This maps to SCM_PEERSEC on AF_UNIX sockets. This is actually heavily
> used already on dbus1. For example, systemd actually uses this
> information about a bus peer to check whether certain operations are
> allowed, on top of the normal policy. To give an explicit example:
> when starting a service systemd gets the client's peer label from
> AF_UNIX via SCM_PEERSEC, reads the security label of the service file
> in question, and then checks the selinux database (via one of
> libselinux APIs) whether to allow this change.
>
> Note that this is really nothing we came up with, it's code from the
> SELinux folks, it's simple enough, and has been exposed and used that
> way since ages in dbus1. libselinux offers all the right APIs to make
> use of this, and kdbus really needs to provide the same functionality
> as dbus1 in this regard here.
>
> Let met stress that checking the selinux database here is alway *on
> top* of the normal UID/caps based security checks services do. This is
> exactly how selinux enforces checks on files on top of UID/caps
> checks, or on process or anything else that is selinux-managed.

But I thought that the LSM hooks were going to replace all of that.

Can you get a selinux person to confirm that this is actually necessary?

>
>>> +  KDBUS_ATTACH_AUDIT
>>> +    Attaches an item of type KDBUS_ITEM_AUDIT, which contains the audio label
>>> +    of the sending taskj. Access via kdbus_item->str.
>>> +
>>
>> I will NAK the hell out of this until, at the very least, someone
>> documents what this means, how to parse it, what its stability rules
>> are, who is allowed to see the value it contains, why that value will
>> never evolve to become *more* security sensitive than it is now, etc.
>>
>> Audit gets to do crazy sh*t because it's restricted to privileged
>> receivers.  This isn't restricted like that, so it doesn't deserve the
>> same dispensation.  (And, honestly, I'm not sure that audit really
>> deserves its free pass on making sense.)
>
> This was based on a misunderstanding, so I will ignore it here. Lets
> discuss this on the metadata-patch, in case it's still unclear.

Yeah, I think you're right about this.

If I'm understanding this right, then I think it just needs a
documentation change and possibly a renaming of the metadata item
name.

>
>>> +  KDBUS_ATTACH_CONN_DESCRIPTION
>>> +    Attaches an item of type KDBUS_ITEM_CONN_DESCRIPTION that contains the
>>> +    sender connection's current name in kdbus_item.str.
>>> +
>>
>> Which name?  Can't there be several?
>
> No. There is only one connection description string, copied verbatim
> from what the connection supplied during HELLO.
>
> Note that this is really just about debugging, since in some cases
> processes might end up with multiple kdbus fds, and it is useful in
> tools like "busctl" to know which one is which.

Fair enough.

>
>>> +
>>> +13.1 Metadata and namespaces
>>> +----------------------------
>>> +
>>> +Metadata such as PIDs, UIDs or GIDs are automatically translated to the
>>> +namespaces of the domain that is used to send a message over. The namespaces
>>> +of a domain are pinned at creation time, which is when the filesystem has been
>>> +mounted.
>>> +
>>> +Metadata items that cannot be translated are dropped.
>>
>> What if the receiver said that the item was mandatory?
>
> It is still dropped. It's the responsibility of the receiver to reject
> messages that lack required metadata.

Thanks,
Andy

^ permalink raw reply

* [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
From: Khalid Aziz @ 2014-11-24 20:56 UTC (permalink / raw)
  To: tglx, corbet, mingo, hpa, peterz, riel, akpm, rientjes, ak,
	mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck
  Cc: Khalid Aziz, linux-kernel, linux-doc, linux-api

sched/fair: Add advisory flag for borrowing a timeslice

This patch adds a way for a task to request to borrow one timeslice
from future if it is about to be preempted, so it could delay
preemption and complete any critical task it is in the middle of.

This feature helps with performance on databases and has been
used for many years on other OSs by the databases. This feature
helps in situation where a task acquires a lock before performing a
critical operation on the database and happens to get preempted before
it completes its task.  This lock being held causes all other tasks
that also acquire the same lock to perform their critical operation
on the database, to start queueing up and causing large number of
context switches.  This queueing problem can be avoided if the task
that acquires lock first could request scheduler to let it borrow one
timeslice once it enters its critical section and hence allow it to
complete its critical section without causing queueing problem. If
critical section completes before the task is due for preemption,
the task can simply desassert its request. A task sends the
scheduler this request by setting a flag in a memory location it has
shared with the kernel.  Kernel uses bytes in the same memory location
to let the task know when its request for amnesty from preemption
has been granted.

These rules apply to the use of this feature:

- Request to borrow timeslice is not guranteed to be honored.
- If the task is allowed to borrow, kernel will inform the task
  of this. When this happens, task must yield the processor as soon
  as it completes its critical section.
- If the task fails to yield processor after being allowed to
  borrow, it is penalized by forcing it to skip its next time slot
  by the scheduler.
- Task is charged additional time for the borrowed timeslice as
  accumulated run time. This pushes it further down in consideration
  for the next task to run.

This feature was tested with a TPC-C workload. TPC-C workload shows
a 3% improvement in tpcc throughput when using this feature, which
is a significant improvement.

A new sysctl tunable kernel.preempt_delay_available enables this
feature at run time. The kernel boots up with this feature disabled
by default.

Documentation file included in this patch contains details on how to
use this feature, and conditions associated with its use. This patch
also adds a new field in scheduler statistics which keeps track of
how many times a task was granted amnesty from preemption.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---

With this new version of this patch, the kernel will not enable
preemption delay by default. This feature must be turned on by using
sysctl tunable kernel.preempt_delay_available. With this change, there
are now two ways to eliminate the impact of this feature on systems
that do not intend to use it and are sensitive to scheduling delays
that may be caused by the use of this feature. This feature can be
configured out for custom built kernels. For pre-compiled kernels where
this feature may have been configured in, it will stay off until enabled
through sysctl tunable.

Changelog:
v3:
	- Use prctl() syscall to give kernel the location for shared flag 
	  instead of using a proc file.
	- Disabled this feature by default on a newly booted kernel and
	  added a sysctl tunable to enable/disable it at runtime.
v2:
	- Replaced mmap operation with a more memory efficient futex
	  like communication between userspace and kernel
	- Added a flag to let userspace know if it was granted amnesty
	- Added a penalty for tasks failing to yield CPU when they
	  are granted amnesty from pre-emption

v1:
	- Initial RFC patch with mmap for communication between userspace
	  and kernel

 Documentation/scheduler/sched-preempt-delay.txt | 117 ++++++++++++++++++++++++
 arch/x86/Kconfig                                |  12 +++
 include/linux/sched.h                           |  15 +++
 include/linux/sched/sysctl.h                    |   4 +
 include/uapi/linux/prctl.h                      |   3 +
 kernel/fork.c                                   |   5 +
 kernel/sched/core.c                             |   8 ++
 kernel/sched/debug.c                            |   1 +
 kernel/sched/fair.c                             | 115 ++++++++++++++++++++++-
 kernel/sys.c                                    |  43 +++++++++
 kernel/sysctl.c                                 |   9 ++
 11 files changed, 329 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/scheduler/sched-preempt-delay.txt

diff --git a/Documentation/scheduler/sched-preempt-delay.txt b/Documentation/scheduler/sched-preempt-delay.txt
new file mode 100644
index 0000000..07dd699
--- /dev/null
+++ b/Documentation/scheduler/sched-preempt-delay.txt
@@ -0,0 +1,117 @@
+=================================
+What is preemption delay feature?
+=================================
+
+There are times when a userspace task is executing a critical section
+which gates a number of other tasks that want access to the same
+critical section. If the task holding the lock that guards this critical
+section is preempted by the scheduler in the middle of its critical
+section because its timeslice is up, scheduler ends up scheduling other
+threads which immediately try to grab the lock to enter the critical
+section. This only results in lots of context changes are tasks wake up
+and go to sleep immediately again. If on the other hand, the original
+task were allowed to run for an extra timeslice, it could have completed
+executing its critical section allowing other tasks to make progress
+when they get scheduled. Preemption delay feature allows a task to
+request scheduler to grant it one extra timeslice, if possible.
+
+
+==================================
+Using the preemption delay feature
+==================================
+
+This feature is compiled in the kernel by setting
+CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. By default, the
+kernel boots up with this feature disabled. Enable it using sysctl
+tunable kernel.preempt_delay_available. Once this feature is
+enabled, the userspace process communicates with the kernel using a
+4-byte memory location in its address space. This location must be
+aligned to 4-byte boundary. It first gives the kernel address for this
+memory location by making a prctl() system call with PR_SET_PREEMPT_DELAY
+option. This memory location is interpreted as a sequence of 4 bytes:
+
+	byte[0] = flag to request preemption delay
+	byte[1] = flag from kernel indicating preemption delay was granted
+	byte[2] = reserved for future use
+	byte[3] = reserved for future use
+
+Task requests a preemption delay by writing a non-zero value to the
+first byte. Scheduler checks this value before preempting the task.
+Scheduler can choose to grant one and only an additional time slice to
+the task for each delay request but this delay is not guaranteed.
+If scheduler does grant an additional timeslice, it will set the flag
+in second byte. Upon completion of the section of code where the task
+wants preemption delay, task should check the second byte. If the flag
+in second byte is set, it should clear this flag and call sched_yield()
+so as to not hog the processor. If a thread was granted additional
+timeslice and it fails to call sched_yield(), scheduler will penalize
+it by denying its next request for additional timeslice. Following sample
+code illustrates how to use this feature:
+
+int main()
+{
+	unsigned char buf[256];
+	union {
+		struct {
+			unsigned char nopreempt;
+			unsigned char yield;
+			unsigned char rsvd[2];
+		} flags;
+		uint32_t preempt_delay;
+	} delay;
+
+	delay.preempt_delay = 0;
+
+	/* Tell kernel where the flag lives */
+	prctl(PR_SET_PREEMPT_DELAY, &delay);
+
+	while (/* some condition is true */) {
+		/* do some work and get ready to enter critical section */
+		delay.flags.nopreempt = 1;
+		/*
+		 * Obtain lock for critical section
+		 */
+		/*
+		 * critical section
+		 */
+		/*
+		 * Release lock for critical section
+		 */
+		delay.flags.nopreempt = 0;
+		/* Give the CPU up if required */
+		if (delay.flags.yield) {
+			delay.flags.yield = 0;
+			sched_yield();
+		}
+		/* do some more work */
+	}
+	/*
+	 * Tell kernel we are done asking for preemption delay
+	 */
+	prctl(PR_SET_PREEMPT_DELAY, NULL);
+}
+
+
+====================
+Scheduler statistics
+====================
+
+Preemption delay features adds a new field to scheduler statictics -
+nr_preempt_delayed. This is a per thread statistic that tracks the
+number of times a thread was granted amnesty from preemption when it
+requested for one. "cat /proc/<pid>/task/<tid>/sched" will list this
+number along with other scheduler statistics.
+
+
+=====
+Notes
+=====
+
+1. Location of shared flag can be set using prctl() only once. To
+   write a new memory address, the previous memory address must be
+   cleared first by writing NULL. Each new memory address requires
+   validation in the kernel and update of pointers. Changing this
+   address too many times creates too much overhead.
+
+2. If the location of shared flag is not aligned to 4-byte boundary,
+   prctl() will terminate with EFAULT.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ded8a67..3dac536 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -852,6 +852,18 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_PREEMPT_DELAY
+	def_bool n
+	prompt "Scheduler preemption delay support"
+	depends on PROC_FS
+	---help---
+	  Say Y here if you want to be able to delay scheduler preemption
+	  when possible by setting a flag in a memory location after
+	  sharing the address of this location with kernel using
+	  PR_SET_PREEMPT_DELAY prctl() call. See
+	  Documentation/scheduler/sched-preempt-delay.txt for details.
+	  If in doubt, say "N".
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..f5150e4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1116,6 +1116,7 @@ struct sched_statistics {
 	u64			nr_wakeups_affine_attempts;
 	u64			nr_wakeups_passive;
 	u64			nr_wakeups_idle;
+	u64			nr_preempt_delayed;
 };
 #endif
 
@@ -1324,6 +1325,13 @@ struct task_struct {
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	struct preempt_delay {
+		u32 __user *delay_req;		/* delay request flag pointer */
+		unsigned char delay_granted;	/* currently in delay */
+		unsigned char yield_penalty;	/* failure to yield penalty */
+	} sched_preempt_delay;
+#endif
 
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
@@ -2179,6 +2187,13 @@ extern u64 scheduler_tick_max_deferment(void);
 static inline bool sched_can_stop_tick(void) { return false; }
 #endif
 
+#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
+extern void sched_preempt_delay_show(struct seq_file *m,
+					struct task_struct *task);
+extern void sched_preempt_delay_set(struct task_struct *task,
+					unsigned char *val);
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern void sched_autogroup_create_attach(struct task_struct *p);
 extern void sched_autogroup_detach(struct task_struct *p);
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 596a0e0..516f74e 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -107,4 +107,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write,
 				 void __user *buffer, size_t *lenp,
 				 loff_t *ppos);
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+extern int sysctl_preempt_delay_available;
+#endif
+
 #endif /* _SCHED_SYSCTL_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75..ecfd2cd 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,7 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+#define PR_SET_PREEMPT_DELAY	43
+#define PR_GET_PREEMPT_DELAY	44
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..7f0d843 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1671,6 +1671,11 @@ long do_fork(unsigned long clone_flags,
 			init_completion(&vfork);
 			get_task_struct(p);
 		}
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+		p->sched_preempt_delay.delay_req = NULL;
+		p->sched_preempt_delay.delay_granted = 0;
+		p->sched_preempt_delay.yield_penalty = 0;
+#endif
 
 		wake_up_new_task(p);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c..38cb515 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4230,6 +4230,14 @@ SYSCALL_DEFINE0(sched_yield)
 {
 	struct rq *rq = this_rq_lock();
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	/*
+	 * Clear the penalty flag for current task to reward it for
+	 * palying by the rules
+	 */
+	current->sched_preempt_delay.yield_penalty = 0;
+#endif
+
 	schedstat_inc(rq, yld_count);
 	current->sched_class->yield_task(rq);
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ce33780..618d2ac 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -597,6 +597,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
+	P(se.statistics.nr_preempt_delayed);
 
 	{
 		u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 34baa60..58f6e1b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -428,6 +428,115 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+/*
+ * delay_resched_rq(): Check if the task about to be preempted has
+ *	requested an additional time slice. If it has, grant it additional
+ *	timeslice once.
+ */
+static void
+delay_resched_rq(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_entity *se;
+	int cpu = task_cpu(curr);
+	u32 __user *delay_req;
+	unsigned int delay_req_flag;
+	unsigned char *delay_flag;
+
+	/*
+	 * Check if task is using pre-emption delay feature. If address
+	 * for preemption delay request flag is not set, this task is
+	 * not using preemption delay feature, we can reschedule without
+	 * any delay
+	 */
+	delay_req = curr->sched_preempt_delay.delay_req;
+
+	if ((delay_req == NULL) || (cpu != smp_processor_id()))
+		goto resched_now;
+
+	/*
+	 * Pre-emption delay will  be granted only once. If this task
+	 * has already been granted delay, rechedule now
+	 */
+	if (curr->sched_preempt_delay.delay_granted) {
+		curr->sched_preempt_delay.delay_granted = 0;
+		goto resched_now;
+	}
+
+	/*
+	 * Get the value of preemption delay request flag from userspace.
+	 * Task had already passed us the address where the flag is stored
+	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
+	 * futex, leverage the futex code here to read the flag. If there
+	 * is a page fault accessing this flag in userspace, that means
+	 * userspace has not touched this flag recently and we can
+	 * assume no preemption delay is needed.
+	 *
+	 * If task is not requesting additional timeslice, resched now
+	 */
+	if (delay_req) {
+		int ret;
+
+		pagefault_disable();
+		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
+				sizeof(u32));
+		pagefault_enable();
+		delay_flag = &delay_req_flag;
+		if (ret || !delay_flag[0])
+			goto resched_now;
+	} else {
+		goto resched_now;
+	}
+
+	/*
+	 * Current thread has requested preemption delay and has not
+	 * been granted an extension yet. If this thread failed to yield
+	 * processor after being granted amnesty last time, penalize it
+	 * by not granting this delay request, otherwise give it an extra
+	 * timeslice.
+	 */
+	if (curr->sched_preempt_delay.yield_penalty) {
+		curr->sched_preempt_delay.yield_penalty = 0;
+		goto resched_now;
+	}
+
+	se = &curr->se;
+	curr->sched_preempt_delay.delay_granted = 1;
+
+	/*
+	 * Set the penalty flag for failing to yield the processor after
+	 * being granted immunity. This flag will be cleared in
+	 * sched_yield() if the thread indeed calls sched_yield
+	 */
+	curr->sched_preempt_delay.yield_penalty = 1;
+
+	/*
+	 * Let the thread know it got amnesty and it should call
+	 * sched_yield() when it is done to avoid penalty next time
+	 * it wants amnesty. We need to write to userspace location.
+	 * Since we just read from this location, chances are extremley
+	 * low we might page fault. If we do page fault, we will ignore
+	 * it and accept the cost of failed write in form of unnecessary
+	 * penalty for userspace task for not yielding processor.
+	 * This is a highly unlikely scenario.
+	 */
+	delay_flag[0] = 0;
+	delay_flag[1] = 1;
+	pagefault_disable();
+	__copy_to_user_inatomic(delay_req, &delay_req_flag, sizeof(u32));
+	pagefault_enable();
+
+	schedstat_inc(curr, se.statistics.nr_preempt_delayed);
+	return;
+
+resched_now:
+	resched_curr(rq);
+}
+#else
+#define delay_resched_rq(rq) resched_curr(rq)
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -2939,7 +3048,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	ideal_runtime = sched_slice(cfs_rq, curr);
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
 	if (delta_exec > ideal_runtime) {
-		resched_curr(rq_of(cfs_rq));
+		delay_resched_rq(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
 		 * re-elected due to buddy favours.
@@ -2963,7 +3072,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		return;
 
 	if (delta > ideal_runtime)
-		resched_curr(rq_of(cfs_rq));
+		delay_resched_rq(rq_of(cfs_rq));
 }
 
 static void
@@ -4780,7 +4889,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	return;
 
 preempt:
-	resched_curr(rq);
+	delay_resched_rq(rq);
 	/*
 	 * Only set the backward buddy when the current task is still
 	 * on the rq. This can happen when a wakeup gets interleaved
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0..9edf572 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2025,6 +2025,36 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 }
 #endif
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+int sysctl_preempt_delay_available;
+
+static int
+preempt_delay_write(struct task_struct *task, unsigned long preempt_delay_addr)
+{
+	/*
+	 * Do not allow write if pointer is currently set
+	 */
+	if (task->sched_preempt_delay.delay_req &&
+			((void *)preempt_delay_addr != NULL))
+		return -EINVAL;
+
+	/*
+	 * Validate the pointer. It should be aligned to 4-byte boundary.
+	 */
+	if (unlikely(!IS_ALIGNED(preempt_delay_addr, 4)))
+		return -EFAULT;
+	if (unlikely(!access_ok(rw, preempt_delay_addr, sizeof(u32))))
+		return -EFAULT;
+
+	task->sched_preempt_delay.delay_req = (u32 __user *) preempt_delay_addr;
+
+	/* zero out flags */
+	put_user(0, (uint32_t *)preempt_delay_addr);
+
+	return 0;
+}
+#endif
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2203,6 +2233,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			me->mm->def_flags &= ~VM_NOHUGEPAGE;
 		up_write(&me->mm->mmap_sem);
 		break;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	case PR_SET_PREEMPT_DELAY:
+		if (sysctl_preempt_delay_available)
+			error = preempt_delay_write(current, arg2);
+		else
+			error = -EPERM;
+		break;
+	case PR_GET_PREEMPT_DELAY:
+		error = put_user(
+			(unsigned long)current->sched_preempt_delay.delay_req,
+					(unsigned long __user *)arg2);
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 15f2511..ad450cf 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	{
+		.procname	= "preempt_delay_available",
+		.data		= &sysctl_preempt_delay_available,
+		.maxlen		= sizeof(int),
+		.mode		= 0600,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v3 4/7] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2014-11-24 20:55 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Daniel Borkmann, 'Quentin Gouchet',
	lkml - Kernel Mailing List, linux-crypto, linux-api
In-Reply-To: <20141124142945.GB31469@gondor.apana.org.au>

Am Montag, 24. November 2014, 22:29:46 schrieb Herbert Xu:

Hi Herbert,

> On Fri, Nov 21, 2014 at 06:32:16AM +0100, Stephan Mueller wrote:
> > This patch adds the AEAD support for AF_ALG.
> > 
> > The AEAD implementation uses the entire memory handling and
> > infrastructure of the existing skcipher implementation.
> > 
> > To use AEAD, the user space consumer has to use the salg_type named
> > "aead". The AEAD extension only uses the bind callback as the key
> > differentiator. The previously added functions that select whether to
> > use AEAD or ablkcipher crypto API functions depend on the TFM type
> > allocated during the bind() call.
> > 
> > The addition of AEAD brings a bit of overhead to calculate the size of
> > the ciphertext, because the AEAD implementation of the kernel crypto API
> > makes implied assumption on the location of the authentication tag. When
> > performing an encryption, the tag will be added to the created
> > ciphertext (note, the tag is placed adjacent to the ciphertext). For
> > decryption, the caller must hand in the ciphertext with the tag appended
> > to the ciphertext. Therefore, the selection of the used memory
> > needs to add/subtract the tag size from the source/destination buffers
> > depending on the encryption type. The code is provided with comments
> > explainint when and how that operation is performed.
> > 
> > Note: The AF_ALG interface does not support zero length input data.
> > Such zero length input data may be used if one wants to access the hash
> > implementation of an AEAD directly (e.g. the GHASH of GCM or CMAC for
> > CCM). However, this is a use case that is not of interest. GHASH or
> > CMAC is directly available via the hash AF_ALG interface and we
> > therefore do not need to take precautions for this use case.
> > 
> > A fully working example using all aspects of AEAD is provided at
> > http://www.chronox.de/libkcapi.html
> > 
> > Signed-off-by: Stephan Mueller <smueller@chronox.de>
> 
> I appreciate the effort to share code, but shoe-horning AEAD into
> algif_skcipher is just too ugly.
> 
> How about let's just start with a separate algif_aead and then
> add helpers to merge common code as applicable?

Instead of creating a separate algif_aead file, may I propose that the inline 
functions wrapping the kernel crypto API calls to keep them in a separate 
header file. That should remove code that distracts from the real 
functionality.

The only AEAD code that is left is the memory handling in the recvmsg and 
setting the AD in sendmsg.

Thanks
> 
> Thanks,


-- 
Ciao
Stephan

^ permalink raw reply

* Re: [fuse-devel] [PATCH 4/6] fs/fuse: support compiling out splice
From: Greg KH @ 2014-11-24 20:22 UTC (permalink / raw)
  To: josh
  Cc: Pieter Smith, Richard Weinberger, Michael S. Tsirkin,
	Bertrand Jacquin, Oleg Nesterov, J. Bruce Fields, Eric Dumazet,
	蔡正龙, Jeff Layton, Tom Herbert,
	Alexei Starovoitov, Miklos Szeredi, Peter Foley, Hugh Dickins,
	Xiao Guangrong, Geert Uytterhoeven, Mel Gorman, Matt Turner,
	Paul E. McKenney, Alexander Duyck, open list:FUSE: FILESYSTEM...
In-Reply-To: <20141124201450.GA18776@cloud>

On Mon, Nov 24, 2014 at 12:14:50PM -0800, josh@joshtriplett.org wrote:
> > I would, again, argue that stuff like __splice_p() not be implemented at
> > all please.  It will only cause a huge proliferation of stuff like this
> > that will not make any sense, and only cause a trivial, if any, amount
> > of code savings.
> > 
> > I thought you were going to not do this type of thing until you got the
> > gcc optimizer working for function callbacks.
> 
> Compared to the previous patchset, there are now only two instances of
> ifdefs outside of the splice code for this, and this is one of them.  In
> this case, the issue is no longer about making the code for this
> splice_read function disappear, but rather to eliminate a reference to a
> bit of splice functionality (used *inside* the FUSE splice code) that
> will not work without SPLICE_SYSCALL.
> 
> Would you prefer to see this specific case handled via an #ifdef in
> fs/fuse/dev.c rather than introducing a __splice_p that people might be
> inclined to propagate?  That'd be fine; the code could simply wrap
> fuse_dev_splice_read in an #ifdef and have the #else define a NULL
> fuse_dev_splice_read.

Yes, I would prefer that, but I'm not the fuse maintainer.

thanks,

greg k-h

^ permalink raw reply

* Re: kdbus: add documentation
From: David Herrmann @ 2014-11-24 20:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Linux API,
	linux-kernel@vger.kernel.org, Daniel Mack, Djalal Harouni
In-Reply-To: <CALCETrWUekPq6pquuiVperBihQHgcvCF=WzVJTZziHhUJYsh4Q@mail.gmail.com>

Hi Andy!

On Fri, Nov 21, 2014 at 6:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Nov 20, 2014 at 9:02 PM, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
>> From: Daniel Mack <daniel@zonque.org>
>>
>> kdbus is a system for low-latency, low-overhead, easy to use
>> interprocess communication (IPC).
>>
>> The interface to all functions in this driver is implemented through
>> ioctls on files exposed through the mount point of a kdbusfs.  This
>> patch adds detailed documentation about the kernel level API design.
>>
>> Signed-off-by: Daniel Mack <daniel@zonque.org>
>> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
>> Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
>> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> ---
>
>> +  Pool:
>> +    Each connection allocates a piece of shmem-backed memory that is used
>> +    to receive messages and answers to ioctl command from the kernel. It is
>> +    never used to send anything to the kernel. In order to access that memory,
>> +    userspace must mmap() it into its task.
>> +    See section 12 for more details.
>
> At the risk of opening a can of worms, wouldn't this be much more
> useful if you could share a pool between multiple connections?

Within a process it could theoretically be possible to share the same
memory pool between multiple connections made by the process. However,
note that normally a process only has a single connection to the bus
open (possibly two, if it opens a connection to both the system and
the user bus). Now, sharing the receiver buffer could certainly be
considered an optimization, but it would have no effect on
"usefulness", though, as just allocating space from a single shared
per-process receiver won't give you any new possibilities...

We have thought about this, but decided to delay it for now. Shared
pools can easily be added as an extension later on.

[snip]
>> +KDBUS_CMD_BUS_MAKE, and KDBUS_CMD_ENDPOINT_MAKE take a
>> +struct kdbus_cmd_make argument.
>> +
>> +struct kdbus_cmd_make {
>> +  __u64 size;
>> +    The overall size of the struct, including its items.
>> +
>> +  __u64 flags;
>> +    The flags for creation.
>> +
>> +    KDBUS_MAKE_ACCESS_GROUP
>> +      Make the device file group-accessible
>> +
>> +    KDBUS_MAKE_ACCESS_WORLD
>> +      Make the device file world-accessible
>
> This thing is a file.  What's wrong with using a normal POSIX mode?
> (And what to the read, write, and exec modes do?)

Domains and buses are directories, endpoints are files. Domains also
create control-files implicitly.

For kdbus clients there is just access or no access, but not
distinction between read, write and execute access. Due to that we
just break this down to per-group and world access bits, since doing
more is pointless, and we shouldn't allow shoehorning more stuff into
the access mode.

[snip]
>> +      KDBUS_ITEM_CREDS
>> +      KDBUS_ITEM_SECLABEL
>> +        Privileged bus users may submit these types in order to create
>> +        connections with faked credentials. The only real use case for this
>> +        is a proxy service which acts on behalf of some other tasks. For a
>> +        connection that runs in that mode, the message's metadata items will
>> +        be limited to what's specified here. See section 13 for more
>> +        information.
>
> This is still confusing.  There are multiple places in which metadata
> is attached.  Which does this apply to?  And why are only creds and
> seclabel listed?

Yes, and there are multiple places where metadata is *gathered*. This
ioctl creates connections, so only the items that are actually
*gathered* by that ioctl are documented here. These items are not part
of any messages, but are used as identification of the connection
owner (and in this particular case, to allow privileged proxies to
overwrite the items so they can properly proxy a legacy-dbus peer
connection).

[snip]
>> +6.5 Getting information about a connection's bus creator
>> +--------------------------------------------------------
>> +
>> +The KDBUS_CMD_BUS_CREATOR_INFO ioctl takes the same struct as
>> +KDBUS_CMD_CONN_INFO but is used to retrieve information about the creator of
>> +the bus the connection is attached to. The metadata returned by this call is
>> +collected during the creation of the bus and is never altered afterwards, so
>> +it provides pristine information on the task that created the bus, at the
>> +moment when it did so.
>
> What's this for?  I understand the need for the creator of busses to
> be authenticated, but doing it like this mean that anyone who will
> *fail* authentication can DoS the authentic creator.

This returns information on a bus owner, to determine whether a
connection is connected to a system, user or session bus. Note that
the bus-creator itself is not a valid peer on the bus, so you cannot
send messages to them. Which kind of DoS do you have in mind?

>> +
>> +7.3 Passing of Payload Data
>> +---------------------------
>> +
>> +When connecting to the bus, receivers request a memory pool of a given size,
>> +large enough to carry all backlog of data enqueued for the connection. The
>> +pool is internally backed by a shared memory file which can be mmap()ed by
>> +the receiver.
>> +
>> +KDBUS_MSG_PAYLOAD_VEC:
>> +  Messages are directly copied by the sending process into the receiver's pool,
>> +  that way two peers can exchange data by effectively doing a single-copy from
>> +  one process to another, the kernel will not buffer the data anywhere else.
>> +
>> +KDBUS_MSG_PAYLOAD_MEMFD:
>> +  Messages can reference memfd files which contain the data.
>> +  memfd files are tmpfs-backed files that allow sealing of the content of the
>> +  file, which prevents all writable access to the file content.
>> +  Only sealed memfd files are accepted as payload data, which enforces
>> +  reliable passing of data; the receiver can assume that neither the sender nor
>> +  anyone else can alter the content after the message is sent.
>
> This should specify *which* seals are checked.

True. Will be added to the documentation. For the record, it's the
full set of seals right now, so
F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE|F_SEAL_SEAL.

>> +
>> +Apart from the sender filling-in the content into memfd files, the data will
>> +be passed as zero-copy from one process to another, read-only, shared between
>> +the peers.
>> +
>> +
>> +7.4 Receiving messages
>> +----------------------
>> +
>> +Messages are received by the client with the KDBUS_CMD_MSG_RECV ioctl. The
>> +endpoint file of the bus supports poll() to wake up the receiving process when
>> +new messages are queued up to be received.
>> +
>> +With the KDBUS_CMD_MSG_RECV ioctl, a struct kdbus_cmd_recv is used.
>> +
>> +struct kdbus_cmd_recv {
>> +  __u64 flags;
>> +    Flags to control the receive command.
>> +
>> +    KDBUS_RECV_PEEK
>> +      Just return the location of the next message. Do not install file
>> +      descriptors or anything else. This is usually used to determine the
>> +      sender of the next queued message.
>> +
>> +    KDBUS_RECV_DROP
>> +      Drop the next message without doing anything else with it, and free the
>> +      pool slice. This a short-cut for KDBUS_RECV_PEEK and KDBUS_CMD_FREE.
>> +
>> +    KDBUS_RECV_USE_PRIORITY
>> +      Use the priority field (see below).
>> +
>> +  __u64 kernel_flags;
>> +    Valid flags for this command, returned by the kernel upon each call.
>> +
>> +  __s64 priority;
>> +      With KDBUS_RECV_USE_PRIORITY set in flags, receive the next message in
>> +      the queue with at least the given priority. If no such message is waiting
>> +      in the queue, -ENOMSG is returned.
>> +
>> +  __u64 offset;
>> +      Upon return of the ioctl, this field contains the offset in the
>> +      receiver's memory pool.
>> +};
>> +
>> +Unless KDBUS_RECV_DROP was passed, and given that the ioctl succeeded, the
>> +offset field contains the location of the new message inside the receiver's
>> +pool. The message is stored as struct kdbus_msg at this offset, and can be
>> +interpreted with the semantics described above.
>
> I'm confused here.  Is sent data written to the pool when send is
> called or when recv is called?

When send is called.

> If the former, what prevents DoS, especially DoS due to sending too many fds?

FDs are installed into the task only when the receiver issues
CMD_RECV, so a task won't explode unless it issues that ioctl. Plus,
receiving FDs is already a per-connection opt-in. Furthermore, you can
CMD_PEEK messages which allows you to look at the full message
_without_ installing FDs. If you don't want the FDs, you can CMD_DROP
the message.

> If the latter, where is the data buffered in the mean time?
>
>> +
>> +Also, if the connection allowed for file descriptor to be passed
>> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
>> +installed into the receiving process after the KDBUS_CMD_MSG_RECV ioctl
>> +returns. The receiving task is obliged to close all of them appropriately.
>
> This makes it sound like fds are installed at receive time.  What
> prevents resource exhaustion due to having excessive numbers of fds in
> transit (that are presumably not accounted to anyone)?

We have a per-user message accounting for undelivered messages, as
well as a maximum number of pending messages per connection on the
receiving end. These limits are accounted on a "user<->user" basis, so
the limit of a user A will not affect two other users (B and C)
talking.

>> +
>> +7.5 Canceling messages synchronously waiting for replies
>> +--------------------------------------------------------
>> +
>> +When a connection sends a message with KDBUS_MSG_FLAGS_SYNC_REPLY and
>> +blocks while waiting for the reply, the KDBUS_CMD_MSG_CANCEL ioctl can be
>> +used on the same file descriptor to cancel the message, based on its cookie.
>> +If there are multiple messages with the same cookie that are all synchronously
>> +waiting for a reply, all of them will be canceled. Obviously, this is only
>> +possible in multi-threaded applications.
>
> What does "cancel the message" mean?  Does it just mean that the wait
> for the reply is cancelled?

Yes, exactly. That will be made clearer in the docs. Thanks.

>> +11. Policy
>> +===============================================================================
>> +
>> +A policy databases restrict the possibilities of connections to own, see and
>> +talk to well-known names. It can be associated with a bus (through a policy
>> +holder connection) or a custom endpoint.
>
> ISTM metadata items on bus names should be replaced with policy that
> applies to the domain as a whole and governs bus creation.

No, well-known names are bound to buses, so a bus is really the right
place to hold policy about which process is allowed to claim them.
Every user is allowed to create a bus of its own, there's no policy
for that, and there shouldn't be.

It has nothing to do with metadata items.

>> +A set of policy rules is described by a name and multiple access rules, defined
>> +by the following struct.
>> +
>> +struct kdbus_policy_access {
>> +  __u64 type;  /* USER, GROUP, WORLD */
>> +    One of the following.
>> +
>> +    KDBUS_POLICY_ACCESS_USER
>> +      Grant access to a user with the uid stored in the 'id' field.
>> +
>> +    KDBUS_POLICY_ACCESS_GROUP
>> +      Grant access to a user with the gid stored in the 'id' field.
>> +
>> +    KDBUS_POLICY_ACCESS_WORLD
>> +      Grant access to everyone. The 'id' field is ignored.
>> +
>> +  __u64 access;        /* OWN, TALK, SEE */
>> +    The access to grant.
>> +
>> +    KDBUS_POLICY_SEE
>> +      Allow the name to be seen.
>> +
>> +    KDBUS_POLICY_TALK
>> +      Allow the name to be talked to.
>> +
>> +    KDBUS_POLICY_OWN
>> +      Allow the name to be owned.
>> +
>> +  __u64 id;
>> +    For KDBUS_POLICY_ACCESS_USER, stores the uid.
>> +    For KDBUS_POLICY_ACCESS_GROUP, stores the gid.
>> +};
>
>
> What happens if there are multiple matches?

We only have _granting_ policy entries. We search through the
policy-db until we find an entry that grants access. We do _not_ stop
on the first item that matches.

>
>> +
>> +11.4 TALK access and multiple well-known names per connection
>> +-------------------------------------------------------------
>> +
>> +Note that TALK access is checked against all names of a connection.
>> +For example, if a connection owns both 'org.foo.bar' and 'org.blah.baz', and
>> +the policy database allows 'org.blah.baz' to be talked to by WORLD, then this
>> +permission is also granted to 'org.foo.bar'. That might sound illogical, but
>> +after all, we allow messages to be directed to either the name or a well-known
>> +name, and policy is applied to the connection, not the name. In other words,
>> +the effective TALK policy for a connection is the most permissive of all names
>> +the connection owns.
>
> This does seem illogical.  Does the recipient at least know which
> well-known name was addressed?

If the sender addressed it to a well-known name, yes. If the sender
addressed the message to a unique-ID, there will be no such name, of
course. Still, the policy applies to such transactions either way
(standard D-Bus behavior).

Note however that dbus1 will not pass along the destination well-known
name, hence most userspace libraries will ignore this information too,
even if they run on kdbus which might pass this information around.
The right way for services that carry multiple service names to
discern which actual service is being talked to is having separate
object paths for the different functionality to hide between the
services.

>> +11.5 Implicit policies
>> +----------------------
>> +
>> +Depending on the type of the endpoint, a set of implicit rules might be
>> +enforced. On default endpoints, the following set is enforced:
>> +
>
> How do these rules interact with installed policy?

As said before, all policy entries _grant_ access. We look through all
entries until we find one that grants access.

>> +  * Privileged connections always override any installed policy. Those
>> +    connections could easily install their own policies, so there is no
>> +    reason to enforce installed policies.
>> +  * Connections can always talk to connections of the same user. This
>> +    includes broadcast messages.
>
> Why?

All limits on buses are enforced on a user<->user basis. We don't want
to provide policies that are more fine-grained than our accounting.

> If anyone ever strengthens the concept of identity to include
> things other than users (hmm -- there are already groups), this could
> be very limiting.

If user-based accounting is not suitable, you can create custom
endpoints. Future extensions to that are always welcome. So far, the
default user-based accounting was enough. And I think it's suitable as
default.

>> +  * Connections that own names might send broadcast messages to other
>> +    connections that belong to a different user, but only if that
>> +    destination connection does not own any name.
>> +
>
> This is weird.  It is also differently illogical than the "illogical"
> thing above.

Actually it follows the same model described above. If two connections
are running under the same user then broadcasts are allowed, but if
they are running under different users *and* if the destination owns a
well-known name, then broadcasts are subject to TALK policy checks
since that destination may own a restricted well-known name that is
not interested in broadcasts. So this implicit policy is just
fast-path for the common case where the target is subscribed to a
broadcast and does not own any name.

>> +12. Pool
>> +===============================================================================
>> +
>> +A pool for data received from the kernel is installed for every connection of
>> +the bus, and is sized according to kdbus_cmd_hello.pool_size. It is accessed
>> +when one of the following ioctls is issued:
>> +
>> +  * KDBUS_CMD_MSG_RECV, to receive a message
>> +  * KDBUS_CMD_NAME_LIST, to dump the name registry
>> +  * KDBUS_CMD_CONN_INFO, to retrieve information on a connection
>> +
>> +Internally, the pool is organized in slices, stored in an rb-tree. The offsets
>> +returned by either one of the aforementioned ioctls describe offsets inside the
>> +pool. In order to make the slice available for subsequent calls, KDBUS_CMD_FREE
>> +has to be called on the offset.
>
> Why are you documenting that the slices are stored in an rb-tree?
> That's just an implementation details, right?

Dropped, thanks.

>> +
>> +To access the memory, the caller is expected to mmap() it to its task, like
>> +this:
>> +
>> +  /*
>> +   * POOL_SIZE has to be a multiple of PAGE_SIZE, and it must match the
>> +   * value that was previously passed in the .pool_size field of struct
>> +   * kdbus_cmd_hello.
>> +   */
>> +
>> +  buf = mmap(NULL, POOL_SIZE, PROT_READ, MAP_PRIVATE, conn_fd, 0);
>> +
>
> Will mapping with PROT_WRITE fail?  What about MAP_SHARED?

PROT_WRITE will fail and VM_MAYWRITE is cleared.

> And why are you suggesting MAP_PRIVATE?  That's just strange.

This was a leftover from pre-3.17. memfds require you to use
MAP_PRIVATE if SEAL_WRITE is set (which is a linux-specific behavior,
where MAP_SHARED is always accounted as writable mapping as
mprotect(VM_WRITE) can be called any time). I fixed this up.

>> +
>> +13. Metadata
>> +===============================================================================
[snip]
>> +13.1 Known item types
>> +---------------------
>> +
>> +The following attach flags are currently supported.
>> +
>> +  KDBUS_ATTACH_TIMESTAMP
>> +    Attaches an item of type KDBUS_ITEM_TIMESTAMP which contains both the
>> +    monotonic and the realtime timestamp, taken when the message was
>> +    processed on the kernel side.
>> +
>> +  KDBUS_ATTACH_CREDS
>> +    Attaches an item of type KDBUS_ITEM_CREDS, containing credentials as
>> +    described in kdbus_creds: the uid, gid, pid, tid and starttime of the task.
>> +
>
> As mentioned last time, please remove or justify starttime.

starttime allows detecting PID overflows. Exposing the process
starttime is useful to detect when a PID is getting reused.
Unfortunately, we don't have 64bit pids, so we need the pid+time
combination to avoid ambiguity.

>> +  KDBUS_ATTACH_AUXGROUPS
>> +    Attaches an item of type KDBUS_ITEM_AUXGROUPS, containing a dynamic
>> +    number of auxiliary groups the sending task was a member of.
>> +
>> +  KDBUS_ATTACH_NAMES
>> +    Attaches items of type KDBUS_ITEM_OWNED_NAME, one for each name the sending
>> +    connection currently owns. The name and flags are stored in kdbus_item.name
>> +    for each of them.
>> +
>
> That's interesting.  What's it for?

It a valuable piece of information for receivers to know which bus
names a sender has claimed. For instance, we need this information for
the D-Bus proxy service, because we have to apply D-Bus1 policy in
that case, and we need to get a list of owned names in a race-free
manner to check the policy against.

>> +  KDBUS_ATTACH_TID_COMM
>> +    Attaches an items of type KDBUS_ITEM_TID_COMM, transporting the sending
>> +    task's 'comm', for the tid.  The string is stored in kdbus_item.str.
>> +
>> +  KDBUS_ATTACH_PID_COMM
>> +    Attaches an items of type KDBUS_ITEM_PID_COMM, transporting the sending
>> +    task's 'comm', for the pid.  The string is stored in kdbus_item.str.
>> +
>> +  KDBUS_ATTACH_EXE
>> +    Attaches an item of type KDBUS_ITEM_EXE, containing the path to the
>> +    executable of the sending task, stored in kdbus_item.str.
>> +
>> +  KDBUS_ATTACH_CMDLINE
>> +    Attaches an item of type KDBUS_ITEM_CMDLINE, containing the command line
>> +    arguments of the sending task, as an array of strings, stored in
>> +    kdbus_item.str.
>
> Please remove these four items.  They are genuinely useless.  Anything
> that uses them for anything is either buggy or should have asked the
> sender to put the value in the payload (and immediately wondered why
> it was doing that).

We use them for logging, debugging and monitoring. With our wireshark
extension it's pretty useful to know the comm-name of a process when
monitoring a bus. As we explained last time, this is not about
security. We're aware that a process can modify them. We use them only
as additional meta-data for logging and debugging.

If we put those items into the payload, we have to transmit this data
even if the destination process is not interested in this.
Furthermore, each caller has to run multiple syscalls on each message
to retrieve those values.

We use these items heavily for filtering and debugging, regardless of
the payload protocol that is transmitted on the bus.

To give another specific use-case here: dbus supports bus activation,
where a message sent to a non-running service causes it to be spawned
implicitly without losing the message. Now, with such a scheme it is
incredibly useful to be able to log which client caused a service to
be triggered, hence we want to know the cmdline/exe/comm of the
client. Not knowing this is a major pita when trying to trace the boot
process and figuring out why a specific service got activated.

Also note that since v2 of the patch there's actually a per-sender
mask for meta-data like this, hence a peer which doesn't want to pass
its exec/cmdline/comm along can do that. Of course, this will
seriously hamper debuggability and transparency...

>> +
>> +  KDBUS_ATTACH_CGROUP
>> +    Attaches an item of type KDBUS_ITEM_CGROUP with the task's cgroup path.
>> +
>> +  KDBUS_ATTACH_CAPS
>> +    Attaches an item of type KDBUS_ITEM_CAPS, carrying sets of capabilities
>> +    that should be accessed via kdbus_item.caps.caps. Also, userspace should
>> +    be written in a way that it takes kdbus_item.caps.last_cap into account,
>> +    and derive the number of sets and rows from the item size and the reported
>> +    number of valid capability bits.
>> +
>
> Please remove this, too, or justify its use.

cgroup information tells us which service is doing a bus request. This
is useful for a variety of things. For example, the bus activation
logging item above benefits from it. In general, if some message shall
be logged about any client it is useful to know its service name.

Capabilities are useful to authenticate specific method calls. For
example, when a client asks systemd to reboot, without this concept we
can only check for UID==0 to decide whether to allow this. Breaking
this down to capabilities in a race-free way has the benefit of
allowing systemd to bind this to CAP_SYS_BOOT instead. There is no
reason to deny a process with CAP_SYS_BOOT to reboot via bus-APIs, as
they could just enforce it via syscalls, anyway.

We think it's a useful and reliable authentication method. Why should
we remove it?

Anyway, these items are  just optional. The sender can refuse the
reveal them, and the item is only transmitted if the receiver opted in
for it, too. So there's no need to drop any item type from the
protocol.

>> +  KDBUS_ATTACH_SECLABEL
>> +    Attaches an item of type KDBUS_ITEM_SECLABEL, which contains the SELinux
>> +    security label of the sending task. Access via kdbus_item->str.
>> +
>
> This one, too, and please justify why code that uses it will work in
> containers an on non-selinux systems.

This maps to SCM_PEERSEC on AF_UNIX sockets. This is actually heavily
used already on dbus1. For example, systemd actually uses this
information about a bus peer to check whether certain operations are
allowed, on top of the normal policy. To give an explicit example:
when starting a service systemd gets the client's peer label from
AF_UNIX via SCM_PEERSEC, reads the security label of the service file
in question, and then checks the selinux database (via one of
libselinux APIs) whether to allow this change.

Note that this is really nothing we came up with, it's code from the
SELinux folks, it's simple enough, and has been exposed and used that
way since ages in dbus1. libselinux offers all the right APIs to make
use of this, and kdbus really needs to provide the same functionality
as dbus1 in this regard here.

Let met stress that checking the selinux database here is alway *on
top* of the normal UID/caps based security checks services do. This is
exactly how selinux enforces checks on files on top of UID/caps
checks, or on process or anything else that is selinux-managed.

>> +  KDBUS_ATTACH_AUDIT
>> +    Attaches an item of type KDBUS_ITEM_AUDIT, which contains the audio label
>> +    of the sending taskj. Access via kdbus_item->str.
>> +
>
> I will NAK the hell out of this until, at the very least, someone
> documents what this means, how to parse it, what its stability rules
> are, who is allowed to see the value it contains, why that value will
> never evolve to become *more* security sensitive than it is now, etc.
>
> Audit gets to do crazy sh*t because it's restricted to privileged
> receivers.  This isn't restricted like that, so it doesn't deserve the
> same dispensation.  (And, honestly, I'm not sure that audit really
> deserves its free pass on making sense.)

This was based on a misunderstanding, so I will ignore it here. Lets
discuss this on the metadata-patch, in case it's still unclear.

>> +  KDBUS_ATTACH_CONN_DESCRIPTION
>> +    Attaches an item of type KDBUS_ITEM_CONN_DESCRIPTION that contains the
>> +    sender connection's current name in kdbus_item.str.
>> +
>
> Which name?  Can't there be several?

No. There is only one connection description string, copied verbatim
from what the connection supplied during HELLO.

Note that this is really just about debugging, since in some cases
processes might end up with multiple kdbus fds, and it is useful in
tools like "busctl" to know which one is which.

>> +
>> +13.1 Metadata and namespaces
>> +----------------------------
>> +
>> +Metadata such as PIDs, UIDs or GIDs are automatically translated to the
>> +namespaces of the domain that is used to send a message over. The namespaces
>> +of a domain are pinned at creation time, which is when the filesystem has been
>> +mounted.
>> +
>> +Metadata items that cannot be translated are dropped.
>
> What if the receiver said that the item was mandatory?

It is still dropped. It's the responsibility of the receiver to reject
messages that lack required metadata.

Thanks for the review, Andy. Documentation fixes coming up!
David

^ permalink raw reply

* Re: [fuse-devel] [PATCH 4/6] fs/fuse: support compiling out splice
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2014-11-24 20:14 UTC (permalink / raw)
  To: Greg KH
  Cc: Pieter Smith, Richard Weinberger, Michael S. Tsirkin,
	Bertrand Jacquin, Oleg Nesterov, J. Bruce Fields, Eric Dumazet,
	蔡正龙, Jeff Layton, Tom Herbert,
	Alexei Starovoitov, Miklos Szeredi, Peter Foley, Hugh Dickins,
	Xiao Guangrong, Geert Uytterhoeven, Mel Gorman, Matt Turner,
	Paul E. McKenney, Alexander Duyck, open list:FUSE: FILESYSTEM...
In-Reply-To: <20141124193412.GB31618-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On Mon, Nov 24, 2014 at 11:34:12AM -0800, Greg KH wrote:
> On Mon, Nov 24, 2014 at 08:05:10AM -0800, Josh Triplett wrote:
> > On Mon, Nov 24, 2014 at 10:49:31AM +0100, Pieter Smith wrote:
> > > On Sun, Nov 23, 2014 at 03:23:02PM -0800, Josh Triplett wrote:
> > > > On Sun, Nov 23, 2014 at 11:29:08PM +0100, Richard Weinberger wrote:
> > > > > On Sun, Nov 23, 2014 at 3:20 PM, Pieter Smith <pieter-qeJ+1H9vRZbz+pZb47iToQ@public.gmane.org> wrote:
> > > > > > To implement splice support, fs/fuse makes use of nosteal_pipe_buf_ops. This
> > > > > > struct is exported by fs/splice. The goal of the larger patch set is to
> > > > > > completely compile out fs/splice, so uses of the exported struct need to be
> > > > > > compiled out along with fs/splice.
> > > > > >
> > > > > > This patch therefore compiles out splice support in fs/fuse when
> > > > > > CONFIG_SYSCALL_SPLICE is undefined.
> > > > > >
> > > > > > Signed-off-by: Pieter Smith <pieter-qeJ+1H9vRZbz+pZb47iToQ@public.gmane.org>
> > > > > > ---
> > > > > >  fs/fuse/dev.c      | 4 ++--
> > > > > >  include/linux/fs.h | 6 ++++++
> > > > > >  2 files changed, 8 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > > > > index ca88731..f8f92a4 100644
> > > > > > --- a/fs/fuse/dev.c
> > > > > > +++ b/fs/fuse/dev.c
> > > > > > @@ -1291,7 +1291,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov,
> > > > > >         return fuse_dev_do_read(fc, file, &cs, iov_length(iov, nr_segs));
> > > > > >  }
> > > > > >
> > > > > > -static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
> > > > > > +static ssize_t __maybe_unused fuse_dev_splice_read(struct file *in, loff_t *ppos,
> > > > > >                                     struct pipe_inode_info *pipe,
> > > > > >                                     size_t len, unsigned int flags)
> > > > > >  {
> > > > > > @@ -2144,7 +2144,7 @@ const struct file_operations fuse_dev_operations = {
> > > > > >         .llseek         = no_llseek,
> > > > > >         .read           = do_sync_read,
> > > > > >         .aio_read       = fuse_dev_read,
> > > > > > -       .splice_read    = fuse_dev_splice_read,
> > > > > > +       .splice_read    = __splice_p(fuse_dev_splice_read),
> > > > > >         .write          = do_sync_write,
> > > > > >         .aio_write      = fuse_dev_write,
> > > > > >         .splice_write   = fuse_dev_splice_write,
> > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > index a957d43..04c0975 100644
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -2443,6 +2443,12 @@ extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
> > > > > >                         int datasync);
> > > > > >  extern void block_sync_page(struct page *page);
> > > > > >
> > > > > > +#ifdef CONFIG_SYSCALL_SPLICE
> > > > > > +#define __splice_p(x) x
> > > > > > +#else
> > > > > > +#define __splice_p(x) NULL
> > > > > > +#endif
> > > > > > +
> > > > > 
> > > > > This needs to go into a different patch.
> > > > > One logical change per patch please. :-)
> > > > 
> > > > Easy enough to merge this one into the patch introducing
> > > > CONFIG_SYSCALL_SPLICE, then.
> > > > 
> > > > - Josh Triplett
> > > 
> > > The patch introducing CONFIG_SYSCALL_SPLICE (PATCH 3) only compiles out the
> > > syscalls. PATCH 6 on the other hand, compiles out fs/splice.c. This patch
> > > allows fs/fuse to be compiled when fs/splice.c is compiled out. If I am to
> > > squash it, it would be logical to include it in PATCH 6, not 3.
> > 
> > The suggestion wasn't to move the fs/fuse/dev.c bits; those should
> > definitely stay in this patch.  The suggestion was just to move the bit
> > of the patch defining __splice_p from this patch to patch 3.  (Note that
> > you need to define it before you use it, so it can't go in patch 6.)
> 
> I would, again, argue that stuff like __splice_p() not be implemented at
> all please.  It will only cause a huge proliferation of stuff like this
> that will not make any sense, and only cause a trivial, if any, amount
> of code savings.
> 
> I thought you were going to not do this type of thing until you got the
> gcc optimizer working for function callbacks.

Compared to the previous patchset, there are now only two instances of
ifdefs outside of the splice code for this, and this is one of them.  In
this case, the issue is no longer about making the code for this
splice_read function disappear, but rather to eliminate a reference to a
bit of splice functionality (used *inside* the FUSE splice code) that
will not work without SPLICE_SYSCALL.

Would you prefer to see this specific case handled via an #ifdef in
fs/fuse/dev.c rather than introducing a __splice_p that people might be
inclined to propagate?  That'd be fine; the code could simply wrap
fuse_dev_splice_read in an #ifdef and have the #else define a NULL
fuse_dev_splice_read.

- Josh Triplett

^ permalink raw reply

* Re: [fuse-devel] [PATCH 4/6] fs/fuse: support compiling out splice
From: Greg KH @ 2014-11-24 19:34 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Pieter Smith, Richard Weinberger, Michael S. Tsirkin,
	Bertrand Jacquin, Oleg Nesterov, J. Bruce Fields, Eric Dumazet,
	蔡正龙, Jeff Layton, Tom Herbert,
	Alexei Starovoitov, Miklos Szeredi, Peter Foley, Hugh Dickins,
	Xiao Guangrong, Geert Uytterhoeven, Mel Gorman, Matt Turner,
	Paul E. McKenney, Alexander Duyck, open list:FUSE: FILESYSTEM...
In-Reply-To: <20141124160510.GA2446@jtriplet-mobl1>

On Mon, Nov 24, 2014 at 08:05:10AM -0800, Josh Triplett wrote:
> On Mon, Nov 24, 2014 at 10:49:31AM +0100, Pieter Smith wrote:
> > On Sun, Nov 23, 2014 at 03:23:02PM -0800, Josh Triplett wrote:
> > > On Sun, Nov 23, 2014 at 11:29:08PM +0100, Richard Weinberger wrote:
> > > > On Sun, Nov 23, 2014 at 3:20 PM, Pieter Smith <pieter@boesman.nl> wrote:
> > > > > To implement splice support, fs/fuse makes use of nosteal_pipe_buf_ops. This
> > > > > struct is exported by fs/splice. The goal of the larger patch set is to
> > > > > completely compile out fs/splice, so uses of the exported struct need to be
> > > > > compiled out along with fs/splice.
> > > > >
> > > > > This patch therefore compiles out splice support in fs/fuse when
> > > > > CONFIG_SYSCALL_SPLICE is undefined.
> > > > >
> > > > > Signed-off-by: Pieter Smith <pieter@boesman.nl>
> > > > > ---
> > > > >  fs/fuse/dev.c      | 4 ++--
> > > > >  include/linux/fs.h | 6 ++++++
> > > > >  2 files changed, 8 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > > > index ca88731..f8f92a4 100644
> > > > > --- a/fs/fuse/dev.c
> > > > > +++ b/fs/fuse/dev.c
> > > > > @@ -1291,7 +1291,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov,
> > > > >         return fuse_dev_do_read(fc, file, &cs, iov_length(iov, nr_segs));
> > > > >  }
> > > > >
> > > > > -static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
> > > > > +static ssize_t __maybe_unused fuse_dev_splice_read(struct file *in, loff_t *ppos,
> > > > >                                     struct pipe_inode_info *pipe,
> > > > >                                     size_t len, unsigned int flags)
> > > > >  {
> > > > > @@ -2144,7 +2144,7 @@ const struct file_operations fuse_dev_operations = {
> > > > >         .llseek         = no_llseek,
> > > > >         .read           = do_sync_read,
> > > > >         .aio_read       = fuse_dev_read,
> > > > > -       .splice_read    = fuse_dev_splice_read,
> > > > > +       .splice_read    = __splice_p(fuse_dev_splice_read),
> > > > >         .write          = do_sync_write,
> > > > >         .aio_write      = fuse_dev_write,
> > > > >         .splice_write   = fuse_dev_splice_write,
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index a957d43..04c0975 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -2443,6 +2443,12 @@ extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
> > > > >                         int datasync);
> > > > >  extern void block_sync_page(struct page *page);
> > > > >
> > > > > +#ifdef CONFIG_SYSCALL_SPLICE
> > > > > +#define __splice_p(x) x
> > > > > +#else
> > > > > +#define __splice_p(x) NULL
> > > > > +#endif
> > > > > +
> > > > 
> > > > This needs to go into a different patch.
> > > > One logical change per patch please. :-)
> > > 
> > > Easy enough to merge this one into the patch introducing
> > > CONFIG_SYSCALL_SPLICE, then.
> > > 
> > > - Josh Triplett
> > 
> > The patch introducing CONFIG_SYSCALL_SPLICE (PATCH 3) only compiles out the
> > syscalls. PATCH 6 on the other hand, compiles out fs/splice.c. This patch
> > allows fs/fuse to be compiled when fs/splice.c is compiled out. If I am to
> > squash it, it would be logical to include it in PATCH 6, not 3.
> 
> The suggestion wasn't to move the fs/fuse/dev.c bits; those should
> definitely stay in this patch.  The suggestion was just to move the bit
> of the patch defining __splice_p from this patch to patch 3.  (Note that
> you need to define it before you use it, so it can't go in patch 6.)

I would, again, argue that stuff like __splice_p() not be implemented at
all please.  It will only cause a huge proliferation of stuff like this
that will not make any sense, and only cause a trivial, if any, amount
of code savings.

I thought you were going to not do this type of thing until you got the
gcc optimizer working for function callbacks.

Again, don't do this please.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
From: Thomas Gleixner @ 2014-11-24 18:53 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Stephen Rothwell, Oleg Nesterov, Michael Kerrisk,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Rich Felker, Christoph Hellwig, x86, linux-arch, linux-api,
	sparclinux
In-Reply-To: <20141124170626.GA5316@mwanda>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1813 bytes --]

On Mon, 24 Nov 2014, Dan Carpenter wrote:

> On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> > Hook up x86-64, i386 and x32 ABIs.
> > 
> > Signed-off-by: David Drysdale <drysdale@google.com>
> 
> This one has been breaking my linux-next build for the past week.  I'm
> not sure what's going on.  I build with a script:
> 
> make allmodconfig
> 
> cat << EOF >> .config
> CONFIG_DYNAMIC_DEBUG=n
> CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
> CONFIG_DYNAMIC_DEBUG=y
> EOF
> 
> make oldconfig
> 
> Here are the errors:
> 
>   CHK     include/generated/compile.h
>   CHECK   arch/x86/ia32/audit.c
>   CC      arch/x86/ia32/audit.o
> arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
> arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/ia32/audit.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/ia32] Error 2
>   CHECK   arch/x86/kernel/audit_64.c
>   CHK     kernel/config_data.h
>   CC      arch/x86/kernel/audit_64.o
> arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
> arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/kernel] Error 2
> make[1]: Target `__build' not remade because of errors.

I don't know what you're doing.

Tried the above and it rebuilds the relevant unistd*.h files and
compiles happily.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v3 04/41] virtio: memory access APIs
From: Michael S. Tsirkin @ 2014-11-24 18:49 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Bjarke Istrup Pedersen, Anup Patel, rusty, Greg Kroah-Hartman,
	linux-kernel@vger.kernel.org, virtualization, Sakari Ailus,
	linux-api, Paolo Bonzini, Pranavkumar Sawargaonkar, David Miller,
	Alexei Starovoitov
In-Reply-To: <CAMuHMdWJA4aUyHo8wq=RX-+0w+PX3NtG1ZgvWgcMwKT5KYT_Tg@mail.gmail.com>

On Mon, Nov 24, 2014 at 01:58:43PM +0100, Geert Uytterhoeven wrote:
> On Mon, Nov 24, 2014 at 1:15 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Nov 24, 2014 at 01:03:24PM +0100, Geert Uytterhoeven wrote:
> >> On Mon, Nov 24, 2014 at 12:52 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > virtio 1.0 makes all memory structures LE, so
> >> > we need APIs to conditionally do a byteswap on BE
> >> > architectures.
> >> >
> >> > To make it easier to check code statically,
> >> > add virtio specific types for multi-byte integers
> >> > in memory.
> >> >
> >> > Add low level wrappers that do a byteswap conditionally, these will be
> >> > useful e.g. for vhost.  Add high level wrappers that
> >> > query device endian-ness and act accordingly.
> >>
> >> > diff --git a/include/linux/virtio_byteorder.h b/include/linux/virtio_byteorder.h
> >> > new file mode 100644
> >> > index 0000000..824ed0b
> >> > --- /dev/null
> >> > +++ b/include/linux/virtio_byteorder.h
> >>
> >> > +static inline u16 __virtio16_to_cpu(bool little_endian, __virtio16 val)
> >> > +{
> >> > +       if (little_endian)
> >> > +               return le16_to_cpu((__force __le16)val);
> >> > +       else
> >> > +               return (__force u16)val;
> >> > +}
> >>
> >> What's wrong with just using le16-to_cpu() ...
> >
> > le16-to_cpu() is simply wrong: virtio needs to be
> > LE or native endian, depending on whether it's running
> > in 0.9 or 1.0 mode.
> 
> IC, that was not clear from the description for this patch.
> I thought it was dependent on BE architectures.
> 
> Nevertheless, any chance you can get rid of the "conditional"?

I don't see how - this is fundamental to any virtio device
that wants to support both 0.9 and 1.0 from the same
codebase.

> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds

^ permalink raw reply

* Re: [PATCHv10 4/5] sparc: Hook up execveat system call.
From: David Miller @ 2014-11-24 18:36 UTC (permalink / raw)
  To: drysdale
  Cc: ebiederm, luto, viro, meredydd, linux-kernel, akpm, tglx, sfr,
	oleg, mtk.manpages, mingo, hpa, keescook, arnd, dalias, hch, x86,
	linux-arch, linux-api, sparclinux
In-Reply-To: <1416830039-21952-5-git-send-email-drysdale@google.com>

From: David Drysdale <drysdale@google.com>
Date: Mon, 24 Nov 2014 11:53:58 +0000

> Signed-off-by: David Drysdale <drysdale@google.com>

Acked-by: David S. Miller <davem@davemloft.net>> 

^ permalink raw reply

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
From: David Drysdale @ 2014-11-24 18:26 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro, Meredydd Luff,
	linux-kernel@vger.kernel.org, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Michael Kerrisk,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Rich Felker, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux
In-Reply-To: <20141124170626.GA5316@mwanda>

On Mon, Nov 24, 2014 at 5:06 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
>> Hook up x86-64, i386 and x32 ABIs.
>>
>> Signed-off-by: David Drysdale <drysdale@google.com>
>
> This one has been breaking my linux-next build for the past week.  I'm
> not sure what's going on.

Hi Dan,

Sorry if this has been causing you problems -- I've not had any
errors from the kbuild robots or my local builds.

> I build with a script:
>
> make allmodconfig
>
> cat << EOF >> .config
> CONFIG_DYNAMIC_DEBUG=n
> CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
> CONFIG_DYNAMIC_DEBUG=y
> EOF
>
> make oldconfig
>
> Here are the errors:
>
>   CHK     include/generated/compile.h
>   CHECK   arch/x86/ia32/audit.c
>   CC      arch/x86/ia32/audit.o
> arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
> arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/ia32/audit.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/ia32] Error 2
>   CHECK   arch/x86/kernel/audit_64.c
>   CHK     kernel/config_data.h
>   CC      arch/x86/kernel/audit_64.o
> arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
> arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/kernel] Error 2
> make[1]: Target `__build' not remade because of errors.

That seems odd -- the generic definition of __NR_execveat is in the first
patch in the series, and the various x86-specific definitions should get
generated from the table entries in the second patch in the series (at
least since the v9 set I sent on 19 Nov, which split out the x86 wiring
from the general implementation).

Are the syscall table generation steps happening in your build?  And does
__NR_execveat appear in the various generated x86 unistd*.h headers?

As an aside, I've just built next-20141124 (a4cfa44aa26a) fine from
scratch with your config steps.   The build output included the header
generation steps:

  SYSTBL  arch/x86/syscalls/../include/generated/asm/syscalls_32.h
  SYSHDR  arch/x86/syscalls/../include/generated/asm/unistd_32_ia32.h
  SYSHDR  arch/x86/syscalls/../include/generated/asm/unistd_64_x32.h
  SYSTBL  arch/x86/syscalls/../include/generated/asm/syscalls_64.h
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_32.h
  HOSTCC  scripts/basic/bin2c
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_64.h
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_x32.h

and the resulting files did include the __NR_execveat constant:

  % grep execveat usr/include/asm*/*.h arch/x86/include/generated/uapi/asm/*.h
  usr/include/asm-generic/unistd.h:#define __NR_execveat 281
  usr/include/asm-generic/unistd.h:__SC_COMP(__NR_execveat,
sys_execveat, compat_sys_execveat)
  usr/include/asm/unistd_32.h:#define __NR_execveat 358
  usr/include/asm/unistd_64.h:#define __NR_execveat 322
  usr/include/asm/unistd_x32.h:#define __NR_execveat (__X32_SYSCALL_BIT + 545)
  arch/x86/include/generated/uapi/asm/unistd_32.h:#define __NR_execveat 358
  arch/x86/include/generated/uapi/asm/unistd_64.h:#define __NR_execveat 322
  arch/x86/include/generated/uapi/asm/unistd_x32.h:#define
__NR_execveat (__X32_SYSCALL_BIT + 545)

So I can't (yet) reproduce your problem I'm afraid...

> regards,
> dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3] selftest: size: Add size test for Linux kernel
From: Tim Bird @ 2014-11-24 18:20 UTC (permalink / raw)
  To: Shuah Khan, linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-embedded-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <546D321F.2080405-/MT0OVThwyLZJqsBc5GL+g@public.gmane.org>


This test shows the amount of memory used by the system.
Note that this is dependent on the user-space that is loaded
when this program runs.  Optimally, this program would be
run as the init program itself.

The program is optimized for size itself, to avoid conflating
its own execution with that of the system software.
The code is compiled statically, with no stdlibs. On my x86_64 system,
this results in a statically linked binary of less than 5K.

Changes from v2:
  - add return values to print routines
  - add .gitignore file

Changes from v1:
  - use more correct Copyright string in get_size.c

Signed-off-by: Tim Bird <tim.bird-/MT0OVThwyLZJqsBc5GL+g@public.gmane.org>
---
 tools/testing/selftests/Makefile        |   1 +
 tools/testing/selftests/size/.gitignore |   1 +
 tools/testing/selftests/size/Makefile   |  21 +++++++
 tools/testing/selftests/size/get_size.c | 105 ++++++++++++++++++++++++++++++++
 4 files changed, 128 insertions(+)
 create mode 100644 tools/testing/selftests/size/.gitignore
 create mode 100644 tools/testing/selftests/size/Makefile
 create mode 100644 tools/testing/selftests/size/get_size.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 45f145c..fa91aef 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -15,6 +15,7 @@ TARGETS += user
 TARGETS += sysctl
 TARGETS += firmware
 TARGETS += ftrace
+TARGETS += size
 
 TARGETS_HOTPLUG = cpu-hotplug
 TARGETS_HOTPLUG += memory-hotplug
diff --git a/tools/testing/selftests/size/.gitignore b/tools/testing/selftests/size/.gitignore
new file mode 100644
index 0000000..189b781
--- /dev/null
+++ b/tools/testing/selftests/size/.gitignore
@@ -0,0 +1 @@
+get_size
diff --git a/tools/testing/selftests/size/Makefile b/tools/testing/selftests/size/Makefile
new file mode 100644
index 0000000..51e5fbd
--- /dev/null
+++ b/tools/testing/selftests/size/Makefile
@@ -0,0 +1,21 @@
+#ifndef CC
+	CC = $(CROSS_COMPILE)gcc
+#endif
+
+#ifndef STRIP
+	STRIP = $(CROSS_COMPILE)strip
+#endif
+
+all: get_size
+
+get_size: get_size.c
+	$(CC) --static -ffreestanding -nostartfiles \
+		-Wl,--entry=main get_size.c -o get_size \
+		`cc -print-libgcc-file-name`
+	$(STRIP) -s get_size
+
+run_tests: all
+	./get_size
+
+clean:
+	$(RM) get_size
diff --git a/tools/testing/selftests/size/get_size.c b/tools/testing/selftests/size/get_size.c
new file mode 100644
index 0000000..9c8d3cd
--- /dev/null
+++ b/tools/testing/selftests/size/get_size.c
@@ -0,0 +1,105 @@
+/*
+ * Copyright 2014 Sony
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Selftest for runtime system size
+ *
+ * Prints the amount of RAM that the currently running system is using.
+ *
+ * This program tries to be as small as possible itself, to
+ * avoid perturbing the system memory utilization with its
+ * own execution.  It also attempts to have as few dependencies
+ * on kernel features as possible.
+ *
+ * It should be statically linked, with startup libs avoided.
+ * It uses no library calls, and only the following 3 syscalls:
+ *   sysinfo(), write(), and _exit()
+ *
+ * For output, it avoids printf (which in some C libraries
+ * has large external dependencies) by implementing its own
+ * strlen(), number output and print() routines.
+ */
+
+#include <sys/sysinfo.h>
+#include <unistd.h>
+
+#define STDOUT_FILENO 1
+
+int my_strlen(const char *s)
+{
+	int len = 0;
+
+	while (*s++)
+		len++;
+	return len;
+}
+
+/*
+ * num_to_str - put digits from num into *s, left to right
+ *   do this by dividing the number by powers of 10
+ *   the tricky part is to omit leading zeros
+ *   don't print zeros until we've started printing any numbers at all
+ */
+void num_to_str(unsigned long num, char *s)
+{
+	unsigned long long temp, div;
+	int started;
+
+	temp = num;
+	div = 1000000000000000000LL;
+	started = 0;
+	while (div) {
+		if (temp/div || started) {
+			*s++ = (unsigned char)(temp/div + '0');
+			started = 1;
+		}
+		temp -= (temp/div)*div;
+		div /= 10;
+	}
+	*s = 0;
+}
+
+int print_num(unsigned long num)
+{
+	char num_buf[30];
+
+	num_to_str(num, num_buf);
+	return write(STDOUT_FILENO, num_buf, my_strlen(num_buf));
+}
+
+int print(char *s)
+{
+	return write(STDOUT_FILENO, s, my_strlen(s));
+}
+
+void main(int argc, char **argv)
+{
+	int ccode;
+	unsigned long used;
+	struct sysinfo info;
+	unsigned long long temp;
+
+	print("Testing system size.\n");
+	print("1..1\n");
+
+	ccode = sysinfo(&info);
+	if (ccode < 0) {
+		print("not ok 1 get size runtime size\n");
+		print("# could not get sysinfo\n");
+		_exit(ccode);
+	}
+
+	/* ignore cache complexities for now */
+	temp = info.totalram - info.freeram - info.bufferram;
+	temp = temp * info.mem_unit;
+	temp = temp / 1024;
+
+	used = temp;
+
+	print("ok 1 get runtime size # size = ");
+	print_num(used);
+	print(" K\n");
+
+	_exit(0);
+}
-- 
1.8.2.2

^ permalink raw reply related

* Re: [PATCH v2] selftest: size: Add size test for Linux kernel
From: Tim Bird @ 2014-11-24 18:08 UTC (permalink / raw)
  To: Shuah Khan, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-embedded-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <546E640F.1090006-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

On 11/20/2014 01:58 PM, Shuah Khan wrote:
> On 11/19/2014 05:13 PM, Tim Bird wrote:
>>
>> This test shows the amount of memory used by the system.
>> Note that this is dependent on the user-space that is loaded
>> when this program runs.  Optimally, this program would be
>> run as the init program itself.
>>
>> The program is optimized for size itself, to avoid conflating
>> its own execution with that of the system software.
>> The code is compiled statically, with no stdlibs. On my x86_64 system,
>> this results in a statically linked binary of less than 5K.
>>
>> Changes from v1:
>>   - use more correct Copyright string in get_size.c
>>
>> Signed-off-by: Tim Bird <tim.bird-/MT0OVThwyLZJqsBc5GL+g@public.gmane.org>
>> ---
>>  tools/testing/selftests/Makefile        |   1 +
>>  tools/testing/selftests/size/Makefile   |  21 +++++++
>>  tools/testing/selftests/size/get_size.c | 105 ++++++++++++++++++++++++++++++++
>>  3 files changed, 127 insertions(+)
>>  create mode 100644 tools/testing/selftests/size/Makefile
>>  create mode 100644 tools/testing/selftests/size/get_size.c
> 
> Tim,
> 
> The test looks good, but you are missing .gitignore file.
> Please add a .gitignore for the binary that gets generated to
> avoid git status including the binary it in its output.
Will do.

I'll send v3 today.

Thanks,
 -- Tim

^ permalink raw reply

* Re: [PATCH] kselftest: Move the docs to the Documentation dir
From: Jonathan Corbet @ 2014-11-24 17:49 UTC (permalink / raw)
  To: Shuah Khan
  Cc: Tim Bird, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-embedded-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <546DF325.6060507-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

On Thu, 20 Nov 2014 06:56:53 -0700
Shuah Khan <shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org> wrote:

> Thanks for doing this. Looks good to me. I think you missed
> Documentation maintainer. Adding linux-doc and Jon Corbet to
> the thread with my ack to take this through Documentation tree.

Thanks for the heads-up.  I've applied it to the docs tree now.

jon

^ permalink raw reply

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
From: Dan Carpenter @ 2014-11-24 17:06 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro, Meredydd Luff,
	linux-kernel, Andrew Morton, David Miller, Thomas Gleixner,
	Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux
In-Reply-To: <1416830039-21952-3-git-send-email-drysdale@google.com>

On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> Hook up x86-64, i386 and x32 ABIs.
> 
> Signed-off-by: David Drysdale <drysdale@google.com>

This one has been breaking my linux-next build for the past week.  I'm
not sure what's going on.  I build with a script:

make allmodconfig

cat << EOF >> .config
CONFIG_DYNAMIC_DEBUG=n
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
CONFIG_DYNAMIC_DEBUG=y
EOF

make oldconfig

Here are the errors:

  CHK     include/generated/compile.h
  CHECK   arch/x86/ia32/audit.c
  CC      arch/x86/ia32/audit.o
arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [arch/x86/ia32/audit.o] Error 1
make[2]: Target `__build' not remade because of errors.
make[1]: *** [arch/x86/ia32] Error 2
  CHECK   arch/x86/kernel/audit_64.c
  CHK     kernel/config_data.h
  CC      arch/x86/kernel/audit_64.o
arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
make[2]: Target `__build' not remade because of errors.
make[1]: *** [arch/x86/kernel] Error 2
make[1]: Target `__build' not remade because of errors.

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [fuse-devel] [PATCH 4/6] fs/fuse: support compiling out splice
From: Josh Triplett @ 2014-11-24 16:05 UTC (permalink / raw)
  To: Pieter Smith
  Cc: Richard Weinberger, Michael S. Tsirkin, Bertrand Jacquin,
	Oleg Nesterov, J. Bruce Fields, Eric Dumazet,
	蔡正龙, Jeff Layton, Tom Herbert,
	Alexei Starovoitov, Miklos Szeredi, Peter Foley, Hugh Dickins,
	Xiao Guangrong, Geert Uytterhoeven, Mel Gorman, Matt Turner,
	Paul E. McKenney, Alexander Duyck, open list:FUSE: FILESYSTEM...,
	Luis R. Rodriguez
In-Reply-To: <20141124094931.GA1055@smipidev>

On Mon, Nov 24, 2014 at 10:49:31AM +0100, Pieter Smith wrote:
> On Sun, Nov 23, 2014 at 03:23:02PM -0800, Josh Triplett wrote:
> > On Sun, Nov 23, 2014 at 11:29:08PM +0100, Richard Weinberger wrote:
> > > On Sun, Nov 23, 2014 at 3:20 PM, Pieter Smith <pieter@boesman.nl> wrote:
> > > > To implement splice support, fs/fuse makes use of nosteal_pipe_buf_ops. This
> > > > struct is exported by fs/splice. The goal of the larger patch set is to
> > > > completely compile out fs/splice, so uses of the exported struct need to be
> > > > compiled out along with fs/splice.
> > > >
> > > > This patch therefore compiles out splice support in fs/fuse when
> > > > CONFIG_SYSCALL_SPLICE is undefined.
> > > >
> > > > Signed-off-by: Pieter Smith <pieter@boesman.nl>
> > > > ---
> > > >  fs/fuse/dev.c      | 4 ++--
> > > >  include/linux/fs.h | 6 ++++++
> > > >  2 files changed, 8 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > > index ca88731..f8f92a4 100644
> > > > --- a/fs/fuse/dev.c
> > > > +++ b/fs/fuse/dev.c
> > > > @@ -1291,7 +1291,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov,
> > > >         return fuse_dev_do_read(fc, file, &cs, iov_length(iov, nr_segs));
> > > >  }
> > > >
> > > > -static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
> > > > +static ssize_t __maybe_unused fuse_dev_splice_read(struct file *in, loff_t *ppos,
> > > >                                     struct pipe_inode_info *pipe,
> > > >                                     size_t len, unsigned int flags)
> > > >  {
> > > > @@ -2144,7 +2144,7 @@ const struct file_operations fuse_dev_operations = {
> > > >         .llseek         = no_llseek,
> > > >         .read           = do_sync_read,
> > > >         .aio_read       = fuse_dev_read,
> > > > -       .splice_read    = fuse_dev_splice_read,
> > > > +       .splice_read    = __splice_p(fuse_dev_splice_read),
> > > >         .write          = do_sync_write,
> > > >         .aio_write      = fuse_dev_write,
> > > >         .splice_write   = fuse_dev_splice_write,
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index a957d43..04c0975 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -2443,6 +2443,12 @@ extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
> > > >                         int datasync);
> > > >  extern void block_sync_page(struct page *page);
> > > >
> > > > +#ifdef CONFIG_SYSCALL_SPLICE
> > > > +#define __splice_p(x) x
> > > > +#else
> > > > +#define __splice_p(x) NULL
> > > > +#endif
> > > > +
> > > 
> > > This needs to go into a different patch.
> > > One logical change per patch please. :-)
> > 
> > Easy enough to merge this one into the patch introducing
> > CONFIG_SYSCALL_SPLICE, then.
> > 
> > - Josh Triplett
> 
> The patch introducing CONFIG_SYSCALL_SPLICE (PATCH 3) only compiles out the
> syscalls. PATCH 6 on the other hand, compiles out fs/splice.c. This patch
> allows fs/fuse to be compiled when fs/splice.c is compiled out. If I am to
> squash it, it would be logical to include it in PATCH 6, not 3.

The suggestion wasn't to move the fs/fuse/dev.c bits; those should
definitely stay in this patch.  The suggestion was just to move the bit
of the patch defining __splice_p from this patch to patch 3.  (Note that
you need to define it before you use it, so it can't go in patch 6.)

- Josh Triplett

^ permalink raw reply

* Re: [PATCH v3 5/7] crypto: AF_ALG: add random number generator support
From: Stephan Mueller @ 2014-11-24 15:08 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Daniel Borkmann, 'Quentin Gouchet',
	lkml - Kernel Mailing List, linux-crypto, linux-api
In-Reply-To: <20141124143150.GC31469@gondor.apana.org.au>

Am Montag, 24. November 2014, 22:31:50 schrieb Herbert Xu:

Hi Herbert,

>On Fri, Nov 21, 2014 at 06:32:52AM +0100, Stephan Mueller wrote:
>> This patch adds the random number generator support for AF_ALG.
>> 
>> A random number generator's purpose is to generate data without
>> requiring the caller to provide any data. Therefore, the AF_ALG
>> interface handler for RNGs only implements a callback handler for
>> recvmsg.
>> 
>> The following parameters provided with a recvmsg are processed by the
>> 
>> RNG callback handler:
>>         * sock - to resolve the RNG context data structure accessing
>>         the
>>         
>>           RNG instance private to the socket
>>         
>>         * len - this parameter allows userspace callers to specify
>>         how
>>         
>>           many random bytes the RNG shall produce and return. As the
>>           kernel context for the RNG allocates a buffer of 128 bytes
>>           to
>>           store random numbers before copying them to userspace, the
>>           len
>>           parameter is checked that it is not larger than 128. If a
>>           caller wants more random numbers, a new request for recvmsg
>>           shall be made.
>> 
>> The size of 128 bytes is chose because of the following 
considerations:
>>         * to increase the memory footprint of the kernel too much
>>         (note,
>>         
>>           that would be 128 bytes per open socket)
>>         
>>         * 128 is divisible by any typical cryptographic block size an
>>         
>>           RNG may have
>>         
>>         * A request for random numbers typically only shall supply
>>         small
>>         
>>           amount of data like for keys or IVs that should only
>>           require
>>           one invocation of the recvmsg function.
>> 
>> Note, during instantiation of the RNG, the code checks whether the
>> RNG
>> implementation requires seeding. If so, the RNG is seeded with output
>> from get_random_bytes.
>> 
>> A fully working example using all aspects of the RNG interface is
>> provided at http://www.chronox.de/libkcapi.html
>> 
>> Signed-off-by: Stephan Mueller <smueller@chronox.de>
>
>Sorry but who is going to use this and for what purpose?
>
>Every other algif interface exports real hardware features that
>cannot otherwise be accessed from user-space.  All crypto RNGs
>are by definition software-only, so what is the point of this?


My idea is twofold: The software-RNGs currently available (X9.31 and 
DRBG) use other ciphers as backends. Therefore, they can be considered 
as transforms on top of these backend ciphers. Now, if these backend 
ciphers are available in kernel mode only, currently only these in-
kernel RNGs can use the hardware.

With the consideration of AEAD, all ciphers supported by the kernel 
crypto API are available to user space. That means, there is no need for 
an additional crypto library in user space in addition to provide 
hardware access. The RNG part is there to complement the case for not 
needing an additional crypto lib in user space.

Ciao
Stephan

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox