Linux userland API discussions
 help / color / mirror / Atom feed
* [PATCHv8 3/4] sparc: Hook up execveat system call.
From: David Drysdale @ 2014-11-14 16:23 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller
  Cc: Oleg Nesterov, Michael Kerrisk, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	sparclinux-u79uwXL29TY76Z2rM5mHXA, David Drysdale
In-Reply-To: <1415982183-20525-1-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 arch/sparc/include/uapi/asm/unistd.h | 3 ++-
 arch/sparc/kernel/systbls_32.S       | 1 +
 arch/sparc/kernel/systbls_64.S       | 2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/include/uapi/asm/unistd.h b/arch/sparc/include/uapi/asm/unistd.h
index 46d83842eddc..6f35f4df17f2 100644
--- a/arch/sparc/include/uapi/asm/unistd.h
+++ b/arch/sparc/include/uapi/asm/unistd.h
@@ -415,8 +415,9 @@
 #define __NR_getrandom		347
 #define __NR_memfd_create	348
 #define __NR_bpf		349
+#define __NR_execveat		350

-#define NR_syscalls		350
+#define NR_syscalls		351

 /* Bitmask values returned from kern_features system call.  */
 #define KERN_FEATURE_MIXED_MODE_STACK	0x00000001
diff --git a/arch/sparc/kernel/systbls_32.S b/arch/sparc/kernel/systbls_32.S
index ad0cdf497b78..e31a9056a303 100644
--- a/arch/sparc/kernel/systbls_32.S
+++ b/arch/sparc/kernel/systbls_32.S
@@ -87,3 +87,4 @@ sys_call_table:
 /*335*/	.long sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
 /*340*/	.long sys_ni_syscall, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
 /*345*/	.long sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
+/*350*/	.long sys_execveat
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index 580cde9370c9..15069cb35dac 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -88,6 +88,7 @@ sys_call_table32:
 	.word sys_syncfs, compat_sys_sendmmsg, sys_setns, compat_sys_process_vm_readv, compat_sys_process_vm_writev
 /*340*/	.word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
 	.word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
+/*350*/	.word sys_execveat

 #endif /* CONFIG_COMPAT */

@@ -167,3 +168,4 @@ sys_call_table:
 	.word sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
 /*340*/	.word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
 	.word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
+/*350*/	.word sys_execveat
--
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [PATCHv8 man-pages 4/4] execveat.2: initial man page for execveat(2)
From: David Drysdale @ 2014-11-14 16:23 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller
  Cc: Oleg Nesterov, Michael Kerrisk, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	sparclinux-u79uwXL29TY76Z2rM5mHXA, David Drysdale
In-Reply-To: <1415982183-20525-1-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 153 insertions(+)
 create mode 100644 man2/execveat.2

diff --git a/man2/execveat.2 b/man2/execveat.2
new file mode 100644
index 000000000000..937d79e4c4f0
--- /dev/null
+++ b/man2/execveat.2
@@ -0,0 +1,153 @@
+.\" Copyright (c) 2014 Google, Inc.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
+.SH NAME
+execveat \- execute program relative to a directory file descriptor
+.SH SYNOPSIS
+.B #include <unistd.h>
+.sp
+.BI "int execveat(int " fd ", const char *" pathname ","
+.br
+.BI "             char *const " argv "[],  char *const " envp "[],"
+.br
+.BI "             int " flags);
+.SH DESCRIPTION
+The
+.BR execveat ()
+system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
+The
+.BR execveat ()
+system call operates in exactly the same way as
+.BR execve (2),
+except for the differences described in this manual page.
+
+If the pathname given in
+.I pathname
+is relative, then it is interpreted relative to the directory
+referred to by the file descriptor
+.I fd
+(rather than relative to the current working directory of
+the calling process, as is done by
+.BR execve (2)
+for a relative pathname).
+
+If
+.I pathname
+is relative and
+.I fd
+is the special value
+.BR AT_FDCWD ,
+then
+.I pathname
+is interpreted relative to the current working
+directory of the calling process (like
+.BR execve (2)).
+
+If
+.I pathname
+is absolute, then
+.I fd
+is ignored.
+
+If
+.I pathname
+is an empty string and the
+.BR AT_EMPTY_PATH
+flag is specified, then the file descriptor
+.I fd
+specifies the file to be executed.
+
+.I flags
+can either be 0, or include the following flags:
+.TP
+.BR AT_EMPTY_PATH
+If
+.I pathname
+is an empty string, operate on the file referred to by
+.IR fd
+(which may have been obtained using the
+.BR open (2)
+.B O_PATH
+flag).
+.TP
+.B AT_SYMLINK_NOFOLLOW
+If the file identified by
+.I fd
+and a non-NULL
+.I pathname
+is a symbolic link, then the call fails with the error
+.BR EINVAL .
+.SH "RETURN VALUE"
+On success,
+.BR execveat ()
+does not return. On error \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+The same errors that occur for
+.BR execve (2)
+can also occur for
+.BR execveat ().
+The following additional errors can occur for
+.BR execveat ():
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B ENOENT
+The program identified by \fIfd\fP and \fIpathname\fP requires the
+use of an interpreter program (such as a script starting with
+"#!") but the file descriptor
+.I fd
+was opened with the
+.B O_CLOEXEC
+flag and so the program file is inaccessible to the launched interpreter.
+.TP
+.B EINVAL
+Invalid flag specified in
+.IR flags .
+.TP
+.B ENOTDIR
+.I pathname
+is relative and
+.I fd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR execveat ()
+was added to Linux in kernel 3.???.
+.SH NOTES
+In addition to the reasons explained in
+.BR openat (2),
+the
+.BR execveat ()
+system call is also needed to allow
+.BR fexecve (3)
+to be implemented on systems that do not have the
+.I /proc
+filesystem mounted.
+.SH SEE ALSO
+.BR execve (2),
+.BR fexecve (3)
--
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* Re: [PATCH v2 net-next 1/7] bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM command
From: Alexei Starovoitov @ 2014-11-14 16:31 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S. Miller, Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Eric Dumazet, Linux API, Network Development, LKML
In-Reply-To: <1415981203.15154.45.camel@localhost>

On Fri, Nov 14, 2014 at 8:06 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On Fr, 2014-11-14 at 07:33 -0800, Alexei Starovoitov wrote:
>> On Fri, Nov 14, 2014 at 4:11 AM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> > On Do, 2014-11-13 at 17:36 -0800, Alexei Starovoitov wrote:
>> >> the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
>> >> either update existing map element or create a new one.
>> >> Initially the plan was to add a new command to handle the case of
>> >> 'create new element if it didn't exist', but 'flags' style looks
>> >> cleaner and overall diff is much smaller (more code reused), so add 'flags'
>> >> attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
>> >>  #define BPF_ANY      0 /* create new element or update existing */
>> >>  #define BPF_NOEXIST  1 /* create new element if it didn't exist */
>> >>  #define BPF_EXIST    2 /* update existing element */
>> >
>> > Would a cmpxchg-alike function be handy here?
>>
>> you mean cmpxchg command in addition to
>> update() command ?
>> May be... it will have an extra 'value' argument
>> (key, old_value, new_value)
>> I don't have a use case for it yet though.
>
> I don't neither. ;)
>
> I just wanted to bring this up before user space api might get public
> and the additional argument might make problems.

addition of cmpxchg command won't be a problem obviously.
(just another 'new_value' field in existing struct inside bpf_attr union).

^ permalink raw reply

* Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)
From: Jeff Moyer @ 2014-11-14 16:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, davej
In-Reply-To: <20141111064417.GT23575@dastard>

Dave Chinner <david@fromorbit.com> writes:

> On Mon, Nov 10, 2014 at 11:40:23AM -0500, Milosz Tanski wrote:
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>> 
>> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
>> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
>> extra flag argument (RWF_NONBLOCK).
>> 
>> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
>> perform buffered IO operations. They submit the work form another thread
>> that performs network IO and epoll or other threads that perform CPU work. This
>> leads to increased latency for processing, esp. in the case of data that's
>> already cached in the page cache.
>> 
>> With the new interface the applications will now be able to fetch the data in
>> their network / cpu bound thread(s) and only defer to a threadpool if it's not
>> there. In our own application (VLDB) we've observed a decrease in latency for
>> "fast" request by avoiding unnecessary queuing and having to swap out current
>> tasks in IO bound work threads.
>
> Can you write a test (or set of) for fstests that exercises this new
> functionality? I'm not worried about performance, just
> correctness....

On the subject of testing, I added support to trinity (attached,
untested).  That did raise one question.  Do we expect applications to
#include <linux/fs.h> to get the RWF_NONBLOCK definition?

Cheers,
Jeff

diff --git a/include/syscalls-i386.h b/include/syscalls-i386.h
index 767be6e..3125064 100644
--- a/include/syscalls-i386.h
+++ b/include/syscalls-i386.h
@@ -365,4 +365,6 @@ struct syscalltable syscalls_i386[] = {
 	{ .entry = &syscall_getrandom },
 	{ .entry = &syscall_memfd_create },
 	{ .entry = &syscall_bpf },
+	{ .entry = &syscall_preadv2 },
+	{ .entry = &syscall_pwritev2 },
 };
diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
index cb609ad..8d32571 100644
--- a/include/syscalls-x86_64.h
+++ b/include/syscalls-x86_64.h
@@ -329,4 +329,6 @@ struct syscalltable syscalls_x86_64[] = {
 	{ .entry = &syscall_memfd_create },
 	{ .entry = &syscall_kexec_file_load },
 	{ .entry = &syscall_bpf },
+	{ .entry = &syscall_preadv2 },
+	{ .entry = &syscall_pwritev2 },
 };
diff --git a/syscalls/read.c b/syscalls/read.c
index e0948a2..adbf146 100644
--- a/syscalls/read.c
+++ b/syscalls/read.c
@@ -3,6 +3,7 @@
  */
 #include <stdlib.h>
 #include <string.h>
+#include <linux/fs.h>
 #include "arch.h"
 #include "maps.h"
 #include "random.h"
@@ -94,3 +95,29 @@ struct syscallentry syscall_preadv = {
 	.arg5name = "pos_h",
 	.flags = NEED_ALARM,
 };
+
+/*
+ * SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+ */
+
+struct syscallentry syscall_preadv2 = {
+	.name = "preadv2",
+	.num_args = 5,
+	.arg1name = "fd",
+	.arg1type = ARG_FD,
+	.arg2name = "vec",
+	.arg2type = ARG_IOVEC,
+	.arg3name = "vlen",
+	.arg3type = ARG_IOVECLEN,
+	.arg4name = "pos_l",
+	.arg5name = "pos_h",
+	.arg6name = "flags",
+	.arg6type = ARG_OP,
+	.arg6list = {
+		.num = 1,
+		.values = { RWF_NONBLOCK, },
+	},
+	.flags = NEED_ALARM,
+};
diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
index 5a7748b..04400dd 100644
--- a/syscalls/syscalls.h
+++ b/syscalls/syscalls.h
@@ -375,5 +375,7 @@ extern struct syscallentry syscall_seccomp;
 extern struct syscallentry syscall_memfd_create;
 extern struct syscallentry syscall_kexec_file_load;
 extern struct syscallentry syscall_bpf;
+extern struct syscallentry syscall_preadv2;
+extern struct syscallentry syscall_pwritev2;
 
 unsigned int random_fcntl_setfl_flags(void);
diff --git a/syscalls/write.c b/syscalls/write.c
index f37e760..4218ccc 100644
--- a/syscalls/write.c
+++ b/syscalls/write.c
@@ -2,6 +2,7 @@
  * SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count)
  */
 #include <stdlib.h>
+#include <linux/fs.h>
 #include "arch.h"	// page_size
 #include "maps.h"
 #include "random.h"
@@ -95,3 +96,30 @@ struct syscallentry syscall_pwritev = {
 	.arg5name = "pos_h",
 	.flags = NEED_ALARM,
 };
+
+
+/*
+ * SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+ */
+
+struct syscallentry syscall_pwritev2 = {
+	.name = "pwritev2",
+	.num_args = 6,
+	.arg1name = "fd",
+	.arg1type = ARG_FD,
+	.arg2name = "vec",
+	.arg2type = ARG_IOVEC,
+	.arg3name = "vlen",
+	.arg3type = ARG_IOVECLEN,
+	.arg4name = "pos_l",
+	.arg5name = "pos_h",
+	.arg6name = "flags",
+	.arg6type = ARG_OP,
+	.arg6list = {
+		.num = 1,
+		.values = { RWF_NONBLOCK, },
+	},
+	.flags = NEED_ALARM,
+};

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related

* Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)
From: Dave Jones @ 2014-11-14 16:39 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dave Chinner, Milosz Tanski, linux-kernel, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch
In-Reply-To: <x49mw7tyc7e.fsf@segfault.boston.devel.redhat.com>

On Fri, Nov 14, 2014 at 11:32:53AM -0500, Jeff Moyer wrote:

 > > Can you write a test (or set of) for fstests that exercises this new
 > > functionality? I'm not worried about performance, just
 > > correctness....
 > 
 > On the subject of testing, I added support to trinity (attached,
 > untested).  That did raise one question.  Do we expect applications to
 > #include <linux/fs.h> to get the RWF_NONBLOCK definition?
 
Trinity will at least need an addition to include/compat.h for
older headers that won't have the definition.  Looks ok otherwise.

Also, I usually sit on stuff like this until the syscall numbers are
in Linus tree. This is 3.19 stuff I presume ?
istr akpm picked up execveat recently, so if that goes in first, we'll
need to respin this anyway..

	Dave

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply

* Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)
From: Jeff Moyer @ 2014-11-14 16:51 UTC (permalink / raw)
  To: Dave Jones
  Cc: Dave Chinner, Milosz Tanski, linux-kernel, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch
In-Reply-To: <20141114163912.GA23769@redhat.com>

Dave Jones <davej@redhat.com> writes:

> On Fri, Nov 14, 2014 at 11:32:53AM -0500, Jeff Moyer wrote:
>
>  > > Can you write a test (or set of) for fstests that exercises this new
>  > > functionality? I'm not worried about performance, just
>  > > correctness....
>  > 
>  > On the subject of testing, I added support to trinity (attached,
>  > untested).  That did raise one question.  Do we expect applications to
>  > #include <linux/fs.h> to get the RWF_NONBLOCK definition?
>  
> Trinity will at least need an addition to include/compat.h for
> older headers that won't have the definition.  Looks ok otherwise.

OK, I'll add that.

> Also, I usually sit on stuff like this until the syscall numbers are
> in Linus tree. This is 3.19 stuff I presume ?
> istr akpm picked up execveat recently, so if that goes in first, we'll
> need to respin this anyway..

Sure.  I just wanted to test with trinity *before* it makes it into the
kernel.  Crazy, I know.  ;-)

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Eric Dumazet @ 2014-11-14 17:00 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: David Miller, netdev, Ying Cai, Willem de Bruijn, Neal Cardwell,
	Linux API
In-Reply-To: <CAHO5Pa0OGtgbUp4R287jbK2SFSpVUoXWCJybvHTFsGyCLynqLg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, 2014-11-14 at 09:05 +0100, Michael Kerrisk wrote:
> Hi Eric,
> 
> Since this is an API change ( Documentation/SubmitChecklist),
> linux-api@ should be CCed.

Right, I am sorry !

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-14 17:17 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Eric Dumazet, David Miller, netdev, Ying Cai, Willem de Bruijn,
	Neal Cardwell, Linux API
In-Reply-To: <CAHO5Pa0OGtgbUp4R287jbK2SFSpVUoXWCJybvHTFsGyCLynqLg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Nov 14, 2014 at 12:05 AM, Michael Kerrisk
<mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Eric,
>
> Since this is an API change ( Documentation/SubmitChecklist),
> linux-api@ should be CCed.
>
> Thanks,
>
> Michael
>
>
>
> On Fri, Nov 7, 2014 at 9:51 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> From: Eric Dumazet <edumazet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>
>> Alternative to RPS/RFS is to use hardware support for multi queue.
>>
>> Then split a set of million of sockets into worker threads, each
>> one using epoll() to manage events on its own socket pool.
>>
>> Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
>> know after accept() or connect() on which queue/cpu a socket is managed.
>>
>> We normally use one cpu per RX queue (IRQ smp_affinity being properly
>> set), so remembering on socket structure which cpu delivered last packet
>> is enough to solve the problem.
>>
>> After accept(), connect(), or even file descriptor passing around
>> processes, applications can use :
>>
>>  int cpu;
>>  socklen_t len = sizeof(cpu);
>>
>>  getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
>>
>> And use this information to put the socket into the right silo
>> for optimal performance, as all networking stack should run
>> on the appropriate cpu, without need to send IPI (RPS/RFS).

As a heavy user of RFS (and finder of bugs in it, too), here's my
question about this API:

How does an application tell whether the socket represents a
non-actively-steered flow?  If the flow is subject to RFS, then moving
the application handling to the socket's CPU seems problematic, as the
socket's CPU might move as well.  The current implementation in this
patch seems to tell me which CPU the most recent packet came in on,
which is not necessarily very useful.

Some possibilities:

1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.

2. Change the interface a bit to report the socket's preferred CPU
(where it would go without RFS, for example) and then let the
application use setsockopt to tell the socket to stay put (i.e. turn
off RFS and RPS for that flow).

3. Report the preferred CPU as in (2) but let the application ask for
something different.

For example, I have flows for which I know which CPU I want.  A nice
API to put the flow there would be quite useful.


Also, it may be worth changing the naming to indicate that these are
about the rx cpu (they are, right?).  For some applications (sparse,
low-latency flows, for example), it can be useful to keep the tx
completion handling on a different CPU.

--Andy

>>
>> Signed-off-by: Eric Dumazet <edumazet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>>  arch/alpha/include/uapi/asm/socket.h   |    2 ++
>>  arch/avr32/include/uapi/asm/socket.h   |    2 ++
>>  arch/cris/include/uapi/asm/socket.h    |    2 ++
>>  arch/frv/include/uapi/asm/socket.h     |    2 ++
>>  arch/ia64/include/uapi/asm/socket.h    |    2 ++
>>  arch/m32r/include/uapi/asm/socket.h    |    2 ++
>>  arch/mips/include/uapi/asm/socket.h    |    2 ++
>>  arch/mn10300/include/uapi/asm/socket.h |    2 ++
>>  arch/parisc/include/uapi/asm/socket.h  |    2 ++
>>  arch/powerpc/include/uapi/asm/socket.h |    2 ++
>>  arch/s390/include/uapi/asm/socket.h    |    2 ++
>>  arch/sparc/include/uapi/asm/socket.h   |    2 ++
>>  arch/xtensa/include/uapi/asm/socket.h  |    2 ++
>>  include/net/sock.h                     |   12 ++++++++++++
>>  include/uapi/asm-generic/socket.h      |    2 ++
>>  net/core/sock.c                        |    5 +++++
>>  net/ipv4/tcp_ipv4.c                    |    1 +
>>  net/ipv4/udp.c                         |    1 +
>>  net/ipv6/tcp_ipv6.c                    |    1 +
>>  net/ipv6/udp.c                         |    1 +
>>  net/sctp/ulpqueue.c                    |    5 +++--
>>  21 files changed, 52 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
>> index 3de1394bcab821984674e89a3ee022cc6dd5f0f2..e2fe0700b3b442bffc1f606b1b8b0bb7759aa157 100644
>> --- a/arch/alpha/include/uapi/asm/socket.h
>> +++ b/arch/alpha/include/uapi/asm/socket.h
>> @@ -87,4 +87,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _UAPI_ASM_SOCKET_H */
>> diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
>> index 6e6cd159924b1855aa5f1811ad4e4c60b403c431..92121b0f5b989a61c008e0be24030725bab88e36 100644
>> --- a/arch/avr32/include/uapi/asm/socket.h
>> +++ b/arch/avr32/include/uapi/asm/socket.h
>> @@ -80,4 +80,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _UAPI__ASM_AVR32_SOCKET_H */
>> diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
>> index ed94e5ed0a238c2750e677ccb806a6bc0a94041a..60f60f5b9b35bd219d7a9834fe5394e8ac5fdbab 100644
>> --- a/arch/cris/include/uapi/asm/socket.h
>> +++ b/arch/cris/include/uapi/asm/socket.h
>> @@ -82,6 +82,8 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _ASM_SOCKET_H */
>>
>>
>> diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
>> index ca2c6e6f31c6817780d31a246652adcc9847e373..2c6890209ea60c149bf097c2a1b369519cb8c301 100644
>> --- a/arch/frv/include/uapi/asm/socket.h
>> +++ b/arch/frv/include/uapi/asm/socket.h
>> @@ -80,5 +80,7 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _ASM_SOCKET_H */
>>
>> diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
>> index a1b49bac7951929127ed08db549218c2c16ccf89..09a93fb566f6c6c6fe29c10c95b931881843d1cd 100644
>> --- a/arch/ia64/include/uapi/asm/socket.h
>> +++ b/arch/ia64/include/uapi/asm/socket.h
>> @@ -89,4 +89,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _ASM_IA64_SOCKET_H */
>> diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
>> index 6c9a24b3aefa3a4f3048c17a7fa06d97b585ec14..e8589819c2743c6e112b15a245fc3ebd146e6313 100644
>> --- a/arch/m32r/include/uapi/asm/socket.h
>> +++ b/arch/m32r/include/uapi/asm/socket.h
>> @@ -80,4 +80,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _ASM_M32R_SOCKET_H */
>> diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
>> index a14baa218c76f14de988ef106bdac5dadc48aceb..2e9ee8c55a103a0337d9f80f71fe9ef28be1154b 100644
>> --- a/arch/mips/include/uapi/asm/socket.h
>> +++ b/arch/mips/include/uapi/asm/socket.h
>> @@ -98,4 +98,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _UAPI_ASM_SOCKET_H */
>> diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
>> index 6aa3ce1854aa9523d46bc28851eddabd59edeb37..f3492e8c9f7009c33e07168df916f7337bef3929 100644
>> --- a/arch/mn10300/include/uapi/asm/socket.h
>> +++ b/arch/mn10300/include/uapi/asm/socket.h
>> @@ -80,4 +80,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _ASM_SOCKET_H */
>> diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
>> index fe35ceacf0e72cad69a43d9b1ce7b8f5ec3da98a..7984a1cab3da980f1f810827967b4b67616eb89b 100644
>> --- a/arch/parisc/include/uapi/asm/socket.h
>> +++ b/arch/parisc/include/uapi/asm/socket.h
>> @@ -79,4 +79,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      0x4029
>>
>> +#define SO_INCOMING_CPU                0x402A
>> +
>>  #endif /* _UAPI_ASM_SOCKET_H */
>> diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
>> index a9c3e2e18c054a1e952fe33599401de57c6a6544..3474e4ef166df4a573773916b325d0fa9f3b45d0 100644
>> --- a/arch/powerpc/include/uapi/asm/socket.h
>> +++ b/arch/powerpc/include/uapi/asm/socket.h
>> @@ -87,4 +87,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _ASM_POWERPC_SOCKET_H */
>> diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
>> index e031332096d7c7b23b5953680289e8f3bcc3b378..8457636c33e1b67a9b7804daa05627839035a8fb 100644
>> --- a/arch/s390/include/uapi/asm/socket.h
>> +++ b/arch/s390/include/uapi/asm/socket.h
>> @@ -86,4 +86,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _ASM_SOCKET_H */
>> diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
>> index 54d9608681b6947ae25dab008f808841d96125c0..4a8003a9416348006cfa85d5bcdf7553c8d23958 100644
>> --- a/arch/sparc/include/uapi/asm/socket.h
>> +++ b/arch/sparc/include/uapi/asm/socket.h
>> @@ -76,6 +76,8 @@
>>
>>  #define SO_BPF_EXTENSIONS      0x0032
>>
>> +#define SO_INCOMING_CPU                0x0033
>> +
>>  /* Security levels - as per NRL IPv6 - don't actually do anything */
>>  #define SO_SECURITY_AUTHENTICATION             0x5001
>>  #define SO_SECURITY_ENCRYPTION_TRANSPORT       0x5002
>> diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
>> index 39acec0cf0b1d500c1c40f9b523ef3a9a142c2f1..c46f6a696849c6f7f8a34b2cc522b48e04b17380 100644
>> --- a/arch/xtensa/include/uapi/asm/socket.h
>> +++ b/arch/xtensa/include/uapi/asm/socket.h
>> @@ -91,4 +91,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* _XTENSA_SOCKET_H */
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 6767d75ecb17693eb59a99b8218da4319854ccc0..7789b59c0c400eb99f65d1f0e03cd9773664cf93 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -273,6 +273,7 @@ struct cg_proto;
>>    *    @sk_rcvtimeo: %SO_RCVTIMEO setting
>>    *    @sk_sndtimeo: %SO_SNDTIMEO setting
>>    *    @sk_rxhash: flow hash received from netif layer
>> +  *    @sk_incoming_cpu: record cpu processing incoming packets
>>    *    @sk_txhash: computed flow hash for use on transmit
>>    *    @sk_filter: socket filtering instructions
>>    *    @sk_protinfo: private area, net family specific, when not using slab
>> @@ -350,6 +351,12 @@ struct sock {
>>  #ifdef CONFIG_RPS
>>         __u32                   sk_rxhash;
>>  #endif
>> +       u16                     sk_incoming_cpu;
>> +       /* 16bit hole
>> +        * Warned : sk_incoming_cpu can be set from softirq,
>> +        * Do not use this hole without fully understanding possible issues.
>> +        */
>> +
>>         __u32                   sk_txhash;
>>  #ifdef CONFIG_NET_RX_BUSY_POLL
>>         unsigned int            sk_napi_id;
>> @@ -833,6 +840,11 @@ static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
>>         return sk->sk_backlog_rcv(sk, skb);
>>  }
>>
>> +static inline void sk_incoming_cpu_update(struct sock *sk)
>> +{
>> +       sk->sk_incoming_cpu = raw_smp_processor_id();
>> +}
>> +
>>  static inline void sock_rps_record_flow_hash(__u32 hash)
>>  {
>>  #ifdef CONFIG_RPS
>> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
>> index ea0796bdcf88404ef0f127eb6e64ba00c16ea856..f541ccefd4acbeb4ad757be9dbf4b67f204bf21d 100644
>> --- a/include/uapi/asm-generic/socket.h
>> +++ b/include/uapi/asm-generic/socket.h
>> @@ -82,4 +82,6 @@
>>
>>  #define SO_BPF_EXTENSIONS      48
>>
>> +#define SO_INCOMING_CPU                49
>> +
>>  #endif /* __ASM_GENERIC_SOCKET_H */
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index ac56dd06c306f3712e57ce8e4724c79565589499..0725cf0cb685787b2122606437da53299fb24621 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -1213,6 +1213,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
>>                 v.val = sk->sk_max_pacing_rate;
>>                 break;
>>
>> +       case SO_INCOMING_CPU:
>> +               v.val = sk->sk_incoming_cpu;
>> +               break;
>> +
>>         default:
>>                 return -ENOPROTOOPT;
>>         }
>> @@ -1517,6 +1521,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>>
>>                 newsk->sk_err      = 0;
>>                 newsk->sk_priority = 0;
>> +               newsk->sk_incoming_cpu = raw_smp_processor_id();
>>                 /*
>>                  * Before updating sk_refcnt, we must commit prior changes to memory
>>                  * (Documentation/RCU/rculist_nulls.txt for details)
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 9c7d7621466b1241f404a5ca11de809dcff2d02a..3893f51972f28271a6d27a763c05495c5c2554f7 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -1662,6 +1662,7 @@ process:
>>                 goto discard_and_relse;
>>
>>         sk_mark_napi_id(sk, skb);
>> +       sk_incoming_cpu_update(sk);
>>         skb->dev = NULL;
>>
>>         bh_lock_sock_nested(sk);
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index df19027f44f3d6fbe13dec78d3b085968dbf2329..f52b6081158e87caa5df32e8e5d27dbf314a01b1 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -1445,6 +1445,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>>         if (inet_sk(sk)->inet_daddr) {
>>                 sock_rps_save_rxhash(sk, skb);
>>                 sk_mark_napi_id(sk, skb);
>> +               sk_incoming_cpu_update(sk);
>>         }
>>
>>         rc = sock_queue_rcv_skb(sk, skb);
>> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
>> index ace29b60813cf8a1d7182ad2262cbcbd21810fa7..ac40d23204b5e55da5172c80dafd1d4854b370d5 100644
>> --- a/net/ipv6/tcp_ipv6.c
>> +++ b/net/ipv6/tcp_ipv6.c
>> @@ -1455,6 +1455,7 @@ process:
>>                 goto discard_and_relse;
>>
>>         sk_mark_napi_id(sk, skb);
>> +       sk_incoming_cpu_update(sk);
>>         skb->dev = NULL;
>>
>>         bh_lock_sock_nested(sk);
>> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
>> index 9b6809232b178c16d699ce3d152196b8c4cb096b..0125ca3daf47a4a3333e7462a11550d3e2f96875 100644
>> --- a/net/ipv6/udp.c
>> +++ b/net/ipv6/udp.c
>> @@ -577,6 +577,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>>         if (!ipv6_addr_any(&sk->sk_v6_daddr)) {
>>                 sock_rps_save_rxhash(sk, skb);
>>                 sk_mark_napi_id(sk, skb);
>> +               sk_incoming_cpu_update(sk);
>>         }
>>
>>         rc = sock_queue_rcv_skb(sk, skb);
>> diff --git a/net/sctp/ulpqueue.c b/net/sctp/ulpqueue.c
>> index d49dc2ed30adb97a809eb37902b9956c366a2862..ce469d648ffbe166f9ae1c5650f481256f31a7f8 100644
>> --- a/net/sctp/ulpqueue.c
>> +++ b/net/sctp/ulpqueue.c
>> @@ -205,9 +205,10 @@ int sctp_ulpq_tail_event(struct sctp_ulpq *ulpq, struct sctp_ulpevent *event)
>>         if (sock_flag(sk, SOCK_DEAD) || (sk->sk_shutdown & RCV_SHUTDOWN))
>>                 goto out_free;
>>
>> -       if (!sctp_ulpevent_is_notification(event))
>> +       if (!sctp_ulpevent_is_notification(event)) {
>>                 sk_mark_napi_id(sk, skb);
>> -
>> +               sk_incoming_cpu_update(sk);
>> +       }
>>         /* Check if the user wishes to receive this event.  */
>>         if (!sctp_ulpevent_is_enabled(event, &sctp_sk(sk)->subscribe))
>>                 goto out_free;
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Michael Kerrisk Linux man-pages maintainer;
> http://www.kernel.org/doc/man-pages/
> Author of "The Linux Programming Interface", http://blog.man7.org/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)
From: Milosz Tanski @ 2014-11-14 18:45 UTC (permalink / raw)
  To: Dave Jones, Jeff Moyer, Dave Chinner, Milosz Tanski, LKML,
	Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Mel Gorman,
	Volker Lendecke, Tejun Heo, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141114163912.GA23769-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Fri, Nov 14, 2014 at 11:39 AM, Dave Jones <davej-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Nov 14, 2014 at 11:32:53AM -0500, Jeff Moyer wrote:
>
>  > > Can you write a test (or set of) for fstests that exercises this new
>  > > functionality? I'm not worried about performance, just
>  > > correctness....
>  >
>  > On the subject of testing, I added support to trinity (attached,
>  > untested).  That did raise one question.  Do we expect applications to
>  > #include <linux/fs.h> to get the RWF_NONBLOCK definition?
>
> Trinity will at least need an addition to include/compat.h for
> older headers that won't have the definition.  Looks ok otherwise.
>
> Also, I usually sit on stuff like this until the syscall numbers are
> in Linus tree. This is 3.19 stuff I presume ?
> istr akpm picked up execveat recently, so if that goes in first, we'll
> need to respin this anyway..

Yes, I am hoping to get it into 3.19. It's a large pain having to deal
with other changes to the syscall code.

On a unrelated note I just back to figuring out how to add this to
xfstests. I got busy with other things the last few days. I'm still
not quite sure how to write a test using the framework, the
documentation (README) seams very XFS specific and otherwise the test
seam to be be split between many different files / directories / C
code / shell code. I might be me being slow... but it's just not
obvious for me how to glue the whole thing together.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org

^ permalink raw reply

* Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)
From: Milosz Tanski @ 2014-11-14 18:46 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dave Jones, Dave Chinner, LKML, Christoph Hellwig,
	linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, Mel Gorman,
	Volker Lendecke, Tejun Heo, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch
In-Reply-To: <x491tp5ybcc.fsf@segfault.boston.devel.redhat.com>

On Fri, Nov 14, 2014 at 11:51 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dave Jones <davej@redhat.com> writes:
>
>> On Fri, Nov 14, 2014 at 11:32:53AM -0500, Jeff Moyer wrote:
>>
>>  > > Can you write a test (or set of) for fstests that exercises this new
>>  > > functionality? I'm not worried about performance, just
>>  > > correctness....
>>  >
>>  > On the subject of testing, I added support to trinity (attached,
>>  > untested).  That did raise one question.  Do we expect applications to
>>  > #include <linux/fs.h> to get the RWF_NONBLOCK definition?
>>
>> Trinity will at least need an addition to include/compat.h for
>> older headers that won't have the definition.  Looks ok otherwise.
>
> OK, I'll add that.
>
>> Also, I usually sit on stuff like this until the syscall numbers are
>> in Linus tree. This is 3.19 stuff I presume ?
>> istr akpm picked up execveat recently, so if that goes in first, we'll
>> need to respin this anyway..
>
> Sure.  I just wanted to test with trinity *before* it makes it into the
> kernel.  Crazy, I know.  ;-)

I am happy to help out to make sure it's solid... although deep down
inside I secretly wish that now wasn't the time we decided to start
doing it :)

>
> Cheers,
> Jeff



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply

* Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)
From: Jeff Moyer @ 2014-11-14 18:52 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Dave Jones, Dave Chinner, LKML, Christoph Hellwig,
	linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, Mel Gorman,
	Volker Lendecke, Tejun Heo, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch
In-Reply-To: <CANP1eJH6BNiJ8zwA2Ba4Dc-2NZ5s2TBE4NVg3Ydk_cvSdDQUFQ@mail.gmail.com>

Milosz Tanski <milosz@adfin.com> writes:

> On a unrelated note I just back to figuring out how to add this to
> xfstests. I got busy with other things the last few days. I'm still
> not quite sure how to write a test using the framework, the
> documentation (README) seams very XFS specific and otherwise the test
> seam to be be split between many different files / directories / C
> code / shell code. I might be me being slow... but it's just not
> obvious for me how to glue the whole thing together.

I can help you get started with this.  I'll send email off-list.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Eric Dumazet @ 2014-11-14 19:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Michael Kerrisk, David Miller, netdev, Ying Cai, Willem de Bruijn,
	Neal Cardwell, Linux API
In-Reply-To: <CALCETrXmCWSoeQLC3WCqMf_sehGVE9KTbTGmsA77ZN84B-V24Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:

> As a heavy user of RFS (and finder of bugs in it, too), here's my
> question about this API:
> 
> How does an application tell whether the socket represents a
> non-actively-steered flow?  If the flow is subject to RFS, then moving
> the application handling to the socket's CPU seems problematic, as the
> socket's CPU might move as well.  The current implementation in this
> patch seems to tell me which CPU the most recent packet came in on,
> which is not necessarily very useful.

Its the cpu that hit the TCP stack, bringing dozens of cache lines in
its cache. This is all that matters,

> 
> Some possibilities:
> 
> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.

Well, idea is to not use RFS at all. Otherwise, it is useless.

RFS is the other way around : You want the flow to follow your thread.

RPS wont be a problem if you have sensible RPS settings.

> 
> 2. Change the interface a bit to report the socket's preferred CPU
> (where it would go without RFS, for example) and then let the
> application use setsockopt to tell the socket to stay put (i.e. turn
> off RFS and RPS for that flow).
> 
> 3. Report the preferred CPU as in (2) but let the application ask for
> something different.
> 
> For example, I have flows for which I know which CPU I want.  A nice
> API to put the flow there would be quite useful.
> 
> 
> Also, it may be worth changing the naming to indicate that these are
> about the rx cpu (they are, right?).  For some applications (sparse,
> low-latency flows, for example), it can be useful to keep the tx
> completion handling on a different CPU.

SO_INCOMING_CPU is rx, like incoming ;)

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Tom Herbert @ 2014-11-14 19:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andy Lutomirski, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <1415993614.17262.55.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>
>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>> question about this API:
>>
>> How does an application tell whether the socket represents a
>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>> the application handling to the socket's CPU seems problematic, as the
>> socket's CPU might move as well.  The current implementation in this
>> patch seems to tell me which CPU the most recent packet came in on,
>> which is not necessarily very useful.
>
> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
> its cache. This is all that matters,
>
>>
>> Some possibilities:
>>
>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>
> Well, idea is to not use RFS at all. Otherwise, it is useless.
>
Bear in mind this is only an interface to report RX CPU and in itself
doesn't provide any functionality for changing scheduling, there is
obviously logic needed in user space that would need to do something.

If we track the interrupting CPU in skb, the interface could be easily
extended to provide the interrupting CPU, the RPS CPU (calculated at
reported time), and the CPU processing transport (post steering which
is what is currently returned). That would provide the complete
picture to control scheduling a flow from userspace, and an interface
to selectively turn off RFS for a socket would make sense then.

> RFS is the other way around : You want the flow to follow your thread.
>
> RPS wont be a problem if you have sensible RPS settings.
>
>>
>> 2. Change the interface a bit to report the socket's preferred CPU
>> (where it would go without RFS, for example) and then let the
>> application use setsockopt to tell the socket to stay put (i.e. turn
>> off RFS and RPS for that flow).
>>
>> 3. Report the preferred CPU as in (2) but let the application ask for
>> something different.
>>
>> For example, I have flows for which I know which CPU I want.  A nice
>> API to put the flow there would be quite useful.
>>
>>
>> Also, it may be worth changing the naming to indicate that these are
>> about the rx cpu (they are, right?).  For some applications (sparse,
>> low-latency flows, for example), it can be useful to keep the tx
>> completion handling on a different CPU.
>
> SO_INCOMING_CPU is rx, like incoming ;)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-14 20:16 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CA+mtBx_DyUCxE6BqL_xXcqKPUDu7b3jLShV+AL6L+gCzJLriPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>
>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>> question about this API:
>>>
>>> How does an application tell whether the socket represents a
>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>> the application handling to the socket's CPU seems problematic, as the
>>> socket's CPU might move as well.  The current implementation in this
>>> patch seems to tell me which CPU the most recent packet came in on,
>>> which is not necessarily very useful.
>>
>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>> its cache. This is all that matters,
>>
>>>
>>> Some possibilities:
>>>
>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>
>> Well, idea is to not use RFS at all. Otherwise, it is useless.

Sure, but how do I know that it'll be the same CPU next time?

>>
> Bear in mind this is only an interface to report RX CPU and in itself
> doesn't provide any functionality for changing scheduling, there is
> obviously logic needed in user space that would need to do something.
>
> If we track the interrupting CPU in skb, the interface could be easily
> extended to provide the interrupting CPU, the RPS CPU (calculated at
> reported time), and the CPU processing transport (post steering which
> is what is currently returned). That would provide the complete
> picture to control scheduling a flow from userspace, and an interface
> to selectively turn off RFS for a socket would make sense then.

I think that a turn-off-RFS interface would also want a way to figure
out where the flow would go without RFS.  Can the network stack do
that (e.g. evaluate the rx indirection hash or whatever happens these
days)?

>
>> RFS is the other way around : You want the flow to follow your thread.
>>
>> RPS wont be a problem if you have sensible RPS settings.
>>
>>>
>>> 2. Change the interface a bit to report the socket's preferred CPU
>>> (where it would go without RFS, for example) and then let the
>>> application use setsockopt to tell the socket to stay put (i.e. turn
>>> off RFS and RPS for that flow).
>>>
>>> 3. Report the preferred CPU as in (2) but let the application ask for
>>> something different.
>>>
>>> For example, I have flows for which I know which CPU I want.  A nice
>>> API to put the flow there would be quite useful.
>>>
>>>
>>> Also, it may be worth changing the naming to indicate that these are
>>> about the rx cpu (they are, right?).  For some applications (sparse,
>>> low-latency flows, for example), it can be useful to keep the tx
>>> completion handling on a different CPU.
>>
>> SO_INCOMING_CPU is rx, like incoming ;)
>>
>>

Duh :)

>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Tom Herbert @ 2014-11-14 20:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrUuChgxXcNrKjYzpe1ry0hx3kf19EfAbY1N1YnG7NFZUw@mail.gmail.com>

On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert@google.com> wrote:
>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>
>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>> question about this API:
>>>>
>>>> How does an application tell whether the socket represents a
>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>> the application handling to the socket's CPU seems problematic, as the
>>>> socket's CPU might move as well.  The current implementation in this
>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>> which is not necessarily very useful.
>>>
>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>> its cache. This is all that matters,
>>>
>>>>
>>>> Some possibilities:
>>>>
>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>
>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>
> Sure, but how do I know that it'll be the same CPU next time?
>
>>>
>> Bear in mind this is only an interface to report RX CPU and in itself
>> doesn't provide any functionality for changing scheduling, there is
>> obviously logic needed in user space that would need to do something.
>>
>> If we track the interrupting CPU in skb, the interface could be easily
>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>> reported time), and the CPU processing transport (post steering which
>> is what is currently returned). That would provide the complete
>> picture to control scheduling a flow from userspace, and an interface
>> to selectively turn off RFS for a socket would make sense then.
>
> I think that a turn-off-RFS interface would also want a way to figure
> out where the flow would go without RFS.  Can the network stack do
> that (e.g. evaluate the rx indirection hash or whatever happens these
> days)?
>
Yes,. We need the rxhash and the CPU that packets are received on from
the device for the socket. The former we already have, the latter
might be done by adding a field to skbuff to set received CPU. Given
the L4 hash and interrupting CPU we can calculated the RPS CPU which
is where packet would have landed with RFS off.

>>
>>> RFS is the other way around : You want the flow to follow your thread.
>>>
>>> RPS wont be a problem if you have sensible RPS settings.
>>>
>>>>
>>>> 2. Change the interface a bit to report the socket's preferred CPU
>>>> (where it would go without RFS, for example) and then let the
>>>> application use setsockopt to tell the socket to stay put (i.e. turn
>>>> off RFS and RPS for that flow).
>>>>
>>>> 3. Report the preferred CPU as in (2) but let the application ask for
>>>> something different.
>>>>
>>>> For example, I have flows for which I know which CPU I want.  A nice
>>>> API to put the flow there would be quite useful.
>>>>
>>>>
>>>> Also, it may be worth changing the naming to indicate that these are
>>>> about the rx cpu (they are, right?).  For some applications (sparse,
>>>> low-latency flows, for example), it can be useful to keep the tx
>>>> completion handling on a different CPU.
>>>
>>> SO_INCOMING_CPU is rx, like incoming ;)
>>>
>>>
>
> Duh :)
>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-14 20:34 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CA+mtBx_tNW_Bz0hovZ90XxcLK6pL8d6RX=v5Rg8utfKkaFoJ+Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>>
>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>>> question about this API:
>>>>>
>>>>> How does an application tell whether the socket represents a
>>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>>> the application handling to the socket's CPU seems problematic, as the
>>>>> socket's CPU might move as well.  The current implementation in this
>>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>>> which is not necessarily very useful.
>>>>
>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>>> its cache. This is all that matters,
>>>>
>>>>>
>>>>> Some possibilities:
>>>>>
>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>>
>>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>>
>> Sure, but how do I know that it'll be the same CPU next time?
>>
>>>>
>>> Bear in mind this is only an interface to report RX CPU and in itself
>>> doesn't provide any functionality for changing scheduling, there is
>>> obviously logic needed in user space that would need to do something.
>>>
>>> If we track the interrupting CPU in skb, the interface could be easily
>>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>>> reported time), and the CPU processing transport (post steering which
>>> is what is currently returned). That would provide the complete
>>> picture to control scheduling a flow from userspace, and an interface
>>> to selectively turn off RFS for a socket would make sense then.
>>
>> I think that a turn-off-RFS interface would also want a way to figure
>> out where the flow would go without RFS.  Can the network stack do
>> that (e.g. evaluate the rx indirection hash or whatever happens these
>> days)?
>>
> Yes,. We need the rxhash and the CPU that packets are received on from
> the device for the socket. The former we already have, the latter
> might be done by adding a field to skbuff to set received CPU. Given
> the L4 hash and interrupting CPU we can calculated the RPS CPU which
> is where packet would have landed with RFS off.

Hmm.  I think this would be useful for me.  It would *definitely* be
useful for me if I could pin an RFS flow to a cpu of my choice.

With SO_INCOMING_CPU as described, I'm worried that people will write
programs that perform very well if RFS is off, but that once that code
runs with RFS on, weird things could happen.

(On a side note: the RFS flow hash stuff seems to be rather buggy.
Some Solarflare engineers know about this, but a fix seems to be
rather slow in the works.  I think that some of the bugs are in core
code, though.)

--Andy

>
>>>
>>>> RFS is the other way around : You want the flow to follow your thread.
>>>>
>>>> RPS wont be a problem if you have sensible RPS settings.
>>>>
>>>>>
>>>>> 2. Change the interface a bit to report the socket's preferred CPU
>>>>> (where it would go without RFS, for example) and then let the
>>>>> application use setsockopt to tell the socket to stay put (i.e. turn
>>>>> off RFS and RPS for that flow).
>>>>>
>>>>> 3. Report the preferred CPU as in (2) but let the application ask for
>>>>> something different.
>>>>>
>>>>> For example, I have flows for which I know which CPU I want.  A nice
>>>>> API to put the flow there would be quite useful.
>>>>>
>>>>>
>>>>> Also, it may be worth changing the naming to indicate that these are
>>>>> about the rx cpu (they are, right?).  For some applications (sparse,
>>>>> low-latency flows, for example), it can be useful to keep the tx
>>>>> completion handling on a different CPU.
>>>>
>>>> SO_INCOMING_CPU is rx, like incoming ;)
>>>>
>>>>
>>
>> Duh :)
>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Tom Herbert @ 2014-11-14 21:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrWYrR2X9u7CAr5xkSg8kLH_O=sdk=JF+ZW+FxCTNTFBZg@mail.gmail.com>

On Fri, Nov 14, 2014 at 12:34 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert@google.com> wrote:
>> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert@google.com> wrote:
>>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>>>
>>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>>>> question about this API:
>>>>>>
>>>>>> How does an application tell whether the socket represents a
>>>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>>>> the application handling to the socket's CPU seems problematic, as the
>>>>>> socket's CPU might move as well.  The current implementation in this
>>>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>>>> which is not necessarily very useful.
>>>>>
>>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>>>> its cache. This is all that matters,
>>>>>
>>>>>>
>>>>>> Some possibilities:
>>>>>>
>>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>>>
>>>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>>>
>>> Sure, but how do I know that it'll be the same CPU next time?
>>>
>>>>>
>>>> Bear in mind this is only an interface to report RX CPU and in itself
>>>> doesn't provide any functionality for changing scheduling, there is
>>>> obviously logic needed in user space that would need to do something.
>>>>
>>>> If we track the interrupting CPU in skb, the interface could be easily
>>>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>>>> reported time), and the CPU processing transport (post steering which
>>>> is what is currently returned). That would provide the complete
>>>> picture to control scheduling a flow from userspace, and an interface
>>>> to selectively turn off RFS for a socket would make sense then.
>>>
>>> I think that a turn-off-RFS interface would also want a way to figure
>>> out where the flow would go without RFS.  Can the network stack do
>>> that (e.g. evaluate the rx indirection hash or whatever happens these
>>> days)?
>>>
>> Yes,. We need the rxhash and the CPU that packets are received on from
>> the device for the socket. The former we already have, the latter
>> might be done by adding a field to skbuff to set received CPU. Given
>> the L4 hash and interrupting CPU we can calculated the RPS CPU which
>> is where packet would have landed with RFS off.
>
> Hmm.  I think this would be useful for me.  It would *definitely* be
> useful for me if I could pin an RFS flow to a cpu of my choice.
>
Andy, can you elaborate a little more on your use case. I've thought
several times about an interface to program the flow table from
userspace, but never quite came up with a compelling use case and
there is the security concern that a user could "steal" cycles from
arbitrary CPUs.

> With SO_INCOMING_CPU as described, I'm worried that people will write
> programs that perform very well if RFS is off, but that once that code
> runs with RFS on, weird things could happen.
>
> (On a side note: the RFS flow hash stuff seems to be rather buggy.
> Some Solarflare engineers know about this, but a fix seems to be
> rather slow in the works.  I think that some of the bugs are in core
> code, though.)

This is problems with accelerated RFS or just getting the flow hash for packets?

Thanks,
Tom

>
> --Andy
>
>>
>>>>
>>>>> RFS is the other way around : You want the flow to follow your thread.
>>>>>
>>>>> RPS wont be a problem if you have sensible RPS settings.
>>>>>
>>>>>>
>>>>>> 2. Change the interface a bit to report the socket's preferred CPU
>>>>>> (where it would go without RFS, for example) and then let the
>>>>>> application use setsockopt to tell the socket to stay put (i.e. turn
>>>>>> off RFS and RPS for that flow).
>>>>>>
>>>>>> 3. Report the preferred CPU as in (2) but let the application ask for
>>>>>> something different.
>>>>>>
>>>>>> For example, I have flows for which I know which CPU I want.  A nice
>>>>>> API to put the flow there would be quite useful.
>>>>>>
>>>>>>
>>>>>> Also, it may be worth changing the naming to indicate that these are
>>>>>> about the rx cpu (they are, right?).  For some applications (sparse,
>>>>>> low-latency flows, for example), it can be useful to keep the tx
>>>>>> completion handling on a different CPU.
>>>>>
>>>>> SO_INCOMING_CPU is rx, like incoming ;)
>>>>>
>>>>>
>>>
>>> Duh :)
>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Andy Lutomirski
>>> AMA Capital Management, LLC
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-14 22:10 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CA+mtBx9dSNOD5Z8GinU8F1mj_PgU0sTC5g8S9vRuSnSUzAoD=g@mail.gmail.com>

On Fri, Nov 14, 2014 at 1:36 PM, Tom Herbert <therbert@google.com> wrote:
> On Fri, Nov 14, 2014 at 12:34 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert@google.com> wrote:
>>>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>>>>
>>>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>>>>> question about this API:
>>>>>>>
>>>>>>> How does an application tell whether the socket represents a
>>>>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>>>>> the application handling to the socket's CPU seems problematic, as the
>>>>>>> socket's CPU might move as well.  The current implementation in this
>>>>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>>>>> which is not necessarily very useful.
>>>>>>
>>>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>>>>> its cache. This is all that matters,
>>>>>>
>>>>>>>
>>>>>>> Some possibilities:
>>>>>>>
>>>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>>>>
>>>>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>>>>
>>>> Sure, but how do I know that it'll be the same CPU next time?
>>>>
>>>>>>
>>>>> Bear in mind this is only an interface to report RX CPU and in itself
>>>>> doesn't provide any functionality for changing scheduling, there is
>>>>> obviously logic needed in user space that would need to do something.
>>>>>
>>>>> If we track the interrupting CPU in skb, the interface could be easily
>>>>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>>>>> reported time), and the CPU processing transport (post steering which
>>>>> is what is currently returned). That would provide the complete
>>>>> picture to control scheduling a flow from userspace, and an interface
>>>>> to selectively turn off RFS for a socket would make sense then.
>>>>
>>>> I think that a turn-off-RFS interface would also want a way to figure
>>>> out where the flow would go without RFS.  Can the network stack do
>>>> that (e.g. evaluate the rx indirection hash or whatever happens these
>>>> days)?
>>>>
>>> Yes,. We need the rxhash and the CPU that packets are received on from
>>> the device for the socket. The former we already have, the latter
>>> might be done by adding a field to skbuff to set received CPU. Given
>>> the L4 hash and interrupting CPU we can calculated the RPS CPU which
>>> is where packet would have landed with RFS off.
>>
>> Hmm.  I think this would be useful for me.  It would *definitely* be
>> useful for me if I could pin an RFS flow to a cpu of my choice.
>>
> Andy, can you elaborate a little more on your use case. I've thought
> several times about an interface to program the flow table from
> userspace, but never quite came up with a compelling use case and
> there is the security concern that a user could "steal" cycles from
> arbitrary CPUs.

I have a bunch of threads that are pinned to various CPUs or groups of
CPUs.  Each thread is responsible for a fixed set of flows.  I'd like
those flows to go to those CPUs.

RFS will eventually do it, but it would be nice if I could
deterministically ask for a flow to be routed to the right CPU.  Also,
if my thread bounces temporarily to another CPU, I don't really need
the flow to follow it -- I'd like it to stay put.

This has a significant benefit over using automatic steering: with
automatic steering, I have to make all of the hash tables have a size
around the square of the total number of the flows in order to make it
reliable.

Something like SO_STEER_TO_THIS_CPU would be fine, as long as it
reported whether it worked (for my diagnostics).

>
>> With SO_INCOMING_CPU as described, I'm worried that people will write
>> programs that perform very well if RFS is off, but that once that code
>> runs with RFS on, weird things could happen.
>>
>> (On a side note: the RFS flow hash stuff seems to be rather buggy.
>> Some Solarflare engineers know about this, but a fix seems to be
>> rather slow in the works.  I think that some of the bugs are in core
>> code, though.)
>
> This is problems with accelerated RFS or just getting the flow hash for packets?

Accelerated RFS.

Digging through my email, I thought that
net.core.rps_sock_flow_entries != the per-queue rps_flow_cnt would
make no sense, although I haven't configured it that way.

More importantly, though, I think that some of the has table stuff is
problematic.  My understanding is:

get_rps_cpu may call set_rps_cpu, passing rflow =
flow_table->flows[hash & flow_table->mask];

set_rps_cpu will compute flow_id = hash & flow_table->mask, which
looks to me like it has the property that rflow == flow_table[hash]
(unless we race with a hash table resize).

Now set_rps_cpu tries to steer the new flow to the right CPU (all good
so far), but then it gets weird.  We have a very high probability of
old_flow == rflow.  rflow->filter gets overwritten with the filter id,
the if condition doesn't execute, and nothing gets set to
RPS_NO_FILTER.

This is technically all correct, but if there are two active flows
with the same hash, they'll each keep getting steered to the same
place.  This wastes cycles and seems to anger the sfc driver (the
latter is presumably an sfc bug).  It also means that some of the
filters are likely to get expired for no good reason.

Also, shouldn't ndo_rx_flow_steer be passed the full hash, not just
the index into the hash table?  What's the point of cutting off the
high bits?

--Andy

--Andy

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Eric Dumazet @ 2014-11-14 22:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tom Herbert, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrUuChgxXcNrKjYzpe1ry0hx3kf19EfAbY1N1YnG7NFZUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, 2014-11-14 at 12:16 -0800, Andy Lutomirski wrote:

> Sure, but how do I know that it'll be the same CPU next time?

Because the NIC always use same RX queue for a given flow.

So if you setup your IRQ affinities properly, the same CPU will drain
packets from this RX queue. And since RFS is off, you have the guarantee
the same CPU will be used to process packets in TCP stack.

This SO_INCOMING_CPU info is a hint, there is no guarantee eg if you use
bonding and some load balancer or switch decides to send packets on
different links.

Most NIC use Toeplitz hash, so given the 4-tuple, and rss key (40
bytes), you can actually compute the hash in software and know on which
RX queue traffic should land.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-14 22:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <1416003371.17262.66.camel-XN9IlZ5yJG9HTL0Zs8A6p/gx64E7kk8eUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>

On Fri, Nov 14, 2014 at 2:16 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, 2014-11-14 at 12:16 -0800, Andy Lutomirski wrote:
>
>> Sure, but how do I know that it'll be the same CPU next time?
>
> Because the NIC always use same RX queue for a given flow.
>
> So if you setup your IRQ affinities properly, the same CPU will drain
> packets from this RX queue. And since RFS is off, you have the guarantee
> the same CPU will be used to process packets in TCP stack.

Right.  My concern is that if RFS is off, everything works fine, but
if RFS is on (and the app has no good way to know), then silly results
will happen.  I think I'd rather have the getsockopt fail if RFS is on
or at least give some indication so that the app can react
accordingly.

--Andy

>
> This SO_INCOMING_CPU info is a hint, there is no guarantee eg if you use
> bonding and some load balancer or switch decides to send packets on
> different links.
>
> Most NIC use Toeplitz hash, so given the 4-tuple, and rss key (40
> bytes), you can actually compute the hash in software and know on which
> RX queue traffic should land.
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Eric Dumazet @ 2014-11-14 22:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tom Herbert, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrXf8j52nmpw-A2+bQt0yoz-fiD-4GP4Qf-afwH6UjCVnw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, 2014-11-14 at 14:10 -0800, Andy Lutomirski wrote:

> I have a bunch of threads that are pinned to various CPUs or groups of
> CPUs.  Each thread is responsible for a fixed set of flows.  I'd like
> those flows to go to those CPUs.
> 
> RFS will eventually do it, but it would be nice if I could
> deterministically ask for a flow to be routed to the right CPU.  Also,
> if my thread bounces temporarily to another CPU, I don't really need
> the flow to follow it -- I'd like it to stay put.
> 
> This has a significant benefit over using automatic steering: with
> automatic steering, I have to make all of the hash tables have a size
> around the square of the total number of the flows in order to make it
> reliable.
> 
> Something like SO_STEER_TO_THIS_CPU would be fine, as long as it
> reported whether it worked (for my diagnostics).

This requires some kind of hardware support, and unfortunately this is
not generic.

With SO_INCOMING_CPU, you simply can pass fd of sockets around threads,
so that a dumb RSS multiqueue NIC is OK (assuming you are not using some
encapsulation that NIC is not able to parse to find L4 information)

Steering is a dream, I really think its easier to build flows so that
their RX queue matches your requirements.

We usually can pick at least one element of the 4-tuple, so its actually
possible to get this before connect().


Two cases :

1) Passive connections.

   After accept(), get SO_INCOMING_CPU, then pass the fd to appropriate
thread of your pool.

2) Active connections .
   find a proper 4-tuple, bind() then connect(). Eventually check
SO_INCOMING_CPU to verify your expectations.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-14 22:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <1416003883.17262.72.camel-XN9IlZ5yJG9HTL0Zs8A6p/gx64E7kk8eUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>

On Fri, Nov 14, 2014 at 2:24 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, 2014-11-14 at 14:10 -0800, Andy Lutomirski wrote:
>
>> I have a bunch of threads that are pinned to various CPUs or groups of
>> CPUs.  Each thread is responsible for a fixed set of flows.  I'd like
>> those flows to go to those CPUs.
>>
>> RFS will eventually do it, but it would be nice if I could
>> deterministically ask for a flow to be routed to the right CPU.  Also,
>> if my thread bounces temporarily to another CPU, I don't really need
>> the flow to follow it -- I'd like it to stay put.
>>
>> This has a significant benefit over using automatic steering: with
>> automatic steering, I have to make all of the hash tables have a size
>> around the square of the total number of the flows in order to make it
>> reliable.
>>
>> Something like SO_STEER_TO_THIS_CPU would be fine, as long as it
>> reported whether it worked (for my diagnostics).
>
> This requires some kind of hardware support, and unfortunately this is
> not generic.
>
> With SO_INCOMING_CPU, you simply can pass fd of sockets around threads,
> so that a dumb RSS multiqueue NIC is OK (assuming you are not using some
> encapsulation that NIC is not able to parse to find L4 information)

I can't really do this.  It means that the performance of my system
will be wildly different every time I restart it.  I don't have enough
connections for everything to average out.

>
> Steering is a dream, I really think its easier to build flows so that
> their RX queue matches your requirements.

I have supporting hardware :)  I just want it to work without
programming the ntuple table myself.

>
> We usually can pick at least one element of the 4-tuple, so its actually
> possible to get this before connect().
>

Hmm.  An API for that would be quite nice :)

>
> Two cases :
>
> 1) Passive connections.
>
>    After accept(), get SO_INCOMING_CPU, then pass the fd to appropriate
> thread of your pool.
>
> 2) Active connections .
>    find a proper 4-tuple, bind() then connect(). Eventually check
> SO_INCOMING_CPU to verify your expectations.

The people at the other end will be really pissed if that results in
lots of reconnections.

--Andy

>
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Eric Dumazet @ 2014-11-14 22:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tom Herbert, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrWhsj+byMeT=WHb9eLR_vgsUe58Li8JDLaTvsSPWp1DKQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, 2014-11-14 at 14:27 -0800, Andy Lutomirski wrote:

> The people at the other end will be really pissed if that results in
> lots of reconnections.

No reconnections necessary.

I believe you misunderstood : On the 4-tuple (SADDR,SPORT,DADDR,DPORT),
you can pick for example SPORT so that hash(SADDR,SPORT,DADDR,DPORT)
maps to a known and wanted RX queue number.

Once you know that, you use bind(SADDR, SPORT), then
connect(DADDR,DPORT). 

Anyway, if your hardware is able to cope with the few number of flows,
just use the hardware and be happy.

Here we want about 10 millions sockets, there is little hope for
hardware being helpful.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-14 23:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <1416005912.17262.76.camel-XN9IlZ5yJG9HTL0Zs8A6p/gx64E7kk8eUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>

On Fri, Nov 14, 2014 at 2:58 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, 2014-11-14 at 14:27 -0800, Andy Lutomirski wrote:
>
>> The people at the other end will be really pissed if that results in
>> lots of reconnections.
>
> No reconnections necessary.
>
> I believe you misunderstood : On the 4-tuple (SADDR,SPORT,DADDR,DPORT),
> you can pick for example SPORT so that hash(SADDR,SPORT,DADDR,DPORT)
> maps to a known and wanted RX queue number.
>
> Once you know that, you use bind(SADDR, SPORT), then
> connect(DADDR,DPORT).

If the kernel had an API for this, I'd be all for using it.

>
> Anyway, if your hardware is able to cope with the few number of flows,
> just use the hardware and be happy.
>
> Here we want about 10 millions sockets, there is little hope for
> hardware being helpful.
>

It's the intermediate numbers that are bad.

With ten flows, the current accelerated RFS works fine.  With 10M
flows, RFS is a lost cause and this solution is much nicer.  With,
say, 1k flows, accelerated RFS *deserves* to work perfectly, because
the hardware has enough filter slots.  But making it work reliably
requires a ridiculously large hash table, and collisions cause silent
bad behavior.

--Andy

>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Eric Dumazet @ 2014-11-14 23:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tom Herbert, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrUAd50g2mYEOKT-5pEqMvwSstfQcgU8=7GRsO1XcKBSFA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, 2014-11-14 at 15:03 -0800, Andy Lutomirski wrote:

> If the kernel had an API for this, I'd be all for using it.

It would be user land code, not kernel.

Why doing a system call when you can avoid it ? ;)

Doing it in the kernel would be quite complex actually.

In userland, you can even implement this by machine learning.

For every connection you made, you get the 4-tuple and INCOMING_CPU,
then you store it in a cache.

Next time you need to connect to same remote peer, you can lookup in the
cache to find a good candidate.

It would actually be good if we could do in a single socket verb a
bind_and_connect(fd, &source, &destination)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox