Netdev List
 help / color / mirror / Atom feed
* Re: [RFCv4 PATCH 2/2] net: Allow protocols to provide an unlocked_recvmsg socket method
From: Arnaldo Carvalho de Melo @ 2009-09-17 21:53 UTC (permalink / raw)
  To: Nir Tzachar
  Cc: David Miller, Linux Networking Development Mailing List,
	Caitlin Bestler, Chris Van Hoof, Clark Williams, Neil Horman,
	Nivedita Singhvi, Paul Moore, Rémi Denis-Courmont,
	Steven Whitehouse, Ziv Ayalon
In-Reply-To: <20090917212113.GC3691@ghostprotocols.net>

[-- Attachment #1: Type: text/plain, Size: 1044 bytes --]

Em Thu, Sep 17, 2009 at 06:21:13PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Thu, Sep 17, 2009 at 05:09:19PM +0300, Nir Tzachar escreveu:
> > Hello.
> > 
> > Below are some test results with the patch (only part 1, as I did not
> > manage to apply part 2).
> 
> I forgot to mention that the patches were made against DaveM's
> net-next-2.6 tree at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
> 
> If you have a linux-2.6 git tree, just do:
> 
> cd linux-2.6
> git remote add net-next git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
> git branch -b net-next-recvmmsg net-next/master
> 
> And you should be able to apply the two patches cleanly.

Strange, I just checked out v2.6.31 and only one hunk in the _first_
patch (adding the recvmmsg entry in the sparc 32 syscall table) failed,
the second applied with just minor offsets.

You must have corrupted the patch when saving somehow, anyway, find
both, against v2.6.31, attached.

Back to building the kernel on 10 Gbit/s hardware.

- Arnaldo

[-- Attachment #2: 0001-net-Introduce-recvmmsg-socket-syscall.patch --]
[-- Type: text/plain, Size: 26147 bytes --]

>From fbdd4648e212c95d82672f385996df0d01086c00 Mon Sep 17 00:00:00 2001
From: Arnaldo Carvalho de Melo <acme@redhat.com>
Date: Thu, 17 Sep 2009 18:44:40 -0300
Subject: [PATCH 1/2] net: Introduce recvmmsg socket syscall
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Meaning receive multiple messages, reducing the number of syscalls and
net stack entry/exit operations.

Next patches will introduce mechanisms where protocols that want to
optimize this operation will provide an unlocked_recvmsg operation.

This takes into account comments made by:

. Paul Moore: sock_recvmsg is called only for the first datagram,
  sock_recvmsg_nosec is used for the rest.

. Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
  works in the same fashion as the ppoll one.

  If the underlying protocol returns a datagram with MSG_OOB set, this
  will make recvmmsg return right away with as many datagrams (+ the OOB
  one) it has received so far.

. Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
  datagrams and then recvmsg returns an error, recvmmsg will return
  the successfully received datagrams, store the error and return it
  in the next call.

This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
where we will be able to acquire the lock only at batch start and end, not at
every underlying recvmsg call.

Cc: Caitlin Bestler <caitlin.bestler@gmail.com>
Cc: Chris Van Hoof <vanhoof@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Nir Tzachar <nir.tzachar@gmail.com>
Cc: Nivedita Singhvi <niv@us.ibm.com>
Cc: Paul Moore <paul.moore@hp.com>
Cc: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Cc: Steven Whitehouse <steve@chygwyn.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 arch/alpha/kernel/systbls.S            |    1 +
 arch/arm/kernel/calls.S                |    1 +
 arch/avr32/kernel/syscall_table.S      |    1 +
 arch/blackfin/mach-common/entry.S      |    1 +
 arch/ia64/kernel/entry.S               |    1 +
 arch/microblaze/kernel/syscall_table.S |    1 +
 arch/mips/kernel/scall32-o32.S         |    1 +
 arch/mips/kernel/scall64-64.S          |    1 +
 arch/mips/kernel/scall64-n32.S         |    1 +
 arch/mips/kernel/scall64-o32.S         |    1 +
 arch/sh/kernel/syscalls_64.S           |    1 +
 arch/sparc/kernel/systbls_64.S         |    4 +-
 arch/x86/ia32/ia32entry.S              |    1 +
 arch/x86/include/asm/unistd_32.h       |    1 +
 arch/x86/include/asm/unistd_64.h       |    2 +
 arch/x86/kernel/syscall_table_32.S     |    1 +
 arch/xtensa/include/asm/unistd.h       |    4 +-
 include/linux/net.h                    |    1 +
 include/linux/socket.h                 |   10 ++
 include/linux/syscalls.h               |    4 +
 include/net/compat.h                   |    8 +
 kernel/sys_ni.c                        |    2 +
 net/compat.c                           |   33 +++++-
 net/socket.c                           |  225 ++++++++++++++++++++++++++------
 24 files changed, 259 insertions(+), 48 deletions(-)

diff --git a/arch/alpha/kernel/systbls.S b/arch/alpha/kernel/systbls.S
index 95c9aef..cda6b8b 100644
--- a/arch/alpha/kernel/systbls.S
+++ b/arch/alpha/kernel/systbls.S
@@ -497,6 +497,7 @@ sys_call_table:
 	.quad sys_signalfd
 	.quad sys_ni_syscall
 	.quad sys_eventfd
+	.quad sys_recvmmsg
 
 	.size sys_call_table, . - sys_call_table
 	.type sys_call_table, @object
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index f776e72..43995f6 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -374,6 +374,7 @@
 		CALL(sys_pwritev)
 		CALL(sys_rt_tgsigqueueinfo)
 		CALL(sys_perf_counter_open)
+		CALL(sys_recvmmsg)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/avr32/kernel/syscall_table.S b/arch/avr32/kernel/syscall_table.S
index 7ee0057..e76bad1 100644
--- a/arch/avr32/kernel/syscall_table.S
+++ b/arch/avr32/kernel/syscall_table.S
@@ -295,4 +295,5 @@ sys_call_table:
 	.long	sys_signalfd
 	.long	sys_ni_syscall		/* 280, was sys_timerfd */
 	.long	sys_eventfd
+	.long	sys_recvmmsg
 	.long	sys_ni_syscall		/* r8 is saturated at nr_syscalls */
diff --git a/arch/blackfin/mach-common/entry.S b/arch/blackfin/mach-common/entry.S
index fb1795d..e4d3d0f 100644
--- a/arch/blackfin/mach-common/entry.S
+++ b/arch/blackfin/mach-common/entry.S
@@ -1612,6 +1612,7 @@ ENTRY(_sys_call_table)
 	.long _sys_pwritev
 	.long _sys_rt_tgsigqueueinfo
 	.long _sys_perf_counter_open
+	.long _sys_recvmmsg
 
 	.rept NR_syscalls-(.-_sys_call_table)/4
 	.long _sys_ni_syscall
diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
index d0e7d37..d75b872 100644
--- a/arch/ia64/kernel/entry.S
+++ b/arch/ia64/kernel/entry.S
@@ -1806,6 +1806,7 @@ sys_call_table:
 	data8 sys_preadv
 	data8 sys_pwritev			// 1320
 	data8 sys_rt_tgsigqueueinfo
+	data8 sys_recvmmsg
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
 #endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */
diff --git a/arch/microblaze/kernel/syscall_table.S b/arch/microblaze/kernel/syscall_table.S
index 4572160..623dbf1 100644
--- a/arch/microblaze/kernel/syscall_table.S
+++ b/arch/microblaze/kernel/syscall_table.S
@@ -371,3 +371,4 @@ ENTRY(sys_call_table)
 	.long sys_ni_syscall
 	.long sys_rt_tgsigqueueinfo	/* 365 */
 	.long sys_perf_counter_open
+	.long sys_recvmmsg
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index b570821..b92aa3e 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -655,6 +655,7 @@ einval:	li	v0, -ENOSYS
 	sys	sys_rt_tgsigqueueinfo	4
 	sys	sys_perf_counter_open	5
 	sys	sys_accept4		4
+	sys     sys_recvmmsg            5
 	.endm
 
 	/* We pre-compute the number of _instruction_ bytes needed to
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index 3d866f2..d3384d8 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -492,4 +492,5 @@ sys_call_table:
 	PTR	sys_rt_tgsigqueueinfo
 	PTR	sys_perf_counter_open
 	PTR	sys_accept4
+	PTR     sys_recvmmsg
 	.size	sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index e855b11..c332346 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -418,4 +418,5 @@ EXPORT(sysn32_call_table)
 	PTR	compat_sys_rt_tgsigqueueinfo	/* 5295 */
 	PTR	sys_perf_counter_open
 	PTR	sys_accept4
+	PTR     compat_sys_recvmmsg
 	.size	sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index 0c49f1a..12bc997 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -538,4 +538,5 @@ sys_call_table:
 	PTR	compat_sys_rt_tgsigqueueinfo
 	PTR	sys_perf_counter_open
 	PTR	sys_accept4
+	PTR     compat_sys_recvmmsg
 	.size	sys_call_table,.-sys_call_table
diff --git a/arch/sh/kernel/syscalls_64.S b/arch/sh/kernel/syscalls_64.S
index bf420b6..056e0a7 100644
--- a/arch/sh/kernel/syscalls_64.S
+++ b/arch/sh/kernel/syscalls_64.S
@@ -391,3 +391,4 @@ sys_call_table:
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo
 	.long sys_perf_counter_open
+	.long sys_recvmmsg		/* 365 */
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index 2ee7250..7e77138 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -83,7 +83,7 @@ sys_call_table32:
 /*310*/	.word compat_sys_utimensat, compat_sys_signalfd, sys_timerfd_create, sys_eventfd, compat_sys_fallocate
 	.word compat_sys_timerfd_settime, compat_sys_timerfd_gettime, compat_sys_signalfd4, sys_eventfd2, sys_epoll_create1
 /*320*/	.word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, compat_sys_preadv
-	.word compat_sys_pwritev, compat_sys_rt_tgsigqueueinfo
+	.word compat_sys_pwritev, compat_sys_rt_tgsigqueueinfo, compat_sys_recvmmsg
 
 #endif /* CONFIG_COMPAT */
 
@@ -158,4 +158,4 @@ sys_call_table:
 /*310*/	.word sys_utimensat, sys_signalfd, sys_timerfd_create, sys_eventfd, sys_fallocate
 	.word sys_timerfd_settime, sys_timerfd_gettime, sys_signalfd4, sys_eventfd2, sys_epoll_create1
 /*320*/	.word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, sys_preadv
-	.word sys_pwritev, sys_rt_tgsigqueueinfo
+	.word sys_pwritev, sys_rt_tgsigqueueinfo, sys_recvmmsg
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index e590261..2a188e5 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -832,4 +832,5 @@ ia32_sys_call_table:
 	.quad compat_sys_pwritev
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_counter_open
+	.quad compat_sys_recvmmsg
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 732a307..3e72cae 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_recvmmsg		337
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 900e161..713a32a 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_counter_open			298
 __SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)
+#define __NR_recvmmsg				299
+__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d51321d..4881b14 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long sys_recvmmsg
diff --git a/arch/xtensa/include/asm/unistd.h b/arch/xtensa/include/asm/unistd.h
index c092c8f..4e55dc7 100644
--- a/arch/xtensa/include/asm/unistd.h
+++ b/arch/xtensa/include/asm/unistd.h
@@ -681,8 +681,10 @@ __SYSCALL(304, sys_signalfd, 3)
 __SYSCALL(305, sys_ni_syscall, 0)
 #define __NR_eventfd				306
 __SYSCALL(306, sys_eventfd, 1)
+#define __NR_recvmmsg				307
+__SYSCALL(307, sys_recvmmsg, 5)
 
-#define __NR_syscall_count			307
+#define __NR_syscall_count			308
 
 /*
  * sysxtensa syscall handler
diff --git a/include/linux/net.h b/include/linux/net.h
index 4fc2ffd..d67587a 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -41,6 +41,7 @@
 #define SYS_SENDMSG	16		/* sys_sendmsg(2)		*/
 #define SYS_RECVMSG	17		/* sys_recvmsg(2)		*/
 #define SYS_ACCEPT4	18		/* sys_accept4(2)		*/
+#define SYS_RECVMMSG	19		/* sys_recvmmsg(2)		*/
 
 typedef enum {
 	SS_FREE = 0,			/* not allocated		*/
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 3b461df..c192bf8 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -65,6 +65,12 @@ struct msghdr {
 	unsigned	msg_flags;
 };
 
+/* For recvmmsg/sendmmsg */
+struct mmsghdr {
+	struct msghdr   msg_hdr;
+	unsigned        msg_len;
+};
+
 /*
  *	POSIX 1003.1g - ancillary data object information
  *	Ancillary data consits of a sequence of pairs of
@@ -327,6 +333,10 @@ extern int move_addr_to_user(struct sockaddr *kaddr, int klen, void __user *uadd
 extern int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr *kaddr);
 extern int put_cmsg(struct msghdr*, int level, int type, int len, void *data);
 
+struct timespec;
+
+extern int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
+			  unsigned int flags, struct timespec *timeout);
 #endif
 #endif /* not kernel and not glibc */
 #endif /* _LINUX_SOCKET_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 80de700..a3532ef 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -25,6 +25,7 @@ struct linux_dirent64;
 struct list_head;
 struct msgbuf;
 struct msghdr;
+struct mmsghdr;
 struct msqid_ds;
 struct new_utsname;
 struct nfsctl_arg;
@@ -559,6 +560,9 @@ asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
 asmlinkage long sys_recvfrom(int, void __user *, size_t, unsigned,
 				struct sockaddr __user *, int __user *);
 asmlinkage long sys_recvmsg(int fd, struct msghdr __user *msg, unsigned flags);
+asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg,
+			     unsigned int vlen, unsigned flags,
+			     struct timespec __user *timeout);
 asmlinkage long sys_socket(int, int, int);
 asmlinkage long sys_socketpair(int, int, int, int __user *);
 asmlinkage long sys_socketcall(int call, unsigned long __user *args);
diff --git a/include/net/compat.h b/include/net/compat.h
index 5bbf8bf..96c38d8 100644
--- a/include/net/compat.h
+++ b/include/net/compat.h
@@ -18,6 +18,11 @@ struct compat_msghdr {
 	compat_uint_t	msg_flags;
 };
 
+struct compat_mmsghdr {
+	struct compat_msghdr msg_hdr;
+	compat_uint_t        msg_len;
+};
+
 struct compat_cmsghdr {
 	compat_size_t	cmsg_len;
 	compat_int_t	cmsg_level;
@@ -35,6 +40,9 @@ extern int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *);
 extern int verify_compat_iovec(struct msghdr *, struct iovec *, struct sockaddr *, int);
 extern asmlinkage long compat_sys_sendmsg(int,struct compat_msghdr __user *,unsigned);
 extern asmlinkage long compat_sys_recvmsg(int,struct compat_msghdr __user *,unsigned);
+extern asmlinkage long compat_sys_recvmmsg(int, struct compat_mmsghdr __user *,
+					   unsigned, unsigned,
+					   struct timespec __user *);
 extern asmlinkage long compat_sys_getsockopt(int, int, int, char __user *, int __user *);
 extern int put_cmsg_compat(struct msghdr*, int, int, int, void *);
 
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 68320f6..f581fb0 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,7 +48,9 @@ cond_syscall(sys_shutdown);
 cond_syscall(sys_sendmsg);
 cond_syscall(compat_sys_sendmsg);
 cond_syscall(sys_recvmsg);
+cond_syscall(sys_recvmmsg);
 cond_syscall(compat_sys_recvmsg);
+cond_syscall(compat_sys_recvmmsg);
 cond_syscall(sys_socketcall);
 cond_syscall(sys_futex);
 cond_syscall(compat_sys_futex);
diff --git a/net/compat.c b/net/compat.c
index 8d73905..9a149a6 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -727,10 +727,10 @@ EXPORT_SYMBOL(compat_mc_getsockopt);
 
 /* Argument list sizes for compat_sys_socketcall */
 #define AL(x) ((x) * sizeof(u32))
-static unsigned char nas[19]={AL(0),AL(3),AL(3),AL(3),AL(2),AL(3),
+static unsigned char nas[20]={AL(0),AL(3),AL(3),AL(3),AL(2),AL(3),
 				AL(3),AL(3),AL(4),AL(4),AL(4),AL(6),
 				AL(6),AL(2),AL(5),AL(5),AL(3),AL(3),
-				AL(4)};
+				AL(4),AL(5)};
 #undef AL
 
 asmlinkage long compat_sys_sendmsg(int fd, struct compat_msghdr __user *msg, unsigned flags)
@@ -743,13 +743,36 @@ asmlinkage long compat_sys_recvmsg(int fd, struct compat_msghdr __user *msg, uns
 	return sys_recvmsg(fd, (struct msghdr __user *)msg, flags | MSG_CMSG_COMPAT);
 }
 
+asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg,
+				    unsigned vlen, unsigned int flags,
+				    struct timespec __user *timeout)
+{
+	int datagrams;
+	struct timespec ktspec;
+	struct compat_timespec __user *utspec =
+			(struct compat_timespec __user *)timeout;
+
+	if (get_user(ktspec.tv_sec, &utspec->tv_sec) ||
+	    get_user(ktspec.tv_nsec, &utspec->tv_nsec))
+		return -EFAULT;
+
+	datagrams = __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen,
+				   flags | MSG_CMSG_COMPAT, &ktspec);
+	if (datagrams > 0 &&
+	    (put_user(ktspec.tv_sec, &utspec->tv_sec) ||
+	     put_user(ktspec.tv_nsec, &utspec->tv_nsec)))
+		datagrams = -EFAULT;
+
+	return datagrams;
+}
+
 asmlinkage long compat_sys_socketcall(int call, u32 __user *args)
 {
 	int ret;
 	u32 a[6];
 	u32 a0, a1;
 
-	if (call < SYS_SOCKET || call > SYS_ACCEPT4)
+	if (call < SYS_SOCKET || call > SYS_RECVMMSG)
 		return -EINVAL;
 	if (copy_from_user(a, args, nas[call]))
 		return -EFAULT;
@@ -810,6 +833,10 @@ asmlinkage long compat_sys_socketcall(int call, u32 __user *args)
 	case SYS_RECVMSG:
 		ret = compat_sys_recvmsg(a0, compat_ptr(a1), a[2]);
 		break;
+	case SYS_RECVMMSG:
+		ret = compat_sys_recvmmsg(a0, compat_ptr(a1), a[2], a[3],
+					  compat_ptr(a[4]));
+		break;
 	case SYS_ACCEPT4:
 		ret = sys_accept4(a0, compat_ptr(a1), compat_ptr(a[2]), a[3]);
 		break;
diff --git a/net/socket.c b/net/socket.c
index 6d47165..32db56a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -668,10 +668,9 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 
 EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
 
-static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
-				 struct msghdr *msg, size_t size, int flags)
+static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
+				       struct msghdr *msg, size_t size, int flags)
 {
-	int err;
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 
 	si->sock = sock;
@@ -680,13 +679,17 @@ static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
 	si->size = size;
 	si->flags = flags;
 
-	err = security_socket_recvmsg(sock, msg, size, flags);
-	if (err)
-		return err;
-
 	return sock->ops->recvmsg(iocb, sock, msg, size, flags);
 }
 
+static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
+				 struct msghdr *msg, size_t size, int flags)
+{
+	int err = security_socket_recvmsg(sock, msg, size, flags);
+
+	return err ?: __sock_recvmsg_nosec(iocb, sock, msg, size, flags);
+}
+
 int sock_recvmsg(struct socket *sock, struct msghdr *msg,
 		 size_t size, int flags)
 {
@@ -702,6 +705,21 @@ int sock_recvmsg(struct socket *sock, struct msghdr *msg,
 	return ret;
 }
 
+static int sock_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
+			      size_t size, int flags)
+{
+	struct kiocb iocb;
+	struct sock_iocb siocb;
+	int ret;
+
+	init_sync_kiocb(&iocb, NULL);
+	iocb.private = &siocb;
+	ret = __sock_recvmsg_nosec(&iocb, sock, msg, size, flags);
+	if (-EIOCBQUEUED == ret)
+		ret = wait_on_sync_kiocb(&iocb);
+	return ret;
+}
+
 int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
 		   struct kvec *vec, size_t num, size_t size, int flags)
 {
@@ -1965,22 +1983,15 @@ out:
 	return err;
 }
 
-/*
- *	BSD recvmsg interface
- */
-
-SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
-		unsigned int, flags)
+static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg,
+			 struct msghdr *msg_sys, unsigned flags, int nosec)
 {
 	struct compat_msghdr __user *msg_compat =
 	    (struct compat_msghdr __user *)msg;
-	struct socket *sock;
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov = iovstack;
-	struct msghdr msg_sys;
 	unsigned long cmsg_ptr;
 	int err, iov_size, total_len, len;
-	int fput_needed;
 
 	/* kernel mode address */
 	struct sockaddr_storage addr;
@@ -1990,27 +2001,23 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 	int __user *uaddr_len;
 
 	if (MSG_CMSG_COMPAT & flags) {
-		if (get_compat_msghdr(&msg_sys, msg_compat))
+		if (get_compat_msghdr(msg_sys, msg_compat))
 			return -EFAULT;
 	}
-	else if (copy_from_user(&msg_sys, msg, sizeof(struct msghdr)))
+	else if (copy_from_user(msg_sys, msg, sizeof(struct msghdr)))
 		return -EFAULT;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
-	if (!sock)
-		goto out;
-
 	err = -EMSGSIZE;
-	if (msg_sys.msg_iovlen > UIO_MAXIOV)
-		goto out_put;
+	if (msg_sys->msg_iovlen > UIO_MAXIOV)
+		goto out;
 
 	/* Check whether to allocate the iovec area */
 	err = -ENOMEM;
-	iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
-	if (msg_sys.msg_iovlen > UIO_FASTIOV) {
+	iov_size = msg_sys->msg_iovlen * sizeof(struct iovec);
+	if (msg_sys->msg_iovlen > UIO_FASTIOV) {
 		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
 		if (!iov)
-			goto out_put;
+			goto out;
 	}
 
 	/*
@@ -2018,46 +2025,47 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 	 *      kernel msghdr to use the kernel address space)
 	 */
 
-	uaddr = (__force void __user *)msg_sys.msg_name;
+	uaddr = (__force void __user *)msg_sys->msg_name;
 	uaddr_len = COMPAT_NAMELEN(msg);
 	if (MSG_CMSG_COMPAT & flags) {
-		err = verify_compat_iovec(&msg_sys, iov,
+		err = verify_compat_iovec(msg_sys, iov,
 					  (struct sockaddr *)&addr,
 					  VERIFY_WRITE);
 	} else
-		err = verify_iovec(&msg_sys, iov,
+		err = verify_iovec(msg_sys, iov,
 				   (struct sockaddr *)&addr,
 				   VERIFY_WRITE);
 	if (err < 0)
 		goto out_freeiov;
 	total_len = err;
 
-	cmsg_ptr = (unsigned long)msg_sys.msg_control;
-	msg_sys.msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
+	cmsg_ptr = (unsigned long)msg_sys->msg_control;
+	msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
 
 	if (sock->file->f_flags & O_NONBLOCK)
 		flags |= MSG_DONTWAIT;
-	err = sock_recvmsg(sock, &msg_sys, total_len, flags);
+	err = (nosec ? sock_recvmsg_nosec : sock_recvmsg)(sock, msg_sys,
+							  total_len, flags);
 	if (err < 0)
 		goto out_freeiov;
 	len = err;
 
 	if (uaddr != NULL) {
 		err = move_addr_to_user((struct sockaddr *)&addr,
-					msg_sys.msg_namelen, uaddr,
+					msg_sys->msg_namelen, uaddr,
 					uaddr_len);
 		if (err < 0)
 			goto out_freeiov;
 	}
-	err = __put_user((msg_sys.msg_flags & ~MSG_CMSG_COMPAT),
+	err = __put_user((msg_sys->msg_flags & ~MSG_CMSG_COMPAT),
 			 COMPAT_FLAGS(msg));
 	if (err)
 		goto out_freeiov;
 	if (MSG_CMSG_COMPAT & flags)
-		err = __put_user((unsigned long)msg_sys.msg_control - cmsg_ptr,
+		err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr,
 				 &msg_compat->msg_controllen);
 	else
-		err = __put_user((unsigned long)msg_sys.msg_control - cmsg_ptr,
+		err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr,
 				 &msg->msg_controllen);
 	if (err)
 		goto out_freeiov;
@@ -2066,21 +2074,150 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 out_freeiov:
 	if (iov != iovstack)
 		sock_kfree_s(sock->sk, iov, iov_size);
-out_put:
+out:
+	return err;
+}
+
+/*
+ *	BSD recvmsg interface
+ */
+
+SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
+		unsigned int, flags)
+{
+	int fput_needed, err;
+	struct msghdr msg_sys;
+	struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed);
+
+	if (!sock)
+		goto out;
+
+	err = __sys_recvmsg(sock, msg, &msg_sys, flags, 0);
+
 	fput_light(sock->file, fput_needed);
 out:
 	return err;
 }
 
-#ifdef __ARCH_WANT_SYS_SOCKETCALL
+/*
+ *     Linux recvmmsg interface
+ */
+
+int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
+		   unsigned int flags, struct timespec *timeout)
+{
+	int fput_needed, err, datagrams;
+	struct socket *sock;
+	struct mmsghdr __user *entry;
+	struct msghdr msg_sys;
+	struct timespec end_time;
+
+	if (timeout &&
+	    poll_select_set_timeout(&end_time, timeout->tv_sec,
+				    timeout->tv_nsec))
+		return -EINVAL;
+
+	datagrams = 0;
+
+	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	if (!sock)
+		return err;
+
+	err = sock_error(sock->sk);
+	if (err)
+		goto out_put;
+
+	entry = mmsg;
+
+	while (datagrams < vlen) {
+		/*
+		 * No need to ask LSM for more than the first datagram.
+		 */
+		err = __sys_recvmsg(sock, (struct msghdr __user *)entry,
+				    &msg_sys, flags, datagrams);
+		if (err < 0)
+			break;
+		err = put_user(err, &entry->msg_len);
+		if (err)
+			break;
+		++entry;
+		++datagrams;
+
+		if (timeout) {
+			ktime_get_ts(timeout);
+			*timeout = timespec_sub(end_time, *timeout);
+			if (timeout->tv_sec < 0) {
+				timeout->tv_sec = timeout->tv_nsec = 0;
+				break;
+			}
+
+			/* Timeout, return less than vlen datagrams */
+			if (timeout->tv_nsec == 0 && timeout->tv_sec == 0)
+				break;
+		}
+
+		/* Out of band data, return right away */
+		if (msg_sys.msg_flags & MSG_OOB)
+			break;
+	}
+
+out_put:
+	fput_light(sock->file, fput_needed);
 
+	if (err == 0)
+		return datagrams;
+
+	if (datagrams != 0) {
+		/*
+		 * We may return less entries than requested (vlen) if the
+		 * sock is non block and there aren't enough datagrams...
+		 */
+		if (err != -EAGAIN) {
+			/*
+			 * ... or  if recvmsg returns an error after we
+			 * received some datagrams, where we record the
+			 * error to return on the next call or if the
+			 * app asks about it using getsockopt(SO_ERROR).
+			 */
+			sock->sk->sk_err = -err;
+		}
+
+		return datagrams;
+	}
+
+	return err;
+}
+
+SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
+		unsigned int, vlen, unsigned int, flags,
+		struct timespec __user *, timeout)
+{
+	int datagrams;
+	struct timespec timeout_sys;
+
+	if (!timeout)
+		return __sys_recvmmsg(fd, mmsg, vlen, flags, NULL);
+
+	if (copy_from_user(&timeout_sys, timeout, sizeof(timeout_sys)))
+		return -EFAULT;
+
+	datagrams = __sys_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys);
+
+	if (datagrams > 0 &&
+	    copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
+		datagrams = -EFAULT;
+
+	return datagrams;
+}
+
+#ifdef __ARCH_WANT_SYS_SOCKETCALL
 /* Argument list sizes for sys_socketcall */
 #define AL(x) ((x) * sizeof(unsigned long))
-static const unsigned char nargs[19]={
+static const unsigned char nargs[20] = {
 	AL(0),AL(3),AL(3),AL(3),AL(2),AL(3),
 	AL(3),AL(3),AL(4),AL(4),AL(4),AL(6),
 	AL(6),AL(2),AL(5),AL(5),AL(3),AL(3),
-	AL(4)
+	AL(4),AL(5)
 };
 
 #undef AL
@@ -2099,7 +2236,7 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
 	unsigned long a0, a1;
 	int err;
 
-	if (call < 1 || call > SYS_ACCEPT4)
+	if (call < 1 || call > SYS_RECVMMSG)
 		return -EINVAL;
 
 	/* copy_from_user should be SMP safe. */
@@ -2173,6 +2310,10 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
 	case SYS_RECVMSG:
 		err = sys_recvmsg(a0, (struct msghdr __user *)a1, a[2]);
 		break;
+	case SYS_RECVMMSG:
+		err = sys_recvmmsg(a0, (struct mmsghdr __user *)a1, a[2], a[3],
+				   (struct timespec __user *)a[4]);
+		break;
 	case SYS_ACCEPT4:
 		err = sys_accept4(a0, (struct sockaddr __user *)a1,
 				  (int __user *)a[2], a[3]);
-- 
1.6.2.5


[-- Attachment #3: 0002-net-Allow-protocols-to-provide-an-unlocked_recvmsg.patch --]
[-- Type: text/plain, Size: 37300 bytes --]

>From ccafdce1eefb3d59793931e746f1f07722fcfbbe Mon Sep 17 00:00:00 2001
From: Arnaldo Carvalho de Melo <acme@redhat.com>
Date: Thu, 17 Sep 2009 18:48:32 -0300
Subject: [PATCH 2/2] net: Allow protocols to provide an unlocked_recvmsg socket method
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

So thar recvmmsg can use it. With this patch recvmmsg actually _requires_ that
socket->ops->unlocked_recvmsg exists, and that socket->sk->sk_prot->unlocked_recvmsg
is non NULL.

We may well switch back to the previous scheme where sys_recvmmsg checks if
the underlying protocol provides an unlocked version and uses it, falling
back to the locked version if there is none.

But first lets see if this works with recvmmsg alone and what kinds of gains we
get with the unlocked_recvmmsg implementation in UDP. Followup patches can
restore that behaviour if we want to use it with, say, DCCP and SCTP without an
specific unlocked version.

This should address the concerns raised by Rémi about the MSG_UNLOCKED problem.

Cc: Caitlin Bestler <caitlin.bestler@gmail.com>
Cc: Chris Van Hoof <vanhoof@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Nir Tzachar <nir.tzachar@gmail.com>
Cc: Nivedita Singhvi <niv@us.ibm.com>
Cc: Paul Moore <paul.moore@hp.com>
Cc: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Cc: Steven Whitehouse <steve@chygwyn.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 drivers/isdn/mISDN/socket.c    |    2 +
 drivers/net/pppoe.c            |    1 +
 drivers/net/pppol2tp.c         |    1 +
 include/linux/net.h            |    7 +++
 include/net/sock.h             |   13 +++++
 net/appletalk/ddp.c            |    1 +
 net/atm/pvc.c                  |    1 +
 net/atm/svc.c                  |    1 +
 net/ax25/af_ax25.c             |    1 +
 net/bluetooth/bnep/sock.c      |    1 +
 net/bluetooth/cmtp/sock.c      |    1 +
 net/bluetooth/hci_sock.c       |    1 +
 net/bluetooth/hidp/sock.c      |    1 +
 net/bluetooth/l2cap.c          |    1 +
 net/bluetooth/rfcomm/sock.c    |    1 +
 net/bluetooth/sco.c            |    1 +
 net/can/bcm.c                  |    1 +
 net/can/raw.c                  |    1 +
 net/core/sock.c                |   26 +++++++++
 net/dccp/ipv4.c                |    1 +
 net/dccp/ipv6.c                |    1 +
 net/decnet/af_decnet.c         |    1 +
 net/econet/af_econet.c         |    1 +
 net/ieee802154/af_ieee802154.c |    2 +
 net/ipv4/af_inet.c             |    3 +
 net/ipv4/udp.c                 |   52 +++++++++++++++---
 net/ipv6/af_inet6.c            |    2 +
 net/ipv6/raw.c                 |    1 +
 net/ipx/af_ipx.c               |    1 +
 net/irda/af_irda.c             |    4 ++
 net/iucv/af_iucv.c             |    1 +
 net/key/af_key.c               |    1 +
 net/llc/af_llc.c               |    1 +
 net/netlink/af_netlink.c       |    1 +
 net/netrom/af_netrom.c         |    1 +
 net/packet/af_packet.c         |    2 +
 net/phonet/socket.c            |    2 +
 net/rds/af_rds.c               |    1 +
 net/rose/af_rose.c             |    1 +
 net/rxrpc/af_rxrpc.c           |    1 +
 net/sctp/ipv6.c                |    1 +
 net/sctp/protocol.c            |    1 +
 net/socket.c                   |  112 +++++++++++++++++++++++++++++++++++----
 net/tipc/socket.c              |    3 +
 net/unix/af_unix.c             |    3 +
 net/x25/af_x25.c               |    1 +
 46 files changed, 244 insertions(+), 21 deletions(-)

diff --git a/drivers/isdn/mISDN/socket.c b/drivers/isdn/mISDN/socket.c
index c36f521..6da3a71 100644
--- a/drivers/isdn/mISDN/socket.c
+++ b/drivers/isdn/mISDN/socket.c
@@ -590,6 +590,7 @@ static const struct proto_ops data_sock_ops = {
 	.getname	= data_sock_getname,
 	.sendmsg	= mISDN_sock_sendmsg,
 	.recvmsg	= mISDN_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= datagram_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
@@ -743,6 +744,7 @@ static const struct proto_ops base_sock_ops = {
 	.getname	= sock_no_getname,
 	.sendmsg	= sock_no_sendmsg,
 	.recvmsg	= sock_no_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= sock_no_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index 5f20902..bf30741 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -1121,6 +1121,7 @@ static const struct proto_ops pppoe_ops = {
 	.getsockopt	= sock_no_getsockopt,
 	.sendmsg	= pppoe_sendmsg,
 	.recvmsg	= pppoe_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.ioctl		= pppox_ioctl,
 };
diff --git a/drivers/net/pppol2tp.c b/drivers/net/pppol2tp.c
index e0f9219..af6160c 100644
--- a/drivers/net/pppol2tp.c
+++ b/drivers/net/pppol2tp.c
@@ -2590,6 +2590,7 @@ static struct proto_ops pppol2tp_ops = {
 	.getsockopt	= pppol2tp_getsockopt,
 	.sendmsg	= pppol2tp_sendmsg,
 	.recvmsg	= pppol2tp_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.ioctl		= pppox_ioctl,
 };
diff --git a/include/linux/net.h b/include/linux/net.h
index d67587a..8b852de 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -186,6 +186,10 @@ struct proto_ops {
 	int		(*recvmsg)   (struct kiocb *iocb, struct socket *sock,
 				      struct msghdr *m, size_t total_len,
 				      int flags);
+	int		(*unlocked_recvmsg)(struct kiocb *iocb,
+					    struct socket *sock,
+					    struct msghdr *m,
+					    size_t total_len, int flags);
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
@@ -316,6 +320,8 @@ SOCKCALL_WRAP(name, sendmsg, (struct kiocb *iocb, struct socket *sock, struct ms
 	      (iocb, sock, m, len)) \
 SOCKCALL_WRAP(name, recvmsg, (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t len, int flags), \
 	      (iocb, sock, m, len, flags)) \
+SOCKCALL_WRAP(name, unlocked_recvmsg, (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t len, int flags), \
+	      (iocb, sock, m, len, flags)) \
 SOCKCALL_WRAP(name, mmap, (struct file *file, struct socket *sock, struct vm_area_struct *vma), \
 	      (file, sock, vma)) \
 	      \
@@ -337,6 +343,7 @@ static const struct proto_ops name##_ops = {			\
 	.getsockopt	= __lock_##name##_getsockopt,	\
 	.sendmsg	= __lock_##name##_sendmsg,	\
 	.recvmsg	= __lock_##name##_recvmsg,	\
+	.unlocked_recvmsg = __lock_##name##_unlocked_recvmsg,	\
 	.mmap		= __lock_##name##_mmap,		\
 };
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 950409d..7c62428 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -644,6 +644,11 @@ struct proto {
 					   struct msghdr *msg,
 					size_t len, int noblock, int flags, 
 					int *addr_len);
+	int			(*unlocked_recvmsg)(struct kiocb *iocb,
+						    struct sock *sk,
+						    struct msghdr *msg,
+						    size_t len, int noblock,
+						    int flags, int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
@@ -998,6 +1003,11 @@ extern int                      sock_no_sendmsg(struct kiocb *, struct socket *,
 						struct msghdr *, size_t);
 extern int                      sock_no_recvmsg(struct kiocb *, struct socket *,
 						struct msghdr *, size_t, int);
+extern int			sock_no_unlocked_recvmsg(struct kiocb *iocb,
+							 struct socket *sock,
+							 struct msghdr *msg,
+							 size_t size,
+							 int flags);
 extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
@@ -1014,6 +1024,9 @@ extern int sock_common_getsockopt(struct socket *sock, int level, int optname,
 				  char __user *optval, int __user *optlen);
 extern int sock_common_recvmsg(struct kiocb *iocb, struct socket *sock,
 			       struct msghdr *msg, size_t size, int flags);
+extern int sock_common_unlocked_recvmsg(struct kiocb *iocb, struct socket *sock,
+					struct msghdr *msg, size_t size,
+					int flags);
 extern int sock_common_setsockopt(struct socket *sock, int level, int optname,
 				  char __user *optval, int optlen);
 extern int compat_sock_common_getsockopt(struct socket *sock, int level,
diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index 875eda5..100c5d7 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -1842,6 +1842,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(atalk_dgram_ops) = {
 	.getsockopt	= sock_no_getsockopt,
 	.sendmsg	= atalk_sendmsg,
 	.recvmsg	= atalk_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/atm/pvc.c b/net/atm/pvc.c
index e1d22d9..5c03749 100644
--- a/net/atm/pvc.c
+++ b/net/atm/pvc.c
@@ -122,6 +122,7 @@ static const struct proto_ops pvc_proto_ops = {
 	.getsockopt =	pvc_getsockopt,
 	.sendmsg =	vcc_sendmsg,
 	.recvmsg =	vcc_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/atm/svc.c b/net/atm/svc.c
index 7b831b5..6c66ae9 100644
--- a/net/atm/svc.c
+++ b/net/atm/svc.c
@@ -644,6 +644,7 @@ static const struct proto_ops svc_proto_ops = {
 	.setsockopt =	svc_setsockopt,
 	.getsockopt =	svc_getsockopt,
 	.sendmsg =	vcc_sendmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.recvmsg =	vcc_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index da0f64f..43f4f57 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1976,6 +1976,7 @@ static const struct proto_ops ax25_proto_ops = {
 	.getsockopt	= ax25_getsockopt,
 	.sendmsg	= ax25_sendmsg,
 	.recvmsg	= ax25_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/bluetooth/bnep/sock.c b/net/bluetooth/bnep/sock.c
index e857628..0b26b3c 100644
--- a/net/bluetooth/bnep/sock.c
+++ b/net/bluetooth/bnep/sock.c
@@ -178,6 +178,7 @@ static const struct proto_ops bnep_sock_ops = {
 	.getname	= sock_no_getname,
 	.sendmsg	= sock_no_sendmsg,
 	.recvmsg	= sock_no_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= sock_no_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
diff --git a/net/bluetooth/cmtp/sock.c b/net/bluetooth/cmtp/sock.c
index 16b0fad..72a4b5d 100644
--- a/net/bluetooth/cmtp/sock.c
+++ b/net/bluetooth/cmtp/sock.c
@@ -173,6 +173,7 @@ static const struct proto_ops cmtp_sock_ops = {
 	.getname	= sock_no_getname,
 	.sendmsg	= sock_no_sendmsg,
 	.recvmsg	= sock_no_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= sock_no_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 4f9621f..bd0aace 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -603,6 +603,7 @@ static const struct proto_ops hci_sock_ops = {
 	.getname	= hci_sock_getname,
 	.sendmsg	= hci_sock_sendmsg,
 	.recvmsg	= hci_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.ioctl		= hci_sock_ioctl,
 	.poll		= datagram_poll,
 	.listen		= sock_no_listen,
diff --git a/net/bluetooth/hidp/sock.c b/net/bluetooth/hidp/sock.c
index 37c9d7d..90b40e2 100644
--- a/net/bluetooth/hidp/sock.c
+++ b/net/bluetooth/hidp/sock.c
@@ -224,6 +224,7 @@ static const struct proto_ops hidp_sock_ops = {
 	.getname	= sock_no_getname,
 	.sendmsg	= sock_no_sendmsg,
 	.recvmsg	= sock_no_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= sock_no_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
diff --git a/net/bluetooth/l2cap.c b/net/bluetooth/l2cap.c
index bd0a4c1..945df03 100644
--- a/net/bluetooth/l2cap.c
+++ b/net/bluetooth/l2cap.c
@@ -2743,6 +2743,7 @@ static const struct proto_ops l2cap_sock_ops = {
 	.getname	= l2cap_sock_getname,
 	.sendmsg	= l2cap_sock_sendmsg,
 	.recvmsg	= l2cap_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= bt_sock_poll,
 	.ioctl		= bt_sock_ioctl,
 	.mmap		= sock_no_mmap,
diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
index 0b85e81..00b1a41 100644
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -1092,6 +1092,7 @@ static const struct proto_ops rfcomm_sock_ops = {
 	.getname	= rfcomm_sock_getname,
 	.sendmsg	= rfcomm_sock_sendmsg,
 	.recvmsg	= rfcomm_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.shutdown	= rfcomm_sock_shutdown,
 	.setsockopt	= rfcomm_sock_setsockopt,
 	.getsockopt	= rfcomm_sock_getsockopt,
diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c
index 51ae0c3..5ef7b5c 100644
--- a/net/bluetooth/sco.c
+++ b/net/bluetooth/sco.c
@@ -965,6 +965,7 @@ static const struct proto_ops sco_sock_ops = {
 	.getname	= sco_sock_getname,
 	.sendmsg	= sco_sock_sendmsg,
 	.recvmsg	= bt_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= bt_sock_poll,
 	.ioctl		= bt_sock_ioctl,
 	.mmap		= sock_no_mmap,
diff --git a/net/can/bcm.c b/net/can/bcm.c
index 72720c7..6e388b3 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1575,6 +1575,7 @@ static struct proto_ops bcm_ops __read_mostly = {
 	.getsockopt    = sock_no_getsockopt,
 	.sendmsg       = bcm_sendmsg,
 	.recvmsg       = bcm_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap          = sock_no_mmap,
 	.sendpage      = sock_no_sendpage,
 };
diff --git a/net/can/raw.c b/net/can/raw.c
index db3152d..b8fa610 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -730,6 +730,7 @@ static struct proto_ops raw_ops __read_mostly = {
 	.getsockopt    = raw_getsockopt,
 	.sendmsg       = raw_sendmsg,
 	.recvmsg       = raw_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap          = sock_no_mmap,
 	.sendpage      = sock_no_sendpage,
 };
diff --git a/net/core/sock.c b/net/core/sock.c
index 7633422..76a6279 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1643,6 +1643,13 @@ int sock_no_connect(struct socket *sock, struct sockaddr *saddr,
 }
 EXPORT_SYMBOL(sock_no_connect);
 
+int sock_no_unlocked_recvmsg(struct kiocb *iocb, struct socket *sock,
+			     struct msghdr *msg, size_t size, int flags)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL(sock_no_unlocked_recvmsg);
+
 int sock_no_socketpair(struct socket *sock1, struct socket *sock2)
 {
 	return -EOPNOTSUPP;
@@ -2004,6 +2011,25 @@ int sock_common_recvmsg(struct kiocb *iocb, struct socket *sock,
 }
 EXPORT_SYMBOL(sock_common_recvmsg);
 
+int sock_common_unlocked_recvmsg(struct kiocb *iocb, struct socket *sock,
+				 struct msghdr *msg, size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	int addr_len = 0;
+	int err;
+
+	if (sk->sk_prot->unlocked_recvmsg == NULL)
+		return -EOPNOTSUPP;
+
+	err = sk->sk_prot->unlocked_recvmsg(iocb, sk, msg, size,
+					    flags & MSG_DONTWAIT,
+					    flags & ~MSG_DONTWAIT, &addr_len);
+	if (err >= 0)
+		msg->msg_namelen = addr_len;
+	return err;
+}
+EXPORT_SYMBOL(sock_common_unlocked_recvmsg);
+
 /*
  *	Set socket options on an inet socket.
  */
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index a0a36c9..263c9b8 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -974,6 +974,7 @@ static const struct proto_ops inet_dccp_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 3e70faa..ae1f650 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -1175,6 +1175,7 @@ static struct proto_ops inet6_dccp_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 77d4028..aa1af0b 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -2348,6 +2348,7 @@ static const struct proto_ops dn_proto_ops = {
 	.getsockopt =	dn_getsockopt,
 	.sendmsg =	dn_sendmsg,
 	.recvmsg =	dn_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/econet/af_econet.c b/net/econet/af_econet.c
index f0bbc57..857ff5b 100644
--- a/net/econet/af_econet.c
+++ b/net/econet/af_econet.c
@@ -765,6 +765,7 @@ static const struct proto_ops econet_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	econet_sendmsg,
 	.recvmsg =	econet_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/ieee802154/af_ieee802154.c b/net/ieee802154/af_ieee802154.c
index af66180..1602409 100644
--- a/net/ieee802154/af_ieee802154.c
+++ b/net/ieee802154/af_ieee802154.c
@@ -197,6 +197,7 @@ static const struct proto_ops ieee802154_raw_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = ieee802154_sock_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
@@ -222,6 +223,7 @@ static const struct proto_ops ieee802154_dgram_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = ieee802154_sock_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 566ea6c..e8a44d4 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -854,6 +854,7 @@ const struct proto_ops inet_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = tcp_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = tcp_sendpage,
 	.splice_read	   = tcp_splice_read,
@@ -880,6 +881,7 @@ const struct proto_ops inet_dgram_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_common_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
@@ -909,6 +911,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 80e3812..4033ae5 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -872,13 +872,34 @@ int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 	return 0;
 }
 
+static void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
+{
+	lock_sock(sk);
+	skb_free_datagram(sk, skb);
+	release_sock(sk);
+}
+
+static int skb_kill_datagram_locked(struct sock *sk, struct sk_buff *skb,
+                                   unsigned int flags)
+{
+	int ret;
+	lock_sock(sk);
+	ret = skb_kill_datagram(sk, skb, flags);
+	release_sock(sk);
+	return ret;
+}
+
 /*
  * 	This should be easy, if there is something there we
  * 	return it, otherwise we block.
  */
-
-int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
-		size_t len, int noblock, int flags, int *addr_len)
+static int __udp_recvmsg(struct kiocb *iocb, struct sock *sk,
+			 struct msghdr *msg, size_t len, int noblock,
+			 int flags, int *addr_len,
+			 void (*free_datagram)(struct sock *,
+					       struct sk_buff *),
+			 int  (*kill_datagram)(struct sock *,
+					       struct sk_buff *, unsigned int))
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name;
@@ -956,23 +977,35 @@ try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
-	skb_free_datagram(sk, skb);
-	release_sock(sk);
+	free_datagram(sk, skb);
 out:
 	return err;
 
 csum_copy_err:
-	lock_sock(sk);
-	if (!skb_kill_datagram(sk, skb, flags))
+	if (!kill_datagram(sk, skb, flags))
 		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	release_sock(sk);
 
 	if (noblock)
 		return -EAGAIN;
 	goto try_again;
 }
 
+int udp_recvmsg(struct kiocb *iocb, struct sock *sk,
+		struct msghdr *msg, size_t len, int noblock,
+		int flags, int *addr_len)
+{
+	return __udp_recvmsg(iocb, sk, msg, len, noblock, flags, addr_len,
+			     skb_free_datagram_locked,
+			     skb_kill_datagram_locked);
+}
+
+int udp_unlocked_recvmsg(struct kiocb *iocb, struct sock *sk,
+			 struct msghdr *msg, size_t len, int noblock,
+			 int flags, int *addr_len)
+{
+	return __udp_recvmsg(iocb, sk, msg, len, noblock, flags, addr_len,
+			     skb_free_datagram, skb_kill_datagram);
+}
 
 int udp_disconnect(struct sock *sk, int flags)
 {
@@ -1565,6 +1598,7 @@ struct proto udp_prot = {
 	.getsockopt	   = udp_getsockopt,
 	.sendmsg	   = udp_sendmsg,
 	.recvmsg	   = udp_recvmsg,
+	.unlocked_recvmsg  = udp_unlocked_recvmsg,
 	.sendpage	   = udp_sendpage,
 	.backlog_rcv	   = __udp_queue_rcv_skb,
 	.hash		   = udp_lib_hash,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 45f9a2a..7d0cc2f 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -518,6 +518,7 @@ const struct proto_ops inet6_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = tcp_sendmsg,		/* ok		*/
 	.recvmsg	   = sock_common_recvmsg,	/* ok		*/
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = tcp_sendpage,
 	.splice_read	   = tcp_splice_read,
@@ -544,6 +545,7 @@ const struct proto_ops inet6_dgram_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet_sendmsg,		/* ok		*/
 	.recvmsg	   = sock_common_recvmsg,	/* ok		*/
+	.unlocked_recvmsg  = sock_common_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index d6c3c1c..c05ec59 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1326,6 +1326,7 @@ static const struct proto_ops inet6_sockraw_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet_sendmsg,		/* ok		*/
 	.recvmsg	   = sock_common_recvmsg,	/* ok		*/
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipx/af_ipx.c b/net/ipx/af_ipx.c
index f1118d9..45048a0 100644
--- a/net/ipx/af_ipx.c
+++ b/net/ipx/af_ipx.c
@@ -1953,6 +1953,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(ipx_dgram_ops) = {
 	.getsockopt	= ipx_getsockopt,
 	.sendmsg	= ipx_sendmsg,
 	.recvmsg	= ipx_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c
index 50b43c5..7e97581 100644
--- a/net/irda/af_irda.c
+++ b/net/irda/af_irda.c
@@ -2489,6 +2489,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(irda_stream_ops) = {
 	.getsockopt =	irda_getsockopt,
 	.sendmsg =	irda_sendmsg,
 	.recvmsg =	irda_recvmsg_stream,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -2513,6 +2514,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(irda_seqpacket_ops) = {
 	.getsockopt =	irda_getsockopt,
 	.sendmsg =	irda_sendmsg,
 	.recvmsg =	irda_recvmsg_dgram,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -2537,6 +2539,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(irda_dgram_ops) = {
 	.getsockopt =	irda_getsockopt,
 	.sendmsg =	irda_sendmsg_dgram,
 	.recvmsg =	irda_recvmsg_dgram,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -2562,6 +2565,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(irda_ultra_ops) = {
 	.getsockopt =	irda_getsockopt,
 	.sendmsg =	irda_sendmsg_ultra,
 	.recvmsg =	irda_recvmsg_dgram,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 49c15b4..c208622 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -1693,6 +1693,7 @@ static struct proto_ops iucv_sock_ops = {
 	.getname	= iucv_sock_getname,
 	.sendmsg	= iucv_sock_sendmsg,
 	.recvmsg	= iucv_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= iucv_sock_poll,
 	.ioctl		= sock_no_ioctl,
 	.mmap		= sock_no_mmap,
diff --git a/net/key/af_key.c b/net/key/af_key.c
index dba9abd..f1af697 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -3636,6 +3636,7 @@ static const struct proto_ops pfkey_ops = {
 	.getsockopt	=	sock_no_getsockopt,
 	.mmap		=	sock_no_mmap,
 	.sendpage	=	sock_no_sendpage,
+	.unlocked_recvmsg =	sock_no_unlocked_recvmsg,
 
 	/* Now the operations that really occur. */
 	.release	=	pfkey_release,
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index c45eee1..d948caf 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -1115,6 +1115,7 @@ static const struct proto_ops llc_ui_ops = {
 	.getsockopt  = llc_ui_getsockopt,
 	.sendmsg     = llc_ui_sendmsg,
 	.recvmsg     = llc_ui_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap	     = sock_no_mmap,
 	.sendpage    = sock_no_sendpage,
 };
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 2936fa3..e7a51bb 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1978,6 +1978,7 @@ static const struct proto_ops netlink_ops = {
 	.getsockopt =	netlink_getsockopt,
 	.sendmsg =	netlink_sendmsg,
 	.recvmsg =	netlink_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index ce1a34b..3550d34 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -1395,6 +1395,7 @@ static const struct proto_ops nr_proto_ops = {
 	.getsockopt	=	nr_getsockopt,
 	.sendmsg	=	nr_sendmsg,
 	.recvmsg	=	nr_recvmsg,
+	.unlocked_recvmsg =	sock_no_unlocked_recvmsg,
 	.mmap		=	sock_no_mmap,
 	.sendpage	=	sock_no_sendpage,
 };
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ebe5718..dc5d7ff 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2340,6 +2340,7 @@ static const struct proto_ops packet_ops_spkt = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	packet_sendmsg_spkt,
 	.recvmsg =	packet_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -2361,6 +2362,7 @@ static const struct proto_ops packet_ops = {
 	.getsockopt =	packet_getsockopt,
 	.sendmsg =	packet_sendmsg,
 	.recvmsg =	packet_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		packet_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/phonet/socket.c b/net/phonet/socket.c
index ada2a35..2bd24a5 100644
--- a/net/phonet/socket.c
+++ b/net/phonet/socket.c
@@ -327,6 +327,7 @@ const struct proto_ops phonet_dgram_ops = {
 #endif
 	.sendmsg	= pn_socket_sendmsg,
 	.recvmsg	= sock_common_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
@@ -352,6 +353,7 @@ const struct proto_ops phonet_stream_ops = {
 #endif
 	.sendmsg	= pn_socket_sendmsg,
 	.recvmsg	= sock_common_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b11e7e5..3e8c846 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -377,6 +377,7 @@ static struct proto_ops rds_proto_ops = {
 	.getsockopt =	rds_getsockopt,
 	.sendmsg =	rds_sendmsg,
 	.recvmsg =	rds_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index e5f478c..a64c623 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -1532,6 +1532,7 @@ static struct proto_ops rose_proto_ops = {
 	.getsockopt	=	rose_getsockopt,
 	.sendmsg	=	rose_sendmsg,
 	.recvmsg	=	rose_recvmsg,
+	.unlocked_recvmsg =	sock_no_unlocked_recvmsg,
 	.mmap		=	sock_no_mmap,
 	.sendpage	=	sock_no_sendpage,
 };
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index bfe493e..bf4c38a 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -766,6 +766,7 @@ static const struct proto_ops rxrpc_rpc_ops = {
 	.getsockopt	= sock_no_getsockopt,
 	.sendmsg	= rxrpc_sendmsg,
 	.recvmsg	= rxrpc_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 6a4b190..b68d9f9 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -918,6 +918,7 @@ static const struct proto_ops inet6_seqpacket_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_sock_common_setsockopt,
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index a76da65..78f52a3 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -897,6 +897,7 @@ static const struct proto_ops inet_seqpacket_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/socket.c b/net/socket.c
index 32db56a..dc5b976 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -690,6 +690,32 @@ static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
 	return err ?: __sock_recvmsg_nosec(iocb, sock, msg, size, flags);
 }
 
+static inline int __sock_unlocked_recvmsg_nosec(struct kiocb *iocb,
+						struct socket *sock,
+						struct msghdr *msg,
+						size_t size, int flags)
+{
+	struct sock_iocb *si = kiocb_to_siocb(iocb);
+
+	si->sock = sock;
+	si->scm = NULL;
+	si->msg = msg;
+	si->size = size;
+	si->flags = flags;
+
+	return sock->ops->unlocked_recvmsg(iocb, sock, msg, size, flags);
+}
+
+static inline int __sock_unlocked_recvmsg(struct kiocb *iocb,
+					  struct socket *sock,
+					  struct msghdr *msg, size_t size,
+					  int flags)
+{
+	int err = security_socket_recvmsg(sock, msg, size, flags);
+
+	return err ?: __sock_unlocked_recvmsg_nosec(iocb, sock, msg, size, flags);
+}
+
 int sock_recvmsg(struct socket *sock, struct msghdr *msg,
 		 size_t size, int flags)
 {
@@ -720,6 +746,58 @@ static int sock_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
 	return ret;
 }
 
+static int sock_unlocked_recvmsg(struct socket *sock, struct msghdr *msg,
+				 size_t size, int flags)
+{
+	struct kiocb iocb;
+	struct sock_iocb siocb;
+	int ret;
+
+	init_sync_kiocb(&iocb, NULL);
+	iocb.private = &siocb;
+	ret = __sock_unlocked_recvmsg(&iocb, sock, msg, size, flags);
+	if (-EIOCBQUEUED == ret)
+		ret = wait_on_sync_kiocb(&iocb);
+	return ret;
+}
+
+static int sock_unlocked_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
+				       size_t size, int flags)
+{
+	struct kiocb iocb;
+	struct sock_iocb siocb;
+	int ret;
+
+	init_sync_kiocb(&iocb, NULL);
+	iocb.private = &siocb;
+	ret = __sock_unlocked_recvmsg_nosec(&iocb, sock, msg, size, flags);
+	if (-EIOCBQUEUED == ret)
+		ret = wait_on_sync_kiocb(&iocb);
+	return ret;
+}
+
+enum sock_recvmsg_security {
+	SOCK_RECVMSG_SEC = 0,
+	SOCK_RECVMSG_NOSEC,
+};
+
+enum sock_recvmsg_locking {
+	SOCK_LOCKED_RECVMSG = 0,
+	SOCK_UNLOCKED_RECVMSG,
+};
+
+static int (*sock_recvmsg_table[2][2])(struct socket *sock, struct msghdr *msg,
+				       size_t size, int flags) = {
+	[SOCK_RECVMSG_SEC] = {
+		[SOCK_LOCKED_RECVMSG]	= sock_recvmsg, /* The old one */
+		[SOCK_UNLOCKED_RECVMSG] = sock_unlocked_recvmsg,
+	},
+	[SOCK_RECVMSG_NOSEC] = {
+		[SOCK_LOCKED_RECVMSG]	= sock_recvmsg_nosec,
+		[SOCK_UNLOCKED_RECVMSG] = sock_unlocked_recvmsg_nosec,
+	},
+};
+
 int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
 		   struct kvec *vec, size_t num, size_t size, int flags)
 {
@@ -1984,7 +2062,9 @@ out:
 }
 
 static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg,
-			 struct msghdr *msg_sys, unsigned flags, int nosec)
+			 struct msghdr *msg_sys, unsigned flags,
+			 enum sock_recvmsg_security security,
+			 enum sock_recvmsg_locking locking)
 {
 	struct compat_msghdr __user *msg_compat =
 	    (struct compat_msghdr __user *)msg;
@@ -2044,8 +2124,8 @@ static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg,
 
 	if (sock->file->f_flags & O_NONBLOCK)
 		flags |= MSG_DONTWAIT;
-	err = (nosec ? sock_recvmsg_nosec : sock_recvmsg)(sock, msg_sys,
-							  total_len, flags);
+	err = sock_recvmsg_table[security][locking](sock, msg_sys,
+						    total_len, flags);
 	if (err < 0)
 		goto out_freeiov;
 	len = err;
@@ -2092,7 +2172,8 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 	if (!sock)
 		goto out;
 
-	err = __sys_recvmsg(sock, msg, &msg_sys, flags, 0);
+	err = __sys_recvmsg(sock, msg, &msg_sys, flags,
+			    SOCK_RECVMSG_SEC, SOCK_LOCKED_RECVMSG);
 
 	fput_light(sock->file, fput_needed);
 out:
@@ -2111,6 +2192,7 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 	struct mmsghdr __user *entry;
 	struct msghdr msg_sys;
 	struct timespec end_time;
+	enum sock_recvmsg_security security;
 
 	if (timeout &&
 	    poll_select_set_timeout(&end_time, timeout->tv_sec,
@@ -2123,20 +2205,25 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 	if (!sock)
 		return err;
 
+	lock_sock(sock->sk);
+
 	err = sock_error(sock->sk);
 	if (err)
 		goto out_put;
 
 	entry = mmsg;
 
+	security = SOCK_RECVMSG_SEC;
 	while (datagrams < vlen) {
-		/*
-		 * No need to ask LSM for more than the first datagram.
-		 */
 		err = __sys_recvmsg(sock, (struct msghdr __user *)entry,
-				    &msg_sys, flags, datagrams);
+				    &msg_sys, flags, security,
+				    SOCK_UNLOCKED_RECVMSG);
 		if (err < 0)
 			break;
+		/*
+		 * No need to ask LSM for more than the first datagram.
+		 */
+		security = SOCK_RECVMSG_NOSEC;
 		err = put_user(err, &entry->msg_len);
 		if (err)
 			break;
@@ -2165,9 +2252,8 @@ out_put:
 	fput_light(sock->file, fput_needed);
 
 	if (err == 0)
-		return datagrams;
-
-	if (datagrams != 0) {
+		err = datagrams;
+	else if (datagrams != 0) {
 		/*
 		 * We may return less entries than requested (vlen) if the
 		 * sock is non block and there aren't enough datagrams...
@@ -2182,9 +2268,11 @@ out_put:
 			sock->sk->sk_err = -err;
 		}
 
-		return datagrams;
+		err = datagrams;
 	}
 
+	release_sock(sock->sk);
+
 	return err;
 }
 
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1848693..141539b 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1791,6 +1791,7 @@ static const struct proto_ops msg_ops = {
 	.getsockopt	= getsockopt,
 	.sendmsg	= send_msg,
 	.recvmsg	= recv_msg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage
 };
@@ -1812,6 +1813,7 @@ static const struct proto_ops packet_ops = {
 	.getsockopt	= getsockopt,
 	.sendmsg	= send_packet,
 	.recvmsg	= recv_msg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage
 };
@@ -1833,6 +1835,7 @@ static const struct proto_ops stream_ops = {
 	.getsockopt	= getsockopt,
 	.sendmsg	= send_stream,
 	.recvmsg	= recv_stream,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fc3ebb9..7e726a6 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -521,6 +521,7 @@ static const struct proto_ops unix_stream_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	unix_stream_sendmsg,
 	.recvmsg =	unix_stream_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -542,6 +543,7 @@ static const struct proto_ops unix_dgram_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	unix_dgram_sendmsg,
 	.recvmsg =	unix_dgram_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -563,6 +565,7 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	unix_seqpacket_sendmsg,
 	.recvmsg =	unix_dgram_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 5e6c072..7c20b26 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1620,6 +1620,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(x25_proto_ops) = {
 	.getsockopt =	x25_getsockopt,
 	.sendmsg =	x25_sendmsg,
 	.recvmsg =	x25_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
-- 
1.6.2.5


^ permalink raw reply related

* Re: Netlink API for bonding ?
From: Stephen Hemminger @ 2009-09-17 21:51 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: Jay Vosburgh, bonding-devel, netdev, Jiri Pirko
In-Reply-To: <4AB2ADBE.1060402@free.fr>

On Thu, 17 Sep 2009 23:44:30 +0200
Nicolas de Pesloüan <nicolas.2p.debian@free.fr> wrote:

> Stephen Hemminger wrote:
> > On Mon, 31 Aug 2009 22:34:50 +0200
> > Nicolas de Pesloüan <nicolas.2p.debian@free.fr> wrote:
> > 
> >> Stephen,
> >>
> >> Can you please describe the netlink API you plan to implement for bonding ?
> >>
> >> Both Jiri Pirko and I plan to add some advanced active slave selection rules, 
> >> for more-than-two-slaves bonding configuration.
> >>
> >> Jay suggested that such advanced features be implemented in user space, using 
> >> netlink to notify a daemon when slaves come up or fall down. I agree with Jay, 
> >> but don't want to design something without having first a view at your proposed 
> >> API for bonding.
> >>
> >> Do you plan to have some notification to user space, or only the ability to read 
> >> and set bonding configuration using netlink ?
> >>
> >> Thanks,
> >>
> >> 	Nicolas.
> > 
> > No paper spec, but was looking to add interface similar to vlan and macvlan.
> > Just use (and extend if needed) existing rtnl_link_ops.
> > 
> > 
> > Was not planning on adding a notification interface, thats good idea but
> > really not what I was looking at.
> 
> What kind of notification system would you suggest to notify userland that a 
> given bond device just lose the current active slave ?

First why should user land care?  Unless all slaves are gone maybe it
should just be transparent.

Use existing link ops mechanism (see vlan and macvlan). You may need
to add new operations, but these should be generic enough so that bonding and bridging
have same operations. 

     .newlink => create bond device
     .dellink => remove bond device
     .newport => add slave
     .delport => remove slave

Also, dellink should always work (even if slaves are present).


The terminology slave is not widely used outside of bonding, and so probably
shouldn't be buried in the API.

^ permalink raw reply

* Re: Netlink API for bonding ?
From: Nicolas de Pesloüan @ 2009-09-17 21:44 UTC (permalink / raw)
  To: Stephen Hemminger, Jay Vosburgh, bonding-devel, netdev; +Cc: Jiri Pirko
In-Reply-To: <20090831150000.4bcd1481@nehalam>

Stephen Hemminger wrote:
> On Mon, 31 Aug 2009 22:34:50 +0200
> Nicolas de Pesloüan <nicolas.2p.debian@free.fr> wrote:
> 
>> Stephen,
>>
>> Can you please describe the netlink API you plan to implement for bonding ?
>>
>> Both Jiri Pirko and I plan to add some advanced active slave selection rules, 
>> for more-than-two-slaves bonding configuration.
>>
>> Jay suggested that such advanced features be implemented in user space, using 
>> netlink to notify a daemon when slaves come up or fall down. I agree with Jay, 
>> but don't want to design something without having first a view at your proposed 
>> API for bonding.
>>
>> Do you plan to have some notification to user space, or only the ability to read 
>> and set bonding configuration using netlink ?
>>
>> Thanks,
>>
>> 	Nicolas.
> 
> No paper spec, but was looking to add interface similar to vlan and macvlan.
> Just use (and extend if needed) existing rtnl_link_ops.
> 
> 
> Was not planning on adding a notification interface, thats good idea but
> really not what I was looking at.

What kind of notification system would you suggest to notify userland that a 
given bond device just lose the current active slave ?

1/ Adding to the list of broadcast group (RTMGRP_*) for NETLINK_ROUTE protocol 
in include/linux/rtnetlink.h.

2/ Registering a new NETLINK protocol NETLINK_BONDING in include/linux/netlink.h 
and one of more broadcast groups for this new protocol ?

3/ Not using a broadcast group for notification, but expecting userland to 
register with the driver using a rtnl_link_ops attribut, to give its PID, so the 
driver can then send unicast netlink message to userland which would bind() on 
NETLINK_ROUTE  ?

4/ Using NETLINK_GENERIC in some ways ?

Also, we need a way to ensure that userland is still available to decide 
quickly what to do when the active slave disappear. At least some sort of 
timeout, that, when elapsed, would cause bonding driver to fall back to the 
normal behavior.

Should the notification message hold all the available information about the 
current status of the bonding device, so that userland is able to decide 
quickly, without asking the driver to provide extra information ? This would 
require the receiving buffer to be very large, and with a variable length, 
because the status length depends on the number of slaves for this particular 
bonding device. Not really nice...

Another way would be to simply notify userland that "something happens to bond 
device bondX", and expect userland to ask for the information, by first asking 
for the buffer size, then asking to fill the buffer. This would lead to some 
extra process time, that might be too long to be acceptable to select a new 
active slave.

Any comments ?

	Nicolas.

^ permalink raw reply

* Re: [RFCv4 PATCH 2/2] net: Allow protocols to provide an unlocked_recvmsg socket method
From: Arnaldo Carvalho de Melo @ 2009-09-17 21:21 UTC (permalink / raw)
  To: Nir Tzachar
  Cc: David Miller, Linux Networking Development Mailing List,
	Caitlin Bestler, Chris Van Hoof, Clark Williams, Neil Horman,
	Nivedita Singhvi, Paul Moore, Rémi Denis-Courmont,
	Steven Whitehouse, Ziv Ayalon
In-Reply-To: <9b2db90b0909170709n400859c6q13514b315970dde9@mail.gmail.com>

Em Thu, Sep 17, 2009 at 05:09:19PM +0300, Nir Tzachar escreveu:
> Hello.
> 
> Below are some test results with the patch (only part 1, as I did not
> manage to apply part 2).

I forgot to mention that the patches were made against DaveM's
net-next-2.6 tree at:

git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6

If you have a linux-2.6 git tree, just do:

cd linux-2.6
git remote add net-next git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
git branch -b net-next-recvmmsg net-next/master

And you should be able to apply the two patches cleanly.

> The test application is attached below, and works as follows:
> 
> I set out to measure the latency which can be saved by this patch, and
> the application is designed accordingly. It is composed of three
> parts: a producer, which time-stamps packets and sends them as fast as
> possible, a mirror, which receives messages and bounces them to a
> remote destination and finally, a consumer, which receives messages as
> fast as possible and measures latency and throughout.
> 
> Both the produce and consumer are executed on the same host and the
> mirror on a remote host. Both hosts are running linux 2.6.31 with v4
> of the patch (but, as I said before, only part 1, with the unlocked_*
> stuff). All processes are executed under SCHED_FIFO. Both hosts are

Here is the problem, the patch, as mentioned above, was made against
net-next-2.6.

I'll rework the 2nd patch so that you can test with both.

> connected by a switched 1G Ethernet network. The mirror is executed on
> a 8-core nahelem beast, and the producer and consumer on my desktop,
> which is a quad. /proc/cpuinfo and lspcis and .configs can be supplied
> if needed. Network cards are Intel Corporation 82566DM-2 Gigabit
> Network and Broadcom Corporation NetXtreme II BCM5709 Gigabit
> Ethernet.
> 
> The results (which follow below) clearly show the advantages of using
> recvmmsg over recvmsg both latency wise and throughput wise. The
> addition of a sendmmsg would also have a huge impact, IMO.

Yeah, there are even some smarts that can be done in the sendmmsg case,
like avoiding passing the same payload to multiple destinations, just
marking the mmsghdr size with zero that would thus mean "use the latest
non-zero sized payload".

> Receiving batches of 30 packets, each of 1024 bytes, results with no
> latency improvements, but with a ~55% throughput improvement, from 72
> megabytes per second to  111. Repeating the same test, but with
> batches of 3000, displays the same behaviour. The more interesting
> result (to me, at least :) is when using small packets. Sending
> packets of size 100 and receiving in batches of 30  gives 470 micro
> latency and 244669 packets per second. On the other hand, without
> recvmmsg we get 750 micro latency and 210818 packets per second. A
> huge improvement here.
> 
> I think that with a bit more tinkering we can even stretch these results a bit.

I guess so too, with luck I'll be able to test this over a 10 Gbit/s
link today, will use my and your test cases.

Thanks a lot!
 
- Arnaldo

^ permalink raw reply

* Re: [ANNOUNCE] new iptables module match large amount of ip addresses
From: Mikulas Patocka @ 2009-09-17 20:36 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netfilter-devel, netdev
In-Reply-To: <1253217817.21074.9.camel@ice-age>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 736 bytes --]

On Thu, 17 Sep 2009, Eric Leblond wrote:

> Hi,
> 
> Le jeudi 17 septembre 2009 a 21:15 +0200, Mikulas Patocka a écrit :
> > Hi
> > 
> > Here I submit an iptables module that can match large amounts (millions) 
> > of ip addresses efficiently using binary search.
> 
> What are the differences with ipset ? (http://ipset.netfilter.org/)
> 
> BR,

What I wrote is static --- once loaded, then used. The only way to update 
the addresses is to reload it. Ipset is dynamic (and has more memory 
consumption because of it). In my implementation, the kernel reads the ip 
addresses, in ipset, the userspace tool reads them. 

I didn't know about ipset before because it is not in the kernel (will it 
ever be?)

Mikulas

^ permalink raw reply

* Re: fanotify as syscalls
From: Andreas Gruenbacher @ 2009-09-17 20:07 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Eric Paris, Linus Torvalds, Evgeniy Polyakov, David Miller,
	linux-kernel, linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <20090916121708.GD29359@shareable.org>

On Wednesday, 16 September 2009 14:17:08 Jamie Lokier wrote:
> Eric Paris wrote:
> > On Wed, 2009-09-16 at 08:52 +0100, Jamie Lokier wrote:
> > > Seriously, what does system-wide fanotify do when run from a
> > > chroot/namespace/cgroup, and a file outside them is accessed?
> >
> > At the moment an fanotify global listener is system wide.  Truely system
> > wide.  A gentleman from suse is looking rectify the problem so that if
> > run inside a namespace it stays inside the namespace.  Note that this
> > particular little tidbit is not in the 8 patches I proposed.  At the
> > moment those just include the UI and basic notification.
>
> I'll be really interested in the gentleman's solution.

I guess Eric meant me.

>From my point of view, "global" events make no sense, and fanotify listeners 
should register which directories they are interested in (e.g., include "/", 
exclude "/proc"). This takes care of chroots and namespaces as well.

I think we want to register for events on objects rather than in the 
namespace, i.e., for inodes visible in multiple places because of hardlinks 
or bind mounts, we get the same kinds of events no matter which path is used. 
(The path actually used would still show up in /proc/self/fd/x.) When moving 
registered inodes, the registrations would move with them. This is how 
inotify works, except that inotify watches are not recursive.

The difficulty with this is that in the worst case, this would require walking 
the entire namespace and all cached inodes. I don't see how this could be 
done for two reasons:

 * First, we can't take the vfsmount_lock and dcache_lock for the entire time.

 * Second, we would need to pin almost all the inodes, which is a clear no-go.

   [Why pin?  At least we would need to remember which objects a listener has
    registered interest in, so we need to pin the inodes.  We could still
    allow unregistered directory inodes to be thrown out because we can
    recreate their registration status from the parent. We can't recreate the
    registration status of non-directories because of hardlinks, though.]

The only other idea I could come up with is to only allow recursive 
registrations at mount points: instead of inodes, the vfsmounts would be 
included or excluded (probably automatically including bind mounts). This has 
one big drawback though: users would no longer be able to watch arbitrary 
subtrees anymore. Privileged users could still arrange to watch almost all 
subtrees with bind mounts (mount --bind /foo/bar /foo/bar).

Any ideas?

Thanks,
Andreas

^ permalink raw reply

* [PATCH] i2400m: minimal ethtool support
From: Dan Williams @ 2009-09-17 20:06 UTC (permalink / raw)
  To: Inaky Perez-Gonzalez; +Cc: wimax@linuxwimax.org, netdev

Add minimal ethtool support for carrier detection.

Signed-off-by: Dan Williams <dcbw@redhat.com>


diff --git a/drivers/net/wimax/i2400m/netdev.c b/drivers/net/wimax/i2400m/netdev.c
index 9653f47..c915775 100644
--- a/drivers/net/wimax/i2400m/netdev.c
+++ b/drivers/net/wimax/i2400m/netdev.c
@@ -74,6 +74,7 @@
  */
 #include <linux/if_arp.h>
 #include <linux/netdevice.h>
+#include <linux/ethtool.h>
 #include "i2400m.h"
 
 
@@ -559,6 +560,22 @@ static const struct net_device_ops i2400m_netdev_ops = {
 	.ndo_change_mtu = i2400m_change_mtu,
 };
 
+static void i2400m_get_drvinfo(struct net_device *net_dev,
+			       struct ethtool_drvinfo *info)
+{
+	struct i2400m *i2400m = net_dev_to_i2400m(net_dev);
+
+	strncpy(info->driver, KBUILD_MODNAME, sizeof(info->driver) - 1);
+	strncpy(info->fw_version, i2400m->fw_name, sizeof(info->fw_version) - 1);
+	if (net_dev->dev.parent)
+		strncpy(info->bus_info, dev_name(net_dev->dev.parent),
+			sizeof(info->bus_info) - 1);
+}
+
+static const struct ethtool_ops i2400m_ethtool_ops = {
+	.get_drvinfo = i2400m_get_drvinfo,
+	.get_link = ethtool_op_get_link,
+};
 
 /**
  * i2400m_netdev_setup - Setup setup @net_dev's i2400m private data
@@ -580,6 +597,7 @@ void i2400m_netdev_setup(struct net_device *net_dev)
 		   & ~IFF_MULTICAST);
 	net_dev->watchdog_timeo = I2400M_TX_TIMEOUT;
 	net_dev->netdev_ops = &i2400m_netdev_ops;
+	net_dev->ethtool_ops = &i2400m_ethtool_ops;
 	d_fnend(3, NULL, "(net_dev %p) = void\n", net_dev);
 }
 EXPORT_SYMBOL_GPL(i2400m_netdev_setup);


^ permalink raw reply related

* Re: [ANNOUNCE] new iptables module match large amount of ip addresses
From: Eric Leblond @ 2009-09-17 20:03 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: netfilter-devel, netdev
In-Reply-To: <Pine.LNX.4.64.0909172100220.27299@artax.karlin.mff.cuni.cz>

[-- Attachment #1: Type: text/plain, Size: 1900 bytes --]

Hi,

Le jeudi 17 septembre 2009 à 21:15 +0200, Mikulas Patocka a écrit :
> Hi
> 
> Here I submit an iptables module that can match large amounts (millions) 
> of ip addresses efficiently using binary search.

What are the differences with ipset ? (http://ipset.netfilter.org/)

BR,

>  I needed it to protect my 
> home network from spam. It may be useful for other people too, so if you 
> want it, you can take it and add it to the kernel.
> 
> Get the patches for netfilter and kernel at:
> http://artax.karlin.mff.cuni.cz/~mikulas/xt_ipfile/
> (you need to copy the file include/linux/netfilter/xt_ipfile.h from kernel 
> sources to /usr/include/linux/netfilter/ to compile the userspace)
> 
> The main features:
> - fast matching of large amount of ip addresses using binary search.
> - an ability to match ranges of addresses or address/mask subnets.
> - fast loading of the addresses (on Pentium 3 850, 2 million addresses 
> load in 5.5s, if they are already sorted in the file, the load time is 
> just 1.5s).
> - memory efficient --- consumes only 8 bytes per address.
> 
> USAGE:
> 
> prepare a file with addreses, in this example /root/firewall/bad-ips. One 
> entry per line, the allowed formats are:
> 1.2.3.4
> 1.2.3.0/24
> 1.2.3.4-1.2.3.8
> 
> insert it into iptables with:
> iptables -A SPAM -m ipfile --src-file /root/firewall/bad-ips -j DROP
> 
> The module doesn't support ipv6 because I don't use it, but it's generic 
> enough that it could be extended for it. It could be also extended to 
> match ethernet MAC addresses.
> 
> Mikulas
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Eric Leblond <eric@inl.fr>
INL: http://www.inl.fr/
NuFW: http://www.nufw.org/

[-- Attachment #2: Ceci est une partie de message numériquement signée --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* [ANNOUNCE] new iptables module match large amount of ip addresses
From: Mikulas Patocka @ 2009-09-17 19:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev

Hi

Here I submit an iptables module that can match large amounts (millions) 
of ip addresses efficiently using binary search. I needed it to protect my 
home network from spam. It may be useful for other people too, so if you 
want it, you can take it and add it to the kernel.

Get the patches for netfilter and kernel at:
http://artax.karlin.mff.cuni.cz/~mikulas/xt_ipfile/
(you need to copy the file include/linux/netfilter/xt_ipfile.h from kernel 
sources to /usr/include/linux/netfilter/ to compile the userspace)

The main features:
- fast matching of large amount of ip addresses using binary search.
- an ability to match ranges of addresses or address/mask subnets.
- fast loading of the addresses (on Pentium 3 850, 2 million addresses 
load in 5.5s, if they are already sorted in the file, the load time is 
just 1.5s).
- memory efficient --- consumes only 8 bytes per address.

USAGE:

prepare a file with addreses, in this example /root/firewall/bad-ips. One 
entry per line, the allowed formats are:
1.2.3.4
1.2.3.0/24
1.2.3.4-1.2.3.8

insert it into iptables with:
iptables -A SPAM -m ipfile --src-file /root/firewall/bad-ips -j DROP

The module doesn't support ipv6 because I don't use it, but it's generic 
enough that it could be extended for it. It could be also extended to 
match ethernet MAC addresses.

Mikulas

^ permalink raw reply

* RE: [PATCH] ks8851_ml ethernet network driver
From: Choi, David @ 2009-09-17 19:30 UTC (permalink / raw)
  To: David Miller, greg; +Cc: netdev, Li, Charles, Choi, jgarzik, shemminger
In-Reply-To: <20090916.204801.190052862.davem@davemloft.net>

Hello David Miller,

Sorry to resend it. In my previous e-mail, I included a part of patch,
not a complete patch.

My fix ups are as followings;
	-Remove DEBUG definition
	-Remove MALLOC definition
	-Intent to fix compile warnings, which I can not reproduce
	 in my test environment(linux-2.6.31-rc3 source tree and
gcc4.2.1.)
	 If you have still warnings with my fix ups, give me brief
description 
	 To reproduce the warnings.

===================
--- linux-2.6.31-rc3/drivers/net/ks8851_mll.c.orig	2009-09-17
10:18:56.000000000 -0700
+++ linux-2.6.31-rc3/drivers/net/ks8851_mll.c	2009-09-17
10:09:37.000000000 -0700
@@ -21,8 +21,6 @@
  * KS8851 16bit MLL chip from Micrel Inc.
  */
 
-#define DEBUG
-
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/netdevice.h>
@@ -419,7 +417,6 @@ union ks_tx_hdr {
  * or one of the work queues.
  *
  */
-#define MALLOC(x)		kmalloc(x, GFP_KERNEL)
 
 /* Receive multiplex framer header info */
 struct type_frame_head {
@@ -552,11 +549,9 @@ static void ks_wrreg16(struct ks_net *ks
  */
 static inline void ks_inblk(struct ks_net *ks, u16 *wptr, u32 len)
 {
-	u32 data_port = (u32)ks->hw_addr;
 	len >>= 1;
-	do {
-		*wptr++ = (u16)ioread16(data_port);
-	} while (--len);
+	while (len--)
+		*wptr++ = (u16)ioread16(ks->hw_addr);
 }


 /**
@@ -568,11 +563,9 @@ static inline void ks_inblk(struct ks_ne
  */
 static inline void ks_outblk(struct ks_net *ks, u16 *wptr, u32 len)
 {
-	u32 data_port = (u32)ks->hw_addr;
 	len >>= 1;
-	do {
-		iowrite16(*wptr++, data_port);
-	} while (--len);
+	while (len--)
+		iowrite16(*wptr++, ks->hw_addr);
 }

@@ -1515,12 +1510,13 @@ void ks_enable(struct ks_net *ks)
 
 static int ks_hw_init(struct ks_net *ks)
 {
+#define	MHEADER_SIZE	(sizeof(struct type_frame_head) *
MAX_RECV_FRAMES)
 	ks->promiscuous = 0;
 	ks->all_mcast = 0;
 	ks->mcast_lst_size = 0;
 
 	ks->frame_head_info = (struct type_frame_head *) \
-		MALLOC(sizeof(struct type_frame_head) *
MAX_RECV_FRAMES);
+		kmalloc(MHEADER_SIZE, GFP_KERNEL);
 	if (!ks->frame_head_info) {
 		printk(KERN_ERR "Error: Fail to allocate frame
memory\n");
 		return false;



Regards,
David J. Choi



-----Original Message-----
From: David Miller [mailto:davem@davemloft.net] 
Sent: Wednesday, September 16, 2009 8:48 PM
To: greg@kroah.com
Cc: netdev@vger.kernel.org; Li, Charles; Choi@kroah.com; Choi, David;
jgarzik@redhat.com; shemminger@vyatta.com
Subject: Re: [PATCH] ks8851_ml ethernet network driver

From: Greg KH <greg@kroah.com>
Date: Wed, 16 Sep 2009 19:38:36 -0700

> From: Choi, David <David.Choi@Micrel.Com>
> 
> This is a network driver for the ks8851 16bit MLL ethernet device.
> 
> Signed-off-by: David J. Choi <david.choi@micrel.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

This doesn't even build cleanly:

drivers/net/ks8851_mll.c: In function 'ks_inblk':
drivers/net/ks8851_mll.c:555: warning: cast from pointer to integer of
different size
drivers/net/ks8851_mll.c:558: warning: passing argument 1 of '_readw'
makes pointer from integer without a cast
drivers/net/ks8851_mll.c: In function 'ks_outblk':
drivers/net/ks8851_mll.c:571: warning: cast from pointer to integer of
different size
drivers/net/ks8851_mll.c:574: warning: passing argument 2 of '_writew'
makes pointer from integer without a cast

It also has a big "#define DEBUG" at the beginning of the driver.

And it also has stuff like:

+#define MALLOC(x)		kmalloc(x, GFP_KERNEL)

which actually decreases the readability of this driver.

Please fix this up.

^ permalink raw reply

* RE: [PATCH] ks8851_ml ethernet network driver
From: Choi, David @ 2009-09-17 19:20 UTC (permalink / raw)
  To: Stephen Hemminger, Greg KH, Li, Charles
  Cc: netdev, David S. Miller, Choi, Jeff Garzik
In-Reply-To: <20090916210702.617b5069@s6510>

Hello Stephen Hemminger,

Here is my fix ups.
	-mutex_lock is intended to gurantee to access the hardware
registers 
	 exclusively. But as you mentioned, this mutex is redundancy in
ks_net_open() because this function does not access the hardware. So I
remove it.

================
@@ -858,7 +856,6 @@ static int ks_net_open(struct net_device
 	/* lock the card, even if we may not actually do anything
 	 * else at the moment.
 	 */
-	mutex_lock(&ks->lock);
 
 	if (netif_msg_ifup(ks))
 		ks_dbg(ks, "%s - entry\n", __func__);
@@ -875,8 +872,6 @@ static int ks_net_open(struct net_device
 	if (netif_msg_ifup(ks))
 		ks_dbg(ks, "network device %s up\n", netdev->name);
 
-	mutex_unlock(&ks->lock);
-
 	return 0;
 }


Regards,
David J. Choi


-----Original Message-----
From: Stephen Hemminger [mailto:shemminger@vyatta.com] 
Sent: Wednesday, September 16, 2009 9:07 PM
To: Greg KH; Li, Charles
Cc: netdev@vger.kernel.org; David S. Miller; Choi@kroah.com; Choi,
David; Jeff Garzik
Subject: Re: [PATCH] ks8851_ml ethernet network driver

On Wed, 16 Sep 2009 19:38:36 -0700
Greg KH <greg@kroah.com> wrote:

> +
> +/**
> + * ks_net_open - open network device
> + * @netdev: The network device being opened.
> + *
> + * Called when the network device is marked active, such as a user
executing
> + * 'ifconfig up' on the device.
> + */
> +static int ks_net_open(struct net_device *netdev)
> +{
> +	struct ks_net *ks = netdev_priv(netdev);
> +	int err;
> +
> +#define	KS_INT_FLAGS	(IRQF_DISABLED|IRQF_TRIGGER_LOW)
> +	/* lock the card, even if we may not actually do anything
> +	 * else at the moment.
> +	 */
> +	mutex_lock(&ks->lock);
> +

I don't understand the purpose of ks->lock mutex. What is it
really protecting? open/close are already protected by rtnl_mutex,
is it really only for the PHY?

^ permalink raw reply

* RE: [PATCH] ks8851_ml ethernet network driver
From: Choi, David @ 2009-09-17 19:06 UTC (permalink / raw)
  To: David Miller, greg; +Cc: netdev, Li, Charles, Choi, jgarzik, shemminger
In-Reply-To: <20090916.204801.190052862.davem@davemloft.net>

Hello David Miller,

My fix ups are as followings;
	-Remove DEBUG definition
	-Remove MALLOC definition
	-Intent to fix compile warnings, which I can not reproduce
	 in my test environment(linux-2.6.31-rc3 source tree and
gcc4.2.1.)
	 If you have still warnings with my fix ups, give me brief
description 
	 To reproduce the warnings.

=================
--- linux-2.6.31-rc3/drivers/net/ks8851_mll.c.orig	2009-09-17
10:18:56.000000000 -0700
+++ linux-2.6.31-rc3/drivers/net/ks8851_mll.c	2009-09-17
10:09:37.000000000 -0700
@@ -21,8 +21,6 @@
  * KS8851 16bit MLL chip from Micrel Inc.
  */
 
-#define DEBUG
-
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/netdevice.h>
@@ -419,7 +417,6 @@ union ks_tx_hdr {
  * or one of the work queues.
  *
  */
-#define MALLOC(x)		kmalloc(x, GFP_KERNEL)
 
 /* Receive multiplex framer header info */
 struct type_frame_head {
@@ -552,11 +549,9 @@ static void ks_wrreg16(struct ks_net *ks
  */
 static inline void ks_inblk(struct ks_net *ks, u16 *wptr, u32 len)
 {
-	u32 data_port = (u32)ks->hw_addr;
 	len >>= 1;
-	do {
-		*wptr++ = (u16)ioread16(data_port);
-	} while (--len);
+	while (len--)
+		*wptr++ = (u16)ioread16(ks->hw_addr);
 }
 
 /**
@@ -568,11 +563,9 @@ static inline void ks_inblk(struct ks_ne
  */
 static inline void ks_outblk(struct ks_net *ks, u16 *wptr, u32 len)
 {
-	u32 data_port = (u32)ks->hw_addr;
 	len >>= 1;
-	do {
-		iowrite16(*wptr++, data_port);
-	} while (--len);
+	while (len--)
+		iowrite16(*wptr++, ks->hw_addr);
 }
 
 /**

Regards,
David J. Choi


-----Original Message-----
From: David Miller [mailto:davem@davemloft.net] 
Sent: Wednesday, September 16, 2009 8:48 PM
To: greg@kroah.com
Cc: netdev@vger.kernel.org; Li, Charles; Choi@kroah.com; Choi, David;
jgarzik@redhat.com; shemminger@vyatta.com
Subject: Re: [PATCH] ks8851_ml ethernet network driver

From: Greg KH <greg@kroah.com>
Date: Wed, 16 Sep 2009 19:38:36 -0700

> From: Choi, David <David.Choi@Micrel.Com>
> 
> This is a network driver for the ks8851 16bit MLL ethernet device.
> 
> Signed-off-by: David J. Choi <david.choi@micrel.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

This doesn't even build cleanly:

drivers/net/ks8851_mll.c: In function 'ks_inblk':
drivers/net/ks8851_mll.c:555: warning: cast from pointer to integer of
different size
drivers/net/ks8851_mll.c:558: warning: passing argument 1 of '_readw'
makes pointer from integer without a cast
drivers/net/ks8851_mll.c: In function 'ks_outblk':
drivers/net/ks8851_mll.c:571: warning: cast from pointer to integer of
different size
drivers/net/ks8851_mll.c:574: warning: passing argument 2 of '_writew'
makes pointer from integer without a cast

It also has a big "#define DEBUG" at the beginning of the driver.

And it also has stuff like:

+#define MALLOC(x)		kmalloc(x, GFP_KERNEL)

which actually decreases the readability of this driver.

Please fix this up.

^ permalink raw reply

* RE: [PATCH] ks8851_ml ethernet network driver
From: Choi, David @ 2009-09-17 19:11 UTC (permalink / raw)
  To: Stephen Hemminger, Li, Charles
  Cc: Greg KH, netdev, David S. Miller, Choi, Jeff Garzik
In-Reply-To: <20090916210315.04dc743e@s6510>

Hello Stephen Hemminger,

My fix up is as followings;
	Even if the irq is not shared, it is safe not to process
	when there is no interrupt status change in the hardware.

====================
@@ -818,6 +811,11 @@ static irqreturn_t ks_irq(int irq, void 
 	ks_save_cmd_reg(ks);
 
 	status = ks_rdreg16(ks, KS_ISR);
+	if (unlikely(!status)) {
+		ks_restore_cmd_reg(ks);
+		return IRQ_NONE;
+	}
+
 	ks_wrreg16(ks, KS_ISR, status);
 
 	if (likely(status & IRQ_RXI))



Regards,
David J. Choi


-----Original Message-----
From: Stephen Hemminger [mailto:shemminger@vyatta.com] 
Sent: Wednesday, September 16, 2009 9:03 PM
To: Li, Charles
Cc: Greg KH; netdev@vger.kernel.org; David S. Miller; Choi@kroah.com;
Choi, David; Jeff Garzik
Subject: Re: [PATCH] ks8851_ml ethernet network driver

On Wed, 16 Sep 2009 19:38:36 -0700
Greg KH <greg@kroah.com> wrote:

> /**
> + * ks_irq - device interrupt handler
> + * @irq: Interrupt number passed from the IRQ hnalder.
> + * @pw: The private word passed to register_irq(), our struct ks_net.
> + *
> + * This is the handler invoked to find out what happened
> + *
> + * Read the interrupt status, work out what needs to be done and then
clear
> + * any of the interrupts that are not needed.
> + */
> +
> +static irqreturn_t ks_irq(int irq, void *pw)
> +{
> +	struct ks_net *ks = pw;
> +	struct net_device *netdev = ks->netdev;
> +	u16 status;
> +
> +	/*this should be the first in IRQ handler */
> +	ks_save_cmd_reg(ks);
> +
> +	status = ks_rdreg16(ks, KS_ISR);
> +	ks_wrreg16(ks, KS_ISR, status);

if status == 0 or status == ~0 then device should not return
IRQ_HANDLED.
In the former case, the IRQ is shared, in later case the device is not
present
on the bus (hotplug).

^ permalink raw reply

* Re: fanotify as syscalls
From: Eric Paris @ 2009-09-17 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Evgeniy Polyakov, David Miller, linux-kernel,
	linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <alpine.LFD.2.01.0909170934450.4950@localhost.localdomain>

On Thu, 2009-09-17 at 09:40 -0700, Linus Torvalds wrote:
> 
> On Wed, 16 Sep 2009, Jamie Lokier wrote:
> > 
> > I'd forgotten about Linus' strace argument.  That's a good one.
> > 
> > Of course everything should be a syscall by that argument :-)
> 
> Oh yes, everything _should_ be a syscall.

I rewrote the interface and prototyped out a working fanotify like so:

SYSCALL_DEFINE4(fanotify_init, unsigned int, flags, int, event_f_flags,
		__u64, mask, int, priority)

int flags indicates - things like global or directed, fd's or wd's,
could include fail allow vs fail deny, O_CLOEXEC, O_NONBLOCK, etc
int event_f_flags - flags used when opening an fd for the listener
__u64 mask - in global mode the events of interest
int priority - the order fanotify listeners should be checked (so HSM
		can be before AV scanners)

Do we need a timeout for access decisions?  I left room for that in the
bind address, but we can't just leave room to spare with a syscall...

SYSCALL_DEFINE6(fanotify_modify_mark, int, fanotify_fd,
		unsigned int, flags, int, fd,
		const char  __user *, pathname, __u64, mask,
		__u64, ignored_mask)

int fanotify_fd - duh
int flags - add, remove, flush, events on child, event on subtree?
int fd - either fd to object or fd to dir for relative pathname
const char __user * pathname - either pathname or null if only use fd
__u64 mask - events this inode cares about
__u64 ignored_mask - events this inode does NOT care about

(not yet done, would someone like to comment?)
fanotify_response(int fanotify_fd, __u64 cookie, __u32 response);
__u64 cookie - which of our permission requests we are waiting on
__u32 response - allow, deny, wait longer

Could be done using write(), but I think the strace argument clearly
says that this should be a syscall that can be easily found and reported

(not settled in my mind)
int fanotify_ignore_sb(int fanotify_fd, unsigned int flags,
                       long f_type, fsid_t f_fsid)
int fanotify_fd - duh
unsigned int flags - f_type or fsid?
long f_type - statfs(2) f_type
fsid_t f_fsid - statfs(2) f_fsid

Reads from the fd would return data of this structure:

struct fanotify_event_metadata {
	__u32 event_len;
	__u32 vers;
	__s32 fd;
	__u32 mask;
	__u32 f_flags;
	__s32 pid;
	__s32 uid;
	__s32 tgid;
	__u64 cookie;
}  __attribute__((packed));

Thanks to event_len and vers, we could extend it to include

__u32 filename1_len,
char filename1[filename1_len]
__u32 filename2_len,
char filename2[filename2_len]

This can all take shape as that work is completed and I don't believe
should block merging.

Do my syscalls look pretty enough?  I'm down to 3 or 4.
Jamie, you tend to agree that the interface and the event types are nice
enough that we can build out (if we get the right hooks in the right
places) everywhere we need to go?

-Eric


^ permalink raw reply

* [GIT]: Networking
From: David Miller @ 2009-09-17 17:54 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) The idea to use ->close() and then a ->open() to refresh multicast
   addresses in the bonding driver was a bad idea and breaks a bunch
   of stuff.  Better scheme from Moni Shoua.

2) Some TCP code still assuming ssthresh was a u16, oops.  Fix from
   Ilpo Järvinen.

3) RXRPC updates from David Howells.

4) genetlink table locking busted with netns, fix from Johannes Berg.

5) Several fixes for multiq packet scheduler fallout from Jarek Poplawski.

6) can receives packets in wrong context eliciting warnings from
   local_softirq_pending(), fix from Oliver Hartkopp.

7) TCP MD5 code has preempt level imbalance, fix from Robert Varga.

8) SKY2 bug fixes from Stephen Hemminger.

9) Alexey Dobriyan is const'ified several protocol and ops structures.

10) Wireless bug fixes via John Linville.

11) IPV6 conformance fix wrt ROUTER_PREF_INVALID options and making
    the DAD failure log message more informative, from Jens Rosenboom.

12) S390 IUCV stack fixes from Ursula Braun and Hendrik Brueckner.

13) ieee802154 stack fixes from Dmitry Eremin-Solenikov.

Please pull, thanks a lot!

The following changes since commit 4142e0d1def2c0176c27fd2e810243045a62eb6d:
  Linus Torvalds (1):
        Merge branch 'osync_cleanup' of git://git.kernel.org/.../jack/linux-fs-2.6

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Alexander Duyck (2):
      igb: reset sgmii phy at start of init
      igb: do not allow phy sw reset code to make calls to null pointers

Alexey Dobriyan (3):
      net: constify struct net_protocol
      net: constify struct inet6_protocol
      net: constify remaining proto_ops

Christian Lamparter (1):
      p54usb: add Zcomax XG-705A usbid

Daniel C Halperin (1):
      iwlwifi: fix HT operation in 2.4 GHz band

David Howells (5):
      RxRPC: Declare the security index constants symbolically
      RxRPC: Allow key payloads to be passed in XDR form
      RxRPC: Allow RxRPC keys to be read
      RxRPC: Parse security index 5 keys (Kerberos 5)
      RxRPC: Use uX/sX rather than uintX_t/intX_t types

David S. Miller (3):
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-next-2.6
      Merge branch 'for-linus' of git://git.kernel.org/.../lowpan/lowpan
      wl12xx: Fix print_mac() conversion.

Dmitry Eremin-Solenikov (2):
      af_ieee802154: setsockopt optlen arg isn't __user
      ieee802154: add locking for seq numbers

Dongdong Deng (1):
      b44: the poll handler b44_poll must not enable IRQ unconditionally

Eric Dumazet (1):
      net: kmemcheck annotation in struct socket

Gerrit Renker (1):
      net-next-2.6 [PATCH 1/1] dccp: ccids whitespace-cleanup / CodingStyle

Hendrik Brueckner (6):
      iucv: fix iucv_buffer_cpumask check when calling IUCV functions
      iucv: use correct output register in iucv_query_maxconn()
      af_iucv: fix race in __iucv_sock_wait()
      af_iucv: handle non-accepted sockets after resuming from suspend
      af_iucv: do not call iucv_sock_kill() twice
      af_iucv: fix race when queueing skbs on the backlog queue

Holger Schurig (2):
      cfg80211: use cfg80211_wext_freq() for freq conversion
      cfg80211: minimal error handling for wext-compat freq scanning

Ilpo Järvinen (1):
      tcp: fix ssthresh u16 leftover

Jarek Poplawski (4):
      pkt_sched: Fix qdisc_graft WRT ingress qdisc
      pkt_sched: Fix tx queue selection in tc_modify_qdisc
      pkt_sched: Fix qdisc_create on stab error handling
      pkt_sched: Fix qstats.qlen updating in dump_stats

Jean-Christophe PLAGNIOL-VILLARD (1):
      wl12xx: switch to %pM to print the mac address

Jens Rosenboom (2):
      ipv6: Ignore route option with ROUTER_PREF_INVALID
      ipv6: Log the affected address when DAD failure occurs

Jie Yang (1):
      atl1e: fix 2.6.31-git4 -- ATL1E 0000:03:00.0: DMA-API: device driver frees DMA

Jiri Pirko (1):
      bonding: make ab_arp select active slaves as other modes

Johannes Berg (3):
      iwlwifi: disable powersave for 4965
      genetlink: fix netns vs. netlink table locking
      cfg80211: fix SME connect

Ken Kawasaki (1):
      pcnet_cs: add cis of Linksys multifunction pcmcia card

Larry Finger (1):
      ssb: Fix error when V1 SPROM extraction is forced

Luis R. Rodriguez (1):
      wireless: default CONFIG_WLAN to y

Mark Smith (1):
      Have atalk_route_packet() return NET_RX_SUCCESS not NET_XMIT_SUCCESS

Martin Decky (1):
      hostap: Revert a toxic part of the conversion to net_device_ops

Michael Buesch (3):
      b43: Force-wake queues on init
      ssb: Disable verbose SDIO coreswitch
      b43: Fix resume failure

Michael Hennerich (1):
      netdev: smc91x: drop Blackfin cruft

Moni Shoua (1):
      bonding: remap muticast addresses without using dev_close() and dev_open()

Oliver Hartkopp (1):
      can: fix NOHZ local_softirq_pending 08 warning

Pavel Roskin (1):
      rc80211_minstrel: fix contention window calculation

Peter P Waskiewicz Jr (3):
      ixgbe: Properly disable packet split per-ring when globally disabled
      ixgbe: Add support for 82599-based CX4 adapters
      ixgbe: Create separate media type for CX4 adapters

Randy Dunlap (1):
      ssb/sdio: fix printk format warnings

Reinette Chatre (1):
      iwlwifi: fix potential rx buffer loss

Robert Varga (1):
      tcp: fix CONFIG_TCP_MD5SIG + CONFIG_PREEMPT timer BUG()

Rémi Denis-Courmont (2):
      Phonet: Netlink event for autoconfigured addresses
      cdc-phonet: remove noisy debug statement

Sathya Perla (1):
      be2net: fix some cmds to use mccq instead of mbox

Stephen Hemminger (2):
      sky2: transmit ring accounting
      sky2: Make sure both ports initialize correctly

Sujith (1):
      ath9k: Fix bug in ANI channel handling

Ursula Braun (1):
      iucv: suspend/resume error msg for left over pathes

Vitaliy Gusev (1):
      mlx4: Fix access to freed memory

Wey-Yi Guy (1):
      iwlwifi: find the correct first antenna

 drivers/net/atl1e/atl1e.h                   |    9 +
 drivers/net/atl1e/atl1e_main.c              |   15 +-
 drivers/net/b44.c                           |    7 +-
 drivers/net/benet/be.h                      |    1 +
 drivers/net/benet/be_cmds.c                 |  412 +++++++-----
 drivers/net/benet/be_cmds.h                 |    5 +-
 drivers/net/benet/be_main.c                 |   42 +-
 drivers/net/bonding/bond_main.c             |  131 ++---
 drivers/net/can/vcan.c                      |    2 +-
 drivers/net/igb/e1000_82575.c               |  198 +++---
 drivers/net/igb/e1000_82575.h               |    2 +-
 drivers/net/igb/e1000_defines.h             |    2 +-
 drivers/net/igb/e1000_phy.c                 |    5 +-
 drivers/net/igb/igb_main.c                  |    2 +-
 drivers/net/ixgbe/ixgbe_82598.c             |    6 +-
 drivers/net/ixgbe/ixgbe_82599.c             |    3 +
 drivers/net/ixgbe/ixgbe_main.c              |    4 +
 drivers/net/ixgbe/ixgbe_type.h              |    2 +
 drivers/net/mlx4/catas.c                    |   11 +-
 drivers/net/pcmcia/pcnet_cs.c               |   10 +-
 drivers/net/pppol2tp.c                      |    4 +-
 drivers/net/sky2.c                          |   24 +-
 drivers/net/smc91x.h                        |   28 -
 drivers/net/usb/cdc-phonet.c                |    1 -
 drivers/net/wireless/Kconfig                |    1 +
 drivers/net/wireless/ath/ath9k/ani.c        |    6 +-
 drivers/net/wireless/b43/main.c             |    8 +-
 drivers/net/wireless/hostap/hostap_main.c   |    3 +-
 drivers/net/wireless/iwlwifi/iwl-4965.c     |    1 +
 drivers/net/wireless/iwlwifi/iwl-agn-rs.c   |   10 +-
 drivers/net/wireless/iwlwifi/iwl-core.c     |    9 +-
 drivers/net/wireless/iwlwifi/iwl-core.h     |    1 +
 drivers/net/wireless/iwlwifi/iwl-power.c    |    5 +-
 drivers/net/wireless/iwlwifi/iwl-rx.c       |   24 +-
 drivers/net/wireless/iwlwifi/iwl3945-base.c |   24 +-
 drivers/net/wireless/p54/p54usb.c           |    1 +
 drivers/net/wireless/wl12xx/wl1271_main.c   |    5 +-
 drivers/serial/serial_cs.c                  |   14 +-
 drivers/ssb/pci.c                           |    1 +
 drivers/ssb/sdio.c                          |    6 +-
 firmware/Makefile                           |    3 +-
 firmware/WHENCE                             |   12 +
 firmware/cis/MT5634ZLX.cis.ihex             |   11 +
 firmware/cis/PCMLM28.cis.ihex               |   18 +
 firmware/cis/RS-COM-2P.cis.ihex             |   10 +
 include/keys/rxrpc-type.h                   |  107 ++++
 include/linux/igmp.h                        |    2 +
 include/linux/net.h                         |    5 +
 include/linux/netdevice.h                   |    3 +-
 include/linux/netlink.h                     |    4 +
 include/linux/notifier.h                    |    2 +
 include/linux/rxrpc.h                       |    7 +
 include/net/addrconf.h                      |    2 +
 include/net/protocol.h                      |   13 +-
 include/net/sch_generic.h                   |    2 +-
 include/net/tcp.h                           |    7 +
 net/appletalk/ddp.c                         |    2 +-
 net/can/af_can.c                            |    4 +-
 net/core/dev.c                              |    4 +-
 net/dccp/ccids/Kconfig                      |    6 +-
 net/dccp/ccids/ccid2.c                      |    2 -
 net/dccp/ccids/ccid2.h                      |    8 +-
 net/dccp/ccids/ccid3.c                      |    5 +-
 net/dccp/ccids/ccid3.h                      |   50 +-
 net/dccp/ccids/lib/loss_interval.c          |    7 +-
 net/dccp/ccids/lib/loss_interval.h          |    2 -
 net/dccp/ccids/lib/packet_history.c         |    4 +-
 net/dccp/ccids/lib/packet_history.h         |    1 -
 net/dccp/ccids/lib/tfrc.h                   |    4 +-
 net/dccp/ccids/lib/tfrc_equation.c          |   26 +-
 net/dccp/ipv4.c                             |    2 +-
 net/dccp/ipv6.c                             |    4 +-
 net/ieee802154/dgram.c                      |    2 +-
 net/ieee802154/netlink.c                    |    4 +
 net/ieee802154/raw.c                        |    2 +-
 net/ipv4/af_inet.c                          |   18 +-
 net/ipv4/ah4.c                              |    2 +-
 net/ipv4/devinet.c                          |    6 +
 net/ipv4/esp4.c                             |    2 +-
 net/ipv4/icmp.c                             |    2 +-
 net/ipv4/igmp.c                             |   22 +
 net/ipv4/ip_gre.c                           |    2 +-
 net/ipv4/ip_input.c                         |    2 +-
 net/ipv4/ipcomp.c                           |    2 +-
 net/ipv4/ipmr.c                             |    6 +-
 net/ipv4/protocol.c                         |    6 +-
 net/ipv4/tcp.c                              |    2 +-
 net/ipv4/tcp_input.c                        |    2 +-
 net/ipv4/tcp_ipv4.c                         |    4 +-
 net/ipv4/tcp_minisocks.c                    |    4 +-
 net/ipv4/tunnel4.c                          |    4 +-
 net/ipv4/udplite.c                          |    2 +-
 net/ipv6/addrconf.c                         |   23 +-
 net/ipv6/af_inet6.c                         |   10 +-
 net/ipv6/ah6.c                              |    2 +-
 net/ipv6/esp6.c                             |    2 +-
 net/ipv6/exthdrs.c                          |    6 +-
 net/ipv6/icmp.c                             |    4 +-
 net/ipv6/ip6_input.c                        |    2 +-
 net/ipv6/ip6mr.c                            |    6 +-
 net/ipv6/ipcomp6.c                          |    2 +-
 net/ipv6/mcast.c                            |   19 +
 net/ipv6/protocol.c                         |    6 +-
 net/ipv6/reassembly.c                       |    2 +-
 net/ipv6/route.c                            |    2 +-
 net/ipv6/tcp_ipv6.c                         |    7 +-
 net/ipv6/tunnel6.c                          |    4 +-
 net/ipv6/udp.c                              |    2 +-
 net/ipv6/udplite.c                          |    2 +-
 net/iucv/af_iucv.c                          |   33 +-
 net/iucv/iucv.c                             |   38 +-
 net/mac80211/rc80211_minstrel.c             |    2 +-
 net/netlink/af_netlink.c                    |   51 +-
 net/netlink/genetlink.c                     |    5 +-
 net/phonet/pn_dev.c                         |    9 +-
 net/rds/af_rds.c                            |    2 +-
 net/rose/af_rose.c                          |    4 +-
 net/rxrpc/ar-ack.c                          |    6 +-
 net/rxrpc/ar-internal.h                     |   32 +-
 net/rxrpc/ar-key.c                          |  914 +++++++++++++++++++++++++--
 net/rxrpc/ar-security.c                     |    8 +-
 net/rxrpc/rxkad.c                           |   47 +-
 net/sched/sch_api.c                         |   29 +-
 net/sched/sch_drr.c                         |    4 +-
 net/sched/sch_mq.c                          |   14 +-
 net/sched/sch_multiq.c                      |    1 +
 net/sched/sch_prio.c                        |    1 +
 net/sctp/ipv6.c                             |    2 +-
 net/sctp/protocol.c                         |    2 +-
 net/socket.c                                |    1 +
 net/wireless/scan.c                         |    7 +-
 net/wireless/sme.c                          |   21 +-
 132 files changed, 2009 insertions(+), 805 deletions(-)
 create mode 100644 firmware/cis/MT5634ZLX.cis.ihex
 create mode 100644 firmware/cis/PCMLM28.cis.ihex
 create mode 100644 firmware/cis/RS-COM-2P.cis.ihex

^ permalink raw reply

* Re: [crash] kernel BUG at net/core/pktgen.c:3503!
From: Cyrill Gorcunov @ 2009-09-17 17:51 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, torvalds, akpm, netdev, linux-kernel
In-Reply-To: <20090917174448.GA9548@elte.hu>

[Ingo Molnar - Thu, Sep 17, 2009 at 07:44:48PM +0200]
...
| 
| > 
| > Ingo, does Cyrill's patch help?
| 
| For now i've turned pktgen off in my tests. Will check it again once 
| things have calmed down somewhat.
| 
| Also, i just tried to reproduce the pktgen crash with latest -git and 
| the config i sent - no luck, so i cannot test Cyrill's patch either.
| 
| Btw., we are seeing some other preempt count and task related 
| weirdnesses as well in other code, maybe it's related. No good pattern 
| yet to act upon.
| 
| Anyway - please disregard this bugreport until i've investigated it 
| closer.
| 
| 	Ingo
| 

I'm unable to reproduce this issue too. I was trying
many ways (under kvm) -- no bug triggered. Though on
a system for which I had done this patch in first place
the bug was been hitting all the time (but it contains
custom vcpu management code, which is not our case here).

	-- Cyrill

^ permalink raw reply

* Re: [crash] kernel BUG at net/core/pktgen.c:3503!
From: David Miller @ 2009-09-17 17:49 UTC (permalink / raw)
  To: mingo; +Cc: gorcunov, torvalds, akpm, netdev, linux-kernel
In-Reply-To: <20090917174448.GA9548@elte.hu>

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 17 Sep 2009 19:44:48 +0200

> Anyway - please disregard this bugreport until i've investigated it 
> closer.

Ok, thanks for the status update.

^ permalink raw reply

* Re: [crash] kernel BUG at net/core/pktgen.c:3503!
From: Ingo Molnar @ 2009-09-17 17:44 UTC (permalink / raw)
  To: David Miller; +Cc: gorcunov, torvalds, akpm, netdev, linux-kernel
In-Reply-To: <20090917.102923.174779685.davem@davemloft.net>


* David Miller <davem@davemloft.net> wrote:

> From: Cyrill Gorcunov <gorcunov@gmail.com>
> Date: Tue, 15 Sep 2009 22:51:12 +0400
> 
> > [Ingo Molnar - Tue, Sep 15, 2009 at 08:36:47PM +0200]
> > | 
> > | not sure which merge caused this, but i got this boot crash with latest 
> > | -git:
> > | 
> > | calling  flow_cache_init+0x0/0x1b9 @ 1
> > | initcall flow_cache_init+0x0/0x1b9 returned 0 after 64 usecs
> > | calling  pg_init+0x0/0x37c @ 1
> > | pktgen 2.72: Packet Generator for packet performance testing.
> > | ------------[ cut here ]------------
> > | kernel BUG at net/core/pktgen.c:3503!
> > | invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > | last sysfs file: 
> > | 
> > 
> > Hi Ingo,
> > 
> > just curious, will the following patch fix the problem?
> > I've been fixing problem with familiar symthoms on
> > system with custome virtual cpu implementation so
> > it may not help in mainline but anyway :)
> 
> Ingo, does Cyrill's patch help?

For now i've turned pktgen off in my tests. Will check it again once 
things have calmed down somewhat.

Also, i just tried to reproduce the pktgen crash with latest -git and 
the config i sent - no luck, so i cannot test Cyrill's patch either.

Btw., we are seeing some other preempt count and task related 
weirdnesses as well in other code, maybe it's related. No good pattern 
yet to act upon.

Anyway - please disregard this bugreport until i've investigated it 
closer.

	Ingo

^ permalink raw reply

* Re: [PATCH 1/2] wl12xx: switch to %pM to print the mac address
From: Jean-Christophe PLAGNIOL-VILLARD @ 2009-09-17 17:31 UTC (permalink / raw)
  To: David Miller; +Cc: linville, bhutchings, netdev
In-Reply-To: <20090917.101939.109704983.davem@davemloft.net>

On 10:19 Thu 17 Sep     , David Miller wrote:
> From: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
> Date: Thu, 17 Sep 2009 16:42:08 +0200
> 
> > On 08:55 Thu 17 Sep     , John W. Linville wrote:
> >> Ugh, you're right -- remind me not to ACK things before bed...
> >> 
> >> Jean-Christophe posted a new patch that looked better, although it
> >> probably needs to be rebased on this one since I think Dave applied
> >> it after my (misguided) ACK.
> > ok I'll do asap really sorry
> 
> I've already fixed this in my tree.
Tks

Best Regards,
J/

^ permalink raw reply

* Re: fanotify as syscalls
From: Arjan van de Ven @ 2009-09-17 17:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Evgeniy Polyakov, Eric Paris, David Miller,
	linux-kernel, linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <alpine.LFD.2.01.0909170934450.4950@localhost.localdomain>

On Thu, 17 Sep 2009 09:40:16 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> And then we have page faults. I've long wished that from a system
> call tracing standpoint we could show page faults as
> pseudo-system-calls (at least as long as they happen from user space
> - trying to handle nesting is not worth it). It would make it _so_
> much more obvious what the performance patterns are if you could just
> do
> 
> 	strace -ttT firefox
> 
> for the cold-cache case and you'd see where the time is really spent.
> 
> (yeah, yeah, you can get that kind of information other ways, but
> it's a hell of a lot less convenient than just getting a nice trace
> with timestamps).

ohhh I should add pagefaults to timechart
good one.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply

* 2.6.31 / WARNING / tcp_input
From: Denys Fedoryschenko @ 2009-09-17 17:33 UTC (permalink / raw)
  To: netdev

Nothing unusual in configs, heavily loaded HTTP proxy.

Some tunings:
sysctl -w net.core.somaxconn=4096
sysctl -w net.core.wmem_max=384000
kernel.panic_on_oops=1
kernel.panic=5
vm.min_free_kbytes=16384
net.ipv4.tcp_max_syn_backlog=10240
net.ipv4.conf.all.arp_filter=1
vm.panic_on_oom=2
net.core.netdev_max_backlog=4000


[  123.221032] ------------[ cut here ]------------
[  123.221183] WARNING: at net/ipv4/tcp_input.c:2920 tcp_ack+0xc6c/0x17bf()
[  123.221316] Hardware name: PowerEdge 2900
[  123.221438] Modules linked in: ext4 jbd2 mptspi mptsas scsi_transport_spi 
scsi_transport_sas mptscsih mptbase xt_owner ipt_REDIRECT xt_tcpudp xt_dscp 
iptable_raw iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack rtc_cmos 
rtc_core rtc_lib nf_defrag_ipv4 iptable_filter ip_tables x_tables 8021q garp 
stp llc loop usb_storage mtdblock mtd_blkdevs mtd iTCO_wdt 
iTCO_vendor_support pata_acpi ata_piix ata_generic libata megaraid_sas bnx2 
sr_mod cdrom tulip r8169 sky2 via_velocity via_rhine sis900 ne2k_pci 8390 
skge tg3 libphy 8139too e1000 e100 usbhid ohci_hcd uhci_hcd ehci_hcd usbcore 
nls_base
[  123.225025] Pid: 0, comm: swapper Not tainted 2.6.31-build-0046-32bit #20
[  123.225162] Call Trace:
[  123.225289]  [<c012d051>] warn_slowpath_common+0x60/0x90
[  123.225415]  [<c012d08e>] warn_slowpath_null+0xd/0x10
[  123.225540]  [<c02c8e15>] tcp_ack+0xc6c/0x17bf
[  123.225665]  [<c02ca51c>] tcp_rcv_established+0x6dc/0x8fd
[  123.225791]  [<c02cff7f>] tcp_v4_do_rcv+0x24/0x17e
[  123.225915]  [<c02d04c0>] tcp_v4_rcv+0x3e7/0x5a8
[  123.226039]  [<c02b953a>] ip_local_deliver_finish+0xb6/0x12e
[  123.226174]  [<c02b9613>] ip_local_deliver+0x61/0x6a
[  123.226303]  [<c02b9215>] ip_rcv_finish+0x29d/0x2b3
[  123.226427]  [<c02b9458>] ip_rcv+0x22d/0x259
[  123.226552]  [<c029d700>] netif_receive_skb+0x439/0x458
[  123.226677]  [<c02ab3ca>] ? eth_type_trans+0x25/0xa9
[  123.226808]  [<f84c1193>] bnx2_poll_work+0xffd/0x1137 [bnx2]
[  123.226938]  [<c012331a>] ? enqueue_task_fair+0x131/0x173
[  123.227069]  [<c011f015>] ? activate_task+0x3c/0x4b
[  123.227206]  [<f84c145f>] bnx2_poll+0xf8/0x1d8 [bnx2]
[  123.227331]  [<c029dc93>] net_rx_action+0x93/0x177
[  123.227457]  [<c0131ad9>] __do_softirq+0xa7/0x144
[  123.227582]  [<c0131a32>] ? __do_softirq+0x0/0x144
[  123.227705]  <IRQ>  [<c0131853>] ? irq_exit+0x29/0x5c
[  123.227876]  [<c0104393>] ? do_IRQ+0x80/0x96
[  123.228000]  [<c0102f49>] ? common_interrupt+0x29/0x30
[  123.228136]  [<c0108a1e>] ? mwait_idle+0x8a/0xb9
[  123.228266]  [<c0101bf0>] ? cpu_idle+0x44/0x60
[  123.228390]  [<c02f90ed>] ? start_secondary+0x195/0x19a
[  123.228515] ---[ end trace 22f267765b97f808 ]---

^ permalink raw reply

* Re: [net-next-2.6 PATCH] be2net: fix some cmds to use mccq instead of mbox
From: David Miller @ 2009-09-17 17:30 UTC (permalink / raw)
  To: sathyap; +Cc: netdev
In-Reply-To: <20090917044331.GA14568@serverengines.com>

From: Sathya Perla <sathyap@serverengines.com>
Date: Thu, 17 Sep 2009 10:13:31 +0530

> All cmds issued to BE after the creation of mccq must now use the mcc-q
> (and not mbox) to avoid a hw issue that results in mbox poll timeout.
> 
> Signed-off-by: Sathya Perla <sathyap@serverengines.com>

Applied, thanks.

^ permalink raw reply

* Re: [crash] kernel BUG at net/core/pktgen.c:3503!
From: David Miller @ 2009-09-17 17:29 UTC (permalink / raw)
  To: gorcunov; +Cc: mingo, torvalds, akpm, netdev, linux-kernel
In-Reply-To: <20090915185112.GA17587@lenovo>

From: Cyrill Gorcunov <gorcunov@gmail.com>
Date: Tue, 15 Sep 2009 22:51:12 +0400

> [Ingo Molnar - Tue, Sep 15, 2009 at 08:36:47PM +0200]
> | 
> | not sure which merge caused this, but i got this boot crash with latest 
> | -git:
> | 
> | calling  flow_cache_init+0x0/0x1b9 @ 1
> | initcall flow_cache_init+0x0/0x1b9 returned 0 after 64 usecs
> | calling  pg_init+0x0/0x37c @ 1
> | pktgen 2.72: Packet Generator for packet performance testing.
> | ------------[ cut here ]------------
> | kernel BUG at net/core/pktgen.c:3503!
> | invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> | last sysfs file: 
> | 
> 
> Hi Ingo,
> 
> just curious, will the following patch fix the problem?
> I've been fixing problem with familiar symthoms on
> system with custome virtual cpu implementation so
> it may not help in mainline but anyway :)

Ingo, does Cyrill's patch help?

^ permalink raw reply

* Re: [Patch net-next]atl1e:fix 2.6.31-git4 -- ATL1E 0000:03:00.0: DMA-API: device driver frees DMA
From: David Miller @ 2009-09-17 17:27 UTC (permalink / raw)
  To: jie.yang; +Cc: miles.lane, chris.snook, jcliburn, netdev, linux-kernel
In-Reply-To: <12530933101782-git-send-email-jie.yang@atheros.com>

From: <jie.yang@atheros.com>
Date: Wed, 16 Sep 2009 17:28:30 +0800

> use the wrong API when free dma. So when map dma use a flag to demostrate whether it is 'pci_map_single' or 'pci_map_page'. When free the dma, check the flags to select the right APIs('pci_unmap_single' or 'pci_unmap_page').
> 
> set the flags type to u16  instead of unsigned long  on David's comments.
> 
> Signed-off-by: Jie Yang <jie.yang@atheros.com>

Applied.

^ permalink raw reply

* Re: [PATCH] pkt_sched: Fix qstats.qlen updating in dump_stats
From: David Miller @ 2009-09-17 17:26 UTC (permalink / raw)
  To: jarkao2; +Cc: kaber, netdev
In-Reply-To: <20090916103838.GA9719@ff.dom.local>

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 16 Sep 2009 10:38:38 +0000

> Some classful qdiscs miss qstats.qlen updating with q.qlen of their
> child qdiscs in dump_stats methods.
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Applied, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox