Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 01/11] netlink: add symbolic value for congested state
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1315070771-18576-1-git-send-email-kaber@trash.net>

From: Patrick McHardy <kaber@trash.net>

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netlink/af_netlink.c |   18 +++++++++++-------
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 0a4db02..fc63ca5 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -88,6 +88,10 @@ struct listeners {
 	unsigned long		masks[0];
 };
 
+/* state bits */
+#define NETLINK_CONGESTED	0x0
+
+/* flags */
 #define NETLINK_KERNEL_SOCKET	0x1
 #define NETLINK_RECV_PKTINFO	0x2
 #define NETLINK_BROADCAST_SEND_ERROR	0x4
@@ -737,7 +741,7 @@ static void netlink_overrun(struct sock *sk)
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (!(nlk->flags & NETLINK_RECV_NO_ENOBUFS)) {
-		if (!test_and_set_bit(0, &nlk_sk(sk)->state)) {
+		if (!test_and_set_bit(NETLINK_CONGESTED, &nlk_sk(sk)->state)) {
 			sk->sk_err = ENOBUFS;
 			sk->sk_error_report(sk);
 		}
@@ -798,7 +802,7 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 	nlk = nlk_sk(sk);
 
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	    test_bit(0, &nlk->state)) {
+	    test_bit(NETLINK_CONGESTED, &nlk->state)) {
 		DECLARE_WAITQUEUE(wait, current);
 		if (!*timeo) {
 			if (!ssk || netlink_is_kernel(ssk))
@@ -812,7 +816,7 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 		add_wait_queue(&nlk->wait, &wait);
 
 		if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-		     test_bit(0, &nlk->state)) &&
+		     test_bit(NETLINK_CONGESTED, &nlk->state)) &&
 		    !sock_flag(sk, SOCK_DEAD))
 			*timeo = schedule_timeout(*timeo);
 
@@ -876,8 +880,8 @@ static inline void netlink_rcv_wake(struct sock *sk)
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (skb_queue_empty(&sk->sk_receive_queue))
-		clear_bit(0, &nlk->state);
-	if (!test_bit(0, &nlk->state))
+		clear_bit(NETLINK_CONGESTED, &nlk->state);
+	if (!test_bit(NETLINK_CONGESTED, &nlk->state))
 		wake_up_interruptible(&nlk->wait);
 }
 
@@ -958,7 +962,7 @@ static inline int netlink_broadcast_deliver(struct sock *sk,
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
-	    !test_bit(0, &nlk->state)) {
+	    !test_bit(NETLINK_CONGESTED, &nlk->state)) {
 		skb_set_owner_r(skb, sk);
 		skb_queue_tail(&sk->sk_receive_queue, skb);
 		sk->sk_data_ready(sk, skb->len);
@@ -1236,7 +1240,7 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
 	case NETLINK_NO_ENOBUFS:
 		if (val) {
 			nlk->flags |= NETLINK_RECV_NO_ENOBUFS;
-			clear_bit(0, &nlk->state);
+			clear_bit(NETLINK_CONGESTED, &nlk->state);
 			wake_up_interruptible(&nlk->wait);
 		} else
 			nlk->flags &= ~NETLINK_RECV_NO_ENOBUFS;
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH 09/11] netlink: implement memory mapped recvmsg()
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1315070771-18576-1-git-send-email-kaber@trash.net>

From: Patrick McHardy <kaber@trash.net>

Add support for memory mapped recvmsg() to netlink. When the kernel wants
to send a message to userspace, it uses a special skb allocation function
that looks up the receiving socket and, in case the socket uses memory
mapped I/O, allocates an skb without a data area and sets the data pointer
to point to the data area of the ring frame. Message construction than
proceeds as normally, once completed the ownership of the ring frame is
transfered to userspace by setting the status to NL_MMAP_STATUS_VALID.

When memory mapped I/O can't be used because the required size exceeds
the frame size or the sending subsystem doesn't support memory mapped
I/O, the skb is queued to the socket receive queue and the frame status
is set to NL_MMAP_STATUS_COPY to instruct userspace to invoke recvmsg()
to receive the queued message.

Unlike with regular socket I/O, userspace normally doesn't invoke
recvmsg(), so flow control of dumps happens in netlink_poll() under
the assumption that poll() is usually only invoked when all pending
frames have been processed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/netlink.h  |    2 +
 net/netlink/af_netlink.c |  207 ++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 182 insertions(+), 27 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 955adc1..2bef899 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -223,6 +223,8 @@ extern void __netlink_clear_multicast_users(struct sock *sk, unsigned int group)
 extern void netlink_clear_multicast_users(struct sock *sk, unsigned int group);
 extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
 extern int netlink_has_listeners(struct sock *sk, unsigned int group);
+extern struct sk_buff *netlink_alloc_skb(struct sock *ssk, unsigned int size,
+					 u32 dst_pid, gfp_t gfp_mask);
 extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock);
 extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid,
 			     __u32 group, gfp_t allocation);
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 9b6400f..2f4745d 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -176,12 +176,40 @@ static struct hlist_head *nl_pid_hashfn(struct nl_pid_hash *hash, u32 pid)
 	return &hash->table[jhash_1word(pid, hash->rnd) & hash->mask];
 }
 
+static void netlink_overrun(struct sock *sk)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+
+	if (!(nlk->flags & NETLINK_RECV_NO_ENOBUFS)) {
+		if (!test_and_set_bit(NETLINK_CONGESTED, &nlk_sk(sk)->state)) {
+			sk->sk_err = ENOBUFS;
+			sk->sk_error_report(sk);
+		}
+	}
+	atomic_inc(&sk->sk_drops);
+}
+
+static inline void netlink_rcv_wake(struct sock *sk)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+
+	if (skb_queue_empty(&sk->sk_receive_queue))
+		clear_bit(NETLINK_CONGESTED, &nlk->state);
+	if (!test_bit(NETLINK_CONGESTED, &nlk->state))
+		wake_up_interruptible(&nlk->wait);
+}
+
 #ifdef CONFIG_NETLINK_MMAP
 static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
 {
 	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
 }
 
+static bool netlink_rx_is_mmaped(struct sock *sk)
+{
+	return nlk_sk(sk)->rx_ring.pg_vec != NULL;
+}
+
 static bool netlink_tx_is_mmaped(struct sock *sk)
 {
 	return nlk_sk(sk)->tx_ring.pg_vec != NULL;
@@ -508,6 +536,22 @@ static unsigned int netlink_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	struct netlink_sock *nlk = nlk_sk(sk);
 	unsigned int mask;
+	int err;
+
+	if (nlk->rx_ring.pg_vec != NULL) {
+		/* Memory mapped sockets don't call recvmsg(), so flow control
+		 * is performed here under the assumption that the entire ring
+		 * has been processed before invoking poll().
+		 */
+		if (nlk->cb != NULL) {
+			err = netlink_dump(sk);
+			if (err < 0) {
+				sk->sk_err = err;
+				sk->sk_error_report(sk);
+			}
+		}
+		netlink_rcv_wake(sk);
+	}
 
 	mask = datagram_poll(file, sock, wait);
 
@@ -649,8 +693,53 @@ out:
 	mutex_unlock(&nlk->pg_vec_lock);
 	return err;
 }
+
+static void netlink_queue_mmaped_skb(struct sock *sk, struct sk_buff *skb)
+{
+	struct nl_mmap_hdr *hdr;
+
+	hdr = netlink_mmap_hdr(skb);
+	hdr->nm_len	= skb->len;
+	hdr->nm_group	= NETLINK_CB(skb).dst_group;
+	hdr->nm_pid	= NETLINK_CB(skb).creds.pid;
+	hdr->nm_uid	= NETLINK_CB(skb).creds.uid;
+	hdr->nm_gid	= NETLINK_CB(skb).creds.gid;
+	netlink_frame_flush_dcache(hdr);
+	netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
+
+	NETLINK_CB(skb).flags |= NETLINK_SKB_DELIVERED;
+	kfree_skb(skb);
+}
+
+static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+	struct netlink_ring *ring = &nlk->rx_ring;
+	struct nl_mmap_hdr *hdr;
+
+	spin_lock_bh(&sk->sk_receive_queue.lock);
+	hdr = netlink_current_frame(ring, NL_MMAP_STATUS_UNUSED);
+	if (hdr == NULL) {
+		spin_unlock_bh(&sk->sk_receive_queue.lock);
+		kfree_skb(skb);
+		netlink_overrun(sk);
+		return;
+	}
+	netlink_increment_head(ring);
+	__skb_queue_tail(&sk->sk_receive_queue, skb);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+
+	hdr->nm_len	= skb->len;
+	hdr->nm_group	= NETLINK_CB(skb).dst_group;
+	hdr->nm_pid	= NETLINK_CB(skb).creds.pid;
+	hdr->nm_uid	= NETLINK_CB(skb).creds.uid;
+	hdr->nm_gid	= NETLINK_CB(skb).creds.gid;
+	netlink_set_status(hdr, NL_MMAP_STATUS_COPY);
+}
+
 #else /* CONFIG_NETLINK_MMAP */
 #define netlink_skb_is_mmaped(skb)	false
+#define netlink_rx_is_mmaped(sk)	false
 #define netlink_tx_is_mmaped(sk)	false
 #define netlink_mmap			sock_no_mmap
 #define netlink_poll			datagram_poll
@@ -661,7 +750,14 @@ static void netlink_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	unsigned int len = skb->len;
 
-	skb_queue_tail(&sk->sk_receive_queue, skb);
+#ifdef CONFIG_NETLINK_MMAP
+	if (netlink_skb_is_mmaped(skb))
+		netlink_queue_mmaped_skb(sk, skb);
+	else if (netlink_rx_is_mmaped(sk))
+		netlink_ring_set_copied(sk, skb);
+	else
+#endif /* CONFIG_NETLINK_MMAP */
+		skb_queue_tail(&sk->sk_receive_queue, skb);
 	sk->sk_data_ready(sk, len);
 }
 
@@ -1309,19 +1405,6 @@ static int netlink_getname(struct socket *sock, struct sockaddr *addr,
 	return 0;
 }
 
-static void netlink_overrun(struct sock *sk)
-{
-	struct netlink_sock *nlk = nlk_sk(sk);
-
-	if (!(nlk->flags & NETLINK_RECV_NO_ENOBUFS)) {
-		if (!test_and_set_bit(NETLINK_CONGESTED, &nlk_sk(sk)->state)) {
-			sk->sk_err = ENOBUFS;
-			sk->sk_error_report(sk);
-		}
-	}
-	atomic_inc(&sk->sk_drops);
-}
-
 static struct sock *netlink_getsockbypid(struct sock *ssk, u32 pid)
 {
 	struct sock *sock;
@@ -1450,16 +1533,6 @@ static inline struct sk_buff *netlink_trim(struct sk_buff *skb,
 	return skb;
 }
 
-static inline void netlink_rcv_wake(struct sock *sk)
-{
-	struct netlink_sock *nlk = nlk_sk(sk);
-
-	if (skb_queue_empty(&sk->sk_receive_queue))
-		clear_bit(NETLINK_CONGESTED, &nlk->state);
-	if (!test_bit(NETLINK_CONGESTED, &nlk->state))
-		wake_up_interruptible(&nlk->wait);
-}
-
 static inline int netlink_unicast_kernel(struct sock *sk, struct sk_buff *skb)
 {
 	int ret;
@@ -1512,6 +1585,69 @@ retry:
 }
 EXPORT_SYMBOL(netlink_unicast);
 
+struct sk_buff *netlink_alloc_skb(struct sock *ssk, unsigned int size,
+				  u32 dst_pid, gfp_t gfp_mask)
+{
+#ifdef CONFIG_NETLINK_MMAP
+	struct sock *sk = NULL;
+	struct sk_buff *skb;
+	struct netlink_ring *ring;
+	struct nl_mmap_hdr *hdr;
+	unsigned int maxlen;
+
+	sk = netlink_getsockbypid(ssk, dst_pid);
+	if (IS_ERR(sk))
+		goto out;
+
+	ring = &nlk_sk(sk)->rx_ring;
+	/* fast-path without atomic ops for common case: non-mmaped receiver */
+	if (ring->pg_vec == NULL)
+		goto out_put;
+
+	skb = alloc_skb_head(gfp_mask);
+	if (skb == NULL)
+		goto err1;
+
+	spin_lock_bh(&sk->sk_receive_queue.lock);
+	/* check again under lock */
+	if (ring->pg_vec == NULL)
+		goto out_free;
+
+	maxlen = ring->frame_size - NL_MMAP_HDRLEN;
+	if (maxlen < size)
+		goto out_free;
+
+	netlink_forward_ring(ring);
+	hdr = netlink_current_frame(ring, NL_MMAP_STATUS_UNUSED);
+	if (hdr == NULL)
+		goto err2;
+	netlink_ring_setup_skb(skb, sk, ring, hdr);
+	netlink_set_status(hdr, NL_MMAP_STATUS_RESERVED);
+	atomic_inc(&ring->pending);
+	netlink_increment_head(ring);
+
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+	return skb;
+
+err2:
+	kfree_skb(skb);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+	netlink_overrun(sk);
+err1:
+	sock_put(sk);
+	return NULL;
+
+out_free:
+	kfree_skb(skb);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+out_put:
+	sock_put(sk);
+out:
+#endif
+	return alloc_skb(size, gfp_mask);
+}
+EXPORT_SYMBOL_GPL(netlink_alloc_skb);
+
 int netlink_has_listeners(struct sock *sk, unsigned int group)
 {
 	int res = 0;
@@ -2278,9 +2414,13 @@ static int netlink_dump(struct sock *sk)
 
 	alloc_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
 
-	skb = sock_rmalloc(sk, alloc_size, 0, GFP_KERNEL);
+	if (!netlink_rx_is_mmaped(sk) &&
+	    atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
+		goto errout_skb;
+	skb = netlink_alloc_skb(sk, alloc_size, nlk->pid, GFP_KERNEL);
 	if (!skb)
 		goto errout_skb;
+	netlink_skb_set_owner_r(skb, sk);
 
 	len = cb->dump(skb, cb);
 
@@ -2337,11 +2477,23 @@ int netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
 	if (cb == NULL)
 		return -ENOBUFS;
 
+	/* Memory mapped dump requests need to be copied to avoid looping
+	 * on the pending state in netlink_mmap_sendmsg() while the cb holds
+	 * a reference to the skb.
+	 */
+	if (netlink_skb_is_mmaped(skb)) {
+		skb = skb_copy(skb, GFP_KERNEL);
+		if (skb == NULL) {
+			kfree(cb);
+			return -ENOBUFS;
+		}
+	} else
+		atomic_inc(&skb->users);
+
 	cb->dump = dump;
 	cb->done = done;
 	cb->nlh = nlh;
 	cb->min_dump_alloc = min_dump_alloc;
-	atomic_inc(&skb->users);
 	cb->skb = skb;
 
 	sk = netlink_lookup(sock_net(ssk), ssk->sk_protocol, NETLINK_CB(skb).pid);
@@ -2386,7 +2538,8 @@ void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err)
 	if (err)
 		payload += nlmsg_len(nlh);
 
-	skb = nlmsg_new(payload, GFP_KERNEL);
+	skb = netlink_alloc_skb(in_skb->sk, nlmsg_total_size(payload),
+				NETLINK_CB(in_skb).pid, GFP_KERNEL);
 	if (!skb) {
 		struct sock *sk;
 
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH 10/11] netlink: add documentation for memory mapped I/O
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1315070771-18576-1-git-send-email-kaber@trash.net>

From: Patrick McHardy <kaber@trash.net>

Add documentation and some example code for memory mapped netlink I/O.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 Documentation/networking/netlink_mmap.txt |  328 +++++++++++++++++++++++++++++
 1 files changed, 328 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/netlink_mmap.txt

diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
new file mode 100644
index 0000000..8034811
--- /dev/null
+++ b/Documentation/networking/netlink_mmap.txt
@@ -0,0 +1,328 @@
+This file documents how to use memory mapped I/O with netlink.
+
+Overview
+--------
+
+Memory mapped netlink I/O can be used to increase throughput and decrease
+overhead of unicast receive and transmit operations. Some netlink subsystems
+require high throughput, these are mainly the netfilter subsystems
+nfnetlink_queue and nfnetlink_log, but it can also help speed up large
+dump operations of f.i. the routing database.
+
+Memory mapped netlink I/O used two circular ring buffers for RX and TX which
+are mapped into the processes address space.
+
+The RX ring is used by the kernel to directly construct netlink messages into
+user-space memory without copying them as done with regular socket I/O,
+additionally as long as the ring contains messages no recvmsg() or poll()
+syscalls have to be issued by user-space to get more message.
+
+The TX ring is used to process messages directly from user-space memory, the
+kernel processes all messages contained in the ring using a single sendmsg()
+call.
+
+Usage overview
+--------------
+
+In order to use memory mapped netlink I/O, user-space needs three main changes:
+
+- ring setup
+- conversion of the RX path to get messages from the ring instead of recvmsg()
+- conversion of the TX path to construct messages into the ring
+
+Ring setup is done using setsockopt() to provide the ring parameters to the
+kernel, then a call to mmap() to map the ring into the processes address space:
+
+- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &params, sizeof(params));
+- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &params, sizeof(params));
+- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)
+
+Usage of either ring is optional, but even if only the RX ring is used the
+mapping still needs to be writable in order to update the frame status after
+processing.
+
+Conversion of the reception path involves calling poll() on the file
+descriptor, once the socket is readable the frames from the ring are
+processsed in order until no more messages are available, as indicated by
+a status word in the frame header.
+
+On kernel side, in order to make use of memory mapped I/O on receive, the
+originating netlink subsystem needs to support memory mapped I/O, otherwise
+it will use an allocated socket buffer as usual and the contents will be
+ copied to the ring on transmission, nullifying most of the performance gains.
+Dumps of kernel databases automatically support memory mapped I/O.
+
+Conversion of the transmit path involves changing message contruction to
+use memory from the TX ring instead of (usually) a buffer declared on the
+stack and setting up the frame header approriately. Optionally poll() can
+be used to wait for free frames in the TX ring.
+
+Structured and definitions for using memory mapped I/O are contained in
+<linux/netlink.h>.
+
+RX and TX rings
+----------------
+
+Each ring contains a number of continous memory blocks, containing frames of
+fixed size dependant on the parameters used for ring setup.
+
+Ring:	[ block 0 ]
+		[ frame 0 ]
+		[ frame 1 ]
+	[ block 1 ]
+		[ frame 2 ]
+		[ frame 3 ]
+	...
+	[ block n ]
+		[ frame 2 * n ]
+		[ frame 2 * n + 1 ]
+
+The blocks are only visible to the kernel, from the point of view of user-space
+the ring just contains the frames in a continous memory zone.
+
+The ring parameters used for setting up the ring are defined as follows:
+
+struct nl_mmap_req {
+	unsigned int	nm_block_size;
+	unsigned int	nm_block_nr;
+	unsigned int	nm_frame_size;
+	unsigned int	nm_frame_nr;
+};
+
+Frames are grouped into blocks, where each block is a continous region of memory
+and holds nm_block_size / nm_frame_size frames. The total number of frames in
+the ring is nm_frame_nr. The following invariants hold:
+
+- frames_per_block = nm_block_size / nm_frame_size
+
+- nm_frame_nr = frames_per_block * nm_block_nr
+
+Some parameters are constrained, specifically:
+
+- nm_block_size must be a multiple of the architectures memory page size.
+  The getpagesize() function can be used to get the page size.
+
+- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be
+  able to hold at least the frame header
+
+- nm_frame_size must be smaller or equal to nm_block_size
+
+- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT
+
+- nm_frame_nr must equal the actual number of frames as specified above.
+
+When the kernel can't allocate phsyically continous memory for a ring block,
+it will fall back to use physically discontinous memory. This might affect
+performance negatively, in order to avoid this the nm_frame_size parameter
+should be chosen to be as small as possible for the required frame size and
+the number of blocks should be increased instead.
+
+Ring frames
+------------
+
+Each frames contain a frame header, consisting of a synchronization word and some
+meta-data, and the message itself.
+
+Frame:	[ header message ]
+
+The frame header is defined as follows:
+
+struct nl_mmap_hdr {
+	unsigned int	nm_status;
+	unsigned int	nm_len;
+	__u32		nm_group;
+	/* credentials */
+	__u32		nm_pid;
+	__u32		nm_uid;
+	__u32		nm_gid;
+};
+
+- nm_status is used for synchronizing processing between the kernel and user-
+  space and specifies ownership of the frame as well as the operation to perform
+
+- nm_len contains the length of the message contained in the data area
+
+- nm_group specified the destination multicast group of message
+
+- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending
+  process. These values correspond to the data available using SOCK_PASSCRED in
+  the SCM_CREDENTIALS cmsg.
+
+The possible values in the status word are:
+
+- NL_MMAP_STATUS_UNUSED:
+	RX ring:	frame belongs to the kernel and contains no message
+			for user-space. Approriate action is to invoke poll()
+			to wait for new messages.
+	TX ring:	frame belongs to user-space and can be used for
+			message construction.
+
+- NL_MMAP_STATUS_RESERVED:
+	RX ring only:	frame is currently used by the kernel for message
+			construction and contains no valid message yet.
+			Appropriate action is to invoke poll() to wait for
+			new messages.
+
+- NL_MMAP_STATUS_VALID:
+	RX ring:	frame contains a valid message. Approriate action is
+			to process the message and release the frame back to
+			the kernel by setting the status to
+			NL_MMAP_STATUS_UNUSED.
+	TX ring:	the frame contains a valid message from user-space to
+			be processed by the kernel. After completing processing
+			the kernel will release the frame back to user-space by
+			setting the status to NL_MMAP_STATUS_UNUSED.
+
+- NL_MMAP_STATUS_COPY:
+	RX ring only:	a message is ready to be processed but could not be
+			stored in the ring, either because it exceeded the
+			frame size or because the originating subsystem does
+			not support memory mapped I/O. Appropriate action is
+			to invoke recvmsg() to receive the message and release
+			the frame back to the kernel by setting the status to
+			NL_MMAP_STATUS_UNUSED.
+
+- NL_MMAP_STATUS_SKIP:
+	RX ring only:	user-space queued the message for later processing, but
+			processed some messages following it in the ring. The
+			kernel should skip this frame when looking for unused
+			frames.
+
+The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the
+frame header.
+
+TX limitations
+--------------
+
+Kernel processing usually involves validation of the message received by
+user-space, then processing its contents. The kernel must assure that
+userspace is not able to modify the message contents after they have been
+validated. In order to do so, the message is copied from the ring frame
+to an allocated buffer if either of these conditions is false:
+
+- only a single mapping of the ring exists
+- the file descriptor is not shared between processes
+
+This means that for threaded programs, the kernel will fall back to copying.
+
+Example
+-------
+
+Ring setup:
+
+	unsigned int block_size = 16 * getpagesize();
+	struct nl_mmap_req req = {
+		.nm_block_size		= block_size,
+		.nm_block_nr		= 64,
+		.nm_frame_size		= 16384,
+		.nm_frame_nr		= 64 * block_size / 16384,
+	};
+	unsigned int ring_size;
+	void *rx_ring, *tx_ring;
+
+	/* Configure ring parameters */
+	if (setsockopt(fd, NETLINK_RX_RING, &req, sizeof(req)) < 0)
+		exit(1);
+	if (setsockopt(fd, NETLINK_TX_RING, &req, sizeof(req)) < 0)
+		exit(1)
+
+	/* Calculate size of each invididual ring */
+	ring_size = req.nm_block_nr * req.nm_block_size;
+
+	/* Map RX/TX rings. The TX ring is located after the RX ring */
+	rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE,
+		       MAP_SHARED, fd, 0);
+	if ((long)rx_ring == -1L)
+		exit(1);
+	tx_ring = rx_ring + ring_size:
+
+Message reception:
+
+This example assumes some ring parameters of the ring setup are available.
+
+	unsigned int frame_offset = 0;
+	struct nl_mmap_hdr *hdr;
+	struct nlmsghdr *nlh;
+	unsigned char buf[16384];
+	ssize_t len;
+
+	while (1) {
+		struct pollfd pfds[1];
+
+		pfds[0].fd	= fd;
+		pfds[0].events	= POLLIN | POLLERR;
+		pfds[0].revents	= 0;
+
+		if (poll(pfds, 1, -1) < 0 && errno != -EINTR)
+			exit(1);
+
+		/* Check for errors. Error handling omitted */
+		if (pfds[0].revents & POLLERR)
+			<handle error>
+
+		/* If no new messages, poll again */
+		if (!(pfds[0].revents & POLLIN))
+			continue;
+
+		/* Process all frames */
+		while (1) {
+			/* Get next frame header */
+			hdr = rx_ring + frame_offset;
+
+			if (hdr->nm_status == NL_MMAP_STATUS_VALID)
+				/* Regular memory mapped frame */
+				nlh = (void *hdr) + NL_MMAP_HDRLEN;
+				len = hdr->nm_len;
+			} else if (hdr->nm_status == NL_MMAP_STATUS_COPY) {
+				/* Frame queued to socket receive queue */
+				len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT);
+				if (len <= 0)
+					break;
+				nlh = buf;
+			} else
+				/* No more messages to process, continue polling */
+				break;
+
+			process_msg(nlh);
+
+			/* Release frame back to the kernel */
+			hdr->nm_status = NL_MMAP_STATUS_UNUSED;
+
+			/* Advance frame offset to next frame */
+			frame_offset = (frame_offset + frame_size) % ring_size;
+		}
+	}
+
+Message transmission:
+
+This example assumes some ring parameters of the ring setup are available.
+A single message is constructed and transmitted, to send multiple messages
+at once they would be constructed in consecutive frames before a final call
+to sendto().
+
+	unsigned int frame_offset = 0;
+	struct nl_mmap_hdr *hdr;
+	struct nlmsghdr *nlh;
+	struct sockaddr_nl addr = {
+		.nl_family	= AF_NETLINK,
+	};
+
+	hdr = tx_ring + frame_offset;
+	if (hdr->nm_status != NL_MMAP_STATUS_UNUSED)
+		/* No frame available. Use poll() to avoid. */
+		exit(1);
+
+	nlh = (void *)hdr + NL_MMAP_HDRLEN;
+
+	/* Build message */
+	build_message(nlh);
+
+	/* Fill frame header: length and status need to be set */
+	hdr->nm_len	= nlh->nlmsg_len;
+	hdr->nm_status	= NL_MMAP_STATUS_VALID;
+
+	if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0)
+		exit(1);
+
+	/* Advance frame offset to next frame */
+	frame_offset = (frame_offset + frame_size) % ring_size;
-- 
1.7.4.4

^ permalink raw reply related

* Re: [PATCH] usbnet: ignore get interface retval of -EINPROGRESS
From: Jim Wylder @ 2011-09-03 17:32 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: netdev
In-Reply-To: <201109031837.48444.oliver@neukum.org>

Oliver,

Thanks for the quick feedback.  True usbnet_start_xmit() could be
running at anytime, but the usb_autopm_get_interface_async() will only
return -EINPROGRESS when rpm_resume detects that
dev->power.runtime_status == RPM_RESUMING.  My understanding is that
for an asynchronous request, the promise that the device is resuming
would be equivalent to cases where usb_autopm_get_interface_async()
returns success.

In all other cases, when we are not attempting to resume an already
resuming interface, this change should have no impact.

Are you recommending that I add an additional check for DEV_ASLEEP,
possibly to decide whether to drop the or continue on?  Or am I
missing your point?  I had not done anything similar because my
understanding was that knowing that the device is in fact resuming
would be sufficient.

Jim

On Sat, Sep 3, 2011 at 11:37 AM, Oliver Neukum <oliver@neukum.org> wrote:
> Am Samstag, 3. September 2011, 16:21:00 schrieb Jim Wylder:
>> When calling pm_runtime_get, usb_autopm_get_interface_async
>> treats a return value of -EINPROGRESS as a success and
>> increments the usage count.  Since the interface is resuming,
>> it is safe for usbnet_start_xmit to submit the urb.
>
> usbnet_start_xmit() is exported, so simply stating that it is
> called when the interface is resumingisn't enough. It seems to
> me that the later check for DEV_ASLEEP will save as, but have
> you checked for that?
>
>        Regards
>                Oliver
>

^ permalink raw reply

* [PATCH 1/1] drivers/net/usb/usbnet.c: No need for usb_free_urb in rx_complete
From: Kautuk Consul @ 2011-09-03 17:35 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
	Kautuk Consul

usb_hcd_giveback_urb frees the urb irrespective of any return
value from the urb->complete function. It calls usb_put_urb which 
is #defined to usb_free_urb, so maybe we shouldn't be freeing this 
from rx_complete.

But I can't get around the fact that this has not been detected as a problem 
till now, so I could be quite wrong about this. :)

Signed-off-by: Kautuk Consul <consul.kautuk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 drivers/net/usb/usbnet.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index ce395fe..6df8094 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -491,7 +491,6 @@ block:
 			rx_submit (dev, urb, GFP_ATOMIC);
 			return;
 		}
-		usb_free_urb (urb);
 	}
 	netif_dbg(dev, rx_err, dev->net, "no read resubmitted\n");
 }
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Nicolas de Pesloüan @ 2011-09-03 18:32 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David S. Miller, netdev
In-Reply-To: <20110902151100.327af0bf@nehalam.ftrdhcpuser.net>

Le 03/09/2011 00:11, Stephen Hemminger a écrit :
> On Fri, 02 Sep 2011 23:39:03 +0200
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:
>
>> Le 02/09/2011 19:22, Stephen Hemminger a écrit :
>>> This resolves a regression seen by some users of bridging.
>>> Some users use the bridge like a dummy device.
>>> They expect to be able to put an IPv6 address on the device
>>> with no ports attached during boot.
>>>
>>> Note: the bridge still will reflect the state of ports in the
>>> bridge if there are any added.
>>
>> Doesn't this jeopardize the behavior introduced in 1faa4356a3bd89ea11fb92752d897cff3a20ec0e
>> "bridge: control carrier based on ports online"?
>>
>> If the user starts the DHCP client before adding the first port to the bridge, the DHCP client will
>> have a carrier and start the autoconfiguration process. This was the old behavior, but you fixed it.
>>
>> 	Nicolas.
>>
>
> There is no perfect solution.
> If DHCP works then IPv6 breaks?

Instead of asserting carrier when the bridge have no port, can't we assert carrier when the three 
following condition are true at the same time :

- The bridge have no port.
- At least one IP address is setup on the bridge.
- The two above conditions are true for more than a configurable amount of seconds, with a default 
of 10, for example.

This would only delay carrier on for a few seconds for the regression and keep the current behavior 
(carrier off until at least 1 port is on) for DHCP.

	Nicolas.

^ permalink raw reply

* [PATCH] netfilter: install nf_nat.h and related headers to INSTALL_HDR_PATH
From: Anthony G. Basile @ 2011-09-03 18:49 UTC (permalink / raw)
  To: davem
  Cc: kaber, basile, blueness, gurligebis, base-system, kernel,
	toolchain, mchehab, hverkuil, laurent.pinchart, arnd, eparis,
	linux-kernel, netdev, netfilter-devel, netfilter, coreteam

Currently nf_nat.h, nf_conntrack_tuple.h and related headers under
include/net/netfilter are not installed as part of the public kernel
headers.   However, there are userland applications, other than iptables
which ships with its own headers, which need these to make use of NAT in
the kernel's netfilter API.  For example, miniupnpd, requires them and is
forced to search /usr/src/linux when building.

This patch makes these headers public by installing them in
INSTALL_HDR_PATH.

See: https://bugs.gentoo.org/376873

Signed-off-by: Anthony G. Basile <blueness@gentoo.org>
---
 include/Kbuild                    |    1 +
 include/linux/Kbuild              |    1 +
 include/net/Kbuild                |    1 +
 include/net/netfilter/Kbuild      |    6 ++++++
 include/net/netfilter/ipv4/Kbuild |    1 +
 include/net/netfilter/ipv6/Kbuild |    2 ++
 6 files changed, 12 insertions(+), 0 deletions(-)
 create mode 100644 include/net/Kbuild
 create mode 100644 include/net/netfilter/Kbuild
 create mode 100644 include/net/netfilter/ipv4/Kbuild
 create mode 100644 include/net/netfilter/ipv6/Kbuild

diff --git a/include/Kbuild b/include/Kbuild
index 8d226bf..9fb8300 100644
--- a/include/Kbuild
+++ b/include/Kbuild
@@ -5,6 +5,7 @@ header-y += asm-generic/
 header-y += linux/
 header-y += sound/
 header-y += mtd/
+header-y += net/
 header-y += rdma/
 header-y += video/
 header-y += drm/
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 619b565..5569432 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -228,6 +228,7 @@ header-y += keyboard.h
 header-y += keyctl.h
 header-y += l2tp.h
 header-y += limits.h
+header-y += list_nulls.h
 header-y += llc.h
 header-y += loop.h
 header-y += lp.h
diff --git a/include/net/Kbuild b/include/net/Kbuild
new file mode 100644
index 0000000..9546082
--- /dev/null
+++ b/include/net/Kbuild
@@ -0,0 +1 @@
+header-y += netfilter/
diff --git a/include/net/netfilter/Kbuild b/include/net/netfilter/Kbuild
new file mode 100644
index 0000000..143f188
--- /dev/null
+++ b/include/net/netfilter/Kbuild
@@ -0,0 +1,6 @@
+header-y += nf_nat.h
+header-y += nf_conntrack.h
+header-y += nf_conntrack_tuple.h
+header-y += nf_conntrack_extend.h
+header-y += ipv4/
+header-y += ipv6/
diff --git a/include/net/netfilter/ipv4/Kbuild b/include/net/netfilter/ipv4/Kbuild
new file mode 100644
index 0000000..a15e304
--- /dev/null
+++ b/include/net/netfilter/ipv4/Kbuild
@@ -0,0 +1 @@
+header-y += nf_conntrack_ipv4.h
diff --git a/include/net/netfilter/ipv6/Kbuild b/include/net/netfilter/ipv6/Kbuild
new file mode 100644
index 0000000..07d43a4
--- /dev/null
+++ b/include/net/netfilter/ipv6/Kbuild
@@ -0,0 +1,2 @@
+header-y += nf_conntrack_icmpv6.h
+header-y += nf_conntrack_ipv6.h
-- 
1.7.0.4


^ permalink raw reply related

* Re: [PATCH] netfilter: install nf_nat.h and related headers to INSTALL_HDR_PATH
From: Jan Engelhardt @ 2011-09-03 19:41 UTC (permalink / raw)
  To: Anthony G. Basile
  Cc: davem, kaber, blueness, gurligebis, base-system, kernel,
	toolchain, mchehab, hverkuil, laurent.pinchart, arnd, eparis,
	linux-kernel, netdev, netfilter-devel, netfilter, coreteam
In-Reply-To: <1315075784-10163-1-git-send-email-basile@opensource.dyc.edu>

On Saturday 2011-09-03 20:49, Anthony G. Basile wrote:

>Currently nf_nat.h, nf_conntrack_tuple.h and related headers under
>include/net/netfilter are not installed as part of the public kernel
>headers.   However, there are userland applications, other than iptables
>which ships with its own headers, which need these to make use of NAT in
>the kernel's netfilter API.  For example, miniupnpd, requires them and is
>forced to search /usr/src/linux when building.
>
>This patch makes these headers public by installing them in
>INSTALL_HDR_PATH.
>
>See: https://bugs.gentoo.org/376873
>
>Signed-off-by: Anthony G. Basile <blueness@gentoo.org>

>@@ -0,0 +1,6 @@
>+header-y += nf_nat.h
>+header-y += nf_conntrack.h
>+header-y += nf_conntrack_tuple.h
>+header-y += nf_conntrack_extend.h
>+header-y += ipv4/
>+header-y += ipv6/

Should not the to-be-exported files better go into linux/ instead?

^ permalink raw reply

* Re: [PATCH 1/1] drivers/net/usb/usbnet.c: No need for usb_free_urb in rx_complete
From: Alan Stern @ 2011-09-03 20:02 UTC (permalink / raw)
  To: Kautuk Consul; +Cc: Oliver Neukum, netdev, linux-usb
In-Reply-To: <1315071353-14978-1-git-send-email-consul.kautuk@gmail.com>

On Sat, 3 Sep 2011, Kautuk Consul wrote:

> usb_hcd_giveback_urb frees the urb irrespective of any return
> value from the urb->complete function. It calls usb_put_urb which 
> is #defined to usb_free_urb, so maybe we shouldn't be freeing this 
> from rx_complete.

usb_free_urb() doesn't free its argument unless the refcount drops to 
0.  The refcount is incremented during usb_hcd_submit_urb(), and the 
decrement during usb_hcd_giveback_urb() merely undoes this increment.

> But I can't get around the fact that this has not been detected as a problem 
> till now, so I could be quite wrong about this. :)

Quite probably.  :-)

Alan Stern

^ permalink raw reply

* Re: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper
From: Matt Helsley @ 2011-09-04  1:51 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
In-Reply-To: <1314993400-6910-5-git-send-email-serge@hallyn.com>

On Fri, Sep 02, 2011 at 07:56:27PM +0000, Serge Hallyn wrote:
> From: "Serge E. Hallyn" <serge@hallyn.com>

I was confused about this patch until I realized that you're not
simply "moving" the capability checks but "distributing" them. Then
you're showing that you'll soon change some to nsown_capable() or
ns_capable() using the strange cpp pattern in the snippet below.

At least I think that's what you intended. A commit message would
help :).

Cheers,
	-Matt Helsley

> 
> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  ipc/namespace.c          |    7 +++++++
>  kernel/fork.c            |    5 +++++
>  kernel/nsproxy.c         |   11 ++++++++---
>  kernel/utsname.c         |    7 +++++++
>  net/core/net_namespace.c |    7 +++++++
>  5 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index ce0a647..a0a7609 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -163,6 +163,13 @@ static void ipcns_put(void *ns)
> 
>  static int ipcns_install(struct nsproxy *nsproxy, void *ns)
>  {
> +#if 0
> +	struct ipc_namespace *newns = ns;
> +	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
> +#else
> +	if (!capable(CAP_SYS_ADMIN))
> +#endif
> +		return -1;
>  	/* Ditch state from the old ipc namespace */
>  	exit_sem(current);
>  	put_ipc_ns(nsproxy->ipc_ns);

^ permalink raw reply

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Stephen Hemminger @ 2011-09-04  4:14 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: David S. Miller, netdev
In-Reply-To: <4E6272BC.4020707@gmail.com>

On Sat, 03 Sep 2011 20:32:28 +0200
Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:

> Le 03/09/2011 00:11, Stephen Hemminger a écrit :
> > On Fri, 02 Sep 2011 23:39:03 +0200
> > Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:
> >
> >> Le 02/09/2011 19:22, Stephen Hemminger a écrit :
> >>> This resolves a regression seen by some users of bridging.
> >>> Some users use the bridge like a dummy device.
> >>> They expect to be able to put an IPv6 address on the device
> >>> with no ports attached during boot.
> >>>
> >>> Note: the bridge still will reflect the state of ports in the
> >>> bridge if there are any added.
> >>
> >> Doesn't this jeopardize the behavior introduced in 1faa4356a3bd89ea11fb92752d897cff3a20ec0e
> >> "bridge: control carrier based on ports online"?
> >>
> >> If the user starts the DHCP client before adding the first port to the bridge, the DHCP client will
> >> have a carrier and start the autoconfiguration process. This was the old behavior, but you fixed it.
> >>
> >> 	Nicolas.
> >>
> >
> > There is no perfect solution.
> > If DHCP works then IPv6 breaks?
> 
> Instead of asserting carrier when the bridge have no port, can't we assert carrier when the three 
> following condition are true at the same time :
> 
> - The bridge have no port.
> - At least one IP address is setup on the bridge.
> - The two above conditions are true for more than a configurable amount of seconds, with a default 
> of 10, for example.
> 
> This would only delay carrier on for a few seconds for the regression and keep the current behavior 
> (carrier off until at least 1 port is on) for DHCP.

This fails on two counts:
1. Bridge's often run without IP addresses!
2. DHCP won't try and send out request until carrier is true.

^ permalink raw reply

* [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Yan, Zheng @ 2011-09-04  5:44 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: davem@davemloft.net, sfr@canb.auug.org.au,
	tim.c.chen@linux.intel.com, jirislaby@gmail.com,
	sedat.dilek@gmail.com

Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
in Unix socket's send and receive path) introduced a use-after-free bug.
It passes the scm reference to the first skb. Skb(s) afterwards may
reference freed data structure because the first skb can be destructed
by the receiver at anytime. The fix is by passing the scm reference to
the very last skb.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reported-by: Jiri Slaby <jirislaby@gmail.com>
---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e6d9d10..77ec8e8 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1577,6 +1577,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	int sent = 0;
 	struct scm_cookie tmp_scm;
 	bool fds_sent = false;
+	bool scm_ref = true;
 	int max_level;
 
 	if (NULL == siocb->scm)
@@ -1637,12 +1638,15 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		 */
 		size = min_t(int, size, skb_tailroom(skb));
 
+		/* pass the scm reference to the very last skb */
+		if (sent + size >= len)
+			scm_ref = false;
 
-		/* Only send the fds and no ref to pid in the first buffer */
-		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
+		/* Only send the fds in the first buffer */
+		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, scm_ref);
 		if (err < 0) {
 			kfree_skb(skb);
-			goto out;
+			goto out_err;
 		}
 		max_level = err + 1;
 		fds_sent = true;
@@ -1650,7 +1654,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		err = memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size);
 		if (err) {
 			kfree_skb(skb);
-			goto out;
+			goto out_err;
 		}
 
 		unix_state_lock(other);
@@ -1667,10 +1671,10 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		sent += size;
 	}
 
-	if (skb)
-		scm_release(siocb->scm);
-	else
+	if (scm_ref)
 		scm_destroy(siocb->scm);
+	else
+		scm_release(siocb->scm);
 	siocb->scm = NULL;
 
 	return sent;
@@ -1683,9 +1687,10 @@ pipe_err:
 		send_sig(SIGPIPE, current, 0);
 	err = -EPIPE;
 out_err:
-	if (skb == NULL)
+	if (scm_ref)
 		scm_destroy(siocb->scm);
-out:
+	else
+		scm_release(siocb->scm);
 	siocb->scm = NULL;
 	return sent ? : err;
 }

^ permalink raw reply related

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Sedat Dilek @ 2011-09-04  7:12 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	tim.c.chen@linux.intel.com, jirislaby@gmail.com
In-Reply-To: <4E631032.6050606@intel.com>

On Sun, Sep 4, 2011 at 7:44 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
> in Unix socket's send and receive path) introduced a use-after-free bug.
> It passes the scm reference to the first skb. Skb(s) afterwards may
> reference freed data structure because the first skb can be destructed
> by the receiver at anytime. The fix is by passing the scm reference to
> the very last skb.
>

s/by passing/bypassing ?

> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
> Reported-by: Jiri Slaby <jirislaby@gmail.com>
> ---

Tested on i386 against linux-next (next-20110831).

- Sedat -

^ permalink raw reply

* Re: [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Nicolas de Pesloüan @ 2011-09-04  7:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David S. Miller, netdev
In-Reply-To: <20110903211438.2a43d2f2@nehalam.ftrdhcpuser.net>

Le 04/09/2011 06:14, Stephen Hemminger a écrit :

>> Instead of asserting carrier when the bridge have no port, can't we assert carrier when the three
>> following condition are true at the same time :
>>
>> - The bridge have no port.
>> - At least one IP address is setup on the bridge.
>> - The two above conditions are true for more than a configurable amount of seconds, with a default
>> of 10, for example.
>>
>> This would only delay carrier on for a few seconds for the regression and keep the current behavior
>> (carrier off until at least 1 port is on) for DHCP.
>
> This fails on two counts:
> 1. Bridge's often run without IP addresses!
> 2. DHCP won't try and send out request until carrier is true.

Sorry, I missed to say that we should of course also assert carrier on if one port has carrier on.

And rethinking about it, the delay is probably useless :

bridge_carrier_on = at_least_one_port_has_carrier_on | (bridge_has_no_port & bridge_has_at_least_one_ip)

That way :
- for those using bridge without any port, manually setting the IP will assert carrier on. (By the 
way, why don't they use a dummy device instead?)

- for those using bridge with ports:
-- Using any kind of autoconfig will work as expected. Carrier will only be asserted at the time 
first port get carrier.
-- Using static IP confifiguration, carrier will possibly be erroneously reported as on during the 
small time gap between IP address configuration and first port is added to the bridge. This time gap 
may be removed by simply configuring the IP after the first port is added. This is probably already 
true for most distribs. And anyway, this time gap is probably not a problem.
-- Carrier will also be erroneously reported as on after removing the last port, if the bridge still 
has an IP. (But we can arrange for this not to happen).

And in order to ensure user really understand why carrier is on of off, we can simply issue an INFO 
message for the non-natural case (bridge_has_no_port & bridga_has_at_least_one_ip).

I consider all this reasonable.

	Nicolas.

^ permalink raw reply

* Re: [PATCH 02/11] net: add function to allocate skbuff head without data area
From: Eric Dumazet @ 2011-09-04  8:12 UTC (permalink / raw)
  To: kaber; +Cc: davem, netfilter-devel, netdev
In-Reply-To: <1315070771-18576-3-git-send-email-kaber@trash.net>

Le samedi 03 septembre 2011 à 19:26 +0200, kaber@trash.net a écrit :

> +struct sk_buff *__alloc_skb_head(gfp_t gfp_mask, int node)
> +{
> +	struct sk_buff *skb;
> +
> +	/* Get the HEAD */
> +	skb = kmem_cache_alloc_node(skbuff_head_cache,
> +				    gfp_mask & ~__GFP_DMA, node);
> +	if (!skb)
> +		goto out;
> +	prefetchw(skb);

Please remove this prefetchw(), since we have no delay between it and
actual memset(skb).

> +
> +	/*
> +	 * Only clear those fields we need to clear, not those that we will
> +	 * actually initialise below. Hence, don't put any more fields after
> +	 * the tail pointer in struct sk_buff!
> +	 */
> +	memset(skb, 0, offsetof(struct sk_buff, tail));



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Yan, Zheng @ 2011-09-04  8:23 UTC (permalink / raw)
  To: sedat.dilek
  Cc: netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	tim.c.chen@linux.intel.com, jirislaby@gmail.com
In-Reply-To: <CA+icZUU25QMMCVT1VfF9XB0wrkppNmKQbHwoHDAQ-qV9bi+HqQ@mail.gmail.com>

On Sun, Sep 4, 2011 at 3:12 PM, Sedat Dilek <sedat.dilek@googlemail.com> wrote:
> On Sun, Sep 4, 2011 at 7:44 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
>> in Unix socket's send and receive path) introduced a use-after-free bug.
>> It passes the scm reference to the first skb. Skb(s) afterwards may
>> reference freed data structure because the first skb can be destructed
>> by the receiver at anytime. The fix is by passing the scm reference to
>> the very last skb.
>>
>
> s/by passing/bypassing ?

No

>
>> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
>> Reported-by: Jiri Slaby <jirislaby@gmail.com>
>> ---
>
> Tested on i386 against linux-next (next-20110831).
>

Thank you.

> - Sedat -
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH] usbnet: ignore get interface retval of -EINPROGRESS
From: Oliver Neukum @ 2011-09-04 10:16 UTC (permalink / raw)
  To: Jim Wylder; +Cc: netdev
In-Reply-To: <CAPopfEWi0v9VXeDYcBfEgnWYxg7WcTTj3GBc6USzwcN+aGrKOQ@mail.gmail.com>

Am Samstag, 3. September 2011, 19:32:44 schrieb Jim Wylder:

Hi,

> Thanks for the quick feedback.  True usbnet_start_xmit() could be
> running at anytime, but the usb_autopm_get_interface_async() will only
> return -EINPROGRESS when rpm_resume detects that
> dev->power.runtime_status == RPM_RESUMING.  My understanding is that
> for an asynchronous request, the promise that the device is resuming
> would be equivalent to cases where usb_autopm_get_interface_async()
> returns success.

If CONFIG_PM is set.

> In all other cases, when we are not attempting to resume an already
> resuming interface, this change should have no impact.
> 
> Are you recommending that I add an additional check for DEV_ASLEEP,

The check is already there but depending on CONFIG_PM.

> possibly to decide whether to drop the or continue on?  Or am I
> missing your point?  I had not done anything similar because my
> understanding was that knowing that the device is in fact resuming
> would be sufficient.

At that point it is sufficient. Upon further thought it looks like your check
is correct.
We just need to make very sure that for the decision to queue or
to submit EVENT_DEV_ASLEEP is relevant.
Would you consider defining a macro with a nice name for the
==0 || == -EINPROGRESS check, so that any reader knows what
this is about? This is because I doubt only usbnet is affected.

	Regards
		Oliver

^ permalink raw reply

* Re: linux-next: Tree for Aug 26 (git-bisect: [0856a30] Scm: Remove unnecessary pid & credential references in Unix socket's send and receive path)
From: Sedat Dilek @ 2011-09-04 10:43 UTC (permalink / raw)
  To: Stephen Rothwell, tim.c.chen
  Cc: linux-next, LKML, jirislaby, David Miller, Yan, Zheng,
	Valdis.Kletnieks, netdev
In-Reply-To: <CA+icZUXOnoG07Z8-AAUNqvuJmq6-OY1_dTzkxsBOLczoNYYGEw@mail.gmail.com>

On Fri, Sep 2, 2011 at 11:43 PM, Sedat Dilek <sedat.dilek@googlemail.com> wrote:
> On Fri, Aug 26, 2011 at 7:00 AM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>> Hi all,
>>
>> The powerpc allyesconfig build still fails today.
>>
>> Changes since 20110825:
>>
>> The slave-dma tree gained a conflict against Linus' tree.
>>
>> I have reverted the x86/spinlocks branch from the tip tree for today.
>>
>> The ptrace tree lost its build failure.
>>
>> The xen tree lost its conflict since I reveretd the version of the same
>> patches in the tip tree..
>>
>> The tty tree still has its build failure but I applied a supplied test
>> patch for it.
>>
>> The staging tree lost its build failures and gained a conflict against
>> the net tree.
>>
>> The moduleh tree lost a conflict.
>>
>> The akpm tree lost its build failure.
>>
>> ----------------------------------------------------------------------------
>>
>
> Hi,
>
> I saw diverse strange kernel-panics letting my notebook blink and
> scream a high-frequency tone, really ugly.
>
> My last good linux-next (l-n) kernel was next-20110818 and the next
> l-n kernel in series which I built and was broken: next-20110826.
> ( BTW, 29/31 Aug are broken too. )
>
> For the correctness I built 22/23/24/25 Aug which were all good.
> As usual when I do once a year a git-bisect... its the last step of 12
> steps in total (see git-bisect_visualize.txt and
> git-bisect_view-stat.txt).
>
> I was irritated by using next-20110825 and next-20110826 as git-bisect
> good and bad, as the first offered commit-id pointed me to
> 3.1-rc1-665-gec5efe7, but I was sure the culprit is definitely
> v3.1-rc3+ and furthermore I got no info how many revisions have to be
> walked through. So, I cancelled this run.
>
> Next thought was to do a git format-patch using above mentionned
> commit-ids (of the tags): 819 single patches.
> I took the commit-ids of 0001 (good) and 0818 (bad) - 0819 was
> linux-specific mumbo-jumbo.
>
> Jiri has reported a similiar breakage in [2] and returning CULPRIT
> commit [1] helped him.
> ( This git-bisect steal a whole Friday. Anyway, it's good to see we
> isolated the "bad" patch. )
>
> As a personal conclusion, trust more git-bisect...
> It does not hurt to NOT turn off brain my looking over the single
> patches, there were lots of staging, arm, powerpc which could be
> surely ignored, but in the end you are wiser - 12 steps are faster
> than reverting xxx single commits on spec.
>
> - Sedat -
>
> [1] http://git.us.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commit;h=0856a304091b33a8e8f9f9c98e776f425af2b625
> [2] http://lkml.org/lkml/2011/9/1/332
> [3] http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
> [4] http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#using-bisect
>
> P.S.: Used 0001 (good) and 0818 (bad) for git-bisect
>
> $ head -5 0001-target-Change-TCM_NON_EXISTENT_LUN-response-to-ASC-L.patch
> 0818-Fixes-the-deadlock-when-inserting-and-removing-the-d.patch
> ==> 0001-target-Change-TCM_NON_EXISTENT_LUN-response-to-ASC-L.patch <==
> From eb39d34004888afcc0a44d9c36383cd69fa3b3b9 Mon Sep 17 00:00:00 2001
> From: Nicholas Bellinger <nab@linux-iscsi.org>
> Date: Tue, 26 Jul 2011 16:59:00 -0700
> Subject: [PATCH 001/819] target: Change TCM_NON_EXISTENT_LUN response to
>  ASC=LOGICAL UNIT NOT SUPPORTED
>
> ==> 0818-Fixes-the-deadlock-when-inserting-and-removing-the-d.patch <==
> From 9b198e560524e3fe341cd2ae478a1cdf2f57fc33 Mon Sep 17 00:00:00 2001
> From: Clifton Barnes <cabarnes@indesign-llc.com>
> Date: Thu, 25 Aug 2011 09:47:59 +1000
> Subject: [PATCH 818/819] Fixes the deadlock when inserting and removing the
>  ds2780.
>

Issue fixed, see "[PATCH -next v2] unix stream: Fix use-after-free crashes" [1].

- Sedat -

[1] http://marc.info/?l=linux-netdev&m=131511507007022&w=2

^ permalink raw reply

* [PATCH 2/2] net: Handle different key sizes between address families in flow cache
From: David Ward @ 2011-09-04 13:07 UTC (permalink / raw)
  To: netdev; +Cc: Julian Anastasov, David Ward
In-Reply-To: <alpine.LFD.2.00.1109031004500.1492@ja.ssi.bg>

With the conversion of struct flowi to a union of AF-specific structs, some
operations on the flow cache need to account for the exact size of the key.

Signed-off-by: David Ward <david.ward@ll.mit.edu>
---
 include/net/flow.h |   19 +++++++++++++++++++
 net/core/flow.c    |   28 ++++++++++++++++------------
 2 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 2ec377d..99119f6 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -7,6 +7,7 @@
 #ifndef _NET_FLOW_H
 #define _NET_FLOW_H
 
+#include <linux/socket.h>
 #include <linux/in6.h>
 #include <linux/atomic.h>
 
@@ -161,6 +162,24 @@ static inline struct flowi *flowidn_to_flowi(struct flowidn *fldn)
 	return container_of(fldn, struct flowi, u.dn);
 }
 
+typedef unsigned long flow_compare_t;
+
+static inline size_t flowi_size(u16 family)
+{
+	switch (family) {
+	case AF_INET:
+		BUILD_BUG_ON(sizeof(struct flowi4) % sizeof(flow_compare_t));
+		return sizeof(struct flowi4);
+	case AF_INET6:
+		BUILD_BUG_ON(sizeof(struct flowi6) % sizeof(flow_compare_t));
+		return sizeof(struct flowi6);
+	case AF_DECnet:
+		BUILD_BUG_ON(sizeof(struct flowidn) % sizeof(flow_compare_t));
+		return sizeof(struct flowidn);
+	}
+	return 0;
+}
+
 #define FLOW_DIR_IN	0
 #define FLOW_DIR_OUT	1
 #define FLOW_DIR_FWD	2
diff --git a/net/core/flow.c b/net/core/flow.c
index bf32c33..1545445 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -172,26 +172,24 @@ static void flow_new_hash_rnd(struct flow_cache *fc,
 
 static u32 flow_hash_code(struct flow_cache *fc,
 			  struct flow_cache_percpu *fcp,
-			  const struct flowi *key)
+			  const struct flowi *key,
+			  size_t keysize)
 {
 	const u32 *k = (const u32 *) key;
 
-	return jhash2(k, (sizeof(*key) / sizeof(u32)), fcp->hash_rnd)
+	return jhash2(k, (keysize / sizeof(u32)), fcp->hash_rnd)
 		& (flow_cache_hash_size(fc) - 1);
 }
 
-typedef unsigned long flow_compare_t;
-
 /* I hear what you're saying, use memcmp.  But memcmp cannot make
  * important assumptions that we can here, such as alignment and
- * constant size.
+ * size.
  */
-static int flow_key_compare(const struct flowi *key1, const struct flowi *key2)
+static int flow_key_compare(const struct flowi *key1, const struct flowi *key2,
+			    size_t keysize)
 {
 	const flow_compare_t *k1, *k1_lim, *k2;
-	const int n_elem = sizeof(struct flowi) / sizeof(flow_compare_t);
-
-	BUILD_BUG_ON(sizeof(struct flowi) % sizeof(flow_compare_t));
+	const int n_elem = keysize / sizeof(flow_compare_t);
 
 	k1 = (const flow_compare_t *) key1;
 	k1_lim = k1 + n_elem;
@@ -215,6 +213,7 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
 	struct flow_cache_entry *fle, *tfle;
 	struct hlist_node *entry;
 	struct flow_cache_object *flo;
+	size_t keysize;
 	unsigned int hash;
 
 	local_bh_disable();
@@ -222,6 +221,11 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
 
 	fle = NULL;
 	flo = NULL;
+
+	keysize = flowi_size(family);
+	if (!keysize)
+		goto nocache;
+
 	/* Packet really early in init?  Making flow_cache_init a
 	 * pre-smp initcall would solve this.  --RR */
 	if (!fcp->hash_table)
@@ -230,11 +234,11 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
 	if (fcp->hash_rnd_recalc)
 		flow_new_hash_rnd(fc, fcp);
 
-	hash = flow_hash_code(fc, fcp, key);
+	hash = flow_hash_code(fc, fcp, key, keysize);
 	hlist_for_each_entry(tfle, entry, &fcp->hash_table[hash], u.hlist) {
 		if (tfle->family == family &&
 		    tfle->dir == dir &&
-		    flow_key_compare(key, &tfle->key) == 0) {
+		    flow_key_compare(key, &tfle->key, keysize) == 0) {
 			fle = tfle;
 			break;
 		}
@@ -248,7 +252,7 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
 		if (fle) {
 			fle->family = family;
 			fle->dir = dir;
-			memcpy(&fle->key, key, sizeof(*key));
+			memcpy(&fle->key, key, keysize);
 			fle->object = NULL;
 			hlist_add_head(&fle->u.hlist, &fcp->hash_table[hash]);
 			fcp->hash_count++;
-- 
1.7.1

^ permalink raw reply related

* [PATCH 0/2] Fixes to flow cache for AF-specifc flowi structs
From: David Ward @ 2011-09-04 13:07 UTC (permalink / raw)
  To: netdev; +Cc: Julian Anastasov, David Ward
In-Reply-To: <alpine.LFD.2.00.1109031004500.1492@ja.ssi.bg>

These fixes to the flow cache are needed with the conversion to AF-specific
flowi structs. They are written so as to avoid introducing AF-specific code
into net/core/flow.c.

Note that __xfrm_policy_check (in net/xfrm/xfrm_policy.c) still allocates a
struct flowi on the stack and passes it to flow_cache_lookup. My understanding
is that since this is on the stack, this will not be aligned, and therefore it
will cause problems with flow_hash_code and flow_key_compare. Is that correct?

Signed-off-by: David Ward <david.ward@ll.mit.edu>

David Ward (2):
  net: Align AF-specific flowi structs to long
  net: Handle different key sizes between address families in flow
    cache

 include/net/flow.h |   25 ++++++++++++++++++++++---
 net/core/flow.c    |   28 ++++++++++++++++------------
 2 files changed, 38 insertions(+), 15 deletions(-)

^ permalink raw reply

* [PATCH 1/2] net: Align AF-specific flowi structs to long
From: David Ward @ 2011-09-04 13:07 UTC (permalink / raw)
  To: netdev; +Cc: Julian Anastasov, David Ward
In-Reply-To: <alpine.LFD.2.00.1109031004500.1492@ja.ssi.bg>

AF-specific flowi structs are now passed to flow_key_compare, which must
also be aligned to a long.

Signed-off-by: David Ward <david.ward@ll.mit.edu>
---
 include/net/flow.h |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 78113da..2ec377d 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -68,7 +68,7 @@ struct flowi4 {
 #define fl4_ipsec_spi		uli.spi
 #define fl4_mh_type		uli.mht.type
 #define fl4_gre_key		uli.gre_key
-};
+} __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 static inline void flowi4_init_output(struct flowi4 *fl4, int oif,
 				      __u32 mark, __u8 tos, __u8 scope,
@@ -112,7 +112,7 @@ struct flowi6 {
 #define fl6_ipsec_spi		uli.spi
 #define fl6_mh_type		uli.mht.type
 #define fl6_gre_key		uli.gre_key
-};
+} __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 struct flowidn {
 	struct flowi_common	__fl_common;
@@ -127,7 +127,7 @@ struct flowidn {
 	union flowi_uli		uli;
 #define fld_sport		uli.ports.sport
 #define fld_dport		uli.ports.dport
-};
+} __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 struct flowi {
 	union {
-- 
1.7.1

^ permalink raw reply related

* Re: [next] unix stream crashes
From: Valdis.Kletnieks @ 2011-09-04 14:43 UTC (permalink / raw)
  To: Yan, Zheng 
  Cc: Jiri Slaby, sedat.dilek, Sedat Dilek, Tim Chen, David S. Miller,
	ML netdev, LKML, Stephen Rothwell
In-Reply-To: <CAAM7YAkB3VVNMmBMVuvEZuV6oGZeyog37_sjFGUunu+15apvZA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 321 bytes --]

On Sat, 03 Sep 2011 20:30:19 +0800, "Yan, Zheng " said:
> The skb can be destructed before the while loop in unix_stream_sendmsg stops.
> please try below patch.

Tried this on top of next-20110831, and system is up and running fine.

Feel free to stick this on it:

Tested-By: Valdis Kletnieks <valdis.kletnieks@vt.edu>

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] net: Handle different key sizes between address families in flow cache
From: Michał Mirosław @ 2011-09-04 15:34 UTC (permalink / raw)
  To: David Ward; +Cc: netdev, Julian Anastasov
In-Reply-To: <1315141641-3120-3-git-send-email-david.ward@ll.mit.edu>

2011/9/4 David Ward <david.ward@ll.mit.edu>:
> With the conversion of struct flowi to a union of AF-specific structs, some
> operations on the flow cache need to account for the exact size of the key.
[...]
> --- a/include/net/flow.h
> +++ b/include/net/flow.h
[...]
> +typedef unsigned long flow_compare_t;
> +
> +static inline size_t flowi_size(u16 family)
> +{
> +       switch (family) {
> +       case AF_INET:
> +               BUILD_BUG_ON(sizeof(struct flowi4) % sizeof(flow_compare_t));
> +               return sizeof(struct flowi4);
> +       case AF_INET6:
> +               BUILD_BUG_ON(sizeof(struct flowi6) % sizeof(flow_compare_t));
> +               return sizeof(struct flowi6);
> +       case AF_DECnet:
> +               BUILD_BUG_ON(sizeof(struct flowidn) % sizeof(flow_compare_t));
> +               return sizeof(struct flowidn);
> +       }
> +       return 0;
> +}

Since most called user (flow_key_compare) uses returned value didided
by sizeof(flow_compare_t), you could just return divided value here
and save some shift operations by it.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Joe Perches @ 2011-09-04 15:50 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: sedat.dilek, netdev@vger.kernel.org, davem@davemloft.net,
	sfr@canb.auug.org.au, tim.c.chen@linux.intel.com,
	jirislaby@gmail.com
In-Reply-To: <CAAM7YAk0csJLLVFF6KYA8vMG_BN6QDvmYvtnY0Sq7P=cBsuKfg@mail.gmail.com>

On Sun, 2011-09-04 at 16:23 +0800, Yan, Zheng wrote:
> On Sun, Sep 4, 2011 at 3:12 PM, Sedat Dilek <sedat.dilek@googlemail.com> wrote:
> > On Sun, Sep 4, 2011 at 7:44 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> >> It passes the scm reference to the first skb. Skb(s) afterwards may
> >> reference freed data structure because the first skb can be destructed
> >> by the receiver at anytime. The fix is by passing the scm reference to
> >> the very last skb.
> > s/by passing/bypassing ?
> No

(putting on my Randy Dunlap hat)

The issue was fixed by passing...
or maybe
The fix is to pass...

^ permalink raw reply

* Re: [PATCH 08/11] netlink: implement memory mapped sendmsg()
From: Michał Mirosław @ 2011-09-04 16:18 UTC (permalink / raw)
  To: kaber; +Cc: davem, netfilter-devel, netdev
In-Reply-To: <1315070771-18576-9-git-send-email-kaber@trash.net>

On Sat, Sep 03, 2011 at 07:26:08PM +0200, kaber@trash.net wrote:
> From: Patrick McHardy <kaber@trash.net>
> 
> Add support for memory mapped sendmsg() to netlink. Userspace queued to
> be processed frames into the TX ring and invokes sendmsg with
> msg.iov.iov_base = NULL to trigger processing of all pending messages.
> 
> Since the kernel usually performs full message validation before beginning
> processing, userspace must be prevented from modifying the message
> contents while the kernel is processing them. In order to do so, the
> frames contents are copied to an allocated skb in case the the ring is
> mapped more than once or the file descriptor is shared (f.i. through
> AF_UNIX file descriptor passing).
> 
> Otherwise an skb without a data area is allocated, the data pointer set
> to point to the data area of the ring frame and the skb is processed.
> Once the skb is freed, the destructor releases the frame back to userspace
> by setting the status to NL_MMAP_STATUS_UNUSED.

Is this protected from threads? Like: one thread waits on sendmsg() and
another (same process) changes the buffer.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox