netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v4 0/3] udp msg_zerocopy
@ 2018-11-30 20:32 Willem de Bruijn
  2018-11-30 20:32 ` [PATCH net-next v4 1/3] udp: msg_zerocopy Willem de Bruijn
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Willem de Bruijn @ 2018-11-30 20:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, pabeni, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Enable MSG_ZEROCOPY for udp sockets

Patch 1/3 is the main patch, a rework of RFC patch
  http://patchwork.ozlabs.org/patch/899630/
  more details in the patch commit message

Patch 2/3 is an optimization to remove a branch from the UDP hot path
  and refcount_inc/refcount_dec_and_test pair when zerocopy is used.
  This used to be included in the first patch in v2.

Patch 3/3 runs the already existing udp zerocopy tests
  as part of kselftest

See also recent Linux Plumbers presentation
  https://linuxplumbersconf.org/event/2/contributions/106/attachments/104/128/willemdebruijn-lpc2018-udpgso-presentation-20181113.pdf

Changes:
  v1 -> v2
    - Fixup reverse christmas tree violation
  v2 -> v3
    - Split refcount avoidance optimization into separate patch
      - Fix refcount leak on error in fragmented case
        (thanks to Paolo Abeni for pointing this one out!)
      - Fix refcount inc on zero
  v3 -> v4
    - Move skb_zcopy_set below the only kfree_skb that might cause
      a premature uarg destroy before skb_zerocopy_put_abort
      - Move the entire skb_shinfo assignment block, to keep that
	cacheline access in one place

Willem de Bruijn (3):
  udp: msg_zerocopy
  udp: elide zerocopy operation in hot path
  selftests: extend zerocopy tests to udp

 include/linux/skbuff.h                      | 13 +++++---
 net/core/skbuff.c                           | 15 ++++++---
 net/core/sock.c                             |  5 ++-
 net/ipv4/ip_output.c                        | 37 ++++++++++++++++-----
 net/ipv4/tcp.c                              |  2 +-
 net/ipv6/ip6_output.c                       | 37 ++++++++++++++++-----
 tools/testing/selftests/net/msg_zerocopy.c  |  3 +-
 tools/testing/selftests/net/msg_zerocopy.sh |  2 ++
 tools/testing/selftests/net/udpgso_bench.sh |  3 ++
 9 files changed, 90 insertions(+), 27 deletions(-)

-- 
2.20.0.rc1.387.gf8505762e3-goog

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next v4 1/3] udp: msg_zerocopy
  2018-11-30 20:32 [PATCH net-next v4 0/3] udp msg_zerocopy Willem de Bruijn
@ 2018-11-30 20:32 ` Willem de Bruijn
  2018-12-03 12:46   ` Paolo Abeni
  2018-11-30 20:32 ` [PATCH net-next v4 2/3] udp: elide zerocopy operation in hot path Willem de Bruijn
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Willem de Bruijn @ 2018-11-30 20:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, pabeni, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.

This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.

tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:

    ipv4 udp -t 1
    tx=191312 (11938 MB) txc=0 zc=n
    rx=191312 (11938 MB)
    ipv4 udp -z -t 1
    tx=304507 (19002 MB) txc=304507 zc=y
    rx=304507 (19002 MB)
    ok
    ipv6 udp -t 1
    tx=174485 (10888 MB) txc=0 zc=n
    rx=174485 (10888 MB)
    ipv6 udp -z -t 1
    tx=294801 (18396 MB) txc=294801 zc=y
    rx=294801 (18396 MB)
    ok

Changes
  v1 -> v2
    - Fixup reverse christmas tree violation
  v2 -> v3
    - Split refcount avoidance optimization into separate patch
      - Fix refcount leak on error in fragmented case
        (thanks to Paolo Abeni for pointing this one out!)
      - Fix refcount inc on zero
      - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
        This is needed since commit 5cf4a8532c99 ("tcp: really ignore
	MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      |  6 ++++++
 net/core/sock.c        |  5 ++++-
 net/ipv4/ip_output.c   | 23 ++++++++++++++++++++++-
 net/ipv6/ip6_output.c  | 23 ++++++++++++++++++++++-
 5 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 73902acf2b71..04f52e719571 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -485,6 +485,7 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len);
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			     struct msghdr *msg, int len,
 			     struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c814565ed7c..1350901c5cb8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1105,6 +1105,12 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
 extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 				   struct iov_iter *from, size_t length);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len)
+{
+	return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, len);
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter_dgram);
+
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			     struct msghdr *msg, int len,
 			     struct ubuf_info *uarg)
diff --git a/net/core/sock.c b/net/core/sock.c
index 6d7e189e3cd9..f5bb89785e47 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1018,7 +1018,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 
 	case SO_ZEROCOPY:
 		if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
-			if (sk->sk_protocol != IPPROTO_TCP)
+			if (!((sk->sk_type == SOCK_STREAM &&
+			       sk->sk_protocol == IPPROTO_TCP) ||
+			      (sk->sk_type == SOCK_DGRAM &&
+			       sk->sk_protocol == IPPROTO_UDP)))
 				ret = -ENOTSUPP;
 		} else if (sk->sk_family != PF_RDS) {
 			ret = -ENOTSUPP;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 5dbec21856f4..6f843aff628c 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -867,6 +867,7 @@ static int __ip_append_data(struct sock *sk,
 			    unsigned int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
+	struct ubuf_info *uarg = NULL;
 	struct sk_buff *skb;
 
 	struct ip_options *opt = cork->opt;
@@ -916,6 +917,19 @@ static int __ip_append_data(struct sock *sk,
 	    (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
 		csummode = CHECKSUM_PARTIAL;
 
+	if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+		if (rt->dst.dev->features & NETIF_F_SG &&
+		    csummode == CHECKSUM_PARTIAL) {
+			paged = true;
+		} else {
+			uarg->zerocopy = 0;
+			skb_zcopy_set(skb, uarg);
+		}
+	}
+
 	cork->length += length;
 
 	/* So, what's going on in the loop below?
@@ -1006,6 +1020,7 @@ static int __ip_append_data(struct sock *sk,
 			cork->tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
 			tskey = 0;
+			skb_zcopy_set(skb, uarg);
 
 			/*
 			 *	Find where to start putting bytes.
@@ -1068,7 +1083,7 @@ static int __ip_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else {
+		} else if (!uarg || !uarg->zerocopy) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
@@ -1098,6 +1113,10 @@ static int __ip_append_data(struct sock *sk,
 			skb->data_len += copy;
 			skb->truesize += copy;
 			wmem_alloc_delta += copy;
+		} else {
+			err = skb_zerocopy_iter_dgram(skb, from, copy);
+			if (err < 0)
+				goto error;
 		}
 		offset += copy;
 		length -= copy;
@@ -1105,11 +1124,13 @@ static int __ip_append_data(struct sock *sk,
 
 	if (wmem_alloc_delta)
 		refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
+	sock_zerocopy_put(uarg);
 	return 0;
 
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 827a3f5ff3bb..7df04d20a91f 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1245,6 +1245,7 @@ static int __ip6_append_data(struct sock *sk,
 {
 	struct sk_buff *skb, *skb_prev = NULL;
 	unsigned int maxfraglen, fragheaderlen, mtu, orig_mtu, pmtu;
+	struct ubuf_info *uarg = NULL;
 	int exthdrlen = 0;
 	int dst_exthdrlen = 0;
 	int hh_len;
@@ -1322,6 +1323,19 @@ static int __ip6_append_data(struct sock *sk,
 	    rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
 		csummode = CHECKSUM_PARTIAL;
 
+	if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+		if (rt->dst.dev->features & NETIF_F_SG &&
+		    csummode == CHECKSUM_PARTIAL) {
+			paged = true;
+		} else {
+			uarg->zerocopy = 0;
+			skb_zcopy_set(skb, uarg);
+		}
+	}
+
 	/*
 	 * Let's try using as much space as possible.
 	 * Use MTU if total length of the message fits into the MTU.
@@ -1445,6 +1459,7 @@ static int __ip6_append_data(struct sock *sk,
 			cork->tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
 			tskey = 0;
+			skb_zcopy_set(skb, uarg);
 
 			/*
 			 *	Find where to start putting bytes
@@ -1506,7 +1521,7 @@ static int __ip6_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else {
+		} else if (!uarg || !uarg->zerocopy) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
@@ -1536,6 +1551,10 @@ static int __ip6_append_data(struct sock *sk,
 			skb->data_len += copy;
 			skb->truesize += copy;
 			wmem_alloc_delta += copy;
+		} else {
+			err = skb_zerocopy_iter_dgram(skb, from, copy);
+			if (err < 0)
+				goto error;
 		}
 		offset += copy;
 		length -= copy;
@@ -1543,11 +1562,13 @@ static int __ip6_append_data(struct sock *sk,
 
 	if (wmem_alloc_delta)
 		refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
+	sock_zerocopy_put(uarg);
 	return 0;
 
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
-- 
2.20.0.rc1.387.gf8505762e3-goog

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next v4 2/3] udp: elide zerocopy operation in hot path
  2018-11-30 20:32 [PATCH net-next v4 0/3] udp msg_zerocopy Willem de Bruijn
  2018-11-30 20:32 ` [PATCH net-next v4 1/3] udp: msg_zerocopy Willem de Bruijn
@ 2018-11-30 20:32 ` Willem de Bruijn
  2018-12-03 12:50   ` Paolo Abeni
  2018-11-30 20:32 ` [PATCH net-next v4 3/3] selftests: extend zerocopy tests to udp Willem de Bruijn
  2018-12-03 23:59 ` [PATCH net-next v4 0/3] udp msg_zerocopy David Miller
  3 siblings, 1 reply; 7+ messages in thread
From: Willem de Bruijn @ 2018-11-30 20:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, pabeni, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
Release of its last reference triggers a completion notification.

The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
the skbs, because it can build, send and free skbs within its loop,
possibly reaching refcount zero and freeing the ubuf_info too soon.

The UDP stack currently also takes this extra ref, but does not need
it as all skbs are sent after return from __ip(6)_append_data.

Avoid the extra refcount_inc and refcount_dec_and_test, and generally
the sock_zerocopy_put in the common path, by passing the initial
reference to the first skb.

This approach is taken instead of initializing the refcount to 0, as
that would generate error "refcount_t: increment on 0" on the
next skb_zcopy_set.

Changes
  v3 -> v4
    - Move skb_zcopy_set below the only kfree_skb that might cause
      a premature uarg destroy before skb_zerocopy_put_abort
      - Move the entire skb_shinfo assignment block, to keep that
        cacheline access in one place

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h | 12 ++++++++----
 net/core/skbuff.c      |  9 +++++----
 net/ipv4/ip_output.c   | 22 +++++++++++-----------
 net/ipv4/tcp.c         |  2 +-
 net/ipv6/ip6_output.c  | 22 +++++++++++-----------
 5 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 04f52e719571..75d50ab7997c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -481,7 +481,7 @@ static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 }
 
 void sock_zerocopy_put(struct ubuf_info *uarg);
-void sock_zerocopy_put_abort(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
@@ -1326,10 +1326,14 @@ static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
 	return is_zcopy ? skb_uarg(skb) : NULL;
 }
 
-static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg,
+				 bool *have_ref)
 {
 	if (skb && uarg && !skb_zcopy(skb)) {
-		sock_zerocopy_get(uarg);
+		if (unlikely(have_ref && *have_ref))
+			*have_ref = false;
+		else
+			sock_zerocopy_get(uarg);
 		skb_shinfo(skb)->destructor_arg = uarg;
 		skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
 	}
@@ -1374,7 +1378,7 @@ static inline void skb_zcopy_abort(struct sk_buff *skb)
 	struct ubuf_info *uarg = skb_zcopy(skb);
 
 	if (uarg) {
-		sock_zerocopy_put_abort(uarg);
+		sock_zerocopy_put_abort(uarg, false);
 		skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
 	}
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1350901c5cb8..c78ce114537e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1089,7 +1089,7 @@ void sock_zerocopy_put(struct ubuf_info *uarg)
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put);
 
-void sock_zerocopy_put_abort(struct ubuf_info *uarg)
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 {
 	if (uarg) {
 		struct sock *sk = skb_from_uarg(uarg)->sk;
@@ -1097,7 +1097,8 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
 		atomic_dec(&sk->sk_zckey);
 		uarg->len--;
 
-		sock_zerocopy_put(uarg);
+		if (have_uref)
+			sock_zerocopy_put(uarg);
 	}
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
@@ -1137,7 +1138,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 		return err;
 	}
 
-	skb_zcopy_set(skb, uarg);
+	skb_zcopy_set(skb, uarg, NULL);
 	return skb->len - orig_len;
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
@@ -1157,7 +1158,7 @@ static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 			if (skb_copy_ubufs(nskb, GFP_ATOMIC))
 				return -EIO;
 		}
-		skb_zcopy_set(nskb, skb_uarg(orig));
+		skb_zcopy_set(nskb, skb_uarg(orig), NULL);
 	}
 	return 0;
 }
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6f843aff628c..78f028bdad30 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -881,8 +881,8 @@ static int __ip_append_data(struct sock *sk,
 	int csummode = CHECKSUM_NONE;
 	struct rtable *rt = (struct rtable *)cork->dst;
 	unsigned int wmem_alloc_delta = 0;
+	bool paged, extra_uref;
 	u32 tskey = 0;
-	bool paged;
 
 	skb = skb_peek_tail(queue);
 
@@ -921,12 +921,13 @@ static int __ip_append_data(struct sock *sk,
 		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
 		if (!uarg)
 			return -ENOBUFS;
+		extra_uref = true;
 		if (rt->dst.dev->features & NETIF_F_SG &&
 		    csummode == CHECKSUM_PARTIAL) {
 			paged = true;
 		} else {
 			uarg->zerocopy = 0;
-			skb_zcopy_set(skb, uarg);
+			skb_zcopy_set(skb, uarg, &extra_uref);
 		}
 	}
 
@@ -1015,13 +1016,6 @@ static int __ip_append_data(struct sock *sk,
 			skb->csum = 0;
 			skb_reserve(skb, hh_len);
 
-			/* only the initial fragment is time stamped */
-			skb_shinfo(skb)->tx_flags = cork->tx_flags;
-			cork->tx_flags = 0;
-			skb_shinfo(skb)->tskey = tskey;
-			tskey = 0;
-			skb_zcopy_set(skb, uarg);
-
 			/*
 			 *	Find where to start putting bytes.
 			 */
@@ -1054,6 +1048,13 @@ static int __ip_append_data(struct sock *sk,
 			exthdrlen = 0;
 			csummode = CHECKSUM_NONE;
 
+			/* only the initial fragment is time stamped */
+			skb_shinfo(skb)->tx_flags = cork->tx_flags;
+			cork->tx_flags = 0;
+			skb_shinfo(skb)->tskey = tskey;
+			tskey = 0;
+			skb_zcopy_set(skb, uarg, &extra_uref);
+
 			if ((flags & MSG_CONFIRM) && !skb_prev)
 				skb_set_dst_pending_confirm(skb, 1);
 
@@ -1124,13 +1125,12 @@ static int __ip_append_data(struct sock *sk,
 
 	if (wmem_alloc_delta)
 		refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
-	sock_zerocopy_put(uarg);
 	return 0;
 
 error_efault:
 	err = -EFAULT;
 error:
-	sock_zerocopy_put_abort(uarg);
+	sock_zerocopy_put_abort(uarg, extra_uref);
 	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 252048776dbb..444cdbff0638 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1423,7 +1423,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	if (copied + copied_syn)
 		goto out;
 out_err:
-	sock_zerocopy_put_abort(uarg);
+	sock_zerocopy_put_abort(uarg, true);
 	err = sk_stream_error(sk, flags, err);
 	/* make sure we wake any epoll edge trigger waiter */
 	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7df04d20a91f..ec8c235ea891 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1258,7 +1258,7 @@ static int __ip6_append_data(struct sock *sk,
 	int csummode = CHECKSUM_NONE;
 	unsigned int maxnonfragsize, headersize;
 	unsigned int wmem_alloc_delta = 0;
-	bool paged;
+	bool paged, extra_uref;
 
 	skb = skb_peek_tail(queue);
 	if (!skb) {
@@ -1327,12 +1327,13 @@ static int __ip6_append_data(struct sock *sk,
 		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
 		if (!uarg)
 			return -ENOBUFS;
+		extra_uref = true;
 		if (rt->dst.dev->features & NETIF_F_SG &&
 		    csummode == CHECKSUM_PARTIAL) {
 			paged = true;
 		} else {
 			uarg->zerocopy = 0;
-			skb_zcopy_set(skb, uarg);
+			skb_zcopy_set(skb, uarg, &extra_uref);
 		}
 	}
 
@@ -1454,13 +1455,6 @@ static int __ip6_append_data(struct sock *sk,
 			skb_reserve(skb, hh_len + sizeof(struct frag_hdr) +
 				    dst_exthdrlen);
 
-			/* Only the initial fragment is time stamped */
-			skb_shinfo(skb)->tx_flags = cork->tx_flags;
-			cork->tx_flags = 0;
-			skb_shinfo(skb)->tskey = tskey;
-			tskey = 0;
-			skb_zcopy_set(skb, uarg);
-
 			/*
 			 *	Find where to start putting bytes
 			 */
@@ -1492,6 +1486,13 @@ static int __ip6_append_data(struct sock *sk,
 			exthdrlen = 0;
 			dst_exthdrlen = 0;
 
+			/* Only the initial fragment is time stamped */
+			skb_shinfo(skb)->tx_flags = cork->tx_flags;
+			cork->tx_flags = 0;
+			skb_shinfo(skb)->tskey = tskey;
+			tskey = 0;
+			skb_zcopy_set(skb, uarg, &extra_uref);
+
 			if ((flags & MSG_CONFIRM) && !skb_prev)
 				skb_set_dst_pending_confirm(skb, 1);
 
@@ -1562,13 +1563,12 @@ static int __ip6_append_data(struct sock *sk,
 
 	if (wmem_alloc_delta)
 		refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
-	sock_zerocopy_put(uarg);
 	return 0;
 
 error_efault:
 	err = -EFAULT;
 error:
-	sock_zerocopy_put_abort(uarg);
+	sock_zerocopy_put_abort(uarg, extra_uref);
 	cork->length -= length;
 	IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
-- 
2.20.0.rc1.387.gf8505762e3-goog

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next v4 3/3] selftests: extend zerocopy tests to udp
  2018-11-30 20:32 [PATCH net-next v4 0/3] udp msg_zerocopy Willem de Bruijn
  2018-11-30 20:32 ` [PATCH net-next v4 1/3] udp: msg_zerocopy Willem de Bruijn
  2018-11-30 20:32 ` [PATCH net-next v4 2/3] udp: elide zerocopy operation in hot path Willem de Bruijn
@ 2018-11-30 20:32 ` Willem de Bruijn
  2018-12-03 23:59 ` [PATCH net-next v4 0/3] udp msg_zerocopy David Miller
  3 siblings, 0 replies; 7+ messages in thread
From: Willem de Bruijn @ 2018-11-30 20:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, pabeni, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Both msg_zerocopy and udpgso_bench have udp zerocopy variants.
Exercise these as part of the standard kselftest run.

With udp, msg_zerocopy has no control channel. Ensure that the
receiver exits after the sender by accounting for the initial
delay in starting them (in msg_zerocopy.sh).

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/net/msg_zerocopy.c  | 3 ++-
 tools/testing/selftests/net/msg_zerocopy.sh | 2 ++
 tools/testing/selftests/net/udpgso_bench.sh | 3 +++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
index 406cc70c571d..4b02933cab8a 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -651,12 +651,13 @@ static void do_flush_datagram(int fd, int type)
 
 static void do_rx(int domain, int type, int protocol)
 {
+	const int cfg_receiver_wait_ms = 400;
 	uint64_t tstop;
 	int fd;
 
 	fd = do_setup_rx(domain, type, protocol);
 
-	tstop = gettimeofday_ms() + cfg_runtime_ms;
+	tstop = gettimeofday_ms() + cfg_runtime_ms + cfg_receiver_wait_ms;
 	do {
 		if (type == SOCK_STREAM)
 			do_flush_tcp(fd);
diff --git a/tools/testing/selftests/net/msg_zerocopy.sh b/tools/testing/selftests/net/msg_zerocopy.sh
index c43c6debda06..825ffec85cea 100755
--- a/tools/testing/selftests/net/msg_zerocopy.sh
+++ b/tools/testing/selftests/net/msg_zerocopy.sh
@@ -25,6 +25,8 @@ readonly path_sysctl_mem="net.core.optmem_max"
 if [[ "$#" -eq "0" ]]; then
 	$0 4 tcp -t 1
 	$0 6 tcp -t 1
+	$0 4 udp -t 1
+	$0 6 udp -t 1
 	echo "OK. All tests passed"
 	exit 0
 fi
diff --git a/tools/testing/selftests/net/udpgso_bench.sh b/tools/testing/selftests/net/udpgso_bench.sh
index 0f0628613f81..5670a9ffd8eb 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -35,6 +35,9 @@ run_udp() {
 
 	echo "udp gso"
 	run_in_netns ${args} -S 0
+
+	echo "udp gso zerocopy"
+	run_in_netns ${args} -S 0 -z
 }
 
 run_tcp() {
-- 
2.20.0.rc1.387.gf8505762e3-goog

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v4 1/3] udp: msg_zerocopy
  2018-11-30 20:32 ` [PATCH net-next v4 1/3] udp: msg_zerocopy Willem de Bruijn
@ 2018-12-03 12:46   ` Paolo Abeni
  0 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2018-12-03 12:46 UTC (permalink / raw)
  To: Willem de Bruijn, netdev; +Cc: davem, Willem de Bruijn

On Fri, 2018-11-30 at 15:32 -0500, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
> interpret flag MSG_ZEROCOPY.
> 
> This patch was previously part of the zerocopy RFC patchsets. Zerocopy
> is not effective at small MTU. With segmentation offload building
> larger datagrams, the benefit of page flipping outweights the cost of
> generating a completion notification.
> 
> tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
> test patch and making skb_orphan_frags_rx same as skb_orphan_frags:
> 
>     ipv4 udp -t 1
>     tx=191312 (11938 MB) txc=0 zc=n
>     rx=191312 (11938 MB)
>     ipv4 udp -z -t 1
>     tx=304507 (19002 MB) txc=304507 zc=y
>     rx=304507 (19002 MB)
>     ok
>     ipv6 udp -t 1
>     tx=174485 (10888 MB) txc=0 zc=n
>     rx=174485 (10888 MB)
>     ipv6 udp -z -t 1
>     tx=294801 (18396 MB) txc=294801 zc=y
>     rx=294801 (18396 MB)
>     ok
> 
> Changes
>   v1 -> v2
>     - Fixup reverse christmas tree violation
>   v2 -> v3
>     - Split refcount avoidance optimization into separate patch
>       - Fix refcount leak on error in fragmented case
>         (thanks to Paolo Abeni for pointing this one out!)
>       - Fix refcount inc on zero
>       - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
>         This is needed since commit 5cf4a8532c99 ("tcp: really ignore
> 	MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Acked-by: Paolo Abeni <pabeni@redhat.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v4 2/3] udp: elide zerocopy operation in hot path
  2018-11-30 20:32 ` [PATCH net-next v4 2/3] udp: elide zerocopy operation in hot path Willem de Bruijn
@ 2018-12-03 12:50   ` Paolo Abeni
  0 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2018-12-03 12:50 UTC (permalink / raw)
  To: Willem de Bruijn, netdev; +Cc: davem, Willem de Bruijn

On Fri, 2018-11-30 at 15:32 -0500, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
> Release of its last reference triggers a completion notification.
> 
> The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
> the skbs, because it can build, send and free skbs within its loop,
> possibly reaching refcount zero and freeing the ubuf_info too soon.
> 
> The UDP stack currently also takes this extra ref, but does not need
> it as all skbs are sent after return from __ip(6)_append_data.
> 
> Avoid the extra refcount_inc and refcount_dec_and_test, and generally
> the sock_zerocopy_put in the common path, by passing the initial
> reference to the first skb.
> 
> This approach is taken instead of initializing the refcount to 0, as
> that would generate error "refcount_t: increment on 0" on the
> next skb_zcopy_set.
> 
> Changes
>   v3 -> v4
>     - Move skb_zcopy_set below the only kfree_skb that might cause
>       a premature uarg destroy before skb_zerocopy_put_abort
>       - Move the entire skb_shinfo assignment block, to keep that
>         cacheline access in one place
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>

I like this solution!

Acked-by: Paolo Abeni <pabeni@redhat.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v4 0/3] udp msg_zerocopy
  2018-11-30 20:32 [PATCH net-next v4 0/3] udp msg_zerocopy Willem de Bruijn
                   ` (2 preceding siblings ...)
  2018-11-30 20:32 ` [PATCH net-next v4 3/3] selftests: extend zerocopy tests to udp Willem de Bruijn
@ 2018-12-03 23:59 ` David Miller
  3 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2018-12-03 23:59 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, pabeni, willemb

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Fri, 30 Nov 2018 15:32:38 -0500

> Enable MSG_ZEROCOPY for udp sockets

Series applied, thanks for keeping up with this.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-12-03 23:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-30 20:32 [PATCH net-next v4 0/3] udp msg_zerocopy Willem de Bruijn
2018-11-30 20:32 ` [PATCH net-next v4 1/3] udp: msg_zerocopy Willem de Bruijn
2018-12-03 12:46   ` Paolo Abeni
2018-11-30 20:32 ` [PATCH net-next v4 2/3] udp: elide zerocopy operation in hot path Willem de Bruijn
2018-12-03 12:50   ` Paolo Abeni
2018-11-30 20:32 ` [PATCH net-next v4 3/3] selftests: extend zerocopy tests to udp Willem de Bruijn
2018-12-03 23:59 ` [PATCH net-next v4 0/3] udp msg_zerocopy David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).