kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 0/8] macvtap/vhost TX zero copy support
@ 2011-04-20 19:36 Shirley Ma
  2011-04-20 19:42 ` [PATCH V3 1/8] Add a new sock zerocopy flag Shirley Ma
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 19:36 UTC (permalink / raw)
  To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
  Cc: netdev, kvm, linux-kernel

This patchset add supports for TX zero-copy between guest and host
kernel through vhost. It significantly reduces CPU utilization on the
local host on which the guest is located (It reduced 30-50% CPU usage
for vhost thread for single stream test). The patchset is based on
previous submission and comments from the community regarding when/how
to handle guest kernel buffers to be released. This is the simplest
approach I can think of after comparing with several other solutions.

This patchset includes:

1/8: Add a new sock zero-copy flag, SOCK_ZEROCOPY;

2/8: Add a new device flag, NETIF_F_ZEROCOPY for lower level device
support zero-copy;

3/8: Add a new struct skb_ubuf_info in skb_share_info for userspace
buffers release callback when lower device DMA has done for that skb;

4/8: Add vhost zero-copy callback in vhost when skb last refcnt is gone;
add vhost_zerocopy_add_used_and_signal to notify guest to release TX skb
buffers.

5/8: Add macvtap zero-copy in lower device when sending packet is
greater than 128 bytes.

6/8: Add Chelsio 10Gb NIC to zero copy feature flag

7/8: Add Intel 10Gb NIC zero copy feature flag

8/8: Add Emulex 10Gb NIC zero copy feature flag

The patchset is built against most recent linux 2.6.git. It has passed
netperf/netserver multiple streams stress test on above NICs.

The single stream test results from 2.6.37 kernel on Chelsio:

64K message size: copy_from_user dropped from 40% to 5%; vhost thread
cpu utilization dropped from 76% to 28%

I am collecting more test results against 2.6.39-rc3 kernel and will
provide the test matrix later.

Thanks
Shirley



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH V3 1/8] Add a new sock zerocopy flag
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
@ 2011-04-20 19:42 ` Shirley Ma
  2011-04-20 19:44 ` [PATCH V3 2/8] Add a new zerocopy device flag Shirley Ma
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 19:42 UTC (permalink / raw)
  To: David Miller
  Cc: mst, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel

This sock zerocopy flag is used to support lower level device DMA 
userspace buffers.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 include/net/sock.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 01810a3..daa0a80 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -562,6 +562,7 @@ enum sock_flags {
 	SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
 	SOCK_FASYNC, /* fasync() active */
 	SOCK_RXQ_OVFL,
+	SOCK_ZEROCOPY,
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V3 2/8] Add a new zerocopy device flag
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
  2011-04-20 19:42 ` [PATCH V3 1/8] Add a new sock zerocopy flag Shirley Ma
@ 2011-04-20 19:44 ` Shirley Ma
  2011-04-20 19:48   ` Ben Hutchings
  2011-05-02 10:42   ` Michael S. Tsirkin
  2011-04-20 19:47 ` [PATCH V3 3/8] Add userspace buffers support in skb Shirley Ma
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 19:44 UTC (permalink / raw)
  To: David Miller
  Cc: mst, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel

This zerocopy flag is used to support device DMA userspace buffers.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 include/linux/netdevice.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0249fe7..0998d3d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1067,6 +1067,9 @@ struct net_device {
 #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
 #define NETIF_F_RXCSUM		(1 << 29) /* Receive checksumming offload */
 
+/* bit 29 is for device to map userspace buffers -- zerocopy */
+#define NETIF_F_ZEROCOPY	(1 << 29)
+
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
 #define NETIF_F_GSO_MASK	0x00ff0000

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V3 3/8] Add userspace buffers support in skb
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
  2011-04-20 19:42 ` [PATCH V3 1/8] Add a new sock zerocopy flag Shirley Ma
  2011-04-20 19:44 ` [PATCH V3 2/8] Add a new zerocopy device flag Shirley Ma
@ 2011-04-20 19:47 ` Shirley Ma
  2011-05-02 10:53   ` Michael S. Tsirkin
  2011-04-20 20:07 ` [PATCH V3 4/8] vhost TX zero copy support Shirley Ma
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 19:47 UTC (permalink / raw)
  To: David Miller
  Cc: mst, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel

This patch adds userspace buffers support in skb. A new struct
skb_ubuf_info is needed to maintain the userspace buffers argument
and index, a callback is used to notify userspace to release the
buffers once lower device has done DMA (Last reference to that skb
has gone).

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 include/linux/skbuff.h |   14 ++++++++++++++
 net/core/skbuff.c      |   15 ++++++++++++++-
 2 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d0ae90a..47a187b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -189,6 +189,16 @@ enum {
 	SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
 };
 
+/* The callback notifies userspace to release buffers when skb DMA is done in
+ * lower device, the desc is used to track userspace buffer index.
+ */
+struct skb_ubuf_info {
+	/* support buffers allocation from userspace */
+	void		(*callback)(struct sk_buff *);
+	void		*arg;
+	size_t		desc;
+};
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -211,6 +221,10 @@ struct skb_shared_info {
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+
+	/* DMA mapping from/to userspace buffers */
+	struct skb_ubuf_info ubuf;
+
 	/* must be last field, see pskb_expand_head() */
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7ebeed0..822c07d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	shinfo = skb_shinfo(skb);
 	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
+	shinfo->ubuf.callback = NULL;
+	shinfo->ubuf.arg = NULL;
 	kmemcheck_annotate_variable(shinfo->destructor_arg);
 
 	if (fclone) {
@@ -327,7 +329,15 @@ static void skb_release_data(struct sk_buff *skb)
 			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 				put_page(skb_shinfo(skb)->frags[i].page);
 		}
-
+		/*
+		 * if skb buf is from userspace, we need to notify the caller
+		 * the lower device DMA has done;
+		 */
+		if (skb_shinfo(skb)->ubuf.callback) {
+			skb_shinfo(skb)->ubuf.callback(skb);
+			skb_shinfo(skb)->ubuf.callback = NULL;
+			skb_shinfo(skb)->ubuf.arg = NULL;
+		}
 		if (skb_has_frag_list(skb))
 			skb_drop_fraglist(skb);
 
@@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
 	if (irqs_disabled())
 		return false;
 
+	if (shinfo->ubuf.callback)
+		return false;
+
 	if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
 		return false;
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-04-20 19:44 ` [PATCH V3 2/8] Add a new zerocopy device flag Shirley Ma
@ 2011-04-20 19:48   ` Ben Hutchings
  2011-04-20 20:05     ` Shirley Ma
  2011-04-20 20:09     ` Shirley Ma
  2011-05-02 10:42   ` Michael S. Tsirkin
  1 sibling, 2 replies; 30+ messages in thread
From: Ben Hutchings @ 2011-04-20 19:48 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann,
	netdev, kvm, linux-kernel

On Wed, 2011-04-20 at 12:44 -0700, Shirley Ma wrote:
> This zerocopy flag is used to support device DMA userspace buffers.
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
> 
>  include/linux/netdevice.h |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0249fe7..0998d3d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1067,6 +1067,9 @@ struct net_device {
>  #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
>  #define NETIF_F_RXCSUM		(1 << 29) /* Receive checksumming offload */
>  
> +/* bit 29 is for device to map userspace buffers -- zerocopy */
> +#define NETIF_F_ZEROCOPY	(1 << 29)

Look above.

Ben.

>  	/* Segmentation offload features */
>  #define NETIF_F_GSO_SHIFT	16
>  #define NETIF_F_GSO_MASK	0x00ff0000
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-04-20 19:48   ` Ben Hutchings
@ 2011-04-20 20:05     ` Shirley Ma
  2011-04-20 20:09     ` Shirley Ma
  1 sibling, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:05 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann,
	netdev, kvm, linux-kernel

Thanks. I need to update it to 30 bit.

Shirley


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH V3 4/8] vhost TX zero copy support
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
                   ` (2 preceding siblings ...)
  2011-04-20 19:47 ` [PATCH V3 3/8] Add userspace buffers support in skb Shirley Ma
@ 2011-04-20 20:07 ` Shirley Ma
  2011-04-20 20:12 ` [PATCH V3 0/8] macvtap/vhost " Shirley Ma
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:07 UTC (permalink / raw)
  To: mst, David Miller
  Cc: Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel

This patch maintains the outstanding userspace buffers in the 
sequence it is delivered to vhost. The outstanding userspace buffers 
will be marked as done once the lower device buffers DMA has finished. 
This is monitored through last reference of kfree_skb callback. Two
buffer index are used for this purpose.

The vhost passes the userspace buffers info to lower device skb 
through message control. Since there will be some done DMAs when
entering vhost handle_tx. The worse case is all buffers in the vq are
in pending/done status, so we need to notify guest to release DMA done 
buffers first before get any new buffers from the vq.

Signed-off-by: Shirley <xma@us.ibm.com>
---

 drivers/vhost/net.c   |   30 +++++++++++++++++++++++++++-
 drivers/vhost/vhost.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.h |   10 +++++++++
 3 files changed, 87 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..1bc4536 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,8 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
+#define MAX_ZEROCOPY_PEND 64
+
 enum {
 	VHOST_NET_VQ_RX = 0,
 	VHOST_NET_VQ_TX = 1,
@@ -129,6 +131,7 @@ static void handle_tx(struct vhost_net *net)
 	int err, wmem;
 	size_t hdr_size;
 	struct socket *sock;
+	struct skb_ubuf_info pend;
 
 	/* TODO: check that we are running from vhost_worker? */
 	sock = rcu_dereference_check(vq->private_data, 1);
@@ -151,6 +154,10 @@ static void handle_tx(struct vhost_net *net)
 	hdr_size = vq->vhost_hlen;
 
 	for (;;) {
+		/* Release DMAs done buffers first */
+		if (sock_flag(sock->sk, SOCK_ZEROCOPY))
+			vhost_zerocopy_signal_used(vq);
+
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
@@ -166,6 +173,12 @@ static void handle_tx(struct vhost_net *net)
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
+			/* If more outstanding DMAs, queue the work */
+			if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
+			    (atomic_read(&vq->refcnt) > MAX_ZEROCOPY_PEND)) {
+				vhost_poll_queue(&vq->poll);
+				break;
+			}
 			if (unlikely(vhost_enable_notify(vq))) {
 				vhost_disable_notify(vq);
 				continue;
@@ -188,17 +201,30 @@ static void handle_tx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), hdr_size);
 			break;
 		}
+		/* use msg_control to pass vhost zerocopy ubuf info to skb */
+		if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
+			pend.callback = vhost_zerocopy_callback;
+			pend.arg = vq;
+			pend.desc = vq->upend_idx;
+			msg.msg_control = &pend;
+			msg.msg_controllen = sizeof(pend);
+			vq->heads[vq->upend_idx].id = head;
+			vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
+			atomic_inc(&vq->refcnt);
+		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
 		err = sock->ops->sendmsg(NULL, sock, &msg, len);
 		if (unlikely(err < 0)) {
-			vhost_discard_vq_desc(vq, 1);
+			if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+				vhost_discard_vq_desc(vq, 1);
 			tx_poll_start(net, sock);
 			break;
 		}
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+			vhost_add_used_and_signal(&net->dev, vq, head, 0);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..09bcb1d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call_ctx = NULL;
 	vq->call = NULL;
 	vq->log_ctx = NULL;
+	vq->upend_idx = 0;
+	vq->done_idx = 0;
+	atomic_set(&vq->refcnt, 0);
 }
 
 static int vhost_worker(void *data)
@@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
 					       UIO_MAXIOV, GFP_KERNEL);
 		dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
 					  GFP_KERNEL);
-		dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
+		dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads *
 					    UIO_MAXIOV, GFP_KERNEL);
 
 		if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
@@ -385,10 +388,41 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
 	return 0;
 }
 
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+{
+	int i, j = 0;
+
+	i = vq->done_idx;
+	while (i != vq->upend_idx) {
+		/* len = 1 means DMA done */
+		if (vq->heads[i].len == 1) {
+			/* reset len = 0 */
+			vq->heads[i].len = 0;
+			i = (i + 1) % UIO_MAXIOV;
+			++j;
+		} else
+			break;
+	}
+	if (j) {
+		if (i > vq->done_idx)
+			vhost_add_used_n(vq, &vq->heads[vq->done_idx], j);
+		else {
+			vhost_add_used_n(vq, &vq->heads[vq->done_idx],
+					 UIO_MAXIOV - vq->done_idx);
+			vhost_add_used_n(vq, vq->heads, i);
+		}
+		vq->done_idx = i;
+		vhost_signal(vq->dev, vq);
+		atomic_sub(j, &vq->refcnt);
+	}
+}
+
 /* Caller should have device mutex */
 void vhost_dev_cleanup(struct vhost_dev *dev)
 {
 	int i;
+	unsigned long begin = jiffies;
+
 
 	for (i = 0; i < dev->nvqs; ++i) {
 		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
@@ -405,6 +439,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 			eventfd_ctx_put(dev->vqs[i].call_ctx);
 		if (dev->vqs[i].call)
 			fput(dev->vqs[i].call);
+		/* wait for all lower device DMAs done, then notify guest */
+		while (atomic_read(&dev->vqs[i].refcnt)) {
+			if (time_after(jiffies, begin + 5 * HZ))
+				vhost_zerocopy_signal_used(&dev->vqs[i]);
+		}
 		vhost_vq_reset(dev, dev->vqs + i);
 	}
 	vhost_dev_free_iovecs(dev);
@@ -1416,3 +1455,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
 		vq_err(vq, "Failed to enable notification at %p: %d\n",
 		       &vq->used->flags, r);
 }
+
+void vhost_zerocopy_callback(struct sk_buff *skb)
+{
+	int idx = skb_shinfo(skb)->ubuf.desc;
+	struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
+
+	/* set len = 1 to mark this desc buffers done DMA */
+	vq->heads[idx].len = 1;
+}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..cd2febb 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -108,6 +108,14 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	/* vhost zerocopy support */
+	atomic_t refcnt; /* num of outstanding zerocopy DMAs */
+	/* index of zerocopy pending DMA buffers */
+	int upend_idx;
+	/* index of zerocopy done DMA buffers, but not notify guest yet */
+	int done_idx;
+	/* notify vhost zerocopy DMA buffers has done in lower device */
+	void (*callback)(struct sk_buff *);
 };
 
 struct vhost_dev {
@@ -154,6 +162,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
+void vhost_zerocopy_callback(struct sk_buff *skb);
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);
 
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-04-20 19:48   ` Ben Hutchings
  2011-04-20 20:05     ` Shirley Ma
@ 2011-04-20 20:09     ` Shirley Ma
  2011-04-20 20:24       ` Dimitris Michailidis
  1 sibling, 1 reply; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:09 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann,
	netdev, kvm, linux-kernel

Resubmit this patch with the new bit.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 include/linux/netdevice.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0249fe7..0998d3d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1067,6 +1067,9 @@ struct net_device {
 #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
 #define NETIF_F_RXCSUM		(1 << 29) /* Receive checksumming offload */
 
+/* Bit 30 is for device to map userspace buffers -- zerocopy */
+#define NETIF_F_ZEROCOPY	(1 << 30)
+
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
 #define NETIF_F_GSO_MASK	0x00ff0000

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
                   ` (3 preceding siblings ...)
  2011-04-20 20:07 ` [PATCH V3 4/8] vhost TX zero copy support Shirley Ma
@ 2011-04-20 20:12 ` Shirley Ma
  2011-04-20 20:13 ` [PATCH V3 5/8] Enable cxgb3 to support zerocopy Shirley Ma
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:12 UTC (permalink / raw)
  To: David Miller, Arnd Bergmann
  Cc: mst, Eric Dumazet, Avi Kivity, netdev, kvm, linux-kernel

Only when buffer size is greater than GOODCOPY_LEN (128), macvtap
enables zero-copy.

Signed-off-by: Shirley MA <xma@us.ibm.com>
---

 drivers/net/macvtap.c |  124 ++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 112 insertions(+), 12 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..b4e6656 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
  */
 static dev_t macvtap_major;
 #define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN  (L1_CACHE_BYTES < 128 ? 128 : L1_CACHE_BYTES)
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
 {
 	struct net *net = current->nsproxy->net_ns;
 	struct net_device *dev = dev_get_by_index(net, iminor(inode));
+	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvtap_queue *q;
 	int err;
 
@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
 	q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
 	q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
 
+	/*
+	 * so far only VM uses macvtap, enable zero copy between guest
+	 * kernel and host kernel when lower device supports high memory
+	 * DMA
+	 */
+	if (vlan) {
+		if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
+			sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+	}
+
 	err = macvtap_set_queue(dev, file, q);
 	if (err)
 		sock_put(&q->sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
 	return skb;
 }
 
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+				  int offset, size_t count)
+{
+	int len = iov_length(from, count) - offset;
+	int copy = skb_headlen(skb);
+	int size, offset1 = 0;
+	int i = 0;
+	skb_frag_t *f;
+
+	/* Skip over from offset */
+	while (offset >= from->iov_len) {
+		offset -= from->iov_len;
+		++from;
+		--count;
+	}
+
+	/* copy up to skb headlen */
+	while (copy > 0) {
+		size = min_t(unsigned int, copy, from->iov_len - offset);
+		if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+				   size))
+			return -EFAULT;
+		if (copy > size) {
+			++from;
+			--count;
+		}
+		copy -= size;
+		offset1 += size;
+		offset = 0;
+	}
+
+	if (len == offset1)
+		return 0;
+
+	while (count--) {
+		struct page *page[MAX_SKB_FRAGS];
+		int num_pages;
+		unsigned long base;
+
+		len = from->iov_len - offset1;
+		if (!len) {
+			offset1 = 0;
+			++from;
+			continue;
+		}
+		base = (unsigned long)from->iov_base + offset1;
+		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+		if ((num_pages != size) ||
+		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+			/* put_page is in skb free */
+			return -EFAULT;
+		skb->data_len += len;
+		skb->len += len;
+		skb->truesize += len;
+		while (len) {
+			f = &skb_shinfo(skb)->frags[i];
+			f->page = page[i];
+			f->page_offset = base & ~PAGE_MASK;
+			f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+			skb_shinfo(skb)->nr_frags++;
+			/* increase sk_wmem_alloc */
+			atomic_add(f->size, &skb->sk->sk_wmem_alloc);
+			base += f->size;
+			len -= f->size;
+			i++;
+		}
+		offset1 = 0;
+		++from;
+	}
+	return 0;
+}
+
 /*
  * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
  * be shared with the tun/tap driver.
@@ -515,17 +601,19 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
 
 
 /* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
-				const struct iovec *iv, size_t count,
-				int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+				const struct iovec *iv, unsigned long total_len,
+				size_t count, int noblock)
 {
 	struct sk_buff *skb;
 	struct macvlan_dev *vlan;
-	size_t len = count;
+	unsigned long len = total_len;
 	int err;
 	struct virtio_net_hdr vnet_hdr = { 0 };
 	int vnet_hdr_len = 0;
+	int copylen, zerocopy;
 
+	zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len > GOODCOPY_LEN);
 	if (q->flags & IFF_VNET_HDR) {
 		vnet_hdr_len = q->vnet_hdr_sz;
 
@@ -552,12 +640,24 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 	if (unlikely(len < ETH_HLEN))
 		goto err;
 
-	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
-				noblock, &err);
+	if (zerocopy)
+		copylen = vnet_hdr.hdr_len;
+	else
+		copylen = len;
+
+	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
+				vnet_hdr.hdr_len, noblock, &err);
 	if (!skb)
 		goto err;
-
-	err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+
+	if (zerocopy)
+		err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+	else
+		err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+						   len);
+	if (sock_flag(&q->sk, SOCK_ZEROCOPY))
+		memcpy(&skb_shinfo(skb)->ubuf, m->msg_control,
+			sizeof(struct skb_ubuf_info));
 	if (err)
 		goto err_kfree;
 
@@ -579,7 +679,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 		kfree_skb(skb);
 	rcu_read_unlock_bh();
 
-	return count;
+	return total_len;
 
 err_kfree:
 	kfree_skb(skb);
@@ -601,8 +701,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
 	ssize_t result = -ENOLINK;
 	struct macvtap_queue *q = file->private_data;
 
-	result = macvtap_get_user(q, iv, iov_length(iv, count),
-			      file->f_flags & O_NONBLOCK);
+	result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+				  file->f_flags & O_NONBLOCK);
 	return result;
 }
 
@@ -815,7 +915,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
 			   struct msghdr *m, size_t total_len)
 {
 	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
-	return macvtap_get_user(q, m->msg_iov, total_len,
+	return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
 			    m->msg_flags & MSG_DONTWAIT);
 }
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V3 5/8] Enable cxgb3 to support zerocopy
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
                   ` (4 preceding siblings ...)
  2011-04-20 20:12 ` [PATCH V3 0/8] macvtap/vhost " Shirley Ma
@ 2011-04-20 20:13 ` Shirley Ma
  2011-04-20 20:52   ` Dimitris Michailidis
  2011-04-20 20:15 ` [PATCH V3 7/8] Enable ixgbe " Shirley Ma
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:13 UTC (permalink / raw)
  To: David Miller
  Cc: mst, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 drivers/net/cxgb3/cxgb3_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index 9108931..93a1101 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -3313,7 +3313,7 @@ static int __devinit init_one(struct pci_dev *pdev,
 		netdev->features |= NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
 		netdev->features |= NETIF_F_GRO;
 		if (pci_using_dac)
-			netdev->features |= NETIF_F_HIGHDMA;
+			netdev->features |= NETIF_F_HIGHDMA | NETIF_F_ZEROCOPY;
 
 		netdev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
 		netdev->netdev_ops = &cxgb_netdev_ops;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V3 7/8] Enable ixgbe to support zerocopy
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
                   ` (5 preceding siblings ...)
  2011-04-20 20:13 ` [PATCH V3 5/8] Enable cxgb3 to support zerocopy Shirley Ma
@ 2011-04-20 20:15 ` Shirley Ma
  2011-04-20 20:17 ` [PATCH V3 8/8] Enable benet " Shirley Ma
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:15 UTC (permalink / raw)
  To: David Miller
  Cc: mst, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 drivers/net/ixgbe/ixgbe_main.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 6f8adc7..68f1e93 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -7395,6 +7395,7 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 #endif /* IXGBE_FCOE */
 	if (pci_using_dac) {
 		netdev->features |= NETIF_F_HIGHDMA;
+		netdev->features |= NETIF_F_ZEROCOPY;
 		netdev->vlan_features |= NETIF_F_HIGHDMA;
 	}
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V3 8/8] Enable benet to support zerocopy
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
                   ` (6 preceding siblings ...)
  2011-04-20 20:15 ` [PATCH V3 7/8] Enable ixgbe " Shirley Ma
@ 2011-04-20 20:17 ` Shirley Ma
  2011-04-20 20:27 ` [PATCH V3 6/8] macvtap/vhost TX zero copy support Shirley Ma
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:17 UTC (permalink / raw)
  To: David Miller
  Cc: mst, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 drivers/net/benet/be_main.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index 7cb5a11..d7b7254 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -2982,6 +2982,7 @@ static int __devinit be_probe(struct pci_dev *pdev,
 	status = dma_set_mask(&pdev->dev, DMA_BIT_MASK(64));
 	if (!status) {
 		netdev->features |= NETIF_F_HIGHDMA;
+		netdev->features |= NETIF_F_ZEROCOPY;
 	} else {
 		status = dma_set_mask(&pdev->dev, DMA_BIT_MASK(32));
 		if (status) {



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-04-20 20:09     ` Shirley Ma
@ 2011-04-20 20:24       ` Dimitris Michailidis
  2011-04-20 20:28         ` Shirley Ma
  2011-04-20 20:30         ` Shirley Ma
  0 siblings, 2 replies; 30+ messages in thread
From: Dimitris Michailidis @ 2011-04-20 20:24 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Ben Hutchings, David Miller, mst, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel

On 04/20/2011 01:09 PM, Shirley Ma wrote:
> Resubmit this patch with the new bit.

Bit 30 is also taken in net-next.

> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
> 
>  include/linux/netdevice.h |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0249fe7..0998d3d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1067,6 +1067,9 @@ struct net_device {
>  #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
>  #define NETIF_F_RXCSUM		(1 << 29) /* Receive checksumming offload */
>  
> +/* Bit 30 is for device to map userspace buffers -- zerocopy */
> +#define NETIF_F_ZEROCOPY	(1 << 30)
> +
>  	/* Segmentation offload features */
>  #define NETIF_F_GSO_SHIFT	16
>  #define NETIF_F_GSO_MASK	0x00ff0000
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH V3 6/8] macvtap/vhost TX zero copy support
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
                   ` (7 preceding siblings ...)
  2011-04-20 20:17 ` [PATCH V3 8/8] Enable benet " Shirley Ma
@ 2011-04-20 20:27 ` Shirley Ma
  2011-05-02 18:35   ` Shirley Ma
  2011-04-21 14:29 ` [PATCH V3 0/8] " Jon Mason
  2011-04-22 17:52 ` Shirley Ma
  10 siblings, 1 reply; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:27 UTC (permalink / raw)
  To: David Miller, Arnd Bergmann, mst
  Cc: Eric Dumazet, Avi Kivity, netdev, kvm, linux-kernel

Only when buffer size is greater than GOODCOPY_LEN (128), macvtap
enables zero-copy.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 drivers/net/macvtap.c |  124
++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 112 insertions(+), 12 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..b4e6656 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
  */
 static dev_t macvtap_major;
 #define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN  (L1_CACHE_BYTES < 128 ? 128 : L1_CACHE_BYTES)
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct
file *file)
 {
 	struct net *net = current->nsproxy->net_ns;
 	struct net_device *dev = dev_get_by_index(net, iminor(inode));
+	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvtap_queue *q;
 	int err;
 
@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct
file *file)
 	q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
 	q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
 
+	/*
+	 * so far only VM uses macvtap, enable zero copy between guest
+	 * kernel and host kernel when lower device supports high memory
+	 * DMA
+	 */
+	if (vlan) {
+		if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
+			sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+	}
+
 	err = macvtap_set_queue(dev, file, q);
 	if (err)
 		sock_put(&q->sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff
*macvtap_alloc_skb(struct sock *sk, size_t prepad,
 	return skb;
 }
 
+/* set skb frags from iovec, this can move to core network code for
reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct
iovec *from,
+				  int offset, size_t count)
+{
+	int len = iov_length(from, count) - offset;
+	int copy = skb_headlen(skb);
+	int size, offset1 = 0;
+	int i = 0;
+	skb_frag_t *f;
+
+	/* Skip over from offset */
+	while (offset >= from->iov_len) {
+		offset -= from->iov_len;
+		++from;
+		--count;
+	}
+
+	/* copy up to skb headlen */
+	while (copy > 0) {
+		size = min_t(unsigned int, copy, from->iov_len - offset);
+		if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+				   size))
+			return -EFAULT;
+		if (copy > size) {
+			++from;
+			--count;
+		}
+		copy -= size;
+		offset1 += size;
+		offset = 0;
+	}
+
+	if (len == offset1)
+		return 0;
+
+	while (count--) {
+		struct page *page[MAX_SKB_FRAGS];
+		int num_pages;
+		unsigned long base;
+
+		len = from->iov_len - offset1;
+		if (!len) {
+			offset1 = 0;
+			++from;
+			continue;
+		}
+		base = (unsigned long)from->iov_base + offset1;
+		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+		if ((num_pages != size) ||
+		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+			/* put_page is in skb free */
+			return -EFAULT;
+		skb->data_len += len;
+		skb->len += len;
+		skb->truesize += len;
+		while (len) {
+			f = &skb_shinfo(skb)->frags[i];
+			f->page = page[i];
+			f->page_offset = base & ~PAGE_MASK;
+			f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+			skb_shinfo(skb)->nr_frags++;
+			/* increase sk_wmem_alloc */
+			atomic_add(f->size, &skb->sk->sk_wmem_alloc);
+			base += f->size;
+			len -= f->size;
+			i++;
+		}
+		offset1 = 0;
+		++from;
+	}
+	return 0;
+}
+
 /*
  * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
  * be shared with the tun/tap driver.
@@ -515,17 +601,19 @@ static int macvtap_skb_to_vnet_hdr(const struct
sk_buff *skb,
 
 
 /* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
-				const struct iovec *iv, size_t count,
-				int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr
*m,
+				const struct iovec *iv, unsigned long total_len,
+				size_t count, int noblock)
 {
 	struct sk_buff *skb;
 	struct macvlan_dev *vlan;
-	size_t len = count;
+	unsigned long len = total_len;
 	int err;
 	struct virtio_net_hdr vnet_hdr = { 0 };
 	int vnet_hdr_len = 0;
+	int copylen, zerocopy;
 
+	zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len > GOODCOPY_LEN);
 	if (q->flags & IFF_VNET_HDR) {
 		vnet_hdr_len = q->vnet_hdr_sz;
 
@@ -552,12 +640,24 @@ static ssize_t macvtap_get_user(struct
macvtap_queue *q,
 	if (unlikely(len < ETH_HLEN))
 		goto err;
 
-	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
-				noblock, &err);
+	if (zerocopy)
+		copylen = vnet_hdr.hdr_len;
+	else
+		copylen = len;
+
+	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
+				vnet_hdr.hdr_len, noblock, &err);
 	if (!skb)
 		goto err;
-
-	err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+
+	if (zerocopy)
+		err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+	else
+		err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+						   len);
+	if (sock_flag(&q->sk, SOCK_ZEROCOPY))
+		memcpy(&skb_shinfo(skb)->ubuf, m->msg_control,
+			sizeof(struct skb_ubuf_info));
 	if (err)
 		goto err_kfree;
 
@@ -579,7 +679,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue
*q,
 		kfree_skb(skb);
 	rcu_read_unlock_bh();
 
-	return count;
+	return total_len;
 
 err_kfree:
 	kfree_skb(skb);
@@ -601,8 +701,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb,
const struct iovec *iv,
 	ssize_t result = -ENOLINK;
 	struct macvtap_queue *q = file->private_data;
 
-	result = macvtap_get_user(q, iv, iov_length(iv, count),
-			      file->f_flags & O_NONBLOCK);
+	result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+				  file->f_flags & O_NONBLOCK);
 	return result;
 }
 
@@ -815,7 +915,7 @@ static int macvtap_sendmsg(struct kiocb *iocb,
struct socket *sock,
 			   struct msghdr *m, size_t total_len)
 {
 	struct macvtap_queue *q = container_of(sock, struct macvtap_queue,
sock);
-	return macvtap_get_user(q, m->msg_iov, total_len,
+	return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
 			    m->msg_flags & MSG_DONTWAIT);
 }
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-04-20 20:24       ` Dimitris Michailidis
@ 2011-04-20 20:28         ` Shirley Ma
  2011-04-20 20:30         ` Shirley Ma
  1 sibling, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:28 UTC (permalink / raw)
  To: Dimitris Michailidis
  Cc: Ben Hutchings, David Miller, mst, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel

On Wed, 2011-04-20 at 13:24 -0700, Dimitris Michailidis wrote:
> Bit 30 is also taken in net-next.

How about 31?

Thanks
Shirley

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-04-20 20:24       ` Dimitris Michailidis
  2011-04-20 20:28         ` Shirley Ma
@ 2011-04-20 20:30         ` Shirley Ma
  1 sibling, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:30 UTC (permalink / raw)
  To: Dimitris Michailidis
  Cc: Ben Hutchings, David Miller, mst, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel

Resubmit it with 31 bit.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 include/linux/netdevice.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0249fe7..0998d3d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1067,6 +1067,9 @@ struct net_device {
 #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
 #define NETIF_F_RXCSUM		(1 << 29) /* Receive checksumming offload */
 
+/* Bit 31 is for device to map userspace buffers -- zerocopy */
+#define NETIF_F_ZEROCOPY	(1 << 31)
+
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
 #define NETIF_F_GSO_MASK	0x00ff0000

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 5/8] Enable cxgb3 to support zerocopy
  2011-04-20 20:13 ` [PATCH V3 5/8] Enable cxgb3 to support zerocopy Shirley Ma
@ 2011-04-20 20:52   ` Dimitris Michailidis
  2011-04-20 20:58     ` Shirley Ma
  0 siblings, 1 reply; 30+ messages in thread
From: Dimitris Michailidis @ 2011-04-20 20:52 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann,
	netdev, kvm, linux-kernel

On 04/20/2011 01:13 PM, Shirley Ma wrote:
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
> 
>  drivers/net/cxgb3/cxgb3_main.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
> index 9108931..93a1101 100644
> --- a/drivers/net/cxgb3/cxgb3_main.c
> +++ b/drivers/net/cxgb3/cxgb3_main.c
> @@ -3313,7 +3313,7 @@ static int __devinit init_one(struct pci_dev *pdev,
>  		netdev->features |= NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
>  		netdev->features |= NETIF_F_GRO;
>  		if (pci_using_dac)
> -			netdev->features |= NETIF_F_HIGHDMA;
> +			netdev->features |= NETIF_F_HIGHDMA | NETIF_F_ZEROCOPY;
>  
>  		netdev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
>  		netdev->netdev_ops = &cxgb_netdev_ops;

The features handling has been reworked in net-next and patches like this 
won't apply as the code you're patching has changed.  Also core code now 
does a lot of the related work and you'll need to tell it what to do with 
any new flags.

What properties does a device or driver need to meet to set this flag?  It 
seems to be set a bit too unconditionally.  For example, does it work if one 
disables scatter/gather?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 5/8] Enable cxgb3 to support zerocopy
  2011-04-20 20:52   ` Dimitris Michailidis
@ 2011-04-20 20:58     ` Shirley Ma
  2011-04-20 21:15       ` Shirley Ma
  0 siblings, 1 reply; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 20:58 UTC (permalink / raw)
  To: Dimitris Michailidis
  Cc: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann,
	netdev, kvm, linux-kernel

On Wed, 2011-04-20 at 13:52 -0700, Dimitris Michailidis wrote:
> The features handling has been reworked in net-next and patches like
> this 
> won't apply as the code you're patching has changed.  Also core code
> now 
> does a lot of the related work and you'll need to tell it what to do
> with 
> any new flags.
Ok, will do.

> What properties does a device or driver need to meet to set this flag?
> It 
> seems to be set a bit too unconditionally.  For example, does it work
> if one 
> disables scatter/gather? 

This flag can be ON when HIGHDMA and scatter/gather support. I will
modify the patch to make it conditionally.

thanks
Shirley


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 5/8] Enable cxgb3 to support zerocopy
  2011-04-20 20:58     ` Shirley Ma
@ 2011-04-20 21:15       ` Shirley Ma
  0 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-20 21:15 UTC (permalink / raw)
  To: Dimitris Michailidis
  Cc: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann,
	netdev, kvm, linux-kernel

On Wed, 2011-04-20 at 13:58 -0700, Shirley Ma wrote:
> This flag can be ON when HIGHDMA and scatter/gather support. I will
> modify the patch to make it conditionally.

Double checked, it only needs HIGHDMA condition, not scatter/gather.

thanks
Shirley


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
                   ` (8 preceding siblings ...)
  2011-04-20 20:27 ` [PATCH V3 6/8] macvtap/vhost TX zero copy support Shirley Ma
@ 2011-04-21 14:29 ` Jon Mason
  2011-04-22 17:31   ` Shirley Ma
  2011-04-22 17:52 ` Shirley Ma
  10 siblings, 1 reply; 30+ messages in thread
From: Jon Mason @ 2011-04-21 14:29 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann,
	netdev, kvm, linux-kernel

On Wed, Apr 20, 2011 at 3:36 PM, Shirley Ma <mashirle@us.ibm.com> wrote:
> This patchset add supports for TX zero-copy between guest and host
> kernel through vhost. It significantly reduces CPU utilization on the
> local host on which the guest is located (It reduced 30-50% CPU usage
> for vhost thread for single stream test). The patchset is based on
> previous submission and comments from the community regarding when/how
> to handle guest kernel buffers to be released. This is the simplest
> approach I can think of after comparing with several other solutions.
>
> This patchset includes:
>
> 1/8: Add a new sock zero-copy flag, SOCK_ZEROCOPY;
>
> 2/8: Add a new device flag, NETIF_F_ZEROCOPY for lower level device
> support zero-copy;
>
> 3/8: Add a new struct skb_ubuf_info in skb_share_info for userspace
> buffers release callback when lower device DMA has done for that skb;
>
> 4/8: Add vhost zero-copy callback in vhost when skb last refcnt is gone;
> add vhost_zerocopy_add_used_and_signal to notify guest to release TX skb
> buffers.
>
> 5/8: Add macvtap zero-copy in lower device when sending packet is
> greater than 128 bytes.
>
> 6/8: Add Chelsio 10Gb NIC to zero copy feature flag
>
> 7/8: Add Intel 10Gb NIC zero copy feature flag
>
> 8/8: Add Emulex 10Gb NIC zero copy feature flag

Why are only these 3 drivers getting support?  As far as I can tell,
the only requirement is HIGHDMA.  If this is the case, is there really
a need for an additional flag to support this?  If you can key off of
HIGHDMA, all devices that support this would get the benefit.



> The patchset is built against most recent linux 2.6.git. It has passed
> netperf/netserver multiple streams stress test on above NICs.
>
> The single stream test results from 2.6.37 kernel on Chelsio:
>
> 64K message size: copy_from_user dropped from 40% to 5%; vhost thread
> cpu utilization dropped from 76% to 28%
>
> I am collecting more test results against 2.6.39-rc3 kernel and will
> provide the test matrix later.
>
> Thanks
> Shirley
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support
  2011-04-21 14:29 ` [PATCH V3 0/8] " Jon Mason
@ 2011-04-22 17:31   ` Shirley Ma
  0 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-22 17:31 UTC (permalink / raw)
  To: Jon Mason
  Cc: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann,
	netdev, kvm, linux-kernel

On Thu, 2011-04-21 at 10:29 -0400, Jon Mason wrote:
> Why are only these 3 drivers getting support?  As far as I can tell,
> the only requirement is HIGHDMA.  If this is the case, is there really
> a need for an additional flag to support this?  If you can key off of
> HIGHDMA, all devices that support this would get the benefit.

Agreed. So far, I have only verified these three 10Gb NICs we have in
our lab. We can enable all devices which support HIGHDMA.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support
  2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
                   ` (9 preceding siblings ...)
  2011-04-21 14:29 ` [PATCH V3 0/8] " Jon Mason
@ 2011-04-22 17:52 ` Shirley Ma
  10 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-04-22 17:52 UTC (permalink / raw)
  To: David Miller
  Cc: mst, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel

On Wed, 2011-04-20 at 12:36 -0700, Shirley Ma wrote:
> I am collecting more test results against 2.6.39-rc3 kernel and will
> provide the test matrix later.

Single TCP_STREAM 120 secs test results over ixgbe 10Gb NIC results:

Message	BW(Gb/s)qemu-kvm (NumCPU)vhost-net(NumCPU) PerfTop irq/s
4K	7408.57		92.1%		22.6%		1229
4K(Orig)4913.17		118.1%		84.1%		2086	
8K	9129.90		89.3%		23.3%		1141
8K(Orig)7094.55		115.9%		84.7%		2157
16K	9178.81		89.1%		23.3%		1139
16K(Orig)8927.1		118.7%		83.4%		2262
64K	9171.43		88.4%		24.9%		1253
64K(Orig)9085.85        115.9%		82.4%		2229

For message size less or equal than 2K, there is a known KVM guest TX
overrun issue. With this zerocopy patch, the issue becomes more severe,
guest io_exits has tripled than before, so the performance is not good.
Once the TX overrun problem has been addressed, I will retest the small
message size performance.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-04-20 19:44 ` [PATCH V3 2/8] Add a new zerocopy device flag Shirley Ma
  2011-04-20 19:48   ` Ben Hutchings
@ 2011-05-02 10:42   ` Michael S. Tsirkin
  2011-05-02 18:47     ` Shirley Ma
  1 sibling, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2011-05-02 10:42 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel

On Wed, Apr 20, 2011 at 12:44:08PM -0700, Shirley Ma wrote:
> This zerocopy flag is used to support device DMA userspace buffers.
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
> 
>  include/linux/netdevice.h |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0249fe7..0998d3d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1067,6 +1067,9 @@ struct net_device {
>  #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
>  #define NETIF_F_RXCSUM		(1 << 29) /* Receive checksumming offload */
>  
> +/* bit 29 is for device to map userspace buffers -- zerocopy */

This comment should specify what exactly is the promise the
device makes by setting this flag. Specifically, the
condition is that no skb fragments are used
after the uinfo callback has been called.

The way it's implemented, it probably means the device
should not use any of skb_clone, expand head etc.

> +#define NETIF_F_ZEROCOPY	(1 << 29)
> +
>  	/* Segmentation offload features */
>  #define NETIF_F_GSO_SHIFT	16
>  #define NETIF_F_GSO_MASK	0x00ff0000
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 3/8] Add userspace buffers support in skb
  2011-04-20 19:47 ` [PATCH V3 3/8] Add userspace buffers support in skb Shirley Ma
@ 2011-05-02 10:53   ` Michael S. Tsirkin
  2011-05-03 17:36     ` Shirley Ma
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2011-05-02 10:53 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel

On Wed, Apr 20, 2011 at 12:47:57PM -0700, Shirley Ma wrote:
> This patch adds userspace buffers support in skb. A new struct
> skb_ubuf_info is needed to maintain the userspace buffers argument
> and index, a callback is used to notify userspace to release the
> buffers once lower device has done DMA (Last reference to that skb
> has gone).
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
> 
>  include/linux/skbuff.h |   14 ++++++++++++++
>  net/core/skbuff.c      |   15 ++++++++++++++-
>  2 files changed, 28 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index d0ae90a..47a187b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -189,6 +189,16 @@ enum {
>  	SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
>  };
>  
> +/* The callback notifies userspace to release buffers when skb DMA is done in
> + * lower device, the desc is used to track userspace buffer index.
> + */
> +struct skb_ubuf_info {
> +	/* support buffers allocation from userspace */
> +	void		(*callback)(struct sk_buff *);
> +	void		*arg;
> +	size_t		desc;
> +};
> +
>  /* This data is invariant across clones and lives at
>   * the end of the header data, ie. at skb->end.
>   */
> @@ -211,6 +221,10 @@ struct skb_shared_info {
>  	/* Intermediate layers must ensure that destructor_arg
>  	 * remains valid until skb destructor */
>  	void *		destructor_arg;
> +
> +	/* DMA mapping from/to userspace buffers */
> +	struct skb_ubuf_info ubuf;
> +
>  	/* must be last field, see pskb_expand_head() */
>  	skb_frag_t	frags[MAX_SKB_FRAGS];
>  };
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7ebeed0..822c07d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	shinfo = skb_shinfo(skb);
>  	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>  	atomic_set(&shinfo->dataref, 1);
> +	shinfo->ubuf.callback = NULL;
> +	shinfo->ubuf.arg = NULL;
>  	kmemcheck_annotate_variable(shinfo->destructor_arg);
>  
>  	if (fclone) {
> @@ -327,7 +329,15 @@ static void skb_release_data(struct sk_buff *skb)
>  			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>  				put_page(skb_shinfo(skb)->frags[i].page);
>  		}
> -
> +		/*
> +		 * if skb buf is from userspace, we need to notify the caller
> +		 * the lower device DMA has done;
> +		 */
> +		if (skb_shinfo(skb)->ubuf.callback) {
> +			skb_shinfo(skb)->ubuf.callback(skb);
> +			skb_shinfo(skb)->ubuf.callback = NULL;
> +			skb_shinfo(skb)->ubuf.arg = NULL;
> +		}
>  		if (skb_has_frag_list(skb))
>  			skb_drop_fraglist(skb);
>  

We probably don't need to touch arg if callback is NULL?

> @@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
>  	if (irqs_disabled())
>  		return false;
>  
> +	if (shinfo->ubuf.callback)
> +		return false;
> +
>  	if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
>  		return false;

This is not the only API unsupported for these skbs, is it?
Probably need to check and fail there as well.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 6/8] macvtap/vhost TX zero copy support
  2011-04-20 20:27 ` [PATCH V3 6/8] macvtap/vhost TX zero copy support Shirley Ma
@ 2011-05-02 18:35   ` Shirley Ma
  0 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-05-02 18:35 UTC (permalink / raw)
  To: David Miller
  Cc: Arnd Bergmann, mst, Eric Dumazet, Avi Kivity, netdev, kvm,
	linux-kernel

Resubmit the patch with a bug being found by <liyonghelpme@foxmail.com>,
when he tested 2.6.38.2 kernel. 

When the virtio_net doesn't enable GSO, vnet hdr_len is 0, we can't
use vnet hdr_len to allocate linear skb in zerocopy.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 drivers/net/macvtap.c |  124 +++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 114 insertions(+), 10 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..e8dac4d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
  */
 static dev_t macvtap_major;
 #define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN  (L1_CACHE_BYTES < 128 ? 128 : L1_CACHE_BYTES)
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
 {
 	struct net *net = current->nsproxy->net_ns;
 	struct net_device *dev = dev_get_by_index(net, iminor(inode));
+	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvtap_queue *q;
 	int err;
 
@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
 	q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
 	q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
 
+	/*
+	 * so far only VM uses macvtap, enable zero copy between guest
+	 * kernel and host kernel when lower device supports high memory
+	 * DMA
+	 */
+	if (vlan) {
+		if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
+			sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+	}
+
 	err = macvtap_set_queue(dev, file, q);
 	if (err)
 		sock_put(&q->sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
 	return skb;
 }
 
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+				  int offset, size_t count)
+{
+	int len = iov_length(from, count) - offset;
+	int copy = skb_headlen(skb);
+	int size, offset1 = 0;
+	int i = 0;
+	skb_frag_t *f;
+
+	/* Skip over from offset */
+	while (offset >= from->iov_len) {
+		offset -= from->iov_len;
+		++from;
+		--count;
+	}
+
+	/* copy up to skb headlen */
+	while (copy > 0) {
+		size = min_t(unsigned int, copy, from->iov_len - offset);
+		if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+				   size))
+			return -EFAULT;
+		if (copy > size) {
+			++from;
+			--count;
+		}
+		copy -= size;
+		offset1 += size;
+		offset = 0;
+	}
+
+	if (len == offset1)
+		return 0;
+
+	while (count--) {
+		struct page *page[MAX_SKB_FRAGS];
+		int num_pages;
+		unsigned long base;
+
+		len = from->iov_len - offset1;
+		if (!len) {
+			offset1 = 0;
+			++from;
+			continue;
+		}
+		base = (unsigned long)from->iov_base + offset1;
+		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+		if ((num_pages != size) ||
+		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+			/* put_page is in skb free */
+			return -EFAULT;
+		skb->data_len += len;
+		skb->len += len;
+		skb->truesize += len;
+		while (len) {
+			f = &skb_shinfo(skb)->frags[i];
+			f->page = page[i];
+			f->page_offset = base & ~PAGE_MASK;
+			f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+			skb_shinfo(skb)->nr_frags++;
+			/* increase sk_wmem_alloc */
+			atomic_add(f->size, &skb->sk->sk_wmem_alloc);
+			base += f->size;
+			len -= f->size;
+			i++;
+		}
+		offset1 = 0;
+		++from;
+	}
+	return 0;
+}
+
 /*
  * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
  * be shared with the tun/tap driver.
@@ -515,17 +601,19 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
 
 
 /* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
-				const struct iovec *iv, size_t count,
-				int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+				const struct iovec *iv, unsigned long total_len,
+				size_t count, int noblock)
 {
 	struct sk_buff *skb;
 	struct macvlan_dev *vlan;
-	size_t len = count;
+	unsigned long len = total_len;
 	int err;
 	struct virtio_net_hdr vnet_hdr = { 0 };
 	int vnet_hdr_len = 0;
+	int headlen, copylen, zerocopy;
 
+	zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len > GOODCOPY_LEN);
 	if (q->flags & IFF_VNET_HDR) {
 		vnet_hdr_len = q->vnet_hdr_sz;
 
@@ -552,12 +640,28 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 	if (unlikely(len < ETH_HLEN))
 		goto err;
 
-	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
+	headlen = vnet_hdr.hdr_len;
+	if (zerocopy) {
+		/* create linear skb when vnet doesn't enable gso */
+		if (!vnet_hdr.hdr_len) {
+			headlen = GOODCOPY_LEN;
+		copylen = headlen;
+	} else
+		copylen = len;
+
+	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen, headlen,
 				noblock, &err);
 	if (!skb)
 		goto err;
 
-	err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+	if (zerocopy)
+		err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+	else
+		err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+						   len);
+	if (sock_flag(&q->sk, SOCK_ZEROCOPY))
+		memcpy(&skb_shinfo(skb)->ubuf, m->msg_control,
+			sizeof(struct skb_ubuf_info));
 	if (err)
 		goto err_kfree;
 
@@ -579,7 +683,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 		kfree_skb(skb);
 	rcu_read_unlock_bh();
 
-	return count;
+	return total_len;
 
 err_kfree:
 	kfree_skb(skb);
@@ -601,8 +705,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
 	ssize_t result = -ENOLINK;
 	struct macvtap_queue *q = file->private_data;
 
-	result = macvtap_get_user(q, iv, iov_length(iv, count),
-			      file->f_flags & O_NONBLOCK);
+	result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+				  file->f_flags & O_NONBLOCK);
 	return result;
 }
 
@@ -815,7 +919,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
 			   struct msghdr *m, size_t total_len)
 {
 	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
-	return macvtap_get_user(q, m->msg_iov, total_len,
+	return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
 			    m->msg_flags & MSG_DONTWAIT);
 }
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-05-02 10:42   ` Michael S. Tsirkin
@ 2011-05-02 18:47     ` Shirley Ma
  2011-05-02 19:53       ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Shirley Ma @ 2011-05-02 18:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel

On Mon, 2011-05-02 at 13:42 +0300, Michael S. Tsirkin wrote:
> This comment should specify what exactly is the promise the
> device makes by setting this flag. Specifically, the
> condition is that no skb fragments are used
> after the uinfo callback has been called.
> 
> The way it's implemented, it probably means the device
> should not use any of skb_clone, expand head etc.

Agree. Or maybe force a copy when device uses skb_clone, expand
head ...?

Thanks
Shirley

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-05-02 18:47     ` Shirley Ma
@ 2011-05-02 19:53       ` Michael S. Tsirkin
  2011-05-03 17:42         ` Shirley Ma
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2011-05-02 19:53 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel

On Mon, May 02, 2011 at 11:47:08AM -0700, Shirley Ma wrote:
> On Mon, 2011-05-02 at 13:42 +0300, Michael S. Tsirkin wrote:
> > This comment should specify what exactly is the promise the
> > device makes by setting this flag. Specifically, the
> > condition is that no skb fragments are used
> > after the uinfo callback has been called.
> > 
> > The way it's implemented, it probably means the device
> > should not use any of skb_clone, expand head etc.
> 
> Agree. Or maybe force a copy when device uses skb_clone, expand
> head ...?
> 
> Thanks
> Shirley

Copy from userspace upfront without locking is probably cheaper?

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 3/8] Add userspace buffers support in skb
  2011-05-02 10:53   ` Michael S. Tsirkin
@ 2011-05-03 17:36     ` Shirley Ma
  0 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-05-03 17:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel

On Mon, 2011-05-02 at 13:53 +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 20, 2011 at 12:47:57PM -0700, Shirley Ma wrote:
> > This patch adds userspace buffers support in skb. A new struct
> > skb_ubuf_info is needed to maintain the userspace buffers argument
> > and index, a callback is used to notify userspace to release the
> > buffers once lower device has done DMA (Last reference to that skb
> > has gone).
> > 
> > Signed-off-by: Shirley Ma <xma@us.ibm.com>
> > ---
> > 
> >  include/linux/skbuff.h |   14 ++++++++++++++
> >  net/core/skbuff.c      |   15 ++++++++++++++-
> >  2 files changed, 28 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index d0ae90a..47a187b 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -189,6 +189,16 @@ enum {
> >       SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
> >  };
> >  
> > +/* The callback notifies userspace to release buffers when skb DMA
> is done in
> > + * lower device, the desc is used to track userspace buffer index.
> > + */
> > +struct skb_ubuf_info {
> > +     /* support buffers allocation from userspace */
> > +     void            (*callback)(struct sk_buff *);
> > +     void            *arg;
> > +     size_t          desc;
> > +};
> > +
> >  /* This data is invariant across clones and lives at
> >   * the end of the header data, ie. at skb->end.
> >   */
> > @@ -211,6 +221,10 @@ struct skb_shared_info {
> >       /* Intermediate layers must ensure that destructor_arg
> >        * remains valid until skb destructor */
> >       void *          destructor_arg;
> > +
> > +     /* DMA mapping from/to userspace buffers */
> > +     struct skb_ubuf_info ubuf;
> > +
> >       /* must be last field, see pskb_expand_head() */
> >       skb_frag_t      frags[MAX_SKB_FRAGS];
> >  };
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 7ebeed0..822c07d 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size,
> gfp_t gfp_mask,
> >       shinfo = skb_shinfo(skb);
> >       memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> >       atomic_set(&shinfo->dataref, 1);
> > +     shinfo->ubuf.callback = NULL;
> > +     shinfo->ubuf.arg = NULL;
> >       kmemcheck_annotate_variable(shinfo->destructor_arg);
> >  
> >       if (fclone) {
> > @@ -327,7 +329,15 @@ static void skb_release_data(struct sk_buff
> *skb)
> >                       for (i = 0; i < skb_shinfo(skb)->nr_frags; i
> ++)
> >                               put_page(skb_shinfo(skb)->frags[i].page);
> >               }
> > -
> > +             /*
> > +              * if skb buf is from userspace, we need to notify the
> caller
> > +              * the lower device DMA has done;
> > +              */
> > +             if (skb_shinfo(skb)->ubuf.callback) {
> > +                     skb_shinfo(skb)->ubuf.callback(skb);
> > +                     skb_shinfo(skb)->ubuf.callback = NULL;
> > +                     skb_shinfo(skb)->ubuf.arg = NULL;
> > +             }
> >               if (skb_has_frag_list(skb))
> >                       skb_drop_fraglist(skb);
> >  
> 
> We probably don't need to touch arg if callback is NULL?

Yes.

> > @@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int
> skb_size)
> >       if (irqs_disabled())
> >               return false;
> >  
> > +     if (shinfo->ubuf.callback)
> > +             return false;
> > +
> >       if (skb_is_nonlinear(skb) || skb->fclone !=
> SKB_FCLONE_UNAVAILABLE)
> >               return false;
> 
> This is not the only API unsupported for these skbs, is it?
> Probably need to check and fail there as well. 

Yes, I am going through all these skbs to make sure covering all.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-05-02 19:53       ` Michael S. Tsirkin
@ 2011-05-03 17:42         ` Shirley Ma
  2011-05-03 20:11           ` Shirley Ma
  0 siblings, 1 reply; 30+ messages in thread
From: Shirley Ma @ 2011-05-03 17:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel

On Mon, 2011-05-02 at 22:53 +0300, Michael S. Tsirkin wrote:
> On Mon, May 02, 2011 at 11:47:08AM -0700, Shirley Ma wrote:
> > On Mon, 2011-05-02 at 13:42 +0300, Michael S. Tsirkin wrote:
> > > This comment should specify what exactly is the promise the
> > > device makes by setting this flag. Specifically, the
> > > condition is that no skb fragments are used
> > > after the uinfo callback has been called.
> > > 
> > > The way it's implemented, it probably means the device
> > > should not use any of skb_clone, expand head etc.
> > 
> > Agree. Or maybe force a copy when device uses skb_clone, expand
> > head ...?
> > 
> > Thanks
> > Shirley
> 
> Copy from userspace upfront without locking is probably cheaper? 

Better to prevent this kind of skbs to be used in skb_clone, expand head
for now.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V3 2/8] Add a new zerocopy device flag
  2011-05-03 17:42         ` Shirley Ma
@ 2011-05-03 20:11           ` Shirley Ma
  0 siblings, 0 replies; 30+ messages in thread
From: Shirley Ma @ 2011-05-03 20:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel

On Tue, 2011-05-03 at 10:42 -0700, Shirley Ma wrote:
> Better to prevent this kind of skbs to be used in skb_clone, expand
> head
> for now.

I looked at the code, skb_clone shouldn't have any problem since ubuf
callback is only called after the lower device DMA has done. I can
modify the zerocopy len to 256 bytes so expand head should be OK as
well. So we only need to prevent recycle skb. I also checked the device
drivers, only a few device do RX buffers recycle. So there shouldn't be
any problem.

I will add more comments here to make sure when ZEROCOPY flag is set,
the ubuf callback should only be called when last reference to this skb
is gone.


Thanks
Shirley

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2011-05-03 20:11 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-20 19:36 [PATCH V3 0/8] macvtap/vhost TX zero copy support Shirley Ma
2011-04-20 19:42 ` [PATCH V3 1/8] Add a new sock zerocopy flag Shirley Ma
2011-04-20 19:44 ` [PATCH V3 2/8] Add a new zerocopy device flag Shirley Ma
2011-04-20 19:48   ` Ben Hutchings
2011-04-20 20:05     ` Shirley Ma
2011-04-20 20:09     ` Shirley Ma
2011-04-20 20:24       ` Dimitris Michailidis
2011-04-20 20:28         ` Shirley Ma
2011-04-20 20:30         ` Shirley Ma
2011-05-02 10:42   ` Michael S. Tsirkin
2011-05-02 18:47     ` Shirley Ma
2011-05-02 19:53       ` Michael S. Tsirkin
2011-05-03 17:42         ` Shirley Ma
2011-05-03 20:11           ` Shirley Ma
2011-04-20 19:47 ` [PATCH V3 3/8] Add userspace buffers support in skb Shirley Ma
2011-05-02 10:53   ` Michael S. Tsirkin
2011-05-03 17:36     ` Shirley Ma
2011-04-20 20:07 ` [PATCH V3 4/8] vhost TX zero copy support Shirley Ma
2011-04-20 20:12 ` [PATCH V3 0/8] macvtap/vhost " Shirley Ma
2011-04-20 20:13 ` [PATCH V3 5/8] Enable cxgb3 to support zerocopy Shirley Ma
2011-04-20 20:52   ` Dimitris Michailidis
2011-04-20 20:58     ` Shirley Ma
2011-04-20 21:15       ` Shirley Ma
2011-04-20 20:15 ` [PATCH V3 7/8] Enable ixgbe " Shirley Ma
2011-04-20 20:17 ` [PATCH V3 8/8] Enable benet " Shirley Ma
2011-04-20 20:27 ` [PATCH V3 6/8] macvtap/vhost TX zero copy support Shirley Ma
2011-05-02 18:35   ` Shirley Ma
2011-04-21 14:29 ` [PATCH V3 0/8] " Jon Mason
2011-04-22 17:31   ` Shirley Ma
2011-04-22 17:52 ` Shirley Ma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).