Netdev List
 help / color / mirror / Atom feed
* [PATCH] mlx4: Performing SENSE_PORT command only when supported
From: Yevgeny Petrilin @ 2011-05-04 13:38 UTC (permalink / raw)
  To: davem; +Cc: netdev, yevgenyp


Not all HW supports this functionality, and in this case FW would
report command error.
This patch checks this capability before trying to sense link partner.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
 drivers/net/mlx4/main.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 3814fc9..f47ac5a 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -944,10 +944,12 @@ static int mlx4_setup_hca(struct mlx4_dev *dev)
 	}
 
 	for (port = 1; port <= dev->caps.num_ports; port++) {
-		enum mlx4_port_type port_type = 0;
-		mlx4_SENSE_PORT(dev, port, &port_type);
-		if (port_type)
-			dev->caps.port_type[port] = port_type;
+		if (dev->caps.flags & MLX4_DEV_CAP_FLAG_DPDP) {
+			enum mlx4_port_type port_type = 0;
+			mlx4_SENSE_PORT(dev, port, &port_type);
+			if (port_type)
+				dev->caps.port_type[port] = port_type;
+		}
 		ib_port_default_caps = 0;
 		err = mlx4_get_port_ib_caps(dev, port, &ib_port_default_caps);
 		if (err)
-- 
1.6.0.2




^ permalink raw reply related

* [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
From: Krishna Kumar @ 2011-05-04 14:02 UTC (permalink / raw)
  To: davem; +Cc: eric.dumazet, kvm, mst, netdev, rusty, Krishna Kumar

Earlier approach to improving small packet performance went
along the lines of dropping packets when the txq is full to
avoid stop/start of the txq. Though performance improved
significantly (upto 3x) for a single thread, multiple netperf
sessions showed a regression of upto -17% (starting from 4
sessions).

This patch proposes a different approach with the following
changes:

A. virtio:
	- Provide a API to get available number of slots.

B. virtio-net:
	- Remove stop/start txq's and associated callback.
	- Pre-calculate the number of slots needed to transmit
	  the skb in xmit_skb and bail out early if enough space
	  is not available. My testing shows that 2.5-3% of
	  packets are benefited by using this API.
	- Do not drop skbs but instead return TX_BUSY like other
	  drivers.
	- When returning EBUSY, set a per-txq variable to indicate
	  to dev_queue_xmit() whether to restart xmits on this txq.

C. net/sched/sch_generic.c:
	Since virtio-net now returns EBUSY, the skb is requeued to
	gso_skb. This allows adding the addional check for restart
	xmits in just the slow-path (the first re-queued packet
	case of dequeue_skb, where it checks for gso_skb) before
	deciding whether to call the driver or not.

Patch was also tested between two servers with Emulex OneConnect
10G cards to confirm there is no regression. Though the patch is
an attempt to improve only small packet performance, there was
improvement for 1K, 2K and also 16K both in BW and SD. Results
from Guest -> Remote Host (BW in Mbps) for 1K and 16K I/O sizes:

________________________________________________________
			I/O Size: 1K
#	BW1	BW2 (%)		SD1	SD2 (%)
________________________________________________________
1	1226	3313 (170.2)	6.6	1.9 (-71.2)
2	3223	7705 (139.0)	18.0	7.1 (-60.5)
4	7223	8716 (20.6)	36.5	29.7 (-18.6)
8	8689	8693 (0)	131.5	123.0 (-6.4)
16	8059	8285 (2.8)	578.3	506.2 (-12.4)
32	7758	7955 (2.5)	2281.4	2244.2 (-1.6)
64	7503	7895 (5.2)	9734.0	9424.4 (-3.1)
96	7496	7751 (3.4)	21980.9	20169.3 (-8.2)
128	7389	7741 (4.7)	40467.5	34995.5 (-13.5)
________________________________________________________
Summary:	BW: 16.2%	SD: -10.2%

________________________________________________________
			I/O Size: 16K
#	BW1	BW2 (%)		SD1	SD2 (%)
________________________________________________________
1	6684	7019 (5.0)	1.1	1.1 (0)
2	7674	7196 (-6.2)	5.0	4.8 (-4.0)
4	7358	8032 (9.1)	21.3	20.4 (-4.2)
8	7393	8015 (8.4)	82.7	82.0 (-.8)
16	7958	8366 (5.1)	283.2	310.7 (9.7)
32	7792	8113 (4.1)	1257.5	1363.0 (8.3)
64	7673	8040 (4.7)	5723.1	5812.4 (1.5)
96	7462	7883 (5.6)	12731.8	12119.8 (-4.8)
128	7338	7800 (6.2)	21331.7	21094.7 (-1.1)
________________________________________________________
Summary:	BW: 4.6%	SD: -1.5%

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply

* [PATCH 1/4] [RFC] netdevice: Introduce per-txq xmit_restart
From: Krishna Kumar @ 2011-05-04 14:03 UTC (permalink / raw)
  To: davem; +Cc: eric.dumazet, kvm, mst, netdev, rusty, Krishna Kumar
In-Reply-To: <20110504140258.14817.66596.sendpatchset@krkumar2.in.ibm.com>

Add a per-txq field that can (optionally) be set by participating
drivers to indicate when to restart tx.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/linux/netdevice.h |    1 +
 1 file changed, 1 insertion(+)

diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2011-05-04 18:57:06.000000000 +0530
+++ new/include/linux/netdevice.h	2011-05-04 18:57:09.000000000 +0530
@@ -571,6 +571,7 @@ struct netdev_queue {
 	 * please use this field instead of dev->trans_start
 	 */
 	unsigned long		trans_start;
+	unsigned long		xmit_restart_jiffies; /* jiffies to restart */
 } ____cacheline_aligned_in_smp;
 
 static inline int netdev_queue_numa_node_read(const struct netdev_queue *q)

^ permalink raw reply

* [PATCH 2/4] [RFC] virtio: Introduce new API to get free space
From: Krishna Kumar @ 2011-05-04 14:03 UTC (permalink / raw)
  To: davem; +Cc: eric.dumazet, kvm, mst, netdev, rusty, Krishna Kumar
In-Reply-To: <20110504140258.14817.66596.sendpatchset@krkumar2.in.ibm.com>

Introduce virtqueue_get_capacity() to help bail out of transmit
path early. Also remove notification when we run out of space (I
am not sure if this should be under a feature bit).

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/virtio/virtio_ring.c |   13 ++++++++-----
 include/linux/virtio.h       |    5 +++++
 2 files changed, 13 insertions(+), 5 deletions(-)

diff -ruNp org/include/linux/virtio.h new/include/linux/virtio.h
--- org/include/linux/virtio.h	2011-05-04 18:57:06.000000000 +0530
+++ new/include/linux/virtio.h	2011-05-04 18:57:09.000000000 +0530
@@ -27,6 +27,9 @@ struct virtqueue {
 
 /**
  * operations for virtqueue
+ * virtqueue_get_capacity: Get vq capacity
+ *	vq: the struct virtqueue we're talking about.
+ *	Returns remaining capacity of queue
  * virtqueue_add_buf: expose buffer to other end
  *	vq: the struct virtqueue we're talking about.
  *	sg: the description of the buffer(s).
@@ -62,6 +65,8 @@ struct virtqueue {
  * All operations can be called in any context.
  */
 
+int virtqueue_get_capacity(struct virtqueue *vq);
+
 int virtqueue_add_buf_gfp(struct virtqueue *vq,
 			  struct scatterlist sg[],
 			  unsigned int out_num,
diff -ruNp org/drivers/virtio/virtio_ring.c new/drivers/virtio/virtio_ring.c
--- org/drivers/virtio/virtio_ring.c	2011-05-04 18:57:06.000000000 +0530
+++ new/drivers/virtio/virtio_ring.c	2011-05-04 18:57:09.000000000 +0530
@@ -156,6 +156,14 @@ static int vring_add_indirect(struct vri
 	return head;
 }
 
+int virtqueue_get_capacity(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	return vq->num_free;
+}
+EXPORT_SYMBOL_GPL(virtqueue_get_capacity);
+
 int virtqueue_add_buf_gfp(struct virtqueue *_vq,
 			  struct scatterlist sg[],
 			  unsigned int out,
@@ -185,11 +193,6 @@ int virtqueue_add_buf_gfp(struct virtque
 	if (vq->num_free < out + in) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
 			 out + in, vq->num_free);
-		/* FIXME: for historical reasons, we force a notify here if
-		 * there are outgoing parts to the buffer.  Presumably the
-		 * host should service the ring ASAP. */
-		if (out)
-			vq->notify(&vq->vq);
 		END_USE(vq);
 		return -ENOSPC;
 	}

^ permalink raw reply

* [PATCH 3/4] [RFC] virtio-net: Changes to virtio-net driver
From: Krishna Kumar @ 2011-05-04 14:03 UTC (permalink / raw)
  To: davem; +Cc: eric.dumazet, kvm, mst, netdev, rusty, Krishna Kumar
In-Reply-To: <20110504140258.14817.66596.sendpatchset@krkumar2.in.ibm.com>

Changes:

1. Remove xmit notification
2. free_old_xmit_skbs() frees upto a limit to reduce tx jitter.
3. xmit_skb() precalculates the number of slots and checks if
   that is available. It assumes that we are not using
   indirect descriptors at this time.
4. start_xmit() becomes a small routine that removes most error
   checks, does not drop packets but instead returns EBUSY if
   there is no space to transmit. It also sets when to restart
   xmits in future.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/net/virtio_net.c |   70 ++++++++++---------------------------
 1 file changed, 20 insertions(+), 50 deletions(-)

diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c	2011-05-04 18:57:06.000000000 +0530
+++ new/drivers/net/virtio_net.c	2011-05-04 18:57:09.000000000 +0530
@@ -117,17 +117,6 @@ static struct page *get_a_page(struct vi
 	return p;
 }
 
-static void skb_xmit_done(struct virtqueue *svq)
-{
-	struct virtnet_info *vi = svq->vdev->priv;
-
-	/* Suppress further interrupts. */
-	virtqueue_disable_cb(svq);
-
-	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
-}
-
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
 			 unsigned int offset, unsigned int *len)
 {
@@ -509,19 +498,18 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static inline void free_old_xmit_skbs(struct virtnet_info *vi)
 {
 	struct sk_buff *skb;
-	unsigned int len, tot_sgs = 0;
+	unsigned int count = 0, len;
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while (count++ < MAX_SKB_FRAGS+2 &&
+	       (skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
-		tot_sgs += skb_vnet_hdr(skb)->num_sg;
 		dev_kfree_skb_any(skb);
 	}
-	return tot_sgs;
 }
 
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
@@ -531,6 +519,12 @@ static int xmit_skb(struct virtnet_info 
 
 	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
 
+	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
+	if (unlikely(hdr->num_sg > virtqueue_get_capacity(vi->svq))) {
+		/* Don't rely on indirect descriptors when reaching capacity */
+		return -ENOSPC;
+	}
+
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		hdr->hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
 		hdr->hdr.csum_start = skb_checksum_start_offset(skb);
@@ -566,7 +560,6 @@ static int xmit_skb(struct virtnet_info 
 	else
 		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
 	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
 					0, skb);
 }
@@ -574,30 +567,21 @@ static int xmit_skb(struct virtnet_info 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-	int capacity;
 
 	/* Free up any pending old buffers before queueing new ones. */
 	free_old_xmit_skbs(vi);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	if (unlikely(xmit_skb(vi, skb) < 0)) {
+		struct netdev_queue *txq;
 
-	/* This can happen with OOM and indirect buffers. */
-	if (unlikely(capacity < 0)) {
-		if (net_ratelimit()) {
-			if (likely(capacity == -ENOMEM)) {
-				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
-			} else {
-				dev->stats.tx_fifo_errors++;
-				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
-			}
-		}
-		dev->stats.tx_dropped++;
-		kfree_skb(skb);
-		return NETDEV_TX_OK;
+		/*
+		 * Tell kernel to restart xmits after 1 jiffy to help the
+		 * host catch up.
+		 */
+		txq = netdev_get_tx_queue(dev, 0);
+		txq->xmit_restart_jiffies = jiffies + 1;
+		return NETDEV_TX_BUSY;
 	}
 	virtqueue_kick(vi->svq);
 
@@ -605,20 +589,6 @@ static netdev_tx_t start_xmit(struct sk_
 	skb_orphan(skb);
 	nf_reset(skb);
 
-	/* Apparently nice girls don't return TX_BUSY; stop the queue
-	 * before it gets out of hand.  Naturally, this wastes entries. */
-	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb(vi->svq))) {
-			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
-			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
-			}
-		}
-	}
-
 	return NETDEV_TX_OK;
 }
 
@@ -881,7 +851,7 @@ static int virtnet_probe(struct virtio_d
 	struct net_device *dev;
 	struct virtnet_info *vi;
 	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
+	vq_callback_t *callbacks[] = { skb_recv_done, NULL, NULL};
 	const char *names[] = { "input", "output", "control" };
 	int nvqs;
 

^ permalink raw reply

* [PATCH 4/4] [RFC] sched: Changes to dequeue_skb
From: Krishna Kumar @ 2011-05-04 14:03 UTC (permalink / raw)
  To: davem; +Cc: eric.dumazet, kvm, mst, netdev, rusty, Krishna Kumar
In-Reply-To: <20110504140258.14817.66596.sendpatchset@krkumar2.in.ibm.com>

Dequeue_skb has an additional check, for the first packet that
is requeued, to see if the device has requested xmits after a
interval. This is intended to not affect the fast xmit path, and
have minimal overhead to the slow path. Drivers setting the
restart time should not stop/start their tx queues, and hence
the frozen/stopped check can be avoided.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/sched/sch_generic.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
--- org/net/sched/sch_generic.c	2011-05-04 18:57:06.000000000 +0530
+++ new/net/sched/sch_generic.c	2011-05-04 18:57:09.000000000 +0530
@@ -50,17 +50,30 @@ static inline int dev_requeue_skb(struct
 	return 0;
 }
 
+/*
+ * This function can return a rare false positive for drivers setting
+ * xmit_restart_jiffies (e.g. virtio-net) when xmit_restart_jiffies is
+ * zero but the device may not be ready. That only leads to the skb
+ * being requeued again.
+ */
+static inline int can_restart_xmit(struct Qdisc *q, struct sk_buff *skb)
+{
+	struct net_device *dev = qdisc_dev(q);
+	struct netdev_queue *txq;
+
+	txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
+	if (unlikely(txq->xmit_restart_jiffies))
+		return time_after_eq(jiffies, txq->xmit_restart_jiffies);
+	return !netif_tx_queue_frozen_or_stopped(txq);
+}
+
 static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 {
 	struct sk_buff *skb = q->gso_skb;
 
 	if (unlikely(skb)) {
-		struct net_device *dev = qdisc_dev(q);
-		struct netdev_queue *txq;
-
 		/* check the reason of requeuing without tx lock first */
-		txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
-		if (!netif_tx_queue_frozen_or_stopped(txq)) {
+		if (can_restart_xmit(q, skb)) {
 			q->gso_skb = NULL;
 			q->q.qlen--;
 		} else

^ permalink raw reply

* Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
From: Krishna Kumar @ 2011-05-04 14:10 UTC (permalink / raw)
  To: davem; +Cc: eric.dumazet, kvm, mst, netdev, rusty, Krishna Kumar

Krishna Kumar2/India/IBM@IBMIN wrote on 05/04/2011 07:32:58 PM:

> [PATCH 0/4] [RFC] virtio-net: Improve small packet performance

I found having tabs in the table made the results a little
difficult to understand. Converting the same to spaces, hope
it is clear this time.

________________________________________________________
               I/O Size: 1K
#     BW1      BW2 (%)            SD1       SD2 (%)
________________________________________________________
1     1226     3313 (170.2)       6.6       1.9 (-71.2)
2     3223     7705 (139.0)       18.0      7.1 (-60.5)
4     7223     8716 (20.6)        36.5      29.7 (-18.6)
8     8689     8693 (0)           131.5     123.0 (-6.4)
16    8059     8285 (2.8)         578.3     506.2 (-12.4)
32    7758     7955 (2.5)         2281.4    2244.2 (-1.6)
64    7503     7895 (5.2)         9734.0    9424.4 (-3.1)
96    7496     7751 (3.4)         21980.9   20169.3 (-8.2)
128   7389     7741 (4.7)         40467.5   34995.5 (-13.5)
________________________________________________________
Summary:     BW: 16.2%     SD: -10.2%

________________________________________________________
               I/O Size: 16K
#     BW1      BW2 (%)            SD1       SD2 (%)
________________________________________________________
1     6684     7019 (5.0)         1.1       1.1 (0)
2     7674     7196 (-6.2)        5.0       4.8 (-4.0)
4     7358     8032 (9.1)         21.3      20.4 (-4.2)
8     7393     8015 (8.4)         82.7      82.0 (-.8)
16    7958     8366 (5.1)         283.2     310.7 (9.7)
32    7792     8113 (4.1)         1257.5    1363.0 (8.3)
64    7673     8040 (4.7)         5723.1    5812.4 (1.5)
96    7462     7883 (5.6)         12731.8   12119.8 (-4.8)
128   7338     7800 (6.2)         21331.7   21094.7 (-1.1)
________________________________________________________
Summary:     BW: 4.6%     SD: -1.5%

^ permalink raw reply

* Re: [PATCH v4 1/1] can: add pruss CAN driver.
From: Wolfgang Grandegger @ 2011-05-04 14:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: sachi-EvXpCiN+lbve9wHmmfpqLFaTQe2KTcn/,
	davinci-linux-open-source-VycZQUHpC/PFrsHnngEfi1aTQe2KTcn/,
	Subhasish Ghosh, nsekhar-l0cyMroinI0, open list,
	CAN NETWORK DRIVERS, Marc Kleine-Budde,
	Netdev-u79uwXL29TY76Z2rM5mHXA, m-watkins-l0cyMroinI0,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <201105041511.54095.arnd-r2nGTMty4D4@public.gmane.org>

Hi Arnd,

On 05/04/2011 03:11 PM, Arnd Bergmann wrote:
> On Wednesday 04 May 2011, Subhasish Ghosh wrote:
>> CAN requires mail box IDs to be programmed in. But, the socket
>> CAN subsystem supports only software filtering of the mail box IDs.
>>
>> So, the mail box IDs programmed into socket CAN during initialization
>> does not propagate into the hardware. This is planned to be a future
>> implementation in Socket CAN.
>>
>> In our case, we support hardware filtering, to work around with this,
>> Wolfgang (Socket CAN owner) suggested that we implement
>> this using sysfs.
>>
>> These setting are not for debugging, but to program the mail box IDs
>> into the hardware. 
> 
> Ok, I see. Can you point me to that discussion?
> 
> Wolfgang, I'm a bit worried by the API being split between sockets and sysfs.
> The problem is that once the sysfs API is established, users will start
> relying on it, and you can no longer migrate away from it, even when
> a later version of the Socket CAN also supports setting through a different
> interface. What is the current interface to set mail box IDs in software?

Note that this CAN controller is *very* special. It cannot handle all
CAN id's due to a lack or resources. The PRUSS firmware is able to
manage just up to 8 different CAN identifiers out of the usual 4096
(12-bit) or even more for the extended CAN ids using 29 bits. There is
no other CAN controller with such rather serious limitations and
therefore there exists also no appropriate interface. I think using
sysfs is OK for such device-specific parameters, at least for the time
being.

> How hard would it be to implement that feature in Socket CAN?

CAN controllers usually provide some kind of hardware CAN id filtering,
but in a very hardware dependent way. A generic interface may be able to
handle the PRUSS restrictions as well. CAN devices are usually
configured through the netlink interface. e.g.

  $ ip link set can0 up type can bitrate 125000

and such a common interface would be netlink based as well.

> Is that something that Subhasish or someone else could to as a prerequisite
> to merging the driver?

Any ideas on how to handle hardware filtering in a generic way are
welcome. I will try to come up with a proposal sooner than later.

Wolfgang.

^ permalink raw reply

* Re: [PATCH] mlx4_en: Setting RSS hash result to skb->rxhash field
From: Ben Hutchings @ 2011-05-04 14:44 UTC (permalink / raw)
  To: Yevgeny Petrilin; +Cc: davem, netdev
In-Reply-To: <4DC156AF.4050807@mellanox.co.il>

On Wed, 2011-05-04 at 16:37 +0300, Yevgeny Petrilin wrote:
> Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
> ---
>  drivers/net/mlx4/en_rx.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/mlx4/en_rx.c b/drivers/net/mlx4/en_rx.c
> index 62dd21b..bb4d66a 100644
> --- a/drivers/net/mlx4/en_rx.c
> +++ b/drivers/net/mlx4/en_rx.c
> @@ -610,6 +610,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  					gro_skb->data_len = length;
>  					gro_skb->truesize += length;
>  					gro_skb->ip_summed = CHECKSUM_UNNECESSARY;
> +					gro_skb->rxhash = be32_to_cpu(cqe->immed_rss_invalid) << 24;
> +					skb_record_rx_queue(gro_skb, cq->ring);

An 8-bit hash is almost useless.  It's entirely useless if you then
shift it into the top bits of rxhash.

Ben.

>  					if (priv->vlgrp && (cqe->vlan_my_qpn &
>  							    cpu_to_be32(MLX4_CQE_VLAN_PRESENT_MASK)))
> @@ -643,6 +645,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  			goto next;
>  		}
>  
> +		skb->rxhash = be32_to_cpu(cqe->immed_rss_invalid) << 24;
>  		skb->ip_summed = ip_summed;
>  		skb->protocol = eth_type_trans(skb, dev);
>  		skb_record_rx_queue(skb, cq->ring);

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
From: Michael S. Tsirkin @ 2011-05-04 14:46 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20110504140258.14817.66596.sendpatchset@krkumar2.in.ibm.com>

On Wed, May 04, 2011 at 07:32:58PM +0530, Krishna Kumar wrote:
> Earlier approach to improving small packet performance went
> along the lines of dropping packets when the txq is full to
> avoid stop/start of the txq. Though performance improved
> significantly (upto 3x) for a single thread, multiple netperf
> sessions showed a regression of upto -17% (starting from 4
> sessions).
> 
> This patch proposes a different approach with the following
> changes:
> 
> A. virtio:
> 	- Provide a API to get available number of slots.
> 
> B. virtio-net:
> 	- Remove stop/start txq's and associated callback.
> 	- Pre-calculate the number of slots needed to transmit
> 	  the skb in xmit_skb and bail out early if enough space
> 	  is not available. My testing shows that 2.5-3% of
> 	  packets are benefited by using this API.
> 	- Do not drop skbs but instead return TX_BUSY like other
> 	  drivers.
> 	- When returning EBUSY, set a per-txq variable to indicate
> 	  to dev_queue_xmit() whether to restart xmits on this txq.
> 
> C. net/sched/sch_generic.c:
> 	Since virtio-net now returns EBUSY, the skb is requeued to
> 	gso_skb. This allows adding the addional check for restart
> 	xmits in just the slow-path (the first re-queued packet
> 	case of dequeue_skb, where it checks for gso_skb) before
> 	deciding whether to call the driver or not.
> 
> Patch was also tested between two servers with Emulex OneConnect
> 10G cards to confirm there is no regression. Though the patch is
> an attempt to improve only small packet performance, there was
> improvement for 1K, 2K and also 16K both in BW and SD. Results
> from Guest -> Remote Host (BW in Mbps) for 1K and 16K I/O sizes:
> 
> ________________________________________________________
> 			I/O Size: 1K
> #	BW1	BW2 (%)		SD1	SD2 (%)
> ________________________________________________________
> 1	1226	3313 (170.2)	6.6	1.9 (-71.2)
> 2	3223	7705 (139.0)	18.0	7.1 (-60.5)
> 4	7223	8716 (20.6)	36.5	29.7 (-18.6)
> 8	8689	8693 (0)	131.5	123.0 (-6.4)
> 16	8059	8285 (2.8)	578.3	506.2 (-12.4)
> 32	7758	7955 (2.5)	2281.4	2244.2 (-1.6)
> 64	7503	7895 (5.2)	9734.0	9424.4 (-3.1)
> 96	7496	7751 (3.4)	21980.9	20169.3 (-8.2)
> 128	7389	7741 (4.7)	40467.5	34995.5 (-13.5)
> ________________________________________________________
> Summary:	BW: 16.2%	SD: -10.2%
> 
> ________________________________________________________
> 			I/O Size: 16K
> #	BW1	BW2 (%)		SD1	SD2 (%)
> ________________________________________________________
> 1	6684	7019 (5.0)	1.1	1.1 (0)
> 2	7674	7196 (-6.2)	5.0	4.8 (-4.0)
> 4	7358	8032 (9.1)	21.3	20.4 (-4.2)
> 8	7393	8015 (8.4)	82.7	82.0 (-.8)
> 16	7958	8366 (5.1)	283.2	310.7 (9.7)
> 32	7792	8113 (4.1)	1257.5	1363.0 (8.3)
> 64	7673	8040 (4.7)	5723.1	5812.4 (1.5)
> 96	7462	7883 (5.6)	12731.8	12119.8 (-4.8)
> 128	7338	7800 (6.2)	21331.7	21094.7 (-1.1)
> ________________________________________________________
> Summary:	BW: 4.6%	SD: -1.5%
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---

So IIUC, we delay transmit by an arbitrary value and hope
that the host is done with the packets by then?

Interesting.

I am currently testing an approach where
we tell the host explicitly to interrupt us only after
a large part of the queue is empty.
With 256 entries in a queue, we should get 1 interrupt per
on the order of 100 packets which does not seem like a lot.

I can post it, mind testing this?

-- 
MST

^ permalink raw reply

* Re: [PATCH v4 1/1] can: add pruss CAN driver.
From: Arnd Bergmann @ 2011-05-04 14:48 UTC (permalink / raw)
  To: Wolfgang Grandegger
  Cc: sachi-EvXpCiN+lbve9wHmmfpqLFaTQe2KTcn/,
	davinci-linux-open-source-VycZQUHpC/PFrsHnngEfi1aTQe2KTcn/,
	Subhasish Ghosh, nsekhar-l0cyMroinI0, open list,
	CAN NETWORK DRIVERS, Marc Kleine-Budde,
	Netdev-u79uwXL29TY76Z2rM5mHXA, m-watkins-l0cyMroinI0,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <4DC163D7.9010309-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

On Wednesday 04 May 2011, Wolfgang Grandegger wrote:
> On 05/04/2011 03:11 PM, Arnd Bergmann wrote:

> > Wolfgang, I'm a bit worried by the API being split between sockets and sysfs.
> > The problem is that once the sysfs API is established, users will start
> > relying on it, and you can no longer migrate away from it, even when
> > a later version of the Socket CAN also supports setting through a different
> > interface. What is the current interface to set mail box IDs in software?
> 
> Note that this CAN controller is *very* special. It cannot handle all
> CAN id's due to a lack or resources. The PRUSS firmware is able to
> manage just up to 8 different CAN identifiers out of the usual 4096
> (12-bit) or even more for the extended CAN ids using 29 bits. 

So for other controllers, they can simply access every ID within
the range (12 or 29 bits), but there is no need to configure?

What are these IDs for? Do they identify a local port, a remote address,
a connection, or something else?

> There is
> no other CAN controller with such rather serious limitations and
> therefore there exists also no appropriate interface. I think using
> sysfs is OK for such device-specific parameters, at least for the time
> being.

It sounds like it's not very scalable, especially since the implementation
is done completely in firmware. Imagine a new firmware version suddenly
supporting 256 ids instead of 8 -- you'd then have to create 256 sysfs
files to be compatible if I understand you correctly.

> > How hard would it be to implement that feature in Socket CAN?
> 
> CAN controllers usually provide some kind of hardware CAN id filtering,
> but in a very hardware dependent way. A generic interface may be able to
> handle the PRUSS restrictions as well. CAN devices are usually
> configured through the netlink interface. e.g.
> 
>   $ ip link set can0 up type can bitrate 125000
> 
> and such a common interface would be netlink based as well.

Agreed.

> > Is that something that Subhasish or someone else could to as a prerequisite
> > to merging the driver?
> 
> Any ideas on how to handle hardware filtering in a generic way are
> welcome. I will try to come up with a proposal sooner than later.

It sounds to me like the best solution would be change the firmware
to lift that restriction and simply allow all IDs, in case it's not
actually a hardware limitation (which sounds unlikely).

If that's not possible, maybe it's possible to define a generic
filtering interface using netlink, and then either do it completely
in the kernel, or using the hardware support.

	Arnd

^ permalink raw reply

* Re: [PATCH 2/4] [RFC] virtio: Introduce new API to get free space
From: Michael S. Tsirkin @ 2011-05-04 14:50 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20110504140319.14817.23145.sendpatchset@krkumar2.in.ibm.com>

On Wed, May 04, 2011 at 07:33:19PM +0530, Krishna Kumar wrote:
> Introduce virtqueue_get_capacity() to help bail out of transmit
> path early. Also remove notification when we run out of space (I
> am not sure if this should be under a feature bit).
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  drivers/virtio/virtio_ring.c |   13 ++++++++-----
>  include/linux/virtio.h       |    5 +++++
>  2 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff -ruNp org/include/linux/virtio.h new/include/linux/virtio.h
> --- org/include/linux/virtio.h	2011-05-04 18:57:06.000000000 +0530
> +++ new/include/linux/virtio.h	2011-05-04 18:57:09.000000000 +0530
> @@ -27,6 +27,9 @@ struct virtqueue {
>  
>  /**
>   * operations for virtqueue
> + * virtqueue_get_capacity: Get vq capacity
> + *	vq: the struct virtqueue we're talking about.
> + *	Returns remaining capacity of queue
>   * virtqueue_add_buf: expose buffer to other end
>   *	vq: the struct virtqueue we're talking about.
>   *	sg: the description of the buffer(s).
> @@ -62,6 +65,8 @@ struct virtqueue {
>   * All operations can be called in any context.
>   */
>  
> +int virtqueue_get_capacity(struct virtqueue *vq);
> +
>  int virtqueue_add_buf_gfp(struct virtqueue *vq,
>  			  struct scatterlist sg[],
>  			  unsigned int out_num,

This is same as Shirley sent?
Maybe split and attribute ...

> diff -ruNp org/drivers/virtio/virtio_ring.c new/drivers/virtio/virtio_ring.c
> --- org/drivers/virtio/virtio_ring.c	2011-05-04 18:57:06.000000000 +0530
> +++ new/drivers/virtio/virtio_ring.c	2011-05-04 18:57:09.000000000 +0530
> @@ -156,6 +156,14 @@ static int vring_add_indirect(struct vri
>  	return head;
>  }
>  
> +int virtqueue_get_capacity(struct virtqueue *_vq)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +
> +	return vq->num_free;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_get_capacity);
> +
>  int virtqueue_add_buf_gfp(struct virtqueue *_vq,
>  			  struct scatterlist sg[],
>  			  unsigned int out,
> @@ -185,11 +193,6 @@ int virtqueue_add_buf_gfp(struct virtque
>  	if (vq->num_free < out + in) {
>  		pr_debug("Can't add buf len %i - avail = %i\n",
>  			 out + in, vq->num_free);
> -		/* FIXME: for historical reasons, we force a notify here if
> -		 * there are outgoing parts to the buffer.  Presumably the
> -		 * host should service the ring ASAP. */
> -		if (out)
> -			vq->notify(&vq->vq);
>  		END_USE(vq);
>  		return -ENOSPC;
>  	}

This will break qemu versions 0.13 and back.
I'm adding some new virtio ring flags, we'll be
able to reuse one of these to mean 'no need for
work around', I think.

-- 
MST

^ permalink raw reply

* Re: [PATCH V4 5/8]macvtap: macvtap TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-04 14:58 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1304496893.20660.88.camel@localhost.localdomain>

On Wed, May 04, 2011 at 01:14:53AM -0700, Shirley Ma wrote:
> Only when buffer size is greater than GOODCOPY_LEN (256), macvtap
> enables zero-copy.
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>


Looks good. Some thoughts below.

> ---
> 
>  drivers/net/macvtap.c |  126 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 files changed, 115 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 6696e56..e8bc5ff 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
>   */
>  static dev_t macvtap_major;
>  #define MACVTAP_NUM_DEVS 65536
> +#define GOODCOPY_LEN 256

Scope with MACVTAP_ please.

For small packets, is it better to copy in vhost
and skip all the back and forth with callbacks? If yes, does
it make sense to put the constant above in some header
shared with vhost-net?

>  static struct class *macvtap_class;
>  static struct cdev macvtap_cdev;
>  
> @@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
>  {
>  	struct net *net = current->nsproxy->net_ns;
>  	struct net_device *dev = dev_get_by_index(net, iminor(inode));
> +	struct macvlan_dev *vlan = netdev_priv(dev);
>  	struct macvtap_queue *q;
>  	int err;
>  
> @@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
>  	q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
>  	q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
>  
> +	/*
> +	 * so far only VM uses macvtap, enable zero copy between guest
> +	 * kernel and host kernel when lower device supports high memory
> +	 * DMA
> +	 */
> +	if (vlan) {
> +		if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
> +			sock_set_flag(&q->sk, SOCK_ZEROCOPY);
> +	}
> +
>  	err = macvtap_set_queue(dev, file, q);
>  	if (err)
>  		sock_put(&q->sk);
> @@ -433,6 +445,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
>  	return skb;
>  }
>  
> +/* set skb frags from iovec, this can move to core network code for reuse */
> +static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
> +				  int offset, size_t count)
> +{
> +	int len = iov_length(from, count) - offset;
> +	int copy = skb_headlen(skb);
> +	int size, offset1 = 0;
> +	int i = 0;
> +	skb_frag_t *f;
> +
> +	/* Skip over from offset */
> +	while (offset >= from->iov_len) {
> +		offset -= from->iov_len;
> +		++from;
> +		--count;
> +	}
> +
> +	/* copy up to skb headlen */
> +	while (copy > 0) {
> +		size = min_t(unsigned int, copy, from->iov_len - offset);
> +		if (copy_from_user(skb->data + offset1, from->iov_base + offset,
> +				   size))
> +			return -EFAULT;
> +		if (copy > size) {
> +			++from;
> +			--count;
> +		}
> +		copy -= size;
> +		offset1 += size;
> +		offset = 0;
> +	}
> +
> +	if (len == offset1)
> +		return 0;
> +
> +	while (count--) {
> +		struct page *page[MAX_SKB_FRAGS];
> +		int num_pages;
> +		unsigned long base;
> +
> +		len = from->iov_len - offset1;
> +		if (!len) {
> +			offset1 = 0;
> +			++from;
> +			continue;
> +		}
> +		base = (unsigned long)from->iov_base + offset1;
> +		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> +		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
> +		if ((num_pages != size) ||
> +		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
> +			/* put_page is in skb free */
> +			return -EFAULT;
> +		skb->data_len += len;
> +		skb->len += len;
> +		skb->truesize += len;
> +		while (len) {
> +			f = &skb_shinfo(skb)->frags[i];
> +			f->page = page[i];
> +			f->page_offset = base & ~PAGE_MASK;
> +			f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
> +			skb_shinfo(skb)->nr_frags++;
> +			/* increase sk_wmem_alloc */
> +			atomic_add(f->size, &skb->sk->sk_wmem_alloc);

One thing that gave me pause that we only accound for part of the page
here. I think we should count the whole page, no?

> +			base += f->size;
> +			len -= f->size;
> +			i++;
> +		}
> +		offset1 = 0;
> +		++from;
> +	}
> +	return 0;
> +}
> +
>  /*
>   * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
>   * be shared with the tun/tap driver.
> @@ -515,17 +601,19 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
>  
>  
>  /* Get packet from user space buffer */
> -static ssize_t macvtap_get_user(struct macvtap_queue *q,
> -				const struct iovec *iv, size_t count,
> -				int noblock)
> +static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
> +				const struct iovec *iv, unsigned long total_len,
> +				size_t count, int noblock)
>  {
>  	struct sk_buff *skb;
>  	struct macvlan_dev *vlan;
> -	size_t len = count;
> +	unsigned long len = total_len;
>  	int err;
>  	struct virtio_net_hdr vnet_hdr = { 0 };
>  	int vnet_hdr_len = 0;
> +	int copylen, zerocopy;
>  
> +	zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len > GOODCOPY_LEN);
>  	if (q->flags & IFF_VNET_HDR) {
>  		vnet_hdr_len = q->vnet_hdr_sz;
>  
> @@ -552,12 +640,28 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
>  	if (unlikely(len < ETH_HLEN))
>  		goto err;
>  
> -	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
> -				noblock, &err);
> +	if (zerocopy)
> +		/* There are 256 bytes to be copied in skb, so there is enough
> +		 * room for skb expand head in case it is used.
> +		 * The rest buffer is mapped from userspace.
> +		 */
> +		copylen = GOODCOPY_LEN;

Just curious: where does the number 256 come from?
Also, as long as we are copying, should we care about
alignment?

> +	else
> +		copylen = len;
> +
> +	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
> +				vnet_hdr.hdr_len, noblock, &err);
>  	if (!skb)
>  		goto err;
>  
> -	err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
> +	if (zerocopy)
> +		err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
> +	else
> +		err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
> +						   len);
> +	if (sock_flag(&q->sk, SOCK_ZEROCOPY))
> +		memcpy(&skb_shinfo(skb)->ubuf, m->msg_control,
> +			sizeof(struct skb_ubuf_info));
>  	if (err)
>  		goto err_kfree;
>  
> @@ -579,7 +683,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
>  		kfree_skb(skb);
>  	rcu_read_unlock_bh();
>  
> -	return count;
> +	return total_len;
>  
>  err_kfree:
>  	kfree_skb(skb);
> @@ -601,8 +705,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
>  	ssize_t result = -ENOLINK;
>  	struct macvtap_queue *q = file->private_data;
>  
> -	result = macvtap_get_user(q, iv, iov_length(iv, count),
> -			      file->f_flags & O_NONBLOCK);
> +	result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
> +				  file->f_flags & O_NONBLOCK);
>  	return result;
>  }
>  
> @@ -815,7 +919,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
>  			   struct msghdr *m, size_t total_len)
>  {
>  	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
> -	return macvtap_get_user(q, m->msg_iov, total_len,
> +	return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
>  			    m->msg_flags & MSG_DONTWAIT);
>  }
>  
> 

^ permalink raw reply

* Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
From: Krishna Kumar2 @ 2011-05-04 14:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20110504144622.GA15823@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/04/2011 08:16:22 PM:

> > A. virtio:
> >    - Provide a API to get available number of slots.
> >
> > B. virtio-net:
> >    - Remove stop/start txq's and associated callback.
> >    - Pre-calculate the number of slots needed to transmit
> >      the skb in xmit_skb and bail out early if enough space
> >      is not available. My testing shows that 2.5-3% of
> >      packets are benefited by using this API.
> >    - Do not drop skbs but instead return TX_BUSY like other
> >      drivers.
> >    - When returning EBUSY, set a per-txq variable to indicate
> >      to dev_queue_xmit() whether to restart xmits on this txq.
> >
> > C. net/sched/sch_generic.c:
> >    Since virtio-net now returns EBUSY, the skb is requeued to
> >    gso_skb. This allows adding the addional check for restart
> >    xmits in just the slow-path (the first re-queued packet
> >    case of dequeue_skb, where it checks for gso_skb) before
> >    deciding whether to call the driver or not.
> >
> > Patch was also tested between two servers with Emulex OneConnect
> > 10G cards to confirm there is no regression. Though the patch is
> > an attempt to improve only small packet performance, there was
> > improvement for 1K, 2K and also 16K both in BW and SD. Results
> > from Guest -> Remote Host (BW in Mbps) for 1K and 16K I/O sizes:
> >
> > ________________________________________________________
> >          I/O Size: 1K
> > #   BW1   BW2 (%)      SD1   SD2 (%)
> > ________________________________________________________
> > 1   1226   3313 (170.2)   6.6   1.9 (-71.2)
> > 2   3223   7705 (139.0)   18.0   7.1 (-60.5)
> > 4   7223   8716 (20.6)   36.5   29.7 (-18.6)
> > 8   8689   8693 (0)   131.5   123.0 (-6.4)
> > 16   8059   8285 (2.8)   578.3   506.2 (-12.4)
> > 32   7758   7955 (2.5)   2281.4   2244.2 (-1.6)
> > 64   7503   7895 (5.2)   9734.0   9424.4 (-3.1)
> > 96   7496   7751 (3.4)   21980.9   20169.3 (-8.2)
> > 128   7389   7741 (4.7)   40467.5   34995.5 (-13.5)
> > ________________________________________________________
> > Summary:   BW: 16.2%   SD: -10.2%
> >
> > ________________________________________________________
> >          I/O Size: 16K
> > #   BW1   BW2 (%)      SD1   SD2 (%)
> > ________________________________________________________
> > 1   6684   7019 (5.0)   1.1   1.1 (0)
> > 2   7674   7196 (-6.2)   5.0   4.8 (-4.0)
> > 4   7358   8032 (9.1)   21.3   20.4 (-4.2)
> > 8   7393   8015 (8.4)   82.7   82.0 (-.8)
> > 16   7958   8366 (5.1)   283.2   310.7 (9.7)
> > 32   7792   8113 (4.1)   1257.5   1363.0 (8.3)
> > 64   7673   8040 (4.7)   5723.1   5812.4 (1.5)
> > 96   7462   7883 (5.6)   12731.8   12119.8 (-4.8)
> > 128   7338   7800 (6.2)   21331.7   21094.7 (-1.1)
> > ________________________________________________________
> > Summary:   BW: 4.6%   SD: -1.5%
> >
> > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> > ---
>
> So IIUC, we delay transmit by an arbitrary value and hope
> that the host is done with the packets by then?

Not "hope" exactly. If the device is not ready, then
the packet is requeued. The main idea is to avoid
drops/stop/starts, etc.

> Interesting.
>
> I am currently testing an approach where
> we tell the host explicitly to interrupt us only after
> a large part of the queue is empty.
> With 256 entries in a queue, we should get 1 interrupt per
> on the order of 100 packets which does not seem like a lot.
>
> I can post it, mind testing this?

Sure.

- KK


^ permalink raw reply

* Re: [PATCHv3 2/2] tg3: Allow ethtool to enable/disable loopback.
From: Stephen Hemminger @ 2011-05-04 15:04 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Mahesh Bandewar, Matt Carlson, David Miller, netdev, Michael Chan,
	Ben Hutchings, Tom Herbert
In-Reply-To: <20110504111112.GA15486@rere.qmqm.pl>

On Wed, 4 May 2011 13:11:12 +0200
Michał Mirosław <mirq-linux@rere.qmqm.pl> wrote:

> On Tue, May 03, 2011 at 06:18:55PM -0700, Mahesh Bandewar wrote:
> > This patch adds tg3_set_features() to handle loopback mode. Currently the
> > capability is added for the devices which support internal MAC loopback mode.
> > So when enabled, it enables internal-MAC loopback.
> [...]
> > diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> > index 7c7c9a8..46de633 100644
> > --- a/drivers/net/tg3.c
> > +++ b/drivers/net/tg3.c
> > @@ -6319,6 +6319,51 @@ static u32 tg3_fix_features(struct net_device *dev, u32 features)
> >  	return features;
> >  }
> >  
> > +static int tg3_set_features(struct net_device *dev, u32 features)
> > +{
> > +	struct tg3 *tp = netdev_priv(dev);
> > +	u32 cur_mode = 0;
> > +	int err = 0;
> > +
> > +	if (!netif_running(dev)) {
> > +		err = -EAGAIN;
> > +		goto sfeatures_out;
> > +	}
> 
> netdev_update_features() is not designed to handle -EAGAIN from
> ndo_set_features callback. It might be useful to implement this
> handling, but in this case you should just return 0 and check
> dev->features in ndo_open callback.
> 
> Best Regards,
> Michał Mirosław

EAGAIN is a bad choice of error code anyway. It implies that the
application should retry, which in this case is not true.

This error code is used for things like flow controlled sockets where
the application should do a select/poll and the condition will clear
when other side has read.

Why not use ENETDOWN instead?

^ permalink raw reply

* Re: [PATCH] net: add mac_pton() for parsing MAC address
From: Stephen Hemminger @ 2011-05-04 15:12 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: davem, netdev
In-Reply-To: <20110504061551.GA12297@p183>

On Wed, 4 May 2011 09:15:51 +0300
Alexey Dobriyan <adobriyan@gmail.com> wrote:

> +int mac_pton(const char *s, u8 *mac)
> +{
> +	int i;
> +
> +	/* XX:XX:XX:XX:XX:XX */
> +	if (strlen(s) < 3 * ETH_ALEN - 1)
> +		return 0;
> +
> +	/* Don't half dirty result. */
Shouldn't this be "Don't allow dirty result."?

> +	for (i = 0; i < ETH_ALEN; i++) {
> +		if (!strchr("0123456789abcdefABCDEF", s[i * 3]))
> +			return 0;
> +		if (!strchr("0123456789abcdefABCDEF", s[i * 3 + 1]))
> +			return 0;

		if (!isxdigit(s[i*3]) || !isxdigit(s[i*3+1]))
			return 0;

> +		if (i != ETH_ALEN - 1 && s[i * 3 + 2] != ':')
> +			return 0;
> +	}
> +	for (i = 0; i < ETH_ALEN; i++) {
> +		mac[i] = (hex_to_bin(s[i * 3]) << 4) | hex_to_bin(s[i * 3 + 1]);
		hex2bin(&mac[i], &s[i*3], 1);
> +	}
> +	return 1;
> +}
> +EXPORT_SYMBOL(mac_pton);

Also don't need two loops, okay to parse partial result.

^ permalink raw reply

* Re: [PATCH V4 4/8]vhost: vhost TX zero-copy support
From: Shirley Ma @ 2011-05-04 15:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <20110504095650.GA11592@redhat.com>

On Wed, 2011-05-04 at 12:56 +0300, Michael S. Tsirkin wrote:
> On Wed, May 04, 2011 at 01:11:24AM -0700, Shirley Ma wrote:
> > This patch maintains the outstanding userspace buffers in the 
> > sequence it is delivered to vhost. The outstanding userspace
> buffers 
> > will be marked as done once the lower device buffers DMA has
> finished. 
> > This is monitored through last reference of kfree_skb callback. Two
> > buffer index are used for this purpose.
> > 
> > The vhost passes the userspace buffers info to lower device skb 
> > through message control. Since there will be some done DMAs when
> > entering vhost handle_tx. The worse case is all buffers in the vq
> are
> > in pending/done status, so we need to notify guest to release DMA
> done 
> > buffers first before get any new buffers from the vq.
> > 
> > Signed-off-by: Shirley <xma@us.ibm.com>
> 
> Looks good overall. Some nits to iron out below.
> 
> > ---
> > 
> >  drivers/vhost/net.c   |   30 +++++++++++++++++++++++++++-
> >  drivers/vhost/vhost.c |   50
> > ++++++++++++++++++++++++++++++++++++++++++++++++-
> >  drivers/vhost/vhost.h |   10 +++++++++
> >  3 files changed, 87 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 2f7c76a..c403afb 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -32,6 +32,8 @@
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> >  
> > +#define MAX_ZEROCOPY_PEND 64
> > +
> 
> Pls document what the above is. Also scope with VHOST_

Will do.

> >  enum {
> >       VHOST_NET_VQ_RX = 0,
> >       VHOST_NET_VQ_TX = 1,
> > @@ -129,6 +131,7 @@ static void handle_tx(struct vhost_net *net)
> >       int err, wmem;
> >       size_t hdr_size;
> >       struct socket *sock;
> > +     struct skb_ubuf_info pend;
> >  
> >       /* TODO: check that we are running from vhost_worker? */
> >       sock = rcu_dereference_check(vq->private_data, 1);
> > @@ -151,6 +154,10 @@ static void handle_tx(struct vhost_net *net)
> >       hdr_size = vq->vhost_hlen;
> >  
> >       for (;;) {
> > +             /* Release DMAs done buffers first */
> > +             if (sock_flag(sock->sk, SOCK_ZEROCOPY))
> > +                     vhost_zerocopy_signal_used(vq);
> > +
> >               head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >                                        ARRAY_SIZE(vq->iov),
> >                                        &out, &in,
> > @@ -166,6 +173,12 @@ static void handle_tx(struct vhost_net *net)
> >                               set_bit(SOCK_ASYNC_NOSPACE,
> &sock->flags);
> >                               break;
> >                       }
> > +                     /* If more outstanding DMAs, queue the work */
> > +                     if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
> > +                         (atomic_read(&vq->refcnt) >
> MAX_ZEROCOPY_PEND)) {
> > +                             vhost_poll_queue(&vq->poll);
> 
> Well, this just keeps polling, doesn't it?
> If you want to wait until # of DMAs is below MAX_ZEROCOPY_PEND,
> you'll need to do the queueing from some callback.
> 
> Something like this: when refcnt is above 2 * MAX_ZEROCOPY_PEND,
> stop submitting and wait until some are freed.
Will try this today

> BTW, can the socket poll wakeup do the job?
Not sure, will try this today.

> > +                             break;
> > +                     }
> >                       if (unlikely(vhost_enable_notify(vq))) {
> >                               vhost_disable_notify(vq);
> >                               continue;
> > @@ -188,17 +201,30 @@ static void handle_tx(struct vhost_net *net)
> >                              iov_length(vq->hdr, s), hdr_size);
> >                       break;
> >               }
> > +             /* use msg_control to pass vhost zerocopy ubuf info to
> skb */
> > +             if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> > +                     pend.callback = vhost_zerocopy_callback;
> > +                     pend.arg = vq;
> > +                     pend.desc = vq->upend_idx;
> > +                     msg.msg_control = &pend;
> > +                     msg.msg_controllen = sizeof(pend);
> > +                     vq->heads[vq->upend_idx].id = head;
> > +                     vq->upend_idx = (vq->upend_idx + 1) %
> UIO_MAXIOV;
> > +                     atomic_inc(&vq->refcnt);
> > +             }
> >               /* TODO: Check specific error and bomb out unless
> ENOBUFS? */
> >               err = sock->ops->sendmsg(NULL, sock, &msg, len);
> >               if (unlikely(err < 0)) {
> > -                     vhost_discard_vq_desc(vq, 1);
> > +                     if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > +                             vhost_discard_vq_desc(vq, 1);
> >                       tx_poll_start(net, sock);
> >                       break;
> >               }
> >               if (err != len)
> >                       pr_debug("Truncated TX packet: "
> >                                " len %d != %zd\n", err, len);
> > -             vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > +             if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > +                     vhost_add_used_and_signal(&net->dev, vq, head,
> 0);
> >               total_len += len;
> >               if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >                       vhost_poll_queue(&vq->poll);
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 2ab2912..3048953 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev
> *dev,
> >       vq->call_ctx = NULL;
> >       vq->call = NULL;
> >       vq->log_ctx = NULL;
> > +     vq->upend_idx = 0;
> > +     vq->done_idx = 0;
> > +     atomic_set(&vq->refcnt, 0);
> >  }
> >  
> >  static int vhost_worker(void *data)
> > @@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct
> vhost_dev
> > *dev)
> >                                              UIO_MAXIOV,
> GFP_KERNEL);
> >               dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log *
> UIO_MAXIOV,
> >                                         GFP_KERNEL);
> > -             dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads
> *
> > +             dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads
> *
> >                                           UIO_MAXIOV, GFP_KERNEL);
> 
> Do we really need to zero it all out? We generally tried to only
> init what is necessary ...

Don't need to.

> >               if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
> > @@ -385,10 +388,41 @@ long vhost_dev_reset_owner(struct vhost_dev
> *dev)
> >       return 0;
> >  }
> >  
> 
> Pls document what the below does.

Ok.

> > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
> > +{
> > +     int i, j = 0;
> > +
> > +     i = vq->done_idx;
> > +     while (i != vq->upend_idx) {
> > +             /* len = 1 means DMA done */
> 
> Hmm. Guests aren't likely to use len 1 in practice,
> but I think it's better to support this.

Will check guest.

> I'd suggest sticking extra stuff in id, IIRC only values
> < vq size are legal there, anything else we can use.
> Also, pls add some defines for special values, better than
> a comment:
>         if (len == VHOST_DMA_DONE_LEN)
OK.

> > +             if (vq->heads[i].len == 1) {
> > +                     /* reset len = 0 */
> > +                     vq->heads[i].len = 0;
> > +                     i = (i + 1) % UIO_MAXIOV;
> > +                     ++j;
> > +             } else
> > +                     break;
> > +     }
> > +     if (j) {
> 
> Pls add some comments to explain the logic here.
OK.

> > +             if (i > vq->done_idx)
> > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> j);
> > +             else {
> > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> > +                                      UIO_MAXIOV - vq->done_idx);
> > +                     vhost_add_used_n(vq, vq->heads, i);
> > +             }
> > +             vq->done_idx = i;
> > +             vhost_signal(vq->dev, vq);
> > +             atomic_sub(j, &vq->refcnt);
> > +     }
> > +}
> > +
> >  /* Caller should have device mutex */
> >  void vhost_dev_cleanup(struct vhost_dev *dev)
> >  {
> >       int i;
> > +     unsigned long begin = jiffies;
> > +
> >  
> >       for (i = 0; i < dev->nvqs; ++i) {
> >               if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
> > @@ -405,6 +439,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> >                       eventfd_ctx_put(dev->vqs[i].call_ctx);
> >               if (dev->vqs[i].call)
> >                       fput(dev->vqs[i].call);
> > +             /* wait for all lower device DMAs done, then notify
> guest */
> > +             while (atomic_read(&dev->vqs[i].refcnt)) {
> > +                     if (time_after(jiffies, begin + 5 * HZ))
> 
> Hmm, does this actually busy-wait?  Let's at least sleep here.
> Or maybe some wakeup scheme can be cooked up.
> For example, have a kref with release function that signals some
> completion.

Will do.

> >
> +                             vhost_zerocopy_signal_used(&dev->vqs[i]);
> > +             }
> >               vhost_vq_reset(dev, dev->vqs + i);
> >       }
> >       vhost_dev_free_iovecs(dev);
> > @@ -1416,3 +1455,12 @@ void vhost_disable_notify(struct
> vhost_virtqueue
> > *vq)
> >               vq_err(vq, "Failed to enable notification at %p: %d
> \n",
> >                      &vq->used->flags, r);
> >  }
> > +
> > +void vhost_zerocopy_callback(struct sk_buff *skb)
> > +{
> > +     int idx = skb_shinfo(skb)->ubuf.desc;
> > +     struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
> > +
> > +     /* set len = 1 to mark this desc buffers done DMA */
> > +     vq->heads[idx].len = 1;
> > +}
> 
> So any kind of callback like that, that goes into the skb,
> will be racy wrt module unloading because module can go away
> after you mark dma done and before this function returns.
> Solution is to have a core function that does the
> final signalling (e.g. sock_wfree is in core).
> Would be nice to fix, even though this race is
> completely theoretical, I don't believe it will
> trigger in practice.

I run lots of stress tests, and never hit this. 

But I can try to fix it.

> 
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index b3363ae..ec032a0 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -108,6 +108,14 @@ struct vhost_virtqueue {
> >       /* Log write descriptors */
> >       void __user *log_base;
> >       struct vhost_log *log;
> > +     /* vhost zerocopy support */
> > +     atomic_t refcnt; /* num of outstanding zerocopy DMAs */
> > +     /* index of zerocopy pending DMA buffers */
> > +     int upend_idx;
> > +     /* index of zerocopy done DMA buffers, but not notify guest
> yet */
> > +     int done_idx;
> 
> Pls try to find more descriptive names for the above,
> and clarify the comments: I could not tell what do the
> comments mean.
> 
> upend_idx seems to be a copy of avail idx?
> done_idx is ... ?

OK.

> > +     /* notify vhost zerocopy DMA buffers has done in lower device
> */
> 
> Do you mean 'notify vhost that zerocopy DMA is complete'?
> 
> > +     void (*callback)(struct sk_buff *);
> 
> Is this actually used?
> If yes rename it zerocopy_dma_done or something like this?

You might be right, I used to use it, but the current code level doesn't
need it any more.

> >  };
> >  
> >  struct vhost_dev {
> > @@ -154,6 +162,8 @@ bool vhost_enable_notify(struct vhost_virtqueue
> *);
> >  
> >  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log
> *log,
> >                   unsigned int log_num, u64 len);
> > +void vhost_zerocopy_callback(struct sk_buff *skb);
> > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);
> >  
> >  #define vq_err(vq, fmt, ...) do {
> \
> >               pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> > 


^ permalink raw reply

* Re: [PATCH V4 5/8]macvtap: macvtap TX zero-copy support
From: Shirley Ma @ 2011-05-04 15:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <20110504145819.GA15648@redhat.com>

On Wed, 2011-05-04 at 17:58 +0300, Michael S. Tsirkin wrote:
> On Wed, May 04, 2011 at 01:14:53AM -0700, Shirley Ma wrote:
> > Only when buffer size is greater than GOODCOPY_LEN (256), macvtap
> > enables zero-copy.
> > 
> > Signed-off-by: Shirley Ma <xma@us.ibm.com>
> 
> 
> Looks good. Some thoughts below.
> 
> > ---
> > 
> >  drivers/net/macvtap.c |  126
> ++++++++++++++++++++++++++++++++++++++++++++----
> >  1 files changed, 115 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> > index 6696e56..e8bc5ff 100644
> > --- a/drivers/net/macvtap.c
> > +++ b/drivers/net/macvtap.c
> > @@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
> >   */
> >  static dev_t macvtap_major;
> >  #define MACVTAP_NUM_DEVS 65536
> > +#define GOODCOPY_LEN 256
> 
> Scope with MACVTAP_ please.
Ok.

> For small packets, is it better to copy in vhost
> and skip all the back and forth with callbacks? If yes, does
> it make sense to put the constant above in some header
> shared with vhost-net?

skb is created in macvtap, the small packet copy is in skb, so I don't
think we can do it in vhost here.

> >  static struct class *macvtap_class;
> >  static struct cdev macvtap_cdev;
> >  
> > @@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode,
> struct file *file)
> >  {
> >       struct net *net = current->nsproxy->net_ns;
> >       struct net_device *dev = dev_get_by_index(net, iminor(inode));
> > +     struct macvlan_dev *vlan = netdev_priv(dev);
> >       struct macvtap_queue *q;
> >       int err;
> >  
> > @@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode,
> struct file *file)
> >       q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
> >       q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
> >  
> > +     /*
> > +      * so far only VM uses macvtap, enable zero copy between guest
> > +      * kernel and host kernel when lower device supports high
> memory
> > +      * DMA
> > +      */
> > +     if (vlan) {
> > +             if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
> > +                     sock_set_flag(&q->sk, SOCK_ZEROCOPY);
> > +     }
> > +
> >       err = macvtap_set_queue(dev, file, q);
> >       if (err)
> >               sock_put(&q->sk);
> > @@ -433,6 +445,80 @@ static inline struct sk_buff
> *macvtap_alloc_skb(struct sock *sk, size_t prepad,
> >       return skb;
> >  }
> >  
> > +/* set skb frags from iovec, this can move to core network code for
> reuse */
> > +static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct
> iovec *from,
> > +                               int offset, size_t count)
> > +{
> > +     int len = iov_length(from, count) - offset;
> > +     int copy = skb_headlen(skb);
> > +     int size, offset1 = 0;
> > +     int i = 0;
> > +     skb_frag_t *f;
> > +
> > +     /* Skip over from offset */
> > +     while (offset >= from->iov_len) {
> > +             offset -= from->iov_len;
> > +             ++from;
> > +             --count;
> > +     }
> > +
> > +     /* copy up to skb headlen */
> > +     while (copy > 0) {
> > +             size = min_t(unsigned int, copy, from->iov_len -
> offset);
> > +             if (copy_from_user(skb->data + offset1, from->iov_base
> + offset,
> > +                                size))
> > +                     return -EFAULT;
> > +             if (copy > size) {
> > +                     ++from;
> > +                     --count;
> > +             }
> > +             copy -= size;
> > +             offset1 += size;
> > +             offset = 0;
> > +     }
> > +
> > +     if (len == offset1)
> > +             return 0;
> > +
> > +     while (count--) {
> > +             struct page *page[MAX_SKB_FRAGS];
> > +             int num_pages;
> > +             unsigned long base;
> > +
> > +             len = from->iov_len - offset1;
> > +             if (!len) {
> > +                     offset1 = 0;
> > +                     ++from;
> > +                     continue;
> > +             }
> > +             base = (unsigned long)from->iov_base + offset1;
> > +             size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >>
> PAGE_SHIFT;
> > +             num_pages = get_user_pages_fast(base, size, 0,
> &page[i]);
> > +             if ((num_pages != size) ||
> > +                 (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags))
> > +                     /* put_page is in skb free */
> > +                     return -EFAULT;
> > +             skb->data_len += len;
> > +             skb->len += len;
> > +             skb->truesize += len;
> > +             while (len) {
> > +                     f = &skb_shinfo(skb)->frags[i];
> > +                     f->page = page[i];
> > +                     f->page_offset = base & ~PAGE_MASK;
> > +                     f->size = min_t(int, len, PAGE_SIZE -
> f->page_offset);
> > +                     skb_shinfo(skb)->nr_frags++;
> > +                     /* increase sk_wmem_alloc */
> > +                     atomic_add(f->size, &skb->sk->sk_wmem_alloc);
> 
> One thing that gave me pause that we only accound for part of the page
> here. I think we should count the whole page, no?

The whole page is pinned, but it might not be for this sock only. 

I think I should change to atomic_add to outside the loop to save cost.

> > +                     base += f->size;
> > +                     len -= f->size;
> > +                     i++;
> > +             }
> > +             offset1 = 0;
> > +             ++from;
> > +     }
> > +     return 0;
> > +}
> > +
> >  /*
> >   * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
> >   * be shared with the tun/tap driver.
> > @@ -515,17 +601,19 @@ static int macvtap_skb_to_vnet_hdr(const
> struct sk_buff *skb,
> >  
> >  
> >  /* Get packet from user space buffer */
> > -static ssize_t macvtap_get_user(struct macvtap_queue *q,
> > -                             const struct iovec *iv, size_t count,
> > -                             int noblock)
> > +static ssize_t macvtap_get_user(struct macvtap_queue *q, struct
> msghdr *m,
> > +                             const struct iovec *iv, unsigned long
> total_len,
> > +                             size_t count, int noblock)
> >  {
> >       struct sk_buff *skb;
> >       struct macvlan_dev *vlan;
> > -     size_t len = count;
> > +     unsigned long len = total_len;
> >       int err;
> >       struct virtio_net_hdr vnet_hdr = { 0 };
> >       int vnet_hdr_len = 0;
> > +     int copylen, zerocopy;
> >  
> > +     zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len >
> GOODCOPY_LEN);
> >       if (q->flags & IFF_VNET_HDR) {
> >               vnet_hdr_len = q->vnet_hdr_sz;
> >  
> > @@ -552,12 +640,28 @@ static ssize_t macvtap_get_user(struct
> macvtap_queue *q,
> >       if (unlikely(len < ETH_HLEN))
> >               goto err;
> >  
> > -     skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len,
> vnet_hdr.hdr_len,
> > -                             noblock, &err);
> > +     if (zerocopy)
> > +             /* There are 256 bytes to be copied in skb, so there
> is enough
> > +              * room for skb expand head in case it is used.
> > +              * The rest buffer is mapped from userspace.
> > +              */
> > +             copylen = GOODCOPY_LEN;
> 
> Just curious: where does the number 256 come from?
> Also, as long as we are copying, should we care about
> alignment?

256 makes the size big enough for any skb head expanding.

That's my concern before. But guest should alignment the buffer already,
after moving the pointer 256 bytes. It should still alignment, right?

> > +     else
> > +             copylen = len;
> > +
> > +     skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
> > +                             vnet_hdr.hdr_len, noblock, &err);
> >       if (!skb)
> >               goto err;
> >  
> > -     err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
> len);
> > +     if (zerocopy)
> > +             err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len,
> count);
> > +     else
> > +             err = skb_copy_datagram_from_iovec(skb, 0, iv,
> vnet_hdr_len,
> > +                                                len);
> > +     if (sock_flag(&q->sk, SOCK_ZEROCOPY))
> > +             memcpy(&skb_shinfo(skb)->ubuf, m->msg_control,
> > +                     sizeof(struct skb_ubuf_info));
> >       if (err)
> >               goto err_kfree;
> >  
> > @@ -579,7 +683,7 @@ static ssize_t macvtap_get_user(struct
> macvtap_queue *q,
> >               kfree_skb(skb);
> >       rcu_read_unlock_bh();
> >  
> > -     return count;
> > +     return total_len;
> >  
> >  err_kfree:
> >       kfree_skb(skb);
> > @@ -601,8 +705,8 @@ static ssize_t macvtap_aio_write(struct kiocb
> *iocb, const struct iovec *iv,
> >       ssize_t result = -ENOLINK;
> >       struct macvtap_queue *q = file->private_data;
> >  
> > -     result = macvtap_get_user(q, iv, iov_length(iv, count),
> > -                           file->f_flags & O_NONBLOCK);
> > +     result = macvtap_get_user(q, NULL, iv, iov_length(iv, count),
> count,
> > +                               file->f_flags & O_NONBLOCK);
> >       return result;
> >  }
> >  
> > @@ -815,7 +919,7 @@ static int macvtap_sendmsg(struct kiocb *iocb,
> struct socket *sock,
> >                          struct msghdr *m, size_t total_len)
> >  {
> >       struct macvtap_queue *q = container_of(sock, struct
> macvtap_queue, sock);
> > -     return macvtap_get_user(q, m->msg_iov, total_len,
> > +     return macvtap_get_user(q, m, m->msg_iov, total_len,
> m->msg_iovlen,
> >                           m->msg_flags & MSG_DONTWAIT);
> >  }
> >  
> > 
> 
> 

^ permalink raw reply

* ath5k regression associating with APs in 2.6.38
From: Seth Forshee @ 2011-05-04 15:38 UTC (permalink / raw)
  To: Jiri Slaby, Nick Kossifidis, Luis R. Rodriguez, Bob Copeland
  Cc: John W. Linville, linux-wireless, ath5k-devel, netdev,
	linux-kernel

I've been investigating some reports of a regression in associating with
APs with AR2413 in 2.6.38. Association repeatedly fails with some
"direct probe to x timed out" messages (see syslog excerpt below),
although it will generally associate eventually, after many tries.

Bisection identifies 8aec7af (ath5k: Support synth-only channel change
for AR2413/AR5413) as offending commit. Prior to this commit there are
no direct probe messages at all in the logs. I've also found that
forcing fast to false at the top of ath5k_hw_reset() fixes the issue.
I'm not sure what the connection is between this commit and the
timeouts. Any suggestions?

Thanks,
Seth


May  4 09:32:15 AcerAspire5100 wpa_supplicant[757]: Trying to associate with c0:3f:0e:b9:f3:b2 (SSID='aureola' freq=2452 MHz)
May  4 09:32:15 AcerAspire5100 kernel: [  120.063271] wlan0: direct probe to c0:3f:0e:b9:f3:b2 (try 1/3)
May  4 09:32:16 AcerAspire5100 kernel: [  120.260074] wlan0: direct probe to c0:3f:0e:b9:f3:b2 (try 2/3)
May  4 09:32:16 AcerAspire5100 kernel: [  120.460084] wlan0: direct probe to c0:3f:0e:b9:f3:b2 (try 3/3)
May  4 09:32:16 AcerAspire5100 kernel: [  120.660082] wlan0: direct probe to c0:3f:0e:b9:f3:b2 timed out
May  4 09:32:25 AcerAspire5100 wpa_supplicant[757]: Authentication with c0:3f:0e:b9:f3:b2 timed out.

^ permalink raw reply

* Re: [PATCH V4 4/8]vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-04 15:56 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1304522284.7076.12.camel@localhost.localdomain>

On Wed, May 04, 2011 at 08:18:04AM -0700, Shirley Ma wrote:
> > > +void vhost_zerocopy_callback(struct sk_buff *skb)
> > > +{
> > > +     int idx = skb_shinfo(skb)->ubuf.desc;
> > > +     struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
> > > +
> > > +     /* set len = 1 to mark this desc buffers done DMA */
> > > +     vq->heads[idx].len = 1;
> > > +}
> > 
> > So any kind of callback like that, that goes into the skb,
> > will be racy wrt module unloading because module can go away
> > after you mark dma done and before this function returns.
> > Solution is to have a core function that does the
> > final signalling (e.g. sock_wfree is in core).
> > Would be nice to fix, even though this race is
> > completely theoretical, I don't believe it will
> > trigger in practice.
> 
> I run lots of stress tests, and never hit this. 
> 
> But I can try to fix it.

Yes, it's a theoretical thing. Nice to have but not a must.

^ permalink raw reply

* Re: [PATCH v4 1/1] can: add pruss CAN driver.
From: Kurt Van Dijck @ 2011-05-04 15:57 UTC (permalink / raw)
  To: Wolfgang Grandegger
  Cc: sachi-EvXpCiN+lbve9wHmmfpqLFaTQe2KTcn/,
	davinci-linux-open-source-VycZQUHpC/PFrsHnngEfi1aTQe2KTcn/,
	Arnd Bergmann, Subhasish Ghosh, nsekhar-l0cyMroinI0, open list,
	CAN NETWORK DRIVERS, Marc Kleine-Budde,
	Netdev-u79uwXL29TY76Z2rM5mHXA, m-watkins-l0cyMroinI0,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <4DC163D7.9010309-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

> 
> > How hard would it be to implement that feature in Socket CAN?
> 
> CAN controllers usually provide some kind of hardware CAN id filtering,
> but in a very hardware dependent way. A generic interface may be able to
> handle the PRUSS restrictions as well. CAN devices are usually
> configured through the netlink interface. e.g.
> 
>   $ ip link set can0 up type can bitrate 125000
> 
> and such a common interface would be netlink based as well.
ack.
> 
> > Is that something that Subhasish or someone else could to as a prerequisite
> > to merging the driver?
> 
> Any ideas on how to handle hardware filtering in a generic way are
> welcome. I will try to come up with a proposal sooner than later.

When doing so, I'd vote for an unlimited(by software) list of hardware filters (id/mask).
The hardware must abort when no more filters are available.
I think that when using hardware filters, knowing the actual device
with it's amount of hardware filters is the least of your problems.
Userspace applications that suddenly stop working properly due to
hw filters (i.e. some traffic not coming in anymore) will be a major
source of bugreports.

Kurt

^ permalink raw reply

* Re: [PATCH v4 1/1] can: add pruss CAN driver.
From: Wolfgang Grandegger @ 2011-05-04 16:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: sachi-EvXpCiN+lbve9wHmmfpqLFaTQe2KTcn/,
	davinci-linux-open-source-VycZQUHpC/PFrsHnngEfi1aTQe2KTcn/,
	Subhasish Ghosh, nsekhar-l0cyMroinI0, open list,
	CAN NETWORK DRIVERS, Marc Kleine-Budde,
	Netdev-u79uwXL29TY76Z2rM5mHXA, m-watkins-l0cyMroinI0,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <201105041648.37199.arnd-r2nGTMty4D4@public.gmane.org>

On 05/04/2011 04:48 PM, Arnd Bergmann wrote:
> On Wednesday 04 May 2011, Wolfgang Grandegger wrote:
>> On 05/04/2011 03:11 PM, Arnd Bergmann wrote:
> 
>>> Wolfgang, I'm a bit worried by the API being split between sockets and sysfs.
>>> The problem is that once the sysfs API is established, users will start
>>> relying on it, and you can no longer migrate away from it, even when
>>> a later version of the Socket CAN also supports setting through a different
>>> interface. What is the current interface to set mail box IDs in software?
>>
>> Note that this CAN controller is *very* special. It cannot handle all
>> CAN id's due to a lack or resources. The PRUSS firmware is able to
>> manage just up to 8 different CAN identifiers out of the usual 4096
>> (12-bit) or even more for the extended CAN ids using 29 bits. 
> 
> So for other controllers, they can simply access every ID within
> the range (12 or 29 bits), but there is no need to configure?

Yes, 11 or 29 bits, to be correct.

> What are these IDs for? Do they identify a local port, a remote address,
> a connection, or something else?

It's a message identifier, which is used for bus arbitration and which
other CAN nodes can listen to. See also:

http://lxr.linux.no/#linux+v2.6.38/Documentation/networking/can.txt#L146
http://en.wikipedia.org/wiki/Controller_area_network

>> There is
>> no other CAN controller with such rather serious limitations and
>> therefore there exists also no appropriate interface. I think using
>> sysfs is OK for such device-specific parameters, at least for the time
>> being.
> 
> It sounds like it's not very scalable, especially since the implementation
> is done completely in firmware. Imagine a new firmware version suddenly
> supporting 256 ids instead of 8 -- you'd then have to create 256 sysfs
> files to be compatible if I understand you correctly.

Well, than an array of CAN identifiers per file would indeed be more
appropriate.

>>> How hard would it be to implement that feature in Socket CAN?
>>
>> CAN controllers usually provide some kind of hardware CAN id filtering,
>> but in a very hardware dependent way. A generic interface may be able to
>> handle the PRUSS restrictions as well. CAN devices are usually
>> configured through the netlink interface. e.g.
>>
>>   $ ip link set can0 up type can bitrate 125000
>>
>> and such a common interface would be netlink based as well.
> 
> Agreed.
> 
>>> Is that something that Subhasish or someone else could to as a prerequisite
>>> to merging the driver?
>>
>> Any ideas on how to handle hardware filtering in a generic way are
>> welcome. I will try to come up with a proposal sooner than later.
> 
> It sounds to me like the best solution would be change the firmware
> to lift that restriction and simply allow all IDs, in case it's not
> actually a hardware limitation (which sounds unlikely).

Yes, that would be best but they told us, that it's not possible with
the available hardware resources. Subhasish?

> If that's not possible, maybe it's possible to define a generic
> filtering interface using netlink, and then either do it completely
> in the kernel, or using the hardware support.

Well, I hesitate to implement an interface especially for such an exotic
device. Fine if it could be handled by a generic CAN hardware filter
interface, which is especially useful for normal CAN controllers.

Wolfgang.

^ permalink raw reply

* Re: [PATCHv3 2/2] tg3: Allow ethtool to enable/disable loopback.
From: Michał Mirosław @ 2011-05-04 16:05 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mahesh Bandewar, Matt Carlson, David Miller, netdev, Michael Chan,
	Ben Hutchings, Tom Herbert
In-Reply-To: <20110504080408.0142f701@nehalam>

On Wed, May 04, 2011 at 08:04:08AM -0700, Stephen Hemminger wrote:
> On Wed, 4 May 2011 13:11:12 +0200
> Michał Mirosław <mirq-linux@rere.qmqm.pl> wrote:
> > On Tue, May 03, 2011 at 06:18:55PM -0700, Mahesh Bandewar wrote:
> > > This patch adds tg3_set_features() to handle loopback mode. Currently the
> > > capability is added for the devices which support internal MAC loopback mode.
> > > So when enabled, it enables internal-MAC loopback.
> > [...]
> > > diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> > > index 7c7c9a8..46de633 100644
> > > --- a/drivers/net/tg3.c
> > > +++ b/drivers/net/tg3.c
> > > @@ -6319,6 +6319,51 @@ static u32 tg3_fix_features(struct net_device *dev, u32 features)
> > >  	return features;
> > >  }
> > >  
> > > +static int tg3_set_features(struct net_device *dev, u32 features)
> > > +{
> > > +	struct tg3 *tp = netdev_priv(dev);
> > > +	u32 cur_mode = 0;
> > > +	int err = 0;
> > > +
> > > +	if (!netif_running(dev)) {
> > > +		err = -EAGAIN;
> > > +		goto sfeatures_out;
> > > +	}
> > netdev_update_features() is not designed to handle -EAGAIN from
> > ndo_set_features callback. It might be useful to implement this
> > handling, but in this case you should just return 0 and check
> > dev->features in ndo_open callback.
> 
> EAGAIN is a bad choice of error code anyway. It implies that the
> application should retry, which in this case is not true.
> 
> This error code is used for things like flow controlled sockets where
> the application should do a select/poll and the condition will clear
> when other side has read.
> 
> Why not use ENETDOWN instead?

The "application" here is netdev_update_features() here. Whatever the
error code, it would need to be handled there.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCHv3 2/2] tg3: Allow ethtool to enable/disable loopback.
From: Ben Hutchings @ 2011-05-04 16:08 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Stephen Hemminger, Mahesh Bandewar, Matt Carlson, David Miller,
	netdev, Michael Chan, Tom Herbert
In-Reply-To: <20110504160521.GA2035@rere.qmqm.pl>

On Wed, 2011-05-04 at 18:05 +0200, Michał Mirosław wrote:
> On Wed, May 04, 2011 at 08:04:08AM -0700, Stephen Hemminger wrote:
> > On Wed, 4 May 2011 13:11:12 +0200
> > Michał Mirosław <mirq-linux@rere.qmqm.pl> wrote:
> > > On Tue, May 03, 2011 at 06:18:55PM -0700, Mahesh Bandewar wrote:
> > > > This patch adds tg3_set_features() to handle loopback mode. Currently the
> > > > capability is added for the devices which support internal MAC loopback mode.
> > > > So when enabled, it enables internal-MAC loopback.
> > > [...]
> > > > diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> > > > index 7c7c9a8..46de633 100644
> > > > --- a/drivers/net/tg3.c
> > > > +++ b/drivers/net/tg3.c
> > > > @@ -6319,6 +6319,51 @@ static u32 tg3_fix_features(struct net_device *dev, u32 features)
> > > >  	return features;
> > > >  }
> > > >  
> > > > +static int tg3_set_features(struct net_device *dev, u32 features)
> > > > +{
> > > > +	struct tg3 *tp = netdev_priv(dev);
> > > > +	u32 cur_mode = 0;
> > > > +	int err = 0;
> > > > +
> > > > +	if (!netif_running(dev)) {
> > > > +		err = -EAGAIN;
> > > > +		goto sfeatures_out;
> > > > +	}
> > > netdev_update_features() is not designed to handle -EAGAIN from
> > > ndo_set_features callback. It might be useful to implement this
> > > handling, but in this case you should just return 0 and check
> > > dev->features in ndo_open callback.
> > 
> > EAGAIN is a bad choice of error code anyway. It implies that the
> > application should retry, which in this case is not true.
> > 
> > This error code is used for things like flow controlled sockets where
> > the application should do a select/poll and the condition will clear
> > when other side has read.
> > 
> > Why not use ENETDOWN instead?
> 
> The "application" here is netdev_update_features() here. Whatever the
> error code, it would need to be handled there.

The important point - which I think you already stated - is that when
the device is down this function is not expected to apply changes to the
hardware and therefore it should return 0.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH v4 1/1] can: add pruss CAN driver.
From: Wolfgang Grandegger @ 2011-05-04 16:09 UTC (permalink / raw)
  To: Arnd Bergmann, Subhasish Ghosh, linux-arm-kernel,
	Marc Kleine-Budde, sachi
In-Reply-To: <20110504155750.GC322@e-circ.dyndns.org>

Hi Kurt,

On 05/04/2011 05:57 PM, Kurt Van Dijck wrote:
>>
>>> How hard would it be to implement that feature in Socket CAN?
>>
>> CAN controllers usually provide some kind of hardware CAN id filtering,
>> but in a very hardware dependent way. A generic interface may be able to
>> handle the PRUSS restrictions as well. CAN devices are usually
>> configured through the netlink interface. e.g.
>>
>>   $ ip link set can0 up type can bitrate 125000
>>
>> and such a common interface would be netlink based as well.
> ack.
>>
>>> Is that something that Subhasish or someone else could to as a prerequisite
>>> to merging the driver?
>>
>> Any ideas on how to handle hardware filtering in a generic way are
>> welcome. I will try to come up with a proposal sooner than later.
> 
> When doing so, I'd vote for an unlimited(by software) list of hardware filters (id/mask).
> The hardware must abort when no more filters are available.

Sounds good and not even to complicated. For the SJA1000 we would just
allow to set the global mask.

> I think that when using hardware filters, knowing the actual device
> with it's amount of hardware filters is the least of your problems.
> Userspace applications that suddenly stop working properly due to
> hw filters (i.e. some traffic not coming in anymore) will be a major
> source of bugreports.

Well, hardware filtering will be off by default and must explicitly be
set by the user, like for the bitrate setting.

Wolfgang.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox