Netdev List
 help / color / mirror / Atom feed
* Re: linux-next: build failure after merge of the final tree (net tree related)
From: David Miller @ 2011-06-29  9:58 UTC (permalink / raw)
  To: sfr; +Cc: adobriyan, netdev, linux-next, linux-kernel
In-Reply-To: <20110629160901.2d3cd6da.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 29 Jun 2011 16:09:01 +1000

> Again, this has taken some time to fix ...
> 
> Dave, I think you have raised my expectations too far ;-)

Sorry.  I try to give the offender time to fix the build failure
himself, but as you saw this time that didn't work out.

To be honest I'm going to be quite reluctant to apply these kinds of
header removal patches in the future if this is how responsive the
patch submitter is going to be to build fallout. :-/

> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Wed, 29 Jun 2011 16:03:18 +1000
> Subject: [PATCH] net: include dma-mapping.h for dma_map_single etc
> 
> fixes thses build errors:
 ...
> Caused by commit commit b7f080cfe223 ("net: remove mm.h inclusion from
> netdevice.h").
> 
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

Applied, thanks!

^ permalink raw reply

* Re: linux-next: build failure after merge of the final tree (net tree related)
From: David Miller @ 2011-06-29  9:56 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, adobriyan
In-Reply-To: <20110629160133.372e53ed.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 29 Jun 2011 16:01:33 +1000

> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Wed, 29 Jun 2011 15:52:00 +1000
> Subject: [PATCH] net: include io.h in for iounmap etc
> 
> fixes these build errors:
 ...
> Caused by commit b7f080cfe223 ("net: remove mm.h inclusion from
> netdevice.h").
> 
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

Applied.

^ permalink raw reply

* Re: [PATCH V7 4/4 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-06-29  9:13 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1306611267.5180.97.camel@localhost.localdomain>

On Sat, May 28, 2011 at 12:34:27PM -0700, Shirley Ma wrote:
> Hello Michael,
> 
> In order to use wait for completion in shutting down, seems to me
> another work thread is needed to call vhost_zerocopy_add_used,

Hmm I don't see vhost_zerocopy_add_used here.

> it seems
> too much work to address a minor issue here. Do we really need it?

Assuming you mean vhost_zerocopy_signal_used, here's how I would do it:
add a kref and a completion, signal completion in kref_put
callback, when backend is set - kref_get, on cleanup,
kref_put and then wait_for_completion_interruptible.
Where's the need for another thread coming from?

If you like, post a patch with busywait + a FIXME comment,
and I can write up a patch on top.

(BTW, ideally the function that does the signalling should be
in core networking bits so that it's still around
even if the vhost module gets removed).

> Right now, the approach I am using is to ignore outstanding userspace
> buffers during shutting down if any, the device might DMAed some wrong
> data to the wire, do we really care?
> 
> Thanks
> Shirley

I think so, yes, guest is told that memory can be reused so
it might put the credit card number or whatever there :)

> ------------
> 
> This patch maintains the outstanding userspace buffers in the 
> sequence it is delivered to vhost. The outstanding userspace buffers 
> will be marked as done once the lower device buffers DMA has finished. 
> This is monitored through last reference of kfree_skb callback. Two
> buffer index are used for this purpose.
> 
> The vhost passes the userspace buffers info to lower device skb 
> through message control. Since there will be some done DMAs when
> entering vhost handle_tx. The worse case is all buffers in the vq are
> in pending/done status, so we need to notify guest to release DMA done 
> buffers first before get any new buffers from the vq.
> 
> Signed-off-by: Shirley <xma@us.ibm.com>
> ---
> 
>  drivers/vhost/net.c   |   46 +++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |   15 +++++++++++++++
>  3 files changed, 107 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 2f7c76a..e2eaba6 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,11 @@
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
>  
> +/* MAX number of TX used buffers for outstanding zerocopy */
> +#define VHOST_MAX_PEND 128
> +/* change it to 256 when small message size performance issue is addressed */
> +#define VHOST_GOODCOPY_LEN 2048
> +
>  enum {
>  	VHOST_NET_VQ_RX = 0,
>  	VHOST_NET_VQ_TX = 1,
> @@ -151,6 +156,10 @@ static void handle_tx(struct vhost_net *net)
>  	hdr_size = vq->vhost_hlen;
>  
>  	for (;;) {
> +		/* Release DMAs done buffers first */
> +		if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND)
> +			vhost_zerocopy_signal_used(vq, false);
> +
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
>  					 &out, &in,
> @@ -166,6 +175,12 @@ static void handle_tx(struct vhost_net *net)
>  				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
>  				break;
>  			}
> +			/* If more outstanding DMAs, queue the work */
> +			if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND) {
> +				tx_poll_start(net, sock);
> +				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> +				break;
> +			}
>  			if (unlikely(vhost_enable_notify(vq))) {
>  				vhost_disable_notify(vq);
>  				continue;
> @@ -188,6 +203,26 @@ static void handle_tx(struct vhost_net *net)
>  			       iov_length(vq->hdr, s), hdr_size);
>  			break;
>  		}
> +		/* use msg_control to pass vhost zerocopy ubuf info to skb */
> +		if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> +			vq->heads[vq->upend_idx].id = head;
> +			if (len < VHOST_GOODCOPY_LEN)
> +				/* copy don't need to wait for DMA done */
> +				vq->heads[vq->upend_idx].len =
> +							VHOST_DMA_DONE_LEN;
> +			else {
> +				struct ubuf_info *ubuf = &vq->ubuf_info[head];
> +
> +				vq->heads[vq->upend_idx].len = len;
> +				ubuf->callback = vhost_zerocopy_callback;
> +				ubuf->arg = vq;
> +				ubuf->desc = vq->upend_idx;
> +				msg.msg_control = ubuf;
> +				msg.msg_controllen = sizeof(ubuf);
> +			}
> +			atomic_inc(&vq->refcnt);
> +			vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
> +		}
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>  		err = sock->ops->sendmsg(NULL, sock, &msg, len);
>  		if (unlikely(err < 0)) {
> @@ -198,12 +233,21 @@ static void handle_tx(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> +			vhost_add_used_and_signal(&net->dev, vq, head, 0);
>  		total_len += len;
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
>  			break;
>  		}
> +		/* if upend_idx is full, then wait for free more */
> +/*
> +		if (unlikely(vq->upend_idx == vq->done_idx)) {
> +			tx_poll_start(net, sock);
> +			set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> +				break;
> +		}
> +*/

What's this BTW?


>  	}
>  
>  	mutex_unlock(&vq->mutex);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 7aa4eea..b76e42a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> +	vq->upend_idx = 0;
> +	vq->done_idx = 0;
> +	atomic_set(&vq->refcnt, 0);
>  }
>  
>  static int vhost_worker(void *data)
> @@ -232,6 +235,8 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
>  					  GFP_KERNEL);
>  		dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
>  					    UIO_MAXIOV, GFP_KERNEL);
> +		dev->vqs[i].ubuf_info = kmalloc(sizeof *dev->vqs[i].ubuf_info *
> +					    UIO_MAXIOV, GFP_KERNEL);
>  
>  		if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
>  			!dev->vqs[i].heads)
> @@ -244,6 +249,7 @@ err_nomem:
>  		kfree(dev->vqs[i].indirect);
>  		kfree(dev->vqs[i].log);
>  		kfree(dev->vqs[i].heads);
> +		kfree(dev->vqs[i].ubuf_info);
>  	}
>  	return -ENOMEM;
>  }
> @@ -385,6 +391,30 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
>  	return 0;
>  }
>  
> +/* In case of DMA done not in order in lower device driver for some reason.
> + * upend_idx is used to track end of used idx, done_idx is used to track head
> + * of used idx. Once lower device DMA done contiguously, we will signal KVM
> + * guest used idx.
> + */
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
> +{
> +	int i, j = 0;
> +
> +	for (i = vq->done_idx; i != vq->upend_idx; i = ++i % UIO_MAXIOV) {
> +		if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
> +			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
> +			vhost_add_used_and_signal(vq->dev, vq,
> +						  vq->heads[i].id, 0);
> +			++j;
> +		} else
> +			break;
> +	}
> +	if (j) {
> +		vq->done_idx = i;
> +		atomic_sub(j, &vq->refcnt);
> +	}
> +}
> +
>  /* Caller should have device mutex */
>  void vhost_dev_cleanup(struct vhost_dev *dev)
>  {
> @@ -395,6 +425,9 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  			vhost_poll_stop(&dev->vqs[i].poll);
>  			vhost_poll_flush(&dev->vqs[i].poll);
>  		}
> +		/* Wait for all lower device DMAs done */
> +		if (atomic_read(&dev->vqs[i].refcnt))
> +			vhost_zerocopy_signal_used(&dev->vqs[i], true);
>  		if (dev->vqs[i].error_ctx)
>  			eventfd_ctx_put(dev->vqs[i].error_ctx);
>  		if (dev->vqs[i].error)
> @@ -603,6 +636,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>  
>  	mutex_lock(&vq->mutex);
>  
> +	/* clean up lower device outstanding DMAs, before setting ring */
> +	if (atomic_read(&vq->refcnt))
> +		vhost_zerocopy_signal_used(vq, true);
> +
>  	switch (ioctl) {
>  	case VHOST_SET_VRING_NUM:
>  		/* Resizing ring with an active backend?
> @@ -1416,3 +1453,13 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
>  		vq_err(vq, "Failed to enable notification at %p: %d\n",
>  		       &vq->used->flags, r);
>  }
> +
> +void vhost_zerocopy_callback(void *arg)
> +{
> +	struct ubuf_info *ubuf = (struct ubuf_info *)arg;
> +	struct vhost_virtqueue *vq;
> +
> +	vq = (struct vhost_virtqueue *)ubuf->arg;
> +	/* set len = 1 to mark this desc buffers done DMA */
> +	vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> +}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index b3363ae..0b11734 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -13,6 +13,11 @@
>  #include <linux/virtio_ring.h>
>  #include <asm/atomic.h>
>  
> +/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
> + * done */
> +#define VHOST_DMA_DONE_LEN	1
> +#define VHOST_DMA_CLEAR_LEN	0
> +
>  struct vhost_device;
>  
>  struct vhost_work;
> @@ -108,6 +113,14 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	/* vhost zerocopy support */
> +	atomic_t refcnt; /* num of outstanding zerocopy DMAs */
> +	/* last used idx for outstanding DMA zerocopy buffers */
> +	int upend_idx;
> +	/* first used idx for DMA done zerocopy buffers */
> +	int done_idx;
> +	/* an array of userspace buffers info */
> +	struct ubuf_info *ubuf_info;
>  };
>  
>  struct vhost_dev {
> @@ -154,6 +167,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
>  
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>  		    unsigned int log_num, u64 len);
> +void vhost_zerocopy_callback(void *arg);
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
>  
>  #define vq_err(vq, fmt, ...) do {                                  \
>  		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> 

^ permalink raw reply

* Re: RFT: virtio_net: limit xmit polling
From: Michael S. Tsirkin @ 2011-06-29  8:42 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Krishna Kumar2, habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	lguest-uLR06cmDAlY/bJ5BZ2RsiQ, Shirley Ma,
	kvm-u79uwXL29TY76Z2rM5mHXA, Carsten Otte,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	netdev-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	linux390-tA70FqPdS9bQT0dZR+AlfA, roprabhu-FYB4Gu1CFyUAvxtiuMwx3w
In-Reply-To: <201106281108.09285.tahm-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

On Tue, Jun 28, 2011 at 11:08:07AM -0500, Tom Lendacky wrote:
> On Sunday, June 19, 2011 05:27:00 AM Michael S. Tsirkin wrote:
> > OK, different people seem to test different trees.  In the hope to get
> > everyone on the same page, I created several variants of this patch so
> > they can be compared. Whoever's interested, please check out the
> > following, and tell me how these compare:
> > 
> > kernel:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
> > 
> > virtio-net-limit-xmit-polling/base - this is net-next baseline to test
> > against virtio-net-limit-xmit-polling/v0 - fixes checks on out of capacity
> > virtio-net-limit-xmit-polling/v1 - previous revision of the patch
> > 		this does xmit,free,xmit,2*free,free
> > virtio-net-limit-xmit-polling/v2 - new revision of the patch
> > 		this does free,xmit,2*free,free
> > 
> 
> Here's a summary of the results.  I've also attached an ODS format spreadsheet
> (30 KB in size) that might be easier to analyze and also has some pinned VM
> results data.  I broke the tests down into a local guest-to-guest scenario
> and a remote host-to-guest scenario.
> 
> Within the local guest-to-guest scenario I ran:
>   - TCP_RR tests using two different messsage sizes and four different
>     instance counts among 1 pair of VMs and 2 pairs of VMs.
>   - TCP_STREAM tests using four different message sizes and two different
>     instance counts among 1 pair of VMs and 2 pairs of VMs.
> 
> Within the remote host-to-guest scenario I ran:
>   - TCP_RR tests using two different messsage sizes and four different
>     instance counts to 1 VM and 4 VMs.
>   - TCP_STREAM and TCP_MAERTS tests using four different message sizes and
>     two different instance counts to 1 VM and 4 VMs.
> over a 10GbE link.

roprabhu, Tom,

Thanks very much for the testing. So on the first glance
one seems to see a significant performance gain in V0 here,
and a slightly less significant in V2, with V1
being worse than base. But I'm afraid that's not the
whole story, and we'll need to work some more to
know what really goes on, please see below.


Some comments on the results: I found out that V0 because of mistake
on my part was actually almost identical to base.
I pushed out virtio-net-limit-xmit-polling/v1a instead that
actually does what I intended to check. However,
the fact we get such a huge distribution in the results by Tom
most likely means that the noise factor is very large.


>From my experience one way to get stable results is to
divide the throughput by the host CPU utilization
(measured by something like mpstat).
Sometimes throughput doesn't increase (e.g. guest-host)
by CPU utilization does decrease. So it's interesting.


Another issue is that we are trying to improve the latency
of a busy queue here. However STREAM/MAERTS tests ignore the latency
(more or less) while TCP_RR by default runs a single packet per queue.
Without arguing about whether these are practically interesting
workloads, these results are thus unlikely to be significantly affected
by the optimization in question.

What we are interested in, thus, is either TCP_RR with a -b flag
(configure with  --enable-burst) or multiple concurrent
TCP_RRs.



> *** Local Guest-to-Guest ***
> 
> Here's the local guest-to-guest summary for 1 VM pair doing TCP_RR with
> 256/256 request/response message size in transactions per second:
> 
> Instances	Base		V0		V1		V2
> 1		 8,151.56	 8,460.72	 8,439.16	 9,990.37
> 25		48,761.74	51,032.62	51,103.25	49,533.52
> 50		55,687.38	55,974.18	56,854.10	54,888.65
> 100		58,255.06	58,255.86	60,380.90	59,308.36
> 
> Here's the local guest-to-guest summary for 2 VM pairs doing TCP_RR with
> 256/256 request/response message size in transactions per second:
> 
> Instances	Base		V0		V1		V2
> 1		18,758.48	19,112.50	18,597.07	19,252.04
> 25		80,500.50	78,801.78	80,590.68	78,782.07
> 50		80,594.20	77,985.44	80,431.72	77,246.90
> 100		82,023.23	81,325.96	81,303.32	81,727.54
> 
> Here's the local guest-to-guest summary for 1 VM pair doing TCP_STREAM with
> 256, 1K, 4K and 16K message size in Mbps:
> 
> 256:
> Instances	Base		V0		V1		V2
> 1		   961.78	 1,115.92	   794.02	   740.37
> 4		 2,498.33	 2,541.82	 2,441.60	 2,308.26
> 
> 1K:					
> 1		 3,476.61	 3,522.02	 2,170.86	 1,395.57
> 4		 6,344.30	 7,056.57	 7,275.16	 7,174.09
> 
> 4K:					
> 1		 9,213.57	10,647.44	 9,883.42	 9,007.29
> 4		11,070.66	11,300.37	11,001.02	12,103.72
> 
> 16K:
> 1		12,065.94	 9,437.78	11,710.60	 6,989.93
> 4		12,755.28	13,050.78	12,518.06	13,227.33
> 
> Here's the local guest-to-guest summary for 2 VM pairs doing TCP_STREAM with
> 256, 1K, 4K and 16K message size in Mbps:
> 
> 256:
> Instances	Base		V0		V1		V2
> 1		 2,434.98	 2,403.23	 2,308.69	 2,261.35
> 4		 5,973.82	 5,729.48	 5,956.76	 5,831.86
> 
> 1K:
> 1		 5,305.99	 5,148.72	 4,960.67	 5,067.76
> 4		10,628.38	10,649.49	10,098.90	10,380.09
> 
> 4K:
> 1		11,577.03	10,710.33	11,700.53	10,304.09
> 4		14,580.66	14,881.38	14,551.17	15,053.02
> 
> 16K:
> 1		16,801.46	16,072.50	15,773.78	15,835.66
> 4		17,194.00	17,294.02	17,319.78	17,121.09
> 
> 
> *** Remote Host-to-Guest ***
> 
> Here's the remote host-to-guest summary for 1 VM doing TCP_RR with
> 256/256 request/response message size in transactions per second:
> 
> Instances	Base		V0		V1		V2
> 1		 9,732.99	10,307.98	10,529.82	 8,889.28
> 25		43,976.18	49,480.50	46,536.66	45,682.38
> 50		63,031.33	67,127.15	60,073.34	65,748.62
> 100		64,778.43	65,338.07	66,774.12	69,391.22
> 
> Here's the remote host-to-guest summary for 4 VMs doing TCP_RR with
> 256/256 request/response message size in transactions per second:
> 
> Instances	Base		V0		V1		V2
> 1		 39,270.42	 38,253.60	 39,353.10	 39,566.33
> 25		207,120.91	207,964.50	211,539.70	213,882.21
> 50		218,801.54	221,490.56	220,529.48	223,594.25
> 100		218,432.62	215,061.44	222,011.61	223,480.47
> 
> Here's the remote host-to-guest summary for 1 VM doing TCP_STREAM with
> 256, 1K, 4K and 16K message size in Mbps:
> 
> 256:
> Instances	Base		V0		V1		V2
> 1		2,274.74	2,220.38	2,245.26	2,212.30
> 4		5,689.66	5,953.86	5,984.80	5,827.94
> 
> 1K:
> 1		7,804.38	7,236.29	6,716.58	7,485.09
> 4		7,722.42	8,070.38	7,700.45	7,856.76
> 
> 4K:
> 1		8,976.14	9,026.77	9,147.32	9,095.58
> 4		7,532.25	7,410.80	7,683.81	7,524.94
> 
> 16K:
> 1		8,991.61	9,045.10	9,124.58	9,238.34
> 4		7,406.10	7,626.81	7,711.62	7,345.37
> 
> Here's the remote host-to-guest summary for 1 VM doing TCP_MAERTS with
> 256, 1K, 4K and 16K message size in Mbps:
> 
> 256:
> Instances	Base		V0		V1		V2
> 1		1,165.69	1,181.92	1,152.20	1,104.68
> 4		2,580.46	2,545.22	2,436.30	2,601.74
> 
> 1K:
> 1		2,393.34	2,457.22	2,128.86	2,258.92
> 4		7,152.57	7,606.60	8,004.64	7,576.85
> 
> 4K:
> 1		9,258.93	8,505.06	9,309.78	9,215.05
> 4		9,374.20	9,363.48	9,372.53	9,352.00
> 
> 16K:
> 1		9,244.70	9,287.72	9,298.60	9,322.28
> 4		9,380.02	9,347.50	9,377.46	9,372.98
> 
> Here's the remote host-to-guest summary for 4 VMs doing TCP_STREAM with
> 256, 1K, 4K and 16K message size in Mbps:
> 
> 256:
> Instances	Base		V0		V1		V2
> 1		9,392.37	9,390.74	9,395.58	9,392.46
> 4		9,394.24	9,394.46	9,395.42	9,394.05
> 
> 1K:
> 1		9,396.34	9,397.46	9,396.64	9,443.26
> 4		9,397.14	9,402.25	9,398.67	9,391.09
> 
> 4K:
> 1		9,397.16	9,398.07	9,397.30	9,396.33
> 4		9,395.64	9,400.25	9,397.54	9,397.75
> 
> 16K:
> 1		9,396.58	9,397.01	9,397.58	9,397.70
> 4		9,399.15	9,400.02	9,399.66	9,400.16
> 
> 
> Here's the remote host-to-guest summary for 4 VMs doing TCP_MAERTS with
> 256, 1K, 4K and 16K message size in Mbps:
> 
> 256:
> Instances	Base		V0		V1		V2
> 1		5,048.66	5,007.26	5,074.98	4,974.86
> 4		9,217.23	9,245.14	9,263.97	9,294.23
> 
> 1K:
> 1		9,378.32	9,387.12	9,386.21	9,361.55
> 4		9,384.42	9,384.02	9,385.50	9,385.55
> 
> 4K:
> 1		9,391.10	9,390.28	9,389.70	9,391.02
> 4		9,384.38	9,383.39	9,384.74	9,384.19
> 
> 16K:
> 1		9,390.77	9,389.62	9,388.07	9,388.19
> 4		9,381.86	9,382.37	9,385.54	9,383.88
> 
> 
> Tom
> 
> > There's also this on top:
> > virtio-net-limit-xmit-polling/v3 -> don't delay avail index update
> > I don't think it's important to test this one, yet
> > 
> > Userspace to use: event index work is not yet merged upstream
> > so the revision to use is still this:
> > git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git
> > virtio-net-event-idx-v3

^ permalink raw reply

* [patch] wanxl: remove a stray irq enable
From: Dan Carpenter @ 2011-06-29  6:29 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: open list:NETWORKING DRIVERS, kernel-janitors

This is error path calls unlock_irq() where we haven't disabled the
IRQs.  The comment says that this error path can never happen.

Signed-off-by: Dan Carpenter <error27@gmail.com>
---
This is a static checker fix and I don't have the hardware to test.

diff --git a/drivers/net/wan/wanxl.c b/drivers/net/wan/wanxl.c
index 8d7aa43..44b7071 100644
--- a/drivers/net/wan/wanxl.c
+++ b/drivers/net/wan/wanxl.c
@@ -284,7 +284,7 @@ static netdev_tx_t wanxl_xmit(struct sk_buff *skb, struct net_device *dev)
                 printk(KERN_DEBUG "%s: transmitter buffer full\n", dev->name);
 #endif
 		netif_stop_queue(dev);
-		spin_unlock_irq(&port->lock);
+		spin_unlock(&port->lock);
 		return NETDEV_TX_BUSY;       /* request packet to be queued */
 	}
 

^ permalink raw reply related

* Re: linux-next: build failure after merge of the final tree (net tree related)
From: Stephen Rothwell @ 2011-06-29  6:09 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: David Miller, netdev, linux-next, linux-kernel
In-Reply-To: <20110627152927.943c34dc.sfr@canb.auug.org.au>

Hi all,

On Mon, 27 Jun 2011 15:29:27 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> On Thu, 23 Jun 2011 16:04:13 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:
> >
> > On Thu, Jun 23, 2011 at 8:29 AM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> > > After merging the final tree, today's linux-next build (powerpc
> > > allyesconfig) failed like this:
> > 
> > I build on all powerpc defconfigs, somehow this driver is not pinned
> > by any of them :^)
> 
> Well, yes, but an allmodconfig or allyesconfig build will get the error.
> 
> > > drivers/net/ll_temac_main.c:209:4: error: implicit declaration of function 'dma_unmap_single'
> 
> Is there a fix?

Again, this has taken some time to fix ...

Dave, I think you have raised my expectations too far ;-)

I have added the following patch for today:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 29 Jun 2011 16:03:18 +1000
Subject: [PATCH] net: include dma-mapping.h for dma_map_single etc

fixes thses build errors:

drivers/net/ll_temac_main.c: In function 'temac_dma_bd_release':
drivers/net/ll_temac_main.c:209:4: error: implicit declaration of function 'dma_unmap_single'
drivers/net/ll_temac_main.c:215:3: error: implicit declaration of function 'dma_free_coherent'
drivers/net/ll_temac_main.c: In function 'temac_dma_bd_init':
drivers/net/ll_temac_main.c:243:2: error: implicit declaration of function 'dma_alloc_coherent'
drivers/net/ll_temac_main.c:243:14: warning: assignment makes pointer from integer without a cast
drivers/net/ll_temac_main.c:251:14: warning: assignment makes pointer from integer without a cast
drivers/net/ll_temac_main.c:280:3: error: implicit declaration of function 'dma_map_single'
drivers/net/ll_temac_main.c: In function 'temac_start_xmit_done':
drivers/net/ll_temac_main.c:628:22: warning: cast to pointer from integer of different size

Caused by commit commit b7f080cfe223 ("net: remove mm.h inclusion from
netdevice.h").

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 drivers/net/ll_temac_main.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ll_temac_main.c b/drivers/net/ll_temac_main.c
index e3f1925..728fe41 100644
--- a/drivers/net/ll_temac_main.c
+++ b/drivers/net/ll_temac_main.c
@@ -49,6 +49,7 @@
 #include <linux/ip.h>
 #include <linux/slab.h>
 #include <linux/interrupt.h>
+#include <linux/dma-mapping.h>
 
 #include "ll_temac.h"
 
-- 
1.7.5.4

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* Re: linux-next: build failure after merge of the final tree (net tree related)
From: Stephen Rothwell @ 2011-06-29  6:01 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: linux-next, linux-kernel, Alexey Dobriyan
In-Reply-To: <20110627153522.8884b125.sfr@canb.auug.org.au>

Hi all,

On Mon, 27 Jun 2011 15:35:22 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> On Thu, 23 Jun 2011 15:25:35 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> >
> > After merging the final tree, today's linux-next build (powerpc
> > allyesconfig) failed like this:
> > 
> > drivers/net/can/sja1000/sja1000_of_platform.c: In function 'sja1000_ofp_read_reg':
> > drivers/net/can/sja1000/sja1000_of_platform.c:61:2: error: implicit declaration of function 'in_8'
> > drivers/net/can/sja1000/sja1000_of_platform.c: In function 'sja1000_ofp_write_reg':
> > drivers/net/can/sja1000/sja1000_of_platform.c:67:2: error: implicit declaration of function 'out_8'
> > drivers/net/can/sja1000/sja1000_of_platform.c: In function 'sja1000_ofp_remove':
> > drivers/net/can/sja1000/sja1000_of_platform.c:81:2: error: implicit declaration of function 'iounmap'
> > drivers/net/can/sja1000/sja1000_of_platform.c: In function 'sja1000_ofp_probe':
> > drivers/net/can/sja1000/sja1000_of_platform.c:113:2: error: implicit declaration of function 'ioremap_nocache'
> > drivers/net/can/sja1000/sja1000_of_platform.c:113:7: warning: assignment makes pointer from integer without a cast
> > 
> > Since this file has not been changed recently, I suspect that this was
> > caused by commit b7f080cfe223 ("net: remove mm.h inclusion from
> > netdevice.h").
> > 
> > I have left the build broken for now since it is also broken for other
> > reasons.
> 
> I am still getting these, is there a fix pending?

I have to wonder why this has taken so long to be fixed ...

I have applied this patch for today:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 29 Jun 2011 15:52:00 +1000
Subject: [PATCH] net: include io.h in for iounmap etc

fixes these build errors:

drivers/net/can/sja1000/sja1000_of_platform.c: In function 'sja1000_ofp_read_reg':
drivers/net/can/sja1000/sja1000_of_platform.c:61:2: error: implicit declaration of function 'in_8'
drivers/net/can/sja1000/sja1000_of_platform.c: In function 'sja1000_ofp_write_reg':
drivers/net/can/sja1000/sja1000_of_platform.c:67:2: error: implicit declaration of function 'out_8'
drivers/net/can/sja1000/sja1000_of_platform.c: In function 'sja1000_ofp_remove':
drivers/net/can/sja1000/sja1000_of_platform.c:81:2: error: implicit declaration of function 'iounmap'
drivers/net/can/sja1000/sja1000_of_platform.c: In function 'sja1000_ofp_probe':
drivers/net/can/sja1000/sja1000_of_platform.c:113:2: error: implicit declaration of function 'ioremap_nocache'
drivers/net/can/sja1000/sja1000_of_platform.c:113:7: warning: assignment makes pointer from integer without a cast

Caused by commit b7f080cfe223 ("net: remove mm.h inclusion from
netdevice.h").

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 drivers/net/can/sja1000/sja1000_of_platform.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/can/sja1000/sja1000_of_platform.c b/drivers/net/can/sja1000/sja1000_of_platform.c
index 9793df6..cee6ba2 100644
--- a/drivers/net/can/sja1000/sja1000_of_platform.c
+++ b/drivers/net/can/sja1000/sja1000_of_platform.c
@@ -38,6 +38,7 @@
 #include <linux/interrupt.h>
 #include <linux/netdevice.h>
 #include <linux/delay.h>
+#include <linux/io.h>
 #include <linux/can/dev.h>
 
 #include <linux/of_platform.h>
-- 
1.7.5.4

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* Re: [PATCH V7 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
From: Shirley Ma @ 2011-06-29  4:28 UTC (permalink / raw)
  To: David Miller; +Cc: mst, eric.dumazet, avi, arnd, netdev, kvm, linux-kernel
In-Reply-To: <20110628.163508.222352070705159851.davem@davemloft.net>

I have submitted this patchset from Version 1 to Version 7 already in
the past few months.
Here is the link to the patchset:

http://lists.openwall.net/netdev/2011/05/28/

I am working on V8 now.

Thanks
Shirley


On Tue, 2011-06-28 at 16:35 -0700, David Miller wrote:
> From: Shirley Ma <mashirle@us.ibm.com>
> Date: Tue, 28 Jun 2011 09:51:32 -0700
> 
> > On Mon, 2011-06-27 at 15:54 -0700, David Miller wrote:
> >> From: Shirley Ma <mashirle@us.ibm.com>
> >> Date: Mon, 27 Jun 2011 08:45:10 -0700
> >> 
> >> > To support skb zero-copy, a pointer is needed to add to skb share
> >> info.
> >> > Do you agree with this approach? If not, do you have any other
> >> > suggestions?
> >> 
> >> I really can't form an opinion unless I am shown the complete
> >> implementation, what this give us in return, what the impact is,
> etc. 
>  ..
> > You can see the overall CPU saved 50% w/i zero-copy.
> > 
> > The impact is every skb allocation consumed one more pointer in skb
> > share info, and a pointer check in skb release when last reference
> is
> > gone.
> > 
> > For skb clone, skb expand private head and skb copy, it still keeps
> copy
> > the buffers to kernel, so we can avoid user application, like
> tcpdump to
> > hold the user-space buffers too long.
> 
> Ok, now show me the "complete implementation".  I'm as interested in
> the code as I am in the numbers, that's why I asked for both. 


^ permalink raw reply

* Re: Bug#631945: linux-image-2.6.39-2-amd64: HFSC puts lots of "WARNING: at .../sch_hfsc.c:1427 hfsc_dequeue+0x155/0x28b [sch_hfsc]()" in dmesg
From: Ben Hutchings @ 2011-06-29  4:08 UTC (permalink / raw)
  To: Patrick McHardy, Michal Soltys; +Cc: 631945, Michal Pokrywka, netdev
In-Reply-To: <20110628131310.9941.39557.reportbug@wenus.h1.pl>

[-- Attachment #1: Type: text/plain, Size: 10697 bytes --]

On Tue, 2011-06-28 at 15:13 +0200, Michal Pokrywka wrote:
> Package: linux-2.6
> Version: 2.6.39-2
> Severity: important
> Tags: upstream
> 
> 
> I had working HFSC configuration on gateway machine with kernel
> 2.6.26-21~bpo40+1. Now I switched to new machine and kernel 2.6.39-2 with the
> same configuration script. Shaping works but after a couple of minutes
> houndreds of warnings per second fills dmesg severly slowing whole system
> (kernel log attached below).

This is the warning in hfsc_schedule_watchdog().  Patrick, Michal, it
looks like you made some changes in this area since 2.6.32 (the last
version reported as OK).  Can you see how this can happen?

Ben.

> My configuration on eth0 involves SFQ queues connected to HFSC leafs, and one
> RED queue connected to one HFSC leaf. Similar configuration runs on ifb0.
> All incoming to eth0 traffic is directed to ifb0 using egress action mirred.
> Also there is some hashtable filtering based in IP addresses.
> 
> When I erased all queues and put for testing purposes only one HFSC queue on
> eth0 with one default class I noticed that the problem exists no more.
> So I connected one SFQ queue to this single HFSC class and problem still
> not occurs.
> 
> Then I did the same thing on ifb0 and redirected traffic from eth0 egress
> and noticed still no warnings in dmesg.
> 
> Connecting RED to HFSC leaf also produces no warnings.
> 
> Switching in udev eth0->eth1 and eth1->eth0 (to make HFSC work with different
> ethernet driver: forcedeth->tg3) does not solve the problem.
> 
> Maybe someone can suggest what else could be done to trace the problem.
> 
> Here are the HFSC script that triggers the bug (without repeating stuff).
> It is a bit complicated but it works with different users on three different
> machines (different ethernet devices) with 2.6.32 kernel.
> It should work on every machine which has 192.168.20.0/24 network on eth0
> and internet access on eth1.
> 
> -- CUT HERE (hfsc.sh begins) --
> modprobe ifb numifbs=1
> ### DOWNLOAD ###
> tc qdisc del dev eth0 root 2>&1 >/dev/null
> tc qdisc add dev eth0 root handle 1: hfsc default 4
> tc class add dev eth0 parent 1:  classid 1:1 hfsc ls rate 100000kbit ul rate 100000kbit # main eth0 download
> 
> tc class add dev eth0 parent 1:1 classid 1:2 hfsc ls rate 3891kbit ul rate 3891kbit # main internet download
> tc class add dev eth0 parent 1:2 classid 1:4 hfsc ls rate 50kbit ul rate 100kbit # bulk class
> tc class add dev eth0 parent 1:2 classid 1:5 hfsc ls rate 200kbit ul rate 1000kbit # wireless class
> tc qdisc add dev eth0 parent 1:4 handle 4: sfq perturb 5 quantum 1920 # bulk qdisc
> tc qdisc add dev eth0 parent 1:5 handle 5: sfq perturb 5 quantum 1920 # wifi qdisc
> 
> tc class add dev eth0 parent 1:1 classid 1:3 hfsc ls rate 94904kbit ul rate 94904kbit # local class
> tc qdisc add dev eth0 parent 1:3 handle 3: red min 75000 max 225000 probability 0.02 limit 675000 avpkt 1000 burst 120 bandwidth 94904
> 
> tc filter add dev eth0 parent 1: protocol ip prio 3 u32 match ip src 192.168.20.0/24 match ip dst 62.***.***.88/29 flowid 1:3
> tc filter add dev eth0 parent 1: protocol ip prio 3 u32 match ip src 192.168.20.0/24 match ip dst 62.***.***.96/28 flowid 1:3
> tc filter add dev eth0 parent 1: protocol ip prio 3 u32 match ip dst 192.168.20.0/24 match ip src 62.***.***.88/29 flowid 1:3
> tc filter add dev eth0 parent 1: protocol ip prio 3 u32 match ip dst 192.168.20.0/24 match ip src 62.***.***.96/28 flowid 1:3
> tc filter add dev eth0 parent 1: protocol ip prio 3 u32 match ip src 192.168.20.91 match ip dst 192.168.20.0/24 flowid 1:3
> 
> tc filter add dev eth0 parent 1: protocol ip prio 3 u32 match ip dst 10.0.0.0/8 flowid 1:5
> 
> tc filter add dev eth0 parent 1: prio 5 handle 100: protocol ip u32 divisor 256
> tc filter add dev eth0 protocol ip parent 1: prio 5 u32 ht 800:: match ip dst 192.168.20.0/24 hashkey mask 0x000000ff at 16 link 100:
> 
> # example_user / 192.168.20.16
> tc class add dev eth0 parent 1:2    classid 1:0016  hfsc ls m1 400kbit d 4s m2 100kbit ul rate 3696kbit
> tc class add dev eth0 parent 1:0016 classid 1:1016  hfsc rt rate 336bit ls m1 400kbit d 4s m2 100kbit ul rate 3696kbit # ack
> tc qdisc add dev eth0 parent 1:1016 handle    1016: sfq perturb 10 quantum 1920
> tc class add dev eth0 parent 1:0016 classid 1:2016  hfsc rt rate 94566bit ls m1 400kbit d 4s m2 100kbit ul rate 3696kbit # data
> tc qdisc add dev eth0 parent 1:2016 handle    2016: sfq perturb 10 quantum 1920
> 
> tc filter add dev eth0 parent 1: protocol ip prio  5 u32 ht 100:10 match ip protocol 6 0xff match u8 0x10 0xff at nexthdr+33 match u16 0x0000 0xffc0 at 2 flowid 1:1016 # ack
> tc filter add dev eth0 parent 1: protocol ip prio 10 u32 ht 100:10 match u32 0 0 flowid 1:2016 # data
> 
> (... more users ...)
> 
> ### UPLOAD ###
> ifconfig ifb0 up
> 
> tc qdisc del dev ifb0 root 2>&1 >/dev/null
> tc qdisc add dev ifb0 root handle 1: hfsc default 4
> tc class add dev ifb0 parent 1:  classid 1:1 hfsc ls rate 100000kbit ul rate 100000kbit # main eth0 upload
> tc class add dev ifb0 parent 1:1 classid 1:2 hfsc ls rate 460kbit ul rate 460kbit # main internet upload
> 
> tc class add dev ifb0 parent 1:2 classid 1:4 hfsc ls rate 50kbit ul rate 10kbit # bulk class
> tc class add dev ifb0 parent 1:2 classid 1:5 hfsc ls rate 200kbit ul rate 100kbit # wireless class
> tc qdisc add dev ifb0 parent 1:4 handle 4: sfq perturb 5 quantum 1920 # bulk qdisc
> tc qdisc add dev ifb0 parent 1:5 handle 5: sfq perturb 5 quantum 1920 # wifi qdisc
> 
> tc class add dev ifb0 parent 1:1 classid 1:3 hfsc ls rate 94904kbit ul rate 94904kbit # local class
> tc qdisc add dev ifb0 parent 1:3 handle 3: red min 75000 max 225000 probability 0.02 limit 675000 avpkt 1000 burst 120 bandwidth 94904
> 
> tc filter add dev ifb0 parent 1: protocol ip prio 2 u32 match ip src 192.168.20.0/24 match ip dst 62.***.***.88/29 flowid 1:3
> tc filter add dev ifb0 parent 1: protocol ip prio 2 u32 match ip src 192.168.20.0/24 match ip dst 62.***.***.96/28 flowid 1:3
> tc filter add dev ifb0 parent 1: protocol ip prio 2 u32 match ip dst 192.168.20.0/24 match ip src 62.***.***.88/29 flowid 1:3
> tc filter add dev ifb0 parent 1: protocol ip prio 2 u32 match ip dst 192.168.20.0/24 match ip src 62.***.***.96/28 flowid 1:3
> tc filter add dev ifb0 parent 1: protocol ip prio 2 u32 match ip dst 192.168.20.91 match ip src 192.168.20.0/24 flowid 1:3
> tc filter add dev ifb0 parent 1: protocol ip prio 2 u32 match ip src 192.168.20.91 match ip dst 192.168.20.0/24 flowid 1:3
> 
> tc filter add dev ifb0 parent 1: protocol ip prio 3 u32 match ip src 10.0.0.0/8 flowid 1:5
> tc filter add dev ifb0 parent 1: prio 5 handle 100: protocol ip u32 divisor 256
> 
> # example_user / 192.168.20.16
> tc class add dev ifb0 parent 1:2    classid 1:0016  hfsc ls m1 400kbit d 4s m2 100kbit ul rate 437kbit
> tc class add dev ifb0 parent 1:0016 classid 1:1016  hfsc rt rate 2847bit ls m1 400kbit d 4s m2 100kbit ul rate 437kbit # ack
> tc qdisc add dev ifb0 parent 1:1016 handle    1016: sfq perturb 10 quantum 1920
> tc class add dev ifb0 parent 1:0016 classid 1:2016  hfsc rt rate 8372bit ls m1 400kbit d 4s m2 100kbit ul rate 437kbit # data
> tc qdisc add dev ifb0 parent 1:2016 handle    2016: sfq perturb 10 quantum 1920
> 
> tc filter add dev ifb0 parent 1: protocol ip prio  5 u32 ht 100:10 match ip protocol 6 0xff match u8 0x10 0xff at nexthdr+33 match u16 0x0000 0xffc0 at 2 flowid 1:1016 # ack
> tc filter add dev ifb0 parent 1: protocol ip prio 10 u32 ht 100:10 match u32 0 0 flowid 1:2016 # data
> 
> (... more users ...)
> 
> tc qdisc del dev eth0 ingress 2>&1 >/dev/null
> tc qdisc add dev eth0 ingress
> 
> tc filter add dev eth0 parent ffff: protocol ip prio 40 u32 match ip src 192.168.20.0/24 flowid 1: action mirred egress redirect dev ifb0
> tc filter add dev eth0 parent ffff: protocol ip prio 40 u32 match ip src 10.0.0.0/8  flowid 1: action mirred egress redirect dev ifb0
> 
> # Local to hashtable
> tc filter add dev ifb0 protocol ip parent 1: prio 40 u32 ht 800:: match ip src 192.168.20.0/24 hashkey mask 0x000000ff at 12 link 100:
> 
> -- CUT HERE (hfsc.sh ends) --
> 
> -- Package-specific info:
> ** Version:
> Linux version 2.6.39-2-amd64 (Debian 2.6.39-2) (ben@decadent.org.uk) (gcc version 4.4.6 (Debian 4.4.6-3) ) #1 SMP Wed Jun 8 11:01:04 UTC 2011
> 
> ** Command line:
> BOOT_IMAGE=/boot/vmlinuz-2.6.39-2-amd64 root=UUID=b9cc0f29-98ad-4034-ba84-167769b10cd6 ro quiet
> 
> ** Tainted: W (512)
>  * Taint on warning.
> 
> ** Kernel log:
> [ 4522.772401] ------------[ cut here ]------------
> [ 4522.772405] WARNING: at /build/buildd-linux-2.6_2.6.39-2-amd64-kuqdRa/linux-2.6-2.6.39/debian/build/source_amd64_none/net/sched/sch_hfsc.c:1427 hfsc_dequeue+0x155/0x28b [sch_hfsc]()
> [ 4522.772410] Hardware name: S2865
> [ 4522.772412] Modules linked in: act_mirred sch_ingress cls_u32 sch_red sch_sfq sch_hfsc ifb xt_tcpudp xt_state iptable_filter iptable_nat ip_tables x_tables nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack xfs dme1737 hwmon_vid loop snd_pcm snd_timer amd64_edac_mod snd soundcore snd_page_alloc k8temp edac_core serio_raw edac_mce_amd pcspkr evdev joydev parport_pc parport i2c_nforce2 processor button i2c_core ext4 mbcache jbd2 crc16 raid1 md_mod sg usbhid hid sr_mod sd_mod crc_t10dif cdrom ohci_hcd ata_generic pata_amd sata_nv libata ehci_hcd tg3 scsi_mod thermal usbcore floppy fan thermal_sys libphy forcedeth [last unloaded: scsi_wait_scan]
> [ 4522.772453] Pid: 3, comm: ksoftirqd/0 Tainted: G        W    2.6.39-2-amd64 #1
> [ 4522.772455] Call Trace:
> [ 4522.772459]  [<ffffffff810458b4>] ? warn_slowpath_common+0x78/0x8c
> [ 4522.772463]  [<ffffffffa039ab3f>] ? hfsc_dequeue+0x155/0x28b [sch_hfsc]
> [ 4522.772467]  [<ffffffff81291962>] ? __qdisc_run+0x93/0x11a
> [ 4522.772471]  [<ffffffff8127696f>] ? net_tx_action+0x10f/0x186
> [ 4522.772475]  [<ffffffff8104b4cf>] ? __do_softirq+0xc3/0x19e
> [ 4522.772479]  [<ffffffff8104b626>] ? run_ksoftirqd+0x7c/0x122
> [ 4522.772483]  [<ffffffff8104b5aa>] ? __do_softirq+0x19e/0x19e
> [ 4522.772486]  [<ffffffff8104b5aa>] ? __do_softirq+0x19e/0x19e
> [ 4522.772490]  [<ffffffff8105ef05>] ? kthread+0x7a/0x82
> [ 4522.772494]  [<ffffffff81339ee4>] ? kernel_thread_helper+0x4/0x10
> [ 4522.772498]  [<ffffffff8105ee8b>] ? kthread_worker_fn+0x147/0x147
> [ 4522.772502]  [<ffffffff81339ee0>] ? gs_change+0x13/0x13
> [ 4522.772505] ---[ end trace 9a0810cc4dbd421e ]---
[...]

-- 
Ben Hutchings
If at first you don't succeed, you're doing about average.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [PATCH V7 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
From: David Miller @ 2011-06-28 23:35 UTC (permalink / raw)
  To: mashirle; +Cc: mst, eric.dumazet, avi, arnd, netdev, kvm, linux-kernel
In-Reply-To: <1309279892.3559.6.camel@localhost.localdomain>

From: Shirley Ma <mashirle@us.ibm.com>
Date: Tue, 28 Jun 2011 09:51:32 -0700

> On Mon, 2011-06-27 at 15:54 -0700, David Miller wrote:
>> From: Shirley Ma <mashirle@us.ibm.com>
>> Date: Mon, 27 Jun 2011 08:45:10 -0700
>> 
>> > To support skb zero-copy, a pointer is needed to add to skb share
>> info.
>> > Do you agree with this approach? If not, do you have any other
>> > suggestions?
>> 
>> I really can't form an opinion unless I am shown the complete
>> implementation, what this give us in return, what the impact is, etc. 
 ..
> You can see the overall CPU saved 50% w/i zero-copy.
> 
> The impact is every skb allocation consumed one more pointer in skb
> share info, and a pointer check in skb release when last reference is
> gone.
> 
> For skb clone, skb expand private head and skb copy, it still keeps copy
> the buffers to kernel, so we can avoid user application, like tcpdump to
> hold the user-space buffers too long.

Ok, now show me the "complete implementation".  I'm as interested in
the code as I am in the numbers, that's why I asked for both.

^ permalink raw reply

* RE: [PATCH] [v3][net][bna] Fix call trace when interrupts are disabled while sleeping function kzalloc is called
From: Rasesh Mody @ 2011-06-28 23:32 UTC (permalink / raw)
  To: Shyam Iyer, netdev@vger.kernel.org
  Cc: Debashis Dutt, Jing Huang, davem@davemloft.net, Shyam Iyer
In-Reply-To: <1309287485-6077-1-git-send-email-shyam_iyer@dell.com>

>From: Shyam Iyer [mailto:shyam.iyer.t@gmail.com]
>Sent: Tuesday, June 28, 2011 11:58 AM
>
>request_threaded irq will call kzalloc that can sleep. Initializing the
>flags variable outside of spin_lock_irqsave/restore in
>bnad_mbox_irq_alloc will avoid call traces like below.
>
>@@ -1125,18 +1125,17 @@ bnad_mbox_irq_alloc(struct bnad *bnad,
> 	if (bnad->cfg_flags & BNAD_CF_MSIX) {
> 		irq_handler = (irq_handler_t)bnad_msix_mbox_handler;
> 		irq = bnad->msix_table[bnad->msix_num - 1].vector;
>-		flags = 0;
> 		intr_info->intr_type = BNA_INTR_T_MSIX;
> 		intr_info->idl[0].vector = bnad->msix_num - 1;
> 	} else {
> 		irq_handler = (irq_handler_t)bnad_isr;
> 		irq = bnad->pcidev->irq;
>-		flags = IRQF_SHARED;
>+		irq_flags = IRQF_SHARED;
> 		intr_info->intr_type = BNA_INTR_T_INTX;
> 		/* intr_info->idl.vector = 0 ? */
> 	}
> 	spin_unlock_irqrestore(&bnad->bna_lock, flags);
>-
>+	flags = irq_flags;
> 	sprintf(bnad->mbox_irq_name, "%s", BNAD_NAME);
>
> 	/*

Patch looks fine.

Acked-by: Rasesh Mody <rmody@brocade.com>

^ permalink raw reply

* [PATCH v2] bridge: ignore pause & bonding frames
From: David Lamparter @ 2011-06-28 22:10 UTC (permalink / raw)
  To: netdev; +Cc: Nick Carter, David Lamparter, Stephen Hemminger, davem
In-Reply-To: <1309298599-11266-1-git-send-email-equinox@diac24.net>

this patch adds bonding frames to the special treatment party and has
both pause and bonding frames delivered on the underlying bridge member
device. we thus become 802.1AX section 5.2.1 compliant which quite
clearly has the link aggregation directly on top of the MAC layer, below
any bridging.

the matching is changed to mimic hardware switches, which match
802.3x/pause and 802.3ad/lacp by the hardware address (keep in mind the
existence of LLC/SNAP headers).

Signed-off-by: David Lamparter <equinox@diac24.net>
---
[v2:] ... and obviously i forget the "*pskb = skb;" - it's getting late.

note: this should actually fix some issues i was having with bridging
screwing up my bonding. i've resolved those by "brouting" LACP frames
in ebtables... (which this patch will also result in)

compile-tested only

 net/bridge/br_input.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index f3ac1e8..3ef2861 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -160,9 +160,14 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
 	p = br_port_get_rcu(skb->dev);
 
 	if (unlikely(is_link_local(dest))) {
-		/* Pause frames shouldn't be passed up by driver anyway */
-		if (skb->protocol == htons(ETH_P_PAUSE))
-			goto drop;
+		/* Pause/3x frames shouldn't be passed up by driver anyway
+		 * LACP/3ad can never be allowed to cross even a dumb hub
+		 *
+		 * both are usually matched by group address */
+		if (dest[5] == 0x01 || dest[5] == 0x02) {
+			*pskb = skb;
+			return RX_HANDLER_PASS;
+		}
 
 		/* If STP is turned off, then forward */
 		if (p->br->stp_enabled == BR_NO_STP && dest[5] == 0)
-- 
1.7.5.3


^ permalink raw reply related

* [PATCH 2/2] bridge: pass through 802.1X & co. in 'dumb' mode
From: David Lamparter @ 2011-06-28 22:03 UTC (permalink / raw)
  To: netdev; +Cc: Nick Carter, David Lamparter, Stephen Hemminger, davem
In-Reply-To: <1309298599-11266-1-git-send-email-equinox@diac24.net>

when operating without STP, we're a dumb switch and should be able to
forward ethernet management protocols like 802.1X, LLDP and GVRP.

if this is not desired, it can be enacted as local policy through
ebtables.

if we're in STP mode we basically claim to be an intelligent switch and
should implement these protocols properly (in userspace).

Signed-off-by: David Lamparter <equinox@diac24.net>
---
compile-tested only

 net/bridge/br_input.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index c873db5..4cee1b5 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -167,16 +167,19 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
 		if (dest[5] == 0x01 || dest[5] == 0x02)
 			return RX_HANDLER_PASS;
 
-		/* If STP is turned off, then forward */
-		if (p->br->stp_enabled == BR_NO_STP && dest[5] == 0)
+		/* If STP is turned off, we're a dumb switch and therefore
+		 * forward the remaining link-locals. (STP, 802.1X, LLDP,
+		 * GVRP & co.) */
+		if (p->br->stp_enabled == BR_NO_STP)
 			goto forward;
 
 		if (NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, skb, skb->dev,
 			    NULL, br_handle_local_finish)) {
 			return RX_HANDLER_CONSUMED; /* consumed by filter */
 		} else {
+			/* stay on physdev for userspace implementation */
 			*pskb = skb;
-			return RX_HANDLER_PASS;	/* continue processing */
+			return RX_HANDLER_PASS;
 		}
 	}
 
-- 
1.7.5.3


^ permalink raw reply related

* [PATCH 1/2] bridge: ignore pause & bonding frames
From: David Lamparter @ 2011-06-28 22:03 UTC (permalink / raw)
  To: netdev; +Cc: Nick Carter, David Lamparter, Stephen Hemminger, davem
In-Reply-To: <20110628214637.GE2121496@jupiter.n2.diac24.net>

this patch adds bonding frames to the special treatment party and has
both pause and bonding frames delivered on the underlying bridge member
device. we thus become 802.1AX section 5.2.1 compliant which quite
clearly has the link aggregation directly on top of the MAC layer, below
any bridging.

the matching is changed to mimic hardware switches, which match
802.3x/pause and 802.3ad/lacp by the hardware address (keep in mind the
existence of LLC/SNAP headers).

Signed-off-by: David Lamparter <equinox@diac24.net>
---
note: this should actually fix some issues i was having with bridging
screwing up my bonding. i've resolved those by "brouting" LACP frames
in ebtables... (which this patch will also result in)

compile-tested only

 net/bridge/br_input.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index f3ac1e8..c873db5 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -160,9 +160,12 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
 	p = br_port_get_rcu(skb->dev);
 
 	if (unlikely(is_link_local(dest))) {
-		/* Pause frames shouldn't be passed up by driver anyway */
-		if (skb->protocol == htons(ETH_P_PAUSE))
-			goto drop;
+		/* Pause/3x frames shouldn't be passed up by driver anyway
+		 * LACP/3ad can never be allowed to cross even a dumb hub
+		 *
+		 * both are usually matched by group address */
+		if (dest[5] == 0x01 || dest[5] == 0x02)
+			return RX_HANDLER_PASS;
 
 		/* If STP is turned off, then forward */
 		if (p->br->stp_enabled == BR_NO_STP && dest[5] == 0)
-- 
1.7.5.3


^ permalink raw reply related

* [PATCH RFC] r8169: minimal rtl8111e-vl support
From: Francois Romieu @ 2011-06-28 21:44 UTC (permalink / raw)
  To: Hayes Wang; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 9909 bytes --]

Mostly bits from version 8.023.00 of Realtek's own r8168 driver. It applies
on top of davem's net-next branch + a small, uninteresting attached patch.

I have given it a short testing w/o the new-format rtl8168e-3_0.0.1 firmware
and it is fairly encouraging.

The code is still incomplete :
- WoL needs some care. No difficulty here.
- rtl8168e_2_hw_phy_config imho deserves a few comments similar to those in
  rtl8168e_1_hw_phy_config. Hayes, can you take care of it ?
- I have excluded a set of completely unidentified registers / bits
  operations, for instance:
  - Config5
      BIT_0
  - Config2
      BIT_5
      BIT_7
  - TxConfig
      BIT_7
  - 0x1a
      BIT_2
      BIT_3
  - 0x1b
      0xf8 / 0x07
  - 0xb0,
      0xee480010
  - 0xd0
      BIT_6
  - 0xd3
      BIT_7
  - 0xf2
      BIT_6
  Either they are not needed or someone will have to name them adequately.
  Hayes ?
- Short packets apparently need to be checksummed by the driver (?) but
  the work-around seems strange. It is not clear to me what Realtek's
  driver is trying to achieve in hard_start_xmit. Hayes, can you elaborate ?

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/r8169.c |  254 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 254 insertions(+), 0 deletions(-)

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index f5b8d52..1b38a0f 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -41,6 +41,7 @@
 #define FIRMWARE_8168D_2	"rtl_nic/rtl8168d-2.fw"
 #define FIRMWARE_8168E_1	"rtl_nic/rtl8168e-1.fw"
 #define FIRMWARE_8168E_2	"rtl_nic/rtl8168e-2.fw"
+#define FIRMWARE_8168E_3	"rtl_nic/rtl8168e-3.fw"
 #define FIRMWARE_8105E_1	"rtl_nic/rtl8105e-1.fw"
 
 #ifdef RTL8169_DEBUG
@@ -133,6 +134,7 @@ enum mac_version {
 	RTL_GIGA_MAC_VER_31,
 	RTL_GIGA_MAC_VER_32,
 	RTL_GIGA_MAC_VER_33,
+	RTL_GIGA_MAC_VER_34,
 	RTL_GIGA_MAC_NONE   = 0xff,
 };
 
@@ -217,6 +219,8 @@ static const struct {
 		_R("RTL8168e/8111e",	RTL_TD_1, FIRMWARE_8168E_1),
 	[RTL_GIGA_MAC_VER_33] =
 		_R("RTL8168e/8111e",	RTL_TD_1, FIRMWARE_8168E_2),
+	[RTL_GIGA_MAC_VER_34] =
+		_R("RTL8168evl/8111evl",RTL_TD_1, FIRMWARE_8168E_3),
 };
 #undef _R
 
@@ -715,6 +719,76 @@ static int rtl8169_poll(struct napi_struct *napi, int budget);
 static const unsigned int rtl8169_rx_config =
 	(RX_FIFO_THRESH << RxCfgFIFOShift) | (RX_DMA_BURST << RxCfgDMAShift);
 
+static void rtl_eri_cmd(void __iomem *ioaddr, int type, int addr, u32 cmd)
+{
+	int i;
+
+	RTL_W32(ERIAR, cmd | type | ERIAR_BYTEEN | addr);
+
+	for (i = 0; i < 10; i++) {
+		udelay(100);
+
+		if ((RTL_R32(ERIAR) ^ cmd) & ERIAR_FLAG)
+			break;
+	}
+}
+
+static u32 eri_data_mask[] = { 0x000000ff, 0x0000ffff, 0x00ffffff, 0xffffffff };
+
+u32 rtl_eri_read(void __iomem *ioaddr, int addr, int len, int type)
+{
+	int done = 0;
+	u32 val = 0;
+
+	BUG_ON((len > 4) || (len <= 0));
+
+	while ((len <= 4) && (len > 0)) {
+		int offset = addr % ERIAR_ADDR_BYTE_ALIGN;
+		int avail = ERIAR_ADDR_BYTE_ALIGN - offset;
+		u32 data;
+
+		addr &= ~(ERIAR_ADDR_BYTE_ALIGN - 1);
+
+		rtl_eri_cmd(ioaddr, type, addr, ERIAR_READ_CMD);
+
+		data = (RTL_R32(ERIDR) >> (8 * offset)) & eri_data_mask[len -1];
+		val |= data << (8 * done);
+		done += min(avail, len);
+		len -= done;
+		addr += ERIAR_ADDR_BYTE_ALIGN;
+	}
+
+	return val;
+}
+
+static void rtl_eri_write(void __iomem *ioaddr, int addr, int len, u32 val,
+			  int type)
+{
+	BUG_ON((len > 4) || (len <= 0));
+
+	while ((len <= 4) && (len > 0)) {
+		int offset = addr % ERIAR_ADDR_BYTE_ALIGN;
+		int avail = ERIAR_ADDR_BYTE_ALIGN - offset;
+		u32 data;
+		int done;
+
+		addr &= ~(ERIAR_ADDR_BYTE_ALIGN - 1);
+
+		data  = rtl_eri_read(ioaddr, addr, 4, type);
+		data &= ~(eri_data_mask[len - 1] << (8 * offset));
+		data |= val << (8 * offset);
+
+		RTL_W32(ERIDR, data);
+
+		rtl_eri_cmd(ioaddr, type, addr, ERIAR_WRITE_CMD);
+
+		done = min(avail, len);
+		val >>= 8 * done;
+		len -= done;
+		addr += ERIAR_ADDR_BYTE_ALIGN;
+	}
+}
+
 static u32 ocp_read(struct rtl8169_private *tp, u8 mask, u16 reg)
 {
 	void __iomem *ioaddr = tp->mmio_addr;
@@ -1641,6 +1715,7 @@ static void rtl8169_get_mac_version(struct rtl8169_private *tp,
 		int mac_version;
 	} mac_info[] = {
 		/* 8168E family. */
+		{ 0x7cf00000, 0x2c800000,	RTL_GIGA_MAC_VER_34 },
 		{ 0x7cf00000, 0x2c200000,	RTL_GIGA_MAC_VER_33 },
 		{ 0x7cf00000, 0x2c100000,	RTL_GIGA_MAC_VER_32 },
 		{ 0x7c800000, 0x2c000000,	RTL_GIGA_MAC_VER_33 },
@@ -2691,6 +2766,109 @@ static void rtl8168e_1_hw_phy_config(struct rtl8169_private *tp)
 	rtl_writephy(tp, 0x0d, 0x0000);
 }
 
+static void rtl8168e_2_hw_phy_config(struct rtl8169_private *tp)
+{
+	static const struct phy_reg phy_reg_init_1[] = {
+		/* ? */
+		{ 0x1f, 0x0001 },
+		{ 0x0b, 0x6c14 },
+		{ 0x14, 0x7f3d },
+		{ 0x1c, 0xfafe },
+		{ 0x08, 0x07c5 },
+		{ 0x10, 0xf090 },
+		{ 0x1f, 0x0003 },
+		{ 0x14, 0x641a },
+		{ 0x1a, 0x0606 },
+		{ 0x12, 0xf480 },
+		{ 0x13, 0x0747 },
+		{ 0x1f, 0x0000 },
+
+		/* ? */
+		{ 0x1f, 0x0004 },
+		{ 0x1f, 0x0007 },
+		{ 0x1e, 0x0078 },
+		{ 0x15, 0xa408 },
+		{ 0x17, 0x5100 },
+		{ 0x19, 0x0008 },
+		{ 0x1f, 0x0002 },
+		{ 0x1f, 0x0000 },
+
+		/* ? */
+		{ 0x1f, 0x0003 },
+		{ 0x0d, 0x0207 },
+		{ 0x02, 0x5fd0 },
+		{ 0x1f, 0x0000 }
+	}, phy_reg_init_2[] = {
+		/* ? */
+		{ 0x1f, 0x0004 },
+		{ 0x1f, 0x0007 },
+		{ 0x1e, 0x00ac },
+		{ 0x18, 0x0006 },
+		{ 0x1f, 0x0002 },
+		{ 0x1f, 0x0000 },
+
+		/* ? */
+		{ 0x1f, 0x0003 },
+		{ 0x09, 0xa20f },
+		{ 0x1f, 0x0000 },
+
+		/* ? */
+		{ 0x1f, 0x0005 },
+		{ 0x05, 0x8b5b },
+		{ 0x06, 0x9222 },
+		{ 0x05, 0x8b6d },
+		{ 0x06, 0x8000 },
+		{ 0x05, 0x8b76 },
+		{ 0x06, 0x8000 },
+		{ 0x1f, 0x0000 }
+	};
+
+	rtl_apply_firmware(tp);
+
+	/* ? */
+	rtl_writephy(tp, 0x1f, 0x0005);
+	rtl_writephy(tp, 0x05, 0x8b80);
+	rtl_w1w0_phy(tp, 0x06, 0x0006, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* ? */
+	rtl_writephy(tp, 0x1f, 0x0004);
+	rtl_writephy(tp, 0x1f, 0x0007);
+	rtl_writephy(tp, 0x1e, 0x002d);
+	rtl_w1w0_phy(tp, 0x18, 0x0010, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0002);
+	rtl_writephy(tp, 0x1f, 0x0000);
+	rtl_w1w0_phy(tp, 0x14, 0x8000, 0x0000);
+
+	rtl_writephy(tp, 0x1f, 0x0000);
+	rtl_writephy(tp, 0x15, 0x1006);
+
+	rtl_writephy(tp, 0x1f, 0x0005);
+	rtl_writephy(tp, 0x05, 0x8b86);
+	rtl_w1w0_phy(tp, 0x06, 0x0001, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	rtl_writephy_batch(tp, phy_reg_init_1, ARRAY_SIZE(phy_reg_init_1));
+
+	/* ? */
+	rtl_writephy(tp, 0x1f, 0x0004);
+	rtl_writephy(tp, 0x1f, 0x0007);
+	rtl_writephy(tp, 0x1e, 0x00a1);
+	rtl_w1w0_phy(tp, 0x1a, 0x0000, 0x0004);
+	rtl_writephy(tp, 0x1f, 0x0002);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* ? */
+	rtl_writephy(tp, 0x1f, 0x0004);
+	rtl_writephy(tp, 0x1f, 0x0007);
+	rtl_writephy(tp, 0x1e, 0x002d);
+	rtl_w1w0_phy(tp, 0x16, 0x0020, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0002);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	rtl_writephy_batch(tp, phy_reg_init_2, ARRAY_SIZE(phy_reg_init_2));
+}
+
 static void rtl8102e_hw_phy_config(struct rtl8169_private *tp)
 {
 	static const struct phy_reg phy_reg_init[] = {
@@ -2812,6 +2990,9 @@ static void rtl_hw_phy_config(struct net_device *dev)
 	case RTL_GIGA_MAC_VER_33:
 		rtl8168e_1_hw_phy_config(tp);
 		break;
+	case RTL_GIGA_MAC_VER_34:
+		rtl8168e_2_hw_phy_config(tp);
+		break;
 
 	default:
 		break;
@@ -3170,6 +3351,7 @@ static void r8168_phy_power_down(struct rtl8169_private *tp)
 	switch (tp->mac_version) {
 	case RTL_GIGA_MAC_VER_32:
 	case RTL_GIGA_MAC_VER_33:
+	case RTL_GIGA_MAC_VER_34:
 		rtl_writephy(tp, MII_BMCR, BMCR_ANENABLE | BMCR_PDOWN);
 		break;
 
@@ -3926,6 +4108,20 @@ static void rtl_csi_access_enable_2(void __iomem *ioaddr)
 	rtl_csi_access_enable(ioaddr, 0x27000000);
 }
 
+struct ephy_reg {
+	u16 offset;
+	u16 val;
+};
+
+static void rtl_write_ephy_batch(void __iomem *ioaddr,
+				 const struct ephy_reg *regs, int len)
+{
+	while (len-- > 0) {
+		rtl_ephy_write(ioaddr, regs->offset, regs->val);
+		regs++;
+	}
+}
+
 struct ephy_info {
 	unsigned int offset;
 	u16 mask;
@@ -4184,6 +4380,60 @@ static void rtl_hw_start_8168e_1(void __iomem *ioaddr, struct pci_dev *pdev)
 	RTL_W8(Config5, RTL_R8(Config5) & ~Spi_en);
 }
 
+struct eri_reg {
+	u16 addr;
+	u16 len;
+	u32 data;
+};
+
+static void rtl_write_eri_batch(void __iomem * ioaddr,
+				const struct eri_reg *regs, int len, int type)
+{
+	while (len-- > 0) {
+		rtl_eri_write(ioaddr, regs->addr, regs->len, regs->data, type);
+		regs++;
+	}
+}
+
+static void rtl_hw_start_8168e_2(void __iomem *ioaddr, struct pci_dev *pdev)
+{
+	static const struct ephy_reg ephy_reg_init[] = {
+		{ 0x06, 0xf020 },
+		{ 0x07, 0x01ff },
+		{ 0x00, 0x5027 },
+		{ 0x01, 0x0003 },
+		{ 0x02, 0x2d16 },
+		{ 0x03, 0x6d49 },
+		{ 0x08, 0x0006 },
+		{ 0x0a, 0x00c8 },
+	};
+	static const struct ephy_info e_info_8168e[] = {
+		{ 0x09, 0x0000, 0x0080 },
+		{ 0x19, 0x0000, 0x0224 }
+	};
+	static const struct eri_reg eri_reg_init[] = {
+		{ 0xd5, 1, 0x0000000c },
+		{ 0xc0, 2, 0x00000000 },
+		{ 0xb8, 2, 0x00000000 },
+		{ 0xc8, 4, 0x00100002 },
+		{ 0xe8, 4, 0x00100006 }
+	};
+
+	rtl_csi_access_enable_1(ioaddr);
+	rtl_tx_performance_tweak(pdev, 0x5 << MAX_READ_REQUEST_SHIFT);
+	RTL_W8(MaxTxPacketSize, TxPacketMax);
+
+	rtl_write_eri_batch(ioaddr, eri_reg_init, ARRAY_SIZE(eri_reg_init),
+			    ERIAR_EXGMAC);
+
+	rtl_eri_write(ioaddr, 0x01dc, 1, 0x64, ERIAR_EXGMAC);
+
+	rtl_write_ephy_batch(ioaddr, ephy_reg_init, ARRAY_SIZE(ephy_reg_init));
+	rtl_ephy_init(ioaddr, e_info_8168e, ARRAY_SIZE(e_info_8168e));
+
+	rtl_disable_clock_request(pdev);
+}
+
 static void rtl_hw_start_8168(struct net_device *dev)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
@@ -4275,6 +4525,10 @@ static void rtl_hw_start_8168(struct net_device *dev)
 		rtl_hw_start_8168e_1(ioaddr, pdev);
 		break;
 
+	case RTL_GIGA_MAC_VER_34:
+		rtl_hw_start_8168e_2(ioaddr, pdev);
+		break;
+
 	default:
 		printk(KERN_ERR PFX "%s: unknown chipset (mac_version = %d).\n",
 			dev->name, tp->mac_version);
-- 
1.7.4.4


[-- Attachment #2: 0001-r8169-noise-redux.patch --]
[-- Type: text/plain, Size: 3375 bytes --]

>From d777443c097b65b06f1d0e7da37db1368185db8e Mon Sep 17 00:00:00 2001
From: Francois Romieu <romieu@fr.zoreil.com>
Date: Tue, 28 Jun 2011 17:01:29 +0200
Subject: [PATCH 1/3] r8169: noise redux.

- insert an empty record in rtl_chip_infos for future chipsets.
  NB: the array is only accessed through the RTL_GIGA_MAC_VER_xy enum.
- ready-to-use ERIAR bits definitions. Those were not used.
- ERIDR / ERIAR confusion. This one is already in Linus's tree. I'll
  remove it before the final submission.
- make room for different 8168e hw_phy_config and hw_start methods.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/r8169.c |   23 ++++++++++++-----------
 1 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index fbd6838..f5b8d52 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -216,7 +216,7 @@ static const struct {
 	[RTL_GIGA_MAC_VER_32] =
 		_R("RTL8168e/8111e",	RTL_TD_1, FIRMWARE_8168E_1),
 	[RTL_GIGA_MAC_VER_33] =
-		_R("RTL8168e/8111e",	RTL_TD_1, FIRMWARE_8168E_2)
+		_R("RTL8168e/8111e",	RTL_TD_1, FIRMWARE_8168E_2),
 };
 #undef _R
 
@@ -351,12 +351,12 @@ enum rtl8168_registers {
 #define ERIAR_WRITE_CMD			0x80000000
 #define ERIAR_READ_CMD			0x00000000
 #define ERIAR_ADDR_BYTE_ALIGN		4
-#define ERIAR_EXGMAC			0
-#define ERIAR_MSIX			1
-#define ERIAR_ASF			2
 #define ERIAR_TYPE_SHIFT		16
-#define ERIAR_BYTEEN			0x0f
+#define ERIAR_EXGMAC			(0x00 << ERIAR_TYPE_SHIFT)
+#define ERIAR_MSIX			(0x01 << ERIAR_TYPE_SHIFT)
+#define ERIAR_ASF			(0x02 << ERIAR_TYPE_SHIFT)
 #define ERIAR_BYTEEN_SHIFT		12
+#define ERIAR_BYTEEN			(0x0f << ERIAR_BYTEEN_SHIFT)
 	EPHY_RXER_NUM		= 0x7c,
 	OCPDR			= 0xb0,	/* OCP GPHY access */
 #define OCPDR_WRITE_CMD			0x80000000
@@ -749,11 +749,12 @@ static void rtl8168_oob_notify(struct rtl8169_private *tp, u8 cmd)
 	int i;
 
 	RTL_W8(ERIDR, cmd);
-	RTL_W32(ERIAR, 0x800010e8);
+	/* Could 0x10e8 be replaced with (ERIAR_BYTEEN | 0x00e8) ? */
+	RTL_W32(ERIAR, ERIAR_WRITE_CMD | ERIAR_EXGMAC | 0x10e8);
 	msleep(2);
 	for (i = 0; i < 5; i++) {
 		udelay(100);
-		if (!(RTL_R32(ERIDR) & ERIAR_FLAG))
+		if (!(RTL_R32(ERIAR) & ERIAR_FLAG))
 			break;
 	}
 
@@ -2617,7 +2618,7 @@ static void rtl8168d_4_hw_phy_config(struct rtl8169_private *tp)
 	rtl_patchphy(tp, 0x0d, 1 << 5);
 }
 
-static void rtl8168e_hw_phy_config(struct rtl8169_private *tp)
+static void rtl8168e_1_hw_phy_config(struct rtl8169_private *tp)
 {
 	static const struct phy_reg phy_reg_init[] = {
 		/* Enable Delay cap */
@@ -2809,7 +2810,7 @@ static void rtl_hw_phy_config(struct net_device *dev)
 		break;
 	case RTL_GIGA_MAC_VER_32:
 	case RTL_GIGA_MAC_VER_33:
-		rtl8168e_hw_phy_config(tp);
+		rtl8168e_1_hw_phy_config(tp);
 		break;
 
 	default:
@@ -4148,7 +4149,7 @@ static void rtl_hw_start_8168d_4(void __iomem *ioaddr, struct pci_dev *pdev)
 	rtl_enable_clock_request(pdev);
 }
 
-static void rtl_hw_start_8168e(void __iomem *ioaddr, struct pci_dev *pdev)
+static void rtl_hw_start_8168e_1(void __iomem *ioaddr, struct pci_dev *pdev)
 {
 	static const struct ephy_info e_info_8168e[] = {
 		{ 0x00, 0x0200,	0x0100 },
@@ -4271,7 +4272,7 @@ static void rtl_hw_start_8168(struct net_device *dev)
 
 	case RTL_GIGA_MAC_VER_32:
 	case RTL_GIGA_MAC_VER_33:
-		rtl_hw_start_8168e(ioaddr, pdev);
+		rtl_hw_start_8168e_1(ioaddr, pdev);
 		break;
 
 	default:
-- 
1.7.4.4


^ permalink raw reply related

* Re: [PATCH] net/core: Convert to current logging forms
From: Joe Perches @ 2011-06-28 21:55 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ben Hutchings, netdev, David S. Miller, Neil Horman, linux-kernel
In-Reply-To: <20110628144923.2bdac9f9@nehalam.ftrdhcpuser.net>

On Tue, 2011-06-28 at 14:49 -0700, Stephen Hemminger wrote:
> Does this actually create useful benefit or just create more bug possibilities?

Probably not the bug possibilities.
It's pretty trivial.

> Those messages are mostly self explanatory as is.

It also has some utility for the cases in drivers
between the alloc_netdev and register_netdev[ice]
where printk/pr_<level> currently has to be used.

Beyond that, (foolish?) consistency only.

cheers, Joe


^ permalink raw reply

* Re: [PATCH] net/core: Convert to current logging forms
From: Stephen Hemminger @ 2011-06-28 21:49 UTC (permalink / raw)
  To: Joe Perches
  Cc: Ben Hutchings, netdev, David S. Miller, Neil Horman, linux-kernel
In-Reply-To: <1309297068.29598.26.camel@Joe-Laptop>

On Tue, 28 Jun 2011 14:37:48 -0700
Joe Perches <joe@perches.com> wrote:

> On Tue, 2011-06-28 at 21:36 +0100, Ben Hutchings wrote:
> > On Tue, 2011-06-28 at 13:31 -0700, Joe Perches wrote:
> > > On Tue, 2011-06-28 at 21:21 +0100, Ben Hutchings wrote:
> > > > On Tue, 2011-06-28 at 12:40 -0700, Joe Perches wrote:
> > > > > Use pr_fmt, pr_<level>, and netdev_<level> as appropriate.
> > > > > Coalesce long formats.
> > > > [...]
> > > > > --- a/net/core/dev.c
> > > > > +++ b/net/core/dev.c
> > > > > @@ -72,6 +72,8 @@
> > > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > > [...]
> > > > KBUILD_MODNAME is presumably going to be "dev".
> > > 'tis.
> > > > That's not very meaningful.
> > > I think it's not useless though.
> > > Anything else you think it should be?
> > > Maybe "net_core_device:" or some such like that?
> > "netdev"
> > > Here are the format strings now prefaced by "dev:"
> > > $ strings net/core/built-in.o |grep "^<.>dev:"
> > > <6>dev: netif_stop_queue() cannot be called before register_netdev()
> > > <4>dev: dev_remove_pack: %p not found
> > > <3>dev: Loading kernel module for a network device with CAP_SYS_MODULE (deprecated)
> > > <0>dev: %s: failed to move %s to init_net: %d
> > > <3>dev: alloc_netdev: Unable to allocate device with zero queues
> > > <3>dev: alloc_netdev: Unable to allocate device with zero RX queues
> > > <3>dev: alloc_netdev: Unable to allocate device
> > Many of these refer to a specific device and should be formatted with
> > one of the netdev_* logging functions.
> 
> Perhaps another way to do this is like this:
> 
> As soon as alloc_netdev is called, netdev_<level> can
> be used, but the netdev_name() function can contain
> printf formatting control codes like %d.
> 
> Use pr_fmt(fmt) "netdev: " fmt in net/core/dev.c
> Add netdev_is_registered() to netdevice.h
> Extend uses of netdev_name() with netdev_is_registered()
> to show "(unregistered)" as may be necessary.
> Move setting of dev->name in alloc_netdev_mqs to just
> after allocation instead of at end of function.
> 
> (on top of the original patch, will respin if wanted)
> 
> Thoughts?

Does this actually create useful benefit or just create more bug possibilities?
Those messages are mostly self explanatory as is.


^ permalink raw reply

* Re: [PATCH] bridge: Forward EAPOL Kconfig option BRIDGE_PAE_FORWARD
From: David Lamparter @ 2011-06-28 21:46 UTC (permalink / raw)
  To: Nick Carter; +Cc: David Lamparter, Stephen Hemminger, netdev, davem
In-Reply-To: <BANLkTikN_TPPcyywVT5W9CrGAFJX8qQiyQ@mail.gmail.com>

On Tue, Jun 28, 2011 at 10:22:53PM +0100, Nick Carter wrote:
> > I beg to differ, there very much is. You never ever ever want to be
> > running STP with 802.1X packets passing through... some client will shut
> > down your upstream port, your STP will break and you will die in a fire.
> >
> > The general idea, though, is that a STP-enabled switch is an intelligent
> > switch. And an intelligent switch can speak all those pesky small
> > side-dish protocols.
[...]
> >> > (Some quick googling reveals that hardware switch chips special-drop
> >> > 01:80:c2:00:00:01 [802.3x/pause] and :02 [802.3ad/lacp] and nothing
> >> > else - for the dumb ones anyway. It also seems like the match for pause
> >> > frames usually works on the address, not on the protocol field like we
> >> > do it...)
> >> 'Enterprise' switches drop :03 [802.1x]
> >
> > They also speak STP, see above about never STP+1X :)
> But if you turn off STP they wont start forwarding 802.1x.

Yes, hence my suggestion to have a knob for all of the link-local
ethernet groups. (Which I'm still not actually endorsing, just placing
the idea)

> Also having STP on and forwarding 802.1x would be useful (but
> non-standard) in some cheap redundant periphery switches, connecting
> to a couple of 'core' switches acting as 802.1x authenticators.

That wouldn't really make much sense since those central 802.1X
authenticators wouldn't be able switch the client-facing ports on and
off. Instead, you now have to (1) disable the port switching to make
sure your upstreams stay on and (2) deal with 802.1X clients being
re"routed" by STP to different authenticators that don't know them.


-David


^ permalink raw reply

* Re: [PATCH] net/core: Convert to current logging forms
From: Joe Perches @ 2011-06-28 21:37 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, David S. Miller, Neil Horman, linux-kernel
In-Reply-To: <1309293416.2771.50.camel@bwh-desktop>

On Tue, 2011-06-28 at 21:36 +0100, Ben Hutchings wrote:
> On Tue, 2011-06-28 at 13:31 -0700, Joe Perches wrote:
> > On Tue, 2011-06-28 at 21:21 +0100, Ben Hutchings wrote:
> > > On Tue, 2011-06-28 at 12:40 -0700, Joe Perches wrote:
> > > > Use pr_fmt, pr_<level>, and netdev_<level> as appropriate.
> > > > Coalesce long formats.
> > > [...]
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -72,6 +72,8 @@
> > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > [...]
> > > KBUILD_MODNAME is presumably going to be "dev".
> > 'tis.
> > > That's not very meaningful.
> > I think it's not useless though.
> > Anything else you think it should be?
> > Maybe "net_core_device:" or some such like that?
> "netdev"
> > Here are the format strings now prefaced by "dev:"
> > $ strings net/core/built-in.o |grep "^<.>dev:"
> > <6>dev: netif_stop_queue() cannot be called before register_netdev()
> > <4>dev: dev_remove_pack: %p not found
> > <3>dev: Loading kernel module for a network device with CAP_SYS_MODULE (deprecated)
> > <0>dev: %s: failed to move %s to init_net: %d
> > <3>dev: alloc_netdev: Unable to allocate device with zero queues
> > <3>dev: alloc_netdev: Unable to allocate device with zero RX queues
> > <3>dev: alloc_netdev: Unable to allocate device
> Many of these refer to a specific device and should be formatted with
> one of the netdev_* logging functions.

Perhaps another way to do this is like this:

As soon as alloc_netdev is called, netdev_<level> can
be used, but the netdev_name() function can contain
printf formatting control codes like %d.

Use pr_fmt(fmt) "netdev: " fmt in net/core/dev.c
Add netdev_is_registered() to netdevice.h
Extend uses of netdev_name() with netdev_is_registered()
to show "(unregistered)" as may be necessary.
Move setting of dev->name in alloc_netdev_mqs to just
after allocation instead of at end of function.

(on top of the original patch, will respin if wanted)

Thoughts?

---

 include/linux/netdevice.h |   29 ++++++++++++++++++++++-------
 net/core/dev.c            |   18 ++++++++++--------
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 011eb89..4b6c4e2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2625,11 +2625,16 @@ static inline u32 dev_ethtool_get_flags(struct net_device *dev)
 
 static inline const char *netdev_name(const struct net_device *dev)
 {
-	if (dev->reg_state != NETREG_REGISTERED)
-		return "(unregistered net_device)";
+	if (!dev)
+		return "NULL net_device";
 	return dev->name;
 }
 
+static inline bool netdev_is_registered(const struct net_device *dev)
+{
+	return dev && dev->reg_state == NETREG_REGISTERED;
+}
+
 extern int netdev_printk(const char *level, const struct net_device *dev,
 			 const char *format, ...)
 	__attribute__ ((format (printf, 3, 4)));
@@ -2657,8 +2662,11 @@ extern int netdev_info(const struct net_device *dev, const char *format, ...)
 #elif defined(CONFIG_DYNAMIC_DEBUG)
 #define netdev_dbg(__dev, format, args...)			\
 do {								\
-	dynamic_dev_dbg((__dev)->dev.parent, "%s: " format,	\
-			netdev_name(__dev), ##args);		\
+	dynamic_dev_dbg((__dev)->dev.parent, "%s%s: " format,	\
+			netdev_name(__dev),			\
+			netdev_is_registered(__dev)		\
+			  ? "" : " (unregistered)",		\
+			##args);				\
 } while (0)
 #else
 #define netdev_dbg(__dev, format, args...)			\
@@ -2687,7 +2695,11 @@ do {								\
  * file/line information and a backtrace.
  */
 #define netdev_WARN(dev, format, args...)			\
-	WARN(1, "netdevice: %s\n" format, netdev_name(dev), ##args);
+	WARN(1, "netdevice: %s%s\n" format,			\
+	     netdev_name(dev),					\
+	     netdev_is_registered(dev)				\
+	       ? "" : " (unregistered)",			\
+	     ##args);
 
 /* netif printk helpers, similar to netdev_printk */
 
@@ -2726,8 +2738,11 @@ do {								\
 do {								\
 	if (netif_msg_##type(priv))				\
 		dynamic_dev_dbg((netdev)->dev.parent,		\
-				"%s: " format,			\
-				netdev_name(netdev), ##args);	\
+				"%s%s: " format,		\
+				netdev_name(netdev),		\
+				netdev_is_registered(netdev)	\
+				  ? "" : " (unregistered)",	\
+				##args);			\
 } while (0)
 #else
 #define netif_dbg(priv, type, dev, format, args...)			\
diff --git a/net/core/dev.c b/net/core/dev.c
index 3401227..44e2646 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -72,7 +72,7 @@
  *				        - netif_rx() feedback
  */
 
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#define pr_fmt(fmt) "netdev: " fmt
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -5803,13 +5803,13 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
 	if (txqs < 1) {
-		pr_err("alloc_netdev: Unable to allocate device with zero queues\n");
+		pr_err("Unable to allocate device with zero queues\n");
 		return NULL;
 	}
 
 #ifdef CONFIG_RPS
 	if (rxqs < 1) {
-		pr_err("alloc_netdev: Unable to allocate device with zero RX queues\n");
+		pr_err("Unable to allocate device with zero RX queues\n");
 		return NULL;
 	}
 #endif
@@ -5825,13 +5825,15 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 
 	p = kzalloc(alloc_size, GFP_KERNEL);
 	if (!p) {
-		pr_err("alloc_netdev: Unable to allocate device\n");
+		pr_err("Unable to allocate device\n");
 		return NULL;
 	}
 
 	dev = PTR_ALIGN(p, NETDEV_ALIGN);
 	dev->padded = (char *)dev - (char *)p;
 
+	strcpy(dev->name, name);
+
 	dev->pcpu_refcnt = alloc_percpu(int);
 	if (!dev->pcpu_refcnt)
 		goto free_p;
@@ -5864,7 +5866,6 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		goto free_all;
 #endif
 
-	strcpy(dev->name, name);
 	dev->group = INIT_NETDEV_GROUP;
 	return dev;
 
@@ -6270,10 +6271,11 @@ static int __netdev_printk(const char *level, const struct net_device *dev,
 	if (dev && dev->dev.parent)
 		r = dev_printk(level, dev->dev.parent, "%s: %pV",
 			       netdev_name(dev), vaf);
-	else if (dev)
-		r = printk("%s%s: %pV", level, netdev_name(dev), vaf);
 	else
-		r = printk("%s(NULL net_device): %pV", level, vaf);
+		r = printk("%s%s%s: %pV",
+			   level, netdev_name(dev),
+			   netdev_is_registered(dev) ? "" : " (unregistered)",
+			   vaf);
 
 	return r;
 }



^ permalink raw reply related

* Re: [PATCH] bridge: Forward EAPOL Kconfig option BRIDGE_PAE_FORWARD
From: Nick Carter @ 2011-06-28 21:22 UTC (permalink / raw)
  To: David Lamparter; +Cc: Stephen Hemminger, netdev, davem
In-Reply-To: <20110628210434.GD2121496@jupiter.n2.diac24.net>

On 28 June 2011 22:04, David Lamparter <equinox@diac24.net> wrote:
> On Tue, Jun 28, 2011 at 09:54:01PM +0100, Nick Carter wrote:
>> > 'reinject' isn't possible when it hits that code path - which is pretty
>> > much why I'm saying we should be forwarding everything in the non-STP
>> > case.
>> I'm not sure I like this turn off STP and suddenly start forwarding
>> random groups.  There is no connection between wanting STP on / off
>> and forwarding pae on / off.
>
> I beg to differ, there very much is. You never ever ever want to be
> running STP with 802.1X packets passing through... some client will shut
> down your upstream port, your STP will break and you will die in a fire.
>
> The general idea, though, is that a STP-enabled switch is an intelligent
> switch. And an intelligent switch can speak all those pesky small
> side-dish protocols.
>
> With STP disabled on the other hand, it depends on site policy. Now
> policy...
>
>> There is no dependencies between the protocols.
>> Also on reflection I think a knob per mac group would be better.
>
> .... policy can be done nice and good with ebtables. You can match the
> groups you want, or the protocols, or the phase of the moon.
>
>> We are only interested in 3 and if I enable pae forwarding so I can
>> connect virtual machine supplicants, i don't then want to turn on LDP
>> forwarding which will needlessly hit my virtual machines.
>> So how about sysfs
>> ../bridge/pae_forwarding
>> ../bridge/ldp_forwarding
>> ../bridge/mvrp_forwarding
>
> It's not like either LLDP or MVRP will trash your VMs. Those protocols
> send a packet once per a few seconds.
>
> MVRP is interesting for the STP-enabled case though. I'm not aware of
> any userspace *VRP implementations, and dropping *VRP without an
> userspace daemon to speak it on our behalf means we're creating a *VRP
> border/break.
>
> I would however say that doing an userspace *VRP implementation is a
> better solution than kernel hacks for this particular case.
>
>> > (Some quick googling reveals that hardware switch chips special-drop
>> > 01:80:c2:00:00:01 [802.3x/pause] and :02 [802.3ad/lacp] and nothing
>> > else - for the dumb ones anyway. It also seems like the match for pause
>> > frames usually works on the address, not on the protocol field like we
>> > do it...)
>> 'Enterprise' switches drop :03 [802.1x]
>
> They also speak STP, see above about never STP+1X :)
But if you turn off STP they wont start forwarding 802.1x.
Also having STP on and forwarding 802.1x would be useful (but
non-standard) in some cheap redundant periphery switches, connecting
to a couple of 'core' switches acting as 802.1x authenticators.
Nick
>
> -David
>
>

^ permalink raw reply

* [RFC PATCH] asix: drop rx skb if header length is invalid
From: maksim.rayskiy @ 2011-06-28 21:21 UTC (permalink / raw)
  To: netdev; +Cc: Maksim Rayskiy

From: Maksim Rayskiy <mrayskiy@broadcom.com>

Signed-off-by: Maksim Rayskiy <mrayskiy@broadcom.com>
---

I am using AX88772 usbnet dongle, and sometimes after system resume I am seeing 
corrupt rx packets which generate infinite number of 
asix_rx_fixup() Bad Header Length
messages.
Looking at asix_rx_fixup() I see that depending on what junk you get in skb you
may end up with never breaking the while loop. Would not it be safer to bail out
as soon as incorrect header length was detected?

 drivers/net/usb/asix.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/usb/asix.c b/drivers/net/usb/asix.c
index 6998aa6..9d7a6ec 100644
--- a/drivers/net/usb/asix.c
+++ b/drivers/net/usb/asix.c
@@ -317,6 +317,7 @@ static int asix_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 		if ((short)(header & 0x0000ffff) !=
 		    ~((short)((header & 0xffff0000) >> 16))) {
 			netdev_err(dev->net, "asix_rx_fixup() Bad Header Length\n");
+			return 0;
 		}
 		/* get the packet length */
 		size = (u16) (header & 0x0000ffff);
-- 
1.7.4.1



^ permalink raw reply related

* Re: [PATCH] bridge: Forward EAPOL Kconfig option BRIDGE_PAE_FORWARD
From: David Lamparter @ 2011-06-28 21:04 UTC (permalink / raw)
  To: Nick Carter; +Cc: David Lamparter, Stephen Hemminger, netdev, davem
In-Reply-To: <BANLkTim1owW7iMS3hgKhm7DzTzE4X-jxHA@mail.gmail.com>

On Tue, Jun 28, 2011 at 09:54:01PM +0100, Nick Carter wrote:
> > 'reinject' isn't possible when it hits that code path - which is pretty
> > much why I'm saying we should be forwarding everything in the non-STP
> > case.
> I'm not sure I like this turn off STP and suddenly start forwarding
> random groups.  There is no connection between wanting STP on / off
> and forwarding pae on / off.

I beg to differ, there very much is. You never ever ever want to be
running STP with 802.1X packets passing through... some client will shut
down your upstream port, your STP will break and you will die in a fire.

The general idea, though, is that a STP-enabled switch is an intelligent
switch. And an intelligent switch can speak all those pesky small 
side-dish protocols.

With STP disabled on the other hand, it depends on site policy. Now
policy...

> There is no dependencies between the protocols.
> Also on reflection I think a knob per mac group would be better.

.... policy can be done nice and good with ebtables. You can match the
groups you want, or the protocols, or the phase of the moon.

> We are only interested in 3 and if I enable pae forwarding so I can
> connect virtual machine supplicants, i don't then want to turn on LDP
> forwarding which will needlessly hit my virtual machines.
> So how about sysfs
> ../bridge/pae_forwarding
> ../bridge/ldp_forwarding
> ../bridge/mvrp_forwarding

It's not like either LLDP or MVRP will trash your VMs. Those protocols
send a packet once per a few seconds.

MVRP is interesting for the STP-enabled case though. I'm not aware of
any userspace *VRP implementations, and dropping *VRP without an
userspace daemon to speak it on our behalf means we're creating a *VRP
border/break.

I would however say that doing an userspace *VRP implementation is a
better solution than kernel hacks for this particular case.

> > (Some quick googling reveals that hardware switch chips special-drop
> > 01:80:c2:00:00:01 [802.3x/pause] and :02 [802.3ad/lacp] and nothing
> > else - for the dumb ones anyway. It also seems like the match for pause
> > frames usually works on the address, not on the protocol field like we
> > do it...)
> 'Enterprise' switches drop :03 [802.1x]

They also speak STP, see above about never STP+1X :)

-David


^ permalink raw reply

* Re: [PATCH] bridge: Forward EAPOL Kconfig option BRIDGE_PAE_FORWARD
From: Nick Carter @ 2011-06-28 20:54 UTC (permalink / raw)
  To: David Lamparter; +Cc: Stephen Hemminger, netdev, davem
In-Reply-To: <20110628202200.GB2121496@jupiter.n2.diac24.net>

On 28 June 2011 21:22, David Lamparter <equinox@diac24.net> wrote:
> On Tue, Jun 28, 2011 at 09:00:16PM +0100, Nick Carter wrote:
>> >>                 /* If STP is turned off, then forward */
>> >> -               if (p->br->stp_enabled == BR_NO_STP && dest[5] == 0)
>> >> +               if (p->br->stp_enabled == BR_NO_STP &&
>> >> +                       (dest[5] == 0 || skb->protocol == htons(ETH_P_PAE)))
>> >>                         goto forward;
>> >> Nick
>> >
>> > That code actually looks quite wrong to me, we should be forwarding all of
>> > the 01:80:C2:00:00:0x groups in non-STP mode, especially :0E and :0D.
>> > (LLDP and GVRP/MVRP)
>> >
>> > Pause frames are the one exception that makes the rule, but as the
>> > comment a few lines above states, "Pause frames shouldn't be passed up by
>> > driver anyway".
>> >
>> > Btw, what might make sense is a general knob for forwarding those
>> > link-local groups, split off from the STP switch so the STP switch
>> > controls only the :00 (STP) group. That way you can decide separately
>> > whether you want to be LLDP/GVRP/802.1X/... transparent and whether you
>> > want to run STP.
>>
>> Sounds good to me.  So we go for :03, :0D, and :0E.  We cant touch :02 see:
>>  commit f01cb5fbea1c1613621f9f32f385e12c1a29dde0
>>  Revert "bridge: Forward reserved group addresses if !STP"
>>
>> > Not sure if it's needed, it can always be done with ebtables...
>> What would be the ebtables rules to achieve the forwarding of :03 ?  I
>> asked this question on the netfilter list and the only response I got
>> said ebtables was a filter and could not do this. :03 is hitting
>> NF_BR_LOCAL_IN.  How would you 'reinject' it so it is forwarded ?
>
> 'reinject' isn't possible when it hits that code path - which is pretty
> much why I'm saying we should be forwarding everything in the non-STP
> case.
I'm not sure I like this turn off STP and suddenly start forwarding
random groups.  There is no connection between wanting STP on / off
and forwarding pae on / off.  There is no dependencies between the
protocols.
Also on reflection I think a knob per mac group would be better.  We
are only interested in 3 and if I enable pae forwarding so I can
connect virtual machine supplicants, i don't then want to turn on LDP
forwarding which will needlessly hit my virtual machines.
So how about sysfs
../bridge/pae_forwarding
../bridge/ldp_forwarding
../bridge/mvrp_forwarding
>
> I have to read up on the bonding interactions, but to my understanding
> the only reasonable usage case is to have the bond below the bridge like
>  eth0 \
>      |- bond0 - br0
>  eth1 /
> then the bonding should receive (and consume) the packets before they
> reach the bridge.
>
> (Some quick googling reveals that hardware switch chips special-drop
> 01:80:c2:00:00:01 [802.3x/pause] and :02 [802.3ad/lacp] and nothing
> else - for the dumb ones anyway. It also seems like the match for pause
> frames usually works on the address, not on the protocol field like we
> do it...)
'Enterprise' switches drop :03 [802.1x]
Nick
>
>
> -David
>
>

^ permalink raw reply

* Re: [PATCH] net/core: Convert to current logging forms
From: Joe Perches @ 2011-06-28 20:40 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, David S. Miller, Neil Horman, linux-kernel
In-Reply-To: <1309293416.2771.50.camel@bwh-desktop>

On Tue, 2011-06-28 at 21:36 +0100, Ben Hutchings wrote:
> On Tue, 2011-06-28 at 13:31 -0700, Joe Perches wrote:
> > On Tue, 2011-06-28 at 21:21 +0100, Ben Hutchings wrote:
> > > On Tue, 2011-06-28 at 12:40 -0700, Joe Perches wrote:
> > > > Use pr_fmt, pr_<level>, and netdev_<level> as appropriate.
> > > > Coalesce long formats.
> > > [...]
> > > > --- a/net/core/dev.c
[]
> > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > [...]
> > > KBUILD_MODNAME is presumably going to be "dev".
> > 'tis.
> > > That's not very meaningful.
> > Anything else you think it should be?
> > Maybe "net_core_device:" or some such like that?
> "netdev"

Maybe.  David?  Got an opinion?

> > Here are the format strings now prefaced by "dev:"
> > 
> > $ strings net/core/built-in.o |grep "^<.>dev:"
> > <6>dev: netif_stop_queue() cannot be called before register_netdev()
> > <4>dev: dev_remove_pack: %p not found
> > <3>dev: Loading kernel module for a network device with CAP_SYS_MODULE (deprecated)
> > <0>dev: %s: failed to move %s to init_net: %d
> > <3>dev: alloc_netdev: Unable to allocate device with zero queues
> > <3>dev: alloc_netdev: Unable to allocate device with zero RX queues
> > <3>dev: alloc_netdev: Unable to allocate device
> Many of these refer to a specific device and should be formatted with
> one of the netdev_* logging functions.

Not true.

These are in the alloc_netdev function where
netdev_<level> can not yet be successfully used.



^ permalink raw reply

* Re: [PATCH] net/core: Convert to current logging forms
From: Ben Hutchings @ 2011-06-28 20:36 UTC (permalink / raw)
  To: Joe Perches; +Cc: netdev, David S. Miller, Neil Horman, linux-kernel
In-Reply-To: <1309293068.29598.14.camel@Joe-Laptop>

On Tue, 2011-06-28 at 13:31 -0700, Joe Perches wrote:
> On Tue, 2011-06-28 at 21:21 +0100, Ben Hutchings wrote:
> > On Tue, 2011-06-28 at 12:40 -0700, Joe Perches wrote:
> > > Use pr_fmt, pr_<level>, and netdev_<level> as appropriate.
> > > Coalesce long formats.
> > [...]
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -72,6 +72,8 @@
> > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > [...]
> > KBUILD_MODNAME is presumably going to be "dev".
> 
> 'tis.
> 
> > That's not very meaningful.
> 
> I think it's not useless though.
>
> Anything else you think it should be?
> Maybe "net_core_device:" or some such like that?

"netdev"

> Here are the format strings now prefaced by "dev:"
> 
> $ strings net/core/built-in.o |grep "^<.>dev:"
> <6>dev: netif_stop_queue() cannot be called before register_netdev()
> <4>dev: dev_remove_pack: %p not found
> <3>dev: Loading kernel module for a network device with CAP_SYS_MODULE (deprecated)
> <0>dev: %s: failed to move %s to init_net: %d
> <3>dev: alloc_netdev: Unable to allocate device with zero queues
> <3>dev: alloc_netdev: Unable to allocate device with zero RX queues
> <3>dev: alloc_netdev: Unable to allocate device

Many of these refer to a specific device and should be formatted with
one of the netdev_* logging functions.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox