Netdev List
 help / color / mirror / Atom feed
* Re: [BUG] bnx2 + vlan + TSO : doesnt work
From: David Miller @ 2011-01-18  6:09 UTC (permalink / raw)
  To: jesse; +Cc: bhutchings, eric.dumazet, netdev
In-Reply-To: <AANLkTincFt4fKLvtity=MgQ2+cuvNFrBHqu3+bMLsypc@mail.gmail.com>

From: Jesse Gross <jesse@nicira.com>
Date: Mon, 17 Jan 2011 16:13:18 -0800

> I think it is better for netif_skb_features() to actually return the
> correct features rather than bypass it here.  NETIF_F_HW_VLAN_TX
> doesn't depend on any other offloads, so we can always include it if
> it is in dev->features.
> 
> Separately, this means there is a problem with bnx2 because it allows
> vlan insertion to be turned off, which would have the same effect.
> Maybe it is looking directly at skb->protocol or similar for TSO.

Please, someone cons up an acceptable fix fast.

Thanks.

^ permalink raw reply

* Re: [PATCH 2/3] vhost-net: Unify the code of mergeable and big buffer handling
From: Michael S. Tsirkin @ 2011-01-18  4:37 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, kvm, linux-kernel
In-Reply-To: <19765.893.776528.869640@gargle.gargle.HOWL>

On Tue, Jan 18, 2011 at 11:05:33AM +0800, Jason Wang wrote:
> Michael S. Tsirkin writes:
>  > On Mon, Jan 17, 2011 at 04:11:08PM +0800, Jason Wang wrote:
>  > > Codes duplication were found between the handling of mergeable and big
>  > > buffers, so this patch tries to unify them. This could be easily done
>  > > by adding a quota to the get_rx_bufs() which is used to limit the
>  > > number of buffers it returns (for mergeable buffer, the quota is
>  > > simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
>  > > previous handle_rx_mergeable() could be resued also for big buffers.
>  > > 
>  > > Signed-off-by: Jason Wang <jasowang@redhat.com>
>  > 
>  > We actually started this way. However the code turned out
>  > to have measureable overhead when handle_rx_mergeable
>  > is called with quota 1.
>  > So before applying this I'd like to see some data
>  > to show this is not the case anymore.
>  > 
> 
> I've run a round of test (Host->Guest) for these three patches on my desktop:

Yes but what if you only apply patch 3?

> Without these patches
> 
> mergeable buffers:
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.42 (10.66.91.42) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384     64    60.00       575.87   69.20    26.36    39.375  7.499  
>  87380  16384    256    60.01      1123.57   73.16    34.73    21.335  5.064  
>  87380  16384    512    60.00      1351.12   75.26    35.80    18.251  4.341  
>  87380  16384   1024    60.00      1955.31   74.73    37.67    12.523  3.156  
>  87380  16384   2048    60.01      3411.92   74.82    39.49    7.186   1.896  
> 
> bug buffers:
> Netperf test results
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.109 (10.66.91.109) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384     64    60.00       567.10   72.06    26.13    41.638  7.550  
>  87380  16384    256    60.00      1143.69   74.66    32.58    21.392  4.667  
>  87380  16384    512    60.00      1460.92   73.94    33.40    16.585  3.746  
>  87380  16384   1024    60.00      3454.85   77.49    33.89    7.349   1.607  
>  87380  16384   2048    60.00      3781.11   76.51    38.38    6.631   1.663  
> 
> With these patches:
> 
> mergeable buffers:
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.236 (10.66.91.236) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384     64    60.00       657.53   71.27    26.42    35.517  6.583  
>  87380  16384    256    60.00      1217.73   74.34    34.67    20.004  4.665  
>  87380  16384    512    60.00      1575.25   75.27    37.12    15.658  3.861  
>  87380  16384   1024    60.00      2416.07   74.77    37.20    10.140  2.522  
>  87380  16384   2048    60.00      3702.29   77.31    36.29    6.842   1.606  
> 
> big buffers:
> Netperf test results
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.202 (10.66.91.202) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384     64    60.00       647.67   71.86    27.26    36.356  6.895  
>  87380  16384    256    60.00      1265.82   76.19    36.54    19.724  4.729  
>  87380  16384    512    60.00      1796.64   76.06    39.48    13.872  3.601  
>  87380  16384   1024    60.00      4008.37   77.05    36.47    6.299   1.491  
>  87380  16384   2048    60.00      4468.56   75.18    41.79    5.513   1.532 
> 
> Looks like the unification does not hurt the performance, and with those patches
> we can get some improvement. BTW, the regression of mergeable buffer still
> exist.
> 
>  > > ---
>  > >  drivers/vhost/net.c |  128 +++------------------------------------------------
>  > >  1 files changed, 7 insertions(+), 121 deletions(-)
>  > > 
>  > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>  > > index 95e49de..c32a2e4 100644
>  > > --- a/drivers/vhost/net.c
>  > > +++ b/drivers/vhost/net.c
>  > > @@ -227,6 +227,7 @@ static int peek_head_len(struct sock *sk)
>  > >   * @iovcount	- returned count of io vectors we fill
>  > >   * @log		- vhost log
>  > >   * @log_num	- log offset
>  > > + * @quota       - headcount quota, 1 for big buffer
>  > >   *	returns number of buffer heads allocated, negative on error
>  > >   */
>  > >  static int get_rx_bufs(struct vhost_virtqueue *vq,
>  > > @@ -234,7 +235,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>  > >  		       int datalen,
>  > >  		       unsigned *iovcount,
>  > >  		       struct vhost_log *log,
>  > > -		       unsigned *log_num)
>  > > +		       unsigned *log_num,
>  > > +		       unsigned int quota)
>  > >  {
>  > >  	unsigned int out, in;
>  > >  	int seg = 0;
>  > > @@ -242,7 +244,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>  > >  	unsigned d;
>  > >  	int r, nlogs = 0;
>  > >  
>  > > -	while (datalen > 0) {
>  > > +	while (datalen > 0 && headcount < quota) {
>  > >  		if (unlikely(seg >= UIO_MAXIOV)) {
>  > >  			r = -ENOBUFS;
>  > >  			goto err;
>  > > @@ -282,116 +284,7 @@ err:
>  > >  
>  > >  /* Expects to be always run from workqueue - which acts as
>  > >   * read-size critical section for our kind of RCU. */
>  > > -static void handle_rx_big(struct vhost_net *net)
>  > > -{
>  > > -	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
>  > > -	unsigned out, in, log, s;
>  > > -	int head;
>  > > -	struct vhost_log *vq_log;
>  > > -	struct msghdr msg = {
>  > > -		.msg_name = NULL,
>  > > -		.msg_namelen = 0,
>  > > -		.msg_control = NULL, /* FIXME: get and handle RX aux data. */
>  > > -		.msg_controllen = 0,
>  > > -		.msg_iov = vq->iov,
>  > > -		.msg_flags = MSG_DONTWAIT,
>  > > -	};
>  > > -
>  > > -	struct virtio_net_hdr hdr = {
>  > > -		.flags = 0,
>  > > -		.gso_type = VIRTIO_NET_HDR_GSO_NONE
>  > > -	};
>  > > -
>  > > -	size_t len, total_len = 0;
>  > > -	int err;
>  > > -	size_t hdr_size;
>  > > -	struct socket *sock = rcu_dereference(vq->private_data);
>  > > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
>  > > -		return;
>  > > -
>  > > -	mutex_lock(&vq->mutex);
>  > > -	vhost_disable_notify(vq);
>  > > -	hdr_size = vq->vhost_hlen;
>  > > -
>  > > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
>  > > -		vq->log : NULL;
>  > > -
>  > > -	for (;;) {
>  > > -		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  > > -					 ARRAY_SIZE(vq->iov),
>  > > -					 &out, &in,
>  > > -					 vq_log, &log);
>  > > -		/* On error, stop handling until the next kick. */
>  > > -		if (unlikely(head < 0))
>  > > -			break;
>  > > -		/* OK, now we need to know about added descriptors. */
>  > > -		if (head == vq->num) {
>  > > -			if (unlikely(vhost_enable_notify(vq))) {
>  > > -				/* They have slipped one in as we were
>  > > -				 * doing that: check again. */
>  > > -				vhost_disable_notify(vq);
>  > > -				continue;
>  > > -			}
>  > > -			/* Nothing new?  Wait for eventfd to tell us
>  > > -			 * they refilled. */
>  > > -			break;
>  > > -		}
>  > > -		/* We don't need to be notified again. */
>  > > -		if (out) {
>  > > -			vq_err(vq, "Unexpected descriptor format for RX: "
>  > > -			       "out %d, int %d\n",
>  > > -			       out, in);
>  > > -			break;
>  > > -		}
>  > > -		/* Skip header. TODO: support TSO/mergeable rx buffers. */
>  > > -		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
>  > > -		msg.msg_iovlen = in;
>  > > -		len = iov_length(vq->iov, in);
>  > > -		/* Sanity check */
>  > > -		if (!len) {
>  > > -			vq_err(vq, "Unexpected header len for RX: "
>  > > -			       "%zd expected %zd\n",
>  > > -			       iov_length(vq->hdr, s), hdr_size);
>  > > -			break;
>  > > -		}
>  > > -		err = sock->ops->recvmsg(NULL, sock, &msg,
>  > > -					 len, MSG_DONTWAIT | MSG_TRUNC);
>  > > -		/* TODO: Check specific error and bomb out unless EAGAIN? */
>  > > -		if (err < 0) {
>  > > -			vhost_discard_vq_desc(vq, 1);
>  > > -			break;
>  > > -		}
>  > > -		/* TODO: Should check and handle checksum. */
>  > > -		if (err > len) {
>  > > -			pr_debug("Discarded truncated rx packet: "
>  > > -				 " len %d > %zd\n", err, len);
>  > > -			vhost_discard_vq_desc(vq, 1);
>  > > -			continue;
>  > > -		}
>  > > -		len = err;
>  > > -		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size);
>  > > -		if (err) {
>  > > -			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
>  > > -			       vq->iov->iov_base, err);
>  > > -			break;
>  > > -		}
>  > > -		len += hdr_size;
>  > > -		vhost_add_used_and_signal(&net->dev, vq, head, len);
>  > > -		if (unlikely(vq_log))
>  > > -			vhost_log_write(vq, vq_log, log, len);
>  > > -		total_len += len;
>  > > -		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  > > -			vhost_poll_queue(&vq->poll);
>  > > -			break;
>  > > -		}
>  > > -	}
>  > > -
>  > > -	mutex_unlock(&vq->mutex);
>  > > -}
>  > > -
>  > > -/* Expects to be always run from workqueue - which acts as
>  > > - * read-size critical section for our kind of RCU. */
>  > > -static void handle_rx_mergeable(struct vhost_net *net)
>  > > +static void handle_rx(struct vhost_net *net)
>  > >  {
>  > >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
>  > >  	unsigned uninitialized_var(in), log;
>  > > @@ -431,7 +324,8 @@ static void handle_rx_mergeable(struct vhost_net *net)
>  > >  		sock_len += sock_hlen;
>  > >  		vhost_len = sock_len + vhost_hlen;
>  > >  		headcount = get_rx_bufs(vq, vq->heads, vhost_len,
>  > > -					&in, vq_log, &log);
>  > > +					&in, vq_log, &log,
>  > > +					likely(mergeable) ? UIO_MAXIOV : 1);
>  > >  		/* On error, stop handling until the next kick. */
>  > >  		if (unlikely(headcount < 0))
>  > >  			break;
>  > > @@ -497,14 +391,6 @@ static void handle_rx_mergeable(struct vhost_net *net)
>  > >  	mutex_unlock(&vq->mutex);
>  > >  }
>  > >  
>  > > -static void handle_rx(struct vhost_net *net)
>  > > -{
>  > > -	if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
>  > > -		handle_rx_mergeable(net);
>  > > -	else
>  > > -		handle_rx_big(net);
>  > > -}
>  > > -
>  > >  static void handle_tx_kick(struct vhost_work *work)
>  > >  {
>  > >  	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,

^ permalink raw reply

* Re: [PATCH 1/3] vhost-net: check the support of mergeable buffer outside the receive loop
From: Michael S. Tsirkin @ 2011-01-18  4:36 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, kvm, linux-kernel
In-Reply-To: <19765.5737.352179.50100@gargle.gargle.HOWL>

On Tue, Jan 18, 2011 at 12:26:17PM +0800, Jason Wang wrote:
> Michael S. Tsirkin writes:
>  > On Mon, Jan 17, 2011 at 04:10:59PM +0800, Jason Wang wrote:
>  > > No need to check the support of mergeable buffer inside the recevie
>  > > loop as the whole handle_rx()_xx is in the read critical region.  So
>  > > this patch move it ahead of the receiving loop.
>  > > 
>  > > Signed-off-by: Jason Wang <jasowang@redhat.com>
>  > 
>  > Well feature check is mostly just features & bit
>  > so why would it be slower? Because of the rcu
>  > adding memory barriers? Do you observe a
>  > measureable speedup with this patch?
>  > 
> 
> I do not measure the performance for just this patch, maybe not obvious. And it
> can also help the code unification.
> 
>  > Apropos, I noticed that the check in vhost_has_feature
>  > is wrong: it uses the same kind of RCU as the
>  > private pointer. So we'll have to fix that properly
>  > by adding more lockdep classes, but for now
>  > we'll need to make
>  > the check 1 || lockdep_is_held(&dev->mutex);
>  > and add a TODO.
>  > 
> 
> Not sure, lockdep_is_head(&dev->mutex) maybe not accurate but sufficient, as it
> was always held in the read critical region.

Not really, when we call vhost_has_feature from the vq handling thread
it's not.

>  > > ---
>  > >  drivers/vhost/net.c |    5 +++--
>  > >  1 files changed, 3 insertions(+), 2 deletions(-)
>  > > 
>  > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>  > > index 14fc189..95e49de 100644
>  > > --- a/drivers/vhost/net.c
>  > > +++ b/drivers/vhost/net.c
>  > > @@ -411,7 +411,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
>  > >  	};
>  > >  
>  > >  	size_t total_len = 0;
>  > > -	int err, headcount;
>  > > +	int err, headcount, mergeable;
>  > >  	size_t vhost_hlen, sock_hlen;
>  > >  	size_t vhost_len, sock_len;
>  > >  	struct socket *sock = rcu_dereference(vq->private_data);
>  > > @@ -425,6 +425,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
>  > >  
>  > >  	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
>  > >  		vq->log : NULL;
>  > > +	mergeable = vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF);
>  > >  
>  > >  	while ((sock_len = peek_head_len(sock->sk))) {
>  > >  		sock_len += sock_hlen;
>  > > @@ -474,7 +475,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
>  > >  			break;
>  > >  		}
>  > >  		/* TODO: Should check and handle checksum. */
>  > > -		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF) &&
>  > > +		if (likely(mergeable) &&
>  > >  		    memcpy_toiovecend(vq->hdr, (unsigned char *)&headcount,
>  > >  				      offsetof(typeof(hdr), num_buffers),
>  > >  				      sizeof hdr.num_buffers)) {

^ permalink raw reply

* Re: [PATCH 1/3] vhost-net: check the support of mergeable buffer outside the receive loop
From: Jason Wang @ 2011-01-18  4:26 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, virtualization, netdev, kvm, linux-kernel
In-Reply-To: <20110117084644.GB23479@redhat.com>

Michael S. Tsirkin writes:
 > On Mon, Jan 17, 2011 at 04:10:59PM +0800, Jason Wang wrote:
 > > No need to check the support of mergeable buffer inside the recevie
 > > loop as the whole handle_rx()_xx is in the read critical region.  So
 > > this patch move it ahead of the receiving loop.
 > > 
 > > Signed-off-by: Jason Wang <jasowang@redhat.com>
 > 
 > Well feature check is mostly just features & bit
 > so why would it be slower? Because of the rcu
 > adding memory barriers? Do you observe a
 > measureable speedup with this patch?
 > 

I do not measure the performance for just this patch, maybe not obvious. And it
can also help the code unification.

 > Apropos, I noticed that the check in vhost_has_feature
 > is wrong: it uses the same kind of RCU as the
 > private pointer. So we'll have to fix that properly
 > by adding more lockdep classes, but for now
 > we'll need to make
 > the check 1 || lockdep_is_held(&dev->mutex);
 > and add a TODO.
 > 

Not sure, lockdep_is_head(&dev->mutex) maybe not accurate but sufficient, as it
was always held in the read critical region.

 > > ---
 > >  drivers/vhost/net.c |    5 +++--
 > >  1 files changed, 3 insertions(+), 2 deletions(-)
 > > 
 > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 > > index 14fc189..95e49de 100644
 > > --- a/drivers/vhost/net.c
 > > +++ b/drivers/vhost/net.c
 > > @@ -411,7 +411,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >  	};
 > >  
 > >  	size_t total_len = 0;
 > > -	int err, headcount;
 > > +	int err, headcount, mergeable;
 > >  	size_t vhost_hlen, sock_hlen;
 > >  	size_t vhost_len, sock_len;
 > >  	struct socket *sock = rcu_dereference(vq->private_data);
 > > @@ -425,6 +425,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >  
 > >  	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 > >  		vq->log : NULL;
 > > +	mergeable = vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF);
 > >  
 > >  	while ((sock_len = peek_head_len(sock->sk))) {
 > >  		sock_len += sock_hlen;
 > > @@ -474,7 +475,7 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >  			break;
 > >  		}
 > >  		/* TODO: Should check and handle checksum. */
 > > -		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF) &&
 > > +		if (likely(mergeable) &&
 > >  		    memcpy_toiovecend(vq->hdr, (unsigned char *)&headcount,
 > >  				      offsetof(typeof(hdr), num_buffers),
 > >  				      sizeof hdr.num_buffers)) {

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: John Fastabend @ 2011-01-18  3:16 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Oleg V. Ukhno, netdev@vger.kernel.org, David S. Miller
In-Reply-To: <26330.1295049912@death>

On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
> Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>> Jay Vosburgh wrote:
>>
>>> 	This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
>>> (f), which requires that all frames of a given "conversation" are passed
>>> to a single port.
>>>
>>> 	The existing layer3+4 hash has a similar problem (that it may
>>> send packets from a conversation to multiple ports), but for that case
>>> it's an unlikely exception (only in the case of IP fragmentation), but
>>> here it's the norm.  At a minimum, this must be clearly documented.
>>>
>>> 	Also, what does a round robin in 802.3ad provide that the
>>> existing round robin does not?  My presumption is that you're looking to
>>> get the aggregator autoconfiguration that 802.3ad provides, but you
>>> don't say.
> 
> 	I'm still curious about this question.  Given the rather
> intricate setup of your particular network (described below), I'm not
> sure why 802.3ad is of benefit over traditional etherchannel
> (balance-rr / balance-xor).
> 
>>> 	I don't necessarily think this is a bad cheat (round robining on
>>> 802.3ad as an explicit non-standard extension), since everybody wants to
>>> stripe their traffic across multiple slaves.  I've given some thought to
>>> making round robin into just another hash mode, but this also does some
>>> magic to the MAC addresses of the outgoing frames (more on that below).
>> Yes, I am resetting MAC addresses when transmitting packets to have switch
>> to put packets into different ports of the receiving etherchannel.
> 
> 	By "etherchannel" do you really mean "Cisco switch with a
> port-channel group using LACP"?
> 
>> I am using this patch to provide full-mesh ISCSI connectivity between at
>> least 4 hosts (all hosts of course are in same ethernet segment) and every
>> host is connected with aggregate link with 4 slaves(usually).
>> Using round-robin I provide near-equal load striping when transmitting,
>> using MAC address magic I force switch to stripe packets over all slave
>> links in destination port-channel(when number of rx-ing slaves is equal to
>> number ot tx-ing slaves and is even).
> 
> 	By "MAC address magic" do you mean that you're assigning
> specifically chosen MAC addresses to the slaves so that the switch's
> hash is essentially "assigning" the bonding slaves to particular ports
> on the outgoing port-channel group?
> 
> 	Assuming that this is the case, it's an interesting idea, but
> I'm unconvinced that it's better on 802.3ad vs. balance-rr.  Unless I'm
> missing something, you can get everything you need from an option to
> have balance-rr / balance-xor utilize the slave's permanent address as
> the source address for outgoing traffic.
> 
>> [...] So I am able to utilize all slaves
>> for tx and for rx up to maximum capacity; besides I am getting L2 link
>> failure detection (and load rebalancing), which is (in my opinion) much
>> faster and robust than L3 or than dm-multipath provides.
>> It's my idea with the patch
> 
> 	Can somebody (John?) more knowledgable than I about dm-multipath
> comment on the above?

Here I'll give it a go.

I don't think detecting L2 link failure this way is very robust. If there
is a failure farther away then your immediate link your going to break
completely? Your bonding hash will continue to round robin the iscsi
packets and half them will get dropped on the floor. dm-multipath handles
this reasonably gracefully. Also in this bonding environment you seem to
be very sensitive to RTT times on the network. Maybe not bad out right but
I wouldn't consider this robust either.

You could tweak your scsi timeout values and fail_fast values, set the io
retry to 0 to cause the fail over to occur faster. I suspect you already
did this and still it is too slow? Maybe adding a checker in multipathd to
listen for link events would be fast enough. The checker could then fail
the path immediately.

I'll try to address your comments from the other thread here. In general I
wonder if it would be better to solve the problems in dm-multipath rather than
add another bonding mode?

OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)

The dm-multipath layer is adding latency? How much? If this is really true
maybe its best to the address the real issue here and not avoid it by
using the bonding layer.

OVU - it handles any link failures bad, because of it's command queue 
limitation(all queued commands above 32 are discarded in case of path 
failure, as I remember)

Maybe true but only link failures with the immediate peer are handled
with a bonding strategy. By working at the block layer we can detect
failures throughout the path. I would need to look into this again I
know when we were looking at this sometime ago there was some talk about
improving this behavior. I need to take some time to go back through the
error recovery stuff to remember how this works.

OVU - it performs very bad when there are many devices and maтy paths(I was 
unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths 
per each disk)

Hmm well that seems like something is broken. I'll try this setup when
I get some time next few days. This really shouldn't be the case dm-multipath
should not add a bunch of extra latency or effect throughput significantly.
By the way what are you seeing without mpio?

Thanks,
John

^ permalink raw reply

* Re: [PATCH v2] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Po-Yu Chuang @ 2011-01-18  3:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, ratbert, bhutchings, joe, dilinger
In-Reply-To: <1295288462.3335.55.camel@edumazet-laptop>

Dear Eric,

On Tue, Jan 18, 2011 at 2:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> BTW, shouldnt you use cpu_to_be32() or cpu_to_le32(), if this driver is
> multi platform ?

This reminds me another thing. Should I use u32 instead of unsigned int for all
hardware related variables (registers, descriptors) ?
Not quite sure about these cross-platform issues.

best regards,
Po-Yu Chuang

^ permalink raw reply

* Re: [PATCH 2/3] vhost-net: Unify the code of mergeable and big buffer handling
From: Jason Wang @ 2011-01-18  3:05 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, virtualization, netdev, kvm, linux-kernel
In-Reply-To: <20110117083644.GA23479@redhat.com>

Michael S. Tsirkin writes:
 > On Mon, Jan 17, 2011 at 04:11:08PM +0800, Jason Wang wrote:
 > > Codes duplication were found between the handling of mergeable and big
 > > buffers, so this patch tries to unify them. This could be easily done
 > > by adding a quota to the get_rx_bufs() which is used to limit the
 > > number of buffers it returns (for mergeable buffer, the quota is
 > > simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
 > > previous handle_rx_mergeable() could be resued also for big buffers.
 > > 
 > > Signed-off-by: Jason Wang <jasowang@redhat.com>
 > 
 > We actually started this way. However the code turned out
 > to have measureable overhead when handle_rx_mergeable
 > is called with quota 1.
 > So before applying this I'd like to see some data
 > to show this is not the case anymore.
 > 

I've run a round of test (Host->Guest) for these three patches on my desktop:

Without these patches

mergeable buffers:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.42 (10.66.91.42) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384     64    60.00       575.87   69.20    26.36    39.375  7.499  
 87380  16384    256    60.01      1123.57   73.16    34.73    21.335  5.064  
 87380  16384    512    60.00      1351.12   75.26    35.80    18.251  4.341  
 87380  16384   1024    60.00      1955.31   74.73    37.67    12.523  3.156  
 87380  16384   2048    60.01      3411.92   74.82    39.49    7.186   1.896  

bug buffers:
Netperf test results
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.109 (10.66.91.109) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384     64    60.00       567.10   72.06    26.13    41.638  7.550  
 87380  16384    256    60.00      1143.69   74.66    32.58    21.392  4.667  
 87380  16384    512    60.00      1460.92   73.94    33.40    16.585  3.746  
 87380  16384   1024    60.00      3454.85   77.49    33.89    7.349   1.607  
 87380  16384   2048    60.00      3781.11   76.51    38.38    6.631   1.663  

With these patches:

mergeable buffers:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.236 (10.66.91.236) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384     64    60.00       657.53   71.27    26.42    35.517  6.583  
 87380  16384    256    60.00      1217.73   74.34    34.67    20.004  4.665  
 87380  16384    512    60.00      1575.25   75.27    37.12    15.658  3.861  
 87380  16384   1024    60.00      2416.07   74.77    37.20    10.140  2.522  
 87380  16384   2048    60.00      3702.29   77.31    36.29    6.842   1.606  

big buffers:
Netperf test results
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.202 (10.66.91.202) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384     64    60.00       647.67   71.86    27.26    36.356  6.895  
 87380  16384    256    60.00      1265.82   76.19    36.54    19.724  4.729  
 87380  16384    512    60.00      1796.64   76.06    39.48    13.872  3.601  
 87380  16384   1024    60.00      4008.37   77.05    36.47    6.299   1.491  
 87380  16384   2048    60.00      4468.56   75.18    41.79    5.513   1.532 

Looks like the unification does not hurt the performance, and with those patches
we can get some improvement. BTW, the regression of mergeable buffer still
exist.

 > > ---
 > >  drivers/vhost/net.c |  128 +++------------------------------------------------
 > >  1 files changed, 7 insertions(+), 121 deletions(-)
 > > 
 > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 > > index 95e49de..c32a2e4 100644
 > > --- a/drivers/vhost/net.c
 > > +++ b/drivers/vhost/net.c
 > > @@ -227,6 +227,7 @@ static int peek_head_len(struct sock *sk)
 > >   * @iovcount	- returned count of io vectors we fill
 > >   * @log		- vhost log
 > >   * @log_num	- log offset
 > > + * @quota       - headcount quota, 1 for big buffer
 > >   *	returns number of buffer heads allocated, negative on error
 > >   */
 > >  static int get_rx_bufs(struct vhost_virtqueue *vq,
 > > @@ -234,7 +235,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 > >  		       int datalen,
 > >  		       unsigned *iovcount,
 > >  		       struct vhost_log *log,
 > > -		       unsigned *log_num)
 > > +		       unsigned *log_num,
 > > +		       unsigned int quota)
 > >  {
 > >  	unsigned int out, in;
 > >  	int seg = 0;
 > > @@ -242,7 +244,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 > >  	unsigned d;
 > >  	int r, nlogs = 0;
 > >  
 > > -	while (datalen > 0) {
 > > +	while (datalen > 0 && headcount < quota) {
 > >  		if (unlikely(seg >= UIO_MAXIOV)) {
 > >  			r = -ENOBUFS;
 > >  			goto err;
 > > @@ -282,116 +284,7 @@ err:
 > >  
 > >  /* Expects to be always run from workqueue - which acts as
 > >   * read-size critical section for our kind of RCU. */
 > > -static void handle_rx_big(struct vhost_net *net)
 > > -{
 > > -	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 > > -	unsigned out, in, log, s;
 > > -	int head;
 > > -	struct vhost_log *vq_log;
 > > -	struct msghdr msg = {
 > > -		.msg_name = NULL,
 > > -		.msg_namelen = 0,
 > > -		.msg_control = NULL, /* FIXME: get and handle RX aux data. */
 > > -		.msg_controllen = 0,
 > > -		.msg_iov = vq->iov,
 > > -		.msg_flags = MSG_DONTWAIT,
 > > -	};
 > > -
 > > -	struct virtio_net_hdr hdr = {
 > > -		.flags = 0,
 > > -		.gso_type = VIRTIO_NET_HDR_GSO_NONE
 > > -	};
 > > -
 > > -	size_t len, total_len = 0;
 > > -	int err;
 > > -	size_t hdr_size;
 > > -	struct socket *sock = rcu_dereference(vq->private_data);
 > > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
 > > -		return;
 > > -
 > > -	mutex_lock(&vq->mutex);
 > > -	vhost_disable_notify(vq);
 > > -	hdr_size = vq->vhost_hlen;
 > > -
 > > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 > > -		vq->log : NULL;
 > > -
 > > -	for (;;) {
 > > -		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 > > -					 ARRAY_SIZE(vq->iov),
 > > -					 &out, &in,
 > > -					 vq_log, &log);
 > > -		/* On error, stop handling until the next kick. */
 > > -		if (unlikely(head < 0))
 > > -			break;
 > > -		/* OK, now we need to know about added descriptors. */
 > > -		if (head == vq->num) {
 > > -			if (unlikely(vhost_enable_notify(vq))) {
 > > -				/* They have slipped one in as we were
 > > -				 * doing that: check again. */
 > > -				vhost_disable_notify(vq);
 > > -				continue;
 > > -			}
 > > -			/* Nothing new?  Wait for eventfd to tell us
 > > -			 * they refilled. */
 > > -			break;
 > > -		}
 > > -		/* We don't need to be notified again. */
 > > -		if (out) {
 > > -			vq_err(vq, "Unexpected descriptor format for RX: "
 > > -			       "out %d, int %d\n",
 > > -			       out, in);
 > > -			break;
 > > -		}
 > > -		/* Skip header. TODO: support TSO/mergeable rx buffers. */
 > > -		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
 > > -		msg.msg_iovlen = in;
 > > -		len = iov_length(vq->iov, in);
 > > -		/* Sanity check */
 > > -		if (!len) {
 > > -			vq_err(vq, "Unexpected header len for RX: "
 > > -			       "%zd expected %zd\n",
 > > -			       iov_length(vq->hdr, s), hdr_size);
 > > -			break;
 > > -		}
 > > -		err = sock->ops->recvmsg(NULL, sock, &msg,
 > > -					 len, MSG_DONTWAIT | MSG_TRUNC);
 > > -		/* TODO: Check specific error and bomb out unless EAGAIN? */
 > > -		if (err < 0) {
 > > -			vhost_discard_vq_desc(vq, 1);
 > > -			break;
 > > -		}
 > > -		/* TODO: Should check and handle checksum. */
 > > -		if (err > len) {
 > > -			pr_debug("Discarded truncated rx packet: "
 > > -				 " len %d > %zd\n", err, len);
 > > -			vhost_discard_vq_desc(vq, 1);
 > > -			continue;
 > > -		}
 > > -		len = err;
 > > -		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size);
 > > -		if (err) {
 > > -			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
 > > -			       vq->iov->iov_base, err);
 > > -			break;
 > > -		}
 > > -		len += hdr_size;
 > > -		vhost_add_used_and_signal(&net->dev, vq, head, len);
 > > -		if (unlikely(vq_log))
 > > -			vhost_log_write(vq, vq_log, log, len);
 > > -		total_len += len;
 > > -		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 > > -			vhost_poll_queue(&vq->poll);
 > > -			break;
 > > -		}
 > > -	}
 > > -
 > > -	mutex_unlock(&vq->mutex);
 > > -}
 > > -
 > > -/* Expects to be always run from workqueue - which acts as
 > > - * read-size critical section for our kind of RCU. */
 > > -static void handle_rx_mergeable(struct vhost_net *net)
 > > +static void handle_rx(struct vhost_net *net)
 > >  {
 > >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 > >  	unsigned uninitialized_var(in), log;
 > > @@ -431,7 +324,8 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >  		sock_len += sock_hlen;
 > >  		vhost_len = sock_len + vhost_hlen;
 > >  		headcount = get_rx_bufs(vq, vq->heads, vhost_len,
 > > -					&in, vq_log, &log);
 > > +					&in, vq_log, &log,
 > > +					likely(mergeable) ? UIO_MAXIOV : 1);
 > >  		/* On error, stop handling until the next kick. */
 > >  		if (unlikely(headcount < 0))
 > >  			break;
 > > @@ -497,14 +391,6 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >  	mutex_unlock(&vq->mutex);
 > >  }
 > >  
 > > -static void handle_rx(struct vhost_net *net)
 > > -{
 > > -	if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
 > > -		handle_rx_mergeable(net);
 > > -	else
 > > -		handle_rx_big(net);
 > > -}
 > > -
 > >  static void handle_tx_kick(struct vhost_work *work)
 > >  {
 > >  	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,

^ permalink raw reply

* Re: [PATCH] ethtool : Add option -L | --set-common to set common flags.
From: Ben Hutchings @ 2011-01-18  2:59 UTC (permalink / raw)
  To: Mahesh Bandewar; +Cc: David Miller, Tom Herbert, Laurent Chavey, netdev
In-Reply-To: <AANLkTiko=12CLgG_MMPshfwSRt7mZA+3DW+GRtxhaxh=@mail.gmail.com>

On Mon, 2011-01-17 at 18:17 -0800, Mahesh Bandewar wrote:
> On Fri, Jan 14, 2011 at 1:19 PM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
> > On Thu, 2011-01-13 at 16:11 -0800, Mahesh Bandewar wrote:
[...]
> >> +static int do_scommon(int fd, struct ifreq *ifr)
> >> +{
> >> +     struct ethtool_value eval;
> >> +
> >> +     if (common_flags_mask) {
> >> +             eval.cmd = ETHTOOL_GFLAGS;
> >> +             eval.data = 0;
> >> +             ifr->ifr_data = (caddr_t)&eval;
> >> +             if (ioctl(fd, SIOCETHTOOL, ifr)) {
> >> +                     perror("Cannot get device common flags");
> >> +                     return 1;
> >> +             }
> >> +
> >> +             eval.cmd = ETHTOOL_SFLAGS;
> >> +             eval.data =
> >> +                 ((eval.data & ~(common_flags_mask | off_flags_mask)) |
> >> +                  (common_flags_wanted | off_flags_wanted));
> >
> > Why should this use off_flags_mask and off_flags_wanted?  They should
> > both be 0 if this function is called.
> >
> That is right! Actually the get (ETHTOOL_GFLAGS) operation confused
> me. I thought the values are fetched and *preserved* while setting the
> new value. But when looked at it carefully, that is not the case.
> Actually why that ioctl() with ETHTOOL_GFLAGS required?
[...]

This is a read-modify-write operation.  We have to:

1. Parse the options to find out which flags are to be changed
(common_flags_mask) and the wanted values (common_flags_wanted).
2. Read the current flags (ETHTOOL_GFLAGS reads them into eval.data).
3. Modify the flags (eval.data = ...).
4. Write the new flags (ETHTOOL_SFLAGS).

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] ethtool : Add option -L | --set-common to set common flags.
From: Mahesh Bandewar @ 2011-01-18  2:17 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, Tom Herbert, Laurent Chavey, netdev
In-Reply-To: <1295039984.5386.19.camel@bwh-desktop>

On Fri, Jan 14, 2011 at 1:19 PM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Thu, 2011-01-13 at 16:11 -0800, Mahesh Bandewar wrote:
>> This patch adds -L | --set-common option to add / remove common flags which
>> includes loopback flag. The -l | --show-common displays the current values
>> for these common flags.
>>
>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>> ---
>>  ethtool-copy.h |    1 +
>>  ethtool.8      |   16 ++++++++++
>>  ethtool.c      |   90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 105 insertions(+), 2 deletions(-)
>>
>> diff --git a/ethtool-copy.h b/ethtool-copy.h
>> index 75c3ae7..5fd18c7 100644
>> --- a/ethtool-copy.h
>> +++ b/ethtool-copy.h
>> @@ -309,6 +309,7 @@ struct ethtool_perm_addr {
>>   * flag differs from the read-only value.
>>   */
>>  enum ethtool_flags {
>> +     ETH_FLAG_LOOPBACK       = (1 << 2),     /* Loopback enable / disable */
>>       ETH_FLAG_TXVLAN         = (1 << 7),     /* TX VLAN offload enabled */
>>       ETH_FLAG_RXVLAN         = (1 << 8),     /* RX VLAN offload enabled */
>>       ETH_FLAG_LRO            = (1 << 15),    /* LRO is enabled */
>> diff --git a/ethtool.8 b/ethtool.8
>> index 1760924..cf7128f 100644
>> --- a/ethtool.8
>> +++ b/ethtool.8
>> @@ -174,6 +174,13 @@ ethtool \- Display or change ethernet card settings
>>  .B2 txvlan on off
>>  .B2 rxhash on off
>>
>> +.B ethtool \-l|\-\-show\-common
>> +.I ethX
>> +
>> +.B ethtool \-L|\-\-set\-common
>> +.I ethX
>> +.B2 loopback on off
>> +
>>  .B ethtool \-p|\-\-identify
>>  .I ethX
>>  .RI [ N ]
>> @@ -406,6 +413,15 @@ Specifies whether TX VLAN acceleration should be enabled
>>  .A2 rxhash on off
>>  Specifies whether receive hashing offload should be enabled
>>  .TP
>> +.B \-l \-\-show\-common
>> +Queries the specified ethernet device for common flag settings.
>> +.TP
>> +.B \-L \-\-set\-common
>> +Changes the common parameters of the specified ethernet device.
>> +.TP
>> +.A2 loopback on off
>> +Specifies whether loopback should be enabled.
>> +.TP
>
> I've just gone through the manual page and changed 'ethernet device' to
> 'network device' for all generic operations; please follow that.  The
> source for the manual page was also renamed to ethtool.8.in as it now
> goes through autoconf substitution.
>
OK. I'll make those changes.

>>  .B \-p \-\-identify
>>  Initiates adapter-specific action intended to enable an operator to
>>  easily identify the adapter by sight.  Typically this involves
>> diff --git a/ethtool.c b/ethtool.c
>> index 63e0ead..1a0c10c 100644
>> --- a/ethtool.c
>> +++ b/ethtool.c
> [...]
>> @@ -1905,6 +1932,13 @@ static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso,
>>       return 0;
>>  }
>>
>> +static int dump_common_flags(int loopback)
>> +{
>> +     fprintf(stdout, "loopback: %s\n", loopback ? "on" : "off");
>> +
>> +     return 0;
>> +}
>> +
>>  static int dump_rxfhash(int fhash, u64 val)
>>  {
>>       switch (fhash) {
> [...]
>> @@ -2219,6 +2257,53 @@ static int do_scoalesce(int fd, struct ifreq *ifr)
>>       return 0;
>>  }
>>
>> +static int do_gcommon(int fd, struct ifreq *ifr)
>> +{
>> +     struct ethtool_value eval;
>> +     int loopback = 0;
>> +
>> +     fprintf(stdout, "Common flags for %s:\n", devname);
>> +
>> +     eval.cmd = ETHTOOL_GFLAGS;
>> +     ifr->ifr_data = (caddr_t)&eval;
>> +     if (ioctl(fd, SIOCETHTOOL, ifr)) {
>> +             perror("Cannot get device flags");
>> +     } else {
>> +             loopback = (eval.data & ETH_FLAG_LOOPBACK) != 0;
>> +     }
>> +
>> +     return dump_common_flags(loopback);
>
> Breaking up a bitmask into a list of flag parameters is fairly
> pointless.  I realise do_goffload() and dump_offload() do that but I am
> just waiting for Michał Mirosław's changes to offload flags to be
> settled before I fix them.
>
>> +}
>> +
>> +static int do_scommon(int fd, struct ifreq *ifr)
>> +{
>> +     struct ethtool_value eval;
>> +
>> +     if (common_flags_mask) {
>> +             eval.cmd = ETHTOOL_GFLAGS;
>> +             eval.data = 0;
>> +             ifr->ifr_data = (caddr_t)&eval;
>> +             if (ioctl(fd, SIOCETHTOOL, ifr)) {
>> +                     perror("Cannot get device common flags");
>> +                     return 1;
>> +             }
>> +
>> +             eval.cmd = ETHTOOL_SFLAGS;
>> +             eval.data =
>> +                 ((eval.data & ~(common_flags_mask | off_flags_mask)) |
>> +                  (common_flags_wanted | off_flags_wanted));
>
> Why should this use off_flags_mask and off_flags_wanted?  They should
> both be 0 if this function is called.
>
That is right! Actually the get (ETHTOOL_GFLAGS) operation confused
me. I thought the values are fetched and *preserved* while setting the
new value. But when looked at it carefully, that is not the case.
Actually why that ioctl() with ETHTOOL_GFLAGS required?

>> +             if (ioctl(fd, SIOCETHTOOL, ifr)) {
>> +                     perror("Cannot set device common flags");
>> +                     return 1;
>> +             }
>> +     } else {
>> +             fprintf(stdout, "No common settings changed\n");
>> +     }
>> +
>> +     return 0;
>> +}
>> +
>>  static int do_goffload(int fd, struct ifreq *ifr)
>>  {
>>       struct ethtool_value eval;
>> @@ -2407,8 +2492,9 @@ static int do_soffload(int fd, struct ifreq *ifr)
>>               }
>>
>>               eval.cmd = ETHTOOL_SFLAGS;
>> -             eval.data = ((eval.data & ~off_flags_mask) |
>> -                          off_flags_wanted);
>> +             eval.data =
>> +                 ((eval.data & ~(off_flags_mask | common_flags_mask)) |
>> +                  (off_flags_wanted | common_flags_wanted));
>
> Similarly, why should this use common_flags_mask and
> common_flags_wanted?

For the same (wrong) reason mentioned above. I'll correct it and post
the new patch.

Thanks,
--mahesh..
>
> Ben.
>
>>
>>               err = ioctl(fd, SIOCETHTOOL, ifr);
>>               if (err) {
>
> --
> Ben Hutchings, Senior Software Engineer, Solarflare Communications
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
>

^ permalink raw reply

* Re: Realtek r8168C / r8169 driver VLAN TAG stripping
From: Francois Romieu @ 2011-01-18  1:21 UTC (permalink / raw)
  To: Anand Raj Manickam; +Cc: netdev, Hayes
In-Reply-To: <AANLkTikoXfgHrJyMbe6A_HmvXT_QRo55w76st5Wo0hSv@mail.gmail.com>

Anand Raj Manickam <anandrm@gmail.com> :
> On Mon, Jan 17, 2011 at 11:52 AM, Anand Raj Manickam <anandrm@gmail.com> wrote:
[...]
> > This is the dmesg  for XID
> >
> > eth0: RTL8168c/8111c at 0xf9628000, 00:17:54:00:f6:62, XID 1c4000c0 IRQ 31
> > r8169: mac_version = 0x16

I do not have one of those (RTL_GIGA_MAC_VER_22) to check if it handles vlan
correctly yet.

> > r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> >
> > Unfortunately , i m not able to upgrade my kernel now . If there is a
> > Fix for it , that would be great !!

I doubt there is a lot of glamour/fortune/fame in backporting the 89 r8169
patches between v2.6.30 and v2.6.37 but you may help yourself and give it
a try.

-- 
Ueimor

^ permalink raw reply

* [PATCH] scm: provide full privilege set via SCM_PRIVILEGE
From: Casey Schaufler @ 2011-01-18  0:34 UTC (permalink / raw)
  To: netdev; +Cc: Casey Schaufler


Subject: [PATCH] scm: provide full privilege set via SCM_PRIVILEGE

The SCM mechanism currently provides interfaces for delivering
the uid/gid and the "security context" (LSM information) of the
peer on a UDS socket. All of the security credential information
is available, but there is no interface available to obtain it.
Further, the existing interfaces require that the user chose
between the uid/gid and the context as the existing interfaces
are exclusive.

This patch introduces an additional interface that provides
a complete set of security information from the peer credential.
No additional work is required to provide the information
internally, it is all being passed, just not exposed.

Also sent to LKML and LSM lists. 

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
---

 include/asm-generic/socket.h |    1 +
 include/linux/net.h          |    1 +
 include/linux/socket.h       |    1 +
 include/net/scm.h            |   80 +++++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c              |   11 ++++++
 5 files changed, 93 insertions(+), 1 deletions(-)
diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 9a6115e..7aa8e84 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -64,4 +64,5 @@
 #define SO_DOMAIN		39
 
 #define SO_RXQ_OVFL             40
+#define SO_PASSPRIV		41
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/linux/net.h b/include/linux/net.h
index 16faa13..159a929 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -71,6 +71,7 @@ struct net;
 #define SOCK_NOSPACE		2
 #define SOCK_PASSCRED		3
 #define SOCK_PASSSEC		4
+#define SOCK_PASSPRIV		5
 
 #ifndef ARCH_HAS_SOCKET_TYPES
 /**
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 86b652f..e9cfd68 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -147,6 +147,7 @@ static inline struct cmsghdr * cmsg_nxthdr (struct msghdr *__msg, struct cmsghdr
 #define	SCM_RIGHTS	0x01		/* rw: access rights (array of int) */
 #define SCM_CREDENTIALS 0x02		/* rw: struct ucred		*/
 #define SCM_SECURITY	0x03		/* rw: security label		*/
+#define SCM_PRIVILEGES  0x04		/* rw: privilege set		*/
 
 struct ucred {
 	__u32	pid;
diff --git a/include/net/scm.h b/include/net/scm.h
index 3165650..4b8db21 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -101,6 +101,83 @@ static inline void scm_passec(struct socket *sock, struct msghdr *msg, struct sc
 { }
 #endif /* CONFIG_SECURITY_NETWORK */
 
+static __inline__ void scm_passpriv(struct socket *sock, struct msghdr *msg,
+				struct scm_cookie *scm)
+{
+	const struct cred *credp = scm->cred;
+	const struct group_info *gip;
+	char *result;
+	char *cp;
+	int i;
+#ifdef CONFIG_SECURITY_NETWORK
+	char *secdata;
+	u32 seclen;
+	int err;
+#endif /* CONFIG_SECURITY_NETWORK */
+
+	if (!test_bit(SOCK_PASSPRIV, &sock->flags))
+		return;
+
+	gip = credp->group_info;
+
+	/*
+	 * uid + euid + gid + egid + group-list + capabilities
+	 *     + "uid=" + "euid=" + "gid=" + "egid=" + "grps="
+	 *     + "cap-e=" + "cap-p=" + "cap-i="
+	 * 10  + 10   + 10  + 10   + (ngrps * 10) + ecap + pcap + icap
+	 *     + 4 + 5 + 4 + 5 + 5 + 6 + 6 + 6
+	 */
+	i = ((4 + gip->ngroups) * 11) + (3 * (_KERNEL_CAPABILITY_U32S * 8 + 1))
+		+ 41;
+
+#ifdef CONFIG_SECURITY_NETWORK
+	err = security_secid_to_secctx(scm->secid, &secdata, &seclen);
+	if (!err)
+		/*
+		 * " context="
+		 */
+		i += seclen + 10;
+#endif /* CONFIG_SECURITY_NETWORK */
+
+	result = kzalloc(i, GFP_KERNEL);
+	if (result == NULL)
+		return;
+
+	cp = result + sprintf(result, "euid=%d uid=%d egid=%d gid=%d",
+				credp->euid, credp->uid,
+				credp->egid, credp->gid);
+
+	if (gip != NULL && gip->ngroups > 0) {
+		cp += sprintf(cp, " grps=%d", GROUP_AT(gip, 0));
+		for (i = 1 ; i < gip->ngroups; i++)
+			cp += sprintf(cp, ",%d", GROUP_AT(gip, i));
+	}
+
+	cp += sprintf(cp, " cap-e=");
+	CAP_FOR_EACH_U32(i)
+		cp += sprintf(cp, "%08x", credp->cap_effective.cap[i]);
+	cp += sprintf(cp, " cap-p=");
+	CAP_FOR_EACH_U32(i)
+		cp += sprintf(cp, "%08x", credp->cap_permitted.cap[i]);
+	cp += sprintf(cp, " cap-i=");
+	CAP_FOR_EACH_U32(i)
+		cp += sprintf(cp, "%08x", credp->cap_inheritable.cap[i]);
+
+#ifdef CONFIG_SECURITY_NETWORK
+	cp += sprintf(cp, " context=");
+	strncpy(cp, secdata, seclen);
+	cp += seclen;
+	*cp = '\0';
+
+	security_release_secctx(secdata, seclen);
+#endif /* CONFIG_SECURITY_NETWORK */
+
+	put_cmsg(msg, SOL_SOCKET, SCM_PRIVILEGES, strlen(result)+1, result);
+
+	kfree(result);
+}
+
+
 static __inline__ void scm_recv(struct socket *sock, struct msghdr *msg,
 				struct scm_cookie *scm, int flags)
 {
@@ -114,6 +191,8 @@ static __inline__ void scm_recv(struct socket *sock, struct msghdr *msg,
 	if (test_bit(SOCK_PASSCRED, &sock->flags))
 		put_cmsg(msg, SOL_SOCKET, SCM_CREDENTIALS, sizeof(scm->creds), &scm->creds);
 
+	scm_passpriv(sock, msg, scm);
+
 	scm_destroy_cred(scm);
 
 	scm_passec(sock, msg, scm);
@@ -124,6 +203,5 @@ static __inline__ void scm_recv(struct socket *sock, struct msghdr *msg,
 	scm_detach_fds(msg, scm);
 }
 
-
 #endif /* __LINUX_NET_SCM_H */
 
diff --git a/net/core/sock.c b/net/core/sock.c
index fb60801..f134126 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -725,6 +725,13 @@ set_rcvbuf:
 		else
 			clear_bit(SOCK_PASSSEC, &sock->flags);
 		break;
+
+	case SO_PASSPRIV:
+		if (valbool)
+			set_bit(SOCK_PASSPRIV, &sock->flags);
+		else
+			clear_bit(SOCK_PASSPRIV, &sock->flags);
+		break;
 	case SO_MARK:
 		if (!capable(CAP_NET_ADMIN))
 			ret = -EPERM;
@@ -950,6 +957,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = test_bit(SOCK_PASSSEC, &sock->flags) ? 1 : 0;
 		break;
 
+	case SO_PASSPRIV:
+		v.val = test_bit(SOCK_PASSPRIV, &sock->flags) ? 1 : 0;
+		break;
+
 	case SO_PEERSEC:
 		return security_socket_getpeersec_stream(sock, optval, optlen, len);
 




^ permalink raw reply related

* Re: [BUG] bnx2 + vlan + TSO : doesnt work
From: Jesse Gross @ 2011-01-18  0:13 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <1295309365.6264.23.camel@bwh-desktop>

On Mon, Jan 17, 2011 at 4:09 PM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Tue, 2011-01-18 at 01:00 +0100, Eric Dumazet wrote:
>> Le lundi 17 janvier 2011 à 23:47 +0000, Ben Hutchings a écrit :
>> > On Tue, 2011-01-18 at 00:41 +0100, Eric Dumazet wrote:
>> > > Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
>> > > with current linux-2.6 tree (I need vlans on my setup)
>> > >
>> > > tg3 : vlan not working
>> > >
>> > > bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
>> > > are OK)
>> > >
>> > > I suspect recent commits from Jesse are the problem.
>> > > (bnx2_xmit() is feeded with zeroed vlan_tci skbs)
>> > >
>> > > Maybe f01a5236bd4b1401 (net offloading: Generalize
>> > > netif_get_vlan_features().) ?
>> > >
>> > > I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
>> > > They only set this flag in dev->features
>> > >
>> > > I dont think changing all drivers to also set vlan_features makes sense.
>> > >
>> > > Is following patch the right path ? (It does solve my problem)
>> >
>> > This isn't right.  NETIF_F_HW_VLAN_TX in vlan_features would mean that
>> > the hardware can do two levels of VLAN tag insertion, which is generally
>> > not true.
>> >
>>
>> So should we revert part of Jesse patch ? I want my vlan back ;)
>
> Yeah, that looks right.

I think it is better for netif_skb_features() to actually return the
correct features rather than bypass it here.  NETIF_F_HW_VLAN_TX
doesn't depend on any other offloads, so we can always include it if
it is in dev->features.

Separately, this means there is a problem with bnx2 because it allows
vlan insertion to be turned off, which would have the same effect.
Maybe it is looking directly at skb->protocol or similar for TSO.

^ permalink raw reply

* Re: [BUG] bnx2 + vlan + TSO : doesnt work
From: Ben Hutchings @ 2011-01-18  0:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Jesse Gross
In-Reply-To: <1295308844.3362.135.camel@edumazet-laptop>

On Tue, 2011-01-18 at 01:00 +0100, Eric Dumazet wrote:
> Le lundi 17 janvier 2011 à 23:47 +0000, Ben Hutchings a écrit :
> > On Tue, 2011-01-18 at 00:41 +0100, Eric Dumazet wrote:
> > > Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
> > > with current linux-2.6 tree (I need vlans on my setup)
> > > 
> > > tg3 : vlan not working
> > > 
> > > bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
> > > are OK)
> > > 
> > > I suspect recent commits from Jesse are the problem.
> > > (bnx2_xmit() is feeded with zeroed vlan_tci skbs)
> > > 
> > > Maybe f01a5236bd4b1401 (net offloading: Generalize
> > > netif_get_vlan_features().) ?
> > > 
> > > I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
> > > They only set this flag in dev->features
> > > 
> > > I dont think changing all drivers to also set vlan_features makes sense.
> > > 
> > > Is following patch the right path ? (It does solve my problem)
> > 
> > This isn't right.  NETIF_F_HW_VLAN_TX in vlan_features would mean that
> > the hardware can do two levels of VLAN tag insertion, which is generally
> > not true.
> > 
> 
> So should we revert part of Jesse patch ? I want my vlan back ;)

Yeah, that looks right.

Ben.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 54277df..8209d93 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2076,7 +2076,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>  		features = netif_skb_features(skb);
>  
>  		if (vlan_tx_tag_present(skb) &&
> -		    !(features & NETIF_F_HW_VLAN_TX)) {
> +		    !(dev->features & NETIF_F_HW_VLAN_TX)) {
>  			skb = __vlan_put_tag(skb, vlan_tx_tag_get(skb));
>  			if (unlikely(!skb))
>  				goto out;
> 
> 

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [BUG] bnx2 + vlan + TSO : doesnt work
From: Eric Dumazet @ 2011-01-18  0:00 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, Jesse Gross
In-Reply-To: <1295308041.6264.22.camel@bwh-desktop>

Le lundi 17 janvier 2011 à 23:47 +0000, Ben Hutchings a écrit :
> On Tue, 2011-01-18 at 00:41 +0100, Eric Dumazet wrote:
> > Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
> > with current linux-2.6 tree (I need vlans on my setup)
> > 
> > tg3 : vlan not working
> > 
> > bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
> > are OK)
> > 
> > I suspect recent commits from Jesse are the problem.
> > (bnx2_xmit() is feeded with zeroed vlan_tci skbs)
> > 
> > Maybe f01a5236bd4b1401 (net offloading: Generalize
> > netif_get_vlan_features().) ?
> > 
> > I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
> > They only set this flag in dev->features
> > 
> > I dont think changing all drivers to also set vlan_features makes sense.
> > 
> > Is following patch the right path ? (It does solve my problem)
> 
> This isn't right.  NETIF_F_HW_VLAN_TX in vlan_features would mean that
> the hardware can do two levels of VLAN tag insertion, which is generally
> not true.
> 

So should we revert part of Jesse patch ? I want my vlan back ;)

diff --git a/net/core/dev.c b/net/core/dev.c
index 54277df..8209d93 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2076,7 +2076,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		features = netif_skb_features(skb);
 
 		if (vlan_tx_tag_present(skb) &&
-		    !(features & NETIF_F_HW_VLAN_TX)) {
+		    !(dev->features & NETIF_F_HW_VLAN_TX)) {
 			skb = __vlan_put_tag(skb, vlan_tx_tag_get(skb));
 			if (unlikely(!skb))
 				goto out;



^ permalink raw reply related

* Re: [BUG] bnx2 + vlan + TSO : doesnt work
From: Ben Hutchings @ 2011-01-17 23:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Jesse Gross
In-Reply-To: <1295307669.3362.106.camel@edumazet-laptop>

On Tue, 2011-01-18 at 00:41 +0100, Eric Dumazet wrote:
> Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
> with current linux-2.6 tree (I need vlans on my setup)
> 
> tg3 : vlan not working
> 
> bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
> are OK)
> 
> I suspect recent commits from Jesse are the problem.
> (bnx2_xmit() is feeded with zeroed vlan_tci skbs)
> 
> Maybe f01a5236bd4b1401 (net offloading: Generalize
> netif_get_vlan_features().) ?
> 
> I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
> They only set this flag in dev->features
> 
> I dont think changing all drivers to also set vlan_features makes sense.
> 
> Is following patch the right path ? (It does solve my problem)

This isn't right.  NETIF_F_HW_VLAN_TX in vlan_features would mean that
the hardware can do two levels of VLAN tag insertion, which is generally
not true.

Ben.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 54277df..5c13e55 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5265,6 +5265,9 @@ int register_netdevice(struct net_device *dev)
>  	 */
>  	dev->vlan_features |= (NETIF_F_GRO | NETIF_F_HIGHDMA);
>  
> +	/* allow netif_skb_features() not mask this bit from dev->features */
> +	dev->vlan_features |= NETIF_F_HW_VLAN_TX;
> +
>  	ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
>  	ret = notifier_to_errno(ret);
>  	if (ret)

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* [BUG] bnx2 + vlan + TSO : doesnt work
From: Eric Dumazet @ 2011-01-17 23:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jesse Gross

Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
with current linux-2.6 tree (I need vlans on my setup)

tg3 : vlan not working

bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
are OK)

I suspect recent commits from Jesse are the problem.
(bnx2_xmit() is feeded with zeroed vlan_tci skbs)

Maybe f01a5236bd4b1401 (net offloading: Generalize
netif_get_vlan_features().) ?

I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
They only set this flag in dev->features

I dont think changing all drivers to also set vlan_features makes sense.

Is following patch the right path ? (It does solve my problem)

diff --git a/net/core/dev.c b/net/core/dev.c
index 54277df..5c13e55 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5265,6 +5265,9 @@ int register_netdevice(struct net_device *dev)
 	 */
 	dev->vlan_features |= (NETIF_F_GRO | NETIF_F_HIGHDMA);
 
+	/* allow netif_skb_features() not mask this bit from dev->features */
+	dev->vlan_features |= NETIF_F_HW_VLAN_TX;
+
 	ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
 	ret = notifier_to_errno(ret);
 	if (ret)



^ permalink raw reply related

* Re: [PATCH] net_sched: factorize qdisc stats handling
From: Eric Dumazet @ 2011-01-17 22:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Stephen Hemminger, Changli Gao, David Miller, Fabio Checconi,
	netdev, Luigi Rizzo, hawk
In-Reply-To: <Pine.LNX.4.64.1101172243470.15896@ask.diku.dk>

Le lundi 17 janvier 2011 à 22:55 +0100, Jesper Dangaard Brouer a écrit :
> 
> On Sat, 8 Jan 2011, Eric Dumazet wrote:
> >> Changli Gao <xiaosuo@gmail.com> wrote:
> >>
> >>>> +       cl->bstats.packets += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;
> 
> What about the qlen when we enqueue a GSO packet? Should we account for 
> the number of internal GSO packets, or just account a GSO packet as a 
> single packet?
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 0918834..1a8c0a3 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -453,7 +453,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, 
> struct Qdisc* qdisc)
>                  struct sk_buff_head *list = band2list(priv, band);
> 
>                  priv->bitmap |= (1 << band);
> -               qdisc->q.qlen++;
> +               /* Should the number of GSO packets be taken into account?*/
> +               qdisc->q.qlen += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;
>                  return __qdisc_enqueue_tail(skb, qdisc, list);
>          }
> 
> Is is at all legal to modify the q.qlen this way?

nope ;)

q.qlen is really number of skbs here.




^ permalink raw reply

* Re: [PATCH] net_sched: factorize qdisc stats handling
From: Jesper Dangaard Brouer @ 2011-01-17 21:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Changli Gao, David Miller, Fabio Checconi,
	netdev, Luigi Rizzo, hawk
In-Reply-To: <1294478789.2709.79.camel@edumazet-laptop>



On Sat, 8 Jan 2011, Eric Dumazet wrote:
>> Changli Gao <xiaosuo@gmail.com> wrote:
>>
>>>> +       cl->bstats.packets += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;

What about the qlen when we enqueue a GSO packet? Should we account for 
the number of internal GSO packets, or just account a GSO packet as a 
single packet?

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 0918834..1a8c0a3 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -453,7 +453,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, 
struct Qdisc* qdisc)
                 struct sk_buff_head *list = band2list(priv, band);

                 priv->bitmap |= (1 << band);
-               qdisc->q.qlen++;
+               /* Should the number of GSO packets be taken into account?*/
+               qdisc->q.qlen += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;
                 return __qdisc_enqueue_tail(skb, qdisc, list);
         }

Is is at all legal to modify the q.qlen this way?

Cheers
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply related

* Re: NETLINK: Failed to browse: Invalid argument from avahi-daemon in F14 since 2.6.37-git8
From: Alessandro Suardi @ 2011-01-17 21:23 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: linux-kernel, netdev
In-Reply-To: <20110117083208.GA7569@ff.dom.local>

On Mon, Jan 17, 2011 at 9:32 AM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> On 2011-01-15 19:34, Alessandro Suardi wrote:
>> /var/log/messages says:
>>
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Found user 'avahi' (UID 70)
>> and group 'avahi' (GID 70).
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Successfully dropped root privileges.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: avahi-daemon 0.6.27 starting up.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Successfully called chroot().
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Successfully dropped remaining capabilities.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Loading service file /services/ssh.service.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Loading service file /services/udisks.service.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: NETLINK: Failed to browse: Invalid argument
>> Jan 13 12:42:23 duff acpid: starting up with netlink and the input layer
>>
>>
>> Happens both at boot and after boot if restarting avahi-daemon service,
>>  and there is a 10-15" wait before the service start script prints a [FAILED]
>>  red tag; avahi-daemon process does start up anyways.
>>
>> -git7 is the latest "good" kernel
>> -git8, -git9, -git11, -git13 have been reproducing the issue
>>
>>
>> thanks, ciao,
>
> Seems to be resolved in this thread:
>
> http://www.spinics.net/lists/netdev/msg152762.html

Thanks Jarek - filed Fedora bug 670316 for this one.

https://bugzilla.redhat.com/show_bug.cgi?id=670316

--alessandro

 "There's always a siren singing you to shipwreck"

   (Radiohead, "There There")

^ permalink raw reply

* Re: [PATCH v2] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Eric Dumazet @ 2011-01-17 20:39 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Po-Yu Chuang, netdev, linux-kernel, ratbert, joe, dilinger
In-Reply-To: <1295290718.6264.19.camel@bwh-desktop>

Le lundi 17 janvier 2011 à 18:58 +0000, Ben Hutchings a écrit :
> On Mon, 2011-01-17 at 18:29 +0100, Eric Dumazet wrote:
> > Le lundi 17 janvier 2011 à 17:21 +0800, Po-Yu Chuang a écrit :
> > 
> > 
> > > +static int ftmac100_rx_packet(struct ftmac100 *priv, int *processed)
> > > +{
> > > +	struct net_device *netdev = priv->netdev;
> > > +	struct ftmac100_rxdes *rxdes;
> > > +	struct sk_buff *skb;
> > > +	int length;
> > > +	int copied = 0;
> > > +	int done = 0;
> > > +
> > > +	rxdes = ftmac100_rx_locate_first_segment(priv);
> > > +	if (!rxdes)
> > > +		return 0;
> > > +
> > > +	length = ftmac100_rxdes_frame_length(rxdes);
> > > +
> > > +	netdev->stats.rx_packets++;
> > > +	netdev->stats.rx_bytes += length;
> > > +
> > > +	if (unlikely(ftmac100_rx_packet_error(priv, rxdes))) {
> > > +		ftmac100_rx_drop_packet(priv);
> > > +		return 1;
> > > +	}
> > > +
> > > +	/* start processing */
> > > +	skb = netdev_alloc_skb_ip_align(netdev, length);
> > > +	if (unlikely(!skb)) {
> > > +		if (net_ratelimit())
> > > +			netdev_err(netdev, "rx skb alloc failed\n");
> > > +
> > > +		ftmac100_rx_drop_packet(priv);
> > > +		return 1;
> > > +	}
> > > +
> > 
> > Please dont increase rx_packets/rx_bytes before the
> > netdev_alloc_skb_ip_align().
> > 
> > In case of mem allocation failure, it would be better not pretending we
> > handled a packet.
> >
> > drivers/net/r8169.c for example does the rx_packets/rx_bytes only if
> > packet is delivered to upper stack.
> 
> That's news to me.  I specifically advised Po-Yu Chuang to increment
> these earlier because my understanding is that all packets/bytes should
> be counted.  And drivers which use hardware MAC stats will generally do
> that, so I really don't think it makes sense to make other drivers
> different deliberately.
> 

I see, but when one frame is dropped because of RX ring buffer
under/overflow we dont account for the lost packet/bytes.

Thats probably not very important, but would be good if all drivers
behave the same.




^ permalink raw reply

* [PATCH] ns83820: Avoid bad pointer deref in ns83820_init_one().
From: Jesper Juhl @ 2011-01-17 20:24 UTC (permalink / raw)
  To: netdev
  Cc: linux-ns83820, linux-kernel, Tejun Heo, Tejun Heo,
	Kulikov Vasiliy, Denis Kirjanov, David S. Miller,
	Benjamin LaHaise

In drivers/net/ns83820.c::ns83820_init_one() we dynamically allocate 
memory via alloc_etherdev(). We then call PRIV() on the returned storage 
which is 'return netdev_priv()'. netdev_priv() takes the pointer it is 
passed and adds 'ALIGN(sizeof(struct net_device), NETDEV_ALIGN)' to it and 
returns it. Then we test the resulting pointer for NULL, which it is 
unlikely to be at this point, and later dereference it. This will go bad 
if alloc_etherdev() actually returned NULL.

This patch reworks the code slightly so that we test for a NULL pointer 
(and return -ENOMEM) directly after calling alloc_etherdev().

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
---
 ns83820.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

  Compile tested only. I have no way to test this for real.

diff --git a/drivers/net/ns83820.c b/drivers/net/ns83820.c
index 84134c7..a41b2cf 100644
--- a/drivers/net/ns83820.c
+++ b/drivers/net/ns83820.c
@@ -1988,12 +1988,11 @@ static int __devinit ns83820_init_one(struct pci_dev *pci_dev,
 	}
 
 	ndev = alloc_etherdev(sizeof(struct ns83820));
-	dev = PRIV(ndev);
-
 	err = -ENOMEM;
-	if (!dev)
+	if (!ndev)
 		goto out;
 
+	dev = PRIV(ndev);
 	dev->ndev = ndev;
 
 	spin_lock_init(&dev->rx_info.lock);


-- 
Jesper Juhl <jj@chaosbits.net>            http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

^ permalink raw reply related

* Re: [PATCH v2 1/3] can: at91_can: clean up usage of AT91_MB_RX_FIRST and AT91_MB_RX_NUM
From: Marc Kleine-Budde @ 2011-01-17 19:50 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1294752085-30151-2-git-send-email-mkl-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 817 bytes --]

On 01/11/2011 02:21 PM, Marc Kleine-Budde wrote:
> This patch cleans up the usage of two macros which specify the mailbox
> usage. AT91_MB_RX_FIRST and AT91_MB_RX_NUM define the first and the
> number of RX mailboxes. The current driver uses these variables in an
> unclean way; assuming that AT91_MB_RX_FIRST is 0;
> 
> This patch cleans up the usage of these macros, no longer assuming
> AT91_MB_RX_FIRST == 0.
> 
> Signed-off-by: Marc Kleine-Budde <mkl-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>

Any comments on this?

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

[-- Attachment #2: Type: text/plain, Size: 188 bytes --]

_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core

^ permalink raw reply

* Re: [PATCH v2] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Ben Hutchings @ 2011-01-17 18:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Po-Yu Chuang, netdev, linux-kernel, ratbert, joe, dilinger
In-Reply-To: <1295285354.3335.10.camel@edumazet-laptop>

On Mon, 2011-01-17 at 18:29 +0100, Eric Dumazet wrote:
> Le lundi 17 janvier 2011 à 17:21 +0800, Po-Yu Chuang a écrit :
> 
> 
> > +static int ftmac100_rx_packet(struct ftmac100 *priv, int *processed)
> > +{
> > +	struct net_device *netdev = priv->netdev;
> > +	struct ftmac100_rxdes *rxdes;
> > +	struct sk_buff *skb;
> > +	int length;
> > +	int copied = 0;
> > +	int done = 0;
> > +
> > +	rxdes = ftmac100_rx_locate_first_segment(priv);
> > +	if (!rxdes)
> > +		return 0;
> > +
> > +	length = ftmac100_rxdes_frame_length(rxdes);
> > +
> > +	netdev->stats.rx_packets++;
> > +	netdev->stats.rx_bytes += length;
> > +
> > +	if (unlikely(ftmac100_rx_packet_error(priv, rxdes))) {
> > +		ftmac100_rx_drop_packet(priv);
> > +		return 1;
> > +	}
> > +
> > +	/* start processing */
> > +	skb = netdev_alloc_skb_ip_align(netdev, length);
> > +	if (unlikely(!skb)) {
> > +		if (net_ratelimit())
> > +			netdev_err(netdev, "rx skb alloc failed\n");
> > +
> > +		ftmac100_rx_drop_packet(priv);
> > +		return 1;
> > +	}
> > +
> 
> Please dont increase rx_packets/rx_bytes before the
> netdev_alloc_skb_ip_align().
> 
> In case of mem allocation failure, it would be better not pretending we
> handled a packet.
>
> drivers/net/r8169.c for example does the rx_packets/rx_bytes only if
> packet is delivered to upper stack.

That's news to me.  I specifically advised Po-Yu Chuang to increment
these earlier because my understanding is that all packets/bytes should
be counted.  And drivers which use hardware MAC stats will generally do
that, so I really don't think it makes sense to make other drivers
different deliberately.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH v2] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Eric Dumazet @ 2011-01-17 18:21 UTC (permalink / raw)
  To: Po-Yu Chuang; +Cc: netdev, linux-kernel, ratbert, bhutchings, joe, dilinger
In-Reply-To: <1295256060-2091-1-git-send-email-ratbert.chuang@gmail.com>

Le lundi 17 janvier 2011 à 17:21 +0800, Po-Yu Chuang a écrit :


> +
> +	spin_lock_irqsave(&priv->tx_lock, flags);
> +	ftmac100_txdes_set_skb(txdes, skb);
> +	ftmac100_txdes_set_dma_addr(txdes, map);
> +
> +	ftmac100_txdes_set_first_segment(txdes);
> +	ftmac100_txdes_set_last_segment(txdes);
> +	ftmac100_txdes_set_txint(txdes);
> +	ftmac100_txdes_set_buffer_size(txdes, len);
> +

I wonder if its not too expensive to read/modify/write txdes->txdes1

Maybe you should use a temporary u32 var and perform one final write on
txdes->txdes1 (with the set_dma_own)

> +	priv->tx_pending++;
> +	if (priv->tx_pending == TX_QUEUE_ENTRIES) {
> +		if (net_ratelimit())
> +			netdev_info(netdev, "tx queue full\n");
> +
> +		netif_stop_queue(netdev);
> +	}
> +
> +	/* start transmit */
> +	ftmac100_txdes_set_dma_own(txdes);

	txdes->txdes1 = txdes1;

BTW, shouldnt you use cpu_to_be32() or cpu_to_le32(), if this driver is
multi platform ?

^ permalink raw reply

* Re: 2.6.37 regression: adding main interface to a bridge breaks vlan interface RX
From: Simon Arlott @ 2011-01-17 18:17 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, Linux Kernel Mailing List, jesse, Herbert Xu
In-Reply-To: <1295280044.6264.5.camel@bwh-desktop>

On 17/01/11 16:00, Ben Hutchings wrote:
> On Sun, 2011-01-16 at 14:09 +0000, Simon Arlott wrote:
>> [    1.666706] forcedeth 0000:00:08.0: ifname eth0, PHY OUI 0x5043 @ 16, addr 00:e0:81:4d:2b:ec
>> [    1.666767] forcedeth 0000:00:08.0: highdma csum vlan pwrctl mgmt gbit lnktim msi desc-v3
>> 
>> I have eth0 and eth0.3840 which works until I add eth0 to a bridge.
>> While eth0 is in a bridge (the bridge device is up), eth0.3840 is unable
>> to receive packets. Using tcpdump on eth0 shows the packets being
>> received with a VLAN tag but they don't appear on eth0.3840. They appear
>> with the VLAN tag on the bridge interface.
> [...]
> 
> This means the behaviour is now consistent, whether or not hardware VLAN
> tag stripping is enabled.  (I previously pointed out the inconsistent
> behaviour in <http://thread.gmane.org/gmane.linux.network/149864>.)  I
> would consider this an improvement.

Shouldn't the kernel also prevent a device from being both part of a
bridge and having VLANs? Instead everything appears to work except
incoming traffic.

-- 
Simon Arlott

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox