Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] net: Add layer 2 hardware acceleration operations for macvlan devices
From: Neil Horman @ 2013-10-14 10:48 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, Andy Gospodarek, David Miller
In-Reply-To: <525B0689.4070109@gmail.com>

On Sun, Oct 13, 2013 at 01:46:01PM -0700, John Fastabend wrote:
> On 10/11/2013 11:43 AM, Neil Horman wrote:
> >Add a operations structure that allows a network interface to export the fact
> >that it supports package forwarding in hardware between physical interfaces and
> >other mac layer devices assigned to it (such as macvlans).  this operaions
> >structure can be used by virtual mac devices to bypass software switching so
> >that forwarding can be done in hardware more efficiently.
> >
> >Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> >CC: john.fastabend@gmail.com
> >CC: Andy Gospodarek <andy@greyhouse.net>
> >CC: "David S. Miller" <davem@davemloft.net>
> >---
> 
> [...]
> 
> >
> >diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >index 2ddb48d..1710fdb 100644
> >--- a/include/linux/skbuff.h
> >+++ b/include/linux/skbuff.h
> >@@ -426,9 +426,9 @@ struct sk_buff {
> >  	char			cb[48] __aligned(8);
> >
> >  	unsigned long		_skb_refdst;
> >-#ifdef CONFIG_XFRM
> >+
> 
> Is this a hold-over from the previous patches? 'sp' isn't touched
> anywhere else so put the ifdef/endif back.
> 
Yeah, my screw up, I wanted to get this out before the weekend and missed that
screw up.  Sorry.

Neil

> >  	struct	sec_path	*sp;
> >-#endif
> >+
> >  	unsigned int		len,
> >  				data_len;
> >  	__u16			mac_len,
> >diff --git a/net/core/dev.c b/net/core/dev.c
> >index 5c713f2..ecad8c2 100644
> 
> -- 
> John Fastabend         Intel Corporation
> 

^ permalink raw reply

* RE: [PATCH net-next v4 2/5] xen-netback: add support for IPv6 checksum offload from guest
From: Paul Durrant @ 2013-10-14 10:49 UTC (permalink / raw)
  To: Wei Liu
  Cc: xen-devel@lists.xen.org, netdev@vger.kernel.org, Wei Liu,
	David Vrabel, Ian Campbell
In-Reply-To: <20131014104235.GB11739@zion.uk.xensource.com>

> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: 14 October 2013 11:43
> To: Paul Durrant
> Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David Vrabel;
> Ian Campbell
> Subject: Re: [PATCH net-next v4 2/5] xen-netback: add support for IPv6
> checksum offload from guest
> 
> On Fri, Oct 11, 2013 at 04:06:19PM +0100, Paul Durrant wrote:
> [...]
> > -/*
> > - * This is the amount of packet we copy rather than map, so that the
> > - * guest can't fiddle with the contents of the headers while we do
> > - * packet processing on them (netfilter, routing, etc).
> > +/* This is a miniumum size for the linear area to avoid lots of
> > + * calls to __pskb_pull_tail() as we set up checksum offsets.
> >   */
> 
> You seem to forget to explain why 128 is chosen. :-)

Is that not sufficient explanation? What sort of thing are you looking for?

  Paul

^ permalink raw reply

* Re: [PATCH 2/2] ixgbe: enable l2 forwarding acceleration for macvlans
From: Neil Horman @ 2013-10-14 10:50 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, Andy Gospodarek, David Miller
In-Reply-To: <525B0721.7060709@gmail.com>

On Sun, Oct 13, 2013 at 01:48:33PM -0700, John Fastabend wrote:
> On 10/11/2013 11:43 AM, Neil Horman wrote:
> >Now that l2 acceleration ops are in place from the prior patch, enable ixgbe to
> >take advantage of these operations.  Allow it to allocate queues for a macvlan
> >so that when we transmit a frame, we can do the switching in hardware inside the
> >ixgbe card, rather than in software.
> >
> >Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> >CC: john.fastabend@gmail.com
> >CC: Andy Gospodarek <andy@greyhouse.net>
> >CC: "David S. Miller" <davem@davemloft.net>
> >---
> 
> Neil, I'm fairly sure I can simplify this patch further. I'll be able
> to take a shot at it Tuesday if you don't mind waiting.
> 
> I can resubmit the series then and preserve your signed-off on at least
> the first patch. Seem reasonable?
> 
That sounds fine.
Thanks!
Neil

> .John
> 
> -- 
> John Fastabend         Intel Corporation
> 

^ permalink raw reply

* Re: [PATCH net-next v4 1/5] xen-netback: add support for IPv6 checksum offload to guest
From: Ian Campbell @ 2013-10-14 10:53 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, netdev, Wei Liu, David Vrabel
In-Reply-To: <1381503982-1418-2-git-send-email-paul.durrant@citrix.com>

On Fri, 2013-10-11 at 16:06 +0100, Paul Durrant wrote:
> Check xenstore flag feature-ipv6-csum-offload to determine if a
> guest is happy to accept IPv6 packets with only partial checksum.
> Also check analogous feature-ip-csum-offload to determine if a
> guest is happy to accept IPv4 packets with only partial checksum
> as a replacement for a negated feature-no-csum-offload value and
> add a comment to deprecate use of feature-no-csum-offload.
> 
> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>

Shouldn't this come later in the series, i.e. after netback is actually
able to cope with ipv6 offloads?

> diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
> index 5715318..b4a9a3c 100644
> --- a/drivers/net/xen-netback/common.h
> +++ b/drivers/net/xen-netback/common.h
> @@ -153,7 +153,8 @@ struct xenvif {
>  	u8 can_sg:1;
>  	u8 gso:1;
>  	u8 gso_prefix:1;
> -	u8 csum:1;
> +	u8 ip_csum:1;
> +	u8 ipv6_csum:1;

Why not ipv4_csum for consistency/unambiguity?

> diff --git a/include/xen/interface/io/netif.h b/include/xen/interface/io/netif.h
> index eb262e3..d9fb44739 100644
> --- a/include/xen/interface/io/netif.h
> +++ b/include/xen/interface/io/netif.h
> @@ -51,6 +51,16 @@
>   */
>  
>  /*
> + * "feature-no-csum-offload" was used to turn off IPv4 TCP/UDP checksum
> + * offload but is now deprecated. Two new feature flags should now be used
> + * to control checksum offload:

How is a frontend to know which sort of backend it is talking too? Is
there going to be a feature flag to indicate support for these new
flags?

In particular a new frontend running on an old backend needs to know
that it needs to set no-csum-offload instead of ip-csum-offload somehow.

> + * "feature-ip-csum-offload" should be used to turn IPv4 TCP/UDP checksum

"ipv4" again?

> + * offload on or off. If it is missing then the feature is assumed to be on.
> + * "feature-ipv6-csum-offload" should be used to turn IPv6 TCP/UDP checksum
> + * offload on or off. If it is missing then the feature is assumed to be off.
> + */
> +
> +/*
>   * This is the 'wire' format for packets:
>   *  Request 1: xen_netif_tx_request  -- XEN_NETTXF_* (any flags)
>   * [Request 2: xen_netif_extra_info]    (only if request 1 has XEN_NETTXF_extra_info)

^ permalink raw reply

* Re: [PATCH net-next v4 2/5] xen-netback: add support for IPv6 checksum offload from guest
From: Wei Liu @ 2013-10-14 10:55 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Wei Liu, xen-devel@lists.xen.org, netdev@vger.kernel.org,
	David Vrabel, Ian Campbell
In-Reply-To: <9AAE0902D5BC7E449B7C8E4E778ABCD0138B8D@AMSPEX01CL01.citrite.net>

On Mon, Oct 14, 2013 at 11:49:20AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > Sent: 14 October 2013 11:43
> > To: Paul Durrant
> > Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David Vrabel;
> > Ian Campbell
> > Subject: Re: [PATCH net-next v4 2/5] xen-netback: add support for IPv6
> > checksum offload from guest
> > 
> > On Fri, Oct 11, 2013 at 04:06:19PM +0100, Paul Durrant wrote:
> > [...]
> > > -/*
> > > - * This is the amount of packet we copy rather than map, so that the
> > > - * guest can't fiddle with the contents of the headers while we do
> > > - * packet processing on them (netfilter, routing, etc).
> > > +/* This is a miniumum size for the linear area to avoid lots of
> > > + * calls to __pskb_pull_tail() as we set up checksum offsets.
> > >   */
> > 
> > You seem to forget to explain why 128 is chosen. :-)
> 
> Is that not sufficient explanation? What sort of thing are you looking for?
> 

>From the second version of this patch, we had a conversation.

> Where does 128 come from?
>

"It's just an arbitrary power of 2 that was chosen because it seems to
cover most likely v6 headers and all v4 headers."

So something like: "We choose 128 which is likely to cover most V6
headers and all V4 headers" would be sufficeint.

Wei.

^ permalink raw reply

* Re: [PATCH net-next v4 3/5] xen-netback: Unconditionally set NETIF_F_RXCSUM
From: Ian Campbell @ 2013-10-14 10:56 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, netdev, Wei Liu, David Vrabel
In-Reply-To: <1381503982-1418-4-git-send-email-paul.durrant@citrix.com>

On Fri, 2013-10-11 at 16:06 +0100, Paul Durrant wrote:
> There is no mechanism to insist that a guest always generates a packet
> with good checksum (at least for IPv4)

Isn't this what feature-no-csum-offload is?

>  so we must handle checksum
> offloading from the guest and hence should set NETIF_F_RXCSUM.
> 
> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> ---
>  drivers/net/xen-netback/interface.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
> index 8e92783..cb0d8ea 100644
> --- a/drivers/net/xen-netback/interface.c
> +++ b/drivers/net/xen-netback/interface.c
> @@ -321,7 +321,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
>  	dev->hw_features = NETIF_F_SG |
>  		NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
>  		NETIF_F_TSO;
> -	dev->features = dev->hw_features;
> +	dev->features = dev->hw_features | NETIF_F_RXCSUM;
>  	SET_ETHTOOL_OPS(dev, &xenvif_ethtool_ops);
>  
>  	dev->tx_queue_len = XENVIF_QUEUE_LENGTH;

^ permalink raw reply

* RE: [PATCH net-next v4 2/5] xen-netback: add support for IPv6 checksum offload from guest
From: Paul Durrant @ 2013-10-14 10:57 UTC (permalink / raw)
  To: Wei Liu
  Cc: Wei Liu, xen-devel@lists.xen.org, netdev@vger.kernel.org,
	David Vrabel, Ian Campbell
In-Reply-To: <20131014105527.GD11739@zion.uk.xensource.com>

> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: 14 October 2013 11:55
> To: Paul Durrant
> Cc: Wei Liu; xen-devel@lists.xen.org; netdev@vger.kernel.org; David Vrabel;
> Ian Campbell
> Subject: Re: [PATCH net-next v4 2/5] xen-netback: add support for IPv6
> checksum offload from guest
> 
> On Mon, Oct 14, 2013 at 11:49:20AM +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > > Sent: 14 October 2013 11:43
> > > To: Paul Durrant
> > > Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David
> Vrabel;
> > > Ian Campbell
> > > Subject: Re: [PATCH net-next v4 2/5] xen-netback: add support for IPv6
> > > checksum offload from guest
> > >
> > > On Fri, Oct 11, 2013 at 04:06:19PM +0100, Paul Durrant wrote:
> > > [...]
> > > > -/*
> > > > - * This is the amount of packet we copy rather than map, so that the
> > > > - * guest can't fiddle with the contents of the headers while we do
> > > > - * packet processing on them (netfilter, routing, etc).
> > > > +/* This is a miniumum size for the linear area to avoid lots of
> > > > + * calls to __pskb_pull_tail() as we set up checksum offsets.
> > > >   */
> > >
> > > You seem to forget to explain why 128 is chosen. :-)
> >
> > Is that not sufficient explanation? What sort of thing are you looking for?
> >
> 
> From the second version of this patch, we had a conversation.
> 
> > Where does 128 come from?
> >
> 
> "It's just an arbitrary power of 2 that was chosen because it seems to
> cover most likely v6 headers and all v4 headers."
> 
> So something like: "We choose 128 which is likely to cover most V6
> headers and all V4 headers" would be sufficeint.
> 

Ok. I figured that was implied by "set up checksum offsets" but I can be more explicit.

  Paul

^ permalink raw reply

* RE: [PATCH net-next v4 3/5] xen-netback: Unconditionally set NETIF_F_RXCSUM
From: Paul Durrant @ 2013-10-14 11:03 UTC (permalink / raw)
  To: Ian Campbell
  Cc: xen-devel@lists.xen.org, netdev@vger.kernel.org, Wei Liu,
	David Vrabel
In-Reply-To: <1381748176.24708.103.camel@kazak.uk.xensource.com>

> -----Original Message-----
> From: Ian Campbell
> Sent: 14 October 2013 11:56
> To: Paul Durrant
> Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David Vrabel
> Subject: Re: [PATCH net-next v4 3/5] xen-netback: Unconditionally set
> NETIF_F_RXCSUM
> 
> On Fri, 2013-10-11 at 16:06 +0100, Paul Durrant wrote:
> > There is no mechanism to insist that a guest always generates a packet
> > with good checksum (at least for IPv4)
> 
> Isn't this what feature-no-csum-offload is?
> 

Theoretically, yes, but netback does not have code to advertise that flag (and never has?) and I don't see anything in xen-netfront that checks such a flag so I think we have to assume it's not going to be honoured even if we were to introduce it now.

  Paul

^ permalink raw reply

* Re: [PATCH net 1/2] virtio-net: don't respond to cpu hotplug notifier if we're not ready
From: Michael S. Tsirkin @ 2013-10-14 11:06 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <1381744595-26881-1-git-send-email-jasowang@redhat.com>

On Mon, Oct 14, 2013 at 05:56:34PM +0800, Jason Wang wrote:
> We're trying to re-configure the affinity unconditionally in cpu hotplug
> callback. This may lead the issue during resuming from s3/s4 since
> 
> - virt queues haven't been allocated at that time.
> - it's unnecessary since thaw method will re-configure the affinity.
> 
> Fix this issue by checking the config_enable and do nothing is we're not ready.
> 
> The bug were introduced by commit 8de4b2f3ae90c8fc0f17eeaab87d5a951b66ee17
> (virtio-net: reset virtqueue affinity when doing cpu hotplug).
> 
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> The patch is need for 3.8 and above.
> ---
>  drivers/net/virtio_net.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index defec2b..c4bc1cc 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1116,6 +1116,11 @@ static int virtnet_cpu_callback(struct notifier_block *nfb,
>  {
>  	struct virtnet_info *vi = container_of(nfb, struct virtnet_info, nb);
>  
> +	mutex_lock(&vi->config_lock);
> +
> +	if (!vi->config_enable)
> +		goto done;
> +
>  	switch(action & ~CPU_TASKS_FROZEN) {
>  	case CPU_ONLINE:
>  	case CPU_DOWN_FAILED:
> @@ -1128,6 +1133,9 @@ static int virtnet_cpu_callback(struct notifier_block *nfb,
>  	default:
>  		break;
>  	}
> +
> +done:
> +	mutex_unlock(&vi->config_lock);
>  	return NOTIFY_OK;
>  }
>  
> -- 
> 1.8.1.2

^ permalink raw reply

* Re: [PATCH net 2/2] virtio-net: refill only when device is up during setting queues
From: Michael S. Tsirkin @ 2013-10-14 11:09 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <1381744595-26881-2-git-send-email-jasowang@redhat.com>

On Mon, Oct 14, 2013 at 05:56:35PM +0800, Jason Wang wrote:
> We used to schedule the refill work unconditionally after changing the
> number of queues. This may lead an issue if the device is not
> up. Since we only try to cancel the work in ndo_stop(), this may cause
> the refill work still work after removing the device. Fix this by only
> schedule the work when device is up.
> 
> The bug were introduce by commit 9b9cd8024a2882e896c65222aa421d461354e3f2.
> (virtio-net: fix the race between channels setting and refill)
> 
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

It bothers me that we look at the flag without any
locks here.
I think we'll need to take the rtnl lock at least
on restore.

> ---
> The patch were need for 3.10 and above.
> ---
>  drivers/net/virtio_net.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c4bc1cc..92f0096 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -938,7 +938,9 @@ static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
>  		return -EINVAL;
>  	} else {
>  		vi->curr_queue_pairs = queue_pairs;
> -		schedule_delayed_work(&vi->refill, 0);
> +		/* virtnet_open() will refill when device is going to up. */
> +		if (dev->flags & IFF_UP)
> +			schedule_delayed_work(&vi->refill, 0);
>  	}
>  
>  	return 0;
> -- 
> 1.8.1.2

^ permalink raw reply

* RE: [PATCH net-next v4 1/5] xen-netback: add support for IPv6 checksum offload to guest
From: Paul Durrant @ 2013-10-14 11:10 UTC (permalink / raw)
  To: Ian Campbell
  Cc: xen-devel@lists.xen.org, netdev@vger.kernel.org, Wei Liu,
	David Vrabel
In-Reply-To: <1381748013.24708.102.camel@kazak.uk.xensource.com>

> -----Original Message-----
> From: Ian Campbell
> Sent: 14 October 2013 11:54
> To: Paul Durrant
> Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David Vrabel
> Subject: Re: [PATCH net-next v4 1/5] xen-netback: add support for IPv6
> checksum offload to guest
> 
> On Fri, 2013-10-11 at 16:06 +0100, Paul Durrant wrote:
> > Check xenstore flag feature-ipv6-csum-offload to determine if a
> > guest is happy to accept IPv6 packets with only partial checksum.
> > Also check analogous feature-ip-csum-offload to determine if a
> > guest is happy to accept IPv4 packets with only partial checksum
> > as a replacement for a negated feature-no-csum-offload value and
> > add a comment to deprecate use of feature-no-csum-offload.
> >
> > Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
> > Cc: Wei Liu <wei.liu2@citrix.com>
> > Cc: David Vrabel <david.vrabel@citrix.com>
> > Cc: Ian Campbell <ian.campbell@citrix.com>
> 
> Shouldn't this come later in the series, i.e. after netback is actually
> able to cope with ipv6 offloads?
> 

I guess that's debatable. The patches don't have any dependency relation; offloads to and from the guest are quite independent.

> > diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-
> netback/common.h
> > index 5715318..b4a9a3c 100644
> > --- a/drivers/net/xen-netback/common.h
> > +++ b/drivers/net/xen-netback/common.h
> > @@ -153,7 +153,8 @@ struct xenvif {
> >  	u8 can_sg:1;
> >  	u8 gso:1;
> >  	u8 gso_prefix:1;
> > -	u8 csum:1;
> > +	u8 ip_csum:1;
> > +	u8 ipv6_csum:1;
> 
> Why not ipv4_csum for consistency/unambiguity?
> 

I followed general linux naming conventions e.g. ip_hdr and ipv6_hdr.

> > diff --git a/include/xen/interface/io/netif.h
> b/include/xen/interface/io/netif.h
> > index eb262e3..d9fb44739 100644
> > --- a/include/xen/interface/io/netif.h
> > +++ b/include/xen/interface/io/netif.h
> > @@ -51,6 +51,16 @@
> >   */
> >
> >  /*
> > + * "feature-no-csum-offload" was used to turn off IPv4 TCP/UDP
> checksum
> > + * offload but is now deprecated. Two new feature flags should now be
> used
> > + * to control checksum offload:
> 
> How is a frontend to know which sort of backend it is talking too? Is
> there going to be a feature flag to indicate support for these new
> flags?
> 
> In particular a new frontend running on an old backend needs to know
> that it needs to set no-csum-offload instead of ip-csum-offload somehow.
> 

Good point. Without any version I guess we have to live with the old flag forever. I'll stick with it for v4 and just leave the new one for v6.

  Paul

> > + * "feature-ip-csum-offload" should be used to turn IPv4 TCP/UDP
> checksum
> 
> "ipv4" again?
> 
> > + * offload on or off. If it is missing then the feature is assumed to be on.
> > + * "feature-ipv6-csum-offload" should be used to turn IPv6 TCP/UDP
> checksum
> > + * offload on or off. If it is missing then the feature is assumed to be off.
> > + */
> > +
> > +/*
> >   * This is the 'wire' format for packets:
> >   *  Request 1: xen_netif_tx_request  -- XEN_NETTXF_* (any flags)
> >   * [Request 2: xen_netif_extra_info]    (only if request 1 has
> XEN_NETTXF_extra_info)
> 


^ permalink raw reply

* Re: DomU's network interface will hung when Dom0 running 32bit
From: Wei Liu @ 2013-10-14 11:19 UTC (permalink / raw)
  To: jianhai luan; +Cc: Ian Campbell, Wei Liu, xen-devel, netdev
In-Reply-To: <52590DFE.6080203@oracle.com>

On Sat, Oct 12, 2013 at 04:53:18PM +0800, jianhai luan wrote:
> Hi Ian,
>   I meet the DomU's network interface hung issue recently, and have
> been working on the issue from that time. I find that DomU's network
> interface, which send lesser package, will hung if Dom0 running
> 32bit and DomU's up-time is very long.  I think that one jiffies
> overflow bug exist in the function tx_credit_exceeded().
>   I know the inline function time_after_eq(a,b) will process jiffies
> overflow, but the function have one limit a should little that (b +
> MAX_SIGNAL_LONG). If a large than the value, time_after_eq will
> return false. The MAX_SINGNAL_LONG should be 0x7fffffff at 32-bit
> machine.
>   If DomU's network interface send lesser package (<0.5k/s if
> jiffies=250 and credit_bytes=ULONG_MAX), jiffies will beyond out
> (credit_timeout.expires + MAX_SIGNAL_LONG) and time_after_eq(now,
> next_credit) will failure (should be true). So one timer which will
> not be trigger in short time, and later process will be aborted when
> timer_pending(&vif->credit_timeout) is true. The result will be
> DomU's network interface will be hung in long time (> 40days).
>   Please think about the below scenario:
>   Condition:
>     Dom0 running 32-bit and HZ = 1000
>     vif->credit_timeout->expire = 0xffffffff, vif->remaining_credit
> = 0xffffffff, vif->credit_usec=0 jiffies=0
>     vif receive lesser package (DomU send lesser package). If the
> value is litter than 2K/s, consume 4G(0xffffffff) will need 582.55
> hours. jiffies will large than 0x7ffffff. we guess jiffies =
> 0x800000ff, time_after_eq(0x800000ff, 0xffffffff) will failure, and
> one time which expire is 0xfffffff will be pended into system. So
> the interface will hung until jiffies recount 0xffffffff (that will
> need very long time).

If I'm not mistaken you meant time_after_eq(now, next_credit) in
netback. How does next_credit become 0xffffffff?

Wei.

> 
>   If some error exist in above explain, please help me point it out.
> 
> Thanks,
> Jason

^ permalink raw reply

* Re: [PATCH] net/ethernet: cpsw: Bugfix interrupts before enabling napi
From: Peter Korsgaard @ 2013-10-14 11:48 UTC (permalink / raw)
  To: Markus Pargmann
  Cc: David S. Miller, Florian Fainelli, Mugunthan V N,
	linux-arm-kernel, netdev, kernel
In-Reply-To: <1381691821-25498-1-git-send-email-mpa@pengutronix.de>

>>>>> "Markus" == Markus Pargmann <mpa@pengutronix.de> writes:

 Markus> If interrupts happen before napi_enable was called, the driver will not
 Markus> work as expected. Network transmissions are impossible in this state.
 Markus> This bug can be reproduced easily by restarting the network interface in
 Markus> a loop. After some time any network transmissions on the network
 Markus> interface will fail.

 Markus> This patch fixes the bug by enabling napi before enabling the network
 Markus> interface interrupts.

 Markus> Signed-off-by: Markus Pargmann <mpa@pengutronix.de>

Acked-by: Peter Korsgaard <jacmet@sunsite.dk>

-- 
Bye, Peter Korsgaard

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next v4 2/5] xen-netback: add support for IPv6 checksum offload from guest
From: David Vrabel @ 2013-10-14 12:19 UTC (permalink / raw)
  To: Wei Liu
  Cc: Paul Durrant, netdev@vger.kernel.org, Ian Campbell, David Vrabel,
	xen-devel@lists.xen.org
In-Reply-To: <20131014105527.GD11739@zion.uk.xensource.com>

On 14/10/13 11:55, Wei Liu wrote:
> On Mon, Oct 14, 2013 at 11:49:20AM +0100, Paul Durrant wrote:
>>> -----Original Message-----
>>> From: Wei Liu [mailto:wei.liu2@citrix.com]
>>> Sent: 14 October 2013 11:43
>>> To: Paul Durrant
>>> Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David Vrabel;
>>> Ian Campbell
>>> Subject: Re: [PATCH net-next v4 2/5] xen-netback: add support for IPv6
>>> checksum offload from guest
>>>
>>> On Fri, Oct 11, 2013 at 04:06:19PM +0100, Paul Durrant wrote:
>>> [...]
>>>> -/*
>>>> - * This is the amount of packet we copy rather than map, so that the
>>>> - * guest can't fiddle with the contents of the headers while we do
>>>> - * packet processing on them (netfilter, routing, etc).
>>>> +/* This is a miniumum size for the linear area to avoid lots of
>>>> + * calls to __pskb_pull_tail() as we set up checksum offsets.
>>>>   */
>>>
>>> You seem to forget to explain why 128 is chosen. :-)
>>
>> Is that not sufficient explanation? What sort of thing are you looking for?
>>
> 
>>From the second version of this patch, we had a conversation.
> 
>> Where does 128 come from?
>>
> 
> "It's just an arbitrary power of 2 that was chosen because it seems to
> cover most likely v6 headers and all v4 headers."
> 
> So something like: "We choose 128 which is likely to cover most V6
> headers and all V4 headers" would be sufficeint.

Is "most IPv6 headers" actually good enough?  Don't we need to ensure
netback copies all IP headers?

David

^ permalink raw reply

* [patch] yam: integer underflow in yam_ioctl()
From: Dan Carpenter @ 2013-10-14 12:28 UTC (permalink / raw)
  To: Jean-Paul Roubelat; +Cc: linux-hams, netdev, kernel-janitors

We cap bitrate at YAM_MAXBITRATE in yam_ioctl(), but it could also be
negative.  I don't know the impact of using a negative bitrate but let's
prevent it.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/include/linux/yam.h b/include/linux/yam.h
index 7fe2822..512cdc2 100644
--- a/include/linux/yam.h
+++ b/include/linux/yam.h
@@ -77,6 +77,6 @@ struct yamdrv_ioctl_cfg {
 
 struct yamdrv_ioctl_mcs {
 	int cmd;
-	int bitrate;
+	unsigned int bitrate;
 	unsigned char bits[YAM_FPGA_SIZE];
 };

^ permalink raw reply related

* RE: [Xen-devel] [PATCH net-next v4 2/5] xen-netback: add support for IPv6 checksum offload from guest
From: Paul Durrant @ 2013-10-14 12:34 UTC (permalink / raw)
  To: David Vrabel, Wei Liu
  Cc: netdev@vger.kernel.org, Ian Campbell, xen-devel@lists.xen.org
In-Reply-To: <525BE148.1010508@citrix.com>

> -----Original Message-----
> From: David Vrabel
> Sent: 14 October 2013 13:19
> To: Wei Liu
> Cc: Paul Durrant; netdev@vger.kernel.org; Ian Campbell; David Vrabel; xen-
> devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH net-next v4 2/5] xen-netback: add support
> for IPv6 checksum offload from guest
> 
> On 14/10/13 11:55, Wei Liu wrote:
> > On Mon, Oct 14, 2013 at 11:49:20AM +0100, Paul Durrant wrote:
> >>> -----Original Message-----
> >>> From: Wei Liu [mailto:wei.liu2@citrix.com]
> >>> Sent: 14 October 2013 11:43
> >>> To: Paul Durrant
> >>> Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David
> Vrabel;
> >>> Ian Campbell
> >>> Subject: Re: [PATCH net-next v4 2/5] xen-netback: add support for IPv6
> >>> checksum offload from guest
> >>>
> >>> On Fri, Oct 11, 2013 at 04:06:19PM +0100, Paul Durrant wrote:
> >>> [...]
> >>>> -/*
> >>>> - * This is the amount of packet we copy rather than map, so that the
> >>>> - * guest can't fiddle with the contents of the headers while we do
> >>>> - * packet processing on them (netfilter, routing, etc).
> >>>> +/* This is a miniumum size for the linear area to avoid lots of
> >>>> + * calls to __pskb_pull_tail() as we set up checksum offsets.
> >>>>   */
> >>>
> >>> You seem to forget to explain why 128 is chosen. :-)
> >>
> >> Is that not sufficient explanation? What sort of thing are you looking for?
> >>
> >
> >>From the second version of this patch, we had a conversation.
> >
> >> Where does 128 come from?
> >>
> >
> > "It's just an arbitrary power of 2 that was chosen because it seems to
> > cover most likely v6 headers and all v4 headers."
> >
> > So something like: "We choose 128 which is likely to cover most V6
> > headers and all V4 headers" would be sufficeint.
> 
> Is "most IPv6 headers" actually good enough?  Don't we need to ensure
> netback copies all IP headers?
> 

It will do if checksum offload is in use, but perhaps the pull as far as the transport header needs to be done anyway? I'm unsure of the expectations of other code.

  Paul

^ permalink raw reply

* [patch] yam: remove a no-op in yam_ioctl()
From: Dan Carpenter @ 2013-10-14 12:46 UTC (permalink / raw)
  To: Jean-Paul Roubelat; +Cc: linux-hams, netdev, kernel-janitors

We overwrite the ->bitrate with the user supplied information on the
next line.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/drivers/net/hamradio/yam.c b/drivers/net/hamradio/yam.c
index 0721e72..5af1c3e 100644
--- a/drivers/net/hamradio/yam.c
+++ b/drivers/net/hamradio/yam.c
@@ -975,7 +975,6 @@ static int yam_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 			return -EINVAL;		/* Cannot change this parameter when up */
 		if ((ym = kmalloc(sizeof(struct yamdrv_ioctl_mcs), GFP_KERNEL)) == NULL)
 			return -ENOBUFS;
-		ym->bitrate = 9600;
 		if (copy_from_user(ym, ifr->ifr_data, sizeof(struct yamdrv_ioctl_mcs))) {
 			kfree(ym);
 			return -EFAULT;

^ permalink raw reply related

* [PATCH] net: sctp: fix a cacc_saw_newack missetting issue
From: Chang Xiangzhong @ 2013-10-14 13:33 UTC (permalink / raw)
  To: nhorman, vyasevich
  Cc: davem, linux-sctp, netdev, linux-kernel, Chang Xiangzhong

For for each TSN t being newly acked (Not only cumulatively,
but also SELECTIVELY) cacc_saw_newack should be set to 1.

Signed-off-by: Xiangzhong Chang <changxiangzhong@gmail.com>
---
 net/sctp/outqueue.c |   42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 94df758..d86032b 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -1398,6 +1398,27 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 				forward_progress = true;
 			}
 
+			if (!tchunk->tsn_gap_acked) {
+				/*
+				 * SFR-CACC algorithm:
+				 * 2) If the SACK contains gap acks
+				 * and the flag CHANGEOVER_ACTIVE is
+				 * set the receiver of the SACK MUST
+				 * take the following action:
+				 *
+				 * B) For each TSN t being acked that
+				 * has not been acked in any SACK so
+				 * far, set cacc_saw_newack to 1 for
+				 * the destination that the TSN was
+				 * sent to.
+				 */
+				if (transport &&
+				    sack->num_gap_ack_blocks &&
+				    q->asoc->peer.primary_path->cacc.
+				    changeover_active)
+					transport->cacc.cacc_saw_newack	= 1;
+			}
+
 			if (TSN_lte(tsn, sack_ctsn)) {
 				/* RFC 2960  6.3.2 Retransmission Timer Rules
 				 *
@@ -1411,27 +1432,6 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 				restart_timer = 1;
 				forward_progress = true;
 
-				if (!tchunk->tsn_gap_acked) {
-					/*
-					 * SFR-CACC algorithm:
-					 * 2) If the SACK contains gap acks
-					 * and the flag CHANGEOVER_ACTIVE is
-					 * set the receiver of the SACK MUST
-					 * take the following action:
-					 *
-					 * B) For each TSN t being acked that
-					 * has not been acked in any SACK so
-					 * far, set cacc_saw_newack to 1 for
-					 * the destination that the TSN was
-					 * sent to.
-					 */
-					if (transport &&
-					    sack->num_gap_ack_blocks &&
-					    q->asoc->peer.primary_path->cacc.
-					    changeover_active)
-						transport->cacc.cacc_saw_newack
-							= 1;
-				}
 
 				list_add_tail(&tchunk->transmitted_list,
 					      &q->sacked);
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH net-next] sctp: Namespacify checksum_disable
From: Vlad Yasevich @ 2013-10-14 14:08 UTC (permalink / raw)
  To: Fan Du, nhorman; +Cc: davem, netdev
In-Reply-To: <1381739545-674-1-git-send-email-fan.du@windriver.com>



Fan Du <fan.du@windriver.com> wrote:

>SCTP CRC32-C checksum computing and verifying should be
>namespace-sensible,
>as each, e.g. tenant instance might need different checksum
>configuration for
>its requirement. So this patch enhance SCTP with this feature.
>
>Signed-off-by: Fan Du <fan.du@windriver.com>

NACK.  We don't want that setting to be sysctl configurable.  It is only useful in very limited circumstances and is not really for production/everyday use.

In fact, I am going to send in a patch that makes this module parameter read only in /sys. 

-vlad
>---
> include/net/netns/sctp.h   |    5 +++++
> include/net/sctp/structs.h |    3 ---
> net/sctp/input.c           |    2 +-
> net/sctp/output.c          |    4 +++-
> net/sctp/protocol.c        |    5 +++--
> net/sctp/sysctl.c          |    7 +++++++
> 6 files changed, 19 insertions(+), 7 deletions(-)
>
>diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
>index 3573a81..704adb3 100644
>--- a/include/net/netns/sctp.h
>+++ b/include/net/netns/sctp.h
>@@ -129,6 +129,11 @@ struct netns_sctp {
> 
> 	/* Threshold for autoclose timeout, in seconds. */
> 	unsigned long max_autoclose;
>+
>+	/* Set to 1 to disable CRC32-C checksum computing and verifying.
>+	 * Default to 0 to enable this feature.
>+	 */
>+	int checksum_disable;
> };
> 
> #endif /* __NETNS_SCTP_H__ */
>diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
>index 2174d8d..820895e 100644
>--- a/include/net/sctp/structs.h
>+++ b/include/net/sctp/structs.h
>@@ -134,9 +134,6 @@ extern struct sctp_globals {
> 	__u16 max_instreams;
> 	__u16 max_outstreams;
> 
>-	/* Flag to indicate whether computing and verifying checksum
>-	 * is disabled. */
>-        bool checksum_disable;
> } sctp_globals;
> 
> #define sctp_max_instreams		(sctp_globals.max_instreams)
>diff --git a/net/sctp/input.c b/net/sctp/input.c
>index 98b69bb..9db2a65 100644
>--- a/net/sctp/input.c
>+++ b/net/sctp/input.c
>@@ -134,7 +134,7 @@ int sctp_rcv(struct sk_buff *skb)
> 	__skb_pull(skb, skb_transport_offset(skb));
> 	if (skb->len < sizeof(struct sctphdr))
> 		goto discard_it;
>-	if (!sctp_checksum_disable && !skb_csum_unnecessary(skb) &&
>+	if (!net->sctp.checksum_disable && !skb_csum_unnecessary(skb) &&
> 		  sctp_rcv_checksum(net, skb) < 0)
> 		goto discard_it;
> 
>diff --git a/net/sctp/output.c b/net/sctp/output.c
>index 6de6402..5d0a45e 100644
>--- a/net/sctp/output.c
>+++ b/net/sctp/output.c
>@@ -395,6 +395,7 @@ int sctp_packet_transmit(struct sctp_packet
>*packet)
> 	struct sk_buff *nskb;
> 	struct sctp_chunk *chunk, *tmp;
> 	struct sock *sk;
>+	struct net *net;
> 	int err = 0;
> 	int padding;		/* How much padding do we need?  */
> 	__u8 has_data = 0;
>@@ -411,6 +412,7 @@ int sctp_packet_transmit(struct sctp_packet
>*packet)
> 	/* Set up convenience variables... */
> 	chunk = list_entry(packet->chunk_list.next, struct sctp_chunk, list);
> 	sk = chunk->skb->sk;
>+	net = sock_net(sk);
> 
> 	/* Allocate the new skb.  */
> 	nskb = alloc_skb(packet->size + LL_MAX_HEADER, GFP_ATOMIC);
>@@ -545,7 +547,7 @@ int sctp_packet_transmit(struct sctp_packet
>*packet)
> 	 * Note: Adler-32 is no longer applicable, as has been replaced
> 	 * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
> 	 */
>-	if (!sctp_checksum_disable) {
>+	if (!net->sctp.checksum_disable) {
> 		if ((!(dst->dev->features & NETIF_F_SCTP_CSUM)) ||
> 			is_xfrm_armed(dst)) {
> 
>diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
>index 5e17092..b3c51ce 100644
>--- a/net/sctp/protocol.c
>+++ b/net/sctp/protocol.c
>@@ -1239,6 +1239,9 @@ static int __net_init sctp_net_init(struct net
>*net)
> 	/* Initialize maximum autoclose timeout. */
> 	net->sctp.max_autoclose		= INT_MAX / HZ;
> 
>+	/* Enable checksum by default. */
>+	net->sctp.checksum_disable = 0;
>+
> 	status = sctp_sysctl_net_register(net);
> 	if (status)
> 		goto err_sysctl_register;
>@@ -1543,6 +1546,4 @@ MODULE_ALIAS("net-pf-" __stringify(PF_INET)
>"-proto-132");
> MODULE_ALIAS("net-pf-" __stringify(PF_INET6) "-proto-132");
>MODULE_AUTHOR("Linux Kernel SCTP developers
><linux-sctp@vger.kernel.org>");
> MODULE_DESCRIPTION("Support for the SCTP protocol (RFC2960)");
>-module_param_named(no_checksums, sctp_checksum_disable, bool, 0644);
>-MODULE_PARM_DESC(no_checksums, "Disable checksums computing and
>verification");
> MODULE_LICENSE("GPL");
>diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
>index 6b36561..d6a6cca 100644
>--- a/net/sctp/sysctl.c
>+++ b/net/sctp/sysctl.c
>@@ -290,6 +290,13 @@ static struct ctl_table sctp_net_table[] = {
> 		.extra1		= &max_autoclose_min,
> 		.extra2		= &max_autoclose_max,
> 	},
>+	{
>+		.procname	= "checksum_disable",
>+		.data		= &init_net.sctp.checksum_disable,
>+		.maxlen		= sizeof(int),
>+		.mode		= 0644,
>+		.proc_handler	= proc_dointvec,
>+	},
> 
> 	{ /* sentinel */ }
> };

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* [PATCH][net-next] gianfar: Simplify MQ polling to avoid soft lockup
From: Claudiu Manoil @ 2013-10-14 14:05 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller

Under certain low traffic conditions, the single core
devices with multiple Rx/Tx queues (MQ mode) may reach
soft lockup due to gfar_poll not returning in proper time.
The following exception was obtained using iperf on a 100Mbit
half-duplex link, for a p1010 single core device:

BUG: soft lockup - CPU#0 stuck for 23s! [iperf:2847]
Modules linked in:
CPU: 0 PID: 2847 Comm: iperf Not tainted 3.12.0-rc3 #16
task: e8bf8000 ti: eeb16000 task.ti: ee646000
NIP: c0255b6c LR: c0367ae8 CTR: c0461c18
REGS: eeb17e70 TRAP: 0901   Not tainted  (3.12.0-rc3)
MSR: 00029000 <CE,EE,ME>  CR: 44228428  XER: 20000000

GPR00: c0367ad4 eeb17f20 e8bf8000 ee01f4b4 00000008 ffffffff ffffffff
00000000
GPR08: 000000c0 00000008 000000ff ffffffc0 000193fe
NIP [c0255b6c] find_next_bit+0xb8/0xc4
LR [c0367ae8] gfar_poll+0xc8/0x1d8
Call Trace:
[eeb17f20] [c0367ad4] gfar_poll+0xb4/0x1d8 (unreliable)
[eeb17f70] [c0422100] net_rx_action+0xa4/0x158
[eeb17fa0] [c003ec6c] __do_softirq+0xcc/0x17c
[eeb17ff0] [c000c28c] call_do_softirq+0x24/0x3c
[ee647cc0] [c0004660] do_softirq+0x6c/0x94
[ee647ce0] [c003eb9c] local_bh_enable+0x9c/0xa0
[ee647cf0] [c0454fe8] tcp_prequeue_process+0xa4/0xdc
[ee647d10] [c0457e44] tcp_recvmsg+0x498/0x96c
[ee647d80] [c047b630] inet_recvmsg+0x40/0x64
[ee647da0] [c040ca8c] sock_recvmsg+0x90/0xc0
[ee647e30] [c040edb8] SyS_recvfrom+0x98/0xfc

To prevent this, the outer while() loop has been removed
allowing gfar_poll() to return faster even if there's
still budget left.  Also, there's no need to recompute
the budget per Rx queue anymore.

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c | 87 ++++++++++++++------------------
 1 file changed, 38 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 9fbe4dd..d6d810c 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -2918,7 +2918,7 @@ static int gfar_poll(struct napi_struct *napi, int budget)
 	struct gfar_priv_rx_q *rx_queue = NULL;
 	int work_done = 0, work_done_per_q = 0;
 	int i, budget_per_q = 0;
-	int has_tx_work;
+	int has_tx_work = 0;
 	unsigned long rstat_rxf;
 	int num_act_queues;
 
@@ -2933,62 +2933,51 @@ static int gfar_poll(struct napi_struct *napi, int budget)
 	if (num_act_queues)
 		budget_per_q = budget/num_act_queues;
 
-	while (1) {
-		has_tx_work = 0;
-		for_each_set_bit(i, &gfargrp->tx_bit_map, priv->num_tx_queues) {
-			tx_queue = priv->tx_queue[i];
-			/* run Tx cleanup to completion */
-			if (tx_queue->tx_skbuff[tx_queue->skb_dirtytx]) {
-				gfar_clean_tx_ring(tx_queue);
-				has_tx_work = 1;
-			}
+	for_each_set_bit(i, &gfargrp->tx_bit_map, priv->num_tx_queues) {
+		tx_queue = priv->tx_queue[i];
+		/* run Tx cleanup to completion */
+		if (tx_queue->tx_skbuff[tx_queue->skb_dirtytx]) {
+			gfar_clean_tx_ring(tx_queue);
+			has_tx_work = 1;
 		}
+	}
 
-		for_each_set_bit(i, &gfargrp->rx_bit_map, priv->num_rx_queues) {
-			/* skip queue if not active */
-			if (!(rstat_rxf & (RSTAT_CLEAR_RXF0 >> i)))
-				continue;
-
-			rx_queue = priv->rx_queue[i];
-			work_done_per_q =
-				gfar_clean_rx_ring(rx_queue, budget_per_q);
-			work_done += work_done_per_q;
-
-			/* finished processing this queue */
-			if (work_done_per_q < budget_per_q) {
-				/* clear active queue hw indication */
-				gfar_write(&regs->rstat,
-					   RSTAT_CLEAR_RXF0 >> i);
-				rstat_rxf &= ~(RSTAT_CLEAR_RXF0 >> i);
-				num_act_queues--;
-
-				if (!num_act_queues)
-					break;
-				/* recompute budget per Rx queue */
-				budget_per_q =
-					(budget - work_done) / num_act_queues;
-			}
-		}
+	for_each_set_bit(i, &gfargrp->rx_bit_map, priv->num_rx_queues) {
+		/* skip queue if not active */
+		if (!(rstat_rxf & (RSTAT_CLEAR_RXF0 >> i)))
+			continue;
 
-		if (work_done >= budget)
-			break;
+		rx_queue = priv->rx_queue[i];
+		work_done_per_q =
+			gfar_clean_rx_ring(rx_queue, budget_per_q);
+		work_done += work_done_per_q;
+
+		/* finished processing this queue */
+		if (work_done_per_q < budget_per_q) {
+			/* clear active queue hw indication */
+			gfar_write(&regs->rstat,
+				   RSTAT_CLEAR_RXF0 >> i);
+			num_act_queues--;
+
+			if (!num_act_queues)
+				break;
+		}
+	}
 
-		if (!num_act_queues && !has_tx_work) {
+	if (!num_act_queues && !has_tx_work) {
 
-			napi_complete(napi);
+		napi_complete(napi);
 
-			/* Clear the halt bit in RSTAT */
-			gfar_write(&regs->rstat, gfargrp->rstat);
+		/* Clear the halt bit in RSTAT */
+		gfar_write(&regs->rstat, gfargrp->rstat);
 
-			gfar_write(&regs->imask, IMASK_DEFAULT);
+		gfar_write(&regs->imask, IMASK_DEFAULT);
 
-			/* If we are coalescing interrupts, update the timer
-			 * Otherwise, clear it
-			 */
-			gfar_configure_coalescing(priv, gfargrp->rx_bit_map,
-						  gfargrp->tx_bit_map);
-			break;
-		}
+		/* If we are coalescing interrupts, update the timer
+		 * Otherwise, clear it
+		 */
+		gfar_configure_coalescing(priv, gfargrp->rx_bit_map,
+					  gfargrp->tx_bit_map);
 	}
 
 	return work_done;
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH][net-next] gianfar: Simplify MQ polling to avoid soft lockup
From: Eric Dumazet @ 2013-10-14 14:34 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: netdev, David S. Miller
In-Reply-To: <1381759509-26882-1-git-send-email-claudiu.manoil@freescale.com>

On Mon, 2013-10-14 at 17:05 +0300, Claudiu Manoil wrote:
> Under certain low traffic conditions, the single core
> devices with multiple Rx/Tx queues (MQ mode) may reach
> soft lockup due to gfar_poll not returning in proper time.
> The following exception was obtained using iperf on a 100Mbit
> half-duplex link, for a p1010 single core device:
> 
> BUG: soft lockup - CPU#0 stuck for 23s! [iperf:2847]
> Modules linked in:
> CPU: 0 PID: 2847 Comm: iperf Not tainted 3.12.0-rc3 #16
> task: e8bf8000 ti: eeb16000 task.ti: ee646000
> NIP: c0255b6c LR: c0367ae8 CTR: c0461c18
> REGS: eeb17e70 TRAP: 0901   Not tainted  (3.12.0-rc3)
> MSR: 00029000 <CE,EE,ME>  CR: 44228428  XER: 20000000
> 
> GPR00: c0367ad4 eeb17f20 e8bf8000 ee01f4b4 00000008 ffffffff ffffffff
> 00000000
> GPR08: 000000c0 00000008 000000ff ffffffc0 000193fe
> NIP [c0255b6c] find_next_bit+0xb8/0xc4
> LR [c0367ae8] gfar_poll+0xc8/0x1d8
> Call Trace:
> [eeb17f20] [c0367ad4] gfar_poll+0xb4/0x1d8 (unreliable)
> [eeb17f70] [c0422100] net_rx_action+0xa4/0x158
> [eeb17fa0] [c003ec6c] __do_softirq+0xcc/0x17c
> [eeb17ff0] [c000c28c] call_do_softirq+0x24/0x3c
> [ee647cc0] [c0004660] do_softirq+0x6c/0x94
> [ee647ce0] [c003eb9c] local_bh_enable+0x9c/0xa0
> [ee647cf0] [c0454fe8] tcp_prequeue_process+0xa4/0xdc
> [ee647d10] [c0457e44] tcp_recvmsg+0x498/0x96c
> [ee647d80] [c047b630] inet_recvmsg+0x40/0x64
> [ee647da0] [c040ca8c] sock_recvmsg+0x90/0xc0
> [ee647e30] [c040edb8] SyS_recvfrom+0x98/0xfc
> 
> To prevent this, the outer while() loop has been removed
> allowing gfar_poll() to return faster even if there's
> still budget left.  Also, there's no need to recompute
> the budget per Rx queue anymore.

It seems there is a race condition, and this patch only makes it happen
less often ?

return faster means what exactly ?

^ permalink raw reply

* [RFC PATCH 0/2] Improve tracing at the driver/core boundary
From: Ben Hutchings @ 2013-10-14 14:39 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

These patches add static tracpeoints at the driver/core boundary which
record various skb fields likely to be useful for datapath debugging.
On the TX side the boundary is where the core calls ndo_start_xmit, and
on the RX side it is where any of the various exported receive functions
is called.

The set of skb fields is mostly based on what I thought would be
interesting for sfc, and may need to be augmented to be more general.

Ben.

Ben Hutchings (2):
  net: Add net_dev_start_xmit trace event, exposing more skb fields
  net: Add trace events for all receive entry points, exposing more skb
    fields

 include/trace/events/net.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c             | 102 ++++++++++++++++++-----------
 2 files changed, 221 insertions(+), 39 deletions(-)


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [RFC PATCH 1/2] net: Add net_dev_start_xmit trace event, exposing more skb fields
From: Ben Hutchings @ 2013-10-14 14:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1381761552.1626.8.camel@bwh-desktop.uk.level5networks.com>

The existing net/net_dev_xmit trace event provides little information
about the skb that has been passed to the driver, and it is not
simple to add more since the skb may already have been freed at
the point the event is emitted.

Add a separate trace event before the skb is passed to the driver,
including most fields that are likely to be interesting for debugging
driver datapath behaviour.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/trace/events/net.h | 58 ++++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c             |  2 ++
 2 files changed, 60 insertions(+)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index f99645d..0b61f2a 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -6,9 +6,67 @@
 
 #include <linux/skbuff.h>
 #include <linux/netdevice.h>
+#include <linux/if_vlan.h>
 #include <linux/ip.h>
 #include <linux/tracepoint.h>
 
+TRACE_EVENT(net_dev_start_xmit,
+
+	TP_PROTO(const struct sk_buff *skb, const struct net_device *dev),
+
+	TP_ARGS(skb, dev),
+
+	TP_STRUCT__entry(
+		__string(	name,			dev->name	)
+		__field(	u16,			queue_mapping	)
+		__field(	const void *,		skbaddr		)
+		__field(	bool,			vlan_tagged	)
+		__field(	u16,			vlan_proto	)
+		__field(	u16,			vlan_tci	)
+		__field(	u16,			protocol	)
+		__field(	u8,			ip_summed	)
+		__field(	unsigned int,		len		)
+		__field(	unsigned int,		data_len	)
+		__field(	int,			network_offset	)
+		__field(	bool,			transport_offset_valid)
+		__field(	int,			transport_offset)
+		__field(	u8,			tx_flags	)
+		__field(	u16,			gso_size	)
+		__field(	u16,			gso_segs	)
+		__field(	u16,			gso_type	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name, dev->name);
+		__entry->queue_mapping = skb->queue_mapping;
+		__entry->skbaddr = skb;
+		__entry->vlan_tagged = vlan_tx_tag_present(skb);
+		__entry->vlan_proto = ntohs(skb->vlan_proto);
+		__entry->vlan_tci = vlan_tx_tag_get(skb);
+		__entry->protocol = ntohs(skb->protocol);
+		__entry->ip_summed = skb->ip_summed;
+		__entry->len = skb->len;
+		__entry->data_len = skb->data_len;
+		__entry->network_offset = skb_network_offset(skb);
+		__entry->transport_offset_valid =
+			skb_transport_header_was_set(skb);
+		__entry->transport_offset = skb_transport_offset(skb);
+		__entry->tx_flags = skb_shinfo(skb)->tx_flags;
+		__entry->gso_size = skb_shinfo(skb)->gso_size;
+		__entry->gso_segs = skb_shinfo(skb)->gso_segs;
+		__entry->gso_type = skb_shinfo(skb)->gso_type;
+	),
+
+	TP_printk("dev=%s queue_mapping=%u skbaddr=%p vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x ip_summed=%d len=%u data_len=%u network_offset=%d transport_offset_valid=%d transport_offset=%d tx_flags=%d gso_size=%d gso_segs=%d gso_type=%#x",
+		  __get_str(name), __entry->queue_mapping, __entry->skbaddr,
+		  __entry->vlan_tagged, __entry->vlan_proto, __entry->vlan_tci,
+		  __entry->protocol, __entry->ip_summed, __entry->len,
+		  __entry->data_len, 
+		  __entry->network_offset, __entry->transport_offset_valid,
+		  __entry->transport_offset, __entry->tx_flags,
+		  __entry->gso_size, __entry->gso_segs, __entry->gso_type)
+);
+
 TRACE_EVENT(net_dev_xmit,
 
 	TP_PROTO(struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index 1b6eadf..e221963 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2602,6 +2602,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			dev_queue_xmit_nit(skb, dev);
 
 		skb_len = skb->len;
+		trace_net_dev_start_xmit(skb, dev);
 		rc = ops->ndo_start_xmit(skb, dev);
 		trace_net_dev_xmit(skb, rc, dev, skb_len);
 		if (rc == NETDEV_TX_OK)
@@ -2620,6 +2621,7 @@ gso:
 			dev_queue_xmit_nit(nskb, dev);
 
 		skb_len = nskb->len;
+		trace_net_dev_start_xmit(nskb, dev);
 		rc = ops->ndo_start_xmit(nskb, dev);
 		trace_net_dev_xmit(nskb, rc, dev, skb_len);
 		if (unlikely(rc != NETDEV_TX_OK)) {


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [RFC PATCH 2/2] net: Add trace events for all receive entry points, exposing more skb fields
From: Ben Hutchings @ 2013-10-14 14:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1381761552.1626.8.camel@bwh-desktop.uk.level5networks.com>

The existing net/netif_rx and net/netif_receive_skb trace events
provide little information about the skb, nor do they indicate how it
entered the stack.

Add trace events at entry of each of the exported functions, including
most fields that are likely to be interesting for debugging driver
datapath behaviour.  Split netif_rx() and netif_receive_skb() so that
internal calls are not traced.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
There is a call to netif_rx() from xfrm_input() where I think the skb
has not passed through any layered device.  I'm thinking that perhaps
this should call netif_rx_internal() to avoid a confusing trace event.
Are there any other cases like this?

I'm not that happy about the event names here and am open to
suggestions.

Ben.

 include/trace/events/net.h | 100 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c             | 100 +++++++++++++++++++++++++++------------------
 2 files changed, 161 insertions(+), 39 deletions(-)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index 0b61f2a..731907c 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -136,6 +136,106 @@ DEFINE_EVENT(net_dev_template, netif_rx,
 
 	TP_ARGS(skb)
 );
+
+DECLARE_EVENT_CLASS(net_dev_rx_verbose_template,
+
+	TP_PROTO(const struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__string(	name,			skb->dev->name	)
+		__field(	unsigned int,		napi_id		)
+		__field(	u16,			queue_mapping	)
+		__field(	const void *,		skbaddr		)
+		__field(	bool,			vlan_tagged	)
+		__field(	u16,			vlan_proto	)
+		__field(	u16,			vlan_tci	)
+		__field(	u16,			protocol	)
+		__field(	u8,			ip_summed	)
+		__field(	u32,			rxhash		)
+		__field(	bool,			l4_rxhash	)
+		__field(	unsigned int,		len		)
+		__field(	unsigned int,		data_len	)
+		__field(	unsigned int,		truesize	)
+		__field(	bool,			mac_header_valid)
+		__field(	int,			mac_header	)
+		__field(	unsigned char,		nr_frags	)
+		__field(	u16,			gso_size	)
+		__field(	u16,			gso_type	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name, skb->dev->name);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		__entry->napi_id = skb->napi_id;
+#else
+		__entry->napi_id = 0;
+#endif
+		__entry->queue_mapping = skb->queue_mapping;
+		__entry->skbaddr = skb;
+		__entry->vlan_tagged = vlan_tx_tag_present(skb);
+		__entry->vlan_proto = ntohs(skb->vlan_proto);
+		__entry->vlan_tci = vlan_tx_tag_get(skb);
+		__entry->protocol = ntohs(skb->protocol);
+		__entry->ip_summed = skb->ip_summed;
+		__entry->rxhash = skb->rxhash;
+		__entry->l4_rxhash = skb->l4_rxhash;
+		__entry->len = skb->len;
+		__entry->data_len = skb->data_len;
+		__entry->truesize = skb->truesize;
+		__entry->mac_header_valid = skb_mac_header_was_set(skb);
+		__entry->mac_header = skb_mac_header(skb) - skb->data;
+		__entry->nr_frags = skb_shinfo(skb)->nr_frags;
+		__entry->gso_size = skb_shinfo(skb)->gso_size;
+		__entry->gso_type = skb_shinfo(skb)->gso_type;
+	),
+
+	TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x ip_summed=%d rxhash=0x%08x l4_rxhash=%d len=%u data_len=%u truesize=%u mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d gso_type=%#x",
+		  __get_str(name), __entry->napi_id, __entry->queue_mapping,
+		  __entry->skbaddr, __entry->vlan_tagged, __entry->vlan_proto,
+		  __entry->vlan_tci, __entry->protocol, __entry->ip_summed,
+		  __entry->rxhash, __entry->l4_rxhash, __entry->len,
+		  __entry->data_len, __entry->truesize,
+		  __entry->mac_header_valid, __entry->mac_header,
+		  __entry->nr_frags, __entry->gso_size, __entry->gso_type)
+);
+
+DEFINE_EVENT(net_dev_rx_verbose_template, napi_gro_frags_entry,
+
+	TP_PROTO(const struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_rx_verbose_template, napi_gro_receive_entry,
+
+	TP_PROTO(const struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_entry,
+
+	TP_PROTO(const struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_entry,
+
+	TP_PROTO(const struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_ni_entry,
+
+	TP_PROTO(const struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
 #endif /* _TRACE_NET_H */
 
 /* This part must be outside protection */
diff --git a/net/core/dev.c b/net/core/dev.c
index e221963..faf49b7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -146,6 +146,8 @@ struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;	/* Taps */
 static struct list_head offload_base __read_mostly;
 
+static int netif_rx_internal(struct sk_buff *skb);
+
 /*
  * The @dev_base_head list is protected by @dev_base_lock and the rtnl
  * semaphore.
@@ -1698,7 +1700,7 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 	 */
 	skb_scrub_packet(skb, true);
 
-	return netif_rx(skb);
+	return netif_rx_internal(skb);
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
 
@@ -3213,22 +3215,7 @@ enqueue:
 	return NET_RX_DROP;
 }
 
-/**
- *	netif_rx	-	post buffer to the network code
- *	@skb: buffer to post
- *
- *	This function receives a packet from a device driver and queues it for
- *	the upper (protocol) levels to process.  It always succeeds. The buffer
- *	may be dropped during processing for congestion control or by the
- *	protocol layers.
- *
- *	return values:
- *	NET_RX_SUCCESS	(no congestion)
- *	NET_RX_DROP     (packet was dropped)
- *
- */
-
-int netif_rx(struct sk_buff *skb)
+static int netif_rx_internal(struct sk_buff *skb)
 {
 	int ret;
 
@@ -3264,14 +3251,38 @@ int netif_rx(struct sk_buff *skb)
 	}
 	return ret;
 }
+
+/**
+ *	netif_rx	-	post buffer to the network code
+ *	@skb: buffer to post
+ *
+ *	This function receives a packet from a device driver and queues it for
+ *	the upper (protocol) levels to process.  It always succeeds. The buffer
+ *	may be dropped during processing for congestion control or by the
+ *	protocol layers.
+ *
+ *	return values:
+ *	NET_RX_SUCCESS	(no congestion)
+ *	NET_RX_DROP     (packet was dropped)
+ *
+ */
+
+int netif_rx(struct sk_buff *skb)
+{
+	trace_netif_rx_entry(skb);
+
+	return netif_rx_internal(skb);
+}
 EXPORT_SYMBOL(netif_rx);
 
 int netif_rx_ni(struct sk_buff *skb)
 {
 	int err;
 
+	trace_netif_rx_ni_entry(skb);
+
 	preempt_disable();
-	err = netif_rx(skb);
+	err = netif_rx_internal(skb);
 	if (local_softirq_pending())
 		do_softirq();
 	preempt_enable();
@@ -3653,22 +3664,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	return ret;
 }
 
-/**
- *	netif_receive_skb - process receive buffer from network
- *	@skb: buffer to process
- *
- *	netif_receive_skb() is the main receive data processing function.
- *	It always succeeds. The buffer may be dropped during processing
- *	for congestion control or by the protocol layers.
- *
- *	This function may only be called from softirq context and interrupts
- *	should be enabled.
- *
- *	Return values (usually ignored):
- *	NET_RX_SUCCESS: no congestion
- *	NET_RX_DROP: packet was dropped
- */
-int netif_receive_skb(struct sk_buff *skb)
+static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	net_timestamp_check(netdev_tstamp_prequeue, skb);
 
@@ -3694,6 +3690,28 @@ int netif_receive_skb(struct sk_buff *skb)
 #endif
 	return __netif_receive_skb(skb);
 }
+
+/**
+ *	netif_receive_skb - process receive buffer from network
+ *	@skb: buffer to process
+ *
+ *	netif_receive_skb() is the main receive data processing function.
+ *	It always succeeds. The buffer may be dropped during processing
+ *	for congestion control or by the protocol layers.
+ *
+ *	This function may only be called from softirq context and interrupts
+ *	should be enabled.
+ *
+ *	Return values (usually ignored):
+ *	NET_RX_SUCCESS: no congestion
+ *	NET_RX_DROP: packet was dropped
+ */
+int netif_receive_skb(struct sk_buff *skb)
+{
+	trace_netif_receive_skb_entry(skb);
+
+	return netif_receive_skb_internal(skb);
+}
 EXPORT_SYMBOL(netif_receive_skb);
 
 /* Network device is going away, flush any packets still pending
@@ -3755,7 +3773,7 @@ static int napi_gro_complete(struct sk_buff *skb)
 	}
 
 out:
-	return netif_receive_skb(skb);
+	return netif_receive_skb_internal(skb);
 }
 
 /* napi->gro_list contains packets ordered by age.
@@ -3906,7 +3924,7 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
 {
 	switch (ret) {
 	case GRO_NORMAL:
-		if (netif_receive_skb(skb))
+		if (netif_receive_skb_internal(skb))
 			ret = GRO_DROP;
 		break;
 
@@ -3948,6 +3966,8 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
 
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 {
+	trace_napi_gro_receive_entry(skb);
+
 	skb_gro_reset_offset(skb);
 
 	return napi_skb_finish(dev_gro_receive(napi, skb), skb);
@@ -3989,7 +4009,7 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi, struct sk_buff *
 
 		if (ret == GRO_HELD)
 			skb_gro_pull(skb, -ETH_HLEN);
-		else if (netif_receive_skb(skb))
+		else if (netif_receive_skb_internal(skb))
 			ret = GRO_DROP;
 		break;
 
@@ -4048,6 +4068,8 @@ gro_result_t napi_gro_frags(struct napi_struct *napi)
 	if (!skb)
 		return GRO_DROP;
 
+	trace_napi_gro_frags_entry(skb);
+
 	return napi_frags_finish(napi, skb, dev_gro_receive(napi, skb));
 }
 EXPORT_SYMBOL(napi_gro_frags);
@@ -6591,11 +6613,11 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 
 	/* Process offline CPU's input_pkt_queue */
 	while ((skb = __skb_dequeue(&oldsd->process_queue))) {
-		netif_rx(skb);
+		netif_rx_internal(skb);
 		input_queue_head_incr(oldsd);
 	}
 	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
-		netif_rx(skb);
+		netif_rx_internal(skb);
 		input_queue_head_incr(oldsd);
 	}
 

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* Re: [PATCH][net-next] gianfar: Simplify MQ polling to avoid soft lockup
From: Claudiu Manoil @ 2013-10-14 15:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller
In-Reply-To: <1381761267.3392.49.camel@edumazet-glaptop.roam.corp.google.com>

On 10/14/2013 5:34 PM, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 17:05 +0300, Claudiu Manoil wrote:
>> Under certain low traffic conditions, the single core
>> devices with multiple Rx/Tx queues (MQ mode) may reach
>> soft lockup due to gfar_poll not returning in proper time.
>> The following exception was obtained using iperf on a 100Mbit
>> half-duplex link, for a p1010 single core device:
>>
>> BUG: soft lockup - CPU#0 stuck for 23s! [iperf:2847]
>> Modules linked in:
>> CPU: 0 PID: 2847 Comm: iperf Not tainted 3.12.0-rc3 #16
>> task: e8bf8000 ti: eeb16000 task.ti: ee646000
>> NIP: c0255b6c LR: c0367ae8 CTR: c0461c18
>> REGS: eeb17e70 TRAP: 0901   Not tainted  (3.12.0-rc3)
>> MSR: 00029000 <CE,EE,ME>  CR: 44228428  XER: 20000000
>>
>> GPR00: c0367ad4 eeb17f20 e8bf8000 ee01f4b4 00000008 ffffffff ffffffff
>> 00000000
>> GPR08: 000000c0 00000008 000000ff ffffffc0 000193fe
>> NIP [c0255b6c] find_next_bit+0xb8/0xc4
>> LR [c0367ae8] gfar_poll+0xc8/0x1d8
>> Call Trace:
>> [eeb17f20] [c0367ad4] gfar_poll+0xb4/0x1d8 (unreliable)
>> [eeb17f70] [c0422100] net_rx_action+0xa4/0x158
>> [eeb17fa0] [c003ec6c] __do_softirq+0xcc/0x17c
>> [eeb17ff0] [c000c28c] call_do_softirq+0x24/0x3c
>> [ee647cc0] [c0004660] do_softirq+0x6c/0x94
>> [ee647ce0] [c003eb9c] local_bh_enable+0x9c/0xa0
>> [ee647cf0] [c0454fe8] tcp_prequeue_process+0xa4/0xdc
>> [ee647d10] [c0457e44] tcp_recvmsg+0x498/0x96c
>> [ee647d80] [c047b630] inet_recvmsg+0x40/0x64
>> [ee647da0] [c040ca8c] sock_recvmsg+0x90/0xc0
>> [ee647e30] [c040edb8] SyS_recvfrom+0x98/0xfc
>>
>> To prevent this, the outer while() loop has been removed
>> allowing gfar_poll() to return faster even if there's
>> still budget left.  Also, there's no need to recompute
>> the budget per Rx queue anymore.
>
> It seems there is a race condition, and this patch only makes it happen
> less often ?
>
> return faster means what exactly ?
>

Hi Eric,
Because of the outer while loop, gfar_poll may not return due
to continuous tx work. The later implementation of gfar_poll
allows only one iteration of the Tx queues before returning
control to net_rx_action(), that's what I meant with "returns faster".
I tested this fix with different loads, and the soft lockup
didn't trigger (without the fix it triggers right away).

Besides, isn't this a more appropriate napi poll implementation
than the former one with the outer while() loop?

Thanks,
Claudiu

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox