Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 04/15] security: Add Linux Security Modules hook for AF_BUS sockets
From: Paul Moore @ 2012-07-09 18:02 UTC (permalink / raw)
  To: Vincent Sanders
  Cc: netdev, linux-kernel, David S. Miller, Javier Martinez Canillas,
	casey
In-Reply-To: <1340988354-26981-5-git-send-email-vincent.sanders@collabora.co.uk>

On Friday, June 29, 2012 05:45:43 PM Vincent Sanders wrote:
> From: Javier Martinez Canillas <javier.martinez@collabora.co.uk>
> 
> AF_BUS implements a security hook bus_connect() to be used by LSM to
> enforce connectivity security policies.
> 
> Signed-off-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk>
> Signed-off-by: Vincent Sanders <vincent.sanders@collabora.co.uk>

In future postings, please reorder the patchset so that this patch (and the 
LSM specific patches) are applied after the actual AF_BUS implementation 
(patch 08/15 in this patchset).  This makes it easier to quickly understand 
how the LSM hooks/implementation interacts with the AF_BUS code.

A good rule of thumb that I try to follow when submitting large patchsets is 
that each patch should contain code that won't be optimized away during the 
build because there is no caller.  Sometimes that isn't possible without 
making things overly awkward, but in this particular case it shouldn't cause a 
problem.

> ---
>  include/linux/security.h |   11 +++++++++++
>  security/capability.c    |    7 +++++++
>  security/security.c      |    7 +++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 4e5a73c..d30dc4a 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h

...

> +static inline int security_bus_connect(struct socket *sock,
> +				       struct sock *other,
> +				       struct sock *newsk)
> +{
> +	return 0;
> +}
> +

Other than the AF_UNIX specific name, is there a reason why you chose not to 
reuse the unix_stream_connect() LSM hook?  The arguments are the same, and 
based on an initial quick review of the SELinux hook implementations they 
appear to do almost identical things; the permissions are different but it 
should be trivial to make that conditional on the parent socket's address 
family (SELinux does similar things with other socket operations).  Looking at 
the Smack implementation, I don't think it would be a problem there either 
(CC'd Casey for his thoughts).

I'm still reviewing the rest of the AF_BUS patches but wanted to ask this now 
in case I was missing something.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH 0/4] Add a driver for the ASIX AX88172A
From: Grant Grundler @ 2012-07-09 17:45 UTC (permalink / raw)
  To: Christian Riesch
  Cc: netdev, Oliver Neukum, Eric Dumazet, Allan Chou, Mark Lord,
	Ming Lei, Michael Riesch
In-Reply-To: <1341574388-7464-1-git-send-email-christian.riesch@omicron.at>

Christian,
Here's my $0.02 response to your questions.

On Fri, Jul 6, 2012 at 4:33 AM, Christian Riesch
<christian.riesch@omicron.at> wrote:
...
> I have a few questions:
>
> 1) Is it ok to factor out the common code like I did? Or should
>    it go into a separate kernel module?

I think it's ok. I'd rather not see additional kernel modules unless
the driver is substantially different.


> 2) phylib/usbnet: Currently I have an empty .status function
>    in my const struct driver_info ax88172a_info. I think this
>    notifies me of a link change, right? I don't know
>    what I should do in this function. Trigger the phy state machine
>    like a phy interrupt would do, since link changes are handled
>    by the phy state machine?

I don't see any Documentation for this entry point. My reading of the
code is driver_info->status (include/linux/usb/usbnet.h) is to poll
current link status from URBs. See  usbnet_probe()  (calls
init_status()) and intr_complete() (calls your stub).  I also looked
at usbnet_cdc_status() in cdc_ethernet.c. Maybe other drivers are
better examples.

hth,
grant

^ permalink raw reply

* Re: [PATCH 04/16] mm: allow PF_MEMALLOC from softirq context
From: Sebastian Andrzej Siewior @ 2012-07-09 16:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Eric Dumazet
In-Reply-To: <20120709100442.GZ14154@suse.de>

On Mon, Jul 09, 2012 at 11:04:42AM +0100, Mel Gorman wrote:
> > - lets assume your allocation happens with kmalloc() without __GFP_MEMALLOC
> >   and current->flags has PF_MEMALLOC ORed and your SLAB pool is empty. This
> >   forces SLAB to allocate more pages from the buddy allocator with it will
> >   receive more likely (due to ->current->flags + PF_MEMALLOC) but SLAB will
> >   drop this extra memory because the page has ->pf_memory (or something like
> >   that) set and the GFP_FLAGS do not have __GFP_MEMALLOC set.
> > 
> 
> It's recorded if the slab page was allocated from PFMEMALLOC reserves (see
> patch 2 from the swap over NBD series). slab will use this page for objects
> but only allocate them to callers that pass a gfp_pfmemalloc_allowed() check.
> kmalloc() users with either __GFP_MEMALLOC or PF_MEMALLOC will get
> the pages they need but they will not "leak" to !_GFP_MEMALLOC users as
> that would potentially deadlock.

Argh, I missed that gfp_to_alloc_flags() is not only called from
within the buddy allocater but also from slab. So this is fine then :)

One thing:
You only get current->flags |= PF_MEMALLOC in softirq _if_ the skb, which is 
passed to netif_receive_skb(), was allocated with __GFP_MEMALLOC. That
means if the NIC's RX allocation did not require an allocation from the
emergency pool (without ->pfmemalloc set) then you never use this extra
pool, even if this skb would end up in your swap socket. Also, the other way
around, where you allocate it from the emergency pool but it is a user
socket and you could drop it.

What about extending sk_set_memalloc() to record socket's ips + ports
in a separate list so that skb_pfmemalloc_protocol() might use that
information and decide on per-protocol basis if the skb is worth to
spend more ressource to deliver it. That means you would enable the
extra pool if the currently received skb is part of your swap socket and
not if the skb was allocated from the emergency pool.

That said, there is nothing wrong with the code as of now and this
optimization could be added later (if at all).

Sebastian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
From: Rick Jones @ 2012-07-09 16:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <4FFA4EAD.7000707@redhat.com>

On 07/08/2012 08:23 PM, Jason Wang wrote:
> On 07/07/2012 12:23 AM, Rick Jones wrote:
>> On 07/06/2012 12:42 AM, Jason Wang wrote:
>> Which mechanism to address skew error?  The netperf manual describes
>> more than one:
>
> This mechanism is missed in my test, I would add them to my test scripts.
>>
>> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance
>>
>>
>> Personally, my preference these days is to use the "demo mode" method
>> of aggregate results as it can be rather faster than (ab)using the
>> confidence intervals mechanism, which I suspect may not really scale
>> all that well to large numbers of concurrent netperfs.
>
> During my test, the confidence interval would even hard to achieved in
> RR test when I pin vhost/vcpus in the processors, so I didn't use it.

When running aggregate netperfs, *something* has to be done to address 
the prospect of skew error.  Otherwise the results are suspect.

happy benchmarking,

rick jones

^ permalink raw reply

* router advertisement processing.
From: BALAKUMARAN KANNAN @ 2012-07-09 16:35 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Dear all,

    I am running tahi IPv6 ready-logo test suite (www.tahi.org) in kernel-3.0.26. Currently I am facing a problem in RFC-4861. The test cases 145, 147 and 149 fails. All three test cases are related to processing the Router Advertisements (ra).

    The setup is having a tester node (tn: a FreeBSD machine) and a node under test (nut: a target board running kernel-3.0.26).

    In test case 145, the tn will send a ra with curhoplimit=64. And will send a ICMP request to the nut. Then tn will receive the ICMP reply from nut and check its hoplimit for 64. Then it will send another ra with curhoplimit=0 (the tn expects this ra should be ignored). Again will a send an ICMP request to check the hoplimit of ICMP reply from nut. The tn expects the reply should have the same hoplimit, 64 as it wants the later ra should be ignored.

    But In my case, the nut doesn't use the value given in the curhoplimit from the tn. Instead for the first ICMP reply it uses the value of /proc/sys/net/ipv6/conf/eth0/hop_limit and for the later ICMP reply it uses the value 255 as hoplimit. So this test case fails. I read the source net/ipv6/ndisc.c and net/ipv6/icmp.c.

    In ndisc.c line 1285, the curhoplimit value is stored. But while preparing ICMP reply, to determine the hoplimit (in route.c line number: 1162), the function ip6_dst_hoplimit, gets the dst_metric_raw and returns that value. But that value becomes 255 once the ra with 0 received.

   And 147 and 149. Both test cases are related to one concept. The tn  sends router advertisement (ra) with some time limit. And it continuously sends ICMP request till the time expires. And also some seconds even after the time expires.  The tn expects that the nut should not reply to the ICMP request after the time expires. But the nut does reply.  In net/ipv6/icmp.c, I can't find the code checking for the time expiry. Also in net/ipv6/ndisc.c I can't find the value lifetime (line no: 1234) got stored in the dst->metrics array.

    I furnished this mail as per my understanding. I am very newbie to kernel code. Frankly speaking, this is my first encounter with kernel source. So there is a very good chance for me to be wrong. Please direct me in right path if I am wrong. Kindly help me to solve this issue.


Thank you all.

--Regards,
K.Balakumaran
+91 7406 479 544

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: populate skb->l4_rxhash
From: Eilon Greenstein @ 2012-07-09 16:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn
In-Reply-To: <1341849744.3265.3219.camel@edumazet-glaptop>

On Mon, 2012-07-09 at 18:02 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> l4_rxhash is set on skb when rxhash is obtained from canonical 4-tuple
> over transport ports/addresses.
> 
> We can set skb->l4_rxhash for all incoming TCP packets on bnx2x for
> free, as cqe status contains a hash type information.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Eilon Greenstein <eilong@broadcom.com>
> Cc: Willem de Bruijn <willemb@google.com>
> ---

This looks very nice :)

Obviously, I have not had a chance to run it in the lab yet, but it
sounds like you already did. So I'm acking it, and will start testing -
if I see anything, I will keep you in the loop...

Thanks Eric!

Acked-by: Eilon Greenstein <eilong@broadcom.com>

^ permalink raw reply

* [PATCH net-next] bnx2x: populate skb->l4_rxhash
From: Eric Dumazet @ 2012-07-09 16:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eilon Greenstein, Willem de Bruijn

From: Eric Dumazet <edumazet@google.com>

l4_rxhash is set on skb when rxhash is obtained from canonical 4-tuple
over transport ports/addresses.

We can set skb->l4_rxhash for all incoming TCP packets on bnx2x for
free, as cqe status contains a hash type information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Cc: Willem de Bruijn <willemb@google.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h     |    1 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |   19 +++++++++++---
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 362d16f..d2dc420 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -454,6 +454,7 @@ struct bnx2x_agg_info {
 	u16			vlan_tag;
 	u16			len_on_bd;
 	u32			rxhash;
+	bool			l4_rxhash;
 	u16			gro_size;
 	u16			full_page;
 };
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 00951b3..5aeb034 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -295,12 +295,20 @@ static inline void bnx2x_update_sge_prod(struct bnx2x_fastpath *fp,
  * CQE (calculated by HW).
  */
 static u32 bnx2x_get_rxhash(const struct bnx2x *bp,
-			    const struct eth_fast_path_rx_cqe *cqe)
+			    const struct eth_fast_path_rx_cqe *cqe,
+			    bool *l4_rxhash)
 {
 	/* Set Toeplitz hash from CQE */
 	if ((bp->dev->features & NETIF_F_RXHASH) &&
-	    (cqe->status_flags & ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
+	    (cqe->status_flags & ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG)) {
+		enum eth_rss_hash_type htype;
+
+		htype = cqe->status_flags & ETH_FAST_PATH_RX_CQE_RSS_HASH_TYPE;
+		*l4_rxhash = (htype == TCP_IPV4_HASH_TYPE) ||
+			     (htype == TCP_IPV6_HASH_TYPE);
 		return le32_to_cpu(cqe->rss_hash_result);
+	}
+	*l4_rxhash = false;
 	return 0;
 }
 
@@ -354,7 +362,7 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
 	tpa_info->tpa_state = BNX2X_TPA_START;
 	tpa_info->len_on_bd = le16_to_cpu(cqe->len_on_bd);
 	tpa_info->placement_offset = cqe->placement_offset;
-	tpa_info->rxhash = bnx2x_get_rxhash(bp, cqe);
+	tpa_info->rxhash = bnx2x_get_rxhash(bp, cqe, &tpa_info->l4_rxhash);
 	if (fp->mode == TPA_MODE_GRO) {
 		u16 gro_size = le16_to_cpu(cqe->pkt_len_or_gro_seg_len);
 		tpa_info->full_page =
@@ -589,6 +597,7 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 		skb_reserve(skb, pad + NET_SKB_PAD);
 		skb_put(skb, len);
 		skb->rxhash = tpa_info->rxhash;
+		skb->l4_rxhash = tpa_info->l4_rxhash;
 
 		skb->protocol = eth_type_trans(skb, bp->dev);
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
@@ -712,6 +721,7 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 		enum eth_rx_cqe_type cqe_fp_type;
 		u16 len, pad, queue;
 		u8 *data;
+		bool l4_rxhash;
 
 #ifdef BNX2X_STOP_ON_ERROR
 		if (unlikely(bp->panic))
@@ -855,7 +865,8 @@ reuse_rx:
 		skb->protocol = eth_type_trans(skb, bp->dev);
 
 		/* Set Toeplitz hash for a none-LRO skb */
-		skb->rxhash = bnx2x_get_rxhash(bp, cqe_fp);
+		skb->rxhash = bnx2x_get_rxhash(bp, cqe_fp, &l4_rxhash);
+		skb->l4_rxhash = l4_rxhash;
 
 		skb_checksum_none_assert(skb);
 

^ permalink raw reply related

* Re: [RFC PATCH] net/ipv4/ipip: add support to move between network namespaces
From: Joe Perches @ 2012-07-09 16:00 UTC (permalink / raw)
  To: Christian Franke; +Cc: netdev
In-Reply-To: <1341848473-2666-1-git-send-email-christian.franke@adytonsystems.com>

On Mon, 2012-07-09 at 17:41 +0200, Christian Franke wrote:
> Below there is a first attempt at adding support for IPIP tunnels to be moved
> across network namespaces. This allows e.g. for tunnel setups where the inner
> network is completely isolated from the outer transport network.

trivia:

> diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
[]
> @@ -652,6 +660,9 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
>  		break;
>  
>  	case SIOCADDTUNNEL:
> +		/* New Tunnels will be created in the current namespace */

New tunnels

> @@ -701,6 +712,15 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
>  				t->parms.iph.tos = p.iph.tos;
>  				t->parms.iph.frag_off = p.iph.frag_off;
>  				if (t->parms.link != p.link) {
> +					if (!net_eq(dev_net(dev),
> +						    target_net(dev))) {
> +						pr_info_once("%s: rebinding "
> +							     "cross ns device "
> +							     "is not supported\n",
> +							     __func__);

Please coalesce format strings.
						pr_info_once("%s: rebinding cross ns device is not supported\n",
							     __func__);

shouldn't "cross ns device" be "different ns devices"?

^ permalink raw reply

* [RFC PATCH] net/ipv4/ipip: add support to move between network namespaces
From: Christian Franke @ 2012-07-09 15:41 UTC (permalink / raw)
  To: netdev; +Cc: Christian Franke

Hi,

Below there is a first attempt at adding support for IPIP tunnels to be moved
across network namespaces. This allows e.g. for tunnel setups where the inner
network is completely isolated from the outer transport network.

One thing I would especially like comments on is the current approach at
namespace reference counting. Currently, the tunnel will acquire a reference
to its original namespace when it is moved to a different namespace, preventing
the transport namespace from being destroyed until the tunnel is either returned
or deleted.

Best Regards,
Christian Franke
---
 include/net/ipip.h |  1 +
 net/ipv4/ipip.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/include/net/ipip.h b/include/net/ipip.h
index a93cf6d..f7ab237 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -18,6 +18,7 @@ struct ip_tunnel_6rd_parm {
 struct ip_tunnel {
 	struct ip_tunnel __rcu	*next;
 	struct net_device	*dev;
+	struct net		*target_net;
 
 	int			err_count;	/* Number of arrived ICMP errors */
 	unsigned long		err_time;	/* Time when the last ICMP error arrived */
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 715338a..2321a34 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -99,6 +99,7 @@
 #include <asm/uaccess.h>
 #include <linux/skbuff.h>
 #include <linux/netdevice.h>
+#include <linux/notifier.h>
 #include <linux/in.h>
 #include <linux/tcp.h>
 #include <linux/udp.h>
@@ -151,6 +152,13 @@ struct pcpu_tstats {
 	struct u64_stats_sync	syncp;
 };
 
+static inline struct net *target_net(struct net_device *dev)
+{
+	struct ip_tunnel *t = netdev_priv(dev);
+
+	return t->target_net ? t->target_net : dev_net(dev);
+}
+
 static struct rtnl_link_stats64 *ipip_get_stats64(struct net_device *dev,
 						  struct rtnl_link_stats64 *tot)
 {
@@ -314,7 +322,7 @@ failed_free:
 /* called with RTNL */
 static void ipip_tunnel_uninit(struct net_device *dev)
 {
-	struct net *net = dev_net(dev);
+	struct net *net = target_net(dev);
 	struct ipip_net *ipn = net_generic(net, ipip_net_id);
 
 	if (dev == ipn->fb_tunnel_dev)
@@ -481,7 +489,7 @@ static netdev_tx_t ipip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 		dst = rt->rt_gateway;
 	}
 
-	rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
+	rt = ip_route_output_ports(target_net(dev), &fl4, NULL,
 				   dst, tiph->saddr,
 				   0, 0,
 				   IPPROTO_IPIP, RT_TOS(tos),
@@ -631,7 +639,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	int err = 0;
 	struct ip_tunnel_parm p;
 	struct ip_tunnel *t;
-	struct net *net = dev_net(dev);
+	struct net *net = target_net(dev);
 	struct ipip_net *ipn = net_generic(net, ipip_net_id);
 
 	switch (cmd) {
@@ -652,6 +660,9 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		break;
 
 	case SIOCADDTUNNEL:
+		/* New Tunnels will be created in the current namespace */
+		net = dev_net(dev);
+		ipn = net_generic(net, ipip_net_id);
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
 		if (!capable(CAP_NET_ADMIN))
@@ -701,6 +712,15 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 				t->parms.iph.tos = p.iph.tos;
 				t->parms.iph.frag_off = p.iph.frag_off;
 				if (t->parms.link != p.link) {
+					if (!net_eq(dev_net(dev),
+						    target_net(dev))) {
+						pr_info_once("%s: rebinding "
+							     "cross ns device "
+							     "is not supported\n",
+							     __func__);
+						err = -ENOTTY;
+						goto done;
+					}
 					t->parms.link = p.link;
 					ipip_tunnel_bind_dev(dev);
 					netdev_state_change(dev);
@@ -759,6 +779,10 @@ static const struct net_device_ops ipip_netdev_ops = {
 
 static void ipip_dev_free(struct net_device *dev)
 {
+	struct ip_tunnel *t = netdev_priv(dev);
+
+	if (t->target_net)
+		put_net(t->target_net);
 	free_percpu(dev->tstats);
 	free_netdev(dev);
 }
@@ -774,7 +798,6 @@ static void ipip_tunnel_setup(struct net_device *dev)
 	dev->flags		= IFF_NOARP;
 	dev->iflink		= 0;
 	dev->addr_len		= 4;
-	dev->features		|= NETIF_F_NETNS_LOCAL;
 	dev->features		|= NETIF_F_LLTX;
 	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
 }
@@ -904,20 +927,69 @@ static struct pernet_operations ipip_net_ops = {
 	.size = sizeof(struct ipip_net),
 };
 
+static bool ipip_device_exists(struct net_device *dev)
+{
+	/* TODO: this is probably not the right check */
+	return dev->netdev_ops == &ipip_netdev_ops;
+}
+
+static int ipip_device_event(struct notifier_block *unused,
+			     unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct ip_tunnel *t;
+
+	if (!ipip_device_exists(dev))
+		return NOTIFY_DONE;
+
+	t = netdev_priv(dev);
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		/* When the tunnel is moved from its natural
+		 * network namespace, it will keep a reference
+		 * to it. */
+		if (dev->reg_state != NETREG_UNREGISTERING) {
+			if (!t->target_net)
+				t->target_net = get_net(dev_net(dev));
+		}
+		break;
+	case NETDEV_REGISTER:
+		if (net_eq(dev_net(dev), t->target_net)) {
+			put_net(t->target_net);
+			t->target_net = NULL;
+		}
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block ipip_notifier_block = {
+	.notifier_call = ipip_device_event,
+};
+
 static int __init ipip_init(void)
 {
 	int err;
 
 	printk(banner);
 
-	err = register_pernet_device(&ipip_net_ops);
+	err = register_netdevice_notifier(&ipip_notifier_block);
 	if (err < 0)
 		return err;
+
+	err = register_pernet_device(&ipip_net_ops);
+	if (err < 0)
+		goto out_pernet;
+
 	err = xfrm4_tunnel_register(&ipip_handler, AF_INET);
-	if (err < 0) {
-		unregister_pernet_device(&ipip_net_ops);
-		pr_info("%s: can't register tunnel\n", __func__);
-	}
+	if (err < 0)
+		goto out_xfrm;
+	return err;
+out_xfrm:
+	unregister_pernet_device(&ipip_net_ops);
+out_pernet:
+	unregister_netdevice_notifier(&ipip_notifier_block);
+	pr_info("%s: can't register tunnel\n", __func__);
 	return err;
 }
 
@@ -927,6 +999,7 @@ static void __exit ipip_fini(void)
 		pr_info("%s: can't deregister tunnel\n", __func__);
 
 	unregister_pernet_device(&ipip_net_ops);
+	unregister_netdevice_notifier(&ipip_notifier_block);
 }
 
 module_init(ipip_init);
-- 
1.7.11

^ permalink raw reply related

* Re: [RFC PATCH] tcp: limit data skbs in qdisc layer
From: Eric Dumazet @ 2012-07-09 14:55 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, ycheng, codel, mattmathis, ncardwell
In-Reply-To: <20120709.000834.1182150057463599677.davem@davemloft.net>

On Mon, 2012-07-09 at 00:08 -0700, David Miller wrote:

> I'm suspicious and anticipate that 10G will need more queueing than
> you are able to get away with tg3 at 1G speeds.  But it is an exciting
> idea nonetheless :-)

There is a fundamental problem calling any xmit function from skb
destructor.

skb destructor can be called while qdisc lock is taken, so we can
deadlock trying to reacquire it.

One such path is the dev_deactivate_queue() -> qdisc_reset() ->
qdisc_reset_queue(), but also any dropped skbs in qdisc.

So I should only do this stuff from a separate context, for example a
tasklet or timer.

Alternative would be to use dev_kfree_skb_irq() for all dropped skbs in
qdisc layer.

^ permalink raw reply

* Re: [PATCH] ip.7: Improve explanation about calling listen or connect
From: Peter Schiffer @ 2012-07-09 14:53 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Flavio Leitner, linux-man-u79uwXL29TY76Z2rM5mHXA, netdev
In-Reply-To: <20120606114426.0595fbac-xCsz71Izw+A8CFLZRvg3uA@public.gmane.org>

ping

On 06/06/2012 04:44 PM, Flavio Leitner wrote:
> Hi,
>
> Could someone tell me what's the patch current state?
> It has been a month already with no feedback.
> thanks,
> fbl
>
> On Fri, 25 May 2012 13:02:48 +0200
> Peter Schiffer <pschiffe-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>> Hi Michael,
>>
>> do you have any comments for this update? Or do you need some supporting
>> info?
>>
>> peter
>>
>> On 05/09/2012 02:30 PM, Flavio Leitner wrote:
>>> Signed-off-by: Flavio Leitner<fbl-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>    man7/ip.7 |   15 +++++++++------
>>>    1 files changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/man7/ip.7 b/man7/ip.7
>>> index 9f560df..84fe32d 100644
>>> --- a/man7/ip.7
>>> +++ b/man7/ip.7
>>> @@ -69,12 +69,11 @@ For
>>>    you may specify a valid IANA IP protocol defined in
>>>    RFC\ 1700 assigned numbers.
>>>    .PP
>>> -.\" FIXME ip current does an autobind in listen, but I'm not sure
>>> -.\" if that should be documented.
>>>    When a process wants to receive new incoming packets or connections, it
>>>    should bind a socket to a local interface address using
>>>    .BR bind (2).
>>> -Only one IP socket may be bound to any given local (address, port) pair.
>>> +In this case, only one IP socket may be bound to any given local
>>> +(address, port) pair.
>>>    When
>>>    .B INADDR_ANY
>>>    is specified in the bind call, the socket will be bound to
>>> @@ -82,10 +81,14 @@ is specified in the bind call, the socket will be bound to
>>>    local interfaces.
>>>    When
>>>    .BR listen (2)
>>> -or
>>> +is called on an unbound socket, the socket is automatically bound
>>> +to a random free port with the local address set to
>>> +.BR INADDR_ANY .
>>> +When
>>>    .BR connect (2)
>>> -are called on an unbound socket, it is automatically bound to a
>>> -random free port with the local address set to
>>> +is called on an unbound socket, the socket is automatically bound
>>> +to a random free port or an usable shared port with the local address
>>> +set to
>>>    .BR INADDR_ANY .
>>>
>>>    A TCP local socket address that has been bound is unavailable for
> --
> To unsubscribe from this list: send the line "unsubscribe linux-man" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net/fsl_pq_mdio: use spin_event_timeout() to poll the indicator register
From: Tabi Timur-B04825 @ 2012-07-09 14:31 UTC (permalink / raw)
  To: David Miller; +Cc: Fleming Andy-AFLEMING, netdev@vger.kernel.org
In-Reply-To: <20120708.235926.1117975937932919247.davem@davemloft.net>

David Miller wrote:

> Define a macro for the timeout value rather than use an arbitrary
> constant.

Ok.

>> +	status = spin_event_timeout(!(in_be32(&regs->miimind) &	MIIMIND_BUSY),
>> +		1000, 0);
>
> This indentation is absolutely terrible.

Can you give me a clue as to how you think it should look?  I could not 
come up with a good way to break up these lines and keep them under 80 
characters.

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [RFC PATCH] ppp: add support for L2 multihop / tunnel switching
From: Benjamin LaHaise @ 2012-07-09 14:15 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, linux-ppp
In-Reply-To: <4FFAC5EF.6030003@katalix.com>

On Mon, Jul 09, 2012 at 12:52:15PM +0100, James Chapman wrote:
> As a mechanism for switching PPP interfaces together, this patch is
> good. For L2TP though, I prefer an approach that would be applicable for
> all L2TP traffic types, not just PPP.

*nod*  This seems like a reasonable consideration.

> L2TP supports many different pseudowire types, and this patch will only
> be useful for tunnel switching between PPP pseudowires. Whereas if we
> implement it within the L2TP core, rather than in the PPP code, we would
> get switching between all pseudowire types. If we add this patch and
> then subsequently add switching between other pseudowires in the L2TP
> core (which we're likely to want to do), then we're left with two
> different interfaces for doing L2TP tunnel switching in the kernel.

At least for ethernet pseudowires, it can already be implemented by using 
an ethernet bridge device.  Besides PPP and ethernet pseudowires, what 
other types are supported at present by the L2TP core?

> The L2TP core allows traffic to be passed directly into an L2TP session.
> In the case of PPPoE, for example, the PPP data can be  extracted from a
> PPPoE packet and passed into an L2TP tunnel/session, with no PPP
> interface(s) involved.
> 
> That said, your approach allows two PPP interfaces to be switched
> together, which has its own advantages.

I think the approach I'm using should be reasonably efficient for PPPoE 
to L2TP, although the locking overhead in the PPP core probably needs to 
be reduced to improve scaling.  I haven't yet done any benchmarking on this 
approach to see how much overhead there is compared to the other code I'd 
written which took a more direct approach (this wasn't on top of the 
ppp_generic core, but the old Babylon kernel modules which have had this 
functionality for a long time).

> > The reasoning behind using dev_queue_xmit() rather than outputting directly 
> > to another PPP channel is to enable the use of the traffic shaping and 
> > queuing features of the kernel on multihop sessions.
> 
> I'm not sure about using a pseudo packet type to do this. For L2TP, it
> would seem better to add netfilter/tc support for L2TP data packets,
> which would let people add rules for, say, traffic in L2TP tunnel x /
> session y. This would avoid the need for ETH_P_PPP and you could then
> output directly to the ppp channel.

The downside of an L2TP specific method is that all the mechanisms need to 
be duplicated, resulting in a much higher maintenance overhead for the 
code and functionality, not to mention all the tool changes to go along 
with that.

As for the pseudo packet type, it may indeed be better to avoid the pseudo 
packet type for known PPP packet types.  One of the benefits of going the 
network device route is that it makes it much easier to implement additional 
functionality like lawful intercept, which would be yet more functionality 
that would have to be implemented if the mechanism is L2TP specific.  The 
pseudo packet type would still be needed for forwarding PPP frames that the 
kernel doesn't know about (all the *CP packet types and MLPPP come to mind)

I had thought about doing the packet forwarding in a manner similar to the 
bridging code -- that is, as a pseudowire bridge in the network core that 
only works between 2 devices.  That approach might work better for L2TP, as 
it would be able to pass packets of any type between the 2 endpoints.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply

* Re: TCP transmit performance regression
From: Eric Dumazet @ 2012-07-09 13:54 UTC (permalink / raw)
  To: Ming Lei; +Cc: Network Development, David Miller
In-Reply-To: <CACVXFVNYwNrdWS_EiQ4O0-ou369AQ2tB4qbvmzY1rt5T3QnyUw@mail.gmail.com>

On Mon, 2012-07-09 at 21:23 +0800, Ming Lei wrote:

> Looks the patch replaces skb_clone with netdev_alloc_skb_ip_align and
> introduces extra copies on incoming data, so would you mind explaining
> it in a bit detail? And why is skb_clone not OK for the purpose?

Problem with cloning is that some paths will have to make a private copy
of the skb.

So you dont see the cost here in the driver, but later in upper stacks.

Since this driver defaults to a huge RX area of more than 16Kbytes,
a copy to a much smaller skb (we call this 'copybreak' in our jargon )
is more than welcome to avoid OOM problems anyway.

TCP coalescing (skb_try_coalesce) for example wont work for cloned skbs,
so TCP receive window will close pretty fast, and performance sucks in
lossy environments (like the Internet)

Actually, since this driver lies about skb->truesize, a single UDP frame
consumes 32Kbytes of memory, escaping normal memory limits we have in
kernel by a factor of 64. Thats pretty bad, especially for a beagle
board.

^ permalink raw reply

* Re: [PATCH] smsc95xx: support ethtool get_regs
From: Émeric Vigier @ 2012-07-09 13:44 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Steve Glendinning, steve glendinning, netdev, Nancy Lin
In-Reply-To: <1341690959.25597.168.camel@deadeye.wl.decadent.org.uk>

----- Mail original -----
> On Sat, 2012-07-07 at 09:58 -0400, Émeric Vigier wrote:
> [...]
> > > > +	for (i = 0; i <= PHY_SPECIAL; i++)
> > > > +		data[j++] = smsc95xx_mdio_read(netdev, dev->mii.phy_id, i);
> > > > +}
> > > 
> > > Again, why use PHY_SPECIAL (+ 1) here as opposed to 32 in the
> > > calculation of the length?
> > 
> > 32 was ok, but I hesitated between defining a SMSC95XX_PHY_END or
> > using the last defined register.
> > Are 32 register-PHY generic to most devices? I mean could 32 be use
> > widely?
> 
> Yes, the address space for the original MDIO protocol ('clause 22')
> allows for 32 registers.  Perhaps that number should be named in
> <linux/mii.h>.

I have not found any definition of this in mii.h.

> 
> As another reviewer commented, though, MDIO PHY registers should be
> accessible with SIOCGMIIREG and mii-tool so it's not really necessary
> to
> duplicate them here.

Ok, I remove this then.

> 
> Ben.
> 
> --
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 

-- 
Emeric

^ permalink raw reply

* [PATCH 2/4] pch_gbe: fix transmit watchdog timeout
From: Andy Cress @ 2012-07-09 13:24 UTC (permalink / raw)
  To: netdev

Author: Andy Cress <andy.cress@us.kontron.com>

An extended ping and iperf test with 6 vlans resulted in transmit
timeout stats 
and sometimes a driver oops with a netdev watchdog transmit timeout.
Fix WATCHDOG_TIMEOUT to be more like e1000e at 5 * HZ, to avoid 
unnecessary transmit timeouts.

Signed-off-by: Andy Cress <andy.cress@us.kontron.com>

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 4c04843..a746064 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -35,7 +35,7 @@ const char pch_driver_version[] = DRV_VERSION;
 #define DSC_INIT16			0xC000
 #define PCH_GBE_DMA_ALIGN		0
 #define PCH_GBE_DMA_PADDING		2
-#define PCH_GBE_WATCHDOG_PERIOD		(1 * HZ)	/*
watchdog time */
+#define PCH_GBE_WATCHDOG_PERIOD		(5 * HZ)	/*
watchdog time */
 #define PCH_GBE_COPYBREAK_DEFAULT	256
 #define PCH_GBE_PCI_BAR			1
 #define PCH_GBE_RESERVE_MEMORY		0x200000	/* 2MB */

^ permalink raw reply related

* [PATCH 1/4] pch_gbe: fix the checksum fill to the error location
From: Andy Cress @ 2012-07-09 13:30 UTC (permalink / raw)
  To: netdev

From: Zhong Hongbo <hongbo.zhong@windriver.com>
Date: Mon, 9 Apr 2012 10:51:28 +0800

Due to some unknown hardware limitations the pch_gbe hardware cannot
calculate checksums when the length of network package is less
than 64 bytes, where we will surprisingly encounter a problem of
the destination IP incorrectly changed.

When forwarding network packages at the network layer the IP packages
won't be relayed to the upper transport layer and analyzed there,
consequently, skb->transport_header pointer will be mistakenly remained
the same as that of skb->network_header, resulting in TCP checksum
wrongly
filled into the field of destination IP in IP header.

We can fix this issue by manually calculate the offset of the TCP
checksum
 and update it accordingly.

Signed-off-by: Zhong Hongbo <hongbo.zhong@windriver.com>
Merged-by: Andy Cress <andy.cress@us.kontron.com>

---
 drivers/net/pch_gbe/pch_gbe_main.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 3787c64..4c04843 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -1180,30 +1180,32 @@ static void pch_gbe_tx_queue(struct
pch_gbe_adapter *adapter,
 	 * when the received data size is less than 64 bytes.
 	 */
 	if (skb->len < PCH_GBE_SHORT_PKT && skb->ip_summed !=
CHECKSUM_NONE) {
+		struct iphdr *iph = ip_hdr(skb);
 		frame_ctrl |= PCH_GBE_TXD_CTRL_APAD |
 			      PCH_GBE_TXD_CTRL_TCPIP_ACC_OFF;
 		if (skb->protocol == htons(ETH_P_IP)) {
-			struct iphdr *iph = ip_hdr(skb);
 			unsigned int offset;
-			offset = skb_transport_offset(skb);
+			offset = (unsigned char *)((u8 *)iph + iph->ihl
* 4) - skb->data;
 			if (iph->protocol == IPPROTO_TCP) {
+				struct tcphdr *tcphdr_point = (struct
tcphdr *)((u8 *)iph + iph->ihl * 4);
 				skb->csum = 0;
-				tcp_hdr(skb)->check = 0;
+				tcphdr_point->check = 0;
 				skb->csum = skb_checksum(skb, offset,
 							 skb->len -
offset, 0);
-				tcp_hdr(skb)->check =
+				tcphdr_point->check = 
 					csum_tcpudp_magic(iph->saddr,
 							  iph->daddr,
 							  skb->len -
offset,
 							  IPPROTO_TCP,
 							  skb->csum);
 			} else if (iph->protocol == IPPROTO_UDP) {
+				struct udphdr *udphdr_point = (struct
udphdr *)((u8 *)iph + iph->ihl * 4);
 				skb->csum = 0;
-				udp_hdr(skb)->check = 0;
+				udphdr_point->check = 0;
 				skb->csum =
 					skb_checksum(skb, offset,
 						     skb->len - offset,
0);
-				udp_hdr(skb)->check =
+				udphdr_point->check = 
 					csum_tcpudp_magic(iph->saddr,
 							  iph->daddr,
 							  skb->len -
offset,

^ permalink raw reply related

* [PATCH 4/4] pch_gbe: vlan skb len fix
From: Andy Cress @ 2012-07-09 13:30 UTC (permalink / raw)
  To: netdev

Author: Veaceslav Falico <vfalico@redhat.com>
Date:   Tue Apr 10 08:14:17 2012 +0200

    pch_gbe: correctly verify skb->len in vlan case
             to avoid bogus transfer length errors.

Signed-off-by: Andy Cress <andy.cress@us.kontron.com>

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 30ef285..04b0e49 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -2158,8 +2158,10 @@ static int pch_gbe_xmit_frame(struct sk_buff
*skb, struct net_device *netdev)
 	struct pch_gbe_adapter *adapter = netdev_priv(netdev);
 	struct pch_gbe_tx_ring *tx_ring = adapter->tx_ring;
 	unsigned long flags;
+	int offset;
 
-	if (unlikely(skb->len > (adapter->hw.mac.max_frame_size - 4))) {
+	offset = skb->protocol == htons(ETH_P_8021Q) ? 0 : 4;
+	if (unlikely(skb->len > (adapter->hw.mac.max_frame_size -
offset))) {
 		pr_err("Transfer length Error: skb len: %d > max: %d\n",
 		       skb->len, adapter->hw.mac.max_frame_size);
 		dev_kfree_skb_any(skb);

^ permalink raw reply related

* [PATCH 0/4] pch_gbe: avoiding transmit timeouts, etc.
From: Andy Cress @ 2012-07-09 13:19 UTC (permalink / raw)
  To: netdev


When the interface is stressed with 6 VLANs, some transmit timeout stats
were observed, which is a potential precursor to the more severe netdev
watchdog timeout oops.  Also we saw more than the expected number of
transmit restarts, which impacted performance.   The following patches
were applied and resolved the symptom of the transmit timeout stats, and
reduced the number of transmit restarts.  

This patch set includes the following patches:
0001-pch_gbe-Fix-the-checksum-fill-to-the-error-location.patch  
0002-pch_gbe-fix-transmit-watchdog-timeout.patch  
0003-pch_gbe-add-extra-clean-tx.patch  (which also includes bumping the
version to 1.01)
0004-pch_gbe-vlan-skb-len-fix.patch

The resulting pch_gbe 1.01 driver has been tested on Kontron Tunnel
Creek EG20T modules and Intel Crown Bay EG20T modules, so I believe that
these are appropriate for consideration in the upstream pch_gbe driver.


Please review and comment.

Thanks,
Andy

^ permalink raw reply

* [PATCH 3/4] pch_gbe: add extra clean tx
From: Andy Cress @ 2012-07-09 13:28 UTC (permalink / raw)
  To: netdev

Author: Andy Cress <andy.cress@us.kontron.com>

This adds extra cleaning to the pch_gbe_clean_tx routine to avoid 
transmit timeouts on some PHYs that have different timing.
Also update the DRV_VERSION to 1.01, and show it.

Signed-off-by: Andy Cress <andy.cress@us.kontron.com>

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index a746064..30ef285 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -26,7 +26,7 @@
 #include <linux/ptp_classify.h>
 #endif
 
-#define DRV_VERSION     "1.00"
+#define DRV_VERSION     "1.01"
 const char pch_driver_version[] = DRV_VERSION;
 
 #define PCI_DEVICE_ID_INTEL_IOH1_GBE	0x8802		/* Pci device ID
*/
@@ -1581,7 +1581,8 @@ pch_gbe_clean_tx(struct pch_gbe_adapter *adapter,
 	struct sk_buff *skb;
 	unsigned int i;
 	unsigned int cleaned_count = 0;
-	bool cleaned = true;
+	bool cleaned = false;
+	int unused, thresh;
 
 	pr_debug("next_to_clean : %d\n", tx_ring->next_to_clean);
 
@@ -1590,10 +1591,36 @@ pch_gbe_clean_tx(struct pch_gbe_adapter
*adapter,
 	pr_debug("gbec_status:0x%04x  dma_status:0x%04x\n",
 		 tx_desc->gbec_status, tx_desc->dma_status);
 
+	unused = PCH_GBE_DESC_UNUSED(tx_ring);
+	thresh = tx_ring->count - PCH_GBE_TX_WEIGHT;
+	if ((tx_desc->gbec_status == DSC_INIT16) && (unused < thresh))
+	{  /* current marked clean, tx queue filling up, do extra clean
*/
+		int j, k;
+		if (unused < 8) {  /* tx queue nearly full */
+			pr_debug("clean_tx: transmit queue warning
(%x,%x) unused=%d\n",
+
tx_ring->next_to_clean,tx_ring->next_to_use,unused);
+		}
+	   
+		/* current marked clean, scan for more that need
cleaning. */
+		k = i;
+		for (j = 0; j < PCH_GBE_TX_WEIGHT; j++) 
+		{
+			tx_desc = PCH_GBE_TX_DESC(*tx_ring, k);
+			if (tx_desc->gbec_status != DSC_INIT16) break;
/*found*/
+			if (++k >= tx_ring->count) k = 0;  /*increment,
wrap*/
+		}
+		if (j < PCH_GBE_TX_WEIGHT) {
+			pr_debug("clean_tx: unused=%d loops=%d found
tx_desc[%x,%x:%x].gbec_status=%04x\n",
+				unused,j, i,k, tx_ring->next_to_use,
tx_desc->gbec_status);
+			i = k;  /*found one to clean, usu
gbec_status==2000.*/
+		}
+	}
+
 	while ((tx_desc->gbec_status & DSC_INIT16) == 0x0000) {
 		pr_debug("gbec_status:0x%04x\n", tx_desc->gbec_status);
 		buffer_info = &tx_ring->buffer_info[i];
 		skb = buffer_info->skb;
+		cleaned = true;
 
 		if ((tx_desc->gbec_status & PCH_GBE_TXD_GMAC_STAT_ABT))
{
 			adapter->stats.tx_aborted_errors++;
@@ -1641,18 +1668,21 @@ pch_gbe_clean_tx(struct pch_gbe_adapter
*adapter,
 	}
 	pr_debug("called pch_gbe_unmap_and_free_tx_resource() %d
count\n",
 		 cleaned_count);
-	/* Recover from running out of Tx resources in xmit_frame */
-	spin_lock(&tx_ring->tx_lock);
-	if (unlikely(cleaned && (netif_queue_stopped(adapter->netdev))))
{
-		netif_wake_queue(adapter->netdev);
-		adapter->stats.tx_restart_count++;
-		pr_debug("Tx wake queue\n");
-	}
+	if (cleaned_count > 0)  { /*skip this if nothing cleaned*/
+		/* Recover from running out of Tx resources in
xmit_frame */
+		spin_lock(&tx_ring->tx_lock);
+		if (unlikely(cleaned &&
(netif_queue_stopped(adapter->netdev))))
+		{
+			netif_wake_queue(adapter->netdev);
+			adapter->stats.tx_restart_count++;
+			pr_debug("Tx wake queue\n");
+		}
 
-	tx_ring->next_to_clean = i;
+		tx_ring->next_to_clean = i;
 
-	pr_debug("next_to_clean : %d\n", tx_ring->next_to_clean);
-	spin_unlock(&tx_ring->tx_lock);
+		pr_debug("next_to_clean : %d\n",
tx_ring->next_to_clean);
+		spin_unlock(&tx_ring->tx_lock);
+	}
 	return cleaned;
 }
 
@@ -2389,7 +2419,7 @@ static int pch_gbe_napi_poll(struct napi_struct
*napi, int budget)
 	pch_gbe_clean_rx(adapter, adapter->rx_ring, &work_done, budget);
 	cleaned = pch_gbe_clean_tx(adapter, adapter->tx_ring);
 
-	if (!cleaned)
+	if (cleaned)
 		work_done = budget;
 	/* If no Tx and not enough Rx work done,
 	 * exit the polling mode
@@ -2795,6 +2825,7 @@ static int __init pch_gbe_init_module(void)
 {
 	int ret;
 
+	pr_info("EG20T PCH Gigabit Ethernet Driver - version
%s\n",DRV_VERSION);
 	ret = pci_register_driver(&pch_gbe_driver);
 	if (copybreak != PCH_GBE_COPYBREAK_DEFAULT) {
 		if (copybreak == 0) {

^ permalink raw reply related

* Re: TCP transmit performance regression
From: Ming Lei @ 2012-07-09 13:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Network Development, David Miller
In-Reply-To: <1341481760.2583.3579.camel@edumazet-glaptop>

On Thu, Jul 5, 2012 at 5:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-07-05 at 16:42 +0800, Ming Lei wrote:
>> On Thu, Jul 5, 2012 at 4:33 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Thu, 2012-07-05 at 16:27 +0800, Ming Lei wrote:
>> >
>> >> After some investigation, the problem is caused by enabling
>> >> DEBUG_SLAB, so it is not a regression.
>> >>
>> >
>> > Strange, unless your machine is a _very_ slow one maybe ?
>>
>> It is a beagle-xm board, and its cpu is ARMv7, 1GHz.
>
> OK, driver seems buggy, please try following patch (on both sides if
> possible)
>
>  drivers/net/usb/smsc95xx.c |   11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
> index b1112e7..0a4ae35 100644
> --- a/drivers/net/usb/smsc95xx.c
> +++ b/drivers/net/usb/smsc95xx.c
> @@ -1084,26 +1084,23 @@ static int smsc95xx_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
>                         if (skb->len == size) {
>                                 if (dev->net->features & NETIF_F_RXCSUM)
>                                         smsc95xx_rx_csum_offload(skb);
> -                               skb_trim(skb, skb->len - 4); /* remove fcs */
> +                               __skb_trim(skb, skb->len - 4); /* remove fcs */
>                                 skb->truesize = size + sizeof(struct sk_buff);
>
>                                 return 1;
>                         }
>
> -                       ax_skb = skb_clone(skb, GFP_ATOMIC);
> +                       ax_skb = netdev_alloc_skb_ip_align(dev->net, size);
>                         if (unlikely(!ax_skb)) {
>                                 netdev_warn(dev->net, "Error allocating skb\n");
>                                 return 0;
>                         }
>
> -                       ax_skb->len = size;
> -                       ax_skb->data = packet;
> -                       skb_set_tail_pointer(ax_skb, size);
> +                       memcpy(skb_put(ax_skb, size), packet, size);
>
>                         if (dev->net->features & NETIF_F_RXCSUM)
>                                 smsc95xx_rx_csum_offload(ax_skb);
> -                       skb_trim(ax_skb, ax_skb->len - 4); /* remove fcs */
> -                       ax_skb->truesize = size + sizeof(struct sk_buff);
> +                       __skb_trim(ax_skb, ax_skb->len - 4); /* remove fcs */
>
>                         usbnet_skb_return(dev, ax_skb);
>                 }
>
>

Looks the patch replaces skb_clone with netdev_alloc_skb_ip_align and
introduces extra copies on incoming data, so would you mind explaining
it in a bit detail? And why is skb_clone not OK for the purpose?


Thanks,
-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH] net: cgroup: fix out of bounds accesses
From: Neil Horman @ 2012-07-09 12:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, linux-kernel, netdev, lizefan, tj, Gao feng
In-Reply-To: <1341837625.3265.2748.camel@edumazet-glaptop>

On Mon, Jul 09, 2012 at 02:40:25PM +0200, Eric Dumazet wrote:
> On Mon, 2012-07-09 at 08:13 -0400, Neil Horman wrote:
> > On Mon, Jul 09, 2012 at 01:50:52PM +0200, Eric Dumazet wrote:
> > > On Mon, 2012-07-09 at 07:01 -0400, Neil Horman wrote:
> > > 
> > > > Thank you for doing this Eric, Gao.  Just to be sure (I asked in the previous
> > > > thread), would it be better to avoid the length check in skb_update_prio, and
> > > > instead update the netdev tables to be long enough in cgrp_create and in
> > > > netprio_device_event on device registration?
> > > 
> > > Yes probably, and it is even needed because extend_netdev_table() can
> > > acutally fail to expand the table if kzalloc() returned NULL.
> > > 
> > > Current code just ignores this allocation failure so we also can crash
> > > in write_priomap()
> > > 
> > ACK, can you follow up with a patch please?
> 
> Gao was working on this allocation problem (he privately sent me a v1 of
> his patch), so I think we can wait Gao submit a v2 to combine all the
> work/ideas in a single patch.
> 
> (ie make sure we dont need additional bound checkings in fast path)
> 
> 
Ok, I agree.  thanks!
Neil

> 
> 

^ permalink raw reply

* Re: [PATCH] net: cgroup: fix out of bounds accesses
From: Eric Dumazet @ 2012-07-09 12:40 UTC (permalink / raw)
  To: Neil Horman; +Cc: David Miller, linux-kernel, netdev, lizefan, tj, Gao feng
In-Reply-To: <20120709121350.GC9186@hmsreliant.think-freely.org>

On Mon, 2012-07-09 at 08:13 -0400, Neil Horman wrote:
> On Mon, Jul 09, 2012 at 01:50:52PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-07-09 at 07:01 -0400, Neil Horman wrote:
> > 
> > > Thank you for doing this Eric, Gao.  Just to be sure (I asked in the previous
> > > thread), would it be better to avoid the length check in skb_update_prio, and
> > > instead update the netdev tables to be long enough in cgrp_create and in
> > > netprio_device_event on device registration?
> > 
> > Yes probably, and it is even needed because extend_netdev_table() can
> > acutally fail to expand the table if kzalloc() returned NULL.
> > 
> > Current code just ignores this allocation failure so we also can crash
> > in write_priomap()
> > 
> ACK, can you follow up with a patch please?

Gao was working on this allocation problem (he privately sent me a v1 of
his patch), so I think we can wait Gao submit a v2 to combine all the
work/ideas in a single patch.

(ie make sure we dont need additional bound checkings in fast path)

^ permalink raw reply

* Re: 82571EB: Detected Hardware Unit Hang
From: Joe Jin @ 2012-07-09 12:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev@vger.kernel.org, e1000-devel, linux-kernel@vger.kernel.org
In-Reply-To: <1341825677.3265.2330.camel@edumazet-glaptop>

On 07/09/12 17:21, Eric Dumazet wrote:
> On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
>> Hi list,
>>
>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
>> scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
>> a big file (>500M) from another server will hit it at once. 
>>
>> Would you please help on this?
>>
> 
> Its a known problem.
> 
> But apparently Intel guys are not very responsive, as they have another
> patch than the following :
> 
> http://permalink.gmane.org/gmane.linux.network/232669

Eris, 

Thanks for you reply, but seems this patch not help for me, 
applied the patch still hit the issue:

# dmesg
e1000e 0000:05:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <6f>
  TDT                  <7e>
  next_to_use          <7e>
  next_to_clean        <6e>
buffer_info[next_to_clean]:
  time_stamp           <fffd48dc>
  next_to_watch        <74>
  jiffies              <fffd5344>
  next_to_watch.status <0>
MAC Status             <80387>
PHY Status             <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
e1000e 0000:05:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <6f>
  TDT                  <7e>
  next_to_use          <7e>
  next_to_clean        <6e>
buffer_info[next_to_clean]:
  time_stamp           <fffd48dc>
  next_to_watch        <74>
  jiffies              <fffd5b14>
  next_to_watch.status <0>
MAC Status             <80387>
PHY Status             <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
e1000e 0000:05:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <6f>
  TDT                  <7e>
  next_to_use          <7e>
  next_to_clean        <6e>
buffer_info[next_to_clean]:
  time_stamp           <fffd48dc>
  next_to_watch        <74>
  jiffies              <fffd62e4>
  next_to_watch.status <0>
MAC Status             <80387>
PHY Status             <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
e1000e 0000:05:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <6f>
  TDT                  <7e>
  next_to_use          <7e>
  next_to_clean        <6e>
buffer_info[next_to_clean]:
  time_stamp           <fffd48dc>
  next_to_watch        <74>
  jiffies              <fffd6ab4>
  next_to_watch.status <0>
MAC Status             <80387>
PHY Status             <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
Hardware name: SUN FIRE X2270 M2
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) snd_seq_dummy snd_seq_oss snd_seq_midi_event igb snd_seq snd_seq_device serio_raw snd_pcm_oss snd_mixer_oss snd_pcm tpm_infineon snd_timer snd soundcore i7core_edac iTCO_wdt iTCO_vendor_support snd_page_alloc edac_core i2c_i801 ioatdma i2c_core pcspkr ghes dca hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded: microcode]
Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1
Call Trace:
 [<c07d9ac5>] ? dev_watchdog+0x225/0x230
 [<c045ba61>] warn_slowpath_common+0x81/0xa0
 [<c07d9ac5>] ? dev_watchdog+0x225/0x230
 [<c045bb23>] warn_slowpath_fmt+0x33/0x40
 [<c07d9ac5>] dev_watchdog+0x225/0x230
 [<c07d98a0>] ? dev_activate+0xb0/0xb0
 [<c0468e82>] call_timer_fn+0x32/0xf0
 [<c046a76d>] run_timer_softirq+0xed/0x1b0
 [<c07d98a0>] ? dev_activate+0xb0/0xb0
 [<c0461a81>] __do_softirq+0x91/0x1a0
 [<c04619f0>] ? local_bh_enable+0x80/0x80
 <IRQ>  [<c0462295>] ? irq_exit+0x95/0xa0
 [<c087f8b8>] ? smp_apic_timer_interrupt+0x38/0x42
 [<c08784f5>] ? apic_timer_interrupt+0x31/0x38
 [<c046007b>] ? do_exit+0x11b/0x370
 [<c065eae4>] ? intel_idle+0xa4/0x100
 [<c078d9b9>] ? cpuidle_idle_call+0xb9/0x1e0
 [<c0411d77>] ? cpu_idle+0x97/0xd0
 [<c085cbbd>] ? rest_init+0x5d/0x70
 [<c0b07a7a>] ? start_kernel+0x28a/0x340
 [<c0b074b0>] ? obsolete_checksetup+0xb0/0xb0
 [<c0b070a4>] ? i386_start_kernel+0x64/0xb0
---[ end trace 5d51553c2ad66677 ]---
e1000e 0000:05:00.0: eth0: Reset adapter
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Any idea?

Thanks,
Joe

> 
> 
> We only have to wait they push their alternative patch, eventually.
> 
> In the mean time, you can use Hiroaki SHIMODA patch, it works.
> 
> 
> 



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH] net: cgroup: fix out of bounds accesses
From: Neil Horman @ 2012-07-09 12:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, linux-kernel, netdev, lizefan, tj, Gao feng
In-Reply-To: <1341834652.3265.2642.camel@edumazet-glaptop>

On Mon, Jul 09, 2012 at 01:50:52PM +0200, Eric Dumazet wrote:
> On Mon, 2012-07-09 at 07:01 -0400, Neil Horman wrote:
> 
> > Thank you for doing this Eric, Gao.  Just to be sure (I asked in the previous
> > thread), would it be better to avoid the length check in skb_update_prio, and
> > instead update the netdev tables to be long enough in cgrp_create and in
> > netprio_device_event on device registration?
> 
> Yes probably, and it is even needed because extend_netdev_table() can
> acutally fail to expand the table if kzalloc() returned NULL.
> 
> Current code just ignores this allocation failure so we also can crash
> in write_priomap()
> 
ACK, can you follow up with a patch please?

Thanks!
Neil

> 
> 
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox