Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 03/16] tcp: Maintain dynamic metrics in local cache.
From: Eric Dumazet @ 2012-07-10 15:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120710.080714.2272376193166978850.davem@davemloft.net>

On Tue, 2012-07-10 at 08:07 -0700, David Miller wrote:
> Maintain a local hash table of TCP dynamic metrics blobs.
> 
> Computed TCP metrics are no longer maintained in the route metrics.
> 
> The table uses RCU and an extremely simple hash so that it has low
> latency and low overhead.  A simple hash is legitimate because we only
> make metrics blobs for fully established connections.
> 
> Some tweaking of the default hash table sizes, metric timeouts, and
> the hash chain length limit certainly could use some tweaking.  But
> the basic design seems sound.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---

Seems to lack namespace support, or maybe I missed something ?

^ permalink raw reply

* Re: [PATCH 03/16] tcp: Maintain dynamic metrics in local cache.
From: David Miller @ 2012-07-10 15:33 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1341934298.3265.5514.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 10 Jul 2012 17:31:38 +0200

> On Tue, 2012-07-10 at 08:07 -0700, David Miller wrote:
>> Maintain a local hash table of TCP dynamic metrics blobs.
>> 
>> Computed TCP metrics are no longer maintained in the route metrics.
>> 
>> The table uses RCU and an extremely simple hash so that it has low
>> latency and low overhead.  A simple hash is legitimate because we only
>> make metrics blobs for fully established connections.
>> 
>> Some tweaking of the default hash table sizes, metric timeouts, and
>> the hash chain length limit certainly could use some tweaking.  But
>> the basic design seems sound.
>> 
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>> ---
> 
> Seems to lack namespace support, or maybe I missed something ?

It does, I'll have to add it.  Thanks for catching that.

^ permalink raw reply

* Re: [PATCH 0/16] Metrics restructuring.
From: Eric Dumazet @ 2012-07-10 15:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120710.080703.916560084659556593.davem@davemloft.net>

On Tue, 2012-07-10 at 08:07 -0700, David Miller wrote:
> This patch series works towards the goal of minimizing the amount
> of things that can change in an ipv4 route.
> 
> In a regime where the routing cache is removed, route changes will
> lead to cloning in the FIB tables or similar.
> 
> The largest trigger of route metrics writes, TCP, now has it's own
> cache of dynamic metric state.  The timewait timestamps are stored
> there now as well.
> 
> As a result of that, pre-cowing metrics is no longer necessary,
> and therefore FLOWI_FLAG_PRECOW_METRICS is removed.
> 
> Redirect and PMTU handling is moved back into the ipv4 routes.  I'm
> sorry for all the headaches trying to do this in the inetpeer has
> caused, it was the wrong approach for sure.
> 
> Since metrics become read-only for ipv4 we no longer need the inetpeer
> hung off of the ipv4 routes either.  So those disappear too.
> 
> Also, timewait sockets no longer need to hold onto an inetpeer either.
> 
> After this series, we still have some details to resolve wrt. PMTU and
> redirects for a route-cache-less system:
> 
> 1) With just the plain route cache removal, PMTU will continue to
>    work mostly fine.  This is because of how the local route users
>    call down into the PMTU update code with the route they already
>    hold.
> 
>    However, if we wish to cache pre-computed routes in fib_info
>    nexthops (which we want for performance), then we need to add
>    route cloning for PMTU events.
> 
> 2) Redirects require more work.  First, redirects must be changed to
>    be handled like PMTU.  Wherein we call down into the sockets and
>    other entities, and then they call back into the routing code with
>    the route they were using.
> 
>    So we'll be adding an ->update_nexthop() method alongside
>    ->update_pmtu().
> 
>    And then, like for PMTU, we'll need cloning support once we start
>    caching routes in the fib_info nexthops.
> 
> But that's it, we can completely pull the trigger and remove the
> routing cache with minimal disruptions.
> 
> As it is, this patch series alone helps a lot of things.  For one,
> routing cache entry creation should be a lot faster, because we no
> longer do inetpeer lookups (even to check if an entry exists).
> 
> This patch series also opens the door for non-DST_HOST ipv4 routes,
> because nothing fundamentally cares about rt->rt_dst any more.  It
> can be removed with the base routing cache removal patch.  In fact,
> that was the primary goal of this patch series.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

This looks great !

^ permalink raw reply

* Re: [PATCH 0/16] Metrics restructuring.
From: Joe Perches @ 2012-07-10 16:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120710.080703.916560084659556593.davem@davemloft.net>

On Tue, 2012-07-10 at 08:07 -0700, David Miller wrote:
> This patch series works towards the goal of minimizing the amount
> of things that can change in an ipv4 route.

Good stuff.

I hope the good information below here will go into the
git log via a merge message too.

thanks, Joe

^ permalink raw reply

* Re: [PATCH 06/16] tcp: Move timestamps from inetpeer to metrics cache.
From: Lin Ming @ 2012-07-10 16:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120710.080726.1369956163447074824.davem@davemloft.net>

On Tue, Jul 10, 2012 at 11:07 PM, David Miller <davem@davemloft.net> wrote:

> +
> +/* VJ's idea. Save last timestamp seen from this destination and hold
> + * it at least for normal timewait interval to use for duplicate
> + * segment detection in subsequent connections, before they enter
> + * synchronized state.
> + */
> +bool tcp_remember_stamp(struct sock *sk)
> +{
> +       struct dst_entry *dst = __sk_dst_get(sk);
> +       bool ret = false;
> +
> +       if (dst) {
> +               struct tcp_metrics_block *tm;
> +
> +               rcu_read_lock();
> +               tm = tcp_get_metrics(sk, dst, true);
> +               if (tm) {
> +                       struct tcp_sock *tp = tcp_sk(sk);
> +
> +                       if ((s32)(tm->tcpm_ts - tp->rx_opt.ts_recent) <= 0 ||
> +                           ((u32)get_seconds() - tm->tcpm_ts_stamp > TCP_PAWS_MSL &&
> +                            tm->tcpm_ts_stamp <= (u32)tp->rx_opt.ts_recent_stamp)) {
> +                               tm->tcpm_ts_stamp = (u32)tp->rx_opt.ts_recent_stamp;
> +                               tm->tcpm_ts = tp->rx_opt.ts_recent;
> +                       }
> +                       ret = true;
> +               }
> +               rcu_read_unlock();
> +       }

A trivial thing when "git am" this patch.

ERROR: trailing whitespace
#315: FILE: net/ipv4/tcp_metrics.c:595:
+^I}^I^I$

Regards,
Lin Ming

> +       return ret;
> +}

^ permalink raw reply

* Re: [PATCH 06/16] tcp: Move timestamps from inetpeer to metrics cache.
From: David Miller @ 2012-07-10 16:17 UTC (permalink / raw)
  To: mlin; +Cc: netdev
In-Reply-To: <CAF1ivSZ--OKqu-PgTf82Rho4N8tUn+HCaDuh5V4p89KusHauWQ@mail.gmail.com>

From: Lin Ming <mlin@ss.pku.edu.cn>
Date: Wed, 11 Jul 2012 00:14:50 +0800

> A trivial thing when "git am" this patch.
> 
> ERROR: trailing whitespace
> #315: FILE: net/ipv4/tcp_metrics.c:595:
> +^I}^I^I$

Thanks, I'll fix that up.

^ permalink raw reply

* Re: [PATCH 0/16] Metrics restructuring.
From: David Miller @ 2012-07-10 16:18 UTC (permalink / raw)
  To: joe; +Cc: netdev
In-Reply-To: <1341936694.6118.119.camel@joe2Laptop>

From: Joe Perches <joe@perches.com>
Date: Tue, 10 Jul 2012 09:11:34 -0700

> I hope the good information below here will go into the
> git log via a merge message too.

Is there some voodoo to force a merge commit in what
would otherwise be a fast-forward?

^ permalink raw reply

* [PATCH] etherdevice: introduce eth_broadcast_addr
From: Johannes Berg @ 2012-07-10 16:18 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless
In-Reply-To: <20120708.235808.1602900783296556684.davem@davemloft.net>

From: Johannes Berg <johannes.berg@intel.com>

A lot of code has either the memset or an inefficient copy
from a static array that contains the all-ones broadcast
address. Introduce eth_broadcast_addr() to fill an address
with all ones, making the code clearer and allowing us to
get rid of some constant arrays.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
---
 include/linux/etherdevice.h |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 3d406e0..98a27cc 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -138,6 +138,17 @@ static inline void random_ether_addr(u8 *addr)
 }
 
 /**
+ * eth_broadcast_addr - Assign broadcast address
+ * @addr: Pointer to a six-byte array containing the Ethernet address
+ *
+ * Assign the broadcast address to the given address array.
+ */
+static inline void eth_broadcast_addr(u8 *addr)
+{
+	memset(addr, 0xff, ETH_ALEN);
+}
+
+/**
  * eth_hw_addr_random - Generate software assigned random Ethernet and
  * set device flag
  * @dev: pointer to net_device structure
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 0/16] Metrics restructuring.
From: Christian Franke @ 2012-07-10 16:34 UTC (permalink / raw)
  To: David Miller; +Cc: joe, netdev
In-Reply-To: <20120710.091825.136189827254363527.davem@davemloft.net>

On 07/10/2012 06:18 PM, David Miller wrote:
> Is there some voodoo to force a merge commit in what
> would otherwise be a fast-forward?

git merge --no-ff

-Christian

^ permalink raw reply

* Re: net-next kernel NULL pointer dereference at fib_rules_tclass
From: David Miller @ 2012-07-10 16:44 UTC (permalink / raw)
  To: ogerlitz; +Cc: netdev, shlomop, amirv, erezsh
In-Reply-To: <alpine.LRH.2.00.1207101008270.9760@ogerlitz.voltaire.com>

From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Tue, 10 Jul 2012 10:16:55 +0300

> Starting system logger: BUG: unable to handle kernel NULL pointer dereference at 00000000000000ac
> IP: [<ffffffff81320393>] fib_rules_tclass+0xf/0x17

Ok, fib_rules_tclass() checks for res->r being NULL and only
dereferences it if it is not.

fib4_rule->tclassid has offset ~0x8c on x86-64, and this fault
address is 0x10 bytes off.

Does this patch fix the problem?

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 539c672..000c467 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -230,6 +230,7 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp,
 			     struct fib_result *res)
 {
 	if (!net->ipv4.fib_has_custom_rules) {
+		res->r = NULL;
 		if (net->ipv4.fib_local &&
 		    !fib_table_lookup(net->ipv4.fib_local, flp, res,
 				      FIB_LOOKUP_NOREF))

^ permalink raw reply related

* Re: [PATCH 03/16] tcp: Maintain dynamic metrics in local cache.
From: Joe Perches @ 2012-07-10 17:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120710.080714.2272376193166978850.davem@davemloft.net>

On Tue, 2012-07-10 at 08:07 -0700, David Miller wrote:
> Maintain a local hash table of TCP dynamic metrics blobs.

Just trivia.

> diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
[]
> +static bool addr_same(const struct inetpeer_addr *a,
> +		      const struct inetpeer_addr *b)
> +{
> +	int i, n;
> +
> +	if (a->family != b->family)
> +		return false;
> +	n = (a->family == AF_INET ? 1 : 4);
> +	for (i = 0; i < n; i++) {
> +		if (a->addr.a6[i] != b->addr.a6[i])
> +			return false;
> +	}
> +	return true;

Maybe something like this is a bit more legible?
{
	if (a->family != b->family)
		return false;

	if (a->family == AF_INET)
		return a->addr.a4 == b->addr.a4;

	return ipv6_addr_equal((const struct in6_addr *)&a->addr.a6,
			       (const struct in6_addr *)&b->addr.a6);
}

> +static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *addr,
> +						   unsigned int hash)
> +{
> +	struct tcp_metrics_block *tm;
> +	int depth = 0;
> +
> +	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
> +	     tm = rcu_dereference(tm->tcpm_next)) {
> +		if (addr_same(&tm->tcpm_addr, addr))
> +			break;
> +		depth++;
> +	}
> +	return (tm ? tm : (depth > TCP_METRICS_RECLAIM_DEPTH ?
> +			   TCP_METRICS_RECLAIM_PTR :
> +			   NULL));

Using multiple ?: in a single return can be a bit hard to read.

Maybe:

	if (tm)
		return tm;
	if (depth > TCP_METRICS_RECLAIM_DEPTH)
		return TCP_METRICS_RECLAIM_PTR;

	return NULL;

or move the "return tm" into the for loop and avoid
the break and test.

> +static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
> +						       struct dst_entry *dst)
> +{
> +	struct tcp_metrics_block *tm;
> +	struct inetpeer_addr addr;
> +	unsigned int hash;
> +
> +	addr.family = req->rsk_ops->family;
> +	switch (addr.family) {
> +	case AF_INET:
> +		hash = addr.addr.a4 = inet_rsk(req)->rmt_addr;

Is this a sparse error?  __be32 to unsigned int?
Maybe it needs a __force?

> +		break;
> +	case AF_INET6:
> +		*(struct in6_addr *)addr.addr.a6 = inet6_rsk(req)->rmt_addr;
> +		hash = (addr.addr.a6[0] ^
> +			addr.addr.a6[1] ^
> +			addr.addr.a6[2] ^
> +			addr.addr.a6[3]);
> +		break;
> +	default:
> +		return NULL;
> +	}
> +
> +	hash ^= (hash >> 24) ^ (hash >> 16) ^ (hash >> 8);
> +	hash &= tcp_metrics_hash_mask;

[]

> +static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
> +						 struct dst_entry *dst,
> +						 bool create)
> +{
> +	struct tcp_metrics_block *tm;
> +	struct inetpeer_addr addr;
> +	unsigned int hash;
> +	bool reclaim;
> +
> +	addr.family = sk->sk_family;
> +	switch (addr.family) {
> +	case AF_INET:
> +		hash = addr.addr.a4 = inet_sk(sk)->inet_daddr;
> +		break;
> +	case AF_INET6:
> +		*(struct in6_addr *)addr.addr.a6 = inet6_sk(sk)->daddr;
> +		hash = (addr.addr.a6[0] ^
> +			addr.addr.a6[1] ^
> +			addr.addr.a6[2] ^
> +			addr.addr.a6[3]);
> +		break;
> +	default:
> +		return NULL;
> +	}
> +
> +	hash ^= (hash >> 24) ^ (hash >> 16) ^ (hash >> 8);
> +	hash &= tcp_metrics_hash_mask;

Same sparse error?

Maybe this mostly duplicated bit could be consolidated
into some hash = calc_tcp_hash(&addr) function?

^ permalink raw reply

* RE: [PATCH net,1/1] hyperv: Add support for setting MAC from within guests
From: Haiyang Zhang @ 2012-07-10 17:03 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: olaf@aepfle.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, devel@linuxdriverproject.org,
	davem@davemloft.net
In-Reply-To: <1341620355.2923.46.camel@bwh-desktop.uk.solarflarecom.com>



> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Friday, July 06, 2012 8:19 PM
> To: Haiyang Zhang
> Cc: davem@davemloft.net; netdev@vger.kernel.org; KY Srinivasan;
> olaf@aepfle.de; linux-kernel@vger.kernel.org;
> devel@linuxdriverproject.org
> Subject: Re: [PATCH net,1/1] hyperv: Add support for setting MAC from
> within guests
> 
> On Fri, 2012-07-06 at 14:25 -0700, Haiyang Zhang wrote:
> > This adds support for setting synthetic NIC MAC address from within
> Linux
> > guests. Before using this feature, the option "spoofing of MAC
> address"
> > should be enabled at the Hyper-V manager / Settings of the synthetic
> > NIC.
> [...]
> > +int rndis_filter_set_device_mac(struct hv_device *hdev, char *mac)
> > +{
> [...]
> > +	t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
> > +	if (t == 0) {
> > +		netdev_err(ndev, "timeout before we got a set
> response...\n");
> > +		/*
> > +		 * can't put_rndis_request, since we may still receive a
> > +		 * send-completion.
> > +		 */
> > +		return -EBUSY;
> > +	} else {
> > +		set_complete = &request->response_msg.msg.set_complete;
> > +		if (set_complete->status != RNDIS_STATUS_SUCCESS)
> > +			ret = -EINVAL;
> [...]
> 
> Is there a specific error code that indicates the hypervisor is
> configured not to allow MAC address changes?  If so, shouldn't that be
> translated to return EPERM rather than EINVAL?

I have check the return code, 0xc000000d, which is returned both when MAC
spoofing is not enabled or the parameter contains other errors. So we can't
tell if it permission error or not. I will re-submit this patch still
using EINVAL.

Thanks,
- Haiyang

^ permalink raw reply

* Re: [RFC PATCH v2] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-10 17:06 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, ycheng, codel, mattmathis, ncardwell
In-Reply-To: <1341933215.3265.5476.camel@edumazet-glaptop>

On Tue, 2012-07-10 at 17:13 +0200, Eric Dumazet wrote:
> This introduce TSQ (TCP Small Queues)
> 
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
> 
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
> 
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
> 
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
> 
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
> 
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
> 
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
> 
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
> 
> 
> 
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
>   but some drivers call it in their start_xmit() handler.
>   These drivers should at least use BQL, or else a single TCP
>   session can still fill the whole NIC TX ring, since TSQ will
>   have no effect.
> 
> Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

By the way, Rick Jones asked me :

"Is there also any chance in service demand?"

I copy here my answer since its a very good point:

I worked on the idea of a CoDel like feedback, to have a timed limit
instead of byte limit ("allow up to 1ms" delay in qdisc/dev queue.)

But it seemed a bit complex : I would need to add skb fields to properly
track the residence time (sojourn time) of queued packets.

Alternative would be to have a per tcp socket tracking array,
but it might be expensive to search a packet in it...

With multi queue devices or bad qdiscs, we can have reordering in skb
orphanings. So the lookup can be relatively expensive.

^ permalink raw reply

* [PATCH] net/ipv4/ipip: support move across network namespaces
From: Christian Franke @ 2012-07-10 17:11 UTC (permalink / raw)
  To: netdev; +Cc: Christian Franke

Allow moving ipip devices from one netns to another. This makes it
possible to isolate the tunneled network from the transport network.

Signed-off-by: Christian Franke <christian.franke@adytonsystems.com>
---
 include/net/ipip.h |  1 +
 net/ipv4/ipip.c    | 81 ++++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/include/net/ipip.h b/include/net/ipip.h
index a93cf6d..60f5152 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -18,6 +18,7 @@ struct ip_tunnel_6rd_parm {
 struct ip_tunnel {
 	struct ip_tunnel __rcu	*next;
 	struct net_device	*dev;
+	struct net		*transport_net;
 
 	int			err_count;	/* Number of arrived ICMP errors */
 	unsigned long		err_time;	/* Time when the last ICMP error arrived */
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 715338a..d996f35 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -99,6 +99,7 @@
 #include <asm/uaccess.h>
 #include <linux/skbuff.h>
 #include <linux/netdevice.h>
+#include <linux/notifier.h>
 #include <linux/in.h>
 #include <linux/tcp.h>
 #include <linux/udp.h>
@@ -151,6 +152,13 @@ struct pcpu_tstats {
 	struct u64_stats_sync	syncp;
 };
 
+static inline struct net *transport_net(struct net_device *dev)
+{
+	struct ip_tunnel *t = netdev_priv(dev);
+
+	return t->transport_net ? t->transport_net : dev_net(dev);
+}
+
 static struct rtnl_link_stats64 *ipip_get_stats64(struct net_device *dev,
 						  struct rtnl_link_stats64 *tot)
 {
@@ -314,7 +322,7 @@ failed_free:
 /* called with RTNL */
 static void ipip_tunnel_uninit(struct net_device *dev)
 {
-	struct net *net = dev_net(dev);
+	struct net *net = transport_net(dev);
 	struct ipip_net *ipn = net_generic(net, ipip_net_id);
 
 	if (dev == ipn->fb_tunnel_dev)
@@ -481,7 +489,7 @@ static netdev_tx_t ipip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 		dst = rt->rt_gateway;
 	}
 
-	rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
+	rt = ip_route_output_ports(transport_net(dev), &fl4, NULL,
 				   dst, tiph->saddr,
 				   0, 0,
 				   IPPROTO_IPIP, RT_TOS(tos),
@@ -631,7 +639,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	int err = 0;
 	struct ip_tunnel_parm p;
 	struct ip_tunnel *t;
-	struct net *net = dev_net(dev);
+	struct net *net = transport_net(dev);
 	struct ipip_net *ipn = net_generic(net, ipip_net_id);
 
 	switch (cmd) {
@@ -652,6 +660,9 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		break;
 
 	case SIOCADDTUNNEL:
+		/* New tunnels will be created in the current namespace */
+		net = dev_net(dev);
+		ipn = net_generic(net, ipip_net_id);
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
 		if (!capable(CAP_NET_ADMIN))
@@ -701,6 +712,13 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 				t->parms.iph.tos = p.iph.tos;
 				t->parms.iph.frag_off = p.iph.frag_off;
 				if (t->parms.link != p.link) {
+					if (!net_eq(dev_net(dev),
+						    transport_net(dev))) {
+						pr_info_once("%s: rebinding a device moved across namespaces is not supported\n",
+							     __func__);
+						err = -ENOTTY;
+						goto done;
+					}
 					t->parms.link = p.link;
 					ipip_tunnel_bind_dev(dev);
 					netdev_state_change(dev);
@@ -759,6 +777,10 @@ static const struct net_device_ops ipip_netdev_ops = {
 
 static void ipip_dev_free(struct net_device *dev)
 {
+	struct ip_tunnel *t = netdev_priv(dev);
+
+	if (t->transport_net)
+		put_net(t->transport_net);
 	free_percpu(dev->tstats);
 	free_netdev(dev);
 }
@@ -774,7 +796,6 @@ static void ipip_tunnel_setup(struct net_device *dev)
 	dev->flags		= IFF_NOARP;
 	dev->iflink		= 0;
 	dev->addr_len		= 4;
-	dev->features		|= NETIF_F_NETNS_LOCAL;
 	dev->features		|= NETIF_F_LLTX;
 	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
 }
@@ -904,6 +925,40 @@ static struct pernet_operations ipip_net_ops = {
 	.size = sizeof(struct ipip_net),
 };
 
+static int ipip_device_event(struct notifier_block *unused,
+			     unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct ip_tunnel *t;
+
+	if (dev->type != ARPHRD_TUNNEL)
+		return NOTIFY_DONE;
+
+	t = netdev_priv(dev);
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		/* When the tunnel is moved from its natural
+		 * network namespace, it will keep a reference
+		 * to it. */
+		if (dev->reg_state != NETREG_UNREGISTERING) {
+			if (!t->transport_net)
+				t->transport_net = get_net(dev_net(dev));
+		}
+		break;
+	case NETDEV_REGISTER:
+		if (net_eq(dev_net(dev), t->transport_net)) {
+			put_net(t->transport_net);
+			t->transport_net = NULL;
+		}
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block ipip_notifier_block = {
+	.notifier_call = ipip_device_event,
+};
+
 static int __init ipip_init(void)
 {
 	int err;
@@ -913,11 +968,20 @@ static int __init ipip_init(void)
 	err = register_pernet_device(&ipip_net_ops);
 	if (err < 0)
 		return err;
+
+	err = register_netdevice_notifier(&ipip_notifier_block);
+	if (err < 0)
+		goto out_notifier;
+
 	err = xfrm4_tunnel_register(&ipip_handler, AF_INET);
-	if (err < 0) {
-		unregister_pernet_device(&ipip_net_ops);
-		pr_info("%s: can't register tunnel\n", __func__);
-	}
+	if (err < 0)
+		goto out_xfrm;
+	return err;
+out_xfrm:
+	unregister_netdevice_notifier(&ipip_notifier_block);
+out_notifier:
+	unregister_pernet_device(&ipip_net_ops);
+	pr_info("%s: can't register tunnel\n", __func__);
 	return err;
 }
 
@@ -926,6 +990,7 @@ static void __exit ipip_fini(void)
 	if (xfrm4_tunnel_deregister(&ipip_handler, AF_INET))
 		pr_info("%s: can't deregister tunnel\n", __func__);
 
+	unregister_netdevice_notifier(&ipip_notifier_block);
 	unregister_pernet_device(&ipip_net_ops);
 }
 
-- 
1.7.11.1

^ permalink raw reply related

* [PATCH net, 1/1] hyperv: Add support for setting MAC from within guests
From: Haiyang Zhang @ 2012-07-10 17:19 UTC (permalink / raw)
  To: davem, netdev; +Cc: devel, haiyangz, olaf, linux-kernel

This adds support for setting synthetic NIC MAC address from within Linux
guests. Before using this feature, the option "spoofing of MAC address"
should be enabled at the Hyper-V manager / Settings of the synthetic
NIC.

Thanks to Kin Cho <kcho@infoblox.com> for the initial implementation and
tests. And, thanks to Long Li <longli@microsoft.com> for the debugging
works.

Reported-and-tested-by: Kin Cho <kcho@infoblox.com>
Reported-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h   |    1 +
 drivers/net/hyperv/netvsc_drv.c   |   30 +++++++++++++-
 drivers/net/hyperv/rndis_filter.c |   79 +++++++++++++++++++++++++++++++++++++
 3 files changed, 109 insertions(+), 1 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 2857ab0..95ceb35 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -131,6 +131,7 @@ int rndis_filter_send(struct hv_device *dev,
 			struct hv_netvsc_packet *pkt);
 
 int rndis_filter_set_packet_filter(struct rndis_device *dev, u32 new_filter);
+int rndis_filter_set_device_mac(struct hv_device *hdev, char *mac);
 
 
 #define NVSP_INVALID_PROTOCOL_VERSION	((u32)0xFFFFFFFF)
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8f8ed33..8e23c08 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -341,6 +341,34 @@ static int netvsc_change_mtu(struct net_device *ndev, int mtu)
 	return 0;
 }
 
+
+static int netvsc_set_mac_addr(struct net_device *ndev, void *p)
+{
+	struct net_device_context *ndevctx = netdev_priv(ndev);
+	struct hv_device *hdev =  ndevctx->device_ctx;
+	struct sockaddr *addr = p;
+	char save_adr[14];
+	unsigned char save_aatype;
+	int err;
+
+	memcpy(save_adr, ndev->dev_addr, ETH_ALEN);
+	save_aatype = ndev->addr_assign_type;
+
+	err = eth_mac_addr(ndev, p);
+	if (err != 0)
+		return err;
+
+	err = rndis_filter_set_device_mac(hdev, addr->sa_data);
+	if (err != 0) {
+		/* roll back to saved MAC */
+		memcpy(ndev->dev_addr, save_adr, ETH_ALEN);
+		ndev->addr_assign_type = save_aatype;
+	}
+
+	return err;
+}
+
+
 static const struct ethtool_ops ethtool_ops = {
 	.get_drvinfo	= netvsc_get_drvinfo,
 	.get_link	= ethtool_op_get_link,
@@ -353,7 +381,7 @@ static const struct net_device_ops device_ops = {
 	.ndo_set_rx_mode =		netvsc_set_multicast_list,
 	.ndo_change_mtu =		netvsc_change_mtu,
 	.ndo_validate_addr =		eth_validate_addr,
-	.ndo_set_mac_address =		eth_mac_addr,
+	.ndo_set_mac_address =		netvsc_set_mac_addr,
 };
 
 /*
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index 981ebb1..fbf5394 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -27,6 +27,7 @@
 #include <linux/if_ether.h>
 #include <linux/netdevice.h>
 #include <linux/if_vlan.h>
+#include <linux/nls.h>
 
 #include "hyperv_net.h"
 
@@ -47,6 +48,7 @@ struct rndis_request {
 	struct hv_page_buffer buf;
 	/* FIXME: We assumed a fixed size request here. */
 	struct rndis_message request_msg;
+	u8 ext[100];
 };
 
 static void rndis_filter_send_completion(void *ctx);
@@ -511,6 +513,83 @@ static int rndis_filter_query_device_mac(struct rndis_device *dev)
 				      dev->hw_mac_adr, &size);
 }
 
+#define NWADR_STR "NetworkAddress"
+#define NWADR_STRLEN 14
+
+int rndis_filter_set_device_mac(struct hv_device *hdev, char *mac)
+{
+	struct netvsc_device *nvdev = hv_get_drvdata(hdev);
+	struct rndis_device *rdev = nvdev->extension;
+	struct net_device *ndev = nvdev->ndev;
+	struct rndis_request *request;
+	struct rndis_set_request *set;
+	struct rndis_config_parameter_info *cpi;
+	wchar_t *cfg_nwadr, *cfg_mac;
+	struct rndis_set_complete *set_complete;
+	char macstr[2*ETH_ALEN+1];
+	u32 extlen = sizeof(struct rndis_config_parameter_info) +
+		2*NWADR_STRLEN + 4*ETH_ALEN;
+	int ret, t;
+
+	request = get_rndis_request(rdev, RNDIS_MSG_SET,
+		RNDIS_MESSAGE_SIZE(struct rndis_set_request) + extlen);
+	if (!request)
+		return -ENOMEM;
+
+	set = &request->request_msg.msg.set_req;
+	set->oid = RNDIS_OID_GEN_RNDIS_CONFIG_PARAMETER;
+	set->info_buflen = extlen;
+	set->info_buf_offset = sizeof(struct rndis_set_request);
+	set->dev_vc_handle = 0;
+
+	cpi = (struct rndis_config_parameter_info *)((ulong)set +
+		set->info_buf_offset);
+	cpi->parameter_name_offset =
+		sizeof(struct rndis_config_parameter_info);
+	/* Multiply by 2 because host needs 2 bytes (utf16) for each char */
+	cpi->parameter_name_length = 2*NWADR_STRLEN;
+	cpi->parameter_type = RNDIS_CONFIG_PARAM_TYPE_STRING;
+	cpi->parameter_value_offset =
+		cpi->parameter_name_offset + cpi->parameter_name_length;
+	/* Multiply by 4 because each MAC byte displayed as 2 utf16 chars */
+	cpi->parameter_value_length = 4*ETH_ALEN;
+
+	cfg_nwadr = (wchar_t *)((ulong)cpi + cpi->parameter_name_offset);
+	cfg_mac = (wchar_t *)((ulong)cpi + cpi->parameter_value_offset);
+	ret = utf8s_to_utf16s(NWADR_STR, NWADR_STRLEN, UTF16_HOST_ENDIAN,
+			      cfg_nwadr, NWADR_STRLEN);
+	if (ret < 0)
+		goto cleanup;
+	snprintf(macstr, 2*ETH_ALEN+1, "%pm", mac);
+	ret = utf8s_to_utf16s(macstr, 2*ETH_ALEN, UTF16_HOST_ENDIAN,
+			      cfg_mac, 2*ETH_ALEN);
+	if (ret < 0)
+		goto cleanup;
+
+	ret = rndis_filter_send_request(rdev, request);
+	if (ret != 0)
+		goto cleanup;
+
+	t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+	if (t == 0) {
+		netdev_err(ndev, "timeout before we got a set response...\n");
+		/*
+		 * can't put_rndis_request, since we may still receive a
+		 * send-completion.
+		 */
+		return -EBUSY;
+	} else {
+		set_complete = &request->response_msg.msg.set_complete;
+		if (set_complete->status != RNDIS_STATUS_SUCCESS)
+			ret = -EINVAL;
+	}
+
+cleanup:
+	put_rndis_request(rdev, request);
+	return ret;
+}
+
+
 static int rndis_filter_query_device_link_status(struct rndis_device *dev)
 {
 	u32 size = sizeof(u32);
-- 
1.7.4.1

^ permalink raw reply related

* RE: [PATCH net,1/1] hyperv: Add support for setting MAC from within guests
From: Ben Hutchings @ 2012-07-10 17:23 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: davem@davemloft.net, netdev@vger.kernel.org, KY Srinivasan,
	olaf@aepfle.de, linux-kernel@vger.kernel.org,
	devel@linuxdriverproject.org
In-Reply-To: <A1F3067C9B68744AA19F6802BAB8FFDC0DDC8C5E@TK5EX14MBXC222.redmond.corp.microsoft.com>

On Tue, 2012-07-10 at 17:03 +0000, Haiyang Zhang wrote:
> 
> > -----Original Message-----
> > From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> > Sent: Friday, July 06, 2012 8:19 PM
> > To: Haiyang Zhang
> > Cc: davem@davemloft.net; netdev@vger.kernel.org; KY Srinivasan;
> > olaf@aepfle.de; linux-kernel@vger.kernel.org;
> > devel@linuxdriverproject.org
> > Subject: Re: [PATCH net,1/1] hyperv: Add support for setting MAC from
> > within guests
> > 
> > On Fri, 2012-07-06 at 14:25 -0700, Haiyang Zhang wrote:
> > > This adds support for setting synthetic NIC MAC address from within
> > Linux
> > > guests. Before using this feature, the option "spoofing of MAC
> > address"
> > > should be enabled at the Hyper-V manager / Settings of the synthetic
> > > NIC.
> > [...]
> > > +int rndis_filter_set_device_mac(struct hv_device *hdev, char *mac)
> > > +{
> > [...]
> > > +	t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
> > > +	if (t == 0) {
> > > +		netdev_err(ndev, "timeout before we got a set
> > response...\n");
> > > +		/*
> > > +		 * can't put_rndis_request, since we may still receive a
> > > +		 * send-completion.
> > > +		 */
> > > +		return -EBUSY;
> > > +	} else {
> > > +		set_complete = &request->response_msg.msg.set_complete;
> > > +		if (set_complete->status != RNDIS_STATUS_SUCCESS)
> > > +			ret = -EINVAL;
> > [...]
> > 
> > Is there a specific error code that indicates the hypervisor is
> > configured not to allow MAC address changes?  If so, shouldn't that be
> > translated to return EPERM rather than EINVAL?
> 
> I have check the return code, 0xc000000d, which is returned both when MAC
> spoofing is not enabled or the parameter contains other errors. So we can't
> tell if it permission error or not. I will re-submit this patch still
> using EINVAL.

Oh well, thanks for trying!

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: net-next kernel NULL pointer dereference at fib_rules_tclass
From: Eric Dumazet @ 2012-07-10 17:25 UTC (permalink / raw)
  To: David Miller; +Cc: ogerlitz, netdev, shlomop, amirv, erezsh
In-Reply-To: <20120710.094428.1167234955738653678.davem@davemloft.net>

On Tue, 2012-07-10 at 09:44 -0700, David Miller wrote:
> From: Or Gerlitz <ogerlitz@mellanox.com>
> Date: Tue, 10 Jul 2012 10:16:55 +0300
> 
> > Starting system logger: BUG: unable to handle kernel NULL pointer dereference at 00000000000000ac
> > IP: [<ffffffff81320393>] fib_rules_tclass+0xf/0x17
> 
> Ok, fib_rules_tclass() checks for res->r being NULL and only
> dereferences it if it is not.
> 
> fib4_rule->tclassid has offset ~0x8c on x86-64, and this fault
> address is 0x10 bytes off.
> 
> Does this patch fix the problem?
> 
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index 539c672..000c467 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -230,6 +230,7 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp,
>  			     struct fib_result *res)
>  {
>  	if (!net->ipv4.fib_has_custom_rules) {
> +		res->r = NULL;
>  		if (net->ipv4.fib_local &&
>  		    !fib_table_lookup(net->ipv4.fib_local, flp, res,
>  				      FIB_LOOKUP_NOREF))

It does here, thanks

^ permalink raw reply

* Re: [PATCH 2/2] bonding: debugfs and network namespaces are incompatible
From: Jay Vosburgh @ 2012-07-10 17:36 UTC (permalink / raw)
  To: David Miller
  Cc: ebiederm, dilip.daya, linux-kernel, containers, netdev, serge
In-Reply-To: <20120709.144932.243254122059983829.davem@davemloft.net>

David Miller <davem@davemloft.net> wrote:

>From: ebiederm@xmission.com (Eric W. Biederman)
>Date: Mon, 09 Jul 2012 13:52:43 -0700
>
>> 
>> The bonding debugfs support has been broken in the presence of network
>> namespaces since it has been added.  The debugfs support does not handle
>> multiple bonding devices with the same name in different network
>> namespaces.
>> 
>> I haven't had any bug reports, and I'm not interested in getting any.
>> Disable the debugfs support when network namespaces are enabled.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
>Applied.

	Since distro kernels appear to set CONFIG_NET_NS, doesn't this
effectively disable debugfs for bonding on most distros?

	Do the other network device drivers that support debugfs have a
similar problem?  E.g., if each of two namespaces have an skge device
with the same name, will there be a debugfs conflict there as well?

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [RFC PATCH v2] tcp: TCP Small Queues
From: Yuchung Cheng @ 2012-07-10 17:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, dave.taht, netdev, codel, therbert, mattmathis,
	nanditad, ncardwell, andrewmcgr
In-Reply-To: <1341933215.3265.5476.camel@edumazet-glaptop>

On Tue, Jul 10, 2012 at 8:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> This introduce TSQ (TCP Small Queues)
>
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
>
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
>
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
>
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
>
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
>
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
>
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
>
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
>
>
>
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
>   but some drivers call it in their start_xmit() handler.
>   These drivers should at least use BQL, or else a single TCP
>   session can still fill the whole NIC TX ring, since TSQ will
>   have no effect.
>
> Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/tcp.h        |    9 ++
>  include/net/tcp.h          |    3
>  net/ipv4/sysctl_net_ipv4.c |    7 +
>  net/ipv4/tcp.c             |   14 ++-
>  net/ipv4/tcp_minisocks.c   |    1
>  net/ipv4/tcp_output.c      |  132 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 160 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 7d3bced..55b8cf9 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -339,6 +339,9 @@ struct tcp_sock {
>         u32     rcv_tstamp;     /* timestamp of last received ACK (for keepalives) */
>         u32     lsndtime;       /* timestamp of last sent data packet (for restart window) */
>
> +       struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
> +       unsigned long   tsq_flags;
> +
>         /* Data for direct copy to user */
>         struct {
>                 struct sk_buff_head     prequeue;
> @@ -494,6 +497,12 @@ struct tcp_sock {
>         struct tcp_cookie_values  *cookie_values;
>  };
>
> +enum tsq_flags {
> +       TSQ_THROTTLED,
> +       TSQ_QUEUED,
> +       TSQ_OWNED, /* tcp_tasklet_func() found socket was locked */
> +};
> +
>  static inline struct tcp_sock *tcp_sk(const struct sock *sk)
>  {
>         return (struct tcp_sock *)sk;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 53fb7d8..3a6ed09 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -253,6 +253,7 @@ extern int sysctl_tcp_cookie_size;
>  extern int sysctl_tcp_thin_linear_timeouts;
>  extern int sysctl_tcp_thin_dupack;
>  extern int sysctl_tcp_early_retrans;
> +extern int sysctl_tcp_limit_output_bytes;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> @@ -321,6 +322,8 @@ extern struct proto tcp_prot;
>
>  extern void tcp_init_mem(struct net *net);
>
> +extern void tcp_tasklet_init(void);
> +
>  extern void tcp_v4_err(struct sk_buff *skb, u32);
>
>  extern void tcp_shutdown (struct sock *sk, int how);
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 12aa0c5..70730f7 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -598,6 +598,13 @@ static struct ctl_table ipv4_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec
>         },
> +       {
> +               .procname       = "tcp_limit_output_bytes",
> +               .data           = &sysctl_tcp_limit_output_bytes,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec
> +       },
>  #ifdef CONFIG_NET_DMA
>         {
>                 .procname       = "tcp_dma_copybreak",
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3ba605f..8838bd2 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -376,6 +376,7 @@ void tcp_init_sock(struct sock *sk)
>         skb_queue_head_init(&tp->out_of_order_queue);
>         tcp_init_xmit_timers(sk);
>         tcp_prequeue_init(tp);
> +       INIT_LIST_HEAD(&tp->tsq_node);
>
>         icsk->icsk_rto = TCP_TIMEOUT_INIT;
>         tp->mdev = TCP_TIMEOUT_INIT;
> @@ -786,15 +787,17 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>                                        int large_allowed)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
> -       u32 xmit_size_goal, old_size_goal;
> +       u32 xmit_size_goal, old_size_goal, gso_max_size;
>
>         xmit_size_goal = mss_now;
>
>         if (large_allowed && sk_can_gso(sk)) {
> -               xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> -                                 inet_csk(sk)->icsk_af_ops->net_header_len -
> -                                 inet_csk(sk)->icsk_ext_hdr_len -
> -                                 tp->tcp_header_len);
> +               gso_max_size = min_t(u32, sk->sk_gso_max_size,
> +                                    sysctl_tcp_limit_output_bytes >> 1);
> +               xmit_size_goal = (gso_max_size - 1) -
> +                                inet_csk(sk)->icsk_af_ops->net_header_len -
> +                                inet_csk(sk)->icsk_ext_hdr_len -
> +                                tp->tcp_header_len;
>
>                 xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
>
> @@ -3573,4 +3576,5 @@ void __init tcp_init(void)
>         tcp_secret_primary = &tcp_secret_one;
>         tcp_secret_retiring = &tcp_secret_two;
>         tcp_secret_secondary = &tcp_secret_two;
> +       tcp_tasklet_init();
>  }
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 72b7c63..83b358f 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
>                         treq->snt_isn + 1 + tcp_s_data_size(oldtp);
>
>                 tcp_prequeue_init(newtp);
> +               INIT_LIST_HEAD(&newtp->tsq_node);
>
>                 tcp_init_wl(newtp, treq->rcv_isn);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index c465d3e..991ae45 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -50,6 +50,9 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>   */
>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>
> +/* Default TSQ limit of two TSO segments */
> +int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
> +
>  /* This limits the percentage of the congestion window which we
>   * will allow a single TSO frame to consume.  Building TSO frames
>   * which are too large can cause TCP streams to be bursty.
> @@ -65,6 +68,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
>  int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
>  EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
>
> +static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> +                          int push_one, gfp_t gfp);
>
>  /* Account for new data that has been sent to the network. */
>  static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
> @@ -783,6 +788,118 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
>         return size;
>  }
>
> +
> +/* TCP SMALL QUEUES (TSQ)
> + *
> + * TSQ goal is to keep small amount of skbs per tcp flow in tx queues (qdisc+dev)
> + * to reduce RTT and bufferbloat.
> + * We do this using a special skb destructor (tcp_wfree).
> + *
> + * Its important tcp_wfree() can be replaced by sock_wfree() in the event skb
> + * needs to be reallocated in a driver.
> + * The invariant being skb->truesize substracted from sk->sk_wmem_alloc
> + *
> + * Since transmit from skb destructor is forbidden, we use a tasklet
> + * to process all sockets that eventually need to send more skbs.
> + * We use one tasklet per cpu, with its own queue of sockets.
> + */
> +struct tsq_tasklet {
> +       struct tasklet_struct   tasklet;
> +       struct list_head        head; /* queue of tcp sockets */
> +};
> +static DEFINE_PER_CPU(struct tsq_tasklet, tsq_tasklet);
> +
> +/*
> + * One tasklest per cpu tries to send more skbs.
> + * We run in tasklet context but need to disable irqs when
> + * transfering tsq->head because tcp_wfree() might
> + * interrupt us (non NAPI drivers)
> + */
> +static void tcp_tasklet_func(unsigned long data)
> +{
> +       struct tsq_tasklet *tsq = (struct tsq_tasklet *)data;
> +       LIST_HEAD(list);
> +       unsigned long flags;
> +       struct list_head *q, *n;
> +       struct tcp_sock *tp;
> +       struct sock *sk;
> +
> +       local_irq_save(flags);
> +       list_splice_init(&tsq->head, &list);
> +       local_irq_restore(flags);
> +
> +       list_for_each_safe(q, n, &list) {
> +               tp = list_entry(q, struct tcp_sock, tsq_node);
> +               list_del(&tp->tsq_node);
> +
> +               sk = (struct sock *)tp;
> +               bh_lock_sock(sk);
> +
> +               if (!sock_owned_by_user(sk)) {
> +                       if ((1 << sk->sk_state) &
> +                           (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
> +                               tcp_write_xmit(sk,
> +                                              tcp_current_mss(sk),
> +                                              0, 0,
> +                                              GFP_ATOMIC);
Is this case possible: app does a large send and immediately closes
the socket. then
the queue is throttled and tcp_write_xmit is called back when state is
in TCP_FIN_WAIT1.

I think tcp_write_xmit should continue regardless of the current state
because the
send maybe throttled/delayed but state change is synchronous.


> +               } else {
> +                       /* TODO:
> +                        * setup a timer, or check TSQ_OWNED in release_sock()
> +                        */
> +                       set_bit(TSQ_OWNED, &tp->tsq_flags);
> +               }
> +               bh_unlock_sock(sk);
> +
> +               clear_bit(TSQ_QUEUED, &tp->tsq_flags);
> +               sk_free(sk);
> +       }
> +}
> +
> +void __init tcp_tasklet_init(void)
> +{
> +       int i;
> +
> +       for_each_possible_cpu(i) {
> +               struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
> +
> +               INIT_LIST_HEAD(&tsq->head);
> +               tasklet_init(&tsq->tasklet,
> +                            tcp_tasklet_func,
> +                            (unsigned long)tsq);
> +       }
> +}
> +
> +/*
> + * Write buffer destructor automatically called from kfree_skb.
> + * We cant xmit new skbs from this context, as we might already
> + * hold qdisc lock.
> + */
> +void tcp_wfree(struct sk_buff *skb)
> +{
> +       struct sock *sk = skb->sk;
> +       struct tcp_sock *tp = tcp_sk(sk);
> +
> +       if (test_and_clear_bit(TSQ_THROTTLED, &tp->tsq_flags) &&
> +           !test_and_set_bit(TSQ_QUEUED, &tp->tsq_flags)) {
> +               unsigned long flags;
> +               struct tsq_tasklet *tsq;
> +
> +               /* Keep a ref on socket.
> +                * This last ref will be released in tcp_tasklet_func()
> +                */
> +               atomic_sub(skb->truesize - 1, &sk->sk_wmem_alloc);
> +
> +               /* queue this socket to tasklet queue */
> +               local_irq_save(flags);
> +               tsq = &__get_cpu_var(tsq_tasklet);
> +               list_add(&tp->tsq_node, &tsq->head);
> +               tasklet_schedule(&tsq->tasklet);
> +               local_irq_restore(flags);
> +       } else {
> +               sock_wfree(skb);
> +       }
> +}
> +
>  /* This routine actually transmits TCP packets queued in by
>   * tcp_do_sendmsg().  This is used by both the initial
>   * transmission and possible later retransmissions.
> @@ -844,7 +961,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
>
>         skb_push(skb, tcp_header_size);
>         skb_reset_transport_header(skb);
> -       skb_set_owner_w(skb, sk);
> +
> +       skb_orphan(skb);
> +       skb->sk = sk;
> +       skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
> +                         tcp_wfree : sock_wfree;
> +       atomic_add(skb->truesize, &sk->sk_wmem_alloc);
>
>         /* Build TCP header and checksum it. */
>         th = tcp_hdr(skb);
> @@ -1780,6 +1902,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>         while ((skb = tcp_send_head(sk))) {
>                 unsigned int limit;
>
> +
>                 tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
>                 BUG_ON(!tso_segs);
>
> @@ -1800,6 +1923,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                                 break;
>                 }
>
> +               /* TSQ : sk_wmem_alloc accounts skb truesize,
> +                * including skb overhead. But thats OK.
> +                */
> +               if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
> +                       set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> +                       break;
> +               }
>                 limit = mss_now;
>                 if (tso_segs > 1 && !tcp_urg_mode(tp))
>                         limit = tcp_mss_split_point(sk, skb, mss_now,
>
>

^ permalink raw reply

* Re: [PATCH net-next 5/9] net/eipoib: Add ethtool file support
From: Ben Hutchings @ 2012-07-10 17:48 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, roland, netdev, ali, sean.hefty, Erez Shitrit
In-Reply-To: <1341922569-4118-6-git-send-email-ogerlitz@mellanox.com>

On Tue, 2012-07-10 at 15:16 +0300, Or Gerlitz wrote:
> From: Erez Shitrit <erezsh@mellanox.co.il>
> 
> Via ethtool the driver describes its version, ABI version, on what PIF
> interface it runs and various statistics.
> 
> Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>  drivers/net/eipoib/eth_ipoib_ethtool.c |  147 ++++++++++++++++++++++++++++++++
>  1 files changed, 147 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/eipoib/eth_ipoib_ethtool.c
> 
> diff --git a/drivers/net/eipoib/eth_ipoib_ethtool.c b/drivers/net/eipoib/eth_ipoib_ethtool.c
> new file mode 100644
> index 0000000..b5c20ec
> --- /dev/null
> +++ b/drivers/net/eipoib/eth_ipoib_ethtool.c
> @@ -0,0 +1,147 @@
> +/*
> + * Copyright (c) 2012 Mellanox Technologies. All rights reserved
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * openfabric.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include "eth_ipoib.h"
> +
> +static void parent_ethtool_get_drvinfo(struct net_device *parent_dev,
> +				       struct ethtool_drvinfo *drvinfo)
> +{
> +	struct parent *parent = netdev_priv(parent_dev);
> +
> +	if (strlen(DRV_NAME) + strlen(parent->ipoib_main_interface) > 31)
> +		strncpy(drvinfo->driver, "driver name is too long", 32);

Returning error messages like is stupid; either truncate or WARN.

> +	else
> +		sprintf(drvinfo->driver, "%s:%s",
> +			DRV_NAME, parent->ipoib_main_interface);

Why do you not use the separate driver and bus_info fields?

> +	strncpy(drvinfo->version, DRV_VERSION, 32);
> +
> +	/* indicates ABI version */
> +	snprintf(drvinfo->fw_version, 32, "%d", EIPOIB_ABI_VER);
> +}
> +
> +static const char parent_strings[][ETH_GSTRING_LEN] = {
> +	/* public statistics */
> +	"rx_packets", "tx_packets", "rx_bytes",
> +	"tx_bytes", "rx_errors", "tx_errors",
> +	"rx_dropped", "tx_dropped", "multicast",
> +	"collisions", "rx_length_errors", "rx_over_errors",
> +	"rx_crc_errors", "rx_frame_errors", "rx_fifo_errors",
> +	"rx_missed_errors", "tx_aborted_errors", "tx_carrier_errors",
> +	"tx_fifo_errors", "tx_heartbeat_errors", "tx_window_errors",
> +#define PUB_STATS_LEN	21
[...]

This is duplicating the basic netdev statistics that are already
available without ethtool.  Most of them are completely meaningless for
Infiniband, I suspect.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* RE: 82571EB: Detected Hardware Unit Hang
From: Wyborny, Carolyn @ 2012-07-10 18:14 UTC (permalink / raw)
  To: Joe Jin
  Cc: e1000-devel@lists.sf.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <4FFBDC50.5090800@oracle.com>


>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
>On Behalf Of Joe Jin
>Sent: Tuesday, July 10, 2012 12:40 AM
>To: Joe Jin
>Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
[..]
>I checked all driver codes I did not found anywhere will set the
>upper.data with
>E1000_TXD_STAT_DD, I guess upper.data be set by hardware?

Yes, the hw sets this bit after transmit, its how the driver knows to reclaim it.

>> Would you please help on this?

Yes, we'll attempt to reproduce it and get back to you with any more needed info.  I'm sorry you're having problems with our parts.

Thanks for the report,

Carolyn

Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation

^ permalink raw reply

* Re: net-next kernel NULL pointer dereference at fib_rules_tclass
From: Greg Rose @ 2012-07-10 18:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, ogerlitz, netdev, shlomop, amirv, erezsh
In-Reply-To: <1341941101.3265.5799.camel@edumazet-glaptop>

On Tue, 10 Jul 2012 19:25:01 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2012-07-10 at 09:44 -0700, David Miller wrote:
> > From: Or Gerlitz <ogerlitz@mellanox.com>
> > Date: Tue, 10 Jul 2012 10:16:55 +0300
> > 
> > > Starting system logger: BUG: unable to handle kernel NULL pointer
> > > dereference at 00000000000000ac IP: [<ffffffff81320393>]
> > > fib_rules_tclass+0xf/0x17
> > 
> > Ok, fib_rules_tclass() checks for res->r being NULL and only
> > dereferences it if it is not.
> > 
> > fib4_rule->tclassid has offset ~0x8c on x86-64, and this fault
> > address is 0x10 bytes off.
> > 
> > Does this patch fix the problem?
> > 
> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> > index 539c672..000c467 100644
> > --- a/include/net/ip_fib.h
> > +++ b/include/net/ip_fib.h
> > @@ -230,6 +230,7 @@ static inline int fib_lookup(struct net *net,
> > struct flowi4 *flp, struct fib_result *res)
> >  {
> >  	if (!net->ipv4.fib_has_custom_rules) {
> > +		res->r = NULL;
> >  		if (net->ipv4.fib_local &&
> >  		    !fib_table_lookup(net->ipv4.fib_local, flp,
> > res, FIB_LOOKUP_NOREF))
> 
> It does here, thanks

Works for me too.

Thanks,

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH v2] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-10 18:32 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, dave.taht, netdev, codel, therbert, mattmathis,
	nanditad, ncardwell, andrewmcgr
In-Reply-To: <CAK6E8=eXMBzG=Zo0N8KWQSNaoekAqAh5N5eY2gE7+BYKJLaOQg@mail.gmail.com>

On Tue, 2012-07-10 at 10:37 -0700, Yuchung Cheng wrote:
> On Tue, Jul 10, 2012 at 8:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > +
> > +               if (!sock_owned_by_user(sk)) {
> > +                       if ((1 << sk->sk_state) &
> > +                           (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
> > +                               tcp_write_xmit(sk,
> > +                                              tcp_current_mss(sk),
> > +                                              0, 0,
> > +                                              GFP_ATOMIC);
> Is this case possible: app does a large send and immediately closes
> the socket. then
> the queue is throttled and tcp_write_xmit is called back when state is
> in TCP_FIN_WAIT1.
> 
> I think tcp_write_xmit should continue regardless of the current state
> because the
> send maybe throttled/delayed but state change is synchronous.
> 

I need testing some allowed states, I think.

Maybe I missed some states, but I dont think we should call
tcp_write_xmit() if socket is now in TIMEWAIT state ?

(because of tasklet delay, we might handle TX completion _after_ socket
state change)

^ permalink raw reply

* Re: 82571EB: Detected Hardware Unit Hang
From: Dave, Tushar N @ 2012-07-10 19:02 UTC (permalink / raw)
  To: Joe Jin
  Cc: netdev@vger.kernel.org, e1000-devel@lists.sf.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <4FFBDC50.5090800@oracle.com>

>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
>On Behalf Of Joe Jin
>Sent: Tuesday, July 10, 2012 12:40 AM
>To: Joe Jin
>Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>When I debug the driver I found before Detected HW hang, driver unable to
>clean and reclaim the resources:
>
>1457         while ((eop_desc->upper.data &
>cpu_to_le32(E1000_TXD_STAT_DD)) &&  <== at here upper.data always is 0x300
>1458                (count < tx_ring->count)) {
>     <--- snip --->
>1487         }
>
>
>I checked all driver codes I did not found anywhere will set the
>upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?

Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means HW has processed that descriptor and driver can now clean that descriptor.
With value 0x300 , DD bit is not set. That means HW has not processed that descriptor.

How fast does tx hang reproduce? I suggest you to enable debug code in driver so when tx hang occurs it will dump the HW desc ring info into kernel log.
You can run "ethtool -s ethx msglvl 0x2c00" to enable debug.
Once tx hang occurs please send me the full dmesg log.

Does tx hang occur with in-kernel e1000e driver too?

Thanks.

-Tushar


>If OS is 32bit system, what which happen?


>
>Thanks in advance,
>Joe
>
>On 07/09/12 16:51, Joe Jin wrote:
>> Hi list,
>>
>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
>> doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
>> just copy a big file (>500M) from another server will hit it at once.
>>
>> Would you please help on this?
>>
>> device info:
>> # lspci -s 05:00.0
>> 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>> Ethernet Controller (Copper) (rev 06)
>>
>> # lspci -s 05:00.0 -n
>> 05:00.0 0200: 8086:10bc (rev 06)
>>
>> # ethtool -i eth0
>> driver: e1000e
>> version: 2.0.0-NAPI
>> firmware-version: 5.10-2
>> bus-info: 0000:05:00.0
>>
>> # ethtool -k eth0
>> Offload parameters for eth0:
>> rx-checksumming: on
>> tx-checksumming: on
>> scatter-gather: on
>> tcp segmentation offload: on
>> udp fragmentation offload: off
>> generic segmentation offload: on
>> generic-receive-offload: on
>>
>> kernel log:
>> -----------
>> e1000e 0000:05:00.0: eth0: Detected Hardware Unit Hang:
>>   TDH                  <6c>
>>   TDT                  <81>
>>   next_to_use          <81>
>>   next_to_clean        <6b>
>> buffer_info[next_to_clean]:
>>   time_stamp           <fffc7a23>
>>   next_to_watch        <71>
>>   jiffies              <fffc8c0c>
>>   next_to_watch.status <0>
>> MAC Status             <80387>
>> PHY Status             <792d>
>> PHY 1000BASE-T Status  <3c00>
>> PHY Extended Status    <3000>
>> PCI Status             <10>
>> e1000e 0000:05:00.0: eth0: Detected Hardware Unit Hang:
>>   TDH                  <6c>
>>   TDT                  <81>
>>   next_to_use          <81>
>>   next_to_clean        <6b>
>> buffer_info[next_to_clean]:
>>   time_stamp           <fffc7a23>
>>   next_to_watch        <71>
>>   jiffies              <fffc9bac>
>>   next_to_watch.status <0>
>> MAC Status             <80387>
>> PHY Status             <792d>
>> PHY 1000BASE-T Status  <3c00>
>> PHY Extended Status    <3000>
>> PCI Status             <10>
>> ------------[ cut here ]------------
>> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
>> Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
>> transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
>> bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
>> be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
>> ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
>> mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
>> acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
>> snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
>> igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer
>> snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr
>> i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed
>> dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage
>> sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded:
>> microcode]
>> Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call
>> Trace:
>>  [<c07d9ac5>] ? dev_watchdog+0x225/0x230  [<c045ba61>]
>> warn_slowpath_common+0x81/0xa0  [<c07d9ac5>] ?
>> dev_watchdog+0x225/0x230  [<c045bb23>] warn_slowpath_fmt+0x33/0x40
>> [<c07d9ac5>] dev_watchdog+0x225/0x230  [<c07d98a0>] ?
>> dev_activate+0xb0/0xb0  [<c0468e82>] call_timer_fn+0x32/0xf0
>> [<c04bceb0>] ? rcu_check_callbacks+0x80/0x80  [<c046a76d>]
>> run_timer_softirq+0xed/0x1b0  [<c07d98a0>] ? dev_activate+0xb0/0xb0
>> [<c0461a81>] __do_softirq+0x91/0x1a0  [<c04619f0>] ?
>> local_bh_enable+0x80/0x80  <IRQ>  [<c0462295>] ? irq_exit+0x95/0xa0
>> [<c087f8b8>] ? smp_apic_timer_interrupt+0x38/0x42
>>  [<c08784f5>] ? apic_timer_interrupt+0x31/0x38  [<c046007b>] ?
>> do_exit+0x11b/0x370  [<c065eae4>] ? intel_idle+0xa4/0x100
>> [<c078d9b9>] ? cpuidle_idle_call+0xb9/0x1e0  [<c0411d77>] ?
>> cpu_idle+0x97/0xd0  [<c085cbbd>] ? rest_init+0x5d/0x70  [<c0b07a7a>] ?
>> start_kernel+0x28a/0x340  [<c0b074b0>] ? obsolete_checksetup+0xb0/0xb0
>> [<c0b070a4>] ? i386_start_kernel+0x64/0xb0 ---[ end trace
>> 5502b55cd4d4e5cb ]--- e1000e 0000:05:00.0: eth0: Reset adapter
>> e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>>
>> Thanks,
>> Joe
>>
>
>
>--
>Oracle <http://www.oracle.com>
>Joe Jin | Software Development Senior Manager | +8610.6106.5624 ORACLE |
>Linux and Virtualization No. 24 Zhongguancun Software Park, Haidian
>District | 100193 Beijing
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH 2/2] bonding: debugfs and network namespaces are incompatible
From: Jay Vosburgh @ 2012-07-10 19:13 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <367b470c-c3f5-4555-be11-02223125b741@email.android.com>


	[ adding netdev back to cc: ]

Eric W. Biederman <ebiederm@xmission.com> wrote:

>Jay Vosburgh <fubar@us.ibm.com> wrote:
>
>>David Miller <davem@davemloft.net> wrote:
>>
>>>From: ebiederm@xmission.com (Eric W. Biederman)
>>>Date: Mon, 09 Jul 2012 13:52:43 -0700
>>>
>>>> 
>>>> The bonding debugfs support has been broken in the presence of
>>network
>>>> namespaces since it has been added.  The debugfs support does not
>>handle
>>>> multiple bonding devices with the same name in different network
>>>> namespaces.
>>>> 
>>>> I haven't had any bug reports, and I'm not interested in getting
>>any.
>>>> Disable the debugfs support when network namespaces are enabled.
>>>> 
>>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>
>>>Applied.
>>
>>	Since distro kernels appear to set CONFIG_NET_NS, doesn't this
>>effectively disable debugfs for bonding on most distros?
>
>Yes.
>
>>	Do the other network device drivers that support debugfs have a
>>similar problem?  E.g., if each of two namespaces have an skge device
>>with the same name, will there be a debugfs conflict there as well?
>
>I haven't run across any of those network devices, but if they create a
>debugfs entry that embeds the device name it will be a problem.

	A quick grep suggests that cxgb4, skge, sky2, stmmac, ipoib and
half a dozen of the wireless drivers all create files in debugfs.  I did
not check exhaustively, but at least some of them include the device
name.

>Last I looked any custom user space interface from network devices was
>rare and bonding using debugfs is the first instance of using debugfs
>from networking devices I have seen.
>
>I think the problem will be a little less severe for physical network
>devices as they all start in the initial network namespace and so start
>with distinct names.
>
>With bonding I can do "ip link add type bond" in any network namespace
>and get another bond0.  So name conflicts are very much expeted with all
>virtual networking devices.

	Fair enough, although it is trivial to rename any network device
such that a conflict would occur.

	It looks like some of the drivers use fixed names for some
things as well.

>But if you know of any other networking devices using debugsfs that
>code should probably get the same treatment as the bonding debugfs code.

	Is there no alternative than simply disabling debugfs whenever
network namespaces are enabled?  The information bonding displays via
debugfs is useful, and having it unavailable on all distro kernels seems
a bit harsh.

	Why is the logic already in the driver not sufficient?  If the
attempt to create the debugfs directory with the interface name fails,
then it merely prints a warning and continues without the debugfs for
that interface.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox