Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Eric Dumazet @ 2012-11-28  3:13 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354051401.14302.40.camel@edumazet-glaptop>

On Tue, 2012-11-27 at 13:23 -0800, Eric Dumazet wrote:
> On Tue, 2012-11-27 at 19:05 +0000, Ben Hutchings wrote:
> > On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:
> 
> > >  struct sock_common {
> > > -	/* skc_daddr and skc_rcv_saddr must be grouped :
> > > -	 * cf INET_MATCH() and INET_TW_MATCH()
> > > +	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
> > > +	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
> > 
> > __aligned(8)?
> 
> Nope, only on 64 bit this requirement exists (since a long time)
> 
> I am not sure we want complexity on this.
> 
> And we dont want holes to be automatically added here neither.

Hmm, maybe the following could be the right way, as we did
for skc_hash/skc_u16hashes


 struct sock_common {
-       /* skc_daddr and skc_rcv_saddr must be grouped :
-        * cf INET_MATCH() and INET_TW_MATCH()
+       /* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
+        * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
         */
-       __be32                  skc_daddr;
-       __be32                  skc_rcv_saddr;
-
+       union {
+               unsigned long   skc_laddr;
+               struct {
+                       __be32  skc_daddr;
+                       __be32  skc_rcv_saddr;
+               };
+       };
        union  {
                unsigned int    skc_hash;
                __u16           skc_u16hashes[2];
        };
+       /* skc_dport && skc_num must be grouped as well */
+       union {
+               unsigned int    skc_ports;
+               struct {
+                       __be16  skc_dport;
+                       __u16   skc_num;
+               };
+       };
+
        unsigned short          skc_family;
        volatile unsigned char  skc_state;
        unsigned char           skc_reuse;

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Joe Perches @ 2012-11-28  3:31 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Eric Dumazet, David Miller, netdev, Ling Ma
In-Reply-To: <1354072351.2701.41.camel@bwh-desktop.uk.solarflarecom.com>

On Wed, 2012-11-28 at 03:12 +0000, Ben Hutchings wrote:
> On Tue, 2012-11-27 at 18:23 -0800, Joe Perches wrote:
> > OK, so it's an and not an or.  Duh.
> [...]
> 
> The way to combine these sorts of comparisons is along the lines of:
> 
> (((left->a ^ right->a) |
>   (left->b ^ right->b) |
>   ...) == 0)
> 
> But when there are big-endian types involved, sparse is likely to
> complain about combining them.

I believe there's only the 2 items that could be combined
for cacheline purposes so using 2 logical tests with AND
seems more readable.  Maybe a single combined test would
be faster.  I don't have equipment at hand to test it.

If you prefer I supposed it could be converted.

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Ben Hutchings @ 2012-11-28  3:55 UTC (permalink / raw)
  To: Joe Perches; +Cc: Eric Dumazet, David Miller, netdev, Ling Ma
In-Reply-To: <1354073514.8918.22.camel@joe-AO722>

On Tue, 2012-11-27 at 19:31 -0800, Joe Perches wrote:
> On Wed, 2012-11-28 at 03:12 +0000, Ben Hutchings wrote:
> > On Tue, 2012-11-27 at 18:23 -0800, Joe Perches wrote:
> > > OK, so it's an and not an or.  Duh.
> > [...]
> > 
> > The way to combine these sorts of comparisons is along the lines of:
> > 
> > (((left->a ^ right->a) |
> >   (left->b ^ right->b) |
> >   ...) == 0)
> > 
> > But when there are big-endian types involved, sparse is likely to
> > complain about combining them.
> 
> I believe there's only the 2 items that could be combined
> for cacheline purposes so using 2 logical tests with AND
> seems more readable.  Maybe a single combined test would
> be faster.  I don't have equipment at hand to test it.
> 
> If you prefer I supposed it could be converted.

I don't particularly care, and I gave up this trick myself because it
didn't seem worth fighting sparse.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH v6 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
From: Ming Lei @ 2012-11-28  3:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <5434404.G1ERYjuorE@vostro.rjw.lan>

On Wed, Nov 28, 2012 at 5:19 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Saturday, November 24, 2012 08:59:14 PM Ming Lei wrote:
>> The patch introduces the flag of memalloc_noio in 'struct dev_pm_info'
>> to help PM core to teach mm not allocating memory with GFP_KERNEL
>> flag for avoiding probable deadlock.
>>
>> As explained in the comment, any GFP_KERNEL allocation inside
>> runtime_resume() or runtime_suspend() on any one of device in
>> the path from one block or network device to the root device
>> in the device tree may cause deadlock, the introduced
>> pm_runtime_set_memalloc_noio() sets or clears the flag on
>> device in the path recursively.
>>
>> Cc: Alan Stern <stern@rowland.harvard.edu>
>> Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>> v5:
>>       - fix code style error
>>       - add comment on clear the device memalloc_noio flag
>> v4:
>>       - rename memalloc_noio_resume as memalloc_noio
>>       - remove pm_runtime_get_memalloc_noio()
>>       - add comments on pm_runtime_set_memalloc_noio
>> v3:
>>       - introduce pm_runtime_get_memalloc_noio()
>>       - hold one global lock on pm_runtime_set_memalloc_noio
>>       - hold device power lock when accessing memalloc_noio_resume
>>         flag suggested by Alan Stern
>>       - implement pm_runtime_set_memalloc_noio without recursion
>>         suggested by Alan Stern
>> v2:
>>       - introduce pm_runtime_set_memalloc_noio()
>> ---
>>  drivers/base/power/runtime.c |   60 ++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/pm.h           |    1 +
>>  include/linux/pm_runtime.h   |    3 +++
>>  3 files changed, 64 insertions(+)
>>
>> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
>> index 3148b10..3e198a0 100644
>> --- a/drivers/base/power/runtime.c
>> +++ b/drivers/base/power/runtime.c
>> @@ -124,6 +124,66 @@ unsigned long pm_runtime_autosuspend_expiration(struct device *dev)
>>  }
>>  EXPORT_SYMBOL_GPL(pm_runtime_autosuspend_expiration);
>>
>> +static int dev_memalloc_noio(struct device *dev, void *data)
>> +{
>> +     return dev->power.memalloc_noio;
>> +}
>> +
>> +/*
>> + * pm_runtime_set_memalloc_noio - Set a device's memalloc_noio flag.
>> + * @dev: Device to handle.
>> + * @enable: True for setting the flag and False for clearing the flag.
>> + *
>> + * Set the flag for all devices in the path from the device to the
>> + * root device in the device tree if @enable is true, otherwise clear
>> + * the flag for devices in the path whose siblings don't set the flag.
>> + *
>
> Please use counters instead of walking the whole path every time.  Ie. in
> addition to the flag add a counter to store the number of the device's
> children having that flag set.

Thanks for your review.

IMO, pm_runtime_set_memalloc_noio() is only called in
probe() and release() of block device and network device, which is
in a very infrequent path, so I am wondering if it is worthy of introducing
another counter for all devices.

Also looks the current implementation of pm_runtime_set_memalloc_noio()
is simple and clean enough with the flag, IMO.

> I would use the flag only to store the information that
> pm_runtime_set_memalloc_noio(dev, true) has been run for this device directly
> and I'd use a counter for everything else.
>
> That is, have power.memalloc_count that would be incremented when (1)
> pm_runtime_set_memalloc_noio(dev, true) is called for that device and (2) when
> power.memalloc_count for one of its children changes from 0 to 1 (and
> analogously for decrementation).  Then, check the counter in rpm_callback().

Sorry, could you explain in a bit detail why we need the counter? Looks only
checking the flag in rpm_callback() is enough, doesn't it?

>
> Besides, don't you need to check children for the arg device itself?

It isn't needed since the children of network/block device can't be
involved of the deadlock in runtime PM path.

Also, the function is only called by network device or block device
subsystem, both the two kind of device are class device and should
have no children.

>
>> + * The function should only be called by block device, or network
>> + * device driver for solving the deadlock problem during runtime
>> + * resume/suspend:
>> + *
>> + *     If memory allocation with GFP_KERNEL is called inside runtime
>> + *     resume/suspend callback of any one of its ancestors(or the
>> + *     block device itself), the deadlock may be triggered inside the
>> + *     memory allocation since it might not complete until the block
>> + *     device becomes active and the involed page I/O finishes. The
>> + *     situation is pointed out first by Alan Stern. Network device
>> + *     are involved in iSCSI kind of situation.
>> + *
>> + * The lock of dev_hotplug_mutex is held in the function for handling
>> + * hotplug race because pm_runtime_set_memalloc_noio() may be called
>> + * in async probe().
>> + *
>> + * The function should be called between device_add() and device_del()
>> + * on the affected device(block/network device).
>> + */
>> +void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
>> +{
>> +     static DEFINE_MUTEX(dev_hotplug_mutex);
>
> What's the mutex for?

It is for avoiding hotplug race, for example, without the mutex,
another child may set the flag between the time device_for_each_child()
runs and the next loop iteration in pm_runtime_set_memalloc_noio(false).

Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Eric Dumazet @ 2012-11-28  4:11 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354069414.8918.13.camel@joe-AO722>

On Tue, 2012-11-27 at 18:23 -0800, Joe Perches wrote:

> Still, the logical tests that are likely to be in the same
> cacheline could be ANDed together to avoid a test and jump.

The point of having the cond jump on sk_hash/hash was that in one
compare, we catch the yes/no status with 99.999999 % success rate.

All the following compares are predicted by the cpu and essentially are
free. Adding the AND or OR will basically have the same cpu cost.

If we wanted to do a full test of all tuple fields and a single
conditional jump, we would not have to include hash test at all.

(If the 4-tuple matches, then sk_hash/hash value _must_ be the same by
definition)

Note its quite different from the optimization we did in
ipv6_addr_equal(), as it allowed fewer memory loads and instructions.

I would say this can come later, as the meat of my patch was about
avoiding a full cache line miss, which is far more expensive than any
tricks we can even think about.

Note it will be hard to actually measure any further gains, since I did
TCP_RR tests (200 threads) and the lookup cost went from 1.4 % to 0.8 %
of the grand total, mostly dominated by the atomic to increase socket
refcount.

^ permalink raw reply

* [RFC PATCH] 8139cp: properly support change of MTU values
From: John Greene @ 2012-11-27 20:08 UTC (permalink / raw)
  To: netdev

The 8139cp driver has a change_mtu function that has not been
enabled since the dawn of the git repository. However, the
generic eth_change_mtu is not used in its place, so that
invalid MTU values can be set on the interface.

Original patch salvages the broken code for the single case of
setting the MTU while the interface is down, which is safe
and also includes the range check.  Now enhanced to support up
or down interface.

Original patch from
http://lkml.indiana.edu/hypermail/linux/kernel/1202.2/00770.html

Testing: has been test on virtual 8139cp setup without issue,
awaiting real hardware and retest again, but wanted to post now.

Signed-Off-By: "John Greene" <jogreene@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 drivers/net/ethernet/realtek/8139cp.c | 22 +++-------------------
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index 6cb96b4..7847c83 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -1226,12 +1226,9 @@ static void cp_tx_timeout(struct net_device *dev)
 	spin_unlock_irqrestore(&cp->lock, flags);
 }
 
-#ifdef BROKEN
 static int cp_change_mtu(struct net_device *dev, int new_mtu)
 {
 	struct cp_private *cp = netdev_priv(dev);
-	int rc;
-	unsigned long flags;
 
 	/* check for invalid MTU, according to hardware limits */
 	if (new_mtu < CP_MIN_MTU || new_mtu > CP_MAX_MTU)
@@ -1244,22 +1241,11 @@ static int cp_change_mtu(struct net_device *dev, int new_mtu)
 		return 0;
 	}
 
-	spin_lock_irqsave(&cp->lock, flags);
-
-	cp_stop_hw(cp);			/* stop h/w and free rings */
-	cp_clean_rings(cp);
-
+	/* network IS up, close it, reset MTU, and come up again. */
+	cp_close(dev);
 	dev->mtu = new_mtu;
-	cp_set_rxbufsize(cp);		/* set new rx buf size */
-
-	rc = cp_init_rings(cp);		/* realloc and restart h/w */
-	cp_start_hw(cp);
-
-	spin_unlock_irqrestore(&cp->lock, flags);
-
-	return rc;
+	return cp_open(dev);
 }
-#endif /* BROKEN */
 
 static const char mii_2_8139_map[8] = {
 	BasicModeCtrl,
@@ -1835,9 +1821,7 @@ static const struct net_device_ops cp_netdev_ops = {
 	.ndo_start_xmit		= cp_start_xmit,
 	.ndo_tx_timeout		= cp_tx_timeout,
 	.ndo_set_features	= cp_set_features,
-#ifdef BROKEN
 	.ndo_change_mtu		= cp_change_mtu,
-#endif
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller	= cp_poll_controller,
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH v6 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
From: Ming Lei @ 2012-11-28  4:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <5434404.G1ERYjuorE@vostro.rjw.lan>

On Wed, Nov 28, 2012 at 5:19 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>
> Please use counters instead of walking the whole path every time.  Ie. in
> addition to the flag add a counter to store the number of the device's
> children having that flag set.

Even though counter is added, walking the whole path can't be avoided too,
and may be a explicit walking or recursion, because pm_runtime_set_memalloc_noio
is required to set or clear the flag(or increase/decrease the counter) of
devices in the whole path.

Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 2/2] bridge: export multicast database via netlink
From: Cong Wang @ 2012-11-28  4:38 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, bridge, Herbert Xu, Jesper Dangaard Brouer,
	Stephen Hemminger, David S. Miller
In-Reply-To: <20121127115905.GA16701@casper.infradead.org>

On Tue, 2012-11-27 at 11:59 +0000, Thomas Graf wrote:
> On 11/27/12 at 05:49pm, Cong Wang wrote:
> > +static int br_rports_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
> > +			       u32 seq, struct net_device *dev)
> > +{
> > +	struct net_bridge *br = netdev_priv(dev);
> > +	struct net_bridge_port *p;
> > +	struct hlist_node *n;
> > +	struct nlattr *nest, *nest2;
> > +
> > +	if (!br->multicast_router || hlist_empty(&br->router_list)) {
> > +		printk(KERN_INFO "no router on bridge\n");
> > +		return 0;
> > +	}
> > +
> > +	nest = nla_nest_start(skb, MDBA_ROUTER);
> > +	if (nest == NULL)
> > +		return -EMSGSIZE;
> > +	nest2 = nla_nest_start(skb, MDBA_MDB_BRPORT);
> > +	if (nest2 == NULL)
> > +		goto fail;
> > +
> > +	hlist_for_each_entry_rcu(p, n, &br->router_list, rlist) {
> > +		if (p && nla_put_u16(skb, MDBA_BRPORT_NO, p->port_no)) {
> > +			nla_nest_cancel(skb, nest2);
> > +			goto fail;
> > +		}
> > +	}
> > +
> > +	nla_nest_end(skb, nest2);
> > +	nla_nest_end(skb, nest);
> 
> I would simplify the MDBA_ROUTER attribute to a u16[len(br->router_list)]. If
> we ever need something more complex we can retire the MDBA_ROUTER
> attribute and replace it with something newer.

Makes sense, will do.

I wanted to reserve some for adding new attributes to MDBA_ROUTER, but
so far it is not necessary.

> 
> > +	nest = nla_nest_start(skb, MDBA_MDB);
> > +	if (nest == NULL)
> > +		return -EMSGSIZE;
> > +
> > +	for (i = 0; i < mdb->max; i++) {
> > +		struct hlist_node *h;
> > +		struct net_bridge_mdb_entry *mp;
> > +		struct net_bridge_port_group *p, **pp;
> > +		struct net_bridge_port *port;
> > +
> > +		hlist_for_each_entry_rcu(mp, h, &mdb->mhash[i], hlist[mdb->ver]) {
> > +			if (nla_put_be32(skb, MDBA_MDB_MCADDR, mp->addr.u.ip4))
> > +				goto fail;
> > +
> > +			nest2 = nla_nest_start(skb, MDBA_MDB_BRPORT);
> > +			if (nest2 == NULL)
> > +				goto fail;
> 
> What if you can't fit all theh hash entries into a single netlink
> message? You need to allow splitting theh hash across multiple
> messages. Therefore I suggest that you add a container attribute
> for each mdb_entry like this:
> 
> MDBA_MDB = {
>   1 = {
>     MDBA_MDB_MCADDR = { ... },
>     MDBA_MDB_BRPORT = { ... },
>   },
>   2 = {
>     MDBA_MDB_MCADDR = { ... },
>     MDBA_MDB_BR_PORT = { ... },
>   },
>   [...]
> }

I thought the user-space will reassemble multiple-part messages, but I
probably misunderstand this...

Actually I was trying to reduce the size of the netlink message. :)

> 
> > +static int br_mdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
> > +{
> > +	struct net_device *dev;
> > +	struct net *net = sock_net(skb->sk);
> > +	struct nlmsghdr *nlh;
> > +	u32 seq = cb->nlh->nlmsg_seq;
> > +	int idx = 0, s_idx;
> > +
> > +	s_idx = cb->args[0];
> > +
> > +	rcu_read_lock();
> > +	cb->seq = net->dev_base_seq;
> 
> Using RCU read lock is OK but that means you must be prepared to
> handle additions/removals to the table between dump iterations
> and thus you must introduce a seq counter bumped on each table
> change and add it to the dev_base_seq above.

Yeah, as you told me before. I will make another patch for this. Thanks
for reminding!

> 
> > +	for_each_netdev_rcu(net, dev) {
> > +		if (dev->priv_flags & IFF_EBRIDGE) {
> > +			struct br_port_msg *bpm;
> > +
> > +			if (idx < s_idx)
> > +				goto cont;
> > +
> > +			nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
> > +					seq, RTM_GETMDB,
> > +					sizeof(*bpm), NLM_F_MULTI);
> > +			if (nlh == NULL)
> > +				break;
> > +
> > +			bpm = nlmsg_data(nlh);
> > +			bpm->ifindex = dev->ifindex;
> > +			if (br_mdb_fill_info(skb, cb, seq, dev) < 0) {
> > +				printk(KERN_INFO "br_mdb_fill_info failed\n");
> > +				goto fail;
> 
> As stated above I believe that you should allow for hashtable to be
> split across multiple messages so you need to store the hash table
> offset as well and properly finalize and send the message on error
> here.

You mean saving the offset into cb->args[1]?

Thanks!

^ permalink raw reply

* Re: [RFC PATCH 2/2] bridge: export multicast database via netlink
From: Cong Wang @ 2012-11-28  5:19 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, bridge, Herbert Xu, Jesper Dangaard Brouer,
	Stephen Hemminger, David S. Miller
In-Reply-To: <20121127115905.GA16701@casper.infradead.org>

On Tue, 2012-11-27 at 11:59 +0000, Thomas Graf wrote:
> 
> Using RCU read lock is OK but that means you must be prepared to
> handle additions/removals to the table between dump iterations
> and thus you must introduce a seq counter bumped on each table
> change and add it to the dev_base_seq above.

Something like this?

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index dc73091..4c3b097 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -74,6 +74,8 @@ static int br_mdb_fill_info(struct sk_buff *skb,
struct netlink_callback *cb,
 		return 0;
 	}
 
+	cb->seq = mdb->seq;
+
 	nest = nla_nest_start(skb, MDBA_MDB);
 	if (nest == NULL)
 		return -EMSGSIZE;
@@ -126,7 +128,6 @@ static int br_mdb_dump(struct sk_buff *skb, struct
netlink_callback *cb)
 	s_idx = cb->args[0];
 
 	rcu_read_lock();
-	cb->seq = net->dev_base_seq;
 
 	for_each_netdev_rcu(net, dev) {
 		if (dev->priv_flags & IFF_EBRIDGE) {
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 1e6ce50..ccf5cfb 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -322,6 +322,7 @@ static int br_mdb_rehash(struct
net_bridge_mdb_htable __rcu **mdbp, int max,
 
 	mdb->size = old ? old->size : 0;
 	mdb->ver = old ? old->ver ^ 1 : 0;
+	mdb->seq = old ? (old->seq + 1): 0;
 
 	if (!old || elasticity)
 		get_random_bytes(&mdb->secret, sizeof(mdb->secret));
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index a02921e..2f5f5b8 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -105,6 +105,7 @@ struct net_bridge_mdb_htable
 	u32				max;
 	u32				secret;
 	u32				ver;
+	u32				seq;
 };
 
 struct net_bridge_port

^ permalink raw reply related

* Re: [net-next RFC v2] net_cls: traffic counter based on classification control cgroup
From: Alexey Perevalov @ 2012-11-28  5:21 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Glauber Costa, netdev-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <50B4B9E2.4030200-kQCPcA+X3s7YtjvyW6yDsg@public.gmane.org>

On 11/27/2012 05:02 PM, Daniel Wagner wrote:
> Hi Alexey,
>
> On 27.11.2012 12:03, Glauber Costa wrote:
>> On 11/27/2012 02:56 PM, Alexey Perevalov wrote:
>>> Hello.
>>>
>>> It's second version of patch I already sent to netdev.
>>>
>>> The main goal of this patch it's counting traffic for process placed to
>>> net_cls cgroup (ingress and egress).
>>> It's based on res_counters and holds counter per network interfaces.
>>>
>>> Description of patch.
>>> It handles packets in net/core/dev.c for egress and in
>>> /net/ipv4/tcp.c|udp.c for ingress.
>>> These places were chosen because we need to know also network interface.
>>>
>>> Cgroup fs interface provides following files additional to existing
>>> net_cls files:
>>> net_cls.ifacename.usage_in_bytes
>>> Containing rcv/snd lines.
>>> Also this patch adds to net_cls ability to handle a network device
>>> registration.
>>>
>>> It could be included or excluded in compile time.
>>> I moved the menu entry for "Control group classifier" from network/QoS to
>>> General Option/Control Group.
>>>
>>> I'm waiting for you comments.
>>>
>> Daniel Wagner is working on something a lot similar.
> Yes, basically what I try to do is explained by this excellent article
>
> https://lwn.net/Articles/523058/
I read articles and agreed with aspects.
But problem of selecting preferred network for application can be solved 
using netprio cgroup.


> The short version: Per application routing and statistics.
>
> I have two PoC implementation doing this. Both implementation have the same key
> idea which is to set SO_MARK per application. The routing and statistics would
> then be done by a bunch iptables rules.
>
> In the first implementation extends net_cls to set SO_MARK:
>
> void sock_update_classid(struct sock *sk, struct task_struct *task)
>   {
>          u32 classid;
> +       u32 mark;
>   
>          classid = task_cls_classid(task);
>          if (classid != sk->sk_classid)
>                  sk->sk_classid = classid;
> +
> +       mark = task_cls_mark(task);
> +       if (mark != sk->sk_mark)
> +               sk->sk_mark = mark;
>   }
>
> The second implementation is adding a new iptables matcher which matches
> on LSM contexts. Then you can do something like this:
>
> iptables -t mangle -A OUTPUT -m secmark --secctx unconfined_u:unconfined_r:foo_t:s0-s0:c0.c1023 -j MARK --set-mark 200
As I understand in LSM context it works for egress and ingress.

>> Maybe you should be in contact, in case you are not yet.
>>
>> A few general comments:
>> 1) res_counters are incredibly expensive. If you are more interested in
>> counting than you are in limiting, they may not be your best choice.
You right, I have a plan now to limit traffic too here.
Or as a variant in QoS in this case atomic is better here.

>> 2) When Daniel exposed his use case to me, it gave me the impression
>> that "counting traffic" is something that is totally doable by having a
>> dedicated interface in a separate namespace. Basically, we already count
>> traffic (rx and tx) for all interfaces anyway, so it suggests that it
>> could be an interesting way to see the problem.
> Moving applications into separate net namespaces is for sure a valid solution.
> Though there is a one drawback in this approach. The namespaces need to be
> attached to a bridge and then some NATting. That means every application
> would get it's own IP address. This might be okay for your certain use
> cases but I am still trying to work around this. Glauber and I had some
> discussion about this and he suggested to allow the physical networking
> device to be attached to several namespaces (e.g. via macvlan). Every
> namespace would get the same IP address. Unfortunately, this would result in
> the same mess as several physical devices on a network get the same
> IP address assigned.
Is I truly understand what to make statistics works we need to put 
process to separate namespace?
Approach to keep counter in cgroup hasn't such side effects, but it has 
another ).

>   
>
>> AFAIK, Daniel is still measuring this. But it would be great to know if
>> that could work for your use case as well.
> I have not started to measure :(
>
> cheers,
> daniel
>
>
Thank you Daniel and Glauber!

-- 
BR
Alexey

^ permalink raw reply

* Re: performance regression on HiperSockets depending on MTU size
From: Cong Wang @ 2012-11-28  5:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Network Developers
In-Reply-To: <1353998800.7553.873.camel@edumazet-glaptop>

On Tue, Nov 27, 2012 at 2:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-11-27 at 06:21 +0000, Cong Wang wrote:
>
>> Eric,
>>
>> Do you have a full list of such commits? I am trying to backport TSQ
>> to 2.6.32, and of course I don't want to miss these commits either.
>
> I dont think there are other known issues.
>
> mlx4 had a 'problem' because only recently we removed the skb_orphan()
> call it used to do in its start_xmit() function.
>
> I remember David had to revert BQL on NIU driver, but NIU does the
> skb_orphan() call as well so TSQ is basically disabled.
>
>


Good news! Thanks!

^ permalink raw reply

* Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
From: Cong Wang @ 2012-11-28  5:47 UTC (permalink / raw)
  To: Linda Walsh; +Cc: LKML, Linux Kernel Network Developers
In-Reply-To: <50B5248A.5010908@tlinx.org>

Cc netdev...

On Wed, Nov 28, 2012 at 4:37 AM, Linda Walsh <lkml@tlinx.org> wrote:
>
>
> Is this a known problem / bug, or should I file a bug on it?  It doesn't
> cause a complete failure, and it happens multiple times (~28 times
> in 2.5 days?... so maybe 10x/day?)  about 8 start with ifup, and the rest
> start @ kworker -- both happen upon enabling the bonding driver
> on a 10Gb dual port adapter (trying to get 1 20Gb adapter).
>
> The 2 tracebacks tyeps (ifup-bonding + kworker) follow:


Does this quick fix help?

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5f5b69f..4a4d9eb 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1785,7 +1785,9 @@ int bond_enslave(struct net_device *bond_dev,
struct net_device *slave_dev)
                new_slave->link == BOND_LINK_DOWN ? "DOWN" :
                        (new_slave->link == BOND_LINK_UP ? "UP" : "BACK"));

+       read_unlock(&bond->lock);
        bond_update_speed_duplex(new_slave);
+       read_lock(&bond->lock);

        if (USES_PRIMARY(bond->params.mode) && bond->params.primary[0]) {
                /* if there is a primary slave, remember it */

Thanks!

>
>
> ----------------- ifup-bonding traceback:
>
> [  229.208603] bonding: bond0: Setting MII monitoring interval to 100.
> [  229.222336] bonding: bond0: Adding slave p2p1.
> [  229.685599] BUG: scheduling while atomic: ifup-bonding/3711/0x00000002
> [  229.692166] 4 locks held by ifup-bonding/3711:
> [  229.696645]  #0:  (&buffer->mutex){......}, at: [<ffffffff811acd3f>]
> sysfs_write_file+0x3f/0x150
> [  229.705721]  #1:  (s_active#75){......}, at: [<ffffffff811acdbb>]
> sysfs_write_file+0xbb/0x150
> [  229.714538]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  229.722772]  #3:  (&bond->lock){......}, at: [<ffffffffa02964af>]
> bond_enslave+0x4df/0xb50 [bonding]
> [  229.732188] Modules linked in: bonding fan mousedev kvm_intel iTCO_wdt
> iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm tpm_bios
> processor button
> [  229.747197] Pid: 3711, comm: ifup-bonding Not tainted 3.6.7-Isht-Van #1
> [  229.753843] Call Trace:
> [  229.756333]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  229.761863]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  229.767214]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  229.772214]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  229.779210]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  229.784645]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  229.790950]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  229.797254]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  229.802611]  [<ffffffff814c519c>] ixgbe_acquire_swfw_sync_X540+0xbc/0x110
> [  229.809429]  [<ffffffff814c146d>] ixgbe_read_phy_reg_generic+0x3d/0x120
> [  229.816078]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  229.824022]  [<ffffffffa02964af>] ? bond_enslave+0x4df/0xb50 [bonding]
> [  229.830581]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  229.836534]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  229.842837]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  229.850092]  [<ffffffffa0296518>] bond_enslave+0x548/0xb50 [bonding]
> [  229.856478]  [<ffffffffa029e62f>] bonding_store_slaves+0x13f/0x190
> [bonding]
> [  229.863556]  [<ffffffff813fe163>] dev_attr_store+0x13/0x30
> [  229.869074]  [<ffffffff811acdd4>] sysfs_write_file+0xd4/0x150
> [  229.874856]  [<ffffffff81142c01>] vfs_write+0xb1/0x180
> [  229.880034]  [<ffffffff81142f28>] sys_write+0x48/0x90
> [  229.885125]  [<ffffffff8169b162>] system_call_fastpath+0x16/0x1b
> [  229.891259] BUG: scheduling while atomic: ifup-bonding/3711/0x00000002
> [  229.897839] 4 locks held by ifup-bonding/3711:
> [  229.902320]  #0:  (&buffer->mutex){......}, at: [<ffffffff811acd3f>]
> sysfs_write_file+0x3f/0x150
> [  229.911395]  #1:  (s_active#75){......}, at: [<ffffffff811acdbb>]
> sysfs_write_file+0xbb/0x150
> [  229.920212]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  229.928449]  #3:  (&bond->lock){......}, at: [<ffffffffa02964af>]
> bond_enslave+0x4df/0xb50 [bonding]
> [  229.937866] Modules linked in: bonding fan mousedev kvm_intel iTCO_wdt
> iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm tpm_bios
> processor button
> [  229.952904] Pid: 3711, comm: ifup-bonding Tainted: G        W
> 3.6.7-Isht-Van #1
> [  229.960507] Call Trace:
> [  229.962997]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  229.968526]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  229.973875]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  229.978876]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  229.985871]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  229.991303]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  229.996739]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  230.003040]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  230.009344]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  230.014698]  [<ffffffff814c50ce>] ixgbe_release_swfw_sync_X540+0x4e/0x60
> [  230.021435]  [<ffffffff814c1531>] ixgbe_read_phy_reg_generic+0x101/0x120
> [  230.028171]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  230.036117]  [<ffffffffa02964af>] ? bond_enslave+0x4df/0xb50 [bonding]
> [  230.042677]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  230.048630]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  230.054934]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  230.062189]  [<ffffffffa0296518>] bond_enslave+0x548/0xb50 [bonding]
> [  230.068580]  [<ffffffffa029e62f>] bonding_store_slaves+0x13f/0x190
> [bonding]
> [  230.075660]  [<ffffffff813fe163>] dev_attr_store+0x13/0x30
> [  230.081189]  [<ffffffff811acdd4>] sysfs_write_file+0xd4/0x150
> [  230.086971]  [<ffffffff81142c01>] vfs_write+0xb1/0x180
> [  230.092148]  [<ffffffff81142f28>] sys_write+0x48/0x90
> [  230.097245]  [<ffffffff8169b162>] system_call_fastpath+0x16/0x1b
> [  230.103427] bonding: bond0: enslaving p2p1 as an active interface with a
> down link.
> [  230.120623] bonding: bond0: Adding slave p2p2.
> [  230.575194] BUG: scheduling while atomic: ifup-bonding/3711/0x00000002
> [  230.581782] 4 locks held by ifup-bonding/3711:
> [  230.586262]  #0:  (&buffer->mutex){......}, at: [<ffffffff811acd3f>]
> sysfs_write_file+0x3f/0x150
> [  230.595287]  #1:  (s_active#75){......}, at: [<ffffffff811acdbb>]
> sysfs_write_file+0xbb/0x150
> [  230.604105]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  230.612393]  #3:  (&bond->lock){......}, at: [<ffffffffa02964af>]
> bond_enslave+0x4df/0xb50 [bonding]
> [  230.621801] Modules linked in: bonding fan mousedev kvm_intel iTCO_wdt
> iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm tpm_bios
> processor button
> [  230.636922] Pid: 3711, comm: ifup-bonding Tainted: G        W
> 3.6.7-Isht-Van #1
> [  230.644516] Call Trace:
> [  230.647009]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  230.652537]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  230.657886]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  230.662889]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  230.669884]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  230.675320]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  230.681622]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  230.687921]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  230.693283]  [<ffffffff814c519c>] ixgbe_acquire_swfw_sync_X540+0xbc/0x110
> [  230.700113]  [<ffffffff814c146d>] ixgbe_read_phy_reg_generic+0x3d/0x120
> [  230.706763]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  230.714713]  [<ffffffffa02964af>] ? bond_enslave+0x4df/0xb50 [bonding]
> [  230.721275]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  230.727231]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  230.733535]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  230.740790]  [<ffffffffa0296518>] bond_enslave+0x548/0xb50 [bonding]
> [  230.747189]  [<ffffffffa029e62f>] bonding_store_slaves+0x13f/0x190
> [bonding]
> [  230.754270]  [<ffffffff813fe163>] dev_attr_store+0x13/0x30
> [  230.759792]  [<ffffffff811acdd4>] sysfs_write_file+0xd4/0x150
> [  230.765576]  [<ffffffff81142c01>] vfs_write+0xb1/0x180
> [  230.770753]  [<ffffffff81142f28>] sys_write+0x48/0x90
> [  230.775840]  [<ffffffff8169b162>] system_call_fastpath+0x16/0x1b
> [  230.781933] BUG: scheduling while atomic: ifup-bonding/3711/0x00000002
> [  230.788499] 4 locks held by ifup-bonding/3711:
> [  230.793021]  #0:  (&buffer->mutex){......}, at: [<ffffffff811acd3f>]
> sysfs_write_file+0x3f/0x150
> [  230.802051]  #1:  (s_active#75){......}, at: [<ffffffff811acdbb>]
> sysfs_write_file+0xbb/0x150
> [  230.810872]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  230.819160]  #3:  (&bond->lock){......}, at: [<ffffffffa02964af>]
> bond_enslave+0x4df/0xb50 [bonding]
> [  230.828575] Modules linked in: bonding fan mousedev kvm_intel iTCO_wdt
> iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm tpm_bios
> processor button
> [  230.843673] Pid: 3711, comm: ifup-bonding Tainted: G        W
> 3.6.7-Isht-Van #1
> [  230.851271] Call Trace:
> [  230.853759]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  230.859282]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  230.864637]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  230.869598]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  230.876548]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  230.881966]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  230.887359]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  230.893649]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  230.899907]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  230.905213]  [<ffffffff814c50ce>] ixgbe_release_swfw_sync_X540+0x4e/0x60
> [  230.911908]  [<ffffffff814c1531>] ixgbe_read_phy_reg_generic+0x101/0x120
> [  230.918599]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  230.926501]  [<ffffffffa02964af>] ? bond_enslave+0x4df/0xb50 [bonding]
> [  230.933018]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  230.938929]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  230.945185]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  230.952392]  [<ffffffffa0296518>] bond_enslave+0x548/0xb50 [bonding]
> [  230.958739]  [<ffffffffa029e62f>] bonding_store_slaves+0x13f/0x190
> [bonding]
> [  230.965774]  [<ffffffff813fe163>] dev_attr_store+0x13/0x30
> [  230.971251]  [<ffffffff811acdd4>] sysfs_write_file+0xd4/0x150
> [  230.976988]  [<ffffffff81142c01>] vfs_write+0xb1/0x180
> [  230.982146]  [<ffffffff81142f28>] sys_write+0x48/0x90
> [  230.987192]  [<ffffffff8169b162>] system_call_fastpath+0x16/0x1b
> [  230.993297] bonding: bond0: enslaving p2p2 as an active interface with a
> down link.
> [  231.014761] ixgbe 0000:06:00.0: p2p1: changing MTU from 1500 to 9000
> [  231.863728] ixgbe 0000:06:00.1: p2p2: changing MTU from 1500 to 9000
>
>
> ---------- kworker traceback:
> [  236.268690] ixgbe 0000:06:00.0: p2p1: NIC Link is Up 10 Gbps, Flow
> Control: None
> [  236.305628] BUG: scheduling while atomic: kworker/u:2/106/0x00000002
> [  236.312025] 4 locks held by kworker/u:2/106:
> [  236.312037]  #0:  ((bond_dev->name)){......}, at: [<ffffffff8105a956>]
> process_one_work+0x146/0x680
> [  236.312049]  #1:  ((&(&bond->mii_work)->work)){......}, at:
> [<ffffffff8105a956>] process_one_work+0x146/0x680
> [  236.312056]  #2:  (rtnl_mutex){......}, at: [<ffffffff8159e1e0>]
> rtnl_trylock+0x10/0x20
> [  236.312065]  #3:  (&bond->lock){......}, at: [<ffffffffa02955ad>]
> bond_mii_monitor+0x2ed/0x640 [bonding]
> [  236.312078] Modules linked in: ipv6 bonding fan mousedev kvm_intel
> iTCO_wdt iTCO_vendor_support gpio_ich kvm acpi_cpufreq mperf tpm_tis tpm
> tpm_bios processor button
> [  236.312082] Pid: 106, comm: kworker/u:2 Tainted: G        W
> 3.6.7-Isht-Van #1
> [  236.312083] Call Trace:
> [  236.312092]  [<ffffffff8168ead9>] __schedule_bug+0x5e/0x6c
> [  236.312102]  [<ffffffff8169893c>] __schedule+0x77c/0x810
> [  236.312108]  [<ffffffff81698a54>] schedule+0x24/0x70
> [  236.312114]  [<ffffffff81697b6c>]
> schedule_hrtimeout_range_clock+0xfc/0x140
> [  236.312121]  [<ffffffff81064c80>] ? update_rmtp+0x60/0x60
> [  236.312129]  [<ffffffff81065a1f>] ? hrtimer_start_range_ns+0xf/0x20
> [  236.312134]  [<ffffffff81697bbe>] schedule_hrtimeout_range+0xe/0x10
> [  236.312144]  [<ffffffff8104bddb>] usleep_range+0x3b/0x40
> [  236.312150]  [<ffffffff814c519c>] ixgbe_acquire_swfw_sync_X540+0xbc/0x110
> [  236.312157]  [<ffffffff814c146d>] ixgbe_read_phy_reg_generic+0x3d/0x120
> [  236.312161]  [<ffffffff814c16dc>]
> ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
> [  236.312166]  [<ffffffffa02955ad>] ? bond_mii_monitor+0x2ed/0x640
> [bonding]
> [  236.312170]  [<ffffffff814b93e4>] ixgbe_get_settings+0x34/0x2b0
> [  236.312177]  [<ffffffff81594385>] __ethtool_get_settings+0x85/0x140
> [  236.312182]  [<ffffffffa0292303>] bond_update_speed_duplex+0x23/0x60
> [bonding]
> [  236.312188]  [<ffffffffa0295614>] bond_mii_monitor+0x354/0x640 [bonding]
> [  236.312198]  [<ffffffff8105a9b7>] process_one_work+0x1a7/0x680
> [  236.312203]  [<ffffffff8105a956>] ? process_one_work+0x146/0x680
> [  236.312210]  [<ffffffff8108c7ce>] ? put_lock_stats.isra.21+0xe/0x40
> [  236.312215]  [<ffffffffa02952c0>] ? bond_loadbalance_arp_mon+0x2c0/0x2c0
> [bonding]
> [  236.312234]  [<ffffffff8105b9ed>] worker_thread+0x18d/0x4f0
> [  236.312239]  [<ffffffff81070991>] ? sub_preempt_count+0x51/0x60
> [  236.312242]  [<ffffffff8105b860>] ? manage_workers+0x320/0x320
> [  236.312247]  [<ffffffff81060f7d>] kthread+0x9d/0xb0
> [  236.312250]  [<ffffffff8169c264>] kernel_thread_helper+0x4/0x10
> [  236.312254]  [<ffffffff8106c197>] ? finish_task_switch+0x77/0x100
> [  236.312262]  [<ffffffff8169a4a6>] ? _raw_spin_unlock_irq+0x36/0x60
> [  236.312268]  [<ffffffff8169a9dd>] ? retint_restore_args+0xe/0xe
> [  236.312273]  [<ffffffff81060ee0>] ? flush_kthread_worker+0x160/0x160
> [  236.312277]  [<ffffffff8169c260>] ? gs_change+0xb/0xb
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related

* [PATCH net-next] be2net: fix INTx ISR for interrupt behaviour on BE2
From: Sathya Perla @ 2012-11-28  5:50 UTC (permalink / raw)
  To: netdev; +Cc: Sathya Perla

On BE2 chip, an interrupt may be raised even when EQ is in un-armed state.
As a result be_intx()::events_get() and be_poll:events_get() can race and
notify an EQ wrongly.

Fix this by counting events only in be_poll(). Commit 0b545a629 fixes
the same issue in the MSI-x path.

But, on Lancer, INTx can be de-asserted only by notifying num evts. This
is not an issue as the above BE2 behavior doesn't exist/has never been
seen on Lancer.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c |   54 +++++++++++---------------
 1 files changed, 23 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index adef536..0661e93 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -1675,24 +1675,6 @@ static inline int events_get(struct be_eq_obj *eqo)
 	return num;
 }
 
-static int event_handle(struct be_eq_obj *eqo)
-{
-	bool rearm = false;
-	int num = events_get(eqo);
-
-	/* Deal with any spurious interrupts that come without events */
-	if (!num)
-		rearm = true;
-
-	if (num || msix_enabled(eqo->adapter))
-		be_eq_notify(eqo->adapter, eqo->q.id, rearm, true, num);
-
-	if (num)
-		napi_schedule(&eqo->napi);
-
-	return num;
-}
-
 /* Leaves the EQ is disarmed state */
 static void be_eq_clean(struct be_eq_obj *eqo)
 {
@@ -2014,15 +1996,23 @@ static int be_rx_cqs_create(struct be_adapter *adapter)
 
 static irqreturn_t be_intx(int irq, void *dev)
 {
-	struct be_adapter *adapter = dev;
-	int num_evts;
+	struct be_eq_obj *eqo = dev;
+	struct be_adapter *adapter = eqo->adapter;
+	int num_evts = 0;
 
-	/* With INTx only one EQ is used */
-	num_evts = event_handle(&adapter->eq_obj[0]);
-	if (num_evts)
-		return IRQ_HANDLED;
-	else
-		return IRQ_NONE;
+	/* On Lancer, clear-intr bit of the EQ DB does not work.
+	 * INTx is de-asserted only on notifying num evts.
+	 */
+	if (lancer_chip(adapter))
+		num_evts = events_get(eqo);
+
+	/* The EQ-notify may not de-assert INTx rightaway, causing
+	 * the ISR to be invoked again. So, return HANDLED even when
+	 * num_evts is zero.
+	 */
+	be_eq_notify(adapter, eqo->q.id, false, true, num_evts);
+	napi_schedule(&eqo->napi);
+	return IRQ_HANDLED;
 }
 
 static irqreturn_t be_msix(int irq, void *dev)
@@ -2342,10 +2332,10 @@ static int be_irq_register(struct be_adapter *adapter)
 			return status;
 	}
 
-	/* INTx */
+	/* INTx: only the first EQ is used */
 	netdev->irq = adapter->pdev->irq;
 	status = request_irq(netdev->irq, be_intx, IRQF_SHARED, netdev->name,
-			adapter);
+			     &adapter->eq_obj[0]);
 	if (status) {
 		dev_err(&adapter->pdev->dev,
 			"INTx request IRQ failed - err %d\n", status);
@@ -2367,7 +2357,7 @@ static void be_irq_unregister(struct be_adapter *adapter)
 
 	/* INTx */
 	if (!msix_enabled(adapter)) {
-		free_irq(netdev->irq, adapter);
+		free_irq(netdev->irq, &adapter->eq_obj[0]);
 		goto done;
 	}
 
@@ -3023,8 +3013,10 @@ static void be_netpoll(struct net_device *netdev)
 	struct be_eq_obj *eqo;
 	int i;
 
-	for_all_evt_queues(adapter, eqo, i)
-		event_handle(eqo);
+	for_all_evt_queues(adapter, eqo, i) {
+		be_eq_notify(eqo->adapter, eqo->q.id, false, true, 0);
+		napi_schedule(&eqo->napi);
+	}
 
 	return;
 }
-- 
1.7.1

^ permalink raw reply related

* Re: Fwd: Re: [PATCH] net: ipv6: change %8s to %s for rt->dst.dev->name in seq_printf of rt6_info_route
From: Chen Gang @ 2012-11-28  5:54 UTC (permalink / raw)
  To: Shan Wei; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <50B4551C.6030505@asianux.com>


completion: "8 right alignment should not belong to os api level"
  for api, we need keep as fewer contents as we can.
  for output format api: "contents"+"topology"+"separator mark"+"space redundancy" are enough.
  so "8 right alignment" should not belong to api level (it only belongs to "User Experience").


  "User Experience" is most likely 'beautiful' !!

  !-)


gchen.


于 2012年11月27日 13:52, Chen Gang 写道:
> 于 2012年11月27日 13:40, Chen Gang 写道:
>>
>> and now, I think:
>>   A) both input and output through /proc/* are for os api level.
>>   B) but both %8s and %s do not change the output interface format (including contents, topology, separator mark, space redundancy).
>>   C) so it is belong to 'User Experience', not belong to os api.
>>
>>   welcome any another members to giving suggestions and completions.
>>
>>   thanks.
>>
>>   :-)
>>
> 
>   completion: 8 right alignment is not belong to interface format.
>     if it was belong to interface format,
>     it would cause correctness issue (the name len may be larger than 8).
>     so if "8 right alignment" is belong to os api, it means the api is not correct, need change.
> 
>   :-)
> 


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent
From: Jason Wang @ 2012-11-28  6:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: mst, netdev, linux-kernel, virtualization, davem
In-Reply-To: <20121127084919.1587c647@nehalam.linuxnetplumber.net>

On 11/28/2012 12:49 AM, Stephen Hemminger wrote:
> On Tue, 27 Nov 2012 14:45:13 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 11/27/2012 01:37 AM, Stephen Hemminger wrote:
>>> On Mon, 26 Nov 2012 15:56:52 +0800
>>> Jason Wang <jasowang@redhat.com> wrote:
>>>
>>>> Some deivces do not free the old tx skbs immediately after it has been sent
>>>> (usually in tx interrupt). One such example is virtio-net which optimizes for
>>>> virt and only free the possible old tx skbs during the next packet sending. This
>>>> would lead the pktgen to wait forever in the refcount of the skb if no other
>>>> pakcet will be sent afterwards.
>>>>
>>>> Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which could
>>>> notify the pktgen that the device does not free skb immediately after it has
>>>> been sent and let it not to wait for the refcount to be one.
>>>>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>> Another alternative would be using skb_orphan() and skb->destructor.
>>> There are other cases where skb's are not freed right away.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Hi Stephen:
>>
>> Do you mean registering a skb->destructor for pktgen then set and check
>> bits in skb->tx_flag?
> Yes. Register a destructor that does something like update a counter (number of packets pending),
> then just spin while number of packets pending is over threshold.
> --

Not sure this is the best method, since pktgen was used to test the tx 
process of the device driver and NIC. If we use skb_orhpan(), we would 
miss the test of tx completion part.

^ permalink raw reply

* Re: [RFC PATCH] 8139cp: properly support change of MTU values
From: Rami Rosen @ 2012-11-28  7:23 UTC (permalink / raw)
  To: John Greene; +Cc: netdev
In-Reply-To: <1354046932-13606-1-git-send-email-jogreene@redhat.com>

Hi,

In cp_change_mtu(), there is the following check:
...
if (new_mtu < CP_MIN_MTU || new_mtu > CP_MAX_MTU)
		return -EINVAL;
...
Later on, we set dev->mtu to new_mtu.

The CP_MIN_MTU is defined to be 60; shouldn't it be 68 ?


The reason for 68 is (RFC 791,  Internet Protocol,
http://www.ietf.org/rfc/rfc791.txt):
"Every internet module must be able to forward a datagram of 68 octets
without further fragmentation.  This is because an internet  header
may be up to 60 octets, and the minimum fragment is 8 octets."

See also the generic Ethernet () method in eth_change_mtu() (net/ethernet/eth.c)

int eth_change_mtu(struct net_device *dev, int new_mtu)
{
	if (new_mtu < 68 || new_mtu > ETH_DATA_LEN)
		return -EINVAL;
	dev->mtu = new_mtu;
	return 0;
}


regards,
Rami Rosen

http://ramirose.wix.com/ramirosen

On Tue, Nov 27, 2012 at 10:08 PM, John Greene <jogreene@redhat.com> wrote:
> The 8139cp driver has a change_mtu function that has not been
> enabled since the dawn of the git repository. However, the
> generic eth_change_mtu is not used in its place, so that
> invalid MTU values can be set on the interface.
>
> Original patch salvages the broken code for the single case of
> setting the MTU while the interface is down, which is safe
> and also includes the range check.  Now enhanced to support up
> or down interface.
>
> Original patch from
> http://lkml.indiana.edu/hypermail/linux/kernel/1202.2/00770.html
>
> Testing: has been test on virtual 8139cp setup without issue,
> awaiting real hardware and retest again, but wanted to post now.
>
> Signed-Off-By: "John Greene" <jogreene@redhat.com>
> CC: "David S. Miller" <davem@davemloft.net>
> ---
>  drivers/net/ethernet/realtek/8139cp.c | 22 +++-------------------
>  1 file changed, 3 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
> index 6cb96b4..7847c83 100644
> --- a/drivers/net/ethernet/realtek/8139cp.c
> +++ b/drivers/net/ethernet/realtek/8139cp.c
> @@ -1226,12 +1226,9 @@ static void cp_tx_timeout(struct net_device *dev)
>         spin_unlock_irqrestore(&cp->lock, flags);
>  }
>
> -#ifdef BROKEN
>  static int cp_change_mtu(struct net_device *dev, int new_mtu)
>  {
>         struct cp_private *cp = netdev_priv(dev);
> -       int rc;
> -       unsigned long flags;
>
>         /* check for invalid MTU, according to hardware limits */
>         if (new_mtu < CP_MIN_MTU || new_mtu > CP_MAX_MTU)
> @@ -1244,22 +1241,11 @@ static int cp_change_mtu(struct net_device *dev, int new_mtu)
>                 return 0;
>         }
>
> -       spin_lock_irqsave(&cp->lock, flags);
> -
> -       cp_stop_hw(cp);                 /* stop h/w and free rings */
> -       cp_clean_rings(cp);
> -
> +       /* network IS up, close it, reset MTU, and come up again. */
> +       cp_close(dev);
>         dev->mtu = new_mtu;
> -       cp_set_rxbufsize(cp);           /* set new rx buf size */
> -
> -       rc = cp_init_rings(cp);         /* realloc and restart h/w */
> -       cp_start_hw(cp);
> -
> -       spin_unlock_irqrestore(&cp->lock, flags);
> -
> -       return rc;
> +       return cp_open(dev);
>  }
> -#endif /* BROKEN */
>
>  static const char mii_2_8139_map[8] = {
>         BasicModeCtrl,
> @@ -1835,9 +1821,7 @@ static const struct net_device_ops cp_netdev_ops = {
>         .ndo_start_xmit         = cp_start_xmit,
>         .ndo_tx_timeout         = cp_tx_timeout,
>         .ndo_set_features       = cp_set_features,
> -#ifdef BROKEN
>         .ndo_change_mtu         = cp_change_mtu,
> -#endif
>
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>         .ndo_poll_controller    = cp_poll_controller,
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: TCP and reordering
From: Saku Ytti @ 2012-11-28  7:26 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, netdev
In-Reply-To: <20121127.210611.1127622873924794001.davem@davemloft.net>

On (2012-11-27 21:06 -0500), David Miller wrote:

> And the gains of fast retransmit far outweigh whatever strange
> justification would give for reordering packets on purpose.

I don't disagree. I'm not proposing to turn off fast retransmits.

My proposal (or question more accurately) was to add 'reorder' counter to
sockets, which would increment when duplicate ACK is followed by same
sequence twice. 
Then you could automatically/dynamically delay duplicate acks, as you'd
start to expect to receive the frames, out-of-order. Giving non-lossy
reordering links pretty much 100% same performance as non-lossy in-order
links.

There are good amount of optimization in TCP for corner-case, and well that
is what TCP stack does, tries to work with limitations imposed by network.

My main question is, am I underestimating complexity needed to add such
counter. Or does such counter actually already exist (I've not looked if
netstat -s reordering counters are attributable to particular socket)

-- 
  ++ytti

^ permalink raw reply

* Re: TCP and reordering
From: David Woodhouse @ 2012-11-28  7:59 UTC (permalink / raw)
  To: David Miller; +Cc: saku, rick.jones2, netdev
In-Reply-To: <20121127.210611.1127622873924794001.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 2170 bytes --]

On Tue, 2012-11-27 at 21:06 -0500, David Miller wrote:
> And the gains of fast retransmit far outweigh whatever strange
> justification would give for reordering packets on purpose.

My 'strange justification' for reordering, albeit not entirely on
purpose, is that a single ADSL line at 8Mb/s down, 448Kb/s up is less
bandwidth than I had to my dorm room 16 years ago. So I bond two of
them, and naturally expect a certain amount of reordering.

I've never really done much analysis of this though, and it's never
seemed to be a problem. Then again, I don't think I get *much*
reordering. Big downloads tend to look fairly much like this:

07:36:02.272979 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67016473:67017881, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912240], length 1408
07:36:02.273478 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67017881:67019289, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912240], length 1408
07:36:02.273507 IP6 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530 > 2001:770:15f::2.http: Flags [.], ack 67019289, win 11198, options [nop,nop,TS val 1096912356 ecr 2564943119], length 0
07:36:02.274727 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67019289:67020697, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912241], length 1408
07:36:02.275151 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67020697:67022105, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912241], length 1408
07:36:02.275184 IP6 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530 > 2001:770:15f::2.http: Flags [.], ack 67022105, win 11198, options [nop,nop,TS val 1096912358 ecr 2564943119], length 0

I suppose it might be worse if the lines weren't running at the same
speed, and if the packets weren't running over the same path through the
telco between the ISP's LNS (which alternates one packet per line) and
my local DSLAM.

Short of going through whole dumps and looking, is there a good way to
get statistics?

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: [PATCH] br2684: don't send frames on not-ready vcc
From: Krzysztof Mazur @ 2012-11-28  8:08 UTC (permalink / raw)
  To: David Woodhouse
  Cc: chas williams - CONTRACTOR, davem, netdev, linux-kernel, nathan
In-Reply-To: <1354064086.21562.10.camel@shinybook.infradead.org>

On Wed, Nov 28, 2012 at 12:54:46AM +0000, David Woodhouse wrote:
> On Wed, 2012-11-28 at 00:51 +0100, Krzysztof Mazur wrote:
> > If you do this actually it's better to don't use patch 1/7 because
> > it introduces race condition that you found earlier.
> 
> Right. I've omitted that from the git tree I just pushed out.
> 
> > With this patch you have still theoretical race that was fixed in patches
> > 5 and 8 in pppoatm series, but I never seen that in practice.
> 
> And I think it's even less likely for br2684. At least with pppoatm you
> might have had pppd sending frames. But for br2684 they *only* come from
> its start_xmit function... which is serialised anyway.
> 
> I do get strange oopses when I try to add BQL to br2684, but that's not
> something to be looking at at 1am...
> 
> I *do* need the equivalent of your patch 4, which is the module_put
> race.
> 

I think you might need also an equivalent of
"[PATCH v3 3/7] pppoatm: allow assign only on a connected socket".

I'm not sure yet. In will test if I can trigger that Oops on pppoatm
without that patch. Testing vcc flags might be sufficient - that's
what I did in the first patch, but you asked what about SOCK_CONNECTED,
and I think it was really needed.

Krzysiek
-- >8 --
Subject: [PATCH] br2684: allow assign only on a connected socket

The br2684 does not check if used vcc is in connected state,
causing potential Oops in pppoatm_send() when vcc->send() is called
on not fully connected socket.

Now br2684 can be assigned only on connected sockets; otherwise
-EINVAL error is returned.

Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net>
---
 net/atm/br2684.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/atm/br2684.c b/net/atm/br2684.c
index 59e8edb..e88998c 100644
--- a/net/atm/br2684.c
+++ b/net/atm/br2684.c
@@ -704,10 +704,13 @@ static int br2684_ioctl(struct socket *sock, unsigned int cmd,
 			return -ENOIOCTLCMD;
 		if (!capable(CAP_NET_ADMIN))
 			return -EPERM;
-		if (cmd == ATM_SETBACKEND)
+		if (cmd == ATM_SETBACKEND) {
+			if (sock->state != SS_CONNECTED)
+				return -EINVAL;
 			return br2684_regvcc(atmvcc, argp);
-		else
+		} else {
 			return br2684_create(argp);
+		}
 #ifdef CONFIG_ATM_BR2684_IPFILTER
 	case BR2684_SETFILT:
 		if (atmvcc->push != br2684_push)
-- 
1.8.0.411.g71a7da8

^ permalink raw reply related

* Re: [net-next RFC v2] net_cls: traffic counter based on classification control cgroup
From: Daniel Wagner @ 2012-11-28  8:09 UTC (permalink / raw)
  To: Alexey Perevalov
  Cc: Glauber Costa, netdev-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <50B59F54.8080401-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>

Hi Alexey,

On 28.11.2012 06:21, Alexey Perevalov wrote:
>>> Daniel Wagner is working on something a lot similar.
>> Yes, basically what I try to do is explained by this excellent article
>>
>> https://lwn.net/Articles/523058/
> I read articles and agreed with aspects.
> But problem of selecting preferred network for application can be solved 
> using netprio cgroup.

Choosing the which network to connect to is job of a connection manager.
I don't see how a cgroup controller can help you there. I guess I do not 
understand your statement. Can you rephrase please?

>> The second implementation is adding a new iptables matcher which matches
>> on LSM contexts. Then you can do something like this:
>>
>> iptables -t mangle -A OUTPUT -m secmark --secctx 
>> unconfined_u:unconfined_r:foo_t:s0-s0:c0.c1023 -j MARK --set-mark 200
> As I understand in LSM context it works for egress and ingress.

Yes, I am using CONNMARK in conjunction with the the above LSM context
matcher. I am still playing around, but it looks quite promising.

>>> 2) When Daniel exposed his use case to me, it gave me the impression
>>> that "counting traffic" is something that is totally doable by having a
>>> dedicated interface in a separate namespace. Basically, we already count
>>> traffic (rx and tx) for all interfaces anyway, so it suggests that it
>>> could be an interesting way to see the problem.
>> Moving applications into separate net namespaces is for sure a valid 
>> solution.
>> Though there is a one drawback in this approach. The namespaces need 
>> to be
>> attached to a bridge and then some NATting. That means every application
>> would get it's own IP address. This might be okay for your certain use
>> cases but I am still trying to work around this. Glauber and I had some
>> discussion about this and he suggested to allow the physical networking
>> device to be attached to several namespaces (e.g. via macvlan). Every
>> namespace would get the same IP address. Unfortunately, this would 
>> result in
>> the same mess as several physical devices on a network get the same
>> IP address assigned.
> Is I truly understand what to make statistics works we need to put 
> process to separate namespace?

If a process lives in its own network namespace then you can
count the packets/bytes on the network interface level. The side effect
is that is that each namespace is obviously a new network and has to be
treated as such.

> Approach to keep counter in cgroup hasn't such side effects, but it has 
> another ).

cgroups are not for free. Currently a lot of effort is put into getting
a reasonable performance and behavior into cgroups. In this situation
any new feature added to cgroups will need a pretty good justification
why it is needed and why it cant be done with existing infrastructure.

Here is some background information on the state of cgroups:

http://thread.gmane.org/gmane.linux.kernel.containers/23698

cheers,
daniel

^ permalink raw reply

* Re: [PATCH v3 8/7] pppoatm: fix missing wakeup in pppoatm_send()
From: Krzysztof Mazur @ 2012-11-28  8:12 UTC (permalink / raw)
  To: David Woodhouse; +Cc: chas williams - CONTRACTOR, netdev, linux-kernel, davem
In-Reply-To: <1354063697.21562.4.camel@shinybook.infradead.org>

On Wed, Nov 28, 2012 at 12:48:17AM +0000, David Woodhouse wrote:
> On Tue, 2012-11-27 at 10:23 -0500, chas williams - CONTRACTOR wrote:
> > yes, but dont call it 8/7 since that doesnt make sense.
> 
> It made enough sense when it was a single patch appended to a thread of
> 7 other patches from Krzysztof. But now it's all got a little more
> complex, so I've tried to collect together the latest version of
> everything we've discussed:

There was also discussion about patch 9/7 "pppoatm: wakeup after ATM
unlock only when it's needed".

> 
>  http://git.infradead.org/users/dwmw2/atm.git
>   git://git.infradead.org/users/dwmw2/atm.git
> 
> David Woodhouse (5):
>       atm: Add release_cb() callback to vcc
>       pppoatm: fix missing wakeup in pppoatm_send()
>       br2684: fix module_put() race

for the three patches above:

Acked-by: Krzysztof Mazur <krzysiek@podlesie.net>

Krzysiek

^ permalink raw reply

* Re: TCP and reordering
From: Christoph Paasch @ 2012-11-28  8:21 UTC (permalink / raw)
  To: David Woodhouse; +Cc: David Miller, saku, rick.jones2, netdev
In-Reply-To: <1354089566.21562.20.camel@shinybook.infradead.org>

On Wednesday 28 November 2012 07:59:26 David Woodhouse wrote:
> My 'strange justification' for reordering, albeit not entirely on
> purpose, is that a single ADSL line at 8Mb/s down, 448Kb/s up is less
> bandwidth than I had to my dorm room 16 years ago. So I bond two of
> them, and naturally expect a certain amount of reordering.

You might want to have a look at MultiPath TCP [1], which allows the use of 
multiple interfaces for a single TCP connection. It is somehow similar to 
SCTP-CMT, with the difference that MPTCP is able to pass by today's firewalls 
and NATs and does not require any modifications to the applications.


E.g., you could install MPTCP on your end host and set up an HTTP-proxy on a 
public web hoster to terminate your MPTCP session -- as servers don't (yet) 
support MPTCP, you will have to terminate the MPTCP session somewhere.


Cheers,
Christoph

[1] http://multipath-tcp.org

-- 
IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://mptcp.info.ucl.ac.be
Université Catholique de Louvain
--

^ permalink raw reply

* Re: TCP and reordering
From: Vijay Subramanian @ 2012-11-28  8:22 UTC (permalink / raw)
  To: David Woodhouse; +Cc: David Miller, saku, rick.jones2, netdev
In-Reply-To: <1354089566.21562.20.camel@shinybook.infradead.org>

>
> Short of going through whole dumps and looking, is there a good way to
> get statistics?
>

Hi David,

I don't believe reordering is tracked on the receiver side but on the
sender, there are SNMB_MIB items.
They can be tracked and can be viewed using nstat/netstat

# nstat -az | grep -i reorder
TcpExtTCPFACKReorder            0                  0.0
TcpExtTCPSACKReorder            0                  0.0
TcpExtTCPRenoReorder            0                  0.0
TcpExtTCPTSReorder              0                  0.0

Regards,
Vijay

^ permalink raw reply

* Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Joe Jin @ 2012-11-28  8:31 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Fujinaka, Todd, Mary Mcgrath, netdev@vger.kernel.org,
	e1000-devel@lists.sf.net, linux-kernel@vger.kernel.org, linux-pci
In-Reply-To: <1354039840.2701.14.camel@bwh-desktop.uk.solarflarecom.com>

On 11/28/12 02:10, Ben Hutchings wrote:
> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>> Forgive me if I'm being too repetitious as I think some of this has
>> been mentioned in the past.
>>
>> We (and by we I mean the Ethernet part and driver) can only change the
>> advertised availability of a larger MaxPayloadSize. The size is
>> negotiated by both sides of the link when the link is established. The
>> driver should not change the size of the link as it would be poking at
>> registers outside of its scope and is controlled by the upstream
>> bridge (not us).
> [...]
> 
> MaxPayloadSize (MPS) is not negotiated between devices but is programmed
> by the system firmware (at least for devices present at boot - the
> kernel may be responsible in case of hotplug).  You can use the kernel
> parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy
> that overrides this, but no policy will allow setting MPS above the
> device's MaxPayloadSizeSupported (MPSS).
> 

Ben,

Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
So I'm trying to use ethtool modify it from eeprom to see if help or no.


Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch,
customer could not modify it from BIOS for there was not entry at there, to
test it, we have to find how to verify if this is the root cause, so still 
need to find the offset in eeprom.

Thanks in advance,
Joe

^ permalink raw reply

* Re: [PATCH 1/2] smsc75xx: refactor entering suspend modes
From: Steve Glendinning @ 2012-11-28  8:34 UTC (permalink / raw)
  To: Alan Stern; +Cc: Bjørn Mork, netdev, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <Pine.LNX.4.44L0.1211271443150.1489-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>

Hi Alan,

>> udev->do_remote_wakeup is set in choose_wakeup() in
>> drivers/usb/core/driver.c.  AFAICS it is always set as long as
>> device_may_wakeup(&udev->dev) is true.
>
> That's right.  But is device_may_wakeup(&udev->dev) true?
>
> By default it wouldn't be.  The normal way to set it is for the user or
> a program to do:
>
>         echo enabled >/sys/bus/usb/devices/.../power/wakeup
>
> Of course, a driver could disregard the user's choice and set the flag
> by itself.

If I set that from userspace the system is able to resume, but I can't
work out how to successfully set this from the driver.  I believe the
driver should be overriding this as if the user has asked for the
device to wake on lan they're expecting this to resume the system.

I've tried placing various combinations of device_set_wakeup_capable
and device_set_wakeup_enable in different places (bind, suspend), but
it still doesn't allow the device to resume from suspend.  How should
I do this?
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox