Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v6 5/6] PM / Runtime: force memory allocation with no I/O during Runtime PM callbcack
From: Rafael J. Wysocki @ 2012-11-27 21:24 UTC (permalink / raw)
  To: linux-pm
  Cc: Ming Lei, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <1353761958-12810-6-git-send-email-ming.lei@canonical.com>

On Saturday, November 24, 2012 08:59:17 PM Ming Lei wrote:
> This patch applies the introduced memalloc_noio_save() and
> memalloc_noio_restore() to force memory allocation with no I/O
> during runtime_resume/runtime_suspend callback on device with
> the flag of 'memalloc_noio' set.
> 
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Oliver Neukum <oneukum@suse.de>
> Cc: Rafael J. Wysocki <rjw@sisk.pl>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
> v5:
> 	- use inline memalloc_noio_save()
> v4:
> 	- runtime_suspend need this too because rpm_resume may wait for
> 	completion of concurrent runtime_suspend, so deadlock still may
> 	be triggered in runtime_suspend path.
> ---
>  drivers/base/power/runtime.c |   32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> index 3e198a0..96d99ea 100644
> --- a/drivers/base/power/runtime.c
> +++ b/drivers/base/power/runtime.c
> @@ -371,6 +371,7 @@ static int rpm_suspend(struct device *dev, int rpmflags)
>  	int (*callback)(struct device *);
>  	struct device *parent = NULL;
>  	int retval;
> +	unsigned int noio_flag;
>  
>  	trace_rpm_suspend(dev, rpmflags);
>  
> @@ -480,7 +481,20 @@ static int rpm_suspend(struct device *dev, int rpmflags)
>  	if (!callback && dev->driver && dev->driver->pm)
>  		callback = dev->driver->pm->runtime_suspend;
>  
> -	retval = rpm_callback(callback, dev);
> +	/*
> +	 * Deadlock might be caused if memory allocation with GFP_KERNEL
> +	 * happens inside runtime_suspend callback of one block device's
> +	 * ancestor or the block device itself. Network device might be
> +	 * thought as part of iSCSI block device, so network device and
> +	 * its ancestor should be marked as memalloc_noio.
> +	 */
> +	if (dev->power.memalloc_noio) {
> +		noio_flag = memalloc_noio_save();
> +		retval = rpm_callback(callback, dev);
> +		memalloc_noio_restore(noio_flag);
> +	} else {
> +		retval = rpm_callback(callback, dev);
> +	}
>  	if (retval)
>  		goto fail;
>  
> @@ -563,6 +577,7 @@ static int rpm_resume(struct device *dev, int rpmflags)
>  	int (*callback)(struct device *);
>  	struct device *parent = NULL;
>  	int retval = 0;
> +	unsigned int noio_flag;
>  
>  	trace_rpm_resume(dev, rpmflags);
>  
> @@ -712,7 +727,20 @@ static int rpm_resume(struct device *dev, int rpmflags)
>  	if (!callback && dev->driver && dev->driver->pm)
>  		callback = dev->driver->pm->runtime_resume;
>  
> -	retval = rpm_callback(callback, dev);
> +	/*
> +	 * Deadlock might be caused if memory allocation with GFP_KERNEL
> +	 * happens inside runtime_resume callback of one block device's
> +	 * ancestor or the block device itself. Network device might be
> +	 * thought as part of iSCSI block device, so network device and
> +	 * its ancestor should be marked as memalloc_noio.
> +	 */
> +	if (dev->power.memalloc_noio) {
> +		noio_flag = memalloc_noio_save();
> +		retval = rpm_callback(callback, dev);
> +		memalloc_noio_restore(noio_flag);
> +	} else {
> +		retval = rpm_callback(callback, dev);
> +	}

Please don't duplicate code this way.

You can move that whole thing to rpm_callback().  Yes, you'll probably need to
check dev->power.memalloc_noio twice in there, but that's OK.


>  	if (retval) {
>  		__update_runtime_status(dev, RPM_SUSPENDED);
>  		pm_runtime_cancel_pending(dev);
> 

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Eric Dumazet @ 2012-11-27 21:23 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354043156.2701.17.camel@bwh-desktop.uk.solarflarecom.com>

On Tue, 2012-11-27 at 19:05 +0000, Ben Hutchings wrote:
> On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:

> >  struct sock_common {
> > -	/* skc_daddr and skc_rcv_saddr must be grouped :
> > -	 * cf INET_MATCH() and INET_TW_MATCH()
> > +	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
> > +	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
> 
> __aligned(8)?

Nope, only on 64 bit this requirement exists (since a long time)

I am not sure we want complexity on this.

And we dont want holes to be automatically added here neither.

^ permalink raw reply

* Re: [PATCH v6 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
From: Rafael J. Wysocki @ 2012-11-27 21:19 UTC (permalink / raw)
  To: linux-pm
  Cc: Ming Lei, linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-mm
In-Reply-To: <1353761958-12810-3-git-send-email-ming.lei@canonical.com>

On Saturday, November 24, 2012 08:59:14 PM Ming Lei wrote:
> The patch introduces the flag of memalloc_noio in 'struct dev_pm_info'
> to help PM core to teach mm not allocating memory with GFP_KERNEL
> flag for avoiding probable deadlock.
> 
> As explained in the comment, any GFP_KERNEL allocation inside
> runtime_resume() or runtime_suspend() on any one of device in
> the path from one block or network device to the root device
> in the device tree may cause deadlock, the introduced
> pm_runtime_set_memalloc_noio() sets or clears the flag on
> device in the path recursively.
> 
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
> v5:
> 	- fix code style error
> 	- add comment on clear the device memalloc_noio flag
> v4:
> 	- rename memalloc_noio_resume as memalloc_noio
> 	- remove pm_runtime_get_memalloc_noio()
> 	- add comments on pm_runtime_set_memalloc_noio
> v3:
> 	- introduce pm_runtime_get_memalloc_noio()
> 	- hold one global lock on pm_runtime_set_memalloc_noio
> 	- hold device power lock when accessing memalloc_noio_resume
> 	  flag suggested by Alan Stern
> 	- implement pm_runtime_set_memalloc_noio without recursion
> 	  suggested by Alan Stern
> v2:
> 	- introduce pm_runtime_set_memalloc_noio()
> ---
>  drivers/base/power/runtime.c |   60 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pm.h           |    1 +
>  include/linux/pm_runtime.h   |    3 +++
>  3 files changed, 64 insertions(+)
> 
> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> index 3148b10..3e198a0 100644
> --- a/drivers/base/power/runtime.c
> +++ b/drivers/base/power/runtime.c
> @@ -124,6 +124,66 @@ unsigned long pm_runtime_autosuspend_expiration(struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(pm_runtime_autosuspend_expiration);
>  
> +static int dev_memalloc_noio(struct device *dev, void *data)
> +{
> +	return dev->power.memalloc_noio;
> +}
> +
> +/*
> + * pm_runtime_set_memalloc_noio - Set a device's memalloc_noio flag.
> + * @dev: Device to handle.
> + * @enable: True for setting the flag and False for clearing the flag.
> + *
> + * Set the flag for all devices in the path from the device to the
> + * root device in the device tree if @enable is true, otherwise clear
> + * the flag for devices in the path whose siblings don't set the flag.
> + *

Please use counters instead of walking the whole path every time.  Ie. in
addition to the flag add a counter to store the number of the device's
children having that flag set.

Besides, don't you need to check children for the arg device itself?

> + * The function should only be called by block device, or network
> + * device driver for solving the deadlock problem during runtime
> + * resume/suspend:
> + *
> + *     If memory allocation with GFP_KERNEL is called inside runtime
> + *     resume/suspend callback of any one of its ancestors(or the
> + *     block device itself), the deadlock may be triggered inside the
> + *     memory allocation since it might not complete until the block
> + *     device becomes active and the involed page I/O finishes. The
> + *     situation is pointed out first by Alan Stern. Network device
> + *     are involved in iSCSI kind of situation.
> + *
> + * The lock of dev_hotplug_mutex is held in the function for handling
> + * hotplug race because pm_runtime_set_memalloc_noio() may be called
> + * in async probe().
> + *
> + * The function should be called between device_add() and device_del()
> + * on the affected device(block/network device).
> + */
> +void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
> +{
> +	static DEFINE_MUTEX(dev_hotplug_mutex);

What's the mutex for?

> +
> +	mutex_lock(&dev_hotplug_mutex);
> +	for (;;) {
> +		/* hold power lock since bitfield is not SMP-safe. */
> +		spin_lock_irq(&dev->power.lock);
> +		dev->power.memalloc_noio = enable;
> +		spin_unlock_irq(&dev->power.lock);
> +
> +		dev = dev->parent;
> +
> +		/*
> +		 * clear flag of the parent device only if all the
> +		 * children don't set the flag because ancestor's
> +		 * flag was set by any one of the descendants.
> +		 */
> +		if (!dev || (!enable &&
> +			     device_for_each_child(dev, NULL,
> +						   dev_memalloc_noio)))
> +			break;
> +	}
> +	mutex_unlock(&dev_hotplug_mutex);
> +}
> +EXPORT_SYMBOL_GPL(pm_runtime_set_memalloc_noio);
> +
>  /**
>   * rpm_check_suspend_allowed - Test whether a device may be suspended.
>   * @dev: Device to test.
> diff --git a/include/linux/pm.h b/include/linux/pm.h
> index 03d7bb1..1a8a69d 100644
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -538,6 +538,7 @@ struct dev_pm_info {
>  	unsigned int		irq_safe:1;
>  	unsigned int		use_autosuspend:1;
>  	unsigned int		timer_autosuspends:1;
> +	unsigned int		memalloc_noio:1;
>  	enum rpm_request	request;
>  	enum rpm_status		runtime_status;
>  	int			runtime_error;
> diff --git a/include/linux/pm_runtime.h b/include/linux/pm_runtime.h
> index f271860..775e063 100644
> --- a/include/linux/pm_runtime.h
> +++ b/include/linux/pm_runtime.h
> @@ -47,6 +47,7 @@ extern void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);
>  extern unsigned long pm_runtime_autosuspend_expiration(struct device *dev);
>  extern void pm_runtime_update_max_time_suspended(struct device *dev,
>  						 s64 delta_ns);
> +extern void pm_runtime_set_memalloc_noio(struct device *dev, bool enable);
>  
>  static inline bool pm_children_suspended(struct device *dev)
>  {
> @@ -149,6 +150,8 @@ static inline void pm_runtime_set_autosuspend_delay(struct device *dev,
>  						int delay) {}
>  static inline unsigned long pm_runtime_autosuspend_expiration(
>  				struct device *dev) { return 0; }
> +static inline void pm_runtime_set_memalloc_noio(struct device *dev,
> +						bool enable){}
>  
>  #endif /* !CONFIG_PM_RUNTIME */
>  
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH v3 net-next] sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name
From: Pavel Emelyanov @ 2012-11-27 20:36 UTC (permalink / raw)
  To: Brian Haley; +Cc: David Miller, Eric Dumazet, netdev@vger.kernel.org
In-Reply-To: <50B388E4.7030600@hp.com>

On 11/26/2012 07:21 PM, Brian Haley wrote:
> Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
> will then require another call like if_indextoname() to get the actual interface
> name, have it return the name directly.
> 
> This also matches the existing man page description on socket(7) which mentions
> the argument being an interface name.
> 
> If the value has not been set, zero is returned and optlen will be set to zero
> to indicate there is no interface name present.
> 
> Added a seqlock to protect this code path, and dev_ifname(), from someone
> changing the device name via dev_change_name().
> 
> v2: Added seqlock protection while copying device name.
> 
> v3: Fixed word wrap in patch.
> 
> Signed-off-by: Brian Haley <brian.haley@hp.com>

Acked-by: Pavel Emelyanov <xemul@parallels.com>

^ permalink raw reply

* Re: [PATCH v3 net-next] sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name
From: Ben Hutchings @ 2012-11-27 20:34 UTC (permalink / raw)
  To: Brian Haley
  Cc: David Miller, Pavel Emelyanov, Eric Dumazet,
	netdev@vger.kernel.org
In-Reply-To: <50B388E4.7030600@hp.com>

On Mon, 2012-11-26 at 10:21 -0500, Brian Haley wrote:
> Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
> will then require another call like if_indextoname() to get the actual interface
> name, have it return the name directly.
> 
> This also matches the existing man page description on socket(7) which mentions
> the argument being an interface name.
> 
> If the value has not been set, zero is returned and optlen will be set to zero
> to indicate there is no interface name present.
> 
> Added a seqlock to protect this code path, and dev_ifname(), from someone
> changing the device name via dev_change_name().
> 
> v2: Added seqlock protection while copying device name.
> 
> v3: Fixed word wrap in patch.
> 
> Signed-off-by: Brian Haley <brian.haley@hp.com>
[...]

Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 1/1] Introduce notification events for routing changes
From: Chris Wilson @ 2012-11-27 20:32 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netdev, netfilter-devel
In-Reply-To: <1354048045-17846-2-git-send-email-kadlec@blackhole.kfki.hu>

Hi Joszef,

On Tue, 27 Nov 2012, Jozsef Kadlecsik wrote:

> The netfilter MASQUERADE target does not handle the case when the routing
> changes and the source address of existing connections become invalid.
> The problem can be solved if routing modifications create events to which
> the MASQUERADE target can subscribe and then delete the affected
> connections.

I would like to thank you personally for picking this up and running with 
it, developing the patches and pushing for their inclusion. Thank you. 
Your hard work is very much appreciated here.

Cheers, Chris.
-- 
Aptivate | http://www.aptivate.org | Phone: +44 1223 967 838
Future Business, Cam City FC, Milton Rd, Cambridge, CB4 1UY, UK

Aptivate is a not-for-profit company registered in England and Wales
with company number 04980791.

^ permalink raw reply

* [PATCH 1/1] Introduce notification events for routing changes
From: Jozsef Kadlecsik @ 2012-11-27 20:27 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, Jozsef Kadlecsik
In-Reply-To: <1354048045-17846-1-git-send-email-kadlec@blackhole.kfki.hu>

The netfilter MASQUERADE target does not handle the case when the routing
changes and the source address of existing connections become invalid.
The problem can be solved if routing modifications create events to which
the MASQUERADE target can subscribe and then delete the affected
connections.

The patch adds the required event support for IPv4/IPv6.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/inetdevice.h |    2 ++
 include/linux/netdevice.h  |    1 +
 include/net/ip6_route.h    |    3 ++-
 net/ipv4/fib_trie.c        |   17 +++++++++++++++++
 net/ipv6/route.c           |   21 +++++++++++++++++++++
 5 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index d032780..cf16dab 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -170,6 +170,8 @@ struct in_ifaddr {
 
 extern int register_inetaddr_notifier(struct notifier_block *nb);
 extern int unregister_inetaddr_notifier(struct notifier_block *nb);
+extern int register_iproute_notifier(struct notifier_block *nb);
+extern int unregister_iproute_notifier(struct notifier_block *nb);
 
 extern struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref);
 static inline struct net_device *ip_dev_find(struct net *net, __be32 addr)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e9929ab..cd53253 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1559,6 +1559,7 @@ struct packet_offload {
 #define NETDEV_RELEASE		0x0012
 #define NETDEV_NOTIFY_PEERS	0x0013
 #define NETDEV_JOIN		0x0014
+#define NETDEV_ROUTE_CHANGED	0x0015
 
 extern int register_netdevice_notifier(struct notifier_block *nb);
 extern int unregister_netdevice_notifier(struct notifier_block *nb);
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 27d8318..e3c079d 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -149,7 +149,8 @@ extern int rt6_dump_route(struct rt6_info *rt, void *p_arg);
 extern void rt6_ifdown(struct net *net, struct net_device *dev);
 extern void rt6_mtu_change(struct net_device *dev, unsigned int mtu);
 extern void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
-
+extern int register_ip6route_notifier(struct notifier_block *nb);
+extern int unregister_ip6route_notifier(struct notifier_block *nb);
 
 /*
  *	Store a destination cache entry in a socket
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 31d771c..ee6f968 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -178,6 +178,8 @@ static const int sync_pages = 128;
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
 
+static BLOCKING_NOTIFIER_HEAD(iproute_chain);
+
 /*
  * caller must hold RTNL
  */
@@ -1337,6 +1339,8 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
 	rtmsg_fib(RTM_NEWROUTE, htonl(key), new_fa, plen, tb->tb_id,
 		  &cfg->fc_nlinfo, 0);
 succeeded:
+	blocking_notifier_call_chain(&iproute_chain,
+				     NETDEV_ROUTE_CHANGED, fi);
 	return 0;
 
 out_free_new_fa:
@@ -1713,6 +1717,8 @@ int fib_table_delete(struct fib_table *tb, struct fib_config *cfg)
 	if (fa->fa_state & FA_S_ACCESSED)
 		rt_cache_flush(cfg->fc_nlinfo.nl_net);
 
+	blocking_notifier_call_chain(&iproute_chain,
+				     NETDEV_ROUTE_CHANGED, fa->fa_info);
 	fib_release_info(fa->fa_info);
 	alias_free_mem_rcu(fa);
 	return 0;
@@ -1979,6 +1985,17 @@ void __init fib_trie_init(void)
 					   0, SLAB_PANIC, NULL);
 }
 
+int register_iproute_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&iproute_chain, nb);
+}
+EXPORT_SYMBOL(register_iproute_notifier);
+
+int unregister_iproute_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&iproute_chain, nb);
+}
+EXPORT_SYMBOL(unregister_iproute_notifier);
 
 struct fib_table *fib_trie_table(u32 id)
 {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8f124f5..5d086d2 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -94,6 +94,8 @@ static struct rt6_info *rt6_get_route_info(struct net *net,
 					   const struct in6_addr *gwaddr, int ifindex);
 #endif
 
+static ATOMIC_NOTIFIER_HEAD(ip6route_chain);
+
 static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
 {
 	struct rt6_info *rt = (struct rt6_info *) dst;
@@ -818,6 +820,10 @@ static int __ip6_ins_rt(struct rt6_info *rt, struct nl_info *info)
 	err = fib6_add(&table->tb6_root, rt, info);
 	write_unlock_bh(&table->tb6_lock);
 
+	if (!err)
+		atomic_notifier_call_chain(&ip6route_chain,
+					   NETDEV_ROUTE_CHANGED, rt);
+
 	return err;
 }
 
@@ -1652,6 +1658,9 @@ static int __ip6_del_rt(struct rt6_info *rt, struct nl_info *info)
 	err = fib6_del(rt, info);
 	write_unlock_bh(&table->tb6_lock);
 
+	if (!err)
+		atomic_notifier_call_chain(&ip6route_chain,
+					   NETDEV_ROUTE_CHANGED, rt);
 out:
 	ip6_rt_put(rt);
 	return err;
@@ -2788,6 +2797,18 @@ static int ip6_route_dev_notify(struct notifier_block *this,
 	return NOTIFY_OK;
 }
 
+int register_ip6route_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_register(&ip6route_chain, nb);
+}
+EXPORT_SYMBOL(register_ip6route_notifier);
+
+int unregister_ip6route_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_unregister(&ip6route_chain, nb);
+}
+EXPORT_SYMBOL(unregister_ip6route_notifier);
+
 /*
  *	/proc
  */
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 0/1] Introduce notification events for routing changes
From: Jozsef Kadlecsik @ 2012-11-27 20:27 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel

Hi Dave,

Please consider applying the following patch for net-next, which introduce
events for routing changes. With it, the MASQUERADE target is able to clean
up connections with wrong NATed source addresses after routing changed (backup
default route, VPNs).

Best regards,
Jozsef

Jozsef Kadlecsik (1):
  Introduce notification events for routing changes

 include/linux/inetdevice.h |    2 ++
 include/linux/netdevice.h  |    1 +
 include/net/ip6_route.h    |    3 ++-
 net/ipv4/fib_trie.c        |   17 +++++++++++++++++
 net/ipv6/route.c           |   21 +++++++++++++++++++++
 5 files changed, 43 insertions(+), 1 deletions(-)

^ permalink raw reply

* Re: [PATCH v6 0/6] solve deadlock caused by memory allocation with I/O
From: Andrew Morton @ 2012-11-27 20:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Alan Stern, Oliver Neukum, Minchan Kim,
	Greg Kroah-Hartman, Rafael J. Wysocki, Jens Axboe,
	David S. Miller, netdev, linux-usb, linux-pm, linux-mm
In-Reply-To: <1353761958-12810-1-git-send-email-ming.lei@canonical.com>

On Sat, 24 Nov 2012 20:59:12 +0800
Ming Lei <ming.lei@canonical.com> wrote:

> This patchset try to solve one deadlock problem which might be caused
> by memory allocation with block I/O during runtime PM and block device
> error handling path. Traditionly, the problem is addressed by passing
> GFP_NOIO statically to mm, but that is not a effective solution, see
> detailed description in patch 1's commit log.
> 
> This patch set introduces one process flag and trys to fix the deadlock
> problem on block device/network device during runtime PM or usb bus reset.
> 
> The 1st one is the change on include/sched.h and mm.
> 
> The 2nd patch introduces the flag of memalloc_noio on 'dev_pm_info',
> and pm_runtime_set_memalloc_noio(), so that PM Core can teach mm to not
> allocate mm with GFP_IO during the runtime_resume callback only on
> device with the flag set.
> 
> The following 2 patches apply the introduced pm_runtime_set_memalloc_noio()
> to mark all devices as memalloc_noio_resume in the path from the block or
> network device to the root device in device tree.
> 
> The last 2 patches are applied again PM and USB subsystem to demonstrate
> how to use the introduced mechanism to fix the deadlock problem.
> 
> Andrew, could you queue these patches into your tree since V6 fixes all
> your concerns and looks no one objects these patches?

Yes, this patchset looks ready to run with.  But as we're at -rc7 I'll ask
you to refresh, retest and resend after 3.8-rc1, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* DHCP response received after 500mS not handled while having multiple Ethernet interfaces
From: Mugunthan V N @ 2012-11-27 19:58 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: davem@davemloft.net

Hi

While testing AM335X which has Dual EMAC feature, I came across an issue.

Setup:
     Board:  AM335X-EVMSK with Dual EMAC implementation with eth0 connected
             and eth1 not connected to switch.
     Router: Trendnet Wireless router TEW-652BRP

Issue:
    While booting with multiple Ethernet interface, kernel logs as
    "DHCP/BOOTP: Ignoring delayed packet" and ignores the DHCP response 
packet.

Analysis:
The router is giving a DHCP response only after 500mS which I confirmed by
capturing Ethernet packets
While looking at net/ipv4/ipconfig.c, it seems that waiting period of DHCP
response with multiple interface is 500mS, if we get a response from the
server after 500mS, it is ignored as ic_dev_xid will hold the next interface
xid which is updated while sending DHCP request after timeout on the next
interface.

Is this 500mS waiting for response period is per protocol?

In my view even if we get a response after 500mS and DHCP request has moved
to next interface we need to handle the response.
Can someone comment on this approach is right or is there some other way to
handle the issue.

Regards
Mugunthan V N

^ permalink raw reply

* Re: [PATCH 1/2] smsc75xx: refactor entering suspend modes
From: Alan Stern @ 2012-11-27 19:46 UTC (permalink / raw)
  To: Bjørn Mork; +Cc: Steve Glendinning, netdev, linux-usb
In-Reply-To: <87y5hm9ap5.fsf@nemi.mork.no>

On Tue, 27 Nov 2012, Bjørn Mork wrote:

> Steve Glendinning <steve@shawell.net> writes:
> 
> > Hi Bjorn,
> >
> > On 27 November 2012 17:21, Steve Glendinning <steve@shawell.net> wrote:
> >> Hi Bjorn,
> >>
> >>>> +     smsc75xx_set_feature(dev, USB_DEVICE_REMOTE_WAKEUP);
> >>>
> >>> As mentioned in another comment to the smsc95xx driver: This is weird.
> >>> Do you really need to do that?
> >>>
> >>> This is an USB interface driver.  The USB device is handled by the
> >>> generic "usb" driver, which will do the right thing.  See
> >>> drivers/usb/generic.c and drivers/usb/core/hub.c
> >>
> >> Thanks, I've tested removing all these calls from the driver and
> >> wakeup functionality seems to still work.
> >>
> >> I'll resubmit my smsc75xx enhancement patchset with this change once
> >> I've done some more testing.
> >
> > Further testing shows that removing these calls stop wakeup from
> > system suspend working (although don't appear to impact runtime
> > autosuspend).  Have I missed a flag or somewhere that causes
> > udev->do_remote_wakeup to be set in the code you posted?
> 
> udev->do_remote_wakeup is set in choose_wakeup() in
> drivers/usb/core/driver.c.  AFAICS it is always set as long as
> device_may_wakeup(&udev->dev) is true.

That's right.  But is device_may_wakeup(&udev->dev) true?

By default it wouldn't be.  The normal way to set it is for the user or 
a program to do:

	echo enabled >/sys/bus/usb/devices/.../power/wakeup

Of course, a driver could disregard the user's choice and set the flag 
by itself.

Alan Stern

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Ben Hutchings @ 2012-11-27 19:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354028815.14302.35.camel@edumazet-glaptop>

On Tue, 2012-11-27 at 07:06 -0800, Eric Dumazet wrote:
[...]
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -112,6 +112,9 @@ struct inet_timewait_sock {
>  #define tw_net			__tw_common.skc_net
>  #define tw_daddr        	__tw_common.skc_daddr
>  #define tw_rcv_saddr    	__tw_common.skc_rcv_saddr
> +#define tw_dport		__tw_common.skc_dport
> +#define tw_num			__tw_common.skc_num
> +
>  	int			tw_timeout;
>  	volatile unsigned char	tw_substate;
>  	unsigned char		tw_rcv_wscale;
> @@ -119,8 +122,6 @@ struct inet_timewait_sock {
>  	/* Socket demultiplex comparisons on incoming packets. */
>  	/* these three are in inet_sock */
>  	__be16			tw_sport;
> -	__be16			tw_dport;
> -	__u16			tw_num;
>  	kmemcheck_bitfield_begin(flags);
>  	/* And these are ours. */
>  	unsigned int		tw_ipv6only     : 1,
> diff --git a/include/net/sock.h b/include/net/sock.h
> index c945fba..e4bab2e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -132,6 +132,8 @@ struct net;
>   *	@skc_rcv_saddr: Bound local IPv4 addr
>   *	@skc_hash: hash value used with various protocol lookup tables
>   *	@skc_u16hashes: two u16 hash values used by UDP lookup tables
> + *	@skc_dport: placeholder for inet_dport/tw_dport
> + *	@skc_num: placeholder for inet_num/tw_num
>   *	@skc_family: network address family
>   *	@skc_state: Connection state
>   *	@skc_reuse: %SO_REUSEADDR setting
> @@ -149,8 +151,8 @@ struct net;
>   *	for struct sock and struct inet_timewait_sock.
>   */
>  struct sock_common {
> -	/* skc_daddr and skc_rcv_saddr must be grouped :
> -	 * cf INET_MATCH() and INET_TW_MATCH()
> +	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
> +	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()

__aligned(8)?

>  	 */
>  	__be32			skc_daddr;
>  	__be32			skc_rcv_saddr;
> @@ -159,6 +161,10 @@ struct sock_common {
>  		unsigned int	skc_hash;
>  		__u16		skc_u16hashes[2];
>  	};
> +	/* skc_dport && skc_num must be grouped as well */
> +	__be16			skc_dport;

__aligned(4)?

> +	__u16			skc_num;
> +
>  	unsigned short		skc_family;
>  	volatile unsigned char	skc_state;
>  	unsigned char		skc_reuse;

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH v2 3/3] pppoatm: protect against freeing of vcc
From: chas williams - CONTRACTOR @ 2012-11-27 18:54 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Krzysztof Mazur, davem, netdev, linux-kernel, nathan
In-Reply-To: <1354039349.2534.11.camel@shinybook.infradead.org>

On Tue, 27 Nov 2012 18:02:29 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> In solos-pci at least, the ops->close() function doesn't flush all
> pending skbs for this vcc before returning. So can be a tasklet
> somewhere which has loaded the address of the vcc->pop function from one
> of them, and is going to call it in some unspecified amount of time.
> 
> Should we make the device's ->close function wait for all TX and RX skbs
> for this vcc to complete? 

the driver's close routine should wait for any of the pending tx and rx
to complete.  take a look at the he.c in driver/atm

^ permalink raw reply

* Re: [PATCH v2 3/3] pppoatm: protect against freeing of vcc
From: Krzysztof Mazur @ 2012-11-27 18:39 UTC (permalink / raw)
  To: David Woodhouse; +Cc: davem, netdev, linux-kernel, nathan
In-Reply-To: <1354039349.2534.11.camel@shinybook.infradead.org>

On Tue, Nov 27, 2012 at 06:02:29PM +0000, David Woodhouse wrote:
> 
> I'm not running with that patch. This bug exists for br2684 even before
> it, and I think also for pppoatm.
> 

Did you use your "atm: br2684: Fix excessive queue bloat" patch?

With that patch for pppoatm the dev->close()/pppoatm_send() race
was much easier to trigger. Maybe with an equivalent patch for br2684
the races are also easier triggerable.

Krzysiek

^ permalink raw reply

* [PATCH net 2/2] openvswitch: Store flow key len if ARP opcode is not request or reply.
From: Jesse Gross @ 2012-11-27 18:37 UTC (permalink / raw)
  To: David Miller
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	Mehak Mahajan
In-Reply-To: <1354041423-3050-1-git-send-email-jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>

From: Mehak Mahajan <mmahajan-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>

We currently only extract the ARP payload if the opcode indicates
that it is a request or reply.  However, we also only set the
key length in these situations even though it should still be
possible to match on the opcode.  There's no real reason to
restrict the ARP opcode since all have the same format so this
simply removes the check.

Signed-off-by: Mehak Mahajan <mmahajan-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
---
 net/openvswitch/flow.c |   14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 98c7063..733cbf4 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -702,15 +702,11 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 			/* We only match on the lower 8 bits of the opcode. */
 			if (ntohs(arp->ar_op) <= 0xff)
 				key->ip.proto = ntohs(arp->ar_op);
-
-			if (key->ip.proto == ARPOP_REQUEST
-					|| key->ip.proto == ARPOP_REPLY) {
-				memcpy(&key->ipv4.addr.src, arp->ar_sip, sizeof(key->ipv4.addr.src));
-				memcpy(&key->ipv4.addr.dst, arp->ar_tip, sizeof(key->ipv4.addr.dst));
-				memcpy(key->ipv4.arp.sha, arp->ar_sha, ETH_ALEN);
-				memcpy(key->ipv4.arp.tha, arp->ar_tha, ETH_ALEN);
-				key_len = SW_FLOW_KEY_OFFSET(ipv4.arp);
-			}
+			memcpy(&key->ipv4.addr.src, arp->ar_sip, sizeof(key->ipv4.addr.src));
+			memcpy(&key->ipv4.addr.dst, arp->ar_tip, sizeof(key->ipv4.addr.dst));
+			memcpy(key->ipv4.arp.sha, arp->ar_sha, ETH_ALEN);
+			memcpy(key->ipv4.arp.tha, arp->ar_tha, ETH_ALEN);
+			key_len = SW_FLOW_KEY_OFFSET(ipv4.arp);
 		}
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		int nh_len;             /* IPv6 Header + Extensions */
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net 1/2] openvswitch: Print device when warning about over MTU packets.
From: Jesse Gross @ 2012-11-27 18:37 UTC (permalink / raw)
  To: David Miller; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1354041423-3050-1-git-send-email-jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>

If an attempt is made to transmit a packet that is over the device's
MTU then we log it using the datapath's name.  However, it is much
more helpful to use the device name instead.

Signed-off-by: Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
---
 net/openvswitch/vport-netdev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 3c1e58b..a903348 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -158,7 +158,7 @@ static int netdev_send(struct vport *vport, struct sk_buff *skb)

 	if (unlikely(packet_length(skb) > mtu && !skb_is_gso(skb))) {
 		net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
-				     ovs_dp_name(vport->dp),
+				     netdev_vport->dev->name,
 				     packet_length(skb), mtu);
 		goto error;
 	}
-- 
1.7.9.5

^ permalink raw reply related

* [GIT net] Open vSwitch
From: Jesse Gross @ 2012-11-27 18:37 UTC (permalink / raw)
  To: David Miller; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA

These two small bug fixes are intended for 3.7/net if there is still time.

The following changes since commit ddffeb8c4d0331609ef2581d84de4d763607bd37:

  Linux 3.7-rc1 (2012-10-14 14:41:04 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch.git fixes

for you to fetch changes up to d04d382980c86bdee9960c3eb157a73f8ed230cc:

  openvswitch: Store flow key len if ARP opcode is not request or reply. (2012-10-30 17:17:09 -0700)

----------------------------------------------------------------
Jesse Gross (1):
      openvswitch: Print device when warning about over MTU packets.

Mehak Mahajan (1):
      openvswitch: Store flow key len if ARP opcode is not request or reply.

 net/openvswitch/flow.c         |   14 +++++---------
 net/openvswitch/vport-netdev.c |    2 +-
 2 files changed, 6 insertions(+), 10 deletions(-)

^ permalink raw reply

* Re: [PATCH 1/1] net: ethernet: cpsw: fix build warnings for CPSW when CPTS not selected
From: Richard Cochran @ 2012-11-27 18:31 UTC (permalink / raw)
  To: Mugunthan V N; +Cc: netdev, davem, linux-arm-kernel, linux-omap
In-Reply-To: <1354038820-11095-1-git-send-email-mugunthanvnm@ti.com>

On Tue, Nov 27, 2012 at 11:23:40PM +0530, Mugunthan V N wrote:
>   CC      drivers/net/ethernet/ti/cpsw.o
> drivers/net/ethernet/ti/cpsw.c: In function 'cpsw_ndo_ioctl':
> drivers/net/ethernet/ti/cpsw.c:881:20: warning: unused variable 'priv'
> 
> The build warning is generated when CPTS is not selected in Kernel Build.
> Fixing by passing the net_device pointer to cpts IOCTL instead of passing priv
> 
> Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>

Thanks for this fix.

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH v2 3/3] pppoatm: protect against freeing of vcc
From: Krzysztof Mazur @ 2012-11-27 18:28 UTC (permalink / raw)
  To: David Woodhouse; +Cc: davem, netdev, linux-kernel, nathan
In-Reply-To: <1354039349.2534.11.camel@shinybook.infradead.org>

On Tue, Nov 27, 2012 at 06:02:29PM +0000, David Woodhouse wrote:
> On Tue, 2012-11-27 at 18:39 +0100, Krzysztof Mazur wrote:
> > Yes, I missed that one - it's even worse, I introduced that bug 
> > in "[PATCH 1/7] atm: detach protocol before closing vcc". Before that
> > patch that scenario shouldn't happen because vcc was closed before
> > calling pppoatm_send(vcc, NULL) - the driver should provide appropriate
> > synchronization.
> > 
> > I think that we should just drop that patch. With later changes it's not
> > necessary - the pppoatm_send() can be safely called while closing vcc.
> 
> I'm not running with that patch. This bug exists for br2684 even before
> it, and I think also for pppoatm.
> 
> In solos-pci at least, the ops->close() function doesn't flush all
> pending skbs for this vcc before returning. So can be a tasklet
> somewhere which has loaded the address of the vcc->pop function from one
> of them, and is going to call it in some unspecified amount of time.
> 
> Should we make the device's ->close function wait for all TX and RX skbs
> for this vcc to complete? 

Yes, the ->close() can sleep and after vcc is closed the ->pop() shouldn't be
called.


While reviewing your br2684 patch I also found that some ATM drivers does
not call ->pop() when ->send() fails, they should do:

	if (vcc->pop)
		vcc->pop(vcc, skb);
	else
		dev_kfree_skb(skb);

but some drivers just call dev_kfree_skb(skb).

I think that we should add atm_pop() function that does that and fix all
drivers.

Krzysiek

^ permalink raw reply

* RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Fujinaka, Todd @ 2012-11-27 18:24 UTC (permalink / raw)
  To: Ben Hutchings, Mary Mcgrath
  Cc: Joe Jin, netdev@vger.kernel.org, e1000-devel@lists.sf.net,
	linux-kernel@vger.kernel.org, linux-pci
In-Reply-To: <1354039840.2701.14.camel@bwh-desktop.uk.solarflarecom.com>

Thanks for the clarification. I was just going by the PCIe spec, which says the lowest value of both ends is used, and I figured SOMETHING had to be looking at that and doing some sort of negotiation. I'm no BIOS guy, so I'm not sure what's actually going on, whether something walks the PCIe tree or if the BIOS just sets all the values to the minimum.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujinaka@intel.com
(503) 712-4565

-----Original Message-----
From: Ben Hutchings [mailto:bhutchings@solarflare.com] 
Sent: Tuesday, November 27, 2012 10:11 AM
To: Fujinaka, Todd; Mary Mcgrath
Cc: Joe Jin; netdev@vger.kernel.org; e1000-devel@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
> Forgive me if I'm being too repetitious as I think some of this has 
> been mentioned in the past.
> 
> We (and by we I mean the Ethernet part and driver) can only change the 
> advertised availability of a larger MaxPayloadSize. The size is 
> negotiated by both sides of the link when the link is established. The 
> driver should not change the size of the link as it would be poking at 
> registers outside of its scope and is controlled by the upstream 
> bridge (not us).
[...]

MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug).  You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS).

(These parameters are not documented in
Documentation/kernel-parameters.txt!  Someone ought to fix that.)

Ben.

--
Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 1/2] smsc75xx: refactor entering suspend modes
From: Bjørn Mork @ 2012-11-27 18:12 UTC (permalink / raw)
  To: Steve Glendinning; +Cc: netdev, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CAKh2mn7f94a7La_EQ9L_qwr7FZEXfawUiAajPMDO8ScS2fJupg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Steve Glendinning <steve-nksJyM/082jR7s880joybQ@public.gmane.org> writes:

> Hi Bjorn,
>
> On 27 November 2012 17:21, Steve Glendinning <steve-nksJyM/082jR7s880joybQ@public.gmane.org> wrote:
>> Hi Bjorn,
>>
>>>> +     smsc75xx_set_feature(dev, USB_DEVICE_REMOTE_WAKEUP);
>>>
>>> As mentioned in another comment to the smsc95xx driver: This is weird.
>>> Do you really need to do that?
>>>
>>> This is an USB interface driver.  The USB device is handled by the
>>> generic "usb" driver, which will do the right thing.  See
>>> drivers/usb/generic.c and drivers/usb/core/hub.c
>>
>> Thanks, I've tested removing all these calls from the driver and
>> wakeup functionality seems to still work.
>>
>> I'll resubmit my smsc75xx enhancement patchset with this change once
>> I've done some more testing.
>
> Further testing shows that removing these calls stop wakeup from
> system suspend working (although don't appear to impact runtime
> autosuspend).  Have I missed a flag or somewhere that causes
> udev->do_remote_wakeup to be set in the code you posted?

udev->do_remote_wakeup is set in choose_wakeup() in
drivers/usb/core/driver.c.  AFAICS it is always set as long as
device_may_wakeup(&udev->dev) is true.

But I am just trying to get a grasp of this code.  Others on the
linux-usb list will know these things much better than me...

Just a thought: Did you remember to remove the clear_feature you have in
resume?


Bjørn
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Ben Hutchings @ 2012-11-27 18:10 UTC (permalink / raw)
  To: Fujinaka, Todd, Mary Mcgrath
  Cc: Joe Jin, netdev@vger.kernel.org, e1000-devel@lists.sf.net,
	linux-kernel@vger.kernel.org, linux-pci
In-Reply-To: <9B4A1B1917080E46B64F07F2989DADD62F2D8070@ORSMSX102.amr.corp.intel.com>

On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
> Forgive me if I'm being too repetitious as I think some of this has
> been mentioned in the past.
> 
> We (and by we I mean the Ethernet part and driver) can only change the
> advertised availability of a larger MaxPayloadSize. The size is
> negotiated by both sides of the link when the link is established. The
> driver should not change the size of the link as it would be poking at
> registers outside of its scope and is controlled by the upstream
> bridge (not us).
[...]

MaxPayloadSize (MPS) is not negotiated between devices but is programmed
by the system firmware (at least for devices present at boot - the
kernel may be responsible in case of hotplug).  You can use the kernel
parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy
that overrides this, but no policy will allow setting MPS above the
device's MaxPayloadSizeSupported (MPSS).

(These parameters are not documented in
Documentation/kernel-parameters.txt!  Someone ought to fix that.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH v2 3/3] pppoatm: protect against freeing of vcc
From: David Woodhouse @ 2012-11-27 18:02 UTC (permalink / raw)
  To: Krzysztof Mazur; +Cc: davem, netdev, linux-kernel, nathan
In-Reply-To: <20121127173906.GA11390@shrek.podlesie.net>

[-- Attachment #1: Type: text/plain, Size: 1028 bytes --]

On Tue, 2012-11-27 at 18:39 +0100, Krzysztof Mazur wrote:
> Yes, I missed that one - it's even worse, I introduced that bug 
> in "[PATCH 1/7] atm: detach protocol before closing vcc". Before that
> patch that scenario shouldn't happen because vcc was closed before
> calling pppoatm_send(vcc, NULL) - the driver should provide appropriate
> synchronization.
> 
> I think that we should just drop that patch. With later changes it's not
> necessary - the pppoatm_send() can be safely called while closing vcc.

I'm not running with that patch. This bug exists for br2684 even before
it, and I think also for pppoatm.

In solos-pci at least, the ops->close() function doesn't flush all
pending skbs for this vcc before returning. So can be a tasklet
somewhere which has loaded the address of the vcc->pop function from one
of them, and is going to call it in some unspecified amount of time.

Should we make the device's ->close function wait for all TX and RX skbs
for this vcc to complete? 

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: TCP and reordering
From: Rick Jones @ 2012-11-27 18:00 UTC (permalink / raw)
  To: Saku Ytti; +Cc: netdev
In-Reply-To: <CAAeewD9MtEx4uF6ezbBj7Ci5OzX8VK7p=WQ2TB3PfjmznA4X0w@mail.gmail.com>

On 11/27/2012 09:15 AM, Saku Ytti wrote:
> On 27 November 2012 19:05, Rick Jones <rick.jones2@hp.com> wrote:
>
>> Packet reordering is supposed to be the exception, not the rule.
>> Links which habitually/constantly introduce reordering are, in my
>> opinion, broken. Optimizing for them would be optimizing an error
>> case.
>
> TCP used to be friendly to reordering before fast retransmit
> optimization was implemented.

It remained "friendly" to reordering even after fast retransmit was 
implemented - just not to particularly bad reordering.

And friendly is somewhat relative.  Even before fast retransmit came to 
be, TCP would immediately ACK each out-of-order segment.

> It seems like minimal complexity in TCP algorithm and would
> dynamically work correctly depending on situation. It is rather slim
> comfort that network should work, when it does not, and you cannot
> affect it.

It is probably considered an "ancient" text these days, but one of the 
chapter intros for The Mythical Man Month includes a quote from Ovid:

adde parvum parvo magnus acervus erit

which if recollection serves the book translated as:

add little to little and soon there will be a big pile.

> But if the complexity is higher than I expect, then I fully agree,
> makes no sense to add it. Reason why reordering can happen in modern
> MPLS network is that you have to essentially duck type your traffic,
> and sometimes you duck type them wrong and you are then calculating
> ECMP on incorrect values, causing packets inside flow to take
> different ports.

I appreciate that one may not always have "access" and there can be 
layer 8 and layer 9 issues involved, but if incorrect typing is the root 
cause of the reordering, treating root cause rather than the symptom is 
what should happen.  How many kludges, no matter how angelic, can fit in 
a TCP implementation?

For other reasons (CPU utilization) various stacks (HP-UX, Solaris, some 
versions of MacOS) have had explicit ACK-avoidance heuristics.  They 
would back-off from ack-every-other to back-every-N, N >> 2.  The 
heuristics worked quite well in LAN environments and on bulk flows (eg 
FTP, ttcp, netperf TCP_STREAM), not necessarily as well in other 
environments.   One (very necessary) part of the heuristic in those is 
to go back to ack-every-other when necessary.  That keyed off the 
standalone ACK timer - if that ever fired the current avoidance count 
would go back to 2, and the max allowed would be half what it was 
before.  However, that took a non-trivial performance hit when there was 
  "tail-drop" and something that wasn't a continuous stream of traffic - 
the tail got dropped, nothing to cause out-of-order to force immediate 
ACKs.  (*) Standalone ACK timer is then the only thing getting us back 
out which means idleness.  I worked a number of "WAN performance 
problems" involving one of those stacks where part of the solution was 
turning down the limits on the ack avoidance heuristic by a considerable 
quantity.  (And I say this as someone with a fondness for the schemes)

I cannot say with certainty your idea would have the same problems but 
as you look to work-out a solution to propose as a patch, you will have 
to keep that in mind.

rick

* yes, the same holds true for a non-ack-avoiding setup, the heuristic 
simply made it worse - especially if the sender wanted to send but had 
gotten limited by cwnd - the ACK(s) of the head of that chunk of data 
were "avoided" and so wouldn't open the cwnd which might then allow 
futher segments to enable detection of the dropped segments.  Even 
without losses it also tended to interact poorly with sending TCPs which 
wantend to increase the congestion window by one MSS for each ACK rather 
than based on the quantity of bytes covered by the ACK.

^ permalink raw reply

* Re: [PATCH for 3.8] iproute2: Add "ip netns pids" and "ip netns identify"
From: Ben Hutchings @ 2012-11-27 18:00 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen Hemminger, netdev, Serge E. Hallyn
In-Reply-To: <87a9u4q7k9.fsf@xmission.com>

On Mon, 2012-11-26 at 17:16 -0600, Eric W. Biederman wrote:
> Add command that go between network namespace names and process
> identifiers.  The code builds and runs agains older kernels but
> only works on Linux 3.8+ kernels where I have fixed stat to work
> properly.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> I don't know if this is too soon to send this patch to iproute as the
> kernel code that fixes stat is currently sitting in my for-next branch
> of:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git
> 
> and has not hit Linus's tree yet.  Still the code runs and is harmless
> on older kernels so it should be harmless whatever happens with it.
> 
>  ip/ipnetns.c        |  141 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  man/man8/ip-netns.8 |    5 ++-
>  2 files changed, 145 insertions(+), 1 deletions(-)
> 
> diff --git a/ip/ipnetns.c b/ip/ipnetns.c
> index e41a598..c55fe3a 100644
> --- a/ip/ipnetns.c
> +++ b/ip/ipnetns.c
> @@ -13,6 +13,7 @@
>  #include <dirent.h>
>  #include <errno.h>
>  #include <unistd.h>
> +#include <ctype.h>
>  
>  #include "utils.h"
>  #include "ip_common.h"
> @@ -48,6 +49,8 @@ static void usage(void)
>  	fprintf(stderr, "Usage: ip netns list\n");
>  	fprintf(stderr, "       ip netns add NAME\n");
>  	fprintf(stderr, "       ip netns delete NAME\n");
> +	fprintf(stderr, "       ip netns identify PID\n");
> +	fprintf(stderr, "       ip netns pids NAME\n");
>  	fprintf(stderr, "       ip netns exec NAME cmd ...\n");
>  	fprintf(stderr, "       ip netns monitor\n");
>  	exit(-1);
> @@ -171,6 +174,138 @@ static int netns_exec(int argc, char **argv)
>  	exit(-1);
>  }
>  
> +static int is_pid(const char *str)
> +{
> +	int ch;
> +	for (; (ch = *str); str++) {
> +		if (!isdigit(ch))

ch must be cast to unsigned char before passing to isdigit().

> +			return 0;
> +	}
> +	return 1;
> +}
> +
> +static int netns_pids(int argc, char **argv)
> +{
> +	const char *name;
> +	char net_path[MAXPATHLEN];
> +	int netns;
> +	struct stat netst;
> +	DIR *dir;
> +	struct dirent *entry;
> +
> +	if (argc < 1) {
> +		fprintf(stderr, "No netns name specified\n");
> +		return -1;
> +	}
> +	if (argc > 1) {
> +		fprintf(stderr, "extra arguments specified\n");
> +		return -1;
> +	}

These, and many other return statements in this file which set the
process exit code, should return 1 (general failure) or 2 (user error)
rather than -1 (likely to be interpreted as command not found).

> +	name = argv[0];
> +	snprintf(net_path, sizeof(net_path), "%s/%s", NETNS_RUN_DIR, name);

No check for truncation?

> +	netns = open(net_path, O_RDONLY);

This file descriptor is leaked, though that probably doesn't really
matter.

[...]
> +static int netns_identify(int argc, char **argv)
> +{
> +	const char *pidstr;
> +	char net_path[MAXPATHLEN];
> +	int netns;
> +	struct stat netst;
> +	DIR *dir;
> +	struct dirent *entry;
> +
> +	if (argc < 1) {
> +		fprintf(stderr, "No pid specified\n");
> +		return -1;
> +	}
> +	if (argc > 1) {
> +		fprintf(stderr, "extra arguments specified\n");
> +		return -1;
> +	}
> +	pidstr = argv[0];
> +
> +	if (!is_pid(pidstr)) {
> +		fprintf(stderr, "Specified string '%s' is not a pid\n",
> +			pidstr);
> +		return -1;
> +	}
> +
> +	snprintf(net_path, sizeof(net_path), "/proc/%s/ns/net", pidstr);
> +	netns = open(net_path, O_RDONLY);
> +	if (netns < 0) {
> +		fprintf(stderr, "Cannot open network namespace: %s\n",
> +			strerror(errno));
> +		return -1;
> +	}
> +	if (fstat(netns, &netst) < 0) {
> +		fprintf(stderr, "Stat of netns failed: %s\n",
> +			strerror(errno));
> +		return -1;
> +	}
> +	dir = opendir(NETNS_RUN_DIR);
> +	if (!dir)
> +		return 0;

Shouldn't this be treated as an error?  Or, if you want it to succeed
when the kernel does not have netns functionality, then treat it as an
error if !dir && errno != ENOENT.

> +	while((entry = readdir(dir))) {
> +		char name_path[MAXPATHLEN];
> +		struct stat st;
> +
> +		if (strcmp(entry->d_name, ".") == 0)
> +			continue;
> +		if (strcmp(entry->d_name, "..") == 0)
> +			continue;
> +
> +		snprintf(name_path, sizeof(name_path), "%s/%s",	NETNS_RUN_DIR,
> +			entry->d_name);
> +
> +		if (stat(name_path, &st) != 0)
> +			continue;
> +
> +		if ((st.st_dev == netst.st_dev) &&
> +		    (st.st_ino == netst.st_ino)) {
> +			printf("%s\n", entry->d_name);
> +		}
> +	}
> +	closedir(dir);
> +	return 0;
> +	
> +}
[...]

It's a shame there isn't a more efficient way to do these lookups.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox