Netdev List
 help / color / mirror / Atom feed
* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Flavio Leitner @ 2011-11-18 17:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321634046.3277.33.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Fri, 18 Nov 2011 17:34:06 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 18 novembre 2011 à 14:30 -0200, Flavio Leitner a écrit :
> 
> > I know we are reverting to get it fixed, but this adds the routing
> > cache back, so what is the plan? Revert to get it working and then
> > think on new approach to remove the route cache again later?
> > 
> > I had one previous patch using the routing cache posted to the list,
> > but it won't fix the route flush problem.
> > 
> 
> I dont "add the routing cache back".

Sorry, I meant that we are trying to avoid doing this:
+			hash = rt_hash(daddr, skeys[s], ikeys[i],rt_genid(net));
+
+			rthp = &rt_hash_table[hash].chain;
+
+			while ((rt = rcu_dereference(*rthp)) != NULL) {
+				rthp = &rt->dst.rt_next;

anyway, see below.

> Note I only fix existing route entries in the cache ;)
Exactly.
 
> A "revert" is probably safe, since we should push a fix for
> 3.0/3.1/3.2 kernels...

I agree that reverting is probably safe.
fbl

^ permalink raw reply

* Re: [net-next] iproute2: Add new command to IP link to enable/disable VF spoof check
From: Greg Rose @ 2011-11-18 17:06 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Kirsher, Jeffrey T, davem@davemloft.net, netdev@vger.kernel.org,
	gospo@redhat.com
In-Reply-To: <20111117164528.293ec0ea@nehalam.linuxnetplumber.net>


On 11/17/2011 4:45 PM, Stephen Hemminger wrote:
> On Thu, 17 Nov 2011 16:43:07 -0800
> Greg Rose<gregory.v.rose@intel.com>  wrote:
>
>> On 11/17/2011 4:40 PM, Stephen Hemminger wrote:
>>> On Thu, 17 Nov 2011 15:39:32 -0800
>>> Greg Rose<gregory.v.rose@intel.com>   wrote:
>>>
>>
>> Would you prefer that I Jeff (Kirsher) and I repost the patch at that time?
>
> not required, if they show up in patchwork

I see it under review in patchwork.

Thanks,

- Greg

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Eric Dumazet @ 2011-11-18 17:07 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <20111118150512.5f66f7d2@asterix.rh>

Le vendredi 18 novembre 2011 à 15:05 -0200, Flavio Leitner a écrit :

> Sorry, I meant that we are trying to avoid doing this:
> +			hash = rt_hash(daddr, skeys[s], ikeys[i],rt_genid(net));
> +
> +			rthp = &rt_hash_table[hash].chain;
> +
> +			while ((rt = rcu_dereference(*rthp)) != NULL) {
> +				rthp = &rt->dst.rt_next;

Sure, but this is still needed right now.

Once route cache is removed, this loop wont exist anymore ;)

^ permalink raw reply

* Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
From: Ben Hutchings @ 2011-11-18 17:14 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Krishna Kumar, rusty, mst, netdev, kvm, davem, virtualization
In-Reply-To: <1321633081.2974.1.camel@sasha>

On Fri, 2011-11-18 at 18:18 +0200, Sasha Levin wrote:
> On Fri, 2011-11-18 at 15:40 +0000, Ben Hutchings wrote:
> > On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
> > > On Fri, 2011-11-18 at 01:08 +0000, Ben Hutchings wrote:
> > > > On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
> > > > > Changes for multiqueue virtio_net driver.
> > > > [...]
> > > > > @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
> > > > >  {
> > > > >  	struct virtnet_info *vi = netdev_priv(dev);
> > > > >  	int cpu;
> > > > > -	unsigned int start;
> > > > >  
> > > > >  	for_each_possible_cpu(cpu) {
> > > > > -		struct virtnet_stats __percpu *stats
> > > > > -			= per_cpu_ptr(vi->stats, cpu);
> > > > > -		u64 tpackets, tbytes, rpackets, rbytes;
> > > > > -
> > > > > -		do {
> > > > > -			start = u64_stats_fetch_begin(&stats->syncp);
> > > > > -			tpackets = stats->tx_packets;
> > > > > -			tbytes   = stats->tx_bytes;
> > > > > -			rpackets = stats->rx_packets;
> > > > > -			rbytes   = stats->rx_bytes;
> > > > > -		} while (u64_stats_fetch_retry(&stats->syncp, start));
> > > > > -
> > > > > -		tot->rx_packets += rpackets;
> > > > > -		tot->tx_packets += tpackets;
> > > > > -		tot->rx_bytes   += rbytes;
> > > > > -		tot->tx_bytes   += tbytes;
> > > > > +		int qpair;
> > > > > +
> > > > > +		for (qpair = 0; qpair < vi->num_queue_pairs; qpair++) {
> > > > > +			struct virtnet_send_stats __percpu *tx_stat;
> > > > > +			struct virtnet_recv_stats __percpu *rx_stat;
> > > > 
> > > > While you're at it, you can drop the per-CPU stats and make them only
> > > > per-queue.  There is unlikely to be any benefit in maintaining them
> > > > per-CPU while receive and transmit processing is serialised per-queue.
> > > 
> > > It allows you to update stats without a lock.
> > 
> > But you'll already be holding a lock related to the queue.
> 
> Right, but now you're holding a queue lock just when playing with the
> queue, we don't hold it when we process the data - which is when we
> usually need to update stats.
[...]

The *stack* is holding the appropriate lock when calling the NAPI poll
function or ndo_start_xmit function.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Flavio Leitner @ 2011-11-18 17:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321636073.3277.38.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Fri, 18 Nov 2011 18:07:53 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 18 novembre 2011 à 15:05 -0200, Flavio Leitner a écrit :
> 
> > Sorry, I meant that we are trying to avoid doing this:
> > +			hash = rt_hash(daddr, skeys[s],
> > ikeys[i],rt_genid(net)); +
> > +			rthp = &rt_hash_table[hash].chain;
> > +
> > +			while ((rt = rcu_dereference(*rthp)) !=
> > NULL) {
> > +				rthp = &rt->dst.rt_next;
> 
> Sure, but this is still needed right now.

Yes, David will not be happy, unfortunately :)

> Once route cache is removed, this loop wont exist anymore ;)

That's the problem, we need to get rid of it first. :)

fbl

^ permalink raw reply

* b43: BCM 4331: MacBook 8,1: No connection after suspend
From: Nico -telmich- Schottelius @ 2011-11-18 17:32 UTC (permalink / raw)
  To: LKML, netdev, Arend van Spriel

Hello,

new notebook, new problems (*):

Running 3.2.0-rc1 on the MacBook Pro 8,1 with the BCM4331
(14e4:4331) the WLAN indeed works:

[   86.231702] b43-phy0: Broadcom 4331 WLAN found (core revision 29)
[   86.269190] ieee80211 phy0: Selected rate control algorithm
'minstrel_ht'
[   86.270486] Broadcom 43xx driver loaded [ Features: PMNLS ]
[   87.677265] b43-phy0: Loading firmware version 666.2 (2011-02-23
01:15:07)

But: After a suspend it seems not to receive any packets anymore:

[ 2334.494845] wlan0: authenticate with 64:87:d7:37:89:89 (try 1)
[ 2334.694035] wlan0: authenticate with 64:87:d7:37:89:89 (try 2)
[ 2334.893909] wlan0: authenticate with 64:87:d7:37:89:89 (try 3)
[ 2335.093824] wlan0: authentication with 64:87:d7:37:89:89 timed out

wpa_supplicant thus retries to connect to the network again and again
without success.

This is reproducable on the MBP 8,1 and is *NOT* fixed if I unload
b43 and modprobe it again. It is also "not fixed" when doing
multiple suspends/resumes.

Any pointers to this?

Cheers,

Nico


(*) feels like in good old times...

-- 
PGP key: 7ED9 F7D3 6B10 81D7 0EC5  5C09 D7DC C8E4 3187 7DF0

^ permalink raw reply

* Re: [PATCH] net: Use kmemdup rather than duplicating its implementation
From: Jesse Brandeburg @ 2011-11-18 17:32 UTC (permalink / raw)
  To: Thomas Meyer
  Cc: samuel@sortiz.org, davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1321569820.1624.307.camel@localhost.localdomain>

On Thu, 17 Nov 2011 14:43:40 -0800
Thomas Meyer <thomas@m3y3r.de> wrote:

> The semantic patch that makes this change is available
> in scripts/coccinelle/api/memdup.cocci.
> 
> Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
> ---
> 
> diff -u -p a/net/irda/irttp.c b/net/irda/irttp.c
> --- a/net/irda/irttp.c 2011-11-07 19:39:06.071138486 +0100
> +++ b/net/irda/irttp.c 2011-11-08 10:59:07.152748948 +0100
> @@ -1461,14 +1461,13 @@ struct tsap_cb *irttp_dup(struct tsap_cb
>  	}
>  
>  	/* Allocate a new instance */
> -	new = kmalloc(sizeof(struct tsap_cb), GFP_ATOMIC);
> +	new = kmemdup(orig, sizeof(struct tsap_cb), GFP_ATOMIC);
>  	if (!new) {
>  		IRDA_DEBUG(0, "%s(), unable to kmalloc\n", __func__);
>  		spin_unlock_irqrestore(&irttp->tsaps->hb_spinlock, flags);
>  		return NULL;
>  	}
>  	/* Dup */

this ^^^ comment should be removed also.

> -	memcpy(new, orig, sizeof(struct tsap_cb));
>  	spin_lock_init(&new->lock);
>  
>  	/* We don't need the old instance any more */

^ permalink raw reply

* Re: b43: BCM 4331: MacBook 8,1: No connection after suspend
From: Nico Schottelius @ 2011-11-18 17:39 UTC (permalink / raw)
  To: Nico -telmich- Schottelius, LKML, netdev, Arend van Spriel
In-Reply-To: <20111118173242.GA2101@schottelius.org>

[-- Attachment #1: Type: text/plain, Size: 203 bytes --]

Nico -telmich- Schottelius [Fri, Nov 18, 2011 at 06:32:42PM +0100]:
> Hello,
> 
> new notebook, new problems (*):

+ full dmesg attached.

-- 
PGP key: 7ED9 F7D3 6B10 81D7 0EC5  5C09 D7DC C8E4 3187 7DF0

[-- Attachment #2: dmesg.no-connection-after-suspend.gz --]
[-- Type: application/octet-stream, Size: 145793 bytes --]

^ permalink raw reply

* Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
From: Ben Hutchings @ 2011-11-18 17:40 UTC (permalink / raw)
  To: Greg Rose
  Cc: Roopa Prabhu, netdev@vger.kernel.org, davem@davemloft.net,
	chrisw@redhat.com, sri@us.ibm.com, dragos.tatulea@gmail.com,
	kvm@vger.kernel.org, arnd@arndb.de, mst@redhat.com,
	mchan@broadcom.com, dwang2@cisco.com, shemminger@vyatta.com,
	eric.dumazet@gmail.com, kaber@trash.net, benve@cisco.com
In-Reply-To: <4EC68EBB.3080303@intel.com>

On Fri, 2011-11-18 at 08:58 -0800, Greg Rose wrote:
> On 11/17/2011 4:44 PM, Ben Hutchings wrote:
> > On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote:
> >> On 11/17/2011 4:15 PM, Ben Hutchings wrote:
> >>> Sorry to come to this rather late.
> >>>
> >>> On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
> >>> [...]
> >>>> v2 ->   v3
> >>>> - Moved set and get filter ops from rtnl_link_ops to netdev_ops
> >>>> - Support for SRIOV VFs.
> >>>>           [Note: The get filters msg (in the way current get rtnetlink handles
> >>>>           it) might get too big for SRIOV vfs. This patch follows existing sriov
> >>>>           vf get code and tries to accomodate filters for all VF's in a PF.
> >>>>           And for the SRIOV case I have only tested the fact that the VF
> >>>>           arguments are getting delivered to rtnetlink correctly. The code
> >>>>           follows existing sriov vf handling code so rest of it should work fine]
> >>> [...]
> >>>
> >>> This is already broken for large numbers of VFs, and increasing the
> >>> amount of information per VF is going to make the situation worse.  I am
> >>> no netlink expert but I think that the current approach of bundling all
> >>> information about an interface in a single message may not be
> >>> sustainable.
> >>>
> >>> Also, I'm unclear on why this interface is to be used to set filtering
> >>> for the (PF) net device as well as for related VFs.  Doesn't that
> >>> duplicate the functionality of ndo_set_rx_mode and
> >>> ndo_vlan_rx_{add,kill}_vid?
> >>
> >> Functionally yes but contextually no.  This allows the PF driver to know
> >> that it is setting these filters in the context of the existence of VFs,
> >> allowing it to take appropriate action.  The other two functions may be
> >> called without the presence of SR-IOV enablement and the existence of VFs.
> >>
> >> Anyway, that's why I asked Roopa to add that capability.
> >
> > I don't follow.  The PF driver already knows whether it has enabled VFs.
> >
> > How do filters set this way interact with filters set through the
> > existing operations?  Should they override promiscuous mode?  None of
> > this has been specified.
> 
> Promiscuous mode is exactly the issue this feature is intended for.  I'm 
> not familiar with the solarflare device but Intel HW promiscuous mode is 
> only promiscuous on the physical port, not on the VEB.  So a packet sent 
> from a VF will not be captured by the PF across the VEB unless the MAC 
> and VLAN filters have been programmed into the HW.

Yes, I get it.  The hardware bridge needs to know more about the address
configuration on the host than the driver is getting at the moment.

> So you may not need 
> the feature for your devices but it is required for Intel devices.

Well we don't have the hardware bridge but that means each VF driver
needs to know whether to fall back to the software bridge.  The net
driver needs much the same additional information.

> And 
> it's a fairly simple request, just allow -1 to indicate that the target 
> of the filter requests is for the PF itself.  Using the already existing 
> set_rx_mode function wont' work because the PF driver will look at it 
> and figure it's in promiscuous mode anyway, so it won't set the filters 
> into the HW.  At least that is how it is in the case of our HW and 
> driver.  Again, the behavior of your HW and driver is unknown to me and 
> thus you may not require this feature.

What concerns me is that this seems to be a workaround rather than a fix
for over-use of promiscuous mode, and it changes the semantics of
filtering modes in ways that haven't been well-specified.

What if there's a software bridge between two net devices corresponding
to separate physical ports, so that they really need to be promiscuous?
What if the administrator runs tcpdump and really wants the (PF) net
device to be promiscuous?

These cases shouldn't break because of VF acceleration.  Or at least we
should make a conscious and documented decision that 'promiscuous'
doesn't mean that if you enable it on your network adapter.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: lock-warning seen on giving rds-info command
From: David Miller @ 2011-11-18 17:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kumaras, netdev, venkat.x.venkatsubra
In-Reply-To: <1321604802.2444.40.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 18 Nov 2011 09:26:42 +0100

> You mean that following sequence triggers a warning ?
> 
> spin_lock_irqsave(&rds_sock_lock, flags);
> ...
> read_lock_bh(&sk->sk_callback_lock);
> read_unlock_bh(&sk->sk_callback_lock);   // HERE
> ...

This sequence has always been disallowed.

BH's are triggered by software, so when you do local_bh_enable() it checks
to make sure you haven't tried to do this with hard IRQs disabled and
triggers the warning if so.

static void __local_bh_enable(unsigned int cnt)
{
	WARN_ON_ONCE(in_irq());
	WARN_ON_ONCE(!irqs_disabled());
 ...

^ permalink raw reply

* Re: [PATCH] xen-netfront: report link speed to ethtool
From: Ben Hutchings @ 2011-11-18 17:46 UTC (permalink / raw)
  To: Olaf Hering; +Cc: netdev, xen-devel, Jeremy Fitzhardinge, Konrad Rzeszutek Wilk
In-Reply-To: <20111118164805.GA14345@aepfle.de>

On Fri, 2011-11-18 at 17:48 +0100, Olaf Hering wrote:
> Add .get_settings function, return fake data so that ethtool can get
> enough information. For some application like VCS, this is useful,
> otherwise some of application logic will get panic.
> The reported data refers to VMWare vmxnet.
> 
> Signed-off-by: Xin Wei Hu <xwhu@suse.com>
> Signed-off-by: Chunyan Liu <cyliu@suse.com>
> Signed-off-by: Olaf Hering <olaf@aepfle.de>

NAK, we should not just make things up.

Ben.

> ---
>  drivers/net/xen-netfront.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> Index: linux-3.2-rc2/drivers/net/xen-netfront.c
> ===================================================================
> --- linux-3.2-rc2.orig/drivers/net/xen-netfront.c
> +++ linux-3.2-rc2/drivers/net/xen-netfront.c
> @@ -1727,6 +1727,17 @@ static void netback_changed(struct xenbu
>  	}
>  }
>  
> +static int xennet_get_settings(struct net_device *netdev, struct ethtool_cmd *ecmd)
> +{
> +	ecmd->supported = SUPPORTED_1000baseT_Full | SUPPORTED_TP;
> +	ecmd->advertising = ADVERTISED_TP;
> +	ecmd->port = PORT_TP;
> +	ecmd->transceiver = XCVR_INTERNAL;
> +	ecmd->speed = SPEED_1000;
> +	ecmd->duplex = DUPLEX_FULL;
> +	return 0;
> +}
> +
>  static const struct xennet_stat {
>  	char name[ETH_GSTRING_LEN];
>  	u16 offset;
> @@ -1774,6 +1785,7 @@ static const struct ethtool_ops xennet_e
>  {
>  	.get_link = ethtool_op_get_link,
>  
> +	.get_settings = xennet_get_settings,
>  	.get_sset_count = xennet_get_sset_count,
>  	.get_ethtool_stats = xennet_get_ethtool_stats,
>  	.get_strings = xennet_get_strings,
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: David Miller @ 2011-11-18 18:04 UTC (permalink / raw)
  To: fbl; +Cc: eric.dumazet, famzah, netdev, segoon
In-Reply-To: <20111118152142.717324b1@asterix.rh>

From: Flavio Leitner <fbl@redhat.com>
Date: Fri, 18 Nov 2011 15:21:42 -0200

> On Fri, 18 Nov 2011 18:07:53 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> Le vendredi 18 novembre 2011 à 15:05 -0200, Flavio Leitner a écrit :
>> 
>> > Sorry, I meant that we are trying to avoid doing this:
>> > +			hash = rt_hash(daddr, skeys[s],
>> > ikeys[i],rt_genid(net)); +
>> > +			rthp = &rt_hash_table[hash].chain;
>> > +
>> > +			while ((rt = rcu_dereference(*rthp)) !=
>> > NULL) {
>> > +				rthp = &rt->dst.rt_next;
>> 
>> Sure, but this is still needed right now.
> 
> Yes, David will not be happy, unfortunately :)

He better be happy that someone is fixing all the bugs he added.

^ permalink raw reply

* Re: Occasional oops with IPSec and IPv6.
From: Timo Teräs @ 2011-11-18 18:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Bowler, netdev, David S. Miller
In-Reply-To: <1321634378.3277.35.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On 11/18/2011 06:39 PM, Eric Dumazet wrote:
> Le vendredi 18 novembre 2011 à 11:27 -0500, Nick Bowler a écrit :
>> On 2011-11-17 14:09 -0500, Nick Bowler wrote:
>>> One of the tests we do with IPsec involves sending and receiving UDP
>>> datagrams of all sizes from 1 to N bytes, where N is much larger than
>>> the MTU.  In this particular instance, the MTU is 1500 bytes and N is
>>> 10000 bytes.  This test works fine with IPv4, but I'm getting an
>>> occasional oops on Linus' master with IPv6 (output at end of email).  We
>>> also run the same test where N is less than the MTU, and it does not
>>> trigger this issue.  The resulting fallout seems to eventually lock up
>>> the box (although it continues to work for a little while afterwards).
>>>
>>> The issue appears timing related, and it doesn't always occur.  This
>>> probably also explains why I've not seen this issue before now, as we
>>> recently upgraded all our lab systems to machines from this century
>>> (with newfangled dual core processors).  This also makes it somewhat
>>> hard to reproduce, but I can trigger it pretty reliably by running 'yes'
>>> in an ssh session (which doesn't use IPsec) while running the test:
>>> it'll usually trigger in 2 or 3 runs.  The choice of cipher suite
>>> appears to be irrelevant.
>>>
>>> I built a relatively old kernel (2.6.34) and could not reproduce the
>>> issue there, so I ran a git bisect.  It pointed to the following, which
>>> (unsurprisingly) no longer reverts cleanly.
>>>
>>> Let me know if you need any more info.  I'll see if I can reproduce the
>>> issue with a smaller test case...
>>
>> OK, here's a somewhat straigthforward way to reproduce it that I've
>> found.  It uses a short test program called "udp_burst" which simply
>> transmits a bunch of UDP datagrams at all sizes between 1 and 10000,
>> included at the end of this mail.
>>[snip]
> 
> Please note commit 80c802f307 added a known bug, fixed in commit
> 0b150932197b (xfrm: avoid possible oopse in xfrm_alloc_dst)
> 
> Given commit 80c802f307 complexity, we can assume other bugs are to be
> fixed as well.
> 
> Unfortunately, Timo seems unresponsive.

This looks quite different. And I've been trying to figure out what
causes this. However, the OOPS happens at ip6_fragment(), indicating
that there was not enough allocated headroom (skb underrun). My initial
thought is ipv6 bug that just got uncovered by my commit; especially
since ipv4 side is happy. But I haven't yet been able to figure this one
out.

Could you also try Herbert's latest patch set:
  [0/6] Replace LL_ALLOCATED_SPACE to allow needed_headroom adjustment

This changes how the headroom is calculated, and *might* fix this issue
too if it's caused by the same SMP race condition which got uncovered by
my other commit earlier.

- Timo

^ permalink raw reply

* NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Andrew Cooper @ 2011-11-18 18:40 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA

Hello,

As described originally in
http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
encountering a bug whereby the NFS session was unexpectedly timing out.

I believe I have found the source of the race condition causing the timeout.

Brief overview of setup:
  10GiB network, NFS mounted using TCP.  Problem reproduces with
multiple different NICs, with synchronous or asynchronous mounts, and
with soft and hard mounts.  Reproduces on 2.6.32 and I am currently
trying to reproduce with mainline. (I don't have physical access to the
servers so installing stuff is not fantastically easy)



In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
the sock buffer using xs_sendpages()

When the sock buffer is nearly fully, we get an EAGAIN from
xs_sendpages() which causes a break out of the loop.  Lower down the
function, we switch on status which cases us to call xs_nospace() with
the task.

In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
in the rare case where that bit is clear, we return 0 instead of
EAGAIN.  This promptly overwrites status in xs_tcp_send_request().

The result is that xs_tcp_release_xprt() finds a request which has no
error, but has not sent all of the bytes in its send buffer.  It cleans
up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
xprt->task_cleanup, which closes the TCP connection.


Under normal operation, the TCP connection goes down and back up without
interruption to the NFS layer.  However, when the NFS server hangs in a
half closed state, the client forces a RST of the TCP connection,
leading to the timeout.

I have tried a few naive fixes such as changing the default return value
in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
returned) but this causes a kernel memory leak.  Can someone who a
better understanding of these interactions than me have a look?  It
seems that the if (test_bit()) test in xs_nospace() should have an else
clause.

Thanks in advance,

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] xen-netfront: report link speed to ethtool
From: Olaf Hering @ 2011-11-18 18:43 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, xen-devel, Jeremy Fitzhardinge, Konrad Rzeszutek Wilk
In-Reply-To: <1321638394.2883.32.camel@bwh-desktop>

On Fri, Nov 18, Ben Hutchings wrote:

> On Fri, 2011-11-18 at 17:48 +0100, Olaf Hering wrote:
> > The reported data refers to VMWare vmxnet.
> NAK, we should not just make things up.

So how about removing veth_get_settings, vmxnet3_get_settings,
tun_get_settings and other functions that escaped my grep?

Olaf

^ permalink raw reply

* Re: [PATCH] xen-netfront: report link speed to ethtool
From: Rick Jones @ 2011-11-18 18:44 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Olaf Hering, netdev, xen-devel, Jeremy Fitzhardinge,
	Konrad Rzeszutek Wilk
In-Reply-To: <1321638394.2883.32.camel@bwh-desktop>

On 11/18/2011 09:46 AM, Ben Hutchings wrote:
> On Fri, 2011-11-18 at 17:48 +0100, Olaf Hering wrote:
>> Add .get_settings function, return fake data so that ethtool can get
>> enough information. For some application like VCS, this is useful,
>> otherwise some of application logic will get panic.
>> The reported data refers to VMWare vmxnet.
>>
>> Signed-off-by: Xin Wei Hu<xwhu@suse.com>
>> Signed-off-by: Chunyan Liu<cyliu@suse.com>
>> Signed-off-by: Olaf Hering<olaf@aepfle.de>
>
> NAK, we should not just make things up.

Which raises an interesting question for a virtual interface that isn't 
pretending to be a specific NIC type. What should the reported speed be? 
  Is it a 10/100 NIC?  A 1 or 10 GbE NIC? 3.14 GbE?  For other emulated 
interfaces, it rather falls-out from the emulation.  We can say that the 
driver may not make stuff up, but it would seem what is running in the 
host/hypervisor/dom0/whatever will have to.  It could I suppose, decide 
based on the physical NIC to which it is attached, so long as folks 
using the virtual NIC don't expect its attributes to be the same from 
system to system.

rick

^ permalink raw reply

* Re: [PATCH] xen-netfront: report link speed to ethtool
From: Jeremy Fitzhardinge @ 2011-11-18 18:46 UTC (permalink / raw)
  To: Rick Jones
  Cc: Ben Hutchings, Olaf Hering, netdev@vger.kernel.org,
	xen-devel@lists.xensource.com, Konrad Rzeszutek Wilk
In-Reply-To: <4EC6A778.1000503@hp.com>

On 11/18/2011 10:44 AM, Rick Jones wrote:
>  It could I suppose, decide 
> based on the physical NIC to which it is attached, so long as folks 
> using the virtual NIC don't expect its attributes to be the same from 
> system to system.

And assuming there's a physical NIC at all.

    J

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Trond Myklebust @ 2011-11-18 18:52 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4EC6A681.30902-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>

On Fri, 2011-11-18 at 18:40 +0000, Andrew Cooper wrote: 
> Hello,
> 
> As described originally in
> http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
> encountering a bug whereby the NFS session was unexpectedly timing out.
> 
> I believe I have found the source of the race condition causing the timeout.
> 
> Brief overview of setup:
>   10GiB network, NFS mounted using TCP.  Problem reproduces with
> multiple different NICs, with synchronous or asynchronous mounts, and
> with soft and hard mounts.  Reproduces on 2.6.32 and I am currently
> trying to reproduce with mainline. (I don't have physical access to the
> servers so installing stuff is not fantastically easy)
> 
> 
> 
> In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
> the sock buffer using xs_sendpages()
> 
> When the sock buffer is nearly fully, we get an EAGAIN from
> xs_sendpages() which causes a break out of the loop.  Lower down the
> function, we switch on status which cases us to call xs_nospace() with
> the task.
> 
> In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
> in the rare case where that bit is clear, we return 0 instead of
> EAGAIN.  This promptly overwrites status in xs_tcp_send_request().
> 
> The result is that xs_tcp_release_xprt() finds a request which has no
> error, but has not sent all of the bytes in its send buffer.  It cleans
> up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
> xprt->task_cleanup, which closes the TCP connection.
> 
> 
> Under normal operation, the TCP connection goes down and back up without
> interruption to the NFS layer.  However, when the NFS server hangs in a
> half closed state, the client forces a RST of the TCP connection,
> leading to the timeout.
> 
> I have tried a few naive fixes such as changing the default return value
> in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
> returned) but this causes a kernel memory leak.  Can someone who a
> better understanding of these interactions than me have a look?  It
> seems that the if (test_bit()) test in xs_nospace() should have an else
> clause.

I fully agree with your analysis. The correct thing to do here is to
always return either EAGAIN or ENOTCONN. Thank you very much for working
this one out!

Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org
www.netapp.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* use a special value of -2 for virtual devices to report indeterminate speed?
From: Rick Jones @ 2011-11-18 18:58 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ben Hutchings, Olaf Hering, netdev@vger.kernel.org,
	xen-devel@lists.xensource.com, Konrad Rzeszutek Wilk
In-Reply-To: <4EC6A802.9090805@goop.org>

On 11/18/2011 10:46 AM, Jeremy Fitzhardinge wrote:
> On 11/18/2011 10:44 AM, Rick Jones wrote:
>>   It could I suppose, decide
>> based on the physical NIC to which it is attached, so long as folks
>> using the virtual NIC don't expect its attributes to be the same from
>> system to system.
>
> And assuming there's a physical NIC at all.

It sounds like we need a way to specify "Indeterminate" for link speed? 
  Or some verbiage to that effect. Right now 0 and -1 cause ethtool to 
report "Unknown!"

         if (speed == 0 || speed == (u16)(-1) || speed == (u32)(-1))
                 fprintf(stdout, "Unknown!\n");
         else
                 fprintf(stdout, "%uMb/s\n", speed);


How about -2 for the u32 cast value of speed returning "Indeterminate" 
or something like that?  Not in "proper" patch format:

	if (speed == 0 || speed == (u16)(-1) || speed == (u32)(-1))
		fprintf(stdout, "Unknown!\n");
	else if (speed == (u32)(-2))
		fprintf(stdout, "Indeterminate.");
	else
		fprintf(stdout, "%uMb/s\n", speed);

Signed-off-by: Rick Jones <rick.jones2@hp.com>	

rick jones

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Andrew Cooper @ 2011-11-18 19:04 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1321642368.2653.35.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>

On 18/11/11 18:52, Trond Myklebust wrote:
> On Fri, 2011-11-18 at 18:40 +0000, Andrew Cooper wrote: 
>> Hello,
>>
>> As described originally in
>> http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
>> encountering a bug whereby the NFS session was unexpectedly timing out.
>>
>> I believe I have found the source of the race condition causing the timeout.
>>
>> Brief overview of setup:
>>   10GiB network, NFS mounted using TCP.  Problem reproduces with
>> multiple different NICs, with synchronous or asynchronous mounts, and
>> with soft and hard mounts.  Reproduces on 2.6.32 and I am currently
>> trying to reproduce with mainline. (I don't have physical access to the
>> servers so installing stuff is not fantastically easy)
>>
>>
>>
>> In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
>> the sock buffer using xs_sendpages()
>>
>> When the sock buffer is nearly fully, we get an EAGAIN from
>> xs_sendpages() which causes a break out of the loop.  Lower down the
>> function, we switch on status which cases us to call xs_nospace() with
>> the task.
>>
>> In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
>> in the rare case where that bit is clear, we return 0 instead of
>> EAGAIN.  This promptly overwrites status in xs_tcp_send_request().
>>
>> The result is that xs_tcp_release_xprt() finds a request which has no
>> error, but has not sent all of the bytes in its send buffer.  It cleans
>> up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
>> xprt->task_cleanup, which closes the TCP connection.
>>
>>
>> Under normal operation, the TCP connection goes down and back up without
>> interruption to the NFS layer.  However, when the NFS server hangs in a
>> half closed state, the client forces a RST of the TCP connection,
>> leading to the timeout.
>>
>> I have tried a few naive fixes such as changing the default return value
>> in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
>> returned) but this causes a kernel memory leak.  Can someone who a
>> better understanding of these interactions than me have a look?  It
>> seems that the if (test_bit()) test in xs_nospace() should have an else
>> clause.
> I fully agree with your analysis. The correct thing to do here is to
> always return either EAGAIN or ENOTCONN. Thank you very much for working
> this one out!
>
> Trond

Returning EAGAIN seems to cause a kernel memory leak, as the oomkiller
starts going after processes holding large amounts of LowMem.  Returning
ENOTCONN causes the NFS session to complain about a timeout in the logs,
and in the case of a softmout, give an EIO to the calling process.

>From the looks of the TCP stream, and from the the looks of some
targeted debugging, nothing is actually wrong, so the client should not
be trying to FIN the TCP connection.  Is it possible that there is a
more sinister reason for SOCK_ASYNC_NOSPACE being clear?

I can attempt to find which of the many calls to clear that bit is
actually causing the problem, but I have a feeing that is going to a
little more tricky to narrow down.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Should "N/A" dust bunnies be swept from fw_version?
From: Rick Jones @ 2011-11-18 19:09 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1321575574.2749.55.camel@bwh-desktop>

On 11/17/2011 04:19 PM, Ben Hutchings wrote:
> On Thu, 2011-11-17 at 15:27 -0800, Rick Jones wrote:
>> In the discussion on "enable virtio_net to return bus_info in ethtool -i
>> consistent with emulated NICs" Ben Hutchings had the following feedback
>> on what might go into bus_info:
>>
>>> Please use the existing 'not implemented' value, which is the empty
>>> string.   If you think ethtool should print some helpful message instead
>>> of an empty string, please submit a patch for ethtool.
>>
>> When I was sweeping in the .get_drvinfo routines, I noticed many drivers
>> would return "N/A" for fw_version - presumably they were drivers for
>> cards without firmware.  Should those be removed to have the fw_version
>> be the empty string, or should those sleeping dust bunnies be allowed to
>> lie?
>
> I much prefer the empty string; the ethtool utility can turn that into a
> user-friendly placeholder if it's considered confusing.

Any other opinions out there?  Anyone? Anyone?-)

rick jones

^ permalink raw reply

* Re: [PATCH] xen-netfront: report link speed to ethtool
From: Ben Hutchings @ 2011-11-18 19:10 UTC (permalink / raw)
  To: Olaf Hering; +Cc: netdev, xen-devel, Jeremy Fitzhardinge, Konrad Rzeszutek Wilk
In-Reply-To: <20111118184336.GA16027@aepfle.de>

On Fri, 2011-11-18 at 19:43 +0100, Olaf Hering wrote:
> On Fri, Nov 18, Ben Hutchings wrote:
> 
> > On Fri, 2011-11-18 at 17:48 +0100, Olaf Hering wrote:
> > > The reported data refers to VMWare vmxnet.
> > NAK, we should not just make things up.
> 
> So how about removing veth_get_settings, vmxnet3_get_settings,
> tun_get_settings and other functions that escaped my grep?

If they can't provide meaningful information then maybe they should be
removed.  However, that could result in a regression for existing
working configurations.  (This isn't the same as the case you're trying
to fix, since those applications have never worked with xen-netfront or
many other drivers that don't implement get_settings.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: Should "N/A" dust bunnies be swept from fw_version?
From: David Miller @ 2011-11-18 19:10 UTC (permalink / raw)
  To: rick.jones2; +Cc: netdev
In-Reply-To: <4EC6AD7C.2070307@hp.com>

From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 18 Nov 2011 11:09:48 -0800

> On 11/17/2011 04:19 PM, Ben Hutchings wrote:
>> On Thu, 2011-11-17 at 15:27 -0800, Rick Jones wrote:
>>> In the discussion on "enable virtio_net to return bus_info in ethtool
>>> -i
>>> consistent with emulated NICs" Ben Hutchings had the following
>>> feedback
>>> on what might go into bus_info:
>>>
>>>> Please use the existing 'not implemented' value, which is the empty
>>>> string.  If you think ethtool should print some helpful message
>>>> instead
>>>> of an empty string, please submit a patch for ethtool.
>>>
>>> When I was sweeping in the .get_drvinfo routines, I noticed many
>>> drivers
>>> would return "N/A" for fw_version - presumably they were drivers for
>>> cards without firmware.  Should those be removed to have the
>>> fw_version
>>> be the empty string, or should those sleeping dust bunnies be allowed
>>> to
>>> lie?
>>
>> I much prefer the empty string; the ethtool utility can turn that into
>> a
>> user-friendly placeholder if it's considered confusing.
> 
> Any other opinions out there?  Anyone? Anyone?-)

I agree with Ben, just provide the empty string.

^ permalink raw reply

* Re: use a special value of -2 for virtual devices to report indeterminate speed?
From: Ben Hutchings @ 2011-11-18 19:13 UTC (permalink / raw)
  To: Rick Jones
  Cc: Jeremy Fitzhardinge, Olaf Hering, netdev@vger.kernel.org,
	xen-devel@lists.xensource.com, Konrad Rzeszutek Wilk
In-Reply-To: <4EC6AAF3.6080803@hp.com>

On Fri, 2011-11-18 at 10:58 -0800, Rick Jones wrote:
> On 11/18/2011 10:46 AM, Jeremy Fitzhardinge wrote:
> > On 11/18/2011 10:44 AM, Rick Jones wrote:
> >>   It could I suppose, decide
> >> based on the physical NIC to which it is attached, so long as folks
> >> using the virtual NIC don't expect its attributes to be the same from
> >> system to system.
> >
> > And assuming there's a physical NIC at all.
> 
> It sounds like we need a way to specify "Indeterminate" for link speed? 
>   Or some verbiage to that effect. Right now 0 and -1 cause ethtool to 
> report "Unknown!"
> 
>          if (speed == 0 || speed == (u16)(-1) || speed == (u32)(-1))
>                  fprintf(stdout, "Unknown!\n");
>          else
>                  fprintf(stdout, "%uMb/s\n", speed);
> 
> 
> How about -2 for the u32 cast value of speed returning "Indeterminate" 
> or something like that?  Not in "proper" patch format:
> 
> 	if (speed == 0 || speed == (u16)(-1) || speed == (u32)(-1))
> 		fprintf(stdout, "Unknown!\n");
> 	else if (speed == (u32)(-2))
> 		fprintf(stdout, "Indeterminate.");
> 	else
> 		fprintf(stdout, "%uMb/s\n", speed);

I'm open to something like this, but the problem with assigning new
magic numbers is that older versions of ethtool won't know to report
them as special.

We should also consider stacked drivers like bonding (and presumably
team) that expect real numbers when the link is up.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] xen-netfront: report link speed to ethtool
From: David Miller @ 2011-11-18 19:11 UTC (permalink / raw)
  To: bhutchings; +Cc: olaf, netdev, xen-devel, jeremy.fitzhardinge, konrad.wilk
In-Reply-To: <1321638394.2883.32.camel@bwh-desktop>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Fri, 18 Nov 2011 17:46:34 +0000

> On Fri, 2011-11-18 at 17:48 +0100, Olaf Hering wrote:
>> Add .get_settings function, return fake data so that ethtool can get
>> enough information. For some application like VCS, this is useful,
>> otherwise some of application logic will get panic.
>> The reported data refers to VMWare vmxnet.
>> 
>> Signed-off-by: Xin Wei Hu <xwhu@suse.com>
>> Signed-off-by: Chunyan Liu <cyliu@suse.com>
>> Signed-off-by: Olaf Hering <olaf@aepfle.de>
> 
> NAK, we should not just make things up.

Agreed, if you cannot determine the values with certainty do not
implement this method.

Fix the tools which cannot function without this information.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox