Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Xen-devel] [PATCH v4] xen/grant-table: Avoid m2p_override during mapping
From: Zoltan Kiss @ 2014-01-22 19:03 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies
In-Reply-To: <alpine.DEB.2.02.1401221848080.21510@kaball.uk.xensource.com>

On 22/01/14 18:50, Stefano Stabellini wrote:
> On Wed, 22 Jan 2014, Zoltan Kiss wrote:
>> On 22/01/14 16:39, Stefano Stabellini wrote:
>>> On Tue, 21 Jan 2014, Zoltan Kiss wrote:
>>>> @@ -121,7 +125,7 @@ static inline unsigned long mfn_to_pfn(unsigned long
>>>> mfn)
>>>>    		pfn = m2p_find_override_pfn(mfn, ~0);
>>>>    	}
>>>>
>>>> -	/*
>>>> +	/*
>>>
>>> Spurious change?
>> It removes a stray space from the original code. Not necessary, but if it's
>> there, I think we can keep it.
>
> Usually cosmetic changes are done in a separate patch, or at the very
> least they are mentioned in the commit message.
Ok, I'll mention it.
>
>
>>>> diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
>>>> index 2ae8699..0060178 100644
>>>> --- a/arch/x86/xen/p2m.c
>>>> +++ b/arch/x86/xen/p2m.c
>>>> @@ -872,15 +872,13 @@ static unsigned long mfn_hash(unsigned long mfn)
>>>>
>>>>    /* Add an MFN override for a particular page */
>>>>    int m2p_add_override(unsigned long mfn, struct page *page,
>>>> -		struct gnttab_map_grant_ref *kmap_op)
>>>> +		struct gnttab_map_grant_ref *kmap_op, unsigned long pfn)
>>>
>>> Do we really need to add another additional parameter to
>>> m2p_add_override?
>>> I would just let m2p_add_override and m2p_remove_override call
>>> page_to_pfn again. It is not that expensive.
>> Yes, because that page_to_pfn can return something different. That's why the
>> v2 patches failed.
>
> I am really curious: how can page_to_pfn return something different?
> I don't think is supposed to happen.
You call set_phys_to_machine before calling m2p* functions.

Zoli

^ permalink raw reply

* Re: [PATCH net-next] net: phy: use network device in phy_print_status
From: Joe Perches @ 2014-01-22 19:05 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, David Miller
In-Reply-To: <CAGVrzcaDTOxBHNgSF1wme8JgnYV-KjpNfqqcNLdebqR0hjnWJg@mail.gmail.com>

On Wed, 2014-01-22 at 10:52 -0800, Florian Fainelli wrote:
> 2014/1/22 Joe Perches <joe@perches.com>:
> > On Wed, 2014-01-22 at 10:32 -0800, Florian Fainelli wrote:
> >> phy_print_status() currently uses dev_name(&phydev->dev) which will
> >> usually result in printing something along those lines for Device Tree
> >> aware drivers:
> >>
> >> libphy: f0b60000.etherne:0a - Link is Down
> >> libphy: f0ba0000.etherne:00 - Link is Up - 1000/Full
[]
> > The leading "- "'s now aren't useful and these could be
> 
> I did not want to remove the leading "- " since this would sort of
> produce a very different message now, that said, I have no strong
> feeling removing it.

The output would already be different as it's no longer
prefixed by "phylib: ".

With the dash it'll look like:

r8169 0000:01:00.0 eth3: - Link is Up

when all the other netdev_info link messages
don't use the - separator after the "ethX: "

> > It may also be possible to slightly improve the output by
> > reducing the number of 0's in the speed block and emitting
> > Kb/Mb/Gb (or Kbps/Mbps/Gbps) and maybe flow control if it's
> > available.
> 
> Sure thing, that calls for a follow-up patch though.

Thanks.

^ permalink raw reply

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Matija Glavinic Pecotic @ 2014-01-22 19:21 UTC (permalink / raw)
  To: ext Vlad Yasevich; +Cc: linux-sctp, Alexander Sverdlin, netdev
In-Reply-To: <52DF28F1.1030602@gmail.com>

Hello Vlad,

On 01/22/2014 03:12 AM, ext Vlad Yasevich wrote:
> On 01/17/2014 02:01 AM, Matija Glavinic Pecotic wrote:
> [ snip long description ...]
>> Proposed solution simplifies whole algorithm having on mind definition from rfc:
>>
>> o  Receiver Window (rwnd): This gives the sender an indication of the space
>>    available in the receiver's inbound buffer.
>>
>> Core of the proposed solution is given with these lines:
>>
>> sctp_assoc_rwnd_account:
>> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> 	else
>> 		asoc->rwnd = 0;
>>
>> We advertise to sender (half of) actual space we have. Half is in the braces
>> depending whether you would like to observe size of socket buffer as SO_RECVBUF
>> or twice the amount, i.e. size is the one visible from userspace, that is,
>> from kernelspace.
>> In this way sender is given with good approximation of our buffer space,
>> regardless of the buffer policy - we always advertise what we have. Proposed
>> solution fixes described problems and removes necessity for rwnd restoration
>> algorithm. Finally, as proposed solution is simplification, some lines of code,
>> along with some bytes in struct sctp_association are saved.
>>
>> Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
>> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
>>
> 
> This is the correct approach.  However there is one issue and
> a few cosmetic suggestions.
> 
>> ---
>>
>> --- net-next.orig/net/sctp/associola.c
>> +++ net-next/net/sctp/associola.c
>> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>>  	return false;
>>  }
>>  
>> -/* Increase asoc's rwnd by len and send any window update SACK if needed. */
>> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
>> +/* Account asoc's rwnd for the approximated state in the buffer,
>> + * and check whether SACK needs to be sent.
>> + */
>> +void sctp_assoc_rwnd_account(struct sctp_association *asoc, int check_sack)
> 
> Maybe sctp_assoc_rwnd_update()?
> 
> 'check_sack' isn't a very descriptive name for what we are doing.  May
> be 'update_peer'?  Also, since it is either 0 or 1, just make a bool.

these would be more appropriate names, I agree.

>>  {
>> +	int rx_count;
>>  	struct sctp_chunk *sack;
>>  	struct timer_list *timer;
>>  
>> -	if (asoc->rwnd_over) {
>> -		if (asoc->rwnd_over >= len) {
>> -			asoc->rwnd_over -= len;
>> -		} else {
>> -			asoc->rwnd += (len - asoc->rwnd_over);
>> -			asoc->rwnd_over = 0;
>> -		}
>> -	} else {
>> -		asoc->rwnd += len;
>> -	}
>> +	if (asoc->ep->rcvbuf_policy)
>> +		rx_count = atomic_read(&asoc->rmem_alloc);
>> +	else
>> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>>  
>> -	/* If we had window pressure, start recovering it
>> -	 * once our rwnd had reached the accumulated pressure
>> -	 * threshold.  The idea is to recover slowly, but up
>> -	 * to the initial advertised window.
>> -	 */
>> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
>> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
>> -		asoc->rwnd += change;
>> -		asoc->rwnd_press -= change;
>> -	}
>> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> +	else
>> +		asoc->rwnd = 0;
>>  
>> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
>> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
>> -		 asoc->a_rwnd);
>> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
>> +		 __func__, asoc, asoc->rwnd, rx_count,
>> +		 asoc->base.sk->sk_rcvbuf);
>>  
>>  	/* Send a window update SACK if the rwnd has increased by at least the
>>  	 * minimum of the association's PMTU and half of the receive buffer.
>>  	 * The algorithm used is similar to the one described in
>>  	 * Section 4.2.3.3 of RFC 1122.
>>  	 */
>> -	if (sctp_peer_needs_update(asoc)) {
>> +	if (check_sack && sctp_peer_needs_update(asoc)) {
>>  		asoc->a_rwnd = asoc->rwnd;
>>  
>>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
>> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>>  	}
>>  }
>>  
>> -/* Decrease asoc's rwnd by len. */
>> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
>> -{
>> -	int rx_count;
>> -	int over = 0;
>> -
>> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
>> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
>> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
>> -			 asoc->rwnd, asoc->rwnd_over);
>> -
>> -	if (asoc->ep->rcvbuf_policy)
>> -		rx_count = atomic_read(&asoc->rmem_alloc);
>> -	else
>> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>> -
>> -	/* If we've reached or overflowed our receive buffer, announce
>> -	 * a 0 rwnd if rwnd would still be positive.  Store the
>> -	 * the potential pressure overflow so that the window can be restored
>> -	 * back to original value.
>> -	 */
>> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
>> -		over = 1;
>> -
>> -	if (asoc->rwnd >= len) {
>> -		asoc->rwnd -= len;
>> -		if (over) {
>> -			asoc->rwnd_press += asoc->rwnd;
>> -			asoc->rwnd = 0;
>> -		}
>> -	} else {
>> -		asoc->rwnd_over = len - asoc->rwnd;
>> -		asoc->rwnd = 0;
>> -	}
>> -
>> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
>> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
>> -		 asoc->rwnd_press);
>> -}
>>  
>>  /* Build the bind address list for the association based on info from the
>>   * local endpoint and the remote peer.
>> --- net-next.orig/include/net/sctp/structs.h
>> +++ net-next/include/net/sctp/structs.h
>> @@ -1653,17 +1653,6 @@ struct sctp_association {
>>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>>  	__u32 a_rwnd;
>>  
>> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
>> -	 * to slop over a maximum of the association's frag_point.
>> -	 */
>> -	__u32 rwnd_over;
>> -
>> -	/* Keeps treack of rwnd pressure.  This happens when we have
>> -	 * a window, but not recevie buffer (i.e small packets).  This one
>> -	 * is releases slowly (1 PMTU at a time ).
>> -	 */
>> -	__u32 rwnd_press;
>> -
>>  	/* This is the sndbuf size in use for the association.
>>  	 * This corresponds to the sndbuf size for the association,
>>  	 * as specified in the sk->sndbuf.
>> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>>  
>>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
>> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
>> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
>> +void sctp_assoc_rwnd_account(struct sctp_association *, int);
>>  void sctp_assoc_set_primary(struct sctp_association *,
>>  			    struct sctp_transport *);
>>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
>> --- net-next.orig/net/sctp/sm_statefuns.c
>> +++ net-next/net/sctp/sm_statefuns.c
>> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>>  	 * PMTU.  In cases, such as loopback, this might be a rather
>>  	 * large spill over.
>>  	 */
>> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
>> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>>  
>>  		/* If this is the next TSN, consider reneging to make
>> --- net-next.orig/net/sctp/socket.c
>> +++ net-next/net/sctp/socket.c
>> @@ -2097,7 +2097,7 @@ static int sctp_recvmsg(struct kiocb *io
>>  		 * rwnd is updated when the event is freed.
>>  		 */
>>  		if (!sctp_ulpevent_is_notification(event))
>> -			sctp_assoc_rwnd_increase(event->asoc, copied);
>> +			sctp_assoc_rwnd_account(event->asoc, 1);
> 
> This is not going to do anything.  The event has not been freed, thus
> rmem_alloc has not changed.  So, the rwnd will not change.

You are right. With current approach we can only remove this code.
I cannot imagine potential problem with it. I simply do not see situation in which receivers behavior regarding reading the whole message would depend on the current state of rwnd. 
In fact, behavior which you pointed is actually in sync with the whole idea. I would say it is safe to remove it.

>>  		goto out;
>>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>>  		   (event->msg_flags & MSG_EOR))
>> --- net-next.orig/net/sctp/ulpevent.c
>> +++ net-next/net/sctp/ulpevent.c
>> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>>  	skb = sctp_event2skb(event);
>>  	/* Set the owner and charge rwnd for bytes received.  */
>>  	sctp_ulpevent_set_owner(event, asoc);
>> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
>> +	sctp_assoc_rwnd_account(asoc, 0);
>>  
>>  	if (!skb->data_len)
>>  		return;
>> @@ -1035,7 +1035,7 @@ static void sctp_ulpevent_release_data(s
>>  	}
>>  
>>  done:
>> -	sctp_assoc_rwnd_increase(event->asoc, len);
>> +	sctp_assoc_rwnd_account(event->asoc, 1);
> 
> Same here.  The below call to sctp_ulpevent_release_owner() will
> finally update the rmem_alloc, so the a above call to rwnd_account()
> will not trigger a window update.

And if we have sack on the way, it will contain approximated rwnd minus the size of the last message received. Ideally, we would here first account rmem_alloc, update rwnd, then put association.

The only thing which comes to my mind is to rewrite this part exactly as described.
Any suggestions from your side?

Thanks,

Matija

> Thanks
> -vlad
> 
>>  	sctp_ulpevent_release_owner(event);
>>  }
>>  
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply

* [PATCH] 6lowpan: add a license to 6lowpan_iphc module
From: Yann Droneaud @ 2014-01-22 19:25 UTC (permalink / raw)
  To: Alexander Smirnov, Dmitry Eremin-Solenikov, Marcel Holtmann,
	Gustavo Padovan, Johan Hedberg, David S. Miller
  Cc: Yann Droneaud, linux-zigbee-devel, linux-bluetooth, netdev,
	linux-kernel, Jukka Rissanen, Alexander Aring

Since commit 8df8c56a5abc, 6lowpan_iphc is a module of its own.

Unfortunately, it lacks some infrastructure to behave like a
good kernel citizen:

  kernel: 6lowpan_iphc: module license 'unspecified' taints kernel.
  kernel: Disabling lock debugging due to kernel taint

This patch adds the basic MODULE_LICENSE(); with GPL license:
the code was copied from net/ieee802154/6lowpan.c which is GPL
and the module exports symbol with EXPORT_SYMBOL_GPL();.

Cc: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
---
 net/ieee802154/6lowpan_iphc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ieee802154/6lowpan_iphc.c b/net/ieee802154/6lowpan_iphc.c
index e14fe8b2c054..860aa2d445ba 100644
--- a/net/ieee802154/6lowpan_iphc.c
+++ b/net/ieee802154/6lowpan_iphc.c
@@ -52,6 +52,7 @@
 
 #include <linux/bitops.h>
 #include <linux/if_arp.h>
+#include <linux/module.h>
 #include <linux/netdevice.h>
 #include <net/ipv6.h>
 #include <net/af_ieee802154.h>
@@ -797,3 +798,5 @@ int lowpan_header_compress(struct sk_buff *skb, struct net_device *dev,
 	return 0;
 }
 EXPORT_SYMBOL_GPL(lowpan_header_compress);
+
+MODULE_LICENSE("GPL");
-- 
1.8.4.2

^ permalink raw reply related

* Re: [PATCH net-next v3] tuntap: Fix for a race in accessing numqueues
From: Max Krasnyansky @ 2014-01-22 19:44 UTC (permalink / raw)
  To: Dominic Curran, netdev; +Cc: Jason Wang
In-Reply-To: <1390359803-27989-1-git-send-email-dominic.curran@citrix.com>

On 01/21/2014 07:03 PM, Dominic Curran wrote:
> A patch for fixing a race between queue selection and changing queues
> was introduced in commit 92bb73ea2("tuntap: fix a possible race between
> queue selection and changing queues").
> 
> The fix was to prevent the driver from re-reading the tun->numqueues
> more than once within tun_select_queue() using ACCESS_ONCE().
> 
> We have been experiancing 'Divide-by-zero' errors in tun_net_xmit() 
> since we moved from 3.6 to 3.10, and believe that they come from a 
> simular source where the value of tun->numqueues changes to zero 
> between the first and a subsequent read of tun->numqueues.
> 
> The fix is a simular use of ACCESS_ONCE(), as well as a multiply
> instead of a divide in the if statement.
> 
> Signed-off-by: Dominic Curran <dominic.curran@citrix.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
> ---
> V3: Rebase against net-next. Include all numqueues in function.
> V2: Use multiply instead of divide. Suggested by Eric Dumazet.
>     Fixed email address for maxk. Rebase against net tree.
> ---
>  drivers/net/tun.c |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> Index: net-next/drivers/net/tun.c
> ===================================================================
> --- net-next.orig/drivers/net/tun.c	2014-01-22 02:50:01.000000000 +0000
> +++ net-next/drivers/net/tun.c	2014-01-22 02:59:42.000000000 +0000
> @@ -738,15 +738,17 @@ static netdev_tx_t tun_net_xmit(struct s
>  	struct tun_struct *tun = netdev_priv(dev);
>  	int txq = skb->queue_mapping;
>  	struct tun_file *tfile;
> +	u32 numqueues = 0;
>  
>  	rcu_read_lock();
>  	tfile = rcu_dereference(tun->tfiles[txq]);
> +	numqueues = ACCESS_ONCE(tun->numqueues);
>  
>  	/* Drop packet if interface is not attached */
> -	if (txq >= tun->numqueues)
> +	if (txq >= numqueues)
>  		goto drop;
>  
> -	if (tun->numqueues == 1) {
> +	if (numqueues == 1) {
>  		/* Select queue was not called for the skbuff, so we extract the
>  		 * RPS hash and save it into the flow_table here.
>  		 */
> @@ -779,8 +781,8 @@ static netdev_tx_t tun_net_xmit(struct s
>  	/* Limit the number of packets queued by dividing txq length with the
>  	 * number of queues.
>  	 */
> -	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
> -			  >= dev->tx_queue_len / tun->numqueues)
> +	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue) * numqueues
> +			  >= dev->tx_queue_len)
>  		goto drop;
>  
>  	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> 

Acked-by: Max Krasnyansky <maxk@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next] net: phy: use network device in phy_print_status
From: Florian Fainelli @ 2014-01-22 19:45 UTC (permalink / raw)
  To: Joe Perches; +Cc: netdev, David Miller
In-Reply-To: <1390417506.31946.89.camel@joe-AO722>

2014/1/22 Joe Perches <joe@perches.com>:
> On Wed, 2014-01-22 at 10:52 -0800, Florian Fainelli wrote:
>> 2014/1/22 Joe Perches <joe@perches.com>:
>> > On Wed, 2014-01-22 at 10:32 -0800, Florian Fainelli wrote:
>> >> phy_print_status() currently uses dev_name(&phydev->dev) which will
>> >> usually result in printing something along those lines for Device Tree
>> >> aware drivers:
>> >>
>> >> libphy: f0b60000.etherne:0a - Link is Down
>> >> libphy: f0ba0000.etherne:00 - Link is Up - 1000/Full
> []
>> > The leading "- "'s now aren't useful and these could be
>>
>> I did not want to remove the leading "- " since this would sort of
>> produce a very different message now, that said, I have no strong
>> feeling removing it.
>
> The output would already be different as it's no longer
> prefixed by "phylib: ".
>
> With the dash it'll look like:
>
> r8169 0000:01:00.0 eth3: - Link is Up
>
> when all the other netdev_info link messages
> don't use the - separator after the "ethX: "

You are right, thanks for spotting that, I will re-submit.

>
>> > It may also be possible to slightly improve the output by
>> > reducing the number of 0's in the speed block and emitting
>> > Kb/Mb/Gb (or Kbps/Mbps/Gbps) and maybe flow control if it's
>> > available.
>>
>> Sure thing, that calls for a follow-up patch though.
>
> Thanks.
>
-- 
Florian

^ permalink raw reply

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Matija Glavinic Pecotic @ 2014-01-22 19:48 UTC (permalink / raw)
  To: ext David Laight
  Cc: 'Vlad Yasevich', linux-sctp@vger.kernel.org,
	Alexander Sverdlin, netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D463CC4@AcuExch.aculab.com>

Hello David,

On 01/22/2014 04:30 PM, ext David Laight wrote:
> From: Vlad Yasevich 
> ...
>>> IIRC the 'size' taken of the socket buffer is the skb's 'true size'
>>> and that includes any padding before and after the actual rx data.
>>> For short packets the driver may have copied the data into a smaller
>>> skb, for long packets it is likely to be more than that of a full
>>> length ethernet packet.
>>> In either case it can be significantly more than sizeof(sk_buff)
>>> (190?) plus the size of the actual data.
>>
>> SCTP currently doesn't support GRO, so each packet is limited to
>> ethernet packet plus sk_buff overhead.
> 
> The ethernet driver might still pass up a 2k buffer, or even a 4k one.
> Especially if it supports GRO for TCP.
> 
>> What throws a real monkey
>> wrench into this whole accounting business is SCTP bundling.  If you
>> bundle multiple messages into the single packet, accounting for it
>> is a mess.
> 
> How about dividing the 'truesize' by the reference count?
> (except that isn't quite right...)
> I assume there are multiple skb headers but only one data buffer?
> At least the chunks are all for the same connection so end up on
> one socket (except I remember some other horrid stuff for datagrams).
>  
>>> I'm also not sure that continuously removing 'credit' is a good idea.
>>> I've done a lot of comms protocol code, removing credit and 'window
>>> slamming' acks are not good ideas.
>>
>> This patch doesn't continuously remove 'credit'.  It advertises the
>> closest approximation of the window that we support and computes it
>> the same way as we do for Initial Window (available sk_rcvbuff >> 1).
>> As the receiver drains the receive queue, available buffer will increase
>> and the available window will grow.
> 
> Let's assume the socket buffer size is 100k.
> We advertise a window of 50k.
> We now receive 100 bytes of data, the remote has 49900 window left.
> The 'truesize' is something like 190+64(ish)+100+tail_pad, say 400.
> Socket buffer space is reduced to 99600 and any ack would give 49800.
> So we have reduced the window by 100 bytes.
> With that much space it probably doesn't matter much.
> 
> However if the connection is receive limited then the remote system
> will have a second packet in flight that uses the rest of the window.
> We then receive an in-sequence but out-of-window packet that refers
> to window that we had previously given to the remote system.

AFAIK, if the packet is not discarded due to depleted buffer, it will normally be processed.
In case its discarded, sender will have to resend it since it wont be acked.

> I don't know what the sctp (or tcp) code does with such packets.
> In my experience it is best to treat them as valid data (unless
> there are larger free memory issues) and ack them at some point.
> Hopefully the rules of the underlying protocol let you do this!
> 
> If you discard these packets then every packet gets sent twice
> (or even more often if the data is very short).

For small packets we will not get into trouble since sctp has same as tcp sws avoidance algorithm

Regards,

Matija

> 	David
> 

^ permalink raw reply

* [PATCH net-next] net: vxlan: convert to act as a pernet subsystem
From: Daniel Borkmann @ 2014-01-22 20:07 UTC (permalink / raw)
  To: davem; +Cc: netdev, Jesse Brandeburg, Eric W. Biederman

As per suggestion from Eric W. Biederman, vxlan should be using
{un,}register_pernet_subsys() instead of {un,}register_pernet_device()
to ensure the vxlan_net structure is initialized before and cleaned
up after all network devices in a given network namespace i.e. when
dealing with network notifiers. This is similarly handeled already in
commit 91e2ff3528ac ("net: Teach vlans to cleanup as a pernet subsystem")
and, thus, improves upon fd27e0d44a89 ("net: vxlan: do not use vxlan_net
before checking event type"). Just as in 91e2ff3528ac, we do not need
to explicitly handle deletion of vxlan devices as network namespace
exit calls dellink on all remaining virtual devices, and
rtnl_link_unregister() calls dellink on all outstanding devices in that
network namespace, so we can entirely drop the pernet exit operation
as well. Moreover, on vxlan module exit, rcu_barrier() is called by
netns since commit 3a765edadb28 ("netns: Add an explicit rcu_barrier
to unregister_pernet_{device|subsys}"), so this may be omitted. Tested
with various scenarios and works well on my side.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 drivers/net/vxlan.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 942acc2..e1bc925 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2784,12 +2784,10 @@ static int vxlan_lowerdev_event(struct notifier_block *unused,
 				unsigned long event, void *ptr)
 {
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-	struct vxlan_net *vn;
+	struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);

-	if (event == NETDEV_UNREGISTER) {
-		vn = net_generic(dev_net(dev), vxlan_net_id);
+	if (event == NETDEV_UNREGISTER)
 		vxlan_handle_lowerdev_unregister(vn, dev);
-	}

 	return NOTIFY_DONE;
 }
@@ -2812,22 +2810,8 @@ static __net_init int vxlan_init_net(struct net *net)
 	return 0;
 }

-static __net_exit void vxlan_exit_net(struct net *net)
-{
-	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
-	struct vxlan_dev *vxlan, *next;
-	LIST_HEAD(list_kill);
-
-	rtnl_lock();
-	list_for_each_entry_safe(vxlan, next, &vn->vxlan_list, next)
-		vxlan_dellink(vxlan->dev, &list_kill);
-	unregister_netdevice_many(&list_kill);
-	rtnl_unlock();
-}
-
 static struct pernet_operations vxlan_net_ops = {
 	.init = vxlan_init_net,
-	.exit = vxlan_exit_net,
 	.id   = &vxlan_net_id,
 	.size = sizeof(struct vxlan_net),
 };
@@ -2842,7 +2826,7 @@ static int __init vxlan_init_module(void)

 	get_random_bytes(&vxlan_salt, sizeof(vxlan_salt));

-	rc = register_pernet_device(&vxlan_net_ops);
+	rc = register_pernet_subsys(&vxlan_net_ops);
 	if (rc)
 		goto out1;

@@ -2858,7 +2842,7 @@ static int __init vxlan_init_module(void)
 out3:
 	unregister_netdevice_notifier(&vxlan_notifier_block);
 out2:
-	unregister_pernet_device(&vxlan_net_ops);
+	unregister_pernet_subsys(&vxlan_net_ops);
 out1:
 	destroy_workqueue(vxlan_wq);
 	return rc;
@@ -2870,8 +2854,8 @@ static void __exit vxlan_cleanup_module(void)
 	rtnl_link_unregister(&vxlan_link_ops);
 	unregister_netdevice_notifier(&vxlan_notifier_block);
 	destroy_workqueue(vxlan_wq);
-	unregister_pernet_device(&vxlan_net_ops);
-	rcu_barrier();
+	unregister_pernet_subsys(&vxlan_net_ops);
+	/* rcu_barrier() is called by netns */
 }
 module_exit(vxlan_cleanup_module);

-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Matija Glavinic Pecotic @ 2014-01-22 19:34 UTC (permalink / raw)
  To: ext Vlad Yasevich, Neil Horman; +Cc: linux-sctp, Alexander Sverdlin, netdev
In-Reply-To: <52DFD334.7030507@gmail.com>

Hello Neil, Vlad,

On 01/22/2014 03:18 PM, ext Vlad Yasevich wrote:
> On 01/22/2014 07:30 AM, Neil Horman wrote:
>> On Fri, Jan 17, 2014 at 08:01:24AM +0100, Matija Glavinic Pecotic wrote:
>>>
>>> Proposed solution simplifies whole algorithm having on mind
> definition from rfc:
>>>
>>> o  Receiver Window (rwnd): This gives the sender an indication of the
> space
>>>    available in the receiver's inbound buffer.
>>>
>>> Core of the proposed solution is given with these lines:
>>>
>>> sctp_assoc_rwnd_account:
>>> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>>> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>>> 	else
>>> 		asoc->rwnd = 0;
>>>
>>> We advertise to sender (half of) actual space we have. Half is in the
> braces
>>> depending whether you would like to observe size of socket buffer as
> SO_RECVBUF
>>> or twice the amount, i.e. size is the one visible from userspace,
> that is,
>>> from kernelspace.
>>> In this way sender is given with good approximation of our buffer space,
>>> regardless of the buffer policy - we always advertise what we have.
> Proposed
>>> solution fixes described problems and removes necessity for rwnd
> restoration
>>> algorithm. Finally, as proposed solution is simplification, some
> lines of code,
>>> along with some bytes in struct sctp_association are saved.
>>>
>>> Signed-off-by: Matija Glavinic Pecotic
> <matija.glavinic-pecotic.ext@nsn.com>
>>> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
>>>
>>
>>
>> General comment - While this seems to make sense to me generally speaking,
>> doesn't it currently violate section 6 of the RFC?
>>
>>
>> A SCTP receiver MUST be able to receive a minimum of 1500 bytes in
>>    one SCTP packet.  This means that a SCTP endpoint MUST NOT indicate
>>    less than 1500 bytes in its Initial a_rwnd sent in the INIT or INIT
>>    ACK.
>>
>> Since we set the initial rwnd value to the larger of sk->sk_rcvbuf/2 or
>> SCTP_MIN_RWND (1500 bytes), won't we potentially advertize half that
> amount?
> 
> Not initially.  Initial window will still be advertized properly.  Once
> we receive the first packet and consumed some space, we'll advertize
> half of available receive buffer.  It is perfectly OK to advertize a
> window smaller the MIN_WINDOW in the middle of the transfer.

I agree to this, although we might be in gray area here:

   Advertised Receiver Window Credit (a_rwnd): 32 bits (unsigned
   integer)

      This value represents the dedicated buffer space, in number of
      bytes, the sender of the INIT has reserved in association with
      this window.  During the life of the association, this buffer
      space SHOULD NOT be lessened (i.e., dedicated buffers taken away
      from this association); however, an endpoint MAY change the value
      of a_rwnd it sends in SACK chunks.

this might be considered as taking away buffer space, although I would agree with point below about doubling.

This however opens another question which you should be aware of. This patch brakes regression, two TCs:

test_timetolive and test_sctp_sendrcvmsg.

This is simply due to 'honest' rwnd reporting. Both of these TCs share code in which initial rwnd is set very low, later socket buffer is increased but with counting on the fact that rwnd will stay as initially set. In TCs, this latter rwnd is fetched from the socket and used as value for the message size which in the end breaks it as message to be sent is too big.
What is important difference to current implementation is that changes of SO_RECVBUF will also change a_rwnd. It is not a big problem to add code which will keep the idea, but bound rwnd to initially set rwnd, but we haven't found it mandated by rfc.

Thanks,

Matija 

>> It seems we need to double the minimum socket receive buffer here.
> 
> Not here specifically, but yes.  It is already broken and this patch
> doesn't change current behavior.  This is something SCTP may need to do
> separately.
> 
> -vlad
> 
>>
>> Neil
>>
>>

^ permalink raw reply

* Re: [PATCH net-next 1/3] bonding: Set the correct value to fail_over_mac at enslavement
From: Jay Vosburgh @ 2014-01-22 20:51 UTC (permalink / raw)
  To: Ding Tianhong; +Cc: Veaceslav Falico, David S. Miller, Netdev, Andy Gospodarek
In-Reply-To: <52DF8DB8.9000006@huawei.com>

Ding Tianhong <dingtianhong@huawei.com> wrote:

>If the new slave don't support setting the MAC address, there are
>two ways to handle this situation:
>
>1). If the new slave is the first slave, set bond to the new slave's
>    MAC address, if the mode is active-backup, set fail_over_mac to
>    active, otherwise set fail_over_mac to none.

	This should be "if the mode is active-backup, set fail_over_mac
to active, otherwise do not change fail_over_mac."  Setting to none here
would undo any setting of fail_over_mac that the user had set prior to
adding the first slave.

>2). If the new slave is not the first slave and the fail_over_mac is
>    active, it means that the slave could work normally in active-backup
>    mode, otherwise if the fail_over_mac is none, the slave could not
>    work normally for no active-backup mode, so bond could not ensalve
>    the new dev.

	This (#2) is not a code change, correct?  You're just restating
the existing behavior of the code, right?

	Also, I don't see where this patch set updates the slave removal
processing where the slave's original MAC is restored.  At present, this
is done by a test against fail_over_mac, but should be tested against
the mode and fail_over_mac.

	My comment to the prior version of this patchset, again:

	The correct way to fix this in general is to permit setting an
option at any time, but only have it take effect in active-backup mode.
This minimizes ordering requirements when setting options.

	I would instead modify the bond enslave and removal processing
to check the mode in addition to fail_over_mac when setting a slave's
MAC during enslavement.  The change active slave processing already only
calls the fail_over_mac function when in active-backup mode.  This
should also be a simpler change set.

	These comments still apply to this version of the patchset.

	-J

>Cc: Jay Vosburgh <fubar@us.ibm.com>
>Cc: Veaceslav Falico <vfalico@redhat.com>
>Cc: Andy Gospodarek <andy@greyhouse.net>
>Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
>---
> drivers/net/bonding/bond_main.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 3220b48..598f100 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1334,9 +1334,17 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>
> 	if (slave_ops->ndo_set_mac_address == NULL) {
> 		if (!bond_has_slaves(bond)) {
>-			pr_warning("%s: Warning: The first slave device specified does not support setting the MAC address. Setting fail_over_mac to active.",
>+			pr_warning("%s: Warning: The first slave device specified does not support setting the MAC address.\n",
> 				   bond_dev->name);
>-			bond->params.fail_over_mac = BOND_FOM_ACTIVE;
>+			if (bond->params.mode == BOND_MODE_ACTIVEBACKUP) {
>+				bond->params.fail_over_mac = BOND_FOM_ACTIVE;
>+				pr_warning("%s: Setting fail_over_mac to active for active-backup mode.\n",
>+					   bond_dev->name);
>+			} else {
>+				bond->params.fail_over_mac = BOND_FOM_NONE;
>+				pr_warning("%s: Setting fail_over_mac to none for no active-backup modes",
>+					   bond_dev->name);
>+			}
> 		} else if (bond->params.fail_over_mac != BOND_FOM_ACTIVE) {
> 			pr_err("%s: Error: The slave device specified does not support setting the MAC address, but fail_over_mac is not set to active.\n",
> 			       bond_dev->name);
>-- 
>1.8.0

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* [PATCH] af_packet: Add Queue mapping mode to af_packet fanout operation
From: Neil Horman @ 2014-01-22 21:01 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, David S. Miller

This patch adds a queue mapping mode to the fanout operation of af_packet
sockets.  This allows user space af_packet users to better filter on flows
ingressing and egressing via a specific hardware queue, and avoids the potential
packet reordering that can occur when FANOUT_CPU is being used and irq affinity
varies.

Tested successfully by myself.  applies to net-next

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 include/uapi/linux/if_packet.h |  1 +
 net/packet/af_packet.c         | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 1988a02..bac27fa 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -60,6 +60,7 @@ struct sockaddr_ll {
 #define PACKET_FANOUT_CPU		2
 #define PACKET_FANOUT_ROLLOVER		3
 #define PACKET_FANOUT_RND		4
+#define PACKET_FANOUT_QM		5
 #define PACKET_FANOUT_FLAG_ROLLOVER	0x1000
 #define PACKET_FANOUT_FLAG_DEFRAG	0x8000
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d711ecb..bd90a87 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1313,6 +1313,13 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 	return idx;
 }
 
+static unsigned int fanout_demux_qm(struct packet_fanout *f,
+				    struct sk_buff *skb,
+				    unsigned int num)
+{
+	return skb_get_queue_mapping(skb) % num;
+}
+
 static bool fanout_has_flag(struct packet_fanout *f, u16 flag)
 {
 	return f->flags & (flag >> 8);
@@ -1352,6 +1359,9 @@ static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev,
 	case PACKET_FANOUT_RND:
 		idx = fanout_demux_rnd(f, skb, num);
 		break;
+	case PACKET_FANOUT_QM:
+		idx = fanout_demux_qm(f, skb, num);
+		break;
 	case PACKET_FANOUT_ROLLOVER:
 		idx = fanout_demux_rollover(f, skb, 0, (unsigned int) -1, num);
 		break;
@@ -1422,6 +1432,7 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 	case PACKET_FANOUT_LB:
 	case PACKET_FANOUT_CPU:
 	case PACKET_FANOUT_RND:
+	case PACKET_FANOUT_QM:
 		break;
 	default:
 		return -EINVAL;
-- 
1.8.3.1

^ permalink raw reply related

* Re: Linux kernel patch: elide fib_validate_source() completely when possible - bad side effect?
From: Julian Anastasov @ 2014-01-22 21:04 UTC (permalink / raw)
  To: gregory hoggarth; +Cc: netdev
In-Reply-To: <52DCF1440200005D000477FD@gwia2.atlnz.lc>


	Hello,

On Mon, 20 Jan 2014, gregory hoggarth wrote:

> >>> Julian Anastasov <ja@ssi.bg> 17/1/14 09:24 PM >>> 
> >
> >	It seems only __fib_validate_source can reject all kind
> > of broadcast addresses in saddr. ip_route_input_slow() rejects
> > only the well known broadcasts. Without rp_filter may be we
> > can at least drop attempts to send replies back to broadcast
> > addresses? For example, checking result of ip_route_output_key() in
> > icmp_reply():
> >
> >	if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
> >		=> ip_rt_put()

> Thank you for your reply.
> 
> While I think your solution may work, isn't the proper approach to drop these 
> rogue packets, rather than wasting CPU and other resources processing them, 
> and then beginning to craft responses which are in turn dropped?

	The problem is that it is __fib_validate_source that
takes time for every packet. And it is a rare case to see
traffic from broadcast addresses. rp_filter set on external
interface will drop such packets. It is an option that one
can use even for internal interface, if needed.

> Also seems better to drop the initial rogue packet, as that should cover all 
> (?) different types of packets, rather than having to add small patches into 
> ICMP, TCP and any other places that may need it.

	Small check looks better compared to FIB lookup for
every received packet. Note that output route can be cached
for sockets, eg. TCP, so such small checks do not occur for
every reply packet.

> So my question really is was the original patch correct and there's something
> wrong with our device configuration, or is this an overlooked / untested side-
> effect from that patch that as a result means the patch should be re-worked?

	I don't remember someone mentioning about such
side-effect, I guess it is overlooked. IMHO, it is not a good
reason to restore the old behavior. Lets see other opinions.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH net-next] IPv6: enable TCP to use an anycast address
From: Alexey Kuznetsov @ 2014-01-22 21:11 UTC (permalink / raw)
  To: François-Xavier Le Bail
  Cc: netdev, David Stevens, Hannes Frederic Sowa, David Miller,
	Hideaki Yoshifuji
In-Reply-To: <1390401156.61208.YahooMailBasic@web125501.mail.ne1.yahoo.com>

Hello!

On Wed, Jan 22, 2014 at 6:32 PM, François-Xavier Le Bail
<fx.lebail@yahoo.com> wrote:
>> Actually, I was alerted by reset processing in your patch, it cannot be right.
>
> Perhaps, I missed something.

Perhaps, I missed something. According to my old knowledge, tcp cannot
be used with anycasts
by many reasons and I have no idea what could change to the moment.

F.e. what do you do when a segment arrives on listening socket w/o
connection associated?
If you sent reset, you could reset someone's valid connection (I mean
this, saying "it cannot be right").
If you do not, you violate protocol, those resets are required to
clean up stale states.

>> Do not you think this must not be enabled for common use? At least
>> some separate sysctl disabled by default.
>
> We could indeed use a sysctl, disabled by default.
>
> But my goal was to enable anycast address usage transparently, if possible.

IMHO it is logically impossible. That's why my first question was
about a document which
allowed anycasts with TCP. (BTW I found your answer unconvincing, at least.)
Traditional anycasting used to be stateless and it was impossible to
use with TCP
because unsynchronized TCP connections would be unavoidable.
AFAIK only suggestion from ancient RFC1546 could cure this problem,
but it is very tricky and implies client knows that destination is anycast.

Actually, I do not even understand how the idea of use of anycast with TCP
emerged. The only situation I can imagine: the router is expected to
do full scale
connection tracking and to bind connections to anycast destinations to
correct nodes.
(And the same moment you understand that the notion of anycast is
completely lost
in this picture: the same thing is done by load balancers in IPv4 for
unicast addresses
for ages) I still have no idea how it is going to work when both
client and servers bound
to anycasts are on the same link.

So, it looks like you need this sysctl to announce: "someone cares
about proper routing
on this network and tcp should work". It looks obvious this mode
cannot be enabled by default.

Alexey

^ permalink raw reply

* Re: [RFC PATCH net] IPv6: Fix broken IPv6 routing table after loopback down-up
From: Hannes Frederic Sowa @ 2014-01-22 21:34 UTC (permalink / raw)
  To: Sabrina Dubroca; +Cc: netdev, gaofeng
In-Reply-To: <1390404908-3914-1-git-send-email-sd@queasysnail.net>

On Wed, Jan 22, 2014 at 04:35:08PM +0100, Sabrina Dubroca wrote:
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 4b6b720..b2eb653 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -2578,6 +2578,23 @@ static void sit_add_v4_addrs(struct inet6_dev *idev)
>  }
>  #endif
>  
> +struct rt6_info *addrconf_dst_alloc_once(struct inet6_dev *idev,
> +					 const struct in6_addr *addr,
> +					 bool anycast)
> +{
> +	struct net *net = dev_net(idev->dev);
> +	struct rt6_info *rt = rt6_lookup(net, addr, NULL, 0, 0);

We should use addrconf_get_prefix_route here. Eric's comment applies
here, too, we do a dst_hold in addrconf_get_prefix_route and thus must
decrease the reference counter with ip6_rt_put then, too.

> +	if (rt == NULL)
> +		return addrconf_dst_alloc(idev, addr, anycast);
> +	else if (!(rt->rt6i_flags & RTF_NONEXTHOP &&
> +		   ((anycast && rt->rt6i_flags & RTF_ANYCAST) ||
> +		    (!anycast && rt->rt6i_flags & RTF_LOCAL))))
> +		return addrconf_dst_alloc(idev, addr, anycast);

The necessary flags can be given to addrconf_get_prefix_route, then.

> +	return ERR_PTR(-EEXIST);
> +}
> +
>  static void init_loopback(struct net_device *dev)
>  {
>  	struct inet6_dev  *idev;
> @@ -2611,10 +2628,7 @@ static void init_loopback(struct net_device *dev)
>  			if (sp_ifa->flags & (IFA_F_DADFAILED | IFA_F_TENTATIVE))
>  				continue;
>  
> -			if (sp_ifa->rt)
> -				continue;

We should try to clean sp_ifa->rt on ifdown so we don't have dangling
pointer to it if it is already in dst gc queue and obsolete. So even if
we leave this code in there we would never hit the continue (it seems we
have to decrease the reference counter in addrconf_ifdown before setting
sp_ifa->rt to NULL).

> -
> -			sp_rt = addrconf_dst_alloc(idev, &sp_ifa->addr, false);
> +			sp_rt = addrconf_dst_alloc_once(idev, &sp_ifa->addr, false);
>  
>  			/* Failure cases are ignored */
>  			if (!IS_ERR(sp_rt)) {

I am fine if we get this fix in for 3.14 because it solves a real problem.
I'll already work on the full route preserving logic on ifdown/up and
would build this on this patch then.

I currently wonder if we need the relookup logic at all as we currently
remove the prefix routes in all cases (in rt6_ifdown, which iterates
over all routing tables). So we always have a obsolete dst which we
just need to clean up in addrconf_ifdown and just allocate the prefix
route in init_loopback in all cases. We just have to steal some code
from inet6_rtm_newaddr to calculate the appropriate flags from the
inet6_ifaddr's ifa_flags (e.g. RTF_EXPIRES).

Thanks a lot,

  Hannes

^ permalink raw reply

* [BUG] null pointer dereference in tcp_gso_segment()
From: Arnaud Ebalard @ 2014-01-22 21:46 UTC (permalink / raw)
  To: David Miller, Eric Dumazet, Daniel Borkmann, Herbert Xu,
	Willy Tarreau
  Cc: netdev

Hi,

The following is a backtrace I got while doing a simple transfer from my
NAS (ARMv7-based ReadyNAS 102) running a 3.13.0 kernel. Usage was a
simple file served by nginx. I managed to get a second identical one
in another session. As this did not looked like some random race
condition or some missing lock, I decided to take a look. Details are
given below. FWIW, driver is mvneta (w/ fixes and performance patches
you accepted for v3.14, David) and ethtool reports gso is enabled and tso
disabled.

[  943.860383] Unable to handle kernel NULL pointer dereference at virtual address 00000090
[  943.868497] pgd = c0004000
[  943.871217] [00000090] *pgd=00000000
[  943.874813] Internal error: Oops: 17 [#1] ARM
[  943.879175] Modules linked in:
[  943.882248] CPU: 0 PID: 0 Comm: swapper Not tainted 3.13.0.rn102-00594-ga0fa1dd3cdbc-dirty #63
[  943.890873] task: c0775620 ti: c0768000 task.ti: c0768000
[  943.896291] PC is at tcp_gso_segment+0x1d8/0x390
[  943.900925] LR is at skb_segment+0x510/0x788
[  943.905202] pc : [<c04b2094>]    lr : [<c0448b90>]    psr: 200f0113
[  943.905202] sp : c0769aa8  ip : 00005b27  fp : 00000001
[  943.916697] r10: 00005b27  r9 : 00000868  r8 : dfab5cb0
[  943.921930] r7 : 00000000  r6 : 4ef79279  r5 : 00000020  r4 : deeba830
[  943.928467] r3 : 000005a8  r2 : df993200  r1 : c04a63f8  r0 : 000005a8
[  943.935005] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[  943.942325] Control: 10c5387d  Table: 1534c019  DAC: 00000015
[  943.948079] Process swapper (pid: 0, stack limit = 0xc0768238)
[  943.953921] Stack: (0xc0769aa8 to 0xc076a000)
[  943.958288] 9aa0:                   c077000c 000005a8 0000e397 d52f3540 c0771230 00001c48
[  943.966481] 9ac0: 00000006 deeba830 0000c6bc 00000014 0000009c 00000000 00000014 c04beeb4
[  943.974674] 9ae0: c0fbb8ec dec2ab80 0000008e 00000000 00000001 00004803 00000002 deeba830
[  943.982867] 9b00: c07739cc deeba830 df27d15c df1fd800 c05a045c c0452dcc 00000004 c07739e0
[  943.991061] 9b20: 00004803 00004803 00000002 00010000 00000000 deeba830 df27d15c c04530e8
[  943.999254] 9b40: 00000001 c046fc40 c0492410 df1fd800 00000000 00000000 00000000 def06e00
[  944.007447] 9b60: df1fd800 deeba830 df1fd800 df27d15c 00000010 c0469af4 def06e00 00000000
[  944.015640] 9b80: 00000042 00000000 df27d15c df1fd800 deeba830 c04535b8 def06e60 000333c5
[  944.023834] 9ba0: 00000000 dfaf5180 dfaf51fc 0000000e 00000000 deeba830 df1fd800 def35580
[  944.032027] 9bc0: 00000010 c0492604 80000000 0b50a8c0 00000000 deeba830 df993200 00000000
[  944.040221] 9be0: df993404 def35580 00000000 009c0000 00000000 c0492a2c deeba830 c0492b5c
[  944.048414] 9c00: deeba780 df993200 deeba780 00000020 00000000 0000fb7b 00000020 df993200
[  944.056608] 9c20: deeba830 c0778368 00000000 def35580 00000000 c04a76e8 c3a0f5c7 134b291d
[  944.064802] 9c40: 00000001 00000000 00000002 00000000 00000000 0000fb7b 002411ee 00000000
[  944.072995] 9c60: 79e31870 df993200 deeba780 000005a8 4ef79279 00001c48 00000000 000021f0
[  944.081188] 9c80: 00000000 c04a7d68 c39c239b 134b291d 00000006 000016a0 000021f0 00000001
[  944.089381] 9ca0: 00000000 df9932bc 00000002 00000000 00000006 134b291d d52f3540 df993200
[  944.097575] 9cc0: d567e7e4 d52f3540 00000020 d567e7d0 d52f3540 0000ddc7 c0770618 c04a86d8
[  944.105768] 9ce0: 00000020 df993200 df993200 c04a4954 00000000 00000020 00000001 00000002
[  944.113961] 9d00: d52f3540 df993200 def35a00 df993200 d567e7d0 d52f3540 0000ddc7 c04ac61c
[  944.122154] 9d20: c048da1c c046fc40 c048da1c c046fc40 dec12140 c0770b64 d52f3540 d52f3540
[  944.130348] 9d40: c07a3dc0 00000000 df993200 c04aede0 df1fd800 c046fcc0 00000000 c0769d74
[  944.138541] 9d60: c048da1c 80000000 c048d734 c0770b64 0000000e c0770b64 3050a8c0 c05aacb4
[  944.146734] 9d80: c0773470 00000000 c07a3dc0 d52f3540 d52f3540 00000000 c0770618 c048dab0
[  944.154927] 9da0: d567e7d0 c07739e8 c0770628 df1fd800 d52f3540 c048d85c df1b4400 c0770614
[  944.163120] 9dc0: c0770614 c07739e8 c0770628 df1fd800 00000008 c044f018 00000003 00000000
[  944.171314] 9de0: e1424800 42000000 df1fd800 d52f3540 00000001 c0770628 00000000 d52f3540
[  944.179507] 9e00: 00000003 df3bd400 e1424800 42000000 df1fd800 d52f3540 00000001 c045153c
[  944.187700] 9e20: 00000004 c0019430 e1424000 c0382238 00000000 00000000 00000002 dfbc4600
[  944.195893] 9e40: 00000001 df1fdbe0 00000001 00000042 00000000 00000000 dee34000 00000040
[  944.204086] 9e60: df1fdbe0 00000100 00000000 00000100 df3bd400 c079b684 00000000 c0382354
[  944.212279] 9e80: c04ab2fc c04aa344 df993200 df1fd800 00000100 df1fdbfc 00000000 df993200
[  944.220473] 9ea0: 00000100 c04ab360 c0768000 c0023cf4 c0fba140 c03822b0 df1fdbfc 00000040
[  944.228666] 9ec0: 0000012c c07abb40 c07abb48 c0778368 c07abb40 c0451288 c0769ed8 0000fb7d
[  944.236859] 9ee0: c0fba140 00000001 0000000c c07ac850 c07ac840 c0768000 00000003 00000100
[  944.245052] 9f00: 0000000c c001ebd4 00000000 c004782c c0fba140 0000000a 0000fb7c 00200000
[  944.253245] 9f20: c0769f78 c078af20 00000018 00000000 c0769f78 c07abc01 00000001 c07abc01
[  944.261438] 9f40: 00000000 c001ef18 c078af20 c000ea44 c07f6f40 000003ff c07f6f40 c00084d8
[  944.269632] 9f60: c000ebc0 c0041c54 600f0013 ffffffff c0769fac c00110c0 00000000 00000000
[  944.277825] 9f80: 00000000 c0779a48 c0768000 c0768000 c0768000 c07700cc c07abc01 00000001
[  944.286018] 9fa0: c07abc01 00000000 01000000 c0769fc0 c000ebc0 c0041c54 600f0013 ffffffff
[  944.294211] 9fc0: c0761530 c073ba14 ffffffff ffffffff c073b4f0 00000000 00000000 c0761530
[  944.302405] 9fe0: 00000000 10c53c7d c0770078 c076152c c07765f8 00008070 00000000 00000000
[  944.310608] [<c04b2094>] (tcp_gso_segment+0x1d8/0x390) from [<c04beeb4>] (inet_gso_segment+0x118/0x2dc)
[  944.320029] [<c04beeb4>] (inet_gso_segment+0x118/0x2dc) from [<c0452dcc>] (skb_mac_gso_segment+0xa4/0x178)
[  944.329704] [<c0452dcc>] (skb_mac_gso_segment+0xa4/0x178) from [<c04530e8>] (dev_hard_start_xmit+0x170/0x484)
[  944.339643] [<c04530e8>] (dev_hard_start_xmit+0x170/0x484) from [<c0469af4>] (sch_direct_xmit+0xa4/0x19c)
[  944.349231] [<c0469af4>] (sch_direct_xmit+0xa4/0x19c) from [<c04535b8>] (__dev_queue_xmit+0x1bc/0x3dc)
[  944.358560] [<c04535b8>] (__dev_queue_xmit+0x1bc/0x3dc) from [<c0492604>] (ip_finish_output+0x1f4/0x440)
[  944.368060] [<c0492604>] (ip_finish_output+0x1f4/0x440) from [<c0492a2c>] (ip_local_out+0x28/0x2c)
[  944.377037] [<c0492a2c>] (ip_local_out+0x28/0x2c) from [<c0492b5c>] (ip_queue_xmit+0x12c/0x384)
[  944.385754] [<c0492b5c>] (ip_queue_xmit+0x12c/0x384) from [<c04a76e8>] (tcp_transmit_skb+0x42c/0x86c)
[  944.394992] [<c04a76e8>] (tcp_transmit_skb+0x42c/0x86c) from [<c04a7d68>] (tcp_write_xmit+0x174/0xa74)
[  944.404317] [<c04a7d68>] (tcp_write_xmit+0x174/0xa74) from [<c04a86d8>] (__tcp_push_pending_frames+0x30/0x98)
[  944.414252] [<c04a86d8>] (__tcp_push_pending_frames+0x30/0x98) from [<c04a4954>] (tcp_rcv_established+0x334/0x5a0)
[  944.424622] [<c04a4954>] (tcp_rcv_established+0x334/0x5a0) from [<c04ac61c>] (tcp_v4_do_rcv+0x104/0x240)
[  944.434122] [<c04ac61c>] (tcp_v4_do_rcv+0x104/0x240) from [<c04aede0>] (tcp_v4_rcv+0x6ec/0x728)
[  944.442838] [<c04aede0>] (tcp_v4_rcv+0x6ec/0x728) from [<c048dab0>] (ip_local_deliver_finish+0x94/0x21c)
[  944.452337] [<c048dab0>] (ip_local_deliver_finish+0x94/0x21c) from [<c048d85c>] (ip_rcv_finish+0x128/0x2e8)
[  944.462098] [<c048d85c>] (ip_rcv_finish+0x128/0x2e8) from [<c044f018>] (__netif_receive_skb_core+0x4c4/0x5d0)
[  944.472032] [<c044f018>] (__netif_receive_skb_core+0x4c4/0x5d0) from [<c045153c>] (napi_gro_receive+0x74/0xa0)
[  944.482054] [<c045153c>] (napi_gro_receive+0x74/0xa0) from [<c0382238>] (mvneta_rx+0x420/0x498)
[  944.490771] [<c0382238>] (mvneta_rx+0x420/0x498) from [<c0382354>] (mvneta_poll+0xa4/0x3b8)
[  944.499139] [<c0382354>] (mvneta_poll+0xa4/0x3b8) from [<c0451288>] (net_rx_action+0x98/0x180)
[  944.507773] [<c0451288>] (net_rx_action+0x98/0x180) from [<c001ebd4>] (__do_softirq+0xc8/0x1f4)
[  944.516491] [<c001ebd4>] (__do_softirq+0xc8/0x1f4) from [<c001ef18>] (irq_exit+0x6c/0xa8)
[  944.524690] [<c001ef18>] (irq_exit+0x6c/0xa8) from [<c000ea44>] (handle_IRQ+0x34/0x84)
[  944.532624] [<c000ea44>] (handle_IRQ+0x34/0x84) from [<c00084d8>] (armada_370_xp_handle_irq+0x48/0x50)
[  944.541953] [<c00084d8>] (armada_370_xp_handle_irq+0x48/0x50) from [<c00110c0>] (__irq_svc+0x40/0x50)
[  944.551186] Exception stack(0xc0769f78 to 0xc0769fc0)
[  944.556246] 9f60:                                                       00000000 00000000
[  944.564439] 9f80: 00000000 c0779a48 c0768000 c0768000 c0768000 c07700cc c07abc01 00000001
[  944.572632] 9fa0: c07abc01 00000000 01000000 c0769fc0 c000ebc0 c0041c54 600f0013 ffffffff
[  944.580835] [<c00110c0>] (__irq_svc+0x40/0x50) from [<c0041c54>] (cpu_startup_entry+0x44/0xe0)
[  944.589470] [<c0041c54>] (cpu_startup_entry+0x44/0xe0) from [<c073ba14>] (start_kernel+0x2fc/0x358)
[  944.598533] Code: e5942010 e5872010 e5977000 e59d3004 (e1d729b0) 
[  944.604662] ---[ end trace 39b798f37a10efc0 ]---
[  944.609288] Kernel panic - not syncing: Fatal exception in interrupt


Here is the beginning of the assembly dump of tcp_gso_segment() provided by
arm-linux-gnueabi-objdump -S -EL -D -b binary -m arm --start-address=0x4a9ebc
--stop-address=0x4aa24c Image. If you wonder, LOADADDR is 0x8000 during
compilation, which explains the address shift.

004a9ebc <.data+0x4a9ebc>:
  4a9ebc:	e92d4ff0 	push	{r4, r5, r6, r7, r8, r9, sl, fp, lr}
  4a9ec0:	e1a09003 	mov	r9, r3
  4a9ec4:	e5901054 	ldr	r1, [r0, #84]	; 0x54
  4a9ec8:	e24dd014 	sub	sp, sp, #20
  4a9ecc:	e5903050 	ldr	r3, [r0, #80]	; 0x50
  4a9ed0:	e1a04000 	mov	r4, r0
  4a9ed4:	e1a08002 	mov	r8, r2
  4a9ed8:	e0611003 	rsb	r1, r1, r3
  4a9edc:	e3510013 	cmp	r1, #19
  4a9ee0:	9a000099 	bls	0x4aa14c
  4a9ee4:	e1d439b0 	ldrh	r3, [r4, #144]	; 0x90
  4a9ee8:	e59420a0 	ldr	r2, [r4, #160]	; 0xa0
  4a9eec:	e0823003 	add	r3, r2, r3
  4a9ef0:	e5d3500c 	ldrb	r5, [r3, #12]
  4a9ef4:	e1a05225 	lsr	r5, r5, #4
  4a9ef8:	e1a05105 	lsl	r5, r5, #2
  4a9efc:	e3550013 	cmp	r5, #19
  4a9f00:	9a000097 	bls	0x4aa164
  4a9f04:	e5946050 	ldr	r6, [r4, #80]	; 0x50
  4a9f08:	e5943054 	ldr	r3, [r4, #84]	; 0x54
  4a9f0c:	e0631006 	rsb	r1, r3, r6
  4a9f10:	e1550001 	cmp	r5, r1
  4a9f14:	8a0000c0 	bhi	0x4aa21c
  4a9f18:	e065c006 	rsb	ip, r5, r6
  4a9f1c:	e584c050 	str	ip, [r4, #80]	; 0x50
  4a9f20:	e15c0003 	cmp	ip, r3
  4a9f24:	3a0000c6 	bcc	0x4aa244
  4a9f28:	e594709c 	ldr	r7, [r4, #156]	; 0x9c
  4a9f2c:	e59430a4 	ldr	r3, [r4, #164]	; 0xa4
  4a9f30:	e0833005 	add	r3, r3, r5
  4a9f34:	e58430a4 	str	r3, [r4, #164]	; 0xa4
  4a9f38:	e1d700b2 	ldrh	r0, [r7, #2]
  4a9f3c:	e15c0000 	cmp	ip, r0
  4a9f40:	e58d0004 	str	r0, [sp, #4]
  4a9f44:	9a000086 	bls	0x4aa164
  4a9f48:	e1d710b6 	ldrh	r1, [r7, #6]
  4a9f4c:	e3a02701 	mov	r2, #262144	; 0x40000
  4a9f50:	e3a03000 	mov	r3, #0
  4a9f54:	e1822008 	orr	r2, r2, r8
  4a9f58:	e1833009 	orr	r3, r3, r9
  4a9f5c:	e1a00801 	lsl	r0, r1, #16
  4a9f60:	e0022000 	and	r2, r2, r0
  4a9f64:	e1a0bfc0 	asr	fp, r0, #31
  4a9f68:	e1a0a000 	mov	sl, r0
  4a9f6c:	e003300b 	and	r3, r3, fp
  4a9f70:	e15b0003 	cmp	fp, r3
  4a9f74:	015a0002 	cmpeq	sl, r2
  4a9f78:	0a00007d 	beq	0x4aa174
  4a9f7c:	e5d4107e 	ldrb	r1, [r4, #126]	; 0x7e
  4a9f80:	e1a03009 	mov	r3, r9
  4a9f84:	e1a00004 	mov	r0, r4
  4a9f88:	e1a02008 	mov	r2, r8
  4a9f8c:	e7e071d1 	ubfx	r7, r1, #3, #1
  4a9f90:	e7c3119f 	bfc	r1, #3, #1
  4a9f94:	e5c4107e 	strb	r1, [r4, #126]	; 0x7e
  4a9f98:	e594b068 	ldr	fp, [r4, #104]	; 0x68
  4a9f9c:	ebfe59b7 	bl	0x440680
  4a9fa0:	e3700a01 	cmn	r0, #4096	; 0x1000
  4a9fa4:	e1a09000 	mov	r9, r0
  4a9fa8:	8a00006e 	bhi	0x4aa168
  4a9fac:	e1d929b0 	ldrh	r2, [r9, #144]	; 0x90
  4a9fb0:	e1e06006 	mvn	r6, r6
  4a9fb4:	e59980a0 	ldr	r8, [r9, #160]	; 0xa0
  4a9fb8:	e6ff6076 	uxth	r6, r6
  4a9fbc:	e58d6008 	str	r6, [sp, #8]
  4a9fc0:	e5d9307e 	ldrb	r3, [r9, #126]	; 0x7e
  4a9fc4:	e0888002 	add	r8, r8, r2
  4a9fc8:	e59f2278 	ldr	r2, [pc, #632]	; 0x4aa248
  4a9fcc:	e7c33197 	bfi	r3, r7, #3, #1
  4a9fd0:	e59d0004 	ldr	r0, [sp, #4]
  4a9fd4:	e5c9307e 	strb	r3, [r9, #126]	; 0x7e
  4a9fd8:	e05b2002 	subs	r2, fp, r2
  4a9fdc:	e1d831b0 	ldrh	r3, [r8, #16]
  4a9fe0:	e272b000 	rsbs	fp, r2, #0
  4a9fe4:	e59d1008 	ldr	r1, [sp, #8]
  4a9fe8:	e085c000 	add	ip, r5, r0
  4a9fec:	e0bbb002 	adcs	fp, fp, r2
  4a9ff0:	e1a07009 	mov	r7, r9
  4a9ff4:	e08cc001 	add	ip, ip, r1
  4a9ff8:	e58d900c 	str	r9, [sp, #12]
  4a9ffc:	e6bfcf3c 	rev	ip, ip
  4aa000:	e08cc003 	add	ip, ip, r3
  4aa004:	e3a03000 	mov	r3, #0
  4aa008:	e08cc86c 	add	ip, ip, ip, ror #16
  4aa00c:	e1a0c82c 	lsr	ip, ip, #16
  4aa010:	e1a09003 	mov	r9, r3
  4aa014:	e5986004 	ldr	r6, [r8, #4]
  4aa018:	e1a0a00c 	mov	sl, ip
  4aa01c:	e6bf6f36 	rev	r6, r6
  4aa020:	e5d8200d 	ldrb	r2, [r8, #13]
  4aa024:	e1c8a1b0 	strh	sl, [r8, #16]
  4aa028:	e20220f7 	and	r2, r2, #247	; 0xf7
  4aa02c:	e7c0201f 	bfc	r2, #0, #1
  4aa030:	e5c8200d 	strb	r2, [r8, #13]
  4aa034:	e5d72064 	ldrb	r2, [r7, #100]	; 0x64
  4aa038:	e202200c 	and	r2, r2, #12
  4aa03c:	e352000c 	cmp	r2, #12
  4aa040:	0a000009 	beq	0x4aa06c
  4aa044:	e1d709b0 	ldrh	r0, [r7, #144]	; 0x90
  4aa048:	e1a01005 	mov	r1, r5
  4aa04c:	e597e0a0 	ldr	lr, [r7, #160]	; 0xa0
  4aa050:	e597205c 	ldr	r2, [r7, #92]	; 0x5c
  4aa054:	e08e0000 	add	r0, lr, r0
  4aa058:	ebf7cc4c 	bl	0x29d190
  4aa05c:	e0800860 	add	r0, r0, r0, ror #16
  4aa060:	e1e02000 	mvn	r2, r0
  4aa064:	e1a02822 	lsr	r2, r2, #16
  4aa068:	e1c821b0 	strh	r2, [r8, #16]
  4aa06c:	e35b0000 	cmp	fp, #0
  4aa070:	0a000005 	beq	0x4aa08c
  4aa074:	e5941068 	ldr	r1, [r4, #104]	; 0x68
  4aa078:	e59720a8 	ldr	r2, [r7, #168]	; 0xa8
  4aa07c:	e5871068 	str	r1, [r7, #104]	; 0x68
  4aa080:	e0899002 	add	r9, r9, r2
  4aa084:	e5942010 	ldr	r2, [r4, #16]
  4aa088:	e5872010 	str	r2, [r7, #16]
  4aa08c:	e5977000 	ldr	r7, [r7]
  4aa090:	e59d3004 	ldr	r3, [sp, #4]
  4aa094:	e1d729b0 	ldrh	r2, [r7, #144]	; 0x90   <- HERE!
  4aa098:	e0866003 	add	r6, r6, r3
  4aa09c:	e59780a0 	ldr	r8, [r7, #160]	; 0xa0
  4aa0a0:	e6bf1f36 	rev	r1, r6
  4aa0a4:	e0888002 	add	r8, r8, r2
  4aa0a8:	e5d8200d 	ldrb	r2, [r8, #13]
  4aa0ac:	e5881004 	str	r1, [r8, #4]
  4aa0b0:	e7c7239f 	bfc	r2, #7, #1
  4aa0b4:	e5c8200d 	strb	r2, [r8, #13]
  4aa0b8:	e5972000 	ldr	r2, [r7]
  4aa0bc:	e3520000 	cmp	r2, #0
  4aa0c0:	1affffd6 	bne	0x4aa020


The line marked with the HERE! above is where the null pointer derefence
occurs. The code is in fact the (inlined) code of tcp_hdr(), which is in
turn the inlined code of tcp_transport_header() on given skb:

static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
{
        return skb->head + skb->transport_header;
}

The call to tcp_hdr() is in the following loop in tcp_gso_segment():

     ...
  1  do {
  2          th->fin = th->psh = 0;
  3          th->check = newcheck;
  4
  5          if (skb->ip_summed != CHECKSUM_PARTIAL)
  6                  th->check =
  7                       csum_fold(csum_partial(skb_transport_header(skb),
  8                                              thlen, skb->csum));
  9
 10          seq += mss;
 11          if (copy_destructor) {
 12                  skb->destructor = gso_skb->destructor;
 13                  skb->sk = gso_skb->sk;
 14                  sum_truesize += skb->truesize;
 15          }
 16          skb = skb->next;
 17          th = tcp_hdr(skb);         <-- HERE!
 18
 19          th->seq = htonl(seq);
 20          th->cwr = 0;
 21  } while (skb->next);
     ...

Unless there is an assumption I missed somewhere in the function, the
problem may occur during the first round of the loop, because (unlike
the 'while' condition does at line 21) skb->next is not checked against
null at lines 17 above before it is passed to tcp_hdr() at line 18.

To be honest, I am asking because I am not familiar w/ the code and it
is somewhat old so I wonder why noone got hit before. AFAICT,
f4c50d990dcf ([NET]: Add software TSOv4) added TSOv4 support in 2006 via
introduction of tcp_tso_segmen() (with the same kind of deref but
possibly different assumptions) which was more recently modified via
28850dc7c7 (net: tcp: move GRO/GSO functions to tcp_offload) to become
tcp_gso_segment().

David, can you confirm the analysis and possibly comment on the
conditions needed for the bug to manifest?

Cheers,

a+

^ permalink raw reply

* Re: [BUG] null pointer dereference in tcp_gso_segment()
From: Eric Dumazet @ 2014-01-22 21:57 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: David Miller, Eric Dumazet, Daniel Borkmann, Herbert Xu,
	Willy Tarreau, netdev
In-Reply-To: <87r47z7kqo.fsf@natisbad.org>

On Wed, 2014-01-22 at 22:46 +0100, Arnaud Ebalard wrote:
> Hi,
> 
> The following is a backtrace I got while doing a simple transfer from my
> NAS (ARMv7-based ReadyNAS 102) running a 3.13.0 kernel. Usage was a
> simple file served by nginx. I managed to get a second identical one
> in another session. As this did not looked like some random race
> condition or some missing lock, I decided to take a look. Details are
> given below. FWIW, driver is mvneta (w/ fixes and performance patches
> you accepted for v3.14, David) and ethtool reports gso is enabled and tso
> disabled.
> 
> [  943.860383] Unable to handle kernel NULL pointer dereference at virtual address 00000090
> [  943.868497] pgd = c0004000
> [  943.871217] [00000090] *pgd=00000000
> [  943.874813] Internal error: Oops: 17 [#1] ARM
> [  943.879175] Modules linked in:
> [  943.882248] CPU: 0 PID: 0 Comm: swapper Not tainted 3.13.0.rn102-00594-ga0fa1dd3cdbc-dirty #63
> [  943.890873] task: c0775620 ti: c0768000 task.ti: c0768000
> [  943.896291] PC is at tcp_gso_segment+0x1d8/0x390
> [  943.900925] LR is at skb_segment+0x510/0x788
> [  943.905202] pc : [<c04b2094>]    lr : [<c0448b90>]    psr: 200f0113
> [  943.905202] sp : c0769aa8  ip : 00005b27  fp : 00000001
> [  943.916697] r10: 00005b27  r9 : 00000868  r8 : dfab5cb0
> [  943.921930] r7 : 00000000  r6 : 4ef79279  r5 : 00000020  r4 : deeba830
> [  943.928467] r3 : 000005a8  r2 : df993200  r1 : c04a63f8  r0 : 000005a8
> [  943.935005] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
> [  943.942325] Control: 10c5387d  Table: 1534c019  DAC: 00000015
> [  943.948079] Process swapper (pid: 0, stack limit = 0xc0768238)
> [  943.953921] Stack: (0xc0769aa8 to 0xc076a000)
> [  943.958288] 9aa0:                   c077000c 000005a8 0000e397 d52f3540 c0771230 00001c48
> [  943.966481] 9ac0: 00000006 deeba830 0000c6bc 00000014 0000009c 00000000 00000014 c04beeb4
> [  943.974674] 9ae0: c0fbb8ec dec2ab80 0000008e 00000000 00000001 00004803 00000002 deeba830
> [  943.982867] 9b00: c07739cc deeba830 df27d15c df1fd800 c05a045c c0452dcc 00000004 c07739e0
> [  943.991061] 9b20: 00004803 00004803 00000002 00010000 00000000 deeba830 df27d15c c04530e8
> [  943.999254] 9b40: 00000001 c046fc40 c0492410 df1fd800 00000000 00000000 00000000 def06e00
> [  944.007447] 9b60: df1fd800 deeba830 df1fd800 df27d15c 00000010 c0469af4 def06e00 00000000
> [  944.015640] 9b80: 00000042 00000000 df27d15c df1fd800 deeba830 c04535b8 def06e60 000333c5
> [  944.023834] 9ba0: 00000000 dfaf5180 dfaf51fc 0000000e 00000000 deeba830 df1fd800 def35580
> [  944.032027] 9bc0: 00000010 c0492604 80000000 0b50a8c0 00000000 deeba830 df993200 00000000
> [  944.040221] 9be0: df993404 def35580 00000000 009c0000 00000000 c0492a2c deeba830 c0492b5c
> [  944.048414] 9c00: deeba780 df993200 deeba780 00000020 00000000 0000fb7b 00000020 df993200
> [  944.056608] 9c20: deeba830 c0778368 00000000 def35580 00000000 c04a76e8 c3a0f5c7 134b291d
> [  944.064802] 9c40: 00000001 00000000 00000002 00000000 00000000 0000fb7b 002411ee 00000000
> [  944.072995] 9c60: 79e31870 df993200 deeba780 000005a8 4ef79279 00001c48 00000000 000021f0
> [  944.081188] 9c80: 00000000 c04a7d68 c39c239b 134b291d 00000006 000016a0 000021f0 00000001
> [  944.089381] 9ca0: 00000000 df9932bc 00000002 00000000 00000006 134b291d d52f3540 df993200
> [  944.097575] 9cc0: d567e7e4 d52f3540 00000020 d567e7d0 d52f3540 0000ddc7 c0770618 c04a86d8
> [  944.105768] 9ce0: 00000020 df993200 df993200 c04a4954 00000000 00000020 00000001 00000002
> [  944.113961] 9d00: d52f3540 df993200 def35a00 df993200 d567e7d0 d52f3540 0000ddc7 c04ac61c
> [  944.122154] 9d20: c048da1c c046fc40 c048da1c c046fc40 dec12140 c0770b64 d52f3540 d52f3540
> [  944.130348] 9d40: c07a3dc0 00000000 df993200 c04aede0 df1fd800 c046fcc0 00000000 c0769d74
> [  944.138541] 9d60: c048da1c 80000000 c048d734 c0770b64 0000000e c0770b64 3050a8c0 c05aacb4
> [  944.146734] 9d80: c0773470 00000000 c07a3dc0 d52f3540 d52f3540 00000000 c0770618 c048dab0
> [  944.154927] 9da0: d567e7d0 c07739e8 c0770628 df1fd800 d52f3540 c048d85c df1b4400 c0770614
> [  944.163120] 9dc0: c0770614 c07739e8 c0770628 df1fd800 00000008 c044f018 00000003 00000000
> [  944.171314] 9de0: e1424800 42000000 df1fd800 d52f3540 00000001 c0770628 00000000 d52f3540
> [  944.179507] 9e00: 00000003 df3bd400 e1424800 42000000 df1fd800 d52f3540 00000001 c045153c
> [  944.187700] 9e20: 00000004 c0019430 e1424000 c0382238 00000000 00000000 00000002 dfbc4600
> [  944.195893] 9e40: 00000001 df1fdbe0 00000001 00000042 00000000 00000000 dee34000 00000040
> [  944.204086] 9e60: df1fdbe0 00000100 00000000 00000100 df3bd400 c079b684 00000000 c0382354
> [  944.212279] 9e80: c04ab2fc c04aa344 df993200 df1fd800 00000100 df1fdbfc 00000000 df993200
> [  944.220473] 9ea0: 00000100 c04ab360 c0768000 c0023cf4 c0fba140 c03822b0 df1fdbfc 00000040
> [  944.228666] 9ec0: 0000012c c07abb40 c07abb48 c0778368 c07abb40 c0451288 c0769ed8 0000fb7d
> [  944.236859] 9ee0: c0fba140 00000001 0000000c c07ac850 c07ac840 c0768000 00000003 00000100
> [  944.245052] 9f00: 0000000c c001ebd4 00000000 c004782c c0fba140 0000000a 0000fb7c 00200000
> [  944.253245] 9f20: c0769f78 c078af20 00000018 00000000 c0769f78 c07abc01 00000001 c07abc01
> [  944.261438] 9f40: 00000000 c001ef18 c078af20 c000ea44 c07f6f40 000003ff c07f6f40 c00084d8
> [  944.269632] 9f60: c000ebc0 c0041c54 600f0013 ffffffff c0769fac c00110c0 00000000 00000000
> [  944.277825] 9f80: 00000000 c0779a48 c0768000 c0768000 c0768000 c07700cc c07abc01 00000001
> [  944.286018] 9fa0: c07abc01 00000000 01000000 c0769fc0 c000ebc0 c0041c54 600f0013 ffffffff
> [  944.294211] 9fc0: c0761530 c073ba14 ffffffff ffffffff c073b4f0 00000000 00000000 c0761530
> [  944.302405] 9fe0: 00000000 10c53c7d c0770078 c076152c c07765f8 00008070 00000000 00000000
> [  944.310608] [<c04b2094>] (tcp_gso_segment+0x1d8/0x390) from [<c04beeb4>] (inet_gso_segment+0x118/0x2dc)
> [  944.320029] [<c04beeb4>] (inet_gso_segment+0x118/0x2dc) from [<c0452dcc>] (skb_mac_gso_segment+0xa4/0x178)
> [  944.329704] [<c0452dcc>] (skb_mac_gso_segment+0xa4/0x178) from [<c04530e8>] (dev_hard_start_xmit+0x170/0x484)
> [  944.339643] [<c04530e8>] (dev_hard_start_xmit+0x170/0x484) from [<c0469af4>] (sch_direct_xmit+0xa4/0x19c)
> [  944.349231] [<c0469af4>] (sch_direct_xmit+0xa4/0x19c) from [<c04535b8>] (__dev_queue_xmit+0x1bc/0x3dc)
> [  944.358560] [<c04535b8>] (__dev_queue_xmit+0x1bc/0x3dc) from [<c0492604>] (ip_finish_output+0x1f4/0x440)
> [  944.368060] [<c0492604>] (ip_finish_output+0x1f4/0x440) from [<c0492a2c>] (ip_local_out+0x28/0x2c)
> [  944.377037] [<c0492a2c>] (ip_local_out+0x28/0x2c) from [<c0492b5c>] (ip_queue_xmit+0x12c/0x384)
> [  944.385754] [<c0492b5c>] (ip_queue_xmit+0x12c/0x384) from [<c04a76e8>] (tcp_transmit_skb+0x42c/0x86c)
> [  944.394992] [<c04a76e8>] (tcp_transmit_skb+0x42c/0x86c) from [<c04a7d68>] (tcp_write_xmit+0x174/0xa74)
> [  944.404317] [<c04a7d68>] (tcp_write_xmit+0x174/0xa74) from [<c04a86d8>] (__tcp_push_pending_frames+0x30/0x98)
> [  944.414252] [<c04a86d8>] (__tcp_push_pending_frames+0x30/0x98) from [<c04a4954>] (tcp_rcv_established+0x334/0x5a0)
> [  944.424622] [<c04a4954>] (tcp_rcv_established+0x334/0x5a0) from [<c04ac61c>] (tcp_v4_do_rcv+0x104/0x240)
> [  944.434122] [<c04ac61c>] (tcp_v4_do_rcv+0x104/0x240) from [<c04aede0>] (tcp_v4_rcv+0x6ec/0x728)
> [  944.442838] [<c04aede0>] (tcp_v4_rcv+0x6ec/0x728) from [<c048dab0>] (ip_local_deliver_finish+0x94/0x21c)
> [  944.452337] [<c048dab0>] (ip_local_deliver_finish+0x94/0x21c) from [<c048d85c>] (ip_rcv_finish+0x128/0x2e8)
> [  944.462098] [<c048d85c>] (ip_rcv_finish+0x128/0x2e8) from [<c044f018>] (__netif_receive_skb_core+0x4c4/0x5d0)
> [  944.472032] [<c044f018>] (__netif_receive_skb_core+0x4c4/0x5d0) from [<c045153c>] (napi_gro_receive+0x74/0xa0)
> [  944.482054] [<c045153c>] (napi_gro_receive+0x74/0xa0) from [<c0382238>] (mvneta_rx+0x420/0x498)
> [  944.490771] [<c0382238>] (mvneta_rx+0x420/0x498) from [<c0382354>] (mvneta_poll+0xa4/0x3b8)
> [  944.499139] [<c0382354>] (mvneta_poll+0xa4/0x3b8) from [<c0451288>] (net_rx_action+0x98/0x180)
> [  944.507773] [<c0451288>] (net_rx_action+0x98/0x180) from [<c001ebd4>] (__do_softirq+0xc8/0x1f4)
> [  944.516491] [<c001ebd4>] (__do_softirq+0xc8/0x1f4) from [<c001ef18>] (irq_exit+0x6c/0xa8)
> [  944.524690] [<c001ef18>] (irq_exit+0x6c/0xa8) from [<c000ea44>] (handle_IRQ+0x34/0x84)
> [  944.532624] [<c000ea44>] (handle_IRQ+0x34/0x84) from [<c00084d8>] (armada_370_xp_handle_irq+0x48/0x50)
> [  944.541953] [<c00084d8>] (armada_370_xp_handle_irq+0x48/0x50) from [<c00110c0>] (__irq_svc+0x40/0x50)
> [  944.551186] Exception stack(0xc0769f78 to 0xc0769fc0)
> [  944.556246] 9f60:                                                       00000000 00000000
> [  944.564439] 9f80: 00000000 c0779a48 c0768000 c0768000 c0768000 c07700cc c07abc01 00000001
> [  944.572632] 9fa0: c07abc01 00000000 01000000 c0769fc0 c000ebc0 c0041c54 600f0013 ffffffff
> [  944.580835] [<c00110c0>] (__irq_svc+0x40/0x50) from [<c0041c54>] (cpu_startup_entry+0x44/0xe0)
> [  944.589470] [<c0041c54>] (cpu_startup_entry+0x44/0xe0) from [<c073ba14>] (start_kernel+0x2fc/0x358)
> [  944.598533] Code: e5942010 e5872010 e5977000 e59d3004 (e1d729b0) 
> [  944.604662] ---[ end trace 39b798f37a10efc0 ]---
> [  944.609288] Kernel panic - not syncing: Fatal exception in interrupt
> 
> 
> Here is the beginning of the assembly dump of tcp_gso_segment() provided by
> arm-linux-gnueabi-objdump -S -EL -D -b binary -m arm --start-address=0x4a9ebc
> --stop-address=0x4aa24c Image. If you wonder, LOADADDR is 0x8000 during
> compilation, which explains the address shift.

> The line marked with the HERE! above is where the null pointer derefence
> occurs. The code is in fact the (inlined) code of tcp_hdr(), which is in
> turn the inlined code of tcp_transport_header() on given skb:
> 
> static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
> {
>         return skb->head + skb->transport_header;
> }
> 
> The call to tcp_hdr() is in the following loop in tcp_gso_segment():
> 
>      ...
>   1  do {
>   2          th->fin = th->psh = 0;
>   3          th->check = newcheck;
>   4
>   5          if (skb->ip_summed != CHECKSUM_PARTIAL)
>   6                  th->check =
>   7                       csum_fold(csum_partial(skb_transport_header(skb),
>   8                                              thlen, skb->csum));
>   9
>  10          seq += mss;
>  11          if (copy_destructor) {
>  12                  skb->destructor = gso_skb->destructor;
>  13                  skb->sk = gso_skb->sk;
>  14                  sum_truesize += skb->truesize;
>  15          }
>  16          skb = skb->next;
>  17          th = tcp_hdr(skb);         <-- HERE!
>  18
>  19          th->seq = htonl(seq);
>  20          th->cwr = 0;
>  21  } while (skb->next);
>      ...
> 
> Unless there is an assumption I missed somewhere in the function, the
> problem may occur during the first round of the loop, because (unlike
> the 'while' condition does at line 21) skb->next is not checked against
> null at lines 17 above before it is passed to tcp_hdr() at line 18.
> 
> To be honest, I am asking because I am not familiar w/ the code and it
> is somewhat old so I wonder why noone got hit before. AFAICT,
> f4c50d990dcf ([NET]: Add software TSOv4) added TSOv4 support in 2006 via
> introduction of tcp_tso_segmen() (with the same kind of deref but
> possibly different assumptions) which was more recently modified via
> 28850dc7c7 (net: tcp: move GRO/GSO functions to tcp_offload) to become
> tcp_gso_segment().
> 
> David, can you confirm the analysis and possibly comment on the
> conditions needed for the bug to manifest?

A gso packet contains at least 2 segments.

So the NULL deref should not happen.

Something strange is happening here...

^ permalink raw reply

* Re: [PATCH net-next] IPv6: enable TCP to use an anycast address
From: Hannes Frederic Sowa @ 2014-01-22 21:59 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: François-Xavier Le Bail, netdev, David Stevens, David Miller,
	Hideaki Yoshifuji
In-Reply-To: <CAMrnHh5od13Ydo12Dt4OTa3nw+5om5LYO53iu9nQYqWeBxMPAw@mail.gmail.com>

Hi!

On Thu, Jan 23, 2014 at 01:11:52AM +0400, Alexey Kuznetsov wrote:
> On Wed, Jan 22, 2014 at 6:32 PM, François-Xavier Le Bail
> <fx.lebail@yahoo.com> wrote:
> >> Actually, I was alerted by reset processing in your patch, it cannot be right.
> >
> > Perhaps, I missed something.
> 
> Perhaps, I missed something. According to my old knowledge, tcp cannot
> be used with anycasts
> by many reasons and I have no idea what could change to the moment.
> 
> F.e. what do you do when a segment arrives on listening socket w/o
> connection associated?
> If you sent reset, you could reset someone's valid connection (I mean
> this, saying "it cannot be right").
> If you do not, you violate protocol, those resets are required to
> clean up stale states.

When using anycast addresses with routing protocols, like e.g. anycast
bgp one cannot make the distinction in the kernel at all. Operators
currently depend on the BGP being stable enough so that the anycast
destination towards one server remains stable during the whole connection.

In case a flap happens, of course, packets can end up at a different server
and will get a RST. I guess this currently works well enough for anycast
network operators. Most DNS server use this model currently and DNS depends on
UDP and TCP being reachable for name resolution.

But this change does affect anycast IPv6 only in one subnet. If an anycast
address is allocated in the kernel, neighbour discovery logic tries as
hard to stick to one particular neighbour in the current subnet by always
sending neighbour advertisments with non-override flag. So in case we
need to reprobe the hardware address for the anycast neighbour, we will
get multiple answers and will pick that one which does not override the
current address. In case the anycast server really disappears from the
network, we could end up in a situation where the TCP packet gets send
to the wrong server which could result in a RST. IMHO this is acceptable,
but I may be wrong here. What is your opinion on that?

> >> Do not you think this must not be enabled for common use? At least
> >> some separate sysctl disabled by default.
> >
> > We could indeed use a sysctl, disabled by default.
> >
> > But my goal was to enable anycast address usage transparently, if possible.
> 
> IMHO it is logically impossible. That's why my first question was
> about a document which
> allowed anycasts with TCP. (BTW I found your answer unconvincing, at least.)
> Traditional anycasting used to be stateless and it was impossible to
> use with TCP
> because unsynchronized TCP connections would be unavoidable.
> AFAIK only suggestion from ancient RFC1546 could cure this problem,
> but it is very tricky and implies client knows that destination is anycast.
> 
> Actually, I do not even understand how the idea of use of anycast with TCP
> emerged. The only situation I can imagine: the router is expected to
> do full scale
> connection tracking and to bind connections to anycast destinations to
> correct nodes.
> (And the same moment you understand that the notion of anycast is
> completely lost
> in this picture: the same thing is done by load balancers in IPv4 for
> unicast addresses
> for ages) I still have no idea how it is going to work when both
> client and servers bound
> to anycasts are on the same link.
> 
> So, it looks like you need this sysctl to announce: "someone cares
> about proper routing
> on this network and tcp should work". It looks obvious this mode
> cannot be enabled by default.

IMHO it would be ok if a tcp socket binds to an anycast address, because the
administrator or software can check either for the pre-defined ones (the
subnet router anycast address) or the application allocated an anycast via
setsockopt specifically (the address could also be requested by another
process and the tcp socket could bind to it by accident, that might be a
problem. If we want to protect from this scenario we must add a new sysctl
knob.).

We don't use anycast addresses for normal source address selection, so if the
application does not specifically set the anycast address it will always use a
unicast one.

Greetings,

  Hannes

^ permalink raw reply

* Re: [BUG] null pointer dereference in tcp_gso_segment()
From: Arnaud Ebalard @ 2014-01-22 22:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Eric Dumazet, Daniel Borkmann, Herbert Xu,
	Willy Tarreau, netdev
In-Reply-To: <1390427824.27806.36.camel@edumazet-glaptop2.roam.corp.google.com>

Hi Eric,

Eric Dumazet <eric.dumazet@gmail.com> writes:

>> Unless there is an assumption I missed somewhere in the function, the
>> problem may occur during the first round of the loop, because (unlike
>> the 'while' condition does at line 21) skb->next is not checked against
>> null at lines 17 above before it is passed to tcp_hdr() at line 18.
>> 
>> To be honest, I am asking because I am not familiar w/ the code and it
>> is somewhat old so I wonder why noone got hit before. AFAICT,
>> f4c50d990dcf ([NET]: Add software TSOv4) added TSOv4 support in 2006 via
>> introduction of tcp_tso_segmen() (with the same kind of deref but
>> possibly different assumptions) which was more recently modified via
>> 28850dc7c7 (net: tcp: move GRO/GSO functions to tcp_offload) to become
>> tcp_gso_segment().
>> 
>> David, can you confirm the analysis and possibly comment on the
>> conditions needed for the bug to manifest?
>
> A gso packet contains at least 2 segments.

By whom / where is it enforced?

Cheers,

a+

^ permalink raw reply

* Re: [PATCH] af_packet: Add Queue mapping mode to af_packet fanout operation
From: Daniel Borkmann @ 2014-01-22 22:00 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, David S. Miller
In-Reply-To: <1390424504-18543-1-git-send-email-nhorman@tuxdriver.com>

On 01/22/2014 10:01 PM, Neil Horman wrote:
> This patch adds a queue mapping mode to the fanout operation of af_packet
> sockets.  This allows user space af_packet users to better filter on flows
> ingressing and egressing via a specific hardware queue, and avoids the potential

Maybe I'm missing something, but I currently cannot find where this is
being filled out for ingress path? Egress, ok, this gets filled out
somewhere in protocol layers or elsewhere and is being locally pushed
back through dev_queue_xmit_nit(), but I think main use case is ingress
through packet fanout. In driver layer I can find skb->rxhash filled out
which would then be PACKET_FANOUT_HASH.
(Otherwise patch looks good.)

> packet reordering that can occur when FANOUT_CPU is being used and irq affinity
> varies.
>
> Tested successfully by myself.  applies to net-next
>
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: "David S. Miller" <davem@davemloft.net>
> ---
>   include/uapi/linux/if_packet.h |  1 +
>   net/packet/af_packet.c         | 11 +++++++++++
>   2 files changed, 12 insertions(+)
>
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index 1988a02..bac27fa 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
> @@ -60,6 +60,7 @@ struct sockaddr_ll {
>   #define PACKET_FANOUT_CPU		2
>   #define PACKET_FANOUT_ROLLOVER		3
>   #define PACKET_FANOUT_RND		4
> +#define PACKET_FANOUT_QM		5
>   #define PACKET_FANOUT_FLAG_ROLLOVER	0x1000
>   #define PACKET_FANOUT_FLAG_DEFRAG	0x8000
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index d711ecb..bd90a87 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1313,6 +1313,13 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
>   	return idx;
>   }
>
> +static unsigned int fanout_demux_qm(struct packet_fanout *f,
> +				    struct sk_buff *skb,
> +				    unsigned int num)
> +{
> +	return skb_get_queue_mapping(skb) % num;
> +}
> +
>   static bool fanout_has_flag(struct packet_fanout *f, u16 flag)
>   {
>   	return f->flags & (flag >> 8);
> @@ -1352,6 +1359,9 @@ static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev,
>   	case PACKET_FANOUT_RND:
>   		idx = fanout_demux_rnd(f, skb, num);
>   		break;
> +	case PACKET_FANOUT_QM:
> +		idx = fanout_demux_qm(f, skb, num);
> +		break;
>   	case PACKET_FANOUT_ROLLOVER:
>   		idx = fanout_demux_rollover(f, skb, 0, (unsigned int) -1, num);
>   		break;
> @@ -1422,6 +1432,7 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>   	case PACKET_FANOUT_LB:
>   	case PACKET_FANOUT_CPU:
>   	case PACKET_FANOUT_RND:
> +	case PACKET_FANOUT_QM:
>   		break;
>   	default:
>   		return -EINVAL;
>

^ permalink raw reply

* Re: ipv4_dst_destroy panic regression after 3.10.15
From: Alexei Starovoitov @ 2014-01-22 22:13 UTC (permalink / raw)
  To: dormando
  Cc: Alexei Starovoitov, Eric Dumazet, netdev,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAADnVQ+MNHFKmCSS7m42s7gz68sCwNUY4U571R5XVH0sU0=ffQ@mail.gmail.com>

On Tue, Jan 21, 2014 at 10:02 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Jan 21, 2014 at 8:10 PM, dormando <dormando@rydia.net> wrote:
>>
>>
>> On Tue, 21 Jan 2014, Alexei Starovoitov wrote:
>>
>>> On Tue, Jan 21, 2014 at 5:39 PM, dormando <dormando@rydia.net> wrote:
>>> >
>>> > > On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
>>> > > >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
>>> > > >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
>>> > > >> > > Hi,
>>> > > >> > >
>>> > > >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
>>> > > >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
>>> > > >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
>>> > > >> > >
>>> > > >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
>>> > > >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode
>>> ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp
>>> pps_core mdio
>>> > > >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
>>> > > >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
>>> > > >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
>>> > > >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
>>> > > >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
>>> > > >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
>>> > > >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
>>> > > >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
>>> > > >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>> > > >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
>>> > > >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
>>> > > >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> > > >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
>>> > > >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> > > >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> > > >> > > <4>[196727.311713] Stack:
>>> > > >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
>>> > > >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
>>> > > >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
>>> > > >> > > <4>[196727.311885] Call Trace:
>>> > > >> > > <4>[196727.311907]  <IRQ>
>>> > > >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
>>> > > >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
>>> > > >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
>>> > > >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
>>> > > >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
>>> > > >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
>>> > > >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
>>> > > >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
>>> > > >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
>>> > > >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
>>> > > >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
>>> > > >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
>>> > > >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
>>> > > >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
>>> > > >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
>>> > > >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
>>> > > >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
>>> > > >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
>>> > > >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
>>> > > >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
>>> > > >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
>>> > > >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
>>> > > >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
>>> > > >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
>>> > > >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
>>> > > >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
>>> > > >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
>>> > > >> > > <4>[196727.312722]  <EOI>
>>> > > >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
>>> > > >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
>>> > > >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
>>> > > >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
>>> > > >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01
>>> 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
>>> > > >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
>>> > > >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
>>> > > >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
>>> > > >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
>>> > > >> > >
>>> > > >> > >
>>> > > >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
>>> > > >> > > am trying a revert or two.
>>> > > >> > >
>>> > > >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
>>> > > >> > > crashed, so the change exists between .15 and .25.
>>> > > >> >
>>> > > >> > Please try following fix, thanks for the report !
>>> > > >> >
>>> > > >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>>> > > >> > index 25071b48921c..78a50a22298a 100644
>>> > > >> > --- a/net/ipv4/route.c
>>> > > >> > +++ b/net/ipv4/route.c
>>> > > >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
>>> > > >> > *dst)
>>> > > >> >
>>> > > >> >     if (!list_empty(&rt->rt_uncached)) {
>>> > > >> >             spin_lock_bh(&rt_uncached_lock);
>>> > > >> > -           list_del(&rt->rt_uncached);
>>> > > >> > +           list_del_init(&rt->rt_uncached);
>>> > > >> >             spin_unlock_bh(&rt_uncached_lock);
>>> > > >> >     }
>>> > > >> >  }
>>> > > >> >
>>> > > >>
>>> > > >> Problem could come from this commit, in linux 3.10.23,
>>> > > >> you also could try to revert it
>>> > > >>
>>> > > >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
>>> > > >> Author: Alexei Starovoitov <ast@plumgrid.com>
>>> > > >> Date:   Tue Nov 19 19:12:34 2013 -0800
>>> > > >>
>>> > > >>     ipv4: fix race in concurrent ip_route_input_slow()
>>> > > >>
>>> > > >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
>>> > > >>
>>> > > >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
>>> > > >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
>>> > > >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
>>> > > >>     via rt_cache_route()
>>> > > >>     Most of the time they succeed, but on occasion the following two lines:
>>> > > >>         orig = *p;
>>> > > >>         prev = cmpxchg(p, orig, rt);
>>> > > >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
>>> > > >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
>>> > > >>     so dst is leaking. dst_destroy() is never called and 'lo' device
>>> > > >>     refcnt doesn't go to zero, which can be seen in the logs as:
>>> > > >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
>>> > > >>     Adding mdelay() between above two lines makes it easily reproducible.
>>> > > >>     Fix it similar to nh_pcpu_rth_output case.
>>> > > >>
>>> > > >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
>>> > > >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>> > > >>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>> > > >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> > > >>
>>> > > >
>>> > > > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
>>> > > > my best guess. I have a kernel running in production with only this
>>> > > > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
>>> > > >
>>> > > > I'm building a kernel now with your route patch, but 62713c4 *not*
>>> > > > reverted, which I can throw on a different machine. Does this sound like a
>>> > > > good idea?
>>> > >
>>> > > the traces of your crash don't look similar to dst leak that was fixed by
>>> > > commit 62713c4...
>>> > >
>>> > > To trigger the 'fix' logic of the 62713c4 add the following diff:
>>> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>>> > > index f6c6ab1..8972676 100644
>>> > > --- a/net/ipv4/route.c
>>> > > +++ b/net/ipv4/route.c
>>> > > @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
>>> > > struct rtable *rt)
>>> > >                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
>>> > >         }
>>> > >         orig = *p;
>>> > > -
>>> > > +       mdelay(100);
>>> > >         prev = cmpxchg(p, orig, rt);
>>> > >         if (prev == orig) {
>>> > >                 if (orig)
>>> > >
>>> > > I've been running with it for a day without issues.
>>> > > Note that it will stress both 'input' and 'output' ways of adding dsts to
>>> > > rt_uncached list...
>>> > > and 'output' was there for ages.
>>> > >
>>> > > If mdelay() helps to reproduce it in minutes instead of days
>>> > > then we're on the right path.
>>> > > Could you share details of your workload?
>>> > > In our case it was a lot of namespaces with a ton of processes
>>> > > talking to each other, forcefully killed and restarted.
>>> > > Do you see the same crash in the latest tree?
>>> > >
>>> > > PS sorry for double posts. netdev email bounced few times for me...
>>> > >
>>> >
>>> > So with 62713c4 reverted on top of 3.10.27 (no other new patches), both of
>>> > my machines have survived. One four days, another two days. They'd barely
>>> > make it a day before.
>>> >
>>> > So this patch is introducing a new problem. Weakening the dst reference
>>> > too much in a false positive case?
>>> >
>>> > I'm not entirely sure if the mdelay(100) thing will help much. I can try
>>> > it but it's kind of obnoxious to reboot these machines (far away, ipmi
>>> > only) too often. Unless you folks have any new ideas before I go off and
>>> > do that?
>>>
>>> mdelay() won't fix things, but it will help debugging. Can you try it on just one machine? It should fail within minutes.
>>>
>>> Only afterwards we can try alternative fix like:
>>> @@ -1722,8 +1722,8 @@ local_input:
>>>         }
>>>         if (do_cache) {
>>>                 if (unlikely(!rt_cache_route(&FIB_RES_NH(res), rth))) {
>>> -                       rth->dst.flags |= DST_NOCACHE;
>>> -                       rt_add_uncached_list(rth);
>>> +                       rt_free(rth);
>>> +                       goto local_input;
>>>                 }
>>>         }
>>>         skb_dst_set(skb, &rth->dst);

the above 'alternative fix' is the wrong fix. I managed to hit good
old dst leak with it:
unregister_netdevice: waiting for lo to become free. Usage count = 2

will continue running with mdelay() and original fix in place for more
days to see
whether I can spot anything.

>>
>> It might take a day or three to find a good time to do it. I don't have
>> access to reboot the machines myself. Oh how I wish things broke in the
>> lab more often :/
>>
>> To be clear, I add your old patch back in + the mdelay. If that fails
>> within a few minutes, I add this new patch instead of your old one?
>
> yes. vanilla 3.10.23+ and mdelay() will be triggering rt_add_uncached path.
> If that fails quickly, I think we better debug it further to really understand
> what's going on. Alternative fix may workaround the problem,
> since it will be freeing dst immediately if it fails to be cached and retrying
> allocation and caching again, but that would mean that double free problem
> is simply hidden instead of fixed.
> I would add tracing of rth free/alloc.
> If mdelay() makes it fail quickly, the trace won't be too long.
> config_debug_kmemleak may be helpful (though unlikely in this situation).
> since your crash traces look like double free, we can enable config_slub_debug
> and call verify_mem_not_deleted() from dst_destroy()

^ permalink raw reply

* Re: [BUG] null pointer dereference in tcp_gso_segment()
From: Eric Dumazet @ 2014-01-22 22:18 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: David Miller, Eric Dumazet, Daniel Borkmann, Herbert Xu,
	Willy Tarreau, netdev
In-Reply-To: <8761pb7jzq.fsf@natisbad.org>

On Wed, 2014-01-22 at 23:02 +0100, Arnaud Ebalard wrote:
> Hi Eric,
> 
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> 
> >> Unless there is an assumption I missed somewhere in the function, the
> >> problem may occur during the first round of the loop, because (unlike
> >> the 'while' condition does at line 21) skb->next is not checked against
> >> null at lines 17 above before it is passed to tcp_hdr() at line 18.
> >> 
> >> To be honest, I am asking because I am not familiar w/ the code and it
> >> is somewhat old so I wonder why noone got hit before. AFAICT,
> >> f4c50d990dcf ([NET]: Add software TSOv4) added TSOv4 support in 2006 via
> >> introduction of tcp_tso_segmen() (with the same kind of deref but
> >> possibly different assumptions) which was more recently modified via
> >> 28850dc7c7 (net: tcp: move GRO/GSO functions to tcp_offload) to become
> >> tcp_gso_segment().
> >> 
> >> David, can you confirm the analysis and possibly comment on the
> >> conditions needed for the bug to manifest?
> >
> > A gso packet contains at least 2 segments.
> 
> By whom / where is it enforced?

For example, tcp_gso_segment() does the following check :

if (unlikely(skb->len <= mss))
	goto out;

If there was one segment, then skb->len should also be smaller than mss

Since TCP stack seemed to be the provider of the packet in your stack
trace, check tcp_set_skb_tso_segs()

^ permalink raw reply

* packet mmap related oops (with reproducer)
From: Dave Jones @ 2014-01-22 22:39 UTC (permalink / raw)
  To: netdev; +Cc: Fedora Kernel Team
In-Reply-To: <bug-1054537-176318-JYaNmE5vAr@bugzilla.redhat.com>

We had a Fedora user report this bug, with a reproducer (below).

On Fri, Jan 17, 2014 at 06:52:06AM +0000, bugzilla@redhat.com wrote:
 > https://bugzilla.redhat.com/show_bug.cgi?id=1054537
 > 
 > 
 > 
 > --- Comment #2 from Karl Auerbach <karl@iwl.com> ---
 >   --> https://bugzilla.redhat.com/attachment.cgi?id=851413
 > Simple program to cause this kernel issue
 > 
 > This is a C program that must be run with root privilege.  It runs fine on
 > older kernels but causes a kernel problem on the most recent one.
 > 
 > The code creates a raw socket with transmit and receive ring buffers.
 > 
 > The rings are then made visible via mmap().
 > 
 > When munmap() is called things go awry.
 > 
 > The code in this example was quickly lifted from a larger body of code that has
 > been running for several years.


#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <unistd.h>

#include <arpa/inet.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <sys/user.h>

#include <features.h>    /* for the glibc version number */
#include <net/ethernet.h>
#include <linux/if_packet.h>
#include <linux/if_ether.h>

#include <net/if.h>
#include <netinet/in.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <linux/if_tun.h>

#define NIL   0

typedef int SOCKET;

int main(int argc, char *argv)
{
size_t              RxMmap_Size;
size_t              TxMmap_Size;
unsigned int        Ether_Sz;
unsigned int        Block_Sz_Order;
unsigned int        Block_Sz;
unsigned int        Block_Cnt;
unsigned int        Frame_Sz;
unsigned int        Frame_Cnt;
unsigned int        Frames_Per_Block;

void *              Mmap_Addr;
size_t              Mmap_Size;
size_t		    TXDataOffset;

SOCKET              Socket;

struct tpacket_req ring_req;

Ether_Sz = 1518;
Block_Sz = Ether_Sz;
Block_Sz_Order = 2; // 16384 byte blocks 
Block_Cnt = 1000;

Socket = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (Socket == -1)
  {
  perror("socket failed");
  return 1;
  }

Frame_Sz = TPACKET_ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN) + Ether_Sz);
TXDataOffset = TPACKET2_HDRLEN - sizeof(struct sockaddr_ll);

Block_Sz = getpagesize() << Block_Sz_Order;

Frames_Per_Block = Block_Sz / Frame_Sz;
Frame_Cnt = Frames_Per_Block * Block_Cnt;

RxMmap_Size = Block_Sz * Block_Cnt;
TxMmap_Size = RxMmap_Size;

Mmap_Size = RxMmap_Size + TxMmap_Size;

// Establish receive ring
// For convenience we will let it be the same size as the TX ring
// The mmap size calculations, far above, assume that the
// rings are the same size.
ring_req.tp_block_nr = Block_Cnt;
ring_req.tp_frame_size = Frame_Sz;
ring_req.tp_block_size = Block_Sz;
ring_req.tp_frame_nr = Frame_Cnt;
if (setsockopt(Socket, SOL_PACKET, PACKET_RX_RING,
               (char *)&ring_req, sizeof(ring_req)) < 0)
  {
  perror("Setsockopt RX_RING failed");
  close(Socket);
  return -1;
  }

// Establish transmit ring
// For convenience we will let it be the same size as the RX ring
// The mmap size calculations, far above, assume that the
// rings are the same size.
ring_req.tp_block_nr = Block_Cnt;
ring_req.tp_frame_size = Frame_Sz;
ring_req.tp_block_size = Block_Sz;
ring_req.tp_frame_nr = Frame_Cnt;
if (setsockopt(Socket, SOL_PACKET, PACKET_TX_RING,
               (char *)&ring_req, sizeof(ring_req)) < 0)
  {
  perror("Setsockopt TX_RING failed");
  close(Socket);
  return -1;
  }

Mmap_Addr = mmap(NIL, Mmap_Size, PROT_READ | PROT_WRITE,
                 MAP_SHARED | MAP_LOCKED, Socket, 0);

if (Mmap_Addr == MAP_FAILED)
  {
  perror("mmap failed");
  return 1;
  }

if (Mmap_Addr != MAP_FAILED)
  {
  (void)munmap(Mmap_Addr, Mmap_Size);
  }

close(Socket);
return 0;
}

^ permalink raw reply

* Freescale FEC packet loss
From: Marek Vasut @ 2014-01-22 21:55 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: Frank Li, fugang.duan, fabio.estevam@freescale.com,
	Hector Palacios, linux-arm-kernel, Detlev Zundel, Eric Nelson

Hi guys,

I am running stock Linux 3.13 on i.MX6Q SabreLite board. The CPU is i.MX6Q TO 
1.0 .

I am hitting a WARNING when I use the FEC ethernet to transfer data, thus I 
started investigating this problem. TL;DR I am not able to figure this problem 
out, so I am not attaching a patch :-(

Steps to reproduce:
-------------------
1) Boot stock Linux 3.13 on i.MX6Q SabreLite board
2) Plug in an SD card into one of the SD slots (I use the full-size one)
3) Plug in an USB stick into one of the USB ports (I use the upper one)
4) Plug in an ethernet cable into the board
   -> Connect the other side into a gigabit-capable PC
   -> Let us assume the PC has IP address 192.168.1.1/24 and
      the board 192.168.1.2/24
5) Install iperf on both boards
6) Bring up the ethernet link on both ends:
   $ ifconfig ...
7) On the PC, run:
   $ iperf -s -i 1
8) Start producing load on the in-CPU busses. Run this on the board:
   $ sha1sum /dev/mmcblk0 &
   $ sha1sum /dev/sda &
9) Now let the board generate ethernet traffic:
   $ iperf -c 192.168.1.1 -t 1000 -i 1

Now wait about 10 minutes and the system will produce a warning and you will 
observe dips in the transmission speed. You can see the output at the end of the 
email.

I observe that this happens more often when there is a severe load on the 
busses, which in this case I simulate by running the sha1sum on /dev/mmcblk0 and 
sha1sum /dev/sda in the background in step 8) . I can also well simulate this by 
running 'sha1sum /dev/mmcblk0 & sha1sum /dev/mmcblk1 &' when I have both SD 
cards populated on the MX6Q sabrelite with the same result (WARNING and dips in 
speed).

There was apparently a thread about this problem already [1] where the person 
used SATA to produce high bus load and had exactly the same result.

One more observation is that I managed to replicate this problem all the way 
back to Linux 3.3.8 , which is one of the first kernel versions where sabrelite 
was supported. I also see this one the Freescale's 3.0.35-4.1.0 .

I have trouble figuring out what this problem is all about. Can you please help 
me? Thank you!

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-
October/202519.html

root@(none):~# iperf -c 192.168.1.1 -t 1000 -i 1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.128 port 36760 connected with 192.168.1.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  25.5 MBytes   214 Mbits/sec
[  3]  1.0- 2.0 sec  28.9 MBytes   242 Mbits/sec
[  3]  2.0- 3.0 sec  24.1 MBytes   202 Mbits/sec
[  3]  3.0- 4.0 sec  21.1 MBytes   177 Mbits/sec
[  3]  4.0- 5.0 sec  20.2 MBytes   170 Mbits/sec
[  3]  5.0- 6.0 sec  20.0 MBytes   168 Mbits/sec
[  3]  6.0- 7.0 sec  20.0 MBytes   168 Mbits/sec
[  3]  7.0- 8.0 sec  20.1 MBytes   169 Mbits/sec
[  3]  8.0- 9.0 sec  20.5 MBytes   172 Mbits/sec
[  3]  9.0-10.0 sec  21.5 MBytes   180 Mbits/sec
[  3] 10.0-11.0 sec  20.0 MBytes   168 Mbits/sec
[  3] 11.0-12.0 sec  19.4 MBytes   163 Mbits/sec
[  3] 12.0-13.0 sec  19.6 MBytes   165 Mbits/sec
[  3] 13.0-14.0 sec  19.8 MBytes   166 Mbits/sec
[  3] 14.0-15.0 sec  19.4 MBytes   163 Mbits/sec
[  275.945247] ------------[ cut here ]------------
[  275.949920] WARNING: CPU: 3 PID: 1155 at net/sched/sch_generic.c:264 
dev_watchdog+0x280/0x2a4()
[  275.958679] NETDEV WATCHDOG: eth3 (fec): transmit queue 0 timed out
[  275.964985] Modules linked in:
[  275.968096] CPU: 3 PID: 1155 Comm: sha1sum Not tainted 3.13.0 #18
[  275.974217] Backtrace:                                                                                                                                                      
[  275.976787] [<80012410>] (dump_backtrace+0x0/0x10c) from [<800125ac>] 
(show_stack+0x18/0x1c)                                                                                
[  275.985271]  r6:00000108 r5:00000009 r4:00000000 r3:00000000                                                                                                                
[  275.991050] [<80012594>] (show_stack+0x0/0x1c) from [<8060f9c8>] 
(dump_stack+0x80/0x9c)                                                                                     
[  275.999103] [<8060f948>] (dump_stack+0x0/0x9c) from [<8002703c>] 
(warn_slowpath_common+0x6c/0x90)                                                                           
[  276.008017]  r4:bef43e10 r3:bef96c00                                                                                                                                        
[  276.011663] [<80026fd0>] (warn_slowpath_common+0x0/0x90) from [<80027104>] 
(warn_slowpath_fmt+0x3)                                                                          
[  276.021170]  r8:bf098240 r7:bef42000 r6:bf098200 r5:bf068000 r4:00000000                                                                                                    
[  276.028083] [<800270cc>] (warn_slowpath_fmt+0x0/0x40) from [<804e2754>] 
(dev_watchdog+0x280/0x2a4)                                                                          
[  276.037095]  r3:bf068000 r2:807d2fb4                                                                                                                                        
[  276.040734] [<804e24d4>] (dev_watchdog+0x0/0x2a4) from [<80030e1c>] 
(call_timer_fn+0x70/0xec)                                                                               
[  276.049313] [<80030dac>] (call_timer_fn+0x0/0xec) from [<80031a90>] 
(run_timer_softirq+0x198/0x23)                                                                          
[  276.058407]  r8:00200200 r7:00000000 r6:bef43ec8 r5:bf836000 r4:bf068284                                                                                                    
[  276.065257] [<800318f8>] (run_timer_softirq+0x0/0x230) from [<8002b480>] 
(__do_softirq+0x100/0x25)                                                                          
[  276.074338] [<8002b380>] (__do_softirq+0x0/0x254) from [<8002b9a8>] 
(irq_exit+0xb0/0x108)                                                                                   
[  276.082570] [<8002b8f8>] (irq_exit+0x0/0x108) from [<8000f3dc>] 
(handle_IRQ+0x58/0xb8)                                                                                      
[  276.090528]  r4:8086ccc8 r3:00000184
[  276.094162] [<8000f384>] (handle_IRQ+0x0/0xb8) from [<80008640>] 
(gic_handle_irq+0x30/0x64)
[  276.102552]  r8:5590d9fa r7:f4000100 r6:bef43fb0 r5:8086ce14 r4:f400010c
r3:000000a0
[  276.110575] [<80008610>] (gic_handle_irq+0x0/0x64) from [<800133a0>] 
(__irq_usr+0x40/0x60)
[  276.118871] Exception stack(0xbef43fb0 to 0xbef43ff8)
[  276.123942] 3fa0:                                     fbe9d585 13e98d33 
6ed9eba1 21e67a57
[  276.132173] 3fc0: 170dd9fc bc58d89e 5d1b7878 c19f2bf4 5590d9fa 2577bcc4 
7b2ac1ea 6cee44dd
[  276.140398] 3fe0: cf486f09 7ea92a58 0000ba1f 0000ab72 a0000030 ffffffff
[  276.147041]  r7:c19f2bf4 r6:ffffffff r5:a0000030 r4:0000ab72
[  276.152846] ---[ end trace 054500acb8edb763 ]---
[  3] 15.0-16.0 sec  18.8 MBytes   157 Mbits/sec
[  3] 16.0-17.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 17.0-18.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 18.0-19.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 19.0-20.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 20.0-21.0 sec  19.8 MBytes   166 Mbits/sec
[  3] 21.0-22.0 sec  19.8 MBytes   166 Mbits/sec
[  3] 22.0-23.0 sec  19.5 MBytes   164 Mbits/sec
[  3] 23.0-24.0 sec  8.38 MBytes  70.3 Mbits/sec
[  3] 24.0-25.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 25.0-26.0 sec  7.88 MBytes  66.1 Mbits/sec
[  3] 26.0-27.0 sec  20.1 MBytes   169 Mbits/sec
[...]
[  3] 71.0-72.0 sec  23.4 MBytes   196 Mbits/sec
[  3] 72.0-73.0 sec  12.2 MBytes   103 Mbits/sec
[  3] 73.0-74.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 74.0-75.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 75.0-76.0 sec  10.9 MBytes  91.2 Mbits/sec
[  3] 76.0-77.0 sec  22.4 MBytes   188 Mbits/sec
[  3] 77.0-78.0 sec  23.0 MBytes   193 Mbits/sec

^ permalink raw reply

* Re: packet mmap related oops (with reproducer)
From: Daniel Borkmann @ 2014-01-22 22:45 UTC (permalink / raw)
  To: Dave Jones; +Cc: netdev, Fedora Kernel Team, Vlastimil Babka, Michel Lespinasse
In-Reply-To: <20140122223920.GA18162@redhat.com>

Hi Dave,

On 01/22/2014 11:39 PM, Dave Jones wrote:
> We had a Fedora user report this bug, with a reproducer (below).
>
> On Fri, Jan 17, 2014 at 06:52:06AM +0000, bugzilla@redhat.com wrote:
>   > https://bugzilla.redhat.com/show_bug.cgi?id=1054537
>   >
>   >
>   >
>   > --- Comment #2 from Karl Auerbach <karl@iwl.com> ---
>   >   --> https://bugzilla.redhat.com/attachment.cgi?id=851413
>   > Simple program to cause this kernel issue
>   >
>   > This is a C program that must be run with root privilege.  It runs fine on
>   > older kernels but causes a kernel problem on the most recent one.
>   >
>   > The code creates a raw socket with transmit and receive ring buffers.
>   >
>   > The rings are then made visible via mmap().
>   >
>   > When munmap() is called things go awry.
>   >
>   > The code in this example was quickly lifted from a larger body of code that has
>   > been running for several years.

It looks to be THP related, you can also use the Linux kernel
selftest suite for networking to trigger that, I've reported
it here:

   https://lkml.org/lkml/2014/1/10/427

Vlastimil seems to have proposed some possibilities, but it
seems discussion stalled for now.

Cheers,

Daniel

> #include <assert.h>
> #include <errno.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <stdarg.h>
> #include <unistd.h>
>
> #include <arpa/inet.h>
> #include <sys/ioctl.h>
> #include <sys/mman.h>
> #include <sys/socket.h>
> #include <sys/un.h>
> #include <sys/user.h>
>
> #include <features.h>    /* for the glibc version number */
> #include <net/ethernet.h>
> #include <linux/if_packet.h>
> #include <linux/if_ether.h>
>
> #include <net/if.h>
> #include <netinet/in.h>
> #include <sys/ioctl.h>
> #include <sys/types.h>
> #include <linux/if_tun.h>
>
> #define NIL   0
>
> typedef int SOCKET;
>
> int main(int argc, char *argv)
> {
> size_t              RxMmap_Size;
> size_t              TxMmap_Size;
> unsigned int        Ether_Sz;
> unsigned int        Block_Sz_Order;
> unsigned int        Block_Sz;
> unsigned int        Block_Cnt;
> unsigned int        Frame_Sz;
> unsigned int        Frame_Cnt;
> unsigned int        Frames_Per_Block;
>
> void *              Mmap_Addr;
> size_t              Mmap_Size;
> size_t		    TXDataOffset;
>
> SOCKET              Socket;
>
> struct tpacket_req ring_req;
>
> Ether_Sz = 1518;
> Block_Sz = Ether_Sz;
> Block_Sz_Order = 2; // 16384 byte blocks
> Block_Cnt = 1000;
>
> Socket = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> if (Socket == -1)
>    {
>    perror("socket failed");
>    return 1;
>    }
>
> Frame_Sz = TPACKET_ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN) + Ether_Sz);
> TXDataOffset = TPACKET2_HDRLEN - sizeof(struct sockaddr_ll);
>
> Block_Sz = getpagesize() << Block_Sz_Order;
>
> Frames_Per_Block = Block_Sz / Frame_Sz;
> Frame_Cnt = Frames_Per_Block * Block_Cnt;
>
> RxMmap_Size = Block_Sz * Block_Cnt;
> TxMmap_Size = RxMmap_Size;
>
> Mmap_Size = RxMmap_Size + TxMmap_Size;
>
> // Establish receive ring
> // For convenience we will let it be the same size as the TX ring
> // The mmap size calculations, far above, assume that the
> // rings are the same size.
> ring_req.tp_block_nr = Block_Cnt;
> ring_req.tp_frame_size = Frame_Sz;
> ring_req.tp_block_size = Block_Sz;
> ring_req.tp_frame_nr = Frame_Cnt;
> if (setsockopt(Socket, SOL_PACKET, PACKET_RX_RING,
>                 (char *)&ring_req, sizeof(ring_req)) < 0)
>    {
>    perror("Setsockopt RX_RING failed");
>    close(Socket);
>    return -1;
>    }
>
> // Establish transmit ring
> // For convenience we will let it be the same size as the RX ring
> // The mmap size calculations, far above, assume that the
> // rings are the same size.
> ring_req.tp_block_nr = Block_Cnt;
> ring_req.tp_frame_size = Frame_Sz;
> ring_req.tp_block_size = Block_Sz;
> ring_req.tp_frame_nr = Frame_Cnt;
> if (setsockopt(Socket, SOL_PACKET, PACKET_TX_RING,
>                 (char *)&ring_req, sizeof(ring_req)) < 0)
>    {
>    perror("Setsockopt TX_RING failed");
>    close(Socket);
>    return -1;
>    }
>
> Mmap_Addr = mmap(NIL, Mmap_Size, PROT_READ | PROT_WRITE,
>                   MAP_SHARED | MAP_LOCKED, Socket, 0);
>
> if (Mmap_Addr == MAP_FAILED)
>    {
>    perror("mmap failed");
>    return 1;
>    }
>
> if (Mmap_Addr != MAP_FAILED)
>    {
>    (void)munmap(Mmap_Addr, Mmap_Size);
>    }
>
> close(Socket);
> return 0;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH] 6lowpan: add a license to 6lowpan_iphc module
From: Marcel Holtmann @ 2014-01-22 22:49 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Alexander Smirnov, Dmitry Eremin-Solenikov, Gustavo F. Padovan,
	Johan Hedberg, David S. Miller, linux-zigbee-devel,
	BlueZ development, Network Development, linux-kernel,
	Jukka Rissanen, Alexander Aring
In-Reply-To: <1390418724-9804-1-git-send-email-ydroneaud@opteya.com>

Hi Yann,

> Since commit 8df8c56a5abc, 6lowpan_iphc is a module of its own.
> 
> Unfortunately, it lacks some infrastructure to behave like a
> good kernel citizen:
> 
>  kernel: 6lowpan_iphc: module license 'unspecified' taints kernel.
>  kernel: Disabling lock debugging due to kernel taint
> 
> This patch adds the basic MODULE_LICENSE(); with GPL license:
> the code was copied from net/ieee802154/6lowpan.c which is GPL
> and the module exports symbol with EXPORT_SYMBOL_GPL();.
> 
> Cc: Jukka Rissanen <jukka.rissanen@linux.intel.com>
> Cc: Alexander Aring <alex.aring@gmail.com>
> Cc: Marcel Holtmann <marcel@holtmann.org>
> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
> ---
> net/ieee802154/6lowpan_iphc.c | 3 +++
> 1 file changed, 3 insertions(+)
> 
> diff --git a/net/ieee802154/6lowpan_iphc.c b/net/ieee802154/6lowpan_iphc.c
> index e14fe8b2c054..860aa2d445ba 100644
> --- a/net/ieee802154/6lowpan_iphc.c
> +++ b/net/ieee802154/6lowpan_iphc.c
> @@ -52,6 +52,7 @@
> 
> #include <linux/bitops.h>
> #include <linux/if_arp.h>
> +#include <linux/module.h>
> #include <linux/netdevice.h>
> #include <net/ipv6.h>
> #include <net/af_ieee802154.h>
> @@ -797,3 +798,5 @@ int lowpan_header_compress(struct sk_buff *skb, struct net_device *dev,
> 	return 0;
> }
> EXPORT_SYMBOL_GPL(lowpan_header_compress);
> +
> +MODULE_LICENSE("GPL”);

looks good to me.

Acked-by: Marcel Holtmann <marcel@holtmann.org>

Regards

Marcel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox