Netdev List
 help / color / mirror / Atom feed
* [PATCH] sky2: Avoid race in sky2_change_mtu
From: Mike McCormack @ 2010-05-03 14:18 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

netif_stop_queue does not ensure all in-progress transmits are complete,
 so use netif_tx_disable() instead.

Make sure NAPI polls are disabled, otherwise NAPI might trigger a TX
 restart between when we stop the queue and NAPI is disabled.

Signed-off-by: Mike McCormack <mikem@ring3k.org>
---
 drivers/net/sky2.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index 088c797..b839bae 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -2236,8 +2236,8 @@ static int sky2_change_mtu(struct net_device *dev, int new_mtu)
 	sky2_write32(hw, B0_IMSK, 0);
 
 	dev->trans_start = jiffies;	/* prevent tx timeout */
-	netif_stop_queue(dev);
 	napi_disable(&hw->napi);
+	netif_tx_disable(dev);
 
 	synchronize_irq(hw->pdev->irq);
 
-- 
1.5.6.5


^ permalink raw reply related

* Re: mmotm 2010-04-28 - RCU whinges
From: Valdis.Kletnieks @ 2010-05-03 14:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Peter Zijlstra, Patrick McHardy, David S. Miller,
	linux-kernel, netfilter-devel, netdev, Paul E. McKenney
In-Reply-To: <1272865137.2173.179.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 2704 bytes --]

On Mon, 03 May 2010 07:38:57 +0200, Eric Dumazet said:
> Le dimanche 02 mai 2010 à 13:46 -0400, Valdis.Kletnieks@vt.edu a écrit :
> > On Wed, 28 Apr 2010 16:53:32 PDT, akpm@linux-foundation.org said:
> > > The mm-of-the-moment snapshot 2010-04-28-16-53 has been uploaded to
> > > 
> > >    http://userweb.kernel.org/~akpm/mmotm/
> > 
> > I thought we swatted all these, hit another one...

> Thanks for the report !
> 
> We can use rcu_dereference_protected() in those cases.
> 
> [PATCH] net: Use rcu_dereference_protected in nf_conntrack_ecache
> 
> Writers own nf_ct_ecache_mutex.

I *really* thought we swatted a bunch of these - did the fixes not make it
into linux-next or -mm?  Your patch fixed that one, but then:

[    9.128899] Netfilter messages via NETLINK v0.30.
[    9.128919] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[    9.129108] CONFIG_NF_CT_ACCT is deprecated and will be removed soon. Please use
[    9.129110] nf_conntrack.acct=1 kernel parameter, acct=1 nf_conntrack module option or
[    9.129113] sysctl net.netfilter.nf_conntrack_acct=1 to enable it.
[    9.129135] ctnetlink v0.93: registering with nfnetlink.
[    9.129452] ip_tables: (C) 2000-2006 Netfilter Core Team
[    9.129506] 
[    9.129507] ===================================================
[    9.129683] [ INFO: suspicious rcu_dereference_check() usage. ]
[    9.129777] ---------------------------------------------------
[    9.129872] net/netfilter/nf_log.c:55 invoked rcu_dereference_check() without protection!
[    9.129969] 
[    9.129969] other info that might help us debug this:
[    9.129970] 
[    9.130232] 
[    9.130232] rcu_scheduler_active = 1, debug_locks = 0
[    9.130407] 1 lock held by swapper/1:
[    9.130525]  #0:  (nf_log_mutex){+.+...}, at: [<ffffffff81481154>] nf_log_register+0x57/0x10f
[    9.130955] 
[    9.130956] stack backtrace:
[    9.131162] Pid: 1, comm: swapper Tainted: G        W   2.6.34-rc5-mmotm0428 #2
[    9.131259] Call Trace:
[    9.131370]  [<ffffffff81064832>] lockdep_rcu_dereference+0xaa/0xb2
[    9.131466]  [<ffffffff814811db>] nf_log_register+0xde/0x10f
[    9.131579]  [<ffffffff81b5ca28>] ? log_tg_init+0x0/0x29
[    9.131689]  [<ffffffff81b5ca4d>] log_tg_init+0x25/0x29
[    9.131800]  [<ffffffff810001ef>] do_one_initcall+0x59/0x14e
[    9.131912]  [<ffffffff81b2e68a>] kernel_init+0x144/0x1ce
[    9.132033]  [<ffffffff81003414>] kernel_thread_helper+0x4/0x10
[    9.132146]  [<ffffffff81598a40>] ? restore_args+0x0/0x30
[    9.132257]  [<ffffffff81b2e546>] ? kernel_init+0x0/0x1ce
[    9.132370]  [<ffffffff81003410>] ? kernel_thread_helper+0x0/0x10
[    9.132513] TCP bic registered


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply

* Re: [PATCH v2] ethernet: call __skb_pull() in eth_type_trans()
From: Eric Dumazet @ 2010-05-03 14:44 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, netdev
In-Reply-To: <1272895972-13799-1-git-send-email-xiaosuo@gmail.com>

Le lundi 03 mai 2010 à 22:12 +0800, Changli Gao a écrit :
> call __skb_pull() in eth_type_trans().
> 
> The callers of eth_type_trans() should always feed it long enough packets. When
> the length of the packet is less than ETH_ZLEN, a warning message will be shown,
> and the later behaviors are undefined.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ----
>  net/ethernet/eth.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
> index 61ec032..1df31cc 100644
> --- a/net/ethernet/eth.c
> +++ b/net/ethernet/eth.c
> @@ -162,7 +162,10 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
>  
>  	skb->dev = dev;
>  	skb_reset_mac_header(skb);
> -	skb_pull_inline(skb, ETH_HLEN);
> +	if (unlikely(skb->len < ETH_ZLEN))
> +		dev_warn(&dev->dev, "too small ethernet packet: %u bytes\n",
> +			 skb->len);
> +	__skb_pull(skb, ETH_HLEN);
>  	eth = eth_hdr(skb);
>  
>  	if (unlikely(is_multicast_ether_addr(eth->h_dest))) {


Hmm, I feel very uncompfortable with this patch.

I am pretty sure some callers dont check minimum ethernet frame length.

At least a WARN_ON_ONCE() is needed, just in case...
In fact our stack has different requirements.

Check net/ipv4/ip_gre.c for example.

                if (tunnel->dev->type == ARPHRD_ETHER) {
                        if (!pskb_may_pull(skb, ETH_HLEN)) {
                                stats->rx_length_errors++;
                                stats->rx_errors++;
                                goto drop;
                        }

                        iph = ip_hdr(skb);
                        skb->protocol = eth_type_trans(skb, tunnel->dev);
                        skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
                }



^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Brian Bloniarz @ 2010-05-03 14:45 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Eric Dumazet, David Miller, hadi, xiaosuo, therbert,
	shemminger, netdev, lenb
In-Reply-To: <20100503070925.572bbee6@infradead.org>

Arjan van de Ven wrote:
> On Mon, 3 May 2010 12:34:26 +0200
> Andi Kleen <andi@firstfloor.org> wrote:
> 
>>>> Maybe its low cost, (apparently, it is, since I can reach ~900.000
>>>> ipis on my 16 cores machine) but multiply this by 16 or 32 or 64
>>>> cpus, and clockevents_notify() cost appears to be a killer, all
>>>> cpus compete on a single lock.
>>>>
>>>> Maybe this notifier could use RCU ?
>>> could this be an artifact of the local apic stopping in deeper C
>>> states? (which is finally fixed in the Westmere generation)
>> Yes it is I think.
>>
>> But I suspect Eric wants a solution for Nehalem.
> 
> sure ;-)
> 
> 
> so the hard problem is that on going idle, the local timers need to be
> funneled to the external HPET. Afaik right now we use one channel of
> the hpet, with the result that we have one global lock for this.

Does the HPET only need to be programmed when going idle?
That could mean that this isn't a big performance issue.
cares if you spin for a while when you're about to sleep for
at least 60usec?

> HPETs have more than one channel (2 or 3 historically, newer chipsets
> iirc have a few more), so in principle we can split this lock at least
> a little bit... if we can get to one hpet channel per level 3 cache
> domain we'd already make huge progress in terms of cost of the
> contention....

Another possible approach: if a core needs the HPET and finds it
locked, it could queue up its request to a backlog which the
locking core will service.

^ permalink raw reply

* Re: mmotm 2010-04-28 - RCU whinges
From: Eric Dumazet @ 2010-05-03 14:48 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, Peter Zijlstra, Patrick McHardy, David S. Miller,
	linux-kernel, netfilter-devel, netdev, Paul E. McKenney
In-Reply-To: <5933.1272897014@localhost>

Le lundi 03 mai 2010 à 10:30 -0400, Valdis.Kletnieks@vt.edu a écrit :

> 
> I *really* thought we swatted a bunch of these - did the fixes not make it
> into linux-next or -mm?  Your patch fixed that one, but then:
> 
> [    9.128899] Netfilter messages via NETLINK v0.30.
> [    9.128919] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
> [    9.129108] CONFIG_NF_CT_ACCT is deprecated and will be removed soon. Please use
> [    9.129110] nf_conntrack.acct=1 kernel parameter, acct=1 nf_conntrack module option or
> [    9.129113] sysctl net.netfilter.nf_conntrack_acct=1 to enable it.
> [    9.129135] ctnetlink v0.93: registering with nfnetlink.
> [    9.129452] ip_tables: (C) 2000-2006 Netfilter Core Team
> [    9.129506] 
> [    9.129507] ===================================================
> [    9.129683] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    9.129777] ---------------------------------------------------
> [    9.129872] net/netfilter/nf_log.c:55 invoked rcu_dereference_check() without protection!
> [    9.129969] 
> [    9.129969] other info that might help us debug this:
> [    9.129970] 
> [    9.130232] 
> [    9.130232] rcu_scheduler_active = 1, debug_locks = 0
> [    9.130407] 1 lock held by swapper/1:
> [    9.130525]  #0:  (nf_log_mutex){+.+...}, at: [<ffffffff81481154>] nf_log_register+0x57/0x10f
> [    9.130955] 
> [    9.130956] stack backtrace:
> [    9.131162] Pid: 1, comm: swapper Tainted: G        W   2.6.34-rc5-mmotm0428 #2
> [    9.131259] Call Trace:
> [    9.131370]  [<ffffffff81064832>] lockdep_rcu_dereference+0xaa/0xb2
> [    9.131466]  [<ffffffff814811db>] nf_log_register+0xde/0x10f
> [    9.131579]  [<ffffffff81b5ca28>] ? log_tg_init+0x0/0x29
> [    9.131689]  [<ffffffff81b5ca4d>] log_tg_init+0x25/0x29
> [    9.131800]  [<ffffffff810001ef>] do_one_initcall+0x59/0x14e
> [    9.131912]  [<ffffffff81b2e68a>] kernel_init+0x144/0x1ce
> [    9.132033]  [<ffffffff81003414>] kernel_thread_helper+0x4/0x10
> [    9.132146]  [<ffffffff81598a40>] ? restore_args+0x0/0x30
> [    9.132257]  [<ffffffff81b2e546>] ? kernel_init+0x0/0x1ce
> [    9.132370]  [<ffffffff81003410>] ? kernel_thread_helper+0x0/0x10
> [    9.132513] TCP bic registered
> 

You probably know this PROVE_RCU thing is new and reserved to
developpers ?

We yet have to change all spots were a rcu_dereference() was used
without rcu_read_lock(). Not a bug by itself, just lockdep is to be
instructed not to shout.

Maybe 30 patches already in, and maybe 30 other are still needed.

^ permalink raw reply

* Re: mmotm 2010-04-28 - RCU whinges
From: Eric Dumazet @ 2010-05-03 14:58 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, Peter Zijlstra, Patrick McHardy, David S. Miller,
	linux-kernel, netfilter-devel, netdev, Paul E. McKenney
In-Reply-To: <5933.1272897014@localhost>

Le lundi 03 mai 2010 à 10:30 -0400, Valdis.Kletnieks@vt.edu a écrit :

> [    9.128899] Netfilter messages via NETLINK v0.30.
> [    9.128919] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
> [    9.129108] CONFIG_NF_CT_ACCT is deprecated and will be removed soon. Please use
> [    9.129110] nf_conntrack.acct=1 kernel parameter, acct=1 nf_conntrack module option or
> [    9.129113] sysctl net.netfilter.nf_conntrack_acct=1 to enable it.
> [    9.129135] ctnetlink v0.93: registering with nfnetlink.
> [    9.129452] ip_tables: (C) 2000-2006 Netfilter Core Team
> [    9.129506] 
> [    9.129507] ===================================================
> [    9.129683] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    9.129777] ---------------------------------------------------
> [    9.129872] net/netfilter/nf_log.c:55 invoked rcu_dereference_check() without protection!
> [    9.129969] 
> [    9.129969] other info that might help us debug this:
> [    9.129970] 
> [    9.130232] 
> [    9.130232] rcu_scheduler_active = 1, debug_locks = 0
> [    9.130407] 1 lock held by swapper/1:
> [    9.130525]  #0:  (nf_log_mutex){+.+...}, at: [<ffffffff81481154>] nf_log_register+0x57/0x10f
> [    9.130955] 
> [    9.130956] stack backtrace:
> [    9.131162] Pid: 1, comm: swapper Tainted: G        W   2.6.34-rc5-mmotm0428 #2
> [    9.131259] Call Trace:
> [    9.131370]  [<ffffffff81064832>] lockdep_rcu_dereference+0xaa/0xb2
> [    9.131466]  [<ffffffff814811db>] nf_log_register+0xde/0x10f
> [    9.131579]  [<ffffffff81b5ca28>] ? log_tg_init+0x0/0x29
> [    9.131689]  [<ffffffff81b5ca4d>] log_tg_init+0x25/0x29
> [    9.131800]  [<ffffffff810001ef>] do_one_initcall+0x59/0x14e
> [    9.131912]  [<ffffffff81b2e68a>] kernel_init+0x144/0x1ce
> [    9.132033]  [<ffffffff81003414>] kernel_thread_helper+0x4/0x10
> [    9.132146]  [<ffffffff81598a40>] ? restore_args+0x0/0x30
> [    9.132257]  [<ffffffff81b2e546>] ? kernel_init+0x0/0x1ce
> [    9.132370]  [<ffffffff81003410>] ? kernel_thread_helper+0x0/0x10
> [    9.132513] TCP bic registered
> 

Thanks for the report !

[PATCH] net: nf_log RCU fixes

nf_log_register() and nf_log_unregister() use a mutex to have exclusive
access to nf_logers[]. Use appropriate rcu_dereference_protected()
lockdep annotation.

Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index 015725a..7df37fd 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -52,7 +52,8 @@ int nf_log_register(u_int8_t pf, struct nf_logger *logger)
 	} else {
 		/* register at end of list to honor first register win */
 		list_add_tail(&logger->list[pf], &nf_loggers_l[pf]);
-		llog = rcu_dereference(nf_loggers[pf]);
+		llog = rcu_dereference_protected(nf_loggers[pf],
+						 lockdep_is_held(&nf_log_mutex));
 		if (llog == NULL)
 			rcu_assign_pointer(nf_loggers[pf], logger);
 	}
@@ -70,7 +71,8 @@ void nf_log_unregister(struct nf_logger *logger)
 
 	mutex_lock(&nf_log_mutex);
 	for (i = 0; i < ARRAY_SIZE(nf_loggers); i++) {
-		c_logger = rcu_dereference(nf_loggers[i]);
+		c_logger = rcu_dereference_protected(nf_loggers[i],
+						     lockdep_is_held(&nf_log_mutex));
 		if (c_logger == logger)
 			rcu_assign_pointer(nf_loggers[i], NULL);
 		list_del(&logger->list[i]);

^ permalink raw reply related

* [PATCH] net/gianfar: drop recycled skbs on MTU change
From: Sebastian Andrzej Siewior @ 2010-05-03 15:17 UTC (permalink / raw)
  To: Andy Fleming; +Cc: netdev

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

The size for skb which is added to the recycled list is using the
current descriptor size which is current MTU. gfar_new_skb() is also
using this size. So after changing or alteast increasing the MTU all
recycled skbs should be dropped.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
I'm not 100% sure but it looks like it is wrong.

 drivers/net/gianfar.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index 5267c27..9093106 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -2287,8 +2287,10 @@ static int gfar_change_mtu(struct net_device *dev, int new_mtu)
 
 	/* Only stop and start the controller if it isn't already
 	 * stopped, and we changed something */
-	if ((oldsize != tempsize) && (dev->flags & IFF_UP))
+	if ((oldsize != tempsize) && (dev->flags & IFF_UP)) {
 		stop_gfar(dev);
+		skb_queue_purge(&priv->rx_recycle);
+	}
 
 	priv->rx_buffer_size = tempsize;
 
-- 
1.6.6.1

^ permalink raw reply related

* Re: mmotm 2010-04-28 - RCU whinges
From: Valdis.Kletnieks @ 2010-05-03 15:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Peter Zijlstra, Patrick McHardy, David S. Miller,
	linux-kernel, netfilter-devel, netdev, Paul E. McKenney
In-Reply-To: <1272898726.2226.47.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 2085 bytes --]

On Mon, 03 May 2010 16:58:46 +0200, Eric Dumazet said:
> Le lundi 03 mai 2010 à 10:30 -0400, Valdis.Kletnieks@vt.edu a écrit :

> > [    9.129872] net/netfilter/nf_log.c:55 invoked rcu_dereference_check() without protection!

> Thanks for the report !
> 
> [PATCH] net: nf_log RCU fixes
> 
> nf_log_register() and nf_log_unregister() use a mutex to have exclusive
> access to nf_logers[]. Use appropriate rcu_dereference_protected()
> lockdep annotation.

Confirming that one fixed.  Now it lives a whole 36 seconds before whinging:

[   35.328729] ===================================================
[   35.328803] [ INFO: suspicious rcu_dereference_check() usage. ]
[   35.328837] ---------------------------------------------------
[   35.328872] net/ipv6/addrconf.c:2977 invoked rcu_dereference_check() without protection!
[   35.328926] 
[   35.328927] other info that might help us debug this:
[   35.328928] 
[   35.329016] 
[   35.329016] rcu_scheduler_active = 1, debug_locks = 0
[   35.329089] 2 locks held by ifconfig/2680:
[   35.329120]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff810f5db8>] seq_read+0x3a/0x42d
[   35.329217]  #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff814eec68>] rcu_read_lock_bh+0x0/0x35
[   35.329322] 
[   35.329323] stack backtrace:
[   35.329380] Pid: 2680, comm: ifconfig Tainted: G        W   2.6.34-rc5-mmotm0428 #3
[   35.329439] Call Trace:
[   35.329471]  [<ffffffff81064832>] lockdep_rcu_dereference+0xaa/0xb2
[   35.329514]  [<ffffffff814ef3ae>] if6_get_next+0x34/0x6d
[   35.329554]  [<ffffffff814ef3f8>] if6_seq_next+0x11/0x18
[   35.329595]  [<ffffffff810f6083>] seq_read+0x305/0x42d
[   35.329635]  [<ffffffff810f5d7e>] ? seq_read+0x0/0x42d
[   35.329676]  [<ffffffff81129e8c>] proc_reg_read+0x8d/0xac
[   35.329717]  [<ffffffff810db7e0>] vfs_read+0xe0/0x140
[   35.329758]  [<ffffffff810db8f6>] sys_read+0x45/0x69
[   35.329799]  [<ffffffff810025eb>] system_call_fastpath+0x16/0x1b

Maybe I need to go and stick the "RCU whinge multiple times" patch on this
kernel and get it over with. :)


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply

* Re: [PATCHv7] add mergeable buffers support to vhost_net
From: David Stevens @ 2010-05-03 15:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, virtualization
In-Reply-To: <20100503103410.GA11113@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/03/2010 03:34:11 AM:

> On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
> > This patch adds mergeable receive buffer support to vhost_net.
> > 
> > Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
> 
> I've been doing some more testing before sending out a pull
> request, and I see a drastic performance degradation in guest to host
> traffic when this is applied but mergeable buffers are not in used
> by userspace (existing qemu-kvm userspace).

        Actually, I wouldn't expect it to work at all; the qemu-kvm
patch (particularly the feature bit setting bug fix) is required.
Without it, I think the existing code will tell the guest to use
mergeable buffers while turning it off in vhost. That was completely
non-functional for me -- what version of qemu-kvm are you using?
        What I did to test w/o mergeable buffers is turn off the
bit in VHOST_FEATURES. I'll recheck these, but qemu-kvm definitely
must be updated; the original doesn't correctly handle feature bits.

                                                                +-DLS


^ permalink raw reply

* Re: mmotm 2010-04-28 - RCU whinges
From: Paul E. McKenney @ 2010-05-03 15:43 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Eric Dumazet, Andrew Morton, Peter Zijlstra, Patrick McHardy,
	David S. Miller, linux-kernel, netfilter-devel, netdev
In-Reply-To: <5112.1272900590@localhost>

On Mon, May 03, 2010 at 11:29:50AM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 03 May 2010 16:58:46 +0200, Eric Dumazet said:
> > Le lundi 03 mai 2010 à 10:30 -0400, Valdis.Kletnieks@vt.edu a écrit :
> 
> > > [    9.129872] net/netfilter/nf_log.c:55 invoked rcu_dereference_check() without protection!
> 
> > Thanks for the report !
> > 
> > [PATCH] net: nf_log RCU fixes
> > 
> > nf_log_register() and nf_log_unregister() use a mutex to have exclusive
> > access to nf_logers[]. Use appropriate rcu_dereference_protected()
> > lockdep annotation.
> 
> Confirming that one fixed.  Now it lives a whole 36 seconds before whinging:
> 
> [   35.328729] ===================================================
> [   35.328803] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   35.328837] ---------------------------------------------------
> [   35.328872] net/ipv6/addrconf.c:2977 invoked rcu_dereference_check() without protection!
> [   35.328926] 
> [   35.328927] other info that might help us debug this:
> [   35.328928] 
> [   35.329016] 
> [   35.329016] rcu_scheduler_active = 1, debug_locks = 0
> [   35.329089] 2 locks held by ifconfig/2680:
> [   35.329120]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff810f5db8>] seq_read+0x3a/0x42d
> [   35.329217]  #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff814eec68>] rcu_read_lock_bh+0x0/0x35
> [   35.329322] 
> [   35.329323] stack backtrace:
> [   35.329380] Pid: 2680, comm: ifconfig Tainted: G        W   2.6.34-rc5-mmotm0428 #3
> [   35.329439] Call Trace:
> [   35.329471]  [<ffffffff81064832>] lockdep_rcu_dereference+0xaa/0xb2
> [   35.329514]  [<ffffffff814ef3ae>] if6_get_next+0x34/0x6d
> [   35.329554]  [<ffffffff814ef3f8>] if6_seq_next+0x11/0x18
> [   35.329595]  [<ffffffff810f6083>] seq_read+0x305/0x42d
> [   35.329635]  [<ffffffff810f5d7e>] ? seq_read+0x0/0x42d
> [   35.329676]  [<ffffffff81129e8c>] proc_reg_read+0x8d/0xac
> [   35.329717]  [<ffffffff810db7e0>] vfs_read+0xe0/0x140
> [   35.329758]  [<ffffffff810db8f6>] sys_read+0x45/0x69
> [   35.329799]  [<ffffffff810025eb>] system_call_fastpath+0x16/0x1b
> 
> Maybe I need to go and stick the "RCU whinge multiple times" patch on this
> kernel and get it over with. :)

Highly recommended.  ;-)

And thanks to you for your testing efforts and to Eric for the fixes!!!

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-03 15:52 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Eric Dumazet, David Miller, hadi, xiaosuo, therbert, shemminger,
	netdev, lenb
In-Reply-To: <20100503070925.572bbee6@infradead.org>

> so the hard problem is that on going idle, the local timers need to be
> funneled to the external HPET. Afaik right now we use one channel of
> the hpet, with the result that we have one global lock for this.
> 
> HPETs have more than one channel (2 or 3 historically, newer chipsets
> iirc have a few more), so in principle we can split this lock at least
> a little bit... if we can get to one hpet channel per level 3 cache
> domain we'd already make huge progress in terms of cost of the
> contention....

I suggested the same thing a few emails up @) (great minds think 
alike etc.etc. @) . 

I'm not sure how difficult it would be to implement though.

Potential issues:

Some user applications use the hpet channels directly through
the character device interface so there would be a potential
compatibility issue (but maybe that should be just moved
to be emulated with a hrtimer ?)

And if multiple broadcast controllers are elected this might
make it harder to become idle.

-Andi

^ permalink raw reply

* Re: [PATCHv7] add mergeable buffers support to vhost_net
From: Michael S. Tsirkin @ 2010-05-03 15:56 UTC (permalink / raw)
  To: David Stevens; +Cc: kvm, netdev, virtualization
In-Reply-To: <OF639FB51C.80B669F0-ON88257718.0054F654-88257718.0055FB45@us.ibm.com>

On Mon, May 03, 2010 at 08:39:08AM -0700, David Stevens wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 05/03/2010 03:34:11 AM:
> 
> > On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
> > > This patch adds mergeable receive buffer support to vhost_net.
> > > 
> > > Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
> > 
> > I've been doing some more testing before sending out a pull
> > request, and I see a drastic performance degradation in guest to host
> > traffic when this is applied but mergeable buffers are not in used
> > by userspace (existing qemu-kvm userspace).
> 
>         Actually, I wouldn't expect it to work at all;
> the qemu-kvm
> patch (particularly the feature bit setting bug fix) is required.

Which bugfix is that?

> Without it, I think the existing code will tell the guest to use
> mergeable buffers while turning it off in vhost.

It should not do that, specifically
vhost_net_get_features does:
	features &= ~(1 << VIRTIO_NET_F_MRG_RXBUF);
unconditionally. This was added with:
5751995a20e77cd9d61d00f7390401895fa172a6

I forced mergeable buffers off with -global virtio-net-pci.mrg_rxbuf=off
and it did not seem to help either.

> That was completely
> non-functional for me -- what version of qemu-kvm are you using?

992cc816c433332f2e93db033919a9ddbfcd1da4 or later should work well
AFAIK.

>         What I did to test w/o mergeable buffers is turn off the
> bit in VHOST_FEATURES.

It should be enough to force mergeable buffers to off by qemu command
line: -global virtio-net-pci.mrg_rxbuf=off 

> I'll recheck these, but qemu-kvm definitely
> must be updated; the original doesn't correctly handle feature bits.
> 
>                                                                 +-DLS

Hmm, I don't see the bug.

-- 
MST

^ permalink raw reply

* Re: mmotm 2010-04-28 - RCU whinges
From: Eric Dumazet @ 2010-05-03 16:14 UTC (permalink / raw)
  To: paulmck
  Cc: Valdis.Kletnieks, Andrew Morton, Peter Zijlstra, Patrick McHardy,
	David S. Miller, linux-kernel, netfilter-devel, netdev
In-Reply-To: <20100503154357.GF2597@linux.vnet.ibm.com>

Le lundi 03 mai 2010 à 08:43 -0700, Paul E. McKenney a écrit :

> Highly recommended.  ;-)
> 
> And thanks to you for your testing efforts and to Eric for the fixes!!!
> 

For this last one, I think you should push following patch Paul

Followup of commit 3120438ad6
(rcu: Disable lockdep checking in RCU list-traversal primitives)

Or we might introduce a hlist_for_each_entry_continue_rcu_bh() macro...



diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 004908b..b0c7e24 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -435,10 +435,10 @@ static inline void hlist_add_after_rcu(struct hlist_node *prev,
  * @member:	the name of the hlist_node within the struct.
  */
 #define hlist_for_each_entry_continue_rcu(tpos, pos, member)		\
-	for (pos = rcu_dereference((pos)->next);			\
+	for (pos = rcu_dereference_raw((pos)->next);			\
 	     pos && ({ prefetch(pos->next); 1; }) &&			\
 	     ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });  \
-	     pos = rcu_dereference(pos->next))
+	     pos = rcu_dereference_raw(pos->next))
 
 
 #endif	/* __KERNEL__ */



^ permalink raw reply related

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Vivek Kashyap @ 2010-05-03 16:18 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Scott Feldman, davem, netdev, chrisw, Jens Osterkamp
In-Reply-To: <201005031332.34955.arnd@arndb.de>

> Yes, that's exactly what I wrote. So do you have any idea why we would
> ever not want to do the resource reservation?

I agree that resource reservation is preferable during migration since it 
avoids failure due to lack of resources at the switch (after doing all the 
work of state transfer to the target system).

In other scenarios where the period of resource reservation is long, 
thereby diminishing the number of active vsi flows, the management 
entity might choose not to use resource reservation.

Vivek

>
> 	Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCHv7] add mergeable buffers support to vhost_net
From: David Stevens @ 2010-05-03 16:19 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, virtualization
In-Reply-To: <20100503155613.GB12449@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 05/03/2010 08:56:14 AM:

> On Mon, May 03, 2010 at 08:39:08AM -0700, David Stevens wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> wrote on 05/03/2010 03:34:11 AM:
> > 
> > > On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
> > > > This patch adds mergeable receive buffer support to vhost_net.
> > > > 
> > > > Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
> > > 
> > > I've been doing some more testing before sending out a pull
> > > request, and I see a drastic performance degradation in guest to 
host
> > > traffic when this is applied but mergeable buffers are not in used
> > > by userspace (existing qemu-kvm userspace).
> > 
> >         Actually, I wouldn't expect it to work at all;
> > the qemu-kvm
> > patch (particularly the feature bit setting bug fix) is required.
> 
> Which bugfix is that?

        Actually, I see you put that upstream already--
commit dc14a397812b91dd0d48b03d1b8f66a251542369  in
Avi's tree is the one I was talking about.
        I'll look further.

                                                +-DLS


^ permalink raw reply

* OOP in ip_cmsg_recv (net-next)
From: Stephen Hemminger @ 2010-05-03 16:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

I am getting occasional NULL pointer references with net-next kernel.
No test, just usual stuff (like DNS).

This is a new regression in net-next only.


[  674.929685] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
[  674.929691] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.929699] PGD 1bce2b067 PUD 1b80af067 PMD 0 
[  674.929704] Oops: 0000 [#1] SMP 
[  674.929708] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
[  674.929712] CPU 2 
[  674.929713] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
[  674.929764] 
[  674.929767] Pid: 4358, comm: dnsmasq Not tainted 2.6.34-rc6-net #121 P6T DELUXE/System Product Name
[  674.929770] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.929776] RSP: 0018:ffff8801bce27ac8  EFLAGS: 00010246
[  674.929778] RAX: 0000000000000000 RBX: ffff8801bde62500 RCX: 0000000000000000
[  674.929781] RDX: ffff8801bce27e48 RSI: ffff8801bde62500 RDI: ffff8801bce27f18
[  674.929784] RBP: ffff8801bce27b48 R08: 0000000000000640 R09: 0000000000000000
[  674.929787] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bce27f18
[  674.929789] R13: ffff8801bce27f18 R14: 0000000000000000 R15: ffff8801bdbe8850
[  674.929793] FS:  00007fe37fbfd700(0000) GS:ffff880001e40000(0000) knlGS:0000000000000000
[  674.929796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  674.929798] CR2: 0000000000000322 CR3: 00000001bce5c000 CR4: 00000000000006e0
[  674.929801] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  674.929804] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  674.929807] Process dnsmasq (pid: 4358, threadinfo ffff8801bce26000, task ffff8801bda54560)
[  674.929810] Stack:
[  674.929811]  0000000000000134 000000000000012c ffff8801bce27b48 ffffffff813b065b
[  674.929816] <0> ffff8801bce27b08 ffffffff8123ce8e ffff8801bdbe8800 ffff8801bce27dc8
[  674.929821] <0> ffff8801bce27b18 ffffffff81464612 ffff8801bce27b48 000000005eba1e95
[  674.929827] Call Trace:
[  674.929834]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
[  674.929840]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
[  674.929845]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
[  674.929850]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
[  674.929856]  [<ffffffff81045190>] ? default_wake_function+0x0/0x10
[  674.929860]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
[  674.929866]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
[  674.929872]  [<ffffffff814388c0>] ? unix_dgram_sendmsg+0x5b0/0x630
[  674.929878]  [<ffffffff81119e12>] ? link_path_walk+0x502/0xaf0
[  674.929882]  [<ffffffff813a3728>] ? sock_aio_write+0x138/0x150
[  674.929888]  [<ffffffff810ca88d>] ? find_get_page+0x1d/0xc0
[  674.929892]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
[  674.929897]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
[  674.929902]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
[  674.929908]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
[  674.929910] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
[  674.929955] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.929959]  RSP <ffff8801bce27ac8>
[  674.929961] CR2: 0000000000000322
[  674.929964] ---[ end trace 443be32e81365554 ]---
[  674.929966] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
[  674.929972] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.929979] PGD 1bb9c7067 PUD 1bd5d3067 PMD 0 
[  674.929985] Oops: 0000 [#2] SMP 
[  674.929989] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
[  674.929994] CPU 7 
[  674.929997] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
[  674.930067] 
[  674.930072] Pid: 4525, comm: dnsmasq Tainted: G      D    2.6.34-rc6-net #121 P6T DELUXE/System Product Name
[  674.930077] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.930084] RSP: 0018:ffff8801bcf03ac8  EFLAGS: 00010246
[  674.930088] RAX: 0000000000000000 RBX: ffff8801b746c500 RCX: 0000000000000000
[  674.930092] RDX: ffff8801bcf03e48 RSI: ffff8801b746c500 RDI: ffff8801bcf03f18
[  674.930097] RBP: ffff8801bcf03b48 R08: 0000000000000640 R09: 0000000000000000
[  674.930101] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bcf03f18
[  674.930105] R13: ffff8801bcf03f18 R14: 0000000000000000 R15: ffff8801bd430850
[  674.930110] FS:  00007f42211eb700(0000) GS:ffff880001ee0000(0000) knlGS:0000000000000000
[  674.930114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  674.930118] CR2: 0000000000000322 CR3: 00000001bb96b000 CR4: 00000000000006e0
[  674.930122] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  674.930127] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  674.930132] Process dnsmasq (pid: 4525, threadinfo ffff8801bcf02000, task ffff8801bd52ae40)
[  674.930135] Stack:
[  674.930137]  0000000000000134 000000000000012c ffff8801bcf03b48 ffffffff813b065b
[  674.930144] <0> ffff8801bcf03b08 ffffffff8123ce8e ffff8801bd430800 ffff8801bcf03dc8
[  674.930152] <0> ffff8801bcf03b18 ffffffff81464612 ffff8801bcf03b48 0000000003fe9d95
[  674.930160] Call Trace:
[  674.930167]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
[  674.930174]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
[  674.930180]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
[  674.930187]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
[  674.930193]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
[  674.930199]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
[  674.930206]  [<ffffffff814388c0>] ? unix_dgram_sendmsg+0x5b0/0x630
[  674.930212]  [<ffffffff8123cf34>] ? do_raw_spin_lock+0x54/0x150
[  674.930218]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
[  674.930224]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
[  674.930231]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
[  674.930238]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
[  674.930241] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
[  674.930307] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.930313]  RSP <ffff8801bcf03ac8>
[  674.930315] CR2: 0000000000000322
[  674.930319] ---[ end trace 443be32e81365555 ]---
[  674.930322] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
[  674.930327] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.930332] PGD 1b97f1067 PUD 1bb827067 PMD 0 
[  674.930338] Oops: 0000 [#3] SMP 
[  674.930341] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
[  674.930345] CPU 3 
[  674.930347] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
[  674.930396] 
[  674.930401] Pid: 4561, comm: dnsmasq Tainted: G      D    2.6.34-rc6-net #121 P6T DELUXE/System Product Name
[  674.930405] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.930413] RSP: 0018:ffff8801bcd95ac8  EFLAGS: 00010246
[  674.930417] RAX: 0000000000000000 RBX: ffff8801b746cb00 RCX: 0000000000000000
[  674.930421] RDX: ffff8801bcd95e48 RSI: ffff8801b746cb00 RDI: ffff8801bcd95f18
[  674.930425] RBP: ffff8801bcd95b48 R08: 0000000000000640 R09: 0000000000000000
[  674.930429] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bcd95f18
[  674.930433] R13: ffff8801bcd95f18 R14: 0000000000000000 R15: ffff8801b6bf8c50
[  674.930439] FS:  00007fc947627700(0000) GS:ffff880001e60000(0000) knlGS:0000000000000000
[  674.930443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  674.930447] CR2: 0000000000000322 CR3: 00000001b9654000 CR4: 00000000000006e0
[  674.930451] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  674.930455] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  674.930460] Process dnsmasq (pid: 4561, threadinfo ffff8801bcd94000, task ffff8801bd5b1720)
[  674.930464] Stack:
[  674.930466]  0000000000000134 000000000000012c ffff8801bcd95b48 ffffffff813b065b
[  674.930473] <0> ffff8801bcd95b08 ffffffff8123ce8e ffff8801b6bf8c00 ffff8801bcd95dc8
[  674.930481] <0> ffff8801bcd95b18 ffffffff81464612 ffff8801bcd95b48 000000008ae6d276
[  674.930490] Call Trace:
[  674.930496]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
[  674.930503]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
[  674.930509]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
[  674.930516]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
[  674.930522]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
[  674.930529]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
[  674.930537]  [<ffffffff810704e2>] ? finish_wait+0x62/0x80
[  674.930543]  [<ffffffff814623f3>] ? __wait_on_bit_lock+0x73/0xb0
[  674.930550]  [<ffffffff81070390>] ? wake_bit_function+0x0/0x40
[  674.930556]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
[  674.930562]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
[  674.930569]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
[  674.930576]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
[  674.930579] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
[  674.930636] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.930641]  RSP <ffff8801bcd95ac8>
[  674.930642] CR2: 0000000000000322
[  674.930645] ---[ end trace 443be32e81365556 ]---
[  674.930647] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
[  674.930653] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.930660] PGD 1bcdbc067 PUD 1bbc3c067 PMD 0 
[  674.930666] Oops: 0000 [#4] SMP 
[  674.930669] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
[  674.930672] CPU 4 
[  674.930673] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
[  674.930712] 
[  674.930715] Pid: 4488, comm: dnsmasq Tainted: G      D    2.6.34-rc6-net #121 P6T DELUXE/System Product Name
[  674.930718] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.930723] RSP: 0018:ffff8801bcd93ac8  EFLAGS: 00010246
[  674.930725] RAX: 0000000000000000 RBX: ffff8801b746cf00 RCX: 0000000000000000
[  674.930727] RDX: ffff8801bcd93e48 RSI: ffff8801b746cf00 RDI: ffff8801bcd93f18
[  674.930730] RBP: ffff8801bcd93b48 R08: 0000000000000640 R09: 0000000000000000
[  674.930732] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bcd93f18
[  674.930735] R13: ffff8801bcd93f18 R14: 0000000000000000 R15: ffff8801b6bf8450
[  674.930738] FS:  00007f4ccbd68700(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000
[  674.930741] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  674.930743] CR2: 0000000000000322 CR3: 00000001bb81d000 CR4: 00000000000006e0
[  674.930745] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  674.930748] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  674.930751] Process dnsmasq (pid: 4488, threadinfo ffff8801bcd92000, task ffff8801bde2dc80)
[  674.930753] Stack:
[  674.930754]  0000000000000134 000000000000012c ffff8801bcd93b48 ffffffff813b065b
[  674.930758] <0> ffff8801bcd93b08 ffffffff8123ce8e ffff8801b6bf8400 ffff8801bcd93dc8
[  674.930763] <0> ffff8801bcd93b18 ffffffff81464612 ffff8801bcd93b48 00000000d5628d65
[  674.930768] Call Trace:
[  674.930773]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
[  674.930778]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
[  674.930783]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
[  674.930787]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
[  674.930792]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
[  674.930796]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
[  674.930801]  [<ffffffff814388c0>] ? unix_dgram_sendmsg+0x5b0/0x630
[  674.930806]  [<ffffffff81119e12>] ? link_path_walk+0x502/0xaf0
[  674.930810]  [<ffffffff813a3728>] ? sock_aio_write+0x138/0x150
[  674.930815]  [<ffffffff810ca88d>] ? find_get_page+0x1d/0xc0
[  674.930819]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
[  674.930823]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
[  674.930828]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
[  674.930833]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
[  674.930835] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
[  674.930880] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
[  674.930884]  RSP <ffff8801bcd93ac8>
[  674.930886] CR2: 0000000000000322
[  674.930889] ---[ end trace 443be32e81365557 ]---

^ permalink raw reply

* [RFC PATCH] sctp: fix sctp to work with ipv6 source address routing
From: Vlad Yasevich @ 2010-05-03 16:50 UTC (permalink / raw)
  To: Weixing.Shi, yjwei; +Cc: netdev, Vlad Yasevich
In-Reply-To: <4BCE68EE.8020500@cn.fujitsu.com>

From: Weixing Shi <Weixing.Shi@windriver.com>

<vlad>
  So I've updated this patch with Wei's and my own suggestings.  Here is an
  update.  Please review and verify that in your environment this still works.
</vlad>

in the below test case, using the source address routing,
sctp can not work.
Node-A
1)ifconfig eth0 inet6 add 2001:1::1/64
2)ip -6 rule add from 2001:1::1 table 100 pref 100
3)ip -6 route add 2001:2::1 dev eth0 table 100
4)sctp_darn -H 2001:1::1 -P 250 -l &
Node-B
1)ifconfig eth0 inet6 add 2001:2::1/64
2)ip -6 rule add from 2001:2::1 table 100 pref 100
3)ip -6 route add 2001:1::1 dev eth0 table 100
4)sctp_darn -H 2001:2::1 -P 250 -h 2001:1::1 -p 250 -s

root cause:
Node-A and Node-B use the source address routing, and
at begining, source address will be NULL,sctp will
search the  routing table by the destination address,
because using the source address routing table, and
the result dst_entry will be NULL.

solution:
walk through the bind address list to get the source
address and then lookup the routing table again to get
the correct dst_entry.

Signed-off-by: Weixing Shi <Weixing.Shi@windriver.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 net/sctp/ipv6.c |  112 ++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 69 insertions(+), 43 deletions(-)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 7326891..2a987a2 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -77,6 +77,10 @@
 #include <net/sctp/sctp.h>
 
 #include <asm/uaccess.h>
+static void sctp_v6_dst_saddr(union sctp_addr *addr, struct dst_entry *dst,
+            __be16 port);
+static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
+          const union sctp_addr *addr2);
 
 /* Event handler for inet6 address addition/deletion events.
  * The sctp_local_addr_list needs to be protocted by a spin lock since
@@ -242,8 +246,12 @@ static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
 					 union sctp_addr *daddr,
 					 union sctp_addr *saddr)
 {
-	struct dst_entry *dst;
+	struct dst_entry *dst = NULL;
 	struct flowi fl;
+	struct sctp_bind_addr *bp;
+	struct sctp_sockaddr_entry *laddr;
+	union sctp_addr dst_saddr;
+	sctp_scope_t scope;
 
 	memset(&fl, 0, sizeof(fl));
 	ipv6_addr_copy(&fl.fl6_dst, &daddr->v6.sin6_addr);
@@ -259,16 +267,70 @@ static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
 	}
 
 	dst = ip6_route_output(&init_net, NULL, &fl);
+	if (!asoc || saddr)
+		goto out;
+
+	bp = &asoc->base.bind_addr;
+	scope = sctp_scope(daddr);
+
 	if (!dst->error) {
+		/* Walk through the bind address list and look for a bind
+		 * address that matches the source address of the returned dst.
+		 */
+		sctp_v6_dst_saddr(&dst_saddr, dst, htons(bp->port));
+		rcu_read_lock();
+		list_for_each_entry_rcu(laddr, &bp->address_list, list) {
+			if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
+				continue;
+
+			/* Do not compare against v4 addrs */
+			if ((laddr->a.sa.sa_family == AF_INET6) &&
+			   (sctp_v6_cmp_addr(&dst_saddr, &laddr->a)))
+				goto out_unlock;
+		}
+		rcu_read_unlock();
+
+		/* None of the bound addresses match the source address of the
+		 * dst. So release it.
+		 */
+		dst_release(dst);
+		dst = NULL;
+	}
+
+	/* Walk through the bind address list and try to get a dst that
+	 * matches a bind address as the source address.
+	 */
+	rcu_read_lock();
+	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
+		if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
+			continue;
+
+		if ((laddr->a.sa.sa_family == AF_INET6) &&
+		    (scope <= sctp_scope(&laddr->a))) {
+			ipv6_addr_copy(&fl.fl6_src, &laddr->a.v6.sin6_addr);
+			dst = ip6_route_output(&init_net, NULL, &fl);
+			if (dst->error) {
+				dst_release(dst);
+				dst = NULL;
+			} else
+				break;
+		}
+	}
+
+out_unlock:
+	rcu_read_unlock();
+
+out:
+	if (dst) {
 		struct rt6_info *rt;
 		rt = (struct rt6_info *)dst;
 		SCTP_DEBUG_PRINTK("rt6_dst:%pI6 rt6_src:%pI6\n",
 			&rt->rt6i_dst.addr, &rt->rt6i_src.addr);
 		return dst;
-	}
-	SCTP_DEBUG_PRINTK("NO ROUTE\n");
-	dst_release(dst);
-	return NULL;
+	} else 
+		SCTP_DEBUG_PRINTK("NO ROUTE\n");
+
+	return dst;
 }
 
 /* Returns the number of consecutive initial bits that match in the 2 ipv6
@@ -289,13 +351,6 @@ static void sctp_v6_get_saddr(struct sctp_sock *sk,
 			      union sctp_addr *daddr,
 			      union sctp_addr *saddr)
 {
-	struct sctp_bind_addr *bp;
-	struct sctp_sockaddr_entry *laddr;
-	sctp_scope_t scope;
-	union sctp_addr *baddr = NULL;
-	__u8 matchlen = 0;
-	__u8 bmatchlen;
-
 	SCTP_DEBUG_PRINTK("%s: asoc:%p dst:%p daddr:%pI6 ",
 			  __func__, asoc, dst, &daddr->v6.sin6_addr);
 
@@ -310,38 +365,9 @@ static void sctp_v6_get_saddr(struct sctp_sock *sk,
 		return;
 	}
 
-	scope = sctp_scope(daddr);
-
-	bp = &asoc->base.bind_addr;
-
-	/* Go through the bind address list and find the best source address
-	 * that matches the scope of the destination address.
-	 */
-	rcu_read_lock();
-	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
-		if (!laddr->valid)
-			continue;
-		if ((laddr->state == SCTP_ADDR_SRC) &&
-		    (laddr->a.sa.sa_family == AF_INET6) &&
-		    (scope <= sctp_scope(&laddr->a))) {
-			bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
-			if (!baddr || (matchlen < bmatchlen)) {
-				baddr = &laddr->a;
-				matchlen = bmatchlen;
-			}
-		}
-	}
-
-	if (baddr) {
-		memcpy(saddr, baddr, sizeof(union sctp_addr));
-		SCTP_DEBUG_PRINTK("saddr: %pI6\n", &saddr->v6.sin6_addr);
-	} else {
-		printk(KERN_ERR "%s: asoc:%p Could not find a valid source "
-		       "address for the dest:%pI6\n",
-		       __func__, asoc, &daddr->v6.sin6_addr);
-	}
+	if (dst)
+		sctp_v6_dst_saddr(saddr, dst, htons(asoc->base.bind_addr.port));
 
-	rcu_read_unlock();
 }
 
 /* Make a copy of all potential local addresses. */
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH 2/2] ppp_generic: handle non-linear skbs when passing them to pppd
From: Simon Arlott @ 2010-05-03 16:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, paulus, linux-ppp
In-Reply-To: <20100502.232711.256856995.davem@davemloft.net>

Frequently when using PPPoE with an interface MTU greater than 1500,
the skb is likely to be non-linear. If the skb needs to be passed to
pppd then the skb data must be read correctly.

The previous commit fixes an issue with accidentally sending skbs
to pppd based on an invalid read of the protocol type. When that
error occurred pppd was reading invalid skb data too.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
On 03/05/10 07:27, David Miller wrote:
> From: Simon Arlott <simon@fire.lp0.eu>
> Date: Fri, 30 Apr 2010 19:41:45 +0100
> 
>> Frequently when using PPPoE with an interface MTU greater than 1500,
>> the skb is likely to be non-linear. If the skb needs to be passed to
>> pppd then the skb must be linearised first.
> 
> Don't propagate stupidity.
> 
> The real problem is that ppp_read() can't handle non-linear SKBs, so
> fix that instead.  The easiest way to do that is to put a "struct
> iovec iov;" on ppp_read()'s stack, fill it in with ther user buffer
> pointer and length, then use that to call
> skb_copy_datagram_const_iovec().

I've updated it to use skb_copy_datagram_iovec():

 drivers/net/ppp_generic.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 75e8903..8518a2e 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -405,6 +405,7 @@ static ssize_t ppp_read(struct file *file, char __user *buf,
 	DECLARE_WAITQUEUE(wait, current);
 	ssize_t ret;
 	struct sk_buff *skb = NULL;
+	struct iovec iov;
 
 	ret = count;
 
@@ -448,7 +449,9 @@ static ssize_t ppp_read(struct file *file, char __user *buf,
 	if (skb->len > count)
 		goto outf;
 	ret = -EFAULT;
-	if (copy_to_user(buf, skb->data, skb->len))
+	iov.iov_base = buf;
+	iov.iov_len = count;
+	if (skb_copy_datagram_iovec(skb, 0, &iov, skb->len))
 		goto outf;
 	ret = skb->len;
 
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* Re: [RFC PATCH] sctp: fix sctp to work with ipv6 source address routing
From: Vlad Yasevich @ 2010-05-03 16:55 UTC (permalink / raw)
  To: Weixing.Shi, yjwei; +Cc: netdev
In-Reply-To: <1272905403-5978-1-git-send-email-vladislav.yasevich@hp.com>



Vlad Yasevich wrote:
> From: Weixing Shi <Weixing.Shi@windriver.com>
> 
> <vlad>
>   So I've updated this patch with Wei's and my own suggestings.  Here is an
>   update.  Please review and verify that in your environment this still works.
> </vlad>
> 
> in the below test case, using the source address routing,
> sctp can not work.
> Node-A
> 1)ifconfig eth0 inet6 add 2001:1::1/64
> 2)ip -6 rule add from 2001:1::1 table 100 pref 100
> 3)ip -6 route add 2001:2::1 dev eth0 table 100
> 4)sctp_darn -H 2001:1::1 -P 250 -l &
> Node-B
> 1)ifconfig eth0 inet6 add 2001:2::1/64
> 2)ip -6 rule add from 2001:2::1 table 100 pref 100
> 3)ip -6 route add 2001:1::1 dev eth0 table 100
> 4)sctp_darn -H 2001:2::1 -P 250 -h 2001:1::1 -p 250 -s
> 
> root cause:
> Node-A and Node-B use the source address routing, and
> at begining, source address will be NULL,sctp will
> search the  routing table by the destination address,
> because using the source address routing table, and
> the result dst_entry will be NULL.
> 
> solution:
> walk through the bind address list to get the source
> address and then lookup the routing table again to get
> the correct dst_entry.
> 
> Signed-off-by: Weixing Shi <Weixing.Shi@windriver.com>
> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
> ---
>  net/sctp/ipv6.c |  112 ++++++++++++++++++++++++++++++++++---------------------
>  1 files changed, 69 insertions(+), 43 deletions(-)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index 7326891..2a987a2 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -77,6 +77,10 @@
>  #include <net/sctp/sctp.h>
>  
>  #include <asm/uaccess.h>
> +static void sctp_v6_dst_saddr(union sctp_addr *addr, struct dst_entry *dst,
> +            __be16 port);
> +static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
> +          const union sctp_addr *addr2);
>  
>  /* Event handler for inet6 address addition/deletion events.
>   * The sctp_local_addr_list needs to be protocted by a spin lock since
> @@ -242,8 +246,12 @@ static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
>  					 union sctp_addr *daddr,
>  					 union sctp_addr *saddr)
>  {
> -	struct dst_entry *dst;
> +	struct dst_entry *dst = NULL;
>  	struct flowi fl;
> +	struct sctp_bind_addr *bp;
> +	struct sctp_sockaddr_entry *laddr;
> +	union sctp_addr dst_saddr;
> +	sctp_scope_t scope;
>  
>  	memset(&fl, 0, sizeof(fl));
>  	ipv6_addr_copy(&fl.fl6_dst, &daddr->v6.sin6_addr);
> @@ -259,16 +267,70 @@ static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
>  	}
>  
>  	dst = ip6_route_output(&init_net, NULL, &fl);
> +	if (!asoc || saddr)
> +		goto out;
> +

Ok, I don't know how that sneaked in.  I'll double-check and resend.

-vlad

> +	bp = &asoc->base.bind_addr;
> +	scope = sctp_scope(daddr);
> +
>  	if (!dst->error) {
> +		/* Walk through the bind address list and look for a bind
> +		 * address that matches the source address of the returned dst.
> +		 */
> +		sctp_v6_dst_saddr(&dst_saddr, dst, htons(bp->port));
> +		rcu_read_lock();
> +		list_for_each_entry_rcu(laddr, &bp->address_list, list) {
> +			if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
> +				continue;
> +
> +			/* Do not compare against v4 addrs */
> +			if ((laddr->a.sa.sa_family == AF_INET6) &&
> +			   (sctp_v6_cmp_addr(&dst_saddr, &laddr->a)))
> +				goto out_unlock;
> +		}
> +		rcu_read_unlock();
> +
> +		/* None of the bound addresses match the source address of the
> +		 * dst. So release it.
> +		 */
> +		dst_release(dst);
> +		dst = NULL;
> +	}
> +
> +	/* Walk through the bind address list and try to get a dst that
> +	 * matches a bind address as the source address.
> +	 */
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
> +		if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
> +			continue;
> +
> +		if ((laddr->a.sa.sa_family == AF_INET6) &&
> +		    (scope <= sctp_scope(&laddr->a))) {
> +			ipv6_addr_copy(&fl.fl6_src, &laddr->a.v6.sin6_addr);
> +			dst = ip6_route_output(&init_net, NULL, &fl);
> +			if (dst->error) {
> +				dst_release(dst);
> +				dst = NULL;
> +			} else
> +				break;
> +		}
> +	}
> +
> +out_unlock:
> +	rcu_read_unlock();
> +
> +out:
> +	if (dst) {
>  		struct rt6_info *rt;
>  		rt = (struct rt6_info *)dst;
>  		SCTP_DEBUG_PRINTK("rt6_dst:%pI6 rt6_src:%pI6\n",
>  			&rt->rt6i_dst.addr, &rt->rt6i_src.addr);
>  		return dst;
> -	}
> -	SCTP_DEBUG_PRINTK("NO ROUTE\n");
> -	dst_release(dst);
> -	return NULL;
> +	} else 
> +		SCTP_DEBUG_PRINTK("NO ROUTE\n");
> +
> +	return dst;
>  }
>  
>  /* Returns the number of consecutive initial bits that match in the 2 ipv6
> @@ -289,13 +351,6 @@ static void sctp_v6_get_saddr(struct sctp_sock *sk,
>  			      union sctp_addr *daddr,
>  			      union sctp_addr *saddr)
>  {
> -	struct sctp_bind_addr *bp;
> -	struct sctp_sockaddr_entry *laddr;
> -	sctp_scope_t scope;
> -	union sctp_addr *baddr = NULL;
> -	__u8 matchlen = 0;
> -	__u8 bmatchlen;
> -
>  	SCTP_DEBUG_PRINTK("%s: asoc:%p dst:%p daddr:%pI6 ",
>  			  __func__, asoc, dst, &daddr->v6.sin6_addr);
>  
> @@ -310,38 +365,9 @@ static void sctp_v6_get_saddr(struct sctp_sock *sk,
>  		return;
>  	}
>  
> -	scope = sctp_scope(daddr);
> -
> -	bp = &asoc->base.bind_addr;
> -
> -	/* Go through the bind address list and find the best source address
> -	 * that matches the scope of the destination address.
> -	 */
> -	rcu_read_lock();
> -	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
> -		if (!laddr->valid)
> -			continue;
> -		if ((laddr->state == SCTP_ADDR_SRC) &&
> -		    (laddr->a.sa.sa_family == AF_INET6) &&
> -		    (scope <= sctp_scope(&laddr->a))) {
> -			bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
> -			if (!baddr || (matchlen < bmatchlen)) {
> -				baddr = &laddr->a;
> -				matchlen = bmatchlen;
> -			}
> -		}
> -	}
> -
> -	if (baddr) {
> -		memcpy(saddr, baddr, sizeof(union sctp_addr));
> -		SCTP_DEBUG_PRINTK("saddr: %pI6\n", &saddr->v6.sin6_addr);
> -	} else {
> -		printk(KERN_ERR "%s: asoc:%p Could not find a valid source "
> -		       "address for the dest:%pI6\n",
> -		       __func__, asoc, &daddr->v6.sin6_addr);
> -	}
> +	if (dst)
> +		sctp_v6_dst_saddr(saddr, dst, htons(asoc->base.bind_addr.port));
>  
> -	rcu_read_unlock();
>  }
>  
>  /* Make a copy of all potential local addresses. */

^ permalink raw reply

* Re: OOP in ip_cmsg_recv (net-next)
From: Eric Dumazet @ 2010-05-03 17:04 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20100503094735.077c2af5@nehalam>

Le lundi 03 mai 2010 à 09:47 -0700, Stephen Hemminger a écrit :
> I am getting occasional NULL pointer references with net-next kernel.
> No test, just usual stuff (like DNS).
> 
> This is a new regression in net-next only.
> 
> 
> [  674.929685] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
> [  674.929691] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.929699] PGD 1bce2b067 PUD 1b80af067 PMD 0 
> [  674.929704] Oops: 0000 [#1] SMP 
> [  674.929708] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
> [  674.929712] CPU 2 
> [  674.929713] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
> [  674.929764] 
> [  674.929767] Pid: 4358, comm: dnsmasq Not tainted 2.6.34-rc6-net #121 P6T DELUXE/System Product Name
> [  674.929770] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.929776] RSP: 0018:ffff8801bce27ac8  EFLAGS: 00010246
> [  674.929778] RAX: 0000000000000000 RBX: ffff8801bde62500 RCX: 0000000000000000
> [  674.929781] RDX: ffff8801bce27e48 RSI: ffff8801bde62500 RDI: ffff8801bce27f18
> [  674.929784] RBP: ffff8801bce27b48 R08: 0000000000000640 R09: 0000000000000000
> [  674.929787] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bce27f18
> [  674.929789] R13: ffff8801bce27f18 R14: 0000000000000000 R15: ffff8801bdbe8850
> [  674.929793] FS:  00007fe37fbfd700(0000) GS:ffff880001e40000(0000) knlGS:0000000000000000
> [  674.929796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  674.929798] CR2: 0000000000000322 CR3: 00000001bce5c000 CR4: 00000000000006e0
> [  674.929801] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  674.929804] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  674.929807] Process dnsmasq (pid: 4358, threadinfo ffff8801bce26000, task ffff8801bda54560)
> [  674.929810] Stack:
> [  674.929811]  0000000000000134 000000000000012c ffff8801bce27b48 ffffffff813b065b
> [  674.929816] <0> ffff8801bce27b08 ffffffff8123ce8e ffff8801bdbe8800 ffff8801bce27dc8
> [  674.929821] <0> ffff8801bce27b18 ffffffff81464612 ffff8801bce27b48 000000005eba1e95
> [  674.929827] Call Trace:
> [  674.929834]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
> [  674.929840]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  674.929845]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
> [  674.929850]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
> [  674.929856]  [<ffffffff81045190>] ? default_wake_function+0x0/0x10
> [  674.929860]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
> [  674.929866]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
> [  674.929872]  [<ffffffff814388c0>] ? unix_dgram_sendmsg+0x5b0/0x630
> [  674.929878]  [<ffffffff81119e12>] ? link_path_walk+0x502/0xaf0
> [  674.929882]  [<ffffffff813a3728>] ? sock_aio_write+0x138/0x150
> [  674.929888]  [<ffffffff810ca88d>] ? find_get_page+0x1d/0xc0
> [  674.929892]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
> [  674.929897]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
> [  674.929902]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
> [  674.929908]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
> [  674.929910] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
> [  674.929955] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.929959]  RSP <ffff8801bce27ac8>
> [  674.929961] CR2: 0000000000000322
> [  674.929964] ---[ end trace 443be32e81365554 ]---
> [  674.929966] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
> [  674.929972] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.929979] PGD 1bb9c7067 PUD 1bd5d3067 PMD 0 
> [  674.929985] Oops: 0000 [#2] SMP 
> [  674.929989] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
> [  674.929994] CPU 7 
> [  674.929997] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
> [  674.930067] 
> [  674.930072] Pid: 4525, comm: dnsmasq Tainted: G      D    2.6.34-rc6-net #121 P6T DELUXE/System Product Name
> [  674.930077] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.930084] RSP: 0018:ffff8801bcf03ac8  EFLAGS: 00010246
> [  674.930088] RAX: 0000000000000000 RBX: ffff8801b746c500 RCX: 0000000000000000
> [  674.930092] RDX: ffff8801bcf03e48 RSI: ffff8801b746c500 RDI: ffff8801bcf03f18
> [  674.930097] RBP: ffff8801bcf03b48 R08: 0000000000000640 R09: 0000000000000000
> [  674.930101] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bcf03f18
> [  674.930105] R13: ffff8801bcf03f18 R14: 0000000000000000 R15: ffff8801bd430850
> [  674.930110] FS:  00007f42211eb700(0000) GS:ffff880001ee0000(0000) knlGS:0000000000000000
> [  674.930114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  674.930118] CR2: 0000000000000322 CR3: 00000001bb96b000 CR4: 00000000000006e0
> [  674.930122] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  674.930127] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  674.930132] Process dnsmasq (pid: 4525, threadinfo ffff8801bcf02000, task ffff8801bd52ae40)
> [  674.930135] Stack:
> [  674.930137]  0000000000000134 000000000000012c ffff8801bcf03b48 ffffffff813b065b
> [  674.930144] <0> ffff8801bcf03b08 ffffffff8123ce8e ffff8801bd430800 ffff8801bcf03dc8
> [  674.930152] <0> ffff8801bcf03b18 ffffffff81464612 ffff8801bcf03b48 0000000003fe9d95
> [  674.930160] Call Trace:
> [  674.930167]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
> [  674.930174]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  674.930180]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
> [  674.930187]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
> [  674.930193]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
> [  674.930199]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
> [  674.930206]  [<ffffffff814388c0>] ? unix_dgram_sendmsg+0x5b0/0x630
> [  674.930212]  [<ffffffff8123cf34>] ? do_raw_spin_lock+0x54/0x150
> [  674.930218]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
> [  674.930224]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
> [  674.930231]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
> [  674.930238]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
> [  674.930241] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
> [  674.930307] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.930313]  RSP <ffff8801bcf03ac8>
> [  674.930315] CR2: 0000000000000322
> [  674.930319] ---[ end trace 443be32e81365555 ]---
> [  674.930322] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
> [  674.930327] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.930332] PGD 1b97f1067 PUD 1bb827067 PMD 0 
> [  674.930338] Oops: 0000 [#3] SMP 
> [  674.930341] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
> [  674.930345] CPU 3 
> [  674.930347] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
> [  674.930396] 
> [  674.930401] Pid: 4561, comm: dnsmasq Tainted: G      D    2.6.34-rc6-net #121 P6T DELUXE/System Product Name
> [  674.930405] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.930413] RSP: 0018:ffff8801bcd95ac8  EFLAGS: 00010246
> [  674.930417] RAX: 0000000000000000 RBX: ffff8801b746cb00 RCX: 0000000000000000
> [  674.930421] RDX: ffff8801bcd95e48 RSI: ffff8801b746cb00 RDI: ffff8801bcd95f18
> [  674.930425] RBP: ffff8801bcd95b48 R08: 0000000000000640 R09: 0000000000000000
> [  674.930429] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bcd95f18
> [  674.930433] R13: ffff8801bcd95f18 R14: 0000000000000000 R15: ffff8801b6bf8c50
> [  674.930439] FS:  00007fc947627700(0000) GS:ffff880001e60000(0000) knlGS:0000000000000000
> [  674.930443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  674.930447] CR2: 0000000000000322 CR3: 00000001b9654000 CR4: 00000000000006e0
> [  674.930451] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  674.930455] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  674.930460] Process dnsmasq (pid: 4561, threadinfo ffff8801bcd94000, task ffff8801bd5b1720)
> [  674.930464] Stack:
> [  674.930466]  0000000000000134 000000000000012c ffff8801bcd95b48 ffffffff813b065b
> [  674.930473] <0> ffff8801bcd95b08 ffffffff8123ce8e ffff8801b6bf8c00 ffff8801bcd95dc8
> [  674.930481] <0> ffff8801bcd95b18 ffffffff81464612 ffff8801bcd95b48 000000008ae6d276
> [  674.930490] Call Trace:
> [  674.930496]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
> [  674.930503]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  674.930509]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
> [  674.930516]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
> [  674.930522]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
> [  674.930529]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
> [  674.930537]  [<ffffffff810704e2>] ? finish_wait+0x62/0x80
> [  674.930543]  [<ffffffff814623f3>] ? __wait_on_bit_lock+0x73/0xb0
> [  674.930550]  [<ffffffff81070390>] ? wake_bit_function+0x0/0x40
> [  674.930556]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
> [  674.930562]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
> [  674.930569]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
> [  674.930576]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
> [  674.930579] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
> [  674.930636] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.930641]  RSP <ffff8801bcd95ac8>
> [  674.930642] CR2: 0000000000000322
> [  674.930645] ---[ end trace 443be32e81365556 ]---
> [  674.930647] BUG: unable to handle kernel NULL pointer dereference at 0000000000000322
> [  674.930653] IP: [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.930660] PGD 1bcdbc067 PUD 1bbc3c067 PMD 0 
> [  674.930666] Oops: 0000 [#4] SMP 
> [  674.930669] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:08/ATK0110:00/hwmon/hwmon0/temp2_label
> [  674.930672] CPU 4 
> [  674.930673] Modules linked in: autofs4 binfmt_misc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm radeon ttm drm_kms_helper drm i2c_algo_bit snd_hda_codec_analog ipv6 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd asus_atk0110 soundcore psmouse snd_page_alloc serio_raw usbhid mvsas libsas floppy scsi_transport_sas sky2 e1000e
> [  674.930712] 
> [  674.930715] Pid: 4488, comm: dnsmasq Tainted: G      D    2.6.34-rc6-net #121 P6T DELUXE/System Product Name
> [  674.930718] RIP: 0010:[<ffffffff813e97c1>]  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.930723] RSP: 0018:ffff8801bcd93ac8  EFLAGS: 00010246
> [  674.930725] RAX: 0000000000000000 RBX: ffff8801b746cf00 RCX: 0000000000000000
> [  674.930727] RDX: ffff8801bcd93e48 RSI: ffff8801b746cf00 RDI: ffff8801bcd93f18
> [  674.930730] RBP: ffff8801bcd93b48 R08: 0000000000000640 R09: 0000000000000000
> [  674.930732] R10: 0000000000000020 R11: 0000000000000246 R12: ffff8801bcd93f18
> [  674.930735] R13: ffff8801bcd93f18 R14: 0000000000000000 R15: ffff8801b6bf8450
> [  674.930738] FS:  00007f4ccbd68700(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000
> [  674.930741] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  674.930743] CR2: 0000000000000322 CR3: 00000001bb81d000 CR4: 00000000000006e0
> [  674.930745] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  674.930748] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  674.930751] Process dnsmasq (pid: 4488, threadinfo ffff8801bcd92000, task ffff8801bde2dc80)
> [  674.930753] Stack:
> [  674.930754]  0000000000000134 000000000000012c ffff8801bcd93b48 ffffffff813b065b
> [  674.930758] <0> ffff8801bcd93b08 ffffffff8123ce8e ffff8801b6bf8400 ffff8801bcd93dc8
> [  674.930763] <0> ffff8801bcd93b18 ffffffff81464612 ffff8801bcd93b48 00000000d5628d65
> [  674.930768] Call Trace:
> [  674.930773]  [<ffffffff813b065b>] ? skb_copy_datagram_iovec+0x5b/0x2c0
> [  674.930778]  [<ffffffff8123ce8e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  674.930783]  [<ffffffff81464612>] ? _raw_spin_unlock_bh+0x12/0x20
> [  674.930787]  [<ffffffff8140cf01>] udp_recvmsg+0x291/0x2b0
> [  674.930792]  [<ffffffff8141403a>] inet_recvmsg+0x4a/0x80
> [  674.930796]  [<ffffffff813a3d2b>] sock_recvmsg+0xeb/0x120
> [  674.930801]  [<ffffffff814388c0>] ? unix_dgram_sendmsg+0x5b0/0x630
> [  674.930806]  [<ffffffff81119e12>] ? link_path_walk+0x502/0xaf0
> [  674.930810]  [<ffffffff813a3728>] ? sock_aio_write+0x138/0x150
> [  674.930815]  [<ffffffff810ca88d>] ? find_get_page+0x1d/0xc0
> [  674.930819]  [<ffffffff813af8a3>] ? verify_iovec+0x93/0x100
> [  674.930823]  [<ffffffff813a52bc>] __sys_recvmsg+0x14c/0x2d0
> [  674.930828]  [<ffffffff813a56f4>] sys_recvmsg+0x44/0x80
> [  674.930833]  [<ffffffff81008f42>] system_call_fastpath+0x16/0x1b
> [  674.930835] Code: c4 80 48 89 5d e0 4c 89 6d f0 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 4c 89 65 e8 4c 89 75 f8 49 89 fd 48 8b 46 18 48 89 f3 <44> 0f b7 a0 22 03 00 00 41 f6 c4 01 74 4b 48 8b 46 58 8b 96 c4 
> [  674.930880] RIP  [<ffffffff813e97c1>] ip_cmsg_recv+0x31/0x2d0
> [  674.930884]  RSP <ffff8801bcd93ac8>
> [  674.930886] CR2: 0000000000000322
> [  674.930889] ---[ end trace 443be32e81365557 ]---

Hmm, skb->sk is NULL

void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb)
{
	struct inet_sock *inet = inet_sk(skb->sk);
	unsigned flags = inet->cmsg_flags; // CRASH


So a skb_free_datagram_locked() is at fault here...

commit 4b0b72f7dd617b13abd1b04c947e15873e011a24 probably

OK, the skb_orphan() should not be done at this point, if we are not the
only user (and last user)

Oh well, sorry for the regression ;)


diff --git a/net/core/datagram.c b/net/core/datagram.c
index 95b851f..88949b0 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -230,12 +230,8 @@ EXPORT_SYMBOL(skb_free_datagram);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
 	lock_sock_bh(sk);
-	skb_orphan(skb);
-	sk_mem_reclaim_partial(sk);
+	skb_free_datagram(sk, skb);
 	unlock_sock_bh(sk);
-
-	/* skb is now orphaned, might be freed outside of locked section */
-	consume_skb(skb);
 }
 EXPORT_SYMBOL(skb_free_datagram_locked);
 





^ permalink raw reply related

* [RFC] network driver skb allocations
From: Eric Dumazet @ 2010-05-03 17:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Some performance idea about drivers and skb allocations :

-----------------------------------------------------------------------------------------------------------
   PerfTop:     954 irqs/sec  kernel:99.5% [1000Hz cycles],  (all, cpu: 0)
-----------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ _________________

             2378.00 16.3% __alloc_skb                   [kernel.kallsyms]
             1962.00 13.5% eth_type_trans                [kernel.kallsyms]
             1472.00 10.1% __kmalloc_track_caller        [kernel.kallsyms]
              991.00  6.8% __slab_alloc                  [kernel.kallsyms]
              938.00  6.4% _raw_spin_lock                [kernel.kallsyms]
              914.00  6.3% __netdev_alloc_skb            [kernel.kallsyms]
              876.00  6.0% kmem_cache_alloc              [kernel.kallsyms]
              519.00  3.6% tg3_poll_work                 [kernel.kallsyms]
              416.00  2.9% tg3_read32                    [kernel.kallsyms]
              394.00  2.7% get_rps_cpu                   [kernel.kallsyms]

Current logic for drivers is to :

allocate skbs (sk_buff + data) and put them in a ring buffer.

When rx interrupt comes, get the skb and give it to stack.

Allocate a new skb (sk_buff + data) and put it in rx fat ring buffer (511 entries for tg3 )

This is suboptimal, because sk_buff will probably be cold 512 rx later...
Also, NUMA info might be wrong : sk_buff should be allocated on current node,
not on the device preferred node.

Drivers should allocate only the data part for NIC, and at the time of interrupt,
allocate the skb_buff and link it to buffer filled by NIC.

With a prefetch(first_cache_line_of_data) before doing sk_buff allocation and init,
eth_type_trans() / get_rps_cpus() would be much faster.







^ permalink raw reply

* [RFC PATCH v2] sctp: fix sctp to work with ipv6 source address routing
From: Vlad Yasevich @ 2010-05-03 17:07 UTC (permalink / raw)
  To: Weixing.Shi, yjwei; +Cc: netdev, Vlad Yasevich
In-Reply-To: <4BDEFFEF.2090200@hp.com>

From: Weixing Shi <Weixing.Shi@windriver.com>

<vlad>
Ok, updated to be a bit more correct.  Only leave the function early if the
first lookup succeeds.
</vlad>

in the below test case, using the source address routing,
sctp can not work.
Node-A
1)ifconfig eth0 inet6 add 2001:1::1/64
2)ip -6 rule add from 2001:1::1 table 100 pref 100
3)ip -6 route add 2001:2::1 dev eth0 table 100
4)sctp_darn -H 2001:1::1 -P 250 -l &
Node-B
1)ifconfig eth0 inet6 add 2001:2::1/64
2)ip -6 rule add from 2001:2::1 table 100 pref 100
3)ip -6 route add 2001:1::1 dev eth0 table 100
4)sctp_darn -H 2001:2::1 -P 250 -h 2001:1::1 -p 250 -s

root cause:
Node-A and Node-B use the source address routing, and
at begining, source address will be NULL,sctp will
search the  routing table by the destination address,
because using the source address routing table, and
the result dst_entry will be NULL.

solution:
walk through the bind address list to get the source
address and then lookup the routing table again to get
the correct dst_entry.

Signed-off-by: Weixing Shi <Weixing.Shi@windriver.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 net/sctp/ipv6.c |  113 ++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 7326891..f75e2f8 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -77,6 +77,10 @@
 #include <net/sctp/sctp.h>
 
 #include <asm/uaccess.h>
+static void sctp_v6_dst_saddr(union sctp_addr *addr, struct dst_entry *dst,
+            __be16 port);
+static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
+          const union sctp_addr *addr2);
 
 /* Event handler for inet6 address addition/deletion events.
  * The sctp_local_addr_list needs to be protocted by a spin lock since
@@ -242,8 +246,12 @@ static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
 					 union sctp_addr *daddr,
 					 union sctp_addr *saddr)
 {
-	struct dst_entry *dst;
+	struct dst_entry *dst = NULL;
 	struct flowi fl;
+	struct sctp_bind_addr *bp;
+	struct sctp_sockaddr_entry *laddr;
+	union sctp_addr dst_saddr;
+	sctp_scope_t scope;
 
 	memset(&fl, 0, sizeof(fl));
 	ipv6_addr_copy(&fl.fl6_dst, &daddr->v6.sin6_addr);
@@ -259,16 +267,71 @@ static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
 	}
 
 	dst = ip6_route_output(&init_net, NULL, &fl);
+
+	bp = &asoc->base.bind_addr;
+	scope = sctp_scope(daddr);
+
 	if (!dst->error) {
+		if (!asoc || saddr)
+			goto out;
+
+		/* Walk through the bind address list and look for a bind
+		 * address that matches the source address of the returned dst.
+		 */
+		sctp_v6_dst_saddr(&dst_saddr, dst, htons(bp->port));
+		rcu_read_lock();
+		list_for_each_entry_rcu(laddr, &bp->address_list, list) {
+			if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
+				continue;
+
+			/* Do not compare against v4 addrs */
+			if ((laddr->a.sa.sa_family == AF_INET6) &&
+			   (sctp_v6_cmp_addr(&dst_saddr, &laddr->a)))
+				goto out_unlock;
+		}
+		rcu_read_unlock();
+
+		/* None of the bound addresses match the source address of the
+		 * dst. So release it.
+		 */
+		dst_release(dst);
+		dst = NULL;
+	}
+
+	/* Walk through the bind address list and try to get a dst that
+	 * matches a bind address as the source address.
+	 */
+	rcu_read_lock();
+	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
+		if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
+			continue;
+
+		if ((laddr->a.sa.sa_family == AF_INET6) &&
+		    (scope <= sctp_scope(&laddr->a))) {
+			ipv6_addr_copy(&fl.fl6_src, &laddr->a.v6.sin6_addr);
+			dst = ip6_route_output(&init_net, NULL, &fl);
+			if (dst->error) {
+				dst_release(dst);
+				dst = NULL;
+			} else
+				break;
+		}
+	}
+
+out_unlock:
+	rcu_read_unlock();
+
+out:
+	if (dst) {
 		struct rt6_info *rt;
 		rt = (struct rt6_info *)dst;
 		SCTP_DEBUG_PRINTK("rt6_dst:%pI6 rt6_src:%pI6\n",
 			&rt->rt6i_dst.addr, &rt->rt6i_src.addr);
 		return dst;
-	}
-	SCTP_DEBUG_PRINTK("NO ROUTE\n");
-	dst_release(dst);
-	return NULL;
+	} else 
+		SCTP_DEBUG_PRINTK("NO ROUTE\n");
+
+	return dst;
 }
 
 /* Returns the number of consecutive initial bits that match in the 2 ipv6
@@ -289,13 +352,6 @@ static void sctp_v6_get_saddr(struct sctp_sock *sk,
 			      union sctp_addr *daddr,
 			      union sctp_addr *saddr)
 {
-	struct sctp_bind_addr *bp;
-	struct sctp_sockaddr_entry *laddr;
-	sctp_scope_t scope;
-	union sctp_addr *baddr = NULL;
-	__u8 matchlen = 0;
-	__u8 bmatchlen;
-
 	SCTP_DEBUG_PRINTK("%s: asoc:%p dst:%p daddr:%pI6 ",
 			  __func__, asoc, dst, &daddr->v6.sin6_addr);
 
@@ -310,38 +366,9 @@ static void sctp_v6_get_saddr(struct sctp_sock *sk,
 		return;
 	}
 
-	scope = sctp_scope(daddr);
-
-	bp = &asoc->base.bind_addr;
-
-	/* Go through the bind address list and find the best source address
-	 * that matches the scope of the destination address.
-	 */
-	rcu_read_lock();
-	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
-		if (!laddr->valid)
-			continue;
-		if ((laddr->state == SCTP_ADDR_SRC) &&
-		    (laddr->a.sa.sa_family == AF_INET6) &&
-		    (scope <= sctp_scope(&laddr->a))) {
-			bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
-			if (!baddr || (matchlen < bmatchlen)) {
-				baddr = &laddr->a;
-				matchlen = bmatchlen;
-			}
-		}
-	}
-
-	if (baddr) {
-		memcpy(saddr, baddr, sizeof(union sctp_addr));
-		SCTP_DEBUG_PRINTK("saddr: %pI6\n", &saddr->v6.sin6_addr);
-	} else {
-		printk(KERN_ERR "%s: asoc:%p Could not find a valid source "
-		       "address for the dest:%pI6\n",
-		       __func__, asoc, &daddr->v6.sin6_addr);
-	}
+	if (dst)
+		sctp_v6_dst_saddr(saddr, dst, htons(asoc->base.bind_addr.port));
 
-	rcu_read_unlock();
 }
 
 /* Make a copy of all potential local addresses. */
-- 
1.6.0.4


^ permalink raw reply related

* Re: OOP in ip_cmsg_recv (net-next)
From: Eric Dumazet @ 2010-05-03 17:21 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <1272906266.2226.77.camel@edumazet-laptop>

Le lundi 03 mai 2010 à 19:04 +0200, Eric Dumazet a écrit :
> Le lundi 03 mai 2010 à 09:47 -0700, Stephen Hemminger a écrit :
> > I am getting occasional NULL pointer references with net-next kernel.
> > No test, just usual stuff (like DNS).
> > 
> > This is a new regression in net-next only.
> > 
> > 
> Hmm, skb->sk is NULL
> 
> void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb)
> {
> 	struct inet_sock *inet = inet_sk(skb->sk);
> 	unsigned flags = inet->cmsg_flags; // CRASH
> 
> 
> So a skb_free_datagram_locked() is at fault here...
> 
> commit 4b0b72f7dd617b13abd1b04c947e15873e011a24 probably
> 
> OK, the skb_orphan() should not be done at this point, if we are not the
> only user (and last user)
> 
> Oh well, sorry for the regression ;)
> 

I'll test following patch and report results to netdev :

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 95b851f..e009753 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -229,13 +229,18 @@ EXPORT_SYMBOL(skb_free_datagram);
 
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
+	if (likely(atomic_read(&skb->users) == 1))
+		smp_rmb();
+	else if (likely(!atomic_dec_and_test(&skb->users)))
+		return;
+
 	lock_sock_bh(sk);
 	skb_orphan(skb);
 	sk_mem_reclaim_partial(sk);
 	unlock_sock_bh(sk);
 
-	/* skb is now orphaned, might be freed outside of locked section */
-	consume_skb(skb);
+	/* skb is now orphaned, can be freed outside of locked section */
+	__kfree_skb(skb);
 }
 EXPORT_SYMBOL(skb_free_datagram_locked);
 



^ permalink raw reply related

* Re: mmotm 2010-04-28 - RCU whinges
From: Paul E. McKenney @ 2010-05-03 18:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Valdis.Kletnieks, Andrew Morton, Peter Zijlstra, Patrick McHardy,
	David S. Miller, linux-kernel, netfilter-devel, netdev
In-Reply-To: <1272903293.2226.50.camel@edumazet-laptop>

On Mon, May 03, 2010 at 06:14:53PM +0200, Eric Dumazet wrote:
> Le lundi 03 mai 2010 à 08:43 -0700, Paul E. McKenney a écrit :
> 
> > Highly recommended.  ;-)
> > 
> > And thanks to you for your testing efforts and to Eric for the fixes!!!
> > 
> 
> For this last one, I think you should push following patch Paul

I would be happy to if I could find the commit creating
hlist_for_each_entry_continue_rcu()...

I do see a ca. 2008 patch from Stephen Hemminger:

	http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg264661.html

According to http://patchwork.ozlabs.org/patch/47997/, this is
going up the networking tree as of March 18, 2010.

So I would be happy to push the patch below, but to do so, I will need
to adopt the portion of Stephen's patch that created this primitive.

Please let me know how you would like to proceed!

							Thanx, Paul

> Followup of commit 3120438ad6
> (rcu: Disable lockdep checking in RCU list-traversal primitives)
> 
> Or we might introduce a hlist_for_each_entry_continue_rcu_bh() macro...
> 
> 
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 004908b..b0c7e24 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -435,10 +435,10 @@ static inline void hlist_add_after_rcu(struct hlist_node *prev,
>   * @member:	the name of the hlist_node within the struct.
>   */
>  #define hlist_for_each_entry_continue_rcu(tpos, pos, member)		\
> -	for (pos = rcu_dereference((pos)->next);			\
> +	for (pos = rcu_dereference_raw((pos)->next);			\
>  	     pos && ({ prefetch(pos->next); 1; }) &&			\
>  	     ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });  \
> -	     pos = rcu_dereference(pos->next))
> +	     pos = rcu_dereference_raw(pos->next))
> 
> 
>  #endif	/* __KERNEL__ */
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Receive issues with bonding and vlans
From: Jay Vosburgh @ 2010-05-03 18:25 UTC (permalink / raw)
  To: John Fastabend
  Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek,
	Patrick McHardy, bonding-devel@lists.sourceforge.net
In-Reply-To: <4BDE3D3B.3030800@intel.com>

John Fastabend <john.r.fastabend@intel.com> wrote:

>Jay Vosburgh wrote:
>> Chris Leech <christopher.leech@intel.com> wrote:
>>
>>> On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote:
>>>> 	Is the FCoE supposed to run over the inactive bonding slave?  Or
>>>> am I misunderstanding what you're saying?  I had thought the LLDP, et
>>>> al, special case in bonding was to permit, essentially, path discovery,
>>>> not necessarily active use of the inactive slave.
>>> That's what I'm trying to do, yes.  Mostly because it's a setup that
>>> would work if you removed the FCoE traffic from the network data path,
>>> and only converged at the driver level and below.  It's possible that
>>> the answer is "don't do that".
>>
>> 	So, basically, you want the bond to act like usual for "regular"
>> ethernet traffic, but act like the slaves are independent from the bond
>> for the magic FCoE traffic, right?
>>
>> 	I'm not really sure if that's a "don't do that" or not.
>>
>>>> 	Not that this is necessarily bad; the "drop stuff on inactive
>>>> slaves" is really there for duplicate suppression, but it also can
>>>> uncover network topology issues, e.g., network layouts that won't work
>>>> if the devices fail, but appear to work during testing because the
>>>> "inactive" slave still receives traffic (it hasn't really failed).
>>>>
>>>>> The problem is that it doesn't work for hardware accelerated VLAN
>>>>> devices, because the VLAN receive paths have their own
>>>>> skb_bond_should_drop calls that were not updated.
>>>>>
>>>> >From what I can tell, VLAN receives always end up going through
>>>>> netif_receive_skb anyway, so skb_bond_should_drop gets called twice if
>>>>> the frame isn't dropped the first time.  I think the bonding checks in
>>>>> __vlan_hwaccel_rx and vlan_gro_common should just be removed.
>>>> 	I'm not so sure.  The checks in __vlan_hwaccel_rx are done with
>>>> the original receiving device in skb->dev; by the time the packet gets
>>>> to netif_receive_skb, the original slave the packet was received on has
>>>> been lost (and replaced with the VLAN device).  Various things are
>>>> interested in that, in particular the "arp_validate" and the "inactive
>>>> slave drop" logic for bonding depend on knowing the real device the
>>>> packet arrived on.
>>>>
>>>> 	I note that the vlan accel logic doesn't change skb_iif to the
>>>> VLAN device; it remains as the original device.  I suppose one
>>>> alternative would be to convert the bonding drop, et al, logic to use
>>>> skb_iif instead of skb->dev; if that works, then I think the VLAN core
>>>> would not need to call skb_bond_should_drop, which in turn would be a
>>>> bit more complicated as it would have to look up the dev from the
>>>> skb_iif.  There's already some code in bonding that takes advantage of
>>>> this property of the VLANs, so maybe this is the way to go.
>>> Thanks, I'll take another look and see if I can come up with something
>>> better.
>>
>> 	I looked at the skb_bond_should_drop stuff a bit more after I
>> wrote that; it's not as easy as I had suspected.  The big sticking point
>> is that currently the test in netif_receive_skb (now __netif_receive_skb
>> in net-next-2.6) is on skb->dev->master to identify packets arriving on
>> slaves of bonding.  The VLAN skb->dev has ->master set to NULL.  Doing
>> that test against skb->skb_iif would be much more expensive, as it would
>> require a device lookup for every packet.
>>
>> 	So, I suspect that something has to happen in the VLAN
>> acceleration path, although I don't know exactly what.  I don't know if
>> it would be possible to flag the packets in some special way to indicate
>> that they're "bonding slave" packets, or if it's better to keep the
>> current structure and just fix the calls somehow.
>>
>> 	-J
>>
>>
>
>Jay,
>
>It should be OK to allow packets to be received on the VLAN if it is not
>explicitly in the bond?

	Lemme see if I have this straight, because all of these special
cases are making my brain hurt.  This one is for a configuration like this:

	bond0 ----- eth0
		   /
	vlan.xxx -/

	I.e., a VLAN configured directly atop an ethernet device, said
ethernet also being a slave to bonding.  Is that correct?

	Extrapolating from the ASCII art in a prior message in this
discussion, would this configuration really be something like this:

	vlan.xxx -\
		   \
	bond0 ----- eth1
	bond0 ----- eth0
		   /
	vlan.xxx -/

	I.e., two slaves to bonding, with VLAN xxx configured atop both
of the slaves?  Or would the eth0 and eth1 use discrete VLAN ids?  The
reason I ask is in regards to duplicate suppression.  The whole reason
the "inactive" slave drops (most) incoming packets is to eliminate
duplicates when the switch floods traffic to both slave ports.

	This is a bit tricky, because it's not really about broadcasts /
multicasts so much, but about traffic that the switch sends to all ports
because the switch's MAC address table isn't up to date with the
destination MAC of the traffic (which is a transient condition, so this
would only happen for perhaps one second or so).  That would result in
duplicate unicast packets being received by the bond (back in the day
before bonding had the "drop inactive traffic" logic).

	So if the same VLAN is configured atop the two slaves, I wonder
if that will open a window for the duplicate unicast packet problem.

	If the VLAN ids are different, then I'll assume this is some
kind of device mapper magic, doing the load balancing elsewhere.

>Or if we want to be more paranoid deliver packets only to handlers with
>exact matches for the device. For non vlan devices we deliver skb's to
>packet handlers that match exactly even on inactive slaves so doing this
>on vlan devices as well makes sense and shouldn't cause any unexpected
>problems.

	Yah, the whole concept of "inactive" slave is pretty mutated
now; it's kind of become the "active-backup with semi-manual load
balance for clever protocols, oh, and duplicate suppression" mode.

>Also on a somewhat unrelated note I suspect null_or_orig and null_or_bond
>are not working as expected in __netif_receive_skb().  At least the
>comment 'deliver only exact match' could be inaccurate.

	I don't think this is unrelated at all.  This code (the packet
to device lookup stuff in __netif_receive_skb) has been modified pretty
extensively lately for various bonding-related special cases, and I
think it is getting hard to follow.  Whatever comments are there need to
be accurate, and, honestly, I think this code needs comments to explain
what exactly is supposed to happen for these special cases.

>Here's a patch to illustrate what I'm thinking compile tested only.  If
>this sounds reasonable I'll work up an official patch.
>
>
>[PATCH] net: allow vlans on bonded real net_devices
>
>For converged I/O it is reasonable to use dm_multipathing to provice
>failover and load balancing for storage traffic and then use bonding
>for the LAN failover and load balancing.
>
>Currently this works if the multipathed devices are using the real
>net_device and those devices are enslaved to a bonded device what
>does not work is creating a vlan on the real device and trying to
>use it outside the bond for multipathing.
>
>This patch adds logic so that if the skb is destined for a vlan
>that is not in the bond the skb will not be dropped.
>
>Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>---
>
> net/8021q/vlan_core.c |   31 +++++++++++++++++++++----------
> net/core/dev.c        |   11 ++++++++---
> 2 files changed, 29 insertions(+), 13 deletions(-)
>
>diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
>index c584a0a..3bce0c3 100644
>--- a/net/8021q/vlan_core.c
>+++ b/net/8021q/vlan_core.c
>@@ -8,18 +8,24 @@
> int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
> 		      u16 vlan_tci, int polling)
> {
>+	struct net_device *vlan_dev;
>+
> 	if (netpoll_rx(skb))
> 		return NET_RX_DROP;
>
>-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>+	vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>+
>+	if (!vlan_dev)
>+		goto drop;
>+
>+	if ((vlan_dev->priv_flags & IFF_BONDING ||
>+	    vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) &&
>+	    skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))

	I'm not sure this will do the right thing if the VLAN device
itself is a slave to bonding, e.g., bond0 ---> vlan.xxx ---> eth0.  In
that case, eth0's dev->master is NULL, and the vlan_dev (vlan.xxx's dev)
doesn't have IFF_MASTER (but does have IFF_SLAVE and IFF_BONDING, I
believe).

	I think this will result in all incoming traffic being accepted
on such a configuration (leading to duplicates, as described above).

	I suspect, but have not tested, that something like this might
do what you're looking for:

	if ((vlan_dev->priv_flags & IFF_BONDING ||
	    vlan_dev_real_dev(vlan_dev)->flags & (IFF_MASTER | IFF_SLAVE)) &&
	    skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))

	I.e., if the VLAN device is either a MASTER (configured above
the bond) or a slave (configured below the bond) do the duplicate
suppresion.

> 		goto drop;
>
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
>-	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>-
>-	if (!skb->dev)
>-		goto drop;
>+	skb->dev = vlan_dev;
>
> 	return (polling ? netif_receive_skb(skb) : netif_rx(skb));
>
>@@ -82,16 +88,21 @@ vlan_gro_common(struct napi_struct *napi, struct
>vlan_group *grp,
> 		unsigned int vlan_tci, struct sk_buff *skb)
> {
> 	struct sk_buff *p;
>+	struct net_device *vlan_dev;
>
>-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>+	vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>+
>+	if (!vlan_dev)
>+		goto drop;
>+
>+	if ((vlan_dev->priv_flags & IFF_BONDING ||
>+	    vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) &&
>+	    skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
> 		goto drop;
>
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
>-	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>-
>-	if (!skb->dev)
>-		goto drop;
>+	skb->dev = vlan_dev;
>
> 	for (p = napi->gro_list; p; p = p->next) {
> 		NAPI_GRO_CB(p)->same_flow =
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 100dcbd..9ea4550 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -2780,6 +2780,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 	struct net_device *master;
> 	struct net_device *null_or_orig;
> 	struct net_device *null_or_bond;
>+	struct net_device *real_dev;
> 	int ret = NET_RX_DROP;
> 	__be16 type;
>
>@@ -2853,9 +2854,13 @@ ncls:
> 	 * handler may have to adjust skb->dev and orig_dev.
> 	 */
> 	null_or_bond = NULL;
>-	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
>-	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
>-		null_or_bond = vlan_dev_real_dev(skb->dev);
>+	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN)) {
>+		real_dev = vlan_dev_real_dev(skb->dev);
>+		if (real_dev->priv_flags & IFF_BONDING)
>+			null_or_bond = vlan_dev_real_dev(skb->dev);
>+		if (!(skb->dev->priv_flags & IFF_BONDING) &&
>+		    real_dev->priv_flags & IFF_SLAVE_INACTIVE)
>+			null_or_orig = skb->dev;
> 	}
>
> 	type = skb->protocol;

	Is this another way of accomplishing what I had suggested at the
end of this message:

http://marc.info/?l=linux-netdev&m=127111386702765&w=2

	The patch part of my suggestion was as follows:

>diff --git a/net/core/dev.c b/net/core/dev.c
>index b98ddc6..cc665bb 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -2735,7 +2735,7 @@ ncls:
> 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
> 		if (ptype->type == type && (ptype->dev == null_or_orig ||
> 		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
>-		     ptype->dev == null_or_bond)) {
>+		     (null_or_bond && (ptype->dev == null_or_bond))) {
> 			if (pt_prev)
> 				ret = deliver_skb(skb, pt_prev, orig_dev);
> 			pt_prev = ptype;
>
>
>	I haven't tested this, but the theory is to only test against
>null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN
>traffic over bonding.

	Chris Leech said "that should do it" but I don't recall seeing
if it actually did so in practice.

	Or is your change meant to fix something else?

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox