Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/2] ppp_generic: pull 2 bytes so that PPP_PROTO(skb) is valid
From: David Miller @ 2010-05-03 20:27 UTC (permalink / raw)
  To: simon; +Cc: netdev, paulus, linux-ppp
In-Reply-To: <4BDF2FD5.2030509@simon.arlott.org.uk>

From: Simon Arlott <simon@fire.lp0.eu>
Date: Mon, 03 May 2010 21:19:33 +0100

> In ppp_input(), PPP_PROTO(skb) may refer to invalid data in the skb.
> 
> If this happens and (proto >= 0xc000 || proto == PPP_CCPFRAG) then
> the packet is passed directly to pppd.
> 
> This occurs frequently when using PPPoE with an interface MTU
> greater than 1500 because the skb is more likely to be non-linear.
> 
> The next 2 bytes need to be pulled in ppp_input(). The pull of 2
> bytes in ppp_receive_frame() has been removed as it is no longer
> required.
> 
> Signed-off-by: Simon Arlott <simon@fire.lp0.eu>

Applied.

^ permalink raw reply

* Re: [PATCH v2 2/2] ppp_generic: handle non-linear skbs when passing them to pppd
From: David Miller @ 2010-05-03 20:27 UTC (permalink / raw)
  To: simon; +Cc: netdev, paulus, linux-ppp
In-Reply-To: <4BDF300B.1040103@simon.arlott.org.uk>

From: Simon Arlott <simon@fire.lp0.eu>
Date: Mon, 03 May 2010 21:20:27 +0100

> Frequently when using PPPoE with an interface MTU greater than 1500,
> the skb is likely to be non-linear. If the skb needs to be passed to
> pppd then the skb data must be read correctly.
> 
> The previous commit fixes an issue with accidentally sending skbs
> to pppd based on an invalid read of the protocol type. When that
> error occurred pppd was reading invalid skb data too.
> 
> Signed-off-by: Simon Arlott <simon@fire.lp0.eu>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: if6_get_next() fix
From: David Miller @ 2010-05-03 20:24 UTC (permalink / raw)
  To: eric.dumazet
  Cc: paulmck, shemminger, Valdis.Kletnieks, akpm, peterz, kaber,
	linux-kernel, netfilter-devel, netdev
In-Reply-To: <1272917635.2407.106.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 03 May 2010 22:13:55 +0200

> Le lundi 03 mai 2010 à 13:09 -0700, David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 03 May 2010 21:46:47 +0200
>> 
>> > Then, net-next-2.6 doesnt yet have your commit Paul to relax
>> > hlist_for_each_entry_rcu(), so its a bit difficult to continue the work.
>> 
>> Is that in Linus's tree yet?  If it propagates there I can make it
>> propagate to net-next-2.6, you just have to tell me you need it.
> 
> 
> Hmm, it seems it's already in net-next-2.6, sorry for the confusion.

No problem, just let me know in the future if something upstream needs
to propagate.

^ permalink raw reply

* [PATCH v2 2/2] ppp_generic: handle non-linear skbs when passing them to pppd
From: Simon Arlott @ 2010-05-03 20:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, paulus, linux-ppp
In-Reply-To: <4BDF2FD5.2030509@simon.arlott.org.uk>

Frequently when using PPPoE with an interface MTU greater than 1500,
the skb is likely to be non-linear. If the skb needs to be passed to
pppd then the skb data must be read correctly.

The previous commit fixes an issue with accidentally sending skbs
to pppd based on an invalid read of the protocol type. When that
error occurred pppd was reading invalid skb data too.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
 drivers/net/ppp_generic.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 75e8903..8518a2e 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -405,6 +405,7 @@ static ssize_t ppp_read(struct file *file, char __user *buf,
 	DECLARE_WAITQUEUE(wait, current);
 	ssize_t ret;
 	struct sk_buff *skb = NULL;
+	struct iovec iov;
 
 	ret = count;
 
@@ -448,7 +449,9 @@ static ssize_t ppp_read(struct file *file, char __user *buf,
 	if (skb->len > count)
 		goto outf;
 	ret = -EFAULT;
-	if (copy_to_user(buf, skb->data, skb->len))
+	iov.iov_base = buf;
+	iov.iov_len = count;
+	if (skb_copy_datagram_iovec(skb, 0, &iov, skb->len))
 		goto outf;
 	ret = skb->len;
 
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* [PATCH v2 1/2] ppp_generic: pull 2 bytes so that PPP_PROTO(skb) is valid
From: Simon Arlott @ 2010-05-03 20:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, paulus, linux-ppp

In ppp_input(), PPP_PROTO(skb) may refer to invalid data in the skb.

If this happens and (proto >= 0xc000 || proto == PPP_CCPFRAG) then
the packet is passed directly to pppd.

This occurs frequently when using PPPoE with an interface MTU
greater than 1500 because the skb is more likely to be non-linear.

The next 2 bytes need to be pulled in ppp_input(). The pull of 2
bytes in ppp_receive_frame() has been removed as it is no longer
required.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
 drivers/net/ppp_generic.c |   29 ++++++++++++++++++-----------
 1 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 6e281bc..75e8903 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1567,13 +1567,22 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 	struct channel *pch = chan->ppp;
 	int proto;
 
-	if (!pch || skb->len == 0) {
+	if (!pch) {
 		kfree_skb(skb);
 		return;
 	}
 
-	proto = PPP_PROTO(skb);
 	read_lock_bh(&pch->upl);
+	if (!pskb_may_pull(skb, 2)) {
+		kfree_skb(skb);
+		if (pch->ppp) {
+			++pch->ppp->dev->stats.rx_length_errors;
+			ppp_receive_error(pch->ppp);
+		}
+		goto done;
+	}
+
+	proto = PPP_PROTO(skb);
 	if (!pch->ppp || proto >= 0xc000 || proto == PPP_CCPFRAG) {
 		/* put it on the channel queue */
 		skb_queue_tail(&pch->file.rq, skb);
@@ -1585,6 +1594,8 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 	} else {
 		ppp_do_recv(pch->ppp, skb, pch);
 	}
+
+done:
 	read_unlock_bh(&pch->upl);
 }
 
@@ -1617,7 +1628,8 @@ ppp_input_error(struct ppp_channel *chan, int code)
 static void
 ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
 {
-	if (pskb_may_pull(skb, 2)) {
+	/* note: a 0-length skb is used as an error indication */
+	if (skb->len > 0) {
 #ifdef CONFIG_PPP_MULTILINK
 		/* XXX do channel-level decompression here */
 		if (PPP_PROTO(skb) == PPP_MP)
@@ -1625,15 +1637,10 @@ ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
 		else
 #endif /* CONFIG_PPP_MULTILINK */
 			ppp_receive_nonmp_frame(ppp, skb);
-		return;
+	} else {
+		kfree_skb(skb);
+		ppp_receive_error(ppp);
 	}
-
-	if (skb->len > 0)
-		/* note: a 0-length skb is used as an error indication */
-		++ppp->dev->stats.rx_length_errors;
-
-	kfree_skb(skb);
-	ppp_receive_error(ppp);
 }
 
 static void
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: jamal @ 2010-05-03 20:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, David Miller, xiaosuo, therbert, shemminger, netdev,
	lenb, arjan
In-Reply-To: <1272838104.2173.166.camel@edumazet-laptop>

On Mon, 2010-05-03 at 00:08 +0200, Eric Dumazet wrote:

> 
> Test I did this week with Jamal.
> 
> We first set a "ee" rps mask, because all NIC interrupts were handled by
> CPU0, and Jamal thought like you, that not using cpu4 would give better
> performance.
> 
> But using "fe" mask gave me a bonus, from ~700.000 pps to ~800.000 pps
> 

I am seeing the opposite with my machine (Nehalem):
with ee i get 99.4% and fe i get 94.2% whereas non-rps
is about 98.1%.


cheers,
jamal

PS:- sorry dont have time to collect a lot more data - tommorow i could
do more.



^ permalink raw reply

* Re: [PATCH net-next-2.6] net: if6_get_next() fix
From: Eric Dumazet @ 2010-05-03 20:13 UTC (permalink / raw)
  To: David Miller
  Cc: paulmck, shemminger, Valdis.Kletnieks, akpm, peterz, kaber,
	linux-kernel, netfilter-devel, netdev
In-Reply-To: <20100503.130902.172620141.davem@davemloft.net>

Le lundi 03 mai 2010 à 13:09 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 03 May 2010 21:46:47 +0200
> 
> > Then, net-next-2.6 doesnt yet have your commit Paul to relax
> > hlist_for_each_entry_rcu(), so its a bit difficult to continue the work.
> 
> Is that in Linus's tree yet?  If it propagates there I can make it
> propagate to net-next-2.6, you just have to tell me you need it.


Hmm, it seems it's already in net-next-2.6, sorry for the confusion.

commit 3120438ad68601f341e61e7cb1323b0e1a6ca367
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Mon Feb 22 17:04:48 2010 -0800

    rcu: Disable lockdep checking in RCU list-traversal primitives
    
    The theory is that use of bare rcu_dereference() is more prone
    to error than use of the RCU list-traversal primitives.
    Therefore, disable lockdep RCU read-side critical-section
    checking in these primitives for the time being.  Once all of
    the rcu_dereference() uses have been dealt with, it may be time
    to re-enable lockdep checking for the RCU list-traversal
    primitives.
    


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-03 20:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272714966.14499.37.camel@bigi>

On Sat, 2010-05-01 at 07:56 -0400, jamal wrote:
> On Sat, 2010-05-01 at 13:42 +0200, Eric Dumazet wrote:
> 
> > But, whole point of epoll is to not change interest each time you get an
> > event.
> > 
> > Without EV_PERSIST, you need two more syscalls per recvfrom()
> > 
> > epoll_wait()
> >  epoll_ctl(REMOVE)
> >  epoll_ctl(ADD)
> >  recvfrom()
> > 
> > Even poll() would be faster in your case
> > 
> > poll(one fd)
> > recvfrom()
> > 
> 
> This is true - but my goal was/is to replicate the regression i was
> seeing[1]. 
> I will try with PERSIST next opportunity. If it gets better
> then it is something that needs documentation in the doc Tom
> promised ;->

I tried it with PERSIST and today's net-next and you are right:
rps was better compared with (99.4% vs 98.1% of 750Kpps).
If however i removed the PERSIST i.e both rps and non-rps
have two extra syscalls, again rps performed worse (93.2% vs 97.8%
of 750Kpps). Eric, I know the answer is not to do the non-PERSIST mode
for rps ;-> But lets just ignore that for a sec:
what the heck is going on? I would expect the degradation to be the same
for both non-rps. 
I also wanna do the broken record reminder that kernels before net-next
of Apr14 were doing about 97% (as opposed to 93% currently for same
test).

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: if6_get_next() fix
From: David Miller @ 2010-05-03 20:09 UTC (permalink / raw)
  To: eric.dumazet
  Cc: paulmck, shemminger, Valdis.Kletnieks, akpm, peterz, kaber,
	linux-kernel, netfilter-devel, netdev
In-Reply-To: <1272916007.2407.75.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 03 May 2010 21:46:47 +0200

> Then, net-next-2.6 doesnt yet have your commit Paul to relax
> hlist_for_each_entry_rcu(), so its a bit difficult to continue the work.

Is that in Linus's tree yet?  If it propagates there I can make it
propagate to net-next-2.6, you just have to tell me you need it.

^ permalink raw reply

* Re: [RFC] network driver skb allocations
From: David Miller @ 2010-05-03 20:06 UTC (permalink / raw)
  To: bhutchings; +Cc: eric.dumazet, netdev
In-Reply-To: <1272916166.27948.62.camel@localhost>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Mon, 03 May 2010 20:49:26 +0100

> On Mon, 2010-05-03 at 19:06 +0200, Eric Dumazet wrote:
> [...]
>> Current logic for drivers is to :
>> 
>> allocate skbs (sk_buff + data) and put them in a ring buffer.
> 
> Not all of them.

In particular NIU always allocates SKBs at the time that it passes the
packet up to the stack.

^ permalink raw reply

* Re: [patch 01/13] KS8851: Fix ks8851 snl transmit problem
From: David Miller @ 2010-05-03 20:04 UTC (permalink / raw)
  To: Tristram.Ha; +Cc: ben, netdev, support
In-Reply-To: <14385191E87B904DBD836449AA30269D6DE63A@MORGANITE.micrel.com>

And btw, your commit message should have explained all of the things
you're telling me here.  Rather than just mention some vague "problem".

Your commit messages should explain everything about why you're
making the change, not just say what changes are being made.

^ permalink raw reply

* Re: [patch 01/13] KS8851: Fix ks8851 snl transmit problem
From: David Miller @ 2010-05-03 20:03 UTC (permalink / raw)
  To: Tristram.Ha; +Cc: ben, netdev, support
In-Reply-To: <14385191E87B904DBD836449AA30269D6DE63A@MORGANITE.micrel.com>

From: "Ha, Tristram" <Tristram.Ha@Micrel.Com>
Date: Mon, 3 May 2010 12:06:21 -0700

> The transmit done interrupt in the KSZ8851 chips is not required for
> normal operation.  Turning it off actually will improve transmit
> performance because the system will not be interrupted every time a
> packet is sent.

But you only trigger this workqueue when you notice in ->ndo_start_xmit()
that you're out of space.

This makes the chip sit idle with no packets to send until the workqueue
executes asynchronously to the initial transmit path which noticed the
queue was full.

That doesn't make any sense to me.  If anything you should at least try
to purge the TX queue and make space directly in the ->ndo_start_xmit()
handler.  And if that fails trigger an hrtimer to poll the TX state.

Without some kind of timer based polling mechanism, if the workqueue
finds the TX queue is still full, what's going to do more checks
later?  You will no longer get ->ndo_start_xmit() calls because the
queue has been marked full, so nothing will trigger the workqueue to
run any more.

^ permalink raw reply

* [PATCH] dm9601: fix phy/eeprom write routine
From: Peter Korsgaard @ 2010-05-03 20:01 UTC (permalink / raw)
  To: davem, netdev, michael.planes; +Cc: Peter Korsgaard, stable

Use correct bit positions in DM_SHARED_CTRL register for writes.

Michael Planes recently encountered a 'KY-RS9600 USB-LAN converter', which
came with a driver CD containing a Linux driver. This driver turns out to
be a copy of dm9601.c with symbols renamed and my copyright stripped.
That aside, it did contain 1 functional change in dm_write_shared_word(),
and after checking the datasheet the original value was indeed wrong
(read versus write bits).

On Michaels HW, this change bumps receive speed from ~30KB/s to ~900KB/s.
On other devices the difference is less spectacular, but still significant
(~30%).

Reported-by: Michael Planes <michael.planes@free.fr>
CC: stable@kernel.org
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk>
---
 drivers/net/usb/dm9601.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/usb/dm9601.c b/drivers/net/usb/dm9601.c
index 04b2810..5dfed92 100644
--- a/drivers/net/usb/dm9601.c
+++ b/drivers/net/usb/dm9601.c
@@ -240,7 +240,7 @@ static int dm_write_shared_word(struct usbnet *dev, int phy, u8 reg, __le16 valu
 		goto out;

 	dm_write_reg(dev, DM_SHARED_ADDR, phy ? (reg | 0x40) : reg);
-	dm_write_reg(dev, DM_SHARED_CTRL, phy ? 0x1c : 0x14);
+	dm_write_reg(dev, DM_SHARED_CTRL, phy ? 0x1a : 0x12);

 	for (i = 0; i < DM_TIMEOUT; i++) {
 		u8 tmp;
-- 
1.7.0

^ permalink raw reply related

* TCP over lo
From: Chuck Lever @ 2010-05-03 20:00 UTC (permalink / raw)
  To: netdev

I've got a bug report about attempting to mount a remote NFSv4 server 
while the client's lo interface is down.

Apparently a TCP connection attempt over lo when it is DOWN results in 
ETIMEDOUT.  If I attempt a TCP connect on eth0 when it is DOWN I get 
ENETUNREACH.

Is this behavior difference by design?  Is there some other information 
I'm missing?  Just wondering how to go about addressing the issue.

Thanks in advance for advice.

-- 
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply

* Re: [PATCH v2] ethernet: call __skb_pull() in eth_type_trans()
From: David Miller @ 2010-05-03 19:54 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, netdev
In-Reply-To: <1272895972-13799-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Mon,  3 May 2010 22:12:52 +0800

> @@ -162,7 +162,10 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
>  
>  	skb->dev = dev;
>  	skb_reset_mac_header(skb);
> -	skb_pull_inline(skb, ETH_HLEN);
> +	if (unlikely(skb->len < ETH_ZLEN))
> +		dev_warn(&dev->dev, "too small ethernet packet: %u bytes\n",
> +			 skb->len);
> +	__skb_pull(skb, ETH_HLEN);
>  	eth = eth_hdr(skb);

And now it's even more expensive than skb_pull_inline() :-)

Really, things are fine as-is.

^ permalink raw reply

* Re: [RFC] network driver skb allocations
From: Ben Hutchings @ 2010-05-03 19:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1272906384.2226.80.camel@edumazet-laptop>

On Mon, 2010-05-03 at 19:06 +0200, Eric Dumazet wrote:
[...]
> Current logic for drivers is to :
> 
> allocate skbs (sk_buff + data) and put them in a ring buffer.

Not all of them.

> When rx interrupt comes, get the skb and give it to stack.
> 
> Allocate a new skb (sk_buff + data) and put it in rx fat ring buffer (511 entries for tg3 )
> 
> This is suboptimal, because sk_buff will probably be cold 512 rx later...
> Also, NUMA info might be wrong : sk_buff should be allocated on current node,
> not on the device preferred node.

This also avoids allocating sk_buffs that are never needed due to GRO or
scattering of jumbo frames.

> Drivers should allocate only the data part for NIC, and at the time of interrupt,
> allocate the skb_buff and link it to buffer filled by NIC.

I think we found that this increases latency, so sfc switches between
page and skb allocations dynamically.

Ben.

> With a prefetch(first_cache_line_of_data) before doing sk_buff allocation and init,
> eth_type_trans() / get_rps_cpus() would be much faster.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH 1/2] ppp_generic: pull 2 bytes so that PPP_PROTO(skb) is valid
From: David Miller @ 2010-05-03 19:49 UTC (permalink / raw)
  To: simon; +Cc: netdev, paulus, linux-ppp
In-Reply-To: <e85216b167313bb2bd46f5990ed468f5b67a1ad8@8b5064a13e22126c1b9329f0dc35b8915774b7c3.invalid>

From: "Simon Arlott" <simon@fire.lp0.eu>
Date: Mon, 3 May 2010 12:50:04 +0100

> Updated patch attached.

Two things:

1) Please don't post patches as binary attachments, it doesn't get
   queued up properly into patchwork if you post it that way.

2) Always make a fresh email posting for new versions of your patch
   when possible, as this way what to include in the commit log
   message is clear and always applies to the patch posted rather
   than bits and pieces of quoted material from previous postings
   that may or may not apply to the current version of the patch.

So please repost this, and your new 2/2 patch, as clean new submissions.

Thanks.

^ permalink raw reply

* [PATCH net-next-2.6] net: if6_get_next() fix
From: Eric Dumazet @ 2010-05-03 19:46 UTC (permalink / raw)
  To: paulmck, Stephen Hemminger
  Cc: Valdis.Kletnieks, Andrew Morton, Peter Zijlstra, Patrick McHardy,
	David S. Miller, linux-kernel, netfilter-devel, netdev
In-Reply-To: <20100503181629.GJ2597@linux.vnet.ibm.com>

Le lundi 03 mai 2010 à 11:16 -0700, Paul E. McKenney a écrit :
> I would be happy to if I could find the commit creating
> hlist_for_each_entry_continue_rcu()...
> 
> I do see a ca. 2008 patch from Stephen Hemminger:
> 
> 	http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg264661.html
> 
> According to http://patchwork.ozlabs.org/patch/47997/, this is
> going up the networking tree as of March 18, 2010.
> 
> So I would be happy to push the patch below, but to do so, I will need
> to adopt the portion of Stephen's patch that created this primitive.
> 

Hmm,  I realize there is a true bug introduced by Stephen patch

Then, net-next-2.6 doesnt yet have your commit Paul to relax
hlist_for_each_entry_rcu(), so its a bit difficult to continue the work.

Thanks

[PATCH net-next-2.6] net: if6_get_next() fix

Must use rcu variant, we are in a rcu_read_lock_bh() section

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 34d2d64..16bb85c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2979,7 +2979,7 @@ static struct inet6_ifaddr *if6_get_next(struct seq_file *seq,
 			return ifa;
 
 	while (++state->bucket < IN6_ADDR_HSIZE) {
-		hlist_for_each_entry(ifa, n,
+		hlist_for_each_entry_rcu(ifa, n,
 				     &inet6_addr_lst[state->bucket], addr_lst) {
 			if (net_eq(dev_net(ifa->idev->dev), net))
 				return ifa;

^ permalink raw reply related

* RE: [patch 01/13] KS8851: Fix ks8851 snl transmit problem
From: Ha, Tristram @ 2010-05-03 19:06 UTC (permalink / raw)
  To: David Miller, ben; +Cc: netdev, support
In-Reply-To: <20100502.223852.219748077.davem@davemloft.net>

David Miller wrote:
> From: Ben Dooks <ben@simtec.co.uk>
> Date: Fri, 30 Apr 2010 00:16:22 +0100
> 
>> From: Tristram Ha <Tristram.Ha@micrel.com>
>> 
>> This fixes a transmit problem of the ks8851 snl network driver.
>> 
>> Under heavy TCP traffic the device will stop transmitting. Turning
off
>> the transmit interrupt avoids this issue.  A new workqueue was
>> implemented to replace the functionality of the transmit interrupt
processing.
>> 
>> Signed-off-by: Tristram Ha <Tristram.Ha@micrel.com>
> 
> Please, try to fix this properly.  Unless you have a known chip errata
with the TX interrupt
> that cannot be worked around reasonably, which would need to be
detailed and explained
> completely in the commit log message, you should try to figure out
what the real problem is.  
> 
> Otherwise just tossing everything to a workqueue looks like a hack, at
best.
> 
> I suspect you have some kind of race between IRQ processing and the
> ->ndo_start_xmit() handler, so you end up missing a queue wakeup.
> Either that or you end up misprogramming the hardware due to the race.
> 
> There is no way I'm applying this as-is.

As I explained a long time ago (last year), this patch is no longer
considered as a fix but for performance.

The transmit done interrupt in the KSZ8851 chips is not required for
normal operation.  Turning it off actually will improve transmit
performance because the system will not be interrupted every time a
packet is sent.

This driver runs on SPI bus.  Just reading register requires a workqueue
because it cannot be done under interrupt context.  Processing the
transmit interrupt requires scheduling a workqueue anyway.

I tested the driver under the Beagle Zippy2 board with a 24 MHz SPI bus
clock, which limits the throughput to 10 Mbps.  On other systems the
transmit performance improvement may be greater, but I do not have that
data to back me up.

If you feel strongly that this workqueue implementation is not
appropriate, then please disregard the patch.

^ permalink raw reply

* Fun with if_bridge.h and br_private.h
From: Paul E. McKenney @ 2010-05-03 19:12 UTC (permalink / raw)
  To: shemminger; +Cc: netdev, arnd

Hello, Stephen,

I need some help with #include issues surrounding bridge.

Arnd has been putting together a sparse-based RCU-checking tool
that can detect pointer access that should have been protected by an
rcu_dereference() or rcu_assign_pointer().  On of the side-effects of
his approach is that rcu_dereference() can no longer handle pointers
to incomplete types, as Arnd's approach uses the address-space feature
of sparse, which in turn must tag the type pointed to rather than the
pointer itself.  And this in turn requires rcu_dereference() and friends
to see the pointed-to type as well as the pointer type.

One place affected by this is handle_bridge() in net/core/dev.c, which
invokes rcu_dereference() on skb->dev->br_port, which is of type struct
net_bridge_port, which is defined in net/bridge/br_private.h.  This is,
well, private, but is included into a couple places in netfilter and atm,
so I tried including it into net/core/dev.c.

This still left me with forward references:

In file included from net/core/dev.c:104:
include/linux/if_bridge.h:106: warning: "struct net_bridge_port" declared inside parameter list
include/linux/if_bridge.h:106: warning: its scope is only this definition or declaration, which is probably not what you want
net/core/dev.c:2331: error: conflicting types for "br_handle_frame_hook"
include/linux/if_bridge.h:105: error: previous declaration of "br_handle_frame_hook" was here
net/core/dev.c:2333: error: conflicting types for "br_handle_frame_hook"
include/linux/if_bridge.h:105: error: previous declaration of "br_handle_frame_hook" was here

This happens because net/bridge/br_private.h includes if_bridge.h before
it defines net_bridge_port.

Any thoughts on how best to allow handle_bridge() see the definition
of struct net_bridge_port?

							Thanx, Paul

^ permalink raw reply

* Re: Receive issues with bonding and vlans
From: Jay Vosburgh @ 2010-05-03 18:25 UTC (permalink / raw)
  To: John Fastabend
  Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek,
	Patrick McHardy, bonding-devel@lists.sourceforge.net
In-Reply-To: <4BDE3D3B.3030800@intel.com>

John Fastabend <john.r.fastabend@intel.com> wrote:

>Jay Vosburgh wrote:
>> Chris Leech <christopher.leech@intel.com> wrote:
>>
>>> On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote:
>>>> 	Is the FCoE supposed to run over the inactive bonding slave?  Or
>>>> am I misunderstanding what you're saying?  I had thought the LLDP, et
>>>> al, special case in bonding was to permit, essentially, path discovery,
>>>> not necessarily active use of the inactive slave.
>>> That's what I'm trying to do, yes.  Mostly because it's a setup that
>>> would work if you removed the FCoE traffic from the network data path,
>>> and only converged at the driver level and below.  It's possible that
>>> the answer is "don't do that".
>>
>> 	So, basically, you want the bond to act like usual for "regular"
>> ethernet traffic, but act like the slaves are independent from the bond
>> for the magic FCoE traffic, right?
>>
>> 	I'm not really sure if that's a "don't do that" or not.
>>
>>>> 	Not that this is necessarily bad; the "drop stuff on inactive
>>>> slaves" is really there for duplicate suppression, but it also can
>>>> uncover network topology issues, e.g., network layouts that won't work
>>>> if the devices fail, but appear to work during testing because the
>>>> "inactive" slave still receives traffic (it hasn't really failed).
>>>>
>>>>> The problem is that it doesn't work for hardware accelerated VLAN
>>>>> devices, because the VLAN receive paths have their own
>>>>> skb_bond_should_drop calls that were not updated.
>>>>>
>>>> >From what I can tell, VLAN receives always end up going through
>>>>> netif_receive_skb anyway, so skb_bond_should_drop gets called twice if
>>>>> the frame isn't dropped the first time.  I think the bonding checks in
>>>>> __vlan_hwaccel_rx and vlan_gro_common should just be removed.
>>>> 	I'm not so sure.  The checks in __vlan_hwaccel_rx are done with
>>>> the original receiving device in skb->dev; by the time the packet gets
>>>> to netif_receive_skb, the original slave the packet was received on has
>>>> been lost (and replaced with the VLAN device).  Various things are
>>>> interested in that, in particular the "arp_validate" and the "inactive
>>>> slave drop" logic for bonding depend on knowing the real device the
>>>> packet arrived on.
>>>>
>>>> 	I note that the vlan accel logic doesn't change skb_iif to the
>>>> VLAN device; it remains as the original device.  I suppose one
>>>> alternative would be to convert the bonding drop, et al, logic to use
>>>> skb_iif instead of skb->dev; if that works, then I think the VLAN core
>>>> would not need to call skb_bond_should_drop, which in turn would be a
>>>> bit more complicated as it would have to look up the dev from the
>>>> skb_iif.  There's already some code in bonding that takes advantage of
>>>> this property of the VLANs, so maybe this is the way to go.
>>> Thanks, I'll take another look and see if I can come up with something
>>> better.
>>
>> 	I looked at the skb_bond_should_drop stuff a bit more after I
>> wrote that; it's not as easy as I had suspected.  The big sticking point
>> is that currently the test in netif_receive_skb (now __netif_receive_skb
>> in net-next-2.6) is on skb->dev->master to identify packets arriving on
>> slaves of bonding.  The VLAN skb->dev has ->master set to NULL.  Doing
>> that test against skb->skb_iif would be much more expensive, as it would
>> require a device lookup for every packet.
>>
>> 	So, I suspect that something has to happen in the VLAN
>> acceleration path, although I don't know exactly what.  I don't know if
>> it would be possible to flag the packets in some special way to indicate
>> that they're "bonding slave" packets, or if it's better to keep the
>> current structure and just fix the calls somehow.
>>
>> 	-J
>>
>>
>
>Jay,
>
>It should be OK to allow packets to be received on the VLAN if it is not
>explicitly in the bond?

	Lemme see if I have this straight, because all of these special
cases are making my brain hurt.  This one is for a configuration like this:

	bond0 ----- eth0
		   /
	vlan.xxx -/

	I.e., a VLAN configured directly atop an ethernet device, said
ethernet also being a slave to bonding.  Is that correct?

	Extrapolating from the ASCII art in a prior message in this
discussion, would this configuration really be something like this:

	vlan.xxx -\
		   \
	bond0 ----- eth1
	bond0 ----- eth0
		   /
	vlan.xxx -/

	I.e., two slaves to bonding, with VLAN xxx configured atop both
of the slaves?  Or would the eth0 and eth1 use discrete VLAN ids?  The
reason I ask is in regards to duplicate suppression.  The whole reason
the "inactive" slave drops (most) incoming packets is to eliminate
duplicates when the switch floods traffic to both slave ports.

	This is a bit tricky, because it's not really about broadcasts /
multicasts so much, but about traffic that the switch sends to all ports
because the switch's MAC address table isn't up to date with the
destination MAC of the traffic (which is a transient condition, so this
would only happen for perhaps one second or so).  That would result in
duplicate unicast packets being received by the bond (back in the day
before bonding had the "drop inactive traffic" logic).

	So if the same VLAN is configured atop the two slaves, I wonder
if that will open a window for the duplicate unicast packet problem.

	If the VLAN ids are different, then I'll assume this is some
kind of device mapper magic, doing the load balancing elsewhere.

>Or if we want to be more paranoid deliver packets only to handlers with
>exact matches for the device. For non vlan devices we deliver skb's to
>packet handlers that match exactly even on inactive slaves so doing this
>on vlan devices as well makes sense and shouldn't cause any unexpected
>problems.

	Yah, the whole concept of "inactive" slave is pretty mutated
now; it's kind of become the "active-backup with semi-manual load
balance for clever protocols, oh, and duplicate suppression" mode.

>Also on a somewhat unrelated note I suspect null_or_orig and null_or_bond
>are not working as expected in __netif_receive_skb().  At least the
>comment 'deliver only exact match' could be inaccurate.

	I don't think this is unrelated at all.  This code (the packet
to device lookup stuff in __netif_receive_skb) has been modified pretty
extensively lately for various bonding-related special cases, and I
think it is getting hard to follow.  Whatever comments are there need to
be accurate, and, honestly, I think this code needs comments to explain
what exactly is supposed to happen for these special cases.

>Here's a patch to illustrate what I'm thinking compile tested only.  If
>this sounds reasonable I'll work up an official patch.
>
>
>[PATCH] net: allow vlans on bonded real net_devices
>
>For converged I/O it is reasonable to use dm_multipathing to provice
>failover and load balancing for storage traffic and then use bonding
>for the LAN failover and load balancing.
>
>Currently this works if the multipathed devices are using the real
>net_device and those devices are enslaved to a bonded device what
>does not work is creating a vlan on the real device and trying to
>use it outside the bond for multipathing.
>
>This patch adds logic so that if the skb is destined for a vlan
>that is not in the bond the skb will not be dropped.
>
>Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>---
>
> net/8021q/vlan_core.c |   31 +++++++++++++++++++++----------
> net/core/dev.c        |   11 ++++++++---
> 2 files changed, 29 insertions(+), 13 deletions(-)
>
>diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
>index c584a0a..3bce0c3 100644
>--- a/net/8021q/vlan_core.c
>+++ b/net/8021q/vlan_core.c
>@@ -8,18 +8,24 @@
> int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
> 		      u16 vlan_tci, int polling)
> {
>+	struct net_device *vlan_dev;
>+
> 	if (netpoll_rx(skb))
> 		return NET_RX_DROP;
>
>-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>+	vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>+
>+	if (!vlan_dev)
>+		goto drop;
>+
>+	if ((vlan_dev->priv_flags & IFF_BONDING ||
>+	    vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) &&
>+	    skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))

	I'm not sure this will do the right thing if the VLAN device
itself is a slave to bonding, e.g., bond0 ---> vlan.xxx ---> eth0.  In
that case, eth0's dev->master is NULL, and the vlan_dev (vlan.xxx's dev)
doesn't have IFF_MASTER (but does have IFF_SLAVE and IFF_BONDING, I
believe).

	I think this will result in all incoming traffic being accepted
on such a configuration (leading to duplicates, as described above).

	I suspect, but have not tested, that something like this might
do what you're looking for:

	if ((vlan_dev->priv_flags & IFF_BONDING ||
	    vlan_dev_real_dev(vlan_dev)->flags & (IFF_MASTER | IFF_SLAVE)) &&
	    skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))

	I.e., if the VLAN device is either a MASTER (configured above
the bond) or a slave (configured below the bond) do the duplicate
suppresion.

> 		goto drop;
>
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
>-	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>-
>-	if (!skb->dev)
>-		goto drop;
>+	skb->dev = vlan_dev;
>
> 	return (polling ? netif_receive_skb(skb) : netif_rx(skb));
>
>@@ -82,16 +88,21 @@ vlan_gro_common(struct napi_struct *napi, struct
>vlan_group *grp,
> 		unsigned int vlan_tci, struct sk_buff *skb)
> {
> 	struct sk_buff *p;
>+	struct net_device *vlan_dev;
>
>-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>+	vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>+
>+	if (!vlan_dev)
>+		goto drop;
>+
>+	if ((vlan_dev->priv_flags & IFF_BONDING ||
>+	    vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) &&
>+	    skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
> 		goto drop;
>
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
>-	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>-
>-	if (!skb->dev)
>-		goto drop;
>+	skb->dev = vlan_dev;
>
> 	for (p = napi->gro_list; p; p = p->next) {
> 		NAPI_GRO_CB(p)->same_flow =
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 100dcbd..9ea4550 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -2780,6 +2780,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 	struct net_device *master;
> 	struct net_device *null_or_orig;
> 	struct net_device *null_or_bond;
>+	struct net_device *real_dev;
> 	int ret = NET_RX_DROP;
> 	__be16 type;
>
>@@ -2853,9 +2854,13 @@ ncls:
> 	 * handler may have to adjust skb->dev and orig_dev.
> 	 */
> 	null_or_bond = NULL;
>-	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
>-	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
>-		null_or_bond = vlan_dev_real_dev(skb->dev);
>+	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN)) {
>+		real_dev = vlan_dev_real_dev(skb->dev);
>+		if (real_dev->priv_flags & IFF_BONDING)
>+			null_or_bond = vlan_dev_real_dev(skb->dev);
>+		if (!(skb->dev->priv_flags & IFF_BONDING) &&
>+		    real_dev->priv_flags & IFF_SLAVE_INACTIVE)
>+			null_or_orig = skb->dev;
> 	}
>
> 	type = skb->protocol;

	Is this another way of accomplishing what I had suggested at the
end of this message:

http://marc.info/?l=linux-netdev&m=127111386702765&w=2

	The patch part of my suggestion was as follows:

>diff --git a/net/core/dev.c b/net/core/dev.c
>index b98ddc6..cc665bb 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -2735,7 +2735,7 @@ ncls:
> 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
> 		if (ptype->type == type && (ptype->dev == null_or_orig ||
> 		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
>-		     ptype->dev == null_or_bond)) {
>+		     (null_or_bond && (ptype->dev == null_or_bond))) {
> 			if (pt_prev)
> 				ret = deliver_skb(skb, pt_prev, orig_dev);
> 			pt_prev = ptype;
>
>
>	I haven't tested this, but the theory is to only test against
>null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN
>traffic over bonding.

	Chris Leech said "that should do it" but I don't recall seeing
if it actually did so in practice.

	Or is your change meant to fix something else?

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: mmotm 2010-04-28 - RCU whinges
From: Paul E. McKenney @ 2010-05-03 18:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Valdis.Kletnieks, Andrew Morton, Peter Zijlstra, Patrick McHardy,
	David S. Miller, linux-kernel, netfilter-devel, netdev
In-Reply-To: <1272903293.2226.50.camel@edumazet-laptop>

On Mon, May 03, 2010 at 06:14:53PM +0200, Eric Dumazet wrote:
> Le lundi 03 mai 2010 à 08:43 -0700, Paul E. McKenney a écrit :
> 
> > Highly recommended.  ;-)
> > 
> > And thanks to you for your testing efforts and to Eric for the fixes!!!
> > 
> 
> For this last one, I think you should push following patch Paul

I would be happy to if I could find the commit creating
hlist_for_each_entry_continue_rcu()...

I do see a ca. 2008 patch from Stephen Hemminger:

	http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg264661.html

According to http://patchwork.ozlabs.org/patch/47997/, this is
going up the networking tree as of March 18, 2010.

So I would be happy to push the patch below, but to do so, I will need
to adopt the portion of Stephen's patch that created this primitive.

Please let me know how you would like to proceed!

							Thanx, Paul

> Followup of commit 3120438ad6
> (rcu: Disable lockdep checking in RCU list-traversal primitives)
> 
> Or we might introduce a hlist_for_each_entry_continue_rcu_bh() macro...
> 
> 
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 004908b..b0c7e24 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -435,10 +435,10 @@ static inline void hlist_add_after_rcu(struct hlist_node *prev,
>   * @member:	the name of the hlist_node within the struct.
>   */
>  #define hlist_for_each_entry_continue_rcu(tpos, pos, member)		\
> -	for (pos = rcu_dereference((pos)->next);			\
> +	for (pos = rcu_dereference_raw((pos)->next);			\
>  	     pos && ({ prefetch(pos->next); 1; }) &&			\
>  	     ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });  \
> -	     pos = rcu_dereference(pos->next))
> +	     pos = rcu_dereference_raw(pos->next))
> 
> 
>  #endif	/* __KERNEL__ */
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: OOP in ip_cmsg_recv (net-next)
From: Eric Dumazet @ 2010-05-03 17:21 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <1272906266.2226.77.camel@edumazet-laptop>

Le lundi 03 mai 2010 à 19:04 +0200, Eric Dumazet a écrit :
> Le lundi 03 mai 2010 à 09:47 -0700, Stephen Hemminger a écrit :
> > I am getting occasional NULL pointer references with net-next kernel.
> > No test, just usual stuff (like DNS).
> > 
> > This is a new regression in net-next only.
> > 
> > 
> Hmm, skb->sk is NULL
> 
> void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb)
> {
> 	struct inet_sock *inet = inet_sk(skb->sk);
> 	unsigned flags = inet->cmsg_flags; // CRASH
> 
> 
> So a skb_free_datagram_locked() is at fault here...
> 
> commit 4b0b72f7dd617b13abd1b04c947e15873e011a24 probably
> 
> OK, the skb_orphan() should not be done at this point, if we are not the
> only user (and last user)
> 
> Oh well, sorry for the regression ;)
> 

I'll test following patch and report results to netdev :

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 95b851f..e009753 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -229,13 +229,18 @@ EXPORT_SYMBOL(skb_free_datagram);
 
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
+	if (likely(atomic_read(&skb->users) == 1))
+		smp_rmb();
+	else if (likely(!atomic_dec_and_test(&skb->users)))
+		return;
+
 	lock_sock_bh(sk);
 	skb_orphan(skb);
 	sk_mem_reclaim_partial(sk);
 	unlock_sock_bh(sk);
 
-	/* skb is now orphaned, might be freed outside of locked section */
-	consume_skb(skb);
+	/* skb is now orphaned, can be freed outside of locked section */
+	__kfree_skb(skb);
 }
 EXPORT_SYMBOL(skb_free_datagram_locked);
 



^ permalink raw reply related

* [RFC PATCH v2] sctp: fix sctp to work with ipv6 source address routing
From: Vlad Yasevich @ 2010-05-03 17:07 UTC (permalink / raw)
  To: Weixing.Shi, yjwei; +Cc: netdev, Vlad Yasevich
In-Reply-To: <4BDEFFEF.2090200@hp.com>

From: Weixing Shi <Weixing.Shi@windriver.com>

<vlad>
Ok, updated to be a bit more correct.  Only leave the function early if the
first lookup succeeds.
</vlad>

in the below test case, using the source address routing,
sctp can not work.
Node-A
1)ifconfig eth0 inet6 add 2001:1::1/64
2)ip -6 rule add from 2001:1::1 table 100 pref 100
3)ip -6 route add 2001:2::1 dev eth0 table 100
4)sctp_darn -H 2001:1::1 -P 250 -l &
Node-B
1)ifconfig eth0 inet6 add 2001:2::1/64
2)ip -6 rule add from 2001:2::1 table 100 pref 100
3)ip -6 route add 2001:1::1 dev eth0 table 100
4)sctp_darn -H 2001:2::1 -P 250 -h 2001:1::1 -p 250 -s

root cause:
Node-A and Node-B use the source address routing, and
at begining, source address will be NULL,sctp will
search the  routing table by the destination address,
because using the source address routing table, and
the result dst_entry will be NULL.

solution:
walk through the bind address list to get the source
address and then lookup the routing table again to get
the correct dst_entry.

Signed-off-by: Weixing Shi <Weixing.Shi@windriver.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 net/sctp/ipv6.c |  113 ++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 7326891..f75e2f8 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -77,6 +77,10 @@
 #include <net/sctp/sctp.h>
 
 #include <asm/uaccess.h>
+static void sctp_v6_dst_saddr(union sctp_addr *addr, struct dst_entry *dst,
+            __be16 port);
+static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
+          const union sctp_addr *addr2);
 
 /* Event handler for inet6 address addition/deletion events.
  * The sctp_local_addr_list needs to be protocted by a spin lock since
@@ -242,8 +246,12 @@ static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
 					 union sctp_addr *daddr,
 					 union sctp_addr *saddr)
 {
-	struct dst_entry *dst;
+	struct dst_entry *dst = NULL;
 	struct flowi fl;
+	struct sctp_bind_addr *bp;
+	struct sctp_sockaddr_entry *laddr;
+	union sctp_addr dst_saddr;
+	sctp_scope_t scope;
 
 	memset(&fl, 0, sizeof(fl));
 	ipv6_addr_copy(&fl.fl6_dst, &daddr->v6.sin6_addr);
@@ -259,16 +267,71 @@ static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
 	}
 
 	dst = ip6_route_output(&init_net, NULL, &fl);
+
+	bp = &asoc->base.bind_addr;
+	scope = sctp_scope(daddr);
+
 	if (!dst->error) {
+		if (!asoc || saddr)
+			goto out;
+
+		/* Walk through the bind address list and look for a bind
+		 * address that matches the source address of the returned dst.
+		 */
+		sctp_v6_dst_saddr(&dst_saddr, dst, htons(bp->port));
+		rcu_read_lock();
+		list_for_each_entry_rcu(laddr, &bp->address_list, list) {
+			if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
+				continue;
+
+			/* Do not compare against v4 addrs */
+			if ((laddr->a.sa.sa_family == AF_INET6) &&
+			   (sctp_v6_cmp_addr(&dst_saddr, &laddr->a)))
+				goto out_unlock;
+		}
+		rcu_read_unlock();
+
+		/* None of the bound addresses match the source address of the
+		 * dst. So release it.
+		 */
+		dst_release(dst);
+		dst = NULL;
+	}
+
+	/* Walk through the bind address list and try to get a dst that
+	 * matches a bind address as the source address.
+	 */
+	rcu_read_lock();
+	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
+		if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
+			continue;
+
+		if ((laddr->a.sa.sa_family == AF_INET6) &&
+		    (scope <= sctp_scope(&laddr->a))) {
+			ipv6_addr_copy(&fl.fl6_src, &laddr->a.v6.sin6_addr);
+			dst = ip6_route_output(&init_net, NULL, &fl);
+			if (dst->error) {
+				dst_release(dst);
+				dst = NULL;
+			} else
+				break;
+		}
+	}
+
+out_unlock:
+	rcu_read_unlock();
+
+out:
+	if (dst) {
 		struct rt6_info *rt;
 		rt = (struct rt6_info *)dst;
 		SCTP_DEBUG_PRINTK("rt6_dst:%pI6 rt6_src:%pI6\n",
 			&rt->rt6i_dst.addr, &rt->rt6i_src.addr);
 		return dst;
-	}
-	SCTP_DEBUG_PRINTK("NO ROUTE\n");
-	dst_release(dst);
-	return NULL;
+	} else 
+		SCTP_DEBUG_PRINTK("NO ROUTE\n");
+
+	return dst;
 }
 
 /* Returns the number of consecutive initial bits that match in the 2 ipv6
@@ -289,13 +352,6 @@ static void sctp_v6_get_saddr(struct sctp_sock *sk,
 			      union sctp_addr *daddr,
 			      union sctp_addr *saddr)
 {
-	struct sctp_bind_addr *bp;
-	struct sctp_sockaddr_entry *laddr;
-	sctp_scope_t scope;
-	union sctp_addr *baddr = NULL;
-	__u8 matchlen = 0;
-	__u8 bmatchlen;
-
 	SCTP_DEBUG_PRINTK("%s: asoc:%p dst:%p daddr:%pI6 ",
 			  __func__, asoc, dst, &daddr->v6.sin6_addr);
 
@@ -310,38 +366,9 @@ static void sctp_v6_get_saddr(struct sctp_sock *sk,
 		return;
 	}
 
-	scope = sctp_scope(daddr);
-
-	bp = &asoc->base.bind_addr;
-
-	/* Go through the bind address list and find the best source address
-	 * that matches the scope of the destination address.
-	 */
-	rcu_read_lock();
-	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
-		if (!laddr->valid)
-			continue;
-		if ((laddr->state == SCTP_ADDR_SRC) &&
-		    (laddr->a.sa.sa_family == AF_INET6) &&
-		    (scope <= sctp_scope(&laddr->a))) {
-			bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
-			if (!baddr || (matchlen < bmatchlen)) {
-				baddr = &laddr->a;
-				matchlen = bmatchlen;
-			}
-		}
-	}
-
-	if (baddr) {
-		memcpy(saddr, baddr, sizeof(union sctp_addr));
-		SCTP_DEBUG_PRINTK("saddr: %pI6\n", &saddr->v6.sin6_addr);
-	} else {
-		printk(KERN_ERR "%s: asoc:%p Could not find a valid source "
-		       "address for the dest:%pI6\n",
-		       __func__, asoc, &daddr->v6.sin6_addr);
-	}
+	if (dst)
+		sctp_v6_dst_saddr(saddr, dst, htons(asoc->base.bind_addr.port));
 
-	rcu_read_unlock();
 }
 
 /* Make a copy of all potential local addresses. */
-- 
1.6.0.4


^ permalink raw reply related

* [RFC] network driver skb allocations
From: Eric Dumazet @ 2010-05-03 17:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Some performance idea about drivers and skb allocations :

-----------------------------------------------------------------------------------------------------------
   PerfTop:     954 irqs/sec  kernel:99.5% [1000Hz cycles],  (all, cpu: 0)
-----------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ _________________

             2378.00 16.3% __alloc_skb                   [kernel.kallsyms]
             1962.00 13.5% eth_type_trans                [kernel.kallsyms]
             1472.00 10.1% __kmalloc_track_caller        [kernel.kallsyms]
              991.00  6.8% __slab_alloc                  [kernel.kallsyms]
              938.00  6.4% _raw_spin_lock                [kernel.kallsyms]
              914.00  6.3% __netdev_alloc_skb            [kernel.kallsyms]
              876.00  6.0% kmem_cache_alloc              [kernel.kallsyms]
              519.00  3.6% tg3_poll_work                 [kernel.kallsyms]
              416.00  2.9% tg3_read32                    [kernel.kallsyms]
              394.00  2.7% get_rps_cpu                   [kernel.kallsyms]

Current logic for drivers is to :

allocate skbs (sk_buff + data) and put them in a ring buffer.

When rx interrupt comes, get the skb and give it to stack.

Allocate a new skb (sk_buff + data) and put it in rx fat ring buffer (511 entries for tg3 )

This is suboptimal, because sk_buff will probably be cold 512 rx later...
Also, NUMA info might be wrong : sk_buff should be allocated on current node,
not on the device preferred node.

Drivers should allocate only the data part for NIC, and at the time of interrupt,
allocate the skb_buff and link it to buffer filled by NIC.

With a prefetch(first_cache_line_of_data) before doing sk_buff allocation and init,
eth_type_trans() / get_rps_cpus() would be much faster.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox