Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] Make INET_LHTABLE_SIZE a compile-time tunable
From: David Miller @ 2011-01-31 22:05 UTC (permalink / raw)
  To: wsommerfeld; +Cc: netdev, therbert
In-Reply-To: <AANLkTi=5ncH6aRY5ifA0TONB7L5RDdHkriMY1p=aDxs6@mail.gmail.com>

From: Bill Sommerfeld <wsommerfeld@google.com>
Date: Mon, 31 Jan 2011 13:52:03 -0800

> On Fri, Jan 14, 2011 at 13:48, I wrote:
>> INET_LHTABLE_SIZE has been fixed at 32 for a long time.  It should be
>> tunable as larger systems may be running many more than 32 listeners.
> 
> I haven't seen any responses to this patch submission.  Can someone
> take a look?  Thanks.

It should be dynamically sized.  Compile time configuration knobs
generally stick.

^ permalink raw reply

* Re: patch "appletalk: move to staging" added to staging tree
From: David Miller @ 2011-01-31 22:05 UTC (permalink / raw)
  To: arnd; +Cc: gregkh, acme, netdev
In-Reply-To: <201101312255.39981.arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Mon, 31 Jan 2011 22:55:39 +0100

> On Monday 31 January 2011, gregkh@suse.de wrote:
>> 
>> This is a note to let you know that I've just added the patch titled
>> 
>>     appletalk: move to staging
>> 
>> to my staging git tree which can be found at
>>     git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6.git
>> in the staging-next branch.
> 
> Actually, David Miller objected to this patch for good reasons, please
> revert it. One of the other patches removes the BKL in appletalk, so
> we should find a way to keep it.

Right.

^ permalink raw reply

* [net-2.6 PATCH 1/3] net: dcb: match dcb_app protocol field with 802.1Qaz spec
From: John Fastabend @ 2011-01-31 22:00 UTC (permalink / raw)
  To: davem; +Cc: john.r.fastabend, netdev

The dcb_app protocol field is a __u32 however the 802.1Qaz
specification defines it as a 16 bit field. This patch brings
the structure inline with the spec making it a __u16.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/dcbnl.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h
index 68cd248..bdc7ef4 100644
--- a/include/linux/dcbnl.h
+++ b/include/linux/dcbnl.h
@@ -101,7 +101,7 @@ struct ieee_pfc {
  */
 struct dcb_app {
 	__u8	selector;
-	__u32	protocol;
+	__u16	protocol;
 	__u8	priority;
 };
 


^ permalink raw reply related

* [net-2.6 PATCH 2/3] net: dcb: use _safe() version of list iterators
From: John Fastabend @ 2011-01-31 22:00 UTC (permalink / raw)
  To: davem; +Cc: john.r.fastabend, netdev
In-Reply-To: <20110131220048.29758.22379.stgit@jf-dev1-dcblab>

Use _safe() version of list iterator macros in dcb_setapp().

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 net/dcb/dcbnl.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index 6b03f56..e3399d6 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -1605,18 +1605,18 @@ u8 dcb_getapp(struct net_device *dev, struct dcb_app *app)
 EXPORT_SYMBOL(dcb_getapp);
 
 /**
- * ixgbe_dcbnl_setapp - add dcb application data to app list
+ * dcbnl_setapp - add dcb application data to app list
  *
  * Priority 0 is the default priority this removes applications
  * from the app list if the priority is set to zero.
  */
 u8 dcb_setapp(struct net_device *dev, struct dcb_app *new)
 {
-	struct dcb_app_type *itr;
+	struct dcb_app_type *itr, *tmp;
 
 	spin_lock(&dcb_lock);
 	/* Search for existing match and replace */
-	list_for_each_entry(itr, &dcb_app_list, list) {
+	list_for_each_entry_safe(itr, tmp, &dcb_app_list, list) {
 		if (itr->app.selector == new->selector &&
 		    itr->app.protocol == new->protocol &&
 		    (strncmp(itr->name, dev->name, IFNAMSIZ) == 0)) {


^ permalink raw reply related

* [net-2.6 PATCH 3/3] net: dcb: application priority is per net_device
From: John Fastabend @ 2011-01-31 22:00 UTC (permalink / raw)
  To: davem; +Cc: john.r.fastabend, netdev
In-Reply-To: <20110131220048.29758.22379.stgit@jf-dev1-dcblab>

The app_data priority may not be the same for all net devices.
In order for stacks with application notifiers to identify the
specific net device dcb_app_type should be passed in the ptr.

This allows handlers to use dev_get_by_name() to pin priority
to net devices.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 net/dcb/dcbnl.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index e3399d6..249bcec 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -1613,6 +1613,10 @@ EXPORT_SYMBOL(dcb_getapp);
 u8 dcb_setapp(struct net_device *dev, struct dcb_app *new)
 {
 	struct dcb_app_type *itr, *tmp;
+	struct dcb_app_type event;
+
+	memcpy(&event.name, dev->name, sizeof(event.name));
+	memcpy(&event.app, new, sizeof(event.app));
 
 	spin_lock(&dcb_lock);
 	/* Search for existing match and replace */
@@ -1644,7 +1648,7 @@ u8 dcb_setapp(struct net_device *dev, struct dcb_app *new)
 	}
 out:
 	spin_unlock(&dcb_lock);
-	call_dcbevent_notifiers(DCB_APP_EVENT, new);
+	call_dcbevent_notifiers(DCB_APP_EVENT, &event);
 	return 0;
 }
 EXPORT_SYMBOL(dcb_setapp);


^ permalink raw reply related

* Re: [PATCH v2 16/16] skge: convert to hw_features
From: Michał Mirosław @ 2011-01-31 22:40 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, Ben Hutchings
In-Reply-To: <20110131215343.GA19379@rere.qmqm.pl>

On Mon, Jan 31, 2011 at 10:53:43PM +0100, Michał Mirosław wrote:
> On Mon, Jan 31, 2011 at 08:45:02AM -0800, Stephen Hemminger wrote:
> > On Sat, 22 Jan 2011 23:14:14 +0100 (CET)
> > Michał Mirosław <mirq-linux@rere.qmqm.pl> wrote:
> > > The hardware might do full HW_CSUM, not just IP_CSUM, but it's not tested
> > > and so not changed here.
> > > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > The skge hardware does not do full HW_CSUM. It looks at the IP header.
> Driver code suggests otherwise. Please look at skge_xmit_frame() and TX
> descriptor format.
> 
> Besides, I'm locally using a patch that makes skge advertise HW_CSUM - TCP
> over IPv6 is correctly checksummed in hardware (haven't tried VLANs yet).

IPv4 UDP inside VLAN-tagged packet is also checksummed properly by hardware.
I don't have time now to make a proper HW_CSUM test (with arbitrary header
prepended). If someone has one (or patches?), I'd be glad to use it, though.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: Problems with /proc/net/tcp6 - possible bug - ipv6
From: PK @ 2011-01-31 22:51 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, linux-kernel, netdev

David Miller wrote
> 
> Please give this patch a  try:
> 
> --------------------
> From  d80bc0fd262ef840ed4e82593ad6416fa1ba3fc4 Mon Sep 17 00:00:00 2001
> From: David  S. Miller <davem@davemloft.net>
> Date: Mon, 24  Jan 2011 16:01:58 -0800
> Subject: [PATCH] ipv6: Always clone offlink  routes.


That patch and all the others seem to be in the official tree, so I pulled 
earlier
today to test against.

I no longer see kernel warnings or any problems with /proc/net/tcp6, but the 
tcp6
layer still has issues with tcp_tw_recycle and a listening socket + looped
connect/disconnects.

First there are intermittent net unreachable connection failures when trying to 
connect
to a local closed tcp6 port, and eventually connection attempts start failing 
with
timeouts.  At that point the tcp6 layer seems quite hosed.  It usually gets
to that point within a few minutes of starting the loop.  Stopping the script 
after that
point seems to have no positive effect.

https://github.com/runningdogx/net6-bug

Using that script, I get something like the following output, although sometimes 
it
takes a few more minutes before the timeouts begin.  Using 127.0.0.1 to test
against tcp4 shows no net unreachables and no timeouts.  All the errors 
displayed
once the timestamped loops start are from attempts to connect to a port that's
supposed to be closed.

Kernel log is empty since boot.
All this still in a standard ubuntu 10.10 amd64 smp vm.

----output----
# ruby net6-bug/tcp6br.rb ::1 3333

If you're not root, you'll need to enable tcp_tw_recycle yourself
Server listening on ::1:3333

Chose port 55555 (should be closed) to test if stack is functioning
14:28:06  SYN_S:1  SYN_R:0  TWAIT:7  FW1:0  FW2:0  CLOSING:0  LACK:0
14:28:11  SYN_S:1  SYN_R:0  TWAIT:8  FW1:0  FW2:0  CLOSING:0  LACK:0
14:28:16  SYN_S:1  SYN_R:0  TWAIT:12  FW1:0  FW2:0  CLOSING:0  LACK:0
14:28:21  SYN_S:1  SYN_R:0  TWAIT:12  FW1:0  FW2:0  CLOSING:0  LACK:0
14:28:26  SYN_S:1  SYN_R:0  TWAIT:12  FW1:0  FW2:0  CLOSING:0  LACK:0
14:28:31  SYN_S:1  SYN_R:0  TWAIT:17  FW1:0  FW2:0  CLOSING:0  LACK:0
14:28:36  SYN_S:0  SYN_R:0  TWAIT:15  FW1:1  FW2:0  CLOSING:0  LACK:0
tcp socket error: Net Unreachable
14:28:41  SYN_S:1  SYN_R:0  TWAIT:17  FW1:0  FW2:0  CLOSING:0  LACK:0
14:28:46  SYN_S:1  SYN_R:0  TWAIT:16  FW1:0  FW2:0  CLOSING:0  LACK:0
14:28:51  SYN_S:1  SYN_R:0  TWAIT:19  FW1:0  FW2:0  CLOSING:0  LACK:1
14:28:56  SYN_S:1  SYN_R:0  TWAIT:18  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:01  SYN_S:1  SYN_R:0  TWAIT:19  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:06  SYN_S:1  SYN_R:0  TWAIT:10  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:11  SYN_S:1  SYN_R:0  TWAIT:8  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:16  SYN_S:1  SYN_R:0  TWAIT:8  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:21  SYN_S:1  SYN_R:0  TWAIT:7  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:26  SYN_S:1  SYN_R:0  TWAIT:4  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:31  SYN_S:1  SYN_R:0  TWAIT:5  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:36  SYN_S:1  SYN_R:0  TWAIT:5  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:41  SYN_S:1  SYN_R:0  TWAIT:4  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:46  SYN_S:1  SYN_R:0  TWAIT:5  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:51  SYN_S:1  SYN_R:0  TWAIT:3  FW1:0  FW2:0  CLOSING:0  LACK:0
14:29:56  SYN_S:1  SYN_R:0  TWAIT:4  FW1:0  FW2:0  CLOSING:0  LACK:0
14:30:01  SYN_S:1  SYN_R:0  TWAIT:5  FW1:4  FW2:0  CLOSING:0  LACK:1
tcp socket error: Net Unreachable
14:30:06  SYN_S:1  SYN_R:0  TWAIT:6  FW1:2  FW2:0  CLOSING:0  LACK:1
14:30:32  SYN_S:1  SYN_R:0  TWAIT:5  FW1:0  FW2:0  CLOSING:0  LACK:0
14:30:37  SYN_S:1  SYN_R:0  TWAIT:5  FW1:0  FW2:0  CLOSING:0  LACK:0
14:30:42  SYN_S:1  SYN_R:0  TWAIT:3  FW1:0  FW2:0  CLOSING:0  LACK:0
14:30:47  SYN_S:1  SYN_R:0  TWAIT:3  FW1:0  FW2:0  CLOSING:0  LACK:0
!! TCP SOCKET TIMED OUT CONNECTING TO A LOCAL CLOSED PORT
14:34:02  SYN_S:1  SYN_R:0  TWAIT:0  FW1:0  FW2:0  CLOSING:0  LACK:0
!! TCP SOCKET TIMED OUT CONNECTING TO A LOCAL CLOSED PORT
14:37:16  SYN_S:1  SYN_R:0  TWAIT:0  FW1:0  FW2:0  CLOSING:0  LACK:0
!! TCP SOCKET TIMED OUT CONNECTING TO A LOCAL CLOSED PORT
14:40:30  SYN_S:1  SYN_R:0  TWAIT:0  FW1:0  FW2:0  CLOSING:0  LACK:0
^C


      

^ permalink raw reply

* Re: linux-next: Tree for January 31 (ip_vs)
From: Simon Horman @ 2011-01-31 22:57 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Stephen Rothwell, netdev, linux-next, LKML
In-Reply-To: <20110131211846.GH2389@verge.net.au>

On Tue, Feb 01, 2011 at 08:18:47AM +1100, Simon Horman wrote:
> On Mon, Jan 31, 2011 at 10:18:29AM -0800, Randy Dunlap wrote:
> > On Mon, 31 Jan 2011 17:41:13 +1100 Stephen Rothwell wrote:
> > 
> > > Hi all,
> > > 
> > > Changes since 20110121:
> > > 
> > > The net tree lost its build failure.
> > 
> > 
> > When CONFIG_SYSCTL is not enabled:
> > 
> > net/netfilter/ipvs/ip_vs_core.c:1891: warning: format '%lu' expects type 'long unsigned int', but argument 2 has type 'unsigned int'
> > ERROR: "unregister_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko] undefined!
> > ERROR: "register_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko] undefined!
> 
> Thanks, I'm looking into it.

On a related note, does IPVS need to handle the case
where CONFIG_PROC_FS is not enabled?

^ permalink raw reply

* Re: linux-next: Tree for January 31 (ip_vs)
From: David Miller @ 2011-01-31 23:00 UTC (permalink / raw)
  To: horms; +Cc: randy.dunlap, sfr, netdev, linux-next, linux-kernel
In-Reply-To: <20110131225727.GA23992@verge.net.au>

From: Simon Horman <horms@verge.net.au>
Date: Tue, 1 Feb 2011 09:57:28 +1100

> On Tue, Feb 01, 2011 at 08:18:47AM +1100, Simon Horman wrote:
>> On Mon, Jan 31, 2011 at 10:18:29AM -0800, Randy Dunlap wrote:
>> > On Mon, 31 Jan 2011 17:41:13 +1100 Stephen Rothwell wrote:
>> > 
>> > > Hi all,
>> > > 
>> > > Changes since 20110121:
>> > > 
>> > > The net tree lost its build failure.
>> > 
>> > 
>> > When CONFIG_SYSCTL is not enabled:
>> > 
>> > net/netfilter/ipvs/ip_vs_core.c:1891: warning: format '%lu' expects type 'long unsigned int', but argument 2 has type 'unsigned int'
>> > ERROR: "unregister_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko] undefined!
>> > ERROR: "register_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko] undefined!
>> 
>> Thanks, I'm looking into it.
> 
> On a related note, does IPVS need to handle the case
> where CONFIG_PROC_FS is not enabled?

Yes.

^ permalink raw reply

* Re: [PATCH v2 16/16] skge: convert to hw_features
From: Michał Mirosław @ 2011-01-31 23:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, Ben Hutchings
In-Reply-To: <20110131215343.GA19379@rere.qmqm.pl>

On Mon, Jan 31, 2011 at 10:53:43PM +0100, Michał Mirosław wrote:
> On Mon, Jan 31, 2011 at 08:45:02AM -0800, Stephen Hemminger wrote:
> > On Sat, 22 Jan 2011 23:14:14 +0100 (CET)
> > Michał Mirosław <mirq-linux@rere.qmqm.pl> wrote:
> > > The hardware might do full HW_CSUM, not just IP_CSUM, but it's not tested
> > > and so not changed here.
> > > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > The skge hardware does not do full HW_CSUM. It looks at the IP header.
> Driver code suggests otherwise. Please look at skge_xmit_frame() and TX
> descriptor format.

Ah, I just noticed you wrote most of the code. :-)

Anyway, I still think you're wrong on this. I'll get back to it if I find
the time to create a proper test for other people with similar hardware.
This might be an academic dispute unless we can get at the docs.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: IAMT broken by commit 82776a4bcd7aa5fbcd2e6339b3ce88b727dd40ab
From: Jeff Kirsher @ 2011-01-31 23:24 UTC (permalink / raw)
  To: David Miller; +Cc: aurelien@aurel32.net, Allan, Bruce W, netdev@vger.kernel.org
In-Reply-To: <20110131.123605.226778418.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1912 bytes --]

On Mon, 2011-01-31 at 12:36 -0800, David Miller wrote:
> From: Aurelien Jarno <aurelien@aurel32.net>
> Date: Mon, 31 Jan 2011 12:45:58 +0100
> 
> >> On recent kernels, IAMT support does not work after the machine has 
> >> been powered-off. Even worse, it also goes into this state when I try
> >> to reboot it.
> >> 
> >> I have done a bisect and got this commit:
> >> 
> >> | commit 82776a4bcd7aa5fbcd2e6339b3ce88b727dd40ab
> >> | Author: Bruce Allan <bruce.w.allan@intel.com>
> >> | Date:   Fri Aug 14 14:35:33 2009 +0000
> >> | 
> >> |     e1000e: WoL does not work on 82577/82578 with manageability enabled
> >> |     
> >> |     With manageability (Intel AMT) enabled via BIOS, PHY wakeup does not get
> >> |     configured on newer parts which use PHY wakeup vs. MAC wakeup which causes
> >> |     WoL to not work.  The driver should configure PHY wakeup whether or not
> >> |     manageability is enabled.
> >> |     
> >> |     Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
> >> |     Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> >> |     Signed-off-by: David S. Miller <davem@davemloft.net>
> >> 
> >> I have tried to revert it on recent kernels (2.6.34), and IAMT is then
> >> working as expected. My machine is using a Gigabyte EQ45M-S2 motherboard
> >> with an 82567LM-3 ethernet chip (8086:10de), that is a different model
> >> than the one of the original problem.
> >> 
> >> I do wonder if the changes in the patch should not only be done on some 
> >> chip models, and I will appreciate any help in fixing this issue.
> >> 
> > 
> > Just a short mail to say this problem is still present in 2.6.38-rc2.
> > The same solution still applies, that is reverting the above commit.
> > Note that reverting the first hunk only is enough to get it working
> > again.
> 
> Intel folks please look into this.

We at looking into it, thanks.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* Kernel wiki for Linux networking
From: Luis R. Rodriguez @ 2011-01-31 23:29 UTC (permalink / raw)
  To: netdev, Stephen Hemminger, David Miller
  Cc: linux-kernel, Etan.Cohen, Hai.Shalom, zhen.xie

https://wiki.kernel.org/

We lack a general networking wiki. Can we get a subdomain to start
one? The Documentation/ directory serves its purpose but wikis can
allow for easier updates and allow for more content to be added and
categorized. We have some pages with some content already like:

http://www.linuxfoundation.org/collaborate/workgroups/networking/
http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge

This is obviously outdated:

http://www.linuxfoundation.org/collaborate/workgroups/networking/802.11

And should just point to wireless.kernel.org, but I don't get the
sense that the LF networking site is a home body for any doc updates.

I have found the 802.11 wiki to write proposals [1], summarize
standards [2], and even provide a home body for userspace [3]. I think
something like this can benefit networking in general. Thoughts?

[1] http://wireless.kernel.org/en/developers/DFS
[2] http://wireless.kernel.org/en/developers/Documentation/ieee80211/802.11n
[3] http://wireless.kernel.org/en/users/Documentation

  Luis

^ permalink raw reply

* Re: Kernel wiki for Linux networking
From: Stephen Hemminger @ 2011-01-31 23:31 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: netdev, David Miller, linux-kernel, Etan.Cohen, Hai.Shalom,
	zhen.xie
In-Reply-To: <AANLkTikEhrZmOM3+2jCFXKJh1szd5zKcPxC5HT8NEorQ@mail.gmail.com>

On Mon, 31 Jan 2011 15:29:25 -0800
"Luis R. Rodriguez" <mcgrof@gmail.com> wrote:

> https://wiki.kernel.org/
> 
> We lack a general networking wiki. Can we get a subdomain to start
> one? The Documentation/ directory serves its purpose but wikis can
> allow for easier updates and allow for more content to be added and
> categorized. We have some pages with some content already like:
> 
> http://www.linuxfoundation.org/collaborate/workgroups/networking/
> http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
> 
> This is obviously outdated:
> 
> http://www.linuxfoundation.org/collaborate/workgroups/networking/802.11
> 
> And should just point to wireless.kernel.org, but I don't get the
> sense that the LF networking site is a home body for any doc updates.
> 
> I have found the 802.11 wiki to write proposals [1], summarize
> standards [2], and even provide a home body for userspace [3]. I think
> something like this can benefit networking in general. Thoughts?
> 
> [1] http://wireless.kernel.org/en/developers/DFS
> [2] http://wireless.kernel.org/en/developers/Documentation/ieee80211/802.11n
> [3] http://wireless.kernel.org/en/users/Documentation
> 
>   Luis

The LF is willing to host it, but if you want just put a link to where
you want.

-- 

^ permalink raw reply

* Re: linux-next: Tree for January 31 (ip_vs)
From: Simon Horman @ 2011-02-01  0:03 UTC (permalink / raw)
  To: David Miller; +Cc: randy.dunlap, sfr, netdev, linux-next, linux-kernel
In-Reply-To: <20110131.150031.70194609.davem@davemloft.net>

On Mon, Jan 31, 2011 at 03:00:31PM -0800, David Miller wrote:
> From: Simon Horman <horms@verge.net.au>
> Date: Tue, 1 Feb 2011 09:57:28 +1100
> 
> > On Tue, Feb 01, 2011 at 08:18:47AM +1100, Simon Horman wrote:
> >> On Mon, Jan 31, 2011 at 10:18:29AM -0800, Randy Dunlap wrote:
> >> > On Mon, 31 Jan 2011 17:41:13 +1100 Stephen Rothwell wrote:
> >> > 
> >> > > Hi all,
> >> > > 
> >> > > Changes since 20110121:
> >> > > 
> >> > > The net tree lost its build failure.
> >> > 
> >> > 
> >> > When CONFIG_SYSCTL is not enabled:
> >> > 
> >> > net/netfilter/ipvs/ip_vs_core.c:1891: warning: format '%lu' expects type 'long unsigned int', but argument 2 has type 'unsigned int'
> >> > ERROR: "unregister_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko] undefined!
> >> > ERROR: "register_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko] undefined!
> >> 
> >> Thanks, I'm looking into it.
> > 
> > On a related note, does IPVS need to handle the case
> > where CONFIG_PROC_FS is not enabled?
> 
> Yes.

Thanks.

I checked and it does seem to compile without CONFIG_PROC_FS
and now also without CONFIG_SYSCTL, I'll send a patch for that right
after I finish this email.

I think that in both cases there is dead code, I'll clean that up next.

^ permalink raw reply

* [GIT PULL nf-next-2.6] IPVS build fixes and clean-ups
From: Simon Horman @ 2011-02-01  0:14 UTC (permalink / raw)
  To: netdev, linux-next, linux-kernel, lvs-devel
  Cc: Randy Dunlap, Stephen Rothwell, Hans Schillstrom, Patrick McHardy

Hi,

This short patch series addresses two linux-next build problems
raised by Randy Dunlap:

* net/netfilter/ipvs/ip_vs_core.c:1891: warning: format '%lu' expects type 'long unsigned int', but argument 2 has type 'unsigned int'
* ERROR: "unregister_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko]
  ERROR: "register_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko] undefined!

The remainder of the changset is cleanups that I noticed along the way.

The changes are available at
git://git.kernel.org/pub/scm/linux/kernel/git/horms/lvs-test-2.6.git master

They are currently compile-tested only.

 include/net/ip_vs.h              |    2 --
 net/netfilter/ipvs/ip_vs_core.c  |    2 +-
 net/netfilter/ipvs/ip_vs_ctl.c   |   17 +++++++++--------
 net/netfilter/ipvs/ip_vs_lblc.c  |   20 ++++++++++----------
 net/netfilter/ipvs/ip_vs_lblcr.c |   20 ++++++++++----------
 5 files changed, 30 insertions(+), 31 deletions(-)

^ permalink raw reply

* [PATCH 1/4] IPVS: use z modifier for sizeof() argument
From: Simon Horman @ 2011-02-01  0:14 UTC (permalink / raw)
  To: netdev, linux-next, linux-kernel, lvs-devel
  Cc: Randy Dunlap, Stephen Rothwell, Hans Schillstrom, Patrick McHardy,
	Simon Horman
In-Reply-To: <1296519255-10602-1-git-send-email-horms@verge.net.au>

Cc: Hans Schillstrom <hans@schillstrom.com>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_core.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index d889f4f..4d06617 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1887,7 +1887,7 @@ static int __net_init __ip_vs_init(struct net *net)
 	ipvs->gen = atomic_read(&ipvs_netns_cnt);
 	atomic_inc(&ipvs_netns_cnt);
 	net->ipvs = ipvs;
-	printk(KERN_INFO "IPVS: Creating netns size=%lu id=%d\n",
+	printk(KERN_INFO "IPVS: Creating netns size=%zu id=%d\n",
 			 sizeof(struct netns_ipvs), ipvs->gen);
 	return 0;
 }
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 2/4] IPVS: remove duplicate initialisation or rs_table
From: Simon Horman @ 2011-02-01  0:14 UTC (permalink / raw)
  To: netdev, linux-next, linux-kernel, lvs-devel
  Cc: Randy Dunlap, Stephen Rothwell, Hans Schillstrom, Patrick McHardy,
	Simon Horman
In-Reply-To: <1296519255-10602-1-git-send-email-horms@verge.net.au>

Cc: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_ctl.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 98df59a..d7c2fa8 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -3515,9 +3515,6 @@ int __net_init __ip_vs_control_init(struct net *net)
 	}
 	spin_lock_init(&ipvs->tot_stats->lock);
 
-	for (idx = 0; idx < IP_VS_RTAB_SIZE; idx++)
-		INIT_LIST_HEAD(&ipvs->rs_table[idx]);
-
 	proc_net_fops_create(net, "ip_vs", 0, &ip_vs_info_fops);
 	proc_net_fops_create(net, "ip_vs_stats", 0, &ip_vs_stats_fops);
 	proc_net_fops_create(net, "ip_vs_stats_percpu", 0,
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 3/4] IPVS: Remove unused variables
From: Simon Horman @ 2011-02-01  0:14 UTC (permalink / raw)
  To: netdev, linux-next, linux-kernel, lvs-devel
  Cc: Randy Dunlap, Stephen Rothwell, Hans Schillstrom, Patrick McHardy,
	Simon Horman
In-Reply-To: <1296519255-10602-1-git-send-email-horms@verge.net.au>

These variables are unused as a result of the recent netns work.

Cc: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 include/net/ip_vs.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index b23bea6..5d75fea 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1109,8 +1109,6 @@ extern int ip_vs_icmp_xmit_v6
  *	we are loaded. Just set ip_vs_drop_rate to 'n' and
  *	we start to drop 1/rate of the packets
  */
-extern int ip_vs_drop_rate;
-extern int ip_vs_drop_counter;
 
 static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
 {
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 4/4] IPVS: Allow compilation with CONFIG_SYSCTL disabled
From: Simon Horman @ 2011-02-01  0:14 UTC (permalink / raw)
  To: netdev, linux-next, linux-kernel, lvs-devel
  Cc: Randy Dunlap, Stephen Rothwell, Hans Schillstrom, Patrick McHardy,
	Simon Horman
In-Reply-To: <1296519255-10602-1-git-send-email-horms@verge.net.au>

This is a rather naieve approach to allowing PVS to compile with
CONFIG_SYSCTL disabled.  I am working on a more comprehensive patch which
will remove compilation of all sysctl-related IPVS code when CONFIG_SYSCTL
is disabled.

Cc: Hans Schillstrom <hans@schillstrom.com>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_ctl.c   |   14 +++++++++-----
 net/netfilter/ipvs/ip_vs_lblc.c  |   20 ++++++++++----------
 net/netfilter/ipvs/ip_vs_lblcr.c |   20 ++++++++++----------
 3 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index d7c2fa8..c73b0c8 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -3552,10 +3552,15 @@ int __net_init __ip_vs_control_init(struct net *net)
 	tbl[idx++].data = &ipvs->sysctl_nat_icmp_send;
 
 
+#ifdef CONFIG_SYSCTL
 	ipvs->sysctl_hdr = register_net_sysctl_table(net, net_vs_ctl_path,
 						     tbl);
-	if (ipvs->sysctl_hdr == NULL)
-		goto err_reg;
+	if (ipvs->sysctl_hdr == NULL) {
+		if (!net_eq(net, &init_net))
+			kfree(tbl);
+		goto err_dup;
+	}
+#endif
 	ip_vs_new_estimator(net, ipvs->tot_stats);
 	ipvs->sysctl_tbl = tbl;
 	/* Schedule defense work */
@@ -3563,9 +3568,6 @@ int __net_init __ip_vs_control_init(struct net *net)
 	schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
 	return 0;
 
-err_reg:
-	if (!net_eq(net, &init_net))
-		kfree(tbl);
 err_dup:
 	free_percpu(ipvs->cpustats);
 err_alloc:
@@ -3581,7 +3583,9 @@ static void __net_exit __ip_vs_control_cleanup(struct net *net)
 	ip_vs_kill_estimator(net, ipvs->tot_stats);
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
+#ifdef CONFIG_SYSCTL
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
+#endif
 	proc_net_remove(net, "ip_vs_stats_percpu");
 	proc_net_remove(net, "ip_vs_stats");
 	proc_net_remove(net, "ip_vs");
diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index d5bec33..00b5ffa 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -554,33 +554,33 @@ static int __net_init __ip_vs_lblc_init(struct net *net)
 						sizeof(vs_vars_table),
 						GFP_KERNEL);
 		if (ipvs->lblc_ctl_table == NULL)
-			goto err_dup;
+			return -ENOMEM;
 	} else
 		ipvs->lblc_ctl_table = vs_vars_table;
 	ipvs->sysctl_lblc_expiration = 24*60*60*HZ;
 	ipvs->lblc_ctl_table[0].data = &ipvs->sysctl_lblc_expiration;
 
+#ifdef CONFIG_SYSCTL
 	ipvs->lblc_ctl_header =
 		register_net_sysctl_table(net, net_vs_ctl_path,
 					  ipvs->lblc_ctl_table);
-	if (!ipvs->lblc_ctl_header)
-		goto err_reg;
+	if (!ipvs->lblc_ctl_header) {
+		if (!net_eq(net, &init_net))
+			kfree(ipvs->lblc_ctl_table);
+		return -ENOMEM;
+	}
+#endif
 
 	return 0;
-
-err_reg:
-	if (!net_eq(net, &init_net))
-		kfree(ipvs->lblc_ctl_table);
-
-err_dup:
-	return -ENOMEM;
 }
 
 static void __net_exit __ip_vs_lblc_exit(struct net *net)
 {
 	struct netns_ipvs *ipvs = net_ipvs(net);
 
+#ifdef CONFIG_SYSCTL
 	unregister_net_sysctl_table(ipvs->lblc_ctl_header);
+#endif
 
 	if (!net_eq(net, &init_net))
 		kfree(ipvs->lblc_ctl_table);
diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index 61ae8cf..bfa25f1 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -754,33 +754,33 @@ static int __net_init __ip_vs_lblcr_init(struct net *net)
 						sizeof(vs_vars_table),
 						GFP_KERNEL);
 		if (ipvs->lblcr_ctl_table == NULL)
-			goto err_dup;
+			return -ENOMEM;
 	} else
 		ipvs->lblcr_ctl_table = vs_vars_table;
 	ipvs->sysctl_lblcr_expiration = 24*60*60*HZ;
 	ipvs->lblcr_ctl_table[0].data = &ipvs->sysctl_lblcr_expiration;
 
+#ifdef CONFIG_SYSCTL
 	ipvs->lblcr_ctl_header =
 		register_net_sysctl_table(net, net_vs_ctl_path,
 					  ipvs->lblcr_ctl_table);
-	if (!ipvs->lblcr_ctl_header)
-		goto err_reg;
+	if (!ipvs->lblcr_ctl_header) {
+		if (!net_eq(net, &init_net))
+			kfree(ipvs->lblcr_ctl_table);
+		return -ENOMEM;
+	}
+#endif
 
 	return 0;
-
-err_reg:
-	if (!net_eq(net, &init_net))
-		kfree(ipvs->lblcr_ctl_table);
-
-err_dup:
-	return -ENOMEM;
 }
 
 static void __net_exit __ip_vs_lblcr_exit(struct net *net)
 {
 	struct netns_ipvs *ipvs = net_ipvs(net);
 
+#ifdef CONFIG_SYSCTL
 	unregister_net_sysctl_table(ipvs->lblcr_ctl_header);
+#endif
 
 	if (!net_eq(net, &init_net))
 		kfree(ipvs->lblcr_ctl_table);
-- 
1.7.2.3

^ permalink raw reply related

* (unknown)
From: Tom Herbert @ 2011-02-01  0:21 UTC (permalink / raw)
  To: davem, netdev

>From b6943d0caff7db23aaed20ec7abb7848281e502a Mon Sep 17 00:00:00 2001
From: Tom Herbert <therbert@google.com>
Date: Mon, 31 Jan 2011 16:12:02 -0800
Subject: [PATCH] net: Check rps_flow_table when RPS map length is 1

In get_rps_cpu, add check that the rps_flow_table for the device is
NULL when trying to take fast path when RPS map length is one.
Without this, RFS is effectively disabled if map length is one which
is not correct.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/core/dev.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index ddd5df2..283ed85 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2666,7 +2666,8 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 
 	map = rcu_dereference(rxqueue->rps_map);
 	if (map) {
-		if (map->len == 1) {
+		if (map->len == 1 &&
+		    !rcu_dereference_raw(rxqueue->rps_flow_table)) {
 			tcpu = map->cpus[0];
 			if (cpu_online(tcpu))
 				cpu = tcpu;
-- 
1.7.3.1


^ permalink raw reply related

* Re: rps_flow_table bug fix
From: David Miller @ 2011-02-01  0:24 UTC (permalink / raw)
  To: therbert; +Cc: netdev
In-Reply-To: <alpine.DEB.2.00.1101311618140.13796@pokey.mtv.corp.google.com>

From: Tom Herbert <therbert@google.com>
Date: Mon, 31 Jan 2011 16:21:43 -0800 (PST)

Tom, please set your email subject correctly.

> Subject: [PATCH] net: Check rps_flow_table when RPS map length is 1
> 
> In get_rps_cpu, add check that the rps_flow_table for the device is
> NULL when trying to take fast path when RPS map length is one.
> Without this, RFS is effectively disabled if map length is one which
> is not correct.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>

Applied, thanks.

^ permalink raw reply

* Re: Network performance with small packets
From: Steve Dobbelstein @ 2011-02-01  0:24 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, kvm, mashirle, netdev
In-Reply-To: <20110128121616.GA8374@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 01/28/2011 06:16:16 AM:

> OK, so thinking about it more, maybe the issue is this:
> tx becomes full. We process one request and interrupt the guest,
> then it adds one request and the queue is full again.
>
> Maybe the following will help it stabilize?
> By itself it does nothing, but if you set
> all the parameters to a huge value we will
> only interrupt when we see an empty ring.
> Which might be too much: pls try other values
> in the middle: e.g. make bufs half the ring,
> or bytes some small value, or packets some
> small value etc.
>
> Warning: completely untested.
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index aac05bc..6769cdc 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,13 @@
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
>
> +int tx_bytes_coalesce = 0;
> +module_param(tx_bytes_coalesce, int, 0644);
> +int tx_bufs_coalesce = 0;
> +module_param(tx_bufs_coalesce, int, 0644);
> +int tx_packets_coalesce = 0;
> +module_param(tx_packets_coalesce, int, 0644);
> +
>  enum {
>     VHOST_NET_VQ_RX = 0,
>     VHOST_NET_VQ_TX = 1,
> @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
>     int err, wmem;
>     size_t hdr_size;
>     struct socket *sock;
> +   int bytes_coalesced = 0;
> +   int bufs_coalesced = 0;
> +   int packets_coalesced = 0;
>
>     /* TODO: check that we are running from vhost_worker? */
>     sock = rcu_dereference_check(vq->private_data, 1);
> @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
>        if (err != len)
>           pr_debug("Truncated TX packet: "
>               " len %d != %zd\n", err, len);
> -      vhost_add_used_and_signal(&net->dev, vq, head, 0);
>        total_len += len;
> +      packets_coalesced += 1;
> +      bytes_coalesced += len;
> +      bufs_coalesced += in;

Should this instead be:
      bufs_coalesced += out;

Perusing the code I see that earlier there is a check to see if "in" is not
zero, and, if so, error out of the loop.  After the check, "in" is not
touched until it is added to bufs_coalesced, effectively not changing
bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
below.

Or am I missing something?

> +      if (unlikely(packets_coalesced > tx_packets_coalesce ||
> +              bytes_coalesced > tx_bytes_coalesce ||
> +              bufs_coalesced > tx_bufs_coalesce))
> +         vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +      else
> +         vhost_add_used(vq, head, 0);
>        if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>           vhost_poll_queue(&vq->poll);
>           break;
>        }
>     }
>
> +   if (likely(packets_coalesced > tx_packets_coalesce ||
> +         bytes_coalesced > tx_bytes_coalesce ||
> +         bufs_coalesced > tx_bufs_coalesce))
> +      vhost_signal(&net->dev, vq);
>     mutex_unlock(&vq->mutex);
>  }
>

Steve D.


^ permalink raw reply

* [PATCH 0/2] Consolidate ipv4 default route selection
From: David Miller @ 2011-02-01  0:25 UTC (permalink / raw)
  To: netdev

We've had two copies of this code for ages, and it has always driven
me crazy when I've seen these things.

With some minor changes, we can have one generic copy which is also
much more efficient than what we have right now.

Part of what makes this so easy to do right now is all of the RCU work
Eric has done in the FIB layers.  Since all of the datastructures are
RCU protected, and fib_select_default() already runs in an RCU read
side protected area, there is no special locking needed.

^ permalink raw reply

* [PATCH 1/2] ipv4: Remember FIB alias list head and table in lookup results.
From: David Miller @ 2011-02-01  0:25 UTC (permalink / raw)
  To: netdev


This will be used later to implement fib_select_default() in a
completely generic manner, instead of the current situation where the
default route is re-looked up in the TRIE/HASH table and then the
available aliases are analyzed.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/ip_fib.h     |    3 +++
 net/ipv4/fib_hash.c      |    2 +-
 net/ipv4/fib_lookup.h    |    2 +-
 net/ipv4/fib_semantics.c |    7 +++++--
 net/ipv4/fib_trie.c      |    8 ++++----
 5 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 2c0508a..f5199b0 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -96,12 +96,15 @@ struct fib_info {
 struct fib_rule;
 #endif
 
+struct fib_table;
 struct fib_result {
 	unsigned char	prefixlen;
 	unsigned char	nh_sel;
 	unsigned char	type;
 	unsigned char	scope;
 	struct fib_info *fi;
+	struct fib_table *table;
+	struct list_head *fa_head;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
 	struct fib_rule	*r;
 #endif
diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index b3acb04..0a88866 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -288,7 +288,7 @@ int fib_table_lookup(struct fib_table *tb,
 				if (f->fn_key != k)
 					continue;
 
-				err = fib_semantic_match(&f->fn_alias,
+				err = fib_semantic_match(tb, &f->fn_alias,
 						 flp, res,
 						 fz->fz_order, fib_flags);
 				if (err <= 0)
diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
index c079cc0..d5c40d8 100644
--- a/net/ipv4/fib_lookup.h
+++ b/net/ipv4/fib_lookup.h
@@ -25,7 +25,7 @@ static inline void fib_alias_accessed(struct fib_alias *fa)
 }
 
 /* Exported by fib_semantics.c */
-extern int fib_semantic_match(struct list_head *head,
+extern int fib_semantic_match(struct fib_table *tb, struct list_head *head,
 			      const struct flowi *flp,
 			      struct fib_result *res, int prefixlen, int fib_flags);
 extern void fib_release_info(struct fib_info *);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 48e93a5..1bf6fb9 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -889,8 +889,9 @@ failure:
 }
 
 /* Note! fib_semantic_match intentionally uses  RCU list functions. */
-int fib_semantic_match(struct list_head *head, const struct flowi *flp,
-		       struct fib_result *res, int prefixlen, int fib_flags)
+int fib_semantic_match(struct fib_table *tb, struct list_head *head,
+		       const struct flowi *flp, struct fib_result *res,
+		       int prefixlen, int fib_flags)
 {
 	struct fib_alias *fa;
 	int nh_sel = 0;
@@ -954,6 +955,8 @@ out_fill_res:
 	res->type = fa->fa_type;
 	res->scope = fa->fa_scope;
 	res->fi = fa->fa_info;
+	res->table = tb;
+	res->fa_head = head;
 	if (!(fib_flags & FIB_LOOKUP_NOREF))
 		atomic_inc(&res->fi->fib_clntref);
 	return 0;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 0f28034..8cee5c8 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1340,7 +1340,7 @@ err:
 }
 
 /* should be called with rcu_read_lock */
-static int check_leaf(struct trie *t, struct leaf *l,
+static int check_leaf(struct fib_table *tb, struct trie *t, struct leaf *l,
 		      t_key key,  const struct flowi *flp,
 		      struct fib_result *res, int fib_flags)
 {
@@ -1356,7 +1356,7 @@ static int check_leaf(struct trie *t, struct leaf *l,
 		if (l->key != (key & ntohl(mask)))
 			continue;
 
-		err = fib_semantic_match(&li->falh, flp, res, plen, fib_flags);
+		err = fib_semantic_match(tb, &li->falh, flp, res, plen, fib_flags);
 
 #ifdef CONFIG_IP_FIB_TRIE_STATS
 		if (err <= 0)
@@ -1398,7 +1398,7 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
 
 	/* Just a leaf? */
 	if (IS_LEAF(n)) {
-		ret = check_leaf(t, (struct leaf *)n, key, flp, res, fib_flags);
+		ret = check_leaf(tb, t, (struct leaf *)n, key, flp, res, fib_flags);
 		goto found;
 	}
 
@@ -1423,7 +1423,7 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
 		}
 
 		if (IS_LEAF(n)) {
-			ret = check_leaf(t, (struct leaf *)n, key, flp, res, fib_flags);
+			ret = check_leaf(tb, t, (struct leaf *)n, key, flp, res, fib_flags);
 			if (ret > 0)
 				goto backtrace;
 			goto found;
-- 
1.7.4


^ permalink raw reply related

* [PATCH 2/2] ipv4: Consolidate all default route selection implementations.
From: David Miller @ 2011-02-01  0:25 UTC (permalink / raw)
  To: netdev


Both fib_trie and fib_hash have a local implementation of
fib_table_select_default().  This is completely unnecessary
code duplication.

Since we now remember the fib_table and the head of the fib
alias list of the default route, we can implement one single
generic version of this routine.

Looking at the fib_hash implementation you may get the impression
that it's possible for there to be multiple top-level routes in
the table for the default route.  The truth is, it isn't, the
insert code will only allow one entry to exist in the zero
prefix hash table, because all keys evaluate to zero and all
keys in a hash table must be unique.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/ip_fib.h     |    6 +---
 net/ipv4/fib_frontend.c  |   15 ---------
 net/ipv4/fib_hash.c      |   72 --------------------------------------------
 net/ipv4/fib_semantics.c |   56 ++++++++++++++++++++++++++++++++++
 net/ipv4/fib_trie.c      |   74 ----------------------------------------------
 net/ipv4/route.c         |    2 +-
 6 files changed, 58 insertions(+), 167 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index f5199b0..819d61c 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -158,9 +158,6 @@ extern int fib_table_delete(struct fib_table *, struct fib_config *);
 extern int fib_table_dump(struct fib_table *table, struct sk_buff *skb,
 			  struct netlink_callback *cb);
 extern int fib_table_flush(struct fib_table *table);
-extern void fib_table_select_default(struct fib_table *table,
-				     const struct flowi *flp,
-				     struct fib_result *res);
 extern void fib_free_table(struct fib_table *tb);
 
 
@@ -221,8 +218,7 @@ extern void		ip_fib_init(void);
 extern int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 			       struct net_device *dev, __be32 *spec_dst,
 			       u32 *itag, u32 mark);
-extern void fib_select_default(struct net *net, const struct flowi *flp,
-			       struct fib_result *res);
+extern void fib_select_default(struct fib_result *res);
 
 /* Exported by fib_semantics.c */
 extern int ip_fib_check_default(__be32 gw, struct net_device *dev);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 1d2cdd4..930768b 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -114,21 +114,6 @@ struct fib_table *fib_get_table(struct net *net, u32 id)
 }
 #endif /* CONFIG_IP_MULTIPLE_TABLES */
 
-void fib_select_default(struct net *net,
-			const struct flowi *flp, struct fib_result *res)
-{
-	struct fib_table *tb;
-	int table = RT_TABLE_MAIN;
-#ifdef CONFIG_IP_MULTIPLE_TABLES
-	if (res->r == NULL || res->r->action != FR_ACT_TO_TBL)
-		return;
-	table = res->r->table;
-#endif
-	tb = fib_get_table(net, table);
-	if (FIB_RES_GW(*res) && FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK)
-		fib_table_select_default(tb, flp, res);
-}
-
 static void fib_flush(struct net *net)
 {
 	int flushed = 0;
diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index 0a88866..fadb602 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -302,78 +302,6 @@ out:
 	return err;
 }
 
-void fib_table_select_default(struct fib_table *tb,
-			      const struct flowi *flp, struct fib_result *res)
-{
-	int order, last_idx;
-	struct hlist_node *node;
-	struct fib_node *f;
-	struct fib_info *fi = NULL;
-	struct fib_info *last_resort;
-	struct fn_hash *t = (struct fn_hash *)tb->tb_data;
-	struct fn_zone *fz = t->fn_zones[0];
-	struct hlist_head *head;
-
-	if (fz == NULL)
-		return;
-
-	last_idx = -1;
-	last_resort = NULL;
-	order = -1;
-
-	rcu_read_lock();
-	head = rcu_dereference(fz->fz_hash);
-	hlist_for_each_entry_rcu(f, node, head, fn_hash) {
-		struct fib_alias *fa;
-
-		list_for_each_entry_rcu(fa, &f->fn_alias, fa_list) {
-			struct fib_info *next_fi = fa->fa_info;
-
-			if (fa->fa_scope != res->scope ||
-			    fa->fa_type != RTN_UNICAST)
-				continue;
-
-			if (next_fi->fib_priority > res->fi->fib_priority)
-				break;
-			if (!next_fi->fib_nh[0].nh_gw ||
-			    next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK)
-				continue;
-
-			fib_alias_accessed(fa);
-
-			if (fi == NULL) {
-				if (next_fi != res->fi)
-					break;
-			} else if (!fib_detect_death(fi, order, &last_resort,
-						&last_idx, tb->tb_default)) {
-				fib_result_assign(res, fi);
-				tb->tb_default = order;
-				goto out;
-			}
-			fi = next_fi;
-			order++;
-		}
-	}
-
-	if (order <= 0 || fi == NULL) {
-		tb->tb_default = -1;
-		goto out;
-	}
-
-	if (!fib_detect_death(fi, order, &last_resort, &last_idx,
-				tb->tb_default)) {
-		fib_result_assign(res, fi);
-		tb->tb_default = order;
-		goto out;
-	}
-
-	if (last_idx >= 0)
-		fib_result_assign(res, last_resort);
-	tb->tb_default = last_idx;
-out:
-	rcu_read_unlock();
-}
-
 /* Insert node F to FZ. */
 static inline void fib_insert_node(struct fn_zone *fz, struct fib_node *f)
 {
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 1bf6fb9..b15857d 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1136,6 +1136,62 @@ int fib_sync_down_dev(struct net_device *dev, int force)
 	return ret;
 }
 
+/* Must be invoked inside of an RCU protected region.  */
+void fib_select_default(struct fib_result *res)
+{
+	struct fib_info *fi = NULL, *last_resort = NULL;
+	struct list_head *fa_head = res->fa_head;
+	struct fib_table *tb = res->table;
+	int order = -1, last_idx = -1;
+	struct fib_alias *fa;
+
+	list_for_each_entry_rcu(fa, fa_head, fa_list) {
+		struct fib_info *next_fi = fa->fa_info;
+
+		if (fa->fa_scope != res->scope ||
+		    fa->fa_type != RTN_UNICAST)
+			continue;
+
+		if (next_fi->fib_priority > res->fi->fib_priority)
+			break;
+		if (!next_fi->fib_nh[0].nh_gw ||
+		    next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK)
+			continue;
+
+		fib_alias_accessed(fa);
+
+		if (fi == NULL) {
+			if (next_fi != res->fi)
+				break;
+		} else if (!fib_detect_death(fi, order, &last_resort,
+					     &last_idx, tb->tb_default)) {
+			fib_result_assign(res, fi);
+			tb->tb_default = order;
+			goto out;
+		}
+		fi = next_fi;
+		order++;
+	}
+
+	if (order <= 0 || fi == NULL) {
+		tb->tb_default = -1;
+		goto out;
+	}
+
+	if (!fib_detect_death(fi, order, &last_resort, &last_idx,
+				tb->tb_default)) {
+		fib_result_assign(res, fi);
+		tb->tb_default = order;
+		goto out;
+	}
+
+	if (last_idx >= 0)
+		fib_result_assign(res, last_resort);
+	tb->tb_default = last_idx;
+out:
+	rcu_read_unlock();
+}
+
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 
 /*
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 8cee5c8..16d589c 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1802,80 +1802,6 @@ void fib_free_table(struct fib_table *tb)
 	kfree(tb);
 }
 
-void fib_table_select_default(struct fib_table *tb,
-			      const struct flowi *flp,
-			      struct fib_result *res)
-{
-	struct trie *t = (struct trie *) tb->tb_data;
-	int order, last_idx;
-	struct fib_info *fi = NULL;
-	struct fib_info *last_resort;
-	struct fib_alias *fa = NULL;
-	struct list_head *fa_head;
-	struct leaf *l;
-
-	last_idx = -1;
-	last_resort = NULL;
-	order = -1;
-
-	rcu_read_lock();
-
-	l = fib_find_node(t, 0);
-	if (!l)
-		goto out;
-
-	fa_head = get_fa_head(l, 0);
-	if (!fa_head)
-		goto out;
-
-	if (list_empty(fa_head))
-		goto out;
-
-	list_for_each_entry_rcu(fa, fa_head, fa_list) {
-		struct fib_info *next_fi = fa->fa_info;
-
-		if (fa->fa_scope != res->scope ||
-		    fa->fa_type != RTN_UNICAST)
-			continue;
-
-		if (next_fi->fib_priority > res->fi->fib_priority)
-			break;
-		if (!next_fi->fib_nh[0].nh_gw ||
-		    next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK)
-			continue;
-
-		fib_alias_accessed(fa);
-
-		if (fi == NULL) {
-			if (next_fi != res->fi)
-				break;
-		} else if (!fib_detect_death(fi, order, &last_resort,
-					     &last_idx, tb->tb_default)) {
-			fib_result_assign(res, fi);
-			tb->tb_default = order;
-			goto out;
-		}
-		fi = next_fi;
-		order++;
-	}
-	if (order <= 0 || fi == NULL) {
-		tb->tb_default = -1;
-		goto out;
-	}
-
-	if (!fib_detect_death(fi, order, &last_resort, &last_idx,
-				tb->tb_default)) {
-		fib_result_assign(res, fi);
-		tb->tb_default = order;
-		goto out;
-	}
-	if (last_idx >= 0)
-		fib_result_assign(res, last_resort);
-	tb->tb_default = last_idx;
-out:
-	rcu_read_unlock();
-}
-
 static int fn_trie_dump_fa(t_key key, int plen, struct list_head *fah,
 			   struct fib_table *tb,
 			   struct sk_buff *skb, struct netlink_callback *cb)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index b1e5d3a..242a3de 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2711,7 +2711,7 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 	else
 #endif
 	if (!res.prefixlen && res.type == RTN_UNICAST && !fl.oif)
-		fib_select_default(net, &fl, &res);
+		fib_select_default(&res);
 
 	if (!fl.fl4_src)
 		fl.fl4_src = FIB_RES_PREFSRC(res);
-- 
1.7.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox