Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH -next] can: softing_cs needs slab.h
From: David Miller @ 2011-02-09 20:43 UTC (permalink / raw)
  To: randy.dunlap-veTT2BtV2gBXrIkS9f7CXA
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w, sfr-3FnU+UHB4dNDw9hX6IcOSA,
	linux-next-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20110209084456.802370fa.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

From: Randy Dunlap <randy.dunlap-veTT2BtV2gBXrIkS9f7CXA@public.gmane.org>
Date: Wed, 9 Feb 2011 08:44:56 -0800

> From: Randy Dunlap <randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> 
> softing_cs.c uses kzalloc & kfree, so it needs to include linux/slab.h.
> 
> drivers/net/can/softing/softing_cs.c:234: error: implicit declaration of function 'kfree'
> drivers/net/can/softing/softing_cs.c:271: error: implicit declaration of function 'kzalloc'
> 
> Signed-off-by: Randy Dunlap <randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

Applied, thanks Randy.

^ permalink raw reply

* Re: [net-2.6][2.6.38-rc2] panic during stress testing
From: David Miller @ 2011-02-09 20:46 UTC (permalink / raw)
  To: jesse.brandeburg; +Cc: netdev, jeffrey.e.pieper
In-Reply-To: <alpine.WNT.2.00.1102081611130.5424@JBRANDEB-DESK2.amr.corp.intel.com>

From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com>
Date: Wed, 9 Feb 2011 10:12:03 -0800 (Pacific Standard Time)

> neigh appears to be zero? since the check just above in dst_destroy was 
> checking it against NULL already, maybe we have a race, with some other 
> free of neigh (assigned from dst->neighbor)
> 
> This is quickly getting beyond me, I tried to check for some changes 
> around dst->neighbor but didn't see anything recent.

"neigh" is non-zero, however it is freed memory and thus unmapped by
the "use after free" debugging code.

That's why it OOPS's.

^ permalink raw reply

* bleeding edge kernels misidentify my Broadcom BCM4401-B0
From: Celejar @ 2011-02-09 20:58 UTC (permalink / raw)
  To: netdev

Hi,

Recently, while using bleeding edge kernels (built from git pulls of
vanillas sources from kernel.org), my Broadcom BCM4401-B0 Ethernet
card, normally driven by the b44 driver, has stopped working.  Poking
around, I discovered that the card is no longer correctly identified;
it now shows up (with 'lspci -v' as root) as this bizarre device:

06:01.0 Network and computing encryption device: Broadcom Corporation Device 0010 (rev 02) (prog-if 10)
        Subsystem: Allied Telesis, Inc Device 0010
        Flags: fast devsel, IRQ 10
        Memory at d0000000 (32-bit, non-prefetchable) [disabled] [size=8K]
        Capabilities: [40] Power Management version 2

Using a stock Debian kernel (2.6.32-5-686), it shows up (and works)
correctly:

06:01.0 Ethernet controller: Broadcom Corporation BCM4401-B0 100Base-TX (rev 02)
        Subsystem: Acer Incorporated [ALI] Device 0090
        Flags: bus master, fast devsel, latency 64, IRQ 21
        Memory at d0000000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: [40] Power Management version 2
        Kernel driver in use: b44

What on earth is this all about?

I tried another kernel I had lying around,
2.6.37-rc6-lizzie-00132-g55ec86f (built from kernel.org vanilla
sources), and it also works fine.  I can do a bisection if necessary,
but I figured I'd check here first.

[Please cc. me on replies.]

Celejar
-- 
foffl.sourceforge.net - Feeds OFFLine, an offline RSS/Atom aggregator
mailmin.sourceforge.net - remote access via secure (OpenPGP) email
ssuds.sourceforge.net - A Simple Sudoku Solver and Generator

^ permalink raw reply

* Re: Linux 2.6.38-rc4 (hysdn: BUG)
From: Randy Dunlap @ 2011-02-09 21:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: netdev, Linux Kernel Mailing List, Karsten Keil
In-Reply-To: <AANLkTimUg8Dm9mZotubcgPHz8_at=_hnbeWUo-LfSALp@mail.gmail.com>

On Wed, 9 Feb 2011 11:44:00 -0800 Linus Torvalds wrote:

> On Wed, Feb 9, 2011 at 9:24 AM, Randy Dunlap <randy.dunlap@oracle.com> wrote:
> >
> > on x86_64.  no HYSDN hardware found (correct).
> > Nearly allmodconfig.
> >
> >
> > [   65.397577] HYSDN: module Rev: 1.6.6.6 loaded
> > [   65.397584] HYSDN: network interface Rev: 1.8.6.4
> > [   65.398057] HYSDN: 0 card(s) found.
> > [   65.398121] BUG: unable to handle kernel paging request at ffffffffa06c99f0
> > [   65.398269] IP: [<ffffffffa06c68ba>] hysdn_getrev+0x2e/0x50 [hysdn]
> > [   65.398379] PGD 1a14067 PUD 1a18063 PMD 6f6c1067 PTE 800000006ce8c161
> > [   65.398613] Oops: 0003 [#1] SMP DEBUG_PAGEALLOC
> > [   65.400030]
> > [   65.400030] Pid: 2497, comm: modprobe Not tainted 2.6.38-rc4 #1 0TY565/OptiPlex 745
> > [   65.400030] RIP: 0010:[<ffffffffa06c68ba>]  [<ffffffffa06c68ba>] hysdn_getrev+0x2e/0x50 [hysdn]
> > [   65.400030] RSP: 0018:ffff88006eec1e68  EFLAGS: 00010206
> > [   65.400030] RAX: ffffffffa06c99f1 RBX: ffffffffa06c99e9 RCX: ffff88007c4159a0
> 
> The instruction sequence decodes to
> 
>   1e:	be 24 00 00 00       	mov    $0x24,%esi
>   23:	48 89 df             	mov    %rbx,%rdi
>   26:	e8 5b 39 c0 e0       	callq  0xffffffffe0c03986
>   2b:*	c6 40 ff 00          	movb   $0x0,-0x1(%rax)     <-- trapping instruction
> 
> which seems to be this
> 
>                 p = strchr(rev, '$');
>                 *--p = 0;
> 
> code. And yes, it's total crap, because while "p" and "rev" are "char
> *", the string that is passed in is actually of type "const char *",
> so that function is seriously broken. It's also seriously broken to
> not test that "p" is non-NULL - the function would just break if there
> is a colon in the string but not a '$'.
> 
> And hysdn_procconf_init() passes in a constant string to the thing:
> 
>     static char *hysdn_procconf_revision = "$Revision: 1.8.6.4 $";
> 
> What happens is that it breaks when we mark the constant section as
> read-only, because you have CONFIG_DEBUG_RODATA enabled.
> 
> So the fix seems to be to
>  - fix the prototype for hysdn_getrev() to not have "const".
>  - fix hysdn_procconf_init() to not pass in a constant string to it
> 
> The minimal patch would appear to be something like the appended. UNTESTED!

for your patch:

Tested-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com>

> Btw, all of this code seems to go back to before the git history even
> started, so it doesn't seem to be new. I assume you haven't tried
> booting these all-module kernels before? Or is it just the
> DEBUG_RODATA thing that is new for you?

Neither is new.  I tested and reported many-modules on 2.6.37-rc1 and
reported these 2 bugs:

https://bugzilla.kernel.org/show_bug.cgi?id=22912
https://bugzilla.kernel.org/show_bug.cgi?id=22882

and that was with CONFIG_DEBUG_RODATA=y.
I don't know how hysdn was missed at that time.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: STMMAC driver: NFS Problem on 2.6.37
From: Chuck Lever @ 2011-02-09 21:26 UTC (permalink / raw)
  To: Brian Downing
  Cc: deepaksi, Armando VISCONTI, Trond Myklebust,
	netdev-u79uwXL29TY76Z2rM5mHXA, Linux NFS Mailing List,
	Shiraz HASHIM, Viresh KUMAR, Peppe CAVALLARO, amitgoel
In-Reply-To: <20110209205855.GB6402-QEOkiq82tQWoLK6CJbI5/KxOck334EZe@public.gmane.org>


On Feb 9, 2011, at 3:58 PM, Brian Downing wrote:

> On Wed, Feb 09, 2011 at 03:12:22PM -0500, Chuck Lever wrote:
>> Based on your console logs, I see that the working case uses UDP to
>> contact the server's mountd, but the failing case uses TCP.  You can
>> try explicitly specifying "proto=udp" to force the use of UDP, to test
>> this theory.
> 
> This does indeed make it work again for me, thanks!
> 
>> Meanwhile, the patch description explicitly states that the default
>> mount option settings have changed.  Does it make sense to change the
>> default behavior of NFSROOT mounts to use UDP again?  I don't see
>> another way to make this process more reliable across NIC
>> initialization.  If this is considered a regression, we can make a
>> patch for 2.6.38-rc and 2.6.37.
> 
> I only use nfsroot for development, so I don't have a terribly strong
> opinion.  I would point out though that the default u-boot parameters
> for nfsrooting a lot of boards will no longer work at this point, so if
> it's not patched to work again without specifying nfs options I think
> there should at least be a note in the documentation and possibly a
> "maybe try proto=udp?" console message on failure.
> 
> I assume it's not feasable to either wait until the chosen interface's
> link is ready before trying to mount nfsroot, or retrying TCP-based
> connections a little bit more aggressively/at all?

Our goal is to use the same mount logic for both normal user space mounts and for NFSROOT (that was the purpose of the patch series this particular patch comes from).  It's exceptionally difficult to add a special case for retrying TCP connections here, as that would change the behavior of user space mounts, which often want to fail quickly, and don't need to worry about NIC initialization.

Sounds like the right thing to do is restore the default UDP behavior.  I'll cook up a patch.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Linux 2.6.38-rc4 (hysdn: BUG)
From: David Miller @ 2011-02-09 21:57 UTC (permalink / raw)
  To: randy.dunlap; +Cc: torvalds, netdev, linux-kernel, isdn
In-Reply-To: <20110209132529.76927f5f.randy.dunlap@oracle.com>

From: Randy Dunlap <randy.dunlap@oracle.com>
Date: Wed, 9 Feb 2011 13:25:29 -0800

> On Wed, 9 Feb 2011 11:44:00 -0800 Linus Torvalds wrote:
> 
>> On Wed, Feb 9, 2011 at 9:24 AM, Randy Dunlap <randy.dunlap@oracle.com> wrote:
>> >
>> > on x86_64.  no HYSDN hardware found (correct).
>> > Nearly allmodconfig.
>> >
>> >
>> > [   65.397577] HYSDN: module Rev: 1.6.6.6 loaded
>> > [   65.397584] HYSDN: network interface Rev: 1.8.6.4
>> > [   65.398057] HYSDN: 0 card(s) found.
>> > [   65.398121] BUG: unable to handle kernel paging request at ffffffffa06c99f0
>> > [   65.398269] IP: [<ffffffffa06c68ba>] hysdn_getrev+0x2e/0x50 [hysdn]
>> > [   65.398379] PGD 1a14067 PUD 1a18063 PMD 6f6c1067 PTE 800000006ce8c161
>> > [   65.398613] Oops: 0003 [#1] SMP DEBUG_PAGEALLOC
>> > [   65.400030]
>> > [   65.400030] Pid: 2497, comm: modprobe Not tainted 2.6.38-rc4 #1 0TY565/OptiPlex 745
>> > [   65.400030] RIP: 0010:[<ffffffffa06c68ba>]  [<ffffffffa06c68ba>] hysdn_getrev+0x2e/0x50 [hysdn]
>> > [   65.400030] RSP: 0018:ffff88006eec1e68  EFLAGS: 00010206
>> > [   65.400030] RAX: ffffffffa06c99f1 RBX: ffffffffa06c99e9 RCX: ffff88007c4159a0
>> 
>> The instruction sequence decodes to
>> 
>>   1e:	be 24 00 00 00       	mov    $0x24,%esi
>>   23:	48 89 df             	mov    %rbx,%rdi
>>   26:	e8 5b 39 c0 e0       	callq  0xffffffffe0c03986
>>   2b:*	c6 40 ff 00          	movb   $0x0,-0x1(%rax)     <-- trapping instruction
>> 
>> which seems to be this
>> 
>>                 p = strchr(rev, '$');
>>                 *--p = 0;
>> 
>> code. And yes, it's total crap, because while "p" and "rev" are "char
>> *", the string that is passed in is actually of type "const char *",
>> so that function is seriously broken. It's also seriously broken to
>> not test that "p" is non-NULL - the function would just break if there
>> is a colon in the string but not a '$'.
>> 
>> And hysdn_procconf_init() passes in a constant string to the thing:
>> 
>>     static char *hysdn_procconf_revision = "$Revision: 1.8.6.4 $";
>> 
>> What happens is that it breaks when we mark the constant section as
>> read-only, because you have CONFIG_DEBUG_RODATA enabled.
>> 
>> So the fix seems to be to
>>  - fix the prototype for hysdn_getrev() to not have "const".
>>  - fix hysdn_procconf_init() to not pass in a constant string to it
>> 
>> The minimal patch would appear to be something like the appended. UNTESTED!
> 
> for your patch:
> 
> Tested-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com>

This stuff just prints out a CVS revision string that hasn't changed
in 10 years into the kernel log.

I propose we just kill this stuff off completely.

I note that there is code in other places of this driver that copy
the read-only revision string into a local string buffer then pass
it into the hysdn_getrev() function, it just doesn't happen in this
one spot.

Anyways, I think I'll fix this like so:

--------------------
isdn: hysdn: Kill (partially buggy) CVS regision log reporting.

Some cases try to modify const strings, and in any event the
CVS revision strings have not changed in over ten years making
these printouts completely worthless.

Just kill all of this stuff off.

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/isdn/hysdn/hysdn_defs.h     |    2 --
 drivers/isdn/hysdn/hysdn_init.c     |   26 +-------------------------
 drivers/isdn/hysdn/hysdn_net.c      |    3 ---
 drivers/isdn/hysdn/hysdn_procconf.c |    3 +--
 4 files changed, 2 insertions(+), 32 deletions(-)

diff --git a/drivers/isdn/hysdn/hysdn_defs.h b/drivers/isdn/hysdn/hysdn_defs.h
index 729df40..18b801a 100644
--- a/drivers/isdn/hysdn/hysdn_defs.h
+++ b/drivers/isdn/hysdn/hysdn_defs.h
@@ -227,7 +227,6 @@ extern hysdn_card *card_root;	/* pointer to first card */
 /*************************/
 /* im/exported functions */
 /*************************/
-extern char *hysdn_getrev(const char *);
 
 /* hysdn_procconf.c */
 extern int hysdn_procconf_init(void);	/* init proc config filesys */
@@ -259,7 +258,6 @@ extern int hysdn_tx_cfgline(hysdn_card *, unsigned char *,
 
 /* hysdn_net.c */
 extern unsigned int hynet_enable; 
-extern char *hysdn_net_revision;
 extern int hysdn_net_create(hysdn_card *);	/* create a new net device */
 extern int hysdn_net_release(hysdn_card *);	/* delete the device */
 extern char *hysdn_net_getname(hysdn_card *);	/* get name of net interface */
diff --git a/drivers/isdn/hysdn/hysdn_init.c b/drivers/isdn/hysdn/hysdn_init.c
index b7cc5c2..0ab42ac 100644
--- a/drivers/isdn/hysdn/hysdn_init.c
+++ b/drivers/isdn/hysdn/hysdn_init.c
@@ -36,7 +36,6 @@ MODULE_DESCRIPTION("ISDN4Linux: Driver for HYSDN cards");
 MODULE_AUTHOR("Werner Cornelius");
 MODULE_LICENSE("GPL");
 
-static char *hysdn_init_revision = "$Revision: 1.6.6.6 $";
 static int cardmax;		/* number of found cards */
 hysdn_card *card_root = NULL;	/* pointer to first card */
 static hysdn_card *card_last = NULL;	/* pointer to first card */
@@ -49,25 +48,6 @@ static hysdn_card *card_last = NULL;	/* pointer to first card */
 /* Additionally newer versions may be activated without rebooting.          */
 /****************************************************************************/
 
-/******************************************************/
-/* extract revision number from string for log output */
-/******************************************************/
-char *
-hysdn_getrev(const char *revision)
-{
-	char *rev;
-	char *p;
-
-	if ((p = strchr(revision, ':'))) {
-		rev = p + 2;
-		p = strchr(rev, '$');
-		*--p = 0;
-	} else
-		rev = "???";
-	return rev;
-}
-
-
 /****************************************************************************/
 /* init_module is called once when the module is loaded to do all necessary */
 /* things like autodetect...                                                */
@@ -175,13 +155,9 @@ static int hysdn_have_procfs;
 static int __init
 hysdn_init(void)
 {
-	char tmp[50];
 	int rc;
 
-	strcpy(tmp, hysdn_init_revision);
-	printk(KERN_NOTICE "HYSDN: module Rev: %s loaded\n", hysdn_getrev(tmp));
-	strcpy(tmp, hysdn_net_revision);
-	printk(KERN_NOTICE "HYSDN: network interface Rev: %s \n", hysdn_getrev(tmp));
+	printk(KERN_NOTICE "HYSDN: module loaded\n");
 
 	rc = pci_register_driver(&hysdn_pci_driver);
 	if (rc)
diff --git a/drivers/isdn/hysdn/hysdn_net.c b/drivers/isdn/hysdn/hysdn_net.c
index feec8d8..11f2cce 100644
--- a/drivers/isdn/hysdn/hysdn_net.c
+++ b/drivers/isdn/hysdn/hysdn_net.c
@@ -26,9 +26,6 @@
 unsigned int hynet_enable = 0xffffffff; 
 module_param(hynet_enable, uint, 0);
 
-/* store the actual version for log reporting */
-char *hysdn_net_revision = "$Revision: 1.8.6.4 $";
-
 #define MAX_SKB_BUFFERS 20	/* number of buffers for keeping TX-data */
 
 /****************************************************************************/
diff --git a/drivers/isdn/hysdn/hysdn_procconf.c b/drivers/isdn/hysdn/hysdn_procconf.c
index 96b3e39..5fe83bd 100644
--- a/drivers/isdn/hysdn/hysdn_procconf.c
+++ b/drivers/isdn/hysdn/hysdn_procconf.c
@@ -23,7 +23,6 @@
 #include "hysdn_defs.h"
 
 static DEFINE_MUTEX(hysdn_conf_mutex);
-static char *hysdn_procconf_revision = "$Revision: 1.8.6.4 $";
 
 #define INFO_OUT_LEN 80		/* length of info line including lf */
 
@@ -404,7 +403,7 @@ hysdn_procconf_init(void)
 		card = card->next;	/* next entry */
 	}
 
-	printk(KERN_NOTICE "HYSDN: procfs Rev. %s initialised\n", hysdn_getrev(hysdn_procconf_revision));
+	printk(KERN_NOTICE "HYSDN: procfs initialised\n");
 	return (0);
 }				/* hysdn_procconf_init */
 
-- 
1.7.4


^ permalink raw reply related

* Re: Linux 2.6.38-rc4 (hysdn: BUG)
From: Linus Torvalds @ 2011-02-09 22:00 UTC (permalink / raw)
  To: David Miller; +Cc: randy.dunlap, netdev, linux-kernel, isdn
In-Reply-To: <20110209.135751.112605587.davem@davemloft.net>

On Wed, Feb 9, 2011 at 1:57 PM, David Miller <davem@davemloft.net> wrote:
>
> I propose we just kill this stuff off completely.
>
> I note that there is code in other places of this driver that copy
> the read-only revision string into a local string buffer then pass
> it into the hysdn_getrev() function, it just doesn't happen in this
> one spot.
>
> Anyways, I think I'll fix this like so:

Ack from me. Removing terminally buggy and pointless code is certainly
better than trying to fix it up. No amount of lipstick will make that
pig look good.

                      Linus

^ permalink raw reply

* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: David Miller @ 2011-02-09 22:03 UTC (permalink / raw)
  To: dfeng; +Cc: netdev, eric.dumazet, therbert, ebiederm, shemminger, ddvlad
In-Reply-To: <1297248769-28530-1-git-send-email-dfeng@redhat.com>

From: Xiaotian feng <dfeng@redhat.com>
Date: Wed,  9 Feb 2011 18:52:49 +0800

> From: Xiaotian Feng <dfeng@redhat.com>
> 
> commit a512b92 adds sysfs entry for net device group, but
> before this commit, tun also uses group sysfs, so after this
> commit checkin, kernel warns like this:
>     sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
> 
> Since tun has used this for years, rename sysfs under tun might
> break existing userspace, so rename group sysfs entry for net device
> group is a better choice.
> 
> Signed-off-by: Xiaotian Feng <dfeng@redhat.com>

I don't think we have much choice in this matter, so I have applied
this patch, thanks!

^ permalink raw reply

* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: David Miller @ 2011-02-09 22:05 UTC (permalink / raw)
  To: dfeng; +Cc: netdev, eric.dumazet, therbert, ebiederm, shemminger, ddvlad
In-Reply-To: <20110209.140323.39178091.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 09 Feb 2011 14:03:23 -0800 (PST)

> From: Xiaotian feng <dfeng@redhat.com>
> Date: Wed,  9 Feb 2011 18:52:49 +0800
> 
>> From: Xiaotian Feng <dfeng@redhat.com>
>> 
>> commit a512b92 adds sysfs entry for net device group, but
>> before this commit, tun also uses group sysfs, so after this
>> commit checkin, kernel warns like this:
>>     sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
>> 
>> Since tun has used this for years, rename sysfs under tun might
>> break existing userspace, so rename group sysfs entry for net device
>> group is a better choice.
>> 
>> Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
> 
> I don't think we have much choice in this matter, so I have applied
> this patch, thanks!

Wait, you didn't even build test this patch?!?!?!?!

net/core/net-sysfs.c: In function ‘format_netdev_group’:
net/core/net-sysfs.c:298: error: ‘const struct net_device’ has no member named ‘netdev_group’
net/core/net-sysfs.c: At top level:
net/core/net-sysfs.c:333: error: ‘show_group’ undeclared here (not in a function)

"RFC" doesn't preclude you from at least build testing patches you
post.

Sigh...




^ permalink raw reply

* RE: [net-next-2.6 PATCH 0/5] enic: updates to version 2.1.1.6
From: Vasanthy Kolluri (vkolluri) @ 2011-02-09 23:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110207.113932.226768580.davem@davemloft.net>

Hi Dave,

Sorry for the trouble. Will handle such cases better next time.
Thanks for taking care of it this time.

Thanks
Vasanthy

-----Original Message-----
From: David Miller [mailto:davem@davemloft.net] 
Sent: Monday, February 07, 2011 11:40 AM
To: Vasanthy Kolluri (vkolluri)
Cc: netdev@vger.kernel.org
Subject: Re: [net-next-2.6 PATCH 0/5] enic: updates to version 2.1.1.6

From: "Vasanthy Kolluri (vkolluri)" <vkolluri@cisco.com>
Date: Mon, 7 Feb 2011 11:32:08 -0800

> Hi David,
> 
> The below patch series should apply clean on top of another enic patch
> "enic: Decouple mac address registration and deregistration from port
> profile set operation" submitted earlier by roprabhu@cisco.com. Just
> want to ensure that it's not missed.
> 
> Please let me know if there is still a need to re-spin.

How in the world was I supposed to know about this dependency?

Tell me.

^ permalink raw reply

* [PATCH 1/3 v2] pch_can: fix 800k comms issue
From: Tomoya MORINAGA @ 2011-02-09 23:56 UTC (permalink / raw)
  To: wg-5Yr1BZd7O62+XT7JhA+gdA, socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	toshiharu-linux-ECg8zkTtlr0C6LszWs/t0g,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w

Previous patch "[PATCH 1/3] pch_can: fix 800k comms issue" is wrong.
I should have modified tseg1_min not tseg2_min.
Please discard previous patch and use this patch.

Currently, 800k comms fails since prop_seg set zero.
(EG20T PCH CAN register of prop_seg must be set more than 1)
To prevent prop_seg set to zero, change tseg2_min 1 to 2.

Signed-off-by: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
---
 drivers/net/can/pch_can.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
index c42e972..0534eed 100644
--- a/drivers/net/can/pch_can.c
+++ b/drivers/net/can/pch_can.c
@@ -185,7 +185,7 @@ struct pch_can_priv {
 
 static struct can_bittiming_const pch_can_bittiming_const = {
 	.name = KBUILD_MODNAME,
-	.tseg1_min = 1,
+	.tseg1_min = 2,
 	.tseg1_max = 16,
 	.tseg2_min = 1,
 	.tseg2_max = 8,
-- 
1.7.3.4

^ permalink raw reply related

* Re: [PATCH 1/3 v2] pch_can: fix 800k comms issue
From: David Miller @ 2011-02-09 23:57 UTC (permalink / raw)
  To: tomoya-linux-ECg8zkTtlr0C6LszWs/t0g
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	toshiharu-linux-ECg8zkTtlr0C6LszWs/t0g,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <1297295776-3735-1-git-send-email-tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>

From: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
Date: Thu, 10 Feb 2011 08:56:16 +0900

> Previous patch "[PATCH 1/3] pch_can: fix 800k comms issue" is wrong.
> I should have modified tseg1_min not tseg2_min.
> Please discard previous patch and use this patch.

I can't "discard" the previous patch as it's already pushed upstream
into my GIT tree and therefore cannot be reverted.

You need to send a relative fix-up patch.

^ permalink raw reply

* [PATCH] pch_can: fix tseg1/tseg2 setting issue
From: Tomoya MORINAGA @ 2011-02-10  0:39 UTC (permalink / raw)
  To: wg-5Yr1BZd7O62+XT7JhA+gdA, socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	toshiharu-linux-ECg8zkTtlr0C6LszWs/t0g,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w

Previous patch "[PATCH 1/3] pch_can: fix 800k comms issue" is wrong.
I should have modified tseg1_min not tseg2_min.
This patch reverts tseg2_min to 1 and set tseg1_min to 2.

Signed-off-by: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
---
 drivers/net/can/pch_can.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
index 7d8bc12..e54712b 100644
--- a/drivers/net/can/pch_can.c
+++ b/drivers/net/can/pch_can.c
@@ -185,9 +185,9 @@ struct pch_can_priv {
 
 static struct can_bittiming_const pch_can_bittiming_const = {
 	.name = KBUILD_MODNAME,
-	.tseg1_min = 1,
+	.tseg1_min = 2,
 	.tseg1_max = 16,
-	.tseg2_min = 2,
+	.tseg2_min = 1,
 	.tseg2_max = 8,
 	.sjw_max = 4,
 	.brp_min = 1,
-- 
1.7.3.4

^ permalink raw reply related

* Re: [PATCH] pch_can: fix tseg1/tseg2 setting issue
From: David Miller @ 2011-02-10  0:46 UTC (permalink / raw)
  To: tomoya-linux-ECg8zkTtlr0C6LszWs/t0g
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	toshiharu-linux-ECg8zkTtlr0C6LszWs/t0g,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <1297298399-6250-1-git-send-email-tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>

From: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
Date: Thu, 10 Feb 2011 09:39:59 +0900

> Previous patch "[PATCH 1/3] pch_can: fix 800k comms issue" is wrong.
> I should have modified tseg1_min not tseg2_min.
> This patch reverts tseg2_min to 1 and set tseg1_min to 2.
> 
> Signed-off-by: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] virtio-net: add schedule check to napi_enable call in refill_work
From: Rusty Russell @ 2011-02-10  1:31 UTC (permalink / raw)
  To: virtualization; +Cc: Ken Stailey, netdev, Bruce Rogers
In-Reply-To: <132885.30568.qm@web110313.mail.gq1.yahoo.com>

On Thu, 10 Feb 2011 06:59:25 am Ken Stailey wrote:
> Justification:
> 
> Impact: Under heavy network I/O load virtio-net driver crashes making VM guest unusable.

Hmm, this went badly wrong.  I acked this patch, and it was mailed to
netdev six months ago.

Bruce's patch used spaces instead of tabs, but that should not have caused
it to be dropped.  I've taken that and ported it forwards, will repost now.

Thanks for picking this up off the floor!
Rusty.

^ permalink raw reply

* [PATCH] virtio_net: Add schedule check to napi_enable call
From: Rusty Russell @ 2011-02-10  2:02 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev, David Miller, virtualization, Ken Stailey

From: "Bruce Rogers" <brogers@novell.com>

Under harsh testing conditions, including low memory, the guest would
stop receiving packets. With this patch applied we no longer see any
problems in the driver while performing these tests for extended periods
of time.

Make sure napi is scheduled subsequent to each napi_enable.

Signed-off-by: Bruce Rogers <brogers@novell.com>
Signed-off-by: Olaf Kirch <okir@suse.de>
Cc: stable@kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/net/virtio_net.c |   27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -446,6 +446,20 @@ static void skb_recv_done(struct virtque
 	}
 }
 
+static void virtnet_napi_enable(struct virtnet_info *vi)
+{
+	napi_enable(&vi->napi);
+
+	/* If all buffers were filled by other side before we napi_enabled, we
+	 * won't get another interrupt, so process any outstanding packets
+	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
+	 * We synchronize against interrupts via NAPI_STATE_SCHED */
+	if (napi_schedule_prep(&vi->napi)) {
+		virtqueue_disable_cb(vi->rvq);
+		__napi_schedule(&vi->napi);
+	}
+}
+
 static void refill_work(struct work_struct *work)
 {
 	struct virtnet_info *vi;
@@ -454,7 +468,7 @@ static void refill_work(struct work_stru
 	vi = container_of(work, struct virtnet_info, refill.work);
 	napi_disable(&vi->napi);
 	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	napi_enable(&vi->napi);
+	virtnet_napi_enable(vi);
 
 	/* In theory, this can happen: if we don't get any buffers in
 	 * we will *never* try to fill again. */
@@ -638,16 +652,7 @@ static int virtnet_open(struct net_devic
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
-	napi_enable(&vi->napi);
-
-	/* If all buffers were filled by other side before we napi_enabled, we
-	 * won't get another interrupt, so process any outstanding packets
-	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
-	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(vi->rvq);
-		__napi_schedule(&vi->napi);
-	}
+	virtnet_napi_enable(vi);
 	return 0;
 }
 

^ permalink raw reply

* Re: [PATCH] virtio-net: add schedule check to napi_enable call in refill_work
From: Bruce Rogers @ 2011-02-10  2:44 UTC (permalink / raw)
  To: virtualization, Rusty Russell; +Cc: netdev
In-Reply-To: <201102101201.19656.rusty@rustcorp.com.au>

 >>> On 2/9/2011 at 06:31 PM, Rusty Russell <rusty@rustcorp.com.au> wrote: 
> On Thu, 10 Feb 2011 06:59:25 am Ken Stailey wrote:
>> Justification:
>> 
>> Impact: Under heavy network I/O load virtio-net driver crashes making VM 
> guest unusable.
> 
> Hmm, this went badly wrong.  I acked this patch, and it was mailed to
> netdev six months ago.
> 
> Bruce's patch used spaces instead of tabs, but that should not have caused
> it to be dropped.  I've taken that and ported it forwards, will repost now.
> 
> Thanks for picking this up off the floor!
> Rusty.

Thanks for taking care of that!

Bruce

^ permalink raw reply

* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: Xiaotian Feng @ 2011-02-10  2:54 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eric.dumazet, therbert, ebiederm, shemminger, ddvlad
In-Reply-To: <20110209.140558.59676278.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

On 02/10/2011 06:05 AM, David Miller wrote:
> From: David Miller<davem@davemloft.net>
> Date: Wed, 09 Feb 2011 14:03:23 -0800 (PST)
>
>> From: Xiaotian feng<dfeng@redhat.com>
>> Date: Wed,  9 Feb 2011 18:52:49 +0800
>>
>>> From: Xiaotian Feng<dfeng@redhat.com>
>>>
>>> commit a512b92 adds sysfs entry for net device group, but
>>> before this commit, tun also uses group sysfs, so after this
>>> commit checkin, kernel warns like this:
>>>      sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
>>>
>>> Since tun has used this for years, rename sysfs under tun might
>>> break existing userspace, so rename group sysfs entry for net device
>>> group is a better choice.
>>>
>>> Signed-off-by: Xiaotian Feng<dfeng@redhat.com>
>>
>> I don't think we have much choice in this matter, so I have applied
>> this patch, thanks!
>
> Wait, you didn't even build test this patch?!?!?!?!
>
> net/core/net-sysfs.c: In function ‘format_netdev_group’:
> net/core/net-sysfs.c:298: error: ‘const struct net_device’ has no member named ‘netdev_group’
> net/core/net-sysfs.c: At top level:
> net/core/net-sysfs.c:333: error: ‘show_group’ undeclared here (not in a function)
>
> "RFC" doesn't preclude you from at least build testing patches you
> post.
>
> Sigh...
>
Sorry, my bad ... v2 patch is attatched, I've built and r/w this renamed 
sysfs, all work fine now. Sorry again about my carelessness ...

Regards
Xiaotian
>
>


[-- Attachment #2: 0001-net-rename-group-sysfs-entry-to-netdev_group.patch --]
[-- Type: text/plain, Size: 1535 bytes --]

>From 35388da8821a72a71f54cb955146a881f916eb25 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Thu, 10 Feb 2011 10:48:53 +0800
Subject: [PATCH net-next v2] net: rename group sysfs entry to netdev_group

commit a512b92 adds sysfs entry for net device group, but
before this commit, tun also uses group sysfs, so after this
commit checkin, kernel warns like this:
    sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'

Since tun has used this for years, rename sysfs under tun might
break existing userspace, so rename group sysfs entry for net device
group is a better choice.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Vlad Dogaru <ddvlad@rosedu.org>
---
 net/core/net-sysfs.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 2e4a393..5ceb257 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -330,7 +330,7 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
-	__ATTR(group, S_IRUGO | S_IWUSR, show_group, store_group),
+	__ATTR(netdev_group, S_IRUGO | S_IWUSR, show_group, store_group),
 	{}
 };
 
-- 
1.7.1


^ permalink raw reply related

* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: David Miller @ 2011-02-10  3:16 UTC (permalink / raw)
  To: dfeng; +Cc: netdev, eric.dumazet, therbert, ebiederm, shemminger, ddvlad
In-Reply-To: <4D53536B.2010505@redhat.com>

From: Xiaotian Feng <dfeng@redhat.com>
Date: Thu, 10 Feb 2011 10:54:35 +0800

> Subject: [PATCH net-next v2] net: rename group sysfs entry to netdev_group
> 
> commit a512b92 adds sysfs entry for net device group, but
> before this commit, tun also uses group sysfs, so after this
> commit checkin, kernel warns like this:
>     sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
> 
> Since tun has used this for years, rename sysfs under tun might
> break existing userspace, so rename group sysfs entry for net device
> group is a better choice.
> 
> Signed-off-by: Xiaotian Feng <dfeng@redhat.com>

Applied.

^ permalink raw reply

* [RFC PATCH 0/5] Cache PMTU/redirects in inetpeer
From: David Miller @ 2011-02-10  6:12 UTC (permalink / raw)
  To: netdev

This is what I've been working on for the past several days.

Right now if the routing cache is turned off (by setting
rt_cache_rebuild_count to "0") several things stop working.

We never make use of any PMTU or redirect information we learn
via ICMP packets.  This is because when the routing cache is
off, we can't "find" the existing cached routes that match
the ICMP because we don't add them to the hash table.

This functionality loss is also a blocker for eliminating the
routing cache entirely.

Solve this by remembering this state in the inetpeer entries.

PMTU information now self-expires.  It gets validated when
cached routes are sanity checked via dst_ops->check().  At
expiration, the original RTAX_MTU metric value is restored.
So we don't have to invalidate the entire cached route just
because it's PMTU learned value has expired.

Similarly, we store redirect information in inetpeer too.
Except that currently my patches don't remember the "original"
gateway the route had, so we have to kill the route off when
we get a dst_ops->negative_advice() call on a redirected route.

Avoid this is easy to fix and I might do that soon.

These patches implement the PMTU/redirect bits in ipv4 only at the
moment, but I do have ipv6 patches I'm in the process of finishing
up.  I just wanted people to see this as soon as possible so that
I can start getting feedback.

And hey if people can test this stuff out that'd be awesome!  If
you've used these changes in an environment where you did hit PMTU
and redirects, please do let me know.

^ permalink raw reply

* [RFC PATCH 1/5] inetpeer: Abstract address representation further.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev


Future changes will add caching information, and some of
these new elements will be addresses.

Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/inetpeer.h |   16 ++++++++++------
 net/ipv4/inetpeer.c    |    6 +++---
 net/ipv4/tcp_ipv4.c    |    2 +-
 net/ipv6/tcp_ipv6.c    |    2 +-
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index ead2cb2..60e2cd8 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -15,12 +15,16 @@
 #include <net/ipv6.h>
 #include <asm/atomic.h>
 
-struct inetpeer_addr {
+struct inetpeer_addr_base {
 	union {
-		__be32		a4;
-		__be32		a6[4];
+		__be32			a4;
+		__be32			a6[4];
 	};
-	__u16	family;
+};
+
+struct inetpeer_addr {
+	struct inetpeer_addr_base	addr;
+	__u16				family;
 };
 
 struct inet_peer {
@@ -67,7 +71,7 @@ static inline struct inet_peer *inet_getpeer_v4(__be32 v4daddr, int create)
 {
 	struct inetpeer_addr daddr;
 
-	daddr.a4 = v4daddr;
+	daddr.addr.a4 = v4daddr;
 	daddr.family = AF_INET;
 	return inet_getpeer(&daddr, create);
 }
@@ -76,7 +80,7 @@ static inline struct inet_peer *inet_getpeer_v6(struct in6_addr *v6daddr, int cr
 {
 	struct inetpeer_addr daddr;
 
-	ipv6_addr_copy((struct in6_addr *)daddr.a6, v6daddr);
+	ipv6_addr_copy((struct in6_addr *)daddr.addr.a6, v6daddr);
 	daddr.family = AF_INET6;
 	return inet_getpeer(&daddr, create);
 }
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 709fbb4..4346c38 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -167,9 +167,9 @@ static int addr_compare(const struct inetpeer_addr *a,
 	int i, n = (a->family == AF_INET ? 1 : 4);
 
 	for (i = 0; i < n; i++) {
-		if (a->a6[i] == b->a6[i])
+		if (a->addr.a6[i] == b->addr.a6[i])
 			continue;
-		if (a->a6[i] < b->a6[i])
+		if (a->addr.a6[i] < b->addr.a6[i])
 			return -1;
 		return 1;
 	}
@@ -510,7 +510,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
 		p->daddr = *daddr;
 		atomic_set(&p->refcnt, 1);
 		atomic_set(&p->rid, 0);
-		atomic_set(&p->ip_id_count, secure_ip_id(daddr->a4));
+		atomic_set(&p->ip_id_count, secure_ip_id(daddr->addr.a4));
 		p->tcp_ts_stamp = 0;
 		p->metrics[RTAX_LOCK-1] = INETPEER_METRICS_NEW;
 		p->rate_tokens = 0;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 02f583b..e2b9be2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1341,7 +1341,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		    tcp_death_row.sysctl_tw_recycle &&
 		    (dst = inet_csk_route_req(sk, req)) != NULL &&
 		    (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
-		    peer->daddr.a4 == saddr) {
+		    peer->daddr.addr.a4 == saddr) {
 			inet_peer_refcheck(peer);
 			if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
 			    (s32)(peer->tcp_ts - req->ts_recent) >
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 20aa95e..d6954e3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1323,7 +1323,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 		    tcp_death_row.sysctl_tw_recycle &&
 		    (dst = inet6_csk_route_req(sk, req)) != NULL &&
 		    (peer = rt6_get_peer((struct rt6_info *)dst)) != NULL &&
-		    ipv6_addr_equal((struct in6_addr *)peer->daddr.a6,
+		    ipv6_addr_equal((struct in6_addr *)peer->daddr.addr.a6,
 				    &treq->rmt_addr)) {
 			inet_peer_refcheck(peer);
 			if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
-- 
1.7.4


^ permalink raw reply related

* [RFC PATCH 2/5] inetpeer: Add redirect and PMTU discovery cached info.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev


Validity of the cached PMTU information is indicated by it's
expiration value being non-zero, just as per dst->expires.

The scheme we will use is that we will remember the pre-ICMP value
held in the metrics or route entry, and then at expiration time
we will restore that value.

In this way PMTU expiration does not kill off the cached route as is
done currently.

Redirect information is permanent, or at least until another redirect
is received.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/inetpeer.h |   18 +++++++++++-------
 net/ipv4/inetpeer.c    |    2 ++
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 60e2cd8..e6dd8da6 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -43,13 +43,17 @@ struct inet_peer {
 	 */
 	union {
 		struct {
-			atomic_t	rid;		/* Frag reception counter */
-			atomic_t	ip_id_count;	/* IP ID for the next packet */
-			__u32		tcp_ts;
-			__u32		tcp_ts_stamp;
-			u32		metrics[RTAX_MAX];
-			u32		rate_tokens;	/* rate limiting for ICMP */
-			unsigned long	rate_last;
+			atomic_t			rid;		/* Frag reception counter */
+			atomic_t			ip_id_count;	/* IP ID for the next packet */
+			__u32				tcp_ts;
+			__u32				tcp_ts_stamp;
+			u32				metrics[RTAX_MAX];
+			u32				rate_tokens;	/* rate limiting for ICMP */
+			unsigned long			rate_last;
+			unsigned long			pmtu_expires;
+			u32				pmtu_orig;
+			u32				pmtu_learned;
+			struct inetpeer_addr_base	redirect_learned;
 		};
 		struct rcu_head         rcu;
 	};
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 4346c38..48f8d45 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -515,6 +515,8 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
 		p->metrics[RTAX_LOCK-1] = INETPEER_METRICS_NEW;
 		p->rate_tokens = 0;
 		p->rate_last = 0;
+		p->pmtu_expires = 0;
+		memset(&p->redirect_learned, 0, sizeof(p->redirect_learned));
 		INIT_LIST_HEAD(&p->unused);
 
 
-- 
1.7.4


^ permalink raw reply related

* [RFC PATCH 3/5] inet: Create a mechanism for upward inetpeer propagation into routes.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev

If we didn't have a routing cache, we would not be able to properly
propagate certain kinds of dynamic path attributes, for example
PMTU information and redirects.

The reason is that if we didn't have a routing cache, then there would
be no way to lookup all of the active cached routes hanging off of
sockets, tunnels, IPSEC bundles, etc.

Consider the case where we created a cached route, but no inetpeer
entry existed and also we were not asked to pre-COW the route metrics
and therefore did not force the creation a new inetpeer entry.

If we later get a PMTU message, or a redirect, and store this
information in a new inetpeer entry, there is no way to teach that
cached route about the newly existing inetpeer entry.

The facilities implemented here handle this problem.

First we create a generation ID.  When we create a cached route of any
kind, we remember the generation ID at the time of attachment.  Any
time we force-create an inetpeer entry in response to new path
information, we bump that generation ID.

The dst_ops->check() callback is where the knowledge of this event
is propagated.  If the global generation ID does not equal the one
stored in the cached route, and the cached route has not attached
to an inetpeer yet, we look it up and attach if one is found.  Now
that we've updated the cached route's information, we update the
route's generation ID too.

This clears the way for implementing PMTU and redirects directly in
the inetpeer cache.  There is absolutely no need to consult cached
route information in order to maintain this information.

At this point nothing bumps the inetpeer genids, that comes in the
later changes which handle PMTUs and redirects using inetpeers.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/ip6_fib.h |    1 +
 include/net/route.h   |    1 +
 net/ipv4/route.c      |   19 ++++++++++++++++++-
 net/ipv6/route.c      |   18 ++++++++++++++++--
 4 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 708ff7c..46a6e8a 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -108,6 +108,7 @@ struct rt6_info {
 	u32				rt6i_flags;
 	struct rt6key			rt6i_src;
 	u32				rt6i_metric;
+	u32				rt6i_peer_genid;

 	struct inet6_dev		*rt6i_idev;
 	struct inet_peer		*rt6i_peer;
diff --git a/include/net/route.h b/include/net/route.h
index e586465..bf790c1 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@ struct rtable {

 	/* Miscellaneous cached information */
 	__be32			rt_spec_dst; /* RFC1122 specific destination */
+	u32			rt_peer_genid;
 	struct inet_peer	*peer; /* long-living peer info */
 	struct fib_info		*fi; /* for client ref to shared metrics */
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0455af8..0979e03 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1308,6 +1308,13 @@ skip_hashing:
 	return 0;
 }

+static atomic_t __rt_peer_genid = ATOMIC_INIT(0);
+
+static u32 rt_peer_genid(void)
+{
+	return atomic_read(&__rt_peer_genid);
+}
+
 void rt_bind_peer(struct rtable *rt, int create)
 {
 	struct inet_peer *peer;
@@ -1316,6 +1323,8 @@ void rt_bind_peer(struct rtable *rt, int create)

 	if (peer && cmpxchg(&rt->peer, NULL, peer) != NULL)
 		inet_putpeer(peer);
+	else
+		rt->rt_peer_genid = rt_peer_genid();
 }

 /*
@@ -1767,8 +1776,16 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)

 static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 {
-	if (rt_is_expired((struct rtable *)dst))
+	struct rtable *rt = (struct rtable *) dst;
+
+	if (rt_is_expired(rt))
 		return NULL;
+	if (rt->rt_peer_genid != rt_peer_genid()) {
+		if (!rt->peer)
+			rt_bind_peer(rt, 0);
+
+		rt->rt_peer_genid = rt_peer_genid();
+	}
 	return dst;
 }

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 12ec83d..ad8556e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -240,6 +240,13 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 	}
 }

+static atomic_t __rt6_peer_genid = ATOMIC_INIT(0);
+
+static u32 rt6_peer_genid(void)
+{
+	return atomic_read(&__rt6_peer_genid);
+}
+
 void rt6_bind_peer(struct rt6_info *rt, int create)
 {
 	struct inet_peer *peer;
@@ -247,6 +254,8 @@ void rt6_bind_peer(struct rt6_info *rt, int create)
 	peer = inet_getpeer_v6(&rt->rt6i_dst.addr, create);
 	if (peer && cmpxchg(&rt->rt6i_peer, NULL, peer) != NULL)
 		inet_putpeer(peer);
+	else
+		rt->rt6i_peer_genid = rt6_peer_genid();
 }

 static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
@@ -912,9 +921,14 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)

 	rt = (struct rt6_info *) dst;

-	if (rt->rt6i_node && (rt->rt6i_node->fn_sernum == cookie))
+	if (rt->rt6i_node && (rt->rt6i_node->fn_sernum == cookie)) {
+		if (rt->rt6i_peer_genid != rt6_peer_genid()) {
+			if (!rt->rt6i_peer)
+				rt6_bind_peer(rt, 0);
+			rt->rt6i_peer_genid = rt6_peer_genid();
+		}
 		return dst;
-
+	}
 	return NULL;
 }

-- 
1.7.4

^ permalink raw reply related

* [RFC PATCH 4/5] ipv4: Cache learned PMTU information in inetpeer.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev


The general idea is that if we learn new PMTU information, we
bump the peer genid.

This triggers the dst_ops->check() code to validate and if
necessary propagate the new PMTU value into the metrics.

Learned PMTU information self-expires.

This means that it is not necessary to kill a cached route
entry just because the PMTU information is too old.

As a consequence:

1) When the path appears unreachable (dst_ops->link_failure
   or dst_ops->negative_advice) we unwind the PMTU state if
   it is out of date, instead of killing the cached route.

   A redirected route will still be invlidated in these
   situations.

2) rt_check_expire(), rt_worker_func(), et al. are no longer
   necessary at all.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c |  260 ++++++++++++++++++------------------------------------
 1 files changed, 86 insertions(+), 174 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0979e03..11faf14 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -131,9 +131,6 @@ static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
 static int rt_chain_length_max __read_mostly	= 20;
 
-static struct delayed_work expires_work;
-static unsigned long expires_ljiffies;
-
 /*
  *	Interface to generic destination cache.
  */
@@ -668,7 +665,7 @@ static inline int rt_fast_clean(struct rtable *rth)
 static inline int rt_valuable(struct rtable *rth)
 {
 	return (rth->rt_flags & (RTCF_REDIRECTED | RTCF_NOTIFY)) ||
-		rth->dst.expires;
+		(rth->peer && rth->peer->pmtu_expires);
 }
 
 static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
@@ -679,13 +676,7 @@ static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long t
 	if (atomic_read(&rth->dst.__refcnt))
 		goto out;
 
-	ret = 1;
-	if (rth->dst.expires &&
-	    time_after_eq(jiffies, rth->dst.expires))
-		goto out;
-
 	age = jiffies - rth->dst.lastuse;
-	ret = 0;
 	if ((age <= tmo1 && !rt_fast_clean(rth)) ||
 	    (age <= tmo2 && rt_valuable(rth)))
 		goto out;
@@ -829,97 +820,6 @@ static int has_noalias(const struct rtable *head, const struct rtable *rth)
 	return ONE;
 }
 
-static void rt_check_expire(void)
-{
-	static unsigned int rover;
-	unsigned int i = rover, goal;
-	struct rtable *rth;
-	struct rtable __rcu **rthp;
-	unsigned long samples = 0;
-	unsigned long sum = 0, sum2 = 0;
-	unsigned long delta;
-	u64 mult;
-
-	delta = jiffies - expires_ljiffies;
-	expires_ljiffies = jiffies;
-	mult = ((u64)delta) << rt_hash_log;
-	if (ip_rt_gc_timeout > 1)
-		do_div(mult, ip_rt_gc_timeout);
-	goal = (unsigned int)mult;
-	if (goal > rt_hash_mask)
-		goal = rt_hash_mask + 1;
-	for (; goal > 0; goal--) {
-		unsigned long tmo = ip_rt_gc_timeout;
-		unsigned long length;
-
-		i = (i + 1) & rt_hash_mask;
-		rthp = &rt_hash_table[i].chain;
-
-		if (need_resched())
-			cond_resched();
-
-		samples++;
-
-		if (rcu_dereference_raw(*rthp) == NULL)
-			continue;
-		length = 0;
-		spin_lock_bh(rt_hash_lock_addr(i));
-		while ((rth = rcu_dereference_protected(*rthp,
-					lockdep_is_held(rt_hash_lock_addr(i)))) != NULL) {
-			prefetch(rth->dst.rt_next);
-			if (rt_is_expired(rth)) {
-				*rthp = rth->dst.rt_next;
-				rt_free(rth);
-				continue;
-			}
-			if (rth->dst.expires) {
-				/* Entry is expired even if it is in use */
-				if (time_before_eq(jiffies, rth->dst.expires)) {
-nofree:
-					tmo >>= 1;
-					rthp = &rth->dst.rt_next;
-					/*
-					 * We only count entries on
-					 * a chain with equal hash inputs once
-					 * so that entries for different QOS
-					 * levels, and other non-hash input
-					 * attributes don't unfairly skew
-					 * the length computation
-					 */
-					length += has_noalias(rt_hash_table[i].chain, rth);
-					continue;
-				}
-			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout))
-				goto nofree;
-
-			/* Cleanup aged off entries. */
-			*rthp = rth->dst.rt_next;
-			rt_free(rth);
-		}
-		spin_unlock_bh(rt_hash_lock_addr(i));
-		sum += length;
-		sum2 += length*length;
-	}
-	if (samples) {
-		unsigned long avg = sum / samples;
-		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
-		rt_chain_length_max = max_t(unsigned long,
-					ip_rt_gc_elasticity,
-					(avg + 4*sd) >> FRACT_BITS);
-	}
-	rover = i;
-}
-
-/*
- * rt_worker_func() is run in process context.
- * we call rt_check_expire() to scan part of the hash table
- */
-static void rt_worker_func(struct work_struct *work)
-{
-	rt_check_expire();
-	schedule_delayed_work(&expires_work, ip_rt_gc_interval);
-}
-
 /*
  * Pertubation of rt_genid by a small quantity [1..256]
  * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
@@ -1535,9 +1435,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
 		if (dst->obsolete > 0) {
 			ip_rt_put(rt);
 			ret = NULL;
-		} else if ((rt->rt_flags & RTCF_REDIRECTED) ||
-			   (rt->dst.expires &&
-			    time_after_eq(jiffies, rt->dst.expires))) {
+		} else if (rt->rt_flags & RTCF_REDIRECTED) {
 			unsigned hash = rt_hash(rt->fl.fl4_dst, rt->fl.fl4_src,
 						rt->fl.oif,
 						rt_genid(dev_net(dst->dev)));
@@ -1547,6 +1445,14 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
 #endif
 			rt_del(hash, rt);
 			ret = NULL;
+		} else if (rt->peer &&
+			   rt->peer->pmtu_expires &&
+			   time_after_eq(jiffies, rt->peer->pmtu_expires)) {
+			unsigned long orig = rt->peer->pmtu_expires;
+
+			if (cmpxchg(&rt->peer->pmtu_expires, orig, 0) == orig)
+				dst_metric_set(dst, RTAX_MTU,
+					       rt->peer->pmtu_orig);
 		}
 	}
 	return ret;
@@ -1697,80 +1603,78 @@ unsigned short ip_rt_frag_needed(struct net *net, struct iphdr *iph,
 				 unsigned short new_mtu,
 				 struct net_device *dev)
 {
-	int i, k;
 	unsigned short old_mtu = ntohs(iph->tot_len);
-	struct rtable *rth;
-	int  ikeys[2] = { dev->ifindex, 0 };
-	__be32  skeys[2] = { iph->saddr, 0, };
-	__be32  daddr = iph->daddr;
 	unsigned short est_mtu = 0;
+	struct inet_peer *peer;
 
-	for (k = 0; k < 2; k++) {
-		for (i = 0; i < 2; i++) {
-			unsigned hash = rt_hash(daddr, skeys[i], ikeys[k],
-						rt_genid(net));
-
-			rcu_read_lock();
-			for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
-			     rth = rcu_dereference(rth->dst.rt_next)) {
-				unsigned short mtu = new_mtu;
+	peer = inet_getpeer_v4(iph->daddr, 1);
+	if (peer) {
+		unsigned short mtu = new_mtu;
 
-				if (rth->fl.fl4_dst != daddr ||
-				    rth->fl.fl4_src != skeys[i] ||
-				    rth->rt_dst != daddr ||
-				    rth->rt_src != iph->saddr ||
-				    rth->fl.oif != ikeys[k] ||
-				    rt_is_input_route(rth) ||
-				    dst_metric_locked(&rth->dst, RTAX_MTU) ||
-				    !net_eq(dev_net(rth->dst.dev), net) ||
-				    rt_is_expired(rth))
-					continue;
+		if (new_mtu < 68 || new_mtu >= old_mtu) {
+			/* BSD 4.2 derived systems incorrectly adjust
+			 * tot_len by the IP header length, and report
+			 * a zero MTU in the ICMP message.
+			 */
+			if (mtu == 0 &&
+			    old_mtu >= 68 + (iph->ihl << 2))
+				old_mtu -= iph->ihl << 2;
+			mtu = guess_mtu(old_mtu);
+		}
 
-				if (new_mtu < 68 || new_mtu >= old_mtu) {
+		if (mtu < ip_rt_min_pmtu)
+			mtu = ip_rt_min_pmtu;
+		if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
+			est_mtu = mtu;
+			peer->pmtu_learned = mtu;
+			peer->pmtu_expires = jiffies + ip_rt_mtu_expires;
+		}
 
-					/* BSD 4.2 compatibility hack :-( */
-					if (mtu == 0 &&
-					    old_mtu >= dst_mtu(&rth->dst) &&
-					    old_mtu >= 68 + (iph->ihl << 2))
-						old_mtu -= iph->ihl << 2;
+		inet_putpeer(peer);
 
-					mtu = guess_mtu(old_mtu);
-				}
-				if (mtu <= dst_mtu(&rth->dst)) {
-					if (mtu < dst_mtu(&rth->dst)) {
-						dst_confirm(&rth->dst);
-						if (mtu < ip_rt_min_pmtu) {
-							u32 lock = dst_metric(&rth->dst,
-									      RTAX_LOCK);
-							mtu = ip_rt_min_pmtu;
-							lock |= (1 << RTAX_MTU);
-							dst_metric_set(&rth->dst, RTAX_LOCK,
-								       lock);
-						}
-						dst_metric_set(&rth->dst, RTAX_MTU, mtu);
-						dst_set_expires(&rth->dst,
-							ip_rt_mtu_expires);
-					}
-					est_mtu = mtu;
-				}
-			}
-			rcu_read_unlock();
-		}
+		atomic_inc(&__rt_peer_genid);
 	}
 	return est_mtu ? : new_mtu;
 }
 
+static void check_peer_pmtu(struct dst_entry *dst, struct inet_peer *peer)
+{
+	unsigned long expires = peer->pmtu_expires;
+
+	if (time_before(expires, jiffies)) {
+		u32 orig_dst_mtu = dst_mtu(dst);
+		if (peer->pmtu_learned < orig_dst_mtu) {
+			if (!peer->pmtu_orig)
+				peer->pmtu_orig = dst_metric_raw(dst, RTAX_MTU);
+			dst_metric_set(dst, RTAX_MTU, peer->pmtu_learned);
+		}
+	} else if (cmpxchg(&peer->pmtu_expires, expires, 0) == expires)
+		dst_metric_set(dst, RTAX_MTU, peer->pmtu_orig);
+}
+
 static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
 {
-	if (dst_mtu(dst) > mtu && mtu >= 68 &&
-	    !(dst_metric_locked(dst, RTAX_MTU))) {
-		if (mtu < ip_rt_min_pmtu) {
-			u32 lock = dst_metric(dst, RTAX_LOCK);
+	struct rtable *rt = (struct rtable *) dst;
+	struct inet_peer *peer;
+
+	dst_confirm(dst);
+
+	if (!rt->peer)
+		rt_bind_peer(rt, 1);
+	peer = rt->peer;
+	if (peer) {
+		if (mtu < ip_rt_min_pmtu)
 			mtu = ip_rt_min_pmtu;
-			dst_metric_set(dst, RTAX_LOCK, lock | (1 << RTAX_MTU));
+		if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
+			peer->pmtu_learned = mtu;
+			peer->pmtu_expires = jiffies + ip_rt_mtu_expires;
+
+			atomic_inc(&__rt_peer_genid);
+			rt->rt_peer_genid = rt_peer_genid();
+
+			check_peer_pmtu(dst, peer);
 		}
-		dst_metric_set(dst, RTAX_MTU, mtu);
-		dst_set_expires(dst, ip_rt_mtu_expires);
+		inet_putpeer(peer);
 	}
 }
 
@@ -1781,9 +1685,15 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 	if (rt_is_expired(rt))
 		return NULL;
 	if (rt->rt_peer_genid != rt_peer_genid()) {
+		struct inet_peer *peer;
+
 		if (!rt->peer)
 			rt_bind_peer(rt, 0);
 
+		peer = rt->peer;
+		if (peer && peer->pmtu_expires)
+			check_peer_pmtu(dst, peer);
+
 		rt->rt_peer_genid = rt_peer_genid();
 	}
 	return dst;
@@ -1812,8 +1722,14 @@ static void ipv4_link_failure(struct sk_buff *skb)
 	icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH, 0);
 
 	rt = skb_rtable(skb);
-	if (rt)
-		dst_set_expires(&rt->dst, 0);
+	if (rt &&
+	    rt->peer &&
+	    rt->peer->pmtu_expires) {
+		unsigned long orig = rt->peer->pmtu_expires;
+
+		if (cmpxchg(&rt->peer->pmtu_expires, orig, 0) == orig)
+			dst_metric_set(&rt->dst, RTAX_MTU, rt->peer->pmtu_orig);
+	}
 }
 
 static int ip_rt_bug(struct sk_buff *skb)
@@ -1911,6 +1827,9 @@ static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
 			memcpy(peer->metrics, fi->fib_metrics,
 			       sizeof(u32) * RTAX_MAX);
 		dst_init_metrics(&rt->dst, peer->metrics, false);
+
+		if (peer->pmtu_expires)
+			check_peer_pmtu(&rt->dst, peer);
 	} else {
 		if (fi->fib_metrics != (u32 *) dst_default_metrics) {
 			rt->fi = fi;
@@ -2961,7 +2880,8 @@ static int rt_fill_info(struct net *net,
 		NLA_PUT_BE32(skb, RTA_MARK, rt->fl.mark);
 
 	error = rt->dst.error;
-	expires = rt->dst.expires ? rt->dst.expires - jiffies : 0;
+	expires = (rt->peer && rt->peer->pmtu_expires) ?
+		rt->peer->pmtu_expires - jiffies : 0;
 	if (rt->peer) {
 		inet_peer_refcheck(rt->peer);
 		id = atomic_read(&rt->peer->ip_id_count) & 0xffff;
@@ -3418,14 +3338,6 @@ int __init ip_rt_init(void)
 	devinet_init();
 	ip_fib_init();
 
-	/* All the timers, started at system startup tend
-	   to synchronize. Perturb it a bit.
-	 */
-	INIT_DELAYED_WORK_DEFERRABLE(&expires_work, rt_worker_func);
-	expires_ljiffies = jiffies;
-	schedule_delayed_work(&expires_work,
-		net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
-
 	if (ip_rt_proc_init())
 		printk(KERN_ERR "Unable to create route proc files\n");
 #ifdef CONFIG_XFRM
-- 
1.7.4


^ permalink raw reply related

* [RFC PATCH 5/5] ipv4: Cache learned redirect information in inetpeer.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev


Note that we do not generate the redirect netevent any longer,
because we don't create a new cached route.

Instead, once the new neighbour is bound to the cached route,
we emit a neigh update event instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c |  136 +++++++++++++++++-------------------------------------
 1 files changed, 42 insertions(+), 94 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 11faf14..756f544 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1294,13 +1294,8 @@ static void rt_del(unsigned hash, struct rtable *rt)
 void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 		    __be32 saddr, struct net_device *dev)
 {
-	int i, k;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
-	struct rtable *rth;
-	struct rtable __rcu **rthp;
-	__be32  skeys[2] = { saddr, 0 };
-	int  ikeys[2] = { dev->ifindex, 0 };
-	struct netevent_redirect netevent;
+	struct inet_peer *peer;
 	struct net *net;
 
 	if (!in_dev)
@@ -1312,9 +1307,6 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 	    ipv4_is_zeronet(new_gw))
 		goto reject_redirect;
 
-	if (!rt_caching(net))
-		goto reject_redirect;
-
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
@@ -1325,93 +1317,13 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 			goto reject_redirect;
 	}
 
-	for (i = 0; i < 2; i++) {
-		for (k = 0; k < 2; k++) {
-			unsigned hash = rt_hash(daddr, skeys[i], ikeys[k],
-						rt_genid(net));
-
-			rthp = &rt_hash_table[hash].chain;
-
-			while ((rth = rcu_dereference(*rthp)) != NULL) {
-				struct rtable *rt;
-
-				if (rth->fl.fl4_dst != daddr ||
-				    rth->fl.fl4_src != skeys[i] ||
-				    rth->fl.oif != ikeys[k] ||
-				    rt_is_input_route(rth) ||
-				    rt_is_expired(rth) ||
-				    !net_eq(dev_net(rth->dst.dev), net)) {
-					rthp = &rth->dst.rt_next;
-					continue;
-				}
-
-				if (rth->rt_dst != daddr ||
-				    rth->rt_src != saddr ||
-				    rth->dst.error ||
-				    rth->rt_gateway != old_gw ||
-				    rth->dst.dev != dev)
-					break;
-
-				dst_hold(&rth->dst);
-
-				rt = dst_alloc(&ipv4_dst_ops);
-				if (rt == NULL) {
-					ip_rt_put(rth);
-					return;
-				}
-
-				/* Copy all the information. */
-				*rt = *rth;
-				rt->dst.__use		= 1;
-				atomic_set(&rt->dst.__refcnt, 1);
-				rt->dst.child		= NULL;
-				if (rt->dst.dev)
-					dev_hold(rt->dst.dev);
-				rt->dst.obsolete	= -1;
-				rt->dst.lastuse	= jiffies;
-				rt->dst.path		= &rt->dst;
-				rt->dst.neighbour	= NULL;
-				rt->dst.hh		= NULL;
-#ifdef CONFIG_XFRM
-				rt->dst.xfrm		= NULL;
-#endif
-				rt->rt_genid		= rt_genid(net);
-				rt->rt_flags		|= RTCF_REDIRECTED;
-
-				/* Gateway is different ... */
-				rt->rt_gateway		= new_gw;
-
-				/* Redirect received -> path was valid */
-				dst_confirm(&rth->dst);
-
-				if (rt->peer)
-					atomic_inc(&rt->peer->refcnt);
-				if (rt->fi)
-					atomic_inc(&rt->fi->fib_clntref);
-
-				if (arp_bind_neighbour(&rt->dst) ||
-				    !(rt->dst.neighbour->nud_state &
-					    NUD_VALID)) {
-					if (rt->dst.neighbour)
-						neigh_event_send(rt->dst.neighbour, NULL);
-					ip_rt_put(rth);
-					rt_drop(rt);
-					goto do_next;
-				}
+	peer = inet_getpeer_v4(daddr, 1);
+	if (peer) {
+		peer->redirect_learned.a4 = new_gw;
 
-				netevent.old = &rth->dst;
-				netevent.new = &rt->dst;
-				call_netevent_notifiers(NETEVENT_REDIRECT,
-							&netevent);
+		inet_putpeer(peer);
 
-				rt_del(hash, rth);
-				if (!rt_intern_hash(hash, rt, &rt, NULL, rt->fl.oif))
-					ip_rt_put(rt);
-				goto do_next;
-			}
-		do_next:
-			;
-		}
+		atomic_inc(&__rt_peer_genid);
 	}
 	return;
 
@@ -1678,6 +1590,31 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
 	}
 }
 
+static int check_peer_redir(struct dst_entry *dst, struct inet_peer *peer)
+{
+	struct rtable *rt = (struct rtable *) dst;
+	__be32 orig_gw = rt->rt_gateway;
+
+	dst_confirm(&rt->dst);
+
+	neigh_release(rt->dst.neighbour);
+	rt->dst.neighbour = NULL;
+
+	rt->rt_gateway = peer->redirect_learned.a4;
+	if (arp_bind_neighbour(&rt->dst) ||
+	    !(rt->dst.neighbour->nud_state & NUD_VALID)) {
+		if (rt->dst.neighbour)
+			neigh_event_send(rt->dst.neighbour, NULL);
+		rt->rt_gateway = orig_gw;
+		return -EAGAIN;
+	} else {
+		rt->rt_flags |= RTCF_REDIRECTED;
+		call_netevent_notifiers(NETEVENT_NEIGH_UPDATE,
+					rt->dst.neighbour);
+	}
+	return 0;
+}
+
 static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 {
 	struct rtable *rt = (struct rtable *) dst;
@@ -1694,6 +1631,12 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 		if (peer && peer->pmtu_expires)
 			check_peer_pmtu(dst, peer);
 
+		if (peer && peer->redirect_learned.a4 &&
+		    peer->redirect_learned.a4 != rt->rt_gateway) {
+			if (check_peer_redir(dst, peer))
+				return NULL;
+		}
+
 		rt->rt_peer_genid = rt_peer_genid();
 	}
 	return dst;
@@ -1830,6 +1773,11 @@ static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
 
 		if (peer->pmtu_expires)
 			check_peer_pmtu(&rt->dst, peer);
+		if (peer->redirect_learned.a4 &&
+		    peer->redirect_learned.a4 != rt->rt_gateway) {
+			rt->rt_gateway = peer->redirect_learned.a4;
+			rt->rt_flags |= RTCF_REDIRECTED;
+		}
 	} else {
 		if (fi->fib_metrics != (u32 *) dst_default_metrics) {
 			rt->fi = fi;
-- 
1.7.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox