Netdev List
 help / color / mirror / Atom feed
* [PATCH] e100: fix 10sec DHCP delay (regression for 82559ER)
From: Bernhard Kaindl @ 2009-10-21 20:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: Bruce Allan, netdev

From: Bernhard Kaindl <bernhard.kaindl@gmx.net>

With 2.6.30, we noticed a regression on our Intel 82559ER-equipped
boards regarding the PHY initialization. The Intel 82559ER uses an
internal PHY bus, so our PHY environment should be not be very special.

The symptom which we observed on these boards was that the boot-time
DHCP negotiation was stalled for ~5secs and went very slowly until
it was finally completed after ~10secs after initial interface start.

I reported our finding along with a proposal for a fix to netdev and
Bruce Allan@Intel, whose patch to support the new Intel 82552 adapter
included a workaround for the 82552 with this side effect for our env.

Bruce worked on a way to have both environments working and tested
it on as many 10/100 parts he could get his hands on and started
a process which allows for more testing and patch submission from

So pending this process which allows for more testing at Intel, I am
submitting our common patch. For PHYs other than the 82552, it resembles
mostly how 2.6.23-2.6.29 have been selecting and isolating the PHYs.

This patch applies to 2.6.30.9, 2.6.31.4 and 2.6.32-rc5-git1 and is
essentially what Bruce has been testing:

Signed-off-by: Bernhard Kaindl <bernhard.kaindl@gmx.net>
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
---

  drivers/net/e100.c |   10 +++++++---
  1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index 5d2f48f..aed18a4 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -1427,13 +1427,17 @@ static int e100_phy_init(struct nic *nic)
  	} else
  		DPRINTK(HW, DEBUG, "phy_addr = %d\n", nic->mii.phy_id);

-	/* Isolate all the PHY ids */
-	for (addr = 0; addr < 32; addr++)
-		mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
  	/* Select the discovered PHY */
  	bmcr &= ~BMCR_ISOLATE;
  	mdio_write(netdev, nic->mii.phy_id, MII_BMCR, bmcr);

+	if (nic->phy != phy_82552_v) {
+		/* Isolate the unused PHY ids */
+		for (addr = 0; addr < 32; addr++)
+			if (addr != nic->mii.phy_id)
+				mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
+	}
+
  	/* Get phy ID */
  	id_lo = mdio_read(netdev, nic->mii.phy_id, MII_PHYSID1);
  	id_hi = mdio_read(netdev, nic->mii.phy_id, MII_PHYSID2);

^ permalink raw reply related

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Eric Dumazet @ 2009-10-21 19:54 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Octavian Purdila, netdev, Cosmin Ratiu
In-Reply-To: <20091021165139.GL877@kvack.org>

Benjamin LaHaise a écrit :
> On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
>> Ben patch only address interface deletion, and one part of the problem,
>> maybe the more visible one for the current kernel.
> 
> The first part I've been tackling has been the overhead in procfs, sysctl 
> and sysfs.  I've got patches for some of the issues, hacks for others, and 
> should have something to post in a few days.  Getting rid of those overheads 
> is enough to get to decent interface creation times for the first 20 or 30k 
> interfaces.
> 
> On the interface deletion side of things, within the network code, fib_hash 
> has a few linear scans that really start hurting.  trie is a bit better, 
> but I haven't started digging too deeply into its flush/remove overhead yet, 
> aside from noticing that trie has a linear scan if 
> CONFIG_IP_ROUTE_MULTIPATH is set since it sets the hash size to 1.  
> fn_trie_flush() is currently the worst offender during deletion.

Well, there are many things to change...

# ip -o link | wc -l
13097
# time ip -o link show mv22248
13045: mv22248@eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN \    link/ether 00:1e:0b:8e:c8:08 brd ff:ff:ff:ff:ff:ff

real    0m0.840s
user    0m0.473s
sys     0m0.368s

almost one second to get link status of one particular interface :(

^ permalink raw reply

* Re: [PATCH v2 8/8] Document future removal of sysctl_tcp_* options
From: William Allen Simpson @ 2009-10-21 19:30 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4ADEE119.7020803@codefidence.com>

Gilad Ben-Yossef wrote:
> I have no issue with leaving those, if everyone thinks we're better off.
> 
> BTW, while we're talking about OS envy, I do believe that Windows do let
> you specify on a per route basis. Not that this is really a good ground for
> technical decision, but still... :-)
> 
I'm not concerned with "envy", I'm concerned with training operators, and
consistency across platforms.

I'm in favor of per route configuration, it seems reasonably clean, as
long as it's done consistently with other systems.  I don't permit Windows
systems to be used here (except under controlled security circumstances), so
I'm not familiar with their configuration.  However, doing things similarly
across platforms will ease documentation and training.


^ permalink raw reply

* Re: [PATCH v2 4/8] Add the no SACK route option feature
From: William Allen Simpson @ 2009-10-21 19:22 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1256115421-12714-5-git-send-email-gilad@codefidence.com>

Gilad Ben-Yossef wrote:
> Implement querying and acting upon the no sack bit in the features
> field.
> 
>  #define RTAX_FEATURE_ECN	0x00000001
> -#define RTAX_FEATURE_SACK	0x00000002
> +#define RTAX_FEATURE_NO_SACK	0x00000002
>  #define RTAX_FEATURE_TIMESTAMP	0x00000004
>  #define RTAX_FEATURE_ALLFRAG	0x00000008
>  
I just realized that unlike NO_DSACK, this change assumes removing the
sysctl and defaulting on.  I'm opposed to removing this sysctl, so I'm
opposed to this change.

I'd prefer the ability to both turn on for global default off, and
turn off for global default on.  Shouldn't that be 2 different bits?

Or should this be a toggle?  How do other systems handle it?


^ permalink raw reply

* Re: [PATCH v2 5/8] Allow disabling TCP timestamp options per route
From: William Allen Simpson @ 2009-10-21 19:22 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1256115421-12714-6-git-send-email-gilad@codefidence.com>

Gilad Ben-Yossef wrote:
> Implement querying and acting upon the no timestamp bit in the feature 
> field.
> 
> Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
> Sigend-off-by: Ori Finkelman <ori@comsleep.com>
> Sigend-off-by: Yony Amit <yony@comsleep.com>
> 
> ---
>  include/linux/rtnetlink.h |    2 +-
>  net/ipv4/tcp_input.c      |    3 ++-
>  net/ipv4/tcp_output.c     |    8 ++++++--
>  3 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> index 9c802a6..2ab8c75 100644
> --- a/include/linux/rtnetlink.h
> +++ b/include/linux/rtnetlink.h
> @@ -378,7 +378,7 @@ enum
>  
>  #define RTAX_FEATURE_ECN	0x00000001
>  #define RTAX_FEATURE_NO_SACK	0x00000002
> -#define RTAX_FEATURE_TIMESTAMP	0x00000004
> +#define RTAX_FEATURE_NO_TSTAMP	0x00000004
>  #define RTAX_FEATURE_ALLFRAG	0x00000008
>  
I just realized that unlike NO_WSCALE, this change assumes removing the
sysctl and defaulting on.  I'm opposed to removing this sysctl, so I'm
opposed to this change.

I'd prefer the ability to both turn on for global default off, and
turn off for global default on.  Shouldn't that be 2 different bits?

Or should this be a toggle?  How do other systems handle it?


^ permalink raw reply

* Re: [PATCH  kernel 2.6.32-rc5] pcnet_cs: add cis of PreMax PE-200 ethernet pcmcia card
From: Dan Williams @ 2009-10-21 19:18 UTC (permalink / raw)
  To: Ken Kawasaki; +Cc: netdev
In-Reply-To: <20091018103920.ca9217ee.ken_kawasaki@spring.nifty.jp>

On Sun, 2009-10-18 at 10:39 +0900, Ken Kawasaki wrote:
> pcnet_cs,serial_cs:
> 
> add cis of PreMax ethernet pcmcia card,
> and some Sierra Wireless serial card(AC555, AC7xx, AC8xx).

Random question: are CIS files copyrightable?  What exactly do they
contain, just updates to the the CIS data on the card itself that the
manufacturer forgot to burn before shipping the card?

Also, I've got a Sierra AC860 here that reports as "prod_id(2):
"AC860"", and has the same manf_id (0x0192) and card_id (0x710) as the
AC850.

manf_id: 0x0192 card_id: 0x0710
function: 6 (network)
prod_id(1): "Sierra Wireless" (0xd85f6206)
prod_id(2): "AC860" (0x698f93db)
prod_id(3): "3G Network Adapter" (0xab3c6f47)
prod_id(4): "R1" (0xd9533fec)

It currently requests the 7xx CIS file because there isn't a specific
check for it in the driver and it uses the default 7xx manfid/prodid,
should I submit something like:

+	PCMCIA_DEVICE_CIS_PROD_ID12("Sierra Wireless", "AC860", 0xd85f6206, 0x698f93db, "cis/SW_8xx_SER.cis"), /* Sierra Wireless AC860 3G Network Adapter R1 */

?

Dan

> use PROD_ID for AC7xx, because MANF_ID of AC7xx and AC8xx are the same.
> 
> 
> Signed-off-by: Ken Kawasaki <ken_kawasaki@spring.nifty.jp>
> 
> ---
> 
>  drivers/net/pcmcia/pcnet_cs.c    |    2 +-
>  drivers/serial/serial_cs.c       |    8 ++++----
>  firmware/Makefile                |    6 ++++--
>  firmware/WHENCE                  |    4 ++++
>  firmware/cis/PE-200.cis.ihex     |    9 +++++++++
>  firmware/cis/SW_555_SER.cis.ihex |   12 ++++++++++++
>  firmware/cis/SW_7xx_SER.cis.ihex |   13 +++++++++++++
>  firmware/cis/SW_8xx_SER.cis.ihex |   13 +++++++++++++
>  8 files changed, 60 insertions(+), 7 deletions(-)
> 
> diff -urpN linux-2.6.32-rc5.orig/drivers/net/pcmcia/pcnet_cs.c linux-2.6.32-rc5/drivers/net/pcmcia/pcnet_cs.c
> --- linux-2.6.32-rc5.orig/drivers/net/pcmcia/pcnet_cs.c	2009-10-17 15:53:44.000000000 +0900
> +++ linux-2.6.32-rc5/drivers/net/pcmcia/pcnet_cs.c	2009-10-17 16:07:12.000000000 +0900
> @@ -1760,7 +1760,7 @@ static struct pcmcia_device_id pcnet_ids
>  	PCMCIA_DEVICE_CIS_MANF_CARD(0xc00f, 0x0002, "cis/LA-PCM.cis"),
>  	PCMCIA_DEVICE_CIS_PROD_ID12("KTI", "PE520 PLUS", 0xad180345, 0x9d58d392, "PE520.cis"),
>  	PCMCIA_DEVICE_CIS_PROD_ID12("NDC", "Ethernet", 0x01c43ae1, 0x00b2e941, "cis/NE2K.cis"),
> -	PCMCIA_DEVICE_CIS_PROD_ID12("PMX   ", "PE-200", 0x34f3f1c8, 0x10b59f8c, "PE-200.cis"),
> +	PCMCIA_DEVICE_CIS_PROD_ID12("PMX   ", "PE-200", 0x34f3f1c8, 0x10b59f8c, "cis/PE-200.cis"),
>  	PCMCIA_DEVICE_CIS_PROD_ID12("TAMARACK", "Ethernet", 0xcf434fba, 0x00b2e941, "cis/tamarack.cis"),
>  	PCMCIA_DEVICE_PROD_ID12("Ethernet", "CF Size PC Card", 0x00b2e941, 0x43ac239b),
>  	PCMCIA_DEVICE_PROD_ID123("Fast Ethernet", "CF Size PC Card", "1.0",
> diff -urpN linux-2.6.32-rc5.orig/drivers/serial/serial_cs.c linux-2.6.32-rc5/drivers/serial/serial_cs.c
> --- linux-2.6.32-rc5.orig/drivers/serial/serial_cs.c	2009-10-17 15:53:45.000000000 +0900
> +++ linux-2.6.32-rc5/drivers/serial/serial_cs.c	2009-10-17 16:02:50.000000000 +0900
> @@ -879,10 +879,10 @@ static struct pcmcia_device_id serial_id
>  	PCMCIA_MFC_DEVICE_CIS_MANF_CARD(1, 0x0175, 0x0000, "cis/DP83903.cis"),
>  	PCMCIA_MFC_DEVICE_CIS_MANF_CARD(1, 0x0101, 0x0035, "cis/3CXEM556.cis"),
>  	PCMCIA_MFC_DEVICE_CIS_MANF_CARD(1, 0x0101, 0x003d, "cis/3CXEM556.cis"),
> -	PCMCIA_DEVICE_CIS_PROD_ID12("Sierra Wireless", "AC850", 0xd85f6206, 0x42a2c018, "SW_8xx_SER.cis"),  /* Sierra Wireless AC850 3G Network Adapter R1 */
> -	PCMCIA_DEVICE_CIS_MANF_CARD(0x0192, 0x0710, "SW_7xx_SER.cis"),	/* Sierra Wireless AC710/AC750 GPRS Network Adapter R1 */
> -	PCMCIA_DEVICE_CIS_MANF_CARD(0x0192, 0xa555, "SW_555_SER.cis"),  /* Sierra Aircard 555 CDMA 1xrtt Modem -- pre update */
> -	PCMCIA_DEVICE_CIS_MANF_CARD(0x013f, 0xa555, "SW_555_SER.cis"),  /* Sierra Aircard 555 CDMA 1xrtt Modem -- post update */
> +	PCMCIA_DEVICE_CIS_PROD_ID12("Sierra Wireless", "AC850", 0xd85f6206, 0x42a2c018, "cis/SW_8xx_SER.cis"), /* Sierra Wireless AC850 3G Network Adapter R1 */
> +	PCMCIA_DEVICE_CIS_PROD_ID12("Sierra Wireless", "AC710/AC750", 0xd85f6206, 0x761b11e0, "cis/SW_7xx_SER.cis"),  /* Sierra Wireless AC710/AC750 GPRS Network Adapter R1 */
> +	PCMCIA_DEVICE_CIS_MANF_CARD(0x0192, 0xa555, "cis/SW_555_SER.cis"),  /* Sierra Aircard 555 CDMA 1xrtt Modem -- pre update */
> +	PCMCIA_DEVICE_CIS_MANF_CARD(0x013f, 0xa555, "cis/SW_555_SER.cis"),  /* Sierra Aircard 555 CDMA 1xrtt Modem -- post update */
>  	PCMCIA_DEVICE_CIS_PROD_ID12("MultiTech", "PCMCIA 56K DataFax", 0x842047ee, 0xc2efcf03, "cis/MT5634ZLX.cis"),
>  	PCMCIA_DEVICE_CIS_PROD_ID12("ADVANTECH", "COMpad-32/85B-2", 0x96913a85, 0x27ab5437, "cis/COMpad2.cis"),
>  	PCMCIA_DEVICE_CIS_PROD_ID12("ADVANTECH", "COMpad-32/85B-4", 0x96913a85, 0xcec8f102, "cis/COMpad4.cis"),
> diff -urpN linux-2.6.32-rc5.orig/firmware/cis/PE-200.cis.ihex linux-2.6.32-rc5/firmware/cis/PE-200.cis.ihex
> --- linux-2.6.32-rc5.orig/firmware/cis/PE-200.cis.ihex	1970-01-01 09:00:00.000000000 +0900
> +++ linux-2.6.32-rc5/firmware/cis/PE-200.cis.ihex	2009-10-17 16:13:21.000000000 +0900
> @@ -0,0 +1,9 @@
> +:1000000001030000FF151E0401504D582020200060
> +:1000100050452D3230300045544845524E4554002D
> +:1000200052303100FF210206031A050101000101CF
> +:100030001B0EC181190155E051000F100F30FFFF59
> +:040040001400FF00A9
> +:00000001FF
> +#
> +# Replacement CIS for PE-200 ethernet card
> +#
> diff -urpN linux-2.6.32-rc5.orig/firmware/cis/SW_555_SER.cis.ihex linux-2.6.32-rc5/firmware/cis/SW_555_SER.cis.ihex
> --- linux-2.6.32-rc5.orig/firmware/cis/SW_555_SER.cis.ihex	1970-01-01 09:00:00.000000000 +0900
> +++ linux-2.6.32-rc5/firmware/cis/SW_555_SER.cis.ihex	2009-10-17 16:12:47.000000000 +0900
> @@ -0,0 +1,12 @@
> +:100000000101FF17034100FF20043F0110072102F7
> +:100010000200152A070053696572726120576972E0
> +:10002000656C657373004169724361726420353594
> +:1000300035004135353500526576203100FF1A050F
> +:1000400001030007731B0BE00118A360F8030730DE
> +:10005000BC3F1B08A10108A360F802071B08A2010E
> +:1000600008A360E803071B08A30108A360E80207D0
> +:0A0070001B04A40108231400FF0084
> +:00000001FF
> +#
> +# Replacement CIS for AC555 provided by Sierra Wireless
> +#
> diff -urpN linux-2.6.32-rc5.orig/firmware/cis/SW_7xx_SER.cis.ihex linux-2.6.32-rc5/firmware/cis/SW_7xx_SER.cis.ihex
> --- linux-2.6.32-rc5.orig/firmware/cis/SW_7xx_SER.cis.ihex	1970-01-01 09:00:00.000000000 +0900
> +++ linux-2.6.32-rc5/firmware/cis/SW_7xx_SER.cis.ihex	2009-10-17 16:12:47.000000000 +0900
> @@ -0,0 +1,13 @@
> +:100000000101FF17034100FF2004920110072102A4
> +:1000100002001537070053696572726120576972D3
> +:10002000656C6573730041433731302F4143373579
> +:10003000300047505253204E6574776F726B2041E9
> +:1000400064617074657200523100FF1A050103008B
> +:1000500007731B10E00119784D555D25A360F80367
> +:100060000730BC861B08A10108A360F802071B0823
> +:10007000A20108A360E803071B08A30108A360E826
> +:0C00800002071B04A40108231400FF0069
> +:00000001FF
> +#
> +# Replacement CIS for AC7xx provided by Sierra Wireless
> +#
> diff -urpN linux-2.6.32-rc5.orig/firmware/cis/SW_8xx_SER.cis.ihex linux-2.6.32-rc5/firmware/cis/SW_8xx_SER.cis.ihex
> --- linux-2.6.32-rc5.orig/firmware/cis/SW_8xx_SER.cis.ihex	1970-01-01 09:00:00.000000000 +0900
> +++ linux-2.6.32-rc5/firmware/cis/SW_8xx_SER.cis.ihex	2009-10-17 16:12:47.000000000 +0900
> @@ -0,0 +1,13 @@
> +:100000000101FF17034100FF2004920110072102A4
> +:100010000200152F070053696572726120576972DB
> +:10002000656C657373004143383530003347204EAB
> +:100030006574776F726B20416461707465720052F1
> +:100040003100FF1A0501030007731B10E001197846
> +:100050004D555D25A360F8480730BC861B08A101FB
> +:1000600008A360F847071B08A20108A360E8480737
> +:100070001B08A30108A360E847071B04A401082389
> +:040080001400FF0069
> +:00000001FF
> +#
> +# Replacement CIS for AC8xx provided by Sierra Wireless
> +#
> diff -urpN linux-2.6.32-rc5.orig/firmware/Makefile linux-2.6.32-rc5/firmware/Makefile
> --- linux-2.6.32-rc5.orig/firmware/Makefile	2009-10-17 15:53:52.000000000 +0900
> +++ linux-2.6.32-rc5/firmware/Makefile	2009-10-17 16:12:21.000000000 +0900
> @@ -69,11 +69,13 @@ fw-shipped-$(CONFIG_E100) += e100/d101m_
>  fw-shipped-$(CONFIG_MYRI_SBUS) += myricom/lanai.bin
>  fw-shipped-$(CONFIG_PCMCIA_PCNET) += cis/LA-PCM.cis cis/PCMLM28.cis \
>  				     cis/DP83903.cis cis/NE2K.cis \
> -				     cis/tamarack.cis
> +				     cis/tamarack.cis cis/PE-200.cis
>  fw-shipped-$(CONFIG_PCMCIA_3C589) += cis/3CXEM556.cis
>  fw-shipped-$(CONFIG_PCMCIA_3C574) += cis/3CCFEM556.cis
>  fw-shipped-$(CONFIG_SERIAL_8250_CS) += cis/MT5634ZLX.cis cis/RS-COM-2P.cis \
> -				       cis/COMpad2.cis cis/COMpad4.cis
> +				       cis/COMpad2.cis cis/COMpad4.cis \
> +				       cis/SW_555_SER.cis cis/SW_7xx_SER.cis \
> +				       cis/SW_8xx_SER.cis
>  fw-shipped-$(CONFIG_PCMCIA_SMC91C92) += ositech/Xilinx7OD.bin
>  fw-shipped-$(CONFIG_SCSI_ADVANSYS) += advansys/mcode.bin advansys/38C1600.bin \
>  				      advansys/3550.bin advansys/38C0800.bin
> diff -urpN linux-2.6.32-rc5.orig/firmware/WHENCE linux-2.6.32-rc5/firmware/WHENCE
> --- linux-2.6.32-rc5.orig/firmware/WHENCE	2009-10-17 15:53:52.000000000 +0900
> +++ linux-2.6.32-rc5/firmware/WHENCE	2009-10-17 16:12:04.000000000 +0900
> @@ -600,6 +600,7 @@ File: cis/LA-PCM.cis
>        cis/DP83903.cis
>        cis/NE2K.cis
>        cis/tamarack.cis
> +      cis/PE-200.cis
>  
>  Licence: GPL
>  
> @@ -633,6 +634,9 @@ File: cis/MT5634ZLX.cis
>        cis/RS-COM-2P.cis
>        cis/COMpad2.cis
>        cis/COMpad4.cis
> +      cis/SW_555_SER.cis
> +      cis/SW_7xx_SER.cis
> +      cis/SW_8xx_SER.cis
>  
>  Licence: GPL
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH v2 1/8] Only parse time stamp TCP option in time wait sock
From: William Allen Simpson @ 2009-10-21 18:59 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4ADEDD65.6070802@codefidence.com>

Gilad Ben-Yossef wrote:
> William Allen Simpson wrote:
> 
>> Gilad Ben-Yossef wrote:
>>> A time wait socket is established - we already know if time stamp
>>> option is called for or not.
>>>
>> Not so sure about this.  A timewait sock isn't actually established,
>> and new/changed options could appear.  There's all sorts of edge cases.
> If you examine the specific context where tcp_parse_options is being 
> called here,
> the only TCP option which is of interest is the time stamp option, and 
> this code path
> is only being taken when we already know that the original socket  had
> used the time stamp option.
> 
> So while I agree that in general you are right, I do believe that in the 
> specific context
> of this patch we should call tcp_parse_options with the established flag 
> on and let it
> know we are expecting to see a time stamp option, which is what I was 
> referring to.
> 
No, a major reason for time-wait is rebooted systems.  We don't "know"
anything about them, and they certainly don't know anything about us.

As I mentioned, this is about edge cases.


>>
>> There's also some current work to note:
>>
>>  http://tools.ietf.org/html/draft-ietf-tcpm-1323bis
>>
>>  http://tools.ietf.org/html/draft-gont-tcpm-tcp-timestamps
> 
> Very interesting, thank you.
> 
> As I noted above, my comment about
> TIME WAIT sockets being "established" should really only be considered
> in the context of the specific call to tcp_parse_options() and the 
> "established"
> parameter of that function.
> 
My suggestion, as this patch is not essential to the other patches in the
series, is to separate it.  As I'm relatively new to this list, I don't
know the best practice.  But I'd like to support the others and delay
this for further consideration.

^ permalink raw reply

* Re: NOHZ: local_softirq_pending 08
From: Tilman Schmidt @ 2009-10-21 18:46 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, johannes, hidave.darkstar, linux-kernel, tglx,
	linux-wireless, linux-ppp, netdev, paulus
In-Reply-To: <4AD76184.6030900@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2202 bytes --]

On 15.10.2009 19:53 Jarek Poplawski wrote:
> Jarek Poplawski wrote, On 10/15/2009 01:40 PM:
> 
>> On 12-10-2009 13:25, Tilman Schmidt wrote:

>>> I have encountered the message in the subject during a test of
>>> the Gigaset CAPI driver, and would like to determine whether
>>> it's a bug in the driver, a bug somewhere else, or no bug at
>>> all. The test scenario was PPP over ISDN with pppd+capiplugin.
>>> In an alternative scenario, also PPP over ISDN but with
>>> smpppd+capidrv, the message did not occur.

I'm sorry, I had confused the two cases. The message occurs in
the smpppd+capidrv scenario, not with pppd+capiplugin.

>>> Johannes' answer pointed me to the netif_rx() function.
>>> The Gigaset driver itself doesn't call that function at all.
>>> In the scenario where I saw the message, it was the SYNC_PPP
>>> line discipline that did.

This analysis was therefore wrong. It would be the netif_rx()
call towards the end of isdn_ppp_push_higher() in
drivers/isdn/i4l/isdn_ppp.c L1177.

>> Anyway, I agree with Michael Buesch there is no reason to waste time
>> for tracking all netif_rx vs netif_rx_ni uses, and it seems we could
>> avoid it by using the "proper" version of raise_softirq_irqoff() in
>> __napi_schedule(). Could anybody try if I'm not wrong?
>>
>>  net/core/dev.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 28b0b9e..7fc4009 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2728,7 +2728,7 @@ void __napi_schedule(struct napi_struct *n)
>>  
>>  	local_irq_save(flags);
>>  	list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list);
>> -	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
>> +	raise_softirq_irqoff(NET_RX_SOFTIRQ);
>>  	local_irq_restore(flags);
>>  }
>>  EXPORT_SYMBOL(__napi_schedule);

I have tested your patch and I can confirm that it fixes the messages.
I have not noticed any ill effects.

Thanks,
Tilman

-- 
Tilman Schmidt                    E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply

* Re: Enable syn cookies by default
From: Olaf van der Spek @ 2009-10-21 18:45 UTC (permalink / raw)
  To: William Allen Simpson; +Cc: netdev
In-Reply-To: <4ADF5499.2080107@gmail.com>

On Wed, Oct 21, 2009 at 8:36 PM, William Allen Simpson
<william.allen.simpson@gmail.com> wrote:
> Olaf van der Spek wrote:
>>
>> How and when do they interfere?
>> If syn cookies are enabled and the queue isn't full, they're not used
>> so they don't interfere.
>> If the queue is full, they do interfere, but the alternative would be
>> no connection at all.
>
> You just answered your own question, both "how" and "when"....

No, I didn't.

>> So I really don't see the disadvantage of enabling cookies by default.
>>
> On systems with long delay paths, it represents turning back the clock
> more than a decade or so.

How's that? Are you saying no connection is better than a connection
with timestamps and SACK?
I don't believe you.

Wasn't there recently a patch to enable these things even when syn
cookies are actually being used?

> A better solution is usually a firewall/IDS.

Why's that?

> The best solution: I'm working on it.

Hmm, got any link to those cookies? I can only find docs on SYN cookies.

> As I'm sure you're aware, Timestamps and Sack options are fairly crucial.

Of course. I'm not saying you should disable them.

>
>>> As Ubuntu is debian based, perhaps they can back-port the Ubuntu changes?
>>
>> Actually changing the value isn't the problem, but the Debian
>> maintainer isn't sure it's a good idea (but he doesn't know why).
>>
> Well, that depends.  For a client, it's a good idea, as the defense is
> mostly local and rare.  For a server run by a small underfunded ISP, it's
> still a good idea as a last ditch defense.  But for a full-fledged ISP,
> especially running in a satellite environment or with a lot of dial-up
> customers, it's terrible!

Why?

> That's a reason the Ubuntu configuration approach works for me.
>
> A caveat: I've not run debian directly in many, many years (IIRC, since
> Red Hat Colgate), and more recently via Unbuntu (since Badger).  I don't
> know whether debian has evolved different installation procedures for
> different environments.

I'm not aware of any differences.

> My comments are based on fairly extensive experience with deployment of
> Yellow Dog Linux servers at an ISP (as a co-founder), and Ubuntu clients
> for the past 2 (US) election cycles.

Olaf

^ permalink raw reply

* Re: Enable syn cookies by default
From: William Allen Simpson @ 2009-10-21 18:36 UTC (permalink / raw)
  To: netdev
In-Reply-To: <b2cc26e40910210310o31ca24dcv50f8bd0c3234b71b@mail.gmail.com>

Olaf van der Spek wrote:
> How and when do they interfere?
> If syn cookies are enabled and the queue isn't full, they're not used
> so they don't interfere.
> If the queue is full, they do interfere, but the alternative would be
> no connection at all.

You just answered your own question, both "how" and "when"....

> So I really don't see the disadvantage of enabling cookies by default.
> 
On systems with long delay paths, it represents turning back the clock
more than a decade or so.  A better solution is usually a firewall/IDS.
The best solution: I'm working on it.

As I'm sure you're aware, Timestamps and Sack options are fairly crucial.


>> As Ubuntu is debian based, perhaps they can back-port the Ubuntu changes?
> 
> Actually changing the value isn't the problem, but the Debian
> maintainer isn't sure it's a good idea (but he doesn't know why).
> 
Well, that depends.  For a client, it's a good idea, as the defense is
mostly local and rare.  For a server run by a small underfunded ISP, it's
still a good idea as a last ditch defense.  But for a full-fledged ISP,
especially running in a satellite environment or with a lot of dial-up
customers, it's terrible!

That's a reason the Ubuntu configuration approach works for me.

A caveat: I've not run debian directly in many, many years (IIRC, since
Red Hat Colgate), and more recently via Unbuntu (since Badger).  I don't
know whether debian has evolved different installation procedures for
different environments.

My comments are based on fairly extensive experience with deployment of
Yellow Dog Linux servers at an ISP (as a co-founder), and Ubuntu clients
for the past 2 (US) election cycles.


^ permalink raw reply

* Re: [PATCH] bonding: fix a race condition in calls to slave MII ioctls
From: Jay Vosburgh @ 2009-10-21 18:13 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: netdev
In-Reply-To: <20091021130301.GA4762@midget.suse.cz>

Jiri Bohac <jbohac@suse.cz> wrote:

>In mii monitor mode, bond_check_dev_link() calls the the ioctl
>handler of slave devices. It stores the ndo_do_ioctl function
>pointer to a static (!) ioctl variable and later uses it to call the
>handler with the IOCTL macro.
>
>If another thread executes bond_check_dev_link() at the same time
>(even with a different bond, which none of the locks prevent), a
>race condition occurs. If the two racing slaves have different
>drivers, this may result in one driver's ioctl handler being
>called with a pointer to a net_device controlled with a different
>driver, resulting in unpredictable breakage.
>
>Unless I am overlooking something, the "static" must be a
>copy'n'paste error (?).

	Heh, I was curious, so I looked it up; this bit was added as-is
in September 2000, when the original "miimon" link monitoring code was
added.  It's interesting that nobody hit this bug back in the days
before netif_carrier; I know I ran a lot of mixed slave environments.

>Signed-off-by: Jiri Bohac <jbohac@suse.cz>

	Anyway, the static is obviously wrong, even without the race.

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>


>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 69c5b15..5bfdd0c 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -691,7 +691,7 @@ static int bond_check_dev_link(struct bonding *bond,
> 			       struct net_device *slave_dev, int reporting)
> {
> 	const struct net_device_ops *slave_ops = slave_dev->netdev_ops;
>-	static int (*ioctl)(struct net_device *, struct ifreq *, int);
>+	int (*ioctl)(struct net_device *, struct ifreq *, int);
> 	struct ifreq ifr;
> 	struct mii_ioctl_data *mii;
>
>
>
>
>-- 
>Jiri Bohac <jbohac@suse.cz>
>SUSE Labs, SUSE CZ
>

^ permalink raw reply

* Re: [PATCH 2/4] [RFC] Add c/r support for connected INET sockets
From: Dan Smith @ 2009-10-21 18:05 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: containers, John Dykstra, netdev
In-Reply-To: <20091021175624.GA20972@us.ibm.com>

SH> Sorry, I think we've discussed this before but can't recall - does
SH> setting sport here allow an unpriv user to bypass
SH> CAP_NET_BIND_SERVICE?

Yes, it does.  I was kinda considering that part of the input sanity
checking that I officially punted on.  However, as far as I know,
we'll just need to check that capability before we bind() in the
listen/closed case and hash in the connected case.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply

* Re: Enable syn cookies by default
From: William Allen Simpson @ 2009-10-21 18:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091021.060443.193726483.davem@davemloft.net>

David Miller wrote:
> Would please you be patient?
> 
> In case you haven't fucking noticed, all of the major kernel
> developers are in Japan at the annual kernel summit and the Japan
> Linux Symposium since late last week.
> 
Wow, that's way over the top!  I'd noticed your recent rudeness to many
folks in my perusal of this list, and carelessness about reading their
documentation (such as confusing "interdependent" with independent), but
I'd ascribed that to the Peter Principle and overwork....

This behavior is inexcusable.  Please apologize or resign.


> So nobody has the time to look into anything requiring real
> long thinking like this issue does.
> 
Thanks for the information.  Too bad it conflicts with the NANOG and ARIN
conferences hosted here in Michigan this week.

^ permalink raw reply

* [PATCH] [NIU] VLAN does not work with niu driver
From: Joyce Yu @ 2009-10-21 18:02 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]


-- 


[-- Attachment #2: 0001-VLAN_ETH_HLEN-should-be-used-to-make-sure-that-the-w.patch --]
[-- Type: text/x-patch, Size: 858 bytes --]

>From f301748d3156437d65305f14288c7d5711861980 Mon Sep 17 00:00:00 2001
From: Joyce Yu <joyce.yu@sun.com>
Date: Wed, 21 Oct 2009 05:35:46 -0700
Subject: [PATCH] VLAN_ETH_HLEN should be used to make sure that the whole MAC header was copied to the head buffer in the Vlan packets case
Signed-off-by: Joyce Yu <joyce.yu@sun.com>

---
 drivers/net/niu.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/niu.c b/drivers/net/niu.c
index f9364d0..d6c7ac6 100644
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -3545,7 +3545,7 @@ static int niu_process_rx_pkt(struct napi_struct *napi, struct niu *np,
 	rp->rcr_index = index;
 
 	skb_reserve(skb, NET_IP_ALIGN);
-	__pskb_pull_tail(skb, min(len, NIU_RXPULL_MAX));
+	__pskb_pull_tail(skb, min(len, VLAN_ETH_HLEN));
 
 	rp->rx_packets++;
 	rp->rx_bytes += skb->len;
-- 
1.6.4


^ permalink raw reply related

* Re: [PATCH 2/4] [RFC] Add c/r support for connected INET sockets
From: Serge E. Hallyn @ 2009-10-21 17:56 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, John Dykstra, netdev
In-Reply-To: <1256072803-3518-3-git-send-email-danms@us.ibm.com>

Quoting Dan Smith (danms@us.ibm.com):
> This patch adds basic support for C/R of open INET sockets.  I think that
> all the important bits of the TCP and ICSK socket structures is saved,
> but I think there is still some additional IPv6 stuff that needs to be
> handled.
> 
> With this patch applied, the following script can be used to demonstrate
> the functionality:
> 
>   https://lists.linux-foundation.org/pipermail/containers/2009-October/021239.html
> 
> It shows that this enables migration of a sendmail process with open
> connections from one machine to another without dropping.
> 
> We still need comments from the netdev people about what sort of sanity
> checking we need to do on the values in the ckpt_hdr_socket_inet
> structure on restart.
> 
> Note that this still doesn't address lingering sockets yet.
> 
> Changes in v2:
>  - Restore saddr, rcv_saddr, daddr, sport, and dport from the sockaddr
>    structure instead of saving them separately
>  - Fix 'sock' naming in sock_cptrst()
>  - Don't take the queue lock before skb_queue_tail() since it is
>    done for us
>  - Allow "listen only" restore behavior if RESTART_SOCK_LISTENONLY
>    flag is specified on sys_restart()
>  - Pull the implementation of the list of listening sockets back into
>    this patch
>  - Fix dangling printk
>  - Add some comments around the parent/child restore logic
> 
> Cc: netdev@vger.kernel.org
> Cc: Oren Laadan <orenl@librato.com>
> Cc: John Dykstra <jdykstra72@gmail.com>
> Signed-off-by: Dan Smith <danms@us.ibm.com>

fwiw,

Acked-by: Serge Hallyn <serue@us.ibm.com>

except

> +static int sock_inet_restore_addrs(struct inet_sock *inet,
> +				   struct ckpt_hdr_socket_inet *hh)
> +{
> +	inet->daddr = hh->raddr.sin_addr.s_addr;
> +	inet->saddr = hh->laddr.sin_addr.s_addr;
> +	inet->rcv_saddr = inet->saddr;
> +
> +	inet->dport = hh->raddr.sin_port;
> +	inet->sport = hh->laddr.sin_port;

Sorry, I think we've discussed this before but can't recall - does
setting sport here allow an unpriv user to bypass CAP_NET_BIND_SERVICE?

-serge

^ permalink raw reply

* [RFC] net,socket: introduce build_sockaddr_check helper to catch overflow at build time
From: Cyrill Gorcunov @ 2009-10-21 17:07 UTC (permalink / raw)
  To: Linux-Netdev; +Cc: David Miller

Hi,

while were sneaking thru sockets code I've got the idea that we may
check for __kernel_sockaddr_storage overflow at build time. At moment
this structure is big enough and I hardly believe it could be overflowed
ever (hmm?).

Anyway just an idea which could be stupid perhaps but I decided to
put it out. An idea is that before copy protocol specific data in
socket->ops->getname implementation the driver code may put

build_sockaddr_check(sizeof(some_struct));

and be sure it doesn't overflow the hosting unit.

Feel free to just ignore this RFC, was just an idea to share.

	-- Cyrill
---
net,socket: introduce build_sockaddr_check helper to catch overflow at build time

proto_ops->getname implies copying protocol specific data
into storage unit (particulary to __kernel_sockaddr_storage).
So when one implements new protocol he either may keep this
in mind (or may not).

Lets introduce build_sockaddr_check helper which check if
storage unit is not overfowed. Note that the check is build
time and introduce no slowdown at execution time.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 include/linux/socket.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.git/include/linux/socket.h
=====================================================================
--- linux-2.6.git.orig/include/linux/socket.h
+++ linux-2.6.git/include/linux/socket.h
@@ -24,6 +24,9 @@ struct __kernel_sockaddr_storage {
 #include <linux/types.h>		/* pid_t			*/
 #include <linux/compiler.h>		/* __user			*/
 
+#define build_sockaddr_check(size)	\
+	BUILD_BUG_ON(((size) > sizeof(struct __kernel_sockaddr_storage)))
+
 #ifdef __KERNEL__
 # ifdef CONFIG_PROC_FS
 struct seq_file;

^ permalink raw reply

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Octavian Purdila @ 2009-10-21 16:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <4ADF2B57.4030708@gmail.com>

On Wednesday 21 October 2009 18:40:07 you wrote:

> >
> > I would also like to see this patch in, we are running into scalability
> > issues with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put()
>  spots than can be deleted and replaced by something more scalable. I began
>  this job but others can help me.
> 

Yes, I agree with you, there are multiple places which needs to be touched to 
allow for better scaling with regard to the number of interfaces. We do have 
patches that addresses some of these issues, but unfortunately they are based 
on 2.6.7 and some of them are quite ugly hacks :) 

However, we are in the process of switching to 2.6.31 so I hope we will be 
able to contribute on this effort.

> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 

Hmm, how would multiple tasks help here? Isn't the RTNL mutex global?

> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 
> [PATCH] net: allow netdev_wait_allrefs() to run faster
> 

Thanks, I am going to test it on our platform and send back the results.

tavi

^ permalink raw reply

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Benjamin LaHaise @ 2009-10-21 16:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, netdev, Cosmin Ratiu
In-Reply-To: <4ADF2B57.4030708@gmail.com>

On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.

The first part I've been tackling has been the overhead in procfs, sysctl 
and sysfs.  I've got patches for some of the issues, hacks for others, and 
should have something to post in a few days.  Getting rid of those overheads 
is enough to get to decent interface creation times for the first 20 or 30k 
interfaces.

On the interface deletion side of things, within the network code, fib_hash 
has a few linear scans that really start hurting.  trie is a bit better, 
but I haven't started digging too deeply into its flush/remove overhead yet, 
aside from noticing that trie has a linear scan if 
CONFIG_IP_ROUTE_MULTIPATH is set since it sets the hash size to 1.  
fn_trie_flush() is currently the worst offender during deletion.

		-ben

^ permalink raw reply

* Re: possible circular locking dependency in ISDN PPP
From: Tilman Schmidt @ 2009-10-21 16:24 UTC (permalink / raw)
  To: Xiaotian Feng; +Cc: LKML, isdn4linux, Netdev, Karsten Keil
In-Reply-To: <7b6bb4a50910182227y1281b40bj3fcc082d32cf4496@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2152 bytes --]

Thanks for your analysis. The patch you propose does indeed prevent the
"circular locking dependency" message, with no noticeable ill effect.

I cannot say why xmit_lock was taken around isdn_net_lp_busy() in the
first place. The ISDN4Linux people would be the ones to comment on that.
If none of them objects, I propose you add a Signed-off-by line to your
patch and submit it to Karsten Keil, the ISDN maintainer, for inclusion.
You can also add a "Tested-by: Tilman Schmidt <tilman@imap.cc>" line.

Thanks,
Tilman

Am 19.10.2009 07:27 schrieb Xiaotian Feng:
> So there's a circular locking dependency.. Looking into isdn_net_get_locked_lp()
[...]
> Why do we need to hold xmit_lock while using
> isdn_net_lp_busy(nd->queue) ? Can following patch fix this bug?
> 
> ---
> diff --git a/drivers/isdn/i4l/isdn_net.h b/drivers/isdn/i4l/isdn_net.h
> index 74032d0..7511f08 100644
> --- a/drivers/isdn/i4l/isdn_net.h
> +++ b/drivers/isdn/i4l/isdn_net.h
> @@ -83,19 +83,19 @@ static __inline__ isdn_net_local *
> isdn_net_get_locked_lp(isdn_net_dev *nd)
> 
>         spin_lock_irqsave(&nd->queue_lock, flags);
>         lp = nd->queue;         /* get lp on top of queue */
> -       spin_lock(&nd->queue->xmit_lock);
>         while (isdn_net_lp_busy(nd->queue)) {
> -               spin_unlock(&nd->queue->xmit_lock);
>                 nd->queue = nd->queue->next;
>                 if (nd->queue == lp) { /* not found -- should never happen */
>                         lp = NULL;
>                         goto errout;
>                 }
> -               spin_lock(&nd->queue->xmit_lock);
>         }
>         lp = nd->queue;
>         nd->queue = nd->queue->next;
> +       spin_unlock_irqrestore(&nd->queue_lock, flags);
> +       spin_lock(&lp->xmit_lock);
>         local_bh_disable();
> +       return lp;
>  errout:
>         spin_unlock_irqrestore(&nd->queue_lock, flags);
>         return lp;
> 

-- 
Tilman Schmidt                    E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Eric Dumazet @ 2009-10-21 16:09 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <4ADF2B57.4030708@gmail.com>

Eric Dumazet a écrit :
> Octavian Purdila a écrit :
>> On Sunday 18 October 2009 21:21:44 you wrote:
>>>> The msleep(250) should be tuned first. Then if this is really necessary
>>>> to dismantle 100.000 netdevices per second, we might have to think a bit
>>>> more. 
>>>> Just try msleep(1 or 2), it should work quite well.
>>> My goal is tearing down 100,000 interfaces in a few seconds, which really
>>>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>>>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>>>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>>>  system can handle that.
>>>
>> I would also like to see this patch in, we are running into scalability issues 
>> with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put() spots than
> can be deleted and replaced by something more scalable. I began this job
> but others can help me.
> 
> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 
> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 

Here are typical timings (on current kernel, but on following example
netdev_wait_allrefs() doesnt wait at all, because my netdevice has no refs)

# time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link set mv161 up

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link set mv161 down

real    0m0.021s
user    0m0.000s
sys     0m0.001s
# time ip link del mv161

real    0m0.022s
user    0m0.000s
sys     0m0.001s

# time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan

real    0m0.001s
user    0m0.001s
sys     0m0.001s
# time ip link set mv161 up

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link del mv161

real    0m0.036s
user    0m0.000s
sys     0m0.001s

22 ms (or 36 ms) delay are also problematic if you want to dismantle 1.000.000 netdevices at once.

^ permalink raw reply

* [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Eric Dumazet @ 2009-10-21 15:40 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <200910211539.01824.opurdila@ixiacom.com>

Octavian Purdila a écrit :
> On Sunday 18 October 2009 21:21:44 you wrote:
>>> The msleep(250) should be tuned first. Then if this is really necessary
>>> to dismantle 100.000 netdevices per second, we might have to think a bit
>>> more. 
>>> Just try msleep(1 or 2), it should work quite well.
>> My goal is tearing down 100,000 interfaces in a few seconds, which really
>>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>>  system can handle that.
>>
> 
> I would also like to see this patch in, we are running into scalability issues 
> with creating/deleting lots of interfaces as well.

Ben patch only address interface deletion, and one part of the problem,
maybe the more visible one for the current kernel.

Adding lots of interfaces only needs several threads to run concurently.

Before applying/examining his patch I suggest identifying all dev_put() spots than
can be deleted and replaced by something more scalable. I began this job
but others can help me.

RTNL and rcu grace periods are going to hurt anyway, so you probably need
to use many tasks to be able to delete lots of interfaces in parallel.

netdev_run_todo() should also use a better algorithm to allow parallelism.

Following patch doesnt slow down dev_put() users and real scalability
problems will surface and might be addressed.

[PATCH] net: allow netdev_wait_allrefs() to run faster

netdev_wait_allrefs() waits that all references to a device vanishes.

It currently uses a _very_ pessimistic 250 ms delay between each probe.
Some users report that no more than 4 devices can be dismantled per second,
this is a pretty serious problem for extreme setups.

Most likely, references only wait for a rcu grace period that should come
fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/dev.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 28b0b9e..fca2e4a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4983,7 +4983,7 @@ static void netdev_wait_allrefs(struct net_device *dev)
 			rebroadcast_time = jiffies;
 		}
 
-		msleep(250);
+		schedule_timeout_uninterruptible(1);
 
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "

^ permalink raw reply related

* Re: [net-next-2.6 PATCH 1/3] irq: Export irq_set_affinity() for drivers
From: Ben Hutchings @ 2009-10-21 15:35 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, gospo, netdev, Peter P Waskiewicz Jr
In-Reply-To: <20091021022626.32449.73883.stgit@localhost.localdomain>

On Tue, 2009-10-20 at 19:26 -0700, Jeff Kirsher wrote:
> From: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> 
> This patch allows drivers to specify an IRQ affinity mask for
> their respective interrupt sources.  This is very useful on
> network adapters using MSI-X, where aligning network flows
> linearly to CPUs greatly improves efficiency of the network
> stack.
[...]

Since this is not networking-specific, I think it needs to go to
linux-kernel as well.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Troubles with virtio network
From: Александр Дориф @ 2009-10-21 13:52 UTC (permalink / raw)
  To: davem; +Cc: netdev

Hello! I compiled Linux kernel v 2.6.32-rc5 and got this warning:

WARNING: drivers/net/virtio_net.o(.data+0xa0): Section mismatch in reference from the variable virtio_net to the function .devexit.text:virtnet_remove()
The variable virtio_net references
the function __devexit virtnet_remove()
If the reference is valid then annotate the
variable with __exit* (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console, 

I hope it'll be corrected soon.

^ permalink raw reply

* Re: [PATCH v2 2/8] Allow tcp_parse_options to consult dst entry
From: Gilad Ben-Yossef @ 2009-10-21 14:07 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Netdev, ori
In-Reply-To: <alpine.DEB.2.00.0910211559050.5304@wel-95.cs.helsinki.fi>

Hi Ilpo,


Thanks for the feedback :-)


Ilpo Järvinen wrote:

> On Wed, 21 Oct 2009, Gilad Ben-Yossef wrote:
>
>   
>> We need tcp_parse_options to be aware of dst_entry to 
>> take into account per dst_entry TCP options settings
>>
>> Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
>> Sigend-off-by: Ori Finkelman <ori@comsleep.com>
>> Sigend-off-by: Yony Amit <yony@comsleep.com>
>>
>> ---
>>  include/net/tcp.h        |    3 ++-
>>  net/ipv4/syncookies.c    |   27 ++++++++++++++-------------
>>  net/ipv4/tcp_input.c     |    9 ++++++---
>>  net/ipv4/tcp_ipv4.c      |   19 ++++++++++---------
>>  net/ipv4/tcp_minisocks.c |    7 +++++--
>>  net/ipv6/syncookies.c    |   28 +++++++++++++++-------------
>>  net/ipv6/tcp_ipv6.c      |    3 ++-
>>  7 files changed, 54 insertions(+), 42 deletions(-)
>>
>>
>>     
<snip>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 7cda24b..1cb0ec4 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -1256,11 +1256,18 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>>  	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
>>  #endif
>>  
>> +	ireq = inet_rsk(req);
>> +	ireq->loc_addr = daddr;
>> +	ireq->rmt_addr = saddr;
>> +	ireq->no_srccheck = inet_sk(sk)->transparent;
>> +	ireq->opt = tcp_v4_save_options(sk, skb);
>> +
>> +	dst = inet_csk_route_req(sk, req);
>>  	tcp_clear_options(&tmp_opt);
>>  	tmp_opt.mss_clamp = 536;
>>  	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
>>  
>> -	tcp_parse_options(skb, &tmp_opt, 0);
>> +	tcp_parse_options(skb, &tmp_opt, 0, dst);
>>  
>>  	if (want_cookie && !tmp_opt.saw_tstamp)
>>  		tcp_clear_options(&tmp_opt);
>> @@ -1269,14 +1276,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>>  
>>  	tcp_openreq_init(req, &tmp_opt, skb);
>>  
>> -	ireq = inet_rsk(req);
>> -	ireq->loc_addr = daddr;
>> -	ireq->rmt_addr = saddr;
>> -	ireq->no_srccheck = inet_sk(sk)->transparent;
>> -	ireq->opt = tcp_v4_save_options(sk, skb);
>> -
>>  	if (security_inet_conn_request(sk, skb, req))
>> -		goto drop_and_free;
>> +		goto drop_and_release;
>>  
>>  	if (!want_cookie)
>>  		TCP_ECN_create_request(req, tcp_hdr(skb));
>> @@ -1301,7 +1302,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>>  		 */
>>  		if (tmp_opt.saw_tstamp &&
>>  		    tcp_death_row.sysctl_tw_recycle &&
>> -		    (dst = inet_csk_route_req(sk, req)) != NULL &&
>> +		    dst != NULL &&
>>     
>
> Why you need this NULL check this here while you trap it with BUG_ON 
> elsewhere? Does your patch perhaps create a remote DoS opportunity?
>
>
>   
Indeed, I believe you are right. Good catch.

What about this (I know the patch gets eaten by Thunderbird, sorry about 
that. This is just for explaining what I want to do):

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c

index 1cb0ec4..1d611e3 100644

--- a/net/ipv4/tcp_ipv4.c

+++ b/net/ipv4/tcp_ipv4.c

@@ -1263,6 +1263,9 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)

        ireq->opt = tcp_v4_save_options(sk, skb);

 

        dst = inet_csk_route_req(sk, req);

+       if(!dst)

+               goto drop_and_free;

+

        tcp_clear_options(&tmp_opt);

        tmp_opt.mss_clamp = 536;

        tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;

@@ -1302,7 +1305,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)

                 */

                if (tmp_opt.saw_tstamp &&

                    tcp_death_row.sysctl_tw_recycle &&

-                   dst != NULL &&

                    (peer = rt_get_peer((struct rtable *)dst)) != NULL &&

                    peer->v4daddr == saddr) {

                        if (get_seconds() < peer->tcp_ts_stamp + TCP_PAWS_MSL &&



My rational is that since if the connection is formed we will need to 
send a syn/ack ( call to __tcp_v4_send_synack a couple of lines below) 
and since we can't do that  if we don't have a route, this makes sense.

If this sounds sane, I'll re-spin the patch with this as a fix.

Thanks a bunch!
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker & CTO
Codefidence Ltd.

Web:   http://codefidence.com
Cell:  +972-52-8260388
Skype: gilad_codefidence
Tel:   +972-8-9316883 ext. 201
Fax:   +972-8-9316884
Email: gilad@codefidence.com

Check out our Open Source technology and training blog - http://tuxology.net

	"Sorry cannot parse this, its too long to be true  :)"
	  -- Eric Dumazet on netdev mailing list


^ permalink raw reply

* Re: [PATCH 1/4] Adds random ect generation to tfrc-sp sender side
From: Ivo Calado @ 2009-10-21 13:15 UTC (permalink / raw)
  To: dccp, Gerrit Renker, Ivo Calado, netdev
In-Reply-To: <425e6efa0910210449r1a10fb2cvf7650ad470d987aa@mail.gmail.com>

On Mon, Oct 19, 2009 at 2:23 AM, Gerrit Renker <gerrit@erg.abdn.ac.uk> wrote:
> | Adds random ect generation to tfrc-sp sender side.
>
> I thought about this and found several reasons why it would be better to
> defer ECN Nonce sums to a later implementation.
>
>  1) At the moment the code always sets ECT(0). Even if it would
>    alternate ECT(0) and ECT(1), this would later be overwritten by ECT(0)
>    in dccp_msghdr_parse(). Ok, this could be fixed, but the real problem
>    is that the underlying machinery does not support ECN nonces, since
>
>    * ECN / DiffServ information is in two separate places of the
>      inet_sock (u8 `tos' field and u8 `tclass' field of ipv6_pinfo);
>
>    * the ECN driver sits in include/net/inet_ecn.h as
>      #define INET_ECN_xmit(sk) do { inet_sk(sk)->tos |= INET_ECN_ECT_0; } while (0)
>
>    * hence this would need to be revised and the best way to make an
>      acceptable suggestion would be a coded proof of concept that
>      changing the underlying implementation does have benefits.
>
>    On the receiver side the situation is the same. The function
>    tfrc_sp_check_ecn_sum(), introduced in Patch 2/4 of the TFRC-SP sender
>    implementation is only referenced in Patch 2/2 of the CCID-4 set, where
>    it ends, without side effect in "TODO: consider ecn sum test fail".
>
>    That is, at the moment both the sender and receiver side of the ECN Nonce
>    sum verification are placeholders which currently have no effect.
>

Okay, then the implementation would be useless now.

>
>  2) As far as I can see the ECN Nonce is an optimisation, an
>
>      "optional addition to Explicit Congestion Notification [RFC3168]
>       improving its robustness against malicious or accidental
>       concealment of marked packets [...]" (from the abstract)
>
>    Hence if at all, we would only have a benefit of adding the ECN Nonce
>    verification on top of an already verified implementation.
>

Yes, not priority at all. And you're right, no benefit.

>  3) Starting an implementation throws up further questions that need to
>    be addressed, both the basis and the extension need to be verified.
>
> I would like to suggest to implement the basis, that is CCID-4 with ECN
> (using plain ECT(0)), test with that until it works satisfactorily, and
> then continue adding measures such as the ECN Nonce verification.
>

Okay. But, when would be good to at least include random ECT
generation? When DCCP ECN code will get fixed? Is there any work on
this?

> Nothing is lost, once we are at this stage we can return to this set of
> initial patches and revise the situation based on the insights gained
> with ECT(0) experience.
>
> In summary, I would like to suggest to remove the ECN verification for
> the moment and focus on the "basic" issues first.
>
> Would you be ok with that?
>

Yes, we'll keep the ECN verification code here at our git until the
scenario is ready.

>
>
> Appendix
> --------
> | +int tfrc_sp_get_random_ect(struct tfrc_tx_li_data *li_data, u64 seqn)
> | +{
> | +     int ect;
> | +     struct tfrc_ecn_echo_sum_entry *sum;
> | +
> | +     /* TODO: implement random ect*/
> | +     ect = INET_ECN_ECT_0;
> | +
> | +     sum = kmem_cache_alloc(tfrc_ecn_echo_sum_slab, GFP_ATOMIC);
>
> For a later implementation, there should be protection against NULL, e.g.
>        if (sum == NULL) {
>           DCCP_CRIT("Problem here ...");
>           return 0;
>        }
> | +
> | +     sum->previous = li_data->ecn_sums_head;
> | +     sum->ecn_echo_sum = (sum->previous->ecn_echo_sum) ? !ect : ect;
>

Thanks, i forgot that.

> (Also for later) I wonder how to do the sums, with RFC 3168
> ECT(0) = 0x2 => !0x2 = 0
> ECT(1) = 0x1 => !0x1 = 0
>

I don't understand. Can you try to explain it? Or cite RFC section
that address it?

> From the addition table in RFC 3540, section 2,
> ECT(0) + ECT(0) = 0
> ECT(0) + ECT(1) = 1
> ECT(1) + ECT(1) = 0
>
> One way could be
>        sum->ecn_echo_sum ^= (ect == INET_ECN_ECT_1);

Ok.




-- 
Ivo Augusto Andrade Rocha Calado
MSc. Candidate
Embedded Systems and Pervasive Computing Lab - http://embedded.ufcg.edu.br
Systems and Computing Department - http://www.dsc.ufcg.edu.br
Electrical Engineering and Informatics Center - http://www.ceei.ufcg.edu.br
Federal University of Campina Grande - http://www.ufcg.edu.br

PGP: 0x03422935
Quidquid latine dictum sit, altum viditur.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox