Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 2/9] IB: amso1100: convert to SKB paged frag API.
From: Steve Wise @ 2011-08-25 14:42 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Tom Tucker, Roland Dreier,
	Sean Hefty, Hal Rosenstock, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1314260895-15936-2-git-send-email-ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>

Acked-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* cassini driver: Use of uninitialized memory
From: Thomas Jarosch @ 2011-08-25 13:58 UTC (permalink / raw)
  To: netdev

Hello,

the interrupt routine of the cassini driver
currently looks like this:

----------------------
static irqreturn_t cas_interruptN(int irq, void *dev_id)
{
	struct net_device *dev = dev_id;
	struct cas *cp = netdev_priv(dev);
	unsigned long flags;
	int ring;
	u32 status = readl(cp->regs + REG_PLUS_INTRN_STATUS(ring));
...
----------------------

-> "ring" isn't initialized properly and gets used
in REG_PLUS_INTRN_STATUS. Some lines below there's this:

----------------------
	ring = (irq == cp->pci_irq_INTC) ? 2 : 3;
----------------------

Should that line be moved before the readl() call
or should "ring" be initialized with zero?

Credit for spotting this goes to cppcheck.

Cheers,
Thomas

^ permalink raw reply

* Re: iwlagn: Random "Time out reading EEPROM".
From: Nicolas de Pesloüan @ 2011-08-25 13:50 UTC (permalink / raw)
  To: wwguy
  Cc: dhalperi-GmWTxIRN22iJaUV4rX00uodd74u8MsAO@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, wireless
In-Reply-To: <1310740111.13897.13.camel@wwguy-ubuntu>

Le 15/07/2011 16:28, wwguy a écrit :

> the error indicate fail to read data from EEPROM, your 2nd report is
> even more strange, the number at the end the error message indicate the
> index of DWORD driver trying to read from EEPROM.
>
> "Time out reading EEPROM[2]" telling me the first 2 DWORD is reading ok
> but not the 3rd read.
>
> How many PCI-E slots you have in your system, could it possible for you
> to switch to another PCI-E slot, or pull out and re-insert the NIC.

Unfortunately, not. On this laptop, the NIC is not reachable without disassembling the laptop, and I 
don't want to... I will double check again, but...

> Also, it is possible to put the NIC into different system and see if you
> are seeing the similar problem?

No, for the exact same reason.

Not that it still happens with 3.0.0-1 from Debian.

[   15.086244] iwlagn: Intel(R) Wireless WiFi Link AGN driver for Linux, in-tree:
[   15.086247] iwlagn: Copyright(c) 2003-2011 Intel Corporation
[   15.086404] iwlagn 0000:05:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[   15.086412] iwlagn 0000:05:00.0: setting latency timer to 64
[   15.086438] iwlagn 0000:05:00.0: Detected Intel(R) WiFi Link 5100 AGN, REV=0x54
[   15.095859] iwlagn 0000:05:00.0: Time out reading EEPROM[6]
[   15.095945] iwlagn 0000:05:00.0: Unable to init EEPROM
[   15.096030] iwlagn 0000:05:00.0: PCI INT A disabled
[   15.096039] iwlagn: probe of 0000:05:00.0 failed with error -110

modprobe -r iwlagn ; modprobe iwlagn

[  231.822492] iwlagn: Intel(R) Wireless WiFi Link AGN driver for Linux, in-tree:
[  231.822495] iwlagn: Copyright(c) 2003-2011 Intel Corporation
[  231.822581] iwlagn 0000:05:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[  231.822591] iwlagn 0000:05:00.0: setting latency timer to 64
[  231.822621] iwlagn 0000:05:00.0: Detected Intel(R) WiFi Link 5100 AGN, REV=0x54
[  231.843544] iwlagn 0000:05:00.0: device EEPROM VER=0x11f, CALIB=0x4
[  231.843546] iwlagn 0000:05:00.0: Device SKU: 0Xb
[  231.844889] iwlagn 0000:05:00.0: Tunable channels: 13 802.11bg, 24 802.11a channels
[  231.844961] iwlagn 0000:05:00.0: irq 50 for MSI/MSI-X
[  231.989424] iwlagn 0000:05:00.0: loaded firmware version 8.83.5.1 build 33692
[  232.037456] ieee80211 phy0: Selected rate control algorithm 'iwl-agn-rs'

The error is not easy to reproduce, but the fix is perfectly stable. A single unload/reload of 
iwlagn is always enough to solve the problem, when it happens. For this reason, it sounds difficult 
to consider this a hardware slot problem. Can't this be related to some other PCI components?

00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:01.0 PCI bridge: Intel Corporation Mobile 4 Series Chipset PCI Express Graphics Port (rev 07)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.2 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 3 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
00:1c.4 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 5 (rev 03)
00:1c.5 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 6 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
01:00.0 VGA compatible controller: nVidia Corporation G98 [GeForce 9300M GS] (rev a1)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8055 PCI-E Gigabit Ethernet Controller 
(rev 14)
05:00.0 Network controller: Intel Corporation WiFi Link 5100
09:03.0 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller (rev 05)
09:03.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 22)
09:03.2 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter (rev 12)

I'm quite sure I can fix this problem by loading, unloading and reloading iwlagn on every startup... 
but I don't really consider this a fix :-/

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: iwlagn: Random "Time out reading EEPROM".
From: Guy, Wey-Yi @ 2011-08-25 13:37 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: dhalperi-GmWTxIRN22iJaUV4rX00uodd74u8MsAO@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, wireless
In-Reply-To: <4E565333.3080007-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Thu, 2011-08-25 at 06:50 -0700, Nicolas de Pesloüan wrote:
> Le 15/07/2011 16:28, wwguy a écrit :
> 
> > the error indicate fail to read data from EEPROM, your 2nd report is
> > even more strange, the number at the end the error message indicate the
> > index of DWORD driver trying to read from EEPROM.
> >
> > "Time out reading EEPROM[2]" telling me the first 2 DWORD is reading ok
> > but not the 3rd read.
> >
> > How many PCI-E slots you have in your system, could it possible for you
> > to switch to another PCI-E slot, or pull out and re-insert the NIC.
> 
> Unfortunately, not. On this laptop, the NIC is not reachable without disassembling the laptop, and I 
> don't want to... I will double check again, but...
> 
> > Also, it is possible to put the NIC into different system and see if you
> > are seeing the similar problem?
> 
> No, for the exact same reason.
> 
> Not that it still happens with 3.0.0-1 from Debian.
> 
> [   15.086244] iwlagn: Intel(R) Wireless WiFi Link AGN driver for Linux, in-tree:
> [   15.086247] iwlagn: Copyright(c) 2003-2011 Intel Corporation
> [   15.086404] iwlagn 0000:05:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
> [   15.086412] iwlagn 0000:05:00.0: setting latency timer to 64
> [   15.086438] iwlagn 0000:05:00.0: Detected Intel(R) WiFi Link 5100 AGN, REV=0x54
> [   15.095859] iwlagn 0000:05:00.0: Time out reading EEPROM[6]
> [   15.095945] iwlagn 0000:05:00.0: Unable to init EEPROM
> [   15.096030] iwlagn 0000:05:00.0: PCI INT A disabled
> [   15.096039] iwlagn: probe of 0000:05:00.0 failed with error -110
> 
> modprobe -r iwlagn ; modprobe iwlagn
> 
> [  231.822492] iwlagn: Intel(R) Wireless WiFi Link AGN driver for Linux, in-tree:
> [  231.822495] iwlagn: Copyright(c) 2003-2011 Intel Corporation
> [  231.822581] iwlagn 0000:05:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
> [  231.822591] iwlagn 0000:05:00.0: setting latency timer to 64
> [  231.822621] iwlagn 0000:05:00.0: Detected Intel(R) WiFi Link 5100 AGN, REV=0x54
> [  231.843544] iwlagn 0000:05:00.0: device EEPROM VER=0x11f, CALIB=0x4
> [  231.843546] iwlagn 0000:05:00.0: Device SKU: 0Xb
> [  231.844889] iwlagn 0000:05:00.0: Tunable channels: 13 802.11bg, 24 802.11a channels
> [  231.844961] iwlagn 0000:05:00.0: irq 50 for MSI/MSI-X
> [  231.989424] iwlagn 0000:05:00.0: loaded firmware version 8.83.5.1 build 33692
> [  232.037456] ieee80211 phy0: Selected rate control algorithm 'iwl-agn-rs'
> 
> The error is not easy to reproduce, but the fix is perfectly stable. A single unload/reload of 
> iwlagn is always enough to solve the problem, when it happens. For this reason, it sounds difficult 
> to consider this a hardware slot problem. Can't this be related to some other PCI components?
> 
> 00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
> 00:01.0 PCI bridge: Intel Corporation Mobile 4 Series Chipset PCI Express Graphics Port (rev 07)
> 00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
> 00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
> 00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
> 00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
> 00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
> 00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
> 00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
> 00:1c.2 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 3 (rev 03)
> 00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
> 00:1c.4 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 5 (rev 03)
> 00:1c.5 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 6 (rev 03)
> 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
> 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
> 00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
> 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
> 00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
> 00:1f.0 ISA bridge: Intel Corporation ICH9M LPC Interface Controller (rev 03)
> 00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03)
> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
> 01:00.0 VGA compatible controller: nVidia Corporation G98 [GeForce 9300M GS] (rev a1)
> 02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8055 PCI-E Gigabit Ethernet Controller 
> (rev 14)
> 05:00.0 Network controller: Intel Corporation WiFi Link 5100
> 09:03.0 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller (rev 05)
> 09:03.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 22)
> 09:03.2 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter (rev 12)
> 
> I'm quite sure I can fix this problem by loading, unloading and reloading iwlagn on every startup... 
> but I don't really consider this a fix :

not sure how to help since it is not easy to re-produce and it is EEPROM
reading problem, I only can guess it might related to the physical
device.

Thanks
Wey


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: TCP port firewall controlled by UDP packets
From: Pavel Machek @ 2011-08-25 13:19 UTC (permalink / raw)
  To: Tonda; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel
In-Reply-To: <1313106172-18455-1-git-send-email-as@strmilov.cz>

No comments, variables named in czech.

Ok for me but...

But first thing would be description what it is good for...?

							Pavel

On Fri 2011-08-12 01:42:52, Tonda wrote:
>  	  If unsure, say N.
> +
> +config TCPFIREWALL
> +	tristate "TCP Firewall controlled by UDP queries"
> +	depends on m
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -51,3 +51,4 @@
>  
>  obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
>  		      xfrm4_output.o
> +obj-$(CONFIG_TCPFIREWALL) += tcpfirewall/
> diff --git a/net/ipv4/tcpfirewall/Makefile b/net/ipv4/tcpfirewall/Makefile
> --- a/net/ipv4/tcpfirewall/Makefile
> +++ b/net/ipv4/tcpfirewall/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_TCPFIREWALL) += tcpfirewall.o
> diff --git a/net/ipv4/tcpfirewall/tcpfirewall.c b/net/ipv4/tcpfirewall/tcpfirewall.c
> --- a/net/ipv4/tcpfirewall/tcpfirewall.c
> +++ b/net/ipv4/tcpfirewall/tcpfirewall.c
> @@ -0,0 +1,451 @@
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/skbuff.h>
> +#include <linux/in.h>
> +#include <linux/if_packet.h>
> +#include <linux/tcp.h>
> +#include <linux/udp.h>
> +#include <net/tcp.h>
> +#include <net/udp.h>
> +
> +struct net_protocol {
> +	int (*handler)(struct sk_buff *skb);
> +	void (*err_handler)(struct sk_buff *skb, u32 info);
> +	int (*gso_send_check)(struct sk_buff *skb);
> +	struct sk_buff *(*gso_segment)(struct sk_buff *skb,
> +		u32 features);
> +	struct sk_buff **(*gro_receive)(struct sk_buff **head,
> +		struct sk_buff *skb);
> +	int (*gro_complete)(struct sk_buff *skb);
> +	unsigned int no_policy:1,
> +		netns_ok:1;
> +};
> +
> +MODULE_LICENSE("GPL");
> +
> +static unsigned long inet_protos = 0x01234567;
> +
> +struct net_protocol **_inet_protos;
> +
> +module_param(inet_protos, ulong, 0);
> +
> +static int *otviraky;
> +static int *zaviraky;
> +
> +static int pocetotviraku;
> +static int pocetzaviraku;
> +static int stav;
> +static int packetcounter;
> +static int tcpport;
> +static int open;
> +static int firewall;
> +
> +int (*tcpv4recv) (struct sk_buff *skb);
> +int (*udprecv) (struct sk_buff *skb);
> +
> +int udpcontroller(struct sk_buff *skb)
> +{
> +	const struct udphdr *uh;
> +
> +	if (skb->pkt_type != PACKET_HOST) {
> +		kfree_skb(skb);
> +		return 0;
> +	}
> +
> +	if (!pskb_may_pull(skb, sizeof(struct tcphdr))) {
> +		kfree_skb(skb);
> +		return 0;
> +	}
> +
> +	uh = udp_hdr(skb);
> +
> +	if (pocetotviraku == 0)
> +		return udprecv(skb);
> +
> +	if (!open) {
> +		if (uh->dest == otviraky[stav]) {
> +			++stav;
> +			packetcounter = 0;
> +
> +			if (stav == pocetotviraku) {
> +				open = 1;
> +				stav = 0;
> +			}
> +		} else {
> +			if (packetcounter <= 16) {
> +				++packetcounter;
> +				if (packetcounter > 16)
> +					stav = 0;
> +			}
> +		}
> +	} else {
> +		if (uh->dest == zaviraky[stav]) {
> +			++stav;
> +			packetcounter = 0;
> +
> +			if (stav == pocetzaviraku) {
> +				open = 0;
> +				stav = 0;
> +			}
> +		} else {
> +			if (packetcounter <= 16) {
> +				++packetcounter;
> +				if (packetcounter > 16)
> +					stav = 0;
> +			}
> +		}
> +	}
> +
> +
> +	return udprecv(skb);
> +}
> +
> +int tcpfirewall(struct sk_buff *skb)
> +{
> +	const struct tcphdr *th;
> +
> +	if (skb->pkt_type != PACKET_HOST) {
> +		kfree_skb(skb);
> +		return 0;
> +	}
> +
> +	if (!pskb_may_pull(skb, sizeof(struct tcphdr))) {
> +		kfree_skb(skb);
> +		return 0;
> +	}
> +
> +	th = tcp_hdr(skb);
> +
> +	if (th->dest == tcpport) {
> +		if (firewall == 1 && !open) {
> +			/*tcpv4sendreset(NULL, skb);*/
> +			kfree_skb(skb);
> +			return 0;
> +		}
> +	}
> +
> +	return tcpv4recv(skb);
> +}
> +
> +static struct net_protocol *zalohatcp;
> +static struct net_protocol *zalohaudp;
> +static struct net_protocol mytcp;
> +static struct net_protocol myudp;
> +
> +static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buffer)
> +{
> +	if (!strcmp(attr->name, "firewall")) {
> +		if (firewall)
> +			buffer[0] = '1';
> +		else
> +			buffer[0] = '0';
> +
> +		buffer[1] = '\n';
> +		return 2;
> +	}
> +
> +	if (!strcmp(attr->name, "tcpport")) {
> +		sprintf(buffer, "%d\n", ntohs(tcpport));
> +		return strlen(buffer)+1;
> +	}
> +
> +	if (!strcmp(attr->name, "openers")) {
> +		int i;
> +		char *znak;
> +		if (pocetotviraku == 0)
> +			return 0;
> +		buffer[0] = '\0';
> +		znak = kmalloc(10, GFP_KERNEL);
> +		for (i = 0; i < pocetotviraku; ++i) {
> +			sprintf(znak, "%d ", ntohs(otviraky[i]));
> +			strcat(buffer, znak);
> +		}
> +		kfree(znak);
> +		buffer[strlen(buffer)-1] = '\n';
> +		return strlen(buffer);
> +	}
> +
> +	if (!strcmp(attr->name, "closers")) {
> +		int i;
> +		char *znak;
> +		if (pocetzaviraku == 0)
> +			return 0;
> +		buffer[0] = '\0';
> +		znak = kmalloc(10, GFP_KERNEL);
> +		for (i = 0; i < pocetzaviraku; ++i) {
> +			sprintf(znak, "%d ", ntohs(zaviraky[i]));
> +			strcat(buffer, znak);
> +		}
> +		kfree(znak);
> +		buffer[strlen(buffer)-1] = '\n';
> +		return strlen(buffer);
> +	}
> +
> +	if (!strcmp(attr->name, "open")) {
> +		if (open)
> +			buffer[0] = '1';
> +		else
> +			buffer[0] = '0';
> +
> +		buffer[1] = '\n';
> +		return 2;
> +	}
> +
> +	if (!strcmp(attr->name, "state")) {
> +		sprintf(buffer, "%d\n", stav);
> +		return strlen(buffer)+1;
> +	}
> +
> +	if (!strcmp(attr->name, "counter")) {
> +		sprintf(buffer, "%d\n", packetcounter);
> +		return strlen(buffer)+1;
> +	}
> +
> +	return 0;
> +}
> +
> +static ssize_t store(struct kobject *kobj, struct attribute *attr,
> +	const char *buffer, size_t size)
> +{
> +	int i;
> +	char *cislo;
> +	if (!strcmp(attr->name, "firewall")) {
> +		if (size > 0 && buffer[0] == '1')
> +			firewall = 1;
> +		else
> +			firewall = 0;
> +		stav = 0;
> +		return size;
> +	}
> +
> +	if (!strcmp(attr->name, "tcpport")) {
> +		cislo = kmalloc(size+1, GFP_KERNEL);
> +		for (i = 0; i < size; ++i)
> +			cislo[i] = buffer[i];
> +		cislo[size] = '\0';
> +		if (kstrtoint(cislo, 10, &i) < 0)
> +			i = -1;
> +		if (i > 0 && i < 65536)
> +			tcpport = htons(i);
> +		kfree(cislo);
> +		stav = 0;
> +		return size;
> +	}
> +
> +	if (!strcmp(attr->name, "openers")) {
> +		int udpport, i;
> +		int *noveotviraky;
> +		int *stareotviraky;
> +		cislo = kmalloc(size+1, GFP_KERNEL);
> +		for (i = 0; i < size; ++i)
> +			cislo[i] = buffer[i];
> +		cislo[size] = '\0';
> +
> +		if (!strcmp(cislo, "reset") || !strcmp(cislo, "reset\n")) {
> +			if (pocetotviraku)
> +				kfree(otviraky);
> +			pocetotviraku = 0;
> +		}
> +
> +		if (kstrtoint(cislo, 10, &i) < 0)
> +			i = -1;
> +		kfree(cislo);
> +
> +		if (i > 0 && i < 65536 && (pocetotviraku == 0 ||
> +			otviraky[pocetotviraku-1] != i))
> +				udpport = htons(i);
> +		else
> +			return size;
> +
> +		if (pocetotviraku < 10) {
> +			noveotviraky = kmalloc((pocetotviraku+1)*sizeof(int),
> +				GFP_KERNEL);
> +
> +			for (i = 0; i < pocetotviraku; ++i)
> +				noveotviraky[i] = otviraky[i];
> +
> +			noveotviraky[pocetotviraku] = udpport;
> +			stareotviraky = otviraky;
> +			otviraky = noveotviraky;
> +			if (pocetotviraku)
> +				kfree(stareotviraky);
> +
> +			++pocetotviraku;
> +		}
> +		stav = 0;
> +		return size;
> +	}
> +
> +	if (!strcmp(attr->name, "closers")) {
> +		int udpport, i;
> +		int *novezaviraky;
> +		int *starezaviraky;
> +		cislo = kmalloc(size+1, GFP_KERNEL);
> +		for (i = 0; i < size; ++i)
> +			cislo[i] = buffer[i];
> +		cislo[size] = '\0';
> +
> +		if (!strcmp(cislo, "reset") || !strcmp(cislo, "reset\n")) {
> +			if (pocetzaviraku)
> +				kfree(zaviraky);
> +			pocetzaviraku = 0;
> +		}
> +
> +		if (kstrtoint(cislo, 10, &i) < 0)
> +			i = -1;
> +		kfree(cislo);
> +
> +		if (i > 0 && i < 65536 && (pocetzaviraku == 0 ||
> +			zaviraky[pocetzaviraku-1] != i))
> +				udpport = htons(i);
> +		else
> +			return size;
> +
> +		if (pocetzaviraku < 10) {
> +			novezaviraky = kmalloc((pocetzaviraku+1)*sizeof(int),
> +				GFP_KERNEL);
> +
> +			for (i = 0; i < pocetzaviraku; ++i)
> +				novezaviraky[i] = zaviraky[i];
> +
> +			novezaviraky[pocetzaviraku] = udpport;
> +			starezaviraky = zaviraky;
> +			zaviraky = novezaviraky;
> +			if (pocetzaviraku)
> +				kfree(starezaviraky);
> +
> +			++pocetzaviraku;
> +		}
> +		stav = 0;
> +		return size;
> +	}
> +
> +	if (!strcmp(attr->name, "open")) {
> +		if (size > 0 && buffer[0] == '1')
> +			open = 1;
> +		else
> +			open = 0;
> +
> +		stav = 0;
> +		return size;
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct sysfs_ops so = {
> +	.show = show,
> +	.store = store,
> +};
> +
> +static struct kobj_type khid = {
> +	.sysfs_ops = &so,
> +};
> +
> +static struct kobject kobj;
> +
> +static const struct attribute fw = {
> +	.name = "firewall",
> +	.mode = S_IRWXU,
> +};
> +
> +static const struct attribute opn = {
> +	.name = "open",
> +	.mode = S_IRWXU,
> +};
> +
> +static const struct attribute tcpp = {
> +	.name = "tcpport",
> +	.mode = S_IRWXU,
> +};
> +
> +static const struct attribute openers = {
> +	.name = "openers",
> +	.mode = S_IRWXU,
> +};
> +
> +static const struct attribute closers = {
> +	.name = "closers",
> +	.mode = S_IRWXU,
> +};
> +
> +static const struct attribute stat = {
> +	.name = "state",
> +	.mode = S_IRUSR,
> +};
> +
> +static const struct attribute counte = {
> +	.name = "counter",
> +	.mode = S_IRUSR,
> +};
> +
> +static int __init start(void)
> +{
> +	if (inet_protos == 0x01234567) {
> +		printk(KERN_WARNING "inet_protos parameter was not");
> +		printk(KERN_WARNING " specified!\nread its value from");
> +		printk(KERN_WARNING " System_map file file, and insert");
> +		printk(KERN_WARNING " the module again!\n");
> +		return -1;
> +	}
> +
> +	pocetotviraku = 0;
> +	pocetzaviraku = 0;
> +	stav = -1;
> +	packetcounter = 0;
> +	tcpport = 0;
> +	open = 1;
> +	firewall = 0;
> +
> +	memset(&kobj, 0, sizeof(struct kobject));
> +
> +	_inet_protos = (struct net_protocol **)inet_protos;
> +
> +	kobject_init(&kobj, &khid);
> +	if (kobject_add(&kobj, NULL, "tcpfirewall") < 0)
> +		printk(KERN_ERR "kobject_add failed");
> +
> +	if (sysfs_create_file(&kobj, &fw) < 0)
> +		printk(KERN_ERR "sysfs_create_file failed");
> +	if (sysfs_create_file(&kobj, &opn) < 0)
> +		printk(KERN_ERR "sysfs_create_file failed");
> +	if (sysfs_create_file(&kobj, &tcpp) < 0)
> +		printk(KERN_ERR "sysfs_create_file failed");
> +	if (sysfs_create_file(&kobj, &openers) < 0)
> +		printk(KERN_ERR "sysfs_create_file failed");
> +	if (sysfs_create_file(&kobj, &closers) < 0)
> +		printk(KERN_ERR "sysfs_create_file failed");
> +	if (sysfs_create_file(&kobj, &stat) < 0)
> +		printk(KERN_ERR "sysfs_create_file failed");
> +	if (sysfs_create_file(&kobj, &counte) < 0)
> +		printk(KERN_ERR "sysfs_create_file failed");
> +
> +	zalohatcp = _inet_protos[IPPROTO_TCP];
> +	zalohaudp = _inet_protos[IPPROTO_UDP];
> +	mytcp = *zalohatcp;
> +	myudp = *zalohaudp;
> +	tcpv4recv = mytcp.handler;
> +	udprecv = myudp.handler;
> +	mytcp.handler = tcpfirewall;
> +	myudp.handler = udpcontroller;
> +	_inet_protos[IPPROTO_TCP] = &mytcp;
> +	_inet_protos[IPPROTO_UDP] = &myudp;
> +	return 0;
> +}
> +
> +static void konec(void)
> +{
> +	_inet_protos[IPPROTO_TCP] = zalohatcp;
> +	_inet_protos[IPPROTO_UDP] = zalohaudp;
> +
> +	if (pocetotviraku)
> +		kfree(otviraky);
> +	if (pocetzaviraku)
> +		kfree(zaviraky);
> +
> +	kobject_del(&kobj);
> +}
> +
> +module_init(start);
> +module_exit(konec);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* Re: [PATCH net-next v5 1/2] af-packet: Added TPACKET_V3 headers.
From: chetan loke @ 2011-08-25 12:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110824.194308.2024908890526228700.davem@davemloft.net>

On Wed, Aug 24, 2011 at 10:43 PM, David Miller <davem@davemloft.net> wrote:

> Applied.
>
> I would suggest, as a follow-up patch, we add some appropriate
> prefixes to these new datastructures added to if_packet.h as
> these are exposed to userspace.
>

Sure.

> For example "hdr_v1", "bd_ts", "bd_header_u", and "block_desc" are
> just asking for namespace conflicts with something other API in
> userspace or the user's own datastructures.
>

Then just to be consistent, I will prefix it with 'tpacket'.


thanks
Chetan Loke

^ permalink raw reply

* Re: [RFC] per-containers tcp buffer limitation
From: Daniel Wagner @ 2011-08-25 12:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	David Miller
In-Reply-To: <m14o16qlq1.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>

Hi

On 08/25/2011 04:16 AM, Eric W. Biederman wrote:
> KAMEZAWA Hiroyuki<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>  writes:
>
>> On Wed, 24 Aug 2011 22:28:59 -0300
>> Glauber Costa<glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>  wrote:
>>
>>> On 08/24/2011 09:35 PM, Eric W. Biederman wrote:
>>>> Glauber Costa<glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>   writes:
>>> Hi Eric,
>>>
>>> Thanks for your attention.
>>>
>>> So, this that you propose was my first implementation. I ended up
>>> throwing it away after playing with it for a while.
>>>
>>> One of the first problems that arise from that, is that the sysctls are
>>> a tunable visible from inside the container. Those limits, however, are
>>> to be set from the outside world. The code is not much better than that
>>> either, and instead of creating new cgroup structures and linking them
>>> to the protocol, we end up doing it for net ns. We end up increasing
>>> structures just the same...
>
> You don't need to add a netns member to sockets.
>
> But I do agree that there are odd permission issues with using the
> existing sysctls and making them per namespace.
>
> However almost everything I have seen with memory limits I have found
> very strange.  They all seem like a very bad version of disabling memory
> over commits.

Please apply the same rules for not cursing my family no further then 
the 3rd generation for my idea:

I'd like to solve a use case where it is necessary to count all bytes 
transmitted and received by an application [1]. So far I have found two 
unsatisfying solution for it. The first one is to hook into libc and 
count the bytes there. I don't think I have to say I don't like this.

The second idea was to use the trick Google has used for Android [2]. 
They add a hook into __sock_sendmsg and __sock_recvmsg and then count 
the bytes per UID. To get this working all application have to use an 
unique UID. So not very nice either.

After reading a bit up on cgroup I think that would be the right place 
to count the traffic. Unfortunately, with net_cls I can count the 
outgoing traffic but not the incoming one. If I understood Glauber 
approach correctly adding some statistic counters would be easy to do. 
Of course I don't know the impact of this.

thanks,
daniel


[1] 
http://lists.freedesktop.org/archives/systemd-devel/2011-August/003093.html

[2] 
http://xf.iksaif.net/dev/android/android-2.6.29-to-2.6.32/0083-uidstat-Adding-uid-stat-driver-to-collect-network-st.patch

^ permalink raw reply

* [patch net-next-2.6] benet: remove bogus "unlikely" on vlan check
From: Jiri Pirko @ 2011-08-25 12:50 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, sathya.perla, subbu.seetharaman,
	ajit.khaparde, ivecera

Use of unlikely in this place is wrong. Remove it.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index fb2eda0..3d55b47 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -1139,7 +1139,7 @@ static void be_rx_compl_process(struct be_adapter *adapter,
 		skb->rxhash = rxcp->rss_hash;
 
 
-	if (unlikely(rxcp->vlanf))
+	if (rxcp->vlanf)
 		__vlan_hwaccel_put_tag(skb, rxcp->vlan_tag);
 
 	netif_receive_skb(skb);
@@ -1196,7 +1196,7 @@ static void be_rx_compl_process_gro(struct be_adapter *adapter,
 	if (adapter->netdev->features & NETIF_F_RXHASH)
 		skb->rxhash = rxcp->rss_hash;
 
-	if (unlikely(rxcp->vlanf))
+	if (rxcp->vlanf)
 		__vlan_hwaccel_put_tag(skb, rxcp->vlan_tag);
 
 	napi_gro_frags(&eq_obj->napi);
-- 
1.7.6

^ permalink raw reply related

* Re: how to distribute irqs of ixgbevf
From: J.Hwan Kim @ 2011-08-25 10:19 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1314260481.2387.10.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On 2011년 08월 25일 17:21, Eric Dumazet wrote:
> Le jeudi 25 août 2011 à 17:07 +0900, J.Hwan Kim a écrit :
>> Hi, everyone
>>
>> The interrupts of my ixgbevf driver occurs only Core 0
>> although the user space "irqbalance" serivce is working.
>>
>> How can I distribute the interrupt of RX in ixgbevf to all cores?
>>
>> cat /proc/interrupts | grep "isv"
>>     97:          8          0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv0-rx-0
>>     99:          7          0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv0:lsc
>>    103:       2059      0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv2-rx-0
>>    104:         14        0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv2-tx-0
>>    105:          1         0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv2:mbx
>>
>> "isv" is netdevice name of my ixgbevf.
>>
>>
> Given load is very small, irqbalance chose to send interrupts on a
> single cpu.

This is CPU load measured by "top" and my cores are 8.


   PID USER      PR  NI  VIRT  RES  SHR    S     %CPU      %MEM     
TIME+      COMMAND
     3 root         20   0     0     0    0        R       99           
0.0     70:05.48    ksoftirqd/0

^ permalink raw reply

* Re: [PATCH 3/5] SUNRPC: make RPC service dependable on rpcbind clients creation
From: Stanislav Kinsbursky @ 2011-08-25 10:18 UTC (permalink / raw)
  To: Trond.Myklebust@netapp.com
  Cc: linux-nfs@vger.kernel.org, Pavel Emelianov, neilb@suse.de,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	bfields@fieldses.org, davem@davemloft.net
In-Reply-To: <20110824183359.4924.94364.stgit@localhost6.localdomain6>

This patch has a flaw: rpcbind clients have to be put in case of error in __svc_create().
So will be the second version.

24.08.2011 22:33, Stanislav Kinsbursky пишет:
> We create or increase users counter of rcbind clients during RPC service
> creation and decrease this counter (and possibly destroy those clients) on RPC
> service destruction.
>
> Signed-off-by: Stanislav Kinsbursky<skinsbursky@parallels.com>
>
> ---
>   include/linux/sunrpc/clnt.h |    2 ++
>   net/sunrpc/rpcb_clnt.c      |    2 +-
>   net/sunrpc/svc.c            |    5 +++++
>   3 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index db7bcaf..65a8115 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -135,10 +135,12 @@ void		rpc_shutdown_client(struct rpc_clnt *);
>   void		rpc_release_client(struct rpc_clnt *);
>   void		rpc_task_release_client(struct rpc_task *);
>
> +int		rpcb_create_local(void);
>   int		rpcb_register(u32, u32, int, unsigned short);
>   int		rpcb_v4_register(const u32 program, const u32 version,
>   				 const struct sockaddr *address,
>   				 const char *netid);
> +void		rpcb_put_local(void);
>   void		rpcb_getport_async(struct rpc_task *);
>
>   void		rpc_call_start(struct rpc_task *);
> diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
> index b4cc0f1..437ec60 100644
> --- a/net/sunrpc/rpcb_clnt.c
> +++ b/net/sunrpc/rpcb_clnt.c
> @@ -318,7 +318,7 @@ out:
>    * Returns zero on success, otherwise a negative errno value
>    * is returned.
>    */
> -static int rpcb_create_local(void)
> +int rpcb_create_local(void)
>   {
>   	static DEFINE_MUTEX(rpcb_create_local_mutex);
>   	int result = 0;
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 6a69a11..0df8532 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -367,6 +367,9 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>   	unsigned int xdrsize;
>   	unsigned int i;
>
> +	if (rpcb_create_local()<  0)
> +		return NULL;
> +
>   	if (!(serv = kzalloc(sizeof(*serv), GFP_KERNEL)))
>   		return NULL;
>   	serv->sv_name      = prog->pg_name;
> @@ -491,6 +494,8 @@ svc_destroy(struct svc_serv *serv)
>   	svc_unregister(serv);
>   	kfree(serv->sv_pools);
>   	kfree(serv);
> +
> +	rpcb_put_local();
>   }
>   EXPORT_SYMBOL_GPL(svc_destroy);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply

* Re: [PATCH] tcp: bound RTO to minimum
From: Arnd Hannemann @ 2011-08-25 10:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian
In-Reply-To: <1314266562.2387.35.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Hi Eric,

Am 25.08.2011 12:02, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
>> Hi Eric,
>>
>> Am 25.08.2011 11:09, schrieb Eric Dumazet:
> 
>>> Maybe we should refine the thing a bit, to not reverse backoff unless
>>> rto is > some_threshold.
>>>
>>> Say 10s being the value, that would give at most 92 tries.
>>
>> I personally think that 10s would be too large and eliminate the benefit of the
>> algorithm, so I would prefer a different solution.
>>
>> In case of one bulk data TCP session, which was transmitting hundreds of packets/s
>> before the connectivity disruption those worst case rate of 5 packet/s really
>> seems conservative enough.
>>
>> However in case of a lot of idle connections, which were transmitting only
>> a number of packets per minute. We might increase the rate drastically for
>> a certain period until it throttles down. You say that we have a problem here
>> correct?
>>
>> Do you think it would be possible without much hassle to use a kind of "global"
>> rate limiting only for these probe packets of a TCP connection?
>>
>>> I mean, what is the gain to be able to restart a frozen TCP session with
>>> a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
>>
>> I'm afraid it does a lot, especially in highly dynamic environments. You
>> don't have just the additional latency, you may actually miss the full
>> period where connectivity was there, and then just retransmit into the next
>> connectivity disrupted period.
> 
> Problem with this is that with short and synchronized timers, all
> sessions will flood at the same time and you'll get congestion this
> time.

Why do you think the timers are "syncronized"? If you have congestion
then you will do exponential backoff.

> The reason for exponential backoff is also to smooth the restarts of
> sessions, because timers are randomized.

If the RTO of these sessions were "randomized" they keep this randomization,
even if backoffs are reverted, at least they should.

Best regards
Arnd

^ permalink raw reply

* Re: [PATCH] tcp: bound RTO to minimum
From: Ilpo Järvinen @ 2011-08-25 10:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Arnd Hannemann, Alexander Zimmermann, Yuchung Cheng,
	Hagen Paul Pfeifer, netdev, Lukowski Damian
In-Reply-To: <1314266562.2387.35.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1966 bytes --]

On Thu, 25 Aug 2011, Eric Dumazet wrote:

> Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
> > Hi Eric,
> > 
> > Am 25.08.2011 11:09, schrieb Eric Dumazet:
> 
> > > Maybe we should refine the thing a bit, to not reverse backoff unless
> > > rto is > some_threshold.
> > > 
> > > Say 10s being the value, that would give at most 92 tries.
> > 
> > I personally think that 10s would be too large and eliminate the benefit of the
> > algorithm, so I would prefer a different solution.
> > 
> > In case of one bulk data TCP session, which was transmitting hundreds of packets/s
> > before the connectivity disruption those worst case rate of 5 packet/s really
> > seems conservative enough.
> > 
> > However in case of a lot of idle connections, which were transmitting only
> > a number of packets per minute. We might increase the rate drastically for
> > a certain period until it throttles down. You say that we have a problem here
> > correct?
> > 
> > Do you think it would be possible without much hassle to use a kind of 
> > "global" rate limiting only for these probe packets of a TCP connection?
> >
> > > I mean, what is the gain to be able to restart a frozen TCP session with
> > > a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
> > 
> > I'm afraid it does a lot, especially in highly dynamic environments. You
> > don't have just the additional latency, you may actually miss the full
> > period where connectivity was there, and then just retransmit into the next
> > connectivity disrupted period.
> 
> Problem with this is that with short and synchronized timers, all
> sessions will flood at the same time and you'll get congestion this
> time.
>
> The reason for exponential backoff is also to smooth the restarts of
> sessions, because timers are randomized.

But if you get a real congestion the system will self-regulate using 
exponential backoffs due to lack of ICMPs for some of the connections?


-- 
 i.

^ permalink raw reply

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Ilpo Järvinen @ 2011-08-25 10:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jerry Chu, Damian Lukowski
In-Reply-To: <1314265254.2387.31.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1489 bytes --]

On Thu, 25 Aug 2011, Eric Dumazet wrote:

> Le jeudi 25 août 2011 à 11:56 +0300, Ilpo Järvinen a écrit :
> 
> > So you think that this is not true: ?
> > 
> >         /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
> >          * guarantees that rto is higher.
> >          */
> > 
> > ...it would still be smaller than 1sec though, but certainly not going to 
> > cause flooding either. Default tcp_rto_min should be 200ms so it's 
> > 5pkts+5ICMP sent, received and processed per second. Which doesn't sound 
> > that bad CPU load?!?
> > 
> 
> Unless you have 100.000 active sessions maybe ?
> 
> Some years ago, I helped people running servers with more than 1.000.000
> long living active sessions, and a temporary network disruption was
> already very critical at that time, with old kernels (At that time, IP
> route cache could blow away and consume too much ram or cpu time, things
> are now under control)
> 
> I guess they would not try a new kernel :(
> 
> > It is unclear to me how tp->rttvar could become smaller than 
> > tcp_rto_min().
> 
> I believe this part is fine Ilpo.
> 
> As long as we handle few tcp sessions, its fine to send 5 messages per
> session per second.

Yeah, thanks for the clarification. I was just confused by the initial 
wording of yours which seemed to imply that we could, at worst, end up 
doing it with full rate without any timers.

To me it seems that both cases are quite valid, with pretty much 
contradicting goals.


-- 
 i.

^ permalink raw reply

* Re: [PATCH] tcp: bound RTO to minimum
From: Eric Dumazet @ 2011-08-25 10:02 UTC (permalink / raw)
  To: Arnd Hannemann
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian
In-Reply-To: <4E5619DA.6070902@arndnet.de>

Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
> Hi Eric,
> 
> Am 25.08.2011 11:09, schrieb Eric Dumazet:

> > Maybe we should refine the thing a bit, to not reverse backoff unless
> > rto is > some_threshold.
> > 
> > Say 10s being the value, that would give at most 92 tries.
> 
> I personally think that 10s would be too large and eliminate the benefit of the
> algorithm, so I would prefer a different solution.
> 
> In case of one bulk data TCP session, which was transmitting hundreds of packets/s
> before the connectivity disruption those worst case rate of 5 packet/s really
> seems conservative enough.
> 
> However in case of a lot of idle connections, which were transmitting only
> a number of packets per minute. We might increase the rate drastically for
> a certain period until it throttles down. You say that we have a problem here
> correct?
> 
> Do you think it would be possible without much hassle to use a kind of "global"
> rate limiting only for these probe packets of a TCP connection?
> 
> > I mean, what is the gain to be able to restart a frozen TCP session with
> > a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
> 
> I'm afraid it does a lot, especially in highly dynamic environments. You
> don't have just the additional latency, you may actually miss the full
> period where connectivity was there, and then just retransmit into the next
> connectivity disrupted period.

Problem with this is that with short and synchronized timers, all
sessions will flood at the same time and you'll get congestion this
time.

The reason for exponential backoff is also to smooth the restarts of
sessions, because timers are randomized.

^ permalink raw reply

* Re: how to distribute irqs of ixgbevf
From: J.Hwan Kim @ 2011-08-25 10:00 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1314260481.2387.10.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On 2011년 08월 25일 17:21, Eric Dumazet wrote:
> Le jeudi 25 août 2011 à 17:07 +0900, J.Hwan Kim a écrit :
>> Hi, everyone
>>
>> The interrupts of my ixgbevf driver occurs only Core 0
>> although the user space "irqbalance" serivce is working.
>>
>> How can I distribute the interrupt of RX in ixgbevf to all cores?
>>
>> cat /proc/interrupts | grep "isv"
>>     97:          8          0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv0-rx-0
>>     99:          7          0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv0:lsc
>>    103:       2059      0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv2-rx-0
>>    104:         14        0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv2-tx-0
>>    105:          1         0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv2:mbx
>>
>> "isv" is netdevice name of my ixgbevf.
> Given load is very small, irqbalance chose to send interrupts on a
> single cpu.
When I measure cpu load with "top", it indicates CPU load around 99%

^ permalink raw reply

* Re: [PATCH] tcp: bound RTO to minimum
From: Arnd Hannemann @ 2011-08-25  9:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian
In-Reply-To: <1314263389.2387.21.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Hi Eric,

Am 25.08.2011 11:09, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 10:46 +0200, Arnd Hannemann a écrit :
>> Am 25.08.2011 10:26, schrieb Eric Dumazet:
>>> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>>>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
>>>
>>>>> Real question is : do we really want to process ~1000 timer interrupts
>>>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>>>> back. This just doesnt scale.
>>>>
>>>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>>>> probing time of 120s, we 600 retransmits in a worst-case-senario
>>>> (assumed that we get for every rot retransmission an icmp). No?
>>>
>>> Where is asserted the "max probing time of 120s" ? 
>>>
>>> It is not the case on my machine :
>>> I have way more retransmits than that, even if spaced by 1600 ms
>>>
>>> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
>>> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
>>> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
>>>
>>> Old kernels where performing up to 15 retries, doing exponential backoff.
>>>
>>> Now its kind of unlimited, according to experimental results.
>>
>> That shouldn't be. It should stop after the same time a TCP connection with an
>> RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
>> So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
>> this doesn't work as expected.
>>
>> * 200ms + 400ms + 800ms ...
> 
> It is 924 second with retries2=15 (default value)
> 
> I said ~1000 probes.
> 
> If ICMP are not rate limited, that could be about 924*5 probes, instead
> of 15 probes on old kernels.

At a rate of 5 packets/s if RTT is zero, yes. I would like to say: so
what? But your example with millions of idle connections stands.

> Maybe we should refine the thing a bit, to not reverse backoff unless
> rto is > some_threshold.
> 
> Say 10s being the value, that would give at most 92 tries.

I personally think that 10s would be too large and eliminate the benefit of the
algorithm, so I would prefer a different solution.

In case of one bulk data TCP session, which was transmitting hundreds of packets/s
before the connectivity disruption those worst case rate of 5 packet/s really
seems conservative enough.

However in case of a lot of idle connections, which were transmitting only
a number of packets per minute. We might increase the rate drastically for
a certain period until it throttles down. You say that we have a problem here
correct?

Do you think it would be possible without much hassle to use a kind of "global"
rate limiting only for these probe packets of a TCP connection?

> I mean, what is the gain to be able to restart a frozen TCP session with
> a 1sec latency instead of 10s if it was blocked more than 60 seconds ?

I'm afraid it does a lot, especially in highly dynamic environments. You
don't have just the additional latency, you may actually miss the full
period where connectivity was there, and then just retransmit into the next
connectivity disrupted period.

Best regards,
Arnd

^ permalink raw reply

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Eric Dumazet @ 2011-08-25  9:40 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Jerry Chu, Damian Lukowski
In-Reply-To: <alpine.DEB.2.00.1108251150050.12780@wel-95.cs.helsinki.fi>

Le jeudi 25 août 2011 à 11:56 +0300, Ilpo Järvinen a écrit :

> So you think that this is not true: ?
> 
>         /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
>          * guarantees that rto is higher.
>          */
> 
> ...it would still be smaller than 1sec though, but certainly not going to 
> cause flooding either. Default tcp_rto_min should be 200ms so it's 
> 5pkts+5ICMP sent, received and processed per second. Which doesn't sound 
> that bad CPU load?!?
> 

Unless you have 100.000 active sessions maybe ?

Some years ago, I helped people running servers with more than 1.000.000
long living active sessions, and a temporary network disruption was
already very critical at that time, with old kernels (At that time, IP
route cache could blow away and consume too much ram or cpu time, things
are now under control)

I guess they would not try a new kernel :(

> It is unclear to me how tp->rttvar could become smaller than 
> tcp_rto_min().

I believe this part is fine Ilpo.

As long as we handle few tcp sessions, its fine to send 5 messages per
session per second.

^ permalink raw reply

* Re: [PATCH net-next 2/2] sunbmac: use standard #defines from mii.h.
From: Francois Romieu @ 2011-08-25  9:22 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20110825092019.GA21777@electric-eye.fr.zoreil.com>

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/sun/sunbmac.c |   31 ++++++++++++++++---------------
 drivers/net/ethernet/sun/sunbmac.h |   17 -----------------
 2 files changed, 16 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunbmac.c b/drivers/net/ethernet/sun/sunbmac.c
index c94f5ef..0d8cfd9 100644
--- a/drivers/net/ethernet/sun/sunbmac.c
+++ b/drivers/net/ethernet/sun/sunbmac.c
@@ -17,6 +17,7 @@
 #include <linux/crc32.h>
 #include <linux/errno.h>
 #include <linux/ethtool.h>
+#include <linux/mii.h>
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>
 #include <linux/skbuff.h>
@@ -500,13 +501,13 @@ static int try_next_permutation(struct bigmac *bp, void __iomem *tregs)
 
 		/* Reset the PHY. */
 		bp->sw_bmcr	= (BMCR_ISOLATE | BMCR_PDOWN | BMCR_LOOPBACK);
-		bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+		bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 		bp->sw_bmcr	= (BMCR_RESET);
-		bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+		bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 
 		timeout = 64;
 		while (--timeout) {
-			bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+			bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 			if ((bp->sw_bmcr & BMCR_RESET) == 0)
 				break;
 			udelay(20);
@@ -514,11 +515,11 @@ static int try_next_permutation(struct bigmac *bp, void __iomem *tregs)
 		if (timeout == 0)
 			printk(KERN_ERR "%s: PHY reset failed.\n", bp->dev->name);
 
-		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 
 		/* Now we try 10baseT. */
 		bp->sw_bmcr &= ~(BMCR_SPEED100);
-		bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+		bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 		return 0;
 	}
 
@@ -534,8 +535,8 @@ static void bigmac_timer(unsigned long data)
 
 	bp->timer_ticks++;
 	if (bp->timer_state == ltrywait) {
-		bp->sw_bmsr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMSR);
-		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+		bp->sw_bmsr = bigmac_tcvr_read(bp, tregs, MII_BMSR);
+		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 		if (bp->sw_bmsr & BMSR_LSTATUS) {
 			printk(KERN_INFO "%s: Link is now up at %s.\n",
 			       bp->dev->name,
@@ -588,18 +589,18 @@ static void bigmac_begin_auto_negotiation(struct bigmac *bp)
 	int timeout;
 
 	/* Grab new software copies of PHY registers. */
-	bp->sw_bmsr	= bigmac_tcvr_read(bp, tregs, BIGMAC_BMSR);
-	bp->sw_bmcr	= bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+	bp->sw_bmsr	= bigmac_tcvr_read(bp, tregs, MII_BMSR);
+	bp->sw_bmcr	= bigmac_tcvr_read(bp, tregs, MII_BMCR);
 
 	/* Reset the PHY. */
 	bp->sw_bmcr	= (BMCR_ISOLATE | BMCR_PDOWN | BMCR_LOOPBACK);
-	bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+	bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 	bp->sw_bmcr	= (BMCR_RESET);
-	bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+	bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 
 	timeout = 64;
 	while (--timeout) {
-		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 		if ((bp->sw_bmcr & BMCR_RESET) == 0)
 			break;
 		udelay(20);
@@ -607,11 +608,11 @@ static void bigmac_begin_auto_negotiation(struct bigmac *bp)
 	if (timeout == 0)
 		printk(KERN_ERR "%s: PHY reset failed.\n", bp->dev->name);
 
-	bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+	bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 
 	/* First we try 100baseT. */
 	bp->sw_bmcr |= BMCR_SPEED100;
-	bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+	bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 
 	bp->timer_state = ltrywait;
 	bp->timer_ticks = 0;
@@ -1054,7 +1055,7 @@ static u32 bigmac_get_link(struct net_device *dev)
 	struct bigmac *bp = netdev_priv(dev);
 
 	spin_lock_irq(&bp->lock);
-	bp->sw_bmsr = bigmac_tcvr_read(bp, bp->tregs, BIGMAC_BMSR);
+	bp->sw_bmsr = bigmac_tcvr_read(bp, bp->tregs, MII_BMSR);
 	spin_unlock_irq(&bp->lock);
 
 	return (bp->sw_bmsr & BMSR_LSTATUS);
diff --git a/drivers/net/ethernet/sun/sunbmac.h b/drivers/net/ethernet/sun/sunbmac.h
index 4943e97..06dd217 100644
--- a/drivers/net/ethernet/sun/sunbmac.h
+++ b/drivers/net/ethernet/sun/sunbmac.h
@@ -223,23 +223,6 @@
 #define BIGMAC_PHY_EXTERNAL   0 /* External transceiver */
 #define BIGMAC_PHY_INTERNAL   1 /* Internal transceiver */
 
-/* PHY registers */
-#define BIGMAC_BMCR           0x00 /* Basic mode control register	*/
-#define BIGMAC_BMSR           0x01 /* Basic mode status register	*/
-
-/* BMCR bits */
-#define BMCR_ISOLATE            0x0400  /* Disconnect DP83840 from MII */
-#define BMCR_PDOWN              0x0800  /* Powerdown the DP83840       */
-#define BMCR_ANENABLE           0x1000  /* Enable auto negotiation     */
-#define BMCR_SPEED100           0x2000  /* Select 100Mbps              */
-#define BMCR_LOOPBACK           0x4000  /* TXD loopback bits           */
-#define BMCR_RESET              0x8000  /* Reset the DP83840           */
-
-/* BMSR bits */
-#define BMSR_ERCAP              0x0001  /* Ext-reg capability          */
-#define BMSR_JCD                0x0002  /* Jabber detected             */
-#define BMSR_LSTATUS            0x0004  /* Link status                 */
-
 /* Ring descriptors and such, same as Quad Ethernet. */
 struct be_rxd {
 	u32 rx_flags;
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH net-next 1/2] dl2k: use standard #defines from mii.h.
From: Francois Romieu @ 2011-08-25  9:21 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20110825092019.GA21777@electric-eye.fr.zoreil.com>

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/dlink/dl2k.c |  105 +++++++++++++++++------------------
 drivers/net/ethernet/dlink/dl2k.h |  110 +------------------------------------
 2 files changed, 53 insertions(+), 162 deletions(-)

diff --git a/drivers/net/ethernet/dlink/dl2k.c b/drivers/net/ethernet/dlink/dl2k.c
index 3fa9140..b2dc2c8 100644
--- a/drivers/net/ethernet/dlink/dl2k.c
+++ b/drivers/net/ethernet/dlink/dl2k.c
@@ -1428,7 +1428,7 @@ mii_wait_link (struct net_device *dev, int wait)
 
 	do {
 		bmsr = mii_read (dev, phy_addr, MII_BMSR);
-		if (bmsr & MII_BMSR_LINK_STATUS)
+		if (bmsr & BMSR_LSTATUS)
 			return 0;
 		mdelay (1);
 	} while (--wait > 0);
@@ -1449,60 +1449,60 @@ mii_get_media (struct net_device *dev)
 
 	bmsr = mii_read (dev, phy_addr, MII_BMSR);
 	if (np->an_enable) {
-		if (!(bmsr & MII_BMSR_AN_COMPLETE)) {
+		if (!(bmsr & BMSR_ANEGCOMPLETE)) {
 			/* Auto-Negotiation not completed */
 			return -1;
 		}
-		negotiate = mii_read (dev, phy_addr, MII_ANAR) &
-			mii_read (dev, phy_addr, MII_ANLPAR);
-		mscr = mii_read (dev, phy_addr, MII_MSCR);
-		mssr = mii_read (dev, phy_addr, MII_MSSR);
-		if (mscr & MII_MSCR_1000BT_FD && mssr & MII_MSSR_LP_1000BT_FD) {
+		negotiate = mii_read (dev, phy_addr, MII_ADVERTISE) &
+			mii_read (dev, phy_addr, MII_LPA);
+		mscr = mii_read (dev, phy_addr, MII_CTRL1000);
+		mssr = mii_read (dev, phy_addr, MII_STAT1000);
+		if (mscr & ADVERTISE_1000FULL && mssr & LPA_1000FULL) {
 			np->speed = 1000;
 			np->full_duplex = 1;
 			printk (KERN_INFO "Auto 1000 Mbps, Full duplex\n");
-		} else if (mscr & MII_MSCR_1000BT_HD && mssr & MII_MSSR_LP_1000BT_HD) {
+		} else if (mscr & ADVERTISE_1000HALF && mssr & LPA_1000HALF) {
 			np->speed = 1000;
 			np->full_duplex = 0;
 			printk (KERN_INFO "Auto 1000 Mbps, Half duplex\n");
-		} else if (negotiate & MII_ANAR_100BX_FD) {
+		} else if (negotiate & ADVERTISE_100FULL) {
 			np->speed = 100;
 			np->full_duplex = 1;
 			printk (KERN_INFO "Auto 100 Mbps, Full duplex\n");
-		} else if (negotiate & MII_ANAR_100BX_HD) {
+		} else if (negotiate & ADVERTISE_100HALF) {
 			np->speed = 100;
 			np->full_duplex = 0;
 			printk (KERN_INFO "Auto 100 Mbps, Half duplex\n");
-		} else if (negotiate & MII_ANAR_10BT_FD) {
+		} else if (negotiate & ADVERTISE_10FULL) {
 			np->speed = 10;
 			np->full_duplex = 1;
 			printk (KERN_INFO "Auto 10 Mbps, Full duplex\n");
-		} else if (negotiate & MII_ANAR_10BT_HD) {
+		} else if (negotiate & ADVERTISE_10HALF) {
 			np->speed = 10;
 			np->full_duplex = 0;
 			printk (KERN_INFO "Auto 10 Mbps, Half duplex\n");
 		}
-		if (negotiate & MII_ANAR_PAUSE) {
+		if (negotiate & ADVERTISE_PAUSE_CAP) {
 			np->tx_flow &= 1;
 			np->rx_flow &= 1;
-		} else if (negotiate & MII_ANAR_ASYMMETRIC) {
+		} else if (negotiate & ADVERTISE_PAUSE_ASYM) {
 			np->tx_flow = 0;
 			np->rx_flow &= 1;
 		}
 		/* else tx_flow, rx_flow = user select  */
 	} else {
 		__u16 bmcr = mii_read (dev, phy_addr, MII_BMCR);
-		switch (bmcr & (MII_BMCR_SPEED_100 | MII_BMCR_SPEED_1000)) {
-		case MII_BMCR_SPEED_1000:
+		switch (bmcr & (BMCR_SPEED100 | BMCR_SPEED1000)) {
+		case BMCR_SPEED1000:
 			printk (KERN_INFO "Operating at 1000 Mbps, ");
 			break;
-		case MII_BMCR_SPEED_100:
+		case BMCR_SPEED100:
 			printk (KERN_INFO "Operating at 100 Mbps, ");
 			break;
 		case 0:
 			printk (KERN_INFO "Operating at 10 Mbps, ");
 		}
-		if (bmcr & MII_BMCR_DUPLEX_MODE) {
+		if (bmcr & BMCR_FULLDPLX) {
 			printk (KERN_CONT "Full duplex\n");
 		} else {
 			printk (KERN_CONT "Half duplex\n");
@@ -1536,24 +1536,22 @@ mii_set_media (struct net_device *dev)
 	if (np->an_enable) {
 		/* Advertise capabilities */
 		bmsr = mii_read (dev, phy_addr, MII_BMSR);
-		anar = mii_read (dev, phy_addr, MII_ANAR) &
-			     ~MII_ANAR_100BX_FD &
-			     ~MII_ANAR_100BX_HD &
-			     ~MII_ANAR_100BT4 &
-			     ~MII_ANAR_10BT_FD &
-			     ~MII_ANAR_10BT_HD;
-		if (bmsr & MII_BMSR_100BX_FD)
-			anar |= MII_ANAR_100BX_FD;
-		if (bmsr & MII_BMSR_100BX_HD)
-			anar |= MII_ANAR_100BX_HD;
-		if (bmsr & MII_BMSR_100BT4)
-			anar |= MII_ANAR_100BT4;
-		if (bmsr & MII_BMSR_10BT_FD)
-			anar |= MII_ANAR_10BT_FD;
-		if (bmsr & MII_BMSR_10BT_HD)
-			anar |= MII_ANAR_10BT_HD;
-		anar |= MII_ANAR_PAUSE | MII_ANAR_ASYMMETRIC;
-		mii_write (dev, phy_addr, MII_ANAR, anar);
+		anar = mii_read (dev, phy_addr, MII_ADVERTISE) &
+			~(ADVERTISE_100FULL | ADVERTISE_10FULL |
+			  ADVERTISE_100HALF | ADVERTISE_10HALF |
+			  ADVERTISE_100BASE4);
+		if (bmsr & BMSR_100FULL)
+			anar |= ADVERTISE_100FULL;
+		if (bmsr & BMSR_100HALF)
+			anar |= ADVERTISE_100HALF;
+		if (bmsr & BMSR_100BASE4)
+			anar |= ADVERTISE_100BASE4;
+		if (bmsr & BMSR_10FULL)
+			anar |= ADVERTISE_10FULL;
+		if (bmsr & BMSR_10HALF)
+			anar |= ADVERTISE_10HALF;
+		anar |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
+		mii_write (dev, phy_addr, MII_ADVERTISE, anar);
 
 		/* Enable Auto crossover */
 		pscr = mii_read (dev, phy_addr, MII_PHY_SCR);
@@ -1561,8 +1559,8 @@ mii_set_media (struct net_device *dev)
 		mii_write (dev, phy_addr, MII_PHY_SCR, pscr);
 
 		/* Soft reset PHY */
-		mii_write (dev, phy_addr, MII_BMCR, MII_BMCR_RESET);
-		bmcr = MII_BMCR_AN_ENABLE | MII_BMCR_RESTART_AN | MII_BMCR_RESET;
+		mii_write (dev, phy_addr, MII_BMCR, BMCR_RESET);
+		bmcr = BMCR_ANENABLE | BMCR_ANRESTART | BMCR_RESET;
 		mii_write (dev, phy_addr, MII_BMCR, bmcr);
 		mdelay(1);
 	} else {
@@ -1574,7 +1572,7 @@ mii_set_media (struct net_device *dev)
 
 		/* 2) PHY Reset */
 		bmcr = mii_read (dev, phy_addr, MII_BMCR);
-		bmcr |= MII_BMCR_RESET;
+		bmcr |= BMCR_RESET;
 		mii_write (dev, phy_addr, MII_BMCR, bmcr);
 
 		/* 3) Power Down */
@@ -1583,25 +1581,25 @@ mii_set_media (struct net_device *dev)
 		mdelay (100);	/* wait a certain time */
 
 		/* 4) Advertise nothing */
-		mii_write (dev, phy_addr, MII_ANAR, 0);
+		mii_write (dev, phy_addr, MII_ADVERTISE, 0);
 
 		/* 5) Set media and Power Up */
-		bmcr = MII_BMCR_POWER_DOWN;
+		bmcr = BMCR_PDOWN;
 		if (np->speed == 100) {
-			bmcr |= MII_BMCR_SPEED_100;
+			bmcr |= BMCR_SPEED100;
 			printk (KERN_INFO "Manual 100 Mbps, ");
 		} else if (np->speed == 10) {
 			printk (KERN_INFO "Manual 10 Mbps, ");
 		}
 		if (np->full_duplex) {
-			bmcr |= MII_BMCR_DUPLEX_MODE;
+			bmcr |= BMCR_FULLDPLX;
 			printk (KERN_CONT "Full duplex\n");
 		} else {
 			printk (KERN_CONT "Half duplex\n");
 		}
 #if 0
 		/* Set 1000BaseT Master/Slave setting */
-		mscr = mii_read (dev, phy_addr, MII_MSCR);
+		mscr = mii_read (dev, phy_addr, MII_CTRL1000);
 		mscr |= MII_MSCR_CFG_ENABLE;
 		mscr &= ~MII_MSCR_CFG_VALUE = 0;
 #endif
@@ -1624,7 +1622,7 @@ mii_get_media_pcs (struct net_device *dev)
 
 	bmsr = mii_read (dev, phy_addr, PCS_BMSR);
 	if (np->an_enable) {
-		if (!(bmsr & MII_BMSR_AN_COMPLETE)) {
+		if (!(bmsr & BMSR_ANEGCOMPLETE)) {
 			/* Auto-Negotiation not completed */
 			return -1;
 		}
@@ -1649,7 +1647,7 @@ mii_get_media_pcs (struct net_device *dev)
 	} else {
 		__u16 bmcr = mii_read (dev, phy_addr, PCS_BMCR);
 		printk (KERN_INFO "Operating at 1000 Mbps, ");
-		if (bmcr & MII_BMCR_DUPLEX_MODE) {
+		if (bmcr & BMCR_FULLDPLX) {
 			printk (KERN_CONT "Full duplex\n");
 		} else {
 			printk (KERN_CONT "Half duplex\n");
@@ -1682,7 +1680,7 @@ mii_set_media_pcs (struct net_device *dev)
 	if (np->an_enable) {
 		/* Advertise capabilities */
 		esr = mii_read (dev, phy_addr, PCS_ESR);
-		anar = mii_read (dev, phy_addr, MII_ANAR) &
+		anar = mii_read (dev, phy_addr, MII_ADVERTISE) &
 			~PCS_ANAR_HALF_DUPLEX &
 			~PCS_ANAR_FULL_DUPLEX;
 		if (esr & (MII_ESR_1000BT_HD | MII_ESR_1000BX_HD))
@@ -1690,22 +1688,21 @@ mii_set_media_pcs (struct net_device *dev)
 		if (esr & (MII_ESR_1000BT_FD | MII_ESR_1000BX_FD))
 			anar |= PCS_ANAR_FULL_DUPLEX;
 		anar |= PCS_ANAR_PAUSE | PCS_ANAR_ASYMMETRIC;
-		mii_write (dev, phy_addr, MII_ANAR, anar);
+		mii_write (dev, phy_addr, MII_ADVERTISE, anar);
 
 		/* Soft reset PHY */
-		mii_write (dev, phy_addr, MII_BMCR, MII_BMCR_RESET);
-		bmcr = MII_BMCR_AN_ENABLE | MII_BMCR_RESTART_AN |
-		       MII_BMCR_RESET;
+		mii_write (dev, phy_addr, MII_BMCR, BMCR_RESET);
+		bmcr = BMCR_ANENABLE | BMCR_ANRESTART | BMCR_RESET;
 		mii_write (dev, phy_addr, MII_BMCR, bmcr);
 		mdelay(1);
 	} else {
 		/* Force speed setting */
 		/* PHY Reset */
-		bmcr = MII_BMCR_RESET;
+		bmcr = BMCR_RESET;
 		mii_write (dev, phy_addr, MII_BMCR, bmcr);
 		mdelay(10);
 		if (np->full_duplex) {
-			bmcr = MII_BMCR_DUPLEX_MODE;
+			bmcr = BMCR_FULLDPLX;
 			printk (KERN_INFO "Manual full duplex\n");
 		} else {
 			bmcr = 0;
@@ -1715,7 +1712,7 @@ mii_set_media_pcs (struct net_device *dev)
 		mdelay(10);
 
 		/*  Advertise nothing */
-		mii_write (dev, phy_addr, MII_ANAR, 0);
+		mii_write (dev, phy_addr, MII_ADVERTISE, 0);
 	}
 	return 0;
 }
diff --git a/drivers/net/ethernet/dlink/dl2k.h b/drivers/net/ethernet/dlink/dl2k.h
index 7caab3d..ba0adca 100644
--- a/drivers/net/ethernet/dlink/dl2k.h
+++ b/drivers/net/ethernet/dlink/dl2k.h
@@ -28,6 +28,7 @@
 #include <linux/init.h>
 #include <linux/crc32.h>
 #include <linux/ethtool.h>
+#include <linux/mii.h>
 #include <linux/bitops.h>
 #include <asm/processor.h>	/* Processor type for cache alignment. */
 #include <asm/io.h>
@@ -271,20 +272,9 @@ enum RFS_bits {
 #define MII_RESET_TIME_OUT		10000
 /* MII register */
 enum _mii_reg {
-	MII_BMCR = 0,
-	MII_BMSR = 1,
-	MII_PHY_ID1 = 2,
-	MII_PHY_ID2 = 3,
-	MII_ANAR = 4,
-	MII_ANLPAR = 5,
-	MII_ANER = 6,
-	MII_ANNPT = 7,
-	MII_ANLPRNP = 8,
-	MII_MSCR = 9,
-	MII_MSSR = 10,
-	MII_ESR = 15,
 	MII_PHY_SCR = 16,
 };
+
 /* PCS register */
 enum _pcs_reg {
 	PCS_BMCR = 0,
@@ -297,102 +287,6 @@ enum _pcs_reg {
 	PCS_ESR = 15,
 };
 
-/* Basic Mode Control Register */
-enum _mii_bmcr {
-	MII_BMCR_RESET = 0x8000,
-	MII_BMCR_LOOP_BACK = 0x4000,
-	MII_BMCR_SPEED_LSB = 0x2000,
-	MII_BMCR_AN_ENABLE = 0x1000,
-	MII_BMCR_POWER_DOWN = 0x0800,
-	MII_BMCR_ISOLATE = 0x0400,
-	MII_BMCR_RESTART_AN = 0x0200,
-	MII_BMCR_DUPLEX_MODE = 0x0100,
-	MII_BMCR_COL_TEST = 0x0080,
-	MII_BMCR_SPEED_MSB = 0x0040,
-	MII_BMCR_SPEED_RESERVED = 0x003f,
-	MII_BMCR_SPEED_10 = 0,
-	MII_BMCR_SPEED_100 = MII_BMCR_SPEED_LSB,
-	MII_BMCR_SPEED_1000 = MII_BMCR_SPEED_MSB,
-};
-
-/* Basic Mode Status Register */
-enum _mii_bmsr {
-	MII_BMSR_100BT4 = 0x8000,
-	MII_BMSR_100BX_FD = 0x4000,
-	MII_BMSR_100BX_HD = 0x2000,
-	MII_BMSR_10BT_FD = 0x1000,
-	MII_BMSR_10BT_HD = 0x0800,
-	MII_BMSR_100BT2_FD = 0x0400,
-	MII_BMSR_100BT2_HD = 0x0200,
-	MII_BMSR_EXT_STATUS = 0x0100,
-	MII_BMSR_PREAMBLE_SUPP = 0x0040,
-	MII_BMSR_AN_COMPLETE = 0x0020,
-	MII_BMSR_REMOTE_FAULT = 0x0010,
-	MII_BMSR_AN_ABILITY = 0x0008,
-	MII_BMSR_LINK_STATUS = 0x0004,
-	MII_BMSR_JABBER_DETECT = 0x0002,
-	MII_BMSR_EXT_CAP = 0x0001,
-};
-
-/* ANAR */
-enum _mii_anar {
-	MII_ANAR_NEXT_PAGE = 0x8000,
-	MII_ANAR_REMOTE_FAULT = 0x4000,
-	MII_ANAR_ASYMMETRIC = 0x0800,
-	MII_ANAR_PAUSE = 0x0400,
-	MII_ANAR_100BT4 = 0x0200,
-	MII_ANAR_100BX_FD = 0x0100,
-	MII_ANAR_100BX_HD = 0x0080,
-	MII_ANAR_10BT_FD = 0x0020,
-	MII_ANAR_10BT_HD = 0x0010,
-	MII_ANAR_SELECTOR = 0x001f,
-	MII_IEEE8023_CSMACD = 0x0001,
-};
-
-/* ANLPAR */
-enum _mii_anlpar {
-	MII_ANLPAR_NEXT_PAGE = MII_ANAR_NEXT_PAGE,
-	MII_ANLPAR_REMOTE_FAULT = MII_ANAR_REMOTE_FAULT,
-	MII_ANLPAR_ASYMMETRIC = MII_ANAR_ASYMMETRIC,
-	MII_ANLPAR_PAUSE = MII_ANAR_PAUSE,
-	MII_ANLPAR_100BT4 = MII_ANAR_100BT4,
-	MII_ANLPAR_100BX_FD = MII_ANAR_100BX_FD,
-	MII_ANLPAR_100BX_HD = MII_ANAR_100BX_HD,
-	MII_ANLPAR_10BT_FD = MII_ANAR_10BT_FD,
-	MII_ANLPAR_10BT_HD = MII_ANAR_10BT_HD,
-	MII_ANLPAR_SELECTOR = MII_ANAR_SELECTOR,
-};
-
-/* Auto-Negotiation Expansion Register */
-enum _mii_aner {
-	MII_ANER_PAR_DETECT_FAULT = 0x0010,
-	MII_ANER_LP_NEXTPAGABLE = 0x0008,
-	MII_ANER_NETXTPAGABLE = 0x0004,
-	MII_ANER_PAGE_RECEIVED = 0x0002,
-	MII_ANER_LP_NEGOTIABLE = 0x0001,
-};
-
-/* MASTER-SLAVE Control Register */
-enum _mii_mscr {
-	MII_MSCR_TEST_MODE = 0xe000,
-	MII_MSCR_CFG_ENABLE = 0x1000,
-	MII_MSCR_CFG_VALUE = 0x0800,
-	MII_MSCR_PORT_VALUE = 0x0400,
-	MII_MSCR_1000BT_FD = 0x0200,
-	MII_MSCR_1000BT_HD = 0X0100,
-};
-
-/* MASTER-SLAVE Status Register */
-enum _mii_mssr {
-	MII_MSSR_CFG_FAULT = 0x8000,
-	MII_MSSR_CFG_RES = 0x4000,
-	MII_MSSR_LOCAL_RCV_STATUS = 0x2000,
-	MII_MSSR_REMOTE_RCVR = 0x1000,
-	MII_MSSR_LP_1000BT_FD = 0x0800,
-	MII_MSSR_LP_1000BT_HD = 0x0400,
-	MII_MSSR_IDLE_ERR_COUNT = 0x00ff,
-};
-
 /* IEEE Extened Status Register */
 enum _mii_esr {
 	MII_ESR_1000BX_FD = 0x8000,
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH net-next 0/2] Duplication of #define with mii.h.
From: Francois Romieu @ 2011-08-25  9:20 UTC (permalink / raw)
  To: davem; +Cc: netdev

Please pull from branch 'davem-next.mii' in repository

git://git.kernel.org/pub/scm/linux/kernel/git/romieu/netdev-2.6.git davem-next.mii

to get the changes below.

The sunbmac changes are not compile tested. Sunbmac changeset is on top of the
stack so it can be instantly removed if untrusted. Building a packaged rpm for a
cross sparc-linux compiler quickly turned more interesting than expected.

Distance from 'davem-next' (0856a304091b33a8e8f9f9c98e776f425af2b625)
---------------------------------------------------------------------

cd2967803617cd0a0bb8611e7d41c33a451207a5
78f6a6bd89e9a33e4be1bc61e6990a1172aa396e

Diffstat
--------

 drivers/net/ethernet/dlink/dl2k.c  |  105 +++++++++++++++++------------------
 drivers/net/ethernet/dlink/dl2k.h  |  110 +-----------------------------------
 drivers/net/ethernet/sun/sunbmac.c |   31 +++++-----
 drivers/net/ethernet/sun/sunbmac.h |   17 ------
 4 files changed, 69 insertions(+), 194 deletions(-)

Shortlog
--------

Francois Romieu (2):
      dl2k: use standard #defines from mii.h.
      sunbmac: use standard #defines from mii.h.

Patch
-----

See patches #1 and #2.

-- 
Ueimor

^ permalink raw reply

* Re: Use of 802.3ad bonding for increasing link throughput
From: Simon Horman @ 2011-08-25  9:35 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Tom Brown, netdev
In-Reply-To: <5344.1312998372@death>

On Wed, Aug 10, 2011 at 10:46:12AM -0700, Jay Vosburgh wrote:

[snip]

> 	On linux, the tcp_reordering sysctl value can be raised to
> compensate, but it will still result in increased packet overhead, and
> is not likely to be very efficient, and doesn't help with anything
> that's not TCP/IP.  I have not tested balance-rr in a few years now, but
> my recollection is that, as a best case, throughput of one TCP
> connection could reach about 1.5x with 2 slaves, or about 2.5x with 4
> slaves (where the multipliers are in units of "bandwidth of one slave").

Hi Jay,

for what it is worth I would like to chip in with the results of some
testing I did using ballance-rr and 3 gigabit NICs late last year.  The
link was three direct ("cross-over") cables to a machine that was also
using balance-rr.


I found that by increasing both rx-usecs (from 3 to 45) and enabling GRO
and TSO I was able to push 2.7*10^9 bits/s.

Local CPU utilisation was 30% and remote CPU utilisation was 10%.
Local service demand was 1.7 us/KB and remote service demand was 2.2us/KB.

The MTU was 1500 bytes.

In this configuration, with the tuning options described above, increasing
tcp_reordering (to 127) did not have a noticable effect on throughput but
did increase local CPU utilisation to about 50% and local service demand to
3.0 us/KB.  There was also increased remote CPU utilisation and service
demand, although not as significant.


By using an 9000 byte MTU I was able to get close to 3*10^9 bits/s
with other parameters at their default values.

Local CPU utilisation was 15% and remote CPU utilisation was 5%.
Local service demand was 0.8us/KB and remote service demand was 1.1us/KB.


Increasing rx-usecs was suggested to me by Eric Dumazet on this list.

I no longer have access to the systems that I used to run these tests but I
do have other results that I have omitted from this email for the sake of
brevity.


Anecdotally my opinion after running these and other tests is that if you
want to push more than a  gigabit/s over a single TCP stream then you would
be well advised to get a faster link rather than bond gigabit devices.  I
believe you stated something similar earlier on in this thread.

^ permalink raw reply

* When set mtu 9600 by gfar_change_mtu, the maxfrm register is greater than 9600
From: Rongqing Li @ 2011-08-25  9:24 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: netdev

Hi:

When set MTU to 9600 by gfar_change_mtu(), the maxfrm register will
be set to 9728 which is greater than 9600 in gianfar.c.

But the MPC8315 Reference manual says the value of maxfrm can not
greater than 9600.

Is it a defect, Do we need to fix it?


-- 
Best Reagrds,
Roy | RongQing Li

^ permalink raw reply

* Re: [PATCH] tcp: bound RTO to minimum
From: Eric Dumazet @ 2011-08-25  9:09 UTC (permalink / raw)
  To: Arnd Hannemann
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian
In-Reply-To: <4E560BFD.5020301@arndnet.de>

Le jeudi 25 août 2011 à 10:46 +0200, Arnd Hannemann a écrit :
> Hi,
> 
> Am 25.08.2011 10:26, schrieb Eric Dumazet:
> > Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
> >> Hi Eric,
> >>
> >> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> > 
> >>> Real question is : do we really want to process ~1000 timer interrupts
> >>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
> >>> requests, only to make tcp revover in ~1sec when connectivity returns
> >>> back. This just doesnt scale.
> >>
> >> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
> >> probing time of 120s, we 600 retransmits in a worst-case-senario
> >> (assumed that we get for every rot retransmission an icmp). No?
> > 
> > Where is asserted the "max probing time of 120s" ? 
> > 
> > It is not the case on my machine :
> > I have way more retransmits than that, even if spaced by 1600 ms
> > 
> > 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> > 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> > 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> > 
> > Old kernels where performing up to 15 retries, doing exponential backoff.
> > 
> > Now its kind of unlimited, according to experimental results.
> 
> That shouldn't be. It should stop after the same time a TCP connection with an
> RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
> So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
> this doesn't work as expected.
> 
> * 200ms + 400ms + 800ms ...

It is 924 second with retries2=15 (default value)

I said ~1000 probes.

If ICMP are not rate limited, that could be about 924*5 probes, instead
of 15 probes on old kernels.

Maybe we should refine the thing a bit, to not reverse backoff unless
rto is > some_threshold.

Say 10s being the value, that would give at most 92 tries.

I mean, what is the gain to be able to restart a frozen TCP session with
a 1sec latency instead of 10s if it was blocked more than 60 seconds ?

^ permalink raw reply

* Re: slow performance on disk/network i/o full speed after drop_caches
From: Stefan Priebe - Profihost AG @ 2011-08-25  9:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton, Mel Gorman,
	Jens Axboe, Linux Netdev List
In-Reply-To: <20110824093336.GB5214@localhost>

Am 24.08.2011 11:33, schrieb Wu Fengguang:
> On Wed, Aug 24, 2011 at 05:01:03PM +0800, Stefan Priebe - Profihost AG wrote:
>>
>>>> sync&&   echo 3>/proc/sys/vm/drop_caches&&   sleep 2&&   echo 0
>>>>> /proc/sys/vm/drop_caches
>>
>> Another way to get it working again is to stop some processes. Could be
>> mysql or apache or php fcgi doesn't matter. Just free some memory.
>> Although there are already 5GB free.
>
> Is it a NUMA machine and _every_ node has enough free pages?
>
>          grep . /sys/devices/system/node/node*/vmstat
>
> Thanks,
> Fengguang
Hi Fengguang,

thanks for your fast reply.

Here is the data you requested:

root@server1015-han:~# grep . /sys/devices/system/node/node*/vmstat
/sys/devices/system/node/node0/vmstat:nr_written 5546561
/sys/devices/system/node/node0/vmstat:nr_dirtied 5572497
/sys/devices/system/node/node1/vmstat:nr_written 3936
/sys/devices/system/node/node1/vmstat:nr_dirtied 4190

modified it a little bit:
~# while [ true ]; do ps -eo 
user,pid,tid,class,rtprio,ni,pri,psr,pcpu,vsz,rss,pmem,stat,wchan:28,cmd 
| grep scp | grep -v grep; sleep 1; done

root     12409 12409 TS       -   0  19   0 59.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 64.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   0 67.7  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 70.6  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 73.5  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 76.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 78.2  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 80.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 80.9  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   2 76.7  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 75.6  42136  1724  0.0 Ds 
pipe_read                    scp -t /tmp/
root     12409 12409 TS       -   0  19   0 76.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 75.2  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 76.6  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 77.9  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 79.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 72.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 73.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 73.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 74.3  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 73.4  42136  1724  0.0 Ss 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 71.3  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 71.9  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   0 72.7  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   3 73.5  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   3 74.4  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   3 75.2  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   0 76.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 76.6  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 74.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 73.2  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 73.9  42136  1724  0.0 Rs 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 72.4  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 72.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 72.5  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 72.9  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 73.5  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1  0.0  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 23.0  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 49.5  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   2 63.3  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 71.5  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 77.4  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 70.3  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 73.1  42136  1728  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12566 12566 TS       -   0  19   0 65.7  42136  1728  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12566 12566 TS       -   0  19   1 61.2  42136  1728  0.0 Ss 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 63.7  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12636 12636 TS       -   0  19   8  0.0  42136  1728  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/


Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Ilpo Järvinen @ 2011-08-25  8:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jerry Chu, Damian Lukowski
In-Reply-To: <1314226834.6797.5.camel@edumazet-laptop>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1973 bytes --]

On Thu, 25 Aug 2011, Eric Dumazet wrote:

> Le jeudi 25 août 2011 à 01:44 +0300, Ilpo Järvinen a écrit :
> > On Wed, 24 Aug 2011, Eric Dumazet wrote:
> > 
> > > On one dev machine running net-next, I just found strange tcp sessions
> > > that retransmit a frame forever (The other peer disappeared)
> > > 
> > > # ss -emoi dst 10.2.1.1
> > > State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> > > ESTAB      0      816              10.2.1.2:37930             10.2.1.1:ssh      timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
> > > 	 mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
> > > 
> > > 
> > > You can see the retransmit count : 246 
> > > 
> > > What possibly can be going on ?
> > > 
> > > What happened to backoff ?
> > 
> > But RTO (even without any backoffs) should be lower bounded to some not so 
> > zeroish value?
> 
> Apparently not.
> 
> The only thing that protect us from a flood is that ip_error() uses
> inetpeer cache to ratelimit the icmp_send(ICMP_DEST_UNREACH)
> 
> This is why we get retransmit period >= 1 sec
>
> vi +432 net/ipv4/tcp_ipv4.c
> 
>                 icsk->icsk_backoff--;
>                 inet_csk(sk)->icsk_rto = (tp->srtt ? __tcp_set_rto(tp) :
>                         TCP_TIMEOUT_INIT) << icsk->icsk_backoff;
>                 tcp_bound_rto(sk);
> 
> and __tcp_set_rto() uses : return (tp->srtt >> 3) + tp->rttvar;

So you think that this is not true: ?

        /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
         * guarantees that rto is higher.
         */

...it would still be smaller than 1sec though, but certainly not going to 
cause flooding either. Default tcp_rto_min should be 200ms so it's 
5pkts+5ICMP sent, received and processed per second. Which doesn't sound 
that bad CPU load?!?

It is unclear to me how tp->rttvar could become smaller than 
tcp_rto_min().

-- 
 i.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox