Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size
From: David Miller @ 2012-05-26  0:11 UTC (permalink / raw)
  To: asharma; +Cc: eric.dumazet, netdev, linux-kernel
In-Reply-To: <4FC01F1B.1080009@fb.com>

From: Arun Sharma <asharma@fb.com>
Date: Fri, 25 May 2012 17:08:59 -0700

> On 5/25/12 3:51 PM, David Miller wrote:
>> From: Arun Sharma<asharma@fb.com>
>> Date: Fri, 25 May 2012 15:22:54 -0700
>>
>>> On 5/25/12 1:47 PM, Eric Dumazet wrote:
>>>> On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
>>>>> The algorithm is based on ipv4 and alloc_large_system_hash().
>>>>>
>>>>
>>>> Why is it needed at all ?
>>>>
>>>> IPv4 has a route cache with potentially millions of entries, not IPv6.
>>>
>>> With the default size of 4096 for the ipv6 routing table, entries can
>>> get garbage collected and hosts could lose their default route and
>>> therefore lose connectivity.
>>>
>>> We actually saw it happen.
>>
>> Under no circumstances should administrator configured ipv6 routes be
>> garbage collected, that is a bug.
> 
> These were not admin configured routes. They were discovered via ipv6
> neighbor discovery.

Then such default routes should either be:

1) Passed over by GC

2) Trigger neighbour discovery when GC'd

^ permalink raw reply

* Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size
From: Arun Sharma @ 2012-05-26  0:08 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, linux-kernel
In-Reply-To: <20120525.185131.2017517041016424794.davem@davemloft.net>

On 5/25/12 3:51 PM, David Miller wrote:
> From: Arun Sharma<asharma@fb.com>
> Date: Fri, 25 May 2012 15:22:54 -0700
>
>> On 5/25/12 1:47 PM, Eric Dumazet wrote:
>>> On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
>>>> The algorithm is based on ipv4 and alloc_large_system_hash().
>>>>
>>>
>>> Why is it needed at all ?
>>>
>>> IPv4 has a route cache with potentially millions of entries, not IPv6.
>>
>> With the default size of 4096 for the ipv6 routing table, entries can
>> get garbage collected and hosts could lose their default route and
>> therefore lose connectivity.
>>
>> We actually saw it happen.
>
> Under no circumstances should administrator configured ipv6 routes be
> garbage collected, that is a bug.

These were not admin configured routes. They were discovered via ipv6 
neighbor discovery.

  -Arun

^ permalink raw reply

* Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size
From: David Miller @ 2012-05-25 22:51 UTC (permalink / raw)
  To: asharma; +Cc: eric.dumazet, netdev, linux-kernel
In-Reply-To: <4FC0063E.8080209@fb.com>

From: Arun Sharma <asharma@fb.com>
Date: Fri, 25 May 2012 15:22:54 -0700

> On 5/25/12 1:47 PM, Eric Dumazet wrote:
>> On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
>>> The algorithm is based on ipv4 and alloc_large_system_hash().
>>>
>>
>> Why is it needed at all ?
>>
>> IPv4 has a route cache with potentially millions of entries, not IPv6.
> 
> With the default size of 4096 for the ipv6 routing table, entries can
> get garbage collected and hosts could lose their default route and
> therefore lose connectivity.
> 
> We actually saw it happen.

Under no circumstances should administrator configured ipv6 routes be
garbage collected, that is a bug.

^ permalink raw reply

* [PATCH] net: compute a more reasonable default ip6_rt_max_size (v2)
From: Arun Sharma @ 2012-05-25 22:26 UTC (permalink / raw)
  To: netdev; +Cc: Arun Sharma, linux-kernel, David Miller

The algorithm is based on ipv4 and alloc_large_system_hash().

The following data is from a x86_64 box I tested:

128MB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
16384
22444

512MB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
65536
99856

1GB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
524288
203068

2GB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
1048576
524288

4GB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
2097152
524288

Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: David Miller <davem@davemloft.net>
---
 net/ipv6/route.c |   21 ++++++++++++++++++++-
 1 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 49d6ce1..bf85926 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2827,6 +2827,16 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
 }
 #endif
 
+static __initdata unsigned long ip6_rt_entries;
+static int __init set_rt_entries(char *str)
+{
+	if (!str)
+		return 0;
+	ip6_rt_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("ip6_rt_entries=", set_rt_entries);
+
 static int __net_init ip6_route_net_init(struct net *net)
 {
 	int ret = -ENOMEM;
@@ -2872,8 +2882,17 @@ static int __net_init ip6_route_net_init(struct net *net)
 			 ip6_template_metrics, true);
 #endif
 
+	/* Compute a reasonable default based on what we do for ipv4
+	 * total size = 1/16th of total RAM
+	 * No more than 512k entries unless overridden on kernel cmdline */
+	if (ip6_rt_entries == 0) {
+		ip6_rt_entries = (totalram_pages << PAGE_SHIFT) >> 4;
+		ip6_rt_entries /= sizeof(struct rt6_info);
+		ip6_rt_entries = min(512 * 1024UL, ip6_rt_entries);
+	}
+
 	net->ipv6.sysctl.flush_delay = 0;
-	net->ipv6.sysctl.ip6_rt_max_size = 4096;
+	net->ipv6.sysctl.ip6_rt_max_size = ip6_rt_entries;
 	net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
 	net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
 	net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
-- 
1.7.8.4

^ permalink raw reply related

* Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size
From: Arun Sharma @ 2012-05-25 22:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, David Miller
In-Reply-To: <1337978820.10135.1.camel@edumazet-glaptop>

On 5/25/12 1:47 PM, Eric Dumazet wrote:
> On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
>> The algorithm is based on ipv4 and alloc_large_system_hash().
>>
>
> Why is it needed at all ?
>
> IPv4 has a route cache with potentially millions of entries, not IPv6.

With the default size of 4096 for the ipv6 routing table, entries can 
get garbage collected and hosts could lose their default route and 
therefore lose connectivity.

We actually saw it happen.

  -Arun

^ permalink raw reply

* Inadvertently sending a Christmas Tree TCP packet
From: Earl Chew @ 2012-05-25 22:15 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org; +Cc: netdev
In-Reply-To: <4FBFCFCA.2090501@ixiacom.com>

I had previously observed the following behaviour captured from WireShark:

16220	111.075627	10.64.33.43	10.128.163.100	TCP	59253 > exec [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=2
16222	0.203210	10.128.163.100	10.64.33.43	TCP	exec > 59253 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1250 WS=7
16223	0.000032	10.64.33.43	10.128.163.100	TCP	59253 > exec [ACK] Seq=1 Ack=1 Win=65532 Len=0
... snip ...
16237	0.000319	10.128.163.100	10.64.33.43	TCP	exec > 59253 [FIN, PSH, ACK, URG] Seq=31 Ack=30 Win=5888 Urg=1 Len=1
16240	1.114085	10.128.163.100	10.64.33.43	TCP	[TCP Retransmission] exec > 59253 [FIN, PSH, ACK, URG] Seq=31 Ack=30 Win=5888 Urg=1 Len=1

These packets were sent from an application running on Linux 2.6.18. 

The receiver has become confused, and the so the Linux sender retransmits at packet 16240,
and continues retransmitting. In this case, the application code at the receiver is blocked
indefinitely trying to read a socket that seemingly has (URG) data and yet at the same time
doesn't have any more data (FIN).

Looking at the 2.6.18 source code for tcp_output.c, I see code at tcp_send_fin()
that is attaching FIN to the packet.

The code in 3.4 seems fairly much the same:

/* Send a fin.  The caller locks the socket for us.  This cannot be
 * allowed to fail queueing a FIN frame under any circumstances.
 */
void tcp_send_fin(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct sk_buff *skb = tcp_write_queue_tail(sk);
	int mss_now;

	/* Optimization, tack on the FIN if we have a queue of
	 * unsent frames.  But be careful about outgoing SACKS
	 * and IP options.
	 */
	mss_now = tcp_current_mss(sk);

	if (tcp_send_head(sk) != NULL) {
		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
		TCP_SKB_CB(skb)->end_seq++;
		tp->write_seq++;
	} else {

The comment block says to be careful about IP options, but
the code doesn't appear to worry too much.

Is something like:

	if (tcp_send_head(sk) != NULL &&
		TCP_SKB_CB(skb)->tcp_flags == 0)

more appropriate ?

Earl

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1610 tcp_recvmsg+0xb1b/0xc70()
From: Jack Stone @ 2012-05-25 21:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, Linux Kernel
In-Reply-To: <1337979331.10135.2.camel@edumazet-glaptop>

On 05/25/2012 09:55 PM, Eric Dumazet wrote:
> On Fri, 2012-05-25 at 22:45 +0200, Eric Dumazet wrote:
>> No need, update your tree.
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=1ca7ee30630e1022dbcf1b51be20580815ffab73
> 

Thank you. Rebuilding now.

Jack

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1610 tcp_recvmsg+0xb1b/0xc70()
From: Eric Dumazet @ 2012-05-25 20:55 UTC (permalink / raw)
  To: Jack Stone; +Cc: davem, netdev, Linux Kernel
In-Reply-To: <1337978725.10135.0.camel@edumazet-glaptop>

On Fri, 2012-05-25 at 22:45 +0200, Eric Dumazet wrote:
> On Fri, 2012-05-25 at 21:25 +0100, Jack Stone wrote:
> > Hi All,
> > 
> > The following warning keeps hitting me. I couldn't get the first one - it had already left dmesg hence the W taint.
> > The C taint is from r8712u from staging.
> > 
> > I've seen it with 3.4.0-076444-g07acfc2 (recent Linus tree) and 3.4.0-rc3-00089-gc6f5c93.
> > 
> > I am going to attempt to bisect it now.
> 
> No need, update your tree.
> 
> 


http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=1ca7ee30630e1022dbcf1b51be20580815ffab73

^ permalink raw reply

* Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size
From: Eric Dumazet @ 2012-05-25 20:47 UTC (permalink / raw)
  To: Arun Sharma; +Cc: netdev, linux-kernel, David Miller
In-Reply-To: <1337976934-18065-1-git-send-email-asharma@fb.com>

On Fri, 2012-05-25 at 13:15 -0700, Arun Sharma wrote:
> The algorithm is based on ipv4 and alloc_large_system_hash().
> 

Why is it needed at all ?

IPv4 has a route cache with potentially millions of entries, not IPv6.

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1610 tcp_recvmsg+0xb1b/0xc70()
From: Eric Dumazet @ 2012-05-25 20:45 UTC (permalink / raw)
  To: Jack Stone; +Cc: davem, netdev, Linux Kernel
In-Reply-To: <4FBFEACC.8040601@fastmail.fm>

On Fri, 2012-05-25 at 21:25 +0100, Jack Stone wrote:
> Hi All,
> 
> The following warning keeps hitting me. I couldn't get the first one - it had already left dmesg hence the W taint.
> The C taint is from r8712u from staging.
> 
> I've seen it with 3.4.0-076444-g07acfc2 (recent Linus tree) and 3.4.0-rc3-00089-gc6f5c93.
> 
> I am going to attempt to bisect it now.

No need, update your tree.

^ permalink raw reply

* Re: [PATCH] net: compute a more reasonable default ip6_rt_max_size
From: David Miller @ 2012-05-25 20:26 UTC (permalink / raw)
  To: asharma; +Cc: netdev, linux-kernel
In-Reply-To: <1337976934-18065-1-git-send-email-asharma@fb.com>

From: Arun Sharma <asharma@fb.com>
Date: Fri, 25 May 2012 13:15:34 -0700

> +	/* Compute a reasonable default based on what we do for ipv4
> +	 * total size = 1/16th of total RAM
> +	 * No more than 512k entries unless overridden on kernel cmdline */

Please format this comment correctly:

	/* Compute a reasonable default based on what we do for ipv4
	 * total size = 1/16th of total RAM
	 * No more than 512k entries unless overridden on kernel cmdline
	 */

^ permalink raw reply

* WARNING: at net/ipv4/tcp.c:1610 tcp_recvmsg+0xb1b/0xc70()
From: Jack Stone @ 2012-05-25 20:25 UTC (permalink / raw)
  To: davem, netdev, Linux Kernel

Hi All,

The following warning keeps hitting me. I couldn't get the first one - it had already left dmesg hence the W taint.
The C taint is from r8712u from staging.

I've seen it with 3.4.0-076444-g07acfc2 (recent Linus tree) and 3.4.0-rc3-00089-gc6f5c93.

I am going to attempt to bisect it now.

[ 3896.037489] ------------[ cut here ]------------
[ 3896.037490] WARNING: at net/ipv4/tcp.c:1610 tcp_recvmsg+0xb1b/0xc70()
[ 3896.037491] Hardware name: System Product Name
[ 3896.037491] recvmsg bug 2: copied 3F1199D7 seq 3F1199D7 rcvnxt 3F119A71 fl 0
[ 3896.037511] Modules linked in: fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge rfcomm lockd 8021q garp stp llc bnep nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_conntrack_ipv4 nf_conntrack_ipv6 nf_defrag_ipv6 nf_defrag_ipv4 xt_state nf_conntrack ip6table_filter ip6_tables vhost_net snd_hda_codec_hdmi macvtap macvlan tun snd_hda_codec_realtek virtio_net btusb bluetooth coretemp kvm_intel kvm snd_hda_intel r8712u(C) snd_hda_codec snd_hwdep e1000e joydev snd_seq snd_seq_device snd_pcm snd_timer snd sunrpc eeepc_wmi asus_wmi hid_logitech_dj sparse_keymap mxm_wmi soundcore iTCO_wdt rfkill snd_page_alloc wmi i2c_i801 pcspkr iTCO_vendor_support serio_raw binfmt_misc uinput microcode crc32c_intel ghash_clmulni_intel firewire_ohci fi
 rewire_core crc_itu_t [last unloaded: scsi_wait_scan]
[ 3896.037512] Pid: 3926, comm: spotify Tainted: G        WC   3.4.0-07644-g07acfc2 #2
[ 3896.037513] Call Trace:
[ 3896.037514]  [<ffffffff8106010f>] warn_slowpath_common+0x7f/0xc0
[ 3896.037515]  [<ffffffff81060206>] warn_slowpath_fmt+0x46/0x50
[ 3896.037517]  [<ffffffff8163f4c5>] ? tcp_recvmsg+0x35/0xc70
[ 3896.037518]  [<ffffffff812c130f>] ? avc_has_perm_flags+0xef/0x230
[ 3896.037519]  [<ffffffff812c125c>] ? avc_has_perm_flags+0x3c/0x230
[ 3896.037520]  [<ffffffff8163ffab>] tcp_recvmsg+0xb1b/0xc70
[ 3896.037522]  [<ffffffff8166a8c0>] ? inet_sendmsg+0x230/0x230
[ 3896.037523]  [<ffffffff8166a9f7>] inet_recvmsg+0x137/0x250
[ 3896.037525]  [<ffffffff815d7f58>] ? sock_update_classid+0x128/0x310
[ 3896.037526]  [<ffffffff815cfe40>] do_sock_read+0xf0/0x110
[ 3896.037527]  [<ffffffff815d0b8c>] sock_aio_read.part.5+0x4c/0x70
[ 3896.037528]  [<ffffffff812c130f>] ? avc_has_perm_flags+0xef/0x230
[ 3896.037530]  [<ffffffff815d0bb0>] ? sock_aio_read.part.5+0x70/0x70
[ 3896.037531]  [<ffffffff815d0bdd>] sock_aio_read+0x2d/0x40
[ 3896.037532]  [<ffffffff811bc2b3>] do_sync_readv_writev+0xd3/0x110
[ 3896.037534]  [<ffffffff812beca6>] ? security_file_permission+0x96/0xb0
[ 3896.037535]  [<ffffffff811bb9a1>] ? rw_verify_area+0x61/0x100
[ 3896.037537]  [<ffffffff811bc584>] do_readv_writev+0xd4/0x1d0
[ 3896.037538]  [<ffffffff811bdad8>] ? fget_light+0x48/0x4f0
[ 3896.037540]  [<ffffffff811bdad8>] ? fget_light+0x48/0x4f0
[ 3896.037541]  [<ffffffff811bc71c>] vfs_readv+0x3c/0x50
[ 3896.037543]  [<ffffffff811bc77d>] sys_readv+0x4d/0xc0
[ 3896.037544]  [<ffffffff8174c829>] system_call_fastpath+0x16/0x1b
[ 3896.037545] ---[ end trace 762b4689c56af7ab ]---

The relevant code from tcp.c is:

		/* Next get a buffer. */

                skb_queue_walk(&sk->sk_receive_queue, skb) {
                        /* Now that we have two receive queues this
                         * shouldn't happen.
                         */
                        if (WARN(before(*seq, TCP_SKB_CB(skb)->seq),
                                 "recvmsg bug: copied %X seq %X rcvnxt %X fl %X\n",
                                 *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt,
                                 flags))
                                break;

                        offset = *seq - TCP_SKB_CB(skb)->seq;
                        if (tcp_hdr(skb)->syn)
                                offset--;
                        if (offset < skb->len)
                                goto found_ok_skb;
                        if (tcp_hdr(skb)->fin)
                                goto found_fin_ok;
This warn here ----->        WARN(!(flags & MSG_PEEK),
                             "recvmsg bug 2: copied %X seq %X rcvnxt %X fl %X\n",
                             *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt, flags);
                }

Thanks,

Jack

^ permalink raw reply

* Re: [PATCH 1/3] TIPC: Removing EXPERIMENTAL label
From: David Miller @ 2012-05-25 20:24 UTC (permalink / raw)
  To: paul.gortmaker; +Cc: jon.maloy, netdev, tipc-discussion, allan.stephens, maloy
In-Reply-To: <20120525190506.GB25102@windriver.com>

From: Paul Gortmaker <paul.gortmaker@windriver.com>
Date: Fri, 25 May 2012 15:05:06 -0400

> OK, what I'm hearing is that you'd prefer I continue to collect up TIPC
> patches and issue pull requests for a while longer.  I can do that.  Any
> specifics of how you'd like things done? -- e.g. if the reviews of new
> TIPC development patches takes place here on netdev before I stage them,
> will that create extra work for you dealing with them in patchworks?

When you want the TIPC patches reviewed here you can simply post the
set, and since you're informing me how this will work I'll know that
you'll send me a pull request later, and therefore I can mark the
patches as "Awaiting Upstream" or similar in patchwork.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply

* [PATCH] net: compute a more reasonable default ip6_rt_max_size
From: Arun Sharma @ 2012-05-25 20:15 UTC (permalink / raw)
  To: netdev; +Cc: Arun Sharma, linux-kernel, David Miller

The algorithm is based on ipv4 and alloc_large_system_hash().

The following data is from a x86_64 box I tested:

128MB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
16384
22444

512MB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
65536
99856

1GB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
524288
203068

2GB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
1048576
524288

4GB
$ cat /proc/sys/net/ipv{4,6}/route/max_size
2097152
524288

Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: David Miller <davem@davemloft.net>
---
 net/ipv6/route.c |   21 ++++++++++++++++++++-
 1 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 49d6ce1..c89ebbb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2827,6 +2827,16 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
 }
 #endif
 
+static __initdata unsigned long ip6_rt_entries;
+static int __init set_rt_entries(char *str)
+{
+	if (!str)
+		return 0;
+	ip6_rt_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("ip6_rt_entries=", set_rt_entries);
+
 static int __net_init ip6_route_net_init(struct net *net)
 {
 	int ret = -ENOMEM;
@@ -2872,8 +2882,17 @@ static int __net_init ip6_route_net_init(struct net *net)
 			 ip6_template_metrics, true);
 #endif
 
+	/* Compute a reasonable default based on what we do for ipv4
+	 * total size = 1/16th of total RAM
+	 * No more than 512k entries unless overridden on kernel cmdline */
+        if (ip6_rt_entries == 0) {
+		ip6_rt_entries = (totalram_pages << PAGE_SHIFT) >> 4;
+		ip6_rt_entries /= sizeof(struct rt6_info);
+		ip6_rt_entries = min(512 * 1024UL, ip6_rt_entries);
+        }
+
 	net->ipv6.sysctl.flush_delay = 0;
-	net->ipv6.sysctl.ip6_rt_max_size = 4096;
+	net->ipv6.sysctl.ip6_rt_max_size = ip6_rt_entries;
 	net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
 	net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
 	net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
-- 
1.7.8.4

^ permalink raw reply related

* Re: [PATCH] gianfar:don't add FCB length to hard_header_len
From: Paul Gortmaker @ 2012-05-25 20:04 UTC (permalink / raw)
  To: Joe Perches; +Cc: Jan Ceuleers, David Miller, b06378, netdev, linuxppc-dev
In-Reply-To: <1337975465.30100.4.camel@joe2Laptop>

On 12-05-25 03:51 PM, Joe Perches wrote:
> On Fri, 2012-05-25 at 11:58 -0400, Paul Gortmaker wrote:
>> But you really shouldn't need the hardware to validate this kind of
>> patch anyways -- aside from your code flow change in the irq routine of
>> gianfar_ptp, you should have been simply able to check for object file
>> equivalence before and after your change.
> 
> No cross compiler either, and I'm lazy 'bout that...

Can't get much easier than using one of these:

http://www.kernel.org/pub/tools/crosstool/

Just untar, export PATH ARCH CROSS_COMPILE and go.

Can't get much lazier than that. Great to have around.

Paul.

> 
> cheers, Joe
> 

^ permalink raw reply

* Re: [PATCH] gianfar:don't add FCB length to hard_header_len
From: Joe Perches @ 2012-05-25 19:51 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: Jan Ceuleers, David Miller, b06378, netdev, linuxppc-dev
In-Reply-To: <20120525155820.GA25102@windriver.com>

On Fri, 2012-05-25 at 11:58 -0400, Paul Gortmaker wrote:
> But you really shouldn't need the hardware to validate this kind of
> patch anyways -- aside from your code flow change in the irq routine of
> gianfar_ptp, you should have been simply able to check for object file
> equivalence before and after your change.

No cross compiler either, and I'm lazy 'bout that...

cheers, Joe

^ permalink raw reply

* Re: [PATCH 1/3] TIPC: Removing EXPERIMENTAL label
From: Paul Gortmaker @ 2012-05-25 19:05 UTC (permalink / raw)
  To: David Miller
  Cc: jon.maloy, netdev, tipc-discussion, ying.xue, erik.hugne,
	allan.stephens, maloy
In-Reply-To: <20120524.161231.1058511318935925082.davem@davemloft.net>

[Re: [PATCH 1/3] TIPC: Removing EXPERIMENTAL label] On 24/05/2012 (Thu 16:12) David Miller wrote:

> From: Paul Gortmaker <paul.gortmaker@windriver.com>
> Date: Thu, 24 May 2012 15:58:16 -0400
> 
> > But for new TIPC development features, future direction, and things like
> > that -- making the right call requires intimate understanding of TIPC
> > and its users, which is something that a maintainer should have but
> > something I know I don't have.  (A man has to know his limitations.)
> > 
> > In this context, I'm not talking about these three trivial patches; but
> > more complicated stuff that I imagine will be floated in the future.
> > 
> > To that end, I can still review and call out issues in a crap patch when
> > I see them.  But I'd like to see new stuff sent to netdev, so that folks
> > smarter than me have a chance to catch when a patch appears generally OK
> > but is architecturally the wrong direction etc.
> 
> For maintainership, taste is more important than deep knowledge of the
> specific technology.  Worst case you ask the submitter to explain the
> background of their change more thoroughly and that information is an
> absolutely requirement in the commit message and code comments
> anyways.

OK, what I'm hearing is that you'd prefer I continue to collect up TIPC
patches and issue pull requests for a while longer.  I can do that.  Any
specifics of how you'd like things done? -- e.g. if the reviews of new
TIPC development patches takes place here on netdev before I stage them,
will that create extra work for you dealing with them in patchworks?

Paul.

^ permalink raw reply

* Re: Using jiffies for tcp_time_stamp?
From: Eric Dumazet @ 2012-05-25 19:00 UTC (permalink / raw)
  To: Srećko Jurić-Kavelj; +Cc: Dave Taht, Chris Friesen, netdev
In-Reply-To: <CAACrLC0eVZJdu802Ff4BWRBKuabS=Z-T3WvZCxyqBTem8-n8ug@mail.gmail.com>

On Fri, 2012-05-25 at 20:35 +0200, Srećko Jurić-Kavelj wrote:

> From what I've seen in the code, NO_HZ doesn't make jiffies go away,
> it simply doesn't use regular CONFIG_HZ interrupt to update, but
> updates them when has an opportunity?

HZ=1000 makes jiffies 10 times more precise, and with NO_HZ, generates
no extra timer interrupts.

This also makes timers workload smoothed, instead of spikes.

^ permalink raw reply

* Re: Using jiffies for tcp_time_stamp?
From: Srećko Jurić-Kavelj @ 2012-05-25 18:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Taht, Chris Friesen, netdev
In-Reply-To: <1337964880.3347.52.camel@edumazet-glaptop>

On Fri, May 25, 2012 at 6:54 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> linux TCP uses high precision timestamps (ktime_get_real()) where
> needed.
>
> # find net|xargs grep -n TCP_CONG_RTT_STAMP
> net/ipv4/tcp_veno.c:205:        .flags          = TCP_CONG_RTT_STAMP,
> net/ipv4/tcp_vegas.c:308:       .flags          = TCP_CONG_RTT_STAMP,
> net/ipv4/tcp_cubic.c:478:               cubictcp.flags |= TCP_CONG_RTT_STAMP;
> net/ipv4/tcp_output.c:815:      if (icsk->icsk_ca_ops->flags & TCP_CONG_RTT_STAMP)
> net/ipv4/tcp_lp.c:317:  .flags = TCP_CONG_RTT_STAMP,
> net/ipv4/tcp_yeah.c:229:        .flags          = TCP_CONG_RTT_STAMP,
> net/ipv4/tcp_illinois.c:326:    .flags          = TCP_CONG_RTT_STAMP,
> net/ipv4/tcp_input.c:3496:                              if (ca_ops->flags & TCP_CONG_RTT_STAMP &&

Didn't know about TCP_CONG_RTT_STAMP.

Thing is, the device I'm connecting to doesn't even support TCP time
stamp option. The returning SYN ACK packet only has maximum segment
size 1460 bytes in options.

From the net/ipv4/tcp_input.c code, RTT is estimated using

#define tcp_time_stamp          ((__u32)(jiffies))

from include/net/tcp.h.

Could ktime_get_real() be used for tcp_time_stamp instead of jiffies?

> Other than that HZ=1000 seems fine.
>
> HZ=100 seems a poor choice, we have NO_HZ since a long time.

I have:
$ grep HZ /boot/config-2.6.32-41-generic
CONFIG_NO_HZ=y
CONFIG_HZ_100=y
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100
CONFIG_MACHZ_WDT=m

From what I've seen in the code, NO_HZ doesn't make jiffies go away,
it simply doesn't use regular CONFIG_HZ interrupt to update, but
updates them when has an opportunity?

--
JKS

^ permalink raw reply

* Inadvertently sending a Christmas Tree TCP packet
From: Earl Chew @ 2012-05-25 18:30 UTC (permalink / raw)
  To: netdev

Does anyone have a reference to any discussions or patches that address this issue ?

Running a userspace daemon on a rather old 2.6.18 system can inadvertently cause a TCP
packet containing flags FIN, PSH, ACK and URG (see packet 16237) which can cause the receiver
(not Linux in this case) to become confused:

16220	111.075627	10.64.33.43	10.128.163.100	TCP	59253 > exec [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=2
16222	0.203210	10.128.163.100	10.64.33.43	TCP	exec > 59253 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1250 WS=7
16223	0.000032	10.64.33.43	10.128.163.100	TCP	59253 > exec [ACK] Seq=1 Ack=1 Win=65532 Len=0
16224	0.000215	10.64.33.43	10.128.163.100	TCP	59253 > exec [PSH, ACK] Seq=1 Ack=1 Win=65532 Len=6
16225	0.202465	10.128.163.100	10.64.33.43	TCP	exec > 59253 [ACK] Seq=1 Ack=7 Win=5888 Len=0
16229	0.209383	10.64.33.43	10.128.163.100	TCP	59253 > exec [PSH, ACK] Seq=7 Ack=1 Win=65532 Len=9
16231	0.202573	10.128.163.100	10.64.33.43	TCP	exec > 59253 [ACK] Seq=1 Ack=16 Win=5888 Len=0
16232	0.000024	10.64.33.43	10.128.163.100	TCP	59253 > exec [PSH, ACK] Seq=16 Ack=1 Win=65532 Len=14
16233	0.202618	10.128.163.100	10.64.33.43	TCP	exec > 59253 [ACK] Seq=1 Ack=30 Win=5888 Len=0
16234	0.012718	10.128.163.100	10.64.33.43	TCP	exec > 59253 [PSH, ACK] Seq=1 Ack=30 Win=5888 Len=1
16235	0.101229	10.128.163.100	10.64.33.43	TCP	exec > 59253 [PSH, ACK] Seq=2 Ack=30 Win=5888 Len=29
16236	0.000032	10.64.33.43	10.128.163.100	TCP	59253 > exec [ACK] Seq=30 Ack=31 Win=65504 Len=0
16237	0.000319	10.128.163.100	10.64.33.43	TCP	exec > 59253 [FIN, PSH, ACK, URG] Seq=31 Ack=30 Win=5888 Urg=1 Len=1
16240	1.114085	10.128.163.100	10.64.33.43	TCP	[TCP Retransmission] exec > 59253 [FIN, PSH, ACK, URG] Seq=31 Ack=30 Win=5888 Urg=1 Len=1


The receiver has become confused, and the so the Linux sender retransmits at packet 16240, and continues retransmitting.
In this case, the application code at the receiver is blocked indefinitely trying to read a socket that seemingly
has (URG) data and yet at the same time doesn't have any more data (FIN).

Perhaps the making of a DOS attack ?


Earl

^ permalink raw reply

* Re: [PATCH 05/21] vswitchd: Add add_tunnel_ports()
From: Ben Pfaff @ 2012-05-25 17:18 UTC (permalink / raw)
  To: Simon Horman; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1337850554-10339-6-git-send-email-horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org>

On Thu, May 24, 2012 at 06:08:58PM +0900, Simon Horman wrote:
> Add tunnel tundevs for tunnel realdevs as needed.
> 
> In general the notion is that realdevs may be configured by users
> and from an end-user point of view are compatible with the existing
> port-based tunneling code. And that tundevs exist in the datapath
> arnd are actually used to send and recieve packets, based on flows.
> 
> Cc: Kyle Mestery <kmestery-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Simon Horman <horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org>

This seems reasonable at a glance.  There are bits I might quibble
with as this gets closer, but the structure seems reasonable.

^ permalink raw reply

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Eric Dumazet @ 2012-05-25 17:18 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Denys Fedoryshchenko, netdev
In-Reply-To: <CA+mtBx8fpS_w7X2iq+tOEFuFhhgJ-4V_1ubBsd8-MvvmCef45w@mail.gmail.com>

On Fri, 2012-05-25 at 09:59 -0700, Tom Herbert wrote:
> > TX completion has no budget, I am not sure what you mean.
> >
> Right, it's the budget on RX that can be a factor.  TX completion is
> done from the NAPI poll routine, so if RX does not complete the NAPI
> is rescheduled and TX completion is done again for the same HW
> interrupt.
> 
> We need to remove the constraint that netdev_completed can only be
> called once per interrupt...

Not clear where is this constraint in the code.

Under heavy load, we can be in a loop situation, one cpu serving NAPI
for a bunch of devices (no more hardware interrupts are delivered since
we dont re-enable them at all)

^ permalink raw reply

* Re: [RFC] mac80211: Use correct originator sequence number in a Path Reply
From: Qasim Javed @ 2012-05-25 17:17 UTC (permalink / raw)
  To: Javier Cardona
  Cc: devel-ZwoEplunGu1xMJw8dq7oimD2FQJk+8+b,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, ravip-DNmUmOh1Rg72fBVCVOL8/A
In-Reply-To: <CAEFj987dNMyMcS9rySzVpfY0Fo5t0LtL9FZg770ohKi+bDO9ZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Forgot to add Javier. Could you please comment on this?

Thanks,
-Qasim

On Fri, May 25, 2012 at 11:31 AM, Yeoh Chun-Yeow <yeohchunyeow-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> I think that the PREQ element has originator sequence number at the
>> end but the PREQ element has the target sequence number at the end.
>> This is what mesh_path_sel_frame_tx is doing.
>
> PREP element has originator sequence number at the end but the PREQ
> element has the target sequence number at the end.
>
> Regards,
> Chun-Yeow
> _______________________________________________
> Devel mailing list
> Devel-ZwoEplunGu1xMJw8dq7oimD2FQJk+8+b@public.gmane.org
> http://lists.open80211s.org/cgi-bin/mailman/listinfo/devel

^ permalink raw reply

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Tom Herbert @ 2012-05-25 16:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Denys Fedoryshchenko, netdev
In-Reply-To: <1337926938.7753.8.camel@edumazet-glaptop>

> TX completion has no budget, I am not sure what you mean.
>
Right, it's the budget on RX that can be a factor.  TX completion is
done from the NAPI poll routine, so if RX does not complete the NAPI
is rescheduled and TX completion is done again for the same HW
interrupt.

We need to remove the constraint that netdev_completed can only be
called once per interrupt...

Tom

> e1000e driver indeed has a limit : It cannot clean more than
> tx_ring->count frames per e1000_clean_tx_irq() invocation.
>
> But with BQL, this should not happen ?
>
> # ethtool -g eth0
> Ring parameters for eth0:
> Pre-set maximums:
> RX:             4096
> RX Mini:        0
> RX Jumbo:       0
> TX:             4096
> Current hardware settings:
> RX:             256
> RX Mini:        0
> RX Jumbo:       0
> TX:             256
>
>

^ permalink raw reply

* Re: Using jiffies for tcp_time_stamp?
From: Eric Dumazet @ 2012-05-25 16:54 UTC (permalink / raw)
  To: Srećko Jurić-Kavelj; +Cc: Dave Taht, Chris Friesen, netdev
In-Reply-To: <CAACrLC0b5dyjJM=DGf-9nUOwar3O9EVTTR0tvynQj285EDpfwA@mail.gmail.com>

On Fri, 2012-05-25 at 18:23 +0200, Srećko Jurić-Kavelj wrote:
> On Fri, May 25, 2012 at 6:17 PM, Dave Taht <dave.taht@gmail.com> wrote:
> > On Fri, May 25, 2012 at 4:58 PM, Chris Friesen
> > <chris.friesen@genband.com> wrote:
> >> I don't know if it would make any difference to the tcp algorithms, but
> >> certainly on some architectures you can get a fast and accurate hardware
> >> timestamp.
> >
> > I would be interested in someone doing that experiment in light of the
> > codel work.
> 
> I've looked this up in other implementations, e.g. FreeBSD uses 1ms
> granularity no matter what HZ says, NetBSD has 500ms ticks, ...
> 
> I guess that granularity also depends on the retransmit timers used. I
> didn't make out what's the precision of the timers that Linux uses in
> TCP, but I guess it uses high resolution timers? At least on x86?
> 
> I've done a simple experiment by repeatedly calling clock_gettime
> (from userspace, but I guess it ends up as a vsyscall). I get >17
> million calls per second on a Q6600.

linux TCP uses high precision timestamps (ktime_get_real()) where
needed.

# find net|xargs grep -n TCP_CONG_RTT_STAMP
net/ipv4/tcp_veno.c:205:	.flags		= TCP_CONG_RTT_STAMP,
net/ipv4/tcp_vegas.c:308:	.flags		= TCP_CONG_RTT_STAMP,
net/ipv4/tcp_cubic.c:478:		cubictcp.flags |= TCP_CONG_RTT_STAMP;
net/ipv4/tcp_output.c:815:	if (icsk->icsk_ca_ops->flags & TCP_CONG_RTT_STAMP)
net/ipv4/tcp_lp.c:317:	.flags = TCP_CONG_RTT_STAMP,
net/ipv4/tcp_yeah.c:229:	.flags		= TCP_CONG_RTT_STAMP,
net/ipv4/tcp_illinois.c:326:	.flags		= TCP_CONG_RTT_STAMP,
net/ipv4/tcp_input.c:3496:				if (ca_ops->flags & TCP_CONG_RTT_STAMP &&

Other than that HZ=1000 seems fine.

HZ=100 seems a poor choice, we have NO_HZ since a long time.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox