Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: David Miller @ 2011-05-19  4:14 UTC (permalink / raw)
  To: tsunanet
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, hagen, eric.dumazet,
	alexander.zimmermann, netdev, linux-kernel
In-Reply-To: <BANLkTinEN_=jSCw4qR1PqtWaQ+07OMq7tg@mail.gmail.com>

From: tsuna <tsunanet@gmail.com>
Date: Wed, 18 May 2011 20:56:33 -0700

> On Wed, May 18, 2011 at 7:36 PM, David Miller <davem@davemloft.net> wrote:
>> From: Benoit Sigoure <tsunanet@gmail.com>
>> Date: Wed, 18 May 2011 19:22:24 -0700
>>
>>> Prior to this patch, Linux would always use 3 seconds (compile-time
>>> constant) as the initial RTO.  Draft RFC 2988bis-02 proposes to tune
>>> this down to 1 second and, in case of a timeout during the TCP 3WHS,
>>> revert the RTO back up to 3 seconds when data transmission begins.
>>
>> We just had a discussion where it was determined that changes to
>> these settings are "network specific" and therefore that if it
>> is appropriate at all (I'm still not convinced) it is only suitable
>> as a routing metric.
> 
> Fair enough.  I'll take another stab at it and see if I can change
> this to be on a per network basis.  Do I need any patch that's not yet
> in Linus' tree?  I'm referring to this:

Keep in mind another thing I do not like about this knob.

The IETF draft has a requirement that we fallback to 3 seconds if the
initial RTO is 1 second.

Nothing in your facilities ensure this, or provide a way for the
kernel to make sure this is the case.

And for other values of initial RTO, what fallback is appropriate?

As a result of all of this, I do not really think this is something
the user should control at all.

I really would rather see the initial RTO be static and be set to 1
with fallback RTO of 3.

^ permalink raw reply

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: tsuna @ 2011-05-19  4:33 UTC (permalink / raw)
  To: David Miller
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, hagen, eric.dumazet,
	alexander.zimmermann, netdev, linux-kernel
In-Reply-To: <20110519.001426.2119532755281545481.davem@davemloft.net>

On Wed, May 18, 2011 at 9:14 PM, David Miller <davem@davemloft.net> wrote:
> The IETF draft has a requirement that we fallback to 3 seconds if the
> initial RTO is 1 second.
>
> Nothing in your facilities ensure this, or provide a way for the
> kernel to make sure this is the case.

Not sure to understand what you're saying.  If tcp_initial_rto = 1000
and tcp_initial_fallback_rto = 3000, then you get exactly the behavior
the draft describes.  The knobs simply allow you to either revert to
today's behavior or use other settings that would make more sense in
your environment (e.g. very high RTT).  Are you concerned about cases
where, say, tcp_initial_fallback_rto < tcp_initial_rto?

> And for other values of initial RTO, what fallback is appropriate?

Presumably if the user decides to tweak these knobs, they'll know
what's appropriate for their environment.  Or are you suggesting that
one value be derived from the other?  (e.g. tcp_initial_fallback_rto =
3 * tcp_initial_rto)

> As a result of all of this, I do not really think this is something
> the user should control at all.
>
> I really would rather see the initial RTO be static and be set to 1
> with fallback RTO of 3.

I can also provide a simple patch for this if you want to start from
there.  And then maybe we can discuss having a runtime knob some more
:-)

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

^ permalink raw reply

* Re: [Patch net-next-2.6] netpoll: disable netpoll when enslave a device
From: Cong Wang @ 2011-05-19  5:13 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, Neil Horman, Jay Vosburgh, Andy Gospodarek,
	David S. Miller, Alexey Dobriyan, Ferenc Wagner, Andrew Morton,
	Paul E. McKenney, Josh Triplett, Ian Campbell, netdev
In-Reply-To: <20110518105558.GA3203@hmsreliant.think-freely.org>

于 2011年05月18日 18:56, Neil Horman 写道:
> On Wed, May 18, 2011 at 06:00:35PM +0800, Amerigo Wang wrote:
...
>> -			case NETDEV_GOING_DOWN:
>>   			case NETDEV_BONDING_DESLAVE:
>> +			case NETDEV_ENSLAVE:
>>   				nt->enabled = 0;
>>   				stopped = true;
>>   				break;
> This wasn't introduced by this patch, but looking at it made me realize that
> nt->enabled, if it passes through this code path, doesn't properly track weather
> or not netpoll_setup has been called on this interface.  If you look at
> drop_netconsole_target, you'll see we only call netpoll_cleanup_target if
> nt->enabled is set.  We should probably change the nt->enabled check there, and
> in store_enabled to be if (nt->np.dev), like we do in the NETDEV_UNREGISTER case
> in netconsole_netdev_event.

Yeah, also note that we can change ->enabled via configfs too.
I guess we probably need to fix this in another patch...


>> +#define NETDEV_ENSLAVE		0x0014
>>
> Nit:
> Shouldn't this be NETDEV_BONDING_ENSLAVE, to keep it in line with
> NETDEV_BONDING_DESLAVE above?

Actually that is my first thought, but I plan to use this in bridge
case too, because using netconsole on a device underlying a bridge
makes little sense too. Thus, I prefer NETDEV_ENSLAVE to
NETDEV_BONDING_ENSLAVE.

>
>>   #define SYS_DOWN	0x0001	/* Notify of system down */
>>   #define SYS_RESTART	SYS_DOWN
>>
>
>
> Other than those two points, this looks good to me

Thanks for review.

^ permalink raw reply

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Eric Dumazet @ 2011-05-19  5:19 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev, David Miller
In-Reply-To: <1305746989.3019.0.camel@edumazet-laptop>

Le mercredi 18 mai 2011 à 21:29 +0200, Eric Dumazet a écrit :
> Le mercredi 18 mai 2011 à 17:52 +0200, Eric Dumazet a écrit :
> 
> > Hmm, it seems we have some inetpeer refcount leak somewhere.
> > 
> > Maybe one (struct rtable)->peer is not released on dst/rtable removal,
> > or we also leak dst/rtable (and their ->peer inetpeer)
> > 
> > Watch :
> > 
> > grep peer /proc/slabinfo
> > grep dst /proc/slabinfo
> > 
> 
> FYI, I started a bisection to find the faulty commit.
> 

Oh well, of course this came to 2c8cec5c10bced240
(ipv4: Cache learned PMTU information in inetpeer.)

So my method to check if we have a leak might be wrong, since the above
commit let cache full of garbage, and hope that following lookups will
find and evict obsolete dst.

Thats getting difficult :(

Could you please send us

grep . /proc/sys/net/ipv4/route/*

Thanks !



^ permalink raw reply

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: David Miller @ 2011-05-19  5:46 UTC (permalink / raw)
  To: tsunanet
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, hagen, eric.dumazet,
	alexander.zimmermann, netdev, linux-kernel
In-Reply-To: <BANLkTikQEq9+YkJHcTe3PWnRvh7AN=VVWA@mail.gmail.com>

From: tsuna <tsunanet@gmail.com>
Date: Wed, 18 May 2011 21:33:21 -0700

> On Wed, May 18, 2011 at 9:14 PM, David Miller <davem@davemloft.net> wrote:
>> I really would rather see the initial RTO be static and be set to 1
>> with fallback RTO of 3.
> 
> I can also provide a simple patch for this if you want to start from
> there.  And then maybe we can discuss having a runtime knob some more
> :-)

Yeah why don't we do that :-)

^ permalink raw reply

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: Alexander Zimmermann @ 2011-05-19  6:10 UTC (permalink / raw)
  To: tsuna
  Cc: David Miller, kuznet, pekkas, jmorris, yoshfuji, kaber, hagen,
	eric.dumazet, netdev, linux-kernel
In-Reply-To: <BANLkTikQEq9+YkJHcTe3PWnRvh7AN=VVWA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1115 bytes --]

Hi,

Am 19.05.2011 um 06:33 schrieb tsuna:

> Presumably if the user decides to tweak these knobs, they'll know
> what's appropriate for their environment.

Are you sure? I'm not. I fully agree with David that minRTO is
something that a user shout not control at all

> Or are you suggesting that
> one value be derived from the other?  (e.g. tcp_initial_fallback_rto =
> 3 * tcp_initial_rto)
> 
>> As a result of all of this, I do not really think this is something
>> the user should control at all.
>> 
>> I really would rather see the initial RTO be static and be set to 1
>> with fallback RTO of 3.
> 
> I can also provide a simple patch for this if you want to start from
> there.  And then maybe we can discuss having a runtime knob some more
> :-)
> 
> -- 
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//


[-- Attachment #2: Signierter Teil der Nachricht --]
[-- Type: application/pgp-signature, Size: 243 bytes --]

^ permalink raw reply

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Denys Fedoryshchenko @ 2011-05-19  6:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller
In-Reply-To: <1305782397.3019.5.camel@edumazet-laptop>

 On Thu, 19 May 2011 07:19:57 +0200, Eric Dumazet wrote:
> Le mercredi 18 mai 2011 à 21:29 +0200, Eric Dumazet a écrit :
>> Le mercredi 18 mai 2011 à 17:52 +0200, Eric Dumazet a écrit :
>>
>> > Hmm, it seems we have some inetpeer refcount leak somewhere.
>> >
>> > Maybe one (struct rtable)->peer is not released on dst/rtable 
>> removal,
>> > or we also leak dst/rtable (and their ->peer inetpeer)
>> >
>> > Watch :
>> >
>> > grep peer /proc/slabinfo
>> > grep dst /proc/slabinfo
>> >
>>
>> FYI, I started a bisection to find the faulty commit.
>>
>
> Oh well, of course this came to 2c8cec5c10bced240
> (ipv4: Cache learned PMTU information in inetpeer.)
>
> So my method to check if we have a leak might be wrong, since the 
> above
> commit let cache full of garbage, and hope that following lookups 
> will
> find and evict obsolete dst.
>
> Thats getting difficult :(
>
> Could you please send us
>
> grep . /proc/sys/net/ipv4/route/*
>
> Thanks !
 NewNet-PPPoE ~ # grep . /proc/sys/net/ipv4/route/*
 /proc/sys/net/ipv4/route/error_burst:5000
 /proc/sys/net/ipv4/route/error_cost:1000
 grep: /proc/sys/net/ipv4/route/flush: Permission denied
 /proc/sys/net/ipv4/route/gc_elasticity:8
 /proc/sys/net/ipv4/route/gc_interval:60
 /proc/sys/net/ipv4/route/gc_min_interval:0
 /proc/sys/net/ipv4/route/gc_min_interval_ms:500
 /proc/sys/net/ipv4/route/gc_thresh:32768
 /proc/sys/net/ipv4/route/gc_timeout:300
 /proc/sys/net/ipv4/route/max_size:524288
 /proc/sys/net/ipv4/route/min_adv_mss:256
 /proc/sys/net/ipv4/route/min_pmtu:552
 /proc/sys/net/ipv4/route/mtu_expires:600
 /proc/sys/net/ipv4/route/redirect_load:20
 /proc/sys/net/ipv4/route/redirect_number:9
 /proc/sys/net/ipv4/route/redirect_silence:20480

 I think it is default one.

 PMTU is very actual for that, as it is pppoe, and up to 2k interfaces 
 terminated there.

 I don't know, if it matters, but
 iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS 
 --clamp-mss-to-pmtu
 also there.

 I can generate and put "ip route ls cache" and any other info.


^ permalink raw reply

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: tsuna @ 2011-05-19  6:25 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: David Miller, kuznet, pekkas, jmorris, yoshfuji, kaber, hagen,
	eric.dumazet, netdev, linux-kernel
In-Reply-To: <9DC9A4D5-8E16-4361-B323-C92D563171A1@comsys.rwth-aachen.de>

On Wed, May 18, 2011 at 11:10 PM, Alexander Zimmermann
<alexander.zimmermann@comsys.rwth-aachen.de> wrote:
> Am 19.05.2011 um 06:33 schrieb tsuna:
>> Presumably if the user decides to tweak these knobs, they'll know
>> what's appropriate for their environment.
>
> Are you sure? I'm not. I fully agree with David that minRTO is

s/minRTO/initRTO/, right?

> something that a user shout not control at all

I personally don't like to hold the hand and spoon feed users too
much, I want to trust them to be responsible and know what they're
doing.  Yes, there will always be people who will act stupid and do
stupid things with whatever knobs you expose.  The web is full of
people who advise to tune up all the TCP rmem/wmem parameters to crazy
high level based on the voodoo belief that they're going to improve
their TCP performance, but then as long as you have knobs in your
system, these people will misuse them anyway and shoot themselves in
the foot, what can we do about that.

There's also a good chunk of people who know what they're doing, and
for them compile-time constants are annoying because it's inconvenient
to experiment and iterate quickly when you need to recompile your
kernel to change a value.  If turning the compile time constant into a
knob leaves the code reasonably straightforward and doesn't incur too
much overhead, then why not do it?

Regarding this knob in particular, I can imagine that people who are
in environment where RTT easily gets around 1s will be upset by the
change in the default value, and doubly upset that they have to
recompile their kernel to change the value back to 3s.  I'm in favor
of the reduction of initRTO, for the same reason Google is, but I can
also understand that the direction we're taking might not be
appropriate for everyone.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

^ permalink raw reply

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Eric Dumazet @ 2011-05-19  6:30 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev, David Miller
In-Reply-To: <626ba8ae63cfc8fdb68c7f281463dc27@visp.net.lb>

Le jeudi 19 mai 2011 à 09:11 +0300, Denys Fedoryshchenko a écrit :
> On Thu, 19 May 2011 07:19:57 +0200, Eric Dumazet wrote:
> > Le mercredi 18 mai 2011 à 21:29 +0200, Eric Dumazet a écrit :
> >> Le mercredi 18 mai 2011 à 17:52 +0200, Eric Dumazet a écrit :
> >>
> >> > Hmm, it seems we have some inetpeer refcount leak somewhere.
> >> >
> >> > Maybe one (struct rtable)->peer is not released on dst/rtable 
> >> removal,
> >> > or we also leak dst/rtable (and their ->peer inetpeer)
> >> >
> >> > Watch :
> >> >
> >> > grep peer /proc/slabinfo
> >> > grep dst /proc/slabinfo
> >> >
> >>
> >> FYI, I started a bisection to find the faulty commit.
> >>
> >
> > Oh well, of course this came to 2c8cec5c10bced240
> > (ipv4: Cache learned PMTU information in inetpeer.)
> >
> > So my method to check if we have a leak might be wrong, since the 
> > above
> > commit let cache full of garbage, and hope that following lookups 
> > will
> > find and evict obsolete dst.
> >
> > Thats getting difficult :(
> >
> > Could you please send us
> >
> > grep . /proc/sys/net/ipv4/route/*
> >
> > Thanks !
>  NewNet-PPPoE ~ # grep . /proc/sys/net/ipv4/route/*
>  /proc/sys/net/ipv4/route/error_burst:5000
>  /proc/sys/net/ipv4/route/error_cost:1000
>  grep: /proc/sys/net/ipv4/route/flush: Permission denied
>  /proc/sys/net/ipv4/route/gc_elasticity:8
>  /proc/sys/net/ipv4/route/gc_interval:60
>  /proc/sys/net/ipv4/route/gc_min_interval:0
>  /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>  /proc/sys/net/ipv4/route/gc_thresh:32768
>  /proc/sys/net/ipv4/route/gc_timeout:300
>  /proc/sys/net/ipv4/route/max_size:524288
>  /proc/sys/net/ipv4/route/min_adv_mss:256
>  /proc/sys/net/ipv4/route/min_pmtu:552
>  /proc/sys/net/ipv4/route/mtu_expires:600
>  /proc/sys/net/ipv4/route/redirect_load:20
>  /proc/sys/net/ipv4/route/redirect_number:9
>  /proc/sys/net/ipv4/route/redirect_silence:20480
> 
>  I think it is default one.
> 
>  PMTU is very actual for that, as it is pppoe, and up to 2k interfaces 
>  terminated there.
> 

Yes, and every time an interface is added -> new route added, route
cache is invalidated (we change rt_genid)

>  I don't know, if it matters, but
>  iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS 
>  --clamp-mss-to-pmtu
>  also there.
> 
>  I can generate and put "ip route ls cache" and any other info.
> 

Hmm would you please send :

rtstat -c10 -i1




^ permalink raw reply

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Julian Anastasov @ 2011-05-19  6:33 UTC (permalink / raw)
  To: Simon Horman; +Cc: Dave Jones, netdev, Wensong Zhang, Hans Schillstrom
In-Reply-To: <20110519032611.GG16688@verge.net.au>


	Hello,

On Thu, 19 May 2011, Simon Horman wrote:

> > >  Call Trace:
> > >   [<ffffffff8107be36>] raw_notifier_chain_register+0xe/0x10
> > >   [<ffffffff81403058>] register_netdevice_notifier+0x2d/0x1b6
> > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > >   [<ffffffffa04322c7>] ip_vs_control_init+0xa5/0xce [ip_vs]
> > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > >   [<ffffffffa0432116>] ip_vs_init+0x10/0x11c [ip_vs]
> > >   [<ffffffff81002099>] do_one_initcall+0x7f/0x13a
> > >   [<ffffffff81096524>] sys_init_module+0x132/0x281
> > >   [<ffffffff814cc702>] system_call_fastpath+0x16/0x1b
> > >  Code: 07 ff c8 89 43 48 eb 08 48 89 df e8 dc 95 44 00 4c 89 e6 48 89 df e8 a7 a5 44 00 5b 41 5c 5d c3 55 48 89 e5 66 66 66 66 90 eb 0c <8b> 50 10 39 56 10 7f 0c 48 8d 78 08 48 8b 07 48 85 c0 75 ec 48 
> > >  RIP  [<ffffffff8107bddb>] notifier_chain_register+0xb/0x2a
> > >   RSP <ffff880114139e68>
> > >  ---[ end trace e90d7053ad1a7a5b ]---
> > > 
> > > 
> > > This script replicates the bug.
> > > (it usually oopses after just a few loops)
> > > 
> > > #!/bin/sh
> > > while [ 1 ];
> > > do
> > > 	modprobe ip_vs_ftp
> > > 	modprobe -r ip_vs_ftp
> > > done
> > > 
> > > Looks like something isn't getting cleaned up on module exit
> > > that we fall over when we encounter it next time it gets loaded ?
> > 
> > Thanks Dave, I will look into this.
> 
> Hi Dave,
> 
> I'm not having much luck reproducing this in KVM.
> I will try this evening on real hardware.
> 
> Just to make sure we are testing the same thing, are you using Linus's tree?

	One unregister_netdevice_notifier(&ip_vs_dst_notifier);
is missing in ip_vs_control_cleanup for sure.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* 2.6.39-rc7-git11, x86/32, failed on ppp2897'th interface, PERCPU:  allocation failed
From: Denys Fedoryshchenko @ 2011-05-19  6:35 UTC (permalink / raw)
  To: netdev

 Hi, again

 Just tried to upgrade large NAS from 2.6.38.6 to 2.6.39-rc7-git11, and 
 at same time enabling ipv6 on it.
 Got that, after ppp2897 brought up (sure it means there is other 2896 
 available, and also few ethernet vlans, around 32).
 I am not sure it is a bug, but it looks i had free memory(the box had 
 8GB free), and lowmem too, also i will try to enable there 64bit kernel 
 at evening.

 May 17 16:00:42 194.146.155.70 kernel: [14925.897799] PERCPU: 
 allocation failed, size=2048 align=4, failed to allocate new chunk
 May 17 16:00:42 194.146.155.70 kernel: [14925.898163] Pid: 24207, comm: 
 pppd Not tainted 2.6.39-rc7-git11-build-0058 #4
 May 17 16:00:42 194.146.155.70 kernel: [14925.898164] Call Trace:
 May 17 16:00:42 194.146.155.70 kernel: [14925.898169]  [<c0335548>] ? 
 printk+0x18/0x20
 May 17 16:00:42 194.146.155.70 kernel: [14925.898173]  [<c017ecd0>] 
 pcpu_alloc+0x616/0x67a
 May 17 16:00:42 194.146.155.70 kernel: [14925.898176]  [<c0194a80>] ? 
 __kmalloc_track_caller+0x68/0xc0
 May 17 16:00:42 194.146.155.70 kernel: [14925.898189]  [<f8ae196c>] ? 
 kzalloc+0xb/0xd [ipv6]
 May 17 16:00:42 194.146.155.70 kernel: [14925.898193]  [<c01320a5>] ? 
 _local_bh_enable_ip.clone.6+0x18/0x71
 May 17 16:00:42 194.146.155.70 kernel: [14925.898195]  [<c017ed3e>] 
 __alloc_percpu+0xa/0xc
 May 17 16:00:42 194.146.155.70 kernel: [14925.898198]  [<c030aa7d>] 
 snmp_mib_init+0x2f/0x51
 May 17 16:00:42 194.146.155.70 kernel: [14925.898207]  [<f8ae2ad0>] 
 ipv6_add_dev+0x133/0x2a3 [ipv6]
 May 17 16:00:42 194.146.155.70 kernel: [14925.898209]  [<c030e12d>] ? 
 ip_mc_init_dev+0x75/0x86
 May 17 16:00:42 194.146.155.70 kernel: [14925.898211]  [<c0309321>] ? 
 devinet_sysctl_register+0x34/0x38
 May 17 16:00:42 194.146.155.70 kernel: [14925.898221]  [<f8ae5754>] 
 addrconf_notify+0x50/0x6a5 [ipv6]
 May 17 16:00:42 194.146.155.70 kernel: [14925.898224]  [<c0218f52>] ? 
 add_uevent_var+0xa3/0xa3
 May 17 16:00:42 194.146.155.70 kernel: [14925.898226]  [<c0309901>] ? 
 inetdev_event+0x55/0x3c0
 May 17 16:00:42 194.146.155.70 kernel: [14925.898230]  [<c01446f9>] 
 notifier_call_chain+0x26/0x48
 May 17 16:00:42 194.146.155.70 kernel: [14925.898232]  [<c01447a7>] 
 raw_notifier_call_chain+0x1a/0x1c
 May 17 16:00:42 194.146.155.70 kernel: [14925.898236]  [<c02c8115>] 
 call_netdevice_notifiers+0x44/0x4b
 May 17 16:00:42 194.146.155.70 kernel: [14925.898238]  [<c01320a5>] ? 
 _local_bh_enable_ip.clone.6+0x18/0x71
 May 17 16:00:42 194.146.155.70 kernel: [14925.898240]  [<c0132106>] ? 
 local_bh_enable_ip+0x8/0xa
 May 17 16:00:42 194.146.155.70 kernel: [14925.898242]  [<c02ca19b>] 
 register_netdevice+0x1fb/0x255
 May 17 16:00:42 194.146.155.70 kernel: [14925.898244]  [<c02ca227>] 
 register_netdev+0x32/0x41
 May 17 16:00:42 194.146.155.70 kernel: [14925.898247]  [<c021d5cf>] ? 
 sprintf+0x1c/0x1e
 May 17 16:00:42 194.146.155.70 kernel: [14925.898249]  [<c029647a>] 
 ppp_ioctl+0x224/0xaea
 May 17 16:00:42 194.146.155.70 kernel: [14925.898252]  [<c01a35cc>] ? 
 do_filp_open+0x26/0x67
 May 17 16:00:42 194.146.155.70 kernel: [14925.898254]  [<c0296256>] ? 
 ppp_write+0x98/0x98
 May 17 16:00:42 194.146.155.70 kernel: [14925.898256]  [<c01a53ce>] 
 do_vfs_ioctl+0x45e/0x498
 May 17 16:00:42 194.146.155.70 kernel: [14925.898258]  [<c01a118e>] ? 
 getname_flags+0x1e/0xad
 May 17 16:00:42 194.146.155.70 kernel: [14925.898260]  [<c019391b>] ? 
 kmem_cache_free+0x14/0x83
 May 17 16:00:42 194.146.155.70 kernel: [14925.898262]  [<c01ab5bb>] ? 
 alloc_fd+0x4e/0xba
 May 17 16:00:42 194.146.155.70 kernel: [14925.898265]  [<c0199465>] ? 
 do_sys_open+0xdb/0xe5
 May 17 16:00:42 194.146.155.70 kernel: [14925.898266]  [<c019ac7b>] ? 
 fput+0x13/0x155
 May 17 16:00:42 194.146.155.70 kernel: [14925.898268]  [<c01a4387>] ? 
 do_fcntl+0x227/0x3aa
 May 17 16:00:42 194.146.155.70 kernel: [14925.898270]  [<c01a543b>] 
 sys_ioctl+0x33/0x4c
 May 17 16:00:42 194.146.155.70 kernel: [14925.898273]  [<c0336edd>] 
 syscall_call+0x7/0xb

^ permalink raw reply

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: Alexander Zimmermann @ 2011-05-19  6:36 UTC (permalink / raw)
  To: tsuna
  Cc: David Miller, kuznet, pekkas, jmorris, yoshfuji, kaber, hagen,
	eric.dumazet, netdev, linux-kernel
In-Reply-To: <BANLkTimVMYC5rzizaM+3dReG14obxGL=Bw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2625 bytes --]


Am 19.05.2011 um 08:25 schrieb tsuna:

> On Wed, May 18, 2011 at 11:10 PM, Alexander Zimmermann
> <alexander.zimmermann@comsys.rwth-aachen.de> wrote:
>> Am 19.05.2011 um 06:33 schrieb tsuna:
>>> Presumably if the user decides to tweak these knobs, they'll know
>>> what's appropriate for their environment.
>> 
>> Are you sure? I'm not. I fully agree with David that minRTO is
> 
> s/minRTO/initRTO/, right?

Yes of course :-)

> 
>> something that a user shout not control at all
> 
> I personally don't like to hold the hand and spoon feed users too
> much, I want to trust them to be responsible and know what they're
> doing.  Yes, there will always be people who will act stupid and do
> stupid things with whatever knobs you expose.  The web is full of
> people who advise to tune up all the TCP rmem/wmem parameters to crazy
> high level based on the voodoo belief that they're going to improve
> their TCP performance, but then as long as you have knobs in your
> system, these people will misuse them anyway and shoot themselves in
> the foot, what can we do about that.

But if you tune rmen/wmen to crazy level, it's only your TCP performance
that hurts (and maybe the receiver's one).

If you set the initRTO=0.1s, it's good for me but bad for the rest of the
world. That's the difference.

Or do you want to implement a lower barrier of 1sec so that you can ensure
that nobody set the initRTO lower than 1s? 


> 
> There's also a good chunk of people who know what they're doing, and
> for them compile-time constants are annoying because it's inconvenient
> to experiment and iterate quickly when you need to recompile your
> kernel to change a value.  If turning the compile time constant into a
> knob leaves the code reasonably straightforward and doesn't incur too
> much overhead, then why not do it?
> 
> Regarding this knob in particular, I can imagine that people who are
> in environment where RTT easily gets around 1s will be upset by the
> change in the default value, and doubly upset that they have to
> recompile their kernel to change the value back to 3s.  I'm in favor
> of the reduction of initRTO, for the same reason Google is, but I can
> also understand that the direction we're taking might not be
> appropriate for everyone.
> 
> -- 
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//


[-- Attachment #2: Signierter Teil der Nachricht --]
[-- Type: application/pgp-signature, Size: 243 bytes --]

^ permalink raw reply

* [PATCH] tcp: Lower the initial RTO to 1s as per draft RFC 2988bis-02.
From: Benoit Sigoure @ 2011-05-19  6:36 UTC (permalink / raw)
  To: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, hagen,
	eric.dumazet, alexander.zimmermann
  Cc: netdev, linux-kernel, Benoit Sigoure
In-Reply-To: <20110519.014656.1735519603194773578.davem@davemloft.net>

From: Benoit Sigoure <tsuna@stumbleupon.com>

Draft RFC 2988bis-02 recommends that the initial RTO be lowered
from 3 seconds down to 1 second, and that in case of a timeout
during the TCP 3WHS, the RTO should fallback to 3 seconds when
data transmission begins.
---

On Wed, May 18, 2011 at 10:46 PM, David Miller <davem@davemloft.net> wrote:
> From: tsuna <tsunanet@gmail.com>
> Date: Wed, 18 May 2011 21:33:21 -0700
>
>> On Wed, May 18, 2011 at 9:14 PM, David Miller <davem@davemloft.net> wrote:
>>> I really would rather see the initial RTO be static and be set to 1
>>> with fallback RTO of 3.
>>
>> I can also provide a simple patch for this if you want to start from
>> there.  And then maybe we can discuss having a runtime knob some more
>> :-)
>
> Yeah why don't we do that :-)

Alright, here we go.


 include/net/tcp.h    |    5 ++++-
 net/ipv4/tcp_input.c |   13 +++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..274d761 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -122,7 +122,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #endif
 #define TCP_RTO_MAX	((unsigned)(120*HZ))
 #define TCP_RTO_MIN	((unsigned)(HZ/5))
-#define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value	*/
+/* The next 2 values come from Draft RFC 2988bis-02. */
+#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))		/* initial RTO value	*/
+#define TCP_TIMEOUT_INIT_FALLBACK ((unsigned)(3*HZ))	/* initial RTO to fallback to when
+							 * a timeout happens during the 3WHS.	*/
 
 #define TCP_RESOURCE_PROBE_INTERVAL ((unsigned)(HZ/2U)) /* Maximal interval between probes
 					                 * for local resources.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bef9f04..a36bc35 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -868,6 +868,11 @@ static void tcp_init_metrics(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct dst_entry *dst = __sk_dst_get(sk);
+	/* If we had to retransmit anything during the 3WHS, use
+	 * the initial fallback RTO as per draft RFC 2988bis-02.
+	 */
+	int init_rto = inet_csk(sk)->icsk_retransmits ?
+		TCP_TIMEOUT_INIT_FALLBACK : TCP_TIMEOUT_INIT;
 
 	if (dst == NULL)
 		goto reset;
@@ -890,7 +895,7 @@ static void tcp_init_metrics(struct sock *sk)
 	if (dst_metric(dst, RTAX_RTT) == 0)
 		goto reset;
 
-	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
+	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (init_rto << 3))
 		goto reset;
 
 	/* Initial rtt is determined from SYN,SYN-ACK.
@@ -916,7 +921,7 @@ static void tcp_init_metrics(struct sock *sk)
 		tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
 	}
 	tcp_set_rto(sk);
-	if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
+	if (inet_csk(sk)->icsk_rto < init_rto && !tp->rx_opt.saw_tstamp) {
 reset:
 		/* Play conservative. If timestamps are not
 		 * supported, TCP will fail to recalculate correct
@@ -924,8 +929,8 @@ reset:
 		 */
 		if (!tp->rx_opt.saw_tstamp && tp->srtt) {
 			tp->srtt = 0;
-			tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
-			inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+			tp->mdev = tp->mdev_max = tp->rttvar = init_rto;
+			inet_csk(sk)->icsk_rto = init_rto;
 		}
 	}
 	tp->snd_cwnd = tcp_init_cwnd(tp, dst);
-- 
1.7.0.4

^ permalink raw reply related

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Denys Fedoryshchenko @ 2011-05-19  6:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller
In-Reply-To: <1305786623.3019.10.camel@edumazet-laptop>

 On Thu, 19 May 2011 08:30:23 +0200, Eric Dumazet wrote:
> Le jeudi 19 mai 2011 à 09:11 +0300, Denys Fedoryshchenko a écrit :
>> On Thu, 19 May 2011 07:19:57 +0200, Eric Dumazet wrote:
>> > Le mercredi 18 mai 2011 à 21:29 +0200, Eric Dumazet a écrit :
>> >> Le mercredi 18 mai 2011 à 17:52 +0200, Eric Dumazet a écrit :
>> >>
>> >> > Hmm, it seems we have some inetpeer refcount leak somewhere.
>> >> >
>> >> > Maybe one (struct rtable)->peer is not released on dst/rtable
>> >> removal,
>> >> > or we also leak dst/rtable (and their ->peer inetpeer)
>> >> >
>> >> > Watch :
>> >> >
>> >> > grep peer /proc/slabinfo
>> >> > grep dst /proc/slabinfo
>> >> >
>> >>
>> >> FYI, I started a bisection to find the faulty commit.
>> >>
>> >
>> > Oh well, of course this came to 2c8cec5c10bced240
>> > (ipv4: Cache learned PMTU information in inetpeer.)
>> >
>> > So my method to check if we have a leak might be wrong, since the
>> > above
>> > commit let cache full of garbage, and hope that following lookups
>> > will
>> > find and evict obsolete dst.
>> >
>> > Thats getting difficult :(
>> >
>> > Could you please send us
>> >
>> > grep . /proc/sys/net/ipv4/route/*
>> >
>> > Thanks !
>>  NewNet-PPPoE ~ # grep . /proc/sys/net/ipv4/route/*
>>  /proc/sys/net/ipv4/route/error_burst:5000
>>  /proc/sys/net/ipv4/route/error_cost:1000
>>  grep: /proc/sys/net/ipv4/route/flush: Permission denied
>>  /proc/sys/net/ipv4/route/gc_elasticity:8
>>  /proc/sys/net/ipv4/route/gc_interval:60
>>  /proc/sys/net/ipv4/route/gc_min_interval:0
>>  /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>>  /proc/sys/net/ipv4/route/gc_thresh:32768
>>  /proc/sys/net/ipv4/route/gc_timeout:300
>>  /proc/sys/net/ipv4/route/max_size:524288
>>  /proc/sys/net/ipv4/route/min_adv_mss:256
>>  /proc/sys/net/ipv4/route/min_pmtu:552
>>  /proc/sys/net/ipv4/route/mtu_expires:600
>>  /proc/sys/net/ipv4/route/redirect_load:20
>>  /proc/sys/net/ipv4/route/redirect_number:9
>>  /proc/sys/net/ipv4/route/redirect_silence:20480
>>
>>  I think it is default one.
>>
>>  PMTU is very actual for that, as it is pppoe, and up to 2k 
>> interfaces
>>  terminated there.
>>
>
> Yes, and every time an interface is added -> new route added, route
> cache is invalidated (we change rt_genid)
 If it matters, there is ifb with shaper on it (for shaping from ppp to 
 world).

>
>>  I don't know, if it matters, but
>>  iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS
>>  --clamp-mss-to-pmtu
>>  also there.
>>
>>  I can generate and put "ip route ls cache" and any other info.
>>
>
> Hmm would you please send :
>
> rtstat -c10 -i1
 Note, it is offpeak time now, just 1447 interfaces, peak is after 12 
 hours

 NewNet-PPPoE ~ # ./rtstat -c10 -i1
 rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
  entries|  in_hit|in_slow_|in_slow_|in_no_ro|  
 in_brd|in_marti|in_marti| 
 out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis|
         |        |     tot|      mc|     ute|        |  an_dst|  
 an_src|        |    _tot|     _mc|        |      ed|    miss| verflow| 
 _search|t_search|
     2256|355568844|85929285|    1649|       9|   59954|     293|    
 1460|14423031| 6865540|       0|       0|       0|       0|       
 0|22719682| 1262044|
     3408|   14887|    2117|       0|       0|       1|       1|       
 0|     761|     159|       0|       0|       0|       0|       0|    
 1209|      46|
     3189|   17185|    5613|       0|       0|       1|       0|       
 0|     987|     334|       0|       0|       0|       0|       0|     
 684|      22|
     2698|   18312|    3417|       0|       0|       5|       0|       
 0|     923|     242|       0|       0|       0|       0|       0|     
 498|      10|
     4996|   17268|    3604|       0|       0|       1|       0|       
 0|     847|     240|       0|       0|       0|       0|       0|     
 830|      23|
     2457|   16439|    4227|       0|       0|       4|       0|       
 0|     663|     268|       0|       0|       0|       0|       0|     
 655|      22|
     4763|   16895|    3634|       0|       0|       1|       0|       
 0|     880|     266|       0|       0|       0|       0|       0|     
 896|      32|
     6299|   19169|    2220|       0|       0|       2|       0|       
 0|     898|     206|       0|       0|       0|       0|       0|    
 1213|      60|
     7511|   20059|    1597|       0|       0|       2|       1|       
 0|     855|     197|       0|       0|       0|       0|       0|    
 1917|      54|
     9271|   17731|    2919|       0|       0|       0|       0|       
 0|     855|     223|       0|       0|       0|       0|       0|    
 1664|     101|


^ permalink raw reply

* Re: 2.6.39-rc7-git11, x86/32, failed on ppp2897'th interface, PERCPU:  allocation failed
From: Eric Dumazet @ 2011-05-19  6:39 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev
In-Reply-To: <f9797bb034f650a24e927629d1ab77d8@visp.net.lb>

Le jeudi 19 mai 2011 à 09:35 +0300, Denys Fedoryshchenko a écrit :
> Hi, again
> 
>  Just tried to upgrade large NAS from 2.6.38.6 to 2.6.39-rc7-git11, and 
>  at same time enabling ipv6 on it.
>  Got that, after ppp2897 brought up (sure it means there is other 2896 
>  available, and also few ethernet vlans, around 32).
>  I am not sure it is a bug, but it looks i had free memory(the box had 
>  8GB free), and lowmem too, also i will try to enable there 64bit kernel 
>  at evening.
> 
>  May 17 16:00:42 194.146.155.70 kernel: [14925.897799] PERCPU: 
>  allocation failed, size=2048 align=4, failed to allocate new chunk
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898163] Pid: 24207, comm: 
>  pppd Not tainted 2.6.39-rc7-git11-build-0058 #4
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898164] Call Trace:
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898169]  [<c0335548>] ? 
>  printk+0x18/0x20
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898173]  [<c017ecd0>] 
>  pcpu_alloc+0x616/0x67a
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898176]  [<c0194a80>] ? 
>  __kmalloc_track_caller+0x68/0xc0
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898189]  [<f8ae196c>] ? 
>  kzalloc+0xb/0xd [ipv6]
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898193]  [<c01320a5>] ? 
>  _local_bh_enable_ip.clone.6+0x18/0x71
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898195]  [<c017ed3e>] 
>  __alloc_percpu+0xa/0xc
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898198]  [<c030aa7d>] 
>  snmp_mib_init+0x2f/0x51
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898207]  [<f8ae2ad0>] 
>  ipv6_add_dev+0x133/0x2a3 [ipv6]
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898209]  [<c030e12d>] ? 
>  ip_mc_init_dev+0x75/0x86
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898211]  [<c0309321>] ? 
>  devinet_sysctl_register+0x34/0x38
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898221]  [<f8ae5754>] 
>  addrconf_notify+0x50/0x6a5 [ipv6]
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898224]  [<c0218f52>] ? 
>  add_uevent_var+0xa3/0xa3
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898226]  [<c0309901>] ? 
>  inetdev_event+0x55/0x3c0
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898230]  [<c01446f9>] 
>  notifier_call_chain+0x26/0x48
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898232]  [<c01447a7>] 
>  raw_notifier_call_chain+0x1a/0x1c
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898236]  [<c02c8115>] 
>  call_netdevice_notifiers+0x44/0x4b
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898238]  [<c01320a5>] ? 
>  _local_bh_enable_ip.clone.6+0x18/0x71
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898240]  [<c0132106>] ? 
>  local_bh_enable_ip+0x8/0xa
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898242]  [<c02ca19b>] 
>  register_netdevice+0x1fb/0x255
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898244]  [<c02ca227>] 
>  register_netdev+0x32/0x41
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898247]  [<c021d5cf>] ? 
>  sprintf+0x1c/0x1e
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898249]  [<c029647a>] 
>  ppp_ioctl+0x224/0xaea
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898252]  [<c01a35cc>] ? 
>  do_filp_open+0x26/0x67
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898254]  [<c0296256>] ? 
>  ppp_write+0x98/0x98
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898256]  [<c01a53ce>] 
>  do_vfs_ioctl+0x45e/0x498
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898258]  [<c01a118e>] ? 
>  getname_flags+0x1e/0xad
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898260]  [<c019391b>] ? 
>  kmem_cache_free+0x14/0x83
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898262]  [<c01ab5bb>] ? 
>  alloc_fd+0x4e/0xba
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898265]  [<c0199465>] ? 
>  do_sys_open+0xdb/0xe5
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898266]  [<c019ac7b>] ? 
>  fput+0x13/0x155
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898268]  [<c01a4387>] ? 
>  do_fcntl+0x227/0x3aa
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898270]  [<c01a543b>] 
>  sys_ioctl+0x33/0x4c
>  May 17 16:00:42 194.146.155.70 kernel: [14925.898273]  [<c0336edd>] 
>  syscall_call+0x7/0xb
> --

Its a known problem : When ipv6 is enabled, we allocate percpu memory to
hold per device snmp counters.

make sure kernel idea of max possible cpus matches real number of cpus.

And yes, switching to 64bit kernel helps a lot.




^ permalink raw reply

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: tsuna @ 2011-05-19  6:42 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: David Miller, kuznet, pekkas, jmorris, yoshfuji, kaber, hagen,
	eric.dumazet, netdev, linux-kernel
In-Reply-To: <C70B920C-6176-481B-B6D2-E52769291EBC@comsys.rwth-aachen.de>

On Wed, May 18, 2011 at 11:36 PM, Alexander Zimmermann
<alexander.zimmermann@comsys.rwth-aachen.de> wrote:
> If you set the initRTO=0.1s, it's good for me but bad for the rest of the
> world. That's the difference.
>
> Or do you want to implement a lower barrier of 1sec so that you can ensure
> that nobody set the initRTO lower than 1s?

Oh, I see.  Yes, there is a lower bound (and an upper bound) on what
values the kernel will accept as initRTO.  In the patch "Implement a
two-level initial RTO as per draft RFC 2988bis-02" above, I re-used
TCP_RTO_MIN and TCP_RTO_MAX in net/ipv4/sysctl_net_ipv4.c in order to
prevent users from setting a minRTO that's outside this range.  They
are defined as follows in tcp.h:

#define TCP_RTO_MAX     ((unsigned)(120*HZ))
#define TCP_RTO_MIN     ((unsigned)(HZ/5))

So we're talking about a [200ms ; 120s] range no matter what.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

^ permalink raw reply

* Re: 2.6.39-rc7-git11, x86/32, failed on ppp2897'th interface,  PERCPU:  allocation failed
From: Denys Fedoryshchenko @ 2011-05-19  6:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1305787158.3019.12.camel@edumazet-laptop>

 On Thu, 19 May 2011 08:39:18 +0200, Eric Dumazet wrote:
> Le jeudi 19 mai 2011 à 09:35 +0300, Denys Fedoryshchenko a écrit :
>> Hi, again
>>
>>  Just tried to upgrade large NAS from 2.6.38.6 to 2.6.39-rc7-git11, 
>> and
>>  at same time enabling ipv6 on it.
>>  Got that, after ppp2897 brought up (sure it means there is other 
>> 2896
>>  available, and also few ethernet vlans, around 32).
>>  I am not sure it is a bug, but it looks i had free memory(the box 
>> had
>>  8GB free), and lowmem too, also i will try to enable there 64bit 
>> kernel
>>  at evening.
>>
>>  May 17 16:00:42 194.146.155.70 kernel: [14925.897799] PERCPU:
>>  allocation failed, size=2048 align=4, failed to allocate new chunk
>>  May 17 16:00:42 194.146.155.70 kernel: [14925.898163] Pid: 24207, 
>> comm:
>>  pppd Not tainted 2.6.39-rc7-git11-build-0058 #4
>
> Its a known problem : When ipv6 is enabled, we allocate percpu memory 
> to
> hold per device snmp counters.
>
> make sure kernel idea of max possible cpus matches real number of 
> cpus.
>
> And yes, switching to 64bit kernel helps a lot.
>
 Yes, it matches, i guess.
 CONFIG_NR_CPUS=8

 processor       : 7
 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 26
 model name      : Intel(R) Core(TM) i7 CPU         950  @ 3.07GHz

 Thanks. Then i will simply switch kernel to 64bit, but for now with 
 32bit userspace, since this semi-embedded system
 mass deployed, and i have to maintain it alone (cannot handle both 
 32/64 bit userspace), and some pc's don't have
 lm flag in cpuinfo :)

 I am hitting a lot lowmem limits lately, but the only application that 
 was not working right 32bit userspace/64bit kernel - ipvsadm.
 Should i report it as a bug (i will check if it is still an issue)?


^ permalink raw reply

* [PATCH] tcp: Lower the initial RTO to 1s as per draft RFC 2988bis-02.
From: Benoit Sigoure @ 2011-05-19  6:47 UTC (permalink / raw)
  To: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, hagen,
	eric.dumazet, alexander.zimmermann
  Cc: netdev, linux-kernel, Benoit Sigoure
In-Reply-To: <20110519.014656.1735519603194773578.davem@davemloft.net>

Draft RFC 2988bis-02 recommends that the initial RTO be lowered
from 3 seconds down to 1 second, and that in case of a timeout
during the TCP 3WHS, the RTO should fallback to 3 seconds when
data transmission begins.

Signed-off-by: Benoit Sigoure <tsunanet@gmail.com>
---

Apologies for the spam, I sent this patch from the wrong address and without
sob'ing it.  I build the Linux kernel in a 15G tmpfs (it's faster this way :D)
and I lost my .git/config after a reboot.

 include/net/tcp.h    |    5 ++++-
 net/ipv4/tcp_input.c |   13 +++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..274d761 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -122,7 +122,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #endif
 #define TCP_RTO_MAX	((unsigned)(120*HZ))
 #define TCP_RTO_MIN	((unsigned)(HZ/5))
-#define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value	*/
+/* The next 2 values come from Draft RFC 2988bis-02. */
+#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))		/* initial RTO value	*/
+#define TCP_TIMEOUT_INIT_FALLBACK ((unsigned)(3*HZ))	/* initial RTO to fallback to when
+							 * a timeout happens during the 3WHS.	*/
 
 #define TCP_RESOURCE_PROBE_INTERVAL ((unsigned)(HZ/2U)) /* Maximal interval between probes
 					                 * for local resources.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bef9f04..a36bc35 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -868,6 +868,11 @@ static void tcp_init_metrics(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct dst_entry *dst = __sk_dst_get(sk);
+	/* If we had to retransmit anything during the 3WHS, use
+	 * the initial fallback RTO as per draft RFC 2988bis-02.
+	 */
+	int init_rto = inet_csk(sk)->icsk_retransmits ?
+		TCP_TIMEOUT_INIT_FALLBACK : TCP_TIMEOUT_INIT;
 
 	if (dst == NULL)
 		goto reset;
@@ -890,7 +895,7 @@ static void tcp_init_metrics(struct sock *sk)
 	if (dst_metric(dst, RTAX_RTT) == 0)
 		goto reset;
 
-	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
+	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (init_rto << 3))
 		goto reset;
 
 	/* Initial rtt is determined from SYN,SYN-ACK.
@@ -916,7 +921,7 @@ static void tcp_init_metrics(struct sock *sk)
 		tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
 	}
 	tcp_set_rto(sk);
-	if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
+	if (inet_csk(sk)->icsk_rto < init_rto && !tp->rx_opt.saw_tstamp) {
 reset:
 		/* Play conservative. If timestamps are not
 		 * supported, TCP will fail to recalculate correct
@@ -924,8 +929,8 @@ reset:
 		 */
 		if (!tp->rx_opt.saw_tstamp && tp->srtt) {
 			tp->srtt = 0;
-			tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
-			inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+			tp->mdev = tp->mdev_max = tp->rttvar = init_rto;
+			inet_csk(sk)->icsk_rto = init_rto;
 		}
 	}
 	tp->snd_cwnd = tcp_init_cwnd(tp, dst);
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: Alexander Zimmermann @ 2011-05-19  6:52 UTC (permalink / raw)
  To: tsuna
  Cc: David Miller, kuznet, pekkas, jmorris, yoshfuji, kaber, hagen,
	eric.dumazet, netdev, linux-kernel
In-Reply-To: <BANLkTiku9KTHCm59cib5KY8mz0ewSLsHFQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1462 bytes --]


Am 19.05.2011 um 08:42 schrieb tsuna:

> On Wed, May 18, 2011 at 11:36 PM, Alexander Zimmermann
> <alexander.zimmermann@comsys.rwth-aachen.de> wrote:
>> If you set the initRTO=0.1s, it's good for me but bad for the rest of the
>> world. That's the difference.
>> 
>> Or do you want to implement a lower barrier of 1sec so that you can ensure
>> that nobody set the initRTO lower than 1s?
> 
> Oh, I see.  Yes, there is a lower bound (and an upper bound) on what
> values the kernel will accept as initRTO.  In the patch "Implement a
> two-level initial RTO as per draft RFC 2988bis-02" above, I re-used
> TCP_RTO_MIN and TCP_RTO_MAX in net/ipv4/sysctl_net_ipv4.c in order to
> prevent users from setting a minRTO that's outside this range.  They
> are defined as follows in tcp.h:
> 
> #define TCP_RTO_MAX     ((unsigned)(120*HZ))
> #define TCP_RTO_MIN     ((unsigned)(HZ/5))
> 
> So we're talking about a [200ms ; 120s] range no matter what.

Why is 200ms a valid lower bound for initRTO? I'm aware of
measurements that 1s is save for Internet, but I don't know of any
studies that 200ms is save... 

> 
> -- 
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//


[-- Attachment #2: Signierter Teil der Nachricht --]
[-- Type: application/pgp-signature, Size: 243 bytes --]

^ permalink raw reply

* Re: 2.6.39-rc7-git11, x86/32, failed on ppp2897'th interface, PERCPU:  allocation failed
From: Eric Dumazet @ 2011-05-19  6:55 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev
In-Reply-To: <1305787158.3019.12.camel@edumazet-laptop>

Le jeudi 19 mai 2011 à 08:39 +0200, Eric Dumazet a écrit :

> Its a known problem : When ipv6 is enabled, we allocate percpu memory to
> hold per device snmp counters.
> 
> make sure kernel idea of max possible cpus matches real number of cpus.
> 
> And yes, switching to 64bit kernel helps a lot.
> 
> 

Looking at snmp6_alloc_dev(), we allocate three mib per device :

ipstats_mib  (30 * sizeof(u64) * number_of_possible_cpus)
icmpv6_mib    (4 * sizeof(long) * number_of_possible_cpus)
icmpv6msg_mib  (26 * sizeof(long))

For sure icmp ones dont need percpu counter. Plain atomic_long_t
(shared) would be enough, since ICMP messages are rare enough.




^ permalink raw reply

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: tsuna @ 2011-05-19  7:07 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: David Miller, kuznet, pekkas, jmorris, yoshfuji, kaber, hagen,
	eric.dumazet, netdev, linux-kernel
In-Reply-To: <8C5DF277-320D-4DEB-A133-EEC301DE58DC@comsys.rwth-aachen.de>

On Wed, May 18, 2011 at 11:52 PM, Alexander Zimmermann
<alexander.zimmermann@comsys.rwth-aachen.de> wrote:
>> So we're talking about a [200ms ; 120s] range no matter what.
>
> Why is 200ms a valid lower bound for initRTO? I'm aware of
> measurements that 1s is save for Internet, but I don't know of any
> studies that 200ms is save...

The constants that are quoted aren't specific to the initRTO.  They're
used to bound the RTO as it gets adjusted during the TCP session.  See
`tcp_set_rto' in tcp_input.c for reference.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

^ permalink raw reply

* [PATCH 1/2] net: ping: make local functions static
From: Changli Gao @ 2011-05-19  7:16 UTC (permalink / raw)
  To: David S. Miller
  Cc: Alexey Kuznetsov, Pekka Savola (ipv6), James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, Changli Gao

As these functions are only used in this file.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 net/ipv4/ping.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 6a21da9..5f9e2d1 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -449,8 +449,8 @@ static int ping_push_pending_frames(struct sock *sk, struct pingfakehdr *pfh, st
 	return ip_push_pending_frames(sk, fl4);
 }
 
-int ping_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
-		 size_t len)
+static int ping_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+			size_t len)
 {
 	struct net *net = sock_net(sk);
 	struct flowi4 fl4;
@@ -621,8 +621,8 @@ do_confirm:
 	goto out;
 }
 
-int ping_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
-		 size_t len, int noblock, int flags, int *addr_len)
+static int ping_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+			size_t len, int noblock, int flags, int *addr_len)
 {
 	struct inet_sock *isk = inet_sk(sk);
 	struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name;

^ permalink raw reply related

* [PATCH 2/2] net: ping: fix the coding style
From: Changli Gao @ 2011-05-19  7:16 UTC (permalink / raw)
  To: David S. Miller
  Cc: Alexey Kuznetsov, Pekka Savola (ipv6), James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, Changli Gao
In-Reply-To: <1305789361-5366-1-git-send-email-xiaosuo@gmail.com>

The characters in a line should be no more than 80.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 net/ipv4/ping.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 5f9e2d1..1f3bb11 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -187,7 +187,8 @@ exit:
 	return sk;
 }
 
-static void inet_get_ping_group_range_net(struct net *net, gid_t *low, gid_t *high)
+static void inet_get_ping_group_range_net(struct net *net, gid_t *low,
+					  gid_t *high)
 {
 	gid_t *data = net->ipv4.sysctl_ping_group_range;
 	unsigned seq;
@@ -437,7 +438,8 @@ static int ping_getfrag(void *from, char * to,
 	return 0;
 }
 
-static int ping_push_pending_frames(struct sock *sk, struct pingfakehdr *pfh, struct flowi4 *fl4)
+static int ping_push_pending_frames(struct sock *sk, struct pingfakehdr *pfh,
+				    struct flowi4 *fl4)
 {
 	struct sk_buff *skb = skb_peek(&sk->sk_write_queue);
 
@@ -754,7 +756,9 @@ static struct sock *ping_get_first(struct seq_file *seq, int start)
 	for (state->bucket = start; state->bucket < PING_HTABLE_SIZE;
 	     ++state->bucket) {
 		struct hlist_nulls_node *node;
-		struct hlist_nulls_head *hslot = &ping_table.hash[state->bucket];
+		struct hlist_nulls_head *hslot;
+
+		hslot = &ping_table.hash[state->bucket];
 
 		if (hlist_nulls_empty(hslot))
 			continue;

^ permalink raw reply related

* Re: [PATCH 14/18] virtio: add api for delayed callbacks
From: Michael S. Tsirkin @ 2011-05-19  7:24 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <87boz3dsoe.fsf-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

On Mon, May 16, 2011 at 04:43:21PM +0930, Rusty Russell wrote:
> On Sun, 15 May 2011 15:48:18 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, May 09, 2011 at 03:27:33PM +0930, Rusty Russell wrote:
> > > On Wed, 4 May 2011 23:52:33 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > Add an API that tells the other side that callbacks
> > > > should be delayed until a lot of work has been done.
> > > > Implement using the new used_event feature.
> > > 
> > > Since you're going to add a capacity query anyway, why not add the
> > > threshold argument here?
> > 
> > I thought that if we keep the API kind of generic
> > there might be more of a chance that future transports
> > will be able to implement it. For example, with an
> > old host we can't commit to a specific index.
> 
> No, it's always a hint anyway: you can be notified before the threshold
> is reached.  But best make it explicit I think.
> 
> Cheers,
> Rusty.


I tried doing that and remembered the real reason I went for this API:

capacity is limited by descriptor table space, not
used ring space: each entry in the used ring frees up multiple entries
in the descriptor ring. Thus the ring can't provide
callback after capacity is N: capacity is only available
after we get bufs.

We could try and make the API pass in the number of freed bufs, however:
- this is not really what virtio-net cares about (it cares about
  capacity)
- if the driver passes a number > number of outstanding bufs, it will
  never get a callback. So to stay correct the driver will need to
  track number of outstanding requests. The simpler API avoids that. 


APIs are easy to change so I'm guessing it's not a major blocker:
we can change later when e.g. block tries to
pass in some kind of extra hint: we'll be smarter
about how this API can change then.

Right?

-- 
MST

^ permalink raw reply

* Re: [PATCH 09/18] virtio: use avail_event index
From: Michael S. Tsirkin @ 2011-05-19  7:27 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <87tycsn9lt.fsf-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

On Wed, May 18, 2011 at 09:49:42AM +0930, Rusty Russell wrote:
> On Tue, 17 May 2011 09:10:31 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Well one can imagine a driver doing:
> > 
> > 	while (virtqueue_get_buf()) {
> > 		virtqueue_add_buf()
> > 	}
> > 	virtqueue_kick()
> > 
> > which looks sensible (batch kicks) but might
> > process any number of bufs between kicks.
> 
> No, we currently only expose the buffers in the kick, so it can only
> fill the ring doing that.
> 
> We could change that (and maybe that's worth looking at)...

That's actually what one of the early patches in the series did.
I guess I can try and reorder the patches, I do believe
it makes sense to publish immediately as this way
host can work in parallel with the guest.

> > If we look at drivers closely enough, I think none
> > of them do the equivalent of the above, but not 100% sure.
> 
> I'm pretty sure we don't have this kind of 'echo' driver yet.  Drivers
> tend to take OS requests and queue them.  The only one which does
> anything even partially sophisticated is the net driver...
> 
> Thanks,
> Rusty.

I guess I'll just need to do the legwork and check then.

-- 
MST

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox