[PATCH] tcp: Socket option to set congestion window

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] tcp: Socket option to set congestion window
@ 2010-05-26  5:01 Tom Herbert
  2010-05-26  5:08 ` Stephen Hemminger
  0 siblings, 1 reply; 21+ messages in thread
From: Tom Herbert @ 2010-05-26  5:01 UTC (permalink / raw)
  To: davem; +Cc: netdev, ycheng

This patch allows an application to set the TCP congestion window
for a connection through a socket option.  The maximum value that
may set is specified in a sysctl value.  When the sysctl is set to
zero, the default value, the socket option is disabled.

The socket option is most useful to set the initial congestion
window for a connection to a larger value than the default in
order to improve latency.  This socket option would typically be
used by an "intelligent" application which might have better knowledge
than the kernel as to what an appropriate initial congestion window is.

One use of this might be with an application which maintains per
client path characteristics.  This could allow setting the congestion
window more precisely than which could be achieved through the
route command.

A second use of this might be to reduce the number of simultaneous
connections that a client might open to the server; for instance
when a web browser opens multiple connections to a server.  With multiple
connections the aggregate congestion window is larger than that of a
single connecton (num_conns * cwnd), this effectively can be used to
circumvent slowstart and improve latency.  With this socket option, a
single connection with a large initial congestion window could be used,
which retains the latency properties of multiple connections but
nicely reducing # of connections (load) on the network.

The systctl to enable and control this feature is

  net.ipv4.tcp_user_cwnd_max

The socket option call would be:

  setsockopt(fd, IPPROTO_TCP, TCP_CWND, &val, sizeof (val))

where val is the congestion window in # MSS.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..9e9692f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -105,6 +105,7 @@ enum {
 #define TCP_COOKIE_TRANSACTIONS	15	/* TCP Cookie Transactions */
 #define TCP_THIN_LINEAR_TIMEOUTS 16      /* Use linear timeouts for thin streams*/
 #define TCP_THIN_DUPACK         17      /* Fast retrans. after 1 dupack */
+#define TCP_CWND		18	/* Set congestion window */

 /* for TCP_INFO socket option */
 #define TCPI_OPT_TIMESTAMPS	1
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a144914..3d1f934 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -246,6 +246,7 @@ extern int sysctl_tcp_max_ssthresh;
 extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_user_cwnd_max;

 extern atomic_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d96c1da..b35d18f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -597,6 +597,13 @@ static struct ctl_table ipv4_table[] = {
 		.mode           = 0644,
 		.proc_handler   = proc_dointvec
 	},
+        {
+		.procname       = "tcp_user_cwnd_max",
+		.data           = &sysctl_tcp_user_cwnd_max,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec
+	},
 	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6596b4f..0ca9832 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2370,6 +2370,24 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;

+	case TCP_CWND:
+		if (sysctl_tcp_user_cwnd_max <= 0)
+			err = -EPERM;
+		else if (val > 0 && sk->sk_state == TCP_ESTABLISHED &&
+		    icsk->icsk_ca_state == TCP_CA_Open) {
+			u32 cwnd = val;
+			cwnd = min(cwnd, (u32)sysctl_tcp_user_cwnd_max);
+			cwnd = min(cwnd, tp->snd_cwnd_clamp);
+
+			if (tp->snd_cwnd != cwnd) {
+				tp->snd_cwnd = cwnd;
+				tp->snd_cwnd_stamp = tcp_time_stamp;
+				tp->snd_cwnd_cnt = 0;
+			}
+		} else
+			err = -EINVAL;
+		break;
+
 #ifdef CONFIG_TCP_MD5SIG
 	case TCP_MD5SIG:
 		/* Read the IP->Key mappings from userspace */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b4ed957..2d10a44 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -60,6 +60,8 @@ int sysctl_tcp_base_mss __read_mostly = 512;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;

+int sysctl_tcp_user_cwnd_max __read_mostly;
+
 int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
 EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26  5:01 [PATCH] tcp: Socket option to set congestion window Tom Herbert
@ 2010-05-26  5:08 ` Stephen Hemminger
  2010-05-26  5:52   ` David Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen Hemminger @ 2010-05-26  5:08 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, ycheng

On Tue, 25 May 2010 22:01:13 -0700 (PDT)
Tom Herbert <therbert@google.com> wrote:

> This patch allows an application to set the TCP congestion window
> for a connection through a socket option.  The maximum value that
> may set is specified in a sysctl value.  When the sysctl is set to
> zero, the default value, the socket option is disabled.
> 
> The socket option is most useful to set the initial congestion
> window for a connection to a larger value than the default in
> order to improve latency.  This socket option would typically be
> used by an "intelligent" application which might have better knowledge
> than the kernel as to what an appropriate initial congestion window is.
> 
> One use of this might be with an application which maintains per
> client path characteristics.  This could allow setting the congestion
> window more precisely than which could be achieved through the
> route command.
> 
> A second use of this might be to reduce the number of simultaneous
> connections that a client might open to the server; for instance
> when a web browser opens multiple connections to a server.  With multiple
> connections the aggregate congestion window is larger than that of a
> single connecton (num_conns * cwnd), this effectively can be used to
> circumvent slowstart and improve latency.  With this socket option, a
> single connection with a large initial congestion window could be used,
> which retains the latency properties of multiple connections but
> nicely reducing # of connections (load) on the network.
> 
> The systctl to enable and control this feature is
> 
>   net.ipv4.tcp_user_cwnd_max
> 
> The socket option call would be:
> 
>   setsockopt(fd, IPPROTO_TCP, TCP_CWND, &val, sizeof (val))
> 
> where val is the congestion window in # MSS.
> 

The IETF TCP maintainers already think Linux TCP allows unsafe
operation, this will just allow more possible misuse and prove
their argument.  Until/unless this behavior was approved by
a wider set of research, I don't think it should be accepted at
this time.


-- 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26  5:08 ` Stephen Hemminger
@ 2010-05-26  5:52   ` David Miller
  2010-05-26  7:06     ` Tom Herbert
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2010-05-26  5:52 UTC (permalink / raw)
  To: shemminger; +Cc: therbert, netdev, ycheng

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 25 May 2010 22:08:58 -0700

> The IETF TCP maintainers already think Linux TCP allows unsafe
> operation, this will just allow more possible misuse and prove
> their argument.  Until/unless this behavior was approved by
> a wider set of research, I don't think it should be accepted at
> this time.

Yes, and two other points I'd like to add.

1) Stop pretending a network path characteristic can be made into
   an application level one, else I'll stop reading your patches.

   You can try to use smoke and mirrors to make your justification by
   saying that an application can circumvent things right now by
   openning up multiple connections.  But guess what?  If that act
   overflows a network queue, we'll pull the CWND back on all of those
   connections while their CWNDs are still small and therefore way
   before things get out of hand.

   Whereas if you set the initial window high, the CWND is wildly out
   of control before we are even started.

   And even after your patch the "abuse" ability is still there.  So
   since your patch doesn't prevent the "abuse", you really don't care
   about CWND abuse.  Instead, you simply want to pimp your feature.

2) The very last application I'd want to use something like this is a
   damn web browser.

   Maybe a program, which is extremely sophisticated, like a database
   or caching manager, that runs privileged and somehow has complete
   and constantly updated knowledge of the network topology from end
   to end.  And iff, and only iff, we only would let privileged
   applications make the setting.

Right now we only allow to do this via a route setting, exactly because:

1) It is a network path characteristic, full stop.

2) Only humans can really know what the exact end to end path
   characteristics are on a per-route basis, and given that whether it
   is safe to increase the initial CWND as a result.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26  5:52   ` David Miller
@ 2010-05-26  7:06     ` Tom Herbert
  2010-05-26  7:33       ` David Miller
  2010-05-26 17:33       ` Andi Kleen
  0 siblings, 2 replies; 21+ messages in thread
From: Tom Herbert @ 2010-05-26  7:06 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev, ycheng

On Tue, May 25, 2010 at 10:52 PM, David Miller <davem@davemloft.net> wrote:
>
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Tue, 25 May 2010 22:08:58 -0700
>
> > The IETF TCP maintainers already think Linux TCP allows unsafe
> > operation, this will just allow more possible misuse and prove
> > their argument.  Until/unless this behavior was approved by
> > a wider set of research, I don't think it should be accepted at
> > this time.
>
> Yes, and two other points I'd like to add.
>
> 1) Stop pretending a network path characteristic can be made into
>   an application level one, else I'll stop reading your patches.
>
>   You can try to use smoke and mirrors to make your justification by
>   saying that an application can circumvent things right now by
>   openning up multiple connections.  But guess what?  If that act
>   overflows a network queue, we'll pull the CWND back on all of those
>   connections while their CWNDs are still small and therefore way
>   before things get out of hand.
>
It's really not that simple.  In the application with multiple
connections, congestion may only affect some number of connections, so
more of the aggregate window may be preserved.  This is an unfairness
issue between 1 and N connection scenarios which is a real problem.

>
>   Whereas if you set the initial window high, the CWND is wildly out
>   of control before we are even started.
>
>   And even after your patch the "abuse" ability is still there.  So
>   since your patch doesn't prevent the "abuse", you really don't care
>   about CWND abuse.  Instead, you simply want to pimp your feature.
>
> 2) The very last application I'd want to use something like this is a
>   damn web browser.
>

Right, this should be fixed in the server not at the browsers.
Unfortunately, web browsers seem to have lost any self control in
limiting the number of simultaneous connections that can be opened (we
managed to get IE8 to open over 100 of them).  So the cat's way out of
the bag.  Server's can rein this problem in by only allowing fewer
connections, but at the cost of losing latency is not much incentive!

>   Maybe a program, which is extremely sophisticated, like a database
>   or caching manager, that runs privileged and somehow has complete
>   and constantly updated knowledge of the network topology from end
>   to end.  And iff, and only iff, we only would let privileged
>   applications make the setting.
>
> Right now we only allow to do this via a route setting, exactly because:
>
> 1) It is a network path characteristic, full stop.
>
Thanks to NAT, the concept of a network path or even host specific
path is a weakened concept.  On the Internet this may be a path
characteristic per client, which unfortunately has no visibility in
the kernel other than per connection state.  When a single IP address
may have thousands of hosts behind it, caching TCP parameters for that
IP address is implicitly doing a huge aggregation-- probably dicey...

>
> 2) Only humans can really know what the exact end to end path
>   characteristics are on a per-route basis, and given that whether it
>   is safe to increase the initial CWND as a result.

In all but the most trivial networks, I do not believes humans are
capable of making an intelligent decision about this.  Don't get me
wrong, it's great that it can be set in the route, but there's nothing
at all that prevents naive abuse (2009 study showed that 15%
connections of connections on the Internet violate icw standards
anyway).  We have proposed in iETF to raise the initial congestion
window, but dynamic mechanisms that algorithmically determine safe
values are still of interest and may be safer which is what this patch
would allow.

Thanks for your comments!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26  7:06     ` Tom Herbert
@ 2010-05-26  7:33       ` David Miller
  2010-05-26 17:33       ` Andi Kleen
  1 sibling, 0 replies; 21+ messages in thread
From: David Miller @ 2010-05-26  7:33 UTC (permalink / raw)
  To: therbert; +Cc: shemminger, netdev, ycheng

From: Tom Herbert <therbert@google.com>
Date: Wed, 26 May 2010 00:06:35 -0700

> It's really not that simple.  In the application with multiple
> connections, congestion may only affect some number of connections, so
> more of the aggregate window may be preserved.  This is an unfairness
> issue between 1 and N connection scenarios which is a real problem.

If this is true, then by all account your patch allows things to be
even worse.

Because now applications can still open up N connections, but with an
even larger initial CWND, with potentially exponential ramifications
on network congestion.

So yet another reason not to consider this feature seriously.  It's
not an application level attribute, it's a network path one.  Please
take it seriously because I really mean it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26  7:06     ` Tom Herbert
  2010-05-26  7:33       ` David Miller
@ 2010-05-26 17:33       ` Andi Kleen
  2010-05-26 17:41         ` Denys Fedorysychenko
  2010-05-26 21:08         ` David Miller
  1 sibling, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2010-05-26 17:33 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, shemminger, netdev, ycheng

Tom Herbert <therbert@google.com> writes:
>>
> Thanks to NAT, the concept of a network path or even host specific
> path is a weakened concept.  On the Internet this may be a path
> characteristic per client, which unfortunately has no visibility in
> the kernel other than per connection state.  When a single IP address
> may have thousands of hosts behind it, caching TCP parameters for that
> IP address is implicitly doing a huge aggregation-- probably dicey...

Yes all of Saudi-Arabia used to be (is?) one IP address...

Caching anything per IP is bogus.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26 17:33       ` Andi Kleen
@ 2010-05-26 17:41         ` Denys Fedorysychenko
  2010-05-26 21:08         ` David Miller
  1 sibling, 0 replies; 21+ messages in thread
From: Denys Fedorysychenko @ 2010-05-26 17:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tom Herbert, David Miller, shemminger, netdev, ycheng

On Wednesday 26 May 2010 20:33:46 Andi Kleen wrote:
> Tom Herbert <therbert@google.com> writes:
> > Thanks to NAT, the concept of a network path or even host specific
> > path is a weakened concept.  On the Internet this may be a path
> > characteristic per client, which unfortunately has no visibility in
> > the kernel other than per connection state.  When a single IP address
> > may have thousands of hosts behind it, caching TCP parameters for that
> > IP address is implicitly doing a huge aggregation-- probably dicey...
> 
> Yes all of Saudi-Arabia used to be (is?) one IP address...
> 
> Caching anything per IP is bogus.
> 
> -Andi
> 
In Lebanon i have around 30k users behind few IP addresses(around 6) (for 
web).
Because backbone here $1200/Mbit, and satellites mostly(rtt 400+ ms)... so TCP 
accelerators and caching proxy a must. Tproxy doesn't work well yet to use 
full set of ip's.

And no local google/youtube servers, so maybe i'm affected by something? :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26 17:33       ` Andi Kleen
  2010-05-26 17:41         ` Denys Fedorysychenko
@ 2010-05-26 21:08         ` David Miller
  2010-05-26 21:27           ` Andi Kleen
  1 sibling, 1 reply; 21+ messages in thread
From: David Miller @ 2010-05-26 21:08 UTC (permalink / raw)
  To: andi; +Cc: therbert, shemminger, netdev, ycheng

From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 26 May 2010 19:33:46 +0200

> Tom Herbert <therbert@google.com> writes:
>>>
>> Thanks to NAT, the concept of a network path or even host specific
>> path is a weakened concept.  On the Internet this may be a path
>> characteristic per client, which unfortunately has no visibility in
>> the kernel other than per connection state.  When a single IP address
>> may have thousands of hosts behind it, caching TCP parameters for that
>> IP address is implicitly doing a huge aggregation-- probably dicey...
> 
> Yes all of Saudi-Arabia used to be (is?) one IP address...
> 
> Caching anything per IP is bogus.

And letting the applications choose the CWND is better?!?!

Every single proposal being mentioned in this thread has huge,
obvious, downsides.

Just because there are some cases of people NAT'ing many machines
behind one IP address doesn't mean we kill performance for the rest of
the world (the majority of internet usage btw) by not caching TCP path
characteristics per IP address.

And just because applications open up many sockets to get better TCP
latency and work around per-connection CWND limits DOES NOT mean we
let the application increase the initial CWND so it can abuse this
EVEN MORE and cause EVEN BIGGER problems.

If people have real, sane, ideas about how to attack this problem I am
all ears.  But everything proposed here so far is complete and utter
crap.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26 21:08         ` David Miller
@ 2010-05-26 21:27           ` Andi Kleen
  2010-05-26 22:10             ` David Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2010-05-26 21:27 UTC (permalink / raw)
  To: David Miller; +Cc: andi, therbert, shemminger, netdev, ycheng

> > Yes all of Saudi-Arabia used to be (is?) one IP address...
> > 
> > Caching anything per IP is bogus.
> 
> And letting the applications choose the CWND is better?!?!

No I actually agree with you on that. Just saying that
anything that relies on per IP caching is bad too.

As I understand the idea was that the application knows
what flows belong to a single peer and wants to have
a single cwnd for all of those. Perhaps there would
be a way to generalize that to tell it to the kernel.

e.g. have a "peer id"  that is known by applications
and the kernel could manage cwnds shared between connections
associated with the same peer id?

Just an idea, I admit I haven't thought very deeply
about this. Feel free to poke holes into it.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26 21:27           ` Andi Kleen
@ 2010-05-26 22:10             ` David Miller
  2010-05-26 22:29               ` Rick Jones
                                 ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: David Miller @ 2010-05-26 22:10 UTC (permalink / raw)
  To: andi; +Cc: therbert, shemminger, netdev, ycheng

From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 26 May 2010 23:27:45 +0200

> As I understand the idea was that the application knows
> what flows belong to a single peer and wants to have
> a single cwnd for all of those. Perhaps there would
> be a way to generalize that to tell it to the kernel.
> 
> e.g. have a "peer id"  that is known by applications
> and the kernel could manage cwnds shared between connections
> associated with the same peer id?
> 
> Just an idea, I admit I haven't thought very deeply
> about this. Feel free to poke holes into it.

Yes, a CWND "domain" that can include multiple sockets is
something that might gain some traction.

The "domain" could just simply be the tuple {process,peer-IP}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26 22:10             ` David Miller
@ 2010-05-26 22:29               ` Rick Jones
  2010-05-27  7:57                 ` Andi Kleen
  2010-05-26 23:15               ` Hagen Paul Pfeifer
  2010-05-27  8:00               ` Andi Kleen
  2 siblings, 1 reply; 21+ messages in thread
From: Rick Jones @ 2010-05-26 22:29 UTC (permalink / raw)
  To: andi; +Cc: David Miller, therbert, shemminger, netdev, ycheng

David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Wed, 26 May 2010 23:27:45 +0200
> 
>>As I understand the idea was that the application knows
>>what flows belong to a single peer and wants to have
>>a single cwnd for all of those. Perhaps there would
>>be a way to generalize that to tell it to the kernel.
>>
>>e.g. have a "peer id"  that is known by applications
>>and the kernel could manage cwnds shared between connections
>>associated with the same peer id?

Then all the app does is say "I'am in peer id foo" right?  Is that really that 
much different from making the setsockopt() call for a different cwnd value? 
Particularly if say the limit were not a global sysctl, but based on the 
existing per-route value (perhaps expanded to have a min, max and default?)

>>Just an idea, I admit I haven't thought very deeply
>>about this. Feel free to poke holes into it.
> 
> Yes, a CWND "domain" that can include multiple sockets is
> something that might gain some traction.
> 
> The "domain" could just simply be the tuple {process,peer-IP}

Name or PID?

rick jones

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26 22:10             ` David Miller
  2010-05-26 22:29               ` Rick Jones
@ 2010-05-26 23:15               ` Hagen Paul Pfeifer
  2010-05-27  3:04                 ` David Miller
  2010-05-27  8:00               ` Andi Kleen
  2 siblings, 1 reply; 21+ messages in thread
From: Hagen Paul Pfeifer @ 2010-05-26 23:15 UTC (permalink / raw)
  To: David Miller; +Cc: andi, therbert, shemminger, netdev, ycheng

* David Miller | 2010-05-26 15:10:14 [-0700]:

>From: Andi Kleen <andi@firstfloor.org>
>Date: Wed, 26 May 2010 23:27:45 +0200
>
>> As I understand the idea was that the application knows
>> what flows belong to a single peer and wants to have
>> a single cwnd for all of those. Perhaps there would
>> be a way to generalize that to tell it to the kernel.
>> 
>> e.g. have a "peer id"  that is known by applications
>> and the kernel could manage cwnds shared between connections
>> associated with the same peer id?
>> 
>> Just an idea, I admit I haven't thought very deeply
>> about this. Feel free to poke holes into it.
>
>Yes, a CWND "domain" that can include multiple sockets is
>something that might gain some traction.
>
>The "domain" could just simply be the tuple {process,peer-IP}

This discussion - as once a month - is about fairness. But if we define a
domain as a tuple of {process,peer-IP} the fairness is applied only for the
last link before "peer-IP".

But fairness applies to *all* links in between! For example: consider a
dumpbell scenario:


+------+                                   +------+ 
|      |                                   |      |  
|  H1  |                                   |  H3  | 
|      |                                   |      |  
+------+                                   +------+  
  10MB  \   +------+            +------+  / 10MB
         \  |      |   1MB/s    |      | / 
          > |  R1  |------------|  R2  |<    
         /  |      |            |      | \      
  10MB  /   +------+            +------+  \ 10MB 
+------+                                   +------+  
|      |                                   |      |        
|  H2  |                                   |  H4  | 
|      |                                   |      | 
+------+                                   +------+


How can a domain defined as {process,peer-IP} fair to the 1MB bottleneck link?
It is not fair! And it is also not fair to open n simultaneous streams and so
on. This problem is discussed in several RFC's.

.02


Best regards, Hagen


-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26 23:15               ` Hagen Paul Pfeifer
@ 2010-05-27  3:04                 ` David Miller
  2010-05-27  7:08                   ` Hagen Paul Pfeifer
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2010-05-27  3:04 UTC (permalink / raw)
  To: hagen; +Cc: andi, therbert, shemminger, netdev, ycheng

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Thu, 27 May 2010 01:15:12 +0200

> How can a domain defined as {process,peer-IP} fair to the 1MB
> bottleneck link?

You're asking about a network level issue in terms of what can be done
on a local end-node.

All an end-node can do is abide by congestion control rules and respond
to packet drops, as has been going on for decades.

People have basically (especially in Europe) given up on crazy crap
like RSVP and other forms of bandwidth limiting and reservation.  They
just oversubscribe their links, and increase their capacity as traffic
increases dictate.  It just isn't all that manageable to put people's
traffic into classes and control what they do on a large scale.

I'm also skeptical about those who say the fight belongs squarely at
the end nodes.  If you want to control the network traffic of the
meeting point of your dumbbell, you'll need a machine there doing RED
or traffic limiting.  End-host schemes simply aren't going to work
because I can just add more end-hosts to reintroduce the problem.

The dumbbell situation is independant of the end-node issues, that's
all I'm really saying.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-27  3:04                 ` David Miller
@ 2010-05-27  7:08                   ` Hagen Paul Pfeifer
  2010-05-27  7:28                     ` David Miller
  2010-05-27 16:14                     ` Tom Herbert
  0 siblings, 2 replies; 21+ messages in thread
From: Hagen Paul Pfeifer @ 2010-05-27  7:08 UTC (permalink / raw)
  To: David Miller; +Cc: andi, therbert, shemminger, netdev, ycheng

* David Miller | 2010-05-26 20:04:43 [-0700]:

>You're asking about a network level issue in terms of what can be done
>on a local end-node.

No, I *write* about network level issues, this is the important item in my
mind.  It is about network stability and network fairness. The lion share of
TCP algorithm are drafted to guarantee _network fairness and network stability_.

And by the way, the IETF (and our) paradigm is still to shift functionality to
end hosts - not into network core. "The Rise of the stupid network" [1] is
still a paradigm that is superior to the alternative where vendors put their
proprietary algorithms into the network and change the behavior in a
uncontrollable fashion.

>All an end-node can do is abide by congestion control rules and respond
>to packet drops, as has been going on for decades.

Right, and this will be reality for the next decades (at least for TCP;
maybe backed by ECN).

>People have basically (especially in Europe) given up on crazy crap
>like RSVP and other forms of bandwidth limiting and reservation.  They
>just oversubscribe their links, and increase their capacity as traffic
>increases dictate.  It just isn't all that manageable to put people's
>traffic into classes and control what they do on a large scale.
>
>I'm also skeptical about those who say the fight belongs squarely at
>the end nodes.  If you want to control the network traffic of the
>meeting point of your dumbbell, you'll need a machine there doing RED
>or traffic limiting.  End-host schemes simply aren't going to work
>because I can just add more end-hosts to reintroduce the problem.

I am not happy with this statement. This differs from the previous paragraph
where you complain about intelligent network components. Davem until these
days the routers do exactly this, they do RED/WRED whatever and signal to the
producer to reduce their bandwidth.

And this is the most important aspect in this email: core network components
rely on end hosts to behave in a fair manner. Disable Slow Start/Congestion
Avoidance and the network will instantly collapse (mmh, net-next? ;-)

The mechanism as proposed in the patch is not fair. There are a lot of
publications available that analyse the impact CWND in great detail as well as
several RFC that talk about the CWND.

>The dumbbell situation is independant of the end-node issues, that's
>all I'm really saying.

Davem, I know that you are a good guy and worries about fairness aspects
really well. I wrote this email to popularize fairness and network stability
aspects to the broad audience.

Hagen

[1] http://isen.com/stupid.html

>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Die Zensur ist das lebendige Gestaendnis der Grossen, dass sie 
nur verdummte Sklaven treten, aber keine freien Voelker regieren koennen.
- Johann Nepomuk Nestroy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-27  7:08                   ` Hagen Paul Pfeifer
@ 2010-05-27  7:28                     ` David Miller
  2010-05-27  7:46                       ` Hagen Paul Pfeifer
  2010-05-27 16:14                     ` Tom Herbert
  1 sibling, 1 reply; 21+ messages in thread
From: David Miller @ 2010-05-27  7:28 UTC (permalink / raw)
  To: hagen; +Cc: andi, therbert, shemminger, netdev, ycheng

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Thu, 27 May 2010 09:08:27 +0200

> And by the way, the IETF (and our) paradigm is still to shift functionality to
> end hosts - not into network core. "The Rise of the stupid network" [1] is
> still a paradigm that is superior to the alternative where vendors put their
> proprietary algorithms into the network and change the behavior in a
> uncontrollable fashion.

Superior or not, it's simply never going to happen.  We are far beyond
being able to get to where we were before NAT'ing and shaping devices
started to get inserted everywhere on the network.

And I also don't see any of this stuff as fundamentally proprietary.

People want deep packet inspection, people want to control their user's
traffic.  And people, most importantly, are willing to pay for this.

Therefore, these elements will always be in the network.

Better to co-exist with them and use them to our advantage instead of
fantasizing about a utopia where they don't exist.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-27  7:28                     ` David Miller
@ 2010-05-27  7:46                       ` Hagen Paul Pfeifer
  0 siblings, 0 replies; 21+ messages in thread
From: Hagen Paul Pfeifer @ 2010-05-27  7:46 UTC (permalink / raw)
  To: David Miller; +Cc: andi, therbert, shemminger, netdev, ycheng

* David Miller | 2010-05-27 00:28:51 [-0700]:

>Superior or not, it's simply never going to happen.  We are far beyond
>being able to get to where we were before NAT'ing and shaping devices
>started to get inserted everywhere on the network.
>
>And I also don't see any of this stuff as fundamentally proprietary.

We will see! If no real interaction between peers is required
ISP/Carrier/InternetExchanges will start to put their proprietary components
into the network. Because they have niffty features, the product developed
phase is shorten (no borring standardizations necessary) and so on. This is no
new insight.

>People want deep packet inspection, people want to control their user's
>traffic.  And people, most importantly, are willing to pay for this.
>
>Therefore, these elements will always be in the network.
>
>Better to co-exist with them and use them to our advantage instead of
>fantasizing about a utopia where they don't exist.

Sure, we have no alternative.

HGN


-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26 22:29               ` Rick Jones
@ 2010-05-27  7:57                 ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2010-05-27  7:57 UTC (permalink / raw)
  To: Rick Jones; +Cc: andi, David Miller, therbert, shemminger, netdev, ycheng

> Then all the app does is say "I'am in peer id foo" right?  Is that really 
> that much different from making the setsockopt() call for a different cwnd 
> value? Particularly if say the limit were not a global sysctl, but based on 
> the existing per-route value (perhaps expanded to have a min, max and 
> default?)

The worst case with peer id would be app using an own peer id
for each connection. So each connection would have an own cwnd,
just like today. So the worst case is the same as today.

If it shares connections between peer ids the real effective cwnd
of all those connections would be also never be "worse" (that is
larger) than it could be on single connection. 

So this limits the cwnds effectively with peer ids, although it also 
gives a nice way to reuse an already existing cwnd for a new
connection (this does not make things worse because in theory
the app could have reused the same connection too) 

So overall peer ids don't allow to enlarge cwnds over today.

If the cwnd is fully application controlled all these limits
are not there and a bittorrent client could just always set 
it to 1 million.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-26 22:10             ` David Miller
  2010-05-26 22:29               ` Rick Jones
  2010-05-26 23:15               ` Hagen Paul Pfeifer
@ 2010-05-27  8:00               ` Andi Kleen
  2 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2010-05-27  8:00 UTC (permalink / raw)
  To: David Miller; +Cc: andi, therbert, shemminger, netdev, ycheng

On Wed, May 26, 2010 at 03:10:14PM -0700, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Wed, 26 May 2010 23:27:45 +0200
> 
> > As I understand the idea was that the application knows
> > what flows belong to a single peer and wants to have
> > a single cwnd for all of those. Perhaps there would
> > be a way to generalize that to tell it to the kernel.
> > 
> > e.g. have a "peer id"  that is known by applications
> > and the kernel could manage cwnds shared between connections
> > associated with the same peer id?
> > 
> > Just an idea, I admit I haven't thought very deeply
> > about this. Feel free to poke holes into it.
> 
> Yes, a CWND "domain" that can include multiple sockets is
> something that might gain some traction.
> 
> The "domain" could just simply be the tuple {process,peer-IP}

If process is in there this wouldn't work for a multi process
server?

Perhaps having it associated with a FD so that it could
be passed around with unix sockets if needed (just would
need to make sure the AF_UNIX gc can handle such cycles)

peer_id = open_peer_id();   
/* peer id is like a fd */

socket = socket( ... ); 
set_peer_id(socket, peer_id); 


...

close(peer_id);

-andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-27  7:08                   ` Hagen Paul Pfeifer
  2010-05-27  7:28                     ` David Miller
@ 2010-05-27 16:14                     ` Tom Herbert
  2010-05-27 18:56                       ` Andi Kleen
  2010-05-27 19:19                       ` Hagen Paul Pfeifer
  1 sibling, 2 replies; 21+ messages in thread
From: Tom Herbert @ 2010-05-27 16:14 UTC (permalink / raw)
  To: Hagen Paul Pfeifer; +Cc: David Miller, andi, shemminger, netdev, ycheng

> And this is the most important aspect in this email: core network components
> rely on end hosts to behave in a fair manner. Disable Slow Start/Congestion
> Avoidance and the network will instantly collapse (mmh, net-next? ;-)
>
> The mechanism as proposed in the patch is not fair. There are a lot of
> publications available that analyse the impact CWND in great detail as well as
> several RFC that talk about the CWND.

The mechanism proposed in the patch is merely an API change; misuse,
abuse, or unfairness are inferences of how it might be used.  Proper
safeguards should be applied to prevent misuse, but I don't see that
it should be any more insidious than 350 other mechanisms in the
system that could be used to screw things up.

Yes, there has been a lot of talk about CWND, but the standard has not
changed since 2002.  In the meantime, browsers have increased the
number of parallel connections they open to a destination, and servers
hide behind multiple domains-- the end result of this is that browsers
use aggregate initial congestion windows much larger than the
standard, which sidesteps slowstart and is a source of unfairness.
This is contrary to RFC 3390:

"When web browsers open simultaneous TCP connections to the same
destination, they are working against TCP's congestion control
mechanisms"

I have yet to find any paper on CWND that analyzed the effect of this
phenomena on the Internet which is quite unfortunate.  In our own full
scale experiments
(http://code.google.com/speed/articles/tcp_initcwnd_paper.pdf), we
anlayzed the effects of using larger initial congestion windows on the
Internet which might be the closest thing to such an analysis.  I know
in LEDBAT WG of IETF they are trying to come up with new
recommendations for number of connections a browser can open, this is
good but I hope it's not after the fact.

It would be better, by almost any perspective, to rein in the number
of connections servers are allowing clients to open.  However this
isn't going to happen if this means increase latency for end users,
there's is no competitive rationale for servers to do that.  That's
where a primary motivation of this patch becomes evident.  Instead of
a server allowing 6 connections from a client, for instance, it could
allow just one connection but with a initial congestion window equal
to the aggregate of the 6 connections.  This reduces connections and
does not change the size of initial data burst going into the
Internet.  The Internet is happy because there a fewer connections
(better for fairness) and fewer packets (fewer 3WHS); the server and
client are happy because there are fewer connections to deal without
and no increased latency.

Tom

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-27 16:14                     ` Tom Herbert
@ 2010-05-27 18:56                       ` Andi Kleen
  2010-05-27 19:19                       ` Hagen Paul Pfeifer
  1 sibling, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2010-05-27 18:56 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Hagen Paul Pfeifer, David Miller, andi, shemminger, netdev,
	ycheng

> It would be better, by almost any perspective, to rein in the number
> of connections servers are allowing clients to open.  However this
> isn't going to happen if this means increase latency for end users,
> there's is no competitive rationale for servers to do that.  That's
> where a primary motivation of this patch becomes evident.  Instead of
> a server allowing 6 connections from a client, for instance, it could
> allow just one connection but with a initial congestion window equal
> to the aggregate of the 6 connections.  This reduces connections and

I thought the point was to avoid cwnd inflation by multiple connections?
Now you're saying you actually want larger cwnds? 

If you simply want larger CWNDs the easiest is to bump up the
define in your local build.

But that cannot be done by default obviously.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tcp: Socket option to set congestion window
  2010-05-27 16:14                     ` Tom Herbert
  2010-05-27 18:56                       ` Andi Kleen
@ 2010-05-27 19:19                       ` Hagen Paul Pfeifer
  1 sibling, 0 replies; 21+ messages in thread
From: Hagen Paul Pfeifer @ 2010-05-27 19:19 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, andi, shemminger, netdev, ycheng, lars.eggert

* Tom Herbert | 2010-05-27 09:14:09 [-0700]:

>"When web browsers open simultaneous TCP connections to the same
>destination, they are working against TCP's congestion control
>mechanisms"

Right, the problem can applied for other protocols as well. Often p2p
protocols behave unfair. This problem is known, but there is currently no
IETF effort to address the problem. The problem is not that simple and it is
difficult to draft a universal statement.

>I have yet to find any paper on CWND that analyzed the effect of this
>phenomena on the Internet which is quite unfortunate.  In our own full
>scale experiments
>(http://code.google.com/speed/articles/tcp_initcwnd_paper.pdf), we
>anlayzed the effects of using larger initial congestion windows on the
>Internet which might be the closest thing to such an analysis.  I know
>in LEDBAT WG of IETF they are trying to come up with new
>recommendations for number of connections a browser can open, this is
>good but I hope it's not after the fact.

I know your paper and if I remember correctly I was a little bit sceptical
about the efforts to analyze the fairness behavior in deep. It takes one day
to validate the fairness issues: take NS3 (with NSC so you can take the Linux
network stack with your patch), setup a dumpbell topology and analyse the
behavior. I will read the paper one more time.

I had no problem with you patch if you apply this patch on top of it:  ;-)


diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0ca9832..73f9d46 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2371,7 +2371,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		break;
 
 	case TCP_CWND:
-		if (sysctl_tcp_user_cwnd_max <= 0)
+		if (sysctl_tcp_user_cwnd_max <= 0 || !capable(CAP_NET_ADMIN))
 			err = -EPERM;
 		else if (val > 0 && sk->sk_state == TCP_ESTABLISHED &&
 		    icsk->icsk_ca_state == TCP_CA_Open) {


HGN

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-05-27 19:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-26  5:01 [PATCH] tcp: Socket option to set congestion window Tom Herbert
2010-05-26  5:08 ` Stephen Hemminger
2010-05-26  5:52   ` David Miller
2010-05-26  7:06     ` Tom Herbert
2010-05-26  7:33       ` David Miller
2010-05-26 17:33       ` Andi Kleen
2010-05-26 17:41         ` Denys Fedorysychenko
2010-05-26 21:08         ` David Miller
2010-05-26 21:27           ` Andi Kleen
2010-05-26 22:10             ` David Miller
2010-05-26 22:29               ` Rick Jones
2010-05-27  7:57                 ` Andi Kleen
2010-05-26 23:15               ` Hagen Paul Pfeifer
2010-05-27  3:04                 ` David Miller
2010-05-27  7:08                   ` Hagen Paul Pfeifer
2010-05-27  7:28                     ` David Miller
2010-05-27  7:46                       ` Hagen Paul Pfeifer
2010-05-27 16:14                     ` Tom Herbert
2010-05-27 18:56                       ` Andi Kleen
2010-05-27 19:19                       ` Hagen Paul Pfeifer
2010-05-27  8:00               ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).