[PATCH] tcp-illinois: incorrect beta usage

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] tcp-illinois: incorrect beta usage
       [not found] ` <001001c83063$9adbc9d0$d5897e82@csp.uiuc.edu>
@ 2007-11-28 23:47   ` Stephen Hemminger
  2007-11-29  0:25     ` Lachlan Andrew
  2007-11-29 14:12     ` [PATCH] tcp-illinois: incorrect beta usage Herbert Xu
  0 siblings, 2 replies; 45+ messages in thread
From: Stephen Hemminger @ 2007-11-28 23:47 UTC (permalink / raw)
  To: Julian Shao Liu, David S. Miller, Herbert Xu
  Cc: 'Lachlan Andrew', shaoliu, 'Douglas Leith',
	'Robert Shorten', netdev

Lachlan Andrew observed that my TCP-Illinois implementation uses the
beta value incorrectly:
  The parameter  beta  in the paper specifies the amount to decrease
  *by*:  that is, on loss,
     W <-  W -  beta*W
  but in   tcp_illinois_ssthresh() uses  beta  as the amount
  to decrease  *to*: W <- beta*W

This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network,
hurting performance. Note: since the base beta value is .5, it has no
impact on a congested network.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>

--- a/net/ipv4/tcp_illinois.c	2007-08-18 07:50:15.000000000 -0700
+++ b/net/ipv4/tcp_illinois.c	2007-11-28 15:39:04.000000000 -0800
@@ -298,7 +298,7 @@ static u32 tcp_illinois_ssthresh(struct 
 	struct illinois *ca = inet_csk_ca(sk);
 
 	/* Multiplicative decrease */
-	return max((tp->snd_cwnd * ca->beta) >> BETA_SHIFT, 2U);
+	return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->beta) >> BETA_SHIFT), 2U);
 }
 
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tcp-illinois: incorrect beta usage
  2007-11-28 23:47   ` [PATCH] tcp-illinois: incorrect beta usage Stephen Hemminger
@ 2007-11-29  0:25     ` Lachlan Andrew
  2007-11-29  0:43       ` Stephen Hemminger
  2007-11-29 14:12     ` [PATCH] tcp-illinois: incorrect beta usage Herbert Xu
  1 sibling, 1 reply; 45+ messages in thread
From: Lachlan Andrew @ 2007-11-29  0:25 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, Herbert Xu, shaoliu, Douglas Leith,
	Robert Shorten, netdev

Thanks Stephen.

A related problem (largely due to the published algorithm itself) is
that Illinois is very aggressive when it over-estimates the maximum
RTT.

At high load (say 200Mbps and 200ms RTT), a backlog of packets builds
up just after a loss, causing the RTT estimate to become large.  This
makes Illinois think that *all* losses are due to corruption not
congestion, and so only back off by 1/8 instead of 1/2.

I can't think how to fix this except by better RTT estimation, or
changes to Illinois itself.  Currently, I ignore RTT measurements when
   sacked_out != 0    and have a heuristic "RTT aging" mechanism, but
that's pretty ugly.

Cheers,
Lachlan

On 28/11/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote:
> Lachlan Andrew observed that my TCP-Illinois implementation uses the
> beta value incorrectly:
>   The parameter  beta  in the paper specifies the amount to decrease
>   *by*:  that is, on loss,
>      W <-  W -  beta*W
>   but in   tcp_illinois_ssthresh() uses  beta  as the amount
>   to decrease  *to*: W <- beta*W
>
> This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network,
> hurting performance. Note: since the base beta value is .5, it has no
> impact on a congested network.
>
> Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
>
> --- a/net/ipv4/tcp_illinois.c   2007-08-18 07:50:15.000000000 -0700
> +++ b/net/ipv4/tcp_illinois.c   2007-11-28 15:39:04.000000000 -0800
> @@ -298,7 +298,7 @@ static u32 tcp_illinois_ssthresh(struct
>         struct illinois *ca = inet_csk_ca(sk);
>
>         /* Multiplicative decrease */
> -       return max((tp->snd_cwnd * ca->beta) >> BETA_SHIFT, 2U);
> +       return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->beta) >> BETA_SHIFT), 2U);
>  }
>
>
>


-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tcp-illinois: incorrect beta usage
  2007-11-29  0:25     ` Lachlan Andrew
@ 2007-11-29  0:43       ` Stephen Hemminger
  2007-11-29  5:26         ` Shao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Stephen Hemminger @ 2007-11-29  0:43 UTC (permalink / raw)
  To: Lachlan Andrew
  Cc: David S. Miller, Herbert Xu, shaoliu, Douglas Leith,
	Robert Shorten, netdev

Lachlan Andrew wrote:
> Thanks Stephen.
>
> A related problem (largely due to the published algorithm itself) is
> that Illinois is very aggressive when it over-estimates the maximum
> RTT.
>
> At high load (say 200Mbps and 200ms RTT), a backlog of packets builds
> up just after a loss, causing the RTT estimate to become large.  This
> makes Illinois think that *all* losses are due to corruption not
> congestion, and so only back off by 1/8 instead of 1/2.
>
> I can't think how to fix this except by better RTT estimation, or
> changes to Illinois itself.  Currently, I ignore RTT measurements when
>    sacked_out != 0    and have a heuristic "RTT aging" mechanism, but
> that's pretty ugly.
>
> Cheers,
> Lachlan
>
>   
Ageing the RTT estimates needs to be done anyway.
Maybe something can be reused from H-TCP. The two are closely related.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH] tcp-illinois: incorrect beta usage
  2007-11-29  0:43       ` Stephen Hemminger
@ 2007-11-29  5:26         ` Shao Liu
  2007-12-03 22:52           ` [RFC] TCP illinois max rtt aging Stephen Hemminger
  0 siblings, 1 reply; 45+ messages in thread
From: Shao Liu @ 2007-11-29  5:26 UTC (permalink / raw)
  To: 'Stephen Hemminger', 'Lachlan Andrew'
  Cc: 'David S. Miller', 'Herbert Xu',
	'Douglas Leith', 'Robert Shorten', netdev

Hi Stephen and Lachlan,

Thanks for pointing out and fixing this bug.

For the max RTT problem, I have considered it also and I have some idea on
improve it. I also have some other places to improve. I will summarize all
my new ideas and send you an update. For me to change it, could you please
give me a link to download to latest source codes for the whole congestion
control module in Linux implementation, including the general module for all
algorithms, and the implementation for specific algorithms like TCP-Illinois
and H-TCP? 

Thanks for the help!
-Shao



-----Original Message-----
From: Stephen Hemminger [mailto:shemminger@linux-foundation.org] 
Sent: Wednesday, November 28, 2007 4:44 PM
To: Lachlan Andrew
Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith;
Robert Shorten; netdev@vger.kernel.org
Subject: Re: [PATCH] tcp-illinois: incorrect beta usage

Lachlan Andrew wrote:
> Thanks Stephen.
>
> A related problem (largely due to the published algorithm itself) is
> that Illinois is very aggressive when it over-estimates the maximum
> RTT.
>
> At high load (say 200Mbps and 200ms RTT), a backlog of packets builds
> up just after a loss, causing the RTT estimate to become large.  This
> makes Illinois think that *all* losses are due to corruption not
> congestion, and so only back off by 1/8 instead of 1/2.
>
> I can't think how to fix this except by better RTT estimation, or
> changes to Illinois itself.  Currently, I ignore RTT measurements when
>    sacked_out != 0    and have a heuristic "RTT aging" mechanism, but
> that's pretty ugly.
>
> Cheers,
> Lachlan
>
>   
Ageing the RTT estimates needs to be done anyway.
Maybe something can be reused from H-TCP. The two are closely related.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC] TCP illinois max rtt aging
  2007-11-29  5:26         ` Shao Liu
@ 2007-12-03 22:52           ` Stephen Hemminger
  2007-12-03 23:06             ` Lachlan Andrew
  0 siblings, 1 reply; 45+ messages in thread
From: Stephen Hemminger @ 2007-12-03 22:52 UTC (permalink / raw)
  To: shaoliu
  Cc: 'Lachlan Andrew', 'David S. Miller',
	'Herbert Xu', 'Douglas Leith',
	'Robert Shorten', netdev

On Wed, 28 Nov 2007 21:26:12 -0800
"Shao Liu" <shaoliu@Princeton.EDU> wrote:

> Hi Stephen and Lachlan,
> 
> Thanks for pointing out and fixing this bug.
> 
> For the max RTT problem, I have considered it also and I have some idea on
> improve it. I also have some other places to improve. I will summarize all
> my new ideas and send you an update. For me to change it, could you please
> give me a link to download to latest source codes for the whole congestion
> control module in Linux implementation, including the general module for all
> algorithms, and the implementation for specific algorithms like TCP-Illinois
> and H-TCP? 
> 
> Thanks for the help!
> -Shao
> 
> 
> 
> -----Original Message-----
> From: Stephen Hemminger [mailto:shemminger@linux-foundation.org] 
> Sent: Wednesday, November 28, 2007 4:44 PM
> To: Lachlan Andrew
> Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith;
> Robert Shorten; netdev@vger.kernel.org
> Subject: Re: [PATCH] tcp-illinois: incorrect beta usage
> 
> Lachlan Andrew wrote:
> > Thanks Stephen.
> >
> > A related problem (largely due to the published algorithm itself) is
> > that Illinois is very aggressive when it over-estimates the maximum
> > RTT.
> >
> > At high load (say 200Mbps and 200ms RTT), a backlog of packets builds
> > up just after a loss, causing the RTT estimate to become large.  This
> > makes Illinois think that *all* losses are due to corruption not
> > congestion, and so only back off by 1/8 instead of 1/2.
> >
> > I can't think how to fix this except by better RTT estimation, or
> > changes to Illinois itself.  Currently, I ignore RTT measurements when
> >    sacked_out != 0    and have a heuristic "RTT aging" mechanism, but
> > that's pretty ugly.
> >
> > Cheers,
> > Lachlan
> >
> >   
> Ageing the RTT estimates needs to be done anyway.
> Maybe something can be reused from H-TCP. The two are closely related.
>

The following adds gradual aging of max RTT.

--- a/net/ipv4/tcp_illinois.c	2007-11-29 08:58:35.000000000 -0800
+++ b/net/ipv4/tcp_illinois.c	2007-11-29 09:37:33.000000000 -0800
@@ -63,7 +63,10 @@ static void rtt_reset(struct sock *sk)
 	ca->cnt_rtt = 0;
 	ca->sum_rtt = 0;
 
-	/* TODO: age max_rtt? */
+	/* add slowly fading memory for maxRTT to accommodate routing changes */
+	if (ca->max_rtt > ca->base_rtt)
+		ca->max_rtt = ca->base_rtt
+			+ (((ca->max_rtt - ca->base_rtt) * 31) >> 5);
 }
 
 static void tcp_illinois_init(struct sock *sk)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-03 22:52           ` [RFC] TCP illinois max rtt aging Stephen Hemminger
@ 2007-12-03 23:06             ` Lachlan Andrew
  2007-12-03 23:59               ` Shao Liu
  0 siblings, 1 reply; 45+ messages in thread
From: Lachlan Andrew @ 2007-12-03 23:06 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: shaoliu, David S. Miller, Herbert Xu, Douglas Leith,
	Robert Shorten, netdev

Greetings Stephen,

Thanks.  We'll have to play with the rate of ageing.  I used the slower ageing

	if (ca->cnt_rtt > 3) {
		u64 mean_rtt = ca->sum_rtt;
		do_div (mean_rtt, ca->cnt_rtt);

		if (ca->max_rtt > mean_rtt)
			ca->max_rtt -= (ca->max_rtt - mean_rtt) >> 9;
	}

and still found that the max_rtt drops considerably within a congestion epoch.

What would also really help would be getting rid of RTT outliers
somehow.  I ignore RTT measurements when SACK is active:
       if (ca->max_rtt < rtt) {
	    struct tcp_sock *tp = tcp_sk(sk);
	    if (! tp->sacked_out )	// SACKs cause hi-CPU/hi-RTT. ignore
		ca->max_rtt = rtt;
	}
which helps a lot, but still gets some outliers.  Would it be possible
to time-stamp packets in the hardware interrupt handler, instead of
waiting for the post-processing stage?

Cheers,
Lachlan

On 03/12/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote:
> On Wed, 28 Nov 2007 21:26:12 -0800
> "Shao Liu" <shaoliu@Princeton.EDU> wrote:
>
> > Hi Stephen and Lachlan,
> >
> > Thanks for pointing out and fixing this bug.
> >
> > For the max RTT problem, I have considered it also and I have some idea on
> > improve it. I also have some other places to improve. I will summarize all
> > my new ideas and send you an update. For me to change it, could you please
> > give me a link to download to latest source codes for the whole congestion
> > control module in Linux implementation, including the general module for all
> > algorithms, and the implementation for specific algorithms like TCP-Illinois
> > and H-TCP?
> >
> > Thanks for the help!
> > -Shao
> >
> >
> >
> > -----Original Message-----
> > From: Stephen Hemminger [mailto:shemminger@linux-foundation.org]
> > Sent: Wednesday, November 28, 2007 4:44 PM
> > To: Lachlan Andrew
> > Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith;
> > Robert Shorten; netdev@vger.kernel.org
> > Subject: Re: [PATCH] tcp-illinois: incorrect beta usage
> >
> > Lachlan Andrew wrote:
> > > Thanks Stephen.
> > >
> > > A related problem (largely due to the published algorithm itself) is
> > > that Illinois is very aggressive when it over-estimates the maximum
> > > RTT.
> > >
> > > At high load (say 200Mbps and 200ms RTT), a backlog of packets builds
> > > up just after a loss, causing the RTT estimate to become large.  This
> > > makes Illinois think that *all* losses are due to corruption not
> > > congestion, and so only back off by 1/8 instead of 1/2.
> > >
> > > I can't think how to fix this except by better RTT estimation, or
> > > changes to Illinois itself.  Currently, I ignore RTT measurements when
> > >    sacked_out != 0    and have a heuristic "RTT aging" mechanism, but
> > > that's pretty ugly.
> > >
> > > Cheers,
> > > Lachlan
> > >
> > >
> > Ageing the RTT estimates needs to be done anyway.
> > Maybe something can be reused from H-TCP. The two are closely related.
> >
>
> The following adds gradual aging of max RTT.
>
> --- a/net/ipv4/tcp_illinois.c   2007-11-29 08:58:35.000000000 -0800
> +++ b/net/ipv4/tcp_illinois.c   2007-11-29 09:37:33.000000000 -0800
> @@ -63,7 +63,10 @@ static void rtt_reset(struct sock *sk)
>         ca->cnt_rtt = 0;
>         ca->sum_rtt = 0;
>
> -       /* TODO: age max_rtt? */
> +       /* add slowly fading memory for maxRTT to accommodate routing changes */
> +       if (ca->max_rtt > ca->base_rtt)
> +               ca->max_rtt = ca->base_rtt
> +                       + (((ca->max_rtt - ca->base_rtt) * 31) >> 5);
>  }
>
>  static void tcp_illinois_init(struct sock *sk)
>


-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [RFC] TCP illinois max rtt aging
  2007-12-03 23:06             ` Lachlan Andrew
@ 2007-12-03 23:59               ` Shao Liu
  2007-12-04  0:32                 ` Stephen Hemminger
  0 siblings, 1 reply; 45+ messages in thread
From: Shao Liu @ 2007-12-03 23:59 UTC (permalink / raw)
  To: 'Lachlan Andrew', 'Stephen Hemminger'
  Cc: 'David S. Miller', 'Herbert Xu',
	'Douglas Leith', 'Robert Shorten', netdev

Hi Stephen and Lachlan,

Thanks for the discussion. The gradual aging is surely an option. And
another possibility is that, we compute the RTT just before congestion
notification, which ideally, represent the full queueing delay + propagation
delay. We can compute the average of the last M such values, and either use
the average as maxRTT, or use it as a benchmark to judge whether a sample is
outlier. How do you think of this idea?

And also a question, why the samples when SACK is active are outliers? 

For the accuracy of time stamping, I am not very familiar with the
implementation details. But I can think of two ways, 1) do time stamp in as
low layer as possible; 2) use as high priority thread to do it as possible.
For 2), we can use separate threads to do time stamp and to process packets.

Thanks and I will let you know more of my thoughts after I go over the
entire code space!
-Shao 


-----Original Message-----
From: Lachlan Andrew [mailto:lachlan.andrew@gmail.com] 
Sent: Monday, December 03, 2007 3:06 PM
To: Stephen Hemminger
Cc: shaoliu@Princeton.EDU; David S. Miller; Herbert Xu; Douglas Leith;
Robert Shorten; netdev@vger.kernel.org
Subject: Re: [RFC] TCP illinois max rtt aging

Greetings Stephen,

Thanks.  We'll have to play with the rate of ageing.  I used the slower
ageing

	if (ca->cnt_rtt > 3) {
		u64 mean_rtt = ca->sum_rtt;
		do_div (mean_rtt, ca->cnt_rtt);

		if (ca->max_rtt > mean_rtt)
			ca->max_rtt -= (ca->max_rtt - mean_rtt) >> 9;
	}

and still found that the max_rtt drops considerably within a congestion
epoch.

What would also really help would be getting rid of RTT outliers
somehow.  I ignore RTT measurements when SACK is active:
       if (ca->max_rtt < rtt) {
	    struct tcp_sock *tp = tcp_sk(sk);
	    if (! tp->sacked_out )	// SACKs cause hi-CPU/hi-RTT. ignore
		ca->max_rtt = rtt;
	}
which helps a lot, but still gets some outliers.  Would it be possible
to time-stamp packets in the hardware interrupt handler, instead of
waiting for the post-processing stage?

Cheers,
Lachlan

On 03/12/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote:
> On Wed, 28 Nov 2007 21:26:12 -0800
> "Shao Liu" <shaoliu@Princeton.EDU> wrote:
>
> > Hi Stephen and Lachlan,
> >
> > Thanks for pointing out and fixing this bug.
> >
> > For the max RTT problem, I have considered it also and I have some idea
on
> > improve it. I also have some other places to improve. I will summarize
all
> > my new ideas and send you an update. For me to change it, could you
please
> > give me a link to download to latest source codes for the whole
congestion
> > control module in Linux implementation, including the general module for
all
> > algorithms, and the implementation for specific algorithms like
TCP-Illinois
> > and H-TCP?
> >
> > Thanks for the help!
> > -Shao
> >
> >
> >
> > -----Original Message-----
> > From: Stephen Hemminger [mailto:shemminger@linux-foundation.org]
> > Sent: Wednesday, November 28, 2007 4:44 PM
> > To: Lachlan Andrew
> > Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith;
> > Robert Shorten; netdev@vger.kernel.org
> > Subject: Re: [PATCH] tcp-illinois: incorrect beta usage
> >
> > Lachlan Andrew wrote:
> > > Thanks Stephen.
> > >
> > > A related problem (largely due to the published algorithm itself) is
> > > that Illinois is very aggressive when it over-estimates the maximum
> > > RTT.
> > >
> > > At high load (say 200Mbps and 200ms RTT), a backlog of packets builds
> > > up just after a loss, causing the RTT estimate to become large.  This
> > > makes Illinois think that *all* losses are due to corruption not
> > > congestion, and so only back off by 1/8 instead of 1/2.
> > >
> > > I can't think how to fix this except by better RTT estimation, or
> > > changes to Illinois itself.  Currently, I ignore RTT measurements when
> > >    sacked_out != 0    and have a heuristic "RTT aging" mechanism, but
> > > that's pretty ugly.
> > >
> > > Cheers,
> > > Lachlan
> > >
> > >
> > Ageing the RTT estimates needs to be done anyway.
> > Maybe something can be reused from H-TCP. The two are closely related.
> >
>
> The following adds gradual aging of max RTT.
>
> --- a/net/ipv4/tcp_illinois.c   2007-11-29 08:58:35.000000000 -0800
> +++ b/net/ipv4/tcp_illinois.c   2007-11-29 09:37:33.000000000 -0800
> @@ -63,7 +63,10 @@ static void rtt_reset(struct sock *sk)
>         ca->cnt_rtt = 0;
>         ca->sum_rtt = 0;
>
> -       /* TODO: age max_rtt? */
> +       /* add slowly fading memory for maxRTT to accommodate routing
changes */
> +       if (ca->max_rtt > ca->base_rtt)
> +               ca->max_rtt = ca->base_rtt
> +                       + (((ca->max_rtt - ca->base_rtt) * 31) >> 5);
>  }
>
>  static void tcp_illinois_init(struct sock *sk)
>


-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-03 23:59               ` Shao Liu
@ 2007-12-04  0:32                 ` Stephen Hemminger
  2007-12-04  1:23                   ` Lachlan Andrew
  0 siblings, 1 reply; 45+ messages in thread
From: Stephen Hemminger @ 2007-12-04  0:32 UTC (permalink / raw)
  To: shaoliu
  Cc: 'Lachlan Andrew', 'David S. Miller',
	'Herbert Xu', 'Douglas Leith',
	'Robert Shorten', netdev

On Mon, 3 Dec 2007 15:59:23 -0800
"Shao Liu" <shaoliu@Princeton.EDU> wrote:

> Hi Stephen and Lachlan,
> 
> Thanks for the discussion. The gradual aging is surely an option. And
> another possibility is that, we compute the RTT just before congestion
> notification, which ideally, represent the full queueing delay + propagation
> delay. We can compute the average of the last M such values, and either use
> the average as maxRTT, or use it as a benchmark to judge whether a sample is
> outlier. How do you think of this idea?

The problem with an average like that would be storing enough values
to be useful and choosing how many to store. Perhaps some form of
weighted sliding average which favors recent values heavily would work
best. Remember that RTT's have a huge noise component and you are
fighting against the long tail distribution trying to see the queue
effects.

> 
> And also a question, why the samples when SACK is active are outliers?

Any sample with SACK is going to mean a loss or reordering has occurred.
So shouldn't the SACK values be useful, but RTT values from retransmits
are not useful.

> 
> For the accuracy of time stamping, I am not very familiar with the
> implementation details. But I can think of two ways, 1) do time stamp in as
> low layer as possible; 2) use as high priority thread to do it as possible.
> For 2), we can use separate threads to do time stamp and to process packets.

Right now the resolution is in microseconds using the hardware clock.
The clock usage costs a little bit, but makes the math more accurate.
It would be worth exploring sensitivity by taking out RTT_STAMP from
the flags field and varying HZ.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-04  0:32                 ` Stephen Hemminger
@ 2007-12-04  1:23                   ` Lachlan Andrew
  2007-12-04  8:37                     ` Ilpo Järvinen
  0 siblings, 1 reply; 45+ messages in thread
From: Lachlan Andrew @ 2007-12-04  1:23 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: shaoliu, David S. Miller, Herbert Xu, Douglas Leith,
	Robert Shorten, netdev

Greetings,

On 03/12/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote:
> On Mon, 3 Dec 2007 15:59:23 -0800
> "Shao Liu" <shaoliu@Princeton.EDU> wrote:
> > And also a question, why the samples when SACK is active are outliers?
>
> Any sample with SACK is going to mean a loss or reordering has occurred.
> So shouldn't the SACK values be useful, but RTT values from retransmits
> are not useful.

When SACK is active, the per-packet processing becomes more involved,
tracking the list of lost/SACKed packets.  This causes a CPU spike
just after a loss, which increases the RTTs, at least in my
experience.  This is a separate issue from the fact that it is hard to
get RTT measurements from lost/retransmitted packets themselves.

Cheers,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-04  1:23                   ` Lachlan Andrew
@ 2007-12-04  8:37                     ` Ilpo Järvinen
  2007-12-07  3:27                       ` Lachlan Andrew
  0 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2007-12-04  8:37 UTC (permalink / raw)
  To: Lachlan Andrew
  Cc: Stephen Hemminger, shaoliu, David S. Miller, Herbert Xu,
	Douglas Leith, Robert Shorten, Netdev

On Mon, 3 Dec 2007, Lachlan Andrew wrote:

> On 03/12/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote:
> > On Mon, 3 Dec 2007 15:59:23 -0800
> > "Shao Liu" <shaoliu@Princeton.EDU> wrote:
> > > And also a question, why the samples when SACK is active are outliers?
> >
> > Any sample with SACK is going to mean a loss or reordering has occurred.
> > So shouldn't the SACK values be useful, but RTT values from retransmits
> > are not useful.
> 
> When SACK is active, the per-packet processing becomes more involved,
> tracking the list of lost/SACKed packets.  This causes a CPU spike
> just after a loss, which increases the RTTs, at least in my
> experience.

I suspect that as long as old code was able to use hint, it wasn't doing 
that bad. But it was seriously lacking ability to take advantage of sack 
processing hint when e.g., a new hole appeared, or cumulative ACK arrived.

...Code available in net-2.6.25 might cure those.


-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-04  8:37                     ` Ilpo Järvinen
@ 2007-12-07  3:27                       ` Lachlan Andrew
  2007-12-07 11:05                         ` Ilpo Järvinen
  0 siblings, 1 reply; 45+ messages in thread
From: Lachlan Andrew @ 2007-12-07  3:27 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Netdev, Tom Quetchenbach

Greetings Ilpo,

On 04/12/2007, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> On Mon, 3 Dec 2007, Lachlan Andrew wrote:
> >
> > When SACK is active, the per-packet processing becomes more involved,
> > tracking the list of lost/SACKed packets.  This causes a CPU spike
> > just after a loss, which increases the RTTs, at least in my
> > experience.
>
> I suspect that as long as old code was able to use hint, it wasn't doing
> that bad. But it was seriously lacking ability to take advantage of sack
> processing hint when e.g., a new hole appeared, or cumulative ACK arrived.
>
> ...Code available in net-2.6.25 might cure those.

We had been using one of your earlier patches, and still had the
problem.  I think you've cured the problem with SACK itself, but there
still seems to be something taking a lot of CPU while recovering from
the loss.  It is possible that it was to do with  web100  which we
have also been running, but I cut out most of the statistics from that
and still had problems.

Cheers,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-07  3:27                       ` Lachlan Andrew
@ 2007-12-07 11:05                         ` Ilpo Järvinen
  2007-12-07 12:41                           ` David Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2007-12-07 11:05 UTC (permalink / raw)
  To: Lachlan Andrew, David Miller; +Cc: Netdev, Tom Quetchenbach

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2032 bytes --]

On Thu, 6 Dec 2007, Lachlan Andrew wrote:
> On 04/12/2007, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> > On Mon, 3 Dec 2007, Lachlan Andrew wrote:
> > >
> > > When SACK is active, the per-packet processing becomes more involved,
> > > tracking the list of lost/SACKed packets.  This causes a CPU spike
> > > just after a loss, which increases the RTTs, at least in my
> > > experience.
> >
> > I suspect that as long as old code was able to use hint, it wasn't doing
> > that bad. But it was seriously lacking ability to take advantage of sack
> > processing hint when e.g., a new hole appeared, or cumulative ACK arrived.
> >
> > ...Code available in net-2.6.25 might cure those.
> 
> We had been using one of your earlier patches, and still had the
> problem.  I think you've cured the problem with SACK itself, but there
> still seems to be something taking a lot of CPU while recovering from
> the loss. 

I guess if you get a large cumulative ACK, the amount of processing is 
still overwhelming (added DaveM if he has some idea how to combat it).

Even a simple scenario (this isn't anything fancy at all, will occur all 
the time): Just one loss => rest skbs grow one by one into a single 
very large SACK block (and we do that efficiently for sure) => then the 
fast retransmit gets delivered and a cumulative ACK for whole orig_window 
arrives => clean_rtx_queue has to do a lot of processing. In this case we 
could optimize RB-tree cleanup away (by just blanking it all) but still 
getting rid of all those skbs is going to take a larger moment than I'd 
like to see.

That tree blanking could be extended to cover anything which ACK more than 
half of the tree by just replacing the root (and dealing with potential 
recolorization of the root).

> It is possible that it was to do with  web100  which we
> have also been running, but I cut out most of the statistics from that
> and still had problems.

No idea about what it could do, haven't yet looked web100, I was planning 
at some point of time...

-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-07 11:05                         ` Ilpo Järvinen
@ 2007-12-07 12:41                           ` David Miller
  2007-12-07 13:05                             ` Ilpo Järvinen
  0 siblings, 1 reply; 45+ messages in thread
From: David Miller @ 2007-12-07 12:41 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET)

> I guess if you get a large cumulative ACK, the amount of processing is 
> still overwhelming (added DaveM if he has some idea how to combat it).
> 
> Even a simple scenario (this isn't anything fancy at all, will occur all 
> the time): Just one loss => rest skbs grow one by one into a single 
> very large SACK block (and we do that efficiently for sure) => then the 
> fast retransmit gets delivered and a cumulative ACK for whole orig_window 
> arrives => clean_rtx_queue has to do a lot of processing. In this case we 
> could optimize RB-tree cleanup away (by just blanking it all) but still 
> getting rid of all those skbs is going to take a larger moment than I'd 
> like to see.
> 
> That tree blanking could be extended to cover anything which ACK more than 
> half of the tree by just replacing the root (and dealing with potential 
> recolorization of the root).

Yes, it's the classic problem.  But it ought to be at least
partially masked when TSO is in use, because we'll only process
a handful of SKBs.  The more effectively TSO batches, the
less work clean_rtx_queue() will do.

When not doing TSO the behavior is super-stupid, we bump reference
counts on the same page multiple times while running over the SKBs
since consequetive SKBs cover data in different spans of the same
page.

The core issue is that we have a poorly behaving data container,
and therefore that's obviously what we need to change.

Conceptually what we probably need to do is seperate the data
maintainence from the SKB objects themselves.  There is a blob
that maintains the paged data state for everything in the
retransmit queue.  SKBs are built and get the page pointers
but don't actually grab references to the pages, the blob
does that and it keeps track of how many SKB references to each
page there are, non-atomically.

The hardest part is dealing with the page lifetime issues.
Unfortunately, when we trim the rtx queue, references to the clones
can still exist in the driver output path.  It's a difficult problem
to overcome in fact, so in the end my suggestion above might not
even be workable.

> No idea about what it could do, haven't yet looked web100, I was planning 
> at some point of time...

Web100 just provides statistics and other kinds of connection data
to userspace, all the actual algorithm etc. modifications have been
merged upstream and yanked out of the web100 patch.  I was looking
at it the other night and it's frankly totally uninteresting these
days :-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-07 12:41                           ` David Miller
@ 2007-12-07 13:05                             ` Ilpo Järvinen
  2007-12-07 18:27                               ` Ilpo Järvinen
  2007-12-08  1:32                               ` David Miller
  0 siblings, 2 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2007-12-07 13:05 UTC (permalink / raw)
  To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1856 bytes --]

On Fri, 7 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET)
> 
> > I guess if you get a large cumulative ACK, the amount of processing is 
> > still overwhelming (added DaveM if he has some idea how to combat it).
> > 
> > Even a simple scenario (this isn't anything fancy at all, will occur all 
> > the time): Just one loss => rest skbs grow one by one into a single 
> > very large SACK block (and we do that efficiently for sure) => then the 
> > fast retransmit gets delivered and a cumulative ACK for whole orig_window 
> > arrives => clean_rtx_queue has to do a lot of processing. In this case we 
> > could optimize RB-tree cleanup away (by just blanking it all) but still 
> > getting rid of all those skbs is going to take a larger moment than I'd 
> > like to see.
> > 
> > That tree blanking could be extended to cover anything which ACK more than 
> > half of the tree by just replacing the root (and dealing with potential 
> > recolorization of the root).
> 
> Yes, it's the classic problem.  But it ought to be at least
> partially masked when TSO is in use, because we'll only process
> a handful of SKBs.  The more effectively TSO batches, the
> less work clean_rtx_queue() will do.

No, that's not what is going to happen, TSO won't help at all
because one-by-one SACKs will fragment every single one of them
(see tcp_match_skb_to_sack) :-(. ...So we're back in non-TSO
case, or am I missing something?

> Web100 just provides statistics and other kinds of connection data
> to userspace, all the actual algorithm etc. modifications have been
> merged upstream and yanked out of the web100 patch.  I was looking
> at it the other night and it's frankly totally uninteresting these
> days :-)

...Thanks, I'll keep that in my mind when looking... :-)


-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-07 13:05                             ` Ilpo Järvinen
@ 2007-12-07 18:27                               ` Ilpo Järvinen
  2007-12-08  1:32                               ` David Miller
  1 sibling, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2007-12-07 18:27 UTC (permalink / raw)
  To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1911 bytes --]

On Fri, 7 Dec 2007, Ilpo Järvinen wrote:

> On Fri, 7 Dec 2007, David Miller wrote:
> 
> > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> > Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET)
> > 
> > > I guess if you get a large cumulative ACK, the amount of processing is 
> > > still overwhelming (added DaveM if he has some idea how to combat it).
> > > 
> > > Even a simple scenario (this isn't anything fancy at all, will occur all 
> > > the time): Just one loss => rest skbs grow one by one into a single 
> > > very large SACK block (and we do that efficiently for sure) => then the 
> > > fast retransmit gets delivered and a cumulative ACK for whole orig_window 
> > > arrives => clean_rtx_queue has to do a lot of processing. In this case we 
> > > could optimize RB-tree cleanup away (by just blanking it all) but still 
> > > getting rid of all those skbs is going to take a larger moment than I'd 
> > > like to see.
> > > 
> > > That tree blanking could be extended to cover anything which ACK more than 
> > > half of the tree by just replacing the root (and dealing with potential 
> > > recolorization of the root).
> > 
> > Yes, it's the classic problem.  But it ought to be at least
> > partially masked when TSO is in use, because we'll only process
> > a handful of SKBs.  The more effectively TSO batches, the
> > less work clean_rtx_queue() will do.
> 
> No, that's not what is going to happen, TSO won't help at all
> because one-by-one SACKs will fragment every single one of them
> (see tcp_match_skb_to_sack) :-(. ...So we're back in non-TSO
> case, or am I missing something?

Hmm... this could be solved though by postponing the fragmentation of a 
partially sacked skb when the first sack block can (is likely) to still 
grow and remove the need for fragmentation. Has some implications to 
packet processing, increases burstiness a bit & tcp_max_burst kicks in too 
easily.

-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] TCP illinois max rtt aging
  2007-12-07 13:05                             ` Ilpo Järvinen
  2007-12-07 18:27                               ` Ilpo Järvinen
@ 2007-12-08  1:32                               ` David Miller
  2007-12-11 11:59                                 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one (Was: Re: [RFC] TCP illinois max rtt aging) Ilpo Järvinen
  1 sibling, 1 reply; 45+ messages in thread
From: David Miller @ 2007-12-08  1:32 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 7 Dec 2007 15:05:59 +0200 (EET)

> On Fri, 7 Dec 2007, David Miller wrote:
> 
> > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> > Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET)
> > 
> > > I guess if you get a large cumulative ACK, the amount of processing is 
> > > still overwhelming (added DaveM if he has some idea how to combat it).
> > > 
> > > Even a simple scenario (this isn't anything fancy at all, will occur all 
> > > the time): Just one loss => rest skbs grow one by one into a single 
> > > very large SACK block (and we do that efficiently for sure) => then the 
> > > fast retransmit gets delivered and a cumulative ACK for whole orig_window 
> > > arrives => clean_rtx_queue has to do a lot of processing. In this case we 
> > > could optimize RB-tree cleanup away (by just blanking it all) but still 
> > > getting rid of all those skbs is going to take a larger moment than I'd 
> > > like to see.
> > > 
> > > That tree blanking could be extended to cover anything which ACK more than 
> > > half of the tree by just replacing the root (and dealing with potential 
> > > recolorization of the root).
> > 
> > Yes, it's the classic problem.  But it ought to be at least
> > partially masked when TSO is in use, because we'll only process
> > a handful of SKBs.  The more effectively TSO batches, the
> > less work clean_rtx_queue() will do.
> 
> No, that's not what is going to happen, TSO won't help at all
> because one-by-one SACKs will fragment every single one of them
> (see tcp_match_skb_to_sack) :-(. ...So we're back in non-TSO
> case, or am I missing something?

You're of course right, and it's ironic that I wrote the SACK
splitting code so I should have known this :-)

A possible approach just occurred to me wherein we maintain
the SACK state external to the SKBs so that we don't need to
mess with them at all.

That would allow us to eliminate the TSO splitting but it would
not remove the general problem of clean_rtx_queue()'s overhead.

I'll try to give some thought to this over the weekend.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one (Was: Re: [RFC] TCP illinois max rtt aging)
  2007-12-08  1:32                               ` David Miller
@ 2007-12-11 11:59                                 ` Ilpo Järvinen
  2007-12-11 12:32                                   ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one David Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2007-12-11 11:59 UTC (permalink / raw)
  To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen

[-- Attachment #1: Type: TEXT/PLAIN, Size: 12766 bytes --]

On Fri, 7 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Fri, 7 Dec 2007 15:05:59 +0200 (EET)
> 
> > On Fri, 7 Dec 2007, David Miller wrote:
> > 
> > > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> > > Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET)
> > > 
> > > > I guess if you get a large cumulative ACK, the amount of processing is 
> > > > still overwhelming (added DaveM if he has some idea how to combat it).
> > > > 
> > > > Even a simple scenario (this isn't anything fancy at all, will occur all 
> > > > the time): Just one loss => rest skbs grow one by one into a single 
> > > > very large SACK block (and we do that efficiently for sure) => then the 
> > > > fast retransmit gets delivered and a cumulative ACK for whole orig_window 
> > > > arrives => clean_rtx_queue has to do a lot of processing. In this case we 
> > > > could optimize RB-tree cleanup away (by just blanking it all) but still 
> > > > getting rid of all those skbs is going to take a larger moment than I'd 
> > > > like to see.
> > > 
> > > Yes, it's the classic problem.  But it ought to be at least
> > > partially masked when TSO is in use, because we'll only process
> > > a handful of SKBs.  The more effectively TSO batches, the
> > > less work clean_rtx_queue() will do.
> > 
> > No, that's not what is going to happen, TSO won't help at all
> > because one-by-one SACKs will fragment every single one of them
> > (see tcp_match_skb_to_sack) :-(. ...So we're back in non-TSO
> > case, or am I missing something?
> 
> You're of course right, and it's ironic that I wrote the SACK
> splitting code so I should have known this :-)
> 
> A possible approach just occurred to me wherein we maintain
> the SACK state external to the SKBs so that we don't need to
> mess with them at all.
> 
> That would allow us to eliminate the TSO splitting but it would
> not remove the general problem of clean_rtx_queue()'s overhead.
> 
> I'll try to give some thought to this over the weekend.

How about this...

...I've left couple of FIXMEs there still, should be quite simple & 
straightforward to handle them if this seems viable solution at all.


Beware, this doesn't even compile yet because not all parameters are 
transferred currently (first_sack_index was killed earlier, I need to 
summon it back for this). Also, I'd need to do the dirty work of kill 
recv_sack_cache first to make this to not produce, well, interesting 
effects due to missed SACK blocks... :-)

Applies cleanly only after this:
  [TCP]: Push fack_count calculation deeper into functions


--
 i.

--
[RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when 
SACKed one-by-one

Because receiver reports out-of-order segment one-by-one using
SACK, the tcp_fragment may do a lot of unnecessary splits that
would be avoided if the sender could see the upcoming future.
Not only SACK processing suffers but clean_rtx_queue as well
is considerable hit when the corresponding cumulative ACK
arrives.

Thus implement a local cache for a single skb to avoid enormous
splitting efforts while the latest SACK block is still growing.
Messy enough, other parts must be made aware of this change as
well because the skb state is a bit fuzzy while have not yet
marked it in tcp_sacktag_one.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h   |    2 +
 include/net/tcp.h     |   19 ++++++++
 net/ipv4/tcp_input.c  |  110 ++++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv4/tcp_output.c |    7 +++
 4 files changed, 133 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 56342c3..4fbfa46 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -360,6 +360,8 @@ struct tcp_sock {
 	u32	fackets_out;	/* FACK'd packets			*/
 	u32	high_seq;	/* snd_nxt at onset of congestion	*/
 
+	u32	sack_pending;	/* End seqno of postponed SACK tagging	*/
+
 	u32	retrans_stamp;	/* Timestamp of the last retransmit,
 				 * also used in SYN-SENT to remember stamp of
 				 * the first SYN. */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5e6c433..e2b88e3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -462,6 +462,21 @@ extern void tcp_send_delayed_ack(struct sock *sk);
 
 /* tcp_input.c */
 extern void tcp_cwnd_application_limited(struct sock *sk);
+extern void __tcp_process_postponed_sack(struct sock *sk, struct sk_buff *skb);
+extern struct sk_buff *tcp_find_postponed_skb(struct sock *sk);
+
+static inline void tcp_process_postponed_sack(struct sock *sk)
+{
+	if (tcp_sk(sk)->sack_pending)
+		__tcp_process_postponed_sack(sk, tcp_find_postponed_skb(sk));
+}
+
+static inline void tcp_process_postponed_sack_overlapping(struct sock *sk,
+							  struct sk_buff *skb)
+{
+	if (tcp_sk(sk)->sack_pending && (skb == tcp_find_postponed_skb(sk)))
+		__tcp_process_postponed_sack(sk, skb);
+}
 
 /* tcp_timer.c */
 extern void tcp_init_xmit_timers(struct sock *);
@@ -1625,6 +1640,10 @@ static inline u32 tcp_highest_sack_seq(struct tcp_sock *tp)
 	if (tp->highest_sack == NULL)
 		return tp->snd_nxt;
 
+	if (tp->sack_pending &&
+	    before(TCP_SKB_CB(tp->highest_sack)->seq, tp->sack_pending))
+		return tp->sack_pending;
+
 	return TCP_SKB_CB(tp->highest_sack)->seq;
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 977c68c..5bed358 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1181,6 +1181,74 @@ static void tcp_mark_lost_retrans(struct sock *sk)
 		tp->lost_retrans_low = new_low_seq;
 }
 
+struct sk_buff *tcp_find_postponed_skb(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (before(TCP_SKB_CB(tcp_highest_sack(sk))->seq, tp->sack_pending))
+		return tcp_highest_sack(sk);
+	else
+		return tcp_write_queue_find(sk, tp->sack_pending, 0);
+}
+
+static void tcp_clear_postponed_tweaks(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	unsigned int pcount;
+
+	pcount = (tp->sack_pending - TCP_SKB_CB(skb)->seq) / skb_shinfo(skb)->gso_size;
+
+	tp->sacked_out -= pcount;
+	if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST)
+		tp->lost_out += pcount;
+	/* FIXME: This might be unnecessary */
+	if (skb == tcp_highest_sack(sk))
+		tp->fackets_out -= pcount;
+}
+
+static int tcp_postpone_sack_handling(struct sock *sk, struct sk_buff *skb, u32 end_seq)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	unsigned int pcount;
+
+	if (tp->sack_pending) {
+		struct sk_buff *oskb = tcp_find_postponed_skb(sk);
+
+		/* Already considered this area? E.g., due to reordering */
+		if (unlikely((skb == oskb) && !after(end_seq, tp->sack_pending)))
+			return 1;
+
+		if (skb != oskb)
+			__tcp_process_postponed_sack(sk, oskb);
+		else
+			tcp_clear_postponed_tweaks(sk, oskb);
+	}
+
+	/* To be discarded very soon, this case might be improved somehow */
+	if (unlikely(!after(TCP_SKB_CB(skb)->seq, tp->snd_una)))
+		return 0;
+
+	/* 1 out of 2^32 hits this */
+	if (unlikely(!end_seq))
+		return 0;
+
+	pcount = (end_seq - TCP_SKB_CB(skb)->seq) / skb_shinfo(skb)->gso_size;
+
+	if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST)
+		tp->lost_out -= pcount;
+	tp->sacked_out += pcount;
+	if (!before(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp))) {
+		unsigned int fack_count_base;
+
+		fack_count_base = TCP_SKB_CB(tcp_write_queue_head(sk))->fack_count;
+		tp->fackets_out = TCP_SKB_CB(skb)->fack_count -
+					fack_count_base + pcount;
+		tp->highest_sack = skb;
+	}
+	tp->sack_pending = end_seq;
+
+}
+
 /* Check if skb is fully within the SACK block. In presence of GSO skbs,
  * the incoming SACK may not exactly match but we can find smaller MSS
  * aligned portion of it that matches. Therefore we might need to fragment
@@ -1188,7 +1256,7 @@ static void tcp_mark_lost_retrans(struct sock *sk)
  * returns).
  */
 static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
-				 u32 start_seq, u32 end_seq)
+				 u32 start_seq, u32 end_seq, int likely_grows)
 {
 	int in_sack, err;
 	unsigned int pkt_len;
@@ -1201,10 +1269,14 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 
 		in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq);
 
-		if (!in_sack)
+		if (!in_sack) {
 			pkt_len = start_seq - TCP_SKB_CB(skb)->seq;
-		else
+		} else {
 			pkt_len = end_seq - TCP_SKB_CB(skb)->seq;
+
+			if (likely_grows && tcp_postpone_sack_handling(sk, skb, end_seq))
+				return -EAGAIN;
+		}
 		err = tcp_fragment(sk, skb, pkt_len, skb_shinfo(skb)->gso_size);
 		if (err < 0)
 			return err;
@@ -1318,6 +1390,7 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 					int dup_sack, int *reord, int *flag,
 					int queue)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *next;
 
 	tcp_for_write_queue_from_safe(skb, next, sk, queue) {
@@ -1330,16 +1403,32 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 		if (!before(TCP_SKB_CB(skb)->seq, end_seq))
 			break;
 
-		in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq);
+		in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq,
+						!dup_sack && first_block_idx);
 		if (unlikely(in_sack < 0))
 			break;
 
-		if (in_sack)
+		if (in_sack) {
+			if (tp->sack_pending && skb == tcp_find_postponed_skb(sk))
+				tcp_clear_postponed_tweaks(sk, skb);
 			*flag |= tcp_sacktag_one(skb, sk, reord, dup_sack);
+		}
 	}
 	return skb;
 }
 
+void __tcp_process_postponed_sack(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int reord = tp->packets_out;
+
+	tcp_clear_postponed_tweaks(sk, skb);
+	tp->sack_pending = 0;
+	tcp_sacktag_one(skb, sk, &reord, 0);
+
+	/* FIXME: reord handling here */
+}
+
 /* Avoid all extra work that is being done by sacktag while walking in
  * a normal way
  */
@@ -1705,6 +1794,8 @@ void tcp_enter_frto(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 
+	tcp_process_postponed_sack(sk);
+
 	if ((!tp->frto_counter && icsk->icsk_ca_state <= TCP_CA_Disorder) ||
 	    tp->snd_una == tp->high_seq ||
 	    ((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) &&
@@ -1777,6 +1868,8 @@ static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 
+	tcp_process_postponed_sack(sk);
+
 	tp->lost_out = 0;
 	tp->retrans_out = 0;
 	if (tcp_is_reno(tp))
@@ -1875,10 +1968,12 @@ void tcp_enter_loss(struct sock *sk, int how)
 		/* Push undo marker, if it was plain RTO and nothing
 		 * was retransmitted. */
 		tp->undo_marker = tp->snd_una;
+		tcp_process_postponed_sack(sk);
 		tcp_clear_retrans_hints_partial(tp);
 	} else {
 		tp->sacked_out = 0;
 		tp->fackets_out = 0;
+		tp->sack_pending = 0;
 		tcp_clear_all_retrans_hints(tp);
 
 		tcp_for_write_queue(skb, sk, TCP_WQ_SACKED) {
@@ -2171,6 +2266,7 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit)
 		if (tcp_is_fack(tp)) {
 			cnt = fc - fack_count_base + tcp_skb_pcount(skb);
 		} else {
+			/* FIXME: SACK postponing adaption */
 			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
 				cnt += tcp_skb_pcount(skb);
 			/* Add SACK blocks between this and skb->prev */
@@ -2224,6 +2320,9 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
 		skb = tp->scoreboard_skb_hint ? tp->scoreboard_skb_hint
 			: tcp_write_queue_head(sk);
 
+		/* FIXME: Performance penalty of this likely significant */
+		tcp_process_postponed_sack(sk);
+
 		tcp_for_write_queue_from(skb, sk, 0) {
 			if (skb == tcp_send_head(sk))
 				break;
@@ -2422,6 +2521,7 @@ static int tcp_try_undo_loss(struct sock *sk)
 
 	if (tcp_may_undo(tp)) {
 		struct sk_buff *skb;
+
 		tcp_for_write_queue(skb, sk, 0) {
 			if (skb == tcp_send_head(sk))
 				break;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1d83c65..e7cb39b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -696,6 +696,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
 
+	tcp_process_postponed_sack_overlapping(sk, skb);
+
 	/* Get a new skb... force flag on. */
 	buff = sk_stream_alloc_skb(sk, nsize, GFP_ATOMIC);
 	if (buff == NULL)
@@ -1761,6 +1763,8 @@ void tcp_simple_retransmit(struct sock *sk)
 	unsigned int mss = tcp_current_mss(sk, 0);
 	int lost = 0;
 
+	tcp_process_postponed_sack(sk);
+
 	tcp_for_write_queue(skb, sk, 0) {
 		if (skb == tcp_send_head(sk))
 			break;
@@ -1928,6 +1932,9 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
 	struct sk_buff *skb;
 	int packet_cnt;
 
+	/* FIXME: There are better was to do this in here */
+	tcp_process_postponed_sack(sk);
+
 	if (tp->retransmit_skb_hint) {
 		skb = tp->retransmit_skb_hint;
 		packet_cnt = tp->retransmit_cnt_hint;
-- 
1.5.0.6

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one
  2007-12-11 11:59                                 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one (Was: Re: [RFC] TCP illinois max rtt aging) Ilpo Järvinen
@ 2007-12-11 12:32                                   ` David Miller
  2007-12-12  0:14                                     ` Lachlan Andrew
  2007-12-15  9:51                                     ` SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one) Ilpo Järvinen
  0 siblings, 2 replies; 45+ messages in thread
From: David Miller @ 2007-12-11 12:32 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 11 Dec 2007 13:59:16 +0200 (EET)

> How about this...
> 
> ...I've left couple of FIXMEs there still, should be quite simple & 
> straightforward to handle them if this seems viable solution at all.
> 
> Beware, this doesn't even compile yet because not all parameters are 
> transferred currently (first_sack_index was killed earlier, I need to 
> summon it back for this). Also, I'd need to do the dirty work of kill 
> recv_sack_cache first to make this to not produce, well, interesting 
> effects due to missed SACK blocks... :-)

Interesting approach, but I think there is limited value to this
(arguably) complex form.

The core issue is that the data and the SACK state are maintained in
the same datastructure.  The complexity in all the state management
and fixups in your patch is purely because of this.

If we maintain SACK scoreboard information seperately, outside of
the SKB, then there are only two changes to make:

1) Every access to TCP_SKB_CB() SACK scoreboard is adjusted to
   new data structure.

2) Retransmit is adjusted so that it can retransmit an SKB
   constructed as a portion of an existing SKB.  Since TSO
   implies SG, this can be handled with simple offset and
   length arguments and suitable creation of a clone referencing
   the pages in the SG vector that contain the desired data.

I would envision this SACK state thing to reference into the
retransmit queue SKB's somehow.  Each SACK block could perhaps
look something like:

	struct sack_ref {
		struct sk_buff *start_skb;
		struct sk_buff *end_skb;
		unsigned int start_off;
		unsigned int len;
	};

Traditionally we've prioritized the design of the SKB and other
infrastructure to suit TCP optimally and I still think we should
operate that way.

Therefore, long term, it is time to make a formal data blob to assist
with all of the repetitive work we do in cases like this and horribly
inefficient places like clean_rtx_queue().

So I'm basically advocating a two-pronged approach to this, the
seperate SACK scoreboard datastructure and the data blob.  I
think we can work on the former right now, and take our time with
the data blob because it requires lots of thinking and we should
get it right as it might have network driver interface implications.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one
  2007-12-11 12:32                                   ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one David Miller
@ 2007-12-12  0:14                                     ` Lachlan Andrew
  2007-12-12 15:11                                       ` David Miller
  2007-12-15  9:51                                     ` SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one) Ilpo Järvinen
  1 sibling, 1 reply; 45+ messages in thread
From: Lachlan Andrew @ 2007-12-12  0:14 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev, quetchen

Greetings all,

On 11/12/2007, David Miller <davem@davemloft.net> wrote:
>
> we should
> get it right as it might have network driver interface implications.

If you're redoing the driver interface, could I put in a request for
packet time-stamping at a lower level?

This thread started because TCP processing interferes with RTT
estimation.  This problem would be eliminated if time-stamping were
done as soon as the packet comes off the NIC.

$0.02
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one
  2007-12-12  0:14                                     ` Lachlan Andrew
@ 2007-12-12 15:11                                       ` David Miller
  2007-12-12 23:35                                         ` Lachlan Andrew
  0 siblings, 1 reply; 45+ messages in thread
From: David Miller @ 2007-12-12 15:11 UTC (permalink / raw)
  To: lachlan.andrew; +Cc: ilpo.jarvinen, netdev, quetchen

From: "Lachlan Andrew" <lachlan.andrew@gmail.com>
Date: Tue, 11 Dec 2007 16:14:36 -0800

> This thread started because TCP processing interferes with RTT
> estimation.  This problem would be eliminated if time-stamping were
> done as soon as the packet comes off the NIC.

We don't do that because such timestamping is too expensive.

It used to be the case that we did this, but we stopped doing
that a long time ago.

On x86 for example, timestamping can involve touching a slow
I/O device to read the timestamp.  We do not want to do that
for every packet.

Also, we timestamp differently for TCP, the global high
resolution timestamp is overkill for this purpose.

Really, this is a silly idea and would only be a bandaid
for the problem at hand, that TCP input processing is
too expensive in certain circumstances.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one
  2007-12-12 15:11                                       ` David Miller
@ 2007-12-12 23:35                                         ` Lachlan Andrew
  2007-12-12 23:38                                           ` David Miller
  2007-12-13  0:00                                           ` Stephen Hemminger
  0 siblings, 2 replies; 45+ messages in thread
From: Lachlan Andrew @ 2007-12-12 23:35 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev, quetchen, Darryl Veitch

Greetings Dave,

On 12/12/2007, David Miller <davem@davemloft.net> wrote:
> From: "Lachlan Andrew" <lachlan.andrew@gmail.com>
> Date: Tue, 11 Dec 2007 16:14:36 -0800
>
> > This thread started because TCP processing interferes with RTT
> > estimation.  This problem would be eliminated if time-stamping were
> > done as soon as the packet comes off the NIC.
>
> We don't do that because such timestamping is too expensive.
> It used to be the case that we did this, but we stopped doing
> that a long time ago.
>
> On x86 for example, timestamping can involve touching a slow
> I/O device to read the timestamp.  We do not want to do that
> for every packet.

OK.  Thanks for the background.

I thought that a TSC read was fairly cheap.  Any messing around to
interpret it could be the responsibility of any task which actually
needs a high-resolution timestamp, couldn't it?  If TSC is disabled,
then the timestamp field could be set to "invalid".

> Also, we timestamp differently for TCP, the global high
> resolution timestamp is overkill for this purpose.

Overkill for Reno and Cubic, but useful for Vegas, LP, veno, Illinois
and YeAH which are all in the kernel.  They currently use "high
resolution" timestamps which are effectively quantized to the
scheduler resolution because of the way timestamping is done --
reading a high-resolution time source when a task is scheduled.

> Really, this is a silly idea

Oh... :(

> and would only be a bandaid
> for the problem at hand, that TCP input processing is
> too expensive in certain circumstances.

That problem should certainly be fixed as well -- I wasn't suggesting
this as an alternative.  Will fixing it fix the problem of those TCP
modules suffering from CPU load from other sources?

(I'm Cc'ing this to Darryl Veitch who has often wanted driver-level
time-stamping for achieving high-resolution synchronization between
hosts.)

Cheers,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one
  2007-12-12 23:35                                         ` Lachlan Andrew
@ 2007-12-12 23:38                                           ` David Miller
  2007-12-13  0:00                                           ` Stephen Hemminger
  1 sibling, 0 replies; 45+ messages in thread
From: David Miller @ 2007-12-12 23:38 UTC (permalink / raw)
  To: lachlan.andrew; +Cc: ilpo.jarvinen, netdev, quetchen, d.veitch

From: "Lachlan Andrew" <lachlan.andrew@gmail.com>
Date: Wed, 12 Dec 2007 15:35:49 -0800

> I thought that a TSC read was fairly cheap.

The TSC is cheap, but on the vast majority of systems it
isn't usable as a stable time source, and we have to read
the ACPI timer instead.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one
  2007-12-12 23:35                                         ` Lachlan Andrew
  2007-12-12 23:38                                           ` David Miller
@ 2007-12-13  0:00                                           ` Stephen Hemminger
  1 sibling, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2007-12-13  0:00 UTC (permalink / raw)
  To: Lachlan Andrew
  Cc: David Miller, ilpo.jarvinen, netdev, quetchen, Darryl Veitch

On Wed, 12 Dec 2007 15:35:49 -0800
"Lachlan Andrew" <lachlan.andrew@gmail.com> wrote:

> Greetings Dave,
> 
> On 12/12/2007, David Miller <davem@davemloft.net> wrote:
> > From: "Lachlan Andrew" <lachlan.andrew@gmail.com>
> > Date: Tue, 11 Dec 2007 16:14:36 -0800
> >
> > > This thread started because TCP processing interferes with RTT
> > > estimation.  This problem would be eliminated if time-stamping were
> > > done as soon as the packet comes off the NIC.
> >
> > We don't do that because such timestamping is too expensive.
> > It used to be the case that we did this, but we stopped doing
> > that a long time ago.
> >
> > On x86 for example, timestamping can involve touching a slow
> > I/O device to read the timestamp.  We do not want to do that
> > for every packet.
> 
> OK.  Thanks for the background.
> 
> I thought that a TSC read was fairly cheap.  Any messing around to
> interpret it could be the responsibility of any task which actually
> needs a high-resolution timestamp, couldn't it?  If TSC is disabled,
> then the timestamp field could be set to "invalid".
> 
> > Also, we timestamp differently for TCP, the global high
> > resolution timestamp is overkill for this purpose.
> 
> Overkill for Reno and Cubic, but useful for Vegas, LP, veno, Illinois
> and YeAH which are all in the kernel.  They currently use "high
> resolution" timestamps which are effectively quantized to the
> scheduler resolution because of the way timestamping is done --
> reading a high-resolution time source when a task is scheduled.
> 
> > Really, this is a silly idea
> 
> Oh... :(
> 
> > and would only be a bandaid
> > for the problem at hand, that TCP input processing is
> > too expensive in certain circumstances.
> 
> That problem should certainly be fixed as well -- I wasn't suggesting
> this as an alternative.  Will fixing it fix the problem of those TCP
> modules suffering from CPU load from other sources?
> 
> (I'm Cc'ing this to Darryl Veitch who has often wanted driver-level
> time-stamping for achieving high-resolution synchronization between
> hosts.)
> 
> Cheers,
> Lachlan
> 

In current kernel, the congestion control can choose the desired accuracy.
The units are always the same (microseconds), 
but if the flag value "TCP_CONG_RTT_STAMP" is set, then TCP
tries to the high resolution real time clock.  This is relatively
expensive so only vegas, veno, LP, and YeAH use it. It would be
worth an experiment to see whether it is worth it.

All the delay based congestion control algorithms are impractical
in the real world, but everyone keeps trying :-)



-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one)
  2007-12-11 12:32                                   ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one David Miller
  2007-12-12  0:14                                     ` Lachlan Andrew
@ 2007-12-15  9:51                                     ` Ilpo Järvinen
  2008-01-08  7:36                                       ` SACK scoreboard David Miller
  1 sibling, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2007-12-15  9:51 UTC (permalink / raw)
  To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen

On Tue, 11 Dec 2007, David Miller wrote:

> Interesting approach, but I think there is limited value to this
> (arguably) complex form.
>
> The core issue is that the data and the SACK state are maintained in
> the same datastructure.  The complexity in all the state management
> and fixups in your patch is purely because of this.

Yeah, had just too much time while waiting for person who never
arrived... :-) It would have covered the typical case quite well
tough, for sure it was very intrusive.

> If we maintain SACK scoreboard information seperately, outside of
> the SKB, then there are only two changes to make:
> 
> 1) Every access to TCP_SKB_CB() SACK scoreboard is adjusted to
>    new data structure.

Not sure if it is going to all that straightforward because you seem to 
ignore in this simple description all details of performing the linking 
between the structures, which becomes tricky when we add those
retransmission adjustments.

> 2) Retransmit is adjusted so that it can retransmit an SKB
>    constructed as a portion of an existing SKB.  Since TSO
>    implies SG, this can be handled with simple offset and
>    length arguments and suitable creation of a clone referencing
>    the pages in the SG vector that contain the desired data.

...I.e., how to count retrans_out origins in such case because the 
presented sack_ref knows nothing about them nor do we have anything but 
one bit in ->sacked per skb. We need the origins when retransmission get 
sacked/acked to reduce retrans_out correctly. To solve this, I think the 
sack_ref must have ->sacked as well and the structure should store the 
R-bits there as well, and may L-bits too though their defination is 
more obvious because it's mostly a complement of sacked (except for small 
part near fackets_out and timedout marking that probably makes bookkeeping 
of L still necessary).

The pcount boundaries are no longer that well defined after we break skb 
boundaries during retransmitting, so doing this makes fackets_out 
calculation even more trickier to track, unless whole pcount is replaced 
by byte based fackets_out :-/, which is very trivial to determine from 
seqnos only (TCP_NODELAYers might get unhappy though if we do that in a 
straightforward way...). 

This would clearly be a good step from one perspective. Nowadays GSO'ed 
skbs may get fragmented when mss_now changes, for two or more consecutive 
non-SACKed skbs that means one or more extra (small) packets when 
retransmitting.

> I would envision this SACK state thing to reference into the
> retransmit queue SKB's somehow.  Each SACK block could perhaps
> look something like:
> 
> 	struct sack_ref {
> 		struct sk_buff *start_skb;
> 		struct sk_buff *end_skb;
> 		unsigned int start_off;
> 		unsigned int len;
> 	};

...I assume that these are fast searchable? And skbs as well (because the 
linking of start/end_skb at minimum is a costly operation otherwise)?

-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2007-12-15  9:51                                     ` SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one) Ilpo Järvinen
@ 2008-01-08  7:36                                       ` David Miller
  2008-01-08 12:12                                         ` Ilpo Järvinen
  2008-01-08 16:51                                         ` John Heffner
  0 siblings, 2 replies; 45+ messages in thread
From: David Miller @ 2008-01-08  7:36 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen

Ilpo, just trying to keep an old conversation from dying off.

Did you happen to read a recent blog posting of mine?

	http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead

I've been thinking more and more and I think we might be able
to get away with enforcing that SACKs are always increasing in
coverage.

I doubt there are any real systems out there that drop out of order
packets that are properly formed and are in window, even though the
SACK specification (foolishly, in my opinion) allows this.

If we could free packets as SACK blocks cover them, all the problems
go away.

For one thing, this will allow the retransmit queue liberation during
loss recovery to be spread out over the event, instead of batched up
like crazy to the point where the cumulative ACK finally moves and
releases an entire window's worth of data.

Next, it would simplify all of this scanning code trying to figure out
which holes to fill during recovery.

And for SACK scoreboard marking, the RB trie would become very nearly
unecessary as far as I can tell.

I would not even entertain this kind of crazy idea unless I thought
the fundamental complexity simplification payback was enormous.  And
in this case I think it is.

What we could do is put some experimental hack in there for developers
to start playing with, which would enforce that SACKs always increase
in coverage.  If violated the connection reset and a verbose log
message is logged so we can analyze any cases that occur.

Sounds crazy, but maybe has potential.  What do you think?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-08  7:36                                       ` SACK scoreboard David Miller
@ 2008-01-08 12:12                                         ` Ilpo Järvinen
  2008-01-09  7:58                                           ` David Miller
  2008-01-08 16:51                                         ` John Heffner
  1 sibling, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2008-01-08 12:12 UTC (permalink / raw)
  To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen

On Mon, 7 Jan 2008, David Miller wrote:

> Did you happen to read a recent blog posting of mine?
> 
> 	http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead
> 
> I've been thinking more and more and I think we might be able
> to get away with enforcing that SACKs are always increasing in
> coverage.
> 
> I doubt there are any real systems out there that drop out of order
> packets that are properly formed and are in window, even though the
> SACK specification (foolishly, in my opinion) allows this.

Luckily we can see that already from MIBs, so quering people who have 
large servers, which are continously "testing" the internet :-), under 
their supervision or can access, and asking if they see any might help.
I checked my dept's interactive servers and all had zero renegings, but
I don't think I have access to www server which would have much wider 
exposure.

> If we could free packets as SACK blocks cover them, all the problems
> go away.

I thought it a bit yesterday after reading your blog and came to 
conclusion that they won't, we can still get those nasty ACKs regardless 
of received SACK info (in here, missing). Even in some valid cases which 
include ACK losses besides actual data loss, not that this is the most 
common case but just wanted to point out that cleanup work is at least 
partially independent of SACK problem. So not "all" problems would go
away really.

> For one thing, this will allow the retransmit queue liberation during
> loss recovery to be spread out over the event, instead of batched up
> like crazy to the point where the cumulative ACK finally moves and
> releases an entire window's worth of data.

Two key cases for real pattern are:

1. Losses once per n, where n is something small, like 2-20 or so, usually
   happens at slow start overshoot or when compething traffic slow starts. 
   Cumulative ACKs will cover only small part of the window once rexmits 
   make through, thus this is not a problem.
2. Single loss (or few at the beginning of the window), rest SACKed. 
   Cumulative ACK will cover original window when the last necessary 
   rexmit gets through.

Case 1 becomes nasty ACKy only if rexmit is lost as well, but in that case 
the arriving SACK blocks make the rest of the window equal to 2 :-).

So I'm now trying to solve just case 2. What if we could somehow "combine" 
adjacent skbs (or whatever they're called in that model) if SACK covers 
them both so that we still hold them but can drop them in a very 
efficient way. That would make the combining effort split per ACK. 
And if reneging would occur, we can think a way to put the necessary fuzz 
into a form which cannot hurt the rest of the system (relatively easy & 
fast if we add CA_Reneging and allow retransmitting a portion of an skb 
similar to what you suggested earlier).

And it might even be possible then to offer admin a control so that the 
admin can choose between recover/plain reset if admin thinks that it's 
always an indication of an attack. This is somewhat similar case to what 
UTO (under IETF evaluation) does, as purpose of both is in violation of 
RFC TCP to avoid malicious traps but the control about it is left to the 
user.

> Next, it would simplify all of this scanning code trying to figure out
> which holes to fill during recovery.
> 
> And for SACK scoreboard marking, the RB trie would become very nearly
> unecessary as far as I can tell.

I've been contacted by a person who was interested in reaching 500k 
windows, so your 4000 sounded like a joke :-/. Having, let say, every
20th dropped means 25k skbs remaining, can we scan though it in any
sensible time without RBs and friends :-)? However, allowing queue walk
to begin from either direction would solve most of the common cases well 
enough for it to be nearly manageable.

> I would not even entertain this kind of crazy idea unless I thought
> the fundamental complexity simplification payback was enormous.  And
> in this case I think it is.
> 
> What we could do is put some experimental hack in there for developers
> to start playing with, which would enforce that SACKs always increase
> in coverage.  If violated the connection reset and a verbose log
> message is logged so we can analyze any cases that occur.

We have an initial number already, in MIBs.

> Sounds crazy, but maybe has potential.  What do you think?

If I'd hint my boss that I'm involved in something like this I'd
bet that he also would get quite crazy... ;-) I'm partially paid
for making TCP more RFCish :-), or at least that the places where
thing diverge are known and controllable for research purposes.

-- 
 i.

ps. If other Cced would like to get dropped if there are some followups, 
just let me know :-). Else, no need to do anything.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-08 12:12                                         ` Ilpo Järvinen
@ 2008-01-09  7:58                                           ` David Miller
  0 siblings, 0 replies; 45+ messages in thread
From: David Miller @ 2008-01-09  7:58 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 8 Jan 2008 14:12:47 +0200 (EET)

> If I'd hint my boss that I'm involved in something like this I'd
> bet that he also would get quite crazy... ;-) I'm partially paid
> for making TCP more RFCish :-), or at least that the places where
> thing diverge are known and controllable for research purposes.

RFCs are great guides by which to implement things, but they have been
often completely wrong or not practicle to follow strictly.

The handling of out of order ACKs with timestamps is my favorite
example.  Nobody performs an RFC compliant timestamp check on ACK
packets, or else their performance would go into the toilet during
packet reordering.

The URG bit setting is another one.

Especially when, practically speaking, we can in fact make changes
like I believe we can here I think we should.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-08  7:36                                       ` SACK scoreboard David Miller
  2008-01-08 12:12                                         ` Ilpo Järvinen
@ 2008-01-08 16:51                                         ` John Heffner
  2008-01-08 22:44                                           ` David Miller
  1 sibling, 1 reply; 45+ messages in thread
From: John Heffner @ 2008-01-08 16:51 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, lachlan.andrew, netdev, quetchen

David Miller wrote:
> Ilpo, just trying to keep an old conversation from dying off.
> 
> Did you happen to read a recent blog posting of mine?
> 
> 	http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead
> 
> I've been thinking more and more and I think we might be able
> to get away with enforcing that SACKs are always increasing in
> coverage.
> 
> I doubt there are any real systems out there that drop out of order
> packets that are properly formed and are in window, even though the
> SACK specification (foolishly, in my opinion) allows this.
> 
> If we could free packets as SACK blocks cover them, all the problems
> go away.
> 
> For one thing, this will allow the retransmit queue liberation during
> loss recovery to be spread out over the event, instead of batched up
> like crazy to the point where the cumulative ACK finally moves and
> releases an entire window's worth of data.
> 
> Next, it would simplify all of this scanning code trying to figure out
> which holes to fill during recovery.
> 
> And for SACK scoreboard marking, the RB trie would become very nearly
> unecessary as far as I can tell.
> 
> I would not even entertain this kind of crazy idea unless I thought
> the fundamental complexity simplification payback was enormous.  And
> in this case I think it is.
> 
> What we could do is put some experimental hack in there for developers
> to start playing with, which would enforce that SACKs always increase
> in coverage.  If violated the connection reset and a verbose log
> message is logged so we can analyze any cases that occur.
> 
> Sounds crazy, but maybe has potential.  What do you think?

Linux has a code path where this can happen under memory over-commit, in 
tcp_prune_queue().  Also, I think one of the motivations for making SACK 
strictly advisory is there was some concern about buggy SACK 
implementations.  Keeping data in your retransmit queue allows you to 
fall back to timeout and go-back-n if things completely fall apart.  For 
better or worse, we have to deal with the spec the way it is.

Even if you made this assumption of "hard" SACKs, you still have to 
worry about large ACKs if SACK is disabled, though I guess you could say 
people running with large windows without SACK deserve what they get. :)

I haven't thought about this too hard, but can we approximate this by 
moving scaked data into a sacked queue, then if something bad happens 
merge this back into the retransmit queue?  The code will have to deal 
with non-contiguous data in the retransmit queue; I'm not sure offhand 
if that violates any assumptions.  You still have a single expensive ACK 
at the end of recovery, though I wonder how much this really hurts.  If 
you want to ameliorate this, you could save this sacked queue to be 
batch processed later, in application context for instance.

   -John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-08 16:51                                         ` John Heffner
@ 2008-01-08 22:44                                           ` David Miller
  2008-01-09  1:34                                             ` Lachlan Andrew
  2008-01-09  2:25                                             ` Andi Kleen
  0 siblings, 2 replies; 45+ messages in thread
From: David Miller @ 2008-01-08 22:44 UTC (permalink / raw)
  To: jheffner; +Cc: ilpo.jarvinen, lachlan.andrew, netdev, quetchen

From: John Heffner <jheffner@psc.edu>
Date: Tue, 08 Jan 2008 11:51:53 -0500

> I haven't thought about this too hard, but can we approximate this by 
> moving scaked data into a sacked queue, then if something bad happens 
> merge this back into the retransmit queue?

That defeats the impetus for the change.

We want to free up the data, say, 2 packets at a time as
ACKs come in.  The key goal is smooth liberation of
retransmit queue packets over time.

The big problem is that recovery from even a single packet loss in a
window makes us run kfree_skb() for a all the packets in a full
window's worth of data when recovery completes.

If we just move such packets to a seperate list, we still have to
iterate over all of them when the cumulative ACK arrives.

This problem, that retransmit queue liberation is not smooth, is the
biggest flaw in how SACK is specified.  I mean, consider Ilpo's
mentioned case of 500,000 packet windows.  The issue cannot be
ignored.  SACK is clearly broken.

You speak of a path in Linux where we can reneg on SACKs, but I doubt
it really ever runs because of how aggressive the queue collapser is.
Alexey even has a comment there:

	 * This must not ever occur. */

To be honest this code sits here because it was written before the
queue collapser was coded up.

Really, find me a box where the LINUX_MIB_OFOPRUNED or
LINUX_MIB_RECVPRUNED counters are anything other than zero.

So this is a non-issue and I did consider it before proposing that we
redefine SACK.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-08 22:44                                           ` David Miller
@ 2008-01-09  1:34                                             ` Lachlan Andrew
  2008-01-09  6:35                                               ` David Miller
  2008-01-09  2:25                                             ` Andi Kleen
  1 sibling, 1 reply; 45+ messages in thread
From: Lachlan Andrew @ 2008-01-09  1:34 UTC (permalink / raw)
  To: David Miller; +Cc: jheffner, ilpo.jarvinen, netdev, quetchen

Greetings David,

On 08/01/2008, David Miller <davem@davemloft.net> wrote:
> From: John Heffner <jheffner@psc.edu>
>
> > I haven't thought about this too hard, but can we approximate this by
> > moving scaked data into a sacked queue, then if something bad happens
> > merge this back into the retransmit queue?
>
> That defeats the impetus for the change.
>
> We want to free up the data, say, 2 packets at a time as
> ACKs come in.  The key goal is smooth liberation of
> retransmit queue packets over time.

John also suggested freeing the packets as a lower priority task, just
doing it after they're acknowledged.

When the ACK finally comes, you could do something like moving John's
entire list of packets to a "to be freed" list, and free a few every
time (say) another ACK comes in.

$0.02,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  1:34                                             ` Lachlan Andrew
@ 2008-01-09  6:35                                               ` David Miller
  0 siblings, 0 replies; 45+ messages in thread
From: David Miller @ 2008-01-09  6:35 UTC (permalink / raw)
  To: lachlan.andrew; +Cc: jheffner, ilpo.jarvinen, netdev, quetchen

From: "Lachlan Andrew" <lachlan.andrew@gmail.com>
Date: Tue, 8 Jan 2008 17:34:03 -0800

> John also suggested freeing the packets as a lower priority task, just
> doing it after they're acknowledged.
> 
> When the ACK finally comes, you could do something like moving John's
> entire list of packets to a "to be freed" list, and free a few every
> time (say) another ACK comes in.

You could, but this requires extra state and the operations to
properly queue such items to the deferred task and wake it up.

And none of this would be necessary if we could make SACK
indications hard.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-08 22:44                                           ` David Miller
  2008-01-09  1:34                                             ` Lachlan Andrew
@ 2008-01-09  2:25                                             ` Andi Kleen
  2008-01-09  4:27                                               ` John Heffner
  2008-01-09  6:39                                               ` David Miller
  1 sibling, 2 replies; 45+ messages in thread
From: Andi Kleen @ 2008-01-09  2:25 UTC (permalink / raw)
  To: David Miller; +Cc: jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen

David Miller <davem@davemloft.net> writes:
>
> The big problem is that recovery from even a single packet loss in a
> window makes us run kfree_skb() for a all the packets in a full
> window's worth of data when recovery completes.

Why exactly is it a problem to free them all at once? Are you worried
about kernel preemption latencies?

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  2:25                                             ` Andi Kleen
@ 2008-01-09  4:27                                               ` John Heffner
  2008-01-09  6:41                                                 ` David Miller
  2008-01-09 12:55                                                 ` Ilpo Järvinen
  2008-01-09  6:39                                               ` David Miller
  1 sibling, 2 replies; 45+ messages in thread
From: John Heffner @ 2008-01-09  4:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, ilpo.jarvinen, lachlan.andrew, netdev, quetchen

Andi Kleen wrote:
> David Miller <davem@davemloft.net> writes:
>> The big problem is that recovery from even a single packet loss in a
>> window makes us run kfree_skb() for a all the packets in a full
>> window's worth of data when recovery completes.
> 
> Why exactly is it a problem to free them all at once? Are you worried
> about kernel preemption latencies?
> 
> -Andi


I also wonder how much of a problem this is (for now, with window sizes 
of order 10000 packets.  My understanding is that the biggest problems 
arise from O(N^2) time for recovery because every ack was expensive. 
Have current tests shown the final ack to be a major source of problems?

   -John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  4:27                                               ` John Heffner
@ 2008-01-09  6:41                                                 ` David Miller
  2008-01-09 14:56                                                   ` John Heffner
  2008-01-09 12:55                                                 ` Ilpo Järvinen
  1 sibling, 1 reply; 45+ messages in thread
From: David Miller @ 2008-01-09  6:41 UTC (permalink / raw)
  To: jheffner; +Cc: andi, ilpo.jarvinen, lachlan.andrew, netdev, quetchen

From: John Heffner <jheffner@psc.edu>
Date: Tue, 08 Jan 2008 23:27:08 -0500

> I also wonder how much of a problem this is (for now, with window sizes 
> of order 10000 packets.  My understanding is that the biggest problems 
> arise from O(N^2) time for recovery because every ack was expensive. 
> Have current tests shown the final ack to be a major source of problems?

Yes, several people have reported this.

It's a real issue, and it's only going to get worse.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  6:41                                                 ` David Miller
@ 2008-01-09 14:56                                                   ` John Heffner
  2008-01-09 18:14                                                     ` SANGTAE HA
  0 siblings, 1 reply; 45+ messages in thread
From: John Heffner @ 2008-01-09 14:56 UTC (permalink / raw)
  To: David Miller; +Cc: andi, ilpo.jarvinen, lachlan.andrew, netdev, quetchen

David Miller wrote:
> From: John Heffner <jheffner@psc.edu>
> Date: Tue, 08 Jan 2008 23:27:08 -0500
> 
>> I also wonder how much of a problem this is (for now, with window sizes 
>> of order 10000 packets.  My understanding is that the biggest problems 
>> arise from O(N^2) time for recovery because every ack was expensive. 
>> Have current tests shown the final ack to be a major source of problems?
> 
> Yes, several people have reported this.

I may have missed some of this.  Does anyone have a link to some recent 
data?

   -John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09 14:56                                                   ` John Heffner
@ 2008-01-09 18:14                                                     ` SANGTAE HA
  2008-01-09 18:23                                                       ` John Heffner
  0 siblings, 1 reply; 45+ messages in thread
From: SANGTAE HA @ 2008-01-09 18:14 UTC (permalink / raw)
  To: John Heffner
  Cc: David Miller, andi, ilpo.jarvinen, lachlan.andrew, netdev,
	quetchen

On Jan 9, 2008 9:56 AM, John Heffner <jheffner@psc.edu> wrote:
> >> I also wonder how much of a problem this is (for now, with window sizes
> >> of order 10000 packets.  My understanding is that the biggest problems
> >> arise from O(N^2) time for recovery because every ack was expensive.
> >> Have current tests shown the final ack to be a major source of problems?
> >
> > Yes, several people have reported this.
>
> I may have missed some of this.  Does anyone have a link to some recent
> data?

I had some testing on this a month ago.
A small set of recent results with linux 2.6.23.9 are at
http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency
One of serious cases with a large number of packet losses (initial
loss is around 8000 packets) is at
http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency/600--TCP-TCP-NONE--400-3-1.0--1000-120-0-0-1-1-5-500--1.0-0.5-133000-73-3000000-0.93-150--3/

Also, there is a comparison among three Linux kernels (2.6.13,
2.6.18-rc4, 2.6.20.3) at
http://netsrv.csc.ncsu.edu/wiki/index.php/Efficiency_of_SACK_processing

Sangtae

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09 18:14                                                     ` SANGTAE HA
@ 2008-01-09 18:23                                                       ` John Heffner
  0 siblings, 0 replies; 45+ messages in thread
From: John Heffner @ 2008-01-09 18:23 UTC (permalink / raw)
  To: SANGTAE HA
  Cc: David Miller, andi, ilpo.jarvinen, lachlan.andrew, netdev,
	quetchen

SANGTAE HA wrote:
> On Jan 9, 2008 9:56 AM, John Heffner <jheffner@psc.edu> wrote:
>>>> I also wonder how much of a problem this is (for now, with window sizes
>>>> of order 10000 packets.  My understanding is that the biggest problems
>>>> arise from O(N^2) time for recovery because every ack was expensive.
>>>> Have current tests shown the final ack to be a major source of problems?
>>> Yes, several people have reported this.
>> I may have missed some of this.  Does anyone have a link to some recent
>> data?
> 
> I had some testing on this a month ago.
> A small set of recent results with linux 2.6.23.9 are at
> http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency
> One of serious cases with a large number of packet losses (initial
> loss is around 8000 packets) is at
> http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency/600--TCP-TCP-NONE--400-3-1.0--1000-120-0-0-1-1-5-500--1.0-0.5-133000-73-3000000-0.93-150--3/
> 
> Also, there is a comparison among three Linux kernels (2.6.13,
> 2.6.18-rc4, 2.6.20.3) at
> http://netsrv.csc.ncsu.edu/wiki/index.php/Efficiency_of_SACK_processing


If I'm reading this right, all these tests occur with large amounts of 
loss and tons of sack processing.  What would be most pertinent to this 
discussion would be a test with a large window, with delayed ack and 
sack disabled, and a single loss repaired by fast retransmit.  This 
would isolate the "single big ack" processing from other factors such as 
doubling the ack rate and sack processing.

I could probably set up such a test, but I don't want to duplicate 
effort if someone else already has done something similar.

Thanks,
   -John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  4:27                                               ` John Heffner
  2008-01-09  6:41                                                 ` David Miller
@ 2008-01-09 12:55                                                 ` Ilpo Järvinen
  1 sibling, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2008-01-09 12:55 UTC (permalink / raw)
  To: John Heffner; +Cc: Andi Kleen, David Miller, lachlan.andrew, Netdev, quetchen

On Tue, 8 Jan 2008, John Heffner wrote:

> Andi Kleen wrote:
> > David Miller <davem@davemloft.net> writes:
> > > The big problem is that recovery from even a single packet loss in a
> > > window makes us run kfree_skb() for a all the packets in a full
> > > window's worth of data when recovery completes.
> > 
> > Why exactly is it a problem to free them all at once? Are you worried
> > about kernel preemption latencies?
> 
> I also wonder how much of a problem this is (for now, with window sizes of
> order 10000 packets.  My understanding is that the biggest problems arise from
> O(N^2) time for recovery because every ack was expensive. Have current 
> tests shown the final ack to be a major source of problems?

This thread got started because I tried to solve the other latencies but 
realized that it helps very little because this latency spike would 
have remained unsolved and it happens in one of the most common case.

-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  2:25                                             ` Andi Kleen
  2008-01-09  4:27                                               ` John Heffner
@ 2008-01-09  6:39                                               ` David Miller
  2008-01-09  7:03                                                 ` Andi Kleen
  1 sibling, 1 reply; 45+ messages in thread
From: David Miller @ 2008-01-09  6:39 UTC (permalink / raw)
  To: andi; +Cc: jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen

From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 09 Jan 2008 03:25:05 +0100

> David Miller <davem@davemloft.net> writes:
> >
> > The big problem is that recovery from even a single packet loss in a
> > window makes us run kfree_skb() for a all the packets in a full
> > window's worth of data when recovery completes.
> 
> Why exactly is it a problem to free them all at once? Are you worried
> about kernel preemption latencies?

If the cpu is there spinning freeing up 500,000 SKBs, it isn't
processing RX packets.

It adds severe spikes in CPU utilization that are even moderate
line rates begins to affect RTTs.

Or do you think it's OK to process 500,000 SKBs while locked
in a software interrupt.

Perhaps you have another broken awk script to prove this :-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  6:39                                               ` David Miller
@ 2008-01-09  7:03                                                 ` Andi Kleen
  2008-01-09  7:16                                                   ` David Miller
  2008-01-09  9:47                                                   ` Evgeniy Polyakov
  0 siblings, 2 replies; 45+ messages in thread
From: Andi Kleen @ 2008-01-09  7:03 UTC (permalink / raw)
  To: David Miller
  Cc: andi, jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen

> It adds severe spikes in CPU utilization that are even moderate
> line rates begins to affect RTTs.
> 
> Or do you think it's OK to process 500,000 SKBs while locked
> in a software interrupt.

You can always push it into a work queue.  Even put it to
other cores if you want. 

In fact this is already done partly for the ->completion_queue.
Wouldn't be a big change to queue it another level down.

Also even freeing a lot of objects doesn't have to be
that expensive. I suspect the most cost is in taking
the slab locks, but that could be batched. Without
that the kmem_free fast path isn't particularly
expensive, as long as the headers are still in cache.

Long ago I had sk_buff optimized for the routing case so that freeing
can be all done with a single cache line. That is 
long gone, but could be probably restored.

But asking for protocol changes just to work around such a
internal implementation detail is ... <I miss words> 

> Perhaps you have another broken awk script to prove this :-)

Your hand waved numbers on inline sizes there were definitely worse 
than mine. 

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  7:03                                                 ` Andi Kleen
@ 2008-01-09  7:16                                                   ` David Miller
  2008-01-09  9:47                                                   ` Evgeniy Polyakov
  1 sibling, 0 replies; 45+ messages in thread
From: David Miller @ 2008-01-09  7:16 UTC (permalink / raw)
  To: andi; +Cc: jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen

From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 9 Jan 2008 08:03:18 +0100

> Also even freeing a lot of objects doesn't have to be
> that expensive.  I suspect the most cost is in taking
> the slab locks, but that could be batched.

We're touching SKB struct members, doing atomics on them, etc. for
objects we haven't referenced for at least two RTTs so are guarenteed
to be cache cold.

Let's say best case we can get it down to 2 cache line misses, and as
a very aggressive goal we can get the cost down to 100 cycles.  For
500,000 packets this is 500 million cpu cycles to free them all up.

That's 1/4 of a second even on a 2 GHZ cpu.

And yes there are inherent costs in handling TCP windows that are
500,000 packets in size.  But, that freeing cost should be spread
throughout the handling of the RTT feedback, not handled all at once.

> Your hand waved numbers on inline sizes there were definitely worse 
> than mine. 

Your primary objective seems to be "being right", and that's fine but
realize that it makes discussing anything with you about as fun as
picking one's toe nails with an ice axe.

Eventually you will be ignored by most folks who get fed up by this
style of argument.

So, have fun being right rather than being pleasant to work with.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  7:03                                                 ` Andi Kleen
  2008-01-09  7:16                                                   ` David Miller
@ 2008-01-09  9:47                                                   ` Evgeniy Polyakov
  2008-01-09 14:02                                                     ` Andi Kleen
  1 sibling, 1 reply; 45+ messages in thread
From: Evgeniy Polyakov @ 2008-01-09  9:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, jheffner, ilpo.jarvinen, lachlan.andrew, netdev,
	quetchen

Hi.

On Wed, Jan 09, 2008 at 08:03:18AM +0100, Andi Kleen (andi@firstfloor.org) wrote:
> > It adds severe spikes in CPU utilization that are even moderate
> > line rates begins to affect RTTs.
> > 
> > Or do you think it's OK to process 500,000 SKBs while locked
> > in a software interrupt.
> 
> You can always push it into a work queue.  Even put it to
> other cores if you want. 
> 
> In fact this is already done partly for the ->completion_queue.
> Wouldn't be a big change to queue it another level down.
> 
> Also even freeing a lot of objects doesn't have to be
> that expensive. I suspect the most cost is in taking
> the slab locks, but that could be batched. Without
> that the kmem_free fast path isn't particularly
> expensive, as long as the headers are still in cache.

Postponing freeing of the skb has major drawbacks. Some time ago I
made a patch to postpone skb freeing behind rcu and got 2.5 times slower
connection speed on some machines with decreased CPU usage though.
So, queueing solution has to be proven with real data and although it
looks good in one situation, it can be really bad in another.

For interested reader: results of the RCUfication of the kfree_skbmem()
http://tservice.net.ru/~s0mbre/blog/devel/networking/2006/12/05

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
  2008-01-09  9:47                                                   ` Evgeniy Polyakov
@ 2008-01-09 14:02                                                     ` Andi Kleen
  0 siblings, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2008-01-09 14:02 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Andi Kleen, David Miller, jheffner, ilpo.jarvinen, lachlan.andrew,
	netdev, quetchen

> Postponing freeing of the skb has major drawbacks. Some time ago I

Yes, the trick would be to make sure that it also does not tie up
too much memory. e.g. it would need some throttling at least.

Also the fast path of kmem_cache_free() is actually not that
much different from just putting something on a list so perhaps
it would not make that much difference.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] tcp-illinois: incorrect beta usage
  2007-11-28 23:47   ` [PATCH] tcp-illinois: incorrect beta usage Stephen Hemminger
  2007-11-29  0:25     ` Lachlan Andrew
@ 2007-11-29 14:12     ` Herbert Xu
  1 sibling, 0 replies; 45+ messages in thread
From: Herbert Xu @ 2007-11-29 14:12 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Julian Shao Liu, David S. Miller, 'Lachlan Andrew',
	shaoliu, 'Douglas Leith', 'Robert Shorten',
	netdev

On Wed, Nov 28, 2007 at 03:47:25PM -0800, Stephen Hemminger wrote:
> Lachlan Andrew observed that my TCP-Illinois implementation uses the
> beta value incorrectly:
>   The parameter  beta  in the paper specifies the amount to decrease
>   *by*:  that is, on loss,
>      W <-  W -  beta*W
>   but in   tcp_illinois_ssthresh() uses  beta  as the amount
>   to decrease  *to*: W <- beta*W
> 
> This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network,
> hurting performance. Note: since the base beta value is .5, it has no
> impact on a congested network.
> 
> Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>

Applied to net-2.6.  Thanks Stephen!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: SACK scoreboard
@ 2008-01-09  6:04 linux
  0 siblings, 0 replies; 45+ messages in thread
From: linux @ 2008-01-09  6:04 UTC (permalink / raw)
  To: netdev; +Cc: linux

Just some idle brainstorming on the subject...

It seems the only way to handle network pipes sigificantly larger (delay *
bandwidth product) than the processor cache is to make freeing retransmit
data o(n).

Now, there are some ways to reduce the constant factor.  The one that
comes to mind first is to not queue sk_buffs.  Throw away the struct
sk_buff after transmission and just queue skb_frag_structs, pages, or
maybe even higher-order pages of data.  Then freeing the data when it's
acked has a much smaller constant factor, particularly d-cache footprint,
and no slab operations.

The downside is more work to recreate the skb if you do have to
retransmit, but optimizing for retransmits is silly.

Some implementations could leave large chunks of memory locked until
all of the sk_buff->skb_shared_info->skb_frag_structs referencing them
have gone away, but you can look at the transmit window when deciding
how big a chunk size to use.

Then, to actually get below O(n), you want to keep the queued data in
a data structure known to the memory manager.  Basically, splice the
retransmit queue onto the free list.

It may require some kludgery in the memory manager.  In particular, doing
that in O(1) time obviously means that you can't coalesce adjacent free
regions to build higher-order pages.  So you'd have to have a threshold
for uncoalesced pages and a way to force coalescing under memory pressure.

You're just deferring work until the page is allocated, but the point
is that then it's okay to bring it into cache when it's about to be
used again.  It's the redundant round trip just because an ack arrived
that's annoying.

I've done thins sort of thing with specialized fixed-block-size allocators
before (an alpha-beta minimax search tree allocates nodes one at a
time, but frees whole subtrees at once), but might it be feasible for
kernel use?

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2008-01-09 18:23 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <aa7d2c6d0711261023m3d2dd850o76a8f44aef022f39@mail.gmail.com>
     [not found] ` <001001c83063$9adbc9d0$d5897e82@csp.uiuc.edu>
2007-11-28 23:47   ` [PATCH] tcp-illinois: incorrect beta usage Stephen Hemminger
2007-11-29  0:25     ` Lachlan Andrew
2007-11-29  0:43       ` Stephen Hemminger
2007-11-29  5:26         ` Shao Liu
2007-12-03 22:52           ` [RFC] TCP illinois max rtt aging Stephen Hemminger
2007-12-03 23:06             ` Lachlan Andrew
2007-12-03 23:59               ` Shao Liu
2007-12-04  0:32                 ` Stephen Hemminger
2007-12-04  1:23                   ` Lachlan Andrew
2007-12-04  8:37                     ` Ilpo Järvinen
2007-12-07  3:27                       ` Lachlan Andrew
2007-12-07 11:05                         ` Ilpo Järvinen
2007-12-07 12:41                           ` David Miller
2007-12-07 13:05                             ` Ilpo Järvinen
2007-12-07 18:27                               ` Ilpo Järvinen
2007-12-08  1:32                               ` David Miller
2007-12-11 11:59                                 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one (Was: Re: [RFC] TCP illinois max rtt aging) Ilpo Järvinen
2007-12-11 12:32                                   ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one David Miller
2007-12-12  0:14                                     ` Lachlan Andrew
2007-12-12 15:11                                       ` David Miller
2007-12-12 23:35                                         ` Lachlan Andrew
2007-12-12 23:38                                           ` David Miller
2007-12-13  0:00                                           ` Stephen Hemminger
2007-12-15  9:51                                     ` SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one) Ilpo Järvinen
2008-01-08  7:36                                       ` SACK scoreboard David Miller
2008-01-08 12:12                                         ` Ilpo Järvinen
2008-01-09  7:58                                           ` David Miller
2008-01-08 16:51                                         ` John Heffner
2008-01-08 22:44                                           ` David Miller
2008-01-09  1:34                                             ` Lachlan Andrew
2008-01-09  6:35                                               ` David Miller
2008-01-09  2:25                                             ` Andi Kleen
2008-01-09  4:27                                               ` John Heffner
2008-01-09  6:41                                                 ` David Miller
2008-01-09 14:56                                                   ` John Heffner
2008-01-09 18:14                                                     ` SANGTAE HA
2008-01-09 18:23                                                       ` John Heffner
2008-01-09 12:55                                                 ` Ilpo Järvinen
2008-01-09  6:39                                               ` David Miller
2008-01-09  7:03                                                 ` Andi Kleen
2008-01-09  7:16                                                   ` David Miller
2008-01-09  9:47                                                   ` Evgeniy Polyakov
2008-01-09 14:02                                                     ` Andi Kleen
2007-11-29 14:12     ` [PATCH] tcp-illinois: incorrect beta usage Herbert Xu
2008-01-09  6:04 SACK scoreboard linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).