* [PATCH] tcp-illinois: incorrect beta usage [not found] ` <001001c83063$9adbc9d0$d5897e82@csp.uiuc.edu> @ 2007-11-28 23:47 ` Stephen Hemminger 2007-11-29 0:25 ` Lachlan Andrew 2007-11-29 14:12 ` [PATCH] tcp-illinois: incorrect beta usage Herbert Xu 0 siblings, 2 replies; 45+ messages in thread From: Stephen Hemminger @ 2007-11-28 23:47 UTC (permalink / raw) To: Julian Shao Liu, David S. Miller, Herbert Xu Cc: 'Lachlan Andrew', shaoliu, 'Douglas Leith', 'Robert Shorten', netdev Lachlan Andrew observed that my TCP-Illinois implementation uses the beta value incorrectly: The parameter beta in the paper specifies the amount to decrease *by*: that is, on loss, W <- W - beta*W but in tcp_illinois_ssthresh() uses beta as the amount to decrease *to*: W <- beta*W This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network, hurting performance. Note: since the base beta value is .5, it has no impact on a congested network. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> --- a/net/ipv4/tcp_illinois.c 2007-08-18 07:50:15.000000000 -0700 +++ b/net/ipv4/tcp_illinois.c 2007-11-28 15:39:04.000000000 -0800 @@ -298,7 +298,7 @@ static u32 tcp_illinois_ssthresh(struct struct illinois *ca = inet_csk_ca(sk); /* Multiplicative decrease */ - return max((tp->snd_cwnd * ca->beta) >> BETA_SHIFT, 2U); + return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->beta) >> BETA_SHIFT), 2U); } ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] tcp-illinois: incorrect beta usage 2007-11-28 23:47 ` [PATCH] tcp-illinois: incorrect beta usage Stephen Hemminger @ 2007-11-29 0:25 ` Lachlan Andrew 2007-11-29 0:43 ` Stephen Hemminger 2007-11-29 14:12 ` [PATCH] tcp-illinois: incorrect beta usage Herbert Xu 1 sibling, 1 reply; 45+ messages in thread From: Lachlan Andrew @ 2007-11-29 0:25 UTC (permalink / raw) To: Stephen Hemminger Cc: David S. Miller, Herbert Xu, shaoliu, Douglas Leith, Robert Shorten, netdev Thanks Stephen. A related problem (largely due to the published algorithm itself) is that Illinois is very aggressive when it over-estimates the maximum RTT. At high load (say 200Mbps and 200ms RTT), a backlog of packets builds up just after a loss, causing the RTT estimate to become large. This makes Illinois think that *all* losses are due to corruption not congestion, and so only back off by 1/8 instead of 1/2. I can't think how to fix this except by better RTT estimation, or changes to Illinois itself. Currently, I ignore RTT measurements when sacked_out != 0 and have a heuristic "RTT aging" mechanism, but that's pretty ugly. Cheers, Lachlan On 28/11/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote: > Lachlan Andrew observed that my TCP-Illinois implementation uses the > beta value incorrectly: > The parameter beta in the paper specifies the amount to decrease > *by*: that is, on loss, > W <- W - beta*W > but in tcp_illinois_ssthresh() uses beta as the amount > to decrease *to*: W <- beta*W > > This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network, > hurting performance. Note: since the base beta value is .5, it has no > impact on a congested network. > > Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> > > --- a/net/ipv4/tcp_illinois.c 2007-08-18 07:50:15.000000000 -0700 > +++ b/net/ipv4/tcp_illinois.c 2007-11-28 15:39:04.000000000 -0800 > @@ -298,7 +298,7 @@ static u32 tcp_illinois_ssthresh(struct > struct illinois *ca = inet_csk_ca(sk); > > /* Multiplicative decrease */ > - return max((tp->snd_cwnd * ca->beta) >> BETA_SHIFT, 2U); > + return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->beta) >> BETA_SHIFT), 2U); > } > > > -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] tcp-illinois: incorrect beta usage 2007-11-29 0:25 ` Lachlan Andrew @ 2007-11-29 0:43 ` Stephen Hemminger 2007-11-29 5:26 ` Shao Liu 0 siblings, 1 reply; 45+ messages in thread From: Stephen Hemminger @ 2007-11-29 0:43 UTC (permalink / raw) To: Lachlan Andrew Cc: David S. Miller, Herbert Xu, shaoliu, Douglas Leith, Robert Shorten, netdev Lachlan Andrew wrote: > Thanks Stephen. > > A related problem (largely due to the published algorithm itself) is > that Illinois is very aggressive when it over-estimates the maximum > RTT. > > At high load (say 200Mbps and 200ms RTT), a backlog of packets builds > up just after a loss, causing the RTT estimate to become large. This > makes Illinois think that *all* losses are due to corruption not > congestion, and so only back off by 1/8 instead of 1/2. > > I can't think how to fix this except by better RTT estimation, or > changes to Illinois itself. Currently, I ignore RTT measurements when > sacked_out != 0 and have a heuristic "RTT aging" mechanism, but > that's pretty ugly. > > Cheers, > Lachlan > > Ageing the RTT estimates needs to be done anyway. Maybe something can be reused from H-TCP. The two are closely related. ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH] tcp-illinois: incorrect beta usage 2007-11-29 0:43 ` Stephen Hemminger @ 2007-11-29 5:26 ` Shao Liu 2007-12-03 22:52 ` [RFC] TCP illinois max rtt aging Stephen Hemminger 0 siblings, 1 reply; 45+ messages in thread From: Shao Liu @ 2007-11-29 5:26 UTC (permalink / raw) To: 'Stephen Hemminger', 'Lachlan Andrew' Cc: 'David S. Miller', 'Herbert Xu', 'Douglas Leith', 'Robert Shorten', netdev Hi Stephen and Lachlan, Thanks for pointing out and fixing this bug. For the max RTT problem, I have considered it also and I have some idea on improve it. I also have some other places to improve. I will summarize all my new ideas and send you an update. For me to change it, could you please give me a link to download to latest source codes for the whole congestion control module in Linux implementation, including the general module for all algorithms, and the implementation for specific algorithms like TCP-Illinois and H-TCP? Thanks for the help! -Shao -----Original Message----- From: Stephen Hemminger [mailto:shemminger@linux-foundation.org] Sent: Wednesday, November 28, 2007 4:44 PM To: Lachlan Andrew Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith; Robert Shorten; netdev@vger.kernel.org Subject: Re: [PATCH] tcp-illinois: incorrect beta usage Lachlan Andrew wrote: > Thanks Stephen. > > A related problem (largely due to the published algorithm itself) is > that Illinois is very aggressive when it over-estimates the maximum > RTT. > > At high load (say 200Mbps and 200ms RTT), a backlog of packets builds > up just after a loss, causing the RTT estimate to become large. This > makes Illinois think that *all* losses are due to corruption not > congestion, and so only back off by 1/8 instead of 1/2. > > I can't think how to fix this except by better RTT estimation, or > changes to Illinois itself. Currently, I ignore RTT measurements when > sacked_out != 0 and have a heuristic "RTT aging" mechanism, but > that's pretty ugly. > > Cheers, > Lachlan > > Ageing the RTT estimates needs to be done anyway. Maybe something can be reused from H-TCP. The two are closely related. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [RFC] TCP illinois max rtt aging 2007-11-29 5:26 ` Shao Liu @ 2007-12-03 22:52 ` Stephen Hemminger 2007-12-03 23:06 ` Lachlan Andrew 0 siblings, 1 reply; 45+ messages in thread From: Stephen Hemminger @ 2007-12-03 22:52 UTC (permalink / raw) To: shaoliu Cc: 'Lachlan Andrew', 'David S. Miller', 'Herbert Xu', 'Douglas Leith', 'Robert Shorten', netdev On Wed, 28 Nov 2007 21:26:12 -0800 "Shao Liu" <shaoliu@Princeton.EDU> wrote: > Hi Stephen and Lachlan, > > Thanks for pointing out and fixing this bug. > > For the max RTT problem, I have considered it also and I have some idea on > improve it. I also have some other places to improve. I will summarize all > my new ideas and send you an update. For me to change it, could you please > give me a link to download to latest source codes for the whole congestion > control module in Linux implementation, including the general module for all > algorithms, and the implementation for specific algorithms like TCP-Illinois > and H-TCP? > > Thanks for the help! > -Shao > > > > -----Original Message----- > From: Stephen Hemminger [mailto:shemminger@linux-foundation.org] > Sent: Wednesday, November 28, 2007 4:44 PM > To: Lachlan Andrew > Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith; > Robert Shorten; netdev@vger.kernel.org > Subject: Re: [PATCH] tcp-illinois: incorrect beta usage > > Lachlan Andrew wrote: > > Thanks Stephen. > > > > A related problem (largely due to the published algorithm itself) is > > that Illinois is very aggressive when it over-estimates the maximum > > RTT. > > > > At high load (say 200Mbps and 200ms RTT), a backlog of packets builds > > up just after a loss, causing the RTT estimate to become large. This > > makes Illinois think that *all* losses are due to corruption not > > congestion, and so only back off by 1/8 instead of 1/2. > > > > I can't think how to fix this except by better RTT estimation, or > > changes to Illinois itself. Currently, I ignore RTT measurements when > > sacked_out != 0 and have a heuristic "RTT aging" mechanism, but > > that's pretty ugly. > > > > Cheers, > > Lachlan > > > > > Ageing the RTT estimates needs to be done anyway. > Maybe something can be reused from H-TCP. The two are closely related. > The following adds gradual aging of max RTT. --- a/net/ipv4/tcp_illinois.c 2007-11-29 08:58:35.000000000 -0800 +++ b/net/ipv4/tcp_illinois.c 2007-11-29 09:37:33.000000000 -0800 @@ -63,7 +63,10 @@ static void rtt_reset(struct sock *sk) ca->cnt_rtt = 0; ca->sum_rtt = 0; - /* TODO: age max_rtt? */ + /* add slowly fading memory for maxRTT to accommodate routing changes */ + if (ca->max_rtt > ca->base_rtt) + ca->max_rtt = ca->base_rtt + + (((ca->max_rtt - ca->base_rtt) * 31) >> 5); } static void tcp_illinois_init(struct sock *sk) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-03 22:52 ` [RFC] TCP illinois max rtt aging Stephen Hemminger @ 2007-12-03 23:06 ` Lachlan Andrew 2007-12-03 23:59 ` Shao Liu 0 siblings, 1 reply; 45+ messages in thread From: Lachlan Andrew @ 2007-12-03 23:06 UTC (permalink / raw) To: Stephen Hemminger Cc: shaoliu, David S. Miller, Herbert Xu, Douglas Leith, Robert Shorten, netdev Greetings Stephen, Thanks. We'll have to play with the rate of ageing. I used the slower ageing if (ca->cnt_rtt > 3) { u64 mean_rtt = ca->sum_rtt; do_div (mean_rtt, ca->cnt_rtt); if (ca->max_rtt > mean_rtt) ca->max_rtt -= (ca->max_rtt - mean_rtt) >> 9; } and still found that the max_rtt drops considerably within a congestion epoch. What would also really help would be getting rid of RTT outliers somehow. I ignore RTT measurements when SACK is active: if (ca->max_rtt < rtt) { struct tcp_sock *tp = tcp_sk(sk); if (! tp->sacked_out ) // SACKs cause hi-CPU/hi-RTT. ignore ca->max_rtt = rtt; } which helps a lot, but still gets some outliers. Would it be possible to time-stamp packets in the hardware interrupt handler, instead of waiting for the post-processing stage? Cheers, Lachlan On 03/12/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote: > On Wed, 28 Nov 2007 21:26:12 -0800 > "Shao Liu" <shaoliu@Princeton.EDU> wrote: > > > Hi Stephen and Lachlan, > > > > Thanks for pointing out and fixing this bug. > > > > For the max RTT problem, I have considered it also and I have some idea on > > improve it. I also have some other places to improve. I will summarize all > > my new ideas and send you an update. For me to change it, could you please > > give me a link to download to latest source codes for the whole congestion > > control module in Linux implementation, including the general module for all > > algorithms, and the implementation for specific algorithms like TCP-Illinois > > and H-TCP? > > > > Thanks for the help! > > -Shao > > > > > > > > -----Original Message----- > > From: Stephen Hemminger [mailto:shemminger@linux-foundation.org] > > Sent: Wednesday, November 28, 2007 4:44 PM > > To: Lachlan Andrew > > Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith; > > Robert Shorten; netdev@vger.kernel.org > > Subject: Re: [PATCH] tcp-illinois: incorrect beta usage > > > > Lachlan Andrew wrote: > > > Thanks Stephen. > > > > > > A related problem (largely due to the published algorithm itself) is > > > that Illinois is very aggressive when it over-estimates the maximum > > > RTT. > > > > > > At high load (say 200Mbps and 200ms RTT), a backlog of packets builds > > > up just after a loss, causing the RTT estimate to become large. This > > > makes Illinois think that *all* losses are due to corruption not > > > congestion, and so only back off by 1/8 instead of 1/2. > > > > > > I can't think how to fix this except by better RTT estimation, or > > > changes to Illinois itself. Currently, I ignore RTT measurements when > > > sacked_out != 0 and have a heuristic "RTT aging" mechanism, but > > > that's pretty ugly. > > > > > > Cheers, > > > Lachlan > > > > > > > > Ageing the RTT estimates needs to be done anyway. > > Maybe something can be reused from H-TCP. The two are closely related. > > > > The following adds gradual aging of max RTT. > > --- a/net/ipv4/tcp_illinois.c 2007-11-29 08:58:35.000000000 -0800 > +++ b/net/ipv4/tcp_illinois.c 2007-11-29 09:37:33.000000000 -0800 > @@ -63,7 +63,10 @@ static void rtt_reset(struct sock *sk) > ca->cnt_rtt = 0; > ca->sum_rtt = 0; > > - /* TODO: age max_rtt? */ > + /* add slowly fading memory for maxRTT to accommodate routing changes */ > + if (ca->max_rtt > ca->base_rtt) > + ca->max_rtt = ca->base_rtt > + + (((ca->max_rtt - ca->base_rtt) * 31) >> 5); > } > > static void tcp_illinois_init(struct sock *sk) > -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [RFC] TCP illinois max rtt aging 2007-12-03 23:06 ` Lachlan Andrew @ 2007-12-03 23:59 ` Shao Liu 2007-12-04 0:32 ` Stephen Hemminger 0 siblings, 1 reply; 45+ messages in thread From: Shao Liu @ 2007-12-03 23:59 UTC (permalink / raw) To: 'Lachlan Andrew', 'Stephen Hemminger' Cc: 'David S. Miller', 'Herbert Xu', 'Douglas Leith', 'Robert Shorten', netdev Hi Stephen and Lachlan, Thanks for the discussion. The gradual aging is surely an option. And another possibility is that, we compute the RTT just before congestion notification, which ideally, represent the full queueing delay + propagation delay. We can compute the average of the last M such values, and either use the average as maxRTT, or use it as a benchmark to judge whether a sample is outlier. How do you think of this idea? And also a question, why the samples when SACK is active are outliers? For the accuracy of time stamping, I am not very familiar with the implementation details. But I can think of two ways, 1) do time stamp in as low layer as possible; 2) use as high priority thread to do it as possible. For 2), we can use separate threads to do time stamp and to process packets. Thanks and I will let you know more of my thoughts after I go over the entire code space! -Shao -----Original Message----- From: Lachlan Andrew [mailto:lachlan.andrew@gmail.com] Sent: Monday, December 03, 2007 3:06 PM To: Stephen Hemminger Cc: shaoliu@Princeton.EDU; David S. Miller; Herbert Xu; Douglas Leith; Robert Shorten; netdev@vger.kernel.org Subject: Re: [RFC] TCP illinois max rtt aging Greetings Stephen, Thanks. We'll have to play with the rate of ageing. I used the slower ageing if (ca->cnt_rtt > 3) { u64 mean_rtt = ca->sum_rtt; do_div (mean_rtt, ca->cnt_rtt); if (ca->max_rtt > mean_rtt) ca->max_rtt -= (ca->max_rtt - mean_rtt) >> 9; } and still found that the max_rtt drops considerably within a congestion epoch. What would also really help would be getting rid of RTT outliers somehow. I ignore RTT measurements when SACK is active: if (ca->max_rtt < rtt) { struct tcp_sock *tp = tcp_sk(sk); if (! tp->sacked_out ) // SACKs cause hi-CPU/hi-RTT. ignore ca->max_rtt = rtt; } which helps a lot, but still gets some outliers. Would it be possible to time-stamp packets in the hardware interrupt handler, instead of waiting for the post-processing stage? Cheers, Lachlan On 03/12/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote: > On Wed, 28 Nov 2007 21:26:12 -0800 > "Shao Liu" <shaoliu@Princeton.EDU> wrote: > > > Hi Stephen and Lachlan, > > > > Thanks for pointing out and fixing this bug. > > > > For the max RTT problem, I have considered it also and I have some idea on > > improve it. I also have some other places to improve. I will summarize all > > my new ideas and send you an update. For me to change it, could you please > > give me a link to download to latest source codes for the whole congestion > > control module in Linux implementation, including the general module for all > > algorithms, and the implementation for specific algorithms like TCP-Illinois > > and H-TCP? > > > > Thanks for the help! > > -Shao > > > > > > > > -----Original Message----- > > From: Stephen Hemminger [mailto:shemminger@linux-foundation.org] > > Sent: Wednesday, November 28, 2007 4:44 PM > > To: Lachlan Andrew > > Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith; > > Robert Shorten; netdev@vger.kernel.org > > Subject: Re: [PATCH] tcp-illinois: incorrect beta usage > > > > Lachlan Andrew wrote: > > > Thanks Stephen. > > > > > > A related problem (largely due to the published algorithm itself) is > > > that Illinois is very aggressive when it over-estimates the maximum > > > RTT. > > > > > > At high load (say 200Mbps and 200ms RTT), a backlog of packets builds > > > up just after a loss, causing the RTT estimate to become large. This > > > makes Illinois think that *all* losses are due to corruption not > > > congestion, and so only back off by 1/8 instead of 1/2. > > > > > > I can't think how to fix this except by better RTT estimation, or > > > changes to Illinois itself. Currently, I ignore RTT measurements when > > > sacked_out != 0 and have a heuristic "RTT aging" mechanism, but > > > that's pretty ugly. > > > > > > Cheers, > > > Lachlan > > > > > > > > Ageing the RTT estimates needs to be done anyway. > > Maybe something can be reused from H-TCP. The two are closely related. > > > > The following adds gradual aging of max RTT. > > --- a/net/ipv4/tcp_illinois.c 2007-11-29 08:58:35.000000000 -0800 > +++ b/net/ipv4/tcp_illinois.c 2007-11-29 09:37:33.000000000 -0800 > @@ -63,7 +63,10 @@ static void rtt_reset(struct sock *sk) > ca->cnt_rtt = 0; > ca->sum_rtt = 0; > > - /* TODO: age max_rtt? */ > + /* add slowly fading memory for maxRTT to accommodate routing changes */ > + if (ca->max_rtt > ca->base_rtt) > + ca->max_rtt = ca->base_rtt > + + (((ca->max_rtt - ca->base_rtt) * 31) >> 5); > } > > static void tcp_illinois_init(struct sock *sk) > -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-03 23:59 ` Shao Liu @ 2007-12-04 0:32 ` Stephen Hemminger 2007-12-04 1:23 ` Lachlan Andrew 0 siblings, 1 reply; 45+ messages in thread From: Stephen Hemminger @ 2007-12-04 0:32 UTC (permalink / raw) To: shaoliu Cc: 'Lachlan Andrew', 'David S. Miller', 'Herbert Xu', 'Douglas Leith', 'Robert Shorten', netdev On Mon, 3 Dec 2007 15:59:23 -0800 "Shao Liu" <shaoliu@Princeton.EDU> wrote: > Hi Stephen and Lachlan, > > Thanks for the discussion. The gradual aging is surely an option. And > another possibility is that, we compute the RTT just before congestion > notification, which ideally, represent the full queueing delay + propagation > delay. We can compute the average of the last M such values, and either use > the average as maxRTT, or use it as a benchmark to judge whether a sample is > outlier. How do you think of this idea? The problem with an average like that would be storing enough values to be useful and choosing how many to store. Perhaps some form of weighted sliding average which favors recent values heavily would work best. Remember that RTT's have a huge noise component and you are fighting against the long tail distribution trying to see the queue effects. > > And also a question, why the samples when SACK is active are outliers? Any sample with SACK is going to mean a loss or reordering has occurred. So shouldn't the SACK values be useful, but RTT values from retransmits are not useful. > > For the accuracy of time stamping, I am not very familiar with the > implementation details. But I can think of two ways, 1) do time stamp in as > low layer as possible; 2) use as high priority thread to do it as possible. > For 2), we can use separate threads to do time stamp and to process packets. Right now the resolution is in microseconds using the hardware clock. The clock usage costs a little bit, but makes the math more accurate. It would be worth exploring sensitivity by taking out RTT_STAMP from the flags field and varying HZ. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-04 0:32 ` Stephen Hemminger @ 2007-12-04 1:23 ` Lachlan Andrew 2007-12-04 8:37 ` Ilpo Järvinen 0 siblings, 1 reply; 45+ messages in thread From: Lachlan Andrew @ 2007-12-04 1:23 UTC (permalink / raw) To: Stephen Hemminger Cc: shaoliu, David S. Miller, Herbert Xu, Douglas Leith, Robert Shorten, netdev Greetings, On 03/12/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote: > On Mon, 3 Dec 2007 15:59:23 -0800 > "Shao Liu" <shaoliu@Princeton.EDU> wrote: > > And also a question, why the samples when SACK is active are outliers? > > Any sample with SACK is going to mean a loss or reordering has occurred. > So shouldn't the SACK values be useful, but RTT values from retransmits > are not useful. When SACK is active, the per-packet processing becomes more involved, tracking the list of lost/SACKed packets. This causes a CPU spike just after a loss, which increases the RTTs, at least in my experience. This is a separate issue from the fact that it is hard to get RTT measurements from lost/retransmitted packets themselves. Cheers, Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-04 1:23 ` Lachlan Andrew @ 2007-12-04 8:37 ` Ilpo Järvinen 2007-12-07 3:27 ` Lachlan Andrew 0 siblings, 1 reply; 45+ messages in thread From: Ilpo Järvinen @ 2007-12-04 8:37 UTC (permalink / raw) To: Lachlan Andrew Cc: Stephen Hemminger, shaoliu, David S. Miller, Herbert Xu, Douglas Leith, Robert Shorten, Netdev On Mon, 3 Dec 2007, Lachlan Andrew wrote: > On 03/12/2007, Stephen Hemminger <shemminger@linux-foundation.org> wrote: > > On Mon, 3 Dec 2007 15:59:23 -0800 > > "Shao Liu" <shaoliu@Princeton.EDU> wrote: > > > And also a question, why the samples when SACK is active are outliers? > > > > Any sample with SACK is going to mean a loss or reordering has occurred. > > So shouldn't the SACK values be useful, but RTT values from retransmits > > are not useful. > > When SACK is active, the per-packet processing becomes more involved, > tracking the list of lost/SACKed packets. This causes a CPU spike > just after a loss, which increases the RTTs, at least in my > experience. I suspect that as long as old code was able to use hint, it wasn't doing that bad. But it was seriously lacking ability to take advantage of sack processing hint when e.g., a new hole appeared, or cumulative ACK arrived. ...Code available in net-2.6.25 might cure those. -- i. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-04 8:37 ` Ilpo Järvinen @ 2007-12-07 3:27 ` Lachlan Andrew 2007-12-07 11:05 ` Ilpo Järvinen 0 siblings, 1 reply; 45+ messages in thread From: Lachlan Andrew @ 2007-12-07 3:27 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: Netdev, Tom Quetchenbach Greetings Ilpo, On 04/12/2007, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote: > On Mon, 3 Dec 2007, Lachlan Andrew wrote: > > > > When SACK is active, the per-packet processing becomes more involved, > > tracking the list of lost/SACKed packets. This causes a CPU spike > > just after a loss, which increases the RTTs, at least in my > > experience. > > I suspect that as long as old code was able to use hint, it wasn't doing > that bad. But it was seriously lacking ability to take advantage of sack > processing hint when e.g., a new hole appeared, or cumulative ACK arrived. > > ...Code available in net-2.6.25 might cure those. We had been using one of your earlier patches, and still had the problem. I think you've cured the problem with SACK itself, but there still seems to be something taking a lot of CPU while recovering from the loss. It is possible that it was to do with web100 which we have also been running, but I cut out most of the statistics from that and still had problems. Cheers, Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-07 3:27 ` Lachlan Andrew @ 2007-12-07 11:05 ` Ilpo Järvinen 2007-12-07 12:41 ` David Miller 0 siblings, 1 reply; 45+ messages in thread From: Ilpo Järvinen @ 2007-12-07 11:05 UTC (permalink / raw) To: Lachlan Andrew, David Miller; +Cc: Netdev, Tom Quetchenbach [-- Attachment #1: Type: TEXT/PLAIN, Size: 2032 bytes --] On Thu, 6 Dec 2007, Lachlan Andrew wrote: > On 04/12/2007, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote: > > On Mon, 3 Dec 2007, Lachlan Andrew wrote: > > > > > > When SACK is active, the per-packet processing becomes more involved, > > > tracking the list of lost/SACKed packets. This causes a CPU spike > > > just after a loss, which increases the RTTs, at least in my > > > experience. > > > > I suspect that as long as old code was able to use hint, it wasn't doing > > that bad. But it was seriously lacking ability to take advantage of sack > > processing hint when e.g., a new hole appeared, or cumulative ACK arrived. > > > > ...Code available in net-2.6.25 might cure those. > > We had been using one of your earlier patches, and still had the > problem. I think you've cured the problem with SACK itself, but there > still seems to be something taking a lot of CPU while recovering from > the loss. I guess if you get a large cumulative ACK, the amount of processing is still overwhelming (added DaveM if he has some idea how to combat it). Even a simple scenario (this isn't anything fancy at all, will occur all the time): Just one loss => rest skbs grow one by one into a single very large SACK block (and we do that efficiently for sure) => then the fast retransmit gets delivered and a cumulative ACK for whole orig_window arrives => clean_rtx_queue has to do a lot of processing. In this case we could optimize RB-tree cleanup away (by just blanking it all) but still getting rid of all those skbs is going to take a larger moment than I'd like to see. That tree blanking could be extended to cover anything which ACK more than half of the tree by just replacing the root (and dealing with potential recolorization of the root). > It is possible that it was to do with web100 which we > have also been running, but I cut out most of the statistics from that > and still had problems. No idea about what it could do, haven't yet looked web100, I was planning at some point of time... -- i. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-07 11:05 ` Ilpo Järvinen @ 2007-12-07 12:41 ` David Miller 2007-12-07 13:05 ` Ilpo Järvinen 0 siblings, 1 reply; 45+ messages in thread From: David Miller @ 2007-12-07 12:41 UTC (permalink / raw) To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET) > I guess if you get a large cumulative ACK, the amount of processing is > still overwhelming (added DaveM if he has some idea how to combat it). > > Even a simple scenario (this isn't anything fancy at all, will occur all > the time): Just one loss => rest skbs grow one by one into a single > very large SACK block (and we do that efficiently for sure) => then the > fast retransmit gets delivered and a cumulative ACK for whole orig_window > arrives => clean_rtx_queue has to do a lot of processing. In this case we > could optimize RB-tree cleanup away (by just blanking it all) but still > getting rid of all those skbs is going to take a larger moment than I'd > like to see. > > That tree blanking could be extended to cover anything which ACK more than > half of the tree by just replacing the root (and dealing with potential > recolorization of the root). Yes, it's the classic problem. But it ought to be at least partially masked when TSO is in use, because we'll only process a handful of SKBs. The more effectively TSO batches, the less work clean_rtx_queue() will do. When not doing TSO the behavior is super-stupid, we bump reference counts on the same page multiple times while running over the SKBs since consequetive SKBs cover data in different spans of the same page. The core issue is that we have a poorly behaving data container, and therefore that's obviously what we need to change. Conceptually what we probably need to do is seperate the data maintainence from the SKB objects themselves. There is a blob that maintains the paged data state for everything in the retransmit queue. SKBs are built and get the page pointers but don't actually grab references to the pages, the blob does that and it keeps track of how many SKB references to each page there are, non-atomically. The hardest part is dealing with the page lifetime issues. Unfortunately, when we trim the rtx queue, references to the clones can still exist in the driver output path. It's a difficult problem to overcome in fact, so in the end my suggestion above might not even be workable. > No idea about what it could do, haven't yet looked web100, I was planning > at some point of time... Web100 just provides statistics and other kinds of connection data to userspace, all the actual algorithm etc. modifications have been merged upstream and yanked out of the web100 patch. I was looking at it the other night and it's frankly totally uninteresting these days :-) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-07 12:41 ` David Miller @ 2007-12-07 13:05 ` Ilpo Järvinen 2007-12-07 18:27 ` Ilpo Järvinen 2007-12-08 1:32 ` David Miller 0 siblings, 2 replies; 45+ messages in thread From: Ilpo Järvinen @ 2007-12-07 13:05 UTC (permalink / raw) To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen [-- Attachment #1: Type: TEXT/PLAIN, Size: 1856 bytes --] On Fri, 7 Dec 2007, David Miller wrote: > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> > Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET) > > > I guess if you get a large cumulative ACK, the amount of processing is > > still overwhelming (added DaveM if he has some idea how to combat it). > > > > Even a simple scenario (this isn't anything fancy at all, will occur all > > the time): Just one loss => rest skbs grow one by one into a single > > very large SACK block (and we do that efficiently for sure) => then the > > fast retransmit gets delivered and a cumulative ACK for whole orig_window > > arrives => clean_rtx_queue has to do a lot of processing. In this case we > > could optimize RB-tree cleanup away (by just blanking it all) but still > > getting rid of all those skbs is going to take a larger moment than I'd > > like to see. > > > > That tree blanking could be extended to cover anything which ACK more than > > half of the tree by just replacing the root (and dealing with potential > > recolorization of the root). > > Yes, it's the classic problem. But it ought to be at least > partially masked when TSO is in use, because we'll only process > a handful of SKBs. The more effectively TSO batches, the > less work clean_rtx_queue() will do. No, that's not what is going to happen, TSO won't help at all because one-by-one SACKs will fragment every single one of them (see tcp_match_skb_to_sack) :-(. ...So we're back in non-TSO case, or am I missing something? > Web100 just provides statistics and other kinds of connection data > to userspace, all the actual algorithm etc. modifications have been > merged upstream and yanked out of the web100 patch. I was looking > at it the other night and it's frankly totally uninteresting these > days :-) ...Thanks, I'll keep that in my mind when looking... :-) -- i. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-07 13:05 ` Ilpo Järvinen @ 2007-12-07 18:27 ` Ilpo Järvinen 2007-12-08 1:32 ` David Miller 1 sibling, 0 replies; 45+ messages in thread From: Ilpo Järvinen @ 2007-12-07 18:27 UTC (permalink / raw) To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen [-- Attachment #1: Type: TEXT/PLAIN, Size: 1911 bytes --] On Fri, 7 Dec 2007, Ilpo Järvinen wrote: > On Fri, 7 Dec 2007, David Miller wrote: > > > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> > > Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET) > > > > > I guess if you get a large cumulative ACK, the amount of processing is > > > still overwhelming (added DaveM if he has some idea how to combat it). > > > > > > Even a simple scenario (this isn't anything fancy at all, will occur all > > > the time): Just one loss => rest skbs grow one by one into a single > > > very large SACK block (and we do that efficiently for sure) => then the > > > fast retransmit gets delivered and a cumulative ACK for whole orig_window > > > arrives => clean_rtx_queue has to do a lot of processing. In this case we > > > could optimize RB-tree cleanup away (by just blanking it all) but still > > > getting rid of all those skbs is going to take a larger moment than I'd > > > like to see. > > > > > > That tree blanking could be extended to cover anything which ACK more than > > > half of the tree by just replacing the root (and dealing with potential > > > recolorization of the root). > > > > Yes, it's the classic problem. But it ought to be at least > > partially masked when TSO is in use, because we'll only process > > a handful of SKBs. The more effectively TSO batches, the > > less work clean_rtx_queue() will do. > > No, that's not what is going to happen, TSO won't help at all > because one-by-one SACKs will fragment every single one of them > (see tcp_match_skb_to_sack) :-(. ...So we're back in non-TSO > case, or am I missing something? Hmm... this could be solved though by postponing the fragmentation of a partially sacked skb when the first sack block can (is likely) to still grow and remove the need for fragmentation. Has some implications to packet processing, increases burstiness a bit & tcp_max_burst kicks in too easily. -- i. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC] TCP illinois max rtt aging 2007-12-07 13:05 ` Ilpo Järvinen 2007-12-07 18:27 ` Ilpo Järvinen @ 2007-12-08 1:32 ` David Miller 2007-12-11 11:59 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one (Was: Re: [RFC] TCP illinois max rtt aging) Ilpo Järvinen 1 sibling, 1 reply; 45+ messages in thread From: David Miller @ 2007-12-08 1:32 UTC (permalink / raw) To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Fri, 7 Dec 2007 15:05:59 +0200 (EET) > On Fri, 7 Dec 2007, David Miller wrote: > > > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> > > Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET) > > > > > I guess if you get a large cumulative ACK, the amount of processing is > > > still overwhelming (added DaveM if he has some idea how to combat it). > > > > > > Even a simple scenario (this isn't anything fancy at all, will occur all > > > the time): Just one loss => rest skbs grow one by one into a single > > > very large SACK block (and we do that efficiently for sure) => then the > > > fast retransmit gets delivered and a cumulative ACK for whole orig_window > > > arrives => clean_rtx_queue has to do a lot of processing. In this case we > > > could optimize RB-tree cleanup away (by just blanking it all) but still > > > getting rid of all those skbs is going to take a larger moment than I'd > > > like to see. > > > > > > That tree blanking could be extended to cover anything which ACK more than > > > half of the tree by just replacing the root (and dealing with potential > > > recolorization of the root). > > > > Yes, it's the classic problem. But it ought to be at least > > partially masked when TSO is in use, because we'll only process > > a handful of SKBs. The more effectively TSO batches, the > > less work clean_rtx_queue() will do. > > No, that's not what is going to happen, TSO won't help at all > because one-by-one SACKs will fragment every single one of them > (see tcp_match_skb_to_sack) :-(. ...So we're back in non-TSO > case, or am I missing something? You're of course right, and it's ironic that I wrote the SACK splitting code so I should have known this :-) A possible approach just occurred to me wherein we maintain the SACK state external to the SKBs so that we don't need to mess with them at all. That would allow us to eliminate the TSO splitting but it would not remove the general problem of clean_rtx_queue()'s overhead. I'll try to give some thought to this over the weekend. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one (Was: Re: [RFC] TCP illinois max rtt aging) 2007-12-08 1:32 ` David Miller @ 2007-12-11 11:59 ` Ilpo Järvinen 2007-12-11 12:32 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one David Miller 0 siblings, 1 reply; 45+ messages in thread From: Ilpo Järvinen @ 2007-12-11 11:59 UTC (permalink / raw) To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen [-- Attachment #1: Type: TEXT/PLAIN, Size: 12766 bytes --] On Fri, 7 Dec 2007, David Miller wrote: > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> > Date: Fri, 7 Dec 2007 15:05:59 +0200 (EET) > > > On Fri, 7 Dec 2007, David Miller wrote: > > > > > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> > > > Date: Fri, 7 Dec 2007 13:05:46 +0200 (EET) > > > > > > > I guess if you get a large cumulative ACK, the amount of processing is > > > > still overwhelming (added DaveM if he has some idea how to combat it). > > > > > > > > Even a simple scenario (this isn't anything fancy at all, will occur all > > > > the time): Just one loss => rest skbs grow one by one into a single > > > > very large SACK block (and we do that efficiently for sure) => then the > > > > fast retransmit gets delivered and a cumulative ACK for whole orig_window > > > > arrives => clean_rtx_queue has to do a lot of processing. In this case we > > > > could optimize RB-tree cleanup away (by just blanking it all) but still > > > > getting rid of all those skbs is going to take a larger moment than I'd > > > > like to see. > > > > > > Yes, it's the classic problem. But it ought to be at least > > > partially masked when TSO is in use, because we'll only process > > > a handful of SKBs. The more effectively TSO batches, the > > > less work clean_rtx_queue() will do. > > > > No, that's not what is going to happen, TSO won't help at all > > because one-by-one SACKs will fragment every single one of them > > (see tcp_match_skb_to_sack) :-(. ...So we're back in non-TSO > > case, or am I missing something? > > You're of course right, and it's ironic that I wrote the SACK > splitting code so I should have known this :-) > > A possible approach just occurred to me wherein we maintain > the SACK state external to the SKBs so that we don't need to > mess with them at all. > > That would allow us to eliminate the TSO splitting but it would > not remove the general problem of clean_rtx_queue()'s overhead. > > I'll try to give some thought to this over the weekend. How about this... ...I've left couple of FIXMEs there still, should be quite simple & straightforward to handle them if this seems viable solution at all. Beware, this doesn't even compile yet because not all parameters are transferred currently (first_sack_index was killed earlier, I need to summon it back for this). Also, I'd need to do the dirty work of kill recv_sack_cache first to make this to not produce, well, interesting effects due to missed SACK blocks... :-) Applies cleanly only after this: [TCP]: Push fack_count calculation deeper into functions -- i. -- [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one Because receiver reports out-of-order segment one-by-one using SACK, the tcp_fragment may do a lot of unnecessary splits that would be avoided if the sender could see the upcoming future. Not only SACK processing suffers but clean_rtx_queue as well is considerable hit when the corresponding cumulative ACK arrives. Thus implement a local cache for a single skb to avoid enormous splitting efforts while the latest SACK block is still growing. Messy enough, other parts must be made aware of this change as well because the skb state is a bit fuzzy while have not yet marked it in tcp_sacktag_one. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> --- include/linux/tcp.h | 2 + include/net/tcp.h | 19 ++++++++ net/ipv4/tcp_input.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++-- net/ipv4/tcp_output.c | 7 +++ 4 files changed, 133 insertions(+), 5 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 56342c3..4fbfa46 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -360,6 +360,8 @@ struct tcp_sock { u32 fackets_out; /* FACK'd packets */ u32 high_seq; /* snd_nxt at onset of congestion */ + u32 sack_pending; /* End seqno of postponed SACK tagging */ + u32 retrans_stamp; /* Timestamp of the last retransmit, * also used in SYN-SENT to remember stamp of * the first SYN. */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 5e6c433..e2b88e3 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -462,6 +462,21 @@ extern void tcp_send_delayed_ack(struct sock *sk); /* tcp_input.c */ extern void tcp_cwnd_application_limited(struct sock *sk); +extern void __tcp_process_postponed_sack(struct sock *sk, struct sk_buff *skb); +extern struct sk_buff *tcp_find_postponed_skb(struct sock *sk); + +static inline void tcp_process_postponed_sack(struct sock *sk) +{ + if (tcp_sk(sk)->sack_pending) + __tcp_process_postponed_sack(sk, tcp_find_postponed_skb(sk)); +} + +static inline void tcp_process_postponed_sack_overlapping(struct sock *sk, + struct sk_buff *skb) +{ + if (tcp_sk(sk)->sack_pending && (skb == tcp_find_postponed_skb(sk))) + __tcp_process_postponed_sack(sk, skb); +} /* tcp_timer.c */ extern void tcp_init_xmit_timers(struct sock *); @@ -1625,6 +1640,10 @@ static inline u32 tcp_highest_sack_seq(struct tcp_sock *tp) if (tp->highest_sack == NULL) return tp->snd_nxt; + if (tp->sack_pending && + before(TCP_SKB_CB(tp->highest_sack)->seq, tp->sack_pending)) + return tp->sack_pending; + return TCP_SKB_CB(tp->highest_sack)->seq; } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 977c68c..5bed358 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1181,6 +1181,74 @@ static void tcp_mark_lost_retrans(struct sock *sk) tp->lost_retrans_low = new_low_seq; } +struct sk_buff *tcp_find_postponed_skb(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + + if (before(TCP_SKB_CB(tcp_highest_sack(sk))->seq, tp->sack_pending)) + return tcp_highest_sack(sk); + else + return tcp_write_queue_find(sk, tp->sack_pending, 0); +} + +static void tcp_clear_postponed_tweaks(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + unsigned int pcount; + + pcount = (tp->sack_pending - TCP_SKB_CB(skb)->seq) / skb_shinfo(skb)->gso_size; + + tp->sacked_out -= pcount; + if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST) + tp->lost_out += pcount; + /* FIXME: This might be unnecessary */ + if (skb == tcp_highest_sack(sk)) + tp->fackets_out -= pcount; +} + +static int tcp_postpone_sack_handling(struct sock *sk, struct sk_buff *skb, u32 end_seq) +{ + struct tcp_sock *tp = tcp_sk(sk); + unsigned int pcount; + + if (tp->sack_pending) { + struct sk_buff *oskb = tcp_find_postponed_skb(sk); + + /* Already considered this area? E.g., due to reordering */ + if (unlikely((skb == oskb) && !after(end_seq, tp->sack_pending))) + return 1; + + if (skb != oskb) + __tcp_process_postponed_sack(sk, oskb); + else + tcp_clear_postponed_tweaks(sk, oskb); + } + + /* To be discarded very soon, this case might be improved somehow */ + if (unlikely(!after(TCP_SKB_CB(skb)->seq, tp->snd_una))) + return 0; + + /* 1 out of 2^32 hits this */ + if (unlikely(!end_seq)) + return 0; + + pcount = (end_seq - TCP_SKB_CB(skb)->seq) / skb_shinfo(skb)->gso_size; + + if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST) + tp->lost_out -= pcount; + tp->sacked_out += pcount; + if (!before(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp))) { + unsigned int fack_count_base; + + fack_count_base = TCP_SKB_CB(tcp_write_queue_head(sk))->fack_count; + tp->fackets_out = TCP_SKB_CB(skb)->fack_count - + fack_count_base + pcount; + tp->highest_sack = skb; + } + tp->sack_pending = end_seq; + +} + /* Check if skb is fully within the SACK block. In presence of GSO skbs, * the incoming SACK may not exactly match but we can find smaller MSS * aligned portion of it that matches. Therefore we might need to fragment @@ -1188,7 +1256,7 @@ static void tcp_mark_lost_retrans(struct sock *sk) * returns). */ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb, - u32 start_seq, u32 end_seq) + u32 start_seq, u32 end_seq, int likely_grows) { int in_sack, err; unsigned int pkt_len; @@ -1201,10 +1269,14 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb, in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq); - if (!in_sack) + if (!in_sack) { pkt_len = start_seq - TCP_SKB_CB(skb)->seq; - else + } else { pkt_len = end_seq - TCP_SKB_CB(skb)->seq; + + if (likely_grows && tcp_postpone_sack_handling(sk, skb, end_seq)) + return -EAGAIN; + } err = tcp_fragment(sk, skb, pkt_len, skb_shinfo(skb)->gso_size); if (err < 0) return err; @@ -1318,6 +1390,7 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk, int dup_sack, int *reord, int *flag, int queue) { + struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *next; tcp_for_write_queue_from_safe(skb, next, sk, queue) { @@ -1330,16 +1403,32 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk, if (!before(TCP_SKB_CB(skb)->seq, end_seq)) break; - in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq); + in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq, + !dup_sack && first_block_idx); if (unlikely(in_sack < 0)) break; - if (in_sack) + if (in_sack) { + if (tp->sack_pending && skb == tcp_find_postponed_skb(sk)) + tcp_clear_postponed_tweaks(sk, skb); *flag |= tcp_sacktag_one(skb, sk, reord, dup_sack); + } } return skb; } +void __tcp_process_postponed_sack(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + int reord = tp->packets_out; + + tcp_clear_postponed_tweaks(sk, skb); + tp->sack_pending = 0; + tcp_sacktag_one(skb, sk, &reord, 0); + + /* FIXME: reord handling here */ +} + /* Avoid all extra work that is being done by sacktag while walking in * a normal way */ @@ -1705,6 +1794,8 @@ void tcp_enter_frto(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; + tcp_process_postponed_sack(sk); + if ((!tp->frto_counter && icsk->icsk_ca_state <= TCP_CA_Disorder) || tp->snd_una == tp->high_seq || ((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) && @@ -1777,6 +1868,8 @@ static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag) struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; + tcp_process_postponed_sack(sk); + tp->lost_out = 0; tp->retrans_out = 0; if (tcp_is_reno(tp)) @@ -1875,10 +1968,12 @@ void tcp_enter_loss(struct sock *sk, int how) /* Push undo marker, if it was plain RTO and nothing * was retransmitted. */ tp->undo_marker = tp->snd_una; + tcp_process_postponed_sack(sk); tcp_clear_retrans_hints_partial(tp); } else { tp->sacked_out = 0; tp->fackets_out = 0; + tp->sack_pending = 0; tcp_clear_all_retrans_hints(tp); tcp_for_write_queue(skb, sk, TCP_WQ_SACKED) { @@ -2171,6 +2266,7 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit) if (tcp_is_fack(tp)) { cnt = fc - fack_count_base + tcp_skb_pcount(skb); } else { + /* FIXME: SACK postponing adaption */ if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) cnt += tcp_skb_pcount(skb); /* Add SACK blocks between this and skb->prev */ @@ -2224,6 +2320,9 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit) skb = tp->scoreboard_skb_hint ? tp->scoreboard_skb_hint : tcp_write_queue_head(sk); + /* FIXME: Performance penalty of this likely significant */ + tcp_process_postponed_sack(sk); + tcp_for_write_queue_from(skb, sk, 0) { if (skb == tcp_send_head(sk)) break; @@ -2422,6 +2521,7 @@ static int tcp_try_undo_loss(struct sock *sk) if (tcp_may_undo(tp)) { struct sk_buff *skb; + tcp_for_write_queue(skb, sk, 0) { if (skb == tcp_send_head(sk)) break; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1d83c65..e7cb39b 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -696,6 +696,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) return -ENOMEM; + tcp_process_postponed_sack_overlapping(sk, skb); + /* Get a new skb... force flag on. */ buff = sk_stream_alloc_skb(sk, nsize, GFP_ATOMIC); if (buff == NULL) @@ -1761,6 +1763,8 @@ void tcp_simple_retransmit(struct sock *sk) unsigned int mss = tcp_current_mss(sk, 0); int lost = 0; + tcp_process_postponed_sack(sk); + tcp_for_write_queue(skb, sk, 0) { if (skb == tcp_send_head(sk)) break; @@ -1928,6 +1932,9 @@ void tcp_xmit_retransmit_queue(struct sock *sk) struct sk_buff *skb; int packet_cnt; + /* FIXME: There are better was to do this in here */ + tcp_process_postponed_sack(sk); + if (tp->retransmit_skb_hint) { skb = tp->retransmit_skb_hint; packet_cnt = tp->retransmit_cnt_hint; -- 1.5.0.6 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one 2007-12-11 11:59 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one (Was: Re: [RFC] TCP illinois max rtt aging) Ilpo Järvinen @ 2007-12-11 12:32 ` David Miller 2007-12-12 0:14 ` Lachlan Andrew 2007-12-15 9:51 ` SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one) Ilpo Järvinen 0 siblings, 2 replies; 45+ messages in thread From: David Miller @ 2007-12-11 12:32 UTC (permalink / raw) To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Tue, 11 Dec 2007 13:59:16 +0200 (EET) > How about this... > > ...I've left couple of FIXMEs there still, should be quite simple & > straightforward to handle them if this seems viable solution at all. > > Beware, this doesn't even compile yet because not all parameters are > transferred currently (first_sack_index was killed earlier, I need to > summon it back for this). Also, I'd need to do the dirty work of kill > recv_sack_cache first to make this to not produce, well, interesting > effects due to missed SACK blocks... :-) Interesting approach, but I think there is limited value to this (arguably) complex form. The core issue is that the data and the SACK state are maintained in the same datastructure. The complexity in all the state management and fixups in your patch is purely because of this. If we maintain SACK scoreboard information seperately, outside of the SKB, then there are only two changes to make: 1) Every access to TCP_SKB_CB() SACK scoreboard is adjusted to new data structure. 2) Retransmit is adjusted so that it can retransmit an SKB constructed as a portion of an existing SKB. Since TSO implies SG, this can be handled with simple offset and length arguments and suitable creation of a clone referencing the pages in the SG vector that contain the desired data. I would envision this SACK state thing to reference into the retransmit queue SKB's somehow. Each SACK block could perhaps look something like: struct sack_ref { struct sk_buff *start_skb; struct sk_buff *end_skb; unsigned int start_off; unsigned int len; }; Traditionally we've prioritized the design of the SKB and other infrastructure to suit TCP optimally and I still think we should operate that way. Therefore, long term, it is time to make a formal data blob to assist with all of the repetitive work we do in cases like this and horribly inefficient places like clean_rtx_queue(). So I'm basically advocating a two-pronged approach to this, the seperate SACK scoreboard datastructure and the data blob. I think we can work on the former right now, and take our time with the data blob because it requires lots of thinking and we should get it right as it might have network driver interface implications. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one 2007-12-11 12:32 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one David Miller @ 2007-12-12 0:14 ` Lachlan Andrew 2007-12-12 15:11 ` David Miller 2007-12-15 9:51 ` SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one) Ilpo Järvinen 1 sibling, 1 reply; 45+ messages in thread From: Lachlan Andrew @ 2007-12-12 0:14 UTC (permalink / raw) To: David Miller; +Cc: ilpo.jarvinen, netdev, quetchen Greetings all, On 11/12/2007, David Miller <davem@davemloft.net> wrote: > > we should > get it right as it might have network driver interface implications. If you're redoing the driver interface, could I put in a request for packet time-stamping at a lower level? This thread started because TCP processing interferes with RTT estimation. This problem would be eliminated if time-stamping were done as soon as the packet comes off the NIC. $0.02 Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one 2007-12-12 0:14 ` Lachlan Andrew @ 2007-12-12 15:11 ` David Miller 2007-12-12 23:35 ` Lachlan Andrew 0 siblings, 1 reply; 45+ messages in thread From: David Miller @ 2007-12-12 15:11 UTC (permalink / raw) To: lachlan.andrew; +Cc: ilpo.jarvinen, netdev, quetchen From: "Lachlan Andrew" <lachlan.andrew@gmail.com> Date: Tue, 11 Dec 2007 16:14:36 -0800 > This thread started because TCP processing interferes with RTT > estimation. This problem would be eliminated if time-stamping were > done as soon as the packet comes off the NIC. We don't do that because such timestamping is too expensive. It used to be the case that we did this, but we stopped doing that a long time ago. On x86 for example, timestamping can involve touching a slow I/O device to read the timestamp. We do not want to do that for every packet. Also, we timestamp differently for TCP, the global high resolution timestamp is overkill for this purpose. Really, this is a silly idea and would only be a bandaid for the problem at hand, that TCP input processing is too expensive in certain circumstances. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one 2007-12-12 15:11 ` David Miller @ 2007-12-12 23:35 ` Lachlan Andrew 2007-12-12 23:38 ` David Miller 2007-12-13 0:00 ` Stephen Hemminger 0 siblings, 2 replies; 45+ messages in thread From: Lachlan Andrew @ 2007-12-12 23:35 UTC (permalink / raw) To: David Miller; +Cc: ilpo.jarvinen, netdev, quetchen, Darryl Veitch Greetings Dave, On 12/12/2007, David Miller <davem@davemloft.net> wrote: > From: "Lachlan Andrew" <lachlan.andrew@gmail.com> > Date: Tue, 11 Dec 2007 16:14:36 -0800 > > > This thread started because TCP processing interferes with RTT > > estimation. This problem would be eliminated if time-stamping were > > done as soon as the packet comes off the NIC. > > We don't do that because such timestamping is too expensive. > It used to be the case that we did this, but we stopped doing > that a long time ago. > > On x86 for example, timestamping can involve touching a slow > I/O device to read the timestamp. We do not want to do that > for every packet. OK. Thanks for the background. I thought that a TSC read was fairly cheap. Any messing around to interpret it could be the responsibility of any task which actually needs a high-resolution timestamp, couldn't it? If TSC is disabled, then the timestamp field could be set to "invalid". > Also, we timestamp differently for TCP, the global high > resolution timestamp is overkill for this purpose. Overkill for Reno and Cubic, but useful for Vegas, LP, veno, Illinois and YeAH which are all in the kernel. They currently use "high resolution" timestamps which are effectively quantized to the scheduler resolution because of the way timestamping is done -- reading a high-resolution time source when a task is scheduled. > Really, this is a silly idea Oh... :( > and would only be a bandaid > for the problem at hand, that TCP input processing is > too expensive in certain circumstances. That problem should certainly be fixed as well -- I wasn't suggesting this as an alternative. Will fixing it fix the problem of those TCP modules suffering from CPU load from other sources? (I'm Cc'ing this to Darryl Veitch who has often wanted driver-level time-stamping for achieving high-resolution synchronization between hosts.) Cheers, Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one 2007-12-12 23:35 ` Lachlan Andrew @ 2007-12-12 23:38 ` David Miller 2007-12-13 0:00 ` Stephen Hemminger 1 sibling, 0 replies; 45+ messages in thread From: David Miller @ 2007-12-12 23:38 UTC (permalink / raw) To: lachlan.andrew; +Cc: ilpo.jarvinen, netdev, quetchen, d.veitch From: "Lachlan Andrew" <lachlan.andrew@gmail.com> Date: Wed, 12 Dec 2007 15:35:49 -0800 > I thought that a TSC read was fairly cheap. The TSC is cheap, but on the vast majority of systems it isn't usable as a stable time source, and we have to read the ACPI timer instead. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one 2007-12-12 23:35 ` Lachlan Andrew 2007-12-12 23:38 ` David Miller @ 2007-12-13 0:00 ` Stephen Hemminger 1 sibling, 0 replies; 45+ messages in thread From: Stephen Hemminger @ 2007-12-13 0:00 UTC (permalink / raw) To: Lachlan Andrew Cc: David Miller, ilpo.jarvinen, netdev, quetchen, Darryl Veitch On Wed, 12 Dec 2007 15:35:49 -0800 "Lachlan Andrew" <lachlan.andrew@gmail.com> wrote: > Greetings Dave, > > On 12/12/2007, David Miller <davem@davemloft.net> wrote: > > From: "Lachlan Andrew" <lachlan.andrew@gmail.com> > > Date: Tue, 11 Dec 2007 16:14:36 -0800 > > > > > This thread started because TCP processing interferes with RTT > > > estimation. This problem would be eliminated if time-stamping were > > > done as soon as the packet comes off the NIC. > > > > We don't do that because such timestamping is too expensive. > > It used to be the case that we did this, but we stopped doing > > that a long time ago. > > > > On x86 for example, timestamping can involve touching a slow > > I/O device to read the timestamp. We do not want to do that > > for every packet. > > OK. Thanks for the background. > > I thought that a TSC read was fairly cheap. Any messing around to > interpret it could be the responsibility of any task which actually > needs a high-resolution timestamp, couldn't it? If TSC is disabled, > then the timestamp field could be set to "invalid". > > > Also, we timestamp differently for TCP, the global high > > resolution timestamp is overkill for this purpose. > > Overkill for Reno and Cubic, but useful for Vegas, LP, veno, Illinois > and YeAH which are all in the kernel. They currently use "high > resolution" timestamps which are effectively quantized to the > scheduler resolution because of the way timestamping is done -- > reading a high-resolution time source when a task is scheduled. > > > Really, this is a silly idea > > Oh... :( > > > and would only be a bandaid > > for the problem at hand, that TCP input processing is > > too expensive in certain circumstances. > > That problem should certainly be fixed as well -- I wasn't suggesting > this as an alternative. Will fixing it fix the problem of those TCP > modules suffering from CPU load from other sources? > > (I'm Cc'ing this to Darryl Veitch who has often wanted driver-level > time-stamping for achieving high-resolution synchronization between > hosts.) > > Cheers, > Lachlan > In current kernel, the congestion control can choose the desired accuracy. The units are always the same (microseconds), but if the flag value "TCP_CONG_RTT_STAMP" is set, then TCP tries to the high resolution real time clock. This is relatively expensive so only vegas, veno, LP, and YeAH use it. It would be worth an experiment to see whether it is worth it. All the delay based congestion control algorithms are impractical in the real world, but everyone keeps trying :-) -- Stephen Hemminger <shemminger@linux-foundation.org> ^ permalink raw reply [flat|nested] 45+ messages in thread
* SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one) 2007-12-11 12:32 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one David Miller 2007-12-12 0:14 ` Lachlan Andrew @ 2007-12-15 9:51 ` Ilpo Järvinen 2008-01-08 7:36 ` SACK scoreboard David Miller 1 sibling, 1 reply; 45+ messages in thread From: Ilpo Järvinen @ 2007-12-15 9:51 UTC (permalink / raw) To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen On Tue, 11 Dec 2007, David Miller wrote: > Interesting approach, but I think there is limited value to this > (arguably) complex form. > > The core issue is that the data and the SACK state are maintained in > the same datastructure. The complexity in all the state management > and fixups in your patch is purely because of this. Yeah, had just too much time while waiting for person who never arrived... :-) It would have covered the typical case quite well tough, for sure it was very intrusive. > If we maintain SACK scoreboard information seperately, outside of > the SKB, then there are only two changes to make: > > 1) Every access to TCP_SKB_CB() SACK scoreboard is adjusted to > new data structure. Not sure if it is going to all that straightforward because you seem to ignore in this simple description all details of performing the linking between the structures, which becomes tricky when we add those retransmission adjustments. > 2) Retransmit is adjusted so that it can retransmit an SKB > constructed as a portion of an existing SKB. Since TSO > implies SG, this can be handled with simple offset and > length arguments and suitable creation of a clone referencing > the pages in the SG vector that contain the desired data. ...I.e., how to count retrans_out origins in such case because the presented sack_ref knows nothing about them nor do we have anything but one bit in ->sacked per skb. We need the origins when retransmission get sacked/acked to reduce retrans_out correctly. To solve this, I think the sack_ref must have ->sacked as well and the structure should store the R-bits there as well, and may L-bits too though their defination is more obvious because it's mostly a complement of sacked (except for small part near fackets_out and timedout marking that probably makes bookkeeping of L still necessary). The pcount boundaries are no longer that well defined after we break skb boundaries during retransmitting, so doing this makes fackets_out calculation even more trickier to track, unless whole pcount is replaced by byte based fackets_out :-/, which is very trivial to determine from seqnos only (TCP_NODELAYers might get unhappy though if we do that in a straightforward way...). This would clearly be a good step from one perspective. Nowadays GSO'ed skbs may get fragmented when mss_now changes, for two or more consecutive non-SACKed skbs that means one or more extra (small) packets when retransmitting. > I would envision this SACK state thing to reference into the > retransmit queue SKB's somehow. Each SACK block could perhaps > look something like: > > struct sack_ref { > struct sk_buff *start_skb; > struct sk_buff *end_skb; > unsigned int start_off; > unsigned int len; > }; ...I assume that these are fast searchable? And skbs as well (because the linking of start/end_skb at minimum is a costly operation otherwise)? -- i. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2007-12-15 9:51 ` SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one) Ilpo Järvinen @ 2008-01-08 7:36 ` David Miller 2008-01-08 12:12 ` Ilpo Järvinen 2008-01-08 16:51 ` John Heffner 0 siblings, 2 replies; 45+ messages in thread From: David Miller @ 2008-01-08 7:36 UTC (permalink / raw) To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen Ilpo, just trying to keep an old conversation from dying off. Did you happen to read a recent blog posting of mine? http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead I've been thinking more and more and I think we might be able to get away with enforcing that SACKs are always increasing in coverage. I doubt there are any real systems out there that drop out of order packets that are properly formed and are in window, even though the SACK specification (foolishly, in my opinion) allows this. If we could free packets as SACK blocks cover them, all the problems go away. For one thing, this will allow the retransmit queue liberation during loss recovery to be spread out over the event, instead of batched up like crazy to the point where the cumulative ACK finally moves and releases an entire window's worth of data. Next, it would simplify all of this scanning code trying to figure out which holes to fill during recovery. And for SACK scoreboard marking, the RB trie would become very nearly unecessary as far as I can tell. I would not even entertain this kind of crazy idea unless I thought the fundamental complexity simplification payback was enormous. And in this case I think it is. What we could do is put some experimental hack in there for developers to start playing with, which would enforce that SACKs always increase in coverage. If violated the connection reset and a verbose log message is logged so we can analyze any cases that occur. Sounds crazy, but maybe has potential. What do you think? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-08 7:36 ` SACK scoreboard David Miller @ 2008-01-08 12:12 ` Ilpo Järvinen 2008-01-09 7:58 ` David Miller 2008-01-08 16:51 ` John Heffner 1 sibling, 1 reply; 45+ messages in thread From: Ilpo Järvinen @ 2008-01-08 12:12 UTC (permalink / raw) To: David Miller; +Cc: lachlan.andrew, Netdev, quetchen On Mon, 7 Jan 2008, David Miller wrote: > Did you happen to read a recent blog posting of mine? > > http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead > > I've been thinking more and more and I think we might be able > to get away with enforcing that SACKs are always increasing in > coverage. > > I doubt there are any real systems out there that drop out of order > packets that are properly formed and are in window, even though the > SACK specification (foolishly, in my opinion) allows this. Luckily we can see that already from MIBs, so quering people who have large servers, which are continously "testing" the internet :-), under their supervision or can access, and asking if they see any might help. I checked my dept's interactive servers and all had zero renegings, but I don't think I have access to www server which would have much wider exposure. > If we could free packets as SACK blocks cover them, all the problems > go away. I thought it a bit yesterday after reading your blog and came to conclusion that they won't, we can still get those nasty ACKs regardless of received SACK info (in here, missing). Even in some valid cases which include ACK losses besides actual data loss, not that this is the most common case but just wanted to point out that cleanup work is at least partially independent of SACK problem. So not "all" problems would go away really. > For one thing, this will allow the retransmit queue liberation during > loss recovery to be spread out over the event, instead of batched up > like crazy to the point where the cumulative ACK finally moves and > releases an entire window's worth of data. Two key cases for real pattern are: 1. Losses once per n, where n is something small, like 2-20 or so, usually happens at slow start overshoot or when compething traffic slow starts. Cumulative ACKs will cover only small part of the window once rexmits make through, thus this is not a problem. 2. Single loss (or few at the beginning of the window), rest SACKed. Cumulative ACK will cover original window when the last necessary rexmit gets through. Case 1 becomes nasty ACKy only if rexmit is lost as well, but in that case the arriving SACK blocks make the rest of the window equal to 2 :-). So I'm now trying to solve just case 2. What if we could somehow "combine" adjacent skbs (or whatever they're called in that model) if SACK covers them both so that we still hold them but can drop them in a very efficient way. That would make the combining effort split per ACK. And if reneging would occur, we can think a way to put the necessary fuzz into a form which cannot hurt the rest of the system (relatively easy & fast if we add CA_Reneging and allow retransmitting a portion of an skb similar to what you suggested earlier). And it might even be possible then to offer admin a control so that the admin can choose between recover/plain reset if admin thinks that it's always an indication of an attack. This is somewhat similar case to what UTO (under IETF evaluation) does, as purpose of both is in violation of RFC TCP to avoid malicious traps but the control about it is left to the user. > Next, it would simplify all of this scanning code trying to figure out > which holes to fill during recovery. > > And for SACK scoreboard marking, the RB trie would become very nearly > unecessary as far as I can tell. I've been contacted by a person who was interested in reaching 500k windows, so your 4000 sounded like a joke :-/. Having, let say, every 20th dropped means 25k skbs remaining, can we scan though it in any sensible time without RBs and friends :-)? However, allowing queue walk to begin from either direction would solve most of the common cases well enough for it to be nearly manageable. > I would not even entertain this kind of crazy idea unless I thought > the fundamental complexity simplification payback was enormous. And > in this case I think it is. > > What we could do is put some experimental hack in there for developers > to start playing with, which would enforce that SACKs always increase > in coverage. If violated the connection reset and a verbose log > message is logged so we can analyze any cases that occur. We have an initial number already, in MIBs. > Sounds crazy, but maybe has potential. What do you think? If I'd hint my boss that I'm involved in something like this I'd bet that he also would get quite crazy... ;-) I'm partially paid for making TCP more RFCish :-), or at least that the places where thing diverge are known and controllable for research purposes. -- i. ps. If other Cced would like to get dropped if there are some followups, just let me know :-). Else, no need to do anything. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-08 12:12 ` Ilpo Järvinen @ 2008-01-09 7:58 ` David Miller 0 siblings, 0 replies; 45+ messages in thread From: David Miller @ 2008-01-09 7:58 UTC (permalink / raw) To: ilpo.jarvinen; +Cc: lachlan.andrew, netdev, quetchen From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Tue, 8 Jan 2008 14:12:47 +0200 (EET) > If I'd hint my boss that I'm involved in something like this I'd > bet that he also would get quite crazy... ;-) I'm partially paid > for making TCP more RFCish :-), or at least that the places where > thing diverge are known and controllable for research purposes. RFCs are great guides by which to implement things, but they have been often completely wrong or not practicle to follow strictly. The handling of out of order ACKs with timestamps is my favorite example. Nobody performs an RFC compliant timestamp check on ACK packets, or else their performance would go into the toilet during packet reordering. The URG bit setting is another one. Especially when, practically speaking, we can in fact make changes like I believe we can here I think we should. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-08 7:36 ` SACK scoreboard David Miller 2008-01-08 12:12 ` Ilpo Järvinen @ 2008-01-08 16:51 ` John Heffner 2008-01-08 22:44 ` David Miller 1 sibling, 1 reply; 45+ messages in thread From: John Heffner @ 2008-01-08 16:51 UTC (permalink / raw) To: David Miller; +Cc: ilpo.jarvinen, lachlan.andrew, netdev, quetchen David Miller wrote: > Ilpo, just trying to keep an old conversation from dying off. > > Did you happen to read a recent blog posting of mine? > > http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead > > I've been thinking more and more and I think we might be able > to get away with enforcing that SACKs are always increasing in > coverage. > > I doubt there are any real systems out there that drop out of order > packets that are properly formed and are in window, even though the > SACK specification (foolishly, in my opinion) allows this. > > If we could free packets as SACK blocks cover them, all the problems > go away. > > For one thing, this will allow the retransmit queue liberation during > loss recovery to be spread out over the event, instead of batched up > like crazy to the point where the cumulative ACK finally moves and > releases an entire window's worth of data. > > Next, it would simplify all of this scanning code trying to figure out > which holes to fill during recovery. > > And for SACK scoreboard marking, the RB trie would become very nearly > unecessary as far as I can tell. > > I would not even entertain this kind of crazy idea unless I thought > the fundamental complexity simplification payback was enormous. And > in this case I think it is. > > What we could do is put some experimental hack in there for developers > to start playing with, which would enforce that SACKs always increase > in coverage. If violated the connection reset and a verbose log > message is logged so we can analyze any cases that occur. > > Sounds crazy, but maybe has potential. What do you think? Linux has a code path where this can happen under memory over-commit, in tcp_prune_queue(). Also, I think one of the motivations for making SACK strictly advisory is there was some concern about buggy SACK implementations. Keeping data in your retransmit queue allows you to fall back to timeout and go-back-n if things completely fall apart. For better or worse, we have to deal with the spec the way it is. Even if you made this assumption of "hard" SACKs, you still have to worry about large ACKs if SACK is disabled, though I guess you could say people running with large windows without SACK deserve what they get. :) I haven't thought about this too hard, but can we approximate this by moving scaked data into a sacked queue, then if something bad happens merge this back into the retransmit queue? The code will have to deal with non-contiguous data in the retransmit queue; I'm not sure offhand if that violates any assumptions. You still have a single expensive ACK at the end of recovery, though I wonder how much this really hurts. If you want to ameliorate this, you could save this sacked queue to be batch processed later, in application context for instance. -John ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-08 16:51 ` John Heffner @ 2008-01-08 22:44 ` David Miller 2008-01-09 1:34 ` Lachlan Andrew 2008-01-09 2:25 ` Andi Kleen 0 siblings, 2 replies; 45+ messages in thread From: David Miller @ 2008-01-08 22:44 UTC (permalink / raw) To: jheffner; +Cc: ilpo.jarvinen, lachlan.andrew, netdev, quetchen From: John Heffner <jheffner@psc.edu> Date: Tue, 08 Jan 2008 11:51:53 -0500 > I haven't thought about this too hard, but can we approximate this by > moving scaked data into a sacked queue, then if something bad happens > merge this back into the retransmit queue? That defeats the impetus for the change. We want to free up the data, say, 2 packets at a time as ACKs come in. The key goal is smooth liberation of retransmit queue packets over time. The big problem is that recovery from even a single packet loss in a window makes us run kfree_skb() for a all the packets in a full window's worth of data when recovery completes. If we just move such packets to a seperate list, we still have to iterate over all of them when the cumulative ACK arrives. This problem, that retransmit queue liberation is not smooth, is the biggest flaw in how SACK is specified. I mean, consider Ilpo's mentioned case of 500,000 packet windows. The issue cannot be ignored. SACK is clearly broken. You speak of a path in Linux where we can reneg on SACKs, but I doubt it really ever runs because of how aggressive the queue collapser is. Alexey even has a comment there: * This must not ever occur. */ To be honest this code sits here because it was written before the queue collapser was coded up. Really, find me a box where the LINUX_MIB_OFOPRUNED or LINUX_MIB_RECVPRUNED counters are anything other than zero. So this is a non-issue and I did consider it before proposing that we redefine SACK. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-08 22:44 ` David Miller @ 2008-01-09 1:34 ` Lachlan Andrew 2008-01-09 6:35 ` David Miller 2008-01-09 2:25 ` Andi Kleen 1 sibling, 1 reply; 45+ messages in thread From: Lachlan Andrew @ 2008-01-09 1:34 UTC (permalink / raw) To: David Miller; +Cc: jheffner, ilpo.jarvinen, netdev, quetchen Greetings David, On 08/01/2008, David Miller <davem@davemloft.net> wrote: > From: John Heffner <jheffner@psc.edu> > > > I haven't thought about this too hard, but can we approximate this by > > moving scaked data into a sacked queue, then if something bad happens > > merge this back into the retransmit queue? > > That defeats the impetus for the change. > > We want to free up the data, say, 2 packets at a time as > ACKs come in. The key goal is smooth liberation of > retransmit queue packets over time. John also suggested freeing the packets as a lower priority task, just doing it after they're acknowledged. When the ACK finally comes, you could do something like moving John's entire list of packets to a "to be freed" list, and free a few every time (say) another ACK comes in. $0.02, Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 1:34 ` Lachlan Andrew @ 2008-01-09 6:35 ` David Miller 0 siblings, 0 replies; 45+ messages in thread From: David Miller @ 2008-01-09 6:35 UTC (permalink / raw) To: lachlan.andrew; +Cc: jheffner, ilpo.jarvinen, netdev, quetchen From: "Lachlan Andrew" <lachlan.andrew@gmail.com> Date: Tue, 8 Jan 2008 17:34:03 -0800 > John also suggested freeing the packets as a lower priority task, just > doing it after they're acknowledged. > > When the ACK finally comes, you could do something like moving John's > entire list of packets to a "to be freed" list, and free a few every > time (say) another ACK comes in. You could, but this requires extra state and the operations to properly queue such items to the deferred task and wake it up. And none of this would be necessary if we could make SACK indications hard. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-08 22:44 ` David Miller 2008-01-09 1:34 ` Lachlan Andrew @ 2008-01-09 2:25 ` Andi Kleen 2008-01-09 4:27 ` John Heffner 2008-01-09 6:39 ` David Miller 1 sibling, 2 replies; 45+ messages in thread From: Andi Kleen @ 2008-01-09 2:25 UTC (permalink / raw) To: David Miller; +Cc: jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen David Miller <davem@davemloft.net> writes: > > The big problem is that recovery from even a single packet loss in a > window makes us run kfree_skb() for a all the packets in a full > window's worth of data when recovery completes. Why exactly is it a problem to free them all at once? Are you worried about kernel preemption latencies? -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 2:25 ` Andi Kleen @ 2008-01-09 4:27 ` John Heffner 2008-01-09 6:41 ` David Miller 2008-01-09 12:55 ` Ilpo Järvinen 2008-01-09 6:39 ` David Miller 1 sibling, 2 replies; 45+ messages in thread From: John Heffner @ 2008-01-09 4:27 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, ilpo.jarvinen, lachlan.andrew, netdev, quetchen Andi Kleen wrote: > David Miller <davem@davemloft.net> writes: >> The big problem is that recovery from even a single packet loss in a >> window makes us run kfree_skb() for a all the packets in a full >> window's worth of data when recovery completes. > > Why exactly is it a problem to free them all at once? Are you worried > about kernel preemption latencies? > > -Andi I also wonder how much of a problem this is (for now, with window sizes of order 10000 packets. My understanding is that the biggest problems arise from O(N^2) time for recovery because every ack was expensive. Have current tests shown the final ack to be a major source of problems? -John ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 4:27 ` John Heffner @ 2008-01-09 6:41 ` David Miller 2008-01-09 14:56 ` John Heffner 2008-01-09 12:55 ` Ilpo Järvinen 1 sibling, 1 reply; 45+ messages in thread From: David Miller @ 2008-01-09 6:41 UTC (permalink / raw) To: jheffner; +Cc: andi, ilpo.jarvinen, lachlan.andrew, netdev, quetchen From: John Heffner <jheffner@psc.edu> Date: Tue, 08 Jan 2008 23:27:08 -0500 > I also wonder how much of a problem this is (for now, with window sizes > of order 10000 packets. My understanding is that the biggest problems > arise from O(N^2) time for recovery because every ack was expensive. > Have current tests shown the final ack to be a major source of problems? Yes, several people have reported this. It's a real issue, and it's only going to get worse. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 6:41 ` David Miller @ 2008-01-09 14:56 ` John Heffner 2008-01-09 18:14 ` SANGTAE HA 0 siblings, 1 reply; 45+ messages in thread From: John Heffner @ 2008-01-09 14:56 UTC (permalink / raw) To: David Miller; +Cc: andi, ilpo.jarvinen, lachlan.andrew, netdev, quetchen David Miller wrote: > From: John Heffner <jheffner@psc.edu> > Date: Tue, 08 Jan 2008 23:27:08 -0500 > >> I also wonder how much of a problem this is (for now, with window sizes >> of order 10000 packets. My understanding is that the biggest problems >> arise from O(N^2) time for recovery because every ack was expensive. >> Have current tests shown the final ack to be a major source of problems? > > Yes, several people have reported this. I may have missed some of this. Does anyone have a link to some recent data? -John ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 14:56 ` John Heffner @ 2008-01-09 18:14 ` SANGTAE HA 2008-01-09 18:23 ` John Heffner 0 siblings, 1 reply; 45+ messages in thread From: SANGTAE HA @ 2008-01-09 18:14 UTC (permalink / raw) To: John Heffner Cc: David Miller, andi, ilpo.jarvinen, lachlan.andrew, netdev, quetchen On Jan 9, 2008 9:56 AM, John Heffner <jheffner@psc.edu> wrote: > >> I also wonder how much of a problem this is (for now, with window sizes > >> of order 10000 packets. My understanding is that the biggest problems > >> arise from O(N^2) time for recovery because every ack was expensive. > >> Have current tests shown the final ack to be a major source of problems? > > > > Yes, several people have reported this. > > I may have missed some of this. Does anyone have a link to some recent > data? I had some testing on this a month ago. A small set of recent results with linux 2.6.23.9 are at http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency One of serious cases with a large number of packet losses (initial loss is around 8000 packets) is at http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency/600--TCP-TCP-NONE--400-3-1.0--1000-120-0-0-1-1-5-500--1.0-0.5-133000-73-3000000-0.93-150--3/ Also, there is a comparison among three Linux kernels (2.6.13, 2.6.18-rc4, 2.6.20.3) at http://netsrv.csc.ncsu.edu/wiki/index.php/Efficiency_of_SACK_processing Sangtae ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 18:14 ` SANGTAE HA @ 2008-01-09 18:23 ` John Heffner 0 siblings, 0 replies; 45+ messages in thread From: John Heffner @ 2008-01-09 18:23 UTC (permalink / raw) To: SANGTAE HA Cc: David Miller, andi, ilpo.jarvinen, lachlan.andrew, netdev, quetchen SANGTAE HA wrote: > On Jan 9, 2008 9:56 AM, John Heffner <jheffner@psc.edu> wrote: >>>> I also wonder how much of a problem this is (for now, with window sizes >>>> of order 10000 packets. My understanding is that the biggest problems >>>> arise from O(N^2) time for recovery because every ack was expensive. >>>> Have current tests shown the final ack to be a major source of problems? >>> Yes, several people have reported this. >> I may have missed some of this. Does anyone have a link to some recent >> data? > > I had some testing on this a month ago. > A small set of recent results with linux 2.6.23.9 are at > http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency > One of serious cases with a large number of packet losses (initial > loss is around 8000 packets) is at > http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency/600--TCP-TCP-NONE--400-3-1.0--1000-120-0-0-1-1-5-500--1.0-0.5-133000-73-3000000-0.93-150--3/ > > Also, there is a comparison among three Linux kernels (2.6.13, > 2.6.18-rc4, 2.6.20.3) at > http://netsrv.csc.ncsu.edu/wiki/index.php/Efficiency_of_SACK_processing If I'm reading this right, all these tests occur with large amounts of loss and tons of sack processing. What would be most pertinent to this discussion would be a test with a large window, with delayed ack and sack disabled, and a single loss repaired by fast retransmit. This would isolate the "single big ack" processing from other factors such as doubling the ack rate and sack processing. I could probably set up such a test, but I don't want to duplicate effort if someone else already has done something similar. Thanks, -John ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 4:27 ` John Heffner 2008-01-09 6:41 ` David Miller @ 2008-01-09 12:55 ` Ilpo Järvinen 1 sibling, 0 replies; 45+ messages in thread From: Ilpo Järvinen @ 2008-01-09 12:55 UTC (permalink / raw) To: John Heffner; +Cc: Andi Kleen, David Miller, lachlan.andrew, Netdev, quetchen On Tue, 8 Jan 2008, John Heffner wrote: > Andi Kleen wrote: > > David Miller <davem@davemloft.net> writes: > > > The big problem is that recovery from even a single packet loss in a > > > window makes us run kfree_skb() for a all the packets in a full > > > window's worth of data when recovery completes. > > > > Why exactly is it a problem to free them all at once? Are you worried > > about kernel preemption latencies? > > I also wonder how much of a problem this is (for now, with window sizes of > order 10000 packets. My understanding is that the biggest problems arise from > O(N^2) time for recovery because every ack was expensive. Have current > tests shown the final ack to be a major source of problems? This thread got started because I tried to solve the other latencies but realized that it helps very little because this latency spike would have remained unsolved and it happens in one of the most common case. -- i. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 2:25 ` Andi Kleen 2008-01-09 4:27 ` John Heffner @ 2008-01-09 6:39 ` David Miller 2008-01-09 7:03 ` Andi Kleen 1 sibling, 1 reply; 45+ messages in thread From: David Miller @ 2008-01-09 6:39 UTC (permalink / raw) To: andi; +Cc: jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen From: Andi Kleen <andi@firstfloor.org> Date: Wed, 09 Jan 2008 03:25:05 +0100 > David Miller <davem@davemloft.net> writes: > > > > The big problem is that recovery from even a single packet loss in a > > window makes us run kfree_skb() for a all the packets in a full > > window's worth of data when recovery completes. > > Why exactly is it a problem to free them all at once? Are you worried > about kernel preemption latencies? If the cpu is there spinning freeing up 500,000 SKBs, it isn't processing RX packets. It adds severe spikes in CPU utilization that are even moderate line rates begins to affect RTTs. Or do you think it's OK to process 500,000 SKBs while locked in a software interrupt. Perhaps you have another broken awk script to prove this :-) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 6:39 ` David Miller @ 2008-01-09 7:03 ` Andi Kleen 2008-01-09 7:16 ` David Miller 2008-01-09 9:47 ` Evgeniy Polyakov 0 siblings, 2 replies; 45+ messages in thread From: Andi Kleen @ 2008-01-09 7:03 UTC (permalink / raw) To: David Miller Cc: andi, jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen > It adds severe spikes in CPU utilization that are even moderate > line rates begins to affect RTTs. > > Or do you think it's OK to process 500,000 SKBs while locked > in a software interrupt. You can always push it into a work queue. Even put it to other cores if you want. In fact this is already done partly for the ->completion_queue. Wouldn't be a big change to queue it another level down. Also even freeing a lot of objects doesn't have to be that expensive. I suspect the most cost is in taking the slab locks, but that could be batched. Without that the kmem_free fast path isn't particularly expensive, as long as the headers are still in cache. Long ago I had sk_buff optimized for the routing case so that freeing can be all done with a single cache line. That is long gone, but could be probably restored. But asking for protocol changes just to work around such a internal implementation detail is ... <I miss words> > Perhaps you have another broken awk script to prove this :-) Your hand waved numbers on inline sizes there were definitely worse than mine. -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 7:03 ` Andi Kleen @ 2008-01-09 7:16 ` David Miller 2008-01-09 9:47 ` Evgeniy Polyakov 1 sibling, 0 replies; 45+ messages in thread From: David Miller @ 2008-01-09 7:16 UTC (permalink / raw) To: andi; +Cc: jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen From: Andi Kleen <andi@firstfloor.org> Date: Wed, 9 Jan 2008 08:03:18 +0100 > Also even freeing a lot of objects doesn't have to be > that expensive. I suspect the most cost is in taking > the slab locks, but that could be batched. We're touching SKB struct members, doing atomics on them, etc. for objects we haven't referenced for at least two RTTs so are guarenteed to be cache cold. Let's say best case we can get it down to 2 cache line misses, and as a very aggressive goal we can get the cost down to 100 cycles. For 500,000 packets this is 500 million cpu cycles to free them all up. That's 1/4 of a second even on a 2 GHZ cpu. And yes there are inherent costs in handling TCP windows that are 500,000 packets in size. But, that freeing cost should be spread throughout the handling of the RTT feedback, not handled all at once. > Your hand waved numbers on inline sizes there were definitely worse > than mine. Your primary objective seems to be "being right", and that's fine but realize that it makes discussing anything with you about as fun as picking one's toe nails with an ice axe. Eventually you will be ignored by most folks who get fed up by this style of argument. So, have fun being right rather than being pleasant to work with. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 7:03 ` Andi Kleen 2008-01-09 7:16 ` David Miller @ 2008-01-09 9:47 ` Evgeniy Polyakov 2008-01-09 14:02 ` Andi Kleen 1 sibling, 1 reply; 45+ messages in thread From: Evgeniy Polyakov @ 2008-01-09 9:47 UTC (permalink / raw) To: Andi Kleen Cc: David Miller, jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen Hi. On Wed, Jan 09, 2008 at 08:03:18AM +0100, Andi Kleen (andi@firstfloor.org) wrote: > > It adds severe spikes in CPU utilization that are even moderate > > line rates begins to affect RTTs. > > > > Or do you think it's OK to process 500,000 SKBs while locked > > in a software interrupt. > > You can always push it into a work queue. Even put it to > other cores if you want. > > In fact this is already done partly for the ->completion_queue. > Wouldn't be a big change to queue it another level down. > > Also even freeing a lot of objects doesn't have to be > that expensive. I suspect the most cost is in taking > the slab locks, but that could be batched. Without > that the kmem_free fast path isn't particularly > expensive, as long as the headers are still in cache. Postponing freeing of the skb has major drawbacks. Some time ago I made a patch to postpone skb freeing behind rcu and got 2.5 times slower connection speed on some machines with decreased CPU usage though. So, queueing solution has to be proven with real data and although it looks good in one situation, it can be really bad in another. For interested reader: results of the RCUfication of the kfree_skbmem() http://tservice.net.ru/~s0mbre/blog/devel/networking/2006/12/05 -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard 2008-01-09 9:47 ` Evgeniy Polyakov @ 2008-01-09 14:02 ` Andi Kleen 0 siblings, 0 replies; 45+ messages in thread From: Andi Kleen @ 2008-01-09 14:02 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Andi Kleen, David Miller, jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen > Postponing freeing of the skb has major drawbacks. Some time ago I Yes, the trick would be to make sure that it also does not tie up too much memory. e.g. it would need some throttling at least. Also the fast path of kmem_cache_free() is actually not that much different from just putting something on a list so perhaps it would not make that much difference. -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] tcp-illinois: incorrect beta usage 2007-11-28 23:47 ` [PATCH] tcp-illinois: incorrect beta usage Stephen Hemminger 2007-11-29 0:25 ` Lachlan Andrew @ 2007-11-29 14:12 ` Herbert Xu 1 sibling, 0 replies; 45+ messages in thread From: Herbert Xu @ 2007-11-29 14:12 UTC (permalink / raw) To: Stephen Hemminger Cc: Julian Shao Liu, David S. Miller, 'Lachlan Andrew', shaoliu, 'Douglas Leith', 'Robert Shorten', netdev On Wed, Nov 28, 2007 at 03:47:25PM -0800, Stephen Hemminger wrote: > Lachlan Andrew observed that my TCP-Illinois implementation uses the > beta value incorrectly: > The parameter beta in the paper specifies the amount to decrease > *by*: that is, on loss, > W <- W - beta*W > but in tcp_illinois_ssthresh() uses beta as the amount > to decrease *to*: W <- beta*W > > This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network, > hurting performance. Note: since the base beta value is .5, it has no > impact on a congested network. > > Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Applied to net-2.6. Thanks Stephen! -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SACK scoreboard @ 2008-01-09 6:04 linux 0 siblings, 0 replies; 45+ messages in thread From: linux @ 2008-01-09 6:04 UTC (permalink / raw) To: netdev; +Cc: linux Just some idle brainstorming on the subject... It seems the only way to handle network pipes sigificantly larger (delay * bandwidth product) than the processor cache is to make freeing retransmit data o(n). Now, there are some ways to reduce the constant factor. The one that comes to mind first is to not queue sk_buffs. Throw away the struct sk_buff after transmission and just queue skb_frag_structs, pages, or maybe even higher-order pages of data. Then freeing the data when it's acked has a much smaller constant factor, particularly d-cache footprint, and no slab operations. The downside is more work to recreate the skb if you do have to retransmit, but optimizing for retransmits is silly. Some implementations could leave large chunks of memory locked until all of the sk_buff->skb_shared_info->skb_frag_structs referencing them have gone away, but you can look at the transmit window when deciding how big a chunk size to use. Then, to actually get below O(n), you want to keep the queued data in a data structure known to the memory manager. Basically, splice the retransmit queue onto the free list. It may require some kludgery in the memory manager. In particular, doing that in O(1) time obviously means that you can't coalesce adjacent free regions to build higher-order pages. So you'd have to have a threshold for uncoalesced pages and a way to force coalescing under memory pressure. You're just deferring work until the page is allocated, but the point is that then it's okay to bring it into cache when it's about to be used again. It's the redundant round trip just because an ack arrived that's annoying. I've done thins sort of thing with specialized fixed-block-size allocators before (an alpha-beta minimax search tree allocates nodes one at a time, but frees whole subtrees at once), but might it be feasible for kernel use? ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2008-01-09 18:23 UTC | newest]
Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <aa7d2c6d0711261023m3d2dd850o76a8f44aef022f39@mail.gmail.com>
[not found] ` <001001c83063$9adbc9d0$d5897e82@csp.uiuc.edu>
2007-11-28 23:47 ` [PATCH] tcp-illinois: incorrect beta usage Stephen Hemminger
2007-11-29 0:25 ` Lachlan Andrew
2007-11-29 0:43 ` Stephen Hemminger
2007-11-29 5:26 ` Shao Liu
2007-12-03 22:52 ` [RFC] TCP illinois max rtt aging Stephen Hemminger
2007-12-03 23:06 ` Lachlan Andrew
2007-12-03 23:59 ` Shao Liu
2007-12-04 0:32 ` Stephen Hemminger
2007-12-04 1:23 ` Lachlan Andrew
2007-12-04 8:37 ` Ilpo Järvinen
2007-12-07 3:27 ` Lachlan Andrew
2007-12-07 11:05 ` Ilpo Järvinen
2007-12-07 12:41 ` David Miller
2007-12-07 13:05 ` Ilpo Järvinen
2007-12-07 18:27 ` Ilpo Järvinen
2007-12-08 1:32 ` David Miller
2007-12-11 11:59 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one (Was: Re: [RFC] TCP illinois max rtt aging) Ilpo Järvinen
2007-12-11 12:32 ` [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one David Miller
2007-12-12 0:14 ` Lachlan Andrew
2007-12-12 15:11 ` David Miller
2007-12-12 23:35 ` Lachlan Andrew
2007-12-12 23:38 ` David Miller
2007-12-13 0:00 ` Stephen Hemminger
2007-12-15 9:51 ` SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one) Ilpo Järvinen
2008-01-08 7:36 ` SACK scoreboard David Miller
2008-01-08 12:12 ` Ilpo Järvinen
2008-01-09 7:58 ` David Miller
2008-01-08 16:51 ` John Heffner
2008-01-08 22:44 ` David Miller
2008-01-09 1:34 ` Lachlan Andrew
2008-01-09 6:35 ` David Miller
2008-01-09 2:25 ` Andi Kleen
2008-01-09 4:27 ` John Heffner
2008-01-09 6:41 ` David Miller
2008-01-09 14:56 ` John Heffner
2008-01-09 18:14 ` SANGTAE HA
2008-01-09 18:23 ` John Heffner
2008-01-09 12:55 ` Ilpo Järvinen
2008-01-09 6:39 ` David Miller
2008-01-09 7:03 ` Andi Kleen
2008-01-09 7:16 ` David Miller
2008-01-09 9:47 ` Evgeniy Polyakov
2008-01-09 14:02 ` Andi Kleen
2007-11-29 14:12 ` [PATCH] tcp-illinois: incorrect beta usage Herbert Xu
2008-01-09 6:04 SACK scoreboard linux
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).