* Re: Socket buffer sizes with autotuning
@ 2008-04-23 23:29 Jerry Chu
2008-04-24 16:32 ` John Heffner
2008-04-25 7:05 ` David Miller
0 siblings, 2 replies; 38+ messages in thread
From: Jerry Chu @ 2008-04-23 23:29 UTC (permalink / raw)
To: netdev
I've been seeing the same problem here and am trying to fix it.
My fix is to not count those pkts still in the host queue as "prior_in_flight"
when feeding the latter to tcp_cong_avoid(). This should cause
tcp_is_cwnd_limited() test to fail when the previous in_flight build-up
is all due to the large host queue, and stop the cwnd to grow beyond
what's really necessary.
"sysctl_tcp_tso_win_divisor causes cwnd" also unnecessarily inflates
cwnd quite a bit when TSO is enabled.
Jerry
From: David Miller <davem@davemloft.net>
>
>
> Date: Tue, Apr 22, 2008 at 8:59 PM
>
> Subject: Re: Socket buffer sizes with autotuning
> To: johnwheffner@gmail.com
> Cc: rick.jones2@hp.com, netdev@vger.kernel.org
>
>
> From: "John Heffner" <johnwheffner@gmail.com>
> Date: Tue, 22 Apr 2008 19:17:39 -0700
>
>
>
> > On Tue, Apr 22, 2008 at 5:38 PM, Rick Jones <rick.jones2@hp.com> wrote:
> > > oslowest:~# netstat -an | grep ESTAB
> > > ...
> > > tcp 0 2760560 10.208.0.1:40500 10.208.0.45:42049 ESTABLISHED
> > > ...
> > >
> > > Is this expected behaviour?
> >
>
> > What is your interface txqueuelen and mtu? If you have a very large
> > interface queue, TCP will happily fill it up unless you are using a
> > delay-based congestion controller.
>
>
> Yes, that's the fundamental problem with loss based congestion
> control. If there are any queues in the path, TCP will fill them up.
>
> Vegas and other similar techniques are able to avoid this, but come
> with the fundamental flaw that it's easy to get them into situations
> where they do not respond to increases in pipe space adequately, and
> thus underperform compared to loss based algorithms.
>
>
>
> --
>
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: Socket buffer sizes with autotuning 2008-04-23 23:29 Socket buffer sizes with autotuning Jerry Chu @ 2008-04-24 16:32 ` John Heffner 2008-04-25 0:49 ` Jerry Chu 2008-04-25 7:05 ` David Miller 1 sibling, 1 reply; 38+ messages in thread From: John Heffner @ 2008-04-24 16:32 UTC (permalink / raw) To: Jerry Chu; +Cc: netdev On Wed, Apr 23, 2008 at 4:29 PM, Jerry Chu <hkchu@google.com> wrote: > > I've been seeing the same problem here and am trying to fix it. > My fix is to not count those pkts still in the host queue as "prior_in_flight" > when feeding the latter to tcp_cong_avoid(). This should cause > tcp_is_cwnd_limited() test to fail when the previous in_flight build-up > is all due to the large host queue, and stop the cwnd to grow beyond > what's really necessary. Sounds like a useful optimization. Do you have a patch? -John ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-24 16:32 ` John Heffner @ 2008-04-25 0:49 ` Jerry Chu 2008-04-25 6:46 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-04-25 0:49 UTC (permalink / raw) To: John Heffner; +Cc: netdev, rick.jones2, davem On Thu, Apr 24, 2008 at 9:32 AM, John Heffner <johnwheffner@gmail.com> wrote: > > On Wed, Apr 23, 2008 at 4:29 PM, Jerry Chu <hkchu@google.com> wrote: > > > > I've been seeing the same problem here and am trying to fix it. > > My fix is to not count those pkts still in the host queue as "prior_in_flight" > > when feeding the latter to tcp_cong_avoid(). This should cause > > tcp_is_cwnd_limited() test to fail when the previous in_flight build-up > > is all due to the large host queue, and stop the cwnd to grow beyond > > what's really necessary. > > Sounds like a useful optimization. Do you have a patch? Am working on one, but still need to completely rootcause the problem first, and do a lot more testing. I, like Rick Jones, have for a while thought either the autotuning, or the Congestion Window Validation (rfc2861) code should dampen the cwnd growth so the bug must be there, until last week when I decided to get to the bottom of this problem. One question: I currently use skb_shinfo(skb)->dataref == 1 for skb's on the sk_write_queue list as the heuristic to determine if a packet has hit the wire. This seems a good solution for the normal cases without requiring changes to the driver to notify TCP in the xmit completion path. But I can imagine there may be cases where another below-IP consumer of skb, e.g., tcpdump, can nullify the above heuristic. If the below IP consumer causes the skb ref count to drop to 1 prematurally, well the inflated cwnd problem comes back but it's no worse than before. What if the below IP skb reader causes the skb ref count to remain > 1 while pkts have long hit the wire? This may cause the fix to prevent cwnd from growing when needed, hence hurting performance. Is there a better solution than checking against dataref to determine if a pkt has hit the wire? Also the code to determine when/how much to defer in the TSO path seems too aggressive. It's currently based on a percentage (sysctl_tcp_tso_win_divisor) of min(snd_wnd, snd_cwnd). Would it be too much if the value is large? E.g., when I disable sysctl_tcp_tso_win_divisor, the cwnd of my simple netperf run drops exactly 1/3 from 1037 (segments) to 695. It seems to me the TSO defer factor should be based on an absolute count, e.g., 64KB. Jerry > > -John > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-25 0:49 ` Jerry Chu @ 2008-04-25 6:46 ` David Miller 2008-04-25 21:29 ` Jerry Chu 2008-04-28 18:30 ` Jerry Chu 0 siblings, 2 replies; 38+ messages in thread From: David Miller @ 2008-04-25 6:46 UTC (permalink / raw) To: hkchu; +Cc: johnwheffner, netdev, rick.jones2 From: "Jerry Chu" <hkchu@google.com> Date: Thu, 24 Apr 2008 17:49:33 -0700 > One question: I currently use skb_shinfo(skb)->dataref == 1 for skb's on the > sk_write_queue list as the heuristic to determine if a packet has hit the wire. This doesn't work for the reasons that you mention in detail next :-) > Is there a better solution than checking against dataref to determine if a pkt > has hit the wire? Unfortunately, no there isn't. Part of the issue is that the driver is only working with a clone, but if a packet gets resent before the driver gives up it's reference, we'll make a completely new copy. But even assuming we could say that the driver gets a clone all the time, the "sent" state would need to be in the shared data area. > Also the code to determine when/how much to defer in the TSO path seems > too aggressive. It's currently based on a percentage > (sysctl_tcp_tso_win_divisor) > of min(snd_wnd, snd_cwnd). Would it be too much if the value is large? E.g., > when I disable sysctl_tcp_tso_win_divisor, the cwnd of my simple netperf run > drops exactly 1/3 from 1037 (segments) to 695. It seems to me the TSO > defer factor should be based on an absolute count, e.g., 64KB. This is one of the most difficult knobs to get right in the TSO code. If the percentage is too low, you'll notice that cpu utilization increases because you aren't accumulating enough data to send down the largest possible TSO frames. But yes you are absolutely right that we should have a hard limit of 64K here, since we can't build a larger TSO frame anyways. In fact I thought we had something like that here already :-/ Wait, in fact we do, it's just hidden behind a variable now: /* If a full-sized TSO skb can be sent, do it. */ if (limit >= sk->sk_gso_max_size) goto send_now; :-) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-25 6:46 ` David Miller @ 2008-04-25 21:29 ` Jerry Chu 2008-04-25 21:35 ` David Miller 2008-04-28 18:30 ` Jerry Chu 1 sibling, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-04-25 21:29 UTC (permalink / raw) To: David Miller; +Cc: johnwheffner, netdev, rick.jones2 On Thu, Apr 24, 2008 at 11:46 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > Date: Thu, 24 Apr 2008 17:49:33 -0700 > > > > One question: I currently use skb_shinfo(skb)->dataref == 1 for skb's on the > > sk_write_queue list as the heuristic to determine if a packet has hit the wire. > > This doesn't work for the reasons that you mention in detail next :-) > > > > Is there a better solution than checking against dataref to determine if a pkt > > has hit the wire? > > Unfortunately, no there isn't. > > Part of the issue is that the driver is only working with a clone, but > if a packet gets resent before the driver gives up it's reference, > we'll make a completely new copy. I think we can ignore this case if it happens rarely. > > But even assuming we could say that the driver gets a clone all the > time, the "sent" state would need to be in the shared data area. Ok. > > > > Also the code to determine when/how much to defer in the TSO path seems > > too aggressive. It's currently based on a percentage > > (sysctl_tcp_tso_win_divisor) > > of min(snd_wnd, snd_cwnd). Would it be too much if the value is large? E.g., > > when I disable sysctl_tcp_tso_win_divisor, the cwnd of my simple netperf run > > drops exactly 1/3 from 1037 (segments) to 695. It seems to me the TSO > > defer factor should be based on an absolute count, e.g., 64KB. > > This is one of the most difficult knobs to get right in the TSO code. > > If the percentage is too low, you'll notice that cpu utilization > increases because you aren't accumulating enough data to send down the > largest possible TSO frames. Well, there is a fine line to walk before CPU efficiency and traffic burstiness. The TSO defer code causes a few hundred KB of bursts that quickly blow away our small switch buffers. The matter may get even worse for 10GE. > > But yes you are absolutely right that we should have a hard limit > of 64K here, since we can't build a larger TSO frame anyways. > > In fact I thought we had something like that here already :-/ > > Wait, in fact we do, it's just hidden behind a variable now: > > /* If a full-sized TSO skb can be sent, do it. */ > if (limit >= sk->sk_gso_max_size) > goto send_now; Oh, just realized I've been working on a very "old" (2.6.18 :-) version of kernel. Will get the latest 2.6.25 and take a look. I can't find "skb_release_all()" function you pointed in a later mail either. Guess the Linux kernel code is rewritten every few month :-(. Jerry > > :-) > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-25 21:29 ` Jerry Chu @ 2008-04-25 21:35 ` David Miller 0 siblings, 0 replies; 38+ messages in thread From: David Miller @ 2008-04-25 21:35 UTC (permalink / raw) To: hkchu; +Cc: johnwheffner, netdev, rick.jones2 From: "Jerry Chu" <hkchu@google.com> Date: Fri, 25 Apr 2008 14:29:25 -0700 > Will get the latest 2.6.25 and take a look. I can't find "skb_release_all()" > function you pointed in a later mail either. Guess the Linux kernel > code is rewritten every few month :-(. You want bugs fixed, like the one you reported here, don't you? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-25 6:46 ` David Miller 2008-04-25 21:29 ` Jerry Chu @ 2008-04-28 18:30 ` Jerry Chu 2008-04-28 19:21 ` John Heffner 1 sibling, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-04-28 18:30 UTC (permalink / raw) To: David Miller; +Cc: johnwheffner, netdev, rick.jones2 On Thu, Apr 24, 2008 at 11:46 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > Date: Thu, 24 Apr 2008 17:49:33 -0700 > > > > One question: I currently use skb_shinfo(skb)->dataref == 1 for skb's on the > > sk_write_queue list as the heuristic to determine if a packet has hit the wire. > > This doesn't work for the reasons that you mention in detail next :-) > > > > Is there a better solution than checking against dataref to determine if a pkt > > has hit the wire? > > Unfortunately, no there isn't. > > Part of the issue is that the driver is only working with a clone, but > if a packet gets resent before the driver gives up it's reference, > we'll make a completely new copy. > > But even assuming we could say that the driver gets a clone all the > time, the "sent" state would need to be in the shared data area. > > > > Also the code to determine when/how much to defer in the TSO path seems > > too aggressive. It's currently based on a percentage > > (sysctl_tcp_tso_win_divisor) > > of min(snd_wnd, snd_cwnd). Would it be too much if the value is large? E.g., > > when I disable sysctl_tcp_tso_win_divisor, the cwnd of my simple netperf run > > drops exactly 1/3 from 1037 (segments) to 695. It seems to me the TSO > > defer factor should be based on an absolute count, e.g., 64KB. > > This is one of the most difficult knobs to get right in the TSO code. > > If the percentage is too low, you'll notice that cpu utilization > increases because you aren't accumulating enough data to send down the > largest possible TSO frames. > > But yes you are absolutely right that we should have a hard limit > of 64K here, since we can't build a larger TSO frame anyways. > > In fact I thought we had something like that here already :-/ > > Wait, in fact we do, it's just hidden behind a variable now: > > /* If a full-sized TSO skb can be sent, do it. */ > if (limit >= sk->sk_gso_max_size) > goto send_now; > > :-) Correct, but its counterpart doesn't exist in tcp_is_cwnd_limited(). So cwnd will continue to grow when left < cwnd/sysctl_tcp_tso_win_divisor, which can be very large when cwnd is large. If I change tcp_tso_win_divisor to 0, cwnd max out at 695 rather than 1037, exactly off by 1/3. I tried to add the same check to tcp_is_cwnd_limited(): diff -c /tmp/tcp.h.old tcp.h *** /tmp/tcp.h.old Mon Apr 28 11:00:44 2008 --- tcp.h Mon Apr 28 10:54:10 2008 *************** *** 828,833 **** --- 828,835 ---- return 0; left = tp->snd_cwnd - in_flight; + if (left >= 65536) + return 0; if (sysctl_tcp_tso_win_divisor) return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd; else > But it doesn't seem to help (cwnd still grows to 1037). Jerry ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-28 18:30 ` Jerry Chu @ 2008-04-28 19:21 ` John Heffner 2008-04-28 20:44 ` Jerry Chu [not found] ` <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com> 0 siblings, 2 replies; 38+ messages in thread From: John Heffner @ 2008-04-28 19:21 UTC (permalink / raw) To: Jerry Chu; +Cc: David Miller, netdev, rick.jones2 [-- Attachment #1: Type: text/plain, Size: 1036 bytes --] On Mon, Apr 28, 2008 at 11:30 AM, Jerry Chu <hkchu@google.com> wrote: > Correct, but its counterpart doesn't exist in tcp_is_cwnd_limited(). So > cwnd will continue to grow when left < cwnd/sysctl_tcp_tso_win_divisor, > which can be very large when cwnd is large. > > If I change tcp_tso_win_divisor to 0, cwnd max out at 695 rather than 1037, > exactly off by 1/3. I tried to add the same check to tcp_is_cwnd_limited(): > > diff -c /tmp/tcp.h.old tcp.h > *** /tmp/tcp.h.old Mon Apr 28 11:00:44 2008 > --- tcp.h Mon Apr 28 10:54:10 2008 > *************** > *** 828,833 **** > --- 828,835 ---- > return 0; > > left = tp->snd_cwnd - in_flight; > + if (left >= 65536) > + return 0; > if (sysctl_tcp_tso_win_divisor) > return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd; > else > > > > > But it doesn't seem to help (cwnd still grows to 1037). Cwnd is in packets, not bytes. Try this series of patches, against net-next. -John [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-NET-Allow-send-limited-cwnd-to-grow-up-to-max_bur.patch --] [-- Type: text/x-diff; name=0001-NET-Allow-send-limited-cwnd-to-grow-up-to-max_bur.patch, Size: 1231 bytes --] From 57039cfc486ae4deac3b1fecaa238b5483c034df Mon Sep 17 00:00:00 2001 From: John Heffner <johnwheffner@gmail.com> Date: Mon, 28 Apr 2008 12:17:49 -0700 Subject: [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled This changes the logic in tcp_is_cwnd_limited() so that cwnd may grow up to tcp_max_burst() even when sk_can_gso() is false, or when sysctl_tcp_tso_win_divisor != 0. Signed-off-by: John Heffner <johnwheffner@gmail.com> --- net/ipv4/tcp_cong.c | 11 ++++------- 1 files changed, 4 insertions(+), 7 deletions(-) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 3a6be23..bfb1996 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -285,14 +285,11 @@ int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) if (in_flight >= tp->snd_cwnd) return 1; - if (!sk_can_gso(sk)) - return 0; - left = tp->snd_cwnd - in_flight; - if (sysctl_tcp_tso_win_divisor) - return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd; - else - return left <= tcp_max_burst(tp); + if (sk_can_gso(sk) && + left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd) + return 1; + return left <= tcp_max_burst(tp); } EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited); -- 1.5.2.5 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #3: 0002-NET-Limit-cwnd-growth-when-deferring-for-GSO.patch --] [-- Type: text/x-diff; name=0002-NET-Limit-cwnd-growth-when-deferring-for-GSO.patch, Size: 992 bytes --] From fc1d9fe99748f3811a0a5b9d93463cf0e4788f56 Mon Sep 17 00:00:00 2001 From: John Heffner <johnwheffner@gmail.com> Date: Mon, 28 Apr 2008 12:19:22 -0700 Subject: [PATCH 2/2] [NET]: Limit cwnd growth when deferring for GSO This fixes inappropriately large cwnd growth on sender-limited flows when GSO is enabled, limiting cwnd growth to 64k. Signed-off-by: John Heffner <johnwheffner@gmail.com> --- net/ipv4/tcp_cong.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index bfb1996..6a25082 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -287,7 +287,8 @@ int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) left = tp->snd_cwnd - in_flight; if (sk_can_gso(sk) && - left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd) + left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd && + left * tp->mss_cache < sk->sk_gso_max_size) return 1; return left <= tcp_max_burst(tp); } -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-28 19:21 ` John Heffner @ 2008-04-28 20:44 ` Jerry Chu 2008-04-28 23:22 ` [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled John Heffner [not found] ` <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com> 1 sibling, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-04-28 20:44 UTC (permalink / raw) To: John Heffner; +Cc: David Miller, netdev, rick.jones2 On Mon, Apr 28, 2008 at 12:21 PM, John Heffner <johnwheffner@gmail.com> wrote: > > On Mon, Apr 28, 2008 at 11:30 AM, Jerry Chu <hkchu@google.com> wrote: > > Correct, but its counterpart doesn't exist in tcp_is_cwnd_limited(). So > > cwnd will continue to grow when left < cwnd/sysctl_tcp_tso_win_divisor, > > which can be very large when cwnd is large. > > > > If I change tcp_tso_win_divisor to 0, cwnd max out at 695 rather than 1037, > > exactly off by 1/3. I tried to add the same check to tcp_is_cwnd_limited(): > > > > diff -c /tmp/tcp.h.old tcp.h > > *** /tmp/tcp.h.old Mon Apr 28 11:00:44 2008 > > --- tcp.h Mon Apr 28 10:54:10 2008 > > *************** > > *** 828,833 **** > > --- 828,835 ---- > > return 0; > > > > left = tp->snd_cwnd - in_flight; > > + if (left >= 65536) > > + return 0; > > if (sysctl_tcp_tso_win_divisor) > > return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd; > > else > > > > > > > > > But it doesn't seem to help (cwnd still grows to 1037). > > Cwnd is in packets, not bytes. Yes, of course. > > Try this series of patches, against net-next. Ah, you already know about this problem. Yes it does the trick. Jerry > > -John > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled 2008-04-28 20:44 ` Jerry Chu @ 2008-04-28 23:22 ` John Heffner 2008-04-28 23:22 ` [PATCH 2/2] [NET]: Limit cwnd growth when deferring for GSO John Heffner 0 siblings, 1 reply; 38+ messages in thread From: John Heffner @ 2008-04-28 23:22 UTC (permalink / raw) To: davem; +Cc: netdev, John Heffner This changes the logic in tcp_is_cwnd_limited() so that cwnd may grow up to tcp_max_burst() even when sk_can_gso() is false, or when sysctl_tcp_tso_win_divisor != 0. Signed-off-by: John Heffner <johnwheffner@gmail.com> --- net/ipv4/tcp_cong.c | 11 ++++------- 1 files changed, 4 insertions(+), 7 deletions(-) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 3a6be23..bfb1996 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -285,14 +285,11 @@ int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) if (in_flight >= tp->snd_cwnd) return 1; - if (!sk_can_gso(sk)) - return 0; - left = tp->snd_cwnd - in_flight; - if (sysctl_tcp_tso_win_divisor) - return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd; - else - return left <= tcp_max_burst(tp); + if (sk_can_gso(sk) && + left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd) + return 1; + return left <= tcp_max_burst(tp); } EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited); -- 1.5.2.5 ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 2/2] [NET]: Limit cwnd growth when deferring for GSO 2008-04-28 23:22 ` [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled John Heffner @ 2008-04-28 23:22 ` John Heffner 0 siblings, 0 replies; 38+ messages in thread From: John Heffner @ 2008-04-28 23:22 UTC (permalink / raw) To: davem; +Cc: netdev, John Heffner This fixes inappropriately large cwnd growth on sender-limited flows when GSO is enabled, limiting cwnd growth to 64k. Signed-off-by: John Heffner <johnwheffner@gmail.com> --- net/ipv4/tcp_cong.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index bfb1996..6a25082 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -287,7 +287,8 @@ int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) left = tp->snd_cwnd - in_flight; if (sk_can_gso(sk) && - left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd) + left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd && + left * tp->mss_cache < sk->sk_gso_max_size) return 1; return left <= tcp_max_burst(tp); } -- 1.5.2.5 ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ ^ permalink raw reply related [flat|nested] 38+ messages in thread
[parent not found: <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com>]
* Re: Socket buffer sizes with autotuning [not found] ` <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com> @ 2008-04-28 23:28 ` John Heffner 2008-04-28 23:35 ` David Miller 2008-04-29 2:20 ` Jerry Chu 0 siblings, 2 replies; 38+ messages in thread From: John Heffner @ 2008-04-28 23:28 UTC (permalink / raw) To: Jerry Chu; +Cc: David Miller, netdev, rick.jones2 On Mon, Apr 28, 2008 at 1:38 PM, Jerry Chu <hkchu@google.com> wrote: > > Try this series of patches, against net-next. > > Ah, you already know about this problem. Yes it does the trick. I had not actually known about this. (Just cooked up the patches after I saw your mail.) This looks to me like a clear bug. After looking carefully at the code there, it seems there was another small problem, too, though not likely to have much effect. Dave, can you apply the patches I just sent? Thanks, -John ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-28 23:28 ` Socket buffer sizes with autotuning John Heffner @ 2008-04-28 23:35 ` David Miller 2008-04-29 2:20 ` Jerry Chu 1 sibling, 0 replies; 38+ messages in thread From: David Miller @ 2008-04-28 23:35 UTC (permalink / raw) To: johnwheffner; +Cc: hkchu, netdev, rick.jones2 From: "John Heffner" <johnwheffner@gmail.com> Date: Mon, 28 Apr 2008 16:28:16 -0700 > Dave, can you apply the patches I just sent? Sure thing. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-28 23:28 ` Socket buffer sizes with autotuning John Heffner 2008-04-28 23:35 ` David Miller @ 2008-04-29 2:20 ` Jerry Chu 1 sibling, 0 replies; 38+ messages in thread From: Jerry Chu @ 2008-04-29 2:20 UTC (permalink / raw) To: John Heffner; +Cc: David Miller, netdev, rick.jones2 Just for the record, in my netperf RR (1MB request/20B reply) over 1GE/tg3 test on top of 2.6.18, changing tcp_tso_win_divisor to 0 from the default of 3 gets the cwnd down to 695 from 1037. Applying the bound check (left * tp->mss_cache < 65536) will get cwnd down to 737, not 695. Still big improvement though. (Anyway this may be moot after I fix the real culprit.) Jerry On Mon, Apr 28, 2008 at 4:28 PM, John Heffner <johnwheffner@gmail.com> wrote: > > On Mon, Apr 28, 2008 at 1:38 PM, Jerry Chu <hkchu@google.com> wrote: > > > Try this series of patches, against net-next. > > > > Ah, you already know about this problem. Yes it does the trick. > > I had not actually known about this. (Just cooked up the patches > after I saw your mail.) This looks to me like a clear bug. After > looking carefully at the code there, it seems there was another small > problem, too, though not likely to have much effect. > > Dave, can you apply the patches I just sent? > > Thanks, > -John > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-23 23:29 Socket buffer sizes with autotuning Jerry Chu 2008-04-24 16:32 ` John Heffner @ 2008-04-25 7:05 ` David Miller 2008-05-07 3:57 ` Jerry Chu 1 sibling, 1 reply; 38+ messages in thread From: David Miller @ 2008-04-25 7:05 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Wed, 23 Apr 2008 16:29:58 -0700 > I've been seeing the same problem here and am trying to fix it. > My fix is to not count those pkts still in the host queue as "prior_in_flight" > when feeding the latter to tcp_cong_avoid(). This should cause > tcp_is_cwnd_limited() test to fail when the previous in_flight build-up > is all due to the large host queue, and stop the cwnd to grow beyond > what's really necessary. Does something like the following suit your needs? diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 299ec4b..6cdf4be 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -140,6 +140,7 @@ struct skb_frag_struct { */ struct skb_shared_info { atomic_t dataref; + atomic_t *in_flight; unsigned short nr_frags; unsigned short gso_size; /* Warning: this field is not always filled in (UFO)! */ diff --git a/include/linux/tcp.h b/include/linux/tcp.h index d96d9b1..62bb58d 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -271,6 +271,8 @@ struct tcp_sock { u32 rcv_tstamp; /* timestamp of last received ACK (for keepalives) */ u32 lsndtime; /* timestamp of last sent data packet (for restart window) */ + atomic_t host_inflight; /* packets queued in transmit path */ + /* Data for direct copy to user */ struct { struct sk_buff_head prequeue; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 4fe605f..a6880c2 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -212,6 +212,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); atomic_set(&shinfo->dataref, 1); + shinfo->in_flight = NULL; shinfo->nr_frags = 0; shinfo->gso_size = 0; shinfo->gso_segs = 0; @@ -403,6 +404,8 @@ static void skb_release_all(struct sk_buff *skb) void __kfree_skb(struct sk_buff *skb) { skb_release_all(skb); + if (skb_shinfo(skb)->in_flight) + atomic_dec(skb_shinfo(skb)->in_flight); kfree_skbmem(skb); } @@ -486,6 +489,8 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb) atomic_set(&n->users, 1); atomic_inc(&(skb_shinfo(skb)->dataref)); + if (skb_shinfo(skb)->in_flight) + atomic_inc(skb_shinfo(skb)->in_flight); skb->cloned = 1; return n; @@ -743,6 +748,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, skb->hdr_len = 0; skb->nohdr = 0; atomic_set(&skb_shinfo(skb)->dataref, 1); + skb_shinfo(skb)->in_flight = NULL; return 0; nodata: diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f886531..28a71fd 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -479,6 +479,7 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb) struct tcp_sock *tp = tcp_sk(sk); struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); + skb_shinfo(skb)->in_flight = &tp->host_inflight; skb->csum = 0; tcb->seq = tcb->end_seq = tp->write_seq; tcb->flags = TCPCB_FLAG_ACK; ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-04-25 7:05 ` David Miller @ 2008-05-07 3:57 ` Jerry Chu 2008-05-07 4:27 ` David Miller 2008-05-07 4:28 ` David Miller 0 siblings, 2 replies; 38+ messages in thread From: Jerry Chu @ 2008-05-07 3:57 UTC (permalink / raw) To: David Miller; +Cc: netdev I fail to see how adding shinfo->in_flight to count how many outstanding clones are there can help accounting for how many "host_inflight" pkts. Part of the problems, as you've mentioned before, is that the driver may not always get a clone. It may be getting a copy (e.g., when GSO is on?) hence losing all its connection to the original tp and any chance to have the pkt properly accounted for as host_infligh by TCP. The skb may also be cloned more than once (e.g., due to tcpdump)... That said, I also fail to come up with a more bullet-proof solution after studying much of the TSO/GSO code without requring driver and more skb changes. So I'm currently leaning toward my original fix of checking if (1 == (atomic_read(&skb_shinfo(skb1)->dataref) & SKB_DATAREF_MASK)) My current prototype scans either sk_send_head or sk_write_queue backwards until the above condition is true. I'm thinking about adding and maintaining a new "tp->host_queue_head" field to avoid most of the scanning. Also it seems much less costly to add a new field to tcp_sock than to skb/skb_shared_info. If you have a better idea please let me know. Jerry On Fri, Apr 25, 2008 at 12:05 AM, David Miller <davem@davemloft.net> wrote: > > From: "Jerry Chu" <hkchu@google.com> > Date: Wed, 23 Apr 2008 16:29:58 -0700 > > > > I've been seeing the same problem here and am trying to fix it. > > My fix is to not count those pkts still in the host queue as "prior_in_flight" > > when feeding the latter to tcp_cong_avoid(). This should cause > > tcp_is_cwnd_limited() test to fail when the previous in_flight build-up > > is all due to the large host queue, and stop the cwnd to grow beyond > > what's really necessary. > > Does something like the following suit your needs? > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 299ec4b..6cdf4be 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -140,6 +140,7 @@ struct skb_frag_struct { > */ > struct skb_shared_info { > atomic_t dataref; > + atomic_t *in_flight; > unsigned short nr_frags; > unsigned short gso_size; > /* Warning: this field is not always filled in (UFO)! */ > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index d96d9b1..62bb58d 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -271,6 +271,8 @@ struct tcp_sock { > u32 rcv_tstamp; /* timestamp of last received ACK (for keepalives) */ > u32 lsndtime; /* timestamp of last sent data packet (for restart window) */ > > + atomic_t host_inflight; /* packets queued in transmit path */ > + > /* Data for direct copy to user */ > struct { > struct sk_buff_head prequeue; > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 4fe605f..a6880c2 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -212,6 +212,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, > /* make sure we initialize shinfo sequentially */ > shinfo = skb_shinfo(skb); > atomic_set(&shinfo->dataref, 1); > + shinfo->in_flight = NULL; > shinfo->nr_frags = 0; > shinfo->gso_size = 0; > shinfo->gso_segs = 0; > @@ -403,6 +404,8 @@ static void skb_release_all(struct sk_buff *skb) > void __kfree_skb(struct sk_buff *skb) > { > skb_release_all(skb); > + if (skb_shinfo(skb)->in_flight) > + atomic_dec(skb_shinfo(skb)->in_flight); > kfree_skbmem(skb); > } > > @@ -486,6 +489,8 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb) > atomic_set(&n->users, 1); > > atomic_inc(&(skb_shinfo(skb)->dataref)); > + if (skb_shinfo(skb)->in_flight) > + atomic_inc(skb_shinfo(skb)->in_flight); > skb->cloned = 1; > > return n; > @@ -743,6 +748,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > skb->hdr_len = 0; > skb->nohdr = 0; > atomic_set(&skb_shinfo(skb)->dataref, 1); > + skb_shinfo(skb)->in_flight = NULL; > return 0; > > nodata: > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index f886531..28a71fd 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -479,6 +479,7 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb) > struct tcp_sock *tp = tcp_sk(sk); > struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); > > + skb_shinfo(skb)->in_flight = &tp->host_inflight; > skb->csum = 0; > tcb->seq = tcb->end_seq = tp->write_seq; > tcb->flags = TCPCB_FLAG_ACK; > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-07 3:57 ` Jerry Chu @ 2008-05-07 4:27 ` David Miller 2008-05-07 18:36 ` Jerry Chu 2008-05-07 4:28 ` David Miller 1 sibling, 1 reply; 38+ messages in thread From: David Miller @ 2008-05-07 4:27 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Tue, 6 May 2008 20:57:46 -0700 > I fail to see how adding shinfo->in_flight to count how many > outstanding clones are there can help accounting for how many > "host_inflight" pkts. Part of the problems, as you've mentioned > before, is that the driver may not always get a clone. Sure but it will get one %99.9999 of the time. With TCP, as long as that clone is alive the driver has it. And the counter only counts the clones. Anyways, did you even test my patch and try to use it for your needs or is this analysis purely from your inspection of it? :-/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-07 4:27 ` David Miller @ 2008-05-07 18:36 ` Jerry Chu 2008-05-07 21:18 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-05-07 18:36 UTC (permalink / raw) To: David Miller; +Cc: netdev On Tue, May 6, 2008 at 9:27 PM, David Miller <davem@davemloft.net> wrote: > > From: "Jerry Chu" <hkchu@google.com> > Date: Tue, 6 May 2008 20:57:46 -0700 > > > > I fail to see how adding shinfo->in_flight to count how many > > outstanding clones are there can help accounting for how many > > "host_inflight" pkts. Part of the problems, as you've mentioned > > before, is that the driver may not always get a clone. > > Sure but it will get one %99.9999 of the time. I couldn't care less about some rare boundary case either. What I'm concerned about is if there is some popular below-IP module/feature that, when turned on, will render my host_inflight check (using dataref==1 as an indication that the pkt has left the host) completely bogus. Turning on pkt tapping (e.g., tcpdump) seems to be one such case, although one may argue that belongs to those edge usage conditions that we don't need to worry about. (OTOH it may be undesirable for tcpdump to have significant performance side effect for those who try to use tcpdump to debug performance problem.) Turning on GSO but not TSO seems to be another case where the driver gets a copy all the time (kernel version 2.6.18). If there is no other popular below-IP module/feature that will break the dataref==1 then perhaps my initial concern was invalid. In either case, I don't see your patch any better than my dataref==1 check. Neither one address the below-IP module/feature concern described before. > > > With TCP, as long as that clone is alive the driver has it. And the > counter only counts the clones. > > Anyways, did you even test my patch and try to use it for your needs > or is this analysis purely from your inspection of it? :-/ No I haven't tested your patch. I tried to understand skb better before applying your patch. After I studied bunch of code, I come to the conclusion that your patch won't work for me. First it tracks # of clones, which is not what I need. E.g., tcpdump will cause host_inflight to be grossly wrong. Ok, maybe we can ignore tcpdump, your patch counts # of cloned skb whereas I need a count of # of pkts. Perhaps this can be fixed also, but it then dawned on me that wouldn't it be more desirable to add the space overhead per tcp_sock than per skb? Jerry > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-07 18:36 ` Jerry Chu @ 2008-05-07 21:18 ` David Miller 2008-05-08 1:37 ` Jerry Chu 0 siblings, 1 reply; 38+ messages in thread From: David Miller @ 2008-05-07 21:18 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Wed, 7 May 2008 11:36:59 -0700 > No I haven't tested your patch. I tried to understand skb better before > applying your patch. After I studied bunch of code, I come to the conclusion > that your patch won't work for me. First it tracks # of clones, which is not > what I need. E.g., tcpdump will cause host_inflight to be grossly wrong. We can make sub-clones not count. Also, we already can distinguish this case, because all SKB clones made by TCP are fast-clones. So we could only bump the counter for fast clones. If tcpdump clones it again, it won't be a fast clone and therefore we can avoid bumping the counter in that case. Similarly for other features that want to clone. Please try to get your idea working with my infrastructure. We can modify it to behave however you need it to, but at the core it's the idea that tracks the state most directly and properly. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-07 21:18 ` David Miller @ 2008-05-08 1:37 ` Jerry Chu 2008-05-08 1:43 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-05-08 1:37 UTC (permalink / raw) To: David Miller; +Cc: netdev On Wed, May 7, 2008 at 2:18 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > Date: Wed, 7 May 2008 11:36:59 -0700 > > > > No I haven't tested your patch. I tried to understand skb better before > > applying your patch. After I studied bunch of code, I come to the conclusion > > that your patch won't work for me. First it tracks # of clones, which is not > > what I need. E.g., tcpdump will cause host_inflight to be grossly wrong. > > We can make sub-clones not count. > > Also, we already can distinguish this case, because all SKB clones > made by TCP are fast-clones. So we could only bump the counter for > fast clones. If tcpdump clones it again, it won't be a fast clone and > therefore we can avoid bumping the counter in that case. Similarly > for other features that want to clone. Ok, just Google search "skb fast clone" and found some posting from you. Will take a look. > > Please try to get your idea working with my infrastructure. We > can modify it to behave however you need it to, but at the core > it's the idea that tracks the state most directly and properly. > Ok, will give it a try. First i'll fix your patch to atomic_add()/atomic_sub() by skb_shinfo(skb)->gso_segs rather than always 1, in order for GSO/TSO to work. One problem came up to my mind - it seems possible for __kfree_skb() to access skb_shinfo(skb)->in_flight whose tp has been freed up since only the original skb's on TCP's rexmit list have the owner set and socket held. One solution is for TCP to zap skb_shinfo(skb)->in_flight field when it's ready to free up skb. I can hack sock_wfree() to do this, but I don't know how to do it right. Jerry ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-08 1:37 ` Jerry Chu @ 2008-05-08 1:43 ` David Miller 2008-05-08 3:33 ` Jerry Chu 0 siblings, 1 reply; 38+ messages in thread From: David Miller @ 2008-05-08 1:43 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Wed, 7 May 2008 18:37:01 -0700 > Ok, will give it a try. First i'll fix your patch to > atomic_add()/atomic_sub() by > skb_shinfo(skb)->gso_segs rather than always 1, in order for GSO/TSO to work. That might not work. gso_segs can change over time as retransmit packets get split up due to SACKs etc. it needs to be audited, at the very least. > One problem came up to my mind - it seems possible for __kfree_skb() to > access skb_shinfo(skb)->in_flight whose tp has been freed up since only the > original skb's on TCP's rexmit list have the owner set and socket > held. One solution > is for TCP to zap skb_shinfo(skb)->in_flight field when it's ready to > free up skb. > I can hack sock_wfree() to do this, but I don't know how to do it right. There will be references to the socket, so this should be ok. If it isn't we can adjust the count and zap the pointer in skb_orphan(). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-08 1:43 ` David Miller @ 2008-05-08 3:33 ` Jerry Chu 2008-05-12 22:22 ` Jerry Chu 0 siblings, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-05-08 3:33 UTC (permalink / raw) To: David Miller; +Cc: netdev There seems to be quite a bit of complexity plus one additional pointer field per skb_shared_info to make skb better track when a pkt leaves the host. Now I wonder if it's really a better solution than my original, simply checking dataref==1 approach which, although not bullet proof, may be "good enough" for all practical purposes? Jerry On Wed, May 7, 2008 at 6:43 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > Date: Wed, 7 May 2008 18:37:01 -0700 > > > > Ok, will give it a try. First i'll fix your patch to > > atomic_add()/atomic_sub() by > > skb_shinfo(skb)->gso_segs rather than always 1, in order for GSO/TSO to work. > > That might not work. gso_segs can change over time as retransmit > packets get split up due to SACKs etc. it needs to be audited, > at the very least. > > > > One problem came up to my mind - it seems possible for __kfree_skb() to > > access skb_shinfo(skb)->in_flight whose tp has been freed up since only the > > original skb's on TCP's rexmit list have the owner set and socket > > held. One solution > > is for TCP to zap skb_shinfo(skb)->in_flight field when it's ready to > > free up skb. > > I can hack sock_wfree() to do this, but I don't know how to do it right. > > There will be references to the socket, so this should be ok. > > If it isn't we can adjust the count and zap the pointer in > skb_orphan(). > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-08 3:33 ` Jerry Chu @ 2008-05-12 22:22 ` Jerry Chu 2008-05-12 22:29 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-05-12 22:22 UTC (permalink / raw) To: David Miller; +Cc: netdev I did a quick prototype based on your idea of adding an "in_flight" field to skb_shared_info to track how many in-flight clones in the host. I tested it quickly and it doesn't work. After some thought it was obvious why it won't work. It's because what the TCP stack needs is to track how many in-flight pkts are in the host, but your proposed patch increments "in_flight" once on the 1st __skb_clone() to be sent to the driver, but decrements "in_flight" TWICE, one for each of the clones to be freed. I did a quick hack to make it work for my limited test case but I haven't figured out an acceptable (non-hack) solution. Continued testing, I discovered the problem I described below where "in_flight" may point to a tp that has already been freed can not be addressed by zapping skb_shinfo(skb)->in_flight in sock_wfree(). The reason is that pkts may be acked and freed by TCP before driver freeing up its clone copy (e.g., due to driver lazy reclaim...) When that happens the "host_inflight" accounting will get messed up. Jerry On Wed, May 7, 2008 at 8:33 PM, Jerry Chu <hkchu@google.com> wrote: > There seems to be quite a bit of complexity plus one additional pointer > field per skb_shared_info to make skb better track when a pkt leaves > the host. Now I wonder if it's really a better solution than my original, > simply checking dataref==1 approach which, although not bullet proof, > may be "good enough" for all practical purposes? > > Jerry > > > > On Wed, May 7, 2008 at 6:43 PM, David Miller <davem@davemloft.net> wrote: > > From: "Jerry Chu" <hkchu@google.com> > > Date: Wed, 7 May 2008 18:37:01 -0700 > > > > > > > Ok, will give it a try. First i'll fix your patch to > > > atomic_add()/atomic_sub() by > > > skb_shinfo(skb)->gso_segs rather than always 1, in order for GSO/TSO to work. > > > > That might not work. gso_segs can change over time as retransmit > > packets get split up due to SACKs etc. it needs to be audited, > > at the very least. > > > > > > > One problem came up to my mind - it seems possible for __kfree_skb() to > > > access skb_shinfo(skb)->in_flight whose tp has been freed up since only the > > > original skb's on TCP's rexmit list have the owner set and socket > > > held. One solution > > > is for TCP to zap skb_shinfo(skb)->in_flight field when it's ready to > > > free up skb. > > > I can hack sock_wfree() to do this, but I don't know how to do it right. > > > > There will be references to the socket, so this should be ok. > > > > If it isn't we can adjust the count and zap the pointer in > > skb_orphan(). > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-12 22:22 ` Jerry Chu @ 2008-05-12 22:29 ` David Miller 2008-05-12 22:31 ` David Miller 2008-05-12 22:58 ` Jerry Chu 0 siblings, 2 replies; 38+ messages in thread From: David Miller @ 2008-05-12 22:29 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Mon, 12 May 2008 15:22:55 -0700 > I did a quick prototype based on your idea of adding an "in_flight" > field to skb_shared_info to track how many in-flight clones in the > host. I tested > it quickly and it doesn't work. After some thought it was obvious why it > won't work. It's because what the TCP stack needs is to track how > many in-flight pkts are in the host, but your proposed patch increments > "in_flight" once on the 1st __skb_clone() to be sent to the driver, but > decrements "in_flight" TWICE, one for each of the clones to be freed. > I did a quick hack to make it work for my limited test case but I haven't > figured out an acceptable (non-hack) solution. That's easy to fix, only set the in_flight pointer in the child clone skb. Thanks for figuring that out. > Continued testing, I discovered the problem I described below where > "in_flight" may point to a tp that has already been freed can not be > addressed by zapping skb_shinfo(skb)->in_flight in sock_wfree(). The > reason is that pkts may be acked and freed by TCP before driver freeing > up its clone copy (e.g., due to driver lazy reclaim...) When that happens > the "host_inflight" accounting will get messed up. Simply notice, when we're about to decrement in_flight, that the data reference is one. You can take appropriate actions if so. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-12 22:29 ` David Miller @ 2008-05-12 22:31 ` David Miller 2008-05-13 3:56 ` Jerry Chu 2008-05-12 22:58 ` Jerry Chu 1 sibling, 1 reply; 38+ messages in thread From: David Miller @ 2008-05-12 22:31 UTC (permalink / raw) To: hkchu; +Cc: netdev From: David Miller <davem@davemloft.net> Date: Mon, 12 May 2008 15:29:00 -0700 (PDT) > Simply notice, when we're about to decrement in_flight, that the data > reference is one. You can take appropriate actions if so. Actually, there is an even simpler solution. Since this is a fast clone, when the parent SKB is freed we can see if the fast-clone child is active and has the in_flight atomic_t pointer. If so, we decrement in_flight and zap it to NULL. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-12 22:31 ` David Miller @ 2008-05-13 3:56 ` Jerry Chu 2008-05-13 3:58 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-05-13 3:56 UTC (permalink / raw) To: David Miller; +Cc: netdev On Mon, May 12, 2008 at 3:31 PM, David Miller <davem@davemloft.net> wrote: > From: David Miller <davem@davemloft.net> > Date: Mon, 12 May 2008 15:29:00 -0700 (PDT) > > > > Simply notice, when we're about to decrement in_flight, that the data > > reference is one. You can take appropriate actions if so. > > Actually, there is an even simpler solution. Since this is a fast > clone, when the parent SKB is freed we can see if the fast-clone child > is active and has the in_flight atomic_t pointer. If so, we decrement > in_flight and zap it to NULL. I thought about that too. But what happens if the clone is being freed by the driver at the same time when TCP is freeing the parent? Isn't there an intrinsic race condition here? If true, can we use some atomic operation to zap in_flight, but also return to the caller if in_flight has been nullified already so that the accounting can be handled correctly? (Although tp->host_inflight doesn't need to be super accurate, if there is no simple solution above I'll have to reset tp->host_inflight periodically and convince myself the pathological case of host_inflight drifting forever won't ever happen...) Jerry ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-13 3:56 ` Jerry Chu @ 2008-05-13 3:58 ` David Miller 2008-05-13 4:00 ` Jerry Chu 0 siblings, 1 reply; 38+ messages in thread From: David Miller @ 2008-05-13 3:58 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Mon, 12 May 2008 20:56:16 -0700 > If true, can we use some atomic operation to zap in_flight, but also > return to the caller if in_flight has been nullified already so that > the accounting can be handled correctly? xchg() should work for that purpose. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-13 3:58 ` David Miller @ 2008-05-13 4:00 ` Jerry Chu 2008-05-13 4:02 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-05-13 4:00 UTC (permalink / raw) To: David Miller; +Cc: netdev Does xchg() return the old value? (Can't find any comments on it.) Jerry On Mon, May 12, 2008 at 8:58 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > Date: Mon, 12 May 2008 20:56:16 -0700 > > > > If true, can we use some atomic operation to zap in_flight, but also > > return to the caller if in_flight has been nullified already so that > > the accounting can be handled correctly? > > xchg() should work for that purpose. > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-13 4:00 ` Jerry Chu @ 2008-05-13 4:02 ` David Miller 2008-05-17 1:13 ` Jerry Chu 0 siblings, 1 reply; 38+ messages in thread From: David Miller @ 2008-05-13 4:02 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Mon, 12 May 2008 21:00:08 -0700 > Does xchg() return the old value? (Can't find any comments on it.) Yes, otherwise it wouldn't be very useful. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-13 4:02 ` David Miller @ 2008-05-17 1:13 ` Jerry Chu 2008-05-17 1:29 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-05-17 1:13 UTC (permalink / raw) To: David Miller; +Cc: netdev Ok, now the code seems to work when TSO is disabled, i.e., cwnd remains low (40-60pkts) and only grows slowly for extended period on a back2back 1GbE link. It also works when tcpdump was running after I fixed the code, per your suggestion, to only bump the in_flight count for fast clones. But it doesn't work when TSO is enabled. I've fixed tso_fragment() to correctly set skb_shinfo(buff)->in_flight after the split. But cwnd still grows rapidly to a few hundred pkts (although smaller than w/o the fix). The host inflight accounting gets screwed up. It looks like pskb_expand_head() called by tcp_tso_acked()->tcp_trim_head() messes up the accounting but I don't know how to fix it (still trying to understand this complex piece of code). There could be other reason as well. Jerry On Mon, May 12, 2008 at 9:02 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > Date: Mon, 12 May 2008 21:00:08 -0700 > >> Does xchg() return the old value? (Can't find any comments on it.) > > Yes, otherwise it wouldn't be very useful. > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-17 1:13 ` Jerry Chu @ 2008-05-17 1:29 ` David Miller 2008-05-17 1:47 ` Jerry Chu 0 siblings, 1 reply; 38+ messages in thread From: David Miller @ 2008-05-17 1:29 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Fri, 16 May 2008 18:13:20 -0700 > The host inflight accounting gets screwed up. It looks like > pskb_expand_head() called by tcp_tso_acked()->tcp_trim_head() messes > up the accounting but I don't know how to fix it (still trying to > understand this complex piece of code). There could be other reason > as well. This is just like freeing up a normal SKB, so decrementing the in_flight value the appropriate number of packets should do the right thing. tcp_tso_acked() calculates this adjustment for you, in packets_acked. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-17 1:29 ` David Miller @ 2008-05-17 1:47 ` Jerry Chu 0 siblings, 0 replies; 38+ messages in thread From: Jerry Chu @ 2008-05-17 1:47 UTC (permalink / raw) To: David Miller; +Cc: netdev On Fri, May 16, 2008 at 6:29 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > Date: Fri, 16 May 2008 18:13:20 -0700 > >> The host inflight accounting gets screwed up. It looks like >> pskb_expand_head() called by tcp_tso_acked()->tcp_trim_head() messes >> up the accounting but I don't know how to fix it (still trying to >> understand this complex piece of code). There could be other reason >> as well. > > This is just like freeing up a normal SKB, so decrementing the > in_flight value the appropriate number of packets should do the right > thing. > > tcp_tso_acked() calculates this adjustment for you, in packets_acked. > The current pskb_expand_head() code sets skb_shinfo(skb)->in_flight to NULL to leave the in_flight accounting solely to the other clone to handle. To calculate in_flight more accurately, it seems to require splitting the gso_segs and in_flight accounting bewteen the clone and the skb (with new header), right? Plus any race condition to take care... Hard to think clearly in a Friday afternoon :-( Jerry ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-12 22:29 ` David Miller 2008-05-12 22:31 ` David Miller @ 2008-05-12 22:58 ` Jerry Chu 2008-05-12 23:01 ` David Miller 1 sibling, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-05-12 22:58 UTC (permalink / raw) To: David Miller; +Cc: netdev On Mon, May 12, 2008 at 3:29 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > Date: Mon, 12 May 2008 15:22:55 -0700 > > > > I did a quick prototype based on your idea of adding an "in_flight" > > field to skb_shared_info to track how many in-flight clones in the > > host. I tested > > it quickly and it doesn't work. After some thought it was obvious why it > > won't work. It's because what the TCP stack needs is to track how > > many in-flight pkts are in the host, but your proposed patch increments > > "in_flight" once on the 1st __skb_clone() to be sent to the driver, but > > decrements "in_flight" TWICE, one for each of the clones to be freed. > > I did a quick hack to make it work for my limited test case but I haven't > > figured out an acceptable (non-hack) solution. > > That's easy to fix, only set the in_flight pointer in the child clone > skb. Thanks for figuring that out. I must be missing something. All the clones share the same "skb_shared_info", how does one only set in_flight in one clone but not the other? > > > > Continued testing, I discovered the problem I described below where > > "in_flight" may point to a tp that has already been freed can not be > > addressed by zapping skb_shinfo(skb)->in_flight in sock_wfree(). The > > reason is that pkts may be acked and freed by TCP before driver freeing > > up its clone copy (e.g., due to driver lazy reclaim...) When that happens > > the "host_inflight" accounting will get messed up. > > Simply notice, when we're about to decrement in_flight, that the data > reference is one. You can take appropriate actions if so. > Jerry ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-12 22:58 ` Jerry Chu @ 2008-05-12 23:01 ` David Miller 0 siblings, 0 replies; 38+ messages in thread From: David Miller @ 2008-05-12 23:01 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Mon, 12 May 2008 15:58:58 -0700 > I must be missing something. All the clones share the same > "skb_shared_info", how does one only set in_flight in one clone but > not the other? These are fast clones, we "know" which is the parent and which is the child. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-07 3:57 ` Jerry Chu 2008-05-07 4:27 ` David Miller @ 2008-05-07 4:28 ` David Miller 2008-05-07 18:54 ` Jerry Chu 1 sibling, 1 reply; 38+ messages in thread From: David Miller @ 2008-05-07 4:28 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Tue, 6 May 2008 20:57:46 -0700 > It may be getting a copy (e.g., when GSO is on?) hence losing all > its connection to the original tp and any chance to have the pkt > properly accounted for as host_infligh by TCP. The skb may also be > cloned more than once (e.g., due to tcpdump)... It only gets a copy if the SKB is already cloned and that clone is alive somewhere (f.e. stuck in the device, a rare occurance by the time we retransmit). It is a clone %99.999999 of the time. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-07 4:28 ` David Miller @ 2008-05-07 18:54 ` Jerry Chu 2008-05-07 21:20 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Jerry Chu @ 2008-05-07 18:54 UTC (permalink / raw) To: David Miller; +Cc: netdev On Tue, May 6, 2008 at 9:28 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > > Date: Tue, 6 May 2008 20:57:46 -0700 > > > > It may be getting a copy (e.g., when GSO is on?) hence losing all > > its connection to the original tp and any chance to have the pkt > > properly accounted for as host_infligh by TCP. The skb may also be > > cloned more than once (e.g., due to tcpdump)... > > It only gets a copy if the SKB is already cloned and that clone is > alive somewhere (f.e. stuck in the device, a rare occurance by > the time we retransmit). This is one of many things about skb that I still don't completely understand. Why in tcp_transmit_skb() we'll have to pskb_copy() if skb_cloned()? Can't we clone a skb mulitple times? Is it due to some special optimization from skb->fclone stuff... that imposes this restriction? > > It is a clone %99.999999 of the time. When I turned on GSO but not TSO, I believe it's a copy %99.999999 of the time. (I must confess I don't understand GSO code yet. I temporarily ran out of stream after studying bunch of other code.) Jerry > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-07 18:54 ` Jerry Chu @ 2008-05-07 21:20 ` David Miller 2008-05-08 0:16 ` Jerry Chu 0 siblings, 1 reply; 38+ messages in thread From: David Miller @ 2008-05-07 21:20 UTC (permalink / raw) To: hkchu; +Cc: netdev From: "Jerry Chu" <hkchu@google.com> Date: Wed, 7 May 2008 11:54:01 -0700 > This is one of many things about skb that I still don't completely understand. > Why in tcp_transmit_skb() we'll have to pskb_copy() if skb_cloned()? The other clone holder owns the packet header area. All packets on the retransmit queue of TCP are headerless. The call sites down into the device add the headers. Therefore we cannot have two paths modifying the headers at the same time. > Can't we clone a skb mulitple times? Is it due to some special optimization > from skb->fclone stuff... that imposes this restriction? No, it has nothing to do with fclone. It has to do with what instance owns the packet header area in front of the user's TCP data. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Socket buffer sizes with autotuning 2008-05-07 21:20 ` David Miller @ 2008-05-08 0:16 ` Jerry Chu 0 siblings, 0 replies; 38+ messages in thread From: Jerry Chu @ 2008-05-08 0:16 UTC (permalink / raw) To: David Miller; +Cc: netdev On Wed, May 7, 2008 at 2:20 PM, David Miller <davem@davemloft.net> wrote: > From: "Jerry Chu" <hkchu@google.com> > Date: Wed, 7 May 2008 11:54:01 -0700 > > > > This is one of many things about skb that I still don't completely understand. > > Why in tcp_transmit_skb() we'll have to pskb_copy() if skb_cloned()? > > The other clone holder owns the packet header area. All packets on > the retransmit queue of TCP are headerless. The call sites down > into the device add the headers. > > Therefore we cannot have two paths modifying the headers at the same > time. Oh I see. I think I confused pskb_copy() with skb_copy() and kept thinking why do we need to copy the whole thing... > > > > Can't we clone a skb mulitple times? Is it due to some special optimization > > from skb->fclone stuff... that imposes this restriction? > > No, it has nothing to do with fclone. It has to do with what instance > owns the packet header area in front of the user's TCP data. Thanks for explaining. Jerry > ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2008-05-17 1:47 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-23 23:29 Socket buffer sizes with autotuning Jerry Chu
2008-04-24 16:32 ` John Heffner
2008-04-25 0:49 ` Jerry Chu
2008-04-25 6:46 ` David Miller
2008-04-25 21:29 ` Jerry Chu
2008-04-25 21:35 ` David Miller
2008-04-28 18:30 ` Jerry Chu
2008-04-28 19:21 ` John Heffner
2008-04-28 20:44 ` Jerry Chu
2008-04-28 23:22 ` [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled John Heffner
2008-04-28 23:22 ` [PATCH 2/2] [NET]: Limit cwnd growth when deferring for GSO John Heffner
[not found] ` <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com>
2008-04-28 23:28 ` Socket buffer sizes with autotuning John Heffner
2008-04-28 23:35 ` David Miller
2008-04-29 2:20 ` Jerry Chu
2008-04-25 7:05 ` David Miller
2008-05-07 3:57 ` Jerry Chu
2008-05-07 4:27 ` David Miller
2008-05-07 18:36 ` Jerry Chu
2008-05-07 21:18 ` David Miller
2008-05-08 1:37 ` Jerry Chu
2008-05-08 1:43 ` David Miller
2008-05-08 3:33 ` Jerry Chu
2008-05-12 22:22 ` Jerry Chu
2008-05-12 22:29 ` David Miller
2008-05-12 22:31 ` David Miller
2008-05-13 3:56 ` Jerry Chu
2008-05-13 3:58 ` David Miller
2008-05-13 4:00 ` Jerry Chu
2008-05-13 4:02 ` David Miller
2008-05-17 1:13 ` Jerry Chu
2008-05-17 1:29 ` David Miller
2008-05-17 1:47 ` Jerry Chu
2008-05-12 22:58 ` Jerry Chu
2008-05-12 23:01 ` David Miller
2008-05-07 4:28 ` David Miller
2008-05-07 18:54 ` Jerry Chu
2008-05-07 21:20 ` David Miller
2008-05-08 0:16 ` Jerry Chu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).