* small RPS cache for fragments? @ 2011-05-17 18:33 David Miller 2011-05-17 20:02 ` Tom Herbert ` (3 more replies) 0 siblings, 4 replies; 40+ messages in thread From: David Miller @ 2011-05-17 18:33 UTC (permalink / raw) To: netdev It seems to me that we can solve the UDP fragmentation problem for flow steering very simply by creating a (saddr/daddr/IPID) entry in a table that maps to the corresponding RPS flow entry. When we see the initial frag with the UDP header, we create the saddr/daddr/IPID mapping, and we tear it down when we hit the saddr/daddr/IPID mapping and the packet has the IP_MF bit clear. We only inspect the saddr/daddr/IPID cache when iph->frag_off is non-zero. It's best effort and should work quite well. Even a one-behind cache, per-NAPI instance, would do a lot better than what happens at the moment. Especially since the IP fragments mostly arrive as one packet train. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 18:33 small RPS cache for fragments? David Miller @ 2011-05-17 20:02 ` Tom Herbert 2011-05-17 20:17 ` Rick Jones 2011-05-17 20:49 ` David Miller 2011-05-17 20:14 ` Eric Dumazet ` (2 subsequent siblings) 3 siblings, 2 replies; 40+ messages in thread From: Tom Herbert @ 2011-05-17 20:02 UTC (permalink / raw) To: David Miller; +Cc: netdev I like it! And this sounds like the sort of algorithm that NICs might be able to implement to solve the UDP/RSS unpleasantness, so even better. Tom On Tue, May 17, 2011 at 11:33 AM, David Miller <davem@davemloft.net> wrote: > > It seems to me that we can solve the UDP fragmentation problem for > flow steering very simply by creating a (saddr/daddr/IPID) entry in a > table that maps to the corresponding RPS flow entry. > > When we see the initial frag with the UDP header, we create the > saddr/daddr/IPID mapping, and we tear it down when we hit the > saddr/daddr/IPID mapping and the packet has the IP_MF bit clear. > > We only inspect the saddr/daddr/IPID cache when iph->frag_off is > non-zero. > > It's best effort and should work quite well. > > Even a one-behind cache, per-NAPI instance, would do a lot better than > what happens at the moment. Especially since the IP fragments mostly > arrive as one packet train. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 20:02 ` Tom Herbert @ 2011-05-17 20:17 ` Rick Jones 2011-05-17 20:41 ` Rick Jones 2011-05-17 20:49 ` David Miller 1 sibling, 1 reply; 40+ messages in thread From: Rick Jones @ 2011-05-17 20:17 UTC (permalink / raw) To: Tom Herbert; +Cc: David Miller, netdev On Tue, 2011-05-17 at 13:02 -0700, Tom Herbert wrote: > I like it! And this sounds like the sort of algorithm that NICs might > be able to implement to solve the UDP/RSS unpleasantness, so even > better. Do (m)any devices take "shortcuts" with UDP datagrams these days? By that I mean that back in the day, the HP-PB and "Slider" FDDI cards/drivers did checksum offload for fragmented UDP datagrams by sending the first fragment, the one with the UDP header and thus checksum, last. It did that to save space on the card and make use of the checksum accumulator. rick jones > > Tom > > On Tue, May 17, 2011 at 11:33 AM, David Miller <davem@davemloft.net> wrote: > > > > It seems to me that we can solve the UDP fragmentation problem for > > flow steering very simply by creating a (saddr/daddr/IPID) entry in a > > table that maps to the corresponding RPS flow entry. > > > > When we see the initial frag with the UDP header, we create the > > saddr/daddr/IPID mapping, and we tear it down when we hit the > > saddr/daddr/IPID mapping and the packet has the IP_MF bit clear. > > > > We only inspect the saddr/daddr/IPID cache when iph->frag_off is > > non-zero. > > > > It's best effort and should work quite well. > > > > Even a one-behind cache, per-NAPI instance, would do a lot better than > > what happens at the moment. Especially since the IP fragments mostly > > arrive as one packet train. > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 20:17 ` Rick Jones @ 2011-05-17 20:41 ` Rick Jones 0 siblings, 0 replies; 40+ messages in thread From: Rick Jones @ 2011-05-17 20:41 UTC (permalink / raw) To: Tom Herbert; +Cc: David Miller, netdev On Tue, 2011-05-17 at 13:17 -0700, Rick Jones wrote: > On Tue, 2011-05-17 at 13:02 -0700, Tom Herbert wrote: > > I like it! And this sounds like the sort of algorithm that NICs might > > be able to implement to solve the UDP/RSS unpleasantness, so even > > better. > > Do (m)any devices take "shortcuts" with UDP datagrams these days? By > that I mean that back in the day, the HP-PB and "Slider" FDDI > cards/drivers did checksum offload for fragmented UDP datagrams by > sending the first fragment, the one with the UDP header and thus > checksum, last. It did that to save space on the card and make use of > the checksum accumulator. Even if no devices (mis)behave like that today, ordering of fragments sent via a mode-rr bond is far from a sure thing. rick > > rick jones > > > > > Tom > > > > On Tue, May 17, 2011 at 11:33 AM, David Miller <davem@davemloft.net> wrote: > > > > > > It seems to me that we can solve the UDP fragmentation problem for > > > flow steering very simply by creating a (saddr/daddr/IPID) entry in a > > > table that maps to the corresponding RPS flow entry. > > > > > > When we see the initial frag with the UDP header, we create the > > > saddr/daddr/IPID mapping, and we tear it down when we hit the > > > saddr/daddr/IPID mapping and the packet has the IP_MF bit clear. > > > > > > We only inspect the saddr/daddr/IPID cache when iph->frag_off is > > > non-zero. > > > > > > It's best effort and should work quite well. > > > > > > Even a one-behind cache, per-NAPI instance, would do a lot better than > > > what happens at the moment. Especially since the IP fragments mostly > > > arrive as one packet train. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 20:02 ` Tom Herbert 2011-05-17 20:17 ` Rick Jones @ 2011-05-17 20:49 ` David Miller 2011-05-17 21:00 ` Eric Dumazet ` (2 more replies) 1 sibling, 3 replies; 40+ messages in thread From: David Miller @ 2011-05-17 20:49 UTC (permalink / raw) To: therbert; +Cc: netdev From: Tom Herbert <therbert@google.com> Date: Tue, 17 May 2011 13:02:25 -0700 > I like it! And this sounds like the sort of algorithm that NICs might > be able to implement to solve the UDP/RSS unpleasantness, so even > better. Actually, I think it won't work. Even Linux emits fragments last to first, so we won't see the UDP header until the last packet where it's no longer useful. Back to the drawing board. :-/ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 20:49 ` David Miller @ 2011-05-17 21:00 ` Eric Dumazet 2011-05-17 21:10 ` David Miller ` (2 more replies) 2011-05-17 21:27 ` Tom Herbert 2011-05-17 23:59 ` Changli Gao 2 siblings, 3 replies; 40+ messages in thread From: Eric Dumazet @ 2011-05-17 21:00 UTC (permalink / raw) To: David Miller; +Cc: therbert, netdev Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit : > From: Tom Herbert <therbert@google.com> > Date: Tue, 17 May 2011 13:02:25 -0700 > > > I like it! And this sounds like the sort of algorithm that NICs might > > be able to implement to solve the UDP/RSS unpleasantness, so even > > better. > > Actually, I think it won't work. Even Linux emits fragments last to > first, so we won't see the UDP header until the last packet where it's > no longer useful. > > Back to the drawing board. :-/ Well, we could just use the iph->id in the rxhash computation for frags. At least all frags of a given datagram should be reassembled on same cpu, so we get RPS (but not RFS) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:00 ` Eric Dumazet @ 2011-05-17 21:10 ` David Miller 2011-05-17 21:13 ` Rick Jones 2011-05-17 21:13 ` Ben Hutchings 2011-05-17 21:11 ` Rick Jones 2011-05-17 21:11 ` Ben Hutchings 2 siblings, 2 replies; 40+ messages in thread From: David Miller @ 2011-05-17 21:10 UTC (permalink / raw) To: eric.dumazet; +Cc: therbert, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 17 May 2011 23:00:50 +0200 > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit : >> From: Tom Herbert <therbert@google.com> >> Date: Tue, 17 May 2011 13:02:25 -0700 >> >> > I like it! And this sounds like the sort of algorithm that NICs might >> > be able to implement to solve the UDP/RSS unpleasantness, so even >> > better. >> >> Actually, I think it won't work. Even Linux emits fragments last to >> first, so we won't see the UDP header until the last packet where it's >> no longer useful. >> >> Back to the drawing board. :-/ > > Well, we could just use the iph->id in the rxhash computation for frags. > > At least all frags of a given datagram should be reassembled on same > cpu, so we get RPS (but not RFS) That's true, but one could also argue that in the existing code at least one of the packets (the one with the UDP header) would make it to the proper flow cpu. That could be as much as half of the packets. So I don't yet see it as a clear win. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:10 ` David Miller @ 2011-05-17 21:13 ` Rick Jones 2011-05-17 21:13 ` Ben Hutchings 1 sibling, 0 replies; 40+ messages in thread From: Rick Jones @ 2011-05-17 21:13 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, therbert, netdev On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 17 May 2011 23:00:50 +0200 > > > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit : > >> From: Tom Herbert <therbert@google.com> > >> Date: Tue, 17 May 2011 13:02:25 -0700 > >> > >> > I like it! And this sounds like the sort of algorithm that NICs might > >> > be able to implement to solve the UDP/RSS unpleasantness, so even > >> > better. > >> > >> Actually, I think it won't work. Even Linux emits fragments last to > >> first, so we won't see the UDP header until the last packet where it's > >> no longer useful. > >> > >> Back to the drawing board. :-/ > > > > Well, we could just use the iph->id in the rxhash computation for frags. > > > > At least all frags of a given datagram should be reassembled on same > > cpu, so we get RPS (but not RFS) > > That's true, but one could also argue that in the existing code at least > one of the packets (the one with the UDP header) would make it to the > proper flow cpu. > > That could be as much as half of the packets. > > So I don't yet see it as a clear win. How heinous would it be to do post-reassembly RFS? rick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:10 ` David Miller 2011-05-17 21:13 ` Rick Jones @ 2011-05-17 21:13 ` Ben Hutchings 2011-05-17 21:26 ` David Miller 2011-05-17 21:27 ` Eric Dumazet 1 sibling, 2 replies; 40+ messages in thread From: Ben Hutchings @ 2011-05-17 21:13 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, therbert, netdev On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 17 May 2011 23:00:50 +0200 > > > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit : > >> From: Tom Herbert <therbert@google.com> > >> Date: Tue, 17 May 2011 13:02:25 -0700 > >> > >> > I like it! And this sounds like the sort of algorithm that NICs might > >> > be able to implement to solve the UDP/RSS unpleasantness, so even > >> > better. > >> > >> Actually, I think it won't work. Even Linux emits fragments last to > >> first, so we won't see the UDP header until the last packet where it's > >> no longer useful. > >> > >> Back to the drawing board. :-/ > > > > Well, we could just use the iph->id in the rxhash computation for frags. > > > > At least all frags of a given datagram should be reassembled on same > > cpu, so we get RPS (but not RFS) > > That's true, but one could also argue that in the existing code at least > one of the packets (the one with the UDP header) would make it to the > proper flow cpu. No, we ignore the layer-4 header when either MF or OFFSET is non-zero. Ben. > That could be as much as half of the packets. > > So I don't yet see it as a clear win. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Ben Hutchings, Senior Software Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:13 ` Ben Hutchings @ 2011-05-17 21:26 ` David Miller 2011-05-17 21:40 ` Rick Jones 2011-05-17 21:27 ` Eric Dumazet 1 sibling, 1 reply; 40+ messages in thread From: David Miller @ 2011-05-17 21:26 UTC (permalink / raw) To: bhutchings; +Cc: eric.dumazet, therbert, netdev From: Ben Hutchings <bhutchings@solarflare.com> Date: Tue, 17 May 2011 22:13:42 +0100 > On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote: >> That's true, but one could also argue that in the existing code at least >> one of the packets (the one with the UDP header) would make it to the >> proper flow cpu. > > No, we ignore the layer-4 header when either MF or OFFSET is non-zero. That's right and I now remember we had quite a discussion about this in the past. So IP/saddr/daddr keying is out of the question due to reordering concerns. The idea to do RFS post fragmentation is interesting, it's sort of another form of GRO. We would need to re-fragment (like GRO does) in the forwarding case. But it would be nice since it would reduce the number of calls into the stack (and thus route lookups, etc.) per fragmented frame. There is of course the issue of fragmentation queue timeouts, and what semantics of that means when we are not the final destination and those fragments would have been forwarded rather than consumed by us. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:26 ` David Miller @ 2011-05-17 21:40 ` Rick Jones 0 siblings, 0 replies; 40+ messages in thread From: Rick Jones @ 2011-05-17 21:40 UTC (permalink / raw) To: David Miller; +Cc: bhutchings, eric.dumazet, therbert, netdev On Tue, 2011-05-17 at 17:26 -0400, David Miller wrote: > The idea to do RFS post fragmentation is interesting, it's sort of > another form of GRO. We would need to re-fragment (like GRO does) > in the forwarding case. > > But it would be nice since it would reduce the number of calls into > the stack (and thus route lookups, etc.) per fragmented frame. > > There is of course the issue of fragmentation queue timeouts, and > what semantics of that means when we are not the final destination > and those fragments would have been forwarded rather than consumed > by us. If we are not the final destination, should there be any reassembly going-on in the first place? And if reassembly times-out, don't the frags just get dropped like they would anyway? Eric keeps asking about (real) workload :) About the only one I can think of at this point that would have much in the way of UDP fragments is EDNS. Apart from that we may be worrying about how many fragments can dance on the header of an IP datagram?-) rick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:13 ` Ben Hutchings 2011-05-17 21:26 ` David Miller @ 2011-05-17 21:27 ` Eric Dumazet 1 sibling, 0 replies; 40+ messages in thread From: Eric Dumazet @ 2011-05-17 21:27 UTC (permalink / raw) To: Ben Hutchings; +Cc: David Miller, therbert, netdev Le mardi 17 mai 2011 à 22:13 +0100, Ben Hutchings a écrit : > On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote: > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Tue, 17 May 2011 23:00:50 +0200 > > > > > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit : > > >> From: Tom Herbert <therbert@google.com> > > >> Date: Tue, 17 May 2011 13:02:25 -0700 > > >> > > >> > I like it! And this sounds like the sort of algorithm that NICs might > > >> > be able to implement to solve the UDP/RSS unpleasantness, so even > > >> > better. > > >> > > >> Actually, I think it won't work. Even Linux emits fragments last to > > >> first, so we won't see the UDP header until the last packet where it's > > >> no longer useful. > > >> > > >> Back to the drawing board. :-/ > > > > > > Well, we could just use the iph->id in the rxhash computation for frags. > > > > > > At least all frags of a given datagram should be reassembled on same > > > cpu, so we get RPS (but not RFS) > > > > That's true, but one could also argue that in the existing code at least > > one of the packets (the one with the UDP header) would make it to the > > proper flow cpu. > > No, we ignore the layer-4 header when either MF or OFFSET is non-zero. Exactly As is, RPS (based on our software rxhash computation) should be working fine with frags, unless we receive different flows with same (src_addr,dst_addr) pair. This is why I asked David if real workloads could hit one cpu instead of many ones. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:00 ` Eric Dumazet 2011-05-17 21:10 ` David Miller @ 2011-05-17 21:11 ` Rick Jones 2011-05-17 21:11 ` Ben Hutchings 2 siblings, 0 replies; 40+ messages in thread From: Rick Jones @ 2011-05-17 21:11 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, therbert, netdev On Tue, 2011-05-17 at 23:00 +0200, Eric Dumazet wrote: > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit : > > From: Tom Herbert <therbert@google.com> > > Date: Tue, 17 May 2011 13:02:25 -0700 > > > > > I like it! And this sounds like the sort of algorithm that NICs might > > > be able to implement to solve the UDP/RSS unpleasantness, so even > > > better. > > > > Actually, I think it won't work. Even Linux emits fragments last to > > first, so we won't see the UDP header until the last packet where it's > > no longer useful. > > > > Back to the drawing board. :-/ > > Well, we could just use the iph->id in the rxhash computation for frags. > > At least all frags of a given datagram should be reassembled on same > cpu, so we get RPS (but not RFS) Won't that just scatter the fragments of a given flow across processors? Instead of then going back and forth between two caches - where reassembly happens and then where the app is running, it will go back and forth between the app's cache and pretty much nearly every other cache in the system (or at least configured to take RPS traffic). rick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:00 ` Eric Dumazet 2011-05-17 21:10 ` David Miller 2011-05-17 21:11 ` Rick Jones @ 2011-05-17 21:11 ` Ben Hutchings 2 siblings, 0 replies; 40+ messages in thread From: Ben Hutchings @ 2011-05-17 21:11 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, therbert, netdev On Tue, 2011-05-17 at 23:00 +0200, Eric Dumazet wrote: > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit : > > From: Tom Herbert <therbert@google.com> > > Date: Tue, 17 May 2011 13:02:25 -0700 > > > > > I like it! And this sounds like the sort of algorithm that NICs might > > > be able to implement to solve the UDP/RSS unpleasantness, so even > > > better. > > > > Actually, I think it won't work. Even Linux emits fragments last to > > first, so we won't see the UDP header until the last packet where it's > > no longer useful. > > > > Back to the drawing board. :-/ > > Well, we could just use the iph->id in the rxhash computation for frags. But then each datagram lands on a different CPU, and reordering is liable to happen far more often than it does now. > At least all frags of a given datagram should be reassembled on same > cpu, so we get RPS (but not RFS) You could still do RPS with just IP addresses (same as RSS using Toeplitz hashes). Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 20:49 ` David Miller 2011-05-17 21:00 ` Eric Dumazet @ 2011-05-17 21:27 ` Tom Herbert 2011-05-17 21:28 ` David Miller 2011-05-17 23:59 ` Changli Gao 2 siblings, 1 reply; 40+ messages in thread From: Tom Herbert @ 2011-05-17 21:27 UTC (permalink / raw) To: David Miller; +Cc: netdev > Actually, I think it won't work. Even Linux emits fragments last to > first, so we won't see the UDP header until the last packet where it's > no longer useful. > I remember observing this a while back, what's the rationale for it? > Back to the drawing board. :-/ > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:27 ` Tom Herbert @ 2011-05-17 21:28 ` David Miller 0 siblings, 0 replies; 40+ messages in thread From: David Miller @ 2011-05-17 21:28 UTC (permalink / raw) To: therbert; +Cc: netdev From: Tom Herbert <therbert@google.com> Date: Tue, 17 May 2011 14:27:10 -0700 >> Actually, I think it won't work. Even Linux emits fragments last to >> first, so we won't see the UDP header until the last packet where it's >> no longer useful. >> > I remember observing this a while back, what's the rationale for it? That's the cheapest way to build the fragments. Regardless of the reason we have to handle it forever. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 20:49 ` David Miller 2011-05-17 21:00 ` Eric Dumazet 2011-05-17 21:27 ` Tom Herbert @ 2011-05-17 23:59 ` Changli Gao 2011-05-18 6:37 ` David Miller 2 siblings, 1 reply; 40+ messages in thread From: Changli Gao @ 2011-05-17 23:59 UTC (permalink / raw) To: David Miller; +Cc: therbert, netdev On Wed, May 18, 2011 at 4:49 AM, David Miller <davem@davemloft.net> wrote: > > Actually, I think it won't work. Even Linux emits fragments last to > first, so we won't see the UDP header until the last packet where it's > no longer useful. > No. Linux emits fragments first to last now. You should check the current code. :) -- Regards, Changli Gao(xiaosuo@gmail.com) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 23:59 ` Changli Gao @ 2011-05-18 6:37 ` David Miller 0 siblings, 0 replies; 40+ messages in thread From: David Miller @ 2011-05-18 6:37 UTC (permalink / raw) To: xiaosuo; +Cc: therbert, netdev From: Changli Gao <xiaosuo@gmail.com> Date: Wed, 18 May 2011 07:59:05 +0800 > On Wed, May 18, 2011 at 4:49 AM, David Miller <davem@davemloft.net> wrote: >> >> Actually, I think it won't work. Even Linux emits fragments last to >> first, so we won't see the UDP header until the last packet where it's >> no longer useful. >> > > No. Linux emits fragments first to last now. You should check the > current code. :) I forgot that we rearranged this, thanks :-) So maybe the original idea can indeed work. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 18:33 small RPS cache for fragments? David Miller 2011-05-17 20:02 ` Tom Herbert @ 2011-05-17 20:14 ` Eric Dumazet 2011-05-17 20:47 ` David Miller 2011-05-17 21:44 ` Andi Kleen 2011-05-17 21:44 ` David Miller 2011-05-24 20:01 ` David Miller 3 siblings, 2 replies; 40+ messages in thread From: Eric Dumazet @ 2011-05-17 20:14 UTC (permalink / raw) To: David Miller; +Cc: netdev Le mardi 17 mai 2011 à 14:33 -0400, David Miller a écrit : > It seems to me that we can solve the UDP fragmentation problem for > flow steering very simply by creating a (saddr/daddr/IPID) entry in a > table that maps to the corresponding RPS flow entry. > > When we see the initial frag with the UDP header, we create the > saddr/daddr/IPID mapping, and we tear it down when we hit the > saddr/daddr/IPID mapping and the packet has the IP_MF bit clear. > > We only inspect the saddr/daddr/IPID cache when iph->frag_off is > non-zero. > > It's best effort and should work quite well. > > Even a one-behind cache, per-NAPI instance, would do a lot better than > what happens at the moment. Especially since the IP fragments mostly > arrive as one packet train. > -- OK but do we have workloads actually needing this optimization at all ? (IP defrag hits a read_lock(&ip4_frags.lock)), so maybe steer all frags on a given cpu ?) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 20:14 ` Eric Dumazet @ 2011-05-17 20:47 ` David Miller 2011-05-17 21:44 ` Andi Kleen 1 sibling, 0 replies; 40+ messages in thread From: David Miller @ 2011-05-17 20:47 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 17 May 2011 22:14:48 +0200 > OK but do we have workloads actually needing this optimization at all ? Yes, I've seen performance graphs where RPS/RFS falls off the cliff when datagram sizes go from 1024 to 2048 bytes. Wrt. defrag queue overhead, it still is minor compared to the cost of processing 1/2 of all packets on one cpu on a 24 core system. BTW, if we can steer reliably, we could make per-cpu defrag queue if you worry about it so much :-) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 20:14 ` Eric Dumazet 2011-05-17 20:47 ` David Miller @ 2011-05-17 21:44 ` Andi Kleen 2011-05-17 21:52 ` Eric Dumazet 1 sibling, 1 reply; 40+ messages in thread From: Andi Kleen @ 2011-05-17 21:44 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev Eric Dumazet <eric.dumazet@gmail.com> writes: > > OK but do we have workloads actually needing this optimization at all ? That's a good a question. > > (IP defrag hits a read_lock(&ip4_frags.lock)), so maybe steer all frags > on a given cpu ?) Couldn't the lock just be replaced with a hashed or bitmap lock or bit in low bits of pointer lock? iirc it just protects the heads of the hash table. They're not rwlocks, but especially if the locking was more finegrained that's likely not needed anymore. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:44 ` Andi Kleen @ 2011-05-17 21:52 ` Eric Dumazet 2011-05-17 22:03 ` Andi Kleen 0 siblings, 1 reply; 40+ messages in thread From: Eric Dumazet @ 2011-05-17 21:52 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, netdev Le mardi 17 mai 2011 à 14:44 -0700, Andi Kleen a écrit : > Eric Dumazet <eric.dumazet@gmail.com> writes: > > > > OK but do we have workloads actually needing this optimization at all ? > > That's a good a question. > > > > (IP defrag hits a read_lock(&ip4_frags.lock)), so maybe steer all frags > > on a given cpu ?) > > Couldn't the lock just be replaced with a hashed or bitmap lock or > bit in low bits of pointer lock? > > iirc it just protects the heads of the hash table. > > They're not rwlocks, but especially if the locking was more finegrained > that's likely not needed anymore. Well, there is the rehashing stuff, and this locks the whole table. Not easy to switch to rcu or something like that. Anyway I hardly use frags here at work, so never considered it was a field to spend time ;) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:52 ` Eric Dumazet @ 2011-05-17 22:03 ` Andi Kleen 0 siblings, 0 replies; 40+ messages in thread From: Andi Kleen @ 2011-05-17 22:03 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev Eric Dumazet <eric.dumazet@gmail.com> writes: >> They're not rwlocks, but especially if the locking was more finegrained >> that's likely not needed anymore. > > Well, there is the rehashing stuff, and this locks the whole table. > > Not easy to switch to rcu or something like that. No need to switch to RCU, just a more finegrained bucket lock. If you move a chain between queues you just lock both for the move. It sounds easy enough. I should probably just code it up. > > Anyway I hardly use frags here at work, so never considered it was a > field to spend time ;) Yes that's the problem. On the other hand most scalability problems hurt sooner or later, so sometimes it's good to fix them in advance. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 18:33 small RPS cache for fragments? David Miller 2011-05-17 20:02 ` Tom Herbert 2011-05-17 20:14 ` Eric Dumazet @ 2011-05-17 21:44 ` David Miller 2011-05-17 21:48 ` Andi Kleen 2011-05-24 20:01 ` David Miller 3 siblings, 1 reply; 40+ messages in thread From: David Miller @ 2011-05-17 21:44 UTC (permalink / raw) To: netdev Guys we can't time out fragments if we are not the final destination. Due to assymetric routing, the fragment pieces we don't see might reach the final destiantion not through us. So we have to pass them onwards, we can't just drop them. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:44 ` David Miller @ 2011-05-17 21:48 ` Andi Kleen 2011-05-17 21:50 ` David Miller 0 siblings, 1 reply; 40+ messages in thread From: Andi Kleen @ 2011-05-17 21:48 UTC (permalink / raw) To: David Miller; +Cc: netdev David Miller <davem@davemloft.net> writes: > Guys we can't time out fragments if we are not the final > destination. If you're not the final destination you should never even try to reassemble them? I'm probably missing something... -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:48 ` Andi Kleen @ 2011-05-17 21:50 ` David Miller 2011-05-17 22:06 ` Andi Kleen 2011-05-17 22:42 ` Rick Jones 0 siblings, 2 replies; 40+ messages in thread From: David Miller @ 2011-05-17 21:50 UTC (permalink / raw) To: andi; +Cc: netdev From: Andi Kleen <andi@firstfloor.org> Date: Tue, 17 May 2011 14:48:28 -0700 > David Miller <davem@davemloft.net> writes: > >> Guys we can't time out fragments if we are not the final >> destination. > > If you're not the final destination you should never even > try to reassemble them? > > I'm probably missing something... We're discussing the idea to do the defragmentation first so we can choose the flow properly and steer the packet to the correct cpu. This also would allos each fragmented packet to traverse the stack only once (one route lookup etc.) instead of once per fragment. Please read the rest of this thread, we have discussed this and now I'm repeating information solely for your benefit. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:50 ` David Miller @ 2011-05-17 22:06 ` Andi Kleen 2011-05-17 22:42 ` Rick Jones 1 sibling, 0 replies; 40+ messages in thread From: Andi Kleen @ 2011-05-17 22:06 UTC (permalink / raw) To: David Miller; +Cc: netdev David Miller <davem@davemloft.net> writes: > > We're discussing the idea to do the defragmentation first > so we can choose the flow properly and steer the packet > to the correct cpu. > > This also would allos each fragmented packet to traverse the > stack only once (one route lookup etc.) instead of once per > fragment. You could always check first in a cheap way (e.g. a small hash table) if it's local or not (and bypass the defragmentation if routing is turned off or the hash table would have collisions) On the other hand if fragmentation is expensive it's probably better to do it later anyways to spread it out better. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 21:50 ` David Miller 2011-05-17 22:06 ` Andi Kleen @ 2011-05-17 22:42 ` Rick Jones 1 sibling, 0 replies; 40+ messages in thread From: Rick Jones @ 2011-05-17 22:42 UTC (permalink / raw) To: David Miller; +Cc: andi, netdev On Tue, 2011-05-17 at 17:50 -0400, David Miller wrote: > From: Andi Kleen <andi@firstfloor.org> > Date: Tue, 17 May 2011 14:48:28 -0700 > > > David Miller <davem@davemloft.net> writes: > > > >> Guys we can't time out fragments if we are not the final > >> destination. > > > > If you're not the final destination you should never even > > try to reassemble them? > > > > I'm probably missing something... > > We're discussing the idea to do the defragmentation first > so we can choose the flow properly and steer the packet > to the correct cpu. > > This also would allos each fragmented packet to traverse the > stack only once (one route lookup etc.) instead of once per > fragment. > > Please read the rest of this thread, we have discussed this > and now I'm repeating information solely for your benefit. Well, I should probably be beaten with that stick too because I wasn't thinking about forwarding, only being the destination system when I broached the suggestion of doing RFS after reassembly. I can see where one *might* be able to do limited RPS when forwarding, but I didn't know that RFS had been extended to forwarding. Now though I see why you were rightfully concerned about timeouts - given all the concerns about added latency from bufferbloat, I wouldn't think that an additional 10 or perhaps even 1ms timeout on a reassembly attempt to get the layer four header when forwarding would sit well with folks - they will expect the fragments to flow through without additional delay. rick jones ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-17 18:33 small RPS cache for fragments? David Miller ` (2 preceding siblings ...) 2011-05-17 21:44 ` David Miller @ 2011-05-24 20:01 ` David Miller 2011-05-24 21:38 ` Rick Jones 3 siblings, 1 reply; 40+ messages in thread From: David Miller @ 2011-05-24 20:01 UTC (permalink / raw) To: netdev From: David Miller <davem@davemloft.net> Date: Tue, 17 May 2011 14:33:42 -0400 (EDT) > > It seems to me that we can solve the UDP fragmentation problem for > flow steering very simply by creating a (saddr/daddr/IPID) entry in a > table that maps to the corresponding RPS flow entry. > > When we see the initial frag with the UDP header, we create the > saddr/daddr/IPID mapping, and we tear it down when we hit the > saddr/daddr/IPID mapping and the packet has the IP_MF bit clear. > > We only inspect the saddr/daddr/IPID cache when iph->frag_off is > non-zero. So I looked into implementing this now that it has been established that we changed even Linux to emit fragments in-order. The first problem we run into is that there is no "context" we can use in all the places where skb_get_rxhash() gets called. Part of the problem is that we call it from strange places, such as egress packet schedulers. That's completely bogus. Examples, FLOW classifier, META e-match, CHOKE, and SFB. In fact, for the classifiers this means they aren't making use of the precomputed TX hash values in the sockets like __skb_tx_hash() will make use of. So this makes these packet schedulers operate potentially more expensively than they need to. If we could get rid of those silly cases, the stuff that remains (macvtap and net/core/dev.c) could work with a NAPI context during rxhash computation and use that to store the IP fragmentation on-behind cached information. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-24 20:01 ` David Miller @ 2011-05-24 21:38 ` Rick Jones 2011-06-04 20:29 ` David Miller 0 siblings, 1 reply; 40+ messages in thread From: Rick Jones @ 2011-05-24 21:38 UTC (permalink / raw) To: David Miller; +Cc: netdev On Tue, 2011-05-24 at 16:01 -0400, David Miller wrote: > So I looked into implementing this now that it has been established > that we changed even Linux to emit fragments in-order. > > The first problem we run into is that there is no "context" we can > use in all the places where skb_get_rxhash() gets called. > > Part of the problem is that we call it from strange places, such as > egress packet schedulers. That's completely bogus. > > Examples, FLOW classifier, META e-match, CHOKE, and SFB. > > In fact, for the classifiers this means they aren't making use of the > precomputed TX hash values in the sockets like __skb_tx_hash() will > make use of. So this makes these packet schedulers operate > potentially more expensively than they need to. > > If we could get rid of those silly cases, the stuff that remains > (macvtap and net/core/dev.c) could work with a NAPI context during > rxhash computation and use that to store the IP fragmentation > on-behind cached information. Isn't there still an issue (perhaps small) of traffic being sent through a mode-rr bond, either at the origin or somewhere along the way? At the origin point will depend on the presence of UFO and whether it is propagated up through the bond interface, but as a quick test, I disabled TSO, GSO and UFO on four e1000e driven interfaces, bonded them mode-rr and ran a netperf UDP_RR test with a 1473 byte request size and this is what they looked like at my un-bonded reciever at the other end: 14:31:01.011370 IP (tos 0x0, ttl 64, id 24960, offset 1480, flags [none], proto UDP (17), length 21) tardy.local > raj-8510w.local: udp 14:31:01.011420 IP (tos 0x0, ttl 64, id 24960, offset 0, flags [+], proto UDP (17), length 1500) tardy.local.36073 > raj-8510w.local.59951: UDP, length 1473 14:31:01.011514 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 29) raj-8510w.local.59951 > tardy.local.36073: UDP, length 1 rick jones ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-05-24 21:38 ` Rick Jones @ 2011-06-04 20:29 ` David Miller 2011-06-06 17:08 ` Rick Jones 0 siblings, 1 reply; 40+ messages in thread From: David Miller @ 2011-06-04 20:29 UTC (permalink / raw) To: rick.jones2; +Cc: netdev From: Rick Jones <rick.jones2@hp.com> Date: Tue, 24 May 2011 14:38:48 -0700 > Isn't there still an issue (perhaps small) of traffic being sent through > a mode-rr bond, either at the origin or somewhere along the way? At the > origin point will depend on the presence of UFO and whether it is > propagated up through the bond interface, but as a quick test, I > disabled TSO, GSO and UFO on four e1000e driven interfaces, bonded them > mode-rr and ran a netperf UDP_RR test with a 1473 byte request size and > this is what they looked like at my un-bonded reciever at the other end: > > 14:31:01.011370 IP (tos 0x0, ttl 64, id 24960, offset 1480, flags > [none], proto UDP (17), length 21) > tardy.local > raj-8510w.local: udp > 14:31:01.011420 IP (tos 0x0, ttl 64, id 24960, offset 0, flags [+], > proto UDP (17), length 1500) > tardy.local.36073 > raj-8510w.local.59951: UDP, length 1473 > 14:31:01.011514 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > UDP (17), length 29) > raj-8510w.local.59951 > tardy.local.36073: UDP, length 1 That's not good behavior, and it's of course going to cause sub-optimal performance if we do the RPS fragment cache. RR bond mode could do something similar, to alleviate this. I assume it doesn't do this kind of reordering for TCP. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-06-04 20:29 ` David Miller @ 2011-06-06 17:08 ` Rick Jones 2011-06-06 17:15 ` Eric Dumazet 2011-06-06 19:22 ` David Miller 0 siblings, 2 replies; 40+ messages in thread From: Rick Jones @ 2011-06-06 17:08 UTC (permalink / raw) To: David Miller; +Cc: netdev On Sat, 2011-06-04 at 13:29 -0700, David Miller wrote: > From: Rick Jones <rick.jones2@hp.com> > Date: Tue, 24 May 2011 14:38:48 -0700 > > > Isn't there still an issue (perhaps small) of traffic being sent through > > a mode-rr bond, either at the origin or somewhere along the way? At the > > origin point will depend on the presence of UFO and whether it is > > propagated up through the bond interface, but as a quick test, I > > disabled TSO, GSO and UFO on four e1000e driven interfaces, bonded them > > mode-rr and ran a netperf UDP_RR test with a 1473 byte request size and > > this is what they looked like at my un-bonded reciever at the other end: > > > > 14:31:01.011370 IP (tos 0x0, ttl 64, id 24960, offset 1480, flags > > [none], proto UDP (17), length 21) > > tardy.local > raj-8510w.local: udp > > 14:31:01.011420 IP (tos 0x0, ttl 64, id 24960, offset 0, flags [+], > > proto UDP (17), length 1500) > > tardy.local.36073 > raj-8510w.local.59951: UDP, length 1473 > > 14:31:01.011514 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > UDP (17), length 29) > > raj-8510w.local.59951 > tardy.local.36073: UDP, length 1 > > That's not good behavior, and it's of course going to cause sub-optimal > performance if we do the RPS fragment cache. > > RR bond mode could do something similar, to alleviate this. > > I assume it doesn't do this kind of reordering for TCP. Mode-rr bonding reorders TCP segments all the time. rick ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-06-06 17:08 ` Rick Jones @ 2011-06-06 17:15 ` Eric Dumazet 2011-06-06 18:06 ` Rick Jones 2011-06-06 19:23 ` David Miller 2011-06-06 19:22 ` David Miller 1 sibling, 2 replies; 40+ messages in thread From: Eric Dumazet @ 2011-06-06 17:15 UTC (permalink / raw) To: rick.jones2; +Cc: David Miller, netdev Le lundi 06 juin 2011 à 10:08 -0700, Rick Jones a écrit : > Mode-rr bonding reorders TCP segments all the time. Shouldnt TCP frames have DF bit set ? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-06-06 17:15 ` Eric Dumazet @ 2011-06-06 18:06 ` Rick Jones 2011-06-06 19:23 ` David Miller 1 sibling, 0 replies; 40+ messages in thread From: Rick Jones @ 2011-06-06 18:06 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Mon, 2011-06-06 at 19:15 +0200, Eric Dumazet wrote: > Le lundi 06 juin 2011 à 10:08 -0700, Rick Jones a écrit : > > > Mode-rr bonding reorders TCP segments all the time. > > Shouldnt TCP frames have DF bit set ? I was ass-u-me-ing that when talking about TCP, David was speaking generally about TCP segments, and suggesting that were bonding's mode-rr altered to not re-order TCP segments, a similar technique could/would/should avoid re-ordering IP datagram fragments, regardless of their payload. Jay will have to weigh-in on how difficult that would be, I'm guessing it would mean a fair bit of overhead to mode-rr though, to know the completion status of frames from the same flow and/or the depth of the tx queues etc etc. I thought that one of mode-rr's (few IMO, just check the archives where I've complained about it :) redeeming qualities was its minimal overhead. rick jones ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-06-06 17:15 ` Eric Dumazet 2011-06-06 18:06 ` Rick Jones @ 2011-06-06 19:23 ` David Miller 1 sibling, 0 replies; 40+ messages in thread From: David Miller @ 2011-06-06 19:23 UTC (permalink / raw) To: eric.dumazet; +Cc: rick.jones2, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 06 Jun 2011 19:15:13 +0200 > Le lundi 06 juin 2011 à 10:08 -0700, Rick Jones a écrit : > >> Mode-rr bonding reorders TCP segments all the time. > > Shouldnt TCP frames have DF bit set ? That has nothing to do with this discussion :-) DF bit set or not, TCP or UDP, this bonding mode apparently reorders frames all the time. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-06-06 17:08 ` Rick Jones 2011-06-06 17:15 ` Eric Dumazet @ 2011-06-06 19:22 ` David Miller 2011-06-06 20:05 ` Rick Jones 1 sibling, 1 reply; 40+ messages in thread From: David Miller @ 2011-06-06 19:22 UTC (permalink / raw) To: rick.jones2; +Cc: netdev From: Rick Jones <rick.jones2@hp.com> Date: Mon, 06 Jun 2011 10:08:52 -0700 > Mode-rr bonding reorders TCP segments all the time. Oh well, then don't use this if you care about performance at all. And therefore it's not even worth considering for our RPS fragment cache. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-06-06 19:22 ` David Miller @ 2011-06-06 20:05 ` Rick Jones 2011-06-06 21:06 ` Jay Vosburgh 0 siblings, 1 reply; 40+ messages in thread From: Rick Jones @ 2011-06-06 20:05 UTC (permalink / raw) To: David Miller; +Cc: netdev On Mon, 2011-06-06 at 12:22 -0700, David Miller wrote: > From: Rick Jones <rick.jones2@hp.com> > Date: Mon, 06 Jun 2011 10:08:52 -0700 > > > Mode-rr bonding reorders TCP segments all the time. > > Oh well, then don't use this if you care about performance at all. > And therefore it's not even worth considering for our RPS fragment > cache. Heh - the (or at least a) reason people use mode-rr is to make a single (TCP) stream go faster :) Without buying the next-up NIC speed. rick jones ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-06-06 20:05 ` Rick Jones @ 2011-06-06 21:06 ` Jay Vosburgh 2011-06-06 21:40 ` David Miller 0 siblings, 1 reply; 40+ messages in thread From: Jay Vosburgh @ 2011-06-06 21:06 UTC (permalink / raw) To: rick.jones2; +Cc: David Miller, netdev Rick Jones <rick.jones2@hp.com> wrote: >On Mon, 2011-06-06 at 12:22 -0700, David Miller wrote: >> From: Rick Jones <rick.jones2@hp.com> >> Date: Mon, 06 Jun 2011 10:08:52 -0700 >> >> > Mode-rr bonding reorders TCP segments all the time. >> >> Oh well, then don't use this if you care about performance at all. >> And therefore it's not even worth considering for our RPS fragment >> cache. > >Heh - the (or at least a) reason people use mode-rr is to make a single >(TCP) stream go faster :) Without buying the next-up NIC speed. Right, the common use case for balance-rr (round robin) is to maximize TCP throughput for one connection, over a set of whatever network devices are available (or are cheap) by striping that connection across multiple interfaces. The tcp_reordering sysctl is set to some large value so that TCP will deal with the reordering as best it can. Since TCP generally won't fragment, I don't see that the RPS frag cache is going to matter for this usage anyway. If somebody out there is using round robin for some UDP-based application that doesn't care about packet ordering (but might create fragmented datagrams), I've not heard about it. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-06-06 21:06 ` Jay Vosburgh @ 2011-06-06 21:40 ` David Miller 2011-06-06 22:49 ` Chris Friesen 0 siblings, 1 reply; 40+ messages in thread From: David Miller @ 2011-06-06 21:40 UTC (permalink / raw) To: fubar; +Cc: rick.jones2, netdev From: Jay Vosburgh <fubar@us.ibm.com> Date: Mon, 06 Jun 2011 14:06:09 -0700 > Right, the common use case for balance-rr (round robin) is to > maximize TCP throughput for one connection, over a set of whatever > network devices are available (or are cheap) by striping that connection > across multiple interfaces. The tcp_reordering sysctl is set to some > large value so that TCP will deal with the reordering as best it can. FWIW, I really would never, ever, encourage schemes like this. Even if they do happen to work. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: small RPS cache for fragments? 2011-06-06 21:40 ` David Miller @ 2011-06-06 22:49 ` Chris Friesen 0 siblings, 0 replies; 40+ messages in thread From: Chris Friesen @ 2011-06-06 22:49 UTC (permalink / raw) To: David Miller; +Cc: fubar, rick.jones2, netdev On 06/06/2011 03:40 PM, David Miller wrote: > From: Jay Vosburgh<fubar@us.ibm.com> > Date: Mon, 06 Jun 2011 14:06:09 -0700 > >> Right, the common use case for balance-rr (round robin) is to >> maximize TCP throughput for one connection, over a set of whatever >> network devices are available (or are cheap) by striping that connection >> across multiple interfaces. The tcp_reordering sysctl is set to some >> large value so that TCP will deal with the reordering as best it can. > > FWIW, I really would never, ever, encourage schemes like this. Even > if they do happen to work. Why not? And if not then what's the recommended way to handle the above scenario? (Assuming hardware upgrade isn't an option.) Chris -- Chris Friesen Software Developer GENBAND chris.friesen@genband.com www.genband.com ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2011-06-06 22:49 UTC | newest] Thread overview: 40+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-17 18:33 small RPS cache for fragments? David Miller 2011-05-17 20:02 ` Tom Herbert 2011-05-17 20:17 ` Rick Jones 2011-05-17 20:41 ` Rick Jones 2011-05-17 20:49 ` David Miller 2011-05-17 21:00 ` Eric Dumazet 2011-05-17 21:10 ` David Miller 2011-05-17 21:13 ` Rick Jones 2011-05-17 21:13 ` Ben Hutchings 2011-05-17 21:26 ` David Miller 2011-05-17 21:40 ` Rick Jones 2011-05-17 21:27 ` Eric Dumazet 2011-05-17 21:11 ` Rick Jones 2011-05-17 21:11 ` Ben Hutchings 2011-05-17 21:27 ` Tom Herbert 2011-05-17 21:28 ` David Miller 2011-05-17 23:59 ` Changli Gao 2011-05-18 6:37 ` David Miller 2011-05-17 20:14 ` Eric Dumazet 2011-05-17 20:47 ` David Miller 2011-05-17 21:44 ` Andi Kleen 2011-05-17 21:52 ` Eric Dumazet 2011-05-17 22:03 ` Andi Kleen 2011-05-17 21:44 ` David Miller 2011-05-17 21:48 ` Andi Kleen 2011-05-17 21:50 ` David Miller 2011-05-17 22:06 ` Andi Kleen 2011-05-17 22:42 ` Rick Jones 2011-05-24 20:01 ` David Miller 2011-05-24 21:38 ` Rick Jones 2011-06-04 20:29 ` David Miller 2011-06-06 17:08 ` Rick Jones 2011-06-06 17:15 ` Eric Dumazet 2011-06-06 18:06 ` Rick Jones 2011-06-06 19:23 ` David Miller 2011-06-06 19:22 ` David Miller 2011-06-06 20:05 ` Rick Jones 2011-06-06 21:06 ` Jay Vosburgh 2011-06-06 21:40 ` David Miller 2011-06-06 22:49 ` Chris Friesen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).