small RPS cache for fragments?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* small RPS cache for fragments?
@ 2011-05-17 18:33 David Miller
  2011-05-17 20:02 ` Tom Herbert
                   ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: David Miller @ 2011-05-17 18:33 UTC (permalink / raw)
  To: netdev

It seems to me that we can solve the UDP fragmentation problem for
flow steering very simply by creating a (saddr/daddr/IPID) entry in a
table that maps to the corresponding RPS flow entry.

When we see the initial frag with the UDP header, we create the
saddr/daddr/IPID mapping, and we tear it down when we hit the
saddr/daddr/IPID mapping and the packet has the IP_MF bit clear.

We only inspect the saddr/daddr/IPID cache when iph->frag_off is
non-zero.

It's best effort and should work quite well.

Even a one-behind cache, per-NAPI instance, would do a lot better than
what happens at the moment.  Especially since the IP fragments mostly
arrive as one packet train.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 18:33 small RPS cache for fragments? David Miller
@ 2011-05-17 20:02 ` Tom Herbert
  2011-05-17 20:17   ` Rick Jones
  2011-05-17 20:49   ` David Miller
  2011-05-17 20:14 ` Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 40+ messages in thread
From: Tom Herbert @ 2011-05-17 20:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

I like it!  And this sounds like the sort of algorithm that NICs might
be able to implement to solve the UDP/RSS unpleasantness, so even
better.

Tom

On Tue, May 17, 2011 at 11:33 AM, David Miller <davem@davemloft.net> wrote:
>
> It seems to me that we can solve the UDP fragmentation problem for
> flow steering very simply by creating a (saddr/daddr/IPID) entry in a
> table that maps to the corresponding RPS flow entry.
>
> When we see the initial frag with the UDP header, we create the
> saddr/daddr/IPID mapping, and we tear it down when we hit the
> saddr/daddr/IPID mapping and the packet has the IP_MF bit clear.
>
> We only inspect the saddr/daddr/IPID cache when iph->frag_off is
> non-zero.
>
> It's best effort and should work quite well.
>
> Even a one-behind cache, per-NAPI instance, would do a lot better than
> what happens at the moment.  Especially since the IP fragments mostly
> arrive as one packet train.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 20:02 ` Tom Herbert
@ 2011-05-17 20:17   ` Rick Jones
  2011-05-17 20:41     ` Rick Jones
  2011-05-17 20:49   ` David Miller
  1 sibling, 1 reply; 40+ messages in thread
From: Rick Jones @ 2011-05-17 20:17 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev

On Tue, 2011-05-17 at 13:02 -0700, Tom Herbert wrote:
> I like it!  And this sounds like the sort of algorithm that NICs might
> be able to implement to solve the UDP/RSS unpleasantness, so even
> better.

Do (m)any devices take "shortcuts" with UDP datagrams these days?  By
that I mean that back in the day, the HP-PB and "Slider" FDDI
cards/drivers did checksum offload for fragmented UDP datagrams by
sending the first fragment, the one with the UDP header and thus
checksum, last.  It did that to save space on the card and make use of
the checksum accumulator.

rick jones

> 
> Tom
> 
> On Tue, May 17, 2011 at 11:33 AM, David Miller <davem@davemloft.net> wrote:
> >
> > It seems to me that we can solve the UDP fragmentation problem for
> > flow steering very simply by creating a (saddr/daddr/IPID) entry in a
> > table that maps to the corresponding RPS flow entry.
> >
> > When we see the initial frag with the UDP header, we create the
> > saddr/daddr/IPID mapping, and we tear it down when we hit the
> > saddr/daddr/IPID mapping and the packet has the IP_MF bit clear.
> >
> > We only inspect the saddr/daddr/IPID cache when iph->frag_off is
> > non-zero.
> >
> > It's best effort and should work quite well.
> >
> > Even a one-behind cache, per-NAPI instance, would do a lot better than
> > what happens at the moment.  Especially since the IP fragments mostly
> > arrive as one packet train.
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 20:17   ` Rick Jones
@ 2011-05-17 20:41     ` Rick Jones
  0 siblings, 0 replies; 40+ messages in thread
From: Rick Jones @ 2011-05-17 20:41 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev

On Tue, 2011-05-17 at 13:17 -0700, Rick Jones wrote:
> On Tue, 2011-05-17 at 13:02 -0700, Tom Herbert wrote:
> > I like it!  And this sounds like the sort of algorithm that NICs might
> > be able to implement to solve the UDP/RSS unpleasantness, so even
> > better.
> 
> Do (m)any devices take "shortcuts" with UDP datagrams these days?  By
> that I mean that back in the day, the HP-PB and "Slider" FDDI
> cards/drivers did checksum offload for fragmented UDP datagrams by
> sending the first fragment, the one with the UDP header and thus
> checksum, last.  It did that to save space on the card and make use of
> the checksum accumulator.

Even if no devices (mis)behave like that today, ordering of fragments
sent via a mode-rr bond is far from a sure thing.

rick

> 
> rick jones
> 
> > 
> > Tom
> > 
> > On Tue, May 17, 2011 at 11:33 AM, David Miller <davem@davemloft.net> wrote:
> > >
> > > It seems to me that we can solve the UDP fragmentation problem for
> > > flow steering very simply by creating a (saddr/daddr/IPID) entry in a
> > > table that maps to the corresponding RPS flow entry.
> > >
> > > When we see the initial frag with the UDP header, we create the
> > > saddr/daddr/IPID mapping, and we tear it down when we hit the
> > > saddr/daddr/IPID mapping and the packet has the IP_MF bit clear.
> > >
> > > We only inspect the saddr/daddr/IPID cache when iph->frag_off is
> > > non-zero.
> > >
> > > It's best effort and should work quite well.
> > >
> > > Even a one-behind cache, per-NAPI instance, would do a lot better than
> > > what happens at the moment.  Especially since the IP fragments mostly
> > > arrive as one packet train.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 20:02 ` Tom Herbert
  2011-05-17 20:17   ` Rick Jones
@ 2011-05-17 20:49   ` David Miller
  2011-05-17 21:00     ` Eric Dumazet
                       ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: David Miller @ 2011-05-17 20:49 UTC (permalink / raw)
  To: therbert; +Cc: netdev

From: Tom Herbert <therbert@google.com>
Date: Tue, 17 May 2011 13:02:25 -0700

> I like it!  And this sounds like the sort of algorithm that NICs might
> be able to implement to solve the UDP/RSS unpleasantness, so even
> better.

Actually, I think it won't work.  Even Linux emits fragments last to
first, so we won't see the UDP header until the last packet where it's
no longer useful.

Back to the drawing board. :-/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 20:49   ` David Miller
@ 2011-05-17 21:00     ` Eric Dumazet
  2011-05-17 21:10       ` David Miller
                         ` (2 more replies)
  2011-05-17 21:27     ` Tom Herbert
  2011-05-17 23:59     ` Changli Gao
  2 siblings, 3 replies; 40+ messages in thread
From: Eric Dumazet @ 2011-05-17 21:00 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev

Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> From: Tom Herbert <therbert@google.com>
> Date: Tue, 17 May 2011 13:02:25 -0700
> 
> > I like it!  And this sounds like the sort of algorithm that NICs might
> > be able to implement to solve the UDP/RSS unpleasantness, so even
> > better.
> 
> Actually, I think it won't work.  Even Linux emits fragments last to
> first, so we won't see the UDP header until the last packet where it's
> no longer useful.
> 
> Back to the drawing board. :-/

Well, we could just use the iph->id in the rxhash computation for frags.

At least all frags of a given datagram should be reassembled on same
cpu, so we get RPS (but not RFS)




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:00     ` Eric Dumazet
@ 2011-05-17 21:10       ` David Miller
  2011-05-17 21:13         ` Rick Jones
  2011-05-17 21:13         ` Ben Hutchings
  2011-05-17 21:11       ` Rick Jones
  2011-05-17 21:11       ` Ben Hutchings
  2 siblings, 2 replies; 40+ messages in thread
From: David Miller @ 2011-05-17 21:10 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 May 2011 23:00:50 +0200

> Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
>> From: Tom Herbert <therbert@google.com>
>> Date: Tue, 17 May 2011 13:02:25 -0700
>> 
>> > I like it!  And this sounds like the sort of algorithm that NICs might
>> > be able to implement to solve the UDP/RSS unpleasantness, so even
>> > better.
>> 
>> Actually, I think it won't work.  Even Linux emits fragments last to
>> first, so we won't see the UDP header until the last packet where it's
>> no longer useful.
>> 
>> Back to the drawing board. :-/
> 
> Well, we could just use the iph->id in the rxhash computation for frags.
> 
> At least all frags of a given datagram should be reassembled on same
> cpu, so we get RPS (but not RFS)

That's true, but one could also argue that in the existing code at least
one of the packets (the one with the UDP header) would make it to the
proper flow cpu.

That could be as much as half of the packets.

So I don't yet see it as a clear win.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:10       ` David Miller
@ 2011-05-17 21:13         ` Rick Jones
  2011-05-17 21:13         ` Ben Hutchings
  1 sibling, 0 replies; 40+ messages in thread
From: Rick Jones @ 2011-05-17 21:13 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, therbert, netdev

On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 17 May 2011 23:00:50 +0200
> 
> > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> >> From: Tom Herbert <therbert@google.com>
> >> Date: Tue, 17 May 2011 13:02:25 -0700
> >> 
> >> > I like it!  And this sounds like the sort of algorithm that NICs might
> >> > be able to implement to solve the UDP/RSS unpleasantness, so even
> >> > better.
> >> 
> >> Actually, I think it won't work.  Even Linux emits fragments last to
> >> first, so we won't see the UDP header until the last packet where it's
> >> no longer useful.
> >> 
> >> Back to the drawing board. :-/
> > 
> > Well, we could just use the iph->id in the rxhash computation for frags.
> > 
> > At least all frags of a given datagram should be reassembled on same
> > cpu, so we get RPS (but not RFS)
> 
> That's true, but one could also argue that in the existing code at least
> one of the packets (the one with the UDP header) would make it to the
> proper flow cpu.
> 
> That could be as much as half of the packets.
> 
> So I don't yet see it as a clear win.

How heinous would it be to do post-reassembly RFS?

rick 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:10       ` David Miller
  2011-05-17 21:13         ` Rick Jones
@ 2011-05-17 21:13         ` Ben Hutchings
  2011-05-17 21:26           ` David Miller
  2011-05-17 21:27           ` Eric Dumazet
  1 sibling, 2 replies; 40+ messages in thread
From: Ben Hutchings @ 2011-05-17 21:13 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, therbert, netdev

On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 17 May 2011 23:00:50 +0200
> 
> > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> >> From: Tom Herbert <therbert@google.com>
> >> Date: Tue, 17 May 2011 13:02:25 -0700
> >> 
> >> > I like it!  And this sounds like the sort of algorithm that NICs might
> >> > be able to implement to solve the UDP/RSS unpleasantness, so even
> >> > better.
> >> 
> >> Actually, I think it won't work.  Even Linux emits fragments last to
> >> first, so we won't see the UDP header until the last packet where it's
> >> no longer useful.
> >> 
> >> Back to the drawing board. :-/
> > 
> > Well, we could just use the iph->id in the rxhash computation for frags.
> > 
> > At least all frags of a given datagram should be reassembled on same
> > cpu, so we get RPS (but not RFS)
> 
> That's true, but one could also argue that in the existing code at least
> one of the packets (the one with the UDP header) would make it to the
> proper flow cpu.

No, we ignore the layer-4 header when either MF or OFFSET is non-zero.

Ben.

> That could be as much as half of the packets.
> 
> So I don't yet see it as a clear win.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:13         ` Ben Hutchings
@ 2011-05-17 21:26           ` David Miller
  2011-05-17 21:40             ` Rick Jones
  2011-05-17 21:27           ` Eric Dumazet
  1 sibling, 1 reply; 40+ messages in thread
From: David Miller @ 2011-05-17 21:26 UTC (permalink / raw)
  To: bhutchings; +Cc: eric.dumazet, therbert, netdev

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Tue, 17 May 2011 22:13:42 +0100

> On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote:
>> That's true, but one could also argue that in the existing code at least
>> one of the packets (the one with the UDP header) would make it to the
>> proper flow cpu.
> 
> No, we ignore the layer-4 header when either MF or OFFSET is non-zero.

That's right and I now remember we had quite a discussion about this
in the past.

So IP/saddr/daddr keying is out of the question due to reordering
concerns.

The idea to do RFS post fragmentation is interesting, it's sort of
another form of GRO.  We would need to re-fragment (like GRO does)
in the forwarding case.

But it would be nice since it would reduce the number of calls into
the stack (and thus route lookups, etc.) per fragmented frame.

There is of course the issue of fragmentation queue timeouts, and
what semantics of that means when we are not the final destination
and those fragments would have been forwarded rather than consumed
by us.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:26           ` David Miller
@ 2011-05-17 21:40             ` Rick Jones
  0 siblings, 0 replies; 40+ messages in thread
From: Rick Jones @ 2011-05-17 21:40 UTC (permalink / raw)
  To: David Miller; +Cc: bhutchings, eric.dumazet, therbert, netdev

On Tue, 2011-05-17 at 17:26 -0400, David Miller wrote:
> The idea to do RFS post fragmentation is interesting, it's sort of
> another form of GRO.  We would need to re-fragment (like GRO does)
> in the forwarding case.
> 
> But it would be nice since it would reduce the number of calls into
> the stack (and thus route lookups, etc.) per fragmented frame.
> 
> There is of course the issue of fragmentation queue timeouts, and
> what semantics of that means when we are not the final destination
> and those fragments would have been forwarded rather than consumed
> by us.

If we are not the final destination, should there be any reassembly
going-on in the first place?

And if reassembly times-out, don't the frags just get dropped like they
would anyway?

Eric keeps asking about (real) workload :)  About the only one I can
think of at this point that would have much in the way of UDP fragments
is EDNS.  Apart from that we may be worrying about how many fragments
can dance on the header of an IP datagram?-)

rick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:13         ` Ben Hutchings
  2011-05-17 21:26           ` David Miller
@ 2011-05-17 21:27           ` Eric Dumazet
  1 sibling, 0 replies; 40+ messages in thread
From: Eric Dumazet @ 2011-05-17 21:27 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, therbert, netdev

Le mardi 17 mai 2011 à 22:13 +0100, Ben Hutchings a écrit :
> On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Tue, 17 May 2011 23:00:50 +0200
> > 
> > > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> > >> From: Tom Herbert <therbert@google.com>
> > >> Date: Tue, 17 May 2011 13:02:25 -0700
> > >> 
> > >> > I like it!  And this sounds like the sort of algorithm that NICs might
> > >> > be able to implement to solve the UDP/RSS unpleasantness, so even
> > >> > better.
> > >> 
> > >> Actually, I think it won't work.  Even Linux emits fragments last to
> > >> first, so we won't see the UDP header until the last packet where it's
> > >> no longer useful.
> > >> 
> > >> Back to the drawing board. :-/
> > > 
> > > Well, we could just use the iph->id in the rxhash computation for frags.
> > > 
> > > At least all frags of a given datagram should be reassembled on same
> > > cpu, so we get RPS (but not RFS)
> > 
> > That's true, but one could also argue that in the existing code at least
> > one of the packets (the one with the UDP header) would make it to the
> > proper flow cpu.
> 
> No, we ignore the layer-4 header when either MF or OFFSET is non-zero.

Exactly

As is, RPS (based on our software rxhash computation) should be working
fine with frags, unless we receive different flows with same
(src_addr,dst_addr) pair.

This is why I asked David if real workloads could hit one cpu instead of
many ones.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:00     ` Eric Dumazet
  2011-05-17 21:10       ` David Miller
@ 2011-05-17 21:11       ` Rick Jones
  2011-05-17 21:11       ` Ben Hutchings
  2 siblings, 0 replies; 40+ messages in thread
From: Rick Jones @ 2011-05-17 21:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, therbert, netdev

On Tue, 2011-05-17 at 23:00 +0200, Eric Dumazet wrote:
> Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> > From: Tom Herbert <therbert@google.com>
> > Date: Tue, 17 May 2011 13:02:25 -0700
> > 
> > > I like it!  And this sounds like the sort of algorithm that NICs might
> > > be able to implement to solve the UDP/RSS unpleasantness, so even
> > > better.
> > 
> > Actually, I think it won't work.  Even Linux emits fragments last to
> > first, so we won't see the UDP header until the last packet where it's
> > no longer useful.
> > 
> > Back to the drawing board. :-/
> 
> Well, we could just use the iph->id in the rxhash computation for frags.
> 
> At least all frags of a given datagram should be reassembled on same
> cpu, so we get RPS (but not RFS)

Won't that just scatter the fragments of a given flow across processors?
Instead of then going back and forth between two caches - where
reassembly happens and then where the app is running, it will go back
and forth between the app's cache and pretty much nearly every other
cache in the system (or at least configured to take RPS traffic).

rick


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:00     ` Eric Dumazet
  2011-05-17 21:10       ` David Miller
  2011-05-17 21:11       ` Rick Jones
@ 2011-05-17 21:11       ` Ben Hutchings
  2 siblings, 0 replies; 40+ messages in thread
From: Ben Hutchings @ 2011-05-17 21:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, therbert, netdev

On Tue, 2011-05-17 at 23:00 +0200, Eric Dumazet wrote:
> Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> > From: Tom Herbert <therbert@google.com>
> > Date: Tue, 17 May 2011 13:02:25 -0700
> > 
> > > I like it!  And this sounds like the sort of algorithm that NICs might
> > > be able to implement to solve the UDP/RSS unpleasantness, so even
> > > better.
> > 
> > Actually, I think it won't work.  Even Linux emits fragments last to
> > first, so we won't see the UDP header until the last packet where it's
> > no longer useful.
> > 
> > Back to the drawing board. :-/
> 
> Well, we could just use the iph->id in the rxhash computation for frags.

But then each datagram lands on a different CPU, and reordering is
liable to happen far more often than it does now.

> At least all frags of a given datagram should be reassembled on same
> cpu, so we get RPS (but not RFS)

You could still do RPS with just IP addresses (same as RSS using
Toeplitz hashes).

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 20:49   ` David Miller
  2011-05-17 21:00     ` Eric Dumazet
@ 2011-05-17 21:27     ` Tom Herbert
  2011-05-17 21:28       ` David Miller
  2011-05-17 23:59     ` Changli Gao
  2 siblings, 1 reply; 40+ messages in thread
From: Tom Herbert @ 2011-05-17 21:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

> Actually, I think it won't work.  Even Linux emits fragments last to
> first, so we won't see the UDP header until the last packet where it's
> no longer useful.
>
I remember observing this a while back, what's the rationale for it?

> Back to the drawing board. :-/
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:27     ` Tom Herbert
@ 2011-05-17 21:28       ` David Miller
  0 siblings, 0 replies; 40+ messages in thread
From: David Miller @ 2011-05-17 21:28 UTC (permalink / raw)
  To: therbert; +Cc: netdev

From: Tom Herbert <therbert@google.com>
Date: Tue, 17 May 2011 14:27:10 -0700

>> Actually, I think it won't work.  Even Linux emits fragments last to
>> first, so we won't see the UDP header until the last packet where it's
>> no longer useful.
>>
> I remember observing this a while back, what's the rationale for it?

That's the cheapest way to build the fragments.

Regardless of the reason we have to handle it forever.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 20:49   ` David Miller
  2011-05-17 21:00     ` Eric Dumazet
  2011-05-17 21:27     ` Tom Herbert
@ 2011-05-17 23:59     ` Changli Gao
  2011-05-18  6:37       ` David Miller
  2 siblings, 1 reply; 40+ messages in thread
From: Changli Gao @ 2011-05-17 23:59 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev

On Wed, May 18, 2011 at 4:49 AM, David Miller <davem@davemloft.net> wrote:
>
> Actually, I think it won't work.  Even Linux emits fragments last to
> first, so we won't see the UDP header until the last packet where it's
> no longer useful.
>

No. Linux emits fragments first to last now. You should check the
current code. :)

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 23:59     ` Changli Gao
@ 2011-05-18  6:37       ` David Miller
  0 siblings, 0 replies; 40+ messages in thread
From: David Miller @ 2011-05-18  6:37 UTC (permalink / raw)
  To: xiaosuo; +Cc: therbert, netdev

From: Changli Gao <xiaosuo@gmail.com>
Date: Wed, 18 May 2011 07:59:05 +0800

> On Wed, May 18, 2011 at 4:49 AM, David Miller <davem@davemloft.net> wrote:
>>
>> Actually, I think it won't work.  Even Linux emits fragments last to
>> first, so we won't see the UDP header until the last packet where it's
>> no longer useful.
>>
> 
> No. Linux emits fragments first to last now. You should check the
> current code. :)

I forgot that we rearranged this, thanks :-)

So maybe the original idea can indeed work.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 18:33 small RPS cache for fragments? David Miller
  2011-05-17 20:02 ` Tom Herbert
@ 2011-05-17 20:14 ` Eric Dumazet
  2011-05-17 20:47   ` David Miller
  2011-05-17 21:44   ` Andi Kleen
  2011-05-17 21:44 ` David Miller
  2011-05-24 20:01 ` David Miller
  3 siblings, 2 replies; 40+ messages in thread
From: Eric Dumazet @ 2011-05-17 20:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Le mardi 17 mai 2011 à 14:33 -0400, David Miller a écrit :
> It seems to me that we can solve the UDP fragmentation problem for
> flow steering very simply by creating a (saddr/daddr/IPID) entry in a
> table that maps to the corresponding RPS flow entry.
> 
> When we see the initial frag with the UDP header, we create the
> saddr/daddr/IPID mapping, and we tear it down when we hit the
> saddr/daddr/IPID mapping and the packet has the IP_MF bit clear.
> 

> We only inspect the saddr/daddr/IPID cache when iph->frag_off is
> non-zero.
> 

> It's best effort and should work quite well.
> 
> Even a one-behind cache, per-NAPI instance, would do a lot better than
> what happens at the moment.  Especially since the IP fragments mostly
> arrive as one packet train.
> --

OK but do we have workloads actually needing this optimization at all ?

(IP defrag hits a read_lock(&ip4_frags.lock)), so maybe steer all frags
on a given cpu ?)





^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 20:14 ` Eric Dumazet
@ 2011-05-17 20:47   ` David Miller
  2011-05-17 21:44   ` Andi Kleen
  1 sibling, 0 replies; 40+ messages in thread
From: David Miller @ 2011-05-17 20:47 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 May 2011 22:14:48 +0200

> OK but do we have workloads actually needing this optimization at all ?

Yes, I've seen performance graphs where RPS/RFS falls off the cliff
when datagram sizes go from 1024 to 2048 bytes.

Wrt. defrag queue overhead, it still is minor compared to the cost of
processing 1/2 of all packets on one cpu on a 24 core system.

BTW, if we can steer reliably, we could make per-cpu defrag queue if
you worry about it so much :-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 20:14 ` Eric Dumazet
  2011-05-17 20:47   ` David Miller
@ 2011-05-17 21:44   ` Andi Kleen
  2011-05-17 21:52     ` Eric Dumazet
  1 sibling, 1 reply; 40+ messages in thread
From: Andi Kleen @ 2011-05-17 21:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

Eric Dumazet <eric.dumazet@gmail.com> writes:
>
> OK but do we have workloads actually needing this optimization at all ?

That's a good a question.
>
> (IP defrag hits a read_lock(&ip4_frags.lock)), so maybe steer all frags
> on a given cpu ?)

Couldn't the lock just be replaced with a hashed or bitmap lock or 
bit in low bits of pointer lock?

iirc it just protects the heads of the hash table.

They're not rwlocks, but especially if the locking was more finegrained
that's likely not needed anymore.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:44   ` Andi Kleen
@ 2011-05-17 21:52     ` Eric Dumazet
  2011-05-17 22:03       ` Andi Kleen
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2011-05-17 21:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, netdev

Le mardi 17 mai 2011 à 14:44 -0700, Andi Kleen a écrit :
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> >
> > OK but do we have workloads actually needing this optimization at all ?
> 
> That's a good a question.
> >
> > (IP defrag hits a read_lock(&ip4_frags.lock)), so maybe steer all frags
> > on a given cpu ?)
> 
> Couldn't the lock just be replaced with a hashed or bitmap lock or 
> bit in low bits of pointer lock?
> 
> iirc it just protects the heads of the hash table.
> 
> They're not rwlocks, but especially if the locking was more finegrained
> that's likely not needed anymore.

Well, there is the rehashing stuff, and this locks the whole table.

Not easy to switch to rcu or something like that.

Anyway I hardly use frags here at work, so never considered it was a
field to spend time ;)




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:52     ` Eric Dumazet
@ 2011-05-17 22:03       ` Andi Kleen
  0 siblings, 0 replies; 40+ messages in thread
From: Andi Kleen @ 2011-05-17 22:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

Eric Dumazet <eric.dumazet@gmail.com> writes:

>> They're not rwlocks, but especially if the locking was more finegrained
>> that's likely not needed anymore.
>
> Well, there is the rehashing stuff, and this locks the whole table.
>
> Not easy to switch to rcu or something like that.

No need to switch to RCU, just a more finegrained bucket lock.

If you move a chain between queues you just lock both for the move.
It sounds easy enough. I should probably just code it up.

>
> Anyway I hardly use frags here at work, so never considered it was a
> field to spend time ;)

Yes that's the problem. On the other hand most scalability problems hurt
sooner or later, so sometimes it's good to fix them in advance.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 18:33 small RPS cache for fragments? David Miller
  2011-05-17 20:02 ` Tom Herbert
  2011-05-17 20:14 ` Eric Dumazet
@ 2011-05-17 21:44 ` David Miller
  2011-05-17 21:48   ` Andi Kleen
  2011-05-24 20:01 ` David Miller
  3 siblings, 1 reply; 40+ messages in thread
From: David Miller @ 2011-05-17 21:44 UTC (permalink / raw)
  To: netdev

Guys we can't time out fragments if we are not the final
destination.

Due to assymetric routing, the fragment pieces we don't
see might reach the final destiantion not through us.

So we have to pass them onwards, we can't just drop them.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:44 ` David Miller
@ 2011-05-17 21:48   ` Andi Kleen
  2011-05-17 21:50     ` David Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Andi Kleen @ 2011-05-17 21:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller <davem@davemloft.net> writes:

> Guys we can't time out fragments if we are not the final
> destination.

If you're not the final destination you should never even
try to reassemble them?

I'm probably missing something...

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:48   ` Andi Kleen
@ 2011-05-17 21:50     ` David Miller
  2011-05-17 22:06       ` Andi Kleen
  2011-05-17 22:42       ` Rick Jones
  0 siblings, 2 replies; 40+ messages in thread
From: David Miller @ 2011-05-17 21:50 UTC (permalink / raw)
  To: andi; +Cc: netdev

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 17 May 2011 14:48:28 -0700

> David Miller <davem@davemloft.net> writes:
> 
>> Guys we can't time out fragments if we are not the final
>> destination.
> 
> If you're not the final destination you should never even
> try to reassemble them?
> 
> I'm probably missing something...

We're discussing the idea to do the defragmentation first
so we can choose the flow properly and steer the packet
to the correct cpu.

This also would allos each fragmented packet to traverse the
stack only once (one route lookup etc.) instead of once per
fragment.

Please read the rest of this thread, we have discussed this
and now I'm repeating information solely for your benefit.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:50     ` David Miller
@ 2011-05-17 22:06       ` Andi Kleen
  2011-05-17 22:42       ` Rick Jones
  1 sibling, 0 replies; 40+ messages in thread
From: Andi Kleen @ 2011-05-17 22:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller <davem@davemloft.net> writes:
>
> We're discussing the idea to do the defragmentation first
> so we can choose the flow properly and steer the packet
> to the correct cpu.
>
> This also would allos each fragmented packet to traverse the
> stack only once (one route lookup etc.) instead of once per
> fragment.

You could always check first in a cheap way (e.g. a small hash table) if
it's local or not (and bypass the defragmentation if routing is turned
off or the hash table would have collisions)

On the other hand if fragmentation is expensive it's probably
better to do it later anyways to spread it out better.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 21:50     ` David Miller
  2011-05-17 22:06       ` Andi Kleen
@ 2011-05-17 22:42       ` Rick Jones
  1 sibling, 0 replies; 40+ messages in thread
From: Rick Jones @ 2011-05-17 22:42 UTC (permalink / raw)
  To: David Miller; +Cc: andi, netdev

On Tue, 2011-05-17 at 17:50 -0400, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Tue, 17 May 2011 14:48:28 -0700
> 
> > David Miller <davem@davemloft.net> writes:
> > 
> >> Guys we can't time out fragments if we are not the final
> >> destination.
> > 
> > If you're not the final destination you should never even
> > try to reassemble them?
> > 
> > I'm probably missing something...
> 
> We're discussing the idea to do the defragmentation first
> so we can choose the flow properly and steer the packet
> to the correct cpu.
> 
> This also would allos each fragmented packet to traverse the
> stack only once (one route lookup etc.) instead of once per
> fragment.
> 
> Please read the rest of this thread, we have discussed this
> and now I'm repeating information solely for your benefit.

Well, I should probably be beaten with that stick too because I wasn't
thinking about forwarding, only being the destination system when I
broached the suggestion of doing RFS after reassembly.  I can see where
one *might* be able to do limited RPS when forwarding, but I didn't know
that RFS had been extended to forwarding.

Now though I see why you were rightfully concerned about timeouts -
given all the concerns about added latency from bufferbloat, I wouldn't
think that an additional 10 or perhaps even 1ms timeout on a reassembly
attempt to get the layer four header when forwarding would sit well with
folks - they will expect the fragments to flow through without
additional delay.

rick jones

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-17 18:33 small RPS cache for fragments? David Miller
                   ` (2 preceding siblings ...)
  2011-05-17 21:44 ` David Miller
@ 2011-05-24 20:01 ` David Miller
  2011-05-24 21:38   ` Rick Jones
  3 siblings, 1 reply; 40+ messages in thread
From: David Miller @ 2011-05-24 20:01 UTC (permalink / raw)
  To: netdev

From: David Miller <davem@davemloft.net>
Date: Tue, 17 May 2011 14:33:42 -0400 (EDT)

> 
> It seems to me that we can solve the UDP fragmentation problem for
> flow steering very simply by creating a (saddr/daddr/IPID) entry in a
> table that maps to the corresponding RPS flow entry.
> 
> When we see the initial frag with the UDP header, we create the
> saddr/daddr/IPID mapping, and we tear it down when we hit the
> saddr/daddr/IPID mapping and the packet has the IP_MF bit clear.
> 
> We only inspect the saddr/daddr/IPID cache when iph->frag_off is
> non-zero.

So I looked into implementing this now that it has been established
that we changed even Linux to emit fragments in-order.

The first problem we run into is that there is no "context" we can
use in all the places where skb_get_rxhash() gets called.

Part of the problem is that we call it from strange places, such as
egress packet schedulers.  That's completely bogus.

Examples, FLOW classifier, META e-match, CHOKE, and SFB.

In fact, for the classifiers this means they aren't making use of the
precomputed TX hash values in the sockets like __skb_tx_hash() will
make use of.  So this makes these packet schedulers operate
potentially more expensively than they need to.

If we could get rid of those silly cases, the stuff that remains
(macvtap and net/core/dev.c) could work with a NAPI context during
rxhash computation and use that to store the IP fragmentation
on-behind cached information.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-24 20:01 ` David Miller
@ 2011-05-24 21:38   ` Rick Jones
  2011-06-04 20:29     ` David Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Rick Jones @ 2011-05-24 21:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Tue, 2011-05-24 at 16:01 -0400, David Miller wrote:
> So I looked into implementing this now that it has been established
> that we changed even Linux to emit fragments in-order.
> 
> The first problem we run into is that there is no "context" we can
> use in all the places where skb_get_rxhash() gets called.
> 
> Part of the problem is that we call it from strange places, such as
> egress packet schedulers.  That's completely bogus.
> 
> Examples, FLOW classifier, META e-match, CHOKE, and SFB.
> 
> In fact, for the classifiers this means they aren't making use of the
> precomputed TX hash values in the sockets like __skb_tx_hash() will
> make use of.  So this makes these packet schedulers operate
> potentially more expensively than they need to.
> 
> If we could get rid of those silly cases, the stuff that remains
> (macvtap and net/core/dev.c) could work with a NAPI context during
> rxhash computation and use that to store the IP fragmentation
> on-behind cached information.

Isn't there still an issue (perhaps small) of traffic being sent through
a mode-rr bond, either at the origin or somewhere along the way?  At the
origin point will depend on the presence of UFO and whether it is
propagated up through the bond interface, but as a quick test, I
disabled TSO, GSO and UFO on four e1000e driven interfaces, bonded them
mode-rr and ran a netperf UDP_RR test with a 1473 byte request size and
this is what they looked like at my un-bonded reciever at the other end:

14:31:01.011370 IP (tos 0x0, ttl 64, id 24960, offset 1480, flags
[none], proto UDP (17), length 21)
    tardy.local > raj-8510w.local: udp
14:31:01.011420 IP (tos 0x0, ttl 64, id 24960, offset 0, flags [+],
proto UDP (17), length 1500)
    tardy.local.36073 > raj-8510w.local.59951: UDP, length 1473
14:31:01.011514 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
UDP (17), length 29)
    raj-8510w.local.59951 > tardy.local.36073: UDP, length 1

rick jones


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-05-24 21:38   ` Rick Jones
@ 2011-06-04 20:29     ` David Miller
  2011-06-06 17:08       ` Rick Jones
  0 siblings, 1 reply; 40+ messages in thread
From: David Miller @ 2011-06-04 20:29 UTC (permalink / raw)
  To: rick.jones2; +Cc: netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Tue, 24 May 2011 14:38:48 -0700

> Isn't there still an issue (perhaps small) of traffic being sent through
> a mode-rr bond, either at the origin or somewhere along the way?  At the
> origin point will depend on the presence of UFO and whether it is
> propagated up through the bond interface, but as a quick test, I
> disabled TSO, GSO and UFO on four e1000e driven interfaces, bonded them
> mode-rr and ran a netperf UDP_RR test with a 1473 byte request size and
> this is what they looked like at my un-bonded reciever at the other end:
> 
> 14:31:01.011370 IP (tos 0x0, ttl 64, id 24960, offset 1480, flags
> [none], proto UDP (17), length 21)
>     tardy.local > raj-8510w.local: udp
> 14:31:01.011420 IP (tos 0x0, ttl 64, id 24960, offset 0, flags [+],
> proto UDP (17), length 1500)
>     tardy.local.36073 > raj-8510w.local.59951: UDP, length 1473
> 14:31:01.011514 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> UDP (17), length 29)
>     raj-8510w.local.59951 > tardy.local.36073: UDP, length 1

That's not good behavior, and it's of course going to cause sub-optimal
performance if we do the RPS fragment cache.

RR bond mode could do something similar, to alleviate this.

I assume it doesn't do this kind of reordering for TCP.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-06-04 20:29     ` David Miller
@ 2011-06-06 17:08       ` Rick Jones
  2011-06-06 17:15         ` Eric Dumazet
  2011-06-06 19:22         ` David Miller
  0 siblings, 2 replies; 40+ messages in thread
From: Rick Jones @ 2011-06-06 17:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Sat, 2011-06-04 at 13:29 -0700, David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Tue, 24 May 2011 14:38:48 -0700
> 
> > Isn't there still an issue (perhaps small) of traffic being sent through
> > a mode-rr bond, either at the origin or somewhere along the way?  At the
> > origin point will depend on the presence of UFO and whether it is
> > propagated up through the bond interface, but as a quick test, I
> > disabled TSO, GSO and UFO on four e1000e driven interfaces, bonded them
> > mode-rr and ran a netperf UDP_RR test with a 1473 byte request size and
> > this is what they looked like at my un-bonded reciever at the other end:
> > 
> > 14:31:01.011370 IP (tos 0x0, ttl 64, id 24960, offset 1480, flags
> > [none], proto UDP (17), length 21)
> >     tardy.local > raj-8510w.local: udp
> > 14:31:01.011420 IP (tos 0x0, ttl 64, id 24960, offset 0, flags [+],
> > proto UDP (17), length 1500)
> >     tardy.local.36073 > raj-8510w.local.59951: UDP, length 1473
> > 14:31:01.011514 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> > UDP (17), length 29)
> >     raj-8510w.local.59951 > tardy.local.36073: UDP, length 1
> 
> That's not good behavior, and it's of course going to cause sub-optimal
> performance if we do the RPS fragment cache.
> 
> RR bond mode could do something similar, to alleviate this.
> 
> I assume it doesn't do this kind of reordering for TCP.

Mode-rr bonding reorders TCP segments all the time. 

rick


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-06-06 17:08       ` Rick Jones
@ 2011-06-06 17:15         ` Eric Dumazet
  2011-06-06 18:06           ` Rick Jones
  2011-06-06 19:23           ` David Miller
  2011-06-06 19:22         ` David Miller
  1 sibling, 2 replies; 40+ messages in thread
From: Eric Dumazet @ 2011-06-06 17:15 UTC (permalink / raw)
  To: rick.jones2; +Cc: David Miller, netdev

Le lundi 06 juin 2011 à 10:08 -0700, Rick Jones a écrit :

> Mode-rr bonding reorders TCP segments all the time. 

Shouldnt TCP frames have DF bit set ?






^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-06-06 17:15         ` Eric Dumazet
@ 2011-06-06 18:06           ` Rick Jones
  2011-06-06 19:23           ` David Miller
  1 sibling, 0 replies; 40+ messages in thread
From: Rick Jones @ 2011-06-06 18:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Mon, 2011-06-06 at 19:15 +0200, Eric Dumazet wrote:
> Le lundi 06 juin 2011 à 10:08 -0700, Rick Jones a écrit :
> 
> > Mode-rr bonding reorders TCP segments all the time. 
> 
> Shouldnt TCP frames have DF bit set ?

I was ass-u-me-ing that when talking about TCP, David was speaking
generally about TCP segments, and suggesting that were bonding's mode-rr
altered to not re-order TCP segments, a similar technique
could/would/should avoid re-ordering IP datagram fragments, regardless
of their payload.

Jay will have to weigh-in on how difficult that would be, I'm guessing
it would mean a fair bit of overhead to mode-rr though, to know the
completion status of frames from the same flow and/or the depth of the
tx queues etc etc.  I thought that one of mode-rr's (few IMO, just check
the archives where I've complained about it :) redeeming qualities was
its minimal overhead.

rick jones

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-06-06 17:15         ` Eric Dumazet
  2011-06-06 18:06           ` Rick Jones
@ 2011-06-06 19:23           ` David Miller
  1 sibling, 0 replies; 40+ messages in thread
From: David Miller @ 2011-06-06 19:23 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rick.jones2, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 06 Jun 2011 19:15:13 +0200

> Le lundi 06 juin 2011 à 10:08 -0700, Rick Jones a écrit :
> 
>> Mode-rr bonding reorders TCP segments all the time. 
> 
> Shouldnt TCP frames have DF bit set ?

That has nothing to do with this discussion :-)

DF bit set or not, TCP or UDP, this bonding mode apparently reorders
frames all the time.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-06-06 17:08       ` Rick Jones
  2011-06-06 17:15         ` Eric Dumazet
@ 2011-06-06 19:22         ` David Miller
  2011-06-06 20:05           ` Rick Jones
  1 sibling, 1 reply; 40+ messages in thread
From: David Miller @ 2011-06-06 19:22 UTC (permalink / raw)
  To: rick.jones2; +Cc: netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Mon, 06 Jun 2011 10:08:52 -0700

> Mode-rr bonding reorders TCP segments all the time. 

Oh well, then don't use this if you care about performance at all.
And therefore it's not even worth considering for our RPS fragment
cache.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-06-06 19:22         ` David Miller
@ 2011-06-06 20:05           ` Rick Jones
  2011-06-06 21:06             ` Jay Vosburgh
  0 siblings, 1 reply; 40+ messages in thread
From: Rick Jones @ 2011-06-06 20:05 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, 2011-06-06 at 12:22 -0700, David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Mon, 06 Jun 2011 10:08:52 -0700
> 
> > Mode-rr bonding reorders TCP segments all the time. 
> 
> Oh well, then don't use this if you care about performance at all.
> And therefore it's not even worth considering for our RPS fragment
> cache.

Heh - the (or at least a) reason people use mode-rr is to make a single
(TCP) stream go faster :)  Without buying the next-up NIC speed.

rick jones


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-06-06 20:05           ` Rick Jones
@ 2011-06-06 21:06             ` Jay Vosburgh
  2011-06-06 21:40               ` David Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Jay Vosburgh @ 2011-06-06 21:06 UTC (permalink / raw)
  To: rick.jones2; +Cc: David Miller, netdev

Rick Jones <rick.jones2@hp.com> wrote:

>On Mon, 2011-06-06 at 12:22 -0700, David Miller wrote:
>> From: Rick Jones <rick.jones2@hp.com>
>> Date: Mon, 06 Jun 2011 10:08:52 -0700
>> 
>> > Mode-rr bonding reorders TCP segments all the time. 
>> 
>> Oh well, then don't use this if you care about performance at all.
>> And therefore it's not even worth considering for our RPS fragment
>> cache.
>
>Heh - the (or at least a) reason people use mode-rr is to make a single
>(TCP) stream go faster :)  Without buying the next-up NIC speed.

	Right, the common use case for balance-rr (round robin) is to
maximize TCP throughput for one connection, over a set of whatever
network devices are available (or are cheap) by striping that connection
across multiple interfaces.  The tcp_reordering sysctl is set to some
large value so that TCP will deal with the reordering as best it can.

	Since TCP generally won't fragment, I don't see that the RPS
frag cache is going to matter for this usage anyway.

	If somebody out there is using round robin for some UDP-based
application that doesn't care about packet ordering (but might create
fragmented datagrams), I've not heard about it.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-06-06 21:06             ` Jay Vosburgh
@ 2011-06-06 21:40               ` David Miller
  2011-06-06 22:49                 ` Chris Friesen
  0 siblings, 1 reply; 40+ messages in thread
From: David Miller @ 2011-06-06 21:40 UTC (permalink / raw)
  To: fubar; +Cc: rick.jones2, netdev

From: Jay Vosburgh <fubar@us.ibm.com>
Date: Mon, 06 Jun 2011 14:06:09 -0700

> 	Right, the common use case for balance-rr (round robin) is to
> maximize TCP throughput for one connection, over a set of whatever
> network devices are available (or are cheap) by striping that connection
> across multiple interfaces.  The tcp_reordering sysctl is set to some
> large value so that TCP will deal with the reordering as best it can.

FWIW, I really would never, ever, encourage schemes like this.  Even
if they do happen to work.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: small RPS cache for fragments?
  2011-06-06 21:40               ` David Miller
@ 2011-06-06 22:49                 ` Chris Friesen
  0 siblings, 0 replies; 40+ messages in thread
From: Chris Friesen @ 2011-06-06 22:49 UTC (permalink / raw)
  To: David Miller; +Cc: fubar, rick.jones2, netdev

On 06/06/2011 03:40 PM, David Miller wrote:
> From: Jay Vosburgh<fubar@us.ibm.com>
> Date: Mon, 06 Jun 2011 14:06:09 -0700
>
>> 	Right, the common use case for balance-rr (round robin) is to
>> maximize TCP throughput for one connection, over a set of whatever
>> network devices are available (or are cheap) by striping that connection
>> across multiple interfaces.  The tcp_reordering sysctl is set to some
>> large value so that TCP will deal with the reordering as best it can.
>
> FWIW, I really would never, ever, encourage schemes like this.  Even
> if they do happen to work.

Why not?  And if not then what's the recommended way to handle the above 
scenario?  (Assuming hardware upgrade isn't an option.)

Chris

-- 
Chris Friesen
Software Developer
GENBAND
chris.friesen@genband.com
www.genband.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-06-06 22:49 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-17 18:33 small RPS cache for fragments? David Miller
2011-05-17 20:02 ` Tom Herbert
2011-05-17 20:17   ` Rick Jones
2011-05-17 20:41     ` Rick Jones
2011-05-17 20:49   ` David Miller
2011-05-17 21:00     ` Eric Dumazet
2011-05-17 21:10       ` David Miller
2011-05-17 21:13         ` Rick Jones
2011-05-17 21:13         ` Ben Hutchings
2011-05-17 21:26           ` David Miller
2011-05-17 21:40             ` Rick Jones
2011-05-17 21:27           ` Eric Dumazet
2011-05-17 21:11       ` Rick Jones
2011-05-17 21:11       ` Ben Hutchings
2011-05-17 21:27     ` Tom Herbert
2011-05-17 21:28       ` David Miller
2011-05-17 23:59     ` Changli Gao
2011-05-18  6:37       ` David Miller
2011-05-17 20:14 ` Eric Dumazet
2011-05-17 20:47   ` David Miller
2011-05-17 21:44   ` Andi Kleen
2011-05-17 21:52     ` Eric Dumazet
2011-05-17 22:03       ` Andi Kleen
2011-05-17 21:44 ` David Miller
2011-05-17 21:48   ` Andi Kleen
2011-05-17 21:50     ` David Miller
2011-05-17 22:06       ` Andi Kleen
2011-05-17 22:42       ` Rick Jones
2011-05-24 20:01 ` David Miller
2011-05-24 21:38   ` Rick Jones
2011-06-04 20:29     ` David Miller
2011-06-06 17:08       ` Rick Jones
2011-06-06 17:15         ` Eric Dumazet
2011-06-06 18:06           ` Rick Jones
2011-06-06 19:23           ` David Miller
2011-06-06 19:22         ` David Miller
2011-06-06 20:05           ` Rick Jones
2011-06-06 21:06             ` Jay Vosburgh
2011-06-06 21:40               ` David Miller
2011-06-06 22:49                 ` Chris Friesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).