[PATCH net-next] route: fix breakage after moving lwtunnel state

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next] route: fix breakage after moving lwtunnel state
@ 2015-08-21 10:41 Jiri Benc
  2015-08-23 23:51 ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Jiri Benc @ 2015-08-21 10:41 UTC (permalink / raw)
  To: netdev

__recnt and related fields need to be in its own cacheline for performance
reasons. Commit 61adedf3e3f1 ("route: move lwtunnel state to dst_entry")
broke that on 32bit archs, causing BUILD_BUG_ON in dst_hold to be triggered.

This patch fixes the breakage by moving the lwtunnel state to the end of
dst_entry on 32bit archs. Unfortunately, this makes it share the cacheline
with __refcnt and may affect performance, thus further patches may be
needed.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 61adedf3e3f1 ("route: move lwtunnel state to dst_entry")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
I'm working on this, I'm going to grab performance numbers with this patch
applied and work on follow up patches as necessary. Until then, this patch
at least fixes the 32bit build.

I'm very sorry for the breakage. I tried to build the patchset with various
configs (IPv6 off, lwtunnel off, etc.) but obviously did not test on 32bit.
I have no excuse for this, I should have tested it, the #ifdef was very
obvious.
---
 include/net/dst.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 0a9a723f6c19..ef8f1d43a203 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -44,7 +44,6 @@ struct dst_entry {
 #else
 	void			*__pad1;
 #endif
-	struct lwtunnel_state   *lwtstate;
 	int			(*input)(struct sk_buff *);
 	int			(*output)(struct sock *sk, struct sk_buff *skb);
 
@@ -85,11 +84,12 @@ struct dst_entry {
 	__u32			__pad2;
 #endif
 
+#ifdef CONFIG_64BIT
+	struct lwtunnel_state   *lwtstate;
 	/*
 	 * Align __refcnt to a 64 bytes alignment
 	 * (L1_CACHE_SIZE would be too much)
 	 */
-#ifdef CONFIG_64BIT
 	long			__pad_to_align_refcnt[1];
 #endif
 	/*
@@ -99,6 +99,9 @@ struct dst_entry {
 	atomic_t		__refcnt;	/* client references	*/
 	int			__use;
 	unsigned long		lastuse;
+#ifndef CONFIG_64BIT
+	struct lwtunnel_state   *lwtstate;
+#endif
 	union {
 		struct dst_entry	*next;
 		struct rtable __rcu	*rt_next;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] route: fix breakage after moving lwtunnel state
  2015-08-21 10:41 [PATCH net-next] route: fix breakage after moving lwtunnel state Jiri Benc
@ 2015-08-23 23:51 ` David Miller
  2015-08-26 16:19   ` Jiri Benc
  0 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2015-08-23 23:51 UTC (permalink / raw)
  To: jbenc; +Cc: netdev

From: Jiri Benc <jbenc@redhat.com>
Date: Fri, 21 Aug 2015 12:41:14 +0200

> @@ -99,6 +99,9 @@ struct dst_entry {
>  	atomic_t		__refcnt;	/* client references	*/
>  	int			__use;
>  	unsigned long		lastuse;
> +#ifndef CONFIG_64BIT
> +	struct lwtunnel_state   *lwtstate;
> +#endif
>  	union {
>  		struct dst_entry	*next;
>  		struct rtable __rcu	*rt_next;

I'm going to apply this to fix the build error without reverting your
change entirely, but this is really an undesirable solution.

This cache line of the SKB is for write heavy members of struct
dst_entry and so if you put a read-mostly member here it's going to
result in performance problems.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] route: fix breakage after moving lwtunnel state
  2015-08-23 23:51 ` David Miller
@ 2015-08-26 16:19   ` Jiri Benc
  2015-08-26 22:13     ` Thomas Graf
  2015-08-27 18:30     ` David Miller
  0 siblings, 2 replies; 10+ messages in thread
From: Jiri Benc @ 2015-08-26 16:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Sun, 23 Aug 2015 16:51:03 -0700 (PDT), David Miller wrote:
> From: Jiri Benc <jbenc@redhat.com>
> Date: Fri, 21 Aug 2015 12:41:14 +0200
> 
> > @@ -99,6 +99,9 @@ struct dst_entry {
> >  	atomic_t		__refcnt;	/* client references	*/
> >  	int			__use;
> >  	unsigned long		lastuse;
> > +#ifndef CONFIG_64BIT
> > +	struct lwtunnel_state   *lwtstate;
> > +#endif
> >  	union {
> >  		struct dst_entry	*next;
> >  		struct rtable __rcu	*rt_next;
> 
> I'm going to apply this to fix the build error without reverting your
> change entirely, but this is really an undesirable solution.
> 
> This cache line of the SKB is for write heavy members of struct
> dst_entry and so if you put a read-mostly member here it's going to
> result in performance problems.

I've spent the last few days taking measurements using tbench (which was
originally used for dst_entry reshuffling) and netperf (super_netperf
with various number of concurrent TCP streams) on i686 and found no
regression introduced by this.

This looks somehow surprising until we look at where and how lwtstate is
used. In non-tunnel case, it's either:

(1) used in netlink route notifications or
(2) used together with the first few fields in struct rtable/rt6_info or
(3) it's a skb_tunnel_info call.

No more cases than those three. Of those, (1) is not of much concern.

As for (2), the first few fields in struct rtable and struct rt6_info share
the same cacheline with __refcnt, thus accessing the lwtstate makes no
difference here.

About (3), skb_tunnel_info is called from ip_route_input_slow and
ip6_route_input. However, skb->dst is set only in case of a metadata
dst_entry. In such case, tunnel info is fetched from metadata; otherwise,
skb->dst is NULL. In either case, lwtstate is not accessed at all.

This confirms what I measured - placing lwtstate in the same cacheline as
__refcnt has no impact on performance in non-tunneling case.

As for tunneling, I did not see any performance degradation after my patch
in most cases. However, there were some cases where there was small
degradation (<2%). I bisected it to the addition of IPv6 fields into
ip_tunnel_key (i.e. commit c1ea5d672aaf). This is not much conclusive,
though, the variation of the benchmark results was relatively high and this
might be a noise. However, there's definitely room for performance
improvement here, the lwtunnel vxlan throughput is at about ~40% of the
non-vxlan throughput. I did not spend too much time on analyzing this, yet,
but it's clear the dst_entry layout is not our biggest concern here.

As the result, I think that the lwtstate field may stay where it is. For
simplification of the code and to get rid of the #ifdefs, we can have it
at the end of the struct also for 64bit case. Let me know if you prefer
this, I'll submit a patch.

Please let me know if you disagree with my analysis above.

Thanks,

  Jiri

-- 
Jiri Benc

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] route: fix breakage after moving lwtunnel state
  2015-08-26 16:19   ` Jiri Benc
@ 2015-08-26 22:13     ` Thomas Graf
  2015-08-27 19:47       ` Tom Herbert
  2015-08-28  8:36       ` Jiri Benc
  2015-08-27 18:30     ` David Miller
  1 sibling, 2 replies; 10+ messages in thread
From: Thomas Graf @ 2015-08-26 22:13 UTC (permalink / raw)
  To: Jiri Benc; +Cc: David Miller, netdev

On 08/26/15 at 06:19pm, Jiri Benc wrote:
> might be a noise. However, there's definitely room for performance
> improvement here, the lwtunnel vxlan throughput is at about ~40% of the
> non-vxlan throughput. I did not spend too much time on analyzing this, yet,
> but it's clear the dst_entry layout is not our biggest concern here.

I'm currently working on reducing the overhead for VXLAN and Gre and
effectively Geneve once Pravin's work is in. The main disadvantage
of lwt based flow tunneling is the additional fib_lookup() performed
for each packet. It seems tempting to cache the tunnel endpoint dst in
the lwt state of the overlay route. It will usually point to the same
dst for every packet. The cache behaviour if dependant on no fib rules
are and the route is a single nexthop route.

Did you test with a card that features UDP encapsulation offloads?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] route: fix breakage after moving lwtunnel state
  2015-08-26 16:19   ` Jiri Benc
  2015-08-26 22:13     ` Thomas Graf
@ 2015-08-27 18:30     ` David Miller
  1 sibling, 0 replies; 10+ messages in thread
From: David Miller @ 2015-08-27 18:30 UTC (permalink / raw)
  To: jbenc; +Cc: netdev

From: Jiri Benc <jbenc@redhat.com>
Date: Wed, 26 Aug 2015 18:19:26 +0200

> Please let me know if you disagree with my analysis above.

I agree with your analysis, thanks for taking the time to investigate.

As for the %40 degradation, as Thomas mentioned it's because of the
extra FIB lookups.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] route: fix breakage after moving lwtunnel state
  2015-08-26 22:13     ` Thomas Graf
@ 2015-08-27 19:47       ` Tom Herbert
  2015-08-27 21:00         ` Thomas Graf
  2015-08-28  8:36       ` Jiri Benc
  1 sibling, 1 reply; 10+ messages in thread
From: Tom Herbert @ 2015-08-27 19:47 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Jiri Benc, David Miller, Linux Kernel Network Developers

On Wed, Aug 26, 2015 at 3:13 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 08/26/15 at 06:19pm, Jiri Benc wrote:
>> might be a noise. However, there's definitely room for performance
>> improvement here, the lwtunnel vxlan throughput is at about ~40% of the
>> non-vxlan throughput. I did not spend too much time on analyzing this, yet,
>> but it's clear the dst_entry layout is not our biggest concern here.
>
> I'm currently working on reducing the overhead for VXLAN and Gre and
> effectively Geneve once Pravin's work is in. The main disadvantage
> of lwt based flow tunneling is the additional fib_lookup() performed
> for each packet. It seems tempting to cache the tunnel endpoint dst in
> the lwt state of the overlay route. It will usually point to the same
> dst for every packet. The cache behaviour if dependant on no fib rules
> are and the route is a single nexthop route.
>
Or set nexthop appropriately. This what we do for ILA. Works great
without any other dst references, but might put to much weight in the
administrator to configure nexthop per encapsulating destination.

Tom

> Did you test with a card that features UDP encapsulation offloads?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] route: fix breakage after moving lwtunnel state
  2015-08-27 19:47       ` Tom Herbert
@ 2015-08-27 21:00         ` Thomas Graf
  2015-08-27 21:20           ` Tom Herbert
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Graf @ 2015-08-27 21:00 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Jiri Benc, David Miller, Linux Kernel Network Developers

On 08/27/15 at 12:47pm, Tom Herbert wrote:
> On Wed, Aug 26, 2015 at 3:13 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > On 08/26/15 at 06:19pm, Jiri Benc wrote:
> >> might be a noise. However, there's definitely room for performance
> >> improvement here, the lwtunnel vxlan throughput is at about ~40% of the
> >> non-vxlan throughput. I did not spend too much time on analyzing this, yet,
> >> but it's clear the dst_entry layout is not our biggest concern here.
> >
> > I'm currently working on reducing the overhead for VXLAN and Gre and
> > effectively Geneve once Pravin's work is in. The main disadvantage
> > of lwt based flow tunneling is the additional fib_lookup() performed
> > for each packet. It seems tempting to cache the tunnel endpoint dst in
> > the lwt state of the overlay route. It will usually point to the same
> > dst for every packet. The cache behaviour if dependant on no fib rules
> > are and the route is a single nexthop route.
> >
> Or set nexthop appropriately. This what we do for ILA. Works great
> without any other dst references, but might put to much weight in the
> administrator to configure nexthop per encapsulating destination.

I assume you mean something like this, right?

        ip route [...] encap vxlan dst 10.1.1.1 dev eth0

The IP metadata encap at FIB level is currently encap agnostic
and requires an intermediate encap device which then defines the
actual encap protocol:

        ip route overlay/prefix encap ip dst 10.1.1.1 dev vxlan0
        ip route 10.1.1.1/prefix dev eth0

I like it because we don't have to embed all the options as metadata
and can still set the through the device. An option would also be
to allow for both and add the following alternative:

        ip route overlay/prefix encap ip type vxlan dst 10.1.1.1 dev eth0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] route: fix breakage after moving lwtunnel state
  2015-08-27 21:00         ` Thomas Graf
@ 2015-08-27 21:20           ` Tom Herbert
  2015-08-28 10:20             ` Thomas Graf
  0 siblings, 1 reply; 10+ messages in thread
From: Tom Herbert @ 2015-08-27 21:20 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Jiri Benc, David Miller, Linux Kernel Network Developers

On Thu, Aug 27, 2015 at 2:00 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 08/27/15 at 12:47pm, Tom Herbert wrote:
>> On Wed, Aug 26, 2015 at 3:13 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> > On 08/26/15 at 06:19pm, Jiri Benc wrote:
>> >> might be a noise. However, there's definitely room for performance
>> >> improvement here, the lwtunnel vxlan throughput is at about ~40% of the
>> >> non-vxlan throughput. I did not spend too much time on analyzing this, yet,
>> >> but it's clear the dst_entry layout is not our biggest concern here.
>> >
>> > I'm currently working on reducing the overhead for VXLAN and Gre and
>> > effectively Geneve once Pravin's work is in. The main disadvantage
>> > of lwt based flow tunneling is the additional fib_lookup() performed
>> > for each packet. It seems tempting to cache the tunnel endpoint dst in
>> > the lwt state of the overlay route. It will usually point to the same
>> > dst for every packet. The cache behaviour if dependant on no fib rules
>> > are and the route is a single nexthop route.
>> >
>> Or set nexthop appropriately. This what we do for ILA. Works great
>> without any other dst references, but might put to much weight in the
>> administrator to configure nexthop per encapsulating destination.
>
> I assume you mean something like this, right?
>
>         ip route [...] encap vxlan dst 10.1.1.1 dev eth0
>
I'm doing:

ip route add 3333:0:0:1:5555:0:2:0/128 encap ila 2001:0:0:2 via
2401:db00:20:911a:face:0:27:0

so that 2401:db00:20:911a:face:0:27:0 is the next hop route for
destination 2001:0:0:2:5555:0:2:0. The dst_output for lwt just calls
the original dest_output after transforming the packet without the use
of any additional routes. So in this way ILA LWT is just acting as a
"pass-through" packet transformation mechanism. Such a model might
have additional utility: LWT occurs before iptables so that iptables
sees the translated or encapsulated packet (davem mentioned this is
probably what we want), we may want to defer translation until IP
fragmentation (Roopa mentioned she needs this for MPLS).

> The IP metadata encap at FIB level is currently encap agnostic
> and requires an intermediate encap device which then defines the
> actual encap protocol:
>
>         ip route overlay/prefix encap ip dst 10.1.1.1 dev vxlan0
>         ip route 10.1.1.1/prefix dev eth0
>
But then your outputting through another device, multiple routes are
involved, performance drops :-( What not just set the route through
VXLAN in that case?

> I like it because we don't have to embed all the options as metadata
> and can still set the through the device. An option would also be
> to allow for both and add the following alternative:
>
>         ip route overlay/prefix encap ip type vxlan dst 10.1.1.1 dev eth0

Better, we should be able to send encapsulated packets with needing a device.

Tom

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] route: fix breakage after moving lwtunnel state
  2015-08-26 22:13     ` Thomas Graf
  2015-08-27 19:47       ` Tom Herbert
@ 2015-08-28  8:36       ` Jiri Benc
  1 sibling, 0 replies; 10+ messages in thread
From: Jiri Benc @ 2015-08-28  8:36 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David Miller, netdev

On Thu, 27 Aug 2015 00:13:30 +0200, Thomas Graf wrote:
> Did you test with a card that features UDP encapsulation offloads?

No, I did not. I do have access to NICs supporting it but I think in
this case, the numbers without vxlan offloading were more interesting.

 Jiri

-- 
Jiri Benc

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] route: fix breakage after moving lwtunnel state
  2015-08-27 21:20           ` Tom Herbert
@ 2015-08-28 10:20             ` Thomas Graf
  0 siblings, 0 replies; 10+ messages in thread
From: Thomas Graf @ 2015-08-28 10:20 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Jiri Benc, David Miller, Linux Kernel Network Developers

On 08/27/15 at 02:20pm, Tom Herbert wrote:
> I'm doing:
> 
> ip route add 3333:0:0:1:5555:0:2:0/128 encap ila 2001:0:0:2 via
> 2401:db00:20:911a:face:0:27:0
> 
> so that 2401:db00:20:911a:face:0:27:0 is the next hop route for
> destination 2001:0:0:2:5555:0:2:0. The dst_output for lwt just calls
> the original dest_output after transforming the packet without the use
> of any additional routes. So in this way ILA LWT is just acting as a
> "pass-through" packet transformation mechanism. Such a model might
> have additional utility: LWT occurs before iptables so that iptables
> sees the translated or encapsulated packet (davem mentioned this is
> probably what we want), we may want to defer translation until IP
> fragmentation (Roopa mentioned she needs this for MPLS).
> 
> > The IP metadata encap at FIB level is currently encap agnostic
> > and requires an intermediate encap device which then defines the
> > actual encap protocol:
> >
> >         ip route overlay/prefix encap ip dst 10.1.1.1 dev vxlan0
> >         ip route 10.1.1.1/prefix dev eth0
> >
> But then your outputting through another device, multiple routes are
> involved, performance drops :-( What not just set the route through
> VXLAN in that case?

The problem with having a single route is that it doesn't allow to
separate management of overlay and underlay. It is common to manage
the underlay with Quagga, bird or even static routes and defer the
overlay to Neutron or a fancy container orchestration system.

Caching of the 10.1.1.1 nexthop route in the overlay route would
essentially lead to the same behaviour without requiring to hardcode
the nexthop. I should have patches to demonstrate this in a bit.

> > I like it because we don't have to embed all the options as metadata
> > and can still set the through the device. An option would also be
> > to allow for both and add the following alternative:
> >
> >         ip route overlay/prefix encap ip type vxlan dst 10.1.1.1 dev eth0
> 
> Better, we should be able to send encapsulated packets with needing a device.

Why is the device itself bad? I understand that we want to minimize
overhead but why is a single logical device to keep common config and
stats undesirable?

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-08-28 10:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-21 10:41 [PATCH net-next] route: fix breakage after moving lwtunnel state Jiri Benc
2015-08-23 23:51 ` David Miller
2015-08-26 16:19   ` Jiri Benc
2015-08-26 22:13     ` Thomas Graf
2015-08-27 19:47       ` Tom Herbert
2015-08-27 21:00         ` Thomas Graf
2015-08-27 21:20           ` Tom Herbert
2015-08-28 10:20             ` Thomas Graf
2015-08-28  8:36       ` Jiri Benc
2015-08-27 18:30     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).