* XFRM pcpu cache issue
@ 2017-08-03 15:48 Ilan Tayari
2017-08-04 16:55 ` Florian Westphal
0 siblings, 1 reply; 3+ messages in thread
From: Ilan Tayari @ 2017-08-03 15:48 UTC (permalink / raw)
To: Florian Westphal
Cc: Steffen Klassert, netdev@vger.kernel.org, Yevgeny Kliteynik,
Yossi Kuperman, Boris Pismenny, Yossef Efraim
Hi Florian,
I debugged a little the regression I told you about the other day...
Steps and Symptoms:
1. Set up a host-to-host IPSec tunnel (or transport, doesn't matter)
2. Ping over IPSec, or do something to populate the pcpu cache
3. Join a MC group, then leave MC group
4. Try to ping again using same CPU as before -> traffic doesn't egress the machine at all
If trying from another CPU (with clean cache), it pings well.
If clearing the pcpu cache, it works well again.
With a little more digging I found that when the cache is first populated (step 2), both xdst->u.dst.dev and xdst->u.dst.path->dev are the same device (my intended device).
At step 4, the cache has same xdst->u.dst.dev, but xdst->u.dst.path->dev points to 'lo' device.
With a HW breakpoint I found who changes it. It is this callstack:
#0 0xffffffff8158bc09 in dst_dev_put at net/core/dst.c:172
#1 0xffffffff815bff14 in rt_cache_route at net/ipv4/route.c:1367
#2 0xffffffff815c0005 in rt_set_nexthop at net/ipv4/route.c:1468
#3 0xffffffff815c25b9 in __mkroute_output at net/ipv4/route.c:2262
#4 ip_route_output_key_hash_rcu at net/ipv4/route.c:2454
#5 0xffffffff815c2b0e in ip_route_output_key_hash at net/ipv4/route.c:2289
#6 0xffffffff815f02e9 in __ip_route_output_key at ./include/net/route.h:125
#7 ip_route_connect at ./include/net/route.h:297
#8 __ip4_datagram_connect at net/ipv4/datagram.c:51
#9 0xffffffff815f048c in ip4_datagram_connect at net/ipv4/datagram.c:92
#10 0xffffffff815ff45e in inet_dgram_connect at net/ipv4/af_inet.c:540
#11 0xffffffff81563207 in SYSC_connect at net/socket.c:1628
#12 0xffffffff81564b8e in SyS_connect at net/socket.c:1609
#13 0xffffffff816aa5f7 in entry_SYSCALL_64_fastpath at arch/x86/entry/entry_64.S:203
The line there is very appropriate:
dst->dev = dev_net(dst->dev)->loopback_dev;
So the dev is replaced when sending the first packet *after* the MC join/leave, and not during that flow.
For reference, in step 3 above, we do:
socket(AF_INET,SOCK_DGRAM, IPPROTO_UDP)
setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
setsockopt(SOL_IP, IP_MULTICAST_TTL, 1)
setsockopt(SOL_IP, IP_MULTICAST_LOOP, 1)
setsockopt(SOL_IP, IP_MULTICAST_IF, <ip of device>)
setsockopt(SOL_IP, IP_ADD_MEMBERSHIP, <group>)
setsockopt(SOL_IP, IP_MULTICAST_TTL, 1)
bind(<group>, <some port>)
And exit the process after a few seconds
I am using net-next from around two weeks ago.
I'll continue digging, but would love to hear your opinion and maybe suggestions on where to look next.
Ilan.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: XFRM pcpu cache issue
2017-08-03 15:48 XFRM pcpu cache issue Ilan Tayari
@ 2017-08-04 16:55 ` Florian Westphal
2017-08-06 6:50 ` Ilan Tayari
0 siblings, 1 reply; 3+ messages in thread
From: Florian Westphal @ 2017-08-04 16:55 UTC (permalink / raw)
To: Ilan Tayari
Cc: Florian Westphal, Steffen Klassert, netdev@vger.kernel.org,
Yevgeny Kliteynik, Yossi Kuperman, Boris Pismenny, Yossef Efraim
Ilan Tayari <ilant@mellanox.com> wrote:
> I debugged a little the regression I told you about the other day...
>
> Steps and Symptoms:
> 1. Set up a host-to-host IPSec tunnel (or transport, doesn't matter)
> 2. Ping over IPSec, or do something to populate the pcpu cache
> 3. Join a MC group, then leave MC group
> 4. Try to ping again using same CPU as before -> traffic doesn't egress the machine at all
>
> If trying from another CPU (with clean cache), it pings well.
> If clearing the pcpu cache, it works well again.
Yes, I think i see the problem, thanks for debugging this.
I dropped the stale_bundle() check vs. rfc, that was a stupid thing
to do because that is what would detect this....
Does this help?
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1818,7 +1818,8 @@ xfrm_resolve_and_create_bundle(struct xfrm_policy **pols, int num_pols,
xdst->num_pols == num_pols &&
!xfrm_pol_dead(xdst) &&
memcmp(xdst->pols, pols,
- sizeof(struct xfrm_policy *) * num_pols) == 0) {
+ sizeof(struct xfrm_policy *) * num_pols) == 0 &&
+ xfrm_bundle_ok(xdst)) {
dst_hold(&xdst->u.dst);
return xdst;
}
^ permalink raw reply [flat|nested] 3+ messages in thread
* RE: XFRM pcpu cache issue
2017-08-04 16:55 ` Florian Westphal
@ 2017-08-06 6:50 ` Ilan Tayari
0 siblings, 0 replies; 3+ messages in thread
From: Ilan Tayari @ 2017-08-06 6:50 UTC (permalink / raw)
To: Florian Westphal
Cc: Steffen Klassert, netdev@vger.kernel.org, Yevgeny Kliteynik,
Yossi Kuperman, Boris Pismenny, Yossef Efraim, Ayham Masood
> -----Original Message-----
> From: Florian Westphal [mailto:fw@strlen.de]
> Subject: Re: XFRM pcpu cache issue
>
> I dropped the stale_bundle() check vs. rfc, that was a stupid thing
> to do because that is what would detect this....
>
> Does this help?
Yes, this fixes the regression for me.
Reported-by: Ayham Masood <ayhamm@mellanox.com>
Tested-by: Ilan Tayari <ilant@mellanox.com>
Thanks, Florian!
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2017-08-06 6:50 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-03 15:48 XFRM pcpu cache issue Ilan Tayari
2017-08-04 16:55 ` Florian Westphal
2017-08-06 6:50 ` Ilan Tayari
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).