* rcu locking issue in mpls output code?
@ 2016-06-20 0:45 Lennert Buytenhek
2016-06-20 2:19 ` David Ahern
0 siblings, 1 reply; 9+ messages in thread
From: Lennert Buytenhek @ 2016-06-20 0:45 UTC (permalink / raw)
To: Roopa Prabhu, Robert Shearman; +Cc: netdev
Hi!
While trying to chase down a memory corruption issue that only occurs
when originating large amounts of MPLS tagged IP traffic, I came across
something in the MPLS output code for which I'm not entirely sure that
it's correct.
Specifically, there is the code path dst_output() -> lwtunnel_output()
-> mpls_output() -> neigh_xmit() -> ___neigh_lookup_noref(), where the
latter accesses a RCU-bh protected struct neigh_table pointer, but there
is no RCU-bh protection being arranged anywhere in this call chain.
Since this is locally generated IP traffic, we're running in process
context, and while lwtunnel_output() holds rcu_read_lock() across its
call to lwtunnel_encap_ops::output() (which is mpls_output() here),
nothing in the chain disables BHs, and in RCU-bh, the completion of a
softirq signals the end of any pending read-side critical sections,
and BHs can preempt this call chain at any time because it runs with
hardirqs and softirqs both enabled, so that would mean that neighbour
table entries can be zapped at any time even while we hold
rcu_read_lock(). I think.
The mpls_forward() path doesn't seem susceptible to the same issue,
as it runs from softirq, where rcu_read_lock() suffices, so I figured
that mpls_output() would be a good place to deal with this and that
something like the patch below would do the trick. I can't say yet if
this makes my memory corruption issues go away, as they don't reproduce
that easily, but I'll keep testing. Any thoughts so far?
Thanks,
Lennert
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index fb31aa8..802956b 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct sock *sk, struct sk_buff *skb)
bos = false;
}
+ rcu_read_lock_bh();
if (rt)
err = neigh_xmit(NEIGH_ARP_TABLE, out_dev, &rt->rt_gateway,
skb);
else if (rt6)
err = neigh_xmit(NEIGH_ND_TABLE, out_dev, &rt6->rt6i_gateway,
skb);
+ rcu_read_unlock_bh();
+
if (err)
net_dbg_ratelimited("%s: packet transmission failed: %d\n",
__func__, err);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: rcu locking issue in mpls output code?
2016-06-20 0:45 rcu locking issue in mpls output code? Lennert Buytenhek
@ 2016-06-20 2:19 ` David Ahern
2016-06-20 6:30 ` Lennert Buytenhek
0 siblings, 1 reply; 9+ messages in thread
From: David Ahern @ 2016-06-20 2:19 UTC (permalink / raw)
To: Lennert Buytenhek, Roopa Prabhu, Robert Shearman; +Cc: netdev
On 6/19/16 6:45 PM, Lennert Buytenhek wrote:
> diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
> index fb31aa8..802956b 100644
> --- a/net/mpls/mpls_iptunnel.c
> +++ b/net/mpls/mpls_iptunnel.c
> @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> bos = false;
> }
>
> + rcu_read_lock_bh();
> if (rt)
> err = neigh_xmit(NEIGH_ARP_TABLE, out_dev, &rt->rt_gateway,
> skb);
> else if (rt6)
> err = neigh_xmit(NEIGH_ND_TABLE, out_dev, &rt6->rt6i_gateway,
> skb);
> + rcu_read_unlock_bh();
> +
> if (err)
> net_dbg_ratelimited("%s: packet transmission failed: %d\n",
> __func__, err);
>
I think those need to be added to neigh_xmit in the
if (likely(index < NEIGH_NR_TABLES)) {
}
block.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: rcu locking issue in mpls output code?
2016-06-20 2:19 ` David Ahern
@ 2016-06-20 6:30 ` Lennert Buytenhek
2016-06-20 15:19 ` David Ahern
0 siblings, 1 reply; 9+ messages in thread
From: Lennert Buytenhek @ 2016-06-20 6:30 UTC (permalink / raw)
To: David Ahern; +Cc: Roopa Prabhu, Robert Shearman, netdev
On Sun, Jun 19, 2016 at 08:19:20PM -0600, David Ahern wrote:
> > diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
> > index fb31aa8..802956b 100644
> > --- a/net/mpls/mpls_iptunnel.c
> > +++ b/net/mpls/mpls_iptunnel.c
> > @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > bos = false;
> > }
> >
> > + rcu_read_lock_bh();
> > if (rt)
> > err = neigh_xmit(NEIGH_ARP_TABLE, out_dev, &rt->rt_gateway,
> > skb);
> > else if (rt6)
> > err = neigh_xmit(NEIGH_ND_TABLE, out_dev, &rt6->rt6i_gateway,
> > skb);
> > + rcu_read_unlock_bh();
> > +
> > if (err)
> > net_dbg_ratelimited("%s: packet transmission failed: %d\n",
> > __func__, err);
> >
>
> I think those need to be added to neigh_xmit in the
>
> if (likely(index < NEIGH_NR_TABLES)) {
>
> }
That'll force callers that don't need the extra protection (i.e.
mpls_forward(), since that always runs from softirq and it's enough
to protect the neigh state with rcu_read_lock() from softirq and we're
already running under rcu_read_lock() when we get to neigh_xmit()) to
eat the useless overhead of an extra rcu_read_{,un}lock_bh() pair, but
sure, functionally that's correct, I think, and in my workload I don't
care about MPLS forwarding performance anyway. ;-)
Want me to send a patch moving it to neigh_xmit() ?
Thank you for having a look!
Cheers,
Lennert
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: rcu locking issue in mpls output code?
2016-06-20 6:30 ` Lennert Buytenhek
@ 2016-06-20 15:19 ` David Ahern
2016-06-20 16:13 ` Roopa Prabhu
0 siblings, 1 reply; 9+ messages in thread
From: David Ahern @ 2016-06-20 15:19 UTC (permalink / raw)
To: Lennert Buytenhek; +Cc: Roopa Prabhu, Robert Shearman, netdev
On 6/20/16 12:30 AM, Lennert Buytenhek wrote:
> On Sun, Jun 19, 2016 at 08:19:20PM -0600, David Ahern wrote:
>
>>> diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
>>> index fb31aa8..802956b 100644
>>> --- a/net/mpls/mpls_iptunnel.c
>>> +++ b/net/mpls/mpls_iptunnel.c
>>> @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>>> bos = false;
>>> }
>>>
>>> + rcu_read_lock_bh();
>>> if (rt)
>>> err = neigh_xmit(NEIGH_ARP_TABLE, out_dev, &rt->rt_gateway,
>>> skb);
>>> else if (rt6)
>>> err = neigh_xmit(NEIGH_ND_TABLE, out_dev, &rt6->rt6i_gateway,
>>> skb);
>>> + rcu_read_unlock_bh();
>>> +
>>> if (err)
>>> net_dbg_ratelimited("%s: packet transmission failed: %d\n",
>>> __func__, err);
>>>
>>
>> I think those need to be added to neigh_xmit in the
>>
>> if (likely(index < NEIGH_NR_TABLES)) {
>>
>> }
>
> That'll force callers that don't need the extra protection (i.e.
> mpls_forward(), since that always runs from softirq and it's enough
> to protect the neigh state with rcu_read_lock() from softirq and we're
> already running under rcu_read_lock() when we get to neigh_xmit()) to
> eat the useless overhead of an extra rcu_read_{,un}lock_bh() pair, but
> sure, functionally that's correct, I think, and in my workload I don't
> care about MPLS forwarding performance anyway. ;-)
__neigh_lookup_noref expects bh level protection. Since the if block in
neigh_xmit requires the locking seems like this the appropriate place
for it.
>
> Want me to send a patch moving it to neigh_xmit() ?
Roopa/Robert: agree?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: rcu locking issue in mpls output code?
2016-06-20 15:19 ` David Ahern
@ 2016-06-20 16:13 ` Roopa Prabhu
2016-06-20 16:33 ` Lennert Buytenhek
0 siblings, 1 reply; 9+ messages in thread
From: Roopa Prabhu @ 2016-06-20 16:13 UTC (permalink / raw)
To: David Ahern; +Cc: Lennert Buytenhek, Robert Shearman, netdev@vger.kernel.org
On Mon, Jun 20, 2016 at 8:19 AM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 6/20/16 12:30 AM, Lennert Buytenhek wrote:
>>
>> On Sun, Jun 19, 2016 at 08:19:20PM -0600, David Ahern wrote:
>>
>>>> diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
>>>> index fb31aa8..802956b 100644
>>>> --- a/net/mpls/mpls_iptunnel.c
>>>> +++ b/net/mpls/mpls_iptunnel.c
>>>> @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct
>>>> sock *sk, struct sk_buff *skb)
>>>> bos = false;
>>>> }
>>>>
>>>> + rcu_read_lock_bh();
>>>> if (rt)
>>>> err = neigh_xmit(NEIGH_ARP_TABLE, out_dev,
>>>> &rt->rt_gateway,
>>>> skb);
>>>> else if (rt6)
>>>> err = neigh_xmit(NEIGH_ND_TABLE, out_dev,
>>>> &rt6->rt6i_gateway,
>>>> skb);
>>>> + rcu_read_unlock_bh();
>>>> +
>>>> if (err)
>>>> net_dbg_ratelimited("%s: packet transmission failed:
>>>> %d\n",
>>>> __func__, err);
>>>>
>>>
>>> I think those need to be added to neigh_xmit in the
>>>
>>> if (likely(index < NEIGH_NR_TABLES)) {
>>>
>>> }
>>
>>
>> That'll force callers that don't need the extra protection (i.e.
>> mpls_forward(), since that always runs from softirq and it's enough
>> to protect the neigh state with rcu_read_lock() from softirq and we're
>> already running under rcu_read_lock() when we get to neigh_xmit()) to
>> eat the useless overhead of an extra rcu_read_{,un}lock_bh() pair, but
>> sure, functionally that's correct, I think, and in my workload I don't
>> care about MPLS forwarding performance anyway. ;-)
>
>
> __neigh_lookup_noref expects bh level protection. Since the if block in
> neigh_xmit requires the locking seems like this the appropriate place for
> it.
>
>>
>> Want me to send a patch moving it to neigh_xmit() ?
>
>
> Roopa/Robert: agree?
>
yes, seems like an appropriate place for it. provided it does not add
unnecessary overhead for others.
But then neigh_xmit seems to be only called from mpls_output and mpls_forward.
thanks!
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: rcu locking issue in mpls output code?
2016-06-20 16:13 ` Roopa Prabhu
@ 2016-06-20 16:33 ` Lennert Buytenhek
2016-06-20 16:38 ` David Ahern
0 siblings, 1 reply; 9+ messages in thread
From: Lennert Buytenhek @ 2016-06-20 16:33 UTC (permalink / raw)
To: Roopa Prabhu; +Cc: David Ahern, Robert Shearman, netdev@vger.kernel.org
On Mon, Jun 20, 2016 at 09:13:36AM -0700, Roopa Prabhu wrote:
> >>>> diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
> >>>> index fb31aa8..802956b 100644
> >>>> --- a/net/mpls/mpls_iptunnel.c
> >>>> +++ b/net/mpls/mpls_iptunnel.c
> >>>> @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct
> >>>> sock *sk, struct sk_buff *skb)
> >>>> bos = false;
> >>>> }
> >>>>
> >>>> + rcu_read_lock_bh();
> >>>> if (rt)
> >>>> err = neigh_xmit(NEIGH_ARP_TABLE, out_dev,
> >>>> &rt->rt_gateway,
> >>>> skb);
> >>>> else if (rt6)
> >>>> err = neigh_xmit(NEIGH_ND_TABLE, out_dev,
> >>>> &rt6->rt6i_gateway,
> >>>> skb);
> >>>> + rcu_read_unlock_bh();
> >>>> +
> >>>> if (err)
> >>>> net_dbg_ratelimited("%s: packet transmission failed:
> >>>> %d\n",
> >>>> __func__, err);
> >>>>
> >>>
> >>> I think those need to be added to neigh_xmit in the
> >>>
> >>> if (likely(index < NEIGH_NR_TABLES)) {
> >>>
> >>> }
> >>
> >>
> >> That'll force callers that don't need the extra protection (i.e.
> >> mpls_forward(), since that always runs from softirq and it's enough
> >> to protect the neigh state with rcu_read_lock() from softirq and we're
> >> already running under rcu_read_lock() when we get to neigh_xmit()) to
> >> eat the useless overhead of an extra rcu_read_{,un}lock_bh() pair, but
> >> sure, functionally that's correct, I think, and in my workload I don't
> >> care about MPLS forwarding performance anyway. ;-)
> >
> >
> > __neigh_lookup_noref expects bh level protection. Since the if block in
> > neigh_xmit requires the locking seems like this the appropriate place for
> > it.
> >
> >>
> >> Want me to send a patch moving it to neigh_xmit() ?
> >
> >
> > Roopa/Robert: agree?
>
> yes, seems like an appropriate place for it. provided it does not add
> unnecessary overhead for others.
> But then neigh_xmit seems to be only called from mpls_output and mpls_forward.
OK, patch coming up. Thanks!
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: rcu locking issue in mpls output code?
2016-06-20 16:33 ` Lennert Buytenhek
@ 2016-06-20 16:38 ` David Ahern
2016-06-20 17:44 ` Lennert Buytenhek
0 siblings, 1 reply; 9+ messages in thread
From: David Ahern @ 2016-06-20 16:38 UTC (permalink / raw)
To: Lennert Buytenhek, Roopa Prabhu; +Cc: Robert Shearman, netdev@vger.kernel.org
On 6/20/16 10:33 AM, Lennert Buytenhek wrote:
> OK, patch coming up. Thanks!
can you build a kernel with rcu debugging enabled as well and run it
through your tests?
Thanks,
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: rcu locking issue in mpls output code?
2016-06-20 16:38 ` David Ahern
@ 2016-06-20 17:44 ` Lennert Buytenhek
2016-06-20 19:37 ` Stephen Hemminger
0 siblings, 1 reply; 9+ messages in thread
From: Lennert Buytenhek @ 2016-06-20 17:44 UTC (permalink / raw)
To: David Ahern; +Cc: Roopa Prabhu, Robert Shearman, netdev@vger.kernel.org
On Mon, Jun 20, 2016 at 10:38:39AM -0600, David Ahern wrote:
> > OK, patch coming up. Thanks!
>
> can you build a kernel with rcu debugging enabled as well and run
> it through your tests?
git HEAD with CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y CONFIG_PROVE_RCU=y gives me a lockdep splat on the
machine under my desk when I cause mpls_output() to be called.
The script I use for that is this one -- it creates a namespace that
accepts MPLS tagged packets for one of its local IPs and then sends
an MPLS tagged packet into that namespace. If you run the script on
an unpatched kernel with lock debugging enabled, you should be able
to see the issue as well, the lockdep splat happens on the very first
packet.
=====
#!/bin/sh
ip link add tons type veth peer name tempitf
ifconfig tons 172.16.20.20 netmask 255.255.255.0
ip netns add ns1
ip netns exec ns1 ifconfig lo 127.0.0.1 up
ip link set tempitf netns ns1
ip netns exec ns1 ip link set tempitf name eth0
ip netns exec ns1 ifconfig eth0 172.16.20.21 netmask 255.255.255.0
modprobe mpls_iptunnel
ip route add 10.10.10.10/32 encap mpls 100 via inet 172.16.20.21
ip netns exec ns1 sysctl -w net.ipv4.conf.all.rp_filter=0
ip netns exec ns1 sysctl -w net.ipv4.conf.lo.rp_filter=0
ip netns exec ns1 sysctl -w net.mpls.conf.eth0.input=1
ip netns exec ns1 sysctl -w net.mpls.platform_labels=1000
ip netns exec ns1 ip addr add 10.10.10.10/32 dev lo
ip netns exec ns1 ip -f mpls route add 100 dev lo
ping -c 1 10.10.10.10
=====
The patch below (which I'll submit shortly with a proper commit
message) makes this lockdep splat go away.
Enabling lock/rcu debugging gives you a lockdep splat on the first
packet going out through mpls_output(), but then makes the packet
loss / memory corruption issue stop appearing, both on my local
space heater and on much more serious hardware, probably due to
timing differences.
But, with lock/rcu debugging disabled and the patch below included,
I don't see packet loss anymore in a production environment during a
test that would fairly reliably show it before.
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index f18ae91..769cece 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2467,13 +2467,17 @@ int neigh_xmit(int index, struct net_device *dev,
tbl = neigh_tables[index];
if (!tbl)
goto out;
+ rcu_read_lock_bh();
neigh = __neigh_lookup_noref(tbl, addr, dev);
if (!neigh)
neigh = __neigh_create(tbl, addr, dev, false);
err = PTR_ERR(neigh);
- if (IS_ERR(neigh))
+ if (IS_ERR(neigh)) {
+ rcu_read_unlock_bh();
goto out_kfree_skb;
+ }
err = neigh->output(neigh, skb);
+ rcu_read_unlock_bh();
}
else if (index == NEIGH_LINK_TABLE) {
err = dev_hard_header(skb, dev, ntohs(skb->protocol),
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: rcu locking issue in mpls output code?
2016-06-20 17:44 ` Lennert Buytenhek
@ 2016-06-20 19:37 ` Stephen Hemminger
0 siblings, 0 replies; 9+ messages in thread
From: Stephen Hemminger @ 2016-06-20 19:37 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: David Ahern, Roopa Prabhu, Robert Shearman,
netdev@vger.kernel.org
> ifconfig tons 172.16.20.20 netmask 255.255.255.0
Really, still using ifconfig?
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-06-20 19:37 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-20 0:45 rcu locking issue in mpls output code? Lennert Buytenhek
2016-06-20 2:19 ` David Ahern
2016-06-20 6:30 ` Lennert Buytenhek
2016-06-20 15:19 ` David Ahern
2016-06-20 16:13 ` Roopa Prabhu
2016-06-20 16:33 ` Lennert Buytenhek
2016-06-20 16:38 ` David Ahern
2016-06-20 17:44 ` Lennert Buytenhek
2016-06-20 19:37 ` Stephen Hemminger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).