All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jiri Wiesner <jwiesner@suse.de>
To: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org, David Ahern <dsahern@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"David S. Miller" <davem@davemloft.net>
Subject: Re: [RFC PATCH] ipv6: route: release reference of dsts cached in sockets
Date: Tue, 1 Oct 2024 17:26:09 +0200	[thread overview]
Message-ID: <20241001152609.GA24007@incl> (raw)
In-Reply-To: <CANn89iJQDWEyTC5Xc77PEXyxbbvKjm=exb5jKB0-O3ZzZ=W1Hg@mail.gmail.com>

On Mon, Sep 30, 2024 at 08:27:50PM +0200, Eric Dumazet wrote:
> On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@suse.de> wrote:
> >
> > An unbalanced refcount was reported for the loopback device of a net
> > namespace being destroyed:
> > unregister_netdevice: waiting for lo to become free. Usage count = 2
> >
> > Analysis revealed that the IPv6 net device corresponding to the loopback
> > device did not get destroyed (in6_dev_finish_destroy() was not called).
> > The IPv6 loopback device had an unbalaced refcount:
> > net dev 73da100 lo refcount 1
> > Operation                     Count Balancing Operation     Count
> > hold  __ipv6_dev_mc_inc           2 ma_put                      2
> >       addrconf_ifdown             1                             0 unbalanced
> > hold  fib6_nh_init                2 fib6_nh_init                2
> > put   inet6_ifa_finish_destroy    1 ipv6_add_addr               1
> >       ip6_dst_destroy            90                             0 unbalanced
> >       ip6_dst_ifdown             90                             0 unbalanced
> > hold  ip6_route_dev_notify        6 ip6_route_dev_notify        6
> > hold  ipv6_add_addr               1 inet6_ifa_finish_destroy    1
> > put   ma_put                      2 __ipv6_dev_mc_inc           2
> > hold  ndisc_netdev_event          2 ndisc_netdev_event          2
> >       rt6_disable_ip              1                             0 unbalanced
> >
> > The refcount of addrconf_ifdown() balances the refcount increment in
> > ipv6_add_dev(), which had no corresponding trace entry. The
> > rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the
> > looback device, and the ip6_dst_destroy() entries were put operations. One
> > refcount decrement in ip6_dst_destroy() was not executed. At this point, a
> > hash was implemented in the debug kernel to hold the changes of the
> > refcount of dst objects per namespace. The trace for the dst object that
> > did not decrement the IPv6 refcount of loopback follows:
> >
> > Function        Parent       Op  Net            Device Dst              Refcount Diff
> > ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84
> > dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20
> > dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29
> > dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34
> > dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1
> >
> > The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely
> > indicates that the net device of the dst object was changed to
> > blackhole_dev and the IPv6 refcount was transferred onto the loopback
> > device. There was no ip6_dst_destroy() trace entry, which means the dst
> > object was not destroyed. There were 86 hold operations but only 85 put
> > operations so the dst object was not destroyed because the refcount of the
> > dst object was unbalanced.
> >
> > The problem is that the refcount sums are ambiguous. The most probable
> > explanation is this: The dst object was a route for an IPv6 TCP connection
> > that kept timing out. Sometimes, the process closed the socket, which
> > corresponds to the refcount decrements of the
> > dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit
> > timer reset the dst of the sockets, which corresponds to the
> > dst_release()/tcp_retransmit_timer() entries. I am unsure about the
> > dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets
> > skb->_skb_refdst with SKB_DST_NOREF.
> >
> > The feature that sets the above trace apart from all the other dst traces
> > is the execution of ip6_negative_advice() for a cached and also expired
> > dst object in the exception table. The cached and expired dst object has
> > its refcount set to at least 2 before executing rt6_remove_exception_rt()
> > found in ip6_negative_advice(). One decrement happens in
> > rt6_remove_exception() after the dst object has been removed from the
> > exception table. The other decrement happens in sk_dst_reset() but that
> > one is counteracted by a dst_hold() intentionally placed just before the
> > sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that
> > keeps a reference to a dst in its sk_dst_cache member increments the
> > refcount of the dst by 1. This is apparent in the following code paths:
> > * When ip6_route_output_flags() finds a dst that is then stored in
> >   sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket().
> > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
> > Provided the dst is not kept in the sk_dst_cache member of another socket,
> > there is no other object tied to the dst (the socket lost its reference
> > and the dst is no longer in the exception table) and the dst becomes a
> > leaked object after ip6_negative_advice() finishes. This leak then
> > precludes the net namespace from being destroyed.
> >
> > The patch that introduced the dst_hold() in ip6_negative_advice() was
> > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
> > only refactored the code with regards to the dst refcount so the issue was
> > present even before 92f1655aa2b22.
> >
> > Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
> > ---
> > At the moment, I am sending this as an RFC because I am not able to
> > reproduce the issue in-house. The customer that encountered the issue is
> > currently running tests. For the customer's testing, I fixed the issue
> > with a kprobe module that calls dst_release() right after
> > rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not
> > quite the same as the change proposed below.
> >
> >  net/ipv6/route.c | 3 ---
> >  1 file changed, 3 deletions(-)
> >
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index b4251915585f..b70267c8d251 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk,
> >         if (rt->rt6i_flags & RTF_CACHE) {
> >                 rcu_read_lock();
> >                 if (rt6_check_expired(rt)) {
> > -                       /* counteract the dst_release() in sk_dst_reset() */
> > -                       dst_hold(dst);
> >                         sk_dst_reset(sk);
> > -
> >                         rt6_remove_exception_rt(rt);
> >                 }
> >                 rcu_read_unlock();
> > --
> 
> Interesting, what kernel version is your customer using ?

It is a distribution kernel (SLES 15 SP5) that is based on 5.14. The newest SLES service pack, SP6, has a 6.4-based kernel. But the customer has yet to move to that one.

> I think that with recent kernels (after 5.18), we do not see issues
> because of the use of blackhole_netdev
> instead of loopback.

Thank you very much for mentioning that. I will be backporting these patches to SUSE's 5.14-based kernels:
> 9cc341286e99 ("dn_route: set rt neigh to blackhole_netdev instead of loopback_dev in ifdown")
> 4d33ab08c0af ("xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown")
> dd263a8cb194 ("ipv6: blackhole_netdev needs snmp6 counters")
> e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev")
These patches will not resolve the dst leak but being able to destroy net namespaces at the expense of sustaining infrequently occurring dst leaks is a favourable trade.

The testing results (debugging SLES15 SP5 kernel plus the kprobe module calling dst_release() in probe_ip6_negative_advice()) came back and the kernel log indicates negative overflow of the dst refcount. The first machine:
> [ 9366.700330] dst_release underflow
> [ 9366.700346] WARNING: CPU: 42 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
> [ 9366.701492] Call Trace:
> [ 9366.701497]  <IRQ>
> [ 9366.701506]  probe_ip6_negative_advice+0x12/0x30 [probe 0c57153a505ed7cc20e2c5db9522a1a524ec4f56]
> [ 9366.701514]  opt_pre_handler+0x3d/0x70
> [ 9366.701528]  optimized_callback+0x64/0x90
> [ 9366.701538]  0xffffffffc0476032
> [ 9366.701548]  ? ip6_negative_advice+0xdf/0x180
> [ 9366.701558]  ? tcp_retransmit_timer+0x658/0x930
> [ 9366.701595]  ? tcp_write_timer_handler+0xba/0x1f0
> [ 9366.701601]  ? tcp_write_timer+0x211/0x260
> [ 9366.701624]  ? call_timer_fn+0x27/0x130
> [ 9366.701631]  ? run_timer_softirq+0x443/0x480
> [ 9366.701645]  ? __do_softirq+0xd2/0x2c0
> [ 9366.701654]  ? irq_exit_rcu+0xa4/0xc0
> [ 9366.701661]  ? sysvec_apic_timer_interrupt+0x50/0x90
> [ 9366.701667]  </IRQ>
> [ 9366.701744] dst_release: dst:00000000ac3b9120 refcnt:-1
> [ 9366.701757] dst_release: dst:00000000ac3b9120 refcnt:-2
> [ 9382.832164] dst_release: dst:00000000ac3b9120 refcnt:-3
> [ 9659.561679] dst_release: dst:00000000494633c6 refcnt:-1
Obviously, the refcount of the cached and expired dst, dst:00000000ac3b9120, was 1 before probe_ip6_negative_advice() was called from tcp_write_timeout()/tcp_retransmit_timer(). So, my supposition that cached dsts being passed to probe_ip6_negative_advice() have a refcount of at least 2 was wrong.

The second machine:
[11479.272704] dst_release underflow
[11479.272718] WARNING: CPU: 52 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
[11479.272924] Call Trace:
[11479.272928]  <IRQ>
[11479.272932]  tcp_retransmit_timer+0x47c/0x930
[11479.272953]  tcp_write_timer_handler+0xba/0x1f0
[11479.272956]  tcp_write_timer+0x211/0x260
[11479.272962]  call_timer_fn+0x27/0x130
[11479.272966]  run_timer_softirq+0x443/0x480
[11479.272973]  __do_softirq+0xd2/0x2c0
[11479.272980]  irq_exit_rcu+0xa4/0xc0
[11479.272985]  sysvec_apic_timer_interrupt+0x50/0x90
[11479.272989]  </IRQ>
[11479.273044] dst_release: dst:00000000ff7e6845 refcnt:-1
[11681.380331] dst_release: dst:00000000893d8a2d refcnt:-1
In this case, the refcount of dst:00000000ff7e6845 was have been 2 before probe_ip6_negative_advice() so the dst_release() called from the kprobe handler decremented the refcount to 0. No warning was printed. There probably was another socket having the dst stored in sk_dst_cache, to which the refcount of 1 belonged. When the second __sk_dst_reset() was called in tcp_retransmit_timer(), which was the tcp_retransmit_timer+0x47c site, the refcount dropped below 0.

I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue.
-- 
Jiri Wiesner
SUSE Labs

  reply	other threads:[~2024-10-01 15:26 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-30 18:09 [RFC PATCH] ipv6: route: release reference of dsts cached in sockets Jiri Wiesner
2024-09-30 18:27 ` Eric Dumazet
2024-10-01 15:26   ` Jiri Wiesner [this message]
2024-10-02 20:27     ` Xin Long
2024-10-03 17:01       ` Jiri Wiesner
2024-10-06 18:25         ` Xin Long
2024-10-09 13:39           ` Jiri Wiesner
2024-11-12  8:56           ` Jiri Wiesner
2024-11-12 21:51             ` Xin Long
2024-11-13 10:59               ` Jiri Wiesner
2024-11-13 17:27               ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241001152609.GA24007@incl \
    --to=jwiesner@suse.de \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.