From: Krister Johansen <kjlx@templeofstupid.com>
To: davem@davemloft.net, eric.dumazet@gmail.com
Cc: netdev@vger.kernel.org
Subject: [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free.
Date: Thu, 8 Jun 2017 13:12:38 -0700 [thread overview]
Message-ID: <20170608201238.GC2553@templeofstupid.com> (raw)
It looks like this:
Message from syslogd@flamingo at Apr 26 00:45:00 ...
kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4
They seem to coincide with net namespace teardown.
The message is emitted by netdev_wait_allrefs().
Forced a kdump in netdev_run_todo, but found that the refcount on the lo
device was already 0 at the time we got to the panic.
Used bcc to check the blocking in netdev_run_todo. The only places
where we're off cpu there are in the rcu_barrier() and msleep() calls.
That behavior is expected. The msleep time coincides with the amount of
time we spend waiting for the refcount to reach zero; the rcu_barrier()
wait times are not excessive.
After looking through the list of callbacks that the netdevice notifiers
invoke in this path, it appears that the dst_dev_event is the most
interesting. The dst_ifdown path places a hold on the loopback_dev as
part of releasing the dev associated with the original dst cache entry.
Most of our notifier callbacks are straight-forward, but this one a)
looks complex, and b) places a hold on the network interface in
question.
I constructed a new bcc script that watches various events in the
liftime of a dst cache entry. Note that dst_ifdown will take a hold on
the loopback device until the invalidated dst entry gets freed.
[ __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183
__dst_free
rcu_nocb_kthread
kthread
ret_from_fork
---
[ dst_gc_task] completed in 4: start: 1282115783057516 end 1282115783062462
[ dst_gc_task] completed in 5: start: 1282116447054101 end 1282116447059392
[ dst_gc_task] completed in 5: start: 1282118111030391 end 1282118111036041
[ dst_gc_task] completed in 6: start: 1282121247074938 end 1282121247081141
[ dst_gc_task] completed in 5: start: 1282126815053252 end 1282126815058751
[ dst_gc_task] completed in 6: start: 1282135007055673 end 1282135007061877
[ dst_gc_task] completed in 6: start: 1282145759021745 end 1282145759027830
[ dst_release] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282147838674257
dst_release
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_send_fin
tcp_close
inet_release
sock_release
sock_close
__fput
____fput
task_work_run
do_exit
do_group_exit
SyS_exit_group
do_syscall_64
return_from_SYSCALL_64
[ dst_ifdown] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282148067377680
dst_ifdown
notifier_call_chain
raw_notifier_call_chain
call_netdevice_notifiers_info
netdev_run_todo
kretprobe_trampoline
default_device_exit_batch
ops_exit_list.isra.4
cleanup_net
process_one_work
worker_thread
kthread
ret_from_fork
---
[ dst_ifdown] on DST: ffff883ccabb7900 IF lo invoked at 1282148067411269
dst_ifdown
notifier_call_chain
raw_notifier_call_chain
call_netdevice_notifiers_info
netdev_run_todo
kretprobe_trampoline
default_device_exit_batch
ops_exit_list.isra.4
cleanup_net
process_one_work
worker_thread
kthread
ret_from_fork
---
<...>
[ dst_destroy] on DST: ffff883ccabb7900 IF lo invoked at 1282160607074022
dst_destroy
kretprobe_trampoline
worker_thread
kthread
ret_from_fork
---
[ dst_gc_task] completed in 30: start: 1282160607047161 end 1282160607077572
The way this works is that if there's still a reference on the dst entry
at the time we try to free it, it gets placed in the gc list by
__dst_free and the dst_destroy() call is invoked by the gc task once the
refcount is 0. If the gc task processes a 10th or less of its entries
on a single pass, it inreases the amount of time it waits between gc
intervals.
Looking at the gc_task intervals, they started at 663ms when we invoked
__dst_free(). After that, they increased to 1663, 3136, 5567, 8191,
10751, and 14848. The release that set the refcnt to 0 on our dst entry
occurred after the gc_task was enqueued for 14 second interval so we had
to wait longer than the warning time in wait_allrefs in order for the
dst entry to get free'd and the hold on 'lo' to be released.
A simple solution to this problem is to have dst_dev_event() reset the
gc timer, which causes us to process this list shortly after the
gc_mutex is relased when dst_dev_event() completes.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
net/core/dst.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/net/core/dst.c b/net/core/dst.c
index 6192f11..13ba4a0 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event,
spin_lock_bh(&dst_garbage.lock);
dst = dst_garbage.list;
dst_garbage.list = NULL;
+ /* The code in dst_ifdown places a hold on the loopback device.
+ * If the gc entry processing is set to expire after a lengthy
+ * interval, this hold can cause netdev_wait_allrefs() to hang
+ * out and wait for a long time -- until the the loopback
+ * interface is released. If we're really unlucky, it'll emit
+ * pr_emerg messages to console too. Reset the interval here,
+ * so dst cleanups occur in a more timely fashion.
+ */
+ if (dst_garbage.timer_inc > DST_GC_INC) {
+ dst_garbage.timer_inc = DST_GC_INC;
+ dst_garbage.timer_expires = DST_GC_MIN;
+ mod_delayed_work(system_wq, &dst_gc_work,
+ dst_garbage.timer_expires);
+ }
spin_unlock_bh(&dst_garbage.lock);
if (last)
--
2.7.4
next reply other threads:[~2017-06-08 20:12 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-08 20:12 Krister Johansen [this message]
2017-06-08 21:06 ` [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free Eric Dumazet
2017-06-08 21:41 ` Wei Wang
2017-06-09 16:28 ` David Miller
2017-06-09 18:18 ` Cong Wang
2017-06-09 18:43 ` Krister Johansen
2017-06-09 21:48 ` Cong Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170608201238.GC2553@templeofstupid.com \
--to=kjlx@templeofstupid.com \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.