* [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free.
@ 2017-06-08 20:12 Krister Johansen
2017-06-08 21:06 ` Eric Dumazet
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Krister Johansen @ 2017-06-08 20:12 UTC (permalink / raw)
To: davem, eric.dumazet; +Cc: netdev
It looks like this:
Message from syslogd@flamingo at Apr 26 00:45:00 ...
kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4
They seem to coincide with net namespace teardown.
The message is emitted by netdev_wait_allrefs().
Forced a kdump in netdev_run_todo, but found that the refcount on the lo
device was already 0 at the time we got to the panic.
Used bcc to check the blocking in netdev_run_todo. The only places
where we're off cpu there are in the rcu_barrier() and msleep() calls.
That behavior is expected. The msleep time coincides with the amount of
time we spend waiting for the refcount to reach zero; the rcu_barrier()
wait times are not excessive.
After looking through the list of callbacks that the netdevice notifiers
invoke in this path, it appears that the dst_dev_event is the most
interesting. The dst_ifdown path places a hold on the loopback_dev as
part of releasing the dev associated with the original dst cache entry.
Most of our notifier callbacks are straight-forward, but this one a)
looks complex, and b) places a hold on the network interface in
question.
I constructed a new bcc script that watches various events in the
liftime of a dst cache entry. Note that dst_ifdown will take a hold on
the loopback device until the invalidated dst entry gets freed.
[ __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183
__dst_free
rcu_nocb_kthread
kthread
ret_from_fork
---
[ dst_gc_task] completed in 4: start: 1282115783057516 end 1282115783062462
[ dst_gc_task] completed in 5: start: 1282116447054101 end 1282116447059392
[ dst_gc_task] completed in 5: start: 1282118111030391 end 1282118111036041
[ dst_gc_task] completed in 6: start: 1282121247074938 end 1282121247081141
[ dst_gc_task] completed in 5: start: 1282126815053252 end 1282126815058751
[ dst_gc_task] completed in 6: start: 1282135007055673 end 1282135007061877
[ dst_gc_task] completed in 6: start: 1282145759021745 end 1282145759027830
[ dst_release] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282147838674257
dst_release
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_send_fin
tcp_close
inet_release
sock_release
sock_close
__fput
____fput
task_work_run
do_exit
do_group_exit
SyS_exit_group
do_syscall_64
return_from_SYSCALL_64
[ dst_ifdown] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282148067377680
dst_ifdown
notifier_call_chain
raw_notifier_call_chain
call_netdevice_notifiers_info
netdev_run_todo
kretprobe_trampoline
default_device_exit_batch
ops_exit_list.isra.4
cleanup_net
process_one_work
worker_thread
kthread
ret_from_fork
---
[ dst_ifdown] on DST: ffff883ccabb7900 IF lo invoked at 1282148067411269
dst_ifdown
notifier_call_chain
raw_notifier_call_chain
call_netdevice_notifiers_info
netdev_run_todo
kretprobe_trampoline
default_device_exit_batch
ops_exit_list.isra.4
cleanup_net
process_one_work
worker_thread
kthread
ret_from_fork
---
<...>
[ dst_destroy] on DST: ffff883ccabb7900 IF lo invoked at 1282160607074022
dst_destroy
kretprobe_trampoline
worker_thread
kthread
ret_from_fork
---
[ dst_gc_task] completed in 30: start: 1282160607047161 end 1282160607077572
The way this works is that if there's still a reference on the dst entry
at the time we try to free it, it gets placed in the gc list by
__dst_free and the dst_destroy() call is invoked by the gc task once the
refcount is 0. If the gc task processes a 10th or less of its entries
on a single pass, it inreases the amount of time it waits between gc
intervals.
Looking at the gc_task intervals, they started at 663ms when we invoked
__dst_free(). After that, they increased to 1663, 3136, 5567, 8191,
10751, and 14848. The release that set the refcnt to 0 on our dst entry
occurred after the gc_task was enqueued for 14 second interval so we had
to wait longer than the warning time in wait_allrefs in order for the
dst entry to get free'd and the hold on 'lo' to be released.
A simple solution to this problem is to have dst_dev_event() reset the
gc timer, which causes us to process this list shortly after the
gc_mutex is relased when dst_dev_event() completes.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
net/core/dst.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/net/core/dst.c b/net/core/dst.c
index 6192f11..13ba4a0 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event,
spin_lock_bh(&dst_garbage.lock);
dst = dst_garbage.list;
dst_garbage.list = NULL;
+ /* The code in dst_ifdown places a hold on the loopback device.
+ * If the gc entry processing is set to expire after a lengthy
+ * interval, this hold can cause netdev_wait_allrefs() to hang
+ * out and wait for a long time -- until the the loopback
+ * interface is released. If we're really unlucky, it'll emit
+ * pr_emerg messages to console too. Reset the interval here,
+ * so dst cleanups occur in a more timely fashion.
+ */
+ if (dst_garbage.timer_inc > DST_GC_INC) {
+ dst_garbage.timer_inc = DST_GC_INC;
+ dst_garbage.timer_expires = DST_GC_MIN;
+ mod_delayed_work(system_wq, &dst_gc_work,
+ dst_garbage.timer_expires);
+ }
spin_unlock_bh(&dst_garbage.lock);
if (last)
--
2.7.4
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free.
2017-06-08 20:12 [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free Krister Johansen
@ 2017-06-08 21:06 ` Eric Dumazet
2017-06-08 21:41 ` Wei Wang
2017-06-09 16:28 ` David Miller
2017-06-09 18:18 ` Cong Wang
2 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2017-06-08 21:06 UTC (permalink / raw)
To: Krister Johansen; +Cc: davem, netdev, Wei Wang
On Thu, 2017-06-08 at 13:12 -0700, Krister Johansen wrote:
...
> Looking at the gc_task intervals, they started at 663ms when we invoked
> __dst_free(). After that, they increased to 1663, 3136, 5567, 8191,
> 10751, and 14848. The release that set the refcnt to 0 on our dst entry
> occurred after the gc_task was enqueued for 14 second interval so we had
> to wait longer than the warning time in wait_allrefs in order for the
> dst entry to get free'd and the hold on 'lo' to be released.
>
> A simple solution to this problem is to have dst_dev_event() reset the
> gc timer, which causes us to process this list shortly after the
> gc_mutex is relased when dst_dev_event() completes.
>
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> ---
> net/core/dst.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/net/core/dst.c b/net/core/dst.c
> index 6192f11..13ba4a0 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event,
> spin_lock_bh(&dst_garbage.lock);
> dst = dst_garbage.list;
> dst_garbage.list = NULL;
> + /* The code in dst_ifdown places a hold on the loopback device.
> + * If the gc entry processing is set to expire after a lengthy
> + * interval, this hold can cause netdev_wait_allrefs() to hang
> + * out and wait for a long time -- until the the loopback
> + * interface is released. If we're really unlucky, it'll emit
> + * pr_emerg messages to console too. Reset the interval here,
> + * so dst cleanups occur in a more timely fashion.
> + */
> + if (dst_garbage.timer_inc > DST_GC_INC) {
> + dst_garbage.timer_inc = DST_GC_INC;
> + dst_garbage.timer_expires = DST_GC_MIN;
> + mod_delayed_work(system_wq, &dst_gc_work,
> + dst_garbage.timer_expires);
> + }
> spin_unlock_bh(&dst_garbage.lock);
>
> if (last)
Looks very nice to me !
Acked-by: Eric Dumazet <edumazet@google.com>
Wei Wan is actually working on a patch series removing all this
dst_garbage list stuff.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free.
2017-06-08 21:06 ` Eric Dumazet
@ 2017-06-08 21:41 ` Wei Wang
0 siblings, 0 replies; 7+ messages in thread
From: Wei Wang @ 2017-06-08 21:41 UTC (permalink / raw)
To: Eric Dumazet
Cc: Krister Johansen, David S . Miller,
Linux Kernel Network Developers
> Wei Wan is actually working on a patch series removing all this
> dst_garbage list stuff.
Yes. I am working on removing the dst garbage collector completely.
dst_dev_event() will be removed from the list of callbacks that the
netdevice notifiers invoke in my patch series.
On Thu, Jun 8, 2017 at 2:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-06-08 at 13:12 -0700, Krister Johansen wrote:
>
> ...
>
>> Looking at the gc_task intervals, they started at 663ms when we invoked
>> __dst_free(). After that, they increased to 1663, 3136, 5567, 8191,
>> 10751, and 14848. The release that set the refcnt to 0 on our dst entry
>> occurred after the gc_task was enqueued for 14 second interval so we had
>> to wait longer than the warning time in wait_allrefs in order for the
>> dst entry to get free'd and the hold on 'lo' to be released.
>>
>> A simple solution to this problem is to have dst_dev_event() reset the
>> gc timer, which causes us to process this list shortly after the
>> gc_mutex is relased when dst_dev_event() completes.
>>
>> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
>> ---
>> net/core/dst.c | 14 ++++++++++++++
>> 1 file changed, 14 insertions(+)
>>
>> diff --git a/net/core/dst.c b/net/core/dst.c
>> index 6192f11..13ba4a0 100644
>> --- a/net/core/dst.c
>> +++ b/net/core/dst.c
>> @@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event,
>> spin_lock_bh(&dst_garbage.lock);
>> dst = dst_garbage.list;
>> dst_garbage.list = NULL;
>> + /* The code in dst_ifdown places a hold on the loopback device.
>> + * If the gc entry processing is set to expire after a lengthy
>> + * interval, this hold can cause netdev_wait_allrefs() to hang
>> + * out and wait for a long time -- until the the loopback
>> + * interface is released. If we're really unlucky, it'll emit
>> + * pr_emerg messages to console too. Reset the interval here,
>> + * so dst cleanups occur in a more timely fashion.
>> + */
>> + if (dst_garbage.timer_inc > DST_GC_INC) {
>> + dst_garbage.timer_inc = DST_GC_INC;
>> + dst_garbage.timer_expires = DST_GC_MIN;
>> + mod_delayed_work(system_wq, &dst_gc_work,
>> + dst_garbage.timer_expires);
>> + }
>> spin_unlock_bh(&dst_garbage.lock);
>>
>> if (last)
>
> Looks very nice to me !
>
> Acked-by: Eric Dumazet <edumazet@google.com>
>
> Wei Wan is actually working on a patch series removing all this
> dst_garbage list stuff.
>
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free.
2017-06-08 20:12 [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free Krister Johansen
2017-06-08 21:06 ` Eric Dumazet
@ 2017-06-09 16:28 ` David Miller
2017-06-09 18:18 ` Cong Wang
2 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2017-06-09 16:28 UTC (permalink / raw)
To: kjlx; +Cc: eric.dumazet, netdev
From: Krister Johansen <kjlx@templeofstupid.com>
Date: Thu, 8 Jun 2017 13:12:38 -0700
> It looks like this:
>
> Message from syslogd@flamingo at Apr 26 00:45:00 ...
> kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4
>
> They seem to coincide with net namespace teardown.
>
> The message is emitted by netdev_wait_allrefs().
>
> Forced a kdump in netdev_run_todo, but found that the refcount on the lo
> device was already 0 at the time we got to the panic.
>
> Used bcc to check the blocking in netdev_run_todo. The only places
> where we're off cpu there are in the rcu_barrier() and msleep() calls.
> That behavior is expected. The msleep time coincides with the amount of
> time we spend waiting for the refcount to reach zero; the rcu_barrier()
> wait times are not excessive.
>
> After looking through the list of callbacks that the netdevice notifiers
> invoke in this path, it appears that the dst_dev_event is the most
> interesting. The dst_ifdown path places a hold on the loopback_dev as
> part of releasing the dev associated with the original dst cache entry.
> Most of our notifier callbacks are straight-forward, but this one a)
> looks complex, and b) places a hold on the network interface in
> question.
>
> I constructed a new bcc script that watches various events in the
> liftime of a dst cache entry. Note that dst_ifdown will take a hold on
> the loopback device until the invalidated dst entry gets freed.
...
> The way this works is that if there's still a reference on the dst entry
> at the time we try to free it, it gets placed in the gc list by
> __dst_free and the dst_destroy() call is invoked by the gc task once the
> refcount is 0. If the gc task processes a 10th or less of its entries
> on a single pass, it inreases the amount of time it waits between gc
> intervals.
>
> Looking at the gc_task intervals, they started at 663ms when we invoked
> __dst_free(). After that, they increased to 1663, 3136, 5567, 8191,
> 10751, and 14848. The release that set the refcnt to 0 on our dst entry
> occurred after the gc_task was enqueued for 14 second interval so we had
> to wait longer than the warning time in wait_allrefs in order for the
> dst entry to get free'd and the hold on 'lo' to be released.
>
> A simple solution to this problem is to have dst_dev_event() reset the
> gc timer, which causes us to process this list shortly after the
> gc_mutex is relased when dst_dev_event() completes.
>
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Yeah this is one of the more unsatisfying areas of dst and device handling
in the tree, thanks for working on this.
Applied and queued up for -stable, thanks again.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free.
2017-06-08 20:12 [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free Krister Johansen
2017-06-08 21:06 ` Eric Dumazet
2017-06-09 16:28 ` David Miller
@ 2017-06-09 18:18 ` Cong Wang
2017-06-09 18:43 ` Krister Johansen
2 siblings, 1 reply; 7+ messages in thread
From: Cong Wang @ 2017-06-09 18:18 UTC (permalink / raw)
To: Krister Johansen
Cc: David Miller, Eric Dumazet, Linux Kernel Network Developers,
Kaiwen Xu
On Thu, Jun 8, 2017 at 1:12 PM, Krister Johansen
<kjlx@templeofstupid.com> wrote:
> After looking through the list of callbacks that the netdevice notifiers
> invoke in this path, it appears that the dst_dev_event is the most
> interesting. The dst_ifdown path places a hold on the loopback_dev as
> part of releasing the dev associated with the original dst cache entry.
> Most of our notifier callbacks are straight-forward, but this one a)
> looks complex, and b) places a hold on the network interface in
> question.
>
> I constructed a new bcc script that watches various events in the
> liftime of a dst cache entry. Note that dst_ifdown will take a hold on
> the loopback device until the invalidated dst entry gets freed.
>
Yeah, this is what I observed when Kevin (Cc'ed) reported a similar
(if not same) bug, I thought we have a refcnt leak on dst.
...
> The way this works is that if there's still a reference on the dst entry
> at the time we try to free it, it gets placed in the gc list by
> __dst_free and the dst_destroy() call is invoked by the gc task once the
> refcount is 0. If the gc task processes a 10th or less of its entries
> on a single pass, it inreases the amount of time it waits between gc
> intervals.
>
> Looking at the gc_task intervals, they started at 663ms when we invoked
> __dst_free(). After that, they increased to 1663, 3136, 5567, 8191,
> 10751, and 14848. The release that set the refcnt to 0 on our dst entry
> occurred after the gc_task was enqueued for 14 second interval so we had
> to wait longer than the warning time in wait_allrefs in order for the
> dst entry to get free'd and the hold on 'lo' to be released.
>
I am glad to see you don't have a dst leak here.
But from my experience of a similar bug (refcnt wait on lo), this goes
infinitely rather than just 14sec, so it looked more like a real leak than
just a gc delay. So in your case, this annoying warning eventually
disappears, right?
Thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free.
2017-06-09 18:18 ` Cong Wang
@ 2017-06-09 18:43 ` Krister Johansen
2017-06-09 21:48 ` Cong Wang
0 siblings, 1 reply; 7+ messages in thread
From: Krister Johansen @ 2017-06-09 18:43 UTC (permalink / raw)
To: Cong Wang
Cc: Krister Johansen, David Miller, Eric Dumazet,
Linux Kernel Network Developers, Kaiwen Xu
On Fri, Jun 09, 2017 at 11:18:44AM -0700, Cong Wang wrote:
> On Thu, Jun 8, 2017 at 1:12 PM, Krister Johansen
> <kjlx@templeofstupid.com> wrote:
> > The way this works is that if there's still a reference on the dst entry
> > at the time we try to free it, it gets placed in the gc list by
> > __dst_free and the dst_destroy() call is invoked by the gc task once the
> > refcount is 0. If the gc task processes a 10th or less of its entries
> > on a single pass, it inreases the amount of time it waits between gc
> > intervals.
> >
> > Looking at the gc_task intervals, they started at 663ms when we invoked
> > __dst_free(). After that, they increased to 1663, 3136, 5567, 8191,
> > 10751, and 14848. The release that set the refcnt to 0 on our dst entry
> > occurred after the gc_task was enqueued for 14 second interval so we had
> > to wait longer than the warning time in wait_allrefs in order for the
> > dst entry to get free'd and the hold on 'lo' to be released.
> >
>
> I am glad to see you don't have a dst leak here.
>
> But from my experience of a similar bug (refcnt wait on lo), this goes
> infinitely rather than just 14sec, so it looked more like a real leak than
> just a gc delay. So in your case, this annoying warning eventually
> disappears, right?
That's correct. The problem occurs intermittently, and the warnings are
less frequent than the interval in netdev_wait_allrefs(). At least when
I observed it, it tended to conincide with our controlplane canary
issuing an API call that lead to a network namespace teardown on the
dataplane.
Sometimes, the message would look like this:
unregister_netdevice: waiting for lo to become free. Usage count = 0
The dst entries were getting released, it's just that often our dst
cache gc interval was longer than the warning interval in wait_allrefs.
The other concern was that because the wait_allrefs happens in the
netdev_todo path, a long gc interval can cause the rtnl_lock hold times
to be much longer than necessary if this bug is encountered.
-K
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free.
2017-06-09 18:43 ` Krister Johansen
@ 2017-06-09 21:48 ` Cong Wang
0 siblings, 0 replies; 7+ messages in thread
From: Cong Wang @ 2017-06-09 21:48 UTC (permalink / raw)
To: Krister Johansen
Cc: David Miller, Eric Dumazet, Linux Kernel Network Developers,
Kaiwen Xu
On Fri, Jun 9, 2017 at 11:43 AM, Krister Johansen
<kjlx@templeofstupid.com> wrote:
> On Fri, Jun 09, 2017 at 11:18:44AM -0700, Cong Wang wrote:
>> On Thu, Jun 8, 2017 at 1:12 PM, Krister Johansen
>> <kjlx@templeofstupid.com> wrote:
>> > The way this works is that if there's still a reference on the dst entry
>> > at the time we try to free it, it gets placed in the gc list by
>> > __dst_free and the dst_destroy() call is invoked by the gc task once the
>> > refcount is 0. If the gc task processes a 10th or less of its entries
>> > on a single pass, it inreases the amount of time it waits between gc
>> > intervals.
>> >
>> > Looking at the gc_task intervals, they started at 663ms when we invoked
>> > __dst_free(). After that, they increased to 1663, 3136, 5567, 8191,
>> > 10751, and 14848. The release that set the refcnt to 0 on our dst entry
>> > occurred after the gc_task was enqueued for 14 second interval so we had
>> > to wait longer than the warning time in wait_allrefs in order for the
>> > dst entry to get free'd and the hold on 'lo' to be released.
>> >
>>
>> I am glad to see you don't have a dst leak here.
>>
>> But from my experience of a similar bug (refcnt wait on lo), this goes
>> infinitely rather than just 14sec, so it looked more like a real leak than
>> just a gc delay. So in your case, this annoying warning eventually
>> disappears, right?
>
> That's correct. The problem occurs intermittently, and the warnings are
> less frequent than the interval in netdev_wait_allrefs(). At least when
> I observed it, it tended to conincide with our controlplane canary
> issuing an API call that lead to a network namespace teardown on the
> dataplane.
Great! Then the bug I saw is different from this one and it is probably
a dst leak.
Thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-06-09 21:49 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-08 20:12 [PATCH net] Fix an intermittent pr_emerg warning about lo becoming free Krister Johansen
2017-06-08 21:06 ` Eric Dumazet
2017-06-08 21:41 ` Wei Wang
2017-06-09 16:28 ` David Miller
2017-06-09 18:18 ` Cong Wang
2017-06-09 18:43 ` Krister Johansen
2017-06-09 21:48 ` Cong Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).