* Re: [2.6.21.1] soft lockup when removing netconsole module
[not found] ` <20070529005628.f7f3abc6.akpm@linux-foundation.org>
@ 2007-06-12 11:02 ` Jarek Poplawski
2007-06-13 9:25 ` [PATCH] " Jarek Poplawski
0 siblings, 1 reply; 6+ messages in thread
From: Jarek Poplawski @ 2007-06-12 11:02 UTC (permalink / raw)
To: Andrew Morton
Cc: Folkert van Heusden, linux-kernel, Jason Wessel, Thomas Gleixner,
stable, netdev
On Tue, May 29, 2007 at 12:56:28AM -0700, Andrew Morton wrote:
> On Sat, 26 May 2007 17:40:12 +0200 Folkert van Heusden <folkert@vanheusden.com> wrote:
>
> > When trying to remove the netconsole module, I got the following kernel
> > output after a while (couple of minutes iirc):
> >
> > [525720.117293] BUG: soft lockup detected on CPU#1!
> > [525720.117353] [<c1004d53>] show_trace_log_lvl+0x1a/0x30
> > [525720.117439] [<c1004d7b>] show_trace+0x12/0x14
> > [525720.117526] [<c1004e75>] dump_stack+0x16/0x18
> > [525720.117613] [<c104dd5b>] softlockup_tick+0xa6/0xc2
> > [525720.117694] [<c1026855>] run_local_timers+0x12/0x14
> > [525720.117738] [<c1026669>] update_process_times+0x72/0xa1
> > [525720.117744] [<c1038673>] tick_sched_timer+0x53/0xb6
> > [525720.117748] [<c1033d62>] hrtimer_interrupt+0x189/0x1e3
> > [525720.117753] [<c100e9e2>] local_apic_timer_interrupt+0x55/0x5b
> > [525720.117761] [<c100ea12>] smp_apic_timer_interrupt+0x2a/0x39
> > [525720.117766] [<c1004a3f>] apic_timer_interrupt+0x33/0x38
> > [525720.117770] [<c120f4b1>] mutex_lock+0x8/0xa
> > [525720.117775] [<c102d2f0>] flush_workqueue+0x2f/0x8f
> > [525720.117780] [<c102d7a0>] cancel_rearming_delayed_workqueue+0x29/0x2b
> > [525720.117785] [<c102d7b1>] cancel_rearming_delayed_work+0xf/0x11
> > [525720.117790] [<c11be143>] netpoll_cleanup+0x75/0xa5
> > [525720.117794] [<f893712d>] cleanup_netconsole+0x17/0x1a [netconsole]
> > [525720.117804] [<c1041f11>] sys_delete_module+0x12f/0x14f
> > [525720.117809] [<c1003f74>] syscall_call+0x7/0xb
> > [525720.117812] =======================
> >
> > Also the rmmod hangs and would not exit even with kill -9. It also
> > sucks up 100% cpu.
>
> Jason recently posted a mystery patch without telling us what problem it
> fixed.
>
To be fair the problem should be known:
http://marc.info/?l=linux-kernel&m=117700287817801&w=2
List: linux-kernel
Subject: Re: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
From: Chuck Ebbert <cebbert () redhat ! com>
Date: 2007-04-19 17:07:11
Message-ID: 4627A1BF.8080406 () redhat ! com
> Okay, an easy test for it: insmod netconsole ; rmmod netconsole
>
> In 2.6.20.x it loops forever and cancel_rearming_delayed_work()
> is part of the trace...
I hoped the discussion about cancel_rearming_delayed_work would
reach more people (there was also a patch proposal to add a warning
to the usage comment). But it seem it was not enough...
Of course such a problem should preferably be fixed by somebody who
knows the code (alas I don't know netconsole), to be sure all needed
cancels are still done after this change. I hope Jason's patch is
right but I'm a little surprised I can't see netdev in cc (I'll try
to fix this).
Cheers,
Jarek P.
PS: I'm very sorry for such late response (holidays).
> It looks like you just found it: cancel_rearming_delayed_work() will hang
> if the work isn't actually pending. Please test this:
>
>
> From: Jason Wessel <jason.wessel@windriver.com>
>
> Do not call cancel_rearming_delayed_work() if there is no
> pending work.
>
> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
> net/core/netpoll.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff -puN net/core/netpoll.c~a net/core/netpoll.c
> --- a/net/core/netpoll.c~a
> +++ a/net/core/netpoll.c
> @@ -784,8 +784,10 @@ void netpoll_cleanup(struct netpoll *np)
> if (atomic_dec_and_test(&npinfo->refcnt)) {
> skb_queue_purge(&npinfo->arp_tx);
> skb_queue_purge(&npinfo->txq);
> - cancel_rearming_delayed_work(&npinfo->tx_work);
> - flush_scheduled_work();
> + if (delayed_work_pending(&npinfo->tx_work)) {
> + cancel_rearming_delayed_work(&npinfo->tx_work);
> + flush_scheduled_work();
> + }
>
> kfree(npinfo);
> }
> _
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
2007-06-12 11:02 ` [2.6.21.1] soft lockup when removing netconsole module Jarek Poplawski
@ 2007-06-13 9:25 ` Jarek Poplawski
2007-06-26 23:07 ` Andrew Morton
0 siblings, 1 reply; 6+ messages in thread
From: Jarek Poplawski @ 2007-06-13 9:25 UTC (permalink / raw)
To: Andrew Morton
Cc: Folkert van Heusden, linux-kernel, Jason Wessel, Thomas Gleixner,
stable, netdev
On Tue, Jun 12, 2007 at 01:02:33PM +0200, Jarek Poplawski wrote:
...
> Of course such a problem should preferably be fixed by somebody who
> knows the code (alas I don't know netconsole), to be sure all needed
> cancels are still done after this change. I hope Jason's patch is
> right but I'm a little surprised I can't see netdev in cc (I'll try
> to fix this).
So, I've had a look into netpoll and, unfortunately, I don't
think this patch is right...
> > From: Jason Wessel <jason.wessel@windriver.com>
> >
> > Do not call cancel_rearming_delayed_work() if there is no
> > pending work.
> >
> > Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >
> > net/core/netpoll.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff -puN net/core/netpoll.c~a net/core/netpoll.c
> > --- a/net/core/netpoll.c~a
> > +++ a/net/core/netpoll.c
> > @@ -784,8 +784,10 @@ void netpoll_cleanup(struct netpoll *np)
> > if (atomic_dec_and_test(&npinfo->refcnt)) {
> > skb_queue_purge(&npinfo->arp_tx);
> > skb_queue_purge(&npinfo->txq);
> > - cancel_rearming_delayed_work(&npinfo->tx_work);
> > - flush_scheduled_work();
> > + if (delayed_work_pending(&npinfo->tx_work)) {
> > + cancel_rearming_delayed_work(&npinfo->tx_work);
> > + flush_scheduled_work();
> > + }
> >
> > kfree(npinfo);
> > }
> > _
There are such possibilities:
1. After positive delayed_work_pending(&npinfo->tx_work) test
some work is queued, but there is no guarantee that when running
it'll rearm again, so cancel_rearming_delayed_work can loop again;
2. After negative delayed_work_pending(&npinfo->tx_work) test
a work is just running, eg. waiting on netif_tx_lock, while
kfree(npinfo) is done here (oops?!).
I've found an additional problem here with or without this patch:
after deleting a timer in cancel_rearming_delayed_work() there could
stay a last skb queued in npinfo->txq, and after kfree(npinfo)
we have small memory leak. If I'm right here similar fix is needed
in the current netpoll code: additional npinfo->txq purging only
or maybe the whole cancel_rearming_ changed like this.
I've tried to eliminate these problems in attached below patch
proposal. I'm not sure it's all right: as I've written earlier I
don't know netconsole enough, but it's probably a little better
than above solution.
I've some doubts yet (I didn't have time to check this all):
1. I hope this other schedule_delayed_work() from netpoll_send_skb()
is not possible when netpoll_cleanup() runs - if I'm wrong additional
check of npinfo->refcnt should be done there;
2. I also hope npinfo->refcnt before scheduling should be enough here
- if not - another possibility is adding some locking eg.:
netif_tx_lock before cancel for synchronization.
Of course it would be very nice if somebody could test or verify
this patch more.
Regards,
Jarek P.
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
---
diff -Nurp 2.6.21-/net/core/netpoll.c 2.6.21/net/core/netpoll.c
--- 2.6.21-/net/core/netpoll.c 2007-04-26 15:08:32.000000000 +0200
+++ 2.6.21/net/core/netpoll.c 2007-06-12 21:05:23.000000000 +0200
@@ -73,7 +73,8 @@ static void queue_process(struct work_st
netif_tx_unlock(dev);
local_irq_restore(flags);
- schedule_delayed_work(&npinfo->tx_work, HZ/10);
+ if (atomic_read(&npinfo->refcnt))
+ schedule_delayed_work(&npinfo->tx_work, HZ/10);
return;
}
netif_tx_unlock(dev);
@@ -780,9 +781,15 @@ void netpoll_cleanup(struct netpoll *np)
if (atomic_dec_and_test(&npinfo->refcnt)) {
skb_queue_purge(&npinfo->arp_tx);
skb_queue_purge(&npinfo->txq);
- cancel_rearming_delayed_work(&npinfo->tx_work);
+ cancel_delayed_work(&npinfo->tx_work);
flush_scheduled_work();
+ /* clean after last, unfinished work */
+ if (!skb_queue_empty(&npinfo->txq)) {
+ struct sk_buff *skb;
+ skb = __skb_dequeue(&npinfo->txq);
+ kfree_skb(skb);
+ }
kfree(npinfo);
}
}
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
2007-06-13 9:25 ` [PATCH] " Jarek Poplawski
@ 2007-06-26 23:07 ` Andrew Morton
2007-06-27 0:46 ` Wessel, Jason
0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2007-06-26 23:07 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Folkert van Heusden, linux-kernel, Jason Wessel, Thomas Gleixner,
stable, netdev
On Wed, 13 Jun 2007 11:25:37 +0200
Jarek Poplawski <jarkao2@o2.pl> wrote:
> On Tue, Jun 12, 2007 at 01:02:33PM +0200, Jarek Poplawski wrote:
> ...
> > Of course such a problem should preferably be fixed by somebody who
> > knows the code (alas I don't know netconsole), to be sure all needed
> > cancels are still done after this change. I hope Jason's patch is
> > right but I'm a little surprised I can't see netdev in cc (I'll try
> > to fix this).
>
> So, I've had a look into netpoll and, unfortunately, I don't
> think this patch is right...
>
> > > From: Jason Wessel <jason.wessel@windriver.com>
> > >
> > > Do not call cancel_rearming_delayed_work() if there is no
> > > pending work.
> > >
> > > Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > ---
> > >
> > > net/core/netpoll.c | 6 ++++--
> > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff -puN net/core/netpoll.c~a net/core/netpoll.c
> > > --- a/net/core/netpoll.c~a
> > > +++ a/net/core/netpoll.c
> > > @@ -784,8 +784,10 @@ void netpoll_cleanup(struct netpoll *np)
> > > if (atomic_dec_and_test(&npinfo->refcnt)) {
> > > skb_queue_purge(&npinfo->arp_tx);
> > > skb_queue_purge(&npinfo->txq);
> > > - cancel_rearming_delayed_work(&npinfo->tx_work);
> > > - flush_scheduled_work();
> > > + if (delayed_work_pending(&npinfo->tx_work)) {
> > > + cancel_rearming_delayed_work(&npinfo->tx_work);
> > > + flush_scheduled_work();
> > > + }
> > >
> > > kfree(npinfo);
> > > }
> > > _
>
> There are such possibilities:
>
> 1. After positive delayed_work_pending(&npinfo->tx_work) test
> some work is queued, but there is no guarantee that when running
> it'll rearm again, so cancel_rearming_delayed_work can loop again;
>
> 2. After negative delayed_work_pending(&npinfo->tx_work) test
> a work is just running, eg. waiting on netif_tx_lock, while
> kfree(npinfo) is done here (oops?!).
>
> I've found an additional problem here with or without this patch:
> after deleting a timer in cancel_rearming_delayed_work() there could
> stay a last skb queued in npinfo->txq, and after kfree(npinfo)
> we have small memory leak. If I'm right here similar fix is needed
> in the current netpoll code: additional npinfo->txq purging only
> or maybe the whole cancel_rearming_ changed like this.
>
> I've tried to eliminate these problems in attached below patch
> proposal. I'm not sure it's all right: as I've written earlier I
> don't know netconsole enough, but it's probably a little better
> than above solution.
>
> I've some doubts yet (I didn't have time to check this all):
>
> 1. I hope this other schedule_delayed_work() from netpoll_send_skb()
> is not possible when netpoll_cleanup() runs - if I'm wrong additional
> check of npinfo->refcnt should be done there;
> 2. I also hope npinfo->refcnt before scheduling should be enough here
> - if not - another possibility is adding some locking eg.:
> netif_tx_lock before cancel for synchronization.
>
> Of course it would be very nice if somebody could test or verify
> this patch more.
>
> Regards,
> Jarek P.
>
>
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
>
> ---
>
> diff -Nurp 2.6.21-/net/core/netpoll.c 2.6.21/net/core/netpoll.c
> --- 2.6.21-/net/core/netpoll.c 2007-04-26 15:08:32.000000000 +0200
> +++ 2.6.21/net/core/netpoll.c 2007-06-12 21:05:23.000000000 +0200
> @@ -73,7 +73,8 @@ static void queue_process(struct work_st
> netif_tx_unlock(dev);
> local_irq_restore(flags);
>
> - schedule_delayed_work(&npinfo->tx_work, HZ/10);
> + if (atomic_read(&npinfo->refcnt))
> + schedule_delayed_work(&npinfo->tx_work, HZ/10);
> return;
> }
> netif_tx_unlock(dev);
> @@ -780,9 +781,15 @@ void netpoll_cleanup(struct netpoll *np)
> if (atomic_dec_and_test(&npinfo->refcnt)) {
> skb_queue_purge(&npinfo->arp_tx);
> skb_queue_purge(&npinfo->txq);
> - cancel_rearming_delayed_work(&npinfo->tx_work);
> + cancel_delayed_work(&npinfo->tx_work);
> flush_scheduled_work();
>
> + /* clean after last, unfinished work */
> + if (!skb_queue_empty(&npinfo->txq)) {
> + struct sk_buff *skb;
> + skb = __skb_dequeue(&npinfo->txq);
> + kfree_skb(skb);
> + }
> kfree(npinfo);
> }
> }
Everything went quiet?
If this patch has been tested and fixes the bug, can you please send a
version which is ready for merging? (ie: add a suitable description of
what it does).
Thanks.
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
2007-06-26 23:07 ` Andrew Morton
@ 2007-06-27 0:46 ` Wessel, Jason
2007-06-27 1:00 ` Andrew Morton
0 siblings, 1 reply; 6+ messages in thread
From: Wessel, Jason @ 2007-06-27 0:46 UTC (permalink / raw)
To: Andrew Morton, Jarek Poplawski
Cc: Folkert van Heusden, linux-kernel, Thomas Gleixner, stable,
netdev
> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> >
> > Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> >
> > ---
> >
> > diff -Nurp 2.6.21-/net/core/netpoll.c 2.6.21/net/core/netpoll.c
> > --- 2.6.21-/net/core/netpoll.c 2007-04-26
> 15:08:32.000000000 +0200
> > +++ 2.6.21/net/core/netpoll.c 2007-06-12
> 21:05:23.000000000 +0200
> > @@ -73,7 +73,8 @@ static void queue_process(struct work_st
> > netif_tx_unlock(dev);
> > local_irq_restore(flags);
> >
> > - schedule_delayed_work(&npinfo->tx_work, HZ/10);
> > + if (atomic_read(&npinfo->refcnt))
> > +
> schedule_delayed_work(&npinfo->tx_work, HZ/10);
> > return;
> > }
> > netif_tx_unlock(dev);
> > @@ -780,9 +781,15 @@ void netpoll_cleanup(struct netpoll *np)
> > if (atomic_dec_and_test(&npinfo->refcnt)) {
> > skb_queue_purge(&npinfo->arp_tx);
> > skb_queue_purge(&npinfo->txq);
> > -
> cancel_rearming_delayed_work(&npinfo->tx_work);
> > + cancel_delayed_work(&npinfo->tx_work);
> > flush_scheduled_work();
> >
> > + /* clean after last, unfinished work */
> > + if (!skb_queue_empty(&npinfo->txq)) {
> > + struct sk_buff *skb;
> > + skb =
> __skb_dequeue(&npinfo->txq);
> > + kfree_skb(skb);
> > + }
> > kfree(npinfo);
> > }
> > }
>
> Everything went quiet?
>
> If this patch has been tested and fixes the bug, can you
> please send a version which is ready for merging? (ie: add a
> suitable description of what it does).
>
>
I mailed Jarek separately.
I had tested the patch with netconsole and kgdb and it does in fact fix
the problem that was reported.
Jason.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
2007-06-27 0:46 ` Wessel, Jason
@ 2007-06-27 1:00 ` Andrew Morton
2007-06-27 7:24 ` Jarek Poplawski
0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2007-06-27 1:00 UTC (permalink / raw)
To: Wessel, Jason
Cc: Jarek Poplawski, Folkert van Heusden, linux-kernel,
Thomas Gleixner, stable, netdev
On Tue, 26 Jun 2007 17:46:13 -0700 "Wessel, Jason" <jason.wessel@windriver.com> wrote:
> > > }
> > > }
> >
> > Everything went quiet?
> >
> > If this patch has been tested and fixes the bug, can you
> > please send a version which is ready for merging? (ie: add a
> > suitable description of what it does).
> >
> >
>
> I mailed Jarek separately.
>
> I had tested the patch with netconsole and kgdb and it does in fact fix
> the problem that was reported.
OK, thanks. Please don't mail people separately!
I queued this up with a null changelog for now.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
2007-06-27 1:00 ` Andrew Morton
@ 2007-06-27 7:24 ` Jarek Poplawski
0 siblings, 0 replies; 6+ messages in thread
From: Jarek Poplawski @ 2007-06-27 7:24 UTC (permalink / raw)
To: Andrew Morton
Cc: Wessel, Jason, Folkert van Heusden, linux-kernel, Thomas Gleixner,
stable, netdev
On Tue, Jun 26, 2007 at 06:00:00PM -0700, Andrew Morton wrote:
> On Tue, 26 Jun 2007 17:46:13 -0700 "Wessel, Jason" <jason.wessel@windriver.com> wrote:
...
> > > Everything went quiet?
> > >
> > > If this patch has been tested and fixes the bug, can you
> > > please send a version which is ready for merging? (ie: add a
> > > suitable description of what it does).
> > >
> > >
> >
> > I mailed Jarek separately.
> >
> > I had tested the patch with netconsole and kgdb and it does in fact fix
> > the problem that was reported.
>
> OK, thanks. Please don't mail people separately!
>
> I queued this up with a null changelog for now.
>
I pasted here this queued version - only the changelog is added.
Regards,
Jarek P.
------------------------------------------------------
Subject: netconsole: fix soft lockup (2.6.21) and memory leak when removing module
From: Jarek Poplawski <jarkao2@o2.pl>
#1
Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
required a work function should always (unconditionally) rearm with
delay > 0 - otherwise it would endlessly loop. This patch replaces
this function with cancel_delayed_work(). Later kernel versions don't
require this, so here it's only for uniformity.
#2
After deleting a timer in cancel_[rearming_]delayed_work() there could
stay a last skb queued in npinfo->txq causing a memory leak after
kfree(npinfo).
Initial patch & testing by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
net/core/netpoll.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff -puN net/core/netpoll.c~netconsole-fix-soft-lockup-when-removing-module net/core/netpoll.c
--- a/net/core/netpoll.c~netconsole-fix-soft-lockup-when-removing-module
+++ a/net/core/netpoll.c
@@ -72,7 +72,8 @@ static void queue_process(struct work_st
netif_tx_unlock(dev);
local_irq_restore(flags);
- schedule_delayed_work(&npinfo->tx_work, HZ/10);
+ if (atomic_read(&npinfo->refcnt))
+ schedule_delayed_work(&npinfo->tx_work, HZ/10);
return;
}
netif_tx_unlock(dev);
@@ -785,9 +786,15 @@ void netpoll_cleanup(struct netpoll *np)
if (atomic_dec_and_test(&npinfo->refcnt)) {
skb_queue_purge(&npinfo->arp_tx);
skb_queue_purge(&npinfo->txq);
- cancel_rearming_delayed_work(&npinfo->tx_work);
+ cancel_delayed_work(&npinfo->tx_work);
flush_scheduled_work();
+ /* clean after last, unfinished work */
+ if (!skb_queue_empty(&npinfo->txq)) {
+ struct sk_buff *skb;
+ skb = __skb_dequeue(&npinfo->txq);
+ kfree_skb(skb);
+ }
kfree(npinfo);
}
}
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-06-27 7:16 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070526154011.GB3735@vanheusden.com>
[not found] ` <20070529005628.f7f3abc6.akpm@linux-foundation.org>
2007-06-12 11:02 ` [2.6.21.1] soft lockup when removing netconsole module Jarek Poplawski
2007-06-13 9:25 ` [PATCH] " Jarek Poplawski
2007-06-26 23:07 ` Andrew Morton
2007-06-27 0:46 ` Wessel, Jason
2007-06-27 1:00 ` Andrew Morton
2007-06-27 7:24 ` Jarek Poplawski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).