netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Linux 3.0 release
       [not found] <CA+55aFxakA2U+oMJ1T7awTYa+p6xp9N0aCbfrUqgkF7BJ8gnQw@mail.gmail.com>
@ 2011-07-22 19:11 ` David
  2011-07-22 19:21   ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: David @ 2011-07-22 19:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, netdev

On 22/07/11 03:59, Linus Torvalds wrote:
> really smooth. Which is not to say that there may not be bugs, but if
> anything, there are hopefully fewer than usual, rather than the normal
> ".0" problems.

I'm getting the following warning at boot from 3.0, everything seems to be fine otherwise though.

Jul 22 19:40:02 server kernel: [   15.526629] ------------[ cut here ]------------
Jul 22 19:40:02 server kernel: [   15.526635] WARNING: at kernel/timer.c:1011 del_timer_sync+0x4e/0x50()
Jul 22 19:40:02 server kernel: [   15.526637] Hardware name: System Product Name
Jul 22 19:40:02 server kernel: [   15.526638] Modules linked in: xt_owner ipt_REDIRECT ipt_MASQUERADE ts_kmp xt_string ipt_REJECT xt_recent xt_state xt_multiport xt_tcpudp xt_pkttype ipt_LOG xt_limit
iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables ip6table_filter ip6_tables x_tables wctdm dahdi hisax nfsd isl6421 crc_ccitt b2c2_flexcop_pci
dvb_pll b2c2_flexcop mt352 isdn cx24123 dvb_usb_digitv cx24113 s5h1420 dvb_usb e1000e snd_hda_codec_hdmi exportfs it87 dvb_core pl2303 hwmon_vid lp snd_hda_codec_via ppdev nfs k10temp i2c_piix4
snd_hda_intel snd_hda_codec r8169 mii lockd snd_pcm snd_timer snd shpchp soundcore parport_pc snd_page_alloc parport auth_rpcgss nfs_acl sunrpc
Jul 22 19:40:02 server kernel: [   15.526667] Pid: 0, comm: kworker/0:0 Not tainted 3.0.0 #1
Jul 22 19:40:02 server kernel: [   15.526668] Call Trace:
Jul 22 19:40:02 server kernel: [   15.526670]  <IRQ>  [<ffffffff8104599a>] warn_slowpath_common+0x7a/0xb0
Jul 22 19:40:02 server kernel: [   15.526675]  [<ffffffff810459e5>] warn_slowpath_null+0x15/0x20
Jul 22 19:40:02 server kernel: [   15.526677]  [<ffffffff81054b4e>] del_timer_sync+0x4e/0x50
Jul 22 19:40:02 server kernel: [   15.526679]  [<ffffffff8145e224>] linkwatch_schedule_work+0x84/0xa0
Jul 22 19:40:02 server kernel: [   15.526681]  [<ffffffff8145e2bc>] linkwatch_fire_event+0x7c/0x100
Jul 22 19:40:02 server kernel: [   15.526684]  [<ffffffff8146b1ed>] netif_carrier_on+0x2d/0x40
Jul 22 19:40:02 server kernel: [   15.526689]  [<ffffffffa006b6fb>] __rtl8169_check_link_status+0x4b/0xc0 [r8169]
Jul 22 19:40:02 server kernel: [   15.526693]  [<ffffffffa006c016>] rtl8169_interrupt+0x166/0x3a0 [r8169]
Jul 22 19:40:02 server kernel: [   15.526696]  [<ffffffff810a4385>] handle_irq_event_percpu+0x55/0x1f0
Jul 22 19:40:02 server kernel: [   15.526698]  [<ffffffff810a4551>] handle_irq_event+0x31/0x50
Jul 22 19:40:02 server kernel: [   15.526701]  [<ffffffff8101a371>] ? ack_apic_edge+0x31/0x40
Jul 22 19:40:02 server kernel: [   15.526703]  [<ffffffff810a6eb5>] handle_edge_irq+0x65/0x120
Jul 22 19:40:02 server kernel: [   15.526706]  [<ffffffff81003f8d>] handle_irq+0x1d/0x30
Jul 22 19:40:02 server kernel: [   15.526708]  [<ffffffff81003918>] do_IRQ+0x58/0xe0
Jul 22 19:40:02 server kernel: [   15.526711]  [<ffffffff815588d3>] common_interrupt+0x13/0x13
Jul 22 19:40:02 server kernel: [   15.526712]  <EOI>  [<ffffffff8106bf5d>] ? sched_clock_local+0x1d/0x90
Jul 22 19:40:02 server kernel: [   15.526717]  [<ffffffff81009e25>] ? default_idle+0x55/0x170
Jul 22 19:40:02 server kernel: [   15.526719]  [<ffffffff81009fc3>] amd_e400_idle+0x83/0x100
Jul 22 19:40:02 server kernel: [   15.526721]  [<ffffffff8155bb55>] ? atomic_notifier_call_chain+0x15/0x20
Jul 22 19:40:02 server kernel: [   15.526723]  [<ffffffff81001fd9>] cpu_idle+0x59/0xb0
Jul 22 19:40:02 server kernel: [   15.526726]  [<ffffffff81551552>] start_secondary+0x181/0x185
Jul 22 19:40:02 server kernel: [   15.526727] ---[ end trace 5bac97729a402448 ]---

(dual core phenom 2, dmesg etc on request)

Cheers
David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 3.0 release
  2011-07-22 19:11 ` Linux 3.0 release David
@ 2011-07-22 19:21   ` Linus Torvalds
  2011-07-22 19:44     ` Ben Greear
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2011-07-22 19:21 UTC (permalink / raw)
  To: David; +Cc: Linux Kernel Mailing List, netdev

On Fri, Jul 22, 2011 at 12:11 PM, David <david@unsolicited.net> wrote:
>
> I'm getting the following warning at boot from 3.0, everything seems to be fine otherwise though.
>
> Jul 22 19:40:02 server kernel: [   15.526629] ------------[ cut here ]------------
> Jul 22 19:40:02 server kernel: [   15.526635] WARNING: at kernel/timer.c:1011 del_timer_sync+0x4e/0x50()

Hmm. That looks like a real bug: you shouldn't do a "del_timer_sync()"
from an interrupt. It probably works, but it sounds like a really bad
idea.

> Jul 22 19:40:02 server kernel: [   15.526677]  [<ffffffff81054b4e>] del_timer_sync+0x4e/0x50
> Jul 22 19:40:02 server kernel: [   15.526679]  [<ffffffff8145e224>] linkwatch_schedule_work+0x84/0xa0
> Jul 22 19:40:02 server kernel: [   15.526681]  [<ffffffff8145e2bc>] linkwatch_fire_event+0x7c/0x100
> Jul 22 19:40:02 server kernel: [   15.526684]  [<ffffffff8146b1ed>] netif_carrier_on+0x2d/0x40
> Jul 22 19:40:02 server kernel: [   15.526689]  [<ffffffffa006b6fb>] __rtl8169_check_link_status+0x4b/0xc0 [r8169]
> Jul 22 19:40:02 server kernel: [   15.526693]  [<ffffffffa006c016>] rtl8169_interrupt+0x166/0x3a0 [r8169]
> Jul 22 19:40:02 server kernel: [   15.526696]  [<ffffffff810a4385>] handle_irq_event_percpu+0x55/0x1f0
> Jul 22 19:40:02 server kernel: [   15.526698]  [<ffffffff810a4551>] handle_irq_event+0x31/0x50

I'm not seeing a lot of changes in any of these areas, though. I
wonder what made it start happen.

                     Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 3.0 release
  2011-07-22 19:21   ` Linus Torvalds
@ 2011-07-22 19:44     ` Ben Greear
  2011-07-22 20:32       ` Stephen Hemminger
  0 siblings, 1 reply; 11+ messages in thread
From: Ben Greear @ 2011-07-22 19:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David, Linux Kernel Mailing List, netdev

On 07/22/2011 12:21 PM, Linus Torvalds wrote:
> On Fri, Jul 22, 2011 at 12:11 PM, David<david@unsolicited.net>  wrote:
>>
>> I'm getting the following warning at boot from 3.0, everything seems to be fine otherwise though.
>>
>> Jul 22 19:40:02 server kernel: [   15.526629] ------------[ cut here ]------------
>> Jul 22 19:40:02 server kernel: [   15.526635] WARNING: at kernel/timer.c:1011 del_timer_sync+0x4e/0x50()
>
> Hmm. That looks like a real bug: you shouldn't do a "del_timer_sync()"
> from an interrupt. It probably works, but it sounds like a really bad
> idea.
>
>> Jul 22 19:40:02 server kernel: [   15.526677]  [<ffffffff81054b4e>] del_timer_sync+0x4e/0x50
>> Jul 22 19:40:02 server kernel: [   15.526679]  [<ffffffff8145e224>] linkwatch_schedule_work+0x84/0xa0
>> Jul 22 19:40:02 server kernel: [   15.526681]  [<ffffffff8145e2bc>] linkwatch_fire_event+0x7c/0x100
>> Jul 22 19:40:02 server kernel: [   15.526684]  [<ffffffff8146b1ed>] netif_carrier_on+0x2d/0x40
>> Jul 22 19:40:02 server kernel: [   15.526689]  [<ffffffffa006b6fb>] __rtl8169_check_link_status+0x4b/0xc0 [r8169]
>> Jul 22 19:40:02 server kernel: [   15.526693]  [<ffffffffa006c016>] rtl8169_interrupt+0x166/0x3a0 [r8169]
>> Jul 22 19:40:02 server kernel: [   15.526696]  [<ffffffff810a4385>] handle_irq_event_percpu+0x55/0x1f0
>> Jul 22 19:40:02 server kernel: [   15.526698]  [<ffffffff810a4551>] handle_irq_event+0x31/0x50
>
> I'm not seeing a lot of changes in any of these areas, though. I
> wonder what made it start happen.

This has been around since at least 2.6.38.  I haven't tested recently
on my rtl8169 system, but I don't recall seeing any attempts to fix
it...

http://permalink.gmane.org/gmane.linux.network/193565
http://lists.openwall.net/netdev/2011/05/04/183

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 3.0 release
  2011-07-22 19:44     ` Ben Greear
@ 2011-07-22 20:32       ` Stephen Hemminger
  2011-07-22 20:35         ` Linus Torvalds
  2011-07-22 21:26         ` Francois Romieu
  0 siblings, 2 replies; 11+ messages in thread
From: Stephen Hemminger @ 2011-07-22 20:32 UTC (permalink / raw)
  To: Ben Greear, Linus Torvalds, David, Tejun Heo
  Cc: Linux Kernel Mailing List, netdev

This a regression which probably began with

commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c
Author: Tejun Heo <tj@kernel.org>
Date:   Tue Jun 29 10:07:14 2010 +0200

    workqueue: implement concurrency managed dynamic worker pool

Before that it was perfectly legal for link watch code to
call schedule_delayed_work from IRQ. This should be allowable;
the code to manage the worker pool should handle it.

Network devices call netif_carrier_on/off from IRQ all the
time. The problem is that the new pool code breaks this
if link watch tries to schedule work before there enough worker
threads.

The workqueue code should have a fallback and not try and
do anything if being called from IRQ.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 3.0 release
  2011-07-22 20:32       ` Stephen Hemminger
@ 2011-07-22 20:35         ` Linus Torvalds
  2011-07-23  2:27           ` Tejun Heo
  2011-07-22 21:26         ` Francois Romieu
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2011-07-22 20:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ben Greear, David, Tejun Heo, Linux Kernel Mailing List, netdev

On Fri, Jul 22, 2011 at 1:32 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
>
> The workqueue code should have a fallback and not try and
> do anything if being called from IRQ.

Fair enough. Especially since one of the *points* of workqueues is
indeed to schedule stuff from irqs and that cannot be done
immediately.

Tejun?

                  Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 3.0 release
  2011-07-22 20:32       ` Stephen Hemminger
  2011-07-22 20:35         ` Linus Torvalds
@ 2011-07-22 21:26         ` Francois Romieu
  2011-07-22 22:09           ` Stephen Hemminger
  1 sibling, 1 reply; 11+ messages in thread
From: Francois Romieu @ 2011-07-22 21:26 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ben Greear, Linus Torvalds, David, Tejun Heo,
	Linux Kernel Mailing List, netdev

Stephen Hemminger <shemminger@vyatta.com> :
> This a regression which probably began with
> 
> commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c
> Author: Tejun Heo <tj@kernel.org>
> Date:   Tue Jun 29 10:07:14 2010 +0200
> 
>     workqueue: implement concurrency managed dynamic worker pool
> 
> Before that it was perfectly legal for link watch code to
> call schedule_delayed_work from IRQ. This should be allowable;
> the code to manage the worker pool should handle it.

I beg to differ: see Ben's first report
(http://lists.openwall.net/netdev/2011/05/04/183).
                                       ^^

One of the code path in the netif_carrier code leads it to try and disable
a late workqueue to reenable it immediately (mod_workqueue anyone ?):
netif_carrier_on
-> linkwatch_fire_event
   -> linkwatch_schedule_work
      -> cancel_delayed_work
         -> del_timer_sync

The del_timer_sync has been here for ages. Afaiks it is not a new pool code
nor a schedule_delayed_work only problem.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 3.0 release
  2011-07-22 21:26         ` Francois Romieu
@ 2011-07-22 22:09           ` Stephen Hemminger
  2011-07-22 22:53             ` [PATCH] net: allow netif_carrier to be called safely from IRQ Stephen Hemminger
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2011-07-22 22:09 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Ben Greear, Linus Torvalds, David, Tejun Heo,
	Linux Kernel Mailing List, netdev

On Fri, 22 Jul 2011 23:26:29 +0200
Francois Romieu <romieu@fr.zoreil.com> wrote:

> Stephen Hemminger <shemminger@vyatta.com> :
> > This a regression which probably began with
> > 
> > commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c
> > Author: Tejun Heo <tj@kernel.org>
> > Date:   Tue Jun 29 10:07:14 2010 +0200
> > 
> >     workqueue: implement concurrency managed dynamic worker pool
> > 
> > Before that it was perfectly legal for link watch code to
> > call schedule_delayed_work from IRQ. This should be allowable;
> > the code to manage the worker pool should handle it.
> 
> I beg to differ: see Ben's first report
> (http://lists.openwall.net/netdev/2011/05/04/183).
>                                        ^^
> 
> One of the code path in the netif_carrier code leads it to try and disable
> a late workqueue to reenable it immediately (mod_workqueue anyone ?):
> netif_carrier_on
> -> linkwatch_fire_event
>    -> linkwatch_schedule_work
>       -> cancel_delayed_work
>          -> del_timer_sync
> 
> The del_timer_sync has been here for ages. Afaiks it is not a new pool code
> nor a schedule_delayed_work only problem.
> 

That path can be fixed by calling _cancel_delayed_work() instead.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] net: allow netif_carrier to be called safely from IRQ
  2011-07-22 22:09           ` Stephen Hemminger
@ 2011-07-22 22:53             ` Stephen Hemminger
  2011-07-23  0:16               ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2011-07-22 22:53 UTC (permalink / raw)
  To: David Miller
  Cc: Francois Romieu, Ben Greear, Linus Torvalds, David, Tejun Heo,
	Linux Kernel Mailing List, netdev

As reported by Ben Greer and Froncois Romieu. The code path in 
the netif_carrier code leads it to try and disable
a late workqueue to reenable it immediately
netif_carrier_on
-> linkwatch_fire_event
   -> linkwatch_schedule_work
      -> cancel_delayed_work
         -> del_timer_sync  

If __cancel_delayed_work is used instead then there is no
problem of waiting for running linkwatch_event.

There is a race between linkwatch_event running re-scheduling
but it is harmless to schedule an extra scan of the linkwatch queue.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


--- a/net/core/link_watch.c	2011-07-22 15:25:31.027533604 -0700
+++ b/net/core/link_watch.c	2011-07-22 15:31:27.531520028 -0700
@@ -126,7 +126,7 @@ static void linkwatch_schedule_work(int
 		return;
 
 	/* It's already running which is good enough. */
-	if (!cancel_delayed_work(&linkwatch_work))
+	if (!__cancel_delayed_work(&linkwatch_work))
 		return;
 
 	/* Otherwise we reschedule it again for immediate execution. */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: allow netif_carrier to be called safely from IRQ
  2011-07-22 22:53             ` [PATCH] net: allow netif_carrier to be called safely from IRQ Stephen Hemminger
@ 2011-07-23  0:16               ` David Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2011-07-23  0:16 UTC (permalink / raw)
  To: shemminger; +Cc: romieu, greearb, torvalds, david, tj, linux-kernel, netdev

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Fri, 22 Jul 2011 15:53:56 -0700

> As reported by Ben Greer and Froncois Romieu. The code path in 
> the netif_carrier code leads it to try and disable
> a late workqueue to reenable it immediately
> netif_carrier_on
> -> linkwatch_fire_event
>    -> linkwatch_schedule_work
>       -> cancel_delayed_work
>          -> del_timer_sync  
> 
> If __cancel_delayed_work is used instead then there is no
> problem of waiting for running linkwatch_event.
> 
> There is a race between linkwatch_event running re-scheduling
> but it is harmless to schedule an extra scan of the linkwatch queue.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 3.0 release
  2011-07-22 20:35         ` Linus Torvalds
@ 2011-07-23  2:27           ` Tejun Heo
  2011-07-23  2:30             ` Tejun Heo
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2011-07-23  2:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen Hemminger, Ben Greear, David, Linux Kernel Mailing List,
	netdev

Hello, Stephen, Linus.

On Fri, Jul 22, 2011 at 01:35:16PM -0700, Linus Torvalds wrote:
> On Fri, Jul 22, 2011 at 1:32 PM, Stephen Hemminger
> <shemminger@vyatta.com> wrote:
> >
> > The workqueue code should have a fallback and not try and
> > do anything if being called from IRQ.
> 
> Fair enough. Especially since one of the *points* of workqueues is
> indeed to schedule stuff from irqs and that cannot be done
> immediately.
> 
> Tejun?

It seems to have been already tracked down but, just to be clear.
Nothing changed regarding synchronization requirements for all the
queue, flush and cancel functions.  If it worked before cmwq, it
should work with cmwq.

While on the topic, we do have some workqueue API problems.  The
delayed ones are a bit screwy.  e.g. requeueing an already pending
delayed work item should probably update the timer but it doesn't andp
we have a bunch of users doing cancel/requeue or using separate timers
for that.  Also, the cancel/flush[_sync] variants are subtly different
making using the correct one difficult, which has possibility of
introducing bugs which are extremely difficult to reproduce.

Again, most of these had accumulated well before cmwq came into the
picture.  I think we need to make workqueue simpler and easier to use.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux 3.0 release
  2011-07-23  2:27           ` Tejun Heo
@ 2011-07-23  2:30             ` Tejun Heo
  0 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2011-07-23  2:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen Hemminger, Ben Greear, David, Linux Kernel Mailing List,
	netdev

On Sat, Jul 23, 2011 at 04:27:15AM +0200, Tejun Heo wrote:
> While on the topic, we do have some workqueue API problems.  The
> delayed ones are a bit screwy.  e.g. requeueing an already pending
> delayed work item should probably update the timer but it doesn't andp
> we have a bunch of users doing cancel/requeue or using separate timers
> for that.

(after reading the other branch of the thread) Ooh, bingo, this
actually was the issue which triggered the problem reported here. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-07-23  2:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CA+55aFxakA2U+oMJ1T7awTYa+p6xp9N0aCbfrUqgkF7BJ8gnQw@mail.gmail.com>
2011-07-22 19:11 ` Linux 3.0 release David
2011-07-22 19:21   ` Linus Torvalds
2011-07-22 19:44     ` Ben Greear
2011-07-22 20:32       ` Stephen Hemminger
2011-07-22 20:35         ` Linus Torvalds
2011-07-23  2:27           ` Tejun Heo
2011-07-23  2:30             ` Tejun Heo
2011-07-22 21:26         ` Francois Romieu
2011-07-22 22:09           ` Stephen Hemminger
2011-07-22 22:53             ` [PATCH] net: allow netif_carrier to be called safely from IRQ Stephen Hemminger
2011-07-23  0:16               ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).