* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 10:55 ` Oleg Nesterov
@ 2007-05-08 11:09 ` Jiri Slaby
2007-05-08 11:09 ` Jiri Slaby
` (2 subsequent siblings)
3 siblings, 0 replies; 26+ messages in thread
From: Jiri Slaby @ 2007-05-08 11:09 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, linux-pm, Pavel Machek, Linux kernel mailing list
Oleg Nesterov napsal(a):
> On 05/08, Andrew Morton wrote:
>> On Tue, 08 May 2007 10:57:35 +0200 Jiri Slaby <jirislaby@gmail.com> wrote:
>>
>>> this occured in dmesg during resuming from hwsusp in 2.6.21-mm1 (captured
>>> through netconsole). Perfectly reproducible, it simply happens each time I
>>> try it.
>> Let's cc Oleg.
>>
>>> usb_endpoint usbdev5.1_ep00: PM: resume from 0, parent usb5 still 2
>>> ------------[ cut here ]------------
>>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
>>> invalid opcode: 0000 [#1]
>>> SMP
>>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
>>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
>>> CPU: 1
>>> EIP: 0060:[<c0132161>] Not tainted VLI
>>> EFLAGS: 00010046 (2.6.21-mm1 #272)
>>> EIP is at insert_work+0x6d/0x71
>>> eax: c1c3b3c0 ebx: c1814aa0 ecx: 00000001 edx: c1814aa0
>>> esi: c1c3b340 edi: 00000282 ebp: c04d2f68 esp: c04d2f50
>>> ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
>>> Process swapper (pid: 0, ti=c04d2000 task=c1c26030 task.ti=c1c20000)
>>> Stack: c4685f54 c18148ac c04d2f98 c013816f c1c3b340 c1814aa0 c04d2f88 c013256c
>>> 00000066 c1814880 c1c5e000 c04d2fc4 c01325a1
>>> c1c5e000 00000100 c012ba35 00000000 c04d2fb8 c01333f4 Call Trace:
>>> [<c0104f27>] [<c0104fe2>] show_stack_log_lvl+0xa5/0xca
>>> show_registers+0x1e2/0x2da
>>> [<c01053f5>] [<c010559a>] do_trap+0x84/0xaa
>>> do_invalid_op+0x88/0x92
>>> [<c0378662>] [<c013256c>] __queue_work+0x22/0x33
>>> delayed_work_timer_fn+0x24/0x2a
>>> [<c012ba35>] [<c01288a9>] __do_softirq+0x75/0xe6
>>> do_softirq+0x63/0xac
>>> [<c0128713>] [<c0116d7e>] smp_apic_timer_interrupt+0x5c/0x88
>>> apic_timer_interrupt+0x28/0x30
>> hm, how come it's so messy?
I have no idea, this is how it appeared in the `nc -ul' output...
> queue_delayed_work().
>
> Probably, cancel_delayed_work(&delayed_work->work) was called with the ->timer
> pending. This is wrong, cancel_delayed_work() clears _PENDING unconditionally,
> that is why the comment says
>
> it is expected that, prior to calling cancel_work_sync(), the caller has
> arranged for the work to not be requeued.
>
> (Just in case, after make-cancel_rearming_delayed_work-reliable.patch this is still
> wrong (as documented) to do cancel_delayed_work() before cancel_delayed_timer(),
> but should work correctly).
>
> ata_port_flush_task() and ata_port_detach() do this, I sent the patch to fix this
> twice. The last one is
>
> [PATCH -mm] libata-core: convert to use cancel_rearming_delayed_work()
> http://marc.info/?l=linux-kernel&m=117840349108505
>
>
> Jiri, any chance you can re-test with the patch below?
Yes. In the meantime I investigated, that the regression is between
broken-out-2007-04-28-05-06
and
special -js edition:
I guess it's time to end the staircase experiment in -mm.
http://userweb.kernel.org/~akpm/js.bz2 is my current rollup (against
2.6.21) minus staircase and related things.
I don't know if it's possible to dig out the patch list from it, otherwise,
2.6.21-mm1 list may be used...
> --- OLD/kernel/workqueue.c~ 2007-05-06 00:01:06.000000000 +0400
> +++ OLD/kernel/workqueue.c 2007-05-08 14:50:39.000000000 +0400
> @@ -103,7 +103,10 @@ static inline void set_wq_data(struct wo
> {
> unsigned long new;
>
> - BUG_ON(!work_pending(work));
> + if (!work_pending(work)) {
> + printk(KERN_ERR "BUG: set_wq_data ");
> + print_symbol("%s\n", (unsigned long) work->func);
> + }
>
> new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
> new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
building and will report,
--
http://www.fi.muni.cz/~xslaby/ Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 10:55 ` Oleg Nesterov
2007-05-08 11:09 ` Jiri Slaby
@ 2007-05-08 11:09 ` Jiri Slaby
2007-05-08 12:00 ` Jiri Slaby
2007-05-08 12:00 ` Jiri Slaby
2007-05-08 12:12 ` Jarek Poplawski
2007-05-08 12:12 ` Jarek Poplawski
3 siblings, 2 replies; 26+ messages in thread
From: Jiri Slaby @ 2007-05-08 11:09 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Rafael J. Wysocki, Pavel Machek, linux-pm,
Linux kernel mailing list
Oleg Nesterov napsal(a):
> On 05/08, Andrew Morton wrote:
>> On Tue, 08 May 2007 10:57:35 +0200 Jiri Slaby <jirislaby@gmail.com> wrote:
>>
>>> this occured in dmesg during resuming from hwsusp in 2.6.21-mm1 (captured
>>> through netconsole). Perfectly reproducible, it simply happens each time I
>>> try it.
>> Let's cc Oleg.
>>
>>> usb_endpoint usbdev5.1_ep00: PM: resume from 0, parent usb5 still 2
>>> ------------[ cut here ]------------
>>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
>>> invalid opcode: 0000 [#1]
>>> SMP
>>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
>>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
>>> CPU: 1
>>> EIP: 0060:[<c0132161>] Not tainted VLI
>>> EFLAGS: 00010046 (2.6.21-mm1 #272)
>>> EIP is at insert_work+0x6d/0x71
>>> eax: c1c3b3c0 ebx: c1814aa0 ecx: 00000001 edx: c1814aa0
>>> esi: c1c3b340 edi: 00000282 ebp: c04d2f68 esp: c04d2f50
>>> ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
>>> Process swapper (pid: 0, ti=c04d2000 task=c1c26030 task.ti=c1c20000)
>>> Stack: c4685f54 c18148ac c04d2f98 c013816f c1c3b340 c1814aa0 c04d2f88 c013256c
>>> 00000066 c1814880 c1c5e000 c04d2fc4 c01325a1
>>> c1c5e000 00000100 c012ba35 00000000 c04d2fb8 c01333f4 Call Trace:
>>> [<c0104f27>] [<c0104fe2>] show_stack_log_lvl+0xa5/0xca
>>> show_registers+0x1e2/0x2da
>>> [<c01053f5>] [<c010559a>] do_trap+0x84/0xaa
>>> do_invalid_op+0x88/0x92
>>> [<c0378662>] [<c013256c>] __queue_work+0x22/0x33
>>> delayed_work_timer_fn+0x24/0x2a
>>> [<c012ba35>] [<c01288a9>] __do_softirq+0x75/0xe6
>>> do_softirq+0x63/0xac
>>> [<c0128713>] [<c0116d7e>] smp_apic_timer_interrupt+0x5c/0x88
>>> apic_timer_interrupt+0x28/0x30
>> hm, how come it's so messy?
I have no idea, this is how it appeared in the `nc -ul' output...
> queue_delayed_work().
>
> Probably, cancel_delayed_work(&delayed_work->work) was called with the ->timer
> pending. This is wrong, cancel_delayed_work() clears _PENDING unconditionally,
> that is why the comment says
>
> it is expected that, prior to calling cancel_work_sync(), the caller has
> arranged for the work to not be requeued.
>
> (Just in case, after make-cancel_rearming_delayed_work-reliable.patch this is still
> wrong (as documented) to do cancel_delayed_work() before cancel_delayed_timer(),
> but should work correctly).
>
> ata_port_flush_task() and ata_port_detach() do this, I sent the patch to fix this
> twice. The last one is
>
> [PATCH -mm] libata-core: convert to use cancel_rearming_delayed_work()
> http://marc.info/?l=linux-kernel&m=117840349108505
>
>
> Jiri, any chance you can re-test with the patch below?
Yes. In the meantime I investigated, that the regression is between
broken-out-2007-04-28-05-06
and
special -js edition:
I guess it's time to end the staircase experiment in -mm.
http://userweb.kernel.org/~akpm/js.bz2 is my current rollup (against
2.6.21) minus staircase and related things.
I don't know if it's possible to dig out the patch list from it, otherwise,
2.6.21-mm1 list may be used...
> --- OLD/kernel/workqueue.c~ 2007-05-06 00:01:06.000000000 +0400
> +++ OLD/kernel/workqueue.c 2007-05-08 14:50:39.000000000 +0400
> @@ -103,7 +103,10 @@ static inline void set_wq_data(struct wo
> {
> unsigned long new;
>
> - BUG_ON(!work_pending(work));
> + if (!work_pending(work)) {
> + printk(KERN_ERR "BUG: set_wq_data ");
> + print_symbol("%s\n", (unsigned long) work->func);
> + }
>
> new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
> new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
building and will report,
--
http://www.fi.muni.cz/~xslaby/ Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 11:09 ` Jiri Slaby
@ 2007-05-08 12:00 ` Jiri Slaby
2007-05-08 12:00 ` Jiri Slaby
1 sibling, 0 replies; 26+ messages in thread
From: Jiri Slaby @ 2007-05-08 12:00 UTC (permalink / raw)
To: Jiri Slaby
Cc: Linux kernel mailing list, Pavel Machek, linux-pm, Andrew Morton,
Oleg Nesterov
Jiri Slaby napsal(a):
> Oleg Nesterov napsal(a):
>> On 05/08, Andrew Morton wrote:
>>> On Tue, 08 May 2007 10:57:35 +0200 Jiri Slaby <jirislaby@gmail.com> wrote:
>>>
>>>> this occured in dmesg during resuming from hwsusp in 2.6.21-mm1 (captured
>>>> through netconsole). Perfectly reproducible, it simply happens each time I
>>>> try it.
>>> Let's cc Oleg.
>>>
>>>> usb_endpoint usbdev5.1_ep00: PM: resume from 0, parent usb5 still 2
>>>> ------------[ cut here ]------------
>>>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
>>>> invalid opcode: 0000 [#1]
>>>> SMP
>>>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
>>>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
>>>> CPU: 1
>>>> EIP: 0060:[<c0132161>] Not tainted VLI
>>>> EFLAGS: 00010046 (2.6.21-mm1 #272)
>>>> EIP is at insert_work+0x6d/0x71
>>>> eax: c1c3b3c0 ebx: c1814aa0 ecx: 00000001 edx: c1814aa0
>>>> esi: c1c3b340 edi: 00000282 ebp: c04d2f68 esp: c04d2f50
>>>> ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
>>>> Process swapper (pid: 0, ti=c04d2000 task=c1c26030 task.ti=c1c20000)
>>>> Stack: c4685f54 c18148ac c04d2f98 c013816f c1c3b340 c1814aa0 c04d2f88 c013256c
>>>> 00000066 c1814880 c1c5e000 c04d2fc4 c01325a1
>>>> c1c5e000 00000100 c012ba35 00000000 c04d2fb8 c01333f4 Call Trace:
>>>> [<c0104f27>] [<c0104fe2>] show_stack_log_lvl+0xa5/0xca
>>>> show_registers+0x1e2/0x2da
>>>> [<c01053f5>] [<c010559a>] do_trap+0x84/0xaa
>>>> do_invalid_op+0x88/0x92
>>>> [<c0378662>] [<c013256c>] __queue_work+0x22/0x33
>>>> delayed_work_timer_fn+0x24/0x2a
>>>> [<c012ba35>] [<c01288a9>] __do_softirq+0x75/0xe6
>>>> do_softirq+0x63/0xac
>>>> [<c0128713>] [<c0116d7e>] smp_apic_timer_interrupt+0x5c/0x88
>>>> apic_timer_interrupt+0x28/0x30
[...]
>> queue_delayed_work().
>>
>> Probably, cancel_delayed_work(&delayed_work->work) was called with the ->timer
>> pending. This is wrong, cancel_delayed_work() clears _PENDING unconditionally,
>> that is why the comment says
>>
>> it is expected that, prior to calling cancel_work_sync(), the caller has
>> arranged for the work to not be requeued.
>>
>> (Just in case, after make-cancel_rearming_delayed_work-reliable.patch this is still
>> wrong (as documented) to do cancel_delayed_work() before cancel_delayed_timer(),
>> but should work correctly).
[...]
>> --- OLD/kernel/workqueue.c~ 2007-05-06 00:01:06.000000000 +0400
>> +++ OLD/kernel/workqueue.c 2007-05-08 14:50:39.000000000 +0400
>> @@ -103,7 +103,10 @@ static inline void set_wq_data(struct wo
>> {
>> unsigned long new;
>>
>> - BUG_ON(!work_pending(work));
>> + if (!work_pending(work)) {
>> + printk(KERN_ERR "BUG: set_wq_data ");
>> + print_symbol("%s\n", (unsigned long) work->func);
>> + }
>>
>> new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
>> new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
vmstat_update+0x0/0x2b
Ccing Christoph.
regards,
--
http://www.fi.muni.cz/~xslaby/ Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 11:09 ` Jiri Slaby
2007-05-08 12:00 ` Jiri Slaby
@ 2007-05-08 12:00 ` Jiri Slaby
2007-05-08 13:48 ` Oleg Nesterov
2007-05-08 13:48 ` Oleg Nesterov
1 sibling, 2 replies; 26+ messages in thread
From: Jiri Slaby @ 2007-05-08 12:00 UTC (permalink / raw)
To: Jiri Slaby
Cc: Oleg Nesterov, Andrew Morton, Rafael J. Wysocki, Pavel Machek,
linux-pm, Linux kernel mailing list
Jiri Slaby napsal(a):
> Oleg Nesterov napsal(a):
>> On 05/08, Andrew Morton wrote:
>>> On Tue, 08 May 2007 10:57:35 +0200 Jiri Slaby <jirislaby@gmail.com> wrote:
>>>
>>>> this occured in dmesg during resuming from hwsusp in 2.6.21-mm1 (captured
>>>> through netconsole). Perfectly reproducible, it simply happens each time I
>>>> try it.
>>> Let's cc Oleg.
>>>
>>>> usb_endpoint usbdev5.1_ep00: PM: resume from 0, parent usb5 still 2
>>>> ------------[ cut here ]------------
>>>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
>>>> invalid opcode: 0000 [#1]
>>>> SMP
>>>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
>>>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
>>>> CPU: 1
>>>> EIP: 0060:[<c0132161>] Not tainted VLI
>>>> EFLAGS: 00010046 (2.6.21-mm1 #272)
>>>> EIP is at insert_work+0x6d/0x71
>>>> eax: c1c3b3c0 ebx: c1814aa0 ecx: 00000001 edx: c1814aa0
>>>> esi: c1c3b340 edi: 00000282 ebp: c04d2f68 esp: c04d2f50
>>>> ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
>>>> Process swapper (pid: 0, ti=c04d2000 task=c1c26030 task.ti=c1c20000)
>>>> Stack: c4685f54 c18148ac c04d2f98 c013816f c1c3b340 c1814aa0 c04d2f88 c013256c
>>>> 00000066 c1814880 c1c5e000 c04d2fc4 c01325a1
>>>> c1c5e000 00000100 c012ba35 00000000 c04d2fb8 c01333f4 Call Trace:
>>>> [<c0104f27>] [<c0104fe2>] show_stack_log_lvl+0xa5/0xca
>>>> show_registers+0x1e2/0x2da
>>>> [<c01053f5>] [<c010559a>] do_trap+0x84/0xaa
>>>> do_invalid_op+0x88/0x92
>>>> [<c0378662>] [<c013256c>] __queue_work+0x22/0x33
>>>> delayed_work_timer_fn+0x24/0x2a
>>>> [<c012ba35>] [<c01288a9>] __do_softirq+0x75/0xe6
>>>> do_softirq+0x63/0xac
>>>> [<c0128713>] [<c0116d7e>] smp_apic_timer_interrupt+0x5c/0x88
>>>> apic_timer_interrupt+0x28/0x30
[...]
>> queue_delayed_work().
>>
>> Probably, cancel_delayed_work(&delayed_work->work) was called with the ->timer
>> pending. This is wrong, cancel_delayed_work() clears _PENDING unconditionally,
>> that is why the comment says
>>
>> it is expected that, prior to calling cancel_work_sync(), the caller has
>> arranged for the work to not be requeued.
>>
>> (Just in case, after make-cancel_rearming_delayed_work-reliable.patch this is still
>> wrong (as documented) to do cancel_delayed_work() before cancel_delayed_timer(),
>> but should work correctly).
[...]
>> --- OLD/kernel/workqueue.c~ 2007-05-06 00:01:06.000000000 +0400
>> +++ OLD/kernel/workqueue.c 2007-05-08 14:50:39.000000000 +0400
>> @@ -103,7 +103,10 @@ static inline void set_wq_data(struct wo
>> {
>> unsigned long new;
>>
>> - BUG_ON(!work_pending(work));
>> + if (!work_pending(work)) {
>> + printk(KERN_ERR "BUG: set_wq_data ");
>> + print_symbol("%s\n", (unsigned long) work->func);
>> + }
>>
>> new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
>> new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
vmstat_update+0x0/0x2b
Ccing Christoph.
regards,
--
http://www.fi.muni.cz/~xslaby/ Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 12:00 ` Jiri Slaby
@ 2007-05-08 13:48 ` Oleg Nesterov
2007-05-08 14:20 ` Oleg Nesterov
2007-05-08 14:20 ` 2.6.21-mm1 hwsusp: BUG at workqueue.c:106 Oleg Nesterov
2007-05-08 13:48 ` Oleg Nesterov
1 sibling, 2 replies; 26+ messages in thread
From: Oleg Nesterov @ 2007-05-08 13:48 UTC (permalink / raw)
To: Jiri Slaby
Cc: Andrew Morton, Rafael J. Wysocki, Pavel Machek, linux-pm,
Linux kernel mailing list
On 05/08, Jiri Slaby wrote:
>
> > Oleg Nesterov napsal(a):
> >>>
> >>>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
> >>>> invalid opcode: 0000 [#1]
> >>>> SMP
> >>>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
> >>>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
> >>>> CPU: 1
> >>>> EIP: 0060:[<c0132161>] Not tainted VLI
> >>>> EFLAGS: 00010046 (2.6.21-mm1 #272)
> >>>> EIP is at insert_work+0x6d/0x71
> [...]
> >> --- OLD/kernel/workqueue.c~ 2007-05-06 00:01:06.000000000 +0400
> >> +++ OLD/kernel/workqueue.c 2007-05-08 14:50:39.000000000 +0400
> >> @@ -103,7 +103,10 @@ static inline void set_wq_data(struct wo
> >> {
> >> unsigned long new;
> >>
> >> - BUG_ON(!work_pending(work));
> >> + if (!work_pending(work)) {
> >> + printk(KERN_ERR "BUG: set_wq_data ");
> >> + print_symbol("%s\n", (unsigned long) work->func);
> >> + }
> >>
> >> new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
> >> new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
>
> vmstat_update+0x0/0x2b
Thanks a lot.
I know nothing about hwsusp, to the point I don't even know what it does.
I'll try to do some reading tomorrow.
Right now,
> +static void vmstat_update(struct work_struct *w)
> +{
> + refresh_cpu_vm_stats(smp_processor_id());
> + schedule_delayed_work(&__get_cpu_var(vmstat_work),
> + sysctl_stat_interval);
> +}
This is not precisely correct. We cam schedule the wrong vmstat_work
if this timer/work migrates to another CPU. I'd suggest
schedule_delayed_work(container_of(w, struct delayed_work, work))
This should not happen because we are doing cancel_rearming_delayed_work()
below, however:
> + case CPU_DOWN_PREPARE:
> + case CPU_DOWN_PREPARE_FROZEN:
> + cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> + per_cpu(vmstat_work, cpu).work.func = NULL;
> + case CPU_DOWN_FAILED:
> + case CPU_DOWN_FAILED_FROZEN:
> + start_cpu_timer(cpu);
we need a "break;" before "case CPU_DOWN_FAILED", otherwise we re-start
vmstat_update() immediately.
This is a bug, but I am not sure is this the only problem.
Oleg.
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 13:48 ` Oleg Nesterov
@ 2007-05-08 14:20 ` Oleg Nesterov
2007-05-08 14:30 ` [PATCH 1/1] vmstat-use-our-own-timer-events-fix Jiri Slaby
2007-05-08 14:30 ` Jiri Slaby
2007-05-08 14:20 ` 2.6.21-mm1 hwsusp: BUG at workqueue.c:106 Oleg Nesterov
1 sibling, 2 replies; 26+ messages in thread
From: Oleg Nesterov @ 2007-05-08 14:20 UTC (permalink / raw)
To: Jiri Slaby
Cc: Andrew Morton, Rafael J. Wysocki, Pavel Machek, linux-pm,
Linux kernel mailing list
On 05/08, Oleg Nesterov wrote:
>
> On 05/08, Jiri Slaby wrote:
> >
> > vmstat_update+0x0/0x2b
>
> Thanks a lot.
>
> Right now,
>
> > +static void vmstat_update(struct work_struct *w)
> > +{
> > + refresh_cpu_vm_stats(smp_processor_id());
> > + schedule_delayed_work(&__get_cpu_var(vmstat_work),
> > + sysctl_stat_interval);
> > +}
>
> This is not precisely correct. We cam schedule the wrong vmstat_work
> if this timer/work migrates to another CPU. I'd suggest
>
> schedule_delayed_work(container_of(w, struct delayed_work, work))
>
> This should not happen because we are doing cancel_rearming_delayed_work()
> below, however:
>
> > + case CPU_DOWN_PREPARE:
> > + case CPU_DOWN_PREPARE_FROZEN:
> > + cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> > + per_cpu(vmstat_work, cpu).work.func = NULL;
> > + case CPU_DOWN_FAILED:
> > + case CPU_DOWN_FAILED_FROZEN:
> > + start_cpu_timer(cpu);
>
> we need a "break;" before "case CPU_DOWN_FAILED", otherwise we re-start
> vmstat_update() immediately.
>
> This is a bug, but I am not sure is this the only problem.
In case I was not clear, this _can_ explain the problem. Because an extra
start_cpu_timer() (due to missed "break;") re-initializes dwork, and clears
_PENDING.
Oleg.
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 1/1] vmstat-use-our-own-timer-events-fix
2007-05-08 14:20 ` Oleg Nesterov
@ 2007-05-08 14:30 ` Jiri Slaby
2007-05-08 14:30 ` Jiri Slaby
1 sibling, 0 replies; 26+ messages in thread
From: Jiri Slaby @ 2007-05-08 14:30 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Pavel Machek, linux-pm, Arjan van de Ven,
Oleg Nesterov, Christoph Lameter
Oleg Nesterov wrote:
> On 05/08, Oleg Nesterov wrote:
> >
> > On 05/08, Jiri Slaby wrote:
> >
> > This should not happen because we are doing cancel_rearming_delayed_work()
> > below, however:
> >
> > > + case CPU_DOWN_PREPARE:
> > > + case CPU_DOWN_PREPARE_FROZEN:
> > > + cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> > > + per_cpu(vmstat_work, cpu).work.func = NULL;
> > > + case CPU_DOWN_FAILED:
> > > + case CPU_DOWN_FAILED_FROZEN:
> > > + start_cpu_timer(cpu);
> >
> > we need a "break;" before "case CPU_DOWN_FAILED", otherwise we re-start
> > vmstat_update() immediately.
> >
> > This is a bug, but I am not sure is this the only problem.
>
> In case I was not clear, this _can_ explain the problem. Because an extra
> start_cpu_timer() (due to missed "break;") re-initializes dwork, and clears
> _PENDING.
Yes, it helped, thanks. If no objections, please apply the patch below.
--
vmstat-use-our-own-timer-events-fix
add missing break
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
---
commit e0fe3cf4776a3c16ac7625ce9eae75ea42eb9b4e
tree c9a4277860e2cea03911b3d73c9386c9284d991b
parent b1034323f7e57642eee59a0f5e77c50bc26982c5
author Jiri Slaby <jirislaby@gmail.com> Tue, 08 May 2007 16:25:28 +0200
committer Jiri Slaby <jirislaby@gmail.com> Tue, 08 May 2007 16:25:28 +0200
mm/vmstat.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 709e2f0..fac05e8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -677,6 +677,7 @@ static int __cpuinit vmstat_cpuup_callback(struct notifier_block *nfb,
case CPU_DOWN_PREPARE_FROZEN:
cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
per_cpu(vmstat_work, cpu).work.func = NULL;
+ break;
case CPU_DOWN_FAILED:
case CPU_DOWN_FAILED_FROZEN:
start_cpu_timer(cpu);
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 1/1] vmstat-use-our-own-timer-events-fix
2007-05-08 14:20 ` Oleg Nesterov
2007-05-08 14:30 ` [PATCH 1/1] vmstat-use-our-own-timer-events-fix Jiri Slaby
@ 2007-05-08 14:30 ` Jiri Slaby
2007-05-08 14:42 ` Oleg Nesterov
` (3 more replies)
1 sibling, 4 replies; 26+ messages in thread
From: Jiri Slaby @ 2007-05-08 14:30 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Arjan van de Ven, Oleg Nesterov,
Rafael J. Wysocki, Pavel Machek, linux-pm, linux-kernel
Oleg Nesterov wrote:
> On 05/08, Oleg Nesterov wrote:
> >
> > On 05/08, Jiri Slaby wrote:
> >
> > This should not happen because we are doing cancel_rearming_delayed_work()
> > below, however:
> >
> > > + case CPU_DOWN_PREPARE:
> > > + case CPU_DOWN_PREPARE_FROZEN:
> > > + cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> > > + per_cpu(vmstat_work, cpu).work.func = NULL;
> > > + case CPU_DOWN_FAILED:
> > > + case CPU_DOWN_FAILED_FROZEN:
> > > + start_cpu_timer(cpu);
> >
> > we need a "break;" before "case CPU_DOWN_FAILED", otherwise we re-start
> > vmstat_update() immediately.
> >
> > This is a bug, but I am not sure is this the only problem.
>
> In case I was not clear, this _can_ explain the problem. Because an extra
> start_cpu_timer() (due to missed "break;") re-initializes dwork, and clears
> _PENDING.
Yes, it helped, thanks. If no objections, please apply the patch below.
--
vmstat-use-our-own-timer-events-fix
add missing break
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
---
commit e0fe3cf4776a3c16ac7625ce9eae75ea42eb9b4e
tree c9a4277860e2cea03911b3d73c9386c9284d991b
parent b1034323f7e57642eee59a0f5e77c50bc26982c5
author Jiri Slaby <jirislaby@gmail.com> Tue, 08 May 2007 16:25:28 +0200
committer Jiri Slaby <jirislaby@gmail.com> Tue, 08 May 2007 16:25:28 +0200
mm/vmstat.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 709e2f0..fac05e8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -677,6 +677,7 @@ static int __cpuinit vmstat_cpuup_callback(struct notifier_block *nfb,
case CPU_DOWN_PREPARE_FROZEN:
cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
per_cpu(vmstat_work, cpu).work.func = NULL;
+ break;
case CPU_DOWN_FAILED:
case CPU_DOWN_FAILED_FROZEN:
start_cpu_timer(cpu);
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH 1/1] vmstat-use-our-own-timer-events-fix
2007-05-08 14:30 ` Jiri Slaby
@ 2007-05-08 14:42 ` Oleg Nesterov
2007-05-08 14:42 ` Oleg Nesterov
` (2 subsequent siblings)
3 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2007-05-08 14:42 UTC (permalink / raw)
To: Jiri Slaby
Cc: linux-kernel, Pavel Machek, Andrew Morton, Arjan van de Ven,
linux-pm, Christoph Lameter
On 05/08, Jiri Slaby wrote:
>
> Oleg Nesterov wrote:
> > On 05/08, Oleg Nesterov wrote:
> > >
> > > On 05/08, Jiri Slaby wrote:
> > >
> > > This should not happen because we are doing cancel_rearming_delayed_work()
> > > below, however:
> > >
> > > > + case CPU_DOWN_PREPARE:
> > > > + case CPU_DOWN_PREPARE_FROZEN:
> > > > + cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> > > > + per_cpu(vmstat_work, cpu).work.func = NULL;
> > > > + case CPU_DOWN_FAILED:
> > > > + case CPU_DOWN_FAILED_FROZEN:
> > > > + start_cpu_timer(cpu);
> > >
> > > we need a "break;" before "case CPU_DOWN_FAILED", otherwise we re-start
> > > vmstat_update() immediately.
> > >
> > > This is a bug, but I am not sure is this the only problem.
> >
> > In case I was not clear, this _can_ explain the problem. Because an extra
> > start_cpu_timer() (due to missed "break;") re-initializes dwork, and clears
> > _PENDING.
>
> Yes, it helped, thanks. If no objections, please apply the patch below.
Great! Thanks.
>
> vmstat-use-our-own-timer-events-fix
>
> add missing break
>
> Cc: Christoph Lameter <clameter@sgi.com>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
>
> ---
> commit e0fe3cf4776a3c16ac7625ce9eae75ea42eb9b4e
> tree c9a4277860e2cea03911b3d73c9386c9284d991b
> parent b1034323f7e57642eee59a0f5e77c50bc26982c5
> author Jiri Slaby <jirislaby@gmail.com> Tue, 08 May 2007 16:25:28 +0200
> committer Jiri Slaby <jirislaby@gmail.com> Tue, 08 May 2007 16:25:28 +0200
>
> mm/vmstat.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 709e2f0..fac05e8 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -677,6 +677,7 @@ static int __cpuinit vmstat_cpuup_callback(struct notifier_block *nfb,
> case CPU_DOWN_PREPARE_FROZEN:
> cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> per_cpu(vmstat_work, cpu).work.func = NULL;
> + break;
> case CPU_DOWN_FAILED:
> case CPU_DOWN_FAILED_FROZEN:
> start_cpu_timer(cpu);
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 1/1] vmstat-use-our-own-timer-events-fix
2007-05-08 14:30 ` Jiri Slaby
2007-05-08 14:42 ` Oleg Nesterov
@ 2007-05-08 14:42 ` Oleg Nesterov
2007-05-08 16:08 ` Christoph Lameter
2007-05-08 16:08 ` Christoph Lameter
3 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2007-05-08 14:42 UTC (permalink / raw)
To: Jiri Slaby
Cc: Andrew Morton, Christoph Lameter, Arjan van de Ven,
Rafael J. Wysocki, Pavel Machek, linux-pm, linux-kernel
On 05/08, Jiri Slaby wrote:
>
> Oleg Nesterov wrote:
> > On 05/08, Oleg Nesterov wrote:
> > >
> > > On 05/08, Jiri Slaby wrote:
> > >
> > > This should not happen because we are doing cancel_rearming_delayed_work()
> > > below, however:
> > >
> > > > + case CPU_DOWN_PREPARE:
> > > > + case CPU_DOWN_PREPARE_FROZEN:
> > > > + cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> > > > + per_cpu(vmstat_work, cpu).work.func = NULL;
> > > > + case CPU_DOWN_FAILED:
> > > > + case CPU_DOWN_FAILED_FROZEN:
> > > > + start_cpu_timer(cpu);
> > >
> > > we need a "break;" before "case CPU_DOWN_FAILED", otherwise we re-start
> > > vmstat_update() immediately.
> > >
> > > This is a bug, but I am not sure is this the only problem.
> >
> > In case I was not clear, this _can_ explain the problem. Because an extra
> > start_cpu_timer() (due to missed "break;") re-initializes dwork, and clears
> > _PENDING.
>
> Yes, it helped, thanks. If no objections, please apply the patch below.
Great! Thanks.
>
> vmstat-use-our-own-timer-events-fix
>
> add missing break
>
> Cc: Christoph Lameter <clameter@sgi.com>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
>
> ---
> commit e0fe3cf4776a3c16ac7625ce9eae75ea42eb9b4e
> tree c9a4277860e2cea03911b3d73c9386c9284d991b
> parent b1034323f7e57642eee59a0f5e77c50bc26982c5
> author Jiri Slaby <jirislaby@gmail.com> Tue, 08 May 2007 16:25:28 +0200
> committer Jiri Slaby <jirislaby@gmail.com> Tue, 08 May 2007 16:25:28 +0200
>
> mm/vmstat.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 709e2f0..fac05e8 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -677,6 +677,7 @@ static int __cpuinit vmstat_cpuup_callback(struct notifier_block *nfb,
> case CPU_DOWN_PREPARE_FROZEN:
> cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> per_cpu(vmstat_work, cpu).work.func = NULL;
> + break;
> case CPU_DOWN_FAILED:
> case CPU_DOWN_FAILED_FROZEN:
> start_cpu_timer(cpu);
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/1] vmstat-use-our-own-timer-events-fix
2007-05-08 14:30 ` Jiri Slaby
2007-05-08 14:42 ` Oleg Nesterov
2007-05-08 14:42 ` Oleg Nesterov
@ 2007-05-08 16:08 ` Christoph Lameter
2007-05-08 16:08 ` Christoph Lameter
3 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-05-08 16:08 UTC (permalink / raw)
To: Jiri Slaby
Cc: linux-kernel, Pavel Machek, linux-pm, Andrew Morton,
Arjan van de Ven, Oleg Nesterov
On Tue, 8 May 2007, Jiri Slaby wrote:
> Yes, it helped, thanks. If no objections, please apply the patch below.
Acked-by: Christoph Lameter <clameter@sgi.com>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/1] vmstat-use-our-own-timer-events-fix
2007-05-08 14:30 ` Jiri Slaby
` (2 preceding siblings ...)
2007-05-08 16:08 ` Christoph Lameter
@ 2007-05-08 16:08 ` Christoph Lameter
3 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-05-08 16:08 UTC (permalink / raw)
To: Jiri Slaby
Cc: Andrew Morton, Arjan van de Ven, Oleg Nesterov, Rafael J. Wysocki,
Pavel Machek, linux-pm, linux-kernel
On Tue, 8 May 2007, Jiri Slaby wrote:
> Yes, it helped, thanks. If no objections, please apply the patch below.
Acked-by: Christoph Lameter <clameter@sgi.com>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 13:48 ` Oleg Nesterov
2007-05-08 14:20 ` Oleg Nesterov
@ 2007-05-08 14:20 ` Oleg Nesterov
1 sibling, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2007-05-08 14:20 UTC (permalink / raw)
To: Jiri Slaby
Cc: Andrew Morton, linux-pm, Pavel Machek, Linux kernel mailing list
On 05/08, Oleg Nesterov wrote:
>
> On 05/08, Jiri Slaby wrote:
> >
> > vmstat_update+0x0/0x2b
>
> Thanks a lot.
>
> Right now,
>
> > +static void vmstat_update(struct work_struct *w)
> > +{
> > + refresh_cpu_vm_stats(smp_processor_id());
> > + schedule_delayed_work(&__get_cpu_var(vmstat_work),
> > + sysctl_stat_interval);
> > +}
>
> This is not precisely correct. We cam schedule the wrong vmstat_work
> if this timer/work migrates to another CPU. I'd suggest
>
> schedule_delayed_work(container_of(w, struct delayed_work, work))
>
> This should not happen because we are doing cancel_rearming_delayed_work()
> below, however:
>
> > + case CPU_DOWN_PREPARE:
> > + case CPU_DOWN_PREPARE_FROZEN:
> > + cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> > + per_cpu(vmstat_work, cpu).work.func = NULL;
> > + case CPU_DOWN_FAILED:
> > + case CPU_DOWN_FAILED_FROZEN:
> > + start_cpu_timer(cpu);
>
> we need a "break;" before "case CPU_DOWN_FAILED", otherwise we re-start
> vmstat_update() immediately.
>
> This is a bug, but I am not sure is this the only problem.
In case I was not clear, this _can_ explain the problem. Because an extra
start_cpu_timer() (due to missed "break;") re-initializes dwork, and clears
_PENDING.
Oleg.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 12:00 ` Jiri Slaby
2007-05-08 13:48 ` Oleg Nesterov
@ 2007-05-08 13:48 ` Oleg Nesterov
1 sibling, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2007-05-08 13:48 UTC (permalink / raw)
To: Jiri Slaby
Cc: Andrew Morton, linux-pm, Pavel Machek, Linux kernel mailing list
On 05/08, Jiri Slaby wrote:
>
> > Oleg Nesterov napsal(a):
> >>>
> >>>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
> >>>> invalid opcode: 0000 [#1]
> >>>> SMP
> >>>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
> >>>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
> >>>> CPU: 1
> >>>> EIP: 0060:[<c0132161>] Not tainted VLI
> >>>> EFLAGS: 00010046 (2.6.21-mm1 #272)
> >>>> EIP is at insert_work+0x6d/0x71
> [...]
> >> --- OLD/kernel/workqueue.c~ 2007-05-06 00:01:06.000000000 +0400
> >> +++ OLD/kernel/workqueue.c 2007-05-08 14:50:39.000000000 +0400
> >> @@ -103,7 +103,10 @@ static inline void set_wq_data(struct wo
> >> {
> >> unsigned long new;
> >>
> >> - BUG_ON(!work_pending(work));
> >> + if (!work_pending(work)) {
> >> + printk(KERN_ERR "BUG: set_wq_data ");
> >> + print_symbol("%s\n", (unsigned long) work->func);
> >> + }
> >>
> >> new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
> >> new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
>
> vmstat_update+0x0/0x2b
Thanks a lot.
I know nothing about hwsusp, to the point I don't even know what it does.
I'll try to do some reading tomorrow.
Right now,
> +static void vmstat_update(struct work_struct *w)
> +{
> + refresh_cpu_vm_stats(smp_processor_id());
> + schedule_delayed_work(&__get_cpu_var(vmstat_work),
> + sysctl_stat_interval);
> +}
This is not precisely correct. We cam schedule the wrong vmstat_work
if this timer/work migrates to another CPU. I'd suggest
schedule_delayed_work(container_of(w, struct delayed_work, work))
This should not happen because we are doing cancel_rearming_delayed_work()
below, however:
> + case CPU_DOWN_PREPARE:
> + case CPU_DOWN_PREPARE_FROZEN:
> + cancel_rearming_delayed_work(&per_cpu(vmstat_work, cpu));
> + per_cpu(vmstat_work, cpu).work.func = NULL;
> + case CPU_DOWN_FAILED:
> + case CPU_DOWN_FAILED_FROZEN:
> + start_cpu_timer(cpu);
we need a "break;" before "case CPU_DOWN_FAILED", otherwise we re-start
vmstat_update() immediately.
This is a bug, but I am not sure is this the only problem.
Oleg.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 10:55 ` Oleg Nesterov
2007-05-08 11:09 ` Jiri Slaby
2007-05-08 11:09 ` Jiri Slaby
@ 2007-05-08 12:12 ` Jarek Poplawski
2007-05-08 12:40 ` Oleg Nesterov
2007-05-08 12:40 ` Oleg Nesterov
2007-05-08 12:12 ` Jarek Poplawski
3 siblings, 2 replies; 26+ messages in thread
From: Jarek Poplawski @ 2007-05-08 12:12 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Jiri Slaby, Rafael J. Wysocki, Pavel Machek,
linux-pm, Linux kernel mailing list
On 08-05-2007 12:55, Oleg Nesterov wrote:
> On 05/08, Andrew Morton wrote:
>> On Tue, 08 May 2007 10:57:35 +0200 Jiri Slaby <jirislaby@gmail.com> wrote:
>>
>>> this occured in dmesg during resuming from hwsusp in 2.6.21-mm1 (captured
>>> through netconsole). Perfectly reproducible, it simply happens each time I
>>> try it.
>> Let's cc Oleg.
>>
>>> usb_endpoint usbdev5.1_ep00: PM: resume from 0, parent usb5 still 2
>>> ------------[ cut here ]------------
>>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
>>> invalid opcode: 0000 [#1]
>>> SMP
>>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
>>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
...
> queue_delayed_work().
>
> Probably, cancel_delayed_work(&delayed_work->work) was called with the ->timer
> pending. This is wrong, cancel_delayed_work() clears _PENDING unconditionally,
Maybe I miss your point, but clearing is conditional: on timer delete...
I think more suspicious is calling cancel_work_sync() for a delayed work
(with timer pending). Or maybe some race profits from _PENDING cleared
without locking?
BTW, it seems some debugging is needed to show, whose work is doing the
mess.
Regards,
Jarek P.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 12:12 ` Jarek Poplawski
@ 2007-05-08 12:40 ` Oleg Nesterov
2007-05-08 12:40 ` Oleg Nesterov
1 sibling, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2007-05-08 12:40 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Andrew Morton, Jiri Slaby, Rafael J. Wysocki, Pavel Machek,
linux-pm, Linux kernel mailing list
On 05/08, Jarek Poplawski wrote:
>
> On 08-05-2007 12:55, Oleg Nesterov wrote:
> > On 05/08, Andrew Morton wrote:
> >> On Tue, 08 May 2007 10:57:35 +0200 Jiri Slaby <jirislaby@gmail.com> wrote:
> >>
> >>> this occured in dmesg during resuming from hwsusp in 2.6.21-mm1 (captured
> >>> through netconsole). Perfectly reproducible, it simply happens each time I
> >>> try it.
> >> Let's cc Oleg.
> >>
> >>> usb_endpoint usbdev5.1_ep00: PM: resume from 0, parent usb5 still 2
> >>> ------------[ cut here ]------------
> >>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
> >>> invalid opcode: 0000 [#1]
> >>> SMP
> >>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
> >>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
> ...
> > queue_delayed_work().
> >
> > Probably, cancel_delayed_work(&delayed_work->work) was called with the ->timer
> > pending. This is wrong, cancel_delayed_work() clears _PENDING unconditionally,
>
> Maybe I miss your point, but clearing is conditional: on timer delete...
>
> I think more suspicious is calling cancel_work_sync() for a delayed work
> (with timer pending). Or maybe some race profits from _PENDING cleared
> without locking?
Yes, of course, I meant cancel_work_sync(), sorry for the confusion.
Thanks!
So, once again, cancel_work_sync(&dwork->work) is wrong unless the timer
was stopped.
Before make-cancel_rearming_delayed_work-reliable.patch
it requires that the @work can't be re-queued.
After
works, but waits for the timer expiration in a busy-wait loop.
Oleg.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 12:12 ` Jarek Poplawski
2007-05-08 12:40 ` Oleg Nesterov
@ 2007-05-08 12:40 ` Oleg Nesterov
1 sibling, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2007-05-08 12:40 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Jiri Slaby, Linux kernel mailing list, Pavel Machek,
Andrew Morton, linux-pm
On 05/08, Jarek Poplawski wrote:
>
> On 08-05-2007 12:55, Oleg Nesterov wrote:
> > On 05/08, Andrew Morton wrote:
> >> On Tue, 08 May 2007 10:57:35 +0200 Jiri Slaby <jirislaby@gmail.com> wrote:
> >>
> >>> this occured in dmesg during resuming from hwsusp in 2.6.21-mm1 (captured
> >>> through netconsole). Perfectly reproducible, it simply happens each time I
> >>> try it.
> >> Let's cc Oleg.
> >>
> >>> usb_endpoint usbdev5.1_ep00: PM: resume from 0, parent usb5 still 2
> >>> ------------[ cut here ]------------
> >>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
> >>> invalid opcode: 0000 [#1]
> >>> SMP
> >>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
> >>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
> ...
> > queue_delayed_work().
> >
> > Probably, cancel_delayed_work(&delayed_work->work) was called with the ->timer
> > pending. This is wrong, cancel_delayed_work() clears _PENDING unconditionally,
>
> Maybe I miss your point, but clearing is conditional: on timer delete...
>
> I think more suspicious is calling cancel_work_sync() for a delayed work
> (with timer pending). Or maybe some race profits from _PENDING cleared
> without locking?
Yes, of course, I meant cancel_work_sync(), sorry for the confusion.
Thanks!
So, once again, cancel_work_sync(&dwork->work) is wrong unless the timer
was stopped.
Before make-cancel_rearming_delayed_work-reliable.patch
it requires that the @work can't be re-queued.
After
works, but waits for the timer expiration in a busy-wait loop.
Oleg.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: 2.6.21-mm1 hwsusp: BUG at workqueue.c:106
2007-05-08 10:55 ` Oleg Nesterov
` (2 preceding siblings ...)
2007-05-08 12:12 ` Jarek Poplawski
@ 2007-05-08 12:12 ` Jarek Poplawski
3 siblings, 0 replies; 26+ messages in thread
From: Jarek Poplawski @ 2007-05-08 12:12 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Jiri Slaby, Linux kernel mailing list, Pavel Machek,
Andrew Morton, linux-pm
On 08-05-2007 12:55, Oleg Nesterov wrote:
> On 05/08, Andrew Morton wrote:
>> On Tue, 08 May 2007 10:57:35 +0200 Jiri Slaby <jirislaby@gmail.com> wrote:
>>
>>> this occured in dmesg during resuming from hwsusp in 2.6.21-mm1 (captured
>>> through netconsole). Perfectly reproducible, it simply happens each time I
>>> try it.
>> Let's cc Oleg.
>>
>>> usb_endpoint usbdev5.1_ep00: PM: resume from 0, parent usb5 still 2
>>> ------------[ cut here ]------------
>>> kernel BUG at /home/l/latest/xxx/kernel/workqueue.c:106!
>>> invalid opcode: 0000 [#1]
>>> SMP
>>> Modules linked in: ipv6 floppy ohci1394 ieee1394 parport_pc parport usbhid
>>> ehci_hcd pata_acpi ff_memless sr_mod cdrom
...
> queue_delayed_work().
>
> Probably, cancel_delayed_work(&delayed_work->work) was called with the ->timer
> pending. This is wrong, cancel_delayed_work() clears _PENDING unconditionally,
Maybe I miss your point, but clearing is conditional: on timer delete...
I think more suspicious is calling cancel_work_sync() for a delayed work
(with timer pending). Or maybe some race profits from _PENDING cleared
without locking?
BTW, it seems some debugging is needed to show, whose work is doing the
mess.
Regards,
Jarek P.
^ permalink raw reply [flat|nested] 26+ messages in thread