* next-20081119: general protection fault: get_next_timer_interrupt()
@ 2008-11-19 15:14 Alexander Beregalov
2008-11-19 21:14 ` Thomas Gleixner
0 siblings, 1 reply; 16+ messages in thread
From: Alexander Beregalov @ 2008-11-19 15:14 UTC (permalink / raw)
To: LKML, linux-next, tglx, mingo
Hi
It is 4way X86_64
The kernel does not boot.
...
scsi0 : LSI SAS based MegaRAID driver
Driver 'sd' needs updating - please use bus_type methods
scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5
general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file:
CPU 3
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.28-rc5-next-20081119 #6
RIP: 0010:[<ffffffff80240061>] [<ffffffff80240061>]
get_next_timer_interrupt+0x11b/0x1f0
RSP: 0018:ffff88007dfe3e98 EFLAGS: 00010016
RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007dfd0000 RCX: 000000000000003e
RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000037 RDI: 6b6b6b6b6b6b6b6b
RBP: ffff88007dfe3ef8 R08: ffff88007dfe3ea8 R09: 0000000000fffeb7
R10: 0000000000000001 R11: ffff88007dfd1430 R12: 000000013ffeb69c
R13: 00000000fffeb69c R14: ffff88007dfd1050 R15: 0000000000000040
FS: 0000000000000000(0000) GS:ffff88007f402a28(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88007dfd6000, task ffff88007dfa4bc0)
Stack:
ffff88008486d000 ffff8800051390c0 ffff88007dfd1050 ffff88007dfd1450
ffff88007dfd1850 ffff88007dfd1c50 ffff880001431780 ffffffff80705000
ffff880001431780 ffff880004f72010 0000000000000000 0000000000000003
Call Trace:
<IRQ> <0> [<ffffffff80253f68>] tick_nohz_stop_sched_tick+0x17b/0x37f
[<ffffffff8023c023>] ? __do_softirq+0xf8/0x101
[<ffffffff8023bbec>] irq_exit+0x91/0xa2
[<ffffffff8021d05d>] smp_apic_timer_interrupt+0xa4/0xce
[<ffffffff8020c610>] apic_timer_interrupt+0x70/0x80
<EOI> <0> [<ffffffff80212df9>] ? mwait_idle+0x3e/0x48
[<ffffffff80212df0>] ? mwait_idle+0x35/0x48
[<ffffffff8020a930>] ? cpu_idle+0x51/0xba
[<ffffffff804dc06d>] ? start_secondary+0x185/0x18a
Code: 30 83 e6 3f 89 f1 48 63 c1 48 c1 e0 04 4a 8b 14 30 4d 8d 1c 06
eb 14 48 8b 42 10 41 ba 01 00 00 00 4c 39 e0 48 89 fa 4c 0f 48 e0 <48>
8b 3a 4c 39 da 0f 18 0f 75 e1 45 85 d2 74 10 85 f6 74 04 39
RIP [<ffffffff80240061>] get_next_timer_interrupt+0x11b/0x1f0
RSP <ffff88007dfe3e98>
---[ end trace c0adcd63427ce040 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
Pid: 0, comm: swapper Tainted: G D 2.6.28-rc5-next-20081119 #6
Call Trace:
<IRQ> [<ffffffff804de999>] panic+0xaa/0x15f
[<ffffffff804deab5>] ? printk+0x67/0x6a
[<ffffffff803ba593>] ? account+0xde/0xed
[<ffffffff803ba6e5>] ? extract_entropy+0x50/0x97
[<ffffffff80239c6e>] do_exit+0x70/0x891
[<ffffffff803ba816>] ? get_random_bytes+0x1b/0x1d
[<ffffffff8020f650>] oops_end+0x89/0x8e
[<ffffffff8020f81c>] die+0x55/0x5e
[<ffffffff8020d9d8>] do_general_protection+0x11e/0x126
[<ffffffff8020d8ba>] ? do_general_protection+0x0/0x126
[<ffffffff804e1f22>] error_exit+0x0/0xa9
[<ffffffff80240061>] ? get_next_timer_interrupt+0x11b/0x1f0
[<ffffffff8023ff83>] ? get_next_timer_interrupt+0x3d/0x1f0
[<ffffffff80253f68>] tick_nohz_stop_sched_tick+0x17b/0x37f
[<ffffffff8023c023>] ? __do_softirq+0xf8/0x101
[<ffffffff8023bbec>] irq_exit+0x91/0xa2
[<ffffffff8021d05d>] smp_apic_timer_interrupt+0xa4/0xce
[<ffffffff8020c610>] apic_timer_interrupt+0x70/0x80
<EOI> [<ffffffff80212df9>] ? mwait_idle+0x3e/0x48
[<ffffffff80212df0>] ? mwait_idle+0x35/0x48
[<ffffffff8020a930>] ? cpu_idle+0x51/0xba
[<ffffffff804dc06d>] ? start_secondary+0x185/0x18a
------------[ cut here ]------------
WARNING: at kernel/smp.c:333 smp_call_function_mask+0x40/0x1e6()
Modules linked in:
Pid: 0, comm: swapper Tainted: G D 2.6.28-rc5-next-20081119 #6
Call Trace:
<IRQ> [<ffffffff80236baf>] warn_on_slowpath+0x58/0x7d
[<ffffffff804deab5>] ? printk+0x67/0x6a
[<ffffffff804dc06d>] ? start_secondary+0x185/0x18a
[<ffffffff804dc06d>] ? start_secondary+0x185/0x18a
[<ffffffff804deab5>] ? printk+0x67/0x6a
[<ffffffff80237864>] ? vprintk+0x312/0x355
[<ffffffff8021d513>] ? touch_nmi_watchdog+0x54/0x58
[<ffffffff80212f6c>] ? stop_this_cpu+0x0/0x24
[<ffffffff8025df57>] smp_call_function_mask+0x40/0x1e6
[<ffffffff80212f6c>] ? stop_this_cpu+0x0/0x24
[<ffffffff8020f9d8>] ? print_context_stack+0xa8/0xc0
[<ffffffff8020eb24>] ? dump_trace+0x25d/0x285
[<ffffffff8020f88a>] ? show_trace_log_lvl+0x4c/0x58
[<ffffffff80212f6c>] ? stop_this_cpu+0x0/0x24
[<ffffffff8025e130>] smp_call_function+0x33/0x6c
[<ffffffff8021bad4>] native_smp_send_stop+0x22/0x48
[<ffffffff804de9a6>] panic+0xb7/0x15f
[<ffffffff804deab5>] ? printk+0x67/0x6a
[<ffffffff803ba593>] ? account+0xde/0xed
[<ffffffff803ba6e5>] ? extract_entropy+0x50/0x97
[<ffffffff80239c6e>] do_exit+0x70/0x891
[<ffffffff803ba816>] ? get_random_bytes+0x1b/0x1d
[<ffffffff8020f650>] oops_end+0x89/0x8e
[<ffffffff8020f81c>] die+0x55/0x5e
[<ffffffff8020d9d8>] do_general_protection+0x11e/0x126
[<ffffffff8020d8ba>] ? do_general_protection+0x0/0x126
[<ffffffff804e1f22>] error_exit+0x0/0xa9
[<ffffffff80240061>] ? get_next_timer_interrupt+0x11b/0x1f0
[<ffffffff8023ff83>] ? get_next_timer_interrupt+0x3d/0x1f0
[<ffffffff80253f68>] tick_nohz_stop_sched_tick+0x17b/0x37f
[<ffffffff8023c023>] ? __do_softirq+0xf8/0x101
[<ffffffff8023bbec>] irq_exit+0x91/0xa2
[<ffffffff8021d05d>] smp_apic_timer_interrupt+0xa4/0xce
[<ffffffff8020c610>] apic_timer_interrupt+0x70/0x80
<EOI> [<ffffffff80212df9>] ? mwait_idle+0x3e/0x48
[<ffffffff80212df0>] ? mwait_idle+0x35/0x48
[<ffffffff8020a930>] ? cpu_idle+0x51/0xba
[<ffffffff804dc06d>] ? start_secondary+0x185/0x18a
---[ end trace c0adcd63427ce040 ]---
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-19 15:14 next-20081119: general protection fault: get_next_timer_interrupt() Alexander Beregalov
@ 2008-11-19 21:14 ` Thomas Gleixner
2008-11-21 10:50 ` Alexander Beregalov
0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2008-11-19 21:14 UTC (permalink / raw)
To: Alexander Beregalov; +Cc: LKML, linux-next, mingo
Alexander,
On Wed, 19 Nov 2008, Alexander Beregalov wrote:
>
> It is 4way X86_64
> The kernel does not boot.
> RIP: 0010:[<ffffffff80240061>] [<ffffffff80240061>]
> get_next_timer_interrupt+0x11b/0x1f0
Can you please enable:
CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_TIMERS=Y
and add "debug_objects" to the kernel command line ?
Thanks,
tglx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-19 21:14 ` Thomas Gleixner
@ 2008-11-21 10:50 ` Alexander Beregalov
2008-11-24 17:43 ` Thomas Gleixner
0 siblings, 1 reply; 16+ messages in thread
From: Alexander Beregalov @ 2008-11-21 10:50 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: LKML, linux-next, mingo, linux-scsi, James.Bottomley
2008/11/20 Thomas Gleixner <tglx@linutronix.de>:
> Alexander,
>
> On Wed, 19 Nov 2008, Alexander Beregalov wrote:
>>
>> It is 4way X86_64
>> The kernel does not boot.
>
>> RIP: 0010:[<ffffffff80240061>] [<ffffffff80240061>]
>> get_next_timer_interrupt+0x11b/0x1f0
>
> Can you please enable:
>
> CONFIG_DEBUG_OBJECTS=y
> CONFIG_DEBUG_OBJECTS_FREE=y
> CONFIG_DEBUG_OBJECTS_TIMERS=Y
>
> and add "debug_objects" to the kernel command line ?
I added these options:
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
hpet0: 3 comparators, 64-bit 14.318180 MHz counter
ODEBUG: object is on stack, but not annotated
------------[ cut here ]------------
WARNING: at lib/debugobjects.c:251 __debug_object_init+0x2bf/0x36d()
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.28-rc5-next-20081119 #9
Call Trace:
[<ffffffff80236ba7>] warn_on_slowpath+0x58/0x7d
[<ffffffff804e2584>] ? printk+0x67/0x6b
[<ffffffff803800bc>] ? __debug_object_init+0x191/0x36d
[<ffffffff803801ea>] __debug_object_init+0x2bf/0x36d
[<ffffffff802977c6>] ? compound_order+0x15/0x26
[<ffffffff803802c5>] debug_object_init+0x14/0x17
[<ffffffff8023fc77>] init_timer+0x18/0x5b
[<ffffffff80220821>] hpet_cpuhp_notify+0x93/0x105
[<ffffffff8022095a>] ? hpet_work+0x0/0x206
[<ffffffff803d1394>] ? hpet_alloc+0x333/0x38f
[<ffffffff80257785>] ? trace_hardirqs_on_caller+0x128/0x153
[<ffffffff802577bd>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff806cf5f5>] ? hpet_late_init+0x0/0x19e
[<ffffffff806cf5f5>] ? hpet_late_init+0x0/0x19e
[<ffffffff806cf75f>] hpet_late_init+0x16a/0x19e
[<ffffffff806cd7c2>] ? print_all_ICs+0x0/0x540
[<ffffffff80209058>] _stext+0x58/0x138
[<ffffffff804e5003>] ? _spin_unlock+0x4a/0x57
[<ffffffff802de241>] ? proc_register+0x17f/0x193
[<ffffffff802de37d>] ? create_proc_entry+0x7e/0x94
[<ffffffff802686e5>] ? register_irq_proc+0xb0/0xcc
[<ffffffff802d0000>] ? do_usbdevfs_bulk+0xf8/0xfe
[<ffffffff806c466d>] kernel_init+0x125/0x179
[<ffffffff804e49ba>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8020c899>] child_rip+0xa/0x11
[<ffffffff8020bd88>] ? restore_args+0x0/0x30
[<ffffffff806c4548>] ? kernel_init+0x0/0x179
[<ffffffff8020c88f>] ? child_rip+0x0/0x11
---[ end trace 4eaa2a86a8e2da22 ]---
<...>
scsi0 : LSI SAS based MegaRAID driver
Driver 'sd' needs updating - please use bus_type methods
scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5
------------[ cut here ]------------
WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57()
ODEBUG: free active object type: timer_list
Modules linked in:
Pid: 580, comm: scsi_scan_0 Tainted: G W 2.6.28-rc5-next-20081119 #9
Call Trace:
[<ffffffff80236b28>] warn_slowpath+0xae/0xd5
[<ffffffff8037f9e8>] ? debug_check_no_obj_freed+0x75/0x1c8
[<ffffffff8037f8b1>] debug_print_object+0x4f/0x57
[<ffffffff8037fa0f>] debug_check_no_obj_freed+0x9c/0x1c8
[<ffffffff8029c7b2>] kmem_cache_free+0x64/0xc0
[<ffffffff8036a6e0>] ? blk_release_queue+0x61/0x66
[<ffffffff8036a6e0>] blk_release_queue+0x61/0x66
[<ffffffff803760f2>] kobject_release+0x52/0x68
[<ffffffff803760a0>] ? kobject_release+0x0/0x68
[<ffffffff80376ec5>] kref_put+0x43/0x4f
[<ffffffff80375ffa>] kobject_put+0x47/0x4b
[<ffffffff80368c53>] blk_cleanup_queue+0x57/0x5c
[<ffffffff803f8729>] scsi_free_queue+0x9/0xb
[<ffffffff803fd3c7>] scsi_device_dev_release_usercontext+0xdc/0x127
[<ffffffff803fd2eb>] ? scsi_device_dev_release_usercontext+0x0/0x127
[<ffffffff802472a8>] execute_in_process_context+0x2a/0x70
[<ffffffff803fd2e9>] scsi_device_dev_release+0x17/0x19
[<ffffffff803e03e0>] device_release+0x43/0x68
[<ffffffff803760f2>] kobject_release+0x52/0x68
[<ffffffff803760a0>] ? kobject_release+0x0/0x68
[<ffffffff80376ec5>] kref_put+0x43/0x4f
[<ffffffff80375ffa>] kobject_put+0x47/0x4b
[<ffffffff803dfd36>] put_device+0x15/0x17
[<ffffffff803fa772>] scsi_destroy_sdev+0x48/0x4c
[<ffffffff803fba05>] scsi_probe_and_add_lun+0xb5d/0xb81
[<ffffffff803faaba>] ? scsi_alloc_target+0x22b/0x267
[<ffffffff803fbcb0>] __scsi_scan_target+0x9d/0x598
[<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153
[<ffffffff804e39a9>] ? __mutex_lock_common+0x371/0x3be
[<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133
[<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153
[<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133
[<ffffffff803fc1fd>] scsi_scan_channel+0x52/0x78
[<ffffffff803fc314>] scsi_scan_host_selected+0xf1/0x133
[<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
[<ffffffff803fc3c1>] do_scsi_scan_host+0x6b/0x70
[<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
[<ffffffff803fc3dd>] do_scan_async+0x17/0x127
[<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
[<ffffffff80249d5d>] kthread+0x49/0x76
[<ffffffff8020c899>] child_rip+0xa/0x11
[<ffffffff8020bd88>] ? restore_args+0x0/0x30
[<ffffffff80249d14>] ? kthread+0x0/0x76
[<ffffffff8020c88f>] ? child_rip+0x0/0x11
---[ end trace 4eaa2a86a8e2da22 ]---
<...>
ata2: port disabled. ignoring.
scsi: waiting for bus probes to complete ...
WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57()
ODEBUG: free active object type: timer_list
Modules linked in:
Pid: 580, comm: scsi_scan_0 Tainted: G W 2.6.28-rc5-next-20081119 #9
Call Trace:
[<ffffffff80236b28>] warn_slowpath+0xae/0xd5
[<ffffffff803925b9>] ? write_vga+0x18/0x4e
[<ffffffff8037f9e8>] ? debug_check_no_obj_freed+0x75/0x1c8
[<ffffffff8037f8b1>] debug_print_object+0x4f/0x57
[<ffffffff8037fa0f>] debug_check_no_obj_freed+0x9c/0x1c8
[<ffffffff8029c7b2>] kmem_cache_free+0x64/0xc0
[<ffffffff8036a6e0>] ? blk_release_queue+0x61/0x66
[<ffffffff8036a6e0>] blk_release_queue+0x61/0x66
[<ffffffff803760f2>] kobject_release+0x52/0x68
[<ffffffff803760a0>] ? kobject_release+0x0/0x68
[<ffffffff80376ec5>] kref_put+0x43/0x4f
[<ffffffff80375ffa>] kobject_put+0x47/0x4b
[<ffffffff80368c53>] blk_cleanup_queue+0x57/0x5c
[<ffffffff803f8729>] scsi_free_queue+0x9/0xb
[<ffffffff803fd3c7>] scsi_device_dev_release_usercontext+0xdc/0x127
[<ffffffff803fd2eb>] ? scsi_device_dev_release_usercontext+0x0/0x127
[<ffffffff802472a8>] execute_in_process_context+0x2a/0x70
[<ffffffff803fd2e9>] scsi_device_dev_release+0x17/0x19
[<ffffffff803e03e0>] device_release+0x43/0x68
[<ffffffff803760f2>] kobject_release+0x52/0x68
[<ffffffff803760a0>] ? kobject_release+0x0/0x68
[<ffffffff80376ec5>] kref_put+0x43/0x4f
[<ffffffff80375ffa>] kobject_put+0x47/0x4b
[<ffffffff803dfd36>] put_device+0x15/0x17
[<ffffffff803fa772>] scsi_destroy_sdev+0x48/0x4c
[<ffffffff803fba05>] scsi_probe_and_add_lun+0xb5d/0xb81
[<ffffffff803faaba>] ? scsi_alloc_target+0x22b/0x267
[<ffffffff803fbcb0>] __scsi_scan_target+0x9d/0x598
[<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153
[<ffffffff804e39a9>] ? __mutex_lock_common+0x371/0x3be
[<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133
[<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153
[<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133
[<ffffffff803fc1fd>] scsi_scan_channel+0x52/0x78
[<ffffffff803fc314>] scsi_scan_host_selected+0xf1/0x133
[<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
[<ffffffff803fc3c1>] do_scsi_scan_host+0x6b/0x70
[<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
[<ffffffff803fc3dd>] do_scan_async+0x17/0x127
[<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
[<ffffffff80249d5d>] kthread+0x49/0x76
[<ffffffff8020c899>] child_rip+0xa/0x11
[<ffffffff8020bd88>] ? restore_args+0x0/0x30
[<ffffffff80249d14>] ? kthread+0x0/0x76
[<ffffffff8020c88f>] ? child_rip+0x0/0x11
---[ end trace 4eaa2a86a8e2da22 ]---
<...>
BUG: using smp_processor_id() in preemptible [00000000] code: init-early.sh/741
caller is sock_prot_inuse_add+0x24/0x42
Pid: 741, comm: init-early.sh Tainted: G W 2.6.28-rc5-next-20081119 #9
Call Trace:
[<ffffffff8037f622>] debug_smp_processor_id+0xca/0xe0
[<ffffffff8046ab5b>] sock_prot_inuse_add+0x24/0x42
[<ffffffff804bb124>] unix_create1+0x161/0x176
[<ffffffff804bb196>] unix_create+0x5d/0x68
[<ffffffff80469368>] __sock_create+0x114/0x17e
[<ffffffff80469420>] sock_create+0x2d/0x2f
[<ffffffff80469623>] sys_socket+0x29/0x5c
[<ffffffff8020b74b>] system_call_fastpath+0x16/0x1b
BUG: using smp_processor_id() in preemptible [00000000] code: init-early.sh/741
caller is sock_prot_inuse_add+0x24/0x42
Pid: 741, comm: init-early.sh Tainted: G W 2.6.28-rc5-next-20081119 #9
Call Trace:
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-21 10:50 ` Alexander Beregalov
@ 2008-11-24 17:43 ` Thomas Gleixner
2008-11-24 19:15 ` James Bottomley
0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2008-11-24 17:43 UTC (permalink / raw)
To: Alexander Beregalov
Cc: LKML, linux-next, Ingo Molnar, linux-scsi, James.Bottomley,
David Miller
Alexander,
On Fri, 21 Nov 2008, Alexander Beregalov wrote:
> 2008/11/20 Thomas Gleixner <tglx@linutronix.de>:
> > Alexander,
> >
> > On Wed, 19 Nov 2008, Alexander Beregalov wrote:
> >>
> >> It is 4way X86_64
> >> The kernel does not boot.
> >
> >> RIP: 0010:[<ffffffff80240061>] [<ffffffff80240061>]
> >> get_next_timer_interrupt+0x11b/0x1f0
> >
> > Can you please enable:
> >
> > CONFIG_DEBUG_OBJECTS=y
> > CONFIG_DEBUG_OBJECTS_FREE=y
> > CONFIG_DEBUG_OBJECTS_TIMERS=Y
> >
> > and add "debug_objects" to the kernel command line ?
>
> I added these options:
Thanks.
> hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
> hpet0: 3 comparators, 64-bit 14.318180 MHz counter
> ODEBUG: object is on stack, but not annotated
ok, that's homework for me.
> scsi0 : LSI SAS based MegaRAID driver
> Driver 'sd' needs updating - please use bus_type methods
> scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5
> ------------[ cut here ]------------
> WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57()
> ODEBUG: free active object type: timer_list
That's the cause for your boot crash. The scsi/blk code is freeing a
page which contains an active timer, so the timer code references gone
memory. You triggered it because DEBUG_PAGEALLOC unmaps the page when
it's freed.
James, or other scsi experts please.
> Modules linked in:
> Pid: 580, comm: scsi_scan_0 Tainted: G W 2.6.28-rc5-next-20081119 #9
> Call Trace:
> [<ffffffff80236b28>] warn_slowpath+0xae/0xd5
> [<ffffffff8037f9e8>] ? debug_check_no_obj_freed+0x75/0x1c8
> [<ffffffff8037f8b1>] debug_print_object+0x4f/0x57
> [<ffffffff8037fa0f>] debug_check_no_obj_freed+0x9c/0x1c8
> [<ffffffff8029c7b2>] kmem_cache_free+0x64/0xc0
> [<ffffffff8036a6e0>] ? blk_release_queue+0x61/0x66
> [<ffffffff8036a6e0>] blk_release_queue+0x61/0x66
> [<ffffffff803760f2>] kobject_release+0x52/0x68
> [<ffffffff803760a0>] ? kobject_release+0x0/0x68
> [<ffffffff80376ec5>] kref_put+0x43/0x4f
> [<ffffffff80375ffa>] kobject_put+0x47/0x4b
> [<ffffffff80368c53>] blk_cleanup_queue+0x57/0x5c
> [<ffffffff803f8729>] scsi_free_queue+0x9/0xb
> [<ffffffff803fd3c7>] scsi_device_dev_release_usercontext+0xdc/0x127
> [<ffffffff803fd2eb>] ? scsi_device_dev_release_usercontext+0x0/0x127
> [<ffffffff802472a8>] execute_in_process_context+0x2a/0x70
> [<ffffffff803fd2e9>] scsi_device_dev_release+0x17/0x19
> [<ffffffff803e03e0>] device_release+0x43/0x68
> [<ffffffff803760f2>] kobject_release+0x52/0x68
> [<ffffffff803760a0>] ? kobject_release+0x0/0x68
> [<ffffffff80376ec5>] kref_put+0x43/0x4f
> [<ffffffff80375ffa>] kobject_put+0x47/0x4b
> [<ffffffff803dfd36>] put_device+0x15/0x17
> [<ffffffff803fa772>] scsi_destroy_sdev+0x48/0x4c
> [<ffffffff803fba05>] scsi_probe_and_add_lun+0xb5d/0xb81
> [<ffffffff803faaba>] ? scsi_alloc_target+0x22b/0x267
> [<ffffffff803fbcb0>] __scsi_scan_target+0x9d/0x598
> [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153
> [<ffffffff804e39a9>] ? __mutex_lock_common+0x371/0x3be
> [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133
> [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153
> [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133
> [<ffffffff803fc1fd>] scsi_scan_channel+0x52/0x78
> [<ffffffff803fc314>] scsi_scan_host_selected+0xf1/0x133
> [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
> [<ffffffff803fc3c1>] do_scsi_scan_host+0x6b/0x70
> [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
> [<ffffffff803fc3dd>] do_scan_async+0x17/0x127
> [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
> [<ffffffff80249d5d>] kthread+0x49/0x76
> [<ffffffff8020c899>] child_rip+0xa/0x11
> [<ffffffff8020bd88>] ? restore_args+0x0/0x30
> [<ffffffff80249d14>] ? kthread+0x0/0x76
> [<ffffffff8020c88f>] ? child_rip+0x0/0x11
> ---[ end trace 4eaa2a86a8e2da22 ]---
> <...>
> ata2: port disabled. ignoring.
> scsi: waiting for bus probes to complete ...
> WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57()
> ODEBUG: free active object type: timer_list
Same as above.
> BUG: using smp_processor_id() in preemptible [00000000] code: init-early.sh/741
> caller is sock_prot_inuse_add+0x24/0x42
> Pid: 741, comm: init-early.sh Tainted: G W 2.6.28-rc5-next-20081119 #9
> Call Trace:
> [<ffffffff8037f622>] debug_smp_processor_id+0xca/0xe0
> [<ffffffff8046ab5b>] sock_prot_inuse_add+0x24/0x42
> [<ffffffff804bb124>] unix_create1+0x161/0x176
> [<ffffffff804bb196>] unix_create+0x5d/0x68
> [<ffffffff80469368>] __sock_create+0x114/0x17e
> [<ffffffff80469420>] sock_create+0x2d/0x2f
> [<ffffffff80469623>] sys_socket+0x29/0x5c
> [<ffffffff8020b74b>] system_call_fastpath+0x16/0x1b
Dave ???
Thanks,
tglx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-24 17:43 ` Thomas Gleixner
@ 2008-11-24 19:15 ` James Bottomley
2008-11-24 19:31 ` Thomas Gleixner
0 siblings, 1 reply; 16+ messages in thread
From: James Bottomley @ 2008-11-24 19:15 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi,
David Miller
On Mon, 2008-11-24 at 18:43 +0100, Thomas Gleixner wrote:
> > scsi0 : LSI SAS based MegaRAID driver
> > Driver 'sd' needs updating - please use bus_type methods
> > scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5
> > ------------[ cut here ]------------
> > WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57()
> > ODEBUG: free active object type: timer_list
>
> That's the cause for your boot crash. The scsi/blk code is freeing a
> page which contains an active timer, so the timer code references gone
> memory. You triggered it because DEBUG_PAGEALLOC unmaps the page when
> it's freed.
>
> James, or other scsi experts please.
>
> > Modules linked in:
> > Pid: 580, comm: scsi_scan_0 Tainted: G W 2.6.28-rc5-next-20081119 #9
> > Call Trace:
> > [<ffffffff80236b28>] warn_slowpath+0xae/0xd5
> > [<ffffffff8037f9e8>] ? debug_check_no_obj_freed+0x75/0x1c8
> > [<ffffffff8037f8b1>] debug_print_object+0x4f/0x57
> > [<ffffffff8037fa0f>] debug_check_no_obj_freed+0x9c/0x1c8
> > [<ffffffff8029c7b2>] kmem_cache_free+0x64/0xc0
> > [<ffffffff8036a6e0>] ? blk_release_queue+0x61/0x66
> > [<ffffffff8036a6e0>] blk_release_queue+0x61/0x66
> > [<ffffffff803760f2>] kobject_release+0x52/0x68
> > [<ffffffff803760a0>] ? kobject_release+0x0/0x68
> > [<ffffffff80376ec5>] kref_put+0x43/0x4f
> > [<ffffffff80375ffa>] kobject_put+0x47/0x4b
> > [<ffffffff80368c53>] blk_cleanup_queue+0x57/0x5c
> > [<ffffffff803f8729>] scsi_free_queue+0x9/0xb
> > [<ffffffff803fd3c7>] scsi_device_dev_release_usercontext+0xdc/0x127
> > [<ffffffff803fd2eb>] ? scsi_device_dev_release_usercontext+0x0/0x127
> > [<ffffffff802472a8>] execute_in_process_context+0x2a/0x70
> > [<ffffffff803fd2e9>] scsi_device_dev_release+0x17/0x19
> > [<ffffffff803e03e0>] device_release+0x43/0x68
> > [<ffffffff803760f2>] kobject_release+0x52/0x68
> > [<ffffffff803760a0>] ? kobject_release+0x0/0x68
> > [<ffffffff80376ec5>] kref_put+0x43/0x4f
> > [<ffffffff80375ffa>] kobject_put+0x47/0x4b
> > [<ffffffff803dfd36>] put_device+0x15/0x17
> > [<ffffffff803fa772>] scsi_destroy_sdev+0x48/0x4c
> > [<ffffffff803fba05>] scsi_probe_and_add_lun+0xb5d/0xb81
> > [<ffffffff803faaba>] ? scsi_alloc_target+0x22b/0x267
> > [<ffffffff803fbcb0>] __scsi_scan_target+0x9d/0x598
> > [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153
> > [<ffffffff804e39a9>] ? __mutex_lock_common+0x371/0x3be
> > [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133
> > [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153
> > [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133
> > [<ffffffff803fc1fd>] scsi_scan_channel+0x52/0x78
> > [<ffffffff803fc314>] scsi_scan_host_selected+0xf1/0x133
> > [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
> > [<ffffffff803fc3c1>] do_scsi_scan_host+0x6b/0x70
> > [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
> > [<ffffffff803fc3dd>] do_scan_async+0x17/0x127
> > [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127
> > [<ffffffff80249d5d>] kthread+0x49/0x76
> > [<ffffffff8020c899>] child_rip+0xa/0x11
> > [<ffffffff8020bd88>] ? restore_args+0x0/0x30
> > [<ffffffff80249d14>] ? kthread+0x0/0x76
> > [<ffffffff8020c88f>] ? child_rip+0x0/0x11
> > ---[ end trace 4eaa2a86a8e2da22 ]---
Well, not sure. Most likely candidate is the new block timer code.
What seems to be happening is that the queue is being released with
either an outstanding request (refcounting problem) or ticking timer
with no work (block timer problem). The way scanning works is that we
create a request queue for each device we probe and then delete it again
if nothing appears after the bus settle time. The argument against
this is that it should show up on every scanned bus. However, these are
getting rarer; I was just about to write that I hadn't seen it when I
remembered that all my SCSI testing systems are currently running
hotplug reporting busses (i.e. don't do scanning). However,
fortunately, I've also booted voyager recently which does use parallel
SCSI and doesn't see this either, so it could also be megaraid_sas
specific.
Could you turn on SCSI logging so we can see the sequences. Probably
since this is boot time, just enable all logging:
echo 0xffffffff > /sys/module/scsi_mod/parameters/scsi_logging_level
(kernel must be compiled with CONFIG_SCSI_LOGGING=y
James
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-24 19:15 ` James Bottomley
@ 2008-11-24 19:31 ` Thomas Gleixner
2008-11-24 21:35 ` Mike Anderson
0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2008-11-24 19:31 UTC (permalink / raw)
To: James Bottomley
Cc: Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi,
David Miller, Jens Axboe, Mike Anderson
On Mon, 24 Nov 2008, James Bottomley wrote:
> On Mon, 2008-11-24 at 18:43 +0100, Thomas Gleixner wrote:
> > > scsi0 : LSI SAS based MegaRAID driver
> > > Driver 'sd' needs updating - please use bus_type methods
> > > scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5
> > > ------------[ cut here ]------------
> > > WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57()
> > > ODEBUG: free active object type: timer_list
> >
> > That's the cause for your boot crash. The scsi/blk code is freeing a
> > page which contains an active timer, so the timer code references gone
> > memory. You triggered it because DEBUG_PAGEALLOC unmaps the page when
> > it's freed.
> >
> > James, or other scsi experts please.
>
> Well, not sure. Most likely candidate is the new block timer code.
> What seems to be happening is that the queue is being released with
> either an outstanding request (refcounting problem) or ticking timer
> with no work (block timer problem). The way scanning works is that we
> create a request queue for each device we probe and then delete it again
> if nothing appears after the bus settle time. The argument against
> this is that it should show up on every scanned bus. However, these are
> getting rarer; I was just about to write that I hadn't seen it when I
> remembered that all my SCSI testing systems are currently running
> hotplug reporting busses (i.e. don't do scanning). However,
> fortunately, I've also booted voyager recently which does use parallel
> SCSI and doesn't see this either, so it could also be megaraid_sas
> specific.
Yeah, block could it be as well. Jens, Mike ?
One note about not seeing it: We have had such bugs before where the
page was freed but not touched and the timer survived w/o tripping the
system over. Alexander noticed because of DEBUG_PAGEALLOC and you can
also see it by enabling debugobjects, which will give you the nice
backtrace.
CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_TIMERS=Y
and add "debug_objects" to the kernel command line.
> Could you turn on SCSI logging so we can see the sequences. Probably
> since this is boot time, just enable all logging:
>
> echo 0xffffffff > /sys/module/scsi_mod/parameters/scsi_logging_level
>
> (kernel must be compiled with CONFIG_SCSI_LOGGING=y
>
> James
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-24 19:31 ` Thomas Gleixner
@ 2008-11-24 21:35 ` Mike Anderson
2008-11-24 22:33 ` Thomas Gleixner
0 siblings, 1 reply; 16+ messages in thread
From: Mike Anderson @ 2008-11-24 21:35 UTC (permalink / raw)
To: Thomas Gleixner
Cc: James Bottomley, Alexander Beregalov, LKML, linux-next,
Ingo Molnar, linux-scsi, David Miller, Jens Axboe
Thomas Gleixner <tglx@linutronix.de> wrote:
> > Well, not sure. Most likely candidate is the new block timer code.
> > What seems to be happening is that the queue is being released with
> > either an outstanding request (refcounting problem) or ticking timer
> > with no work (block timer problem). The way scanning works is that we
> > create a request queue for each device we probe and then delete it again
> > if nothing appears after the bus settle time. The argument against
> > this is that it should show up on every scanned bus. However, these are
> > getting rarer; I was just about to write that I hadn't seen it when I
> > remembered that all my SCSI testing systems are currently running
> > hotplug reporting busses (i.e. don't do scanning). However,
> > fortunately, I've also booted voyager recently which does use parallel
> > SCSI and doesn't see this either, so it could also be megaraid_sas
> > specific.
>
> Yeah, block could it be as well. Jens, Mike ?
I added a comment to bug 12020 on Thursday about a few other systems that
where seeing the signature shown in bug 12020. It appeared from debug that
there where a few paths that where adding timers for requests that where
not expected.
http://bugzilla.kernel.org/show_bug.cgi?id=12020
It would be good to know if the debug patch below effects your problem as while.
If it does we need to investigated a solution to resolve not adding a
timer for these requests.
-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com
blk: blk_add_timer debug patch
[DEBUG] Debug only patch.
Debug patch to blk_add_timer to not start timer for request that do not
have the REQ_STARTED flag set.
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
---
block/blk-timeout.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 69185ea..4389391 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -177,6 +177,9 @@ void blk_add_timer(struct request *req)
BUG_ON(!list_empty(&req->timeout_list));
BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags));
+ if (!(req->cmd_flags & REQ_STARTED))
+ return;
+
if (req->timeout)
req->deadline = jiffies + req->timeout;
else {
--
1.5.6.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-24 21:35 ` Mike Anderson
@ 2008-11-24 22:33 ` Thomas Gleixner
2008-11-24 23:42 ` malahal
2008-11-25 0:09 ` malahal
0 siblings, 2 replies; 16+ messages in thread
From: Thomas Gleixner @ 2008-11-24 22:33 UTC (permalink / raw)
To: Mike Anderson
Cc: James Bottomley, Alexander Beregalov, LKML, linux-next,
Ingo Molnar, linux-scsi, David Miller, Jens Axboe
On Mon, 24 Nov 2008, Mike Anderson wrote:
> Thomas Gleixner <tglx@linutronix.de> wrote:
> > Yeah, block could it be as well. Jens, Mike ?
>
> I added a comment to bug 12020 on Thursday about a few other systems that
> where seeing the signature shown in bug 12020. It appeared from debug that
> there where a few paths that where adding timers for requests that where
> not expected.
>
> http://bugzilla.kernel.org/show_bug.cgi?id=12020
>
> It would be good to know if the debug patch below effects your problem as while.
>
> If it does we need to investigated a solution to resolve not adding a
> timer for these requests.
Wrong.
The problem is not a timer which is armed in the first place.
The problem is an armed timer which is not canceled before the data
structure which contains it is freed.
So not arming the timer will probably prevent this particular scan
problem, but it does not solve the general wreckage of freeing a data
structure with a possibly armed timer in it.
You need to fix the code path which frees the data structure which
contains the timer and cancel the timer _before_ freeing the data
structure.
Thanks,
tglx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-24 22:33 ` Thomas Gleixner
@ 2008-11-24 23:42 ` malahal
2008-11-25 0:09 ` malahal
1 sibling, 0 replies; 16+ messages in thread
From: malahal @ 2008-11-24 23:42 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mike Anderson, James Bottomley, Alexander Beregalov, LKML,
linux-next, Ingo Molnar, linux-scsi, David Miller, Jens Axboe
Thomas Gleixner [tglx@linutronix.de] wrote:
> > where seeing the signature shown in bug 12020. It appeared from debug that
> > there where a few paths that where adding timers for requests that where
> > not expected.
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=12020
> >
> > It would be good to know if the debug patch below effects your problem as while.
> >
> > If it does we need to investigated a solution to resolve not adding a
> > timer for these requests.
>
> Wrong.
>
> The problem is not a timer which is armed in the first place.
No, this could be a problem if such a timer is not dis-armed! As fas as
I know, the queue timer will be dis-armed in end_that_request_last() if
needed. Do we know end_that_request_last() gets called for every request
queued?
> The problem is an armed timer which is not canceled before the data
> structure which contains it is freed.
>
> So not arming the timer will probably prevent this particular scan
> problem, but it does not solve the general wreckage of freeing a data
> structure with a possibly armed timer in it.
>
> You need to fix the code path which frees the data structure which
> contains the timer and cancel the timer _before_ freeing the data
> structure.
Agreed but the timer is armed when a request is sent and is dis-armed
when it is completed. Essentially there should NOT be any active
timer(s) when you try to free the request queue. In other words, the
code which frees the data structure (request queue) is correct and there
is no need to cancel the timer there!
--Malahal.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-24 22:33 ` Thomas Gleixner
2008-11-24 23:42 ` malahal
@ 2008-11-25 0:09 ` malahal
2008-11-25 0:57 ` Stephen Rothwell
1 sibling, 1 reply; 16+ messages in thread
From: malahal @ 2008-11-25 0:09 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mike Anderson, James Bottomley, Alexander Beregalov, LKML,
linux-next, Ingo Molnar, linux-scsi, David Miller, Jens Axboe
Thomas Gleixner [tglx@linutronix.de] wrote:
> On Mon, 24 Nov 2008, Mike Anderson wrote:
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > > Yeah, block could it be as well. Jens, Mike ?
> >
> > I added a comment to bug 12020 on Thursday about a few other systems that
> > where seeing the signature shown in bug 12020. It appeared from debug that
> > there where a few paths that where adding timers for requests that where
> > not expected.
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=12020
> >
> > It would be good to know if the debug patch below effects your problem as while.
> >
> > If it does we need to investigated a solution to resolve not adding a
> > timer for these requests.
The block timer code calls del_timer(), should it call del_timer_sync()?
It is possible although unlikely that you are hitting del_timer_sync vs
del_timer problem in the block timeout code. Can only be seen on SMP
systems though!
--Malahal.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-25 0:09 ` malahal
@ 2008-11-25 0:57 ` Stephen Rothwell
2008-11-25 2:08 ` malahal
0 siblings, 1 reply; 16+ messages in thread
From: Stephen Rothwell @ 2008-11-25 0:57 UTC (permalink / raw)
To: malahal
Cc: Thomas Gleixner, Mike Anderson, James Bottomley,
Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi,
David Miller, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 1786 bytes --]
On Mon, 24 Nov 2008 16:09:02 -0800 malahal@us.ibm.com wrote:
>
> Thomas Gleixner [tglx@linutronix.de] wrote:
> > On Mon, 24 Nov 2008, Mike Anderson wrote:
> > > Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > Yeah, block could it be as well. Jens, Mike ?
> > >
> > > I added a comment to bug 12020 on Thursday about a few other systems that
> > > where seeing the signature shown in bug 12020. It appeared from debug that
> > > there where a few paths that where adding timers for requests that where
> > > not expected.
> > >
> > > http://bugzilla.kernel.org/show_bug.cgi?id=12020
> > >
> > > It would be good to know if the debug patch below effects your problem as while.
> > >
> > > If it does we need to investigated a solution to resolve not adding a
> > > timer for these requests.
>
> The block timer code calls del_timer(), should it call del_timer_sync()?
> It is possible although unlikely that you are hitting del_timer_sync vs
> del_timer problem in the block timeout code. Can only be seen on SMP
> systems though!
Is this still a problem in next-20081121? In that tree, the block commit
"block: leave the request timeout timer running even on an empty list"
was changed to add this:
diff --git a/block/blk-core.c b/block/blk-core.c
index 04267d6..44f547c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue);
void blk_sync_queue(struct request_queue *q)
{
del_timer_sync(&q->unplug_timer);
+ del_timer_sync(&q->timeout);
kblockd_flush_work(&q->unplug_work);
}
EXPORT_SYMBOL(blk_sync_queue);
After I spent some time bisecting a boot failure in PowerPC.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-25 0:57 ` Stephen Rothwell
@ 2008-11-25 2:08 ` malahal
2008-11-25 8:51 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: malahal @ 2008-11-25 2:08 UTC (permalink / raw)
To: Stephen Rothwell
Cc: Thomas Gleixner, Mike Anderson, James Bottomley,
Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi,
David Miller, Jens Axboe
Stephen Rothwell [sfr@canb.auug.org.au] wrote:
> > The block timer code calls del_timer(), should it call del_timer_sync()?
> > It is possible although unlikely that you are hitting del_timer_sync vs
> > del_timer problem in the block timeout code. Can only be seen on SMP
> > systems though!
>
> Is this still a problem in next-20081121? In that tree, the block commit
> "block: leave the request timeout timer running even on an empty list"
> was changed to add this:
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 04267d6..44f547c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue);
> void blk_sync_queue(struct request_queue *q)
> {
> del_timer_sync(&q->unplug_timer);
> + del_timer_sync(&q->timeout);
> kblockd_flush_work(&q->unplug_work);
> }
> EXPORT_SYMBOL(blk_sync_queue);
I was looking at the Linux tree. Clearly same problem doesn't exist with
the above commit! I wonder why kblockd_flush_work() is called after the
del_timer_sync(). It makes sense to cancel the work and then shutdown
the timer(s). I doubt if you are running into this problem though.
-Malahal.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-25 2:08 ` malahal
@ 2008-11-25 8:51 ` Jens Axboe
2008-11-25 16:59 ` malahal
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2008-11-25 8:51 UTC (permalink / raw)
To: malahal
Cc: Stephen Rothwell, Thomas Gleixner, Mike Anderson, James Bottomley,
Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi,
David Miller
On Mon, Nov 24 2008, malahal@us.ibm.com wrote:
> Stephen Rothwell [sfr@canb.auug.org.au] wrote:
> > > The block timer code calls del_timer(), should it call del_timer_sync()?
> > > It is possible although unlikely that you are hitting del_timer_sync vs
> > > del_timer problem in the block timeout code. Can only be seen on SMP
> > > systems though!
> >
> > Is this still a problem in next-20081121? In that tree, the block commit
> > "block: leave the request timeout timer running even on an empty list"
> > was changed to add this:
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 04267d6..44f547c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue);
> > void blk_sync_queue(struct request_queue *q)
> > {
> > del_timer_sync(&q->unplug_timer);
> > + del_timer_sync(&q->timeout);
> > kblockd_flush_work(&q->unplug_work);
> > }
> > EXPORT_SYMBOL(blk_sync_queue);
>
> I was looking at the Linux tree. Clearly same problem doesn't exist with
> the above commit! I wonder why kblockd_flush_work() is called after the
> del_timer_sync(). It makes sense to cancel the work and then shutdown
> the timer(s). I doubt if you are running into this problem though.
If the kernel tested doesn't include the above fix, it'll surely go
boom. Can someone verify that this is the case?
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-25 8:51 ` Jens Axboe
@ 2008-11-25 16:59 ` malahal
2008-11-25 17:14 ` Alexander Beregalov
0 siblings, 1 reply; 16+ messages in thread
From: malahal @ 2008-11-25 16:59 UTC (permalink / raw)
To: Jens Axboe
Cc: Stephen Rothwell, Thomas Gleixner, Mike Anderson, James Bottomley,
Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi,
David Miller
Jens Axboe [jens.axboe@oracle.com] wrote:
> On Mon, Nov 24 2008, malahal@us.ibm.com wrote:
> > Stephen Rothwell [sfr@canb.auug.org.au] wrote:
> > > > The block timer code calls del_timer(), should it call del_timer_sync()?
> > > > It is possible although unlikely that you are hitting del_timer_sync vs
> > > > del_timer problem in the block timeout code. Can only be seen on SMP
> > > > systems though!
> > >
> > > Is this still a problem in next-20081121? In that tree, the block commit
> > > "block: leave the request timeout timer running even on an empty list"
> > > was changed to add this:
> > >
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 04267d6..44f547c 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue);
> > > void blk_sync_queue(struct request_queue *q)
> > > {
> > > del_timer_sync(&q->unplug_timer);
> > > + del_timer_sync(&q->timeout);
> > > kblockd_flush_work(&q->unplug_work);
> > > }
> > > EXPORT_SYMBOL(blk_sync_queue);
> >
> > I was looking at the Linux tree. Clearly same problem doesn't exist with
> > the above commit! I wonder why kblockd_flush_work() is called after the
> > del_timer_sync(). It makes sense to cancel the work and then shutdown
> > the timer(s). I doubt if you are running into this problem though.
>
> If the kernel tested doesn't include the above fix, it'll surely go
> boom. Can someone verify that this is the case?
Just looked, next-20081119 doesn't have the above fix. It is included in
next-20081120. Also note that the above fix is only partially copied,
there is other part that removed deleting the timer when there are no
outstanding requests.
--Malahal.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-25 16:59 ` malahal
@ 2008-11-25 17:14 ` Alexander Beregalov
2008-11-25 17:43 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Alexander Beregalov @ 2008-11-25 17:14 UTC (permalink / raw)
To: Jens Axboe, Stephen Rothwell, Thomas Gleixner, Mike Anderson,
James
2008/11/25 <malahal@us.ibm.com>:
> Jens Axboe [jens.axboe@oracle.com] wrote:
>> On Mon, Nov 24 2008, malahal@us.ibm.com wrote:
>> > Stephen Rothwell [sfr@canb.auug.org.au] wrote:
>> > > > The block timer code calls del_timer(), should it call del_timer_sync()?
>> > > > It is possible although unlikely that you are hitting del_timer_sync vs
>> > > > del_timer problem in the block timeout code. Can only be seen on SMP
>> > > > systems though!
>> > >
>> > > Is this still a problem in next-20081121? In that tree, the block commit
>> > > "block: leave the request timeout timer running even on an empty list"
>> > > was changed to add this:
>> > >
>> > > diff --git a/block/blk-core.c b/block/blk-core.c
>> > > index 04267d6..44f547c 100644
>> > > --- a/block/blk-core.c
>> > > +++ b/block/blk-core.c
>> > > @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue);
>> > > void blk_sync_queue(struct request_queue *q)
>> > > {
>> > > del_timer_sync(&q->unplug_timer);
>> > > + del_timer_sync(&q->timeout);
>> > > kblockd_flush_work(&q->unplug_work);
>> > > }
>> > > EXPORT_SYMBOL(blk_sync_queue);
>> >
>> > I was looking at the Linux tree. Clearly same problem doesn't exist with
>> > the above commit! I wonder why kblockd_flush_work() is called after the
>> > del_timer_sync(). It makes sense to cancel the work and then shutdown
>> > the timer(s). I doubt if you are running into this problem though.
>>
>> If the kernel tested doesn't include the above fix, it'll surely go
>> boom. Can someone verify that this is the case?
>
> Just looked, next-20081119 doesn't have the above fix. It is included in
> next-20081120. Also note that the above fix is only partially copied,
> there is other part that removed deleting the timer when there are no
> outstanding requests.
>
Yes, I can not reproduce it anymore on linux-next 1121 and newer. (I
did not try 1120)
It seems the fix works pretty good.
Is it still needed and reasonable to investigate the problem on next-20081119?
Unfortunately I do not have much time for it.
All these problems have gone away on next-1125 except ODEBUG warning on HPET.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt()
2008-11-25 17:14 ` Alexander Beregalov
@ 2008-11-25 17:43 ` Jens Axboe
0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2008-11-25 17:43 UTC (permalink / raw)
To: Alexander Beregalov
Cc: Stephen Rothwell, Thomas Gleixner, Mike Anderson, James Bottomley,
LKML, linux-next, Ingo Molnar, linux-scsi, David Miller
On Tue, Nov 25 2008, Alexander Beregalov wrote:
> 2008/11/25 <malahal@us.ibm.com>:
> > Jens Axboe [jens.axboe@oracle.com] wrote:
> >> On Mon, Nov 24 2008, malahal@us.ibm.com wrote:
> >> > Stephen Rothwell [sfr@canb.auug.org.au] wrote:
> >> > > > The block timer code calls del_timer(), should it call del_timer_sync()?
> >> > > > It is possible although unlikely that you are hitting del_timer_sync vs
> >> > > > del_timer problem in the block timeout code. Can only be seen on SMP
> >> > > > systems though!
> >> > >
> >> > > Is this still a problem in next-20081121? In that tree, the block commit
> >> > > "block: leave the request timeout timer running even on an empty list"
> >> > > was changed to add this:
> >> > >
> >> > > diff --git a/block/blk-core.c b/block/blk-core.c
> >> > > index 04267d6..44f547c 100644
> >> > > --- a/block/blk-core.c
> >> > > +++ b/block/blk-core.c
> >> > > @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue);
> >> > > void blk_sync_queue(struct request_queue *q)
> >> > > {
> >> > > del_timer_sync(&q->unplug_timer);
> >> > > + del_timer_sync(&q->timeout);
> >> > > kblockd_flush_work(&q->unplug_work);
> >> > > }
> >> > > EXPORT_SYMBOL(blk_sync_queue);
> >> >
> >> > I was looking at the Linux tree. Clearly same problem doesn't exist with
> >> > the above commit! I wonder why kblockd_flush_work() is called after the
> >> > del_timer_sync(). It makes sense to cancel the work and then shutdown
> >> > the timer(s). I doubt if you are running into this problem though.
> >>
> >> If the kernel tested doesn't include the above fix, it'll surely go
> >> boom. Can someone verify that this is the case?
> >
> > Just looked, next-20081119 doesn't have the above fix. It is included in
> > next-20081120. Also note that the above fix is only partially copied,
> > there is other part that removed deleting the timer when there are no
> > outstanding requests.
> >
> Yes, I can not reproduce it anymore on linux-next 1121 and newer. (I
> did not try 1120) It seems the fix works pretty good. Is it still
> needed and reasonable to investigate the problem on next-20081119?
> Unfortunately I do not have much time for it.
No, you don't have to investigate further. This was a known bug that is
fixed in -next and mainline basically right after next-20081119.
>
> All these problems have gone away on next-1125 except ODEBUG warning
> on HPET.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-11-25 17:45 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-19 15:14 next-20081119: general protection fault: get_next_timer_interrupt() Alexander Beregalov
2008-11-19 21:14 ` Thomas Gleixner
2008-11-21 10:50 ` Alexander Beregalov
2008-11-24 17:43 ` Thomas Gleixner
2008-11-24 19:15 ` James Bottomley
2008-11-24 19:31 ` Thomas Gleixner
2008-11-24 21:35 ` Mike Anderson
2008-11-24 22:33 ` Thomas Gleixner
2008-11-24 23:42 ` malahal
2008-11-25 0:09 ` malahal
2008-11-25 0:57 ` Stephen Rothwell
2008-11-25 2:08 ` malahal
2008-11-25 8:51 ` Jens Axboe
2008-11-25 16:59 ` malahal
2008-11-25 17:14 ` Alexander Beregalov
2008-11-25 17:43 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).