* next-20081119: general protection fault: get_next_timer_interrupt() @ 2008-11-19 15:14 Alexander Beregalov 2008-11-19 21:14 ` Thomas Gleixner 0 siblings, 1 reply; 16+ messages in thread From: Alexander Beregalov @ 2008-11-19 15:14 UTC (permalink / raw) To: LKML, linux-next, tglx, mingo Hi It is 4way X86_64 The kernel does not boot. ... scsi0 : LSI SAS based MegaRAID driver Driver 'sd' needs updating - please use bus_type methods scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5 general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: CPU 3 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.28-rc5-next-20081119 #6 RIP: 0010:[<ffffffff80240061>] [<ffffffff80240061>] get_next_timer_interrupt+0x11b/0x1f0 RSP: 0018:ffff88007dfe3e98 EFLAGS: 00010016 RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007dfd0000 RCX: 000000000000003e RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000037 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88007dfe3ef8 R08: ffff88007dfe3ea8 R09: 0000000000fffeb7 R10: 0000000000000001 R11: ffff88007dfd1430 R12: 000000013ffeb69c R13: 00000000fffeb69c R14: ffff88007dfd1050 R15: 0000000000000040 FS: 0000000000000000(0000) GS:ffff88007f402a28(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffff88007dfd6000, task ffff88007dfa4bc0) Stack: ffff88008486d000 ffff8800051390c0 ffff88007dfd1050 ffff88007dfd1450 ffff88007dfd1850 ffff88007dfd1c50 ffff880001431780 ffffffff80705000 ffff880001431780 ffff880004f72010 0000000000000000 0000000000000003 Call Trace: <IRQ> <0> [<ffffffff80253f68>] tick_nohz_stop_sched_tick+0x17b/0x37f [<ffffffff8023c023>] ? __do_softirq+0xf8/0x101 [<ffffffff8023bbec>] irq_exit+0x91/0xa2 [<ffffffff8021d05d>] smp_apic_timer_interrupt+0xa4/0xce [<ffffffff8020c610>] apic_timer_interrupt+0x70/0x80 <EOI> <0> [<ffffffff80212df9>] ? mwait_idle+0x3e/0x48 [<ffffffff80212df0>] ? mwait_idle+0x35/0x48 [<ffffffff8020a930>] ? cpu_idle+0x51/0xba [<ffffffff804dc06d>] ? start_secondary+0x185/0x18a Code: 30 83 e6 3f 89 f1 48 63 c1 48 c1 e0 04 4a 8b 14 30 4d 8d 1c 06 eb 14 48 8b 42 10 41 ba 01 00 00 00 4c 39 e0 48 89 fa 4c 0f 48 e0 <48> 8b 3a 4c 39 da 0f 18 0f 75 e1 45 85 d2 74 10 85 f6 74 04 39 RIP [<ffffffff80240061>] get_next_timer_interrupt+0x11b/0x1f0 RSP <ffff88007dfe3e98> ---[ end trace c0adcd63427ce040 ]--- Kernel panic - not syncing: Attempted to kill the idle task! Pid: 0, comm: swapper Tainted: G D 2.6.28-rc5-next-20081119 #6 Call Trace: <IRQ> [<ffffffff804de999>] panic+0xaa/0x15f [<ffffffff804deab5>] ? printk+0x67/0x6a [<ffffffff803ba593>] ? account+0xde/0xed [<ffffffff803ba6e5>] ? extract_entropy+0x50/0x97 [<ffffffff80239c6e>] do_exit+0x70/0x891 [<ffffffff803ba816>] ? get_random_bytes+0x1b/0x1d [<ffffffff8020f650>] oops_end+0x89/0x8e [<ffffffff8020f81c>] die+0x55/0x5e [<ffffffff8020d9d8>] do_general_protection+0x11e/0x126 [<ffffffff8020d8ba>] ? do_general_protection+0x0/0x126 [<ffffffff804e1f22>] error_exit+0x0/0xa9 [<ffffffff80240061>] ? get_next_timer_interrupt+0x11b/0x1f0 [<ffffffff8023ff83>] ? get_next_timer_interrupt+0x3d/0x1f0 [<ffffffff80253f68>] tick_nohz_stop_sched_tick+0x17b/0x37f [<ffffffff8023c023>] ? __do_softirq+0xf8/0x101 [<ffffffff8023bbec>] irq_exit+0x91/0xa2 [<ffffffff8021d05d>] smp_apic_timer_interrupt+0xa4/0xce [<ffffffff8020c610>] apic_timer_interrupt+0x70/0x80 <EOI> [<ffffffff80212df9>] ? mwait_idle+0x3e/0x48 [<ffffffff80212df0>] ? mwait_idle+0x35/0x48 [<ffffffff8020a930>] ? cpu_idle+0x51/0xba [<ffffffff804dc06d>] ? start_secondary+0x185/0x18a ------------[ cut here ]------------ WARNING: at kernel/smp.c:333 smp_call_function_mask+0x40/0x1e6() Modules linked in: Pid: 0, comm: swapper Tainted: G D 2.6.28-rc5-next-20081119 #6 Call Trace: <IRQ> [<ffffffff80236baf>] warn_on_slowpath+0x58/0x7d [<ffffffff804deab5>] ? printk+0x67/0x6a [<ffffffff804dc06d>] ? start_secondary+0x185/0x18a [<ffffffff804dc06d>] ? start_secondary+0x185/0x18a [<ffffffff804deab5>] ? printk+0x67/0x6a [<ffffffff80237864>] ? vprintk+0x312/0x355 [<ffffffff8021d513>] ? touch_nmi_watchdog+0x54/0x58 [<ffffffff80212f6c>] ? stop_this_cpu+0x0/0x24 [<ffffffff8025df57>] smp_call_function_mask+0x40/0x1e6 [<ffffffff80212f6c>] ? stop_this_cpu+0x0/0x24 [<ffffffff8020f9d8>] ? print_context_stack+0xa8/0xc0 [<ffffffff8020eb24>] ? dump_trace+0x25d/0x285 [<ffffffff8020f88a>] ? show_trace_log_lvl+0x4c/0x58 [<ffffffff80212f6c>] ? stop_this_cpu+0x0/0x24 [<ffffffff8025e130>] smp_call_function+0x33/0x6c [<ffffffff8021bad4>] native_smp_send_stop+0x22/0x48 [<ffffffff804de9a6>] panic+0xb7/0x15f [<ffffffff804deab5>] ? printk+0x67/0x6a [<ffffffff803ba593>] ? account+0xde/0xed [<ffffffff803ba6e5>] ? extract_entropy+0x50/0x97 [<ffffffff80239c6e>] do_exit+0x70/0x891 [<ffffffff803ba816>] ? get_random_bytes+0x1b/0x1d [<ffffffff8020f650>] oops_end+0x89/0x8e [<ffffffff8020f81c>] die+0x55/0x5e [<ffffffff8020d9d8>] do_general_protection+0x11e/0x126 [<ffffffff8020d8ba>] ? do_general_protection+0x0/0x126 [<ffffffff804e1f22>] error_exit+0x0/0xa9 [<ffffffff80240061>] ? get_next_timer_interrupt+0x11b/0x1f0 [<ffffffff8023ff83>] ? get_next_timer_interrupt+0x3d/0x1f0 [<ffffffff80253f68>] tick_nohz_stop_sched_tick+0x17b/0x37f [<ffffffff8023c023>] ? __do_softirq+0xf8/0x101 [<ffffffff8023bbec>] irq_exit+0x91/0xa2 [<ffffffff8021d05d>] smp_apic_timer_interrupt+0xa4/0xce [<ffffffff8020c610>] apic_timer_interrupt+0x70/0x80 <EOI> [<ffffffff80212df9>] ? mwait_idle+0x3e/0x48 [<ffffffff80212df0>] ? mwait_idle+0x35/0x48 [<ffffffff8020a930>] ? cpu_idle+0x51/0xba [<ffffffff804dc06d>] ? start_secondary+0x185/0x18a ---[ end trace c0adcd63427ce040 ]--- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-19 15:14 next-20081119: general protection fault: get_next_timer_interrupt() Alexander Beregalov @ 2008-11-19 21:14 ` Thomas Gleixner 2008-11-21 10:50 ` Alexander Beregalov 0 siblings, 1 reply; 16+ messages in thread From: Thomas Gleixner @ 2008-11-19 21:14 UTC (permalink / raw) To: Alexander Beregalov; +Cc: LKML, linux-next, mingo Alexander, On Wed, 19 Nov 2008, Alexander Beregalov wrote: > > It is 4way X86_64 > The kernel does not boot. > RIP: 0010:[<ffffffff80240061>] [<ffffffff80240061>] > get_next_timer_interrupt+0x11b/0x1f0 Can you please enable: CONFIG_DEBUG_OBJECTS=y CONFIG_DEBUG_OBJECTS_FREE=y CONFIG_DEBUG_OBJECTS_TIMERS=Y and add "debug_objects" to the kernel command line ? Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-19 21:14 ` Thomas Gleixner @ 2008-11-21 10:50 ` Alexander Beregalov 2008-11-24 17:43 ` Thomas Gleixner 0 siblings, 1 reply; 16+ messages in thread From: Alexander Beregalov @ 2008-11-21 10:50 UTC (permalink / raw) To: Thomas Gleixner; +Cc: LKML, linux-next, mingo, linux-scsi, James.Bottomley 2008/11/20 Thomas Gleixner <tglx@linutronix.de>: > Alexander, > > On Wed, 19 Nov 2008, Alexander Beregalov wrote: >> >> It is 4way X86_64 >> The kernel does not boot. > >> RIP: 0010:[<ffffffff80240061>] [<ffffffff80240061>] >> get_next_timer_interrupt+0x11b/0x1f0 > > Can you please enable: > > CONFIG_DEBUG_OBJECTS=y > CONFIG_DEBUG_OBJECTS_FREE=y > CONFIG_DEBUG_OBJECTS_TIMERS=Y > > and add "debug_objects" to the kernel command line ? I added these options: hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 hpet0: 3 comparators, 64-bit 14.318180 MHz counter ODEBUG: object is on stack, but not annotated ------------[ cut here ]------------ WARNING: at lib/debugobjects.c:251 __debug_object_init+0x2bf/0x36d() Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.28-rc5-next-20081119 #9 Call Trace: [<ffffffff80236ba7>] warn_on_slowpath+0x58/0x7d [<ffffffff804e2584>] ? printk+0x67/0x6b [<ffffffff803800bc>] ? __debug_object_init+0x191/0x36d [<ffffffff803801ea>] __debug_object_init+0x2bf/0x36d [<ffffffff802977c6>] ? compound_order+0x15/0x26 [<ffffffff803802c5>] debug_object_init+0x14/0x17 [<ffffffff8023fc77>] init_timer+0x18/0x5b [<ffffffff80220821>] hpet_cpuhp_notify+0x93/0x105 [<ffffffff8022095a>] ? hpet_work+0x0/0x206 [<ffffffff803d1394>] ? hpet_alloc+0x333/0x38f [<ffffffff80257785>] ? trace_hardirqs_on_caller+0x128/0x153 [<ffffffff802577bd>] ? trace_hardirqs_on+0xd/0xf [<ffffffff806cf5f5>] ? hpet_late_init+0x0/0x19e [<ffffffff806cf5f5>] ? hpet_late_init+0x0/0x19e [<ffffffff806cf75f>] hpet_late_init+0x16a/0x19e [<ffffffff806cd7c2>] ? print_all_ICs+0x0/0x540 [<ffffffff80209058>] _stext+0x58/0x138 [<ffffffff804e5003>] ? _spin_unlock+0x4a/0x57 [<ffffffff802de241>] ? proc_register+0x17f/0x193 [<ffffffff802de37d>] ? create_proc_entry+0x7e/0x94 [<ffffffff802686e5>] ? register_irq_proc+0xb0/0xcc [<ffffffff802d0000>] ? do_usbdevfs_bulk+0xf8/0xfe [<ffffffff806c466d>] kernel_init+0x125/0x179 [<ffffffff804e49ba>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8020c899>] child_rip+0xa/0x11 [<ffffffff8020bd88>] ? restore_args+0x0/0x30 [<ffffffff806c4548>] ? kernel_init+0x0/0x179 [<ffffffff8020c88f>] ? child_rip+0x0/0x11 ---[ end trace 4eaa2a86a8e2da22 ]--- <...> scsi0 : LSI SAS based MegaRAID driver Driver 'sd' needs updating - please use bus_type methods scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5 ------------[ cut here ]------------ WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57() ODEBUG: free active object type: timer_list Modules linked in: Pid: 580, comm: scsi_scan_0 Tainted: G W 2.6.28-rc5-next-20081119 #9 Call Trace: [<ffffffff80236b28>] warn_slowpath+0xae/0xd5 [<ffffffff8037f9e8>] ? debug_check_no_obj_freed+0x75/0x1c8 [<ffffffff8037f8b1>] debug_print_object+0x4f/0x57 [<ffffffff8037fa0f>] debug_check_no_obj_freed+0x9c/0x1c8 [<ffffffff8029c7b2>] kmem_cache_free+0x64/0xc0 [<ffffffff8036a6e0>] ? blk_release_queue+0x61/0x66 [<ffffffff8036a6e0>] blk_release_queue+0x61/0x66 [<ffffffff803760f2>] kobject_release+0x52/0x68 [<ffffffff803760a0>] ? kobject_release+0x0/0x68 [<ffffffff80376ec5>] kref_put+0x43/0x4f [<ffffffff80375ffa>] kobject_put+0x47/0x4b [<ffffffff80368c53>] blk_cleanup_queue+0x57/0x5c [<ffffffff803f8729>] scsi_free_queue+0x9/0xb [<ffffffff803fd3c7>] scsi_device_dev_release_usercontext+0xdc/0x127 [<ffffffff803fd2eb>] ? scsi_device_dev_release_usercontext+0x0/0x127 [<ffffffff802472a8>] execute_in_process_context+0x2a/0x70 [<ffffffff803fd2e9>] scsi_device_dev_release+0x17/0x19 [<ffffffff803e03e0>] device_release+0x43/0x68 [<ffffffff803760f2>] kobject_release+0x52/0x68 [<ffffffff803760a0>] ? kobject_release+0x0/0x68 [<ffffffff80376ec5>] kref_put+0x43/0x4f [<ffffffff80375ffa>] kobject_put+0x47/0x4b [<ffffffff803dfd36>] put_device+0x15/0x17 [<ffffffff803fa772>] scsi_destroy_sdev+0x48/0x4c [<ffffffff803fba05>] scsi_probe_and_add_lun+0xb5d/0xb81 [<ffffffff803faaba>] ? scsi_alloc_target+0x22b/0x267 [<ffffffff803fbcb0>] __scsi_scan_target+0x9d/0x598 [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153 [<ffffffff804e39a9>] ? __mutex_lock_common+0x371/0x3be [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133 [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153 [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133 [<ffffffff803fc1fd>] scsi_scan_channel+0x52/0x78 [<ffffffff803fc314>] scsi_scan_host_selected+0xf1/0x133 [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 [<ffffffff803fc3c1>] do_scsi_scan_host+0x6b/0x70 [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 [<ffffffff803fc3dd>] do_scan_async+0x17/0x127 [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 [<ffffffff80249d5d>] kthread+0x49/0x76 [<ffffffff8020c899>] child_rip+0xa/0x11 [<ffffffff8020bd88>] ? restore_args+0x0/0x30 [<ffffffff80249d14>] ? kthread+0x0/0x76 [<ffffffff8020c88f>] ? child_rip+0x0/0x11 ---[ end trace 4eaa2a86a8e2da22 ]--- <...> ata2: port disabled. ignoring. scsi: waiting for bus probes to complete ... WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57() ODEBUG: free active object type: timer_list Modules linked in: Pid: 580, comm: scsi_scan_0 Tainted: G W 2.6.28-rc5-next-20081119 #9 Call Trace: [<ffffffff80236b28>] warn_slowpath+0xae/0xd5 [<ffffffff803925b9>] ? write_vga+0x18/0x4e [<ffffffff8037f9e8>] ? debug_check_no_obj_freed+0x75/0x1c8 [<ffffffff8037f8b1>] debug_print_object+0x4f/0x57 [<ffffffff8037fa0f>] debug_check_no_obj_freed+0x9c/0x1c8 [<ffffffff8029c7b2>] kmem_cache_free+0x64/0xc0 [<ffffffff8036a6e0>] ? blk_release_queue+0x61/0x66 [<ffffffff8036a6e0>] blk_release_queue+0x61/0x66 [<ffffffff803760f2>] kobject_release+0x52/0x68 [<ffffffff803760a0>] ? kobject_release+0x0/0x68 [<ffffffff80376ec5>] kref_put+0x43/0x4f [<ffffffff80375ffa>] kobject_put+0x47/0x4b [<ffffffff80368c53>] blk_cleanup_queue+0x57/0x5c [<ffffffff803f8729>] scsi_free_queue+0x9/0xb [<ffffffff803fd3c7>] scsi_device_dev_release_usercontext+0xdc/0x127 [<ffffffff803fd2eb>] ? scsi_device_dev_release_usercontext+0x0/0x127 [<ffffffff802472a8>] execute_in_process_context+0x2a/0x70 [<ffffffff803fd2e9>] scsi_device_dev_release+0x17/0x19 [<ffffffff803e03e0>] device_release+0x43/0x68 [<ffffffff803760f2>] kobject_release+0x52/0x68 [<ffffffff803760a0>] ? kobject_release+0x0/0x68 [<ffffffff80376ec5>] kref_put+0x43/0x4f [<ffffffff80375ffa>] kobject_put+0x47/0x4b [<ffffffff803dfd36>] put_device+0x15/0x17 [<ffffffff803fa772>] scsi_destroy_sdev+0x48/0x4c [<ffffffff803fba05>] scsi_probe_and_add_lun+0xb5d/0xb81 [<ffffffff803faaba>] ? scsi_alloc_target+0x22b/0x267 [<ffffffff803fbcb0>] __scsi_scan_target+0x9d/0x598 [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153 [<ffffffff804e39a9>] ? __mutex_lock_common+0x371/0x3be [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133 [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153 [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133 [<ffffffff803fc1fd>] scsi_scan_channel+0x52/0x78 [<ffffffff803fc314>] scsi_scan_host_selected+0xf1/0x133 [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 [<ffffffff803fc3c1>] do_scsi_scan_host+0x6b/0x70 [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 [<ffffffff803fc3dd>] do_scan_async+0x17/0x127 [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 [<ffffffff80249d5d>] kthread+0x49/0x76 [<ffffffff8020c899>] child_rip+0xa/0x11 [<ffffffff8020bd88>] ? restore_args+0x0/0x30 [<ffffffff80249d14>] ? kthread+0x0/0x76 [<ffffffff8020c88f>] ? child_rip+0x0/0x11 ---[ end trace 4eaa2a86a8e2da22 ]--- <...> BUG: using smp_processor_id() in preemptible [00000000] code: init-early.sh/741 caller is sock_prot_inuse_add+0x24/0x42 Pid: 741, comm: init-early.sh Tainted: G W 2.6.28-rc5-next-20081119 #9 Call Trace: [<ffffffff8037f622>] debug_smp_processor_id+0xca/0xe0 [<ffffffff8046ab5b>] sock_prot_inuse_add+0x24/0x42 [<ffffffff804bb124>] unix_create1+0x161/0x176 [<ffffffff804bb196>] unix_create+0x5d/0x68 [<ffffffff80469368>] __sock_create+0x114/0x17e [<ffffffff80469420>] sock_create+0x2d/0x2f [<ffffffff80469623>] sys_socket+0x29/0x5c [<ffffffff8020b74b>] system_call_fastpath+0x16/0x1b BUG: using smp_processor_id() in preemptible [00000000] code: init-early.sh/741 caller is sock_prot_inuse_add+0x24/0x42 Pid: 741, comm: init-early.sh Tainted: G W 2.6.28-rc5-next-20081119 #9 Call Trace: ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-21 10:50 ` Alexander Beregalov @ 2008-11-24 17:43 ` Thomas Gleixner 2008-11-24 19:15 ` James Bottomley 0 siblings, 1 reply; 16+ messages in thread From: Thomas Gleixner @ 2008-11-24 17:43 UTC (permalink / raw) To: Alexander Beregalov Cc: LKML, linux-next, Ingo Molnar, linux-scsi, James.Bottomley, David Miller Alexander, On Fri, 21 Nov 2008, Alexander Beregalov wrote: > 2008/11/20 Thomas Gleixner <tglx@linutronix.de>: > > Alexander, > > > > On Wed, 19 Nov 2008, Alexander Beregalov wrote: > >> > >> It is 4way X86_64 > >> The kernel does not boot. > > > >> RIP: 0010:[<ffffffff80240061>] [<ffffffff80240061>] > >> get_next_timer_interrupt+0x11b/0x1f0 > > > > Can you please enable: > > > > CONFIG_DEBUG_OBJECTS=y > > CONFIG_DEBUG_OBJECTS_FREE=y > > CONFIG_DEBUG_OBJECTS_TIMERS=Y > > > > and add "debug_objects" to the kernel command line ? > > I added these options: Thanks. > hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 > hpet0: 3 comparators, 64-bit 14.318180 MHz counter > ODEBUG: object is on stack, but not annotated ok, that's homework for me. > scsi0 : LSI SAS based MegaRAID driver > Driver 'sd' needs updating - please use bus_type methods > scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5 > ------------[ cut here ]------------ > WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57() > ODEBUG: free active object type: timer_list That's the cause for your boot crash. The scsi/blk code is freeing a page which contains an active timer, so the timer code references gone memory. You triggered it because DEBUG_PAGEALLOC unmaps the page when it's freed. James, or other scsi experts please. > Modules linked in: > Pid: 580, comm: scsi_scan_0 Tainted: G W 2.6.28-rc5-next-20081119 #9 > Call Trace: > [<ffffffff80236b28>] warn_slowpath+0xae/0xd5 > [<ffffffff8037f9e8>] ? debug_check_no_obj_freed+0x75/0x1c8 > [<ffffffff8037f8b1>] debug_print_object+0x4f/0x57 > [<ffffffff8037fa0f>] debug_check_no_obj_freed+0x9c/0x1c8 > [<ffffffff8029c7b2>] kmem_cache_free+0x64/0xc0 > [<ffffffff8036a6e0>] ? blk_release_queue+0x61/0x66 > [<ffffffff8036a6e0>] blk_release_queue+0x61/0x66 > [<ffffffff803760f2>] kobject_release+0x52/0x68 > [<ffffffff803760a0>] ? kobject_release+0x0/0x68 > [<ffffffff80376ec5>] kref_put+0x43/0x4f > [<ffffffff80375ffa>] kobject_put+0x47/0x4b > [<ffffffff80368c53>] blk_cleanup_queue+0x57/0x5c > [<ffffffff803f8729>] scsi_free_queue+0x9/0xb > [<ffffffff803fd3c7>] scsi_device_dev_release_usercontext+0xdc/0x127 > [<ffffffff803fd2eb>] ? scsi_device_dev_release_usercontext+0x0/0x127 > [<ffffffff802472a8>] execute_in_process_context+0x2a/0x70 > [<ffffffff803fd2e9>] scsi_device_dev_release+0x17/0x19 > [<ffffffff803e03e0>] device_release+0x43/0x68 > [<ffffffff803760f2>] kobject_release+0x52/0x68 > [<ffffffff803760a0>] ? kobject_release+0x0/0x68 > [<ffffffff80376ec5>] kref_put+0x43/0x4f > [<ffffffff80375ffa>] kobject_put+0x47/0x4b > [<ffffffff803dfd36>] put_device+0x15/0x17 > [<ffffffff803fa772>] scsi_destroy_sdev+0x48/0x4c > [<ffffffff803fba05>] scsi_probe_and_add_lun+0xb5d/0xb81 > [<ffffffff803faaba>] ? scsi_alloc_target+0x22b/0x267 > [<ffffffff803fbcb0>] __scsi_scan_target+0x9d/0x598 > [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153 > [<ffffffff804e39a9>] ? __mutex_lock_common+0x371/0x3be > [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133 > [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153 > [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133 > [<ffffffff803fc1fd>] scsi_scan_channel+0x52/0x78 > [<ffffffff803fc314>] scsi_scan_host_selected+0xf1/0x133 > [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 > [<ffffffff803fc3c1>] do_scsi_scan_host+0x6b/0x70 > [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 > [<ffffffff803fc3dd>] do_scan_async+0x17/0x127 > [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 > [<ffffffff80249d5d>] kthread+0x49/0x76 > [<ffffffff8020c899>] child_rip+0xa/0x11 > [<ffffffff8020bd88>] ? restore_args+0x0/0x30 > [<ffffffff80249d14>] ? kthread+0x0/0x76 > [<ffffffff8020c88f>] ? child_rip+0x0/0x11 > ---[ end trace 4eaa2a86a8e2da22 ]--- > <...> > ata2: port disabled. ignoring. > scsi: waiting for bus probes to complete ... > WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57() > ODEBUG: free active object type: timer_list Same as above. > BUG: using smp_processor_id() in preemptible [00000000] code: init-early.sh/741 > caller is sock_prot_inuse_add+0x24/0x42 > Pid: 741, comm: init-early.sh Tainted: G W 2.6.28-rc5-next-20081119 #9 > Call Trace: > [<ffffffff8037f622>] debug_smp_processor_id+0xca/0xe0 > [<ffffffff8046ab5b>] sock_prot_inuse_add+0x24/0x42 > [<ffffffff804bb124>] unix_create1+0x161/0x176 > [<ffffffff804bb196>] unix_create+0x5d/0x68 > [<ffffffff80469368>] __sock_create+0x114/0x17e > [<ffffffff80469420>] sock_create+0x2d/0x2f > [<ffffffff80469623>] sys_socket+0x29/0x5c > [<ffffffff8020b74b>] system_call_fastpath+0x16/0x1b Dave ??? Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-24 17:43 ` Thomas Gleixner @ 2008-11-24 19:15 ` James Bottomley 2008-11-24 19:31 ` Thomas Gleixner 0 siblings, 1 reply; 16+ messages in thread From: James Bottomley @ 2008-11-24 19:15 UTC (permalink / raw) To: Thomas Gleixner Cc: Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller On Mon, 2008-11-24 at 18:43 +0100, Thomas Gleixner wrote: > > scsi0 : LSI SAS based MegaRAID driver > > Driver 'sd' needs updating - please use bus_type methods > > scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5 > > ------------[ cut here ]------------ > > WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57() > > ODEBUG: free active object type: timer_list > > That's the cause for your boot crash. The scsi/blk code is freeing a > page which contains an active timer, so the timer code references gone > memory. You triggered it because DEBUG_PAGEALLOC unmaps the page when > it's freed. > > James, or other scsi experts please. > > > Modules linked in: > > Pid: 580, comm: scsi_scan_0 Tainted: G W 2.6.28-rc5-next-20081119 #9 > > Call Trace: > > [<ffffffff80236b28>] warn_slowpath+0xae/0xd5 > > [<ffffffff8037f9e8>] ? debug_check_no_obj_freed+0x75/0x1c8 > > [<ffffffff8037f8b1>] debug_print_object+0x4f/0x57 > > [<ffffffff8037fa0f>] debug_check_no_obj_freed+0x9c/0x1c8 > > [<ffffffff8029c7b2>] kmem_cache_free+0x64/0xc0 > > [<ffffffff8036a6e0>] ? blk_release_queue+0x61/0x66 > > [<ffffffff8036a6e0>] blk_release_queue+0x61/0x66 > > [<ffffffff803760f2>] kobject_release+0x52/0x68 > > [<ffffffff803760a0>] ? kobject_release+0x0/0x68 > > [<ffffffff80376ec5>] kref_put+0x43/0x4f > > [<ffffffff80375ffa>] kobject_put+0x47/0x4b > > [<ffffffff80368c53>] blk_cleanup_queue+0x57/0x5c > > [<ffffffff803f8729>] scsi_free_queue+0x9/0xb > > [<ffffffff803fd3c7>] scsi_device_dev_release_usercontext+0xdc/0x127 > > [<ffffffff803fd2eb>] ? scsi_device_dev_release_usercontext+0x0/0x127 > > [<ffffffff802472a8>] execute_in_process_context+0x2a/0x70 > > [<ffffffff803fd2e9>] scsi_device_dev_release+0x17/0x19 > > [<ffffffff803e03e0>] device_release+0x43/0x68 > > [<ffffffff803760f2>] kobject_release+0x52/0x68 > > [<ffffffff803760a0>] ? kobject_release+0x0/0x68 > > [<ffffffff80376ec5>] kref_put+0x43/0x4f > > [<ffffffff80375ffa>] kobject_put+0x47/0x4b > > [<ffffffff803dfd36>] put_device+0x15/0x17 > > [<ffffffff803fa772>] scsi_destroy_sdev+0x48/0x4c > > [<ffffffff803fba05>] scsi_probe_and_add_lun+0xb5d/0xb81 > > [<ffffffff803faaba>] ? scsi_alloc_target+0x22b/0x267 > > [<ffffffff803fbcb0>] __scsi_scan_target+0x9d/0x598 > > [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153 > > [<ffffffff804e39a9>] ? __mutex_lock_common+0x371/0x3be > > [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133 > > [<ffffffff8025767c>] ? trace_hardirqs_on_caller+0x1f/0x153 > > [<ffffffff803fc2d9>] ? scsi_scan_host_selected+0xb6/0x133 > > [<ffffffff803fc1fd>] scsi_scan_channel+0x52/0x78 > > [<ffffffff803fc314>] scsi_scan_host_selected+0xf1/0x133 > > [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 > > [<ffffffff803fc3c1>] do_scsi_scan_host+0x6b/0x70 > > [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 > > [<ffffffff803fc3dd>] do_scan_async+0x17/0x127 > > [<ffffffff803fc3c6>] ? do_scan_async+0x0/0x127 > > [<ffffffff80249d5d>] kthread+0x49/0x76 > > [<ffffffff8020c899>] child_rip+0xa/0x11 > > [<ffffffff8020bd88>] ? restore_args+0x0/0x30 > > [<ffffffff80249d14>] ? kthread+0x0/0x76 > > [<ffffffff8020c88f>] ? child_rip+0x0/0x11 > > ---[ end trace 4eaa2a86a8e2da22 ]--- Well, not sure. Most likely candidate is the new block timer code. What seems to be happening is that the queue is being released with either an outstanding request (refcounting problem) or ticking timer with no work (block timer problem). The way scanning works is that we create a request queue for each device we probe and then delete it again if nothing appears after the bus settle time. The argument against this is that it should show up on every scanned bus. However, these are getting rarer; I was just about to write that I hadn't seen it when I remembered that all my SCSI testing systems are currently running hotplug reporting busses (i.e. don't do scanning). However, fortunately, I've also booted voyager recently which does use parallel SCSI and doesn't see this either, so it could also be megaraid_sas specific. Could you turn on SCSI logging so we can see the sequences. Probably since this is boot time, just enable all logging: echo 0xffffffff > /sys/module/scsi_mod/parameters/scsi_logging_level (kernel must be compiled with CONFIG_SCSI_LOGGING=y James ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-24 19:15 ` James Bottomley @ 2008-11-24 19:31 ` Thomas Gleixner 2008-11-24 21:35 ` Mike Anderson 0 siblings, 1 reply; 16+ messages in thread From: Thomas Gleixner @ 2008-11-24 19:31 UTC (permalink / raw) To: James Bottomley Cc: Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller, Jens Axboe, Mike Anderson On Mon, 24 Nov 2008, James Bottomley wrote: > On Mon, 2008-11-24 at 18:43 +0100, Thomas Gleixner wrote: > > > scsi0 : LSI SAS based MegaRAID driver > > > Driver 'sd' needs updating - please use bus_type methods > > > scsi 0:0:0:0: Direct-Access ATA SAMSUNG HE160HJ 0-24 PQ: 0 ANSI: 5 > > > ------------[ cut here ]------------ > > > WARNING: at lib/debugobjects.c:215 debug_print_object+0x4f/0x57() > > > ODEBUG: free active object type: timer_list > > > > That's the cause for your boot crash. The scsi/blk code is freeing a > > page which contains an active timer, so the timer code references gone > > memory. You triggered it because DEBUG_PAGEALLOC unmaps the page when > > it's freed. > > > > James, or other scsi experts please. > > Well, not sure. Most likely candidate is the new block timer code. > What seems to be happening is that the queue is being released with > either an outstanding request (refcounting problem) or ticking timer > with no work (block timer problem). The way scanning works is that we > create a request queue for each device we probe and then delete it again > if nothing appears after the bus settle time. The argument against > this is that it should show up on every scanned bus. However, these are > getting rarer; I was just about to write that I hadn't seen it when I > remembered that all my SCSI testing systems are currently running > hotplug reporting busses (i.e. don't do scanning). However, > fortunately, I've also booted voyager recently which does use parallel > SCSI and doesn't see this either, so it could also be megaraid_sas > specific. Yeah, block could it be as well. Jens, Mike ? One note about not seeing it: We have had such bugs before where the page was freed but not touched and the timer survived w/o tripping the system over. Alexander noticed because of DEBUG_PAGEALLOC and you can also see it by enabling debugobjects, which will give you the nice backtrace. CONFIG_DEBUG_OBJECTS=y CONFIG_DEBUG_OBJECTS_FREE=y CONFIG_DEBUG_OBJECTS_TIMERS=Y and add "debug_objects" to the kernel command line. > Could you turn on SCSI logging so we can see the sequences. Probably > since this is boot time, just enable all logging: > > echo 0xffffffff > /sys/module/scsi_mod/parameters/scsi_logging_level > > (kernel must be compiled with CONFIG_SCSI_LOGGING=y > > James > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-24 19:31 ` Thomas Gleixner @ 2008-11-24 21:35 ` Mike Anderson 2008-11-24 22:33 ` Thomas Gleixner 0 siblings, 1 reply; 16+ messages in thread From: Mike Anderson @ 2008-11-24 21:35 UTC (permalink / raw) To: Thomas Gleixner Cc: James Bottomley, Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller, Jens Axboe Thomas Gleixner <tglx@linutronix.de> wrote: > > Well, not sure. Most likely candidate is the new block timer code. > > What seems to be happening is that the queue is being released with > > either an outstanding request (refcounting problem) or ticking timer > > with no work (block timer problem). The way scanning works is that we > > create a request queue for each device we probe and then delete it again > > if nothing appears after the bus settle time. The argument against > > this is that it should show up on every scanned bus. However, these are > > getting rarer; I was just about to write that I hadn't seen it when I > > remembered that all my SCSI testing systems are currently running > > hotplug reporting busses (i.e. don't do scanning). However, > > fortunately, I've also booted voyager recently which does use parallel > > SCSI and doesn't see this either, so it could also be megaraid_sas > > specific. > > Yeah, block could it be as well. Jens, Mike ? I added a comment to bug 12020 on Thursday about a few other systems that where seeing the signature shown in bug 12020. It appeared from debug that there where a few paths that where adding timers for requests that where not expected. http://bugzilla.kernel.org/show_bug.cgi?id=12020 It would be good to know if the debug patch below effects your problem as while. If it does we need to investigated a solution to resolve not adding a timer for these requests. -andmike -- Michael Anderson andmike@linux.vnet.ibm.com blk: blk_add_timer debug patch [DEBUG] Debug only patch. Debug patch to blk_add_timer to not start timer for request that do not have the REQ_STARTED flag set. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> --- block/blk-timeout.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 69185ea..4389391 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -177,6 +177,9 @@ void blk_add_timer(struct request *req) BUG_ON(!list_empty(&req->timeout_list)); BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); + if (!(req->cmd_flags & REQ_STARTED)) + return; + if (req->timeout) req->deadline = jiffies + req->timeout; else { -- 1.5.6.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-24 21:35 ` Mike Anderson @ 2008-11-24 22:33 ` Thomas Gleixner 2008-11-24 23:42 ` malahal 2008-11-25 0:09 ` malahal 0 siblings, 2 replies; 16+ messages in thread From: Thomas Gleixner @ 2008-11-24 22:33 UTC (permalink / raw) To: Mike Anderson Cc: James Bottomley, Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller, Jens Axboe On Mon, 24 Nov 2008, Mike Anderson wrote: > Thomas Gleixner <tglx@linutronix.de> wrote: > > Yeah, block could it be as well. Jens, Mike ? > > I added a comment to bug 12020 on Thursday about a few other systems that > where seeing the signature shown in bug 12020. It appeared from debug that > there where a few paths that where adding timers for requests that where > not expected. > > http://bugzilla.kernel.org/show_bug.cgi?id=12020 > > It would be good to know if the debug patch below effects your problem as while. > > If it does we need to investigated a solution to resolve not adding a > timer for these requests. Wrong. The problem is not a timer which is armed in the first place. The problem is an armed timer which is not canceled before the data structure which contains it is freed. So not arming the timer will probably prevent this particular scan problem, but it does not solve the general wreckage of freeing a data structure with a possibly armed timer in it. You need to fix the code path which frees the data structure which contains the timer and cancel the timer _before_ freeing the data structure. Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-24 22:33 ` Thomas Gleixner @ 2008-11-24 23:42 ` malahal 2008-11-25 0:09 ` malahal 1 sibling, 0 replies; 16+ messages in thread From: malahal @ 2008-11-24 23:42 UTC (permalink / raw) To: Thomas Gleixner Cc: Mike Anderson, James Bottomley, Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller, Jens Axboe Thomas Gleixner [tglx@linutronix.de] wrote: > > where seeing the signature shown in bug 12020. It appeared from debug that > > there where a few paths that where adding timers for requests that where > > not expected. > > > > http://bugzilla.kernel.org/show_bug.cgi?id=12020 > > > > It would be good to know if the debug patch below effects your problem as while. > > > > If it does we need to investigated a solution to resolve not adding a > > timer for these requests. > > Wrong. > > The problem is not a timer which is armed in the first place. No, this could be a problem if such a timer is not dis-armed! As fas as I know, the queue timer will be dis-armed in end_that_request_last() if needed. Do we know end_that_request_last() gets called for every request queued? > The problem is an armed timer which is not canceled before the data > structure which contains it is freed. > > So not arming the timer will probably prevent this particular scan > problem, but it does not solve the general wreckage of freeing a data > structure with a possibly armed timer in it. > > You need to fix the code path which frees the data structure which > contains the timer and cancel the timer _before_ freeing the data > structure. Agreed but the timer is armed when a request is sent and is dis-armed when it is completed. Essentially there should NOT be any active timer(s) when you try to free the request queue. In other words, the code which frees the data structure (request queue) is correct and there is no need to cancel the timer there! --Malahal. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-24 22:33 ` Thomas Gleixner 2008-11-24 23:42 ` malahal @ 2008-11-25 0:09 ` malahal 2008-11-25 0:57 ` Stephen Rothwell 1 sibling, 1 reply; 16+ messages in thread From: malahal @ 2008-11-25 0:09 UTC (permalink / raw) To: Thomas Gleixner Cc: Mike Anderson, James Bottomley, Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller, Jens Axboe Thomas Gleixner [tglx@linutronix.de] wrote: > On Mon, 24 Nov 2008, Mike Anderson wrote: > > Thomas Gleixner <tglx@linutronix.de> wrote: > > > Yeah, block could it be as well. Jens, Mike ? > > > > I added a comment to bug 12020 on Thursday about a few other systems that > > where seeing the signature shown in bug 12020. It appeared from debug that > > there where a few paths that where adding timers for requests that where > > not expected. > > > > http://bugzilla.kernel.org/show_bug.cgi?id=12020 > > > > It would be good to know if the debug patch below effects your problem as while. > > > > If it does we need to investigated a solution to resolve not adding a > > timer for these requests. The block timer code calls del_timer(), should it call del_timer_sync()? It is possible although unlikely that you are hitting del_timer_sync vs del_timer problem in the block timeout code. Can only be seen on SMP systems though! --Malahal. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-25 0:09 ` malahal @ 2008-11-25 0:57 ` Stephen Rothwell 2008-11-25 2:08 ` malahal 0 siblings, 1 reply; 16+ messages in thread From: Stephen Rothwell @ 2008-11-25 0:57 UTC (permalink / raw) To: malahal Cc: Thomas Gleixner, Mike Anderson, James Bottomley, Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 1786 bytes --] On Mon, 24 Nov 2008 16:09:02 -0800 malahal@us.ibm.com wrote: > > Thomas Gleixner [tglx@linutronix.de] wrote: > > On Mon, 24 Nov 2008, Mike Anderson wrote: > > > Thomas Gleixner <tglx@linutronix.de> wrote: > > > > Yeah, block could it be as well. Jens, Mike ? > > > > > > I added a comment to bug 12020 on Thursday about a few other systems that > > > where seeing the signature shown in bug 12020. It appeared from debug that > > > there where a few paths that where adding timers for requests that where > > > not expected. > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=12020 > > > > > > It would be good to know if the debug patch below effects your problem as while. > > > > > > If it does we need to investigated a solution to resolve not adding a > > > timer for these requests. > > The block timer code calls del_timer(), should it call del_timer_sync()? > It is possible although unlikely that you are hitting del_timer_sync vs > del_timer problem in the block timeout code. Can only be seen on SMP > systems though! Is this still a problem in next-20081121? In that tree, the block commit "block: leave the request timeout timer running even on an empty list" was changed to add this: diff --git a/block/blk-core.c b/block/blk-core.c index 04267d6..44f547c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue); void blk_sync_queue(struct request_queue *q) { del_timer_sync(&q->unplug_timer); + del_timer_sync(&q->timeout); kblockd_flush_work(&q->unplug_work); } EXPORT_SYMBOL(blk_sync_queue); After I spent some time bisecting a boot failure in PowerPC. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/ [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-25 0:57 ` Stephen Rothwell @ 2008-11-25 2:08 ` malahal 2008-11-25 8:51 ` Jens Axboe 0 siblings, 1 reply; 16+ messages in thread From: malahal @ 2008-11-25 2:08 UTC (permalink / raw) To: Stephen Rothwell Cc: Thomas Gleixner, Mike Anderson, James Bottomley, Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller, Jens Axboe Stephen Rothwell [sfr@canb.auug.org.au] wrote: > > The block timer code calls del_timer(), should it call del_timer_sync()? > > It is possible although unlikely that you are hitting del_timer_sync vs > > del_timer problem in the block timeout code. Can only be seen on SMP > > systems though! > > Is this still a problem in next-20081121? In that tree, the block commit > "block: leave the request timeout timer running even on an empty list" > was changed to add this: > > diff --git a/block/blk-core.c b/block/blk-core.c > index 04267d6..44f547c 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue); > void blk_sync_queue(struct request_queue *q) > { > del_timer_sync(&q->unplug_timer); > + del_timer_sync(&q->timeout); > kblockd_flush_work(&q->unplug_work); > } > EXPORT_SYMBOL(blk_sync_queue); I was looking at the Linux tree. Clearly same problem doesn't exist with the above commit! I wonder why kblockd_flush_work() is called after the del_timer_sync(). It makes sense to cancel the work and then shutdown the timer(s). I doubt if you are running into this problem though. -Malahal. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-25 2:08 ` malahal @ 2008-11-25 8:51 ` Jens Axboe 2008-11-25 16:59 ` malahal 0 siblings, 1 reply; 16+ messages in thread From: Jens Axboe @ 2008-11-25 8:51 UTC (permalink / raw) To: malahal Cc: Stephen Rothwell, Thomas Gleixner, Mike Anderson, James Bottomley, Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller On Mon, Nov 24 2008, malahal@us.ibm.com wrote: > Stephen Rothwell [sfr@canb.auug.org.au] wrote: > > > The block timer code calls del_timer(), should it call del_timer_sync()? > > > It is possible although unlikely that you are hitting del_timer_sync vs > > > del_timer problem in the block timeout code. Can only be seen on SMP > > > systems though! > > > > Is this still a problem in next-20081121? In that tree, the block commit > > "block: leave the request timeout timer running even on an empty list" > > was changed to add this: > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > index 04267d6..44f547c 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue); > > void blk_sync_queue(struct request_queue *q) > > { > > del_timer_sync(&q->unplug_timer); > > + del_timer_sync(&q->timeout); > > kblockd_flush_work(&q->unplug_work); > > } > > EXPORT_SYMBOL(blk_sync_queue); > > I was looking at the Linux tree. Clearly same problem doesn't exist with > the above commit! I wonder why kblockd_flush_work() is called after the > del_timer_sync(). It makes sense to cancel the work and then shutdown > the timer(s). I doubt if you are running into this problem though. If the kernel tested doesn't include the above fix, it'll surely go boom. Can someone verify that this is the case? -- Jens Axboe ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-25 8:51 ` Jens Axboe @ 2008-11-25 16:59 ` malahal 2008-11-25 17:14 ` Alexander Beregalov 0 siblings, 1 reply; 16+ messages in thread From: malahal @ 2008-11-25 16:59 UTC (permalink / raw) To: Jens Axboe Cc: Stephen Rothwell, Thomas Gleixner, Mike Anderson, James Bottomley, Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller Jens Axboe [jens.axboe@oracle.com] wrote: > On Mon, Nov 24 2008, malahal@us.ibm.com wrote: > > Stephen Rothwell [sfr@canb.auug.org.au] wrote: > > > > The block timer code calls del_timer(), should it call del_timer_sync()? > > > > It is possible although unlikely that you are hitting del_timer_sync vs > > > > del_timer problem in the block timeout code. Can only be seen on SMP > > > > systems though! > > > > > > Is this still a problem in next-20081121? In that tree, the block commit > > > "block: leave the request timeout timer running even on an empty list" > > > was changed to add this: > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > > index 04267d6..44f547c 100644 > > > --- a/block/blk-core.c > > > +++ b/block/blk-core.c > > > @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue); > > > void blk_sync_queue(struct request_queue *q) > > > { > > > del_timer_sync(&q->unplug_timer); > > > + del_timer_sync(&q->timeout); > > > kblockd_flush_work(&q->unplug_work); > > > } > > > EXPORT_SYMBOL(blk_sync_queue); > > > > I was looking at the Linux tree. Clearly same problem doesn't exist with > > the above commit! I wonder why kblockd_flush_work() is called after the > > del_timer_sync(). It makes sense to cancel the work and then shutdown > > the timer(s). I doubt if you are running into this problem though. > > If the kernel tested doesn't include the above fix, it'll surely go > boom. Can someone verify that this is the case? Just looked, next-20081119 doesn't have the above fix. It is included in next-20081120. Also note that the above fix is only partially copied, there is other part that removed deleting the timer when there are no outstanding requests. --Malahal. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-25 16:59 ` malahal @ 2008-11-25 17:14 ` Alexander Beregalov 2008-11-25 17:43 ` Jens Axboe 0 siblings, 1 reply; 16+ messages in thread From: Alexander Beregalov @ 2008-11-25 17:14 UTC (permalink / raw) To: Jens Axboe, Stephen Rothwell, Thomas Gleixner, Mike Anderson, James Bottomley, Alexander Beregalov, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller 2008/11/25 <malahal@us.ibm.com>: > Jens Axboe [jens.axboe@oracle.com] wrote: >> On Mon, Nov 24 2008, malahal@us.ibm.com wrote: >> > Stephen Rothwell [sfr@canb.auug.org.au] wrote: >> > > > The block timer code calls del_timer(), should it call del_timer_sync()? >> > > > It is possible although unlikely that you are hitting del_timer_sync vs >> > > > del_timer problem in the block timeout code. Can only be seen on SMP >> > > > systems though! >> > > >> > > Is this still a problem in next-20081121? In that tree, the block commit >> > > "block: leave the request timeout timer running even on an empty list" >> > > was changed to add this: >> > > >> > > diff --git a/block/blk-core.c b/block/blk-core.c >> > > index 04267d6..44f547c 100644 >> > > --- a/block/blk-core.c >> > > +++ b/block/blk-core.c >> > > @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue); >> > > void blk_sync_queue(struct request_queue *q) >> > > { >> > > del_timer_sync(&q->unplug_timer); >> > > + del_timer_sync(&q->timeout); >> > > kblockd_flush_work(&q->unplug_work); >> > > } >> > > EXPORT_SYMBOL(blk_sync_queue); >> > >> > I was looking at the Linux tree. Clearly same problem doesn't exist with >> > the above commit! I wonder why kblockd_flush_work() is called after the >> > del_timer_sync(). It makes sense to cancel the work and then shutdown >> > the timer(s). I doubt if you are running into this problem though. >> >> If the kernel tested doesn't include the above fix, it'll surely go >> boom. Can someone verify that this is the case? > > Just looked, next-20081119 doesn't have the above fix. It is included in > next-20081120. Also note that the above fix is only partially copied, > there is other part that removed deleting the timer when there are no > outstanding requests. > Yes, I can not reproduce it anymore on linux-next 1121 and newer. (I did not try 1120) It seems the fix works pretty good. Is it still needed and reasonable to investigate the problem on next-20081119? Unfortunately I do not have much time for it. All these problems have gone away on next-1125 except ODEBUG warning on HPET. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: next-20081119: general protection fault: get_next_timer_interrupt() 2008-11-25 17:14 ` Alexander Beregalov @ 2008-11-25 17:43 ` Jens Axboe 0 siblings, 0 replies; 16+ messages in thread From: Jens Axboe @ 2008-11-25 17:43 UTC (permalink / raw) To: Alexander Beregalov Cc: Stephen Rothwell, Thomas Gleixner, Mike Anderson, James Bottomley, LKML, linux-next, Ingo Molnar, linux-scsi, David Miller On Tue, Nov 25 2008, Alexander Beregalov wrote: > 2008/11/25 <malahal@us.ibm.com>: > > Jens Axboe [jens.axboe@oracle.com] wrote: > >> On Mon, Nov 24 2008, malahal@us.ibm.com wrote: > >> > Stephen Rothwell [sfr@canb.auug.org.au] wrote: > >> > > > The block timer code calls del_timer(), should it call del_timer_sync()? > >> > > > It is possible although unlikely that you are hitting del_timer_sync vs > >> > > > del_timer problem in the block timeout code. Can only be seen on SMP > >> > > > systems though! > >> > > > >> > > Is this still a problem in next-20081121? In that tree, the block commit > >> > > "block: leave the request timeout timer running even on an empty list" > >> > > was changed to add this: > >> > > > >> > > diff --git a/block/blk-core.c b/block/blk-core.c > >> > > index 04267d6..44f547c 100644 > >> > > --- a/block/blk-core.c > >> > > +++ b/block/blk-core.c > >> > > @@ -391,6 +391,7 @@ EXPORT_SYMBOL(blk_stop_queue); > >> > > void blk_sync_queue(struct request_queue *q) > >> > > { > >> > > del_timer_sync(&q->unplug_timer); > >> > > + del_timer_sync(&q->timeout); > >> > > kblockd_flush_work(&q->unplug_work); > >> > > } > >> > > EXPORT_SYMBOL(blk_sync_queue); > >> > > >> > I was looking at the Linux tree. Clearly same problem doesn't exist with > >> > the above commit! I wonder why kblockd_flush_work() is called after the > >> > del_timer_sync(). It makes sense to cancel the work and then shutdown > >> > the timer(s). I doubt if you are running into this problem though. > >> > >> If the kernel tested doesn't include the above fix, it'll surely go > >> boom. Can someone verify that this is the case? > > > > Just looked, next-20081119 doesn't have the above fix. It is included in > > next-20081120. Also note that the above fix is only partially copied, > > there is other part that removed deleting the timer when there are no > > outstanding requests. > > > Yes, I can not reproduce it anymore on linux-next 1121 and newer. (I > did not try 1120) It seems the fix works pretty good. Is it still > needed and reasonable to investigate the problem on next-20081119? > Unfortunately I do not have much time for it. No, you don't have to investigate further. This was a known bug that is fixed in -next and mainline basically right after next-20081119. > > All these problems have gone away on next-1125 except ODEBUG warning > on HPET. -- Jens Axboe ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-11-25 17:45 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-19 15:14 next-20081119: general protection fault: get_next_timer_interrupt() Alexander Beregalov 2008-11-19 21:14 ` Thomas Gleixner 2008-11-21 10:50 ` Alexander Beregalov 2008-11-24 17:43 ` Thomas Gleixner 2008-11-24 19:15 ` James Bottomley 2008-11-24 19:31 ` Thomas Gleixner 2008-11-24 21:35 ` Mike Anderson 2008-11-24 22:33 ` Thomas Gleixner 2008-11-24 23:42 ` malahal 2008-11-25 0:09 ` malahal 2008-11-25 0:57 ` Stephen Rothwell 2008-11-25 2:08 ` malahal 2008-11-25 8:51 ` Jens Axboe 2008-11-25 16:59 ` malahal 2008-11-25 17:14 ` Alexander Beregalov 2008-11-25 17:43 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox