From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: 2.6.29-rc3: BUG: NMI Watchdog detected LOCKUP Date: Thu, 12 Feb 2009 16:19:08 -0800 Message-ID: <20090212161908.2cc2045c.akpm@linux-foundation.org> References: <19f34abd0902080221w662635f0h51875a125b156535@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: Received: from smtp1.linux-foundation.org ([140.211.169.13]:36476 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752010AbZBMATr (ORCPT ); Thu, 12 Feb 2009 19:19:47 -0500 In-Reply-To: <19f34abd0902080221w662635f0h51875a125b156535@mail.gmail.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Vegard Nossum Cc: linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org, Jens Axboe , linux-scsi@vger.kernel.org On Sun, 8 Feb 2009 11:21:20 +0100 Vegard Nossum wrote: > Hi, > > Not sure exactly what happened here. Was running LTP, and it seems > that the USB flash disk (which held the root device, though I was > running LTP in a chroot on a fixed harddisk) disconnect, although I > didn't touch it. > > [ 3344.890073] usb 1-6: unregistering interface 1-6:1.0 > [ 3344.895744] sd 2:0:0:0: Device offlined - not ready after error recovery > [ 3344.902893] sd 2:0:0:0: [sdb] Unhandled error code > [ 3344.908051] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK > [ 3344.916810] end_request: I/O error, dev sdb, sector 1735619 > [ 3344.922746] Write-error on swap-device (8:16:1735627) > [ 3344.928195] Write-error on swap-device (8:16:1735635) > [ 3344.933611] Write-error on swap-device (8:16:1735643) > [ 3344.939020] Write-error on swap-device (8:16:1735651) > [ 3344.944427] Write-error on swap-device (8:16:1735659) > [ 3344.949836] Write-error on swap-device (8:16:1735667) > [ 3344.955320] Write-error on swap-device (8:16:1735675) > [ 3344.960757] sd 2:0:0:0: rejecting I/O to offline device > [ 3344.961735] sd 2:0:0:0: rejecting I/O to offline device Presumably the device layer (USB or scsi) shat itself. Bad hardware or a kernel bug? > [ 3344.972984] BUG: NMI Watchdog detected LOCKUP on CPU1, ip ffffffff81491f02, : > [ 3344.972984] CPU 1 > [ 3344.972984] Modules linked in: > [ 3344.972984] Pid: 11127, comm: hackbench Not tainted 2.6.29-rc3 #219 > [ 3344.972984] RIP: 0010:[] [] _spin_lock_b > [ 3344.972984] RSP: 0018:ffff880006b01408 EFLAGS: 00000093 > [ 3344.972984] RAX: 0000000000003b39 RBX: 0000000000000001 RCX: 6db6db6db6db6db7 > [ 3344.972984] RDX: ffff88003ec688d8 RSI: ffff880006b01428 RDI: ffff88003ec68b40 > [ 3344.972984] RBP: ffff880006b01408 R08: b000000000000000 R09: 0000000000000000 > [ 3344.972984] R10: ffff880006b01918 R11: 0000000000000000 R12: ffff88003ec688d8 > [ 3344.972984] R13: 0000000000001000 R14: 00000000001aeeb3 R15: ffff88003ec688d8 > [ 3344.972984] FS: 0000000000000000(0000) GS:ffff88003f801a80(0063) knlGS:00000 > [ 3344.972984] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b > [ 3344.972984] CR2: 0000000000b9dea0 CR3: 0000000006ae3000 CR4: 00000000000006a0 > [ 3344.972984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 3344.972984] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 3344.972984] Process hackbench (pid: 11127, threadinfo ffff880006b00000, task) > [ 3344.972984] Stack: > [ 3344.972984] ffff880006b01468 ffffffff8118d26a ffff88001f7e8000 0000000000001 > [ 3344.972984] ffff88001bc33500 0001121000000010 0000000000000047 ffff88001bc30 > [ 3344.972984] ffff88001bc33500 ffff88003ec688d8 00000000001aeeb3 ffff88003ec68 > [ 3344.972984] Call Trace: > [ 3344.972984] [] __make_request+0x3e/0x412 > [ 3344.972984] [] generic_make_request+0x279/0x2c3 > [ 3344.972984] [] ? radix_tree_tag_set+0x6b/0xce > [ 3344.972984] [] submit_bio+0xc6/0xcf > [ 3344.972984] [] ? unlock_page+0x22/0x26 > [ 3344.972984] [] swap_writepage+0xa2/0xac > [ 3344.972984] [] shrink_page_list+0x3a7/0x67b > [ 3344.972984] [] ? finish_task_switch+0x68/0x88 > [ 3344.972984] [] ? __cpus_empty+0x9/0xb > [ 3344.972984] [] ? flush_tlb_page+0x66/0x83 > [ 3344.972984] [] ? thread_return+0x3d/0xc6 > [ 3344.972984] [] shrink_list+0x29d/0x59f > [ 3344.972984] [] ? get_dirty_limits+0x22/0x24a > [ 3344.972984] [] shrink_zone+0x281/0x32b > [ 3344.972984] [] ? __up_read+0x92/0x9c > [ 3344.972984] [] ? shrink_slab+0x146/0x158 > [ 3344.972984] [] try_to_free_pages+0x23d/0x38f > [ 3344.972984] [] ? isolate_pages_global+0x0/0x219 > [ 3344.972984] [] __alloc_pages_internal+0x292/0x43d > [ 3344.972984] [] alloc_pages_current+0xb9/0xc2 > [ 3344.972984] [] alloc_slab_page+0x19/0x69 > [ 3344.972984] [] new_slab+0x49/0x1cc > [ 3344.972984] [] ? rb_insert_color+0xbd/0xe6 > [ 3344.972984] [] __slab_alloc+0x1f3/0x36c > [ 3344.972984] [] ? __alloc_skb+0x42/0x130 > [ 3344.972984] [] ? __alloc_skb+0x42/0x130 > [ 3344.972984] [] kmem_cache_alloc_node+0x69/0xa2 > [ 3344.972984] [] __alloc_skb+0x42/0x130 > [ 3344.972984] [] sock_alloc_send_skb+0xa1/0x200 > [ 3344.972984] [] ? security_socket_getpeersec_dgram+0x11/0x3 > [ 3344.972984] [] unix_stream_sendmsg+0x138/0x2b5 > [ 3344.972984] [] __sock_sendmsg+0x59/0x62 > [ 3344.972984] [] sock_aio_write+0xe8/0xf8 > [ 3344.972984] [] do_sync_write+0xe7/0x12d > [ 3344.972984] [] ? autoremove_wake_function+0x0/0x38 > [ 3344.972984] [] ? selinux_file_permission+0xbd/0xc6 > [ 3344.972984] [] ? security_file_permission+0x11/0x13 > [ 3344.972984] [] vfs_write+0xbe/0x105 > [ 3344.972984] [] sys_write+0x47/0x6f > [ 3344.972984] [] sysenter_dispatch+0x7/0x27 > [ 3344.972984] Code: 01 00 00 f0 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c9 c > [ 3344.972984] BUG: NMI Watchdog detected LOCKUP<4>---[ end trace 820f38a7b2441- > [ 3344.972984] on CPU0, ip ffffffff81491f6c, registers: And then the block layer died. Looks like it was trying to take the queue lock. Probably against the recently-offlined device. I'd say that either someone forgot to release the lock on an error path. Or the structure was freed, but the kernel still tries to use it.