* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related [not found] <20060912205602.57568b2a.akpm@osdl.org> @ 2006-09-21 9:54 ` Anton Altaparmakov 2006-09-21 14:41 ` Anton Altaparmakov 0 siblings, 1 reply; 10+ messages in thread From: Anton Altaparmakov @ 2006-09-21 9:54 UTC (permalink / raw) To: Jonathan Woithe; +Cc: linux-kernel, Andrew Morton Hi, On Tue, 2006-09-12 at 20:56 -0700, Andrew Morton wrote: Andrew, thanks for forwarding me the message... > Begin forwarded message: > > Date: Wed, 13 Sep 2006 10:05:42 +0930 (CST) > From: Jonathan Woithe <jwoithe@physics.adelaide.edu.au> > To: linux-kernel@vger.kernel.org > Cc: jwoithe@physics.adelaide.edu.au (Jonathan Woithe) > Subject: 2.6.17 oops, possibly ntfs/mmap related > > > We have a machine which is currently making heavy use of a usb hard disc > formatted with ntfs. There have been two occasions where the kernel has > oopsed while this disc was being accessed heavily. Before adding this HDD > the machine in question was rock solid which leads me to think that it > might be related to ntfs. USB drives formatted with other filesystems do > not appear to suffer from this problem. > > Unfortunately bogofilter considers the oops reports as spam so I cannot post > them to the list. I have instead put the full text of my original post > regarding this topic on the web at > > http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt > > I'm happy to try things to narrow down the cause if it will help. > > Please CC me on reply. These were the oopses from above text url: > The first oops caused the machine to totally lock up: > > BUG: unable to handle kernel paging request at virtual address e4004de0 > printing eip: > c012de0c > *pde = 00000000 > Oops: 0000 [#1] > Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore > CPU: 0 > EIP: 0060:[<c012de0c>] Not tainted VLI > EFLAGS: 00010082 (2.6.17 #2) > EIP is at find_get_page+0x11/0x22 > eax: e4004de0 ebx: c02f01a8 ecx: e4004de0 edx: e4004de0 > esi: 00000000 edi: 00000066 ebp: cfc20574 esp: c770bee8 > ds: 007b es: 007b ss: 0068 > Process sh (pid: 10467, threadinfo=c770a000 task=cfa495c0) > Stack: c012ea09 00000002 00000000 cfc204d8 cff882a4 cff88260 c770bf30 c5962544 > c02f01a8 00000000 ceec10a0 080aef10 c01385d1 00000000 cfc20574 080aef10 > c5962544 ceec10a0 00000002 c7aeb080 080aef10 ceec10a0 080aef10 c013886a > Call Trace: > <c012ea09> filemap_nopage+0x98/0x2b2 <c01385d1> do_no_page+0x6d/0x1e1 > <c013886a> __handle_mm_fault+0xc4/0x162 <c0112190> do_page_fault+0x23e/0x56b > <c01c43c1> copy_to_user+0x41/0x49 <c0111f52> do_page_fault+0x0/0x56b > <c010342f> error_code+0x4f/0x54 > Code: a0 fe ff ff 89 ea b9 e2 d7 12 c0 6a 02 e8 5f ec 15 00 83 c4 44 5b 5e 5f 5d c3 fa 83 c0 04 e8 2c 3f 09 00 85 c0 89 c1 74 0f 89 c2 <8b> 00 f6 c4 40 74 03 8b 51 0c ff 42 04 fb 89 c8 c3 fa 83 c0 04 > > > In the case of the second oops the machine was still partially usable and a > clean shutdown was possible. However, services such as sshd were no longer > responding. > > BUG: unable to handle kernel paging request at virtual address 0010c744 > printing eip: > c013be50 > *pde = 00000000 > Oops: 0002 [#1] > Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore > CPU: 0 > EIP: 0060:[<c013be50>] Tainted: G M VLI > EFLAGS: 00010282 (2.6.17 #2) > EIP is at anon_vma_unlink+0x16/0x3c > eax: 0010c740 ebx: cf1070cc ecx: cf107104 edx: cf8bc740 > esi: cf8bc740 edi: b7e82000 ebp: 00000000 esp: cdad7f58 > ds: 007b es: 007b ss: 0068 > Process sh (pid: 20272, threadinfo=cdad6000 task=c0d8d580) > Stack: cf1070cc cf61f3e4 c0136b5f cdad7f80 c4084b74 cf8b5860 00000001 00000000 > c013ab92 00000000 c0371b7c 000000b9 cf8b5860 c0d8d580 c01145dd cdad6000 > c0118187 cdad6000 00000000 00000000 cdad6000 c0118380 00000000 b7f9968c > Call Trace: > <c0136b5f> free_pgtables+0x41/0x82 <c013ab92> exit_mmap+0x6a/0xb8 > <c01145dd> mmput+0x1b/0x5e <c0118187> do_exit+0x14e/0x2d1 > <c0118380> sys_exit_group+0x0/0xd <c010299b> syscall_call+0x7/0xb > Code: c9 74 10 8b 11 8d 40 38 89 42 04 89 53 38 89 48 04 89 01 5b c3 56 53 8b 70 40 89 c3 85 f6 74 2e 8d 48 38 8b 40 38 8b 51 04 89 02 <89> 50 04 c7 43 38 00 01 10 00 39 36 c7 41 04 00 02 20 00 75 0e > EIP: [<c013be50>] anon_vma_unlink+0x16/0x3c SS:ESP 0068:cdad7f58 > <1>Fixing recursive fault but reboot is needed! > > I'm not entirely sure why the kernel considered itself tainted in the second > oops and not in the first - the setup hadn't changed and precisely the same > kernel modules were loaded. This machine does not have any external (ie: > out-of-tree) modules installed. Weird. The traces do not include NTFS at all in them so I have no idea if NTFS has anything to do with this or not... Anyone have any idea what these oopses mean?!? Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://www.linux-ntfs.org/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related 2006-09-21 9:54 ` Fw: 2.6.17 oops, possibly ntfs/mmap related Anton Altaparmakov @ 2006-09-21 14:41 ` Anton Altaparmakov 2006-09-21 17:52 ` Andrew Morton 0 siblings, 1 reply; 10+ messages in thread From: Anton Altaparmakov @ 2006-09-21 14:41 UTC (permalink / raw) To: Jonathan Woithe; +Cc: linux-kernel, Andrew Morton Hi, On Thu, 2006-09-21 at 10:54 +0100, Anton Altaparmakov wrote: > On Tue, 2006-09-12 at 20:56 -0700, Andrew Morton wrote: > Andrew, thanks for forwarding me the message... > > Begin forwarded message: > > > > We have a machine which is currently making heavy use of a usb hard disc > > formatted with ntfs. There have been two occasions where the kernel has > > oopsed while this disc was being accessed heavily. Before adding this HDD > > the machine in question was rock solid which leads me to think that it > > might be related to ntfs. USB drives formatted with other filesystems do > > not appear to suffer from this problem. I have now seen such an oops too with 2.6.18 kernel. Note no NTFS file systems were mounted at the time (but I had an NTFS file system mounted earlier in the day). The oops is caused by kswapd0 kernel thread, the stack trace is: Call Trace: [<c10470a3>] shrink_inactive_list+0x46b/0x790 [<c104747c>] shrink_zone+0xb4/0xd3 [<c104797d>] kswapd+0x2de/0x3cf [<c102c18e>] kthread+0xc2/0xf0 [<c1000bf1>] kernel_thread_helper+0x5/0xb DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb Leftover inexact backtrace: [<c1003e6c>] show_stack_log_lvl+0x8c/0x97 [<c1003fc8>] show_registers+0x151/0x1c6 [<c10041af>] die+0x172/0x27b [<c145f22c>] do_page_fault+0x42c/0x4f9 [<c10037dd>] error_code+0x39/0x40 [<c10470a3>] shrink_inactive_list+0x46b/0x790 [<c104747c>] shrink_zone+0xb4/0xd3 [<c104797d>] kswapd+0x2de/0x3cf [<c102c18e>] kthread+0xc2/0xf0 [<c1000bf1>] kernel_thread_helper+0x5/0xb And the EIP is at fs/buffer.c::try_to_release_page() the code of which is here: int try_to_release_page(struct page *page, gfp_t gfp_mask) { struct address_space * const mapping = page->mapping; BUG_ON(!PageLocked(page)); if (PageWriteback(page)) return 0; if (mapping && mapping->a_ops->releasepage) ^^^ bug happens here when the value of mapping->a_ops is used to obtain mapping->a_ops->releasepage return mapping->a_ops->releasepage(page, gfp_mask); return try_to_free_buffers(page); } This bug seems to suggest that there is a page which the kernel is trying to release private data which has page->mapping set to a valid value and page->mapping->a_ops apparently set to an invalid value and when page->mapping->a_ops->releasepage is dereferenced it causes an oops with the kernel saying: BUG: unable to handle kernel paging request at virtual address 020030d2 The values of the relevant variables from the oops are: page = 0xc2248fa0 page->mapping = 0xe3a79eac page->mapping->a_ops = 0x020030aa Note that 0x020030aa+0x28 = 020030d2 which is the oops causing address and 0x28 is the offset of the releasepage function pointer in the address space operations structure... This oops is not identical to the oopses pointed out by Jonathan at: http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt But those oopses have to do with pages also so could be related... Anyone have any ideas how a page can end up in such a weird state? Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://www.linux-ntfs.org/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related 2006-09-21 14:41 ` Anton Altaparmakov @ 2006-09-21 17:52 ` Andrew Morton 2006-09-21 19:04 ` Hugh Dickins 2006-09-22 6:59 ` Anton Altaparmakov 0 siblings, 2 replies; 10+ messages in thread From: Andrew Morton @ 2006-09-21 17:52 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Jonathan Woithe, linux-kernel On Thu, 21 Sep 2006 15:41:36 +0100 Anton Altaparmakov <aia21@cam.ac.uk> wrote: > Hi, > > On Thu, 2006-09-21 at 10:54 +0100, Anton Altaparmakov wrote: > > On Tue, 2006-09-12 at 20:56 -0700, Andrew Morton wrote: > > Andrew, thanks for forwarding me the message... > > > Begin forwarded message: > > > > > > We have a machine which is currently making heavy use of a usb hard disc > > > formatted with ntfs. There have been two occasions where the kernel has > > > oopsed while this disc was being accessed heavily. Before adding this HDD > > > the machine in question was rock solid which leads me to think that it > > > might be related to ntfs. USB drives formatted with other filesystems do > > > not appear to suffer from this problem. > > I have now seen such an oops too with 2.6.18 kernel. I assume it is a once-off? > Note no NTFS file > systems were mounted at the time (but I had an NTFS file system mounted > earlier in the day). > > The oops is caused by kswapd0 kernel thread, the stack trace is: > > Call Trace: > [<c10470a3>] shrink_inactive_list+0x46b/0x790 > [<c104747c>] shrink_zone+0xb4/0xd3 > [<c104797d>] kswapd+0x2de/0x3cf > [<c102c18e>] kthread+0xc2/0xf0 > [<c1000bf1>] kernel_thread_helper+0x5/0xb > DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb > Leftover inexact backtrace: > [<c1003e6c>] show_stack_log_lvl+0x8c/0x97 > [<c1003fc8>] show_registers+0x151/0x1c6 > [<c10041af>] die+0x172/0x27b > [<c145f22c>] do_page_fault+0x42c/0x4f9 > [<c10037dd>] error_code+0x39/0x40 > [<c10470a3>] shrink_inactive_list+0x46b/0x790 > [<c104747c>] shrink_zone+0xb4/0xd3 > [<c104797d>] kswapd+0x2de/0x3cf > [<c102c18e>] kthread+0xc2/0xf0 > [<c1000bf1>] kernel_thread_helper+0x5/0xb > > And the EIP is at fs/buffer.c::try_to_release_page() the code of which > is here: > > int try_to_release_page(struct page *page, gfp_t gfp_mask) > { > struct address_space * const mapping = page->mapping; > > BUG_ON(!PageLocked(page)); > if (PageWriteback(page)) > return 0; > > if (mapping && mapping->a_ops->releasepage) > > ^^^ bug happens here when the value of mapping->a_ops is used to obtain > mapping->a_ops->releasepage > > return mapping->a_ops->releasepage(page, gfp_mask); > return try_to_free_buffers(page); > } > > This bug seems to suggest that there is a page which the kernel is > trying to release private data which has page->mapping set to a valid > value and page->mapping->a_ops apparently set to an invalid value and > when page->mapping->a_ops->releasepage is dereferenced it causes an oops > with the kernel saying: > > BUG: unable to handle kernel paging request at virtual address 020030d2 > > The values of the relevant variables from the oops are: > > page = 0xc2248fa0 > page->mapping = 0xe3a79eac > page->mapping->a_ops = 0x020030aa I wonder if page->mapping really wanted to be 0xc3a79eac, only something set bit 29. > Note that 0x020030aa+0x28 = 020030d2 which is the oops causing address > and 0x28 is the offset of the releasepage function pointer in the > address space operations structure... > > This oops is not identical to the oopses pointed out by Jonathan at: > > http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt > > But those oopses have to do with pages also so could be related... Looks a bit different - Jonathan appears to have pulled a bad page* out of the radix tree whereas you got your page off the LRU. > Anyone have any ideas how a page can end up in such a weird state? Nope. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related 2006-09-21 17:52 ` Andrew Morton @ 2006-09-21 19:04 ` Hugh Dickins 2006-09-21 19:24 ` Dave Jones 2006-09-22 6:59 ` Anton Altaparmakov 1 sibling, 1 reply; 10+ messages in thread From: Hugh Dickins @ 2006-09-21 19:04 UTC (permalink / raw) To: Andrew Morton; +Cc: Anton Altaparmakov, Jonathan Woithe, linux-kernel On Thu, 21 Sep 2006, Andrew Morton wrote: > On Thu, 21 Sep 2006 15:41:36 +0100 > Anton Altaparmakov <aia21@cam.ac.uk> wrote: > > > > BUG: unable to handle kernel paging request at virtual address 020030d2 > > > > The values of the relevant variables from the oops are: > > > > page = 0xc2248fa0 > > page->mapping = 0xe3a79eac > > page->mapping->a_ops = 0x020030aa > > I wonder if page->mapping really wanted to be 0xc3a79eac, only something > set bit 29. Perhaps, but I'm more suspicious of that 0x0200 top half of the a_ops ptr. > > > Note that 0x020030aa+0x28 = 020030d2 which is the oops causing address > > and 0x28 is the offset of the releasepage function pointer in the > > address space operations structure... > > > > This oops is not identical to the oopses pointed out by Jonathan at: > > > > http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt > > > > But those oopses have to do with pages also so could be related... > > Looks a bit different - Jonathan appears to have pulled a bad page* out > of the radix tree whereas you got your page off the LRU. Jonathan does show a second oops, from a later boot: BUG: unable to handle kernel paging request at virtual address 0010c744 printing eip: c013be50 *pde = 00000000 Oops: 0002 [#1] Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore CPU: 0 EIP: 0060:[<c013be50>] Tainted: G M VLI EFLAGS: 00010282 (2.6.17 #2) EIP is at anon_vma_unlink+0x16/0x3c eax: 0010c740 ebx: cf1070cc ecx: cf107104 edx: cf8bc740 esi: cf8bc740 edi: b7e82000 ebp: 00000000 esp: cdad7f58 I haven't worked out the disassembly in detail to support the idea (though certainly anon_vma_unlink would be trying to list_del around here), but that eax and esi do suggest a corrupted list: somehow the top half of a pointer overwritten by the top half of LIST_POISON1. And in Anton's case, the top half of a pointer overwritten by the bottom half of LIST_POISON2. Maybe just coincidence, and I've nothing more illuminating to add; but just a hint of a list_del going very wrong somewhere? Hugh ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related 2006-09-21 19:04 ` Hugh Dickins @ 2006-09-21 19:24 ` Dave Jones 2006-09-22 7:17 ` Jonathan Woithe 0 siblings, 1 reply; 10+ messages in thread From: Dave Jones @ 2006-09-21 19:24 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Anton Altaparmakov, Jonathan Woithe, linux-kernel On Thu, Sep 21, 2006 at 08:04:49PM +0100, Hugh Dickins wrote: > BUG: unable to handle kernel paging request at virtual address 0010c744 > printing eip: > c013be50 > *pde = 00000000 > Oops: 0002 [#1] > Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore > CPU: 0 > EIP: 0060:[<c013be50>] Tainted: G M VLI > EFLAGS: 00010282 (2.6.17 #2) > EIP is at anon_vma_unlink+0x16/0x3c > eax: 0010c740 ebx: cf1070cc ecx: cf107104 edx: cf8bc740 > esi: cf8bc740 edi: b7e82000 ebp: 00000000 esp: cdad7f58 > > I haven't worked out the disassembly in detail to support the idea > (though certainly anon_vma_unlink would be trying to list_del around > here), but that eax and esi do suggest a corrupted list: somehow the > top half of a pointer overwritten by the top half of LIST_POISON1. > > And in Anton's case, the top half of a pointer overwritten by the > bottom half of LIST_POISON2. > > Maybe just coincidence, and I've nothing more illuminating to add; > but just a hint of a list_del going very wrong somewhere? Given a machine check happened, the state of the machine in general is questionable. I'd recommend a run of memtest86+ Dave ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related 2006-09-21 19:24 ` Dave Jones @ 2006-09-22 7:17 ` Jonathan Woithe 2006-09-22 15:40 ` Dave Jones 0 siblings, 1 reply; 10+ messages in thread From: Jonathan Woithe @ 2006-09-22 7:17 UTC (permalink / raw) To: Dave Jones Cc: Hugh Dickins, Andrew Morton, Anton Altaparmakov, Jonathan Woithe, linux-kernel > On Thu, Sep 21, 2006 at 08:04:49PM +0100, Hugh Dickins wrote: > > > BUG: unable to handle kernel paging request at virtual address 0010c744 > > printing eip: > > c013be50 > > *pde = 00000000 > > Oops: 0002 [#1] > > Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore > > CPU: 0 > > EIP: 0060:[<c013be50>] Tainted: G M VLI > > EFLAGS: 00010282 (2.6.17 #2) > > EIP is at anon_vma_unlink+0x16/0x3c > > eax: 0010c740 ebx: cf1070cc ecx: cf107104 edx: cf8bc740 > > esi: cf8bc740 edi: b7e82000 ebp: 00000000 esp: cdad7f58 > > > > I haven't worked out the disassembly in detail to support the idea > > (though certainly anon_vma_unlink would be trying to list_del around > > here), but that eax and esi do suggest a corrupted list: somehow the > > top half of a pointer overwritten by the top half of LIST_POISON1. > > > > And in Anton's case, the top half of a pointer overwritten by the > > bottom half of LIST_POISON2. > > > > Maybe just coincidence, and I've nothing more illuminating to add; > > but just a hint of a list_del going very wrong somewhere? > > Given a machine check happened, the state of the machine in general > is questionable. I'd recommend a run of memtest86+ That was already done. No memory errors were reported over 10 passes. Secondly, the machine check indication was only present on one of the two oopses we saw. Furthermore, there was no indication in any log files that a machine check had occurred in the case of the second oops. Then again, perhaps machine checks don't get logged which would make this observation irrelevant. Could we be looking at a dying CPU? Regards jonathan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related 2006-09-22 7:17 ` Jonathan Woithe @ 2006-09-22 15:40 ` Dave Jones 2006-09-24 23:49 ` Jonathan Woithe 0 siblings, 1 reply; 10+ messages in thread From: Dave Jones @ 2006-09-22 15:40 UTC (permalink / raw) To: Jonathan Woithe Cc: Hugh Dickins, Andrew Morton, Anton Altaparmakov, linux-kernel On Fri, Sep 22, 2006 at 04:47:00PM +0930, Jonathan Woithe wrote: > > Given a machine check happened, the state of the machine in general > > is questionable. I'd recommend a run of memtest86+ > > That was already done. No memory errors were reported over 10 passes. > > Secondly, the machine check indication was only present on one of the two > oopses we saw. Furthermore, there was no indication in any log files > that a machine check had occurred in the case of the second oops. > Then again, perhaps machine checks don't get logged which would make this > observation irrelevant. > > Could we be looking at a dying CPU? Maybe. Or some other hardware problem. Insufficient cooling/power for eg. Dave ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related 2006-09-22 15:40 ` Dave Jones @ 2006-09-24 23:49 ` Jonathan Woithe 0 siblings, 0 replies; 10+ messages in thread From: Jonathan Woithe @ 2006-09-24 23:49 UTC (permalink / raw) To: Dave Jones Cc: Jonathan Woithe, Hugh Dickins, Andrew Morton, Anton Altaparmakov, linux-kernel > > > Given a machine check happened, the state of the machine in general > > > is questionable. I'd recommend a run of memtest86+ > > > > That was already done. No memory errors were reported over 10 passes. > > > > Secondly, the machine check indication was only present on one of the two > > oopses we saw. Furthermore, there was no indication in any log files > > that a machine check had occurred in the case of the second oops. > > Then again, perhaps machine checks don't get logged which would make this > > observation irrelevant. > > > > Could we be looking at a dying CPU? > > Maybe. Or some other hardware problem. Insufficient cooling/power for eg. Power and cooling should be fine, and I've checked fans etc for correct functioning - all is ok. The other thing worth noting is that the problems with this machine only started once the USB/NTFS HDD started being used. Before this the machine has been rock solid for 2+ years, and the usage patterns haven't changed. Anyway, I'll keep an eye on it and post any subsequent information as it becomes available. Regards jonathan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related 2006-09-21 17:52 ` Andrew Morton 2006-09-21 19:04 ` Hugh Dickins @ 2006-09-22 6:59 ` Anton Altaparmakov 2006-09-22 7:23 ` Jonathan Woithe 1 sibling, 1 reply; 10+ messages in thread From: Anton Altaparmakov @ 2006-09-22 6:59 UTC (permalink / raw) To: Andrew Morton; +Cc: Jonathan Woithe, linux-kernel On Thu, 2006-09-21 at 10:52 -0700, Andrew Morton wrote: > On Thu, 21 Sep 2006 15:41:36 +0100 > Anton Altaparmakov <aia21@cam.ac.uk> wrote: > > On Thu, 2006-09-21 at 10:54 +0100, Anton Altaparmakov wrote: > > > On Tue, 2006-09-12 at 20:56 -0700, Andrew Morton wrote: > > > Andrew, thanks for forwarding me the message... > > > > Begin forwarded message: > > > > > > > > We have a machine which is currently making heavy use of a usb hard disc > > > > formatted with ntfs. There have been two occasions where the kernel has > > > > oopsed while this disc was being accessed heavily. Before adding this HDD > > > > the machine in question was rock solid which leads me to think that it > > > > might be related to ntfs. USB drives formatted with other filesystems do > > > > not appear to suffer from this problem. > > > > I have now seen such an oops too with 2.6.18 kernel. > > I assume it is a once-off? So far yes. I now have seen a recursive locking thing reported by the new lock analyzer but that looks like it has to do with NFS (my home directory is on NFS) so I don't think it is in any way related. > > Note no NTFS file > > systems were mounted at the time (but I had an NTFS file system mounted > > earlier in the day). > > > > The oops is caused by kswapd0 kernel thread, the stack trace is: > > > > Call Trace: > > [<c10470a3>] shrink_inactive_list+0x46b/0x790 > > [<c104747c>] shrink_zone+0xb4/0xd3 > > [<c104797d>] kswapd+0x2de/0x3cf > > [<c102c18e>] kthread+0xc2/0xf0 > > [<c1000bf1>] kernel_thread_helper+0x5/0xb > > DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb > > Leftover inexact backtrace: > > [<c1003e6c>] show_stack_log_lvl+0x8c/0x97 > > [<c1003fc8>] show_registers+0x151/0x1c6 > > [<c10041af>] die+0x172/0x27b > > [<c145f22c>] do_page_fault+0x42c/0x4f9 > > [<c10037dd>] error_code+0x39/0x40 > > [<c10470a3>] shrink_inactive_list+0x46b/0x790 > > [<c104747c>] shrink_zone+0xb4/0xd3 > > [<c104797d>] kswapd+0x2de/0x3cf > > [<c102c18e>] kthread+0xc2/0xf0 > > [<c1000bf1>] kernel_thread_helper+0x5/0xb > > > > And the EIP is at fs/buffer.c::try_to_release_page() the code of which > > is here: > > > > int try_to_release_page(struct page *page, gfp_t gfp_mask) > > { > > struct address_space * const mapping = page->mapping; > > > > BUG_ON(!PageLocked(page)); > > if (PageWriteback(page)) > > return 0; > > > > if (mapping && mapping->a_ops->releasepage) > > > > ^^^ bug happens here when the value of mapping->a_ops is used to obtain > > mapping->a_ops->releasepage > > > > return mapping->a_ops->releasepage(page, gfp_mask); > > return try_to_free_buffers(page); > > } > > > > This bug seems to suggest that there is a page which the kernel is > > trying to release private data which has page->mapping set to a valid > > value and page->mapping->a_ops apparently set to an invalid value and > > when page->mapping->a_ops->releasepage is dereferenced it causes an oops > > with the kernel saying: > > > > BUG: unable to handle kernel paging request at virtual address 020030d2 > > > > The values of the relevant variables from the oops are: > > > > page = 0xc2248fa0 > > page->mapping = 0xe3a79eac > > page->mapping->a_ops = 0x020030aa > > I wonder if page->mapping really wanted to be 0xc3a79eac, only something > set bit 29. I don't know, it could be but the machine is totally stable so I would be surprised if it is bad ram... > > Note that 0x020030aa+0x28 = 020030d2 which is the oops causing address > > and 0x28 is the offset of the releasepage function pointer in the > > address space operations structure... > > > > This oops is not identical to the oopses pointed out by Jonathan at: > > > > http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt > > > > But those oopses have to do with pages also so could be related... > > Looks a bit different - Jonathan appears to have pulled a bad page* out > of the radix tree whereas you got your page off the LRU. > > > Anyone have any ideas how a page can end up in such a weird state? > > Nope. )-: Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://www.linux-ntfs.org/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Fw: 2.6.17 oops, possibly ntfs/mmap related 2006-09-22 6:59 ` Anton Altaparmakov @ 2006-09-22 7:23 ` Jonathan Woithe 0 siblings, 0 replies; 10+ messages in thread From: Jonathan Woithe @ 2006-09-22 7:23 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Andrew Morton, Jonathan Woithe, linux-kernel > > > > > We have a machine which is currently making heavy use of a usb hard disc > > > > > formatted with ntfs. There have been two occasions where the kernel has > > > > > oopsed while this disc was being accessed heavily. Before adding this HDD > > > > > the machine in question was rock solid which leads me to think that it > > > > > might be related to ntfs. USB drives formatted with other filesystems do > > > > > not appear to suffer from this problem. > > > > > > I have now seen such an oops too with 2.6.18 kernel. > > > > I assume it is a once-off? > > So far yes. I now have seen a recursive locking thing reported by the > new lock analyzer but that looks like it has to do with NFS (my home > directory is on NFS) so I don't think it is in any way related. Our setup also has user home directories on NFS, so that much at least is common to our configurations. I don't know if the USB/NTFS user was writing to their home directory as part of their work at the time of the oops though. Regards jonathan ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2006-09-24 23:32 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20060912205602.57568b2a.akpm@osdl.org>
2006-09-21 9:54 ` Fw: 2.6.17 oops, possibly ntfs/mmap related Anton Altaparmakov
2006-09-21 14:41 ` Anton Altaparmakov
2006-09-21 17:52 ` Andrew Morton
2006-09-21 19:04 ` Hugh Dickins
2006-09-21 19:24 ` Dave Jones
2006-09-22 7:17 ` Jonathan Woithe
2006-09-22 15:40 ` Dave Jones
2006-09-24 23:49 ` Jonathan Woithe
2006-09-22 6:59 ` Anton Altaparmakov
2006-09-22 7:23 ` Jonathan Woithe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox