* 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS
@ 2005-08-11 15:59 Phil Dier
2005-08-12 2:07 ` Neil Brown
0 siblings, 1 reply; 10+ messages in thread
From: Phil Dier @ 2005-08-11 15:59 UTC (permalink / raw)
To: linux-kernel; +Cc: ziggy, Scott Holdren, Jack Massari
Hi,
I posted an oops a few days ago from 2.6.12.3 [1]. Here are the results
of my tests on 2.6.13-rc6. The kernel oopses, but it the box isn't completely
hosed; I can still log in and move around. It appears that the only things that are
locked are the apps that were doing i/o to the test partition. More detailed info
about my configuration can be found here:
<http://www.icglink.com/debug-2.6.13-rc6.html>
Here is the oops:
Oops: 0000 [#1]
SMP
Modules linked in:
CPU: 0
EIP: 0060:[<c0116dd0>] Not tainted VLI
EFLAGS: 00010207 (2.6.13-rc6)
EIP is at kmap+0x10/0x30
eax: 00000003 ebx: d0977440 ecx: c9efb470 edx: 00000000
esi: c1000000 edi: 00000000 ebp: ce59d570 esp: f7adde18
ds: 007b es: 007b ss: 0068
Process md4_raid1 (pid: 6442, threadinfo=f7adc000 task=f70eda20)
Stack: c014e0bd c9efb470 00000001 00000000 00000000 cb129000 00000001 e0c65f00
f7dcbe18 087641ef 00000000 f7dcbe18 c014e146 f7dcbe18 f7addeb0 c3146940
c033f03f f7dcbe18 f7addeb0 0021d906 00000000 0000003f 00000040 00000000
Call Trace:
[<c014e0bd>] __blk_queue_bounce+0x20d/0x260
[<c014e146>] blk_queue_bounce+0x36/0x60
[<c033f03f>] __make_request+0x5f/0x560
[<c0132b90>] autoremove_wake_function+0x0/0x60
[<c033f921>] generic_make_request+0x151/0x230
[<c0132b90>] autoremove_wake_function+0x0/0x60
[<c04a5fac>] schedule+0x62c/0xcb0
[<c04a5fe0>] schedule+0x660/0xcb0
[<c0132b90>] autoremove_wake_function+0x0/0x60
[<c0126973>] del_timer+0x73/0x80
[<c033db0d>] blk_remove_plug+0x3d/0x80
[<c03feed9>] raid1d+0x289/0x2a0
[<c0414bb3>] md_thread+0x143/0x190
[<c0132b90>] autoremove_wake_function+0x0/0x60
[<c0102ed2>] ret_from_fork+0x6/0x14
[<c0132b90>] autoremove_wake_function+0x0/0x60
[<c0414a70>] md_thread+0x0/0x190
[<c01011b5>] kernel_thread_helper+0x5/0x10
Code: 00 40 c7 46 0c 90 30 15 c0 c7 46 10 90 31 15 c0 eb b9 90 90 90 90 90 90 90 90 90 8b 4c 24 04 8b 01 c1 e8 1e 8b 14 85 14 f4 63 c0 <8b> 82 0c 04 00 00 05 00 09 00 00 39 c2 74 05 e9 ac 73 03 00 89
Thanks for looking..
--
Phil Dier (ICGLink.com -- 615 370-1530 x733)
/* vim:set noai nocindent ts=8 sw=8: */
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS 2005-08-11 15:59 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS Phil Dier @ 2005-08-12 2:07 ` Neil Brown 2005-08-12 4:17 ` Phil Dier 2005-08-12 17:35 ` Phil Dier 0 siblings, 2 replies; 10+ messages in thread From: Neil Brown @ 2005-08-12 2:07 UTC (permalink / raw) To: Phil Dier; +Cc: linux-kernel, ziggy, Scott Holdren, Jack Massari On Thursday August 11, phil@icglink.com wrote: > Hi, > > I posted an oops a few days ago from 2.6.12.3 [1]. Here are the results > of my tests on 2.6.13-rc6. The kernel oopses, but it the box isn't completely > hosed; I can still log in and move around. It appears that the only things that are > locked are the apps that were doing i/o to the test partition. More detailed info > about my configuration can be found here: > > <http://www.icglink.com/debug-2.6.13-rc6.html> You don't seem to give details on how lvm is used to combine the md arrays, though I'm not sure that would help particularly. > > Here is the oops: > > Oops: 0000 [#1] > SMP > Modules linked in: > CPU: 0 > EIP: 0060:[<c0116dd0>] Not tainted VLI > EFLAGS: 00010207 (2.6.13-rc6) > EIP is at kmap+0x10/0x30 > eax: 00000003 ebx: d0977440 ecx: c9efb470 edx: 00000000 > esi: c1000000 edi: 00000000 ebp: ce59d570 esp: f7adde18 > ds: 007b es: 007b ss: 0068 > Process md4_raid1 (pid: 6442, threadinfo=f7adc000 task=f70eda20) > Stack: c014e0bd c9efb470 00000001 00000000 00000000 cb129000 00000001 e0c65f00 > f7dcbe18 087641ef 00000000 f7dcbe18 c014e146 f7dcbe18 f7addeb0 c3146940 > c033f03f f7dcbe18 f7addeb0 0021d906 00000000 0000003f 00000040 00000000 > Call Trace: > [<c014e0bd>] __blk_queue_bounce+0x20d/0x260 --snip--- > Code: 00 40 c7 46 0c 90 30 15 c0 c7 46 10 90 31 15 c0 eb b9 90 90 90 90 90 90 90 90 90 8b 4c 24 04 8b 01 c1 e8 1e 8b 14 85 14 f4 63 c0 <8b> 82 0c 04 00 00 05 00 09 00 00 39 c2 74 05 e9 ac 73 03 00 89 > The code is Oopsing in a call to kmap in arch/i386/highmem.c The PageHighMem macro calls is_highmem(page_zone(page)). page_zone is defined in mm.h static inline struct zone *page_zone(struct page *page) { return zone_table[(page->flags >> ZONETABLE_PGSHIFT) & ZONETABLE_MASK]; } Now at the point of the crash, eax is (page->flags >> ZONETABLE_PGSHIFT), which is '3'. So it seems that this page is in zone 3. However zone_table[3] is now in edx, and we can see it is '0'. There are only 3 zones (normal, dma, highmem), so nothing should ever by in zone 3. This page is clearly bad. However that is as far as I can get. I don't know whether this is a bad page pointer passed down from jfs or nfsd, a page pointer that was corrupted by either lvm or md, or a valid page pointer that has managed to get a bad zone number encoded in it's flags. You could possibly put something like struct bio_vec *from; int i; bio_for_each_segment(from, bio, i) BUG_ON(page_zone(from->bv_page)==NULL); in generic_make_requst in drivers/block/ll_rw_blk.c, just before the call to q->make_request_fn. This might trigger the bug early enough to see what is happening. > > Thanks for looking.. > Thanks for testing. NeilBrown ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS 2005-08-12 2:07 ` Neil Brown @ 2005-08-12 4:17 ` Phil Dier 2005-08-12 17:35 ` Phil Dier 1 sibling, 0 replies; 10+ messages in thread From: Phil Dier @ 2005-08-12 4:17 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, ziggy, scott, jack On Fri, 12 Aug 2005 12:07:21 +1000 Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Thursday August 11, phil@icglink.com wrote: > > Hi, > > > > I posted an oops a few days ago from 2.6.12.3 [1]. Here are the results > > of my tests on 2.6.13-rc6. The kernel oopses, but it the box isn't completely > > hosed; I can still log in and move around. It appears that the only things that are > > locked are the apps that were doing i/o to the test partition. More detailed info > > about my configuration can be found here: > > > > <http://www.icglink.com/debug-2.6.13-rc6.html> > > You don't seem to give details on how lvm is used to combine the md > arrays, though I'm not sure that would help particularly. > FYI: vgdisplay -v vg1 Using volume group(s) on command line Finding volume group "vg1" --- Volume group --- VG Name vg1 System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 8 VG Access read/write VG Status resizable MAX LV 255 Cur LV 1 Open LV 0 Max PV 255 Cur PV 2 Act PV 2 VG Size 410.00 GB PE Size 128.00 MB Total PE 3280 Alloc PE / Size 1093 / 136.62 GB Free PE / Size 2187 / 273.38 GB VG UUID XuRomW-O6Uw-oQGq-vdwD-YwMT-Dltj-NExFmV --- Logical volume --- LV Name /dev/vg1/home VG Name vg1 LV UUID K7Gq9l-Vjte-ksFt-s0vn-ejqT-RGYc-5Aibtx LV Write Access read/write LV Status available # open 0 LV Size 136.62 GB Current LE 1093 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:3 --- Physical volumes --- PV Name /dev/md4 PV UUID VgHU6k-lZmE-j686-dvfX-OSsM-yh28-Jyfidn PV Status allocatable Total PE / Free PE 1093 / 0 PV Name /dev/md7 PV UUID n4rVmy-rARO-a5mY-Iiqo-GvOx-2nbG-HluaTa PV Status allocatable Total PE / Free PE 2187 / 2187 md7 is in there to test live migration from smaller disks to larger ones. > > struct bio_vec *from; > int i; > bio_for_each_segment(from, bio, i) > BUG_ON(page_zone(from->bv_page)==NULL); > > in generic_make_requst in drivers/block/ll_rw_blk.c, just before > the call to q->make_request_fn. > This might trigger the bug early enough to see what is happening. I'll try this and report the results. -- Phil Dier <phil@dier.us> /* vim:set ts=8 sw=8 nocindent noai: */ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS 2005-08-12 2:07 ` Neil Brown 2005-08-12 4:17 ` Phil Dier @ 2005-08-12 17:35 ` Phil Dier 2005-08-12 18:35 ` Sonny Rao 1 sibling, 1 reply; 10+ messages in thread From: Phil Dier @ 2005-08-12 17:35 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, ziggy, scott, jack On Fri, 12 Aug 2005 12:07:21 +1000 Neil Brown <neilb@cse.unsw.edu.au> wrote: > You could possibly put something like > > struct bio_vec *from; > int i; > bio_for_each_segment(from, bio, i) > BUG_ON(page_zone(from->bv_page)==NULL); > > in generic_make_requst in drivers/block/ll_rw_blk.c, just before > the call to q->make_request_fn. > This might trigger the bug early enough to see what is happening. I've got tests running with this code in place, by I/O is so slow now I don't think it's going to oops (or if it does, it'll be a while).. Is there any other info I can collect to help track this down? -- Phil Dier (ICGLink.com -- 615 370-1530 x733) /* vim:set noai nocindent ts=8 sw=8: */ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS 2005-08-12 17:35 ` Phil Dier @ 2005-08-12 18:35 ` Sonny Rao 2005-08-15 2:03 ` Phil Dier 0 siblings, 1 reply; 10+ messages in thread From: Sonny Rao @ 2005-08-12 18:35 UTC (permalink / raw) To: Phil Dier; +Cc: Neil Brown, linux-kernel, ziggy, scott, jack On Fri, Aug 12, 2005 at 12:35:05PM -0500, Phil Dier wrote: > On Fri, 12 Aug 2005 12:07:21 +1000 > Neil Brown <neilb@cse.unsw.edu.au> wrote: > > You could possibly put something like > > > > struct bio_vec *from; > > int i; > > bio_for_each_segment(from, bio, i) > > BUG_ON(page_zone(from->bv_page)==NULL); > > > > in generic_make_requst in drivers/block/ll_rw_blk.c, just before > > the call to q->make_request_fn. > > This might trigger the bug early enough to see what is happening. > > > I've got tests running with this code in place, by I/O is so slow now > I don't think it's going to oops (or if it does, it'll be a while).. > > Is there any other info I can collect to help track this down? Well, while we are slowing things down in the name of debugging.. you might try setting the following debug options in your config: CONFIG_DEBUG_PAGEALLOC CONFIG_DEBUG_HIGHMEM CONFIG_DEBUG_SLAB CONFIG_FRAME_POINTER Can anyone think of anything else? According to the website you don't have these on right now. Sonny ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS 2005-08-12 18:35 ` Sonny Rao @ 2005-08-15 2:03 ` Phil Dier 2005-08-15 2:40 ` Zwane Mwaikambo 0 siblings, 1 reply; 10+ messages in thread From: Phil Dier @ 2005-08-15 2:03 UTC (permalink / raw) To: Sonny Rao; +Cc: neilb, linux-kernel, ziggy, scott, jack I just got this: Unable to handle kernel paging request at virtual address eeafefc0 printing eip: c0188487 *pde = 00681067 *pte = 2eafe000 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: CPU: 1 EIP: 0060:[<c0188487>] Not tainted VLI EFLAGS: 00010296 (2.6.13-rc6) EIP is at inotify_inode_queue_event+0x17/0x130 eax: eeafefc0 ebx: 00000000 ecx: 00000200 edx: eeafee9c esi: 00000000 edi: ef4cbe9c ebp: f66e1eac esp: f66e1e84 ds: 007b es: 007b ss: 0068 Process nfsd (pid: 6259, threadinfo=f66e0000 task=f6307b00) Stack: eeafee9c c0536a34 ee900f6c f66e1eac c0179949 eeafefc0 23644d80 00000000 00000000 ef4cbe9c f66e1ed4 c01713ad eeafee9c 00000400 00000000 00000000 eeafee9c ee900f6c f0940f6c f6f0adf8 f66e1f00 c020caa1 ef4cbe9c ee900f6c Call Trace: [<c0103e7f>] show_stack+0x7f/0xa0 [<c0104030>] show_registers+0x160/0x1d0 [<c0104260>] die+0x100/0x180 [<c0116199>] do_page_fault+0x369/0x6ed [<c0103aa3>] error_code+0x4f/0x54 [<c01713ad>] vfs_unlink+0x17d/0x210 [<c020caa1>] nfsd_unlink+0x161/0x240 [<c0207c64>] nfsd_proc_remove+0x44/0x90 [<c0206747>] nfsd_dispatch+0xd7/0x200 [<c0491b13>] svc_process+0x533/0x670 [<c02064dd>] nfsd+0x1bd/0x350 [<c01011e5>] kernel_thread_helper+0x5/0x10 Code: ff ff ff 8b 5d f8 8b 75 fc 89 ec 5d c3 8d b4 26 00 00 00 00 55 89 e5 57 56 53 83 ec 1c 8b 45 08 8b 55 08 05 24 01 00 00 89 45 ec <39> 82 24 01 00 00 74 5d f0 ff 8a 2c 01 00 00 0f 88 d1 0b 00 00 -- Phil Dier <phil@dier.us> /* vim:set ts=8 sw=8 nocindent noai: */ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS 2005-08-15 2:03 ` Phil Dier @ 2005-08-15 2:40 ` Zwane Mwaikambo 2005-08-15 3:08 ` Robert Love 0 siblings, 1 reply; 10+ messages in thread From: Zwane Mwaikambo @ 2005-08-15 2:40 UTC (permalink / raw) To: Robert Love Cc: Sonny Rao, Neil Brown, Linux Kernel, ziggy, scott, jack, Alexander Viro, Phil Dier On Sun, 14 Aug 2005, Phil Dier wrote: > I just got this: > > Unable to handle kernel paging request at virtual address eeafefc0 > printing eip: > c0188487 > *pde = 00681067 > *pte = 2eafe000 > Oops: 0000 [#1] > SMP DEBUG_PAGEALLOC > Modules linked in: > CPU: 1 > EIP: 0060:[<c0188487>] Not tainted VLI > EFLAGS: 00010296 (2.6.13-rc6) > EIP is at inotify_inode_queue_event+0x17/0x130 > eax: eeafefc0 ebx: 00000000 ecx: 00000200 edx: eeafee9c > esi: 00000000 edi: ef4cbe9c ebp: f66e1eac esp: f66e1e84 > ds: 007b es: 007b ss: 0068 > Process nfsd (pid: 6259, threadinfo=f66e0000 task=f6307b00) > Stack: eeafee9c c0536a34 ee900f6c f66e1eac c0179949 eeafefc0 23644d80 00000000 > 00000000 ef4cbe9c f66e1ed4 c01713ad eeafee9c 00000400 00000000 00000000 > eeafee9c ee900f6c f0940f6c f6f0adf8 f66e1f00 c020caa1 ef4cbe9c ee900f6c > Call Trace: > [<c0103e7f>] show_stack+0x7f/0xa0 > [<c0104030>] show_registers+0x160/0x1d0 > [<c0104260>] die+0x100/0x180 > [<c0116199>] do_page_fault+0x369/0x6ed > [<c0103aa3>] error_code+0x4f/0x54 > [<c01713ad>] vfs_unlink+0x17d/0x210 > [<c020caa1>] nfsd_unlink+0x161/0x240 > [<c0207c64>] nfsd_proc_remove+0x44/0x90 > [<c0206747>] nfsd_dispatch+0xd7/0x200 > [<c0491b13>] svc_process+0x533/0x670 > [<c02064dd>] nfsd+0x1bd/0x350 > [<c01011e5>] kernel_thread_helper+0x5/0x10 > Code: ff ff ff 8b 5d f8 8b 75 fc 89 ec 5d c3 8d b4 26 00 00 00 00 55 89 e5 57 56 53 83 ec 1c 8b 45 08 8b 55 08 05 24 01 00 00 89 45 ec <39> 82 24 01 00 00 74 5d f0 ff 8a 2c 01 00 00 0f 88 d1 0b 00 00 int vfs_unlink(struct inode *dir, struct dentry *dentry) { <snipped> /* We don't d_delete() NFS sillyrenamed files--they still exist. */ if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) { struct inode *inode = dentry->d_inode; d_delete(dentry); <========== fsnotify_unlink(dentry, inode, dir); } return error; } static inline void fsnotify_unlink(struct dentry *dentry, struct inode *inode, struct inode *dir) { <snipped> inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL); <===== <snipped> } void inotify_inode_queue_event(struct inode *inode, u32 mask, u32 cookie, const char *name) { struct inotify_watch *watch, *next; if (!inotify_inode_watched(inode)) <====== return; <snipped> } static inline int inotify_inode_watched(struct inode *inode) { return !list_empty(&inode->inotify_watches); <=== } I'm new here, if the inode isn't being watched, what's to stop d_delete from removing the inode before fsnotify_unlink proceeds to use it? Thanks, Zwane ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS 2005-08-15 2:40 ` Zwane Mwaikambo @ 2005-08-15 3:08 ` Robert Love 2005-08-15 3:20 ` Zwane Mwaikambo 0 siblings, 1 reply; 10+ messages in thread From: Robert Love @ 2005-08-15 3:08 UTC (permalink / raw) To: Zwane Mwaikambo Cc: Robert Love, Sonny Rao, Neil Brown, Linux Kernel, ziggy, scott, jack, Alexander Viro, Phil Dier On Sun, 2005-08-14 at 20:40 -0600, Zwane Mwaikambo wrote: > I'm new here, if the inode isn't being watched, what's to stop d_delete > from removing the inode before fsnotify_unlink proceeds to use it? Nothing. But check out http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a Should solve this problem? Robert Love ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS 2005-08-15 3:08 ` Robert Love @ 2005-08-15 3:20 ` Zwane Mwaikambo 2005-08-15 17:44 ` Phil Dier 0 siblings, 1 reply; 10+ messages in thread From: Zwane Mwaikambo @ 2005-08-15 3:20 UTC (permalink / raw) To: Robert Love Cc: Sonny Rao, Neil Brown, Linux Kernel, ziggy, scott, jack, Alexander Viro, Phil Dier On Sun, 14 Aug 2005, Robert Love wrote: > On Sun, 2005-08-14 at 20:40 -0600, Zwane Mwaikambo wrote: > > > I'm new here, if the inode isn't being watched, what's to stop d_delete > > from removing the inode before fsnotify_unlink proceeds to use it? > > Nothing. But check out > > http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a That git web interface looks rather spiffy. > Should solve this problem? Seems to fit the bill perfectly. Thanks, Zwane ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS 2005-08-15 3:20 ` Zwane Mwaikambo @ 2005-08-15 17:44 ` Phil Dier 0 siblings, 0 replies; 10+ messages in thread From: Phil Dier @ 2005-08-15 17:44 UTC (permalink / raw) To: linux-kernel; +Cc: ziggy, scott, jack On Sun, 14 Aug 2005 21:20:35 -0600 (MDT) Zwane Mwaikambo <zwane@arm.linux.org.uk> wrote: > On Sun, 14 Aug 2005, Robert Love wrote: > > > On Sun, 2005-08-14 at 20:40 -0600, Zwane Mwaikambo wrote: > > > > > I'm new here, if the inode isn't being watched, what's to stop d_delete > > > from removing the inode before fsnotify_unlink proceeds to use it? > > > > Nothing. But check out > > > > http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a > > That git web interface looks rather spiffy. > > > Should solve this problem? > > Seems to fit the bill perfectly. > > Thanks, > Zwane > So, for the record, I patched my 2.6.13-rc6 kernel with the patch at this location: http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a;hp=1963c907b21e140082d081b1c8f8c2154593c7d7 and I will be testing it today. Thanks to all of you guys. -- Phil Dier (ICGLink.com -- 615 370-1530 x733) /* vim:set noai nocindent ts=8 sw=8: */ ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-08-15 17:44 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-08-11 15:59 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS Phil Dier 2005-08-12 2:07 ` Neil Brown 2005-08-12 4:17 ` Phil Dier 2005-08-12 17:35 ` Phil Dier 2005-08-12 18:35 ` Sonny Rao 2005-08-15 2:03 ` Phil Dier 2005-08-15 2:40 ` Zwane Mwaikambo 2005-08-15 3:08 ` Robert Love 2005-08-15 3:20 ` Zwane Mwaikambo 2005-08-15 17:44 ` Phil Dier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox