* slab corruption in test9 (NFS related?)
@ 2003-11-09 23:16 Burton Windle
2003-11-10 3:04 ` Andrew Morton
0 siblings, 1 reply; 11+ messages in thread
From: Burton Windle @ 2003-11-09 23:16 UTC (permalink / raw)
To: Linux-kernel
Forwarding to LKML as per Neil Brown.
---------- Forwarded message ----------
Date: Sun, 9 Nov 2003 17:21:47 -0500 (EST)
From: Burton Windle <bwindle@fint.org>
To: Trond Myklebust <trond.myklebust@fys.uio.no>, neilb@cse.unsw.edu.au
Subject: nfsd caused slab corruption in test9
Hello. I'm not quite sure whom to send this to, so my appologizes if you
aren't the right people.
Two linux boxes, both running Debian Testing, both x86 with 2.6.0-test9.
The nfs client was running the nhfsstone stress test against an ext3 NFS
exported share. After about an hour (I forgot it was running), the console on the
server woke up and spit this out:
Slab corruption: start=c9df6bcc, expend=c9df6c8b, problemat=c9df6bcc
Last user: [<c017f87e>](d_callback+0x1e/0x40)
Data: 6A**********************************************************************************************************************************************************************************************A5
Next: 71 F0 2C .7E F8 17 C0 A5 C2 0F 17 00 00 00 00 08 00 00 00 3C 4B 24 1D 00 00 00 00 0A 00 00 00 slab error in check_poison_obj(): cache
`dentry_cache': object was modified after freeing Call Trace:
[<c0143653>] check_poison_obj+0xf3/0x180
[<c0181d19>] d_alloc+0x19/0x360
[<c01457da>] kmem_cache_alloc+0x13a/0x180
[<c0181d19>] d_alloc+0x19/0x360
[<c0174a3e>] cached_lookup+0x5e/0x80
[<c0175ebb>] __lookup_hash+0x5b/0xa0
[<c0175f10>] lookup_hash+0x10/0x20
[<c0175f75>] lookup_one_len+0x55/0x80
[<c01e7d05>] nfsd_lookup+0xa5/0x5e0
[<c01f020b>] nfsd3_proc_lookup+0x8b/0x100
[<c01e5147>] nfsd_dispatch+0xc7/0x1a0
[<c0313a7f>] svc_process+0x49f/0x640
[<c01e4c47>] nfsd+0x307/0x740
[<c01e4940>] nfsd+0x0/0x740
[<c0107429>] kernel_thread_helper+0x5/0x1c
# CONFIG_SMP is not set
CONFIG_PREEMPT=y
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_NFS_V4 is not set
# CONFIG_NFS_DIRECTIO is not set
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
# CONFIG_NFSD_V4 is not set
# CONFIG_NFSD_TCP is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_IOVIRT=y
CONFIG_DEBUG_SPINLOCK=y
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_INFO is not set
CONFIG_DEBUG_SPINLOCK_SLEEP=y
Linux version 2.6.0-test9 (root@morpheus) (gcc version 3.3.2 20030908
(Debian prerelease)) #2 Sat Oct 25 19:33:41 EDT 2003
Gnu C 3.3.2
Gnu make 3.80
util-linux 2.12
mount 2.12
module-init-tools implemented
e2fsprogs 1.35-WIP
nfs-utils 1.0.5
Linux C Library 2.3.2
Dynamic linker (ldd) 2.3.2
Procps 3.1.12
Net-tools 1.60
Console-tools 0.2.3
Sh-utils 5.0
--
Burton Windle burton@fint.org
Linux: the "grim reaper of innocent orphaned children."
from /usr/src/linux-2.4.18/init/main.c:461
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: slab corruption in test9 (NFS related?) 2003-11-09 23:16 slab corruption in test9 (NFS related?) Burton Windle @ 2003-11-10 3:04 ` Andrew Morton 2003-11-10 3:05 ` Neil Brown 0 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2003-11-10 3:04 UTC (permalink / raw) To: Burton Windle; +Cc: Linux-kernel Burton Windle <bwindle@fint.org> wrote: > > Forwarding to LKML as per Neil Brown. > > > ---------- Forwarded message ---------- > Date: Sun, 9 Nov 2003 17:21:47 -0500 (EST) > From: Burton Windle <bwindle@fint.org> > To: Trond Myklebust <trond.myklebust@fys.uio.no>, neilb@cse.unsw.edu.au > Subject: nfsd caused slab corruption in test9 > > Hello. I'm not quite sure whom to send this to, so my appologizes if you > aren't the right people. > > Two linux boxes, both running Debian Testing, both x86 with 2.6.0-test9. > The nfs client was running the nhfsstone stress test against an ext3 NFS > exported share. After about an hour (I forgot it was running), the console on the > server woke up and spit this out: > > > Slab corruption: start=c9df6bcc, expend=c9df6c8b, problemat=c9df6bcc > Last user: [<c017f87e>](d_callback+0x1e/0x40) > Data: 6A**********************************************************************************************************************************************************************************************A5 > Next: 71 F0 2C .7E F8 17 C0 A5 C2 0F 17 00 00 00 00 08 00 00 00 3C 4B 24 1D 00 00 00 00 0A 00 00 00 slab error in check_poison_obj(): cache > `dentry_cache': object was modified after freeing Call Trace: > [<c0143653>] check_poison_obj+0xf3/0x180 > [<c0181d19>] d_alloc+0x19/0x360 > [<c01457da>] kmem_cache_alloc+0x13a/0x180 > [<c0181d19>] d_alloc+0x19/0x360 > [<c0174a3e>] cached_lookup+0x5e/0x80 > [<c0175ebb>] __lookup_hash+0x5b/0xa0 > [<c0175f10>] lookup_hash+0x10/0x20 > [<c0175f75>] lookup_one_len+0x55/0x80 > [<c01e7d05>] nfsd_lookup+0xa5/0x5e0 > [<c01f020b>] nfsd3_proc_lookup+0x8b/0x100 > [<c01e5147>] nfsd_dispatch+0xc7/0x1a0 > [<c0313a7f>] svc_process+0x49f/0x640 > [<c01e4c47>] nfsd+0x307/0x740 > [<c01e4940>] nfsd+0x0/0x740 > [<c0107429>] kernel_thread_helper+0x5/0x1c > Someone did a dput() of a freed dentry. This is quite possibly an nfsd bug. There's also http://bugme.osdl.org/show_bug.cgi?id=1497, in which someone did a scribble on the buffer_head slab's metadata. Could be unrelated. There's a debug patch in -mm which should catch the rogue dput(). Perhaps you could include it in your testing. It is at ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.0-test9/2.6.0-test9-mm2/broken-out/atomic_dec-debug.patch ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: slab corruption in test9 (NFS related?) 2003-11-10 3:04 ` Andrew Morton @ 2003-11-10 3:05 ` Neil Brown 2003-11-10 4:09 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Neil Brown @ 2003-11-10 3:05 UTC (permalink / raw) To: Andrew Morton; +Cc: Burton Windle, Linux-kernel On Sunday November 9, akpm@osdl.org wrote: > Burton Windle <bwindle@fint.org> wrote: > > Slab corruption: start=c9df6bcc, expend=c9df6c8b, problemat=c9df6bcc > > Last user: [<c017f87e>](d_callback+0x1e/0x40) > > Data: 6A**********************************************************************************************************************************************************************************************A5 > > Next: 71 F0 2C .7E F8 17 C0 A5 C2 0F 17 00 00 00 00 08 00 00 00 3C 4B 24 1D 00 00 00 00 0A 00 00 00 slab error in check_poison_obj(): cache > > > > Someone did a dput() of a freed dentry. This is quite possibly an > nfsd bug. Ahhh. That would be: =============================================== An extra dput was introduced in nfsd_rename 20 months ago.... time to remove it. ----------- Diffstat output ------------ ./fs/nfsd/vfs.c | 1 - 1 files changed, 1 deletion(-) diff ./fs/nfsd/vfs.c~current~ ./fs/nfsd/vfs.c --- ./fs/nfsd/vfs.c~current~ 2003-11-05 11:25:45.000000000 +1100 +++ ./fs/nfsd/vfs.c 2003-11-05 11:25:59.000000000 +1100 @@ -1368,7 +1368,6 @@ nfsd_rename(struct svc_rqst *rqstp, stru nfsd_sync_dir(tdentry); nfsd_sync_dir(fdentry); } - dput(ndentry); out_dput_new: dput(ndentry); ============================================== NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: slab corruption in test9 (NFS related?) 2003-11-10 3:05 ` Neil Brown @ 2003-11-10 4:09 ` Linus Torvalds 2003-11-10 4:23 ` Joshua Kwan ` (4 more replies) 0 siblings, 5 replies; 11+ messages in thread From: Linus Torvalds @ 2003-11-10 4:09 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, Burton Windle, Linux-kernel On Mon, 10 Nov 2003, Neil Brown wrote: > > An extra dput was introduced in nfsd_rename 20 months ago.... > > time to remove it. Oh, you stand-up comedian you. I'm just wondering how the hell this hasn't bit us seriously until now? What's up? In other words, your patch certainly looks obviously correct, but it also looks _so_ obviously correct that my alarm bells are going off. If the code was quite that broken at counting dentries, how the hell did it ever work AT ALL? Call me suspicious, but I find this really strange.. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: slab corruption in test9 (NFS related?) 2003-11-10 4:09 ` Linus Torvalds @ 2003-11-10 4:23 ` Joshua Kwan 2003-11-10 4:36 ` Trond Myklebust ` (3 subsequent siblings) 4 siblings, 0 replies; 11+ messages in thread From: Joshua Kwan @ 2003-11-10 4:23 UTC (permalink / raw) To: Linux-kernel [-- Attachment #1: Type: text/plain, Size: 991 bytes --] On Sun, Nov 09, 2003 at 08:09:40PM -0800, Linus Torvalds wrote: > In other words, your patch certainly looks obviously correct, but it also > looks _so_ obviously correct that my alarm bells are going off. If the > code was quite that broken at counting dentries, how the hell did it ever > work AT ALL? From Jargon File (4.3.0, 30 APR 2001) [jargon]: schroedinbug /shroh'din-buhg/ n. [MIT: from the Schroedinger's Cat thought-experiment in quantum physics] A design or implementation bug in a program that doesn't manifest until someone reading source or using the program in an unusual way notices that it never should have worked, at which point the program promptly stops working for everybody until fixed. Though (like {bit rot}) this sounds impossible, it happens; some programs have harbored latent schroedinbugs for years. Compare {heisenbug}, {Bohr bug}, {mandelbug}. Hold on to your horses everyone? :) -- Joshua Kwan [-- Attachment #2: Type: application/pgp-signature, Size: 827 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: slab corruption in test9 (NFS related?) 2003-11-10 4:09 ` Linus Torvalds 2003-11-10 4:23 ` Joshua Kwan @ 2003-11-10 4:36 ` Trond Myklebust 2003-11-10 4:42 ` Linus Torvalds 2003-11-10 4:36 ` Neil Brown ` (2 subsequent siblings) 4 siblings, 1 reply; 11+ messages in thread From: Trond Myklebust @ 2003-11-10 4:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Neil Brown, Andrew Morton, Burton Windle, Linux-kernel >>>>> " " == Linus Torvalds <torvalds@osdl.org> writes: > In other words, your patch certainly looks obviously correct, > but it also looks _so_ obviously correct that my alarm bells > are going off. If the code was quite that broken at counting > dentries, how the hell did it ever work AT ALL? Given that d_free() now uses rcu, and hence defers the actual call to kmem_cache_free(), might that not suffice to explain why actual consequences are rare? Cheers, Trond ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: slab corruption in test9 (NFS related?) 2003-11-10 4:36 ` Trond Myklebust @ 2003-11-10 4:42 ` Linus Torvalds 0 siblings, 0 replies; 11+ messages in thread From: Linus Torvalds @ 2003-11-10 4:42 UTC (permalink / raw) To: Trond Myklebust; +Cc: Neil Brown, Andrew Morton, Burton Windle, Linux-kernel On 9 Nov 2003, Trond Myklebust wrote: > > Given that d_free() now uses rcu, and hence defers the actual call to > kmem_cache_free(), might that not suffice to explain why actual > consequences are rare? The thing is, since it gets free'd before all users have let go of their pointers to it, I'd expect people to now have a pointer to a _totally_ stale entry somewhere. Although I guess in 99% of all cases there just won't be any other users for the new dentry in particular (it might be different if it was the old dentry that got dropped too much), and clearly it hasn't been noticed for the last 20 months. Even so, I'm a bit surprised. But I obviously applied the patch. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: slab corruption in test9 (NFS related?) 2003-11-10 4:09 ` Linus Torvalds 2003-11-10 4:23 ` Joshua Kwan 2003-11-10 4:36 ` Trond Myklebust @ 2003-11-10 4:36 ` Neil Brown 2003-11-10 4:45 ` Linus Torvalds 2003-11-10 4:51 ` Andrew Morton 2003-11-10 5:56 ` viro 4 siblings, 1 reply; 11+ messages in thread From: Neil Brown @ 2003-11-10 4:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Burton Windle, Linux-kernel On Sunday November 9, torvalds@osdl.org wrote: > > On Mon, 10 Nov 2003, Neil Brown wrote: > > > > An extra dput was introduced in nfsd_rename 20 months ago.... > > > > time to remove it. > > Oh, you stand-up comedian you. > > I'm just wondering how the hell this hasn't bit us seriously until now? > What's up? > > In other words, your patch certainly looks obviously correct, but it also > looks _so_ obviously correct that my alarm bells are going off. If the > code was quite that broken at counting dentries, how the hell did it ever > work AT ALL? > > Call me suspicious, but I find this really strange.. > > Linus Me too. I had to read through the code several times, and then find the patch that introduced the bug and make sure it looked wrong too, which it does. The only time that the dentry is used after the dput that reduces d_count to 0, is that dput is called once more, and it is very soon and only decrements the count (down to -1). So there is a very small window for someone else to get that piece of memory and have it corrupted. This bug was first brought to my attention by the NFSv4 team who have obviously been hammering nfsd harder than the rest of us (an in more interesting ways, as is the nature of nfsv4). NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: slab corruption in test9 (NFS related?) 2003-11-10 4:36 ` Neil Brown @ 2003-11-10 4:45 ` Linus Torvalds 0 siblings, 0 replies; 11+ messages in thread From: Linus Torvalds @ 2003-11-10 4:45 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, Burton Windle, Linux-kernel On Mon, 10 Nov 2003, Neil Brown wrote: > > I had to read through the code several times, and then find the patch > that introduced the bug and make sure it looked wrong too, which it > does. Yeah, me too. I looked up the original patch from Feb 2002 to verify that there wasn't something subtle going on. But I agree - there's nothing subtle there at all, it really is an obvious bug. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: slab corruption in test9 (NFS related?) 2003-11-10 4:09 ` Linus Torvalds ` (2 preceding siblings ...) 2003-11-10 4:36 ` Neil Brown @ 2003-11-10 4:51 ` Andrew Morton 2003-11-10 5:56 ` viro 4 siblings, 0 replies; 11+ messages in thread From: Andrew Morton @ 2003-11-10 4:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: neilb, bwindle, Linux-kernel Linus Torvalds <torvalds@osdl.org> wrote: > > > On Mon, 10 Nov 2003, Neil Brown wrote: > > > > An extra dput was introduced in nfsd_rename 20 months ago.... > > > > time to remove it. > > Oh, you stand-up comedian you. > > I'm just wondering how the hell this hasn't bit us seriously until now? > What's up? Me too. If the dentry was on the dentry_unused list then it would get a whacky refcount and we would simply leak it. It is only in the very rare case that the dentry gets freed up between the two dput()s that the slab debugging will trip. So probably, we've been leaking dentries as a result of rename activity by NFS clients, but nobody noticed. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: slab corruption in test9 (NFS related?) 2003-11-10 4:09 ` Linus Torvalds ` (3 preceding siblings ...) 2003-11-10 4:51 ` Andrew Morton @ 2003-11-10 5:56 ` viro 4 siblings, 0 replies; 11+ messages in thread From: viro @ 2003-11-10 5:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Neil Brown, Andrew Morton, Burton Windle, Linux-kernel On Sun, Nov 09, 2003 at 08:09:40PM -0800, Linus Torvalds wrote: > > On Mon, 10 Nov 2003, Neil Brown wrote: > > > > An extra dput was introduced in nfsd_rename 20 months ago.... > > > > time to remove it. > > Oh, you stand-up comedian you. > > I'm just wondering how the hell this hasn't bit us seriously until now? > What's up? > > In other words, your patch certainly looks obviously correct, but it also > looks _so_ obviously correct that my alarm bells are going off. If the > code was quite that broken at counting dentries, how the hell did it ever > work AT ALL? > > Call me suspicious, but I find this really strange.. Arrgh... No, Neil is right - it was a plain and simple fsckup on my part. No hidden logics is there; his fix is correct. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2003-11-10 5:56 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-11-09 23:16 slab corruption in test9 (NFS related?) Burton Windle 2003-11-10 3:04 ` Andrew Morton 2003-11-10 3:05 ` Neil Brown 2003-11-10 4:09 ` Linus Torvalds 2003-11-10 4:23 ` Joshua Kwan 2003-11-10 4:36 ` Trond Myklebust 2003-11-10 4:42 ` Linus Torvalds 2003-11-10 4:36 ` Neil Brown 2003-11-10 4:45 ` Linus Torvalds 2003-11-10 4:51 ` Andrew Morton 2003-11-10 5:56 ` viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox