* Re: Slab corruption in 2.6.16-rc5-mm2
@ 2006-03-08 6:25 Chuck Ebbert
2006-03-08 8:32 ` Nick Piggin
0 siblings, 1 reply; 53+ messages in thread
From: Chuck Ebbert @ 2006-03-08 6:25 UTC (permalink / raw)
To: Linus Torvalds
Cc: Lee Schermerhorn, Mike Christie, Jesper Juhl, Andrew Morton,
linux-kernel, James Bottomley
In-Reply-To: <Pine.LNX.4.64.0603061917330.3573@g5.osdl.org>
On Mon, 6 Mar 2006 19:20:13 -0800, Linus Torvalds wrote:
> > When someone converted the *buffer* allocation to kzalloc they
> > also removed the the memset for the *packet_cmmand* struct.
> >
> > The
> >
> > memset(&cgc, 0, sizeof(struct packet_command));
> >
> > should be added back I think.
>
> Good eyes. I bet that's it.
Heh. This exact fix was posted to linux-kernel by Lee Schermerhorn
three weeks ago:
Date: Wed, 15 Feb 2006 14:07:37 -0500
From: Lee Schermerhorn <lee.schermerhorn@hp.com>
Subject: [PATCH] 2.6.16-rc3-mm1 - restore zeroing of packet_command
struct in sr_ioctl.c
To: linux-kernel <linux-kernel@vger.kernel.org>
Cc: Andrew Morton <akpm@osdl.org>
Message-ID: <1140030457.6619.3.camel@localhost.localdomain>
--
Chuck
"Penguins don't come from next door, they come from the Antarctic!"
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-08 6:25 Slab corruption in 2.6.16-rc5-mm2 Chuck Ebbert @ 2006-03-08 8:32 ` Nick Piggin 2006-03-08 8:46 ` Andrew Morton 0 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2006-03-08 8:32 UTC (permalink / raw) To: Chuck Ebbert Cc: Linus Torvalds, Lee Schermerhorn, Mike Christie, Jesper Juhl, Andrew Morton, linux-kernel, James Bottomley Chuck Ebbert wrote: > In-Reply-To: <Pine.LNX.4.64.0603061917330.3573@g5.osdl.org> > > On Mon, 6 Mar 2006 19:20:13 -0800, Linus Torvalds wrote: > > >>>When someone converted the *buffer* allocation to kzalloc they >>>also removed the the memset for the *packet_cmmand* struct. >>> >>>The >>> >>>memset(&cgc, 0, sizeof(struct packet_command)); >>> >>>should be added back I think. >> >>Good eyes. I bet that's it. > > > Heh. This exact fix was posted to linux-kernel by Lee Schermerhorn > three weeks ago: > > Date: Wed, 15 Feb 2006 14:07:37 -0500 > From: Lee Schermerhorn <lee.schermerhorn@hp.com> > Subject: [PATCH] 2.6.16-rc3-mm1 - restore zeroing of packet_command > struct in sr_ioctl.c > To: linux-kernel <linux-kernel@vger.kernel.org> > Cc: Andrew Morton <akpm@osdl.org> > Message-ID: <1140030457.6619.3.camel@localhost.localdomain> > > It isn't Andrew's job to make sure a patch gets to the right place until it is safely in -mm, and even then he's not always going to know the severity and importance unless he's told. If it was a patch to "restore" a regression in behaviour, CCs should at least have gone to the author of the patch that broke it, and the subsystem maintainers / list / etc as well. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-08 8:32 ` Nick Piggin @ 2006-03-08 8:46 ` Andrew Morton 2006-03-08 9:02 ` Nick Piggin 0 siblings, 1 reply; 53+ messages in thread From: Andrew Morton @ 2006-03-08 8:46 UTC (permalink / raw) To: Nick Piggin Cc: 76306.1226, torvalds, lee.schermerhorn, michaelc, jesper.juhl, linux-kernel, James.Bottomley Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Chuck Ebbert wrote: > > In-Reply-To: <Pine.LNX.4.64.0603061917330.3573@g5.osdl.org> > > > > On Mon, 6 Mar 2006 19:20:13 -0800, Linus Torvalds wrote: > > > > > >>>When someone converted the *buffer* allocation to kzalloc they > >>>also removed the the memset for the *packet_cmmand* struct. > >>> > >>>The > >>> > >>>memset(&cgc, 0, sizeof(struct packet_command)); > >>> > >>>should be added back I think. > >> > >>Good eyes. I bet that's it. > > > > > > Heh. This exact fix was posted to linux-kernel by Lee Schermerhorn > > three weeks ago: > > > > Date: Wed, 15 Feb 2006 14:07:37 -0500 > > From: Lee Schermerhorn <lee.schermerhorn@hp.com> > > Subject: [PATCH] 2.6.16-rc3-mm1 - restore zeroing of packet_command > > struct in sr_ioctl.c > > To: linux-kernel <linux-kernel@vger.kernel.org> > > Cc: Andrew Morton <akpm@osdl.org> > > Message-ID: <1140030457.6619.3.camel@localhost.localdomain> > > > > > > It isn't Andrew's job to make sure a patch gets to the right place > until it is safely in -mm, and even then he's not always going to > know the severity and importance unless he's told. Is too! > If it was a patch to "restore" a regression in behaviour, CCs should > at least have gone to the author of the patch that broke it, and the > subsystem maintainers / list / etc as well. I actually merged Lee's patch into -mm, copied James on it and then I dropped it when I saw that it spat rejects against an updated version of James's tree, assuming that it had been merged. Often I'll check that a patch reverts successfully from the upstream tree before dropping it, but for an obvious one like that I guess I didn't bother, and assumed that James had taken it. Only he hadn't - instead he'd gone and merged something else, hence the rejects. Oh well. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-08 8:46 ` Andrew Morton @ 2006-03-08 9:02 ` Nick Piggin 2006-03-08 9:12 ` Andrew Morton 0 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2006-03-08 9:02 UTC (permalink / raw) To: Andrew Morton Cc: 76306.1226, torvalds, lee.schermerhorn, michaelc, jesper.juhl, linux-kernel, James.Bottomley Andrew Morton wrote: > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>It isn't Andrew's job to make sure a patch gets to the right place >>until it is safely in -mm, and even then he's not always going to >>know the severity and importance unless he's told. > > > Is too! > OK, partially. As this case illustrates, everybody makes mistakes and you obviously can't go back and verify you got all the patches because. The guy who hits the bug and/or writes the patch can easily see it is still not merged and shout. > >>If it was a patch to "restore" a regression in behaviour, CCs should >>at least have gone to the author of the patch that broke it, and the >>subsystem maintainers / list / etc as well. > > > I actually merged Lee's patch into -mm, copied James on it and then I > dropped it when I saw that it spat rejects against an updated version of > James's tree, assuming that it had been merged. > > Often I'll check that a patch reverts successfully from the upstream tree > before dropping it, but for an obvious one like that I guess I didn't > bother, and assumed that James had taken it. Only he hadn't - instead he'd > gone and merged something else, hence the rejects. Oh well. > You do a great job, but "push the work out to the end nodes", right? That's how we get this network to scale. It is trivial for people to verify their important patches have propogated as the release approaches. (A little harder for part-timers who aren't in the loop about exactly when the release will happen, thanks to our -ridiculous-count release system, but still easy compared with your having to double check everything). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-08 9:02 ` Nick Piggin @ 2006-03-08 9:12 ` Andrew Morton 2006-03-08 9:23 ` Nick Piggin 0 siblings, 1 reply; 53+ messages in thread From: Andrew Morton @ 2006-03-08 9:12 UTC (permalink / raw) To: Nick Piggin Cc: 76306.1226, torvalds, lee.schermerhorn, michaelc, jesper.juhl, linux-kernel, James.Bottomley Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > Often I'll check that a patch reverts successfully from the upstream tree > > before dropping it, but for an obvious one like that I guess I didn't > > bother, and assumed that James had taken it. Only he hadn't - instead he'd > > gone and merged something else, hence the rejects. Oh well. > > > > You do a great job, but "push the work out to the end nodes", right? > That's how we get this network to scale. It is trivial for people to > verify their important patches have propogated as the release approaches. > > (A little harder for part-timers who aren't in the loop about exactly > when the release will happen, thanks to our -ridiculous-count release > system, but still easy compared with your having to double check > everything). Well yes, Lee sent the fix to the guy who he got the kernel release from in the reasonable expectation that I'd take care of getting it to where it needed to be. Problem is, a) I screwed up, b) James screwed up and c) someone just happened to change those few lines of code in that place within a few-day window. That triple-combo doesn't happen very often. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-08 9:12 ` Andrew Morton @ 2006-03-08 9:23 ` Nick Piggin 2006-03-08 14:35 ` Lee Schermerhorn 0 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2006-03-08 9:23 UTC (permalink / raw) To: Andrew Morton Cc: 76306.1226, torvalds, lee.schermerhorn, michaelc, jesper.juhl, linux-kernel, James.Bottomley Andrew Morton wrote: > Well yes, Lee sent the fix to the guy who he got the kernel release from in > the reasonable expectation that I'd take care of getting it to where it > needed to be. > > Problem is, a) I screwed up, b) James screwed up and c) someone just > happened to change those few lines of code in that place within a few-day > window. > > That triple-combo doesn't happen very often. > I guess what I'm advocating isn't foolproof either: the guy who wrote the patch might die (knock on wood) ;) Carry on. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-08 9:23 ` Nick Piggin @ 2006-03-08 14:35 ` Lee Schermerhorn 0 siblings, 0 replies; 53+ messages in thread From: Lee Schermerhorn @ 2006-03-08 14:35 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, 76306.1226, torvalds, michaelc, jesper.juhl, linux-kernel, James.Bottomley On Wed, 2006-03-08 at 20:23 +1100, Nick Piggin wrote: > Andrew Morton wrote: > > > Well yes, Lee sent the fix to the guy who he got the kernel release from in > > the reasonable expectation that I'd take care of getting it to where it > > needed to be. > > > > Problem is, a) I screwed up, b) James screwed up and c) someone just > > happened to change those few lines of code in that place within a few-day > > window. > > > > That triple-combo doesn't happen very often. > > > > I guess what I'm advocating isn't foolproof either: the guy who wrote > the patch might die (knock on wood) ;) Thanks, Nick. I'll remember that... See you at OLS? Lee P.S. ;-) ^ permalink raw reply [flat|nested] 53+ messages in thread
* Slab corruption in 2.6.16-rc5-mm2 @ 2006-03-06 0:17 Jesper Juhl 2006-03-06 18:25 ` Linus Torvalds 0 siblings, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 0:17 UTC (permalink / raw) To: linux-kernel; +Cc: Andrew Morton, markhe, Andrea Arcangeli Just found the following in dmesg : Slab corruption: start=f72948a0, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f7294854, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c0173923>](real_lookup+0x93/0xe0) 000: 6c 69 62 6b 64 65 69 6e 69 74 5f 6b 69 6f 5f 68 010: 74 74 70 5f 63 61 63 68 65 5f 63 6c 65 61 6e 65 Next obj: start=f72948ec, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c01c7c80>](ext3_init_block_alloc_info+0x20/0x70) 000: 10 cf ce f7 00 00 00 00 00 00 00 00 00 00 00 00 010: 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Slab corruption: start=f70aeab4, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f70aea68, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c023d6df>](init_dev+0x5cf/0x630) 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Next obj: start=f70aeb00, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c0173923>](real_lookup+0x93/0xe0) 000: 6c 69 62 62 6f 6f 73 74 5f 70 72 67 5f 65 78 65 010: 63 5f 6d 6f 6e 69 74 6f 72 2d 67 63 63 2d 31 5f Machine is running 2.6.16-rc5-mm2 : $ uname -a Linux dragon 2.6.16-rc5-mm2 #1 SMP PREEMPT Mon Mar 6 00:06:54 CET 2006 i686 athlon-4 i386 GNU/Linux CPU is a dualcore Athlon X2 4400+ Kernel is 32bit, build for i386. The machine has 2GB of RAM. Let me know what additional info would be useful, if any. If patches need testing then just send them my way. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 0:17 Jesper Juhl @ 2006-03-06 18:25 ` Linus Torvalds 2006-03-06 18:43 ` Jesper Juhl 2006-03-06 18:48 ` Mike Christie 0 siblings, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2006-03-06 18:25 UTC (permalink / raw) To: Jesper Juhl Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Mon, 6 Mar 2006, Jesper Juhl wrote: > > Slab corruption: start=f72948a0, len=64 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) > 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 > 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > Prev obj: start=f7294854, len=64 > Redzone: 0x170fc2a5/0x170fc2a5. > Last user: [<c0173923>](real_lookup+0x93/0xe0) > 000: 6c 69 62 6b 64 65 69 6e 69 74 5f 6b 69 6f 5f 68 > 010: 74 74 70 5f 63 61 63 68 65 5f 63 6c 65 61 6e 65 > Next obj: start=f72948ec, len=64 > Redzone: 0x170fc2a5/0x170fc2a5. > Last user: [<c01c7c80>](ext3_init_block_alloc_info+0x20/0x70) > 000: 10 cf ce f7 00 00 00 00 00 00 00 00 00 00 00 00 > 010: 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Ok, this is interesting because the previous and the next objects look fine, which implies that it's not something else that overwrote it. Perhaps more importantly, the actual corrupted object _should_ contain the POISON_FREE bytes, but doesn't. In fact, what it _does_ contain is apparently perfectly valid SCSI sense request data (if I read it right, it's ASC/ASCQ of 24/00 means "Invalid field in cdb"). So that "last user: sr_do_ioctl" thing actually matches what the contents are. Your other smal corruption: > Slab corruption: start=f70aeab4, len=64 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) > 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 > 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > Prev obj: start=f70aea68, len=64 > Redzone: 0x170fc2a5/0x170fc2a5. > Last user: [<c023d6df>](init_dev+0x5cf/0x630) > 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > Next obj: start=f70aeb00, len=64 > Redzone: 0x170fc2a5/0x170fc2a5. > Last user: [<c0173923>](real_lookup+0x93/0xe0) > 000: 6c 69 62 62 6f 6f 73 74 5f 70 72 67 5f 65 78 65 > 010: 63 5f 6d 6f 6e 69 74 6f 72 2d 67 63 63 2d 31 5f is exactly the same thing. The objects around it look ok. The object that should be poisoned again looks like it contains normal request-sense data (this time 3a/01: "Medium not present - tray closed", if I read it right). So it really looks like in this case, the sr_do_ioctl() function free'd the sense key object, but that the sense data was actually written _after_ the object was free'd. That in turn would imply that scsi_execute() returned before the SCSI command had actually completed. Now, scsi_execute() just calls to the generic block device layer, which in turn just submits it, and then waits for its completion, so if it returns too early, that means that it got _completed_ too early by the driver. Alternatively, it got re-tried. The reason I mention that is that we have commit 17e01f216b611fc46956dcd9063aec4de75991e3, which changes scsi_execute() to add retries. I wonder if something does a "complete(rq->waiting)" while the thing is still retrying? In general, I do not believe that we should retry special commands that have been initiated by a user, we should return the error. But I haven't thought this through. Anyway, Jesper, I see two potential reasons for this bug: - total and utter slab confusion (the slab layer returned the same slab allocation twice to two different callers). I consider this pretty unlikely, because it's such a _major_ failure of the slab code, and the slab code hasn't changed that much, but I mention it just in case. - SCSI layer breakage. It might well be the low-level driver completing a request too early, or it migth be the re-trying. If it's the re-trying, you could try just reverting that commit I pointed to (ie if you're a git user, just do "git revert 17e01f21", otherwise you'd need to look it up from gitweb and un-apply the patch) Regardless, Jesper, it would be great to hear _what_ strange CDROM device you have that would implied in sr_ioctl.c - is it USB, SATA or something else? James, Mike, can you double-check the retries? In particular, it's _wrong_ to retry after you've already marked a command completed with "complete(rq->waiting)", so if that happens somewhere, things are really broken. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 18:25 ` Linus Torvalds @ 2006-03-06 18:43 ` Jesper Juhl 2006-03-06 19:32 ` Linus Torvalds 2006-03-06 18:48 ` Mike Christie 1 sibling, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 18:43 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jesper Juhl On Monday 06 March 2006 19:25, Linus Torvalds wrote: > <...snip...> > Anyway, Jesper, I see two potential reasons for this bug: > > - total and utter slab confusion (the slab layer returned the same slab > allocation twice to two different callers). I consider this pretty > unlikely, because it's such a _major_ failure of the slab code, and the > slab code hasn't changed that much, but I mention it just in case. > > - SCSI layer breakage. It might well be the low-level driver completing a > request too early, or it migth be the re-trying. If it's the re-trying, > you could try just reverting that commit I pointed to (ie if you're a > git user, just do "git revert 17e01f21", otherwise you'd need to look > it up from gitweb and un-apply the patch) > Not a git user (I need to become one but haven't found the time to read up on it yet), but no problem, I'll dig out the patch and try reverting it. Luckily it seems this is pretty repeatable on every boot, I find it in the logs instantly after logging in and launching a shell on my KDE desktop and running dmesg - I'll do a few more reboots to make sure it *really* is reproducible before reverting the patch so we can be sure if it fixes the problem or not. Btw, the messages turn out slightly different on each boot, here are the ones from this current boot of my box: Slab corruption: start=f72b6b98, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f72b6b4c, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813e6>](free_fdtable_rcu+0x66/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=f72b6be4, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<00000000>](_stext+0x3feffd68/0x8) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Slab corruption: start=f72b6b98, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f72b6b4c, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813e6>](free_fdtable_rcu+0x66/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=f72b6be4, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<00000000>](_stext+0x3feffd68/0x8) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Slab corruption: start=f72b6b98, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01d3769>](ext3_clear_inode+0x29/0x40) 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 010: 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Prev obj: start=f72b6b4c, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813e6>](free_fdtable_rcu+0x66/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=f72b6be4, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<00000000>](_stext+0x3feffd68/0x8) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Would gathering more of these help you out? > Regardless, Jesper, it would be great to hear _what_ strange CDROM device > you have that would implied in sr_ioctl.c - is it USB, SATA or something > else? > I have no USB, SATA or similar devices in the box, only a floppy drive, a SCSI harddisk, a SCSI CD writer and a SCSI DVD-ROM. Here are some details : $ cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 04 Lun: 00 Vendor: PIONEER Model: DVD-ROM DVD-305 Rev: 1.03 Type: CD-ROM ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 05 Lun: 00 Vendor: PLEXTOR Model: CD-R PX-W1210S Rev: 1.01 Type: CD-ROM ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 06 Lun: 00 Vendor: IBM Model: DDYS-T36950N Rev: S96H Type: Direct-Access ANSI SCSI revision: 03 >From dmesg : scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0 <Adaptec 29160N Ultra160 SCSI adapter> aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs Vendor: PIONEER Model: DVD-ROM DVD-305 Rev: 1.03 Type: CD-ROM ANSI SCSI revision: 02 target0:0:4: Beginning Domain Validation target0:0:4: FAST-20 SCSI 20.0 MB/s ST (50 ns, offset 16) target0:0:4: Domain Validation skipping write tests target0:0:4: Ending Domain Validation Vendor: PLEXTOR Model: CD-R PX-W1210S Rev: 1.01 Type: CD-ROM ANSI SCSI revision: 02 target0:0:5: Beginning Domain Validation target0:0:5: FAST-20 SCSI 20.0 MB/s ST (50 ns, offset 16) target0:0:5: Domain Validation skipping write tests target0:0:5: Ending Domain Validation Vendor: IBM Model: DDYS-T36950N Rev: S96H Type: Direct-Access ANSI SCSI revision: 03 scsi0:A:6:0: Tagged Queuing enabled. Depth 200 target0:0:6: Beginning Domain Validation target0:0:6: wide asynchronous target0:0:6: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 63) target0:0:6: Ending Domain Validation SCSI device sda: 71687340 512-byte hdwr sectors (36704 MB) sda: Write Protect is off sda: Mode Sense: cb 00 00 08 SCSI device sda: drive cache: write back SCSI device sda: 71687340 512-byte hdwr sectors (36704 MB) sda: Write Protect is off sda: Mode Sense: cb 00 00 08 SCSI device sda: drive cache: write back sda: sda1 sda2 sda3 sda4 sd 0:0:6:0: Attached scsi disk sda sr0: scsi3-mmc drive: 16x/40x cd/rw xa/form2 cdda tray Uniform CD-ROM driver Revision: 3.20 sr 0:0:4:0: Attached scsi CD-ROM sr0 sr1: scsi3-mmc drive: 32x/32x writer cd/rw xa/form2 cdda tray sr 0:0:5:0: Attached scsi CD-ROM sr1 sr 0:0:4:0: Attached scsi generic sg0 type 5 sr 0:0:5:0: Attached scsi generic sg1 type 5 sd 0:0:6:0: Attached scsi generic sg2 type 0 # lspci -vvx 00:00.0 Host bridge: ALi Corporation M1695 K8 Northbridge [PCI Express and HyperTransport] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 Capabilities: [40] #08 [0060] Capabilities: [5c] #08 [a800] Capabilities: [68] #08 [9000] Capabilities: [74] #08 [8000] Capabilities: [7c] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable- Address: 00000000fee00000 Data: 0000 00: b9 10 95 16 07 00 10 00 00 00 00 06 00 00 00 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00 00:01.0 PCI bridge: ALi Corporation: Unknown device 524b (prog-if 00 [Normal decode]) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, cache line size 10 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 Memory behind bridge: ff200000-ff2fffff BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B- Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable- Address: 00000000fee00000 Data: 0000 Capabilities: [58] #10 [0141] Capabilities: [7c] #08 [a800] Capabilities: [88] #08 [8825] 00: b9 10 4b 52 06 01 10 00 00 00 04 06 10 00 01 00 10: 00 00 00 00 00 00 00 00 00 01 01 00 f0 00 00 00 20: 20 ff 20 ff f0 ff 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 0a 01 03 00 00:02.0 PCI bridge: ALi Corporation: Unknown device 524c (prog-if 00 [Normal decode]) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, cache line size 10 Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 Memory behind bridge: ff300000-ff3fffff BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B- Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable- Address: 00000000fee00000 Data: 0000 Capabilities: [58] #10 [0141] Capabilities: [7c] #08 [a800] Capabilities: [88] #08 [8825] 00: b9 10 4c 52 06 01 10 00 00 00 04 06 10 00 01 00 10: 00 00 00 00 00 00 00 00 00 02 02 00 f0 00 00 00 20: 30 ff 30 ff f0 ff 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 03 00 00:04.0 Host bridge: ALi Corporation M1689 K8 Northbridge [Super K8 Single Chip] Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 Region 0: Memory at dc000000 (32-bit, prefetchable) [size=64M] Capabilities: [40] #08 [0024] Capabilities: [60] #08 [8038] Capabilities: [80] AGP version 3.0 Status: RQ=28 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW- AGP3- Rate=x1,x2,x4 Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit- FW- Rate=<none> 00: b9 10 89 16 06 01 10 00 00 00 00 06 00 00 00 00 10: 08 00 00 dc 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00 00:05.0 PCI bridge: ALi Corporation AGP8X Controller (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 Bus: primary=00, secondary=03, subordinate=03, sec-latency=64 Memory behind bridge: ff400000-ff4fffff Prefetchable memory behind bridge: c7f00000-d7efffff BridgeCtl: Parity+ SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B- 00: b9 10 46 52 07 01 20 00 00 00 04 06 00 00 01 00 10: 00 00 00 00 00 00 00 00 00 03 03 40 f0 00 20 22 20: 40 ff 40 ff f0 c7 e0 d7 00 00 00 00 00 00 00 00 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0b 00 00:06.0 PCI bridge: ALi Corporation M5249 HTT to PCI Bridge (prog-if 01 [Subtractive decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 Bus: primary=00, secondary=04, subordinate=04, sec-latency=32 I/O behind bridge: 0000d000-0000dfff Memory behind bridge: ff500000-ff5fffff Prefetchable memory behind bridge: 88000000-880fffff BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B- 00: b9 10 49 52 07 01 00 00 00 01 04 06 00 00 01 00 10: 00 00 00 00 00 00 00 00 00 04 04 20 d0 d0 00 22 20: 50 ff 50 ff 00 88 00 88 00 00 00 00 00 00 00 00 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00:07.0 ISA bridge: ALi Corporation M1563 HyperTransport South Bridge (rev 70) Subsystem: ASRock Incorporation: Unknown device 1563 Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 (250ns min, 6000ns max) 00: b9 10 63 15 0f 00 00 02 70 00 01 06 00 00 80 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 63 15 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 18 00:07.1 Bridge: ALi Corporation M7101 Power Management Controller [PMU] Subsystem: ASRock Incorporation: Unknown device 7101 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- 00: b9 10 01 71 00 00 00 02 00 00 80 06 00 00 80 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 01 71 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00:11.0 Ethernet controller: ALi Corporation M5263 Ethernet Controller (rev 40) Subsystem: ASRock Incorporation: Unknown device 5263 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (5000ns min, 10000ns max), cache line size 08 Interrupt: pin A routed to IRQ 10 Region 0: I/O ports at e800 [size=256] Region 1: Memory at ff6ffc00 (32-bit, non-prefetchable) [size=256] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00: b9 10 63 52 07 01 10 02 40 00 00 02 08 20 00 00 10: 01 e8 00 00 00 fc 6f ff 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 63 52 30: 00 00 00 00 50 00 00 00 00 00 00 00 0a 01 14 28 00:12.0 IDE interface: ALi Corporation M5229 IDE (rev c7) (prog-if 8a [Master SecP PriP]) Subsystem: ASRock Incorporation: Unknown device 5229 Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 Interrupt: pin A routed to IRQ 0 Region 0: I/O ports at <ignored> Region 1: I/O ports at <ignored> Region 2: I/O ports at <ignored> Region 3: I/O ports at <ignored> Region 4: I/O ports at ff00 [size=16] 00: b9 10 29 52 05 00 a0 02 c7 8a 01 01 00 20 00 00 10: f1 01 00 00 f5 03 00 00 71 01 00 00 75 03 00 00 20: 01 ff 00 00 00 00 00 00 00 00 00 00 49 18 29 52 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00:13.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03) (prog-if 10 [OHCI]) Subsystem: ASRock Incorporation: Unknown device 5237 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (20000ns max), cache line size 10 Interrupt: pin A routed to IRQ 11 Region 0: Memory at ff6fe000 (32-bit, non-prefetchable) [size=4K] 00: b9 10 37 52 17 01 a8 02 03 10 03 0c 10 20 80 00 10: 00 e0 6f ff 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 37 52 30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 01 00 50 00:13.1 USB Controller: ALi Corporation USB 1.1 Controller (rev 03) (prog-if 10 [OHCI]) Subsystem: ASRock Incorporation: Unknown device 5237 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (20000ns max), cache line size 10 Interrupt: pin B routed to IRQ 3 Region 0: Memory at ff6fd000 (32-bit, non-prefetchable) [size=4K] 00: b9 10 37 52 17 01 a8 02 03 10 03 0c 10 20 80 00 10: 00 d0 6f ff 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 37 52 30: 00 00 00 00 00 00 00 00 00 00 00 00 03 02 00 50 00:13.2 USB Controller: ALi Corporation USB 1.1 Controller (rev 03) (prog-if 10 [OHCI]) Subsystem: ASRock Incorporation: Unknown device 5237 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (20000ns max), cache line size 10 Interrupt: pin C routed to IRQ 11 Region 0: Memory at ff6fc000 (32-bit, non-prefetchable) [size=4K] 00: b9 10 37 52 17 01 a8 02 03 10 03 0c 10 20 80 00 10: 00 c0 6f ff 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 37 52 30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 03 00 50 00:13.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01) (prog-if 20 [EHCI]) Subsystem: ASRock Incorporation: Unknown device 5239 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (4000ns min, 8000ns max), cache line size 10 Interrupt: pin D routed to IRQ 5 Region 0: Memory at ff6ff800 (32-bit, non-prefetchable) [size=256] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] #0a [2090] 00: b9 10 39 52 16 01 b0 02 01 20 03 0c 10 20 80 00 10: 00 f8 6f ff 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 49 18 39 52 30: 00 00 00 00 50 00 00 00 00 00 00 00 05 04 10 20 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Capabilities: [80] #08 [2101] 00: 22 10 00 11 00 00 10 00 00 00 00 06 00 00 80 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 80 00 00 00 00 00 00 00 00 00 00 00 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- 00: 22 10 01 11 00 00 00 00 00 00 00 06 00 00 80 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- 00: 22 10 02 11 00 00 00 00 00 00 00 06 00 00 80 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- 00: 22 10 03 11 00 00 00 00 00 00 00 06 00 00 80 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA Parhelia AGP (rev 03) (prog-if 00 [VGA]) Subsystem: Matrox Graphics, Inc. Parhelia 128Mb Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (4000ns min, 8000ns max), cache line size 10 Interrupt: pin A routed to IRQ 5 Region 0: Memory at c8000000 (32-bit, prefetchable) [size=128M] Region 1: Memory at ff4fe000 (32-bit, non-prefetchable) [size=8K] Expansion ROM at ff4c0000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [f0] AGP version 2.0 Status: RQ=32 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4 Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit- FW- Rate=<none> 00: 2b 10 27 05 07 00 b0 02 03 00 00 03 10 20 00 00 10: 08 00 00 c8 00 e0 4f ff 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 2b 10 40 08 30: 00 00 4c ff dc 00 00 00 00 00 00 00 05 01 10 20 04:05.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 0a) Subsystem: Creative Labs SBLive! 5.1 eMicro 28028 Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (500ns min, 5000ns max) Interrupt: pin A routed to IRQ 20 Region 0: I/O ports at d880 [size=32] Capabilities: [dc] Power Management version 1 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00: 02 11 02 00 05 01 90 02 0a 00 01 04 00 20 80 00 10: 81 d8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 02 11 67 80 30: 00 00 00 00 dc 00 00 00 00 00 00 00 0b 01 02 14 04:05.1 Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 0a) Subsystem: Creative Labs Gameport Joystick Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 Region 0: I/O ports at dc00 [size=8] Capabilities: [dc] Power Management version 1 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00: 02 11 02 70 05 01 90 02 0a 00 80 09 00 20 80 00 10: 01 dc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 02 11 20 00 30: 00 00 00 00 dc 00 00 00 00 00 00 00 00 00 00 00 04:06.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02) Subsystem: Adaptec 29160N Ultra160 SCSI Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (10000ns min, 6250ns max), cache line size 10 Interrupt: pin A routed to IRQ 19 BIST result: 00 Region 0: I/O ports at d400 [disabled] [size=256] Region 1: Memory at ff5ff000 (64-bit, non-prefetchable) [size=4K] Expansion ROM at 88000000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00: 05 90 80 00 16 01 b0 02 02 00 00 01 10 20 00 80 10: 01 d4 00 00 04 f0 5f ff 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 05 90 a0 62 30: 00 00 5c ff dc 00 00 00 00 00 00 00 03 01 28 19 04:07.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42) Subsystem: D-Link System Inc DFE-530TX rev B Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (750ns min, 2000ns max), cache line size 10 Interrupt: pin A routed to IRQ 18 Region 0: I/O ports at d000 [size=256] Region 1: Memory at ff5fec00 (32-bit, non-prefetchable) [size=256] Expansion ROM at 88020000 [disabled] [size=64K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00: 06 11 65 30 17 01 10 02 42 00 00 02 10 20 00 00 10: 01 d0 00 00 00 ec 5f ff 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 11 01 14 30: 00 00 ff ff 40 00 00 00 00 00 00 00 02 01 03 08 /Jesper ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 18:43 ` Jesper Juhl @ 2006-03-06 19:32 ` Linus Torvalds 2006-03-06 19:51 ` Jesper Juhl 2006-03-06 20:06 ` Linus Torvalds 0 siblings, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2006-03-06 19:32 UTC (permalink / raw) To: Jesper Juhl Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley [-- Attachment #1: Type: TEXT/PLAIN, Size: 2986 bytes --] On Mon, 6 Mar 2006, Jesper Juhl wrote: > > Not a git user (I need to become one but haven't found the time to read up > on it yet), but no problem, I'll dig out the patch and try reverting it. It's attached here. NOTE! I'm not at all sure it's the re-try logic. It could be something else. Anything that completes the request before it's actually totally done - or possibly re-uses the sense data for something else would be wrong and buggy. > Btw, the messages turn out slightly different on each boot, here are the > ones from this current boot of my box: > > Slab corruption: start=f72b6b98, len=64 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) > 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 > 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Ok, same deal. "Medium not present - tray closed" sense data. > Slab corruption: start=f72b6b98, len=64 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) > 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Hmm. Totally empty sense data? Strange. > Slab corruption: start=f72b6b98, len=64 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<c01d3769>](ext3_clear_inode+0x29/0x40) > 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 > 010: 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b This is different. But it looks similar. It looks like the thing was actually re-allocated for something else (posix acl data?) but then overwritten. However, the overwritten data does look like SCSI sense information again ("Invalid field in cdb"), so I think it's the same thing despite the fact that it had gotten re-allocated for something else. > Would gathering more of these help you out? It's always interesting when trying to find the pattern, but I think the pattern is already pretty clear. sr_do_ioctl() seems to be the thing, and sense data is written too late. > I have no USB, SATA or similar devices in the box, only a floppy drive, a > SCSI harddisk, a SCSI CD writer and a SCSI DVD-ROM. Well, the fact that you have a CDSI CD-writer and a SCSI DVD-ROM explains the thing, so that's all good. > scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0 > <Adaptec 29160N Ultra160 SCSI adapter> > aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs So it's either an aic7xxx bug, or it's generic SCSI. Considering that we've had other slab corruption issues (the reason I was looking closely at yours), generic SCSI isn't out of the question. If you were a git user, doing a bisection run would be useful since you seem to be able to recreate it at will. Oh, well. Testign that one patch would still help. Linus [-- Attachment #2: Type: TEXT/PLAIN, Size: 2046 bytes --] diff-tree 17e01f216b611fc46956dcd9063aec4de75991e3 (from 6e68af666f5336254b5715dca591026b7324499a) Author: Mike Christie <michaelc@cs.wisc.edu> Date: Fri Nov 11 05:31:37 2005 -0600 [SCSI] add retries field to request for REQ_BLOCK_PC use For tape we need to control the retries. This patch adds a retries counter on the request for REQ_BLOCK_PC commands originating from scsi_execute* to use. REQ_BLOCK_PC commands comming from the block layer SG_IO path continue to use the retires set in the ULD init_command. (scsi_execute* does not set the gendisk so we do not execute the init_command in that path). Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index eb0cfbf..365843a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -259,6 +259,7 @@ int scsi_execute(struct scsi_device *sde memcpy(req->cmd, cmd, req->cmd_len); req->sense = sense; req->sense_len = 0; + req->retries = retries; req->timeout = timeout; req->flags |= flags | REQ_BLOCK_PC | REQ_SPECIAL | REQ_QUIET; @@ -472,6 +473,7 @@ int scsi_execute_async(struct scsi_devic req->sense = sioc->sense; req->sense_len = 0; req->timeout = timeout; + req->retries = retries; req->flags |= REQ_BLOCK_PC | REQ_QUIET; req->end_io_data = sioc; @@ -1393,7 +1395,7 @@ static int scsi_prep_fn(struct request_q cmd->sc_data_direction = DMA_NONE; cmd->transfersize = req->data_len; - cmd->allowed = 3; + cmd->allowed = req->retries; cmd->timeout_per_command = req->timeout; cmd->done = scsi_generic_done; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 9a68716..509e9a0 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -184,6 +184,7 @@ struct request { void *sense; unsigned int timeout; + int retries; /* * For Power Management requests ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 19:32 ` Linus Torvalds @ 2006-03-06 19:51 ` Jesper Juhl 2006-03-06 19:58 ` Jesper Juhl 2006-03-06 20:06 ` Linus Torvalds 1 sibling, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 19:51 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Monday 06 March 2006 20:32, Linus Torvalds wrote: > > On Mon, 6 Mar 2006, Jesper Juhl wrote: > > > > Not a git user (I need to become one but haven't found the time to read up > > on it yet), but no problem, I'll dig out the patch and try reverting it. > > It's attached here. > Thanks. > NOTE! I'm not at all sure it's the re-try logic. It could be something > else. Anything that completes the request before it's actually totally > done - or possibly re-uses the sense data for something else would be > wrong and buggy. > Ohh well, let's work on the assumption that it is the re-try logic first, then try something else if it turns out it isn't. I have no problem testing a bunch of patches if needed. <...> > This is different. But it looks similar. It looks like the thing was > actually re-allocated for something else (posix acl data?) but then I doubt it's POSIX ACL data : juhl@dragon:~/download/kernel/linux-2.6.16-rc5-mm2$ grep -i ACL .config # CONFIG_FS_POSIX_ACL is not set > overwritten. However, the overwritten data does look like SCSI sense > information again ("Invalid field in cdb"), so I think it's the same > thing despite the fact that it had gotten re-allocated for something else. > > > Would gathering more of these help you out? > > It's always interesting when trying to find the pattern, but I think the > pattern is already pretty clear. sr_do_ioctl() seems to be the thing, and > sense data is written too late. > Ok, it's reproducible on demand (at least I've now reproduced it on 6 more boots), so if you need any more just let me know and I'll gather a few. > > I have no USB, SATA or similar devices in the box, only a floppy drive, a > > SCSI harddisk, a SCSI CD writer and a SCSI DVD-ROM. > > Well, the fact that you have a CDSI CD-writer and a SCSI DVD-ROM explains > the thing, so that's all good. > > > scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0 > > <Adaptec 29160N Ultra160 SCSI adapter> > > aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs > > So it's either an aic7xxx bug, or it's generic SCSI. > > Considering that we've had other slab corruption issues (the reason I was > looking closely at yours), generic SCSI isn't out of the question. > > If you were a git user, doing a bisection run would be useful since you Well, now is probably as good a time as any for becoming a git user, tracking my own patches as individual plain-text files is getting un-managable and as you say, bisection would be useful to be able to do. I'll dig up some git docs and start reading. > seem to be able to recreate it at will. Oh, well. Testign that one patch > would still help. > Hmm, that patch does not apply to the 2.6.16-rc5-mm2 kernel : patching file drivers/scsi/scsi_lib.c Hunk #1 succeeded at 260 (offset 1 line). Hunk #2 FAILED at 473. Hunk #3 FAILED at 1394. 2 out of 3 hunks FAILED -- saving rejects to file drivers/scsi/scsi_lib.c.rej patching file include/linux/blkdev.h Hunk #1 succeeded at 190 (offset 6 lines). I'll go see if the problem also exists in mainline - will report on that shortly. /Jesper ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 19:51 ` Jesper Juhl @ 2006-03-06 19:58 ` Jesper Juhl 0 siblings, 0 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 19:58 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On 3/6/06, Jesper Juhl <jesper.juhl@gmail.com> wrote: > On Monday 06 March 2006 20:32, Linus Torvalds wrote: > > > > seem to be able to recreate it at will. Oh, well. Testign that one patch > > would still help. > > > Hmm, that patch does not apply to the 2.6.16-rc5-mm2 kernel : > I fixed it up by hand. Building a new kernel at the moment - results in a short while. > > I'll go see if the problem also exists in mainline - will report on that > shortly. > I'll still do this. Just downloading 2.6.16-rc5-git8 as we speak. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 19:32 ` Linus Torvalds 2006-03-06 19:51 ` Jesper Juhl @ 2006-03-06 20:06 ` Linus Torvalds 2006-03-06 20:24 ` Jesper Juhl 2006-03-06 20:36 ` Jesper Juhl 1 sibling, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2006-03-06 20:06 UTC (permalink / raw) To: Jesper Juhl Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Mon, 6 Mar 2006, Linus Torvalds wrote: > > So it's either an aic7xxx bug, or it's generic SCSI. > > Considering that we've had other slab corruption issues (the reason I was > looking closely at yours), generic SCSI isn't out of the question. > > If you were a git user, doing a bisection run would be useful since you > seem to be able to recreate it at will. Oh, well. Testign that one patch > would still help. Hmm.. This appended patch may or may not help. It overwrites the SCSI command "req" pointer when the request has been done. The request cannot be used afterwards, so anybody accessing it would be a bug. I think. HOWEVER. I noticed something else strange. Your slab corruption report says Slab corruption: start=f72948a0, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) ... and the scary thing is that "len=64". The thing is, SCSI uses "SCSI_SENSE_BUFFERSIZE" to determine the maximum sense size to copy, and what do we have, if not include/scsi/scsi_cmnd.h:#define SCSI_SENSE_BUFFERSIZE 96 ie a 64-byte buffer is simply TOO DAMN SMALL! Now, the thing is, the 64 comes from "sizeof(struct request_sense)", which is what "struct packet_command *" uses. We can change that sizeof() to just use SCSI_SENSE_BUFFERSIZE, but that still makes me worry about somebody else having allocated a "packed_command->sense" using just the same 64-byte "struct request_sense". Can a SCSI sense-buffer really be 96? If so, why is "struct request_sense" the wrong size? This looks likea really nasty bug. Linus ---- diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 701a328..296ac8e 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -788,6 +788,9 @@ static struct scsi_cmnd *scsi_end_reques } } + /* Poison the request pointer: it is done and no longer exists after this */ + cmd->request = (void *) 0x5a5a5a5a; + add_disk_randomness(req->rq_disk); spin_lock_irqsave(q->queue_lock, flags); ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 20:06 ` Linus Torvalds @ 2006-03-06 20:24 ` Jesper Juhl 2006-03-06 20:30 ` Jens Axboe 2006-03-06 20:36 ` Jesper Juhl 1 sibling, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 20:24 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Monday 06 March 2006 21:06, Linus Torvalds wrote: > > On Mon, 6 Mar 2006, Linus Torvalds wrote: > > > > So it's either an aic7xxx bug, or it's generic SCSI. > > > > Considering that we've had other slab corruption issues (the reason I was > > looking closely at yours), generic SCSI isn't out of the question. > > > > If you were a git user, doing a bisection run would be useful since you > > seem to be able to recreate it at will. Oh, well. Testign that one patch > > would still help. > Since the patch you sent me didn't apply cleanly to the mm kernel I made the changes by hand. This is what I ended up with (should be the same end result as what you intended as far as I can see) : --- linux-2.6.16-rc5-mm2-orig/drivers/scsi/scsi_lib.c 2006-03-05 23:43:56.000000000 +0100 +++ linux-2.6.16-rc5-mm2/drivers/scsi/scsi_lib.c 2006-03-06 21:13:53.000000000 +0100 @@ -260,7 +260,6 @@ int scsi_execute(struct scsi_device *sde memcpy(req->cmd, cmd, req->cmd_len); req->sense = sense; req->sense_len = 0; - req->retries = retries; req->timeout = timeout; req->flags |= flags | REQ_BLOCK_PC | REQ_SPECIAL | REQ_QUIET; @@ -478,7 +477,6 @@ int scsi_execute_async(struct scsi_devic req->sense = sioc->sense; req->sense_len = 0; req->timeout = timeout; - req->retries = retries; req->end_io_data = sioc; sioc->data = privdata; @@ -1240,7 +1238,7 @@ static void scsi_setup_blk_pc_cmnd(struc cmd->sc_data_direction = DMA_FROM_DEVICE; cmd->transfersize = req->data_len; - cmd->allowed = req->retries; + cmd->allowed = 3; cmd->timeout_per_command = req->timeout; cmd->done = scsi_blk_pc_done; } --- linux-2.6.16-rc5-mm2-orig/block/scsi_ioctl.c 2006-03-05 23:43:41.000000000 +0100 +++ linux-2.6.16-rc5-mm2/block/scsi_ioctl.c 2006-03-06 21:16:19.000000000 +0100 @@ -314,8 +314,6 @@ static int sg_io(struct file *file, requ if (!rq->timeout) rq->timeout = BLK_DEFAULT_TIMEOUT; - rq->retries = 0; - start_time = jiffies; /* ignore return value. All information is passed back to caller @@ -433,7 +431,6 @@ static int sg_scsi_ioctl(struct file *fi rq->data = buffer; rq->data_len = bytes; rq->flags |= REQ_BLOCK_PC; - rq->retries = 0; blk_execute_rq(q, bd_disk, rq, 0); err = rq->errors & 0xff; /* only 8 bit SCSI status */ --- linux-2.6.16-rc5-mm2-orig/include/linux/blkdev.h 2006-03-05 23:44:06.000000000 +0100 +++ linux-2.6.16-rc5-mm2/include/linux/blkdev.h 2006-03-06 21:13:02.000000000 +0100 @@ -190,7 +190,6 @@ struct request { void *sense; unsigned int timeout; - int retries; /* * For Power Management requests Unfortunately that didn't fix it. After booting the patched kernel I found this in dmesg : Slab corruption: start=f77d768c, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934db>](sr_do_ioctl+0x11b/0x270) 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f77d7640, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c023d64d>](init_dev+0x55d/0x630) 000: 00 01 00 00 05 00 00 00 bf 00 00 00 3b 8a 00 00 010: 00 03 1c 7f 15 04 00 01 00 11 13 1a 00 12 0f 17 Next obj: start=f77d76d8, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c01c7c80>](ext3_init_block_alloc_info+0x20/0x70) 000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 010: 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Slab corruption: start=f7001640, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934db>](sr_do_ioctl+0x11b/0x270) 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f70015f4, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<f8bdf124>](__snd_util_mem_alloc+0x74/0x80 [snd_util_mem]) 000: 00 10 00 00 00 00 00 00 b0 75 7d f7 50 33 bd f5 010: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 Next obj: start=f700168c, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c0173923>](real_lookup+0x93/0xe0) 000: 6c 69 62 62 6f 6f 73 74 5f 75 6e 69 74 5f 74 65 010: 73 74 5f 66 72 61 6d 65 77 6f 72 6b 2d 67 63 63 > Hmm.. This appended patch may or may not help. > I'll give it a spin. > It overwrites the SCSI command "req" pointer when the request has been > done. The request cannot be used afterwards, so anybody accessing it would > be a bug. I think. > Let's see what happens. I've applied it on top of the changes mentioned above - let me know if that's wrong. > HOWEVER. I noticed something else strange. Your slab corruption report > says > > Slab corruption: start=f72948a0, len=64 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) > ... > > and the scary thing is that "len=64". > > The thing is, SCSI uses "SCSI_SENSE_BUFFERSIZE" to determine the maximum > sense size to copy, and what do we have, if not > > include/scsi/scsi_cmnd.h:#define SCSI_SENSE_BUFFERSIZE 96 > > ie a 64-byte buffer is simply TOO DAMN SMALL! > > Now, the thing is, the 64 comes from "sizeof(struct request_sense)", which > is what "struct packet_command *" uses. We can change that sizeof() to > just use SCSI_SENSE_BUFFERSIZE, but that still makes me worry about I'll try that after booting with the patch you just supplied and let you know the results shortly. /Jesper ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 20:24 ` Jesper Juhl @ 2006-03-06 20:30 ` Jens Axboe 2006-03-06 20:33 ` Jens Axboe 2006-03-06 21:41 ` Jesper Juhl 0 siblings, 2 replies; 53+ messages in thread From: Jens Axboe @ 2006-03-06 20:30 UTC (permalink / raw) To: Jesper Juhl Cc: Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Mon, Mar 06 2006, Jesper Juhl wrote: > On Monday 06 March 2006 21:06, Linus Torvalds wrote: > > > > On Mon, 6 Mar 2006, Linus Torvalds wrote: > > > > > > So it's either an aic7xxx bug, or it's generic SCSI. > > > > > > Considering that we've had other slab corruption issues (the reason I was > > > looking closely at yours), generic SCSI isn't out of the question. > > > > > > If you were a git user, doing a bisection run would be useful since you > > > seem to be able to recreate it at will. Oh, well. Testign that one patch > > > would still help. > > > > Since the patch you sent me didn't apply cleanly to the mm kernel I made the > changes by hand. This is what I ended up with (should be the same end result > as what you intended as far as I can see) : > > --- linux-2.6.16-rc5-mm2-orig/drivers/scsi/scsi_lib.c 2006-03-05 23:43:56.000000000 +0100 > +++ linux-2.6.16-rc5-mm2/drivers/scsi/scsi_lib.c 2006-03-06 21:13:53.000000000 +0100 > @@ -260,7 +260,6 @@ int scsi_execute(struct scsi_device *sde > memcpy(req->cmd, cmd, req->cmd_len); > req->sense = sense; > req->sense_len = 0; > - req->retries = retries; > req->timeout = timeout; > req->flags |= flags | REQ_BLOCK_PC | REQ_SPECIAL | REQ_QUIET; > > @@ -478,7 +477,6 @@ int scsi_execute_async(struct scsi_devic > req->sense = sioc->sense; > req->sense_len = 0; > req->timeout = timeout; > - req->retries = retries; > req->end_io_data = sioc; > > sioc->data = privdata; > @@ -1240,7 +1238,7 @@ static void scsi_setup_blk_pc_cmnd(struc > cmd->sc_data_direction = DMA_FROM_DEVICE; > > cmd->transfersize = req->data_len; > - cmd->allowed = req->retries; > + cmd->allowed = 3; > cmd->timeout_per_command = req->timeout; > cmd->done = scsi_blk_pc_done; > } > --- linux-2.6.16-rc5-mm2-orig/block/scsi_ioctl.c 2006-03-05 23:43:41.000000000 +0100 > +++ linux-2.6.16-rc5-mm2/block/scsi_ioctl.c 2006-03-06 21:16:19.000000000 +0100 > @@ -314,8 +314,6 @@ static int sg_io(struct file *file, requ > if (!rq->timeout) > rq->timeout = BLK_DEFAULT_TIMEOUT; > > - rq->retries = 0; > - > start_time = jiffies; > > /* ignore return value. All information is passed back to caller > @@ -433,7 +431,6 @@ static int sg_scsi_ioctl(struct file *fi > rq->data = buffer; > rq->data_len = bytes; > rq->flags |= REQ_BLOCK_PC; > - rq->retries = 0; > > blk_execute_rq(q, bd_disk, rq, 0); > err = rq->errors & 0xff; /* only 8 bit SCSI status */ > --- linux-2.6.16-rc5-mm2-orig/include/linux/blkdev.h 2006-03-05 23:44:06.000000000 +0100 > +++ linux-2.6.16-rc5-mm2/include/linux/blkdev.h 2006-03-06 21:13:02.000000000 +0100 > @@ -190,7 +190,6 @@ struct request { > void *sense; > > unsigned int timeout; > - int retries; > > /* > * For Power Management requests > > > Unfortunately that didn't fix it. After booting the patched kernel I found > this in dmesg : I don't see how it could be, honestly, we would gladly oops in locally close places if that was the case. If you disable slab debug/poison, do you get a nice NULL pointer dereference instead? There have been some reports on a NULL queue for sr devices as of lately, I wonder if some SCSI change recently was broken. Tejun, I seem to recall you looking at this, but I can't seem to locate the thread. Did anything come of it? -- Jens Axboe ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 20:30 ` Jens Axboe @ 2006-03-06 20:33 ` Jens Axboe 2006-03-06 21:14 ` Jesper Juhl 2006-03-06 21:41 ` Jesper Juhl 1 sibling, 1 reply; 53+ messages in thread From: Jens Axboe @ 2006-03-06 20:33 UTC (permalink / raw) To: Jesper Juhl Cc: Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Mon, Mar 06 2006, Jens Axboe wrote: > I don't see how it could be, honestly, we would gladly oops in locally > close places if that was the case. If you disable slab debug/poison, do > you get a nice NULL pointer dereference instead? There have been some > reports on a NULL queue for sr devices as of lately, I wonder if some > SCSI change recently was broken. > > Tejun, I seem to recall you looking at this, but I can't seem to locate > the thread. Did anything come of it? This is the one: http://marc.theaimsgroup.com/?l=linux-kernel&m=114041855331295&w=2 Also an -mm report, btw. Does this reproduce with 2.6.16-rcX latest? -- Jens Axboe ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 20:33 ` Jens Axboe @ 2006-03-06 21:14 ` Jesper Juhl 0 siblings, 0 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 21:14 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On 3/6/06, Jens Axboe <axboe@suse.de> wrote: <snip> > > This is the one: > > http://marc.theaimsgroup.com/?l=linux-kernel&m=114041855331295&w=2 > > Also an -mm report, btw. Does this reproduce with 2.6.16-rcX latest? > I just build and booted a2.6.16-rc5-git8 with the same config that I used for the -mm kernel and the problem did not manifest itself there. So it seems that mainline is fine but we need to find the bug before it propagates from mm to mainline. I'll test a -mm kernel without slab debug/poison now to see it it goes Oops. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 20:30 ` Jens Axboe 2006-03-06 20:33 ` Jens Axboe @ 2006-03-06 21:41 ` Jesper Juhl 2006-03-06 21:55 ` Dave Jones 1 sibling, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 21:41 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On 3/6/06, Jens Axboe <axboe@suse.de> wrote: [...snip...] > > I don't see how it could be, honestly, we would gladly oops in locally > close places if that was the case. If you disable slab debug/poison, do > you get a nice NULL pointer dereference instead? There have been some > reports on a NULL queue for sr devices as of lately, I wonder if some > SCSI change recently was broken. > I just build and booted a plain 2.6.16-rc5-mm2 with the same config I've used previously, but with the following options disabled : CONFIG_DEBUG_SLAB CONFIG_PAGE_OWNER CONFIG_DEBUG_VM CONFIG_DEBUG_PAGEALLOC The resulting kernel boots and runs just fine (no Oops) and leaves nothing in dmesg. So, without the debugging options it appears to the user that everything is OK - nasty. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 21:41 ` Jesper Juhl @ 2006-03-06 21:55 ` Dave Jones 2006-03-06 21:57 ` Jesper Juhl 2006-03-09 15:50 ` Martin J. Bligh 0 siblings, 2 replies; 53+ messages in thread From: Dave Jones @ 2006-03-06 21:55 UTC (permalink / raw) To: Jesper Juhl Cc: Jens Axboe, Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Mon, Mar 06, 2006 at 10:41:07PM +0100, Jesper Juhl wrote: > CONFIG_DEBUG_SLAB > CONFIG_PAGE_OWNER > CONFIG_DEBUG_VM > CONFIG_DEBUG_PAGEALLOC > > The resulting kernel boots and runs just fine (no Oops) and leaves > nothing in dmesg. > So, without the debugging options it appears to the user that > everything is OK - nasty. DEBUG_PAGEALLOC in particular is *fantastic* at making bugs hide. I've lost many an hour trying to pin bugs down due to that. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 21:55 ` Dave Jones @ 2006-03-06 21:57 ` Jesper Juhl 2006-03-09 15:50 ` Martin J. Bligh 1 sibling, 0 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 21:57 UTC (permalink / raw) To: Dave Jones, Jesper Juhl, Jens Axboe, Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On 3/6/06, Dave Jones <davej@redhat.com> wrote: > On Mon, Mar 06, 2006 at 10:41:07PM +0100, Jesper Juhl wrote: > > > CONFIG_DEBUG_SLAB > > CONFIG_PAGE_OWNER > > CONFIG_DEBUG_VM > > CONFIG_DEBUG_PAGEALLOC > > > > The resulting kernel boots and runs just fine (no Oops) and leaves > > nothing in dmesg. > > So, without the debugging options it appears to the user that > > everything is OK - nasty. > > DEBUG_PAGEALLOC in particular is *fantastic* at making bugs hide. > I've lost many an hour trying to pin bugs down due to that. > Well, in this case, turning the option *off* hides the bug ;) -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 21:55 ` Dave Jones 2006-03-06 21:57 ` Jesper Juhl @ 2006-03-09 15:50 ` Martin J. Bligh 2006-03-09 15:54 ` Martin J. Bligh ` (2 more replies) 1 sibling, 3 replies; 53+ messages in thread From: Martin J. Bligh @ 2006-03-09 15:50 UTC (permalink / raw) To: Dave Jones Cc: Jesper Juhl, Jens Axboe, Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley Dave Jones wrote: > On Mon, Mar 06, 2006 at 10:41:07PM +0100, Jesper Juhl wrote: > > > CONFIG_DEBUG_SLAB > > CONFIG_PAGE_OWNER > > CONFIG_DEBUG_VM > > CONFIG_DEBUG_PAGEALLOC > > > > The resulting kernel boots and runs just fine (no Oops) and leaves > > nothing in dmesg. > > So, without the debugging options it appears to the user that > > everything is OK - nasty. > > DEBUG_PAGEALLOC in particular is *fantastic* at making bugs hide. > I've lost many an hour trying to pin bugs down due to that. Is this backwards? We're saying DEBUG_PAGEALLOC is bad? OK, what I'm going to try to do, given the recent comments re CONFIG_DEBUG_SLAB and also PAGEALLOC is to arrange with Andy to run a debug kernel as well as a normal kernel for every test, and then we can publish the results on http://test.kernel.org. For now it'll be a seperate matrix, until I work out how to fold the 3d cube nicely into 2d - I know I have to do that anyway, so no big deal. Do we NOT want to have DEBUG_SLAB and DEBUG_PAGEALLOC both enabled? Running multiple permutations is going to get really painful on the systems involved. Any other requests for what gets enabled (I really want to just stick to one 'debug' setup if possible). I have no idea why I didn't do this a year ago <slaps self>. M. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-09 15:50 ` Martin J. Bligh @ 2006-03-09 15:54 ` Martin J. Bligh 2006-03-09 15:54 ` Benjamin LaHaise 2006-03-09 16:08 ` Linus Torvalds 2 siblings, 0 replies; 53+ messages in thread From: Martin J. Bligh @ 2006-03-09 15:54 UTC (permalink / raw) To: Martin J. Bligh Cc: Dave Jones, Jesper Juhl, Jens Axboe, Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley >> DEBUG_PAGEALLOC in particular is *fantastic* at making bugs hide. >> I've lost many an hour trying to pin bugs down due to that. > > > Is this backwards? We're saying DEBUG_PAGEALLOC is bad? > > OK, what I'm going to try to do, given the recent comments re > CONFIG_DEBUG_SLAB and also PAGEALLOC is to arrange with Andy to run > a debug kernel as well as a normal kernel for every test, and then > we can publish the results on http://test.kernel.org. For now it'll be > a seperate matrix, until I work out how to fold the 3d cube nicely > into 2d - I know I have to do that anyway, so no big deal. > > Do we NOT want to have DEBUG_SLAB and DEBUG_PAGEALLOC both enabled? > Running multiple permutations is going to get really painful on the > systems involved. Any other requests for what gets enabled (I really > want to just stick to one 'debug' setup if possible). > > I have no idea why I didn't do this a year ago <slaps self>. Oh, and I ported CONFIG_DEBUG_PAGEALLOC to x86_64 last week. Will send out the patch later today. M. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-09 15:50 ` Martin J. Bligh 2006-03-09 15:54 ` Martin J. Bligh @ 2006-03-09 15:54 ` Benjamin LaHaise 2006-03-09 16:04 ` Martin J. Bligh 2006-03-09 16:08 ` Linus Torvalds 2 siblings, 1 reply; 53+ messages in thread From: Benjamin LaHaise @ 2006-03-09 15:54 UTC (permalink / raw) To: Martin J. Bligh Cc: Dave Jones, Jesper Juhl, Jens Axboe, Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Thu, Mar 09, 2006 at 07:50:15AM -0800, Martin J. Bligh wrote: > Do we NOT want to have DEBUG_SLAB and DEBUG_PAGEALLOC both enabled? > Running multiple permutations is going to get really painful on the > systems involved. Any other requests for what gets enabled (I really > want to just stick to one 'debug' setup if possible). Debug kernels are incredibly slow, making hitting certain races next to impossible. By all means non-DEBUG kernels should definately be getting tested. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-09 15:54 ` Benjamin LaHaise @ 2006-03-09 16:04 ` Martin J. Bligh 0 siblings, 0 replies; 53+ messages in thread From: Martin J. Bligh @ 2006-03-09 16:04 UTC (permalink / raw) To: Benjamin LaHaise Cc: Dave Jones, Jesper Juhl, Jens Axboe, Linus Torvalds, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley Benjamin LaHaise wrote: > On Thu, Mar 09, 2006 at 07:50:15AM -0800, Martin J. Bligh wrote: > >>Do we NOT want to have DEBUG_SLAB and DEBUG_PAGEALLOC both enabled? >>Running multiple permutations is going to get really painful on the >>systems involved. Any other requests for what gets enabled (I really >>want to just stick to one 'debug' setup if possible). > > > Debug kernels are incredibly slow, making hitting certain races next to > impossible. By all means non-DEBUG kernels should definately be getting > tested. It'd still run a totally non-debug kernel as well - I want that for the perf tests etc anyway. I guess the question is whether the debug kernels should have most of the DEBUG_* turned on, or just a select few. M. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-09 15:50 ` Martin J. Bligh 2006-03-09 15:54 ` Martin J. Bligh 2006-03-09 15:54 ` Benjamin LaHaise @ 2006-03-09 16:08 ` Linus Torvalds 2006-03-09 16:41 ` Dave Jones 2 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2006-03-09 16:08 UTC (permalink / raw) To: Martin J. Bligh Cc: Dave Jones, Jesper Juhl, Jens Axboe, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Thu, 9 Mar 2006, Martin J. Bligh wrote: > > > > DEBUG_PAGEALLOC in particular is *fantastic* at making bugs hide. > > I've lost many an hour trying to pin bugs down due to that. > > Is this backwards? We're saying DEBUG_PAGEALLOC is bad? DEBUG_PAGEALLOC is great for finding the really stupid kinds of bugs, and it's definitely worth doing every once in a while. However, DEBUG_PAGEALLOC makes many things orders of magnitude slower, and it eats memory like mad (because it turns some slabs into whole pages - but it still doesn't help small allocation debugging that much). So unlike DEBUG_SLAB, it's not reasonable to have it on all the time. IOW, DEBUG_SLAB is something that a distro kernel can reasonably enable for users by default (I think fedora-devel does, for example). In contrast, DEBUG_PAGEALLOC is more of a "useful for special cases" thing, where you want to validate that there's nothing _obviously_ bad going on. > Do we NOT want to have DEBUG_SLAB and DEBUG_PAGEALLOC both enabled? I suspect that once DEBUG_PAGEALLOC is on, whether you do DEBUG_SLAB or not is a toss-up. The interesting cases tend to be - neither: usable for benchmarking - DEBUG_SLAB: perfectly usable for normal work - DEBUG_PAGEALLOC (with or without DEBUG_SLAB): debugging tool only At least that's my opinion, maybe others have other experiences. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-09 16:08 ` Linus Torvalds @ 2006-03-09 16:41 ` Dave Jones 0 siblings, 0 replies; 53+ messages in thread From: Dave Jones @ 2006-03-09 16:41 UTC (permalink / raw) To: Linus Torvalds Cc: Martin J. Bligh, Jesper Juhl, Jens Axboe, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley On Thu, Mar 09, 2006 at 08:08:52AM -0800, Linus Torvalds wrote: > IOW, DEBUG_SLAB is something that a distro kernel can reasonably enable > for users by default (I think fedora-devel does, for example). Correct. > - neither: usable for benchmarking > - DEBUG_SLAB: perfectly usable for normal work > - DEBUG_PAGEALLOC (with or without DEBUG_SLAB): debugging tool only > > At least that's my opinion, maybe others have other experiences. That's pretty much my experience. I get people screaming at me when I enable PAGEALLOC for a day or so in the Fedora-devel kernel if I'm chasing something :) Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 20:06 ` Linus Torvalds 2006-03-06 20:24 ` Jesper Juhl @ 2006-03-06 20:36 ` Jesper Juhl 2006-03-06 20:53 ` Jesper Juhl 1 sibling, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 20:36 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe On Monday 06 March 2006 21:06, Linus Torvalds wrote: > > On Mon, 6 Mar 2006, Linus Torvalds wrote: > > > > So it's either an aic7xxx bug, or it's generic SCSI. > > > > Considering that we've had other slab corruption issues (the reason I was > > looking closely at yours), generic SCSI isn't out of the question. > > > > If you were a git user, doing a bisection run would be useful since you > > seem to be able to recreate it at will. Oh, well. Testign that one patch > > would still help. > > Hmm.. This appended patch may or may not help. > > It overwrites the SCSI command "req" pointer when the request has been > done. The request cannot be used afterwards, so anybody accessing it would > be a bug. I think. > With the retry code removed and your req poisoning patch on top I just got this : Slab corruption: start=f727c5a8, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934db>](sr_do_ioctl+0x11b/0x270) 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f727c55c, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813ee>](free_fdtable_rcu+0x6e/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=f727c5f4, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813ee>](free_fdtable_rcu+0x6e/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Slab corruption: start=f727c5a8, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934db>](sr_do_ioctl+0x11b/0x270) 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f727c55c, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813ee>](free_fdtable_rcu+0x6e/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=f727c5f4, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813ee>](free_fdtable_rcu+0x6e/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b and another, probably unrelated, thing I just noticed in my dmesg output: initcall at 0xc0428240: init_hpet_clocksource+0x0/0x90(): returned with error code -19 > HOWEVER. I noticed something else strange. Your slab corruption report > says > > Slab corruption: start=f72948a0, len=64 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) > ... > > and the scary thing is that "len=64". > > The thing is, SCSI uses "SCSI_SENSE_BUFFERSIZE" to determine the maximum > sense size to copy, and what do we have, if not > > include/scsi/scsi_cmnd.h:#define SCSI_SENSE_BUFFERSIZE 96 > > ie a 64-byte buffer is simply TOO DAMN SMALL! > > Now, the thing is, the 64 comes from "sizeof(struct request_sense)", which > is what "struct packet_command *" uses. We can change that sizeof() to > just use SCSI_SENSE_BUFFERSIZE, but that still makes me worry about Building a kernel with that change on top of the other ones atm. / Jesper ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 20:36 ` Jesper Juhl @ 2006-03-06 20:53 ` Jesper Juhl 2006-03-06 20:56 ` Jesper Juhl 0 siblings, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 20:53 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe On 3/6/06, Jesper Juhl <jesper.juhl@gmail.com> wrote: > On Monday 06 March 2006 21:06, Linus Torvalds wrote: > > <...snip...> > > and the scary thing is that "len=64". > > > > The thing is, SCSI uses "SCSI_SENSE_BUFFERSIZE" to determine the maximum > > sense size to copy, and what do we have, if not > > > > include/scsi/scsi_cmnd.h:#define SCSI_SENSE_BUFFERSIZE 96 > > > > ie a 64-byte buffer is simply TOO DAMN SMALL! > > > > Now, the thing is, the 64 comes from "sizeof(struct request_sense)", which > > is what "struct packet_command *" uses. We can change that sizeof() to > > just use SCSI_SENSE_BUFFERSIZE, but that still makes me worry about > > Building a kernel with that change on top of the other ones atm. > Changing the sizeof() to SCSI_SENSE_BUFFERSIZE doesn't fix it : Slab corruption: start=f79da5a8, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934db>](sr_do_ioctl+0x11b/0x270) 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f79da55c, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c0158918>](__vmalloc_node+0x68/0x80) 000: d0 1e 1e c3 18 1f 1e c3 60 1f 1e c3 a8 1f 1e c3 010: f0 1f 1e c3 38 20 1e c3 80 20 1e c3 c8 20 1e c3 Next obj: start=f79da5f4, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c0173923>](real_lookup+0x93/0xe0) 000: 6c 69 62 62 6f 6f 73 74 5f 70 72 67 5f 65 78 65 010: 63 5f 6d 6f 6e 69 74 6f 72 2d 67 63 63 2d 6d 74 Slab corruption: start=f79da5a8, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934db>](sr_do_ioctl+0x11b/0x270) 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f79da55c, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c0158918>](__vmalloc_node+0x68/0x80) 000: d0 1e 1e c3 18 1f 1e c3 60 1f 1e c3 a8 1f 1e c3 010: f0 1f 1e c3 38 20 1e c3 80 20 1e c3 c8 20 1e c3 Next obj: start=f79da5f4, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c0173923>](real_lookup+0x93/0xe0) 000: 6c 69 62 62 6f 6f 73 74 5f 70 72 67 5f 65 78 65 010: 63 5f 6d 6f 6e 69 74 6f 72 2d 67 63 63 2d 6d 74 I'll now go test the things Jens suggested. Expect more feedback shortly. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 20:53 ` Jesper Juhl @ 2006-03-06 20:56 ` Jesper Juhl 2006-03-06 21:07 ` Linus Torvalds 0 siblings, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 20:56 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe On 3/6/06, Jesper Juhl <jesper.juhl@gmail.com> wrote: > On 3/6/06, Jesper Juhl <jesper.juhl@gmail.com> wrote: > > On Monday 06 March 2006 21:06, Linus Torvalds wrote: > > > > <...snip...> > > > and the scary thing is that "len=64". > > > > > > The thing is, SCSI uses "SCSI_SENSE_BUFFERSIZE" to determine the maximum > > > sense size to copy, and what do we have, if not > > > > > > include/scsi/scsi_cmnd.h:#define SCSI_SENSE_BUFFERSIZE 96 > > > > > > ie a 64-byte buffer is simply TOO DAMN SMALL! > > > > > > Now, the thing is, the 64 comes from "sizeof(struct request_sense)", which > > > is what "struct packet_command *" uses. We can change that sizeof() to > > > just use SCSI_SENSE_BUFFERSIZE, but that still makes me worry about > > > > Building a kernel with that change on top of the other ones atm. > > > Changing the sizeof() to SCSI_SENSE_BUFFERSIZE doesn't fix it : > > Slab corruption: start=f79da5a8, len=64 Hmm, is it just me or should that len= have read len=96 ??? This is the change I made : --- linux-2.6.16-rc5-mm2/block/scsi_ioctl.c~ 2006-03-06 21:43:56.000000000 +0100 +++ linux-2.6.16-rc5-mm2/block/scsi_ioctl.c 2006-03-06 21:43:56.000000000 +0100 @@ -568,7 +568,7 @@ int scsi_cmd_ioctl(struct file *file, st hdr.dxferp = cgc.buffer; hdr.sbp = cgc.sense; if (hdr.sbp) - hdr.mx_sb_len = sizeof(struct request_sense); + hdr.mx_sb_len = SCSI_SENSE_BUFFERSIZE; hdr.timeout = cgc.timeout; hdr.cmdp = ((struct cdrom_generic_command __user*) arg)->cmd; hdr.cmd_len = sizeof(cgc.cmd); did I mess up? -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 20:56 ` Jesper Juhl @ 2006-03-06 21:07 ` Linus Torvalds 2006-03-06 21:16 ` Jesper Juhl 2006-03-06 21:54 ` Jesper Juhl 0 siblings, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2006-03-06 21:07 UTC (permalink / raw) To: Jesper Juhl Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe On Mon, 6 Mar 2006, Jesper Juhl wrote: > > Hmm, is it just me or should that len= have read len=96 ??? > > This is the change I made : > > --- linux-2.6.16-rc5-mm2/block/scsi_ioctl.c~ 2006-03-06 > 21:43:56.000000000 +0100 > +++ linux-2.6.16-rc5-mm2/block/scsi_ioctl.c 2006-03-06 > 21:43:56.000000000 +0100 > @@ -568,7 +568,7 @@ int scsi_cmd_ioctl(struct file *file, st > hdr.dxferp = cgc.buffer; > hdr.sbp = cgc.sense; > if (hdr.sbp) > - hdr.mx_sb_len = sizeof(struct request_sense); > + hdr.mx_sb_len = SCSI_SENSE_BUFFERSIZE; > hdr.timeout = cgc.timeout; > hdr.cmdp = ((struct cdrom_generic_command __user*) arg)->cmd; > hdr.cmd_len = sizeof(cgc.cmd); > > did I mess up? That's not the one to change. It's the one in "sr_do_ioctl()", where it uses "sizeof(*sense)". Linus ---- diff --git a/drivers/scsi/sr_ioctl.c b/drivers/scsi/sr_ioctl.c index 5d02ff4..b65462f 100644 --- a/drivers/scsi/sr_ioctl.c +++ b/drivers/scsi/sr_ioctl.c @@ -192,7 +192,7 @@ int sr_do_ioctl(Scsi_CD *cd, struct pack SDev = cd->device; if (!sense) { - sense = kmalloc(sizeof(*sense), GFP_KERNEL); + sense = kmalloc(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL); if (!sense) { err = -ENOMEM; goto out; ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 21:07 ` Linus Torvalds @ 2006-03-06 21:16 ` Jesper Juhl 2006-03-06 21:54 ` Jesper Juhl 1 sibling, 0 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 21:16 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe On 3/6/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Mon, 6 Mar 2006, Jesper Juhl wrote: > > > > Hmm, is it just me or should that len= have read len=96 ??? > > > > This is the change I made : > > > > --- linux-2.6.16-rc5-mm2/block/scsi_ioctl.c~ 2006-03-06 > > 21:43:56.000000000 +0100 > > +++ linux-2.6.16-rc5-mm2/block/scsi_ioctl.c 2006-03-06 > > 21:43:56.000000000 +0100 > > @@ -568,7 +568,7 @@ int scsi_cmd_ioctl(struct file *file, st > > hdr.dxferp = cgc.buffer; > > hdr.sbp = cgc.sense; > > if (hdr.sbp) > > - hdr.mx_sb_len = sizeof(struct request_sense); > > + hdr.mx_sb_len = SCSI_SENSE_BUFFERSIZE; > > hdr.timeout = cgc.timeout; > > hdr.cmdp = ((struct cdrom_generic_command __user*) arg)->cmd; > > hdr.cmd_len = sizeof(cgc.cmd); > > > > did I mess up? > > That's not the one to change. It's the one in "sr_do_ioctl()", where it > uses "sizeof(*sense)". > Ahh, so I did mess up - whoops - I just grep'ed for "sizeof(struct request_sense)" :-( I'll try it again (with the correct change) in a moment, after I've tested Jens's "does no slab poison/debug make it go Oops" question... -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 21:07 ` Linus Torvalds 2006-03-06 21:16 ` Jesper Juhl @ 2006-03-06 21:54 ` Jesper Juhl 2006-03-06 22:05 ` Andrew Morton 2006-03-06 22:17 ` Linus Torvalds 1 sibling, 2 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 21:54 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe On 3/6/06, Linus Torvalds <torvalds@osdl.org> wrote: > [.snip.] > > That's not the one to change. It's the one in "sr_do_ioctl()", where it > uses "sizeof(*sense)". > > Linus > > ---- > diff --git a/drivers/scsi/sr_ioctl.c b/drivers/scsi/sr_ioctl.c > index 5d02ff4..b65462f 100644 > --- a/drivers/scsi/sr_ioctl.c > +++ b/drivers/scsi/sr_ioctl.c > @@ -192,7 +192,7 @@ int sr_do_ioctl(Scsi_CD *cd, struct pack > SDev = cd->device; > > if (!sense) { > - sense = kmalloc(sizeof(*sense), GFP_KERNEL); > + sense = kmalloc(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL); > if (!sense) { > err = -ENOMEM; > goto out; > Ok, booting a plain 2.6.16-rc5-mm2 kernel with the above being the only change made results in this : Slab corruption: start=f4f6a11c, len=128 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f4f6a090, len=128 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c01f4a26>](alloc_as_io_context+0x16/0xd0) 000: 01 00 00 00 00 00 00 00 ad 4e ad de ff ff ff ff 010: ff ff ff ff b0 49 1f c0 c0 49 1f c0 07 00 00 00 Next obj: start=f4f6a1a8, len=128 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c01f4a26>](alloc_as_io_context+0x16/0xd0) 000: 01 00 00 00 00 00 00 00 ad 4e ad de ff ff ff ff 010: ff ff ff ff b0 49 1f c0 c0 49 1f c0 07 00 00 00 Slab corruption: start=f4f6a11c, len=128 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270) 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f4f6a090, len=128 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c01f4a26>](alloc_as_io_context+0x16/0xd0) 000: 01 00 00 00 00 00 00 00 ad 4e ad de ff ff ff ff 010: ff ff ff ff b0 49 1f c0 c0 49 1f c0 07 00 00 00 Next obj: start=f4f6a1a8, len=128 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c01f4a26>](alloc_as_io_context+0x16/0xd0) 000: 01 00 00 00 00 00 00 00 ad 4e ad de ff ff ff ff 010: ff ff ff ff b0 49 1f c0 c0 49 1f c0 07 00 00 00 Where do we go from here ? -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 21:54 ` Jesper Juhl @ 2006-03-06 22:05 ` Andrew Morton 2006-03-06 22:08 ` Jesper Juhl 2006-03-06 22:17 ` Linus Torvalds 1 sibling, 1 reply; 53+ messages in thread From: Andrew Morton @ 2006-03-06 22:05 UTC (permalink / raw) To: Jesper Juhl Cc: torvalds, linux-kernel, markhe, andrea, michaelc, James.Bottomley, axboe "Jesper Juhl" <jesper.juhl@gmail.com> wrote: > > Where do we go from here ? > If you can test just 2.6.16-rc5 + linus.patch + git-scsi-misc.patch then we'd have a clearer idea. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 22:05 ` Andrew Morton @ 2006-03-06 22:08 ` Jesper Juhl 2006-03-06 22:27 ` Jesper Juhl 0 siblings, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 22:08 UTC (permalink / raw) To: Andrew Morton Cc: torvalds, linux-kernel, markhe, andrea, michaelc, James.Bottomley, axboe On 3/6/06, Andrew Morton <akpm@osdl.org> wrote: > "Jesper Juhl" <jesper.juhl@gmail.com> wrote: > > > > Where do we go from here ? > > > > If you can test just > > 2.6.16-rc5 + linus.patch + git-scsi-misc.patch > > then we'd have a clearer idea. > Sure, I'll get right on it. I'll post the results in 15min or so. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 22:08 ` Jesper Juhl @ 2006-03-06 22:27 ` Jesper Juhl 0 siblings, 0 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 22:27 UTC (permalink / raw) To: Andrew Morton Cc: torvalds, linux-kernel, markhe, andrea, michaelc, James.Bottomley, axboe On 3/6/06, Jesper Juhl <jesper.juhl@gmail.com> wrote: > On 3/6/06, Andrew Morton <akpm@osdl.org> wrote: > > "Jesper Juhl" <jesper.juhl@gmail.com> wrote: > > > > > > Where do we go from here ? > > > > > > > If you can test just > > > > 2.6.16-rc5 + linus.patch + git-scsi-misc.patch > > > > then we'd have a clearer idea. > > > Sure, I'll get right on it. > I'll post the results in 15min or so. > Ok, a plain 2.6.15-rc5 + linus.patch + git-scsi-misc.patch results in this : Slab corruption: start=f4812d14, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c028f61b>](sr_do_ioctl+0x11b/0x270) 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f4812cc8, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<00000000>](_stext+0x3feffd68/0x8) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=f4812d60, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c02367ef>](init_dev+0x5cf/0x630) 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Slab corruption: start=f4812d14, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c028f61b>](sr_do_ioctl+0x11b/0x270) 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f4812cc8, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<00000000>](_stext+0x3feffd68/0x8) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=f4812d60, len=64 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<c02367ef>](init_dev+0x5cf/0x630) 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 21:54 ` Jesper Juhl 2006-03-06 22:05 ` Andrew Morton @ 2006-03-06 22:17 ` Linus Torvalds 2006-03-06 22:34 ` Linus Torvalds 2006-03-06 22:44 ` Jesper Juhl 1 sibling, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2006-03-06 22:17 UTC (permalink / raw) To: Jesper Juhl Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe On Mon, 6 Mar 2006, Jesper Juhl wrote: > > Ok, booting a plain 2.6.16-rc5-mm2 kernel with the above being the > only change made results in this : Yeah. I'm not surprised. A real mode-sense shouldn't be even 64 bytes, much less 96, so it shouldn't have overflowed, and we had no indication of the corruption spreading past the one allocation anyway. It did/does seem a bug, though, so worth checking. So onward in our tireless battle. Does this patch make any difference for you? It does two things: - it clears the "->sense" buffer in blk_end_sync_rq() (since it won't be valid any more: the request is gone) - it adds a BUG_ON() if we appear to have already done the sense fill on SCSI IO completion, and do it again. Now, I've not tried either of these, and the BUG_ON() in particular might be a false positive itself, but it might be worth testing. Linus --- diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 03d9c82..4351d34 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -2637,6 +2637,7 @@ void blk_end_sync_rq(struct request *rq, struct completion *waiting = rq->waiting; rq->waiting = NULL; + rq->sense = NULL; __blk_put_request(rq->q, rq); /* diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 701a328..2b60769 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -961,6 +961,10 @@ void scsi_io_completion(struct scsi_cmnd if (result) { clear_errors = 0; if (sense_valid && req->sense) { + + /* Have we already filled the sense buffer? */ + BUG_ON(req->sense_len); + /* * SG_IO wants current and deferred errors */ ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 22:17 ` Linus Torvalds @ 2006-03-06 22:34 ` Linus Torvalds 2006-03-06 22:52 ` Jesper Juhl ` (2 more replies) 2006-03-06 22:44 ` Jesper Juhl 1 sibling, 3 replies; 53+ messages in thread From: Linus Torvalds @ 2006-03-06 22:34 UTC (permalink / raw) To: Jesper Juhl Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe, Pekka Enberg Ok, I have a new favorite suspect. It is this one: commit 4d268eba1187ef66844a6a33b9431e5d0dadd4ad: [PATCH] slab: extract slab order calculation to separate function This patch moves the ugly loop that determines the 'optimal' size (page order) of cache slabs from kmem_cache_create() to a separate function and cleans it up a bit. Thanks to Matthew Wilcox for the help with this patch. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> and I think it may be broken. In particular, as far as I can tell, that + /* More than offslab_limit objects will cause problems */ + if (flags & CFLGS_OFF_SLAB && cachep->num > offslab_limit) + break; has been incorrectly translated for several reasons: - we shouldn't check "cachep->num > offslab_limit". We should check just "num > offslab_limit" (cachep->num is the _previous_ number we tested). - when we do "break", we've already incremented "gfporder", and we should correct for that. Now, maybe I'm just off my rocker again (I've certainly been batting 0.000 so far, even if I think I've been finding real bugs). So who knows. But I get the feeling that that patch is broken. Either revert it, or try this (TOTALLY UNTESTED!!!) patch.. And hey, maybe I'm just crazy. Linus ---- diff --git a/mm/slab.c b/mm/slab.c index 2b0b151..1cca41d 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1628,25 +1628,22 @@ static inline size_t calculate_slab_orde size_t size, size_t align, unsigned long flags) { size_t left_over = 0; + int gfporder; - for (;; cachep->gfporder++) { + for (gfporder = 0 ; gfporder < MAX_GFP_ORDER; gfporder++) { unsigned int num; size_t remainder; - if (cachep->gfporder > MAX_GFP_ORDER) { - cachep->num = 0; - break; - } - - cache_estimate(cachep->gfporder, size, align, flags, - &remainder, &num); + cache_estimate(gfporder, size, align, flags, &remainder, &num); if (!num) continue; + /* More than offslab_limit objects will cause problems */ - if (flags & CFLGS_OFF_SLAB && cachep->num > offslab_limit) + if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit) break; cachep->num = num; + cachep->gfporder = gfporder; left_over = remainder; /* ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 22:34 ` Linus Torvalds @ 2006-03-06 22:52 ` Jesper Juhl 2006-03-06 22:54 ` Linus Torvalds 2006-03-07 19:28 ` Bill Davidsen 2 siblings, 0 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 22:52 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe, Pekka Enberg On 3/6/06, Linus Torvalds <torvalds@osdl.org> wrote: > > Ok, > I have a new favorite suspect. > Heh, you are coming up with stuff to test faster than I can build & boot kernels ;-) Which is good, we'll get to the bottom of this all the faster :) > It is this one: commit 4d268eba1187ef66844a6a33b9431e5d0dadd4ad: > [--snip--] > > Now, maybe I'm just off my rocker again (I've certainly been batting 0.000 > so far, even if I think I've been finding real bugs). So who knows. But I > get the feeling that that patch is broken. > > Either revert it, or try this (TOTALLY UNTESTED!!!) patch.. > Hmm, that patch doesn't apply at all to 2.6.16-rc5-mm2 :/ patching file mm/slab.c Hunk #1 FAILED at 1628. 1 out of 1 hunk FAILED -- saving rejects to file mm/slab.c.rej $ cat mm/slab.c.rej *************** *** 1628,1649 **** size_t size, size_t align, unsigned long flags) { size_t left_over = 0; - int gfporder; - for (gfporder = 0 ; gfporder < MAX_GFP_ORDER; gfporder++) { unsigned int num; size_t remainder; - cache_estimate(gfporder, size, align, flags, &remainder, &num); if (!num) continue; - /* More than offslab_limit objects will cause problems */ - if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit) break; cachep->num = num; - cachep->gfporder = gfporder; left_over = remainder; /* --- 1628,1652 ---- size_t size, size_t align, unsigned long flags) { size_t left_over = 0; + for (;; cachep->gfporder++) { unsigned int num; size_t remainder; + if (cachep->gfporder > MAX_GFP_ORDER) { + cachep->num = 0; + break; + } + + cache_estimate(cachep->gfporder, size, align, flags, + &remainder, &num); if (!num) continue; /* More than offslab_limit objects will cause problems */ + if (flags & CFLGS_OFF_SLAB && cachep->num > offslab_limit) break; cachep->num = num; left_over = remainder; /* > And hey, maybe I'm just crazy. > Somehow I don't think that's the core problem here ;) -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 22:34 ` Linus Torvalds 2006-03-06 22:52 ` Jesper Juhl @ 2006-03-06 22:54 ` Linus Torvalds 2006-03-06 23:01 ` Jesper Juhl 2006-03-07 19:28 ` Bill Davidsen 2 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2006-03-06 22:54 UTC (permalink / raw) To: Jesper Juhl Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe, Pekka Enberg On Mon, 6 Mar 2006, Linus Torvalds wrote: > > Either revert it, or try this (TOTALLY UNTESTED!!!) patch.. Don't even bother with the untested patch. > + for (gfporder = 0 ; gfporder < MAX_GFP_ORDER; gfporder++) { At a minimum, this "<" needs to be "<=". After that, it might even work. Not that I can convince me that the test for "offslab_limit" ever even triggers, so.. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 22:54 ` Linus Torvalds @ 2006-03-06 23:01 ` Jesper Juhl 2006-03-06 23:06 ` Andrew Morton 0 siblings, 1 reply; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 23:01 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe, Pekka Enberg On 3/6/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Mon, 6 Mar 2006, Linus Torvalds wrote: > > > > Either revert it, or try this (TOTALLY UNTESTED!!!) patch.. > > Don't even bother with the untested patch. > > > + for (gfporder = 0 ; gfporder < MAX_GFP_ORDER; gfporder++) { > > At a minimum, this "<" needs to be "<=". > > After that, it might even work. Not that I can convince me that the test > for "offslab_limit" ever even triggers, so.. > Ehh, it's getting pretty clear that you are looking at 2.6.16-rc5-git<latest> and I'm using -mm here, since that code is not present in mm/slab.c in 2.6.16-rc5-mm2 in anything near that form. And since 2.6.16-rc5-git8 is not experiencing problems I'd suggest you perhaps instead take a look at what's in -mm... That's where we need to work (it seems) to find the bug... -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 23:01 ` Jesper Juhl @ 2006-03-06 23:06 ` Andrew Morton 2006-03-06 23:24 ` Jesper Juhl 0 siblings, 1 reply; 53+ messages in thread From: Andrew Morton @ 2006-03-06 23:06 UTC (permalink / raw) To: Jesper Juhl Cc: torvalds, linux-kernel, markhe, andrea, michaelc, James.Bottomley, axboe, penberg "Jesper Juhl" <jesper.juhl@gmail.com> wrote: > > And since 2.6.16-rc5-git8 is not experiencing problems I'd suggest you > perhaps instead take a look at what's in -mm... That's where we need > to work (it seems) to find the bug... Yes, it's very probably something in git-scsi-misc. drivers/block/cciss.c | 3 drivers/message/fusion/Kconfig | 1 drivers/message/fusion/mptbase.c | 72 - drivers/message/fusion/mptbase.h | 14 drivers/message/fusion/mptfc.c | 4 drivers/message/fusion/mptlan.c | 5 drivers/message/fusion/mptsas.c | 196 ++ drivers/message/fusion/mptscsih.c | 2402 +----------------------------------- drivers/message/fusion/mptscsih.h | 11 drivers/message/fusion/mptspi.c | 733 ++++++++++ drivers/scsi/53c700.c | 18 drivers/scsi/aacraid/aacraid.h | 5 drivers/scsi/aacraid/comminit.c | 1 drivers/scsi/aacraid/commsup.c | 14 drivers/scsi/aacraid/linit.c | 14 drivers/scsi/aha152x.c | 7 drivers/scsi/aic7xxx/aic79xx_core.c | 24 drivers/scsi/aic7xxx/aic7xxx_core.c | 24 drivers/scsi/aic7xxx/aic7xxx_osm.c | 45 drivers/scsi/aic7xxx/aic7xxx_osm.h | 5 drivers/scsi/hosts.c | 3 drivers/scsi/ipr.c | 109 + drivers/scsi/ips.c | 2 drivers/scsi/jazz_esp.c | 19 drivers/scsi/lpfc/lpfc.h | 40 drivers/scsi/lpfc/lpfc_attr.c | 162 +- drivers/scsi/lpfc/lpfc_crtn.h | 28 drivers/scsi/lpfc/lpfc_ct.c | 74 - drivers/scsi/lpfc/lpfc_disc.h | 19 drivers/scsi/lpfc/lpfc_els.c | 772 +++++++---- drivers/scsi/lpfc/lpfc_hbadisc.c | 492 +++---- drivers/scsi/lpfc/lpfc_hw.h | 65 drivers/scsi/lpfc/lpfc_init.c | 265 ++- drivers/scsi/lpfc/lpfc_mbox.c | 33 drivers/scsi/lpfc/lpfc_nportdisc.c | 374 +++-- drivers/scsi/lpfc/lpfc_scsi.c | 25 drivers/scsi/lpfc/lpfc_scsi.h | 5 drivers/scsi/lpfc/lpfc_sli.c | 385 +++-- drivers/scsi/lpfc/lpfc_sli.h | 5 drivers/scsi/lpfc/lpfc_version.h | 6 drivers/scsi/ncr53c8xx.c | 127 - drivers/scsi/ncr53c8xx.h | 37 drivers/scsi/osst.c | 24 drivers/scsi/qla2xxx/qla_def.h | 6 drivers/scsi/qla2xxx/qla_gbl.h | 2 drivers/scsi/qla2xxx/qla_isr.c | 7 drivers/scsi/qla2xxx/qla_mbx.c | 4 drivers/scsi/qla2xxx/qla_os.c | 89 - drivers/scsi/qla2xxx/qla_sup.c | 4 drivers/scsi/scsi.c | 6 drivers/scsi/scsi_debug.c | 9 drivers/scsi/scsi_ioctl.c | 3 drivers/scsi/scsi_lib.c | 76 - drivers/scsi/scsi_scan.c | 100 - drivers/scsi/scsi_sysfs.c | 4 drivers/scsi/scsi_transport_fc.c | 9 drivers/scsi/scsi_transport_iscsi.c | 3 drivers/scsi/scsi_transport_sas.c | 258 +++ drivers/scsi/scsi_transport_spi.c | 83 - drivers/scsi/sd.c | 11 drivers/scsi/sg.c | 16 drivers/scsi/sr.c | 5 drivers/scsi/sr_ioctl.c | 6 drivers/scsi/st.c | 29 drivers/scsi/sym53c8xx_2/sym_hipd.c | 53 include/linux/workqueue.h | 6 include/scsi/scsi.h | 2 include/scsi/scsi_cmnd.h | 20 include/scsi/scsi_device.h | 16 include/scsi/scsi_transport_sas.h | 22 include/scsi/scsi_transport_spi.h | 4 kernel/workqueue.c | 29 72 files changed, 3444 insertions(+), 4107 deletions(-) ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 23:06 ` Andrew Morton @ 2006-03-06 23:24 ` Jesper Juhl 2006-03-07 0:17 ` Linus Torvalds 2006-03-07 3:15 ` Mike Christie 0 siblings, 2 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 23:24 UTC (permalink / raw) To: Andrew Morton Cc: torvalds, linux-kernel, markhe, andrea, michaelc, James.Bottomley, axboe, penberg On 3/7/06, Andrew Morton <akpm@osdl.org> wrote: > "Jesper Juhl" <jesper.juhl@gmail.com> wrote: > > > > And since 2.6.16-rc5-git8 is not experiencing problems I'd suggest you > > perhaps instead take a look at what's in -mm... That's where we need > > to work (it seems) to find the bug... > > Yes, it's very probably something in git-scsi-misc. > I would say that's correct. I just build 2.6.16-rc5-mm2 with just git-scsi-misc.patch reverted, and that makes the problem go away. So now the big question is; what part(s) of git-scsi-misc is broken? -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 23:24 ` Jesper Juhl @ 2006-03-07 0:17 ` Linus Torvalds 2006-03-07 0:25 ` Jesper Juhl 2006-03-07 3:15 ` Mike Christie 1 sibling, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2006-03-07 0:17 UTC (permalink / raw) To: Jesper Juhl Cc: Andrew Morton, linux-kernel, markhe, andrea, michaelc, James.Bottomley, axboe, penberg On Tue, 7 Mar 2006, Jesper Juhl wrote: > > On 3/7/06, Andrew Morton <akpm@osdl.org> wrote: > > "Jesper Juhl" <jesper.juhl@gmail.com> wrote: > > > > > > And since 2.6.16-rc5-git8 is not experiencing problems I'd suggest you > > > perhaps instead take a look at what's in -mm... That's where we need > > > to work (it seems) to find the bug... > > > > Yes, it's very probably something in git-scsi-misc. > > > I would say that's correct. I just build 2.6.16-rc5-mm2 with just > git-scsi-misc.patch reverted, and that makes the problem go away. Ok. I was kind of hoping that it was just a more reliable case of the corruption that Andrew had been seeing too (which seems to be hard to trigger in mainline too, but might exist there). > So now the big question is; what part(s) of git-scsi-misc is broken? Well, its origin is actually a git tree, so you could try the "git bisect" approach using the git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git tree that the patch comes from.. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-07 0:17 ` Linus Torvalds @ 2006-03-07 0:25 ` Jesper Juhl 0 siblings, 0 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-07 0:25 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, linux-kernel, markhe, andrea, michaelc, James.Bottomley, axboe, penberg On 3/7/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Tue, 7 Mar 2006, Jesper Juhl wrote: > > > > On 3/7/06, Andrew Morton <akpm@osdl.org> wrote: > > > "Jesper Juhl" <jesper.juhl@gmail.com> wrote: > > > > > > > > And since 2.6.16-rc5-git8 is not experiencing problems I'd suggest you > > > > perhaps instead take a look at what's in -mm... That's where we need > > > > to work (it seems) to find the bug... > > > > > > Yes, it's very probably something in git-scsi-misc. > > > > > I would say that's correct. I just build 2.6.16-rc5-mm2 with just > > git-scsi-misc.patch reverted, and that makes the problem go away. > > Ok. I was kind of hoping that it was just a more reliable case of the > corruption that Andrew had been seeing too (which seems to be hard to > trigger in mainline too, but might exist there). > > > So now the big question is; what part(s) of git-scsi-misc is broken? > > Well, its origin is actually a git tree, so you could try the "git bisect" > approach using the > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git > > tree that the patch comes from.. > I'll give that a go tomorrow - right now I need to get some sleep. If there are other things to try, then just drop me a mail and I'll test it tomorrow. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 23:24 ` Jesper Juhl 2006-03-07 0:17 ` Linus Torvalds @ 2006-03-07 3:15 ` Mike Christie 2006-03-07 3:20 ` Linus Torvalds 1 sibling, 1 reply; 53+ messages in thread From: Mike Christie @ 2006-03-07 3:15 UTC (permalink / raw) To: Jesper Juhl Cc: Andrew Morton, torvalds, linux-kernel, markhe, andrea, James.Bottomley, axboe, penberg Jesper Juhl wrote: > On 3/7/06, Andrew Morton <akpm@osdl.org> wrote: > >>"Jesper Juhl" <jesper.juhl@gmail.com> wrote: >> >>>And since 2.6.16-rc5-git8 is not experiencing problems I'd suggest you >>> perhaps instead take a look at what's in -mm... That's where we need >>> to work (it seems) to find the bug... >> >>Yes, it's very probably something in git-scsi-misc. >> > > I would say that's correct. I just build 2.6.16-rc5-mm2 with just > git-scsi-misc.patch reverted, and that makes the problem go away. > > So now the big question is; what part(s) of git-scsi-misc is broken? > Is it relate to this change? diff --git a/drivers/scsi/sr_ioctl.c b/drivers/scsi/sr_ioctl.c index 5d02ff4..03fbc4b 100644 --- a/drivers/scsi/sr_ioctl.c +++ b/drivers/scsi/sr_ioctl.c @@ -44,11 +44,10 @@ static int sr_read_tochdr(struct cdrom_d int result; unsigned char *buffer; - buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd)); + buffer = kzalloc(32, GFP_KERNEL | SR_GFP_DMA(cd)); if (!buffer) return -ENOMEM; - memset(&cgc, 0, sizeof(struct packet_command)); cgc.timeout = IOCTL_TIMEOUT; cgc.cmd[0] = GPCMD_READ_TOC_PMA_ATIP; cgc.cmd[8] = 12; /* LSB of length */ @@ -74,11 +73,10 @@ static int sr_read_tocentry(struct cdrom int result; unsigned char *buffer; - buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd)); + buffer = kzalloc(32, GFP_KERNEL | SR_GFP_DMA(cd)); if (!buffer) return -ENOMEM; - memset(&cgc, 0, sizeof(struct packet_command)); When someone converted the *buffer* allocation to kzalloc they also removed the the memset for the *packet_cmmand* struct. The memset(&cgc, 0, sizeof(struct packet_command)); should be added back I think. ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-07 3:15 ` Mike Christie @ 2006-03-07 3:20 ` Linus Torvalds 2006-03-07 18:01 ` James Bottomley 0 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2006-03-07 3:20 UTC (permalink / raw) To: Mike Christie Cc: Jesper Juhl, Andrew Morton, linux-kernel, markhe, andrea, James.Bottomley, axboe, penberg On Mon, 6 Mar 2006, Mike Christie wrote: > - buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd)); > + buffer = kzalloc(32, GFP_KERNEL | SR_GFP_DMA(cd)); > if (!buffer) > return -ENOMEM; > > - memset(&cgc, 0, sizeof(struct packet_command)); ... > - buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd)); > + buffer = kzalloc(32, GFP_KERNEL | SR_GFP_DMA(cd)); > if (!buffer) > return -ENOMEM; > > - memset(&cgc, 0, sizeof(struct packet_command)); > When someone converted the *buffer* allocation to kzalloc they > also removed the the memset for the *packet_cmmand* struct. > > The > > memset(&cgc, 0, sizeof(struct packet_command)); > > should be added back I think. Good eyes. I bet that's it. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-07 3:20 ` Linus Torvalds @ 2006-03-07 18:01 ` James Bottomley 2006-03-07 19:40 ` Jesper Juhl 0 siblings, 1 reply; 53+ messages in thread From: James Bottomley @ 2006-03-07 18:01 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Christie, Jesper Juhl, Andrew Morton, linux-kernel, markhe, andrea, axboe, penberg On Mon, 2006-03-06 at 19:20 -0800, Linus Torvalds wrote: > > should be added back I think. > > Good eyes. I bet that's it. Yes, well done. Do we have confirmation yet that reversing this fixes the bug? I think a full reversal is in order, since buffer is a quantity being written to, there's no point in zeroing it. [Note to self: must do better in checking janitors patches] James ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-07 18:01 ` James Bottomley @ 2006-03-07 19:40 ` Jesper Juhl 0 siblings, 0 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-07 19:40 UTC (permalink / raw) To: James Bottomley Cc: Linus Torvalds, Mike Christie, Andrew Morton, linux-kernel, markhe, andrea, axboe, penberg On 3/7/06, James Bottomley <James.Bottomley@steeleye.com> wrote: > On Mon, 2006-03-06 at 19:20 -0800, Linus Torvalds wrote: > > > should be added back I think. > > > > Good eyes. I bet that's it. > > Yes, well done. Do we have confirmation yet that reversing this fixes > the bug? > I just tried reverting that bit only from 2.6.16-rc5-mm2, and it does indeed fix the problem. Thanks for spotting that Mike. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 22:34 ` Linus Torvalds 2006-03-06 22:52 ` Jesper Juhl 2006-03-06 22:54 ` Linus Torvalds @ 2006-03-07 19:28 ` Bill Davidsen 2 siblings, 0 replies; 53+ messages in thread From: Bill Davidsen @ 2006-03-07 19:28 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe, Pekka Enberg Linus Torvalds wrote: > has been incorrectly translated for several reasons: > > - we shouldn't check "cachep->num > offslab_limit". We should check just > "num > offslab_limit" (cachep->num is the _previous_ number we tested). > > - when we do "break", we've already incremented "gfporder", and we should > correct for that. > > Now, maybe I'm just off my rocker again (I've certainly been batting 0.000 > so far, even if I think I've been finding real bugs). So who knows. But I > get the feeling that that patch is broken. I thought stumbling over bugs while looking for other things was part of the new development model ;-) > > Either revert it, or try this (TOTALLY UNTESTED!!!) patch.. > > And hey, maybe I'm just crazy. Being crazy and being right are not mutually exclusize. -- -bill davidsen (davidsen@tmr.com) "The secret to procrastination is to put things off until the last possible moment - but no longer" -me ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 22:17 ` Linus Torvalds 2006-03-06 22:34 ` Linus Torvalds @ 2006-03-06 22:44 ` Jesper Juhl 1 sibling, 0 replies; 53+ messages in thread From: Jesper Juhl @ 2006-03-06 22:44 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, Mike Christie, James Bottomley, Jens Axboe On 3/6/06, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Mon, 6 Mar 2006, Jesper Juhl wrote: > > > > Ok, booting a plain 2.6.16-rc5-mm2 kernel with the above being the > > only change made results in this : > > Yeah. I'm not surprised. A real mode-sense shouldn't be even 64 bytes, > much less 96, so it shouldn't have overflowed, and we had no indication of > the corruption spreading past the one allocation anyway. > > It did/does seem a bug, though, so worth checking. > Well, hopefully the SCSI people can take a look at that as a seperate issue... > So onward in our tireless battle. Does this patch make any difference for > you? It does two things: > > - it clears the "->sense" buffer in blk_end_sync_rq() (since it won't be > valid any more: the request is gone) > - it adds a BUG_ON() if we appear to have already done the sense fill on > SCSI IO completion, and do it again. > > Now, I've not tried either of these, and the BUG_ON() in particular might > be a false positive itself, but it might be worth testing. > Unfortunately this doesn't seem to make a difference : Slab corruption: start=f70c0770, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934fb>](sr_do_ioctl+0x11b/0x270) 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f70c0724, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813e6>](free_fdtable_rcu+0x66/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=f70c07bc, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813ee>](free_fdtable_rcu+0x6e/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Slab corruption: start=f70c0770, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c02934fb>](sr_do_ioctl+0x11b/0x270) 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=f70c0724, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813e6>](free_fdtable_rcu+0x66/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=f70c07bc, len=64 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<c01813ee>](free_fdtable_rcu+0x6e/0x150) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 18:25 ` Linus Torvalds 2006-03-06 18:43 ` Jesper Juhl @ 2006-03-06 18:48 ` Mike Christie 2006-03-06 18:49 ` Mike Christie 1 sibling, 1 reply; 53+ messages in thread From: Mike Christie @ 2006-03-06 18:48 UTC (permalink / raw) To: Linus Torvalds Cc: Jesper Juhl, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, James Bottomley Linus Torvalds wrote: > > James, Mike, can you double-check the retries? In particular, it's _wrong_ > to retry after you've already marked a command completed with > "complete(rq->waiting)", so if that happens somewhere, things are really > broken. > I am looking into it. I think it has something to do with the request getting completed too early or maybe something crazy like twice. This looks like a similar problem that was reported to linux-scsi where for some tape setup the request's bio gets freed twice. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Slab corruption in 2.6.16-rc5-mm2 2006-03-06 18:48 ` Mike Christie @ 2006-03-06 18:49 ` Mike Christie 0 siblings, 0 replies; 53+ messages in thread From: Mike Christie @ 2006-03-06 18:49 UTC (permalink / raw) To: Linus Torvalds Cc: Jesper Juhl, Linux Kernel Mailing List, Andrew Morton, markhe, Andrea Arcangeli, James Bottomley Mike Christie wrote: > Linus Torvalds wrote: > >> >> James, Mike, can you double-check the retries? In particular, it's >> _wrong_ to retry after you've already marked a command completed with >> "complete(rq->waiting)", so if that happens somewhere, things are >> really broken. >> > > I am looking into it. I think it has something to do with the request > getting completed too early or maybe something crazy like twice. This > looks like a similar problem that was reported to linux-scsi Oh yeah here is that thread http://marc.theaimsgroup.com/?l=linux-scsi&m=114127615918030&w=2 ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2006-03-09 16:42 UTC | newest] Thread overview: 53+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-08 6:25 Slab corruption in 2.6.16-rc5-mm2 Chuck Ebbert 2006-03-08 8:32 ` Nick Piggin 2006-03-08 8:46 ` Andrew Morton 2006-03-08 9:02 ` Nick Piggin 2006-03-08 9:12 ` Andrew Morton 2006-03-08 9:23 ` Nick Piggin 2006-03-08 14:35 ` Lee Schermerhorn -- strict thread matches above, loose matches on Subject: below -- 2006-03-06 0:17 Jesper Juhl 2006-03-06 18:25 ` Linus Torvalds 2006-03-06 18:43 ` Jesper Juhl 2006-03-06 19:32 ` Linus Torvalds 2006-03-06 19:51 ` Jesper Juhl 2006-03-06 19:58 ` Jesper Juhl 2006-03-06 20:06 ` Linus Torvalds 2006-03-06 20:24 ` Jesper Juhl 2006-03-06 20:30 ` Jens Axboe 2006-03-06 20:33 ` Jens Axboe 2006-03-06 21:14 ` Jesper Juhl 2006-03-06 21:41 ` Jesper Juhl 2006-03-06 21:55 ` Dave Jones 2006-03-06 21:57 ` Jesper Juhl 2006-03-09 15:50 ` Martin J. Bligh 2006-03-09 15:54 ` Martin J. Bligh 2006-03-09 15:54 ` Benjamin LaHaise 2006-03-09 16:04 ` Martin J. Bligh 2006-03-09 16:08 ` Linus Torvalds 2006-03-09 16:41 ` Dave Jones 2006-03-06 20:36 ` Jesper Juhl 2006-03-06 20:53 ` Jesper Juhl 2006-03-06 20:56 ` Jesper Juhl 2006-03-06 21:07 ` Linus Torvalds 2006-03-06 21:16 ` Jesper Juhl 2006-03-06 21:54 ` Jesper Juhl 2006-03-06 22:05 ` Andrew Morton 2006-03-06 22:08 ` Jesper Juhl 2006-03-06 22:27 ` Jesper Juhl 2006-03-06 22:17 ` Linus Torvalds 2006-03-06 22:34 ` Linus Torvalds 2006-03-06 22:52 ` Jesper Juhl 2006-03-06 22:54 ` Linus Torvalds 2006-03-06 23:01 ` Jesper Juhl 2006-03-06 23:06 ` Andrew Morton 2006-03-06 23:24 ` Jesper Juhl 2006-03-07 0:17 ` Linus Torvalds 2006-03-07 0:25 ` Jesper Juhl 2006-03-07 3:15 ` Mike Christie 2006-03-07 3:20 ` Linus Torvalds 2006-03-07 18:01 ` James Bottomley 2006-03-07 19:40 ` Jesper Juhl 2006-03-07 19:28 ` Bill Davidsen 2006-03-06 22:44 ` Jesper Juhl 2006-03-06 18:48 ` Mike Christie 2006-03-06 18:49 ` Mike Christie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox