tg3: unable to handle null pointer dereference [Re: Linux 2.6.21-rc6]

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* tg3: unable to handle null pointer dereference [Re: Linux 2.6.21-rc6]
       [not found] <Pine.LNX.4.64.0704051944230.6730@woody.linux-foundation.org>
@ 2007-04-06 21:40 ` Nishanth Aravamudan
  2007-04-06 22:57   ` Michael Chan
  2007-04-14  0:36 ` [1/3] 2.6.21-rc6: known regressions Adrian Bunk
  1 sibling, 1 reply; 14+ messages in thread
From: Nishanth Aravamudan @ 2007-04-06 21:40 UTC (permalink / raw)
  To: mchan; +Cc: Linus Torvalds, LKML, netdev

On 05.04.2007 [19:50:11 -0700], Linus Torvalds wrote:
> 
> Ok,
>  I don't think there really is anything very interesting here, but we're 
> hopefully whittling down the list of regressions, and fixing various 
> random other small issues while at it.
> 
> Some smallish MIPS updates, networking (and network driver) fixes, removal 
> of a long obsolete framebuffer driver, etc etc. The shortlog really tells 
> the story.
> 
> We should be getting close to a 2.6.21 release, so please update any 
> regression reports you've done,

2.6.21-rc5 is ok. 2.6.21-rc6 results in

[   14.241665] Unable to handle kernel NULL pointer dereference (address 0000000000000000)
[   14.250025] swapper[1]: Oops 11003706212352 [1]
[   14.254753] Modules linked in:
[   14.258046] 
[   14.258047] Pid: 1, CPU 7, comm:              swapper
[   14.264962] psr : 00001210084a6010 ifs : 8000000000000610 ip  : [<a000000100495371>]    Not tainted
[   14.274399] ip is at tg3_chip_reset+0xf1/0x12c0
[   14.279124] unat: 0000000000000000 pfs : 0000000000000610 rsc : 0000000000000003
[   14.286862] rnat: e000001005bc7d40 bsps: e000001005bc0000 pr  : 68105a9195655599
[   14.294598] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
[   14.302338] csd : 0000000000000000 ssd : 0000000000000000
[   14.307946] b0  : a0000001004952c0 b6  : a00000010038b2e0 b7  : a000000100486580
[   14.315688] f6  : 1003e000000054e304351 f7  : 1003e0000000000000640
[   14.322164] f8  : 1003e000000054e2dd251 f9  : 1003e0000000000000064
[   14.328643] f10 : 10015e7d113fff182eec0 f11 : 1003e000000000073e88a
[   14.335116] r1  : a000000100d4be30 r2  : a000000100b68fc0 r3  : a000000100b68eb0
[   14.342851] r8  : 0000000000000000 r9  : 0000000000000200 r10 : a00000010089d1a8
[   14.350597] r11 : a000000100486580 r12 : e000001005bc7d70 r13 : e000001005bc0000
[   14.358332] r14 : 0000000000000002 r15 : e000001005d08f10 r16 : e000001005d08ee0
[   14.366072] r17 : e000001005d08748 r18 : e000001005d08758 r19 : 0000000000000000
[   14.373815] r20 : e000001005d08748 r21 : 0000000000000000 r22 : 0000000040027401
[   14.381557] r23 : 0000000000027401 r24 : 0000000040000000 r25 : a00000010089d2f0
[   14.389293] r26 : a000000100b5b5c0 r27 : 0000000000000000 r28 : 0000000000000000
[   14.397035] r29 : 0000000000000000 r30 : 0000000000000000 r31 : e000001005d08708
[   14.404847] 
[   14.404848] Call Trace:
[   14.409160]  [<a000000100013900>] show_stack+0x80/0xa0
[   14.409162]                                 sp=e000001005bc7900 bsp=e000001005bc1120
[   14.422595]  [<a0000001000141f0>] show_regs+0x870/0x8a0
[   14.422597]                                 sp=e000001005bc7ad0 bsp=e000001005bc10c8
[   14.436128]  [<a0000001000390f0>] die+0x1b0/0x320
[   14.436131]                                 sp=e000001005bc7ad0 bsp=e000001005bc1080
[   14.449182]  [<a000000100730980>] ia64_do_page_fault+0xa00/0xba0
[   14.449185]                                 sp=e000001005bc7af0 bsp=e000001005bc1028
[   14.463498]  [<a00000010000b760>] ia64_leave_kernel+0x0/0x280
[   14.463501]                                 sp=e000001005bc7ba0 bsp=e000001005bc1028
[   14.477553]  [<a000000100495370>] tg3_chip_reset+0xf0/0x12c0
[   14.477555]                                 sp=e000001005bc7d70 bsp=e000001005bc0fa0
[   14.491516]  [<a000000100496590>] tg3_halt+0x50/0xa0
[   14.491518]                                 sp=e000001005bc7d80 bsp=e000001005bc0f68
[   14.504796]  [<a0000001004a86a0>] tg3_init_one+0x1c80/0x3080
[   14.504799]                                 sp=e000001005bc7d80 bsp=e000001005bc0eb8
[   14.518796]  [<a000000100399c70>] pci_device_probe+0x1f0/0x2c0
[   14.518799]                                 sp=e000001005bc7dd0 bsp=e000001005bc0e70
[   14.532961]  [<a000000100466560>] really_probe+0x100/0x3a0
[   14.532963]                                 sp=e000001005bc7dd0 bsp=e000001005bc0e20
[   14.546745]  [<a0000001004669c0>] driver_probe_device+0x1c0/0x1e0
[   14.546747]                                 sp=e000001005bc7dd0 bsp=e000001005bc0de8
[   14.561148]  [<a000000100466c80>] __driver_attach+0xc0/0x160
[   14.561150]                                 sp=e000001005bc7dd0 bsp=e000001005bc0db0
[   14.575108]  [<a000000100464bb0>] bus_for_each_dev+0xb0/0x120
[   14.575111]                                 sp=e000001005bc7dd0 bsp=e000001005bc0d78
[   14.589157]  [<a000000100466240>] driver_attach+0x40/0x60
[   14.589160]                                 sp=e000001005bc7df0 bsp=e000001005bc0d58
[   14.602862]  [<a000000100465330>] bus_add_driver+0xf0/0x3c0
[   14.602864]                                 sp=e000001005bc7df0 bsp=e000001005bc0d18
[   14.616736]  [<a000000100467180>] driver_register+0xe0/0x1a0
[   14.616738]                                 sp=e000001005bc7df0 bsp=e000001005bc0cf8
[   14.630696]  [<a00000010039a1a0>] __pci_register_driver+0x120/0x1a0
[   14.630699]                                 sp=e000001005bc7df0 bsp=e000001005bc0cc0
[   14.645280]  [<a0000001008e8270>] tg3_init+0x30/0x60
[   14.645283]                                 sp=e000001005bc7e00 bsp=e000001005bc0ca8
[   14.658540]  [<a0000001008a85f0>] init+0x390/0x740
[   14.658542]                                 sp=e000001005bc7e00 bsp=e000001005bc0c58
[   14.671627]  [<a0000001000113d0>] kernel_thread_helper+0xd0/0x100
[   14.671629]                                 sp=e000001005bc7e30 bsp=e000001005bc0c30
[   14.686023]  [<a000000100009140>] start_kernel_thread+0x20/0x40
[   14.686025]                                 sp=e000001005bc7e30 bsp=e000001005bc0c30
[   14.700284] Kernel panic - not syncing: Attempted to kill init!

on an 8-way IA64. I'm guessing it's one of these:

> Michael Chan (5):
>       [TG3]: Eliminate the unused TG3_FLAG_SPLIT_MODE flag.
>       [TG3]: Exit irq handler during chip reset.
>       [TG3]: Update version and reldate.

probably "Exit irq handler during chip reset"?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tg3: unable to handle null pointer dereference [Re: Linux 2.6.21-rc6]
  2007-04-06 21:40 ` tg3: unable to handle null pointer dereference [Re: Linux 2.6.21-rc6] Nishanth Aravamudan
@ 2007-04-06 22:57   ` Michael Chan
  2007-04-07  0:36     ` tg3: unable to handle null pointer dereference David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Chan @ 2007-04-06 22:57 UTC (permalink / raw)
  To: Nishanth Aravamudan, davem; +Cc: Linus Torvalds, LKML, netdev

On Fri, 2007-04-06 at 14:40 -0700, Nishanth Aravamudan wrote:

> 2.6.21-rc5 is ok. 2.6.21-rc6 results in
> 
> [   14.241665] Unable to handle kernel NULL pointer dereference (address 0000000000000000)

Sorry, I think this should fix it:

[TG3]: Fix crash during tg3_init_one().

The driver will crash when the chip has been initialized by EFI before
tg3_init_one().  In this case, the driver will call tg3_chip_reset()
before allocating consistent memory.

The bug is fixed by checking for tp->hw_status before accessing it
during tg3_chip_reset().

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 0acee9f..256969e 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -4834,8 +4834,10 @@ static int tg3_chip_reset(struct tg3 *tp)
 	 * sharing or irqpoll.
 	 */
 	tp->tg3_flags |= TG3_FLAG_CHIP_RESETTING;
-	tp->hw_status->status = 0;
-	tp->hw_status->status_tag = 0;
+	if (tp->hw_status) {
+		tp->hw_status->status = 0;
+		tp->hw_status->status_tag = 0;
+	}
 	tp->last_tag = 0;
 	smp_mb();
 	synchronize_irq(tp->pdev->irq);





^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: tg3: unable to handle null pointer dereference
  2007-04-06 22:57   ` Michael Chan
@ 2007-04-07  0:36     ` David Miller
  2007-04-07  1:53       ` Nishanth Aravamudan
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2007-04-07  0:36 UTC (permalink / raw)
  To: mchan; +Cc: nacc, torvalds, linux-kernel, netdev

From: "Michael Chan" <mchan@broadcom.com>
Date: Fri, 06 Apr 2007 15:57:13 -0700

> On Fri, 2007-04-06 at 14:40 -0700, Nishanth Aravamudan wrote:
> 
> > 2.6.21-rc5 is ok. 2.6.21-rc6 results in
> > 
> > [   14.241665] Unable to handle kernel NULL pointer dereference (address 0000000000000000)
> 
> Sorry, I think this should fix it:
> 
> [TG3]: Fix crash during tg3_init_one().
> 
> The driver will crash when the chip has been initialized by EFI before
> tg3_init_one().  In this case, the driver will call tg3_chip_reset()
> before allocating consistent memory.
> 
> The bug is fixed by checking for tp->hw_status before accessing it
> during tg3_chip_reset().
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied, thanks Michael.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tg3: unable to handle null pointer dereference
  2007-04-07  0:36     ` tg3: unable to handle null pointer dereference David Miller
@ 2007-04-07  1:53       ` Nishanth Aravamudan
  0 siblings, 0 replies; 14+ messages in thread
From: Nishanth Aravamudan @ 2007-04-07  1:53 UTC (permalink / raw)
  To: David Miller; +Cc: mchan, torvalds, linux-kernel, netdev

On 06.04.2007 [17:36:00 -0700], David Miller wrote:
> From: "Michael Chan" <mchan@broadcom.com>
> Date: Fri, 06 Apr 2007 15:57:13 -0700
> 
> > On Fri, 2007-04-06 at 14:40 -0700, Nishanth Aravamudan wrote:
> > 
> > > 2.6.21-rc5 is ok. 2.6.21-rc6 results in
> > > 
> > > [   14.241665] Unable to handle kernel NULL pointer dereference (address 0000000000000000)
> > 
> > Sorry, I think this should fix it:
> > 
> > [TG3]: Fix crash during tg3_init_one().
> > 
> > The driver will crash when the chip has been initialized by EFI before
> > tg3_init_one().  In this case, the driver will call tg3_chip_reset()
> > before allocating consistent memory.
> > 
> > The bug is fixed by checking for tp->hw_status before accessing it
> > during tg3_chip_reset().
> > 
> > Signed-off-by: Michael Chan <mchan@broadcom.com>
> 
> Applied, thanks Michael.

FWIW, tested, no panic.

Tested-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [1/3] 2.6.21-rc6: known regressions
       [not found] <Pine.LNX.4.64.0704051944230.6730@woody.linux-foundation.org>
  2007-04-06 21:40 ` tg3: unable to handle null pointer dereference [Re: Linux 2.6.21-rc6] Nishanth Aravamudan
@ 2007-04-14  0:36 ` Adrian Bunk
  2007-04-14  1:34   ` Linus Torvalds
  1 sibling, 1 reply; 14+ messages in thread
From: Adrian Bunk @ 2007-04-14  0:36 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: Linux Kernel Mailing List, Stephen Clark, jgarzik, linux-ide,
	Michal Jaegermann, lenb, linux-acpi, netdev, Dave Jones, cramerj,
	john.ronciak, jeffrey.t.kirsher, auke-jan.h.kok, e1000-devel,
	Ingo Molnar, Ayaz Abdulla

This email lists some known regressions in Linus' tree compared to 2.6.20.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way
possibly involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject    : ali_pata: boot from CD fails
References : http://lkml.org/lkml/2007/3/31/160
Submitter  : Stephen Clark <Stephen.Clark@seclark.us>
Status     : unknown


Subject    : kernels fail to boot with drives on ATIIXP controller
             (ACPI/IRQ related)
References : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229621
             http://lkml.org/lkml/2007/3/4/257
Submitter  : Michal Jaegermann <michal@ellpspace.math.ualberta.ca>
Status     : unknown


Subject    : boot failure: rtl8139: exception in interrupt routine
References : http://lkml.org/lkml/2007/3/31/160
Submitter  : Stephen Clark <Stephen.Clark@seclark.us>
Status     : unknown


Subject    : laptops with e1000: lockups
References : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603
Submitter  : Dave Jones <davej@redhat.com>
Handled-By : Jesse Brandeburg <jesse.brandeburg@intel.com>
Status     : problem is being debugged


Subject    : forcedeth: interface hangs under load
References : http://lkml.org/lkml/2007/4/3/39
Submitter  : Ingo Molnar <mingo@elte.hu>
Handled-By : Ingo Molnar <mingo@elte.hu>
             Ayaz Abdulla <aabdulla@nvidia.com>
Status     : problem is being debugged



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [1/3] 2.6.21-rc6: known regressions
  2007-04-14  0:36 ` [1/3] 2.6.21-rc6: known regressions Adrian Bunk
@ 2007-04-14  1:34   ` Linus Torvalds
  2007-04-14  1:49     ` Brandeburg, Jesse
                       ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Linus Torvalds @ 2007-04-14  1:34 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Ayaz Abdulla, e1000-devel, netdev, David S. Miller, Greg KH,
	Dave Jones, Andrew Morton, Jeff Garzik, Ingo Molnar

Note: Ingo also reports what looks like a memory corruption due to 
the 6b6b6b6b pattern on presumably the same box.

The 6b6b6b6b pattern is POISON_FREE, implying some kind of slab misuse, 
most likely a use-after-free, although possibly just due to overrunning a 
slab into the next one or something like that.

What I'm leading up to is that I'm wondering if these mysterious network 
driver bugs aren't due to the network drivers themselves, but due to some 
higher-level problem. I think the hangs that Ingo sees with forcedeth were 
preceded by mysterious and "impossible" NULL pointer oopses. Ingo?

Davem - have there been network infrastructure changes that migt be 
suspect? Jeff and/or Greg - anything in the generic network driver/device 
driver level? We had some trouble earlier with the transition to the 
driver core, and kref miscounting. Related? The last Oops Ingo saw was a 
module refcounting one, iirc.

It does seem networking related somehow. Yeah, it could be obviously be a 
combination of independent bugs both in e1000/ and forcedeth drivers, but 
maybe there is something in common here...

I'll make an -rc7, but I'm a bit worried about the fact that I haven't 
actually gotten anything that looks like it might address any of this 
(unless some of the network patches I just pulled do, but that sounds 
unlikely). It's been very quiet. I don't like that. I don't get the 
feeling that we're making progress here, unlike the timer-related 
regressions which seem to all be slowly working themselves out..

So please people, give it a look. Comments?

		Linus

On Sat, 14 Apr 2007, Adrian Bunk wrote:
> 
> Subject    : laptops with e1000: lockups
> References : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603
> Submitter  : Dave Jones <davej@redhat.com>
> Handled-By : Jesse Brandeburg <jesse.brandeburg@intel.com>
> Status     : problem is being debugged
> 
> Subject    : forcedeth: interface hangs under load
> References : http://lkml.org/lkml/2007/4/3/39
> Submitter  : Ingo Molnar <mingo@elte.hu>
> Handled-By : Ingo Molnar <mingo@elte.hu>
>              Ayaz Abdulla <aabdulla@nvidia.com>
> Status     : problem is being debugged

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [1/3] 2.6.21-rc6: known regressions
  2007-04-14  1:34   ` Linus Torvalds
@ 2007-04-14  1:49     ` Brandeburg, Jesse
  2007-04-14  4:25     ` David Miller
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Brandeburg, Jesse @ 2007-04-14  1:49 UTC (permalink / raw)
  To: Linus Torvalds, Adrian Bunk
  Cc: Ayaz Abdulla, e1000-devel, Greg KH, Ingo Molnar, netdev,
	Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller

> On Sat, 14 Apr 2007, Adrian Bunk wrote:
>> 
>> Subject    : laptops with e1000: lockups
>> References :
>> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603
>> Submitter  : Dave Jones <davej@redhat.com> 
>> Handled-By : Jesse Brandeburg <jesse.brandeburg@intel.com>
>> Status     : problem is being debugged
>> 
>> Subject    : forcedeth: interface hangs under load
>> References : http://lkml.org/lkml/2007/4/3/39
>> Submitter  : Ingo Molnar <mingo@elte.hu>
>> Handled-By : Ingo Molnar <mingo@elte.hu>
>>              Ayaz Abdulla <aabdulla@nvidia.com>
>> Status     : problem is being debugged

Linus Torvalds wrote:
> It does seem networking related somehow. Yeah, it could be obviously
> be a combination of independent bugs both in e1000/ and forcedeth
> drivers, but maybe there is something in common here...
 
> So please people, give it a look. Comments?

I mentioned this in the bugzilla (229603 above), but we have at least
reproduced this here in our lab w.r.t e1000.  Some people were on
vacation this week so the issue didn't progress (regardless if this is
e1000 specific we will have some resources helping to report on this
next week).  So we're not sure if this is an e1000 problem yet.  More
soon, maybe I'll try to bisect back to some good bad branches, as the
problem is pretty quick to occur and didn't seem to be present in
2.6.20.

Jesse

 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [1/3] 2.6.21-rc6: known regressions
  2007-04-14  1:34   ` Linus Torvalds
  2007-04-14  1:49     ` Brandeburg, Jesse
@ 2007-04-14  4:25     ` David Miller
  2007-04-14  5:07     ` Ian McDonald
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2007-04-14  4:25 UTC (permalink / raw)
  To: torvalds
  Cc: aabdulla, e1000-devel, netdev, bunk, greg, davej, akpm, jgarzik,
	mingo

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 13 Apr 2007 18:34:23 -0700 (PDT)

> Davem - have there been network infrastructure changes that migt be 
> suspect? Jeff and/or Greg - anything in the generic network driver/device 
> driver level? We had some trouble earlier with the transition to the 
> driver core, and kref miscounting. Related? The last Oops Ingo saw was a 
> module refcounting one, iirc.

Nothing stands out in the recent changes I've merged, I'll study this
issue and see if I can see any pattern or a clue.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [1/3] 2.6.21-rc6: known regressions
  2007-04-14  1:34   ` Linus Torvalds
  2007-04-14  1:49     ` Brandeburg, Jesse
  2007-04-14  4:25     ` David Miller
@ 2007-04-14  5:07     ` Ian McDonald
  2007-04-14  5:29     ` David Miller
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Ian McDonald @ 2007-04-14  5:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Adrian Bunk, Andrew Morton, Jeff Garzik, netdev, e1000-devel,
	Ingo Molnar, Ayaz Abdulla, Dave Jones, David S. Miller, Greg KH

On 4/14/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> Note: Ingo also reports what looks like a memory corruption due to
> the 6b6b6b6b pattern on presumably the same box.
>
> The 6b6b6b6b pattern is POISON_FREE, implying some kind of slab misuse,
> most likely a use-after-free, although possibly just due to overrunning a
> slab into the next one or something like that.
>
> What I'm leading up to is that I'm wondering if these mysterious network
> driver bugs aren't due to the network drivers themselves, but due to some
> higher-level problem. I think the hangs that Ingo sees with forcedeth were
> preceded by mysterious and "impossible" NULL pointer oopses. Ingo?
>
> Davem - have there been network infrastructure changes that migt be
> suspect? Jeff and/or Greg - anything in the generic network driver/device
> driver level? We had some trouble earlier with the transition to the
> driver core, and kref miscounting. Related? The last Oops Ingo saw was a
> module refcounting one, iirc.
>
> It does seem networking related somehow. Yeah, it could be obviously be a
> combination of independent bugs both in e1000/ and forcedeth drivers, but
> maybe there is something in common here...
>
I don't know if this is a red herring or not but I reported on March
13th slab corruption and it looked like file_free_rcu - these are
fairly recent changes I think (rcu)?

Anyway original message is at http://lkml.org/lkml/2007/3/12/364

My apologies if this is not related.

Ian
-- 
Web: http://wand.net.nz/~iam4/
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [1/3] 2.6.21-rc6: known regressions
  2007-04-14  1:34   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2007-04-14  5:07     ` Ian McDonald
@ 2007-04-14  5:29     ` David Miller
  2007-04-14  6:21     ` Ingo Molnar
  2007-04-20 13:46     ` Ingo Molnar
  5 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2007-04-14  5:29 UTC (permalink / raw)
  To: torvalds
  Cc: aabdulla, e1000-devel, netdev, bunk, greg, davej, akpm, jgarzik,
	mingo

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 13 Apr 2007 18:34:23 -0700 (PDT)

Let's see how related these two might actually be.

> On Sat, 14 Apr 2007, Adrian Bunk wrote:
> > 
> > Subject    : laptops with e1000: lockups
> > References : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603
> > Submitter  : Dave Jones <davej@redhat.com>
> > Handled-By : Jesse Brandeburg <jesse.brandeburg@intel.com>
> > Status     : problem is being debugged

In this case the entire machine hangs and sometimes spits out an
NMI message.

The user confirms that using another network interface (albeit
wireless) works properly.

The Intel folks can reproduce this one in-house and will look more
deeply into it on Monday.

> > Subject    : forcedeth: interface hangs under load
> > References : http://lkml.org/lkml/2007/4/3/39
> > Submitter  : Ingo Molnar <mingo@elte.hu>
> > Handled-By : Ingo Molnar <mingo@elte.hu>
> >              Ayaz Abdulla <aabdulla@nvidia.com>
> > Status     : problem is being debugged

In Ingo's case here the interface stops working entirely, but his
system is still otherwise operational.

I looked at the interrupt handler for this driver and it is absolutely
awful especially in the NAPI enabled case.

It tries to handle TX done interrupts and other status events in the
HW irq handler, and the RX packet processing via NAPI ->poll().

Time has shown that this is a faulty way to use NAPI and that all
events types should be done in the NAPI ->poll() handler, not just
RX packet processing.

The way the loop is coded now it will keep prodding at the interrupt
status register in the HW irq handler loop even after the RX packet
processing has been deferred to NAPI ->poll().  It seems likely that
since the RX packets aren't being processed there, the RX irq event
status should keep showing as set as new packets arrive.

Really, the interrupt status should be checked exactly once, all the
work deferred to NAPI's ->poll() and then the HW interrupt handler
should return immediately.  This is what e1000 and tg3 do, and it is
therefore the most well tested manner in which to use NAPI in a
network driver.

Anything else is racey and error prone.

This would also eliminate the max_interrupt_work hack, it's a side
effect of the way the interrupt handler is implemented in this
driver.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [1/3] 2.6.21-rc6: known regressions
  2007-04-14  1:34   ` Linus Torvalds
                       ` (3 preceding siblings ...)
  2007-04-14  5:29     ` David Miller
@ 2007-04-14  6:21     ` Ingo Molnar
  2007-04-14  7:25       ` Greg KH
  2007-04-20 13:39       ` Ingo Molnar
  2007-04-20 13:46     ` Ingo Molnar
  5 siblings, 2 replies; 14+ messages in thread
From: Ingo Molnar @ 2007-04-14  6:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ayaz Abdulla, e1000-devel, netdev, Adrian Bunk, Greg KH,
	Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Note: Ingo also reports what looks like a memory corruption due to the 
> 6b6b6b6b pattern on presumably the same box.
> 
> The 6b6b6b6b pattern is POISON_FREE, implying some kind of slab 
> misuse, most likely a use-after-free, although possibly just due to 
> overrunning a slab into the next one or something like that.

unfortunately, while being at -rc6 based kernel #445 meanwhile, this 
incident was the only time i saw this problem. Note: while it's a 
CONFIG_SMP kernel, in that bootup i was using maxcpus=1:

   WARNING: maxcpus limit of 1 reached. Processor ignored.

so it's a pure UP problem. Plus i used PREEMPT_NONE. So this really must 
be something fundamental.

> What I'm leading up to is that I'm wondering if these mysterious 
> network driver bugs aren't due to the network drivers themselves, but 
> due to some higher-level problem. I think the hangs that Ingo sees 
> with forcedeth were preceded by mysterious and "impossible" NULL 
> pointer oopses. Ingo?

hm. I would tend to exclude networking, because the oops happened right 
during bootup (i saw it happen real time on the serial console), 
possibly before networking was brought up. It was udevd that crashed, 
and rarely does udevd do anything after its initial /dev hierarchy setup 
frenzy. (But this testbox boots very fast so it might have been near 
network bringup.)

note that i can pretty much freely force the forcedeth problem to occur 
on -rt [but all the reports i sent about it were done on a vanilla 
kernel]. I triggered that problem at least a couple of dozen times, and 
it _never_ caused any other effect besides the skb NULL dereference - or 
lately (with the latest forcedeth.c version), a pure forcedeth interface 
hang. That doesnt exclude networking driver badness, but makes it less 
likely.

to me this crash has the feeling of being sysfs related: not just 
because the crash itself is within sysfs:

 EIP is at module_put+0x19/0x2d

 [<c0104c44>] show_trace_log_lvl+0x19/0x2e
 [<c0104cf4>] show_stack_log_lvl+0x9b/0xa3
 [<c0104fdd>] show_registers+0x1c8/0x29a
 [<c01052d0>] die+0x119/0x1f0
 [<c03cd075>] do_page_fault+0x4e3/0x5b8
 [<c03cb7a4>] error_code+0x7c/0x84
 [<c019e832>] sysfs_release+0x55/0x76
 [<c0167c7f>] __fput+0xb9/0x15e
 [<c0167d3b>] fput+0x17/0x19
 [<c01658b2>] filp_close+0x52/0x5a
 [<c01660a3>] sys_close+0x76/0xad
 [<c0103dc0>] syscall_call+0x7/0xb

but also because udevd itself is _very_ sysfs intense - an in fact on 
this bzImage kernel it's perhaps the _only_ true sysfs activity that 
happens. (there are no loadable modules whatsoever, all drivers are 
built in)

	Ingo

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [1/3] 2.6.21-rc6: known regressions
  2007-04-14  6:21     ` Ingo Molnar
@ 2007-04-14  7:25       ` Greg KH
  2007-04-20 13:39       ` Ingo Molnar
  1 sibling, 0 replies; 14+ messages in thread
From: Greg KH @ 2007-04-14  7:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ayaz Abdulla, e1000-devel, netdev, Adrian Bunk, Linus Torvalds,
	Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller

On Sat, Apr 14, 2007 at 08:21:43AM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Note: Ingo also reports what looks like a memory corruption due to the 
> > 6b6b6b6b pattern on presumably the same box.
> > 
> > The 6b6b6b6b pattern is POISON_FREE, implying some kind of slab 
> > misuse, most likely a use-after-free, although possibly just due to 
> > overrunning a slab into the next one or something like that.
> 
> unfortunately, while being at -rc6 based kernel #445 meanwhile, this 
> incident was the only time i saw this problem. Note: while it's a 
> CONFIG_SMP kernel, in that bootup i was using maxcpus=1:
> 
>    WARNING: maxcpus limit of 1 reached. Processor ignored.
> 
> so it's a pure UP problem. Plus i used PREEMPT_NONE. So this really must 
> be something fundamental.
> 
> > What I'm leading up to is that I'm wondering if these mysterious 
> > network driver bugs aren't due to the network drivers themselves, but 
> > due to some higher-level problem. I think the hangs that Ingo sees 
> > with forcedeth were preceded by mysterious and "impossible" NULL 
> > pointer oopses. Ingo?
> 
> hm. I would tend to exclude networking, because the oops happened right 
> during bootup (i saw it happen real time on the serial console), 
> possibly before networking was brought up. It was udevd that crashed, 
> and rarely does udevd do anything after its initial /dev hierarchy setup 
> frenzy. (But this testbox boots very fast so it might have been near 
> network bringup.)
> 
> note that i can pretty much freely force the forcedeth problem to occur 
> on -rt [but all the reports i sent about it were done on a vanilla 
> kernel]. I triggered that problem at least a couple of dozen times, and 
> it _never_ caused any other effect besides the skb NULL dereference - or 
> lately (with the latest forcedeth.c version), a pure forcedeth interface 
> hang. That doesnt exclude networking driver badness, but makes it less 
> likely.
> 
> to me this crash has the feeling of being sysfs related: not just 
> because the crash itself is within sysfs:
> 
>  EIP is at module_put+0x19/0x2d
> 
>  [<c0104c44>] show_trace_log_lvl+0x19/0x2e
>  [<c0104cf4>] show_stack_log_lvl+0x9b/0xa3
>  [<c0104fdd>] show_registers+0x1c8/0x29a
>  [<c01052d0>] die+0x119/0x1f0
>  [<c03cd075>] do_page_fault+0x4e3/0x5b8
>  [<c03cb7a4>] error_code+0x7c/0x84
>  [<c019e832>] sysfs_release+0x55/0x76
>  [<c0167c7f>] __fput+0xb9/0x15e
>  [<c0167d3b>] fput+0x17/0x19
>  [<c01658b2>] filp_close+0x52/0x5a
>  [<c01660a3>] sys_close+0x76/0xad
>  [<c0103dc0>] syscall_call+0x7/0xb
> 
> but also because udevd itself is _very_ sysfs intense - an in fact on 
> this bzImage kernel it's perhaps the _only_ true sysfs activity that 
> happens. (there are no loadable modules whatsoever, all drivers are 
> built in)

What version of udev are you using?  Newer versions of udev don't hit
sysfs as much as they get the majority of their information from the
uevent message instead.

thanks,

greg k-h

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [1/3] 2.6.21-rc6: known regressions
  2007-04-14  6:21     ` Ingo Molnar
  2007-04-14  7:25       ` Greg KH
@ 2007-04-20 13:39       ` Ingo Molnar
  1 sibling, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2007-04-20 13:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ayaz Abdulla, e1000-devel, netdev, Adrian Bunk, Greg KH,
	Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller


* Ingo Molnar <mingo@elte.hu> wrote:

> > The 6b6b6b6b pattern is POISON_FREE, implying some kind of slab 
> > misuse, most likely a use-after-free, although possibly just due to 
> > overrunning a slab into the next one or something like that.
> 
> unfortunately, while being at -rc6 based kernel #445 meanwhile, this 
> incident was the only time i saw this problem. [...]

meanwhile i'm at kernel bootup #657, and still this crash did not 
reoccur. So it could have been some pre-existing sysfs bug that triggers 
only extremely rarely. I'd suggest that this bug have its priority 
lowered (to not hold up a v2.6.21 release) - there's no smoking gun and 
no reproducer. I'll keep an eye on it.

	Ingo

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [1/3] 2.6.21-rc6: known regressions
  2007-04-14  1:34   ` Linus Torvalds
                       ` (4 preceding siblings ...)
  2007-04-14  6:21     ` Ingo Molnar
@ 2007-04-20 13:46     ` Ingo Molnar
  5 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2007-04-20 13:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ayaz Abdulla, e1000-devel, netdev, Adrian Bunk, Greg KH,
	Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [...] I think the hangs that Ingo sees with forcedeth were preceded by 
> mysterious and "impossible" NULL pointer oopses. Ingo?

update: the 'forcedeth NULL pointer oops' problem got resolved by one of 
these commits:

 commit 3ba4d093fe8a26f5f2da94411bf8732fa6e9da86
 Author: Ayaz Abdulla <aabdulla@nvidia.com>
 Date:   Fri Mar 23 05:50:02 2007 -0500

     forcedeth: fix tx timeout

 commit fcc5f2665c81e087fb95143325ed769a41128d50
 Author: Ayaz Abdulla <aabdulla@nvidia.com>
 Date:   Fri Mar 23 05:49:37 2007 -0500

     forcedeth: fix nic poll

it never reoccured since this went upstream - so i'd close the NULL 
dereference bug.

furthermore, i havent seen the 'forcedeth interface hangs' problem 
trigger with recent kernels (havent seen it trigger for the past 2 
weeks), but no forcedeth specific change went into the kernel since i 
last reproduced a hang so either it got fixed by something else, or the 
hang is very rare. We could lower its priority for v2.6.21. If it ever 
happens again i'll send another ethtool dump to Ayaz.

	Ingo

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-04-20 13:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.0704051944230.6730@woody.linux-foundation.org>
2007-04-06 21:40 ` tg3: unable to handle null pointer dereference [Re: Linux 2.6.21-rc6] Nishanth Aravamudan
2007-04-06 22:57   ` Michael Chan
2007-04-07  0:36     ` tg3: unable to handle null pointer dereference David Miller
2007-04-07  1:53       ` Nishanth Aravamudan
2007-04-14  0:36 ` [1/3] 2.6.21-rc6: known regressions Adrian Bunk
2007-04-14  1:34   ` Linus Torvalds
2007-04-14  1:49     ` Brandeburg, Jesse
2007-04-14  4:25     ` David Miller
2007-04-14  5:07     ` Ian McDonald
2007-04-14  5:29     ` David Miller
2007-04-14  6:21     ` Ingo Molnar
2007-04-14  7:25       ` Greg KH
2007-04-20 13:39       ` Ingo Molnar
2007-04-20 13:46     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).