* Re: [1/3] 2.6.21-rc6: known regressions
2007-04-14 1:34 ` Linus Torvalds
@ 2007-04-14 1:49 ` Brandeburg, Jesse
2007-04-14 4:25 ` David Miller
` (4 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Brandeburg, Jesse @ 2007-04-14 1:49 UTC (permalink / raw)
To: Linus Torvalds, Adrian Bunk
Cc: Ayaz Abdulla, e1000-devel, Greg KH, Ingo Molnar, netdev,
Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller
> On Sat, 14 Apr 2007, Adrian Bunk wrote:
>>
>> Subject : laptops with e1000: lockups
>> References :
>> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603
>> Submitter : Dave Jones <davej@redhat.com>
>> Handled-By : Jesse Brandeburg <jesse.brandeburg@intel.com>
>> Status : problem is being debugged
>>
>> Subject : forcedeth: interface hangs under load
>> References : http://lkml.org/lkml/2007/4/3/39
>> Submitter : Ingo Molnar <mingo@elte.hu>
>> Handled-By : Ingo Molnar <mingo@elte.hu>
>> Ayaz Abdulla <aabdulla@nvidia.com>
>> Status : problem is being debugged
Linus Torvalds wrote:
> It does seem networking related somehow. Yeah, it could be obviously
> be a combination of independent bugs both in e1000/ and forcedeth
> drivers, but maybe there is something in common here...
> So please people, give it a look. Comments?
I mentioned this in the bugzilla (229603 above), but we have at least
reproduced this here in our lab w.r.t e1000. Some people were on
vacation this week so the issue didn't progress (regardless if this is
e1000 specific we will have some resources helping to report on this
next week). So we're not sure if this is an e1000 problem yet. More
soon, maybe I'll try to bisect back to some good bad branches, as the
problem is pretty quick to occur and didn't seem to be present in
2.6.20.
Jesse
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [1/3] 2.6.21-rc6: known regressions
2007-04-14 1:34 ` Linus Torvalds
2007-04-14 1:49 ` Brandeburg, Jesse
@ 2007-04-14 4:25 ` David Miller
2007-04-14 5:07 ` Ian McDonald
` (3 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2007-04-14 4:25 UTC (permalink / raw)
To: torvalds
Cc: aabdulla, e1000-devel, netdev, bunk, greg, davej, akpm, jgarzik,
mingo
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 13 Apr 2007 18:34:23 -0700 (PDT)
> Davem - have there been network infrastructure changes that migt be
> suspect? Jeff and/or Greg - anything in the generic network driver/device
> driver level? We had some trouble earlier with the transition to the
> driver core, and kref miscounting. Related? The last Oops Ingo saw was a
> module refcounting one, iirc.
Nothing stands out in the recent changes I've merged, I'll study this
issue and see if I can see any pattern or a clue.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [1/3] 2.6.21-rc6: known regressions
2007-04-14 1:34 ` Linus Torvalds
2007-04-14 1:49 ` Brandeburg, Jesse
2007-04-14 4:25 ` David Miller
@ 2007-04-14 5:07 ` Ian McDonald
2007-04-14 5:29 ` David Miller
` (2 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Ian McDonald @ 2007-04-14 5:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Adrian Bunk, Andrew Morton, Jeff Garzik, netdev, e1000-devel,
Ingo Molnar, Ayaz Abdulla, Dave Jones, David S. Miller, Greg KH
On 4/14/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> Note: Ingo also reports what looks like a memory corruption due to
> the 6b6b6b6b pattern on presumably the same box.
>
> The 6b6b6b6b pattern is POISON_FREE, implying some kind of slab misuse,
> most likely a use-after-free, although possibly just due to overrunning a
> slab into the next one or something like that.
>
> What I'm leading up to is that I'm wondering if these mysterious network
> driver bugs aren't due to the network drivers themselves, but due to some
> higher-level problem. I think the hangs that Ingo sees with forcedeth were
> preceded by mysterious and "impossible" NULL pointer oopses. Ingo?
>
> Davem - have there been network infrastructure changes that migt be
> suspect? Jeff and/or Greg - anything in the generic network driver/device
> driver level? We had some trouble earlier with the transition to the
> driver core, and kref miscounting. Related? The last Oops Ingo saw was a
> module refcounting one, iirc.
>
> It does seem networking related somehow. Yeah, it could be obviously be a
> combination of independent bugs both in e1000/ and forcedeth drivers, but
> maybe there is something in common here...
>
I don't know if this is a red herring or not but I reported on March
13th slab corruption and it looked like file_free_rcu - these are
fairly recent changes I think (rcu)?
Anyway original message is at http://lkml.org/lkml/2007/3/12/364
My apologies if this is not related.
Ian
--
Web: http://wand.net.nz/~iam4/
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [1/3] 2.6.21-rc6: known regressions
2007-04-14 1:34 ` Linus Torvalds
` (2 preceding siblings ...)
2007-04-14 5:07 ` Ian McDonald
@ 2007-04-14 5:29 ` David Miller
2007-04-14 6:21 ` Ingo Molnar
2007-04-20 13:46 ` Ingo Molnar
5 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2007-04-14 5:29 UTC (permalink / raw)
To: torvalds
Cc: aabdulla, e1000-devel, netdev, bunk, greg, davej, akpm, jgarzik,
mingo
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 13 Apr 2007 18:34:23 -0700 (PDT)
Let's see how related these two might actually be.
> On Sat, 14 Apr 2007, Adrian Bunk wrote:
> >
> > Subject : laptops with e1000: lockups
> > References : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603
> > Submitter : Dave Jones <davej@redhat.com>
> > Handled-By : Jesse Brandeburg <jesse.brandeburg@intel.com>
> > Status : problem is being debugged
In this case the entire machine hangs and sometimes spits out an
NMI message.
The user confirms that using another network interface (albeit
wireless) works properly.
The Intel folks can reproduce this one in-house and will look more
deeply into it on Monday.
> > Subject : forcedeth: interface hangs under load
> > References : http://lkml.org/lkml/2007/4/3/39
> > Submitter : Ingo Molnar <mingo@elte.hu>
> > Handled-By : Ingo Molnar <mingo@elte.hu>
> > Ayaz Abdulla <aabdulla@nvidia.com>
> > Status : problem is being debugged
In Ingo's case here the interface stops working entirely, but his
system is still otherwise operational.
I looked at the interrupt handler for this driver and it is absolutely
awful especially in the NAPI enabled case.
It tries to handle TX done interrupts and other status events in the
HW irq handler, and the RX packet processing via NAPI ->poll().
Time has shown that this is a faulty way to use NAPI and that all
events types should be done in the NAPI ->poll() handler, not just
RX packet processing.
The way the loop is coded now it will keep prodding at the interrupt
status register in the HW irq handler loop even after the RX packet
processing has been deferred to NAPI ->poll(). It seems likely that
since the RX packets aren't being processed there, the RX irq event
status should keep showing as set as new packets arrive.
Really, the interrupt status should be checked exactly once, all the
work deferred to NAPI's ->poll() and then the HW interrupt handler
should return immediately. This is what e1000 and tg3 do, and it is
therefore the most well tested manner in which to use NAPI in a
network driver.
Anything else is racey and error prone.
This would also eliminate the max_interrupt_work hack, it's a side
effect of the way the interrupt handler is implemented in this
driver.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [1/3] 2.6.21-rc6: known regressions
2007-04-14 1:34 ` Linus Torvalds
` (3 preceding siblings ...)
2007-04-14 5:29 ` David Miller
@ 2007-04-14 6:21 ` Ingo Molnar
2007-04-14 7:25 ` Greg KH
2007-04-20 13:39 ` Ingo Molnar
2007-04-20 13:46 ` Ingo Molnar
5 siblings, 2 replies; 14+ messages in thread
From: Ingo Molnar @ 2007-04-14 6:21 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ayaz Abdulla, e1000-devel, netdev, Adrian Bunk, Greg KH,
Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Note: Ingo also reports what looks like a memory corruption due to the
> 6b6b6b6b pattern on presumably the same box.
>
> The 6b6b6b6b pattern is POISON_FREE, implying some kind of slab
> misuse, most likely a use-after-free, although possibly just due to
> overrunning a slab into the next one or something like that.
unfortunately, while being at -rc6 based kernel #445 meanwhile, this
incident was the only time i saw this problem. Note: while it's a
CONFIG_SMP kernel, in that bootup i was using maxcpus=1:
WARNING: maxcpus limit of 1 reached. Processor ignored.
so it's a pure UP problem. Plus i used PREEMPT_NONE. So this really must
be something fundamental.
> What I'm leading up to is that I'm wondering if these mysterious
> network driver bugs aren't due to the network drivers themselves, but
> due to some higher-level problem. I think the hangs that Ingo sees
> with forcedeth were preceded by mysterious and "impossible" NULL
> pointer oopses. Ingo?
hm. I would tend to exclude networking, because the oops happened right
during bootup (i saw it happen real time on the serial console),
possibly before networking was brought up. It was udevd that crashed,
and rarely does udevd do anything after its initial /dev hierarchy setup
frenzy. (But this testbox boots very fast so it might have been near
network bringup.)
note that i can pretty much freely force the forcedeth problem to occur
on -rt [but all the reports i sent about it were done on a vanilla
kernel]. I triggered that problem at least a couple of dozen times, and
it _never_ caused any other effect besides the skb NULL dereference - or
lately (with the latest forcedeth.c version), a pure forcedeth interface
hang. That doesnt exclude networking driver badness, but makes it less
likely.
to me this crash has the feeling of being sysfs related: not just
because the crash itself is within sysfs:
EIP is at module_put+0x19/0x2d
[<c0104c44>] show_trace_log_lvl+0x19/0x2e
[<c0104cf4>] show_stack_log_lvl+0x9b/0xa3
[<c0104fdd>] show_registers+0x1c8/0x29a
[<c01052d0>] die+0x119/0x1f0
[<c03cd075>] do_page_fault+0x4e3/0x5b8
[<c03cb7a4>] error_code+0x7c/0x84
[<c019e832>] sysfs_release+0x55/0x76
[<c0167c7f>] __fput+0xb9/0x15e
[<c0167d3b>] fput+0x17/0x19
[<c01658b2>] filp_close+0x52/0x5a
[<c01660a3>] sys_close+0x76/0xad
[<c0103dc0>] syscall_call+0x7/0xb
but also because udevd itself is _very_ sysfs intense - an in fact on
this bzImage kernel it's perhaps the _only_ true sysfs activity that
happens. (there are no loadable modules whatsoever, all drivers are
built in)
Ingo
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [1/3] 2.6.21-rc6: known regressions
2007-04-14 6:21 ` Ingo Molnar
@ 2007-04-14 7:25 ` Greg KH
2007-04-20 13:39 ` Ingo Molnar
1 sibling, 0 replies; 14+ messages in thread
From: Greg KH @ 2007-04-14 7:25 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ayaz Abdulla, e1000-devel, netdev, Adrian Bunk, Linus Torvalds,
Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller
On Sat, Apr 14, 2007 at 08:21:43AM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Note: Ingo also reports what looks like a memory corruption due to the
> > 6b6b6b6b pattern on presumably the same box.
> >
> > The 6b6b6b6b pattern is POISON_FREE, implying some kind of slab
> > misuse, most likely a use-after-free, although possibly just due to
> > overrunning a slab into the next one or something like that.
>
> unfortunately, while being at -rc6 based kernel #445 meanwhile, this
> incident was the only time i saw this problem. Note: while it's a
> CONFIG_SMP kernel, in that bootup i was using maxcpus=1:
>
> WARNING: maxcpus limit of 1 reached. Processor ignored.
>
> so it's a pure UP problem. Plus i used PREEMPT_NONE. So this really must
> be something fundamental.
>
> > What I'm leading up to is that I'm wondering if these mysterious
> > network driver bugs aren't due to the network drivers themselves, but
> > due to some higher-level problem. I think the hangs that Ingo sees
> > with forcedeth were preceded by mysterious and "impossible" NULL
> > pointer oopses. Ingo?
>
> hm. I would tend to exclude networking, because the oops happened right
> during bootup (i saw it happen real time on the serial console),
> possibly before networking was brought up. It was udevd that crashed,
> and rarely does udevd do anything after its initial /dev hierarchy setup
> frenzy. (But this testbox boots very fast so it might have been near
> network bringup.)
>
> note that i can pretty much freely force the forcedeth problem to occur
> on -rt [but all the reports i sent about it were done on a vanilla
> kernel]. I triggered that problem at least a couple of dozen times, and
> it _never_ caused any other effect besides the skb NULL dereference - or
> lately (with the latest forcedeth.c version), a pure forcedeth interface
> hang. That doesnt exclude networking driver badness, but makes it less
> likely.
>
> to me this crash has the feeling of being sysfs related: not just
> because the crash itself is within sysfs:
>
> EIP is at module_put+0x19/0x2d
>
> [<c0104c44>] show_trace_log_lvl+0x19/0x2e
> [<c0104cf4>] show_stack_log_lvl+0x9b/0xa3
> [<c0104fdd>] show_registers+0x1c8/0x29a
> [<c01052d0>] die+0x119/0x1f0
> [<c03cd075>] do_page_fault+0x4e3/0x5b8
> [<c03cb7a4>] error_code+0x7c/0x84
> [<c019e832>] sysfs_release+0x55/0x76
> [<c0167c7f>] __fput+0xb9/0x15e
> [<c0167d3b>] fput+0x17/0x19
> [<c01658b2>] filp_close+0x52/0x5a
> [<c01660a3>] sys_close+0x76/0xad
> [<c0103dc0>] syscall_call+0x7/0xb
>
> but also because udevd itself is _very_ sysfs intense - an in fact on
> this bzImage kernel it's perhaps the _only_ true sysfs activity that
> happens. (there are no loadable modules whatsoever, all drivers are
> built in)
What version of udev are you using? Newer versions of udev don't hit
sysfs as much as they get the majority of their information from the
uevent message instead.
thanks,
greg k-h
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [1/3] 2.6.21-rc6: known regressions
2007-04-14 6:21 ` Ingo Molnar
2007-04-14 7:25 ` Greg KH
@ 2007-04-20 13:39 ` Ingo Molnar
1 sibling, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2007-04-20 13:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ayaz Abdulla, e1000-devel, netdev, Adrian Bunk, Greg KH,
Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller
* Ingo Molnar <mingo@elte.hu> wrote:
> > The 6b6b6b6b pattern is POISON_FREE, implying some kind of slab
> > misuse, most likely a use-after-free, although possibly just due to
> > overrunning a slab into the next one or something like that.
>
> unfortunately, while being at -rc6 based kernel #445 meanwhile, this
> incident was the only time i saw this problem. [...]
meanwhile i'm at kernel bootup #657, and still this crash did not
reoccur. So it could have been some pre-existing sysfs bug that triggers
only extremely rarely. I'd suggest that this bug have its priority
lowered (to not hold up a v2.6.21 release) - there's no smoking gun and
no reproducer. I'll keep an eye on it.
Ingo
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [1/3] 2.6.21-rc6: known regressions
2007-04-14 1:34 ` Linus Torvalds
` (4 preceding siblings ...)
2007-04-14 6:21 ` Ingo Molnar
@ 2007-04-20 13:46 ` Ingo Molnar
5 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2007-04-20 13:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ayaz Abdulla, e1000-devel, netdev, Adrian Bunk, Greg KH,
Dave Jones, Andrew Morton, Jeff Garzik, David S. Miller
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> [...] I think the hangs that Ingo sees with forcedeth were preceded by
> mysterious and "impossible" NULL pointer oopses. Ingo?
update: the 'forcedeth NULL pointer oops' problem got resolved by one of
these commits:
commit 3ba4d093fe8a26f5f2da94411bf8732fa6e9da86
Author: Ayaz Abdulla <aabdulla@nvidia.com>
Date: Fri Mar 23 05:50:02 2007 -0500
forcedeth: fix tx timeout
commit fcc5f2665c81e087fb95143325ed769a41128d50
Author: Ayaz Abdulla <aabdulla@nvidia.com>
Date: Fri Mar 23 05:49:37 2007 -0500
forcedeth: fix nic poll
it never reoccured since this went upstream - so i'd close the NULL
dereference bug.
furthermore, i havent seen the 'forcedeth interface hangs' problem
trigger with recent kernels (havent seen it trigger for the past 2
weeks), but no forcedeth specific change went into the kernel since i
last reproduced a hang so either it got fixed by something else, or the
hang is very rare. We could lower its priority for v2.6.21. If it ever
happens again i'll send another ethtool dump to Ayaz.
Ingo
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
^ permalink raw reply [flat|nested] 14+ messages in thread