From: Ayaz Abdulla <aabdulla@nvidia.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Jeff Garzik <jeff@garzik.org>, Adrian Bunk <bunk@stusta.de>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: Linux 2.6.21-rc5
Date: Mon, 26 Mar 2007 03:17:22 -0500 [thread overview]
Message-ID: <46078192.6020307@nvidia.com> (raw)
In-Reply-To: <20070326083146.GA11666@elte.hu>
This issue might be resolved with the patch provided in the following
bug report: http://bugzilla.kernel.org/show_bug.cgi?id=8058
Please try out the patch in the bug report without your patch and see if
the issue reproduces.
Ayaz
Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
>>There's various fixes here, ranging from some architecture updates
>>(ia64, ARM, MIPS, SH, Sparc64) to KVM, networking and network drivers.
>
>
> here's a new v2.6.20 -> v2.6.21 forcedeth.c regression:
>
> in the last week or so i've been seeing sporadic under-load forcedeth.c
> crashes (see the full oops further below):
>
> eth1: too many iterations (6) in nv_nic_irq.
> Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP:
> [<ffffffff80404587>] nv_tx_done+0xf4/0x1cf
>
> this is line 1906 of drivers/net/forcedeth.c:
>
> np->stats.tx_bytes += np->get_tx_ctx->skb->len;
>
> struct sk_buff's len field is at offset 88, so np->get_tx_ctx->skb is
> NULL. That is an 'impossible' scenario for tx descriptors here - the tx
> ring descriptors are always set up with a valid skb (and a valid dma
> address), and their completion is serialized via np->lock.
>
> these crashes are almost instant on the .21-rc5-rt kernel, but extremely
> sporadic on the upstream kernel and needed very high networking loads to
> trigger. Today i found a good way to trigger it almost instantly on
> upstream kernels too: apply the debug patch attached further below and
> do:
>
> echo 100 > /proc/sys/kernel/panic
>
> that will inject 100 artificial 'too many iterations' failures and
> provokes a TX timeout - which TX timeout will crash. (i've used a
> dual-core Athlon64 system in this test)
>
> my first quick guess was to extend np->priv locking to the whole of
> nv_start_xmit/nv_start_xmit_optimized - while that appeared to make the
> crash a bit less likely, it did not prevent it. So there must be some
> other, more fundamental problem be left as well. At first glance the SMP
> locking looks OK, so maybe the ring indices are messed up somehow and we
> got into a 'ring head bites the tail' scenario?
>
> i can provide more info if needed.
>
> Ingo
>
> -------------->
> eth1: too many iterations (6) in nv_nic_irq.
> Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP:
> [<ffffffff80404587>] nv_tx_done+0xf4/0x1cf
> PGD 34d03067 PUD 34d02067 PMD 0
> Oops: 0000 [1] PREEMPT SMP
> CPU 1
> Modules linked in:
> Pid: 0, comm: swapper Not tainted 2.6.21-rc5 #8
> RIP: 0010:[<ffffffff80404587>] [<ffffffff80404587>] nv_tx_done+0xf4/0x1cf
> RSP: 0018:ffff81003ff6be40 EFLAGS: 00010206
> RAX: 0000000000000000 RBX: ffff810002e26700 RCX: 0000000000000001
> RDX: 0000000000000042 RSI: 000000003ef00cbe RDI: ffff81003fbeb070
> RBP: ffff81003ff6be60 R08: ffff810002e26a00 R09: 0000000000000003
> R10: ffff81003ff4e100 R11: ffff810001e283f8 R12: 000000003ef00cbe
> R13: ffff810002e26000 R14: ffff810002e28fc0 R15: 0000000000000000
> FS: 00002b6cb57f1db0(0000) GS:ffff81003ff4ad40(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000088 CR3: 0000000034c87000 CR4: 00000000000006e0
> Process swapper (pid: 0, threadinfo ffff81003ff64000, task ffff81003ff4e100)
> Stack: ffff810002e26700 0000000000000032 ffffc2000001a000 ffff810002e26000
> ffff81003ff6bea0 ffffffff80406dae ffff810002e26700 ffff810002e26700
> ffff810002e26000 00000000000000ff ffffc2000001a000 ffffffff80749080
> Call Trace:
> <IRQ> [<ffffffff80406dae>] nv_nic_irq+0x76/0x261
> [<ffffffff8040961e>] nv_do_nic_poll+0x200/0x284
> [<ffffffff8040941e>] nv_do_nic_poll+0x0/0x284
> [<ffffffff80241995>] run_timer_softirq+0x167/0x1dd
> [<ffffffff8023de45>] __do_softirq+0x5b/0xc9
> [<ffffffff8020af0c>] call_softirq+0x1c/0x28
> [<ffffffff8020c2b4>] do_softirq+0x31/0x84
> [<ffffffff8023db16>] irq_exit+0x3f/0x50
> [<ffffffff802190c2>] smp_apic_timer_interrupt+0x49/0x5b
> [<ffffffff802087fb>] default_idle+0x0/0x44
> [<ffffffff8020a9b6>] apic_timer_interrupt+0x66/0x70
> <EOI> [<ffffffff8020882a>] default_idle+0x2f/0x44
> [<ffffffff8020804c>] enter_idle+0x22/0x24
> [<ffffffff802088d0>] cpu_idle+0x91/0xd4
> [<ffffffff80218572>] start_secondary+0x2e3/0x2f5
>
> ---
> drivers/net/forcedeth.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> Index: linux/drivers/net/forcedeth.c
> ===================================================================
> --- linux.orig/drivers/net/forcedeth.c
> +++ linux/drivers/net/forcedeth.c
> @@ -2908,6 +2908,10 @@ static irqreturn_t nv_nic_irq(int foo, v
> spin_unlock(&np->lock);
> break;
> }
> + if (panic_timeout > 0) {
> + panic_timeout--;
> + i = max_interrupt_work+1;
> + }
> if (unlikely(i > max_interrupt_work)) {
> spin_lock(&np->lock);
> /* disable interrupts on the nic */
> @@ -3026,6 +3030,10 @@ static irqreturn_t nv_nic_irq_optimized(
> break;
> }
>
> + if (panic_timeout > 0) {
> + panic_timeout--;
> + i = max_interrupt_work+1;
> + }
> if (unlikely(i > max_interrupt_work)) {
> spin_lock(&np->lock);
> /* disable interrupts on the nic */
> @@ -3076,6 +3084,10 @@ static irqreturn_t nv_nic_irq_tx(int foo
> dprintk(KERN_DEBUG "%s: received irq with events 0x%x. Probably TX fail.\n",
> dev->name, events);
> }
> + if (panic_timeout > 0) {
> + panic_timeout--;
> + i = max_interrupt_work+1;
> + }
> if (unlikely(i > max_interrupt_work)) {
> spin_lock_irqsave(&np->lock, flags);
> /* disable interrupts on the nic */
> @@ -3191,6 +3203,10 @@ static irqreturn_t nv_nic_irq_rx(int foo
> }
> }
>
> + if (panic_timeout > 0) {
> + panic_timeout--;
> + i = max_interrupt_work+1;
> + }
> if (unlikely(i > max_interrupt_work)) {
> spin_lock_irqsave(&np->lock, flags);
> /* disable interrupts on the nic */
> @@ -3264,6 +3280,10 @@ static irqreturn_t nv_nic_irq_other(int
> printk(KERN_DEBUG "%s: received irq with unknown events 0x%x. Please report\n",
> dev->name, events);
> }
> + if (panic_timeout > 0) {
> + panic_timeout--;
> + i = max_interrupt_work+1;
> + }
> if (unlikely(i > max_interrupt_work)) {
> spin_lock_irqsave(&np->lock, flags);
> /* disable interrupts on the nic */
next prev parent reply other threads:[~2007-03-26 19:31 UTC|newest]
Thread overview: 110+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-03-25 23:08 Linux 2.6.21-rc5 Linus Torvalds
2007-03-26 8:31 ` Ingo Molnar
2007-03-26 8:17 ` Ayaz Abdulla [this message]
2007-03-26 8:39 ` Ingo Molnar
2007-03-26 8:58 ` [patch] forcedeth: work around NULL skb dereference crash Ingo Molnar
2007-04-02 11:56 ` [patch] forcedeth: improve NAPI logic Ingo Molnar
2007-03-26 8:55 ` Linux 2.6.21-rc5 Thomas Gleixner
2007-03-26 12:25 ` Bob Tracy
2007-03-26 12:30 ` Thomas Gleixner
2007-03-26 9:04 ` 2.6.21-rc5: maxcpus=1 crash in cpufreq: kernel BUG at drivers/cpufreq/cpufreq.c:82! Ingo Molnar
2007-03-26 18:12 ` Venki Pallipadi
2007-03-26 19:03 ` Venki Pallipadi
2007-03-27 7:11 ` Ingo Molnar
2007-03-26 9:21 ` [PATCH] clockevents: remove bad designed sysfs support for now Thomas Gleixner
2007-03-26 9:25 ` Ingo Molnar
2007-03-26 18:57 ` Greg KH
2007-03-26 12:51 ` Pavel Machek
2007-03-27 7:08 ` [PATCH] i386: Fix bogus return value in hpet_next_event() Thomas Gleixner
2007-03-26 10:11 ` -rc5: e1000 resume weirdness Ingo Molnar
2007-03-26 15:39 ` Kok, Auke
2007-03-26 15:50 ` Jesse Brandeburg
2007-03-26 15:55 ` Kok, Auke
2007-03-26 17:39 ` Ingo Molnar
2007-03-27 1:59 ` [1/5] 2.6.21-rc5: known regressions Adrian Bunk
2007-03-28 18:54 ` Kok, Auke
2007-03-28 19:23 ` Ingo Molnar
2007-03-30 18:04 ` Adrian Bunk
2007-03-30 12:04 ` [bug] hung bootup in various drivers, was: "2.6.21-rc5: known regressions" Ingo Molnar
2007-03-30 12:06 ` [bug] fixed_init(): BUG: at drivers/base/core.c:120 device_release(), " Ingo Molnar
2007-03-30 14:18 ` Greg KH
2007-03-30 14:25 ` Ingo Molnar
2007-03-30 16:31 ` Vitaly Bordug
2007-03-30 14:16 ` [bug] hung bootup in various drivers, " Greg KH
2007-03-30 17:46 ` Ingo Molnar
2007-03-30 19:32 ` Greg KH
2007-03-31 2:32 ` Kay Sievers
2007-03-31 16:51 ` [patch] driver core: fix built-in drivers sysfs links Ingo Molnar
2007-03-31 16:31 ` [bug] hung bootup in various drivers, was: "2.6.21-rc5: known regressions" Ingo Molnar
2007-04-01 7:49 ` Pavel Machek
2007-04-01 17:17 ` Linus Torvalds
2007-04-01 17:35 ` [patch] driver core: if built-in, do not wait in driver_unregister() Ingo Molnar
2007-04-02 1:47 ` Greg KH
2007-03-27 1:59 ` [2/5] 2.6.21-rc5: known regressions Adrian Bunk
2007-03-28 19:46 ` Laurent Riffard
2007-03-29 19:02 ` Fabio Comolli
2007-03-27 1:59 ` [3/5] " Adrian Bunk
2007-03-27 1:59 ` [4/5] " Adrian Bunk
2007-03-27 8:00 ` Marcus Better
2007-03-27 13:25 ` Eric W. Biederman
2007-03-27 16:53 ` Marcus Better
2007-03-27 20:50 ` Eric W. Biederman
2007-03-27 10:09 ` Rafael J. Wysocki
2007-03-27 22:29 ` Adrian Bunk
2007-03-27 22:45 ` Thomas Meyer
2007-03-28 12:19 ` Ingo Molnar
2007-03-28 12:41 ` Ingo Molnar
2007-03-28 13:03 ` Ingo Molnar
2007-03-28 13:06 ` [patch] MSI-X: fix resume crash Ingo Molnar
2007-03-28 13:31 ` Eric W. Biederman
2007-03-28 13:36 ` Ingo Molnar
2007-03-29 4:30 ` Len Brown
2007-03-29 4:57 ` Eric W. Biederman
2007-03-27 1:59 ` [5/5] 2.6.21-rc5: known regressions Adrian Bunk
2007-03-27 5:51 ` ATA ACPI (was Re: Linux 2.6.21-rc5) Jeff Garzik
2007-03-27 5:54 ` Tejun Heo
2007-03-27 21:32 ` Pavel Machek
2007-03-28 9:51 ` Tejun Heo
2007-03-27 17:07 ` Linus Torvalds
2007-03-27 18:48 ` Jeff Garzik
2007-03-27 6:17 ` Linux 2.6.21-rc5 Andrew Morton
2007-03-27 6:20 ` Greg KH
2007-03-27 16:49 ` Jesse Barnes
2007-03-27 9:49 ` Takashi Iwai
2007-03-27 12:25 ` Andi Kleen
2007-03-27 16:33 ` Andrew Morton
2007-03-27 12:43 ` Dmitry Torokhov
2007-03-28 22:32 ` Tilman Schmidt
2007-03-27 18:34 ` Michal Piotrowski
2007-03-27 22:29 ` Pavel Machek
2007-03-27 22:55 ` Michal Piotrowski
2007-03-27 18:53 ` Michal Piotrowski
2007-03-28 14:30 ` Andi Kleen
2007-03-28 14:56 ` Michal Piotrowski
2007-03-28 16:12 ` Jiri Kosina
2007-03-28 16:51 ` Michal Piotrowski
2007-03-28 17:56 ` Linus Torvalds
[not found] ` <20070327230024.GJ16477@stusta.de>
2007-03-27 23:10 ` 2.6.21-rc5: known regressions with patches Rafael J. Wysocki
2007-03-28 0:50 ` Jay Cliburn
2007-03-30 21:32 ` [1/4] 2.6.21-rc5: known regressions (v2) Adrian Bunk
2007-03-30 21:38 ` Greg KH
2007-03-31 0:23 ` Michal Jaegermann
2007-03-31 15:01 ` Adrian Bunk
2007-03-31 16:42 ` Michal Jaegermann
2007-03-30 21:32 ` [2/4] " Adrian Bunk
2007-03-30 21:32 ` [3/4] " Adrian Bunk
2007-03-31 2:52 ` Jeff Chua
2007-03-31 3:16 ` Adrian Bunk
2007-03-31 11:08 ` Jens Axboe
2007-04-01 5:39 ` Jeremy Fitzhardinge
2007-04-13 16:32 ` Michal Piotrowski
2007-03-30 21:49 ` [4/4] " Adrian Bunk
2007-03-31 2:41 ` Jeff Chua
2007-03-31 6:44 ` Frédéric Riss
2007-04-01 7:04 ` Michael S. Tsirkin
2007-04-01 20:37 ` Michael S. Tsirkin
2007-03-31 18:19 ` 2.6.21-rc5: known regressions with patches (v2) Adrian Bunk
2007-04-03 4:05 ` [PATCH] libata: add NCQ blacklist entries from Silicon Image Windows driver (v2) Robert Hancock
2007-04-03 4:13 ` Tejun Heo
2007-04-04 6:09 ` Jeff Garzik
2007-04-04 14:26 ` Robert Hancock
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=46078192.6020307@nvidia.com \
--to=aabdulla@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=bunk@stusta.de \
--cc=jeff@garzik.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox