public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ayaz Abdulla <aabdulla@nvidia.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Jeff Garzik <jeff@garzik.org>, Adrian Bunk <bunk@stusta.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: Linux 2.6.21-rc5
Date: Mon, 26 Mar 2007 03:17:22 -0500	[thread overview]
Message-ID: <46078192.6020307@nvidia.com> (raw)
In-Reply-To: <20070326083146.GA11666@elte.hu>

This issue might be resolved with the patch provided in the following 
bug report: http://bugzilla.kernel.org/show_bug.cgi?id=8058

Please try out the patch in the bug report without your patch and see if 
the issue reproduces.

Ayaz


Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> 
>>There's various fixes here, ranging from some architecture updates 
>>(ia64, ARM, MIPS, SH, Sparc64) to KVM, networking and network drivers.
> 
> 
> here's a new v2.6.20 -> v2.6.21 forcedeth.c regression:
> 
> in the last week or so i've been seeing sporadic under-load forcedeth.c 
> crashes (see the full oops further below):
> 
>  eth1: too many iterations (6) in nv_nic_irq.
>  Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP: 
>  [<ffffffff80404587>] nv_tx_done+0xf4/0x1cf
> 
> this is line 1906 of drivers/net/forcedeth.c:
> 
>     np->stats.tx_bytes += np->get_tx_ctx->skb->len;
> 
> struct sk_buff's len field is at offset 88, so np->get_tx_ctx->skb is 
> NULL. That is an 'impossible' scenario for tx descriptors here - the tx 
> ring descriptors are always set up with a valid skb (and a valid dma 
> address), and their completion is serialized via np->lock.
> 
> these crashes are almost instant on the .21-rc5-rt kernel, but extremely 
> sporadic on the upstream kernel and needed very high networking loads to 
> trigger. Today i found a good way to trigger it almost instantly on 
> upstream kernels too: apply the debug patch attached further below and 
> do:
> 
> 	echo 100 > /proc/sys/kernel/panic
> 
> that will inject 100 artificial 'too many iterations' failures and 
> provokes a TX timeout - which TX timeout will crash. (i've used a 
> dual-core Athlon64 system in this test)
> 
> my first quick guess was to extend np->priv locking to the whole of 
> nv_start_xmit/nv_start_xmit_optimized - while that appeared to make the 
> crash a bit less likely, it did not prevent it. So there must be some 
> other, more fundamental problem be left as well. At first glance the SMP 
> locking looks OK, so maybe the ring indices are messed up somehow and we 
> got into a 'ring head bites the tail' scenario?
> 
> i can provide more info if needed.
> 
> 	Ingo
> 
> -------------->
> eth1: too many iterations (6) in nv_nic_irq.
> Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP: 
>  [<ffffffff80404587>] nv_tx_done+0xf4/0x1cf
> PGD 34d03067 PUD 34d02067 PMD 0 
> Oops: 0000 [1] PREEMPT SMP 
> CPU 1 
> Modules linked in:
> Pid: 0, comm: swapper Not tainted 2.6.21-rc5 #8
> RIP: 0010:[<ffffffff80404587>]  [<ffffffff80404587>] nv_tx_done+0xf4/0x1cf
> RSP: 0018:ffff81003ff6be40  EFLAGS: 00010206
> RAX: 0000000000000000 RBX: ffff810002e26700 RCX: 0000000000000001
> RDX: 0000000000000042 RSI: 000000003ef00cbe RDI: ffff81003fbeb070
> RBP: ffff81003ff6be60 R08: ffff810002e26a00 R09: 0000000000000003
> R10: ffff81003ff4e100 R11: ffff810001e283f8 R12: 000000003ef00cbe
> R13: ffff810002e26000 R14: ffff810002e28fc0 R15: 0000000000000000
> FS:  00002b6cb57f1db0(0000) GS:ffff81003ff4ad40(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000088 CR3: 0000000034c87000 CR4: 00000000000006e0
> Process swapper (pid: 0, threadinfo ffff81003ff64000, task ffff81003ff4e100)
> Stack:  ffff810002e26700 0000000000000032 ffffc2000001a000 ffff810002e26000
>  ffff81003ff6bea0 ffffffff80406dae ffff810002e26700 ffff810002e26700
>  ffff810002e26000 00000000000000ff ffffc2000001a000 ffffffff80749080
> Call Trace:
>  <IRQ>  [<ffffffff80406dae>] nv_nic_irq+0x76/0x261
>  [<ffffffff8040961e>] nv_do_nic_poll+0x200/0x284
>  [<ffffffff8040941e>] nv_do_nic_poll+0x0/0x284
>  [<ffffffff80241995>] run_timer_softirq+0x167/0x1dd
>  [<ffffffff8023de45>] __do_softirq+0x5b/0xc9
>  [<ffffffff8020af0c>] call_softirq+0x1c/0x28
>  [<ffffffff8020c2b4>] do_softirq+0x31/0x84
>  [<ffffffff8023db16>] irq_exit+0x3f/0x50
>  [<ffffffff802190c2>] smp_apic_timer_interrupt+0x49/0x5b
>  [<ffffffff802087fb>] default_idle+0x0/0x44
>  [<ffffffff8020a9b6>] apic_timer_interrupt+0x66/0x70
>  <EOI>  [<ffffffff8020882a>] default_idle+0x2f/0x44
>  [<ffffffff8020804c>] enter_idle+0x22/0x24
>  [<ffffffff802088d0>] cpu_idle+0x91/0xd4
>  [<ffffffff80218572>] start_secondary+0x2e3/0x2f5
> 
> ---
>  drivers/net/forcedeth.c |   20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> Index: linux/drivers/net/forcedeth.c
> ===================================================================
> --- linux.orig/drivers/net/forcedeth.c
> +++ linux/drivers/net/forcedeth.c
> @@ -2908,6 +2908,10 @@ static irqreturn_t nv_nic_irq(int foo, v
>  			spin_unlock(&np->lock);
>  			break;
>  		}
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock(&np->lock);
>  			/* disable interrupts on the nic */
> @@ -3026,6 +3030,10 @@ static irqreturn_t nv_nic_irq_optimized(
>  			break;
>  		}
>  
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock(&np->lock);
>  			/* disable interrupts on the nic */
> @@ -3076,6 +3084,10 @@ static irqreturn_t nv_nic_irq_tx(int foo
>  			dprintk(KERN_DEBUG "%s: received irq with events 0x%x. Probably TX fail.\n",
>  						dev->name, events);
>  		}
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock_irqsave(&np->lock, flags);
>  			/* disable interrupts on the nic */
> @@ -3191,6 +3203,10 @@ static irqreturn_t nv_nic_irq_rx(int foo
>  			}
>  		}
>  
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock_irqsave(&np->lock, flags);
>  			/* disable interrupts on the nic */
> @@ -3264,6 +3280,10 @@ static irqreturn_t nv_nic_irq_other(int 
>  			printk(KERN_DEBUG "%s: received irq with unknown events 0x%x. Please report\n",
>  						dev->name, events);
>  		}
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock_irqsave(&np->lock, flags);
>  			/* disable interrupts on the nic */

  reply	other threads:[~2007-03-26 19:31 UTC|newest]

Thread overview: 110+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-25 23:08 Linux 2.6.21-rc5 Linus Torvalds
2007-03-26  8:31 ` Ingo Molnar
2007-03-26  8:17   ` Ayaz Abdulla [this message]
2007-03-26  8:39   ` Ingo Molnar
2007-03-26  8:58     ` [patch] forcedeth: work around NULL skb dereference crash Ingo Molnar
2007-04-02 11:56       ` [patch] forcedeth: improve NAPI logic Ingo Molnar
2007-03-26  8:55 ` Linux 2.6.21-rc5 Thomas Gleixner
2007-03-26 12:25   ` Bob Tracy
2007-03-26 12:30     ` Thomas Gleixner
2007-03-26  9:04 ` 2.6.21-rc5: maxcpus=1 crash in cpufreq: kernel BUG at drivers/cpufreq/cpufreq.c:82! Ingo Molnar
2007-03-26 18:12   ` Venki Pallipadi
2007-03-26 19:03     ` Venki Pallipadi
2007-03-27  7:11       ` Ingo Molnar
2007-03-26  9:21 ` [PATCH] clockevents: remove bad designed sysfs support for now Thomas Gleixner
2007-03-26  9:25   ` Ingo Molnar
2007-03-26 18:57     ` Greg KH
2007-03-26 12:51   ` Pavel Machek
2007-03-27  7:08   ` [PATCH] i386: Fix bogus return value in hpet_next_event() Thomas Gleixner
2007-03-26 10:11 ` -rc5: e1000 resume weirdness Ingo Molnar
2007-03-26 15:39   ` Kok, Auke
2007-03-26 15:50   ` Jesse Brandeburg
2007-03-26 15:55     ` Kok, Auke
2007-03-26 17:39     ` Ingo Molnar
2007-03-27  1:59 ` [1/5] 2.6.21-rc5: known regressions Adrian Bunk
2007-03-28 18:54   ` Kok, Auke
2007-03-28 19:23     ` Ingo Molnar
2007-03-30 18:04     ` Adrian Bunk
2007-03-30 12:04   ` [bug] hung bootup in various drivers, was: "2.6.21-rc5: known regressions" Ingo Molnar
2007-03-30 12:06     ` [bug] fixed_init(): BUG: at drivers/base/core.c:120 device_release(), " Ingo Molnar
2007-03-30 14:18       ` Greg KH
2007-03-30 14:25         ` Ingo Molnar
2007-03-30 16:31           ` Vitaly Bordug
2007-03-30 14:16     ` [bug] hung bootup in various drivers, " Greg KH
2007-03-30 17:46       ` Ingo Molnar
2007-03-30 19:32         ` Greg KH
2007-03-31  2:32           ` Kay Sievers
2007-03-31 16:51             ` [patch] driver core: fix built-in drivers sysfs links Ingo Molnar
2007-03-31 16:31           ` [bug] hung bootup in various drivers, was: "2.6.21-rc5: known regressions" Ingo Molnar
2007-04-01  7:49     ` Pavel Machek
2007-04-01 17:17       ` Linus Torvalds
2007-04-01 17:35         ` [patch] driver core: if built-in, do not wait in driver_unregister() Ingo Molnar
2007-04-02  1:47           ` Greg KH
2007-03-27  1:59 ` [2/5] 2.6.21-rc5: known regressions Adrian Bunk
2007-03-28 19:46   ` Laurent Riffard
2007-03-29 19:02     ` Fabio Comolli
2007-03-27  1:59 ` [3/5] " Adrian Bunk
2007-03-27  1:59 ` [4/5] " Adrian Bunk
2007-03-27  8:00   ` Marcus Better
2007-03-27 13:25     ` Eric W. Biederman
2007-03-27 16:53       ` Marcus Better
2007-03-27 20:50         ` Eric W. Biederman
2007-03-27 10:09   ` Rafael J. Wysocki
2007-03-27 22:29     ` Adrian Bunk
2007-03-27 22:45       ` Thomas Meyer
2007-03-28 12:19   ` Ingo Molnar
2007-03-28 12:41     ` Ingo Molnar
2007-03-28 13:03       ` Ingo Molnar
2007-03-28 13:06         ` [patch] MSI-X: fix resume crash Ingo Molnar
2007-03-28 13:31           ` Eric W. Biederman
2007-03-28 13:36             ` Ingo Molnar
2007-03-29  4:30           ` Len Brown
2007-03-29  4:57             ` Eric W. Biederman
2007-03-27  1:59 ` [5/5] 2.6.21-rc5: known regressions Adrian Bunk
2007-03-27  5:51 ` ATA ACPI (was Re: Linux 2.6.21-rc5) Jeff Garzik
2007-03-27  5:54   ` Tejun Heo
2007-03-27 21:32     ` Pavel Machek
2007-03-28  9:51       ` Tejun Heo
2007-03-27 17:07   ` Linus Torvalds
2007-03-27 18:48     ` Jeff Garzik
2007-03-27  6:17 ` Linux 2.6.21-rc5 Andrew Morton
2007-03-27  6:20   ` Greg KH
2007-03-27 16:49     ` Jesse Barnes
2007-03-27  9:49   ` Takashi Iwai
2007-03-27 12:25   ` Andi Kleen
2007-03-27 16:33     ` Andrew Morton
2007-03-27 12:43   ` Dmitry Torokhov
2007-03-28 22:32   ` Tilman Schmidt
2007-03-27 18:34 ` Michal Piotrowski
2007-03-27 22:29   ` Pavel Machek
2007-03-27 22:55     ` Michal Piotrowski
2007-03-27 18:53 ` Michal Piotrowski
2007-03-28 14:30   ` Andi Kleen
2007-03-28 14:56     ` Michal Piotrowski
2007-03-28 16:12       ` Jiri Kosina
2007-03-28 16:51         ` Michal Piotrowski
2007-03-28 17:56     ` Linus Torvalds
     [not found] ` <20070327230024.GJ16477@stusta.de>
2007-03-27 23:10   ` 2.6.21-rc5: known regressions with patches Rafael J. Wysocki
2007-03-28  0:50   ` Jay Cliburn
2007-03-30 21:32 ` [1/4] 2.6.21-rc5: known regressions (v2) Adrian Bunk
2007-03-30 21:38   ` Greg KH
2007-03-31  0:23   ` Michal Jaegermann
2007-03-31 15:01     ` Adrian Bunk
2007-03-31 16:42       ` Michal Jaegermann
2007-03-30 21:32 ` [2/4] " Adrian Bunk
2007-03-30 21:32 ` [3/4] " Adrian Bunk
2007-03-31  2:52   ` Jeff Chua
2007-03-31  3:16     ` Adrian Bunk
2007-03-31 11:08       ` Jens Axboe
2007-04-01  5:39   ` Jeremy Fitzhardinge
2007-04-13 16:32   ` Michal Piotrowski
2007-03-30 21:49 ` [4/4] " Adrian Bunk
2007-03-31  2:41   ` Jeff Chua
2007-03-31  6:44   ` Frédéric Riss
2007-04-01  7:04   ` Michael S. Tsirkin
2007-04-01 20:37   ` Michael S. Tsirkin
2007-03-31 18:19 ` 2.6.21-rc5: known regressions with patches (v2) Adrian Bunk
2007-04-03  4:05   ` [PATCH] libata: add NCQ blacklist entries from Silicon Image Windows driver (v2) Robert Hancock
2007-04-03  4:13     ` Tejun Heo
2007-04-04  6:09     ` Jeff Garzik
2007-04-04 14:26       ` Robert Hancock

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46078192.6020307@nvidia.com \
    --to=aabdulla@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=bunk@stusta.de \
    --cc=jeff@garzik.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox