Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Kok, Auke" <auke-jan.h.kok@intel.com>
To: Attila Nagy <bra@fsn.hu>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
Date: Mon, 30 Jul 2007 09:18:46 -0700	[thread overview]
Message-ID: <46AE0F66.7000504@intel.com> (raw)
In-Reply-To: <46AE0420.4030900@fsn.hu>

Attila Nagy wrote:
> Hello,
> 
> I have four identical machines, based on Supermicro X7DBE motherboards.
> All of the machines have two Xeon 5130 CPUs, 16GB RAM and two add-on
> cards: an Areca 1261 SATA RAID HBA and a Qlogic QLA2342 fibre channel HBA.
> 
> I would like to use these as file servers (via FC), but during the 
> performance and
> reliabilty tests it turned out that the machines are very unreliable, 
> despite that they
> seemed to be OK hardware-wise (memtest and the usual stuff).
> 
> During the debugging of this (seemingly) high IO load related problem, I 
> have
> observed the following:
> - when MSI is enabled (the first iteration), the machines sometimes 
> "hang", but
> not the whole system, just the SCSI target subsystem (SCST), which makes
> heavy IO on the Areca (arcmsr) and the Qlogic (patched qla2xxx) HBAs
> - when MSI is disabled, I couldn't reproduce that hung up state, instead the
> machines sometimes throw an MCE (see below), but I couldn't find its cause
> - when MSI is disabled, and CONFIG_DEBUG_SHIRQ is enabled, the machines
> can't even boot normally, I get an oops instantly during the kernel 
> initialization
> - with MSI disabled sometimes the machines fail to respond, the ssh 
> sessions terminate
> and on the console I can't type for very long seconds. I have nearly all 
> debugging turned on,
> but can't see anything in the logs or on the console. The machine 
> recovers from this hang
> automatically. The whole thing seems like when a high (eg. network) 
> interrupt activity happens
> on a highly loaded machine, but I could observe this even after a fresh 
> boot, without anything
> (of course minus the standard stuff, sshd, and the others) running on 
> the machine.
> 
> The kernel is 2.6.21.5 (I've tried 2.6.18, the effects are the same), 
> running in 64 bit mode.

something is definately not happy on this system. There was a e1000 fix related 
to DEBUG_SHIRQ in 2.6.22, so I definately advise you to test 2.6.22.1 
immediately - however:

> The oops I get with MSI disabled and CONFIG_DEBUG_SHIRQ enabled:
> 
> [   92.681320] NET: Registered protocol family 17
> [   93.491658] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> [   93.557402]  [<0000000000000000>]
> [   93.626770] PGD 0
> [   93.651106] Oops: 0010 [1] SMP
> [   93.689170] CPU 1
> [   93.713506] Modules linked in:
> [   93.750322] Pid: 1, comm: swapper Not tainted 2.6.21.5 #1
> [   93.815011] RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
> [   93.887187] RSP: 0018:ffff81042fc5dc68  EFLAGS: 00010002
> [   93.950836] RAX: ffff81042fbe6b70 RBX: 0000000000000202 RCX: ffff81042fbe6b70
> [   94.036323] RDX: ffffc20000040000 RSI: ffff81042f51cdf8 RDI: ffff81042fbe6800
> [   94.121812] RBP: ffff81042fc5dd10 R08: 0000000000000000 R09: ffff81042f4c0ea8
> [   94.207298] R10: 0000000000000000 R11: ffff81042fbe6800 R12: 00000000fffffff4
> [   94.292788] R13: ffff81042fbe6000 R14: 0000000000000001 R15: ffffffff80399450
> [   94.378275] FS:  0000000000000000(0000) GS:ffff81042fc694c8(0000) knlGS:0000000000000000
> [   94.475307] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [   94.544153] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> [   94.629643] Process swapper (pid: 1, threadinfo ffff81042fc5c000, task ffff81042fc58040)
> [   94.726673] Stack:  ffffffff80399559 ffff81042fc5dca0 ffffffff802121b2 ffff81042f4c0ea0
> [   94.823603]  ffff81042fc5dca0 ffffffff80221cd7 ffffffff802addb5 ffff81042fbe6800
> [   94.913042]  ffffffff8020c9bc ffff81042fbe6b70 0000000000000246 ffff81042fbe6b70
> [   95.000194] Call Trace:
> [   95.031814]  [<ffffffff80399559>] e1000_intr+0x109/0x590
> [   95.095461]  [<ffffffff802121b2>] poison_obj+0x42/0x60
> [   95.157027]  [<ffffffff80221cd7>] dbg_redzone1+0x17/0x30
> [   95.220676]  [<ffffffff802addb5>] request_irq+0x95/0x150
> [   95.284324]  [<ffffffff8020c9bc>] cache_alloc_debugcheck_after+0x17c/0x1c0
> [   95.366690]  [<ffffffff8020a43d>] kmem_cache_alloc+0xcd/0xf0
> [   95.434500]  [<ffffffff80399450>] e1000_intr+0x0/0x590
> [   95.496067]  [<ffffffff802ade00>] request_irq+0xe0/0x150
> [   95.559716]  [<ffffffff8039558c>] e1000_request_irq+0x3c/0x80
> [   95.628564]  [<ffffffff803985bc>] e1000_open+0x5c/0x100
> [   95.691172]  [<ffffffff8041d937>] dev_open+0x37/0x80
> [   95.750661]  [<ffffffff8041becd>] dev_change_flags+0x6d/0x150
> [   95.819508]  [<ffffffff80616565>] ip_auto_config+0x175/0xea0
> [   95.887317]  [<ffffffff80442f88>] tcp_set_default_congestion_control+0x18/0x70
> [   95.973947]  [<ffffffff80442fcf>] tcp_set_default_congestion_control+0x5f/0x70
> [   96.060582]  [<ffffffff80265236>] _spin_unlock+0x26/0x30
> [   96.124227]  [<ffffffff805f1754>] init+0x1a4/0x2b0
> [   96.181635]  [<ffffffff802a0e7b>] trace_hardirqs_on+0x14b/0x180
> [   96.252563]  [<ffffffff8025ff28>] child_rip+0xa/0x12
> [   96.312051]  [<ffffffff8026563b>] _spin_unlock_irq+0x2b/0x40
> [   96.379859]  [<ffffffff8025f63c>] restore_args+0x0/0x30
> [   96.442467]  [<ffffffff805f15b0>] init+0x0/0x2b0
> [   96.497795]  [<ffffffff8025ff1e>] child_rip+0x0/0x12
> [   96.557282]
> [   96.575170]
> [   96.575171] Code:  Bad RIP value.
> [   96.633203] RIP  [<0000000000000000>]
> [   96.677297]  RSP <ffff81042fc5dc68>
> [   96.719105] CR2: 0000000000000000
> [   96.758835] Kernel panic - not syncing: Attempted to kill init!
> 
> 
> MCE:
> [153103.918654] HARDWARE ERROR
> [153103.918655] CPU 1: Machine Check Exception:                5 Bank 0: 
> b200004010000400
> [153104.066037] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
> [153104.145699] TSC 1167e915e93ce
> [153104.183554] This is not a software problem!
> [153104.234724] Run through mcelog --ascii to decode and contact your 
> hardware vendor
> [153104.325517]
> [153104.325518] HARDWARE ERROR
> [153104.325519] CPU 1: Machine Check Exception:                5 Bank 5: 
> b200221024080400
> [153104.472883] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
> [153104.552546] TSC 1167e915e9ea8
> [153104.590402] This is not a software problem!
> [153104.641572] Run through mcelog --ascii to decode and contact your 
> hardware vendor
> [153104.732365] Kernel panic - not syncing: Machine check

this is serious problems that might not be resolved and be a symptom of a true 
hardware issue. Looking at the time it seems an issue on itself and unrelated to 
the e1000 debug_shirq fix.

> I've got exactly the same errors (only the TSC and the CPU value 
> changing) on all four machines,
> could this really be a hardware error?

yes

> full dmesg: http://people.fsn.hu/~bra/linux/x7dbe-20070730/with_debug_shirq
> dmesg with MSI enabled: 
> http://people.fsn.hu/~bra/linux/x7dbe-20070730/with_msi_and_debug_shirq
> kernel config: http://people.fsn.hu/~bra/linux/x7dbe-20070730/config
> 
> I've tried to disable all possible devices which consume interrupts,
> and placed the cards into various slots (the FC HBA is PCI-X, which the 
> Areca
> is PCI-E). Currently the arcmsr and (one of, it's a dual channel
> HBA) qla2xxx are on a shared IRQ.
> 
> Could you please help?
> Do you think this is related to the strange hang under high IO load, the
> occasional, complete "blackouts", where all ssh network sessions
> time out, but the machine recovers, and the MCEs?


Like I said, please try 2.6.22 which should have the e1000 issue fixed. The MCE 
looks real and a different problem

Auke

next prev parent reply	other threads:[~2007-07-30 16:27 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-30 15:30 Hangs and reboots under high loads, oops with DEBUG_SHIRQ Attila Nagy
2007-07-30 16:18 ` Kok, Auke [this message]
2007-07-30 16:19 ` Alan Cox
2007-07-31 14:21   ` Attila Nagy
2007-07-31 22:08     ` Roger Heflin
2007-08-02  8:29       ` Attila Nagy
2007-08-04 22:56         ` Mr. James W. Laferriere

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46AE0F66.7000504@intel.com \
    --to=auke-jan.h.kok@intel.com \
    --cc=bra@fsn.hu \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.