All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joe Malicki <jmalicki@metacarta.com>
To: Sumant Patro <Sumant.Patro@lsi.com>
Cc: linux-scsi@vger.kernel.org, linux-poweredge@dell.com
Subject: megaraid_sas xscale interrupt mask?
Date: Wed, 03 Jan 2007 20:41:24 -0500	[thread overview]
Message-ID: <459C5B44.30202@metacarta.com> (raw)
In-Reply-To: <459C455C.4090707@metacarta.com>

Hi Sumant,

While trying to debug Dell PERC 5/i RAID controller problems we've been
having with the megaraid_sas driver, we've been inspecting differences
between the Red Hat EL 4 kernel (which Dell officially supports) versus
the stock Linux 2.6.17.13 driver we use.  We found a very interesting
change, introduced into linux 2.6.16, that seems very odd to us:

http://groups.google.com/group/fa.linux.kernel/browse_frm/thread/51f889bd09bafd2d/cbbe2a30b8c2eb94?lnk=st&q=outbound_intr_mask+0x1f+0x00000001&rnum=1#cbbe2a30b8c2eb94

The title of the thread is "megaraid_sas: new template defined to
represent each type of controllers", and introduces this curious change:

 /**
  * megasas_disable_intr -      Disables interrupts
  * @regs:                      MFI register set
  */
 static inline void
 megasas_disable_intr(struct megasas_register_set __iomem * regs)
 {
-       u32 mask = readl(&regs->outbound_intr_mask) & (~0x00000001);
+       u32 mask = 0x1f;
        writel(mask, &regs->outbound_intr_mask);

        /* Dummy readl to force pci flush */

Interrupts are enabled by writing "1" to the same register.

Is there a specific reason for this?  Is it possible that Dell PERC 5/i
controllers differ from LSI controllers in this respect?  It seems odd
that this change would be introduced without any explanation for what
it's meant to do, so I am very curious if it could be an inadvertently
introduced bug that is causing some problems.

Thanks!
Joe Malicki

-- 
Joseph Malicki
Software Engineer
Metacarta, Inc.
350 Massachusetts Avenue
4th Floor
Cambridge, MA 02451 USA

email: joe.malicki@metacarta.com

http://www.metacarta.com

Joe Malicki wrote:
> After upgrading to the new 5.0.3-0001 "package build" firmware, released
> 12/12/06, from
> http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&osl=en&deviceid=9182&releaseid=R141188,
> we just experienced one firmware problem that's leaving a clear
> traceback.  I don't know if this is
> 
> 1) the same problem we were experiencing before, that the new firmware
> introduced debugging/a detailed error message for (if this is the case,
> I do really appreciate that Dell did this, since it may help to fix
> these problems eventually),
> 2) A problem introduced by the new firmware, or
> 3) A preexisting problem that we never happened to experience before.
> 
> In the firmware logs at the end of this message, note that just 15
> minutes after a battery relearn is finished and the battery finished
> charging, we see the message:
> 
> 01/02/07  0:33:50: Diag Retention test is running...all activities are
> stopped
> 
> This corresponds to when the megasas driver timed out SCSI commands and
> the controller stopped responding.
> 
> 1) Does anyone know what a "Diag Retention test" is?  Documentation
> mentions "BBU Retention tests" and "NVRAM Retention tests", but not
> "Diag Retention test" - is the "Diag Retention test" a synonym for one
> of these, or is it something different?
> 2) Has anyone seen a similar failure?
> 
> Note that 4 hours after the controller has been offline, a stack
> backtrace, with a firmware source code file and line number, appears in
> the firmware logs - which is something I wouldn't expect to happen under
> any circumstances on a stable product - and seems to drop to a debug
> console (we haven't tried hooking up a serial port to what look like the
> headers on the PERC card, we didn't experiment too much the first time
> it happened as it's a production machine we wanted to get back up quickly).
> 
> We have previously noticed failures corresponding with patrol reads, and
> this failure takes place several hours later, and the traceback happens
> within the "PatrolReadTimer" procedure - is this the same failure as before?
> 
> We don't yet have a clear reproduction case, but are working on it with
> additional information we have from this crash (as we've begun remote
> logging to capture the state of the machine as it's dying, since syslog
> failing because it couldn't write to disk in previous crashes lowered
> the amount of information we could get).
> 
> Thanks,
> Joe
> 
> Logs follow:
> 
> 01/01/07 20:16:57: PR cycle complete
> 01/01/07 20:16:57: EVT#06277-01/01/07 20:16:57:  35=Patrol Read complete
> 01/01/07 20:16:57: Next PR scheduled to start at 01/02/07 18:13:20
> 01/01/07 21:17:01: EVT#06278-01/01/07 21:17:01:  44=Time established as
> 01/01/07 21:17:01; (1727059 seconds since power on)
> 01/01/07 21:23:40: EVT#06279-01/01/07 21:23:40: 162=Current capacity of
> the battery is below threshold
> 01/01/07 21:23:40: EVT#06280-01/01/07 21:23:40: 195=BBU disabled;
> changing WB virtual disks to WT
> 01/01/07 21:26:40: EVT#06281-01/01/07 21:26:40: 153=Battery relearn
> completed
> 01/01/07 21:26:40: Learn completed successfully
> 01/01/07 21:26:40: Next Learn will start on 04 01 2007
> 
> 01/01/07 21:26:40:       *** BATTERY FEATURE PROPERTIES ***
> 01/01/07 21:26:40:  _________________________________________________
> 
> 01/01/07 21:26:40:       Auto Learn Period     : 90  days
> 01/01/07 21:26:40:       Next Learn Time       : 228778000
> 01/01/07 21:26:40:       Battery ID            : 34ec019f
> 01/01/07 21:26:40:       Delayed Learn Interval: 0  hours from scheduled
> time
> 01/01/07 21:26:40:       Next Learn cheduled on: 04 01 2007
> 01/01/07 21:26:40:  _________________________________________________
> 
> 01/01/07 21:26:55: EVT#06282-01/01/07 21:26:55: 147=Battery started charging
> 01/01/07 21:26:55: EVT#06283-01/01/07 21:26:55: 162=Current capacity of
> the battery is below threshold
> 01/01/07 21:49:40: EVT#06284-01/01/07 21:49:40: 163=Current capacity of
> the battery is above threshold
> 01/01/07 21:49:40: EVT#06285-01/01/07 21:49:40: 194=BBU enabled;
> changing WT virtual disks to WB
> 01/01/07 23:16:52: EVT#06286-01/01/07 23:16:52:  73=VD 00/0 Properties
> updated to [ID=00,dcp=0d,ccp=0c,ap=0,dc=0,dbgi=0] (from
> [ID=00,dcp=0c,ccp=0c,ap=0,dc=0,dbgi=0])
> 01/02/07  0:18:05: EVT#06287-01/02/07  0:18:05: 242=Battery charge complete
> 01/02/07  0:33:50: Diag Retention test is running...all activities are
> stopped
> 01/02/07  4:41:08: TaskAdd: No more tasks available!!!
> [0]: fp=a00ffde4, lr=a0885aac  -  TaskAdd+7c
> [1]: fp=a00ffe00, lr=a086a3ac  -  PatrolReadTimer+fc
> [2]: fp=a00ffe40, lr=a0885f2c  -  TimerISR+a4
> [3]: fp=a00ffe60, lr=a088e428  -  FIQ_isr+48
> [4]: fp=a00ffe88, lr=a000a848  -  dbits+1787e34
> [5]: fp=a00ffe9c, lr=a000a24c  -  dbits+1787838
> [6]: fp=a00ffee4, lr=a0883440  -  kbhit+48
> [7]: fp=a00ffef8, lr=a0866e28  -  MonCheck+14
> [8]: fp=a00fff0c, lr=a0815930  -  diagRetentionCmdBlockDone+7c
> [9]: fp=a00fff34, lr=a084d630  -  CmdBlocked+1b4
> [10]: fp=a00fff60, lr=a0874c28  -  set_state+278
> [11]: fp=a00fff94, lr=a08748b0  -  raid_task+2f0
> [12]: fp=a00fffb8, lr=a088e0b0  -  main+3b0
> [13]: fp=a00fffe4, lr=a088c774  -  c_start+30
> [14]: fp=a00ffffc, lr=9e8804cc  -  _start+6c
> [15]: fp=a0018344, lr=a00061d0  -  dbits+17837bc
> [16]: fp=a00183fc, lr=4c0  -  000004c0
> MonTask: line 100 in file ../../raid/taskman.c
> INTCTL=16c00000:1003dcf, IINTSRC=0:0, FINTSRC=0:0, CPSR=600000d3,
> sp=a00ffb28
> MegaMon>
> 
> T0: LSI Logic MegaRAID firmware loaded
> T0: Firmware version 1.00.02-0163 built on Nov 13 2006 at 18:32:21
> T0: Board is type 1028/0015/1028/1f03
> 
> T0: Initializing 1MB memory pool
> T0: LogInit: Flushing events from previous boot
> T0: EVT#06288-01/02/07  4:41:08:  15=Fatal firmware error: Line 100 in
> ../../raid/taskman.c
> 
> T0: EVT#06289-T0:   0=Firmware initialization started (PCI ID
> 0015/1028/1f03/1028)
> T0: EVT#06290-T0:   1=Firmware version 1.00.02-0163
> T0: EVT#06291-T0: 209=BBU Retention test was initiated on previous boot
> T12: EVT#06292-T12: 210=BBU Retention test passed
> T12: EVT#06293-T12: 212=NVRAM Retention test was initiated on previous boot
> T12: EVT#06294-T12: 213=NVRAM Retention test passed
> T12: Authenticating RAID key: Done!
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge@dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
> 

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

       reply	other threads:[~2007-01-04  1:41 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <459C455C.4090707@metacarta.com>
2007-01-04  1:41 ` Joe Malicki [this message]
2007-01-05  2:20 megaraid_sas xscale interrupt mask? Patro, Sumant
2007-01-08 15:59 ` Joe Malicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=459C5B44.30202@metacarta.com \
    --to=jmalicki@metacarta.com \
    --cc=Sumant.Patro@lsi.com \
    --cc=linux-poweredge@dell.com \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.