megaraid_sas xscale interrupt mask?

From: Joe Malicki <jmalicki@metacarta.com>
To: Sumant Patro <Sumant.Patro@lsi.com>
Cc: linux-scsi@vger.kernel.org, linux-poweredge@dell.com
Subject: megaraid_sas xscale interrupt mask?
Date: Wed, 03 Jan 2007 20:41:24 -0500	[thread overview]
Message-ID: <459C5B44.30202@metacarta.com> (raw)
In-Reply-To: <459C455C.4090707@metacarta.com>

Hi Sumant,

While trying to debug Dell PERC 5/i RAID controller problems we've been
having with the megaraid_sas driver, we've been inspecting differences
between the Red Hat EL 4 kernel (which Dell officially supports) versus
the stock Linux 2.6.17.13 driver we use.  We found a very interesting
change, introduced into linux 2.6.16, that seems very odd to us:

http://groups.google.com/group/fa.linux.kernel/browse_frm/thread/51f889bd09bafd2d/cbbe2a30b8c2eb94?lnk=st&q=outbound_intr_mask+0x1f+0x00000001&rnum=1#cbbe2a30b8c2eb94

The title of the thread is "megaraid_sas: new template defined to
represent each type of controllers", and introduces this curious change:

 /**
  * megasas_disable_intr -      Disables interrupts
  * @regs:                      MFI register set
  */
 static inline void
 megasas_disable_intr(struct megasas_register_set __iomem * regs)
 {
-       u32 mask = readl(&regs->outbound_intr_mask) & (~0x00000001);
+       u32 mask = 0x1f;
        writel(mask, &regs->outbound_intr_mask);

        /* Dummy readl to force pci flush */

Interrupts are enabled by writing "1" to the same register.

Is there a specific reason for this?  Is it possible that Dell PERC 5/i
controllers differ from LSI controllers in this respect?  It seems odd
that this change would be introduced without any explanation for what
it's meant to do, so I am very curious if it could be an inadvertently
introduced bug that is causing some problems.

Thanks!
Joe Malicki

-- 
Joseph Malicki
Software Engineer
Metacarta, Inc.
350 Massachusetts Avenue
4th Floor
Cambridge, MA 02451 USA

email: joe.malicki@metacarta.com

http://www.metacarta.com

Joe Malicki wrote:
> After upgrading to the new 5.0.3-0001 "package build" firmware, released
> 12/12/06, from
> http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&osl=en&deviceid=9182&releaseid=R141188,
> we just experienced one firmware problem that's leaving a clear
> traceback.  I don't know if this is
> 
> 1) the same problem we were experiencing before, that the new firmware
> introduced debugging/a detailed error message for (if this is the case,
> I do really appreciate that Dell did this, since it may help to fix
> these problems eventually),
> 2) A problem introduced by the new firmware, or
> 3) A preexisting problem that we never happened to experience before.
> 
> In the firmware logs at the end of this message, note that just 15
> minutes after a battery relearn is finished and the battery finished
> charging, we see the message:
> 
> 01/02/07  0:33:50: Diag Retention test is running...all activities are
> stopped
> 
> This corresponds to when the megasas driver timed out SCSI commands and
> the controller stopped responding.
> 
> 1) Does anyone know what a "Diag Retention test" is?  Documentation
> mentions "BBU Retention tests" and "NVRAM Retention tests", but not
> "Diag Retention test" - is the "Diag Retention test" a synonym for one
> of these, or is it something different?
> 2) Has anyone seen a similar failure?
> 
> Note that 4 hours after the controller has been offline, a stack
> backtrace, with a firmware source code file and line number, appears in
> the firmware logs - which is something I wouldn't expect to happen under
> any circumstances on a stable product - and seems to drop to a debug
> console (we haven't tried hooking up a serial port to what look like the
> headers on the PERC card, we didn't experiment too much the first time
> it happened as it's a production machine we wanted to get back up quickly).
> 
> We have previously noticed failures corresponding with patrol reads, and
> this failure takes place several hours later, and the traceback happens
> within the "PatrolReadTimer" procedure - is this the same failure as before?
> 
> We don't yet have a clear reproduction case, but are working on it with
> additional information we have from this crash (as we've begun remote
> logging to capture the state of the machine as it's dying, since syslog
> failing because it couldn't write to disk in previous crashes lowered
> the amount of information we could get).
> 
> Thanks,
> Joe
> 
> Logs follow:
> 
> 01/01/07 20:16:57: PR cycle complete
> 01/01/07 20:16:57: EVT#06277-01/01/07 20:16:57:  35=Patrol Read complete
> 01/01/07 20:16:57: Next PR scheduled to start at 01/02/07 18:13:20
> 01/01/07 21:17:01: EVT#06278-01/01/07 21:17:01:  44=Time established as
> 01/01/07 21:17:01; (1727059 seconds since power on)
> 01/01/07 21:23:40: EVT#06279-01/01/07 21:23:40: 162=Current capacity of
> the battery is below threshold
> 01/01/07 21:23:40: EVT#06280-01/01/07 21:23:40: 195=BBU disabled;
> changing WB virtual disks to WT
> 01/01/07 21:26:40: EVT#06281-01/01/07 21:26:40: 153=Battery relearn
> completed
> 01/01/07 21:26:40: Learn completed successfully
> 01/01/07 21:26:40: Next Learn will start on 04 01 2007
> 
> 01/01/07 21:26:40:       *** BATTERY FEATURE PROPERTIES ***
> 01/01/07 21:26:40:  _________________________________________________
> 
> 01/01/07 21:26:40:       Auto Learn Period     : 90  days
> 01/01/07 21:26:40:       Next Learn Time       : 228778000
> 01/01/07 21:26:40:       Battery ID            : 34ec019f
> 01/01/07 21:26:40:       Delayed Learn Interval: 0  hours from scheduled
> time
> 01/01/07 21:26:40:       Next Learn cheduled on: 04 01 2007
> 01/01/07 21:26:40:  _________________________________________________
> 
> 01/01/07 21:26:55: EVT#06282-01/01/07 21:26:55: 147=Battery started charging
> 01/01/07 21:26:55: EVT#06283-01/01/07 21:26:55: 162=Current capacity of
> the battery is below threshold
> 01/01/07 21:49:40: EVT#06284-01/01/07 21:49:40: 163=Current capacity of
> the battery is above threshold
> 01/01/07 21:49:40: EVT#06285-01/01/07 21:49:40: 194=BBU enabled;
> changing WT virtual disks to WB
> 01/01/07 23:16:52: EVT#06286-01/01/07 23:16:52:  73=VD 00/0 Properties
> updated to [ID=00,dcp=0d,ccp=0c,ap=0,dc=0,dbgi=0] (from
> [ID=00,dcp=0c,ccp=0c,ap=0,dc=0,dbgi=0])
> 01/02/07  0:18:05: EVT#06287-01/02/07  0:18:05: 242=Battery charge complete
> 01/02/07  0:33:50: Diag Retention test is running...all activities are
> stopped
> 01/02/07  4:41:08: TaskAdd: No more tasks available!!!
> [0]: fp=a00ffde4, lr=a0885aac  -  TaskAdd+7c
> [1]: fp=a00ffe00, lr=a086a3ac  -  PatrolReadTimer+fc
> [2]: fp=a00ffe40, lr=a0885f2c  -  TimerISR+a4
> [3]: fp=a00ffe60, lr=a088e428  -  FIQ_isr+48
> [4]: fp=a00ffe88, lr=a000a848  -  dbits+1787e34
> [5]: fp=a00ffe9c, lr=a000a24c  -  dbits+1787838
> [6]: fp=a00ffee4, lr=a0883440  -  kbhit+48
> [7]: fp=a00ffef8, lr=a0866e28  -  MonCheck+14
> [8]: fp=a00fff0c, lr=a0815930  -  diagRetentionCmdBlockDone+7c
> [9]: fp=a00fff34, lr=a084d630  -  CmdBlocked+1b4
> [10]: fp=a00fff60, lr=a0874c28  -  set_state+278
> [11]: fp=a00fff94, lr=a08748b0  -  raid_task+2f0
> [12]: fp=a00fffb8, lr=a088e0b0  -  main+3b0
> [13]: fp=a00fffe4, lr=a088c774  -  c_start+30
> [14]: fp=a00ffffc, lr=9e8804cc  -  _start+6c
> [15]: fp=a0018344, lr=a00061d0  -  dbits+17837bc
> [16]: fp=a00183fc, lr=4c0  -  000004c0
> MonTask: line 100 in file ../../raid/taskman.c
> INTCTL=16c00000:1003dcf, IINTSRC=0:0, FINTSRC=0:0, CPSR=600000d3,
> sp=a00ffb28
> MegaMon>
> 
> T0: LSI Logic MegaRAID firmware loaded
> T0: Firmware version 1.00.02-0163 built on Nov 13 2006 at 18:32:21
> T0: Board is type 1028/0015/1028/1f03
> 
> T0: Initializing 1MB memory pool
> T0: LogInit: Flushing events from previous boot
> T0: EVT#06288-01/02/07  4:41:08:  15=Fatal firmware error: Line 100 in
> ../../raid/taskman.c
> 
> T0: EVT#06289-T0:   0=Firmware initialization started (PCI ID
> 0015/1028/1f03/1028)
> T0: EVT#06290-T0:   1=Firmware version 1.00.02-0163
> T0: EVT#06291-T0: 209=BBU Retention test was initiated on previous boot
> T12: EVT#06292-T12: 210=BBU Retention test passed
> T12: EVT#06293-T12: 212=NVRAM Retention test was initiated on previous boot
> T12: EVT#06294-T12: 213=NVRAM Retention test passed
> T12: Authenticating RAID key: Done!
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge@dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
> 

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq