[Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI
@ 2016-11-07 13:53 bugzilla-daemon
  2016-11-07 15:25 ` [Bug 187231] " bugzilla-daemon
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: bugzilla-daemon @ 2016-11-07 13:53 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=187231

            Bug ID: 187231
           Summary: kernel panic during hpsa MSI plus tg3 MSI
           Product: IO/Storage
           Version: 2.5
    Kernel Version: 4.8.6
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: SCSI
          Assignee: linux-scsi@vger.kernel.org
          Reporter: kernelorg@bof.de
        Regression: No

Created attachment 243801
  --> https://bugzilla.kernel.org/attachment.cgi?id=243801&action=edit
kernel 4.8.6 .config

I'm not sure whether this is a SCSI / HPSA bug or a networking / tg3 driver
bug. Both are seen in the stack dump. As the trigger seems to be HPSA I'm
reporting as a SCSI issue here...

I've been recently attempting to run mainline 4.8.x kernels, most recently
4.8.6, on our production HP DL 380 Intel servers.

On several of them there is some related issue reported in
https://bugzilla.kernel.org/show_bug.cgi?id=187221 where the HPSA driver on
some of the hosts sometimes resets the logical device. I had seen that already
with 4.4.x kernels, and again with 4.8.6.

Now, specifically with 4.8.6, on the box which has the worst of these symptoms,
I _additionally_ experienced multiple full kernel panics. The same box (with
the same hpsa reset symtoms) had been running 4.4.x kernels before without such
kernel panics. The panics then happened multiple times with about a day in
between.

On the last round I had the ILO SSH console running under screen with logging
enabled, and was able to retrieve the following panic backtrace:

[187283.903173] hpsa 0000:03:00.0: scsi 0:1:0:0: resetting logical 
Direct-Access     HP       LOGICAL VOLUME   RAID-5 SSDSmartPathCap- En- Exp=1   
[187314.331375] sd 0:1:0:0: rejecting I/O to offline device                     
[187314.413441] sd 0:1:0:0: rejecting I/O to offline device                     
[187314.854183] sd 0:1:0:0: rejecting I/O to offline device                     
... lots of these ...
[187328.991285] sd 0:1:0:0: rejecting I/O to offline device                     
[187328.991389] sd 0:1:0:0: rejecting I/O to offline device                     
[187329.190166] sd 0:1:0:0: rejecting I/O to offline device                     
[187329.271304]  ffff88bd1a7e8000 ffff88bd1a7be500 ffff88bd7f483eb8
ffffffff8143
493f                                                                            
[187329.271304] Call Trace:                                                     
[187329.271310]  <IRQ>                                                          
[187329.271310]  [<ffffffffa002e332>] ? tg3_poll_msix+0xc2/0x160 [tg3]          
[187329.271311]  [<ffffffff8143493f>] do_hpsa_intr_msi+0x8f/0x1c0               
[187329.271314]  [<ffffffff81148c46>] __handle_irq_event_percpu+0x66/0xe0       
[187329.271315]  [<ffffffff81148cde>] handle_irq_event_percpu+0x1e/0x50         
[187329.271316]  [<ffffffff81148d37>] handle_irq_event+0x27/0x50                
[187329.271318]  [<ffffffff8114bda5>] handle_edge_irq+0x65/0x140                
[187329.271320]  [<ffffffff81057255>] handle_irq+0x15/0x20                      
[187329.271321]  [<ffffffff81057086>] do_IRQ+0x46/0xd0                          
[187329.271324]  [<ffffffff816dc4fc>] common_interrupt+0x7c/0x7c                
[187329.271325]  <EOI>                                                          
[187329.271338] Code: 53 48 89 fb 48 83 ec 28 4c 8b a7 5c 02 00 00 4c 8b bf 40
0
2 00 00 4c 8b b7 38 02 00 00 4c 8b af 4c 02 00 00 49 8b 04 24 4c 89 e7 <48> 8b
8
0 98 00 00 00 48 89 45 c0 49 8b 87 d0 01 00 00 48 89 45                         
[187329.271339] RIP  [<ffffffff81431417>] complete_scsi_command+0x37/0x8c0      
[187329.271339]  RSP <ffff88bd7f483e38>                                         
[187329.271339] CR2: 0000000000000098                                           
[187329.271341] ---[ end trace 52898916f0da5c53 ]---                            
[187329.273413] Kernel panic - not syncing: Fatal exception in interrupt        
[187330.308465] Shutting down cpus with NMI                                     
[187330.308471] Kernel Offset: disabled                                         
[187330.919173] Rebooting in 300 seconds..  

I'll attach my kernel .config.

As this is a production system and so far the panics only hit with our usual
(webserver and DB kvm machine) production load active, there's not much testing
or bisecting I can do, but I didn't want to drop the issue unreported, either. 

Hope this helps somebody. If there is any more info I can provide, just ask
what would be useful.

(I'm back to running 4.4.x)

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 187231] kernel panic during hpsa MSI plus tg3 MSI
  2016-11-07 13:53 [Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI bugzilla-daemon
@ 2016-11-07 15:25 ` bugzilla-daemon
  2016-11-07 15:45 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2016-11-07 15:25 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=187231

Don <don.brace@microsemi.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |don.brace@microsemi.com

--- Comment #1 from Don <don.brace@microsemi.com> ---
Created attachment 243811
  --> https://bugzilla.kernel.org/attachment.cgi?id=243811&action=edit
Patch to correct resets

I will be uploading this patch to linux-scsi this week.

I am attaching the patch in case you would like to test this patch now.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 187231] kernel panic during hpsa MSI plus tg3 MSI
  2016-11-07 13:53 [Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI bugzilla-daemon
  2016-11-07 15:25 ` [Bug 187231] " bugzilla-daemon
@ 2016-11-07 15:45 ` bugzilla-daemon
  2016-11-07 16:16 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2016-11-07 15:45 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=187231

--- Comment #2 from Patrick Schaaf <kernelorg@bof.de> ---
Thanks Don for the reaction!

Right now, on the box that had that panic and the worst resetting/reset issues
(see the other bug I linked), I'm back to 3.14.79, and want to stay there for
another 24 to 36 hours, to see that this issue was not present with that kernel
series.

What would your patch help with? Specifically the panic potential in case a
logical device reset is ongoing? Or should it affect / remedy the mysterious
(to me) "resetting logical" events in the first place?

I'm willing to test patches on that box starting Thursday, but I'd like to
understand a bit better what we are dealing with here.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 187231] kernel panic during hpsa MSI plus tg3 MSI
  2016-11-07 13:53 [Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI bugzilla-daemon
  2016-11-07 15:25 ` [Bug 187231] " bugzilla-daemon
  2016-11-07 15:45 ` bugzilla-daemon
@ 2016-11-07 16:16 ` bugzilla-daemon
  2016-11-08 16:01 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2016-11-07 16:16 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=187231

--- Comment #3 from Don <don.brace@microsemi.com> ---

(In reply to Patrick Schaaf from comment #2)
> Thanks Don for the reaction!
> 
> Right now, on the box that had that panic and the worst resetting/reset
> issues (see the other bug I linked), I'm back to 3.14.79, and want to stay
> there for another 24 to 36 hours, to see that this issue was not present
> with that kernel series.
> 
> What would your patch help with? Specifically the panic potential in case a
> logical device reset is ongoing? Or should it affect / remedy the mysterious
> (to me) "resetting logical" events in the first place?
> 
> I'm willing to test patches on that box starting Thursday, but I'd like to
> understand a bit better what we are dealing with here.

The specific issue that this patch addresses is that during a reset,
complete_scsi_command returns without having called scsi_done which causes the
OS to offline the disk (after two more occurrences). But this code path is not
often followed so the issue does not happen with all resets.

There are some other recent patches that should also be tested that have been
recently applied.

>From git format-patch:
0457-scsi-hpsa-Check-for-null-device-pointers.patch
    * This checks for a NULL device that can happen if the OS
      off-lines the disk because of the afore mentioned reset issue.
0460-scsi-hpsa-Check-for-null-devices-in-ioaccel-submissi.patch
0462-scsi-hpsa-correct-call-to-hpsa_do_reset.patch
    * Fine tunes resets into LOGICAL/Physical resets.

A patch I still have pending on linux-scsi
0464-hpsa-add-generate-controller-NMI-on-lockup.patch
    * This patch just adds more granularity on lock-up detection.

It would be nice to know why the reset is happening in the first place.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 187231] kernel panic during hpsa MSI plus tg3 MSI
  2016-11-07 13:53 [Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI bugzilla-daemon
                   ` (2 preceding siblings ...)
  2016-11-07 16:16 ` bugzilla-daemon
@ 2016-11-08 16:01 ` bugzilla-daemon
  2016-11-11  9:54 ` bugzilla-daemon
  2016-11-16  6:17 ` bugzilla-daemon
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2016-11-08 16:01 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=187231

--- Comment #4 from Patrick Schaaf <kernelorg@bof.de> ---
That problematic box, which showed the kernel panic with 4.8.6, and the
resetting/reset-up-to-20-seconds pauses several times a day with both 4.8 and
4.4.x, has now been running on 3.14.79 (with the same kvm load as before), for
30 hours, without any such HPSA resetting symptoms, or untoward pauses in the
VMs that I could otherwise notice in monitoring.

So somehow 3.14 does not trigger these episodes, or so it seems.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 187231] kernel panic during hpsa MSI plus tg3 MSI
  2016-11-07 13:53 [Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI bugzilla-daemon
                   ` (3 preceding siblings ...)
  2016-11-08 16:01 ` bugzilla-daemon
@ 2016-11-11  9:54 ` bugzilla-daemon
  2016-11-16  6:17 ` bugzilla-daemon
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2016-11-11  9:54 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=187231

--- Comment #5 from Patrick Schaaf <kernelorg@bof.de> ---
After almost 4 days my problematic box downgraded to 3.14.79, finally made some
noise, like this:

2016-11-11T03:31:10.608539+01:00 kvm3f kernel: [320020.727691] hpsa
0000:03:00.0: Abort request on C0:B0:T0:L0
2016-11-11T03:31:10.608555+01:00 kvm3f kernel: [320020.728175] hpsa
0000:03:00.0: cp ffff8868f2c17000 is reported invalid (probably means target
device no longer present)
2016-11-11T03:31:10.608557+01:00 kvm3f kernel: [320020.728796] hpsa
0000:03:00.0: cp ffff8868f2c17000 is reported invalid (probably means target
device no longer present)
2016-11-11T03:31:10.608558+01:00 kvm3f kernel: [320020.729389] hpsa
0000:03:00.0: FAILED abort on device C0:B0:T0:L0
2016-11-11T03:31:10.608560+01:00 kvm3f kernel: [320020.729708] hpsa
0000:03:00.0: resetting device 0:0:0:0
2016-11-11T03:31:26.968534+01:00 kvm3f kernel: [320037.081397] hpsa
0000:03:00.0: device is ready.

So, maybe there is a somewhat weirdly faulty drive in that array, which
otherwise does not show any (SMART / ILO logs) symptoms...

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 187231] kernel panic during hpsa MSI plus tg3 MSI
  2016-11-07 13:53 [Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI bugzilla-daemon
                   ` (4 preceding siblings ...)
  2016-11-11  9:54 ` bugzilla-daemon
@ 2016-11-16  6:17 ` bugzilla-daemon
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2016-11-16  6:17 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=187231

Patrick Schaaf <kernelorg@bof.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |UNREPRODUCIBLE

--- Comment #6 from Patrick Schaaf <kernelorg@bof.de> ---
After several more such Abort request / reset sequences with 3.14.79, two days
ago the box _finally_ announced that one of its 8 drives has a SMART
"predictive failure"; after swapping that drive for a spare, the symptoms are
no longer seen.

This is the third or fourth time, over the last year, that I've seen Gen9
servers with P440ar cards behave that way.

Anyway, my immediate test case is gone, so I'll close this as RESOLVED /
unreproducible...

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-11-16  6:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-07 13:53 [Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI bugzilla-daemon
2016-11-07 15:25 ` [Bug 187231] " bugzilla-daemon
2016-11-07 15:45 ` bugzilla-daemon
2016-11-07 16:16 ` bugzilla-daemon
2016-11-08 16:01 ` bugzilla-daemon
2016-11-11  9:54 ` bugzilla-daemon
2016-11-16  6:17 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).