* [PATCH] sparc64: sun4v TLB error power off events
@ 2014-09-07 15:47 Bob Picco
2014-09-09 19:22 ` David Miller
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Bob Picco @ 2014-09-07 15:47 UTC (permalink / raw)
To: sparclinux
From: bob picco <bpicco@meloft.net>
We've witnessed a few TLB events causing the machine to power off because
of prom_halt. In one case it was some nfs related area during rmmod. Another
was an mmapper of /dev/mem. A more recent one is an ITLB issue with
a bad pagesize which could be a hardware bug. Bugs happen but we should
attempt to not power off the machine and/or hang it when possible.
This is a DTLB error from an mmapper of /dev/mem:
[root@sparcie ~]# SUN4V-DTLB: Error at TPC[fffff80100903e6c], tl 1
SUN4V-DTLB: TPC<0xfffff80100903e6c>
SUN4V-DTLB: O7[fffff801081979d0]
SUN4V-DTLB: O7<0xfffff801081979d0>
SUN4V-DTLB: vaddr[fffff80100000000] ctx[1250] pte[98000000000f0610] error[2]
.
This is recent mainline for ITLB:
[ 3708.179864] SUN4V-ITLB: TPC<0xfffffc010071cefc>
[ 3708.188866] SUN4V-ITLB: O7[fffffc010071cee8]
[ 3708.197377] SUN4V-ITLB: O7<0xfffffc010071cee8>
[ 3708.206539] SUN4V-ITLB: vaddr[e0003] ctx[1a3c] pte[2900000dcc800eeb] error[4]
.
We've treated DTLB/ITLB error events identically within the patch.
Should TL be <= 1 then proceed to die_if_kernel. Fully expect
though that for a privileged access the machine must be reset
when panic_on_oops is armed. Should panic_on_oops not be armed, then you
remain up but the quality and duration will be subject to what the error
condition caused. An unprivileged task is killed off with a SIGSEGV.
Power off of large sparc64 machines is painful. Plus die_if_kernel provides
more context. A reset sequence isn't a brief period on large sparc64 but
better than power-off/power-on sequence.
For TL > 1 the machine does abruptly enter power off like it has.
Cc: sparclinux@vger.kernel.org
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Bob Picco <bob.picco@oracle.com>
---
arch/sparc/kernel/traps_64.c | 16 ++++++++++++++--
1 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index fb6640e..6a34e96 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -2104,6 +2104,18 @@ void sun4v_nonresum_overflow(struct pt_regs *regs)
atomic_inc(&sun4v_nonresum_oflow_cnt);
}
+static void sun4v_tlb_error(struct pt_regs *regs, int tl, char *message)
+{
+ /* Should we be above TL=1 then we just prom_halt. Should
+ * pstate.priv have been true at trap time and panic_on_oops
+ * disabled then we proceed but YMMV.
+ */
+ if (tl > 1)
+ prom_halt();
+ else
+ die_if_kernel(message, regs);
+}
+
unsigned long sun4v_err_itlb_vaddr;
unsigned long sun4v_err_itlb_ctx;
unsigned long sun4v_err_itlb_pte;
@@ -2125,7 +2137,7 @@ void sun4v_itlb_error_report(struct pt_regs *regs, int tl)
sun4v_err_itlb_vaddr, sun4v_err_itlb_ctx,
sun4v_err_itlb_pte, sun4v_err_itlb_error);
- prom_halt();
+ sun4v_tlb_error(regs, tl, "ITLB HV ERROR");
}
unsigned long sun4v_err_dtlb_vaddr;
@@ -2149,7 +2161,7 @@ void sun4v_dtlb_error_report(struct pt_regs *regs, int tl)
sun4v_err_dtlb_vaddr, sun4v_err_dtlb_ctx,
sun4v_err_dtlb_pte, sun4v_err_dtlb_error);
- prom_halt();
+ sun4v_tlb_error(regs, tl, "DTLB HV ERROR");
}
void hypervisor_tlbop_error(unsigned long err, unsigned long op)
--
1.7.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] sparc64: sun4v TLB error power off events
2014-09-07 15:47 [PATCH] sparc64: sun4v TLB error power off events Bob Picco
@ 2014-09-09 19:22 ` David Miller
2014-09-09 21:12 ` Bob Picco
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2014-09-09 19:22 UTC (permalink / raw)
To: sparclinux
From: Bob Picco <bpicco@meloft.net>
Date: Sun, 7 Sep 2014 11:47:38 -0400
> We've witnessed a few TLB events causing the machine to power off because
> of prom_halt. In one case it was some nfs related area during rmmod. Another
> was an mmapper of /dev/mem. A more recent one is an ITLB issue with
> a bad pagesize which could be a hardware bug. Bugs happen but we should
> attempt to not power off the machine and/or hang it when possible.
prom_halt() should not power off the machine, but rather drop us to
the OF command line "ok" prompt.
Why doesn't it do that?
We properly do a >tl1 vs. tl1 etrap call, so we should be at trap
level zero when we call into the prom to "exit".
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sparc64: sun4v TLB error power off events
2014-09-07 15:47 [PATCH] sparc64: sun4v TLB error power off events Bob Picco
2014-09-09 19:22 ` David Miller
@ 2014-09-09 21:12 ` Bob Picco
2014-09-09 21:52 ` David Miller
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Bob Picco @ 2014-09-09 21:12 UTC (permalink / raw)
To: sparclinux
David Miller wrote: [Tue Sep 09 2014, 03:22:37PM EDT]
> From: Bob Picco <bpicco@meloft.net>
> Date: Sun, 7 Sep 2014 11:47:38 -0400
>
> > We've witnessed a few TLB events causing the machine to power off because
> > of prom_halt. In one case it was some nfs related area during rmmod. Another
> > was an mmapper of /dev/mem. A more recent one is an ITLB issue with
> > a bad pagesize which could be a hardware bug. Bugs happen but we should
> > attempt to not power off the machine and/or hang it when possible.
>
> prom_halt() should not power off the machine, but rather drop us to
> the OF command line "ok" prompt.
I didn't know this. This would be ideal.
For my nearly P0 T4-2 it always powers off.
>
> Why doesn't it do that?
Don't know.
>
> We properly do a >tl1 vs. tl1 etrap call, so we should be at trap
> level zero when we call into the prom to "exit".
I agree.
I just ran a quick experiment on my T5-2 which is supported hardware. The
kernel is 3.17-rc3 without any modification from me - well ixgbe. As root mmap
of /dev/mem at address 0UL. It powered off:
4 GNU/Linux
[root@t5-2 ~]# [31732.360547] SUN4V-DTLB: Error at TPC[fffffc01001cac48], tl 1
[31732.371659] SUN4V-DTLB: TPC<0xfffffc01001cac48>
[31732.380652] SUN4V-DTLB: O7[100970]
[31732.387418] SUN4V-DTLB: O7<0x100970>
[31732.394548] SUN4V-DTLB: vaddr[fffffc0100028000] ctx[1634] pte[9a00000000000610] error[2]
Message from syslogd@t5-2 at Sep 9 16:53:25 ...
kernel:[31732.360547] SUN4V-DTLB: Error at TPC[fffffc01001cac48], tl 1
Message from syslogd@t5-2 at Sep 9 16:53:25 ...
kernel:[31732.371659] SUN4V-DTLB: TPC<0xfffffc01001cac48>
Message from syslogd@t5-2 at Sep 9 16:53:25 ...
kernel:[31732.380652] SUN4V-DTLB: O7[102014-09-09 20:35:34 SP> NOTICE: Host is off
. Some firmware widget we are unaware of?
Should you like the code it is below.
thanx,
bob
<<CLIP HERE>>
#define _GNU_SOURCE
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#define PGSIZE (8192)
void main(int argc, char **argv)
{
unsigned long addr;
char buf[PGSIZE];
void *mmap_addr;
ssize_t size;
off_t offset;
int rc, fd;
if (argc != 2)
fprintf(stderr, "%s: 0xaddress\n", argv[0]), exit(1);
rc = sscanf(argv[1], "%lx", &addr);
if (rc != 1)
fprintf(stderr, "%s: address-format-invalid\n", argv[0]),
exit(1);
fd = open("/dev/mem", O_RDONLY);
if (fd < 0)
fprintf(stderr, "%s: failed to open /dev/mem\n", argv[0]),
exit(1);
offset = addr;
size = PGSIZE;
mmap_addr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, offset);
if (mmap_addr = MAP_FAILED)
fprintf(stderr, "%s: failed mmap offset=0x%lx\n", argv[0],
offset), exit(1);
memcpy(buf, mmap_addr, sizeof (buf));
(void) munmap(mmap_addr, size);
close(fd);
}
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sparc64: sun4v TLB error power off events
2014-09-07 15:47 [PATCH] sparc64: sun4v TLB error power off events Bob Picco
2014-09-09 19:22 ` David Miller
2014-09-09 21:12 ` Bob Picco
@ 2014-09-09 21:52 ` David Miller
2014-09-10 14:18 ` Bob Picco
2014-09-10 18:39 ` David Miller
4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2014-09-09 21:52 UTC (permalink / raw)
To: sparclinux
From: Bob Picco <bob.picco@oracle.com>
Date: Tue, 9 Sep 2014 17:12:27 -0400
> I just ran a quick experiment on my T5-2 which is supported hardware. The
> kernel is 3.17-rc3 without any modification from me - well ixgbe. As root mmap
> of /dev/mem at address 0UL. It powered off:
Just out of curiosity what ixgbe patches do you need that aren't
upstream already?
> 4 GNU/Linux
> [root@t5-2 ~]# [31732.360547] SUN4V-DTLB: Error at TPC[fffffc01001cac48], tl 1
> [31732.371659] SUN4V-DTLB: TPC<0xfffffc01001cac48>
> [31732.380652] SUN4V-DTLB: O7[100970]
> [31732.387418] SUN4V-DTLB: O7<0x100970>
> [31732.394548] SUN4V-DTLB: vaddr[fffffc0100028000] ctx[1634] pte[9a00000000000610] error[2]
>
> Message from syslogd@t5-2 at Sep 9 16:53:25 ...
> kernel:[31732.360547] SUN4V-DTLB: Error at TPC[fffffc01001cac48], tl 1
>
> Message from syslogd@t5-2 at Sep 9 16:53:25 ...
> kernel:[31732.371659] SUN4V-DTLB: TPC<0xfffffc01001cac48>
>
> Message from syslogd@t5-2 at Sep 9 16:53:25 ...
> kernel:[31732.380652] SUN4V-DTLB: O7[102014-09-09 20:35:34 SP> NOTICE: Host is off
> . Some firmware widget we are unaware of?
Hmmm...
Oh I see, if LDOMs are enabled we do ldom_power_off() instead of doing
an OF "exit".
That explains everything.
I seem to remember that for some reason after early boot it got to the
point with LDOMs that you had to stop talking to the OF, and that's
why for all of these interfaces that could be invoked after early
boot, we revector to a ldom_*() routine if ldom_domaining_enabled is
true.
So I don't think there is anything we can do about this, so perhaps we
should just unconditionally avoid using prom_halt() here, and just do
a die_if_kernel() regardless of the trap level.
Also, for the >tl1 case, it would be beneficial to print out the stack
of trap state registers that etraptl1 saves on the stack right after
pt_regs. The format is traps_64.c's "struct tl1_traplog", and there
is a dump_tl1_traplog() helper there already.
Thanks for looking into this Bob.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sparc64: sun4v TLB error power off events
2014-09-07 15:47 [PATCH] sparc64: sun4v TLB error power off events Bob Picco
` (2 preceding siblings ...)
2014-09-09 21:52 ` David Miller
@ 2014-09-10 14:18 ` Bob Picco
2014-09-10 18:39 ` David Miller
4 siblings, 0 replies; 6+ messages in thread
From: Bob Picco @ 2014-09-10 14:18 UTC (permalink / raw)
To: sparclinux
David Miller wrote: [Tue Sep 09 2014, 05:52:46PM EDT]
> From: Bob Picco <bob.picco@oracle.com>
> Date: Tue, 9 Sep 2014 17:12:27 -0400
>
> > I just ran a quick experiment on my T5-2 which is supported hardware. The
> > kernel is 3.17-rc3 without any modification from me - well ixgbe. As root mmap
> > of /dev/mem at address 0UL. It powered off:
>
> Just out of curiosity what ixgbe patches do you need that aren't
> upstream already?
It is really Martin's (mkp) from last year. I ported it over to mainline.
Basically the mac is acquired with:
addr = of_get_property(dp, "local-mac-address", &len);
. I'll append at the end of this email what my local T5-2 is using.
>
> > 4 GNU/Linux
> > [root@t5-2 ~]# [31732.360547] SUN4V-DTLB: Error at TPC[fffffc01001cac48], tl 1
> > [31732.371659] SUN4V-DTLB: TPC<0xfffffc01001cac48>
> > [31732.380652] SUN4V-DTLB: O7[100970]
> > [31732.387418] SUN4V-DTLB: O7<0x100970>
> > [31732.394548] SUN4V-DTLB: vaddr[fffffc0100028000] ctx[1634] pte[9a00000000000610] error[2]
> >
> > Message from syslogd@t5-2 at Sep 9 16:53:25 ...
> > kernel:[31732.360547] SUN4V-DTLB: Error at TPC[fffffc01001cac48], tl 1
> >
> > Message from syslogd@t5-2 at Sep 9 16:53:25 ...
> > kernel:[31732.371659] SUN4V-DTLB: TPC<0xfffffc01001cac48>
> >
> > Message from syslogd@t5-2 at Sep 9 16:53:25 ...
> > kernel:[31732.380652] SUN4V-DTLB: O7[102014-09-09 20:35:34 SP> NOTICE: Host is off
> > . Some firmware widget we are unaware of?
>
> Hmmm...
>
> Oh I see, if LDOMs are enabled we do ldom_power_off() instead of doing
> an OF "exit".
>
> That explains everything.
>
> I seem to remember that for some reason after early boot it got to the
> point with LDOMs that you had to stop talking to the OF, and that's
> why for all of these interfaces that could be invoked after early
> boot, we revector to a ldom_*() routine if ldom_domaining_enabled is
> true.
I seem to remember encountering similar for kexec and start/stop strand but
that was long ago too :)
>
> So I don't think there is anything we can do about this, so perhaps we
> should just unconditionally avoid using prom_halt() here, and just do
> a die_if_kernel() regardless of the trap level.
okay.
>
> Also, for the >tl1 case, it would be beneficial to print out the stack
> of trap state registers that etraptl1 saves on the stack right after
> pt_regs. The format is traps_64.c's "struct tl1_traplog", and there
> is a dump_tl1_traplog() helper there already.
okay.
>
> Thanks for looking into this Bob.
You're welcome and thanx,
bob
<<ixgbe>>
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 32 +++++++++++++++++++++++++
1 files changed, 32 insertions(+), 0 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 87bd53f..bb37bd7 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -49,6 +49,10 @@
#include <linux/if_bridge.h>
#include <linux/prefetch.h>
#include <scsi/fc/fc_fcoe.h>
+#ifdef CONFIG_SPARC
+#include <asm/idprom.h>
+#include <asm/prom.h>
+#endif
#include "ixgbe.h"
#include "ixgbe_common.h"
@@ -8063,6 +8067,33 @@ int ixgbe_wol_supported(struct ixgbe_adapter *adapter, u16 device_id,
return is_wol_supported;
}
+#ifdef CONFIG_SPARC
+/**
+ * ixgbe_mac_addr_sparc - Look up MAC address on SPARC
+ * @adapter: Pointer to adapter struct
+ */
+static void ixgbe_mac_addr_sparc(struct ixgbe_adapter *adapter)
+{
+ struct device_node *dp = pci_device_to_OF_node(adapter->pdev);
+ struct ixgbe_hw *hw = &adapter->hw;
+ const unsigned char *addr;
+ int len;
+
+ addr = of_get_property(dp, "local-mac-address", &len);
+ if (addr && len = 6) {
+ e_dev_info("Using OpenPROM MAC address\n");
+ memcpy(hw->mac.perm_addr, addr, 6);
+ }
+
+ if (!is_valid_ether_addr(hw->mac.perm_addr)) {
+ e_dev_info("Using IDPROM MAC address\n");
+ memcpy(hw->mac.perm_addr, idprom->id_ethaddr, 6);
+ }
+}
+#else
+static void ixgbe_mac_addr_sparc(struct ixgbe_adapter *adapter) {}
+#endif
+
/**
* ixgbe_probe - Device Initialization Routine
* @pdev: PCI device information struct
@@ -8330,6 +8361,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto err_sw_init;
}
+ ixgbe_mac_addr_sparc(adapter);
memcpy(netdev->dev_addr, hw->mac.perm_addr, netdev->addr_len);
if (!is_valid_ether_addr(netdev->dev_addr)) {
> --
> To unsubscribe from this list: send the line "unsubscribe sparclinux" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] sparc64: sun4v TLB error power off events
2014-09-07 15:47 [PATCH] sparc64: sun4v TLB error power off events Bob Picco
` (3 preceding siblings ...)
2014-09-10 14:18 ` Bob Picco
@ 2014-09-10 18:39 ` David Miller
4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2014-09-10 18:39 UTC (permalink / raw)
To: sparclinux
From: Bob Picco <bpicco@meloft.net>
Date: Wed, 10 Sep 2014 10:18:22 -0400
> David Miller wrote: [Tue Sep 09 2014, 05:52:46PM EDT]
>> From: Bob Picco <bob.picco@oracle.com>
>> Date: Tue, 9 Sep 2014 17:12:27 -0400
>>
>> > I just ran a quick experiment on my T5-2 which is supported hardware. The
>> > kernel is 3.17-rc3 without any modification from me - well ixgbe. As root mmap
>> > of /dev/mem at address 0UL. It powered off:
>>
>> Just out of curiosity what ixgbe patches do you need that aren't
>> upstream already?
> It is really Martin's (mkp) from last year. I ported it over to mainline.
> Basically the mac is acquired with:
> addr = of_get_property(dp, "local-mac-address", &len);
> . I'll append at the end of this email what my local T5-2 is using.
This really needs to: 1) get fixed to depend upon CONFIG_OF rather than
a specific architecture like SPARC and 2) get submitted and accepted
by the Intel ethernet driver maintainers.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-09-10 18:39 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-07 15:47 [PATCH] sparc64: sun4v TLB error power off events Bob Picco
2014-09-09 19:22 ` David Miller
2014-09-09 21:12 ` Bob Picco
2014-09-09 21:52 ` David Miller
2014-09-10 14:18 ` Bob Picco
2014-09-10 18:39 ` David Miller
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.