[PATCH] parisc: Try to fix random segmentation faults in package builds

Linux PARISC architecture development
 help / color / mirror / Atom feed

* [PATCH] parisc: Try to fix random segmentation faults in package builds
@ 2024-05-05 16:58 John David Anglin
  2024-05-08  8:54 ` Vidra.Jonas
  0 siblings, 1 reply; 18+ messages in thread
From: John David Anglin @ 2024-05-05 16:58 UTC (permalink / raw)
  To: linux-parisc; +Cc: Helge Deller

[-- Attachment #1: Type: text/plain, Size: 3573 bytes --]

The majority of random segmentation faults that I have looked at
appear to be memory corruption in memory allocated using mmap and
malloc.  This got me thinking that there might be issues with the
parisc implementation of flush_anon_page.

On PA8800/PA8900 CPUs, we use flush_user_cache_page to flush anonymous
pages.  I modified flush_user_cache_page to leave interrupts disabled
for the entire flush just to be sure the context didn't get modified
mid flush.

In looking at the implementation of flush_anon_page on other architectures,
I noticed that they all invalidate the kernel mapping as well as flush
the user page.  I added code to invalidate the kernel mapping to this
page in the PA8800/PA8900 path.  It's possible this is also needed for
other processors but I don't have a way to test.

I removed using flush_data_cache when the mapping is shared.  In theory,
shared mappings are all equivalent, so flush_user_cache_page should
flush all shared mappings.  It is much faster.

Lightly tested on rp3440 and c8000.

Signed-off-by: John David Anglin <dave.anglin@bell.net>
---

diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index ca4a302d4365..8d14a8a5d4d6 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -333,8 +333,6 @@ static void flush_user_cache_page(struct vm_area_struct *vma, unsigned long vmad
 
 	vmaddr &= PAGE_MASK;
 
-	preempt_disable();
-
 	/* Set context for flush */
 	local_irq_save(flags);
 	prot = mfctl(8);
@@ -344,7 +342,6 @@ static void flush_user_cache_page(struct vm_area_struct *vma, unsigned long vmad
 	pgd_lock = mfctl(28);
 #endif
 	switch_mm_irqs_off(NULL, vma->vm_mm, NULL);
-	local_irq_restore(flags);
 
 	flush_user_dcache_range_asm(vmaddr, vmaddr + PAGE_SIZE);
 	if (vma->vm_flags & VM_EXEC)
@@ -352,7 +349,6 @@ static void flush_user_cache_page(struct vm_area_struct *vma, unsigned long vmad
 	flush_tlb_page(vma, vmaddr);
 
 	/* Restore previous context */
-	local_irq_save(flags);
 #ifdef CONFIG_TLB_PTLOCK
 	mtctl(pgd_lock, 28);
 #endif
@@ -360,8 +356,6 @@ static void flush_user_cache_page(struct vm_area_struct *vma, unsigned long vmad
 	mtsp(space, SR_USER);
 	mtctl(prot, 8);
 	local_irq_restore(flags);
-
-	preempt_enable();
 }
 
 static inline pte_t *get_ptep(struct mm_struct *mm, unsigned long addr)
@@ -543,7 +537,7 @@ void __init parisc_setup_cache_timing(void)
 		parisc_tlb_flush_threshold/1024);
 }
 
-extern void purge_kernel_dcache_page_asm(unsigned long);
+extern void purge_kernel_dcache_page_asm(const void *addr);
 extern void clear_user_page_asm(void *, unsigned long);
 extern void copy_user_page_asm(void *, void *, unsigned long);
 
@@ -558,6 +552,16 @@ void flush_kernel_dcache_page_addr(const void *addr)
 }
 EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
 
+static void purge_kernel_dcache_page_addr(const void *addr)
+{
+	unsigned long flags;
+
+	purge_kernel_dcache_page_asm(addr);
+	purge_tlb_start(flags);
+	pdtlb(SR_KERNEL, addr);
+	purge_tlb_end(flags);
+}
+
 static void flush_cache_page_if_present(struct vm_area_struct *vma,
 	unsigned long vmaddr, unsigned long pfn)
 {
@@ -725,10 +729,8 @@ void flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned lon
 		return;
 
 	if (parisc_requires_coherency()) {
-		if (vma->vm_flags & VM_SHARED)
-			flush_data_cache();
-		else
-			flush_user_cache_page(vma, vmaddr);
+		flush_user_cache_page(vma, vmaddr);
+		purge_kernel_dcache_page_addr(page_address(page));
 		return;
 	}
 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-05 16:58 [PATCH] parisc: Try to fix random segmentation faults in package builds John David Anglin
@ 2024-05-08  8:54 ` Vidra.Jonas
  2024-05-08 15:23   ` John David Anglin
  0 siblings, 1 reply; 18+ messages in thread
From: Vidra.Jonas @ 2024-05-08  8:54 UTC (permalink / raw)
  To: linux-parisc; +Cc: John David Anglin, Helge Deller

[-- Attachment #1: Type: text/plain, Size: 4541 bytes --]

---------- Original e-mail ----------
From: John David Anglin
To: linux-parisc@vger.kernel.org
CC: Helge Deller
Date: 5. 5. 2024 19:07:17
Subject: [PATCH] parisc: Try to fix random segmentation faults in package builds

> The majority of random segmentation faults that I have looked at
> appear to be memory corruption in memory allocated using mmap and
> malloc. This got me thinking that there might be issues with the
> parisc implementation of flush_anon_page.
>
> [...]
>
> Lightly tested on rp3440 and c8000.

Hello,

thank you very much for working on the issue and for the patch! I tested
it on my C8000 with the 6.8.9 kernel with Gentoo distribution patches.

My machine is affected heavily by the segfaults – with some kernel
configurations, I get several per hour when compiling Gentoo packages
on all four cores. This patch doesn't fix them, though. On the patched
kernel, it happened after ~8h of uptime during installation of the
perl-core/Test-Simple package. I got no error output from the running
program, but an HPMC was logged to the serial console:

[30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
<Cpu3> 78000c6203e00000  a0e008c01100b009  CC_PAT_ENCODED_FIELD_WARNING
<Cpu0> e800009800e00000  0000000041093be4  CC_ERR_CHECK_HPMC
<Cpu1> e800009801e00000  00000000404ce130  CC_ERR_CHECK_HPMC
<Cpu3> 76000c6803e00000  0000000000000520  CC_PAT_DATA_FIELD_WARNING
<Cpu0> 37000f7300e00000  84000[30007.188321] Backtrace:
[30007.188321]  [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
[30007.188321]  [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
[30007.188321]  [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
[30007.188321]  [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
[30007.188321]  [<00000000401e95c0>] handle_interruption+0x330/0xe60
[30007.188321]  [<0000000040295b44>] schedule_tail+0x78/0xe8
[30007.188321]  [<00000000401e0f6c>] finish_child_return+0x0/0x58

A longer excerpt of the logs is attached. The error happened at boot
time 30007, the preceding unaligned accesses seem to be unrelated.

The patch didn't apply cleanly, but all hunks succeeded with some
offsets and fuzz. This may also be a part of it – I didn't check the
code for merge conflicts manually.

If you want me to provide you with more logs (such as the HPMC dumps)
or run some experiments, let me know.


Some speculation about the cause of the errors follows:

I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on
the same machine. The errors seem to be more frequent with a heavy IO
load, so it might be system-bus or PCI-bus-related. Using X11 causes
lockups rather quickly, but that could be caused by unrelated errors in
the graphics subsystem and/or the Radeon drivers.

Limiting the machine to a single socket (2 cores) by disabling the other
socket in firmware, or even booting on a single core using a maxcpus=1
kernel cmdline option, decreases the error frequency, but doesn't
prevent them completely, at least on an (unpatched) 6.1 kernel. So it's
probably not an SMP bug. If it's related to cache coherency, it's
coherency between the CPUs and bus IO.

The errors typically manifest as a null page access to a very low
address, so probably a null pointer dereference. I think the kernel
accidentally maps a zeroed page in place of one that the program was
using previously, making it load (and subsequently dereference) a null
pointer instead of a valid one. There are two problems with this theory,
though:
1. It would mean the program could also load zeroed /data/ instead of a
zeroed /pointer/, causing data corruption. I never conclusively observed
this, although I am getting GCC ICEs from time to time, which could
be explained by data corruption.
2. The segfault is sometimes preceded by an unaligned access, which I
believe is also caused by a corrupted machine state rather than by a
coding error in the program – sometimes a bunch of unaligned accesses
show up in the logs just prior to a segfault / lockup, even from
unrelated programs such as random bash processes. Sometimes the machine
keeps working afterwards (although I typically reboot it immediately
to limit the consequences of potential kernel data structure damage),
sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild
zeroed page appearance. But this typically happens when running X11, so
again, it might be caused by another bug, such as the GPU randomly
writing to memory via misconfigured DMA.

[-- Attachment #2: parisc-hpmc-6.8.9-patched.log --]
[-- Type: application/octet-stream, Size: 5352 bytes --]

[19083.299828] try(5911): unaligned access to 0xf95e3709 at ip 0x4100d673 (iir 0xec0109d)
[19083.394626] try(5911): unaligned access to 0xf95e370a at ip 0x4100d673 (iir 0xec0109d)
[19083.487926] try(5911): unaligned access to 0xf95e370b at ip 0x4100d673 (iir 0xec0109d)
[19083.584588] try(5911): unaligned access to 0xf95e3709 at ip 0x4100d6b7 (iir 0xf941280)
[19083.677922] try(5911): unaligned access to 0xf95e3709 at ip 0x4100d6bb (iir 0xf801095)
[27578.934693] handle_unaligned: 4 callbacks suppressed
[27578.934726] gs(13408): unaligned access to 0x417d39b4 at ip 0xf56971a3 (iir 0x2f801004)
[27579.090196] gs(13408): unaligned access to 0x417d39b4 at ip 0xf56971a3 (iir 0x2f801004)
[27579.271876] gs(13408): unaligned access to 0x417d39b4 at ip 0xf56971a3 (iir 0x2f801004)
[27579.366418] gs(13408): unaligned access to 0x417d39b4 at ip 0xf56971a3 (iir 0x2f801004)
[27579.464046] gs(13408): unaligned access to 0x417d39b4 at ip 0xf56971a3 (iir 0x2f801004)
[28810.160265] handle_unaligned: 9 callbacks suppressed
[28810.160296] ruby(12588): unaligned access to 0xf8a78b54 at ip 0xf87875af (iir 0x2f1c0005)
[28810.880682] ruby(12589): unaligned access to 0xf8a78b54 at ip 0xf87875af (iir 0x2f1c0005)
[28811.473918] ruby(12590): unaligned access to 0xf8a78b54 at ip 0xf87875af (iir 0x2f1c0005)
[28812.065843] ruby(12591): unaligned access to 0xf8a78b54 at ip 0xf87875af (iir 0x2f1c0005)
[28812.654894] ruby(12592): unaligned access to 0xf8a78b54 at ip 0xf87875af (iir 0x2f1c0005)
[28815.235781] handle_unaligned: 4 callbacks suppressed
[28815.235802] ruby(12601): unaligned access to 0xf8a78b54 at ip 0xf87875af (iir 0x2f1c0005)
[28815.884917] ruby(12602): unaligned access to 0xf8678b54 at ip 0xf83875af (iir 0x2f1c0005)
[30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
<Cpu3> 78000c6203e00000  a0e008c01100b009  CC_PAT_ENCODED_FIELD_WARNING
<Cpu0> e800009800e00000  0000000041093be4  CC_ERR_CHECK_HPMC
<Cpu1> e800009801e00000  00000000404ce130  CC_ERR_CHECK_HPMC
<Cpu3> 76000c6803e00000  0000000000000520  CC_PAT_DATA_FIELD_WARNING
<Cpu0> 37000f7300e00000  84000[30007.188321] Backtrace:
[30007.188321]  [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
[30007.188321]  [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
[30007.188321]  [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
[30007.188321]  [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
[30007.188321]  [<00000000401e95c0>] handle_interruption+0x330/0xe60
[30007.188321]  [<0000000040295b44>] schedule_tail+0x78/0xe8
[30007.188321]  [<00000000401e0f6c>] finish_child_return+0x0/0x58
[30007.188321] 
[30007.188321] 
[30007.188321] Kernel Fault: Code=26 (Data memory access rights trap) at addr 0000000041001f24
[30007.188321] CPU: 3 PID: 28344 Comm: bash Not tainted 6.8.9-gentoo-64bit-debug #1
[30007.188321] Hardware name: 9000/785/C8000
[30007.188321] 
[30007.188321]      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[30007.188321] PSW: 00001000000001100000000000001111 Not tainted
[30007.188321] r00-03  000000000806000f 0000000000000000 0000000041074ec8 0000000273ea88a0
[30007.188321] r04-07  0000000041165180 0000000041001f24 0000000055659990 0000000273ea8110
[30007.188321] r08-11  00000000f92bc0a8 0000000057d28c90 0000000000000002 0000000273ea8110
[30007.188321] r12-15  0000000055659990 0000000000000400 0000000041185980 0000000001000000
[30007.188321] r16-19  0000000041185980 0000000041185980 0000000041185980 0000000000000000
[30007.188321] r20-23  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[30007.188321] r24-27  0000000000000000 0000000000000000 0000000000000000 0000000041165180
[30007.188321] r28-31  000000000000002a 0000000000000000 0000000273ea8930 0000000000000000
[30007.188321] sr00-03  000000000336a000 0000000000000000 0000000000000000 000000000336ac00
[30007.188321] sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[30007.188321] 
[30007.188321] IASQ: 0000000000000000 0000000000000000 IAOQ: 0000000041074ec8 0000000041074ecc
[30007.188321]  IIR: 0ca01280    ISR: 0000000000000000  IOR: 0000000041001f24
[30007.188321]  CPU:        3   CR30: 0000000273e48ba0 CR31: ffffffffffffffff
[30007.188321]  ORIG_R28: 000000000000002a
[30007.188321]  IAOQ[0]: pmd_clear_bad+0x54/0xa8
[30007.188321]  IAOQ[1]: pmd_clear_bad+0x58/0xa8
[30007.188321]  RP(r2): pmd_clear_bad+0x54/0xa8
[30007.188321] Backtrace:
[30007.188321]  [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
[30007.188321]  [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
[30007.188321]  [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
[30007.188321]  [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
[30007.188321]  [<00000000401e95c0>] handle_interruption+0x330/0xe60
[30007.188321]  [<0000000040295b44>] schedule_tail+0x78/0xe8
[30007.188321]  [<00000000401e0f6c>] finish_child_return+0x0/0x58
[30007.188321] 
000800000  CC_ERR_CPU_CHECK_SUMMARY
<Cpu1> 37000f7301e00000  8400000000800000  CC_ERR_CPU_CHECK_SUMMARY
<Cpu3> 0300109103e00000  0000000000000000  CC_PROCS_ENTRY_OUT
[30007.188321] Kernel panic - not syncing: Kernel Fault
<Cpu0> f600105e00e00000  fffffff0f0c00000  CC_MC_HPMC_MONARCH_SELECTED
<Cpu1> 5600109b01e00000  00000000001de024  CC_MC_BR_TO_OS_HPMC
<Cpu2> 00000000a2aa0000  0000000000000000  
<Cpu3> e000006603e00000  090000000174b6f8  CC_BOOT_UNEXPECTED_INTERRUPT
<Cpu3> 030010d503e00000  0000000000000000  CC_CPU_STOP

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-08  8:54 ` Vidra.Jonas
@ 2024-05-08 15:23   ` John David Anglin
  2024-05-08 19:18     ` matoro
  2024-05-12  6:57     ` Vidra.Jonas
  0 siblings, 2 replies; 18+ messages in thread
From: John David Anglin @ 2024-05-08 15:23 UTC (permalink / raw)
  To: Vidra.Jonas, linux-parisc; +Cc: John David Anglin, Helge Deller

On 2024-05-08 4:54 a.m., Vidra.Jonas@seznam.cz wrote:
> ---------- Original e-mail ----------
> From: John David Anglin
> To: linux-parisc@vger.kernel.org
> CC: Helge Deller
> Date: 5. 5. 2024 19:07:17
> Subject: [PATCH] parisc: Try to fix random segmentation faults in package builds
>
>> The majority of random segmentation faults that I have looked at
>> appear to be memory corruption in memory allocated using mmap and
>> malloc. This got me thinking that there might be issues with the
>> parisc implementation of flush_anon_page.
>>
>> [...]
>>
>> Lightly tested on rp3440 and c8000.
> Hello,
>
> thank you very much for working on the issue and for the patch! I tested
> it on my C8000 with the 6.8.9 kernel with Gentoo distribution patches.
Thanks for testing.  Trying to fix these faults is largely guess work.

In my opinion, the 6.1.x branch is the most stable branch on parisc.  6.6.x and later
branches have folio changes and haven't had very much testing in build environments.
I did run 6.8.7 and 6.8.8 on rp3440 for some time but I have gone back to a slightly
modified 6.1.90.
>
> My machine is affected heavily by the segfaults – with some kernel
> configurations, I get several per hour when compiling Gentoo packages
That's more than normal although number seems to depend on package.
At this rate, you wouldn't be able to build gcc.
> on all four cores. This patch doesn't fix them, though. On the patched
Okay.  There are likely multiple problems.  The problem I was trying to address is null
objects in the hash tables used by ld and as.  The symptom is usually a null pointer
dereference after pointer has been loaded from null object.  These occur in multiple
places in libbfd during hash table traversal.  Typically, a couple would occur in a gcc
testsuite run.  _objalloc_alloc uses malloc.  One can see the faults on the console and
in the gcc testsuite log.

How these null objects are generated is not known.  It must be a kernel issue because
they don't occur with qemu.  I think the frequency of these faults is reduced with the
patch.  I suspect the objects are zeroed after they are initialized.  In some cases, ld can
successfully link by ignoring null objects.

The next time I see a fault caused by a null object, I think it would be useful to see if
we have a full null page.  This might indicate a swap problem.

random faults also occur during gcc compilations.  gcc uses mmap to allocate memory.

> kernel, it happened after ~8h of uptime during installation of the
> perl-core/Test-Simple package. I got no error output from the running
> program, but an HPMC was logged to the serial console:
>
> [30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
> <Cpu3> 78000c6203e00000  a0e008c01100b009  CC_PAT_ENCODED_FIELD_WARNING
> <Cpu0> e800009800e00000  0000000041093be4  CC_ERR_CHECK_HPMC
> <Cpu1> e800009801e00000  00000000404ce130  CC_ERR_CHECK_HPMC
> <Cpu3> 76000c6803e00000  0000000000000520  CC_PAT_DATA_FIELD_WARNING
> <Cpu0> 37000f7300e00000  84000[30007.188321] Backtrace:
> [30007.188321]  [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
> [30007.188321]  [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
> [30007.188321]  [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
> [30007.188321]  [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
> [30007.188321]  [<00000000401e95c0>] handle_interruption+0x330/0xe60
> [30007.188321]  [<0000000040295b44>] schedule_tail+0x78/0xe8
> [30007.188321]  [<00000000401e0f6c>] finish_child_return+0x0/0x58
>
> A longer excerpt of the logs is attached. The error happened at boot
> time 30007, the preceding unaligned accesses seem to be unrelated.
I doubt this HPMC is related to the patch.  In the above, the pmd table appears to have
become corrupted.
>
> The patch didn't apply cleanly, but all hunks succeeded with some
> offsets and fuzz. This may also be a part of it – I didn't check the
> code for merge conflicts manually.
Sorry, the patch was generated against 6.1.90.  This is likely the cause of the offsets
and fuzz.
>
> If you want me to provide you with more logs (such as the HPMC dumps)
> or run some experiments, let me know.
>
>
> Some speculation about the cause of the errors follows:
>
> I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on
> the same machine. The errors seem to be more frequent with a heavy IO
> load, so it might be system-bus or PCI-bus-related. Using X11 causes
> lockups rather quickly, but that could be caused by unrelated errors in
> the graphics subsystem and/or the Radeon drivers.
I am not using X11 on my c8000.  I have frame buffer support on. Radeon acceleration
is broken on parisc.

Maybe there are more problems with debian kernels because of its use of X11.
>
> Limiting the machine to a single socket (2 cores) by disabling the other
> socket in firmware, or even booting on a single core using a maxcpus=1
> kernel cmdline option, decreases the error frequency, but doesn't
> prevent them completely, at least on an (unpatched) 6.1 kernel. So it's
> probably not an SMP bug. If it's related to cache coherency, it's
> coherency between the CPUs and bus IO.
>
> The errors typically manifest as a null page access to a very low
> address, so probably a null pointer dereference. I think the kernel
> accidentally maps a zeroed page in place of one that the program was
> using previously, making it load (and subsequently dereference) a null
> pointer instead of a valid one. There are two problems with this theory,
> though:
> 1. It would mean the program could also load zeroed /data/ instead of a
> zeroed /pointer/, causing data corruption. I never conclusively observed
> this, although I am getting GCC ICEs from time to time, which could
> be explained by data corruption.
GCC catches page faults and no core dump is generated when it ICEs. So, it's harder
to debug memory issues in gcc.

I have observed zeroed data multiple times in ld faults.
> 2. The segfault is sometimes preceded by an unaligned access, which I
> believe is also caused by a corrupted machine state rather than by a
> coding error in the program – sometimes a bunch of unaligned accesses
> show up in the logs just prior to a segfault / lockup, even from
> unrelated programs such as random bash processes. Sometimes the machine
> keeps working afterwards (although I typically reboot it immediately
> to limit the consequences of potential kernel data structure damage),
> sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild
> zeroed page appearance. But this typically happens when running X11, so
> again, it might be caused by another bug, such as the GPU randomly
> writing to memory via misconfigured DMA.
There was a bug in the unaligned handler for double word instructions (ldd) that was
recently fixed.  ldd/std are not used in userspace, so this problem didn't affect it.

Kernel unaligned faults are not logged, so problems could occur internal to the kernel
and not be noticed till disaster.  Still, it seems unlikely that an unaligned fault would
corrupt more than a single word.

We have observed that the faults appear SMP and memory size related.  A rp4440 with
6 CPUs and 4 GB RAM faulted a lot.  It's mostly a PA8800/PA8900 issue.

It's months since I had a HPMC or LPMC on rp3440 and c8000.  Stalls still happen but they
are rare.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-08 15:23   ` John David Anglin
@ 2024-05-08 19:18     ` matoro
  2024-05-08 20:52       ` John David Anglin
  2024-05-12  6:57     ` Vidra.Jonas
  1 sibling, 1 reply; 18+ messages in thread
From: matoro @ 2024-05-08 19:18 UTC (permalink / raw)
  To: John David Anglin
  Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-05-08 11:23, John David Anglin wrote:
> On 2024-05-08 4:54 a.m., Vidra.Jonas@seznam.cz wrote:
>> ---------- Original e-mail ----------
>> From: John David Anglin
>> To: linux-parisc@vger.kernel.org
>> CC: Helge Deller
>> Date: 5. 5. 2024 19:07:17
>> Subject: [PATCH] parisc: Try to fix random segmentation faults in package 
>> builds
>> 
>>> The majority of random segmentation faults that I have looked at
>>> appear to be memory corruption in memory allocated using mmap and
>>> malloc. This got me thinking that there might be issues with the
>>> parisc implementation of flush_anon_page.
>>> 
>>> [...]
>>> 
>>> Lightly tested on rp3440 and c8000.
>> Hello,
>> 
>> thank you very much for working on the issue and for the patch! I tested
>> it on my C8000 with the 6.8.9 kernel with Gentoo distribution patches.
> Thanks for testing.  Trying to fix these faults is largely guess work.
> 
> In my opinion, the 6.1.x branch is the most stable branch on parisc.  6.6.x 
> and later
> branches have folio changes and haven't had very much testing in build 
> environments.
> I did run 6.8.7 and 6.8.8 on rp3440 for some time but I have gone back to a 
> slightly
> modified 6.1.90.
>> 
>> My machine is affected heavily by the segfaults – with some kernel
>> configurations, I get several per hour when compiling Gentoo packages
> That's more than normal although number seems to depend on package.
> At this rate, you wouldn't be able to build gcc.
>> on all four cores. This patch doesn't fix them, though. On the patched
> Okay.  There are likely multiple problems.  The problem I was trying to 
> address is null
> objects in the hash tables used by ld and as.  The symptom is usually a null 
> pointer
> dereference after pointer has been loaded from null object.  These occur in 
> multiple
> places in libbfd during hash table traversal.  Typically, a couple would 
> occur in a gcc
> testsuite run.  _objalloc_alloc uses malloc.  One can see the faults on the 
> console and
> in the gcc testsuite log.
> 
> How these null objects are generated is not known.  It must be a kernel 
> issue because
> they don't occur with qemu.  I think the frequency of these faults is 
> reduced with the
> patch.  I suspect the objects are zeroed after they are initialized.  In 
> some cases, ld can
> successfully link by ignoring null objects.
> 
> The next time I see a fault caused by a null object, I think it would be 
> useful to see if
> we have a full null page.  This might indicate a swap problem.
> 
> random faults also occur during gcc compilations.  gcc uses mmap to allocate 
> memory.
> 
>> kernel, it happened after ~8h of uptime during installation of the
>> perl-core/Test-Simple package. I got no error output from the running
>> program, but an HPMC was logged to the serial console:
>> 
>> [30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
>> <Cpu3> 78000c6203e00000  a0e008c01100b009  CC_PAT_ENCODED_FIELD_WARNING
>> <Cpu0> e800009800e00000  0000000041093be4  CC_ERR_CHECK_HPMC
>> <Cpu1> e800009801e00000  00000000404ce130  CC_ERR_CHECK_HPMC
>> <Cpu3> 76000c6803e00000  0000000000000520  CC_PAT_DATA_FIELD_WARNING
>> <Cpu0> 37000f7300e00000  84000[30007.188321] Backtrace:
>> [30007.188321]  [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
>> [30007.188321]  [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
>> [30007.188321]  [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
>> [30007.188321]  [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
>> [30007.188321]  [<00000000401e95c0>] handle_interruption+0x330/0xe60
>> [30007.188321]  [<0000000040295b44>] schedule_tail+0x78/0xe8
>> [30007.188321]  [<00000000401e0f6c>] finish_child_return+0x0/0x58
>> 
>> A longer excerpt of the logs is attached. The error happened at boot
>> time 30007, the preceding unaligned accesses seem to be unrelated.
> I doubt this HPMC is related to the patch.  In the above, the pmd table 
> appears to have
> become corrupted.
>> 
>> The patch didn't apply cleanly, but all hunks succeeded with some
>> offsets and fuzz. This may also be a part of it – I didn't check the
>> code for merge conflicts manually.
> Sorry, the patch was generated against 6.1.90.  This is likely the cause of 
> the offsets
> and fuzz.
>> 
>> If you want me to provide you with more logs (such as the HPMC dumps)
>> or run some experiments, let me know.
>> 
>> 
>> Some speculation about the cause of the errors follows:
>> 
>> I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on
>> the same machine. The errors seem to be more frequent with a heavy IO
>> load, so it might be system-bus or PCI-bus-related. Using X11 causes
>> lockups rather quickly, but that could be caused by unrelated errors in
>> the graphics subsystem and/or the Radeon drivers.
> I am not using X11 on my c8000.  I have frame buffer support on. Radeon 
> acceleration
> is broken on parisc.
> 
> Maybe there are more problems with debian kernels because of its use of X11.
>> 
>> Limiting the machine to a single socket (2 cores) by disabling the other
>> socket in firmware, or even booting on a single core using a maxcpus=1
>> kernel cmdline option, decreases the error frequency, but doesn't
>> prevent them completely, at least on an (unpatched) 6.1 kernel. So it's
>> probably not an SMP bug. If it's related to cache coherency, it's
>> coherency between the CPUs and bus IO.
>> 
>> The errors typically manifest as a null page access to a very low
>> address, so probably a null pointer dereference. I think the kernel
>> accidentally maps a zeroed page in place of one that the program was
>> using previously, making it load (and subsequently dereference) a null
>> pointer instead of a valid one. There are two problems with this theory,
>> though:
>> 1. It would mean the program could also load zeroed /data/ instead of a
>> zeroed /pointer/, causing data corruption. I never conclusively observed
>> this, although I am getting GCC ICEs from time to time, which could
>> be explained by data corruption.
> GCC catches page faults and no core dump is generated when it ICEs. So, it's 
> harder
> to debug memory issues in gcc.
> 
> I have observed zeroed data multiple times in ld faults.
>> 2. The segfault is sometimes preceded by an unaligned access, which I
>> believe is also caused by a corrupted machine state rather than by a
>> coding error in the program – sometimes a bunch of unaligned accesses
>> show up in the logs just prior to a segfault / lockup, even from
>> unrelated programs such as random bash processes. Sometimes the machine
>> keeps working afterwards (although I typically reboot it immediately
>> to limit the consequences of potential kernel data structure damage),
>> sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild
>> zeroed page appearance. But this typically happens when running X11, so
>> again, it might be caused by another bug, such as the GPU randomly
>> writing to memory via misconfigured DMA.
> There was a bug in the unaligned handler for double word instructions (ldd) 
> that was
> recently fixed.  ldd/std are not used in userspace, so this problem didn't 
> affect it.
> 
> Kernel unaligned faults are not logged, so problems could occur internal to 
> the kernel
> and not be noticed till disaster.  Still, it seems unlikely that an 
> unaligned fault would
> corrupt more than a single word.
> 
> We have observed that the faults appear SMP and memory size related.  A 
> rp4440 with
> 6 CPUs and 4 GB RAM faulted a lot.  It's mostly a PA8800/PA8900 issue.
> 
> It's months since I had a HPMC or LPMC on rp3440 and c8000.  Stalls still 
> happen but they
> are rare.
> 
> Dave

Hi, I also tested this patch on an rp3440 with PA8900.  Unfortunately it 
seems to have exacerbated an existing issue which takes the whole machine 
down.  Occasionally I would get a message:

[ 7497.061892] Kernel panic - not syncing: Kernel Fault

with no accompanying stack trace and then the BMC would restart the whole 
machine automatically.  These were infrequent enough that the segfaults were 
the bigger problem, but after applying this patch on top of 6.8, this changed 
the dynamic.  It seems to occur during builds with varying I/O loads.  For 
example, I was able to build gcc fine, with no segfaults, but I was unable to 
build perl, a much smaller build, without crashing the machine.  I did not 
observe any segfaults over the day or 2 I ran this patch, but that's not an 
unheard-of stretch of time even without it, and I am being forced to revert 
because of the panics.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-08 19:18     ` matoro
@ 2024-05-08 20:52       ` John David Anglin
  2024-05-08 23:51         ` matoro
  2024-05-09 17:10         ` John David Anglin
  0 siblings, 2 replies; 18+ messages in thread
From: John David Anglin @ 2024-05-08 20:52 UTC (permalink / raw)
  To: matoro; +Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-05-08 3:18 p.m., matoro wrote:
> On 2024-05-08 11:23, John David Anglin wrote:
>> On 2024-05-08 4:54 a.m., Vidra.Jonas@seznam.cz wrote:
>>> ---------- Original e-mail ----------
>>> From: John David Anglin
>>> To: linux-parisc@vger.kernel.org
>>> CC: Helge Deller
>>> Date: 5. 5. 2024 19:07:17
>>> Subject: [PATCH] parisc: Try to fix random segmentation faults in package builds
>>>
>>>> The majority of random segmentation faults that I have looked at
>>>> appear to be memory corruption in memory allocated using mmap and
>>>> malloc. This got me thinking that there might be issues with the
>>>> parisc implementation of flush_anon_page.
>>>>
>>>> [...]
>>>>
>>>> Lightly tested on rp3440 and c8000.
>>> Hello,
>>>
>>> thank you very much for working on the issue and for the patch! I tested
>>> it on my C8000 with the 6.8.9 kernel with Gentoo distribution patches.
>> Thanks for testing.  Trying to fix these faults is largely guess work.
>>
>> In my opinion, the 6.1.x branch is the most stable branch on parisc.  6.6.x and later
>> branches have folio changes and haven't had very much testing in build environments.
>> I did run 6.8.7 and 6.8.8 on rp3440 for some time but I have gone back to a slightly
>> modified 6.1.90.
>>>
>>> My machine is affected heavily by the segfaults – with some kernel
>>> configurations, I get several per hour when compiling Gentoo packages
>> That's more than normal although number seems to depend on package.
>> At this rate, you wouldn't be able to build gcc.
>>> on all four cores. This patch doesn't fix them, though. On the patched
>> Okay.  There are likely multiple problems.  The problem I was trying to address is null
>> objects in the hash tables used by ld and as.  The symptom is usually a null pointer
>> dereference after pointer has been loaded from null object. These occur in multiple
>> places in libbfd during hash table traversal.  Typically, a couple would occur in a gcc
>> testsuite run.  _objalloc_alloc uses malloc.  One can see the faults on the console and
>> in the gcc testsuite log.
>>
>> How these null objects are generated is not known.  It must be a kernel issue because
>> they don't occur with qemu.  I think the frequency of these faults is reduced with the
>> patch.  I suspect the objects are zeroed after they are initialized.  In some cases, ld can
>> successfully link by ignoring null objects.
>>
>> The next time I see a fault caused by a null object, I think it would be useful to see if
>> we have a full null page.  This might indicate a swap problem.
>>
>> random faults also occur during gcc compilations.  gcc uses mmap to allocate memory.
>>
>>> kernel, it happened after ~8h of uptime during installation of the
>>> perl-core/Test-Simple package. I got no error output from the running
>>> program, but an HPMC was logged to the serial console:
>>>
>>> [30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
>>> <Cpu3> 78000c6203e00000  a0e008c01100b009 CC_PAT_ENCODED_FIELD_WARNING
>>> <Cpu0> e800009800e00000  0000000041093be4 CC_ERR_CHECK_HPMC
>>> <Cpu1> e800009801e00000  00000000404ce130 CC_ERR_CHECK_HPMC
>>> <Cpu3> 76000c6803e00000  0000000000000520 CC_PAT_DATA_FIELD_WARNING
>>> <Cpu0> 37000f7300e00000  84000[30007.188321] Backtrace:
>>> [30007.188321]  [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
>>> [30007.188321]  [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
>>> [30007.188321]  [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
>>> [30007.188321]  [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
>>> [30007.188321]  [<00000000401e95c0>] handle_interruption+0x330/0xe60
>>> [30007.188321]  [<0000000040295b44>] schedule_tail+0x78/0xe8
>>> [30007.188321]  [<00000000401e0f6c>] finish_child_return+0x0/0x58
>>>
>>> A longer excerpt of the logs is attached. The error happened at boot
>>> time 30007, the preceding unaligned accesses seem to be unrelated.
>> I doubt this HPMC is related to the patch.  In the above, the pmd table appears to have
>> become corrupted.
>>>
>>> The patch didn't apply cleanly, but all hunks succeeded with some
>>> offsets and fuzz. This may also be a part of it – I didn't check the
>>> code for merge conflicts manually.
>> Sorry, the patch was generated against 6.1.90.  This is likely the cause of the offsets
>> and fuzz.
>>>
>>> If you want me to provide you with more logs (such as the HPMC dumps)
>>> or run some experiments, let me know.
>>>
>>>
>>> Some speculation about the cause of the errors follows:
>>>
>>> I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on
>>> the same machine. The errors seem to be more frequent with a heavy IO
>>> load, so it might be system-bus or PCI-bus-related. Using X11 causes
>>> lockups rather quickly, but that could be caused by unrelated errors in
>>> the graphics subsystem and/or the Radeon drivers.
>> I am not using X11 on my c8000.  I have frame buffer support on. Radeon acceleration
>> is broken on parisc.
>>
>> Maybe there are more problems with debian kernels because of its use of X11.
>>>
>>> Limiting the machine to a single socket (2 cores) by disabling the other
>>> socket in firmware, or even booting on a single core using a maxcpus=1
>>> kernel cmdline option, decreases the error frequency, but doesn't
>>> prevent them completely, at least on an (unpatched) 6.1 kernel. So it's
>>> probably not an SMP bug. If it's related to cache coherency, it's
>>> coherency between the CPUs and bus IO.
>>>
>>> The errors typically manifest as a null page access to a very low
>>> address, so probably a null pointer dereference. I think the kernel
>>> accidentally maps a zeroed page in place of one that the program was
>>> using previously, making it load (and subsequently dereference) a null
>>> pointer instead of a valid one. There are two problems with this theory,
>>> though:
>>> 1. It would mean the program could also load zeroed /data/ instead of a
>>> zeroed /pointer/, causing data corruption. I never conclusively observed
>>> this, although I am getting GCC ICEs from time to time, which could
>>> be explained by data corruption.
>> GCC catches page faults and no core dump is generated when it ICEs. So, it's harder
>> to debug memory issues in gcc.
>>
>> I have observed zeroed data multiple times in ld faults.
>>> 2. The segfault is sometimes preceded by an unaligned access, which I
>>> believe is also caused by a corrupted machine state rather than by a
>>> coding error in the program – sometimes a bunch of unaligned accesses
>>> show up in the logs just prior to a segfault / lockup, even from
>>> unrelated programs such as random bash processes. Sometimes the machine
>>> keeps working afterwards (although I typically reboot it immediately
>>> to limit the consequences of potential kernel data structure damage),
>>> sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild
>>> zeroed page appearance. But this typically happens when running X11, so
>>> again, it might be caused by another bug, such as the GPU randomly
>>> writing to memory via misconfigured DMA.
>> There was a bug in the unaligned handler for double word instructions (ldd) that was
>> recently fixed.  ldd/std are not used in userspace, so this problem didn't affect it.
>>
>> Kernel unaligned faults are not logged, so problems could occur internal to the kernel
>> and not be noticed till disaster.  Still, it seems unlikely that an unaligned fault would
>> corrupt more than a single word.
>>
>> We have observed that the faults appear SMP and memory size related.  A rp4440 with
>> 6 CPUs and 4 GB RAM faulted a lot.  It's mostly a PA8800/PA8900 issue.
>>
>> It's months since I had a HPMC or LPMC on rp3440 and c8000. Stalls still happen but they
>> are rare.
>>
>> Dave
>
> Hi, I also tested this patch on an rp3440 with PA8900. Unfortunately it seems to have exacerbated an existing issue which takes the whole 
> machine down.  Occasionally I would get a message:
>
> [ 7497.061892] Kernel panic - not syncing: Kernel Fault
>
> with no accompanying stack trace and then the BMC would restart the whole machine automatically.  These were infrequent enough that the 
> segfaults were the bigger problem, but after applying this patch on top of 6.8, this changed the dynamic.  It seems to occur during builds 
> with varying I/O loads.  For example, I was able to build gcc fine, with no segfaults, but I was unable to build perl, a much smaller build, 
> without crashing the machine.  I did not observe any segfaults over the day or 2 I ran this patch, but that's not an unheard-of stretch of 
> time even without it, and I am being forced to revert because of the panics.
Looks like there is a problem with 6.8.  I'll do some testing with it.

I haven't had any panics with 6.1 on rp3440 or c8000.

Trying a debian perl-5.38.2 build.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-08 20:52       ` John David Anglin
@ 2024-05-08 23:51         ` matoro
  2024-05-09  1:21           ` John David Anglin
  2024-05-09 17:10         ` John David Anglin
  1 sibling, 1 reply; 18+ messages in thread
From: matoro @ 2024-05-08 23:51 UTC (permalink / raw)
  To: John David Anglin
  Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-05-08 16:52, John David Anglin wrote:
> On 2024-05-08 3:18 p.m., matoro wrote:
>> On 2024-05-08 11:23, John David Anglin wrote:
>>> On 2024-05-08 4:54 a.m., Vidra.Jonas@seznam.cz wrote:
>>>> ---------- Original e-mail ----------
>>>> From: John David Anglin
>>>> To: linux-parisc@vger.kernel.org
>>>> CC: Helge Deller
>>>> Date: 5. 5. 2024 19:07:17
>>>> Subject: [PATCH] parisc: Try to fix random segmentation faults in package 
>>>> builds
>>>> 
>>>>> The majority of random segmentation faults that I have looked at
>>>>> appear to be memory corruption in memory allocated using mmap and
>>>>> malloc. This got me thinking that there might be issues with the
>>>>> parisc implementation of flush_anon_page.
>>>>> 
>>>>> [...]
>>>>> 
>>>>> Lightly tested on rp3440 and c8000.
>>>> Hello,
>>>> 
>>>> thank you very much for working on the issue and for the patch! I tested
>>>> it on my C8000 with the 6.8.9 kernel with Gentoo distribution patches.
>>> Thanks for testing.  Trying to fix these faults is largely guess work.
>>> 
>>> In my opinion, the 6.1.x branch is the most stable branch on parisc.  
>>> 6.6.x and later
>>> branches have folio changes and haven't had very much testing in build 
>>> environments.
>>> I did run 6.8.7 and 6.8.8 on rp3440 for some time but I have gone back to 
>>> a slightly
>>> modified 6.1.90.
>>>> 
>>>> My machine is affected heavily by the segfaults – with some kernel
>>>> configurations, I get several per hour when compiling Gentoo packages
>>> That's more than normal although number seems to depend on package.
>>> At this rate, you wouldn't be able to build gcc.
>>>> on all four cores. This patch doesn't fix them, though. On the patched
>>> Okay.  There are likely multiple problems.  The problem I was trying to 
>>> address is null
>>> objects in the hash tables used by ld and as.  The symptom is usually a 
>>> null pointer
>>> dereference after pointer has been loaded from null object. These occur in 
>>> multiple
>>> places in libbfd during hash table traversal.  Typically, a couple would 
>>> occur in a gcc
>>> testsuite run.  _objalloc_alloc uses malloc.  One can see the faults on 
>>> the console and
>>> in the gcc testsuite log.
>>> 
>>> How these null objects are generated is not known.  It must be a kernel 
>>> issue because
>>> they don't occur with qemu.  I think the frequency of these faults is 
>>> reduced with the
>>> patch.  I suspect the objects are zeroed after they are initialized.  In 
>>> some cases, ld can
>>> successfully link by ignoring null objects.
>>> 
>>> The next time I see a fault caused by a null object, I think it would be 
>>> useful to see if
>>> we have a full null page.  This might indicate a swap problem.
>>> 
>>> random faults also occur during gcc compilations.  gcc uses mmap to 
>>> allocate memory.
>>> 
>>>> kernel, it happened after ~8h of uptime during installation of the
>>>> perl-core/Test-Simple package. I got no error output from the running
>>>> program, but an HPMC was logged to the serial console:
>>>> 
>>>> [30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
>>>> <Cpu3> 78000c6203e00000  a0e008c01100b009 CC_PAT_ENCODED_FIELD_WARNING
>>>> <Cpu0> e800009800e00000  0000000041093be4 CC_ERR_CHECK_HPMC
>>>> <Cpu1> e800009801e00000  00000000404ce130 CC_ERR_CHECK_HPMC
>>>> <Cpu3> 76000c6803e00000  0000000000000520 CC_PAT_DATA_FIELD_WARNING
>>>> <Cpu0> 37000f7300e00000  84000[30007.188321] Backtrace:
>>>> [30007.188321]  [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
>>>> [30007.188321]  [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
>>>> [30007.188321]  [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
>>>> [30007.188321]  [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
>>>> [30007.188321]  [<00000000401e95c0>] handle_interruption+0x330/0xe60
>>>> [30007.188321]  [<0000000040295b44>] schedule_tail+0x78/0xe8
>>>> [30007.188321]  [<00000000401e0f6c>] finish_child_return+0x0/0x58
>>>> 
>>>> A longer excerpt of the logs is attached. The error happened at boot
>>>> time 30007, the preceding unaligned accesses seem to be unrelated.
>>> I doubt this HPMC is related to the patch.  In the above, the pmd table 
>>> appears to have
>>> become corrupted.
>>>> 
>>>> The patch didn't apply cleanly, but all hunks succeeded with some
>>>> offsets and fuzz. This may also be a part of it – I didn't check the
>>>> code for merge conflicts manually.
>>> Sorry, the patch was generated against 6.1.90.  This is likely the cause 
>>> of the offsets
>>> and fuzz.
>>>> 
>>>> If you want me to provide you with more logs (such as the HPMC dumps)
>>>> or run some experiments, let me know.
>>>> 
>>>> 
>>>> Some speculation about the cause of the errors follows:
>>>> 
>>>> I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on
>>>> the same machine. The errors seem to be more frequent with a heavy IO
>>>> load, so it might be system-bus or PCI-bus-related. Using X11 causes
>>>> lockups rather quickly, but that could be caused by unrelated errors in
>>>> the graphics subsystem and/or the Radeon drivers.
>>> I am not using X11 on my c8000.  I have frame buffer support on. Radeon 
>>> acceleration
>>> is broken on parisc.
>>> 
>>> Maybe there are more problems with debian kernels because of its use of 
>>> X11.
>>>> 
>>>> Limiting the machine to a single socket (2 cores) by disabling the other
>>>> socket in firmware, or even booting on a single core using a maxcpus=1
>>>> kernel cmdline option, decreases the error frequency, but doesn't
>>>> prevent them completely, at least on an (unpatched) 6.1 kernel. So it's
>>>> probably not an SMP bug. If it's related to cache coherency, it's
>>>> coherency between the CPUs and bus IO.
>>>> 
>>>> The errors typically manifest as a null page access to a very low
>>>> address, so probably a null pointer dereference. I think the kernel
>>>> accidentally maps a zeroed page in place of one that the program was
>>>> using previously, making it load (and subsequently dereference) a null
>>>> pointer instead of a valid one. There are two problems with this theory,
>>>> though:
>>>> 1. It would mean the program could also load zeroed /data/ instead of a
>>>> zeroed /pointer/, causing data corruption. I never conclusively observed
>>>> this, although I am getting GCC ICEs from time to time, which could
>>>> be explained by data corruption.
>>> GCC catches page faults and no core dump is generated when it ICEs. So, 
>>> it's harder
>>> to debug memory issues in gcc.
>>> 
>>> I have observed zeroed data multiple times in ld faults.
>>>> 2. The segfault is sometimes preceded by an unaligned access, which I
>>>> believe is also caused by a corrupted machine state rather than by a
>>>> coding error in the program – sometimes a bunch of unaligned accesses
>>>> show up in the logs just prior to a segfault / lockup, even from
>>>> unrelated programs such as random bash processes. Sometimes the machine
>>>> keeps working afterwards (although I typically reboot it immediately
>>>> to limit the consequences of potential kernel data structure damage),
>>>> sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild
>>>> zeroed page appearance. But this typically happens when running X11, so
>>>> again, it might be caused by another bug, such as the GPU randomly
>>>> writing to memory via misconfigured DMA.
>>> There was a bug in the unaligned handler for double word instructions 
>>> (ldd) that was
>>> recently fixed.  ldd/std are not used in userspace, so this problem didn't 
>>> affect it.
>>> 
>>> Kernel unaligned faults are not logged, so problems could occur internal 
>>> to the kernel
>>> and not be noticed till disaster.  Still, it seems unlikely that an 
>>> unaligned fault would
>>> corrupt more than a single word.
>>> 
>>> We have observed that the faults appear SMP and memory size related.  A 
>>> rp4440 with
>>> 6 CPUs and 4 GB RAM faulted a lot.  It's mostly a PA8800/PA8900 issue.
>>> 
>>> It's months since I had a HPMC or LPMC on rp3440 and c8000. Stalls still 
>>> happen but they
>>> are rare.
>>> 
>>> Dave
>> 
>> Hi, I also tested this patch on an rp3440 with PA8900. Unfortunately it 
>> seems to have exacerbated an existing issue which takes the whole machine 
>> down.  Occasionally I would get a message:
>> 
>> [ 7497.061892] Kernel panic - not syncing: Kernel Fault
>> 
>> with no accompanying stack trace and then the BMC would restart the whole 
>> machine automatically.  These were infrequent enough that the segfaults 
>> were the bigger problem, but after applying this patch on top of 6.8, this 
>> changed the dynamic.  It seems to occur during builds with varying I/O 
>> loads.  For example, I was able to build gcc fine, with no segfaults, but I 
>> was unable to build perl, a much smaller build, without crashing the 
>> machine.  I did not observe any segfaults over the day or 2 I ran this 
>> patch, but that's not an unheard-of stretch of time even without it, and I 
>> am being forced to revert because of the panics.
> Looks like there is a problem with 6.8.  I'll do some testing with it.
> 
> I haven't had any panics with 6.1 on rp3440 or c8000.
> 
> Trying a debian perl-5.38.2 build.
> 
> Dave

Oops, seems after reverting this patch I ran into the exact same problem.

First the failing package is actually perl XS-Parse-Keyword, not the actual 
perl interpreter.  Didn't have serial console hooked up to check it exactly.  
And secondly it did the exact same thing even without the patch, on kernel 
6.8.9, so that's definitely not the problem.  I'm going to try checking some 
older kernels to see if I can identify any that aren't susceptible to this 
crash.  Luckily this package build seems to be pretty reliably triggering it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-08 23:51         ` matoro
@ 2024-05-09  1:21           ` John David Anglin
  0 siblings, 0 replies; 18+ messages in thread
From: John David Anglin @ 2024-05-09  1:21 UTC (permalink / raw)
  To: matoro; +Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-05-08 7:51 p.m., matoro wrote:
> On 2024-05-08 16:52, John David Anglin wrote:
>> On 2024-05-08 3:18 p.m., matoro wrote:
>>> On 2024-05-08 11:23, John David Anglin wrote:
>>>> On 2024-05-08 4:54 a.m., Vidra.Jonas@seznam.cz wrote:
>>>>> ---------- Original e-mail ----------
>>>>> From: John David Anglin
>>>>> To: linux-parisc@vger.kernel.org
>>>>> CC: Helge Deller
>>>>> Date: 5. 5. 2024 19:07:17
>>>>> Subject: [PATCH] parisc: Try to fix random segmentation faults in package builds
>>>>>
>>>>>> The majority of random segmentation faults that I have looked at
>>>>>> appear to be memory corruption in memory allocated using mmap and
>>>>>> malloc. This got me thinking that there might be issues with the
>>>>>> parisc implementation of flush_anon_page.
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>> Lightly tested on rp3440 and c8000.
>>>>> Hello,
>>>>>
>>>>> thank you very much for working on the issue and for the patch! I tested
>>>>> it on my C8000 with the 6.8.9 kernel with Gentoo distribution patches.
>>>> Thanks for testing.  Trying to fix these faults is largely guess work.
>>>>
>>>> In my opinion, the 6.1.x branch is the most stable branch on parisc.  6.6.x and later
>>>> branches have folio changes and haven't had very much testing in build environments.
>>>> I did run 6.8.7 and 6.8.8 on rp3440 for some time but I have gone back to a slightly
>>>> modified 6.1.90.
>>>>>
>>>>> My machine is affected heavily by the segfaults – with some kernel
>>>>> configurations, I get several per hour when compiling Gentoo packages
>>>> That's more than normal although number seems to depend on package.
>>>> At this rate, you wouldn't be able to build gcc.
>>>>> on all four cores. This patch doesn't fix them, though. On the patched
>>>> Okay.  There are likely multiple problems.  The problem I was trying to address is null
>>>> objects in the hash tables used by ld and as.  The symptom is usually a null pointer
>>>> dereference after pointer has been loaded from null object. These occur in multiple
>>>> places in libbfd during hash table traversal.  Typically, a couple would occur in a gcc
>>>> testsuite run.  _objalloc_alloc uses malloc.  One can see the faults on the console and
>>>> in the gcc testsuite log.
>>>>
>>>> How these null objects are generated is not known.  It must be a kernel issue because
>>>> they don't occur with qemu.  I think the frequency of these faults is reduced with the
>>>> patch.  I suspect the objects are zeroed after they are initialized.  In some cases, ld can
>>>> successfully link by ignoring null objects.
>>>>
>>>> The next time I see a fault caused by a null object, I think it would be useful to see if
>>>> we have a full null page.  This might indicate a swap problem.
>>>>
>>>> random faults also occur during gcc compilations.  gcc uses mmap to allocate memory.
>>>>
>>>>> kernel, it happened after ~8h of uptime during installation of the
>>>>> perl-core/Test-Simple package. I got no error output from the running
>>>>> program, but an HPMC was logged to the serial console:
>>>>>
>>>>> [30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
>>>>> <Cpu3> 78000c6203e00000  a0e008c01100b009 CC_PAT_ENCODED_FIELD_WARNING
>>>>> <Cpu0> e800009800e00000  0000000041093be4 CC_ERR_CHECK_HPMC
>>>>> <Cpu1> e800009801e00000  00000000404ce130 CC_ERR_CHECK_HPMC
>>>>> <Cpu3> 76000c6803e00000  0000000000000520 CC_PAT_DATA_FIELD_WARNING
>>>>> <Cpu0> 37000f7300e00000  84000[30007.188321] Backtrace:
>>>>> [30007.188321]  [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
>>>>> [30007.188321]  [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
>>>>> [30007.188321]  [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
>>>>> [30007.188321]  [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
>>>>> [30007.188321]  [<00000000401e95c0>] handle_interruption+0x330/0xe60
>>>>> [30007.188321]  [<0000000040295b44>] schedule_tail+0x78/0xe8
>>>>> [30007.188321]  [<00000000401e0f6c>] finish_child_return+0x0/0x58
>>>>>
>>>>> A longer excerpt of the logs is attached. The error happened at boot
>>>>> time 30007, the preceding unaligned accesses seem to be unrelated.
>>>> I doubt this HPMC is related to the patch.  In the above, the pmd table appears to have
>>>> become corrupted.
>>>>>
>>>>> The patch didn't apply cleanly, but all hunks succeeded with some
>>>>> offsets and fuzz. This may also be a part of it – I didn't check the
>>>>> code for merge conflicts manually.
>>>> Sorry, the patch was generated against 6.1.90.  This is likely the cause of the offsets
>>>> and fuzz.
>>>>>
>>>>> If you want me to provide you with more logs (such as the HPMC dumps)
>>>>> or run some experiments, let me know.
>>>>>
>>>>>
>>>>> Some speculation about the cause of the errors follows:
>>>>>
>>>>> I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on
>>>>> the same machine. The errors seem to be more frequent with a heavy IO
>>>>> load, so it might be system-bus or PCI-bus-related. Using X11 causes
>>>>> lockups rather quickly, but that could be caused by unrelated errors in
>>>>> the graphics subsystem and/or the Radeon drivers.
>>>> I am not using X11 on my c8000.  I have frame buffer support on. Radeon acceleration
>>>> is broken on parisc.
>>>>
>>>> Maybe there are more problems with debian kernels because of its use of X11.
>>>>>
>>>>> Limiting the machine to a single socket (2 cores) by disabling the other
>>>>> socket in firmware, or even booting on a single core using a maxcpus=1
>>>>> kernel cmdline option, decreases the error frequency, but doesn't
>>>>> prevent them completely, at least on an (unpatched) 6.1 kernel. So it's
>>>>> probably not an SMP bug. If it's related to cache coherency, it's
>>>>> coherency between the CPUs and bus IO.
>>>>>
>>>>> The errors typically manifest as a null page access to a very low
>>>>> address, so probably a null pointer dereference. I think the kernel
>>>>> accidentally maps a zeroed page in place of one that the program was
>>>>> using previously, making it load (and subsequently dereference) a null
>>>>> pointer instead of a valid one. There are two problems with this theory,
>>>>> though:
>>>>> 1. It would mean the program could also load zeroed /data/ instead of a
>>>>> zeroed /pointer/, causing data corruption. I never conclusively observed
>>>>> this, although I am getting GCC ICEs from time to time, which could
>>>>> be explained by data corruption.
>>>> GCC catches page faults and no core dump is generated when it ICEs. So, it's harder
>>>> to debug memory issues in gcc.
>>>>
>>>> I have observed zeroed data multiple times in ld faults.
>>>>> 2. The segfault is sometimes preceded by an unaligned access, which I
>>>>> believe is also caused by a corrupted machine state rather than by a
>>>>> coding error in the program – sometimes a bunch of unaligned accesses
>>>>> show up in the logs just prior to a segfault / lockup, even from
>>>>> unrelated programs such as random bash processes. Sometimes the machine
>>>>> keeps working afterwards (although I typically reboot it immediately
>>>>> to limit the consequences of potential kernel data structure damage),
>>>>> sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild
>>>>> zeroed page appearance. But this typically happens when running X11, so
>>>>> again, it might be caused by another bug, such as the GPU randomly
>>>>> writing to memory via misconfigured DMA.
>>>> There was a bug in the unaligned handler for double word instructions (ldd) that was
>>>> recently fixed.  ldd/std are not used in userspace, so this problem didn't affect it.
>>>>
>>>> Kernel unaligned faults are not logged, so problems could occur internal to the kernel
>>>> and not be noticed till disaster.  Still, it seems unlikely that an unaligned fault would
>>>> corrupt more than a single word.
>>>>
>>>> We have observed that the faults appear SMP and memory size related.  A rp4440 with
>>>> 6 CPUs and 4 GB RAM faulted a lot.  It's mostly a PA8800/PA8900 issue.
>>>>
>>>> It's months since I had a HPMC or LPMC on rp3440 and c8000. Stalls still happen but they
>>>> are rare.
>>>>
>>>> Dave
>>>
>>> Hi, I also tested this patch on an rp3440 with PA8900. Unfortunately it seems to have exacerbated an existing issue which takes the whole 
>>> machine down.  Occasionally I would get a message:
>>>
>>> [ 7497.061892] Kernel panic - not syncing: Kernel Fault
>>>
>>> with no accompanying stack trace and then the BMC would restart the whole machine automatically.  These were infrequent enough that the 
>>> segfaults were the bigger problem, but after applying this patch on top of 6.8, this changed the dynamic.  It seems to occur during builds 
>>> with varying I/O loads.  For example, I was able to build gcc fine, with no segfaults, but I was unable to build perl, a much smaller build, 
>>> without crashing the machine.  I did not observe any segfaults over the day or 2 I ran this patch, but that's not an unheard-of stretch of 
>>> time even without it, and I am being forced to revert because of the panics.
>> Looks like there is a problem with 6.8.  I'll do some testing with it.
>>
>> I haven't had any panics with 6.1 on rp3440 or c8000.
>>
>> Trying a debian perl-5.38.2 build.
>>
>> Dave
>
> Oops, seems after reverting this patch I ran into the exact same problem.
It was hard to understand how the patch could cause a kernel crash. The only significant change
is adding the purge_kernel_dcache_page_addr call in flush_anon_page.  It uses the pdc instruction
to invalidate the kernel mapping.  Assuming pdc is actually implemented as described in the architecture
book, it doesn't write back to memory at priority 0.  It just invalidates the addressed cache line.

My 6.8.9 build is still going after 2 hours and 42 minutes...

>
> First the failing package is actually perl XS-Parse-Keyword, not the actual perl interpreter.  Didn't have serial console hooked up to check 
> it exactly.  And secondly it did the exact same thing even without the patch, on kernel 6.8.9, so that's definitely not the problem.  I'm 
> going to try checking some older kernels to see if I can identify any that aren't susceptible to this crash. Luckily this package build seems 
> to be pretty reliably triggering it.
That's a good find.

libxs-parse-keyword-perl just built a couple of hours ago on sap rp4440.  At last email, it was running 6.1.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-08 20:52       ` John David Anglin
  2024-05-08 23:51         ` matoro
@ 2024-05-09 17:10         ` John David Anglin
  2024-05-29 15:54           ` matoro
  1 sibling, 1 reply; 18+ messages in thread
From: John David Anglin @ 2024-05-09 17:10 UTC (permalink / raw)
  To: matoro; +Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-05-08 4:52 p.m., John David Anglin wrote:
>> with no accompanying stack trace and then the BMC would restart the whole machine automatically.  These were infrequent enough that the 
>> segfaults were the bigger problem, but after applying this patch on top of 6.8, this changed the dynamic.  It seems to occur during builds 
>> with varying I/O loads.  For example, I was able to build gcc fine, with no segfaults, but I was unable to build perl, a much smaller build, 
>> without crashing the machine. I did not observe any segfaults over the day or 2 I ran this patch, but that's not an unheard-of stretch of 
>> time even without it, and I am being forced to revert because of the panics.
> Looks like there is a problem with 6.8.  I'll do some testing with it.
So far, I haven't seen any panics with 6.8.9 but I have seen some random segmentation faults
in the gcc testsuite.  I looked at one ld fault in some detail.  18 contiguous words in the  elf_link_hash_entry
struct were zeroed starting with the last word in the bfd_link_hash_entry struct causing the fault.
The section pointer was zeroed.

18 words is a rather strange number of words to corrupt and corruption doesn't seem related
to object structure.  In any case, it is not page related.

It's really hard to tell how this happens.  The corrupt object was at a slightly different location
than it is when ld is run under gdb.  Can't duplicate in gdb.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-08 15:23   ` John David Anglin
  2024-05-08 19:18     ` matoro
@ 2024-05-12  6:57     ` Vidra.Jonas
  1 sibling, 0 replies; 18+ messages in thread
From: Vidra.Jonas @ 2024-05-12  6:57 UTC (permalink / raw)
  To: linux-parisc; +Cc: John David Anglin, Helge Deller

---------- Original e-mail ----------
From: John David Anglin
To: linux-parisc@vger.kernel.org
CC: John David Anglin, Helge Deller
Date: 8. 5. 2024 17:23:27
Subject: Re: [PATCH] parisc: Try to fix random segmentation faults in package builds

> In my opinion, the 6.1.x branch is the most stable branch on parisc.  6.6.x and later
> branches have folio changes and haven't had very much testing in build environments.
> I did run 6.8.7 and 6.8.8 on rp3440 for some time but I have gone back to a slightly
> modified 6.1.90.

OK, thanks, I'll roll back as well.


>> My machine is affected heavily by the segfaults – with some kernel
>> configurations, I get several per hour when compiling Gentoo packages
> That's more than normal although number seems to depend on package.
> At this rate, you wouldn't be able to build gcc.

Well, yeah. :-) The crashes are rarer when using a kernel with many
debugging options turned on, which suggests that it's some kind of a
race condition. Unfortunately, that also means it doesn't manifest when
the program is run under strace or gdb. I build large packages with -j1,
as the crashes are rarer with a smaller load.

The worst offender is the `moc` program used in builds of Qt packages,
it crashes a lot.


>> on all four cores. This patch doesn't fix them, though. On the patched
> Okay.  There are likely multiple problems.  The problem I was trying to address is null
> objects in the hash tables used by ld and as.  The symptom is usually a null pointer
> dereference after pointer has been loaded from null object.  These occur in multiple
> places in libbfd during hash table traversal.  Typically, a couple would occur in a gcc
> testsuite run.  _objalloc_alloc uses malloc.  One can see the faults on the console and
> in the gcc testsuite log.
>
> How these null objects are generated is not known.  It must be a kernel issue because
> they don't occur with qemu.  I think the frequency of these faults is reduced with the
> patch.  I suspect the objects are zeroed after they are initialized.  In some cases, ld can
> successfully link by ignoring null objects.
>
> The next time I see a fault caused by a null object, I think it would be useful to see if
> we have a full null page.  This might indicate a swap problem.

I did see a full zeroed page at least once, but it's hard to debug.
Also, I'm not sure whether core dumps are reliable in this case – since
this is a kernel bug, the view of memory stored in a core dump might be
different from what the program saw at the time of the crash.


>> kernel, it happened after ~8h of uptime during installation of the
>> perl-core/Test-Simple package. I got no error output from the running
>> program, but an HPMC was logged to the serial console:
>>
>> [30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
>> <Cpu3> 78000c6203e00000 a0e008c01100b009 CC_PAT_ENCODED_FIELD_WARNING
>> <Cpu0> e800009800e00000 0000000041093be4 CC_ERR_CHECK_HPMC
>> <Cpu1> e800009801e00000 00000000404ce130 CC_ERR_CHECK_HPMC
>> <Cpu3> 76000c6803e00000 0000000000000520 CC_PAT_DATA_FIELD_WARNING
>> <Cpu0> 37000f7300e00000 84000[30007.188321] Backtrace:
>> [30007.188321] [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
>> [30007.188321] [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
>> [30007.188321] [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
>> [30007.188321] [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
>> [30007.188321] [<00000000401e95c0>] handle_interruption+0x330/0xe60
>> [30007.188321] [<0000000040295b44>] schedule_tail+0x78/0xe8
>> [30007.188321] [<00000000401e0f6c>] finish_child_return+0x0/0x58
>>
>> A longer excerpt of the logs is attached. The error happened at boot
>> time 30007, the preceding unaligned accesses seem to be unrelated.
> I doubt this HPMC is related to the patch.  In the above, the pmd table appears to have
> become corrupted.

I see all kinds of corruption in both kernel space and user space, and I
assumed they all share the same underlying mechanism, but you're right
that there might be multiple unrelated causes.


>> I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on
>> the same machine. The errors seem to be more frequent with a heavy IO
>> load, so it might be system-bus or PCI-bus-related. Using X11 causes
>> lockups rather quickly, but that could be caused by unrelated errors in
>> the graphics subsystem and/or the Radeon drivers.
> I am not using X11 on my c8000.  I have frame buffer support on. Radeon acceleration
> is broken on parisc.

Yeah, accel doesn't work, but unaccelerated graphics works fine. Except
for the crashes, that is.


>> 2. The segfault is sometimes preceded by an unaligned access, which I
>> believe is also caused by a corrupted machine state rather than by a
>> coding error in the program – sometimes a bunch of unaligned accesses
>> show up in the logs just prior to a segfault / lockup, even from
>> unrelated programs such as random bash processes. Sometimes the machine
>> keeps working afterwards (although I typically reboot it immediately
>> to limit the consequences of potential kernel data structure damage),
>> sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild
>> zeroed page appearance. But this typically happens when running X11, so
>> again, it might be caused by another bug, such as the GPU randomly
>> writing to memory via misconfigured DMA.
> There was a bug in the unaligned handler for double word instructions (ldd) that was
> recently fixed.  ldd/std are not used in userspace, so this problem didn't affect it.

Yes, but this fixes the case when a program has a coding bug, performs
an unaligned access and the kernel has to emulate the load. What I'm
seeing is that sometimes, several programs which usually run just fine
with no unaligned accesses all perform an unaligned access at once,
which seems very weird. I sometimes (but not always) see this on X11
startup.


> We have observed that the faults appear SMP and memory size related.  A rp4440 with
> 6 CPUs and 4 GB RAM faulted a lot.  It's mostly a PA8800/PA8900 issue.
>
> It's months since I had a HPMC or LPMC on rp3440 and c8000.  Stalls still happen but they
> are rare.

I have 16 GiB of memory and 4 × PA8900 @ 1GHz. But I've seen a lot of
them even with 2 GiB.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-09 17:10         ` John David Anglin
@ 2024-05-29 15:54           ` matoro
  2024-05-29 16:33             ` John David Anglin
  0 siblings, 1 reply; 18+ messages in thread
From: matoro @ 2024-05-29 15:54 UTC (permalink / raw)
  To: John David Anglin
  Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-05-09 13:10, John David Anglin wrote:
> On 2024-05-08 4:52 p.m., John David Anglin wrote:
>>> with no accompanying stack trace and then the BMC would restart the whole 
>>> machine automatically.  These were infrequent enough that the segfaults 
>>> were the bigger problem, but after applying this patch on top of 6.8, this 
>>> changed the dynamic.  It seems to occur during builds with varying I/O 
>>> loads.  For example, I was able to build gcc fine, with no segfaults, but 
>>> I was unable to build perl, a much smaller build, without crashing the 
>>> machine. I did not observe any segfaults over the day or 2 I ran this 
>>> patch, but that's not an unheard-of stretch of 
>>> time even without it, and I am being forced to revert because of the panics.
>> Looks like there is a problem with 6.8.  I'll do some testing with it.
> So far, I haven't seen any panics with 6.8.9 but I have seen some random 
> segmentation faults
> in the gcc testsuite.  I looked at one ld fault in some detail.  18 
> contiguous words in the  elf_link_hash_entry
> struct were zeroed starting with the last word in the bfd_link_hash_entry 
> struct causing the fault.
> The section pointer was zeroed.
> 
> 18 words is a rather strange number of words to corrupt and corruption 
> doesn't seem related
> to object structure.  In any case, it is not page related.
> 
> It's really hard to tell how this happens.  The corrupt object was at a 
> slightly different location
> than it is when ld is run under gdb.  Can't duplicate in gdb.
> 
> Dave

Dave, not sure how much testing you have done with current mainline kernels, 
but I've had to temporarily give up on 6.8 and 6.9 for now, as most heavy 
builds quickly hit that kernel panic.  6.6 does not seem to have the problem 
though.  The patch from this thread does not seem to have made a difference 
one way or the other w.r.t. segfaults.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-29 15:54           ` matoro
@ 2024-05-29 16:33             ` John David Anglin
  2024-05-30  5:00               ` matoro
  0 siblings, 1 reply; 18+ messages in thread
From: John David Anglin @ 2024-05-29 16:33 UTC (permalink / raw)
  To: matoro; +Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

[-- Attachment #1: Type: text/plain, Size: 3176 bytes --]

On 2024-05-29 11:54 a.m., matoro wrote:
> On 2024-05-09 13:10, John David Anglin wrote:
>> On 2024-05-08 4:52 p.m., John David Anglin wrote:
>>>> with no accompanying stack trace and then the BMC would restart the whole machine automatically. These were infrequent enough that the 
>>>> segfaults were the bigger problem, but after applying this patch on top of 6.8, this changed the dynamic.  It seems to occur during builds 
>>>> with varying I/O loads.  For example, I was able to build gcc fine, with no segfaults, but I was unable to build perl, a much smaller 
>>>> build, without crashing the machine. I did not observe any segfaults over the day or 2 I ran this patch, but that's not an unheard-of 
>>>> stretch of time even without it, and I am being forced to revert because of the panics.
>>> Looks like there is a problem with 6.8.  I'll do some testing with it.
>> So far, I haven't seen any panics with 6.8.9 but I have seen some random segmentation faults
>> in the gcc testsuite.  I looked at one ld fault in some detail. 18 contiguous words in the  elf_link_hash_entry
>> struct were zeroed starting with the last word in the bfd_link_hash_entry struct causing the fault.
>> The section pointer was zeroed.
>>
>> 18 words is a rather strange number of words to corrupt and corruption doesn't seem related
>> to object structure.  In any case, it is not page related.
>>
>> It's really hard to tell how this happens.  The corrupt object was at a slightly different location
>> than it is when ld is run under gdb.  Can't duplicate in gdb.
>>
>> Dave
>
> Dave, not sure how much testing you have done with current mainline kernels, but I've had to temporarily give up on 6.8 and 6.9 for now, as 
> most heavy builds quickly hit that kernel panic. 6.6 does not seem to have the problem though.  The patch from this thread does not seem to 
> have made a difference one way or the other w.r.t. segfaults.
My latest patch is looking good.  I have 6 days of testing on c8000 (1 GHz PA8800) with 6.8.10 and 6.8.11, and I haven't had any random segmentation
faults.  System has been building debian packages.  In addition, it has been building and testing gcc.  It's on its third gcc build and check 
with patch.

The latest version uses lpa_user() with fallback to page table search in flush_cache_page_if_present() to obtain physical page address.
It revises copy_to_user_page() and copy_from_user_page() to flush kernel mapping with tmpalias flushes.  copy_from_user_page()
was missing kernel mapping flush.  flush_cache_vmap() and flush_cache_vunmap() are moved into cache.c.  TLB is now flushed before
cache flush to inhibit move-in in these routines. flush_cache_vmap() now handles small VM_IOREMAP flushes instead of flushing
entire cache.  This latter change is an optimization.

If random faults are still present, I believe we will have to give up trying to optimize flush_cache_mm() and flush_cache_range() and
flush the whole cache in these routines.

Some work would be needed to backport my current patch to longterm kernels because of folio changes in 6.8.

Dave

-- 
John David Anglin  dave.anglin@bell.net

[-- Attachment #2: flush-cache-v13.txt --]
[-- Type: text/plain, Size: 18156 bytes --]

diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h
index ba4c05bc24d6..8597f8c387d7 100644
--- a/arch/parisc/include/asm/cacheflush.h
+++ b/arch/parisc/include/asm/cacheflush.h
@@ -31,18 +31,19 @@ void flush_cache_all_local(void);
 void flush_cache_all(void);
 void flush_cache_mm(struct mm_struct *mm);
 
-void flush_kernel_dcache_page_addr(const void *addr);
-
 #define flush_kernel_dcache_range(start,size) \
 	flush_kernel_dcache_range_asm((start), (start)+(size));
 
+/* The only way to flush a vmap range is to flush whole cache */
 #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1
 void flush_kernel_vmap_range(void *vaddr, int size);
 void invalidate_kernel_vmap_range(void *vaddr, int size);
 
-#define flush_cache_vmap(start, end)		flush_cache_all()
+// #define flush_cache_vmap(start, end)		flush_cache_all()
+void flush_cache_vmap(unsigned long start, unsigned long end);
 #define flush_cache_vmap_early(start, end)	do { } while (0)
-#define flush_cache_vunmap(start, end)		flush_cache_all()
+// #define flush_cache_vunmap(start, end)		flush_cache_all()
+void flush_cache_vunmap(unsigned long start, unsigned long end);
 
 void flush_dcache_folio(struct folio *folio);
 #define flush_dcache_folio flush_dcache_folio
@@ -77,17 +78,11 @@ void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr,
 void flush_cache_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end);
 
-/* defined in pacache.S exported in cache.c used by flush_anon_page */
-void flush_dcache_page_asm(unsigned long phys_addr, unsigned long vaddr);
-
 #define ARCH_HAS_FLUSH_ANON_PAGE
 void flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr);
 
 #define ARCH_HAS_FLUSH_ON_KUNMAP
-static inline void kunmap_flush_on_unmap(const void *addr)
-{
-	flush_kernel_dcache_page_addr(addr);
-}
+void kunmap_flush_on_unmap(const void *addr);
 
 #endif /* _PARISC_CACHEFLUSH_H */
 
diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index 422f3e1e6d9c..b5b1094a1fe7 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -31,20 +31,23 @@
 #include <asm/mmu_context.h>
 #include <asm/cachectl.h>
 
+#define PTR_PAGE_ALIGN_DOWN(addr) PTR_ALIGN_DOWN(addr, PAGE_SIZE)
+
 int split_tlb __ro_after_init;
 int dcache_stride __ro_after_init;
 int icache_stride __ro_after_init;
 EXPORT_SYMBOL(dcache_stride);
 
+/* Internal implementation in arch/parisc/kernel/pacache.S */
 void flush_dcache_page_asm(unsigned long phys_addr, unsigned long vaddr);
 EXPORT_SYMBOL(flush_dcache_page_asm);
 void purge_dcache_page_asm(unsigned long phys_addr, unsigned long vaddr);
 void flush_icache_page_asm(unsigned long phys_addr, unsigned long vaddr);
-
-/* Internal implementation in arch/parisc/kernel/pacache.S */
 void flush_data_cache_local(void *);  /* flushes local data-cache only */
 void flush_instruction_cache_local(void); /* flushes local code-cache only */
 
+static void flush_kernel_dcache_page_addr(const void *addr);
+
 /* On some machines (i.e., ones with the Merced bus), there can be
  * only a single PxTLB broadcast at a time; this must be guaranteed
  * by software. We need a spinlock around all TLB flushes to ensure
@@ -321,6 +324,18 @@ __flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr,
 {
 	if (!static_branch_likely(&parisc_has_cache))
 		return;
+
+	/*
+	 * The TLB is the engine of coherence on parisc.  The CPU is
+	 * entitled to speculate any page with a TLB mapping, so here
+	 * we kill the mapping then flush the page along a special flush
+	 * only alias mapping. This guarantees that the page is no-longer
+	 * in the cache for any process and nor may it be speculatively
+	 * read in (until the user or kernel specifically accesses it,
+	 * of course).
+	 */
+	flush_tlb_page(vma, vmaddr);
+
 	preempt_disable();
 	flush_dcache_page_asm(physaddr, vmaddr);
 	if (vma->vm_flags & VM_EXEC)
@@ -328,18 +343,66 @@ __flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr,
 	preempt_enable();
 }
 
-static void flush_user_cache_page(struct vm_area_struct *vma, unsigned long vmaddr)
+static void flush_kernel_dcache_page_addr(const void *addr)
 {
-	unsigned long flags, space, pgd, prot;
-#ifdef CONFIG_TLB_PTLOCK
-	unsigned long pgd_lock;
-#endif
+	unsigned long vaddr = (unsigned long)addr;
+	unsigned long flags;
 
-	vmaddr &= PAGE_MASK;
+	/* Purge TLB entry to remove translation on all CPUs */
+	purge_tlb_start(flags);
+	pdtlb(SR_KERNEL, addr);
+	purge_tlb_end(flags);
 
+	/* Use tmpalias flush to prevent data cache move-in */
 	preempt_disable();
+	flush_dcache_page_asm(__pa(vaddr), vaddr);
+	preempt_enable();
+}
+
+static void flush_kernel_icache_page_addr(const void *addr)
+{
+	unsigned long vaddr = (unsigned long)addr;
+	unsigned long flags;
+
+	/* Purge TLB entry to remove translation on all CPUs */
+	purge_tlb_start(flags);
+	pdtlb(SR_KERNEL, addr);
+	purge_tlb_end(flags);
+
+	/* Use tmpalias flush to prevent instruction cache move-in */
+	preempt_disable();
+	flush_icache_page_asm(__pa(vaddr), vaddr);
+	preempt_enable();
+}
+
+void kunmap_flush_on_unmap(const void *addr)
+{
+	flush_kernel_dcache_page_addr(addr);
+}
+EXPORT_SYMBOL(kunmap_flush_on_unmap);
+
+void flush_icache_pages(struct vm_area_struct *vma, struct page *page,
+		unsigned int nr)
+{
+	void *kaddr = page_address(page);
+
+	for (;;) {
+		flush_kernel_dcache_page_addr(kaddr);
+		flush_kernel_icache_page_addr(kaddr);
+		if (--nr == 0)
+			break;
+		kaddr += PAGE_SIZE;
+	}
+}
 
-	/* Set context for flush */
+static inline unsigned long get_upa(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long flags, space, pgd, prot, pa;
+#ifdef CONFIG_TLB_PTLOCK
+	unsigned long pgd_lock;
+#endif
+
+	/* Save context */
 	local_irq_save(flags);
 	prot = mfctl(8);
 	space = mfsp(SR_USER);
@@ -347,16 +410,12 @@ static void flush_user_cache_page(struct vm_area_struct *vma, unsigned long vmad
 #ifdef CONFIG_TLB_PTLOCK
 	pgd_lock = mfctl(28);
 #endif
-	switch_mm_irqs_off(NULL, vma->vm_mm, NULL);
-	local_irq_restore(flags);
 
-	flush_user_dcache_range_asm(vmaddr, vmaddr + PAGE_SIZE);
-	if (vma->vm_flags & VM_EXEC)
-		flush_user_icache_range_asm(vmaddr, vmaddr + PAGE_SIZE);
-	flush_tlb_page(vma, vmaddr);
+	/* Set context for lpa_user */
+	switch_mm_irqs_off(NULL, mm, NULL);
+	pa = lpa_user(addr);
 
 	/* Restore previous context */
-	local_irq_save(flags);
 #ifdef CONFIG_TLB_PTLOCK
 	mtctl(pgd_lock, 28);
 #endif
@@ -365,21 +424,7 @@ static void flush_user_cache_page(struct vm_area_struct *vma, unsigned long vmad
 	mtctl(prot, 8);
 	local_irq_restore(flags);
 
-	preempt_enable();
-}
-
-void flush_icache_pages(struct vm_area_struct *vma, struct page *page,
-		unsigned int nr)
-{
-	void *kaddr = page_address(page);
-
-	for (;;) {
-		flush_kernel_dcache_page_addr(kaddr);
-		flush_kernel_icache_page(kaddr);
-		if (--nr == 0)
-			break;
-		kaddr += PAGE_SIZE;
-	}
+	return pa;
 }
 
 static inline pte_t *get_ptep(struct mm_struct *mm, unsigned long addr)
@@ -404,12 +449,6 @@ static inline pte_t *get_ptep(struct mm_struct *mm, unsigned long addr)
 	return ptep;
 }
 
-static inline bool pte_needs_flush(pte_t pte)
-{
-	return (pte_val(pte) & (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_NO_CACHE))
-		== (_PAGE_PRESENT | _PAGE_ACCESSED);
-}
-
 void flush_dcache_folio(struct folio *folio)
 {
 	struct address_space *mapping = folio_flush_mapping(folio);
@@ -458,50 +497,23 @@ void flush_dcache_folio(struct folio *folio)
 		if (addr + nr * PAGE_SIZE > vma->vm_end)
 			nr = (vma->vm_end - addr) / PAGE_SIZE;
 
-		if (parisc_requires_coherency()) {
-			for (i = 0; i < nr; i++) {
-				pte_t *ptep = get_ptep(vma->vm_mm,
-							addr + i * PAGE_SIZE);
-				if (!ptep)
-					continue;
-				if (pte_needs_flush(*ptep))
-					flush_user_cache_page(vma,
-							addr + i * PAGE_SIZE);
-				/* Optimise accesses to the same table? */
-				pte_unmap(ptep);
-			}
-		} else {
+		if (old_addr == 0 || (old_addr & (SHM_COLOUR - 1))
+					!= (addr & (SHM_COLOUR - 1))) {
+			for (i = 0; i < nr; i++)
+				__flush_cache_page(vma,
+					addr + i * PAGE_SIZE,
+					(pfn + i) * PAGE_SIZE);
 			/*
-			 * The TLB is the engine of coherence on parisc:
-			 * The CPU is entitled to speculate any page
-			 * with a TLB mapping, so here we kill the
-			 * mapping then flush the page along a special
-			 * flush only alias mapping. This guarantees that
-			 * the page is no-longer in the cache for any
-			 * process and nor may it be speculatively read
-			 * in (until the user or kernel specifically
-			 * accesses it, of course)
+			 * Software is allowed to have any number
+			 * of private mappings to a page.
 			 */
-			for (i = 0; i < nr; i++)
-				flush_tlb_page(vma, addr + i * PAGE_SIZE);
-			if (old_addr == 0 || (old_addr & (SHM_COLOUR - 1))
-					!= (addr & (SHM_COLOUR - 1))) {
-				for (i = 0; i < nr; i++)
-					__flush_cache_page(vma,
-						addr + i * PAGE_SIZE,
-						(pfn + i) * PAGE_SIZE);
-				/*
-				 * Software is allowed to have any number
-				 * of private mappings to a page.
-				 */
-				if (!(vma->vm_flags & VM_SHARED))
-					continue;
-				if (old_addr)
-					pr_err("INEQUIVALENT ALIASES 0x%lx and 0x%lx in file %pD\n",
-						old_addr, addr, vma->vm_file);
-				if (nr == folio_nr_pages(folio))
-					old_addr = addr;
-			}
+			if (!(vma->vm_flags & VM_SHARED))
+				continue;
+			if (old_addr)
+				pr_err("INEQUIVALENT ALIASES 0x%lx and 0x%lx in file %pD\n",
+					old_addr, addr, vma->vm_file);
+			if (nr == folio_nr_pages(folio))
+				old_addr = addr;
 		}
 		WARN_ON(++count == 4096);
 	}
@@ -591,35 +603,31 @@ extern void purge_kernel_dcache_page_asm(unsigned long);
 extern void clear_user_page_asm(void *, unsigned long);
 extern void copy_user_page_asm(void *, void *, unsigned long);
 
-void flush_kernel_dcache_page_addr(const void *addr)
-{
-	unsigned long flags;
-
-	flush_kernel_dcache_page_asm(addr);
-	purge_tlb_start(flags);
-	pdtlb(SR_KERNEL, addr);
-	purge_tlb_end(flags);
-}
-EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
-
 static void flush_cache_page_if_present(struct vm_area_struct *vma,
-	unsigned long vmaddr, unsigned long pfn)
+	unsigned long vmaddr)
 {
-	bool needs_flush = false;
 	pte_t *ptep;
+	unsigned long flags, pfn, physaddr;
+	struct mm_struct *mm = vma->vm_mm;
+
+	physaddr = get_upa(mm, vmaddr);
+	if (!physaddr) {
+		spin_lock_irqsave(&mm->page_table_lock, flags);
+		ptep = get_ptep(mm, vmaddr);
+		if (!ptep) {
+			spin_unlock_irqrestore(&mm->page_table_lock, flags);
+			return;
+		}
+		pfn = pte_pfn(*ptep);
+		spin_unlock_irqrestore(&mm->page_table_lock, flags);
 
-	/*
-	 * The pte check is racy and sometimes the flush will trigger
-	 * a non-access TLB miss. Hopefully, the page has already been
-	 * flushed.
-	 */
-	ptep = get_ptep(vma->vm_mm, vmaddr);
-	if (ptep) {
-		needs_flush = pte_needs_flush(*ptep);
+		if (WARN_ON(!pfn_valid(pfn)))
+			return;
+		physaddr = PFN_PHYS(pfn);
 		pte_unmap(ptep);
 	}
-	if (needs_flush)
-		flush_cache_page(vma, vmaddr, pfn);
+
+	__flush_cache_page(vma, vmaddr, physaddr);
 }
 
 void copy_user_highpage(struct page *to, struct page *from,
@@ -629,7 +637,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	kfrom = kmap_local_page(from);
 	kto = kmap_local_page(to);
-	flush_cache_page_if_present(vma, vaddr, page_to_pfn(from));
+	__flush_cache_page(vma, vaddr, PFN_PHYS(page_to_pfn(from)));
 	copy_page_asm(kto, kfrom);
 	kunmap_local(kto);
 	kunmap_local(kfrom);
@@ -638,16 +646,17 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
 		unsigned long user_vaddr, void *dst, void *src, int len)
 {
-	flush_cache_page_if_present(vma, user_vaddr, page_to_pfn(page));
+	__flush_cache_page(vma, user_vaddr, PFN_PHYS(page_to_pfn(page)));
 	memcpy(dst, src, len);
-	flush_kernel_dcache_range_asm((unsigned long)dst, (unsigned long)dst + len);
+	flush_kernel_dcache_page_addr(PTR_PAGE_ALIGN_DOWN(dst));
 }
 
 void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
 		unsigned long user_vaddr, void *dst, void *src, int len)
 {
-	flush_cache_page_if_present(vma, user_vaddr, page_to_pfn(page));
+	__flush_cache_page(vma, user_vaddr, PFN_PHYS(page_to_pfn(page)));
 	memcpy(dst, src, len);
+	flush_kernel_dcache_page_addr(PTR_PAGE_ALIGN_DOWN(src));
 }
 
 /* __flush_tlb_range()
@@ -681,32 +690,10 @@ int __flush_tlb_range(unsigned long sid, unsigned long start,
 
 static void flush_cache_pages(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 {
-	unsigned long addr, pfn;
-	pte_t *ptep;
+	unsigned long addr;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE) {
-		bool needs_flush = false;
-		/*
-		 * The vma can contain pages that aren't present. Although
-		 * the pte search is expensive, we need the pte to find the
-		 * page pfn and to check whether the page should be flushed.
-		 */
-		ptep = get_ptep(vma->vm_mm, addr);
-		if (ptep) {
-			needs_flush = pte_needs_flush(*ptep);
-			pfn = pte_pfn(*ptep);
-			pte_unmap(ptep);
-		}
-		if (needs_flush) {
-			if (parisc_requires_coherency()) {
-				flush_user_cache_page(vma, addr);
-			} else {
-				if (WARN_ON(!pfn_valid(pfn)))
-					return;
-				__flush_cache_page(vma, addr, PFN_PHYS(pfn));
-			}
-		}
-	}
+	for (addr = start; addr < end; addr += PAGE_SIZE)
+		flush_cache_page_if_present(vma, addr);
 }
 
 static inline unsigned long mm_total_size(struct mm_struct *mm)
@@ -757,21 +744,19 @@ void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned
 		if (WARN_ON(IS_ENABLED(CONFIG_SMP) && arch_irqs_disabled()))
 			return;
 		flush_tlb_range(vma, start, end);
-		flush_cache_all();
+		if (vma->vm_flags & VM_EXEC)
+			flush_cache_all();
+		else
+			flush_data_cache();
 		return;
 	}
 
-	flush_cache_pages(vma, start, end);
+	flush_cache_pages(vma, start & PAGE_MASK, end);
 }
 
 void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn)
 {
-	if (WARN_ON(!pfn_valid(pfn)))
-		return;
-	if (parisc_requires_coherency())
-		flush_user_cache_page(vma, vmaddr);
-	else
-		__flush_cache_page(vma, vmaddr, PFN_PHYS(pfn));
+	__flush_cache_page(vma, vmaddr, PFN_PHYS(pfn));
 }
 
 void flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr)
@@ -779,35 +764,91 @@ void flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned lon
 	if (!PageAnon(page))
 		return;
 
-	if (parisc_requires_coherency()) {
-		if (vma->vm_flags & VM_SHARED)
-			flush_data_cache();
-		else
-			flush_user_cache_page(vma, vmaddr);
+	__flush_cache_page(vma, vmaddr, PFN_PHYS(page_to_pfn(page)));
+}
+
+/*
+ * The physical address for pages in the ioremap case can be obtained
+ * from the vm_struct struct. I wasn't able to successfully handle the
+ * vmalloc and vmap cases. We have an array of struct page pointers in
+ * the uninitialized vmalloc case but the flush failed using page_to_pfn.
+ */
+void flush_cache_vmap(unsigned long start, unsigned long end)
+{
+	unsigned long addr, physaddr;
+	struct vm_struct *vm;
+
+	/* Prevent cache move-in */
+	flush_tlb_kernel_range(start, end);
+
+	if (end - start >= parisc_cache_flush_threshold) {
+		flush_cache_all();
 		return;
 	}
 
-	flush_tlb_page(vma, vmaddr);
-	preempt_disable();
-	flush_dcache_page_asm(page_to_phys(page), vmaddr);
-	preempt_enable();
+	if (WARN_ON_ONCE(!is_vmalloc_addr((void *)start))) {
+		flush_cache_all();
+		return;
+	}
+
+	vm = find_vm_area((void *)start);
+	if (WARN_ON_ONCE(!vm)) {
+		flush_cache_all();
+		return;
+	}
+
+	/* The physical addresses of IOREMAP regions are contiguous */
+	if (vm->flags & VM_IOREMAP) {
+		physaddr = vm->phys_addr;
+		for (addr = start; addr < end; addr += PAGE_SIZE) {
+			preempt_disable();
+			flush_dcache_page_asm(physaddr, start);
+			flush_icache_page_asm(physaddr, start);
+			preempt_enable();
+			physaddr += PAGE_SIZE;
+		}
+		return;
+	}
+
+	flush_cache_all();
 }
+EXPORT_SYMBOL(flush_cache_vmap);
 
+/*
+ * The vm_struct has been retired and the page table is set up. The
+ * last page in the range is a guard page. Its physical address can't
+ * be determined using lpa, so there is no way to flush the range
+ * using flush_dcache_page_asm.
+ */
+void flush_cache_vunmap(unsigned long start, unsigned long end)
+{
+	/* Prevent cache move-in */
+	flush_tlb_kernel_range(start, end);
+	flush_data_cache();
+}
+EXPORT_SYMBOL(flush_cache_vunmap);
+
+/*
+ * On systems with PA8800/PA8900 processors, there is no way to flush
+ * a vmap range other than using the architected loop to flush the
+ * entire cache. The page directory is not set up, so we can't use
+ * fdc, etc. FDCE/FICE don't work to flush a portion of the cache.
+ * L2 is physically indexed but FDCE/FICE instructions in virtual
+ * mode output their virtual address on the core bus, not their
+ * real address. As a result, the L2 cache index formed from the
+ * virtual address will most likely not be the same as the L2 index
+ * formed from the real address.
+ */
 void flush_kernel_vmap_range(void *vaddr, int size)
 {
 	unsigned long start = (unsigned long)vaddr;
 	unsigned long end = start + size;
 
-	if ((!IS_ENABLED(CONFIG_SMP) || !arch_irqs_disabled()) &&
-	    (unsigned long)size >= parisc_cache_flush_threshold) {
-		flush_tlb_kernel_range(start, end);
-		flush_data_cache();
-		return;
-	}
+	BUG_ON(IS_ENABLED(CONFIG_SMP) && arch_irqs_disabled());
 
-	flush_kernel_dcache_range_asm(start, end);
 	flush_tlb_kernel_range(start, end);
-}
+	flush_data_cache();
+ }
 EXPORT_SYMBOL(flush_kernel_vmap_range);
 
 void invalidate_kernel_vmap_range(void *vaddr, int size)
@@ -818,16 +859,11 @@ void invalidate_kernel_vmap_range(void *vaddr, int size)
 	/* Ensure DMA is complete */
 	asm_syncdma();
 
-	if ((!IS_ENABLED(CONFIG_SMP) || !arch_irqs_disabled()) &&
-	    (unsigned long)size >= parisc_cache_flush_threshold) {
-		flush_tlb_kernel_range(start, end);
-		flush_data_cache();
-		return;
-	}
+	BUG_ON(IS_ENABLED(CONFIG_SMP) && arch_irqs_disabled());
 
-	purge_kernel_dcache_range_asm(start, end);
 	flush_tlb_kernel_range(start, end);
-}
+	flush_data_cache();
+ }
 EXPORT_SYMBOL(invalidate_kernel_vmap_range);
 
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-29 16:33             ` John David Anglin
@ 2024-05-30  5:00               ` matoro
  2024-06-04 15:07                 ` matoro
  0 siblings, 1 reply; 18+ messages in thread
From: matoro @ 2024-05-30  5:00 UTC (permalink / raw)
  To: John David Anglin
  Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-05-29 12:33, John David Anglin wrote:
> On 2024-05-29 11:54 a.m., matoro wrote:
>> On 2024-05-09 13:10, John David Anglin wrote:
>>> On 2024-05-08 4:52 p.m., John David Anglin wrote:
>>>>> with no accompanying stack trace and then the BMC would restart the 
>>>>> whole machine automatically. These were infrequent enough that the 
>>>>> segfaults were the bigger problem, but after applying this patch on top 
>>>>> of 6.8, this changed the dynamic.  It seems to occur during builds with 
>>>>> varying I/O loads.  For example, I was able to build gcc fine, with no 
>>>>> segfaults, but I was unable to build perl, a much smaller build, without 
>>>>> crashing the machine. I did not observe any segfaults over the day or 2 
>>>>> I ran this patch, but that's not an unheard-of stretch of 
>>>>> time even without it, and I am being forced to revert because of the panics.
>>>> Looks like there is a problem with 6.8.  I'll do some testing with it.
>>> So far, I haven't seen any panics with 6.8.9 but I have seen some random 
>>> segmentation faults
>>> in the gcc testsuite.  I looked at one ld fault in some detail. 18 
>>> contiguous words in the  elf_link_hash_entry
>>> struct were zeroed starting with the last word in the bfd_link_hash_entry 
>>> struct causing the fault.
>>> The section pointer was zeroed.
>>> 
>>> 18 words is a rather strange number of words to corrupt and corruption 
>>> doesn't seem related
>>> to object structure.  In any case, it is not page related.
>>> 
>>> It's really hard to tell how this happens.  The corrupt object was at a 
>>> slightly different location
>>> than it is when ld is run under gdb.  Can't duplicate in gdb.
>>> 
>>> Dave
>> 
>> Dave, not sure how much testing you have done with current mainline 
>> kernels, but I've had to temporarily give up on 6.8 and 6.9 for now, as 
>> most heavy builds quickly hit that kernel panic. 6.6 does not seem to have 
>> the problem though.  The patch from this thread does not seem to have made 
>> a difference one way or the other w.r.t. segfaults.
> My latest patch is looking good.  I have 6 days of testing on c8000 (1 GHz 
> PA8800) with 6.8.10 and 6.8.11, and I haven't had any random segmentation
> faults.  System has been building debian packages.  In addition, it has been 
> building and testing gcc.  It's on its third gcc build and check with patch.
> 
> The latest version uses lpa_user() with fallback to page table search in 
> flush_cache_page_if_present() to obtain physical page address.
> It revises copy_to_user_page() and copy_from_user_page() to flush kernel 
> mapping with tmpalias flushes.  copy_from_user_page()
> was missing kernel mapping flush.  flush_cache_vmap() and 
> flush_cache_vunmap() are moved into cache.c.  TLB is now flushed before
> cache flush to inhibit move-in in these routines. flush_cache_vmap() now 
> handles small VM_IOREMAP flushes instead of flushing
> entire cache.  This latter change is an optimization.
> 
> If random faults are still present, I believe we will have to give up trying 
> to optimize flush_cache_mm() and flush_cache_range() and
> flush the whole cache in these routines.
> 
> Some work would be needed to backport my current patch to longterm kernels 
> because of folio changes in 6.8.
> 
> Dave

Thanks a ton Dave, I've applied this on top of 6.9.2 and also think I'm 
seeing improvement!  No panics yet, I have a couple week's worth of package 
testing to catch up on so I'll report if I see anything!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-05-30  5:00               ` matoro
@ 2024-06-04 15:07                 ` matoro
  2024-06-04 17:08                   ` John David Anglin
  0 siblings, 1 reply; 18+ messages in thread
From: matoro @ 2024-06-04 15:07 UTC (permalink / raw)
  To: John David Anglin
  Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-05-30 01:00, matoro wrote:
> On 2024-05-29 12:33, John David Anglin wrote:
>> On 2024-05-29 11:54 a.m., matoro wrote:
>>> On 2024-05-09 13:10, John David Anglin wrote:
>>>> On 2024-05-08 4:52 p.m., John David Anglin wrote:
>>>>>> with no accompanying stack trace and then the BMC would restart the 
>>>>>> whole machine automatically. These were infrequent enough that the 
>>>>>> segfaults were the bigger problem, but after applying this patch on top 
>>>>>> of 6.8, this changed the dynamic.  It seems to occur during builds with 
>>>>>> varying I/O loads.  For example, I was able to build gcc fine, with no 
>>>>>> segfaults, but I was unable to build perl, a much smaller build, 
>>>>>> without crashing the machine. I did not observe any segfaults over the 
>>>>>> day or 2 I ran this patch, but that's not an unheard-of stretch of 
>>>>>> time even without it, and I am being forced to revert because of the panics.
>>>>> Looks like there is a problem with 6.8.  I'll do some testing with it.
>>>> So far, I haven't seen any panics with 6.8.9 but I have seen some random 
>>>> segmentation faults
>>>> in the gcc testsuite.  I looked at one ld fault in some detail. 18 
>>>> contiguous words in the  elf_link_hash_entry
>>>> struct were zeroed starting with the last word in the bfd_link_hash_entry 
>>>> struct causing the fault.
>>>> The section pointer was zeroed.
>>>> 
>>>> 18 words is a rather strange number of words to corrupt and corruption 
>>>> doesn't seem related
>>>> to object structure.  In any case, it is not page related.
>>>> 
>>>> It's really hard to tell how this happens.  The corrupt object was at a 
>>>> slightly different location
>>>> than it is when ld is run under gdb.  Can't duplicate in gdb.
>>>> 
>>>> Dave
>>> 
>>> Dave, not sure how much testing you have done with current mainline 
>>> kernels, but I've had to temporarily give up on 6.8 and 6.9 for now, as 
>>> most heavy builds quickly hit that kernel panic. 6.6 does not seem to have 
>>> the problem though.  The patch from this thread does not seem to have made 
>>> a difference one way or the other w.r.t. segfaults.
>> My latest patch is looking good.  I have 6 days of testing on c8000 (1 GHz 
>> PA8800) with 6.8.10 and 6.8.11, and I haven't had any random segmentation
>> faults.  System has been building debian packages.  In addition, it has 
>> been building and testing gcc.  It's on its third gcc build and check with 
>> patch.
>> 
>> The latest version uses lpa_user() with fallback to page table search in 
>> flush_cache_page_if_present() to obtain physical page address.
>> It revises copy_to_user_page() and copy_from_user_page() to flush kernel 
>> mapping with tmpalias flushes.  copy_from_user_page()
>> was missing kernel mapping flush.  flush_cache_vmap() and 
>> flush_cache_vunmap() are moved into cache.c.  TLB is now flushed before
>> cache flush to inhibit move-in in these routines. flush_cache_vmap() now 
>> handles small VM_IOREMAP flushes instead of flushing
>> entire cache.  This latter change is an optimization.
>> 
>> If random faults are still present, I believe we will have to give up 
>> trying to optimize flush_cache_mm() and flush_cache_range() and
>> flush the whole cache in these routines.
>> 
>> Some work would be needed to backport my current patch to longterm kernels 
>> because of folio changes in 6.8.
>> 
>> Dave
> 
> Thanks a ton Dave, I've applied this on top of 6.9.2 and also think I'm 
> seeing improvement!  No panics yet, I have a couple week's worth of package 
> testing to catch up on so I'll report if I see anything!

I've seen a few warnings in my dmesg while testing, although I didn't see any 
immediately corresponding failures.  Any danger?

[Sun Jun  2 18:46:29 2024] ------------[ cut here ]------------
[Sun Jun  2 18:46:29 2024] WARNING: CPU: 0 PID: 26808 at 
arch/parisc/kernel/cache.c:624 flush_cache_page_if_present+0x1a4/0x330
[Sun Jun  2 18:46:29 2024] Modules linked in: raw_diag tcp_diag inet_diag 
netlink_diag unix_diag nfnetlink overlay loop nfsv4 dns_resolver nfs
lockd grace sunrpc netfs autofs4 binfmt_misc sr_mod ohci_pci cdrom ehci_pci 
ohci_hcd ehci_hcd tg3 pata_cmd64x usbcore ipmi_si hwmon usb_common
libata libphy ipmi_devintf nls_base ipmi_msghandler
[Sun Jun  2 18:46:29 2024] CPU: 0 PID: 26808 Comm: bash Tainted: G        W   
        6.9.3-gentoo-parisc64 #1
[Sun Jun  2 18:46:29 2024] Hardware name: 9000/800/rp3440

[Sun Jun  2 18:46:29 2024]      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[Sun Jun  2 18:46:29 2024] PSW: 00001000000001101111100100001111 Tainted: G   
      W
[Sun Jun  2 18:46:29 2024] r00-03  000000ff0806f90f 000000004106b280 
00000000402090bc 000000005160c6a0
[Sun Jun  2 18:46:29 2024] r04-07  0000000040f99a80 00000000f96da000 
00000001659a2360 000000000800000f
[Sun Jun  2 18:46:29 2024] r08-11  0000000c0063f89c 0000000000000000 
000000004ce09e9c 000000005160c5a8
[Sun Jun  2 18:46:29 2024] r12-15  000000004ce09eb0 00000000414ebd70 
0000000041687768 0000000041646830
[Sun Jun  2 18:46:29 2024] r16-19  00000000516333c0 0000000001200000 
00000001c36be780 0000000000000003
[Sun Jun  2 18:46:29 2024] r20-23  0000000000001a46 000000000f584000 
ffffffffc0000000 000000000000000f
[Sun Jun  2 18:46:29 2024] r24-27  0000000000000000 000000000800000f 
000000004ce09ea0 0000000040f99a80
[Sun Jun  2 18:46:29 2024] r28-31  0000000000000000 000000005160c720 
000000005160c750 0000000000000000
[Sun Jun  2 18:46:29 2024] sr00-03  00000000052be800 00000000052be800 
0000000000000000 00000000052be800
[Sun Jun  2 18:46:29 2024] sr04-07  0000000000000000 0000000000000000 
0000000000000000 0000000000000000

[Sun Jun  2 18:46:29 2024] IASQ: 0000000000000000 0000000000000000 IAOQ: 
0000000040209104 0000000040209108
[Sun Jun  2 18:46:29 2024]  IIR: 03ffe01f    ISR: 0000000010240000  IOR: 
0000003382609ea0
[Sun Jun  2 18:46:29 2024]  CPU:        0   CR30: 00000000516333c0 CR31: 
fffffff0f0e05ee0
[Sun Jun  2 18:46:29 2024]  ORIG_R28: 000000005160c7b0
[Sun Jun  2 18:46:29 2024]  IAOQ[0]: flush_cache_page_if_present+0x1a4/0x330
[Sun Jun  2 18:46:29 2024]  IAOQ[1]: flush_cache_page_if_present+0x1a8/0x330
[Sun Jun  2 18:46:29 2024]  RP(r2): flush_cache_page_if_present+0x15c/0x330
[Sun Jun  2 18:46:29 2024] Backtrace:
[Sun Jun  2 18:46:29 2024]  [<000000004020afb8>] flush_cache_mm+0x1a8/0x1c8
[Sun Jun  2 18:46:29 2024]  [<000000004023cf3c>] copy_mm+0x2a8/0xfd0
[Sun Jun  2 18:46:29 2024]  [<0000000040241040>] copy_process+0x1684/0x26e8
[Sun Jun  2 18:46:29 2024]  [<0000000040242218>] kernel_clone+0xcc/0x754
[Sun Jun  2 18:46:29 2024]  [<0000000040242908>] __do_sys_clone+0x68/0x80
[Sun Jun  2 18:46:29 2024]  [<0000000040242d14>] sys_clone+0x30/0x60
[Sun Jun  2 18:46:29 2024]  [<0000000040203fbc>] syscall_exit+0x0/0x10

[Sun Jun  2 18:46:29 2024] ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-06-04 15:07                 ` matoro
@ 2024-06-04 17:08                   ` John David Anglin
  2024-06-10 19:52                     ` matoro
  0 siblings, 1 reply; 18+ messages in thread
From: John David Anglin @ 2024-06-04 17:08 UTC (permalink / raw)
  To: matoro; +Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-06-04 11:07 a.m., matoro wrote:
>> Thanks a ton Dave, I've applied this on top of 6.9.2 and also think I'm seeing improvement!  No panics yet, I have a couple week's worth of 
>> package testing to catch up on so I'll report if I see anything!
>
> I've seen a few warnings in my dmesg while testing, although I didn't see any immediately corresponding failures.  Any danger?
We have determined most of the warnings arise from pages that have been swapped out.  Mostly, it seems these
pages have been flushed to memory before the pte is changed to a swap pte.  There might be issues for pages that
have been cleared.  It is possible the random faults aren't related to the warning I added for pages with an invalid pfn
in flush_cache_page_if_present.  The only thing I know for certain is there is no way to flush these pages on parisc
other than flushing the whole cache.

My c8000 has run almost two weeks without any random faults.  On the other hand, Helge has two machines that
frequently fault and generate these warnings.

Flushing the whole cache in flush_cache_mm and flush_cache_range might eliminate the random faults but
there will be a significant performance hit.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-06-04 17:08                   ` John David Anglin
@ 2024-06-10 19:52                     ` matoro
  2024-06-10 20:17                       ` John David Anglin
  0 siblings, 1 reply; 18+ messages in thread
From: matoro @ 2024-06-10 19:52 UTC (permalink / raw)
  To: John David Anglin
  Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-06-04 13:08, John David Anglin wrote:
> On 2024-06-04 11:07 a.m., matoro wrote:
>>> Thanks a ton Dave, I've applied this on top of 6.9.2 and also think I'm 
>>> seeing improvement!  No panics yet, I have a couple week's worth of 
>>> package testing to catch up on so I'll report if I see anything!
>> 
>> I've seen a few warnings in my dmesg while testing, although I didn't see 
>> any immediately corresponding failures.  Any danger?
> We have determined most of the warnings arise from pages that have been 
> swapped out.  Mostly, it seems these
> pages have been flushed to memory before the pte is changed to a swap pte.  
> There might be issues for pages that
> have been cleared.  It is possible the random faults aren't related to the 
> warning I added for pages with an invalid pfn
> in flush_cache_page_if_present.  The only thing I know for certain is there 
> is no way to flush these pages on parisc
> other than flushing the whole cache.
> 
> My c8000 has run almost two weeks without any random faults.  On the other 
> hand, Helge has two machines that
> frequently fault and generate these warnings.
> 
> Flushing the whole cache in flush_cache_mm and flush_cache_range might 
> eliminate the random faults but
> there will be a significant performance hit.
> 
> Dave

Unfortunately I had a few of these faults trip today after ~4 days of uptime 
with corresponding random segfaults.  One of the WARNs was emitted shortly 
before, though not for the same PID.  Reattempted the build twice and 
randomly segfaulted all 3 times.  Had to reboot as usual to get it out of the 
bad state.

[Mon Jun 10 14:26:20 2024] ------------[ cut here ]------------
[Mon Jun 10 14:26:20 2024] WARNING: CPU: 1 PID: 26453 at 
arch/parisc/kernel/cache.c:624 flush_cache_page_if_present+0x1a4/0x330
[Mon Jun 10 14:26:20 2024] Modules linked in: nfnetlink af_packet overlay 
loop nfsv4 dns_resolver nfs lockd grace sunrpc netfs autofs4 binfmt_m
isc sr_mod ohci_pci cdrom ehci_pci ohci_hcd ehci_hcd tg3 usbcore pata_cmd64x 
ipmi_si hwmon usb_common ipmi_devintf libata libphy nls_base ipmi_
msghandler
[Mon Jun 10 14:26:20 2024] CPU: 1 PID: 26453 Comm: ld.so.1 Tainted: G        
W          6.9.3-gentoo-parisc64 #1
[Mon Jun 10 14:26:20 2024] Hardware name: 9000/800/rp3440

[Mon Jun 10 14:26:20 2024]      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[Mon Jun 10 14:26:20 2024] PSW: 00001000000001001111100100001111 Tainted: G   
      W
[Mon Jun 10 14:26:20 2024] r00-03  000000ff0804f90f 000000004106b280 
00000000402090bc 000000007f4c85f0
[Mon Jun 10 14:26:20 2024] r04-07  0000000040f99a80 00000000f855d000 
00000000561b6360 000000000800000f
[Mon Jun 10 14:26:20 2024] r08-11  0000000c009674de 0000000000000000 
0000004100b2e39c 000000007f4c81c0
[Mon Jun 10 14:26:20 2024] r12-15  00000000561b6360 0000004100b2e330 
0000000000000002 0000000000000000
[Mon Jun 10 14:26:20 2024] r16-19  0000000040f50360 fffffffffffffff4 
000000007f4c8108 0000000000000003
[Mon Jun 10 14:26:20 2024] r20-23  0000000000001a46 0000000011b81000 
ffffffffc0000000 00000000f859d000
[Mon Jun 10 14:26:20 2024] r24-27  0000000000000000 000000000800000f 
0000004100b2e3a0 0000000040f99a80
[Mon Jun 10 14:26:20 2024] r28-31  0000000000000000 000000007f4c8670 
000000007f4c86a0 0000000000000000
[Mon Jun 10 14:26:20 2024] sr00-03  000000000604d000 000000000604d000 
0000000000000000 000000000604d000
[Mon Jun 10 14:26:20 2024] sr04-07  0000000000000000 0000000000000000 
0000000000000000 0000000000000000

[Mon Jun 10 14:26:20 2024] IASQ: 0000000000000000 0000000000000000 IAOQ: 
0000000040209104 0000000040209108
[Mon Jun 10 14:26:20 2024]  IIR: 03ffe01f    ISR: 0000000000000000  IOR: 
0000000000000000
[Mon Jun 10 14:26:20 2024]  CPU:        1   CR30: 00000001e700e780 CR31: 
fffffff0f0e05ee0
[Mon Jun 10 14:26:20 2024]  ORIG_R28: 00000000414cab90
[Mon Jun 10 14:26:20 2024]  IAOQ[0]: flush_cache_page_if_present+0x1a4/0x330
[Mon Jun 10 14:26:20 2024]  IAOQ[1]: flush_cache_page_if_present+0x1a8/0x330
[Mon Jun 10 14:26:20 2024]  RP(r2): flush_cache_page_if_present+0x15c/0x330
[Mon Jun 10 14:26:20 2024] Backtrace:
[Mon Jun 10 14:26:20 2024]  [<000000004020b110>] 
flush_cache_range+0x138/0x158
[Mon Jun 10 14:26:20 2024]  [<00000000405fdfc8>] 
change_protection+0x134/0xb78
[Mon Jun 10 14:26:20 2024]  [<00000000405feb4c>] mprotect_fixup+0x140/0x478
[Mon Jun 10 14:26:20 2024]  [<00000000405ff15c>] 
do_mprotect_pkey.constprop.0+0x2d8/0x5f0
[Mon Jun 10 14:26:20 2024]  [<00000000405ff4a4>] sys_mprotect+0x30/0x60
[Mon Jun 10 14:26:20 2024]  [<0000000040203fbc>] syscall_exit+0x0/0x10

[Mon Jun 10 14:26:20 2024] ---[ end trace 0000000000000000 ]---

[Mon Jun 10 14:28:04 2024] do_page_fault() command='ld.so.1' type=15 
address=0x161236a0 in libc.so[f8b9c000+1b6000]
                            trap #15: Data TLB miss fault, vm_start = 
0x4208e000, vm_end = 0x420af000
[Mon Jun 10 14:28:04 2024] CPU: 0 PID: 26681 Comm: ld.so.1 Tainted: G        
W          6.9.3-gentoo-parisc64 #1
[Mon Jun 10 14:28:04 2024] Hardware name: 9000/800/rp3440

[Mon Jun 10 14:28:04 2024]      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[Mon Jun 10 14:28:04 2024] PSW: 00000000000001100000000000001111 Tainted: G   
      W
[Mon Jun 10 14:28:04 2024] r00-03  000000000006000f 00000000f8d584a8 
00000000f8c46e33 0000000000000028
[Mon Jun 10 14:28:04 2024] r04-07  00000000f8d54660 00000000f8d54648 
0000000000000020 000000000001ab91
[Mon Jun 10 14:28:04 2024] r08-11  00000000f8d54654 00000000f8d5bf78 
0000000000000005 00000000f9ad87c8
[Mon Jun 10 14:28:04 2024] r12-15  0000000000000000 0000000000000000 
000000000000003f 00000000000003e9
[Mon Jun 10 14:28:04 2024] r16-19  000000000001a000 000000000001a000 
000000000001a000 00000000f8d56ca8
[Mon Jun 10 14:28:04 2024] r20-23  0000000000000000 00000000f8c46bcc 
000000000001a2d8 00000000ffffffff
[Mon Jun 10 14:28:04 2024] r24-27  0000000000000000 0000000000000020 
00000000f8d54648 000000000001a000
[Mon Jun 10 14:28:04 2024] r28-31  0000000000000001 0000000016123698 
00000000f9ad8cc0 00000000f9ad8c2c
[Mon Jun 10 14:28:04 2024] sr00-03  0000000006069400 0000000006069400 
0000000000000000 0000000006069400
[Mon Jun 10 14:28:04 2024] sr04-07  0000000006069400 0000000006069400 
0000000006069400 0000000006069400

[Mon Jun 10 14:28:04 2024]       VZOUICununcqcqcqcqcqcrmunTDVZOUI
[Mon Jun 10 14:28:04 2024] FPSR: 00000000000000000000000000000000
[Mon Jun 10 14:28:04 2024] FPER1: 00000000
[Mon Jun 10 14:28:04 2024] fr00-03  0000000000000000 0000000000000000 
0000000000000000 0000000000000000
[Mon Jun 10 14:28:04 2024] fr04-07  3fbc58dcd6e825cf 41d98fdb92c00000 
00001d29b5e9bfb4 41d999952df718f9
[Mon Jun 10 14:28:04 2024] fr08-11  ffe3d998c543273c ff60537aba025d00 
004698b61bd9b9ee 000527c1bed53af7
[Mon Jun 10 14:28:04 2024] fr12-15  0000000000000000 0000000000000000 
0000000000000000 0000000000000000
[Mon Jun 10 14:28:04 2024] fr16-19  0000000000000000 0000000000000000 
0000000000000000 0000000000000000
[Mon Jun 10 14:28:04 2024] fr20-23  0000000000000000 0000000000000000 
0000000000000020 0000000000000000
[Mon Jun 10 14:28:04 2024] fr24-27  0000000000000003 0000000000000000 
3d473181aed58d64 bff0000000000000
[Mon Jun 10 14:28:04 2024] fr28-31  3fc999b324f10111 057028cc5c564e70 
dbc91a3f6bd13476 02632fb493c76730

[Mon Jun 10 14:28:04 2024] IASQ: 0000000006069400 0000000006069400 IAOQ: 
00000000f8c44063 00000000f8c44067
[Mon Jun 10 14:28:04 2024]  IIR: 0fb0109c    ISR: 0000000006069400  IOR: 
00000000161236a0
[Mon Jun 10 14:28:04 2024]  CPU:        0   CR30: 00000001e70099e0 CR31: 
fffffff0f0e05ee0
[Mon Jun 10 14:28:04 2024]  ORIG_R28: 0000000000000000
[Mon Jun 10 14:28:04 2024]  IAOQ[0]: 00000000f8c44063
[Mon Jun 10 14:28:04 2024]  IAOQ[1]: 00000000f8c44067
[Mon Jun 10 14:28:04 2024]  RP(r2): 00000000f8c46e33

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-06-10 19:52                     ` matoro
@ 2024-06-10 20:17                       ` John David Anglin
  2024-06-26  6:12                         ` matoro
  0 siblings, 1 reply; 18+ messages in thread
From: John David Anglin @ 2024-06-10 20:17 UTC (permalink / raw)
  To: matoro; +Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

Hi Matoro,

On 2024-06-10 3:52 p.m., matoro wrote:
> Unfortunately I had a few of these faults trip today after ~4 days of uptime with corresponding random segfaults.  One of the WARNs was 
> emitted shortly before, though not for the same PID.  Reattempted the build twice and randomly segfaulted all 3 times.  Had to reboot as usual 
> to get it out of the bad state.
Please try v3 patch sent today.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-06-10 20:17                       ` John David Anglin
@ 2024-06-26  6:12                         ` matoro
  2024-06-26 15:44                           ` John David Anglin
  0 siblings, 1 reply; 18+ messages in thread
From: matoro @ 2024-06-26  6:12 UTC (permalink / raw)
  To: John David Anglin
  Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-06-10 16:17, John David Anglin wrote:
> Hi Matoro,
> 
> On 2024-06-10 3:52 p.m., matoro wrote:
>> Unfortunately I had a few of these faults trip today after ~4 days of 
>> uptime with corresponding random segfaults.  One of the WARNs was emitted 
>> shortly before, though not for the same PID.  Reattempted the build twice 
>> and randomly segfaulted all 3 times.  Had to reboot as usual to get it out 
>> of the bad state.
> Please try v3 patch sent today.
> 
> Dave

I think this patch is probably a winner!  I now have 14 days continuous 
uptime where I've done a lot of intense package testing and not a single 
random corruption or crash observed.  I'm switching to vanilla 6.9.6 now that 
it's in tree.  Thanks so much for your great work!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
  2024-06-26  6:12                         ` matoro
@ 2024-06-26 15:44                           ` John David Anglin
  0 siblings, 0 replies; 18+ messages in thread
From: John David Anglin @ 2024-06-26 15:44 UTC (permalink / raw)
  To: matoro; +Cc: Vidra.Jonas, linux-parisc, John David Anglin, Helge Deller

On 2024-06-26 2:12 a.m., matoro wrote:
> On 2024-06-10 16:17, John David Anglin wrote:
>> Hi Matoro,
>>
>> On 2024-06-10 3:52 p.m., matoro wrote:
>>> Unfortunately I had a few of these faults trip today after ~4 days of uptime with corresponding random segfaults.  One of the WARNs was 
>>> emitted shortly before, though not for the same PID.  Reattempted the build twice and randomly segfaulted all 3 times.  Had to reboot as 
>>> usual to get it out of the bad state.
>> Please try v3 patch sent today.
>>
>> Dave
>
> I think this patch is probably a winner!  I now have 14 days continuous uptime where I've done a lot of intense package testing and not a 
> single random corruption or crash observed.  I'm switching to vanilla 6.9.6 now that it's in tree.  Thanks so much for your great work!
The important change in v3 and the version committed was to flush the cache page when a page table entry was cleared.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-06-26 15:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-05 16:58 [PATCH] parisc: Try to fix random segmentation faults in package builds John David Anglin
2024-05-08  8:54 ` Vidra.Jonas
2024-05-08 15:23   ` John David Anglin
2024-05-08 19:18     ` matoro
2024-05-08 20:52       ` John David Anglin
2024-05-08 23:51         ` matoro
2024-05-09  1:21           ` John David Anglin
2024-05-09 17:10         ` John David Anglin
2024-05-29 15:54           ` matoro
2024-05-29 16:33             ` John David Anglin
2024-05-30  5:00               ` matoro
2024-06-04 15:07                 ` matoro
2024-06-04 17:08                   ` John David Anglin
2024-06-10 19:52                     ` matoro
2024-06-10 20:17                       ` John David Anglin
2024-06-26  6:12                         ` matoro
2024-06-26 15:44                           ` John David Anglin
2024-05-12  6:57     ` Vidra.Jonas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox