All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] parisc: Fix TLB related boot crash with PA8000-PA8700 CPUs
@ 2016-12-07 20:52 Helge Deller
  2016-12-07 21:34 ` Aaro Koskinen
  0 siblings, 1 reply; 8+ messages in thread
From: Helge Deller @ 2016-12-07 20:52 UTC (permalink / raw)
  To: linux-parisc, James Bottomley, John David Anglin

Machines with PA8000-PA8700 CPUs crash during startup while we measure
and calculate a good threshold for the TLB flush.

Avoid this crash by simply skipping the test until we figure out what
really triggers the crash.

Cc: <stable@vger.kernel.org> # v3.18+
Signed-off-by: Helge Deller <deller@gmx.de>

diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index c263301..63c10ea 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -393,6 +393,14 @@ void __init parisc_setup_cache_timing(void)
 
 	/* calculate TLB flush threshold */
 
+	/* skip TLB measure on PA8000-PA8700 CPUs */
+	if (boot_cpu_data.cpu_type >= pcxu &&
+	    boot_cpu_data.cpu_type <= pcxw2) {
+		threshold = max(cache_info.it_size, cache_info.dt_size);
+		threshold *= PAGE_SIZE;
+		goto set_tlb_threshold;
+	}
+
 	alltime = mfctl(16);
 	flush_tlb_all();
 	alltime = mfctl(16) - alltime;
@@ -411,6 +419,8 @@ void __init parisc_setup_cache_timing(void)
 		alltime, size, rangetime);
 
 	threshold = PAGE_ALIGN(num_online_cpus() * size * alltime / rangetime);
+
+set_tlb_threshold:
 	if (threshold)
 		parisc_tlb_flush_threshold = threshold;
 	printk(KERN_INFO "TLB flush threshold set to %lu KiB\n",

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] parisc: Fix TLB related boot crash with PA8000-PA8700 CPUs
  2016-12-07 20:52 [PATCH] parisc: Fix TLB related boot crash with PA8000-PA8700 CPUs Helge Deller
@ 2016-12-07 21:34 ` Aaro Koskinen
  2016-12-07 22:17   ` John David Anglin
  2016-12-08 19:58   ` Helge Deller
  0 siblings, 2 replies; 8+ messages in thread
From: Aaro Koskinen @ 2016-12-07 21:34 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc, James Bottomley, John David Anglin

On Wed, Dec 07, 2016 at 09:52:40PM +0100, Helge Deller wrote:
> Machines with PA8000-PA8700 CPUs crash during startup while we measure
> and calculate a good threshold for the TLB flush.

I haven't seen any crashed on HP c3700 (PA8700)...

A.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] parisc: Fix TLB related boot crash with PA8000-PA8700 CPUs
  2016-12-07 21:34 ` Aaro Koskinen
@ 2016-12-07 22:17   ` John David Anglin
  2016-12-08 19:58   ` Helge Deller
  1 sibling, 0 replies; 8+ messages in thread
From: John David Anglin @ 2016-12-07 22:17 UTC (permalink / raw)
  To: Aaro Koskinen, Helge Deller; +Cc: linux-parisc, James Bottomley

On 2016-12-07 4:34 PM, Aaro Koskinen wrote:
> On Wed, Dec 07, 2016 at 09:52:40PM +0100, Helge Deller wrote:
>> Machines with PA8000-PA8700 CPUs crash during startup while we measure
>> and calculate a good threshold for the TLB flush.
> I haven't seen any crashed on HP c3700 (PA8700)...
I believe the issue only affects SMP machines.  My c3750 also boots okay.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] parisc: Fix TLB related boot crash with PA8000-PA8700 CPUs
  2016-12-07 21:34 ` Aaro Koskinen
  2016-12-07 22:17   ` John David Anglin
@ 2016-12-08 19:58   ` Helge Deller
  2016-12-08 20:00     ` [PATCH] parisc: Fix TLB related boot crash on SMP machines Helge Deller
  1 sibling, 1 reply; 8+ messages in thread
From: Helge Deller @ 2016-12-08 19:58 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: linux-parisc, James Bottomley, John David Anglin

Hi Aaro,

On 07.12.2016 22:34, Aaro Koskinen wrote:
> On Wed, Dec 07, 2016 at 09:52:40PM +0100, Helge Deller wrote:
>> Machines with PA8000-PA8700 CPUs crash during startup while we measure
>> and calculate a good threshold for the TLB flush.
> 
> I haven't seen any crashed on HP c3700 (PA8700)...

Thanks for the feedback!
I did some additional analysis today, and it happens only on SMP machines.
I could reproduce the crash on a A500-44 and a J5000, both are 2-way boxes.

Helge

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] parisc: Fix TLB related boot crash on SMP machines
  2016-12-08 19:58   ` Helge Deller
@ 2016-12-08 20:00     ` Helge Deller
  2016-12-08 20:49       ` John David Anglin
  0 siblings, 1 reply; 8+ messages in thread
From: Helge Deller @ 2016-12-08 20:00 UTC (permalink / raw)
  To: linux-parisc, James Bottomley, John David Anglin; +Cc: Aaro Koskinen

At bootup we run measurements to calculate the best threshold for when we
should be using full TLB flushes instead of just flushing a specific amount of
TLB entries.  This performance test is run over the kernel text segment.

But running this TLB performance test on the kernel text segment turned out to
crash some SMP machines when the kernel text pages were mapped as huge pages.

To avoid those crashes this patch simply skips this test on some SMP machines
and calculates an optimal threshold based on the maximum number of available
TLB entries and number of online CPUs.

On a technical side, this seems to happen:
The TLB measurement code uses flush_tlb_kernel_range() to flush specific TLB
entries with a page size of 4k (pdtlb 0(sr1,addr)). On UP systems this purge
instruction seems to work without problems even if the pages were mapped as
huge pages.  But on SMP systems the TLB purge instruction is broadcasted to
other CPUs. Those CPUs then crash the machine because the page size is not as
expected.  C8000 machines with PA8800/PA8900 CPUs were not affected by this
problem, because the required cache coherency prohibits to use huge pages at
all.  Sadly I didn't found any documentation about this behaviour, so this
finding is purely based on testing with phyiscal SMP machines (A500-44 and
J5000, both were 2-way boxes).

Cc: <stable@vger.kernel.org> # v3.18+
Signed-off-by: Helge Deller <deller@gmx.de>

diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index c263301..977f0a4f 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -393,6 +393,15 @@ void __init parisc_setup_cache_timing(void)
 
 	/* calculate TLB flush threshold */
 
+	/* On SMP machines, skip the TLB measure of kernel text which
+	 * has been mapped as huge pages. */
+	if (num_online_cpus() > 1 && !parisc_requires_coherency()) {
+		threshold = max(cache_info.it_size, cache_info.dt_size);
+		threshold *= PAGE_SIZE;
+		threshold /= num_online_cpus();
+		goto set_tlb_threshold;
+	}
+
 	alltime = mfctl(16);
 	flush_tlb_all();
 	alltime = mfctl(16) - alltime;
@@ -411,6 +420,8 @@ void __init parisc_setup_cache_timing(void)
 		alltime, size, rangetime);
 
 	threshold = PAGE_ALIGN(num_online_cpus() * size * alltime / rangetime);
+
+set_tlb_threshold:
 	if (threshold)
 		parisc_tlb_flush_threshold = threshold;
 	printk(KERN_INFO "TLB flush threshold set to %lu KiB\n",

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] parisc: Fix TLB related boot crash on SMP machines
  2016-12-08 20:00     ` [PATCH] parisc: Fix TLB related boot crash on SMP machines Helge Deller
@ 2016-12-08 20:49       ` John David Anglin
  2016-12-08 21:15         ` Helge Deller
  0 siblings, 1 reply; 8+ messages in thread
From: John David Anglin @ 2016-12-08 20:49 UTC (permalink / raw)
  To: Helge Deller, linux-parisc, James Bottomley; +Cc: Aaro Koskinen

On 2016-12-08 3:00 PM, Helge Deller wrote:
> On a technical side, this seems to happen:
> The TLB measurement code uses flush_tlb_kernel_range() to flush specific TLB
> entries with a page size of 4k (pdtlb 0(sr1,addr)). On UP systems this purge
> instruction seems to work without problems even if the pages were mapped as
> huge pages.  But on SMP systems the TLB purge instruction is broadcasted to
> other CPUs. Those CPUs then crash the machine because the page size is not as
> expected.  C8000 machines with PA8800/PA8900 CPUs were not affected by this
> problem, because the required cache coherency prohibits to use huge pages at
> all.  Sadly I didn't found any documentation about this behaviour, so this
> finding is purely based on testing with phyiscal SMP machines (A500-44 and
> J5000, both were 2-way boxes).
I doubt the problem is the 4k iteration using pdtlb 0(sr1,addr).  I 
think the issue is the huge
page size for the kernel.  Each pdtlb instruction knocks out the same 
tlb entry including the
entry used for tlb interruptions.  This likely leads to stack overflow.  
In any case, it probably
doesn't provide accurate timing because each pdtlb knocks out the entry 
for the interruption
handler on systems with combined tlb.

Dave
-- 

John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] parisc: Fix TLB related boot crash on SMP machines
  2016-12-08 20:49       ` John David Anglin
@ 2016-12-08 21:15         ` Helge Deller
  2016-12-08 21:21           ` John David Anglin
  0 siblings, 1 reply; 8+ messages in thread
From: Helge Deller @ 2016-12-08 21:15 UTC (permalink / raw)
  To: John David Anglin, linux-parisc, James Bottomley; +Cc: Aaro Koskinen

On 08.12.2016 21:49, John David Anglin wrote:
> On 2016-12-08 3:00 PM, Helge Deller wrote:
>> On a technical side, this seems to happen:
>> The TLB measurement code uses flush_tlb_kernel_range() to flush specific TLB
>> entries with a page size of 4k (pdtlb 0(sr1,addr)). On UP systems this purge
>> instruction seems to work without problems even if the pages were mapped as
>> huge pages.  But on SMP systems the TLB purge instruction is broadcasted to
>> other CPUs. Those CPUs then crash the machine because the page size is not as
>> expected.  C8000 machines with PA8800/PA8900 CPUs were not affected by this
>> problem, because the required cache coherency prohibits to use huge pages at
>> all.  Sadly I didn't found any documentation about this behaviour, so this
>> finding is purely based on testing with phyiscal SMP machines (A500-44 and
>> J5000, both were 2-way boxes).

> I doubt the problem is the 4k iteration using pdtlb 0(sr1,addr). I
> think the issue is the huge page size for the kernel. Each pdtlb
> instruction knocks out the same tlb entry including the entry used
> for tlb interruptions. This likely leads to stack overflow. 

Yes, likely.

> In any
> case, it probably doesn't provide accurate timing because each pdtlb
> knocks out the entry for the interruption handler on systems with
> combined tlb.

True.

So, how to continue?
I see two options:
a) skip the TLB measuring code as my patch does.
b) kmalloc() another region and do measurement there.

I'd like to submit some fix-patch for 4.9, else the machines won't boot 4.9.
That's why I'd prefer option a).
Opinions?

Helge

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] parisc: Fix TLB related boot crash on SMP machines
  2016-12-08 21:15         ` Helge Deller
@ 2016-12-08 21:21           ` John David Anglin
  0 siblings, 0 replies; 8+ messages in thread
From: John David Anglin @ 2016-12-08 21:21 UTC (permalink / raw)
  To: Helge Deller, linux-parisc, James Bottomley; +Cc: Aaro Koskinen

On 2016-12-08 4:15 PM, Helge Deller wrote:
> On 08.12.2016 21:49, John David Anglin wrote:
>> On 2016-12-08 3:00 PM, Helge Deller wrote:
>>> On a technical side, this seems to happen:
>>> The TLB measurement code uses flush_tlb_kernel_range() to flush specific TLB
>>> entries with a page size of 4k (pdtlb 0(sr1,addr)). On UP systems this purge
>>> instruction seems to work without problems even if the pages were mapped as
>>> huge pages.  But on SMP systems the TLB purge instruction is broadcasted to
>>> other CPUs. Those CPUs then crash the machine because the page size is not as
>>> expected.  C8000 machines with PA8800/PA8900 CPUs were not affected by this
>>> problem, because the required cache coherency prohibits to use huge pages at
>>> all.  Sadly I didn't found any documentation about this behaviour, so this
>>> finding is purely based on testing with phyiscal SMP machines (A500-44 and
>>> J5000, both were 2-way boxes).
>> I doubt the problem is the 4k iteration using pdtlb 0(sr1,addr). I
>> think the issue is the huge page size for the kernel. Each pdtlb
>> instruction knocks out the same tlb entry including the entry used
>> for tlb interruptions. This likely leads to stack overflow.
> Yes, likely.
>
>> In any
>> case, it probably doesn't provide accurate timing because each pdtlb
>> knocks out the entry for the interruption handler on systems with
>> combined tlb.
> True.
>
> So, how to continue?
> I see two options:
> a) skip the TLB measuring code as my patch does.
> b) kmalloc() another region and do measurement there.
>
> I'd like to submit some fix-patch for 4.9, else the machines won't boot 4.9.
> That's why I'd prefer option a).
> Opinions?
Go with option a) for 4.9

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-12-08 21:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-07 20:52 [PATCH] parisc: Fix TLB related boot crash with PA8000-PA8700 CPUs Helge Deller
2016-12-07 21:34 ` Aaro Koskinen
2016-12-07 22:17   ` John David Anglin
2016-12-08 19:58   ` Helge Deller
2016-12-08 20:00     ` [PATCH] parisc: Fix TLB related boot crash on SMP machines Helge Deller
2016-12-08 20:49       ` John David Anglin
2016-12-08 21:15         ` Helge Deller
2016-12-08 21:21           ` John David Anglin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.