* [PATCH 0/4] TLB and mm-related optimisations
@ 2013-03-25 18:19 Will Deacon
2013-03-25 18:19 ` [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops Will Deacon
` (3 more replies)
0 siblings, 4 replies; 26+ messages in thread
From: Will Deacon @ 2013-03-25 18:19 UTC (permalink / raw)
To: linux-arm-kernel
Hello again,
This is another slightly random set of patches which I cooked up when
trying to optimise various bits and pieces in arch/arm/mm after writing
my changes to the ASID allocator.
The first two patches modify the local_* variants of our TLB and BP
invalidation functions to use the non-shareable versions, avoiding the
need to broadcast them in hardware. This allows us to update the ASID
allocator to use only local operations (which it needs anyway for
11MPCore).
The third patch removes some redundant cache-flushing when writing page
table entries for ARMv7 SMP targets (which require table walking at L1)
and the final patch allows us to use non-exclusive instructions for
atomic64 read/set (like we do for the 32-bit variants) when we have LPAE,
since this requires atomicity of aligned double-word accesses in order to
manipulate ptes atomically.
All feedback welcome,
Will
Will Deacon (4):
ARM: tlb: don't perform inner-shareable invalidation for local TLB
ops
ARM: tlb: don't perform inner-shareable invalidation for local BP ops
ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP
instead
ARM: atomics: don't use exclusives for atomic64 read/set with LPAE
arch/arm/include/asm/atomic.h | 24 ++++++++++++
arch/arm/include/asm/tlbflush.h | 83 +++++++++++++++++++++++++++++++++++++----
arch/arm/kernel/smp_tlb.c | 10 ++---
arch/arm/mm/context.c | 5 +--
arch/arm/mm/proc-v6.S | 2 -
arch/arm/mm/proc-v7-2level.S | 3 +-
arch/arm/mm/proc-v7-3level.S | 3 +-
arch/arm/mm/proc-v7.S | 4 +-
8 files changed, 112 insertions(+), 22 deletions(-)
--
1.8.0
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops 2013-03-25 18:19 [PATCH 0/4] TLB and mm-related optimisations Will Deacon @ 2013-03-25 18:19 ` Will Deacon 2013-03-27 10:34 ` Catalin Marinas 2013-03-25 18:19 ` [PATCH 2/4] ARM: tlb: don't perform inner-shareable invalidation for local BP ops Will Deacon ` (2 subsequent siblings) 3 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-03-25 18:19 UTC (permalink / raw) To: linux-arm-kernel Inner-shareable TLB invalidation is typically more expensive than local (non-shareable) invalidation, so performing the broadcasting for local_flush_tlb_* operations is a waste of cycles and needlessly clobbers entries in the TLBs of other CPUs. This patch introduces __flush_tlb_* versions for many of the TLB invalidation functions, which only respect inner-shareable variants of the invalidation instructions. This allows us to modify the v7 SMP TLB flags to include *both* inner-shareable and non-shareable operations and then check the relevant flags depending on whether the operation is local or not. This gains us around 0.5% in hackbench scores for a dual-core A15, but I would expect this to improve as more cores (and clusters) are added to the equation. Reported-by: Albin Tonnerre <Albin.Tonnerre@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> --- arch/arm/include/asm/tlbflush.h | 67 ++++++++++++++++++++++++++++++++++++++--- arch/arm/kernel/smp_tlb.c | 8 ++--- arch/arm/mm/context.c | 5 +-- 3 files changed, 68 insertions(+), 12 deletions(-) diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h index 4db8c88..ae9f34f 100644 --- a/arch/arm/include/asm/tlbflush.h +++ b/arch/arm/include/asm/tlbflush.h @@ -170,6 +170,8 @@ #endif #define v7wbi_tlb_flags_smp (TLB_WB | TLB_DCLEAN | TLB_BARRIER | \ + TLB_V6_U_FULL | TLB_V6_U_PAGE | \ + TLB_V6_U_ASID | \ TLB_V7_UIS_FULL | TLB_V7_UIS_PAGE | \ TLB_V7_UIS_ASID | TLB_V7_UIS_BP) #define v7wbi_tlb_flags_up (TLB_WB | TLB_DCLEAN | TLB_BARRIER | \ @@ -334,6 +336,21 @@ static inline void local_flush_tlb_all(void) tlb_op(TLB_V4_U_FULL | TLB_V6_U_FULL, "c8, c7, 0", zero); tlb_op(TLB_V4_D_FULL | TLB_V6_D_FULL, "c8, c6, 0", zero); tlb_op(TLB_V4_I_FULL | TLB_V6_I_FULL, "c8, c5, 0", zero); + + if (tlb_flag(TLB_BARRIER)) { + dsb(); + isb(); + } +} + +static inline void __flush_tlb_all(void) +{ + const int zero = 0; + const unsigned int __tlb_flag = __cpu_tlb_flags; + + if (tlb_flag(TLB_WB)) + dsb(); + tlb_op(TLB_V7_UIS_FULL, "c8, c3, 0", zero); if (tlb_flag(TLB_BARRIER)) { @@ -352,22 +369,33 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm) dsb(); if (possible_tlb_flags & (TLB_V3_FULL|TLB_V4_U_FULL|TLB_V4_D_FULL|TLB_V4_I_FULL)) { - if (cpumask_test_cpu(get_cpu(), mm_cpumask(mm))) { + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) { tlb_op(TLB_V3_FULL, "c6, c0, 0", zero); tlb_op(TLB_V4_U_FULL, "c8, c7, 0", zero); tlb_op(TLB_V4_D_FULL, "c8, c6, 0", zero); tlb_op(TLB_V4_I_FULL, "c8, c5, 0", zero); } - put_cpu(); } tlb_op(TLB_V6_U_ASID, "c8, c7, 2", asid); tlb_op(TLB_V6_D_ASID, "c8, c6, 2", asid); tlb_op(TLB_V6_I_ASID, "c8, c5, 2", asid); + + if (tlb_flag(TLB_BARRIER)) + dsb(); +} + +static inline void __flush_tlb_mm(struct mm_struct *mm) +{ + const unsigned int __tlb_flag = __cpu_tlb_flags; + + if (tlb_flag(TLB_WB)) + dsb(); + #ifdef CONFIG_ARM_ERRATA_720789 - tlb_op(TLB_V7_UIS_ASID, "c8, c3, 0", zero); + tlb_op(TLB_V7_UIS_ASID, "c8, c3, 0", 0); #else - tlb_op(TLB_V7_UIS_ASID, "c8, c3, 2", asid); + tlb_op(TLB_V7_UIS_ASID, "c8, c3, 2", ASID(mm)); #endif if (tlb_flag(TLB_BARRIER)) @@ -398,6 +426,21 @@ local_flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", uaddr); tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", uaddr); tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", uaddr); + + if (tlb_flag(TLB_BARRIER)) + dsb(); +} + +static inline void +__flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) +{ + const unsigned int __tlb_flag = __cpu_tlb_flags; + + uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm); + + if (tlb_flag(TLB_WB)) + dsb(); + #ifdef CONFIG_ARM_ERRATA_720789 tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 3", uaddr & PAGE_MASK); #else @@ -428,6 +471,22 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr) tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", kaddr); tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", kaddr); tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", kaddr); + + if (tlb_flag(TLB_BARRIER)) { + dsb(); + isb(); + } +} + +static inline void __flush_tlb_kernel_page(unsigned long kaddr) +{ + const unsigned int __tlb_flag = __cpu_tlb_flags; + + kaddr &= PAGE_MASK; + + if (tlb_flag(TLB_WB)) + dsb(); + tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 1", kaddr); if (tlb_flag(TLB_BARRIER)) { diff --git a/arch/arm/kernel/smp_tlb.c b/arch/arm/kernel/smp_tlb.c index bd03005..6ae08b1 100644 --- a/arch/arm/kernel/smp_tlb.c +++ b/arch/arm/kernel/smp_tlb.c @@ -74,7 +74,7 @@ void flush_tlb_all(void) if (tlb_ops_need_broadcast()) on_each_cpu(ipi_flush_tlb_all, NULL, 1); else - local_flush_tlb_all(); + __flush_tlb_all(); } void flush_tlb_mm(struct mm_struct *mm) @@ -82,7 +82,7 @@ void flush_tlb_mm(struct mm_struct *mm) if (tlb_ops_need_broadcast()) on_each_cpu_mask(mm_cpumask(mm), ipi_flush_tlb_mm, mm, 1); else - local_flush_tlb_mm(mm); + __flush_tlb_mm(mm); } void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) @@ -94,7 +94,7 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) on_each_cpu_mask(mm_cpumask(vma->vm_mm), ipi_flush_tlb_page, &ta, 1); } else - local_flush_tlb_page(vma, uaddr); + __flush_tlb_page(vma, uaddr); } void flush_tlb_kernel_page(unsigned long kaddr) @@ -104,7 +104,7 @@ void flush_tlb_kernel_page(unsigned long kaddr) ta.ta_start = kaddr; on_each_cpu(ipi_flush_tlb_kernel_page, &ta, 1); } else - local_flush_tlb_kernel_page(kaddr); + __flush_tlb_kernel_page(kaddr); } void flush_tlb_range(struct vm_area_struct *vma, diff --git a/arch/arm/mm/context.c b/arch/arm/mm/context.c index a5a4b2bc..550acf8 100644 --- a/arch/arm/mm/context.c +++ b/arch/arm/mm/context.c @@ -134,10 +134,7 @@ static void flush_context(unsigned int cpu) } /* Queue a TLB invalidate and flush the I-cache if necessary. */ - if (!tlb_ops_need_broadcast()) - cpumask_set_cpu(cpu, &tlb_flush_pending); - else - cpumask_setall(&tlb_flush_pending); + cpumask_setall(&tlb_flush_pending); if (icache_is_vivt_asid_tagged()) __flush_icache_all(); -- 1.8.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops 2013-03-25 18:19 ` [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops Will Deacon @ 2013-03-27 10:34 ` Catalin Marinas 2013-03-27 12:07 ` Will Deacon 0 siblings, 1 reply; 26+ messages in thread From: Catalin Marinas @ 2013-03-27 10:34 UTC (permalink / raw) To: linux-arm-kernel On Mon, Mar 25, 2013 at 06:19:38PM +0000, Will Deacon wrote: > @@ -352,22 +369,33 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm) > dsb(); > > if (possible_tlb_flags & (TLB_V3_FULL|TLB_V4_U_FULL|TLB_V4_D_FULL|TLB_V4_I_FULL)) { > - if (cpumask_test_cpu(get_cpu(), mm_cpumask(mm))) { > + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) { > tlb_op(TLB_V3_FULL, "c6, c0, 0", zero); > tlb_op(TLB_V4_U_FULL, "c8, c7, 0", zero); > tlb_op(TLB_V4_D_FULL, "c8, c6, 0", zero); > tlb_op(TLB_V4_I_FULL, "c8, c5, 0", zero); > } > - put_cpu(); Why is this change needed? You only flush the local TLB if the mm never wasn't active on this processor? > @@ -398,6 +426,21 @@ local_flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) > tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", uaddr); > tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", uaddr); > tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", uaddr); > + > + if (tlb_flag(TLB_BARRIER)) > + dsb(); > +} > + > +static inline void > +__flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) > +{ > + const unsigned int __tlb_flag = __cpu_tlb_flags; > + > + uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm); > + > + if (tlb_flag(TLB_WB)) > + dsb(); > + I guess here we could just have a single *_tlb_page() variant. I couldn't find any place where we call the local_flush_tlb_page() explicitly, I guess we don't really need local semantics. On ARMv6 SMP, they are local anyway. If we have a single *_tlb_page() function, you would need to drop the TLB_V6_*_PAGE from the v8 possible TLB ops. > #ifdef CONFIG_ARM_ERRATA_720789 > tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 3", uaddr & PAGE_MASK); > #else > @@ -428,6 +471,22 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr) > tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", kaddr); > tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", kaddr); > tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", kaddr); > + > + if (tlb_flag(TLB_BARRIER)) { > + dsb(); > + isb(); > + } > +} I have some worries with this function. It is used by set_top_pte() and it really doesn't look like it has local-only semantics. For example, you use it do flush the I-cache aliases and this must target all the CPUs because of speculative prefetches, which means that set_top_pte() must set the new alias on all the CPUs. Highmem mappings need to be revisited as well. > --- a/arch/arm/mm/context.c > +++ b/arch/arm/mm/context.c > @@ -134,10 +134,7 @@ static void flush_context(unsigned int cpu) > } > > /* Queue a TLB invalidate and flush the I-cache if necessary. */ > - if (!tlb_ops_need_broadcast()) > - cpumask_set_cpu(cpu, &tlb_flush_pending); > - else > - cpumask_setall(&tlb_flush_pending); > + cpumask_setall(&tlb_flush_pending); That's a good change ;) -- Catalin ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops 2013-03-27 10:34 ` Catalin Marinas @ 2013-03-27 12:07 ` Will Deacon 2013-03-27 12:30 ` Catalin Marinas 0 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-03-27 12:07 UTC (permalink / raw) To: linux-arm-kernel Hi Catalin, Cheers for looking at this. On Wed, Mar 27, 2013 at 10:34:30AM +0000, Catalin Marinas wrote: > On Mon, Mar 25, 2013 at 06:19:38PM +0000, Will Deacon wrote: > > @@ -352,22 +369,33 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm) > > dsb(); > > > > if (possible_tlb_flags & (TLB_V3_FULL|TLB_V4_U_FULL|TLB_V4_D_FULL|TLB_V4_I_FULL)) { > > - if (cpumask_test_cpu(get_cpu(), mm_cpumask(mm))) { > > + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) { > > tlb_op(TLB_V3_FULL, "c6, c0, 0", zero); > > tlb_op(TLB_V4_U_FULL, "c8, c7, 0", zero); > > tlb_op(TLB_V4_D_FULL, "c8, c6, 0", zero); > > tlb_op(TLB_V4_I_FULL, "c8, c5, 0", zero); > > } > > - put_cpu(); > > Why is this change needed? You only flush the local TLB if the mm never > wasn't active on this processor? Ouch, that's a cock-up, sorry. I'll remove the '!'. > > @@ -398,6 +426,21 @@ local_flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) > > tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", uaddr); > > tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", uaddr); > > tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", uaddr); > > + > > + if (tlb_flag(TLB_BARRIER)) > > + dsb(); > > +} > > + > > +static inline void > > +__flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) > > +{ > > + const unsigned int __tlb_flag = __cpu_tlb_flags; > > + > > + uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm); > > + > > + if (tlb_flag(TLB_WB)) > > + dsb(); > > + > > I guess here we could just have a single *_tlb_page() variant. I > couldn't find any place where we call the local_flush_tlb_page() > explicitly, I guess we don't really need local semantics. On ARMv6 SMP, > they are local anyway. > > If we have a single *_tlb_page() function, you would need to drop the > TLB_V6_*_PAGE from the v8 possible TLB ops. Having the local variant doesn't hurt though, and provides the same symmetry as other architectures (powerpc, sh, tile, mips, ...). > > #ifdef CONFIG_ARM_ERRATA_720789 > > tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 3", uaddr & PAGE_MASK); > > #else > > @@ -428,6 +471,22 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr) > > tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", kaddr); > > tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", kaddr); > > tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", kaddr); > > + > > + if (tlb_flag(TLB_BARRIER)) { > > + dsb(); > > + isb(); > > + } > > +} > > I have some worries with this function. It is used by set_top_pte() and > it really doesn't look like it has local-only semantics. For example, > you use it do flush the I-cache aliases and this must target all the > CPUs because of speculative prefetches, which means that set_top_pte() > must set the new alias on all the CPUs. This looks like a bug in set_top_pte when it's called for cache-flushing. However, the only core this would affect is 11MPCore, which uses the ipi-based flushing anyway, so I think we're ok. > Highmem mappings need to be revisited as well. I think they're ok. Everything is either done in atomic context or under a raw spinlock, so the mappings aren't expected to be used by other CPUs. Cheers, Will ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops 2013-03-27 12:07 ` Will Deacon @ 2013-03-27 12:30 ` Catalin Marinas 2013-03-27 12:56 ` Will Deacon 0 siblings, 1 reply; 26+ messages in thread From: Catalin Marinas @ 2013-03-27 12:30 UTC (permalink / raw) To: linux-arm-kernel On Wed, Mar 27, 2013 at 12:07:37PM +0000, Will Deacon wrote: > Cheers for looking at this. Just warming up for Marc's KVM patches ;) > On Wed, Mar 27, 2013 at 10:34:30AM +0000, Catalin Marinas wrote: > > On Mon, Mar 25, 2013 at 06:19:38PM +0000, Will Deacon wrote: > > > @@ -352,22 +369,33 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm) > > > dsb(); > > > > > > if (possible_tlb_flags & (TLB_V3_FULL|TLB_V4_U_FULL|TLB_V4_D_FULL|TLB_V4_I_FULL)) { > > > - if (cpumask_test_cpu(get_cpu(), mm_cpumask(mm))) { > > > + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) { > > > tlb_op(TLB_V3_FULL, "c6, c0, 0", zero); > > > tlb_op(TLB_V4_U_FULL, "c8, c7, 0", zero); > > > tlb_op(TLB_V4_D_FULL, "c8, c6, 0", zero); > > > tlb_op(TLB_V4_I_FULL, "c8, c5, 0", zero); > > > } > > > - put_cpu(); > > > > Why is this change needed? You only flush the local TLB if the mm never > > wasn't active on this processor? > > Ouch, that's a cock-up, sorry. I'll remove the '!'. Do we also need to disable preemtion? > > > @@ -398,6 +426,21 @@ local_flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) > > > tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", uaddr); > > > tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", uaddr); > > > tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", uaddr); > > > + > > > + if (tlb_flag(TLB_BARRIER)) > > > + dsb(); > > > +} > > > + > > > +static inline void > > > +__flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) > > > +{ > > > + const unsigned int __tlb_flag = __cpu_tlb_flags; > > > + > > > + uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm); > > > + > > > + if (tlb_flag(TLB_WB)) > > > + dsb(); > > > + > > > > I guess here we could just have a single *_tlb_page() variant. I > > couldn't find any place where we call the local_flush_tlb_page() > > explicitly, I guess we don't really need local semantics. On ARMv6 SMP, > > they are local anyway. > > > > If we have a single *_tlb_page() function, you would need to drop the > > TLB_V6_*_PAGE from the v8 possible TLB ops. > > Having the local variant doesn't hurt though, and provides the same symmetry > as other architectures (powerpc, sh, tile, mips, ...). It's probably harmless to have them, though they may not get used. > > > #ifdef CONFIG_ARM_ERRATA_720789 > > > tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 3", uaddr & PAGE_MASK); > > > #else > > > @@ -428,6 +471,22 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr) > > > tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", kaddr); > > > tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", kaddr); > > > tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", kaddr); > > > + > > > + if (tlb_flag(TLB_BARRIER)) { > > > + dsb(); > > > + isb(); > > > + } > > > +} > > > > I have some worries with this function. It is used by set_top_pte() and > > it really doesn't look like it has local-only semantics. For example, > > you use it do flush the I-cache aliases and this must target all the > > CPUs because of speculative prefetches, which means that set_top_pte() > > must set the new alias on all the CPUs. > > This looks like a bug in set_top_pte when it's called for cache-flushing. > However, the only core this would affect is 11MPCore, which uses the > ipi-based flushing anyway, so I think we're ok. I don't think its 11MPCore only, set_top_pte() is called by flush_icache_alias() from flush_ptrace_access() even on ARMv7. > > Highmem mappings need to be revisited as well. > > I think they're ok. Everything is either done in atomic context or under a > raw spinlock, so the mappings aren't expected to be used by other CPUs. It's not whether they are used explicitly but whether a speculative TLB load can bring them in on a different CPU. I don't immediately see a problem with non-aliasing caches but needs some more thinking. -- Catalin ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops 2013-03-27 12:30 ` Catalin Marinas @ 2013-03-27 12:56 ` Will Deacon 2013-03-27 13:40 ` Catalin Marinas 0 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-03-27 12:56 UTC (permalink / raw) To: linux-arm-kernel On Wed, Mar 27, 2013 at 12:30:55PM +0000, Catalin Marinas wrote: > On Wed, Mar 27, 2013 at 12:07:37PM +0000, Will Deacon wrote: > > On Wed, Mar 27, 2013 at 10:34:30AM +0000, Catalin Marinas wrote: > > > On Mon, Mar 25, 2013 at 06:19:38PM +0000, Will Deacon wrote: > > > > @@ -352,22 +369,33 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm) > > > > dsb(); > > > > > > > > if (possible_tlb_flags & (TLB_V3_FULL|TLB_V4_U_FULL|TLB_V4_D_FULL|TLB_V4_I_FULL)) { > > > > - if (cpumask_test_cpu(get_cpu(), mm_cpumask(mm))) { > > > > + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) { > > > > tlb_op(TLB_V3_FULL, "c6, c0, 0", zero); > > > > tlb_op(TLB_V4_U_FULL, "c8, c7, 0", zero); > > > > tlb_op(TLB_V4_D_FULL, "c8, c6, 0", zero); > > > > tlb_op(TLB_V4_I_FULL, "c8, c5, 0", zero); > > > > } > > > > - put_cpu(); > > > > > > Why is this change needed? You only flush the local TLB if the mm never > > > wasn't active on this processor? > > > > Ouch, that's a cock-up, sorry. I'll remove the '!'. > > Do we also need to disable preemtion? I don't think so, that should be taken care of by the caller if they are issuing the local_ operation (otherwise it's racy anyway). > > > > #ifdef CONFIG_ARM_ERRATA_720789 > > > > tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 3", uaddr & PAGE_MASK); > > > > #else > > > > @@ -428,6 +471,22 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr) > > > > tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", kaddr); > > > > tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", kaddr); > > > > tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", kaddr); > > > > + > > > > + if (tlb_flag(TLB_BARRIER)) { > > > > + dsb(); > > > > + isb(); > > > > + } > > > > +} > > > > > > I have some worries with this function. It is used by set_top_pte() and > > > it really doesn't look like it has local-only semantics. For example, > > > you use it do flush the I-cache aliases and this must target all the > > > CPUs because of speculative prefetches, which means that set_top_pte() > > > must set the new alias on all the CPUs. > > > > This looks like a bug in set_top_pte when it's called for cache-flushing. > > However, the only core this would affect is 11MPCore, which uses the > > ipi-based flushing anyway, so I think we're ok. > > I don't think its 11MPCore only, set_top_pte() is called by > flush_icache_alias() from flush_ptrace_access() even on ARMv7. Damn, yes, I missed those. Perhaps we should add set_top_pte_atomic, which just does the local flush, and then promote the current flush to be IS? > > > Highmem mappings need to be revisited as well. > > > > I think they're ok. Everything is either done in atomic context or under a > > raw spinlock, so the mappings aren't expected to be used by other CPUs. > > It's not whether they are used explicitly but whether a speculative TLB > load can bring them in on a different CPU. I don't immediately see a > problem with non-aliasing caches but needs some more thinking. But why do we care about the speculation? If the core doing the speculating is always going to write a new pte before dereferencing anything mapped there, then it will invalidate its own TLB then. Will ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops 2013-03-27 12:56 ` Will Deacon @ 2013-03-27 13:40 ` Catalin Marinas 2013-03-27 13:54 ` Will Deacon 0 siblings, 1 reply; 26+ messages in thread From: Catalin Marinas @ 2013-03-27 13:40 UTC (permalink / raw) To: linux-arm-kernel On Wed, Mar 27, 2013 at 12:56:39PM +0000, Will Deacon wrote: > On Wed, Mar 27, 2013 at 12:30:55PM +0000, Catalin Marinas wrote: > > On Wed, Mar 27, 2013 at 12:07:37PM +0000, Will Deacon wrote: > > > On Wed, Mar 27, 2013 at 10:34:30AM +0000, Catalin Marinas wrote: > > > > On Mon, Mar 25, 2013 at 06:19:38PM +0000, Will Deacon wrote: > > > > > @@ -352,22 +369,33 @@ static inline void local_flush_tlb_mm(struct mm_struct *mm) > > > > > dsb(); > > > > > > > > > > if (possible_tlb_flags & (TLB_V3_FULL|TLB_V4_U_FULL|TLB_V4_D_FULL|TLB_V4_I_FULL)) { > > > > > - if (cpumask_test_cpu(get_cpu(), mm_cpumask(mm))) { > > > > > + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) { > > > > > tlb_op(TLB_V3_FULL, "c6, c0, 0", zero); > > > > > tlb_op(TLB_V4_U_FULL, "c8, c7, 0", zero); > > > > > tlb_op(TLB_V4_D_FULL, "c8, c6, 0", zero); > > > > > tlb_op(TLB_V4_I_FULL, "c8, c5, 0", zero); > > > > > } > > > > > - put_cpu(); > > > > > > > > Why is this change needed? You only flush the local TLB if the mm never > > > > wasn't active on this processor? > > > > > > Ouch, that's a cock-up, sorry. I'll remove the '!'. > > > > Do we also need to disable preemtion? > > I don't think so, that should be taken care of by the caller if they are > issuing the local_ operation (otherwise it's racy anyway). OK. > > > > > #ifdef CONFIG_ARM_ERRATA_720789 > > > > > tlb_op(TLB_V7_UIS_PAGE, "c8, c3, 3", uaddr & PAGE_MASK); > > > > > #else > > > > > @@ -428,6 +471,22 @@ static inline void local_flush_tlb_kernel_page(unsigned long kaddr) > > > > > tlb_op(TLB_V6_U_PAGE, "c8, c7, 1", kaddr); > > > > > tlb_op(TLB_V6_D_PAGE, "c8, c6, 1", kaddr); > > > > > tlb_op(TLB_V6_I_PAGE, "c8, c5, 1", kaddr); > > > > > + > > > > > + if (tlb_flag(TLB_BARRIER)) { > > > > > + dsb(); > > > > > + isb(); > > > > > + } > > > > > +} > > > > > > > > I have some worries with this function. It is used by set_top_pte() and > > > > it really doesn't look like it has local-only semantics. For example, > > > > you use it do flush the I-cache aliases and this must target all the > > > > CPUs because of speculative prefetches, which means that set_top_pte() > > > > must set the new alias on all the CPUs. > > > > > > This looks like a bug in set_top_pte when it's called for cache-flushing. > > > However, the only core this would affect is 11MPCore, which uses the > > > ipi-based flushing anyway, so I think we're ok. > > > > I don't think its 11MPCore only, set_top_pte() is called by > > flush_icache_alias() from flush_ptrace_access() even on ARMv7. > > Damn, yes, I missed those. Perhaps we should add set_top_pte_atomic, which > just does the local flush, and then promote the current flush to be IS? Where would we use the set_top_pte_atomic() on ARMv7? > > > > Highmem mappings need to be revisited as well. > > > > > > I think they're ok. Everything is either done in atomic context or under a > > > raw spinlock, so the mappings aren't expected to be used by other CPUs. > > > > It's not whether they are used explicitly but whether a speculative TLB > > load can bring them in on a different CPU. I don't immediately see a > > problem with non-aliasing caches but needs some more thinking. > > But why do we care about the speculation? If the core doing the speculating > is always going to write a new pte before dereferencing anything mapped > there, then it will invalidate its own TLB then. It's about speculation on another CPU. Let's say CPU0 does several kmap_atomic() calls which in turn call set_top_pte(). The same page tables are visible to CPU1 which speculatively loads some top pte (not the latest). At this point we have a VA pointing to different PAs on CPU0 and CPU1. CPU1 would not access this VA, so not a problem here, but whether this matters for inner-shareable cache maintenance (dma_cache_maint_page), I can't tell yet (internal thread with the architecture guys). -- Catalin ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops 2013-03-27 13:40 ` Catalin Marinas @ 2013-03-27 13:54 ` Will Deacon 0 siblings, 0 replies; 26+ messages in thread From: Will Deacon @ 2013-03-27 13:54 UTC (permalink / raw) To: linux-arm-kernel On Wed, Mar 27, 2013 at 01:40:29PM +0000, Catalin Marinas wrote: > On Wed, Mar 27, 2013 at 12:56:39PM +0000, Will Deacon wrote: > > Damn, yes, I missed those. Perhaps we should add set_top_pte_atomic, which > > just does the local flush, and then promote the current flush to be IS? > > Where would we use the set_top_pte_atomic() on ARMv7? I was thinking of kmap_atomic{_pfn}, to avoid adding further overhead to highmem. > > > It's not whether they are used explicitly but whether a speculative TLB > > > load can bring them in on a different CPU. I don't immediately see a > > > problem with non-aliasing caches but needs some more thinking. > > > > But why do we care about the speculation? If the core doing the speculating > > is always going to write a new pte before dereferencing anything mapped > > there, then it will invalidate its own TLB then. > > It's about speculation on another CPU. > > Let's say CPU0 does several kmap_atomic() calls which in turn call > set_top_pte(). The same page tables are visible to CPU1 which > speculatively loads some top pte (not the latest). At this point we have > a VA pointing to different PAs on CPU0 and CPU1. CPU1 would not access > this VA, so not a problem here, but whether this matters for > inner-shareable cache maintenance (dma_cache_maint_page), I can't tell > yet (internal thread with the architecture guys). Ok... given that the (speculated) lines won't be dirty, I'm don't see why this matters for cache maintenance, but I'll wait to see what the architecture guys some back with. Will ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/4] ARM: tlb: don't perform inner-shareable invalidation for local BP ops 2013-03-25 18:19 [PATCH 0/4] TLB and mm-related optimisations Will Deacon 2013-03-25 18:19 ` [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops Will Deacon @ 2013-03-25 18:19 ` Will Deacon 2013-03-27 10:36 ` Catalin Marinas 2013-03-25 18:19 ` [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead Will Deacon 2013-03-25 18:19 ` [PATCH 4/4] ARM: atomics: don't use exclusives for atomic64 read/set with LPAE Will Deacon 3 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-03-25 18:19 UTC (permalink / raw) To: linux-arm-kernel Now that the ASID allocator doesn't require inner-shareable maintenance, we can convert the local_bp_flush_all function to perform only non-shareable flushing, in a similar manner to the TLB invalidation routines. Signed-off-by: Will Deacon <will.deacon@arm.com> --- arch/arm/include/asm/tlbflush.h | 16 +++++++++++++--- arch/arm/kernel/smp_tlb.c | 2 +- 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h index ae9f34f..c7cdb59 100644 --- a/arch/arm/include/asm/tlbflush.h +++ b/arch/arm/include/asm/tlbflush.h @@ -171,7 +171,7 @@ #define v7wbi_tlb_flags_smp (TLB_WB | TLB_DCLEAN | TLB_BARRIER | \ TLB_V6_U_FULL | TLB_V6_U_PAGE | \ - TLB_V6_U_ASID | \ + TLB_V6_U_ASID | TLB_V6_BP | \ TLB_V7_UIS_FULL | TLB_V7_UIS_PAGE | \ TLB_V7_UIS_ASID | TLB_V7_UIS_BP) #define v7wbi_tlb_flags_up (TLB_WB | TLB_DCLEAN | TLB_BARRIER | \ @@ -495,14 +495,24 @@ static inline void __flush_tlb_kernel_page(unsigned long kaddr) } } -static inline void local_flush_bp_all(void) +static inline void __flush_bp_all(void) { const int zero = 0; const unsigned int __tlb_flag = __cpu_tlb_flags; if (tlb_flag(TLB_V7_UIS_BP)) asm("mcr p15, 0, %0, c7, c1, 6" : : "r" (zero)); - else if (tlb_flag(TLB_V6_BP)) + + if (tlb_flag(TLB_BARRIER)) + isb(); +} + +static inline void local_flush_bp_all(void) +{ + const int zero = 0; + const unsigned int __tlb_flag = __cpu_tlb_flags; + + if (tlb_flag(TLB_V6_BP)) asm("mcr p15, 0, %0, c7, c5, 6" : : "r" (zero)); if (tlb_flag(TLB_BARRIER)) diff --git a/arch/arm/kernel/smp_tlb.c b/arch/arm/kernel/smp_tlb.c index 6ae08b1..bc3617a 100644 --- a/arch/arm/kernel/smp_tlb.c +++ b/arch/arm/kernel/smp_tlb.c @@ -137,5 +137,5 @@ void flush_bp_all(void) if (tlb_ops_need_broadcast()) on_each_cpu(ipi_flush_bp_all, NULL, 1); else - local_flush_bp_all(); + __flush_bp_all(); } -- 1.8.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 2/4] ARM: tlb: don't perform inner-shareable invalidation for local BP ops 2013-03-25 18:19 ` [PATCH 2/4] ARM: tlb: don't perform inner-shareable invalidation for local BP ops Will Deacon @ 2013-03-27 10:36 ` Catalin Marinas 0 siblings, 0 replies; 26+ messages in thread From: Catalin Marinas @ 2013-03-27 10:36 UTC (permalink / raw) To: linux-arm-kernel On Mon, Mar 25, 2013 at 06:19:39PM +0000, Will Deacon wrote: > Now that the ASID allocator doesn't require inner-shareable maintenance, > we can convert the local_bp_flush_all function to perform only > non-shareable flushing, in a similar manner to the TLB invalidation > routines. > > Signed-off-by: Will Deacon <will.deacon@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-03-25 18:19 [PATCH 0/4] TLB and mm-related optimisations Will Deacon 2013-03-25 18:19 ` [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops Will Deacon 2013-03-25 18:19 ` [PATCH 2/4] ARM: tlb: don't perform inner-shareable invalidation for local BP ops Will Deacon @ 2013-03-25 18:19 ` Will Deacon 2013-03-27 10:53 ` Catalin Marinas 2013-05-15 13:18 ` Gregory CLEMENT 2013-03-25 18:19 ` [PATCH 4/4] ARM: atomics: don't use exclusives for atomic64 read/set with LPAE Will Deacon 3 siblings, 2 replies; 26+ messages in thread From: Will Deacon @ 2013-03-25 18:19 UTC (permalink / raw) To: linux-arm-kernel Many ARMv7 cores have hardware page table walkers that can read the L1 cache. This is discoverable from the ID_MMFR3 register, although this can be expensive to access from the low-level set_pte functions and is a pain to cache, particularly with multi-cluster systems. A useful observation is that the multi-processing extensions for ARMv7 require coherent table walks, meaning that we can make use of ALT_SMP patching in proc-v7-* to patch away the cache flush safely for these cores. Reported-by: Albin Tonnerre <Albin.Tonnerre@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> --- arch/arm/include/asm/tlbflush.h | 2 +- arch/arm/mm/proc-v6.S | 2 -- arch/arm/mm/proc-v7-2level.S | 3 ++- arch/arm/mm/proc-v7-3level.S | 3 ++- arch/arm/mm/proc-v7.S | 4 ++-- 5 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h index c7cdb59..42d155e 100644 --- a/arch/arm/include/asm/tlbflush.h +++ b/arch/arm/include/asm/tlbflush.h @@ -169,7 +169,7 @@ # define v6wbi_always_flags (-1UL) #endif -#define v7wbi_tlb_flags_smp (TLB_WB | TLB_DCLEAN | TLB_BARRIER | \ +#define v7wbi_tlb_flags_smp (TLB_WB | TLB_BARRIER | \ TLB_V6_U_FULL | TLB_V6_U_PAGE | \ TLB_V6_U_ASID | TLB_V6_BP | \ TLB_V7_UIS_FULL | TLB_V7_UIS_PAGE | \ diff --git a/arch/arm/mm/proc-v6.S b/arch/arm/mm/proc-v6.S index bcaaa8d..a286d47 100644 --- a/arch/arm/mm/proc-v6.S +++ b/arch/arm/mm/proc-v6.S @@ -80,12 +80,10 @@ ENTRY(cpu_v6_do_idle) mov pc, lr ENTRY(cpu_v6_dcache_clean_area) -#ifndef TLB_CAN_READ_FROM_L1_CACHE 1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry add r0, r0, #D_CACHE_LINE_SIZE subs r1, r1, #D_CACHE_LINE_SIZE bhi 1b -#endif mov pc, lr /* diff --git a/arch/arm/mm/proc-v7-2level.S b/arch/arm/mm/proc-v7-2level.S index 78f520b..9704097 100644 --- a/arch/arm/mm/proc-v7-2level.S +++ b/arch/arm/mm/proc-v7-2level.S @@ -110,7 +110,8 @@ ENTRY(cpu_v7_set_pte_ext) ARM( str r3, [r0, #2048]! ) THUMB( add r0, r0, #2048 ) THUMB( str r3, [r0] ) - mcr p15, 0, r0, c7, c10, 1 @ flush_pte + ALT_SMP(mov pc,lr) + ALT_UP (mcr p15, 0, r0, c7, c10, 1) @ flush_pte #endif mov pc, lr ENDPROC(cpu_v7_set_pte_ext) diff --git a/arch/arm/mm/proc-v7-3level.S b/arch/arm/mm/proc-v7-3level.S index 6ffd78c..363027e 100644 --- a/arch/arm/mm/proc-v7-3level.S +++ b/arch/arm/mm/proc-v7-3level.S @@ -73,7 +73,8 @@ ENTRY(cpu_v7_set_pte_ext) tst r3, #1 << (55 - 32) @ L_PTE_DIRTY orreq r2, #L_PTE_RDONLY 1: strd r2, r3, [r0] - mcr p15, 0, r0, c7, c10, 1 @ flush_pte + ALT_SMP(mov pc, lr) + ALT_UP (mcr p15, 0, r0, c7, c10, 1) @ flush_pte #endif mov pc, lr ENDPROC(cpu_v7_set_pte_ext) diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S index 3a3c015..37716b0 100644 --- a/arch/arm/mm/proc-v7.S +++ b/arch/arm/mm/proc-v7.S @@ -75,14 +75,14 @@ ENTRY(cpu_v7_do_idle) ENDPROC(cpu_v7_do_idle) ENTRY(cpu_v7_dcache_clean_area) -#ifndef TLB_CAN_READ_FROM_L1_CACHE + ALT_SMP(mov pc, lr) @ MP extensions imply L1 PTW + ALT_UP(W(nop)) dcache_line_size r2, r3 1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry add r0, r0, r2 subs r1, r1, r2 bhi 1b dsb -#endif mov pc, lr ENDPROC(cpu_v7_dcache_clean_area) -- 1.8.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-03-25 18:19 ` [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead Will Deacon @ 2013-03-27 10:53 ` Catalin Marinas 2013-03-27 12:20 ` Will Deacon 2013-05-15 13:18 ` Gregory CLEMENT 1 sibling, 1 reply; 26+ messages in thread From: Catalin Marinas @ 2013-03-27 10:53 UTC (permalink / raw) To: linux-arm-kernel On Mon, Mar 25, 2013 at 06:19:40PM +0000, Will Deacon wrote: > Many ARMv7 cores have hardware page table walkers that can read the L1 > cache. This is discoverable from the ID_MMFR3 register, although this > can be expensive to access from the low-level set_pte functions and is a > pain to cache, particularly with multi-cluster systems. > > A useful observation is that the multi-processing extensions for ARMv7 > require coherent table walks, meaning that we can make use of ALT_SMP > patching in proc-v7-* to patch away the cache flush safely for these > cores. > > Reported-by: Albin Tonnerre <Albin.Tonnerre@arm.com> > Signed-off-by: Will Deacon <will.deacon@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> There are some pmd flushing functions we should target as well (flush_pmd_entry, clean_pmd_entry) in this patch or a new one. -- Catalin ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-03-27 10:53 ` Catalin Marinas @ 2013-03-27 12:20 ` Will Deacon 0 siblings, 0 replies; 26+ messages in thread From: Will Deacon @ 2013-03-27 12:20 UTC (permalink / raw) To: linux-arm-kernel On Wed, Mar 27, 2013 at 10:53:49AM +0000, Catalin Marinas wrote: > On Mon, Mar 25, 2013 at 06:19:40PM +0000, Will Deacon wrote: > > Many ARMv7 cores have hardware page table walkers that can read the L1 > > cache. This is discoverable from the ID_MMFR3 register, although this > > can be expensive to access from the low-level set_pte functions and is a > > pain to cache, particularly with multi-cluster systems. > > > > A useful observation is that the multi-processing extensions for ARMv7 > > require coherent table walks, meaning that we can make use of ALT_SMP > > patching in proc-v7-* to patch away the cache flush safely for these > > cores. > > > > Reported-by: Albin Tonnerre <Albin.Tonnerre@arm.com> > > Signed-off-by: Will Deacon <will.deacon@arm.com> > > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> > > There are some pmd flushing functions we should target as well > (flush_pmd_entry, clean_pmd_entry) in this patch or a new one. I already took care of those by avoiding the TLB_DCLEAN flag for v7 SMP. Will ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-03-25 18:19 ` [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead Will Deacon 2013-03-27 10:53 ` Catalin Marinas @ 2013-05-15 13:18 ` Gregory CLEMENT 2013-05-15 13:41 ` Will Deacon 1 sibling, 1 reply; 26+ messages in thread From: Gregory CLEMENT @ 2013-05-15 13:18 UTC (permalink / raw) To: linux-arm-kernel Hi Will, On 03/25/2013 07:19 PM, Will Deacon wrote: > Many ARMv7 cores have hardware page table walkers that can read the L1 > cache. This is discoverable from the ID_MMFR3 register, although this > can be expensive to access from the low-level set_pte functions and is a > pain to cache, particularly with multi-cluster systems. > > A useful observation is that the multi-processing extensions for ARMv7 > require coherent table walks, meaning that we can make use of ALT_SMP > patching in proc-v7-* to patch away the cache flush safely for these > cores. I encountered a regression with 3.10-rc1 on the Armada 370 based boards. With the 3.10-rc1 they hang during auto testy of the xor engine which are mainly DMA transfers. If I revert this patch, it no more hang. I found this by using bisect, it was not obvious at all for me that this patch may have cause this regression. The issue appear in SMP and in UP. However I think that the PJ4B-v7 used in the Armada 370 are not MP capable. I made some investigation. And in UP if I remove the line: ALT_UP(W(nop)) at the begining of the cpu_v7_dcache_clean_are macro located in arch/arm/mm/proc-v7.S Then the kernel boot again. It is not surprising because in this case we found the same generated code that before this patch was applied. now I don't really understand why a W(nop) will cause this issue. In SMP mode even with this line removed the kernel hang, but in this case I am not sure of what happen exactly and how the .alt.smp.init section is used. I don't know if it is relevant but I tested with these 2 version of gcc: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) and gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-1ubuntu1) I hope you will find some explanation and solution to this bug, because currently the only solution I have is to revert this patch. Thanks, Gregory > > Reported-by: Albin Tonnerre <Albin.Tonnerre@arm.com> > Signed-off-by: Will Deacon <will.deacon@arm.com> > --- > arch/arm/include/asm/tlbflush.h | 2 +- > arch/arm/mm/proc-v6.S | 2 -- > arch/arm/mm/proc-v7-2level.S | 3 ++- > arch/arm/mm/proc-v7-3level.S | 3 ++- > arch/arm/mm/proc-v7.S | 4 ++-- > 5 files changed, 7 insertions(+), 7 deletions(-) > > diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h > index c7cdb59..42d155e 100644 > --- a/arch/arm/include/asm/tlbflush.h > +++ b/arch/arm/include/asm/tlbflush.h > @@ -169,7 +169,7 @@ > # define v6wbi_always_flags (-1UL) > #endif > > -#define v7wbi_tlb_flags_smp (TLB_WB | TLB_DCLEAN | TLB_BARRIER | \ > +#define v7wbi_tlb_flags_smp (TLB_WB | TLB_BARRIER | \ > TLB_V6_U_FULL | TLB_V6_U_PAGE | \ > TLB_V6_U_ASID | TLB_V6_BP | \ > TLB_V7_UIS_FULL | TLB_V7_UIS_PAGE | \ > diff --git a/arch/arm/mm/proc-v6.S b/arch/arm/mm/proc-v6.S > index bcaaa8d..a286d47 100644 > --- a/arch/arm/mm/proc-v6.S > +++ b/arch/arm/mm/proc-v6.S > @@ -80,12 +80,10 @@ ENTRY(cpu_v6_do_idle) > mov pc, lr > > ENTRY(cpu_v6_dcache_clean_area) > -#ifndef TLB_CAN_READ_FROM_L1_CACHE > 1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry > add r0, r0, #D_CACHE_LINE_SIZE > subs r1, r1, #D_CACHE_LINE_SIZE > bhi 1b > -#endif > mov pc, lr > > /* > diff --git a/arch/arm/mm/proc-v7-2level.S b/arch/arm/mm/proc-v7-2level.S > index 78f520b..9704097 100644 > --- a/arch/arm/mm/proc-v7-2level.S > +++ b/arch/arm/mm/proc-v7-2level.S > @@ -110,7 +110,8 @@ ENTRY(cpu_v7_set_pte_ext) > ARM( str r3, [r0, #2048]! ) > THUMB( add r0, r0, #2048 ) > THUMB( str r3, [r0] ) > - mcr p15, 0, r0, c7, c10, 1 @ flush_pte > + ALT_SMP(mov pc,lr) > + ALT_UP (mcr p15, 0, r0, c7, c10, 1) @ flush_pte > #endif > mov pc, lr > ENDPROC(cpu_v7_set_pte_ext) > diff --git a/arch/arm/mm/proc-v7-3level.S b/arch/arm/mm/proc-v7-3level.S > index 6ffd78c..363027e 100644 > --- a/arch/arm/mm/proc-v7-3level.S > +++ b/arch/arm/mm/proc-v7-3level.S > @@ -73,7 +73,8 @@ ENTRY(cpu_v7_set_pte_ext) > tst r3, #1 << (55 - 32) @ L_PTE_DIRTY > orreq r2, #L_PTE_RDONLY > 1: strd r2, r3, [r0] > - mcr p15, 0, r0, c7, c10, 1 @ flush_pte > + ALT_SMP(mov pc, lr) > + ALT_UP (mcr p15, 0, r0, c7, c10, 1) @ flush_pte > #endif > mov pc, lr > ENDPROC(cpu_v7_set_pte_ext) > diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S > index 3a3c015..37716b0 100644 > --- a/arch/arm/mm/proc-v7.S > +++ b/arch/arm/mm/proc-v7.S > @@ -75,14 +75,14 @@ ENTRY(cpu_v7_do_idle) > ENDPROC(cpu_v7_do_idle) > > ENTRY(cpu_v7_dcache_clean_area) > -#ifndef TLB_CAN_READ_FROM_L1_CACHE > + ALT_SMP(mov pc, lr) @ MP extensions imply L1 PTW > + ALT_UP(W(nop)) > dcache_line_size r2, r3 > 1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry > add r0, r0, r2 > subs r1, r1, r2 > bhi 1b > dsb > -#endif > mov pc, lr > ENDPROC(cpu_v7_dcache_clean_area) > > -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 13:18 ` Gregory CLEMENT @ 2013-05-15 13:41 ` Will Deacon 2013-05-15 13:54 ` Gregory CLEMENT 0 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-05-15 13:41 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 15, 2013 at 02:18:53PM +0100, Gregory CLEMENT wrote: > Hi Will, Hi Gregory, > On 03/25/2013 07:19 PM, Will Deacon wrote: > > Many ARMv7 cores have hardware page table walkers that can read the L1 > > cache. This is discoverable from the ID_MMFR3 register, although this > > can be expensive to access from the low-level set_pte functions and is a > > pain to cache, particularly with multi-cluster systems. > > > > A useful observation is that the multi-processing extensions for ARMv7 > > require coherent table walks, meaning that we can make use of ALT_SMP > > patching in proc-v7-* to patch away the cache flush safely for these > > cores. > > I encountered a regression with 3.10-rc1 on the Armada 370 based boards. > With the 3.10-rc1 they hang during auto testy of the xor engine which are > mainly DMA transfers. If I revert this patch, it no more hang. I found this > by using bisect, it was not obvious at all for me that this patch may have > cause this regression. Is this using dmatest.ko, or a different test program? > The issue appear in SMP and in UP. However I think that the PJ4B-v7 used in > the Armada 370 are not MP capable. Ok, so the ALT_UP case should be emitted after patching, correct? > I made some investigation. And in UP if I remove the line: > ALT_UP(W(nop)) Did you remove the ALT_SMP case as well? You could also try making the ALT_SMP case use W(nop) too and see if it changes anything. > at the begining of the cpu_v7_dcache_clean_are macro located in > arch/arm/mm/proc-v7.S > > Then the kernel boot again. It is not surprising because in this case > we found the same generated code that before this patch was applied. > > now I don't really understand why a W(nop) will cause this issue. No, that sounds weird. Can you inspect the functions using JTAG after the smp patching code has executed? > In SMP mode even with this line removed the kernel hang, but in this case > I am not sure of what happen exactly and how the .alt.smp.init section is > used. If you don't have both of the alternatives, things will go wrong. > I hope you will find some explanation and solution to this bug, because currently > the only solution I have is to revert this patch. Let's not jump to that just yet! Cheers, Will ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 13:41 ` Will Deacon @ 2013-05-15 13:54 ` Gregory CLEMENT 2013-05-15 14:06 ` Will Deacon 0 siblings, 1 reply; 26+ messages in thread From: Gregory CLEMENT @ 2013-05-15 13:54 UTC (permalink / raw) To: linux-arm-kernel On 05/15/2013 03:41 PM, Will Deacon wrote: > On Wed, May 15, 2013 at 02:18:53PM +0100, Gregory CLEMENT wrote: >> Hi Will, > > Hi Gregory, > >> On 03/25/2013 07:19 PM, Will Deacon wrote: >>> Many ARMv7 cores have hardware page table walkers that can read the L1 >>> cache. This is discoverable from the ID_MMFR3 register, although this >>> can be expensive to access from the low-level set_pte functions and is a >>> pain to cache, particularly with multi-cluster systems. >>> >>> A useful observation is that the multi-processing extensions for ARMv7 >>> require coherent table walks, meaning that we can make use of ALT_SMP >>> patching in proc-v7-* to patch away the cache flush safely for these >>> cores. >> >> I encountered a regression with 3.10-rc1 on the Armada 370 based boards. >> With the 3.10-rc1 they hang during auto testy of the xor engine which are >> mainly DMA transfers. If I revert this patch, it no more hang. I found this >> by using bisect, it was not obvious at all for me that this patch may have >> cause this regression. > > Is this using dmatest.ko, or a different test program? No they are self-test from the mv_xor driver > >> The issue appear in SMP and in UP. However I think that the PJ4B-v7 used in >> the Armada 370 are not MP capable. > > Ok, so the ALT_UP case should be emitted after patching, correct? Indeed it was I excepted but I didn't check (I don't know how to do) > >> I made some investigation. And in UP if I remove the line: >> ALT_UP(W(nop)) > > Did you remove the ALT_SMP case as well? You could also try making the > ALT_SMP case use W(nop) too and see if it changes anything. OK I will try it. > >> at the begining of the cpu_v7_dcache_clean_are macro located in >> arch/arm/mm/proc-v7.S >> >> Then the kernel boot again. It is not surprising because in this case >> we found the same generated code that before this patch was applied. >> >> now I don't really understand why a W(nop) will cause this issue. > > No, that sounds weird. Can you inspect the functions using JTAG after the smp > patching code has executed? No I can't: I don't have any JTAG :/ > >> In SMP mode even with this line removed the kernel hang, but in this case >> I am not sure of what happen exactly and how the .alt.smp.init section is >> used. > > If you don't have both of the alternatives, things will go wrong. For my own culture, how ALT_UP and ALT_SMP work in SMP case? When I disassembled the proc-v7.o, I saw that the SMP variant of the code were written. How the kernel switch to the UP version of the code? > >> I hope you will find some explanation and solution to this bug, because currently >> the only solution I have is to revert this patch. > > Let's not jump to that just yet! Sure I hope we will find a fix for that. Thanks, Gregory -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 13:54 ` Gregory CLEMENT @ 2013-05-15 14:06 ` Will Deacon 2013-05-15 14:46 ` Gregory CLEMENT 0 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-05-15 14:06 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 15, 2013 at 02:54:14PM +0100, Gregory CLEMENT wrote: > On 05/15/2013 03:41 PM, Will Deacon wrote: > > Is this using dmatest.ko, or a different test program? > > No they are self-test from the mv_xor driver Ok. And the CPU locks up, rather than the DMA master? > > Ok, so the ALT_UP case should be emitted after patching, correct? > > Indeed it was I excepted but I didn't check (I don't know how to do) If you don't have JTAG, that's trickier. Hopefully my other suggestion will help. > >> I made some investigation. And in UP if I remove the line: > >> ALT_UP(W(nop)) > > > > Did you remove the ALT_SMP case as well? You could also try making the > > ALT_SMP case use W(nop) too and see if it changes anything. > > OK I will try it. You could also try deleting both of the ALT_* lines and just putting a W(nop) in there directly. > > If you don't have both of the alternatives, things will go wrong. > > For my own culture, how ALT_UP and ALT_SMP work in SMP case? > When I disassembled the proc-v7.o, I saw that the SMP variant of the code were > written. How the kernel switch to the UP version of the code? Early in boot (head.S:__fixup_smp), we detect whether the CPU has the MP extensions then, if we're actually UP but the kernel has CONFIG_SMP=y, we walk over the .alt.smp.init section looking at each entry in there. The ALT_UP macro spits out an (address, instruction) pair, so in __do_fixup_smp_on_up, we store the instruction to the address for each pair, replacing the SMP instruction which sat there in the compiled image. It could be that the CPUID checks are failing on your Marvell part. Can you tell me what you have in your mpidr (mrc p15, 0, Rd, c0, c0, 5) and also your midr (mrc p15, 0, Rd, c0, c0) please? Cheers, Will ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 14:06 ` Will Deacon @ 2013-05-15 14:46 ` Gregory CLEMENT 2013-05-15 15:04 ` Will Deacon 0 siblings, 1 reply; 26+ messages in thread From: Gregory CLEMENT @ 2013-05-15 14:46 UTC (permalink / raw) To: linux-arm-kernel On 05/15/2013 04:06 PM, Will Deacon wrote: > On Wed, May 15, 2013 at 02:54:14PM +0100, Gregory CLEMENT wrote: >> On 05/15/2013 03:41 PM, Will Deacon wrote: >>> Is this using dmatest.ko, or a different test program? >> >> No they are self-test from the mv_xor driver > > Ok. And the CPU locks up, rather than the DMA master? > >>> Ok, so the ALT_UP case should be emitted after patching, correct? >> >> Indeed it was I excepted but I didn't check (I don't know how to do) > > If you don't have JTAG, that's trickier. Hopefully my other suggestion will > help. > >>>> I made some investigation. And in UP if I remove the line: >>>> ALT_UP(W(nop)) >>> >>> Did you remove the ALT_SMP case as well? You could also try making the >>> ALT_SMP case use W(nop) too and see if it changes anything. >> >> OK I will try it. > > You could also try deleting both of the ALT_* lines and just putting a > W(nop) in there directly. If I just delete the both of the ALT_* lines it no more hangs. If I put a W(nop) instead it hangs. > >>> If you don't have both of the alternatives, things will go wrong. >> >> For my own culture, how ALT_UP and ALT_SMP work in SMP case? >> When I disassembled the proc-v7.o, I saw that the SMP variant of the code were >> written. How the kernel switch to the UP version of the code? > > Early in boot (head.S:__fixup_smp), we detect whether the CPU has the MP > extensions then, if we're actually UP but the kernel has CONFIG_SMP=y, we > walk over the .alt.smp.init section looking at each entry in there. The > ALT_UP macro spits out an (address, instruction) pair, so in > __do_fixup_smp_on_up, we store the instruction to the address for each pair, > replacing the SMP instruction which sat there in the compiled image. > > It could be that the CPUID checks are failing on your Marvell part. Can you > tell me what you have in your mpidr (mrc p15, 0, Rd, c0, c0, 5) and also > your midr (mrc p15, 0, Rd, c0, c0) please? mpidr= 0x0 midr= 0x561F5811 > > Cheers, > > Will > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 14:46 ` Gregory CLEMENT @ 2013-05-15 15:04 ` Will Deacon 2013-05-15 15:36 ` Gregory CLEMENT 0 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-05-15 15:04 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 15, 2013 at 03:46:20PM +0100, Gregory CLEMENT wrote: > On 05/15/2013 04:06 PM, Will Deacon wrote: > > You could also try deleting both of the ALT_* lines and just putting a > > W(nop) in there directly. > > If I just delete the both of the ALT_* lines it no more hangs. > If I put a W(nop) instead it hangs. Wow. This doesn't sound good for your CPU and you might want to check with the Marvell guys... Extra things you could try: - Try adding a 16-bit nop instead (remove the W, build thumb-2 and double check in your diasassembly) - Try adding the W(nop) to other places in the kernel and see if you can tickle the lock-up elsewhere. - Can you reproduce on the Armada XP? (since I have access to one of those) > > Early in boot (head.S:__fixup_smp), we detect whether the CPU has the MP > > extensions then, if we're actually UP but the kernel has CONFIG_SMP=y, we > > walk over the .alt.smp.init section looking at each entry in there. The > > ALT_UP macro spits out an (address, instruction) pair, so in > > __do_fixup_smp_on_up, we store the instruction to the address for each pair, > > replacing the SMP instruction which sat there in the compiled image. > > > > It could be that the CPUID checks are failing on your Marvell part. Can you > > tell me what you have in your mpidr (mrc p15, 0, Rd, c0, c0, 5) and also > > your midr (mrc p15, 0, Rd, c0, c0) please? > > mpidr= 0x0 > midr= 0x561F5811 Should be fine, but it doesn't look like the patching is the issue here. Will ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 15:04 ` Will Deacon @ 2013-05-15 15:36 ` Gregory CLEMENT 2013-05-15 15:41 ` Will Deacon 0 siblings, 1 reply; 26+ messages in thread From: Gregory CLEMENT @ 2013-05-15 15:36 UTC (permalink / raw) To: linux-arm-kernel On 05/15/2013 05:04 PM, Will Deacon wrote: > On Wed, May 15, 2013 at 03:46:20PM +0100, Gregory CLEMENT wrote: >> On 05/15/2013 04:06 PM, Will Deacon wrote: >>> You could also try deleting both of the ALT_* lines and just putting a >>> W(nop) in there directly. >> >> If I just delete the both of the ALT_* lines it no more hangs. >> If I put a W(nop) instead it hangs. It also hang with a simple nop by the way > > Wow. This doesn't sound good for your CPU and you might want to check with > the Marvell guys... > > Extra things you could try: > > - Try adding a 16-bit nop instead (remove the W, build thumb-2 and > double check in your diasassembly) > > - Try adding the W(nop) to other places in the kernel and see if you > can tickle the lock-up elsewhere. I managed to add W(nop) elsewhere in the kernel without getting any lock-up. Is the fact that this nop is the first instruction of the macro could have an influence ? > > - Can you reproduce on the Armada XP? (since I have access to one of > those) No on Armada XP I don't have this kind of problem even on UP. > >>> Early in boot (head.S:__fixup_smp), we detect whether the CPU has the MP >>> extensions then, if we're actually UP but the kernel has CONFIG_SMP=y, we >>> walk over the .alt.smp.init section looking at each entry in there. The >>> ALT_UP macro spits out an (address, instruction) pair, so in >>> __do_fixup_smp_on_up, we store the instruction to the address for each pair, >>> replacing the SMP instruction which sat there in the compiled image. >>> >>> It could be that the CPUID checks are failing on your Marvell part. Can you >>> tell me what you have in your mpidr (mrc p15, 0, Rd, c0, c0, 5) and also >>> your midr (mrc p15, 0, Rd, c0, c0) please? >> >> mpidr= 0x0 >> midr= 0x561F5811 > > Should be fine, but it doesn't look like the patching is the issue here. > > Will > -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 15:36 ` Gregory CLEMENT @ 2013-05-15 15:41 ` Will Deacon 2013-05-15 16:29 ` Gregory CLEMENT 0 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-05-15 15:41 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 15, 2013 at 04:36:43PM +0100, Gregory CLEMENT wrote: > On 05/15/2013 05:04 PM, Will Deacon wrote: > > On Wed, May 15, 2013 at 03:46:20PM +0100, Gregory CLEMENT wrote: > >> On 05/15/2013 04:06 PM, Will Deacon wrote: > >>> You could also try deleting both of the ALT_* lines and just putting a > >>> W(nop) in there directly. > >> > >> If I just delete the both of the ALT_* lines it no more hangs. > >> If I put a W(nop) instead it hangs. > > It also hang with a simple nop by the way Ok. Have you tried adding a different instruction (mov r0, r0, for example)? > > > > Wow. This doesn't sound good for your CPU and you might want to check with > > the Marvell guys... > > > > Extra things you could try: > > > > - Try adding a 16-bit nop instead (remove the W, build thumb-2 and > > double check in your diasassembly) > > > > - Try adding the W(nop) to other places in the kernel and see if you > > can tickle the lock-up elsewhere. > > I managed to add W(nop) elsewhere in the kernel without getting any lock-up. > > Is the fact that this nop is the first instruction of the macro could have an > influence ? I can't think why -- the macro is just expanded inline during assembly. > > > > > - Can you reproduce on the Armada XP? (since I have access to one of > > those) > > No on Armada XP I don't have this kind of problem even on UP. Is that compiling with CONFIG_SMP=n? Will ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 15:41 ` Will Deacon @ 2013-05-15 16:29 ` Gregory CLEMENT 2013-05-15 16:48 ` Will Deacon 0 siblings, 1 reply; 26+ messages in thread From: Gregory CLEMENT @ 2013-05-15 16:29 UTC (permalink / raw) To: linux-arm-kernel On 05/15/2013 05:41 PM, Will Deacon wrote: > On Wed, May 15, 2013 at 04:36:43PM +0100, Gregory CLEMENT wrote: >> On 05/15/2013 05:04 PM, Will Deacon wrote: >>> On Wed, May 15, 2013 at 03:46:20PM +0100, Gregory CLEMENT wrote: >>>> On 05/15/2013 04:06 PM, Will Deacon wrote: >>>>> You could also try deleting both of the ALT_* lines and just putting a >>>>> W(nop) in there directly. >>>> >>>> If I just delete the both of the ALT_* lines it no more hangs. >>>> If I put a W(nop) instead it hangs. >> >> It also hang with a simple nop by the way > > Ok. Have you tried adding a different instruction (mov r0, r0, for example)? it hanged again :( I also try to boot a kernel in Thumb2, it doesn't hang in the same place but the kernel crash latter > >>> >>> Wow. This doesn't sound good for your CPU and you might want to check with >>> the Marvell guys... >>> >>> Extra things you could try: >>> >>> - Try adding a 16-bit nop instead (remove the W, build thumb-2 and >>> double check in your diasassembly) >>> >>> - Try adding the W(nop) to other places in the kernel and see if you >>> can tickle the lock-up elsewhere. >> >> I managed to add W(nop) elsewhere in the kernel without getting any lock-up. >> >> Is the fact that this nop is the first instruction of the macro could have an >> influence ? > > I can't think why -- the macro is just expanded inline during assembly. If I put anything before the "dcache_line_size r2, r3" line the kernel hang, but just after it's ok. > >> >>> >>> - Can you reproduce on the Armada XP? (since I have access to one of >>> those) >> >> No on Armada XP I don't have this kind of problem even on UP. > > Is that compiling with CONFIG_SMP=n? Yes > > Will > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 16:29 ` Gregory CLEMENT @ 2013-05-15 16:48 ` Will Deacon 2013-05-15 17:16 ` Russell King - ARM Linux 0 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-05-15 16:48 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 15, 2013 at 05:29:03PM +0100, Gregory CLEMENT wrote: > On 05/15/2013 05:41 PM, Will Deacon wrote: > > On Wed, May 15, 2013 at 04:36:43PM +0100, Gregory CLEMENT wrote: > >> On 05/15/2013 05:04 PM, Will Deacon wrote: > >>> On Wed, May 15, 2013 at 03:46:20PM +0100, Gregory CLEMENT wrote: > >>>> On 05/15/2013 04:06 PM, Will Deacon wrote: > >>>>> You could also try deleting both of the ALT_* lines and just putting a > >>>>> W(nop) in there directly. > >>>> > >>>> If I just delete the both of the ALT_* lines it no more hangs. > >>>> If I put a W(nop) instead it hangs. > >> > >> It also hang with a simple nop by the way > > > > Ok. Have you tried adding a different instruction (mov r0, r0, for example)? > > it hanged again :( > > I also try to boot a kernel in Thumb2, it doesn't hang in the same place but the kernel > crash latter Crikey, that's not much fun. My last idea is to try sticking an isb instruction in there. Does that make any difference? Can you get access to another 370 board and see if the problem reproduces there? Is there any possibility of getting JTAG (again, perhaps on a different board)? Failing that, I think you need to pester the CPU guys and/or the board guys to help you debug this. Will ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead 2013-05-15 16:48 ` Will Deacon @ 2013-05-15 17:16 ` Russell King - ARM Linux 0 siblings, 0 replies; 26+ messages in thread From: Russell King - ARM Linux @ 2013-05-15 17:16 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 15, 2013 at 05:48:43PM +0100, Will Deacon wrote: > On Wed, May 15, 2013 at 05:29:03PM +0100, Gregory CLEMENT wrote: > > On 05/15/2013 05:41 PM, Will Deacon wrote: > > > On Wed, May 15, 2013 at 04:36:43PM +0100, Gregory CLEMENT wrote: > > >> On 05/15/2013 05:04 PM, Will Deacon wrote: > > >>> On Wed, May 15, 2013 at 03:46:20PM +0100, Gregory CLEMENT wrote: > > >>>> On 05/15/2013 04:06 PM, Will Deacon wrote: > > >>>>> You could also try deleting both of the ALT_* lines and just putting a > > >>>>> W(nop) in there directly. > > >>>> > > >>>> If I just delete the both of the ALT_* lines it no more hangs. > > >>>> If I put a W(nop) instead it hangs. > > >> > > >> It also hang with a simple nop by the way > > > > > > Ok. Have you tried adding a different instruction (mov r0, r0, for example)? > > > > it hanged again :( > > > > I also try to boot a kernel in Thumb2, it doesn't hang in the same place but the kernel > > crash latter > > Crikey, that's not much fun. My last idea is to try sticking an isb > instruction in there. Does that make any difference? > > Can you get access to another 370 board and see if the problem reproduces > there? Is there any possibility of getting JTAG (again, perhaps on a > different board)? > > Failing that, I think you need to pester the CPU guys and/or the board guys > to help you debug this. I've just tried it on the cubox (Dove, Armada 510) which has the xor engines enabled in the config. I see in the boot messages: mv_xor mv_xor.0: Marvell XOR driver mv_xor mv_xor.0: Marvell XOR: ( xor cpy ) mv_xor mv_xor.0: Marvell XOR: ( xor fill cpy ) mv_xor mv_xor.1: Marvell XOR driver mv_xor mv_xor.1: Marvell XOR: ( xor cpy ) mv_xor mv_xor.1: Marvell XOR: ( xor fill cpy ) and /proc/interrupts shows that the interrupts have fired: 39: 2 orion_irq mv_xor.0 40: 2 orion_irq mv_xor.0 42: 2 orion_irq mv_xor.1 43: 2 orion_irq mv_xor.1 on each boot. So I assume that they're doing this self-test. Anyway, I've added the 'nop' to cpu_v7_dcache_clean_area, and it boots just fine. The CPU is: CPU: ARMv7 Processor [560f5815] revision 5 (ARMv7), cr=10c53c7d ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 4/4] ARM: atomics: don't use exclusives for atomic64 read/set with LPAE 2013-03-25 18:19 [PATCH 0/4] TLB and mm-related optimisations Will Deacon ` (2 preceding siblings ...) 2013-03-25 18:19 ` [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead Will Deacon @ 2013-03-25 18:19 ` Will Deacon 2013-03-27 10:57 ` Catalin Marinas 3 siblings, 1 reply; 26+ messages in thread From: Will Deacon @ 2013-03-25 18:19 UTC (permalink / raw) To: linux-arm-kernel To ease page table updates with 64-bit descriptors, CPUs implementing LPAE are required to implement ldrd/strd as atomic operations. This patch uses these accessors instead of the exclusive variants when performing atomic64_{read,set} on LPAE systems. Signed-off-by: Will Deacon <will.deacon@arm.com> --- arch/arm/include/asm/atomic.h | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h index c79f61f..da1c77d 100644 --- a/arch/arm/include/asm/atomic.h +++ b/arch/arm/include/asm/atomic.h @@ -243,6 +243,29 @@ typedef struct { #define ATOMIC64_INIT(i) { (i) } +#ifdef CONFIG_ARM_LPAE +static inline u64 atomic64_read(const atomic64_t *v) +{ + u64 result; + + __asm__ __volatile__("@ atomic64_read\n" +" ldrd %0, %H0, [%1]" + : "=&r" (result) + : "r" (&v->counter), "Qo" (v->counter) + ); + + return result; +} + +static inline void atomic64_set(atomic64_t *v, u64 i) +{ + __asm__ __volatile__("@ atomic64_set\n" +" strd %2, %H2, [%1]" + : "=Qo" (v->counter) + : "r" (&v->counter), "r" (i) + ); +} +#else static inline u64 atomic64_read(const atomic64_t *v) { u64 result; @@ -269,6 +292,7 @@ static inline void atomic64_set(atomic64_t *v, u64 i) : "r" (&v->counter), "r" (i) : "cc"); } +#endif static inline void atomic64_add(u64 i, atomic64_t *v) { -- 1.8.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 4/4] ARM: atomics: don't use exclusives for atomic64 read/set with LPAE 2013-03-25 18:19 ` [PATCH 4/4] ARM: atomics: don't use exclusives for atomic64 read/set with LPAE Will Deacon @ 2013-03-27 10:57 ` Catalin Marinas 0 siblings, 0 replies; 26+ messages in thread From: Catalin Marinas @ 2013-03-27 10:57 UTC (permalink / raw) To: linux-arm-kernel On Mon, Mar 25, 2013 at 06:19:41PM +0000, Will Deacon wrote: > To ease page table updates with 64-bit descriptors, CPUs implementing > LPAE are required to implement ldrd/strd as atomic operations. > > This patch uses these accessors instead of the exclusive variants when > performing atomic64_{read,set} on LPAE systems. > > Signed-off-by: Will Deacon <will.deacon@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2013-05-15 17:16 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-25 18:19 [PATCH 0/4] TLB and mm-related optimisations Will Deacon 2013-03-25 18:19 ` [PATCH 1/4] ARM: tlb: don't perform inner-shareable invalidation for local TLB ops Will Deacon 2013-03-27 10:34 ` Catalin Marinas 2013-03-27 12:07 ` Will Deacon 2013-03-27 12:30 ` Catalin Marinas 2013-03-27 12:56 ` Will Deacon 2013-03-27 13:40 ` Catalin Marinas 2013-03-27 13:54 ` Will Deacon 2013-03-25 18:19 ` [PATCH 2/4] ARM: tlb: don't perform inner-shareable invalidation for local BP ops Will Deacon 2013-03-27 10:36 ` Catalin Marinas 2013-03-25 18:19 ` [PATCH 3/4] ARM: mm: kill unused TLB_CAN_READ_FROM_L1_CACHE and use ALT_SMP instead Will Deacon 2013-03-27 10:53 ` Catalin Marinas 2013-03-27 12:20 ` Will Deacon 2013-05-15 13:18 ` Gregory CLEMENT 2013-05-15 13:41 ` Will Deacon 2013-05-15 13:54 ` Gregory CLEMENT 2013-05-15 14:06 ` Will Deacon 2013-05-15 14:46 ` Gregory CLEMENT 2013-05-15 15:04 ` Will Deacon 2013-05-15 15:36 ` Gregory CLEMENT 2013-05-15 15:41 ` Will Deacon 2013-05-15 16:29 ` Gregory CLEMENT 2013-05-15 16:48 ` Will Deacon 2013-05-15 17:16 ` Russell King - ARM Linux 2013-03-25 18:19 ` [PATCH 4/4] ARM: atomics: don't use exclusives for atomic64 read/set with LPAE Will Deacon 2013-03-27 10:57 ` Catalin Marinas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).