* [GIT PULL] arm64 patches for 3.15
@ 2014-03-31 17:52 Catalin Marinas
2014-04-01 16:10 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) Jon Medhurst (Tixy)
0 siblings, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2014-03-31 17:52 UTC (permalink / raw)
To: linux-arm-kernel
Hi Linus,
Please pull the arm64 patches below for 3.15. Thanks.
The following changes since commit cfbf8d4857c26a8a307fb7cd258074c9dcd8c691:
Linux 3.14-rc4 (2014-02-23 17:40:03 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-upstream
for you to fetch changes up to 196adf2f3015eacac0567278ba538e3ffdd16d0e:
arm64: Remove pgprot_dmacoherent() (2014-03-24 10:35:35 +0000)
----------------------------------------------------------------
- KGDB support for arm64
- PCI I/O space extended to 16M (in preparation of PCIe support patches)
- Dropping ZONE_DMA32 in favour of ZONE_DMA (we only need one for the
time being), together with swiotlb late initialisation to correctly
setup the bounce buffer
- DMA API cache maintenance support (not all ARMv8 platforms have
hardware cache coherency)
- Crypto extensions advertising via ELF_HWCAP2 for compat user space
- Perf support for dwarf unwinding in compat mode
- asm/tlb.h converted to the generic mmu_gather code
- asm-generic rwsem implementation
- Code clean-up
----------------------------------------------------------------
Ard Biesheuvel (4):
binfmt_elf: add ELF_HWCAP2 to compat auxv entries
arm64: add AT_HWCAP2 support for 32-bit compat
arm64: advertise ARMv8 extensions to 32-bit compat ELF binaries
arm64: enable generic CPU feature modalias matching for this architecture
Catalin Marinas (9):
arm64: Extend the PCI I/O space to 16MB
arm64: Convert asm/tlb.h to generic mmu_gather
arm64: Extend the idmap to the whole kernel image
arm64: Replace ZONE_DMA32 with ZONE_DMA
arm64: Use swiotlb late initialisation
arm64: Implement coherent DMA API based on swiotlb
arm64: Make DMA coherent and strongly ordered mappings not executable
arm64: Do not synchronise I and D caches for special ptes
arm64: Remove pgprot_dmacoherent()
Christopher Covington (1):
arm64: Fix __range_ok macro
Geoff Levand (1):
arm64: Fix the soft_restart routine
Jean Pihet (3):
ARM64: perf: add support for perf registers API
ARM64: perf: add support for frame pointer unwinding in compat mode
ARM64: perf: support dwarf unwinding in compat mode
Jingoo Han (2):
arm64: debug: make local symbols static
arm64: smp: make local symbol static
Laura Abbott (2):
arm64: Implement custom mmap functions for dma mapping
arm64: Support DMA_ATTR_WRITE_COMBINE
Mark Brown (2):
arm64: topology: Implement basic CPU topology support
arm64: Fix duplicated Kconfig entries
Mark Rutland (1):
arm64: remove unnecessary cache flush at boot
Nathan Lynch (1):
arm64: vdso: clean up vdso_pagelist initialization
Radha Mohan Chintakuntla (1):
arm64: Add boot time configuration of Intermediate Physical Address size
Ritesh Harjani (1):
arm64: Change misleading function names in dma-mapping
Rob Herring (1):
cpufreq: enable ARM drivers on arm64
Steve Capper (1):
arm64: mm: Route pmd thp functions through pte equivalents
Vijaya Kumar K (7):
arm64: Add macros to manage processor debug state
arm64: KGDB: Add Basic KGDB support
arm64: KGDB: Add step debugging support
KGDB: make kgdb_breakpoint() as noinline
misc: debug: remove compilation warnings
arm64: KGDB: Add KGDB config
arm64: enable processor debug state for secondary cpus
Vladimir Murzin (2):
arm64: remove redundant "psci:" prefixes
arm64: remove return value form psci_init()
Will Deacon (3):
arm64: barriers: add dmb barrier
asm-generic: rwsem: de-PPCify rwsem.h
arm64: rwsem: use asm-generic rwsem implementation
Documentation/arm64/memory.txt | 16 +-
arch/arm64/Kconfig | 26 ++-
arch/arm64/include/asm/Kbuild | 1 +
arch/arm64/include/asm/barrier.h | 1 +
arch/arm64/include/asm/cacheflush.h | 7 +
arch/arm64/include/asm/compat.h | 2 +-
arch/arm64/include/asm/cpufeature.h | 29 +++
arch/arm64/include/asm/debug-monitors.h | 64 ++++--
arch/arm64/include/asm/dma-mapping.h | 7 +
arch/arm64/include/asm/hwcap.h | 9 +-
arch/arm64/include/asm/io.h | 2 +-
arch/arm64/include/asm/irqflags.h | 23 +++
arch/arm64/include/asm/kgdb.h | 84 ++++++++
arch/arm64/include/asm/kvm_arm.h | 15 +-
arch/arm64/include/asm/pgtable-hwdef.h | 5 +-
arch/arm64/include/asm/pgtable.h | 60 +++---
arch/arm64/include/asm/psci.h | 2 +-
arch/arm64/include/asm/ptrace.h | 5 +-
arch/arm64/include/asm/tlb.h | 136 ++-----------
arch/arm64/include/asm/topology.h | 39 ++++
arch/arm64/include/asm/uaccess.h | 4 +-
arch/arm64/include/uapi/asm/Kbuild | 1 +
arch/arm64/include/uapi/asm/perf_regs.h | 40 ++++
arch/arm64/kernel/Makefile | 6 +-
arch/arm64/kernel/debug-monitors.c | 10 +-
arch/arm64/kernel/head.S | 20 +-
arch/arm64/kernel/kgdb.c | 336 ++++++++++++++++++++++++++++++++
arch/arm64/kernel/perf_event.c | 75 ++++++-
arch/arm64/kernel/perf_regs.c | 44 +++++
arch/arm64/kernel/process.c | 11 +-
arch/arm64/kernel/psci.c | 13 +-
arch/arm64/kernel/setup.c | 33 ++++
arch/arm64/kernel/smp.c | 12 ++
arch/arm64/kernel/smp_spin_table.c | 2 +-
arch/arm64/kernel/topology.c | 95 +++++++++
arch/arm64/kernel/vdso.c | 42 ++--
arch/arm64/kvm/hyp-init.S | 6 +
arch/arm64/mm/cache.S | 80 +++++++-
arch/arm64/mm/dma-mapping.c | 246 +++++++++++++++++++++--
arch/arm64/mm/init.c | 33 ++--
arch/arm64/mm/proc.S | 14 +-
drivers/cpufreq/Kconfig | 2 +-
fs/compat_binfmt_elf.c | 5 +
include/asm-generic/rwsem.h | 10 +-
kernel/debug/debug_core.c | 2 +-
45 files changed, 1368 insertions(+), 307 deletions(-)
create mode 100644 arch/arm64/include/asm/cpufeature.h
create mode 100644 arch/arm64/include/asm/kgdb.h
create mode 100644 arch/arm64/include/asm/topology.h
create mode 100644 arch/arm64/include/uapi/asm/perf_regs.h
create mode 100644 arch/arm64/kernel/kgdb.c
create mode 100644 arch/arm64/kernel/perf_regs.c
create mode 100644 arch/arm64/kernel/topology.c
--
Catalin
^ permalink raw reply [flat|nested] 11+ messages in thread* Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) 2014-03-31 17:52 [GIT PULL] arm64 patches for 3.15 Catalin Marinas @ 2014-04-01 16:10 ` Jon Medhurst (Tixy) 2014-04-01 17:29 ` Catalin Marinas 0 siblings, 1 reply; 11+ messages in thread From: Jon Medhurst (Tixy) @ 2014-04-01 16:10 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2014-03-31 at 18:52 +0100, Catalin Marinas wrote: > The following changes since commit cfbf8d4857c26a8a307fb7cd258074c9dcd8c691: > > Linux 3.14-rc4 (2014-02-23 17:40:03 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-upstream > > for you to fetch changes up to 196adf2f3015eacac0567278ba538e3ffdd16d0e: > > arm64: Remove pgprot_dmacoherent() (2014-03-24 10:35:35 +0000) I may have spotted a bug in commit 7363590d2c46 (arm64: Implement coherent DMA API based on swiotlb), see my inline comment below... [...] > diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S > index 1ea9f26..97fcef5 100644 > --- a/arch/arm64/mm/cache.S > +++ b/arch/arm64/mm/cache.S > @@ -166,3 +166,81 @@ ENTRY(__flush_dcache_area) > dsb sy > ret > ENDPROC(__flush_dcache_area) > + > +/* > + * __dma_inv_range(start, end) > + * - start - virtual start address of region > + * - end - virtual end address of region > + */ > +__dma_inv_range: > + dcache_line_size x2, x3 > + sub x3, x2, #1 > + bic x0, x0, x3 > + bic x1, x1, x3 Why is the 'end' value in x1 above rounded down to be cache aligned? This means the cache invalidate won't include the cache line containing the final bytes of the region, unless it happened to already be cache line aligned. This looks especially suspect as the other two cache operations added in the same patch (below) don't do that. > +1: dc ivac, x0 // invalidate D / U line > + add x0, x0, x2 > + cmp x0, x1 > + b.lo 1b > + dsb sy > + ret > +ENDPROC(__dma_inv_range) > + > +/* > + * __dma_clean_range(start, end) > + * - start - virtual start address of region > + * - end - virtual end address of region > + */ > +__dma_clean_range: > + dcache_line_size x2, x3 > + sub x3, x2, #1 > + bic x0, x0, x3 > +1: dc cvac, x0 // clean D / U line > + add x0, x0, x2 > + cmp x0, x1 > + b.lo 1b > + dsb sy > + ret > +ENDPROC(__dma_clean_range) > + > +/* > + * __dma_flush_range(start, end) > + * - start - virtual start address of region > + * - end - virtual end address of region > + */ > +ENTRY(__dma_flush_range) > + dcache_line_size x2, x3 > + sub x3, x2, #1 > + bic x0, x0, x3 > +1: dc civac, x0 // clean & invalidate D / U line > + add x0, x0, x2 > + cmp x0, x1 > + b.lo 1b > + dsb sy > + ret > +ENDPROC(__dma_flush_range) [...] -- Tixy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) 2014-04-01 16:10 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) Jon Medhurst (Tixy) @ 2014-04-01 17:29 ` Catalin Marinas 2014-04-02 8:52 ` Jon Medhurst (Tixy) 0 siblings, 1 reply; 11+ messages in thread From: Catalin Marinas @ 2014-04-01 17:29 UTC (permalink / raw) To: linux-arm-kernel On Tue, Apr 01, 2014 at 05:10:57PM +0100, Jon Medhurst (Tixy) wrote: > On Mon, 2014-03-31 at 18:52 +0100, Catalin Marinas wrote: > > The following changes since commit cfbf8d4857c26a8a307fb7cd258074c9dcd8c691: > > > > Linux 3.14-rc4 (2014-02-23 17:40:03 -0800) > > > > are available in the git repository at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-upstream > > > > for you to fetch changes up to 196adf2f3015eacac0567278ba538e3ffdd16d0e: > > > > arm64: Remove pgprot_dmacoherent() (2014-03-24 10:35:35 +0000) > > I may have spotted a bug in commit 7363590d2c46 (arm64: Implement > coherent DMA API based on swiotlb), see my inline comment below... > > [...] > > diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S > > index 1ea9f26..97fcef5 100644 > > --- a/arch/arm64/mm/cache.S > > +++ b/arch/arm64/mm/cache.S > > @@ -166,3 +166,81 @@ ENTRY(__flush_dcache_area) > > dsb sy > > ret > > ENDPROC(__flush_dcache_area) > > + > > +/* > > + * __dma_inv_range(start, end) > > + * - start - virtual start address of region > > + * - end - virtual end address of region > > + */ > > +__dma_inv_range: > > + dcache_line_size x2, x3 > > + sub x3, x2, #1 > > + bic x0, x0, x3 > > + bic x1, x1, x3 > > Why is the 'end' value in x1 above rounded down to be cache aligned? > This means the cache invalidate won't include the cache line containing > the final bytes of the region, unless it happened to already be cache > line aligned. This looks especially suspect as the other two cache > operations added in the same patch (below) don't do that. Cache invalidation is destructive, so we want to make sure that it doesn't affect anything beyond x1. But you are right, if either end of the buffer is not cache line aligned it can get it wrong. The fix is to use clean+invalidate on the unaligned ends: diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S index c46f48b33c14..6a26bf1965d3 100644 --- a/arch/arm64/mm/cache.S +++ b/arch/arm64/mm/cache.S @@ -175,10 +175,17 @@ ENDPROC(__flush_dcache_area) __dma_inv_range: dcache_line_size x2, x3 sub x3, x2, #1 - bic x0, x0, x3 + tst x1, x3 // end cache line aligned? bic x1, x1, x3 -1: dc ivac, x0 // invalidate D / U line - add x0, x0, x2 + b.eq 1f + dc civac, x1 // clean & invalidate D / U line +1: tst x0, x3 // start cache line aligned? + bic x0, x0, x3 + b.eq 2f + dc civac, x0 // clean & invalidate D / U line + b 3f +2: dc ivac, x0 // invalidate D / U line +3: add x0, x0, x2 cmp x0, x1 b.lo 1b dsb sy -- Catalin ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) 2014-04-01 17:29 ` Catalin Marinas @ 2014-04-02 8:52 ` Jon Medhurst (Tixy) 2014-04-02 9:20 ` Catalin Marinas 0 siblings, 1 reply; 11+ messages in thread From: Jon Medhurst (Tixy) @ 2014-04-02 8:52 UTC (permalink / raw) To: linux-arm-kernel On Tue, 2014-04-01 at 18:29 +0100, Catalin Marinas wrote: > On Tue, Apr 01, 2014 at 05:10:57PM +0100, Jon Medhurst (Tixy) wrote: > > On Mon, 2014-03-31 at 18:52 +0100, Catalin Marinas wrote: > > > The following changes since commit cfbf8d4857c26a8a307fb7cd258074c9dcd8c691: > > > > > > Linux 3.14-rc4 (2014-02-23 17:40:03 -0800) > > > > > > are available in the git repository at: > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-upstream > > > > > > for you to fetch changes up to 196adf2f3015eacac0567278ba538e3ffdd16d0e: > > > > > > arm64: Remove pgprot_dmacoherent() (2014-03-24 10:35:35 +0000) > > > > I may have spotted a bug in commit 7363590d2c46 (arm64: Implement > > coherent DMA API based on swiotlb), see my inline comment below... > > > > [...] > > > diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S > > > index 1ea9f26..97fcef5 100644 > > > --- a/arch/arm64/mm/cache.S > > > +++ b/arch/arm64/mm/cache.S > > > @@ -166,3 +166,81 @@ ENTRY(__flush_dcache_area) > > > dsb sy > > > ret > > > ENDPROC(__flush_dcache_area) > > > + > > > +/* > > > + * __dma_inv_range(start, end) > > > + * - start - virtual start address of region > > > + * - end - virtual end address of region > > > + */ > > > +__dma_inv_range: > > > + dcache_line_size x2, x3 > > > + sub x3, x2, #1 > > > + bic x0, x0, x3 > > > + bic x1, x1, x3 > > > > Why is the 'end' value in x1 above rounded down to be cache aligned? > > This means the cache invalidate won't include the cache line containing > > the final bytes of the region, unless it happened to already be cache > > line aligned. This looks especially suspect as the other two cache > > operations added in the same patch (below) don't do that. > > Cache invalidation is destructive, so we want to make sure that it > doesn't affect anything beyond x1. But you are right, if either end of > the buffer is not cache line aligned it can get it wrong. The fix is to > use clean+invalidate on the unaligned ends: Like the ARMv7 implementation does :-) However, I wonder, is it possible for the Cache Writeback Granule (CWG) to come into play? If the CWG of further out caches was bigger than closer (to CPU) caches then it would cause data corruption. So for these region ends, should we not be using the CWG size, not the minimum D cache line size? On second thoughts, that wouldn't be safe either in the converse case where the CWG of a closer cache was bigger. So we would need to first use minimum cache line size to clean a CWG sized region, then invalidate cache lines by the same method. But then that leaves a time period where a write can happen between the clean and the invalidate, again leading to data corruption. I hope all this means I've either got rather confused or that that cache architectures are smart enough to automatically cope. I also have a couple of comments on the specific changes below... > > diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S > index c46f48b33c14..6a26bf1965d3 100644 > --- a/arch/arm64/mm/cache.S > +++ b/arch/arm64/mm/cache.S > @@ -175,10 +175,17 @@ ENDPROC(__flush_dcache_area) > __dma_inv_range: > dcache_line_size x2, x3 > sub x3, x2, #1 > - bic x0, x0, x3 > + tst x1, x3 // end cache line aligned? > bic x1, x1, x3 > -1: dc ivac, x0 // invalidate D / U line > - add x0, x0, x2 > + b.eq 1f > + dc civac, x1 // clean & invalidate D / U line That is actually cleaning the address one byte past the end of the region, not sure it matters though because it is still within the same minimum cache line sized region. > +1: tst x0, x3 // start cache line aligned? > + bic x0, x0, x3 > + b.eq 2f > + dc civac, x0 // clean & invalidate D / U line > + b 3f > +2: dc ivac, x0 // invalidate D / U line > +3: add x0, x0, x2 > cmp x0, x1 > b.lo 1b The above obviously also needs changing to branch to 3b > dsb sy > -- Tixy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) 2014-04-02 8:52 ` Jon Medhurst (Tixy) @ 2014-04-02 9:20 ` Catalin Marinas 2014-04-02 9:40 ` Russell King - ARM Linux ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Catalin Marinas @ 2014-04-02 9:20 UTC (permalink / raw) To: linux-arm-kernel On Wed, Apr 02, 2014 at 09:52:02AM +0100, Jon Medhurst (Tixy) wrote: > On Tue, 2014-04-01 at 18:29 +0100, Catalin Marinas wrote: > > On Tue, Apr 01, 2014 at 05:10:57PM +0100, Jon Medhurst (Tixy) wrote: > > > On Mon, 2014-03-31 at 18:52 +0100, Catalin Marinas wrote: > > > > +__dma_inv_range: > > > > + dcache_line_size x2, x3 > > > > + sub x3, x2, #1 > > > > + bic x0, x0, x3 > > > > + bic x1, x1, x3 > > > > > > Why is the 'end' value in x1 above rounded down to be cache aligned? > > > This means the cache invalidate won't include the cache line containing > > > the final bytes of the region, unless it happened to already be cache > > > line aligned. This looks especially suspect as the other two cache > > > operations added in the same patch (below) don't do that. > > > > Cache invalidation is destructive, so we want to make sure that it > > doesn't affect anything beyond x1. But you are right, if either end of > > the buffer is not cache line aligned it can get it wrong. The fix is to > > use clean+invalidate on the unaligned ends: > > Like the ARMv7 implementation does :-) However, I wonder, is it possible > for the Cache Writeback Granule (CWG) to come into play? If the CWG of > further out caches was bigger than closer (to CPU) caches then it would > cause data corruption. So for these region ends, should we not be using > the CWG size, not the minimum D cache line size? On second thoughts, > that wouldn't be safe either in the converse case where the CWG of a > closer cache was bigger. So we would need to first use minimum cache > line size to clean a CWG sized region, then invalidate cache lines by > the same method. CWG gives us the maximum size (of all cache levels in the system, even on a different CPU for example in big.LITTLE configurations) that would be evicted by the cache operation. So we need small loops of Dmin size that go over the bigger CWG (and that's guaranteed to be at least Dmin). > But then that leaves a time period where a write can > happen between the clean and the invalidate, again leading to data > corruption. I hope all this means I've either got rather confused or > that that cache architectures are smart enough to automatically cope. You are right. I think having unaligned DMA buffers for inbound transfers is pointless. We can avoid losing data written by another CPU in the same cache line but, depending on the stage of the DMA transfer, it can corrupt the DMA data. I wonder whether it's easier to define the cache_line_size() macro to read CWG and assume that the DMA buffers are always aligned, ignoring the invalidation of the unaligned boundaries. This wouldn't be much different from your scenario where the shared cache line is written (just less likely to trigger but still a bug, so I would rather notice this early). The ARMv7 code has a similar issue, it performs clean&invalidate on the unaligned start but it doesn't move r0, so it goes into the main loop invalidating the same cache line again. If it was written by something else, the information would be lost. > I also have a couple of comments on the specific changes below... > > > diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S > > index c46f48b33c14..6a26bf1965d3 100644 > > --- a/arch/arm64/mm/cache.S > > +++ b/arch/arm64/mm/cache.S > > @@ -175,10 +175,17 @@ ENDPROC(__flush_dcache_area) > > __dma_inv_range: > > dcache_line_size x2, x3 > > sub x3, x2, #1 > > - bic x0, x0, x3 > > + tst x1, x3 // end cache line aligned? > > bic x1, x1, x3 > > -1: dc ivac, x0 // invalidate D / U line > > - add x0, x0, x2 > > + b.eq 1f > > + dc civac, x1 // clean & invalidate D / U line > > That is actually cleaning the address one byte past the end of the > region, not sure it matters though because it is still within the same > minimum cache line sized region. It shouldn't, there is a "bic x1, x1, x3" above and this dc only happens if the address was unaligned. > > +1: tst x0, x3 // start cache line aligned? > > + bic x0, x0, x3 > > + b.eq 2f > > + dc civac, x0 // clean & invalidate D / U line > > + b 3f > > +2: dc ivac, x0 // invalidate D / U line > > +3: add x0, x0, x2 > > cmp x0, x1 > > b.lo 1b > > The above obviously also needs changing to branch to 3b Good point. (but I'm no longer convinced we need the hassle above ;)) -- Catalin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) 2014-04-02 9:20 ` Catalin Marinas @ 2014-04-02 9:40 ` Russell King - ARM Linux 2014-04-02 11:13 ` Catalin Marinas 2014-04-02 10:41 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" Jon Medhurst (Tixy) 2014-04-02 10:54 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) Jon Medhurst (Tixy) 2 siblings, 1 reply; 11+ messages in thread From: Russell King - ARM Linux @ 2014-04-02 9:40 UTC (permalink / raw) To: linux-arm-kernel On Wed, Apr 02, 2014 at 10:20:32AM +0100, Catalin Marinas wrote: > You are right. I think having unaligned DMA buffers for inbound > transfers is pointless. We can avoid losing data written by another CPU > in the same cache line but, depending on the stage of the DMA transfer, > it can corrupt the DMA data. > > I wonder whether it's easier to define the cache_line_size() macro to > read CWG and assume that the DMA buffers are always aligned, ignoring > the invalidation of the unaligned boundaries. This wouldn't be much > different from your scenario where the shared cache line is written > (just less likely to trigger but still a bug, so I would rather notice > this early). > > The ARMv7 code has a similar issue, it performs clean&invalidate on the > unaligned start but it doesn't move r0, so it goes into the main loop > invalidating the same cache line again. If it was written by something > else, the information would be lost. You can't make that a requirement. People have shared stuff across a cache line for years in Linux, and people have brought it up and tried to fix it, but there's much resistance against it. In particular is SCSI, which submits the sense buffer as part of a larger structure (the host.) SCSI sort-of guarantees that the surrounding struct members won't be touched, but their data has to be preserved. In any case, remember that there are strict rules about ownership of the DMA memory vs calls to the DMA API. It is invalid to call the DMA streaming API functions while a DMA transfer is active. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) 2014-04-02 9:40 ` Russell King - ARM Linux @ 2014-04-02 11:13 ` Catalin Marinas 0 siblings, 0 replies; 11+ messages in thread From: Catalin Marinas @ 2014-04-02 11:13 UTC (permalink / raw) To: linux-arm-kernel On Wed, Apr 02, 2014 at 10:40:45AM +0100, Russell King - ARM Linux wrote: > On Wed, Apr 02, 2014 at 10:20:32AM +0100, Catalin Marinas wrote: > > You are right. I think having unaligned DMA buffers for inbound > > transfers is pointless. We can avoid losing data written by another CPU > > in the same cache line but, depending on the stage of the DMA transfer, > > it can corrupt the DMA data. > > > > I wonder whether it's easier to define the cache_line_size() macro to > > read CWG and assume that the DMA buffers are always aligned, ignoring > > the invalidation of the unaligned boundaries. This wouldn't be much > > different from your scenario where the shared cache line is written > > (just less likely to trigger but still a bug, so I would rather notice > > this early). > > > > The ARMv7 code has a similar issue, it performs clean&invalidate on the > > unaligned start but it doesn't move r0, so it goes into the main loop > > invalidating the same cache line again. If it was written by something > > else, the information would be lost. > > You can't make that a requirement. People have shared stuff across a > cache line for years in Linux, and people have brought it up and tried > to fix it, but there's much resistance against it. In particular is > SCSI, which submits the sense buffer as part of a larger structure (the > host.) SCSI sort-of guarantees that the surrounding struct members > won't be touched, but their data has to be preserved. Let's hope that CWG stays small enough on real hardware (as the architecture specifies it to max 2K). > In any case, remember that there are strict rules about ownership of the > DMA memory vs calls to the DMA API. It is invalid to call the DMA > streaming API functions while a DMA transfer is active. Yes, I was referring to non-DMA buffer area in the same cache line being touched during a DMA transfer. -- Catalin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" 2014-04-02 9:20 ` Catalin Marinas 2014-04-02 9:40 ` Russell King - ARM Linux @ 2014-04-02 10:41 ` Jon Medhurst (Tixy) 2014-04-02 11:37 ` Catalin Marinas 2014-04-02 10:54 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) Jon Medhurst (Tixy) 2 siblings, 1 reply; 11+ messages in thread From: Jon Medhurst (Tixy) @ 2014-04-02 10:41 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2014-04-02 at 10:20 +0100, Catalin Marinas wrote: > On Wed, Apr 02, 2014 at 09:52:02AM +0100, Jon Medhurst (Tixy) wrote: > > On Tue, 2014-04-01 at 18:29 +0100, Catalin Marinas wrote: > > > On Tue, Apr 01, 2014 at 05:10:57PM +0100, Jon Medhurst (Tixy) wrote: > > > > On Mon, 2014-03-31 at 18:52 +0100, Catalin Marinas wrote: > > > > > +__dma_inv_range: > > > > > + dcache_line_size x2, x3 > > > > > + sub x3, x2, #1 > > > > > + bic x0, x0, x3 > > > > > + bic x1, x1, x3 > > > > > > > > Why is the 'end' value in x1 above rounded down to be cache aligned? > > > > This means the cache invalidate won't include the cache line containing > > > > the final bytes of the region, unless it happened to already be cache > > > > line aligned. This looks especially suspect as the other two cache > > > > operations added in the same patch (below) don't do that. > > > > > > Cache invalidation is destructive, so we want to make sure that it > > > doesn't affect anything beyond x1. But you are right, if either end of > > > the buffer is not cache line aligned it can get it wrong. The fix is to > > > use clean+invalidate on the unaligned ends: > > > > Like the ARMv7 implementation does :-) However, I wonder, is it possible > > for the Cache Writeback Granule (CWG) to come into play? If the CWG of > > further out caches was bigger than closer (to CPU) caches then it would > > cause data corruption. So for these region ends, should we not be using > > the CWG size, not the minimum D cache line size? On second thoughts, > > that wouldn't be safe either in the converse case where the CWG of a > > closer cache was bigger. So we would need to first use minimum cache > > line size to clean a CWG sized region, then invalidate cache lines by > > the same method. > > CWG gives us the maximum size (of all cache levels in the system, even > on a different CPU for example in big.LITTLE configurations) that would > be evicted by the cache operation. So we need small loops of Dmin size > that go over the bigger CWG (and that's guaranteed to be at least Dmin). Yes, that's what I was getting at. > > > But then that leaves a time period where a write can > > happen between the clean and the invalidate, again leading to data > > corruption. I hope all this means I've either got rather confused or > > that that cache architectures are smart enough to automatically cope. > > You are right. I think having unaligned DMA buffers for inbound > transfers is pointless. We can avoid losing data written by another CPU > in the same cache line but, depending on the stage of the DMA transfer, > it can corrupt the DMA data. > > I wonder whether it's easier to define the cache_line_size() macro to > read CWG That won't work, the stride of cache operations needs to be the _minimum_ cache line size, otherwise we might skip over some cache lines and not flush them. (We've been hit before by bugs caused by the fact that big.LITTLE systems report different minimum i-cache line sizes depend on whether you execute on the big or LITTLE cores [1], we need the 'real' minimum otherwise things go horribly wrong.) [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-February/149950.html > and assume that the DMA buffers are always aligned, We can't assume the region in any particular DMA transfer is cache aligned, but I agree, that if multiple actors were operating on adjacent memory locations in the same cache line, without implementing their own coordination then there's nothing the low level DMA code can do to avoid data corruption from cache cleaning. We at least need to make sure that the memory allocation functions used for DMA buffers return regions of whole CWG size, to avoid unrelated buffers corrupting each other. If I have correctly read __dma_alloc_noncoherent and the functions it calls, then it looks like buffers are actually whole pages, so that's not a problem. > ignoring > the invalidation of the unaligned boundaries. This wouldn't be much > different from your scenario where the shared cache line is written > (just less likely to trigger but still a bug, so I would rather notice > this early). > > The ARMv7 code has a similar issue, it performs clean&invalidate on the > unaligned start but it doesn't move r0, so it goes into the main loop > invalidating the same cache line again. Yes, and as it's missing a dsb could also lead to the wrong behaviour if the invalidate was reordered to execute prior to the clean+invalidate on the same line. I just dug into git history to see if I could find a clue as to how the v7 code came to look like it does, but I see that it's been like that since the day it was submitted in 2007, by a certain Catalin Marinas ;-) -- Tixy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" 2014-04-02 10:41 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" Jon Medhurst (Tixy) @ 2014-04-02 11:37 ` Catalin Marinas 0 siblings, 0 replies; 11+ messages in thread From: Catalin Marinas @ 2014-04-02 11:37 UTC (permalink / raw) To: linux-arm-kernel On Wed, Apr 02, 2014 at 11:41:30AM +0100, Jon Medhurst (Tixy) wrote: > On Wed, 2014-04-02 at 10:20 +0100, Catalin Marinas wrote: > > On Wed, Apr 02, 2014 at 09:52:02AM +0100, Jon Medhurst (Tixy) wrote: > > > But then that leaves a time period where a write can > > > happen between the clean and the invalidate, again leading to data > > > corruption. I hope all this means I've either got rather confused or > > > that that cache architectures are smart enough to automatically cope. > > > > You are right. I think having unaligned DMA buffers for inbound > > transfers is pointless. We can avoid losing data written by another CPU > > in the same cache line but, depending on the stage of the DMA transfer, > > it can corrupt the DMA data. > > > > I wonder whether it's easier to define the cache_line_size() macro to > > read CWG > > That won't work, the stride of cache operations needs to be the > _minimum_ cache line size, otherwise we might skip over some cache lines > and not flush them. (We've been hit before by bugs caused by the fact > that big.LITTLE systems report different minimum i-cache line sizes > depend on whether you execute on the big or LITTLE cores [1], we need > the 'real' minimum otherwise things go horribly wrong.) > > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-February/149950.html Yes, I remember this. CWG should also be the same in a big.LITTLE system. > > and assume that the DMA buffers are always aligned, > > We can't assume the region in any particular DMA transfer is cache > aligned, but I agree, that if multiple actors were operating on adjacent > memory locations in the same cache line, without implementing their own > coordination then there's nothing the low level DMA code can do to avoid > data corruption from cache cleaning. > > We at least need to make sure that the memory allocation functions used > for DMA buffers return regions of whole CWG size, to avoid unrelated > buffers corrupting each other. If I have correctly read > __dma_alloc_noncoherent and the functions it calls, then it looks like > buffers are actually whole pages, so that's not a problem. It's not about dma_alloc but the streaming DMA API like dma_map_sg(). > > ignoring > > the invalidation of the unaligned boundaries. This wouldn't be much > > different from your scenario where the shared cache line is written > > (just less likely to trigger but still a bug, so I would rather notice > > this early). > > > > The ARMv7 code has a similar issue, it performs clean&invalidate on the > > unaligned start but it doesn't move r0, so it goes into the main loop > > invalidating the same cache line again. > > Yes, and as it's missing a dsb could also lead to the wrong behaviour if > the invalidate was reordered to execute prior to the clean+invalidate on > the same line. I just dug into git history to see if I could find a clue > as to how the v7 code came to look like it does, but I see that it's > been like that since the day it was submitted in 2007, by a certain > Catalin Marinas ;-) I don't remember ;). But there are some rules about reordering of cache line operations by MVA with regards to memory accesses. I have to check whether they apply to other d-cache maintenance to the same address as well. I'll try to come up with another patch using CWG. -- Catalin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) 2014-04-02 9:20 ` Catalin Marinas 2014-04-02 9:40 ` Russell King - ARM Linux 2014-04-02 10:41 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" Jon Medhurst (Tixy) @ 2014-04-02 10:54 ` Jon Medhurst (Tixy) 2 siblings, 0 replies; 11+ messages in thread From: Jon Medhurst (Tixy) @ 2014-04-02 10:54 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2014-04-02 at 10:20 +0100, Catalin Marinas wrote: > On Wed, Apr 02, 2014 at 09:52:02AM +0100, Jon Medhurst (Tixy) wrote: > > On Tue, 2014-04-01 at 18:29 +0100, Catalin Marinas wrote: > > > +1: tst x0, x3 // start cache line aligned? > > > + bic x0, x0, x3 > > > + b.eq 2f > > > + dc civac, x0 // clean & invalidate D / U line > > > + b 3f > > > +2: dc ivac, x0 // invalidate D / U line > > > +3: add x0, x0, x2 > > > cmp x0, x1 > > > b.lo 1b > > > > The above obviously also needs changing to branch to 3b > > Good point. Actually, it should be 2b :-) -- Tixy ^ permalink raw reply [flat|nested] 11+ messages in thread
* [GIT PULL] arm64 patches for 3.15
@ 2014-04-08 17:37 Catalin Marinas
0 siblings, 0 replies; 11+ messages in thread
From: Catalin Marinas @ 2014-04-08 17:37 UTC (permalink / raw)
To: linux-arm-kernel
Hi Linus,
The following changes since commit 196adf2f3015eacac0567278ba538e3ffdd16d0e:
arm64: Remove pgprot_dmacoherent() (2014-03-24 10:35:35 +0000)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-upstream
for you to fetch changes up to ebf81a938dade3b450eb11c57fa744cfac4b523f:
arm64: Fix DMA range invalidation for cache line unaligned buffers (2014-04-08 11:45:08 +0100)
A second pull request for this merging window, mainly with fixes and
docs clarification. As I haven't rebased my tree, you'll get a conflict
with latest mainline in arch/arm64/kernel/head.S. The fix-up is below.
Thanks.
diff --cc arch/arm64/kernel/head.S
index 1fe5d8d2bdfd,26109682d2fa..0fd565000772
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@@ -461,12 -476,26 +476,23 @@@ __create_page_tables
sub x6, x6, #1 // inclusive range
create_block_map x0, x7, x3, x5, x6
1:
-#ifdef CONFIG_EARLY_PRINTK
/*
- * Create the pgd entry for the UART mapping. The full mapping is done
- * later based earlyprintk kernel parameter.
+ * Create the pgd entry for the fixed mappings.
*/
- ldr x5, =EARLYCON_IOBASE // UART virtual address
+ ldr x5, =FIXADDR_TOP // Fixed mapping virtual address
add x0, x26, #2 * PAGE_SIZE // section table address
create_pgd_entry x26, x0, x5, x6, x7
-#endif
+
+ /*
+ * Since the page tables have been populated with non-cacheable
+ * accesses (MMU disabled), invalidate the idmap and swapper page
+ * tables again to remove any speculatively loaded cache lines.
+ */
+ mov x0, x25
+ add x1, x26, #SWAPPER_DIR_SIZE
+ bl __inval_cache_range
+
+ mov lr, x27
ret
ENDPROC(__create_page_tables)
.ltorg
----------------------------------------------------------------
- Documentation clarification on CPU topology and booting requirements
- Additional cache flushing during boot (needed in the presence of
external caches or under virtualisation)
- DMA range invalidation fix for non cache line aligned buffers
- Build failure fix with !COMPAT
- Kconfig update for STRICT_DEVMEM
----------------------------------------------------------------
Catalin Marinas (4):
arm64: Update the TCR_EL1 translation granule definitions for 16K pages
arm64: Relax the kernel cache requirements for boot
Revert "arm64: virt: ensure visibility of __boot_cpu_mode"
arm64: Fix DMA range invalidation for cache line unaligned buffers
Laura Abbott (1):
arm64: Add missing Kconfig for CONFIG_STRICT_DEVMEM
Mark Brown (1):
ARM: topology: Make it clear that all CPUs need to be described
Mark Salter (1):
arm64: fix !CONFIG_COMPAT build failures
Documentation/arm64/booting.txt | 10 ++++++--
Documentation/devicetree/bindings/arm/topology.txt | 7 ++---
arch/arm64/Kconfig.debug | 14 ++++++++++
arch/arm64/include/asm/pgtable-hwdef.h | 6 ++++-
arch/arm64/include/asm/virt.h | 13 ----------
arch/arm64/kernel/head.S | 30 ++++++++++++++++++++--
arch/arm64/kernel/perf_event.c | 4 +++
arch/arm64/kernel/perf_regs.c | 2 ++
arch/arm64/mm/cache.S | 24 ++++++++++++++---
arch/arm64/mm/proc.S | 25 ++++++++++--------
10 files changed, 99 insertions(+), 36 deletions(-)
--
Catalin
^ permalink raw reply [flat|nested] 11+ messages in threadend of thread, other threads:[~2014-04-08 17:37 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-31 17:52 [GIT PULL] arm64 patches for 3.15 Catalin Marinas 2014-04-01 16:10 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) Jon Medhurst (Tixy) 2014-04-01 17:29 ` Catalin Marinas 2014-04-02 8:52 ` Jon Medhurst (Tixy) 2014-04-02 9:20 ` Catalin Marinas 2014-04-02 9:40 ` Russell King - ARM Linux 2014-04-02 11:13 ` Catalin Marinas 2014-04-02 10:41 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" Jon Medhurst (Tixy) 2014-04-02 11:37 ` Catalin Marinas 2014-04-02 10:54 ` Bug(?) in patch "arm64: Implement coherent DMA API based on swiotlb" (was Re: [GIT PULL] arm64 patches for 3.15) Jon Medhurst (Tixy) -- strict thread matches above, loose matches on Subject: below -- 2014-04-08 17:37 [GIT PULL] arm64 patches for 3.15 Catalin Marinas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).