[RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups
@ 2013-01-15 16:48 Dave Martin
  2013-01-15 16:48 ` [RFC PATCH 1/4] ARM: b.L: Remove C declarations for vlocks Dave Martin
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Dave Martin @ 2013-01-15 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

After much head-scratching and discussion, I have concluded that we
need comprehensive memory barriers in order to ensure that the
low-level synchronisation code executes robustly on all platforms.

DSBs are excessive on most situations though, so many DSBs can be
replces with DMBs.

As was observed in review, providing a C interface to the vlocks
makes little sense, so this series gets rid of it.

Dave Martin (4):
  ARM: b.L: Remove C declarations for vlocks
  ARM: b.L: vlocks: Add architecturally required memory barriers
  ARM: bL_entry: Match memory barriers to architectural requirements
  ARM: vexpress/dcscb: power_up_setup memory barrier cleanup

 arch/arm/common/bL_head.S            |   40 ++++++++-----------------------
 arch/arm/common/vlock.S              |    7 ++++-
 arch/arm/common/vlock.h              |   43 ----------------------------------
 arch/arm/mach-vexpress/dcscb_setup.S |    5 +--
 4 files changed, 18 insertions(+), 77 deletions(-)
 delete mode 100644 arch/arm/common/vlock.h

-- 
1.7.4.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/4] ARM: b.L: Remove C declarations for vlocks
  2013-01-15 16:48 [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Dave Martin
@ 2013-01-15 16:48 ` Dave Martin
  2013-01-15 16:48 ` [RFC PATCH 2/4] ARM: b.L: vlocks: Add architecturally required memory barriers Dave Martin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Dave Martin @ 2013-01-15 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

Vlocks are only useful in very specialised situations, so the
precence of a C header file is more confusing than helpful.  All
current use of vlocks, including allocation of space for the lock
structures themselves, is done in assembler.

This patch removes the C interface, so that people are not tempted
to use vlocks when they shouldn't.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
---
 arch/arm/common/vlock.h |   43 -------------------------------------------
 1 files changed, 0 insertions(+), 43 deletions(-)
 delete mode 100644 arch/arm/common/vlock.h

diff --git a/arch/arm/common/vlock.h b/arch/arm/common/vlock.h
deleted file mode 100644
index 94c29a6..0000000
--- a/arch/arm/common/vlock.h
+++ /dev/null
@@ -1,43 +0,0 @@
-/*
- * vlock.h - simple voting lock implementation
- *
- * Created by:	Dave Martin, 2012-08-16
- * Copyright:	(C) 2012  Linaro Limited
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License along
- * with this program; if not, write to the Free Software Foundation, Inc.,
- * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
- */
-
-#ifndef __VLOCK_H
-#define __VLOCK_H
-
-#include <asm/bL_entry.h>
-
-#define VLOCK_OWNER_OFFSET	0
-#define VLOCK_VOTING_OFFSET	4
-#define VLOCK_VOTING_SIZE	((BL_CPUS_PER_CLUSTER + 3) / 4 * 4)
-#define VLOCK_SIZE		(VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE)
-#define VLOCK_OWNER_NONE	0
-
-#ifndef __ASSEMBLY__
-
-struct vlock {
-	char data[VLOCK_SIZE];
-};
-
-int vlock_trylock(struct vlock *lock, unsigned int owner);
-void vlock_unlock(struct vlock *lock);
-
-#endif /* __ASSEMBLY__ */
-#endif /* ! __VLOCK_H */
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/4] ARM: b.L: vlocks: Add architecturally required memory barriers
  2013-01-15 16:48 [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Dave Martin
  2013-01-15 16:48 ` [RFC PATCH 1/4] ARM: b.L: Remove C declarations for vlocks Dave Martin
@ 2013-01-15 16:48 ` Dave Martin
  2013-01-15 16:48 ` [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements Dave Martin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Dave Martin @ 2013-01-15 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

For architectural correctness even Strongly-Ordered memory accesses
require barriers in order to guarantee that multiple CPUs have a
coherent view of the ordering of memory accesses.

Whether or not this matters depends on hardware implementation
details of the memory system.

Since the purpose of this code is to provide a clean, generic
locking mechanism with no platform-specific dependencies the
barriers should be present to avoid unpleasant surprises on future
platforms.

This patch adds the required barriers.

Note:

  * When taking the lock, we don't care about implicit background
    memory operations and other signalling which may be pending,
    because those are not part of the critical section anyway.

    A DMB is sufficient to ensure correctly observed ordering if
    the explicit memory accesses in vlock_trylock.

  * No barrier is required after checking the election result,
    because the result is determined by the store st
    VLOCK_OWNER_OFFSET and is already globally observed due to the
    barriers in voting_end.  This means that global agreement on
    the winner is guaranteed, even before the winner is known
    locally.

  * The magic to guarantee correct barrierless access to the vlocks
    by aligning them in memory now makes no sense and is removed.
    However, we must still ensure that these don't share a
    cacheline with anything else.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
---
 arch/arm/common/bL_head.S |   19 -------------------
 arch/arm/common/vlock.S   |    7 +++++--
 2 files changed, 5 insertions(+), 21 deletions(-)

diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
index 314d4ae..fd71ff6 100644
--- a/arch/arm/common/bL_head.S
+++ b/arch/arm/common/bL_head.S
@@ -187,26 +187,7 @@ ENDPROC(bL_entry_point)
 
 	.bss
 
-	@ Magic to size and align the first-man vlock structures
-	@ so that each does not cross a 1KB boundary.
-	@ We also must ensure that none of these shares a cacheline with
-	@ any data which might be accessed through the cache.
-
-	.equ	.Log2, 0
-	.rept	11
-		.if (1 << .Log2) < VLOCK_SIZE
-			.equ .Log2, .Log2 + 1
-		.endif
-	.endr
-	.if	.Log2 > 10
-		.error "vlock struct is too large for guaranteed barrierless access ordering"
-	.endif
-	.equ	.Lvlock_size, 1 << .Log2
-
-	@ The presence of two .align directives here is deliberate: we must
-	@ align to whichever of the two boundaries is larger:
 	.align	__CACHE_WRITEBACK_ORDER
-	.align	.Log2
 first_man_locks:
 	.rept	BL_NR_CLUSTERS
 	.space	.Lvlock_size
diff --git a/arch/arm/common/vlock.S b/arch/arm/common/vlock.S
index 0a1ee3a..f55744f 100644
--- a/arch/arm/common/vlock.S
+++ b/arch/arm/common/vlock.S
@@ -39,10 +39,11 @@
 .macro voting_begin rbase:req, rcpu:req, rscratch:req
 	mov	\rscratch, #1
 	strb	\rscratch, [\rbase, \rcpu]
-	dsb
+	dmb
 .endm
 
 .macro voting_end rbase:req, rcpu:req, rscratch:req
+	dmb
 	mov	\rscratch, #0
 	strb	\rscratch, [\rbase, \rcpu]
 	dsb
@@ -68,6 +69,7 @@ ENTRY(vlock_trylock)
 	cmp	r2, #VLOCK_OWNER_NONE
 	bne	trylock_fail			@ fail if so
 
+	dmb
 	strb	r1, [r0, #VLOCK_OWNER_OFFSET]	@ submit my vote
 
 	voting_end	r0, r1, r2
@@ -87,6 +89,7 @@ ENTRY(vlock_trylock)
 
 	@ Check who won:
 
+	dmb
 	ldrb	r2, [r0, #VLOCK_OWNER_OFFSET]
 	eor	r0, r1, r2			@ zero if I won, else nonzero
 	bx	lr
@@ -99,8 +102,8 @@ ENDPROC(vlock_trylock)
 
 @ r0: lock structure base
 ENTRY(vlock_unlock)
+	dmb
 	mov	r1, #VLOCK_OWNER_NONE
-	dsb
 	strb	r1, [r0, #VLOCK_OWNER_OFFSET]
 	dsb
 	sev
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements
  2013-01-15 16:48 [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Dave Martin
  2013-01-15 16:48 ` [RFC PATCH 1/4] ARM: b.L: Remove C declarations for vlocks Dave Martin
  2013-01-15 16:48 ` [RFC PATCH 2/4] ARM: b.L: vlocks: Add architecturally required memory barriers Dave Martin
@ 2013-01-15 16:48 ` Dave Martin
  2013-01-16  6:50   ` Santosh Shilimkar
  2013-01-15 16:48 ` [RFC PATCH 4/4] ARM: vexpress/dcscb: power_up_setup memory barrier cleanup Dave Martin
  2013-01-15 17:29 ` [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Nicolas Pitre
  4 siblings, 1 reply; 15+ messages in thread
From: Dave Martin @ 2013-01-15 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

For architectural correctness even Strongly-Ordered memory accesses
require barriers in order to guarantee that multiple CPUs have a
coherent view of the ordering of memory accesses.

Virtually everything done by this early code is done via explicit
memory access only, so DSBs are seldom required.  Existing barriers
are demoted to DMB, except where a DSB is needed to synchronise
non-memory signalling (i.e., before a SEV).  If a particular
platform performs cache maintenance in its power_up_setup function,
it should force it to complete explicitly including a DSB, instead
of relying on the bL_head framework code to do it.

Some additional DMBs are added to ensure all the memory ordering
properties required by the race avoidance algorithm.  DMBs are also
moved out of loops, and for clarity some are moved so that most
directly follow the memory operation which needs to be
synchronised.

The setting of a CPU's bL_entry_vectors[] entry is also required to
act as a synchronisation point, so a DMB is added after checking
that entry to ensure that other CPUs do not observe gated
operations leaking across the opening of the gate.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
---
 arch/arm/common/bL_head.S |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
index fd71ff6..a4a20e5 100644
--- a/arch/arm/common/bL_head.S
+++ b/arch/arm/common/bL_head.S
@@ -87,8 +87,7 @@ ENTRY(bL_entry_point)
 	mov	r5, #BL_SYNC_CPU_SIZE
 	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
 	strb	r0, [r5]
-
-	dsb
+	dmb
 
 	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
 	@ state, because there is at least one active CPU (this CPU).
@@ -97,7 +96,7 @@ ENTRY(bL_entry_point)
 	mla	r11, r0, r10, r11		@ r11 = cluster first man lock
 	mov	r0, r11
 	mov	r1, r9				@ cpu
-	bl	vlock_trylock
+	bl	vlock_trylock			@ implies DSB
 
 	cmp	r0, #0				@ failed to get the lock?
 	bne	cluster_setup_wait		@ wait for cluster setup if so
@@ -115,11 +114,12 @@ cluster_setup:
 	@ Wait for any previously-pending cluster teardown operations to abort
 	@ or complete:
 
-	dsb
-	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	dmb
+0:	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
 	cmp	r0, #CLUSTER_GOING_DOWN
 	wfeeq
-	beq	cluster_setup
+	beq	0b
+	dmb
 
 	@ If the outbound gave up before teardown started, skip cluster setup:
 
@@ -131,8 +131,8 @@ cluster_setup:
 	cmp	r7, #0
 	mov	r0, #1		@ second (cluster) affinity level
 	blxne	r7		@ Call power_up_setup if defined
+	dmb
 
-	dsb
 	mov	r0, #CLUSTER_UP
 	strb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
 	dsb
@@ -146,11 +146,11 @@ cluster_setup_leave:
 	@ In the contended case, non-first men wait here for cluster setup
 	@ to complete:
 cluster_setup_wait:
-	dsb
 	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
 	cmp	r0, #CLUSTER_UP
 	wfene
 	bne	cluster_setup_wait
+	dmb
 
 cluster_setup_complete:
 	@ If a platform-specific CPU setup hook is needed, it is
@@ -162,13 +162,14 @@ cluster_setup_complete:
 
 	@ Mark the CPU as up:
 
-	dsb
+	dmb
 	mov	r0, #CPU_UP
 	strb	r0, [r5]
+	dmb
 
 bL_entry_gated:
-	dsb
 	ldr	r5, [r6, r4, lsl #2]		@ r5 = CPU entry vector
+	dmb
 	cmp	r5, #0
 	wfeeq
 	beq	bL_entry_gated
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 4/4] ARM: vexpress/dcscb: power_up_setup memory barrier cleanup
  2013-01-15 16:48 [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Dave Martin
                   ` (2 preceding siblings ...)
  2013-01-15 16:48 ` [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements Dave Martin
@ 2013-01-15 16:48 ` Dave Martin
  2013-01-15 17:29 ` [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Nicolas Pitre
  4 siblings, 0 replies; 15+ messages in thread
From: Dave Martin @ 2013-01-15 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

The bL_head framework has sufficient barriers to ensure correct
globally observed ordering of explicit memory accesses done by the
power_up_setup function.

This patch removes the unnecessary DSB from dcscb_setup.S.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
---
 arch/arm/mach-vexpress/dcscb_setup.S |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/arm/mach-vexpress/dcscb_setup.S b/arch/arm/mach-vexpress/dcscb_setup.S
index c75ee8c..e338be7 100644
--- a/arch/arm/mach-vexpress/dcscb_setup.S
+++ b/arch/arm/mach-vexpress/dcscb_setup.S
@@ -64,14 +64,13 @@ ENTRY(dcscb_power_up_setup)
 
 	ldr	r3, =RTSM_CCI_PHYS_BASE
 
-	b	1f
-0:	dsb
 1:	ldr	r0, [r3, #CCI_STATUS_OFFSET]
 	tst	r0, #STATUS_CHANGE_PENDING
-	bne	0b
+	bne	1b
 
 2:	@ Implementation-specific local CPU setup operations should go here,
 	@ if any.  In this case, there is nothing to do.
 
 	bx	lr
+
 ENDPROC(dcscb_power_up_setup)
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups
  2013-01-15 16:48 [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Dave Martin
                   ` (3 preceding siblings ...)
  2013-01-15 16:48 ` [RFC PATCH 4/4] ARM: vexpress/dcscb: power_up_setup memory barrier cleanup Dave Martin
@ 2013-01-15 17:29 ` Nicolas Pitre
  2013-01-15 17:42   ` Dave Martin
  4 siblings, 1 reply; 15+ messages in thread
From: Nicolas Pitre @ 2013-01-15 17:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 15 Jan 2013, Dave Martin wrote:

> After much head-scratching and discussion, I have concluded that we
> need comprehensive memory barriers in order to ensure that the
> low-level synchronisation code executes robustly on all platforms.
> 
> DSBs are excessive on most situations though, so many DSBs can be
> replces with DMBs.
> 
> As was observed in review, providing a C interface to the vlocks
> makes little sense, so this series gets rid of it.

Thanks.  I have already removed the C stuff in my version but I'll merge 
the rest no problem.


Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups
  2013-01-15 17:29 ` [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Nicolas Pitre
@ 2013-01-15 17:42   ` Dave Martin
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Martin @ 2013-01-15 17:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 15, 2013 at 12:29:22PM -0500, Nicolas Pitre wrote:
> On Tue, 15 Jan 2013, Dave Martin wrote:
> 
> > After much head-scratching and discussion, I have concluded that we
> > need comprehensive memory barriers in order to ensure that the
> > low-level synchronisation code executes robustly on all platforms.
> > 
> > DSBs are excessive on most situations though, so many DSBs can be
> > replces with DMBs.
> > 
> > As was observed in review, providing a C interface to the vlocks
> > makes little sense, so this series gets rid of it.
> 
> Thanks.  I have already removed the C stuff in my version but I'll merge 
> the rest no problem.

OK, I wasn't sure whether you'd already done that, but figured the patch
would be trivial to throw away (it's not like I invested a lot of time
in that one!)

I've done my best with the barriers, but if you have any concerns about
correctness, let me know.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements
  2013-01-15 16:48 ` [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements Dave Martin
@ 2013-01-16  6:50   ` Santosh Shilimkar
  2013-01-16 11:49     ` Dave Martin
  2013-01-16 15:05     ` Catalin Marinas
  0 siblings, 2 replies; 15+ messages in thread
From: Santosh Shilimkar @ 2013-01-16  6:50 UTC (permalink / raw)
  To: linux-arm-kernel

+ Catalin, RMK

Dave,

On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> For architectural correctness even Strongly-Ordered memory accesses
> require barriers in order to guarantee that multiple CPUs have a
> coherent view of the ordering of memory accesses.
>
> Virtually everything done by this early code is done via explicit
> memory access only, so DSBs are seldom required.  Existing barriers
> are demoted to DMB, except where a DSB is needed to synchronise
> non-memory signalling (i.e., before a SEV).  If a particular
> platform performs cache maintenance in its power_up_setup function,
> it should force it to complete explicitly including a DSB, instead
> of relying on the bL_head framework code to do it.
>
> Some additional DMBs are added to ensure all the memory ordering
> properties required by the race avoidance algorithm.  DMBs are also
> moved out of loops, and for clarity some are moved so that most
> directly follow the memory operation which needs to be
> synchronised.
>
> The setting of a CPU's bL_entry_vectors[] entry is also required to
> act as a synchronisation point, so a DMB is added after checking
> that entry to ensure that other CPUs do not observe gated
> operations leaking across the opening of the gate.
>
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> ---

Sorry to pick on this again but I am not able to understand why
the strongly ordered access needs barriers. At least from the
ARM point of view, a strongly ordered write will be more of blocking
write and the further interconnect also is suppose to respect that
rule. SO read writes are like adding barrier after every load store
so adding explicit barriers doesn't make sense. Is this a side
effect of some "write early response" kind of optimizations at
interconnect level ?
Will you be able to point to specs or documents which puts
this requirement ?

Regards
santosh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements
  2013-01-16  6:50   ` Santosh Shilimkar
@ 2013-01-16 11:49     ` Dave Martin
  2013-01-16 12:11       ` Santosh Shilimkar
  2013-01-16 15:05     ` Catalin Marinas
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Martin @ 2013-01-16 11:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote:
> + Catalin, RMK
> 
> Dave,
> 
> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> >For architectural correctness even Strongly-Ordered memory accesses
> >require barriers in order to guarantee that multiple CPUs have a
> >coherent view of the ordering of memory accesses.
> >
> >Virtually everything done by this early code is done via explicit
> >memory access only, so DSBs are seldom required.  Existing barriers
> >are demoted to DMB, except where a DSB is needed to synchronise
> >non-memory signalling (i.e., before a SEV).  If a particular
> >platform performs cache maintenance in its power_up_setup function,
> >it should force it to complete explicitly including a DSB, instead
> >of relying on the bL_head framework code to do it.
> >
> >Some additional DMBs are added to ensure all the memory ordering
> >properties required by the race avoidance algorithm.  DMBs are also
> >moved out of loops, and for clarity some are moved so that most
> >directly follow the memory operation which needs to be
> >synchronised.
> >
> >The setting of a CPU's bL_entry_vectors[] entry is also required to
> >act as a synchronisation point, so a DMB is added after checking
> >that entry to ensure that other CPUs do not observe gated
> >operations leaking across the opening of the gate.
> >
> >Signed-off-by: Dave Martin <dave.martin@linaro.org>
> >---
> 
> Sorry to pick on this again but I am not able to understand why
> the strongly ordered access needs barriers. At least from the
> ARM point of view, a strongly ordered write will be more of blocking
> write and the further interconnect also is suppose to respect that

This is what I originally assumed (hence the absence of barriers in
the initial patch).

> rule. SO read writes are like adding barrier after every load store

This assumption turns out to be wrong, unfortunately, although in
a uniprocessor scenario is makes no difference.  A SO memory access
does block the CPU making the access, but explicitly does not
block the interconnect.

In a typical boot scenario for example, all secondary CPUs are
quiescent or powered down, so there's no problem.  But we can't make
the same assumptions when we're trying to coordinate between
multiple active CPUs.

> so adding explicit barriers doesn't make sense. Is this a side
> effect of some "write early response" kind of optimizations at
> interconnect level ?

Strongly-Ordered accesses are always non-shareable, so there is
no explicit guarantee of coherency between multiple masters.

If there is only one master, it makes no difference, but if there
are multiple masters, there is no guarantee that they are conntected
to a slave device (DRAM controller in this case) via a single
slave port.

The architecture only guarantees global serialisation when there is a
single slave device, but provides no way to know whether two accesses
from different masters will reach the same slave port.  This is in the
realms of "implementation defined."

Unfortunately, a high-performance component like a DRAM controller
is exactly the kind of component which may implement multiple
master ports, so you can't guarantee that accesses are serialised
in the same order from the perspective of all masters.  There may
be some pipelining and caching between each master port and the actual
memory, for example.  This is allowed, because there is no requirement
for the DMC to look like a single slave device from the perspective
of multiple masters.

A multi-ported slave might provide transparent coherency between master
ports, but it is only required to guarantee this when the accesses
are shareable (SO is always non-shared), or when explicit barriers
are used to force synchronisation between the device's master ports.

Of course, a given platform may have a DMC with only one slave
port, in which case the barriers should not be needed.  But I wanted
this code to be generic enough to be reusable -- hence the
addition of the barriers.  The CPU does not need to wait for a DMB
to "complete" in any sense, so this does not necessarily have a
meaningful impact on performance.

This is my understanding anyway.

> Will you be able to point to specs or documents which puts
> this requirement ?

Unfortunately, this is one of this things which we require not because
there is a statement in the ARM ARM to say that we need it -- rather,
there is no statement in the ARM ARM to say that we don't.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements
  2013-01-16 11:49     ` Dave Martin
@ 2013-01-16 12:11       ` Santosh Shilimkar
  2013-01-16 12:47         ` Dave Martin
  0 siblings, 1 reply; 15+ messages in thread
From: Santosh Shilimkar @ 2013-01-16 12:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote:
> On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote:
>> + Catalin, RMK
>>
>> Dave,
>>
>> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
>>> For architectural correctness even Strongly-Ordered memory accesses
>>> require barriers in order to guarantee that multiple CPUs have a
>>> coherent view of the ordering of memory accesses.
>>>
>>> Virtually everything done by this early code is done via explicit
>>> memory access only, so DSBs are seldom required.  Existing barriers
>>> are demoted to DMB, except where a DSB is needed to synchronise
>>> non-memory signalling (i.e., before a SEV).  If a particular
>>> platform performs cache maintenance in its power_up_setup function,
>>> it should force it to complete explicitly including a DSB, instead
>>> of relying on the bL_head framework code to do it.
>>>
>>> Some additional DMBs are added to ensure all the memory ordering
>>> properties required by the race avoidance algorithm.  DMBs are also
>>> moved out of loops, and for clarity some are moved so that most
>>> directly follow the memory operation which needs to be
>>> synchronised.
>>>
>>> The setting of a CPU's bL_entry_vectors[] entry is also required to
>>> act as a synchronisation point, so a DMB is added after checking
>>> that entry to ensure that other CPUs do not observe gated
>>> operations leaking across the opening of the gate.
>>>
>>> Signed-off-by: Dave Martin <dave.martin@linaro.org>
>>> ---
>>
>> Sorry to pick on this again but I am not able to understand why
>> the strongly ordered access needs barriers. At least from the
>> ARM point of view, a strongly ordered write will be more of blocking
>> write and the further interconnect also is suppose to respect that
>
> This is what I originally assumed (hence the absence of barriers in
> the initial patch).
>
>> rule. SO read writes are like adding barrier after every load store
>
> This assumption turns out to be wrong, unfortunately, although in
> a uniprocessor scenario is makes no difference.  A SO memory access
> does block the CPU making the access, but explicitly does not
> block the interconnect.
>
I suspected the interconnect part when you described the barrier
need for SO memory region.

> In a typical boot scenario for example, all secondary CPUs are
> quiescent or powered down, so there's no problem.  But we can't make
> the same assumptions when we're trying to coordinate between
> multiple active CPUs.
>
>> so adding explicit barriers doesn't make sense. Is this a side
>> effect of some "write early response" kind of optimizations at
>> interconnect level ?
>
> Strongly-Ordered accesses are always non-shareable, so there is
> no explicit guarantee of coherency between multiple masters.
>
This is where probably issue then. My understanding is exactly
opposite here and hence I wasn't worried about multi-master
CPU scenario since sharable attributes would be taking care of it
considering the same page tables being used in SMP system.

ARM documentation says -
------------
Shareability and the S bit, with TEX remap
The memory type of a region, as indicated in the Memory type column of 
Table B3-12 on page B3-1350, provides
the first level of control of whether the region is shareable:
? If the memory type is Strongly-ordered then the region is Shareable
------------------------------------------------------------

> If there is only one master, it makes no difference, but if there
> are multiple masters, there is no guarantee that they are conntected
> to a slave device (DRAM controller in this case) via a single
> slave port.
>
See above. We are talking about multiple CPUs here and not
the DSP or other co-processors. In either case we are discussing
code which is getting execute on ARM CPUs so we can safely limit
it to the multi-master ARM CPU.

> The architecture only guarantees global serialisation when there is a
> single slave device, but provides no way to know whether two accesses
> from different masters will reach the same slave port.  This is in the
> realms of "implementation defined."
>
> Unfortunately, a high-performance component like a DRAM controller
> is exactly the kind of component which may implement multiple
> master ports, so you can't guarantee that accesses are serialised
> in the same order from the perspective of all masters.  There may
> be some pipelining and caching between each master port and the actual
> memory, for example.  This is allowed, because there is no requirement
> for the DMC to look like a single slave device from the perspective
> of multiple masters.
>
> A multi-ported slave might provide transparent coherency between master
> ports, but it is only required to guarantee this when the accesses
> are shareable (SO is always non-shared), or when explicit barriers
> are used to force synchronisation between the device's master ports.
>
> Of course, a given platform may have a DMC with only one slave
> port, in which case the barriers should not be needed.  But I wanted
> this code to be generic enough to be reusable -- hence the
> addition of the barriers.  The CPU does not need to wait for a DMB
> to "complete" in any sense, so this does not necessarily have a
> meaningful impact on performance.
>
> This is my understanding anyway.
>
>> Will you be able to point to specs or documents which puts
>> this requirement ?
>
> Unfortunately, this is one of this things which we require not because
> there is a statement in the ARM ARM to say that we need it -- rather,
> there is no statement in the ARM ARM to say that we don't.
>
Thanks a lot for elaborate answer. It helps to understand the rationale
at least.

Regards
Santosh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements
  2013-01-16 12:11       ` Santosh Shilimkar
@ 2013-01-16 12:47         ` Dave Martin
  2013-01-16 14:36           ` Santosh Shilimkar
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Martin @ 2013-01-16 12:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 16, 2013 at 05:41:00PM +0530, Santosh Shilimkar wrote:
> On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote:
> >On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote:
> >>+ Catalin, RMK
> >>
> >>Dave,
> >>
> >>On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> >>>For architectural correctness even Strongly-Ordered memory accesses
> >>>require barriers in order to guarantee that multiple CPUs have a
> >>>coherent view of the ordering of memory accesses.
> >>>
> >>>Virtually everything done by this early code is done via explicit
> >>>memory access only, so DSBs are seldom required.  Existing barriers
> >>>are demoted to DMB, except where a DSB is needed to synchronise
> >>>non-memory signalling (i.e., before a SEV).  If a particular
> >>>platform performs cache maintenance in its power_up_setup function,
> >>>it should force it to complete explicitly including a DSB, instead
> >>>of relying on the bL_head framework code to do it.
> >>>
> >>>Some additional DMBs are added to ensure all the memory ordering
> >>>properties required by the race avoidance algorithm.  DMBs are also
> >>>moved out of loops, and for clarity some are moved so that most
> >>>directly follow the memory operation which needs to be
> >>>synchronised.
> >>>
> >>>The setting of a CPU's bL_entry_vectors[] entry is also required to
> >>>act as a synchronisation point, so a DMB is added after checking
> >>>that entry to ensure that other CPUs do not observe gated
> >>>operations leaking across the opening of the gate.
> >>>
> >>>Signed-off-by: Dave Martin <dave.martin@linaro.org>
> >>>---
> >>
> >>Sorry to pick on this again but I am not able to understand why
> >>the strongly ordered access needs barriers. At least from the
> >>ARM point of view, a strongly ordered write will be more of blocking
> >>write and the further interconnect also is suppose to respect that
> >
> >This is what I originally assumed (hence the absence of barriers in
> >the initial patch).
> >
> >>rule. SO read writes are like adding barrier after every load store
> >
> >This assumption turns out to be wrong, unfortunately, although in
> >a uniprocessor scenario is makes no difference.  A SO memory access
> >does block the CPU making the access, but explicitly does not
> >block the interconnect.
> >
> I suspected the interconnect part when you described the barrier
> need for SO memory region.
> 
> >In a typical boot scenario for example, all secondary CPUs are
> >quiescent or powered down, so there's no problem.  But we can't make
> >the same assumptions when we're trying to coordinate between
> >multiple active CPUs.
> >
> >>so adding explicit barriers doesn't make sense. Is this a side
> >>effect of some "write early response" kind of optimizations at
> >>interconnect level ?
> >
> >Strongly-Ordered accesses are always non-shareable, so there is
> >no explicit guarantee of coherency between multiple masters.
> >
> This is where probably issue then. My understanding is exactly
> opposite here and hence I wasn't worried about multi-master
> CPU scenario since sharable attributes would be taking care of it
> considering the same page tables being used in SMP system.
> 
> ARM documentation says -
> ------------
> Shareability and the S bit, with TEX remap
> The memory type of a region, as indicated in the Memory type column
> of Table B3-12 on page B3-1350, provides
> the first level of control of whether the region is shareable:
> ? If the memory type is Strongly-ordered then the region is Shareable
> ------------------------------------------------------------

Hmmm, it looks like you're right here.  My assumption that SO implies
non-shareable is wrong.  This is backed up by:

A3.5.6 Device and Strongly-ordered memory

"Address locations marked as Strongly-ordered [...] are always treated
as Shareable."


I think this is sufficient to ensure that if two CPUs access the same
location with SO accesses, each will see an access order to any single
location which is consistent with the program order of the accesses on
the other CPUs.  (This comes from the glossary definition of Coherent.)

However, I can't see any general guarantee for accesses to _different_
locations, beyond the guarantees for certain special cases given in
A3.8.2 Ordering requirements for memory accesses (address and control
dependencies etc.)

This may make some of the dmbs unnecessary, but it is not clear whether
they are all unnecessary.


I'll need to follow up on this and see if we can get an answer.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements
  2013-01-16 12:47         ` Dave Martin
@ 2013-01-16 14:36           ` Santosh Shilimkar
  0 siblings, 0 replies; 15+ messages in thread
From: Santosh Shilimkar @ 2013-01-16 14:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 16 January 2013 06:17 PM, Dave Martin wrote:
> On Wed, Jan 16, 2013 at 05:41:00PM +0530, Santosh Shilimkar wrote:
>> On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote:
>>> On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote:
>>>> + Catalin, RMK
>>>>
>>>> Dave,
>>>>
>>>> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
>>>>> For architectural correctness even Strongly-Ordered memory accesses
>>>>> require barriers in order to guarantee that multiple CPUs have a
>>>>> coherent view of the ordering of memory accesses.
>>>>>
>>>>> Virtually everything done by this early code is done via explicit
>>>>> memory access only, so DSBs are seldom required.  Existing barriers
>>>>> are demoted to DMB, except where a DSB is needed to synchronise
>>>>> non-memory signalling (i.e., before a SEV).  If a particular
>>>>> platform performs cache maintenance in its power_up_setup function,
>>>>> it should force it to complete explicitly including a DSB, instead
>>>>> of relying on the bL_head framework code to do it.
>>>>>
>>>>> Some additional DMBs are added to ensure all the memory ordering
>>>>> properties required by the race avoidance algorithm.  DMBs are also
>>>>> moved out of loops, and for clarity some are moved so that most
>>>>> directly follow the memory operation which needs to be
>>>>> synchronised.
>>>>>
>>>>> The setting of a CPU's bL_entry_vectors[] entry is also required to
>>>>> act as a synchronisation point, so a DMB is added after checking
>>>>> that entry to ensure that other CPUs do not observe gated
>>>>> operations leaking across the opening of the gate.
>>>>>
>>>>> Signed-off-by: Dave Martin <dave.martin@linaro.org>
>>>>> ---
>>>>
>>>> Sorry to pick on this again but I am not able to understand why
>>>> the strongly ordered access needs barriers. At least from the
>>>> ARM point of view, a strongly ordered write will be more of blocking
>>>> write and the further interconnect also is suppose to respect that
>>>
>>> This is what I originally assumed (hence the absence of barriers in
>>> the initial patch).
>>>
>>>> rule. SO read writes are like adding barrier after every load store
>>>
>>> This assumption turns out to be wrong, unfortunately, although in
>>> a uniprocessor scenario is makes no difference.  A SO memory access
>>> does block the CPU making the access, but explicitly does not
>>> block the interconnect.
>>>
>> I suspected the interconnect part when you described the barrier
>> need for SO memory region.
>>
>>> In a typical boot scenario for example, all secondary CPUs are
>>> quiescent or powered down, so there's no problem.  But we can't make
>>> the same assumptions when we're trying to coordinate between
>>> multiple active CPUs.
>>>
>>>> so adding explicit barriers doesn't make sense. Is this a side
>>>> effect of some "write early response" kind of optimizations at
>>>> interconnect level ?
>>>
>>> Strongly-Ordered accesses are always non-shareable, so there is
>>> no explicit guarantee of coherency between multiple masters.
>>>
>> This is where probably issue then. My understanding is exactly
>> opposite here and hence I wasn't worried about multi-master
>> CPU scenario since sharable attributes would be taking care of it
>> considering the same page tables being used in SMP system.
>>
>> ARM documentation says -
>> ------------
>> Shareability and the S bit, with TEX remap
>> The memory type of a region, as indicated in the Memory type column
>> of Table B3-12 on page B3-1350, provides
>> the first level of control of whether the region is shareable:
>> ? If the memory type is Strongly-ordered then the region is Shareable
>> ------------------------------------------------------------
>
> Hmmm, it looks like you're right here.  My assumption that SO implies
> non-shareable is wrong.  This is backed up by:
>
> A3.5.6 Device and Strongly-ordered memory
>
> "Address locations marked as Strongly-ordered [...] are always treated
> as Shareable."
>
>
> I think this is sufficient to ensure that if two CPUs access the same
> location with SO accesses, each will see an access order to any single
> location which is consistent with the program order of the accesses on
> the other CPUs.  (This comes from the glossary definition of Coherent.)
>
> However, I can't see any general guarantee for accesses to _different_
> locations, beyond the guarantees for certain special cases given in
> A3.8.2 Ordering requirements for memory accesses (address and control
> dependencies etc.)
>
> This may make some of the dmbs unnecessary, but it is not clear whether
> they are all unnecessary.
>
>
> I'll need to follow up on this and see if we can get an answer.
>
Thanks David. I am looking forward to hear more on this.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements
  2013-01-16  6:50   ` Santosh Shilimkar
  2013-01-16 11:49     ` Dave Martin
@ 2013-01-16 15:05     ` Catalin Marinas
  2013-01-16 15:37       ` Dave Martin
  2013-01-17  6:39       ` Santosh Shilimkar
  1 sibling, 2 replies; 15+ messages in thread
From: Catalin Marinas @ 2013-01-16 15:05 UTC (permalink / raw)
  To: linux-arm-kernel

Santosh,

On Wed, Jan 16, 2013 at 06:50:47AM +0000, Santosh Shilimkar wrote:
> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> > For architectural correctness even Strongly-Ordered memory accesses
> > require barriers in order to guarantee that multiple CPUs have a
> > coherent view of the ordering of memory accesses.
> >
> > Virtually everything done by this early code is done via explicit
> > memory access only, so DSBs are seldom required.  Existing barriers
> > are demoted to DMB, except where a DSB is needed to synchronise
> > non-memory signalling (i.e., before a SEV).  If a particular
> > platform performs cache maintenance in its power_up_setup function,
> > it should force it to complete explicitly including a DSB, instead
> > of relying on the bL_head framework code to do it.
> >
> > Some additional DMBs are added to ensure all the memory ordering
> > properties required by the race avoidance algorithm.  DMBs are also
> > moved out of loops, and for clarity some are moved so that most
> > directly follow the memory operation which needs to be
> > synchronised.
> >
> > The setting of a CPU's bL_entry_vectors[] entry is also required to
> > act as a synchronisation point, so a DMB is added after checking
> > that entry to ensure that other CPUs do not observe gated
> > operations leaking across the opening of the gate.
> >
> > Signed-off-by: Dave Martin <dave.martin@linaro.org>
> 
> Sorry to pick on this again but I am not able to understand why
> the strongly ordered access needs barriers. At least from the
> ARM point of view, a strongly ordered write will be more of blocking
> write and the further interconnect also is suppose to respect that
> rule. SO read writes are like adding barrier after every load store
> so adding explicit barriers doesn't make sense. Is this a side
> effect of some "write early response" kind of optimizations at
> interconnect level ?

SO or Device memory accesses are *not* like putting a proper barrier
between each access, though it may behave in some situations like having
a barrier. The ARM ARM (A3.8.3, fig 3.5) defines how accesses must
*arrive* at a peripheral or block of memory depending on the memory type
and in case of Device or SO we don't need additional barriers because
such accesses would *arrive* in order (given the minimum 1KB range
restriction). But it does not say anything about *observability* by a
different *master*. That's because you can't guarantee that your memory
accesses go to the same slave port.

For observability by a different master, you need an explicit DMB even
though the memory type is Device or SO. While it may work fine in most
cases (especially when the accesses by various masters go to the same
slave port), you can't be sure what the memory controller or whatever
interconnect do.

As Dave said, it's more about what the ARM ARM doesn't say rather than
what it explicitly states.

-- 
Catalin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements
  2013-01-16 15:05     ` Catalin Marinas
@ 2013-01-16 15:37       ` Dave Martin
  2013-01-17  6:39       ` Santosh Shilimkar
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Martin @ 2013-01-16 15:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 16, 2013 at 03:05:34PM +0000, Catalin Marinas wrote:
> Santosh,
> 
> On Wed, Jan 16, 2013 at 06:50:47AM +0000, Santosh Shilimkar wrote:
> > On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
> > > For architectural correctness even Strongly-Ordered memory accesses
> > > require barriers in order to guarantee that multiple CPUs have a
> > > coherent view of the ordering of memory accesses.
> > >
> > > Virtually everything done by this early code is done via explicit
> > > memory access only, so DSBs are seldom required.  Existing barriers
> > > are demoted to DMB, except where a DSB is needed to synchronise
> > > non-memory signalling (i.e., before a SEV).  If a particular
> > > platform performs cache maintenance in its power_up_setup function,
> > > it should force it to complete explicitly including a DSB, instead
> > > of relying on the bL_head framework code to do it.
> > >
> > > Some additional DMBs are added to ensure all the memory ordering
> > > properties required by the race avoidance algorithm.  DMBs are also
> > > moved out of loops, and for clarity some are moved so that most
> > > directly follow the memory operation which needs to be
> > > synchronised.
> > >
> > > The setting of a CPU's bL_entry_vectors[] entry is also required to
> > > act as a synchronisation point, so a DMB is added after checking
> > > that entry to ensure that other CPUs do not observe gated
> > > operations leaking across the opening of the gate.
> > >
> > > Signed-off-by: Dave Martin <dave.martin@linaro.org>
> > 
> > Sorry to pick on this again but I am not able to understand why
> > the strongly ordered access needs barriers. At least from the
> > ARM point of view, a strongly ordered write will be more of blocking
> > write and the further interconnect also is suppose to respect that
> > rule. SO read writes are like adding barrier after every load store
> > so adding explicit barriers doesn't make sense. Is this a side
> > effect of some "write early response" kind of optimizations at
> > interconnect level ?
> 
> SO or Device memory accesses are *not* like putting a proper barrier
> between each access, though it may behave in some situations like having
> a barrier. The ARM ARM (A3.8.3, fig 3.5) defines how accesses must
> *arrive* at a peripheral or block of memory depending on the memory type
> and in case of Device or SO we don't need additional barriers because
> such accesses would *arrive* in order (given the minimum 1KB range
> restriction). But it does not say anything about *observability* by a
> different *master*. That's because you can't guarantee that your memory
> accesses go to the same slave port.
> 
> For observability by a different master, you need an explicit DMB even
> though the memory type is Device or SO. While it may work fine in most
> cases (especially when the accesses by various masters go to the same
> slave port), you can't be sure what the memory controller or whatever
> interconnect do.
> 
> As Dave said, it's more about what the ARM ARM doesn't say rather than
> what it explicitly states.

OK, so I talked to one of the ARM architects and he strongly recommends
having the DMBs.

The fact that SO accesses are shareable guarantees a coherent view of
access order only per memory location (i.e., per address), _not_
per slave.  This is what was missing from my original assumption.

For different memory locations, even within a single block of memory,
you don't get that guarantee about what other masters see, except for
the precise documented situations where there are address, control,
or data dependencies between instructions which imply a particular
observation order by other masters.  (See A3.8.2, Ordering requirements
for memory accesses).

For this locking code, we're concerned about ordering between different
memory locations in almost all instances, so most or all of the
barriers will be needed, for strict correctness.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements
  2013-01-16 15:05     ` Catalin Marinas
  2013-01-16 15:37       ` Dave Martin
@ 2013-01-17  6:39       ` Santosh Shilimkar
  1 sibling, 0 replies; 15+ messages in thread
From: Santosh Shilimkar @ 2013-01-17  6:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 16 January 2013 08:35 PM, Catalin Marinas wrote:
> Santosh,
>
> On Wed, Jan 16, 2013 at 06:50:47AM +0000, Santosh Shilimkar wrote:
>> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
>>> For architectural correctness even Strongly-Ordered memory accesses
>>> require barriers in order to guarantee that multiple CPUs have a
>>> coherent view of the ordering of memory accesses.
>>>
>>> Virtually everything done by this early code is done via explicit
>>> memory access only, so DSBs are seldom required.  Existing barriers
>>> are demoted to DMB, except where a DSB is needed to synchronise
>>> non-memory signalling (i.e., before a SEV).  If a particular
>>> platform performs cache maintenance in its power_up_setup function,
>>> it should force it to complete explicitly including a DSB, instead
>>> of relying on the bL_head framework code to do it.
>>>
>>> Some additional DMBs are added to ensure all the memory ordering
>>> properties required by the race avoidance algorithm.  DMBs are also
>>> moved out of loops, and for clarity some are moved so that most
>>> directly follow the memory operation which needs to be
>>> synchronised.
>>>
>>> The setting of a CPU's bL_entry_vectors[] entry is also required to
>>> act as a synchronisation point, so a DMB is added after checking
>>> that entry to ensure that other CPUs do not observe gated
>>> operations leaking across the opening of the gate.
>>>
>>> Signed-off-by: Dave Martin <dave.martin@linaro.org>
>>
>> Sorry to pick on this again but I am not able to understand why
>> the strongly ordered access needs barriers. At least from the
>> ARM point of view, a strongly ordered write will be more of blocking
>> write and the further interconnect also is suppose to respect that
>> rule. SO read writes are like adding barrier after every load store
>> so adding explicit barriers doesn't make sense. Is this a side
>> effect of some "write early response" kind of optimizations at
>> interconnect level ?
>
> SO or Device memory accesses are *not* like putting a proper barrier
> between each access, though it may behave in some situations like having
> a barrier. The ARM ARM (A3.8.3, fig 3.5) defines how accesses must
> *arrive* at a peripheral or block of memory depending on the memory type
> and in case of Device or SO we don't need additional barriers because
> such accesses would *arrive* in order (given the minimum 1KB range
> restriction). But it does not say anything about *observability* by a
> different *master*. That's because you can't guarantee that your memory
> accesses go to the same slave port.
>
> For observability by a different master, you need an explicit DMB even
> though the memory type is Device or SO. While it may work fine in most
> cases (especially when the accesses by various masters go to the same
> slave port), you can't be sure what the memory controller or whatever
> interconnect do.
>
I agree that the interconnect behavior and no. of slave port usages
is implementation specific and we can't assume anything here. So
safer side is to have those additional barriers to not hit the corner
cases.

> As Dave said, it's more about what the ARM ARM doesn't say rather than
> what it explicitly states.
>
I understand it better now. Thanks for good discussion.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-01-17  6:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-15 16:48 [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Dave Martin
2013-01-15 16:48 ` [RFC PATCH 1/4] ARM: b.L: Remove C declarations for vlocks Dave Martin
2013-01-15 16:48 ` [RFC PATCH 2/4] ARM: b.L: vlocks: Add architecturally required memory barriers Dave Martin
2013-01-15 16:48 ` [RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements Dave Martin
2013-01-16  6:50   ` Santosh Shilimkar
2013-01-16 11:49     ` Dave Martin
2013-01-16 12:11       ` Santosh Shilimkar
2013-01-16 12:47         ` Dave Martin
2013-01-16 14:36           ` Santosh Shilimkar
2013-01-16 15:05     ` Catalin Marinas
2013-01-16 15:37       ` Dave Martin
2013-01-17  6:39       ` Santosh Shilimkar
2013-01-15 16:48 ` [RFC PATCH 4/4] ARM: vexpress/dcscb: power_up_setup memory barrier cleanup Dave Martin
2013-01-15 17:29 ` [RFC PATCH 0/4] b.L: Memory barriers and miscellaneous tidyups Nicolas Pitre
2013-01-15 17:42   ` Dave Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).